Re: PySpark failure [RE: [NIGHTLY] Arrow Build Report for Job nightly-2020-01-15-0]

Bryan Cutler Fri, 24 Jan 2020 10:17:04 -0800

Thanks Joris for clearing that up! It's correct that pyspark will allow the
user to do operations on the resulting DataFrame, so it doesn't sound like
I should set `split_blocks=True` in the conversion. You're right that the
unnecessary assignments can be easily avoided if not timestamps, so that
will be a big help. I'll link this discussion to the JIRA in case it could
help others. Thanks again.


Bryan

On Fri, Jan 24, 2020 at 2:10 AM Joris Van den Bossche <
jorisvandenboss...@gmail.com> wrote:

> Hi Bryan,
>
> For the case that the column is no timestamp and was not modified: I don't
> think it will take copies of the full dataframe by assigning columns in a
> loop like that. But it is still doing work (it will copy data for that
> column into the array holding those data for 2D blocks), and which can
> easily be avoided I think by only assigning back when the column was
> actually modified (eg by moving the is_datetime64tz_dtype inline in the
> loop iterating through all columns, so you can only write back if actually
> having tz-aware data).
>
> Further, even if you do the above to avoid writing back to the dataframe
> when not needed, I am not sure you should directly try to use the new
> zero-copy feature of the Table.to_pandas conversion (with
> split_blocks=True). It depends very much on what further happens with the
> converted dataframe. Once you do some operations in pandas, those splitted
> blocks will get combined (resulting in a memory copy then), and it also
> means you can't modify the dataframe (if this dataframe is used in python
> UDFs, it might limit what can be done in those UDFs. Just guessing here, I
> don't know the pyspark code well enough).
>
> Joris
>
>
> On Thu, 23 Jan 2020 at 21:03, Bryan Cutler <cutl...@gmail.com> wrote:
>
> > Thanks for investigating this and the quick fix Joris and Wes!  I just
> have
> > a couple questions about the behavior observed here.  The pyspark code
> > assigns either the same series back to the pandas.DataFrame or makes some
> > modifications if it is a timestamp. In the case there are no timestamps,
> is
> > this potentially making extra copies or will it be unable to take
> advantage
> > of new zero-copy features in pyarrow? For the case of having timestamp
> > columns that need to be modified, is there a more efficient way to
> create a
> > new dataframe with only copies of the modified series?  Thanks!
> >
> > Bryan
> >
> > On Thu, Jan 16, 2020 at 11:48 PM Joris Van den Bossche <
> > jorisvandenboss...@gmail.com> wrote:
> >
> > > That sounds like a good solution. Having the zero-copy behavior
> depending
> > > on whether you have only 1 column of a certain type or not, might lead
> to
> > > surprising results. To avoid yet another keyword, only doing it when
> > > split_blocks=True sounds good to me (in practice, that's also when it
> > will
> > > happen mostly, except for very narrow dataframes with only few
> columns).
> > >
> > > Joris
> > >
> > > On Thu, 16 Jan 2020 at 22:44, Wes McKinney <wesmck...@gmail.com>
> wrote:
> > >
> > > > hi Joris,
> > > >
> > > > Thanks for investigating this. It seems there were some unintended
> > > > consequences of the zero-copy optimizations from ARROW-3789. Another
> > > > way forward might be to "opt in" to this behavior, or to only do the
> > > > zero copy optimizations when split_blocks=True. What do you think?
> > > >
> > > > - Wes
> > > >
> > > > On Thu, Jan 16, 2020 at 3:42 AM Joris Van den Bossche
> > > > <jorisvandenboss...@gmail.com> wrote:
> > > > >
> > > > > So the spark integration build started to fail, and with the
> > following
> > > > test
> > > > > error:
> > > > >
> > > > >
> > ======================================================================
> > > > > ERROR: test_toPandas_batch_order
> > > > > (pyspark.sql.tests.test_arrow.EncryptionArrowTests)
> > > > >
> > ----------------------------------------------------------------------
> > > > > Traceback (most recent call last):
> > > > >   File "/spark/python/pyspark/sql/tests/test_arrow.py", line 422,
> in
> > > > > test_toPandas_batch_order
> > > > >     run_test(*case)
> > > > >   File "/spark/python/pyspark/sql/tests/test_arrow.py", line 409,
> in
> > > > run_test
> > > > >     pdf, pdf_arrow = self._toPandas_arrow_toggle(df)
> > > > >   File "/spark/python/pyspark/sql/tests/test_arrow.py", line 152,
> in
> > > > > _toPandas_arrow_toggle
> > > > >     pdf_arrow = df.toPandas()
> > > > >   File "/spark/python/pyspark/sql/pandas/conversion.py", line 115,
> in
> > > > toPandas
> > > > >     return _check_dataframe_localize_timestamps(pdf, timezone)
> > > > >   File "/spark/python/pyspark/sql/pandas/types.py", line 180, in
> > > > > _check_dataframe_localize_timestamps
> > > > >     pdf[column] = _check_series_localize_timestamps(series,
> timezone)
> > > > >   File
> > > >
> > "/opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/frame.py",
> > > > > line 3487, in __setitem__
> > > > >     self._set_item(key, value)
> > > > >   File
> > > >
> > "/opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/frame.py",
> > > > > line 3565, in _set_item
> > > > >     NDFrame._set_item(self, key, value)
> > > > >   File
> > > >
> > >
> >
> "/opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/generic.py",
> > > > > line 3381, in _set_item
> > > > >     self._data.set(key, value)
> > > > >   File
> > > >
> > >
> >
> "/opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/internals/managers.py",
> > > > > line 1090, in set
> > > > >     blk.set(blk_locs, value_getitem(val_locs))
> > > > >   File
> > > >
> > >
> >
> "/opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/internals/blocks.py",
> > > > > line 380, in set
> > > > >     self.values[locs] = values
> > > > > ValueError: assignment destination is read-only
> > > > >
> > > > >
> > > > > It's from a test that is doing conversions from spark to arrow to
> > > pandas
> > > > > (so calling pyarrow.Table.to_pandas here
> > > > > <
> > > >
> > >
> >
> https://github.com/apache/spark/blob/018bdcc53c925072b07956de0600452ad255b9c7/python/pyspark/sql/pandas/conversion.py#L111-L115
> > > > >),
> > > > > and on the resulting DataFrame, it is iterating through all
> columns,
> > > > > potentially fixing timezones, and writing each column back into the
> > > > > DataFrame (here
> > > > > <
> > > >
> > >
> >
> https://github.com/apache/spark/blob/018bdcc53c925072b07956de0600452ad255b9c7/python/pyspark/sql/pandas/types.py#L179-L181
> > > > >
> > > > > ).
> > > > >
> > > > > Since it is giving an error about read-only, it might be related to
> > > > > zero-copy behaviour of to_pandas, and thus might be related to the
> > > > refactor
> > > > > of the arrow->pandas conversion that landed yesterday (
> > > > > https://github.com/apache/arrow/pull/6067, it says it changed to
> do
> > > > > zero-copy for 1-column blocks if possible).
> > > > > I am not sure if something should be fixed in pyarrow for this, but
> > the
> > > > > obvious thing that pyspark can do is specify they don't want
> > zero-copy.
> > > > >
> > > > > Joris
> > > > >
> > > > > On Wed, 15 Jan 2020 at 14:32, Crossbow <cross...@ursalabs.org>
> > wrote:
> > > > >
> > > >
> > >
> >
>

Re: PySpark failure [RE: [NIGHTLY] Arrow Build Report for Job nightly-2020-01-15-0]

Reply via email to