Thanks Joris for clearing that up! It's correct that pyspark will allow the user to do operations on the resulting DataFrame, so it doesn't sound like I should set `split_blocks=True` in the conversion. You're right that the unnecessary assignments can be easily avoided if not timestamps, so that will be a big help. I'll link this discussion to the JIRA in case it could help others. Thanks again.
Bryan On Fri, Jan 24, 2020 at 2:10 AM Joris Van den Bossche < jorisvandenboss...@gmail.com> wrote: > Hi Bryan, > > For the case that the column is no timestamp and was not modified: I don't > think it will take copies of the full dataframe by assigning columns in a > loop like that. But it is still doing work (it will copy data for that > column into the array holding those data for 2D blocks), and which can > easily be avoided I think by only assigning back when the column was > actually modified (eg by moving the is_datetime64tz_dtype inline in the > loop iterating through all columns, so you can only write back if actually > having tz-aware data). > > Further, even if you do the above to avoid writing back to the dataframe > when not needed, I am not sure you should directly try to use the new > zero-copy feature of the Table.to_pandas conversion (with > split_blocks=True). It depends very much on what further happens with the > converted dataframe. Once you do some operations in pandas, those splitted > blocks will get combined (resulting in a memory copy then), and it also > means you can't modify the dataframe (if this dataframe is used in python > UDFs, it might limit what can be done in those UDFs. Just guessing here, I > don't know the pyspark code well enough). > > Joris > > > On Thu, 23 Jan 2020 at 21:03, Bryan Cutler <cutl...@gmail.com> wrote: > > > Thanks for investigating this and the quick fix Joris and Wes! I just > have > > a couple questions about the behavior observed here. The pyspark code > > assigns either the same series back to the pandas.DataFrame or makes some > > modifications if it is a timestamp. In the case there are no timestamps, > is > > this potentially making extra copies or will it be unable to take > advantage > > of new zero-copy features in pyarrow? For the case of having timestamp > > columns that need to be modified, is there a more efficient way to > create a > > new dataframe with only copies of the modified series? Thanks! > > > > Bryan > > > > On Thu, Jan 16, 2020 at 11:48 PM Joris Van den Bossche < > > jorisvandenboss...@gmail.com> wrote: > > > > > That sounds like a good solution. Having the zero-copy behavior > depending > > > on whether you have only 1 column of a certain type or not, might lead > to > > > surprising results. To avoid yet another keyword, only doing it when > > > split_blocks=True sounds good to me (in practice, that's also when it > > will > > > happen mostly, except for very narrow dataframes with only few > columns). > > > > > > Joris > > > > > > On Thu, 16 Jan 2020 at 22:44, Wes McKinney <wesmck...@gmail.com> > wrote: > > > > > > > hi Joris, > > > > > > > > Thanks for investigating this. It seems there were some unintended > > > > consequences of the zero-copy optimizations from ARROW-3789. Another > > > > way forward might be to "opt in" to this behavior, or to only do the > > > > zero copy optimizations when split_blocks=True. What do you think? > > > > > > > > - Wes > > > > > > > > On Thu, Jan 16, 2020 at 3:42 AM Joris Van den Bossche > > > > <jorisvandenboss...@gmail.com> wrote: > > > > > > > > > > So the spark integration build started to fail, and with the > > following > > > > test > > > > > error: > > > > > > > > > > > > ====================================================================== > > > > > ERROR: test_toPandas_batch_order > > > > > (pyspark.sql.tests.test_arrow.EncryptionArrowTests) > > > > > > > ---------------------------------------------------------------------- > > > > > Traceback (most recent call last): > > > > > File "/spark/python/pyspark/sql/tests/test_arrow.py", line 422, > in > > > > > test_toPandas_batch_order > > > > > run_test(*case) > > > > > File "/spark/python/pyspark/sql/tests/test_arrow.py", line 409, > in > > > > run_test > > > > > pdf, pdf_arrow = self._toPandas_arrow_toggle(df) > > > > > File "/spark/python/pyspark/sql/tests/test_arrow.py", line 152, > in > > > > > _toPandas_arrow_toggle > > > > > pdf_arrow = df.toPandas() > > > > > File "/spark/python/pyspark/sql/pandas/conversion.py", line 115, > in > > > > toPandas > > > > > return _check_dataframe_localize_timestamps(pdf, timezone) > > > > > File "/spark/python/pyspark/sql/pandas/types.py", line 180, in > > > > > _check_dataframe_localize_timestamps > > > > > pdf[column] = _check_series_localize_timestamps(series, > timezone) > > > > > File > > > > > > "/opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/frame.py", > > > > > line 3487, in __setitem__ > > > > > self._set_item(key, value) > > > > > File > > > > > > "/opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/frame.py", > > > > > line 3565, in _set_item > > > > > NDFrame._set_item(self, key, value) > > > > > File > > > > > > > > > > "/opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/generic.py", > > > > > line 3381, in _set_item > > > > > self._data.set(key, value) > > > > > File > > > > > > > > > > "/opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/internals/managers.py", > > > > > line 1090, in set > > > > > blk.set(blk_locs, value_getitem(val_locs)) > > > > > File > > > > > > > > > > "/opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/internals/blocks.py", > > > > > line 380, in set > > > > > self.values[locs] = values > > > > > ValueError: assignment destination is read-only > > > > > > > > > > > > > > > It's from a test that is doing conversions from spark to arrow to > > > pandas > > > > > (so calling pyarrow.Table.to_pandas here > > > > > < > > > > > > > > > > https://github.com/apache/spark/blob/018bdcc53c925072b07956de0600452ad255b9c7/python/pyspark/sql/pandas/conversion.py#L111-L115 > > > > >), > > > > > and on the resulting DataFrame, it is iterating through all > columns, > > > > > potentially fixing timezones, and writing each column back into the > > > > > DataFrame (here > > > > > < > > > > > > > > > > https://github.com/apache/spark/blob/018bdcc53c925072b07956de0600452ad255b9c7/python/pyspark/sql/pandas/types.py#L179-L181 > > > > > > > > > > ). > > > > > > > > > > Since it is giving an error about read-only, it might be related to > > > > > zero-copy behaviour of to_pandas, and thus might be related to the > > > > refactor > > > > > of the arrow->pandas conversion that landed yesterday ( > > > > > https://github.com/apache/arrow/pull/6067, it says it changed to > do > > > > > zero-copy for 1-column blocks if possible). > > > > > I am not sure if something should be fixed in pyarrow for this, but > > the > > > > > obvious thing that pyspark can do is specify they don't want > > zero-copy. > > > > > > > > > > Joris > > > > > > > > > > On Wed, 15 Jan 2020 at 14:32, Crossbow <cross...@ursalabs.org> > > wrote: > > > > > > > > > > > > > > >