Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18664#discussion_r145271690
  
    --- Diff: python/pyspark/serializers.py ---
    @@ -259,11 +261,13 @@ def load_stream(self, stream):
             """
             Deserialize ArrowRecordBatches to an Arrow table and return as a 
list of pandas.Series.
             """
    +        from pyspark.sql.types import _check_dataframe_localize_timestamps
             import pyarrow as pa
             reader = pa.open_stream(stream)
             for batch in reader:
    -            table = pa.Table.from_batches([batch])
    -            yield [c.to_pandas() for c in table.itercolumns()]
    +            # NOTE: changed from pa.Columns.to_pandas, timezone issue in 
conversion fixed in 0.7.1
    +            pdf = _check_dataframe_localize_timestamps(batch.to_pandas())
    +            yield [c for _, c in pdf.iteritems()]
    --- End diff --
    
    After running some tests, this change does not significantly degrade 
performance, but there seems to be a small difference.  cc @ueshin 
    
    I ran various columns of random data through a `pandas_udf` repeatedly with 
and without this change.  Test was in local mode with default Spark conf, 
looking at min wall clock time of 10 loops
    
    before change: 2.595558
    after change: 2.681813
    
    Do you think the difference here is acceptable for now until arrow is 
upgraded and we can look into again?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to