Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/18664#discussion_r145271690 --- Diff: python/pyspark/serializers.py --- @@ -259,11 +261,13 @@ def load_stream(self, stream): """ Deserialize ArrowRecordBatches to an Arrow table and return as a list of pandas.Series. """ + from pyspark.sql.types import _check_dataframe_localize_timestamps import pyarrow as pa reader = pa.open_stream(stream) for batch in reader: - table = pa.Table.from_batches([batch]) - yield [c.to_pandas() for c in table.itercolumns()] + # NOTE: changed from pa.Columns.to_pandas, timezone issue in conversion fixed in 0.7.1 + pdf = _check_dataframe_localize_timestamps(batch.to_pandas()) + yield [c for _, c in pdf.iteritems()] --- End diff -- After running some tests, this change does not significantly degrade performance, but there seems to be a small difference. cc @ueshin I ran various columns of random data through a `pandas_udf` repeatedly with and without this change. Test was in local mode with default Spark conf, looking at min wall clock time of 10 loops before change: 2.595558 after change: 2.681813 Do you think the difference here is acceptable for now until arrow is upgraded and we can look into again?
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org