[ https://issues.apache.org/jira/browse/SPARK-33189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon resolved SPARK-33189. ---------------------------------- Fix Version/s: 2.4.8 3.0.2 3.1.0 Resolution: Fixed Issue resolved by pull request 30111 [https://github.com/apache/spark/pull/30111] > Support PyArrow 2.0.0+ > ---------------------- > > Key: SPARK-33189 > URL: https://issues.apache.org/jira/browse/SPARK-33189 > Project: Spark > Issue Type: Improvement > Components: PySpark > Affects Versions: 3.1.0 > Reporter: Hyukjin Kwon > Assignee: Bryan Cutler > Priority: Major > Fix For: 3.1.0, 3.0.2, 2.4.8 > > > Some tests fail with PyArrow 2.0.0 in PySpark: > {code} > ====================================================================== > ERROR [0.774s]: test_grouped_over_window_with_key > (pyspark.sql.tests.test_pandas_grouped_map.GroupedMapInPandasTests) > ---------------------------------------------------------------------- > Traceback (most recent call last): > File > "/__w/spark/spark/python/pyspark/sql/tests/test_pandas_grouped_map.py", line > 595, in test_grouped_over_window_with_key > .select('id', 'result').collect() > File "/__w/spark/spark/python/pyspark/sql/dataframe.py", line 588, in > collect > sock_info = self._jdf.collectToPython() > File > "/__w/spark/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line > 1305, in __call__ > answer, self.gateway_client, self.target_id, self.name) > File "/__w/spark/spark/python/pyspark/sql/utils.py", line 117, in deco > raise converted from None > pyspark.sql.utils.PythonException: > An exception was thrown from the Python worker. Please see the stack trace > below. > Traceback (most recent call last): > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 601, > in main > process() > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 593, > in process > serializer.dump_stream(out_iter, outfile) > File > "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", > line 255, in dump_stream > return ArrowStreamSerializer.dump_stream(self, > init_stream_yield_batches(), stream) > File > "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", > line 81, in dump_stream > for batch in iterator: > File > "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", > line 248, in init_stream_yield_batches > for series in iterator: > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 426, > in mapper > return f(keys, vals) > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 170, > in <lambda> > return lambda k, v: [(wrapped(k, v), to_arrow_type(return_type))] > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 158, > in wrapped > result = f(key, pd.concat(value_series, axis=1)) > File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/util.py", line 68, in > wrapper > return f(*args, **kwargs) > File > "/__w/spark/spark/python/pyspark/sql/tests/test_pandas_grouped_map.py", line > 590, in f > "{} != {}".format(expected_key[i][1], window_range) > AssertionError: {'start': datetime.datetime(2018, 3, 15, 0, 0), 'end': > datetime.datetime(2018, 3, 20, 0, 0)} != {'start': datetime.datetime(2018, 3, > 15, 0, 0, tzinfo=<StaticTzInfo 'Etc/UTC'>), 'end': datetime.datetime(2018, 3, > 20, 0, 0, tzinfo=<StaticTzInfo 'Etc/UTC'>)} > {code} > We should verify and support PyArrow 2.0.0+ > See also https://github.com/apache/spark/runs/1278918780 -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org