[jira] [Assigned] (SPARK-33189) Support PyArrow 2.0.0+

2020-10-20 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-33189:


Assignee: Bryan Cutler

> Support PyArrow 2.0.0+
> --
>
> Key: SPARK-33189
> URL: https://issues.apache.org/jira/browse/SPARK-33189
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Bryan Cutler
>Priority: Major
>
> Some tests fail with PyArrow 2.0.0 in PySpark:
> {code}
> ==
> ERROR [0.774s]: test_grouped_over_window_with_key 
> (pyspark.sql.tests.test_pandas_grouped_map.GroupedMapInPandasTests)
> --
> Traceback (most recent call last):
>   File 
> "/__w/spark/spark/python/pyspark/sql/tests/test_pandas_grouped_map.py", line 
> 595, in test_grouped_over_window_with_key
> .select('id', 'result').collect()
>   File "/__w/spark/spark/python/pyspark/sql/dataframe.py", line 588, in 
> collect
> sock_info = self._jdf.collectToPython()
>   File 
> "/__w/spark/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 
> 1305, in __call__
> answer, self.gateway_client, self.target_id, self.name)
>   File "/__w/spark/spark/python/pyspark/sql/utils.py", line 117, in deco
> raise converted from None
> pyspark.sql.utils.PythonException: 
>   An exception was thrown from the Python worker. Please see the stack trace 
> below.
> Traceback (most recent call last):
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 601, 
> in main
> process()
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 593, 
> in process
> serializer.dump_stream(out_iter, outfile)
>   File 
> "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
> line 255, in dump_stream
> return ArrowStreamSerializer.dump_stream(self, 
> init_stream_yield_batches(), stream)
>   File 
> "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
> line 81, in dump_stream
> for batch in iterator:
>   File 
> "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
> line 248, in init_stream_yield_batches
> for series in iterator:
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 426, 
> in mapper
> return f(keys, vals)
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 170, 
> in 
> return lambda k, v: [(wrapped(k, v), to_arrow_type(return_type))]
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 158, 
> in wrapped
> result = f(key, pd.concat(value_series, axis=1))
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/util.py", line 68, in 
> wrapper
> return f(*args, **kwargs)
>   File 
> "/__w/spark/spark/python/pyspark/sql/tests/test_pandas_grouped_map.py", line 
> 590, in f
> "{} != {}".format(expected_key[i][1], window_range)
> AssertionError: {'start': datetime.datetime(2018, 3, 15, 0, 0), 'end': 
> datetime.datetime(2018, 3, 20, 0, 0)} != {'start': datetime.datetime(2018, 3, 
> 15, 0, 0, tzinfo=), 'end': datetime.datetime(2018, 3, 
> 20, 0, 0, tzinfo=)}
> {code}
> We should verify and support PyArrow 2.0.0+
> See also https://github.com/apache/spark/runs/1278918780



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33189) Support PyArrow 2.0.0+

2020-10-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33189:


Assignee: Apache Spark

> Support PyArrow 2.0.0+
> --
>
> Key: SPARK-33189
> URL: https://issues.apache.org/jira/browse/SPARK-33189
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> Some tests fail with PyArrow 2.0.0 in PySpark:
> {code}
> ==
> ERROR [0.774s]: test_grouped_over_window_with_key 
> (pyspark.sql.tests.test_pandas_grouped_map.GroupedMapInPandasTests)
> --
> Traceback (most recent call last):
>   File 
> "/__w/spark/spark/python/pyspark/sql/tests/test_pandas_grouped_map.py", line 
> 595, in test_grouped_over_window_with_key
> .select('id', 'result').collect()
>   File "/__w/spark/spark/python/pyspark/sql/dataframe.py", line 588, in 
> collect
> sock_info = self._jdf.collectToPython()
>   File 
> "/__w/spark/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 
> 1305, in __call__
> answer, self.gateway_client, self.target_id, self.name)
>   File "/__w/spark/spark/python/pyspark/sql/utils.py", line 117, in deco
> raise converted from None
> pyspark.sql.utils.PythonException: 
>   An exception was thrown from the Python worker. Please see the stack trace 
> below.
> Traceback (most recent call last):
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 601, 
> in main
> process()
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 593, 
> in process
> serializer.dump_stream(out_iter, outfile)
>   File 
> "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
> line 255, in dump_stream
> return ArrowStreamSerializer.dump_stream(self, 
> init_stream_yield_batches(), stream)
>   File 
> "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
> line 81, in dump_stream
> for batch in iterator:
>   File 
> "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
> line 248, in init_stream_yield_batches
> for series in iterator:
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 426, 
> in mapper
> return f(keys, vals)
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 170, 
> in 
> return lambda k, v: [(wrapped(k, v), to_arrow_type(return_type))]
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 158, 
> in wrapped
> result = f(key, pd.concat(value_series, axis=1))
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/util.py", line 68, in 
> wrapper
> return f(*args, **kwargs)
>   File 
> "/__w/spark/spark/python/pyspark/sql/tests/test_pandas_grouped_map.py", line 
> 590, in f
> "{} != {}".format(expected_key[i][1], window_range)
> AssertionError: {'start': datetime.datetime(2018, 3, 15, 0, 0), 'end': 
> datetime.datetime(2018, 3, 20, 0, 0)} != {'start': datetime.datetime(2018, 3, 
> 15, 0, 0, tzinfo=), 'end': datetime.datetime(2018, 3, 
> 20, 0, 0, tzinfo=)}
> {code}
> We should verify and support PyArrow 2.0.0+
> See also https://github.com/apache/spark/runs/1278918780



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33189) Support PyArrow 2.0.0+

2020-10-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33189:


Assignee: (was: Apache Spark)

> Support PyArrow 2.0.0+
> --
>
> Key: SPARK-33189
> URL: https://issues.apache.org/jira/browse/SPARK-33189
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Some tests fail with PyArrow 2.0.0 in PySpark:
> {code}
> ==
> ERROR [0.774s]: test_grouped_over_window_with_key 
> (pyspark.sql.tests.test_pandas_grouped_map.GroupedMapInPandasTests)
> --
> Traceback (most recent call last):
>   File 
> "/__w/spark/spark/python/pyspark/sql/tests/test_pandas_grouped_map.py", line 
> 595, in test_grouped_over_window_with_key
> .select('id', 'result').collect()
>   File "/__w/spark/spark/python/pyspark/sql/dataframe.py", line 588, in 
> collect
> sock_info = self._jdf.collectToPython()
>   File 
> "/__w/spark/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 
> 1305, in __call__
> answer, self.gateway_client, self.target_id, self.name)
>   File "/__w/spark/spark/python/pyspark/sql/utils.py", line 117, in deco
> raise converted from None
> pyspark.sql.utils.PythonException: 
>   An exception was thrown from the Python worker. Please see the stack trace 
> below.
> Traceback (most recent call last):
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 601, 
> in main
> process()
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 593, 
> in process
> serializer.dump_stream(out_iter, outfile)
>   File 
> "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
> line 255, in dump_stream
> return ArrowStreamSerializer.dump_stream(self, 
> init_stream_yield_batches(), stream)
>   File 
> "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
> line 81, in dump_stream
> for batch in iterator:
>   File 
> "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
> line 248, in init_stream_yield_batches
> for series in iterator:
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 426, 
> in mapper
> return f(keys, vals)
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 170, 
> in 
> return lambda k, v: [(wrapped(k, v), to_arrow_type(return_type))]
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 158, 
> in wrapped
> result = f(key, pd.concat(value_series, axis=1))
>   File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/util.py", line 68, in 
> wrapper
> return f(*args, **kwargs)
>   File 
> "/__w/spark/spark/python/pyspark/sql/tests/test_pandas_grouped_map.py", line 
> 590, in f
> "{} != {}".format(expected_key[i][1], window_range)
> AssertionError: {'start': datetime.datetime(2018, 3, 15, 0, 0), 'end': 
> datetime.datetime(2018, 3, 20, 0, 0)} != {'start': datetime.datetime(2018, 3, 
> 15, 0, 0, tzinfo=), 'end': datetime.datetime(2018, 3, 
> 20, 0, 0, tzinfo=)}
> {code}
> We should verify and support PyArrow 2.0.0+
> See also https://github.com/apache/spark/runs/1278918780



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org