[jira] [Comment Edited] (ARROW-2590) [Python] Pyspark python_udf serialization error on grouped map (Amazon EMR)

Bryan Cutler (JIRA) Mon, 22 Apr 2019 10:56:34 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-2590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16823289#comment-16823289
 ]


Bryan Cutler edited comment on ARROW-2590 at 4/22/19 5:55 PM:
--------------------------------------------------------------

I wasn't able to reproduce the exact error from above, when I mix up a long 
column with string data I get:
{noformat}
pyarrow.lib.ArrowNotImplementedError: No cast implemented from string to 
int64{noformat}
I am using Linux Ubuntu 18.04 and tried Spark 2.3 and master, pyarrow 0.8.0,  
0.10.0, 0.12.1.

If this is a matter of mixing up columns, there was a JIRA SPARK-24324 that 
changed column assignment to use name by default, fixed in 2.4.0. You can work 
around that by specifying column labels explicitly when creating the DataFrame 
to match the expected schema, for example:
{code:java}
@pandas_udf("z string, a long", PandasUDFType.GROUPED_MAP)
def foo(_):
    return pd.DataFrame({'a': [1], 'z': ['hi']}, columns=['z', 'a']){code}
I'll close this for now, but feel free to reopen if this doesn't solve your 
issue.


was (Author: bryanc):
I wasn't able to reproduce the exact error from above, when I mix up a long 
column with string data I get:
{noformat}
pyarrow.lib.ArrowNotImplementedError: No cast implemented from string to 
int64{noformat}
I am using Linux Ubuntu 18.04 and tried Spark 2.3 and master, pyarrow 0.8.0 and 
0.10.0.

If this is a matter of mixing up columns, there was a JIRA SPARK-24324 that 
changed column assignment to use name by default, fixed in 2.4.0. You can work 
around that by specifying column labels explicitly when creating the DataFrame 
to match the expected schema, for example:
{code:java}
@pandas_udf("z string, a long", PandasUDFType.GROUPED_MAP)
def foo(_):
    return pd.DataFrame({'a': [1], 'z': ['hi']}, columns=['z', 'a']){code}
I'll close this for now, but feel free to reopen if this doesn't solve your 
issue.

> [Python] Pyspark python_udf serialization error on grouped map (Amazon EMR)
> ---------------------------------------------------------------------------
>
>                 Key: ARROW-2590
>                 URL: https://issues.apache.org/jira/browse/ARROW-2590
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.9.0
>         Environment: Amazon EMR 5.13
> Spark 2.3.0
> PyArrow 0.9.0 (and 0.8.0)
> Pandas 0.22.0 (and 0.21.1)
> Numpy 1.14.1
>            Reporter: Daniel Fithian
>            Priority: Critical
>              Labels: spark
>             Fix For: 0.14.0
>
>
> I am writing a python_udf grouped map aggregation on Spark 2.3.0 in Amazon 
> EMR. When I try to run any aggregation, I get the following Python stack 
> trace:
> {quote}{{18/05/16 14:08:56 ERROR Utils: Aborting task}}
> {{ org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):}}
> {{ \{{ File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1526400761989_0068/container_1526400761989_0068_01_000002/pyspark.zip/pyspark/worker.py",
>  line 229, in m}}}}
> {{ ain}}
> {{ \{{ process()}}}}
> {{ \{{ File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1526400761989_0068/container_1526400761989_0068_01_000002/pyspark.zip/pyspark/worker.py",
>  line 224, in p}}}}
> {{ rocess}}
> {{ \{{ serializer.dump_stream(func(split_index, iterator), outfile)}}}}
> {{ \{{ File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1526400761989_0068/container_1526400761989_0068_01_000002/pyspark.zip/pyspark/serializers.py",
>  line 261,}}}}
> {{ \{{ in dump_stream}}}}
> {{ \{{ batch = _create_batch(series, self._timezone)}}}}
> {{ \{{ File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1526400761989_0068/container_1526400761989_0068_01_000002/pyspark.zip/pyspark/serializers.py",
>  line 239,}}}}
> {{ \{{ in _create_batch}}}}
> {{ {{ arrs = [create_array(s, t) for s, t in series]}}}}
> {{ \{{ File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1526400761989_0068/container_1526400761989_0068_01_000002/pyspark.zip/pyspark/serializers.py",
>  line 239,}}}}
> {{ \{{ in <listcomp>}}}}
> {{ {{ arrs = [create_array(s, t) for s, t in series]}}}}
> {{ \{{ File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1526400761989_0068/container_1526400761989_0068_01_000002/pyspark.zip/pyspark/serializers.py",
>  line 237, in create_array}}}}
> {{ \{{ return pa.Array.from_pandas(s, mask=mask, type=t)}}}}
> {{ \{{ File "array.pxi", line 372, in pyarrow.lib.Array.from_pandas}}}}
> {{ \{{ File "array.pxi", line 177, in pyarrow.lib.array}}}}
> {{ \{{ File "array.pxi", line 77, in pyarrow.lib._ndarray_to_array}}}}
> {{ \{{ File "error.pxi", line 98, in pyarrow.lib.check_status}}}}
> {{ pyarrow.lib.ArrowException: Unknown error: 'utf-32-le' codec can't decode 
> bytes in position 0-3: code point not in range(0x110000)}}{quote}
> To be clear, this happens when I run any aggregation, including the identity 
> aggregation (return the Pandas DataFrame that was passed in). I do not get 
> this error when I return an empty DataFrame, so it seems to be a symptom of 
> the serialization of the Pandas DataFrame back to Spark.
> I have observed this behavior with the following versions:
>  * Spark 2.3.0
>  * PyArrow 0.9.0 (also 0.8.0)
>  * Pandas 0.22.0 (also 0.22.1)
>  * Numpy 1.14.1
> Here is some sample code:
> {quote}{{@func.pandas_udf(SCHEMA, func.PandasUDFType.GROUPED_MAP)}}{quote}
> {quote}{{def aggregation(df):}}{quote}
> {quote}{{    return df}}{quote}
> {quote}{{df.groupBy('a').apply(aggregation) # get error}}{quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (ARROW-2590) [Python] Pyspark python_udf serialization error on grouped map (Amazon EMR)

Reply via email to