[jira] [Created] (SPARK-31930) Pandas_udf does not properly return ArrayType

Julia Maddalena (Jira) Mon, 08 Jun 2020 08:08:14 -0700

Julia Maddalena created SPARK-31930:
---------------------------------------


             Summary: Pandas_udf does not properly return ArrayType
                 Key: SPARK-31930
                 URL: https://issues.apache.org/jira/browse/SPARK-31930
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 2.4.3
         Environment: Azure Databricks
            Reporter: Julia Maddalena


Attempting to return an ArrayType() from pandas_udf reveals a consistent error 
with skipping specific list elements upon return. 

We were able to create a reproducible example, as below. 
{code:java}
df = spark.createDataFrame([('A', 1), ('A', 2), ('B', 5), ('B', 6), ('C', 10)], 
['group', 'val'])

@pandas_udf(ArrayType(ArrayType(LongType())), PandasUDFType.GROUPED_AGG)
def get_list(x):
    return [[1,1], [2,2], [3,3], [4,4], [5,5], [6,6], [7,7], [8,8]]

df.groupby('group').agg(get_list(df['val']).alias('list_col')).show(3, False)
{code}
 

 
{code:java}
+-----+-----------------------------+
|group|list_col                     |
+-----+-----------------------------+
|B    |[[1, 1],,,,,, [7, 7], [8, 8]]|
|C    |[[1, 1],,,,,, [7, 7], [8, 8]]|
|A    |[[1, 1],,,,,, [7, 7], [8, 8]]|
+-----+-----------------------------+
{code}
 

 

In every example we've come up with, it consistently replaces elements 2-6 with 
None (as well as some later elements too). 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31930) Pandas_udf does not properly return ArrayType

Reply via email to