[ https://issues.apache.org/jira/browse/SPARK-31930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133921#comment-17133921 ]
Hyukjin Kwon commented on SPARK-31930: -------------------------------------- Seems like it depends on which version you use. I can't reproduce this in the latest master: {code} +-----+----------------------------------------------------------------+ |group|list_col | +-----+----------------------------------------------------------------+ |B |[[1, 1], [2, 2], [3, 3], [4, 4], [5, 5], [6, 6], [7, 7], [8, 8]]| |C |[[1, 1], [2, 2], [3, 3], [4, 4], [5, 5], [6, 6], [7, 7], [8, 8]]| |A |[[1, 1], [2, 2], [3, 3], [4, 4], [5, 5], [6, 6], [7, 7], [8, 8]]| +-----+----------------------------------------------------------------+ {code} Let's better identify which JIRA fixed this and see if we can port back. Or it might be fixed in the upper version of pyarrow or pandas. > Pandas_udf does not properly return ArrayType > --------------------------------------------- > > Key: SPARK-31930 > URL: https://issues.apache.org/jira/browse/SPARK-31930 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 2.4.3 > Environment: Azure Databricks > Reporter: Julia Maddalena > Priority: Major > > Attempting to return an ArrayType() from pandas_udf reveals a consistent > error with skipping specific list elements upon return. > We were able to create a reproducible example, as below. > {code:java} > df = spark.createDataFrame([('A', 1), ('A', 2), ('B', 5), ('B', 6), ('C', > 10)], ['group', 'val']) > @pandas_udf(ArrayType(ArrayType(LongType())), PandasUDFType.GROUPED_AGG) > def get_list(x): > return [[1,1], [2,2], [3,3], [4,4], [5,5], [6,6], [7,7], [8,8]] > df.groupby('group').agg(get_list(df['val']).alias('list_col')).show(3, False) > {code} > {code:java} > +-----+-----------------------------+ > |group|list_col | > +-----+-----------------------------+ > |B |[[1, 1],,,,,, [7, 7], [8, 8]]| > |C |[[1, 1],,,,,, [7, 7], [8, 8]]| > |A |[[1, 1],,,,,, [7, 7], [8, 8]]| > +-----+-----------------------------+ > {code} > > > In every example we've come up with, it consistently replaces elements 2-6 > with None (as well as some later elements too). > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org