Julia Maddalena created SPARK-31930: ---------------------------------------
Summary: Pandas_udf does not properly return ArrayType Key: SPARK-31930 URL: https://issues.apache.org/jira/browse/SPARK-31930 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.4.3 Environment: Azure Databricks Reporter: Julia Maddalena Attempting to return an ArrayType() from pandas_udf reveals a consistent error with skipping specific list elements upon return. We were able to create a reproducible example, as below. {code:java} df = spark.createDataFrame([('A', 1), ('A', 2), ('B', 5), ('B', 6), ('C', 10)], ['group', 'val']) @pandas_udf(ArrayType(ArrayType(LongType())), PandasUDFType.GROUPED_AGG) def get_list(x): return [[1,1], [2,2], [3,3], [4,4], [5,5], [6,6], [7,7], [8,8]] df.groupby('group').agg(get_list(df['val']).alias('list_col')).show(3, False) {code} {code:java} +-----+-----------------------------+ |group|list_col | +-----+-----------------------------+ |B |[[1, 1],,,,,, [7, 7], [8, 8]]| |C |[[1, 1],,,,,, [7, 7], [8, 8]]| |A |[[1, 1],,,,,, [7, 7], [8, 8]]| +-----+-----------------------------+ {code} In every example we've come up with, it consistently replaces elements 2-6 with None (as well as some later elements too). -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org