[ https://issues.apache.org/jira/browse/SPARK-34348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Raman Srinivasan updated SPARK-34348: ------------------------------------- Description: {code:java} df = spark.createDataFrame( [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], ("id", "v")) def subtract_mean(pdf): # pdf is a pandas.DataFrame pdf['count'] = pdf.shape[0] return pdf{code} Using a DDL-formatted string for output schema works fine: {code:java} df.groupby("id").applyInPandas(subtract_mean, schema="id long, v double, count int").show() +---+----+-----+ | id| v|count| +---+----+-----+ | 1| 1.0| 2| | 1| 2.0| 2| | 2| 3.0| 3| | 2| 5.0| 3| | 2|10.0| 3| +---+----+-----+ {code} But using StructType schema (appending a integer count column) fails: {code:java} df.groupby("id").applyInPandas(subtract_mean, schema=df.schema.add(StructField('count', IntegerType(), False))).show() AnalysisException: Cannot resolve column name "count" among (id, v); {code} It appears to be looking for the new return field in the input schema? As a workaround, is there a toDDL method I can use to get the current schema as a DDL string to which I can append the new return fields? was: {code:java} df = spark.createDataFrame( [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], ("id", "v")) def subtract_mean(pdf): # pdf is a pandas.DataFrame pdf['count'] = pdf.shape[0] return pdf{code} Using a DDL-formatted string for output schema works fine: {code:java} df.groupby("id").applyInPandas(subtract_mean, schema="id long, v double, count int").show() +---+----+-----+ | id| v|count| +---+----+-----+ | 1| 1.0| 2| | 1| 2.0| 2| | 2| 3.0| 3| | 2| 5.0| 3| | 2|10.0| 3| +---+----+-----+ {code} But using StructType schema (appending a integer count column) fails: {code:java} df.groupby("id").applyInPandas(subtract_mean, schema=df.schema.add(StructField('count', IntegerType(), False))).show() AnalysisException: Cannot resolve column name "count" among (id, v); {code} It appears to be looking for the new return field in the input schema? > applyInPandas doesn't seem to work with StructType output schema > ----------------------------------------------------------------- > > Key: SPARK-34348 > URL: https://issues.apache.org/jira/browse/SPARK-34348 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 3.0.1 > Reporter: Raman Srinivasan > Priority: Major > > > {code:java} > df = spark.createDataFrame( > [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], > ("id", "v")) > def subtract_mean(pdf): > # pdf is a pandas.DataFrame > pdf['count'] = pdf.shape[0] > return pdf{code} > > > Using a DDL-formatted string for output schema works fine: > {code:java} > df.groupby("id").applyInPandas(subtract_mean, schema="id long, v double, > count int").show() > +---+----+-----+ > | id| v|count| > +---+----+-----+ > | 1| 1.0| 2| > | 1| 2.0| 2| > | 2| 3.0| 3| > | 2| 5.0| 3| > | 2|10.0| 3| > +---+----+-----+ > {code} > > > But using StructType schema (appending a integer count column) fails: > {code:java} > df.groupby("id").applyInPandas(subtract_mean, > schema=df.schema.add(StructField('count', IntegerType(), False))).show() > AnalysisException: Cannot resolve column name "count" among (id, v); > {code} > It appears to be looking for the new return field in the input schema? > As a workaround, is there a toDDL method I can use to get the current schema > as a DDL string to which I can append the new return fields? > > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org