[ https://issues.apache.org/jira/browse/SPARK-23929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432344#comment-16432344 ]
Omri commented on SPARK-23929: ------------------------------ [~icexelloss], I couldn't recreate the problem I had where the order was mixed, but I have a different example to illustrate the problem. Here the schema struct is [id,zeros,ones] but the user returned a data frame with [id,ones,zeros]. The column names are taken from the provided schema and not from the explicitly mentioned data frame. {code:java} from pyspark.sql.functions import pandas_udf, PandasUDFType from pyspark.sql.types import * import pandas as pd df = spark.createDataFrame( [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], ("id", "v")) schema = StructType([ StructField("id", LongType()), StructField("zeros", DoubleType()), StructField("ones", DoubleType()) ]) @pandas_udf(schema, PandasUDFType.GROUPED_MAP) def median_per_group(grp): return pd.DataFrame({"id":grp.iloc[0]['id'],"ones":1,"zeros":0},index = [0]) df.groupby("id").apply(median_per_group).show() {code} results: {code:java} +---+-----+----+ | id|zeros|ones| +---+-----+----+ | 1| 1.0| 0.0| | 2| 1.0| 0.0| +---+-----+----+ {code} So the user was expecting ones to have 1 and zeros to have 0 but they get wrong results. > pandas_udf schema mapped by position and not by name > ---------------------------------------------------- > > Key: SPARK-23929 > URL: https://issues.apache.org/jira/browse/SPARK-23929 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 2.3.0 > Environment: PySpark > Spark 2.3.0 > > Reporter: Omri > Priority: Major > > The return struct of a pandas_udf should be mapped to the provided schema by > name. Currently it's not the case. > Consider these two examples, where the only change is the order of the fields > in the provided schema struct: > {code:java} > from pyspark.sql.functions import pandas_udf, PandasUDFType > df = spark.createDataFrame( > [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], > ("id", "v")) > @pandas_udf("v double,id long", PandasUDFType.GROUPED_MAP) > def normalize(pdf): > v = pdf.v > return pdf.assign(v=(v - v.mean()) / v.std()) > df.groupby("id").apply(normalize).show() > {code} > and this one: > {code:java} > from pyspark.sql.functions import pandas_udf, PandasUDFType > df = spark.createDataFrame( > [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], > ("id", "v")) > @pandas_udf("id long,v double", PandasUDFType.GROUPED_MAP) > def normalize(pdf): > v = pdf.v > return pdf.assign(v=(v - v.mean()) / v.std()) > df.groupby("id").apply(normalize).show() > {code} > The results should be the same but they are different: > For the first code: > {code:java} > +---+---+ > | v| id| > +---+---+ > |1.0| 0| > |1.0| 0| > |2.0| 0| > |2.0| 0| > |2.0| 1| > +---+---+ > {code} > For the second code: > {code:java} > +---+-------------------+ > | id| v| > +---+-------------------+ > | 1|-0.7071067811865475| > | 1| 0.7071067811865475| > | 2|-0.8320502943378437| > | 2|-0.2773500981126146| > | 2| 1.1094003924504583| > +---+-------------------+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org