[ https://issues.apache.org/jira/browse/SPARK-23929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16431014#comment-16431014 ]
Bryan Cutler commented on SPARK-23929: -------------------------------------- cc [~icexelloss] > pandas_udf schema mapped by position and not by name > ---------------------------------------------------- > > Key: SPARK-23929 > URL: https://issues.apache.org/jira/browse/SPARK-23929 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 2.3.0 > Environment: PySpark > Spark 2.3.0 > > Reporter: Omri > Priority: Major > > The return struct of a pandas_udf should be mapped to the provided schema by > name. Currently it's not the case. > Consider these two examples, where the only change is the order of the fields > in the provided schema struct: > {code:java} > from pyspark.sql.functions import pandas_udf, PandasUDFType > df = spark.createDataFrame( > [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], > ("id", "v")) > @pandas_udf("v double,id long", PandasUDFType.GROUPED_MAP) > def normalize(pdf): > v = pdf.v > return pdf.assign(v=(v - v.mean()) / v.std()) > df.groupby("id").apply(normalize).show() > {code} > and this one: > {code:java} > from pyspark.sql.functions import pandas_udf, PandasUDFType > df = spark.createDataFrame( > [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], > ("id", "v")) > @pandas_udf("id long,v double", PandasUDFType.GROUPED_MAP) > def normalize(pdf): > v = pdf.v > return pdf.assign(v=(v - v.mean()) / v.std()) > df.groupby("id").apply(normalize).show() > {code} > The results should be the same but they are different: > For the first code: > {code:java} > +---+---+ > | v| id| > +---+---+ > |1.0| 0| > |1.0| 0| > |2.0| 0| > |2.0| 0| > |2.0| 1| > +---+---+ > {code} > For the second code: > {code:java} > +---+-------------------+ > | id| v| > +---+-------------------+ > | 1|-0.7071067811865475| > | 1| 0.7071067811865475| > | 2|-0.8320502943378437| > | 2|-0.2773500981126146| > | 2| 1.1094003924504583| > +---+-------------------+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org