[jira] [Commented] (SPARK-23929) pandas_udf schema mapped by position and not by name

Omri (JIRA) Tue, 10 Apr 2018 07:22:27 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-23929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432344#comment-16432344
 ]


Omri commented on SPARK-23929:
------------------------------

[~icexelloss], I couldn't recreate the problem I had where the order was mixed, 
but I have a different example to illustrate the problem.

Here the schema struct is [id,zeros,ones] but the user returned a data frame 
with [id,ones,zeros]. The column names are taken from the provided schema and 
not from the explicitly mentioned data frame.

 
{code:java}
from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.types import *
import pandas as pd
df = spark.createDataFrame(
    [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
    ("id", "v"))  

schema = StructType([
    StructField("id", LongType()),
    StructField("zeros", DoubleType()),
    StructField("ones", DoubleType())
])

@pandas_udf(schema, PandasUDFType.GROUPED_MAP)  
def median_per_group(grp):
    return pd.DataFrame({"id":grp.iloc[0]['id'],"ones":1,"zeros":0},index = [0])
    
df.groupby("id").apply(median_per_group).show()
{code}
results:
{code:java}
+---+-----+----+
| id|zeros|ones|
+---+-----+----+
|  1|  1.0| 0.0|
|  2|  1.0| 0.0|
+---+-----+----+
{code}
So the user was expecting ones to have 1 and zeros to have 0 but they get wrong 
results.

 

> pandas_udf schema mapped by position and not by name
> ----------------------------------------------------
>
>                 Key: SPARK-23929
>                 URL: https://issues.apache.org/jira/browse/SPARK-23929
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.3.0
>         Environment: PySpark
> Spark 2.3.0
>  
>            Reporter: Omri
>            Priority: Major
>
> The return struct of a pandas_udf should be mapped to the provided schema by 
> name. Currently it's not the case.
> Consider these two examples, where the only change is the order of the fields 
> in the provided schema struct:
> {code:java}
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> df = spark.createDataFrame(
>     [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
>     ("id", "v"))  
> @pandas_udf("v double,id long", PandasUDFType.GROUPED_MAP)  
> def normalize(pdf):
>     v = pdf.v
>     return pdf.assign(v=(v - v.mean()) / v.std())
> df.groupby("id").apply(normalize).show() 
> {code}
> and this one:
> {code:java}
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> df = spark.createDataFrame(
>     [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
>     ("id", "v"))  
> @pandas_udf("id long,v double", PandasUDFType.GROUPED_MAP)  
> def normalize(pdf):
>     v = pdf.v
>     return pdf.assign(v=(v - v.mean()) / v.std())
> df.groupby("id").apply(normalize).show()
> {code}
> The results should be the same but they are different:
> For the first code:
> {code:java}
> +---+---+
> |  v| id|
> +---+---+
> |1.0|  0|
> |1.0|  0|
> |2.0|  0|
> |2.0|  0|
> |2.0|  1|
> +---+---+
> {code}
> For the second code:
> {code:java}
> +---+-------------------+
> | id|                  v|
> +---+-------------------+
> |  1|-0.7071067811865475|
> |  1| 0.7071067811865475|
> |  2|-0.8320502943378437|
> |  2|-0.2773500981126146|
> |  2| 1.1094003924504583|
> +---+-------------------+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23929) pandas_udf schema mapped by position and not by name

Reply via email to