[jira] [Commented] (SPARK-27463) Support Dataframe Cogroup via Pandas UDFs

Li Jin (JIRA) Wed, 12 Jun 2019 10:07:52 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-27463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16862287#comment-16862287
 ]


Li Jin commented on SPARK-27463:
--------------------------------

I think one way to design this API to mimic the existing dataset cogroup API:
{code:java}
gdf1 = df1.groupByKey('id')
gdf2 = df2.groupByKey('id')

result = gdf1.cogroup(gdf2).apply(my_pandas_udf){code}
Although the KeyValueGroupedData and groupByKey isn't really exposed to pyspark 
(or maybe it doesn't apply to pyspark because of type?) So another way to go 
about this is to use RelationalGroupedData:
{code:java}
gdf1 = df1.groupBy('id')
gdf2 = df2.groupBy('id')

result = gdf1.cogroup(gdf2).apply(my_pandas_udf){code}

> Support Dataframe Cogroup via Pandas UDFs 
> ------------------------------------------
>
>                 Key: SPARK-27463
>                 URL: https://issues.apache.org/jira/browse/SPARK-27463
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark, SQL
>    Affects Versions: 3.0.0
>            Reporter: Chris Martin
>            Priority: Major
>
> Recent work on Pandas UDFs in Spark, has allowed for improved 
> interoperability between Pandas and Spark.  This proposal aims to extend this 
> by introducing a new Pandas UDF type which would allow for a cogroup 
> operation to be applied to two PySpark DataFrames.
> Full details are in the google document linked below.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27463) Support Dataframe Cogroup via Pandas UDFs

Reply via email to