[jira] [Commented] (SPARK-27463) Support Dataframe Cogroup via Pandas UDFs

Chris Martin (JIRA) Wed, 12 Jun 2019 03:53:46 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-27463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16861993#comment-16861993
 ]


Chris Martin commented on SPARK-27463:
--------------------------------------

Hi [~hyukjin.kwon]- I've just started working on the code side of this (as an 
aside I seem unable to assign this Jira to me- do you know how I can do this?). 

Regarding your questions- I don't think there is an analogous API in pandas 
although perhaps [~icexelloss] knows of one.  In terms of comparison to the 
Dataset Cogroup there are obviously a number of similarities but the biggest 
difference is that the Scala version you end up operating on a couple of Scala 
Iterators whereas in this proposal you would operate on a couple of Pandas 
DataFrames.  This means that the Scala version doesn't necessarily need to be 
able to store the entire cogroup in memory, but on the other hand gives you a 
much less rich data structure (a Scala iterator as opposed to a Pandas 
DataFrame).   I think this distinction is basically analogous to that between 
the Python groupby().apply()  and the Scala groupbyKey().flatmapgroups(). In 
each case you end up operating on a data structure which is more in keeping 
with the language at hand. 

 

 

 

> Support Dataframe Cogroup via Pandas UDFs 
> ------------------------------------------
>
>                 Key: SPARK-27463
>                 URL: https://issues.apache.org/jira/browse/SPARK-27463
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark, SQL
>    Affects Versions: 3.0.0
>            Reporter: Chris Martin
>            Priority: Major
>
> Recent work on Pandas UDFs in Spark, has allowed for improved 
> interoperability between Pandas and Spark.  This proposal aims to extend this 
> by introducing a new Pandas UDF type which would allow for a cogroup 
> operation to be applied to two PySpark DataFrames.
> Full details are in the google document linked below.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27463) Support Dataframe Cogroup via Pandas UDFs

Reply via email to