[ https://issues.apache.org/jira/browse/SPARK-27463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16861993#comment-16861993 ]
Chris Martin commented on SPARK-27463: -------------------------------------- Hi [~hyukjin.kwon]- I've just started working on the code side of this (as an aside I seem unable to assign this Jira to me- do you know how I can do this?). Regarding your questions- I don't think there is an analogous API in pandas although perhaps [~icexelloss] knows of one. In terms of comparison to the Dataset Cogroup there are obviously a number of similarities but the biggest difference is that the Scala version you end up operating on a couple of Scala Iterators whereas in this proposal you would operate on a couple of Pandas DataFrames. This means that the Scala version doesn't necessarily need to be able to store the entire cogroup in memory, but on the other hand gives you a much less rich data structure (a Scala iterator as opposed to a Pandas DataFrame). I think this distinction is basically analogous to that between the Python groupby().apply() and the Scala groupbyKey().flatmapgroups(). In each case you end up operating on a data structure which is more in keeping with the language at hand. > Support Dataframe Cogroup via Pandas UDFs > ------------------------------------------ > > Key: SPARK-27463 > URL: https://issues.apache.org/jira/browse/SPARK-27463 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL > Affects Versions: 3.0.0 > Reporter: Chris Martin > Priority: Major > > Recent work on Pandas UDFs in Spark, has allowed for improved > interoperability between Pandas and Spark. This proposal aims to extend this > by introducing a new Pandas UDF type which would allow for a cogroup > operation to be applied to two PySpark DataFrames. > Full details are in the google document linked below. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org