[ https://issues.apache.org/jira/browse/SPARK-20396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Reynold Xin updated SPARK-20396: -------------------------------- Issue Type: Sub-task (was: New Feature) Parent: SPARK-22216 > groupBy().apply() with pandas udf in pyspark > -------------------------------------------- > > Key: SPARK-20396 > URL: https://issues.apache.org/jira/browse/SPARK-20396 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL > Affects Versions: 2.1.0 > Reporter: Li Jin > > split-apply-merge is a common pattern when analyzing data. It is implemented > in many popular data analyzing libraries such as Spark, Pandas, R, and etc. > Split and merge operations in these libraries are similar to each other, > mostly implemented by certain grouping operators. For instance, Spark > DataFrame has groupBy, Pandas DataFrame has groupby. Therefore, for users > familiar with either Spark DataFrame or pandas DataFrame, it is not difficult > for them to understand how grouping works in the other library. However, > apply is more native to different libraries and therefore, quite different > between libraries. A pandas user knows how to use apply to do curtain > transformation in pandas might not know how to do the same using pyspark. > Also, the current implementation of passing data from the java executor to > python executor is not efficient, there is opportunity to speed it up using > Apache Arrow. This feature can enable use cases that uses Spark's grouping > operators such as groupBy, rollUp, cube, window and Pandas's native apply > operator. > Related work: > SPARK-13534 > This enables faster data serialization between Pyspark and Pandas using > Apache Arrow. Our work will be on top of this and use the same serialization > for pandas udf. > SPARK-12919 and SPARK-12922 > These implemented two functions: dapply and gapply in Spark R which > implements the similar split-apply-merge pattern that we want to implement > with Pyspark. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org