Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/20211#discussion_r161085442 --- Diff: python/pyspark/sql/group.py --- @@ -233,6 +233,27 @@ def apply(self, udf): | 2| 1.1094003924504583| +---+-------------------+ + Notes on grouping column: --- End diff -- @cloud-fan Good point. I did some more research on `KeyValueGroupedDataset. flatMapGroups ` and also `gapply` in SparkR. These two seem to have consistent behavior: * Grouping key is passed as the first arg to the udf, the second arg is the actual data * No grouping key is prepend to the result, i.e., the output of the udf is the schema of the result Spark DataFrame. I think this could be a good option for `groupby apply` too for the following reason: * Consistent with similar API in Spark, i.e., `flatMapGroups` and `gapply` * Is flexible enough to implement the two use cases above: ``` df = ... # id int, v double pandas_udf('id int, v double', GROUP_MAP) def foo(key, pdf): # key is not used return pdf.assign(v=pdf.v+1) df.groupby('id').apply(foo) ``` and ``` import statsmodels.api as sm # df has four columns: id, y, x1, x2 group_column = 'id' y_column = 'y' x_columns = ['x1', 'x2'] schema = df.select(group_column, *x_columns).schema @pandas_udf(schema, PandasUDFType.GROUP_MAP) # Input/output are both a pandas.DataFrame def ols(key, pdf): y = pdf[y_column] X = pdf[x_columns] X = sm.add_constant(X) model = sm.OLS(y, X).fit() return pd.DataFrame([key + [model.params[i] for i in x_columns]]) beta = df.groupby(group_column).apply(ols) ``` In the second example here, output schema for the udf is still coupled with grouping key, but the function itself is decoupled. The downside is it's now different from the pandas groupby apply API (the pandas one only takes one argument), but we cannot be consistent with everything... Thoughts?
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org