Github user icexelloss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20211#discussion_r161085442
  
    --- Diff: python/pyspark/sql/group.py ---
    @@ -233,6 +233,27 @@ def apply(self, udf):
             |  2| 1.1094003924504583|
             +---+-------------------+
     
    +        Notes on grouping column:
    --- End diff --
    
    @cloud-fan Good point. I did some more research on `KeyValueGroupedDataset. 
flatMapGroups ` and also `gapply` in SparkR. These two seem to have consistent 
behavior:
    
    * Grouping key is passed as the first arg to the udf, the second arg is the 
actual data
    * No grouping key is prepend to the result, i.e., the output of the udf is 
the schema of the result Spark DataFrame.
    
    I think this could be a good option for `groupby apply` too for the 
following reason:
    * Consistent with similar API in Spark, i.e., `flatMapGroups` and `gapply`
    * Is flexible enough to implement the two use cases above:
    ```
    df = ... # id int, v double
    pandas_udf('id int, v double', GROUP_MAP)
    def foo(key, pdf):
          # key is not used
          return pdf.assign(v=pdf.v+1)
    
    df.groupby('id').apply(foo)
    ```
    and
    ```
    import statsmodels.api as sm
    # df has four columns: id, y, x1, x2
    
    group_column = 'id'
    y_column = 'y'
    x_columns = ['x1', 'x2']
    schema = df.select(group_column, *x_columns).schema
    
    @pandas_udf(schema, PandasUDFType.GROUP_MAP)
    # Input/output are both a pandas.DataFrame
    def ols(key, pdf):
        y = pdf[y_column]
        X = pdf[x_columns]
        X = sm.add_constant(X)
        model = sm.OLS(y, X).fit()
    
        return pd.DataFrame([key + [model.params[i] for i in x_columns]])
    
    beta = df.groupby(group_column).apply(ols)
    ```
    In the second example here, output schema for the udf is still coupled with 
grouping key, but the function itself is decoupled.
    
    The downside is it's now different from the pandas groupby apply API (the 
pandas one only takes one argument), but we cannot be consistent with 
everything...
    
    Thoughts?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to