[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...

icexelloss Sun, 01 Oct 2017 20:17:00 -0700

Github user icexelloss commented on the issue:

    https://github.com/apache/spark/pull/18732
  
    @rxin, `transform` takes a function: pd.Series -> pd.Series and apply the 
function on all columns:
    
    ```
    df.show()
    
     id   v1   v2  v3
     a  1.0  4.0  0.0
     a  2.0  5.0  1.0
     a  3.0  6.0  1.0
    
    df.groupby('id').transform(pandas_udf(lambda v: v - v.mean(), 
DoubleType())).show()
    
     id   v1   v2        v3
     a  -1.0 -1.0    -0.666667
     a   0.0  0.0     0.333333
     a   1.0  1.0     0.333333
    ```
    
    This is mimicking `pd.DataFrame.groupby().transform`
    
    `apply` takes a function: pd.DataFrame -> pd.DataFrame and is similar to 
`flatMapGroups`
    
    The name `apply` is originated from the R paper "The Split-Apply-Combine 
Strategy for Data Analysis" and is used in both pandas and R to describe this 
function, so the name `apply` should be pretty straight forward to 
pandas/python user.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...

Reply via email to