Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20211#discussion_r161371619
  
    --- Diff: python/pyspark/sql/group.py ---
    @@ -233,6 +233,27 @@ def apply(self, udf):
             |  2| 1.1094003924504583|
             +---+-------------------+
     
    +        Notes on grouping column:
    --- End diff --
    
    Up to my knowledge, the current implementation follows Pandas's default 
groupBy - apply when Pandas DataFrame -> Pandas DataFrame (correct me if I am 
wrong). So, I was thinking that we shouldn't start with prepending the grouping 
columns but we could alternatively consider an idea of `gapply` in somehow ..
    
    I think it's still feasible to have both ideas - If the given function 
takes single argument, we can give the input as pdf. If it takes two arguments, 
we can give key and pdf as input. I think we can support the `gapply`-like 
support optionally.
    
    It's a rough idea but I think we can do this in theory as we know and can 
`inspect` the function ahead before computation.
    
    WDYT guys?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to