Github user icexelloss commented on the issue:

    https://github.com/apache/spark/pull/20295
  
    @cloud-fan Currently I sent group columns along with the extra data column. 
For example, if the original DataFrame has `id, v` and group column is `id`, 
the current implementation in this PR will send three series `id, id, v` to the 
python worker and send an `argOffsets` of `[1, 2]` to specify the data columns 
are `id, v`. The first value of the group column is used as the group key, 
because values in a group column are equal.
    
    I implemented it this way because it doesn't change the existing 
serialization protocol. Alternatively, we can implement a new serialization 
protocol for GROUP_MAP eval type, i.e, instead of sending an arrow batch, we 
could send a group row and then an arrow batch. What do you think?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to