Github user icexelloss commented on the issue: https://github.com/apache/spark/pull/20295 @cloud-fan Currently I sent group columns along with the extra data column. For example, if the original DataFrame has `id, v` and group column is `id`, the current implementation in this PR will send three series `id, id, v` to the python worker and send an `argOffsets` of `[1, 2]` to specify the data columns are `id, v`. The first value of the group column is used as the group key, because values in a group column are equal. I implemented it this way because it doesn't change the existing serialization protocol. Alternatively, we can implement a new serialization protocol for GROUP_MAP eval type, i.e, instead of sending an arrow batch, we could send a group row and then an arrow batch. What do you think?
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org