Github user icexelloss commented on the issue: https://github.com/apache/spark/pull/20295 Hi all, I did some digging and I think adding a serialization form that serialize a key object along with a Arrow record batch is quite complicated because we are using ArrowStreamReader/Writer for sending batches and send extra key data would have to use a lower level Arrow API for sending/receiving batches. I did two things to convince myself the current approach is fine: * I add logic to de duplicate grouping key they are already in data columns. i.e., if a user calls ``` df.groupby('id').apply(foo_udf) ``` We will not send extra grouping columns because those are already part of data columns. Instead, we will just use the corresponding data column to get grouping key to pass to user function. However, if user calls: ``` df.groupby(df.id % 2).apply(foo_udf) ``` then an extra column `df.id % 2` will be created and sent to python worker. But I think this is an uncommon case. * I did some benchmark to see the impact of sending extra grouping column. I used a Spark DataFrame of a single column to maximize the effect of the extra grouping column (basically sending extra grouping column will double the data to be sent to python in the benchmark, however in real use cases the effect of sending extra grouping columns should be far less). Even with the setting of the benchmark, I have not observed performance diffs when sending extra grouping columns, I think this is because the total time is dominated by other parts of the computation. [micro benchmark](https://gist.github.com/icexelloss/88f6c6fdaf04aac39d68d74cd0942c07) I'd like to leave the work for more flexible arrow serialization as future work because it doesn't seems to affect performance of this patch and proceed with the current patch based on the two points above. What do you guys think?
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org