Github user hvanhovell commented on the pull request: https://github.com/apache/spark/pull/11688#issuecomment-196382153 @mccheah I am all on-board with adding a better `collect_list`/`collect_set` implementation. I have submitted a similar PR recently (using mutable `ArrayData` classes): https://github.com/apache/spark/pull/11004. I have one question. What is the exact use case for this? Collecting elements in a Dimension of some sorts, or turn a flat DataFrame into a hierarchical DataFrame? For the latter it might be better to create a custom physical operator, much like the current `org.apache.spark.sql.execution.MapGroups` class. As for the your PR the current GC churn will be quite high because you create a new Array and ArrayData instance after each update (that is why I created custom `ArrayData` classes). A way to counter this would be taking the current `HiveUDAFFunction` approach by disabling `supportsPartial` (what you already do) and not using the buffer at all (this banks on the aggregation operator being sort-based). As for performance. I am curious to see the differences between the declarative, imperative and Hive approaches. I cannot imagine imperative to be so much smaller because we don't do type conversions for imperative operators (only if you use an actual UDAF).
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org