Github user hvanhovell commented on the pull request:

    https://github.com/apache/spark/pull/11688#issuecomment-196382153
  
    @mccheah I am all on-board with adding a better 
`collect_list`/`collect_set` implementation. I have submitted a similar PR 
recently (using mutable `ArrayData` classes): 
https://github.com/apache/spark/pull/11004. 
    
    I have one question. What is the exact use case for this? Collecting 
elements in a Dimension of some sorts, or turn a flat DataFrame into a 
hierarchical DataFrame? For the latter it might be better to create a custom 
physical operator, much like the current 
`org.apache.spark.sql.execution.MapGroups` class.
    
    As for the your PR the current GC churn will be quite high because you 
create a new Array and ArrayData instance after each update (that is why I 
created custom `ArrayData` classes). A way to counter this would be taking the 
current `HiveUDAFFunction` approach by disabling `supportsPartial` (what you 
already do) and not using the buffer at all (this banks on the aggregation 
operator being sort-based). 
    
    As for performance. I am curious to see the differences between the 
declarative, imperative and Hive approaches. I cannot imagine imperative to be 
so much smaller because we don't do type conversions for imperative operators 
(only if you use an actual UDAF).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to