Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/17419 I looked a bit deeper into the performance aspect. Here are some quick insights: - there was an immediate bottleneck in `VectorUDT`, which boosts the performance already by 3x - it is not clear if switching to pure Breeze operations helps given the overhead for tiny vectors. I will need to do more analysis on larger vectors. - now, most of the time is roughly split between `ObjectAggregationIterator.processInputs` (40%), some codegen'ed expression (20%) and our own `MetricsAggregate.update` (35%) That benchmark focuses on the overhead of catalyst. I will do another benchmark with dense vectors to see how it fares in practice with more real data.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org