[ https://issues.apache.org/jira/browse/SPARK-10936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14949306#comment-14949306 ]
Frederick Reiss commented on SPARK-10936: ----------------------------------------- Mode is not an algebraic aggregate. To find the mode in a single pass over the original data, one needs to track the full set of distinct values in the underlying set, as well as the number of times each value occurs in the records seen so far. For high-cardinality columns, this requirement will result in unbounded state. I can see three ways forward here: a) Stuff a hash table into the PartialAggregate2 API's result buffer, and hope that this buffer does not exhaust the heap, produce O(n^2) behavior when the column cardinality is high, or stop working on future (or present?) versions of codegen. b) Implement an approximate mode with fixed-size intermediate state (for example, a compressed reservoir sample), similar to how the current HyperLogLog++ aggregate works. Approximate computation of the mode will work well most of the time but will have unbounded error in corner cases. c) Add mode as another member of the "distinct" family of Spark aggregates, such as SUM/COUNT/AVERAGE DISTINCT. Use the same pre-Tungsten style of processing to handle mode for now. Create a follow-on JIRA to move mode over to the fast path at the same time that the other DISTINCT aggregates switch over. I think that (c) is the best option overall, but I'm happy to defer to others with deeper understanding. My thinking is that, while it would be good to have a mode aggregate available, mode is a relatively uncommon use case. Slow-path processing for mode is ok as a short-term expedient. Once SUM DISTINCT and related aggregates are fully moved onto the new framework, transitioning mode to the fast path should be easy. > UDAF "mode" for categorical variables > ------------------------------------- > > Key: SPARK-10936 > URL: https://issues.apache.org/jira/browse/SPARK-10936 > Project: Spark > Issue Type: Sub-task > Components: ML, SQL > Reporter: Xiangrui Meng > > This is similar to frequent items except that we don't have a threshold on > the frequency. So an exact implementation might require a global shuffle. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org