[ 
https://issues.apache.org/jira/browse/SPARK-10936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14949306#comment-14949306
 ] 

Frederick Reiss commented on SPARK-10936:
-----------------------------------------

Mode is not an algebraic aggregate.  To find the mode in a single pass over the 
original data, one needs to track the full set of distinct values in the 
underlying set, as well as the number of times each value occurs in the records 
seen so far. For high-cardinality columns, this requirement will result in 
unbounded state.

I can see three ways forward here: 

a) Stuff a hash table into the PartialAggregate2 API's result buffer, and hope 
that this buffer does not exhaust the heap, produce O(n^2) behavior when the 
column cardinality is high, or stop working on future (or present?) versions of 
codegen.

b) Implement an approximate mode with fixed-size intermediate state (for 
example, a compressed reservoir sample), similar to how the current 
HyperLogLog++ aggregate works. Approximate computation of the mode will work 
well most of the time but will have unbounded error in corner cases.

c) Add mode as another member of the "distinct" family of Spark aggregates, 
such as SUM/COUNT/AVERAGE DISTINCT. Use the same pre-Tungsten style of 
processing to handle mode for now.  Create a follow-on JIRA to move mode over 
to the fast path at the same time that the other DISTINCT aggregates switch 
over.

I think that (c) is the best option overall, but I'm happy to defer to others 
with deeper understanding. My thinking is that, while it would be good to have 
a mode aggregate available, mode is a relatively uncommon use case. Slow-path 
processing for mode is ok as a short-term expedient. Once SUM DISTINCT and 
related aggregates are fully moved onto the new framework, transitioning mode 
to the fast path should be easy.


> UDAF "mode" for categorical variables
> -------------------------------------
>
>                 Key: SPARK-10936
>                 URL: https://issues.apache.org/jira/browse/SPARK-10936
>             Project: Spark
>          Issue Type: Sub-task
>          Components: ML, SQL
>            Reporter: Xiangrui Meng
>
> This is similar to frequent items except that we don't have a threshold on 
> the frequency. So an exact implementation might require a global shuffle.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to