[ https://issues.apache.org/jira/browse/SPARK-23996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon resolved SPARK-23996. ---------------------------------- Resolution: Incomplete > Implement the optimal KLL algorithms for quantiles in streams > ------------------------------------------------------------- > > Key: SPARK-23996 > URL: https://issues.apache.org/jira/browse/SPARK-23996 > Project: Spark > Issue Type: Improvement > Components: MLlib, SQL > Affects Versions: 2.3.0 > Reporter: Timothy Hunter > Priority: Major > Labels: bulk-closed > > The current implementation for approximate quantiles - a variant of > Grunwald-Khanna, which I implemented - is not the best in light of recent > papers: > - it is not exactly the one from the paper for performance reasons, but the > changes are not documented beyond comments on the code > - there are now more optimal algorithms with proven bounds (unlike q-digest, > the other contender at the time) > I propose that we revisit the current implementation and look at the > Karnin-Lang-Liberty algorithm (KLL) for example: > [https://arxiv.org/abs/1603.05346] > [https://edoliberty.github.io//papers/streamingQuantiles.pdf] > This algorithm seems to have favorable characteristics for streaming and a > distributed implementation, and there is a python implementation for > reference. > It is a fairly standalone piece, and in that respect available to people who > don't know too much about spark internals. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org