[ https://issues.apache.org/jira/browse/SPARK-23996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16447781#comment-16447781 ]
Timothy Hunter commented on SPARK-23996: ---------------------------------------- [~wm624] yes this is the implementation: [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/QuantileSummaries.scala] you can see the test suite here: [https://github.com/apache/spark/blob/master/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/QuantileSummariesSuite.scala] The current implementation focused on doubles, but I do not see much issue in switch to floats. The main entry points are fairly similar: [https://github.com/DataSketches/sketches-core/blob/master/src/main/java/com/yahoo/sketches/kll/KllFloatsSketch.java#L299] > Implement the optimal KLL algorithms for quantiles in streams > ------------------------------------------------------------- > > Key: SPARK-23996 > URL: https://issues.apache.org/jira/browse/SPARK-23996 > Project: Spark > Issue Type: Improvement > Components: MLlib, SQL > Affects Versions: 2.3.0 > Reporter: Timothy Hunter > Priority: Major > > The current implementation for approximate quantiles - a variant of > Grunwald-Khanna, which I implemented - is not the best in light of recent > papers: > - it is not exactly the one from the paper for performance reasons, but the > changes are not documented beyond comments on the code > - there are now more optimal algorithms with proven bounds (unlike q-digest, > the other contender at the time) > I propose that we revisit the current implementation and look at the > Karnin-Lang-Liberty algorithm (KLL) for example: > [https://arxiv.org/abs/1603.05346] > [https://edoliberty.github.io//papers/streamingQuantiles.pdf] > This algorithm seems to have favorable characteristics for streaming and a > distributed implementation, and there is a python implementation for > reference. > It is a fairly standalone piece, and in that respect available to people who > don't know too much about spark internals. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org