So looking at it, it seems to fit the bill, with a couple of comments: - The quantiles stuff provides a CDF and PMF function, which is sufficient for our purposes. I haven't seen any real comparison between t-digests and their approach. A cursory glance at the source code leads me to believe that it's not tree-based, so I'd have to dig into it a bit more to understand the tradeoffs of their approach vs a tree-based approach like in t-digest - The HLL stuff seems to be pure HLL, rather than HLL+, which is what we support. HLL+ has better accuracy characteristics for small sets, as I recall. I'll defer to Mike Miklavcic on that as I haven't read the paper in a while.
On the whole, I'd love to integrate with it and maybe swap out the t-digest approach for this since it has an active community around it. Anyway, thanks for bringing it to our attention and if anyone wants to take that on, I'd be on board with a +1 ;) Casey On Tue, Feb 21, 2017 at 10:22 PM, Matt Foley <ma...@apache.org> wrote: > Looks interesting. Any indication whether it supports MAD (median > absolute deviation) for outlier detection? > > > On 2/21/17, 8:08 AM, "Nick Allen" <n...@nickallen.org> wrote: > > We currently use the tdunning/t-digest > <https://github.com/tdunning/t-digest> library for generating our > STATS_* > sketches and then a separate library addthis/stream-lib > <https://github.com/addthis/stream-lib> for doing the HLL distinct > count. > > I ran across another library originating from Yahoo that looks quite > featureful, well documented and quite active. On the surface it > *seems* to > be able to do what we need for both the STATS_* sketches and HLL. > > https://datasketches.github.io/ > > > Has anyone evaluated this library before? Are there deficiencies as > compared to the libraries that we currently use? > > > >