@Casey That's correct - the stream-lib library has an HLL+ implementation that not only works better for small data set sizes, but also scales to much larger cardinalities than the previous HLL algorithm. The DataSketches library appears to be a vanilla HLL implementation.
On Wed, Feb 22, 2017 at 7:11 AM, Casey Stella <ceste...@gmail.com> wrote: > So looking at it, it seems to fit the bill, with a couple of comments: > > - The quantiles stuff provides a CDF and PMF function, which is > sufficient for our purposes. I haven't seen any real comparison between > t-digests and their approach. A cursory glance at the source code > leads me > to believe that it's not tree-based, so I'd have to dig into it a bit > more > to understand the tradeoffs of their approach vs a tree-based approach > like > in t-digest > - The HLL stuff seems to be pure HLL, rather than HLL+, which is what we > support. HLL+ has better accuracy characteristics for small sets, as I > recall. I'll defer to Mike Miklavcic on that as I haven't read the > paper > in a while. > > On the whole, I'd love to integrate with it and maybe swap out the t-digest > approach for this since it has an active community around it. > > Anyway, thanks for bringing it to our attention and if anyone wants to take > that on, I'd be on board with a +1 ;) > > Casey > > On Tue, Feb 21, 2017 at 10:22 PM, Matt Foley <ma...@apache.org> wrote: > > > Looks interesting. Any indication whether it supports MAD (median > > absolute deviation) for outlier detection? > > > > > > On 2/21/17, 8:08 AM, "Nick Allen" <n...@nickallen.org> wrote: > > > > We currently use the tdunning/t-digest > > <https://github.com/tdunning/t-digest> library for generating our > > STATS_* > > sketches and then a separate library addthis/stream-lib > > <https://github.com/addthis/stream-lib> for doing the HLL distinct > > count. > > > > I ran across another library originating from Yahoo that looks quite > > featureful, well documented and quite active. On the surface it > > *seems* to > > be able to do what we need for both the STATS_* sketches and HLL. > > > > https://datasketches.github.io/ > > > > > > Has anyone evaluated this library before? Are there deficiencies as > > compared to the libraries that we currently use? > > > > > > > > >