Re: [DISCUSS] Sketch Libraries

Michael Miklavcic Wed, 22 Feb 2017 09:43:35 -0800

@Casey That's correct - the stream-lib library has an HLL+ implementation
that not only works better for small data set sizes, but also scales to
much larger cardinalities than the previous HLL algorithm. The DataSketches
library appears to be a vanilla HLL implementation.


On Wed, Feb 22, 2017 at 7:11 AM, Casey Stella <ceste...@gmail.com> wrote:

> So looking at it, it seems to fit the bill, with a couple of comments:
>
>    - The quantiles stuff provides a CDF and PMF function, which is
>    sufficient for our purposes.  I haven't seen any real comparison between
>    t-digests and their approach.  A cursory glance at the source code
> leads me
>    to believe that it's not tree-based, so I'd have to dig into it a bit
> more
>    to understand the tradeoffs of their approach vs a tree-based approach
> like
>    in t-digest
>    - The HLL stuff seems to be pure HLL, rather than HLL+, which is what we
>    support.  HLL+ has better accuracy characteristics for small sets, as I
>    recall.  I'll defer to Mike Miklavcic on that as I haven't read the
> paper
>    in a while.
>
> On the whole, I'd love to integrate with it and maybe swap out the t-digest
> approach for this since it has an active community around it.
>
> Anyway, thanks for bringing it to our attention and if anyone wants to take
> that on, I'd be on board with a +1 ;)
>
> Casey
>
> On Tue, Feb 21, 2017 at 10:22 PM, Matt Foley <ma...@apache.org> wrote:
>
> > Looks interesting.  Any indication whether it supports MAD (median
> > absolute deviation) for outlier detection?
> >
> >
> > On 2/21/17, 8:08 AM, "Nick Allen" <n...@nickallen.org> wrote:
> >
> >     We currently use the tdunning/t-digest
> >     <https://github.com/tdunning/t-digest> library for generating our
> > STATS_*
> >     sketches and then a separate library addthis/stream-lib
> >     <https://github.com/addthis/stream-lib> for doing the HLL distinct
> > count.
> >
> >     I ran across another library originating from Yahoo that looks quite
> >     featureful, well documented and quite active.  On the surface it
> > *seems* to
> >     be able to do what we need for both the STATS_* sketches and HLL.
> >
> >     https://datasketches.github.io/
> >
> >
> >     Has anyone evaluated this library before?  Are there deficiencies as
> >     compared to the libraries that we currently use?
> >
> >
> >
> >
>

Re: [DISCUSS] Sketch Libraries

Reply via email to