Re: [DISCUSS] Sketch Libraries

2017-02-22 Thread Michael Miklavcic
@Casey That's correct - the stream-lib library has an HLL+ implementation
that not only works better for small data set sizes, but also scales to
much larger cardinalities than the previous HLL algorithm. The DataSketches
library appears to be a vanilla HLL implementation.

On Wed, Feb 22, 2017 at 7:11 AM, Casey Stella  wrote:

> So looking at it, it seems to fit the bill, with a couple of comments:
>
>- The quantiles stuff provides a CDF and PMF function, which is
>sufficient for our purposes.  I haven't seen any real comparison between
>t-digests and their approach.  A cursory glance at the source code
> leads me
>to believe that it's not tree-based, so I'd have to dig into it a bit
> more
>to understand the tradeoffs of their approach vs a tree-based approach
> like
>in t-digest
>- The HLL stuff seems to be pure HLL, rather than HLL+, which is what we
>support.  HLL+ has better accuracy characteristics for small sets, as I
>recall.  I'll defer to Mike Miklavcic on that as I haven't read the
> paper
>in a while.
>
> On the whole, I'd love to integrate with it and maybe swap out the t-digest
> approach for this since it has an active community around it.
>
> Anyway, thanks for bringing it to our attention and if anyone wants to take
> that on, I'd be on board with a +1 ;)
>
> Casey
>
> On Tue, Feb 21, 2017 at 10:22 PM, Matt Foley  wrote:
>
> > Looks interesting.  Any indication whether it supports MAD (median
> > absolute deviation) for outlier detection?
> >
> >
> > On 2/21/17, 8:08 AM, "Nick Allen"  wrote:
> >
> > We currently use the tdunning/t-digest
> >  library for generating our
> > STATS_*
> > sketches and then a separate library addthis/stream-lib
> >  for doing the HLL distinct
> > count.
> >
> > I ran across another library originating from Yahoo that looks quite
> > featureful, well documented and quite active.  On the surface it
> > *seems* to
> > be able to do what we need for both the STATS_* sketches and HLL.
> >
> > https://datasketches.github.io/
> >
> >
> > Has anyone evaluated this library before?  Are there deficiencies as
> > compared to the libraries that we currently use?
> >
> >
> >
> >
>


Re: [DISCUSS] Sketch Libraries

2017-02-22 Thread Casey Stella
Oh, one thing we are doing in t-digest is that the library can serialize
itself to a bytestream (presumably) in a tighter representation than the
default kryo serialization, which is nice.  Not sure if data streams has
the ability to serialize itself, but I wouldn't be surprised.  Anyway, not
a dealbreaker per se, just a thought.

On Wed, Feb 22, 2017 at 6:11 AM, Casey Stella  wrote:

> So looking at it, it seems to fit the bill, with a couple of comments:
>
>- The quantiles stuff provides a CDF and PMF function, which is
>sufficient for our purposes.  I haven't seen any real comparison between
>t-digests and their approach.  A cursory glance at the source code leads me
>to believe that it's not tree-based, so I'd have to dig into it a bit more
>to understand the tradeoffs of their approach vs a tree-based approach like
>in t-digest
>- The HLL stuff seems to be pure HLL, rather than HLL+, which is what
>we support.  HLL+ has better accuracy characteristics for small sets, as I
>recall.  I'll defer to Mike Miklavcic on that as I haven't read the paper
>in a while.
>
> On the whole, I'd love to integrate with it and maybe swap out the
> t-digest approach for this since it has an active community around it.
>
> Anyway, thanks for bringing it to our attention and if anyone wants to
> take that on, I'd be on board with a +1 ;)
>
> Casey
>
> On Tue, Feb 21, 2017 at 10:22 PM, Matt Foley  wrote:
>
>> Looks interesting.  Any indication whether it supports MAD (median
>> absolute deviation) for outlier detection?
>>
>>
>> On 2/21/17, 8:08 AM, "Nick Allen"  wrote:
>>
>> We currently use the tdunning/t-digest
>>  library for generating our
>> STATS_*
>> sketches and then a separate library addthis/stream-lib
>>  for doing the HLL distinct
>> count.
>>
>> I ran across another library originating from Yahoo that looks quite
>> featureful, well documented and quite active.  On the surface it
>> *seems* to
>> be able to do what we need for both the STATS_* sketches and HLL.
>>
>> https://datasketches.github.io/
>>
>>
>> Has anyone evaluated this library before?  Are there deficiencies as
>> compared to the libraries that we currently use?
>>
>>
>>
>>
>


Re: [DISCUSS] Sketch Libraries

2017-02-22 Thread Casey Stella
So looking at it, it seems to fit the bill, with a couple of comments:

   - The quantiles stuff provides a CDF and PMF function, which is
   sufficient for our purposes.  I haven't seen any real comparison between
   t-digests and their approach.  A cursory glance at the source code leads me
   to believe that it's not tree-based, so I'd have to dig into it a bit more
   to understand the tradeoffs of their approach vs a tree-based approach like
   in t-digest
   - The HLL stuff seems to be pure HLL, rather than HLL+, which is what we
   support.  HLL+ has better accuracy characteristics for small sets, as I
   recall.  I'll defer to Mike Miklavcic on that as I haven't read the paper
   in a while.

On the whole, I'd love to integrate with it and maybe swap out the t-digest
approach for this since it has an active community around it.

Anyway, thanks for bringing it to our attention and if anyone wants to take
that on, I'd be on board with a +1 ;)

Casey

On Tue, Feb 21, 2017 at 10:22 PM, Matt Foley  wrote:

> Looks interesting.  Any indication whether it supports MAD (median
> absolute deviation) for outlier detection?
>
>
> On 2/21/17, 8:08 AM, "Nick Allen"  wrote:
>
> We currently use the tdunning/t-digest
>  library for generating our
> STATS_*
> sketches and then a separate library addthis/stream-lib
>  for doing the HLL distinct
> count.
>
> I ran across another library originating from Yahoo that looks quite
> featureful, well documented and quite active.  On the surface it
> *seems* to
> be able to do what we need for both the STATS_* sketches and HLL.
>
> https://datasketches.github.io/
>
>
> Has anyone evaluated this library before?  Are there deficiencies as
> compared to the libraries that we currently use?
>
>
>
>


Re: [DISCUSS] Sketch Libraries

2017-02-21 Thread Matt Foley
Looks interesting.  Any indication whether it supports MAD (median absolute 
deviation) for outlier detection?


On 2/21/17, 8:08 AM, "Nick Allen"  wrote:

We currently use the tdunning/t-digest
 library for generating our STATS_*
sketches and then a separate library addthis/stream-lib
 for doing the HLL distinct count.

I ran across another library originating from Yahoo that looks quite
featureful, well documented and quite active.  On the surface it *seems* to
be able to do what we need for both the STATS_* sketches and HLL.

https://datasketches.github.io/


Has anyone evaluated this library before?  Are there deficiencies as
compared to the libraries that we currently use?





[DISCUSS] Sketch Libraries

2017-02-21 Thread Nick Allen
We currently use the tdunning/t-digest
 library for generating our STATS_*
sketches and then a separate library addthis/stream-lib
 for doing the HLL distinct count.

I ran across another library originating from Yahoo that looks quite
featureful, well documented and quite active.  On the surface it *seems* to
be able to do what we need for both the STATS_* sketches and HLL.

https://datasketches.github.io/


Has anyone evaluated this library before?  Are there deficiencies as
compared to the libraries that we currently use?