Oh, one thing we are doing in t-digest is that the library can serialize
itself to a bytestream (presumably) in a tighter representation than the
default kryo serialization, which is nice.  Not sure if data streams has
the ability to serialize itself, but I wouldn't be surprised.  Anyway, not
a dealbreaker per se, just a thought.

On Wed, Feb 22, 2017 at 6:11 AM, Casey Stella <ceste...@gmail.com> wrote:

> So looking at it, it seems to fit the bill, with a couple of comments:
>
>    - The quantiles stuff provides a CDF and PMF function, which is
>    sufficient for our purposes.  I haven't seen any real comparison between
>    t-digests and their approach.  A cursory glance at the source code leads me
>    to believe that it's not tree-based, so I'd have to dig into it a bit more
>    to understand the tradeoffs of their approach vs a tree-based approach like
>    in t-digest
>    - The HLL stuff seems to be pure HLL, rather than HLL+, which is what
>    we support.  HLL+ has better accuracy characteristics for small sets, as I
>    recall.  I'll defer to Mike Miklavcic on that as I haven't read the paper
>    in a while.
>
> On the whole, I'd love to integrate with it and maybe swap out the
> t-digest approach for this since it has an active community around it.
>
> Anyway, thanks for bringing it to our attention and if anyone wants to
> take that on, I'd be on board with a +1 ;)
>
> Casey
>
> On Tue, Feb 21, 2017 at 10:22 PM, Matt Foley <ma...@apache.org> wrote:
>
>> Looks interesting.  Any indication whether it supports MAD (median
>> absolute deviation) for outlier detection?
>>
>>
>> On 2/21/17, 8:08 AM, "Nick Allen" <n...@nickallen.org> wrote:
>>
>>     We currently use the tdunning/t-digest
>>     <https://github.com/tdunning/t-digest> library for generating our
>> STATS_*
>>     sketches and then a separate library addthis/stream-lib
>>     <https://github.com/addthis/stream-lib> for doing the HLL distinct
>> count.
>>
>>     I ran across another library originating from Yahoo that looks quite
>>     featureful, well documented and quite active.  On the surface it
>> *seems* to
>>     be able to do what we need for both the STATS_* sketches and HLL.
>>
>>     https://datasketches.github.io/
>>
>>
>>     Has anyone evaluated this library before?  Are there deficiencies as
>>     compared to the libraries that we currently use?
>>
>>
>>
>>
>

Reply via email to