Oh, one thing we are doing in t-digest is that the library can serialize itself to a bytestream (presumably) in a tighter representation than the default kryo serialization, which is nice. Not sure if data streams has the ability to serialize itself, but I wouldn't be surprised. Anyway, not a dealbreaker per se, just a thought.
On Wed, Feb 22, 2017 at 6:11 AM, Casey Stella <ceste...@gmail.com> wrote: > So looking at it, it seems to fit the bill, with a couple of comments: > > - The quantiles stuff provides a CDF and PMF function, which is > sufficient for our purposes. I haven't seen any real comparison between > t-digests and their approach. A cursory glance at the source code leads me > to believe that it's not tree-based, so I'd have to dig into it a bit more > to understand the tradeoffs of their approach vs a tree-based approach like > in t-digest > - The HLL stuff seems to be pure HLL, rather than HLL+, which is what > we support. HLL+ has better accuracy characteristics for small sets, as I > recall. I'll defer to Mike Miklavcic on that as I haven't read the paper > in a while. > > On the whole, I'd love to integrate with it and maybe swap out the > t-digest approach for this since it has an active community around it. > > Anyway, thanks for bringing it to our attention and if anyone wants to > take that on, I'd be on board with a +1 ;) > > Casey > > On Tue, Feb 21, 2017 at 10:22 PM, Matt Foley <ma...@apache.org> wrote: > >> Looks interesting. Any indication whether it supports MAD (median >> absolute deviation) for outlier detection? >> >> >> On 2/21/17, 8:08 AM, "Nick Allen" <n...@nickallen.org> wrote: >> >> We currently use the tdunning/t-digest >> <https://github.com/tdunning/t-digest> library for generating our >> STATS_* >> sketches and then a separate library addthis/stream-lib >> <https://github.com/addthis/stream-lib> for doing the HLL distinct >> count. >> >> I ran across another library originating from Yahoo that looks quite >> featureful, well documented and quite active. On the surface it >> *seems* to >> be able to do what we need for both the STATS_* sketches and HLL. >> >> https://datasketches.github.io/ >> >> >> Has anyone evaluated this library before? Are there deficiencies as >> compared to the libraries that we currently use? >> >> >> >> >