Re: Hive HLL for appx count distinct

2015-12-30 Thread Gopal Vijayaraghavan
> In the hive-hll-udf, you seem to mention about RRD. Is that something
>supported by Hive?

No. RRDTool is what most people are replacing with Hive to store time
series data in.

Raw RRDTool files on a local disk have no availability model (i.e lose a
disk, you lose data).

The rollup concept however is very powerful, to maintain distinct
aggregates of a time-series (& throw out the expired ones), which is what
my example was 

last 30 days HLL + last 23 hours HLL + generate HLL over current_hour.

to count billions of distincts across them with a few megabytes of storage.

This can be then further extended to build hundreds of bitsets per hour,
one for each tracked A/B experiment to collect stats on.

Cheers,
Gopal




Re: Hive HLL for appx count distinct

2015-12-30 Thread Gopal Vijayaraghavan

> I'm trying to explore the HLL UDF option to compute # of uniq users for
>each time range (week, month, yr, etc.) and wanted to know if
> its possible to just maintain HLL struct for each day and then use those
>to compute the uniqs for various time
> ranges using these per day structs instead of running the queries across
>all the data?

Yes, unions of raw HLL can be done (though not intersects).

https://github.com/t3rmin4t0r/hive-hll-udf


Or better yet, use the Yahoo sketches which work better than raw HLL.

http://yahooeng.tumblr.com/post/135390948446/data-sketches

+
http://datasketches.github.io/

+
https://github.com/DataSketches/sketches-hive


Cheers,
Gopal



Re: Hive HLL for appx count distinct

2015-12-30 Thread Buntu Dev
Thanks Gopal!

In the hive-hll-udf, you seem to mention about RRD. Is that something
supported by Hive?

Will go over the Data Sketches as well, thanks for the pointer :)

On Wed, Dec 30, 2015 at 4:29 PM, Gopal Vijayaraghavan  wrote:

>
> > I'm trying to explore the HLL UDF option to compute # of uniq users for
> >each time range (week, month, yr, etc.) and wanted to know if
> > its possible to just maintain HLL struct for each day and then use those
> >to compute the uniqs for various time
> > ranges using these per day structs instead of running the queries across
> >all the data?
>
> Yes, unions of raw HLL can be done (though not intersects).
>
> https://github.com/t3rmin4t0r/hive-hll-udf
>
>
> Or better yet, use the Yahoo sketches which work better than raw HLL.
>
> http://yahooeng.tumblr.com/post/135390948446/data-sketches
>
> +
> http://datasketches.github.io/
>
> +
> https://github.com/DataSketches/sketches-hive
>
>
> Cheers,
> Gopal
>
>