Re: Hive HLL for appx count distinct
> In the hive-hll-udf, you seem to mention about RRD. Is that something >supported by Hive? No. RRDTool is what most people are replacing with Hive to store time series data in. Raw RRDTool files on a local disk have no availability model (i.e lose a disk, you lose data). The rollup concept however is very powerful, to maintain distinct aggregates of a time-series (& throw out the expired ones), which is what my example was last 30 days HLL + last 23 hours HLL + generate HLL over current_hour. to count billions of distincts across them with a few megabytes of storage. This can be then further extended to build hundreds of bitsets per hour, one for each tracked A/B experiment to collect stats on. Cheers, Gopal
Re: Hive HLL for appx count distinct
> I'm trying to explore the HLL UDF option to compute # of uniq users for >each time range (week, month, yr, etc.) and wanted to know if > its possible to just maintain HLL struct for each day and then use those >to compute the uniqs for various time > ranges using these per day structs instead of running the queries across >all the data? Yes, unions of raw HLL can be done (though not intersects). https://github.com/t3rmin4t0r/hive-hll-udf Or better yet, use the Yahoo sketches which work better than raw HLL. http://yahooeng.tumblr.com/post/135390948446/data-sketches + http://datasketches.github.io/ + https://github.com/DataSketches/sketches-hive Cheers, Gopal
Re: Hive HLL for appx count distinct
Thanks Gopal! In the hive-hll-udf, you seem to mention about RRD. Is that something supported by Hive? Will go over the Data Sketches as well, thanks for the pointer :) On Wed, Dec 30, 2015 at 4:29 PM, Gopal Vijayaraghavanwrote: > > > I'm trying to explore the HLL UDF option to compute # of uniq users for > >each time range (week, month, yr, etc.) and wanted to know if > > its possible to just maintain HLL struct for each day and then use those > >to compute the uniqs for various time > > ranges using these per day structs instead of running the queries across > >all the data? > > Yes, unions of raw HLL can be done (though not intersects). > > https://github.com/t3rmin4t0r/hive-hll-udf > > > Or better yet, use the Yahoo sketches which work better than raw HLL. > > http://yahooeng.tumblr.com/post/135390948446/data-sketches > > + > http://datasketches.github.io/ > > + > https://github.com/DataSketches/sketches-hive > > > Cheers, > Gopal > >