What's the expected size of your unique key set? Thousands? Millions?
Billions?
This project is something to occupy me my spare time. And it's intended to
explore aspects of Accumulo that I haven't needed to use yet. In the past,
I simply ran a map-reduce job using the Word Counting technique.
If I have the following simple set of data:
NAME John
NAME Jake
NAME John
NAME Mary
I want to end up with the following:
NAME 3
I'm thinking that perhaps a HyperLogLog approach should work. See
http://en.wikipedia.org/wiki/HyperLogLog for more information.
Has anyone done this before in
Yes, the data has not yet been ingested. I can control the table structure;
hopefully by integrating (or extending) the D4M schema.
I'm leaning towards using https://github.com/addthis/stream-lib as part of
the ingest process. Upon start up, existing tables would be analyzed to
find cardinality.
What's the expected size of your unique key set? Thousands? Millions?
Billions?
You could probably use a table structure similar to
https://github.com/calrissian/accumulo-recipes/tree/master/store/metrics-storebut
just have it emit 1's instead of summing them.
I'm thinking maybe your mappings
woops, sorry for the empty response, but I'm new to E-mail. The bitset
within HLL supports union and intersection. You should be able to estimate
cardinality without re-reading the data. In effect, you can segment your
estimation and minimize error about 2%.
Union is straightforward, whereas
On Fri, May 16, 2014 at 6:04 PM, Corey Nolet cjno...@gmail.com wrote:
What's the expected size of your unique key set? Thousands? Millions?
Billions?
You could probably use a table structure similar to
https://github.com/calrissian/accumulo-recipes/tree/master/store/metrics-storebut
just