>What's the expected size of your unique key set? Thousands? Millions? Billions?
This project is something to occupy me my spare time. And it's intended to explore aspects of Accumulo that I haven't needed to use yet. In the past, I simply ran a map-reduce job using the Word Counting technique. tl;dr - The expected size of the unique key key would be in the millions. Too large to calculate on-the-fly for a web application. On Fri, May 16, 2014 at 6:04 PM, Corey Nolet <cjno...@gmail.com> wrote: > What's the expected size of your unique key set? Thousands? Millions? > Billions? > > You could probably use a table structure similar to > https://github.com/calrissian/accumulo-recipes/tree/master/store/metrics-storebut > just have it emit 1's instead of summing them. > > I'm thinking maybe your mappings could be like this: > group=anything, type=NAME, name=John(etc...) > > perhaps a ColumnQualifierGrouping iterator could be applied at scan time > to add up the cardinalities for the quals over the given time range being > scanned where cardinalities across different time units get aggregated > client side. > > > > > On Fri, May 16, 2014 at 5:19 PM, David Medinets > <david.medin...@gmail.com>wrote: > >> Yes, the data has not yet been ingested. I can control the table >> structure; hopefully by integrating (or extending) the D4M schema. >> >> I'm leaning towards using https://github.com/addthis/stream-lib as part >> of the ingest process. Upon start up, existing tables would be analyzed to >> find cardinality. Then as records are ingested, the cardinality would be >> adjusted as needed. I don't yet know how to store the cardinality >> information so that restarting the ingest process doesn't require >> re-processing all the data. Still researching. >> >> >> On Fri, May 16, 2014 at 4:19 PM, Corey Nolet <cjno...@gmail.com> wrote: >> >>> Can we assume this data has not yet been ingested? Do you have control >>> over the way in which you structure your table? >>> >>> >>> >>> On Fri, May 16, 2014 at 1:54 PM, David Medinets < >>> david.medin...@gmail.com> wrote: >>> >>>> If I have the following simple set of data: >>>> >>>> NAME John >>>> NAME Jake >>>> NAME John >>>> NAME Mary >>>> >>>> I want to end up with the following: >>>> >>>> NAME 3 >>>> >>>> I'm thinking that perhaps a HyperLogLog approach should work. See >>>> http://en.wikipedia.org/wiki/HyperLogLog for more information. >>>> >>>> Has anyone done this before in Accumulo? >>>> >>> >>> >> >