Yes, the data has not yet been ingested. I can control the table structure; hopefully by integrating (or extending) the D4M schema.
I'm leaning towards using https://github.com/addthis/stream-lib as part of the ingest process. Upon start up, existing tables would be analyzed to find cardinality. Then as records are ingested, the cardinality would be adjusted as needed. I don't yet know how to store the cardinality information so that restarting the ingest process doesn't require re-processing all the data. Still researching. On Fri, May 16, 2014 at 4:19 PM, Corey Nolet <cjno...@gmail.com> wrote: > Can we assume this data has not yet been ingested? Do you have control > over the way in which you structure your table? > > > > On Fri, May 16, 2014 at 1:54 PM, David Medinets > <david.medin...@gmail.com>wrote: > >> If I have the following simple set of data: >> >> NAME John >> NAME Jake >> NAME John >> NAME Mary >> >> I want to end up with the following: >> >> NAME 3 >> >> I'm thinking that perhaps a HyperLogLog approach should work. See >> http://en.wikipedia.org/wiki/HyperLogLog for more information. >> >> Has anyone done this before in Accumulo? >> > >