I am wondering what the complexity would be for this and also how does it compare to creating a new table with the required revered data and calculating the sum using an iterator.
Sent from my iPhone > On Oct 20, 2016, at 2:07 PM, ivan bella <[email protected]> wrote: > > You could cache results in an internal map. Once the number of entries in > your map gets to a certain point, you could dump them to a separate file in > hdfs and then start building a new map. Once you have completed the > underlying scan, do a merge sort and aggregation of the written files to > start returning the keys. I did something similar to this and it seems to > work well. You might want to use RFiles as the underlying format which would > enable reuse of some accumulo code when doing the merge sort. Also it would > allow more efficient reseeking into the rfiles if your iterator gets torn > down and reconstructed provided you detect this and at least avoid redoing > the entire scan. > >> On October 20, 2016 at 1:22 PM Yamini Joshi <[email protected]> wrote: >> >> Hello all >> >> I am trying to find the number of times a set of column families appear in a >> set of records (irrespective of the rowIds). Is it possible to do this on >> the server side? My concern is that if the set of column families is huge, >> it might face memory constraints on the server side. Also, we might need to >> generate new keys with columnfamily name as the key and count as the value. >> >> Best regards, >> Yamini Joshi
