I am wondering what the complexity would be for this and also how does it 
compare to creating a new table with the required revered data and calculating 
the sum using an iterator.

Sent from my iPhone

> On Oct 20, 2016, at 2:07 PM, ivan bella <[email protected]> wrote:
> 
> You could cache results in an internal map.  Once the number of entries in 
> your map gets to a certain point, you could dump them to a separate file in 
> hdfs and then start building a new map.  Once you have completed the 
> underlying scan, do a merge sort and aggregation of the written files to 
> start returning the keys.  I did something similar to this and it seems to 
> work well.  You might want to use RFiles as the underlying format which would 
> enable reuse of some accumulo code when doing the merge sort.  Also it would 
> allow more efficient reseeking into the rfiles if your iterator gets torn 
> down and reconstructed provided you detect this and at least avoid redoing 
> the entire scan.
> 
>> On October 20, 2016 at 1:22 PM Yamini Joshi <[email protected]> wrote:
>> 
>> Hello all
>> 
>> I am trying to find the number of times a set of column families appear in a 
>> set of records (irrespective of the rowIds). Is it possible to do this on 
>> the server side? My concern is that if the set of column families is huge, 
>> it might face memory constraints on the server side. Also, we might need to 
>> generate new keys with columnfamily name as the key and count as the value.
>> 
>> Best regards,
>> Yamini Joshi

Reply via email to