Alright! Do you happen to have some reference code that I can refer to? I am a newbie and I am not sure if by caching, aggregating and merge sort you mean to use some Accumulo wrapper or write a simple java code.
Best regards, Yamini Joshi On Thu, Oct 20, 2016 at 2:49 PM, ivan bella <i...@ivan.bella.name> wrote: > That is essentially the same thing, but instead of doing it within an > iterator, you are letting accumulo do the work! Perfect. > > On October 20, 2016 at 3:38 PM yamini.1...@gmail.com wrote: > > I am wondering what the complexity would be for this and also how does it > compare to creating a new table with the required revered data and > calculating the sum using an iterator. > > Sent from my iPhone > > On Oct 20, 2016, at 2:07 PM, ivan bella <i...@ivan.bella.name> wrote: > > You could cache results in an internal map. Once the number of entries in > your map gets to a certain point, you could dump them to a separate file in > hdfs and then start building a new map. Once you have completed the > underlying scan, do a merge sort and aggregation of the written files to > start returning the keys. I did something similar to this and it seems to > work well. You might want to use RFiles as the underlying format which > would enable reuse of some accumulo code when doing the merge sort. Also > it would allow more efficient reseeking into the rfiles if your iterator > gets torn down and reconstructed provided you detect this and at least > avoid redoing the entire scan. > > On October 20, 2016 at 1:22 PM Yamini Joshi <yamini.1...@gmail.com> wrote: > > Hello all > > I am trying to find the number of times a set of column families appear in > a set of records (irrespective of the rowIds). Is it possible to do this on > the server side? My concern is that if the set of column families is huge, > it might face memory constraints on the server side. Also, we might need to > generate new keys with columnfamily name as the key and count as the value. > > Best regards, > Yamini Joshi > >