I want to push all the computation to the server. I am using a test DB but the DB is huge in the actual dev environment. I am also not sure if writing to a new table is a good option either. It is not a one time operation, it needs to be computed for every query that a user fires with set of parameters.
I am back to square one. But I guess if there is no other option, I will try to benchmark and keep you guys in the loop :) Best regards, Yamini Joshi On Thu, Oct 20, 2016 at 4:22 PM, Josh Elser <josh.el...@gmail.com> wrote: > I would like to inject some hesitation here. This is getting into what I'd > call "advance Accumulo development". > > I'd encourage you to benchmark the simple implementation (bring back the > columns you want to count to the client, and perform the summation there) > and see if that runs in an acceptable amount of time. > > Creating a "pivot" table (where you move the column family from your > source table to the row of a new table) is fairly straightforward to do, > but you will run into problems in keeping both tables in sync with each > other. :) > > ivan bella wrote: > >> I do not have any reference code for you. However basically you want to >> write a program that scans from one table, creates new transformed Key >> which you write as Mutations to another table. The transfomed Key >> object's row would be the column family of the key you pulled from the >> scan, and the value would be a 1 encoded using one of the encoders in >> the LongCombiner class. You would create the new table you are going to >> write to manually in the accumulo shell and set a SummingCombiner on the >> majc, minc, and scan with the same encoder you used. Run your program, >> compact the new table, and then scan it. >> >> >> On October 20, 2016 at 4:07 PM Yamini Joshi <yamini.1...@gmail.com> >>> wrote: >>> >>> Alright! Do you happen to have some reference code that I can refer >>> to? I am a newbie and I am not sure if by caching, aggregating and >>> merge sort you mean to use some Accumulo wrapper or write a simple >>> java code. >>> >>> Best regards, >>> Yamini Joshi >>> >>> On Thu, Oct 20, 2016 at 2:49 PM, ivan bella <i...@ivan.bella.name >>> <mailto:i...@ivan.bella.name>> wrote: >>> >>> __ >>> >>> That is essentially the same thing, but instead of doing it within >>> an iterator, you are letting accumulo do the work! Perfect. >>> >>> On October 20, 2016 at 3:38 PM yamini.1...@gmail.com >>>> <mailto:yamini.1...@gmail.com> wrote: >>>> >>>> I am wondering what the complexity would be for this and also how >>>> does it compare to creating a new table with the required revered >>>> data and calculating the sum using an iterator. >>>> >>>> Sent from my iPhone >>>> >>>> On Oct 20, 2016, at 2:07 PM, ivan bella <i...@ivan.bella.name >>>> <mailto:i...@ivan.bella.name>> wrote: >>>> >>>> You could cache results in an internal map. Once the number of >>>>> entries in your map gets to a certain point, you could dump them >>>>> to a separate file in hdfs and then start building a new map. >>>>> Once you have completed the underlying scan, do a merge sort and >>>>> aggregation of the written files to start returning the keys. I >>>>> did something similar to this and it seems to work well. You >>>>> might want to use RFiles as the underlying format which would >>>>> enable reuse of some accumulo code when doing the merge sort. >>>>> Also it would allow more efficient reseeking into the rfiles if >>>>> your iterator gets torn down and reconstructed provided you >>>>> detect this and at least avoid redoing the entire scan. >>>>> >>>>> On October 20, 2016 at 1:22 PM Yamini Joshi >>>>>> <yamini.1...@gmail.com <mailto:yamini.1...@gmail.com>> wrote: >>>>>> >>>>>> Hello all >>>>>> >>>>>> I am trying to find the number of times a set of column >>>>>> families appear in a set of records (irrespective of the >>>>>> rowIds). Is it possible to do this on the server side? My >>>>>> concern is that if the set of column families is huge, it might >>>>>> face memory constraints on the server side. Also, we might need >>>>>> to generate new keys with columnfamily name as the key and >>>>>> count as the value. >>>>>> >>>>>> Best regards, >>>>>> Yamini Joshi >>>>>> >>>>> >>> >>>