I will take a look at it. Thanks Josh :) Best regards, Yamini Joshi
On Thu, Oct 20, 2016 at 5:30 PM, Josh Elser <[email protected]> wrote: > You can do a partial summation in an Iterator, but managing memory > pressure (like you originally pointed out) would require some tricks. > > In general, Iterators work well with performing partial computations and > letting the client perform a final computation over the batches. > > https://blogs.apache.org/accumulo/entry/thinking_about_reads_over_accumulo > might help > > Yamini Joshi wrote: > >> I want to push all the computation to the server. I am using a test DB >> but the DB is huge in the actual dev environment. I am also not sure if >> writing to a new table is a good option either. It is not a one time >> operation, it needs to be computed for every query that a user fires >> with set of parameters. >> >> I am back to square one. But I guess if there is no other option, I will >> try to benchmark and keep you guys in the loop :) >> >> >> >> Best regards, >> Yamini Joshi >> >> On Thu, Oct 20, 2016 at 4:22 PM, Josh Elser <[email protected] >> <mailto:[email protected]>> wrote: >> >> I would like to inject some hesitation here. This is getting into >> what I'd call "advance Accumulo development". >> >> I'd encourage you to benchmark the simple implementation (bring back >> the columns you want to count to the client, and perform the >> summation there) and see if that runs in an acceptable amount of time. >> >> Creating a "pivot" table (where you move the column family from your >> source table to the row of a new table) is fairly straightforward to >> do, but you will run into problems in keeping both tables in sync >> with each other. :) >> >> ivan bella wrote: >> >> I do not have any reference code for you. However basically you >> want to >> write a program that scans from one table, creates new >> transformed Key >> which you write as Mutations to another table. The transfomed Key >> object's row would be the column family of the key you pulled >> from the >> scan, and the value would be a 1 encoded using one of the >> encoders in >> the LongCombiner class. You would create the new table you are >> going to >> write to manually in the accumulo shell and set a >> SummingCombiner on the >> majc, minc, and scan with the same encoder you used. Run your >> program, >> compact the new table, and then scan it. >> >> >> On October 20, 2016 at 4:07 PM Yamini Joshi >> <[email protected] <mailto:[email protected]>> wrote: >> >> Alright! Do you happen to have some reference code that I >> can refer >> to? I am a newbie and I am not sure if by caching, >> aggregating and >> merge sort you mean to use some Accumulo wrapper or write a >> simple >> java code. >> >> Best regards, >> Yamini Joshi >> >> On Thu, Oct 20, 2016 at 2:49 PM, ivan bella >> <[email protected] <mailto:[email protected]> >> <mailto:[email protected] <mailto:[email protected]>>> >> wrote: >> >> __ >> >> That is essentially the same thing, but instead of >> doing it within >> an iterator, you are letting accumulo do the work! >> Perfect. >> >> On October 20, 2016 at 3:38 PM >> [email protected] <mailto:[email protected]> >> <mailto:[email protected] >> <mailto:[email protected]>> wrote: >> >> I am wondering what the complexity would be for >> this and also how >> does it compare to creating a new table with the >> required revered >> data and calculating the sum using an iterator. >> >> Sent from my iPhone >> >> On Oct 20, 2016, at 2:07 PM, ivan bella >> <[email protected] <mailto:[email protected]> >> <mailto:[email protected] >> >> <mailto:[email protected]>>> wrote: >> >> You could cache results in an internal map. >> Once the number of >> entries in your map gets to a certain point, >> you could dump them >> to a separate file in hdfs and then start >> building a new map. >> Once you have completed the underlying scan, do >> a merge sort and >> aggregation of the written files to start >> returning the keys. I >> did something similar to this and it seems to >> work well. You >> might want to use RFiles as the underlying >> format which would >> enable reuse of some accumulo code when doing >> the merge sort. >> Also it would allow more efficient reseeking >> into the rfiles if >> your iterator gets torn down and reconstructed >> provided you >> detect this and at least avoid redoing the >> entire scan. >> >> On October 20, 2016 at 1:22 PM Yamini Joshi >> <[email protected] >> <mailto:[email protected]> >> <mailto:[email protected] >> <mailto:[email protected]>>> wrote: >> >> Hello all >> >> I am trying to find the number of times a >> set of column >> families appear in a set of records >> (irrespective of the >> rowIds). Is it possible to do this on the >> server side? My >> concern is that if the set of column >> families is huge, it might >> face memory constraints on the server side. >> Also, we might need >> to generate new keys with columnfamily name >> as the key and >> count as the value. >> >> Best regards, >> Yamini Joshi >> >> >> >> >>
