Re: Net ColumnFamily Count

Yamini Joshi Thu, 20 Oct 2016 15:51:40 -0700

I will take a look at it. Thanks Josh :)

Best regards,
Yamini Joshi


On Thu, Oct 20, 2016 at 5:30 PM, Josh Elser <[email protected]> wrote:

> You can do a partial summation in an Iterator, but managing memory
> pressure (like you originally pointed out) would require some tricks.
>
> In general, Iterators work well with performing partial computations and
> letting the client perform a final computation over the batches.
>
> https://blogs.apache.org/accumulo/entry/thinking_about_reads_over_accumulo
> might help
>
> Yamini Joshi wrote:
>
>> I want to push all the computation to the server. I am using a test DB
>> but the DB is huge in the actual dev environment. I am also not sure if
>> writing to a new table is a good option either. It is not a one time
>> operation, it needs to be computed for every query that a user fires
>> with set of parameters.
>>
>> I am back to square one. But I guess if there is no other option, I will
>> try to benchmark and keep you guys in the loop :)
>>
>>
>>
>> Best regards,
>> Yamini Joshi
>>
>> On Thu, Oct 20, 2016 at 4:22 PM, Josh Elser <[email protected]
>> <mailto:[email protected]>> wrote:
>>
>>     I would like to inject some hesitation here. This is getting into
>>     what I'd call "advance Accumulo development".
>>
>>     I'd encourage you to benchmark the simple implementation (bring back
>>     the columns you want to count to the client, and perform the
>>     summation there) and see if that runs in an acceptable amount of time.
>>
>>     Creating a "pivot" table (where you move the column family from your
>>     source table to the row of a new table) is fairly straightforward to
>>     do, but you will run into problems in keeping both tables in sync
>>     with each other. :)
>>
>>     ivan bella wrote:
>>
>>         I do not have any reference code for you. However basically you
>>         want to
>>         write a program that scans from one table, creates new
>>         transformed Key
>>         which you write as Mutations to another table. The transfomed Key
>>         object's row would be the column family of the key you pulled
>>         from the
>>         scan, and the value would be a 1 encoded using one of the
>>         encoders in
>>         the LongCombiner class. You would create the new table you are
>>         going to
>>         write to manually in the accumulo shell and set a
>>         SummingCombiner on the
>>         majc, minc, and scan with the same encoder you used. Run your
>>         program,
>>         compact the new table, and then scan it.
>>
>>
>>             On October 20, 2016 at 4:07 PM Yamini Joshi
>>             <[email protected] <mailto:[email protected]>> wrote:
>>
>>             Alright! Do you happen to have some reference code that I
>>             can refer
>>             to? I am a newbie and I am not sure if by caching,
>>             aggregating and
>>             merge sort you mean to use some Accumulo wrapper or write a
>>             simple
>>             java code.
>>
>>             Best regards,
>>             Yamini Joshi
>>
>>             On Thu, Oct 20, 2016 at 2:49 PM, ivan bella
>>             <[email protected] <mailto:[email protected]>
>>             <mailto:[email protected] <mailto:[email protected]>>>
>>             wrote:
>>
>>                  __
>>
>>                  That is essentially the same thing, but instead of
>>             doing it within
>>                  an iterator, you are letting accumulo do the work!
>> Perfect.
>>
>>                      On October 20, 2016 at 3:38 PM
>>                 [email protected] <mailto:[email protected]>
>>                 <mailto:[email protected]
>>                 <mailto:[email protected]>> wrote:
>>
>>                      I am wondering what the complexity would be for
>>                 this and also how
>>                      does it compare to creating a new table with the
>>                 required revered
>>                      data and calculating the sum using an iterator.
>>
>>                      Sent from my iPhone
>>
>>                      On Oct 20, 2016, at 2:07 PM, ivan bella
>>                 <[email protected] <mailto:[email protected]>
>>                 <mailto:[email protected]
>>
>>                 <mailto:[email protected]>>> wrote:
>>
>>                          You could cache results in an internal map.
>>                     Once the number of
>>                          entries in your map gets to a certain point,
>>                     you could dump them
>>                          to a separate file in hdfs and then start
>>                     building a new map.
>>                          Once you have completed the underlying scan, do
>>                     a merge sort and
>>                          aggregation of the written files to start
>>                     returning the keys. I
>>                          did something similar to this and it seems to
>>                     work well. You
>>                          might want to use RFiles as the underlying
>>                     format which would
>>                          enable reuse of some accumulo code when doing
>>                     the merge sort.
>>                          Also it would allow more efficient reseeking
>>                     into the rfiles if
>>                          your iterator gets torn down and reconstructed
>>                     provided you
>>                          detect this and at least avoid redoing the
>>                     entire scan.
>>
>>                              On October 20, 2016 at 1:22 PM Yamini Joshi
>>                         <[email protected]
>>                         <mailto:[email protected]>
>>                         <mailto:[email protected]
>>                         <mailto:[email protected]>>> wrote:
>>
>>                              Hello all
>>
>>                              I am trying to find the number of times a
>>                         set of column
>>                              families appear in a set of records
>>                         (irrespective of the
>>                              rowIds). Is it possible to do this on the
>>                         server side? My
>>                              concern is that if the set of column
>>                         families is huge, it might
>>                              face memory constraints on the server side.
>>                         Also, we might need
>>                              to generate new keys with columnfamily name
>>                         as the key and
>>                              count as the value.
>>
>>                              Best regards,
>>                              Yamini Joshi
>>
>>
>>
>>
>>

Re: Net ColumnFamily Count

Reply via email to