I would like to inject some hesitation here. This is getting into what I'd call "advance Accumulo development".

I'd encourage you to benchmark the simple implementation (bring back the columns you want to count to the client, and perform the summation there) and see if that runs in an acceptable amount of time.

Creating a "pivot" table (where you move the column family from your source table to the row of a new table) is fairly straightforward to do, but you will run into problems in keeping both tables in sync with each other. :)

ivan bella wrote:
I do not have any reference code for you. However basically you want to
write a program that scans from one table, creates new transformed Key
which you write as Mutations to another table. The transfomed Key
object's row would be the column family of the key you pulled from the
scan, and the value would be a 1 encoded using one of the encoders in
the LongCombiner class. You would create the new table you are going to
write to manually in the accumulo shell and set a SummingCombiner on the
majc, minc, and scan with the same encoder you used. Run your program,
compact the new table, and then scan it.


On October 20, 2016 at 4:07 PM Yamini Joshi <yamini.1...@gmail.com> wrote:

Alright! Do you happen to have some reference code that I can refer
to? I am a newbie and I am not sure if by caching, aggregating and
merge sort you mean to use some Accumulo wrapper or write a simple
java code.

Best regards,
Yamini Joshi

On Thu, Oct 20, 2016 at 2:49 PM, ivan bella <i...@ivan.bella.name
<mailto:i...@ivan.bella.name>> wrote:

    __

    That is essentially the same thing, but instead of doing it within
    an iterator, you are letting accumulo do the work! Perfect.

    On October 20, 2016 at 3:38 PM yamini.1...@gmail.com
    <mailto:yamini.1...@gmail.com> wrote:

    I am wondering what the complexity would be for this and also how
    does it compare to creating a new table with the required revered
    data and calculating the sum using an iterator.

    Sent from my iPhone

    On Oct 20, 2016, at 2:07 PM, ivan bella <i...@ivan.bella.name
    <mailto:i...@ivan.bella.name>> wrote:

    You could cache results in an internal map. Once the number of
    entries in your map gets to a certain point, you could dump them
    to a separate file in hdfs and then start building a new map.
    Once you have completed the underlying scan, do a merge sort and
    aggregation of the written files to start returning the keys. I
    did something similar to this and it seems to work well. You
    might want to use RFiles as the underlying format which would
    enable reuse of some accumulo code when doing the merge sort.
    Also it would allow more efficient reseeking into the rfiles if
    your iterator gets torn down and reconstructed provided you
    detect this and at least avoid redoing the entire scan.

    On October 20, 2016 at 1:22 PM Yamini Joshi
    <yamini.1...@gmail.com <mailto:yamini.1...@gmail.com>> wrote:

    Hello all

    I am trying to find the number of times a set of column
    families appear in a set of records (irrespective of the
    rowIds). Is it possible to do this on the server side? My
    concern is that if the set of column families is huge, it might
    face memory constraints on the server side. Also, we might need
    to generate new keys with columnfamily name as the key and
    count as the value.

    Best regards,
    Yamini Joshi


Reply via email to