Re: Net ColumnFamily Count

Josh Elser Thu, 20 Oct 2016 15:37:32 -0700

You can do a partial summation in an Iterator, but managing memorypressure (like you originally pointed out) would require some tricks.

In general, Iterators work well with performing partial computations andletting the client perform a final computation over the batches.

https://blogs.apache.org/accumulo/entry/thinking_about_reads_over_accumulomight help


Yamini Joshi wrote:

I want to push all the computation to the server. I am using a test DB
but the DB is huge in the actual dev environment. I am also not sure if
writing to a new table is a good option either. It is not a one time
operation, it needs to be computed for every query that a user fires
with set of parameters.

I am back to square one. But I guess if there is no other option, I will
try to benchmark and keep you guys in the loop :)



Best regards,
Yamini Joshi

On Thu, Oct 20, 2016 at 4:22 PM, Josh Elser <[email protected]
<mailto:[email protected]>> wrote:

    I would like to inject some hesitation here. This is getting into
    what I'd call "advance Accumulo development".

    I'd encourage you to benchmark the simple implementation (bring back
    the columns you want to count to the client, and perform the
    summation there) and see if that runs in an acceptable amount of time.

    Creating a "pivot" table (where you move the column family from your
    source table to the row of a new table) is fairly straightforward to
    do, but you will run into problems in keeping both tables in sync
    with each other. :)

    ivan bella wrote:

        I do not have any reference code for you. However basically you
        want to
        write a program that scans from one table, creates new
        transformed Key
        which you write as Mutations to another table. The transfomed Key
        object's row would be the column family of the key you pulled
        from the
        scan, and the value would be a 1 encoded using one of the
        encoders in
        the LongCombiner class. You would create the new table you are
        going to
        write to manually in the accumulo shell and set a
        SummingCombiner on the
        majc, minc, and scan with the same encoder you used. Run your
        program,
        compact the new table, and then scan it.


            On October 20, 2016 at 4:07 PM Yamini Joshi
            <[email protected] <mailto:[email protected]>> wrote:

            Alright! Do you happen to have some reference code that I
            can refer
            to? I am a newbie and I am not sure if by caching,
            aggregating and
            merge sort you mean to use some Accumulo wrapper or write a
            simple
            java code.

            Best regards,
            Yamini Joshi

            On Thu, Oct 20, 2016 at 2:49 PM, ivan bella
            <[email protected] <mailto:[email protected]>
            <mailto:[email protected] <mailto:[email protected]>>>
            wrote:

                 __

                 That is essentially the same thing, but instead of
            doing it within
                 an iterator, you are letting accumulo do the work! Perfect.

                     On October 20, 2016 at 3:38 PM
                [email protected] <mailto:[email protected]>
                <mailto:[email protected]
                <mailto:[email protected]>> wrote:

                     I am wondering what the complexity would be for
                this and also how
                     does it compare to creating a new table with the
                required revered
                     data and calculating the sum using an iterator.

                     Sent from my iPhone

                     On Oct 20, 2016, at 2:07 PM, ivan bella
                <[email protected] <mailto:[email protected]>
                <mailto:[email protected]
                <mailto:[email protected]>>> wrote:

                         You could cache results in an internal map.
                    Once the number of
                         entries in your map gets to a certain point,
                    you could dump them
                         to a separate file in hdfs and then start
                    building a new map.
                         Once you have completed the underlying scan, do
                    a merge sort and
                         aggregation of the written files to start
                    returning the keys. I
                         did something similar to this and it seems to
                    work well. You
                         might want to use RFiles as the underlying
                    format which would
                         enable reuse of some accumulo code when doing
                    the merge sort.
                         Also it would allow more efficient reseeking
                    into the rfiles if
                         your iterator gets torn down and reconstructed
                    provided you
                         detect this and at least avoid redoing the
                    entire scan.

                             On October 20, 2016 at 1:22 PM Yamini Joshi
                        <[email protected]
                        <mailto:[email protected]>
                        <mailto:[email protected]
                        <mailto:[email protected]>>> wrote:

                             Hello all

                             I am trying to find the number of times a
                        set of column
                             families appear in a set of records
                        (irrespective of the
                             rowIds). Is it possible to do this on the
                        server side? My
                             concern is that if the set of column
                        families is huge, it might
                             face memory constraints on the server side.
                        Also, we might need
                             to generate new keys with columnfamily name
                        as the key and
                             count as the value.

                             Best regards,
                             Yamini Joshi

Re: Net ColumnFamily Count

Reply via email to