I want to push all the computation to the server. I am using a test DB but
the DB is huge in the actual dev environment. I am also not sure if writing
to a new table is a good option either. It is not a one time operation, it
needs to be computed for every query that a user fires with set of
parameters.

I am back to square one. But I guess if there is no other option, I will
try to benchmark and keep you guys in the loop :)



Best regards,
Yamini Joshi

On Thu, Oct 20, 2016 at 4:22 PM, Josh Elser <josh.el...@gmail.com> wrote:

> I would like to inject some hesitation here. This is getting into what I'd
> call "advance Accumulo development".
>
> I'd encourage you to benchmark the simple implementation (bring back the
> columns you want to count to the client, and perform the summation there)
> and see if that runs in an acceptable amount of time.
>
> Creating a "pivot" table (where you move the column family from your
> source table to the row of a new table) is fairly straightforward to do,
> but you will run into problems in keeping both tables in sync with each
> other. :)
>
> ivan bella wrote:
>
>> I do not have any reference code for you. However basically you want to
>> write a program that scans from one table, creates new transformed Key
>> which you write as Mutations to another table. The transfomed Key
>> object's row would be the column family of the key you pulled from the
>> scan, and the value would be a 1 encoded using one of the encoders in
>> the LongCombiner class. You would create the new table you are going to
>> write to manually in the accumulo shell and set a SummingCombiner on the
>> majc, minc, and scan with the same encoder you used. Run your program,
>> compact the new table, and then scan it.
>>
>>
>> On October 20, 2016 at 4:07 PM Yamini Joshi <yamini.1...@gmail.com>
>>> wrote:
>>>
>>> Alright! Do you happen to have some reference code that I can refer
>>> to? I am a newbie and I am not sure if by caching, aggregating and
>>> merge sort you mean to use some Accumulo wrapper or write a simple
>>> java code.
>>>
>>> Best regards,
>>> Yamini Joshi
>>>
>>> On Thu, Oct 20, 2016 at 2:49 PM, ivan bella <i...@ivan.bella.name
>>> <mailto:i...@ivan.bella.name>> wrote:
>>>
>>>     __
>>>
>>>     That is essentially the same thing, but instead of doing it within
>>>     an iterator, you are letting accumulo do the work! Perfect.
>>>
>>>     On October 20, 2016 at 3:38 PM yamini.1...@gmail.com
>>>>     <mailto:yamini.1...@gmail.com> wrote:
>>>>
>>>>     I am wondering what the complexity would be for this and also how
>>>>     does it compare to creating a new table with the required revered
>>>>     data and calculating the sum using an iterator.
>>>>
>>>>     Sent from my iPhone
>>>>
>>>>     On Oct 20, 2016, at 2:07 PM, ivan bella <i...@ivan.bella.name
>>>>     <mailto:i...@ivan.bella.name>> wrote:
>>>>
>>>>     You could cache results in an internal map. Once the number of
>>>>>     entries in your map gets to a certain point, you could dump them
>>>>>     to a separate file in hdfs and then start building a new map.
>>>>>     Once you have completed the underlying scan, do a merge sort and
>>>>>     aggregation of the written files to start returning the keys. I
>>>>>     did something similar to this and it seems to work well. You
>>>>>     might want to use RFiles as the underlying format which would
>>>>>     enable reuse of some accumulo code when doing the merge sort.
>>>>>     Also it would allow more efficient reseeking into the rfiles if
>>>>>     your iterator gets torn down and reconstructed provided you
>>>>>     detect this and at least avoid redoing the entire scan.
>>>>>
>>>>>     On October 20, 2016 at 1:22 PM Yamini Joshi
>>>>>>     <yamini.1...@gmail.com <mailto:yamini.1...@gmail.com>> wrote:
>>>>>>
>>>>>>     Hello all
>>>>>>
>>>>>>     I am trying to find the number of times a set of column
>>>>>>     families appear in a set of records (irrespective of the
>>>>>>     rowIds). Is it possible to do this on the server side? My
>>>>>>     concern is that if the set of column families is huge, it might
>>>>>>     face memory constraints on the server side. Also, we might need
>>>>>>     to generate new keys with columnfamily name as the key and
>>>>>>     count as the value.
>>>>>>
>>>>>>     Best regards,
>>>>>>     Yamini Joshi
>>>>>>
>>>>>
>>>
>>>

Reply via email to