subject:"Add client complexity or use a coprocessor\?"

Re: Add client complexity or use a coprocessor?

2012-04-13 Thread Andrew Purtell

I would look first at how concurrent is your coprocessor in operation. There's 
been quite a bit of effort to make upserts (increments), and the MemStore in 
general, efficient at high concurrency.


> Is the table auto-flush option the same as manually batching all the updates?

I think the answer to your question is yes. Setting HTable.setAutoFlush(false) 
will buffer Puts (only) until the write buffer is full or a call to 
HTable.flushCommits(). So that would be like manually batching a bunch of Puts. 

Best regards,


    - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein (via 
Tom White)



- Original Message -
> From: Tom Brown 
> To: user@hbase.apache.org; Andrew Purtell 
> Cc: 
> Sent: Thursday, April 12, 2012 1:37 AM
> Subject: Re: Add client complexity or use a coprocessor?
> 
> Andy,
> 
> Is the table auto-flush option the same as manually batching all the updates?
> 
> --Tom
> 
> On Tue, Apr 10, 2012 at 5:53 PM, Andrew Purtell  
> wrote:
>>>  Even my implementation of an atomic increment
>>>  (using a coprocessor) is two orders of magnitude slower than the
>>>  provided implementation.  Are there properties inherent to
>>>  coprocessors or Incrementors that would force this kind of performance
>>>  difference?
>> 
>> 
>>  No.
>> 
>> 
>>  You may be seeing a performance difference if you are packing multiple 
> Increments into one round trip but not doing a similar kind of batching if 
> calling a custom endpoint. Each Endpoint invocation is a round trip unless 
> you 
> do something like:
>> 
>>      List actions = new ArrayList();    
> actions.add(new Exec(conf, row, protocol, method, ...));
>> 
>>      actions.add(new Exec(conf, row, protocol, method, ...));
>> 
>>      actions.add(new Exec(conf, row, protocol, method, ...));
>> 
>>      Object[] results = table.batch(actions);
>>      ...
>> 
>> 
>>  I've not personally tried that particular API combination but don't 
> see why it would not be possible.
>> 
>> 
>>  Beyond that, I'd suggest running a regionserver with your coprocessor 
> installed under a profiler to see if you have monitor contention or a hotspot 
> or 
> similar. It could be something unexpected.
>> 
>> 
>>>  Can you think of an efficient way to implement an atomic bitfield
>>>  (other than adding it as a separate feature like atomic increments)?
>> 
>>  I think the idea of an atomic bitfield operation as part of the core API is 
> intriguing. It has applicability to your estimator use case and I can think 
> of a 
> couple of things I could use it for. If there is more support for this idea, 
> this may be something to consider.
>> 
>> 
>>  Best regards,
>> 
>> 
>>      - Andy
>> 
>>  Problems worthy of attack prove their worth by hitting back. - Piet Hein 
> (via Tom White)
>> 
>> 
>> 
>>  - Original Message -
>>>  From: Tom Brown 
>>>  To: user@hbase.apache.org; Andrew Purtell 
>>>  Cc:
>>>  Sent: Tuesday, April 10, 2012 3:53 PM
>>>  Subject: Re: Add client complexity or use a coprocessor?
>>> 
>>>  Andy,
>>> 
>>>  I have attempted to use coprocessors to achieve a passable performance
>>>  but have failed so far. Even my implementation of an atomic increment
>>>  (using a coprocessor) is two orders of magnitude slower than the
>>>  provided implementation.  Are there properties inherent to
>>>  coprocessors or Incrementors that would force this kind of performance
>>>  difference?
>>> 
>>>  Can you think of an efficient way to implement an atomic bitfield
>>>  (other than adding it as a separate feature like atomic increments)?
>>> 
>>>  Thanks!
>>> 
>>>  --Tom
>>> 
>>>  On Tue, Apr 10, 2012 at 12:01 PM, Andrew Purtell 
> 
>>>  wrote:
>>>>   Tom,
>>>>>   I am a big fan of the Increment class. Unfortunately, I'm 
> not doing
>>>>>   simple increments for the viewer count. I will be receiving 
> duplicate
>>>>>   messages from a particular client for a specific cube cell, 
> and
>>>  don't
>>>>>   want them to be counted twice
>>>> 
>>>>   Gotcha.
>>>> 
>>>>>   I created an RPC endpoint coprocessor to perform this function 
> but
>>>>>   performance suffered heavily under load (it appears that the 
> endpoint
>>>>>   performs all functions i

Re: Add client complexity or use a coprocessor?

2012-04-11 Thread kisalay

Assuming that there are X records per region, the performance will be
determined by:

scan time required for X records + Time required to update the bitset
counter for X records.

I will not be very amazed if updating the bitset dominates the cost
here, as for each record you may have to generate one or many hash
values. This really depends on the implementation of the counting
algorithm.

But then you can optimize this by precomputing the hash value / bit
position to be set in the bitset at insertion time itself.



On Wed, Apr 11, 2012 at 11:11 PM, Tom Brown  wrote:
> kisalay,
>
> Are you talking about storing all my data in a non-aggregate format,
> and just aggregating as needed?  If so, do you have any idea what kind
> of performance I should expect when scanning over 15 million rows to
> summarize the specific cubes I need (using bitfields to estimate
> unique users on each region and merge them together later)?
>
> Or are you suggesting something else?
>
> --Tom
>
> On Tue, Apr 10, 2012 at 11:59 PM, kisalay  wrote:
>> Tom,
>>
>> I was way too curious to resist a reply here.
>> If you want to store a bytearray estimating the unique count for a
>> particular OLAP cell, will you not see a lot of updates to the same
>> cell and create a hotspot ?
>>
>> I think another option comes to my mind. I assume that you get all the
>> user activities. Now consider for a moment that you store each of
>> there activities in you Table in HBase with the row-key being
>> udid-timestamp and value being some blob representing the activity
>> detail. Now if I have to do a unique count of user for a OLAP cell, I
>> would create endpoint coprocessors, that would execute per region,
>> scan it, and prepare a bitset representing the unique counts for the
>> OLAP cellof interest for that region.
>>
>> Now if you are implementing Probabilistic Counting, you can OR the
>> bitsets returned from each region to get the final bitset which will
>> give you the overall unique counts for all the regions together.
>>
>> You would not only save on network transfers, as you are doing the
>> counts per region in coprocessor and returning only one bitset per
>> region, you would also be able to resolve the query in the time taken
>> by one coprocessor to scan one region.
>>
>> I have a rudimentary implementation of Probabilistic Count which I
>> once used as a Bolt in Storm (Storm was developed at Backtype and
>> opensourced by Twitter) to count the unique users for a similar use
>> case. Let me know if you would to look at the implementation of the
>> algorithm.
>>
>>
>>
>> ~Kisalay
>>
>> On Wed, Apr 11, 2012 at 5:23 AM, Andrew Purtell  wrote:
>>>> Even my implementation of an atomic increment
>>>> (using a coprocessor) is two orders of magnitude slower than the
>>>> provided implementation.  Are there properties inherent to
>>>> coprocessors or Incrementors that would force this kind of performance
>>>> difference?
>>>
>>>
>>> No.
>>>
>>>
>>> You may be seeing a performance difference if you are packing multiple 
>>> Increments into one round trip but not doing a similar kind of batching if 
>>> calling a custom endpoint. Each Endpoint invocation is a round trip unless 
>>> you do something like:
>>>
>>>     List actions = new ArrayList();    actions.add(new Exec(conf, 
>>> row, protocol, method, ...));
>>>
>>>     actions.add(new Exec(conf, row, protocol, method, ...));
>>>
>>>     actions.add(new Exec(conf, row, protocol, method, ...));
>>>
>>>     Object[] results = table.batch(actions);
>>>     ...
>>>
>>>
>>> I've not personally tried that particular API combination but don't see why 
>>> it would not be possible.
>>>
>>>
>>> Beyond that, I'd suggest running a regionserver with your coprocessor 
>>> installed under a profiler to see if you have monitor contention or a 
>>> hotspot or similar. It could be something unexpected.
>>>
>>>
>>>> Can you think of an efficient way to implement an atomic bitfield
>>>> (other than adding it as a separate feature like atomic increments)?
>>>
>>> I think the idea of an atomic bitfield operation as part of the core API is 
>>> intriguing. It has applicability to your estimator use case and I can think 
>>> of a couple of things I could use it for. If there is more support for this 
>>> idea, this may be something to consider.
>>>

Re: Add client complexity or use a coprocessor?

2012-04-11 Thread Tom Brown

kisalay,

Are you talking about storing all my data in a non-aggregate format,
and just aggregating as needed?  If so, do you have any idea what kind
of performance I should expect when scanning over 15 million rows to
summarize the specific cubes I need (using bitfields to estimate
unique users on each region and merge them together later)?

Or are you suggesting something else?

--Tom

On Tue, Apr 10, 2012 at 11:59 PM, kisalay  wrote:
> Tom,
>
> I was way too curious to resist a reply here.
> If you want to store a bytearray estimating the unique count for a
> particular OLAP cell, will you not see a lot of updates to the same
> cell and create a hotspot ?
>
> I think another option comes to my mind. I assume that you get all the
> user activities. Now consider for a moment that you store each of
> there activities in you Table in HBase with the row-key being
> udid-timestamp and value being some blob representing the activity
> detail. Now if I have to do a unique count of user for a OLAP cell, I
> would create endpoint coprocessors, that would execute per region,
> scan it, and prepare a bitset representing the unique counts for the
> OLAP cellof interest for that region.
>
> Now if you are implementing Probabilistic Counting, you can OR the
> bitsets returned from each region to get the final bitset which will
> give you the overall unique counts for all the regions together.
>
> You would not only save on network transfers, as you are doing the
> counts per region in coprocessor and returning only one bitset per
> region, you would also be able to resolve the query in the time taken
> by one coprocessor to scan one region.
>
> I have a rudimentary implementation of Probabilistic Count which I
> once used as a Bolt in Storm (Storm was developed at Backtype and
> opensourced by Twitter) to count the unique users for a similar use
> case. Let me know if you would to look at the implementation of the
> algorithm.
>
>
>
> ~Kisalay
>
> On Wed, Apr 11, 2012 at 5:23 AM, Andrew Purtell  wrote:
>>> Even my implementation of an atomic increment
>>> (using a coprocessor) is two orders of magnitude slower than the
>>> provided implementation.  Are there properties inherent to
>>> coprocessors or Incrementors that would force this kind of performance
>>> difference?
>>
>>
>> No.
>>
>>
>> You may be seeing a performance difference if you are packing multiple 
>> Increments into one round trip but not doing a similar kind of batching if 
>> calling a custom endpoint. Each Endpoint invocation is a round trip unless 
>> you do something like:
>>
>>     List actions = new ArrayList();    actions.add(new Exec(conf, 
>> row, protocol, method, ...));
>>
>>     actions.add(new Exec(conf, row, protocol, method, ...));
>>
>>     actions.add(new Exec(conf, row, protocol, method, ...));
>>
>>     Object[] results = table.batch(actions);
>>     ...
>>
>>
>> I've not personally tried that particular API combination but don't see why 
>> it would not be possible.
>>
>>
>> Beyond that, I'd suggest running a regionserver with your coprocessor 
>> installed under a profiler to see if you have monitor contention or a 
>> hotspot or similar. It could be something unexpected.
>>
>>
>>> Can you think of an efficient way to implement an atomic bitfield
>>> (other than adding it as a separate feature like atomic increments)?
>>
>> I think the idea of an atomic bitfield operation as part of the core API is 
>> intriguing. It has applicability to your estimator use case and I can think 
>> of a couple of things I could use it for. If there is more support for this 
>> idea, this may be something to consider.
>>
>>
>> Best regards,
>>
>>
>>     - Andy
>>
>> Problems worthy of attack prove their worth by hitting back. - Piet Hein 
>> (via Tom White)
>>
>>
>>
>> - Original Message -
>>> From: Tom Brown 
>>> To: user@hbase.apache.org; Andrew Purtell 
>>> Cc:
>>> Sent: Tuesday, April 10, 2012 3:53 PM
>>> Subject: Re: Add client complexity or use a coprocessor?
>>>
>>> Andy,
>>>
>>> I have attempted to use coprocessors to achieve a passable performance
>>> but have failed so far. Even my implementation of an atomic increment
>>> (using a coprocessor) is two orders of magnitude slower than the
>>> provided implementation.  Are there properties inherent to
>>> coprocessors or Incrementors that would force this kind of performance
>>> difference

Re: Add client complexity or use a coprocessor?

2012-04-11 Thread Tom Brown

Andy,

Is the table auto-flush option the same as manually batching all the updates?

--Tom

On Tue, Apr 10, 2012 at 5:53 PM, Andrew Purtell  wrote:
>> Even my implementation of an atomic increment
>> (using a coprocessor) is two orders of magnitude slower than the
>> provided implementation.  Are there properties inherent to
>> coprocessors or Incrementors that would force this kind of performance
>> difference?
>
>
> No.
>
>
> You may be seeing a performance difference if you are packing multiple 
> Increments into one round trip but not doing a similar kind of batching if 
> calling a custom endpoint. Each Endpoint invocation is a round trip unless 
> you do something like:
>
>     List actions = new ArrayList();    actions.add(new Exec(conf, 
> row, protocol, method, ...));
>
>     actions.add(new Exec(conf, row, protocol, method, ...));
>
>     actions.add(new Exec(conf, row, protocol, method, ...));
>
>     Object[] results = table.batch(actions);
>     ...
>
>
> I've not personally tried that particular API combination but don't see why 
> it would not be possible.
>
>
> Beyond that, I'd suggest running a regionserver with your coprocessor 
> installed under a profiler to see if you have monitor contention or a hotspot 
> or similar. It could be something unexpected.
>
>
>> Can you think of an efficient way to implement an atomic bitfield
>> (other than adding it as a separate feature like atomic increments)?
>
> I think the idea of an atomic bitfield operation as part of the core API is 
> intriguing. It has applicability to your estimator use case and I can think 
> of a couple of things I could use it for. If there is more support for this 
> idea, this may be something to consider.
>
>
> Best regards,
>
>
>     - Andy
>
> Problems worthy of attack prove their worth by hitting back. - Piet Hein (via 
> Tom White)
>
>
>
> - Original Message -
>> From: Tom Brown 
>> To: user@hbase.apache.org; Andrew Purtell 
>> Cc:
>> Sent: Tuesday, April 10, 2012 3:53 PM
>> Subject: Re: Add client complexity or use a coprocessor?
>>
>> Andy,
>>
>> I have attempted to use coprocessors to achieve a passable performance
>> but have failed so far. Even my implementation of an atomic increment
>> (using a coprocessor) is two orders of magnitude slower than the
>> provided implementation.  Are there properties inherent to
>> coprocessors or Incrementors that would force this kind of performance
>> difference?
>>
>> Can you think of an efficient way to implement an atomic bitfield
>> (other than adding it as a separate feature like atomic increments)?
>>
>> Thanks!
>>
>> --Tom
>>
>> On Tue, Apr 10, 2012 at 12:01 PM, Andrew Purtell 
>> wrote:
>>>  Tom,
>>>>  I am a big fan of the Increment class. Unfortunately, I'm not doing
>>>>  simple increments for the viewer count. I will be receiving duplicate
>>>>  messages from a particular client for a specific cube cell, and
>> don't
>>>>  want them to be counted twice
>>>
>>>  Gotcha.
>>>
>>>>  I created an RPC endpoint coprocessor to perform this function but
>>>>  performance suffered heavily under load (it appears that the endpoint
>>>>  performs all functions in serial).
>>>
>>>  Did you serialize access to your data structure(s)?
>>>
>>>>  When I tried implementing it as a region observer, I was unsure of how
>>>>  to correctly replace the provided "put" with my own. When I
>> issued a
>>>>  put from within "prePut", the server blocked the new put
>> (waiting for
>>>>  the "prePut" to finish). Should I be attempting to modify the
>> WALEdit
>>>>  object?
>>>
>>>  You can add KVs to the WALEdit. Or, you can get a reference to the
>> Put's familyMap:
>>>
>>>  Map> familyMap = put.getFamilyMap();
>>>
>>>  and if you modify the map, you'll change what gets committed.
>>>
>>>>  Is there a way to extend the functionality of "Increment" to
>> provide
>>>>  arbitrary bitwise operations on a the contents of a field?
>>>
>>>  As a matter of design, this should be a new operation. It does sound
>> interesting and useful, some sort of atomic bitfield.
>>>
>>>
>>>  Best regards,
>>>
>>>      - Andy
>>>
>>>  Problems worthy of attack prove their worth by hitti

Re: Add client complexity or use a coprocessor?

2012-04-10 Thread kisalay

Tom,

I was way too curious to resist a reply here.
If you want to store a bytearray estimating the unique count for a
particular OLAP cell, will you not see a lot of updates to the same
cell and create a hotspot ?

I think another option comes to my mind. I assume that you get all the
user activities. Now consider for a moment that you store each of
there activities in you Table in HBase with the row-key being
udid-timestamp and value being some blob representing the activity
detail. Now if I have to do a unique count of user for a OLAP cell, I
would create endpoint coprocessors, that would execute per region,
scan it, and prepare a bitset representing the unique counts for the
OLAP cellof interest for that region.

Now if you are implementing Probabilistic Counting, you can OR the
bitsets returned from each region to get the final bitset which will
give you the overall unique counts for all the regions together.

You would not only save on network transfers, as you are doing the
counts per region in coprocessor and returning only one bitset per
region, you would also be able to resolve the query in the time taken
by one coprocessor to scan one region.

I have a rudimentary implementation of Probabilistic Count which I
once used as a Bolt in Storm (Storm was developed at Backtype and
opensourced by Twitter) to count the unique users for a similar use
case. Let me know if you would to look at the implementation of the
algorithm.

~Kisalay

On Wed, Apr 11, 2012 at 5:23 AM, Andrew Purtell  wrote:
>> Even my implementation of an atomic increment
>> (using a coprocessor) is two orders of magnitude slower than the
>> provided implementation.  Are there properties inherent to
>> coprocessors or Incrementors that would force this kind of performance
>> difference?
>
>
> No.
>
>
> You may be seeing a performance difference if you are packing multiple 
> Increments into one round trip but not doing a similar kind of batching if 
> calling a custom endpoint. Each Endpoint invocation is a round trip unless 
> you do something like:
>
>     List actions = new ArrayList();    actions.add(new Exec(conf, 
> row, protocol, method, ...));
>
>     actions.add(new Exec(conf, row, protocol, method, ...));
>
>     actions.add(new Exec(conf, row, protocol, method, ...));
>
>     Object[] results = table.batch(actions);
>     ...
>
>
> I've not personally tried that particular API combination but don't see why 
> it would not be possible.
>
>
> Beyond that, I'd suggest running a regionserver with your coprocessor 
> installed under a profiler to see if you have monitor contention or a hotspot 
> or similar. It could be something unexpected.
>
>
>> Can you think of an efficient way to implement an atomic bitfield
>> (other than adding it as a separate feature like atomic increments)?
>
> I think the idea of an atomic bitfield operation as part of the core API is 
> intriguing. It has applicability to your estimator use case and I can think 
> of a couple of things I could use it for. If there is more support for this 
> idea, this may be something to consider.
>
>
> Best regards,
>
>
>     - Andy
>
> Problems worthy of attack prove their worth by hitting back. - Piet Hein (via 
> Tom White)
>
>
>
> - Original Message -
>> From: Tom Brown 
>> To: user@hbase.apache.org; Andrew Purtell 
>> Cc:
>> Sent: Tuesday, April 10, 2012 3:53 PM
>> Subject: Re: Add client complexity or use a coprocessor?
>>
>> Andy,
>>
>> I have attempted to use coprocessors to achieve a passable performance
>> but have failed so far. Even my implementation of an atomic increment
>> (using a coprocessor) is two orders of magnitude slower than the
>> provided implementation.  Are there properties inherent to
>> coprocessors or Incrementors that would force this kind of performance
>> difference?
>>
>> Can you think of an efficient way to implement an atomic bitfield
>> (other than adding it as a separate feature like atomic increments)?
>>
>> Thanks!
>>
>> --Tom
>>
>> On Tue, Apr 10, 2012 at 12:01 PM, Andrew Purtell 
>> wrote:
>>>  Tom,
>>>>  I am a big fan of the Increment class. Unfortunately, I'm not doing
>>>>  simple increments for the viewer count. I will be receiving duplicate
>>>>  messages from a particular client for a specific cube cell, and
>> don't
>>>>  want them to be counted twice
>>>
>>>  Gotcha.
>>>
>>>>  I created an RPC endpoint coprocessor to perform this function but
>>>>  performance suffered heavily under load (it appears that the endpoint
>>>>  perf

Re: Add client complexity or use a coprocessor?

2012-04-10 Thread Andrew Purtell

> Even my implementation of an atomic increment
> (using a coprocessor) is two orders of magnitude slower than the
> provided implementation.  Are there properties inherent to
> coprocessors or Incrementors that would force this kind of performance
> difference?


No.


You may be seeing a performance difference if you are packing multiple 
Increments into one round trip but not doing a similar kind of batching if 
calling a custom endpoint. Each Endpoint invocation is a round trip unless you 
do something like:

    List actions = new ArrayList();    actions.add(new Exec(conf, 
row, protocol, method, ...));

    actions.add(new Exec(conf, row, protocol, method, ...));

    actions.add(new Exec(conf, row, protocol, method, ...));

    Object[] results = table.batch(actions);
    ...


I've not personally tried that particular API combination but don't see why it 
would not be possible.


Beyond that, I'd suggest running a regionserver with your coprocessor installed 
under a profiler to see if you have monitor contention or a hotspot or similar. 
It could be something unexpected.


> Can you think of an efficient way to implement an atomic bitfield
> (other than adding it as a separate feature like atomic increments)?

I think the idea of an atomic bitfield operation as part of the core API is 
intriguing. It has applicability to your estimator use case and I can think of 
a couple of things I could use it for. If there is more support for this idea, 
this may be something to consider.


Best regards,


    - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein (via 
Tom White)



- Original Message -
> From: Tom Brown 
> To: user@hbase.apache.org; Andrew Purtell 
> Cc: 
> Sent: Tuesday, April 10, 2012 3:53 PM
> Subject: Re: Add client complexity or use a coprocessor?
> 
> Andy,
> 
> I have attempted to use coprocessors to achieve a passable performance
> but have failed so far. Even my implementation of an atomic increment
> (using a coprocessor) is two orders of magnitude slower than the
> provided implementation.  Are there properties inherent to
> coprocessors or Incrementors that would force this kind of performance
> difference?
> 
> Can you think of an efficient way to implement an atomic bitfield
> (other than adding it as a separate feature like atomic increments)?
> 
> Thanks!
> 
> --Tom
> 
> On Tue, Apr 10, 2012 at 12:01 PM, Andrew Purtell  
> wrote:
>>  Tom,
>>>  I am a big fan of the Increment class. Unfortunately, I'm not doing
>>>  simple increments for the viewer count. I will be receiving duplicate
>>>  messages from a particular client for a specific cube cell, and 
> don't
>>>  want them to be counted twice
>> 
>>  Gotcha.
>> 
>>>  I created an RPC endpoint coprocessor to perform this function but
>>>  performance suffered heavily under load (it appears that the endpoint
>>>  performs all functions in serial).
>> 
>>  Did you serialize access to your data structure(s)?
>> 
>>>  When I tried implementing it as a region observer, I was unsure of how
>>>  to correctly replace the provided "put" with my own. When I 
> issued a
>>>  put from within "prePut", the server blocked the new put 
> (waiting for
>>>  the "prePut" to finish). Should I be attempting to modify the 
> WALEdit
>>>  object?
>> 
>>  You can add KVs to the WALEdit. Or, you can get a reference to the 
> Put's familyMap:
>> 
>>      Map> familyMap = put.getFamilyMap();
>> 
>>  and if you modify the map, you'll change what gets committed.
>> 
>>>  Is there a way to extend the functionality of "Increment" to 
> provide
>>>  arbitrary bitwise operations on a the contents of a field?
>> 
>>  As a matter of design, this should be a new operation. It does sound 
> interesting and useful, some sort of atomic bitfield.
>> 
>> 
>>  Best regards,
>> 
>>      - Andy
>> 
>>  Problems worthy of attack prove their worth by hitting back. - Piet Hein 
> (via Tom White)
>> 
>> 
>> 
>>  - Original Message -
>>>  From: Tom Brown 
>>>  To: user@hbase.apache.org
>>>  Cc:
>>>  Sent: Monday, April 9, 2012 10:14 PM
>>>  Subject: Re: Add client complexity or use a coprocessor?
>>> 
>>>  Andy,
>>> 
>>>  I am a big fan of the Increment class. Unfortunately, I'm not doing
>>>  simple increments for the viewer count. I will be receiving duplicate
>>>  messages from a particular client for a specific cube cell, and 
> don't
&g

Re: Add client complexity or use a coprocessor?

2012-04-10 Thread Tom Brown

Andy,

I have attempted to use coprocessors to achieve a passable performance
but have failed so far. Even my implementation of an atomic increment
(using a coprocessor) is two orders of magnitude slower than the
provided implementation.  Are there properties inherent to
coprocessors or Incrementors that would force this kind of performance
difference?

Can you think of an efficient way to implement an atomic bitfield
(other than adding it as a separate feature like atomic increments)?

Thanks!

--Tom

On Tue, Apr 10, 2012 at 12:01 PM, Andrew Purtell  wrote:
> Tom,
>> I am a big fan of the Increment class. Unfortunately, I'm not doing
>> simple increments for the viewer count. I will be receiving duplicate
>> messages from a particular client for a specific cube cell, and don't
>> want them to be counted twice
>
> Gotcha.
>
>> I created an RPC endpoint coprocessor to perform this function but
>> performance suffered heavily under load (it appears that the endpoint
>> performs all functions in serial).
>
> Did you serialize access to your data structure(s)?
>
>> When I tried implementing it as a region observer, I was unsure of how
>> to correctly replace the provided "put" with my own. When I issued a
>> put from within "prePut", the server blocked the new put (waiting for
>> the "prePut" to finish). Should I be attempting to modify the WALEdit
>> object?
>
> You can add KVs to the WALEdit. Or, you can get a reference to the Put's 
> familyMap:
>
>     Map> familyMap = put.getFamilyMap();
>
> and if you modify the map, you'll change what gets committed.
>
>> Is there a way to extend the functionality of "Increment" to provide
>> arbitrary bitwise operations on a the contents of a field?
>
> As a matter of design, this should be a new operation. It does sound 
> interesting and useful, some sort of atomic bitfield.
>
>
> Best regards,
>
>     - Andy
>
> Problems worthy of attack prove their worth by hitting back. - Piet Hein (via 
> Tom White)
>
>
>
> - Original Message -
>> From: Tom Brown 
>> To: user@hbase.apache.org
>> Cc:
>> Sent: Monday, April 9, 2012 10:14 PM
>> Subject: Re: Add client complexity or use a coprocessor?
>>
>> Andy,
>>
>> I am a big fan of the Increment class. Unfortunately, I'm not doing
>> simple increments for the viewer count. I will be receiving duplicate
>> messages from a particular client for a specific cube cell, and don't
>> want them to be counted twice (my stats don't have to be 100%
>> accurate, but the expected rate of duplicates will be higher than the
>> allowable error rate).
>>
>> I created an RPC endpoint coprocessor to perform this function but
>> performance suffered heavily under load (it appears that the endpoint
>> performs all functions in serial).
>>
>> When I tried implementing it as a region observer, I was unsure of how
>> to correctly replace the provided "put" with my own. When I issued a
>> put from within "prePut", the server blocked the new put (waiting for
>> the "prePut" to finish). Should I be attempting to modify the WALEdit
>> object?
>>
>> Is there a way to extend the functionality of "Increment" to provide
>> arbitrary bitwise operations on a the contents of a field?
>>
>> Thanks again!
>>
>> --Tom
>>
>>> If it helps, yes this is possible:
>>>
>>>>  Can I observe updates to a
>>>>  particular table and replace the provided data with my own? (The
>>>>  client calls "put" with the actual user ID, my co-processor
>> replaces
>>>>  it with a computed value, so the actual user ID never gets stored in
>>>>  HBase).
>>>
>>> Since your option #2 requires atomic updates to the data structure, have you
>> considered native
>>> atomic increments? See
>>>
>>> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#incrementColumnValue%28byte[],%20byte[],%20byte[],%20long,%20boolean%29
>>>
>>>
>>> or
>>>
>>> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Increment.html
>>>
>>> The former is a round trip for each value update. The latter allows you to
>> pack multiple updates
>>> into a single round trip. This would give you accurate counts even with
>> concurrent writers.
>>>
>>> It should be possible for you to do partial aggregation on the client side
>> too whenever parallel
>>> re

Re: Add client complexity or use a coprocessor?

2012-04-10 Thread Jacques

On Tue, Apr 10, 2012 at 9:19 AM, Tom Brown  wrote:

> Jacques,
>
> The technique I've been trying to use is similar to a bloom filter
> (except that it's more space efficient).


Got it.  I didn't realize.


> It's my understanding that
> bloom filters in HBase are only implemented in the context of finding
> individual columns (for improving read performance). Are there
> specific bloom operations I can use atomically on a specific cell?
>

Your understanding is correct.  My statement was about using the data
structure as a compressed version of a duplication filter, not any HBase
feature.



> Thanks!
>
> --Tom
>
> On Tue, Apr 10, 2012 at 12:01 AM, Jacques  wrote:
> > What about maintaining a bloom filter in addition to an increment to
> > minimize double counting? You couldn't do atomic without some custom work
> > but it would get u mostly there.  If you wanted to be fancy you could
> > actually maintain the bloom as a bunch of separate colums to avoid update
> > contention.
> > On Apr 9, 2012 10:14 PM, "Tom Brown"  wrote:
> >
> >> Andy,
> >>
> >> I am a big fan of the Increment class. Unfortunately, I'm not doing
> >> simple increments for the viewer count. I will be receiving duplicate
> >> messages from a particular client for a specific cube cell, and don't
> >> want them to be counted twice (my stats don't have to be 100%
> >> accurate, but the expected rate of duplicates will be higher than the
> >> allowable error rate).
> >>
> >> I created an RPC endpoint coprocessor to perform this function but
> >> performance suffered heavily under load (it appears that the endpoint
> >> performs all functions in serial).
> >>
> >> When I tried implementing it as a region observer, I was unsure of how
> >> to correctly replace the provided "put" with my own. When I issued a
> >> put from within "prePut", the server blocked the new put (waiting for
> >> the "prePut" to finish). Should I be attempting to modify the WALEdit
> >> object?
> >>
> >> Is there a way to extend the functionality of "Increment" to provide
> >> arbitrary bitwise operations on a the contents of a field?
> >>
> >> Thanks again!
> >>
> >> --Tom
> >>
> >> >If it helps, yes this is possible:
> >> >
> >> >> Can I observe updates to a
> >> >> particular table and replace the provided data with my own? (The
> >> >> client calls "put" with the actual user ID, my co-processor replaces
> >> >> it with a computed value, so the actual user ID never gets stored in
> >> >> HBase).
> >> >
> >> >Since your option #2 requires atomic updates to the data structure,
> have
> >> you considered native
> >> >atomic increments? See
> >> >
> >> >
> >>
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#incrementColumnValue%28byte[],%20byte[],%20byte[],%20long,%20boolean%29
> >> >
> >> >
> >> >or
> >> >
> >> >
> >>
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Increment.html
> >> >
> >> >The former is a round trip for each value update. The latter allows you
> >> to pack multiple updates
> >> >into a single round trip. This would give you accurate counts even with
> >> concurrent writers.
> >> >
> >> >It should be possible for you to do partial aggregation on the client
> >> side too whenever parallel
> >> >requests colocate multiple updates to the same cube within some small
> >> window of time.
> >> >
> >> >Best regards,
> >> >
> >> >
> >> >- Andy
> >> >
> >> >Problems worthy of attack prove their worth by hitting back. - Piet
> Hein
> >> (via Tom White)
> >> >
> >> >- Original Message -
> >> >> From: Tom Brown 
> >> >> To: user@hbase.apache.org
> >> >> Cc:
> >> >> Sent: Monday, April 9, 2012 9:48 AM
> >> >> Subject: Add client complexity or use a coprocessor?
> >> >>
> >> >> To whom it may concern,
> >> >>
> >> >> Ignoring the complexities of gathering the data, assume that I will
> be
> >> >> tracking millions of unique viewers. Updates from each of our
> millio

Re: Add client complexity or use a coprocessor?

2012-04-10 Thread Andrew Purtell

Tom,
> I am a big fan of the Increment class. Unfortunately, I'm not doing
> simple increments for the viewer count. I will be receiving duplicate
> messages from a particular client for a specific cube cell, and don't
> want them to be counted twice

Gotcha.

> I created an RPC endpoint coprocessor to perform this function but
> performance suffered heavily under load (it appears that the endpoint
> performs all functions in serial).

Did you serialize access to your data structure(s)?

> When I tried implementing it as a region observer, I was unsure of how
> to correctly replace the provided "put" with my own. When I issued a
> put from within "prePut", the server blocked the new put (waiting for
> the "prePut" to finish). Should I be attempting to modify the WALEdit
> object?

You can add KVs to the WALEdit. Or, you can get a reference to the Put's 
familyMap:

    Map> familyMap = put.getFamilyMap();

and if you modify the map, you'll change what gets committed.

> Is there a way to extend the functionality of "Increment" to provide
> arbitrary bitwise operations on a the contents of a field?

As a matter of design, this should be a new operation. It does sound 
interesting and useful, some sort of atomic bitfield.


Best regards,

    - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein (via 
Tom White)



- Original Message -
> From: Tom Brown 
> To: user@hbase.apache.org
> Cc: 
> Sent: Monday, April 9, 2012 10:14 PM
> Subject: Re: Add client complexity or use a coprocessor?
> 
> Andy,
> 
> I am a big fan of the Increment class. Unfortunately, I'm not doing
> simple increments for the viewer count. I will be receiving duplicate
> messages from a particular client for a specific cube cell, and don't
> want them to be counted twice (my stats don't have to be 100%
> accurate, but the expected rate of duplicates will be higher than the
> allowable error rate).
> 
> I created an RPC endpoint coprocessor to perform this function but
> performance suffered heavily under load (it appears that the endpoint
> performs all functions in serial).
> 
> When I tried implementing it as a region observer, I was unsure of how
> to correctly replace the provided "put" with my own. When I issued a
> put from within "prePut", the server blocked the new put (waiting for
> the "prePut" to finish). Should I be attempting to modify the WALEdit
> object?
> 
> Is there a way to extend the functionality of "Increment" to provide
> arbitrary bitwise operations on a the contents of a field?
> 
> Thanks again!
> 
> --Tom
> 
>> If it helps, yes this is possible:
>> 
>>>  Can I observe updates to a
>>>  particular table and replace the provided data with my own? (The
>>>  client calls "put" with the actual user ID, my co-processor 
> replaces
>>>  it with a computed value, so the actual user ID never gets stored in
>>>  HBase).
>> 
>> Since your option #2 requires atomic updates to the data structure, have you 
> considered native
>> atomic increments? See
>> 
>> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#incrementColumnValue%28byte[],%20byte[],%20byte[],%20long,%20boolean%29
>> 
>> 
>> or
>> 
>> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Increment.html
>> 
>> The former is a round trip for each value update. The latter allows you to 
> pack multiple updates
>> into a single round trip. This would give you accurate counts even with 
> concurrent writers.
>> 
>> It should be possible for you to do partial aggregation on the client side 
> too whenever parallel
>> requests colocate multiple updates to the same cube within some small window 
> of time.
>> 
>> Best regards,
>> 
>> 
>>     - Andy
>> 
>> Problems worthy of attack prove their worth by hitting back. - Piet Hein 
> (via Tom White)
>> 
>> - Original Message -
>>>  From: Tom Brown 
>>>  To: user@hbase.apache.org
>>>  Cc:
>>>  Sent: Monday, April 9, 2012 9:48 AM
>>>  Subject: Add client complexity or use a coprocessor?
>>> 
>>>  To whom it may concern,
>>> 
>>>  Ignoring the complexities of gathering the data, assume that I will be
>>>  tracking millions of unique viewers. Updates from each of our millions
>>>  of clients are gathered in a centralized platform and spread among a
>>>  group of machines for processing and inserting into HBase (assume that
>>>  this group can be sca

Re: Add client complexity or use a coprocessor?

2012-04-10 Thread Tom Brown

Jacques,

The technique I've been trying to use is similar to a bloom filter
(except that it's more space efficient). It's my understanding that
bloom filters in HBase are only implemented in the context of finding
individual columns (for improving read performance). Are there
specific bloom operations I can use atomically on a specific cell?

Thanks!

--Tom

On Tue, Apr 10, 2012 at 12:01 AM, Jacques  wrote:
> What about maintaining a bloom filter in addition to an increment to
> minimize double counting? You couldn't do atomic without some custom work
> but it would get u mostly there.  If you wanted to be fancy you could
> actually maintain the bloom as a bunch of separate colums to avoid update
> contention.
> On Apr 9, 2012 10:14 PM, "Tom Brown"  wrote:
>
>> Andy,
>>
>> I am a big fan of the Increment class. Unfortunately, I'm not doing
>> simple increments for the viewer count. I will be receiving duplicate
>> messages from a particular client for a specific cube cell, and don't
>> want them to be counted twice (my stats don't have to be 100%
>> accurate, but the expected rate of duplicates will be higher than the
>> allowable error rate).
>>
>> I created an RPC endpoint coprocessor to perform this function but
>> performance suffered heavily under load (it appears that the endpoint
>> performs all functions in serial).
>>
>> When I tried implementing it as a region observer, I was unsure of how
>> to correctly replace the provided "put" with my own. When I issued a
>> put from within "prePut", the server blocked the new put (waiting for
>> the "prePut" to finish). Should I be attempting to modify the WALEdit
>> object?
>>
>> Is there a way to extend the functionality of "Increment" to provide
>> arbitrary bitwise operations on a the contents of a field?
>>
>> Thanks again!
>>
>> --Tom
>>
>> >If it helps, yes this is possible:
>> >
>> >> Can I observe updates to a
>> >> particular table and replace the provided data with my own? (The
>> >> client calls "put" with the actual user ID, my co-processor replaces
>> >> it with a computed value, so the actual user ID never gets stored in
>> >> HBase).
>> >
>> >Since your option #2 requires atomic updates to the data structure, have
>> you considered native
>> >atomic increments? See
>> >
>> >
>> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#incrementColumnValue%28byte[],%20byte[],%20byte[],%20long,%20boolean%29
>> >
>> >
>> >or
>> >
>> >
>> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Increment.html
>> >
>> >The former is a round trip for each value update. The latter allows you
>> to pack multiple updates
>> >into a single round trip. This would give you accurate counts even with
>> concurrent writers.
>> >
>> >It should be possible for you to do partial aggregation on the client
>> side too whenever parallel
>> >requests colocate multiple updates to the same cube within some small
>> window of time.
>> >
>> >Best regards,
>> >
>> >
>> >    - Andy
>> >
>> >Problems worthy of attack prove their worth by hitting back. - Piet Hein
>> (via Tom White)
>> >
>> >- Original Message -
>> >> From: Tom Brown 
>> >> To: user@hbase.apache.org
>> >> Cc:
>> >> Sent: Monday, April 9, 2012 9:48 AM
>> >> Subject: Add client complexity or use a coprocessor?
>> >>
>> >> To whom it may concern,
>> >>
>> >> Ignoring the complexities of gathering the data, assume that I will be
>> >> tracking millions of unique viewers. Updates from each of our millions
>> >> of clients are gathered in a centralized platform and spread among a
>> >> group of machines for processing and inserting into HBase (assume that
>> >> this group can be scaled horizontally). The data is stored in an OLAP
>> >> cube format and one of the metrics I'm tracking across various
>> >> attributes is viewership (how many people from Y are watching X).
>> >>
>> >> I'm writing this to ask for your thoughts as to the most appropriate
>> >> way to structure my data so I can count unique TV viewers (assume a
>> >> service like netflix or hulu).
>> >>
>> >> Here are the solutions I'm cons

Re: Add client complexity or use a coprocessor?

2012-04-09 Thread Jacques

What about maintaining a bloom filter in addition to an increment to
minimize double counting? You couldn't do atomic without some custom work
but it would get u mostly there.  If you wanted to be fancy you could
actually maintain the bloom as a bunch of separate colums to avoid update
contention.
On Apr 9, 2012 10:14 PM, "Tom Brown"  wrote:

> Andy,
>
> I am a big fan of the Increment class. Unfortunately, I'm not doing
> simple increments for the viewer count. I will be receiving duplicate
> messages from a particular client for a specific cube cell, and don't
> want them to be counted twice (my stats don't have to be 100%
> accurate, but the expected rate of duplicates will be higher than the
> allowable error rate).
>
> I created an RPC endpoint coprocessor to perform this function but
> performance suffered heavily under load (it appears that the endpoint
> performs all functions in serial).
>
> When I tried implementing it as a region observer, I was unsure of how
> to correctly replace the provided "put" with my own. When I issued a
> put from within "prePut", the server blocked the new put (waiting for
> the "prePut" to finish). Should I be attempting to modify the WALEdit
> object?
>
> Is there a way to extend the functionality of "Increment" to provide
> arbitrary bitwise operations on a the contents of a field?
>
> Thanks again!
>
> --Tom
>
> >If it helps, yes this is possible:
> >
> >> Can I observe updates to a
> >> particular table and replace the provided data with my own? (The
> >> client calls "put" with the actual user ID, my co-processor replaces
> >> it with a computed value, so the actual user ID never gets stored in
> >> HBase).
> >
> >Since your option #2 requires atomic updates to the data structure, have
> you considered native
> >atomic increments? See
> >
> >
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#incrementColumnValue%28byte[],%20byte[],%20byte[],%20long,%20boolean%29
> >
> >
> >or
> >
> >
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Increment.html
> >
> >The former is a round trip for each value update. The latter allows you
> to pack multiple updates
> >into a single round trip. This would give you accurate counts even with
> concurrent writers.
> >
> >It should be possible for you to do partial aggregation on the client
> side too whenever parallel
> >requests colocate multiple updates to the same cube within some small
> window of time.
> >
> >Best regards,
> >
> >
> >- Andy
> >
> >Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)
> >
> >- Original Message -
> >> From: Tom Brown 
> >> To: user@hbase.apache.org
> >> Cc:
> >> Sent: Monday, April 9, 2012 9:48 AM
> >> Subject: Add client complexity or use a coprocessor?
> >>
> >> To whom it may concern,
> >>
> >> Ignoring the complexities of gathering the data, assume that I will be
> >> tracking millions of unique viewers. Updates from each of our millions
> >> of clients are gathered in a centralized platform and spread among a
> >> group of machines for processing and inserting into HBase (assume that
> >> this group can be scaled horizontally). The data is stored in an OLAP
> >> cube format and one of the metrics I'm tracking across various
> >> attributes is viewership (how many people from Y are watching X).
> >>
> >> I'm writing this to ask for your thoughts as to the most appropriate
> >> way to structure my data so I can count unique TV viewers (assume a
> >> service like netflix or hulu).
> >>
> >> Here are the solutions I'm considering:
> >>
> >> 1. Store each unique user ID as the cell name within the cube(s) it
> >> occurs. This has the advantage of having 100% accuracy, but the
> >> downside is the enormous space required to store each unique cell.
> >> Consuming this data is also problematic as the only way to provide a
> >> viewership count is by counting each cell. To save the overhead of
> >> sending each cell over the network, counting them could be done by a
> >> coprocessor on the region server, but that still doesn't avoid the
> >> overhead of reading each cell from the disk. I'm also not sure what
> >> happens if a single row is larger than an entire region (48 bytes per
> >> user ID * 10,000,000 users = 4

Re: Add client complexity or use a coprocessor?

2012-04-09 Thread Tom Brown

Andy,

I am a big fan of the Increment class. Unfortunately, I'm not doing
simple increments for the viewer count. I will be receiving duplicate
messages from a particular client for a specific cube cell, and don't
want them to be counted twice (my stats don't have to be 100%
accurate, but the expected rate of duplicates will be higher than the
allowable error rate).

I created an RPC endpoint coprocessor to perform this function but
performance suffered heavily under load (it appears that the endpoint
performs all functions in serial).

When I tried implementing it as a region observer, I was unsure of how
to correctly replace the provided "put" with my own. When I issued a
put from within "prePut", the server blocked the new put (waiting for
the "prePut" to finish). Should I be attempting to modify the WALEdit
object?

Is there a way to extend the functionality of "Increment" to provide
arbitrary bitwise operations on a the contents of a field?

Thanks again!

--Tom

>If it helps, yes this is possible:
>
>> Can I observe updates to a
>> particular table and replace the provided data with my own? (The
>> client calls "put" with the actual user ID, my co-processor replaces
>> it with a computed value, so the actual user ID never gets stored in
>> HBase).
>
>Since your option #2 requires atomic updates to the data structure, have you 
>considered native
>atomic increments? See
>
>http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#incrementColumnValue%28byte[],%20byte[],%20byte[],%20long,%20boolean%29
>
>
>or
>
>http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Increment.html
>
>The former is a round trip for each value update. The latter allows you to 
>pack multiple updates
>into a single round trip. This would give you accurate counts even with 
>concurrent writers.
>
>It should be possible for you to do partial aggregation on the client side too 
>whenever parallel
>requests colocate multiple updates to the same cube within some small window 
>of time.
>
>Best regards,
>
>
>- Andy
>
>Problems worthy of attack prove their worth by hitting back. - Piet Hein (via 
>Tom White)
>
>- Original Message -
>> From: Tom Brown 
>> To: user@hbase.apache.org
>> Cc:
>> Sent: Monday, April 9, 2012 9:48 AM
>> Subject: Add client complexity or use a coprocessor?
>>
>> To whom it may concern,
>>
>> Ignoring the complexities of gathering the data, assume that I will be
>> tracking millions of unique viewers. Updates from each of our millions
>> of clients are gathered in a centralized platform and spread among a
>> group of machines for processing and inserting into HBase (assume that
>> this group can be scaled horizontally). The data is stored in an OLAP
>> cube format and one of the metrics I'm tracking across various
>> attributes is viewership (how many people from Y are watching X).
>>
>> I'm writing this to ask for your thoughts as to the most appropriate
>> way to structure my data so I can count unique TV viewers (assume a
>> service like netflix or hulu).
>>
>> Here are the solutions I'm considering:
>>
>> 1. Store each unique user ID as the cell name within the cube(s) it
>> occurs. This has the advantage of having 100% accuracy, but the
>> downside is the enormous space required to store each unique cell.
>> Consuming this data is also problematic as the only way to provide a
>> viewership count is by counting each cell. To save the overhead of
>> sending each cell over the network, counting them could be done by a
>> coprocessor on the region server, but that still doesn't avoid the
>> overhead of reading each cell from the disk. I'm also not sure what
>> happens if a single row is larger than an entire region (48 bytes per
>> user ID * 10,000,000 users = 480GB).
>>
>> 2. Store a byte array that allows estimating unique viewers (with a
>> small margin of error*). Add a co-processor for updating this column
>> so I can guarantee the updates to a specific OLAP cell will be atomic.
>> The main benefit from this path is that there the nodes that update
>> HBase can be less complex. Another benefit I see is that the I can
>> just add more HBase regions as scale requires. However, I'm not sure
>> if I can use a coprocessor the way I want; Can I observe updates to a
>> particular table and replace the provided data with my own? (The
>> client calls "put" with the actual user ID, my co-processor replaces
>> it with a computed value, so the actual user ID never gets stored in
>>

Re: Add client complexity or use a coprocessor?

2012-04-09 Thread Andrew Purtell

If it helps, yes this is possible:

> Can I observe updates to a
> particular table and replace the provided data with my own? (The
> client calls "put" with the actual user ID, my co-processor replaces
> it with a computed value, so the actual user ID never gets stored in
> HBase).

Since your option #2 requires atomic updates to the data structure, have you 
considered native atomic increments? See

http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#incrementColumnValue%28byte[],%20byte[],%20byte[],%20long,%20boolean%29
 

or

http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Increment.html

The former is a round trip for each value update. The latter allows you to pack 
multiple updates into a single round trip. This would give you accurate counts 
even with concurrent writers.

It should be possible for you to do partial aggregation on the client side too 
whenever parallel requests colocate multiple updates to the same cube within 
some small window of time.

Best regards,


    - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein (via 
Tom White)

- Original Message -
> From: Tom Brown 
> To: user@hbase.apache.org
> Cc: 
> Sent: Monday, April 9, 2012 9:48 AM
> Subject: Add client complexity or use a coprocessor?
> 
> To whom it may concern,
> 
> Ignoring the complexities of gathering the data, assume that I will be
> tracking millions of unique viewers. Updates from each of our millions
> of clients are gathered in a centralized platform and spread among a
> group of machines for processing and inserting into HBase (assume that
> this group can be scaled horizontally). The data is stored in an OLAP
> cube format and one of the metrics I'm tracking across various
> attributes is viewership (how many people from Y are watching X).
> 
> I'm writing this to ask for your thoughts as to the most appropriate
> way to structure my data so I can count unique TV viewers (assume a
> service like netflix or hulu).
> 
> Here are the solutions I'm considering:
> 
> 1. Store each unique user ID as the cell name within the cube(s) it
> occurs. This has the advantage of having 100% accuracy, but the
> downside is the enormous space required to store each unique cell.
> Consuming this data is also problematic as the only way to provide a
> viewership count is by counting each cell. To save the overhead of
> sending each cell over the network, counting them could be done by a
> coprocessor on the region server, but that still doesn't avoid the
> overhead of reading each cell from the disk. I'm also not sure what
> happens if a single row is larger than an entire region (48 bytes per
> user ID * 10,000,000 users = 480GB).
> 
> 2. Store a byte array that allows estimating unique viewers (with a
> small margin of error*). Add a co-processor for updating this column
> so I can guarantee the updates to a specific OLAP cell will be atomic.
> The main benefit from this path is that there the nodes that update
> HBase can be less complex. Another benefit I see is that the I can
> just add more HBase regions as scale requires. However, I'm not sure
> if I can use a coprocessor the way I want; Can I observe updates to a
> particular table and replace the provided data with my own? (The
> client calls "put" with the actual user ID, my co-processor replaces
> it with a computed value, so the actual user ID never gets stored in
> HBase).
> 
> 3. Store a byte array that allows estimating unique viewers (with a
> small margin of error*). Re-arrange my architecture so that each OLAP
> cell is only updated by a single node. The main benefit from this
> would be that I don't need to worry about atomic operations in HBase
> since all updates for a single cell will be atomic and in serial. The
> biggest downside is that I believe it will add significant complexity
> to my overall architecture.
> 
> 
> Thanks for your time, and I look forward to hearing your thoughts.
> 
> Sincerely,
> Tom Brown
> 
> *(For information about the byte array mentioned in #2 and #3, see:
> http://highscalability.com/blog/2012/4/5/big-data-counting-how-to-count-a-billion-distinct-objects-us.html)
>

Re: Add client complexity or use a coprocessor?

Re: Add client complexity or use a coprocessor?

Re: Add client complexity or use a coprocessor?

Re: Add client complexity or use a coprocessor?

Re: Add client complexity or use a coprocessor?

Re: Add client complexity or use a coprocessor?

Re: Add client complexity or use a coprocessor?

Re: Add client complexity or use a coprocessor?

Re: Add client complexity or use a coprocessor?

Re: Add client complexity or use a coprocessor?

Re: Add client complexity or use a coprocessor?

Re: Add client complexity or use a coprocessor?

Re: Add client complexity or use a coprocessor?

13 matches

Site Navigation

Mail list logo

Footer information