Data aggregation -- help me design a solution

2012-08-21 Thread Oleg Dulin

Here are my requirements.

We use Cassandra.

I get millions of invoice line items into the system. As I load them I 
need to build up some data structures.


* Invoice line items by invoice id (each line item has an invoice id on 
it ), with total dollar value

* Invoice line items by customer id , with total dollar value
* Invoice line items by territory, with total dollar value

In all of those cases, what we want is to see the total by a given 
attribute, that's all there is to it.


Line items may change daily, i.e. a territory may change or they may 
correct the values. In this case I need to update the aggregations 
accordingly.


Here are my ideas:

- I can use counters and store the data in buckets
- I can just store the data in buckets and do the math in Java

In both cases the challenge is that the items can be updated. Which 
means I need to look up a current version of an item and decide how to 
proceed. That puts a huge performance penalty on the application (# of 
line items we receive is in the millions and we need to process them in 
a timely fashion).


Help me out here -- any ideas on how I could design this in Cassandra ?


Regards,
Oleg




Re: Data aggregation -- help me design a solution

2012-08-21 Thread Milind Parikh
1. Assuming that the majorirty of the line items are new and

2. The lookup of an existing line-item will dictate the performance of the
system  because reads are slower than writes in C*.

3. Assuming that you are using counters in C*

Therefore eliminate that problem by implementing a bloom filter or similar
structure (stable bloom filter) to figure out whether you actually need to
go to C* at all FOR READING of existing line item.

IF YOU NEED TO GO TO C* FOR READS, handle that event (act of getting an
line-item that has already existed) in a seperate set of threads; DECRing
the chosen counters for the previous value of the invoice line-tems


HTH
Regards
Milind



On Tue, Aug 21, 2012 at 1:08 PM, Oleg Dulin oleg.du...@gmail.com wrote:

 Here are my requirements.

 We use Cassandra.

 I get millions of invoice line items into the system. As I load them I
 need to build up some data structures.

 * Invoice line items by invoice id (each line item has an invoice id on it
 ), with total dollar value
 * Invoice line items by customer id , with total dollar value
 * Invoice line items by territory, with total dollar value

 In all of those cases, what we want is to see the total by a given
 attribute, that's all there is to it.

 Line items may change daily, i.e. a territory may change or they may
 correct the values. In this case I need to update the aggregations
 accordingly.

 Here are my ideas:

 - I can use counters and store the data in buckets
 - I can just store the data in buckets and do the math in Java

 In both cases the challenge is that the items can be updated. Which means
 I need to look up a current version of an item and decide how to proceed.
 That puts a huge performance penalty on the application (# of line items we
 receive is in the millions and we need to process them in a timely fashion).

 Help me out here -- any ideas on how I could design this in Cassandra ?


 Regards,
 Oleg





Re: Data aggregation -- help me design a solution

2012-08-21 Thread Guillermo Winkler
Oleg,

If you have the aggregates in counters you only need to read the current
counter when adding/removing invoice lines.

In this situation you only need to be sure this sequence:

+ Read current counter value
+ Update current value according to newly created/updated lines

Is done safely to avoid messing up the current counter with concurrent
updates.

Assuming you don't need to have the counters updated in real time you can
also batch the counter update in Java/Redis/Whatever and do the updates in
C* less often.

Best,
Guille

On Tue, Aug 21, 2012 at 5:08 PM, Oleg Dulin oleg.du...@gmail.com wrote:

 Here are my requirements.

 We use Cassandra.

 I get millions of invoice line items into the system. As I load them I
 need to build up some data structures.

 * Invoice line items by invoice id (each line item has an invoice id on it
 ), with total dollar value
 * Invoice line items by customer id , with total dollar value
 * Invoice line items by territory, with total dollar value

 In all of those cases, what we want is to see the total by a given
 attribute, that's all there is to it.

 Line items may change daily, i.e. a territory may change or they may
 correct the values. In this case I need to update the aggregations
 accordingly.

 Here are my ideas:

 - I can use counters and store the data in buckets
 - I can just store the data in buckets and do the math in Java

 In both cases the challenge is that the items can be updated. Which means
 I need to look up a current version of an item and decide how to proceed.
 That puts a huge performance penalty on the application (# of line items we
 receive is in the millions and we need to process them in a timely fashion).

 Help me out here -- any ideas on how I could design this in Cassandra ?


 Regards,
 Oleg