Data aggregation -- help me design a solution
Here are my requirements. We use Cassandra. I get millions of invoice line items into the system. As I load them I need to build up some data structures. * Invoice line items by invoice id (each line item has an invoice id on it ), with total dollar value * Invoice line items by customer id , with total dollar value * Invoice line items by territory, with total dollar value In all of those cases, what we want is to see the total by a given attribute, that's all there is to it. Line items may change daily, i.e. a territory may change or they may correct the values. In this case I need to update the aggregations accordingly. Here are my ideas: - I can use counters and store the data in buckets - I can just store the data in buckets and do the math in Java In both cases the challenge is that the items can be updated. Which means I need to look up a current version of an item and decide how to proceed. That puts a huge performance penalty on the application (# of line items we receive is in the millions and we need to process them in a timely fashion). Help me out here -- any ideas on how I could design this in Cassandra ? Regards, Oleg
Re: Data aggregation -- help me design a solution
1. Assuming that the majorirty of the line items are new and 2. The lookup of an existing line-item will dictate the performance of the system because reads are slower than writes in C*. 3. Assuming that you are using counters in C* Therefore eliminate that problem by implementing a bloom filter or similar structure (stable bloom filter) to figure out whether you actually need to go to C* at all FOR READING of existing line item. IF YOU NEED TO GO TO C* FOR READS, handle that event (act of getting an line-item that has already existed) in a seperate set of threads; DECRing the chosen counters for the previous value of the invoice line-tems HTH Regards Milind On Tue, Aug 21, 2012 at 1:08 PM, Oleg Dulin oleg.du...@gmail.com wrote: Here are my requirements. We use Cassandra. I get millions of invoice line items into the system. As I load them I need to build up some data structures. * Invoice line items by invoice id (each line item has an invoice id on it ), with total dollar value * Invoice line items by customer id , with total dollar value * Invoice line items by territory, with total dollar value In all of those cases, what we want is to see the total by a given attribute, that's all there is to it. Line items may change daily, i.e. a territory may change or they may correct the values. In this case I need to update the aggregations accordingly. Here are my ideas: - I can use counters and store the data in buckets - I can just store the data in buckets and do the math in Java In both cases the challenge is that the items can be updated. Which means I need to look up a current version of an item and decide how to proceed. That puts a huge performance penalty on the application (# of line items we receive is in the millions and we need to process them in a timely fashion). Help me out here -- any ideas on how I could design this in Cassandra ? Regards, Oleg
Re: Data aggregation -- help me design a solution
Oleg, If you have the aggregates in counters you only need to read the current counter when adding/removing invoice lines. In this situation you only need to be sure this sequence: + Read current counter value + Update current value according to newly created/updated lines Is done safely to avoid messing up the current counter with concurrent updates. Assuming you don't need to have the counters updated in real time you can also batch the counter update in Java/Redis/Whatever and do the updates in C* less often. Best, Guille On Tue, Aug 21, 2012 at 5:08 PM, Oleg Dulin oleg.du...@gmail.com wrote: Here are my requirements. We use Cassandra. I get millions of invoice line items into the system. As I load them I need to build up some data structures. * Invoice line items by invoice id (each line item has an invoice id on it ), with total dollar value * Invoice line items by customer id , with total dollar value * Invoice line items by territory, with total dollar value In all of those cases, what we want is to see the total by a given attribute, that's all there is to it. Line items may change daily, i.e. a territory may change or they may correct the values. In this case I need to update the aggregations accordingly. Here are my ideas: - I can use counters and store the data in buckets - I can just store the data in buckets and do the math in Java In both cases the challenge is that the items can be updated. Which means I need to look up a current version of an item and decide how to proceed. That puts a huge performance penalty on the application (# of line items we receive is in the millions and we need to process them in a timely fashion). Help me out here -- any ideas on how I could design this in Cassandra ? Regards, Oleg