Data aggregation -- help me design a solution

2012-08-21 Thread Oleg Dulin
Here are my requirements. We use Cassandra. I get millions of invoice line items into the system. As I load them I need to build up some data structures. * Invoice line items by invoice id (each line item has an invoice id on it ), with total dollar value * Invoice line items by customer

Re: Data aggregation -- help me design a solution

2012-08-21 Thread Milind Parikh
1. Assuming that the majorirty of the line items are new and 2. The lookup of an existing line-item will dictate the performance of the system because reads are slower than writes in C*. 3. Assuming that you are using counters in C* Therefore eliminate that problem by implementing a bloom

Re: Data aggregation -- help me design a solution

2012-08-21 Thread Guillermo Winkler
Oleg, If you have the aggregates in counters you only need to read the current counter when adding/removing invoice lines. In this situation you only need to be sure this sequence: + Read current counter value + Update current value according to newly created/updated lines Is done safely to

OutOfMemory log in Cassandra

2012-08-21 Thread Xu Renjie
Hi, all I have a problem about the log. I have set the CASSANDRA_HEAPDUMP_DIR in the cassandra-env.in file to some path to store the dump file when OutOfMemory exception happens. But after out of memory happens(I judge it from the /var/log/messages which says kernel: out of memory: kill process

Re: OutOfMemory log in Cassandra

2012-08-21 Thread Xu Renjie
BTW, I have checked the jvm by jps that the options are correctly added. On Wed, Aug 22, 2012 at 11:09 AM, Xu Renjie xrjxrjxrj...@gmail.com wrote: Hi, all I have a problem about the log. I have set the CASSANDRA_HEAPDUMP_DIR in the cassandra-env.in file to some path to store the dump file

Re: OutOfMemory log in Cassandra

2012-08-21 Thread Guillermo Winkler
Xu, what's your configuration? How many CF, how much data (size/rows/cols), how many clients operations/sec and how much memory assigned for the heap? Guille On Wed, Aug 22, 2012 at 12:09 AM, Xu Renjie xrjxrjxrj...@gmail.com wrote: Hi, all I have a problem about the log. I have set the

Re: How to add secondary index to existing column family with CLI?

2012-08-21 Thread aaron morton
The column name must be valid according to the type specified for the comparator . cannot parse ‘title’ as hex bytes. Looks like you dont have a comparator type, so it defaulted to bytes. You can either change the comparator *IF* all column names are strings or get the

Re: OutOfMemory log in Cassandra

2012-08-21 Thread Xu Renjie
Guille, Thanks for your reply. I seem to find where is the problem. I guess it is because memory is used up not the jvm heap( I use micro ec2 instance just for toy use). Seems that I have set the cassandra heap too high to avoid gc to effect my performance, but the total memory is not enough.

Re: Heap size question

2012-08-21 Thread aaron morton
How do I know if my off-heap memory is not used? If you are using the default memory mapped file access memory not used by the cassandra JVM will be used to cache files. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 22/08/2012, at

Re: Why the StageManager thread pools have 60 seconds keepalive time?

2012-08-21 Thread aaron morton
One thing we did change in the past weeks was the memtable_flush_queue_size in order to occupy less heap space with memtables, this was due to having received this warning message and some OOM exceptions: Danger. Do you know any strategy to diagnose if memtables flushing to disk and

Re: OutOfMemory log in Cassandra

2012-08-21 Thread aaron morton
CASSANDRA_HEAPDUMP_DIR Is for JVM out of memory. You were seeing the OS kill the JVM because of low os memory. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 22/08/2012, at 4:28 PM, Xu Renjie xrjxrjxrj...@gmail.com wrote: Guille,

Fwd: Problem while configuring key and row cache?

2012-08-21 Thread Amit Handa
Hi, Thanks Jonathan for your reply. I modified key_cache_size_in_mb and row_cache_size_in_mb values inside cassandra.yaml. but not able to see it's effect using command *./nodetool -h 107.108.189.212 cfstats*. Can u let me know how to verify that the setting for key_cache_size and

Re: Secondary index and/or row key in the read path ?

2012-08-21 Thread aaron morton
- do we need to post-process (filter) the result of the query in our application ? Thats the one :) Right now the code paths don't exist to select a row using a row key *and* apply a column level filter. The RPC API does not work that way and I'm not sure if this is something that is planned

Re: Cassandra with large number of columns per row

2012-08-21 Thread aaron morton
I think the limit of the size per row in cassandra is 2G? That was a pre 0.7 restriction http://wiki.apache.org/cassandra/CassandraLimitations and I insert 1 columns into a row, each column has a 1MB data. So a single row with 10GB of data. That's what we call a big one.

Re: Cassandra with large number of columns per row

2012-08-21 Thread Chuan-Heng Hsiao
Thank you very much! That also cleared my erroneous understanding of the size limitation before. Hsiao On Tue, Aug 21, 2012 at 5:03 PM, aaron morton aa...@thelastpickle.comwrote: I think the limit of the size per row in cassandra is 2G? That was a pre 0.7 restriction

Heap size question

2012-08-21 Thread Tamar Fraenkel
Hi! I have a question regarding Cassandra heap size. Cassandra calculates heap size in cassandra-env.sh according to the following algorythm # set max heap size based on the following # max(min(1/2 ram, 1024MB), min(1/4 ram, 8GB)) # calculate 1/2 ram and cap to 1024MB # calculate

Re: Secondary index and/or row key in the read path ?

2012-08-21 Thread Jean-Armel Luce
Hi Aaron, Thank you for your answer. So, I shall do post-processing for selecting a row using a row key *and* applying a column level filter. Best Regards, Jean-Armel 2012/8/21 aaron morton aa...@thelastpickle.com - do we need to post-process (filter) the result of the query in our

Re: Best strategy to increase cluster size and keep nodes balanced

2012-08-21 Thread aaron morton
Unless you really need to consider moving to 6, it will be easier. That said, if you want to get to 7 I would: * bring the new nodes in with tokens selected for 7. * move the old nodes to new 7-node tokens. * cleanup on the old nodes There is a way to expedite things by copying files around,

Re: Why so slow?

2012-08-21 Thread aaron morton
I did a talk on server side latency at Cassandra Summit 12 the other week http://www.datastax.com/events/cassandrasummit2012/presentations If you want to do some baseline tests think about: multiple clients, batch calls with maybe 10's of rows, connection pooling. There is a stress tool in

Re: get_slice on wide rows

2012-08-21 Thread aaron morton
Is the problem that cassandra is attempting to load all the deleted columns into memory? Yup. The talk by Mat Dennis at the Cassandra Summit may be of interest to you. He talks about similar things http://www.datastax.com/events/cassandrasummit2012/presentations Drop the gc_grace_seconds

Re: Heap size question

2012-08-21 Thread Alain RODRIGUEZ
I have the same configuration and I recently change my cassandra-sh.yaml to : MAX_HEAP_SIZE=4G HEAP_NEWSIZE=200M I guess it depends on how much you use the cache (which is now in the off-heap memory). I don't use row cache and use the default key cache size. I have no more memory pressure nor

Re: Heap size question

2012-08-21 Thread Tamar Fraenkel
Thanks for you prompt response. Please see follow up questions below Thanks!!! *Tamar Fraenkel * Senior Software Engineer, TOK Media [image: Inline image 1] ta...@tok-media.com Tel: +972 2 6409736 Mob: +972 54 8356490 Fax: +972 2 5612956 On Tue, Aug 21, 2012 at 12:57 PM, Alain

Re: Problem while configuring key and row cache?

2012-08-21 Thread Jonathan Ellis
setcachecapacity is obsolete in 1.1+. Looks like we missed removing it from nodetool. See http://www.datastax.com/dev/blog/caching-in-cassandra-1-1 for background. (Moving to users@.) On Tue, Aug 21, 2012 at 8:19 AM, Amit Handa amithand...@gmail.com wrote: I started exploring apache cassandra

Re: Heap size question

2012-08-21 Thread Alain RODRIGUEZ
You're welcome. I'll answer to your new questions but keep in mind that I am not a cassandra commiter nor even a cassandra specialist. you mean that key cache is not in heap? I am using cassandra 1.0.8 and I was under the expression it was, see http://www.datastax.com/docs/1.0/operations/tuning,

Re: Heap size question

2012-08-21 Thread Tamar Fraenkel
Much appreciated. What you described makes a lot of sense from all my readings :) Thanks! *Tamar Fraenkel * Senior Software Engineer, TOK Media [image: Inline image 1] ta...@tok-media.com Tel: +972 2 6409736 Mob: +972 54 8356490 Fax: +972 2 5612956 On Tue, Aug 21, 2012 at 6:43 PM,

Re: Why the StageManager thread pools have 60 seconds keepalive time?

2012-08-21 Thread Guillermo Winkler
Aaron, thanks for your answer. We do have big batch updates not always with the columns belonging to the same row(i.e. many threads are needed to handle the updates), but it did not not represented a problem when the CFs had less data on them. One thing we did change in the past weeks was the