Re: performance degradation in cluster
First time I tun single instance of Cassandra and my application on a system (16GB ram and 8 core), the time taken was 480sec. When I added one more system ,(means this time I was running 2 instance of Cassandra in cluster) and running application from single client , I found time taken in increased to 1000sec. And I also found that that data distribution was also very odd on both system (in one system data were about 2.5GB and another were 140MB). Is any configuration require while running Cassandra in a cluster other than adding seeds ? For starters: (1) Are you spreading your data around evenly across row? Rows determine where data is placed in the cluster. (2) Is your ring actually balanced? (nodetool ring, they should have 50/50) (3) Is your test concurrent/multi-threaded? Increasing total time would be expected if you're moving from local traffic only to running against remote machines, if your test is a sequential workload. Adding machines increases aggregate throughput across multiple clients; it won't make individual requests faster (except indirectly of course by avoiding overloaded conditions). -- / Peter Schuller
Re: performance degradation in cluster
Hi Peter, Thanks for your reply. Our application is multi-threaded. we are using 8 core machine. In our application we are using 4 column families out of which one column family is containing rows whose size is huge relative to size of the rows in other column families. In the ring the balance is highly skewed.Can you suggest we can insure even balancing of the load across the cluster? The rows id in one column family is combination of cell numbers ( ie 9883240354_9885430354 ) and other row id's are like thread_name_12234 etc. How to insure spreading the data across rows? Thanks Regards, abhinav On Thu, Feb 3, 2011 at 1:46 PM, Peter Schuller peter.schul...@infidyne.comwrote: First time I tun single instance of Cassandra and my application on a system (16GB ram and 8 core), the time taken was 480sec. When I added one more system ,(means this time I was running 2 instance of Cassandra in cluster) and running application from single client , I found time taken in increased to 1000sec. And I also found that that data distribution was also very odd on both system (in one system data were about 2.5GB and another were 140MB). Is any configuration require while running Cassandra in a cluster other than adding seeds ? For starters: (1) Are you spreading your data around evenly across row? Rows determine where data is placed in the cluster. (2) Is your ring actually balanced? (nodetool ring, they should have 50/50) (3) Is your test concurrent/multi-threaded? Increasing total time would be expected if you're moving from local traffic only to running against remote machines, if your test is a sequential workload. Adding machines increases aggregate throughput across multiple clients; it won't make individual requests faster (except indirectly of course by avoiding overloaded conditions). -- / Peter Schuller -- Regards, Abhinav P. Rai
Re: performance degradation in cluster
This page has a guide to setting the initial tokens for the nodes http://wiki.apache.org/cassandra/Operations#Ring_management You can also use the bin/nodetool cfstats command or JConsole to check the maximum row size in each node, to see if you have a monster row. Aaron On 3/02/2011, at 10:22 PM, abhinav prakash rai wrote: Hi Peter, Thanks for your reply. Our application is multi-threaded. we are using 8 core machine. In our application we are using 4 column families out of which one column family is containing rows whose size is huge relative to size of the rows in other column families. In the ring the balance is highly skewed.Can you suggest we can insure even balancing of the load across the cluster? The rows id in one column family is combination of cell numbers ( ie 9883240354_9885430354 ) and other row id's are like thread_name_12234 etc. How to insure spreading the data across rows? Thanks Regards, abhinav On Thu, Feb 3, 2011 at 1:46 PM, Peter Schuller peter.schul...@infidyne.com wrote: First time I tun single instance of Cassandra and my application on a system (16GB ram and 8 core), the time taken was 480sec. When I added one more system ,(means this time I was running 2 instance of Cassandra in cluster) and running application from single client , I found time taken in increased to 1000sec. And I also found that that data distribution was also very odd on both system (in one system data were about 2.5GB and another were 140MB). Is any configuration require while running Cassandra in a cluster other than adding seeds ? For starters: (1) Are you spreading your data around evenly across row? Rows determine where data is placed in the cluster. (2) Is your ring actually balanced? (nodetool ring, they should have 50/50) (3) Is your test concurrent/multi-threaded? Increasing total time would be expected if you're moving from local traffic only to running against remote machines, if your test is a sequential workload. Adding machines increases aggregate throughput across multiple clients; it won't make individual requests faster (except indirectly of course by avoiding overloaded conditions). -- / Peter Schuller -- Regards, Abhinav P. Rai
Re: performance degradation in cluster
Are you using Virtual Machines to run Cassandra? Ive found that performance in VMs is crap Nicolas Santini On Thu, Feb 3, 2011 at 11:17 PM, aaron morton aa...@thelastpickle.comwrote: This page has a guide to setting the initial tokens for the nodes http://wiki.apache.org/cassandra/Operations#Ring_management http://wiki.apache.org/cassandra/Operations#Ring_managementYou can also use the bin/nodetool cfstats command or JConsole to check the maximum row size in each node, to see if you have a monster row. Aaron On 3/02/2011, at 10:22 PM, abhinav prakash rai wrote: Hi Peter, Thanks for your reply. Our application is multi-threaded. we are using 8 core machine. In our application we are using 4 column families out of which one column family is containing rows whose size is huge relative to size of the rows in other column families. In the ring the balance is highly skewed.Can you suggest we can insure even balancing of the load across the cluster? The rows id in one column family is combination of cell numbers ( ie 9883240354_9885430354 ) and other row id's are like thread_name_12234 etc. How to insure spreading the data across rows? Thanks Regards, abhinav On Thu, Feb 3, 2011 at 1:46 PM, Peter Schuller peter.schul...@infidyne.com wrote: First time I tun single instance of Cassandra and my application on a system (16GB ram and 8 core), the time taken was 480sec. When I added one more system ,(means this time I was running 2 instance of Cassandra in cluster) and running application from single client , I found time taken in increased to 1000sec. And I also found that that data distribution was also very odd on both system (in one system data were about 2.5GB and another were 140MB). Is any configuration require while running Cassandra in a cluster other than adding seeds ? For starters: (1) Are you spreading your data around evenly across row? Rows determine where data is placed in the cluster. (2) Is your ring actually balanced? (nodetool ring, they should have 50/50) (3) Is your test concurrent/multi-threaded? Increasing total time would be expected if you're moving from local traffic only to running against remote machines, if your test is a sequential workload. Adding machines increases aggregate throughput across multiple clients; it won't make individual requests faster (except indirectly of course by avoiding overloaded conditions). -- / Peter Schuller -- Regards, Abhinav P. Rai
Re: performance degradation in cluster
Hi I'll explain a bit. I'm working with Abhinav. We've an application which was earlier based on Lucene which would index a huge volume of data, and later use the indices to fetch data and perform a fuzzy matching operation. We wanted to use Cassandra primarily because of the sharding/availability/SPOF capabilities and the write-speed. The application is running on an 8-core machine, and we've 8 threads, each reading different files and writing to 3 different CFs - - one to store the raw data, keyed by an ID, the ID is of the form ThreadName-counter and is unique - one to store a subset of the raw data - I mean a small set of fields, and keyed by the same ID as before - one to store the inverted index, keyed by a field in the data with all the ID of the records for which that field matched On the 8-core machine, with 8-threads, it took us approx 20 min. to create the index store with a data set of 24M rows. And this was for a single instance of Cassandra. 480 sec. mentioned by Abhinav earlier was for a smaller dataset. When we created a ring, by adding another similar machine, and re-executed the application from scratch (consistency level = ONE), the total time increased considerably - actually doubled. And the nodes were unbalanced showing 70-30 distribution of load (sometimes even more skewed). Effectively, in the ring, it's taking much longer and the data distribution in skewed. Similar thing happened when we tried the application on a collection of desktops (4/5 of them). We have faced another issue while doing this. We performed jstack on the application, and found an output similar to the JIRA issue 1594 (which I mentioned in another mail earlier) - and this is true for both 0.6.8 and 0.7 versions. The cpu usage on the nodes is never greater than 50-60% (user+sys), the disk busy time is quite high. The CPU usage when we were using Lucene was pretty high for all the cores (90% or more). It may be possible that the usage has gone down because of the disk IO - but we aren't completely sure on this. We have a feeling that we aren't creating the cluster properly or have missed certain important configuration aspects. The configuration we are using is the default one. Changes to the memtable-throughput in MB didn't have much effect. Following is a snapshot from the cfstat output (for a data set of 2M rows): Keyspace: fct_cdr Read Count: 277537 Read Latency: 0.43607250564789557 ms. Write Count: 3781264 Write Latency: 0.01323008708199163 ms. Pending Tasks: 0 Column Family: RawCDR SSTable count: 1 Space used (live): 719796067 Space used (total): 1439605485 Memtable Columns Count: 218459 Memtable Data Size: 120398507 Memtable Switch Count: 4 Read Count: 0 Read Latency: NaN ms. Write Count: 1203177 Write Latency: 0.016 ms. Pending Tasks: 0 Key cache capacity: 1 Key cache size: 0 Key cache hit rate: NaN Row cache capacity: 1000 Row cache size: 0 Row cache hit rate: NaN Compacted row minimum size: 535 Compacted row maximum size: 924 Compacted row mean size: 642 Column Family: Index SSTable count: 5 Space used (live): 326960041 Space used (total): 564423442 Memtable Columns Count: 264507 Memtable Data Size: 9443853 Memtable Switch Count: 15 Read Count: 178785 Read Latency: 0.425 ms. Write Count: 1203177 Write Latency: 0.012 ms. Pending Tasks: 0 Key cache capacity: 1 Key cache size: 1 Key cache hit rate: 0.0 Row cache capacity: 1000 Row cache size: 1000 Row cache hit rate: 0.0 Compacted row minimum size: 215 Compacted row maximum size: 310 Compacted row mean size: 215 Column Family: IndexInverse SSTable count: 3 Space used (live): 164782651 Space used (total): 164782651 Memtable Columns Count: 289647 Memtable Data Size: 12757041 Memtable Switch Count: 3 Read Count: 98950 Read Latency: 0.457 ms. Write Count: 1201911 Write Latency: 0.017 ms. Pending Tasks: 0 Key cache capacity: 1 Key cache size: 1 Key cache hit rate: 0.0 Row cache capacity: 1000 Row cache size: 1000 Row cache hit rate: 0.0 Compacted row
performance degradation in cluster
First time I tun single instance of Cassandra and my application on a system (16GB ram and 8 core), the time taken was 480sec. When I added one more system ,(means this time I was running 2 instance of Cassandra in cluster) and running application from single client , I found time taken in increased to 1000sec. And I also found that that data distribution was also very odd on both system (in one system data were about 2.5GB and another were 140MB). Is any configuration require while running Cassandra in a cluster other than adding seeds ? hanks Regards, abhinav