Hi I'll explain a bit. I'm working with Abhinav.
We've an application which was earlier based on Lucene which would index a huge volume of data, and later use the indices to fetch data and perform a fuzzy matching operation. We wanted to use Cassandra primarily because of the sharding/availability/SPOF capabilities and the write-speed. The application is running on an 8-core machine, and we've 8 threads, each reading different files and writing to 3 different CFs - - one to store the raw data, keyed by an ID, the ID is of the form ThreadName-<counter> and is unique - one to store a subset of the raw data - I mean a small set of fields, and keyed by the same ID as before - one to store the inverted index, keyed by a field in the data with all the ID of the records for which that field matched On the 8-core machine, with 8-threads, it took us approx 20 min. to create the index store with a data set of 24M rows. And this was for a single instance of Cassandra. 480 sec. mentioned by Abhinav earlier was for a smaller dataset. When we created a ring, by adding another similar machine, and re-executed the application from scratch (consistency level = ONE), the total time increased considerably - actually doubled. And the nodes were unbalanced showing 70-30 distribution of load (sometimes even more skewed). Effectively, in the ring, it's taking much longer and the data distribution in skewed. Similar thing happened when we tried the application on a collection of desktops (4/5 of them). We have faced another issue while doing this. We performed jstack on the application, and found an output similar to the JIRA issue 1594 (which I mentioned in another mail earlier) - and this is true for both 0.6.8 and 0.7 versions. The cpu usage on the nodes is never greater than 50-60% (user+sys), the disk busy time is quite high. The CPU usage when we were using Lucene was pretty high for all the cores (90% or more). It may be possible that the usage has gone down because of the disk IO - but we aren't completely sure on this. We have a feeling that we aren't creating the cluster properly or have missed certain important configuration aspects. The configuration we are using is the default one. Changes to the memtable-throughput in MB didn't have much effect. Following is a snapshot from the cfstat output (for a data set of 2M rows): Keyspace: fct_cdr Read Count: 277537 Read Latency: 0.43607250564789557 ms. Write Count: 3781264 Write Latency: 0.01323008708199163 ms. Pending Tasks: 0 Column Family: RawCDR SSTable count: 1 Space used (live): 719796067 Space used (total): 1439605485 Memtable Columns Count: 218459 Memtable Data Size: 120398507 Memtable Switch Count: 4 Read Count: 0 Read Latency: NaN ms. Write Count: 1203177 Write Latency: 0.016 ms. Pending Tasks: 0 Key cache capacity: 10000 Key cache size: 0 Key cache hit rate: NaN Row cache capacity: 1000 Row cache size: 0 Row cache hit rate: NaN Compacted row minimum size: 535 Compacted row maximum size: 924 Compacted row mean size: 642 Column Family: Index SSTable count: 5 Space used (live): 326960041 Space used (total): 564423442 Memtable Columns Count: 264507 Memtable Data Size: 9443853 Memtable Switch Count: 15 Read Count: 178785 Read Latency: 0.425 ms. Write Count: 1203177 Write Latency: 0.012 ms. Pending Tasks: 0 Key cache capacity: 10000 Key cache size: 10000 Key cache hit rate: 0.0 Row cache capacity: 1000 Row cache size: 1000 Row cache hit rate: 0.0 Compacted row minimum size: 215 Compacted row maximum size: 310 Compacted row mean size: 215 Column Family: IndexInverse SSTable count: 3 Space used (live): 164782651 Space used (total): 164782651 Memtable Columns Count: 289647 Memtable Data Size: 12757041 Memtable Switch Count: 3 Read Count: 98950 Read Latency: 0.457 ms. Write Count: 1201911 Write Latency: 0.017 ms. Pending Tasks: 0 Key cache capacity: 10000 Key cache size: 10000 Key cache hit rate: 0.0 Row cache capacity: 1000 Row cache size: 1000 Row cache hit rate: 0.0 Compacted row minimum size: 149 Compacted row maximum size: 14237 Compacted row mean size: 179 The write latency shown in this is not bad, but we need to confirm this. It may be the case that it's something to do with the application and/or our configuration. Regards Arijit -- "And when the night is cloudy, There is still a light that shines on me, Shine on until tomorrow, let it be."