Make cassandra sampling and startup faster
------------------------------------------

                 Key: CASSANDRA-1526
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1526
             Project: Cassandra
          Issue Type: New Feature
            Reporter: Edward Capriolo


http://wiki.apache.org/cassandra/CassandraHardware makes mention of very large 
disks I do not see how that would be possible.

We have a server class system have 4x processors 16GB RAM a 6 DISK RAID5 (yes 
RAID0 would be faster but still) 
{noformat}
INFO [main] 2010-09-21 12:58:26,348 SSTableReader.java (line 120) Sampling 
index for /var/lib/cassandra/data/system/LocationInfo-699-Data.db
...
INFO [main] 2010-09-21 13:05:51,333 CassandraDaemon.java (line 124) Binding 
thrift service to cdbsd07/10.71.71.57:9160
{noformat}

This node has 200GB of data in two column families and the time to sample all 
tables and startup is 7+ minutes. The logging suggests this process is 
happening a single SSTable at a time. Additionally the normal system vitals 
mainly DISK and CPU do not look overtaxed.

* Since SSTables are immutable is there a way the sampling of the tables could 
be saved?
* Could this process be done in parallel for speedup?
* Can multiple column families be processed at once?

Unless someone has an insanely powerful disk pack making mention of 2TB 
limitations seem out of place. Unless my calculations are wrong (which they 
usually are), I have a pretty decent hardware, and if I had 2 TB of data I 
would have a 95 minute node start up? 

I hope that maybe sampling multiple ColumnFamilies at once would make nodes of 
at least a few hundred GB startup reasonably fast.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to