Misc Performance Questions
Is there a performance hit when dropping a CF? What if it contains .5 TB of data? If not, is there a quick and painless way to drop a large amount of data w/minimal perf hit? Is there a performance hit running multiple keyspaces on a cluster versus only one keyspace given a constant total data size? Is there some quantity limit? Using a Random Partitioner, but with a RF = 1, will the rows still be spread-out evenly on the cluster or will there be an affinity to a single node (like the one receiving the data from the client)? I see a lot of mention of using RAID-0, but not RAID-5/6. Why? Even though Cass can tolerate a down node due to data loss, it would still be more efficient to just rebuild a bad hdd live, right? Maybe perf related: Will there be a problem having multiple keyspaces on a cluster all with different replication factors, from 1-3? Thanks!
Re: Misc Performance Questions
Hi AJ, On Wed, Jun 8, 2011 at 9:29 AM, AJ a...@dude.podzone.net wrote: Is there a performance hit when dropping a CF? What if it contains .5 TB of data? If not, is there a quick and painless way to drop a large amount of data w/minimal perf hit? Dropping a CF is quick - it snapshots the files (which creates hard links) and removes the CF definition. To actually delete the data, remove the snapshot files from your data directory. Is there a performance hit running multiple keyspaces on a cluster versus only one keyspace given a constant total data size? Is there some quantity limit? There is a tiny amount of memory used per keyspace, but unless you have very many keyspaces you won't notice any impact of running multiple keyspaces. There is however a difference in running multiple column families versus putting everything in the same column family and separating them with e.g. a key prefix. E.g. if you have a large data set and a small one, it will be quicker to query the small one if it is in its own column family. Using a Random Partitioner, but with a RF = 1, will the rows still be spread-out evenly on the cluster or will there be an affinity to a single node (like the one receiving the data from the client)? The rows will be spread out the same way - RF=1 doesn't affect the load balancing. I see a lot of mention of using RAID-0, but not RAID-5/6. Why? Even though Cass can tolerate a down node due to data loss, it would still be more efficient to just rebuild a bad hdd live, right? There's a trade-off - RAID-0 will give better performance, but rebuilds are over a network. WIth RF 1, RAID-0 is enough so that that you're unlikely to lose data, but as you say, replacing a failed node will be slower. Maybe perf related: Will there be a problem having multiple keyspaces on a cluster all with different replication factors, from 1-3? No. Richard. -- Richard Low Acunu | http://www.acunu.com | @acunu
Re: Misc Performance Questions
Thank you Richard! On 6/8/2011 2:57 AM, Richard Low wrote: snip There is however a difference in running multiple column families versus putting everything in the same column family and separating them with e.g. a key prefix. E.g. if you have a large data set and a small one, it will be quicker to query the small one if it is in its own column family. I assumed that a read would be O(1) for any size CF since Cass is implemented with hashmaps. Do you know why size matters? (forgive the pun)
Re: Misc Performance Questions
On Wed, Jun 8, 2011 at 12:30 PM, AJ a...@dude.podzone.net wrote: There is however a difference in running multiple column families versus putting everything in the same column family and separating them with e.g. a key prefix. E.g. if you have a large data set and a small one, it will be quicker to query the small one if it is in its own column family. I assumed that a read would be O(1) for any size CF since Cass is implemented with hashmaps. Do you know why size matters? (forgive the pun) You may not notice a difference, but it can happen. For a query, each SSTable is queried. If there is more data then there are (most likely) more SSTables to query, slowing it down. For point queries, this isn't so bad because the Bloom filters will help, but for range queries you will notice a big difference. You will have to do more seeks to seek over unwanted data. It will also help buffer caching to separate them - the small SSTables are more likely to remain in cache. -- Richard Low Acunu | http://www.acunu.com | @acunu