Misc Performance Questions

2011-06-08 Thread AJ


Is there a performance hit when dropping a CF?  What if it contains .5 
TB of data?  If not, is there a quick and painless way to drop a large 
amount of data w/minimal perf hit?


Is there a performance hit running multiple keyspaces on a cluster 
versus only one keyspace given a constant total data size?  Is there 
some quantity limit?


Using a Random Partitioner, but with a RF = 1, will the rows still be 
spread-out evenly on the cluster or will there be an affinity to a 
single node (like the one receiving the data from the client)?


I see a lot of mention of using RAID-0, but not RAID-5/6.  Why?  Even 
though Cass can tolerate a down node due to data loss, it would still be 
more efficient to just rebuild a bad hdd live, right?


Maybe perf related:  Will there be a problem having multiple keyspaces 
on a cluster all with different replication factors, from 1-3?


Thanks!


Re: Misc Performance Questions

2011-06-08 Thread Richard Low
Hi AJ,

On Wed, Jun 8, 2011 at 9:29 AM, AJ a...@dude.podzone.net wrote:

 Is there a performance hit when dropping a CF?  What if it contains .5 TB of
 data?  If not, is there a quick and painless way to drop a large amount of
 data w/minimal perf hit?

Dropping a CF is quick - it snapshots the files (which creates hard
links) and removes the CF definition.  To actually delete the data,
remove the snapshot files from your data directory.

 Is there a performance hit running multiple keyspaces on a cluster versus
 only one keyspace given a constant total data size?  Is there some quantity
 limit?

There is a tiny amount of memory used per keyspace, but unless you
have very many keyspaces you won't notice any impact of running
multiple keyspaces.

There is however a difference in running multiple column families
versus putting everything in the same column family and separating
them with e.g. a key prefix.  E.g. if you have a large data set and a
small one, it will be quicker to query the small one if it is in its
own column family.

 Using a Random Partitioner, but with a RF = 1, will the rows still be
 spread-out evenly on the cluster or will there be an affinity to a single
 node (like the one receiving the data from the client)?

The rows will be spread out the same way - RF=1 doesn't affect the
load balancing.

 I see a lot of mention of using RAID-0, but not RAID-5/6.  Why?  Even though
 Cass can tolerate a down node due to data loss, it would still be more
 efficient to just rebuild a bad hdd live, right?

There's a trade-off - RAID-0 will give better performance, but
rebuilds are over a network.  WIth RF  1, RAID-0 is enough so that
that you're unlikely to lose data, but as you say, replacing a failed
node will be slower.

 Maybe perf related:  Will there be a problem having multiple keyspaces on a
 cluster all with different replication factors, from 1-3?

No.

Richard.

-- 
Richard Low
Acunu | http://www.acunu.com | @acunu


Re: Misc Performance Questions

2011-06-08 Thread AJ

Thank you Richard!

On 6/8/2011 2:57 AM, Richard Low wrote:
snip

There is however a difference in running multiple column families
versus putting everything in the same column family and separating
them with e.g. a key prefix.  E.g. if you have a large data set and a
small one, it will be quicker to query the small one if it is in its
own column family.



I assumed that a read would be O(1) for any size CF since Cass is 
implemented with hashmaps.  Do you know why size matters?  (forgive the pun)


Re: Misc Performance Questions

2011-06-08 Thread Richard Low
On Wed, Jun 8, 2011 at 12:30 PM, AJ a...@dude.podzone.net wrote:

 There is however a difference in running multiple column families
 versus putting everything in the same column family and separating
 them with e.g. a key prefix.  E.g. if you have a large data set and a
 small one, it will be quicker to query the small one if it is in its
 own column family.


 I assumed that a read would be O(1) for any size CF since Cass is
 implemented with hashmaps.  Do you know why size matters?  (forgive the pun)


You may not notice a difference, but it can happen.

For a query, each SSTable is queried.  If there is more data then
there are (most likely) more SSTables to query, slowing it down.  For
point queries, this isn't so bad because the Bloom filters will help,
but for range queries you will notice a big difference.  You will have
to do more seeks to seek over unwanted data.

It will also help buffer caching to separate them - the small SSTables
are more likely to remain in cache.

-- 
Richard Low
Acunu | http://www.acunu.com | @acunu