[Cassandra Wiki] Update of "LargeDataSetConsiderations" by jeremyhanna

Apache Wiki Wed, 04 Sep 2013 07:08:32 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for 
change notification.


The "LargeDataSetConsiderations" page has been changed by jeremyhanna:
https://wiki.apache.org/cassandra/LargeDataSetConsiderations?action=diff&rev1=22&rev2=23

  
  Note that not all of these issues are specific to Cassandra.  For example, 
any storage system is subject to the trade-offs of cache sizes relative to 
active set size, and IOPS will always be strongly correlated with the 
percentage of requests that penetrate caching layers.  Also of note, the more 
data stored per node, the more data will have to be streamed in bootstrapping 
new or replacement nodes.
  
- '''Assumes Cassandra 1.2+'''
+ ==== Assumes Cassandra 1.2+ ====
  
- Significant work has been done to allow for more data to be stored on each 
node:
+ ==== Significant work has been done to allow for more data to be stored on 
each node: ====
-  * Row cache can be serialized off-heap.  Keep in mind that the existing row 
cache implementation still maintains off-heap entire rows of data and when that 
row is called for those rows are deserialized into the heap.  Alternate row 
cache implementations are being worked on to make the row cache more generally 
useful, see 
[[https://issues.apache.org/jira/browse/CASSANDRA-2864|CASSANDRA-2864]].
+  * Row cache can be serialized off-heap.  Keep in mind that this still stores 
entire rows off-heap and when they are used the full row is temporarily 
deserialized in the heap.  Alternate row cache implementations are being worked 
on to make the row cache more generally useful, see 
[[https://issues.apache.org/jira/browse/CASSANDRA-2864|CASSANDRA-2864]].
   * Bloom filters:
    * Moved off-heap 
[[https://issues.apache.org/jira/browse/CASSANDRA-4865|CASSANDRA-4865]]
    * Tunable via bloom_filter_fp_chance.  Starting in Cassandra 1.2 there are 
better defaults: 0.01 for column families using the 
!SizeTieredCompactionStrategy and 0.1 for column families using the 
!LeveledCompactionStrategy.  Note that a change to this property will take 
effect as new sstables are built.
@@ -23, +23 @@

   * Virtual nodes to increase the parallelism and reduce the time of 
bootstrapping new nodes
   * sstable index files are no longer loaded on startup (reference)
  
- Other points to consider:
+ '''On moving data structures off-heap'''
+  * Moving data structures off-heap means that the structure gets serialized 
off-heap until it is needed.  Then it is deserialized temporarily in the heap 
and is the garbage collected when it is no longer used.
  
-  * Moving data structures off-heap means that the structure gets serialized 
off-heap until it is needed.  Then it is deserialized temporarily in the heap 
and is the garbage collected when it is no longer used.
+ ==== Other points to consider: ====
+ 
   * Disk space usage in Cassandra can vary over time:
    * Compaction: with the !SizeTieredCompactionStrategy, compaction can up to 
double the disk space used.  With the !LeveledCompactionStrategy, usually only 
requires about 10% overhead (see 
http://www.datastax.com/dev/blog/leveled-compaction-in-apache-cassandra).
    * Repair: repair operations can increase disk space demands, see 
http://www.datastax.com/dev/blog/advanced-repair-techniques for details and how 
it can be improved.

[Cassandra Wiki] Update of "LargeDataSetConsiderations" by jeremyhanna

Reply via email to