[Cassandra Wiki] Update of "LargeDataSetConsiderations" by jeremyhanna

Apache Wiki Wed, 04 Sep 2013 06:55:31 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for 
change notification.


The "LargeDataSetConsiderations" page has been changed by jeremyhanna:
https://wiki.apache.org/cassandra/LargeDataSetConsiderations?action=diff&rev1=21&rev2=22

Comment:
Revising for current Cassandra features and considerations.

  = Using Cassandra for large data sets (lots of data per node) =
  
- This page aims to to give some advise as to the issues one may need to 
consider when using Cassandra for large data sets (meaning hundreds of 
gigabytes or terabytes per node). The intent is not to make original claims, 
but to collect in one place some issues that are operationally relevant. Other 
parts of the wiki are highly recommended in order to fully understand the 
issues involved.
+ This page aims to to give some advice as to the issues one may need to 
consider when using Cassandra for large data sets (meaning hundreds of 
gigabytes or terabytes per node). The intent is not to make original claims, 
but to collect in one place some issues that are operationally relevant. Other 
parts of the wiki are highly recommended in order to fully understand the 
issues involved.
  
- This is a work in progress. If you find information out of date (e.g., a JIRA 
ticket referenced has been resolved but this document has not been updated), 
please help by editing or e-mail:ing cassandra-user.
+ This is a work in progress. If you find information out of date (e.g., a JIRA 
ticket referenced has been resolved but this document has not been updated), 
please help by editing or e-mailing cassandra-user.
  
- Note that not all of these issues are specific to Cassandra (for example, any 
storage system is subject to the trade-offs of cache sizes relative to active 
set size, and IOPS will always be strongly correlated with the percentage of 
requests that penetrate caching layers).
+ Note that not all of these issues are specific to Cassandra.  For example, 
any storage system is subject to the trade-offs of cache sizes relative to 
active set size, and IOPS will always be strongly correlated with the 
percentage of requests that penetrate caching layers.  Also of note, the more 
data stored per node, the more data will have to be streamed in bootstrapping 
new or replacement nodes.
  
- Unless otherwise noted, the points refer to Cassandra 0.7 and above.
+ '''Assumes Cassandra 1.2+'''
  
+ Significant work has been done to allow for more data to be stored on each 
node:
+  * Row cache can be serialized off-heap.  Keep in mind that the existing row 
cache implementation still maintains off-heap entire rows of data and when that 
row is called for those rows are deserialized into the heap.  Alternate row 
cache implementations are being worked on to make the row cache more generally 
useful, see 
[[https://issues.apache.org/jira/browse/CASSANDRA-2864|CASSANDRA-2864]].
+  * Bloom filters:
+   * Moved off-heap 
[[https://issues.apache.org/jira/browse/CASSANDRA-4865|CASSANDRA-4865]]
+   * Tunable via bloom_filter_fp_chance.  Starting in Cassandra 1.2 there are 
better defaults: 0.01 for column families using the 
!SizeTieredCompactionStrategy and 0.1 for column families using the 
!LeveledCompactionStrategy.  Note that a change to this property will take 
effect as new sstables are built.
+  * Compression metadata has been moved off-heap in 1.2 (reference)
+  * Partition summary has been reduced in 1.2.5 and moved off-heap in 2.0
+  * Key cache has been serialized off-heap in 2.0 (reference)
-  * Disk space usage in Cassandra can vary fairly suddenly over time. If you 
have significant amounts of data such that available disk space is not 
significantly higher than usage, consider:
-   * Compaction of a column family can up to double the disk space used by 
said column family (in the case of a major compaction and no deletions). If 
your data is predominantly made up of a single, or a select few, column 
families then doubling the disk space for a CF may be a significant amount 
compared to your total disk usage.
-   * Repair operations can increase disk space demands (particularly in 0.6, 
less so in 0.7; TODO: provide actual maximum growth and what it depends on).
-  * As your data set becomes larger and larger (assuming significantly larger 
than memory), you become more and more dependent on caching to elide I/O 
operations. As you plan and test your capacity, keep in mind that:
-   * The cassandra row cache is in the JVM heap and unaffected (remains warm) 
by compactions and repair operations. This is a plus, but the down-side is that 
the row cache is not very memory efficient compared to the operating system 
page cache.
-   * For 0.6.8 and below, the key cache is affected by compaction because it 
is per-sstable, and compaction moves data to new sstables.
-    * Was fixed/improved as of 
[[https://issues.apache.org/jira/browse/CASSANDRA-1878|CASSANDRA-1878]], for 
0.6.9 and 0.7.0.
-   * The operating system's page cache is affected by compaction and repair 
operations. If you are relying on the page cache to keep the active set in 
memory, you may see significant degradation on performance as a result of 
compaction and repair operations.
-    * Potential future improvements: 
[[https://issues.apache.org/jira/browse/CASSANDRA-1470|CASSANDRA-1470]], 
[[https://issues.apache.org/jira/browse/CASSANDRA-1882|CASSANDRA-1882]].
-  * Prior to 0.7.1 (fixed in 
[[https://issues.apache.org/jira/browse/CASSANDRA-1555|CASSANDRA-1555]]), if 
you had column families with more than 143 million row keys in them, bloom 
filter false positive rates would be likely to go up because of implementation 
concerns that limited the maximum size of a bloom filter. See 
[[ArchitectureInternals]] for information on how bloom filters are used. The 
negative effects of hitting this limit is that reads will start taking 
additional seeks to disk as the row count increases. Note that the effect you 
are seeing at any given moment will depend on when compaction was last run, 
because the bloom filter limit is per-sstable. It is an issue for column 
families because after a major compaction, the entire column family will be in 
a single sstable.
-  * Compaction is currently not concurrent, so only a single compaction runs 
at a time. This means that sstable counts may spike during larger compactions 
as several smaller sstables are written while a large compaction is happening. 
This can cause additional seeks on reads.
-   * Potential future improvements: 
[[https://issues.apache.org/jira/browse/CASSANDRA-1876|CASSANDRA-1876]] and 
[[https://issues.apache.org/jira/browse/CASSANDRA-1881|CASSANDRA-1881]]
+  * Parallel compactions as in 
[[https://issues.apache.org/jira/browse/CASSANDRA-2191|CASSANDRA-2191]] and 
[[https://issues.apache.org/jira/browse/CASSANDRA-4310|CASSANDRA-4310]]
-   * Potentially already fixed for 0.8 (todo: go through ticket history and 
make sure what it implies): 
[[https://issues.apache.org/jira/browse/CASSANDRA-2191|CASSANDRA-2191]]
+  * Multi-threaded compaction for high IO hardware (reference)
+  * Virtual nodes to increase the parallelism and reduce the time of 
bootstrapping new nodes
+  * sstable index files are no longer loaded on startup (reference)
+ 
+ Other points to consider:
+ 
+  * Moving data structures off-heap means that the structure gets serialized 
off-heap until it is needed.  Then it is deserialized temporarily in the heap 
and is the garbage collected when it is no longer used.
+  * Disk space usage in Cassandra can vary over time:
+   * Compaction: with the !SizeTieredCompactionStrategy, compaction can up to 
double the disk space used.  With the !LeveledCompactionStrategy, usually only 
requires about 10% overhead (see 
http://www.datastax.com/dev/blog/leveled-compaction-in-apache-cassandra).
+   * Repair: repair operations can increase disk space demands, see 
http://www.datastax.com/dev/blog/advanced-repair-techniques for details and how 
it can be improved.
-  * Consider the choice of file system. Removal of large files is notoriously 
slow and seek bound on e.g. ext2/ext3. Consider xfs or ext4fs. This affects 
background unlink():ing of sstables that happens every now and then, and also 
affects start-up time (if there are sstables pending removal when a node is 
starting up, they are removed as part of the start-up proceess; it may thus be 
detrimental if removing a terabyte of sstables takes an hour (numbers are 
ballparks, not accurately measured and depends on circumstances)).
+  * Consider the choice of file system. Removal of large files is notoriously 
slow and seek bound on e.g. ext2/ext3. Consider xfs or ext4fs. This affects 
background unlink():ing of sstables that happens every now and then, and also 
affects start-up time (if there are sstables pending removal when a node is 
starting up, they are removed as part of the start-up procees; it may thus be 
detrimental if removing a terabyte of sstables takes an hour (numbers are 
ballparks, not accurately measured and depends on circumstances)).
   * Adding nodes is a slow process if each node is responsible for a large 
amount of data. Plan for this; do not try to throw additional hardware at a 
cluster at the last minute.
+  * The operating system's page cache is affected by compaction and repair 
operations. If you are relying on the page cache to keep the active set in 
memory, you may see significant degradation on performance as a result of 
compaction and repair operations.  See the cassandra.yaml for settings to 
reduce this impact.
+  * The partition (or sampled) index entries for each sstable can start to add 
up.  You can reduce the memory usage by tuning the interval that it samples at. 
 The setting is index_interval the cassandra.yaml.  See the comments there for 
more information.
-  * Cassandra will read through sstable index files on start-up, doing what is 
known as "index sampling". This is used to keep a subset (currently and by 
default, 1 out of 100) of keys and and their on-disk location in the index, in 
memory. See [[ArchitectureInternals]]. This means that the larger the index 
files are, the longer it takes to perform this sampling. Thus, for very large 
indexes (typically when you have a very large number of keys) the index 
sampling on start-up may be a significant issue.
-  * A negative side-effect of a large row-cache is start-up time. The periodic 
saving of the row cache information only saves the keys that are cached; the 
data has to be pre-fetched on start-up. On a large data set, this is probably 
going to be seek-bound and the time it takes to warm up the row cache will be 
linear with respect to the row cache size (assuming sufficiently large amounts 
of data that the seek bound I/O is not subject to optimization by disks).
-   * Potential future improvement: 
[[https://issues.apache.org/jira/browse/CASSANDRA-1625|CASSANDRA-1625]].
-  * The total number of rows per node correlates directly with the size of 
bloom filters and sampled index entries. Expect the base memory requirement of 
a node to increase linearly with the number of keys (assuming the average row 
key size remains constant). If you are not using caching at all (e.g. you are 
doing analysis type workloads), expect these two to be the two biggest 
consumers of memory.
-   * You can decrease the memory use due to index sampling by changing the 
index sampling interval in cassandra.yaml
-   * You should soon be able to tweak the bloom filter sizes too once 
[[https://issues.apache.org/jira/browse/CASSANDRA-3497|CASSANDRA-3497]] is done
  
+ Other references to improvements:
+  * 
[[http://www.datastax.com/dev/blog/performance-improvements-in-cassandra-1-2|Performance
 improvements in Cassandra 1.2]]
+  * 
[[http://www.datastax.com/dev/blog/six-mid-series-changes-to-know-about-in-1-2-x|Six
 mid-series changes in Cassandra 1.2]]
+

[Cassandra Wiki] Update of "LargeDataSetConsiderations" by jeremyhanna

Reply via email to