Dear Wiki user, You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for change notification.
The "LargeDataSetConsiderations" page has been changed by jeremyhanna: https://wiki.apache.org/cassandra/LargeDataSetConsiderations?action=diff&rev1=21&rev2=22 Comment: Revising for current Cassandra features and considerations. = Using Cassandra for large data sets (lots of data per node) = - This page aims to to give some advise as to the issues one may need to consider when using Cassandra for large data sets (meaning hundreds of gigabytes or terabytes per node). The intent is not to make original claims, but to collect in one place some issues that are operationally relevant. Other parts of the wiki are highly recommended in order to fully understand the issues involved. + This page aims to to give some advice as to the issues one may need to consider when using Cassandra for large data sets (meaning hundreds of gigabytes or terabytes per node). The intent is not to make original claims, but to collect in one place some issues that are operationally relevant. Other parts of the wiki are highly recommended in order to fully understand the issues involved. - This is a work in progress. If you find information out of date (e.g., a JIRA ticket referenced has been resolved but this document has not been updated), please help by editing or e-mail:ing cassandra-user. + This is a work in progress. If you find information out of date (e.g., a JIRA ticket referenced has been resolved but this document has not been updated), please help by editing or e-mailing cassandra-user. - Note that not all of these issues are specific to Cassandra (for example, any storage system is subject to the trade-offs of cache sizes relative to active set size, and IOPS will always be strongly correlated with the percentage of requests that penetrate caching layers). + Note that not all of these issues are specific to Cassandra. For example, any storage system is subject to the trade-offs of cache sizes relative to active set size, and IOPS will always be strongly correlated with the percentage of requests that penetrate caching layers. Also of note, the more data stored per node, the more data will have to be streamed in bootstrapping new or replacement nodes. - Unless otherwise noted, the points refer to Cassandra 0.7 and above. + '''Assumes Cassandra 1.2+''' + Significant work has been done to allow for more data to be stored on each node: + * Row cache can be serialized off-heap. Keep in mind that the existing row cache implementation still maintains off-heap entire rows of data and when that row is called for those rows are deserialized into the heap. Alternate row cache implementations are being worked on to make the row cache more generally useful, see [[https://issues.apache.org/jira/browse/CASSANDRA-2864|CASSANDRA-2864]]. + * Bloom filters: + * Moved off-heap [[https://issues.apache.org/jira/browse/CASSANDRA-4865|CASSANDRA-4865]] + * Tunable via bloom_filter_fp_chance. Starting in Cassandra 1.2 there are better defaults: 0.01 for column families using the !SizeTieredCompactionStrategy and 0.1 for column families using the !LeveledCompactionStrategy. Note that a change to this property will take effect as new sstables are built. + * Compression metadata has been moved off-heap in 1.2 (reference) + * Partition summary has been reduced in 1.2.5 and moved off-heap in 2.0 + * Key cache has been serialized off-heap in 2.0 (reference) - * Disk space usage in Cassandra can vary fairly suddenly over time. If you have significant amounts of data such that available disk space is not significantly higher than usage, consider: - * Compaction of a column family can up to double the disk space used by said column family (in the case of a major compaction and no deletions). If your data is predominantly made up of a single, or a select few, column families then doubling the disk space for a CF may be a significant amount compared to your total disk usage. - * Repair operations can increase disk space demands (particularly in 0.6, less so in 0.7; TODO: provide actual maximum growth and what it depends on). - * As your data set becomes larger and larger (assuming significantly larger than memory), you become more and more dependent on caching to elide I/O operations. As you plan and test your capacity, keep in mind that: - * The cassandra row cache is in the JVM heap and unaffected (remains warm) by compactions and repair operations. This is a plus, but the down-side is that the row cache is not very memory efficient compared to the operating system page cache. - * For 0.6.8 and below, the key cache is affected by compaction because it is per-sstable, and compaction moves data to new sstables. - * Was fixed/improved as of [[https://issues.apache.org/jira/browse/CASSANDRA-1878|CASSANDRA-1878]], for 0.6.9 and 0.7.0. - * The operating system's page cache is affected by compaction and repair operations. If you are relying on the page cache to keep the active set in memory, you may see significant degradation on performance as a result of compaction and repair operations. - * Potential future improvements: [[https://issues.apache.org/jira/browse/CASSANDRA-1470|CASSANDRA-1470]], [[https://issues.apache.org/jira/browse/CASSANDRA-1882|CASSANDRA-1882]]. - * Prior to 0.7.1 (fixed in [[https://issues.apache.org/jira/browse/CASSANDRA-1555|CASSANDRA-1555]]), if you had column families with more than 143 million row keys in them, bloom filter false positive rates would be likely to go up because of implementation concerns that limited the maximum size of a bloom filter. See [[ArchitectureInternals]] for information on how bloom filters are used. The negative effects of hitting this limit is that reads will start taking additional seeks to disk as the row count increases. Note that the effect you are seeing at any given moment will depend on when compaction was last run, because the bloom filter limit is per-sstable. It is an issue for column families because after a major compaction, the entire column family will be in a single sstable. - * Compaction is currently not concurrent, so only a single compaction runs at a time. This means that sstable counts may spike during larger compactions as several smaller sstables are written while a large compaction is happening. This can cause additional seeks on reads. - * Potential future improvements: [[https://issues.apache.org/jira/browse/CASSANDRA-1876|CASSANDRA-1876]] and [[https://issues.apache.org/jira/browse/CASSANDRA-1881|CASSANDRA-1881]] + * Parallel compactions as in [[https://issues.apache.org/jira/browse/CASSANDRA-2191|CASSANDRA-2191]] and [[https://issues.apache.org/jira/browse/CASSANDRA-4310|CASSANDRA-4310]] - * Potentially already fixed for 0.8 (todo: go through ticket history and make sure what it implies): [[https://issues.apache.org/jira/browse/CASSANDRA-2191|CASSANDRA-2191]] + * Multi-threaded compaction for high IO hardware (reference) + * Virtual nodes to increase the parallelism and reduce the time of bootstrapping new nodes + * sstable index files are no longer loaded on startup (reference) + + Other points to consider: + + * Moving data structures off-heap means that the structure gets serialized off-heap until it is needed. Then it is deserialized temporarily in the heap and is the garbage collected when it is no longer used. + * Disk space usage in Cassandra can vary over time: + * Compaction: with the !SizeTieredCompactionStrategy, compaction can up to double the disk space used. With the !LeveledCompactionStrategy, usually only requires about 10% overhead (see http://www.datastax.com/dev/blog/leveled-compaction-in-apache-cassandra). + * Repair: repair operations can increase disk space demands, see http://www.datastax.com/dev/blog/advanced-repair-techniques for details and how it can be improved. - * Consider the choice of file system. Removal of large files is notoriously slow and seek bound on e.g. ext2/ext3. Consider xfs or ext4fs. This affects background unlink():ing of sstables that happens every now and then, and also affects start-up time (if there are sstables pending removal when a node is starting up, they are removed as part of the start-up proceess; it may thus be detrimental if removing a terabyte of sstables takes an hour (numbers are ballparks, not accurately measured and depends on circumstances)). + * Consider the choice of file system. Removal of large files is notoriously slow and seek bound on e.g. ext2/ext3. Consider xfs or ext4fs. This affects background unlink():ing of sstables that happens every now and then, and also affects start-up time (if there are sstables pending removal when a node is starting up, they are removed as part of the start-up procees; it may thus be detrimental if removing a terabyte of sstables takes an hour (numbers are ballparks, not accurately measured and depends on circumstances)). * Adding nodes is a slow process if each node is responsible for a large amount of data. Plan for this; do not try to throw additional hardware at a cluster at the last minute. + * The operating system's page cache is affected by compaction and repair operations. If you are relying on the page cache to keep the active set in memory, you may see significant degradation on performance as a result of compaction and repair operations. See the cassandra.yaml for settings to reduce this impact. + * The partition (or sampled) index entries for each sstable can start to add up. You can reduce the memory usage by tuning the interval that it samples at. The setting is index_interval the cassandra.yaml. See the comments there for more information. - * Cassandra will read through sstable index files on start-up, doing what is known as "index sampling". This is used to keep a subset (currently and by default, 1 out of 100) of keys and and their on-disk location in the index, in memory. See [[ArchitectureInternals]]. This means that the larger the index files are, the longer it takes to perform this sampling. Thus, for very large indexes (typically when you have a very large number of keys) the index sampling on start-up may be a significant issue. - * A negative side-effect of a large row-cache is start-up time. The periodic saving of the row cache information only saves the keys that are cached; the data has to be pre-fetched on start-up. On a large data set, this is probably going to be seek-bound and the time it takes to warm up the row cache will be linear with respect to the row cache size (assuming sufficiently large amounts of data that the seek bound I/O is not subject to optimization by disks). - * Potential future improvement: [[https://issues.apache.org/jira/browse/CASSANDRA-1625|CASSANDRA-1625]]. - * The total number of rows per node correlates directly with the size of bloom filters and sampled index entries. Expect the base memory requirement of a node to increase linearly with the number of keys (assuming the average row key size remains constant). If you are not using caching at all (e.g. you are doing analysis type workloads), expect these two to be the two biggest consumers of memory. - * You can decrease the memory use due to index sampling by changing the index sampling interval in cassandra.yaml - * You should soon be able to tweak the bloom filter sizes too once [[https://issues.apache.org/jira/browse/CASSANDRA-3497|CASSANDRA-3497]] is done + Other references to improvements: + * [[http://www.datastax.com/dev/blog/performance-improvements-in-cassandra-1-2|Performance improvements in Cassandra 1.2]] + * [[http://www.datastax.com/dev/blog/six-mid-series-changes-to-know-about-in-1-2-x|Six mid-series changes in Cassandra 1.2]] +