Dear Wiki user, You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for change notification.
The "LargeDataSetConsiderations" page has been changed by jeremyhanna: https://wiki.apache.org/cassandra/LargeDataSetConsiderations?action=diff&rev1=22&rev2=23 Note that not all of these issues are specific to Cassandra. For example, any storage system is subject to the trade-offs of cache sizes relative to active set size, and IOPS will always be strongly correlated with the percentage of requests that penetrate caching layers. Also of note, the more data stored per node, the more data will have to be streamed in bootstrapping new or replacement nodes. - '''Assumes Cassandra 1.2+''' + ==== Assumes Cassandra 1.2+ ==== - Significant work has been done to allow for more data to be stored on each node: + ==== Significant work has been done to allow for more data to be stored on each node: ==== - * Row cache can be serialized off-heap. Keep in mind that the existing row cache implementation still maintains off-heap entire rows of data and when that row is called for those rows are deserialized into the heap. Alternate row cache implementations are being worked on to make the row cache more generally useful, see [[https://issues.apache.org/jira/browse/CASSANDRA-2864|CASSANDRA-2864]]. + * Row cache can be serialized off-heap. Keep in mind that this still stores entire rows off-heap and when they are used the full row is temporarily deserialized in the heap. Alternate row cache implementations are being worked on to make the row cache more generally useful, see [[https://issues.apache.org/jira/browse/CASSANDRA-2864|CASSANDRA-2864]]. * Bloom filters: * Moved off-heap [[https://issues.apache.org/jira/browse/CASSANDRA-4865|CASSANDRA-4865]] * Tunable via bloom_filter_fp_chance. Starting in Cassandra 1.2 there are better defaults: 0.01 for column families using the !SizeTieredCompactionStrategy and 0.1 for column families using the !LeveledCompactionStrategy. Note that a change to this property will take effect as new sstables are built. @@ -23, +23 @@ * Virtual nodes to increase the parallelism and reduce the time of bootstrapping new nodes * sstable index files are no longer loaded on startup (reference) - Other points to consider: + '''On moving data structures off-heap''' + * Moving data structures off-heap means that the structure gets serialized off-heap until it is needed. Then it is deserialized temporarily in the heap and is the garbage collected when it is no longer used. - * Moving data structures off-heap means that the structure gets serialized off-heap until it is needed. Then it is deserialized temporarily in the heap and is the garbage collected when it is no longer used. + ==== Other points to consider: ==== + * Disk space usage in Cassandra can vary over time: * Compaction: with the !SizeTieredCompactionStrategy, compaction can up to double the disk space used. With the !LeveledCompactionStrategy, usually only requires about 10% overhead (see http://www.datastax.com/dev/blog/leveled-compaction-in-apache-cassandra). * Repair: repair operations can increase disk space demands, see http://www.datastax.com/dev/blog/advanced-repair-techniques for details and how it can be improved.