[ https://issues.apache.org/jira/browse/CASSANDRA-7447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Benedict updated CASSANDRA-7447: -------------------------------- Attachment: (was: storage_format.pdf) > New sstable format with support for columnar layout > --------------------------------------------------- > > Key: CASSANDRA-7447 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7447 > Project: Cassandra > Issue Type: Improvement > Components: Core > Reporter: Benedict > Assignee: Benedict > Labels: performance, storage > Fix For: 3.0 > > Attachments: ngcc-storage.odp > > > h2. Storage Format Proposal > C* has come a long way over the past few years, and unfortunately our storage > format hasn't kept pace with the data models we are now encouraging people to > utilise. This ticket proposes a collections of storage primitives that can be > combined to serve these data models more optimally. > It would probably help to first state the data model at the most abstract > level. We have a fixed three-tier structure: We have the partition key, the > clustering columns, and the data columns. Each have their own characteristics > and so require their own specialised treatment. > I should note that these changes will necessarily be delivered in stages, and > that we will be making some assumptions about what the most useful features > to support initially will be. Any features not supported will require > sticking with the old format until we extend support to all C* functionality. > h3. Partition Key > * This really has two components: the partition, and the value. Although the > partition is primarily used to distribute across nodes, it can also be used > to optimise lookups for a given key within a node > * Generally partitioning is by hash, and for the moment I want to focus this > ticket on the assumption that this is the case > * Given this, it makes sense to optimise our storage format to permit O(1) > searching of a given partition. It may be possible to achieve this with > little overhead based on the fact we store the hashes in order and know they > are approximately randomly distributed, as this effectively forms an > immutable contiguous split-ordered list (see Shalev/Shavit, or > CASSANDRA-7282), so we only need to store an amount of data based on how > imperfectly distributed the hashes are, or at worst a single value per block. > * This should completely obviate the need for a separate key-cache, which > will be relegated to supporting the old storage format only > h3. Primary Key / Clustering Columns > * Given we have a hierarchical data model, I propose the use of a > cache-oblivious trie > * The main advantage of the trie is that it is extremely compact and > _supports optimally efficient merges with other tries_ so that we can support > more efficient reads when multiple sstables are touched > * The trie will be preceded by a small amount of related data; the full > partition key, a timestamp epoch (for offset-encoding timestamps) and any > other partition level optimisation data, such as (potentially) a min/max > timestamp to abort merges earlier > * Initially I propose to limit the trie to byte-order comparable data types > only (the number of which we can expand through translations of the important > types that are not currently) > * Crucially the trie will also encapsulate any range tombstones, so that > these are merged early in the process and avoids re-iterating the same data > * Results in true bidirectional streaming without having to read entire range > into memory > h3. Values > There are generally two approaches to storing rows of data: columnar, or > row-oriented. The above two data structures can be combined with a value > storage scheme that is based on either. However, given the current model we > have of reading large 64Kb blocks for any read, I am inclined to focus on > columnar support first, as this delivers order-of-magnitude benefits to those > users with the correct workload, while for most workloads our 64Kb blocks are > large enough to store row-oriented data in a column-oriented fashion without > any performance degradation (I'm happy to consign very large row support to > phase 2). > Since we will most likely target both behaviours eventually, I am currently > inclined to suggest that static columns, sets and maps be targeted for a > row-oriented release, as they don't naturally fit in a columnar layout > without secondary heap-blocks. This may be easier than delivering heap-blocks > also, as it keeps both implementations relatively clean. This is certainly > open to debate, and I have no doubt there will be conflicting opinions here. > Focusing on our columnar layout, the goals are: > * Support sparse and dense column storage > * Efficient compression of tombstones, timestamps, ttls, etc. into near-zero > space based on offset/delta/bitmap encoding > * Normalisation of column names once per file > * Per-file block-layout index, defining how each block's data is encoded, so > we can index directly within a block for dense fields (permitting more > efficient page cache utilisation) > * Configurable grouping of fields per block, so that we can get closer to > row-oriented or column-oriented performance, based on user goals > I have attached my slides from the ngcc for reference. -- This message was sent by Atlassian JIRA (v6.3.4#6332)