[ 
https://issues.apache.org/jira/browse/CASSANDRA-7447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117267#comment-14117267
 ] 

Benedict commented on CASSANDRA-7447:
-------------------------------------

It sounds like we're approaching agreement on our terms of disagreement, which 
is great :)

bq. I'm just not sure what is intrinsically less incremental in what I'm 
suggesting. 

The relatively simple (but still time consuming) solution you seem to be 
advocating sounds like one we would obsolete entirely once the full featureset 
I want to deliver is completed, since it is less efficient. It's quite possible 
it wouldn't even translate in the way you imagine, since the indexing is very 
different in the new scheme and so the current approach would not be possible 
(we need to implement some kind of tree structure for search in clustering 
columns, at the very least, since we will not have any of the current KeyCache 
machiner, and limiting ourselves to byte-ordered types permits us to not only 
use but _assume_ especially efficient tries that are both more efficient on 
disk but also algorithmically more efficient for merging on read). But, either 
way, this obsolescence seems to largely get in the way of the real end-state 
we're aiming for, by introducing unnecessary work. I want incremental and 
non-superfluous deliverables en route.

As I stated above I might be willing to aim for a row-oriented approach first 
(although incrementally delivered, so likely not supporting everything in one 
step, simply because this is easier and more digestable) which, once fully 
delivered, would support everything - except this implementation I want to aim 
for would be for byte-ordered clustering columns only. If instead we want to 
support non-byte-ordered, we either have to do more work to deliver an optimal 
result (i.e. a complex hybrid btree/trie-variant), which is distinctly less 
incremental, or deliver a suboptimal result that only exists temporarily until 
optimised, so is wasted effort. Either seem to run the risk of a reduction in 
final featureset when done earlier in the process, simply by consuming precious 
time, so I prefer to implement these features much later, if at all (like I 
said, I strongly favour dropping support altogether), since at last minute we 
can simply deliver an average/poor implementation if necessary. Especially 
given we know of no use cases outside of CQL types, to my knowledge, and we'll 
be eliminating those from the equation.

In summary, we need to agree on your points (1) and (2) with my new comments, 
but also need (1a) and (2a):

(1a): Do we want to try and eliminate this early in the process, or should we 
aim our end state for eliminating it and only fill in our suboptimal solution 
we'd deliver earlier if we are running out of time?
(2a): Does 'everything' include non-byte-ordered types, or if we can try to EOL 
them. Bearing in mind the cost if this needs to be aborted is not dramatic, but 
the payoff is potentially huge by simplifying many chunks of low-level codebase 
(and permitting pretty substantial computational benefits).


> New sstable format with support for columnar layout
> ---------------------------------------------------
>
>                 Key: CASSANDRA-7447
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7447
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Benedict
>            Assignee: Benedict
>              Labels: performance, storage
>             Fix For: 3.0
>
>         Attachments: ngcc-storage.odp
>
>
> h2. Storage Format Proposal
> C* has come a long way over the past few years, and unfortunately our storage 
> format hasn't kept pace with the data models we are now encouraging people to 
> utilise. This ticket proposes a collections of storage primitives that can be 
> combined to serve these data models more optimally.
> It would probably help to first state the data model at the most abstract 
> level. We have a fixed three-tier structure: We have the partition key, the 
> clustering columns, and the data columns. Each have their own characteristics 
> and so require their own specialised treatment.
> I should note that these changes will necessarily be delivered in stages, and 
> that we will be making some assumptions about what the most useful features 
> to support initially will be. Any features not supported will require 
> sticking with the old format until we extend support to all C* functionality.
> h3. Partition Key
> * This really has two components: the partition, and the value. Although the 
> partition is primarily used to distribute across nodes, it can also be used 
> to optimise lookups for a given key within a node
> * Generally partitioning is by hash, and for the moment I want to focus this 
> ticket on the assumption that this is the case
> * Given this, it makes sense to optimise our storage format to permit O(1) 
> searching of a given partition. It may be possible to achieve this with 
> little overhead based on the fact we store the hashes in order and know they 
> are approximately randomly distributed, as this effectively forms an 
> immutable contiguous split-ordered list (see Shalev/Shavit, or 
> CASSANDRA-7282), so we only need to store an amount of data based on how 
> imperfectly distributed the hashes are, or at worst a single value per block.
> * This should completely obviate the need for a separate key-cache, which 
> will be relegated to supporting the old storage format only
> h3. Primary Key / Clustering Columns
> * Given we have a hierarchical data model, I propose the use of a 
> cache-oblivious trie
> * The main advantage of the trie is that it is extremely compact and 
> _supports optimally efficient merges with other tries_ so that we can support 
> more efficient reads when multiple sstables are touched
> * The trie will be preceded by a small amount of related data; the full 
> partition key, a timestamp epoch (for offset-encoding timestamps) and any 
> other partition level optimisation data, such as (potentially) a min/max 
> timestamp to abort merges earlier
> * Initially I propose to limit the trie to byte-order comparable data types 
> only (the number of which we can expand through translations of the important 
> types that are not currently)
> * Crucially the trie will also encapsulate any range tombstones, so that 
> these are merged early in the process and avoids re-iterating the same data
> * Results in true bidirectional streaming without having to read entire range 
> into memory
> h3. Values
> There are generally two approaches to storing rows of data: columnar, or 
> row-oriented. The above two data structures can be combined with a value 
> storage scheme that is based on either. However, given the current model we 
> have of reading large 64Kb blocks for any read, I am inclined to focus on 
> columnar support first, as this delivers order-of-magnitude benefits to those 
> users with the correct workload, while for most workloads our 64Kb blocks are 
> large enough to store row-oriented data in a column-oriented fashion without 
> any performance degradation (I'm happy to consign very large row support to 
> phase 2). 
> Since we will most likely target both behaviours eventually, I am currently 
> inclined to suggest that static columns, sets and maps be targeted for a 
> row-oriented release, as they don't naturally fit in a columnar layout 
> without secondary heap-blocks. This may be easier than delivering heap-blocks 
> also, as it keeps both implementations relatively clean. This is certainly 
> open to debate, and I have no doubt there will be conflicting opinions here.
> Focusing on our columnar layout, the goals are:
> * Support sparse and dense column storage
> * Efficient compression of tombstones, timestamps, ttls, etc. into near-zero 
> space based on offset/delta/bitmap encoding
> * Normalisation of column names once per file
> * Per-file block-layout index, defining how each block's data is encoded, so 
> we can index directly within a block for dense fields (permitting more 
> efficient page cache utilisation)
> * Configurable grouping of fields per block, so that we can get closer to 
> row-oriented or column-oriented performance, based on user goals
> I have attached my slides from the ngcc for reference.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to