[ https://issues.apache.org/jira/browse/CASSANDRA-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12976469#action_12976469 ]
Stu Hood commented on CASSANDRA-674: ------------------------------------ Thinking about this issue again. Dumped some thoughts I had on paper to the wiki: http://wiki.apache.org/cassandra/FileFormatDesignDoc . > New SSTable Format > ------------------ > > Key: CASSANDRA-674 > URL: https://issues.apache.org/jira/browse/CASSANDRA-674 > Project: Cassandra > Issue Type: Improvement > Components: Core > Reporter: Stu Hood > Fix For: 0.8 > > Attachments: 674-v1.diff, perf-674-v1.txt, > perf-trunk-2f3d2c0e4845faf62e33c191d152cb1b3fa62806.txt > > > Various tickets exist due to limitations in the SSTable file format, > including #16, #47 and #328. Attached is a proposed design/implementation of > a new file format for SSTables that addresses a few of these limitations. The > implementation has a bunch of issues/fixmes, which I'll describe in the > comments. > The file format is described in the javadoc for the o.a.c.io.SSTableWriter > class, but briefly: > * Blocks are opaque (except for their header) so that they can be > compressed. The index file contains an entry for the first key in every > Block. Blocks contain Slices. > * Slices are series of columns with the same parents and (deletion) > metadata. They can be used to represent ColumnFamilies or SuperColumns (or a > slice of columns at any other depth). A single CF can be split across > multiple Slices, which can be split across multiple blocks. > * Neither Slices nor Blocks have a fixed size or maximum length, but they > each have target lengths which can be stretched and broken by very large > columns. > The most interesting concepts from this patch are: > * Block compression is possible (currently using GZIP, which has one bug > mentioned in the comments), > * Compaction involves merging intersecting Slices from input SSTables. Since > large rows will be broken down into multiple slices, only the portions of > rows that intersect between tables need to be > deserialized/merged/held-in-memory, > * Indexes for individual rows are gone, since the global index allows random > access to the middle of column families that span Blocks, and Slices allow > batches of columns to be skipped within a Block. > * Bloom filters for individual rows are gone, and the global filter contains > ColumnKeys instead, meaning that a query for a column that doesn't exist in a > row that does will often not need to seek to the row. > * Metadata (deletion/gc time) and ColumnKeys (key, colname1, colname2...) > for columns are defined recursively, so deeply nested slices are possible, > * Slices representing a single parent (CF, SC, etc) can have different > Metadata, meaning that a tombstone Slice from d-f could sit between Slices > containing columns a-c and g-h. This allows for eventually consistent range > deletes of columns. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.