[jira] [Commented] (CASSANDRA-674) New SSTable Format
[ https://issues.apache.org/jira/browse/CASSANDRA-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13081105#comment-13081105 ] Stu Hood commented on CASSANDRA-674: bq. Is replicate-on-write=disabled why uncompressed went from the highest latency to the lowest? Yes: replicate-on-write triggers a huge number of reads, which are much more expensive in trunk, due to 2319 not being included. > New SSTable Format > -- > > Key: CASSANDRA-674 > URL: https://issues.apache.org/jira/browse/CASSANDRA-674 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Stu Hood > Fix For: 1.0 > > Attachments: 674-v1.diff, 674-v2.tgz, 674-v3.tgz, 674-ycsb.log, > trunk-ycsb.log > > > Various tickets exist due to limitations in the SSTable file format, > including #16, #47 and #328. Attached is a proposed design/implementation of > a new file format for SSTables that addresses a few of these limitations. > This v2 implementation is not ready for serious use: see comments for > remaining issues. It is roughly the format described here: > http://wiki.apache.org/cassandra/FileFormatDesignDoc -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-674) New SSTable Format
[ https://issues.apache.org/jira/browse/CASSANDRA-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13080929#comment-13080929 ] Chris Burroughs commented on CASSANDRA-674: --- Is replicate-on-write=disabled why uncompressed went from the highest latency to the lowest? > New SSTable Format > -- > > Key: CASSANDRA-674 > URL: https://issues.apache.org/jira/browse/CASSANDRA-674 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Stu Hood > Fix For: 1.0 > > Attachments: 674-v1.diff, 674-v2.tgz, 674-v3.tgz, 674-ycsb.log, > trunk-ycsb.log > > > Various tickets exist due to limitations in the SSTable file format, > including #16, #47 and #328. Attached is a proposed design/implementation of > a new file format for SSTables that addresses a few of these limitations. > This v2 implementation is not ready for serious use: see comments for > remaining issues. It is roughly the format described here: > http://wiki.apache.org/cassandra/FileFormatDesignDoc -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-674) New SSTable Format
[ https://issues.apache.org/jira/browse/CASSANDRA-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13080757#comment-13080757 ] Stu Hood commented on CASSANDRA-674: I reran the test mentioned in [#comment-13054228] with replicate-on-write disabled, which makes for a much more fair comparison (trunk/47 require 2 seeks to miss for a column, and 3 to hit). This version of trunk also includes CASSANDRA-47 snappy compression. || build || disk volume (bytes) || bytes per column || runtime (s) || throughput (ops/s) || avg read ms || 99th % read ms || | trunk - uncompressed | 16,713,328,798 | 66.8 | 6154 | 40620 | 2.54 | 6 | | trunk - gz 6 * | 2,747,319,000 | 10.98 |-|-|-|-| | trunk - [snappy|https://issues.apache.org/jira/browse/CASSANDRA-47] | 4,356,461,652 | 17.4 | 7906 | 31618 | 4.64 | 15 | | 674+2319 | 2,675,888,207 | 10.7 | 7703 | 32454 | 3.04 | 10 | \* _trunk - gz 6_ is the size of compressing the data directory of the trunk result at GZIP level 6 In this workload, we're reading from the tail of the row, which means that CASSANDRA-47 needs to decode two blocks per read (one for the row index at the head of the row, and one for the columns at the tail). > New SSTable Format > -- > > Key: CASSANDRA-674 > URL: https://issues.apache.org/jira/browse/CASSANDRA-674 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Stu Hood > Fix For: 1.0 > > Attachments: 674-v1.diff, 674-v2.tgz, 674-v3.tgz, 674-ycsb.log, > trunk-ycsb.log > > > Various tickets exist due to limitations in the SSTable file format, > including #16, #47 and #328. Attached is a proposed design/implementation of > a new file format for SSTables that addresses a few of these limitations. > This v2 implementation is not ready for serious use: see comments for > remaining issues. It is roughly the format described here: > http://wiki.apache.org/cassandra/FileFormatDesignDoc -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-674) New SSTable Format
[ https://issues.apache.org/jira/browse/CASSANDRA-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13062662#comment-13062662 ] Stu Hood commented on CASSANDRA-674: I've posted the slightly-divergent branch of YCSB I used for this workload at https://github.com/stuhood/YCSB/tree/monotonic-timeseries > New SSTable Format > -- > > Key: CASSANDRA-674 > URL: https://issues.apache.org/jira/browse/CASSANDRA-674 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Stu Hood > Fix For: 1.0 > > Attachments: 674-v1.diff, 674-v2.tgz, 674-v3.tgz, 674-ycsb.log, > trunk-ycsb.log > > > Various tickets exist due to limitations in the SSTable file format, > including #16, #47 and #328. Attached is a proposed design/implementation of > a new file format for SSTables that addresses a few of these limitations. > This v2 implementation is not ready for serious use: see comments for > remaining issues. It is roughly the format described here: > http://wiki.apache.org/cassandra/FileFormatDesignDoc -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-674) New SSTable Format
[ https://issues.apache.org/jira/browse/CASSANDRA-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13054287#comment-13054287 ] Stu Hood commented on CASSANDRA-674: To clarify, I included the "trunk gz 6" result since it is essentially a lower bound for block-based compression. On the other hand, there is some low hanging fruit that could decrease the size of the 674-2319 by another 1 to 1.5 bytes per column. > New SSTable Format > -- > > Key: CASSANDRA-674 > URL: https://issues.apache.org/jira/browse/CASSANDRA-674 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Stu Hood > Fix For: 1.0 > > Attachments: 674-v1.diff, 674-v2.tgz, 674-v3.tgz, 674-ycsb.log, > trunk-ycsb.log > > > Various tickets exist due to limitations in the SSTable file format, > including #16, #47 and #328. Attached is a proposed design/implementation of > a new file format for SSTables that addresses a few of these limitations. > This v2 implementation is not ready for serious use: see comments for > remaining issues. It is roughly the format described here: > http://wiki.apache.org/cassandra/FileFormatDesignDoc -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (CASSANDRA-674) New SSTable Format
[ https://issues.apache.org/jira/browse/CASSANDRA-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12989112#comment-12989112 ] Stu Hood commented on CASSANDRA-674: One of the key blockers is implementing rebuilding of SSTables post-streaming. Based on an IRC conversation yesterday, the smoothest way to support streaming of older SSTable versions was to ABC and subclass what is now the SSTableWrite.Builder object: I'll probably try to do this in a separate ticket. > New SSTable Format > -- > > Key: CASSANDRA-674 > URL: https://issues.apache.org/jira/browse/CASSANDRA-674 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Stu Hood > Fix For: 0.8 > > Attachments: 674-v1.diff, 674-v2.tgz, perf-674-v1.txt, > perf-trunk-2f3d2c0e4845faf62e33c191d152cb1b3fa62806.txt > > > Various tickets exist due to limitations in the SSTable file format, > including #16, #47 and #328. Attached is a proposed design/implementation of > a new file format for SSTables that addresses a few of these limitations. > This v2 implementation is not ready for serious use: see comments for > remaining issues. It is roughly the format described here: > http://wiki.apache.org/cassandra/FileFormatDesignDoc -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (CASSANDRA-674) New SSTable Format
[ https://issues.apache.org/jira/browse/CASSANDRA-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12981109#action_12981109 ] Holden Robbins commented on CASSANDRA-674: -- Feel free to tell me I'm off-base here, but what about doing something super simple like storing the segment as compressed and un-compressing when it's accessed on disk. Compaction process can possibly clean up uncompressed segments? I'm thinking this would solve my particular use case well (log data) since our requirements are to store a large amount of data but the majority of the reads will only be on a small subset of recently inserted data. If it sounds like a decent approach I'll be happy to put together a patch. > New SSTable Format > -- > > Key: CASSANDRA-674 > URL: https://issues.apache.org/jira/browse/CASSANDRA-674 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Stu Hood > Fix For: 0.8 > > Attachments: 674-v1.diff, perf-674-v1.txt, > perf-trunk-2f3d2c0e4845faf62e33c191d152cb1b3fa62806.txt > > > Various tickets exist due to limitations in the SSTable file format, > including #16, #47 and #328. Attached is a proposed design/implementation of > a new file format for SSTables that addresses a few of these limitations. The > implementation has a bunch of issues/fixmes, which I'll describe in the > comments. > The file format is described in the javadoc for the o.a.c.io.SSTableWriter > class, but briefly: > * Blocks are opaque (except for their header) so that they can be > compressed. The index file contains an entry for the first key in every > Block. Blocks contain Slices. > * Slices are series of columns with the same parents and (deletion) > metadata. They can be used to represent ColumnFamilies or SuperColumns (or a > slice of columns at any other depth). A single CF can be split across > multiple Slices, which can be split across multiple blocks. > * Neither Slices nor Blocks have a fixed size or maximum length, but they > each have target lengths which can be stretched and broken by very large > columns. > The most interesting concepts from this patch are: > * Block compression is possible (currently using GZIP, which has one bug > mentioned in the comments), > * Compaction involves merging intersecting Slices from input SSTables. Since > large rows will be broken down into multiple slices, only the portions of > rows that intersect between tables need to be > deserialized/merged/held-in-memory, > * Indexes for individual rows are gone, since the global index allows random > access to the middle of column families that span Blocks, and Slices allow > batches of columns to be skipped within a Block. > * Bloom filters for individual rows are gone, and the global filter contains > ColumnKeys instead, meaning that a query for a column that doesn't exist in a > row that does will often not need to seek to the row. > * Metadata (deletion/gc time) and ColumnKeys (key, colname1, colname2...) > for columns are defined recursively, so deeply nested slices are possible, > * Slices representing a single parent (CF, SC, etc) can have different > Metadata, meaning that a tombstone Slice from d-f could sit between Slices > containing columns a-c and g-h. This allows for eventually consistent range > deletes of columns. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (CASSANDRA-674) New SSTable Format
[ https://issues.apache.org/jira/browse/CASSANDRA-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979172#action_12979172 ] Jonathan Ellis commented on CASSANDRA-674: -- Here is an interesting paper on a way to get both good inter-record and intra-record data locality: http://scholar.google.com/scholar?q=A+Storage+Model+to+Bridge+the+Processor/Memory+Speed+Gap. Not sure how to apply that to an arbitrarily-large-rows model like ours tho. > New SSTable Format > -- > > Key: CASSANDRA-674 > URL: https://issues.apache.org/jira/browse/CASSANDRA-674 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Stu Hood > Fix For: 0.8 > > Attachments: 674-v1.diff, perf-674-v1.txt, > perf-trunk-2f3d2c0e4845faf62e33c191d152cb1b3fa62806.txt > > > Various tickets exist due to limitations in the SSTable file format, > including #16, #47 and #328. Attached is a proposed design/implementation of > a new file format for SSTables that addresses a few of these limitations. The > implementation has a bunch of issues/fixmes, which I'll describe in the > comments. > The file format is described in the javadoc for the o.a.c.io.SSTableWriter > class, but briefly: > * Blocks are opaque (except for their header) so that they can be > compressed. The index file contains an entry for the first key in every > Block. Blocks contain Slices. > * Slices are series of columns with the same parents and (deletion) > metadata. They can be used to represent ColumnFamilies or SuperColumns (or a > slice of columns at any other depth). A single CF can be split across > multiple Slices, which can be split across multiple blocks. > * Neither Slices nor Blocks have a fixed size or maximum length, but they > each have target lengths which can be stretched and broken by very large > columns. > The most interesting concepts from this patch are: > * Block compression is possible (currently using GZIP, which has one bug > mentioned in the comments), > * Compaction involves merging intersecting Slices from input SSTables. Since > large rows will be broken down into multiple slices, only the portions of > rows that intersect between tables need to be > deserialized/merged/held-in-memory, > * Indexes for individual rows are gone, since the global index allows random > access to the middle of column families that span Blocks, and Slices allow > batches of columns to be skipped within a Block. > * Bloom filters for individual rows are gone, and the global filter contains > ColumnKeys instead, meaning that a query for a column that doesn't exist in a > row that does will often not need to seek to the row. > * Metadata (deletion/gc time) and ColumnKeys (key, colname1, colname2...) > for columns are defined recursively, so deeply nested slices are possible, > * Slices representing a single parent (CF, SC, etc) can have different > Metadata, meaning that a tombstone Slice from d-f could sit between Slices > containing columns a-c and g-h. This allows for eventually consistent range > deletes of columns. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (CASSANDRA-674) New SSTable Format
[ https://issues.apache.org/jira/browse/CASSANDRA-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12978183#action_12978183 ] Stu Hood commented on CASSANDRA-674: > If we assume we keep the datamodel as is how can we simplify the open > ended-ness of your design to make the approach fit our current data model. To keep this from becoming a point of contention, I'll remove that goal from the design doc: the design so far has this feature as a side effect though. > New SSTable Format > -- > > Key: CASSANDRA-674 > URL: https://issues.apache.org/jira/browse/CASSANDRA-674 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Stu Hood > Fix For: 0.8 > > Attachments: 674-v1.diff, perf-674-v1.txt, > perf-trunk-2f3d2c0e4845faf62e33c191d152cb1b3fa62806.txt > > > Various tickets exist due to limitations in the SSTable file format, > including #16, #47 and #328. Attached is a proposed design/implementation of > a new file format for SSTables that addresses a few of these limitations. The > implementation has a bunch of issues/fixmes, which I'll describe in the > comments. > The file format is described in the javadoc for the o.a.c.io.SSTableWriter > class, but briefly: > * Blocks are opaque (except for their header) so that they can be > compressed. The index file contains an entry for the first key in every > Block. Blocks contain Slices. > * Slices are series of columns with the same parents and (deletion) > metadata. They can be used to represent ColumnFamilies or SuperColumns (or a > slice of columns at any other depth). A single CF can be split across > multiple Slices, which can be split across multiple blocks. > * Neither Slices nor Blocks have a fixed size or maximum length, but they > each have target lengths which can be stretched and broken by very large > columns. > The most interesting concepts from this patch are: > * Block compression is possible (currently using GZIP, which has one bug > mentioned in the comments), > * Compaction involves merging intersecting Slices from input SSTables. Since > large rows will be broken down into multiple slices, only the portions of > rows that intersect between tables need to be > deserialized/merged/held-in-memory, > * Indexes for individual rows are gone, since the global index allows random > access to the middle of column families that span Blocks, and Slices allow > batches of columns to be skipped within a Block. > * Bloom filters for individual rows are gone, and the global filter contains > ColumnKeys instead, meaning that a query for a column that doesn't exist in a > row that does will often not need to seek to the row. > * Metadata (deletion/gc time) and ColumnKeys (key, colname1, colname2...) > for columns are defined recursively, so deeply nested slices are possible, > * Slices representing a single parent (CF, SC, etc) can have different > Metadata, meaning that a tombstone Slice from d-f could sit between Slices > containing columns a-c and g-h. This allows for eventually consistent range > deletes of columns. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (CASSANDRA-674) New SSTable Format
[ https://issues.apache.org/jira/browse/CASSANDRA-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12978155#action_12978155 ] Stu Hood commented on CASSANDRA-674: >> Indexes for individual rows are gone, since the global index allows random >> access... > ^ This wouldn't be useful to cache? in the situation you only want a small > range of columns? That information is outdated: it's from the original implementation. But yes... we will want to keep the index in app memory or page cache. > Roughly how large would the actual chunk be? This is the unit of > deserialization right? The span is the unit of deserialization (made up of at most 1 chunk per level), and its size would be 100% configurable. The main question is how frequently to index the spans in the sstable index: does each span get an index entry? or only the first span of a row (this is our approach in the current implementation). > So if you are doing a range query on a very wide row how do you know when to > stop processing chunks? By looking at the global index: if all spans get entries in the index, you know the last interesting span. > Let me know if this is wrong, but this design opens the cassandra data model > to contain arbitrarily nested data. > Given the complexity we already have surrounding the supercolumn concept do > you think this is the right way forward? The super column concept is only confusing _because_ we call them "supercolumns" rather than just calling them "compound column names". People use them, and the consensus I've heard is that they are useful. > If we assume we keep the datamodel as is how can we simplify the open > ended-ness of your design to make the approach fit our current data model. The only difference is what you call the structures, and whether you put arbitrary limits on the nesting: I'm open to suggestions. > New SSTable Format > -- > > Key: CASSANDRA-674 > URL: https://issues.apache.org/jira/browse/CASSANDRA-674 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Stu Hood > Fix For: 0.8 > > Attachments: 674-v1.diff, perf-674-v1.txt, > perf-trunk-2f3d2c0e4845faf62e33c191d152cb1b3fa62806.txt > > > Various tickets exist due to limitations in the SSTable file format, > including #16, #47 and #328. Attached is a proposed design/implementation of > a new file format for SSTables that addresses a few of these limitations. The > implementation has a bunch of issues/fixmes, which I'll describe in the > comments. > The file format is described in the javadoc for the o.a.c.io.SSTableWriter > class, but briefly: > * Blocks are opaque (except for their header) so that they can be > compressed. The index file contains an entry for the first key in every > Block. Blocks contain Slices. > * Slices are series of columns with the same parents and (deletion) > metadata. They can be used to represent ColumnFamilies or SuperColumns (or a > slice of columns at any other depth). A single CF can be split across > multiple Slices, which can be split across multiple blocks. > * Neither Slices nor Blocks have a fixed size or maximum length, but they > each have target lengths which can be stretched and broken by very large > columns. > The most interesting concepts from this patch are: > * Block compression is possible (currently using GZIP, which has one bug > mentioned in the comments), > * Compaction involves merging intersecting Slices from input SSTables. Since > large rows will be broken down into multiple slices, only the portions of > rows that intersect between tables need to be > deserialized/merged/held-in-memory, > * Indexes for individual rows are gone, since the global index allows random > access to the middle of column families that span Blocks, and Slices allow > batches of columns to be skipped within a Block. > * Bloom filters for individual rows are gone, and the global filter contains > ColumnKeys instead, meaning that a query for a column that doesn't exist in a > row that does will often not need to seek to the row. > * Metadata (deletion/gc time) and ColumnKeys (key, colname1, colname2...) > for columns are defined recursively, so deeply nested slices are possible, > * Slices representing a single parent (CF, SC, etc) can have different > Metadata, meaning that a tombstone Slice from d-f could sit between Slices > containing columns a-c and g-h. This allows for eventually consistent range > deletes of columns. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (CASSANDRA-674) New SSTable Format
[ https://issues.apache.org/jira/browse/CASSANDRA-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12977785#action_12977785 ] T Jake Luciani commented on CASSANDRA-674: -- Let me know if this is wrong, but this design opens the cassandra data model to contain arbitrarily nested data. Given the complexity we already have surrounding the supercolumn concept do you think this is the right way forward? As much as my inner geek wants to build a tree or graph model I don't think the C* community or committers want to take it this way. If we assume we keep the datamodel as is how can we simplify the open ended-ness of your design to make the approach fit our current data model. > New SSTable Format > -- > > Key: CASSANDRA-674 > URL: https://issues.apache.org/jira/browse/CASSANDRA-674 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Stu Hood > Fix For: 0.8 > > Attachments: 674-v1.diff, perf-674-v1.txt, > perf-trunk-2f3d2c0e4845faf62e33c191d152cb1b3fa62806.txt > > > Various tickets exist due to limitations in the SSTable file format, > including #16, #47 and #328. Attached is a proposed design/implementation of > a new file format for SSTables that addresses a few of these limitations. The > implementation has a bunch of issues/fixmes, which I'll describe in the > comments. > The file format is described in the javadoc for the o.a.c.io.SSTableWriter > class, but briefly: > * Blocks are opaque (except for their header) so that they can be > compressed. The index file contains an entry for the first key in every > Block. Blocks contain Slices. > * Slices are series of columns with the same parents and (deletion) > metadata. They can be used to represent ColumnFamilies or SuperColumns (or a > slice of columns at any other depth). A single CF can be split across > multiple Slices, which can be split across multiple blocks. > * Neither Slices nor Blocks have a fixed size or maximum length, but they > each have target lengths which can be stretched and broken by very large > columns. > The most interesting concepts from this patch are: > * Block compression is possible (currently using GZIP, which has one bug > mentioned in the comments), > * Compaction involves merging intersecting Slices from input SSTables. Since > large rows will be broken down into multiple slices, only the portions of > rows that intersect between tables need to be > deserialized/merged/held-in-memory, > * Indexes for individual rows are gone, since the global index allows random > access to the middle of column families that span Blocks, and Slices allow > batches of columns to be skipped within a Block. > * Bloom filters for individual rows are gone, and the global filter contains > ColumnKeys instead, meaning that a query for a column that doesn't exist in a > row that does will often not need to seek to the row. > * Metadata (deletion/gc time) and ColumnKeys (key, colname1, colname2...) > for columns are defined recursively, so deeply nested slices are possible, > * Slices representing a single parent (CF, SC, etc) can have different > Metadata, meaning that a tombstone Slice from d-f could sit between Slices > containing columns a-c and g-h. This allows for eventually consistent range > deletes of columns. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (CASSANDRA-674) New SSTable Format
[ https://issues.apache.org/jira/browse/CASSANDRA-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1296#action_1296 ] T Jake Luciani commented on CASSANDRA-674: -- bq. the metadata is useless on it's own. It only becomes useful when it is attached to data (a column or to a range), so there is no reason to cache the meta- independently of the data. But above you mention: {code} Indexes for individual rows are gone, since the global index allows random access to the middle of column families that span Blocks, and Slices allow batches of columns to be skipped within a Block. {code} ^ This wouldn't be useful to cache? in the situation you only want a small range of columns? - More questions Roughly how large would the actual chunk be? This is the unit of deserialization right? or can avro deserialize only part of a structure? So if you are doing a range query on a very wide row how do you know when to stop processing chunks? do you keep going till you hit the sentinel value ? > New SSTable Format > -- > > Key: CASSANDRA-674 > URL: https://issues.apache.org/jira/browse/CASSANDRA-674 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Stu Hood > Fix For: 0.8 > > Attachments: 674-v1.diff, perf-674-v1.txt, > perf-trunk-2f3d2c0e4845faf62e33c191d152cb1b3fa62806.txt > > > Various tickets exist due to limitations in the SSTable file format, > including #16, #47 and #328. Attached is a proposed design/implementation of > a new file format for SSTables that addresses a few of these limitations. The > implementation has a bunch of issues/fixmes, which I'll describe in the > comments. > The file format is described in the javadoc for the o.a.c.io.SSTableWriter > class, but briefly: > * Blocks are opaque (except for their header) so that they can be > compressed. The index file contains an entry for the first key in every > Block. Blocks contain Slices. > * Slices are series of columns with the same parents and (deletion) > metadata. They can be used to represent ColumnFamilies or SuperColumns (or a > slice of columns at any other depth). A single CF can be split across > multiple Slices, which can be split across multiple blocks. > * Neither Slices nor Blocks have a fixed size or maximum length, but they > each have target lengths which can be stretched and broken by very large > columns. > The most interesting concepts from this patch are: > * Block compression is possible (currently using GZIP, which has one bug > mentioned in the comments), > * Compaction involves merging intersecting Slices from input SSTables. Since > large rows will be broken down into multiple slices, only the portions of > rows that intersect between tables need to be > deserialized/merged/held-in-memory, > * Indexes for individual rows are gone, since the global index allows random > access to the middle of column families that span Blocks, and Slices allow > batches of columns to be skipped within a Block. > * Bloom filters for individual rows are gone, and the global filter contains > ColumnKeys instead, meaning that a query for a column that doesn't exist in a > row that does will often not need to seek to the row. > * Metadata (deletion/gc time) and ColumnKeys (key, colname1, colname2...) > for columns are defined recursively, so deeply nested slices are possible, > * Slices representing a single parent (CF, SC, etc) can have different > Metadata, meaning that a tombstone Slice from d-f could sit between Slices > containing columns a-c and g-h. This allows for eventually consistent range > deletes of columns. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (CASSANDRA-674) New SSTable Format
[ https://issues.apache.org/jira/browse/CASSANDRA-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12977697#action_12977697 ] Stu Hood commented on CASSANDRA-674: > How will ranges be stored? The parent ordering would mean the sorting of data > at that level is lost no? Added some explanation of how I think ranges should work to the wiki. http://wiki.apache.org/cassandra/FileFormatDesignDoc?action=diff&rev1=15&rev2=16 > Are chunks broken up by size only? Technically "spans" are the largest unit, so they define the boundaries: tried to clarify this part as well. There are a few possible thresholds, including a max number of rows, columns, range tombstones or total bytes in the span. One semi-undefined portion is what happens when a row is larger than can be stuffed in a span. Most likely we'll want to use the range metadata to indicate the portion of the row covered by the span (the approach I took in the original implementation attached here). > Will the metadata be ripe for caching? I don't think so: the metadata is useless on it's own. It only becomes useful when it is attached to data (a column or to a range), so there is no reason to cache the meta- independently of the data. Thanks! > New SSTable Format > -- > > Key: CASSANDRA-674 > URL: https://issues.apache.org/jira/browse/CASSANDRA-674 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Stu Hood > Fix For: 0.8 > > Attachments: 674-v1.diff, perf-674-v1.txt, > perf-trunk-2f3d2c0e4845faf62e33c191d152cb1b3fa62806.txt > > > Various tickets exist due to limitations in the SSTable file format, > including #16, #47 and #328. Attached is a proposed design/implementation of > a new file format for SSTables that addresses a few of these limitations. The > implementation has a bunch of issues/fixmes, which I'll describe in the > comments. > The file format is described in the javadoc for the o.a.c.io.SSTableWriter > class, but briefly: > * Blocks are opaque (except for their header) so that they can be > compressed. The index file contains an entry for the first key in every > Block. Blocks contain Slices. > * Slices are series of columns with the same parents and (deletion) > metadata. They can be used to represent ColumnFamilies or SuperColumns (or a > slice of columns at any other depth). A single CF can be split across > multiple Slices, which can be split across multiple blocks. > * Neither Slices nor Blocks have a fixed size or maximum length, but they > each have target lengths which can be stretched and broken by very large > columns. > The most interesting concepts from this patch are: > * Block compression is possible (currently using GZIP, which has one bug > mentioned in the comments), > * Compaction involves merging intersecting Slices from input SSTables. Since > large rows will be broken down into multiple slices, only the portions of > rows that intersect between tables need to be > deserialized/merged/held-in-memory, > * Indexes for individual rows are gone, since the global index allows random > access to the middle of column families that span Blocks, and Slices allow > batches of columns to be skipped within a Block. > * Bloom filters for individual rows are gone, and the global filter contains > ColumnKeys instead, meaning that a query for a column that doesn't exist in a > row that does will often not need to seek to the row. > * Metadata (deletion/gc time) and ColumnKeys (key, colname1, colname2...) > for columns are defined recursively, so deeply nested slices are possible, > * Slices representing a single parent (CF, SC, etc) can have different > Metadata, meaning that a tombstone Slice from d-f could sit between Slices > containing columns a-c and g-h. This allows for eventually consistent range > deletes of columns. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (CASSANDRA-674) New SSTable Format
[ https://issues.apache.org/jira/browse/CASSANDRA-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12977609#action_12977609 ] T Jake Luciani commented on CASSANDRA-674: -- As I try to wrap my head around this I'm listing questions that come to mind: - How will ranges be stored? The parent ordering would mean the sorting of data at that level is lost no? - Are chunks broken up by size only? - Will the metadata be ripe for caching? > New SSTable Format > -- > > Key: CASSANDRA-674 > URL: https://issues.apache.org/jira/browse/CASSANDRA-674 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Stu Hood > Fix For: 0.8 > > Attachments: 674-v1.diff, perf-674-v1.txt, > perf-trunk-2f3d2c0e4845faf62e33c191d152cb1b3fa62806.txt > > > Various tickets exist due to limitations in the SSTable file format, > including #16, #47 and #328. Attached is a proposed design/implementation of > a new file format for SSTables that addresses a few of these limitations. The > implementation has a bunch of issues/fixmes, which I'll describe in the > comments. > The file format is described in the javadoc for the o.a.c.io.SSTableWriter > class, but briefly: > * Blocks are opaque (except for their header) so that they can be > compressed. The index file contains an entry for the first key in every > Block. Blocks contain Slices. > * Slices are series of columns with the same parents and (deletion) > metadata. They can be used to represent ColumnFamilies or SuperColumns (or a > slice of columns at any other depth). A single CF can be split across > multiple Slices, which can be split across multiple blocks. > * Neither Slices nor Blocks have a fixed size or maximum length, but they > each have target lengths which can be stretched and broken by very large > columns. > The most interesting concepts from this patch are: > * Block compression is possible (currently using GZIP, which has one bug > mentioned in the comments), > * Compaction involves merging intersecting Slices from input SSTables. Since > large rows will be broken down into multiple slices, only the portions of > rows that intersect between tables need to be > deserialized/merged/held-in-memory, > * Indexes for individual rows are gone, since the global index allows random > access to the middle of column families that span Blocks, and Slices allow > batches of columns to be skipped within a Block. > * Bloom filters for individual rows are gone, and the global filter contains > ColumnKeys instead, meaning that a query for a column that doesn't exist in a > row that does will often not need to seek to the row. > * Metadata (deletion/gc time) and ColumnKeys (key, colname1, colname2...) > for columns are defined recursively, so deeply nested slices are possible, > * Slices representing a single parent (CF, SC, etc) can have different > Metadata, meaning that a tombstone Slice from d-f could sit between Slices > containing columns a-c and g-h. This allows for eventually consistent range > deletes of columns. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (CASSANDRA-674) New SSTable Format
[ https://issues.apache.org/jira/browse/CASSANDRA-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12976469#action_12976469 ] Stu Hood commented on CASSANDRA-674: Thinking about this issue again. Dumped some thoughts I had on paper to the wiki: http://wiki.apache.org/cassandra/FileFormatDesignDoc . > New SSTable Format > -- > > Key: CASSANDRA-674 > URL: https://issues.apache.org/jira/browse/CASSANDRA-674 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Stu Hood > Fix For: 0.8 > > Attachments: 674-v1.diff, perf-674-v1.txt, > perf-trunk-2f3d2c0e4845faf62e33c191d152cb1b3fa62806.txt > > > Various tickets exist due to limitations in the SSTable file format, > including #16, #47 and #328. Attached is a proposed design/implementation of > a new file format for SSTables that addresses a few of these limitations. The > implementation has a bunch of issues/fixmes, which I'll describe in the > comments. > The file format is described in the javadoc for the o.a.c.io.SSTableWriter > class, but briefly: > * Blocks are opaque (except for their header) so that they can be > compressed. The index file contains an entry for the first key in every > Block. Blocks contain Slices. > * Slices are series of columns with the same parents and (deletion) > metadata. They can be used to represent ColumnFamilies or SuperColumns (or a > slice of columns at any other depth). A single CF can be split across > multiple Slices, which can be split across multiple blocks. > * Neither Slices nor Blocks have a fixed size or maximum length, but they > each have target lengths which can be stretched and broken by very large > columns. > The most interesting concepts from this patch are: > * Block compression is possible (currently using GZIP, which has one bug > mentioned in the comments), > * Compaction involves merging intersecting Slices from input SSTables. Since > large rows will be broken down into multiple slices, only the portions of > rows that intersect between tables need to be > deserialized/merged/held-in-memory, > * Indexes for individual rows are gone, since the global index allows random > access to the middle of column families that span Blocks, and Slices allow > batches of columns to be skipped within a Block. > * Bloom filters for individual rows are gone, and the global filter contains > ColumnKeys instead, meaning that a query for a column that doesn't exist in a > row that does will often not need to seek to the row. > * Metadata (deletion/gc time) and ColumnKeys (key, colname1, colname2...) > for columns are defined recursively, so deeply nested slices are possible, > * Slices representing a single parent (CF, SC, etc) can have different > Metadata, meaning that a tombstone Slice from d-f could sit between Slices > containing columns a-c and g-h. This allows for eventually consistent range > deletes of columns. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (CASSANDRA-674) New SSTable Format
[ https://issues.apache.org/jira/browse/CASSANDRA-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886030#action_12886030 ] Ryan King commented on CASSANDRA-674: - YES! > New SSTable Format > -- > > Key: CASSANDRA-674 > URL: https://issues.apache.org/jira/browse/CASSANDRA-674 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Stu Hood > Fix For: 0.8 > > Attachments: 674-v1.diff, perf-674-v1.txt, > perf-trunk-2f3d2c0e4845faf62e33c191d152cb1b3fa62806.txt > > > Various tickets exist due to limitations in the SSTable file format, > including #16, #47 and #328. Attached is a proposed design/implementation of > a new file format for SSTables that addresses a few of these limitations. The > implementation has a bunch of issues/fixmes, which I'll describe in the > comments. > The file format is described in the javadoc for the o.a.c.io.SSTableWriter > class, but briefly: > * Blocks are opaque (except for their header) so that they can be > compressed. The index file contains an entry for the first key in every > Block. Blocks contain Slices. > * Slices are series of columns with the same parents and (deletion) > metadata. They can be used to represent ColumnFamilies or SuperColumns (or a > slice of columns at any other depth). A single CF can be split across > multiple Slices, which can be split across multiple blocks. > * Neither Slices nor Blocks have a fixed size or maximum length, but they > each have target lengths which can be stretched and broken by very large > columns. > The most interesting concepts from this patch are: > * Block compression is possible (currently using GZIP, which has one bug > mentioned in the comments), > * Compaction involves merging intersecting Slices from input SSTables. Since > large rows will be broken down into multiple slices, only the portions of > rows that intersect between tables need to be > deserialized/merged/held-in-memory, > * Indexes for individual rows are gone, since the global index allows random > access to the middle of column families that span Blocks, and Slices allow > batches of columns to be skipped within a Block. > * Bloom filters for individual rows are gone, and the global filter contains > ColumnKeys instead, meaning that a query for a column that doesn't exist in a > row that does will often not need to seek to the row. > * Metadata (deletion/gc time) and ColumnKeys (key, colname1, colname2...) > for columns are defined recursively, so deeply nested slices are possible, > * Slices representing a single parent (CF, SC, etc) can have different > Metadata, meaning that a tombstone Slice from d-f could sit between Slices > containing columns a-c and g-h. This allows for eventually consistent range > deletes of columns. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (CASSANDRA-674) New SSTable Format
[ https://issues.apache.org/jira/browse/CASSANDRA-674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886023#action_12886023 ] Stu Hood commented on CASSANDRA-674: After having played with Avro a bit more, I'm all for using its DataFile format in the SSTable. The variable length integer encoding, built in compression, schema migration and block recovery schemes are win. > New SSTable Format > -- > > Key: CASSANDRA-674 > URL: https://issues.apache.org/jira/browse/CASSANDRA-674 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Stu Hood > Fix For: 0.8 > > Attachments: 674-v1.diff, perf-674-v1.txt, > perf-trunk-2f3d2c0e4845faf62e33c191d152cb1b3fa62806.txt > > > Various tickets exist due to limitations in the SSTable file format, > including #16, #47 and #328. Attached is a proposed design/implementation of > a new file format for SSTables that addresses a few of these limitations. The > implementation has a bunch of issues/fixmes, which I'll describe in the > comments. > The file format is described in the javadoc for the o.a.c.io.SSTableWriter > class, but briefly: > * Blocks are opaque (except for their header) so that they can be > compressed. The index file contains an entry for the first key in every > Block. Blocks contain Slices. > * Slices are series of columns with the same parents and (deletion) > metadata. They can be used to represent ColumnFamilies or SuperColumns (or a > slice of columns at any other depth). A single CF can be split across > multiple Slices, which can be split across multiple blocks. > * Neither Slices nor Blocks have a fixed size or maximum length, but they > each have target lengths which can be stretched and broken by very large > columns. > The most interesting concepts from this patch are: > * Block compression is possible (currently using GZIP, which has one bug > mentioned in the comments), > * Compaction involves merging intersecting Slices from input SSTables. Since > large rows will be broken down into multiple slices, only the portions of > rows that intersect between tables need to be > deserialized/merged/held-in-memory, > * Indexes for individual rows are gone, since the global index allows random > access to the middle of column families that span Blocks, and Slices allow > batches of columns to be skipped within a Block. > * Bloom filters for individual rows are gone, and the global filter contains > ColumnKeys instead, meaning that a query for a column that doesn't exist in a > row that does will often not need to seek to the row. > * Metadata (deletion/gc time) and ColumnKeys (key, colname1, colname2...) > for columns are defined recursively, so deeply nested slices are possible, > * Slices representing a single parent (CF, SC, etc) can have different > Metadata, meaning that a tombstone Slice from d-f could sit between Slices > containing columns a-c and g-h. This allows for eventually consistent range > deletes of columns. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.