[ https://issues.apache.org/jira/browse/CASSANDRA-8542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Shawn Walker updated CASSANDRA-8542: ------------------------------------ Comment: was deleted (was: As written, the current {{InvertibleBloomReconciler}} uses finite field arithmetic over the prime field {{GF[2^30 - 35]}}. This is a trade off: * (pro) Addition, multiplication, subtraction are fairly fast, and comprehensible. * (pro) During reconciliation, we can actually determine what sources produced each non-common record, rather than just discovering what the differences were. * (pro) We can reconcile up to 28 sources at once. * (con) The way I've encoded from bytes into {{GF[2^30 - 35]}} is not space efficient, creating 4+ bytes of output for each 3 bytes of input. * (con) Division is a bit slow. Similar techniques could be used with a different finite field. {{GF[2^16]}} is a possible choice. For it: * (pro) Addition and subtraction become just XOR, and so should be blazingly fast. * (pro) Encoding is trivial, and has nearly perfect space efficiency: 1-2 bytes of padding per record should suffice. * (con) We can only reconcile 16 sources at once. * (con) Performing multiplication/division in {{GF[2^16]}} can be a bit arcane. That said, it can be done quickly with a few lookup tables.* (con) During reconciliation we wouldn't be able to determine what sources produced each non-common record, only which records are the non-common records ) > Make get_range_slices and related succeed most of the time on tombstone heavy > column families > --------------------------------------------------------------------------------------------- > > Key: CASSANDRA-8542 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8542 > Project: Cassandra > Issue Type: New Feature > Components: Core > Reporter: Shawn Walker > Priority: Minor > Attachments: trunk-8542-InvertibleBloomReconciler.txt > > > I've got a ColumnFamily in which some rows end up being used in a queue-like > fashion, and where many tombstones tend to pile up at the left end of the > row. As a result, I run into {{TombstoneOverwhelming}} (e.g. CASSANDRA-6117) > when trying to list the columns of said rows. > Please don't yell at me too loudly. I know the issue I'm describing will > generally be considered as being due to poor use of Cassandra. I understand > the rationale of the current behavior, and am hesitant to play with fire by > increasing {{tombstone_fail_threshold}} to a high value. Instead, what I > propose is an alternate method of resolving such queries on the read path. > ---- > This is based on the following observation: on the coordinator node, when > {{RangeSliceResponseResolver}} is resolving a range slice query, it needn't > be aware of all tombstones that all responding nodes have within the > specified range. Instead, it would suffice if it could determine only those > tombstones which aren't shared by all responding nodes. E.g. if node #1 > responds with tombstones (A, B, D), node #2 responds with tombstones (A, B, > C), and node #3 responds with tombstones (A, B, C, D), > {{RangeSliceResponseResolver}} need only actually know about the tombstones > (C, D): All of the responders will already have removed any relevant data for > the tombstones (A, B) from their individual responses. > Arranging for {{RangeSliceResponseResolver}} to discover only the non-common > tombstones can be accomplished by using a variant of the "invertible bloom > filter" data structure described in e.g. > http://conferences.sigcomm.org/sigcomm/2011/papers/sigcomm/p218.pdf. Using > an invertible Bloom filter, each responding node can build a (roughly) fixed > size data structure holding a representation of all the tombstones that node > encounters. The coordinator node can then combine the invertible Bloom > filters. If there aren't too many non-common tombstones, the coordinator > node will be able to enumerate all of them, and so resolve the range slice > query. > I see accomplishing this in a few discrete steps: > 1. Implement an appropriate variant of the "invertible bloom filter". I've > started this already, and will shortly upload a patch including my > {{InvertibleBloomReconciler}} implementation. From a black box perspective, > {{InvertibleBloomReconcilerTest.verifyFiveWayReconciliation()}} demonstrates > how this data structure and algorithm could be used. > 2. ("db layer") Wrap {{InvertibleBloomReconciler}} into > {{TombstoneReconciler}}, a structure for spilling tombstones into. Refactor > large swaths of {{org.apache.cassandra.db}}'s read path to accomodate the > possibility of placing tombstones discovered during a read into a > {{TombstoneReconciler}} instead of returning them within a {{Row}}, > {{List<Row>}}, {{ColumnFamily}}, etc. I've attempted a start on this, though > this means carrying the {{TombstoneReconciler}} around through most of > {{ColumnFamilyStore}}, practically all of {{org.apache.db.filter}}, and other > places I haven't yet discovered. I'm wondering if I wouldn't be better off > waiting for CASSANDRA-8099 before starting this step -- a fully iterator flow > through {{org.apache.cassandra.db}} might make this easier. > 3. ("dynamo layer") Arrange for {{RangeSliceCommand}} to include parameters > for the IBR (table size, hash count, seed), possibly by making these part of > {{CFMetaData}}. Allow a {{RangeSliceResponse}} to optionally return a > {{TombstoneReconciler}} in addition to its {{List<Row>}}. Refactor > {{RangeSliceResponseResolver}} to be capable of handling > {{TombstoneReconciler}} s. Possibly refactor > {{StorageProxy.getRangeSlices(...)}} to disable read repair if any responses > contained a {{TombstoneReconciler}}. > Since the invertible bloom filter is a probabilistic data structure, it is > possible that resolving a query in this manner could fail. As such, I'm > proposing that the current read path not make use of the > {{TombstoneReconciler}} idea unless it would otherwise encounter a > {{TombstoneOverwhelming}} situation. > Also, there is an astronomically minute possibility (on the order if 1 / > 2^120 ) that tombstones reconciled via a {{TombstoneReconciler}} could be bad > data. Possibly less astronomical if someone were attempting to maliciously > abuse this functionality, given the fact that I'm using the Murmur3 hash > instead of a cryptographic hash within {{InvertibleBloomReconciler}}. As > such, I'm suggesting that read repair not be enabled on this path, just in > case. > ---- > Any thoughts? Would there be any interest in an implementation as I've > described? Any suggestions on who I might ask for help navigating > {{org.apache.cassandra.db}} if/when I run into trouble there? -- This message was sent by Atlassian JIRA (v6.3.4#6332)