[jira] [Comment Edited] (CASSANDRA-5351) Avoid repairing already-repaired data by default
[ https://issues.apache.org/jira/browse/CASSANDRA-5351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13889494#comment-13889494 ] Marcus Eriksson edited comment on CASSANDRA-5351 at 2/3/14 2:18 PM: More complete version now pushed to https://github.com/krummas/cassandra/tree/marcuse/5351 Lots of testing required, but i think it is mostly 'feature-complete'; Repair flow is now: # Repair coordinator sends out Prepare messages to all neighbors # All involved parties figure out what sstables should be included in the repair (if full repair, all sstables are included) otherwise only the ones with repairedAt set to 0. Note that we don't do any locking of the sstables, if they are gone when we do anticompaction it is fine - we will repair them next round. # Repair coordinator prepares itself and waits until all neighbors have prepared and sends out TreeRequests. # All nodes calculate merkle trees based on the sstables picked in step #2 # Coordinator waits for replies and then sends AnticompactionRequests to all nodes # If we are doing full repair, we simply skip doing anticompaction. notes; * SSTables are tagged with repairedAt timestamps, compactions keep min(repairedAt) of the included sstables. * nodetool repair defaults to use the old behaviour. Use --incremental to use the new repairs. * anticompaction ** Split an sstable in 2 new ones. One sstable with all keys that were in the repaired ranges and one with unrepaired data. ** If the repaired ranges cover the entire sstable, we rewrite sstable metadata. This means that the optimal way to run incremental repairs is to not do partitioner range repairs etc. * LCS ** We always first check if there are any unrepaired sstables to do STCS on, if there is, we do that. Reasoning being that new data (which needs compaction) is unrepaired. ** We keep all sstables in the LeveledManifest, then filter out the unrepaired ones when getting compaction candidates etc. * STCS ** Major compaction is done by taking the biggest set of sstables - so for a total major compaction, you will need to run nodetool compact twice. ** Minors works the same way, the biggest set of sstables will be compacted. * Streaming - A streamed SSTable keeps its repairedAt time. * BulkLoader - Loaded sstables are unrepaired. * Scrub - Set repairedAt to UNREPAIRED - since we can drop rows during repair new sstable is not repaired. * Upgradesstables - Keep repaired status was (Author: krummas): More complete version now pushed to https://github.com/krummas/cassandra/tree/marcuse/5351 Lots of testing required, but i think it is mostly 'feature-complete'; Repair flow is now: # Repair coordinator sends out Prepare messages to all neighbors # All involved parties figure out what sstables should be included in the repair (if full repair, all sstables are included) otherwise only the ones with repairedAt set to 0. Note that we don't do any locking of the sstables, if they are gone when we do anticompaction it is fine - we will repair them next round. # Repair coordinator prepares itself and waits until all neighbors have prepared and sends out TreeRequests. # All nodes calculate merkle trees based on the sstables picked in step #2 # Coordinator waits for replies and then sends AnticompactionRequests to all nodes # If we are doing full repair, we simply skip doing anticompaction. notes; * SSTables are tagged with repairedAt timestamps, compactions keep min(repairedAt) of the included sstables. * nodetool repair defaults to use the old behaviour. Use --incremental to use the new repairs. * anticompaction - Split an sstable in 2 new ones. One sstable with all keys that were in the repaired ranges and one with unrepaired data. - If the repaired ranges cover the entire sstable, we rewrite sstable metadata. This means that the optimal way to run incremental repairs is to not do partitioner range repairs etc. * Compaction * LCS - We always first check if there are any unrepaired sstables to do STCS on, if there is, we do that. Reasoning being that new data (which needs compaction) is unrepaired. - We keep all sstables in the LeveledManifest, then filter out the unrepaired ones when getting compaction candidates etc. * STCS - Major compaction is done by taking the biggest set of sstables - so for a total major compaction, you will need to run nodetool compact twice. - Minors works the same way, the biggest set of sstables will be compacted. * Streaming - A streamed SSTable keeps its repairedAt time. * BulkLoader - Loaded sstables are unrepaired. * Scrub - Set repairedAt to UNREPAIRED - since we can drop rows during repair new sstable is not repaired. * Upgradesstables - Keep repaired status Avoid repairing already-repaired data by default Key:
[jira] [Comment Edited] (CASSANDRA-5351) Avoid repairing already-repaired data by default
[ https://issues.apache.org/jira/browse/CASSANDRA-5351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13889494#comment-13889494 ] Marcus Eriksson edited comment on CASSANDRA-5351 at 2/3/14 2:30 PM: More complete version now pushed to https://github.com/krummas/cassandra/tree/marcuse/5351 Lots of testing required, but i think it is mostly 'feature-complete'; Repair flow is now: # Repair coordinator sends out Prepare messages to all neighbors # All involved parties figure out what sstables should be included in the repair (if full repair, all sstables are included) otherwise only the ones with repairedAt set to 0. Note that we don't do any locking of the sstables, if they are gone when we do anticompaction it is fine - we will repair them next round. # Repair coordinator prepares itself and waits until all neighbors have prepared and sends out TreeRequests. # All nodes calculate merkle trees based on the sstables picked in step #2 # Coordinator waits for replies and then sends AnticompactionRequests to all nodes # If we are doing full repair, we simply skip doing anticompaction. notes; * SSTables are tagged with repairedAt timestamps, compactions keep min(repairedAt) of the included sstables. * nodetool repair defaults to use the old behaviour. Use --incremental to use the new repairs. * anticompaction ** Split an sstable in 2 new ones. One sstable with all keys that were in the repaired ranges and one with unrepaired data. ** If the repaired ranges cover the entire sstable, we rewrite sstable metadata. This means that the optimal way to run incremental repairs is to not do partitioner range repairs etc. * LCS ** We always first check if there are any unrepaired sstables to do STCS on, if there is, we do that. Reasoning being that new data (which needs compaction) is unrepaired. ** We keep all sstables in the LeveledManifest, then filter out the unrepaired ones when getting compaction candidates etc. * STCS ** Major compaction is done by taking the biggest set of sstables - so for a total major compaction, you will need to run nodetool compact twice. ** Minors works the same way, the biggest set of sstables will be compacted. * Streaming - A streamed SSTable keeps its repairedAt time. * BulkLoader - Loaded sstables are unrepaired. * Scrub - Set repairedAt to UNREPAIRED - since we can drop rows during scrub new sstable is not repaired. * Upgradesstables - Keep repaired status was (Author: krummas): More complete version now pushed to https://github.com/krummas/cassandra/tree/marcuse/5351 Lots of testing required, but i think it is mostly 'feature-complete'; Repair flow is now: # Repair coordinator sends out Prepare messages to all neighbors # All involved parties figure out what sstables should be included in the repair (if full repair, all sstables are included) otherwise only the ones with repairedAt set to 0. Note that we don't do any locking of the sstables, if they are gone when we do anticompaction it is fine - we will repair them next round. # Repair coordinator prepares itself and waits until all neighbors have prepared and sends out TreeRequests. # All nodes calculate merkle trees based on the sstables picked in step #2 # Coordinator waits for replies and then sends AnticompactionRequests to all nodes # If we are doing full repair, we simply skip doing anticompaction. notes; * SSTables are tagged with repairedAt timestamps, compactions keep min(repairedAt) of the included sstables. * nodetool repair defaults to use the old behaviour. Use --incremental to use the new repairs. * anticompaction ** Split an sstable in 2 new ones. One sstable with all keys that were in the repaired ranges and one with unrepaired data. ** If the repaired ranges cover the entire sstable, we rewrite sstable metadata. This means that the optimal way to run incremental repairs is to not do partitioner range repairs etc. * LCS ** We always first check if there are any unrepaired sstables to do STCS on, if there is, we do that. Reasoning being that new data (which needs compaction) is unrepaired. ** We keep all sstables in the LeveledManifest, then filter out the unrepaired ones when getting compaction candidates etc. * STCS ** Major compaction is done by taking the biggest set of sstables - so for a total major compaction, you will need to run nodetool compact twice. ** Minors works the same way, the biggest set of sstables will be compacted. * Streaming - A streamed SSTable keeps its repairedAt time. * BulkLoader - Loaded sstables are unrepaired. * Scrub - Set repairedAt to UNREPAIRED - since we can drop rows during repair new sstable is not repaired. * Upgradesstables - Keep repaired status Avoid repairing already-repaired data by default Key: CASSANDRA-5351
[jira] [Comment Edited] (CASSANDRA-5351) Avoid repairing already-repaired data by default
[ https://issues.apache.org/jira/browse/CASSANDRA-5351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13890438#comment-13890438 ] Yuki Morishita edited comment on CASSANDRA-5351 at 2/4/14 6:20 AM: --- bq. Dropping sstable to UNREPAIRED during major compaction means that all repaired data status is cleared for the node. That's what I meant. Current major compaction produces one SSTable and I think changing that behavior would confuse users, maybe. My opinion is to keep it as is . Additional review comments: * Does PrepareMessage needs to carry around dataCenters? Only coordinator sends out messages so I think you can drop it(also from ParentRepairSession). * CF ID is preferred to use over Keyspace name/CF name pair. * PrepareMessage is sent per CF but it can produce a lot of round trip. Isn't one message per replica node enough? * I think we need clean up for parentRepairSessions when something bad happened. Otherwise ParentRepairSession in the map keep reference to SSTables. I just worked on the first one above and the commit is here(on top of your branch): https://github.com/yukim/cassandra/commit/7c65e532dd69f9f4c1ea2d3fdf0401ed70291361 was (Author: yukim): bq. Dropping sstable to UNREPAIRED during major compaction means that all repaired data status is cleared for the node. That's what I meant. Current major compaction produces one SSTable and I think changing that behavior would confuse users, maybe. My opinion is to keep it as is, but . Additional review comments: * Does PrepareMessage needs to carry around dataCenters? Only coordinator sends out messages so I think you can drop it(also from ParentRepairSession). * CF ID is preferred to use over Keyspace name/CF name pair. * PrepareMessage is sent per CF but it can produce a lot of round trip. Isn't one message per replica node enough? * I think we need clean up for parentRepairSessions when something bad happened. Otherwise ParentRepairSession in the map keep reference to SSTables. I just worked on the first one above and the commit is here(on top of your branch): https://github.com/yukim/cassandra/commit/7c65e532dd69f9f4c1ea2d3fdf0401ed70291361 Avoid repairing already-repaired data by default Key: CASSANDRA-5351 URL: https://issues.apache.org/jira/browse/CASSANDRA-5351 Project: Cassandra Issue Type: Task Components: Core Reporter: Jonathan Ellis Assignee: Lyuben Todorov Labels: repair Fix For: 2.1 Attachments: 5351_node1.log, 5351_node2.log, 5351_node3.log, 5351_nodetool.log Repair has always built its merkle tree from all the data in a columnfamily, which is guaranteed to work but is inefficient. We can improve this by remembering which sstables have already been successfully repaired, and only repairing sstables new since the last repair. (This automatically makes CASSANDRA-3362 much less of a problem too.) The tricky part is, compaction will (if not taught otherwise) mix repaired data together with non-repaired. So we should segregate unrepaired sstables from the repaired ones. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Comment Edited] (CASSANDRA-5351) Avoid repairing already-repaired data by default
[ https://issues.apache.org/jira/browse/CASSANDRA-5351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13875155#comment-13875155 ] Marcus Eriksson edited comment on CASSANDRA-5351 at 1/17/14 7:30 PM: - Just pushed a work-in-progress-branch with a bunch of updates: https://github.com/krummas/cassandra/tree/marcuse/5351 * It locks the sstables for the whole duration of the repair, might be better to do a best-effort here, ie, remember what sstables we calculated merkle trees on and then if the sstable is gone during anticompaction, we dont mark that data as repaired - we will catch it next repair round instead. * Adds a test for anticompaction * For LCS, we always first check if there are STCS-compactions to do on the unrepaired data, reasoning being that new data is unrepaired, so if you have flushed 4 sstables, there will be a stcs compaction. * For STCS, we just pick the one with the most sstables, probably not the best way, but we can tweak that heuristic later. [~lyubent] - I think the bug you had above was because you didn't check if markCompacting was successful or not in AnticompactionSession.lockTables() It works but atleast these things are needed; * Fix streaming, keep the repairedAt times etc. * Make full repairs work somehow (anticompacting the whole dataset is probably not preferable) * Timeouts, if we lock all unrepaired sstables, we must handle stalled repairs somehow. * Make anticompaction smarter, if we have 10 sstables that we should anticompact, why not combine anticompaction with compaction and create 2 new sstables, one with repaired data and one with unrepaired * More tests, cleanups, refactor was (Author: krummas): Just pushed a work-in-progress-branch with a bunch of updates: https://github.com/krummas/cassandra/tree/marcuse/5351 * It locks the sstables for the whole duration of the repair, might be better to do a best-effort here, ie, remember what sstables we calculated merkle trees on * Adds a test for anticompaction * For LCS, we always first check if there are STCS-compactions to do on the unrepaired data, reasoning being that new data is unrepaired, so if you have flushed 4 sstables, there will be a stcs compaction. * For STCS, we just pick the one with the most sstables, probably not the best way, but we can tweak that heuristic later. [~lyubent] - I think the bug you had above was because you didn't check if markCompacting was successful or not in AnticompactionSession.lockTables() It works but atleast these things are needed; * Fix streaming, keep the repairedAt times etc. * Make full repairs work somehow (anticompacting the whole dataset is probably not preferable) * Timeouts, if we lock all unrepaired sstables, we must handle stalled repairs somehow. * Make anticompaction smarter, if we have 10 sstables that we should anticompact, why not combine anticompaction with compaction and create 2 new sstables, one with repaired data and one with unrepaired * More tests, cleanups, refactor Avoid repairing already-repaired data by default Key: CASSANDRA-5351 URL: https://issues.apache.org/jira/browse/CASSANDRA-5351 Project: Cassandra Issue Type: Task Components: Core Reporter: Jonathan Ellis Assignee: Lyuben Todorov Labels: repair Fix For: 2.1 Attachments: 5351_node1.log, 5351_node2.log, 5351_node3.log, 5351_nodetool.log Repair has always built its merkle tree from all the data in a columnfamily, which is guaranteed to work but is inefficient. We can improve this by remembering which sstables have already been successfully repaired, and only repairing sstables new since the last repair. (This automatically makes CASSANDRA-3362 much less of a problem too.) The tricky part is, compaction will (if not taught otherwise) mix repaired data together with non-repaired. So we should segregate unrepaired sstables from the repaired ones. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Comment Edited] (CASSANDRA-5351) Avoid repairing already-repaired data by default
[ https://issues.apache.org/jira/browse/CASSANDRA-5351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13838926#comment-13838926 ] Lyuben Todorov edited comment on CASSANDRA-5351 at 12/4/13 2:40 PM: Added checks to LeveledManifest#replace and LeveledManifest#add to ensure un-repaired data is kept at L0 ([patch here|https://github.com/lyubent/cassandra/commit/70f63e577f531f904997934c53022c1d6a94b9f3]) The out-of-order key error is still a problem, logs show same error as above comment. One thing that I haven't accounted for so far is, tables being added straight to levels higher than L0, is it possible for newly flushed data to go straight to a level L0 was (Author: lyubent): Added checks to LeveledManifest#replace and LeveledManifest#add to ensure un-repaired data is kept at L0 ([patch here|https://github.com/lyubent/cassandra/commit/70f63e577f531f904997934c53022c1d6a94b9f3]) The out-of-order key error is still a problem, logs show same error as above comment. Avoid repairing already-repaired data by default Key: CASSANDRA-5351 URL: https://issues.apache.org/jira/browse/CASSANDRA-5351 Project: Cassandra Issue Type: Task Components: Core Reporter: Jonathan Ellis Assignee: Lyuben Todorov Labels: repair Fix For: 2.1 Repair has always built its merkle tree from all the data in a columnfamily, which is guaranteed to work but is inefficient. We can improve this by remembering which sstables have already been successfully repaired, and only repairing sstables new since the last repair. (This automatically makes CASSANDRA-3362 much less of a problem too.) The tricky part is, compaction will (if not taught otherwise) mix repaired data together with non-repaired. So we should segregate unrepaired sstables from the repaired ones. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Comment Edited] (CASSANDRA-5351) Avoid repairing already-repaired data by default
[ https://issues.apache.org/jira/browse/CASSANDRA-5351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13832570#comment-13832570 ] Lyuben Todorov edited comment on CASSANDRA-5351 at 11/26/13 1:27 PM: - After [this commit|https://github.com/lyubent/cassandra/commit/903e416539cdde78514850bda25076f3f2fc57ec] to keep unrepaired data at L0 repairs start failing to validate after a few inserts and compactions. The stack trace from the error in each node is below (3 node ccm cluster was used here with the repair being issued to node 2). {code} INFO 15:19:10,321 Starting repair command #3, repairing 2 ranges for keyspace test INFO 15:19:10,322 [repair #55f4d610-569d-11e3-b553-975f903ccf5a] new session: will sync /127.0.0.2, /127.0.0.3 on range (-9223372036854775808,-3074457345618258603] for test.[lvl] INFO 15:19:10,325 Handshaking version with /127.0.0.3 INFO 15:19:10,343 [repair #55f4d610-569d-11e3-b553-975f903ccf5a] requesting merkle trees for lvl (to [/127.0.0.3, /127.0.0.2]) INFO 15:19:11,493 [repair #55f4d610-569d-11e3-b553-975f903ccf5a] Received merkle tree for lvl from /127.0.0.3 ERROR 15:19:16,138 Failed creating a merkle tree for [repair #55f4d610-569d-11e3-b553-975f903ccf5a on test/lvl, (-9223372036854775808,-3074457345618258603]], /127.0.0.2 (see log for details) ERROR 15:19:16,138 Exception in thread Thread[ValidationExecutor:2,1,main] java.lang.AssertionError: row DecoratedKey(-9223264645216044815, 73636c744c546e56534c4741775141) received out of order wrt DecoratedKey(-3331959603918038206, 685863786a586464616b794f597075) at org.apache.cassandra.repair.Validator.add(Validator.java:136) at org.apache.cassandra.db.compaction.CompactionManager.doValidationCompaction(CompactionManager.java:820) at org.apache.cassandra.db.compaction.CompactionManager.access$600(CompactionManager.java:61) at org.apache.cassandra.db.compaction.CompactionManager$8.call(CompactionManager.java:417) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) ERROR 15:19:16,139 [repair #55f4d610-569d-11e3-b553-975f903ccf5a] session completed with the following error org.apache.cassandra.exceptions.RepairException: [repair #55f4d610-569d-11e3-b553-975f903ccf5a on test/lvl, (-9223372036854775808,-3074457345618258603]] Validation failed in /127.0.0.2 at org.apache.cassandra.repair.RepairSession.validationComplete(RepairSession.java:152) at org.apache.cassandra.service.ActiveRepairService.handleMessage(ActiveRepairService.java:212) at org.apache.cassandra.repair.RepairMessageVerbHandler.doVerb(RepairMessageVerbHandler.java:91) at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:56) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) INFO 15:19:16,138 Range (3074457345618258602,-9223372036854775808] has already been repaired. Skipping repair. S.AD: lvl repairedAt: 1385466263703792000 ERROR 15:19:16,139 Exception in thread Thread[AntiEntropySessions:5,5,RMI Runtime] java.lang.RuntimeException: org.apache.cassandra.exceptions.RepairException: [repair #55f4d610-569d-11e3-b553-975f903ccf5a on test/lvl, (-9223372036854775808,-3074457345618258603]] Validation failed in /127.0.0.2 at com.google.common.base.Throwables.propagate(Throwables.java:160) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:32) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) Caused by: org.apache.cassandra.exceptions.RepairException: [repair #55f4d610-569d-11e3-b553-975f903ccf5a on test/lvl, (-9223372036854775808,-3074457345618258603]] Validation failed in /127.0.0.2 at org.apache.cassandra.repair.RepairSession.validationComplete(RepairSession.java:152) at org.apache.cassandra.service.ActiveRepairService.handleMessage(ActiveRepairService.java:212) at org.apache.cassandra.repair.RepairMessageVerbHandler.doVerb(RepairMessageVerbHandler.java:91) at
[jira] [Comment Edited] (CASSANDRA-5351) Avoid repairing already-repaired data by default
[ https://issues.apache.org/jira/browse/CASSANDRA-5351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13822426#comment-13822426 ] Lyuben Todorov edited comment on CASSANDRA-5351 at 11/14/13 1:37 PM: - I think I misunderstood Marcus' idea. If we keep all the un-repaired data at L0 and promote it to L1 once repaired, would that mean overriding what currently happens when there are 10 sstables at L0 (where they get promoted), or is that where triggering automatic repairs would come in play? was (Author: lyubent): I think I misunderstood Marcus' idea. If we keep all the un-repaired data at L0 and promote it to L1 once repaired, would that mind overriding what currently happens when there are 10 sstables at L0 (where they get promoted), or is that where triggering automatic repairs would come in play? Avoid repairing already-repaired data by default Key: CASSANDRA-5351 URL: https://issues.apache.org/jira/browse/CASSANDRA-5351 Project: Cassandra Issue Type: Task Components: Core Reporter: Jonathan Ellis Assignee: Lyuben Todorov Labels: repair Fix For: 2.1 Repair has always built its merkle tree from all the data in a columnfamily, which is guaranteed to work but is inefficient. We can improve this by remembering which sstables have already been successfully repaired, and only repairing sstables new since the last repair. (This automatically makes CASSANDRA-3362 much less of a problem too.) The tricky part is, compaction will (if not taught otherwise) mix repaired data together with non-repaired. So we should segregate unrepaired sstables from the repaired ones. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Comment Edited] (CASSANDRA-5351) Avoid repairing already-repaired data by default
[ https://issues.apache.org/jira/browse/CASSANDRA-5351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13822771#comment-13822771 ] Jonathan Ellis edited comment on CASSANDRA-5351 at 11/14/13 6:57 PM: - I think everyone agrees that repaired data should have a leveling system exactly like current LCS. Freshly repaired sstables (streamed from other replicas) start in L0 and get leveled from there. The tricky part is, what do we do about the unrepaired data? The two proposals are: # Keep a separate unrepaired arena -- let's call it L0' to distinguish from the repaired L0 -- and perform STCS in it the way we do with an overflowing L0 currently. Once repaired we move sstables into L0 proper. # Keep a complete extra set of levels for unrepaired data, L0', L1', ..., LN' and perform leveling on these, separate from the repaired levels. Once repaired, data from here will be dropped into repaired L0. The downside to 1 is that you don't get full leveling benefits until it's repaired. The downside to 2 is that LCS already has super high write amplification properties (relative to STCS) so doubling that is going to be even more painful. Both downsides get mitigated by repairing more often. Having written that out ... I'd lean towards the STCS option because it's so much simpler to implement. was (Author: jbellis): I think everyone agrees that repaired data should have a leveling system exactly like current LCS. Freshly repaired sstables (streamed from other replicas) start in L0 and get leveled from there. The tricky part is, what do we do about the unrepaired data? The two proposals are: # Keep a separate unrepaired arena -- let's call it L0' to distinguish from the repaired L0 -- and perform STCS in it the way we do with an overflowing L0 currently. Once repaired we move sstables into L0 proper. # Keep a complete extra set of levels for unrepaired data, L0', L1', ..., LN'. Once repaired, data from here will be dropped into repaired L0. The downside to 1 is that you don't get full leveling benefits until it's repaired. The downside to 2 is that LCS already has super high write amplification properties (relative to STCS) so doubling that is going to be even more painful. Both downsides get mitigated by repairing more often. Having written that out ... I'd lean towards the STCS option because it's so much simpler to implement. Avoid repairing already-repaired data by default Key: CASSANDRA-5351 URL: https://issues.apache.org/jira/browse/CASSANDRA-5351 Project: Cassandra Issue Type: Task Components: Core Reporter: Jonathan Ellis Assignee: Lyuben Todorov Labels: repair Fix For: 2.1 Repair has always built its merkle tree from all the data in a columnfamily, which is guaranteed to work but is inefficient. We can improve this by remembering which sstables have already been successfully repaired, and only repairing sstables new since the last repair. (This automatically makes CASSANDRA-3362 much less of a problem too.) The tricky part is, compaction will (if not taught otherwise) mix repaired data together with non-repaired. So we should segregate unrepaired sstables from the repaired ones. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Comment Edited] (CASSANDRA-5351) Avoid repairing already-repaired data by default
[ https://issues.apache.org/jira/browse/CASSANDRA-5351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13795120#comment-13795120 ] Jason Brown edited comment on CASSANDRA-5351 at 10/15/13 12:31 PM: --- Interesting ideas here. However, here's some problems off the top of my head that need to be addressed (in no order): * Nodes that are fully replaced (see: Netflix running in the cloud). When a node is replaced, we bootstrap the node by streaming data from closest peers (usually) in the local DC. The new node would not have anti-compacted sstables, as it's never had a chance to repair. I'm not sure if bootstrapping data can be considered anti-compacted through cummutativity; it might be true, but I'd need to think about it more. Assuming not, when this new node is involved in any repair, it would generate a different MT than it's already repaired peers, and thus all hell would break loose streaming already repaired data to every node involved in the repair, worse than today's repair (think streaming TBs of data across multiple amazon datacenters). If we can prove that the new node's data is commutatively repaired just by bootstrap, then this is not a problem as such. Note this also affects move (to a lesser degree) and rebuild. * Consider nodes A, B, and C. If nodes A and B successfully repair, but C fails to repair with them (due to partitioning, app crash, etc) during the repair. C is forced to do an -ipr repair as A and B have already anti-compacted and that is the only way C will be able to repair against A and B. * If the operator chooses to cancel the repair, we are left at an indeterminate state wrt which node has successfully completed repairs with another (similar to last point). * Local DC repair vs. global is largely incompatible with this. Looks like you will get one shot with each sstable's range for repair, so if you choose do local DC repair with an ssttable, you are forced to do -ipr if you later want to globally repair. Note that these problems are magnified immensely when you run in multiple datacenters, especially datacenters separated by great distances. While none of these situations is unsolvable, it seems that there are many non-obvious ways into which we can get into a non-deterministic state that operators will see either tons of data being streamed due to different anti-compaction points being different or will be forced to run -ipr without an easily understood reason. I already see operators terminate repair jobs because they hang or take too long, for better or worse (mostly worse). At that point, the operator is pretty much required to do an -ipr repair, which gets us back into the same situation we are in today, but with more confusion and possibly using -ipr as the default. It would probably be good to run -ipr as a best practice anyways every n days/weeks/months, to help with bit rot. I worry about the very non-obvious edge cases ticket introduces and the possibility that operators will simply fall back to using -ipr whenever something goes bump or doesn't make sense. Thanks for listening. was (Author: jasobrown): Interesting ideas here. However, here's some problems off the top of my head that need to be addressed (in no order): * Nodes that are fully replaced (see: Netflix running in the cloud). When a node is replaced, we bootstrap the node by streaming data from closest peers (usually) in the local DC. The new node would not have anti-compacted sstables, as it's never had a chance to repair. I'm not sure if bootstrapping data can be considered anti-compacted through cummutativity; it might be true, but I'd need to think about it more. Asuuming not, when this new node is involved in any repair, it would generate a different MT than it's already repaired peers, and thus all hell would break loose streaming already repaired data to every node involved in the repair, worse than today's repair (think streaming TBs of data across multiple amazon datacenters). If we can prove that the new node's data is commutatively repaired just by bootstrap, then this is not a problem as such. Note this also affects move (to a lesser degree) and rebuild. * Consider nodes A, B, and C. If nodes A and B successfully repair, but C fails to repair with them (due to partitioning, app crash, etc) during the repair. C is forced to do an -ipr repair as A and B have already anti-compacted and that is the only way C will be able to repair against A and B. * If the operator chooses to cacncel the repair, we are left at an indetermant state wrt which node has successfully completed repairs with another (similar to last point). * Local DC repair vs. global is largely incompatible with this. Looks like you will get one shot with each sstable's range for repair, so if you choose do local DC repair with an ssttable, you are forced to do
[jira] [Comment Edited] (CASSANDRA-5351) Avoid repairing already-repaired data by default
[ https://issues.apache.org/jira/browse/CASSANDRA-5351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13773150#comment-13773150 ] Jonathan Ellis edited comment on CASSANDRA-5351 at 9/20/13 4:46 PM: bq. The more often you repair the less big a full separate set of levels for unrepaired data would be. So maybe that's the way to go. Which is to say, we'd be kicking repairs off as automatically as we currently kick off compaction. I still don't have any better ideas. [~krummas]? was (Author: jbellis): bq. I think it would be simpler to anticompact after repair This is straightforward for STCS (bucket repaired/non-repaired separately) but less so for LCS. Now that we're already doing STCS in L0, I suggest extending that here: reserve the levels for repaired data, and STCS until we can repair. This implies making repair as automatic as compaction, which is a big change for us. I think it's a lot more user friendly, but I'm not 100% confident the performance impact will be negligible. Any better ideas? Avoid repairing already-repaired data by default Key: CASSANDRA-5351 URL: https://issues.apache.org/jira/browse/CASSANDRA-5351 Project: Cassandra Issue Type: Task Components: Core Reporter: Jonathan Ellis Assignee: Lyuben Todorov Labels: repair Fix For: 2.1 Repair has always built its merkle tree from all the data in a columnfamily, which is guaranteed to work but is inefficient. We can improve this by remembering which sstables have already been successfully repaired, and only repairing sstables new since the last repair. (This automatically makes CASSANDRA-3362 much less of a problem too.) The tricky part is, compaction will (if not taught otherwise) mix repaired data together with non-repaired. So we should segregate unrepaired sstables from the repaired ones. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (CASSANDRA-5351) Avoid repairing already-repaired data by default
[ https://issues.apache.org/jira/browse/CASSANDRA-5351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13735160#comment-13735160 ] Jonathan Ellis edited comment on CASSANDRA-5351 at 8/9/13 7:17 PM: --- bq. Given our setup we would need to repair once a day which wouldn't fly Given the tiny amounts of data being repaired, I think you could get down to hourly. But, the more often you repair the less big a full separate set of levels for unrepaired data would be. So maybe that's the way to go. was (Author: jbellis): bq. Given our setup we would need to repair once a day which wouldn't fly Given the tiny amounts of data being repaired, I think you could get down to hourly. Avoid repairing already-repaired data by default Key: CASSANDRA-5351 URL: https://issues.apache.org/jira/browse/CASSANDRA-5351 Project: Cassandra Issue Type: Task Components: Core Reporter: Jonathan Ellis Labels: repair Fix For: 2.1 Repair has always built its merkle tree from all the data in a columnfamily, which is guaranteed to work but is inefficient. We can improve this by remembering which sstables have already been successfully repaired, and only repairing sstables new since the last repair. (This automatically makes CASSANDRA-3362 much less of a problem too.) The tricky part is, compaction will (if not taught otherwise) mix repaired data together with non-repaired. So we should segregate unrepaired sstables from the repaired ones. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (CASSANDRA-5351) Avoid repairing already-repaired data by default
[ https://issues.apache.org/jira/browse/CASSANDRA-5351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13710131#comment-13710131 ] Jonathan Ellis edited comment on CASSANDRA-5351 at 7/16/13 7:37 PM: Right. More precisely, the idea is to build a merkle tree only for new-since-last-repair data. was (Author: jbellis): Right. More precisely, the idea is to reconstruct the merkle tree only for new-since-last-repair data. Avoid repairing already-repaired data by default Key: CASSANDRA-5351 URL: https://issues.apache.org/jira/browse/CASSANDRA-5351 Project: Cassandra Issue Type: Task Components: Core Reporter: Jonathan Ellis Labels: repair Fix For: 2.1 Repair has always built its merkle tree from all the data in a columnfamily, which is guaranteed to work but is inefficient. We can improve this by remembering which sstables have already been successfully repaired, and only repairing sstables new since the last repair. (This automatically makes CASSANDRA-3362 much less of a problem too.) The tricky part is, compaction will (if not taught otherwise) mix repaired data together with non-repaired. So we should segregate unrepaired sstables from the repaired ones. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (CASSANDRA-5351) Avoid repairing already-repaired data by default
[ https://issues.apache.org/jira/browse/CASSANDRA-5351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13699687#comment-13699687 ] Jeremiah Jordan edited comment on CASSANDRA-5351 at 7/4/13 1:35 AM: Anti compaction sounds like it could work. Then you really do just need an am I repaired flag, because during repair you anti-compact into repaired and not repaired data. So something like: 1. Calculate merkle trees, anti compacting each sstable into data being repaired and data not being repaired tmp sstables during the process. Set a flag in the data being repaired sstables to show them as repaired. 2. Perform merkle exchange/streaming, flag tmp sstables coming in from streaming as repaired. 3. When the repair is done, convert all tmp sstables into real ones, and delete originals sstables involved in the repair would be marked already compacting so they won't participate in compaction during the repair. Since you don't promote from tmp to real until the repair complete's successfully, if the node dies in the middle of the repair, all the tmp sstables will just be removed at startup. Then only compact like sstables, so there will be two sets of sstables fully repaired and not repaired at all. This is going to use a lot of Disk IO for all the anti-compaction, but as long as you run repair a lot, since it is cheap after the first time, it shouldn't be too bad. Probably want to let people pick their repair strategy to begin with, this is going to hurt, disk io and space wise, the first time you do it on a 1 TB per node already existing data set... was (Author: jjordan): Anti compaction sounds like it could work. Then you really do just need an am I repaired flag, because during repair you anti-compact into repaired and not repaired data. So something like: 1. Calculate merkle trees, anti compacting each sstable into data being repaired and data not being repaired tmp sstables during the process. Set a flag in the data being repaired sstables to show them as repaired. 2. Perform merkle exchange/streaming, flag tmp sstables coming in from streaming as repaired. 3. When the repair is done, convert all tmp sstables into real ones, and delete originals sstables involved in the repair would be marked already compacting so they won't participate in compaction during the repair. Since you don't promote from tmp to real until the repair complete's successfully, if the node dies in the middle of the repair, all the tmp sstables will just be removed at startup. Then only compact like sstables, so there will be two sets of sstables fully repaired and not repaired at all. Avoid repairing already-repaired data by default Key: CASSANDRA-5351 URL: https://issues.apache.org/jira/browse/CASSANDRA-5351 Project: Cassandra Issue Type: Task Components: Core Reporter: Jonathan Ellis Labels: repair Fix For: 2.1 Repair has always built its merkle tree from all the data in a columnfamily, which is guaranteed to work but is inefficient. We can improve this by remembering which sstables have already been successfully repaired, and only repairing sstables new since the last repair. (This automatically makes CASSANDRA-3362 much less of a problem too.) The tricky part is, compaction will (if not taught otherwise) mix repaired data together with non-repaired. So we should segregate unrepaired sstables from the repaired ones. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (CASSANDRA-5351) Avoid repairing already-repaired data by default
[ https://issues.apache.org/jira/browse/CASSANDRA-5351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13692096#comment-13692096 ] Jeremiah Jordan edited comment on CASSANDRA-5351 at 6/24/13 4:08 PM: - I've been thinking about this issues this morning. Here are my current thoughts on how it could be accomplished: 1. Keep track on a per range basis the data that has been repaired in a given sstable. As new ranges are repaired, union them with existing repaired ranges to update what has been repaired. 2. When sstables are compacted, take the intersection of repaired ranges in the given sstables to be the repaired ranges for the resulting sstable(s). 3. Do not compact tables which have never been repaired with tables that have had repairs done. This will prevent new sstables from blowing away the fact that older tables are all repaired when intersecting ranges per step 2. 4. Make sure to mark sstables which are the result of streaming from repair as having been repaired. 5. Have repair skip sstables which have already been repaired on the specified range. I think with those 5 things this should be doable. was (Author: jjordan): I've been thinking about this issues this morning. Here are my current thoughts on how it could be accomplished: 1. Keep track on a per range basis the data that has been repaired in a given sstable. As new ranges are repaired, union them with existing repaired ranges to update what has been repaired. 2. When sstables are compacted, take the intersection of repaired ranges in the given sstables to be the repaired ranges for the resulting sstable(s). 3. Do not compact tables which have never been repaired with tables that have had repairs done. This will prevent new sstables from blowing away the fact that older tables are all repaired when intersecting ranges per step 2. 4. Make sure to mark sstables which are the result of streaming from repair as having been repaired. 5. Have repair skip tables which have already been repaired on the specified range. I think with those 5 things this should be doable. Avoid repairing already-repaired data by default Key: CASSANDRA-5351 URL: https://issues.apache.org/jira/browse/CASSANDRA-5351 Project: Cassandra Issue Type: Task Components: Core Reporter: Jonathan Ellis Labels: repair Fix For: 2.1 Repair has always built its merkle tree from all the data in a columnfamily, which is guaranteed to work but is inefficient. We can improve this by remembering which sstables have already been successfully repaired, and only repairing sstables new since the last repair. (This automatically makes CASSANDRA-3362 much less of a problem too.) The tricky part is, compaction will (if not taught otherwise) mix repaired data together with non-repaired. So we should segregate unrepaired sstables from the repaired ones. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (CASSANDRA-5351) Avoid repairing already-repaired data by default
[ https://issues.apache.org/jira/browse/CASSANDRA-5351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13662593#comment-13662593 ] Jonathan Ellis edited comment on CASSANDRA-5351 at 5/21/13 2:17 AM: *All* nodes (including the coordinator) will only have portions repaired in the general case, since (a) the user can request a repair of an arbitrary range, and (b) even without that, repairing an entire vnode's range will still leave data from other vnodes unrepaired in the same sstables. So the two options that I see are (1) making ranges repaired, rather than sstables, or (2) anti-compacting repaired parts into new sstables. was (Author: jbellis): All nodes will only have portions repaired in the general case, since (a) the user can request a repair of an arbitrary range, and (b) even without that, repairing an entire vnode's range will still leave data from other vnodes unrepaired in the same sstables. So the two options that I see are (1) making ranges repaired, rather than sstables, or (2) anti-compacting repaired parts into new sstables. Avoid repairing already-repaired data by default Key: CASSANDRA-5351 URL: https://issues.apache.org/jira/browse/CASSANDRA-5351 Project: Cassandra Issue Type: Task Components: Core Reporter: Jonathan Ellis Labels: repair Fix For: 2.0 Repair has always built its merkle tree from all the data in a columnfamily, which is guaranteed to work but is inefficient. We can improve this by remembering which sstables have already been successfully repaired, and only repairing sstables new since the last repair. (This automatically makes CASSANDRA-3362 much less of a problem too.) The tricky part is, compaction will (if not taught otherwise) mix repaired data together with non-repaired. So we should segregate unrepaired sstables from the repaired ones. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira