[jira] [Comment Edited] (CASSANDRA-5351) Avoid repairing already-repaired data by default

2014-02-03 Thread Marcus Eriksson (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13889494#comment-13889494
 ] 

Marcus Eriksson edited comment on CASSANDRA-5351 at 2/3/14 2:18 PM:


More complete version now pushed to  
https://github.com/krummas/cassandra/tree/marcuse/5351
Lots of testing required, but i think it is mostly 'feature-complete';

Repair flow is now:
# Repair coordinator sends out Prepare messages to all neighbors
# All involved parties figure out what sstables should be included in the 
repair (if full repair, all sstables are included) otherwise only the ones with 
repairedAt set to 0. Note that we don't do any locking of the sstables, if they 
are gone when we do anticompaction it is fine - we will repair them next round.
# Repair coordinator prepares itself and waits until all neighbors have 
prepared and sends out TreeRequests.
# All nodes calculate merkle trees based on the sstables picked in step #2
# Coordinator waits for replies and then sends AnticompactionRequests to all 
nodes
# If we are doing full repair, we simply skip doing anticompaction.

notes;
* SSTables are tagged with repairedAt timestamps, compactions keep 
min(repairedAt) of the included sstables.
* nodetool repair defaults to use the old behaviour. Use --incremental to use 
the new repairs.
* anticompaction
  ** Split an sstable in 2 new ones. One sstable with all keys that were in the 
repaired ranges and one with unrepaired data.
  ** If the repaired ranges cover the entire sstable, we rewrite sstable 
metadata. This means that the optimal way to run incremental repairs is to not 
do partitioner range repairs etc.
* LCS
   ** We always first check if there are any unrepaired sstables to do STCS on, 
if there is, we do that. Reasoning being that new data (which needs compaction) 
is unrepaired.
   ** We keep all sstables in the LeveledManifest, then filter out the 
unrepaired ones when getting compaction candidates etc.
* STCS
  ** Major compaction is done by taking the biggest set of sstables - so for a 
total major compaction, you will need to run nodetool compact twice.
   ** Minors works the same way, the biggest set of sstables will be compacted.
* Streaming - A streamed SSTable keeps its repairedAt time.
* BulkLoader - Loaded sstables are unrepaired.
* Scrub - Set repairedAt to UNREPAIRED - since we can drop rows during repair 
new sstable is not repaired.
* Upgradesstables - Keep repaired status



was (Author: krummas):
More complete version now pushed to  
https://github.com/krummas/cassandra/tree/marcuse/5351
Lots of testing required, but i think it is mostly 'feature-complete';

Repair flow is now:
# Repair coordinator sends out Prepare messages to all neighbors
# All involved parties figure out what sstables should be included in the 
repair (if full repair, all sstables are included) otherwise only the ones with 
repairedAt set to 0. Note that we don't do any locking of the sstables, if they 
are gone when we do anticompaction it is fine - we will repair them next round.
# Repair coordinator prepares itself and waits until all neighbors have 
prepared and sends out TreeRequests.
# All nodes calculate merkle trees based on the sstables picked in step #2
# Coordinator waits for replies and then sends AnticompactionRequests to all 
nodes
# If we are doing full repair, we simply skip doing anticompaction.

notes;
* SSTables are tagged with repairedAt timestamps, compactions keep 
min(repairedAt) of the included sstables.
* nodetool repair defaults to use the old behaviour. Use --incremental to use 
the new repairs.
* anticompaction
  - Split an sstable in 2 new ones. One sstable with all keys that were in the 
repaired ranges and one with unrepaired data.
  - If the repaired ranges cover the entire sstable, we rewrite sstable 
metadata. This means that the optimal way to run incremental repairs is to not 
do partitioner range repairs etc.
* Compaction
  * LCS
- We always first check if there are any unrepaired sstables to do STCS on, 
if there is, we do that. Reasoning being that new data (which needs compaction) 
is unrepaired.
- We keep all sstables in the LeveledManifest, then filter out the 
unrepaired ones when getting compaction candidates etc.
  * STCS
- Major compaction is done by taking the biggest set of sstables - so for a 
total major compaction, you will need to run nodetool compact twice.
- Minors works the same way, the biggest set of sstables will be compacted.
* Streaming - A streamed SSTable keeps its repairedAt time.
* BulkLoader - Loaded sstables are unrepaired.
* Scrub - Set repairedAt to UNREPAIRED - since we can drop rows during repair 
new sstable is not repaired.
* Upgradesstables - Keep repaired status


 Avoid repairing already-repaired data by default
 

 Key: 

[jira] [Comment Edited] (CASSANDRA-5351) Avoid repairing already-repaired data by default

2014-02-03 Thread Marcus Eriksson (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13889494#comment-13889494
 ] 

Marcus Eriksson edited comment on CASSANDRA-5351 at 2/3/14 2:30 PM:


More complete version now pushed to  
https://github.com/krummas/cassandra/tree/marcuse/5351
Lots of testing required, but i think it is mostly 'feature-complete';

Repair flow is now:
# Repair coordinator sends out Prepare messages to all neighbors
# All involved parties figure out what sstables should be included in the 
repair (if full repair, all sstables are included) otherwise only the ones with 
repairedAt set to 0. Note that we don't do any locking of the sstables, if they 
are gone when we do anticompaction it is fine - we will repair them next round.
# Repair coordinator prepares itself and waits until all neighbors have 
prepared and sends out TreeRequests.
# All nodes calculate merkle trees based on the sstables picked in step #2
# Coordinator waits for replies and then sends AnticompactionRequests to all 
nodes
# If we are doing full repair, we simply skip doing anticompaction.

notes;
* SSTables are tagged with repairedAt timestamps, compactions keep 
min(repairedAt) of the included sstables.
* nodetool repair defaults to use the old behaviour. Use --incremental to use 
the new repairs.
* anticompaction
  ** Split an sstable in 2 new ones. One sstable with all keys that were in the 
repaired ranges and one with unrepaired data.
  ** If the repaired ranges cover the entire sstable, we rewrite sstable 
metadata. This means that the optimal way to run incremental repairs is to not 
do partitioner range repairs etc.
* LCS
   ** We always first check if there are any unrepaired sstables to do STCS on, 
if there is, we do that. Reasoning being that new data (which needs compaction) 
is unrepaired.
   ** We keep all sstables in the LeveledManifest, then filter out the 
unrepaired ones when getting compaction candidates etc.
* STCS
  ** Major compaction is done by taking the biggest set of sstables - so for a 
total major compaction, you will need to run nodetool compact twice.
   ** Minors works the same way, the biggest set of sstables will be compacted.
* Streaming - A streamed SSTable keeps its repairedAt time.
* BulkLoader - Loaded sstables are unrepaired.
* Scrub - Set repairedAt to UNREPAIRED - since we can drop rows during scrub 
new sstable is not repaired.
* Upgradesstables - Keep repaired status



was (Author: krummas):
More complete version now pushed to  
https://github.com/krummas/cassandra/tree/marcuse/5351
Lots of testing required, but i think it is mostly 'feature-complete';

Repair flow is now:
# Repair coordinator sends out Prepare messages to all neighbors
# All involved parties figure out what sstables should be included in the 
repair (if full repair, all sstables are included) otherwise only the ones with 
repairedAt set to 0. Note that we don't do any locking of the sstables, if they 
are gone when we do anticompaction it is fine - we will repair them next round.
# Repair coordinator prepares itself and waits until all neighbors have 
prepared and sends out TreeRequests.
# All nodes calculate merkle trees based on the sstables picked in step #2
# Coordinator waits for replies and then sends AnticompactionRequests to all 
nodes
# If we are doing full repair, we simply skip doing anticompaction.

notes;
* SSTables are tagged with repairedAt timestamps, compactions keep 
min(repairedAt) of the included sstables.
* nodetool repair defaults to use the old behaviour. Use --incremental to use 
the new repairs.
* anticompaction
  ** Split an sstable in 2 new ones. One sstable with all keys that were in the 
repaired ranges and one with unrepaired data.
  ** If the repaired ranges cover the entire sstable, we rewrite sstable 
metadata. This means that the optimal way to run incremental repairs is to not 
do partitioner range repairs etc.
* LCS
   ** We always first check if there are any unrepaired sstables to do STCS on, 
if there is, we do that. Reasoning being that new data (which needs compaction) 
is unrepaired.
   ** We keep all sstables in the LeveledManifest, then filter out the 
unrepaired ones when getting compaction candidates etc.
* STCS
  ** Major compaction is done by taking the biggest set of sstables - so for a 
total major compaction, you will need to run nodetool compact twice.
   ** Minors works the same way, the biggest set of sstables will be compacted.
* Streaming - A streamed SSTable keeps its repairedAt time.
* BulkLoader - Loaded sstables are unrepaired.
* Scrub - Set repairedAt to UNREPAIRED - since we can drop rows during repair 
new sstable is not repaired.
* Upgradesstables - Keep repaired status


 Avoid repairing already-repaired data by default
 

 Key: CASSANDRA-5351

[jira] [Comment Edited] (CASSANDRA-5351) Avoid repairing already-repaired data by default

2014-02-03 Thread Yuki Morishita (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13890438#comment-13890438
 ] 

Yuki Morishita edited comment on CASSANDRA-5351 at 2/4/14 6:20 AM:
---

bq. Dropping sstable to UNREPAIRED during major compaction means that all 
repaired data status is cleared for the node.

That's what I meant. Current major compaction produces one SSTable and I think 
changing that behavior would confuse users, maybe. My opinion is to keep it as 
is .

Additional review comments:

* Does PrepareMessage needs to carry around dataCenters? Only coordinator sends 
out messages so I think you can drop it(also from ParentRepairSession).
* CF ID is preferred to use over Keyspace name/CF name pair.
* PrepareMessage is sent per CF but it can produce a lot of round trip. Isn't 
one message per replica node enough?
* I think we need clean up for parentRepairSessions when something bad 
happened. Otherwise ParentRepairSession in the map keep reference to SSTables.

I just worked on the first one above and the commit is here(on top of your 
branch): 
https://github.com/yukim/cassandra/commit/7c65e532dd69f9f4c1ea2d3fdf0401ed70291361



was (Author: yukim):
bq. Dropping sstable to UNREPAIRED during major compaction means that all 
repaired data status is cleared for the node.

That's what I meant. Current major compaction produces one SSTable and I think 
changing that behavior would confuse users, maybe. My opinion is to keep it as 
is, but .

Additional review comments:

* Does PrepareMessage needs to carry around dataCenters? Only coordinator sends 
out messages so I think you can drop it(also from ParentRepairSession).
* CF ID is preferred to use over Keyspace name/CF name pair.
* PrepareMessage is sent per CF but it can produce a lot of round trip. Isn't 
one message per replica node enough?
* I think we need clean up for parentRepairSessions when something bad 
happened. Otherwise ParentRepairSession in the map keep reference to SSTables.

I just worked on the first one above and the commit is here(on top of your 
branch): 
https://github.com/yukim/cassandra/commit/7c65e532dd69f9f4c1ea2d3fdf0401ed70291361


 Avoid repairing already-repaired data by default
 

 Key: CASSANDRA-5351
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5351
 Project: Cassandra
  Issue Type: Task
  Components: Core
Reporter: Jonathan Ellis
Assignee: Lyuben Todorov
  Labels: repair
 Fix For: 2.1

 Attachments: 5351_node1.log, 5351_node2.log, 5351_node3.log, 
 5351_nodetool.log


 Repair has always built its merkle tree from all the data in a columnfamily, 
 which is guaranteed to work but is inefficient.
 We can improve this by remembering which sstables have already been 
 successfully repaired, and only repairing sstables new since the last repair. 
  (This automatically makes CASSANDRA-3362 much less of a problem too.)
 The tricky part is, compaction will (if not taught otherwise) mix repaired 
 data together with non-repaired.  So we should segregate unrepaired sstables 
 from the repaired ones.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Comment Edited] (CASSANDRA-5351) Avoid repairing already-repaired data by default

2014-01-17 Thread Marcus Eriksson (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13875155#comment-13875155
 ] 

Marcus Eriksson edited comment on CASSANDRA-5351 at 1/17/14 7:30 PM:
-

Just pushed a work-in-progress-branch with a bunch of updates: 
https://github.com/krummas/cassandra/tree/marcuse/5351 

* It locks the sstables for the whole duration of the repair, might be better 
to do a best-effort here, ie, remember what sstables we calculated merkle trees 
on and then if the sstable is gone during anticompaction, we dont mark that 
data as repaired - we will catch it next repair round instead.
* Adds a test for anticompaction
* For LCS, we always first check if there are STCS-compactions to do on the 
unrepaired data, reasoning being that new data is unrepaired, so if you have 
flushed 4 sstables, there will be a stcs compaction.
* For STCS, we just pick the one with the most sstables, probably not the best 
way, but we can tweak that heuristic later.

[~lyubent] - I think the bug you had above was because you didn't check if 
markCompacting was successful or not in AnticompactionSession.lockTables()

It works but atleast these things are needed;
* Fix streaming, keep the repairedAt times etc.
* Make full repairs work somehow (anticompacting the whole dataset is probably 
not preferable)
* Timeouts, if we lock all unrepaired sstables, we must handle stalled repairs 
somehow.
* Make anticompaction smarter, if we have 10 sstables that we should 
anticompact, why not combine anticompaction with compaction and create 2 new 
sstables, one with repaired data and one with unrepaired
* More tests, cleanups, refactor


was (Author: krummas):
Just pushed a work-in-progress-branch with a bunch of updates: 
https://github.com/krummas/cassandra/tree/marcuse/5351 

* It locks the sstables for the whole duration of the repair, might be better 
to do a best-effort here, ie, remember what sstables we calculated merkle trees 
on
* Adds a test for anticompaction
* For LCS, we always first check if there are STCS-compactions to do on the 
unrepaired data, reasoning being that new data is unrepaired, so if you have 
flushed 4 sstables, there will be a stcs compaction.
* For STCS, we just pick the one with the most sstables, probably not the best 
way, but we can tweak that heuristic later.

[~lyubent] - I think the bug you had above was because you didn't check if 
markCompacting was successful or not in AnticompactionSession.lockTables()

It works but atleast these things are needed;
* Fix streaming, keep the repairedAt times etc.
* Make full repairs work somehow (anticompacting the whole dataset is probably 
not preferable)
* Timeouts, if we lock all unrepaired sstables, we must handle stalled repairs 
somehow.
* Make anticompaction smarter, if we have 10 sstables that we should 
anticompact, why not combine anticompaction with compaction and create 2 new 
sstables, one with repaired data and one with unrepaired
* More tests, cleanups, refactor

 Avoid repairing already-repaired data by default
 

 Key: CASSANDRA-5351
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5351
 Project: Cassandra
  Issue Type: Task
  Components: Core
Reporter: Jonathan Ellis
Assignee: Lyuben Todorov
  Labels: repair
 Fix For: 2.1

 Attachments: 5351_node1.log, 5351_node2.log, 5351_node3.log, 
 5351_nodetool.log


 Repair has always built its merkle tree from all the data in a columnfamily, 
 which is guaranteed to work but is inefficient.
 We can improve this by remembering which sstables have already been 
 successfully repaired, and only repairing sstables new since the last repair. 
  (This automatically makes CASSANDRA-3362 much less of a problem too.)
 The tricky part is, compaction will (if not taught otherwise) mix repaired 
 data together with non-repaired.  So we should segregate unrepaired sstables 
 from the repaired ones.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Comment Edited] (CASSANDRA-5351) Avoid repairing already-repaired data by default

2013-12-04 Thread Lyuben Todorov (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13838926#comment-13838926
 ] 

Lyuben Todorov edited comment on CASSANDRA-5351 at 12/4/13 2:40 PM:


Added checks to LeveledManifest#replace and LeveledManifest#add to ensure 
un-repaired data is kept at L0 ([patch 
here|https://github.com/lyubent/cassandra/commit/70f63e577f531f904997934c53022c1d6a94b9f3])
 The out-of-order key error is still a problem, logs show same error as above 
comment. One thing that I haven't accounted for so far is, tables being added 
straight to levels higher than L0, is it possible for newly flushed data to go 
straight to a level  L0


was (Author: lyubent):
Added checks to LeveledManifest#replace and LeveledManifest#add to ensure 
un-repaired data is kept at L0 ([patch 
here|https://github.com/lyubent/cassandra/commit/70f63e577f531f904997934c53022c1d6a94b9f3])
 The out-of-order key error is still a problem, logs show same error as above 
comment. 

 Avoid repairing already-repaired data by default
 

 Key: CASSANDRA-5351
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5351
 Project: Cassandra
  Issue Type: Task
  Components: Core
Reporter: Jonathan Ellis
Assignee: Lyuben Todorov
  Labels: repair
 Fix For: 2.1


 Repair has always built its merkle tree from all the data in a columnfamily, 
 which is guaranteed to work but is inefficient.
 We can improve this by remembering which sstables have already been 
 successfully repaired, and only repairing sstables new since the last repair. 
  (This automatically makes CASSANDRA-3362 much less of a problem too.)
 The tricky part is, compaction will (if not taught otherwise) mix repaired 
 data together with non-repaired.  So we should segregate unrepaired sstables 
 from the repaired ones.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Comment Edited] (CASSANDRA-5351) Avoid repairing already-repaired data by default

2013-11-26 Thread Lyuben Todorov (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13832570#comment-13832570
 ] 

Lyuben Todorov edited comment on CASSANDRA-5351 at 11/26/13 1:27 PM:
-

After [this 
commit|https://github.com/lyubent/cassandra/commit/903e416539cdde78514850bda25076f3f2fc57ec]
 to keep unrepaired data at L0 repairs start failing to validate after a few 
inserts and compactions. The stack trace from the error in each node is below 
(3 node ccm cluster was used here with the repair being issued to node 2).

{code}
 INFO 15:19:10,321 Starting repair command #3, repairing 2 ranges for keyspace 
test
 INFO 15:19:10,322 [repair #55f4d610-569d-11e3-b553-975f903ccf5a] new session: 
will sync /127.0.0.2, /127.0.0.3 on range 
(-9223372036854775808,-3074457345618258603] for test.[lvl]
 INFO 15:19:10,325 Handshaking version with /127.0.0.3
 INFO 15:19:10,343 [repair #55f4d610-569d-11e3-b553-975f903ccf5a] requesting 
merkle trees for lvl (to [/127.0.0.3, /127.0.0.2])
 INFO 15:19:11,493 [repair #55f4d610-569d-11e3-b553-975f903ccf5a] Received 
merkle tree for lvl from /127.0.0.3
ERROR 15:19:16,138 Failed creating a merkle tree for [repair 
#55f4d610-569d-11e3-b553-975f903ccf5a on test/lvl, 
(-9223372036854775808,-3074457345618258603]], /127.0.0.2 (see log for details)
ERROR 15:19:16,138 Exception in thread Thread[ValidationExecutor:2,1,main]
java.lang.AssertionError: row DecoratedKey(-9223264645216044815, 
73636c744c546e56534c4741775141) received out of order wrt 
DecoratedKey(-3331959603918038206, 685863786a586464616b794f597075)
at org.apache.cassandra.repair.Validator.add(Validator.java:136)
at 
org.apache.cassandra.db.compaction.CompactionManager.doValidationCompaction(CompactionManager.java:820)
at 
org.apache.cassandra.db.compaction.CompactionManager.access$600(CompactionManager.java:61)
at 
org.apache.cassandra.db.compaction.CompactionManager$8.call(CompactionManager.java:417)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
ERROR 15:19:16,139 [repair #55f4d610-569d-11e3-b553-975f903ccf5a] session 
completed with the following error
org.apache.cassandra.exceptions.RepairException: [repair 
#55f4d610-569d-11e3-b553-975f903ccf5a on test/lvl, 
(-9223372036854775808,-3074457345618258603]] Validation failed in /127.0.0.2
at 
org.apache.cassandra.repair.RepairSession.validationComplete(RepairSession.java:152)
at 
org.apache.cassandra.service.ActiveRepairService.handleMessage(ActiveRepairService.java:212)
at 
org.apache.cassandra.repair.RepairMessageVerbHandler.doVerb(RepairMessageVerbHandler.java:91)
at 
org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:56)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
 INFO 15:19:16,138 Range (3074457345618258602,-9223372036854775808] has already 
been repaired. Skipping repair.
S.AD: lvl repairedAt: 1385466263703792000
ERROR 15:19:16,139 Exception in thread Thread[AntiEntropySessions:5,5,RMI 
Runtime]
java.lang.RuntimeException: org.apache.cassandra.exceptions.RepairException: 
[repair #55f4d610-569d-11e3-b553-975f903ccf5a on test/lvl, 
(-9223372036854775808,-3074457345618258603]] Validation failed in /127.0.0.2
at com.google.common.base.Throwables.propagate(Throwables.java:160)
at 
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:32)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
Caused by: org.apache.cassandra.exceptions.RepairException: [repair 
#55f4d610-569d-11e3-b553-975f903ccf5a on test/lvl, 
(-9223372036854775808,-3074457345618258603]] Validation failed in /127.0.0.2
at 
org.apache.cassandra.repair.RepairSession.validationComplete(RepairSession.java:152)
at 
org.apache.cassandra.service.ActiveRepairService.handleMessage(ActiveRepairService.java:212)
at 
org.apache.cassandra.repair.RepairMessageVerbHandler.doVerb(RepairMessageVerbHandler.java:91)
at 

[jira] [Comment Edited] (CASSANDRA-5351) Avoid repairing already-repaired data by default

2013-11-14 Thread Lyuben Todorov (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13822426#comment-13822426
 ] 

Lyuben Todorov edited comment on CASSANDRA-5351 at 11/14/13 1:37 PM:
-

I think I misunderstood Marcus' idea. If we keep all the un-repaired data at L0 
and promote it to L1 once repaired, would that mean overriding what currently 
happens when there are 10 sstables at L0 (where they get promoted), or is that 
where triggering automatic repairs would come in play?


was (Author: lyubent):
I think I misunderstood Marcus' idea. If we keep all the un-repaired data at L0 
and promote it to L1 once repaired, would that mind overriding what currently 
happens when there are 10 sstables at L0 (where they get promoted), or is that 
where triggering automatic repairs would come in play?

 Avoid repairing already-repaired data by default
 

 Key: CASSANDRA-5351
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5351
 Project: Cassandra
  Issue Type: Task
  Components: Core
Reporter: Jonathan Ellis
Assignee: Lyuben Todorov
  Labels: repair
 Fix For: 2.1


 Repair has always built its merkle tree from all the data in a columnfamily, 
 which is guaranteed to work but is inefficient.
 We can improve this by remembering which sstables have already been 
 successfully repaired, and only repairing sstables new since the last repair. 
  (This automatically makes CASSANDRA-3362 much less of a problem too.)
 The tricky part is, compaction will (if not taught otherwise) mix repaired 
 data together with non-repaired.  So we should segregate unrepaired sstables 
 from the repaired ones.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Comment Edited] (CASSANDRA-5351) Avoid repairing already-repaired data by default

2013-11-14 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13822771#comment-13822771
 ] 

Jonathan Ellis edited comment on CASSANDRA-5351 at 11/14/13 6:57 PM:
-

I think everyone agrees that repaired data should have a leveling system 
exactly like current LCS.  Freshly repaired sstables (streamed from other 
replicas) start in L0 and get leveled from there.

The tricky part is, what do we do about the unrepaired data?  The two proposals 
are:

# Keep a separate unrepaired arena -- let's call it L0' to distinguish from the 
repaired L0 -- and perform STCS in it the way we do with an overflowing L0 
currently.  Once repaired we move sstables into L0 proper.
# Keep a complete extra set of levels for unrepaired data, L0', L1', ..., LN' 
and perform leveling on these, separate from the repaired levels.  Once 
repaired, data from here will be dropped into repaired L0.

The downside to 1 is that you don't get full leveling benefits until it's 
repaired.  The downside to 2 is that LCS already has super high write 
amplification properties (relative to STCS) so doubling that is going to be 
even more painful.

Both downsides get mitigated by repairing more often.

Having written that out ... I'd lean towards the STCS option because it's so 
much simpler to implement.


was (Author: jbellis):
I think everyone agrees that repaired data should have a leveling system 
exactly like current LCS.  Freshly repaired sstables (streamed from other 
replicas) start in L0 and get leveled from there.

The tricky part is, what do we do about the unrepaired data?  The two proposals 
are:

# Keep a separate unrepaired arena -- let's call it L0' to distinguish from the 
repaired L0 -- and perform STCS in it the way we do with an overflowing L0 
currently.  Once repaired we move sstables into L0 proper.
# Keep a complete extra set of levels for unrepaired data, L0', L1', ..., LN'.  
Once repaired, data from here will be dropped into repaired L0.

The downside to 1 is that you don't get full leveling benefits until it's 
repaired.  The downside to 2 is that LCS already has super high write 
amplification properties (relative to STCS) so doubling that is going to be 
even more painful.

Both downsides get mitigated by repairing more often.

Having written that out ... I'd lean towards the STCS option because it's so 
much simpler to implement.

 Avoid repairing already-repaired data by default
 

 Key: CASSANDRA-5351
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5351
 Project: Cassandra
  Issue Type: Task
  Components: Core
Reporter: Jonathan Ellis
Assignee: Lyuben Todorov
  Labels: repair
 Fix For: 2.1


 Repair has always built its merkle tree from all the data in a columnfamily, 
 which is guaranteed to work but is inefficient.
 We can improve this by remembering which sstables have already been 
 successfully repaired, and only repairing sstables new since the last repair. 
  (This automatically makes CASSANDRA-3362 much less of a problem too.)
 The tricky part is, compaction will (if not taught otherwise) mix repaired 
 data together with non-repaired.  So we should segregate unrepaired sstables 
 from the repaired ones.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Comment Edited] (CASSANDRA-5351) Avoid repairing already-repaired data by default

2013-10-15 Thread Jason Brown (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13795120#comment-13795120
 ] 

Jason Brown edited comment on CASSANDRA-5351 at 10/15/13 12:31 PM:
---

Interesting ideas here. However, here's some problems off the top of my head 
that need to be addressed (in no order):

* Nodes that are fully replaced (see: Netflix running in the cloud). When a 
node is replaced, we bootstrap the node by streaming data from closest peers 
(usually) in the local DC. The new node would not have anti-compacted sstables, 
as it's never had a chance to repair. I'm not sure if bootstrapping data can be 
considered anti-compacted through cummutativity; it might be true, but I'd need 
to think about it more. Assuming not, when this new node is involved in any 
repair, it would generate a different MT than it's already repaired peers, and 
thus all hell would break loose streaming already repaired data to every node 
involved in the repair, worse than today's repair (think streaming TBs of data 
across multiple amazon datacenters). If we can prove that the new node's data 
is commutatively repaired just by bootstrap, then this is not a problem as 
such. Note this also affects move (to a lesser degree) and rebuild.
* Consider nodes A, B, and C. If nodes A and B successfully repair, but C fails 
to repair with them (due to partitioning, app crash, etc) during the repair. C 
is forced to do an -ipr repair as A and B have already anti-compacted and that 
is the only way C will be able to repair against A and B. 
* If the operator chooses to cancel the repair, we are left at an indeterminate 
state wrt which node has successfully completed repairs with another (similar 
to last point).
* Local DC repair vs. global is largely incompatible with this. Looks like you 
will get one shot with each sstable's range for repair, so if you choose do 
local DC repair with an ssttable, you are forced to do -ipr if you later want 
to globally repair.

Note that these problems are magnified immensely when you run in multiple 
datacenters, especially datacenters separated by great distances.

While none of these situations is unsolvable, it seems that there are many 
non-obvious ways into which we can get into a non-deterministic state that 
operators will see either tons of data being streamed due to different 
anti-compaction points being different or will be forced to run -ipr without an 
easily understood reason. I already see operators terminate repair jobs because 
they hang or take too long, for better or worse (mostly worse). At that 
point, the operator is pretty much required to do an -ipr repair, which gets us 
back into the same situation we are in today, but with more confusion and 
possibly using -ipr as the default.

It would probably be good to run -ipr as a best practice anyways every n 
days/weeks/months, to help with bit rot. I worry about the very non-obvious 
edge cases ticket introduces and the possibility that operators will simply 
fall back to using -ipr whenever something goes bump or doesn't make sense.

Thanks for listening.


was (Author: jasobrown):
Interesting ideas here. However, here's some problems off the top of my head 
that need to be addressed (in no order):

* Nodes that are fully replaced (see: Netflix running in the cloud). When a 
node is replaced, we bootstrap the node by streaming data from closest peers 
(usually) in the local DC. The new node would not have anti-compacted sstables, 
as it's never had a chance to repair. I'm not sure if bootstrapping data can be 
considered anti-compacted through cummutativity; it might be true, but I'd need 
to think about it more. Asuuming not, when this new node is involved in any 
repair, it would generate a different MT than it's already repaired peers, and 
thus all hell would break loose streaming already repaired data to every node 
involved in the repair, worse than today's repair (think streaming TBs of data 
across multiple amazon datacenters). If we can prove that the new node's data 
is commutatively repaired just by bootstrap, then this is not a problem as 
such. Note this also affects move (to a lesser degree) and rebuild.
* Consider nodes A, B, and C. If nodes A and B successfully repair, but C fails 
to repair with them (due to partitioning, app crash, etc) during the repair. C 
is forced to do an -ipr repair as A and B have already anti-compacted and that 
is the only way C will be able to repair against A and B. 
* If the operator chooses to cacncel the repair, we are left at an indetermant 
state wrt which node has successfully completed repairs with another (similar 
to last point).
* Local DC repair vs. global is largely incompatible with this. Looks like you 
will get one shot with each sstable's range for repair, so if you choose do 
local DC repair with an ssttable, you are forced to do 

[jira] [Comment Edited] (CASSANDRA-5351) Avoid repairing already-repaired data by default

2013-09-20 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13773150#comment-13773150
 ] 

Jonathan Ellis edited comment on CASSANDRA-5351 at 9/20/13 4:46 PM:


bq. The more often you repair the less big a full separate set of levels for 
unrepaired data would be. So maybe that's the way to go.

Which is to say, we'd be kicking repairs off as automatically as we currently 
kick off compaction.

I still don't have any better ideas.  [~krummas]?

  was (Author: jbellis):
bq. I think it would be simpler to anticompact after repair

This is straightforward for STCS (bucket repaired/non-repaired separately) but 
less so for LCS.

Now that we're already doing STCS in L0, I suggest extending that here: reserve 
the levels for repaired data, and STCS until we can repair.

This implies making repair as automatic as compaction, which is a big change 
for us.  I think it's a lot more user friendly, but I'm not 100% confident the 
performance impact will be negligible.  Any better ideas?
  
 Avoid repairing already-repaired data by default
 

 Key: CASSANDRA-5351
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5351
 Project: Cassandra
  Issue Type: Task
  Components: Core
Reporter: Jonathan Ellis
Assignee: Lyuben Todorov
  Labels: repair
 Fix For: 2.1


 Repair has always built its merkle tree from all the data in a columnfamily, 
 which is guaranteed to work but is inefficient.
 We can improve this by remembering which sstables have already been 
 successfully repaired, and only repairing sstables new since the last repair. 
  (This automatically makes CASSANDRA-3362 much less of a problem too.)
 The tricky part is, compaction will (if not taught otherwise) mix repaired 
 data together with non-repaired.  So we should segregate unrepaired sstables 
 from the repaired ones.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Comment Edited] (CASSANDRA-5351) Avoid repairing already-repaired data by default

2013-08-09 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13735160#comment-13735160
 ] 

Jonathan Ellis edited comment on CASSANDRA-5351 at 8/9/13 7:17 PM:
---

bq. Given our setup we would need to repair once a day which wouldn't fly

Given the tiny amounts of data being repaired, I think you could get down to 
hourly.

But, the more often you repair the less big a full separate set of levels for 
unrepaired data would be.  So maybe that's the way to go.

  was (Author: jbellis):
bq. Given our setup we would need to repair once a day which wouldn't fly

Given the tiny amounts of data being repaired, I think you could get down to 
hourly.
  
 Avoid repairing already-repaired data by default
 

 Key: CASSANDRA-5351
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5351
 Project: Cassandra
  Issue Type: Task
  Components: Core
Reporter: Jonathan Ellis
  Labels: repair
 Fix For: 2.1


 Repair has always built its merkle tree from all the data in a columnfamily, 
 which is guaranteed to work but is inefficient.
 We can improve this by remembering which sstables have already been 
 successfully repaired, and only repairing sstables new since the last repair. 
  (This automatically makes CASSANDRA-3362 much less of a problem too.)
 The tricky part is, compaction will (if not taught otherwise) mix repaired 
 data together with non-repaired.  So we should segregate unrepaired sstables 
 from the repaired ones.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Comment Edited] (CASSANDRA-5351) Avoid repairing already-repaired data by default

2013-07-16 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13710131#comment-13710131
 ] 

Jonathan Ellis edited comment on CASSANDRA-5351 at 7/16/13 7:37 PM:


Right.  More precisely, the idea is to build a merkle tree only for 
new-since-last-repair data.

  was (Author: jbellis):
Right.  More precisely, the idea is to reconstruct the merkle tree only for 
new-since-last-repair data.
  
 Avoid repairing already-repaired data by default
 

 Key: CASSANDRA-5351
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5351
 Project: Cassandra
  Issue Type: Task
  Components: Core
Reporter: Jonathan Ellis
  Labels: repair
 Fix For: 2.1


 Repair has always built its merkle tree from all the data in a columnfamily, 
 which is guaranteed to work but is inefficient.
 We can improve this by remembering which sstables have already been 
 successfully repaired, and only repairing sstables new since the last repair. 
  (This automatically makes CASSANDRA-3362 much less of a problem too.)
 The tricky part is, compaction will (if not taught otherwise) mix repaired 
 data together with non-repaired.  So we should segregate unrepaired sstables 
 from the repaired ones.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Comment Edited] (CASSANDRA-5351) Avoid repairing already-repaired data by default

2013-07-03 Thread Jeremiah Jordan (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13699687#comment-13699687
 ] 

Jeremiah Jordan edited comment on CASSANDRA-5351 at 7/4/13 1:35 AM:


Anti compaction sounds like it could work.
Then you really do just need an am I repaired flag, because during repair you 
anti-compact into repaired and not repaired data.
So something like:
1. Calculate merkle trees, anti compacting each sstable into data being 
repaired and data not being repaired tmp sstables during the process.  Set a 
flag in the data being repaired sstables to show them as repaired.
2. Perform merkle exchange/streaming, flag tmp sstables coming in from 
streaming as repaired.
3. When the repair is done, convert all tmp sstables into real ones, and delete 
originals

sstables involved in the repair would be marked already compacting so they 
won't participate in compaction during the repair.

Since you don't promote from tmp to real until the repair complete's 
successfully, if the node dies in the middle of the repair, all the tmp 
sstables will just be removed at startup.

Then only compact like sstables, so there will be two sets of sstables fully 
repaired and not repaired at all.

This is going to use a lot of Disk IO for all the anti-compaction, but as long 
as you run repair a lot, since it is cheap after the first time, it shouldn't 
be too bad.  Probably want to let people pick their repair strategy to begin 
with, this is going to hurt, disk io and space wise, the first time you do it 
on a 1 TB per node already existing data set...

  was (Author: jjordan):
Anti compaction sounds like it could work.
Then you really do just need an am I repaired flag, because during repair you 
anti-compact into repaired and not repaired data.
So something like:
1. Calculate merkle trees, anti compacting each sstable into data being 
repaired and data not being repaired tmp sstables during the process.  Set a 
flag in the data being repaired sstables to show them as repaired.
2. Perform merkle exchange/streaming, flag tmp sstables coming in from 
streaming as repaired.
3. When the repair is done, convert all tmp sstables into real ones, and delete 
originals

sstables involved in the repair would be marked already compacting so they 
won't participate in compaction during the repair.

Since you don't promote from tmp to real until the repair complete's 
successfully, if the node dies in the middle of the repair, all the tmp 
sstables will just be removed at startup.

Then only compact like sstables, so there will be two sets of sstables fully 
repaired and not repaired at all.
  
 Avoid repairing already-repaired data by default
 

 Key: CASSANDRA-5351
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5351
 Project: Cassandra
  Issue Type: Task
  Components: Core
Reporter: Jonathan Ellis
  Labels: repair
 Fix For: 2.1


 Repair has always built its merkle tree from all the data in a columnfamily, 
 which is guaranteed to work but is inefficient.
 We can improve this by remembering which sstables have already been 
 successfully repaired, and only repairing sstables new since the last repair. 
  (This automatically makes CASSANDRA-3362 much less of a problem too.)
 The tricky part is, compaction will (if not taught otherwise) mix repaired 
 data together with non-repaired.  So we should segregate unrepaired sstables 
 from the repaired ones.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Comment Edited] (CASSANDRA-5351) Avoid repairing already-repaired data by default

2013-06-24 Thread Jeremiah Jordan (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13692096#comment-13692096
 ] 

Jeremiah Jordan edited comment on CASSANDRA-5351 at 6/24/13 4:08 PM:
-

I've been thinking about this issues this morning.  Here are my current 
thoughts on how it could be accomplished:

1. Keep track on a per range basis the data that has been repaired in a given 
sstable.  As new ranges are repaired, union them with existing repaired ranges 
to update what has been repaired.
2. When sstables are compacted, take the intersection of repaired ranges in the 
given sstables to be the repaired ranges for the resulting sstable(s).
3. Do not compact tables which have never been repaired with tables that have 
had repairs done.  This will prevent new sstables from blowing away the fact 
that older tables are all repaired when intersecting ranges per step 2.
4. Make sure to mark sstables which are the result of streaming from repair as 
having been repaired.
5. Have repair skip sstables which have already been repaired on the specified 
range.

I think with those 5 things this should be doable.

  was (Author: jjordan):
I've been thinking about this issues this morning.  Here are my current 
thoughts on how it could be accomplished:

1. Keep track on a per range basis the data that has been repaired in a given 
sstable.  As new ranges are repaired, union them with existing repaired ranges 
to update what has been repaired.
2. When sstables are compacted, take the intersection of repaired ranges in the 
given sstables to be the repaired ranges for the resulting sstable(s).
3. Do not compact tables which have never been repaired with tables that have 
had repairs done.  This will prevent new sstables from blowing away the fact 
that older tables are all repaired when intersecting ranges per step 2.
4. Make sure to mark sstables which are the result of streaming from repair as 
having been repaired.
5. Have repair skip tables which have already been repaired on the specified 
range.

I think with those 5 things this should be doable.
  
 Avoid repairing already-repaired data by default
 

 Key: CASSANDRA-5351
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5351
 Project: Cassandra
  Issue Type: Task
  Components: Core
Reporter: Jonathan Ellis
  Labels: repair
 Fix For: 2.1


 Repair has always built its merkle tree from all the data in a columnfamily, 
 which is guaranteed to work but is inefficient.
 We can improve this by remembering which sstables have already been 
 successfully repaired, and only repairing sstables new since the last repair. 
  (This automatically makes CASSANDRA-3362 much less of a problem too.)
 The tricky part is, compaction will (if not taught otherwise) mix repaired 
 data together with non-repaired.  So we should segregate unrepaired sstables 
 from the repaired ones.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Comment Edited] (CASSANDRA-5351) Avoid repairing already-repaired data by default

2013-05-20 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13662593#comment-13662593
 ] 

Jonathan Ellis edited comment on CASSANDRA-5351 at 5/21/13 2:17 AM:


*All* nodes (including the coordinator) will only have portions repaired in the 
general case, since (a) the user can request a repair of an arbitrary range, 
and (b) even without that, repairing an entire vnode's range will still leave 
data from other vnodes unrepaired in the same sstables.

So the two options that I see are (1) making ranges repaired, rather than 
sstables, or (2) anti-compacting repaired parts into new sstables.

  was (Author: jbellis):
All nodes will only have portions repaired in the general case, since (a) 
the user can request a repair of an arbitrary range, and (b) even without that, 
repairing an entire vnode's range will still leave data from other vnodes 
unrepaired in the same sstables.

So the two options that I see are (1) making ranges repaired, rather than 
sstables, or (2) anti-compacting repaired parts into new sstables.
  
 Avoid repairing already-repaired data by default
 

 Key: CASSANDRA-5351
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5351
 Project: Cassandra
  Issue Type: Task
  Components: Core
Reporter: Jonathan Ellis
  Labels: repair
 Fix For: 2.0


 Repair has always built its merkle tree from all the data in a columnfamily, 
 which is guaranteed to work but is inefficient.
 We can improve this by remembering which sstables have already been 
 successfully repaired, and only repairing sstables new since the last repair. 
  (This automatically makes CASSANDRA-3362 much less of a problem too.)
 The tricky part is, compaction will (if not taught otherwise) mix repaired 
 data together with non-repaired.  So we should segregate unrepaired sstables 
 from the repaired ones.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira