[ 
https://issues.apache.org/jira/browse/CASSANDRA-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sylvain Lebresne updated CASSANDRA-2433:
----------------------------------------

    Attachment: 
0004-Reports-validation-compaction-errors-back-to-repair-v3.patch
                0003-Report-streaming-errors-back-to-repair-v3.patch
                0002-Register-in-gossip-to-handle-node-failures-v3.patch
                
0001-Put-repair-session-on-a-Stage-and-add-a-method-to-re-v3.patch

Attaching v3 rebased (on 0.8).

bq. Since we're not trying to control throughput or monitor sessions, could we 
just use Stage.MISC?

The thing is that repair session are very long lived. And MISC is single 
threaded. So that would block other task that are not supposed to block. We 
could make MISC multi-threaded but even then it's not a good idea to mix short 
lived and long lived task on the same stage.

bq. I think RepairSession.exception needs to be volatile to ensure that the 
awoken thread sees it

Done in v3.

bq. Would it be better if RepairSession implemented 
IEndpointStateChangeSubscriber directly?

Good idea, it's slightly simpler, done in v3.

bq. The endpoint set needs to be threadsafe, since it will be modified by the 
endpoint state change thread, and the AE_STAGE thread

Done in v3. That will probably change with CASSANDRA-2610 anyway (which I have 
to update)

bq. Should StreamInSession.retries be volatile/atomic? (likely they won't retry 
quickly enough for it to be a problem, but...)

I did not change that, but if it's a problem for retries to not be volatile, I 
suspect having StreamInSession.current not volatile is also a problem. But 
really I'd be curious to see that be a problem.

bq. Playing devil's advocate: would sending a half-built tree in case of 
failure still be useful?

I don't think it is. Or more precisely, if you do send half-built tree, you'll 
have to be careful that the other doesn't consider what's missing as ranges not 
being in sync (I don't think people will be happy with tons of data being 
stream just because we happen to have a bug that make compaction throw an 
exception during the validation). So I think you cannot do much with a 
half-built tree, and it will add complication. For a case where people will 
need to restart a repair anyway once whatever happened is fixed

bq. success might need to be volatile as well

Done in v3.


> Failed Streams Break Repair
> ---------------------------
>
>                 Key: CASSANDRA-2433
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2433
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Benjamin Coverston
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.1
>
>         Attachments: 
> 0001-Put-repair-session-on-a-Stage-and-add-a-method-to-re-v2.patch, 
> 0001-Put-repair-session-on-a-Stage-and-add-a-method-to-re-v3.patch, 
> 0001-Put-repair-session-on-a-Stage-and-add-a-method-to-re.patch, 
> 0002-Register-in-gossip-to-handle-node-failures-v2.patch, 
> 0002-Register-in-gossip-to-handle-node-failures-v3.patch, 
> 0002-Register-in-gossip-to-handle-node-failures.patch, 
> 0003-Report-streaming-errors-back-to-repair-v2.patch, 
> 0003-Report-streaming-errors-back-to-repair-v3.patch, 
> 0003-Report-streaming-errors-back-to-repair.patch, 
> 0004-Reports-validation-compaction-errors-back-to-repair-v2.patch, 
> 0004-Reports-validation-compaction-errors-back-to-repair-v3.patch, 
> 0004-Reports-validation-compaction-errors-back-to-repair.patch
>
>
> Running repair in cases where a stream fails we are seeing multiple problems.
> 1. Although retry is initiated and completes, the old stream doesn't seem to 
> clean itself up and repair hangs.
> 2. The temp files are left behind and multiple failures can end up filling up 
> the data partition.
> These issues together are making repair very difficult for nearly everyone 
> running repair on a non-trivial sized data set.
> This issue is also being worked on w.r.t CASSANDRA-2088, however that was 
> moved to 0.8 for a few reasons. This ticket is to fix the immediate issues 
> that we are seeing in 0.7.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to