[ 
https://issues.apache.org/jira/browse/CASSANDRA-3112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13176420#comment-13176420
 ] 

Vijay commented on CASSANDRA-3112:
----------------------------------

"But do you know what is the reason for it making no progress? Because unless 
we know what can cause it, not sure what to fix?"
it is usually is in the Streaming phase, i think adding a SoTimeout might fix 
it... but it is so random i couldn't reproduce in my tests but definitely 
seeing it in production.

"How can we "lose" messages, aren't tcp supposed to avoid this?"
Once you send the message the other node might get restarted (without 
validation or starting any thing) or the sockets can get reset, Actually i 
think when i posted this message it was because of CASSANDRA-3577. There isnt 
something like hints or a retry on the messages sent for the repairs.

I understand this isnt the scope of this ticket, but i still think there should 
be a way to orchestrate repairs with a little complicated logic and i will try 
to do some parts of it in the other ticket.



                
> Make repair fail when an unexpected error occurs
> ------------------------------------------------
>
>                 Key: CASSANDRA-3112
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3112
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>            Priority: Minor
>              Labels: repair
>             Fix For: 1.1
>
>         Attachments: 0003-Report-streaming-errors-back-to-repair-v4.patch, 
> 0004-Reports-validation-compaction-errors-back-to-repair-v4.patch
>
>
> CASSANDRA-2433 makes it so that nodetool repair will fail if a node 
> participating to repair dies before completing his part of the repair. This 
> handles most of the situation where repair was previously hanging, but repair 
> can still hang if an unexpected error occurs during either the merkle tree 
> creation (an on-disk corruption triggers an IOError say) or during streaming 
> (though I'm not sure what could make streaming failed outside of 'one of the 
> node died' (besides a bug)).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to