[jira] [Commented] (CASSANDRA-15566) Repair coordinator can hang under some cases

2022-04-07 Thread David Capwell (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17519273#comment-17519273
 ] 

David Capwell commented on CASSANDRA-15566:
---

now that the vtables are in, progress can be made here.  Have a few things to 
look into outside of repair, but do plan to come back to work on this; if 
anyone else has cycles do feel free to pick up w/e you can (there are many edge 
cases)

> Repair coordinator can hang under some cases
> 
>
> Key: CASSANDRA-15566
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15566
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Consistency/Repair
>Reporter: David Capwell
>Priority: Normal
> Fix For: 4.x
>
>
> Repair coordination makes a few assumptions about message delivery which 
> cause it to hang forever when those assumptions don’t hold true: fire and 
> forget will not get rejected (participate has an issue and rejects the 
> message), and a very delayed message will one day be seen (messaging can be 
> dropped under load or when failure detector thinks a node is bad but is just 
> GCing).
> Given this and the desire to have better observability with repair (see 
> CASSANDRA-15399), coordination should be changed into a request/response 
> pattern (with retries) and polling (validation status and MerkleTree 
> sending).  This would allow the coordinator to detect changes in state (it 
> was known participate was working on validation, but it no longer knows about 
> the validation task), and to be able to recover from ephemeral issues.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15566) Repair coordinator can hang under some cases

2020-05-12 Thread David Capwell (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17105683#comment-17105683
 ] 

David Capwell commented on CASSANDRA-15566:
---

Moved out of 4.0 since these are not regressions but improvements, and these 
improvements can be done in a minor release.  The 4.0 timeline should flesh out 
a plan for how to fix and validate these changes, but should not be made in the 
4.0 timeline.

> Repair coordinator can hang under some cases
> 
>
> Key: CASSANDRA-15566
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15566
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Consistency/Repair
>Reporter: David Capwell
>Assignee: David Capwell
>Priority: Normal
> Fix For: 4.x
>
>
> Repair coordination makes a few assumptions about message delivery which 
> cause it to hang forever when those assumptions don’t hold true: fire and 
> forget will not get rejected (participate has an issue and rejects the 
> message), and a very delayed message will one day be seen (messaging can be 
> dropped under load or when failure detector thinks a node is bad but is just 
> GCing).
> Given this and the desire to have better observability with repair (see 
> CASSANDRA-15399), coordination should be changed into a request/response 
> pattern (with retries) and polling (validation status and MerkleTree 
> sending).  This would allow the coordinator to detect changes in state (it 
> was known participate was working on validation, but it no longer knows about 
> the validation task), and to be able to recover from ephemeral issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15566) Repair coordinator can hang under some cases

2020-03-09 Thread David Capwell (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17055063#comment-17055063
 ] 

David Capwell commented on CASSANDRA-15566:
---

bq. C* 4.0 code is quite new to me...

Me too :)

One of the best ways to start is testing; we need more tests to show where 
repair needs improvement. When I joined this project I asked operators top pain 
points with repair (all were from 2.1) and as I write tests I  see 4.0 has the 
same issues.  More tests which show new areas world be great!

Think your 5 classifications are good, though 1/2 can merge; our networking is 
lossy (not a bad thing, under load it’s crash or drop).  I would love a smoke 
test which runs user/operators tasks constantly under “load” (should be able to 
artificially lower resources). This test would help show if the different sub 
systems work well or need improvement as well.

About participate crashing, I added a jvm dtest with shows this is handled; 
assuming failure detector detect this (restart node also fails repair).

About detection and abort, I agree it should be external for now. Any/all 
things the external tools need must be identified and tested to show they work 
(for example does aborting repair work?). 

> Repair coordinator can hang under some cases
> 
>
> Key: CASSANDRA-15566
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15566
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Consistency/Repair
>Reporter: David Capwell
>Assignee: David Capwell
>Priority: Normal
> Fix For: 4.0-beta
>
>
> Repair coordination makes a few assumptions about message delivery which 
> cause it to hang forever when those assumptions don’t hold true: fire and 
> forget will not get rejected (participate has an issue and rejects the 
> message), and a very delayed message will one day be seen (messaging can be 
> dropped under load or when failure detector thinks a node is bad but is just 
> GCing).
> Given this and the desire to have better observability with repair (see 
> CASSANDRA-15399), coordination should be changed into a request/response 
> pattern (with retries) and polling (validation status and MerkleTree 
> sending).  This would allow the coordinator to detect changes in state (it 
> was known participate was working on validation, but it no longer knows about 
> the validation task), and to be able to recover from ephemeral issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15566) Repair coordinator can hang under some cases

2020-03-09 Thread ZhaoYang (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17054748#comment-17054748
 ] 

ZhaoYang commented on CASSANDRA-15566:
--

Thanks for the update..

I don't have any concrete implementation details in mind yet, C* 4.0 code is 
quite new to me...

Based on my understanding on 3.x, the main reasons for repair hanging are:
 # request/response messages got dropped if exceeding expiration time which is 
10s..
 # internode connections are closed and clear all queued messages due to 
network or gossip status changes..
 # participant crashed.
 # failure response was not sent to coordinator in 
{{RepairMessageVerbHandler.doVerb()}} in case of unknown exception. currently 
it only handles dropped tables..
 # participant is indeed making progress but very slow during validation 
because disk IO throttle.

For problem #1-2, I am thinking to make repair message idempotent and sender 
will periodically resend message until it got a reply.

For problem #3, make sure repair manager responds to endpoint status 
changes(eg. up/down/remove, etc..) if it doesn't do it already.

For problem #4, make sure all exceptions are caught and responded with failure. 
need to add some failure injections to dtests.

For problem #5, as you suggested in CASSANDRA-15399, coordinator should be able 
to check participants' in-mem virtual table to determine if it's making 
progress.

In order to make repair great again, i think it's important to be able to 
identify hanged repairs automatically (even with some false-positive) and abort 
those hanged repairs by nodetool. Because I don't expect repair operations to 
be run by operators manually. On production, it should be managed by automation 
tool, like repair service or reaper which will abort and retry hanged repair.. 
It can probably be done in CASSANDRA-15399 or a separate ticket..

> Repair coordinator can hang under some cases
> 
>
> Key: CASSANDRA-15566
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15566
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Consistency/Repair
>Reporter: David Capwell
>Assignee: David Capwell
>Priority: Normal
> Fix For: 4.0-beta
>
>
> Repair coordination makes a few assumptions about message delivery which 
> cause it to hang forever when those assumptions don’t hold true: fire and 
> forget will not get rejected (participate has an issue and rejects the 
> message), and a very delayed message will one day be seen (messaging can be 
> dropped under load or when failure detector thinks a node is bad but is just 
> GCing).
> Given this and the desire to have better observability with repair (see 
> CASSANDRA-15399), coordination should be changed into a request/response 
> pattern (with retries) and polling (validation status and MerkleTree 
> sending).  This would allow the coordinator to detect changes in state (it 
> was known participate was working on validation, but it no longer knows about 
> the validation task), and to be able to recover from ephemeral issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15566) Repair coordinator can hang under some cases

2020-03-06 Thread David Capwell (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053776#comment-17053776
 ] 

David Capwell commented on CASSANDRA-15566:
---

Thanks!

I have and have not started!  There are several things going on so this task 
really should be split into smaller tasks (kinda like this is an epic!).

I am currently trying to finish CASSANDRA-15564 which fixes some of the issues 
but doesn't fix anything that fails because of networking; that JIRA also adds 
a @Ignored test which fails 100% of the time (which is what triggered this 
JIRA).
I have a POC patch for CASSANDRA-15399 which exposes the state across the 
coordinator and participants 
I am working on longevity testing  (without explicitly causing failures) with a 
focus on repair; this was to help boost my confidence on any structural changes 
made.

Thats the current state of this all!

There are a few things already found problematic, so the next steps are mostly

1) testing
2) come up with a proposal for how to fix

I have not fleshed out #2 so if you want to take a stab at fleshing out how to 
solve that would be great!  Without that we can't really split this JIRA.

> Repair coordinator can hang under some cases
> 
>
> Key: CASSANDRA-15566
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15566
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Consistency/Repair
>Reporter: David Capwell
>Assignee: David Capwell
>Priority: Normal
> Fix For: 4.0-beta
>
>
> Repair coordination makes a few assumptions about message delivery which 
> cause it to hang forever when those assumptions don’t hold true: fire and 
> forget will not get rejected (participate has an issue and rejects the 
> message), and a very delayed message will one day be seen (messaging can be 
> dropped under load or when failure detector thinks a node is bad but is just 
> GCing).
> Given this and the desire to have better observability with repair (see 
> CASSANDRA-15399), coordination should be changed into a request/response 
> pattern (with retries) and polling (validation status and MerkleTree 
> sending).  This would allow the coordinator to detect changes in state (it 
> was known participate was working on validation, but it no longer knows about 
> the validation task), and to be able to recover from ephemeral issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15566) Repair coordinator can hang under some cases

2020-03-06 Thread ZhaoYang (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053336#comment-17053336
 ] 

ZhaoYang commented on CASSANDRA-15566:
--

Hi, [~dcapwell] Have you started working on this ticket? I am happy to help..

> Repair coordinator can hang under some cases
> 
>
> Key: CASSANDRA-15566
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15566
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Consistency/Repair
>Reporter: David Capwell
>Assignee: David Capwell
>Priority: Normal
> Fix For: 4.0-beta
>
>
> Repair coordination makes a few assumptions about message delivery which 
> cause it to hang forever when those assumptions don’t hold true: fire and 
> forget will not get rejected (participate has an issue and rejects the 
> message), and a very delayed message will one day be seen (messaging can be 
> dropped under load or when failure detector thinks a node is bad but is just 
> GCing).
> Given this and the desire to have better observability with repair (see 
> CASSANDRA-15399), coordination should be changed into a request/response 
> pattern (with retries) and polling (validation status and MerkleTree 
> sending).  This would allow the coordinator to detect changes in state (it 
> was known participate was working on validation, but it no longer knows about 
> the validation task), and to be able to recover from ephemeral issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org