[jira] [Comment Edited] (CASSANDRA-18866) Node sends multiple inflight echos

Stefan Miklosovic (Jira) Fri, 13 Oct 2023 00:45:04 -0700


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-18866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17774815#comment-17774815
 ]


Stefan Miklosovic edited comment on CASSANDRA-18866 at 10/13/23 7:44 AM:
-------------------------------------------------------------------------

[~brandon.williams] do you remember CASSANDRA-18854 / CASSANDRA-18543 where we 
reverted the logic around missed echo message? This one fixes it. Repeated 
tests seem to be stable, it is seen that in some cases it resends echo request 
when lost (in 1% of cases). Do you think this is something you could take a 
look at?  

[trunk 
j17|https://app.circleci.com/pipelines/github/instaclustr/cassandra/3337/workflows/f44ee4a8-03fb-488f-ba54-43306bdc86d0]
[trunk 
j11|https://app.circleci.com/pipelines/github/instaclustr/cassandra/3337/workflows/459e6b55-3d43-4ee8-85cd-b308e8797e51]
[5.0 
j17|https://app.circleci.com/pipelines/github/instaclustr/cassandra/3336/workflows/50a0bc41-b800-478d-b7e1-38cc73c16f84]
[5.0 
j11|https://app.circleci.com/pipelines/github/instaclustr/cassandra/3336/workflows/d92a1a77-1f81-4210-a14a-da61fb18e1dd]
[4.1 
j11|https://app.circleci.com/pipelines/github/instaclustr/cassandra/3335/workflows/dc9128f2-94f0-4651-afc8-df4a2db53a9b]
[4.1 
j8|https://app.circleci.com/pipelines/github/instaclustr/cassandra/3335/workflows/8205db12-78f4-429a-81b8-508ba83b98bb]
[4.0 
j11|https://app.circleci.com/pipelines/github/instaclustr/cassandra/3326/workflows/86b807f3-801f-47a0-925b-9ca49eb76d97]
[4.0 
j8|https://app.circleci.com/pipelines/github/instaclustr/cassandra/3326/workflows/0af2bd74-55fc-4d3f-a4e0-2a4c1461ed90]
[3.11 j8| 
https://app.circleci.com/pipelines/github/instaclustr/cassandra/3325/workflows/ff2b2562-c238-49d2-abb2-3457acb9618d]


was (Author: smiklosovic):
[~brandon.williams] do you remember CASSANDRA-18854 / CASSANDRA-18543 where we 
reverted the logic around missed echo message? This one fixes it. Repeated 
tests seem to be stable, it is seen that in some cases it resends echo request 
when lost (in 1% of cases). Do you think this is something you could take a 
look at?  

[trunk 
j17|https://app.circleci.com/pipelines/github/instaclustr/cassandra/3337/workflows/f44ee4a8-03fb-488f-ba54-43306bdc86d0]
[trunk 
j11|https://app.circleci.com/pipelines/github/instaclustr/cassandra/3337/workflows/459e6b55-3d43-4ee8-85cd-b308e8797e51]
[5.0 
j17|https://app.circleci.com/pipelines/github/instaclustr/cassandra/3336/workflows/50a0bc41-b800-478d-b7e1-38cc73c16f84]
[5.0 
j11|https://app.circleci.com/pipelines/github/instaclustr/cassandra/3336/workflows/d92a1a77-1f81-4210-a14a-da61fb18e1dd]
[4.1 
j11|https://app.circleci.com/pipelines/github/instaclustr/cassandra/3335/workflows/dc9128f2-94f0-4651-afc8-df4a2db53a9b]
[4.1 
j8|https://app.circleci.com/pipelines/github/instaclustr/cassandra/3335/workflows/8205db12-78f4-429a-81b8-508ba83b98bb]
[4.0 
j11|https://app.circleci.com/pipelines/github/instaclustr/cassandra/3326/workflows/86b807f3-801f-47a0-925b-9ca49eb76d97]
[4.0 
j8|https://app.circleci.com/pipelines/github/instaclustr/cassandra/3326/workflows/0af2bd74-55fc-4d3f-a4e0-2a4c1461ed90]
[3.11 j8 
https://app.circleci.com/pipelines/github/instaclustr/cassandra/3325/workflows/ff2b2562-c238-49d2-abb2-3457acb9618d]

> Node sends multiple inflight echos
> ----------------------------------
>
>                 Key: CASSANDRA-18866
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-18866
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Cluster/Gossip
>            Reporter: Cameron Zemek
>            Assignee: Cameron Zemek
>            Priority: Normal
>         Attachments: 18866-regression.patch, duplicates.log, echo.log
>
>
> CASSANDRA-18854 rolled back the changes from CASSANDRA-18845. In particular, 
> 18845 had change to only allow 1 inflight ECHO request at a time. As per 
> 18854 some tests have an error rate due to this change. Creating this ticket 
> to discuss this further. As the current state also does not have retry logic, 
> it just allowing multiple ECHO requests inflight at the same time so less 
> likely that all ECHO will timeout or get lost.
> With the change from 18845 adding in some extra logging to track what is 
> going on, I do see it retrying ECHOs. Likewise, I patched a node to drop ECHO 
> requests from a node and also see it retrying ECHOs when it doesn't get a 
> reply.
> Therefore, I think the problem is more specific than the dropping of one ECHO 
> request. Yes there no retry logic for failed ECHO requests, but this is the 
> case even both before and after 18845. ECHO requests are only sent via gossip 
> verb handlers calling applyStateLocally. In these failed tests I therefore 
> assuming their cases where it won't call markAlive when other nodes consider 
> the node UP but its marked DOWN by a node.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Comment Edited] (CASSANDRA-18866) Node sends multiple inflight echos

Reply via email to