[jira] [Commented] (CASSANDRA-15667) StreamResultFuture check for completeness is inconsistent, leading to races
[ https://issues.apache.org/jira/browse/CASSANDRA-15667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17076236#comment-17076236 ] Massimiliano Tomassi commented on CASSANDRA-15667: -- PR: [https://github.com/maxtomassi/cassandra/pull/1] CircleCI: [https://app.circleci.com/pipelines/github/maxtomassi/cassandra?branch=15667-4.0] (JVM dtests failed running) > StreamResultFuture check for completeness is inconsistent, leading to races > --- > > Key: CASSANDRA-15667 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15667 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Streaming and Messaging >Reporter: Sergio Bossa >Assignee: Massimiliano Tomassi >Priority: Normal > Fix For: 4.0 > > > {{StreamResultFuture#maybeComplete()}} uses > {{StreamCoordinator#hasActiveSessions()}} to determine if all sessions are > completed, but then accesses each session state via > {{StreamCoordinator#getAllSessionInfo()}}: this is inconsistent, as the > former relies on the actual {{StreamSession}} state, while the latter on the > {{SessionInfo}} state, and the two are concurrently updated with no > coordination whatsoever. > This leads to races, i.e. apparent in some dtest spurious failures, such as > {{TestBootstrap.resumable_bootstrap_test}} in CASSANDRA-15614 cc > [~e.dimitrova]. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15667) StreamResultFuture check for completeness is inconsistent, leading to races
[ https://issues.apache.org/jira/browse/CASSANDRA-15667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17077514#comment-17077514 ] Ekaterina Dimitrova commented on CASSANDRA-15667: - Just to be sure you saw that - resumable bootstrap is tested in another test too, that's why this one was removed, not blindly removed because it was flakey or something. Fix was applied to the other one but it was pointed in the ticket that in extremely rare cases it might fail. Why? From what I recall the bootstrap was sometimes completing too fast before the streaming is really interrupted from the byteman code and we didn't really have an instrument to control that. So the only fix was to introduce bigger delay which solves the issue in most of the cases. I think I haven't seen this test failing anymore but it might be really possible in very rare cases. There is in parallel additional ticket to try to implement the test in the in-jvm test suite as according to Jordan West who is the Shepard for testing 4.0, there we might be able to control the flow deterministically. I will look In detail at this patch and the test again later today or tomorrow, thanks a lot for bringing it up! > StreamResultFuture check for completeness is inconsistent, leading to races > --- > > Key: CASSANDRA-15667 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15667 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Streaming and Messaging >Reporter: Sergio Bossa >Assignee: Massimiliano Tomassi >Priority: Normal > Fix For: 4.0 > > > {{StreamResultFuture#maybeComplete()}} uses > {{StreamCoordinator#hasActiveSessions()}} to determine if all sessions are > completed, but then accesses each session state via > {{StreamCoordinator#getAllSessionInfo()}}: this is inconsistent, as the > former relies on the actual {{StreamSession}} state, while the latter on the > {{SessionInfo}} state, and the two are concurrently updated with no > coordination whatsoever. > This leads to races, i.e. apparent in some dtest spurious failures, such as > {{TestBootstrap.resumable_bootstrap_test}} in CASSANDRA-15614 cc > [~e.dimitrova]. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15667) StreamResultFuture check for completeness is inconsistent, leading to races
[ https://issues.apache.org/jira/browse/CASSANDRA-15667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17079258#comment-17079258 ] Sergio Bossa commented on CASSANDRA-15667: -- Thanks [~e.dimitrova] for chiming in. {quote}From what I recall the bootstrap was sometimes completing too fast before the streaming is really interrupted from the byteman code and we didn't really have an instrument to control that. {quote} I went through the "resumable bootstrap" test, and unfortunately I don't see how the bootstrap could ever complete before the byteman script is invoked: this is because such script [makes node 1 fail before it starts to stream files|[https://github.com/ekaterinadimitrova2/cassandra-dtest/blob/b56887d67c353d6d69cd60cfd74859405fa37685/byteman/4.0/stream_failure.btm#L10]], which means there's no way for node 3 to finish bootstrapping before it received all files from both nodes, which will never happen due to said script causing node 1 to fail. So why did the test fail? I believe that's because of this issue: in other words, node 3 was correctly seeing its streaming session completed (after node 1 finished streaming with an error) but *not* failed; this is because the "completed" state is read through the actual session state, while the "failed" state is read through the {{SessionInfo}} state, which is what we're fixing here. That said, I would propose to still re-introduce the original {{resumable_bootstrap_test}}, because it's an important enough feature to deserve its own test, and it uses 3 nodes which increases the chances of detecting errors/races. Thoughts? > StreamResultFuture check for completeness is inconsistent, leading to races > --- > > Key: CASSANDRA-15667 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15667 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Streaming and Messaging >Reporter: Sergio Bossa >Assignee: Massimiliano Tomassi >Priority: Normal > Fix For: 4.0 > > > {{StreamResultFuture#maybeComplete()}} uses > {{StreamCoordinator#hasActiveSessions()}} to determine if all sessions are > completed, but then accesses each session state via > {{StreamCoordinator#getAllSessionInfo()}}: this is inconsistent, as the > former relies on the actual {{StreamSession}} state, while the latter on the > {{SessionInfo}} state, and the two are concurrently updated with no > coordination whatsoever. > This leads to races, i.e. apparent in some dtest spurious failures, such as > {{TestBootstrap.resumable_bootstrap_test}} in CASSANDRA-15614 cc > [~e.dimitrova]. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15667) StreamResultFuture check for completeness is inconsistent, leading to races
[ https://issues.apache.org/jira/browse/CASSANDRA-15667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17079380#comment-17079380 ] Ekaterina Dimitrova commented on CASSANDRA-15667: - Hi Sergio, Thanks for the patch and the broad explanation. I just returned the test [here|https://github.com/ekaterinadimitrova2/cassandra-dtest/tree/CASSANDRA-15667] Unfortunately, I am not a committer so anyone else should do it. Thanks again! > StreamResultFuture check for completeness is inconsistent, leading to races > --- > > Key: CASSANDRA-15667 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15667 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Streaming and Messaging >Reporter: Sergio Bossa >Assignee: Massimiliano Tomassi >Priority: Normal > Fix For: 4.0 > > > {{StreamResultFuture#maybeComplete()}} uses > {{StreamCoordinator#hasActiveSessions()}} to determine if all sessions are > completed, but then accesses each session state via > {{StreamCoordinator#getAllSessionInfo()}}: this is inconsistent, as the > former relies on the actual {{StreamSession}} state, while the latter on the > {{SessionInfo}} state, and the two are concurrently updated with no > coordination whatsoever. > This leads to races, i.e. apparent in some dtest spurious failures, such as > {{TestBootstrap.resumable_bootstrap_test}} in CASSANDRA-15614 cc > [~e.dimitrova]. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15667) StreamResultFuture check for completeness is inconsistent, leading to races
[ https://issues.apache.org/jira/browse/CASSANDRA-15667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17080281#comment-17080281 ] ZhaoYang commented on CASSANDRA-15667: -- [~maxtomassi] thanks for the patch.. should we also fix 3.0 and 3.11 ? > StreamResultFuture check for completeness is inconsistent, leading to races > --- > > Key: CASSANDRA-15667 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15667 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Streaming and Messaging >Reporter: Sergio Bossa >Assignee: Massimiliano Tomassi >Priority: Normal > Fix For: 4.0 > > > {{StreamResultFuture#maybeComplete()}} uses > {{StreamCoordinator#hasActiveSessions()}} to determine if all sessions are > completed, but then accesses each session state via > {{StreamCoordinator#getAllSessionInfo()}}: this is inconsistent, as the > former relies on the actual {{StreamSession}} state, while the latter on the > {{SessionInfo}} state, and the two are concurrently updated with no > coordination whatsoever. > This leads to races, i.e. apparent in some dtest spurious failures, such as > {{TestBootstrap.resumable_bootstrap_test}} in CASSANDRA-15614 cc > [~e.dimitrova]. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15667) StreamResultFuture check for completeness is inconsistent, leading to races
[ https://issues.apache.org/jira/browse/CASSANDRA-15667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17080472#comment-17080472 ] Ekaterina Dimitrova commented on CASSANDRA-15667: - I am looking now at the test as I triggered it to run again 100 times yesterday and it seems there is some error. I am gonna figure it out and update the ticket later > StreamResultFuture check for completeness is inconsistent, leading to races > --- > > Key: CASSANDRA-15667 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15667 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Streaming and Messaging >Reporter: Sergio Bossa >Assignee: Massimiliano Tomassi >Priority: Normal > Fix For: 4.0 > > > {{StreamResultFuture#maybeComplete()}} uses > {{StreamCoordinator#hasActiveSessions()}} to determine if all sessions are > completed, but then accesses each session state via > {{StreamCoordinator#getAllSessionInfo()}}: this is inconsistent, as the > former relies on the actual {{StreamSession}} state, while the latter on the > {{SessionInfo}} state, and the two are concurrently updated with no > coordination whatsoever. > This leads to races, i.e. apparent in some dtest spurious failures, such as > {{TestBootstrap.resumable_bootstrap_test}} in CASSANDRA-15614 cc > [~e.dimitrova]. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15667) StreamResultFuture check for completeness is inconsistent, leading to races
[ https://issues.apache.org/jira/browse/CASSANDRA-15667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17081026#comment-17081026 ] Ekaterina Dimitrova commented on CASSANDRA-15667: - Attached a log of 100 successful runs. (Turned out the issue was false alarm, my local config got messed up, nothing related to the test or the patch) > StreamResultFuture check for completeness is inconsistent, leading to races > --- > > Key: CASSANDRA-15667 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15667 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Streaming and Messaging >Reporter: Sergio Bossa >Assignee: Massimiliano Tomassi >Priority: Normal > Fix For: 4.0 > > > {{StreamResultFuture#maybeComplete()}} uses > {{StreamCoordinator#hasActiveSessions()}} to determine if all sessions are > completed, but then accesses each session state via > {{StreamCoordinator#getAllSessionInfo()}}: this is inconsistent, as the > former relies on the actual {{StreamSession}} state, while the latter on the > {{SessionInfo}} state, and the two are concurrently updated with no > coordination whatsoever. > This leads to races, i.e. apparent in some dtest spurious failures, such as > {{TestBootstrap.resumable_bootstrap_test}} in CASSANDRA-15614 cc > [~e.dimitrova]. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15667) StreamResultFuture check for completeness is inconsistent, leading to races
[ https://issues.apache.org/jira/browse/CASSANDRA-15667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17083119#comment-17083119 ] Massimiliano Tomassi commented on CASSANDRA-15667: -- [~jasonstack], sorry for the delay. I've updated my first comment with links to PRs against 3.0 and 3.11 too. > StreamResultFuture check for completeness is inconsistent, leading to races > --- > > Key: CASSANDRA-15667 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15667 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Streaming and Messaging >Reporter: Sergio Bossa >Assignee: Massimiliano Tomassi >Priority: Normal > Fix For: 4.0 > > Attachments: log_bootstrap_resumable > > > {{StreamResultFuture#maybeComplete()}} uses > {{StreamCoordinator#hasActiveSessions()}} to determine if all sessions are > completed, but then accesses each session state via > {{StreamCoordinator#getAllSessionInfo()}}: this is inconsistent, as the > former relies on the actual {{StreamSession}} state, while the latter on the > {{SessionInfo}} state, and the two are concurrently updated with no > coordination whatsoever. > This leads to races, i.e. apparent in some dtest spurious failures, such as > {{TestBootstrap.resumable_bootstrap_test}} in CASSANDRA-15614 cc > [~e.dimitrova]. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15667) StreamResultFuture check for completeness is inconsistent, leading to races
[ https://issues.apache.org/jira/browse/CASSANDRA-15667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17083146#comment-17083146 ] Sergio Bossa commented on CASSANDRA-15667: -- [~maxtomassi], [~e.dimitrova], thanks for your PRs. I'm +1 to both. > StreamResultFuture check for completeness is inconsistent, leading to races > --- > > Key: CASSANDRA-15667 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15667 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Streaming and Messaging >Reporter: Sergio Bossa >Assignee: Massimiliano Tomassi >Priority: Normal > Fix For: 4.0 > > Attachments: log_bootstrap_resumable > > > {{StreamResultFuture#maybeComplete()}} uses > {{StreamCoordinator#hasActiveSessions()}} to determine if all sessions are > completed, but then accesses each session state via > {{StreamCoordinator#getAllSessionInfo()}}: this is inconsistent, as the > former relies on the actual {{StreamSession}} state, while the latter on the > {{SessionInfo}} state, and the two are concurrently updated with no > coordination whatsoever. > This leads to races, i.e. apparent in some dtest spurious failures, such as > {{TestBootstrap.resumable_bootstrap_test}} in CASSANDRA-15614 cc > [~e.dimitrova]. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15667) StreamResultFuture check for completeness is inconsistent, leading to races
[ https://issues.apache.org/jira/browse/CASSANDRA-15667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091505#comment-17091505 ] Benjamin Lerer commented on CASSANDRA-15667: Jenkin CI | [[4.0|https://ci-cassandra.apache.org/job/Cassandra-devbranch/74/]|[3.0|https://ci-cassandra.apache.org/job/Cassandra-devbranch/75/]|[3.11|https://ci-cassandra.apache.org/job/Cassandra-devbranch/76/]| > StreamResultFuture check for completeness is inconsistent, leading to races > --- > > Key: CASSANDRA-15667 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15667 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Streaming and Messaging >Reporter: Sergio Bossa >Assignee: Massimiliano Tomassi >Priority: Normal > Fix For: 4.0 > > Attachments: log_bootstrap_resumable > > > {{StreamResultFuture#maybeComplete()}} uses > {{StreamCoordinator#hasActiveSessions()}} to determine if all sessions are > completed, but then accesses each session state via > {{StreamCoordinator#getAllSessionInfo()}}: this is inconsistent, as the > former relies on the actual {{StreamSession}} state, while the latter on the > {{SessionInfo}} state, and the two are concurrently updated with no > coordination whatsoever. > This leads to races, i.e. apparent in some dtest spurious failures, such as > {{TestBootstrap.resumable_bootstrap_test}} in CASSANDRA-15614 cc > [~e.dimitrova]. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org