[jira] [Commented] (SOLR-13376) Multi-node race condition to create/remove nodeLost markers
[ https://issues.apache.org/jira/browse/SOLR-13376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813718#comment-16813718 ] Andrzej Bialecki commented on SOLR-13376: -- [~hossman] - this patch changes the {{OverseerTriggerThread}} so that it does not remove markers once it's done init-ing all triggers, only marks them "inactive". This kills two birds with one stone - it prevents straggler nodes from re-creating these markers, and it allows triggers to avoid processing them multiple times (on multiple Overseer leader changes). It also speeds up removal of markers in {{InactiveMarkersPlanAction}}. I also added some Ref Guide documentation about the maintenance trigger. I'd appreciate a review. > Multi-node race condition to create/remove nodeLost markers > --- > > Key: SOLR-13376 > URL: https://issues.apache.org/jira/browse/SOLR-13376 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man >Assignee: Andrzej Bialecki >Priority: Major > Attachments: SOLR-13376.patch > > > NodeMarkersRegistrationTest.testNodeMarkersRegistration is frequently failing > on jenkins builds in the same spot, with a similar looking logs. > Although i haven't been able to reproduce these failures locally, I am fairly > confident that the problem is a race condition bug that exists between > when/how a new Overseer will process & clean up "nodeLost" marker's in ZK, > with how other nodes may (mistakenly) re-create those markers in their > liveNodes listener. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-13376) Multi-node race condition to create/remove nodeLost markers
[ https://issues.apache.org/jira/browse/SOLR-13376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813350#comment-16813350 ] ASF subversion and git services commented on SOLR-13376: Commit ab55b6386b701ec91afb92b269decd081f398ca8 in lucene-solr's branch refs/heads/jira/LUCENE-8738 from Chris M. Hostetter [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=ab55b63 ] SOLR-13376: Disable test until it can be re-written to reflect actual expected behavior of how/when node markers will be cleaned up > Multi-node race condition to create/remove nodeLost markers > --- > > Key: SOLR-13376 > URL: https://issues.apache.org/jira/browse/SOLR-13376 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man >Assignee: Andrzej Bialecki >Priority: Major > > NodeMarkersRegistrationTest.testNodeMarkersRegistration is frequently failing > on jenkins builds in the same spot, with a similar looking logs. > Although i haven't been able to reproduce these failures locally, I am fairly > confident that the problem is a race condition bug that exists between > when/how a new Overseer will process & clean up "nodeLost" marker's in ZK, > with how other nodes may (mistakenly) re-create those markers in their > liveNodes listener. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-13376) Multi-node race condition to create/remove nodeLost markers
[ https://issues.apache.org/jira/browse/SOLR-13376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813271#comment-16813271 ] Andrzej Bialecki commented on SOLR-13376: -- bq. it's expected that InactiveMarkersPlanAction is what will clean up the markers It's expected to _eventually_ clean them - the trigger runs once a day. That's why the section in {{OverseerTriggerThread.run()}} was removing them on overseer leader change, to clean the markers that we know for sure are no longer needed. And apparently this creates the race condition. bq. you just re-enabled the test (w/o any modifications to it) and re-resolved this issue Well, for the record, see 1cfbd3e1c84d35e741cfc068a8e88f0eff4ea9e1 where I tried to address another source of the test's instability, and the test's reliability improved after that change. The race condition that you discovered is something new that I wasn't aware of before, so I'm going to fix it (and add the missing documentation on {{.scheduled_maintenance}} trigger). > Multi-node race condition to create/remove nodeLost markers > --- > > Key: SOLR-13376 > URL: https://issues.apache.org/jira/browse/SOLR-13376 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man >Assignee: Andrzej Bialecki >Priority: Major > > NodeMarkersRegistrationTest.testNodeMarkersRegistration is frequently failing > on jenkins builds in the same spot, with a similar looking logs. > Although i haven't been able to reproduce these failures locally, I am fairly > confident that the problem is a race condition bug that exists between > when/how a new Overseer will process & clean up "nodeLost" marker's in ZK, > with how other nodes may (mistakenly) re-create those markers in their > liveNodes listener. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-13376) Multi-node race condition to create/remove nodeLost markers
[ https://issues.apache.org/jira/browse/SOLR-13376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16812907#comment-16812907 ] ASF subversion and git services commented on SOLR-13376: Commit deb79872720fbaffa138ab4ef3c7226bec935aaf in lucene-solr's branch refs/heads/branch_8x from Chris M. Hostetter [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=deb7987 ] SOLR-13376: Disable test until it can be re-written to reflect actual expected behavior of how/when node markers will be cleaned up (cherry picked from commit ab55b6386b701ec91afb92b269decd081f398ca8) > Multi-node race condition to create/remove nodeLost markers > --- > > Key: SOLR-13376 > URL: https://issues.apache.org/jira/browse/SOLR-13376 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man >Assignee: Andrzej Bialecki >Priority: Major > > NodeMarkersRegistrationTest.testNodeMarkersRegistration is frequently failing > on jenkins builds in the same spot, with a similar looking logs. > Although i haven't been able to reproduce these failures locally, I am fairly > confident that the problem is a race condition bug that exists between > when/how a new Overseer will process & clean up "nodeLost" marker's in ZK, > with how other nodes may (mistakenly) re-create those markers in their > liveNodes listener. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-13376) Multi-node race condition to create/remove nodeLost markers
[ https://issues.apache.org/jira/browse/SOLR-13376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16812908#comment-16812908 ] ASF subversion and git services commented on SOLR-13376: Commit ab55b6386b701ec91afb92b269decd081f398ca8 in lucene-solr's branch refs/heads/master from Chris M. Hostetter [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=ab55b63 ] SOLR-13376: Disable test until it can be re-written to reflect actual expected behavior of how/when node markers will be cleaned up > Multi-node race condition to create/remove nodeLost markers > --- > > Key: SOLR-13376 > URL: https://issues.apache.org/jira/browse/SOLR-13376 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man >Assignee: Andrzej Bialecki >Priority: Major > > NodeMarkersRegistrationTest.testNodeMarkersRegistration is frequently failing > on jenkins builds in the same spot, with a similar looking logs. > Although i haven't been able to reproduce these failures locally, I am fairly > confident that the problem is a race condition bug that exists between > when/how a new Overseer will process & clean up "nodeLost" marker's in ZK, > with how other nodes may (mistakenly) re-create those markers in their > liveNodes listener. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-13376) Multi-node race condition to create/remove nodeLost markers
[ https://issues.apache.org/jira/browse/SOLR-13376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16812880#comment-16812880 ] Hoss Man commented on SOLR-13376: - {quote}This cleaning of leftover markers in OverseerTriggerThread was added early on when we added this functionality, and it may not be necessary anymore - there's InactiveMarkersPlanAction that runs periodically to remove stale markers. {quote} If this test doesn't reflect reality, and it's expected that {{InactiveMarkersPlanAction}} is what will clean up the markers, then the test needs fixed – because right now it (like many other auto-scalling tests) goes out of it's way to disable the {{.scheduled_maintenance}} trigger. For the record, this is the *exact* question I asked you about when you first resolved SOLR-13072 (but initially left this test marked AwaitsFix), but you never replied .. you just re-enabled the test (w/o any modifications to it) and re-resolved this issue... https://issues.apache.org/jira/browse/SOLR-13072?focusedCommentId=16732499=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16732499 Also: AFAICT there is *nothing* in the ref-guide that mentions the {{.scheduled_maintenance}} trigger, or any of it's (default) actions ({{inactive_shard_plan}}, {{inactive_markers_plan}}, {{execute_plan}}) or what they due, or why they (may) be important for cleaning up things like the nodeLost/nodeAdded markers. that seems like a problematic omission? > Multi-node race condition to create/remove nodeLost markers > --- > > Key: SOLR-13376 > URL: https://issues.apache.org/jira/browse/SOLR-13376 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man >Assignee: Andrzej Bialecki >Priority: Major > > NodeMarkersRegistrationTest.testNodeMarkersRegistration is frequently failing > on jenkins builds in the same spot, with a similar looking logs. > Although i haven't been able to reproduce these failures locally, I am fairly > confident that the problem is a race condition bug that exists between > when/how a new Overseer will process & clean up "nodeLost" marker's in ZK, > with how other nodes may (mistakenly) re-create those markers in their > liveNodes listener. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-13376) Multi-node race condition to create/remove nodeLost markers
[ https://issues.apache.org/jira/browse/SOLR-13376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16810887#comment-16810887 ] Andrzej Bialecki commented on SOLR-13376: -- Hmm, indeed there's a race condition here. The reason for having more than 1 node attempt creating a nodeLost marker is that more than 1 node may go away (3 was a magic number ;) that we felt wasn't excessive and still reduced the chance of losing the event due to multiple node failures). This cleaning of leftover markers in {{OverseerTriggerThread}} was added early on when we added this functionality, and it may not be necessary anymore - there's {{InactiveMarkersPlanAction}} that runs periodically to remove stale markers. > Multi-node race condition to create/remove nodeLost markers > --- > > Key: SOLR-13376 > URL: https://issues.apache.org/jira/browse/SOLR-13376 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man >Assignee: Andrzej Bialecki >Priority: Major > > NodeMarkersRegistrationTest.testNodeMarkersRegistration is frequently failing > on jenkins builds in the same spot, with a similar looking logs. > Although i haven't been able to reproduce these failures locally, I am fairly > confident that the problem is a race condition bug that exists between > when/how a new Overseer will process & clean up "nodeLost" marker's in ZK, > with how other nodes may (mistakenly) re-create those markers in their > liveNodes listener. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org