[jira] [Commented] (SOLR-13376) Multi-node race condition to create/remove nodeLost markers

2019-04-09 Thread Andrzej Bialecki (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813718#comment-16813718
 ] 

Andrzej Bialecki  commented on SOLR-13376:
--

[~hossman] - this patch changes the {{OverseerTriggerThread}} so that it does 
not remove markers once it's done init-ing all triggers, only marks them 
"inactive". This kills two birds with one stone - it prevents straggler nodes 
from re-creating these markers, and it allows triggers to avoid processing them 
multiple times (on multiple Overseer leader changes). It also speeds up removal 
of markers in {{InactiveMarkersPlanAction}}.

I also added some Ref Guide documentation about the maintenance trigger. I'd 
appreciate a review.

> Multi-node race condition to create/remove nodeLost markers
> ---
>
> Key: SOLR-13376
> URL: https://issues.apache.org/jira/browse/SOLR-13376
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Assignee: Andrzej Bialecki 
>Priority: Major
> Attachments: SOLR-13376.patch
>
>
> NodeMarkersRegistrationTest.testNodeMarkersRegistration is frequently failing 
> on jenkins builds in the same spot, with a similar looking logs.
> Although i haven't been able to reproduce these failures locally, I am fairly 
> confident that the problem is a race condition bug that exists between 
> when/how a new Overseer will process & clean up "nodeLost" marker's in ZK, 
> with how other nodes may (mistakenly) re-create those markers in their 
> liveNodes listener.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13376) Multi-node race condition to create/remove nodeLost markers

2019-04-09 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813350#comment-16813350
 ] 

ASF subversion and git services commented on SOLR-13376:


Commit ab55b6386b701ec91afb92b269decd081f398ca8 in lucene-solr's branch 
refs/heads/jira/LUCENE-8738 from Chris M. Hostetter
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=ab55b63 ]

SOLR-13376: Disable test until it can be re-written to reflect actual expected 
behavior of how/when node markers will be cleaned up


> Multi-node race condition to create/remove nodeLost markers
> ---
>
> Key: SOLR-13376
> URL: https://issues.apache.org/jira/browse/SOLR-13376
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Assignee: Andrzej Bialecki 
>Priority: Major
>
> NodeMarkersRegistrationTest.testNodeMarkersRegistration is frequently failing 
> on jenkins builds in the same spot, with a similar looking logs.
> Although i haven't been able to reproduce these failures locally, I am fairly 
> confident that the problem is a race condition bug that exists between 
> when/how a new Overseer will process & clean up "nodeLost" marker's in ZK, 
> with how other nodes may (mistakenly) re-create those markers in their 
> liveNodes listener.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13376) Multi-node race condition to create/remove nodeLost markers

2019-04-09 Thread Andrzej Bialecki (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813271#comment-16813271
 ] 

Andrzej Bialecki  commented on SOLR-13376:
--

bq. it's expected that InactiveMarkersPlanAction is what will clean up the 
markers

It's expected to _eventually_ clean them - the trigger runs once a day. That's 
why the section in {{OverseerTriggerThread.run()}} was removing them on 
overseer leader change, to clean the markers that we know for sure are no 
longer needed. And apparently this creates the race condition.
 
bq.  you just re-enabled the test (w/o any modifications to it) and re-resolved 
this issue

Well, for the record, see 1cfbd3e1c84d35e741cfc068a8e88f0eff4ea9e1 where I 
tried to address another source of the test's instability, and the test's 
reliability improved after that change. The race condition that you discovered 
is something new that I wasn't aware of before, so I'm going to fix it (and add 
the missing documentation on {{.scheduled_maintenance}} trigger).

> Multi-node race condition to create/remove nodeLost markers
> ---
>
> Key: SOLR-13376
> URL: https://issues.apache.org/jira/browse/SOLR-13376
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Assignee: Andrzej Bialecki 
>Priority: Major
>
> NodeMarkersRegistrationTest.testNodeMarkersRegistration is frequently failing 
> on jenkins builds in the same spot, with a similar looking logs.
> Although i haven't been able to reproduce these failures locally, I am fairly 
> confident that the problem is a race condition bug that exists between 
> when/how a new Overseer will process & clean up "nodeLost" marker's in ZK, 
> with how other nodes may (mistakenly) re-create those markers in their 
> liveNodes listener.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13376) Multi-node race condition to create/remove nodeLost markers

2019-04-08 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16812907#comment-16812907
 ] 

ASF subversion and git services commented on SOLR-13376:


Commit deb79872720fbaffa138ab4ef3c7226bec935aaf in lucene-solr's branch 
refs/heads/branch_8x from Chris M. Hostetter
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=deb7987 ]

SOLR-13376: Disable test until it can be re-written to reflect actual expected 
behavior of how/when node markers will be cleaned up

(cherry picked from commit ab55b6386b701ec91afb92b269decd081f398ca8)


> Multi-node race condition to create/remove nodeLost markers
> ---
>
> Key: SOLR-13376
> URL: https://issues.apache.org/jira/browse/SOLR-13376
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Assignee: Andrzej Bialecki 
>Priority: Major
>
> NodeMarkersRegistrationTest.testNodeMarkersRegistration is frequently failing 
> on jenkins builds in the same spot, with a similar looking logs.
> Although i haven't been able to reproduce these failures locally, I am fairly 
> confident that the problem is a race condition bug that exists between 
> when/how a new Overseer will process & clean up "nodeLost" marker's in ZK, 
> with how other nodes may (mistakenly) re-create those markers in their 
> liveNodes listener.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13376) Multi-node race condition to create/remove nodeLost markers

2019-04-08 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16812908#comment-16812908
 ] 

ASF subversion and git services commented on SOLR-13376:


Commit ab55b6386b701ec91afb92b269decd081f398ca8 in lucene-solr's branch 
refs/heads/master from Chris M. Hostetter
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=ab55b63 ]

SOLR-13376: Disable test until it can be re-written to reflect actual expected 
behavior of how/when node markers will be cleaned up


> Multi-node race condition to create/remove nodeLost markers
> ---
>
> Key: SOLR-13376
> URL: https://issues.apache.org/jira/browse/SOLR-13376
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Assignee: Andrzej Bialecki 
>Priority: Major
>
> NodeMarkersRegistrationTest.testNodeMarkersRegistration is frequently failing 
> on jenkins builds in the same spot, with a similar looking logs.
> Although i haven't been able to reproduce these failures locally, I am fairly 
> confident that the problem is a race condition bug that exists between 
> when/how a new Overseer will process & clean up "nodeLost" marker's in ZK, 
> with how other nodes may (mistakenly) re-create those markers in their 
> liveNodes listener.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13376) Multi-node race condition to create/remove nodeLost markers

2019-04-08 Thread Hoss Man (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16812880#comment-16812880
 ] 

Hoss Man commented on SOLR-13376:
-

{quote}This cleaning of leftover markers in OverseerTriggerThread was added 
early on when we added this functionality, and it may not be necessary anymore 
- there's InactiveMarkersPlanAction that runs periodically to remove stale 
markers.
{quote}
If this test doesn't reflect reality, and it's expected that 
{{InactiveMarkersPlanAction}} is what will clean up the markers, then the test 
needs fixed – because right now it (like many other auto-scalling tests) goes 
out of it's way to disable the {{.scheduled_maintenance}} trigger.

For the record, this is the *exact* question I asked you about when you first 
resolved SOLR-13072 (but initially left this test marked AwaitsFix), but you 
never replied .. you just re-enabled the test (w/o any modifications to it) and 
re-resolved this issue...

https://issues.apache.org/jira/browse/SOLR-13072?focusedCommentId=16732499=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16732499

Also: AFAICT there is *nothing* in the ref-guide that mentions the 
{{.scheduled_maintenance}} trigger, or any of it's (default) actions 
({{inactive_shard_plan}}, {{inactive_markers_plan}}, {{execute_plan}}) or what 
they due, or why they (may) be important for cleaning up things like the 
nodeLost/nodeAdded markers.  that seems like a problematic omission?

> Multi-node race condition to create/remove nodeLost markers
> ---
>
> Key: SOLR-13376
> URL: https://issues.apache.org/jira/browse/SOLR-13376
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Assignee: Andrzej Bialecki 
>Priority: Major
>
> NodeMarkersRegistrationTest.testNodeMarkersRegistration is frequently failing 
> on jenkins builds in the same spot, with a similar looking logs.
> Although i haven't been able to reproduce these failures locally, I am fairly 
> confident that the problem is a race condition bug that exists between 
> when/how a new Overseer will process & clean up "nodeLost" marker's in ZK, 
> with how other nodes may (mistakenly) re-create those markers in their 
> liveNodes listener.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13376) Multi-node race condition to create/remove nodeLost markers

2019-04-05 Thread Andrzej Bialecki (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16810887#comment-16810887
 ] 

Andrzej Bialecki  commented on SOLR-13376:
--

Hmm, indeed there's a race condition here.

The reason for having more than 1 node attempt creating a nodeLost marker is 
that more than 1 node may go away (3 was a magic number ;) that we felt wasn't 
excessive and still reduced the chance of losing the event due to multiple node 
failures).

This cleaning of leftover markers in {{OverseerTriggerThread}} was added early 
on when we added this functionality, and it may not be necessary anymore - 
there's {{InactiveMarkersPlanAction}} that runs periodically to remove stale 
markers.

> Multi-node race condition to create/remove nodeLost markers
> ---
>
> Key: SOLR-13376
> URL: https://issues.apache.org/jira/browse/SOLR-13376
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Assignee: Andrzej Bialecki 
>Priority: Major
>
> NodeMarkersRegistrationTest.testNodeMarkersRegistration is frequently failing 
> on jenkins builds in the same spot, with a similar looking logs.
> Although i haven't been able to reproduce these failures locally, I am fairly 
> confident that the problem is a race condition bug that exists between 
> when/how a new Overseer will process & clean up "nodeLost" marker's in ZK, 
> with how other nodes may (mistakenly) re-create those markers in their 
> liveNodes listener.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org