[jira] [Commented] (SOLR-5859) Harden the Overseer restart mechanism
[ https://issues.apache.org/jira/browse/SOLR-5859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16437946#comment-16437946 ] Mikhail Khludnev commented on SOLR-5859: [~noble.paul], I want to clarify https://github.com/apache/lucene-solr/commit/3fd292234166105f96fcb5acd3999c9c2abff737#diff-9ed614eee66b9e685d73446b775dc043R287 {quote} //do this in a separate thread because any wait is interrupted in this main thread new Thread(this::checkIfIamStillLeader, "OverseerExitThread").start(); {quote} Can't we clean interrupt flag with {{Thread.interrupted()}} and avoid spawning new thread ? > Harden the Overseer restart mechanism > - > > Key: SOLR-5859 > URL: https://issues.apache.org/jira/browse/SOLR-5859 > Project: Solr > Issue Type: Improvement >Reporter: Noble Paul >Assignee: Noble Paul >Priority: Major > Fix For: 4.8, 6.0 > > Attachments: SOLR-5859.patch, SOLR-5859.patch, SOLR-5859.patch, > SOLR-5859.patch > > > SOLR-5476 depends on Overseer restart.The current strategy is to remove the > zk node for leader election and wait for STATUS_UPDATE_DELAY +100 ms and > start the new overseer. > Though overseer ops are short running, it is not a 100% foolproof strategy > because if an operation takes longer than the wait period there can be race > condition. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5859) Harden the Overseer restart mechanism
[ https://issues.apache.org/jira/browse/SOLR-5859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13961372#comment-13961372 ] ASF subversion and git services commented on SOLR-5859: --- Commit 1585276 from no...@apache.org in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1585276 ] SOLR-5859 Fixing test errors > Harden the Overseer restart mechanism > - > > Key: SOLR-5859 > URL: https://issues.apache.org/jira/browse/SOLR-5859 > Project: Solr > Issue Type: Improvement >Reporter: Noble Paul >Assignee: Noble Paul > Fix For: 4.8, 5.0 > > Attachments: SOLR-5859.patch, SOLR-5859.patch, SOLR-5859.patch, > SOLR-5859.patch > > > SOLR-5476 depends on Overseer restart.The current strategy is to remove the > zk node for leader election and wait for STATUS_UPDATE_DELAY +100 ms and > start the new overseer. > Though overseer ops are short running, it is not a 100% foolproof strategy > because if an operation takes longer than the wait period there can be race > condition. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5859) Harden the Overseer restart mechanism
[ https://issues.apache.org/jira/browse/SOLR-5859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13961370#comment-13961370 ] ASF subversion and git services commented on SOLR-5859: --- Commit 1585274 from no...@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1585274 ] SOLR-5859 Fixing test errors > Harden the Overseer restart mechanism > - > > Key: SOLR-5859 > URL: https://issues.apache.org/jira/browse/SOLR-5859 > Project: Solr > Issue Type: Improvement >Reporter: Noble Paul >Assignee: Noble Paul > Fix For: 4.8, 5.0 > > Attachments: SOLR-5859.patch, SOLR-5859.patch, SOLR-5859.patch, > SOLR-5859.patch > > > SOLR-5476 depends on Overseer restart.The current strategy is to remove the > zk node for leader election and wait for STATUS_UPDATE_DELAY +100 ms and > start the new overseer. > Though overseer ops are short running, it is not a 100% foolproof strategy > because if an operation takes longer than the wait period there can be race > condition. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5859) Harden the Overseer restart mechanism
[ https://issues.apache.org/jira/browse/SOLR-5859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13958599#comment-13958599 ] ASF subversion and git services commented on SOLR-5859: --- Commit 1584273 from [~noble.paul] in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1584273 ] SOLR-5859 improved logging, and fix a potential bug > Harden the Overseer restart mechanism > - > > Key: SOLR-5859 > URL: https://issues.apache.org/jira/browse/SOLR-5859 > Project: Solr > Issue Type: Improvement >Reporter: Noble Paul >Assignee: Noble Paul > Fix For: 4.8, 5.0 > > Attachments: SOLR-5859.patch, SOLR-5859.patch, SOLR-5859.patch, > SOLR-5859.patch > > > SOLR-5476 depends on Overseer restart.The current strategy is to remove the > zk node for leader election and wait for STATUS_UPDATE_DELAY +100 ms and > start the new overseer. > Though overseer ops are short running, it is not a 100% foolproof strategy > because if an operation takes longer than the wait period there can be race > condition. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5859) Harden the Overseer restart mechanism
[ https://issues.apache.org/jira/browse/SOLR-5859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13958586#comment-13958586 ] ASF subversion and git services commented on SOLR-5859: --- Commit 1584271 from [~noble.paul] in branch 'dev/trunk' [ https://svn.apache.org/r1584271 ] SOLR-5859 improved logging, and fix a potential bug > Harden the Overseer restart mechanism > - > > Key: SOLR-5859 > URL: https://issues.apache.org/jira/browse/SOLR-5859 > Project: Solr > Issue Type: Improvement >Reporter: Noble Paul >Assignee: Noble Paul > Fix For: 4.8, 5.0 > > Attachments: SOLR-5859.patch, SOLR-5859.patch, SOLR-5859.patch, > SOLR-5859.patch > > > SOLR-5476 depends on Overseer restart.The current strategy is to remove the > zk node for leader election and wait for STATUS_UPDATE_DELAY +100 ms and > start the new overseer. > Though overseer ops are short running, it is not a 100% foolproof strategy > because if an operation takes longer than the wait period there can be race > condition. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5859) Harden the Overseer restart mechanism
[ https://issues.apache.org/jira/browse/SOLR-5859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13957986#comment-13957986 ] ASF subversion and git services commented on SOLR-5859: --- Commit 1584120 from [~noble.paul] in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1584120 ] SOLR-5859 removing accidental removal of SOLR-5908 changes > Harden the Overseer restart mechanism > - > > Key: SOLR-5859 > URL: https://issues.apache.org/jira/browse/SOLR-5859 > Project: Solr > Issue Type: Improvement >Reporter: Noble Paul >Assignee: Noble Paul > Fix For: 4.8, 5.0 > > Attachments: SOLR-5859.patch, SOLR-5859.patch, SOLR-5859.patch, > SOLR-5859.patch > > > SOLR-5476 depends on Overseer restart.The current strategy is to remove the > zk node for leader election and wait for STATUS_UPDATE_DELAY +100 ms and > start the new overseer. > Though overseer ops are short running, it is not a 100% foolproof strategy > because if an operation takes longer than the wait period there can be race > condition. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5859) Harden the Overseer restart mechanism
[ https://issues.apache.org/jira/browse/SOLR-5859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13957969#comment-13957969 ] ASF subversion and git services commented on SOLR-5859: --- Commit 1584115 from [~noble.paul] in branch 'dev/trunk' [ https://svn.apache.org/r1584115 ] SOLR-5859 removing accidental removal of SOLR-5908 changes > Harden the Overseer restart mechanism > - > > Key: SOLR-5859 > URL: https://issues.apache.org/jira/browse/SOLR-5859 > Project: Solr > Issue Type: Improvement >Reporter: Noble Paul >Assignee: Noble Paul > Fix For: 4.8, 5.0 > > Attachments: SOLR-5859.patch, SOLR-5859.patch, SOLR-5859.patch, > SOLR-5859.patch > > > SOLR-5476 depends on Overseer restart.The current strategy is to remove the > zk node for leader election and wait for STATUS_UPDATE_DELAY +100 ms and > start the new overseer. > Though overseer ops are short running, it is not a 100% foolproof strategy > because if an operation takes longer than the wait period there can be race > condition. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5859) Harden the Overseer restart mechanism
[ https://issues.apache.org/jira/browse/SOLR-5859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13957951#comment-13957951 ] ASF subversion and git services commented on SOLR-5859: --- Commit 1584110 from [~steve_rowe] in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1584110 ] SOLR-5859: add OCP.getCollectionStatus() param description for 'clusterState' to stop 'ant precommit' bitching 'Javadoc: Description expected after this reference' and failing the build (merged trunk r1584108) > Harden the Overseer restart mechanism > - > > Key: SOLR-5859 > URL: https://issues.apache.org/jira/browse/SOLR-5859 > Project: Solr > Issue Type: Improvement >Reporter: Noble Paul >Assignee: Noble Paul > Fix For: 4.8, 5.0 > > Attachments: SOLR-5859.patch, SOLR-5859.patch, SOLR-5859.patch, > SOLR-5859.patch > > > SOLR-5476 depends on Overseer restart.The current strategy is to remove the > zk node for leader election and wait for STATUS_UPDATE_DELAY +100 ms and > start the new overseer. > Though overseer ops are short running, it is not a 100% foolproof strategy > because if an operation takes longer than the wait period there can be race > condition. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5859) Harden the Overseer restart mechanism
[ https://issues.apache.org/jira/browse/SOLR-5859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13957944#comment-13957944 ] ASF subversion and git services commented on SOLR-5859: --- Commit 1584108 from [~steve_rowe] in branch 'dev/trunk' [ https://svn.apache.org/r1584108 ] SOLR-5859: add OCP.getCollectionStatus() param description for 'clusterState' to stop 'ant precommit' bitching 'Javadoc: Description expected after this reference' and failing the build > Harden the Overseer restart mechanism > - > > Key: SOLR-5859 > URL: https://issues.apache.org/jira/browse/SOLR-5859 > Project: Solr > Issue Type: Improvement >Reporter: Noble Paul >Assignee: Noble Paul > Fix For: 4.8, 5.0 > > Attachments: SOLR-5859.patch, SOLR-5859.patch, SOLR-5859.patch, > SOLR-5859.patch > > > SOLR-5476 depends on Overseer restart.The current strategy is to remove the > zk node for leader election and wait for STATUS_UPDATE_DELAY +100 ms and > start the new overseer. > Though overseer ops are short running, it is not a 100% foolproof strategy > because if an operation takes longer than the wait period there can be race > condition. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5859) Harden the Overseer restart mechanism
[ https://issues.apache.org/jira/browse/SOLR-5859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13957855#comment-13957855 ] ASF subversion and git services commented on SOLR-5859: --- Commit 1584085 from [~noble.paul] in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1584085 ] SOLR-5859 Harden Overseer restart > Harden the Overseer restart mechanism > - > > Key: SOLR-5859 > URL: https://issues.apache.org/jira/browse/SOLR-5859 > Project: Solr > Issue Type: Improvement >Reporter: Noble Paul >Assignee: Noble Paul > Attachments: SOLR-5859.patch, SOLR-5859.patch, SOLR-5859.patch, > SOLR-5859.patch > > > SOLR-5476 depends on Overseer restart.The current strategy is to remove the > zk node for leader election and wait for STATUS_UPDATE_DELAY +100 ms and > start the new overseer. > Though overseer ops are short running, it is not a 100% foolproof strategy > because if an operation takes longer than the wait period there can be race > condition. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5859) Harden the Overseer restart mechanism
[ https://issues.apache.org/jira/browse/SOLR-5859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13957786#comment-13957786 ] ASF subversion and git services commented on SOLR-5859: --- Commit 1584069 from [~noble.paul] in branch 'dev/trunk' [ https://svn.apache.org/r1584069 ] SOLR-5859 Harden Overseer restart > Harden the Overseer restart mechanism > - > > Key: SOLR-5859 > URL: https://issues.apache.org/jira/browse/SOLR-5859 > Project: Solr > Issue Type: Improvement >Reporter: Noble Paul >Assignee: Noble Paul > Attachments: SOLR-5859.patch, SOLR-5859.patch, SOLR-5859.patch > > > SOLR-5476 depends on Overseer restart.The current strategy is to remove the > zk node for leader election and wait for STATUS_UPDATE_DELAY +100 ms and > start the new overseer. > Though overseer ops are short running, it is not a 100% foolproof strategy > because if an operation takes longer than the wait period there can be race > condition. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5859) Harden the Overseer restart mechanism
[ https://issues.apache.org/jira/browse/SOLR-5859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13954824#comment-13954824 ] Mark Miller commented on SOLR-5859: --- Yes, very nice change. This approach is great. Patch looks good, but some nits with the current version listed below: bq. OCP We info log closing OCP - we probably should not abbreviate it though, a user won't know what it is. bq. } else if( QUIT.equals(operation)){ {code} } String getId(){ return myId; } {code} There are also some project formatting violations - eg spacing, missing new line: {code} log.info("IsClosed :{} , {}", isClosed, this); log.warn("OverseerCollectionProcessor.processMessage : "+ operation + " , "+ message.toString()); {code} I think both of those are wrong - should be one log line under debug. bq. import org.apache.zookeeper.data.Stat; Unused import added. > Harden the Overseer restart mechanism > - > > Key: SOLR-5859 > URL: https://issues.apache.org/jira/browse/SOLR-5859 > Project: Solr > Issue Type: Improvement >Reporter: Noble Paul >Assignee: Noble Paul > Attachments: SOLR-5859.patch, SOLR-5859.patch > > > SOLR-5476 depends on Overseer restart.The current strategy is to remove the > zk node for leader election and wait for STATUS_UPDATE_DELAY +100 ms and > start the new overseer. > Though overseer ops are short running, it is not a 100% foolproof strategy > because if an operation takes longer than the wait period there can be race > condition. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5859) Harden the Overseer restart mechanism
[ https://issues.apache.org/jira/browse/SOLR-5859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13946745#comment-13946745 ] Shalin Shekhar Mangar commented on SOLR-5859: - This seems to be a much better way of killing an overseer. > Harden the Overseer restart mechanism > - > > Key: SOLR-5859 > URL: https://issues.apache.org/jira/browse/SOLR-5859 > Project: Solr > Issue Type: Improvement >Reporter: Noble Paul >Assignee: Noble Paul > > SOLR-5476 depends on Overseer restart.The current strategy is to remove the > zk node for leader election and wait for STATUS_UPDATE_DELAY +100 ms and > start the new overseer. > Though overseer ops are short running, it is not a 100% foolproof strategy > because if an operation takes longer than the wait period there can be race > condition. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5859) Harden the Overseer restart mechanism
[ https://issues.apache.org/jira/browse/SOLR-5859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13946663#comment-13946663 ] Noble Paul commented on SOLR-5859: -- new strategy, implement a new message called _quit_ on receiving the message Overseer would set isClosed=true and the loop would exit as soon as the current in-flight message is done . After exiting the loop , it checks if it is still the leader (most likely it is) , if yes , remove the leader node from ZK and remove itself from the forefront of the election queue > Harden the Overseer restart mechanism > - > > Key: SOLR-5859 > URL: https://issues.apache.org/jira/browse/SOLR-5859 > Project: Solr > Issue Type: Improvement >Reporter: Noble Paul >Assignee: Noble Paul > > SOLR-5476 depends on Overseer restart.The current strategy is to remove the > zk node for leader election and wait for STATUS_UPDATE_DELAY +100 ms and > start the new overseer. > Though overseer ops are short running, it is not a 100% foolproof strategy > because if an operation takes longer than the wait period there can be race > condition. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org