[ 
https://issues.apache.org/jira/browse/SOLR-17764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17980485#comment-17980485
 ] 

Mark Robert Miller commented on SOLR-17764:
-------------------------------------------

You can see the tests did that kind of hard shutdown here (how it was done 
varied over time)
   // stop timeout is 0, so we will interrupt right away

And that was the case, but now it appears that comment is out of date and it 
uses the default 5 second grace period. Unless your graceful shutdown made 
things even worse than a 5 second wait, it doesn't make any sense it would 
correlate with this fail.

{noformat}
    // Do not let Jetty/Solr pollute the MDC for this thread
    Map<String, String> prevContext = MDC.getCopyOfContextMap();
    MDC.clear();
    try {
      QueuedThreadPool qtp = (QueuedThreadPool) server.getThreadPool();
      ReservedThreadExecutor rte = qtp.getBean(ReservedThreadExecutor.class);

      server.stop();

      // stop timeout is 0, so we will interrupt right away
      while (!qtp.isStopped()) {
        qtp.stop();
        if (qtp.isStopped()) {
          Thread.sleep(50);
        }
      }

      // we tried to kill everything, now we wait for executor to stop
      qtp.setStopTimeout(Integer.MAX_VALUE);
      qtp.stop();
      qtp.join();
{noformat}


> "graceful" jetty shutdown causes ChaosMonkeySafeLeaderWithPullReplicasTest 
> failures
> -----------------------------------------------------------------------------------
>
>                 Key: SOLR-17764
>                 URL: https://issues.apache.org/jira/browse/SOLR-17764
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Chris M. Hostetter
>            Priority: Major
>         Attachments: 
> E7F93005B9386058.OUTPUT-org.apache.solr.cloud.ChaosMonkeySafeLeaderWithPullReplicasTest.txt
>
>
> Reviewing recent jenkins test failure metrics, I noticed that (Nightly) test 
> ChaosMonkeySafeLeaderWithPullReplicasTest started failing ~60% of the time 
> right around the time that SOLR-17744 was committed.
> Things i have observed:
>  * Seeds from failing runs seem to reliably reproduce the failure
>  ** These failures do *NOT* reproduce if i revert to just before SOLR-17744
>  * Ad-hoc testing I've done of seeds that do _not_ fail on first attempt seem 
> to reliably succeed on all subsequent attempts
>  ** Suggesting that the root cause is something deterministic in the 
> {{{}random(){}}}-ness of the test, and not something dependent on timing or 
> concurrency.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to