[ https://issues.apache.org/jira/browse/SOLR-9189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15316873#comment-15316873 ]
Hoss Man edited comment on SOLR-9189 at 6/6/16 5:55 PM: -------------------------------------------------------- My initial gut paranoia skimming the jenkins emails this morning was to assume that this might be because of SOLR-9107 / SOLR-5776 -- the hypothosis being: "The increased randomized use of ssl (factoring in tests.nightly / tests.multiplier) is causing more tests to slow down due to the crypto calculations" ... but that hypothosis seems weak when i started looking at the logs -- there is a "Randomized ssl" line as part of the logs for every SolrTestCaseJ4 subclass showing if ssl is being used or not... * http://jenkins.thetaphi.de/job/Lucene-Solr-6.x-Linux/834/ ** 25 test failures ** only 7 of those were using ssl * https://builds.apache.org/job/Lucene-Solr-NightlyTests-master/1034/ ** 44 test failures ** only 17 of those were using ssl ...even if we assume every test failure where ssl was in use was directly caused by ssl, that still leaves a really high increase in the number of failed tests in those two runs. So my ammended (paranoid) hypothosis is "The increased randomized use of ssl (factoring in tests.nightly / tests.multiplier) is causing more tests to slow down due to the crypto calculations *EVEN IN OTHER TESTS AT THE SAME TIME DUE TO CPU STARVATION*" I'm going to commit a blanket disable of all SSL randomization _on master_ ASAP to test this hypothosis. Part of me feels like this is an overkill reaction, and that a more rational response would simply be to undo the "increased odds of using ssl" portion of SOLR-9107 -- but I'd really like to get a difinitive understanding of wether SSL usage is really having such a seriously pronounced affect on other tests in the same jenkins run -- OR -- *is it just a red herring, and some other recent change has caused serious timeout issues?* ---- EDIT: clarified jira refrences was (Author: hossman): My initial gut paranoia skimming the jenkins emails this morning was to assume that this might be because of SOLR-5776 -- the hypothosis being: "The increased randomized use of ssl (factoring in tests.nightly / tests.multiplier) is causing more tests to slow down due to the crypto calculations" ... but that hypothosis seems weak when i started looking at the logs -- there is a "Randomized ssl" line as part of the logs for every SolrTestCaseJ4 subclass showing if ssl is being used or not... * http://jenkins.thetaphi.de/job/Lucene-Solr-6.x-Linux/834/ ** 25 test failures ** only 7 of those were using ssl * https://builds.apache.org/job/Lucene-Solr-NightlyTests-master/1034/ ** 44 test failures ** only 17 of those were using ssl ...even if we assume every test failure where ssl was in use was directly caused by ssl, that still leaves a really high increase in the number of failed tests in those two runs. So my ammended (paranoid) hypothosis is "The increased randomized use of ssl (factoring in tests.nightly / tests.multiplier) is causing more tests to slow down due to the crypto calculations *EVEN IN OTHER TESTS AT THE SAME TIME DUE TO CPU STARVATION*" I'm going to commit a blanket disable of all SSL randomization _on master_ ASAP to test this hypothosis. Part of me feels like this is an overkill reaction, and that a more rational response would simply be to undo the "increased odds of using ssl" portion of SOLR-5776 -- but I'd really like to get a difinitive understanding of wether SSL usage is really having such a seriously pronounced affect on other tests in the same jenkins run -- OR -- *is it just a red herring, and some other recent change has caused serious timeout issues?* > explosion of timeout related failures in jenkins the past few days > ------------------------------------------------------------------ > > Key: SOLR-9189 > URL: https://issues.apache.org/jira/browse/SOLR-9189 > Project: Solr > Issue Type: Bug > Reporter: Hoss Man > Assignee: Hoss Man > Priority: Critical > > In the past few days, something has gone seriously wonky with our jenkins > tests -- causing a serious explosion in the number of test failures -- > notably do to various sorts of timeouts... > * "Unable to create core ... Timed out getting coreNodeName for ..." > * "msg=SolrCore is loading,code=503" > * "Timeout occured while waiting response from server" > * "No registered leader was found after waiting for 30000ms" > * "Unable to create core ... Caused by: Timed out getting shard id for core: > ..." -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org