Re: PLEASE READ! BadApple report. Last week was horrible!
Phew! Thanks for digging Erick, and for producing these BadApple reports. Mike McCandless http://blog.mikemccandless.com On Wed, May 6, 2020 at 7:59 AM Erick Erickson wrote: > OK, this morning things are back to normal. I think the disk space issue > was to blame because checking after Mike’s fix didn’t look like it > cured the problem. > > Thanks all! > > > On May 5, 2020, at 1:41 PM, Chris Hostetter > wrote: > > > > > > : And FWIW, I beasted one of the failing suites last night _without_ > > : Mike’s changes and didn’t get any failures so I can’t say anything > about > > : whether Mike’s changes helped or not. > > > > IIUC McCandless's failure only affects you if you use the "jenkins" test > > data file (the really big wikipedia dump) ... see the jira he mentioned > > for details. > > > > > > > > -Hoss > > http://www.lucidworks.com/ > > > > - > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > > For additional commands, e-mail: dev-h...@lucene.apache.org > > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > >
Re: PLEASE READ! BadApple report. Last week was horrible!
OK, this morning things are back to normal. I think the disk space issue was to blame because checking after Mike’s fix didn’t look like it cured the problem. Thanks all! > On May 5, 2020, at 1:41 PM, Chris Hostetter wrote: > > > : And FWIW, I beasted one of the failing suites last night _without_ > : Mike’s changes and didn’t get any failures so I can’t say anything about > : whether Mike’s changes helped or not. > > IIUC McCandless's failure only affects you if you use the "jenkins" test > data file (the really big wikipedia dump) ... see the jira he mentioned > for details. > > > > -Hoss > http://www.lucidworks.com/ > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: PLEASE READ! BadApple report. Last week was horrible!
OK, thanks Chris. The 24 hour rollup still shows many failures in the several classes, I’ll check tomorrow to see if that’s a consequence of the disk full problem. > On May 5, 2020, at 1:41 PM, Chris Hostetter wrote: > > > : And FWIW, I beasted one of the failing suites last night _without_ > : Mike’s changes and didn’t get any failures so I can’t say anything about > : whether Mike’s changes helped or not. > > IIUC McCandless's failure only affects you if you use the "jenkins" test > data file (the really big wikipedia dump) ... see the jira he mentioned > for details. > > > > -Hoss > http://www.lucidworks.com/ > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: PLEASE READ! BadApple report. Last week was horrible!
: And FWIW, I beasted one of the failing suites last night _without_ : Mike’s changes and didn’t get any failures so I can’t say anything about : whether Mike’s changes helped or not. IIUC McCandless's failure only affects you if you use the "jenkins" test data file (the really big wikipedia dump) ... see the jira he mentioned for details. -Hoss http://www.lucidworks.com/ - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: PLEASE READ! BadApple report. Last week was horrible!
2119758844836987 UNLOAD) [n:127.0.0.1:49613_solrx:replicaTypesTestColl_shard1_replica_p4 ] o.a.s.m.r.SolrJmxReporter Closing reporter [org.apache.solr.metrics.reporters.SolrJmxReporter@1f2a6e95: rootName = solr_49613, domain = solr.core.replicaTypesTestColl.shard1.replica_p4, service url = null, agent id = null] for registry solr.core.replicaTypesTestColl.shard1.replica_p4/com.codahale.metrics.MetricRegistry@2edb03e2 [junit4] 2> 33770 ERROR (indexFetcher-621-thread-1) [n:127.0.0.1:49612_solr ] o.a.s.h.ReplicationHandler Index fetch failed :java.lang.NullPointerException [junit4] 2>at org.apache.solr.handler.IndexFetcher.getLeaderReplica(IndexFetcher.java:709) [junit4] 2>at org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:387) [junit4] 2>at org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:351) [junit4] 2>at org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:422) [junit4] 2>at org.apache.solr.handler.ReplicationHandler.lambda$setupPolling$13(ReplicationHandler.java:1208) [junit4] 2>at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [junit4] 2>at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305) [junit4] 2>at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305) [junit4] 2>at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [junit4] 2>at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [junit4] 2>at java.base/java.lang.Thread.run(Thread.java:834) [junit4] 2> > On May 5, 2020, at 4:33 AM, Uwe Schindler wrote: > > Hi, > > there was also a problem with the Windows Node. It ran out of disk space, > because some test seem to have filled up all of the disk. All followup builds > failed. I cleaned all Workspaces (8.x, master) and it freed 20 Gigabytes! > > Uwe > > - > Uwe Schindler > Achterdiek 19, D-28357 Bremen > https://www.thetaphi.de > eMail: u...@thetaphi.de > >> -Original Message----- >> From: Erick Erickson >> Sent: Monday, May 4, 2020 1:54 PM >> To: dev@lucene.apache.org >> Subject: PLEASE READ! BadApple report. Last week was horrible! >> >> I don’t know whether we had some temporary glitch that broke lots of tests >> and they’ve been fixed or we had a major regression, but this needs to be >> addressed ASAP if they’re still failing. See everything below the line "ALL >> OF >> THE TESTS BELOW HERE HAVE ONLY FAILED IN THE LAST WEEK!” in this e-mail. >> I’ll raise a JIRA if we can’t get some traction quickly here. >> >> Hey, stuff happens. there’s no problem with tests going totally weird for a >> while. If you can say “Oh, yeah, all those failures for class XYZ are >> probably >> fixed” that’s fine. >> >> Gosh-a-rooni, I hope my logging changes aren’t the culprit (gulp)…. >> >> Hoss’ rolllup for the last 24 hours is not encouraging in terms of the >> problem already being fixed. There are lots of failures in some >> classes, notably: >> >> CloudHttp2SolrClientTest >> CollectionsAPIDistributedZkTest >> DeleteReplicaTest >> TestDocCollectionWatcher >> >> Unfortunately, the failure rate is not very high so reliably >> reproducing is hard. >> >> I’ve reproduced the last week’s failure in this e-mail, full >> report attached. >> >> Here’s Hoss’ rollup: >> http://fucit.org/solr-jenkins-reports/failure-report.html >> >> Usual synopsis: >> >> Raw fail count by week totals, most recent week first (corresponds to bits): >> Week: 0 had 343 failures >> Week: 1 had 86 failures >> Week: 2 had 78 failures >> Week: 3 had 117 failures >> >> >> Failures in Hoss' reports for the last 4 rollups. >> >> There were 497 unannotated tests that failed in Hoss' rollups. Ordered by the >> date I downloaded the rollup file, newest->oldest. See above for the dates >> the >> files were collected >> These tests were NOT BadApple'd or AwaitsFix’d >> >> Failures in the last 4 reports.. >> Report Pct runsfails test >> 0123 0.7 1617 11 >> ConnectionManagerTest.testReconnectWhenZkDisappeared >> 0123 1.5 1606 12 ExecutePlanActionTest.testTaskTimeout >> 0123 1.6 1320 19 MultiThreadedOCPTes
RE: PLEASE READ! BadApple report. Last week was horrible!
Hi, there was also a problem with the Windows Node. It ran out of disk space, because some test seem to have filled up all of the disk. All followup builds failed. I cleaned all Workspaces (8.x, master) and it freed 20 Gigabytes! Uwe - Uwe Schindler Achterdiek 19, D-28357 Bremen https://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message- > From: Erick Erickson > Sent: Monday, May 4, 2020 1:54 PM > To: dev@lucene.apache.org > Subject: PLEASE READ! BadApple report. Last week was horrible! > > I don’t know whether we had some temporary glitch that broke lots of tests > and they’ve been fixed or we had a major regression, but this needs to be > addressed ASAP if they’re still failing. See everything below the line "ALL OF > THE TESTS BELOW HERE HAVE ONLY FAILED IN THE LAST WEEK!” in this e-mail. > I’ll raise a JIRA if we can’t get some traction quickly here. > > Hey, stuff happens. there’s no problem with tests going totally weird for a > while. If you can say “Oh, yeah, all those failures for class XYZ are probably > fixed” that’s fine. > > Gosh-a-rooni, I hope my logging changes aren’t the culprit (gulp)…. > > Hoss’ rolllup for the last 24 hours is not encouraging in terms of the > problem already being fixed. There are lots of failures in some > classes, notably: > > CloudHttp2SolrClientTest > CollectionsAPIDistributedZkTest > DeleteReplicaTest > TestDocCollectionWatcher > > Unfortunately, the failure rate is not very high so reliably > reproducing is hard. > > I’ve reproduced the last week’s failure in this e-mail, full > report attached. > > Here’s Hoss’ rollup: > http://fucit.org/solr-jenkins-reports/failure-report.html > > Usual synopsis: > > Raw fail count by week totals, most recent week first (corresponds to bits): > Week: 0 had 343 failures > Week: 1 had 86 failures > Week: 2 had 78 failures > Week: 3 had 117 failures > > > Failures in Hoss' reports for the last 4 rollups. > > There were 497 unannotated tests that failed in Hoss' rollups. Ordered by the > date I downloaded the rollup file, newest->oldest. See above for the dates the > files were collected > These tests were NOT BadApple'd or AwaitsFix’d > > Failures in the last 4 reports.. >Report Pct runsfails test > 0123 0.7 1617 11 > ConnectionManagerTest.testReconnectWhenZkDisappeared > 0123 1.5 1606 12 ExecutePlanActionTest.testTaskTimeout > 0123 1.6 1320 19 MultiThreadedOCPTest.test > 0123 1.0 1620 13 RollingRestartTest.test > 0123 1.2 1617 12 SearchRateTriggerTest.testWaitForElapsed > 0123 3.8 119 7 ShardSplitTest.testSplitWithChaosMonkey > 0123 0.3 1519 7 TestInPlaceUpdatesDistrib.test > 0123 0.7 1629 14 > TestIndexWriterDelete.testDeleteAllNoDeadLock > 0123 2.4 1548 18 TestPackages.testPluginLoading > 0123 0.3 1587 4 UnloadDistributedZkTest.test > > > FAILURES IN THE LAST WEEK (343!) > Look particularly at the ones with only a zero in the “Report” column, those > are > failures that were _not_ in the previous 3 week’s rollups. > >Report Pct runsfails test > 0120.5 1165 4 CustomHighlightComponentTest.test > 0121.0 1168 6 > NodeMarkersRegistrationTest.testNodeMarkersRegistration > 0121.0 1170 8 TestCryptoKeys.test > 01 3 0.7 1233 11 LeaderFailoverAfterPartitionTest.test > 01 3 63.2 102 39 StressHdfsTest.test > 01 0.3 709 2 > ScheduledTriggerIntegrationTest.testScheduledTrigger > 01 0.2 768 2 ShardRoutingTest.test > 01 2.6 807 22 TestAllFilesHaveChecksumFooter.test > 01 2.6 808 22 TestAllFilesHaveCodecHeader.test > 01 0.2 769 2 TestCloudSchemaless.test > 01 0.2 769 2 TestDynamicLoading.testDynamicLoading > 01 0.3 707 2 > TestDynamicLoadingUrl.testDynamicLoadingUrl > 01 0.5 767 4 TestPointFields.testFloatPointStats > 0127.1 83 19 TestSQLHandler.doTest > 01 0.2 794 12 TestSameScoresWithThreads.test > 01 2.6 806 22 TestShardSearching.testSimple > 01 0.5 726 4 TestSimScenario.testSplitShard > 01 1.1 726 7 TestSimScenario.testSuggestions > 01 0.3 771 2 TestWithColle
Re: PLEASE READ! BadApple report. Last week was horrible!
Mike: I saw the push. Hoss’ rollups go for “the last 24 hours”, so it’ll be Tuesday evening before things have had a chance to work their way through, I’ll look tomorrow. Meanwhile I’m beasting one of the failing test suites (without the change) and 280 iterations so far and no failures. That said, the failure rate was < 1% so it’s not conclusive. Only another 720 runs to go before I pull the latest changes and try again… ;) > On May 4, 2020, at 1:33 PM, Michael McCandless > wrote: > > Hi Erick, > > OK I pushed a fix! See if it decreases the failure rate for those newly bad > apples? > > Sorry and thanks :) > > Mike McCandless > > http://blog.mikemccandless.com > > > On Mon, May 4, 2020 at 1:06 PM Erick Erickson wrote: > Mike: > > I have no idea. Hoss’ rollups don’t link back to builds, they > just aggregate the results. > > Not a huge deal if it’s something like this of course. Let’s just > say I’ve had my share or “moments” ;). > > And unfortunately, the test failures are pretty rare on a > percentage basis, so it’s hard to tell. > > I’m watching LUCENE-9191 and I’ll look back at Hoss’ rollups > a day after you push it and see if the failures disappear. > > It’ll take a while for the fixes to roll through all the reporting. > > Tell you what. I’ll try beasting one of the classes that fails a lot and then > try it again after you push LUCENE-9191 and we’ll go from there. > > Thanks for getting into this so promptly! > > Erick > > > On May 4, 2020, at 9:10 AM, Michael McCandless > > wrote: > > > > Hi Erick, > > > > It's possible this was the root cause of many of the failures: > > https://issues.apache.org/jira/browse/LUCENE-9191 > > > > Do these transient failures look something like this? > > > >[junit4]> Throwable #1: java.nio.charset.MalformedInputException: > > Input length = 1 > >[junit4]>at > > __randomizedtesting.SeedInfo.seed([172C6414BE5E2A2C:E5829DFC005A1F0]:0) > >[junit4]>at > > java.base/java.nio.charset.CoderResult.throwException(CoderResult.java:274) > >[junit4]>at > > java.base/sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:339) > >[junit4]>at > > java.base/sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178) > >[junit4]>at > > java.base/java.io.InputStreamReader.read(InputStreamReader.java:185) > >[junit4]>at > > java.base/java.io.BufferedReader.fill(BufferedReader.java:161) > >[junit4]>at > > java.base/java.io.BufferedReader.readLine(BufferedReader.java:326) > >[junit4]>at > > java.base/java.io.BufferedReader.readLine(BufferedReader.java:392) > >[junit4]>at > > org.apache.lucene.util.LineFileDocs.open(LineFileDocs.java:175) > >[junit4]>at > > org.apache.lucene.util.LineFileDocs.(LineFileDocs.java:65) > >[junit4]>at > > org.apache.lucene.util.LineFileDocs.(LineFileDocs.java:69) > > > > > > If so, then it is likely the root cause ... I'm working on a fix. Sorry! > > > > Mike McCandless > > > > http://blog.mikemccandless.com > > > > > > On Mon, May 4, 2020 at 7:54 AM Erick Erickson > > wrote: > > I don’t know whether we had some temporary glitch that broke lots of tests > > and they’ve been fixed or we had a major regression, but this needs to be > > addressed ASAP if they’re still failing. See everything below the line "ALL > > OF THE TESTS BELOW HERE HAVE ONLY FAILED IN THE LAST WEEK!” in this e-mail. > > I’ll raise a JIRA if we can’t get some traction quickly here. > > > > Hey, stuff happens. there’s no problem with tests going totally weird for a > > while. If you can say “Oh, yeah, all those failures for class XYZ are > > probably fixed” that’s fine. > > > > Gosh-a-rooni, I hope my logging changes aren’t the culprit (gulp)…. > > > > Hoss’ rolllup for the last 24 hours is not encouraging in terms of the > > problem already being fixed. There are lots of failures in some > > classes, notably: > > > > CloudHttp2SolrClientTest > > CollectionsAPIDistributedZkTest > > DeleteReplicaTest > > TestDocCollectionWatcher > > > > Unfortunately, the failure rate is not very high so reliably > > reproducing is hard. > > > > I’ve reproduced the last week’s failure in this e-mail, full > > report attached. > > > > Here’s Hoss’ rollup: > > http://fucit.org/solr-jenkins-reports/failure-report.html > > > > Usual synopsis: > > > > Raw fail count by week totals, most recent week first (corresponds to bits): > > Week: 0 had 343 failures > > Week: 1 had 86 failures > > Week: 2 had 78 failures > > Week: 3 had 117 failures > > > > > > Failures in Hoss' reports for the last 4 rollups. > > > > There were 497 unannotated tests that failed in Hoss' rollups. Ordered by > > the date I downloaded the rollup file, newest->oldest. See above for the > > dates the files were collected > > These tests were NOT BadApple'd or AwaitsFix’d > >
Re: PLEASE READ! BadApple report. Last week was horrible!
Hi Erick, OK I pushed a fix! See if it decreases the failure rate for those newly bad apples? Sorry and thanks :) Mike McCandless http://blog.mikemccandless.com On Mon, May 4, 2020 at 1:06 PM Erick Erickson wrote: > Mike: > > I have no idea. Hoss’ rollups don’t link back to builds, they > just aggregate the results. > > Not a huge deal if it’s something like this of course. Let’s just > say I’ve had my share or “moments” ;). > > And unfortunately, the test failures are pretty rare on a > percentage basis, so it’s hard to tell. > > I’m watching LUCENE-9191 and I’ll look back at Hoss’ rollups > a day after you push it and see if the failures disappear. > > It’ll take a while for the fixes to roll through all the reporting. > > Tell you what. I’ll try beasting one of the classes that fails a lot and > then > try it again after you push LUCENE-9191 and we’ll go from there. > > Thanks for getting into this so promptly! > > Erick > > > On May 4, 2020, at 9:10 AM, Michael McCandless < > luc...@mikemccandless.com> wrote: > > > > Hi Erick, > > > > It's possible this was the root cause of many of the failures: > https://issues.apache.org/jira/browse/LUCENE-9191 > > > > Do these transient failures look something like this? > > > >[junit4]> Throwable #1: java.nio.charset.MalformedInputException: > Input length = 1 > >[junit4]>at > __randomizedtesting.SeedInfo.seed([172C6414BE5E2A2C:E5829DFC005A1F0]:0) > >[junit4]>at > java.base/java.nio.charset.CoderResult.throwException(CoderResult.java:274) > >[junit4]>at > java.base/sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:339) > >[junit4]>at > java.base/sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178) > >[junit4]>at java.base/java.io > .InputStreamReader.read(InputStreamReader.java:185) > >[junit4]>at java.base/java.io > .BufferedReader.fill(BufferedReader.java:161) > >[junit4]>at java.base/java.io > .BufferedReader.readLine(BufferedReader.java:326) > >[junit4]>at java.base/java.io > .BufferedReader.readLine(BufferedReader.java:392) > >[junit4]>at > org.apache.lucene.util.LineFileDocs.open(LineFileDocs.java:175) > >[junit4]>at > org.apache.lucene.util.LineFileDocs.(LineFileDocs.java:65) > >[junit4]>at > org.apache.lucene.util.LineFileDocs.(LineFileDocs.java:69) > > > > > > If so, then it is likely the root cause ... I'm working on a fix. Sorry! > > > > Mike McCandless > > > > http://blog.mikemccandless.com > > > > > > On Mon, May 4, 2020 at 7:54 AM Erick Erickson > wrote: > > I don’t know whether we had some temporary glitch that broke lots of > tests and they’ve been fixed or we had a major regression, but this needs > to be addressed ASAP if they’re still failing. See everything below the > line "ALL OF THE TESTS BELOW HERE HAVE ONLY FAILED IN THE LAST WEEK!” in > this e-mail. I’ll raise a JIRA if we can’t get some traction quickly here. > > > > Hey, stuff happens. there’s no problem with tests going totally weird > for a while. If you can say “Oh, yeah, all those failures for class XYZ are > probably fixed” that’s fine. > > > > Gosh-a-rooni, I hope my logging changes aren’t the culprit (gulp)…. > > > > Hoss’ rolllup for the last 24 hours is not encouraging in terms of the > > problem already being fixed. There are lots of failures in some > > classes, notably: > > > > CloudHttp2SolrClientTest > > CollectionsAPIDistributedZkTest > > DeleteReplicaTest > > TestDocCollectionWatcher > > > > Unfortunately, the failure rate is not very high so reliably > > reproducing is hard. > > > > I’ve reproduced the last week’s failure in this e-mail, full > > report attached. > > > > Here’s Hoss’ rollup: > > http://fucit.org/solr-jenkins-reports/failure-report.html > > > > Usual synopsis: > > > > Raw fail count by week totals, most recent week first (corresponds to > bits): > > Week: 0 had 343 failures > > Week: 1 had 86 failures > > Week: 2 had 78 failures > > Week: 3 had 117 failures > > > > > > Failures in Hoss' reports for the last 4 rollups. > > > > There were 497 unannotated tests that failed in Hoss' rollups. Ordered > by the date I downloaded the rollup file, newest->oldest. See above for the > dates the files were collected > > These tests were NOT BadApple'd or AwaitsFix’d > > > > Failures in the last 4 reports.. > >Report Pct runsfails test > > 0123 0.7 1617 11 > ConnectionManagerTest.testReconnectWhenZkDisappeared > > 0123 1.5 1606 12 > ExecutePlanActionTest.testTaskTimeout > > 0123 1.6 1320 19 MultiThreadedOCPTest.test > > 0123 1.0 1620 13 RollingRestartTest.test > > 0123 1.2 1617 12 > SearchRateTriggerTest.testWaitForElapsed > > 0123 3.8 119 7 > ShardSplitTest.testSplitWithChaosMonkey > > 0123 0.3 1519 7 TestInPla
Re: PLEASE READ! BadApple report. Last week was horrible!
Mike: I have no idea. Hoss’ rollups don’t link back to builds, they just aggregate the results. Not a huge deal if it’s something like this of course. Let’s just say I’ve had my share or “moments” ;). And unfortunately, the test failures are pretty rare on a percentage basis, so it’s hard to tell. I’m watching LUCENE-9191 and I’ll look back at Hoss’ rollups a day after you push it and see if the failures disappear. It’ll take a while for the fixes to roll through all the reporting. Tell you what. I’ll try beasting one of the classes that fails a lot and then try it again after you push LUCENE-9191 and we’ll go from there. Thanks for getting into this so promptly! Erick > On May 4, 2020, at 9:10 AM, Michael McCandless > wrote: > > Hi Erick, > > It's possible this was the root cause of many of the failures: > https://issues.apache.org/jira/browse/LUCENE-9191 > > Do these transient failures look something like this? > >[junit4]> Throwable #1: java.nio.charset.MalformedInputException: > Input length = 1 >[junit4]>at > __randomizedtesting.SeedInfo.seed([172C6414BE5E2A2C:E5829DFC005A1F0]:0) >[junit4]>at > java.base/java.nio.charset.CoderResult.throwException(CoderResult.java:274) >[junit4]>at > java.base/sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:339) >[junit4]>at > java.base/sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178) >[junit4]>at > java.base/java.io.InputStreamReader.read(InputStreamReader.java:185) >[junit4]>at > java.base/java.io.BufferedReader.fill(BufferedReader.java:161) >[junit4]>at > java.base/java.io.BufferedReader.readLine(BufferedReader.java:326) >[junit4]>at > java.base/java.io.BufferedReader.readLine(BufferedReader.java:392) >[junit4]>at > org.apache.lucene.util.LineFileDocs.open(LineFileDocs.java:175) >[junit4]>at > org.apache.lucene.util.LineFileDocs.(LineFileDocs.java:65) >[junit4]>at > org.apache.lucene.util.LineFileDocs.(LineFileDocs.java:69) > > > If so, then it is likely the root cause ... I'm working on a fix. Sorry! > > Mike McCandless > > http://blog.mikemccandless.com > > > On Mon, May 4, 2020 at 7:54 AM Erick Erickson wrote: > I don’t know whether we had some temporary glitch that broke lots of tests > and they’ve been fixed or we had a major regression, but this needs to be > addressed ASAP if they’re still failing. See everything below the line "ALL > OF THE TESTS BELOW HERE HAVE ONLY FAILED IN THE LAST WEEK!” in this e-mail. > I’ll raise a JIRA if we can’t get some traction quickly here. > > Hey, stuff happens. there’s no problem with tests going totally weird for a > while. If you can say “Oh, yeah, all those failures for class XYZ are > probably fixed” that’s fine. > > Gosh-a-rooni, I hope my logging changes aren’t the culprit (gulp)…. > > Hoss’ rolllup for the last 24 hours is not encouraging in terms of the > problem already being fixed. There are lots of failures in some > classes, notably: > > CloudHttp2SolrClientTest > CollectionsAPIDistributedZkTest > DeleteReplicaTest > TestDocCollectionWatcher > > Unfortunately, the failure rate is not very high so reliably > reproducing is hard. > > I’ve reproduced the last week’s failure in this e-mail, full > report attached. > > Here’s Hoss’ rollup: > http://fucit.org/solr-jenkins-reports/failure-report.html > > Usual synopsis: > > Raw fail count by week totals, most recent week first (corresponds to bits): > Week: 0 had 343 failures > Week: 1 had 86 failures > Week: 2 had 78 failures > Week: 3 had 117 failures > > > Failures in Hoss' reports for the last 4 rollups. > > There were 497 unannotated tests that failed in Hoss' rollups. Ordered by the > date I downloaded the rollup file, newest->oldest. See above for the dates > the files were collected > These tests were NOT BadApple'd or AwaitsFix’d > > Failures in the last 4 reports.. >Report Pct runsfails test > 0123 0.7 1617 11 > ConnectionManagerTest.testReconnectWhenZkDisappeared > 0123 1.5 1606 12 ExecutePlanActionTest.testTaskTimeout > 0123 1.6 1320 19 MultiThreadedOCPTest.test > 0123 1.0 1620 13 RollingRestartTest.test > 0123 1.2 1617 12 SearchRateTriggerTest.testWaitForElapsed > 0123 3.8 119 7 ShardSplitTest.testSplitWithChaosMonkey > 0123 0.3 1519 7 TestInPlaceUpdatesDistrib.test > 0123 0.7 1629 14 > TestIndexWriterDelete.testDeleteAllNoDeadLock > 0123 2.4 1548 18 TestPackages.testPluginLoading > 0123 0.3 1587 4 UnloadDistributedZkTest.test > > > FAILURES IN THE LAST WEEK (343!) > Look particularly at the ones with only a zero in th
Re: PLEASE READ! BadApple report. Last week was horrible!
Hi Erick, It's possible this was the root cause of many of the failures: https://issues.apache.org/jira/browse/LUCENE-9191 Do these transient failures look something like this? [junit4]> Throwable #1: java.nio.charset.MalformedInputException: Input length = 1 [junit4]>at __randomizedtesting.SeedInfo.seed([172C6414BE5E2A2C:E5829DFC005A1F0]:0) [junit4]>at java.base/java.nio.charset.CoderResult.throwException(CoderResult.java:274) [junit4]>at java.base/sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:339) [junit4]>at java.base/sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178) [junit4]>at java.base/java.io.InputStreamReader.read(InputStreamReader.java:185) [junit4]>at java.base/java.io.BufferedReader.fill(BufferedReader.java:161) [junit4]>at java.base/java.io.BufferedReader.readLine(BufferedReader.java:326) [junit4]>at java.base/java.io.BufferedReader.readLine(BufferedReader.java:392) [junit4]>at org.apache.lucene.util.LineFileDocs.open(LineFileDocs.java:175) [junit4]>at org.apache.lucene.util.LineFileDocs.(LineFileDocs.java:65) [junit4]>at org.apache.lucene.util.LineFileDocs.(LineFileDocs.java:69) If so, then it is likely the root cause ... I'm working on a fix. Sorry! Mike McCandless http://blog.mikemccandless.com On Mon, May 4, 2020 at 7:54 AM Erick Erickson wrote: > I don’t know whether we had some temporary glitch that broke lots of tests > and they’ve been fixed or we had a major regression, but this needs to be > addressed ASAP if they’re still failing. See everything below the line "ALL > OF THE TESTS BELOW HERE HAVE ONLY FAILED IN THE LAST WEEK!” in this e-mail. > I’ll raise a JIRA if we can’t get some traction quickly here. > > Hey, stuff happens. there’s no problem with tests going totally weird for > a while. If you can say “Oh, yeah, all those failures for class XYZ are > probably fixed” that’s fine. > > Gosh-a-rooni, I hope my logging changes aren’t the culprit (gulp)…. > > Hoss’ rolllup for the last 24 hours is not encouraging in terms of the > problem already being fixed. There are lots of failures in some > classes, notably: > > CloudHttp2SolrClientTest > CollectionsAPIDistributedZkTest > DeleteReplicaTest > TestDocCollectionWatcher > > Unfortunately, the failure rate is not very high so reliably > reproducing is hard. > > I’ve reproduced the last week’s failure in this e-mail, full > report attached. > > Here’s Hoss’ rollup: > http://fucit.org/solr-jenkins-reports/failure-report.html > > Usual synopsis: > > Raw fail count by week totals, most recent week first (corresponds to > bits): > Week: 0 had 343 failures > Week: 1 had 86 failures > Week: 2 had 78 failures > Week: 3 had 117 failures > > > Failures in Hoss' reports for the last 4 rollups. > > There were 497 unannotated tests that failed in Hoss' rollups. Ordered by > the date I downloaded the rollup file, newest->oldest. See above for the > dates the files were collected > These tests were NOT BadApple'd or AwaitsFix’d > > Failures in the last 4 reports.. >Report Pct runsfails test > 0123 0.7 1617 11 > ConnectionManagerTest.testReconnectWhenZkDisappeared > 0123 1.5 1606 12 ExecutePlanActionTest.testTaskTimeout > 0123 1.6 1320 19 MultiThreadedOCPTest.test > 0123 1.0 1620 13 RollingRestartTest.test > 0123 1.2 1617 12 > SearchRateTriggerTest.testWaitForElapsed > 0123 3.8 119 7 > ShardSplitTest.testSplitWithChaosMonkey > 0123 0.3 1519 7 TestInPlaceUpdatesDistrib.test > 0123 0.7 1629 14 > TestIndexWriterDelete.testDeleteAllNoDeadLock > 0123 2.4 1548 18 TestPackages.testPluginLoading > 0123 0.3 1587 4 UnloadDistributedZkTest.test > > > FAILURES IN THE LAST WEEK (343!) > Look particularly at the ones with only a zero in the “Report” column, > those are > failures that were _not_ in the previous 3 week’s rollups. > >Report Pct runsfails test > 0120.5 1165 4 CustomHighlightComponentTest.test > 0121.0 1168 6 > NodeMarkersRegistrationTest.testNodeMarkersRegistration > 0121.0 1170 8 TestCryptoKeys.test > 01 3 0.7 1233 11 LeaderFailoverAfterPartitionTest.test > 01 3 63.2 102 39 StressHdfsTest.test > 01 0.3 709 2 > ScheduledTriggerIntegrationTest.testScheduledTrigger > 01 0.2 768 2 ShardRoutingTest.test > 01 2.6 807 22 TestAllFilesHaveChecksumFooter.test > 01 2.6 808 22 TestAllFilesHaveCodecHeader.test > 01 0.2 769 2 TestCloudSchemaless.test > 01 0.2
PLEASE READ! BadApple report. Last week was horrible!
I don’t know whether we had some temporary glitch that broke lots of tests and they’ve been fixed or we had a major regression, but this needs to be addressed ASAP if they’re still failing. See everything below the line "ALL OF THE TESTS BELOW HERE HAVE ONLY FAILED IN THE LAST WEEK!” in this e-mail. I’ll raise a JIRA if we can’t get some traction quickly here. Hey, stuff happens. there’s no problem with tests going totally weird for a while. If you can say “Oh, yeah, all those failures for class XYZ are probably fixed” that’s fine. Gosh-a-rooni, I hope my logging changes aren’t the culprit (gulp)…. Hoss’ rolllup for the last 24 hours is not encouraging in terms of the problem already being fixed. There are lots of failures in some classes, notably: CloudHttp2SolrClientTest CollectionsAPIDistributedZkTest DeleteReplicaTest TestDocCollectionWatcher Unfortunately, the failure rate is not very high so reliably reproducing is hard. I’ve reproduced the last week’s failure in this e-mail, full report attached. Here’s Hoss’ rollup: http://fucit.org/solr-jenkins-reports/failure-report.html Usual synopsis: Raw fail count by week totals, most recent week first (corresponds to bits): Week: 0 had 343 failures Week: 1 had 86 failures Week: 2 had 78 failures Week: 3 had 117 failures Failures in Hoss' reports for the last 4 rollups. There were 497 unannotated tests that failed in Hoss' rollups. Ordered by the date I downloaded the rollup file, newest->oldest. See above for the dates the files were collected These tests were NOT BadApple'd or AwaitsFix’d Failures in the last 4 reports.. Report Pct runsfails test 0123 0.7 1617 11 ConnectionManagerTest.testReconnectWhenZkDisappeared 0123 1.5 1606 12 ExecutePlanActionTest.testTaskTimeout 0123 1.6 1320 19 MultiThreadedOCPTest.test 0123 1.0 1620 13 RollingRestartTest.test 0123 1.2 1617 12 SearchRateTriggerTest.testWaitForElapsed 0123 3.8 119 7 ShardSplitTest.testSplitWithChaosMonkey 0123 0.3 1519 7 TestInPlaceUpdatesDistrib.test 0123 0.7 1629 14 TestIndexWriterDelete.testDeleteAllNoDeadLock 0123 2.4 1548 18 TestPackages.testPluginLoading 0123 0.3 1587 4 UnloadDistributedZkTest.test FAILURES IN THE LAST WEEK (343!) Look particularly at the ones with only a zero in the “Report” column, those are failures that were _not_ in the previous 3 week’s rollups. Report Pct runsfails test 0120.5 1165 4 CustomHighlightComponentTest.test 0121.0 1168 6 NodeMarkersRegistrationTest.testNodeMarkersRegistration 0121.0 1170 8 TestCryptoKeys.test 01 3 0.7 1233 11 LeaderFailoverAfterPartitionTest.test 01 3 63.2 102 39 StressHdfsTest.test 01 0.3 709 2 ScheduledTriggerIntegrationTest.testScheduledTrigger 01 0.2 768 2 ShardRoutingTest.test 01 2.6 807 22 TestAllFilesHaveChecksumFooter.test 01 2.6 808 22 TestAllFilesHaveCodecHeader.test 01 0.2 769 2 TestCloudSchemaless.test 01 0.2 769 2 TestDynamicLoading.testDynamicLoading 01 0.3 707 2 TestDynamicLoadingUrl.testDynamicLoadingUrl 01 0.5 767 4 TestPointFields.testFloatPointStats 0127.1 83 19 TestSQLHandler.doTest 01 0.2 794 12 TestSameScoresWithThreads.test 01 2.6 806 22 TestShardSearching.testSimple 01 0.5 726 4 TestSimScenario.testSplitShard 01 1.1 726 7 TestSimScenario.testSuggestions 01 0.3 771 2 TestWithCollection.testAddReplicaSimple 0 23 0.3 1223 4 CdcrVersionReplicationTest.testCdcrDocVersions 0 23 0.8 1172 6 CloudHttp2SolrClientTest.testRetryUpdatesWhenClusterStateIsStale 0 23 1.4 1202 8 CollectionsAPISolrJTest.testColStatus 0 23 1.0 1249 11 HttpPartitionTest.test 0 23 1.1 1210 8 HttpPartitionWithTlogReplicasTest.test 0 23 0.5 1258 4 ShardSplitTest.testSplitShardWithRuleLink 0 23 0.2 1231 4 TestQueryingOnDownCollection.testQueryToDownCollectionShouldFailFast 0 23 0.2 1232 6 TestSolrConfigHandlerCloud.test 0 20.3 767 2 DocValuesNotIndexedTest.testGroupingDVOnlySortLast 0 20.3 750 2 TestLBHttp2SolrClient.testTwoServers 0 20.3 794 2 TestSolrCloudSnapshots.testSnapshots 0 2 40.7 51 12 TestXYMultiPolygonShapeQueries.testRa