Re: PLEASE READ! BadApple report. Last week was horrible!
Phew! Thanks for digging Erick, and for producing these BadApple reports. Mike McCandless http://blog.mikemccandless.com On Wed, May 6, 2020 at 7:59 AM Erick Erickson wrote: > OK, this morning things are back to normal. I think the disk space issue > was to blame because checking after Mike’s fix didn’t look like it > cured the problem. > > Thanks all! > > > On May 5, 2020, at 1:41 PM, Chris Hostetter > wrote: > > > > > > : And FWIW, I beasted one of the failing suites last night _without_ > > : Mike’s changes and didn’t get any failures so I can’t say anything > about > > : whether Mike’s changes helped or not. > > > > IIUC McCandless's failure only affects you if you use the "jenkins" test > > data file (the really big wikipedia dump) ... see the jira he mentioned > > for details. > > > > > > > > -Hoss > > http://www.lucidworks.com/ > > > > - > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > > For additional commands, e-mail: dev-h...@lucene.apache.org > > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > >
Re: PLEASE READ! BadApple report. Last week was horrible!
OK, this morning things are back to normal. I think the disk space issue was to blame because checking after Mike’s fix didn’t look like it cured the problem. Thanks all! > On May 5, 2020, at 1:41 PM, Chris Hostetter wrote: > > > : And FWIW, I beasted one of the failing suites last night _without_ > : Mike’s changes and didn’t get any failures so I can’t say anything about > : whether Mike’s changes helped or not. > > IIUC McCandless's failure only affects you if you use the "jenkins" test > data file (the really big wikipedia dump) ... see the jira he mentioned > for details. > > > > -Hoss > http://www.lucidworks.com/ > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: PLEASE READ! BadApple report. Last week was horrible!
OK, thanks Chris. The 24 hour rollup still shows many failures in the several classes, I’ll check tomorrow to see if that’s a consequence of the disk full problem. > On May 5, 2020, at 1:41 PM, Chris Hostetter wrote: > > > : And FWIW, I beasted one of the failing suites last night _without_ > : Mike’s changes and didn’t get any failures so I can’t say anything about > : whether Mike’s changes helped or not. > > IIUC McCandless's failure only affects you if you use the "jenkins" test > data file (the really big wikipedia dump) ... see the jira he mentioned > for details. > > > > -Hoss > http://www.lucidworks.com/ > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: PLEASE READ! BadApple report. Last week was horrible!
: And FWIW, I beasted one of the failing suites last night _without_ : Mike’s changes and didn’t get any failures so I can’t say anything about : whether Mike’s changes helped or not. IIUC McCandless's failure only affects you if you use the "jenkins" test data file (the really big wikipedia dump) ... see the jira he mentioned for details. -Hoss http://www.lucidworks.com/ - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: PLEASE READ! BadApple report. Last week was horrible!
Thanks Uwe and Mike. I’ll check Hoss’ rollups regularly this week, it’ll be tomorrow (Wednesday) before Uwe’s changes get a chance to be reflected there. And FWIW, I beasted one of the failing suites last night _without_ Mike’s changes and didn’t get any failures so I can’t say anything about whether Mike’s changes helped or not. Side note: I’m seeing about 40% of the tests throw an NPE in both before and after but the tests succeed, so I’d guess it’s totally unrelated, at glance, the replica is being unloaded and the replication handler hasn’t stopped yet: [junit4] 2> 33676 INFO (qtp1195412159-1051) [n:127.0.0.1:49614_solr ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/cores params={qt=/admin/cores=9929b23d-8505-48a5-841a-0992da1af4752119758844417043=REQUESTSTATUS=javabin=2} status=0 QTime=0 [junit4] 2> 33746 INFO (parallelCoreAdminExecutor-541-thread-6-processing-n:127.0.0.1:49614_solr x:replicaTypesTestColl_shard3_replica_n12 9929b23d-8505-48a5-841a-0992da1af4752119758845055723 UNLOAD) [n:127.0.0.1:49614_solrx:replicaTypesTestColl_shard3_replica_n12 ] o.a.s.m.SolrMetricManager Closing metric reporters for registry=solr.core.replicaTypesTestColl.shard3.replica_n12 tag=null [junit4] 2> 33746 INFO (parallelCoreAdminExecutor-541-thread-6-processing-n:127.0.0.1:49614_solr x:replicaTypesTestColl_shard3_replica_n12 9929b23d-8505-48a5-841a-0992da1af4752119758845055723 UNLOAD) [n:127.0.0.1:49614_solrx:replicaTypesTestColl_shard3_replica_n12 ] o.a.s.m.r.SolrJmxReporter Closing reporter [org.apache.solr.metrics.reporters.SolrJmxReporter@57b06a4b: rootName = solr_49614, domain = solr.core.replicaTypesTestColl.shard3.replica_n12, service url = null, agent id = null] for registry solr.core.replicaTypesTestColl.shard3.replica_n12/com.codahale.metrics.MetricRegistry@6cceb313 [junit4] 2> 33746 INFO (parallelCoreAdminExecutor-541-thread-4-processing-n:127.0.0.1:49614_solr x:replicaTypesTestColl_shard1_replica_n1 9929b23d-8505-48a5-841a-0992da1af4752119758844417043 UNLOAD) [n:127.0.0.1:49614_solr ] o.a.s.c.SolrCore [replicaTypesTestColl_shard1_replica_n1] CLOSING SolrCore org.apache.solr.core.SolrCore@55431ea6 [junit4] 2> 33748 INFO (parallelCoreAdminExecutor-523-thread-6-processing-n:127.0.0.1:49612_solr x:replicaTypesTestColl_shard2_replica_t8 9929b23d-8505-48a5-841a-0992da1af4752119758844946997 UNLOAD) [n:127.0.0.1:49612_solrx:replicaTypesTestColl_shard2_replica_t8 ] o.a.s.m.SolrMetricManager Closing metric reporters for registry=solr.core.replicaTypesTestColl.shard2.replica_t8 tag=null [junit4] 2> 33748 INFO (parallelCoreAdminExecutor-523-thread-4-processing-n:127.0.0.1:49612_solr x:replicaTypesTestColl_shard1_replica_t2 9929b23d-8505-48a5-841a-0992da1af4752119758844761938 UNLOAD) [n:127.0.0.1:49612_solrx:replicaTypesTestColl_shard1_replica_t2 ] o.a.s.c.ZkController replicaTypesTestColl_shard1_replica_t2 stopping background replication from leader [junit4] 2> 33749 INFO (parallelCoreAdminExecutor-523-thread-6-processing-n:127.0.0.1:49612_solr x:replicaTypesTestColl_shard2_replica_t8 9929b23d-8505-48a5-841a-0992da1af4752119758844946997 UNLOAD) [n:127.0.0.1:49612_solrx:replicaTypesTestColl_shard2_replica_t8 ] o.a.s.m.r.SolrJmxReporter Closing reporter [org.apache.solr.metrics.reporters.SolrJmxReporter@534a73a: rootName = solr_49612, domain = solr.core.replicaTypesTestColl.shard2.replica_t8, service url = null, agent id = null] for registry solr.core.replicaTypesTestColl.shard2.replica_t8/com.codahale.metrics.MetricRegistry@33c32317 [junit4] 2> 33749 INFO (parallelCoreAdminExecutor-523-thread-4-processing-n:127.0.0.1:49612_solr x:replicaTypesTestColl_shard1_replica_t2 9929b23d-8505-48a5-841a-0992da1af4752119758844761938 UNLOAD) [n:127.0.0.1:49612_solr ] o.a.s.c.SolrCore [replicaTypesTestColl_shard1_replica_t2] CLOSING SolrCore org.apache.solr.core.SolrCore@1ce79412 [junit4] 2> 33750 INFO (parallelCoreAdminExecutor-546-thread-5-processing-n:127.0.0.1:49613_solr x:replicaTypesTestColl_shard1_replica_p4 9929b23d-8505-48a5-841a-0992da1af4752119758844836987 UNLOAD) [n:127.0.0.1:49613_solrx:replicaTypesTestColl_shard1_replica_p4 ] o.a.s.m.SolrMetricManager Closing metric reporters for registry=solr.core.replicaTypesTestColl.shard1.replica_p4 tag=null [junit4] 2> 33750 INFO (parallelCoreAdminExecutor-546-thread-4-processing-n:127.0.0.1:49613_solr x:replicaTypesTestColl_shard2_replica_n6 9929b23d-8505-48a5-841a-0992da1af4752119758844903977 UNLOAD) [n:127.0.0.1:49613_solr ] o.a.s.c.SolrCore [replicaTypesTestColl_shard2_replica_n6] CLOSING SolrCore org.apache.solr.core.SolrCore@36707173 [junit4] 2> 33750 INFO (parallelCoreAdminExecutor-546-thread-5-processing-n:127.0.0.1:49613_solr x:replicaTypesTestColl_shard1_replica_p4 9929b23d-8505-48a5-841a-0992da1af4752119758844836987 UNLOAD) [n:127.0.0.1:49613_solr
RE: PLEASE READ! BadApple report. Last week was horrible!
Hi, there was also a problem with the Windows Node. It ran out of disk space, because some test seem to have filled up all of the disk. All followup builds failed. I cleaned all Workspaces (8.x, master) and it freed 20 Gigabytes! Uwe - Uwe Schindler Achterdiek 19, D-28357 Bremen https://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message- > From: Erick Erickson > Sent: Monday, May 4, 2020 1:54 PM > To: dev@lucene.apache.org > Subject: PLEASE READ! BadApple report. Last week was horrible! > > I don’t know whether we had some temporary glitch that broke lots of tests > and they’ve been fixed or we had a major regression, but this needs to be > addressed ASAP if they’re still failing. See everything below the line "ALL OF > THE TESTS BELOW HERE HAVE ONLY FAILED IN THE LAST WEEK!” in this e-mail. > I’ll raise a JIRA if we can’t get some traction quickly here. > > Hey, stuff happens. there’s no problem with tests going totally weird for a > while. If you can say “Oh, yeah, all those failures for class XYZ are probably > fixed” that’s fine. > > Gosh-a-rooni, I hope my logging changes aren’t the culprit (gulp)…. > > Hoss’ rolllup for the last 24 hours is not encouraging in terms of the > problem already being fixed. There are lots of failures in some > classes, notably: > > CloudHttp2SolrClientTest > CollectionsAPIDistributedZkTest > DeleteReplicaTest > TestDocCollectionWatcher > > Unfortunately, the failure rate is not very high so reliably > reproducing is hard. > > I’ve reproduced the last week’s failure in this e-mail, full > report attached. > > Here’s Hoss’ rollup: > http://fucit.org/solr-jenkins-reports/failure-report.html > > Usual synopsis: > > Raw fail count by week totals, most recent week first (corresponds to bits): > Week: 0 had 343 failures > Week: 1 had 86 failures > Week: 2 had 78 failures > Week: 3 had 117 failures > > > Failures in Hoss' reports for the last 4 rollups. > > There were 497 unannotated tests that failed in Hoss' rollups. Ordered by the > date I downloaded the rollup file, newest->oldest. See above for the dates the > files were collected > These tests were NOT BadApple'd or AwaitsFix’d > > Failures in the last 4 reports.. >Report Pct runsfails test > 0123 0.7 1617 11 > ConnectionManagerTest.testReconnectWhenZkDisappeared > 0123 1.5 1606 12 ExecutePlanActionTest.testTaskTimeout > 0123 1.6 1320 19 MultiThreadedOCPTest.test > 0123 1.0 1620 13 RollingRestartTest.test > 0123 1.2 1617 12 SearchRateTriggerTest.testWaitForElapsed > 0123 3.8 119 7 ShardSplitTest.testSplitWithChaosMonkey > 0123 0.3 1519 7 TestInPlaceUpdatesDistrib.test > 0123 0.7 1629 14 > TestIndexWriterDelete.testDeleteAllNoDeadLock > 0123 2.4 1548 18 TestPackages.testPluginLoading > 0123 0.3 1587 4 UnloadDistributedZkTest.test > > > FAILURES IN THE LAST WEEK (343!) > Look particularly at the ones with only a zero in the “Report” column, those > are > failures that were _not_ in the previous 3 week’s rollups. > >Report Pct runsfails test > 0120.5 1165 4 CustomHighlightComponentTest.test > 0121.0 1168 6 > NodeMarkersRegistrationTest.testNodeMarkersRegistration > 0121.0 1170 8 TestCryptoKeys.test > 01 3 0.7 1233 11 LeaderFailoverAfterPartitionTest.test > 01 3 63.2 102 39 StressHdfsTest.test > 01 0.3 709 2 > ScheduledTriggerIntegrationTest.testScheduledTrigger > 01 0.2 768 2 ShardRoutingTest.test > 01 2.6 807 22 TestAllFilesHaveChecksumFooter.test > 01 2.6 808 22 TestAllFilesHaveCodecHeader.test > 01 0.2 769 2 TestCloudSchemaless.test > 01 0.2 769 2 TestDynamicLoading.testDynamicLoading > 01 0.3 707 2 > TestDynamicLoadingUrl.testDynamicLoadingUrl > 01 0.5 767 4 TestPointFields.testFloatPointStats > 0127.1 83 19 TestSQLHandler.doTest > 01 0.2 794 12 TestSameScoresWithThreads.test > 01 2.6 806 22 TestShardSearching.testSimple > 01 0.5 726 4 TestSimScenario.testSplitShard > 01 1.1 726 7 TestSimScenario.testSuggestions > 01 0.3 771 2 TestWithCollection.testAddReplicaSimple > 0 23 0.3 1223 4 > CdcrVersionReplicationTest.testCdcrDocVersions > 0 23 0.8 1172 6 > CloudHttp2SolrClientTest.testRetryUpdatesWhenClusterStateIsStale > 0 23 1.4 1202 8 CollectionsAPISolrJTest.testColStatus > 0 23 1.0
Re: PLEASE READ! BadApple report. Last week was horrible!
Mike: I saw the push. Hoss’ rollups go for “the last 24 hours”, so it’ll be Tuesday evening before things have had a chance to work their way through, I’ll look tomorrow. Meanwhile I’m beasting one of the failing test suites (without the change) and 280 iterations so far and no failures. That said, the failure rate was < 1% so it’s not conclusive. Only another 720 runs to go before I pull the latest changes and try again… ;) > On May 4, 2020, at 1:33 PM, Michael McCandless > wrote: > > Hi Erick, > > OK I pushed a fix! See if it decreases the failure rate for those newly bad > apples? > > Sorry and thanks :) > > Mike McCandless > > http://blog.mikemccandless.com > > > On Mon, May 4, 2020 at 1:06 PM Erick Erickson wrote: > Mike: > > I have no idea. Hoss’ rollups don’t link back to builds, they > just aggregate the results. > > Not a huge deal if it’s something like this of course. Let’s just > say I’ve had my share or “moments” ;). > > And unfortunately, the test failures are pretty rare on a > percentage basis, so it’s hard to tell. > > I’m watching LUCENE-9191 and I’ll look back at Hoss’ rollups > a day after you push it and see if the failures disappear. > > It’ll take a while for the fixes to roll through all the reporting. > > Tell you what. I’ll try beasting one of the classes that fails a lot and then > try it again after you push LUCENE-9191 and we’ll go from there. > > Thanks for getting into this so promptly! > > Erick > > > On May 4, 2020, at 9:10 AM, Michael McCandless > > wrote: > > > > Hi Erick, > > > > It's possible this was the root cause of many of the failures: > > https://issues.apache.org/jira/browse/LUCENE-9191 > > > > Do these transient failures look something like this? > > > >[junit4]> Throwable #1: java.nio.charset.MalformedInputException: > > Input length = 1 > >[junit4]>at > > __randomizedtesting.SeedInfo.seed([172C6414BE5E2A2C:E5829DFC005A1F0]:0) > >[junit4]>at > > java.base/java.nio.charset.CoderResult.throwException(CoderResult.java:274) > >[junit4]>at > > java.base/sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:339) > >[junit4]>at > > java.base/sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178) > >[junit4]>at > > java.base/java.io.InputStreamReader.read(InputStreamReader.java:185) > >[junit4]>at > > java.base/java.io.BufferedReader.fill(BufferedReader.java:161) > >[junit4]>at > > java.base/java.io.BufferedReader.readLine(BufferedReader.java:326) > >[junit4]>at > > java.base/java.io.BufferedReader.readLine(BufferedReader.java:392) > >[junit4]>at > > org.apache.lucene.util.LineFileDocs.open(LineFileDocs.java:175) > >[junit4]>at > > org.apache.lucene.util.LineFileDocs.(LineFileDocs.java:65) > >[junit4]>at > > org.apache.lucene.util.LineFileDocs.(LineFileDocs.java:69) > > > > > > If so, then it is likely the root cause ... I'm working on a fix. Sorry! > > > > Mike McCandless > > > > http://blog.mikemccandless.com > > > > > > On Mon, May 4, 2020 at 7:54 AM Erick Erickson > > wrote: > > I don’t know whether we had some temporary glitch that broke lots of tests > > and they’ve been fixed or we had a major regression, but this needs to be > > addressed ASAP if they’re still failing. See everything below the line "ALL > > OF THE TESTS BELOW HERE HAVE ONLY FAILED IN THE LAST WEEK!” in this e-mail. > > I’ll raise a JIRA if we can’t get some traction quickly here. > > > > Hey, stuff happens. there’s no problem with tests going totally weird for a > > while. If you can say “Oh, yeah, all those failures for class XYZ are > > probably fixed” that’s fine. > > > > Gosh-a-rooni, I hope my logging changes aren’t the culprit (gulp)…. > > > > Hoss’ rolllup for the last 24 hours is not encouraging in terms of the > > problem already being fixed. There are lots of failures in some > > classes, notably: > > > > CloudHttp2SolrClientTest > > CollectionsAPIDistributedZkTest > > DeleteReplicaTest > > TestDocCollectionWatcher > > > > Unfortunately, the failure rate is not very high so reliably > > reproducing is hard. > > > > I’ve reproduced the last week’s failure in this e-mail, full > > report attached. > > > > Here’s Hoss’ rollup: > > http://fucit.org/solr-jenkins-reports/failure-report.html > > > > Usual synopsis: > > > > Raw fail count by week totals, most recent week first (corresponds to bits): > > Week: 0 had 343 failures > > Week: 1 had 86 failures > > Week: 2 had 78 failures > > Week: 3 had 117 failures > > > > > > Failures in Hoss' reports for the last 4 rollups. > > > > There were 497 unannotated tests that failed in Hoss' rollups. Ordered by > > the date I downloaded the rollup file, newest->oldest. See above for the > > dates the files were collected > > These tests were NOT BadApple'd or AwaitsFix’d >
Re: PLEASE READ! BadApple report. Last week was horrible!
Hi Erick, OK I pushed a fix! See if it decreases the failure rate for those newly bad apples? Sorry and thanks :) Mike McCandless http://blog.mikemccandless.com On Mon, May 4, 2020 at 1:06 PM Erick Erickson wrote: > Mike: > > I have no idea. Hoss’ rollups don’t link back to builds, they > just aggregate the results. > > Not a huge deal if it’s something like this of course. Let’s just > say I’ve had my share or “moments” ;). > > And unfortunately, the test failures are pretty rare on a > percentage basis, so it’s hard to tell. > > I’m watching LUCENE-9191 and I’ll look back at Hoss’ rollups > a day after you push it and see if the failures disappear. > > It’ll take a while for the fixes to roll through all the reporting. > > Tell you what. I’ll try beasting one of the classes that fails a lot and > then > try it again after you push LUCENE-9191 and we’ll go from there. > > Thanks for getting into this so promptly! > > Erick > > > On May 4, 2020, at 9:10 AM, Michael McCandless < > luc...@mikemccandless.com> wrote: > > > > Hi Erick, > > > > It's possible this was the root cause of many of the failures: > https://issues.apache.org/jira/browse/LUCENE-9191 > > > > Do these transient failures look something like this? > > > >[junit4]> Throwable #1: java.nio.charset.MalformedInputException: > Input length = 1 > >[junit4]>at > __randomizedtesting.SeedInfo.seed([172C6414BE5E2A2C:E5829DFC005A1F0]:0) > >[junit4]>at > java.base/java.nio.charset.CoderResult.throwException(CoderResult.java:274) > >[junit4]>at > java.base/sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:339) > >[junit4]>at > java.base/sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178) > >[junit4]>at java.base/java.io > .InputStreamReader.read(InputStreamReader.java:185) > >[junit4]>at java.base/java.io > .BufferedReader.fill(BufferedReader.java:161) > >[junit4]>at java.base/java.io > .BufferedReader.readLine(BufferedReader.java:326) > >[junit4]>at java.base/java.io > .BufferedReader.readLine(BufferedReader.java:392) > >[junit4]>at > org.apache.lucene.util.LineFileDocs.open(LineFileDocs.java:175) > >[junit4]>at > org.apache.lucene.util.LineFileDocs.(LineFileDocs.java:65) > >[junit4]>at > org.apache.lucene.util.LineFileDocs.(LineFileDocs.java:69) > > > > > > If so, then it is likely the root cause ... I'm working on a fix. Sorry! > > > > Mike McCandless > > > > http://blog.mikemccandless.com > > > > > > On Mon, May 4, 2020 at 7:54 AM Erick Erickson > wrote: > > I don’t know whether we had some temporary glitch that broke lots of > tests and they’ve been fixed or we had a major regression, but this needs > to be addressed ASAP if they’re still failing. See everything below the > line "ALL OF THE TESTS BELOW HERE HAVE ONLY FAILED IN THE LAST WEEK!” in > this e-mail. I’ll raise a JIRA if we can’t get some traction quickly here. > > > > Hey, stuff happens. there’s no problem with tests going totally weird > for a while. If you can say “Oh, yeah, all those failures for class XYZ are > probably fixed” that’s fine. > > > > Gosh-a-rooni, I hope my logging changes aren’t the culprit (gulp)…. > > > > Hoss’ rolllup for the last 24 hours is not encouraging in terms of the > > problem already being fixed. There are lots of failures in some > > classes, notably: > > > > CloudHttp2SolrClientTest > > CollectionsAPIDistributedZkTest > > DeleteReplicaTest > > TestDocCollectionWatcher > > > > Unfortunately, the failure rate is not very high so reliably > > reproducing is hard. > > > > I’ve reproduced the last week’s failure in this e-mail, full > > report attached. > > > > Here’s Hoss’ rollup: > > http://fucit.org/solr-jenkins-reports/failure-report.html > > > > Usual synopsis: > > > > Raw fail count by week totals, most recent week first (corresponds to > bits): > > Week: 0 had 343 failures > > Week: 1 had 86 failures > > Week: 2 had 78 failures > > Week: 3 had 117 failures > > > > > > Failures in Hoss' reports for the last 4 rollups. > > > > There were 497 unannotated tests that failed in Hoss' rollups. Ordered > by the date I downloaded the rollup file, newest->oldest. See above for the > dates the files were collected > > These tests were NOT BadApple'd or AwaitsFix’d > > > > Failures in the last 4 reports.. > >Report Pct runsfails test > > 0123 0.7 1617 11 > ConnectionManagerTest.testReconnectWhenZkDisappeared > > 0123 1.5 1606 12 > ExecutePlanActionTest.testTaskTimeout > > 0123 1.6 1320 19 MultiThreadedOCPTest.test > > 0123 1.0 1620 13 RollingRestartTest.test > > 0123 1.2 1617 12 > SearchRateTriggerTest.testWaitForElapsed > > 0123 3.8 119 7 > ShardSplitTest.testSplitWithChaosMonkey > > 0123 0.3 1519 7
Re: PLEASE READ! BadApple report. Last week was horrible!
Mike: I have no idea. Hoss’ rollups don’t link back to builds, they just aggregate the results. Not a huge deal if it’s something like this of course. Let’s just say I’ve had my share or “moments” ;). And unfortunately, the test failures are pretty rare on a percentage basis, so it’s hard to tell. I’m watching LUCENE-9191 and I’ll look back at Hoss’ rollups a day after you push it and see if the failures disappear. It’ll take a while for the fixes to roll through all the reporting. Tell you what. I’ll try beasting one of the classes that fails a lot and then try it again after you push LUCENE-9191 and we’ll go from there. Thanks for getting into this so promptly! Erick > On May 4, 2020, at 9:10 AM, Michael McCandless > wrote: > > Hi Erick, > > It's possible this was the root cause of many of the failures: > https://issues.apache.org/jira/browse/LUCENE-9191 > > Do these transient failures look something like this? > >[junit4]> Throwable #1: java.nio.charset.MalformedInputException: > Input length = 1 >[junit4]>at > __randomizedtesting.SeedInfo.seed([172C6414BE5E2A2C:E5829DFC005A1F0]:0) >[junit4]>at > java.base/java.nio.charset.CoderResult.throwException(CoderResult.java:274) >[junit4]>at > java.base/sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:339) >[junit4]>at > java.base/sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178) >[junit4]>at > java.base/java.io.InputStreamReader.read(InputStreamReader.java:185) >[junit4]>at > java.base/java.io.BufferedReader.fill(BufferedReader.java:161) >[junit4]>at > java.base/java.io.BufferedReader.readLine(BufferedReader.java:326) >[junit4]>at > java.base/java.io.BufferedReader.readLine(BufferedReader.java:392) >[junit4]>at > org.apache.lucene.util.LineFileDocs.open(LineFileDocs.java:175) >[junit4]>at > org.apache.lucene.util.LineFileDocs.(LineFileDocs.java:65) >[junit4]>at > org.apache.lucene.util.LineFileDocs.(LineFileDocs.java:69) > > > If so, then it is likely the root cause ... I'm working on a fix. Sorry! > > Mike McCandless > > http://blog.mikemccandless.com > > > On Mon, May 4, 2020 at 7:54 AM Erick Erickson wrote: > I don’t know whether we had some temporary glitch that broke lots of tests > and they’ve been fixed or we had a major regression, but this needs to be > addressed ASAP if they’re still failing. See everything below the line "ALL > OF THE TESTS BELOW HERE HAVE ONLY FAILED IN THE LAST WEEK!” in this e-mail. > I’ll raise a JIRA if we can’t get some traction quickly here. > > Hey, stuff happens. there’s no problem with tests going totally weird for a > while. If you can say “Oh, yeah, all those failures for class XYZ are > probably fixed” that’s fine. > > Gosh-a-rooni, I hope my logging changes aren’t the culprit (gulp)…. > > Hoss’ rolllup for the last 24 hours is not encouraging in terms of the > problem already being fixed. There are lots of failures in some > classes, notably: > > CloudHttp2SolrClientTest > CollectionsAPIDistributedZkTest > DeleteReplicaTest > TestDocCollectionWatcher > > Unfortunately, the failure rate is not very high so reliably > reproducing is hard. > > I’ve reproduced the last week’s failure in this e-mail, full > report attached. > > Here’s Hoss’ rollup: > http://fucit.org/solr-jenkins-reports/failure-report.html > > Usual synopsis: > > Raw fail count by week totals, most recent week first (corresponds to bits): > Week: 0 had 343 failures > Week: 1 had 86 failures > Week: 2 had 78 failures > Week: 3 had 117 failures > > > Failures in Hoss' reports for the last 4 rollups. > > There were 497 unannotated tests that failed in Hoss' rollups. Ordered by the > date I downloaded the rollup file, newest->oldest. See above for the dates > the files were collected > These tests were NOT BadApple'd or AwaitsFix’d > > Failures in the last 4 reports.. >Report Pct runsfails test > 0123 0.7 1617 11 > ConnectionManagerTest.testReconnectWhenZkDisappeared > 0123 1.5 1606 12 ExecutePlanActionTest.testTaskTimeout > 0123 1.6 1320 19 MultiThreadedOCPTest.test > 0123 1.0 1620 13 RollingRestartTest.test > 0123 1.2 1617 12 SearchRateTriggerTest.testWaitForElapsed > 0123 3.8 119 7 ShardSplitTest.testSplitWithChaosMonkey > 0123 0.3 1519 7 TestInPlaceUpdatesDistrib.test > 0123 0.7 1629 14 > TestIndexWriterDelete.testDeleteAllNoDeadLock > 0123 2.4 1548 18 TestPackages.testPluginLoading > 0123 0.3 1587 4 UnloadDistributedZkTest.test > > > FAILURES IN THE LAST WEEK (343!) > Look particularly at the ones with only a zero in
Re: PLEASE READ! BadApple report. Last week was horrible!
Hi Erick, It's possible this was the root cause of many of the failures: https://issues.apache.org/jira/browse/LUCENE-9191 Do these transient failures look something like this? [junit4]> Throwable #1: java.nio.charset.MalformedInputException: Input length = 1 [junit4]>at __randomizedtesting.SeedInfo.seed([172C6414BE5E2A2C:E5829DFC005A1F0]:0) [junit4]>at java.base/java.nio.charset.CoderResult.throwException(CoderResult.java:274) [junit4]>at java.base/sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:339) [junit4]>at java.base/sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178) [junit4]>at java.base/java.io.InputStreamReader.read(InputStreamReader.java:185) [junit4]>at java.base/java.io.BufferedReader.fill(BufferedReader.java:161) [junit4]>at java.base/java.io.BufferedReader.readLine(BufferedReader.java:326) [junit4]>at java.base/java.io.BufferedReader.readLine(BufferedReader.java:392) [junit4]>at org.apache.lucene.util.LineFileDocs.open(LineFileDocs.java:175) [junit4]>at org.apache.lucene.util.LineFileDocs.(LineFileDocs.java:65) [junit4]>at org.apache.lucene.util.LineFileDocs.(LineFileDocs.java:69) If so, then it is likely the root cause ... I'm working on a fix. Sorry! Mike McCandless http://blog.mikemccandless.com On Mon, May 4, 2020 at 7:54 AM Erick Erickson wrote: > I don’t know whether we had some temporary glitch that broke lots of tests > and they’ve been fixed or we had a major regression, but this needs to be > addressed ASAP if they’re still failing. See everything below the line "ALL > OF THE TESTS BELOW HERE HAVE ONLY FAILED IN THE LAST WEEK!” in this e-mail. > I’ll raise a JIRA if we can’t get some traction quickly here. > > Hey, stuff happens. there’s no problem with tests going totally weird for > a while. If you can say “Oh, yeah, all those failures for class XYZ are > probably fixed” that’s fine. > > Gosh-a-rooni, I hope my logging changes aren’t the culprit (gulp)…. > > Hoss’ rolllup for the last 24 hours is not encouraging in terms of the > problem already being fixed. There are lots of failures in some > classes, notably: > > CloudHttp2SolrClientTest > CollectionsAPIDistributedZkTest > DeleteReplicaTest > TestDocCollectionWatcher > > Unfortunately, the failure rate is not very high so reliably > reproducing is hard. > > I’ve reproduced the last week’s failure in this e-mail, full > report attached. > > Here’s Hoss’ rollup: > http://fucit.org/solr-jenkins-reports/failure-report.html > > Usual synopsis: > > Raw fail count by week totals, most recent week first (corresponds to > bits): > Week: 0 had 343 failures > Week: 1 had 86 failures > Week: 2 had 78 failures > Week: 3 had 117 failures > > > Failures in Hoss' reports for the last 4 rollups. > > There were 497 unannotated tests that failed in Hoss' rollups. Ordered by > the date I downloaded the rollup file, newest->oldest. See above for the > dates the files were collected > These tests were NOT BadApple'd or AwaitsFix’d > > Failures in the last 4 reports.. >Report Pct runsfails test > 0123 0.7 1617 11 > ConnectionManagerTest.testReconnectWhenZkDisappeared > 0123 1.5 1606 12 ExecutePlanActionTest.testTaskTimeout > 0123 1.6 1320 19 MultiThreadedOCPTest.test > 0123 1.0 1620 13 RollingRestartTest.test > 0123 1.2 1617 12 > SearchRateTriggerTest.testWaitForElapsed > 0123 3.8 119 7 > ShardSplitTest.testSplitWithChaosMonkey > 0123 0.3 1519 7 TestInPlaceUpdatesDistrib.test > 0123 0.7 1629 14 > TestIndexWriterDelete.testDeleteAllNoDeadLock > 0123 2.4 1548 18 TestPackages.testPluginLoading > 0123 0.3 1587 4 UnloadDistributedZkTest.test > > > FAILURES IN THE LAST WEEK (343!) > Look particularly at the ones with only a zero in the “Report” column, > those are > failures that were _not_ in the previous 3 week’s rollups. > >Report Pct runsfails test > 0120.5 1165 4 CustomHighlightComponentTest.test > 0121.0 1168 6 > NodeMarkersRegistrationTest.testNodeMarkersRegistration > 0121.0 1170 8 TestCryptoKeys.test > 01 3 0.7 1233 11 LeaderFailoverAfterPartitionTest.test > 01 3 63.2 102 39 StressHdfsTest.test > 01 0.3 709 2 > ScheduledTriggerIntegrationTest.testScheduledTrigger > 01 0.2 768 2 ShardRoutingTest.test > 01 2.6 807 22 TestAllFilesHaveChecksumFooter.test > 01 2.6 808 22 TestAllFilesHaveCodecHeader.test > 01 0.2 769 2 TestCloudSchemaless.test > 01 0.2