Re: PLEASE READ! BadApple report. Last week was horrible!

2020-05-06 Thread Michael McCandless
Phew!  Thanks for digging Erick, and for producing these BadApple reports.

Mike McCandless

http://blog.mikemccandless.com


On Wed, May 6, 2020 at 7:59 AM Erick Erickson 
wrote:

> OK, this morning things are back to normal. I think the disk space issue
> was to blame because checking after Mike’s fix didn’t look like it
> cured the problem.
>
> Thanks all!
>
> > On May 5, 2020, at 1:41 PM, Chris Hostetter 
> wrote:
> >
> >
> > : And FWIW, I beasted one of the failing suites last night _without_
> > : Mike’s changes and didn’t get any failures so I can’t say anything
> about
> > : whether Mike’s changes helped or not.
> >
> > IIUC McCandless's failure only affects you if you use the "jenkins" test
> > data file (the really big wikipedia dump) ... see the jira he mentioned
> > for details.
> >
> >
> >
> > -Hoss
> > http://www.lucidworks.com/
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: dev-h...@lucene.apache.org
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


Re: PLEASE READ! BadApple report. Last week was horrible!

2020-05-06 Thread Erick Erickson
OK, this morning things are back to normal. I think the disk space issue
was to blame because checking after Mike’s fix didn’t look like it
cured the problem.

Thanks all!

> On May 5, 2020, at 1:41 PM, Chris Hostetter  wrote:
> 
> 
> : And FWIW, I beasted one of the failing suites last night _without_ 
> : Mike’s changes and didn’t get any failures so I can’t say anything about 
> : whether Mike’s changes helped or not.
> 
> IIUC McCandless's failure only affects you if you use the "jenkins" test 
> data file (the really big wikipedia dump) ... see the jira he mentioned 
> for details.
> 
> 
> 
> -Hoss
> http://www.lucidworks.com/
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: PLEASE READ! BadApple report. Last week was horrible!

2020-05-05 Thread Erick Erickson
OK, thanks Chris. 

The 24 hour rollup still shows many failures in the several classes, I’ll check 
tomorrow
to see if that’s a consequence of the disk full problem.

> On May 5, 2020, at 1:41 PM, Chris Hostetter  wrote:
> 
> 
> : And FWIW, I beasted one of the failing suites last night _without_ 
> : Mike’s changes and didn’t get any failures so I can’t say anything about 
> : whether Mike’s changes helped or not.
> 
> IIUC McCandless's failure only affects you if you use the "jenkins" test 
> data file (the really big wikipedia dump) ... see the jira he mentioned 
> for details.
> 
> 
> 
> -Hoss
> http://www.lucidworks.com/
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: PLEASE READ! BadApple report. Last week was horrible!

2020-05-05 Thread Chris Hostetter

: And FWIW, I beasted one of the failing suites last night _without_ 
: Mike’s changes and didn’t get any failures so I can’t say anything about 
: whether Mike’s changes helped or not.

IIUC McCandless's failure only affects you if you use the "jenkins" test 
data file (the really big wikipedia dump) ... see the jira he mentioned 
for details.



-Hoss
http://www.lucidworks.com/

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: PLEASE READ! BadApple report. Last week was horrible!

2020-05-05 Thread Erick Erickson
Thanks Uwe and Mike. I’ll check Hoss’ rollups regularly this week, it’ll be 
tomorrow (Wednesday) before Uwe’s changes get a chance to be reflected there.

And FWIW, I beasted one of the failing suites last night _without_ Mike’s 
changes and didn’t get any failures so I can’t say anything about whether 
Mike’s changes helped or not.

Side note: I’m seeing about 40% of the tests throw an NPE in both before and 
after but the tests succeed, so I’d guess it’s totally unrelated, at  glance, 
the replica is being unloaded and the replication handler hasn’t stopped yet:

   [junit4]   2> 33676 INFO  (qtp1195412159-1051) [n:127.0.0.1:49614_solr ] 
o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/cores 
params={qt=/admin/cores=9929b23d-8505-48a5-841a-0992da1af4752119758844417043=REQUESTSTATUS=javabin=2}
 status=0 QTime=0
   [junit4]   2> 33746 INFO  
(parallelCoreAdminExecutor-541-thread-6-processing-n:127.0.0.1:49614_solr 
x:replicaTypesTestColl_shard3_replica_n12 
9929b23d-8505-48a5-841a-0992da1af4752119758845055723 UNLOAD) 
[n:127.0.0.1:49614_solrx:replicaTypesTestColl_shard3_replica_n12 ] 
o.a.s.m.SolrMetricManager Closing metric reporters for 
registry=solr.core.replicaTypesTestColl.shard3.replica_n12 tag=null
   [junit4]   2> 33746 INFO  
(parallelCoreAdminExecutor-541-thread-6-processing-n:127.0.0.1:49614_solr 
x:replicaTypesTestColl_shard3_replica_n12 
9929b23d-8505-48a5-841a-0992da1af4752119758845055723 UNLOAD) 
[n:127.0.0.1:49614_solrx:replicaTypesTestColl_shard3_replica_n12 ] 
o.a.s.m.r.SolrJmxReporter Closing reporter 
[org.apache.solr.metrics.reporters.SolrJmxReporter@57b06a4b: rootName = 
solr_49614, domain = solr.core.replicaTypesTestColl.shard3.replica_n12, service 
url = null, agent id = null] for registry 
solr.core.replicaTypesTestColl.shard3.replica_n12/com.codahale.metrics.MetricRegistry@6cceb313
   [junit4]   2> 33746 INFO  
(parallelCoreAdminExecutor-541-thread-4-processing-n:127.0.0.1:49614_solr 
x:replicaTypesTestColl_shard1_replica_n1 
9929b23d-8505-48a5-841a-0992da1af4752119758844417043 UNLOAD) 
[n:127.0.0.1:49614_solr ] o.a.s.c.SolrCore 
[replicaTypesTestColl_shard1_replica_n1]  CLOSING SolrCore 
org.apache.solr.core.SolrCore@55431ea6
   [junit4]   2> 33748 INFO  
(parallelCoreAdminExecutor-523-thread-6-processing-n:127.0.0.1:49612_solr 
x:replicaTypesTestColl_shard2_replica_t8 
9929b23d-8505-48a5-841a-0992da1af4752119758844946997 UNLOAD) 
[n:127.0.0.1:49612_solrx:replicaTypesTestColl_shard2_replica_t8 ] 
o.a.s.m.SolrMetricManager Closing metric reporters for 
registry=solr.core.replicaTypesTestColl.shard2.replica_t8 tag=null
   [junit4]   2> 33748 INFO  
(parallelCoreAdminExecutor-523-thread-4-processing-n:127.0.0.1:49612_solr 
x:replicaTypesTestColl_shard1_replica_t2 
9929b23d-8505-48a5-841a-0992da1af4752119758844761938 UNLOAD) 
[n:127.0.0.1:49612_solrx:replicaTypesTestColl_shard1_replica_t2 ] 
o.a.s.c.ZkController replicaTypesTestColl_shard1_replica_t2 stopping background 
replication from leader
   [junit4]   2> 33749 INFO  
(parallelCoreAdminExecutor-523-thread-6-processing-n:127.0.0.1:49612_solr 
x:replicaTypesTestColl_shard2_replica_t8 
9929b23d-8505-48a5-841a-0992da1af4752119758844946997 UNLOAD) 
[n:127.0.0.1:49612_solrx:replicaTypesTestColl_shard2_replica_t8 ] 
o.a.s.m.r.SolrJmxReporter Closing reporter 
[org.apache.solr.metrics.reporters.SolrJmxReporter@534a73a: rootName = 
solr_49612, domain = solr.core.replicaTypesTestColl.shard2.replica_t8, service 
url = null, agent id = null] for registry 
solr.core.replicaTypesTestColl.shard2.replica_t8/com.codahale.metrics.MetricRegistry@33c32317
   [junit4]   2> 33749 INFO  
(parallelCoreAdminExecutor-523-thread-4-processing-n:127.0.0.1:49612_solr 
x:replicaTypesTestColl_shard1_replica_t2 
9929b23d-8505-48a5-841a-0992da1af4752119758844761938 UNLOAD) 
[n:127.0.0.1:49612_solr ] o.a.s.c.SolrCore 
[replicaTypesTestColl_shard1_replica_t2]  CLOSING SolrCore 
org.apache.solr.core.SolrCore@1ce79412
   [junit4]   2> 33750 INFO  
(parallelCoreAdminExecutor-546-thread-5-processing-n:127.0.0.1:49613_solr 
x:replicaTypesTestColl_shard1_replica_p4 
9929b23d-8505-48a5-841a-0992da1af4752119758844836987 UNLOAD) 
[n:127.0.0.1:49613_solrx:replicaTypesTestColl_shard1_replica_p4 ] 
o.a.s.m.SolrMetricManager Closing metric reporters for 
registry=solr.core.replicaTypesTestColl.shard1.replica_p4 tag=null
   [junit4]   2> 33750 INFO  
(parallelCoreAdminExecutor-546-thread-4-processing-n:127.0.0.1:49613_solr 
x:replicaTypesTestColl_shard2_replica_n6 
9929b23d-8505-48a5-841a-0992da1af4752119758844903977 UNLOAD) 
[n:127.0.0.1:49613_solr ] o.a.s.c.SolrCore 
[replicaTypesTestColl_shard2_replica_n6]  CLOSING SolrCore 
org.apache.solr.core.SolrCore@36707173
   [junit4]   2> 33750 INFO  
(parallelCoreAdminExecutor-546-thread-5-processing-n:127.0.0.1:49613_solr 
x:replicaTypesTestColl_shard1_replica_p4 
9929b23d-8505-48a5-841a-0992da1af4752119758844836987 UNLOAD) 
[n:127.0.0.1:49613_solr

RE: PLEASE READ! BadApple report. Last week was horrible!

2020-05-05 Thread Uwe Schindler
Hi,

there was also a problem with the Windows Node. It ran out of disk space, 
because some test seem to have filled up all of the disk. All followup builds 
failed. I cleaned all Workspaces (8.x, master) and it freed 20 Gigabytes!

Uwe

-
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: u...@thetaphi.de

> -Original Message-
> From: Erick Erickson 
> Sent: Monday, May 4, 2020 1:54 PM
> To: dev@lucene.apache.org
> Subject: PLEASE READ! BadApple report. Last week was horrible!
> 
> I don’t know whether we had some temporary glitch that broke lots of tests
> and they’ve been fixed or we had a major regression, but this needs to be
> addressed ASAP if they’re still failing. See everything below the line "ALL OF
> THE TESTS BELOW HERE HAVE ONLY FAILED IN THE LAST WEEK!” in this e-mail.
> I’ll raise a JIRA if we can’t get some traction quickly here.
> 
> Hey, stuff happens. there’s no problem with tests going totally weird for a
> while. If you can say “Oh, yeah, all those failures for class XYZ are probably
> fixed” that’s fine.
> 
> Gosh-a-rooni, I hope my logging changes aren’t the culprit (gulp)….
> 
> Hoss’ rolllup for the last 24 hours is not encouraging in terms of the
> problem already being fixed. There are lots of failures in some
> classes, notably:
> 
> CloudHttp2SolrClientTest
> CollectionsAPIDistributedZkTest
> DeleteReplicaTest
> TestDocCollectionWatcher
> 
> Unfortunately, the failure rate is not very high so reliably
> reproducing is hard.
> 
> I’ve reproduced the last week’s failure in this e-mail, full
> report attached.
> 
> Here’s Hoss’ rollup:
> http://fucit.org/solr-jenkins-reports/failure-report.html
> 
> Usual synopsis:
> 
> Raw fail count by week totals, most recent week first (corresponds to bits):
> Week: 0  had  343 failures
> Week: 1  had  86 failures
> Week: 2  had  78 failures
> Week: 3  had  117 failures
> 
> 
> Failures in Hoss' reports for the last 4 rollups.
> 
> There were 497 unannotated tests that failed in Hoss' rollups. Ordered by the
> date I downloaded the rollup file, newest->oldest. See above for the dates the
> files were collected
> These tests were NOT BadApple'd or AwaitsFix’d
> 
> Failures in the last 4 reports..
>Report   Pct runsfails   test
>  0123   0.7 1617 11
> ConnectionManagerTest.testReconnectWhenZkDisappeared
>  0123   1.5 1606 12  ExecutePlanActionTest.testTaskTimeout
>  0123   1.6 1320 19  MultiThreadedOCPTest.test
>  0123   1.0 1620 13  RollingRestartTest.test
>  0123   1.2 1617 12  SearchRateTriggerTest.testWaitForElapsed
>  0123   3.8  119  7  ShardSplitTest.testSplitWithChaosMonkey
>  0123   0.3 1519  7  TestInPlaceUpdatesDistrib.test
>  0123   0.7 1629 14  
> TestIndexWriterDelete.testDeleteAllNoDeadLock
>  0123   2.4 1548 18  TestPackages.testPluginLoading
>  0123   0.3 1587  4  UnloadDistributedZkTest.test
> 
> 
> FAILURES IN THE LAST WEEK (343!)
> Look particularly at the ones with only a zero in the “Report” column, those 
> are
> failures that were _not_ in the previous 3 week’s rollups.
> 
>Report   Pct runsfails   test
>  0120.5 1165  4  CustomHighlightComponentTest.test
>  0121.0 1168  6
> NodeMarkersRegistrationTest.testNodeMarkersRegistration
>  0121.0 1170  8  TestCryptoKeys.test
>  01 3   0.7 1233 11  LeaderFailoverAfterPartitionTest.test
>  01 3  63.2  102 39  StressHdfsTest.test
>  01 0.3  709  2
> ScheduledTriggerIntegrationTest.testScheduledTrigger
>  01 0.2  768  2  ShardRoutingTest.test
>  01 2.6  807 22  TestAllFilesHaveChecksumFooter.test
>  01 2.6  808 22  TestAllFilesHaveCodecHeader.test
>  01 0.2  769  2  TestCloudSchemaless.test
>  01 0.2  769  2  TestDynamicLoading.testDynamicLoading
>  01 0.3  707  2  
> TestDynamicLoadingUrl.testDynamicLoadingUrl
>  01 0.5  767  4  TestPointFields.testFloatPointStats
>  0127.1   83 19  TestSQLHandler.doTest
>  01 0.2  794 12  TestSameScoresWithThreads.test
>  01 2.6  806 22  TestShardSearching.testSimple
>  01 0.5  726  4  TestSimScenario.testSplitShard
>  01 1.1  726  7  TestSimScenario.testSuggestions
>  01 0.3  771  2  TestWithCollection.testAddReplicaSimple
>  0 23   0.3 1223  4  
> CdcrVersionReplicationTest.testCdcrDocVersions
>  0 23   0.8 1172  6
> CloudHttp2SolrClientTest.testRetryUpdatesWhenClusterStateIsStale
>  0 23   1.4 1202  8  CollectionsAPISolrJTest.testColStatus
>  0 23   1.0 

Re: PLEASE READ! BadApple report. Last week was horrible!

2020-05-04 Thread Erick Erickson
Mike:

I saw the push. Hoss’ rollups go for “the last 24 hours”, so it’ll be Tuesday 
evening before things have had a chance to work their way through, I’ll look 
tomorrow.

Meanwhile I’m beasting one of the failing test suites (without the change) and 
280 iterations so far and no failures. That said, the failure rate was < 1% so 
it’s not conclusive. Only another 720 runs to go before I pull the latest 
changes and try again… ;)



> On May 4, 2020, at 1:33 PM, Michael McCandless  
> wrote:
> 
> Hi Erick,
> 
> OK I pushed a fix!  See if it decreases the failure rate for those newly bad 
> apples?
> 
> Sorry and thanks :)
> 
> Mike McCandless
> 
> http://blog.mikemccandless.com
> 
> 
> On Mon, May 4, 2020 at 1:06 PM Erick Erickson  wrote:
> Mike:
> 
> I have no idea. Hoss’ rollups don’t link back to builds, they
> just aggregate the results.
> 
> Not a huge deal if it’s something like this of course. Let’s just
> say I’ve had my share or “moments” ;).
> 
> And unfortunately, the test failures are pretty rare on a 
> percentage basis, so it’s hard to tell.
> 
> I’m watching LUCENE-9191 and I’ll look back at Hoss’ rollups
> a day after you push it and see if the failures disappear.
> 
> It’ll take a while for the fixes to roll through all the reporting.
> 
> Tell you what. I’ll try beasting one of the classes that fails a lot and then
> try it again after you push LUCENE-9191 and we’ll go from there.
> 
> Thanks for getting into this so promptly!
> 
> Erick
> 
> > On May 4, 2020, at 9:10 AM, Michael McCandless  
> > wrote:
> > 
> > Hi Erick,
> > 
> > It's possible this was the root cause of many of the failures: 
> > https://issues.apache.org/jira/browse/LUCENE-9191
> > 
> > Do these transient failures look something like this?
> > 
> >[junit4]> Throwable #1: java.nio.charset.MalformedInputException: 
> > Input length = 1
> >[junit4]>at 
> > __randomizedtesting.SeedInfo.seed([172C6414BE5E2A2C:E5829DFC005A1F0]:0)
> >[junit4]>at 
> > java.base/java.nio.charset.CoderResult.throwException(CoderResult.java:274)
> >[junit4]>at 
> > java.base/sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:339)
> >[junit4]>at 
> > java.base/sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
> >[junit4]>at 
> > java.base/java.io.InputStreamReader.read(InputStreamReader.java:185)
> >[junit4]>at 
> > java.base/java.io.BufferedReader.fill(BufferedReader.java:161)
> >[junit4]>at 
> > java.base/java.io.BufferedReader.readLine(BufferedReader.java:326)
> >[junit4]>at 
> > java.base/java.io.BufferedReader.readLine(BufferedReader.java:392)
> >[junit4]>at 
> > org.apache.lucene.util.LineFileDocs.open(LineFileDocs.java:175)
> >[junit4]>at 
> > org.apache.lucene.util.LineFileDocs.(LineFileDocs.java:65)
> >[junit4]>at 
> > org.apache.lucene.util.LineFileDocs.(LineFileDocs.java:69)
> > 
> > 
> > If so, then it is likely the root cause ... I'm working on a fix.  Sorry!
> > 
> > Mike McCandless
> > 
> > http://blog.mikemccandless.com
> > 
> > 
> > On Mon, May 4, 2020 at 7:54 AM Erick Erickson  
> > wrote:
> > I don’t know whether we had some temporary glitch that broke lots of tests 
> > and they’ve been fixed or we had a major regression, but this needs to be 
> > addressed ASAP if they’re still failing. See everything below the line "ALL 
> > OF THE TESTS BELOW HERE HAVE ONLY FAILED IN THE LAST WEEK!” in this e-mail. 
> > I’ll raise a JIRA if we can’t get some traction quickly here.
> > 
> > Hey, stuff happens. there’s no problem with tests going totally weird for a 
> > while. If you can say “Oh, yeah, all those failures for class XYZ are 
> > probably fixed” that’s fine.
> > 
> > Gosh-a-rooni, I hope my logging changes aren’t the culprit (gulp)….
> > 
> > Hoss’ rolllup for the last 24 hours is not encouraging in terms of the
> > problem already being fixed. There are lots of failures in some
> > classes, notably:
> > 
> > CloudHttp2SolrClientTest
> > CollectionsAPIDistributedZkTest
> > DeleteReplicaTest
> > TestDocCollectionWatcher
> > 
> > Unfortunately, the failure rate is not very high so reliably 
> > reproducing is hard.
> > 
> > I’ve reproduced the last week’s failure in this e-mail, full 
> > report attached. 
> > 
> > Here’s Hoss’ rollup:
> > http://fucit.org/solr-jenkins-reports/failure-report.html
> > 
> > Usual synopsis:
> > 
> > Raw fail count by week totals, most recent week first (corresponds to bits):
> > Week: 0  had  343 failures
> > Week: 1  had  86 failures
> > Week: 2  had  78 failures
> > Week: 3  had  117 failures
> > 
> > 
> > Failures in Hoss' reports for the last 4 rollups.
> > 
> > There were 497 unannotated tests that failed in Hoss' rollups. Ordered by 
> > the date I downloaded the rollup file, newest->oldest. See above for the 
> > dates the files were collected 
> > These tests were NOT BadApple'd or AwaitsFix’d
> 

Re: PLEASE READ! BadApple report. Last week was horrible!

2020-05-04 Thread Michael McCandless
Hi Erick,

OK I pushed a fix!  See if it decreases the failure rate for those newly
bad apples?

Sorry and thanks :)

Mike McCandless

http://blog.mikemccandless.com


On Mon, May 4, 2020 at 1:06 PM Erick Erickson 
wrote:

> Mike:
>
> I have no idea. Hoss’ rollups don’t link back to builds, they
> just aggregate the results.
>
> Not a huge deal if it’s something like this of course. Let’s just
> say I’ve had my share or “moments” ;).
>
> And unfortunately, the test failures are pretty rare on a
> percentage basis, so it’s hard to tell.
>
> I’m watching LUCENE-9191 and I’ll look back at Hoss’ rollups
> a day after you push it and see if the failures disappear.
>
> It’ll take a while for the fixes to roll through all the reporting.
>
> Tell you what. I’ll try beasting one of the classes that fails a lot and
> then
> try it again after you push LUCENE-9191 and we’ll go from there.
>
> Thanks for getting into this so promptly!
>
> Erick
>
> > On May 4, 2020, at 9:10 AM, Michael McCandless <
> luc...@mikemccandless.com> wrote:
> >
> > Hi Erick,
> >
> > It's possible this was the root cause of many of the failures:
> https://issues.apache.org/jira/browse/LUCENE-9191
> >
> > Do these transient failures look something like this?
> >
> >[junit4]> Throwable #1: java.nio.charset.MalformedInputException:
> Input length = 1
> >[junit4]>at
> __randomizedtesting.SeedInfo.seed([172C6414BE5E2A2C:E5829DFC005A1F0]:0)
> >[junit4]>at
> java.base/java.nio.charset.CoderResult.throwException(CoderResult.java:274)
> >[junit4]>at
> java.base/sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:339)
> >[junit4]>at
> java.base/sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
> >[junit4]>at java.base/java.io
> .InputStreamReader.read(InputStreamReader.java:185)
> >[junit4]>at java.base/java.io
> .BufferedReader.fill(BufferedReader.java:161)
> >[junit4]>at java.base/java.io
> .BufferedReader.readLine(BufferedReader.java:326)
> >[junit4]>at java.base/java.io
> .BufferedReader.readLine(BufferedReader.java:392)
> >[junit4]>at
> org.apache.lucene.util.LineFileDocs.open(LineFileDocs.java:175)
> >[junit4]>at
> org.apache.lucene.util.LineFileDocs.(LineFileDocs.java:65)
> >[junit4]>at
> org.apache.lucene.util.LineFileDocs.(LineFileDocs.java:69)
> >
> >
> > If so, then it is likely the root cause ... I'm working on a fix.  Sorry!
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
> >
> > On Mon, May 4, 2020 at 7:54 AM Erick Erickson 
> wrote:
> > I don’t know whether we had some temporary glitch that broke lots of
> tests and they’ve been fixed or we had a major regression, but this needs
> to be addressed ASAP if they’re still failing. See everything below the
> line "ALL OF THE TESTS BELOW HERE HAVE ONLY FAILED IN THE LAST WEEK!” in
> this e-mail. I’ll raise a JIRA if we can’t get some traction quickly here.
> >
> > Hey, stuff happens. there’s no problem with tests going totally weird
> for a while. If you can say “Oh, yeah, all those failures for class XYZ are
> probably fixed” that’s fine.
> >
> > Gosh-a-rooni, I hope my logging changes aren’t the culprit (gulp)….
> >
> > Hoss’ rolllup for the last 24 hours is not encouraging in terms of the
> > problem already being fixed. There are lots of failures in some
> > classes, notably:
> >
> > CloudHttp2SolrClientTest
> > CollectionsAPIDistributedZkTest
> > DeleteReplicaTest
> > TestDocCollectionWatcher
> >
> > Unfortunately, the failure rate is not very high so reliably
> > reproducing is hard.
> >
> > I’ve reproduced the last week’s failure in this e-mail, full
> > report attached.
> >
> > Here’s Hoss’ rollup:
> > http://fucit.org/solr-jenkins-reports/failure-report.html
> >
> > Usual synopsis:
> >
> > Raw fail count by week totals, most recent week first (corresponds to
> bits):
> > Week: 0  had  343 failures
> > Week: 1  had  86 failures
> > Week: 2  had  78 failures
> > Week: 3  had  117 failures
> >
> >
> > Failures in Hoss' reports for the last 4 rollups.
> >
> > There were 497 unannotated tests that failed in Hoss' rollups. Ordered
> by the date I downloaded the rollup file, newest->oldest. See above for the
> dates the files were collected
> > These tests were NOT BadApple'd or AwaitsFix’d
> >
> > Failures in the last 4 reports..
> >Report   Pct runsfails   test
> >  0123   0.7 1617 11
> ConnectionManagerTest.testReconnectWhenZkDisappeared
> >  0123   1.5 1606 12
> ExecutePlanActionTest.testTaskTimeout
> >  0123   1.6 1320 19  MultiThreadedOCPTest.test
> >  0123   1.0 1620 13  RollingRestartTest.test
> >  0123   1.2 1617 12
> SearchRateTriggerTest.testWaitForElapsed
> >  0123   3.8  119  7
> ShardSplitTest.testSplitWithChaosMonkey
> >  0123   0.3 1519  7  

Re: PLEASE READ! BadApple report. Last week was horrible!

2020-05-04 Thread Erick Erickson
Mike:

I have no idea. Hoss’ rollups don’t link back to builds, they
just aggregate the results.

Not a huge deal if it’s something like this of course. Let’s just
say I’ve had my share or “moments” ;).

And unfortunately, the test failures are pretty rare on a 
percentage basis, so it’s hard to tell.

I’m watching LUCENE-9191 and I’ll look back at Hoss’ rollups
a day after you push it and see if the failures disappear.

It’ll take a while for the fixes to roll through all the reporting.

Tell you what. I’ll try beasting one of the classes that fails a lot and then
try it again after you push LUCENE-9191 and we’ll go from there.

Thanks for getting into this so promptly!

Erick

> On May 4, 2020, at 9:10 AM, Michael McCandless  
> wrote:
> 
> Hi Erick,
> 
> It's possible this was the root cause of many of the failures: 
> https://issues.apache.org/jira/browse/LUCENE-9191
> 
> Do these transient failures look something like this?
> 
>[junit4]> Throwable #1: java.nio.charset.MalformedInputException: 
> Input length = 1
>[junit4]>at 
> __randomizedtesting.SeedInfo.seed([172C6414BE5E2A2C:E5829DFC005A1F0]:0)
>[junit4]>at 
> java.base/java.nio.charset.CoderResult.throwException(CoderResult.java:274)
>[junit4]>at 
> java.base/sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:339)
>[junit4]>at 
> java.base/sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
>[junit4]>at 
> java.base/java.io.InputStreamReader.read(InputStreamReader.java:185)
>[junit4]>at 
> java.base/java.io.BufferedReader.fill(BufferedReader.java:161)
>[junit4]>at 
> java.base/java.io.BufferedReader.readLine(BufferedReader.java:326)
>[junit4]>at 
> java.base/java.io.BufferedReader.readLine(BufferedReader.java:392)
>[junit4]>at 
> org.apache.lucene.util.LineFileDocs.open(LineFileDocs.java:175)
>[junit4]>at 
> org.apache.lucene.util.LineFileDocs.(LineFileDocs.java:65)
>[junit4]>at 
> org.apache.lucene.util.LineFileDocs.(LineFileDocs.java:69)
> 
> 
> If so, then it is likely the root cause ... I'm working on a fix.  Sorry!
> 
> Mike McCandless
> 
> http://blog.mikemccandless.com
> 
> 
> On Mon, May 4, 2020 at 7:54 AM Erick Erickson  wrote:
> I don’t know whether we had some temporary glitch that broke lots of tests 
> and they’ve been fixed or we had a major regression, but this needs to be 
> addressed ASAP if they’re still failing. See everything below the line "ALL 
> OF THE TESTS BELOW HERE HAVE ONLY FAILED IN THE LAST WEEK!” in this e-mail. 
> I’ll raise a JIRA if we can’t get some traction quickly here.
> 
> Hey, stuff happens. there’s no problem with tests going totally weird for a 
> while. If you can say “Oh, yeah, all those failures for class XYZ are 
> probably fixed” that’s fine.
> 
> Gosh-a-rooni, I hope my logging changes aren’t the culprit (gulp)….
> 
> Hoss’ rolllup for the last 24 hours is not encouraging in terms of the
> problem already being fixed. There are lots of failures in some
> classes, notably:
> 
> CloudHttp2SolrClientTest
> CollectionsAPIDistributedZkTest
> DeleteReplicaTest
> TestDocCollectionWatcher
> 
> Unfortunately, the failure rate is not very high so reliably 
> reproducing is hard.
> 
> I’ve reproduced the last week’s failure in this e-mail, full 
> report attached. 
> 
> Here’s Hoss’ rollup:
> http://fucit.org/solr-jenkins-reports/failure-report.html
> 
> Usual synopsis:
> 
> Raw fail count by week totals, most recent week first (corresponds to bits):
> Week: 0  had  343 failures
> Week: 1  had  86 failures
> Week: 2  had  78 failures
> Week: 3  had  117 failures
> 
> 
> Failures in Hoss' reports for the last 4 rollups.
> 
> There were 497 unannotated tests that failed in Hoss' rollups. Ordered by the 
> date I downloaded the rollup file, newest->oldest. See above for the dates 
> the files were collected 
> These tests were NOT BadApple'd or AwaitsFix’d
> 
> Failures in the last 4 reports..
>Report   Pct runsfails   test
>  0123   0.7 1617 11  
> ConnectionManagerTest.testReconnectWhenZkDisappeared
>  0123   1.5 1606 12  ExecutePlanActionTest.testTaskTimeout
>  0123   1.6 1320 19  MultiThreadedOCPTest.test
>  0123   1.0 1620 13  RollingRestartTest.test
>  0123   1.2 1617 12  SearchRateTriggerTest.testWaitForElapsed
>  0123   3.8  119  7  ShardSplitTest.testSplitWithChaosMonkey
>  0123   0.3 1519  7  TestInPlaceUpdatesDistrib.test
>  0123   0.7 1629 14  
> TestIndexWriterDelete.testDeleteAllNoDeadLock
>  0123   2.4 1548 18  TestPackages.testPluginLoading
>  0123   0.3 1587  4  UnloadDistributedZkTest.test
> 
> 
> FAILURES IN THE LAST WEEK (343!)
> Look particularly at the ones with only a zero in 

Re: PLEASE READ! BadApple report. Last week was horrible!

2020-05-04 Thread Michael McCandless
Hi Erick,

It's possible this was the root cause of many of the failures:
https://issues.apache.org/jira/browse/LUCENE-9191

Do these transient failures look something like this?

   [junit4]> Throwable #1:
java.nio.charset.MalformedInputException: Input length = 1
   [junit4]>at
__randomizedtesting.SeedInfo.seed([172C6414BE5E2A2C:E5829DFC005A1F0]:0)
   [junit4]>at
java.base/java.nio.charset.CoderResult.throwException(CoderResult.java:274)
   [junit4]>at
java.base/sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:339)
   [junit4]>at
java.base/sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
   [junit4]>at
java.base/java.io.InputStreamReader.read(InputStreamReader.java:185)
   [junit4]>at
java.base/java.io.BufferedReader.fill(BufferedReader.java:161)
   [junit4]>at
java.base/java.io.BufferedReader.readLine(BufferedReader.java:326)
   [junit4]>at
java.base/java.io.BufferedReader.readLine(BufferedReader.java:392)
   [junit4]>at
org.apache.lucene.util.LineFileDocs.open(LineFileDocs.java:175)
   [junit4]>at
org.apache.lucene.util.LineFileDocs.(LineFileDocs.java:65)
   [junit4]>at
org.apache.lucene.util.LineFileDocs.(LineFileDocs.java:69)


If so, then it is likely the root cause ... I'm working on a fix.  Sorry!

Mike McCandless

http://blog.mikemccandless.com


On Mon, May 4, 2020 at 7:54 AM Erick Erickson 
wrote:

> I don’t know whether we had some temporary glitch that broke lots of tests
> and they’ve been fixed or we had a major regression, but this needs to be
> addressed ASAP if they’re still failing. See everything below the line "ALL
> OF THE TESTS BELOW HERE HAVE ONLY FAILED IN THE LAST WEEK!” in this e-mail.
> I’ll raise a JIRA if we can’t get some traction quickly here.
>
> Hey, stuff happens. there’s no problem with tests going totally weird for
> a while. If you can say “Oh, yeah, all those failures for class XYZ are
> probably fixed” that’s fine.
>
> Gosh-a-rooni, I hope my logging changes aren’t the culprit (gulp)….
>
> Hoss’ rolllup for the last 24 hours is not encouraging in terms of the
> problem already being fixed. There are lots of failures in some
> classes, notably:
>
> CloudHttp2SolrClientTest
> CollectionsAPIDistributedZkTest
> DeleteReplicaTest
> TestDocCollectionWatcher
>
> Unfortunately, the failure rate is not very high so reliably
> reproducing is hard.
>
> I’ve reproduced the last week’s failure in this e-mail, full
> report attached.
>
> Here’s Hoss’ rollup:
> http://fucit.org/solr-jenkins-reports/failure-report.html
>
> Usual synopsis:
>
> Raw fail count by week totals, most recent week first (corresponds to
> bits):
> Week: 0  had  343 failures
> Week: 1  had  86 failures
> Week: 2  had  78 failures
> Week: 3  had  117 failures
>
>
> Failures in Hoss' reports for the last 4 rollups.
>
> There were 497 unannotated tests that failed in Hoss' rollups. Ordered by
> the date I downloaded the rollup file, newest->oldest. See above for the
> dates the files were collected
> These tests were NOT BadApple'd or AwaitsFix’d
>
> Failures in the last 4 reports..
>Report   Pct runsfails   test
>  0123   0.7 1617 11
> ConnectionManagerTest.testReconnectWhenZkDisappeared
>  0123   1.5 1606 12  ExecutePlanActionTest.testTaskTimeout
>  0123   1.6 1320 19  MultiThreadedOCPTest.test
>  0123   1.0 1620 13  RollingRestartTest.test
>  0123   1.2 1617 12
> SearchRateTriggerTest.testWaitForElapsed
>  0123   3.8  119  7
> ShardSplitTest.testSplitWithChaosMonkey
>  0123   0.3 1519  7  TestInPlaceUpdatesDistrib.test
>  0123   0.7 1629 14
> TestIndexWriterDelete.testDeleteAllNoDeadLock
>  0123   2.4 1548 18  TestPackages.testPluginLoading
>  0123   0.3 1587  4  UnloadDistributedZkTest.test
> 
>
> FAILURES IN THE LAST WEEK (343!)
> Look particularly at the ones with only a zero in the “Report” column,
> those are
> failures that were _not_ in the previous 3 week’s rollups.
>
>Report   Pct runsfails   test
>  0120.5 1165  4  CustomHighlightComponentTest.test
>  0121.0 1168  6
> NodeMarkersRegistrationTest.testNodeMarkersRegistration
>  0121.0 1170  8  TestCryptoKeys.test
>  01 3   0.7 1233 11  LeaderFailoverAfterPartitionTest.test
>  01 3  63.2  102 39  StressHdfsTest.test
>  01 0.3  709  2
> ScheduledTriggerIntegrationTest.testScheduledTrigger
>  01 0.2  768  2  ShardRoutingTest.test
>  01 2.6  807 22  TestAllFilesHaveChecksumFooter.test
>  01 2.6  808 22  TestAllFilesHaveCodecHeader.test
>  01 0.2  769  2  TestCloudSchemaless.test
>  01 0.2