subject:"PLEASE READ\! BadApple report. Last week was horrible\!"

Re: PLEASE READ! BadApple report. Last week was horrible!

2020-05-06 Thread Michael McCandless

Phew!  Thanks for digging Erick, and for producing these BadApple reports.

Mike McCandless

http://blog.mikemccandless.com


On Wed, May 6, 2020 at 7:59 AM Erick Erickson 
wrote:

> OK, this morning things are back to normal. I think the disk space issue
> was to blame because checking after Mike’s fix didn’t look like it
> cured the problem.
>
> Thanks all!
>
> > On May 5, 2020, at 1:41 PM, Chris Hostetter 
> wrote:
> >
> >
> > : And FWIW, I beasted one of the failing suites last night _without_
> > : Mike’s changes and didn’t get any failures so I can’t say anything
> about
> > : whether Mike’s changes helped or not.
> >
> > IIUC McCandless's failure only affects you if you use the "jenkins" test
> > data file (the really big wikipedia dump) ... see the jira he mentioned
> > for details.
> >
> >
> >
> > -Hoss
> > http://www.lucidworks.com/
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: dev-h...@lucene.apache.org
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

Re: PLEASE READ! BadApple report. Last week was horrible!

2020-05-06 Thread Erick Erickson

OK, this morning things are back to normal. I think the disk space issue
was to blame because checking after Mike’s fix didn’t look like it
cured the problem.

Thanks all!

> On May 5, 2020, at 1:41 PM, Chris Hostetter  wrote:
> 
> 
> : And FWIW, I beasted one of the failing suites last night _without_ 
> : Mike’s changes and didn’t get any failures so I can’t say anything about 
> : whether Mike’s changes helped or not.
> 
> IIUC McCandless's failure only affects you if you use the "jenkins" test 
> data file (the really big wikipedia dump) ... see the jira he mentioned 
> for details.
> 
> 
> 
> -Hoss
> http://www.lucidworks.com/
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: PLEASE READ! BadApple report. Last week was horrible!

2020-05-05 Thread Erick Erickson

OK, thanks Chris. 

The 24 hour rollup still shows many failures in the several classes, I’ll check 
tomorrow
to see if that’s a consequence of the disk full problem.

> On May 5, 2020, at 1:41 PM, Chris Hostetter  wrote:
> 
> 
> : And FWIW, I beasted one of the failing suites last night _without_ 
> : Mike’s changes and didn’t get any failures so I can’t say anything about 
> : whether Mike’s changes helped or not.
> 
> IIUC McCandless's failure only affects you if you use the "jenkins" test 
> data file (the really big wikipedia dump) ... see the jira he mentioned 
> for details.
> 
> 
> 
> -Hoss
> http://www.lucidworks.com/
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: PLEASE READ! BadApple report. Last week was horrible!

2020-05-05 Thread Chris Hostetter


: And FWIW, I beasted one of the failing suites last night _without_ 
: Mike’s changes and didn’t get any failures so I can’t say anything about 
: whether Mike’s changes helped or not.

IIUC McCandless's failure only affects you if you use the "jenkins" test 
data file (the really big wikipedia dump) ... see the jira he mentioned 
for details.



-Hoss
http://www.lucidworks.com/

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: PLEASE READ! BadApple report. Last week was horrible!

2020-05-05 Thread Erick Erickson

2119758844836987 UNLOAD) 
[n:127.0.0.1:49613_solrx:replicaTypesTestColl_shard1_replica_p4 ] 
o.a.s.m.r.SolrJmxReporter Closing reporter 
[org.apache.solr.metrics.reporters.SolrJmxReporter@1f2a6e95: rootName = 
solr_49613, domain = solr.core.replicaTypesTestColl.shard1.replica_p4, service 
url = null, agent id = null] for registry 
solr.core.replicaTypesTestColl.shard1.replica_p4/com.codahale.metrics.MetricRegistry@2edb03e2
 [junit4]   2> 33770 ERROR (indexFetcher-621-thread-1) [n:127.0.0.1:49612_solr  
   ] o.a.s.h.ReplicationHandler Index fetch failed 
:java.lang.NullPointerException
   [junit4]   2>at 
org.apache.solr.handler.IndexFetcher.getLeaderReplica(IndexFetcher.java:709)
   [junit4]   2>at 
org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:387)
   [junit4]   2>at 
org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:351)
   [junit4]   2>at 
org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:422)
   [junit4]   2>at 
org.apache.solr.handler.ReplicationHandler.lambda$setupPolling$13(ReplicationHandler.java:1208)
   [junit4]   2>at 
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
   [junit4]   2>at 
java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
   [junit4]   2>at 
java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
   [junit4]   2>at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
   [junit4]   2>at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
   [junit4]   2>at java.base/java.lang.Thread.run(Thread.java:834)
   [junit4]   2> 

> On May 5, 2020, at 4:33 AM, Uwe Schindler  wrote:
> 
> Hi,
> 
> there was also a problem with the Windows Node. It ran out of disk space, 
> because some test seem to have filled up all of the disk. All followup builds 
> failed. I cleaned all Workspaces (8.x, master) and it freed 20 Gigabytes!
> 
> Uwe
> 
> -
> Uwe Schindler
> Achterdiek 19, D-28357 Bremen
> https://www.thetaphi.de
> eMail: u...@thetaphi.de
> 
>> -Original Message-----
>> From: Erick Erickson 
>> Sent: Monday, May 4, 2020 1:54 PM
>> To: dev@lucene.apache.org
>> Subject: PLEASE READ! BadApple report. Last week was horrible!
>> 
>> I don’t know whether we had some temporary glitch that broke lots of tests
>> and they’ve been fixed or we had a major regression, but this needs to be
>> addressed ASAP if they’re still failing. See everything below the line "ALL 
>> OF
>> THE TESTS BELOW HERE HAVE ONLY FAILED IN THE LAST WEEK!” in this e-mail.
>> I’ll raise a JIRA if we can’t get some traction quickly here.
>> 
>> Hey, stuff happens. there’s no problem with tests going totally weird for a
>> while. If you can say “Oh, yeah, all those failures for class XYZ are 
>> probably
>> fixed” that’s fine.
>> 
>> Gosh-a-rooni, I hope my logging changes aren’t the culprit (gulp)….
>> 
>> Hoss’ rolllup for the last 24 hours is not encouraging in terms of the
>> problem already being fixed. There are lots of failures in some
>> classes, notably:
>> 
>> CloudHttp2SolrClientTest
>> CollectionsAPIDistributedZkTest
>> DeleteReplicaTest
>> TestDocCollectionWatcher
>> 
>> Unfortunately, the failure rate is not very high so reliably
>> reproducing is hard.
>> 
>> I’ve reproduced the last week’s failure in this e-mail, full
>> report attached.
>> 
>> Here’s Hoss’ rollup:
>> http://fucit.org/solr-jenkins-reports/failure-report.html
>> 
>> Usual synopsis:
>> 
>> Raw fail count by week totals, most recent week first (corresponds to bits):
>> Week: 0  had  343 failures
>> Week: 1  had  86 failures
>> Week: 2  had  78 failures
>> Week: 3  had  117 failures
>> 
>> 
>> Failures in Hoss' reports for the last 4 rollups.
>> 
>> There were 497 unannotated tests that failed in Hoss' rollups. Ordered by the
>> date I downloaded the rollup file, newest->oldest. See above for the dates 
>> the
>> files were collected
>> These tests were NOT BadApple'd or AwaitsFix’d
>> 
>> Failures in the last 4 reports..
>>   Report   Pct runsfails   test
>> 0123   0.7 1617 11
>> ConnectionManagerTest.testReconnectWhenZkDisappeared
>> 0123   1.5 1606 12  ExecutePlanActionTest.testTaskTimeout
>> 0123   1.6 1320 19  MultiThreadedOCPTes

RE: PLEASE READ! BadApple report. Last week was horrible!

2020-05-05 Thread Uwe Schindler

Hi,

there was also a problem with the Windows Node. It ran out of disk space, 
because some test seem to have filled up all of the disk. All followup builds 
failed. I cleaned all Workspaces (8.x, master) and it freed 20 Gigabytes!

Uwe

-
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: u...@thetaphi.de

> -Original Message-
> From: Erick Erickson 
> Sent: Monday, May 4, 2020 1:54 PM
> To: dev@lucene.apache.org
> Subject: PLEASE READ! BadApple report. Last week was horrible!
> 
> I don’t know whether we had some temporary glitch that broke lots of tests
> and they’ve been fixed or we had a major regression, but this needs to be
> addressed ASAP if they’re still failing. See everything below the line "ALL OF
> THE TESTS BELOW HERE HAVE ONLY FAILED IN THE LAST WEEK!” in this e-mail.
> I’ll raise a JIRA if we can’t get some traction quickly here.
> 
> Hey, stuff happens. there’s no problem with tests going totally weird for a
> while. If you can say “Oh, yeah, all those failures for class XYZ are probably
> fixed” that’s fine.
> 
> Gosh-a-rooni, I hope my logging changes aren’t the culprit (gulp)….
> 
> Hoss’ rolllup for the last 24 hours is not encouraging in terms of the
> problem already being fixed. There are lots of failures in some
> classes, notably:
> 
> CloudHttp2SolrClientTest
> CollectionsAPIDistributedZkTest
> DeleteReplicaTest
> TestDocCollectionWatcher
> 
> Unfortunately, the failure rate is not very high so reliably
> reproducing is hard.
> 
> I’ve reproduced the last week’s failure in this e-mail, full
> report attached.
> 
> Here’s Hoss’ rollup:
> http://fucit.org/solr-jenkins-reports/failure-report.html
> 
> Usual synopsis:
> 
> Raw fail count by week totals, most recent week first (corresponds to bits):
> Week: 0  had  343 failures
> Week: 1  had  86 failures
> Week: 2  had  78 failures
> Week: 3  had  117 failures
> 
> 
> Failures in Hoss' reports for the last 4 rollups.
> 
> There were 497 unannotated tests that failed in Hoss' rollups. Ordered by the
> date I downloaded the rollup file, newest->oldest. See above for the dates the
> files were collected
> These tests were NOT BadApple'd or AwaitsFix’d
> 
> Failures in the last 4 reports..
>Report   Pct runsfails   test
>  0123   0.7 1617 11
> ConnectionManagerTest.testReconnectWhenZkDisappeared
>  0123   1.5 1606 12  ExecutePlanActionTest.testTaskTimeout
>  0123   1.6 1320 19  MultiThreadedOCPTest.test
>  0123   1.0 1620 13  RollingRestartTest.test
>  0123   1.2 1617 12  SearchRateTriggerTest.testWaitForElapsed
>  0123   3.8  119  7  ShardSplitTest.testSplitWithChaosMonkey
>  0123   0.3 1519  7  TestInPlaceUpdatesDistrib.test
>  0123   0.7 1629 14  
> TestIndexWriterDelete.testDeleteAllNoDeadLock
>  0123   2.4 1548 18  TestPackages.testPluginLoading
>  0123   0.3 1587  4  UnloadDistributedZkTest.test
> 
> 
> FAILURES IN THE LAST WEEK (343!)
> Look particularly at the ones with only a zero in the “Report” column, those 
> are
> failures that were _not_ in the previous 3 week’s rollups.
> 
>Report   Pct runsfails   test
>  0120.5 1165  4  CustomHighlightComponentTest.test
>  0121.0 1168  6
> NodeMarkersRegistrationTest.testNodeMarkersRegistration
>  0121.0 1170  8  TestCryptoKeys.test
>  01 3   0.7 1233 11  LeaderFailoverAfterPartitionTest.test
>  01 3  63.2  102 39  StressHdfsTest.test
>  01 0.3  709  2
> ScheduledTriggerIntegrationTest.testScheduledTrigger
>  01 0.2  768  2  ShardRoutingTest.test
>  01 2.6  807 22  TestAllFilesHaveChecksumFooter.test
>  01 2.6  808 22  TestAllFilesHaveCodecHeader.test
>  01 0.2  769  2  TestCloudSchemaless.test
>  01 0.2  769  2  TestDynamicLoading.testDynamicLoading
>  01 0.3  707  2  
> TestDynamicLoadingUrl.testDynamicLoadingUrl
>  01 0.5  767  4  TestPointFields.testFloatPointStats
>  0127.1   83 19  TestSQLHandler.doTest
>  01 0.2  794 12  TestSameScoresWithThreads.test
>  01 2.6  806 22  TestShardSearching.testSimple
>  01 0.5  726  4  TestSimScenario.testSplitShard
>  01 1.1  726  7  TestSimScenario.testSuggestions
>  01 0.3  771  2  TestWithColle

Re: PLEASE READ! BadApple report. Last week was horrible!

2020-05-04 Thread Erick Erickson

Mike:

I saw the push. Hoss’ rollups go for “the last 24 hours”, so it’ll be Tuesday 
evening before things have had a chance to work their way through, I’ll look 
tomorrow.

Meanwhile I’m beasting one of the failing test suites (without the change) and 
280 iterations so far and no failures. That said, the failure rate was < 1% so 
it’s not conclusive. Only another 720 runs to go before I pull the latest 
changes and try again… ;)



> On May 4, 2020, at 1:33 PM, Michael McCandless  
> wrote:
> 
> Hi Erick,
> 
> OK I pushed a fix!  See if it decreases the failure rate for those newly bad 
> apples?
> 
> Sorry and thanks :)
> 
> Mike McCandless
> 
> http://blog.mikemccandless.com
> 
> 
> On Mon, May 4, 2020 at 1:06 PM Erick Erickson  wrote:
> Mike:
> 
> I have no idea. Hoss’ rollups don’t link back to builds, they
> just aggregate the results.
> 
> Not a huge deal if it’s something like this of course. Let’s just
> say I’ve had my share or “moments” ;).
> 
> And unfortunately, the test failures are pretty rare on a 
> percentage basis, so it’s hard to tell.
> 
> I’m watching LUCENE-9191 and I’ll look back at Hoss’ rollups
> a day after you push it and see if the failures disappear.
> 
> It’ll take a while for the fixes to roll through all the reporting.
> 
> Tell you what. I’ll try beasting one of the classes that fails a lot and then
> try it again after you push LUCENE-9191 and we’ll go from there.
> 
> Thanks for getting into this so promptly!
> 
> Erick
> 
> > On May 4, 2020, at 9:10 AM, Michael McCandless  
> > wrote:
> > 
> > Hi Erick,
> > 
> > It's possible this was the root cause of many of the failures: 
> > https://issues.apache.org/jira/browse/LUCENE-9191
> > 
> > Do these transient failures look something like this?
> > 
> >[junit4]> Throwable #1: java.nio.charset.MalformedInputException: 
> > Input length = 1
> >[junit4]>at 
> > __randomizedtesting.SeedInfo.seed([172C6414BE5E2A2C:E5829DFC005A1F0]:0)
> >[junit4]>at 
> > java.base/java.nio.charset.CoderResult.throwException(CoderResult.java:274)
> >[junit4]>at 
> > java.base/sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:339)
> >[junit4]>at 
> > java.base/sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
> >[junit4]>at 
> > java.base/java.io.InputStreamReader.read(InputStreamReader.java:185)
> >[junit4]>at 
> > java.base/java.io.BufferedReader.fill(BufferedReader.java:161)
> >[junit4]>at 
> > java.base/java.io.BufferedReader.readLine(BufferedReader.java:326)
> >[junit4]>at 
> > java.base/java.io.BufferedReader.readLine(BufferedReader.java:392)
> >[junit4]>at 
> > org.apache.lucene.util.LineFileDocs.open(LineFileDocs.java:175)
> >[junit4]>at 
> > org.apache.lucene.util.LineFileDocs.(LineFileDocs.java:65)
> >[junit4]>at 
> > org.apache.lucene.util.LineFileDocs.(LineFileDocs.java:69)
> > 
> > 
> > If so, then it is likely the root cause ... I'm working on a fix.  Sorry!
> > 
> > Mike McCandless
> > 
> > http://blog.mikemccandless.com
> > 
> > 
> > On Mon, May 4, 2020 at 7:54 AM Erick Erickson  
> > wrote:
> > I don’t know whether we had some temporary glitch that broke lots of tests 
> > and they’ve been fixed or we had a major regression, but this needs to be 
> > addressed ASAP if they’re still failing. See everything below the line "ALL 
> > OF THE TESTS BELOW HERE HAVE ONLY FAILED IN THE LAST WEEK!” in this e-mail. 
> > I’ll raise a JIRA if we can’t get some traction quickly here.
> > 
> > Hey, stuff happens. there’s no problem with tests going totally weird for a 
> > while. If you can say “Oh, yeah, all those failures for class XYZ are 
> > probably fixed” that’s fine.
> > 
> > Gosh-a-rooni, I hope my logging changes aren’t the culprit (gulp)….
> > 
> > Hoss’ rolllup for the last 24 hours is not encouraging in terms of the
> > problem already being fixed. There are lots of failures in some
> > classes, notably:
> > 
> > CloudHttp2SolrClientTest
> > CollectionsAPIDistributedZkTest
> > DeleteReplicaTest
> > TestDocCollectionWatcher
> > 
> > Unfortunately, the failure rate is not very high so reliably 
> > reproducing is hard.
> > 
> > I’ve reproduced the last week’s failure in this e-mail, full 
> > report attached. 
> > 
> > Here’s Hoss’ rollup:
> > http://fucit.org/solr-jenkins-reports/failure-report.html
> > 
> > Usual synopsis:
> > 
> > Raw fail count by week totals, most recent week first (corresponds to bits):
> > Week: 0  had  343 failures
> > Week: 1  had  86 failures
> > Week: 2  had  78 failures
> > Week: 3  had  117 failures
> > 
> > 
> > Failures in Hoss' reports for the last 4 rollups.
> > 
> > There were 497 unannotated tests that failed in Hoss' rollups. Ordered by 
> > the date I downloaded the rollup file, newest->oldest. See above for the 
> > dates the files were collected 
> > These tests were NOT BadApple'd or AwaitsFix’d
> >

Re: PLEASE READ! BadApple report. Last week was horrible!

2020-05-04 Thread Michael McCandless

Hi Erick,

OK I pushed a fix!  See if it decreases the failure rate for those newly
bad apples?

Sorry and thanks :)

Mike McCandless

http://blog.mikemccandless.com


On Mon, May 4, 2020 at 1:06 PM Erick Erickson 
wrote:

> Mike:
>
> I have no idea. Hoss’ rollups don’t link back to builds, they
> just aggregate the results.
>
> Not a huge deal if it’s something like this of course. Let’s just
> say I’ve had my share or “moments” ;).
>
> And unfortunately, the test failures are pretty rare on a
> percentage basis, so it’s hard to tell.
>
> I’m watching LUCENE-9191 and I’ll look back at Hoss’ rollups
> a day after you push it and see if the failures disappear.
>
> It’ll take a while for the fixes to roll through all the reporting.
>
> Tell you what. I’ll try beasting one of the classes that fails a lot and
> then
> try it again after you push LUCENE-9191 and we’ll go from there.
>
> Thanks for getting into this so promptly!
>
> Erick
>
> > On May 4, 2020, at 9:10 AM, Michael McCandless <
> luc...@mikemccandless.com> wrote:
> >
> > Hi Erick,
> >
> > It's possible this was the root cause of many of the failures:
> https://issues.apache.org/jira/browse/LUCENE-9191
> >
> > Do these transient failures look something like this?
> >
> >[junit4]> Throwable #1: java.nio.charset.MalformedInputException:
> Input length = 1
> >[junit4]>at
> __randomizedtesting.SeedInfo.seed([172C6414BE5E2A2C:E5829DFC005A1F0]:0)
> >[junit4]>at
> java.base/java.nio.charset.CoderResult.throwException(CoderResult.java:274)
> >[junit4]>at
> java.base/sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:339)
> >[junit4]>at
> java.base/sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
> >[junit4]>at java.base/java.io
> .InputStreamReader.read(InputStreamReader.java:185)
> >[junit4]>at java.base/java.io
> .BufferedReader.fill(BufferedReader.java:161)
> >[junit4]>at java.base/java.io
> .BufferedReader.readLine(BufferedReader.java:326)
> >[junit4]>at java.base/java.io
> .BufferedReader.readLine(BufferedReader.java:392)
> >[junit4]>at
> org.apache.lucene.util.LineFileDocs.open(LineFileDocs.java:175)
> >[junit4]>at
> org.apache.lucene.util.LineFileDocs.(LineFileDocs.java:65)
> >[junit4]>at
> org.apache.lucene.util.LineFileDocs.(LineFileDocs.java:69)
> >
> >
> > If so, then it is likely the root cause ... I'm working on a fix.  Sorry!
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
> >
> > On Mon, May 4, 2020 at 7:54 AM Erick Erickson 
> wrote:
> > I don’t know whether we had some temporary glitch that broke lots of
> tests and they’ve been fixed or we had a major regression, but this needs
> to be addressed ASAP if they’re still failing. See everything below the
> line "ALL OF THE TESTS BELOW HERE HAVE ONLY FAILED IN THE LAST WEEK!” in
> this e-mail. I’ll raise a JIRA if we can’t get some traction quickly here.
> >
> > Hey, stuff happens. there’s no problem with tests going totally weird
> for a while. If you can say “Oh, yeah, all those failures for class XYZ are
> probably fixed” that’s fine.
> >
> > Gosh-a-rooni, I hope my logging changes aren’t the culprit (gulp)….
> >
> > Hoss’ rolllup for the last 24 hours is not encouraging in terms of the
> > problem already being fixed. There are lots of failures in some
> > classes, notably:
> >
> > CloudHttp2SolrClientTest
> > CollectionsAPIDistributedZkTest
> > DeleteReplicaTest
> > TestDocCollectionWatcher
> >
> > Unfortunately, the failure rate is not very high so reliably
> > reproducing is hard.
> >
> > I’ve reproduced the last week’s failure in this e-mail, full
> > report attached.
> >
> > Here’s Hoss’ rollup:
> > http://fucit.org/solr-jenkins-reports/failure-report.html
> >
> > Usual synopsis:
> >
> > Raw fail count by week totals, most recent week first (corresponds to
> bits):
> > Week: 0  had  343 failures
> > Week: 1  had  86 failures
> > Week: 2  had  78 failures
> > Week: 3  had  117 failures
> >
> >
> > Failures in Hoss' reports for the last 4 rollups.
> >
> > There were 497 unannotated tests that failed in Hoss' rollups. Ordered
> by the date I downloaded the rollup file, newest->oldest. See above for the
> dates the files were collected
> > These tests were NOT BadApple'd or AwaitsFix’d
> >
> > Failures in the last 4 reports..
> >Report   Pct runsfails   test
> >  0123   0.7 1617 11
> ConnectionManagerTest.testReconnectWhenZkDisappeared
> >  0123   1.5 1606 12
> ExecutePlanActionTest.testTaskTimeout
> >  0123   1.6 1320 19  MultiThreadedOCPTest.test
> >  0123   1.0 1620 13  RollingRestartTest.test
> >  0123   1.2 1617 12
> SearchRateTriggerTest.testWaitForElapsed
> >  0123   3.8  119  7
> ShardSplitTest.testSplitWithChaosMonkey
> >  0123   0.3 1519  7  TestInPla

Re: PLEASE READ! BadApple report. Last week was horrible!

2020-05-04 Thread Erick Erickson

Mike:

I have no idea. Hoss’ rollups don’t link back to builds, they
just aggregate the results.

Not a huge deal if it’s something like this of course. Let’s just
say I’ve had my share or “moments” ;).

And unfortunately, the test failures are pretty rare on a 
percentage basis, so it’s hard to tell.

I’m watching LUCENE-9191 and I’ll look back at Hoss’ rollups
a day after you push it and see if the failures disappear.

It’ll take a while for the fixes to roll through all the reporting.

Tell you what. I’ll try beasting one of the classes that fails a lot and then
try it again after you push LUCENE-9191 and we’ll go from there.

Thanks for getting into this so promptly!

Erick

> On May 4, 2020, at 9:10 AM, Michael McCandless  
> wrote:
> 
> Hi Erick,
> 
> It's possible this was the root cause of many of the failures: 
> https://issues.apache.org/jira/browse/LUCENE-9191
> 
> Do these transient failures look something like this?
> 
>[junit4]> Throwable #1: java.nio.charset.MalformedInputException: 
> Input length = 1
>[junit4]>at 
> __randomizedtesting.SeedInfo.seed([172C6414BE5E2A2C:E5829DFC005A1F0]:0)
>[junit4]>at 
> java.base/java.nio.charset.CoderResult.throwException(CoderResult.java:274)
>[junit4]>at 
> java.base/sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:339)
>[junit4]>at 
> java.base/sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
>[junit4]>at 
> java.base/java.io.InputStreamReader.read(InputStreamReader.java:185)
>[junit4]>at 
> java.base/java.io.BufferedReader.fill(BufferedReader.java:161)
>[junit4]>at 
> java.base/java.io.BufferedReader.readLine(BufferedReader.java:326)
>[junit4]>at 
> java.base/java.io.BufferedReader.readLine(BufferedReader.java:392)
>[junit4]>at 
> org.apache.lucene.util.LineFileDocs.open(LineFileDocs.java:175)
>[junit4]>at 
> org.apache.lucene.util.LineFileDocs.(LineFileDocs.java:65)
>[junit4]>at 
> org.apache.lucene.util.LineFileDocs.(LineFileDocs.java:69)
> 
> 
> If so, then it is likely the root cause ... I'm working on a fix.  Sorry!
> 
> Mike McCandless
> 
> http://blog.mikemccandless.com
> 
> 
> On Mon, May 4, 2020 at 7:54 AM Erick Erickson  wrote:
> I don’t know whether we had some temporary glitch that broke lots of tests 
> and they’ve been fixed or we had a major regression, but this needs to be 
> addressed ASAP if they’re still failing. See everything below the line "ALL 
> OF THE TESTS BELOW HERE HAVE ONLY FAILED IN THE LAST WEEK!” in this e-mail. 
> I’ll raise a JIRA if we can’t get some traction quickly here.
> 
> Hey, stuff happens. there’s no problem with tests going totally weird for a 
> while. If you can say “Oh, yeah, all those failures for class XYZ are 
> probably fixed” that’s fine.
> 
> Gosh-a-rooni, I hope my logging changes aren’t the culprit (gulp)….
> 
> Hoss’ rolllup for the last 24 hours is not encouraging in terms of the
> problem already being fixed. There are lots of failures in some
> classes, notably:
> 
> CloudHttp2SolrClientTest
> CollectionsAPIDistributedZkTest
> DeleteReplicaTest
> TestDocCollectionWatcher
> 
> Unfortunately, the failure rate is not very high so reliably 
> reproducing is hard.
> 
> I’ve reproduced the last week’s failure in this e-mail, full 
> report attached. 
> 
> Here’s Hoss’ rollup:
> http://fucit.org/solr-jenkins-reports/failure-report.html
> 
> Usual synopsis:
> 
> Raw fail count by week totals, most recent week first (corresponds to bits):
> Week: 0  had  343 failures
> Week: 1  had  86 failures
> Week: 2  had  78 failures
> Week: 3  had  117 failures
> 
> 
> Failures in Hoss' reports for the last 4 rollups.
> 
> There were 497 unannotated tests that failed in Hoss' rollups. Ordered by the 
> date I downloaded the rollup file, newest->oldest. See above for the dates 
> the files were collected 
> These tests were NOT BadApple'd or AwaitsFix’d
> 
> Failures in the last 4 reports..
>Report   Pct runsfails   test
>  0123   0.7 1617 11  
> ConnectionManagerTest.testReconnectWhenZkDisappeared
>  0123   1.5 1606 12  ExecutePlanActionTest.testTaskTimeout
>  0123   1.6 1320 19  MultiThreadedOCPTest.test
>  0123   1.0 1620 13  RollingRestartTest.test
>  0123   1.2 1617 12  SearchRateTriggerTest.testWaitForElapsed
>  0123   3.8  119  7  ShardSplitTest.testSplitWithChaosMonkey
>  0123   0.3 1519  7  TestInPlaceUpdatesDistrib.test
>  0123   0.7 1629 14  
> TestIndexWriterDelete.testDeleteAllNoDeadLock
>  0123   2.4 1548 18  TestPackages.testPluginLoading
>  0123   0.3 1587  4  UnloadDistributedZkTest.test
> 
> 
> FAILURES IN THE LAST WEEK (343!)
> Look particularly at the ones with only a zero in th

Re: PLEASE READ! BadApple report. Last week was horrible!

2020-05-04 Thread Michael McCandless

Hi Erick,

It's possible this was the root cause of many of the failures:
https://issues.apache.org/jira/browse/LUCENE-9191

Do these transient failures look something like this?

   [junit4]> Throwable #1:
java.nio.charset.MalformedInputException: Input length = 1
   [junit4]>at
__randomizedtesting.SeedInfo.seed([172C6414BE5E2A2C:E5829DFC005A1F0]:0)
   [junit4]>at
java.base/java.nio.charset.CoderResult.throwException(CoderResult.java:274)
   [junit4]>at
java.base/sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:339)
   [junit4]>at
java.base/sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
   [junit4]>at
java.base/java.io.InputStreamReader.read(InputStreamReader.java:185)
   [junit4]>at
java.base/java.io.BufferedReader.fill(BufferedReader.java:161)
   [junit4]>at
java.base/java.io.BufferedReader.readLine(BufferedReader.java:326)
   [junit4]>at
java.base/java.io.BufferedReader.readLine(BufferedReader.java:392)
   [junit4]>at
org.apache.lucene.util.LineFileDocs.open(LineFileDocs.java:175)
   [junit4]>at
org.apache.lucene.util.LineFileDocs.(LineFileDocs.java:65)
   [junit4]>at
org.apache.lucene.util.LineFileDocs.(LineFileDocs.java:69)


If so, then it is likely the root cause ... I'm working on a fix.  Sorry!

Mike McCandless

http://blog.mikemccandless.com


On Mon, May 4, 2020 at 7:54 AM Erick Erickson 
wrote:

> I don’t know whether we had some temporary glitch that broke lots of tests
> and they’ve been fixed or we had a major regression, but this needs to be
> addressed ASAP if they’re still failing. See everything below the line "ALL
> OF THE TESTS BELOW HERE HAVE ONLY FAILED IN THE LAST WEEK!” in this e-mail.
> I’ll raise a JIRA if we can’t get some traction quickly here.
>
> Hey, stuff happens. there’s no problem with tests going totally weird for
> a while. If you can say “Oh, yeah, all those failures for class XYZ are
> probably fixed” that’s fine.
>
> Gosh-a-rooni, I hope my logging changes aren’t the culprit (gulp)….
>
> Hoss’ rolllup for the last 24 hours is not encouraging in terms of the
> problem already being fixed. There are lots of failures in some
> classes, notably:
>
> CloudHttp2SolrClientTest
> CollectionsAPIDistributedZkTest
> DeleteReplicaTest
> TestDocCollectionWatcher
>
> Unfortunately, the failure rate is not very high so reliably
> reproducing is hard.
>
> I’ve reproduced the last week’s failure in this e-mail, full
> report attached.
>
> Here’s Hoss’ rollup:
> http://fucit.org/solr-jenkins-reports/failure-report.html
>
> Usual synopsis:
>
> Raw fail count by week totals, most recent week first (corresponds to
> bits):
> Week: 0  had  343 failures
> Week: 1  had  86 failures
> Week: 2  had  78 failures
> Week: 3  had  117 failures
>
>
> Failures in Hoss' reports for the last 4 rollups.
>
> There were 497 unannotated tests that failed in Hoss' rollups. Ordered by
> the date I downloaded the rollup file, newest->oldest. See above for the
> dates the files were collected
> These tests were NOT BadApple'd or AwaitsFix’d
>
> Failures in the last 4 reports..
>Report   Pct runsfails   test
>  0123   0.7 1617 11
> ConnectionManagerTest.testReconnectWhenZkDisappeared
>  0123   1.5 1606 12  ExecutePlanActionTest.testTaskTimeout
>  0123   1.6 1320 19  MultiThreadedOCPTest.test
>  0123   1.0 1620 13  RollingRestartTest.test
>  0123   1.2 1617 12
> SearchRateTriggerTest.testWaitForElapsed
>  0123   3.8  119  7
> ShardSplitTest.testSplitWithChaosMonkey
>  0123   0.3 1519  7  TestInPlaceUpdatesDistrib.test
>  0123   0.7 1629 14
> TestIndexWriterDelete.testDeleteAllNoDeadLock
>  0123   2.4 1548 18  TestPackages.testPluginLoading
>  0123   0.3 1587  4  UnloadDistributedZkTest.test
> 
>
> FAILURES IN THE LAST WEEK (343!)
> Look particularly at the ones with only a zero in the “Report” column,
> those are
> failures that were _not_ in the previous 3 week’s rollups.
>
>Report   Pct runsfails   test
>  0120.5 1165  4  CustomHighlightComponentTest.test
>  0121.0 1168  6
> NodeMarkersRegistrationTest.testNodeMarkersRegistration
>  0121.0 1170  8  TestCryptoKeys.test
>  01 3   0.7 1233 11  LeaderFailoverAfterPartitionTest.test
>  01 3  63.2  102 39  StressHdfsTest.test
>  01 0.3  709  2
> ScheduledTriggerIntegrationTest.testScheduledTrigger
>  01 0.2  768  2  ShardRoutingTest.test
>  01 2.6  807 22  TestAllFilesHaveChecksumFooter.test
>  01 2.6  808 22  TestAllFilesHaveCodecHeader.test
>  01 0.2  769  2  TestCloudSchemaless.test
>  01 0.2

PLEASE READ! BadApple report. Last week was horrible!

2020-05-04 Thread Erick Erickson

I don’t know whether we had some temporary glitch that broke lots of tests and 
they’ve been fixed or we had a major regression, but this needs to be addressed 
ASAP if they’re still failing. See everything below the line "ALL OF THE TESTS 
BELOW HERE HAVE ONLY FAILED IN THE LAST WEEK!” in this e-mail. I’ll raise a 
JIRA if we can’t get some traction quickly here.

Hey, stuff happens. there’s no problem with tests going totally weird for a 
while. If you can say “Oh, yeah, all those failures for class XYZ are probably 
fixed” that’s fine.

Gosh-a-rooni, I hope my logging changes aren’t the culprit (gulp)….

Hoss’ rolllup for the last 24 hours is not encouraging in terms of the
problem already being fixed. There are lots of failures in some
classes, notably:

CloudHttp2SolrClientTest
CollectionsAPIDistributedZkTest
DeleteReplicaTest
TestDocCollectionWatcher

Unfortunately, the failure rate is not very high so reliably 
reproducing is hard.

I’ve reproduced the last week’s failure in this e-mail, full 
report attached. 

Here’s Hoss’ rollup:
http://fucit.org/solr-jenkins-reports/failure-report.html

Usual synopsis:

Raw fail count by week totals, most recent week first (corresponds to bits):
Week: 0  had  343 failures
Week: 1  had  86 failures
Week: 2  had  78 failures
Week: 3  had  117 failures


Failures in Hoss' reports for the last 4 rollups.

There were 497 unannotated tests that failed in Hoss' rollups. Ordered by the 
date I downloaded the rollup file, newest->oldest. See above for the dates the 
files were collected 
These tests were NOT BadApple'd or AwaitsFix’d

Failures in the last 4 reports..
   Report   Pct runsfails   test
 0123   0.7 1617 11  
ConnectionManagerTest.testReconnectWhenZkDisappeared
 0123   1.5 1606 12  ExecutePlanActionTest.testTaskTimeout
 0123   1.6 1320 19  MultiThreadedOCPTest.test
 0123   1.0 1620 13  RollingRestartTest.test
 0123   1.2 1617 12  SearchRateTriggerTest.testWaitForElapsed
 0123   3.8  119  7  ShardSplitTest.testSplitWithChaosMonkey
 0123   0.3 1519  7  TestInPlaceUpdatesDistrib.test
 0123   0.7 1629 14  
TestIndexWriterDelete.testDeleteAllNoDeadLock
 0123   2.4 1548 18  TestPackages.testPluginLoading
 0123   0.3 1587  4  UnloadDistributedZkTest.test


FAILURES IN THE LAST WEEK (343!)
Look particularly at the ones with only a zero in the “Report” column, those are
failures that were _not_ in the previous 3 week’s rollups.

   Report   Pct runsfails   test
 0120.5 1165  4  CustomHighlightComponentTest.test
 0121.0 1168  6  
NodeMarkersRegistrationTest.testNodeMarkersRegistration
 0121.0 1170  8  TestCryptoKeys.test
 01 3   0.7 1233 11  LeaderFailoverAfterPartitionTest.test
 01 3  63.2  102 39  StressHdfsTest.test
 01 0.3  709  2  
ScheduledTriggerIntegrationTest.testScheduledTrigger
 01 0.2  768  2  ShardRoutingTest.test
 01 2.6  807 22  TestAllFilesHaveChecksumFooter.test
 01 2.6  808 22  TestAllFilesHaveCodecHeader.test
 01 0.2  769  2  TestCloudSchemaless.test
 01 0.2  769  2  TestDynamicLoading.testDynamicLoading
 01 0.3  707  2  TestDynamicLoadingUrl.testDynamicLoadingUrl
 01 0.5  767  4  TestPointFields.testFloatPointStats
 0127.1   83 19  TestSQLHandler.doTest
 01 0.2  794 12  TestSameScoresWithThreads.test
 01 2.6  806 22  TestShardSearching.testSimple
 01 0.5  726  4  TestSimScenario.testSplitShard
 01 1.1  726  7  TestSimScenario.testSuggestions
 01 0.3  771  2  TestWithCollection.testAddReplicaSimple
 0 23   0.3 1223  4  
CdcrVersionReplicationTest.testCdcrDocVersions
 0 23   0.8 1172  6  
CloudHttp2SolrClientTest.testRetryUpdatesWhenClusterStateIsStale
 0 23   1.4 1202  8  CollectionsAPISolrJTest.testColStatus
 0 23   1.0 1249 11  HttpPartitionTest.test
 0 23   1.1 1210  8  HttpPartitionWithTlogReplicasTest.test
 0 23   0.5 1258  4  ShardSplitTest.testSplitShardWithRuleLink
 0 23   0.2 1231  4  
TestQueryingOnDownCollection.testQueryToDownCollectionShouldFailFast
 0 23   0.2 1232  6  TestSolrConfigHandlerCloud.test
 0 20.3  767  2  
DocValuesNotIndexedTest.testGroupingDVOnlySortLast
 0 20.3  750  2  TestLBHttp2SolrClient.testTwoServers
 0 20.3  794  2  TestSolrCloudSnapshots.testSnapshots
 0 2   40.7   51 12  
TestXYMultiPolygonShapeQueries.testRa

Re: PLEASE READ! BadApple report. Last week was horrible!

Re: PLEASE READ! BadApple report. Last week was horrible!

Re: PLEASE READ! BadApple report. Last week was horrible!

Re: PLEASE READ! BadApple report. Last week was horrible!

Re: PLEASE READ! BadApple report. Last week was horrible!

RE: PLEASE READ! BadApple report. Last week was horrible!

Re: PLEASE READ! BadApple report. Last week was horrible!

Re: PLEASE READ! BadApple report. Last week was horrible!

Re: PLEASE READ! BadApple report. Last week was horrible!

Re: PLEASE READ! BadApple report. Last week was horrible!

PLEASE READ! BadApple report. Last week was horrible!

11 matches

Site Navigation

Mail list logo

Footer information