[jira] [Commented] (SOLR-7215) non reproducible Suite failures due to excessive sysout due to HDFS lease renewal WARN logs due to connection refused -- even if test doesn't use HDFS (ie: threads leaki

2019-02-02 Thread Kevin Risden (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-7215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16759194#comment-16759194
 ] 

Kevin Risden commented on SOLR-7215:


Not sure what the status is here. I would guess this is either

a) not an issue any more (lots has changed with HDFS thread cleanup)
b) fixed later due to HDFS thread cleaup
c) still any issue but isn't clear that this has happened recently.

Planning to resolve this since I haven't seen this and last comment was 3+ 
years ago.

SOLR-9515 with Hadoop 3 upgrade was recent so trying to cleanup old HDFS 
related JIRAs if it isn't clear they still happen.

> non reproducible Suite failures due to excessive sysout due to HDFS lease 
> renewal WARN logs due to connection refused -- even if test doesn't use HDFS 
> (ie: threads leaking between tests)
> --
>
> Key: SOLR-7215
> URL: https://issues.apache.org/jira/browse/SOLR-7215
> Project: Solr
>  Issue Type: Bug
>Reporter: Hoss Man
>Priority: Major
> Attachments: tests-report.txt_suite-failure-due-to-sysout.txt.zip
>
>
> On my local machine, i've noticed lately a lot of sporadic, non reproducible, 
> failures like these...
> {noformat}
>   2> NOTE: reproduce with: ant test  -Dtestcase=ScriptEngineTest 
> -Dtests.seed=E254A7E69EC7212A -Dtests.slow=true -Dtests.locale=sv 
> -Dtests.timezone=SystemV/CST6 -Dtests.asserts=true -Dtests.file.encoding=UTF-8
> [14:34:23.749] ERROR   0.00s J1 | ScriptEngineTest (suite) <<<
>> Throwable #1: java.lang.AssertionError: The test or suite printed 10984 
> bytes to stdout and stderr, even though the limit was set to 8192 bytes. 
> Increase the limit with @Limit, ignore it completely with 
> @SuppressSysoutChecks or run with -Dtests.verbose=true
>>  at __randomizedtesting.SeedInfo.seed([E254A7E69EC7212A]:0)
>>  at 
> org.apache.lucene.util.TestRuleLimitSysouts.afterIfSuccessful(TestRuleLimitSysouts.java:212)
> {noformat}
> Invariably, looking at the logs of test that fail for this reason, i see 
> multiple instances of these WARN msgs...
> {noformat}
>   2> 601361 T3064 oahh.LeaseRenewer.run WARN Failed to renew lease for 
> [DFSClient_NONMAPREDUCE_-253604438_2947] for 92 seconds.  Will retry shortly 
> ... java.net.ConnectException: Call From frisbee/127.0.1.1 to localhost:40618 
> failed on connection exception: java.net.ConnectException: Connection 
> refused; For more details see:  
> http://wiki.apache.org/hadoop/ConnectionRefused
>   2>  at sun.reflect.GeneratedConstructorAccessor268.newInstance(Unknown 
> Source)
>   2>  at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>  ...
> {noformat}
> ...the full stack traces of these exceptions typically being 36 lines long 
> (not counting the supressed "... 17 more" at the end)
> doing some basic crunching of the "tests-report.txt" file from a recent run 
> of all "solr-core" tests (that caused the above failure) leads to some pretty 
> damn disconcerting numbers...
> {noformat}
> hossman@frisbee:~/tmp$ wc -l tests-report.txt_suite-failure-due-to-sysout.txt
> 1049177 tests-report.txt_suite-failure-due-to-sysout.txt
> hossman@frisbee:~/tmp$ grep "Suite: org.apache.solr" 
> tests-report.txt_suite-failure-due-to-sysout.txt | wc -l
> 465
> hossman@frisbee:~/tmp$ grep "LeaseRenewer.run WARN Failed to renew lease" 
> tests-report.txt_suite-failure-due-to-sysout.txt | grep 
> http://wiki.apache.org/hadoop/ConnectionRefused | wc -l
> 1988
> hossman@frisbee:~/tmp$ calc
> 1988 * 36
> 71568
> {noformat}
> So running 465 Solr test suites, we got ~2 thousand of these "Failed to renew 
> lease" WARNings.  Of the ~1 million total lines of log messages from all 
> tests, ~70 thousand (~7%) are coming from these WARNing mesages -- which can 
> evidently be safetly ignored?
> Something seems broken here.
> Someone who understands this area of the code should either:
> * investigate & fix the code/test not to have these lease renewal problems
> * tweak our test logging configs to supress these WARN messages



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7215) non reproducible Suite failures due to excessive sysout due to HDFS lease renewal WARN logs due to connection refused -- even if test doesn't use HDFS (ie: threads leaki

2015-03-12 Thread Dawid Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359227#comment-14359227
 ] 

Dawid Weiss commented on SOLR-7215:
---

Uncomment the ThreadLeakFilters, Hoss. Nothing should get through. 
SolrIgnoredThreadsFilter has way too many exclusions -- these have to be shut 
down and cleaned properly, not ignored (leading to errors like this one):
{code}
/*
 * IMPORTANT! IMPORTANT!
 * 
 * Any threads added here should have ABSOLUTELY NO SIDE EFFECTS
 * (should be stateless). This includes no references to cores or other
 * test-dependent information.
 */

String threadName = t.getName();
if (threadName.equals(TimerThread.THREAD_NAME)) {
  return true;
}

if (threadName.startsWith("facetExecutor-") || 
threadName.startsWith("cmdDistribExecutor-") ||
threadName.startsWith("httpShardExecutor-")) {
  return true;
}

// This is a bug in ZooKeeper where they call System.exit(11) when
// this thread receives an interrupt signal.
if (threadName.startsWith("SyncThread")) {
  return true;
}

// THESE ARE LIKELY BUGS - these threads should be closed!
if (threadName.startsWith("Overseer-") ||
threadName.startsWith("aliveCheckExecutor-") ||
threadName.startsWith("concurrentUpdateScheduler-")) {
  return true;
}

return false;
{code}

> non reproducible Suite failures due to excessive sysout due to HDFS lease 
> renewal WARN logs due to connection refused -- even if test doesn't use HDFS 
> (ie: threads leaking between tests)
> --
>
> Key: SOLR-7215
> URL: https://issues.apache.org/jira/browse/SOLR-7215
> Project: Solr
>  Issue Type: Bug
>Reporter: Hoss Man
> Attachments: tests-report.txt_suite-failure-due-to-sysout.txt.zip
>
>
> On my local machine, i've noticed lately a lot of sporadic, non reproducible, 
> failures like these...
> {noformat}
>   2> NOTE: reproduce with: ant test  -Dtestcase=ScriptEngineTest 
> -Dtests.seed=E254A7E69EC7212A -Dtests.slow=true -Dtests.locale=sv 
> -Dtests.timezone=SystemV/CST6 -Dtests.asserts=true -Dtests.file.encoding=UTF-8
> [14:34:23.749] ERROR   0.00s J1 | ScriptEngineTest (suite) <<<
>> Throwable #1: java.lang.AssertionError: The test or suite printed 10984 
> bytes to stdout and stderr, even though the limit was set to 8192 bytes. 
> Increase the limit with @Limit, ignore it completely with 
> @SuppressSysoutChecks or run with -Dtests.verbose=true
>>  at __randomizedtesting.SeedInfo.seed([E254A7E69EC7212A]:0)
>>  at 
> org.apache.lucene.util.TestRuleLimitSysouts.afterIfSuccessful(TestRuleLimitSysouts.java:212)
> {noformat}
> Invariably, looking at the logs of test that fail for this reason, i see 
> multiple instances of these WARN msgs...
> {noformat}
>   2> 601361 T3064 oahh.LeaseRenewer.run WARN Failed to renew lease for 
> [DFSClient_NONMAPREDUCE_-253604438_2947] for 92 seconds.  Will retry shortly 
> ... java.net.ConnectException: Call From frisbee/127.0.1.1 to localhost:40618 
> failed on connection exception: java.net.ConnectException: Connection 
> refused; For more details see:  
> http://wiki.apache.org/hadoop/ConnectionRefused
>   2>  at sun.reflect.GeneratedConstructorAccessor268.newInstance(Unknown 
> Source)
>   2>  at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>  ...
> {noformat}
> ...the full stack traces of these exceptions typically being 36 lines long 
> (not counting the supressed "... 17 more" at the end)
> doing some basic crunching of the "tests-report.txt" file from a recent run 
> of all "solr-core" tests (that caused the above failure) leads to some pretty 
> damn disconcerting numbers...
> {noformat}
> hossman@frisbee:~/tmp$ wc -l tests-report.txt_suite-failure-due-to-sysout.txt
> 1049177 tests-report.txt_suite-failure-due-to-sysout.txt
> hossman@frisbee:~/tmp$ grep "Suite: org.apache.solr" 
> tests-report.txt_suite-failure-due-to-sysout.txt | wc -l
> 465
> hossman@frisbee:~/tmp$ grep "LeaseRenewer.run WARN Failed to renew lease" 
> tests-report.txt_suite-failure-due-to-sysout.txt | grep 
> http://wiki.apache.org/hadoop/ConnectionRefused | wc -l
> 1988
> hossman@frisbee:~/tmp$ calc
> 1988 * 36
> 71568
> {noformat}
> So running 465 Solr test suites, we got ~2 thousand of these "Failed to renew 
> lease" WARNings.  Of the ~1 million total lines of log messages from all 
> tests, ~70 thousand (~7%) are coming from these WARNing mesages -- which can 
> evidently be safetly ignored?
> Something seems broken here.
> Someone who understands this area of the code should either:
> * investigate