[ 
https://issues.apache.org/jira/browse/SOLR-7215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359123#comment-14359123
 ] 

Hoss Man commented on SOLR-7215:
--------------------------------

This is more jacked up then i thought -- i just got one of the Suite failures 
from "TestDocSet" which directly extends LuceneTestCase and doesn't do _ANY_ 
initialization of any Solr specific functionality (no CoreContainers, no 
SolrCores, no ZooKeeper)

which means not only are these HDFS Client "ConnectExceptions" causing test 
failures due to too much logging -- these threads appear to be leaking from the 
test suites and affecting other tests run in the same JVM *EVEN WHEN WHATEVER 
TEST CREATED THESE THREADS PASSES* ... The _only_ failure i got was from 
TestDocSet. and yet it failed because of excessive logging from a thread 
created by some other test that had already passed.

{noformat}
hossman@frisbee:~/lucene/dev/solr$ ant test
...
   [junit4] Suite: org.apache.solr.search.TestDocSet
   [junit4]   2> 1460665 T5379 oahh.LeaseRenewer.run WARN Failed to renew lease 
for [DFSClient_NONMAPREDUCE_1277984620_5262] for 402 seconds.  Will retry 
shortly ... java.net.ConnectException: Call From frisbee/127.0.1.1 to 
localhost:47570 failed on connection exception: java.net.ConnectException: 
Connection refused; For more details see:  
http://wiki.apache.org/hadoop/ConnectionRefused
   [junit4]   2>        at 
sun.reflect.GeneratedConstructorAccessor303.newInstance(Unknown Source)
   [junit4]   2>        at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
   [junit4]   2>        at 
java.lang.reflect.Constructor.newInstance(Constructor.java:408)
   [junit4]   2>        at 
org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
   [junit4]   2>        at 
org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
   [junit4]   2>        at org.apache.hadoop.ipc.Client.call(Client.java:1410)
   [junit4]   2>        at org.apache.hadoop.ipc.Client.call(Client.java:1359)
   [junit4]   2>        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
   [junit4]   2>        at com.sun.proxy.$Proxy43.renewLease(Unknown Source)
   [junit4]   2>        at sun.reflect.GeneratedMethodAccessor60.invoke(Unknown 
Source)
   [junit4]   2>        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   [junit4]   2>        at java.lang.reflect.Method.invoke(Method.java:483)
   [junit4]   2>        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
   [junit4]   2>        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
   [junit4]   2>        at com.sun.proxy.$Proxy43.renewLease(Unknown Source)
   [junit4]   2>        at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.renewLease(ClientNamenodeProtocolTranslatorPB.java:519)
   [junit4]   2>        at 
org.apache.hadoop.hdfs.DFSClient.renewLease(DFSClient.java:773)
   [junit4]   2>        at 
org.apache.hadoop.hdfs.LeaseRenewer.renew(LeaseRenewer.java:417)
   [junit4]   2>        at 
org.apache.hadoop.hdfs.LeaseRenewer.run(LeaseRenewer.java:442)
   [junit4]   2>        at 
org.apache.hadoop.hdfs.LeaseRenewer.access$700(LeaseRenewer.java:71)
   [junit4]   2>        at 
org.apache.hadoop.hdfs.LeaseRenewer$1.run(LeaseRenewer.java:298)
   [junit4]   2>        at java.lang.Thread.run(Thread.java:745)
   [junit4]   2> Caused by: java.net.ConnectException: Connection refused
   [junit4]   2>        at sun.nio.ch.SocketChannelImpl.checkConnect(Native 
Method)
   [junit4]   2>        at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716)
   [junit4]   2>        at 
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
   [junit4]   2>        at 
org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
   [junit4]   2>        at 
org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
   [junit4]   2>        at 
org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:601)
   [junit4]   2>        at 
org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:696)
   [junit4]   2>        at 
org.apache.hadoop.ipc.Client$Connection.access$2700(Client.java:367)
   [junit4]   2>        at 
org.apache.hadoop.ipc.Client.getConnection(Client.java:1458)
   [junit4]   2>        at org.apache.hadoop.ipc.Client.call(Client.java:1377)
   [junit4]   2>        ... 16 more
   [junit4]   2> 
   [junit4]   2> 1460924 T8206 oahh.LeaseRenewer.run WARN Failed to renew lease 
for [DFSClient_NONMAPREDUCE_602751345_8088] for 91 seconds.  Will retry shortly 
... java.net.ConnectException: Call From frisbee/127.0.1.1 to localhost:47687 
failed on connection exception: java.net.ConnectException: Connection refused; 
For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
   [junit4]   2>        at 
sun.reflect.GeneratedConstructorAccessor303.newInstance(Unknown Source)
   [junit4]   2>        at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
   [junit4]   2>        at 
java.lang.reflect.Constructor.newInstance(Constructor.java:408)
   [junit4]   2>        at 
org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
   [junit4]   2>        at 
org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
   [junit4]   2>        at org.apache.hadoop.ipc.Client.call(Client.java:1410)
   [junit4]   2>        at org.apache.hadoop.ipc.Client.call(Client.java:1359)
   [junit4]   2>        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
   [junit4]   2>        at com.sun.proxy.$Proxy43.renewLease(Unknown Source)
   [junit4]   2>        at sun.reflect.GeneratedMethodAccessor60.invoke(Unknown 
Source)
   [junit4]   2>        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   [junit4]   2>        at java.lang.reflect.Method.invoke(Method.java:483)
   [junit4]   2>        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
   [junit4]   2>        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
   [junit4]   2>        at com.sun.proxy.$Proxy43.renewLease(Unknown Source)
   [junit4]   2>        at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.renewLease(ClientNamenodeProtocolTranslatorPB.java:519)
   [junit4]   2>        at 
org.apache.hadoop.hdfs.DFSClient.renewLease(DFSClient.java:773)
   [junit4]   2>        at 
org.apache.hadoop.hdfs.LeaseRenewer.renew(LeaseRenewer.java:417)
   [junit4]   2>        at 
org.apache.hadoop.hdfs.LeaseRenewer.run(LeaseRenewer.java:442)
   [junit4]   2>        at 
org.apache.hadoop.hdfs.LeaseRenewer.access$700(LeaseRenewer.java:71)
   [junit4]   2>        at 
org.apache.hadoop.hdfs.LeaseRenewer$1.run(LeaseRenewer.java:298)
   [junit4]   2>        at java.lang.Thread.run(Thread.java:745)
   [junit4]   2> Caused by: java.net.ConnectException: Connection refused
   [junit4]   2>        at sun.nio.ch.SocketChannelImpl.checkConnect(Native 
Method)
   [junit4]   2>        at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716)
   [junit4]   2>        at 
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
   [junit4]   2>        at 
org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
   [junit4]   2>        at 
org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
   [junit4]   2>        at 
org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:601)
   [junit4]   2>        at 
org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:696)
   [junit4]   2>        at 
org.apache.hadoop.ipc.Client$Connection.access$2700(Client.java:367)
   [junit4]   2>        at 
org.apache.hadoop.ipc.Client.getConnection(Client.java:1458)
   [junit4]   2>        at org.apache.hadoop.ipc.Client.call(Client.java:1377)
   [junit4]   2>        ... 16 more
   [junit4]   2> 
   [junit4]   2> NOTE: test params are: codec=Asserting(Lucene50): {}, 
docValues:{}, sim=DefaultSimilarity, locale=es_CU, timezone=Asia/Dacca
   [junit4]   2> NOTE: Linux 3.2.0-76-generic amd64/Oracle Corporation 1.8.0_31 
(64-bit)/cpus=4,threads=1,free=75167392,total=524812288
   [junit4]   2> NOTE: All tests run in this JVM: [TestShortCircuitedRequests, 
TestPHPSerializedResponseWriter, TestBinaryResponseWriter, 
ReplicationFactorTest, TestMiniSolrCloudCluster, TestManagedResource, 
BufferStoreTest, DistributedMLTComponentTest, TestExceedMaxTermLength, 
TestSchemaVersionResource, TestSimpleQParserPlugin, DirectUpdateHandlerTest, 
TestWriterPerf, DirectUpdateHandlerOptimizeTest, TestRTGBase, 
DistanceFunctionTest, EnumFieldTest, TestTrackingShardHandlerFactory, 
OverseerRolesTest, TestUtils, AnalysisErrorHandlingTest, SolrPluginUtilsTest, 
ShardSplitTest, TestDynamicLoading, ClusterStateTest, SOLR749Test, 
VMParamsZkACLAndCredentialsProvidersTest, AnalyticsMergeStrategyTest, 
TestJmxMonitoredMap, TestFaceting, TestAddFieldRealTimeGet, 
TestSchemaNameResource, TestOrdValues, DistributedFacetPivotSmallAdvancedTest, 
CopyFieldTest, BasicDistributedZkTest, AssignTest, TestRandomMergePolicy, 
DisMaxRequestHandlerTest, TestRequestStatusCollectionAPI, 
RecoveryAfterSoftCommitTest, PreAnalyzedUpdateProcessorTest, 
TestLuceneMatchVersion, SynonymTokenizerTest, RankQueryTest, 
DocValuesMultiTest, TestClassNameShortening, SolrInfoMBeanTest, 
CloudMLTQParserTest, TestManagedStopFilterFactory, 
StatelessScriptUpdateProcessorFactoryTest, TestSolrXml, 
DistributedFacetPivotSmallTest, TestLFUCache, TestNonDefinedSimilarityFactory, 
BadIndexSchemaTest, TestFiltering, TestCopyFieldCollectionResource, 
AlternateDirectoryTest, TestHashPartitioner, RollingRestartTest, 
TestRebalanceLeaders, DocValuesTest, TestStressVersions, 
TestSchemaSimilarityResource, TestSolr4Spatial, CircularListTest, 
TestCollationField, HdfsBasicDistributedZk2Test, TestFieldCollectionResource, 
AutoCommitTest, SuggestComponentTest, DistanceUnitsTest, TestSearchPerf, 
StandardRequestHandlerTest, TestComplexPhraseQParserPlugin, 
TestDefaultStatsCache, TestDistributedSearch, MultiThreadedOCPTest, 
TestCloudSchemaless, UnloadDistributedZkTest, StressHdfsTest, 
TestSolrConfigHandlerCloud, CloudExitableDirectoryReaderTest, 
SaslZkACLProviderTest, TestCryptoKeys, TestCloudManagedSchema, 
TriLevelCompositeIdRoutingTest, SpellCheckComponentTest, 
DistributedIntervalFacetingTest, DistributedFacetPivotLargeTest, 
TestRandomDVFaceting, TestStressRecovery, ExitableDirectoryReaderTest, 
DistributedTermsComponentTest, TestExactSharedStatsCache, 
HdfsDirectoryFactoryTest, DistributedFacetPivotWhiteBoxTest, 
TestHighlightDedupGrouping, DistributedQueryComponentCustomSortTest, ZkCLITest, 
TestStressUserVersions, SolrCmdDistributorTest, TestBadConfig, 
DistributedQueryElevationComponentTest, SolrIndexSplitterTest, 
FacetPivotSmallTest, LukeRequestHandlerTest, TestAnalyzedSuggestions, 
AddSchemaFieldsUpdateProcessorFactoryTest, TestExtendedDismaxParser, 
TestDynamicFieldCollectionResource, TestCSVLoader, TestStressLucene, 
TestFreeTextSuggestions, DateRangeFieldTest, DocumentBuilderTest, 
ZkStateWriterTest, JsonLoaderTest, TestCollapseQParserPlugin, TestQueryTypes, 
TestCoreDiscovery, TermVectorComponentTest, CacheHeaderTest, TestOmitPositions, 
TestFieldTypeResource, TestDynamicFieldResource, TestExpandComponent, 
TestJmxIntegration, ReturnFieldsTest, FieldAnalysisRequestHandlerTest, 
InfoHandlerTest, SegmentsInfoRequestHandlerTest, RequiredFieldsTest, 
DateFieldTest, NotRequiredUniqueKeyTest, SchemaVersionSpecificBehaviorTest, 
DefaultValueUpdateProcessorTest, TestHashQParserPlugin, TestDocSet]
   [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestDocSet 
-Dtests.seed=31CA503F5F477364 -Dtests.slow=true -Dtests.locale=es_CU 
-Dtests.timezone=Asia/Dacca -Dtests.asserts=true -Dtests.file.encoding=UTF-8
   [junit4] ERROR   0.00s J1 | TestDocSet (suite) <<<
   [junit4]    > Throwable #1: java.lang.AssertionError: The test or suite 
printed 10988 bytes to stdout and stderr, even though the limit was set to 8192 
bytes. Increase the limit with @Limit, ignore it completely with 
@SuppressSysoutChecks or run with -Dtests.verbose=true
   [junit4]    >        at 
__randomizedtesting.SeedInfo.seed([31CA503F5F477364]:0)
   [junit4]    >        at java.lang.Thread.run(Thread.java:745)
   [junit4] Completed on J1 in 0.67s, 2 tests, 1 failure <<< FAILURES!
...
   [junit4] 
   [junit4] Tests with failures:
   [junit4]   - org.apache.solr.search.TestDocSet (suite)
   [junit4] 
   [junit4] 
   [junit4] JVM J0:     1.03 ..  1470.41 =  1469.38s
   [junit4] JVM J1:     0.84 ..  1470.20 =  1469.36s
   [junit4] JVM J2:     1.02 ..  1470.36 =  1469.33s
   [junit4] Execution time total: 24 minutes 30 seconds
   [junit4] Tests summary: 473 suites, 1882 tests, 1 suite-level error, 31 
ignored (21 assumptions)
{noformat}


> non reproducible Suite failures due to excessive sysout due to HDFS lease 
> renewal WARN logs due to connection refused
> ---------------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-7215
>                 URL: https://issues.apache.org/jira/browse/SOLR-7215
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Hoss Man
>         Attachments: tests-report.txt_suite-failure-due-to-sysout.txt.zip
>
>
> On my local machine, i've noticed lately a lot of sporadic, non reproducible, 
> failures like these...
> {noformat}
>   2> NOTE: reproduce with: ant test  -Dtestcase=ScriptEngineTest 
> -Dtests.seed=E254A7E69EC7212A -Dtests.slow=true -Dtests.locale=sv 
> -Dtests.timezone=SystemV/CST6 -Dtests.asserts=true -Dtests.file.encoding=UTF-8
> [14:34:23.749] ERROR   0.00s J1 | ScriptEngineTest (suite) <<<
>    > Throwable #1: java.lang.AssertionError: The test or suite printed 10984 
> bytes to stdout and stderr, even though the limit was set to 8192 bytes. 
> Increase the limit with @Limit, ignore it completely with 
> @SuppressSysoutChecks or run with -Dtests.verbose=true
>    >  at __randomizedtesting.SeedInfo.seed([E254A7E69EC7212A]:0)
>    >  at 
> org.apache.lucene.util.TestRuleLimitSysouts.afterIfSuccessful(TestRuleLimitSysouts.java:212)
> {noformat}
> Invariably, looking at the logs of test that fail for this reason, i see 
> multiple instances of these WARN msgs...
> {noformat}
>   2> 601361 T3064 oahh.LeaseRenewer.run WARN Failed to renew lease for 
> [DFSClient_NONMAPREDUCE_-253604438_2947] for 92 seconds.  Will retry shortly 
> ... java.net.ConnectException: Call From frisbee/127.0.1.1 to localhost:40618 
> failed on connection exception: java.net.ConnectException: Connection 
> refused; For more details see:  
> http://wiki.apache.org/hadoop/ConnectionRefused
>   2>  at sun.reflect.GeneratedConstructorAccessor268.newInstance(Unknown 
> Source)
>   2>  at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>  ...
> {noformat}
> ...the full stack traces of these exceptions typically being 36 lines long 
> (not counting the supressed "... 17 more" at the end)
> doing some basic crunching of the "tests-report.txt" file from a recent run 
> of all "solr-core" tests (that caused the above failure) leads to some pretty 
> damn disconcerting numbers...
> {noformat}
> hossman@frisbee:~/tmp$ wc -l tests-report.txt_suite-failure-due-to-sysout.txt
> 1049177 tests-report.txt_suite-failure-due-to-sysout.txt
> hossman@frisbee:~/tmp$ grep "Suite: org.apache.solr" 
> tests-report.txt_suite-failure-due-to-sysout.txt | wc -l
> 465
> hossman@frisbee:~/tmp$ grep "LeaseRenewer.run WARN Failed to renew lease" 
> tests-report.txt_suite-failure-due-to-sysout.txt | grep 
> http://wiki.apache.org/hadoop/ConnectionRefused | wc -l
> 1988
> hossman@frisbee:~/tmp$ calc
> 1988 * 36
> 71568
> {noformat}
> So running 465 Solr test suites, we got ~2 thousand of these "Failed to renew 
> lease" WARNings.  Of the ~1 million total lines of log messages from all 
> tests, ~70 thousand (~7%) are coming from these WARNing mesages -- which can 
> evidently be safetly ignored?
> Something seems broken here.
> Someone who understands this area of the code should either:
> * investigate & fix the code/test not to have these lease renewal problems
> * tweak our test logging configs to supress these WARN messages



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to