[ https://issues.apache.org/jira/browse/SOLR-7215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359123#comment-14359123 ]
Hoss Man commented on SOLR-7215: -------------------------------- This is more jacked up then i thought -- i just got one of the Suite failures from "TestDocSet" which directly extends LuceneTestCase and doesn't do _ANY_ initialization of any Solr specific functionality (no CoreContainers, no SolrCores, no ZooKeeper) which means not only are these HDFS Client "ConnectExceptions" causing test failures due to too much logging -- these threads appear to be leaking from the test suites and affecting other tests run in the same JVM *EVEN WHEN WHATEVER TEST CREATED THESE THREADS PASSES* ... The _only_ failure i got was from TestDocSet. and yet it failed because of excessive logging from a thread created by some other test that had already passed. {noformat} hossman@frisbee:~/lucene/dev/solr$ ant test ... [junit4] Suite: org.apache.solr.search.TestDocSet [junit4] 2> 1460665 T5379 oahh.LeaseRenewer.run WARN Failed to renew lease for [DFSClient_NONMAPREDUCE_1277984620_5262] for 402 seconds. Will retry shortly ... java.net.ConnectException: Call From frisbee/127.0.1.1 to localhost:47570 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused [junit4] 2> at sun.reflect.GeneratedConstructorAccessor303.newInstance(Unknown Source) [junit4] 2> at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) [junit4] 2> at java.lang.reflect.Constructor.newInstance(Constructor.java:408) [junit4] 2> at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783) [junit4] 2> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730) [junit4] 2> at org.apache.hadoop.ipc.Client.call(Client.java:1410) [junit4] 2> at org.apache.hadoop.ipc.Client.call(Client.java:1359) [junit4] 2> at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) [junit4] 2> at com.sun.proxy.$Proxy43.renewLease(Unknown Source) [junit4] 2> at sun.reflect.GeneratedMethodAccessor60.invoke(Unknown Source) [junit4] 2> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [junit4] 2> at java.lang.reflect.Method.invoke(Method.java:483) [junit4] 2> at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186) [junit4] 2> at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) [junit4] 2> at com.sun.proxy.$Proxy43.renewLease(Unknown Source) [junit4] 2> at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.renewLease(ClientNamenodeProtocolTranslatorPB.java:519) [junit4] 2> at org.apache.hadoop.hdfs.DFSClient.renewLease(DFSClient.java:773) [junit4] 2> at org.apache.hadoop.hdfs.LeaseRenewer.renew(LeaseRenewer.java:417) [junit4] 2> at org.apache.hadoop.hdfs.LeaseRenewer.run(LeaseRenewer.java:442) [junit4] 2> at org.apache.hadoop.hdfs.LeaseRenewer.access$700(LeaseRenewer.java:71) [junit4] 2> at org.apache.hadoop.hdfs.LeaseRenewer$1.run(LeaseRenewer.java:298) [junit4] 2> at java.lang.Thread.run(Thread.java:745) [junit4] 2> Caused by: java.net.ConnectException: Connection refused [junit4] 2> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) [junit4] 2> at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716) [junit4] 2> at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) [junit4] 2> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529) [junit4] 2> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493) [junit4] 2> at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:601) [junit4] 2> at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:696) [junit4] 2> at org.apache.hadoop.ipc.Client$Connection.access$2700(Client.java:367) [junit4] 2> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1458) [junit4] 2> at org.apache.hadoop.ipc.Client.call(Client.java:1377) [junit4] 2> ... 16 more [junit4] 2> [junit4] 2> 1460924 T8206 oahh.LeaseRenewer.run WARN Failed to renew lease for [DFSClient_NONMAPREDUCE_602751345_8088] for 91 seconds. Will retry shortly ... java.net.ConnectException: Call From frisbee/127.0.1.1 to localhost:47687 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused [junit4] 2> at sun.reflect.GeneratedConstructorAccessor303.newInstance(Unknown Source) [junit4] 2> at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) [junit4] 2> at java.lang.reflect.Constructor.newInstance(Constructor.java:408) [junit4] 2> at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783) [junit4] 2> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730) [junit4] 2> at org.apache.hadoop.ipc.Client.call(Client.java:1410) [junit4] 2> at org.apache.hadoop.ipc.Client.call(Client.java:1359) [junit4] 2> at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) [junit4] 2> at com.sun.proxy.$Proxy43.renewLease(Unknown Source) [junit4] 2> at sun.reflect.GeneratedMethodAccessor60.invoke(Unknown Source) [junit4] 2> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [junit4] 2> at java.lang.reflect.Method.invoke(Method.java:483) [junit4] 2> at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186) [junit4] 2> at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) [junit4] 2> at com.sun.proxy.$Proxy43.renewLease(Unknown Source) [junit4] 2> at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.renewLease(ClientNamenodeProtocolTranslatorPB.java:519) [junit4] 2> at org.apache.hadoop.hdfs.DFSClient.renewLease(DFSClient.java:773) [junit4] 2> at org.apache.hadoop.hdfs.LeaseRenewer.renew(LeaseRenewer.java:417) [junit4] 2> at org.apache.hadoop.hdfs.LeaseRenewer.run(LeaseRenewer.java:442) [junit4] 2> at org.apache.hadoop.hdfs.LeaseRenewer.access$700(LeaseRenewer.java:71) [junit4] 2> at org.apache.hadoop.hdfs.LeaseRenewer$1.run(LeaseRenewer.java:298) [junit4] 2> at java.lang.Thread.run(Thread.java:745) [junit4] 2> Caused by: java.net.ConnectException: Connection refused [junit4] 2> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) [junit4] 2> at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716) [junit4] 2> at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) [junit4] 2> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529) [junit4] 2> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493) [junit4] 2> at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:601) [junit4] 2> at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:696) [junit4] 2> at org.apache.hadoop.ipc.Client$Connection.access$2700(Client.java:367) [junit4] 2> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1458) [junit4] 2> at org.apache.hadoop.ipc.Client.call(Client.java:1377) [junit4] 2> ... 16 more [junit4] 2> [junit4] 2> NOTE: test params are: codec=Asserting(Lucene50): {}, docValues:{}, sim=DefaultSimilarity, locale=es_CU, timezone=Asia/Dacca [junit4] 2> NOTE: Linux 3.2.0-76-generic amd64/Oracle Corporation 1.8.0_31 (64-bit)/cpus=4,threads=1,free=75167392,total=524812288 [junit4] 2> NOTE: All tests run in this JVM: [TestShortCircuitedRequests, TestPHPSerializedResponseWriter, TestBinaryResponseWriter, ReplicationFactorTest, TestMiniSolrCloudCluster, TestManagedResource, BufferStoreTest, DistributedMLTComponentTest, TestExceedMaxTermLength, TestSchemaVersionResource, TestSimpleQParserPlugin, DirectUpdateHandlerTest, TestWriterPerf, DirectUpdateHandlerOptimizeTest, TestRTGBase, DistanceFunctionTest, EnumFieldTest, TestTrackingShardHandlerFactory, OverseerRolesTest, TestUtils, AnalysisErrorHandlingTest, SolrPluginUtilsTest, ShardSplitTest, TestDynamicLoading, ClusterStateTest, SOLR749Test, VMParamsZkACLAndCredentialsProvidersTest, AnalyticsMergeStrategyTest, TestJmxMonitoredMap, TestFaceting, TestAddFieldRealTimeGet, TestSchemaNameResource, TestOrdValues, DistributedFacetPivotSmallAdvancedTest, CopyFieldTest, BasicDistributedZkTest, AssignTest, TestRandomMergePolicy, DisMaxRequestHandlerTest, TestRequestStatusCollectionAPI, RecoveryAfterSoftCommitTest, PreAnalyzedUpdateProcessorTest, TestLuceneMatchVersion, SynonymTokenizerTest, RankQueryTest, DocValuesMultiTest, TestClassNameShortening, SolrInfoMBeanTest, CloudMLTQParserTest, TestManagedStopFilterFactory, StatelessScriptUpdateProcessorFactoryTest, TestSolrXml, DistributedFacetPivotSmallTest, TestLFUCache, TestNonDefinedSimilarityFactory, BadIndexSchemaTest, TestFiltering, TestCopyFieldCollectionResource, AlternateDirectoryTest, TestHashPartitioner, RollingRestartTest, TestRebalanceLeaders, DocValuesTest, TestStressVersions, TestSchemaSimilarityResource, TestSolr4Spatial, CircularListTest, TestCollationField, HdfsBasicDistributedZk2Test, TestFieldCollectionResource, AutoCommitTest, SuggestComponentTest, DistanceUnitsTest, TestSearchPerf, StandardRequestHandlerTest, TestComplexPhraseQParserPlugin, TestDefaultStatsCache, TestDistributedSearch, MultiThreadedOCPTest, TestCloudSchemaless, UnloadDistributedZkTest, StressHdfsTest, TestSolrConfigHandlerCloud, CloudExitableDirectoryReaderTest, SaslZkACLProviderTest, TestCryptoKeys, TestCloudManagedSchema, TriLevelCompositeIdRoutingTest, SpellCheckComponentTest, DistributedIntervalFacetingTest, DistributedFacetPivotLargeTest, TestRandomDVFaceting, TestStressRecovery, ExitableDirectoryReaderTest, DistributedTermsComponentTest, TestExactSharedStatsCache, HdfsDirectoryFactoryTest, DistributedFacetPivotWhiteBoxTest, TestHighlightDedupGrouping, DistributedQueryComponentCustomSortTest, ZkCLITest, TestStressUserVersions, SolrCmdDistributorTest, TestBadConfig, DistributedQueryElevationComponentTest, SolrIndexSplitterTest, FacetPivotSmallTest, LukeRequestHandlerTest, TestAnalyzedSuggestions, AddSchemaFieldsUpdateProcessorFactoryTest, TestExtendedDismaxParser, TestDynamicFieldCollectionResource, TestCSVLoader, TestStressLucene, TestFreeTextSuggestions, DateRangeFieldTest, DocumentBuilderTest, ZkStateWriterTest, JsonLoaderTest, TestCollapseQParserPlugin, TestQueryTypes, TestCoreDiscovery, TermVectorComponentTest, CacheHeaderTest, TestOmitPositions, TestFieldTypeResource, TestDynamicFieldResource, TestExpandComponent, TestJmxIntegration, ReturnFieldsTest, FieldAnalysisRequestHandlerTest, InfoHandlerTest, SegmentsInfoRequestHandlerTest, RequiredFieldsTest, DateFieldTest, NotRequiredUniqueKeyTest, SchemaVersionSpecificBehaviorTest, DefaultValueUpdateProcessorTest, TestHashQParserPlugin, TestDocSet] [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestDocSet -Dtests.seed=31CA503F5F477364 -Dtests.slow=true -Dtests.locale=es_CU -Dtests.timezone=Asia/Dacca -Dtests.asserts=true -Dtests.file.encoding=UTF-8 [junit4] ERROR 0.00s J1 | TestDocSet (suite) <<< [junit4] > Throwable #1: java.lang.AssertionError: The test or suite printed 10988 bytes to stdout and stderr, even though the limit was set to 8192 bytes. Increase the limit with @Limit, ignore it completely with @SuppressSysoutChecks or run with -Dtests.verbose=true [junit4] > at __randomizedtesting.SeedInfo.seed([31CA503F5F477364]:0) [junit4] > at java.lang.Thread.run(Thread.java:745) [junit4] Completed on J1 in 0.67s, 2 tests, 1 failure <<< FAILURES! ... [junit4] [junit4] Tests with failures: [junit4] - org.apache.solr.search.TestDocSet (suite) [junit4] [junit4] [junit4] JVM J0: 1.03 .. 1470.41 = 1469.38s [junit4] JVM J1: 0.84 .. 1470.20 = 1469.36s [junit4] JVM J2: 1.02 .. 1470.36 = 1469.33s [junit4] Execution time total: 24 minutes 30 seconds [junit4] Tests summary: 473 suites, 1882 tests, 1 suite-level error, 31 ignored (21 assumptions) {noformat} > non reproducible Suite failures due to excessive sysout due to HDFS lease > renewal WARN logs due to connection refused > --------------------------------------------------------------------------------------------------------------------- > > Key: SOLR-7215 > URL: https://issues.apache.org/jira/browse/SOLR-7215 > Project: Solr > Issue Type: Bug > Reporter: Hoss Man > Attachments: tests-report.txt_suite-failure-due-to-sysout.txt.zip > > > On my local machine, i've noticed lately a lot of sporadic, non reproducible, > failures like these... > {noformat} > 2> NOTE: reproduce with: ant test -Dtestcase=ScriptEngineTest > -Dtests.seed=E254A7E69EC7212A -Dtests.slow=true -Dtests.locale=sv > -Dtests.timezone=SystemV/CST6 -Dtests.asserts=true -Dtests.file.encoding=UTF-8 > [14:34:23.749] ERROR 0.00s J1 | ScriptEngineTest (suite) <<< > > Throwable #1: java.lang.AssertionError: The test or suite printed 10984 > bytes to stdout and stderr, even though the limit was set to 8192 bytes. > Increase the limit with @Limit, ignore it completely with > @SuppressSysoutChecks or run with -Dtests.verbose=true > > at __randomizedtesting.SeedInfo.seed([E254A7E69EC7212A]:0) > > at > org.apache.lucene.util.TestRuleLimitSysouts.afterIfSuccessful(TestRuleLimitSysouts.java:212) > {noformat} > Invariably, looking at the logs of test that fail for this reason, i see > multiple instances of these WARN msgs... > {noformat} > 2> 601361 T3064 oahh.LeaseRenewer.run WARN Failed to renew lease for > [DFSClient_NONMAPREDUCE_-253604438_2947] for 92 seconds. Will retry shortly > ... java.net.ConnectException: Call From frisbee/127.0.1.1 to localhost:40618 > failed on connection exception: java.net.ConnectException: Connection > refused; For more details see: > http://wiki.apache.org/hadoop/ConnectionRefused > 2> at sun.reflect.GeneratedConstructorAccessor268.newInstance(Unknown > Source) > 2> at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > ... > {noformat} > ...the full stack traces of these exceptions typically being 36 lines long > (not counting the supressed "... 17 more" at the end) > doing some basic crunching of the "tests-report.txt" file from a recent run > of all "solr-core" tests (that caused the above failure) leads to some pretty > damn disconcerting numbers... > {noformat} > hossman@frisbee:~/tmp$ wc -l tests-report.txt_suite-failure-due-to-sysout.txt > 1049177 tests-report.txt_suite-failure-due-to-sysout.txt > hossman@frisbee:~/tmp$ grep "Suite: org.apache.solr" > tests-report.txt_suite-failure-due-to-sysout.txt | wc -l > 465 > hossman@frisbee:~/tmp$ grep "LeaseRenewer.run WARN Failed to renew lease" > tests-report.txt_suite-failure-due-to-sysout.txt | grep > http://wiki.apache.org/hadoop/ConnectionRefused | wc -l > 1988 > hossman@frisbee:~/tmp$ calc > 1988 * 36 > 71568 > {noformat} > So running 465 Solr test suites, we got ~2 thousand of these "Failed to renew > lease" WARNings. Of the ~1 million total lines of log messages from all > tests, ~70 thousand (~7%) are coming from these WARNing mesages -- which can > evidently be safetly ignored? > Something seems broken here. > Someone who understands this area of the code should either: > * investigate & fix the code/test not to have these lease renewal problems > * tweak our test logging configs to supress these WARN messages -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org