[ https://issues.apache.org/jira/browse/SOLR-3180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yonik Seeley updated SOLR-3180: ------------------------------- Attachment: fail.130102_020942.txt OK, here's a fail with more logging. What finally manifests as an error the test detects seems to be caused by retries (but there are still threads hanging around from the first time that eventually execute). I didn't follow this fail all the way through to the end since I saw enough potentially wonky stuff earlier. C14 tries to publish success, fails because CM caused a connection loss, and then future updates to C14 hang for a long time. {code} ### Since we know C14 (P56842) seems problematic we look back in time before the problematic update. 2> ASYNC NEW_CORE C14 name=collection1 org.apache.solr.core.SolrCore@15c617e0 url=http://127.0.0.1:56842/collection1 node=127.0.0.1:56842_ C14_STATE=coll:collection1 core:collection1 props:{shard=shard1, roles=null, state=active, core=collection1, collection=collection1, node_name=127.0.0.1:56842_, base_url=http://127.0.0.1:56842} 2> 73226 T323 C14 P56842 oasc.RecoveryStrategy.run Starting recovery process. core=collection1 recoveringAfterStartup=true 2> 73233 T323 C14 P56842 oasc.ZkController.publish publishing core=collection1 state=recovering 2> 75058 T256 oasc.ChaosMonkey.monkeyLog monkey: chose a victim! 56842 2> 75059 T256 oasc.ChaosMonkey.monkeyLog monkey: expire session for 56842 ! 2> 75059 T256 oasc.ChaosMonkey.monkeyLog monkey: cause connection loss! 2> C14_STATE=coll:collection1 core:collection1 props:{shard=shard1, roles=null, state=recovering, core=collection1, collection=collection1, node_name=127.0.0.1:56842_, base_url=http://127.0.0.1:56842} 2> 77316 T323 C14 P56842 oasc.RecoveryStrategy.doRecovery Attempting to PeerSync from http://127.0.0.1:41533/collection1/ core=collection1 - recoveringAfterStartup=true 2> 77328 T323 C14 P56842 oasu.PeerSync.handleVersions PeerSync: core=collection1 url=http://127.0.0.1:56842 Our versions are newer. ourLowThreshold=1423033399394697216 otherHigh=1423033403051081728 2> 77329 T323 C14 P56842 oasu.PeerSync.sync PeerSync: core=collection1 url=http://127.0.0.1:56842 DONE. sync succeeded ##### OK, perhaps the real problem starts here, when we try to publish as "active" but ##### we have issues talking to ZK (because CM caused a connection loss previously?) 2> 77478 T323 C14 P56842 oasc.ZkController.publish publishing core=collection1 state=active 2> 77480 T319 oaz.ClientCnxn$SendThread.run WARNING Session 0x13bfa14a03d0020 for server null, unexpected error, closing socket connection and attempting reconnect java.nio.channels.CancelledKeyException ##### The first request that comes in for us in on T314, but this is the last we hear from this thread until time 109571 (same as the problematic add) 2> 82034 T314 C14 P56842 oasup.LogUpdateProcessor.processDelete PRE_UPDATE DELETE delete{flags=0,_version_=-1423033432318935040,id=60072,commitWithin=-1} {distrib.from=http://127.0.0.1:41533/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} ############## start of troublesome update 2> 82115 T28 C12 P49621 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD add{flags=0,_version_=0,id=10079} {wt=javabin&version=2} 2> 82116 T28 C12 P49621 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: id=10079 version=1423033432421695488 2> 82117 T28 C12 P49621 PRE_UPDATE FINISH {wt=javabin&version=2} 2> 82118 T28 C12 P49621 /update {wt=javabin&version=2} {add=[10079 (1423033432421695488)]} 0 3 ##### Below, C4 receives the update, forwards it to the leader C2, which forwards it to C14 and C0. Looking ahead, it looks like C14 is a problem node - this update takes a long time to complete. 2> 82121 T59 C4 P48210 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD add{flags=0,_version_=0,id=10079} {wt=javabin&version=2} 2> 82121 T59 C4 P48210 PRE_UPDATE FINISH {wt=javabin&version=2} 2> 82124 T44 C2 P41533 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD add{flags=0,_version_=0,id=10079} {distrib.from=http://127.0.0.1:48210/collection1/&update.distrib=TOLEADER&wt=javabin&version=2} 2> 82124 T44 C2 P41533 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: id=10079 version=1423033432430084096 2> 82125 T44 C2 P41533 PRE_UPDATE FINISH {distrib.from=http://127.0.0.1:48210/collection1/&update.distrib=TOLEADER&wt=javabin&version=2} 2> 82127 T315 C14 P56842 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD add{flags=0,_version_=0,id=10079} {distrib.from=http://127.0.0.1:41533/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} 2> 82128 T89 C0 P56785 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD add{flags=0,_version_=0,id=10079} {distrib.from=http://127.0.0.1:41533/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} 2> 82128 T89 C0 P56785 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: id=10079 version=1423033432430084096 2> 82129 T89 C0 P56785 PRE_UPDATE FINISH {distrib.from=http://127.0.0.1:41533/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} 2> 82129 T89 C0 P56785 /update {distrib.from=http://127.0.0.1:41533/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} {add=[10079 (1423033432430084096)]} 0 1 2> 82247 T256 oasc.ChaosMonkey.monkeyLog monkey: chose a victim! 56785 2> 82247 T256 oasc.ChaosMonkey.monkeyLog monkey: expire session for 56785 ! 2> 82248 T256 oasc.ChaosMonkey.monkeyLog monkey: cause connection loss! ##### 56785 is C0, but it looks like we already completed our update forwarded from C2 (although it's always possible the response was not completely written yet) ##### So C14 hung for some reason, but it doesn't look like the chaos monkey touched him! 2> 97029 T46 C2 P41533 oasc.SolrException.log SEVERE shard update error StdNode: http://127.0.0.1:56842/collection1/:org.apache.solr.client.solrj.SolrServerException: Timeout occured while waiting response from server at: http://127.0.0.1:56842/collection1 2> 97031 T46 C2 P41533 oasup.DistributedUpdateProcessor.doFinish try and ask http://127.0.0.1:56842 to recover ##### I presume this is C14, and it got the request to recover 2> 97044 T316 oasha.CoreAdminHandler.handleRequestRecoveryAction It has been requested that we recover ##### It looks like C2 finally returned failure to C4, which retries sending to the leader 2> 97133 T59 C4 P48210 oasc.SolrException.log SEVERE forwarding update to http://127.0.0.1:41533/collection1/ failed - retrying ... ##### Another timeout from the leader to that node - not sure what request this was... 2> 97135 T44 C2 P41533 oasc.SolrException.log SEVERE shard update error StdNode: http://127.0.0.1:56842/collection1/:org.apache.solr.client.solrj.SolrServerException: Timeout occured while waiting response from server at: http://127.0.0.1:56842/collection1 ##### But we ask the node to recover again, and it looks like the node gets that request again 2> 97137 T44 C2 P41533 oasup.DistributedUpdateProcessor.doFinish try and ask http://127.0.0.1:56842 to recover 2> 97138 T44 C2 P41533 oascsi.HttpClientUtil.createClient Creating new http client, config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false 2> 97148 T311 oasha.CoreAdminHandler.handleRequestRecoveryAction It has been requested that we recover ##### Below, the request was re-sent to the leader C2 from C4 ##### Either the top level cloud client timed things out and re-sent, or the command distributor in C4 re-sent. 2> 97639 T47 C2 P41533 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD add{flags=0,_version_=0,id=10079} {distrib.from=http://127.0.0.1:48210/collection1/&update.distrib=TOLEADER&wt=javabin&version=2} 2> 97641 T47 C2 P41533 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: id=10079 version=1423033448699789312 2> 97643 T47 C2 P41533 PRE_UPDATE FINISH {distrib.from=http://127.0.0.1:48210/collection1/&update.distrib=TOLEADER&wt=javabin&version=2} 2> 97648 T313 C14 P56842 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD add{flags=0,_version_=0,id=10079} {distrib.from=http://127.0.0.1:41533/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} 2> 97648 T93 C0 P56785 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD add{flags=0,_version_=0,id=10079} {distrib.from=http://127.0.0.1:41533/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} ##### Why did this time out? We saw C14 get the request... perhaps more needs to be done asynchronously on the receiving end of such a request? ##### Also, what's happening on C14 and C0 right now??? 2> 102044 T46 C2 P41533 oasup.DistributedUpdateProcessor.doFinish Could not tell a replica to recover org.apache.solr.client.solrj.SolrServerException: Timeout occured while waiting response from server at: http://127.0.0.1:56842 ##### The original hung update to C14 finally finished after 27 sec. 2> 109516 T419 oascc.ConnectionManager.process Watcher org.apache.solr.common.cloud.ConnectionManager@58046530 name:ZooKeeperConnection Watcher:127.0.0.1:40151/solr got event WatchedEvent state:SyncConnected type:None path:null path:null type:None 2> 109518 T320 oascc.ConnectionManager.waitForConnected Client is connected to ZooKeeper 2> 109518 T320 oascc.ConnectionManager$1.update Connection with ZooKeeper reestablished. 2> 109519 T320 oasc.RecoveryStrategy.close WARNING Stopping recovery for zkNodeName=127.0.0.1:56842__collection1core=collection1 2> 109564 T313 C14 P56842 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: id=10079 version=1423033448699789312 2> 109571 T314 C14 P56842 PRE_UPDATE FINISH {distrib.from=http://127.0.0.1:41533/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} 2> 109571 T313 C14 P56842 PRE_UPDATE FINISH {distrib.from=http://127.0.0.1:41533/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} 2> 109571 T315 C14 P56842 PRE_UPDATE FINISH {distrib.from=http://127.0.0.1:41533/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} 2> 109572 T314 C14 P56842 /update {distrib.from=http://127.0.0.1:41533/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} {delete=[60072 (-1423033432318935040)]} 0 27538 2> 109572 T313 C14 P56842 /update {distrib.from=http://127.0.0.1:41533/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} {add=[10079 (1423033448699789312)]} 0 11925 2> 109573 T315 C14 P56842 /update {distrib.from=http://127.0.0.1:41533/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2} {add=[10079 (1423033432430084096)]} 0 27446 2> ###### Only in cloudDocList: [{id=10079}] 2> 207369 T545 C22 P57160 REQ /select {fl=id,_version_&shard.url=127.0.0.1:41533/collection1/|127.0.0.1:57160/collection1/&NOW=1357110561123&tests=checkShardConsistency(vsControl)/getVers&q=id:(+10079)&ids=10079&distrib=false&isShard=true&wt=javabin&rows=100000&version=2} status=0 QTime=1 2> 207372 T10 oasc.AbstractFullDistribZkTestBase.checkShardConsistency SEVERE controlClient :{numFound=0,start=0,docs=[]} 2> cloudClient :{numFound=1,start=0,docs=[SolrDocument{id=10079, _version_=1423033464957960192}]} {code} > ChaosMonkey test failures > ------------------------- > > Key: SOLR-3180 > URL: https://issues.apache.org/jira/browse/SOLR-3180 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Reporter: Yonik Seeley > Attachments: CMSL_fail1.log, CMSL_hang_2.txt, CMSL_hang.txt, > fail.130101_034142.txt, fail.130102_020942.txt, fail.inconsistent.txt, > test_report_1.txt > > > Handle intermittent failures in the ChaosMonkey tests. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org