[ 
https://issues.apache.org/jira/browse/SOLR-3180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yonik Seeley updated SOLR-3180:
-------------------------------

    Attachment: fail.130102_020942.txt

OK, here's a fail with more logging.

What finally manifests as an error the test detects seems to be caused by 
retries (but there are still threads hanging around from the first time that 
eventually execute).

I didn't follow this fail all the way through to the end since I saw enough 
potentially wonky stuff earlier.

C14 tries to publish success, fails because CM caused a connection loss, and 
then future updates to C14 hang for a long time.

{code}

###  Since we know C14 (P56842) seems problematic we look back in time before 
the problematic update.
  2> ASYNC  NEW_CORE C14 name=collection1 
org.apache.solr.core.SolrCore@15c617e0 url=http://127.0.0.1:56842/collection1 
node=127.0.0.1:56842_ C14_STATE=coll:collection1 core:collection1 
props:{shard=shard1, roles=null, state=active, core=collection1, 
collection=collection1, node_name=127.0.0.1:56842_, 
base_url=http://127.0.0.1:56842}
  2> 73226 T323 C14 P56842 oasc.RecoveryStrategy.run Starting recovery process. 
 core=collection1 recoveringAfterStartup=true
 
  2> 73233 T323 C14 P56842 oasc.ZkController.publish publishing 
core=collection1 state=recovering

  2> 75058 T256 oasc.ChaosMonkey.monkeyLog monkey: chose a victim! 56842
  2> 75059 T256 oasc.ChaosMonkey.monkeyLog monkey: expire session for 56842 !
  2> 75059 T256 oasc.ChaosMonkey.monkeyLog monkey: cause connection loss!
 
  2>  C14_STATE=coll:collection1 core:collection1 props:{shard=shard1, 
roles=null, state=recovering, core=collection1, collection=collection1, 
node_name=127.0.0.1:56842_, base_url=http://127.0.0.1:56842}
  2> 77316 T323 C14 P56842 oasc.RecoveryStrategy.doRecovery Attempting to 
PeerSync from http://127.0.0.1:41533/collection1/ core=collection1 - 
recoveringAfterStartup=true


  2> 77328 T323 C14 P56842 oasu.PeerSync.handleVersions PeerSync: 
core=collection1 url=http://127.0.0.1:56842  Our versions are newer. 
ourLowThreshold=1423033399394697216 otherHigh=1423033403051081728
  2> 77329 T323 C14 P56842 oasu.PeerSync.sync PeerSync: core=collection1 
url=http://127.0.0.1:56842 DONE. sync succeeded
 
##### OK, perhaps the real problem starts here, when we try to publish as 
"active" but
##### we have issues talking to ZK (because CM caused a connection loss 
previously?)
  2> 77478 T323 C14 P56842 oasc.ZkController.publish publishing 
core=collection1 state=active
  2> 77480 T319 oaz.ClientCnxn$SendThread.run WARNING Session 0x13bfa14a03d0020 
for server null, unexpected error, closing socket connection and attempting 
reconnect java.nio.channels.CancelledKeyException

##### The first request that comes in for us in on T314, but this is the last 
we hear from this thread until time 109571 (same as the problematic add)
  2> 82034 T314 C14 P56842 oasup.LogUpdateProcessor.processDelete PRE_UPDATE 
DELETE delete{flags=0,_version_=-1423033432318935040,id=60072,commitWithin=-1} 
{distrib.from=http://127.0.0.1:41533/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}

 
############## start of troublesome update

  2> 82115 T28 C12 P49621 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD 
add{flags=0,_version_=0,id=10079} {wt=javabin&version=2}
  2> 82116 T28 C12 P49621 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: 
id=10079 version=1423033432421695488
  2> 82117 T28 C12 P49621 PRE_UPDATE FINISH  {wt=javabin&version=2}
  2> 82118 T28 C12 P49621 /update {wt=javabin&version=2} {add=[10079 
(1423033432421695488)]} 0 3


##### Below, C4 receives the update, forwards it to the leader C2, which 
forwards it to C14 and C0.  Looking ahead, it looks like C14 is a problem node 
- this update takes a long time to complete.

  2> 82121 T59 C4 P48210 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD 
add{flags=0,_version_=0,id=10079} {wt=javabin&version=2}
  2> 82121 T59 C4 P48210 PRE_UPDATE FINISH  {wt=javabin&version=2}
  2> 82124 T44 C2 P41533 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD 
add{flags=0,_version_=0,id=10079} 
{distrib.from=http://127.0.0.1:48210/collection1/&update.distrib=TOLEADER&wt=javabin&version=2}
  2> 82124 T44 C2 P41533 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: 
id=10079 version=1423033432430084096
  2> 82125 T44 C2 P41533 PRE_UPDATE FINISH  
{distrib.from=http://127.0.0.1:48210/collection1/&update.distrib=TOLEADER&wt=javabin&version=2}
  2> 82127 T315 C14 P56842 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD 
add{flags=0,_version_=0,id=10079} 
{distrib.from=http://127.0.0.1:41533/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
  2> 82128 T89 C0 P56785 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD 
add{flags=0,_version_=0,id=10079} 
{distrib.from=http://127.0.0.1:41533/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
  2> 82128 T89 C0 P56785 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: 
id=10079 version=1423033432430084096
  2> 82129 T89 C0 P56785 PRE_UPDATE FINISH  
{distrib.from=http://127.0.0.1:41533/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
  2> 82129 T89 C0 P56785 /update 
{distrib.from=http://127.0.0.1:41533/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
 {add=[10079 (1423033432430084096)]} 0 1

  2> 82247 T256 oasc.ChaosMonkey.monkeyLog monkey: chose a victim! 56785
  2> 82247 T256 oasc.ChaosMonkey.monkeyLog monkey: expire session for 56785 !
  2> 82248 T256 oasc.ChaosMonkey.monkeyLog monkey: cause connection loss!
##### 56785 is C0, but it looks like we already completed our update forwarded 
from C2 (although it's always possible the response was not completely written 
yet)



##### So C14 hung for some reason, but it doesn't look like the chaos monkey 
touched him!
  2> 97029 T46 C2 P41533 oasc.SolrException.log SEVERE shard update error 
StdNode: 
http://127.0.0.1:56842/collection1/:org.apache.solr.client.solrj.SolrServerException:
 Timeout occured while waiting response from server at: 
http://127.0.0.1:56842/collection1
  2> 97031 T46 C2 P41533 oasup.DistributedUpdateProcessor.doFinish try and ask 
http://127.0.0.1:56842 to recover

##### I presume this is C14, and it got the request to recover
  2> 97044 T316 oasha.CoreAdminHandler.handleRequestRecoveryAction It has been 
requested that we recover

##### It looks like C2 finally returned failure to C4, which retries sending to 
the leader
  2> 97133 T59 C4 P48210 oasc.SolrException.log SEVERE forwarding update to 
http://127.0.0.1:41533/collection1/ failed - retrying ... 

##### Another timeout from the leader to that node - not sure what request this 
was...
  2> 97135 T44 C2 P41533 oasc.SolrException.log SEVERE shard update error 
StdNode: 
http://127.0.0.1:56842/collection1/:org.apache.solr.client.solrj.SolrServerException:
 Timeout occured while waiting response from server at: 
http://127.0.0.1:56842/collection1

##### But we ask the node to recover again, and it looks like the node gets 
that request again
  2> 97137 T44 C2 P41533 oasup.DistributedUpdateProcessor.doFinish try and ask 
http://127.0.0.1:56842 to recover
  2> 97138 T44 C2 P41533 oascsi.HttpClientUtil.createClient Creating new http 
client, config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
  2> 97148 T311 oasha.CoreAdminHandler.handleRequestRecoveryAction It has been 
requested that we recover
 

##### Below, the request was re-sent to the leader C2 from C4
##### Either the top level cloud client timed things out and re-sent, or the 
command distributor in C4 re-sent.
  2> 97639 T47 C2 P41533 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD 
add{flags=0,_version_=0,id=10079} 
{distrib.from=http://127.0.0.1:48210/collection1/&update.distrib=TOLEADER&wt=javabin&version=2}
  2> 97641 T47 C2 P41533 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: 
id=10079 version=1423033448699789312
  2> 97643 T47 C2 P41533 PRE_UPDATE FINISH  
{distrib.from=http://127.0.0.1:48210/collection1/&update.distrib=TOLEADER&wt=javabin&version=2}
  2> 97648 T313 C14 P56842 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD 
add{flags=0,_version_=0,id=10079} 
{distrib.from=http://127.0.0.1:41533/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
  2> 97648 T93 C0 P56785 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD 
add{flags=0,_version_=0,id=10079} 
{distrib.from=http://127.0.0.1:41533/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
 

##### Why did this time out?  We saw C14 get the request... perhaps more needs 
to be done asynchronously on the receiving end of such a request? 
##### Also, what's happening on C14 and C0 right now???
   2> 102044 T46 C2 P41533 oasup.DistributedUpdateProcessor.doFinish Could not 
tell a replica to recover org.apache.solr.client.solrj.SolrServerException: 
Timeout occured while waiting response from server at: http://127.0.0.1:56842


##### The original hung update to C14 finally finished after 27 sec.
  2> 109516 T419 oascc.ConnectionManager.process Watcher 
org.apache.solr.common.cloud.ConnectionManager@58046530 
name:ZooKeeperConnection Watcher:127.0.0.1:40151/solr got event WatchedEvent 
state:SyncConnected type:None path:null path:null type:None
  2> 109518 T320 oascc.ConnectionManager.waitForConnected Client is connected 
to ZooKeeper
  2> 109518 T320 oascc.ConnectionManager$1.update Connection with ZooKeeper 
reestablished.
  2> 109519 T320 oasc.RecoveryStrategy.close WARNING Stopping recovery for 
zkNodeName=127.0.0.1:56842__collection1core=collection1
 
  2> 109564 T313 C14 P56842 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: 
id=10079 version=1423033448699789312
  2> 109571 T314 C14 P56842 PRE_UPDATE FINISH  
{distrib.from=http://127.0.0.1:41533/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
  2> 109571 T313 C14 P56842 PRE_UPDATE FINISH  
{distrib.from=http://127.0.0.1:41533/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
  2> 109571 T315 C14 P56842 PRE_UPDATE FINISH  
{distrib.from=http://127.0.0.1:41533/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
  2> 109572 T314 C14 P56842 /update 
{distrib.from=http://127.0.0.1:41533/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
 {delete=[60072 (-1423033432318935040)]} 0 27538
  2> 109572 T313 C14 P56842 /update 
{distrib.from=http://127.0.0.1:41533/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
 {add=[10079 (1423033448699789312)]} 0 11925
  2> 109573 T315 C14 P56842 /update 
{distrib.from=http://127.0.0.1:41533/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
 {add=[10079 (1423033432430084096)]} 0 27446
 

  2> ###### Only in cloudDocList: [{id=10079}]

  2> 207369 T545 C22 P57160 REQ /select 
{fl=id,_version_&shard.url=127.0.0.1:41533/collection1/|127.0.0.1:57160/collection1/&NOW=1357110561123&tests=checkShardConsistency(vsControl)/getVers&q=id:(+10079)&ids=10079&distrib=false&isShard=true&wt=javabin&rows=100000&version=2}
 status=0 QTime=1 
  2> 207372 T10 oasc.AbstractFullDistribZkTestBase.checkShardConsistency SEVERE 
controlClient :{numFound=0,start=0,docs=[]}
  2>            cloudClient :{numFound=1,start=0,docs=[SolrDocument{id=10079, 
_version_=1423033464957960192}]}
 
{code}
                
> ChaosMonkey test failures
> -------------------------
>
>                 Key: SOLR-3180
>                 URL: https://issues.apache.org/jira/browse/SOLR-3180
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>            Reporter: Yonik Seeley
>         Attachments: CMSL_fail1.log, CMSL_hang_2.txt, CMSL_hang.txt, 
> fail.130101_034142.txt, fail.130102_020942.txt, fail.inconsistent.txt, 
> test_report_1.txt
>
>
> Handle intermittent failures in the ChaosMonkey tests.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to