[ 
https://issues.apache.org/jira/browse/SOLR-3180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yonik Seeley updated SOLR-3180:
-------------------------------

    Attachment: fail.inconsistent.txt

Uploading fail.inconsistent.txt - I had to truncate the start of the log file 
to get it under the limit for JIRA.

Analysis:
{code}

  2> ASYNC  NEW_CORE C6 name=collection1 org.apache.solr.core.SolrCore@eaecb09 
url=http://127.0.0.1:58270/collection1 node=127.0.0.1:58270_ 
C6_STATE=coll:control_collection core:collection1 props:{shard=shard1, 
roles=null, state=active, core=collection1, collection=control_collection, 
node_name=127.0.0.1:58270_, base_url=http://127.0.0.1:58270, leader=true}

  2> ASYNC  NEW_CORE C5 name=collection1 org.apache.solr.core.SolrCore@54eeabe8 
url=http://127.0.0.1:37198/collection1 node=127.0.0.1:37198_ 
C5_STATE=coll:collection1 core:collection1 props:{shard=shard3, roles=null, 
state=active, core=collection1, collection=collection1, 
node_name=127.0.0.1:37198_, base_url=http://127.0.0.1:37198, leader=true}
  2> 25510 T80 C5 P37198 REQ /get 
{distrib=false&qt=/get&wt=javabin&version=2&getVersions=100} status=0 QTime=0 


  2> 187637 T669 C21 P39620 oasu.PeerSync.sync PeerSync: core=collection1 
url=http://127.0.0.1:39620 DONE. sync succeeded
  2> 188653 T669 C21 P39620 oasc.RecoveryStrategy.doRecovery PeerSync Recovery 
was successful - registering as Active. core=collection1

#
# C21 (the replica) is recovering around the same time that the update for 
id:52720 comes in (we only
# see when the update finishes below, not when it starts).
#

# update control finished
  2> 187923 T24 C6 P58270 /update {wt=javabin&version=2} {add=[52720 
(1422073056556220416)]} 0 1
# update leader for shard3 finished
  2> 187927 T77 C5 P37198 /update {wt=javabin&version=2} {add=[52720 
(1422073056559366144)]} 0 1
# these are the only adds for id:52720 in the logs...
# TODO: verify that there was no replica for C5 to forward to?


--------------------
  2> 225993 T77 C5 P37198 REQ /select 
{tests=checkShardConsistency&q=*:*&distrib=false&wt=javabin&rows=0&version=2} 
hits=835 status=0 QTime=1 
# Note that C5 is still the leader - this means that C21 recovered from it at 
some point?

  2> 225997 T658 C21 P39620 REQ /select 
{tests=checkShardConsistency&q=*:*&distrib=false&wt=javabin&rows=0&version=2} 
hits=833 status=0 QTime=1 
  2>  live:true
  2>  num:833
  2> 
  2> ######shard3 is not consistent.  Got 835 from 
http://127.0.0.1:37198/collection1lastClient and got 833 from 
http://127.0.0.1:39620/collection1
 
  2> ###### sizes=835,833
  2> ###### Only in http://127.0.0.1:37198/collection1: [{id=52720, 
_version_=1422073056559366144}, {id=52710, _version_=1422073056325533696}, 
{id=52717, _version_=1422073056485965825}, {id=2225, 
_version_=1422073056602357760}, {id=52709, _version_=1422073056298270720}, 
{id=2226, _version_=1422073056612843520}, {id=2219, 
_version_=1422073056477577216}, {id=52723, _version_=1422073056605503488}]
  2> ###### Only in http://127.0.0.1:39620/collection1: [{id=52680, 
_version_=1422073042480136192}, {id=52669, _version_=1422073042276712448}, 
{id=52676, _version_=1422073042420367360}, {id=2204, 
_version_=1422073042912149504}, {id=2198, _version_=1422073042778980352}, 
{id=2207, _version_=1422073053454532608}]
{code}

So what looks to be the case is that we came up and did a peersync with the 
leader and succeeded, but the leader hadn't noticed us yet and so didn't 
forward us a new update in the meantime.

If we don't already, we need to make sure to do some of the same type of stuff 
that we do with replication recovery.  The replica needs to ensure that the 
leader sees it (and hence will forward future updates) before it peersyncs with 
the leader.

                
> ChaosMonkey test failures
> -------------------------
>
>                 Key: SOLR-3180
>                 URL: https://issues.apache.org/jira/browse/SOLR-3180
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>            Reporter: Yonik Seeley
>         Attachments: fail.inconsistent.txt, test_report_1.txt
>
>
> Handle intermittent failures in the ChaosMonkey tests.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to