[ https://issues.apache.org/jira/browse/SOLR-10398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16456993#comment-16456993 ]
Varun Thacker commented on SOLR-10398: -------------------------------------- [~caomanhdat2] we should just mark this as closed as part of SOLR-11702 right? > Multiple LIR requests can fail PeerSync even if it succeeds > ----------------------------------------------------------- > > Key: SOLR-10398 > URL: https://issues.apache.org/jira/browse/SOLR-10398 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Reporter: Varun Thacker > Priority: Major > > I've seen a scenario where multiple LIRs happen around the same time. > In this case even if PeerSync succeeded we ended up failing causing a full > index fetch. > Sequence of events: > T1: Leader puts replica in LIR and replica's LIRState as DOWN > T2: Replica begins PeerSync and LIRState changes > T3: Leader puts replica in LIR again and replica's LIRState is set to DOWN > T4: PeerSync from T1 succeeds and examines it's own LIRState which is now > DOWN and fails triggering a full replication > Log snippet > T1 from the Leader Logs > {code} > solr.log.2:12779:2017-03-23 03:03:18.706 INFO (qtp1076677520-9812) [c:test > s:shard73 r:core_node44 x:test_shard73_replica1] o.a.s.c.ZkController Put > replica core=test_shard73_replica2 coreNodeName=core_node247 on > server:8993_solr into leader-initiated recovery. > {code} > T2 from the replica logs: > {code} > solr.log.1:2017-03-23 03:03:26.724 INFO > (RecoveryThread-test_shard73_replica2) [c:test s:shard73 r:core_node247 > x:test_shard73_replica2] o.a.s.c.RecoveryStrategy Attempting to PeerSync from > http://server:8983/solr/test_shard73_replica1/ - recoveringAfterStartup=false > {code} > T3 from the Leader Logs > {code} > solr.log.2:2017-03-23 03:03:43.268 INFO (qtp1076677520-9796) [c:test > s:shard73 r:core_node44 x:test_shard73_replica1] o.a.s.c.ZkController Put > replica core=test_shard73_replica2 coreNodeName=core_node247 on > server:8993_solr into leader-initiated recovery. > {code} > T4 from the replica logs: > {code} > 2017-03-23 03:05:38.009 INFO (RecoveryThread-test_shard73_replica2) [c:test > s:shard73 r:core_node247 x:test_shard73_replica2] o.a.s.c.RecoveryStrategy > PeerSync Recovery was successful - registering as Active. > 2017-03-23 03:05:38.012 ERROR (RecoveryThread-test_shard73_replica2) [c:test > s:shard73 r:core_node247 x:test_shard73_replica2] o.a.s.c.RecoveryStrategy > Error while trying to recover.:org.apache.solr.common.SolrException: Cannot > publish state of core 'test_shard73_replica2' as active without recovering > first! > at org.apache.solr.cloud.ZkController.publish(ZkController.java:1179) > at org.apache.solr.cloud.ZkController.publish(ZkController.java:1135) > at org.apache.solr.cloud.ZkController.publish(ZkController.java:1131) > at > org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:415) > at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:227) > 2017-03-23 03:05:47.014 INFO (RecoveryThread-test_shard73_replica2) [c:test > s:shard73 r:core_node247 x:test_shard73_replica2] o.a.s.h.IndexFetcher > Starting download to > NRTCachingDirectory(MMapDirectory@/data4/test_shard73_replica2/data/index.20170323030546697 > lockFactory=org.apache.lucene.store.NativeFSLockFactory@4aa1e5c0; > maxCacheMB=48.0 maxMergeSizeMB=4.0) fullCopy=true > {code} > I don't know whats the best approach to tackle the problem is but I'll post > suggestions after doing some research. I wanted to create the Jira to track > the issue -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org