Shawn , Just wanted to follow up , I still face this issue of inconsistent search results on Solr Cloud 4.1.0.1 , upon further looking into logs , I found out a few exceptions , what was obvious was zkConnection time out issues and other exceptions , please take a look .
*Logs* /opt/tomcat1/logs/catalina.out:103651230 [http-bio-8081-exec-206] WARN org.apache.solr.handler.ReplicationHandler – Exception while writing response for params: file=_68v.fnm&command=filecontent&checksum=true&wt=filestream&qt=/replication&generation=2410 /opt/tomcat1/logs/catalina.out:java.nio.file.NoSuchFileException: /opt/solr/home1/dyCollection1_shard2_replica1/data/index/_68v.fnm /opt/tomcat1/logs/catalina.out: at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) /opt/tomcat1/logs/catalina.out: at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) /opt/tomcat1/logs/catalina.out: at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) /opt/tomcat1/logs/catalina.out:103651579 [http-bio-8081-exec-206] WARN org.apache.solr.handler.ReplicationHandler – Exception while writing response for params: file=_68v.fnm&command=filecontent&checksum=true&wt=filestream&qt=/replication&generation=2410 /opt/tomcat1/logs/catalina.out:java.nio.file.NoSuchFileException: /opt/solr/home1/dyCollection1_shard2_replica1/data/index/_68v.fnm /opt/tomcat1/logs/catalina.out: at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) /opt/tomcat1/logs/catalina.out: at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) /opt/tomcat1/logs/catalina.out: at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) /opt/tomcat1/logs/catalina.out:103651586 [http-bio-8081-exec-206] WARN org.apache.solr.handler.ReplicationHandler – Exception while writing response for params: file=_68v.fnm&command=filecontent&checksum=true&wt=filestream&qt=/replication&generation=2410 /opt/tomcat1/logs/catalina.out:java.nio.file.NoSuchFileException: /opt/solr/home1/dyCollection1_shard2_replica1/data/index/_68v.fnm /opt/tomcat1/logs/catalina.out: at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) /opt/tomcat1/logs/catalina.out: at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) /opt/tomcat1/logs/catalina.out: at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) /opt/tomcat1/logs/catalina.out:103651592 [http-bio-8081-exec-206] WARN org.apache.solr.handler.ReplicationHandler – Exception while writing response for params: file=_68v.fnm&command=filecontent&checksum=true&wt=filestream&qt=/replication&generation=2410 /opt/tomcat1/logs/catalina.out:java.nio.file.NoSuchFileException: /opt/solr/home1/dyCollection1_shard2_replica1/data/index/_68v.fnm /opt/tomcat1/logs/catalina.out: at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) /opt/tomcat1/logs/catalina.out: at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) /opt/tomcat1/logs/catalina.out: at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) /opt/tomcat1/logs/catalina.out:103651600 [http-bio-8081-exec-206] WARN org.apache.solr.handler.ReplicationHandler – Exception while writing response for params: file=_68v.fnm&command=filecontent&checksum=true&wt=filestream&qt=/replication&generation=2410 /opt/tomcat1/logs/catalina.out:java.nio.file.NoSuchFileException: /opt/solr/home1/dyCollection1_shard2_replica1/data/index/_68v.fnm /opt/tomcat1/logs/catalina.out: at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) /opt/tomcat1/logs/catalina.out: at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) /opt/tomcat1/logs/catalina.out: at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) /opt/tomcat1/logs/catalina.out:103651611 [http-bio-8081-exec-203] WARN org.apache.solr.handler.ReplicationHandler – Exception while writing response for params: file=_68v.fnm&command=filecontent&checksum=true&wt=filestream&qt=/replication&generation=2410 /opt/tomcat1/logs/catalina.out:java.nio.file.NoSuchFileException: /opt/solr/home1/dyCollection1_shard2_replica1/data/index/_68v.fnm /opt/tomcat1/logs/catalina.out: at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) 471640118 [localhost-startStop-1-EventThread] INFO org.apache.solr.common.cloud.ConnectionManager – Watcher org.apache.solr.common.cloud.ConnectionManager@2a7dcd74 name:ZooKeeperConnection Watcher:server1.mydomain.com:2181, server2.mydomain.com:2181,server3.mydomain.com:2181 got event WatchedEvent state:Disconnected type:None path:null path:null type:None 471640120 [localhost-startStop-1-EventThread] INFO org.apache.solr.common.cloud.ConnectionManager – zkClient has disconnected 471642457 [zkCallback-2-thread-8] INFO org.apache.solr.cloud.DistributedQueue – LatchChildWatcher fired on path: null state: Expired type None 471642458 [localhost-startStop-1-EventThread] INFO org.apache.solr.common.cloud.ConnectionManager – Watcher org.apache.solr.common.cloud.ConnectionManager@2a7dcd74 name:ZooKeeperConnection Watcher:server1.mydomain.com:2181, server2.mydomain.com:2181,server3.mydomain.com:2181 got event WatchedEvent state:Expired type:None path:null path:null type:None 471642458 [localhost-startStop-1-EventThread] INFO org.apache.solr.common.cloud.ConnectionManager – Our previous ZooKeeper session was expired. Attempting to reconnect to recover relationship with ZooKeeper... 471642458 [localhost-startStop-1-EventThread] INFO org.apache.solr.cloud.Overseer – Overseer (id=164669836745768960-server1.mydomain.com:8081_solr-n_0000000019) closing 471642693 [OverseerCollectionProcessor-164669836745768960-server1.mydomain.com:8081_solr-n_0000000019] INFO org.apache.solr.cloud.OverseerCollectionProcessor – According to ZK I (id=164669836745768960-server1.mydomain.com:8081_solr-n_0000000019) am no longer a leader. 471643178 [OverseerStateUpdate-164669836745768960-server1.mydomain.com:8081_solr-n_0000000019] INFO org.apache.solr.cloud.Overseer – Overseer Loop exiting : server1.mydomain.com:8081_solr 471643727 [localhost-startStop-1-EventThread] INFO org.apache.solr.common.cloud.DefaultConnectionStrategy – Connection expired - starting a new one... 471643963 [localhost-startStop-1-EventThread] INFO org.apache.solr.common.cloud.ConnectionManager – Waiting for client to connect to ZooKeeper 471644368 [localhost-startStop-1-EventThread] INFO org.apache.solr.common.cloud.ConnectionManager – Watcher org.apache.solr.common.cloud.ConnectionManager@2a7dcd74 name:ZooKeeperConnection Watcher:server1.mydomain.com:2181, server2.mydomain.com:2181,server3.mydomain.com:2181 got event WatchedEvent state:SyncConnected type:None path:null path:null type:None 471644463 [localhost-startStop-1-EventThread] INFO org.apache.solr.common.cloud.ConnectionManager – Client is connected to ZooKeeper 471644464 [localhost-startStop-1-EventThread] INFO org.apache.solr.common.cloud.ConnectionManager – Connection with ZooKeeper reestablished. 471644464 [localhost-startStop-1-EventThread] INFO org.apache.solr.common.cloud.DefaultConnectionStrategy – Reconnected to ZooKeeper 471644464 [localhost-startStop-1-EventThread] INFO org.apache.solr.common.cloud.ConnectionManager – Connected:true 471644571 [OverseerExitThread] ERROR org.apache.solr.cloud.Overseer – could not read the data *org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /overseer_elect/leader* at org.apache.zookeeper.KeeperException.create(KeeperException.java:127) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155) at org.apache.solr.common.cloud.SolrZkClient$8.execute(SolrZkClient.java:307) at org.apache.solr.common.cloud.SolrZkClient$8.execute(SolrZkClient.java:304) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:74) at org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:304) at org.apache.solr.cloud.Overseer$ClusterStateUpdater.checkIfIamStillLeader(Overseer.java:320) at org.apache.solr.cloud.Overseer$ClusterStateUpdater.access$300(Overseer.java:89) at org.apache.solr.cloud.Overseer$ClusterStateUpdater$1.run(Overseer.java:292) 471644603 [Thread-2343] INFO org.apache.solr.cloud.ZkController – publishing core=dyCollection1_shard2_replica1 state=down collection=dyCollection1 471644878 [Thread-2343] INFO org.apache.solr.cloud.ZkController – Replica core_node1 NOT in leader-initiated recovery, need to wait for leader to see down state. 471645717 [Thread-2343] INFO org.apache.solr.cloud.ElectionContext – canceling election /overseer_elect/election/164669836745768960-server1.mydomain.com:8081 _solr-n_0000000019 471645742 [Thread-2343] WARN org.apache.solr.cloud.ElectionContext – cancelElection did not find election node to remove /overseer_elect/election/164669836745768960-server1.mydomain.com:8081 _solr-n_0000000019 471645869 [Thread-2343] INFO org.apache.solr.common.cloud.ZkStateReader – Updating cluster state from ZooKeeper... 471646230 [Thread-2343] INFO org.apache.solr.cloud.ZkController – Register node as live in ZooKeeper:/live_nodes/server1.mydomain.com:8081 _solr 471646277 [Thread-2343] INFO org.apache.solr.common.cloud.SolrZkClient – makePath: /live_nodes/server1.mydomain.com:8081_solr 471646508 [Thread-2343] INFO org.apache.solr.cloud.ZkController – Register replica - core:dyCollection1_shard2_replica1 address: http://server1.mydomain.com:8081/solr collection:dyCollection1 shard:shard2 471646678 [Thread-2343] INFO org.apache.solr.cloud.ElectionContext – canceling election /collections/dyCollection1/leader_elect/shard2/election/164669836745768960-core_node1-n_0000000002 471646932 [Thread-2343] WARN org.apache.solr.cloud.ElectionContext – cancelElection did not find election node to remove /collections/dyCollection1/leader_elect/shard2/election/164669836745768960-core_node1-n_0000000002 471646972 [Thread-2343] INFO org.apache.solr.cloud.ZkController – We are http://server1.mydomain.com:8081/solr/dyCollection1_shard2_replica1/ and leader is http://server3.mydomain.com:8081/solr/dyCollection1_shard2_replica2/ 471646972 [Thread-2343] INFO org.apache.solr.cloud.ZkController – No LogReplay needed for core=dyCollection1_shard2_replica1 baseURL= http://server1.mydomain.com:8081/solr 471646972 [Thread-2343] INFO org.apache.solr.cloud.ZkController – Core needs to recover:dyCollection1_shard2_replica1 471646973 [Thread-2343] INFO org.apache.solr.update.DefaultSolrCoreState – Running recovery - first canceling any ongoing recovery 471647606 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy – Starting recovery process. core=dyCollection1_shard2_replica1 recoveringAfterStartup=true 471648601 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy – ####### Found new versions added after startup: num=33 471648628 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy – ###### currentVersions=[1482267976600125440, 1482267976541405184, 1482267964838248448, 1482267962649870336, 1482267919451684864, 1482267919392964608, 1482267918793179136, 1482267918732361728, 1482267868830629888, 1482267868770861056, 1482267866553122816, 1482267866495451136, 1482267855821996032, 1482267854691631104, 1482267848546975744, 1482267848487206912, 1482267838120984576, 1482267838058070016, 1482267833656147968, 1482267833596379136, 1482267819169021952, 1482267819110301696, 1482267819050532864, 1482267818987618304, 1482267814068748288, 1482267800491786240, 1482267795263586304, 1482267795202768896, 1482267780293066752, 1482267759067791360, 1482267730781405184, 1482267699959562240, 1482267699897696256] 471648628 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy – ###### startupVersions=[] 471648628 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy – Publishing state of core dyCollection1_shard2_replica1 as recovering, leader is http://server3.mydomain.com:8081/solr/dyCollection1_shard2_replica2/ and I am http://server1.mydomain.com:8081/solr/dyCollection1_shard2_replica1/ 471648628 [RecoveryThread] INFO org.apache.solr.cloud.ZkController – publishing core=dyCollection1_shard2_replica1 state=recovering collection=dyCollection1 471648793 [zkCallback-2-thread-11] INFO org.apache.solr.common.cloud.ZkStateReader – A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 6) 471649248 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy – Sending prep recovery command to http://server3.mydomain.com:8081/solr; WaitForState: action=PREPRECOVERY&core=dyCollection1_shard2_replica2&nodeName= server1.mydomain.com %3A8081_solr&coreNodeName=core_node1&state=recovering&checkLive=true&onlyIfLeader=true&onlyIfLeaderActive=true 471651448 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy – Attempting to PeerSync from http://server3.mydomain.com:8081/solr/dyCollection1_shard2_replica2/ core=dyCollection1_shard2_replica1 - recoveringAfterStartup=true 471651690 [RecoveryThread] INFO org.apache.solr.update.PeerSync – PeerSync: core=dyCollection1_shard2_replica1 url= http://server1.mydomain.com:8081/solr START replicas=[ http://server3.mydomain.com:8081/solr/dyCollection1_shard2_replica2/] nUpdates=100 471652187 [RecoveryThread] WARN org.apache.solr.update.PeerSync – no frame of reference to tell if we've missed updates 471652187 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy –* PeerSync Recovery was not successful - trying replication.* core=dyCollection1_shard2_replica1 471652187 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy – Starting Replication Recovery. core=dyCollection1_shard2_replica1 471652187 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy – Begin buffering updates. core=dyCollection1_shard2_replica1 471652471 [RecoveryThread] INFO org.apache.solr.update.UpdateLog – Starting to buffer updates. FSUpdateLog{state=ACTIVE, tlog=null} 471652478 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy – Attempting to replicate from http://server3.mydomain.com:8081/solr/dyCollection1_shard2_replica2/. core=dyCollection1_shard2_replica1 471653514 [RecoveryThread] INFO org.apache.solr.handler.SnapPuller – No value set for 'pollInterval'. Timer Task not started. 471653568 [RecoveryThread] INFO org.apache.solr.handler.SnapPuller – Master's generation: 10685 471653568 [RecoveryThread] INFO org.apache.solr.handler.SnapPuller – Slave's generation: 10713 471653569 [RecoveryThread] INFO org.apache.solr.handler.SnapPuller – Starting replication process 471653943 [RecoveryThread] INFO org.apache.solr.handler.SnapPuller – Number of files in latest index in master: 108 471653944 [RecoveryThread] INFO org.apache.solr.core.CachingDirectoryFactory – return new directory for /opt/solr/home1/dyCollection1_shard2_replica1/data/index.20141018111139463 471654573 [RecoveryThread] INFO org.apache.solr.handler.SnapPuller – Starting download to NRTCachingDirectory(MMapDirectory@/opt/solr/home1/dyCollection1_shard2_replica1/data/index.20141018111139463 lockFactory=NativeFSLockFactory@/opt/solr/home1/dyCollection1_shard2_replica1/data/index.20141018111139463; maxCacheMB=48.0 maxMergeSizeMB=4.0) fullCopy=true 471834454 [zkCallback-2-thread-12] INFO org.apache.solr.common.cloud.ZkStateReader – A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 6) 471897454 [RecoveryThread] INFO org.apache.solr.handler.SnapPuller – Total time taken for download : 243 secs 471898551 [RecoveryThread] INFO org.apache.solr.handler.SnapPuller – New index installed. Updating index properties... index=index.20141018111139463 471898932 [RecoveryThread] INFO org.apache.solr.handler.SnapPuller – removing old index directory NRTCachingDirectory(MMapDirectory@/opt/solr/home1/dyCollection1_shard2_replica1/data/index lockFactory=NativeFSLockFactory@/opt/solr/home1/dyCollection1_shard2_replica1/data/index; maxCacheMB=48.0 maxMergeSizeMB=4.0) 471898932 [RecoveryThread] INFO org.apache.solr.update.DefaultSolrCoreState – Creating new IndexWriter... 471898934 [RecoveryThread] INFO org.apache.solr.update.DefaultSolrCoreState – Waiting until IndexWriter is unused... core=dyCollection1_shard2_replica1 471898934 [RecoveryThread] INFO org.apache.solr.update.DefaultSolrCoreState – Rollback old IndexWriter... core=dyCollection1_shard2_replica1 471904192 [RecoveryThread] INFO org.apache.solr.core.SolrCore – New index directory detected: old=/opt/solr/home1/dyCollection1_shard2_replica1/data/index/ new=/opt/solr/home1/dyCollection1_shard2_replica1/data/index.20141018111139463 471904907 [RecoveryThread] INFO org.apache.solr.core.SolrCore – SolrDeletionPolicy.onInit: commits: num=1 commit{dir=NRTCachingDirectory(MMapDirectory@/opt/solr/home1/dyCollection1_shard2_replica1/data/index.20141018111139463 lockFactory=NativeFSLockFactory@/opt/solr/home1/dyCollection1_shard2_replica1/data/index.20141018111139463; maxCacheMB=48.0 maxMergeSizeMB=4.0),segFN=segments_88t,generation=10685} 471904907 [RecoveryThread] INFO org.apache.solr.core.SolrCore – newest commit generation = 10685 On Fri, Oct 17, 2014 at 1:12 PM, S.L <simpleliving...@gmail.com> wrote: > Shawn, > > Just wondering if you have any other suggestions on what the next steps > whould be ? Thanks. > > On Thu, Oct 16, 2014 at 11:12 PM, S.L <simpleliving...@gmail.com> wrote: > >> Shawn , >> >> >> 1. I will upgrade to 67 JVM shortly . >> 2. This is a new collection as , I was facing a similar issue in 4.7 >> and based on Erick's recommendation I updated to 4.10.1 and created a new >> collection. >> 3. Yes, I am hitting the replicas of the same shard and I see the >> lists are completely non overlapping.I am using CloudSolrServer to add the >> documents. >> 4. I have a 3 physical node cluster , with each having 16GB in memory. >> 5. I also have a custom request handler defined in my solrconfig.xml >> as below , however I am not using that and I am only using the default >> select handler, but my MyCustomHandler class has been been added to the >> source and included in the build , but not being used for any requests >> yet. >> >> <requestHandler name="/mycustomselect" class="solr.MyCustomHandler" >> startup="lazy"> >> <lst name="defaults"> >> <str name="df">suggestAggregate</str> >> >> <str name="spellcheck.dictionary">direct</str> >> <!--<str name="spellcheck.dictionary">wordbreak</str>--> >> <str name="spellcheck">on</str> >> <str name="spellcheck.extendedResults">true</str> >> <str name="spellcheck.count">10</str> >> <str name="spellcheck.alternativeTermCount">5</str> >> <str name="spellcheck.maxResultsForSuggest">5</str> >> <str name="spellcheck.collate">true</str> >> <str name="spellcheck.collateExtendedResults">true</str> >> <str name="spellcheck.maxCollationTries">10</str> >> <str name="spellcheck.maxCollations">5</str> >> </lst> >> <arr name="last-components"> >> <str>spellcheck</str> >> </arr> >> </requestHandler> >> >> >> 5. The clusterstate.json is copied below >> >> {"dyCollection1":{ >> "shards":{ >> "shard1":{ >> "range":"80000000-d554ffff", >> "state":"active", >> "replicas":{ >> "core_node3":{ >> "state":"active", >> "core":"dyCollection1_shard1_replica1", >> "node_name":"server3.mydomain.com:8082_solr", >> "base_url":"http://server3.mydomain.com:8082/solr"}, >> "core_node4":{ >> "state":"active", >> "core":"dyCollection1_shard1_replica2", >> "node_name":"server2.mydomain.com:8081_solr", >> "base_url":"http://server2.mydomain.com:8081/solr", >> "leader":"true"}}}, >> "shard2":{ >> "range":"d5550000-2aa9ffff", >> "state":"active", >> "replicas":{ >> "core_node1":{ >> "state":"active", >> "core":"dyCollection1_shard2_replica1", >> "node_name":"server1.mydomain.com:8081_solr", >> "base_url":"http://server1.mydomain.com:8081/solr", >> "leader":"true"}, >> "core_node6":{ >> "state":"active", >> "core":"dyCollection1_shard2_replica2", >> "node_name":"server3.mydomain.com:8081_solr", >> "base_url":"http://server3.mydomain.com:8081/solr"}}}, >> "shard3":{ >> "range":"2aaa0000-7fffffff", >> "state":"active", >> "replicas":{ >> "core_node2":{ >> "state":"active", >> "core":"dyCollection1_shard3_replica2", >> "node_name":"server1.mydomain.com:8082_solr", >> "base_url":"http://server1.mydomain.com:8082/solr", >> "leader":"true"}, >> "core_node5":{ >> "state":"active", >> "core":"dyCollection1_shard3_replica1", >> "node_name":"server2.mydomain.com:8082_solr", >> "base_url":"http://server2.mydomain.com:8082/solr"}}}}, >> "maxShardsPerNode":"1", >> "router":{"name":"compositeId"}, >> "replicationFactor":"2", >> "autoAddReplicas":"false"}} >> >> Thanks! >> >> On Thu, Oct 16, 2014 at 9:02 PM, Shawn Heisey <apa...@elyograg.org> >> wrote: >> >>> On 10/16/2014 6:27 PM, S.L wrote: >>> >>>> 1. Java Version :java version "1.7.0_51" >>>> Java(TM) SE Runtime Environment (build 1.7.0_51-b13) >>>> Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) >>>> >>> >>> I believe that build 51 is one of those that is known to have bugs >>> related to Lucene. If you can upgrade this to 67, that would be good, but >>> I don't know that it's a pressing matter. It looks like the Oracle JVM, >>> which is good. >>> >>> 2.OS >>>> CentOS Linux release 7.0.1406 (Core) >>>> >>>> 3. Everything is 64 bit , OS , Java , and CPU. >>>> >>>> 4. Java Args. >>>> -Djava.io.tmpdir=/opt/tomcat1/temp >>>> -Dcatalina.home=/opt/tomcat1 >>>> -Dcatalina.base=/opt/tomcat1 >>>> -Djava.endorsed.dirs=/opt/tomcat1/endorsed >>>> -DzkHost=server1.mydomain.com:2181,server2.mydomain.com:2181, >>>> server3.mydomain.com:2181 >>>> -DzkClientTimeout=20000 >>>> -DhostContext=solr >>>> -Dport=8081 >>>> -Dhost=server1.mydomain.com >>>> -Dsolr.solr.home=/opt/solr/home1 >>>> -Dfile.encoding=UTF8 >>>> -Duser.timezone=UTC >>>> -XX:+UseG1GC >>>> -XX:MaxPermSize=128m >>>> -XX:PermSize=64m >>>> -Xmx2048m >>>> -Xms128m >>>> -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager >>>> -Djava.util.logging.config.file=/opt/tomcat1/conf/ >>>> logging.properties >>>> >>> >>> I would not use the G1 collector myself, but with the heap at only 2GB, >>> I don't know that it matters all that much. Even a worst-case collection >>> probably is not going to take more than a few seconds, and you've already >>> increased the zookeeper client timeout. >>> >>> http://wiki.apache.org/solr/ShawnHeisey#GC_Tuning >>> >>> 5. Zookeeper ensemble has 3 zookeeper instances , which are external and >>>> are not embedded. >>>> >>>> >>>> 6. Container : I am using Tomcat Apache Tomcat Version 7.0.42 >>>> >>>> *Additional Observations:* >>>> >>>> I queries all docs on both replicas with distrib=false&fl=id&sort=id+ >>>> asc, >>>> then compared the two lists, I could see by eyeballing the first few >>>> lines >>>> of ids in both the lists ,I could say that even though each list has >>>> equal >>>> number of documents i.e 96309 each , but the document ids in them seem >>>> to >>>> be *mutually exclusive* , , I did not find even a single common id in >>>> those lists , I tried at least 15 manually ,it looks like to me that the >>>> replicas are disjoint sets. >>>> >>> >>> Are you sure you hit both replicas of the same shard number? If you >>> are, then it sounds like something is going wrong with your document >>> routing, or maybe your clusterstate is really messed up. Recreating the >>> collection from scratch and doing a full reindex might be a good plan ... >>> assuming this is possible for you. You could create a whole new >>> collection, and then when you're ready to switch, delete the original >>> collection and create an alias so your app can still use the old name. >>> >>> How much total RAM do you have on these systems, and how large are those >>> index shards? With a shard having 96K documents, it sounds like your whole >>> index is probably just shy of 300K documents. >>> >>> Thanks, >>> Shawn >>> >>> >> >