Shawn ,
1. I will upgrade to 67 JVM shortly . 2. This is a new collection as , I was facing a similar issue in 4.7 and based on Erick's recommendation I updated to 4.10.1 and created a new collection. 3. Yes, I am hitting the replicas of the same shard and I see the lists are completely non overlapping.I am using CloudSolrServer to add the documents. 4. I have a 3 physical node cluster , with each having 16GB in memory. 5. I also have a custom request handler defined in my solrconfig.xml as below , however I am not using that and I am only using the default select handler, but my MyCustomHandler class has been been added to the source and included in the build , but not being used for any requests yet. <requestHandler name="/mycustomselect" class="solr.MyCustomHandler" startup="lazy"> <lst name="defaults"> <str name="df">suggestAggregate</str> <str name="spellcheck.dictionary">direct</str> <!--<str name="spellcheck.dictionary">wordbreak</str>--> <str name="spellcheck">on</str> <str name="spellcheck.extendedResults">true</str> <str name="spellcheck.count">10</str> <str name="spellcheck.alternativeTermCount">5</str> <str name="spellcheck.maxResultsForSuggest">5</str> <str name="spellcheck.collate">true</str> <str name="spellcheck.collateExtendedResults">true</str> <str name="spellcheck.maxCollationTries">10</str> <str name="spellcheck.maxCollations">5</str> </lst> <arr name="last-components"> <str>spellcheck</str> </arr> </requestHandler> 5. The clusterstate.json is copied below {"dyCollection1":{ "shards":{ "shard1":{ "range":"80000000-d554ffff", "state":"active", "replicas":{ "core_node3":{ "state":"active", "core":"dyCollection1_shard1_replica1", "node_name":"server3.mydomain.com:8082_solr", "base_url":"http://server3.mydomain.com:8082/solr"}, "core_node4":{ "state":"active", "core":"dyCollection1_shard1_replica2", "node_name":"server2.mydomain.com:8081_solr", "base_url":"http://server2.mydomain.com:8081/solr", "leader":"true"}}}, "shard2":{ "range":"d5550000-2aa9ffff", "state":"active", "replicas":{ "core_node1":{ "state":"active", "core":"dyCollection1_shard2_replica1", "node_name":"server1.mydomain.com:8081_solr", "base_url":"http://server1.mydomain.com:8081/solr", "leader":"true"}, "core_node6":{ "state":"active", "core":"dyCollection1_shard2_replica2", "node_name":"server3.mydomain.com:8081_solr", "base_url":"http://server3.mydomain.com:8081/solr"}}}, "shard3":{ "range":"2aaa0000-7fffffff", "state":"active", "replicas":{ "core_node2":{ "state":"active", "core":"dyCollection1_shard3_replica2", "node_name":"server1.mydomain.com:8082_solr", "base_url":"http://server1.mydomain.com:8082/solr", "leader":"true"}, "core_node5":{ "state":"active", "core":"dyCollection1_shard3_replica1", "node_name":"server2.mydomain.com:8082_solr", "base_url":"http://server2.mydomain.com:8082/solr"}}}}, "maxShardsPerNode":"1", "router":{"name":"compositeId"}, "replicationFactor":"2", "autoAddReplicas":"false"}} Thanks! On Thu, Oct 16, 2014 at 9:02 PM, Shawn Heisey <apa...@elyograg.org> wrote: > On 10/16/2014 6:27 PM, S.L wrote: > >> 1. Java Version :java version "1.7.0_51" >> Java(TM) SE Runtime Environment (build 1.7.0_51-b13) >> Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) >> > > I believe that build 51 is one of those that is known to have bugs related > to Lucene. If you can upgrade this to 67, that would be good, but I don't > know that it's a pressing matter. It looks like the Oracle JVM, which is > good. > > 2.OS >> CentOS Linux release 7.0.1406 (Core) >> >> 3. Everything is 64 bit , OS , Java , and CPU. >> >> 4. Java Args. >> -Djava.io.tmpdir=/opt/tomcat1/temp >> -Dcatalina.home=/opt/tomcat1 >> -Dcatalina.base=/opt/tomcat1 >> -Djava.endorsed.dirs=/opt/tomcat1/endorsed >> -DzkHost=server1.mydomain.com:2181,server2.mydomain.com:2181, >> server3.mydomain.com:2181 >> -DzkClientTimeout=20000 >> -DhostContext=solr >> -Dport=8081 >> -Dhost=server1.mydomain.com >> -Dsolr.solr.home=/opt/solr/home1 >> -Dfile.encoding=UTF8 >> -Duser.timezone=UTC >> -XX:+UseG1GC >> -XX:MaxPermSize=128m >> -XX:PermSize=64m >> -Xmx2048m >> -Xms128m >> -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager >> -Djava.util.logging.config.file=/opt/tomcat1/conf/logging.properties >> > > I would not use the G1 collector myself, but with the heap at only 2GB, I > don't know that it matters all that much. Even a worst-case collection > probably is not going to take more than a few seconds, and you've already > increased the zookeeper client timeout. > > http://wiki.apache.org/solr/ShawnHeisey#GC_Tuning > > 5. Zookeeper ensemble has 3 zookeeper instances , which are external and >> are not embedded. >> >> >> 6. Container : I am using Tomcat Apache Tomcat Version 7.0.42 >> >> *Additional Observations:* >> >> I queries all docs on both replicas with distrib=false&fl=id&sort=id+asc, >> then compared the two lists, I could see by eyeballing the first few lines >> of ids in both the lists ,I could say that even though each list has equal >> number of documents i.e 96309 each , but the document ids in them seem to >> be *mutually exclusive* , , I did not find even a single common id in >> those lists , I tried at least 15 manually ,it looks like to me that the >> replicas are disjoint sets. >> > > Are you sure you hit both replicas of the same shard number? If you are, > then it sounds like something is going wrong with your document routing, or > maybe your clusterstate is really messed up. Recreating the collection > from scratch and doing a full reindex might be a good plan ... assuming > this is possible for you. You could create a whole new collection, and > then when you're ready to switch, delete the original collection and create > an alias so your app can still use the old name. > > How much total RAM do you have on these systems, and how large are those > index shards? With a shard having 96K documents, it sounds like your whole > index is probably just shy of 300K documents. > > Thanks, > Shawn > >