Re: SolrCloud 4.7 not doing distributed search when querying from a load balancer.
Shawn , Just wanted to follow up , I still face this issue of inconsistent search results on Solr Cloud 4.1.0.1 , upon further looking into logs , I found out a few exceptions , what was obvious was zkConnection time out issues and other exceptions , please take a look . *Logs* /opt/tomcat1/logs/catalina.out:103651230 [http-bio-8081-exec-206] WARN org.apache.solr.handler.ReplicationHandler – Exception while writing response for params: file=_68v.fnmcommand=filecontentchecksum=truewt=filestreamqt=/replicationgeneration=2410 /opt/tomcat1/logs/catalina.out:java.nio.file.NoSuchFileException: /opt/solr/home1/dyCollection1_shard2_replica1/data/index/_68v.fnm /opt/tomcat1/logs/catalina.out: at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) /opt/tomcat1/logs/catalina.out: at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) /opt/tomcat1/logs/catalina.out: at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) /opt/tomcat1/logs/catalina.out:103651579 [http-bio-8081-exec-206] WARN org.apache.solr.handler.ReplicationHandler – Exception while writing response for params: file=_68v.fnmcommand=filecontentchecksum=truewt=filestreamqt=/replicationgeneration=2410 /opt/tomcat1/logs/catalina.out:java.nio.file.NoSuchFileException: /opt/solr/home1/dyCollection1_shard2_replica1/data/index/_68v.fnm /opt/tomcat1/logs/catalina.out: at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) /opt/tomcat1/logs/catalina.out: at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) /opt/tomcat1/logs/catalina.out: at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) /opt/tomcat1/logs/catalina.out:103651586 [http-bio-8081-exec-206] WARN org.apache.solr.handler.ReplicationHandler – Exception while writing response for params: file=_68v.fnmcommand=filecontentchecksum=truewt=filestreamqt=/replicationgeneration=2410 /opt/tomcat1/logs/catalina.out:java.nio.file.NoSuchFileException: /opt/solr/home1/dyCollection1_shard2_replica1/data/index/_68v.fnm /opt/tomcat1/logs/catalina.out: at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) /opt/tomcat1/logs/catalina.out: at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) /opt/tomcat1/logs/catalina.out: at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) /opt/tomcat1/logs/catalina.out:103651592 [http-bio-8081-exec-206] WARN org.apache.solr.handler.ReplicationHandler – Exception while writing response for params: file=_68v.fnmcommand=filecontentchecksum=truewt=filestreamqt=/replicationgeneration=2410 /opt/tomcat1/logs/catalina.out:java.nio.file.NoSuchFileException: /opt/solr/home1/dyCollection1_shard2_replica1/data/index/_68v.fnm /opt/tomcat1/logs/catalina.out: at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) /opt/tomcat1/logs/catalina.out: at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) /opt/tomcat1/logs/catalina.out: at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) /opt/tomcat1/logs/catalina.out:103651600 [http-bio-8081-exec-206] WARN org.apache.solr.handler.ReplicationHandler – Exception while writing response for params: file=_68v.fnmcommand=filecontentchecksum=truewt=filestreamqt=/replicationgeneration=2410 /opt/tomcat1/logs/catalina.out:java.nio.file.NoSuchFileException: /opt/solr/home1/dyCollection1_shard2_replica1/data/index/_68v.fnm /opt/tomcat1/logs/catalina.out: at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) /opt/tomcat1/logs/catalina.out: at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) /opt/tomcat1/logs/catalina.out: at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) /opt/tomcat1/logs/catalina.out:103651611 [http-bio-8081-exec-203] WARN org.apache.solr.handler.ReplicationHandler – Exception while writing response for params: file=_68v.fnmcommand=filecontentchecksum=truewt=filestreamqt=/replicationgeneration=2410 /opt/tomcat1/logs/catalina.out:java.nio.file.NoSuchFileException: /opt/solr/home1/dyCollection1_shard2_replica1/data/index/_68v.fnm /opt/tomcat1/logs/catalina.out: at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) 471640118 [localhost-startStop-1-EventThread] INFO org.apache.solr.common.cloud.ConnectionManager – Watcher org.apache.solr.common.cloud.ConnectionManager@2a7dcd74 name:ZooKeeperConnection Watcher:server1.mydomain.com:2181, server2.mydomain.com:2181,server3.mydomain.com:2181 got event WatchedEvent state:Disconnected type:None path:null path:null type:None 471640120 [localhost-startStop-1-EventThread] INFO org.apache.solr.common.cloud.ConnectionManager – zkClient has disconnected 471642457 [zkCallback-2-thread-8] INFO org.apache.solr.cloud.DistributedQueue – LatchChildWatcher fired on path: null state: Expired type None 471642458 [localhost-startStop-1-EventThread] INFO org.apache.solr.common.cloud.ConnectionManager –
Re: SolrCloud 4.7 not doing distributed search when querying from a load balancer.
Shawn, Just wondering if you have any other suggestions on what the next steps whould be ? Thanks. On Thu, Oct 16, 2014 at 11:12 PM, S.L simpleliving...@gmail.com wrote: Shawn , 1. I will upgrade to 67 JVM shortly . 2. This is a new collection as , I was facing a similar issue in 4.7 and based on Erick's recommendation I updated to 4.10.1 and created a new collection. 3. Yes, I am hitting the replicas of the same shard and I see the lists are completely non overlapping.I am using CloudSolrServer to add the documents. 4. I have a 3 physical node cluster , with each having 16GB in memory. 5. I also have a custom request handler defined in my solrconfig.xml as below , however I am not using that and I am only using the default select handler, but my MyCustomHandler class has been been added to the source and included in the build , but not being used for any requests yet. requestHandler name=/mycustomselect class=solr.MyCustomHandler startup=lazy lst name=defaults str name=dfsuggestAggregate/str str name=spellcheck.dictionarydirect/str !--str name=spellcheck.dictionarywordbreak/str-- str name=spellcheckon/str str name=spellcheck.extendedResultstrue/str str name=spellcheck.count10/str str name=spellcheck.alternativeTermCount5/str str name=spellcheck.maxResultsForSuggest5/str str name=spellcheck.collatetrue/str str name=spellcheck.collateExtendedResultstrue/str str name=spellcheck.maxCollationTries10/str str name=spellcheck.maxCollations5/str /lst arr name=last-components strspellcheck/str /arr /requestHandler 5. The clusterstate.json is copied below {dyCollection1:{ shards:{ shard1:{ range:8000-d554, state:active, replicas:{ core_node3:{ state:active, core:dyCollection1_shard1_replica1, node_name:server3.mydomain.com:8082_solr, base_url:http://server3.mydomain.com:8082/solr}, core_node4:{ state:active, core:dyCollection1_shard1_replica2, node_name:server2.mydomain.com:8081_solr, base_url:http://server2.mydomain.com:8081/solr;, leader:true}}}, shard2:{ range:d555-2aa9, state:active, replicas:{ core_node1:{ state:active, core:dyCollection1_shard2_replica1, node_name:server1.mydomain.com:8081_solr, base_url:http://server1.mydomain.com:8081/solr;, leader:true}, core_node6:{ state:active, core:dyCollection1_shard2_replica2, node_name:server3.mydomain.com:8081_solr, base_url:http://server3.mydomain.com:8081/solr}}}, shard3:{ range:2aaa-7fff, state:active, replicas:{ core_node2:{ state:active, core:dyCollection1_shard3_replica2, node_name:server1.mydomain.com:8082_solr, base_url:http://server1.mydomain.com:8082/solr;, leader:true}, core_node5:{ state:active, core:dyCollection1_shard3_replica1, node_name:server2.mydomain.com:8082_solr, base_url:http://server2.mydomain.com:8082/solr, maxShardsPerNode:1, router:{name:compositeId}, replicationFactor:2, autoAddReplicas:false}} Thanks! On Thu, Oct 16, 2014 at 9:02 PM, Shawn Heisey apa...@elyograg.org wrote: On 10/16/2014 6:27 PM, S.L wrote: 1. Java Version :java version 1.7.0_51 Java(TM) SE Runtime Environment (build 1.7.0_51-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) I believe that build 51 is one of those that is known to have bugs related to Lucene. If you can upgrade this to 67, that would be good, but I don't know that it's a pressing matter. It looks like the Oracle JVM, which is good. 2.OS CentOS Linux release 7.0.1406 (Core) 3. Everything is 64 bit , OS , Java , and CPU. 4. Java Args. -Djava.io.tmpdir=/opt/tomcat1/temp -Dcatalina.home=/opt/tomcat1 -Dcatalina.base=/opt/tomcat1 -Djava.endorsed.dirs=/opt/tomcat1/endorsed -DzkHost=server1.mydomain.com:2181,server2.mydomain.com:2181, server3.mydomain.com:2181 -DzkClientTimeout=2 -DhostContext=solr -Dport=8081 -Dhost=server1.mydomain.com -Dsolr.solr.home=/opt/solr/home1 -Dfile.encoding=UTF8 -Duser.timezone=UTC -XX:+UseG1GC -XX:MaxPermSize=128m -XX:PermSize=64m -Xmx2048m -Xms128m -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -Djava.util.logging.config.file=/opt/tomcat1/conf/ logging.properties I would not use the G1 collector myself, but with the heap at
Re: SolrCloud 4.7 not doing distributed search when querying from a load balancer.
Shawn, Please find the answers to your questions. 1. Java Version :java version 1.7.0_51 Java(TM) SE Runtime Environment (build 1.7.0_51-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) 2.OS CentOS Linux release 7.0.1406 (Core) 3. Everything is 64 bit , OS , Java , and CPU. 4. Java Args. -Djava.io.tmpdir=/opt/tomcat1/temp -Dcatalina.home=/opt/tomcat1 -Dcatalina.base=/opt/tomcat1 -Djava.endorsed.dirs=/opt/tomcat1/endorsed -DzkHost=server1.mydomain.com:2181,server2.mydomain.com:2181, server3.mydomain.com:2181 -DzkClientTimeout=2 -DhostContext=solr -Dport=8081 -Dhost=server1.mydomain.com -Dsolr.solr.home=/opt/solr/home1 -Dfile.encoding=UTF8 -Duser.timezone=UTC -XX:+UseG1GC -XX:MaxPermSize=128m -XX:PermSize=64m -Xmx2048m -Xms128m -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -Djava.util.logging.config.file=/opt/tomcat1/conf/logging.properties 5. Zookeeper ensemble has 3 zookeeper instances , which are external and are not embedded. 6. Container : I am using Tomcat Apache Tomcat Version 7.0.42 *Additional Observations:* I queries all docs on both replicas with distrib=falsefl=idsort=id+asc, then compared the two lists, I could see by eyeballing the first few lines of ids in both the lists ,I could say that even though each list has equal number of documents i.e 96309 each , but the document ids in them seem to be *mutually exclusive* , , I did not find even a single common id in those lists , I tried at least 15 manually ,it looks like to me that the replicas are disjoint sets. Thanks. On Thu, Oct 16, 2014 at 1:41 AM, Shawn Heisey apa...@elyograg.org wrote: On 10/15/2014 10:24 PM, S.L wrote: Yes , I tried those two queries with distrib=false , I get 0 results for first and 1 result for the second query( (i.e. server 3 shard 2 replica 2) consistently. However if I run the same second query (i.e. server 3 shard 2 replica 2) with distrib=true, I sometimes get a result and sometimes not , should'nt this query always return a result when its pointing to a core that seems to have that document regardless of distrib=true or false ? Unfortunately I dont see anything particular in the logs to point to any information. BTW you asked me to replace the request handler , I use the select request handler ,so I cannot replace it with anything else , is that a problem ? If you send the query with distrib=true (which is the default value in SolrCloud), then it treats it just as if you had sent it to /solr/collection instead of /solr/collection_shardN_replicaN, so it's a full distributed query. The distrib=false is required to turn that behavior off and ONLY query the index on the actual core where you sent it. I only said to replace those things as appropriate. Since you are using /select, it's no problem that you left it that way. If I were to assume that you used /select, but you didn't, the URLs as I wrote them might not have worked. As discussed, this means that your replicas are truly out of sync. It's difficult to know what caused it, especially if you can't see anything in the log when you indexed the missing documents. We know you're on Solr 4.10.1. This means that your Java is a 1.7 version, since Java7 is required. Here's where I ask a whole lot of questions about your setup. What is the precise Java version, and which vendor's Java are you using? What operating system is it on? Is everything 64-bit, or is any piece (CPU, OS, Java) 32-bit? On the Solr admin UI dashboard, it lists all parameters used when starting Java, labelled as Args. Can you include those? Is zookeeper external, or embedded in Solr? Is it a 3-server (or more) ensemble? Are you using the example jetty, or did you provide your own servlet container? We recommend 64-bit Oracle Java, the latest 1.7 version. OpenJDK (since version 1.7.x) should be pretty safe as well, but IBM's Java should be avoided. IBM does very aggressive runtime optimizations. These can make programs run faster, but they are known to negatively affect Lucene/Solr. Thanks, Shawn
Re: SolrCloud 4.7 not doing distributed search when querying from a load balancer.
On 10/16/2014 6:27 PM, S.L wrote: 1. Java Version :java version 1.7.0_51 Java(TM) SE Runtime Environment (build 1.7.0_51-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) I believe that build 51 is one of those that is known to have bugs related to Lucene. If you can upgrade this to 67, that would be good, but I don't know that it's a pressing matter. It looks like the Oracle JVM, which is good. 2.OS CentOS Linux release 7.0.1406 (Core) 3. Everything is 64 bit , OS , Java , and CPU. 4. Java Args. -Djava.io.tmpdir=/opt/tomcat1/temp -Dcatalina.home=/opt/tomcat1 -Dcatalina.base=/opt/tomcat1 -Djava.endorsed.dirs=/opt/tomcat1/endorsed -DzkHost=server1.mydomain.com:2181,server2.mydomain.com:2181, server3.mydomain.com:2181 -DzkClientTimeout=2 -DhostContext=solr -Dport=8081 -Dhost=server1.mydomain.com -Dsolr.solr.home=/opt/solr/home1 -Dfile.encoding=UTF8 -Duser.timezone=UTC -XX:+UseG1GC -XX:MaxPermSize=128m -XX:PermSize=64m -Xmx2048m -Xms128m -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -Djava.util.logging.config.file=/opt/tomcat1/conf/logging.properties I would not use the G1 collector myself, but with the heap at only 2GB, I don't know that it matters all that much. Even a worst-case collection probably is not going to take more than a few seconds, and you've already increased the zookeeper client timeout. http://wiki.apache.org/solr/ShawnHeisey#GC_Tuning 5. Zookeeper ensemble has 3 zookeeper instances , which are external and are not embedded. 6. Container : I am using Tomcat Apache Tomcat Version 7.0.42 *Additional Observations:* I queries all docs on both replicas with distrib=falsefl=idsort=id+asc, then compared the two lists, I could see by eyeballing the first few lines of ids in both the lists ,I could say that even though each list has equal number of documents i.e 96309 each , but the document ids in them seem to be *mutually exclusive* , , I did not find even a single common id in those lists , I tried at least 15 manually ,it looks like to me that the replicas are disjoint sets. Are you sure you hit both replicas of the same shard number? If you are, then it sounds like something is going wrong with your document routing, or maybe your clusterstate is really messed up. Recreating the collection from scratch and doing a full reindex might be a good plan ... assuming this is possible for you. You could create a whole new collection, and then when you're ready to switch, delete the original collection and create an alias so your app can still use the old name. How much total RAM do you have on these systems, and how large are those index shards? With a shard having 96K documents, it sounds like your whole index is probably just shy of 300K documents. Thanks, Shawn
Re: SolrCloud 4.7 not doing distributed search when querying from a load balancer.
Shawn , 1. I will upgrade to 67 JVM shortly . 2. This is a new collection as , I was facing a similar issue in 4.7 and based on Erick's recommendation I updated to 4.10.1 and created a new collection. 3. Yes, I am hitting the replicas of the same shard and I see the lists are completely non overlapping.I am using CloudSolrServer to add the documents. 4. I have a 3 physical node cluster , with each having 16GB in memory. 5. I also have a custom request handler defined in my solrconfig.xml as below , however I am not using that and I am only using the default select handler, but my MyCustomHandler class has been been added to the source and included in the build , but not being used for any requests yet. requestHandler name=/mycustomselect class=solr.MyCustomHandler startup=lazy lst name=defaults str name=dfsuggestAggregate/str str name=spellcheck.dictionarydirect/str !--str name=spellcheck.dictionarywordbreak/str-- str name=spellcheckon/str str name=spellcheck.extendedResultstrue/str str name=spellcheck.count10/str str name=spellcheck.alternativeTermCount5/str str name=spellcheck.maxResultsForSuggest5/str str name=spellcheck.collatetrue/str str name=spellcheck.collateExtendedResultstrue/str str name=spellcheck.maxCollationTries10/str str name=spellcheck.maxCollations5/str /lst arr name=last-components strspellcheck/str /arr /requestHandler 5. The clusterstate.json is copied below {dyCollection1:{ shards:{ shard1:{ range:8000-d554, state:active, replicas:{ core_node3:{ state:active, core:dyCollection1_shard1_replica1, node_name:server3.mydomain.com:8082_solr, base_url:http://server3.mydomain.com:8082/solr}, core_node4:{ state:active, core:dyCollection1_shard1_replica2, node_name:server2.mydomain.com:8081_solr, base_url:http://server2.mydomain.com:8081/solr;, leader:true}}}, shard2:{ range:d555-2aa9, state:active, replicas:{ core_node1:{ state:active, core:dyCollection1_shard2_replica1, node_name:server1.mydomain.com:8081_solr, base_url:http://server1.mydomain.com:8081/solr;, leader:true}, core_node6:{ state:active, core:dyCollection1_shard2_replica2, node_name:server3.mydomain.com:8081_solr, base_url:http://server3.mydomain.com:8081/solr}}}, shard3:{ range:2aaa-7fff, state:active, replicas:{ core_node2:{ state:active, core:dyCollection1_shard3_replica2, node_name:server1.mydomain.com:8082_solr, base_url:http://server1.mydomain.com:8082/solr;, leader:true}, core_node5:{ state:active, core:dyCollection1_shard3_replica1, node_name:server2.mydomain.com:8082_solr, base_url:http://server2.mydomain.com:8082/solr, maxShardsPerNode:1, router:{name:compositeId}, replicationFactor:2, autoAddReplicas:false}} Thanks! On Thu, Oct 16, 2014 at 9:02 PM, Shawn Heisey apa...@elyograg.org wrote: On 10/16/2014 6:27 PM, S.L wrote: 1. Java Version :java version 1.7.0_51 Java(TM) SE Runtime Environment (build 1.7.0_51-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) I believe that build 51 is one of those that is known to have bugs related to Lucene. If you can upgrade this to 67, that would be good, but I don't know that it's a pressing matter. It looks like the Oracle JVM, which is good. 2.OS CentOS Linux release 7.0.1406 (Core) 3. Everything is 64 bit , OS , Java , and CPU. 4. Java Args. -Djava.io.tmpdir=/opt/tomcat1/temp -Dcatalina.home=/opt/tomcat1 -Dcatalina.base=/opt/tomcat1 -Djava.endorsed.dirs=/opt/tomcat1/endorsed -DzkHost=server1.mydomain.com:2181,server2.mydomain.com:2181, server3.mydomain.com:2181 -DzkClientTimeout=2 -DhostContext=solr -Dport=8081 -Dhost=server1.mydomain.com -Dsolr.solr.home=/opt/solr/home1 -Dfile.encoding=UTF8 -Duser.timezone=UTC -XX:+UseG1GC -XX:MaxPermSize=128m -XX:PermSize=64m -Xmx2048m -Xms128m -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -Djava.util.logging.config.file=/opt/tomcat1/conf/logging.properties I would not use the G1 collector myself, but with the heap at only 2GB, I don't know that it matters all that much. Even a worst-case collection probably is not going to take more than a few seconds, and you've already increased the zookeeper client timeout. http://wiki.apache.org/solr/ShawnHeisey#GC_Tuning 5.
Re: SolrCloud 4.7 not doing distributed search when querying from a load balancer.
surprised that this issue never got reported for 4.7 up until now. Thanks again for your help! On Mon, Oct 6, 2014 at 10:52 PM, Erick Erickson erickerick...@gmail.com wrote: I think there were some holes that would allow replicas and leaders to be out of synch that have been patched up in the last 3 releases. There shouldn't be anything you need to do to keep these in synch, so if you can capture what happened when things got out of synch we'll fix it. But a lot has changed in the last several months, so the first thing I'd do if possible is to upgrade to 4.10.1. Best, Erick On Mon, Oct 6, 2014 at 2:41 PM, S.L simpleliving...@gmail.com wrote: Hi Erick, Before I tried your suggestion of issung a commit=true update, I realized that for eaach shard there was atleast a node that had its index directory named like index.timestamp. I went ahead and deleted index directory that restarted that core and now the index directory got syched with the other node and is properly named as 'index' without any timestamp attached to it.This is now giving me consistent results for distrib=true using a load balancer.Also distrib=false returns expexted results for a given shard. The underlying issue appears to be that in every shard the leader and the replica(follower) were out of sych. How can I avoid this from happening again? Thanks for your help! Sent from my HTC - Reply message - From: Erick Erickson erickerick...@gmail.com To: solr-user@lucene.apache.org Subject: SolrCloud 4.7 not doing distributed search when querying from a load balancer. Date: Fri, Oct 3, 2014 12:56 AM H. Assuming that you aren't re-indexing the doc you're searching for... Try issuing http://blah blah:8983/solr/collection/update?commit=true. That'll force all the docs to be searchable. Does 1 still hold for the document in question? Because this is exactly backwards of what I'd expect. I'd expect, if anything, the replica (I'm trying to call it the follower when a distinction needs to be made since the leader is a replica too) would be out of sync. This is still a Bad Thing, but the leader gets first crack at indexing thing. bq: only the replica of the shard that has this key returns the result , and the leader does not , Just to be sure we're talking about the same thing. When you say leader, you mean the shard leader, right? The filled-in circle on the graph view from the admin/cloud page. And let's see your soft and hard commit settings please. Best, Erick On Thu, Oct 2, 2014 at 9:48 PM, S.L simpleliving...@gmail.com wrote: Eirck, 0 Load balancer is out of the picture . 1When I query with *distrib=false* , I get consistent results as expected for those shards that dont have the key i.e I dont get the results back for those shards, however I just realized that while *distrib=false* is present in the query for the shard that is supposed to contain the key,only the replica of the shard that has this key returns the result , and the leader does not , looks like replica and the leader do not have the same data and replica seems to contain the key in the query for that shard. 2 By indexing I mean this collection is being populated by a web crawler. So looks like 1 above is pointing to leader and replica being out of synch for atleast one shard. On Thu, Oct 2, 2014 at 11:57 PM, Erick Erickson erickerick...@gmail.com wrote: bq: Also ,the collection is being actively indexed as I query this, could that be an issue too ? Not if the documents you're searching aren't being added as you search (and all your autocommit intervals have expired). I would turn off indexing for testing, it's just one more variable that can get in the way of understanding this. Do note that if the problem were endemic to Solr, there would probably be a _lot_ more noise out there. So to recap: 0 we can take the load balancer out of the picture all together. 1 when you query each shard individually with distrib=true, every replica in a particular shard returns the same count. 2 when you query without distrib=true you get varying counts. This is very strange and not at all expected. Let's try it again without indexing going on And what do you mean by indexing anyway? How are documents being fed to your system? Best, Erick@PuzzledAsWell On Thu, Oct 2, 2014 at 7:32 PM, S.L
Re: SolrCloud 4.7 not doing distributed search when querying from a load balancer.
-bio-8081-exec-169] INFO org.apache.solr.core.SolrCore – [dyCollection1_shard2_replica1] webapp=/solr path=/select/ params={q=*:*distrib=truewt=jsonfq=(id:220a8dce-3b31-4d46-8386-da8405595c47)} hits=1 status=0 QTime=7 *Autocommit and Soft commit settings.* autoSoftCommit maxTime${solr.autoSoftCommit.maxTime:-1}/maxTime /autoSoftCommit autoCommit maxTime${solr.autoCommit.maxTime:15000}/maxTime openSearchertrue/openSearcher /autoCommit On Tue, Oct 7, 2014 at 12:22 AM, Erick Erickson erickerick...@gmail.com wrote: Not, I'm not guaranteeing that it'll actually cure the problem, just that enough has changed since 4.7 that it'd be a good place to start. Things have been reported off and on, but they're often pesky race conditions or something else that takes a long time to track down, you just are lucky perhaps ;)... Erick On Mon, Oct 6, 2014 at 8:04 PM, S.L simpleliving...@gmail.com wrote: Erick, Thanks for the suggestion , I am not sure if I would be able to capture what went wrong , so upgrading to 4.10 seems easier even though it means , a days work of effort :) . I will go ahead and upgrade and let me know , although I am surprised that this issue never got reported for 4.7 up until now. Thanks again for your help! On Mon, Oct 6, 2014 at 10:52 PM, Erick Erickson erickerick...@gmail.com wrote: I think there were some holes that would allow replicas and leaders to be out of synch that have been patched up in the last 3 releases. There shouldn't be anything you need to do to keep these in synch, so if you can capture what happened when things got out of synch we'll fix it. But a lot has changed in the last several months, so the first thing I'd do if possible is to upgrade to 4.10.1. Best, Erick On Mon, Oct 6, 2014 at 2:41 PM, S.L simpleliving...@gmail.com wrote: Hi Erick, Before I tried your suggestion of issung a commit=true update, I realized that for eaach shard there was atleast a node that had its index directory named like index.timestamp. I went ahead and deleted index directory that restarted that core and now the index directory got syched with the other node and is properly named as 'index' without any timestamp attached to it.This is now giving me consistent results for distrib=true using a load balancer.Also distrib=false returns expexted results for a given shard. The underlying issue appears to be that in every shard the leader and the replica(follower) were out of sych. How can I avoid this from happening again? Thanks for your help! Sent from my HTC - Reply message - From: Erick Erickson erickerick...@gmail.com To: solr-user@lucene.apache.org Subject: SolrCloud 4.7 not doing distributed search when querying from a load balancer. Date: Fri, Oct 3, 2014 12:56 AM H. Assuming that you aren't re-indexing the doc you're searching for... Try issuing http://blah blah:8983/solr/collection/update?commit=true. That'll force all the docs to be searchable. Does 1 still hold for the document in question? Because this is exactly backwards of what I'd expect. I'd expect, if anything, the replica (I'm trying to call it the follower when a distinction needs to be made since the leader is a replica too) would be out of sync. This is still a Bad Thing, but the leader gets first crack at indexing thing. bq: only the replica of the shard that has this key returns the result , and the leader does not , Just to be sure we're talking about the same thing. When you say leader, you mean the shard leader, right? The filled-in circle on the graph view from the admin/cloud page. And let's see your soft and hard commit settings please. Best, Erick On Thu, Oct 2, 2014 at 9:48 PM, S.L simpleliving...@gmail.com wrote: Eirck, 0 Load balancer is out of the picture . 1When I query with *distrib=false* , I get consistent results as expected for those shards that dont have the key i.e I dont get the results back for those shards, however I just realized that while *distrib=false* is present in the query for the shard that is supposed to contain the key,only the replica of the shard that has this key returns the result , and the leader does not , looks like replica and the leader do not have the same data and replica seems to contain the key in the query for that shard. 2 By indexing I mean this collection is being populated by a web crawler. So looks like 1 above
Re: SolrCloud 4.7 not doing distributed search when querying from a load balancer.
On 10/15/2014 9:26 PM, S.L wrote: Look at the logging information I provided below , looks like the results are only being returned back for this solrCloud cluster if the request goes to one of the two replicas of a shard. I have verified that numDocs in the replicas for a given shard is same but there is difference in the maxDoc and deletedDocs, does this signal the replicas being out of sync ? Even if the numDocs are same , how do we guarantee that those docs are identical and have the same uniquekeys , is there a way to verify this ? I am suspecting that as the numDocs is same across the replicas , and still only when the request goes to one of the replicas of the shard that I get a result back , the documents with in those replicas with in a shard are not an exact replica set of each other. I suspect the issue I am facing in 4.10.1 cloud is related to https://issues.apache.org/jira/browse/SOLR-4924 . Can anyone please let me know , how to solve this issue of intermittent no results for a query ? query with no results hits these cores: server 2 shard 3 replica1 server 3 shard 1 replica 1 server 1 shard 2 replica 1 query with 1 result hits these cores: server 2 shard 1 replica 2 server 3 shard 2 replica 2 (found 1) server 1 shard 3 replica 2 Here's some URLs for some testing. They are directed at specific shard replicas and are specifically NOT distributed queries: http://server1.mydomain.com:8081/solr/dyCollection1_shard2_replica1/select?q=*:*fq=id:e8995da8-7d98-4010-93b4-8ff7dffb8bfbdistrib=false http://server3.mydomain.com:8081/solr/dyCollection1_shard2_replica2/select?q=*:*fq=id:e8995da8-7d98-4010-93b4-8ff7dffb8bfbdistrib=false If you run these queries (replacing server names and the /select request handler as appropriate), do you get 0 results on the first one and 1 result on the second one? If you do, then you've definitely got replicas out of sync. If you get 1 result on both queries, then something else is breaking. If by chance you have taken steps to fix this particular ID, pick another one that you know has a problem. There is no automated way to detect replicas out of sync. You could request all docs on both replicas with distrib=falsefl=idsort=id+asc, then compare the two lists. Depending on how many docs you have, those queries could take a while to run. If the replicas are out of sync, are there any ERROR entries in the Solr log, especially at the time that the problem docs were indexed? Thanks, Shawn
Re: SolrCloud 4.7 not doing distributed search when querying from a load balancer.
Shawn, Yes , I tried those two queries with distrib=false , I get 0 results for first and 1 result for the second query( (i.e. server 3 shard 2 replica 2) consistently. However if I run the same second query (i.e. server 3 shard 2 replica 2) with distrib=true, I sometimes get a result and sometimes not , should'nt this query always return a result when its pointing to a core that seems to have that document regardless of distrib=true or false ? Unfortunately I dont see anything particular in the logs to point to any information. BTW you asked me to replace the request handler , I use the select request handler ,so I cannot replace it with anything else , is that a problem ? Thanks. On Thu, Oct 16, 2014 at 12:05 AM, Shawn Heisey apa...@elyograg.org wrote: On 10/15/2014 9:26 PM, S.L wrote: Look at the logging information I provided below , looks like the results are only being returned back for this solrCloud cluster if the request goes to one of the two replicas of a shard. I have verified that numDocs in the replicas for a given shard is same but there is difference in the maxDoc and deletedDocs, does this signal the replicas being out of sync ? Even if the numDocs are same , how do we guarantee that those docs are identical and have the same uniquekeys , is there a way to verify this ? I am suspecting that as the numDocs is same across the replicas , and still only when the request goes to one of the replicas of the shard that I get a result back , the documents with in those replicas with in a shard are not an exact replica set of each other. I suspect the issue I am facing in 4.10.1 cloud is related to https://issues.apache.org/jira/browse/SOLR-4924 . Can anyone please let me know , how to solve this issue of intermittent no results for a query ? query with no results hits these cores: server 2 shard 3 replica1 server 3 shard 1 replica 1 server 1 shard 2 replica 1 query with 1 result hits these cores: server 2 shard 1 replica 2 server 3 shard 2 replica 2 (found 1) server 1 shard 3 replica 2 Here's some URLs for some testing. They are directed at specific shard replicas and are specifically NOT distributed queries: http://server1.mydomain.com:8081/solr/dyCollection1_ shard2_replica1/select?q=*:*fq=id:e8995da8-7d98-4010-93b4- 8ff7dffb8bfbdistrib=false http://server3.mydomain.com:8081/solr/dyCollection1_ shard2_replica2/select?q=*:*fq=id:e8995da8-7d98-4010-93b4- 8ff7dffb8bfbdistrib=false If you run these queries (replacing server names and the /select request handler as appropriate), do you get 0 results on the first one and 1 result on the second one? If you do, then you've definitely got replicas out of sync. If you get 1 result on both queries, then something else is breaking. If by chance you have taken steps to fix this particular ID, pick another one that you know has a problem. There is no automated way to detect replicas out of sync. You could request all docs on both replicas with distrib=falsefl=idsort=id+asc, then compare the two lists. Depending on how many docs you have, those queries could take a while to run. If the replicas are out of sync, are there any ERROR entries in the Solr log, especially at the time that the problem docs were indexed? Thanks, Shawn
Re: SolrCloud 4.7 not doing distributed search when querying from a load balancer.
On 10/15/2014 10:24 PM, S.L wrote: Yes , I tried those two queries with distrib=false , I get 0 results for first and 1 result for the second query( (i.e. server 3 shard 2 replica 2) consistently. However if I run the same second query (i.e. server 3 shard 2 replica 2) with distrib=true, I sometimes get a result and sometimes not , should'nt this query always return a result when its pointing to a core that seems to have that document regardless of distrib=true or false ? Unfortunately I dont see anything particular in the logs to point to any information. BTW you asked me to replace the request handler , I use the select request handler ,so I cannot replace it with anything else , is that a problem ? If you send the query with distrib=true (which is the default value in SolrCloud), then it treats it just as if you had sent it to /solr/collection instead of /solr/collection_shardN_replicaN, so it's a full distributed query. The distrib=false is required to turn that behavior off and ONLY query the index on the actual core where you sent it. I only said to replace those things as appropriate. Since you are using /select, it's no problem that you left it that way. If I were to assume that you used /select, but you didn't, the URLs as I wrote them might not have worked. As discussed, this means that your replicas are truly out of sync. It's difficult to know what caused it, especially if you can't see anything in the log when you indexed the missing documents. We know you're on Solr 4.10.1. This means that your Java is a 1.7 version, since Java7 is required. Here's where I ask a whole lot of questions about your setup. What is the precise Java version, and which vendor's Java are you using? What operating system is it on? Is everything 64-bit, or is any piece (CPU, OS, Java) 32-bit? On the Solr admin UI dashboard, it lists all parameters used when starting Java, labelled as Args. Can you include those? Is zookeeper external, or embedded in Solr? Is it a 3-server (or more) ensemble? Are you using the example jetty, or did you provide your own servlet container? We recommend 64-bit Oracle Java, the latest 1.7 version. OpenJDK (since version 1.7.x) should be pretty safe as well, but IBM's Java should be avoided. IBM does very aggressive runtime optimizations. These can make programs run faster, but they are known to negatively affect Lucene/Solr. Thanks, Shawn
Re: SolrCloud 4.7 not doing distributed search when querying from a load balancer.
Erickson erickerick...@gmail.com wrote: Not, I'm not guaranteeing that it'll actually cure the problem, just that enough has changed since 4.7 that it'd be a good place to start. Things have been reported off and on, but they're often pesky race conditions or something else that takes a long time to track down, you just are lucky perhaps ;)... Erick On Mon, Oct 6, 2014 at 8:04 PM, S.L simpleliving...@gmail.com wrote: Erick, Thanks for the suggestion , I am not sure if I would be able to capture what went wrong , so upgrading to 4.10 seems easier even though it means , a days work of effort :) . I will go ahead and upgrade and let me know , although I am surprised that this issue never got reported for 4.7 up until now. Thanks again for your help! On Mon, Oct 6, 2014 at 10:52 PM, Erick Erickson erickerick...@gmail.com wrote: I think there were some holes that would allow replicas and leaders to be out of synch that have been patched up in the last 3 releases. There shouldn't be anything you need to do to keep these in synch, so if you can capture what happened when things got out of synch we'll fix it. But a lot has changed in the last several months, so the first thing I'd do if possible is to upgrade to 4.10.1. Best, Erick On Mon, Oct 6, 2014 at 2:41 PM, S.L simpleliving...@gmail.com wrote: Hi Erick, Before I tried your suggestion of issung a commit=true update, I realized that for eaach shard there was atleast a node that had its index directory named like index.timestamp. I went ahead and deleted index directory that restarted that core and now the index directory got syched with the other node and is properly named as 'index' without any timestamp attached to it.This is now giving me consistent results for distrib=true using a load balancer.Also distrib=false returns expexted results for a given shard. The underlying issue appears to be that in every shard the leader and the replica(follower) were out of sych. How can I avoid this from happening again? Thanks for your help! Sent from my HTC - Reply message - From: Erick Erickson erickerick...@gmail.com To: solr-user@lucene.apache.org Subject: SolrCloud 4.7 not doing distributed search when querying from a load balancer. Date: Fri, Oct 3, 2014 12:56 AM H. Assuming that you aren't re-indexing the doc you're searching for... Try issuing http://blah blah:8983/solr/collection/update?commit=true. That'll force all the docs to be searchable. Does 1 still hold for the document in question? Because this is exactly backwards of what I'd expect. I'd expect, if anything, the replica (I'm trying to call it the follower when a distinction needs to be made since the leader is a replica too) would be out of sync. This is still a Bad Thing, but the leader gets first crack at indexing thing. bq: only the replica of the shard that has this key returns the result , and the leader does not , Just to be sure we're talking about the same thing. When you say leader, you mean the shard leader, right? The filled-in circle on the graph view from the admin/cloud page. And let's see your soft and hard commit settings please. Best, Erick On Thu, Oct 2, 2014 at 9:48 PM, S.L simpleliving...@gmail.com wrote: Eirck, 0 Load balancer is out of the picture . 1When I query with *distrib=false* , I get consistent results as expected for those shards that dont have the key i.e I dont get the results back for those shards, however I just realized that while *distrib=false* is present in the query for the shard that is supposed to contain the key,only the replica of the shard that has this key returns the result , and the leader does not , looks like replica and the leader do not have the same data and replica seems to contain the key in the query for that shard. 2 By indexing I mean this collection is being populated by a web crawler. So looks like 1 above is pointing to leader and replica being out of synch for atleast one shard. On Thu, Oct 2, 2014 at 11:57 PM, Erick Erickson erickerick...@gmail.com wrote: bq: Also ,the collection is being actively indexed as I query this, could that be an issue too ? Not if the documents you're searching aren't being added as you search (and all your autocommit intervals have expired). I would turn off indexing for testing, it's just one more variable that can get in the way of understanding this. Do note that if the problem were endemic to Solr, there would probably be a _lot_ more noise out there. So to recap: 0 we can take the load
Re: SolrCloud 4.7 not doing distributed search when querying from a load balancer.
to track down, you just are lucky perhaps ;)... Erick On Mon, Oct 6, 2014 at 8:04 PM, S.L simpleliving...@gmail.com wrote: Erick, Thanks for the suggestion , I am not sure if I would be able to capture what went wrong , so upgrading to 4.10 seems easier even though it means , a days work of effort :) . I will go ahead and upgrade and let me know , although I am surprised that this issue never got reported for 4.7 up until now. Thanks again for your help! On Mon, Oct 6, 2014 at 10:52 PM, Erick Erickson erickerick...@gmail.com wrote: I think there were some holes that would allow replicas and leaders to be out of synch that have been patched up in the last 3 releases. There shouldn't be anything you need to do to keep these in synch, so if you can capture what happened when things got out of synch we'll fix it. But a lot has changed in the last several months, so the first thing I'd do if possible is to upgrade to 4.10.1. Best, Erick On Mon, Oct 6, 2014 at 2:41 PM, S.L simpleliving...@gmail.com wrote: Hi Erick, Before I tried your suggestion of issung a commit=true update, I realized that for eaach shard there was atleast a node that had its index directory named like index.timestamp. I went ahead and deleted index directory that restarted that core and now the index directory got syched with the other node and is properly named as 'index' without any timestamp attached to it.This is now giving me consistent results for distrib=true using a load balancer.Also distrib=false returns expexted results for a given shard. The underlying issue appears to be that in every shard the leader and the replica(follower) were out of sych. How can I avoid this from happening again? Thanks for your help! Sent from my HTC - Reply message - From: Erick Erickson erickerick...@gmail.com To: solr-user@lucene.apache.org Subject: SolrCloud 4.7 not doing distributed search when querying from a load balancer. Date: Fri, Oct 3, 2014 12:56 AM H. Assuming that you aren't re-indexing the doc you're searching for... Try issuing http://blah blah:8983/solr/collection/update?commit=true. That'll force all the docs to be searchable. Does 1 still hold for the document in question? Because this is exactly backwards of what I'd expect. I'd expect, if anything, the replica (I'm trying to call it the follower when a distinction needs to be made since the leader is a replica too) would be out of sync. This is still a Bad Thing, but the leader gets first crack at indexing thing. bq: only the replica of the shard that has this key returns the result , and the leader does not , Just to be sure we're talking about the same thing. When you say leader, you mean the shard leader, right? The filled-in circle on the graph view from the admin/cloud page. And let's see your soft and hard commit settings please. Best, Erick On Thu, Oct 2, 2014 at 9:48 PM, S.L simpleliving...@gmail.com wrote: Eirck, 0 Load balancer is out of the picture . 1When I query with *distrib=false* , I get consistent results as expected for those shards that dont have the key i.e I dont get the results back for those shards, however I just realized that while *distrib=false* is present in the query for the shard that is supposed to contain the key,only the replica of the shard that has this key returns the result , and the leader does not , looks like replica and the leader do not have the same data and replica seems to contain the key in the query for that shard. 2 By indexing I mean this collection is being populated by a web crawler. So looks like 1 above is pointing to leader and replica being out of synch for atleast one shard. On Thu, Oct 2, 2014 at 11:57 PM, Erick Erickson erickerick...@gmail.com wrote: bq: Also ,the collection is being actively indexed as I query this, could that be an issue too ? Not if the documents you're searching aren't being added as you search (and all your autocommit intervals have expired). I would turn off indexing for testing, it's just one more variable that can get in the way of understanding this. Do note that if the problem were endemic to Solr, there would probably be a _lot_ more noise out there. So to recap: 0 we can take the load balancer out of the picture all together. 1 when you query each shard individually with distrib=true, every replica in a particular shard returns the same count. 2 when you query without distrib=true you get varying counts. This is very strange and not at all expected. Let's try it again without indexing going on And what do you mean by indexing anyway? How are documents being fed to your system? Best, Erick
Re: SolrCloud 4.7 not doing distributed search when querying from a load balancer.
Hi Erick, Before I tried your suggestion of issung a commit=true update, I realized that for eaach shard there was atleast a node that had its index directory named like index.timestamp. I went ahead and deleted index directory that restarted that core and now the index directory got syched with the other node and is properly named as 'index' without any timestamp attached to it.This is now giving me consistent results for distrib=true using a load balancer.Also distrib=false returns expexted results for a given shard. The underlying issue appears to be that in every shard the leader and the replica(follower) were out of sych. How can I avoid this from happening again? Thanks for your help! Sent from my HTC - Reply message - From: Erick Erickson erickerick...@gmail.com To: solr-user@lucene.apache.org Subject: SolrCloud 4.7 not doing distributed search when querying from a load balancer. Date: Fri, Oct 3, 2014 12:56 AM H. Assuming that you aren't re-indexing the doc you're searching for... Try issuing http://blah blah:8983/solr/collection/update?commit=true. That'll force all the docs to be searchable. Does 1 still hold for the document in question? Because this is exactly backwards of what I'd expect. I'd expect, if anything, the replica (I'm trying to call it the follower when a distinction needs to be made since the leader is a replica too) would be out of sync. This is still a Bad Thing, but the leader gets first crack at indexing thing. bq: only the replica of the shard that has this key returns the result , and the leader does not , Just to be sure we're talking about the same thing. When you say leader, you mean the shard leader, right? The filled-in circle on the graph view from the admin/cloud page. And let's see your soft and hard commit settings please. Best, Erick On Thu, Oct 2, 2014 at 9:48 PM, S.L simpleliving...@gmail.com wrote: Eirck, 0 Load balancer is out of the picture . 1When I query with *distrib=false* , I get consistent results as expected for those shards that dont have the key i.e I dont get the results back for those shards, however I just realized that while *distrib=false* is present in the query for the shard that is supposed to contain the key,only the replica of the shard that has this key returns the result , and the leader does not , looks like replica and the leader do not have the same data and replica seems to contain the key in the query for that shard. 2 By indexing I mean this collection is being populated by a web crawler. So looks like 1 above is pointing to leader and replica being out of synch for atleast one shard. On Thu, Oct 2, 2014 at 11:57 PM, Erick Erickson erickerick...@gmail.com wrote: bq: Also ,the collection is being actively indexed as I query this, could that be an issue too ? Not if the documents you're searching aren't being added as you search (and all your autocommit intervals have expired). I would turn off indexing for testing, it's just one more variable that can get in the way of understanding this. Do note that if the problem were endemic to Solr, there would probably be a _lot_ more noise out there. So to recap: 0 we can take the load balancer out of the picture all together. 1 when you query each shard individually with distrib=true, every replica in a particular shard returns the same count. 2 when you query without distrib=true you get varying counts. This is very strange and not at all expected. Let's try it again without indexing going on And what do you mean by indexing anyway? How are documents being fed to your system? Best, Erick@PuzzledAsWell On Thu, Oct 2, 2014 at 7:32 PM, S.L simpleliving...@gmail.com wrote: Erick, I would like to add that the interesting behavior i.e point #2 that I mentioned in my earlier reply happens in all the shards , if this were to be a distributed search issue this should have not manifested itself in the shard that contains the key that I am searching for , looks like the search is just failing as whole intermittently . Also ,the collection is being actively indexed as I query this, could that be an issue too ? Thanks. On Thu, Oct 2, 2014 at 10:24 PM, S.L simpleliving...@gmail.com wrote: Erick, Thanks for your reply, I tried your suggestions. 1 . When not using loadbalancer if *I have distrib=false* I get consistent results across the replicas. 2. However here's the insteresting part , while not using load balancer if I *dont have distrib=false* , then when I query a particular node ,I get the same behaviour as if I were using a loadbalancer , meaning the distributed search from a node works intermittently .Does this give any clue ? On Thu, Oct 2, 2014 at 7:47 PM, Erick Erickson erickerick...@gmail.com wrote: Hmmm, nothing quite makes sense here Here are some experiments: 1 avoid the load balancer and issue queries like
Re: SolrCloud 4.7 not doing distributed search when querying from a load balancer.
I think there were some holes that would allow replicas and leaders to be out of synch that have been patched up in the last 3 releases. There shouldn't be anything you need to do to keep these in synch, so if you can capture what happened when things got out of synch we'll fix it. But a lot has changed in the last several months, so the first thing I'd do if possible is to upgrade to 4.10.1. Best, Erick On Mon, Oct 6, 2014 at 2:41 PM, S.L simpleliving...@gmail.com wrote: Hi Erick, Before I tried your suggestion of issung a commit=true update, I realized that for eaach shard there was atleast a node that had its index directory named like index.timestamp. I went ahead and deleted index directory that restarted that core and now the index directory got syched with the other node and is properly named as 'index' without any timestamp attached to it.This is now giving me consistent results for distrib=true using a load balancer.Also distrib=false returns expexted results for a given shard. The underlying issue appears to be that in every shard the leader and the replica(follower) were out of sych. How can I avoid this from happening again? Thanks for your help! Sent from my HTC - Reply message - From: Erick Erickson erickerick...@gmail.com To: solr-user@lucene.apache.org Subject: SolrCloud 4.7 not doing distributed search when querying from a load balancer. Date: Fri, Oct 3, 2014 12:56 AM H. Assuming that you aren't re-indexing the doc you're searching for... Try issuing http://blah blah:8983/solr/collection/update?commit=true. That'll force all the docs to be searchable. Does 1 still hold for the document in question? Because this is exactly backwards of what I'd expect. I'd expect, if anything, the replica (I'm trying to call it the follower when a distinction needs to be made since the leader is a replica too) would be out of sync. This is still a Bad Thing, but the leader gets first crack at indexing thing. bq: only the replica of the shard that has this key returns the result , and the leader does not , Just to be sure we're talking about the same thing. When you say leader, you mean the shard leader, right? The filled-in circle on the graph view from the admin/cloud page. And let's see your soft and hard commit settings please. Best, Erick On Thu, Oct 2, 2014 at 9:48 PM, S.L simpleliving...@gmail.com wrote: Eirck, 0 Load balancer is out of the picture . 1When I query with *distrib=false* , I get consistent results as expected for those shards that dont have the key i.e I dont get the results back for those shards, however I just realized that while *distrib=false* is present in the query for the shard that is supposed to contain the key,only the replica of the shard that has this key returns the result , and the leader does not , looks like replica and the leader do not have the same data and replica seems to contain the key in the query for that shard. 2 By indexing I mean this collection is being populated by a web crawler. So looks like 1 above is pointing to leader and replica being out of synch for atleast one shard. On Thu, Oct 2, 2014 at 11:57 PM, Erick Erickson erickerick...@gmail.com wrote: bq: Also ,the collection is being actively indexed as I query this, could that be an issue too ? Not if the documents you're searching aren't being added as you search (and all your autocommit intervals have expired). I would turn off indexing for testing, it's just one more variable that can get in the way of understanding this. Do note that if the problem were endemic to Solr, there would probably be a _lot_ more noise out there. So to recap: 0 we can take the load balancer out of the picture all together. 1 when you query each shard individually with distrib=true, every replica in a particular shard returns the same count. 2 when you query without distrib=true you get varying counts. This is very strange and not at all expected. Let's try it again without indexing going on And what do you mean by indexing anyway? How are documents being fed to your system? Best, Erick@PuzzledAsWell On Thu, Oct 2, 2014 at 7:32 PM, S.L simpleliving...@gmail.com wrote: Erick, I would like to add that the interesting behavior i.e point #2 that I mentioned in my earlier reply happens in all the shards , if this were to be a distributed search issue this should have not manifested itself in the shard that contains the key that I am searching for , looks like the search is just failing as whole intermittently . Also ,the collection is being actively indexed as I query this, could that be an issue too ? Thanks. On Thu, Oct 2, 2014 at 10:24 PM, S.L simpleliving...@gmail.com wrote: Erick, Thanks for your reply, I tried your suggestions. 1 . When not using loadbalancer if *I have distrib=false* I get consistent results across the replicas. 2
Re: SolrCloud 4.7 not doing distributed search when querying from a load balancer.
Erick, Thanks for the suggestion , I am not sure if I would be able to capture what went wrong , so upgrading to 4.10 seems easier even though it means , a days work of effort :) . I will go ahead and upgrade and let me know , although I am surprised that this issue never got reported for 4.7 up until now. Thanks again for your help! On Mon, Oct 6, 2014 at 10:52 PM, Erick Erickson erickerick...@gmail.com wrote: I think there were some holes that would allow replicas and leaders to be out of synch that have been patched up in the last 3 releases. There shouldn't be anything you need to do to keep these in synch, so if you can capture what happened when things got out of synch we'll fix it. But a lot has changed in the last several months, so the first thing I'd do if possible is to upgrade to 4.10.1. Best, Erick On Mon, Oct 6, 2014 at 2:41 PM, S.L simpleliving...@gmail.com wrote: Hi Erick, Before I tried your suggestion of issung a commit=true update, I realized that for eaach shard there was atleast a node that had its index directory named like index.timestamp. I went ahead and deleted index directory that restarted that core and now the index directory got syched with the other node and is properly named as 'index' without any timestamp attached to it.This is now giving me consistent results for distrib=true using a load balancer.Also distrib=false returns expexted results for a given shard. The underlying issue appears to be that in every shard the leader and the replica(follower) were out of sych. How can I avoid this from happening again? Thanks for your help! Sent from my HTC - Reply message - From: Erick Erickson erickerick...@gmail.com To: solr-user@lucene.apache.org Subject: SolrCloud 4.7 not doing distributed search when querying from a load balancer. Date: Fri, Oct 3, 2014 12:56 AM H. Assuming that you aren't re-indexing the doc you're searching for... Try issuing http://blah blah:8983/solr/collection/update?commit=true. That'll force all the docs to be searchable. Does 1 still hold for the document in question? Because this is exactly backwards of what I'd expect. I'd expect, if anything, the replica (I'm trying to call it the follower when a distinction needs to be made since the leader is a replica too) would be out of sync. This is still a Bad Thing, but the leader gets first crack at indexing thing. bq: only the replica of the shard that has this key returns the result , and the leader does not , Just to be sure we're talking about the same thing. When you say leader, you mean the shard leader, right? The filled-in circle on the graph view from the admin/cloud page. And let's see your soft and hard commit settings please. Best, Erick On Thu, Oct 2, 2014 at 9:48 PM, S.L simpleliving...@gmail.com wrote: Eirck, 0 Load balancer is out of the picture . 1When I query with *distrib=false* , I get consistent results as expected for those shards that dont have the key i.e I dont get the results back for those shards, however I just realized that while *distrib=false* is present in the query for the shard that is supposed to contain the key,only the replica of the shard that has this key returns the result , and the leader does not , looks like replica and the leader do not have the same data and replica seems to contain the key in the query for that shard. 2 By indexing I mean this collection is being populated by a web crawler. So looks like 1 above is pointing to leader and replica being out of synch for atleast one shard. On Thu, Oct 2, 2014 at 11:57 PM, Erick Erickson erickerick...@gmail.com wrote: bq: Also ,the collection is being actively indexed as I query this, could that be an issue too ? Not if the documents you're searching aren't being added as you search (and all your autocommit intervals have expired). I would turn off indexing for testing, it's just one more variable that can get in the way of understanding this. Do note that if the problem were endemic to Solr, there would probably be a _lot_ more noise out there. So to recap: 0 we can take the load balancer out of the picture all together. 1 when you query each shard individually with distrib=true, every replica in a particular shard returns the same count. 2 when you query without distrib=true you get varying counts. This is very strange and not at all expected. Let's try it again without indexing going on And what do you mean by indexing anyway? How are documents being fed to your system? Best, Erick@PuzzledAsWell On Thu, Oct 2, 2014 at 7:32 PM, S.L simpleliving...@gmail.com wrote: Erick, I would like to add that the interesting behavior i.e point #2 that I mentioned in my earlier reply happens in all the shards , if this were to be a distributed search issue
Re: SolrCloud 4.7 not doing distributed search when querying from a load balancer.
Not, I'm not guaranteeing that it'll actually cure the problem, just that enough has changed since 4.7 that it'd be a good place to start. Things have been reported off and on, but they're often pesky race conditions or something else that takes a long time to track down, you just are lucky perhaps ;)... Erick On Mon, Oct 6, 2014 at 8:04 PM, S.L simpleliving...@gmail.com wrote: Erick, Thanks for the suggestion , I am not sure if I would be able to capture what went wrong , so upgrading to 4.10 seems easier even though it means , a days work of effort :) . I will go ahead and upgrade and let me know , although I am surprised that this issue never got reported for 4.7 up until now. Thanks again for your help! On Mon, Oct 6, 2014 at 10:52 PM, Erick Erickson erickerick...@gmail.com wrote: I think there were some holes that would allow replicas and leaders to be out of synch that have been patched up in the last 3 releases. There shouldn't be anything you need to do to keep these in synch, so if you can capture what happened when things got out of synch we'll fix it. But a lot has changed in the last several months, so the first thing I'd do if possible is to upgrade to 4.10.1. Best, Erick On Mon, Oct 6, 2014 at 2:41 PM, S.L simpleliving...@gmail.com wrote: Hi Erick, Before I tried your suggestion of issung a commit=true update, I realized that for eaach shard there was atleast a node that had its index directory named like index.timestamp. I went ahead and deleted index directory that restarted that core and now the index directory got syched with the other node and is properly named as 'index' without any timestamp attached to it.This is now giving me consistent results for distrib=true using a load balancer.Also distrib=false returns expexted results for a given shard. The underlying issue appears to be that in every shard the leader and the replica(follower) were out of sych. How can I avoid this from happening again? Thanks for your help! Sent from my HTC - Reply message - From: Erick Erickson erickerick...@gmail.com To: solr-user@lucene.apache.org Subject: SolrCloud 4.7 not doing distributed search when querying from a load balancer. Date: Fri, Oct 3, 2014 12:56 AM H. Assuming that you aren't re-indexing the doc you're searching for... Try issuing http://blah blah:8983/solr/collection/update?commit=true. That'll force all the docs to be searchable. Does 1 still hold for the document in question? Because this is exactly backwards of what I'd expect. I'd expect, if anything, the replica (I'm trying to call it the follower when a distinction needs to be made since the leader is a replica too) would be out of sync. This is still a Bad Thing, but the leader gets first crack at indexing thing. bq: only the replica of the shard that has this key returns the result , and the leader does not , Just to be sure we're talking about the same thing. When you say leader, you mean the shard leader, right? The filled-in circle on the graph view from the admin/cloud page. And let's see your soft and hard commit settings please. Best, Erick On Thu, Oct 2, 2014 at 9:48 PM, S.L simpleliving...@gmail.com wrote: Eirck, 0 Load balancer is out of the picture . 1When I query with *distrib=false* , I get consistent results as expected for those shards that dont have the key i.e I dont get the results back for those shards, however I just realized that while *distrib=false* is present in the query for the shard that is supposed to contain the key,only the replica of the shard that has this key returns the result , and the leader does not , looks like replica and the leader do not have the same data and replica seems to contain the key in the query for that shard. 2 By indexing I mean this collection is being populated by a web crawler. So looks like 1 above is pointing to leader and replica being out of synch for atleast one shard. On Thu, Oct 2, 2014 at 11:57 PM, Erick Erickson erickerick...@gmail.com wrote: bq: Also ,the collection is being actively indexed as I query this, could that be an issue too ? Not if the documents you're searching aren't being added as you search (and all your autocommit intervals have expired). I would turn off indexing for testing, it's just one more variable that can get in the way of understanding this. Do note that if the problem were endemic to Solr, there would probably be a _lot_ more noise out there. So to recap: 0 we can take the load balancer out of the picture all together. 1 when you query each shard individually with distrib=true, every replica in a particular shard returns the same count. 2 when you query without distrib=true you get varying counts. This is very strange and not at all expected. Let's try it again without indexing going
SolrCloud 4.7 not doing distributed search when querying from a load balancer.
Hi All, I am trying to query a 6 node Solr4.7 cluster with 3 shards and a replication factor of 2 . I have fronted these 6 Solr nodes using a load balancer , what I notice is that every time I do a search of the form q=*:*fq=(id:9e78c064-919f-4ef3-b236-dc66351b4acf) it gives me a result only once in every 3 tries , telling me that the load balancer is distributing the requests between the 3 shards and SolrCloud only returns a result if the request goes to the core that as that id . However if I do a simple search like q=*:* , I consistently get the right aggregated results back of all the documents across all the shards for every request from the load balancer. Can someone please let me know what this is symptomatic of ? Somehow Solr Cloud seems to be doing search query distribution and aggregation for queries of type *:* only. Thanks.
Re: SolrCloud 4.7 not doing distributed search when querying from a load balancer.
Hmmm, nothing quite makes sense here Here are some experiments: 1 avoid the load balancer and issue queries like http://solr_server:8983/solr/collection/q=whateverdistrib=false the distrib=false bit will cause keep SolrCloud from trying to send the queries anywhere, they'll be served only from the node you address them to. that'll help check whether the nodes are consistent. You should be getting back the same results from each replica in a shard (i.e. 2 of your 6 machines). Next, try your failing query the same way. Next, try your failing query from a browser, pointing it at successive nodes. Where is the first place problems show up? My _guess_ is that your load balancer isn't quite doing what you think, or your cluster isn't set up the way you think it is, but those are guesses. Best, Erick On Thu, Oct 2, 2014 at 2:51 PM, S.L simpleliving...@gmail.com wrote: Hi All, I am trying to query a 6 node Solr4.7 cluster with 3 shards and a replication factor of 2 . I have fronted these 6 Solr nodes using a load balancer , what I notice is that every time I do a search of the form q=*:*fq=(id:9e78c064-919f-4ef3-b236-dc66351b4acf) it gives me a result only once in every 3 tries , telling me that the load balancer is distributing the requests between the 3 shards and SolrCloud only returns a result if the request goes to the core that as that id . However if I do a simple search like q=*:* , I consistently get the right aggregated results back of all the documents across all the shards for every request from the load balancer. Can someone please let me know what this is symptomatic of ? Somehow Solr Cloud seems to be doing search query distribution and aggregation for queries of type *:* only. Thanks.
Re: SolrCloud 4.7 not doing distributed search when querying from a load balancer.
Erick, Thanks for your reply, I tried your suggestions. 1 . When not using loadbalancer if *I have distrib=false* I get consistent results across the replicas. 2. However here's the insteresting part , while not using load balancer if I *dont have distrib=false* , then when I query a particular node ,I get the same behaviour as if I were using a loadbalancer , meaning the distributed search from a node works intermittently .Does this give any clue ? On Thu, Oct 2, 2014 at 7:47 PM, Erick Erickson erickerick...@gmail.com wrote: Hmmm, nothing quite makes sense here Here are some experiments: 1 avoid the load balancer and issue queries like http://solr_server:8983/solr/collection/q=whateverdistrib=false the distrib=false bit will cause keep SolrCloud from trying to send the queries anywhere, they'll be served only from the node you address them to. that'll help check whether the nodes are consistent. You should be getting back the same results from each replica in a shard (i.e. 2 of your 6 machines). Next, try your failing query the same way. Next, try your failing query from a browser, pointing it at successive nodes. Where is the first place problems show up? My _guess_ is that your load balancer isn't quite doing what you think, or your cluster isn't set up the way you think it is, but those are guesses. Best, Erick On Thu, Oct 2, 2014 at 2:51 PM, S.L simpleliving...@gmail.com wrote: Hi All, I am trying to query a 6 node Solr4.7 cluster with 3 shards and a replication factor of 2 . I have fronted these 6 Solr nodes using a load balancer , what I notice is that every time I do a search of the form q=*:*fq=(id:9e78c064-919f-4ef3-b236-dc66351b4acf) it gives me a result only once in every 3 tries , telling me that the load balancer is distributing the requests between the 3 shards and SolrCloud only returns a result if the request goes to the core that as that id . However if I do a simple search like q=*:* , I consistently get the right aggregated results back of all the documents across all the shards for every request from the load balancer. Can someone please let me know what this is symptomatic of ? Somehow Solr Cloud seems to be doing search query distribution and aggregation for queries of type *:* only. Thanks.
Re: SolrCloud 4.7 not doing distributed search when querying from a load balancer.
Erick, I would like to add that the interesting behavior i.e point #2 that I mentioned in my earlier reply happens in all the shards , if this were to be a distributed search issue this should have not manifested itself in the shard that contains the key that I am searching for , looks like the search is just failing as whole intermittently . Also ,the collection is being actively indexed as I query this, could that be an issue too ? Thanks. On Thu, Oct 2, 2014 at 10:24 PM, S.L simpleliving...@gmail.com wrote: Erick, Thanks for your reply, I tried your suggestions. 1 . When not using loadbalancer if *I have distrib=false* I get consistent results across the replicas. 2. However here's the insteresting part , while not using load balancer if I *dont have distrib=false* , then when I query a particular node ,I get the same behaviour as if I were using a loadbalancer , meaning the distributed search from a node works intermittently .Does this give any clue ? On Thu, Oct 2, 2014 at 7:47 PM, Erick Erickson erickerick...@gmail.com wrote: Hmmm, nothing quite makes sense here Here are some experiments: 1 avoid the load balancer and issue queries like http://solr_server:8983/solr/collection/q=whateverdistrib=false the distrib=false bit will cause keep SolrCloud from trying to send the queries anywhere, they'll be served only from the node you address them to. that'll help check whether the nodes are consistent. You should be getting back the same results from each replica in a shard (i.e. 2 of your 6 machines). Next, try your failing query the same way. Next, try your failing query from a browser, pointing it at successive nodes. Where is the first place problems show up? My _guess_ is that your load balancer isn't quite doing what you think, or your cluster isn't set up the way you think it is, but those are guesses. Best, Erick On Thu, Oct 2, 2014 at 2:51 PM, S.L simpleliving...@gmail.com wrote: Hi All, I am trying to query a 6 node Solr4.7 cluster with 3 shards and a replication factor of 2 . I have fronted these 6 Solr nodes using a load balancer , what I notice is that every time I do a search of the form q=*:*fq=(id:9e78c064-919f-4ef3-b236-dc66351b4acf) it gives me a result only once in every 3 tries , telling me that the load balancer is distributing the requests between the 3 shards and SolrCloud only returns a result if the request goes to the core that as that id . However if I do a simple search like q=*:* , I consistently get the right aggregated results back of all the documents across all the shards for every request from the load balancer. Can someone please let me know what this is symptomatic of ? Somehow Solr Cloud seems to be doing search query distribution and aggregation for queries of type *:* only. Thanks.
Re: SolrCloud 4.7 not doing distributed search when querying from a load balancer.
bq: Also ,the collection is being actively indexed as I query this, could that be an issue too ? Not if the documents you're searching aren't being added as you search (and all your autocommit intervals have expired). I would turn off indexing for testing, it's just one more variable that can get in the way of understanding this. Do note that if the problem were endemic to Solr, there would probably be a _lot_ more noise out there. So to recap: 0 we can take the load balancer out of the picture all together. 1 when you query each shard individually with distrib=true, every replica in a particular shard returns the same count. 2 when you query without distrib=true you get varying counts. This is very strange and not at all expected. Let's try it again without indexing going on And what do you mean by indexing anyway? How are documents being fed to your system? Best, Erick@PuzzledAsWell On Thu, Oct 2, 2014 at 7:32 PM, S.L simpleliving...@gmail.com wrote: Erick, I would like to add that the interesting behavior i.e point #2 that I mentioned in my earlier reply happens in all the shards , if this were to be a distributed search issue this should have not manifested itself in the shard that contains the key that I am searching for , looks like the search is just failing as whole intermittently . Also ,the collection is being actively indexed as I query this, could that be an issue too ? Thanks. On Thu, Oct 2, 2014 at 10:24 PM, S.L simpleliving...@gmail.com wrote: Erick, Thanks for your reply, I tried your suggestions. 1 . When not using loadbalancer if *I have distrib=false* I get consistent results across the replicas. 2. However here's the insteresting part , while not using load balancer if I *dont have distrib=false* , then when I query a particular node ,I get the same behaviour as if I were using a loadbalancer , meaning the distributed search from a node works intermittently .Does this give any clue ? On Thu, Oct 2, 2014 at 7:47 PM, Erick Erickson erickerick...@gmail.com wrote: Hmmm, nothing quite makes sense here Here are some experiments: 1 avoid the load balancer and issue queries like http://solr_server:8983/solr/collection/q=whateverdistrib=false the distrib=false bit will cause keep SolrCloud from trying to send the queries anywhere, they'll be served only from the node you address them to. that'll help check whether the nodes are consistent. You should be getting back the same results from each replica in a shard (i.e. 2 of your 6 machines). Next, try your failing query the same way. Next, try your failing query from a browser, pointing it at successive nodes. Where is the first place problems show up? My _guess_ is that your load balancer isn't quite doing what you think, or your cluster isn't set up the way you think it is, but those are guesses. Best, Erick On Thu, Oct 2, 2014 at 2:51 PM, S.L simpleliving...@gmail.com wrote: Hi All, I am trying to query a 6 node Solr4.7 cluster with 3 shards and a replication factor of 2 . I have fronted these 6 Solr nodes using a load balancer , what I notice is that every time I do a search of the form q=*:*fq=(id:9e78c064-919f-4ef3-b236-dc66351b4acf) it gives me a result only once in every 3 tries , telling me that the load balancer is distributing the requests between the 3 shards and SolrCloud only returns a result if the request goes to the core that as that id . However if I do a simple search like q=*:* , I consistently get the right aggregated results back of all the documents across all the shards for every request from the load balancer. Can someone please let me know what this is symptomatic of ? Somehow Solr Cloud seems to be doing search query distribution and aggregation for queries of type *:* only. Thanks.
Re: SolrCloud 4.7 not doing distributed search when querying from a load balancer.
Eirck, 0 Load balancer is out of the picture . 1When I query with *distrib=false* , I get consistent results as expected for those shards that dont have the key i.e I dont get the results back for those shards, however I just realized that while *distrib=false* is present in the query for the shard that is supposed to contain the key,only the replica of the shard that has this key returns the result , and the leader does not , looks like replica and the leader do not have the same data and replica seems to contain the key in the query for that shard. 2 By indexing I mean this collection is being populated by a web crawler. So looks like 1 above is pointing to leader and replica being out of synch for atleast one shard. On Thu, Oct 2, 2014 at 11:57 PM, Erick Erickson erickerick...@gmail.com wrote: bq: Also ,the collection is being actively indexed as I query this, could that be an issue too ? Not if the documents you're searching aren't being added as you search (and all your autocommit intervals have expired). I would turn off indexing for testing, it's just one more variable that can get in the way of understanding this. Do note that if the problem were endemic to Solr, there would probably be a _lot_ more noise out there. So to recap: 0 we can take the load balancer out of the picture all together. 1 when you query each shard individually with distrib=true, every replica in a particular shard returns the same count. 2 when you query without distrib=true you get varying counts. This is very strange and not at all expected. Let's try it again without indexing going on And what do you mean by indexing anyway? How are documents being fed to your system? Best, Erick@PuzzledAsWell On Thu, Oct 2, 2014 at 7:32 PM, S.L simpleliving...@gmail.com wrote: Erick, I would like to add that the interesting behavior i.e point #2 that I mentioned in my earlier reply happens in all the shards , if this were to be a distributed search issue this should have not manifested itself in the shard that contains the key that I am searching for , looks like the search is just failing as whole intermittently . Also ,the collection is being actively indexed as I query this, could that be an issue too ? Thanks. On Thu, Oct 2, 2014 at 10:24 PM, S.L simpleliving...@gmail.com wrote: Erick, Thanks for your reply, I tried your suggestions. 1 . When not using loadbalancer if *I have distrib=false* I get consistent results across the replicas. 2. However here's the insteresting part , while not using load balancer if I *dont have distrib=false* , then when I query a particular node ,I get the same behaviour as if I were using a loadbalancer , meaning the distributed search from a node works intermittently .Does this give any clue ? On Thu, Oct 2, 2014 at 7:47 PM, Erick Erickson erickerick...@gmail.com wrote: Hmmm, nothing quite makes sense here Here are some experiments: 1 avoid the load balancer and issue queries like http://solr_server:8983/solr/collection/q=whateverdistrib=false the distrib=false bit will cause keep SolrCloud from trying to send the queries anywhere, they'll be served only from the node you address them to. that'll help check whether the nodes are consistent. You should be getting back the same results from each replica in a shard (i.e. 2 of your 6 machines). Next, try your failing query the same way. Next, try your failing query from a browser, pointing it at successive nodes. Where is the first place problems show up? My _guess_ is that your load balancer isn't quite doing what you think, or your cluster isn't set up the way you think it is, but those are guesses. Best, Erick On Thu, Oct 2, 2014 at 2:51 PM, S.L simpleliving...@gmail.com wrote: Hi All, I am trying to query a 6 node Solr4.7 cluster with 3 shards and a replication factor of 2 . I have fronted these 6 Solr nodes using a load balancer , what I notice is that every time I do a search of the form q=*:*fq=(id:9e78c064-919f-4ef3-b236-dc66351b4acf) it gives me a result only once in every 3 tries , telling me that the load balancer is distributing the requests between the 3 shards and SolrCloud only returns a result if the request goes to the core that as that id . However if I do a simple search like q=*:* , I consistently get the right aggregated results back of all the documents across all the shards for every request from the load balancer. Can someone please let me know what this is symptomatic of ? Somehow Solr Cloud seems to be doing search query distribution and aggregation for queries of type *:* only. Thanks.
Re: SolrCloud 4.7 not doing distributed search when querying from a load balancer.
H. Assuming that you aren't re-indexing the doc you're searching for... Try issuing http://blah blah:8983/solr/collection/update?commit=true. That'll force all the docs to be searchable. Does 1 still hold for the document in question? Because this is exactly backwards of what I'd expect. I'd expect, if anything, the replica (I'm trying to call it the follower when a distinction needs to be made since the leader is a replica too) would be out of sync. This is still a Bad Thing, but the leader gets first crack at indexing thing. bq: only the replica of the shard that has this key returns the result , and the leader does not , Just to be sure we're talking about the same thing. When you say leader, you mean the shard leader, right? The filled-in circle on the graph view from the admin/cloud page. And let's see your soft and hard commit settings please. Best, Erick On Thu, Oct 2, 2014 at 9:48 PM, S.L simpleliving...@gmail.com wrote: Eirck, 0 Load balancer is out of the picture . 1When I query with *distrib=false* , I get consistent results as expected for those shards that dont have the key i.e I dont get the results back for those shards, however I just realized that while *distrib=false* is present in the query for the shard that is supposed to contain the key,only the replica of the shard that has this key returns the result , and the leader does not , looks like replica and the leader do not have the same data and replica seems to contain the key in the query for that shard. 2 By indexing I mean this collection is being populated by a web crawler. So looks like 1 above is pointing to leader and replica being out of synch for atleast one shard. On Thu, Oct 2, 2014 at 11:57 PM, Erick Erickson erickerick...@gmail.com wrote: bq: Also ,the collection is being actively indexed as I query this, could that be an issue too ? Not if the documents you're searching aren't being added as you search (and all your autocommit intervals have expired). I would turn off indexing for testing, it's just one more variable that can get in the way of understanding this. Do note that if the problem were endemic to Solr, there would probably be a _lot_ more noise out there. So to recap: 0 we can take the load balancer out of the picture all together. 1 when you query each shard individually with distrib=true, every replica in a particular shard returns the same count. 2 when you query without distrib=true you get varying counts. This is very strange and not at all expected. Let's try it again without indexing going on And what do you mean by indexing anyway? How are documents being fed to your system? Best, Erick@PuzzledAsWell On Thu, Oct 2, 2014 at 7:32 PM, S.L simpleliving...@gmail.com wrote: Erick, I would like to add that the interesting behavior i.e point #2 that I mentioned in my earlier reply happens in all the shards , if this were to be a distributed search issue this should have not manifested itself in the shard that contains the key that I am searching for , looks like the search is just failing as whole intermittently . Also ,the collection is being actively indexed as I query this, could that be an issue too ? Thanks. On Thu, Oct 2, 2014 at 10:24 PM, S.L simpleliving...@gmail.com wrote: Erick, Thanks for your reply, I tried your suggestions. 1 . When not using loadbalancer if *I have distrib=false* I get consistent results across the replicas. 2. However here's the insteresting part , while not using load balancer if I *dont have distrib=false* , then when I query a particular node ,I get the same behaviour as if I were using a loadbalancer , meaning the distributed search from a node works intermittently .Does this give any clue ? On Thu, Oct 2, 2014 at 7:47 PM, Erick Erickson erickerick...@gmail.com wrote: Hmmm, nothing quite makes sense here Here are some experiments: 1 avoid the load balancer and issue queries like http://solr_server:8983/solr/collection/q=whateverdistrib=false the distrib=false bit will cause keep SolrCloud from trying to send the queries anywhere, they'll be served only from the node you address them to. that'll help check whether the nodes are consistent. You should be getting back the same results from each replica in a shard (i.e. 2 of your 6 machines). Next, try your failing query the same way. Next, try your failing query from a browser, pointing it at successive nodes. Where is the first place problems show up? My _guess_ is that your load balancer isn't quite doing what you think, or your cluster isn't set up the way you think it is, but those are guesses. Best, Erick On Thu, Oct 2, 2014 at 2:51 PM, S.L simpleliving...@gmail.com wrote: Hi All, I am trying to query a 6 node Solr4.7 cluster with 3 shards and a replication factor of 2 . I have fronted these 6 Solr