Re: SolrCloud 4.7 not doing distributed search when querying from a load balancer.

2014-10-23 Thread S.L
Shawn ,

Just wanted to follow up , I still face this issue of inconsistent search
results on Solr Cloud 4.1.0.1 , upon further looking into logs , I found
out a few exceptions , what was obvious was zkConnection time out issues
and other exceptions , please take a look .

*Logs*

/opt/tomcat1/logs/catalina.out:103651230 [http-bio-8081-exec-206] WARN
org.apache.solr.handler.ReplicationHandler  – Exception while writing
response for params:
file=_68v.fnmcommand=filecontentchecksum=truewt=filestreamqt=/replicationgeneration=2410
/opt/tomcat1/logs/catalina.out:java.nio.file.NoSuchFileException:
/opt/solr/home1/dyCollection1_shard2_replica1/data/index/_68v.fnm
/opt/tomcat1/logs/catalina.out: at
sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
/opt/tomcat1/logs/catalina.out: at
sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
/opt/tomcat1/logs/catalina.out: at
sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
/opt/tomcat1/logs/catalina.out:103651579 [http-bio-8081-exec-206] WARN
org.apache.solr.handler.ReplicationHandler  – Exception while writing
response for params:
file=_68v.fnmcommand=filecontentchecksum=truewt=filestreamqt=/replicationgeneration=2410
/opt/tomcat1/logs/catalina.out:java.nio.file.NoSuchFileException:
/opt/solr/home1/dyCollection1_shard2_replica1/data/index/_68v.fnm
/opt/tomcat1/logs/catalina.out: at
sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
/opt/tomcat1/logs/catalina.out: at
sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
/opt/tomcat1/logs/catalina.out: at
sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
/opt/tomcat1/logs/catalina.out:103651586 [http-bio-8081-exec-206] WARN
org.apache.solr.handler.ReplicationHandler  – Exception while writing
response for params:
file=_68v.fnmcommand=filecontentchecksum=truewt=filestreamqt=/replicationgeneration=2410
/opt/tomcat1/logs/catalina.out:java.nio.file.NoSuchFileException:
/opt/solr/home1/dyCollection1_shard2_replica1/data/index/_68v.fnm
/opt/tomcat1/logs/catalina.out: at
sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
/opt/tomcat1/logs/catalina.out: at
sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
/opt/tomcat1/logs/catalina.out: at
sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
/opt/tomcat1/logs/catalina.out:103651592 [http-bio-8081-exec-206] WARN
org.apache.solr.handler.ReplicationHandler  – Exception while writing
response for params:
file=_68v.fnmcommand=filecontentchecksum=truewt=filestreamqt=/replicationgeneration=2410
/opt/tomcat1/logs/catalina.out:java.nio.file.NoSuchFileException:
/opt/solr/home1/dyCollection1_shard2_replica1/data/index/_68v.fnm
/opt/tomcat1/logs/catalina.out: at
sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
/opt/tomcat1/logs/catalina.out: at
sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
/opt/tomcat1/logs/catalina.out: at
sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
/opt/tomcat1/logs/catalina.out:103651600 [http-bio-8081-exec-206] WARN
org.apache.solr.handler.ReplicationHandler  – Exception while writing
response for params:
file=_68v.fnmcommand=filecontentchecksum=truewt=filestreamqt=/replicationgeneration=2410
/opt/tomcat1/logs/catalina.out:java.nio.file.NoSuchFileException:
/opt/solr/home1/dyCollection1_shard2_replica1/data/index/_68v.fnm
/opt/tomcat1/logs/catalina.out: at
sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
/opt/tomcat1/logs/catalina.out: at
sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
/opt/tomcat1/logs/catalina.out: at
sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
/opt/tomcat1/logs/catalina.out:103651611 [http-bio-8081-exec-203] WARN
org.apache.solr.handler.ReplicationHandler  – Exception while writing
response for params:
file=_68v.fnmcommand=filecontentchecksum=truewt=filestreamqt=/replicationgeneration=2410
/opt/tomcat1/logs/catalina.out:java.nio.file.NoSuchFileException:
/opt/solr/home1/dyCollection1_shard2_replica1/data/index/_68v.fnm
/opt/tomcat1/logs/catalina.out: at
sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
471640118 [localhost-startStop-1-EventThread] INFO
org.apache.solr.common.cloud.ConnectionManager  – Watcher
org.apache.solr.common.cloud.ConnectionManager@2a7dcd74
name:ZooKeeperConnection Watcher:server1.mydomain.com:2181,
server2.mydomain.com:2181,server3.mydomain.com:2181 got event WatchedEvent
state:Disconnected type:None path:null path:null type:None
471640120 [localhost-startStop-1-EventThread] INFO
org.apache.solr.common.cloud.ConnectionManager  – zkClient has disconnected
471642457 [zkCallback-2-thread-8] INFO
org.apache.solr.cloud.DistributedQueue  – LatchChildWatcher fired on path:
null state: Expired type None
471642458 [localhost-startStop-1-EventThread] INFO
org.apache.solr.common.cloud.ConnectionManager  – 

Re: SolrCloud 4.7 not doing distributed search when querying from a load balancer.

2014-10-17 Thread S.L
Shawn,

Just wondering if you have any other suggestions on what the next steps
whould be ? Thanks.

On Thu, Oct 16, 2014 at 11:12 PM, S.L simpleliving...@gmail.com wrote:

 Shawn ,


1. I will upgrade to 67 JVM  shortly .
2. This is  a new collection as , I was facing a similar issue in 4.7
and based on Erick's recommendation I updated to 4.10.1 and created a new
collection.
3. Yes, I am hitting the replicas of the same shard and I see the
lists are completely non overlapping.I am using CloudSolrServer to add the
documents.
4. I have a 3 physical node cluster , with each having 16GB in memory.
5. I also have a custom request handler defined in my solrconfig.xml
as below , however I am not using that and I am only using the default
select handler, but my MyCustomHandler class has been been added to the
source and included in the build , but not being used for any requests yet.

   requestHandler name=/mycustomselect class=solr.MyCustomHandler
 startup=lazy
 lst name=defaults
   str name=dfsuggestAggregate/str

   str name=spellcheck.dictionarydirect/str
   !--str name=spellcheck.dictionarywordbreak/str--
   str name=spellcheckon/str
   str name=spellcheck.extendedResultstrue/str
   str name=spellcheck.count10/str
   str name=spellcheck.alternativeTermCount5/str
   str name=spellcheck.maxResultsForSuggest5/str
   str name=spellcheck.collatetrue/str
   str name=spellcheck.collateExtendedResultstrue/str
   str name=spellcheck.maxCollationTries10/str
   str name=spellcheck.maxCollations5/str
 /lst
 arr name=last-components
   strspellcheck/str
 /arr
   /requestHandler


 5. The clusterstate.json is copied below

 {dyCollection1:{
 shards:{
   shard1:{
 range:8000-d554,
 state:active,
 replicas:{
   core_node3:{
 state:active,
 core:dyCollection1_shard1_replica1,
 node_name:server3.mydomain.com:8082_solr,
 base_url:http://server3.mydomain.com:8082/solr},
   core_node4:{
 state:active,
 core:dyCollection1_shard1_replica2,
 node_name:server2.mydomain.com:8081_solr,
 base_url:http://server2.mydomain.com:8081/solr;,
 leader:true}}},
   shard2:{
 range:d555-2aa9,
 state:active,
 replicas:{
   core_node1:{
 state:active,
 core:dyCollection1_shard2_replica1,
 node_name:server1.mydomain.com:8081_solr,
 base_url:http://server1.mydomain.com:8081/solr;,
 leader:true},
   core_node6:{
 state:active,
 core:dyCollection1_shard2_replica2,
 node_name:server3.mydomain.com:8081_solr,
 base_url:http://server3.mydomain.com:8081/solr}}},
   shard3:{
 range:2aaa-7fff,
 state:active,
 replicas:{
   core_node2:{
 state:active,
 core:dyCollection1_shard3_replica2,
 node_name:server1.mydomain.com:8082_solr,
 base_url:http://server1.mydomain.com:8082/solr;,
 leader:true},
   core_node5:{
 state:active,
 core:dyCollection1_shard3_replica1,
 node_name:server2.mydomain.com:8082_solr,
 base_url:http://server2.mydomain.com:8082/solr,
 maxShardsPerNode:1,
 router:{name:compositeId},
 replicationFactor:2,
 autoAddReplicas:false}}

   Thanks!

 On Thu, Oct 16, 2014 at 9:02 PM, Shawn Heisey apa...@elyograg.org wrote:

 On 10/16/2014 6:27 PM, S.L wrote:

 1. Java Version :java version 1.7.0_51
 Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
 Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)


 I believe that build 51 is one of those that is known to have bugs
 related to Lucene.  If you can upgrade this to 67, that would be good, but
 I don't know that it's a pressing matter.  It looks like the Oracle JVM,
 which is good.

  2.OS
 CentOS Linux release 7.0.1406 (Core)

 3. Everything is 64 bit , OS , Java , and CPU.

 4. Java Args.
  -Djava.io.tmpdir=/opt/tomcat1/temp
  -Dcatalina.home=/opt/tomcat1
  -Dcatalina.base=/opt/tomcat1
  -Djava.endorsed.dirs=/opt/tomcat1/endorsed
  -DzkHost=server1.mydomain.com:2181,server2.mydomain.com:2181,
 server3.mydomain.com:2181
  -DzkClientTimeout=2
  -DhostContext=solr
  -Dport=8081
  -Dhost=server1.mydomain.com
  -Dsolr.solr.home=/opt/solr/home1
  -Dfile.encoding=UTF8
  -Duser.timezone=UTC
  -XX:+UseG1GC
  -XX:MaxPermSize=128m
  -XX:PermSize=64m
  -Xmx2048m
  -Xms128m
  -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager
  -Djava.util.logging.config.file=/opt/tomcat1/conf/
 logging.properties


 I would not use the G1 collector myself, but with the heap at 

Re: SolrCloud 4.7 not doing distributed search when querying from a load balancer.

2014-10-16 Thread S.L
Shawn,

Please find the answers to your questions.

1. Java Version :java version 1.7.0_51
Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)

2.OS
CentOS Linux release 7.0.1406 (Core)

3. Everything is 64 bit , OS , Java , and CPU.

4. Java Args.
-Djava.io.tmpdir=/opt/tomcat1/temp
-Dcatalina.home=/opt/tomcat1
-Dcatalina.base=/opt/tomcat1
-Djava.endorsed.dirs=/opt/tomcat1/endorsed
-DzkHost=server1.mydomain.com:2181,server2.mydomain.com:2181,
server3.mydomain.com:2181
-DzkClientTimeout=2
-DhostContext=solr
-Dport=8081
-Dhost=server1.mydomain.com
-Dsolr.solr.home=/opt/solr/home1
-Dfile.encoding=UTF8
-Duser.timezone=UTC
-XX:+UseG1GC
-XX:MaxPermSize=128m
-XX:PermSize=64m
-Xmx2048m
-Xms128m
-Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager
-Djava.util.logging.config.file=/opt/tomcat1/conf/logging.properties

5. Zookeeper ensemble has 3 zookeeper instances , which are external and
are not embedded.


6. Container : I am using Tomcat Apache Tomcat Version 7.0.42

*Additional Observations:*

I queries all docs on both replicas with distrib=falsefl=idsort=id+asc,
then compared the two lists, I could see by eyeballing the first few lines
of ids in both the lists ,I could say that even though each list has equal
number of documents i.e 96309 each , but the document ids in them seem to
be *mutually exclusive* ,  , I did not find even a single  common id in
those lists , I tried at least 15 manually ,it looks like to me that the
replicas are disjoint sets.

Thanks.



On Thu, Oct 16, 2014 at 1:41 AM, Shawn Heisey apa...@elyograg.org wrote:

 On 10/15/2014 10:24 PM, S.L wrote:

 Yes , I tried those two queries with distrib=false , I get 0 results for
 first and 1 result  for the second query( (i.e. server 3 shard 2 replica
 2)  consistently.

 However if I run the same second query (i.e. server 3 shard 2 replica 2)
 with distrib=true, I sometimes get a result and sometimes not , should'nt
 this query always return a result when its pointing to a core that seems
 to
 have that document regardless of distrib=true or false ?

 Unfortunately I dont see anything particular in the logs to point to any
 information.

 BTW you asked me to replace the request handler , I use the select request
 handler ,so I cannot replace it with anything else , is that  a problem ?


 If you send the query with distrib=true (which is the default value in
 SolrCloud), then it treats it just as if you had sent it to
 /solr/collection instead of /solr/collection_shardN_replicaN, so it's a
 full distributed query. The distrib=false is required to turn that behavior
 off and ONLY query the index on the actual core where you sent it.

 I only said to replace those things as appropriate.  Since you are using
 /select, it's no problem that you left it that way. If I were to assume
 that you used /select, but you didn't, the URLs as I wrote them might not
 have worked.

 As discussed, this means that your replicas are truly out of sync.  It's
 difficult to know what caused it, especially if you can't see anything in
 the log when you indexed the missing documents.

 We know you're on Solr 4.10.1.  This means that your Java is a 1.7
 version, since Java7 is required.

 Here's where I ask a whole lot of questions about your setup. What is the
 precise Java version, and which vendor's Java are you using?  What
 operating system is it on?  Is everything 64-bit, or is any piece (CPU, OS,
 Java) 32-bit?  On the Solr admin UI dashboard, it lists all parameters used
 when starting Java, labelled as Args.  Can you include those?  Is
 zookeeper external, or embedded in Solr?  Is it a 3-server (or more)
 ensemble?  Are you using the example jetty, or did you provide your own
 servlet container?

 We recommend 64-bit Oracle Java, the latest 1.7 version.  OpenJDK (since
 version 1.7.x) should be pretty safe as well, but IBM's Java should be
 avoided.  IBM does very aggressive runtime optimizations.  These can make
 programs run faster, but they are known to negatively affect Lucene/Solr.

 Thanks,
 Shawn




Re: SolrCloud 4.7 not doing distributed search when querying from a load balancer.

2014-10-16 Thread Shawn Heisey

On 10/16/2014 6:27 PM, S.L wrote:

1. Java Version :java version 1.7.0_51
Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)


I believe that build 51 is one of those that is known to have bugs 
related to Lucene.  If you can upgrade this to 67, that would be good, 
but I don't know that it's a pressing matter.  It looks like the Oracle 
JVM, which is good.



2.OS
CentOS Linux release 7.0.1406 (Core)

3. Everything is 64 bit , OS , Java , and CPU.

4. Java Args.
 -Djava.io.tmpdir=/opt/tomcat1/temp
 -Dcatalina.home=/opt/tomcat1
 -Dcatalina.base=/opt/tomcat1
 -Djava.endorsed.dirs=/opt/tomcat1/endorsed
 -DzkHost=server1.mydomain.com:2181,server2.mydomain.com:2181,
server3.mydomain.com:2181
 -DzkClientTimeout=2
 -DhostContext=solr
 -Dport=8081
 -Dhost=server1.mydomain.com
 -Dsolr.solr.home=/opt/solr/home1
 -Dfile.encoding=UTF8
 -Duser.timezone=UTC
 -XX:+UseG1GC
 -XX:MaxPermSize=128m
 -XX:PermSize=64m
 -Xmx2048m
 -Xms128m
 -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager
 -Djava.util.logging.config.file=/opt/tomcat1/conf/logging.properties


I would not use the G1 collector myself, but with the heap at only 2GB, 
I don't know that it matters all that much.  Even a worst-case 
collection probably is not going to take more than a few seconds, and 
you've already increased the zookeeper client timeout.


http://wiki.apache.org/solr/ShawnHeisey#GC_Tuning


5. Zookeeper ensemble has 3 zookeeper instances , which are external and
are not embedded.


6. Container : I am using Tomcat Apache Tomcat Version 7.0.42

*Additional Observations:*

I queries all docs on both replicas with distrib=falsefl=idsort=id+asc,
then compared the two lists, I could see by eyeballing the first few lines
of ids in both the lists ,I could say that even though each list has equal
number of documents i.e 96309 each , but the document ids in them seem to
be *mutually exclusive* ,  , I did not find even a single  common id in
those lists , I tried at least 15 manually ,it looks like to me that the
replicas are disjoint sets.


Are you sure you hit both replicas of the same shard number?  If you 
are, then it sounds like something is going wrong with your document 
routing, or maybe your clusterstate is really messed up.  Recreating the 
collection from scratch and doing a full reindex might be a good plan 
... assuming this is possible for you.  You could create a whole new 
collection, and then when you're ready to switch, delete the original 
collection and create an alias so your app can still use the old name.


How much total RAM do you have on these systems, and how large are those 
index shards?  With a shard having 96K documents, it sounds like your 
whole index is probably just shy of 300K documents.


Thanks,
Shawn



Re: SolrCloud 4.7 not doing distributed search when querying from a load balancer.

2014-10-16 Thread S.L
Shawn ,


   1. I will upgrade to 67 JVM  shortly .
   2. This is  a new collection as , I was facing a similar issue in 4.7
   and based on Erick's recommendation I updated to 4.10.1 and created a new
   collection.
   3. Yes, I am hitting the replicas of the same shard and I see the lists
   are completely non overlapping.I am using CloudSolrServer to add the
   documents.
   4. I have a 3 physical node cluster , with each having 16GB in memory.
   5. I also have a custom request handler defined in my solrconfig.xml as
   below , however I am not using that and I am only using the default select
   handler, but my MyCustomHandler class has been been added to the source and
   included in the build , but not being used for any requests yet.

  requestHandler name=/mycustomselect class=solr.MyCustomHandler
startup=lazy
lst name=defaults
  str name=dfsuggestAggregate/str

  str name=spellcheck.dictionarydirect/str
  !--str name=spellcheck.dictionarywordbreak/str--
  str name=spellcheckon/str
  str name=spellcheck.extendedResultstrue/str
  str name=spellcheck.count10/str
  str name=spellcheck.alternativeTermCount5/str
  str name=spellcheck.maxResultsForSuggest5/str
  str name=spellcheck.collatetrue/str
  str name=spellcheck.collateExtendedResultstrue/str
  str name=spellcheck.maxCollationTries10/str
  str name=spellcheck.maxCollations5/str
/lst
arr name=last-components
  strspellcheck/str
/arr
  /requestHandler


5. The clusterstate.json is copied below

{dyCollection1:{
shards:{
  shard1:{
range:8000-d554,
state:active,
replicas:{
  core_node3:{
state:active,
core:dyCollection1_shard1_replica1,
node_name:server3.mydomain.com:8082_solr,
base_url:http://server3.mydomain.com:8082/solr},
  core_node4:{
state:active,
core:dyCollection1_shard1_replica2,
node_name:server2.mydomain.com:8081_solr,
base_url:http://server2.mydomain.com:8081/solr;,
leader:true}}},
  shard2:{
range:d555-2aa9,
state:active,
replicas:{
  core_node1:{
state:active,
core:dyCollection1_shard2_replica1,
node_name:server1.mydomain.com:8081_solr,
base_url:http://server1.mydomain.com:8081/solr;,
leader:true},
  core_node6:{
state:active,
core:dyCollection1_shard2_replica2,
node_name:server3.mydomain.com:8081_solr,
base_url:http://server3.mydomain.com:8081/solr}}},
  shard3:{
range:2aaa-7fff,
state:active,
replicas:{
  core_node2:{
state:active,
core:dyCollection1_shard3_replica2,
node_name:server1.mydomain.com:8082_solr,
base_url:http://server1.mydomain.com:8082/solr;,
leader:true},
  core_node5:{
state:active,
core:dyCollection1_shard3_replica1,
node_name:server2.mydomain.com:8082_solr,
base_url:http://server2.mydomain.com:8082/solr,
maxShardsPerNode:1,
router:{name:compositeId},
replicationFactor:2,
autoAddReplicas:false}}

  Thanks!

On Thu, Oct 16, 2014 at 9:02 PM, Shawn Heisey apa...@elyograg.org wrote:

 On 10/16/2014 6:27 PM, S.L wrote:

 1. Java Version :java version 1.7.0_51
 Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
 Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)


 I believe that build 51 is one of those that is known to have bugs related
 to Lucene.  If you can upgrade this to 67, that would be good, but I don't
 know that it's a pressing matter.  It looks like the Oracle JVM, which is
 good.

  2.OS
 CentOS Linux release 7.0.1406 (Core)

 3. Everything is 64 bit , OS , Java , and CPU.

 4. Java Args.
  -Djava.io.tmpdir=/opt/tomcat1/temp
  -Dcatalina.home=/opt/tomcat1
  -Dcatalina.base=/opt/tomcat1
  -Djava.endorsed.dirs=/opt/tomcat1/endorsed
  -DzkHost=server1.mydomain.com:2181,server2.mydomain.com:2181,
 server3.mydomain.com:2181
  -DzkClientTimeout=2
  -DhostContext=solr
  -Dport=8081
  -Dhost=server1.mydomain.com
  -Dsolr.solr.home=/opt/solr/home1
  -Dfile.encoding=UTF8
  -Duser.timezone=UTC
  -XX:+UseG1GC
  -XX:MaxPermSize=128m
  -XX:PermSize=64m
  -Xmx2048m
  -Xms128m
  -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager
  -Djava.util.logging.config.file=/opt/tomcat1/conf/logging.properties


 I would not use the G1 collector myself, but with the heap at only 2GB, I
 don't know that it matters all that much.  Even a worst-case collection
 probably is not going to take more than a few seconds, and you've already
 increased the zookeeper client timeout.

 http://wiki.apache.org/solr/ShawnHeisey#GC_Tuning

  5. 

Re: SolrCloud 4.7 not doing distributed search when querying from a load balancer.

2014-10-15 Thread S.L
 surprised that this issue never got reported for 4.7 up
   until
now.
   
Thanks again for your help!
   
   
   
On Mon, Oct 6, 2014 at 10:52 PM, Erick Erickson 
  erickerick...@gmail.com
   
wrote:
   
I think there were some holes that would allow replicas and leaders
 to
be out of synch that have been patched up in the last 3 releases.
   
There shouldn't be anything you need to do to keep these in synch,
 so
if you can capture what happened when things got out of synch we'll
fix it. But a lot has changed in the last several months, so the
 first
thing I'd do if possible is to upgrade to 4.10.1.
   
   
Best,
Erick
   
On Mon, Oct 6, 2014 at 2:41 PM, S.L simpleliving...@gmail.com
  wrote:
 Hi Erick,

 Before I tried your suggestion of  issung a commit=true update, I
realized that for eaach shard there was atleast a node that had its
   index
directory named like index.timestamp.

 I went ahead and deleted index directory that restarted that core
  and
now the index directory got syched with the other node and is
 properly
named as 'index' without any timestamp attached to it.This is now
   giving me
consistent results for distrib=true using a load balancer.Also
distrib=false returns expexted results for a given shard.

 The underlying issue appears to be that in every shard the leader
  and
the replica(follower) were out of sych.

 How can I avoid this from happening again?

 Thanks for your help!

 Sent from my HTC

 - Reply message -
 From: Erick Erickson erickerick...@gmail.com
 To: solr-user@lucene.apache.org
 Subject: SolrCloud 4.7 not doing distributed search when querying
   from a
load balancer.
 Date: Fri, Oct 3, 2014 12:56 AM

 H. Assuming that you aren't re-indexing the doc you're
 searching
for...

 Try issuing http://blah
  blah:8983/solr/collection/update?commit=true.
 That'll force all the docs to be searchable. Does 1 still hold
 for
 the document in question? Because this is exactly backwards of
 what
 I'd expect. I'd expect, if anything, the replica (I'm trying to
 call
 it the follower when a distinction needs to be made since the
  leader
 is a replica too) would be out of sync. This is still a Bad
 Thing, but the leader gets first crack at indexing thing.

 bq: only the replica of the shard that has this key returns the
  result
 , and the leader does not ,

 Just to be sure we're talking about the same thing. When you say
 leader, you mean the shard leader, right? The filled-in circle
 on
 the graph view from the admin/cloud page.

 And let's see your soft and hard commit settings please.

 Best,
 Erick

 On Thu, Oct 2, 2014 at 9:48 PM, S.L simpleliving...@gmail.com
   wrote:
 Eirck,

 0 Load balancer is out of the picture
 .
 1When I query with *distrib=false* , I get consistent results as
expected
 for those shards that dont have the key i.e I dont get the
 results
   back
for
 those shards, however I just realized that while *distrib=false*
 is
present
 in the query for the shard that is supposed to contain the
 key,only
   the
 replica of the shard that has this key returns the result , and
 the
leader
 does not , looks like replica and the leader do not have the same
   data
and
 replica seems to contain the key in the query for that shard.

 2 By indexing I mean this collection is being populated by a web
crawler.

 So looks like 1 above  is pointing to leader and replica being
 out
   of
 synch for atleast one shard.



 On Thu, Oct 2, 2014 at 11:57 PM, Erick Erickson 
erickerick...@gmail.com
 wrote:

 bq: Also ,the collection is being actively indexed as I query
  this,
could
 that
 be an issue too ?

 Not if the documents you're searching aren't being added as you
   search
 (and all your autocommit intervals have expired).

 I would turn off indexing for testing, it's just one more
 variable
 that can get in the way of understanding this.

 Do note that if the problem were endemic to Solr, there would
   probably
 be a _lot_ more noise out there.

 So to recap:
 0 we can take the load balancer out of the picture all
 together.

 1 when you query each shard individually with distrib=true,
  every
 replica in a particular shard returns the same count.

 2 when you query without distrib=true you get varying counts.

 This is very strange and not at all expected. Let's try it again
 without indexing going on

 And what do you mean by indexing anyway? How are documents
 being
   fed
 to your system?

 Best,
 Erick@PuzzledAsWell

 On Thu, Oct 2, 2014 at 7:32 PM, S.L

Re: SolrCloud 4.7 not doing distributed search when querying from a load balancer.

2014-10-15 Thread S.L
-bio-8081-exec-169] INFO  org.apache.solr.core.SolrCore  –
  [dyCollection1_shard2_replica1] webapp=/solr path=/select/
 
 
 params={q=*:*distrib=truewt=jsonfq=(id:220a8dce-3b31-4d46-8386-da8405595c47)}
  hits=1 status=0 QTime=7
 
 
  *Autocommit and Soft commit settings.*
 
   autoSoftCommit
 maxTime${solr.autoSoftCommit.maxTime:-1}/maxTime
   /autoSoftCommit
 
   autoCommit
 maxTime${solr.autoCommit.maxTime:15000}/maxTime
 
 openSearchertrue/openSearcher
   /autoCommit
 
 
 
  On Tue, Oct 7, 2014 at 12:22 AM, Erick Erickson 
 erickerick...@gmail.com
  wrote:
 
   Not, I'm not guaranteeing that it'll actually cure the problem, just
   that enough has changed since 4.7 that it'd be a good place to start.
  
   Things have been reported off and on, but they're often pesky race
   conditions or something else that takes a long time to track down, you
   just are lucky perhaps ;)...
  
   Erick
  
   On Mon, Oct 6, 2014 at 8:04 PM, S.L simpleliving...@gmail.com
 wrote:
Erick,
   
Thanks for the suggestion , I am not sure if I would be able to
 capture
what went wrong , so upgrading to 4.10 seems easier even though it
  means
   ,
a days work of effort :) . I will go ahead and upgrade and let me
 know
  ,
although I am surprised that this issue never got reported for 4.7
 up
   until
now.
   
Thanks again for your help!
   
   
   
On Mon, Oct 6, 2014 at 10:52 PM, Erick Erickson 
  erickerick...@gmail.com
   
wrote:
   
I think there were some holes that would allow replicas and
 leaders to
be out of synch that have been patched up in the last 3 releases.
   
There shouldn't be anything you need to do to keep these in synch,
 so
if you can capture what happened when things got out of synch we'll
fix it. But a lot has changed in the last several months, so the
 first
thing I'd do if possible is to upgrade to 4.10.1.
   
   
Best,
Erick
   
On Mon, Oct 6, 2014 at 2:41 PM, S.L simpleliving...@gmail.com
  wrote:
 Hi Erick,

 Before I tried your suggestion of  issung a commit=true update, I
realized that for eaach shard there was atleast a node that had its
   index
directory named like index.timestamp.

 I went ahead and deleted index directory that restarted that core
  and
now the index directory got syched with the other node and is
 properly
named as 'index' without any timestamp attached to it.This is now
   giving me
consistent results for distrib=true using a load balancer.Also
distrib=false returns expexted results for a given shard.

 The underlying issue appears to be that in every shard the leader
  and
the replica(follower) were out of sych.

 How can I avoid this from happening again?

 Thanks for your help!

 Sent from my HTC

 - Reply message -
 From: Erick Erickson erickerick...@gmail.com
 To: solr-user@lucene.apache.org
 Subject: SolrCloud 4.7 not doing distributed search when querying
   from a
load balancer.
 Date: Fri, Oct 3, 2014 12:56 AM

 H. Assuming that you aren't re-indexing the doc you're
 searching
for...

 Try issuing http://blah
  blah:8983/solr/collection/update?commit=true.
 That'll force all the docs to be searchable. Does 1 still hold
 for
 the document in question? Because this is exactly backwards of
 what
 I'd expect. I'd expect, if anything, the replica (I'm trying to
 call
 it the follower when a distinction needs to be made since the
  leader
 is a replica too) would be out of sync. This is still a Bad
 Thing, but the leader gets first crack at indexing thing.

 bq: only the replica of the shard that has this key returns the
  result
 , and the leader does not ,

 Just to be sure we're talking about the same thing. When you say
 leader, you mean the shard leader, right? The filled-in circle
 on
 the graph view from the admin/cloud page.

 And let's see your soft and hard commit settings please.

 Best,
 Erick

 On Thu, Oct 2, 2014 at 9:48 PM, S.L simpleliving...@gmail.com
   wrote:
 Eirck,

 0 Load balancer is out of the picture
 .
 1When I query with *distrib=false* , I get consistent results
 as
expected
 for those shards that dont have the key i.e I dont get the
 results
   back
for
 those shards, however I just realized that while
 *distrib=false* is
present
 in the query for the shard that is supposed to contain the
 key,only
   the
 replica of the shard that has this key returns the result , and
 the
leader
 does not , looks like replica and the leader do not have the
 same
   data
and
 replica seems to contain the key in the query for that shard.

 2 By indexing I mean this collection is being populated by a
 web
crawler.

 So looks like 1 above

Re: SolrCloud 4.7 not doing distributed search when querying from a load balancer.

2014-10-15 Thread Shawn Heisey

On 10/15/2014 9:26 PM, S.L wrote:

Look at the logging information I provided below , looks like the results
are only being returned back for this solrCloud cluster  if the request
goes to one of the two replicas of a shard.

I have verified that numDocs in the replicas for a given shard is same but
there is difference in the maxDoc and deletedDocs, does this signal the
replicas being out of sync ?

Even if the numDocs are same , how do we guarantee that those docs are
identical and have the same uniquekeys , is there a way to verify this ? I
am suspecting that  as the numDocs is same across the replicas , and still
only when the request goes to one of  the  replicas of the shard that I get
a result back , the documents with in those replicas with in a shard are
not an exact replica set of each other.

I suspect the issue I am facing in 4.10.1 cloud is related to
https://issues.apache.org/jira/browse/SOLR-4924  .

Can anyone please let me know , how to solve this issue of intermittent no
results for a query ?


query with no results hits these cores:
server 2 shard 3 replica1
server 3 shard 1 replica 1
server 1 shard 2 replica 1

query with 1 result hits these cores:
server 2 shard 1 replica 2
server 3 shard 2 replica 2 (found 1)
server 1 shard 3 replica 2

Here's some URLs for some testing.  They are directed at specific shard 
replicas and are specifically NOT distributed queries:


http://server1.mydomain.com:8081/solr/dyCollection1_shard2_replica1/select?q=*:*fq=id:e8995da8-7d98-4010-93b4-8ff7dffb8bfbdistrib=false

http://server3.mydomain.com:8081/solr/dyCollection1_shard2_replica2/select?q=*:*fq=id:e8995da8-7d98-4010-93b4-8ff7dffb8bfbdistrib=false

If you run these queries (replacing server names and the /select request 
handler as appropriate), do you get 0 results on the first one and 1 
result on the second one?  If you do, then you've definitely got 
replicas out of sync.  If you get 1 result on both queries, then 
something else is breaking.  If by chance you have taken steps to fix 
this particular ID, pick another one that you know has a problem.


There is no automated way to detect replicas out of sync.  You could 
request all docs on both replicas with distrib=falsefl=idsort=id+asc, 
then compare the two lists.  Depending on how many docs you have, those 
queries could take a while to run.


If the replicas are out of sync, are there any ERROR entries in the Solr 
log, especially at the time that the problem docs were indexed?


Thanks,
Shawn



Re: SolrCloud 4.7 not doing distributed search when querying from a load balancer.

2014-10-15 Thread S.L
Shawn,

Yes , I tried those two queries with distrib=false , I get 0 results for
first and 1 result  for the second query( (i.e. server 3 shard 2 replica
2)  consistently.

However if I run the same second query (i.e. server 3 shard 2 replica 2)
with distrib=true, I sometimes get a result and sometimes not , should'nt
this query always return a result when its pointing to a core that seems to
have that document regardless of distrib=true or false ?

Unfortunately I dont see anything particular in the logs to point to any
information.

BTW you asked me to replace the request handler , I use the select request
handler ,so I cannot replace it with anything else , is that  a problem ?

Thanks.

On Thu, Oct 16, 2014 at 12:05 AM, Shawn Heisey apa...@elyograg.org wrote:

 On 10/15/2014 9:26 PM, S.L wrote:

 Look at the logging information I provided below , looks like the results
 are only being returned back for this solrCloud cluster  if the request
 goes to one of the two replicas of a shard.

 I have verified that numDocs in the replicas for a given shard is same but
 there is difference in the maxDoc and deletedDocs, does this signal the
 replicas being out of sync ?

 Even if the numDocs are same , how do we guarantee that those docs are
 identical and have the same uniquekeys , is there a way to verify this ? I
 am suspecting that  as the numDocs is same across the replicas , and still
 only when the request goes to one of  the  replicas of the shard that I
 get
 a result back , the documents with in those replicas with in a shard are
 not an exact replica set of each other.

 I suspect the issue I am facing in 4.10.1 cloud is related to
 https://issues.apache.org/jira/browse/SOLR-4924  .

 Can anyone please let me know , how to solve this issue of intermittent no
 results for a query ?


 query with no results hits these cores:
 server 2 shard 3 replica1
 server 3 shard 1 replica 1
 server 1 shard 2 replica 1

 query with 1 result hits these cores:
 server 2 shard 1 replica 2
 server 3 shard 2 replica 2 (found 1)
 server 1 shard 3 replica 2

 Here's some URLs for some testing.  They are directed at specific shard
 replicas and are specifically NOT distributed queries:

 http://server1.mydomain.com:8081/solr/dyCollection1_
 shard2_replica1/select?q=*:*fq=id:e8995da8-7d98-4010-93b4-
 8ff7dffb8bfbdistrib=false

 http://server3.mydomain.com:8081/solr/dyCollection1_
 shard2_replica2/select?q=*:*fq=id:e8995da8-7d98-4010-93b4-
 8ff7dffb8bfbdistrib=false

 If you run these queries (replacing server names and the /select request
 handler as appropriate), do you get 0 results on the first one and 1 result
 on the second one?  If you do, then you've definitely got replicas out of
 sync.  If you get 1 result on both queries, then something else is
 breaking.  If by chance you have taken steps to fix this particular ID,
 pick another one that you know has a problem.

 There is no automated way to detect replicas out of sync.  You could
 request all docs on both replicas with distrib=falsefl=idsort=id+asc,
 then compare the two lists.  Depending on how many docs you have, those
 queries could take a while to run.

 If the replicas are out of sync, are there any ERROR entries in the Solr
 log, especially at the time that the problem docs were indexed?

 Thanks,
 Shawn




Re: SolrCloud 4.7 not doing distributed search when querying from a load balancer.

2014-10-15 Thread Shawn Heisey

On 10/15/2014 10:24 PM, S.L wrote:

Yes , I tried those two queries with distrib=false , I get 0 results for
first and 1 result  for the second query( (i.e. server 3 shard 2 replica
2)  consistently.

However if I run the same second query (i.e. server 3 shard 2 replica 2)
with distrib=true, I sometimes get a result and sometimes not , should'nt
this query always return a result when its pointing to a core that seems to
have that document regardless of distrib=true or false ?

Unfortunately I dont see anything particular in the logs to point to any
information.

BTW you asked me to replace the request handler , I use the select request
handler ,so I cannot replace it with anything else , is that  a problem ?


If you send the query with distrib=true (which is the default value in 
SolrCloud), then it treats it just as if you had sent it to 
/solr/collection instead of /solr/collection_shardN_replicaN, so it's a 
full distributed query. The distrib=false is required to turn that 
behavior off and ONLY query the index on the actual core where you sent it.


I only said to replace those things as appropriate.  Since you are using 
/select, it's no problem that you left it that way. If I were to assume 
that you used /select, but you didn't, the URLs as I wrote them might 
not have worked.


As discussed, this means that your replicas are truly out of sync.  It's 
difficult to know what caused it, especially if you can't see anything 
in the log when you indexed the missing documents.


We know you're on Solr 4.10.1.  This means that your Java is a 1.7 
version, since Java7 is required.


Here's where I ask a whole lot of questions about your setup. What is 
the precise Java version, and which vendor's Java are you using?  What 
operating system is it on?  Is everything 64-bit, or is any piece (CPU, 
OS, Java) 32-bit?  On the Solr admin UI dashboard, it lists all 
parameters used when starting Java, labelled as Args.  Can you include 
those?  Is zookeeper external, or embedded in Solr?  Is it a 3-server 
(or more) ensemble?  Are you using the example jetty, or did you provide 
your own servlet container?


We recommend 64-bit Oracle Java, the latest 1.7 version.  OpenJDK (since 
version 1.7.x) should be pretty safe as well, but IBM's Java should be 
avoided.  IBM does very aggressive runtime optimizations.  These can 
make programs run faster, but they are known to negatively affect 
Lucene/Solr.


Thanks,
Shawn



Re: SolrCloud 4.7 not doing distributed search when querying from a load balancer.

2014-10-14 Thread Tim Potter
 Erickson erickerick...@gmail.com
 wrote:

  Not, I'm not guaranteeing that it'll actually cure the problem, just
  that enough has changed since 4.7 that it'd be a good place to start.
 
  Things have been reported off and on, but they're often pesky race
  conditions or something else that takes a long time to track down, you
  just are lucky perhaps ;)...
 
  Erick
 
  On Mon, Oct 6, 2014 at 8:04 PM, S.L simpleliving...@gmail.com wrote:
   Erick,
  
   Thanks for the suggestion , I am not sure if I would be able to capture
   what went wrong , so upgrading to 4.10 seems easier even though it
 means
  ,
   a days work of effort :) . I will go ahead and upgrade and let me know
 ,
   although I am surprised that this issue never got reported for 4.7 up
  until
   now.
  
   Thanks again for your help!
  
  
  
   On Mon, Oct 6, 2014 at 10:52 PM, Erick Erickson 
 erickerick...@gmail.com
  
   wrote:
  
   I think there were some holes that would allow replicas and leaders to
   be out of synch that have been patched up in the last 3 releases.
  
   There shouldn't be anything you need to do to keep these in synch, so
   if you can capture what happened when things got out of synch we'll
   fix it. But a lot has changed in the last several months, so the first
   thing I'd do if possible is to upgrade to 4.10.1.
  
  
   Best,
   Erick
  
   On Mon, Oct 6, 2014 at 2:41 PM, S.L simpleliving...@gmail.com
 wrote:
Hi Erick,
   
Before I tried your suggestion of  issung a commit=true update, I
   realized that for eaach shard there was atleast a node that had its
  index
   directory named like index.timestamp.
   
I went ahead and deleted index directory that restarted that core
 and
   now the index directory got syched with the other node and is properly
   named as 'index' without any timestamp attached to it.This is now
  giving me
   consistent results for distrib=true using a load balancer.Also
   distrib=false returns expexted results for a given shard.
   
The underlying issue appears to be that in every shard the leader
 and
   the replica(follower) were out of sych.
   
How can I avoid this from happening again?
   
Thanks for your help!
   
Sent from my HTC
   
- Reply message -
From: Erick Erickson erickerick...@gmail.com
To: solr-user@lucene.apache.org
Subject: SolrCloud 4.7 not doing distributed search when querying
  from a
   load balancer.
Date: Fri, Oct 3, 2014 12:56 AM
   
H. Assuming that you aren't re-indexing the doc you're searching
   for...
   
Try issuing http://blah
 blah:8983/solr/collection/update?commit=true.
That'll force all the docs to be searchable. Does 1 still hold for
the document in question? Because this is exactly backwards of what
I'd expect. I'd expect, if anything, the replica (I'm trying to call
it the follower when a distinction needs to be made since the
 leader
is a replica too) would be out of sync. This is still a Bad
Thing, but the leader gets first crack at indexing thing.
   
bq: only the replica of the shard that has this key returns the
 result
, and the leader does not ,
   
Just to be sure we're talking about the same thing. When you say
leader, you mean the shard leader, right? The filled-in circle on
the graph view from the admin/cloud page.
   
And let's see your soft and hard commit settings please.
   
Best,
Erick
   
On Thu, Oct 2, 2014 at 9:48 PM, S.L simpleliving...@gmail.com
  wrote:
Eirck,
   
0 Load balancer is out of the picture
.
1When I query with *distrib=false* , I get consistent results as
   expected
for those shards that dont have the key i.e I dont get the results
  back
   for
those shards, however I just realized that while *distrib=false* is
   present
in the query for the shard that is supposed to contain the key,only
  the
replica of the shard that has this key returns the result , and the
   leader
does not , looks like replica and the leader do not have the same
  data
   and
replica seems to contain the key in the query for that shard.
   
2 By indexing I mean this collection is being populated by a web
   crawler.
   
So looks like 1 above  is pointing to leader and replica being out
  of
synch for atleast one shard.
   
   
   
On Thu, Oct 2, 2014 at 11:57 PM, Erick Erickson 
   erickerick...@gmail.com
wrote:
   
bq: Also ,the collection is being actively indexed as I query
 this,
   could
that
be an issue too ?
   
Not if the documents you're searching aren't being added as you
  search
(and all your autocommit intervals have expired).
   
I would turn off indexing for testing, it's just one more variable
that can get in the way of understanding this.
   
Do note that if the problem were endemic to Solr, there would
  probably
be a _lot_ more noise out there.
   
So to recap:
0 we can take the load

Re: SolrCloud 4.7 not doing distributed search when querying from a load balancer.

2014-10-13 Thread S.L
 to track down, you
 just are lucky perhaps ;)...

 Erick

 On Mon, Oct 6, 2014 at 8:04 PM, S.L simpleliving...@gmail.com wrote:
  Erick,
 
  Thanks for the suggestion , I am not sure if I would be able to capture
  what went wrong , so upgrading to 4.10 seems easier even though it means
 ,
  a days work of effort :) . I will go ahead and upgrade and let me know ,
  although I am surprised that this issue never got reported for 4.7 up
 until
  now.
 
  Thanks again for your help!
 
 
 
  On Mon, Oct 6, 2014 at 10:52 PM, Erick Erickson erickerick...@gmail.com
 
  wrote:
 
  I think there were some holes that would allow replicas and leaders to
  be out of synch that have been patched up in the last 3 releases.
 
  There shouldn't be anything you need to do to keep these in synch, so
  if you can capture what happened when things got out of synch we'll
  fix it. But a lot has changed in the last several months, so the first
  thing I'd do if possible is to upgrade to 4.10.1.
 
 
  Best,
  Erick
 
  On Mon, Oct 6, 2014 at 2:41 PM, S.L simpleliving...@gmail.com wrote:
   Hi Erick,
  
   Before I tried your suggestion of  issung a commit=true update, I
  realized that for eaach shard there was atleast a node that had its
 index
  directory named like index.timestamp.
  
   I went ahead and deleted index directory that restarted that core and
  now the index directory got syched with the other node and is properly
  named as 'index' without any timestamp attached to it.This is now
 giving me
  consistent results for distrib=true using a load balancer.Also
  distrib=false returns expexted results for a given shard.
  
   The underlying issue appears to be that in every shard the leader and
  the replica(follower) were out of sych.
  
   How can I avoid this from happening again?
  
   Thanks for your help!
  
   Sent from my HTC
  
   - Reply message -
   From: Erick Erickson erickerick...@gmail.com
   To: solr-user@lucene.apache.org
   Subject: SolrCloud 4.7 not doing distributed search when querying
 from a
  load balancer.
   Date: Fri, Oct 3, 2014 12:56 AM
  
   H. Assuming that you aren't re-indexing the doc you're searching
  for...
  
   Try issuing http://blah blah:8983/solr/collection/update?commit=true.
   That'll force all the docs to be searchable. Does 1 still hold for
   the document in question? Because this is exactly backwards of what
   I'd expect. I'd expect, if anything, the replica (I'm trying to call
   it the follower when a distinction needs to be made since the leader
   is a replica too) would be out of sync. This is still a Bad
   Thing, but the leader gets first crack at indexing thing.
  
   bq: only the replica of the shard that has this key returns the result
   , and the leader does not ,
  
   Just to be sure we're talking about the same thing. When you say
   leader, you mean the shard leader, right? The filled-in circle on
   the graph view from the admin/cloud page.
  
   And let's see your soft and hard commit settings please.
  
   Best,
   Erick
  
   On Thu, Oct 2, 2014 at 9:48 PM, S.L simpleliving...@gmail.com
 wrote:
   Eirck,
  
   0 Load balancer is out of the picture
   .
   1When I query with *distrib=false* , I get consistent results as
  expected
   for those shards that dont have the key i.e I dont get the results
 back
  for
   those shards, however I just realized that while *distrib=false* is
  present
   in the query for the shard that is supposed to contain the key,only
 the
   replica of the shard that has this key returns the result , and the
  leader
   does not , looks like replica and the leader do not have the same
 data
  and
   replica seems to contain the key in the query for that shard.
  
   2 By indexing I mean this collection is being populated by a web
  crawler.
  
   So looks like 1 above  is pointing to leader and replica being out
 of
   synch for atleast one shard.
  
  
  
   On Thu, Oct 2, 2014 at 11:57 PM, Erick Erickson 
  erickerick...@gmail.com
   wrote:
  
   bq: Also ,the collection is being actively indexed as I query this,
  could
   that
   be an issue too ?
  
   Not if the documents you're searching aren't being added as you
 search
   (and all your autocommit intervals have expired).
  
   I would turn off indexing for testing, it's just one more variable
   that can get in the way of understanding this.
  
   Do note that if the problem were endemic to Solr, there would
 probably
   be a _lot_ more noise out there.
  
   So to recap:
   0 we can take the load balancer out of the picture all together.
  
   1 when you query each shard individually with distrib=true, every
   replica in a particular shard returns the same count.
  
   2 when you query without distrib=true you get varying counts.
  
   This is very strange and not at all expected. Let's try it again
   without indexing going on
  
   And what do you mean by indexing anyway? How are documents being
 fed
   to your system?
  
   Best,
   Erick

Re: SolrCloud 4.7 not doing distributed search when querying from a load balancer.

2014-10-06 Thread S.L
Hi Erick,

Before I tried your suggestion of  issung a commit=true update, I realized that 
for eaach shard there was atleast a node that had its index directory named 
like index.timestamp.

I went ahead and deleted index directory that restarted that core and now the 
index directory got syched with the other node and is properly named as 'index' 
without any timestamp attached to it.This is now giving me consistent results 
for distrib=true using a load balancer.Also distrib=false returns expexted 
results for a given shard.

The underlying issue appears to be that in every shard the leader and the 
replica(follower) were out of sych.

How can I avoid this from happening again?

Thanks for your help!

Sent from my HTC

- Reply message -
From: Erick Erickson erickerick...@gmail.com
To: solr-user@lucene.apache.org
Subject: SolrCloud 4.7 not doing distributed search when querying from a load 
balancer.
Date: Fri, Oct 3, 2014 12:56 AM

H. Assuming that you aren't re-indexing the doc you're searching for...

Try issuing http://blah blah:8983/solr/collection/update?commit=true.
That'll force all the docs to be searchable. Does 1 still hold for
the document in question? Because this is exactly backwards of what
I'd expect. I'd expect, if anything, the replica (I'm trying to call
it the follower when a distinction needs to be made since the leader
is a replica too) would be out of sync. This is still a Bad
Thing, but the leader gets first crack at indexing thing.

bq: only the replica of the shard that has this key returns the result
, and the leader does not ,

Just to be sure we're talking about the same thing. When you say
leader, you mean the shard leader, right? The filled-in circle on
the graph view from the admin/cloud page.

And let's see your soft and hard commit settings please.

Best,
Erick

On Thu, Oct 2, 2014 at 9:48 PM, S.L simpleliving...@gmail.com wrote:
 Eirck,

 0 Load balancer is out of the picture
 .
 1When I query with *distrib=false* , I get consistent results as expected
 for those shards that dont have the key i.e I dont get the results back for
 those shards, however I just realized that while *distrib=false* is present
 in the query for the shard that is supposed to contain the key,only the
 replica of the shard that has this key returns the result , and the leader
 does not , looks like replica and the leader do not have the same data and
 replica seems to contain the key in the query for that shard.

 2 By indexing I mean this collection is being populated by a web crawler.

 So looks like 1 above  is pointing to leader and replica being out of
 synch for atleast one shard.



 On Thu, Oct 2, 2014 at 11:57 PM, Erick Erickson erickerick...@gmail.com
 wrote:

 bq: Also ,the collection is being actively indexed as I query this, could
 that
 be an issue too ?

 Not if the documents you're searching aren't being added as you search
 (and all your autocommit intervals have expired).

 I would turn off indexing for testing, it's just one more variable
 that can get in the way of understanding this.

 Do note that if the problem were endemic to Solr, there would probably
 be a _lot_ more noise out there.

 So to recap:
 0 we can take the load balancer out of the picture all together.

 1 when you query each shard individually with distrib=true, every
 replica in a particular shard returns the same count.

 2 when you query without distrib=true you get varying counts.

 This is very strange and not at all expected. Let's try it again
 without indexing going on

 And what do you mean by indexing anyway? How are documents being fed
 to your system?

 Best,
 Erick@PuzzledAsWell

 On Thu, Oct 2, 2014 at 7:32 PM, S.L simpleliving...@gmail.com wrote:
  Erick,
 
  I would like to add that the interesting behavior i.e point #2 that I
  mentioned in my earlier reply  happens in all the shards , if this were
 to
  be a distributed search issue this should have not manifested itself in
 the
  shard that contains the key that I am searching for , looks like the
 search
  is just failing as whole intermittently .
 
  Also ,the collection is being actively indexed as I query this, could
 that
  be an issue too ?
 
  Thanks.
 
  On Thu, Oct 2, 2014 at 10:24 PM, S.L simpleliving...@gmail.com wrote:
 
  Erick,
 
  Thanks for your reply, I tried your suggestions.
 
  1 . When not using loadbalancer if  *I have distrib=false* I get
  consistent results across the replicas.
 
  2. However here's the insteresting part , while not using load balancer
 if
  I *dont have distrib=false* , then when I query a particular node ,I get
  the same behaviour as if I were using a loadbalancer , meaning the
  distributed search from a node works intermittently .Does this give any
  clue ?
 
 
 
  On Thu, Oct 2, 2014 at 7:47 PM, Erick Erickson erickerick...@gmail.com
 
  wrote:
 
  Hmmm, nothing quite makes sense here
 
  Here are some experiments:
  1 avoid the load balancer and issue queries like

Re: SolrCloud 4.7 not doing distributed search when querying from a load balancer.

2014-10-06 Thread Erick Erickson
I think there were some holes that would allow replicas and leaders to
be out of synch that have been patched up in the last 3 releases.

There shouldn't be anything you need to do to keep these in synch, so
if you can capture what happened when things got out of synch we'll
fix it. But a lot has changed in the last several months, so the first
thing I'd do if possible is to upgrade to 4.10.1.


Best,
Erick

On Mon, Oct 6, 2014 at 2:41 PM, S.L simpleliving...@gmail.com wrote:
 Hi Erick,

 Before I tried your suggestion of  issung a commit=true update, I realized 
 that for eaach shard there was atleast a node that had its index directory 
 named like index.timestamp.

 I went ahead and deleted index directory that restarted that core and now the 
 index directory got syched with the other node and is properly named as 
 'index' without any timestamp attached to it.This is now giving me consistent 
 results for distrib=true using a load balancer.Also distrib=false returns 
 expexted results for a given shard.

 The underlying issue appears to be that in every shard the leader and the 
 replica(follower) were out of sych.

 How can I avoid this from happening again?

 Thanks for your help!

 Sent from my HTC

 - Reply message -
 From: Erick Erickson erickerick...@gmail.com
 To: solr-user@lucene.apache.org
 Subject: SolrCloud 4.7 not doing distributed search when querying from a load 
 balancer.
 Date: Fri, Oct 3, 2014 12:56 AM

 H. Assuming that you aren't re-indexing the doc you're searching for...

 Try issuing http://blah blah:8983/solr/collection/update?commit=true.
 That'll force all the docs to be searchable. Does 1 still hold for
 the document in question? Because this is exactly backwards of what
 I'd expect. I'd expect, if anything, the replica (I'm trying to call
 it the follower when a distinction needs to be made since the leader
 is a replica too) would be out of sync. This is still a Bad
 Thing, but the leader gets first crack at indexing thing.

 bq: only the replica of the shard that has this key returns the result
 , and the leader does not ,

 Just to be sure we're talking about the same thing. When you say
 leader, you mean the shard leader, right? The filled-in circle on
 the graph view from the admin/cloud page.

 And let's see your soft and hard commit settings please.

 Best,
 Erick

 On Thu, Oct 2, 2014 at 9:48 PM, S.L simpleliving...@gmail.com wrote:
 Eirck,

 0 Load balancer is out of the picture
 .
 1When I query with *distrib=false* , I get consistent results as expected
 for those shards that dont have the key i.e I dont get the results back for
 those shards, however I just realized that while *distrib=false* is present
 in the query for the shard that is supposed to contain the key,only the
 replica of the shard that has this key returns the result , and the leader
 does not , looks like replica and the leader do not have the same data and
 replica seems to contain the key in the query for that shard.

 2 By indexing I mean this collection is being populated by a web crawler.

 So looks like 1 above  is pointing to leader and replica being out of
 synch for atleast one shard.



 On Thu, Oct 2, 2014 at 11:57 PM, Erick Erickson erickerick...@gmail.com
 wrote:

 bq: Also ,the collection is being actively indexed as I query this, could
 that
 be an issue too ?

 Not if the documents you're searching aren't being added as you search
 (and all your autocommit intervals have expired).

 I would turn off indexing for testing, it's just one more variable
 that can get in the way of understanding this.

 Do note that if the problem were endemic to Solr, there would probably
 be a _lot_ more noise out there.

 So to recap:
 0 we can take the load balancer out of the picture all together.

 1 when you query each shard individually with distrib=true, every
 replica in a particular shard returns the same count.

 2 when you query without distrib=true you get varying counts.

 This is very strange and not at all expected. Let's try it again
 without indexing going on

 And what do you mean by indexing anyway? How are documents being fed
 to your system?

 Best,
 Erick@PuzzledAsWell

 On Thu, Oct 2, 2014 at 7:32 PM, S.L simpleliving...@gmail.com wrote:
  Erick,
 
  I would like to add that the interesting behavior i.e point #2 that I
  mentioned in my earlier reply  happens in all the shards , if this were
 to
  be a distributed search issue this should have not manifested itself in
 the
  shard that contains the key that I am searching for , looks like the
 search
  is just failing as whole intermittently .
 
  Also ,the collection is being actively indexed as I query this, could
 that
  be an issue too ?
 
  Thanks.
 
  On Thu, Oct 2, 2014 at 10:24 PM, S.L simpleliving...@gmail.com wrote:
 
  Erick,
 
  Thanks for your reply, I tried your suggestions.
 
  1 . When not using loadbalancer if  *I have distrib=false* I get
  consistent results across the replicas.
 
  2

Re: SolrCloud 4.7 not doing distributed search when querying from a load balancer.

2014-10-06 Thread S.L
Erick,

Thanks for the suggestion , I am not sure if I would be able to capture
what went wrong , so upgrading to 4.10 seems easier even though it means ,
a days work of effort :) . I will go ahead and upgrade and let me know ,
although I am surprised that this issue never got reported for 4.7 up until
now.

Thanks again for your help!



On Mon, Oct 6, 2014 at 10:52 PM, Erick Erickson erickerick...@gmail.com
wrote:

 I think there were some holes that would allow replicas and leaders to
 be out of synch that have been patched up in the last 3 releases.

 There shouldn't be anything you need to do to keep these in synch, so
 if you can capture what happened when things got out of synch we'll
 fix it. But a lot has changed in the last several months, so the first
 thing I'd do if possible is to upgrade to 4.10.1.


 Best,
 Erick

 On Mon, Oct 6, 2014 at 2:41 PM, S.L simpleliving...@gmail.com wrote:
  Hi Erick,
 
  Before I tried your suggestion of  issung a commit=true update, I
 realized that for eaach shard there was atleast a node that had its index
 directory named like index.timestamp.
 
  I went ahead and deleted index directory that restarted that core and
 now the index directory got syched with the other node and is properly
 named as 'index' without any timestamp attached to it.This is now giving me
 consistent results for distrib=true using a load balancer.Also
 distrib=false returns expexted results for a given shard.
 
  The underlying issue appears to be that in every shard the leader and
 the replica(follower) were out of sych.
 
  How can I avoid this from happening again?
 
  Thanks for your help!
 
  Sent from my HTC
 
  - Reply message -
  From: Erick Erickson erickerick...@gmail.com
  To: solr-user@lucene.apache.org
  Subject: SolrCloud 4.7 not doing distributed search when querying from a
 load balancer.
  Date: Fri, Oct 3, 2014 12:56 AM
 
  H. Assuming that you aren't re-indexing the doc you're searching
 for...
 
  Try issuing http://blah blah:8983/solr/collection/update?commit=true.
  That'll force all the docs to be searchable. Does 1 still hold for
  the document in question? Because this is exactly backwards of what
  I'd expect. I'd expect, if anything, the replica (I'm trying to call
  it the follower when a distinction needs to be made since the leader
  is a replica too) would be out of sync. This is still a Bad
  Thing, but the leader gets first crack at indexing thing.
 
  bq: only the replica of the shard that has this key returns the result
  , and the leader does not ,
 
  Just to be sure we're talking about the same thing. When you say
  leader, you mean the shard leader, right? The filled-in circle on
  the graph view from the admin/cloud page.
 
  And let's see your soft and hard commit settings please.
 
  Best,
  Erick
 
  On Thu, Oct 2, 2014 at 9:48 PM, S.L simpleliving...@gmail.com wrote:
  Eirck,
 
  0 Load balancer is out of the picture
  .
  1When I query with *distrib=false* , I get consistent results as
 expected
  for those shards that dont have the key i.e I dont get the results back
 for
  those shards, however I just realized that while *distrib=false* is
 present
  in the query for the shard that is supposed to contain the key,only the
  replica of the shard that has this key returns the result , and the
 leader
  does not , looks like replica and the leader do not have the same data
 and
  replica seems to contain the key in the query for that shard.
 
  2 By indexing I mean this collection is being populated by a web
 crawler.
 
  So looks like 1 above  is pointing to leader and replica being out of
  synch for atleast one shard.
 
 
 
  On Thu, Oct 2, 2014 at 11:57 PM, Erick Erickson 
 erickerick...@gmail.com
  wrote:
 
  bq: Also ,the collection is being actively indexed as I query this,
 could
  that
  be an issue too ?
 
  Not if the documents you're searching aren't being added as you search
  (and all your autocommit intervals have expired).
 
  I would turn off indexing for testing, it's just one more variable
  that can get in the way of understanding this.
 
  Do note that if the problem were endemic to Solr, there would probably
  be a _lot_ more noise out there.
 
  So to recap:
  0 we can take the load balancer out of the picture all together.
 
  1 when you query each shard individually with distrib=true, every
  replica in a particular shard returns the same count.
 
  2 when you query without distrib=true you get varying counts.
 
  This is very strange and not at all expected. Let's try it again
  without indexing going on
 
  And what do you mean by indexing anyway? How are documents being fed
  to your system?
 
  Best,
  Erick@PuzzledAsWell
 
  On Thu, Oct 2, 2014 at 7:32 PM, S.L simpleliving...@gmail.com wrote:
   Erick,
  
   I would like to add that the interesting behavior i.e point #2 that I
   mentioned in my earlier reply  happens in all the shards , if this
 were
  to
   be a distributed search issue

Re: SolrCloud 4.7 not doing distributed search when querying from a load balancer.

2014-10-06 Thread Erick Erickson
Not, I'm not guaranteeing that it'll actually cure the problem, just
that enough has changed since 4.7 that it'd be a good place to start.

Things have been reported off and on, but they're often pesky race
conditions or something else that takes a long time to track down, you
just are lucky perhaps ;)...

Erick

On Mon, Oct 6, 2014 at 8:04 PM, S.L simpleliving...@gmail.com wrote:
 Erick,

 Thanks for the suggestion , I am not sure if I would be able to capture
 what went wrong , so upgrading to 4.10 seems easier even though it means ,
 a days work of effort :) . I will go ahead and upgrade and let me know ,
 although I am surprised that this issue never got reported for 4.7 up until
 now.

 Thanks again for your help!



 On Mon, Oct 6, 2014 at 10:52 PM, Erick Erickson erickerick...@gmail.com
 wrote:

 I think there were some holes that would allow replicas and leaders to
 be out of synch that have been patched up in the last 3 releases.

 There shouldn't be anything you need to do to keep these in synch, so
 if you can capture what happened when things got out of synch we'll
 fix it. But a lot has changed in the last several months, so the first
 thing I'd do if possible is to upgrade to 4.10.1.


 Best,
 Erick

 On Mon, Oct 6, 2014 at 2:41 PM, S.L simpleliving...@gmail.com wrote:
  Hi Erick,
 
  Before I tried your suggestion of  issung a commit=true update, I
 realized that for eaach shard there was atleast a node that had its index
 directory named like index.timestamp.
 
  I went ahead and deleted index directory that restarted that core and
 now the index directory got syched with the other node and is properly
 named as 'index' without any timestamp attached to it.This is now giving me
 consistent results for distrib=true using a load balancer.Also
 distrib=false returns expexted results for a given shard.
 
  The underlying issue appears to be that in every shard the leader and
 the replica(follower) were out of sych.
 
  How can I avoid this from happening again?
 
  Thanks for your help!
 
  Sent from my HTC
 
  - Reply message -
  From: Erick Erickson erickerick...@gmail.com
  To: solr-user@lucene.apache.org
  Subject: SolrCloud 4.7 not doing distributed search when querying from a
 load balancer.
  Date: Fri, Oct 3, 2014 12:56 AM
 
  H. Assuming that you aren't re-indexing the doc you're searching
 for...
 
  Try issuing http://blah blah:8983/solr/collection/update?commit=true.
  That'll force all the docs to be searchable. Does 1 still hold for
  the document in question? Because this is exactly backwards of what
  I'd expect. I'd expect, if anything, the replica (I'm trying to call
  it the follower when a distinction needs to be made since the leader
  is a replica too) would be out of sync. This is still a Bad
  Thing, but the leader gets first crack at indexing thing.
 
  bq: only the replica of the shard that has this key returns the result
  , and the leader does not ,
 
  Just to be sure we're talking about the same thing. When you say
  leader, you mean the shard leader, right? The filled-in circle on
  the graph view from the admin/cloud page.
 
  And let's see your soft and hard commit settings please.
 
  Best,
  Erick
 
  On Thu, Oct 2, 2014 at 9:48 PM, S.L simpleliving...@gmail.com wrote:
  Eirck,
 
  0 Load balancer is out of the picture
  .
  1When I query with *distrib=false* , I get consistent results as
 expected
  for those shards that dont have the key i.e I dont get the results back
 for
  those shards, however I just realized that while *distrib=false* is
 present
  in the query for the shard that is supposed to contain the key,only the
  replica of the shard that has this key returns the result , and the
 leader
  does not , looks like replica and the leader do not have the same data
 and
  replica seems to contain the key in the query for that shard.
 
  2 By indexing I mean this collection is being populated by a web
 crawler.
 
  So looks like 1 above  is pointing to leader and replica being out of
  synch for atleast one shard.
 
 
 
  On Thu, Oct 2, 2014 at 11:57 PM, Erick Erickson 
 erickerick...@gmail.com
  wrote:
 
  bq: Also ,the collection is being actively indexed as I query this,
 could
  that
  be an issue too ?
 
  Not if the documents you're searching aren't being added as you search
  (and all your autocommit intervals have expired).
 
  I would turn off indexing for testing, it's just one more variable
  that can get in the way of understanding this.
 
  Do note that if the problem were endemic to Solr, there would probably
  be a _lot_ more noise out there.
 
  So to recap:
  0 we can take the load balancer out of the picture all together.
 
  1 when you query each shard individually with distrib=true, every
  replica in a particular shard returns the same count.
 
  2 when you query without distrib=true you get varying counts.
 
  This is very strange and not at all expected. Let's try it again
  without indexing going

SolrCloud 4.7 not doing distributed search when querying from a load balancer.

2014-10-02 Thread S.L
Hi All,

I am trying to query a 6 node Solr4.7  cluster with 3 shards and  a
replication factor of 2 .

I have fronted these 6 Solr nodes using a load balancer , what I notice is
that every time I do a search of the form
q=*:*fq=(id:9e78c064-919f-4ef3-b236-dc66351b4acf)  it gives me a result
only once in every 3 tries , telling me that the load balancer is
distributing the requests between the 3 shards and SolrCloud only returns a
result if the request goes to the core that as that id .

However if I do a simple search like q=*:* , I consistently get the right
aggregated results back of all the documents across all the shards for
every request from the load balancer. Can someone please let me know what
this is symptomatic of ?

Somehow Solr Cloud seems to be doing search query distribution and
aggregation for queries of type *:* only.

Thanks.


Re: SolrCloud 4.7 not doing distributed search when querying from a load balancer.

2014-10-02 Thread Erick Erickson
Hmmm, nothing quite makes sense here

Here are some experiments:
1 avoid the load balancer and issue queries like
http://solr_server:8983/solr/collection/q=whateverdistrib=false

the distrib=false bit will cause keep SolrCloud from trying to send
the queries anywhere, they'll be served only from the node you address them to.
that'll help check whether the nodes are consistent. You should be
getting back the same results from each replica in a shard (i.e. 2 of
your 6 machines).

Next, try your failing query the same way.

Next, try your failing query from a browser, pointing it at successive
nodes.

Where is the first place problems show up?

My _guess_ is that your load balancer isn't quite doing what you think, or
your cluster isn't set up the way you think it is, but those are guesses.

Best,
Erick

On Thu, Oct 2, 2014 at 2:51 PM, S.L simpleliving...@gmail.com wrote:
 Hi All,

 I am trying to query a 6 node Solr4.7  cluster with 3 shards and  a
 replication factor of 2 .

 I have fronted these 6 Solr nodes using a load balancer , what I notice is
 that every time I do a search of the form
 q=*:*fq=(id:9e78c064-919f-4ef3-b236-dc66351b4acf)  it gives me a result
 only once in every 3 tries , telling me that the load balancer is
 distributing the requests between the 3 shards and SolrCloud only returns a
 result if the request goes to the core that as that id .

 However if I do a simple search like q=*:* , I consistently get the right
 aggregated results back of all the documents across all the shards for
 every request from the load balancer. Can someone please let me know what
 this is symptomatic of ?

 Somehow Solr Cloud seems to be doing search query distribution and
 aggregation for queries of type *:* only.

 Thanks.


Re: SolrCloud 4.7 not doing distributed search when querying from a load balancer.

2014-10-02 Thread S.L
Erick,

Thanks for your reply, I tried your suggestions.

1 . When not using loadbalancer if  *I have distrib=false* I get consistent
results across the replicas.

2. However here's the insteresting part , while not using load balancer if
I *dont have distrib=false* , then when I query a particular node ,I get
the same behaviour as if I were using a loadbalancer , meaning the
distributed search from a node works intermittently .Does this give any
clue ?



On Thu, Oct 2, 2014 at 7:47 PM, Erick Erickson erickerick...@gmail.com
wrote:

 Hmmm, nothing quite makes sense here

 Here are some experiments:
 1 avoid the load balancer and issue queries like
 http://solr_server:8983/solr/collection/q=whateverdistrib=false

 the distrib=false bit will cause keep SolrCloud from trying to send
 the queries anywhere, they'll be served only from the node you address
 them to.
 that'll help check whether the nodes are consistent. You should be
 getting back the same results from each replica in a shard (i.e. 2 of
 your 6 machines).

 Next, try your failing query the same way.

 Next, try your failing query from a browser, pointing it at successive
 nodes.

 Where is the first place problems show up?

 My _guess_ is that your load balancer isn't quite doing what you think, or
 your cluster isn't set up the way you think it is, but those are guesses.

 Best,
 Erick

 On Thu, Oct 2, 2014 at 2:51 PM, S.L simpleliving...@gmail.com wrote:
  Hi All,
 
  I am trying to query a 6 node Solr4.7  cluster with 3 shards and  a
  replication factor of 2 .
 
  I have fronted these 6 Solr nodes using a load balancer , what I notice
 is
  that every time I do a search of the form
  q=*:*fq=(id:9e78c064-919f-4ef3-b236-dc66351b4acf)  it gives me a result
  only once in every 3 tries , telling me that the load balancer is
  distributing the requests between the 3 shards and SolrCloud only
 returns a
  result if the request goes to the core that as that id .
 
  However if I do a simple search like q=*:* , I consistently get the right
  aggregated results back of all the documents across all the shards for
  every request from the load balancer. Can someone please let me know what
  this is symptomatic of ?
 
  Somehow Solr Cloud seems to be doing search query distribution and
  aggregation for queries of type *:* only.
 
  Thanks.



Re: SolrCloud 4.7 not doing distributed search when querying from a load balancer.

2014-10-02 Thread S.L
Erick,

I would like to add that the interesting behavior i.e point #2 that I
mentioned in my earlier reply  happens in all the shards , if this were to
be a distributed search issue this should have not manifested itself in the
shard that contains the key that I am searching for , looks like the search
is just failing as whole intermittently .

Also ,the collection is being actively indexed as I query this, could that
be an issue too ?

Thanks.

On Thu, Oct 2, 2014 at 10:24 PM, S.L simpleliving...@gmail.com wrote:

 Erick,

 Thanks for your reply, I tried your suggestions.

 1 . When not using loadbalancer if  *I have distrib=false* I get
 consistent results across the replicas.

 2. However here's the insteresting part , while not using load balancer if
 I *dont have distrib=false* , then when I query a particular node ,I get
 the same behaviour as if I were using a loadbalancer , meaning the
 distributed search from a node works intermittently .Does this give any
 clue ?



 On Thu, Oct 2, 2014 at 7:47 PM, Erick Erickson erickerick...@gmail.com
 wrote:

 Hmmm, nothing quite makes sense here

 Here are some experiments:
 1 avoid the load balancer and issue queries like
 http://solr_server:8983/solr/collection/q=whateverdistrib=false

 the distrib=false bit will cause keep SolrCloud from trying to send
 the queries anywhere, they'll be served only from the node you address
 them to.
 that'll help check whether the nodes are consistent. You should be
 getting back the same results from each replica in a shard (i.e. 2 of
 your 6 machines).

 Next, try your failing query the same way.

 Next, try your failing query from a browser, pointing it at successive
 nodes.

 Where is the first place problems show up?

 My _guess_ is that your load balancer isn't quite doing what you think, or
 your cluster isn't set up the way you think it is, but those are guesses.

 Best,
 Erick

 On Thu, Oct 2, 2014 at 2:51 PM, S.L simpleliving...@gmail.com wrote:
  Hi All,
 
  I am trying to query a 6 node Solr4.7  cluster with 3 shards and  a
  replication factor of 2 .
 
  I have fronted these 6 Solr nodes using a load balancer , what I notice
 is
  that every time I do a search of the form
  q=*:*fq=(id:9e78c064-919f-4ef3-b236-dc66351b4acf)  it gives me a result
  only once in every 3 tries , telling me that the load balancer is
  distributing the requests between the 3 shards and SolrCloud only
 returns a
  result if the request goes to the core that as that id .
 
  However if I do a simple search like q=*:* , I consistently get the
 right
  aggregated results back of all the documents across all the shards for
  every request from the load balancer. Can someone please let me know
 what
  this is symptomatic of ?
 
  Somehow Solr Cloud seems to be doing search query distribution and
  aggregation for queries of type *:* only.
 
  Thanks.





Re: SolrCloud 4.7 not doing distributed search when querying from a load balancer.

2014-10-02 Thread Erick Erickson
bq: Also ,the collection is being actively indexed as I query this, could that
be an issue too ?

Not if the documents you're searching aren't being added as you search
(and all your autocommit intervals have expired).

I would turn off indexing for testing, it's just one more variable
that can get in the way of understanding this.

Do note that if the problem were endemic to Solr, there would probably
be a _lot_ more noise out there.

So to recap:
0 we can take the load balancer out of the picture all together.

1 when you query each shard individually with distrib=true, every
replica in a particular shard returns the same count.

2 when you query without distrib=true you get varying counts.

This is very strange and not at all expected. Let's try it again
without indexing going on

And what do you mean by indexing anyway? How are documents being fed
to your system?

Best,
Erick@PuzzledAsWell

On Thu, Oct 2, 2014 at 7:32 PM, S.L simpleliving...@gmail.com wrote:
 Erick,

 I would like to add that the interesting behavior i.e point #2 that I
 mentioned in my earlier reply  happens in all the shards , if this were to
 be a distributed search issue this should have not manifested itself in the
 shard that contains the key that I am searching for , looks like the search
 is just failing as whole intermittently .

 Also ,the collection is being actively indexed as I query this, could that
 be an issue too ?

 Thanks.

 On Thu, Oct 2, 2014 at 10:24 PM, S.L simpleliving...@gmail.com wrote:

 Erick,

 Thanks for your reply, I tried your suggestions.

 1 . When not using loadbalancer if  *I have distrib=false* I get
 consistent results across the replicas.

 2. However here's the insteresting part , while not using load balancer if
 I *dont have distrib=false* , then when I query a particular node ,I get
 the same behaviour as if I were using a loadbalancer , meaning the
 distributed search from a node works intermittently .Does this give any
 clue ?



 On Thu, Oct 2, 2014 at 7:47 PM, Erick Erickson erickerick...@gmail.com
 wrote:

 Hmmm, nothing quite makes sense here

 Here are some experiments:
 1 avoid the load balancer and issue queries like
 http://solr_server:8983/solr/collection/q=whateverdistrib=false

 the distrib=false bit will cause keep SolrCloud from trying to send
 the queries anywhere, they'll be served only from the node you address
 them to.
 that'll help check whether the nodes are consistent. You should be
 getting back the same results from each replica in a shard (i.e. 2 of
 your 6 machines).

 Next, try your failing query the same way.

 Next, try your failing query from a browser, pointing it at successive
 nodes.

 Where is the first place problems show up?

 My _guess_ is that your load balancer isn't quite doing what you think, or
 your cluster isn't set up the way you think it is, but those are guesses.

 Best,
 Erick

 On Thu, Oct 2, 2014 at 2:51 PM, S.L simpleliving...@gmail.com wrote:
  Hi All,
 
  I am trying to query a 6 node Solr4.7  cluster with 3 shards and  a
  replication factor of 2 .
 
  I have fronted these 6 Solr nodes using a load balancer , what I notice
 is
  that every time I do a search of the form
  q=*:*fq=(id:9e78c064-919f-4ef3-b236-dc66351b4acf)  it gives me a result
  only once in every 3 tries , telling me that the load balancer is
  distributing the requests between the 3 shards and SolrCloud only
 returns a
  result if the request goes to the core that as that id .
 
  However if I do a simple search like q=*:* , I consistently get the
 right
  aggregated results back of all the documents across all the shards for
  every request from the load balancer. Can someone please let me know
 what
  this is symptomatic of ?
 
  Somehow Solr Cloud seems to be doing search query distribution and
  aggregation for queries of type *:* only.
 
  Thanks.





Re: SolrCloud 4.7 not doing distributed search when querying from a load balancer.

2014-10-02 Thread S.L
Eirck,

0 Load balancer is out of the picture
.
1When I query with *distrib=false* , I get consistent results as expected
for those shards that dont have the key i.e I dont get the results back for
those shards, however I just realized that while *distrib=false* is present
in the query for the shard that is supposed to contain the key,only the
replica of the shard that has this key returns the result , and the leader
does not , looks like replica and the leader do not have the same data and
replica seems to contain the key in the query for that shard.

2 By indexing I mean this collection is being populated by a web crawler.

So looks like 1 above  is pointing to leader and replica being out of
synch for atleast one shard.



On Thu, Oct 2, 2014 at 11:57 PM, Erick Erickson erickerick...@gmail.com
wrote:

 bq: Also ,the collection is being actively indexed as I query this, could
 that
 be an issue too ?

 Not if the documents you're searching aren't being added as you search
 (and all your autocommit intervals have expired).

 I would turn off indexing for testing, it's just one more variable
 that can get in the way of understanding this.

 Do note that if the problem were endemic to Solr, there would probably
 be a _lot_ more noise out there.

 So to recap:
 0 we can take the load balancer out of the picture all together.

 1 when you query each shard individually with distrib=true, every
 replica in a particular shard returns the same count.

 2 when you query without distrib=true you get varying counts.

 This is very strange and not at all expected. Let's try it again
 without indexing going on

 And what do you mean by indexing anyway? How are documents being fed
 to your system?

 Best,
 Erick@PuzzledAsWell

 On Thu, Oct 2, 2014 at 7:32 PM, S.L simpleliving...@gmail.com wrote:
  Erick,
 
  I would like to add that the interesting behavior i.e point #2 that I
  mentioned in my earlier reply  happens in all the shards , if this were
 to
  be a distributed search issue this should have not manifested itself in
 the
  shard that contains the key that I am searching for , looks like the
 search
  is just failing as whole intermittently .
 
  Also ,the collection is being actively indexed as I query this, could
 that
  be an issue too ?
 
  Thanks.
 
  On Thu, Oct 2, 2014 at 10:24 PM, S.L simpleliving...@gmail.com wrote:
 
  Erick,
 
  Thanks for your reply, I tried your suggestions.
 
  1 . When not using loadbalancer if  *I have distrib=false* I get
  consistent results across the replicas.
 
  2. However here's the insteresting part , while not using load balancer
 if
  I *dont have distrib=false* , then when I query a particular node ,I get
  the same behaviour as if I were using a loadbalancer , meaning the
  distributed search from a node works intermittently .Does this give any
  clue ?
 
 
 
  On Thu, Oct 2, 2014 at 7:47 PM, Erick Erickson erickerick...@gmail.com
 
  wrote:
 
  Hmmm, nothing quite makes sense here
 
  Here are some experiments:
  1 avoid the load balancer and issue queries like
  http://solr_server:8983/solr/collection/q=whateverdistrib=false
 
  the distrib=false bit will cause keep SolrCloud from trying to send
  the queries anywhere, they'll be served only from the node you address
  them to.
  that'll help check whether the nodes are consistent. You should be
  getting back the same results from each replica in a shard (i.e. 2 of
  your 6 machines).
 
  Next, try your failing query the same way.
 
  Next, try your failing query from a browser, pointing it at successive
  nodes.
 
  Where is the first place problems show up?
 
  My _guess_ is that your load balancer isn't quite doing what you
 think, or
  your cluster isn't set up the way you think it is, but those are
 guesses.
 
  Best,
  Erick
 
  On Thu, Oct 2, 2014 at 2:51 PM, S.L simpleliving...@gmail.com wrote:
   Hi All,
  
   I am trying to query a 6 node Solr4.7  cluster with 3 shards and  a
   replication factor of 2 .
  
   I have fronted these 6 Solr nodes using a load balancer , what I
 notice
  is
   that every time I do a search of the form
   q=*:*fq=(id:9e78c064-919f-4ef3-b236-dc66351b4acf)  it gives me a
 result
   only once in every 3 tries , telling me that the load balancer is
   distributing the requests between the 3 shards and SolrCloud only
  returns a
   result if the request goes to the core that as that id .
  
   However if I do a simple search like q=*:* , I consistently get the
  right
   aggregated results back of all the documents across all the shards
 for
   every request from the load balancer. Can someone please let me know
  what
   this is symptomatic of ?
  
   Somehow Solr Cloud seems to be doing search query distribution and
   aggregation for queries of type *:* only.
  
   Thanks.
 
 
 



Re: SolrCloud 4.7 not doing distributed search when querying from a load balancer.

2014-10-02 Thread Erick Erickson
H. Assuming that you aren't re-indexing the doc you're searching for...

Try issuing http://blah blah:8983/solr/collection/update?commit=true.
That'll force all the docs to be searchable. Does 1 still hold for
the document in question? Because this is exactly backwards of what
I'd expect. I'd expect, if anything, the replica (I'm trying to call
it the follower when a distinction needs to be made since the leader
is a replica too) would be out of sync. This is still a Bad
Thing, but the leader gets first crack at indexing thing.

bq: only the replica of the shard that has this key returns the result
, and the leader does not ,

Just to be sure we're talking about the same thing. When you say
leader, you mean the shard leader, right? The filled-in circle on
the graph view from the admin/cloud page.

And let's see your soft and hard commit settings please.

Best,
Erick

On Thu, Oct 2, 2014 at 9:48 PM, S.L simpleliving...@gmail.com wrote:
 Eirck,

 0 Load balancer is out of the picture
 .
 1When I query with *distrib=false* , I get consistent results as expected
 for those shards that dont have the key i.e I dont get the results back for
 those shards, however I just realized that while *distrib=false* is present
 in the query for the shard that is supposed to contain the key,only the
 replica of the shard that has this key returns the result , and the leader
 does not , looks like replica and the leader do not have the same data and
 replica seems to contain the key in the query for that shard.

 2 By indexing I mean this collection is being populated by a web crawler.

 So looks like 1 above  is pointing to leader and replica being out of
 synch for atleast one shard.



 On Thu, Oct 2, 2014 at 11:57 PM, Erick Erickson erickerick...@gmail.com
 wrote:

 bq: Also ,the collection is being actively indexed as I query this, could
 that
 be an issue too ?

 Not if the documents you're searching aren't being added as you search
 (and all your autocommit intervals have expired).

 I would turn off indexing for testing, it's just one more variable
 that can get in the way of understanding this.

 Do note that if the problem were endemic to Solr, there would probably
 be a _lot_ more noise out there.

 So to recap:
 0 we can take the load balancer out of the picture all together.

 1 when you query each shard individually with distrib=true, every
 replica in a particular shard returns the same count.

 2 when you query without distrib=true you get varying counts.

 This is very strange and not at all expected. Let's try it again
 without indexing going on

 And what do you mean by indexing anyway? How are documents being fed
 to your system?

 Best,
 Erick@PuzzledAsWell

 On Thu, Oct 2, 2014 at 7:32 PM, S.L simpleliving...@gmail.com wrote:
  Erick,
 
  I would like to add that the interesting behavior i.e point #2 that I
  mentioned in my earlier reply  happens in all the shards , if this were
 to
  be a distributed search issue this should have not manifested itself in
 the
  shard that contains the key that I am searching for , looks like the
 search
  is just failing as whole intermittently .
 
  Also ,the collection is being actively indexed as I query this, could
 that
  be an issue too ?
 
  Thanks.
 
  On Thu, Oct 2, 2014 at 10:24 PM, S.L simpleliving...@gmail.com wrote:
 
  Erick,
 
  Thanks for your reply, I tried your suggestions.
 
  1 . When not using loadbalancer if  *I have distrib=false* I get
  consistent results across the replicas.
 
  2. However here's the insteresting part , while not using load balancer
 if
  I *dont have distrib=false* , then when I query a particular node ,I get
  the same behaviour as if I were using a loadbalancer , meaning the
  distributed search from a node works intermittently .Does this give any
  clue ?
 
 
 
  On Thu, Oct 2, 2014 at 7:47 PM, Erick Erickson erickerick...@gmail.com
 
  wrote:
 
  Hmmm, nothing quite makes sense here
 
  Here are some experiments:
  1 avoid the load balancer and issue queries like
  http://solr_server:8983/solr/collection/q=whateverdistrib=false
 
  the distrib=false bit will cause keep SolrCloud from trying to send
  the queries anywhere, they'll be served only from the node you address
  them to.
  that'll help check whether the nodes are consistent. You should be
  getting back the same results from each replica in a shard (i.e. 2 of
  your 6 machines).
 
  Next, try your failing query the same way.
 
  Next, try your failing query from a browser, pointing it at successive
  nodes.
 
  Where is the first place problems show up?
 
  My _guess_ is that your load balancer isn't quite doing what you
 think, or
  your cluster isn't set up the way you think it is, but those are
 guesses.
 
  Best,
  Erick
 
  On Thu, Oct 2, 2014 at 2:51 PM, S.L simpleliving...@gmail.com wrote:
   Hi All,
  
   I am trying to query a 6 node Solr4.7  cluster with 3 shards and  a
   replication factor of 2 .
  
   I have fronted these 6 Solr