SolrCloud 4.6.0 - leader election issue

2013-12-09 Thread Elodie Sannier

Hello,

I am using SolrCloud 4.6.0 with two shards, two replicas by shard and with two 
collections.

collection fr_blue:
- shard1 - server-01 (replica1), server-01 (replica2)
- shard2 - server-02 (replica1), server-02 (replica2)

collection fr_green:
- shard1 - server-01 (replica1), server-01 (replica2)
- shard2 - server-02 (replica1), server-02 (replica2)

If I start the four solr instances without a delay between each start, it is 
not possible to connect to them and it is not possible to acces to the Solr 
Admin page.

If I get the clusterstate.json with zkCli, the statuses are:
- active for the leaders of the first collection
- recoveryingfor the other replicas of the first collection
- down for all replicas of the second collection (no leader)

The logs loop on the following messages :

server-01:
2013-12-09 14:41:28,634 
[main-SendThread(dc1-vt-dev-xen-06-vm-07.dev.dc1.kelkoo.net:2181)] DEBUG 
org.apache.zookeeper.ClientCnxn:readResponse:815  - Reading reply 
sessionid:0x142d770be010063, packet:: clientPath:null serverPath:null 
finished:false header:: 568,4  replyHeader:: 568,483813,-101  request:: 
'/s6fr/collections/fr_green/leaders/shard1,F  response::
2013-12-09 14:41:28,635 
[main-SendThread(dc1-vt-dev-xen-06-vm-07.dev.dc1.kelkoo.net:2181)] DEBUG 
org.apache.zookeeper.ClientCnxn:readResponse:815  - Reading reply 
sessionid:0x142d770be010064, packet:: clientPath:null serverPath:null 
finished:false header:: 372,4  replyHeader:: 372,483813,-101  request:: 
'/s6fr/collections/fr_green/leaders/shard2,F  response::

server-02:
2013-12-09 14:41:51,381 
[main-SendThread(dc1-vt-dev-xen-06-vm-07.dev.dc1.kelkoo.net:2181)] DEBUG 
org.apache.zookeeper.ClientCnxn:readResponse:815  - Reading reply 
sessionid:0x142d770be01005e, packet:: clientPath:null serverPath:null 
finished:false header:: 1014,4  replyHeader:: 1014,483813,0  request:: 
'/s6fr/overseer_elect/leader,F  response:: 
#7b226964223a2239303837313832313732343837363839342d6463312d76742d6465762d78656e2d30362d766d2d30362e6465762e6463312e6b656c6b6f6f2e6e65743a383038375f736561726368736f6c726e6f646566722d6e5f30303030303030303634227d,s{483632,483632,1386599789203,1386599789203,0,0,0,90871821724876894,104,0,483632}
2013-12-09 14:41:51,383 
[main-SendThread(dc1-vt-dev-xen-06-vm-07.dev.dc1.kelkoo.net:2181)] DEBUG 
org.apache.zookeeper.ClientCnxn:readResponse:815  - Reading reply 
sessionid:0x142d770be01005e, packet:: clientPath:null serverPath:null 
finished:false header:: 1015,8  replyHeader:: 1015,483813,0  request:: 
'/s6fr/overseer/queue,F  response:: v{}
2013-12-09 14:41:51,385 
[main-SendThread(dc1-vt-dev-xen-06-vm-07.dev.dc1.kelkoo.net:2181)] DEBUG 
org.apache.zookeeper.ClientCnxn:readResponse:815  - Reading reply 
sessionid:0x142d770be01005e, packet:: clientPath:null serverPath:null 
finished:false header:: 1016,8  replyHeader:: 1016,483813,0  request:: 
'/s6fr/overseer/queue-work,F  response:: v{}

After 10 minutes, there is a WARN message, a leader is found for the second 
collection and it is possible to connect to the solr instances:

2013-12-06 21:17:57,635 [main-EventThread] INFO  
org.apache.solr.common.cloud.ZkStateReader:process:212  - A cluster state 
change: WatchedEvent state:SyncConnected type:NodeDataChanged 
path:/clusterstate.json, has occurred - updating... (live nodes size: 4)
2013-12-06 21:27:58,719 [coreLoadExecutor-4-thread-2] WARN  
org.apache.solr.update.PeerSync:handleResponse:322  - PeerSync: core=fr_green 
url=http://dc1-vt-dev-xen-06-vm-07.dev.dc1.kelkoo.net:8080/searchsolrnodefr  
exception talking to 
http://dc1-vt-dev-xen-06-vm-06.dev.dc1.kelkoo.net:8080/searchsolrnodefr/fr_green/,
 failed
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: incref on 
a closed log: 
tlog{file=/opt/kookel/data/searchSolrNode/solrindex/fr1_green/tlog/tlog.001
 refcount=1}
at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:491)
at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197)
at 
org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:156)
at 
org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:118)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:722)
2013-12-06 21:27:58,730 [coreLoadExecutor-4-thread-2] INFO  
org.apache.solr.cloud.SyncStrategy:syncReplicas:134  - Leader's attempt to sync 
with shard failed, 

RE: SolrCloud 4.6.0 - leader election issue

2013-12-09 Thread Markus Jelsma
I can confirm i've seen this issue as well on trunk, a very recent build.  
 
-Original message-
 From:Elodie Sannier elodie.sann...@kelkoo.fr
 Sent: Monday 9th December 2013 16:43
 To: solr-user@lucene.apache.org
 Cc: search5t...@lists.kelkoo.com
 Subject: SolrCloud 4.6.0 - leader election issue
 
 Hello,
 
 I am using SolrCloud 4.6.0 with two shards, two replicas by shard and with 
 two collections.
 
 collection fr_blue:
 - shard1 - server-01 (replica1), server-01 (replica2)
 - shard2 - server-02 (replica1), server-02 (replica2)
 
 collection fr_green:
 - shard1 - server-01 (replica1), server-01 (replica2)
 - shard2 - server-02 (replica1), server-02 (replica2)
 
 If I start the four solr instances without a delay between each start, it is 
 not possible to connect to them and it is not possible to acces to the Solr 
 Admin page.
 
 If I get the clusterstate.json with zkCli, the statuses are:
 - active for the leaders of the first collection
 - recoveryingfor the other replicas of the first collection
 - down for all replicas of the second collection (no leader)
 
 The logs loop on the following messages :
 
 server-01:
 2013-12-09 14:41:28,634 
 [main-SendThread(dc1-vt-dev-xen-06-vm-07.dev.dc1.kelkoo.net:2181)] DEBUG 
 org.apache.zookeeper.ClientCnxn:readResponse:815  - Reading reply 
 sessionid:0x142d770be010063, packet:: clientPath:null serverPath:null 
 finished:false header:: 568,4  replyHeader:: 568,483813,-101  request:: 
 '/s6fr/collections/fr_green/leaders/shard1,F  response::
 2013-12-09 14:41:28,635 
 [main-SendThread(dc1-vt-dev-xen-06-vm-07.dev.dc1.kelkoo.net:2181)] DEBUG 
 org.apache.zookeeper.ClientCnxn:readResponse:815  - Reading reply 
 sessionid:0x142d770be010064, packet:: clientPath:null serverPath:null 
 finished:false header:: 372,4  replyHeader:: 372,483813,-101  request:: 
 '/s6fr/collections/fr_green/leaders/shard2,F  response::
 
 server-02:
 2013-12-09 14:41:51,381 
 [main-SendThread(dc1-vt-dev-xen-06-vm-07.dev.dc1.kelkoo.net:2181)] DEBUG 
 org.apache.zookeeper.ClientCnxn:readResponse:815  - Reading reply 
 sessionid:0x142d770be01005e, packet:: clientPath:null serverPath:null 
 finished:false header:: 1014,4  replyHeader:: 1014,483813,0  request:: 
 '/s6fr/overseer_elect/leader,F  response:: 
 #7b226964223a2239303837313832313732343837363839342d6463312d76742d6465762d78656e2d30362d766d2d30362e6465762e6463312e6b656c6b6f6f2e6e65743a383038375f736561726368736f6c726e6f646566722d6e5f30303030303030303634227d,s{483632,483632,1386599789203,1386599789203,0,0,0,90871821724876894,104,0,483632}
 2013-12-09 14:41:51,383 
 [main-SendThread(dc1-vt-dev-xen-06-vm-07.dev.dc1.kelkoo.net:2181)] DEBUG 
 org.apache.zookeeper.ClientCnxn:readResponse:815  - Reading reply 
 sessionid:0x142d770be01005e, packet:: clientPath:null serverPath:null 
 finished:false header:: 1015,8  replyHeader:: 1015,483813,0  request:: 
 '/s6fr/overseer/queue,F  response:: v{}
 2013-12-09 14:41:51,385 
 [main-SendThread(dc1-vt-dev-xen-06-vm-07.dev.dc1.kelkoo.net:2181)] DEBUG 
 org.apache.zookeeper.ClientCnxn:readResponse:815  - Reading reply 
 sessionid:0x142d770be01005e, packet:: clientPath:null serverPath:null 
 finished:false header:: 1016,8  replyHeader:: 1016,483813,0  request:: 
 '/s6fr/overseer/queue-work,F  response:: v{}
 
 After 10 minutes, there is a WARN message, a leader is found for the second 
 collection and it is possible to connect to the solr instances:
 
 2013-12-06 21:17:57,635 [main-EventThread] INFO  
 org.apache.solr.common.cloud.ZkStateReader:process:212  - A cluster state 
 change: WatchedEvent state:SyncConnected type:NodeDataChanged 
 path:/clusterstate.json, has occurred - updating... (live nodes size: 4)
 2013-12-06 21:27:58,719 [coreLoadExecutor-4-thread-2] WARN  
 org.apache.solr.update.PeerSync:handleResponse:322  - PeerSync: core=fr_green 
 url=http://dc1-vt-dev-xen-06-vm-07.dev.dc1.kelkoo.net:8080/searchsolrnodefr  
 exception talking to 
 http://dc1-vt-dev-xen-06-vm-06.dev.dc1.kelkoo.net:8080/searchsolrnodefr/fr_green/,
  failed
 org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: incref 
 on a closed log: 
 tlog{file=/opt/kookel/data/searchSolrNode/solrindex/fr1_green/tlog/tlog.001
  refcount=1}
  at 
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:491)
  at 
 org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197)
  at 
 org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:156)
  at 
 org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:118)
  at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
  at java.util.concurrent.FutureTask.run(FutureTask.java:166)
  at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
  at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
  at java.util.concurrent.FutureTask.run