Hi Li

I got into very similar situation like you. The GC was taking much longer than 
the zookeeper timeout configured. I had 3 nodes in the SolrCloud and very often 
I would have my entire cluster totally messed up. Increasing the zookeeper 
timeout eventually helped. But before that, I was able do some temporary 
workaround by "rmr /solr/overseer/queueā€¯ in the zookeeper (not sure whether I 
restarted the solr after that). I am not even sure this is the right thing to 
do, but seem to have unblocked me at time. At least, there were no negative 
effect.

Thanks

Bosco





On 4/29/16, 7:52 AM, "Erick Erickson" <erickerick...@gmail.com> wrote:

>Well, there have been lots of improvements since 4.6. You're right,
>logically when things come back up and are all reachable, it seems
>like it is theoretically possible to bring a node back up. There
>have been situations where that doesn't happen, and various fixes
>have been implemented to fix them as they're identified....
>
>You might try reloading the core from the core admin (that's
>about the only thing you should try in SolrCloud from the
>core admin screen)....
>
>Best,
>Erick
>
>On Wed, Apr 27, 2016 at 10:58 AM, Li Ding <li.d...@bloomreach.com> wrote:
>> Hi Erick,
>>
>> I don't have the GC log.  But after the GC finished.  Isn't zk ping
>> succeeds and the core should be back to normal state?  From the log I
>> posted.  The sequence is:
>>
>> 1) Solr Detects itself can't connect to ZK and reconnect to ZK
>> 2) Solr marked all cores are down
>> 3) Solr recovery each cores, some succeeds, some failed.
>> 4) After 30 minutes, the cores that are failed still marked as down.
>>
>> So my questions is, during the 30 minutes interval, if GC takes too long,
>> all cores should failed.  And GC doesn't take longer than a minute since
>> all serving requests to other calls succeeds and the next zk ping should
>> bring the core back to normal? right?  We have an active monitor running at
>> the same time querying every core in distrib=false mode and every query
>> succeeds.
>>
>> Thanks,
>>
>> Li
>>
>> On Tue, Apr 26, 2016 at 6:20 PM, Erick Erickson <erickerick...@gmail.com>
>> wrote:
>>
>>> One of the reasons this happens is if you have very
>>> long GC cycles, longer than the Zookeeper "keep alive"
>>> timeout. During a full GC pause, Solr is unresponsive and
>>> if the ZK ping times out, ZK assumes the machine is
>>> gone and you get into this recovery state.
>>>
>>> So I'd collect GC logs and see if you have any
>>> stop-the-world GC pauses that take longer than the ZK
>>> timeout.
>>>
>>> see Mark Millers primer on GC here:
>>> https://lucidworks.com/blog/2011/03/27/garbage-collection-bootcamp-1-0/
>>>
>>> Best,
>>> Erick
>>>
>>> On Tue, Apr 26, 2016 at 2:13 PM, Li Ding <li.d...@bloomreach.com> wrote:
>>> > Thank you all for your help!
>>> >
>>> > The zookeeper log rolled over, thisis from Solr.log:
>>> >
>>> > Looks like the solr and zk connection is gone for some reason
>>> >
>>> > INFO  - 2016-04-21 12:37:57.536;
>>> > org.apache.solr.common.cloud.ConnectionManager; Watcher
>>> > org.apache.solr.common.cloud.ConnectionManager@19789a96
>>> > name:ZooKeeperConnection Watcher:{ZK HOSTS here} got event WatchedEvent
>>> > state:Disconnected type:None path:null path:null type:None
>>> >
>>> > INFO  - 2016-04-21 12:37:57.536;
>>> > org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected
>>> >
>>> > INFO  - 2016-04-21 12:38:24.248;
>>> > org.apache.solr.common.cloud.DefaultConnectionStrategy; Connection
>>> expired
>>> > - starting a new one...
>>> >
>>> > INFO  - 2016-04-21 12:38:24.262;
>>> > org.apache.solr.common.cloud.ConnectionManager; Waiting for client to
>>> > connect to ZooKeeper
>>> >
>>> > INFO  - 2016-04-21 12:38:24.269;
>>> > org.apache.solr.common.cloud.ConnectionManager; Connected:true
>>> >
>>> >
>>> > Then it publishes all cores on the hosts are down.  I just list three
>>> cores
>>> > here:
>>> >
>>> > INFO  - 2016-04-21 12:38:24.269; org.apache.solr.cloud.ZkController;
>>> > publishing core=product1_shard1_replica1 state=down
>>> >
>>> > INFO  - 2016-04-21 12:38:24.271; org.apache.solr.cloud.ZkController;
>>> > publishing core=collection1 state=down
>>> >
>>> > INFO  - 2016-04-21 12:38:24.272; org.apache.solr.cloud.ZkController;
>>> > numShards not found on descriptor - reading it from system property
>>> >
>>> > INFO  - 2016-04-21 12:38:24.289; org.apache.solr.cloud.ZkController;
>>> > publishing core=product2_shard5_replica1 state=down
>>> >
>>> > INFO  - 2016-04-21 12:38:24.292; org.apache.solr.cloud.ZkController;
>>> > publishing core=product2_shard13_replica1 state=down
>>> >
>>> >
>>> > product1 has only one shard one replica and it's able to be active
>>> > successfully:
>>> >
>>> > INFO  - 2016-04-21 12:38:26.383; org.apache.solr.cloud.ZkController;
>>> > Register replica - core:product1_shard1_replica1 address:http://
>>> > {internalIp}:8983/solr collection:product1 shard:shard1
>>> >
>>> > WARN  - 2016-04-21 12:38:26.385; org.apache.solr.cloud.ElectionContext;
>>> > cancelElection did not find election node to remove
>>> >
>>> > INFO  - 2016-04-21 12:38:26.393;
>>> > org.apache.solr.cloud.ShardLeaderElectionContext; Running the leader
>>> > process for shard shard1
>>> >
>>> > INFO  - 2016-04-21 12:38:26.399;
>>> > org.apache.solr.cloud.ShardLeaderElectionContext; Enough replicas found
>>> to
>>> > continue.
>>> >
>>> > INFO  - 2016-04-21 12:38:26.399;
>>> > org.apache.solr.cloud.ShardLeaderElectionContext; I may be the new
>>> leader -
>>> > try and sync
>>> >
>>> > INFO  - 2016-04-21 12:38:26.399; org.apache.solr.cloud.SyncStrategy; Sync
>>> > replicas to http://{internalIp}:8983/solr/product1_shard1_replica1/
>>> >
>>> > INFO  - 2016-04-21 12:38:26.399; org.apache.solr.cloud.SyncStrategy; Sync
>>> > Success - now sync replicas to me
>>> >
>>> > INFO  - 2016-04-21 12:38:26.399; org.apache.solr.cloud.SyncStrategy;
>>> > http://{internalIp}:8983/solr/product1_shard1_replica1/
>>> > has no replicas
>>> >
>>> > INFO  - 2016-04-21 12:38:26.399;
>>> > org.apache.solr.cloud.ShardLeaderElectionContext; I am the new leader:
>>> > http://{internalIp}:8983/solr/product1_shard1_replica1/ shard1
>>> >
>>> > INFO  - 2016-04-21 12:38:26.399;
>>> org.apache.solr.common.cloud.SolrZkClient;
>>> > makePath: /collections/product1/leaders/shard1
>>> >
>>> > INFO  - 2016-04-21 12:38:26.412; org.apache.solr.cloud.ZkController; We
>>> are
>>> > http://{internalIp}:8983/solr/product1_shard1_replica1/ and leader is
>>> > http://{internalIp}:8983/solr/product1_shard1_replica1/
>>> >
>>> > INFO  - 2016-04-21 12:38:26.412; org.apache.solr.cloud.ZkController; No
>>> > LogReplay needed for core=product1_replica1 baseURL=http://
>>> > {internalIp}:8983/solr
>>> >
>>> > INFO  - 2016-04-21 12:38:26.412; org.apache.solr.cloud.ZkController; I am
>>> > the leader, no recovery necessary
>>> >
>>> > INFO  - 2016-04-21 12:38:26.413; org.apache.solr.cloud.ZkController;
>>> > publishing core=product1_shard1_replica1 state=active
>>> >
>>> >
>>> > product2 has 15 shards one replica but only two shards lived on this
>>> > machine, this is one of the failed shard that I never seen the message of
>>> > the core product2_shard5_replica1 active:
>>> >
>>> > INFO  - 2016-04-21 12:38:26.616; org.apache.solr.cloud.ZkController;
>>> > Register replica - product2_shard5_replica1 address:http://
>>> > {internalIp}:8983/solr collection:product2 shard:shard5
>>> >
>>> > WARN  - 2016-04-21 12:38:26.618; org.apache.solr.cloud.ElectionContext;
>>> > cancelElection did not find election node to remove
>>> >
>>> > INFO  - 2016-04-21 12:38:26.625;
>>> > org.apache.solr.cloud.ShardLeaderElectionContext; Running the leader
>>> > process for shard shard5
>>> >
>>> > INFO  - 2016-04-21 12:38:26.631;
>>> > org.apache.solr.cloud.ShardLeaderElectionContext; Enough replicas found
>>> to
>>> > continue.
>>> >
>>> > INFO  - 2016-04-21 12:38:26.631;
>>> > org.apache.solr.cloud.ShardLeaderElectionContext; I may be the new
>>> leader -
>>> > try and sync
>>> >
>>> > INFO  - 2016-04-21 12:38:26.631; org.apache.solr.cloud.SyncStrategy; Sync
>>> > replicas to http://
>>> > {internalIp}:8983/solr/product2_shard5_replica1_shard5_replica1/
>>> >
>>> > INFO  - 2016-04-21 12:38:26.631; org.apache.solr.cloud.SyncStrategy; Sync
>>> > Success - now sync replicas to me
>>> >
>>> > INFO  - 2016-04-21 12:38:26.632; org.apache.solr.cloud.SyncStrategy;
>>> > http://{internalIp}:8983/solr/product2_shard5_replica1_shard5_replica1/
>>> > has no replicas
>>> >
>>> > INFO  - 2016-04-21 12:38:26.632;
>>> > org.apache.solr.cloud.ShardLeaderElectionContext; I am the new leader:
>>> > http://{internalIp}:8983/solr/product2_shard5_replica1_shard5_replica1/
>>> > shard5
>>> >
>>> > INFO  - 2016-04-21 12:38:26.632;
>>> org.apache.solr.common.cloud.SolrZkClient;
>>> > makePath: /collections/product2_shard5_replica1/leaders/shard5
>>> >
>>> > INFO  - 2016-04-21 12:38:26.645; org.apache.solr.cloud.ZkController; We
>>> are
>>> > http://{internalIp}:8983/solr/product2_shard5_replica1_shard5_replica1/
>>> and
>>> > leader is http://{internalIp}:8983/solr
>>> > product2_shard5_replica1_shard5_replica1/
>>> >
>>> > INFO  - 2016-04-21 12:38:26.646;
>>> > org.apache.solr.common.cloud.ZkStateReader; Updating cloud state from
>>> > ZooKeeper...
>>> >
>>> >
>>> > Before I restarted this server, a bunch of queries failed for this
>>> > collection product2.  But I don't think it will affect the core status.
>>> >
>>> >
>>> > Do you guys have any idea about why this particular core is not published
>>> > as active since from the log, most steps are done except the very last
>>> one
>>> > to publish info to ZK.
>>> >
>>> >
>>> > Thanks,
>>> >
>>> >
>>> > Li
>>> > On Thu, Apr 21, 2016 at 7:08 AM, Rajesh Hazari <rajeshhaz...@gmail.com>
>>> > wrote:
>>> >
>>> >> Hi Li,
>>> >>
>>> >> Do you see timeouts liek "CLUSTERSTATUS the collection time out:180s"
>>> >> if its the case, this may be related to
>>> >> https://issues.apache.org/jira/browse/SOLR-7940,
>>> >> and i would say either use the patch file or upgrade.
>>> >>
>>> >>
>>> >> *Thanks,*
>>> >> *Rajesh,*
>>> >> *8328789519,*
>>> >> *If I don't answer your call please leave a voicemail with your contact
>>> >> info, *
>>> >> *will return your call ASAP.*
>>> >>
>>> >> On Thu, Apr 21, 2016 at 6:02 AM, YouPeng Yang <
>>> yypvsxf19870...@gmail.com>
>>> >> wrote:
>>> >>
>>> >> > Hi
>>> >> >    We have used Solr4.6 for 2 years,If you post more logs ,maybe we
>>> can
>>> >> > fixed it.
>>> >> >
>>> >> > 2016-04-21 6:50 GMT+08:00 Li Ding <li.d...@bloomreach.com>:
>>> >> >
>>> >> > > Hi All,
>>> >> > >
>>> >> > > We are using SolrCloud 4.6.1.  We have observed following behaviors
>>> >> > > recently.  A Solr node in a Solrcloud cluster is up but some of the
>>> >> cores
>>> >> > > on the nodes are marked as down in Zookeeper.  If the cores are
>>> parts
>>> >> of
>>> >> > a
>>> >> > > multi-sharded collection with one replica,  the queries to that
>>> >> > collection
>>> >> > > will fail.  However, when this happened, if we issue queries to the
>>> >> core
>>> >> > > directly, it returns 200 and correct info.  But once Solr got into
>>> the
>>> >> > > state, the core will be marked down forever unless we do a restart
>>> on
>>> >> > Solr.
>>> >> > >
>>> >> > > Has anyone seen this behavior before?  Is there any to get out of
>>> the
>>> >> > state
>>> >> > > on its own?
>>> >> > >
>>> >> > > Thanks,
>>> >> > >
>>> >> > > Li
>>> >> > >
>>> >> >
>>> >>
>>>

Reply via email to