This happened when the second time I'm performing restart. But after that, every time this collection is stuck at here. If I restart the leader node as well, the core can get out of the recovering state
On Mon, May 16, 2016 at 5:00 PM, Li Ding <li.d...@bloomreach.com> wrote: > Hi Anshum, > > This is for restart solr with 1000 collections. I created an environment > with 1023 collections today All collections are empty. During repeated > restart test, one of the cores are marked as "recovering" and stuck there > for ever. The solr is 4.6.1 and we have 3 zk hosts and 8 solr hosts, here > is the relevant logs: > > ---This is the logs for the core stuck at "recovering" > > INFO - 2016-05-16 22:47:04.984; org.apache.solr.cloud.ZkController; > publishing core=test_collection_112_shard1_replica2 state=down > > INFO - 2016-05-16 22:47:05.999; org.apache.solr.core.SolrCore; > [test_collection_112_shard1_replica2] CLOSING SolrCore > org.apache.solr.core.SolrCore@1e48619 > > INFO - 2016-05-16 22:47:06.001; org.apache.solr.core.SolrCore; > [test_collection_112_shard1_replica2] Closing main searcher on request. > > INFO - 2016-05-16 22:47:06.001; > org.apache.solr.core.CachingDirectoryFactory; looking to close /mnt > /solrcloud_latest/solr/test_collection_112_shard1_replica2/data/index > [CachedDir<<refCount=0;path=/mnt/solrcloud_latest/solr > /test_collection_112_shard1_replica2/data/index;done=false>>]... > > INFO - 2016-05-16 22:47:15.745; > org.apache.solr.core.CorePropertiesLocator; Found core > test_collection_112_shard1_replica2 in /mnt/solrcloud_latest/solr > /test_collection_112_shard1_replica2/ > > INFO - 2016-05-16 22:47:15.906; org.apache.solr.cloud.ZkController; > publishing core=test_collection_112_shard1_replica2 state=down > > INFO - 2016-05-16 22:47:15.973; org.apache.solr.cloud.ZkController; > waiting to find shard id in clusterstate for > test_collection_112_shard1_replica2 > > INFO - 2016-05-16 22:47:15.974; org.apache.solr.core.CoreContainer; > Creating SolrCore 'test_collection_112_shard1_replica2' using instanceDir: / > mnt/solrcloud_latest/solr/test_collection_112_shard1_replica2 > > INFO - 2016-05-16 22:47:15.975; org.apache.solr.cloud.ZkController; Check > for collection zkNode:test_collection_112 > > INFO - 2016-05-16 22:47:16.136; org.apache.solr.cloud.ZkController; Load > collection config from:/collections/test_collection_112 > > INFO - 2016-05-16 22:47:16.509; org.apache.solr.core.SolrResourceLoader; > new SolrResourceLoader for directory: '/mnt/solrcloud_latest/solr > /test_collection_112_shard1_replica2/' > > INFO - 2016-05-16 22:49:18.409; org.apache.solr.core.SolrCore; > [test_collection_112_shard1_replica2] Opening new SolrCore at /mnt > /solrcloud_latest/solr/test_collection_112_shard1_replica2/, dataDir=/mnt > /solrcloud_latest/solr//test_collection_112_shard1_replica2/data/ > > INFO - 2016-05-16 22:49:54.860; org.apache.solr.cloud.ZkController; > Register replica - core:test_collection_112_shard1_replica2 address: > http://10.10.1.8:8983/solr collection:test_collection_112 shard:shard1 > > INFO - 2016-05-16 22:49:55.324; org.apache.solr.cloud.ZkController; We > are http://10.10.1.8:8983/solr/test_collection_112_shard1_replica2/ and > leader is http://10.10.1.6:8983/solr/test_collection_112_shard1_replica1/ > > INFO - 2016-05-16 22:49:55.324; org.apache.solr.cloud.ZkController; No > LogReplay needed for core=test_collection_112_shard1_replica2 baseURL= > http://10.10.1.8:8983/solr > > INFO - 2016-05-16 22:49:55.324; org.apache.solr.cloud.ZkController; Core > needs to recover:test_collection_112_shard1_replica2 > > INFO - 2016-05-16 22:49:55.545; org.apache.solr.cloud.RecoveryStrategy; > Starting recovery process. core=test_collection_112_shard1_replica2 > recoveringAfterStartup=true > > INFO - 2016-05-16 22:49:55.546; org.apache.solr.cloud.ZkController; > publishing core=test_collection_112_shard1_replica2 state=recovering > > INFO - 2016-05-16 22:50:01.562; org.apache.solr.cloud.RecoveryStrategy; > Attempting to PeerSync from > http://10.10.1.6:8983/solr/test_collection_112_shard1_replica1/ > core=test_collection_112_shard1_replica2 - recoveringAfterStartup=true > > INFO - 2016-05-16 22:50:01.562; org.apache.solr.update.PeerSync; > PeerSync: core=test_collection_112_shard1_replica2 url= > http://10.10.1.8:8983/solr START replicas=[ > http://10.10.1.6:8983/solr/test_collection_112_shard1_replica1/] > nUpdates=100 > > INFO - 2016-05-16 22:50:01.572; org.apache.solr.cloud.RecoveryStrategy; > PeerSync Recovery was not successful - trying replication. > core=test_collection_112_shard1_replica2 > > INFO - 2016-05-16 22:50:01.572; org.apache.solr.cloud.RecoveryStrategy; > Starting Replication Recovery. core=test_collection_112_shard1_replica2 > > INFO - 2016-05-16 22:50:01.572; org.apache.solr.cloud.RecoveryStrategy; > Begin buffering updates. core=test_collection_112_shard1_replica2 > > INFO - 2016-05-16 22:50:01.577; org.apache.solr.cloud.RecoveryStrategy; > Attempting to replicate from > http://10.10.1.6:8983/solr/test_collection_112_shard1_replica1/. > core=test_collection_112_shard1_replica2 > > ----- After this line, there is no info about the core and the status > stuck forever > > > On the leader node, after this message, there is no logs regarding > test_collection_112 after those message:: > > INFO - 2016-05-16 22:47:07.572; org.apache.solr.cloud.SyncStrategy; Sync > replicas to > http://10.10.1.6:8983/solr/test_collection_112_shard1_replica1/ > > INFO - 2016-05-16 22:47:07.572; org.apache.solr.cloud.SyncStrategy; > http://10.10.1.6:8983/solr/test_collection_112_shard1_replica1/ has no > replicas > > INFO - 2016-05-16 22:47:07.572; > org.apache.solr.cloud.ShardLeaderElectionContext; I am the new leader: > http://10.10.1.6:8983/solr/test_collection_112_shard1_replica1/ shard1 > > INFO - 2016-05-16 22:47:07.573; > org.apache.solr.common.cloud.SolrZkClient; makePath: > /collections/test_collection_112/leaders/shard1 > > INFO - 2016-05-16 22:49:59.554; > org.apache.solr.servlet.SolrDispatchFilter; [admin] webapp=null path=/ > admin/cores params > ={coreNodeName=core_node2&onlyIfLeaderActive=true&state=recovering&nodeName=10.10.1.8:8983 > _solr&action=PREPRECOVERY&checkLive=true&core=test_collection_112_shard1_replica1 > &wt=javabin&onlyIfLeader=true&version=2} status=0 QTime=4001 > > > Is there any known bug? all collections are empty. > > > Thanks, > > > Li > > On Mon, May 16, 2016 at 12:50 PM, Anshum Gupta <ans...@anshumgupta.net> > wrote: > >> I think you are approaching the problem all wrong. This seems, what is >> described as an x-y problem ( >> https://people.apache.org/~hossman/#xyproblem). >> Can you tell us more about : >> * What's your setup like? SolrCloud - Version, number of shards, is there >> any custom code, etc. >> * Did you start seeing this more recently? If so, what did you change? >> >> To already answer your question, there is no way in SolrCloud to disable >> or >> remove the concept of 'leaders'. However, there would be other ways to fix >> your setup, and get rid of the issues you are facing once you share more >> details. >> >> >> On Mon, May 16, 2016 at 12:33 PM, Li Ding <li.d...@bloomreach.com> wrote: >> >> > Hi all, >> > >> > We have an unique scenario where we don't need leaders in every >> collection >> > to recover from failures. The indexing never changes. But we have >> faced >> > problems where either zk marked a core as down while the core is fine in >> > non-distributed query or during restart, the core never comes up. My >> > question is that is there any simple way to disable those leaders and >> > leaders election in SolrCloud, We do use multi-shard and distributed >> > queries. But with our unique situation, we don't need leaders to >> maintain >> > the correct status of the index. So if we can get rid of that part, our >> > solr restart will be more robust. >> > >> > Any suggestions will be appreciated. >> > >> > Thanks, >> > >> > Li >> > >> >> >> >> -- >> Anshum Gupta >> > >