This happened when the second time I'm performing restart.  But after that,
every time this collection is stuck at here.  If I restart the leader node
as well, the core can get out of the recovering state

On Mon, May 16, 2016 at 5:00 PM, Li Ding <li.d...@bloomreach.com> wrote:

> Hi Anshum,
>
> This is for restart solr with 1000 collections.  I created an environment
> with 1023 collections today All collections are empty.  During repeated
> restart test, one of the cores are marked as "recovering" and stuck there
> for ever.   The solr is 4.6.1 and we have 3 zk hosts and 8 solr hosts, here
> is the relevant logs:
>
> ---This is the logs for the core stuck at "recovering"
>
> INFO  - 2016-05-16 22:47:04.984; org.apache.solr.cloud.ZkController;
> publishing core=test_collection_112_shard1_replica2 state=down
>
> INFO  - 2016-05-16 22:47:05.999; org.apache.solr.core.SolrCore;
> [test_collection_112_shard1_replica2]  CLOSING SolrCore
> org.apache.solr.core.SolrCore@1e48619
>
> INFO  - 2016-05-16 22:47:06.001; org.apache.solr.core.SolrCore;
> [test_collection_112_shard1_replica2] Closing main searcher on request.
>
> INFO  - 2016-05-16 22:47:06.001;
> org.apache.solr.core.CachingDirectoryFactory; looking to close /mnt
> /solrcloud_latest/solr/test_collection_112_shard1_replica2/data/index
> [CachedDir<<refCount=0;path=/mnt/solrcloud_latest/solr
> /test_collection_112_shard1_replica2/data/index;done=false>>]...
>
> INFO  - 2016-05-16 22:47:15.745;
> org.apache.solr.core.CorePropertiesLocator; Found core
> test_collection_112_shard1_replica2 in /mnt/solrcloud_latest/solr
> /test_collection_112_shard1_replica2/
>
> INFO  - 2016-05-16 22:47:15.906; org.apache.solr.cloud.ZkController;
> publishing core=test_collection_112_shard1_replica2 state=down
>
> INFO  - 2016-05-16 22:47:15.973; org.apache.solr.cloud.ZkController;
> waiting to find shard id in clusterstate for
> test_collection_112_shard1_replica2
>
> INFO  - 2016-05-16 22:47:15.974; org.apache.solr.core.CoreContainer;
> Creating SolrCore 'test_collection_112_shard1_replica2' using instanceDir: /
> mnt/solrcloud_latest/solr/test_collection_112_shard1_replica2
>
> INFO  - 2016-05-16 22:47:15.975; org.apache.solr.cloud.ZkController; Check
> for collection zkNode:test_collection_112
>
> INFO  - 2016-05-16 22:47:16.136; org.apache.solr.cloud.ZkController; Load
> collection config from:/collections/test_collection_112
>
> INFO  - 2016-05-16 22:47:16.509; org.apache.solr.core.SolrResourceLoader;
> new SolrResourceLoader for directory: '/mnt/solrcloud_latest/solr
> /test_collection_112_shard1_replica2/'
>
> INFO  - 2016-05-16 22:49:18.409; org.apache.solr.core.SolrCore;
> [test_collection_112_shard1_replica2] Opening new SolrCore at /mnt
> /solrcloud_latest/solr/test_collection_112_shard1_replica2/, dataDir=/mnt
> /solrcloud_latest/solr//test_collection_112_shard1_replica2/data/
>
> INFO  - 2016-05-16 22:49:54.860; org.apache.solr.cloud.ZkController;
> Register replica - core:test_collection_112_shard1_replica2 address:
> http://10.10.1.8:8983/solr collection:test_collection_112 shard:shard1
>
> INFO  - 2016-05-16 22:49:55.324; org.apache.solr.cloud.ZkController; We
> are http://10.10.1.8:8983/solr/test_collection_112_shard1_replica2/ and
> leader is http://10.10.1.6:8983/solr/test_collection_112_shard1_replica1/
>
> INFO  - 2016-05-16 22:49:55.324; org.apache.solr.cloud.ZkController; No
> LogReplay needed for core=test_collection_112_shard1_replica2 baseURL=
> http://10.10.1.8:8983/solr
>
> INFO  - 2016-05-16 22:49:55.324; org.apache.solr.cloud.ZkController; Core
> needs to recover:test_collection_112_shard1_replica2
>
> INFO  - 2016-05-16 22:49:55.545; org.apache.solr.cloud.RecoveryStrategy;
> Starting recovery process.  core=test_collection_112_shard1_replica2
> recoveringAfterStartup=true
>
> INFO  - 2016-05-16 22:49:55.546; org.apache.solr.cloud.ZkController;
> publishing core=test_collection_112_shard1_replica2 state=recovering
>
> INFO  - 2016-05-16 22:50:01.562; org.apache.solr.cloud.RecoveryStrategy;
> Attempting to PeerSync from
> http://10.10.1.6:8983/solr/test_collection_112_shard1_replica1/
> core=test_collection_112_shard1_replica2 - recoveringAfterStartup=true
>
> INFO  - 2016-05-16 22:50:01.562; org.apache.solr.update.PeerSync;
> PeerSync: core=test_collection_112_shard1_replica2 url=
> http://10.10.1.8:8983/solr START replicas=[
> http://10.10.1.6:8983/solr/test_collection_112_shard1_replica1/]
> nUpdates=100
>
> INFO  - 2016-05-16 22:50:01.572; org.apache.solr.cloud.RecoveryStrategy;
> PeerSync Recovery was not successful - trying replication.
> core=test_collection_112_shard1_replica2
>
> INFO  - 2016-05-16 22:50:01.572; org.apache.solr.cloud.RecoveryStrategy;
> Starting Replication Recovery. core=test_collection_112_shard1_replica2
>
> INFO  - 2016-05-16 22:50:01.572; org.apache.solr.cloud.RecoveryStrategy;
> Begin buffering updates. core=test_collection_112_shard1_replica2
>
> INFO  - 2016-05-16 22:50:01.577; org.apache.solr.cloud.RecoveryStrategy;
> Attempting to replicate from
> http://10.10.1.6:8983/solr/test_collection_112_shard1_replica1/.
> core=test_collection_112_shard1_replica2
>
> ----- After this line, there is no info about the core and the status
> stuck forever
>
>
> On the leader node, after this message, there is no logs regarding
> test_collection_112 after those message::
>
> INFO  - 2016-05-16 22:47:07.572; org.apache.solr.cloud.SyncStrategy; Sync
> replicas to
> http://10.10.1.6:8983/solr/test_collection_112_shard1_replica1/
>
> INFO  - 2016-05-16 22:47:07.572; org.apache.solr.cloud.SyncStrategy;
> http://10.10.1.6:8983/solr/test_collection_112_shard1_replica1/ has no
> replicas
>
> INFO  - 2016-05-16 22:47:07.572;
> org.apache.solr.cloud.ShardLeaderElectionContext; I am the new leader:
> http://10.10.1.6:8983/solr/test_collection_112_shard1_replica1/ shard1
>
> INFO  - 2016-05-16 22:47:07.573;
> org.apache.solr.common.cloud.SolrZkClient; makePath:
> /collections/test_collection_112/leaders/shard1
>
> INFO  - 2016-05-16 22:49:59.554;
> org.apache.solr.servlet.SolrDispatchFilter; [admin] webapp=null path=/
> admin/cores params
> ={coreNodeName=core_node2&onlyIfLeaderActive=true&state=recovering&nodeName=10.10.1.8:8983
> _solr&action=PREPRECOVERY&checkLive=true&core=test_collection_112_shard1_replica1
> &wt=javabin&onlyIfLeader=true&version=2} status=0 QTime=4001
>
>
> Is there any known bug? all collections are empty.
>
>
> Thanks,
>
>
> Li
>
> On Mon, May 16, 2016 at 12:50 PM, Anshum Gupta <ans...@anshumgupta.net>
> wrote:
>
>> I think you are approaching the problem all wrong. This seems, what is
>> described as an x-y problem (
>> https://people.apache.org/~hossman/#xyproblem).
>> Can you tell us more about :
>> * What's your setup like? SolrCloud - Version, number of shards, is there
>> any custom code, etc.
>> * Did you start seeing this more recently? If so, what did you change?
>>
>> To already answer your question, there is no way in SolrCloud to disable
>> or
>> remove the concept of 'leaders'. However, there would be other ways to fix
>> your setup, and get rid of the issues you are facing once you share more
>> details.
>>
>>
>> On Mon, May 16, 2016 at 12:33 PM, Li Ding <li.d...@bloomreach.com> wrote:
>>
>> > Hi all,
>> >
>> > We have an unique scenario where we don't need leaders in every
>> collection
>> > to recover from failures.  The indexing never changes.  But we have
>> faced
>> > problems where either zk marked a core as down while the core is fine in
>> > non-distributed query or during restart, the core never comes up.  My
>> > question is that is there any simple way to disable those leaders and
>> > leaders election in SolrCloud,  We do use multi-shard and distributed
>> > queries.  But with our unique situation, we don't need leaders to
>> maintain
>> > the correct status of the index.  So if we can get rid of that part, our
>> > solr restart will be more robust.
>> >
>> > Any suggestions will be appreciated.
>> >
>> > Thanks,
>> >
>> > Li
>> >
>>
>>
>>
>> --
>> Anshum Gupta
>>
>
>

Reply via email to