Re: Disable leaders in SolrCloud mode

2016-05-16 Thread Shawn Heisey
On 5/16/2016 6:29 PM, Li Ding wrote:
> This happened when the second time I'm performing restart.  But after that,
> every time this collection is stuck at here.  If I restart the leader node
> as well, the core can get out of the recovering state
>
> On Mon, May 16, 2016 at 5:00 PM, Li Ding  wrote:
>> This is for restart solr with 1000 collections.  I created an environment
>> with 1023 collections today All collections are empty.  During repeated
>> restart test, one of the cores are marked as "recovering" and stuck there
>> for ever.   The solr is 4.6.1 and we have 3 zk hosts and 8 solr hosts, here
>> is the relevant logs:

SolrCloud does not handle that many collections very well, especially
with a lot of them per server.  After I did some experimentation with a
lot more collections than you have, I opened this issue:

https://issues.apache.org/jira/browse/SOLR-7191

The stability and scalability gets a little bit better with each new
release, but when you push it too far, it does not work well.

How many Solr instances are in your cloud?  If you want good performance
and stability with a thousand collections, you'll probably need a lot of
servers, so each server is only handling a relatively small number of
cores.  I do not have any precise information about how many cores
(shard replicas) is too many for one server.  You should make that
number as small as you can.

Upgrading Solr *might* help with this situation, but really I think
you'll need to either run fewer collections or run more instances.  You
might be able to run multiple Solr instances per server, but if you do
that, be sure that you don't give all your memory to java.  Enough
memory must be available to the operating system for caching the
important parts of your index.

Thanks,
Shawn



Re: Disable leaders in SolrCloud mode

2016-05-16 Thread Li Ding
This happened when the second time I'm performing restart.  But after that,
every time this collection is stuck at here.  If I restart the leader node
as well, the core can get out of the recovering state

On Mon, May 16, 2016 at 5:00 PM, Li Ding  wrote:

> Hi Anshum,
>
> This is for restart solr with 1000 collections.  I created an environment
> with 1023 collections today All collections are empty.  During repeated
> restart test, one of the cores are marked as "recovering" and stuck there
> for ever.   The solr is 4.6.1 and we have 3 zk hosts and 8 solr hosts, here
> is the relevant logs:
>
> ---This is the logs for the core stuck at "recovering"
>
> INFO  - 2016-05-16 22:47:04.984; org.apache.solr.cloud.ZkController;
> publishing core=test_collection_112_shard1_replica2 state=down
>
> INFO  - 2016-05-16 22:47:05.999; org.apache.solr.core.SolrCore;
> [test_collection_112_shard1_replica2]  CLOSING SolrCore
> org.apache.solr.core.SolrCore@1e48619
>
> INFO  - 2016-05-16 22:47:06.001; org.apache.solr.core.SolrCore;
> [test_collection_112_shard1_replica2] Closing main searcher on request.
>
> INFO  - 2016-05-16 22:47:06.001;
> org.apache.solr.core.CachingDirectoryFactory; looking to close /mnt
> /solrcloud_latest/solr/test_collection_112_shard1_replica2/data/index
> [CachedDir< /test_collection_112_shard1_replica2/data/index;done=false>>]...
>
> INFO  - 2016-05-16 22:47:15.745;
> org.apache.solr.core.CorePropertiesLocator; Found core
> test_collection_112_shard1_replica2 in /mnt/solrcloud_latest/solr
> /test_collection_112_shard1_replica2/
>
> INFO  - 2016-05-16 22:47:15.906; org.apache.solr.cloud.ZkController;
> publishing core=test_collection_112_shard1_replica2 state=down
>
> INFO  - 2016-05-16 22:47:15.973; org.apache.solr.cloud.ZkController;
> waiting to find shard id in clusterstate for
> test_collection_112_shard1_replica2
>
> INFO  - 2016-05-16 22:47:15.974; org.apache.solr.core.CoreContainer;
> Creating SolrCore 'test_collection_112_shard1_replica2' using instanceDir: /
> mnt/solrcloud_latest/solr/test_collection_112_shard1_replica2
>
> INFO  - 2016-05-16 22:47:15.975; org.apache.solr.cloud.ZkController; Check
> for collection zkNode:test_collection_112
>
> INFO  - 2016-05-16 22:47:16.136; org.apache.solr.cloud.ZkController; Load
> collection config from:/collections/test_collection_112
>
> INFO  - 2016-05-16 22:47:16.509; org.apache.solr.core.SolrResourceLoader;
> new SolrResourceLoader for directory: '/mnt/solrcloud_latest/solr
> /test_collection_112_shard1_replica2/'
>
> INFO  - 2016-05-16 22:49:18.409; org.apache.solr.core.SolrCore;
> [test_collection_112_shard1_replica2] Opening new SolrCore at /mnt
> /solrcloud_latest/solr/test_collection_112_shard1_replica2/, dataDir=/mnt
> /solrcloud_latest/solr//test_collection_112_shard1_replica2/data/
>
> INFO  - 2016-05-16 22:49:54.860; org.apache.solr.cloud.ZkController;
> Register replica - core:test_collection_112_shard1_replica2 address:
> http://10.10.1.8:8983/solr collection:test_collection_112 shard:shard1
>
> INFO  - 2016-05-16 22:49:55.324; org.apache.solr.cloud.ZkController; We
> are http://10.10.1.8:8983/solr/test_collection_112_shard1_replica2/ and
> leader is http://10.10.1.6:8983/solr/test_collection_112_shard1_replica1/
>
> INFO  - 2016-05-16 22:49:55.324; org.apache.solr.cloud.ZkController; No
> LogReplay needed for core=test_collection_112_shard1_replica2 baseURL=
> http://10.10.1.8:8983/solr
>
> INFO  - 2016-05-16 22:49:55.324; org.apache.solr.cloud.ZkController; Core
> needs to recover:test_collection_112_shard1_replica2
>
> INFO  - 2016-05-16 22:49:55.545; org.apache.solr.cloud.RecoveryStrategy;
> Starting recovery process.  core=test_collection_112_shard1_replica2
> recoveringAfterStartup=true
>
> INFO  - 2016-05-16 22:49:55.546; org.apache.solr.cloud.ZkController;
> publishing core=test_collection_112_shard1_replica2 state=recovering
>
> INFO  - 2016-05-16 22:50:01.562; org.apache.solr.cloud.RecoveryStrategy;
> Attempting to PeerSync from
> http://10.10.1.6:8983/solr/test_collection_112_shard1_replica1/
> core=test_collection_112_shard1_replica2 - recoveringAfterStartup=true
>
> INFO  - 2016-05-16 22:50:01.562; org.apache.solr.update.PeerSync;
> PeerSync: core=test_collection_112_shard1_replica2 url=
> http://10.10.1.8:8983/solr START replicas=[
> http://10.10.1.6:8983/solr/test_collection_112_shard1_replica1/]
> nUpdates=100
>
> INFO  - 2016-05-16 22:50:01.572; org.apache.solr.cloud.RecoveryStrategy;
> PeerSync Recovery was not successful - trying replication.
> core=test_collection_112_shard1_replica2
>
> INFO  - 2016-05-16 22:50:01.572; org.apache.solr.cloud.RecoveryStrategy;
> Starting Replication Recovery. core=test_collection_112_shard1_replica2
>
> INFO  - 2016-05-16 22:50:01.572; org.apache.solr.cloud.RecoveryStrategy;
> Begin buffering updates. core=test_collection_112_shard1_replica2
>
> INFO  - 2016-05-16 22:50:01.577; org.apache.solr.cloud.RecoveryStrategy;
> Attempting to replicate from
> http://10.10.1.6:8983/solr/test

Re: Disable leaders in SolrCloud mode

2016-05-16 Thread Li Ding
Hi Anshum,

This is for restart solr with 1000 collections.  I created an environment
with 1023 collections today All collections are empty.  During repeated
restart test, one of the cores are marked as "recovering" and stuck there
for ever.   The solr is 4.6.1 and we have 3 zk hosts and 8 solr hosts, here
is the relevant logs:

---This is the logs for the core stuck at "recovering"

INFO  - 2016-05-16 22:47:04.984; org.apache.solr.cloud.ZkController;
publishing core=test_collection_112_shard1_replica2 state=down

INFO  - 2016-05-16 22:47:05.999; org.apache.solr.core.SolrCore;
[test_collection_112_shard1_replica2]  CLOSING SolrCore
org.apache.solr.core.SolrCore@1e48619

INFO  - 2016-05-16 22:47:06.001; org.apache.solr.core.SolrCore;
[test_collection_112_shard1_replica2] Closing main searcher on request.

INFO  - 2016-05-16 22:47:06.001;
org.apache.solr.core.CachingDirectoryFactory; looking to close /mnt
/solrcloud_latest/solr/test_collection_112_shard1_replica2/data/index
[CachedDir<>]...

INFO  - 2016-05-16 22:47:15.745;
org.apache.solr.core.CorePropertiesLocator; Found core
test_collection_112_shard1_replica2 in /mnt/solrcloud_latest/solr
/test_collection_112_shard1_replica2/

INFO  - 2016-05-16 22:47:15.906; org.apache.solr.cloud.ZkController;
publishing core=test_collection_112_shard1_replica2 state=down

INFO  - 2016-05-16 22:47:15.973; org.apache.solr.cloud.ZkController;
waiting to find shard id in clusterstate for
test_collection_112_shard1_replica2

INFO  - 2016-05-16 22:47:15.974; org.apache.solr.core.CoreContainer;
Creating SolrCore 'test_collection_112_shard1_replica2' using instanceDir: /
mnt/solrcloud_latest/solr/test_collection_112_shard1_replica2

INFO  - 2016-05-16 22:47:15.975; org.apache.solr.cloud.ZkController; Check
for collection zkNode:test_collection_112

INFO  - 2016-05-16 22:47:16.136; org.apache.solr.cloud.ZkController; Load
collection config from:/collections/test_collection_112

INFO  - 2016-05-16 22:47:16.509; org.apache.solr.core.SolrResourceLoader;
new SolrResourceLoader for directory: '/mnt/solrcloud_latest/solr
/test_collection_112_shard1_replica2/'

INFO  - 2016-05-16 22:49:18.409; org.apache.solr.core.SolrCore;
[test_collection_112_shard1_replica2] Opening new SolrCore at /mnt
/solrcloud_latest/solr/test_collection_112_shard1_replica2/, dataDir=/mnt
/solrcloud_latest/solr//test_collection_112_shard1_replica2/data/

INFO  - 2016-05-16 22:49:54.860; org.apache.solr.cloud.ZkController;
Register replica - core:test_collection_112_shard1_replica2 address:
http://10.10.1.8:8983/solr collection:test_collection_112 shard:shard1

INFO  - 2016-05-16 22:49:55.324; org.apache.solr.cloud.ZkController; We are
http://10.10.1.8:8983/solr/test_collection_112_shard1_replica2/ and leader
is http://10.10.1.6:8983/solr/test_collection_112_shard1_replica1/

INFO  - 2016-05-16 22:49:55.324; org.apache.solr.cloud.ZkController; No
LogReplay needed for core=test_collection_112_shard1_replica2 baseURL=
http://10.10.1.8:8983/solr

INFO  - 2016-05-16 22:49:55.324; org.apache.solr.cloud.ZkController; Core
needs to recover:test_collection_112_shard1_replica2

INFO  - 2016-05-16 22:49:55.545; org.apache.solr.cloud.RecoveryStrategy;
Starting recovery process.  core=test_collection_112_shard1_replica2
recoveringAfterStartup=true

INFO  - 2016-05-16 22:49:55.546; org.apache.solr.cloud.ZkController;
publishing core=test_collection_112_shard1_replica2 state=recovering

INFO  - 2016-05-16 22:50:01.562; org.apache.solr.cloud.RecoveryStrategy;
Attempting to PeerSync from
http://10.10.1.6:8983/solr/test_collection_112_shard1_replica1/
core=test_collection_112_shard1_replica2 - recoveringAfterStartup=true

INFO  - 2016-05-16 22:50:01.562; org.apache.solr.update.PeerSync; PeerSync:
core=test_collection_112_shard1_replica2 url=http://10.10.1.8:8983/solr
START replicas=[
http://10.10.1.6:8983/solr/test_collection_112_shard1_replica1/]
nUpdates=100

INFO  - 2016-05-16 22:50:01.572; org.apache.solr.cloud.RecoveryStrategy;
PeerSync Recovery was not successful - trying replication.
core=test_collection_112_shard1_replica2

INFO  - 2016-05-16 22:50:01.572; org.apache.solr.cloud.RecoveryStrategy;
Starting Replication Recovery. core=test_collection_112_shard1_replica2

INFO  - 2016-05-16 22:50:01.572; org.apache.solr.cloud.RecoveryStrategy;
Begin buffering updates. core=test_collection_112_shard1_replica2

INFO  - 2016-05-16 22:50:01.577; org.apache.solr.cloud.RecoveryStrategy;
Attempting to replicate from
http://10.10.1.6:8983/solr/test_collection_112_shard1_replica1/.
core=test_collection_112_shard1_replica2

- After this line, there is no info about the core and the status stuck
forever


On the leader node, after this message, there is no logs regarding
test_collection_112 after those message::

INFO  - 2016-05-16 22:47:07.572; org.apache.solr.cloud.SyncStrategy; Sync
replicas to http://10.10.1.6:8983/solr/test_collection_112_shard1_replica1/

INFO  - 2016-05-16 22:47:07.572; org.apache.solr.cloud.SyncStrategy;
htt

Re: Disable leaders in SolrCloud mode

2016-05-16 Thread Anshum Gupta
I think you are approaching the problem all wrong. This seems, what is
described as an x-y problem (https://people.apache.org/~hossman/#xyproblem).
Can you tell us more about :
* What's your setup like? SolrCloud - Version, number of shards, is there
any custom code, etc.
* Did you start seeing this more recently? If so, what did you change?

To already answer your question, there is no way in SolrCloud to disable or
remove the concept of 'leaders'. However, there would be other ways to fix
your setup, and get rid of the issues you are facing once you share more
details.


On Mon, May 16, 2016 at 12:33 PM, Li Ding  wrote:

> Hi all,
>
> We have an unique scenario where we don't need leaders in every collection
> to recover from failures.  The indexing never changes.  But we have faced
> problems where either zk marked a core as down while the core is fine in
> non-distributed query or during restart, the core never comes up.  My
> question is that is there any simple way to disable those leaders and
> leaders election in SolrCloud,  We do use multi-shard and distributed
> queries.  But with our unique situation, we don't need leaders to maintain
> the correct status of the index.  So if we can get rid of that part, our
> solr restart will be more robust.
>
> Any suggestions will be appreciated.
>
> Thanks,
>
> Li
>



-- 
Anshum Gupta


Disable leaders in SolrCloud mode

2016-05-16 Thread Li Ding
Hi all,

We have an unique scenario where we don't need leaders in every collection
to recover from failures.  The indexing never changes.  But we have faced
problems where either zk marked a core as down while the core is fine in
non-distributed query or during restart, the core never comes up.  My
question is that is there any simple way to disable those leaders and
leaders election in SolrCloud,  We do use multi-shard and distributed
queries.  But with our unique situation, we don't need leaders to maintain
the correct status of the index.  So if we can get rid of that part, our
solr restart will be more robust.

Any suggestions will be appreciated.

Thanks,

Li