[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections

2017-03-15 Thread Varun Thacker (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15926810#comment-15926810
 ] 

Varun Thacker commented on SOLR-7191:
-

> Restarting a single node takes over two minutes in good circumstance

I posted some numbers on SOLR-10265 which you might find helpful. You can run 
that test on your cluster to see the speed of the overseer. 

A dedicated overseer node can help you in the meanwhile as well : 
https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-ADDROLE:AddaRole



> Improve stability and startup performance of SolrCloud with thousands of 
> collections
> 
>
> Key: SOLR-7191
> URL: https://issues.apache.org/jira/browse/SOLR-7191
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.0
>Reporter: Shawn Heisey
>Assignee: Noble Paul
>  Labels: performance, scalability
> Fix For: 6.3
>
> Attachments: lots-of-zkstatereader-updates-branch_5x.log, 
> SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, 
> SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch
>
>
> A user on the mailing list with thousands of collections (5000 on 4.10.3, 
> 4000 on 5.0) is having severe problems with getting Solr to restart.
> I tried as hard as I could to duplicate the user setup, but I ran into many 
> problems myself even before I was able to get 4000 collections created on a 
> 5.0 example cloud setup.  Restarting Solr takes a very long time, and it is 
> not very stable once it's up and running.
> This kind of setup is very much pushing the envelope on SolrCloud performance 
> and scalability.  It doesn't help that I'm running both Solr nodes on one 
> machine (I started with 'bin/solr -e cloud') and that ZK is embedded.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections

2017-03-15 Thread Joshua Humphries (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15926204#comment-15926204
 ] 

Joshua Humphries commented on SOLR-7191:


Our cluster has many thousands of collections, most of which have only a single 
shard and single replica. Restarting a single node takes over two minutes in 
good circumstances (expected restart, like during upgrades of solr or 
deployment of new/updated plugins). In bad circumstances, like if machines 
appear wedged and leader election issues have already caused the overseer queue 
to grow large, restarting a server can take over 10 minutes!

While watching the overseer queue size in our latest observation of this 
slowness, I saw that the down node messages take *way* too long to process. I 
ended up tracking that to an issue where it results in a ZK write for *every* 
collection, not just the collections that had shard-replicas on that node. In 
our case, it was processing about 40 times too many collections, making a 
rolling restart of the whole cluster effectively O(n^2) instead of O(n) in 
terms of the writes to ZK.

See SOLR-10277.

> Improve stability and startup performance of SolrCloud with thousands of 
> collections
> 
>
> Key: SOLR-7191
> URL: https://issues.apache.org/jira/browse/SOLR-7191
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.0
>Reporter: Shawn Heisey
>Assignee: Noble Paul
>  Labels: performance, scalability
> Fix For: 6.3
>
> Attachments: lots-of-zkstatereader-updates-branch_5x.log, 
> SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, 
> SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch
>
>
> A user on the mailing list with thousands of collections (5000 on 4.10.3, 
> 4000 on 5.0) is having severe problems with getting Solr to restart.
> I tried as hard as I could to duplicate the user setup, but I ran into many 
> problems myself even before I was able to get 4000 collections created on a 
> 5.0 example cloud setup.  Restarting Solr takes a very long time, and it is 
> not very stable once it's up and running.
> This kind of setup is very much pushing the envelope on SolrCloud performance 
> and scalability.  It doesn't help that I'm running both Solr nodes on one 
> machine (I started with 'bin/solr -e cloud') and that ZK is embedded.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections

2017-03-15 Thread Tim Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15925821#comment-15925821
 ] 

Tim Owen commented on SOLR-7191:


Admittedly not thousands of collections, but another anecdote. Each of our 
clusters are 12 hosts running 6 nodes each, with 165 collections of 16 shards 
each, 3x replication. So around 7900 cores spread over 72 nodes (roughly 100 
each).

To get stable restarts we throttle the recovery thread pool size, see ticket I 
raised with our patch, SOLR-9936 - without that, the amount of recovery just 
kills the network and disks and the cluster status never settles.

Also we avoid restarting all nodes at once, we bring up a few at a time and 
wait for their recovery to finish before starting more. We need to automate 
this, e.g. using a Zookeeper lock pool so that nodes will wait to startup.

> Improve stability and startup performance of SolrCloud with thousands of 
> collections
> 
>
> Key: SOLR-7191
> URL: https://issues.apache.org/jira/browse/SOLR-7191
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.0
>Reporter: Shawn Heisey
>Assignee: Noble Paul
>  Labels: performance, scalability
> Fix For: 6.3
>
> Attachments: lots-of-zkstatereader-updates-branch_5x.log, 
> SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, 
> SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch
>
>
> A user on the mailing list with thousands of collections (5000 on 4.10.3, 
> 4000 on 5.0) is having severe problems with getting Solr to restart.
> I tried as hard as I could to duplicate the user setup, but I ran into many 
> problems myself even before I was able to get 4000 collections created on a 
> 5.0 example cloud setup.  Restarting Solr takes a very long time, and it is 
> not very stable once it's up and running.
> This kind of setup is very much pushing the envelope on SolrCloud performance 
> and scalability.  It doesn't help that I'm running both Solr nodes on one 
> machine (I started with 'bin/solr -e cloud') and that ZK is embedded.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections

2017-03-15 Thread Jerome Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15925790#comment-15925790
 ] 

Jerome Yang commented on SOLR-7191:
---

After I test on solr6.4.2 in cloud mode.
32 nodes, on 3 centos6.7 host.
Create 100 collections with replica_factor 3, 1800 cores in total.
After restart the whole cluster(I write a script to start all nodes at same 
time). Only parts of collections are green.
And sometimes all collections are marked as down.

It seems this still not fixed on solr6.4.2. Please re-open this.

> Improve stability and startup performance of SolrCloud with thousands of 
> collections
> 
>
> Key: SOLR-7191
> URL: https://issues.apache.org/jira/browse/SOLR-7191
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.0
>Reporter: Shawn Heisey
>Assignee: Noble Paul
>  Labels: performance, scalability
> Fix For: 6.3
>
> Attachments: lots-of-zkstatereader-updates-branch_5x.log, 
> SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, 
> SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch
>
>
> A user on the mailing list with thousands of collections (5000 on 4.10.3, 
> 4000 on 5.0) is having severe problems with getting Solr to restart.
> I tried as hard as I could to duplicate the user setup, but I ran into many 
> problems myself even before I was able to get 4000 collections created on a 
> 5.0 example cloud setup.  Restarting Solr takes a very long time, and it is 
> not very stable once it's up and running.
> This kind of setup is very much pushing the envelope on SolrCloud performance 
> and scalability.  It doesn't help that I'm running both Solr nodes on one 
> machine (I started with 'bin/solr -e cloud') and that ZK is embedded.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections

2017-03-02 Thread Shawn Heisey (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15893410#comment-15893410
 ] 

Shawn Heisey commented on SOLR-7191:


Now that SOLR-10130 has been discovered, there's a possibility that the bigger 
problems I ran into with a repeat test on 6.4 might have been related to that.  
When I find some time, I will need to repeat with 6.4.2 or branch_6x.

I disagree with the status of this issue as FIXED.  No changes have been 
committed in relation to this issue.  At best, the problems I encountered with 
5.0 are the same, and may have gotten worse.

> Improve stability and startup performance of SolrCloud with thousands of 
> collections
> 
>
> Key: SOLR-7191
> URL: https://issues.apache.org/jira/browse/SOLR-7191
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.0
>Reporter: Shawn Heisey
>Assignee: Noble Paul
>  Labels: performance, scalability
> Fix For: 6.3
>
> Attachments: lots-of-zkstatereader-updates-branch_5x.log, 
> SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, 
> SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch
>
>
> A user on the mailing list with thousands of collections (5000 on 4.10.3, 
> 4000 on 5.0) is having severe problems with getting Solr to restart.
> I tried as hard as I could to duplicate the user setup, but I ran into many 
> problems myself even before I was able to get 4000 collections created on a 
> 5.0 example cloud setup.  Restarting Solr takes a very long time, and it is 
> not very stable once it's up and running.
> This kind of setup is very much pushing the envelope on SolrCloud performance 
> and scalability.  It doesn't help that I'm running both Solr nodes on one 
> machine (I started with 'bin/solr -e cloud') and that ZK is embedded.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections

2017-01-22 Thread Damien Kamerman (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15833804#comment-15833804
 ] 

Damien Kamerman commented on SOLR-7191:
---

Regarding the extra threads, I'm thinking the issue is that 
CoreContainer.load() now calls ZkContainer.registerInZk() with background 
'true'. Check how many 'coreZkRegister' threads there are.

> Improve stability and startup performance of SolrCloud with thousands of 
> collections
> 
>
> Key: SOLR-7191
> URL: https://issues.apache.org/jira/browse/SOLR-7191
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.0
>Reporter: Shawn Heisey
>Assignee: Noble Paul
>  Labels: performance, scalability
> Fix For: 6.3
>
> Attachments: lots-of-zkstatereader-updates-branch_5x.log, 
> SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, 
> SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch
>
>
> A user on the mailing list with thousands of collections (5000 on 4.10.3, 
> 4000 on 5.0) is having severe problems with getting Solr to restart.
> I tried as hard as I could to duplicate the user setup, but I ran into many 
> problems myself even before I was able to get 4000 collections created on a 
> 5.0 example cloud setup.  Restarting Solr takes a very long time, and it is 
> not very stable once it's up and running.
> This kind of setup is very much pushing the envelope on SolrCloud performance 
> and scalability.  It doesn't help that I'm running both Solr nodes on one 
> machine (I started with 'bin/solr -e cloud') and that ZK is embedded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections

2017-01-21 Thread Shawn Heisey (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15833172#comment-15833172
 ] 

Shawn Heisey commented on SOLR-7191:


bq. Shawn where can I raise maxThreads of jetty?

server/etc/jetty.xml


> Improve stability and startup performance of SolrCloud with thousands of 
> collections
> 
>
> Key: SOLR-7191
> URL: https://issues.apache.org/jira/browse/SOLR-7191
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.0
>Reporter: Shawn Heisey
>Assignee: Noble Paul
>  Labels: performance, scalability
> Fix For: 6.3
>
> Attachments: lots-of-zkstatereader-updates-branch_5x.log, 
> SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, 
> SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch
>
>
> A user on the mailing list with thousands of collections (5000 on 4.10.3, 
> 4000 on 5.0) is having severe problems with getting Solr to restart.
> I tried as hard as I could to duplicate the user setup, but I ran into many 
> problems myself even before I was able to get 4000 collections created on a 
> 5.0 example cloud setup.  Restarting Solr takes a very long time, and it is 
> not very stable once it's up and running.
> This kind of setup is very much pushing the envelope on SolrCloud performance 
> and scalability.  It doesn't help that I'm running both Solr nodes on one 
> machine (I started with 'bin/solr -e cloud') and that ZK is embedded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections

2017-01-21 Thread Yago Riveiro (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832908#comment-15832908
 ] 

Yago Riveiro commented on SOLR-7191:


Restarting a node in 6.3 now takes forever ... I bumped coreLoadThreads from 4 
to 512 and restarting a node with 1500 collections takes 20 - 25 minutes. If I 
bump coreLoadThreads to 1024 or 2048 is faster, but some times replicas stay in 
a wrong state and never go up.

Other thing that I see happen now is collections created without replicas.


Shawn where can I raise maxThreads of jetty?

> Improve stability and startup performance of SolrCloud with thousands of 
> collections
> 
>
> Key: SOLR-7191
> URL: https://issues.apache.org/jira/browse/SOLR-7191
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.0
>Reporter: Shawn Heisey
>Assignee: Noble Paul
>  Labels: performance, scalability
> Fix For: 6.3
>
> Attachments: lots-of-zkstatereader-updates-branch_5x.log, 
> SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, 
> SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch
>
>
> A user on the mailing list with thousands of collections (5000 on 4.10.3, 
> 4000 on 5.0) is having severe problems with getting Solr to restart.
> I tried as hard as I could to duplicate the user setup, but I ran into many 
> problems myself even before I was able to get 4000 collections created on a 
> 5.0 example cloud setup.  Restarting Solr takes a very long time, and it is 
> not very stable once it's up and running.
> This kind of setup is very much pushing the envelope on SolrCloud performance 
> and scalability.  It doesn't help that I'm running both Solr nodes on one 
> machine (I started with 'bin/solr -e cloud') and that ZK is embedded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections

2017-01-20 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832705#comment-15832705
 ] 

Mark Miller commented on SOLR-7191:
---

I know we have gotten pretty good at labeling and grouping threads, so 
hopefully a comparison between versions is not too difficult.

> Improve stability and startup performance of SolrCloud with thousands of 
> collections
> 
>
> Key: SOLR-7191
> URL: https://issues.apache.org/jira/browse/SOLR-7191
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.0
>Reporter: Shawn Heisey
>Assignee: Noble Paul
>  Labels: performance, scalability
> Fix For: 6.3
>
> Attachments: lots-of-zkstatereader-updates-branch_5x.log, 
> SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, 
> SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch
>
>
> A user on the mailing list with thousands of collections (5000 on 4.10.3, 
> 4000 on 5.0) is having severe problems with getting Solr to restart.
> I tried as hard as I could to duplicate the user setup, but I ran into many 
> problems myself even before I was able to get 4000 collections created on a 
> 5.0 example cloud setup.  Restarting Solr takes a very long time, and it is 
> not very stable once it's up and running.
> This kind of setup is very much pushing the envelope on SolrCloud performance 
> and scalability.  It doesn't help that I'm running both Solr nodes on one 
> machine (I started with 'bin/solr -e cloud') and that ZK is embedded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections

2017-01-20 Thread Shawn Heisey (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832668#comment-15832668
 ] 

Shawn Heisey commented on SOLR-7191:


With hard nproc at 61440, soft nproc at 40960, and maxThreads for each jetty at 
2, I am still not able to create enough threads to start both Solr 
instances with just under 1900 collections in the cloud.  The user in question 
is running a few other things on the system, but the number of threads involved 
there is less than 300.

> Improve stability and startup performance of SolrCloud with thousands of 
> collections
> 
>
> Key: SOLR-7191
> URL: https://issues.apache.org/jira/browse/SOLR-7191
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.0
>Reporter: Shawn Heisey
>Assignee: Noble Paul
>  Labels: performance, scalability
> Fix For: 6.3
>
> Attachments: lots-of-zkstatereader-updates-branch_5x.log, 
> SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, 
> SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch
>
>
> A user on the mailing list with thousands of collections (5000 on 4.10.3, 
> 4000 on 5.0) is having severe problems with getting Solr to restart.
> I tried as hard as I could to duplicate the user setup, but I ran into many 
> problems myself even before I was able to get 4000 collections created on a 
> 5.0 example cloud setup.  Restarting Solr takes a very long time, and it is 
> not very stable once it's up and running.
> This kind of setup is very much pushing the envelope on SolrCloud performance 
> and scalability.  It doesn't help that I'm running both Solr nodes on one 
> machine (I started with 'bin/solr -e cloud') and that ZK is embedded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections

2017-01-20 Thread Shawn Heisey (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832664#comment-15832664
 ] 

Shawn Heisey commented on SOLR-7191:


I figured I would try the test setup again on Solr 6.4, see whether the 
situation has improved with newer versions.

The system requirements of thousands of collections has NOT gotten better.  It 
seems to have gotten considerably worse.  The time from node restart to stable 
operation MIGHT have improved, but since I haven't yet been able to create all 
4000 collections, I cannot be sure about that.

I ran into serious trouble before I had even created 1000 collections.  Bumped 
the heap and proceeded to create more ... but ran into more trouble.  With a 
12g heap for the instance running zookeeper, I noticed that I was getting an 
OOME about not being able to create threads when I had gotten a little more 
than 1800 collections created  I have changed nproc in 
/etc/security/limits.conf (was a soft limit of 4096 and a hard limit of 6144) 
and bumped maxThreads in the Jetty config, and once the cluster is stable after 
restart, I will try to make more collections.


> Improve stability and startup performance of SolrCloud with thousands of 
> collections
> 
>
> Key: SOLR-7191
> URL: https://issues.apache.org/jira/browse/SOLR-7191
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.0
>Reporter: Shawn Heisey
>Assignee: Noble Paul
>  Labels: performance, scalability
> Fix For: 6.3
>
> Attachments: lots-of-zkstatereader-updates-branch_5x.log, 
> SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, 
> SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch
>
>
> A user on the mailing list with thousands of collections (5000 on 4.10.3, 
> 4000 on 5.0) is having severe problems with getting Solr to restart.
> I tried as hard as I could to duplicate the user setup, but I ran into many 
> problems myself even before I was able to get 4000 collections created on a 
> 5.0 example cloud setup.  Restarting Solr takes a very long time, and it is 
> not very stable once it's up and running.
> This kind of setup is very much pushing the envelope on SolrCloud performance 
> and scalability.  It doesn't help that I'm running both Solr nodes on one 
> machine (I started with 'bin/solr -e cloud') and that ZK is embedded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections

2016-11-28 Thread Erick Erickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15702278#comment-15702278
 ] 

Erick Erickson commented on SOLR-7191:
--

[~noble.paul] Can we close this since SOLR-7280 has been committed?

> Improve stability and startup performance of SolrCloud with thousands of 
> collections
> 
>
> Key: SOLR-7191
> URL: https://issues.apache.org/jira/browse/SOLR-7191
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.0
>Reporter: Shawn Heisey
>Assignee: Shalin Shekhar Mangar
>  Labels: performance, scalability
> Attachments: SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, 
> SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, 
> lots-of-zkstatereader-updates-branch_5x.log
>
>
> A user on the mailing list with thousands of collections (5000 on 4.10.3, 
> 4000 on 5.0) is having severe problems with getting Solr to restart.
> I tried as hard as I could to duplicate the user setup, but I ran into many 
> problems myself even before I was able to get 4000 collections created on a 
> 5.0 example cloud setup.  Restarting Solr takes a very long time, and it is 
> not very stable once it's up and running.
> This kind of setup is very much pushing the envelope on SolrCloud performance 
> and scalability.  It doesn't help that I'm running both Solr nodes on one 
> machine (I started with 'bin/solr -e cloud') and that ZK is embedded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections

2016-06-30 Thread Noble Paul (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15357167#comment-15357167
 ] 

Noble Paul commented on SOLR-7191:
--

I have simplified this patch and moved it over to SOLR-7280 . I plan to commit 
that soon

> Improve stability and startup performance of SolrCloud with thousands of 
> collections
> 
>
> Key: SOLR-7191
> URL: https://issues.apache.org/jira/browse/SOLR-7191
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.0
>Reporter: Shawn Heisey
>Assignee: Shalin Shekhar Mangar
>  Labels: performance, scalability
> Attachments: SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, 
> SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, 
> lots-of-zkstatereader-updates-branch_5x.log
>
>
> A user on the mailing list with thousands of collections (5000 on 4.10.3, 
> 4000 on 5.0) is having severe problems with getting Solr to restart.
> I tried as hard as I could to duplicate the user setup, but I ran into many 
> problems myself even before I was able to get 4000 collections created on a 
> 5.0 example cloud setup.  Restarting Solr takes a very long time, and it is 
> not very stable once it's up and running.
> This kind of setup is very much pushing the envelope on SolrCloud performance 
> and scalability.  It doesn't help that I'm running both Solr nodes on one 
> machine (I started with 'bin/solr -e cloud') and that ZK is embedded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections

2016-06-30 Thread Noble Paul (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15357146#comment-15357146
 ] 

Noble Paul commented on SOLR-7191:
--

yeah, normally you are fine. If there is a GC pause in the overseer node,  a 
lot of messages can get stuck in the queue and this will lead to even more 
threads waiting indefinitely (consuming more memory ) and aggravating the 
situation.   

> Improve stability and startup performance of SolrCloud with thousands of 
> collections
> 
>
> Key: SOLR-7191
> URL: https://issues.apache.org/jira/browse/SOLR-7191
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.0
>Reporter: Shawn Heisey
>Assignee: Shalin Shekhar Mangar
>  Labels: performance, scalability
> Attachments: SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, 
> SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, 
> lots-of-zkstatereader-updates-branch_5x.log
>
>
> A user on the mailing list with thousands of collections (5000 on 4.10.3, 
> 4000 on 5.0) is having severe problems with getting Solr to restart.
> I tried as hard as I could to duplicate the user setup, but I ran into many 
> problems myself even before I was able to get 4000 collections created on a 
> 5.0 example cloud setup.  Restarting Solr takes a very long time, and it is 
> not very stable once it's up and running.
> This kind of setup is very much pushing the envelope on SolrCloud performance 
> and scalability.  It doesn't help that I'm running both Solr nodes on one 
> machine (I started with 'bin/solr -e cloud') and that ZK is embedded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections

2016-06-28 Thread Erick Erickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15353556#comment-15353556
 ] 

Erick Erickson commented on SOLR-7191:
--

yeah, probably as I'm out of my depth here. Since each replica goes through at 
least three state changes (down->recovering->active), not to mention leadership 
election and such and each state change needs to bet to ZK, I'm really not at 
all sure now to cut that number down.



> Improve stability and startup performance of SolrCloud with thousands of 
> collections
> 
>
> Key: SOLR-7191
> URL: https://issues.apache.org/jira/browse/SOLR-7191
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.0
>Reporter: Shawn Heisey
>Assignee: Shalin Shekhar Mangar
>  Labels: performance, scalability
> Attachments: SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, 
> SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, 
> lots-of-zkstatereader-updates-branch_5x.log
>
>
> A user on the mailing list with thousands of collections (5000 on 4.10.3, 
> 4000 on 5.0) is having severe problems with getting Solr to restart.
> I tried as hard as I could to duplicate the user setup, but I ran into many 
> problems myself even before I was able to get 4000 collections created on a 
> 5.0 example cloud setup.  Restarting Solr takes a very long time, and it is 
> not very stable once it's up and running.
> This kind of setup is very much pushing the envelope on SolrCloud performance 
> and scalability.  It doesn't help that I'm running both Solr nodes on one 
> machine (I started with 'bin/solr -e cloud') and that ZK is embedded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections

2016-06-28 Thread Scott Blum (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15353502#comment-15353502
 ] 

Scott Blum commented on SOLR-7191:
--

[~erickerickson] we may be talking about 2 different things?  I'm referring to 
the total number of Overseer state update operations that happen.  Is there a 
relationship between that and the watcher side that I'm unaware of?

Also, in my case, we only have 1 replica per shard, period, so leadership 
contention shouldn't be an issue at all.

> Improve stability and startup performance of SolrCloud with thousands of 
> collections
> 
>
> Key: SOLR-7191
> URL: https://issues.apache.org/jira/browse/SOLR-7191
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.0
>Reporter: Shawn Heisey
>Assignee: Shalin Shekhar Mangar
>  Labels: performance, scalability
> Attachments: SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, 
> SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, 
> lots-of-zkstatereader-updates-branch_5x.log
>
>
> A user on the mailing list with thousands of collections (5000 on 4.10.3, 
> 4000 on 5.0) is having severe problems with getting Solr to restart.
> I tried as hard as I could to duplicate the user setup, but I ran into many 
> problems myself even before I was able to get 4000 collections created on a 
> 5.0 example cloud setup.  Restarting Solr takes a very long time, and it is 
> not very stable once it's up and running.
> This kind of setup is very much pushing the envelope on SolrCloud performance 
> and scalability.  It doesn't help that I'm running both Solr nodes on one 
> machine (I started with 'bin/solr -e cloud') and that ZK is embedded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections

2016-06-28 Thread Erick Erickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15353437#comment-15353437
 ] 

Erick Erickson commented on SOLR-7191:
--

FWIW, I tried both and had no trouble even when overseer was on one of the 
replicas with lots of cores. That said, I agree it's wise to put the overseer 
somewhere else in these cases. Certainly can't hurt!

> Improve stability and startup performance of SolrCloud with thousands of 
> collections
> 
>
> Key: SOLR-7191
> URL: https://issues.apache.org/jira/browse/SOLR-7191
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.0
>Reporter: Shawn Heisey
>Assignee: Shalin Shekhar Mangar
>  Labels: performance, scalability
> Attachments: SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, 
> SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, 
> lots-of-zkstatereader-updates-branch_5x.log
>
>
> A user on the mailing list with thousands of collections (5000 on 4.10.3, 
> 4000 on 5.0) is having severe problems with getting Solr to restart.
> I tried as hard as I could to duplicate the user setup, but I ran into many 
> problems myself even before I was able to get 4000 collections created on a 
> 5.0 example cloud setup.  Restarting Solr takes a very long time, and it is 
> not very stable once it's up and running.
> This kind of setup is very much pushing the envelope on SolrCloud performance 
> and scalability.  It doesn't help that I'm running both Solr nodes on one 
> machine (I started with 'bin/solr -e cloud') and that ZK is embedded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections

2016-06-28 Thread Erick Erickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15353362#comment-15353362
 ] 

Erick Erickson commented on SOLR-7191:
--

IIUC, the whole watcher thing is replica based. So N replicas for the same 
collection in the same JVM register N watchers.

If that's true, does it make sense to think about watchers being set per 
_collection_ in a JVM rather than per _replica_? I admit I'm completely 
ignorant of the nuances here. It also wouldn't make any difference in a 
collection where each instance hosted exactly one replica per collection, but 
practically I'm not sure there's anything we can do about that anyway.

Although it seems that each replica could be an Observer for a given collection 
(watcher at the JVM level?) without doing much violence to the current 
architecture. Or maybe it'd just be simpler to have the replicas get their 
state information from some kind of cache maintained at the JVM level where the 
cache was updated via watcher. I admit I'm talking through my hat here. Maybe 
there should be a JIRA to discuss this?

> Improve stability and startup performance of SolrCloud with thousands of 
> collections
> 
>
> Key: SOLR-7191
> URL: https://issues.apache.org/jira/browse/SOLR-7191
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.0
>Reporter: Shawn Heisey
>Assignee: Shalin Shekhar Mangar
>  Labels: performance, scalability
> Attachments: SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, 
> SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, 
> lots-of-zkstatereader-updates-branch_5x.log
>
>
> A user on the mailing list with thousands of collections (5000 on 4.10.3, 
> 4000 on 5.0) is having severe problems with getting Solr to restart.
> I tried as hard as I could to duplicate the user setup, but I ran into many 
> problems myself even before I was able to get 4000 collections created on a 
> 5.0 example cloud setup.  Restarting Solr takes a very long time, and it is 
> not very stable once it's up and running.
> This kind of setup is very much pushing the envelope on SolrCloud performance 
> and scalability.  It doesn't help that I'm running both Solr nodes on one 
> machine (I started with 'bin/solr -e cloud') and that ZK is embedded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections

2016-06-28 Thread Noble Paul (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15353293#comment-15353293
 ] 

Noble Paul commented on SOLR-7191:
--

The cluster was very stable when I used dedicated overseer using the 
{{ADDROLE}} command. I used the replica placement strategy to ensure that the 
overseer nodes did not have any replicas created. For any reasonably large 
cluster, I recommend using dedicated overseer nodes. Another observation was 
that the overseer nodes use very little memory. It never went beyond 200MB 
heap. 

> Improve stability and startup performance of SolrCloud with thousands of 
> collections
> 
>
> Key: SOLR-7191
> URL: https://issues.apache.org/jira/browse/SOLR-7191
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.0
>Reporter: Shawn Heisey
>Assignee: Shalin Shekhar Mangar
>  Labels: performance, scalability
> Attachments: SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, 
> SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, 
> lots-of-zkstatereader-updates-branch_5x.log
>
>
> A user on the mailing list with thousands of collections (5000 on 4.10.3, 
> 4000 on 5.0) is having severe problems with getting Solr to restart.
> I tried as hard as I could to duplicate the user setup, but I ran into many 
> problems myself even before I was able to get 4000 collections created on a 
> 5.0 example cloud setup.  Restarting Solr takes a very long time, and it is 
> not very stable once it's up and running.
> This kind of setup is very much pushing the envelope on SolrCloud performance 
> and scalability.  It doesn't help that I'm running both Solr nodes on one 
> machine (I started with 'bin/solr -e cloud') and that ZK is embedded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections

2016-06-28 Thread Erick Erickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15353278#comment-15353278
 ] 

Erick Erickson commented on SOLR-7191:
--

Empirically it was fine, but that was on very few runs. I had 10 replicas in 
each JVM and 3 load threads in one variant. Of course I might just have gotten 
lucky.

> Improve stability and startup performance of SolrCloud with thousands of 
> collections
> 
>
> Key: SOLR-7191
> URL: https://issues.apache.org/jira/browse/SOLR-7191
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.0
>Reporter: Shawn Heisey
>Assignee: Shalin Shekhar Mangar
>  Labels: performance, scalability
> Attachments: SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, 
> SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, 
> lots-of-zkstatereader-updates-branch_5x.log
>
>
> A user on the mailing list with thousands of collections (5000 on 4.10.3, 
> 4000 on 5.0) is having severe problems with getting Solr to restart.
> I tried as hard as I could to duplicate the user setup, but I ran into many 
> problems myself even before I was able to get 4000 collections created on a 
> 5.0 example cloud setup.  Restarting Solr takes a very long time, and it is 
> not very stable once it's up and running.
> This kind of setup is very much pushing the envelope on SolrCloud performance 
> and scalability.  It doesn't help that I'm running both Solr nodes on one 
> machine (I started with 'bin/solr -e cloud') and that ZK is embedded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections

2016-06-28 Thread Scott Blum (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15353262#comment-15353262
 ] 

Scott Blum commented on SOLR-7191:
--

(for the record, this was on a 5.5.1 based build)

> Improve stability and startup performance of SolrCloud with thousands of 
> collections
> 
>
> Key: SOLR-7191
> URL: https://issues.apache.org/jira/browse/SOLR-7191
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.0
>Reporter: Shawn Heisey
>Assignee: Shalin Shekhar Mangar
>  Labels: performance, scalability
> Attachments: SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, 
> SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, 
> lots-of-zkstatereader-updates-branch_5x.log
>
>
> A user on the mailing list with thousands of collections (5000 on 4.10.3, 
> 4000 on 5.0) is having severe problems with getting Solr to restart.
> I tried as hard as I could to duplicate the user setup, but I ran into many 
> problems myself even before I was able to get 4000 collections created on a 
> 5.0 example cloud setup.  Restarting Solr takes a very long time, and it is 
> not very stable once it's up and running.
> This kind of setup is very much pushing the envelope on SolrCloud performance 
> and scalability.  It doesn't help that I'm running both Solr nodes on one 
> machine (I started with 'bin/solr -e cloud') and that ZK is embedded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections

2016-06-28 Thread Scott Blum (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15353261#comment-15353261
 ] 

Scott Blum commented on SOLR-7191:
--

Paginated getChildren().. always wondered why that wasn't a thing.

> Improve stability and startup performance of SolrCloud with thousands of 
> collections
> 
>
> Key: SOLR-7191
> URL: https://issues.apache.org/jira/browse/SOLR-7191
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.0
>Reporter: Shawn Heisey
>Assignee: Shalin Shekhar Mangar
>  Labels: performance, scalability
> Attachments: SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, 
> SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, 
> lots-of-zkstatereader-updates-branch_5x.log
>
>
> A user on the mailing list with thousands of collections (5000 on 4.10.3, 
> 4000 on 5.0) is having severe problems with getting Solr to restart.
> I tried as hard as I could to duplicate the user setup, but I ran into many 
> problems myself even before I was able to get 4000 collections created on a 
> 5.0 example cloud setup.  Restarting Solr takes a very long time, and it is 
> not very stable once it's up and running.
> This kind of setup is very much pushing the envelope on SolrCloud performance 
> and scalability.  It doesn't help that I'm running both Solr nodes on one 
> machine (I started with 'bin/solr -e cloud') and that ZK is embedded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections

2016-06-28 Thread Shawn Heisey (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15353258#comment-15353258
 ] 

Shawn Heisey commented on SOLR-7191:


bq. That seems... insane.

I am glad that I am not the only one to think the number of updates in the 
overseer queue for node startup is insane.

When you get that many updates in the queue and haven't make a big change to 
jute.maxbuffer, zookeeper starts failing because the size of the znode will 
become too large.  I think it's crazy that zookeeper allows *writes* to a znode 
when the write will make the node too big.  See ZOOKEEPER-1162.

> Improve stability and startup performance of SolrCloud with thousands of 
> collections
> 
>
> Key: SOLR-7191
> URL: https://issues.apache.org/jira/browse/SOLR-7191
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.0
>Reporter: Shawn Heisey
>Assignee: Shalin Shekhar Mangar
>  Labels: performance, scalability
> Attachments: SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, 
> SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, 
> lots-of-zkstatereader-updates-branch_5x.log
>
>
> A user on the mailing list with thousands of collections (5000 on 4.10.3, 
> 4000 on 5.0) is having severe problems with getting Solr to restart.
> I tried as hard as I could to duplicate the user setup, but I ran into many 
> problems myself even before I was able to get 4000 collections created on a 
> 5.0 example cloud setup.  Restarting Solr takes a very long time, and it is 
> not very stable once it's up and running.
> This kind of setup is very much pushing the envelope on SolrCloud performance 
> and scalability.  It doesn't help that I'm running both Solr nodes on one 
> machine (I started with 'bin/solr -e cloud') and that ZK is embedded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections

2016-06-28 Thread Scott Blum (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15353247#comment-15353247
 ] 

Scott Blum commented on SOLR-7191:
--

This may be unrelated to the current patch work, but seems relevant to the uber 
ticket.:

I rebooted our solr cluster the other night to pick up an update, and I ran 
into what seemed to be pathological behavior around state updates.  My first 
attempt to bring up everything at once resulted in utter deadlock, so I shut 
everything down, manually nuked all the overseer queues/maps in ZK, and started 
bringing them up one at a time.  What I saw was kind of astounding.

I was monitoring OVERSEERSTATUS and tracking the number of outstanding overseer 
ops + the total number of update_state ops, and I noticed that every VM I 
brought up needed ~4000 update_state ops to stabilize, despite the fact that 
each VM only manages ~128 cores.  We have 32 vms with ~128 cores each, or ~4096 
cores in our entire cluster... it took over 100,000 update_state operations to 
bring the whole cluster up.  That seems... insane.  3 or 4 update_state ops per 
core would seem reasonable to me, but I saw over 30 ops per core loaded as I 
went.  This number was extremely consistent for every node I brought up.

> Improve stability and startup performance of SolrCloud with thousands of 
> collections
> 
>
> Key: SOLR-7191
> URL: https://issues.apache.org/jira/browse/SOLR-7191
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.0
>Reporter: Shawn Heisey
>Assignee: Shalin Shekhar Mangar
>  Labels: performance, scalability
> Attachments: SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, 
> SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, 
> lots-of-zkstatereader-updates-branch_5x.log
>
>
> A user on the mailing list with thousands of collections (5000 on 4.10.3, 
> 4000 on 5.0) is having severe problems with getting Solr to restart.
> I tried as hard as I could to duplicate the user setup, but I ran into many 
> problems myself even before I was able to get 4000 collections created on a 
> 5.0 example cloud setup.  Restarting Solr takes a very long time, and it is 
> not very stable once it's up and running.
> This kind of setup is very much pushing the envelope on SolrCloud performance 
> and scalability.  It doesn't help that I'm running both Solr nodes on one 
> machine (I started with 'bin/solr -e cloud') and that ZK is embedded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections

2016-06-28 Thread damien kamerman (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15352431#comment-15352431
 ] 

damien kamerman commented on SOLR-7191:
---

Concern based on general principles. e.g. if a shard has all 4 replicas on the 
one JVM and 3 load threads. Then registration will be based on the first three 
cores only.

> Improve stability and startup performance of SolrCloud with thousands of 
> collections
> 
>
> Key: SOLR-7191
> URL: https://issues.apache.org/jira/browse/SOLR-7191
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.0
>Reporter: Shawn Heisey
>Assignee: Shalin Shekhar Mangar
>  Labels: performance, scalability
> Attachments: SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, 
> SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, 
> lots-of-zkstatereader-updates-branch_5x.log
>
>
> A user on the mailing list with thousands of collections (5000 on 4.10.3, 
> 4000 on 5.0) is having severe problems with getting Solr to restart.
> I tried as hard as I could to duplicate the user setup, but I ran into many 
> problems myself even before I was able to get 4000 collections created on a 
> 5.0 example cloud setup.  Restarting Solr takes a very long time, and it is 
> not very stable once it's up and running.
> This kind of setup is very much pushing the envelope on SolrCloud performance 
> and scalability.  It doesn't help that I'm running both Solr nodes on one 
> machine (I started with 'bin/solr -e cloud') and that ZK is embedded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections

2016-06-27 Thread Noble Paul (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15352399#comment-15352399
 ] 

Noble Paul commented on SOLR-7191:
--

bq.f there is a collection that has more than coreLoadThreadCount

or a 'shard' has more replicas? 

> Improve stability and startup performance of SolrCloud with thousands of 
> collections
> 
>
> Key: SOLR-7191
> URL: https://issues.apache.org/jira/browse/SOLR-7191
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.0
>Reporter: Shawn Heisey
>Assignee: Shalin Shekhar Mangar
>  Labels: performance, scalability
> Attachments: SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, 
> SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, 
> lots-of-zkstatereader-updates-branch_5x.log
>
>
> A user on the mailing list with thousands of collections (5000 on 4.10.3, 
> 4000 on 5.0) is having severe problems with getting Solr to restart.
> I tried as hard as I could to duplicate the user setup, but I ran into many 
> problems myself even before I was able to get 4000 collections created on a 
> 5.0 example cloud setup.  Restarting Solr takes a very long time, and it is 
> not very stable once it's up and running.
> This kind of setup is very much pushing the envelope on SolrCloud performance 
> and scalability.  It doesn't help that I'm running both Solr nodes on one 
> machine (I started with 'bin/solr -e cloud') and that ZK is embedded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections

2016-06-27 Thread Erick Erickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15352349#comment-15352349
 ] 

Erick Erickson commented on SOLR-7191:
--


Hmmm, is this a concern based on general principles or on a code path that is 
expected to fail?

I tested a couple of scenarios. All there are 4 JVM, 3 load threads. The "smoke 
test" was starting all the JVMs at once.

1> 100 collections, 4 shards x 4 replicas each
2> 10 collections, 4 shards x 40 replicas each.

Of course my testing could easily have missed the corner cases that trip this 
as it was pretty bare-bones.



> Improve stability and startup performance of SolrCloud with thousands of 
> collections
> 
>
> Key: SOLR-7191
> URL: https://issues.apache.org/jira/browse/SOLR-7191
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.0
>Reporter: Shawn Heisey
>Assignee: Shalin Shekhar Mangar
>  Labels: performance, scalability
> Attachments: SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, 
> SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, 
> lots-of-zkstatereader-updates-branch_5x.log
>
>
> A user on the mailing list with thousands of collections (5000 on 4.10.3, 
> 4000 on 5.0) is having severe problems with getting Solr to restart.
> I tried as hard as I could to duplicate the user setup, but I ran into many 
> problems myself even before I was able to get 4000 collections created on a 
> 5.0 example cloud setup.  Restarting Solr takes a very long time, and it is 
> not very stable once it's up and running.
> This kind of setup is very much pushing the envelope on SolrCloud performance 
> and scalability.  It doesn't help that I'm running both Solr nodes on one 
> machine (I started with 'bin/solr -e cloud') and that ZK is embedded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections

2016-06-27 Thread damien kamerman (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15352228#comment-15352228
 ] 

damien kamerman commented on SOLR-7191:
---

Only coreLoadThreadCount cores are registering at a time on each JVM, so
the concern is if there is a collection that has more than
coreLoadThreadCount replicas on a JVM then registration could fail.





-- 
Damien Kamerman


> Improve stability and startup performance of SolrCloud with thousands of 
> collections
> 
>
> Key: SOLR-7191
> URL: https://issues.apache.org/jira/browse/SOLR-7191
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.0
>Reporter: Shawn Heisey
>Assignee: Shalin Shekhar Mangar
>  Labels: performance, scalability
> Attachments: SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, 
> SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, 
> lots-of-zkstatereader-updates-branch_5x.log
>
>
> A user on the mailing list with thousands of collections (5000 on 4.10.3, 
> 4000 on 5.0) is having severe problems with getting Solr to restart.
> I tried as hard as I could to duplicate the user setup, but I ran into many 
> problems myself even before I was able to get 4000 collections created on a 
> 5.0 example cloud setup.  Restarting Solr takes a very long time, and it is 
> not very stable once it's up and running.
> This kind of setup is very much pushing the envelope on SolrCloud performance 
> and scalability.  It doesn't help that I'm running both Solr nodes on one 
> machine (I started with 'bin/solr -e cloud') and that ZK is embedded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections

2016-06-27 Thread Erick Erickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15352216#comment-15352216
 ] 

Erick Erickson commented on SOLR-7191:
--

Noble:

A few comments:

> the stress setup I have is sailing right through the parts that were dying 
> last week, so this is looking good.

> I also tried back-porting this to 5x and it seems to be working equally well 
> there

> There's a comment in CoreContainer:
   // OK to limit the size of the executor in zk mode as cores are loaded in 
order.
   // This assumes replicaCount is less than coreLoadThreadCount?
I didn't read it before I started testing, so I didn't know enough to be 
scared... I'm starting 400 cores in each JVM with 10 coreLoadThreads so does 
the fact that the loading is in order preclude fewer threads than cores being a 
problem? I also experimented with 3 threads (4 shards, 4 replicas each) and saw 
no problems.




> Improve stability and startup performance of SolrCloud with thousands of 
> collections
> 
>
> Key: SOLR-7191
> URL: https://issues.apache.org/jira/browse/SOLR-7191
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.0
>Reporter: Shawn Heisey
>Assignee: Shalin Shekhar Mangar
>  Labels: performance, scalability
> Attachments: SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, 
> SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, 
> lots-of-zkstatereader-updates-branch_5x.log
>
>
> A user on the mailing list with thousands of collections (5000 on 4.10.3, 
> 4000 on 5.0) is having severe problems with getting Solr to restart.
> I tried as hard as I could to duplicate the user setup, but I ran into many 
> problems myself even before I was able to get 4000 collections created on a 
> 5.0 example cloud setup.  Restarting Solr takes a very long time, and it is 
> not very stable once it's up and running.
> This kind of setup is very much pushing the envelope on SolrCloud performance 
> and scalability.  It doesn't help that I'm running both Solr nodes on one 
> machine (I started with 'bin/solr -e cloud') and that ZK is embedded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections

2016-06-20 Thread damien kamerman (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15340698#comment-15340698
 ] 

damien kamerman commented on SOLR-7191:
---

The patch from March 2015 was against Solr 4.10.
The patch from Oct 2015 was against Solr trunk (1708905)





-- 
Damien Kamerman


> Improve stability and startup performance of SolrCloud with thousands of 
> collections
> 
>
> Key: SOLR-7191
> URL: https://issues.apache.org/jira/browse/SOLR-7191
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.0
>Reporter: Shawn Heisey
>Assignee: Shalin Shekhar Mangar
>  Labels: performance, scalability
> Attachments: SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, 
> SOLR-7191.patch, lots-of-zkstatereader-updates-branch_5x.log
>
>
> A user on the mailing list with thousands of collections (5000 on 4.10.3, 
> 4000 on 5.0) is having severe problems with getting Solr to restart.
> I tried as hard as I could to duplicate the user setup, but I ran into many 
> problems myself even before I was able to get 4000 collections created on a 
> 5.0 example cloud setup.  Restarting Solr takes a very long time, and it is 
> not very stable once it's up and running.
> This kind of setup is very much pushing the envelope on SolrCloud performance 
> and scalability.  It doesn't help that I'm running both Solr nodes on one 
> machine (I started with 'bin/solr -e cloud') and that ZK is embedded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections

2016-06-20 Thread Erick Erickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15340685#comment-15340685
 ] 

Erick Erickson commented on SOLR-7191:
--

[~noble.paul] Not sure what version this patch was against. My testing and most 
recent notes were on stock Solr...

> Improve stability and startup performance of SolrCloud with thousands of 
> collections
> 
>
> Key: SOLR-7191
> URL: https://issues.apache.org/jira/browse/SOLR-7191
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.0
>Reporter: Shawn Heisey
>Assignee: Shalin Shekhar Mangar
>  Labels: performance, scalability
> Attachments: SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, 
> SOLR-7191.patch, lots-of-zkstatereader-updates-branch_5x.log
>
>
> A user on the mailing list with thousands of collections (5000 on 4.10.3, 
> 4000 on 5.0) is having severe problems with getting Solr to restart.
> I tried as hard as I could to duplicate the user setup, but I ran into many 
> problems myself even before I was able to get 4000 collections created on a 
> 5.0 example cloud setup.  Restarting Solr takes a very long time, and it is 
> not very stable once it's up and running.
> This kind of setup is very much pushing the envelope on SolrCloud performance 
> and scalability.  It doesn't help that I'm running both Solr nodes on one 
> machine (I started with 'bin/solr -e cloud') and that ZK is embedded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections

2016-06-20 Thread Noble Paul (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15340106#comment-15340106
 ] 

Noble Paul commented on SOLR-7191:
--

which version is the patch built on?

> Improve stability and startup performance of SolrCloud with thousands of 
> collections
> 
>
> Key: SOLR-7191
> URL: https://issues.apache.org/jira/browse/SOLR-7191
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.0
>Reporter: Shawn Heisey
>Assignee: Shalin Shekhar Mangar
>  Labels: performance, scalability
> Attachments: SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, 
> SOLR-7191.patch, lots-of-zkstatereader-updates-branch_5x.log
>
>
> A user on the mailing list with thousands of collections (5000 on 4.10.3, 
> 4000 on 5.0) is having severe problems with getting Solr to restart.
> I tried as hard as I could to duplicate the user setup, but I ran into many 
> problems myself even before I was able to get 4000 collections created on a 
> 5.0 example cloud setup.  Restarting Solr takes a very long time, and it is 
> not very stable once it's up and running.
> This kind of setup is very much pushing the envelope on SolrCloud performance 
> and scalability.  It doesn't help that I'm running both Solr nodes on one 
> machine (I started with 'bin/solr -e cloud') and that ZK is embedded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections

2016-06-10 Thread damien kamerman (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15325626#comment-15325626
 ] 

damien kamerman commented on SOLR-7191:
---

This fits with what I've seen on solr 4/5. The cores register on an
unlimited thread pool. The patch I did was to limit the thread pool and
register in order.



> Improve stability and startup performance of SolrCloud with thousands of 
> collections
> 
>
> Key: SOLR-7191
> URL: https://issues.apache.org/jira/browse/SOLR-7191
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.0
>Reporter: Shawn Heisey
>Assignee: Shalin Shekhar Mangar
>  Labels: performance, scalability
> Attachments: SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, 
> SOLR-7191.patch, lots-of-zkstatereader-updates-branch_5x.log
>
>
> A user on the mailing list with thousands of collections (5000 on 4.10.3, 
> 4000 on 5.0) is having severe problems with getting Solr to restart.
> I tried as hard as I could to duplicate the user setup, but I ran into many 
> problems myself even before I was able to get 4000 collections created on a 
> 5.0 example cloud setup.  Restarting Solr takes a very long time, and it is 
> not very stable once it's up and running.
> This kind of setup is very much pushing the envelope on SolrCloud performance 
> and scalability.  It doesn't help that I'm running both Solr nodes on one 
> machine (I started with 'bin/solr -e cloud') and that ZK is embedded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections

2016-06-10 Thread Erick Erickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15325575#comment-15325575
 ] 

Erick Erickson commented on SOLR-7191:
--

I had to chase after this for a while, so I'm recording results 
of some testing for posterity.

> Setup: 4 Solr JVMs, 8G each (64G total RAM on the machine).
> Create 100 4x4 collections (i.e. 4 replicas, 4 shards each). 1,600 total 
> shards
  > Note that the cluster is fine at this point, everything's green.
> No data indexed at all.
> Shut all Solr instances down.
> Bring up a Solr on a different box. I did this to eliminate the chance
  that the Overseer was somehow involved since it is now on the machine
  with no replicas. I don't think this matters much though.
> Bring up one JVM.
> Wait for all the nodes on that JVM to come up. Now every shard has a leader,
  and the collections are all green, 3 of 4 replicas for each shard are
  "gone" of course, but it's a functioning cluster.
> Bring up the next JVM: Kabloooey. Very shortly you'll start to see OOM
  errors on the _second_ JVM but not the first.
  > The numbers of threads on the first JVM are about 1,200. On the second,
they go over 2,000. Whether this would drop back down or not
is an open question.
  > So I tried playing with -Xss to drop the size of the stack on the threads
and even dropping by half didn't help.
  > Expanding the memory on the second JVM to 32G didn't help
  > I tried increasing the processes to no avail (ulimit -u) on a hint
that there was a wonky effect there somehow.
  > Especially disconcerting is the fact that this node was running fine
when the collections were _created_, it just can't get past restart.
  > Changing coreLoadThreads even down to 2 did not seem to help.
  > At no point does the reported memory consumption via jConsole or top
show even getting close to the allocated JVM limits.
> I'd like to be able to just start all 4 JVMs at once, but didn't get
  that far.
> If one tries to start additional JVMs anyway, there's a lot of thrashing
  around, replicas go into recovery, go out of recovery, are permanently down 
etc.
  Of course with OOMs it's unclear what _should_ happen.
> The OOM killer script apparently does NOT get triggered, I think the OOM
  is swallowed, perhaps in Zookeeper client code. Note that if the OOM
  killer script _did_ get fired there'd the second & greater JVMs would
  ust die.
> Error is OOM: Unable to create new native thread.
> Here's a stack trace, there are a _lot_ of these...

ERROR - 2016-06-11 00:05:36.806; [   ] 
org.apache.zookeeper.ClientCnxn$EventThread; Error while calling watcher 
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:714)
at 
java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:950)
at 
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1368)
at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.execute(ExecutorUtil.java:214)
at 
java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112)
at 
org.apache.solr.common.cloud.SolrZkClient$3.process(SolrZkClient.java:266)
at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)


> Improve stability and startup performance of SolrCloud with thousands of 
> collections
> 
>
> Key: SOLR-7191
> URL: https://issues.apache.org/jira/browse/SOLR-7191
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.0
>Reporter: Shawn Heisey
>Assignee: Shalin Shekhar Mangar
>  Labels: performance, scalability
> Attachments: SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, 
> SOLR-7191.patch, lots-of-zkstatereader-updates-branch_5x.log
>
>
> A user on the mailing list with thousands of collections (5000 on 4.10.3, 
> 4000 on 5.0) is having severe problems with getting Solr to restart.
> I tried as hard as I could to duplicate the user setup, but I ran into many 
> problems myself even before I was able to get 4000 collections created on a 
> 5.0 example cloud setup.  Restarting Solr takes a very long time, and it is 
> not very stable once it's up and running.
> This kind of setup is very much pushing the envelope on SolrCloud performance 
> and scalability.  It doesn't help that I'm running both Solr nodes on one 
> machine (I started with 'bin/solr -e cloud') and that ZK is embedded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To 

[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections

2015-10-22 Thread Damien Kamerman (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14970255#comment-14970255
 ] 

Damien Kamerman commented on SOLR-7191:
---

After 2min around 100 collections all-green. This is with a 3-node ensemble. 
Ten minutes would be great, and I guess with 3K collections I would be close to 
that mark.

> Improve stability and startup performance of SolrCloud with thousands of 
> collections
> 
>
> Key: SOLR-7191
> URL: https://issues.apache.org/jira/browse/SOLR-7191
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.0
>Reporter: Shawn Heisey
>Assignee: Shalin Shekhar Mangar
>  Labels: performance, scalability
> Attachments: SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, 
> SOLR-7191.patch, lots-of-zkstatereader-updates-branch_5x.log
>
>
> A user on the mailing list with thousands of collections (5000 on 4.10.3, 
> 4000 on 5.0) is having severe problems with getting Solr to restart.
> I tried as hard as I could to duplicate the user setup, but I ran into many 
> problems myself even before I was able to get 4000 collections created on a 
> 5.0 example cloud setup.  Restarting Solr takes a very long time, and it is 
> not very stable once it's up and running.
> This kind of setup is very much pushing the envelope on SolrCloud performance 
> and scalability.  It doesn't help that I'm running both Solr nodes on one 
> machine (I started with 'bin/solr -e cloud') and that ZK is embedded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections

2015-10-22 Thread Damien Kamerman (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14970253#comment-14970253
 ] 

Damien Kamerman commented on SOLR-7191:
---

1. hmmm cancel that. Initially I noticed very slow (around 60min total) 
shutdown in JmxMonitoredMap.clear(). I went back to test it and was unable to 
reproduce!? I did update the trunk. A partial stack is all I've saved:
at org.apache.solr.core.JmxMonitoredMap.clear(JmxMonitoredMap.java:144)
at org.apache.solr.core.SolrCore.close(SolrCore.java:1263)
at org.apache.solr.core.SolrCores.close(SolrCores.java:124)
at org.apache.solr.core.CoreContainer.shutdown(CoreContainer.java:564)
at 
org.apache.solr.servlet.SolrDispatchFilter.destroy(SolrDispatchFilter.java:172)

2. OK, will look into that.

> Improve stability and startup performance of SolrCloud with thousands of 
> collections
> 
>
> Key: SOLR-7191
> URL: https://issues.apache.org/jira/browse/SOLR-7191
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.0
>Reporter: Shawn Heisey
>Assignee: Shalin Shekhar Mangar
>  Labels: performance, scalability
> Attachments: SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, 
> SOLR-7191.patch, lots-of-zkstatereader-updates-branch_5x.log
>
>
> A user on the mailing list with thousands of collections (5000 on 4.10.3, 
> 4000 on 5.0) is having severe problems with getting Solr to restart.
> I tried as hard as I could to duplicate the user setup, but I ran into many 
> problems myself even before I was able to get 4000 collections created on a 
> 5.0 example cloud setup.  Restarting Solr takes a very long time, and it is 
> not very stable once it's up and running.
> This kind of setup is very much pushing the envelope on SolrCloud performance 
> and scalability.  It doesn't help that I'm running both Solr nodes on one 
> machine (I started with 'bin/solr -e cloud') and that ZK is embedded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections

2015-10-22 Thread Shalin Shekhar Mangar (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968920#comment-14968920
 ] 

Shalin Shekhar Mangar commented on SOLR-7191:
-

Thanks Damien.

# What is the purpose of the fastClose in SolrCore.close(). It only disables 
clearing the jmx registry. Have you found that to be very slow in practice?
# Your schema cache will cache schema indefinitely and won't reload on changes 
made by schema API or manually. You need to use znode version of the schema 
file as part of the key name to ensure that you can reload schemas.

I'll have to test your change of moving the updateClusterState to CoreContainer.

> Improve stability and startup performance of SolrCloud with thousands of 
> collections
> 
>
> Key: SOLR-7191
> URL: https://issues.apache.org/jira/browse/SOLR-7191
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.0
>Reporter: Shawn Heisey
>Assignee: Shalin Shekhar Mangar
>  Labels: performance, scalability
> Attachments: SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, 
> SOLR-7191.patch, lots-of-zkstatereader-updates-branch_5x.log
>
>
> A user on the mailing list with thousands of collections (5000 on 4.10.3, 
> 4000 on 5.0) is having severe problems with getting Solr to restart.
> I tried as hard as I could to duplicate the user setup, but I ran into many 
> problems myself even before I was able to get 4000 collections created on a 
> 5.0 example cloud setup.  Restarting Solr takes a very long time, and it is 
> not very stable once it's up and running.
> This kind of setup is very much pushing the envelope on SolrCloud performance 
> and scalability.  It doesn't help that I'm running both Solr nodes on one 
> machine (I started with 'bin/solr -e cloud') and that ZK is embedded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections

2015-10-21 Thread Shawn Heisey (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14966893#comment-14966893
 ] 

Shawn Heisey commented on SOLR-7191:


bq. I created 6,000 collections (3 nodes; 2 x replicas) and re-started the 3 
nodes, and all green in 24min.

Very nice! That's an improvement, though still fairly slow.  How quickly did 
the cluster reach full usability -- at least one replica active on all 
collections?  That's more important than all-green.

I would hope for full stability in a timeframe well under ten minutes, and 
under five minutes would be even better.  I don't want to discourage your 
efforts.

Was this with a single ZK node, or did you have a 3-node ensemble?  It's my 
understanding that as nodes are added to the ensemble, database updates get 
slower, because the write must be coordinated on more hosts. A redundant 
ensemble cannot update the database as fast as a single node, so verification 
of progress should be done with an ensemble.


> Improve stability and startup performance of SolrCloud with thousands of 
> collections
> 
>
> Key: SOLR-7191
> URL: https://issues.apache.org/jira/browse/SOLR-7191
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 5.0
>Reporter: Shawn Heisey
>Assignee: Shalin Shekhar Mangar
>  Labels: performance, scalability
> Attachments: SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, 
> SOLR-7191.patch, lots-of-zkstatereader-updates-branch_5x.log
>
>
> A user on the mailing list with thousands of collections (5000 on 4.10.3, 
> 4000 on 5.0) is having severe problems with getting Solr to restart.
> I tried as hard as I could to duplicate the user setup, but I ran into many 
> problems myself even before I was able to get 4000 collections created on a 
> 5.0 example cloud setup.  Restarting Solr takes a very long time, and it is 
> not very stable once it's up and running.
> This kind of setup is very much pushing the envelope on SolrCloud performance 
> and scalability.  It doesn't help that I'm running both Solr nodes on one 
> machine (I started with 'bin/solr -e cloud') and that ZK is embedded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections

2015-07-27 Thread Shawn Heisey (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14642896#comment-14642896
 ] 

Shawn Heisey commented on SOLR-7191:


Another user on the mailing list brought up another scaling problem -- large 
numbers of collections using a large configuration.  A 6MB config across 1000 
collections requires several gigabytes of RAM.

 Improve stability and startup performance of SolrCloud with thousands of 
 collections
 

 Key: SOLR-7191
 URL: https://issues.apache.org/jira/browse/SOLR-7191
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 5.0
Reporter: Shawn Heisey
Assignee: Shalin Shekhar Mangar
  Labels: performance, scalability
 Attachments: SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, 
 lots-of-zkstatereader-updates-branch_5x.log


 A user on the mailing list with thousands of collections (5000 on 4.10.3, 
 4000 on 5.0) is having severe problems with getting Solr to restart.
 I tried as hard as I could to duplicate the user setup, but I ran into many 
 problems myself even before I was able to get 4000 collections created on a 
 5.0 example cloud setup.  Restarting Solr takes a very long time, and it is 
 not very stable once it's up and running.
 This kind of setup is very much pushing the envelope on SolrCloud performance 
 and scalability.  It doesn't help that I'm running both Solr nodes on one 
 machine (I started with 'bin/solr -e cloud') and that ZK is embedded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections

2015-07-27 Thread Shalin Shekhar Mangar (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14642990#comment-14642990
 ] 

Shalin Shekhar Mangar commented on SOLR-7191:
-

bq. Another user on the mailing list brought up another scaling problem – large 
numbers of collections using a large configuration. A 6MB config across 1000 
collections requires several gigabytes of RAM.

Yup, that's why I opened SOLR-7282

 Improve stability and startup performance of SolrCloud with thousands of 
 collections
 

 Key: SOLR-7191
 URL: https://issues.apache.org/jira/browse/SOLR-7191
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 5.0
Reporter: Shawn Heisey
Assignee: Shalin Shekhar Mangar
  Labels: performance, scalability
 Attachments: SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, 
 lots-of-zkstatereader-updates-branch_5x.log


 A user on the mailing list with thousands of collections (5000 on 4.10.3, 
 4000 on 5.0) is having severe problems with getting Solr to restart.
 I tried as hard as I could to duplicate the user setup, but I ran into many 
 problems myself even before I was able to get 4000 collections created on a 
 5.0 example cloud setup.  Restarting Solr takes a very long time, and it is 
 not very stable once it's up and running.
 This kind of setup is very much pushing the envelope on SolrCloud performance 
 and scalability.  It doesn't help that I'm running both Solr nodes on one 
 machine (I started with 'bin/solr -e cloud') and that ZK is embedded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections

2015-03-20 Thread Shalin Shekhar Mangar (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372155#comment-14372155
 ] 

Shalin Shekhar Mangar commented on SOLR-7191:
-

That's awesome, Damien. I'll start next week on getting these improvements into 
Solr. I'm going to create some sub-tasks for individual changes. You can help 
by providing patches which applies on trunk.

 Improve stability and startup performance of SolrCloud with thousands of 
 collections
 

 Key: SOLR-7191
 URL: https://issues.apache.org/jira/browse/SOLR-7191
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 5.0
Reporter: Shawn Heisey
Assignee: Shalin Shekhar Mangar
  Labels: performance, scalability
 Attachments: SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, 
 lots-of-zkstatereader-updates-branch_5x.log


 A user on the mailing list with thousands of collections (5000 on 4.10.3, 
 4000 on 5.0) is having severe problems with getting Solr to restart.
 I tried as hard as I could to duplicate the user setup, but I ran into many 
 problems myself even before I was able to get 4000 collections created on a 
 5.0 example cloud setup.  Restarting Solr takes a very long time, and it is 
 not very stable once it's up and running.
 This kind of setup is very much pushing the envelope on SolrCloud performance 
 and scalability.  It doesn't help that I'm running both Solr nodes on one 
 machine (I started with 'bin/solr -e cloud') and that ZK is embedded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections

2015-03-20 Thread Shalin Shekhar Mangar (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372212#comment-14372212
 ] 

Shalin Shekhar Mangar commented on SOLR-7191:
-

bq. The biggest problem I noticed is that any little change on the cluster 
(even creating a new collection) seems to cause a flood of ZkStateReader 
Updating data for X to ver NN messages. Each time it must update every 
single collection it has ... when that happens, it doesn't take an extreme 
amount of time, but I've noticed that it does it repeatedly, especially on node 
startup.

[~elyograg] -- Sorry for not responding earlier. Yes, that is how it works 
right now. We didn't optimize this case because collections are created/deleted 
infrequently. However, this might be a problem when users has collections with 
statetFormat=1 and 2.

 Improve stability and startup performance of SolrCloud with thousands of 
 collections
 

 Key: SOLR-7191
 URL: https://issues.apache.org/jira/browse/SOLR-7191
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 5.0
Reporter: Shawn Heisey
Assignee: Shalin Shekhar Mangar
  Labels: performance, scalability
 Attachments: SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, 
 lots-of-zkstatereader-updates-branch_5x.log


 A user on the mailing list with thousands of collections (5000 on 4.10.3, 
 4000 on 5.0) is having severe problems with getting Solr to restart.
 I tried as hard as I could to duplicate the user setup, but I ran into many 
 problems myself even before I was able to get 4000 collections created on a 
 5.0 example cloud setup.  Restarting Solr takes a very long time, and it is 
 not very stable once it's up and running.
 This kind of setup is very much pushing the envelope on SolrCloud performance 
 and scalability.  It doesn't help that I'm running both Solr nodes on one 
 machine (I started with 'bin/solr -e cloud') and that ZK is embedded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections

2015-03-18 Thread Damien Kamerman (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14368557#comment-14368557
 ] 

Damien Kamerman commented on SOLR-7191:
---

Shalin, Another change I made was to cache ConfigSetService.createIndexSchema() 
in cloud mode. BTW, have tested OK up to 24K cores.

 Improve stability and startup performance of SolrCloud with thousands of 
 collections
 

 Key: SOLR-7191
 URL: https://issues.apache.org/jira/browse/SOLR-7191
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 5.0
Reporter: Shawn Heisey
Assignee: Shalin Shekhar Mangar
  Labels: performance, scalability
 Attachments: SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, 
 lots-of-zkstatereader-updates-branch_5x.log


 A user on the mailing list with thousands of collections (5000 on 4.10.3, 
 4000 on 5.0) is having severe problems with getting Solr to restart.
 I tried as hard as I could to duplicate the user setup, but I ran into many 
 problems myself even before I was able to get 4000 collections created on a 
 5.0 example cloud setup.  Restarting Solr takes a very long time, and it is 
 not very stable once it's up and running.
 This kind of setup is very much pushing the envelope on SolrCloud performance 
 and scalability.  It doesn't help that I'm running both Solr nodes on one 
 machine (I started with 'bin/solr -e cloud') and that ZK is embedded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections

2015-03-15 Thread Shalin Shekhar Mangar (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14362511#comment-14362511
 ] 

Shalin Shekhar Mangar commented on SOLR-7191:
-

This is a very broad issue so there are likely to be multiple problems and 
their solutions. It's probably best to start splitting out individual changes 
into their own sub-tasks so that each can be reviewed and committed 
individually.

 Improve stability and startup performance of SolrCloud with thousands of 
 collections
 

 Key: SOLR-7191
 URL: https://issues.apache.org/jira/browse/SOLR-7191
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 5.0
Reporter: Shawn Heisey
Assignee: Shalin Shekhar Mangar
  Labels: performance, scalability
 Attachments: SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, 
 lots-of-zkstatereader-updates-branch_5x.log


 A user on the mailing list with thousands of collections (5000 on 4.10.3, 
 4000 on 5.0) is having severe problems with getting Solr to restart.
 I tried as hard as I could to duplicate the user setup, but I ran into many 
 problems myself even before I was able to get 4000 collections created on a 
 5.0 example cloud setup.  Restarting Solr takes a very long time, and it is 
 not very stable once it's up and running.
 This kind of setup is very much pushing the envelope on SolrCloud performance 
 and scalability.  It doesn't help that I'm running both Solr nodes on one 
 machine (I started with 'bin/solr -e cloud') and that ZK is embedded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections

2015-03-12 Thread Shawn Heisey (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14358685#comment-14358685
 ] 

Shawn Heisey commented on SOLR-7191:


[~dk]:

The first thing I thought when I saw that you were trying 10K cores was that 
you would run out of threads unless you change the servlet container config.  
There is another limit looming after that ... the number of processes that you 
can create.  A Linux/Unix system uses a 16-bit identifier for process IDs, so 
the absolute upper limit of processes (including all OS-related processes) is 
65535.  On Linux (and likely other Unix/Unix-like systems), threads take up a 
PID, although they are not visible to programs like top or ps without 
specific options.  I have no idea what the situation is on Windows.

On your patch:

The first patch section removes a null check.  This is never a good idea, 
because the fact that a null check exists tends to mean that the object 
identifier has the potential to be null, and presumably the first result on the 
trinary operator will fail (NullPointerException) somehow if the checked object 
actually is null.

On the last patch section: Imposing a limit in the code without giving the user 
the option of configuring that limit will eventually cause problems for 
somebody.  Also, someone who is really familiar with how the ZkContainer code 
works will need to let us know if reducing the number of threads might have 
unintended consequences.

On LotsOfCores: SolrCloud brings a lot of complications to the situation, and 
when Erick did his work on that, he told all of us that trying to use transient 
cores in conjunction with SolrCloud would likely not work correctly.  I think 
that the goal is to eventually make the two features coexist, but a lot of 
thought and work needs to happen.

General observation:  A patch like this is not likely to be backported to the 
4.10 branch.  That branch is in maintenance mode, so only trivial fixes or 
patches for major bugs will be committed, and new releases from the maintenance 
mode branch are not common.


 Improve stability and startup performance of SolrCloud with thousands of 
 collections
 

 Key: SOLR-7191
 URL: https://issues.apache.org/jira/browse/SOLR-7191
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 5.0
Reporter: Shawn Heisey
  Labels: performance, scalability
 Attachments: SOLR-7191.patch, 
 lots-of-zkstatereader-updates-branch_5x.log


 A user on the mailing list with thousands of collections (5000 on 4.10.3, 
 4000 on 5.0) is having severe problems with getting Solr to restart.
 I tried as hard as I could to duplicate the user setup, but I ran into many 
 problems myself even before I was able to get 4000 collections created on a 
 5.0 example cloud setup.  Restarting Solr takes a very long time, and it is 
 not very stable once it's up and running.
 This kind of setup is very much pushing the envelope on SolrCloud performance 
 and scalability.  It doesn't help that I'm running both Solr nodes on one 
 machine (I started with 'bin/solr -e cloud') and that ZK is embedded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections

2015-03-12 Thread Damien Kamerman (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14358149#comment-14358149
 ] 

Damien Kamerman commented on SOLR-7191:
---

With 10K cores I was still seeing many cores stuck in recovery.

I reduced CoreContainer.load() coreLoadExector to be 24 (cfg.coreLoadThreads) 
threads (from max int); I guess I'm now assuming replicas to be on other nodes.
I reduced ZkContainer.coreZkRegister to be 24 threads (from max int) and added 
a sort to SolrCore.getCores(). The sort ensures replicas are available.
I tested with solr 4.10.4; 2 x nodes; 5K collections; 10K cores. All green in 
19min.

Please review patch.

Would still like to see SOLR-6399 and a SolrCloud LotsOfCores as per Erick.

 Improve stability and startup performance of SolrCloud with thousands of 
 collections
 

 Key: SOLR-7191
 URL: https://issues.apache.org/jira/browse/SOLR-7191
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 5.0
Reporter: Shawn Heisey
  Labels: performance, scalability
 Attachments: lots-of-zkstatereader-updates-branch_5x.log


 A user on the mailing list with thousands of collections (5000 on 4.10.3, 
 4000 on 5.0) is having severe problems with getting Solr to restart.
 I tried as hard as I could to duplicate the user setup, but I ran into many 
 problems myself even before I was able to get 4000 collections created on a 
 5.0 example cloud setup.  Restarting Solr takes a very long time, and it is 
 not very stable once it's up and running.
 This kind of setup is very much pushing the envelope on SolrCloud performance 
 and scalability.  It doesn't help that I'm running both Solr nodes on one 
 machine (I started with 'bin/solr -e cloud') and that ZK is embedded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections

2015-03-10 Thread Shawn Heisey (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14354875#comment-14354875
 ] 

Shawn Heisey commented on SOLR-7191:


Now that I look closer at that log snippet, it appears that was a state update 
for node2, so I'm not sure exactly how long it took for node1 to become stable 
... but since I started ndoe2 at 07:01 UTC, it appears that it was nearly half 
an hour for full cluster stability, which is unacceptable in a production 
situation.

There are no docs in any of the indexes, and I did not think to try a query or 
an update to see whether Solr was actually functional.  That will be included 
on any test repeat.


 Improve stability and startup performance of SolrCloud with thousands of 
 collections
 

 Key: SOLR-7191
 URL: https://issues.apache.org/jira/browse/SOLR-7191
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 5.0
Reporter: Shawn Heisey
  Labels: performance, scalability
 Attachments: lots-of-zkstatereader-updates-branch_5x.log


 A user on the mailing list with thousands of collections (5000 on 4.10.3, 
 4000 on 5.0) is having severe problems with getting Solr to restart.
 I tried as hard as I could to duplicate the user setup, but I ran into many 
 problems myself even before I was able to get 4000 collections created on a 
 5.0 example cloud setup.  Restarting Solr takes a very long time, and it is 
 not very stable once it's up and running.
 This kind of setup is very much pushing the envelope on SolrCloud performance 
 and scalability.  It doesn't help that I'm running both Solr nodes on one 
 machine (I started with 'bin/solr -e cloud') and that ZK is embedded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections

2015-03-10 Thread Shawn Heisey (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14354868#comment-14354868
 ] 

Shawn Heisey commented on SOLR-7191:


The branch_5x code is a lot more stable than 5.0, but full recovery after a 
cluster restart is very slow, and as Damien noted, creation of new collections 
is also very slow.  On my dev server, by the time it had reached 4000 
collections, each new collection was taking about 20 seconds to create.

The biggest problem I noticed is that any little change on the cluster (even 
creating a new collection) seems to cause a flood of ZkStateReader Updating 
data for X to ver NN messages.  Each time it must update every single 
collection it has ... when that happens, it doesn't take an extreme amount of 
time, but I've noticed that it does it repeatedly, especially on node startup.

I have some logs from a cluster restart on a 2 node cluster, but they are far 
too large to attach here, even compressed.  I have loaded the zip onto my 
dropbox account.

https://www.dropbox.com/s/28mu7asdvbgwkqt/logs-restart1-branch_5x.zip?dl=0

These logs are also far too large to load into Notepad or Notepad++ on Windows, 
but gnu's less program can deal with them just fine.

To create these logs, I shut down the cluster, deleted all logfiles, and then 
restarted each node.

Looking only at the log for node1, node startup began at 06:54 UTC.  At the 
following point, half an hour later, it appears that the node was fully 
operational - every collection shard had become active:

{noformat}
INFO  - 2015-03-06 07:27:21.586; org.apache.solr.cloud.overseer.ReplicaMutator; 
Update state numShards=2 message={
  operation:state,
  core_node_name:core_node2,
  numShards:2,
  shard:shard2,
  roles:null,
  state:active,
  core:mycoll3892_shard2_replica1,
  collection:mycoll3892,
  node_name:10.100.1.39:7574_solr,
  base_url:http://10.100.1.39:7574/solr}
{noformat}

At 08:38 UTC, just before I went to bed, I sent a request to create a new 
collection, named lotsashards.  The http request for collection creation 
timed out three minutes later, at 08:41, but the request was in the overseer 
queue, so eventually it did happen.  It took until 15:29 for that collection 
creation to begin - nearly seven hours after I started it - because Solr was 
busy handling a very large number of those ZkStateReader updates.  It appears 
that it took another ten minutes for the create to finish, and there are other 
messages related to lotsashards right up through the end of the log at 16:54 
UTC.


 Improve stability and startup performance of SolrCloud with thousands of 
 collections
 

 Key: SOLR-7191
 URL: https://issues.apache.org/jira/browse/SOLR-7191
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 5.0
Reporter: Shawn Heisey
  Labels: performance, scalability
 Attachments: lots-of-zkstatereader-updates-branch_5x.log


 A user on the mailing list with thousands of collections (5000 on 4.10.3, 
 4000 on 5.0) is having severe problems with getting Solr to restart.
 I tried as hard as I could to duplicate the user setup, but I ran into many 
 problems myself even before I was able to get 4000 collections created on a 
 5.0 example cloud setup.  Restarting Solr takes a very long time, and it is 
 not very stable once it's up and running.
 This kind of setup is very much pushing the envelope on SolrCloud performance 
 and scalability.  It doesn't help that I'm running both Solr nodes on one 
 machine (I started with 'bin/solr -e cloud') and that ZK is embedded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections

2015-03-10 Thread Shalin Shekhar Mangar (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14354965#comment-14354965
 ] 

Shalin Shekhar Mangar commented on SOLR-7191:
-

bq. I tested 4,000 cores on branch_5x and found better results.

Thanks, that is good to know.

bq. BTW: collection creation slows down the more collections you have in the 
cloud. Starts with qtimes of ~3s; ending with ~6s. Solr 4.x was always steady 
at ~3s.

Yes, that is expected with the current implementation. Each new collection 
touches the /clusterstate.json as a way to inform all nodes about the existence 
of a new collection (this is required for routing requests; this also happens 
on collection deletion). This invokes a reload of state information for all 
collections that have a core on that solr node. As the number of collections 
increase (and because you have a lot of cores on the same node), the state 
refresh takes longer. The Collection Create API waits for the new collection to 
be visible in the cluster state and therefore takes a longer time as number of 
collections increase.

The current implementation was made under the assumption that the system will 
have a large number of collections but a node will only have a few replicas. 
This isn't true in this case and therefore we're running into these bottlenecks.

 Improve stability and startup performance of SolrCloud with thousands of 
 collections
 

 Key: SOLR-7191
 URL: https://issues.apache.org/jira/browse/SOLR-7191
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 5.0
Reporter: Shawn Heisey
  Labels: performance, scalability
 Attachments: lots-of-zkstatereader-updates-branch_5x.log


 A user on the mailing list with thousands of collections (5000 on 4.10.3, 
 4000 on 5.0) is having severe problems with getting Solr to restart.
 I tried as hard as I could to duplicate the user setup, but I ran into many 
 problems myself even before I was able to get 4000 collections created on a 
 5.0 example cloud setup.  Restarting Solr takes a very long time, and it is 
 not very stable once it's up and running.
 This kind of setup is very much pushing the envelope on SolrCloud performance 
 and scalability.  It doesn't help that I'm running both Solr nodes on one 
 machine (I started with 'bin/solr -e cloud') and that ZK is embedded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections

2015-03-09 Thread Damien Kamerman (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14354379#comment-14354379
 ] 

Damien Kamerman commented on SOLR-7191:
---

I tested 4,000 cores on branch_5x and found better results. My setup:
3 nodes (32GB RAM each ; jdk1.8.0_40) running on a single server (256GB RAM).
2,000 collections (1 x shard ; 2 x replica)
1 x Zookeeper 3.4.6

Full restart (stop all nodes; start all nodes 1min staggered). Many cores on 
node1 are active, other cores are recovering. Lots of warnings 
'org.apache.solr.update.PeerSync; no frame of reference to tell if we've missed 
updates' on node2 and node3. But it is slowly recovering.

BTW: collection creation slows down the more collections you have in the cloud. 
Starts with qtimes of ~3s; ending with ~6s. Solr 4.x was always steady at ~3s.

 Improve stability and startup performance of SolrCloud with thousands of 
 collections
 

 Key: SOLR-7191
 URL: https://issues.apache.org/jira/browse/SOLR-7191
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 5.0
Reporter: Shawn Heisey
  Labels: performance, scalability
 Attachments: lots-of-zkstatereader-updates-branch_5x.log


 A user on the mailing list with thousands of collections (5000 on 4.10.3, 
 4000 on 5.0) is having severe problems with getting Solr to restart.
 I tried as hard as I could to duplicate the user setup, but I ran into many 
 problems myself even before I was able to get 4000 collections created on a 
 5.0 example cloud setup.  Restarting Solr takes a very long time, and it is 
 not very stable once it's up and running.
 This kind of setup is very much pushing the envelope on SolrCloud performance 
 and scalability.  It doesn't help that I'm running both Solr nodes on one 
 machine (I started with 'bin/solr -e cloud') and that ZK is embedded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections

2015-03-05 Thread Shawn Heisey (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14349622#comment-14349622
 ] 

Shawn Heisey commented on SOLR-7191:


On the plus side ... in branch_5x, drawing the cloud graph is very fast, even 
if you increase the number of items per page to the point where it shows all 
4001 collections at once.

 Improve stability and startup performance of SolrCloud with thousands of 
 collections
 

 Key: SOLR-7191
 URL: https://issues.apache.org/jira/browse/SOLR-7191
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 5.0
Reporter: Shawn Heisey
  Labels: performance, scalability
 Attachments: lots-of-zkstatereader-updates-branch_5x.log


 A user on the mailing list with thousands of collections (5000 on 4.10.3, 
 4000 on 5.0) is having severe problems with getting Solr to restart.
 I tried as hard as I could to duplicate the user setup, but I ran into many 
 problems myself even before I was able to get 4000 collections created on a 
 5.0 example cloud setup.  Restarting Solr takes a very long time, and it is 
 not very stable once it's up and running.
 This kind of setup is very much pushing the envelope on SolrCloud performance 
 and scalability.  It doesn't help that I'm running both Solr nodes on one 
 machine (I started with 'bin/solr -e cloud') and that ZK is embedded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections

2015-03-05 Thread Shawn Heisey (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14349849#comment-14349849
 ] 

Shawn Heisey commented on SOLR-7191:


Followup on the attached log, because I realized that I didn't quite understand 
what I was looking at.

The first line of the log is where the request for a collection create is 
logged ... but all the activity for that CREATE call is *before* that line, so 
you can't see it.

All of the rest of the log covers activity for the request that is logged as 
the last line.


 Improve stability and startup performance of SolrCloud with thousands of 
 collections
 

 Key: SOLR-7191
 URL: https://issues.apache.org/jira/browse/SOLR-7191
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 5.0
Reporter: Shawn Heisey
  Labels: performance, scalability
 Attachments: lots-of-zkstatereader-updates-branch_5x.log


 A user on the mailing list with thousands of collections (5000 on 4.10.3, 
 4000 on 5.0) is having severe problems with getting Solr to restart.
 I tried as hard as I could to duplicate the user setup, but I ran into many 
 problems myself even before I was able to get 4000 collections created on a 
 5.0 example cloud setup.  Restarting Solr takes a very long time, and it is 
 not very stable once it's up and running.
 This kind of setup is very much pushing the envelope on SolrCloud performance 
 and scalability.  It doesn't help that I'm running both Solr nodes on one 
 machine (I started with 'bin/solr -e cloud') and that ZK is embedded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections

2015-03-05 Thread Shalin Shekhar Mangar (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14349163#comment-14349163
 ] 

Shalin Shekhar Mangar commented on SOLR-7191:
-

Another issue that will help us achieve this is SOLR-6760 but that isn't 
implemented yet.

 Improve stability and startup performance of SolrCloud with thousands of 
 collections
 

 Key: SOLR-7191
 URL: https://issues.apache.org/jira/browse/SOLR-7191
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 5.0
Reporter: Shawn Heisey
  Labels: performance, scalability

 A user on the mailing list with thousands of collections (5000 on 4.10.3, 
 4000 on 5.0) is having severe problems with getting Solr to restart.
 I tried as hard as I could to duplicate the user setup, but I ran into many 
 problems myself even before I was able to get 4000 collections created on a 
 5.0 example cloud setup.  Restarting Solr takes a very long time, and it is 
 not very stable once it's up and running.
 This kind of setup is very much pushing the envelope on SolrCloud performance 
 and scalability.  It doesn't help that I'm running both Solr nodes on one 
 machine (I started with 'bin/solr -e cloud') and that ZK is embedded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections

2015-03-05 Thread Shawn Heisey (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14348836#comment-14348836
 ] 

Shawn Heisey commented on SOLR-7191:


As I said above, I started over with a fresh cloud example because of the 
problems I was having.  Late last night, the creation of 4000 collections 
completed.

This morning, I went to the admin UI page which I still had open, and did a 
shift-refresh on the main page.  Somehow, pulling the list of cores for the 
dropdown (there are 4001 of them) sent Solr into a tailspin where all the 
shards were marked down (orange) on the cloud graph, then they started coming 
back up.  Half an hour after this event started, only about 150 out of the 4000 
collections had fully recovered.  That projects full recovery to take half the 
day ... and this is just from basically opening the admin UI to the dashboard!

For situations where large numbers of cluster bits must change state, could 
there be an improvement made in the overseer queue messages to change the state 
of multiple bits?  If we could have a single message change the state of up to 
1000 at once (and perhaps have a special code to indicate all of the replicas 
on a node), that would go a long way towards keeping the queue tidy when there 
are major events.  (I wrote this paragraph before seeing Shalin's note about 
optimizing the node shutdown case)

[~shalinmangar], I will try again with branch_5x.


 Improve stability and startup performance of SolrCloud with thousands of 
 collections
 

 Key: SOLR-7191
 URL: https://issues.apache.org/jira/browse/SOLR-7191
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 5.0
Reporter: Shawn Heisey
  Labels: performance, scalability

 A user on the mailing list with thousands of collections (5000 on 4.10.3, 
 4000 on 5.0) is having severe problems with getting Solr to restart.
 I tried as hard as I could to duplicate the user setup, but I ran into many 
 problems myself even before I was able to get 4000 collections created on a 
 5.0 example cloud setup.  Restarting Solr takes a very long time, and it is 
 not very stable once it's up and running.
 This kind of setup is very much pushing the envelope on SolrCloud performance 
 and scalability.  It doesn't help that I'm running both Solr nodes on one 
 machine (I started with 'bin/solr -e cloud') and that ZK is embedded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections

2015-03-04 Thread Shawn Heisey (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14347689#comment-14347689
 ] 

Shawn Heisey commented on SOLR-7191:


At this time I do not know what the actual issues are.  I'm in the middle of 
another round of restarts right now because it became too unstable to add the 
last few hundred collections.  Once I have a complete log from that startup, I 
can attach it here.

I will add some information about how I have my test machine set up, and I can 
answer questions as needed.


 Improve stability and startup performance of SolrCloud with thousands of 
 collections
 

 Key: SOLR-7191
 URL: https://issues.apache.org/jira/browse/SOLR-7191
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 5.0
Reporter: Shawn Heisey
  Labels: performance, scalability

 A user on the mailing list with thousands of collections (5000 on 4.10.3, 
 4000 on 5.0) is having severe problems with getting Solr to restart.
 I tried as hard as I could to duplicate the user setup, but I ran into many 
 problems myself even before I was able to get 4000 collections created on a 
 5.0 example cloud setup.  Restarting Solr takes a very long time, and it is 
 not very stable once it's up and running.
 This kind of setup is very much pushing the envelope on SolrCloud performance 
 and scalability.  It doesn't help that I'm running both Solr nodes on one 
 machine (I started with 'bin/solr -e cloud') and that ZK is embedded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections

2015-03-04 Thread Shawn Heisey (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14347813#comment-14347813
 ] 

Shawn Heisey commented on SOLR-7191:


I have thought about this in terms of deciding whether we should even bother 
with it at all... but I figured it wouldn't hurt to open the issue so we can 
think about it, figure out whether any performance bottlenecks can be fixed, 
and at the very least come up with some best practice information for the docs.

One of the big problems that I had was that the overseer queue was getting 
REALLY filled up.  I gave up and started over when I realized (from looking at 
the logs) that there were over eight hundred thousand entries in the queue, all 
from attempts to restart Solr repeatedly.  That huge queue is what finally 
pushed the znode size too high for jute.maxbuffer.

Zookeeper seems to be the real bottleneck here, which is not all that 
surprising.


 Improve stability and startup performance of SolrCloud with thousands of 
 collections
 

 Key: SOLR-7191
 URL: https://issues.apache.org/jira/browse/SOLR-7191
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 5.0
Reporter: Shawn Heisey
  Labels: performance, scalability

 A user on the mailing list with thousands of collections (5000 on 4.10.3, 
 4000 on 5.0) is having severe problems with getting Solr to restart.
 I tried as hard as I could to duplicate the user setup, but I ran into many 
 problems myself even before I was able to get 4000 collections created on a 
 5.0 example cloud setup.  Restarting Solr takes a very long time, and it is 
 not very stable once it's up and running.
 This kind of setup is very much pushing the envelope on SolrCloud performance 
 and scalability.  It doesn't help that I'm running both Solr nodes on one 
 machine (I started with 'bin/solr -e cloud') and that ZK is embedded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections

2015-03-04 Thread Shawn Heisey (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14347834#comment-14347834
 ] 

Shawn Heisey commented on SOLR-7191:


Regarding jute.maxbuffer:  The new stateFormat ensures that the clusterstate 
won't break the default 1MB limit.

The giant queue I encountered was about 85 entries, and resulted in a 
packet length of a little over 14 megabytes.  If I divide 85 by 14, I know 
that I can have about 6 overseer queue entries in one znode before 
jute.maxbuffer needs to be increased.

There are a couple of possibilities for managing really large overseer queues 
within the default buffer size.  One is to throttle the creation of new 
overseer entries when the number of existing entries exceeds a certain 
threshold ... I'm thinking 32768 ... so that hopefully the overseer can catch 
up.  Another is to change the structure of the queue so that it creates new 
nodes under /overseer/queue and then puts actual entries inside those nodes, 
limiting the number of queue entries in each one to 32768.  With 32768 nodes 
that each have 32768 entries, the queue could easily reach one billion entries 
without breaking the buffer... although a queue that size might take days or 
weeks to process.

A similar problem would exist with the /collections node ... so a limit of 
32768 collections would also be prudent.  I expect that users will run into 
other problems (number of processes being the one that comes to mind) long 
before they reach 32768 collections, though.


 Improve stability and startup performance of SolrCloud with thousands of 
 collections
 

 Key: SOLR-7191
 URL: https://issues.apache.org/jira/browse/SOLR-7191
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 5.0
Reporter: Shawn Heisey
  Labels: performance, scalability

 A user on the mailing list with thousands of collections (5000 on 4.10.3, 
 4000 on 5.0) is having severe problems with getting Solr to restart.
 I tried as hard as I could to duplicate the user setup, but I ran into many 
 problems myself even before I was able to get 4000 collections created on a 
 5.0 example cloud setup.  Restarting Solr takes a very long time, and it is 
 not very stable once it's up and running.
 This kind of setup is very much pushing the envelope on SolrCloud performance 
 and scalability.  It doesn't help that I'm running both Solr nodes on one 
 machine (I started with 'bin/solr -e cloud') and that ZK is embedded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections

2015-03-04 Thread Erick Erickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14347746#comment-14347746
 ] 

Erick Erickson commented on SOLR-7191:
--

Just throwing this out there, but tangentially related is the whole Lots of 
cores thing. It's really an open question to me how much effort into 
supporting either a zillion collections (or even a zillion cores).

The LotsOfCores code was written without considering how to support it in 
SolrCloud. I'm quite sure it'd be interesting to support. I'm explicitly 
_not_ advocating whether it be supported in SolrCloud, it'd have to be thought 
out pretty carefully and my instinct is that it wouldn't be worth the effort.

FWIW

 Improve stability and startup performance of SolrCloud with thousands of 
 collections
 

 Key: SOLR-7191
 URL: https://issues.apache.org/jira/browse/SOLR-7191
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 5.0
Reporter: Shawn Heisey
  Labels: performance, scalability

 A user on the mailing list with thousands of collections (5000 on 4.10.3, 
 4000 on 5.0) is having severe problems with getting Solr to restart.
 I tried as hard as I could to duplicate the user setup, but I ran into many 
 problems myself even before I was able to get 4000 collections created on a 
 5.0 example cloud setup.  Restarting Solr takes a very long time, and it is 
 not very stable once it's up and running.
 This kind of setup is very much pushing the envelope on SolrCloud performance 
 and scalability.  It doesn't help that I'm running both Solr nodes on one 
 machine (I started with 'bin/solr -e cloud') and that ZK is embedded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections

2015-03-04 Thread Shalin Shekhar Mangar (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14348235#comment-14348235
 ] 

Shalin Shekhar Mangar commented on SOLR-7191:
-

Shawn, can you try with branch_5x instead? In particular, SOLR-6956 should 
help. There was unnecessary synchronization on the overseer node which caused 
OverseerCollectionProcessor and replicas on the overseer node to not see 
updated cluster state. Considering that you're running only 3(?) nodes, 1 of 
the nodes involved could be slowing down the entire cluster.

 Improve stability and startup performance of SolrCloud with thousands of 
 collections
 

 Key: SOLR-7191
 URL: https://issues.apache.org/jira/browse/SOLR-7191
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 5.0
Reporter: Shawn Heisey
  Labels: performance, scalability

 A user on the mailing list with thousands of collections (5000 on 4.10.3, 
 4000 on 5.0) is having severe problems with getting Solr to restart.
 I tried as hard as I could to duplicate the user setup, but I ran into many 
 problems myself even before I was able to get 4000 collections created on a 
 5.0 example cloud setup.  Restarting Solr takes a very long time, and it is 
 not very stable once it's up and running.
 This kind of setup is very much pushing the envelope on SolrCloud performance 
 and scalability.  It doesn't help that I'm running both Solr nodes on one 
 machine (I started with 'bin/solr -e cloud') and that ZK is embedded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections

2015-03-04 Thread Damien Kamerman (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14348132#comment-14348132
 ] 

Damien Kamerman commented on SOLR-7191:
---

I'm more concerned with stability than performance. I've created up to 10,000 
cores and the cloud is fine and well within memory limits. However, the cloud 
will never restart. Lots of warnings 'org.apache.solr.cloud.ZkController; Still 
seeing conflicting information about the leader of shard'

 Improve stability and startup performance of SolrCloud with thousands of 
 collections
 

 Key: SOLR-7191
 URL: https://issues.apache.org/jira/browse/SOLR-7191
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 5.0
Reporter: Shawn Heisey
  Labels: performance, scalability

 A user on the mailing list with thousands of collections (5000 on 4.10.3, 
 4000 on 5.0) is having severe problems with getting Solr to restart.
 I tried as hard as I could to duplicate the user setup, but I ran into many 
 problems myself even before I was able to get 4000 collections created on a 
 5.0 example cloud setup.  Restarting Solr takes a very long time, and it is 
 not very stable once it's up and running.
 This kind of setup is very much pushing the envelope on SolrCloud performance 
 and scalability.  It doesn't help that I'm running both Solr nodes on one 
 machine (I started with 'bin/solr -e cloud') and that ZK is embedded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections

2015-03-04 Thread Shalin Shekhar Mangar (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14348240#comment-14348240
 ] 

Shalin Shekhar Mangar commented on SOLR-7191:
-

We can optimize the node shutdown case. Instead of each core publishing a 
'down' state individually, Solr can publish a 'down' state for the entire node 
and overseer can do the rest.

 Improve stability and startup performance of SolrCloud with thousands of 
 collections
 

 Key: SOLR-7191
 URL: https://issues.apache.org/jira/browse/SOLR-7191
 Project: Solr
  Issue Type: Bug
  Components: SolrCloud
Affects Versions: 5.0
Reporter: Shawn Heisey
  Labels: performance, scalability

 A user on the mailing list with thousands of collections (5000 on 4.10.3, 
 4000 on 5.0) is having severe problems with getting Solr to restart.
 I tried as hard as I could to duplicate the user setup, but I ran into many 
 problems myself even before I was able to get 4000 collections created on a 
 5.0 example cloud setup.  Restarting Solr takes a very long time, and it is 
 not very stable once it's up and running.
 This kind of setup is very much pushing the envelope on SolrCloud performance 
 and scalability.  It doesn't help that I'm running both Solr nodes on one 
 machine (I started with 'bin/solr -e cloud') and that ZK is embedded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org