[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections
[ https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15926810#comment-15926810 ] Varun Thacker commented on SOLR-7191: - > Restarting a single node takes over two minutes in good circumstance I posted some numbers on SOLR-10265 which you might find helpful. You can run that test on your cluster to see the speed of the overseer. A dedicated overseer node can help you in the meanwhile as well : https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-ADDROLE:AddaRole > Improve stability and startup performance of SolrCloud with thousands of > collections > > > Key: SOLR-7191 > URL: https://issues.apache.org/jira/browse/SOLR-7191 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.0 >Reporter: Shawn Heisey >Assignee: Noble Paul > Labels: performance, scalability > Fix For: 6.3 > > Attachments: lots-of-zkstatereader-updates-branch_5x.log, > SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, > SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch > > > A user on the mailing list with thousands of collections (5000 on 4.10.3, > 4000 on 5.0) is having severe problems with getting Solr to restart. > I tried as hard as I could to duplicate the user setup, but I ran into many > problems myself even before I was able to get 4000 collections created on a > 5.0 example cloud setup. Restarting Solr takes a very long time, and it is > not very stable once it's up and running. > This kind of setup is very much pushing the envelope on SolrCloud performance > and scalability. It doesn't help that I'm running both Solr nodes on one > machine (I started with 'bin/solr -e cloud') and that ZK is embedded. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections
[ https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15926204#comment-15926204 ] Joshua Humphries commented on SOLR-7191: Our cluster has many thousands of collections, most of which have only a single shard and single replica. Restarting a single node takes over two minutes in good circumstances (expected restart, like during upgrades of solr or deployment of new/updated plugins). In bad circumstances, like if machines appear wedged and leader election issues have already caused the overseer queue to grow large, restarting a server can take over 10 minutes! While watching the overseer queue size in our latest observation of this slowness, I saw that the down node messages take *way* too long to process. I ended up tracking that to an issue where it results in a ZK write for *every* collection, not just the collections that had shard-replicas on that node. In our case, it was processing about 40 times too many collections, making a rolling restart of the whole cluster effectively O(n^2) instead of O(n) in terms of the writes to ZK. See SOLR-10277. > Improve stability and startup performance of SolrCloud with thousands of > collections > > > Key: SOLR-7191 > URL: https://issues.apache.org/jira/browse/SOLR-7191 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.0 >Reporter: Shawn Heisey >Assignee: Noble Paul > Labels: performance, scalability > Fix For: 6.3 > > Attachments: lots-of-zkstatereader-updates-branch_5x.log, > SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, > SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch > > > A user on the mailing list with thousands of collections (5000 on 4.10.3, > 4000 on 5.0) is having severe problems with getting Solr to restart. > I tried as hard as I could to duplicate the user setup, but I ran into many > problems myself even before I was able to get 4000 collections created on a > 5.0 example cloud setup. Restarting Solr takes a very long time, and it is > not very stable once it's up and running. > This kind of setup is very much pushing the envelope on SolrCloud performance > and scalability. It doesn't help that I'm running both Solr nodes on one > machine (I started with 'bin/solr -e cloud') and that ZK is embedded. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections
[ https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15925821#comment-15925821 ] Tim Owen commented on SOLR-7191: Admittedly not thousands of collections, but another anecdote. Each of our clusters are 12 hosts running 6 nodes each, with 165 collections of 16 shards each, 3x replication. So around 7900 cores spread over 72 nodes (roughly 100 each). To get stable restarts we throttle the recovery thread pool size, see ticket I raised with our patch, SOLR-9936 - without that, the amount of recovery just kills the network and disks and the cluster status never settles. Also we avoid restarting all nodes at once, we bring up a few at a time and wait for their recovery to finish before starting more. We need to automate this, e.g. using a Zookeeper lock pool so that nodes will wait to startup. > Improve stability and startup performance of SolrCloud with thousands of > collections > > > Key: SOLR-7191 > URL: https://issues.apache.org/jira/browse/SOLR-7191 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.0 >Reporter: Shawn Heisey >Assignee: Noble Paul > Labels: performance, scalability > Fix For: 6.3 > > Attachments: lots-of-zkstatereader-updates-branch_5x.log, > SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, > SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch > > > A user on the mailing list with thousands of collections (5000 on 4.10.3, > 4000 on 5.0) is having severe problems with getting Solr to restart. > I tried as hard as I could to duplicate the user setup, but I ran into many > problems myself even before I was able to get 4000 collections created on a > 5.0 example cloud setup. Restarting Solr takes a very long time, and it is > not very stable once it's up and running. > This kind of setup is very much pushing the envelope on SolrCloud performance > and scalability. It doesn't help that I'm running both Solr nodes on one > machine (I started with 'bin/solr -e cloud') and that ZK is embedded. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections
[ https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15925790#comment-15925790 ] Jerome Yang commented on SOLR-7191: --- After I test on solr6.4.2 in cloud mode. 32 nodes, on 3 centos6.7 host. Create 100 collections with replica_factor 3, 1800 cores in total. After restart the whole cluster(I write a script to start all nodes at same time). Only parts of collections are green. And sometimes all collections are marked as down. It seems this still not fixed on solr6.4.2. Please re-open this. > Improve stability and startup performance of SolrCloud with thousands of > collections > > > Key: SOLR-7191 > URL: https://issues.apache.org/jira/browse/SOLR-7191 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.0 >Reporter: Shawn Heisey >Assignee: Noble Paul > Labels: performance, scalability > Fix For: 6.3 > > Attachments: lots-of-zkstatereader-updates-branch_5x.log, > SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, > SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch > > > A user on the mailing list with thousands of collections (5000 on 4.10.3, > 4000 on 5.0) is having severe problems with getting Solr to restart. > I tried as hard as I could to duplicate the user setup, but I ran into many > problems myself even before I was able to get 4000 collections created on a > 5.0 example cloud setup. Restarting Solr takes a very long time, and it is > not very stable once it's up and running. > This kind of setup is very much pushing the envelope on SolrCloud performance > and scalability. It doesn't help that I'm running both Solr nodes on one > machine (I started with 'bin/solr -e cloud') and that ZK is embedded. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections
[ https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15893410#comment-15893410 ] Shawn Heisey commented on SOLR-7191: Now that SOLR-10130 has been discovered, there's a possibility that the bigger problems I ran into with a repeat test on 6.4 might have been related to that. When I find some time, I will need to repeat with 6.4.2 or branch_6x. I disagree with the status of this issue as FIXED. No changes have been committed in relation to this issue. At best, the problems I encountered with 5.0 are the same, and may have gotten worse. > Improve stability and startup performance of SolrCloud with thousands of > collections > > > Key: SOLR-7191 > URL: https://issues.apache.org/jira/browse/SOLR-7191 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.0 >Reporter: Shawn Heisey >Assignee: Noble Paul > Labels: performance, scalability > Fix For: 6.3 > > Attachments: lots-of-zkstatereader-updates-branch_5x.log, > SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, > SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch > > > A user on the mailing list with thousands of collections (5000 on 4.10.3, > 4000 on 5.0) is having severe problems with getting Solr to restart. > I tried as hard as I could to duplicate the user setup, but I ran into many > problems myself even before I was able to get 4000 collections created on a > 5.0 example cloud setup. Restarting Solr takes a very long time, and it is > not very stable once it's up and running. > This kind of setup is very much pushing the envelope on SolrCloud performance > and scalability. It doesn't help that I'm running both Solr nodes on one > machine (I started with 'bin/solr -e cloud') and that ZK is embedded. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections
[ https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15833804#comment-15833804 ] Damien Kamerman commented on SOLR-7191: --- Regarding the extra threads, I'm thinking the issue is that CoreContainer.load() now calls ZkContainer.registerInZk() with background 'true'. Check how many 'coreZkRegister' threads there are. > Improve stability and startup performance of SolrCloud with thousands of > collections > > > Key: SOLR-7191 > URL: https://issues.apache.org/jira/browse/SOLR-7191 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.0 >Reporter: Shawn Heisey >Assignee: Noble Paul > Labels: performance, scalability > Fix For: 6.3 > > Attachments: lots-of-zkstatereader-updates-branch_5x.log, > SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, > SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch > > > A user on the mailing list with thousands of collections (5000 on 4.10.3, > 4000 on 5.0) is having severe problems with getting Solr to restart. > I tried as hard as I could to duplicate the user setup, but I ran into many > problems myself even before I was able to get 4000 collections created on a > 5.0 example cloud setup. Restarting Solr takes a very long time, and it is > not very stable once it's up and running. > This kind of setup is very much pushing the envelope on SolrCloud performance > and scalability. It doesn't help that I'm running both Solr nodes on one > machine (I started with 'bin/solr -e cloud') and that ZK is embedded. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections
[ https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15833172#comment-15833172 ] Shawn Heisey commented on SOLR-7191: bq. Shawn where can I raise maxThreads of jetty? server/etc/jetty.xml > Improve stability and startup performance of SolrCloud with thousands of > collections > > > Key: SOLR-7191 > URL: https://issues.apache.org/jira/browse/SOLR-7191 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.0 >Reporter: Shawn Heisey >Assignee: Noble Paul > Labels: performance, scalability > Fix For: 6.3 > > Attachments: lots-of-zkstatereader-updates-branch_5x.log, > SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, > SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch > > > A user on the mailing list with thousands of collections (5000 on 4.10.3, > 4000 on 5.0) is having severe problems with getting Solr to restart. > I tried as hard as I could to duplicate the user setup, but I ran into many > problems myself even before I was able to get 4000 collections created on a > 5.0 example cloud setup. Restarting Solr takes a very long time, and it is > not very stable once it's up and running. > This kind of setup is very much pushing the envelope on SolrCloud performance > and scalability. It doesn't help that I'm running both Solr nodes on one > machine (I started with 'bin/solr -e cloud') and that ZK is embedded. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections
[ https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832908#comment-15832908 ] Yago Riveiro commented on SOLR-7191: Restarting a node in 6.3 now takes forever ... I bumped coreLoadThreads from 4 to 512 and restarting a node with 1500 collections takes 20 - 25 minutes. If I bump coreLoadThreads to 1024 or 2048 is faster, but some times replicas stay in a wrong state and never go up. Other thing that I see happen now is collections created without replicas. Shawn where can I raise maxThreads of jetty? > Improve stability and startup performance of SolrCloud with thousands of > collections > > > Key: SOLR-7191 > URL: https://issues.apache.org/jira/browse/SOLR-7191 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.0 >Reporter: Shawn Heisey >Assignee: Noble Paul > Labels: performance, scalability > Fix For: 6.3 > > Attachments: lots-of-zkstatereader-updates-branch_5x.log, > SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, > SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch > > > A user on the mailing list with thousands of collections (5000 on 4.10.3, > 4000 on 5.0) is having severe problems with getting Solr to restart. > I tried as hard as I could to duplicate the user setup, but I ran into many > problems myself even before I was able to get 4000 collections created on a > 5.0 example cloud setup. Restarting Solr takes a very long time, and it is > not very stable once it's up and running. > This kind of setup is very much pushing the envelope on SolrCloud performance > and scalability. It doesn't help that I'm running both Solr nodes on one > machine (I started with 'bin/solr -e cloud') and that ZK is embedded. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections
[ https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832705#comment-15832705 ] Mark Miller commented on SOLR-7191: --- I know we have gotten pretty good at labeling and grouping threads, so hopefully a comparison between versions is not too difficult. > Improve stability and startup performance of SolrCloud with thousands of > collections > > > Key: SOLR-7191 > URL: https://issues.apache.org/jira/browse/SOLR-7191 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.0 >Reporter: Shawn Heisey >Assignee: Noble Paul > Labels: performance, scalability > Fix For: 6.3 > > Attachments: lots-of-zkstatereader-updates-branch_5x.log, > SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, > SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch > > > A user on the mailing list with thousands of collections (5000 on 4.10.3, > 4000 on 5.0) is having severe problems with getting Solr to restart. > I tried as hard as I could to duplicate the user setup, but I ran into many > problems myself even before I was able to get 4000 collections created on a > 5.0 example cloud setup. Restarting Solr takes a very long time, and it is > not very stable once it's up and running. > This kind of setup is very much pushing the envelope on SolrCloud performance > and scalability. It doesn't help that I'm running both Solr nodes on one > machine (I started with 'bin/solr -e cloud') and that ZK is embedded. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections
[ https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832668#comment-15832668 ] Shawn Heisey commented on SOLR-7191: With hard nproc at 61440, soft nproc at 40960, and maxThreads for each jetty at 2, I am still not able to create enough threads to start both Solr instances with just under 1900 collections in the cloud. The user in question is running a few other things on the system, but the number of threads involved there is less than 300. > Improve stability and startup performance of SolrCloud with thousands of > collections > > > Key: SOLR-7191 > URL: https://issues.apache.org/jira/browse/SOLR-7191 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.0 >Reporter: Shawn Heisey >Assignee: Noble Paul > Labels: performance, scalability > Fix For: 6.3 > > Attachments: lots-of-zkstatereader-updates-branch_5x.log, > SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, > SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch > > > A user on the mailing list with thousands of collections (5000 on 4.10.3, > 4000 on 5.0) is having severe problems with getting Solr to restart. > I tried as hard as I could to duplicate the user setup, but I ran into many > problems myself even before I was able to get 4000 collections created on a > 5.0 example cloud setup. Restarting Solr takes a very long time, and it is > not very stable once it's up and running. > This kind of setup is very much pushing the envelope on SolrCloud performance > and scalability. It doesn't help that I'm running both Solr nodes on one > machine (I started with 'bin/solr -e cloud') and that ZK is embedded. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections
[ https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832664#comment-15832664 ] Shawn Heisey commented on SOLR-7191: I figured I would try the test setup again on Solr 6.4, see whether the situation has improved with newer versions. The system requirements of thousands of collections has NOT gotten better. It seems to have gotten considerably worse. The time from node restart to stable operation MIGHT have improved, but since I haven't yet been able to create all 4000 collections, I cannot be sure about that. I ran into serious trouble before I had even created 1000 collections. Bumped the heap and proceeded to create more ... but ran into more trouble. With a 12g heap for the instance running zookeeper, I noticed that I was getting an OOME about not being able to create threads when I had gotten a little more than 1800 collections created I have changed nproc in /etc/security/limits.conf (was a soft limit of 4096 and a hard limit of 6144) and bumped maxThreads in the Jetty config, and once the cluster is stable after restart, I will try to make more collections. > Improve stability and startup performance of SolrCloud with thousands of > collections > > > Key: SOLR-7191 > URL: https://issues.apache.org/jira/browse/SOLR-7191 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.0 >Reporter: Shawn Heisey >Assignee: Noble Paul > Labels: performance, scalability > Fix For: 6.3 > > Attachments: lots-of-zkstatereader-updates-branch_5x.log, > SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, > SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch > > > A user on the mailing list with thousands of collections (5000 on 4.10.3, > 4000 on 5.0) is having severe problems with getting Solr to restart. > I tried as hard as I could to duplicate the user setup, but I ran into many > problems myself even before I was able to get 4000 collections created on a > 5.0 example cloud setup. Restarting Solr takes a very long time, and it is > not very stable once it's up and running. > This kind of setup is very much pushing the envelope on SolrCloud performance > and scalability. It doesn't help that I'm running both Solr nodes on one > machine (I started with 'bin/solr -e cloud') and that ZK is embedded. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections
[ https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15702278#comment-15702278 ] Erick Erickson commented on SOLR-7191: -- [~noble.paul] Can we close this since SOLR-7280 has been committed? > Improve stability and startup performance of SolrCloud with thousands of > collections > > > Key: SOLR-7191 > URL: https://issues.apache.org/jira/browse/SOLR-7191 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.0 >Reporter: Shawn Heisey >Assignee: Shalin Shekhar Mangar > Labels: performance, scalability > Attachments: SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, > SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, > lots-of-zkstatereader-updates-branch_5x.log > > > A user on the mailing list with thousands of collections (5000 on 4.10.3, > 4000 on 5.0) is having severe problems with getting Solr to restart. > I tried as hard as I could to duplicate the user setup, but I ran into many > problems myself even before I was able to get 4000 collections created on a > 5.0 example cloud setup. Restarting Solr takes a very long time, and it is > not very stable once it's up and running. > This kind of setup is very much pushing the envelope on SolrCloud performance > and scalability. It doesn't help that I'm running both Solr nodes on one > machine (I started with 'bin/solr -e cloud') and that ZK is embedded. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections
[ https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15357167#comment-15357167 ] Noble Paul commented on SOLR-7191: -- I have simplified this patch and moved it over to SOLR-7280 . I plan to commit that soon > Improve stability and startup performance of SolrCloud with thousands of > collections > > > Key: SOLR-7191 > URL: https://issues.apache.org/jira/browse/SOLR-7191 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.0 >Reporter: Shawn Heisey >Assignee: Shalin Shekhar Mangar > Labels: performance, scalability > Attachments: SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, > SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, > lots-of-zkstatereader-updates-branch_5x.log > > > A user on the mailing list with thousands of collections (5000 on 4.10.3, > 4000 on 5.0) is having severe problems with getting Solr to restart. > I tried as hard as I could to duplicate the user setup, but I ran into many > problems myself even before I was able to get 4000 collections created on a > 5.0 example cloud setup. Restarting Solr takes a very long time, and it is > not very stable once it's up and running. > This kind of setup is very much pushing the envelope on SolrCloud performance > and scalability. It doesn't help that I'm running both Solr nodes on one > machine (I started with 'bin/solr -e cloud') and that ZK is embedded. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections
[ https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15357146#comment-15357146 ] Noble Paul commented on SOLR-7191: -- yeah, normally you are fine. If there is a GC pause in the overseer node, a lot of messages can get stuck in the queue and this will lead to even more threads waiting indefinitely (consuming more memory ) and aggravating the situation. > Improve stability and startup performance of SolrCloud with thousands of > collections > > > Key: SOLR-7191 > URL: https://issues.apache.org/jira/browse/SOLR-7191 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.0 >Reporter: Shawn Heisey >Assignee: Shalin Shekhar Mangar > Labels: performance, scalability > Attachments: SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, > SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, > lots-of-zkstatereader-updates-branch_5x.log > > > A user on the mailing list with thousands of collections (5000 on 4.10.3, > 4000 on 5.0) is having severe problems with getting Solr to restart. > I tried as hard as I could to duplicate the user setup, but I ran into many > problems myself even before I was able to get 4000 collections created on a > 5.0 example cloud setup. Restarting Solr takes a very long time, and it is > not very stable once it's up and running. > This kind of setup is very much pushing the envelope on SolrCloud performance > and scalability. It doesn't help that I'm running both Solr nodes on one > machine (I started with 'bin/solr -e cloud') and that ZK is embedded. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections
[ https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15353556#comment-15353556 ] Erick Erickson commented on SOLR-7191: -- yeah, probably as I'm out of my depth here. Since each replica goes through at least three state changes (down->recovering->active), not to mention leadership election and such and each state change needs to bet to ZK, I'm really not at all sure now to cut that number down. > Improve stability and startup performance of SolrCloud with thousands of > collections > > > Key: SOLR-7191 > URL: https://issues.apache.org/jira/browse/SOLR-7191 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.0 >Reporter: Shawn Heisey >Assignee: Shalin Shekhar Mangar > Labels: performance, scalability > Attachments: SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, > SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, > lots-of-zkstatereader-updates-branch_5x.log > > > A user on the mailing list with thousands of collections (5000 on 4.10.3, > 4000 on 5.0) is having severe problems with getting Solr to restart. > I tried as hard as I could to duplicate the user setup, but I ran into many > problems myself even before I was able to get 4000 collections created on a > 5.0 example cloud setup. Restarting Solr takes a very long time, and it is > not very stable once it's up and running. > This kind of setup is very much pushing the envelope on SolrCloud performance > and scalability. It doesn't help that I'm running both Solr nodes on one > machine (I started with 'bin/solr -e cloud') and that ZK is embedded. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections
[ https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15353502#comment-15353502 ] Scott Blum commented on SOLR-7191: -- [~erickerickson] we may be talking about 2 different things? I'm referring to the total number of Overseer state update operations that happen. Is there a relationship between that and the watcher side that I'm unaware of? Also, in my case, we only have 1 replica per shard, period, so leadership contention shouldn't be an issue at all. > Improve stability and startup performance of SolrCloud with thousands of > collections > > > Key: SOLR-7191 > URL: https://issues.apache.org/jira/browse/SOLR-7191 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.0 >Reporter: Shawn Heisey >Assignee: Shalin Shekhar Mangar > Labels: performance, scalability > Attachments: SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, > SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, > lots-of-zkstatereader-updates-branch_5x.log > > > A user on the mailing list with thousands of collections (5000 on 4.10.3, > 4000 on 5.0) is having severe problems with getting Solr to restart. > I tried as hard as I could to duplicate the user setup, but I ran into many > problems myself even before I was able to get 4000 collections created on a > 5.0 example cloud setup. Restarting Solr takes a very long time, and it is > not very stable once it's up and running. > This kind of setup is very much pushing the envelope on SolrCloud performance > and scalability. It doesn't help that I'm running both Solr nodes on one > machine (I started with 'bin/solr -e cloud') and that ZK is embedded. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections
[ https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15353437#comment-15353437 ] Erick Erickson commented on SOLR-7191: -- FWIW, I tried both and had no trouble even when overseer was on one of the replicas with lots of cores. That said, I agree it's wise to put the overseer somewhere else in these cases. Certainly can't hurt! > Improve stability and startup performance of SolrCloud with thousands of > collections > > > Key: SOLR-7191 > URL: https://issues.apache.org/jira/browse/SOLR-7191 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.0 >Reporter: Shawn Heisey >Assignee: Shalin Shekhar Mangar > Labels: performance, scalability > Attachments: SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, > SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, > lots-of-zkstatereader-updates-branch_5x.log > > > A user on the mailing list with thousands of collections (5000 on 4.10.3, > 4000 on 5.0) is having severe problems with getting Solr to restart. > I tried as hard as I could to duplicate the user setup, but I ran into many > problems myself even before I was able to get 4000 collections created on a > 5.0 example cloud setup. Restarting Solr takes a very long time, and it is > not very stable once it's up and running. > This kind of setup is very much pushing the envelope on SolrCloud performance > and scalability. It doesn't help that I'm running both Solr nodes on one > machine (I started with 'bin/solr -e cloud') and that ZK is embedded. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections
[ https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15353362#comment-15353362 ] Erick Erickson commented on SOLR-7191: -- IIUC, the whole watcher thing is replica based. So N replicas for the same collection in the same JVM register N watchers. If that's true, does it make sense to think about watchers being set per _collection_ in a JVM rather than per _replica_? I admit I'm completely ignorant of the nuances here. It also wouldn't make any difference in a collection where each instance hosted exactly one replica per collection, but practically I'm not sure there's anything we can do about that anyway. Although it seems that each replica could be an Observer for a given collection (watcher at the JVM level?) without doing much violence to the current architecture. Or maybe it'd just be simpler to have the replicas get their state information from some kind of cache maintained at the JVM level where the cache was updated via watcher. I admit I'm talking through my hat here. Maybe there should be a JIRA to discuss this? > Improve stability and startup performance of SolrCloud with thousands of > collections > > > Key: SOLR-7191 > URL: https://issues.apache.org/jira/browse/SOLR-7191 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.0 >Reporter: Shawn Heisey >Assignee: Shalin Shekhar Mangar > Labels: performance, scalability > Attachments: SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, > SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, > lots-of-zkstatereader-updates-branch_5x.log > > > A user on the mailing list with thousands of collections (5000 on 4.10.3, > 4000 on 5.0) is having severe problems with getting Solr to restart. > I tried as hard as I could to duplicate the user setup, but I ran into many > problems myself even before I was able to get 4000 collections created on a > 5.0 example cloud setup. Restarting Solr takes a very long time, and it is > not very stable once it's up and running. > This kind of setup is very much pushing the envelope on SolrCloud performance > and scalability. It doesn't help that I'm running both Solr nodes on one > machine (I started with 'bin/solr -e cloud') and that ZK is embedded. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections
[ https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15353293#comment-15353293 ] Noble Paul commented on SOLR-7191: -- The cluster was very stable when I used dedicated overseer using the {{ADDROLE}} command. I used the replica placement strategy to ensure that the overseer nodes did not have any replicas created. For any reasonably large cluster, I recommend using dedicated overseer nodes. Another observation was that the overseer nodes use very little memory. It never went beyond 200MB heap. > Improve stability and startup performance of SolrCloud with thousands of > collections > > > Key: SOLR-7191 > URL: https://issues.apache.org/jira/browse/SOLR-7191 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.0 >Reporter: Shawn Heisey >Assignee: Shalin Shekhar Mangar > Labels: performance, scalability > Attachments: SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, > SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, > lots-of-zkstatereader-updates-branch_5x.log > > > A user on the mailing list with thousands of collections (5000 on 4.10.3, > 4000 on 5.0) is having severe problems with getting Solr to restart. > I tried as hard as I could to duplicate the user setup, but I ran into many > problems myself even before I was able to get 4000 collections created on a > 5.0 example cloud setup. Restarting Solr takes a very long time, and it is > not very stable once it's up and running. > This kind of setup is very much pushing the envelope on SolrCloud performance > and scalability. It doesn't help that I'm running both Solr nodes on one > machine (I started with 'bin/solr -e cloud') and that ZK is embedded. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections
[ https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15353278#comment-15353278 ] Erick Erickson commented on SOLR-7191: -- Empirically it was fine, but that was on very few runs. I had 10 replicas in each JVM and 3 load threads in one variant. Of course I might just have gotten lucky. > Improve stability and startup performance of SolrCloud with thousands of > collections > > > Key: SOLR-7191 > URL: https://issues.apache.org/jira/browse/SOLR-7191 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.0 >Reporter: Shawn Heisey >Assignee: Shalin Shekhar Mangar > Labels: performance, scalability > Attachments: SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, > SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, > lots-of-zkstatereader-updates-branch_5x.log > > > A user on the mailing list with thousands of collections (5000 on 4.10.3, > 4000 on 5.0) is having severe problems with getting Solr to restart. > I tried as hard as I could to duplicate the user setup, but I ran into many > problems myself even before I was able to get 4000 collections created on a > 5.0 example cloud setup. Restarting Solr takes a very long time, and it is > not very stable once it's up and running. > This kind of setup is very much pushing the envelope on SolrCloud performance > and scalability. It doesn't help that I'm running both Solr nodes on one > machine (I started with 'bin/solr -e cloud') and that ZK is embedded. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections
[ https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15353262#comment-15353262 ] Scott Blum commented on SOLR-7191: -- (for the record, this was on a 5.5.1 based build) > Improve stability and startup performance of SolrCloud with thousands of > collections > > > Key: SOLR-7191 > URL: https://issues.apache.org/jira/browse/SOLR-7191 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.0 >Reporter: Shawn Heisey >Assignee: Shalin Shekhar Mangar > Labels: performance, scalability > Attachments: SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, > SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, > lots-of-zkstatereader-updates-branch_5x.log > > > A user on the mailing list with thousands of collections (5000 on 4.10.3, > 4000 on 5.0) is having severe problems with getting Solr to restart. > I tried as hard as I could to duplicate the user setup, but I ran into many > problems myself even before I was able to get 4000 collections created on a > 5.0 example cloud setup. Restarting Solr takes a very long time, and it is > not very stable once it's up and running. > This kind of setup is very much pushing the envelope on SolrCloud performance > and scalability. It doesn't help that I'm running both Solr nodes on one > machine (I started with 'bin/solr -e cloud') and that ZK is embedded. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections
[ https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15353261#comment-15353261 ] Scott Blum commented on SOLR-7191: -- Paginated getChildren().. always wondered why that wasn't a thing. > Improve stability and startup performance of SolrCloud with thousands of > collections > > > Key: SOLR-7191 > URL: https://issues.apache.org/jira/browse/SOLR-7191 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.0 >Reporter: Shawn Heisey >Assignee: Shalin Shekhar Mangar > Labels: performance, scalability > Attachments: SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, > SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, > lots-of-zkstatereader-updates-branch_5x.log > > > A user on the mailing list with thousands of collections (5000 on 4.10.3, > 4000 on 5.0) is having severe problems with getting Solr to restart. > I tried as hard as I could to duplicate the user setup, but I ran into many > problems myself even before I was able to get 4000 collections created on a > 5.0 example cloud setup. Restarting Solr takes a very long time, and it is > not very stable once it's up and running. > This kind of setup is very much pushing the envelope on SolrCloud performance > and scalability. It doesn't help that I'm running both Solr nodes on one > machine (I started with 'bin/solr -e cloud') and that ZK is embedded. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections
[ https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15353258#comment-15353258 ] Shawn Heisey commented on SOLR-7191: bq. That seems... insane. I am glad that I am not the only one to think the number of updates in the overseer queue for node startup is insane. When you get that many updates in the queue and haven't make a big change to jute.maxbuffer, zookeeper starts failing because the size of the znode will become too large. I think it's crazy that zookeeper allows *writes* to a znode when the write will make the node too big. See ZOOKEEPER-1162. > Improve stability and startup performance of SolrCloud with thousands of > collections > > > Key: SOLR-7191 > URL: https://issues.apache.org/jira/browse/SOLR-7191 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.0 >Reporter: Shawn Heisey >Assignee: Shalin Shekhar Mangar > Labels: performance, scalability > Attachments: SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, > SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, > lots-of-zkstatereader-updates-branch_5x.log > > > A user on the mailing list with thousands of collections (5000 on 4.10.3, > 4000 on 5.0) is having severe problems with getting Solr to restart. > I tried as hard as I could to duplicate the user setup, but I ran into many > problems myself even before I was able to get 4000 collections created on a > 5.0 example cloud setup. Restarting Solr takes a very long time, and it is > not very stable once it's up and running. > This kind of setup is very much pushing the envelope on SolrCloud performance > and scalability. It doesn't help that I'm running both Solr nodes on one > machine (I started with 'bin/solr -e cloud') and that ZK is embedded. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections
[ https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15353247#comment-15353247 ] Scott Blum commented on SOLR-7191: -- This may be unrelated to the current patch work, but seems relevant to the uber ticket.: I rebooted our solr cluster the other night to pick up an update, and I ran into what seemed to be pathological behavior around state updates. My first attempt to bring up everything at once resulted in utter deadlock, so I shut everything down, manually nuked all the overseer queues/maps in ZK, and started bringing them up one at a time. What I saw was kind of astounding. I was monitoring OVERSEERSTATUS and tracking the number of outstanding overseer ops + the total number of update_state ops, and I noticed that every VM I brought up needed ~4000 update_state ops to stabilize, despite the fact that each VM only manages ~128 cores. We have 32 vms with ~128 cores each, or ~4096 cores in our entire cluster... it took over 100,000 update_state operations to bring the whole cluster up. That seems... insane. 3 or 4 update_state ops per core would seem reasonable to me, but I saw over 30 ops per core loaded as I went. This number was extremely consistent for every node I brought up. > Improve stability and startup performance of SolrCloud with thousands of > collections > > > Key: SOLR-7191 > URL: https://issues.apache.org/jira/browse/SOLR-7191 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.0 >Reporter: Shawn Heisey >Assignee: Shalin Shekhar Mangar > Labels: performance, scalability > Attachments: SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, > SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, > lots-of-zkstatereader-updates-branch_5x.log > > > A user on the mailing list with thousands of collections (5000 on 4.10.3, > 4000 on 5.0) is having severe problems with getting Solr to restart. > I tried as hard as I could to duplicate the user setup, but I ran into many > problems myself even before I was able to get 4000 collections created on a > 5.0 example cloud setup. Restarting Solr takes a very long time, and it is > not very stable once it's up and running. > This kind of setup is very much pushing the envelope on SolrCloud performance > and scalability. It doesn't help that I'm running both Solr nodes on one > machine (I started with 'bin/solr -e cloud') and that ZK is embedded. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections
[ https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15352431#comment-15352431 ] damien kamerman commented on SOLR-7191: --- Concern based on general principles. e.g. if a shard has all 4 replicas on the one JVM and 3 load threads. Then registration will be based on the first three cores only. > Improve stability and startup performance of SolrCloud with thousands of > collections > > > Key: SOLR-7191 > URL: https://issues.apache.org/jira/browse/SOLR-7191 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.0 >Reporter: Shawn Heisey >Assignee: Shalin Shekhar Mangar > Labels: performance, scalability > Attachments: SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, > SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, > lots-of-zkstatereader-updates-branch_5x.log > > > A user on the mailing list with thousands of collections (5000 on 4.10.3, > 4000 on 5.0) is having severe problems with getting Solr to restart. > I tried as hard as I could to duplicate the user setup, but I ran into many > problems myself even before I was able to get 4000 collections created on a > 5.0 example cloud setup. Restarting Solr takes a very long time, and it is > not very stable once it's up and running. > This kind of setup is very much pushing the envelope on SolrCloud performance > and scalability. It doesn't help that I'm running both Solr nodes on one > machine (I started with 'bin/solr -e cloud') and that ZK is embedded. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections
[ https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15352399#comment-15352399 ] Noble Paul commented on SOLR-7191: -- bq.f there is a collection that has more than coreLoadThreadCount or a 'shard' has more replicas? > Improve stability and startup performance of SolrCloud with thousands of > collections > > > Key: SOLR-7191 > URL: https://issues.apache.org/jira/browse/SOLR-7191 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.0 >Reporter: Shawn Heisey >Assignee: Shalin Shekhar Mangar > Labels: performance, scalability > Attachments: SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, > SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, > lots-of-zkstatereader-updates-branch_5x.log > > > A user on the mailing list with thousands of collections (5000 on 4.10.3, > 4000 on 5.0) is having severe problems with getting Solr to restart. > I tried as hard as I could to duplicate the user setup, but I ran into many > problems myself even before I was able to get 4000 collections created on a > 5.0 example cloud setup. Restarting Solr takes a very long time, and it is > not very stable once it's up and running. > This kind of setup is very much pushing the envelope on SolrCloud performance > and scalability. It doesn't help that I'm running both Solr nodes on one > machine (I started with 'bin/solr -e cloud') and that ZK is embedded. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections
[ https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15352349#comment-15352349 ] Erick Erickson commented on SOLR-7191: -- Hmmm, is this a concern based on general principles or on a code path that is expected to fail? I tested a couple of scenarios. All there are 4 JVM, 3 load threads. The "smoke test" was starting all the JVMs at once. 1> 100 collections, 4 shards x 4 replicas each 2> 10 collections, 4 shards x 40 replicas each. Of course my testing could easily have missed the corner cases that trip this as it was pretty bare-bones. > Improve stability and startup performance of SolrCloud with thousands of > collections > > > Key: SOLR-7191 > URL: https://issues.apache.org/jira/browse/SOLR-7191 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.0 >Reporter: Shawn Heisey >Assignee: Shalin Shekhar Mangar > Labels: performance, scalability > Attachments: SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, > SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, > lots-of-zkstatereader-updates-branch_5x.log > > > A user on the mailing list with thousands of collections (5000 on 4.10.3, > 4000 on 5.0) is having severe problems with getting Solr to restart. > I tried as hard as I could to duplicate the user setup, but I ran into many > problems myself even before I was able to get 4000 collections created on a > 5.0 example cloud setup. Restarting Solr takes a very long time, and it is > not very stable once it's up and running. > This kind of setup is very much pushing the envelope on SolrCloud performance > and scalability. It doesn't help that I'm running both Solr nodes on one > machine (I started with 'bin/solr -e cloud') and that ZK is embedded. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections
[ https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15352228#comment-15352228 ] damien kamerman commented on SOLR-7191: --- Only coreLoadThreadCount cores are registering at a time on each JVM, so the concern is if there is a collection that has more than coreLoadThreadCount replicas on a JVM then registration could fail. -- Damien Kamerman > Improve stability and startup performance of SolrCloud with thousands of > collections > > > Key: SOLR-7191 > URL: https://issues.apache.org/jira/browse/SOLR-7191 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.0 >Reporter: Shawn Heisey >Assignee: Shalin Shekhar Mangar > Labels: performance, scalability > Attachments: SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, > SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, > lots-of-zkstatereader-updates-branch_5x.log > > > A user on the mailing list with thousands of collections (5000 on 4.10.3, > 4000 on 5.0) is having severe problems with getting Solr to restart. > I tried as hard as I could to duplicate the user setup, but I ran into many > problems myself even before I was able to get 4000 collections created on a > 5.0 example cloud setup. Restarting Solr takes a very long time, and it is > not very stable once it's up and running. > This kind of setup is very much pushing the envelope on SolrCloud performance > and scalability. It doesn't help that I'm running both Solr nodes on one > machine (I started with 'bin/solr -e cloud') and that ZK is embedded. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections
[ https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15352216#comment-15352216 ] Erick Erickson commented on SOLR-7191: -- Noble: A few comments: > the stress setup I have is sailing right through the parts that were dying > last week, so this is looking good. > I also tried back-porting this to 5x and it seems to be working equally well > there > There's a comment in CoreContainer: // OK to limit the size of the executor in zk mode as cores are loaded in order. // This assumes replicaCount is less than coreLoadThreadCount? I didn't read it before I started testing, so I didn't know enough to be scared... I'm starting 400 cores in each JVM with 10 coreLoadThreads so does the fact that the loading is in order preclude fewer threads than cores being a problem? I also experimented with 3 threads (4 shards, 4 replicas each) and saw no problems. > Improve stability and startup performance of SolrCloud with thousands of > collections > > > Key: SOLR-7191 > URL: https://issues.apache.org/jira/browse/SOLR-7191 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.0 >Reporter: Shawn Heisey >Assignee: Shalin Shekhar Mangar > Labels: performance, scalability > Attachments: SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, > SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, > lots-of-zkstatereader-updates-branch_5x.log > > > A user on the mailing list with thousands of collections (5000 on 4.10.3, > 4000 on 5.0) is having severe problems with getting Solr to restart. > I tried as hard as I could to duplicate the user setup, but I ran into many > problems myself even before I was able to get 4000 collections created on a > 5.0 example cloud setup. Restarting Solr takes a very long time, and it is > not very stable once it's up and running. > This kind of setup is very much pushing the envelope on SolrCloud performance > and scalability. It doesn't help that I'm running both Solr nodes on one > machine (I started with 'bin/solr -e cloud') and that ZK is embedded. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections
[ https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15340698#comment-15340698 ] damien kamerman commented on SOLR-7191: --- The patch from March 2015 was against Solr 4.10. The patch from Oct 2015 was against Solr trunk (1708905) -- Damien Kamerman > Improve stability and startup performance of SolrCloud with thousands of > collections > > > Key: SOLR-7191 > URL: https://issues.apache.org/jira/browse/SOLR-7191 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.0 >Reporter: Shawn Heisey >Assignee: Shalin Shekhar Mangar > Labels: performance, scalability > Attachments: SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, > SOLR-7191.patch, lots-of-zkstatereader-updates-branch_5x.log > > > A user on the mailing list with thousands of collections (5000 on 4.10.3, > 4000 on 5.0) is having severe problems with getting Solr to restart. > I tried as hard as I could to duplicate the user setup, but I ran into many > problems myself even before I was able to get 4000 collections created on a > 5.0 example cloud setup. Restarting Solr takes a very long time, and it is > not very stable once it's up and running. > This kind of setup is very much pushing the envelope on SolrCloud performance > and scalability. It doesn't help that I'm running both Solr nodes on one > machine (I started with 'bin/solr -e cloud') and that ZK is embedded. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections
[ https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15340685#comment-15340685 ] Erick Erickson commented on SOLR-7191: -- [~noble.paul] Not sure what version this patch was against. My testing and most recent notes were on stock Solr... > Improve stability and startup performance of SolrCloud with thousands of > collections > > > Key: SOLR-7191 > URL: https://issues.apache.org/jira/browse/SOLR-7191 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.0 >Reporter: Shawn Heisey >Assignee: Shalin Shekhar Mangar > Labels: performance, scalability > Attachments: SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, > SOLR-7191.patch, lots-of-zkstatereader-updates-branch_5x.log > > > A user on the mailing list with thousands of collections (5000 on 4.10.3, > 4000 on 5.0) is having severe problems with getting Solr to restart. > I tried as hard as I could to duplicate the user setup, but I ran into many > problems myself even before I was able to get 4000 collections created on a > 5.0 example cloud setup. Restarting Solr takes a very long time, and it is > not very stable once it's up and running. > This kind of setup is very much pushing the envelope on SolrCloud performance > and scalability. It doesn't help that I'm running both Solr nodes on one > machine (I started with 'bin/solr -e cloud') and that ZK is embedded. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections
[ https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15340106#comment-15340106 ] Noble Paul commented on SOLR-7191: -- which version is the patch built on? > Improve stability and startup performance of SolrCloud with thousands of > collections > > > Key: SOLR-7191 > URL: https://issues.apache.org/jira/browse/SOLR-7191 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.0 >Reporter: Shawn Heisey >Assignee: Shalin Shekhar Mangar > Labels: performance, scalability > Attachments: SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, > SOLR-7191.patch, lots-of-zkstatereader-updates-branch_5x.log > > > A user on the mailing list with thousands of collections (5000 on 4.10.3, > 4000 on 5.0) is having severe problems with getting Solr to restart. > I tried as hard as I could to duplicate the user setup, but I ran into many > problems myself even before I was able to get 4000 collections created on a > 5.0 example cloud setup. Restarting Solr takes a very long time, and it is > not very stable once it's up and running. > This kind of setup is very much pushing the envelope on SolrCloud performance > and scalability. It doesn't help that I'm running both Solr nodes on one > machine (I started with 'bin/solr -e cloud') and that ZK is embedded. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections
[ https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15325626#comment-15325626 ] damien kamerman commented on SOLR-7191: --- This fits with what I've seen on solr 4/5. The cores register on an unlimited thread pool. The patch I did was to limit the thread pool and register in order. > Improve stability and startup performance of SolrCloud with thousands of > collections > > > Key: SOLR-7191 > URL: https://issues.apache.org/jira/browse/SOLR-7191 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.0 >Reporter: Shawn Heisey >Assignee: Shalin Shekhar Mangar > Labels: performance, scalability > Attachments: SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, > SOLR-7191.patch, lots-of-zkstatereader-updates-branch_5x.log > > > A user on the mailing list with thousands of collections (5000 on 4.10.3, > 4000 on 5.0) is having severe problems with getting Solr to restart. > I tried as hard as I could to duplicate the user setup, but I ran into many > problems myself even before I was able to get 4000 collections created on a > 5.0 example cloud setup. Restarting Solr takes a very long time, and it is > not very stable once it's up and running. > This kind of setup is very much pushing the envelope on SolrCloud performance > and scalability. It doesn't help that I'm running both Solr nodes on one > machine (I started with 'bin/solr -e cloud') and that ZK is embedded. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections
[ https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15325575#comment-15325575 ] Erick Erickson commented on SOLR-7191: -- I had to chase after this for a while, so I'm recording results of some testing for posterity. > Setup: 4 Solr JVMs, 8G each (64G total RAM on the machine). > Create 100 4x4 collections (i.e. 4 replicas, 4 shards each). 1,600 total > shards > Note that the cluster is fine at this point, everything's green. > No data indexed at all. > Shut all Solr instances down. > Bring up a Solr on a different box. I did this to eliminate the chance that the Overseer was somehow involved since it is now on the machine with no replicas. I don't think this matters much though. > Bring up one JVM. > Wait for all the nodes on that JVM to come up. Now every shard has a leader, and the collections are all green, 3 of 4 replicas for each shard are "gone" of course, but it's a functioning cluster. > Bring up the next JVM: Kabloooey. Very shortly you'll start to see OOM errors on the _second_ JVM but not the first. > The numbers of threads on the first JVM are about 1,200. On the second, they go over 2,000. Whether this would drop back down or not is an open question. > So I tried playing with -Xss to drop the size of the stack on the threads and even dropping by half didn't help. > Expanding the memory on the second JVM to 32G didn't help > I tried increasing the processes to no avail (ulimit -u) on a hint that there was a wonky effect there somehow. > Especially disconcerting is the fact that this node was running fine when the collections were _created_, it just can't get past restart. > Changing coreLoadThreads even down to 2 did not seem to help. > At no point does the reported memory consumption via jConsole or top show even getting close to the allocated JVM limits. > I'd like to be able to just start all 4 JVMs at once, but didn't get that far. > If one tries to start additional JVMs anyway, there's a lot of thrashing around, replicas go into recovery, go out of recovery, are permanently down etc. Of course with OOMs it's unclear what _should_ happen. > The OOM killer script apparently does NOT get triggered, I think the OOM is swallowed, perhaps in Zookeeper client code. Note that if the OOM killer script _did_ get fired there'd the second & greater JVMs would ust die. > Error is OOM: Unable to create new native thread. > Here's a stack trace, there are a _lot_ of these... ERROR - 2016-06-11 00:05:36.806; [ ] org.apache.zookeeper.ClientCnxn$EventThread; Error while calling watcher java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:714) at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:950) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1368) at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.execute(ExecutorUtil.java:214) at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112) at org.apache.solr.common.cloud.SolrZkClient$3.process(SolrZkClient.java:266) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) > Improve stability and startup performance of SolrCloud with thousands of > collections > > > Key: SOLR-7191 > URL: https://issues.apache.org/jira/browse/SOLR-7191 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.0 >Reporter: Shawn Heisey >Assignee: Shalin Shekhar Mangar > Labels: performance, scalability > Attachments: SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, > SOLR-7191.patch, lots-of-zkstatereader-updates-branch_5x.log > > > A user on the mailing list with thousands of collections (5000 on 4.10.3, > 4000 on 5.0) is having severe problems with getting Solr to restart. > I tried as hard as I could to duplicate the user setup, but I ran into many > problems myself even before I was able to get 4000 collections created on a > 5.0 example cloud setup. Restarting Solr takes a very long time, and it is > not very stable once it's up and running. > This kind of setup is very much pushing the envelope on SolrCloud performance > and scalability. It doesn't help that I'm running both Solr nodes on one > machine (I started with 'bin/solr -e cloud') and that ZK is embedded. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To
[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections
[ https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14970255#comment-14970255 ] Damien Kamerman commented on SOLR-7191: --- After 2min around 100 collections all-green. This is with a 3-node ensemble. Ten minutes would be great, and I guess with 3K collections I would be close to that mark. > Improve stability and startup performance of SolrCloud with thousands of > collections > > > Key: SOLR-7191 > URL: https://issues.apache.org/jira/browse/SOLR-7191 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.0 >Reporter: Shawn Heisey >Assignee: Shalin Shekhar Mangar > Labels: performance, scalability > Attachments: SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, > SOLR-7191.patch, lots-of-zkstatereader-updates-branch_5x.log > > > A user on the mailing list with thousands of collections (5000 on 4.10.3, > 4000 on 5.0) is having severe problems with getting Solr to restart. > I tried as hard as I could to duplicate the user setup, but I ran into many > problems myself even before I was able to get 4000 collections created on a > 5.0 example cloud setup. Restarting Solr takes a very long time, and it is > not very stable once it's up and running. > This kind of setup is very much pushing the envelope on SolrCloud performance > and scalability. It doesn't help that I'm running both Solr nodes on one > machine (I started with 'bin/solr -e cloud') and that ZK is embedded. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections
[ https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14970253#comment-14970253 ] Damien Kamerman commented on SOLR-7191: --- 1. hmmm cancel that. Initially I noticed very slow (around 60min total) shutdown in JmxMonitoredMap.clear(). I went back to test it and was unable to reproduce!? I did update the trunk. A partial stack is all I've saved: at org.apache.solr.core.JmxMonitoredMap.clear(JmxMonitoredMap.java:144) at org.apache.solr.core.SolrCore.close(SolrCore.java:1263) at org.apache.solr.core.SolrCores.close(SolrCores.java:124) at org.apache.solr.core.CoreContainer.shutdown(CoreContainer.java:564) at org.apache.solr.servlet.SolrDispatchFilter.destroy(SolrDispatchFilter.java:172) 2. OK, will look into that. > Improve stability and startup performance of SolrCloud with thousands of > collections > > > Key: SOLR-7191 > URL: https://issues.apache.org/jira/browse/SOLR-7191 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.0 >Reporter: Shawn Heisey >Assignee: Shalin Shekhar Mangar > Labels: performance, scalability > Attachments: SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, > SOLR-7191.patch, lots-of-zkstatereader-updates-branch_5x.log > > > A user on the mailing list with thousands of collections (5000 on 4.10.3, > 4000 on 5.0) is having severe problems with getting Solr to restart. > I tried as hard as I could to duplicate the user setup, but I ran into many > problems myself even before I was able to get 4000 collections created on a > 5.0 example cloud setup. Restarting Solr takes a very long time, and it is > not very stable once it's up and running. > This kind of setup is very much pushing the envelope on SolrCloud performance > and scalability. It doesn't help that I'm running both Solr nodes on one > machine (I started with 'bin/solr -e cloud') and that ZK is embedded. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections
[ https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968920#comment-14968920 ] Shalin Shekhar Mangar commented on SOLR-7191: - Thanks Damien. # What is the purpose of the fastClose in SolrCore.close(). It only disables clearing the jmx registry. Have you found that to be very slow in practice? # Your schema cache will cache schema indefinitely and won't reload on changes made by schema API or manually. You need to use znode version of the schema file as part of the key name to ensure that you can reload schemas. I'll have to test your change of moving the updateClusterState to CoreContainer. > Improve stability and startup performance of SolrCloud with thousands of > collections > > > Key: SOLR-7191 > URL: https://issues.apache.org/jira/browse/SOLR-7191 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.0 >Reporter: Shawn Heisey >Assignee: Shalin Shekhar Mangar > Labels: performance, scalability > Attachments: SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, > SOLR-7191.patch, lots-of-zkstatereader-updates-branch_5x.log > > > A user on the mailing list with thousands of collections (5000 on 4.10.3, > 4000 on 5.0) is having severe problems with getting Solr to restart. > I tried as hard as I could to duplicate the user setup, but I ran into many > problems myself even before I was able to get 4000 collections created on a > 5.0 example cloud setup. Restarting Solr takes a very long time, and it is > not very stable once it's up and running. > This kind of setup is very much pushing the envelope on SolrCloud performance > and scalability. It doesn't help that I'm running both Solr nodes on one > machine (I started with 'bin/solr -e cloud') and that ZK is embedded. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections
[ https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14966893#comment-14966893 ] Shawn Heisey commented on SOLR-7191: bq. I created 6,000 collections (3 nodes; 2 x replicas) and re-started the 3 nodes, and all green in 24min. Very nice! That's an improvement, though still fairly slow. How quickly did the cluster reach full usability -- at least one replica active on all collections? That's more important than all-green. I would hope for full stability in a timeframe well under ten minutes, and under five minutes would be even better. I don't want to discourage your efforts. Was this with a single ZK node, or did you have a 3-node ensemble? It's my understanding that as nodes are added to the ensemble, database updates get slower, because the write must be coordinated on more hosts. A redundant ensemble cannot update the database as fast as a single node, so verification of progress should be done with an ensemble. > Improve stability and startup performance of SolrCloud with thousands of > collections > > > Key: SOLR-7191 > URL: https://issues.apache.org/jira/browse/SOLR-7191 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 5.0 >Reporter: Shawn Heisey >Assignee: Shalin Shekhar Mangar > Labels: performance, scalability > Attachments: SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, > SOLR-7191.patch, lots-of-zkstatereader-updates-branch_5x.log > > > A user on the mailing list with thousands of collections (5000 on 4.10.3, > 4000 on 5.0) is having severe problems with getting Solr to restart. > I tried as hard as I could to duplicate the user setup, but I ran into many > problems myself even before I was able to get 4000 collections created on a > 5.0 example cloud setup. Restarting Solr takes a very long time, and it is > not very stable once it's up and running. > This kind of setup is very much pushing the envelope on SolrCloud performance > and scalability. It doesn't help that I'm running both Solr nodes on one > machine (I started with 'bin/solr -e cloud') and that ZK is embedded. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections
[ https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14642896#comment-14642896 ] Shawn Heisey commented on SOLR-7191: Another user on the mailing list brought up another scaling problem -- large numbers of collections using a large configuration. A 6MB config across 1000 collections requires several gigabytes of RAM. Improve stability and startup performance of SolrCloud with thousands of collections Key: SOLR-7191 URL: https://issues.apache.org/jira/browse/SOLR-7191 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 5.0 Reporter: Shawn Heisey Assignee: Shalin Shekhar Mangar Labels: performance, scalability Attachments: SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, lots-of-zkstatereader-updates-branch_5x.log A user on the mailing list with thousands of collections (5000 on 4.10.3, 4000 on 5.0) is having severe problems with getting Solr to restart. I tried as hard as I could to duplicate the user setup, but I ran into many problems myself even before I was able to get 4000 collections created on a 5.0 example cloud setup. Restarting Solr takes a very long time, and it is not very stable once it's up and running. This kind of setup is very much pushing the envelope on SolrCloud performance and scalability. It doesn't help that I'm running both Solr nodes on one machine (I started with 'bin/solr -e cloud') and that ZK is embedded. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections
[ https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14642990#comment-14642990 ] Shalin Shekhar Mangar commented on SOLR-7191: - bq. Another user on the mailing list brought up another scaling problem – large numbers of collections using a large configuration. A 6MB config across 1000 collections requires several gigabytes of RAM. Yup, that's why I opened SOLR-7282 Improve stability and startup performance of SolrCloud with thousands of collections Key: SOLR-7191 URL: https://issues.apache.org/jira/browse/SOLR-7191 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 5.0 Reporter: Shawn Heisey Assignee: Shalin Shekhar Mangar Labels: performance, scalability Attachments: SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, lots-of-zkstatereader-updates-branch_5x.log A user on the mailing list with thousands of collections (5000 on 4.10.3, 4000 on 5.0) is having severe problems with getting Solr to restart. I tried as hard as I could to duplicate the user setup, but I ran into many problems myself even before I was able to get 4000 collections created on a 5.0 example cloud setup. Restarting Solr takes a very long time, and it is not very stable once it's up and running. This kind of setup is very much pushing the envelope on SolrCloud performance and scalability. It doesn't help that I'm running both Solr nodes on one machine (I started with 'bin/solr -e cloud') and that ZK is embedded. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections
[ https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372155#comment-14372155 ] Shalin Shekhar Mangar commented on SOLR-7191: - That's awesome, Damien. I'll start next week on getting these improvements into Solr. I'm going to create some sub-tasks for individual changes. You can help by providing patches which applies on trunk. Improve stability and startup performance of SolrCloud with thousands of collections Key: SOLR-7191 URL: https://issues.apache.org/jira/browse/SOLR-7191 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 5.0 Reporter: Shawn Heisey Assignee: Shalin Shekhar Mangar Labels: performance, scalability Attachments: SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, lots-of-zkstatereader-updates-branch_5x.log A user on the mailing list with thousands of collections (5000 on 4.10.3, 4000 on 5.0) is having severe problems with getting Solr to restart. I tried as hard as I could to duplicate the user setup, but I ran into many problems myself even before I was able to get 4000 collections created on a 5.0 example cloud setup. Restarting Solr takes a very long time, and it is not very stable once it's up and running. This kind of setup is very much pushing the envelope on SolrCloud performance and scalability. It doesn't help that I'm running both Solr nodes on one machine (I started with 'bin/solr -e cloud') and that ZK is embedded. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections
[ https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372212#comment-14372212 ] Shalin Shekhar Mangar commented on SOLR-7191: - bq. The biggest problem I noticed is that any little change on the cluster (even creating a new collection) seems to cause a flood of ZkStateReader Updating data for X to ver NN messages. Each time it must update every single collection it has ... when that happens, it doesn't take an extreme amount of time, but I've noticed that it does it repeatedly, especially on node startup. [~elyograg] -- Sorry for not responding earlier. Yes, that is how it works right now. We didn't optimize this case because collections are created/deleted infrequently. However, this might be a problem when users has collections with statetFormat=1 and 2. Improve stability and startup performance of SolrCloud with thousands of collections Key: SOLR-7191 URL: https://issues.apache.org/jira/browse/SOLR-7191 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 5.0 Reporter: Shawn Heisey Assignee: Shalin Shekhar Mangar Labels: performance, scalability Attachments: SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, lots-of-zkstatereader-updates-branch_5x.log A user on the mailing list with thousands of collections (5000 on 4.10.3, 4000 on 5.0) is having severe problems with getting Solr to restart. I tried as hard as I could to duplicate the user setup, but I ran into many problems myself even before I was able to get 4000 collections created on a 5.0 example cloud setup. Restarting Solr takes a very long time, and it is not very stable once it's up and running. This kind of setup is very much pushing the envelope on SolrCloud performance and scalability. It doesn't help that I'm running both Solr nodes on one machine (I started with 'bin/solr -e cloud') and that ZK is embedded. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections
[ https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14368557#comment-14368557 ] Damien Kamerman commented on SOLR-7191: --- Shalin, Another change I made was to cache ConfigSetService.createIndexSchema() in cloud mode. BTW, have tested OK up to 24K cores. Improve stability and startup performance of SolrCloud with thousands of collections Key: SOLR-7191 URL: https://issues.apache.org/jira/browse/SOLR-7191 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 5.0 Reporter: Shawn Heisey Assignee: Shalin Shekhar Mangar Labels: performance, scalability Attachments: SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, lots-of-zkstatereader-updates-branch_5x.log A user on the mailing list with thousands of collections (5000 on 4.10.3, 4000 on 5.0) is having severe problems with getting Solr to restart. I tried as hard as I could to duplicate the user setup, but I ran into many problems myself even before I was able to get 4000 collections created on a 5.0 example cloud setup. Restarting Solr takes a very long time, and it is not very stable once it's up and running. This kind of setup is very much pushing the envelope on SolrCloud performance and scalability. It doesn't help that I'm running both Solr nodes on one machine (I started with 'bin/solr -e cloud') and that ZK is embedded. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections
[ https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14362511#comment-14362511 ] Shalin Shekhar Mangar commented on SOLR-7191: - This is a very broad issue so there are likely to be multiple problems and their solutions. It's probably best to start splitting out individual changes into their own sub-tasks so that each can be reviewed and committed individually. Improve stability and startup performance of SolrCloud with thousands of collections Key: SOLR-7191 URL: https://issues.apache.org/jira/browse/SOLR-7191 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 5.0 Reporter: Shawn Heisey Assignee: Shalin Shekhar Mangar Labels: performance, scalability Attachments: SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, lots-of-zkstatereader-updates-branch_5x.log A user on the mailing list with thousands of collections (5000 on 4.10.3, 4000 on 5.0) is having severe problems with getting Solr to restart. I tried as hard as I could to duplicate the user setup, but I ran into many problems myself even before I was able to get 4000 collections created on a 5.0 example cloud setup. Restarting Solr takes a very long time, and it is not very stable once it's up and running. This kind of setup is very much pushing the envelope on SolrCloud performance and scalability. It doesn't help that I'm running both Solr nodes on one machine (I started with 'bin/solr -e cloud') and that ZK is embedded. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections
[ https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14358685#comment-14358685 ] Shawn Heisey commented on SOLR-7191: [~dk]: The first thing I thought when I saw that you were trying 10K cores was that you would run out of threads unless you change the servlet container config. There is another limit looming after that ... the number of processes that you can create. A Linux/Unix system uses a 16-bit identifier for process IDs, so the absolute upper limit of processes (including all OS-related processes) is 65535. On Linux (and likely other Unix/Unix-like systems), threads take up a PID, although they are not visible to programs like top or ps without specific options. I have no idea what the situation is on Windows. On your patch: The first patch section removes a null check. This is never a good idea, because the fact that a null check exists tends to mean that the object identifier has the potential to be null, and presumably the first result on the trinary operator will fail (NullPointerException) somehow if the checked object actually is null. On the last patch section: Imposing a limit in the code without giving the user the option of configuring that limit will eventually cause problems for somebody. Also, someone who is really familiar with how the ZkContainer code works will need to let us know if reducing the number of threads might have unintended consequences. On LotsOfCores: SolrCloud brings a lot of complications to the situation, and when Erick did his work on that, he told all of us that trying to use transient cores in conjunction with SolrCloud would likely not work correctly. I think that the goal is to eventually make the two features coexist, but a lot of thought and work needs to happen. General observation: A patch like this is not likely to be backported to the 4.10 branch. That branch is in maintenance mode, so only trivial fixes or patches for major bugs will be committed, and new releases from the maintenance mode branch are not common. Improve stability and startup performance of SolrCloud with thousands of collections Key: SOLR-7191 URL: https://issues.apache.org/jira/browse/SOLR-7191 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 5.0 Reporter: Shawn Heisey Labels: performance, scalability Attachments: SOLR-7191.patch, lots-of-zkstatereader-updates-branch_5x.log A user on the mailing list with thousands of collections (5000 on 4.10.3, 4000 on 5.0) is having severe problems with getting Solr to restart. I tried as hard as I could to duplicate the user setup, but I ran into many problems myself even before I was able to get 4000 collections created on a 5.0 example cloud setup. Restarting Solr takes a very long time, and it is not very stable once it's up and running. This kind of setup is very much pushing the envelope on SolrCloud performance and scalability. It doesn't help that I'm running both Solr nodes on one machine (I started with 'bin/solr -e cloud') and that ZK is embedded. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections
[ https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14358149#comment-14358149 ] Damien Kamerman commented on SOLR-7191: --- With 10K cores I was still seeing many cores stuck in recovery. I reduced CoreContainer.load() coreLoadExector to be 24 (cfg.coreLoadThreads) threads (from max int); I guess I'm now assuming replicas to be on other nodes. I reduced ZkContainer.coreZkRegister to be 24 threads (from max int) and added a sort to SolrCore.getCores(). The sort ensures replicas are available. I tested with solr 4.10.4; 2 x nodes; 5K collections; 10K cores. All green in 19min. Please review patch. Would still like to see SOLR-6399 and a SolrCloud LotsOfCores as per Erick. Improve stability and startup performance of SolrCloud with thousands of collections Key: SOLR-7191 URL: https://issues.apache.org/jira/browse/SOLR-7191 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 5.0 Reporter: Shawn Heisey Labels: performance, scalability Attachments: lots-of-zkstatereader-updates-branch_5x.log A user on the mailing list with thousands of collections (5000 on 4.10.3, 4000 on 5.0) is having severe problems with getting Solr to restart. I tried as hard as I could to duplicate the user setup, but I ran into many problems myself even before I was able to get 4000 collections created on a 5.0 example cloud setup. Restarting Solr takes a very long time, and it is not very stable once it's up and running. This kind of setup is very much pushing the envelope on SolrCloud performance and scalability. It doesn't help that I'm running both Solr nodes on one machine (I started with 'bin/solr -e cloud') and that ZK is embedded. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections
[ https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14354875#comment-14354875 ] Shawn Heisey commented on SOLR-7191: Now that I look closer at that log snippet, it appears that was a state update for node2, so I'm not sure exactly how long it took for node1 to become stable ... but since I started ndoe2 at 07:01 UTC, it appears that it was nearly half an hour for full cluster stability, which is unacceptable in a production situation. There are no docs in any of the indexes, and I did not think to try a query or an update to see whether Solr was actually functional. That will be included on any test repeat. Improve stability and startup performance of SolrCloud with thousands of collections Key: SOLR-7191 URL: https://issues.apache.org/jira/browse/SOLR-7191 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 5.0 Reporter: Shawn Heisey Labels: performance, scalability Attachments: lots-of-zkstatereader-updates-branch_5x.log A user on the mailing list with thousands of collections (5000 on 4.10.3, 4000 on 5.0) is having severe problems with getting Solr to restart. I tried as hard as I could to duplicate the user setup, but I ran into many problems myself even before I was able to get 4000 collections created on a 5.0 example cloud setup. Restarting Solr takes a very long time, and it is not very stable once it's up and running. This kind of setup is very much pushing the envelope on SolrCloud performance and scalability. It doesn't help that I'm running both Solr nodes on one machine (I started with 'bin/solr -e cloud') and that ZK is embedded. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections
[ https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14354868#comment-14354868 ] Shawn Heisey commented on SOLR-7191: The branch_5x code is a lot more stable than 5.0, but full recovery after a cluster restart is very slow, and as Damien noted, creation of new collections is also very slow. On my dev server, by the time it had reached 4000 collections, each new collection was taking about 20 seconds to create. The biggest problem I noticed is that any little change on the cluster (even creating a new collection) seems to cause a flood of ZkStateReader Updating data for X to ver NN messages. Each time it must update every single collection it has ... when that happens, it doesn't take an extreme amount of time, but I've noticed that it does it repeatedly, especially on node startup. I have some logs from a cluster restart on a 2 node cluster, but they are far too large to attach here, even compressed. I have loaded the zip onto my dropbox account. https://www.dropbox.com/s/28mu7asdvbgwkqt/logs-restart1-branch_5x.zip?dl=0 These logs are also far too large to load into Notepad or Notepad++ on Windows, but gnu's less program can deal with them just fine. To create these logs, I shut down the cluster, deleted all logfiles, and then restarted each node. Looking only at the log for node1, node startup began at 06:54 UTC. At the following point, half an hour later, it appears that the node was fully operational - every collection shard had become active: {noformat} INFO - 2015-03-06 07:27:21.586; org.apache.solr.cloud.overseer.ReplicaMutator; Update state numShards=2 message={ operation:state, core_node_name:core_node2, numShards:2, shard:shard2, roles:null, state:active, core:mycoll3892_shard2_replica1, collection:mycoll3892, node_name:10.100.1.39:7574_solr, base_url:http://10.100.1.39:7574/solr} {noformat} At 08:38 UTC, just before I went to bed, I sent a request to create a new collection, named lotsashards. The http request for collection creation timed out three minutes later, at 08:41, but the request was in the overseer queue, so eventually it did happen. It took until 15:29 for that collection creation to begin - nearly seven hours after I started it - because Solr was busy handling a very large number of those ZkStateReader updates. It appears that it took another ten minutes for the create to finish, and there are other messages related to lotsashards right up through the end of the log at 16:54 UTC. Improve stability and startup performance of SolrCloud with thousands of collections Key: SOLR-7191 URL: https://issues.apache.org/jira/browse/SOLR-7191 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 5.0 Reporter: Shawn Heisey Labels: performance, scalability Attachments: lots-of-zkstatereader-updates-branch_5x.log A user on the mailing list with thousands of collections (5000 on 4.10.3, 4000 on 5.0) is having severe problems with getting Solr to restart. I tried as hard as I could to duplicate the user setup, but I ran into many problems myself even before I was able to get 4000 collections created on a 5.0 example cloud setup. Restarting Solr takes a very long time, and it is not very stable once it's up and running. This kind of setup is very much pushing the envelope on SolrCloud performance and scalability. It doesn't help that I'm running both Solr nodes on one machine (I started with 'bin/solr -e cloud') and that ZK is embedded. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections
[ https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14354965#comment-14354965 ] Shalin Shekhar Mangar commented on SOLR-7191: - bq. I tested 4,000 cores on branch_5x and found better results. Thanks, that is good to know. bq. BTW: collection creation slows down the more collections you have in the cloud. Starts with qtimes of ~3s; ending with ~6s. Solr 4.x was always steady at ~3s. Yes, that is expected with the current implementation. Each new collection touches the /clusterstate.json as a way to inform all nodes about the existence of a new collection (this is required for routing requests; this also happens on collection deletion). This invokes a reload of state information for all collections that have a core on that solr node. As the number of collections increase (and because you have a lot of cores on the same node), the state refresh takes longer. The Collection Create API waits for the new collection to be visible in the cluster state and therefore takes a longer time as number of collections increase. The current implementation was made under the assumption that the system will have a large number of collections but a node will only have a few replicas. This isn't true in this case and therefore we're running into these bottlenecks. Improve stability and startup performance of SolrCloud with thousands of collections Key: SOLR-7191 URL: https://issues.apache.org/jira/browse/SOLR-7191 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 5.0 Reporter: Shawn Heisey Labels: performance, scalability Attachments: lots-of-zkstatereader-updates-branch_5x.log A user on the mailing list with thousands of collections (5000 on 4.10.3, 4000 on 5.0) is having severe problems with getting Solr to restart. I tried as hard as I could to duplicate the user setup, but I ran into many problems myself even before I was able to get 4000 collections created on a 5.0 example cloud setup. Restarting Solr takes a very long time, and it is not very stable once it's up and running. This kind of setup is very much pushing the envelope on SolrCloud performance and scalability. It doesn't help that I'm running both Solr nodes on one machine (I started with 'bin/solr -e cloud') and that ZK is embedded. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections
[ https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14354379#comment-14354379 ] Damien Kamerman commented on SOLR-7191: --- I tested 4,000 cores on branch_5x and found better results. My setup: 3 nodes (32GB RAM each ; jdk1.8.0_40) running on a single server (256GB RAM). 2,000 collections (1 x shard ; 2 x replica) 1 x Zookeeper 3.4.6 Full restart (stop all nodes; start all nodes 1min staggered). Many cores on node1 are active, other cores are recovering. Lots of warnings 'org.apache.solr.update.PeerSync; no frame of reference to tell if we've missed updates' on node2 and node3. But it is slowly recovering. BTW: collection creation slows down the more collections you have in the cloud. Starts with qtimes of ~3s; ending with ~6s. Solr 4.x was always steady at ~3s. Improve stability and startup performance of SolrCloud with thousands of collections Key: SOLR-7191 URL: https://issues.apache.org/jira/browse/SOLR-7191 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 5.0 Reporter: Shawn Heisey Labels: performance, scalability Attachments: lots-of-zkstatereader-updates-branch_5x.log A user on the mailing list with thousands of collections (5000 on 4.10.3, 4000 on 5.0) is having severe problems with getting Solr to restart. I tried as hard as I could to duplicate the user setup, but I ran into many problems myself even before I was able to get 4000 collections created on a 5.0 example cloud setup. Restarting Solr takes a very long time, and it is not very stable once it's up and running. This kind of setup is very much pushing the envelope on SolrCloud performance and scalability. It doesn't help that I'm running both Solr nodes on one machine (I started with 'bin/solr -e cloud') and that ZK is embedded. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections
[ https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14349622#comment-14349622 ] Shawn Heisey commented on SOLR-7191: On the plus side ... in branch_5x, drawing the cloud graph is very fast, even if you increase the number of items per page to the point where it shows all 4001 collections at once. Improve stability and startup performance of SolrCloud with thousands of collections Key: SOLR-7191 URL: https://issues.apache.org/jira/browse/SOLR-7191 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 5.0 Reporter: Shawn Heisey Labels: performance, scalability Attachments: lots-of-zkstatereader-updates-branch_5x.log A user on the mailing list with thousands of collections (5000 on 4.10.3, 4000 on 5.0) is having severe problems with getting Solr to restart. I tried as hard as I could to duplicate the user setup, but I ran into many problems myself even before I was able to get 4000 collections created on a 5.0 example cloud setup. Restarting Solr takes a very long time, and it is not very stable once it's up and running. This kind of setup is very much pushing the envelope on SolrCloud performance and scalability. It doesn't help that I'm running both Solr nodes on one machine (I started with 'bin/solr -e cloud') and that ZK is embedded. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections
[ https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14349849#comment-14349849 ] Shawn Heisey commented on SOLR-7191: Followup on the attached log, because I realized that I didn't quite understand what I was looking at. The first line of the log is where the request for a collection create is logged ... but all the activity for that CREATE call is *before* that line, so you can't see it. All of the rest of the log covers activity for the request that is logged as the last line. Improve stability and startup performance of SolrCloud with thousands of collections Key: SOLR-7191 URL: https://issues.apache.org/jira/browse/SOLR-7191 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 5.0 Reporter: Shawn Heisey Labels: performance, scalability Attachments: lots-of-zkstatereader-updates-branch_5x.log A user on the mailing list with thousands of collections (5000 on 4.10.3, 4000 on 5.0) is having severe problems with getting Solr to restart. I tried as hard as I could to duplicate the user setup, but I ran into many problems myself even before I was able to get 4000 collections created on a 5.0 example cloud setup. Restarting Solr takes a very long time, and it is not very stable once it's up and running. This kind of setup is very much pushing the envelope on SolrCloud performance and scalability. It doesn't help that I'm running both Solr nodes on one machine (I started with 'bin/solr -e cloud') and that ZK is embedded. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections
[ https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14349163#comment-14349163 ] Shalin Shekhar Mangar commented on SOLR-7191: - Another issue that will help us achieve this is SOLR-6760 but that isn't implemented yet. Improve stability and startup performance of SolrCloud with thousands of collections Key: SOLR-7191 URL: https://issues.apache.org/jira/browse/SOLR-7191 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 5.0 Reporter: Shawn Heisey Labels: performance, scalability A user on the mailing list with thousands of collections (5000 on 4.10.3, 4000 on 5.0) is having severe problems with getting Solr to restart. I tried as hard as I could to duplicate the user setup, but I ran into many problems myself even before I was able to get 4000 collections created on a 5.0 example cloud setup. Restarting Solr takes a very long time, and it is not very stable once it's up and running. This kind of setup is very much pushing the envelope on SolrCloud performance and scalability. It doesn't help that I'm running both Solr nodes on one machine (I started with 'bin/solr -e cloud') and that ZK is embedded. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections
[ https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14348836#comment-14348836 ] Shawn Heisey commented on SOLR-7191: As I said above, I started over with a fresh cloud example because of the problems I was having. Late last night, the creation of 4000 collections completed. This morning, I went to the admin UI page which I still had open, and did a shift-refresh on the main page. Somehow, pulling the list of cores for the dropdown (there are 4001 of them) sent Solr into a tailspin where all the shards were marked down (orange) on the cloud graph, then they started coming back up. Half an hour after this event started, only about 150 out of the 4000 collections had fully recovered. That projects full recovery to take half the day ... and this is just from basically opening the admin UI to the dashboard! For situations where large numbers of cluster bits must change state, could there be an improvement made in the overseer queue messages to change the state of multiple bits? If we could have a single message change the state of up to 1000 at once (and perhaps have a special code to indicate all of the replicas on a node), that would go a long way towards keeping the queue tidy when there are major events. (I wrote this paragraph before seeing Shalin's note about optimizing the node shutdown case) [~shalinmangar], I will try again with branch_5x. Improve stability and startup performance of SolrCloud with thousands of collections Key: SOLR-7191 URL: https://issues.apache.org/jira/browse/SOLR-7191 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 5.0 Reporter: Shawn Heisey Labels: performance, scalability A user on the mailing list with thousands of collections (5000 on 4.10.3, 4000 on 5.0) is having severe problems with getting Solr to restart. I tried as hard as I could to duplicate the user setup, but I ran into many problems myself even before I was able to get 4000 collections created on a 5.0 example cloud setup. Restarting Solr takes a very long time, and it is not very stable once it's up and running. This kind of setup is very much pushing the envelope on SolrCloud performance and scalability. It doesn't help that I'm running both Solr nodes on one machine (I started with 'bin/solr -e cloud') and that ZK is embedded. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections
[ https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14347689#comment-14347689 ] Shawn Heisey commented on SOLR-7191: At this time I do not know what the actual issues are. I'm in the middle of another round of restarts right now because it became too unstable to add the last few hundred collections. Once I have a complete log from that startup, I can attach it here. I will add some information about how I have my test machine set up, and I can answer questions as needed. Improve stability and startup performance of SolrCloud with thousands of collections Key: SOLR-7191 URL: https://issues.apache.org/jira/browse/SOLR-7191 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 5.0 Reporter: Shawn Heisey Labels: performance, scalability A user on the mailing list with thousands of collections (5000 on 4.10.3, 4000 on 5.0) is having severe problems with getting Solr to restart. I tried as hard as I could to duplicate the user setup, but I ran into many problems myself even before I was able to get 4000 collections created on a 5.0 example cloud setup. Restarting Solr takes a very long time, and it is not very stable once it's up and running. This kind of setup is very much pushing the envelope on SolrCloud performance and scalability. It doesn't help that I'm running both Solr nodes on one machine (I started with 'bin/solr -e cloud') and that ZK is embedded. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections
[ https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14347813#comment-14347813 ] Shawn Heisey commented on SOLR-7191: I have thought about this in terms of deciding whether we should even bother with it at all... but I figured it wouldn't hurt to open the issue so we can think about it, figure out whether any performance bottlenecks can be fixed, and at the very least come up with some best practice information for the docs. One of the big problems that I had was that the overseer queue was getting REALLY filled up. I gave up and started over when I realized (from looking at the logs) that there were over eight hundred thousand entries in the queue, all from attempts to restart Solr repeatedly. That huge queue is what finally pushed the znode size too high for jute.maxbuffer. Zookeeper seems to be the real bottleneck here, which is not all that surprising. Improve stability and startup performance of SolrCloud with thousands of collections Key: SOLR-7191 URL: https://issues.apache.org/jira/browse/SOLR-7191 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 5.0 Reporter: Shawn Heisey Labels: performance, scalability A user on the mailing list with thousands of collections (5000 on 4.10.3, 4000 on 5.0) is having severe problems with getting Solr to restart. I tried as hard as I could to duplicate the user setup, but I ran into many problems myself even before I was able to get 4000 collections created on a 5.0 example cloud setup. Restarting Solr takes a very long time, and it is not very stable once it's up and running. This kind of setup is very much pushing the envelope on SolrCloud performance and scalability. It doesn't help that I'm running both Solr nodes on one machine (I started with 'bin/solr -e cloud') and that ZK is embedded. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections
[ https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14347834#comment-14347834 ] Shawn Heisey commented on SOLR-7191: Regarding jute.maxbuffer: The new stateFormat ensures that the clusterstate won't break the default 1MB limit. The giant queue I encountered was about 85 entries, and resulted in a packet length of a little over 14 megabytes. If I divide 85 by 14, I know that I can have about 6 overseer queue entries in one znode before jute.maxbuffer needs to be increased. There are a couple of possibilities for managing really large overseer queues within the default buffer size. One is to throttle the creation of new overseer entries when the number of existing entries exceeds a certain threshold ... I'm thinking 32768 ... so that hopefully the overseer can catch up. Another is to change the structure of the queue so that it creates new nodes under /overseer/queue and then puts actual entries inside those nodes, limiting the number of queue entries in each one to 32768. With 32768 nodes that each have 32768 entries, the queue could easily reach one billion entries without breaking the buffer... although a queue that size might take days or weeks to process. A similar problem would exist with the /collections node ... so a limit of 32768 collections would also be prudent. I expect that users will run into other problems (number of processes being the one that comes to mind) long before they reach 32768 collections, though. Improve stability and startup performance of SolrCloud with thousands of collections Key: SOLR-7191 URL: https://issues.apache.org/jira/browse/SOLR-7191 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 5.0 Reporter: Shawn Heisey Labels: performance, scalability A user on the mailing list with thousands of collections (5000 on 4.10.3, 4000 on 5.0) is having severe problems with getting Solr to restart. I tried as hard as I could to duplicate the user setup, but I ran into many problems myself even before I was able to get 4000 collections created on a 5.0 example cloud setup. Restarting Solr takes a very long time, and it is not very stable once it's up and running. This kind of setup is very much pushing the envelope on SolrCloud performance and scalability. It doesn't help that I'm running both Solr nodes on one machine (I started with 'bin/solr -e cloud') and that ZK is embedded. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections
[ https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14347746#comment-14347746 ] Erick Erickson commented on SOLR-7191: -- Just throwing this out there, but tangentially related is the whole Lots of cores thing. It's really an open question to me how much effort into supporting either a zillion collections (or even a zillion cores). The LotsOfCores code was written without considering how to support it in SolrCloud. I'm quite sure it'd be interesting to support. I'm explicitly _not_ advocating whether it be supported in SolrCloud, it'd have to be thought out pretty carefully and my instinct is that it wouldn't be worth the effort. FWIW Improve stability and startup performance of SolrCloud with thousands of collections Key: SOLR-7191 URL: https://issues.apache.org/jira/browse/SOLR-7191 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 5.0 Reporter: Shawn Heisey Labels: performance, scalability A user on the mailing list with thousands of collections (5000 on 4.10.3, 4000 on 5.0) is having severe problems with getting Solr to restart. I tried as hard as I could to duplicate the user setup, but I ran into many problems myself even before I was able to get 4000 collections created on a 5.0 example cloud setup. Restarting Solr takes a very long time, and it is not very stable once it's up and running. This kind of setup is very much pushing the envelope on SolrCloud performance and scalability. It doesn't help that I'm running both Solr nodes on one machine (I started with 'bin/solr -e cloud') and that ZK is embedded. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections
[ https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14348235#comment-14348235 ] Shalin Shekhar Mangar commented on SOLR-7191: - Shawn, can you try with branch_5x instead? In particular, SOLR-6956 should help. There was unnecessary synchronization on the overseer node which caused OverseerCollectionProcessor and replicas on the overseer node to not see updated cluster state. Considering that you're running only 3(?) nodes, 1 of the nodes involved could be slowing down the entire cluster. Improve stability and startup performance of SolrCloud with thousands of collections Key: SOLR-7191 URL: https://issues.apache.org/jira/browse/SOLR-7191 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 5.0 Reporter: Shawn Heisey Labels: performance, scalability A user on the mailing list with thousands of collections (5000 on 4.10.3, 4000 on 5.0) is having severe problems with getting Solr to restart. I tried as hard as I could to duplicate the user setup, but I ran into many problems myself even before I was able to get 4000 collections created on a 5.0 example cloud setup. Restarting Solr takes a very long time, and it is not very stable once it's up and running. This kind of setup is very much pushing the envelope on SolrCloud performance and scalability. It doesn't help that I'm running both Solr nodes on one machine (I started with 'bin/solr -e cloud') and that ZK is embedded. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections
[ https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14348132#comment-14348132 ] Damien Kamerman commented on SOLR-7191: --- I'm more concerned with stability than performance. I've created up to 10,000 cores and the cloud is fine and well within memory limits. However, the cloud will never restart. Lots of warnings 'org.apache.solr.cloud.ZkController; Still seeing conflicting information about the leader of shard' Improve stability and startup performance of SolrCloud with thousands of collections Key: SOLR-7191 URL: https://issues.apache.org/jira/browse/SOLR-7191 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 5.0 Reporter: Shawn Heisey Labels: performance, scalability A user on the mailing list with thousands of collections (5000 on 4.10.3, 4000 on 5.0) is having severe problems with getting Solr to restart. I tried as hard as I could to duplicate the user setup, but I ran into many problems myself even before I was able to get 4000 collections created on a 5.0 example cloud setup. Restarting Solr takes a very long time, and it is not very stable once it's up and running. This kind of setup is very much pushing the envelope on SolrCloud performance and scalability. It doesn't help that I'm running both Solr nodes on one machine (I started with 'bin/solr -e cloud') and that ZK is embedded. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections
[ https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14348240#comment-14348240 ] Shalin Shekhar Mangar commented on SOLR-7191: - We can optimize the node shutdown case. Instead of each core publishing a 'down' state individually, Solr can publish a 'down' state for the entire node and overseer can do the rest. Improve stability and startup performance of SolrCloud with thousands of collections Key: SOLR-7191 URL: https://issues.apache.org/jira/browse/SOLR-7191 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 5.0 Reporter: Shawn Heisey Labels: performance, scalability A user on the mailing list with thousands of collections (5000 on 4.10.3, 4000 on 5.0) is having severe problems with getting Solr to restart. I tried as hard as I could to duplicate the user setup, but I ran into many problems myself even before I was able to get 4000 collections created on a 5.0 example cloud setup. Restarting Solr takes a very long time, and it is not very stable once it's up and running. This kind of setup is very much pushing the envelope on SolrCloud performance and scalability. It doesn't help that I'm running both Solr nodes on one machine (I started with 'bin/solr -e cloud') and that ZK is embedded. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org