Re: solr cloud does not start with many collections
Didier, I'm starting to look at SOLR-6399 > after the core was unloaded, it was absent from the collection list, as if it never existed. On the other hand, re-issuing a CREATE call with the same collection restored the collection, along with its data The collection is sill in ZK though? > upon restart Solr tried to reload the previously-unloaded collection. Looks like CoreContainer.load() uses CoreDescriptor.isTransient() and CoreDescriptor.isLoadOnStartup() properties on startup. On 7 March 2015 at 13:10, didier deshommes wrote: > It would be a huge step forward if one could have several hundreds of Solr > collections, but only have a small portion of them opened/loaded at the > same time. This is similar to ElasticSearch's close index api, listed here: > > http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-open-close.html > . I've opened an issue to implement the same in Solr here a few months ago: > https://issues.apache.org/jira/browse/SOLR-6399 > > On Thu, Mar 5, 2015 at 4:42 PM, Damien Kamerman wrote: > > > I've tried a few variations, with 3 x ZK, 6 X nodes, solr 4.10.3, solr > 5.0 > > without any success and no real difference. There is a tipping point at > > around 3,000-4,000 cores (varies depending on hardware) from where I can > > restart the cloud OK within ~4min, to the cloud not working and > > continuous 'conflicting > > information about the leader of shard' warnings. > > > > On 5 March 2015 at 14:15, Shawn Heisey wrote: > > > > > On 3/4/2015 5:37 PM, Damien Kamerman wrote: > > > > I'm running on Solaris x86, I have plenty of memory and no real > limits > > > > # plimit 15560 > > > > 15560: /opt1/jdk/bin/java -d64 -server -Xss512k -Xms32G -Xmx32G > > > > -XX:MaxMetasp > > > >resource current maximum > > > > time(seconds) unlimited unlimited > > > > file(blocks) unlimited unlimited > > > > data(kbytes) unlimited unlimited > > > > stack(kbytes) unlimited unlimited > > > > coredump(blocks) unlimited unlimited > > > > nofiles(descriptors) 65536 65536 > > > > vmemory(kbytes) unlimited unlimited > > > > > > > > I've been testing with 3 nodes, and that seems OK up to around 3,000 > > > cores > > > > total. I'm thinking of testing with more nodes. > > > > > > I have opened an issue for the problems I encountered while recreating > a > > > config similar to yours, which I have been doing on Linux. > > > > > > https://issues.apache.org/jira/browse/SOLR-7191 > > > > > > It's possible that the only thing the issue will lead to is > improvements > > > in the documentation, but I'm hopeful that there will be code > > > improvements too. > > > > > > Thanks, > > > Shawn > > > > > > > > > > > > -- > > Damien Kamerman > > > -- Damien Kamerman
Re: solr cloud does not start with many collections
It would be a huge step forward if one could have several hundreds of Solr collections, but only have a small portion of them opened/loaded at the same time. This is similar to ElasticSearch's close index api, listed here: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-open-close.html . I've opened an issue to implement the same in Solr here a few months ago: https://issues.apache.org/jira/browse/SOLR-6399 On Thu, Mar 5, 2015 at 4:42 PM, Damien Kamerman wrote: > I've tried a few variations, with 3 x ZK, 6 X nodes, solr 4.10.3, solr 5.0 > without any success and no real difference. There is a tipping point at > around 3,000-4,000 cores (varies depending on hardware) from where I can > restart the cloud OK within ~4min, to the cloud not working and > continuous 'conflicting > information about the leader of shard' warnings. > > On 5 March 2015 at 14:15, Shawn Heisey wrote: > > > On 3/4/2015 5:37 PM, Damien Kamerman wrote: > > > I'm running on Solaris x86, I have plenty of memory and no real limits > > > # plimit 15560 > > > 15560: /opt1/jdk/bin/java -d64 -server -Xss512k -Xms32G -Xmx32G > > > -XX:MaxMetasp > > >resource current maximum > > > time(seconds) unlimited unlimited > > > file(blocks) unlimited unlimited > > > data(kbytes) unlimited unlimited > > > stack(kbytes) unlimited unlimited > > > coredump(blocks) unlimited unlimited > > > nofiles(descriptors) 65536 65536 > > > vmemory(kbytes) unlimited unlimited > > > > > > I've been testing with 3 nodes, and that seems OK up to around 3,000 > > cores > > > total. I'm thinking of testing with more nodes. > > > > I have opened an issue for the problems I encountered while recreating a > > config similar to yours, which I have been doing on Linux. > > > > https://issues.apache.org/jira/browse/SOLR-7191 > > > > It's possible that the only thing the issue will lead to is improvements > > in the documentation, but I'm hopeful that there will be code > > improvements too. > > > > Thanks, > > Shawn > > > > > > > -- > Damien Kamerman >
Re: solr cloud does not start with many collections
I've tried a few variations, with 3 x ZK, 6 X nodes, solr 4.10.3, solr 5.0 without any success and no real difference. There is a tipping point at around 3,000-4,000 cores (varies depending on hardware) from where I can restart the cloud OK within ~4min, to the cloud not working and continuous 'conflicting information about the leader of shard' warnings. On 5 March 2015 at 14:15, Shawn Heisey wrote: > On 3/4/2015 5:37 PM, Damien Kamerman wrote: > > I'm running on Solaris x86, I have plenty of memory and no real limits > > # plimit 15560 > > 15560: /opt1/jdk/bin/java -d64 -server -Xss512k -Xms32G -Xmx32G > > -XX:MaxMetasp > >resource current maximum > > time(seconds) unlimited unlimited > > file(blocks) unlimited unlimited > > data(kbytes) unlimited unlimited > > stack(kbytes) unlimited unlimited > > coredump(blocks) unlimited unlimited > > nofiles(descriptors) 65536 65536 > > vmemory(kbytes) unlimited unlimited > > > > I've been testing with 3 nodes, and that seems OK up to around 3,000 > cores > > total. I'm thinking of testing with more nodes. > > I have opened an issue for the problems I encountered while recreating a > config similar to yours, which I have been doing on Linux. > > https://issues.apache.org/jira/browse/SOLR-7191 > > It's possible that the only thing the issue will lead to is improvements > in the documentation, but I'm hopeful that there will be code > improvements too. > > Thanks, > Shawn > > -- Damien Kamerman
Re: solr cloud does not start with many collections
On 3/4/2015 5:37 PM, Damien Kamerman wrote: > I'm running on Solaris x86, I have plenty of memory and no real limits > # plimit 15560 > 15560: /opt1/jdk/bin/java -d64 -server -Xss512k -Xms32G -Xmx32G > -XX:MaxMetasp >resource current maximum > time(seconds) unlimited unlimited > file(blocks) unlimited unlimited > data(kbytes) unlimited unlimited > stack(kbytes) unlimited unlimited > coredump(blocks) unlimited unlimited > nofiles(descriptors) 65536 65536 > vmemory(kbytes) unlimited unlimited > > I've been testing with 3 nodes, and that seems OK up to around 3,000 cores > total. I'm thinking of testing with more nodes. I have opened an issue for the problems I encountered while recreating a config similar to yours, which I have been doing on Linux. https://issues.apache.org/jira/browse/SOLR-7191 It's possible that the only thing the issue will lead to is improvements in the documentation, but I'm hopeful that there will be code improvements too. Thanks, Shawn
Re: solr cloud does not start with many collections
I'm running on Solaris x86, I have plenty of memory and no real limits # plimit 15560 15560: /opt1/jdk/bin/java -d64 -server -Xss512k -Xms32G -Xmx32G -XX:MaxMetasp resource current maximum time(seconds) unlimited unlimited file(blocks) unlimited unlimited data(kbytes) unlimited unlimited stack(kbytes) unlimited unlimited coredump(blocks) unlimited unlimited nofiles(descriptors) 65536 65536 vmemory(kbytes) unlimited unlimited I've been testing with 3 nodes, and that seems OK up to around 3,000 cores total. I'm thinking of testing with more nodes. On 5 March 2015 at 05:28, Shawn Heisey wrote: > On 3/4/2015 2:09 AM, Shawn Heisey wrote: > > I've come to one major conclusion about this whole thing, even before > > I reach the magic number of 4000 collections. Thousands of collections > > is not at all practical with SolrCloud currently. > > I've now encountered a new problem. I may have been hasty in declaring > that an increase of jute.maxbuffer is not required. There are now 3715 > collections, and I've seen a zookeeper exception that may indicate an > increase actually is required. I have added that parameter to the > startup and when I have some time to look deeper, I will see whether > that helps. > > Before 5.0, the maxbuffer would have been exceeded by only a few hundred > collections ... so this is definitely progress. > > Thanks, > Shawn > > -- Damien Kamerman
Re: solr cloud does not start with many collections
On 3/4/2015 2:09 AM, Shawn Heisey wrote: > I've come to one major conclusion about this whole thing, even before > I reach the magic number of 4000 collections. Thousands of collections > is not at all practical with SolrCloud currently. I've now encountered a new problem. I may have been hasty in declaring that an increase of jute.maxbuffer is not required. There are now 3715 collections, and I've seen a zookeeper exception that may indicate an increase actually is required. I have added that parameter to the startup and when I have some time to look deeper, I will see whether that helps. Before 5.0, the maxbuffer would have been exceeded by only a few hundred collections ... so this is definitely progress. Thanks, Shawn
Re: solr cloud does not start with many collections
On 3/4/2015 1:02 AM, Shawn Heisey wrote: > Even now, nearly three hours after startup, the Solr log is still > spitting out thousands of lines that look like this, so I don't think I > can call it stable: > > INFO - 2015-03-04 07:35:51.166; > org.apache.solr.common.cloud.ZkStateReader; Updating data for mycoll1515 > to ver 60 > > I'm going to try bringing up the other Solr instance now, and if that > stabilizes with all shards in the green, I will try to continue adding > collections. I've come to one major conclusion about this whole thing, even before I reach the magic number of 4000 collections. Thousands of collections is not at all practical with SolrCloud currently. Some additional conclusions about this setup: * Stopping and restarting the entire cluster will quite literally take hours for full stability. A rolling restart *might* go faster, but honestly I would not count on that. * An external zookeeper ensemble is absolutely critical. Zookeeper stability is extremely important. * A lot of heap memory is required, even if the indexes are completely empty and there is no query/index activity. Active indexes with data are going to push that even higher, and will very likely slow down recovery on server restart. * Operating system limits for the max number of open files and max number of processes allowed will need to be reconfigured - these are settings that are NOT managed by Solr or Jetty. Configuration may vary widely between different operating systems. * Thousands of collections *might* work OK if there are enough servers so that each one doesn't have more than a couple hundred cores. This would need to be tested, and I don't have the available hardware. I'm not sure that the OP's problem can actually be called a bug ... it's more of a performance limitation. We should still file an issue and treat it like a bug, though. Thanks, Shawn
Re: solr cloud does not start with many collections
On 3/3/2015 9:22 PM, Damien Kamerman wrote: > I've done a similar thing to create the collections. You're going to need > more memory I think. > > OK, so maxThreads limit on jetty could be causing a distributed dead-lock? I don't know what the exact problems would be if maxThreads is reached. It's probably unpredictable. With 2674 collections added, 5GB wasn't enough heap. I started getting a ton of exceptions during collection creation. I had to shut down both Solr instances. When I brought up the first instance with a 7GB heap (the one with the embedded zk), it took exactly half an hour for jetty to start listening on port 8983, and about two hours total for it to stabilize to the point where everything for that node was green on the cloud graph. Even now, nearly three hours after startup, the Solr log is still spitting out thousands of lines that look like this, so I don't think I can call it stable: INFO - 2015-03-04 07:35:51.166; org.apache.solr.common.cloud.ZkStateReader; Updating data for mycoll1515 to ver 60 I'm going to try bringing up the other Solr instance now, and if that stabilizes with all shards in the green, I will try to continue adding collections. Side note: I have been able to confirm with these tests that version 5.0 no longer requires increasing jute.maxbuffer to run many collections. I'm still running with the default value and zookeeper has had no problems handling all the data. Thanks, Shawn
Re: solr cloud does not start with many collections
I've done a similar thing to create the collections. You're going to need more memory I think. OK, so maxThreads limit on jetty could be causing a distributed dead-lock? On 4 March 2015 at 13:18, Shawn Heisey wrote: > On 3/2/2015 12:54 AM, Damien Kamerman wrote: > > I still see the same cloud startup issue with Solr 5.0.0. I created 4,000 > > collections from scratch and then attempted to stop/start the cloud. > > I have been trying to duplicate your setup using the "-e cloud" example > included in the Solr 5.0 download and accepting all the defaults. This > sets up two Solr instances on one machine, one of which runs an embedded > zookeeper. > > I have been running into a LOT of issues just trying to get so many > collections created, to say nothing about restart problems. > > The first problem I ran into was heap size. The example starts each of > the Solr instances with a 512MB heap, which is WAY too small. It > allowed me to create 274 collections, in addition to the gettingstarted > collection that the example started with. One of the Solr instances > simply crashed. No OutOfMemoryException or anything else in the log ... > it just died. > > I bumped the heap on each Solr instance to 4GB. The next problem I ran > into was the operating system limit on the number of processes ... and I > had already bumped that up beyond the usual 1024 default, to 4096. Solr > was not able to create any more threads, because my user was not able to > fork any more processes. I got over 700 collections created before that > became a problem. My max open files had also been increased already -- > this is another place where a stock system will run into trouble > creating a lot of collections. > > I fixed that, and the next problem I ran into was total RAM on the > machine ... it turns out that with two Solr processes each using 4GB, I > was dipped 3GB deep into swap. This is odd, because I have 12GB of RAM > on that machine and it's not doing very much besides this SolrCloud > test. Swapping means that performance was completely unacceptable and > it would probably never finish. > > So ... I had to find a machine with more memory. I've got a dev server > with 32GB. I fired up the two SolrCloud processes on it with 5GB heap > each, with 32768 processes allowed. I am in the process of building > 4000 collections (numShards=2, replicationFactor=1), and so far, it is > working OK. I have almost 2700 collections now. > > If I can ever get it to actually build 4000 collections, then I can > attempt restarting the second Solr instance and see what happens. I > think I might hit another roadblock in the form of the > 1 maxThreads limit on Jetty. Running this all on one machine might > not be possible, but I'm giving it a try. > > Here's the script I am using to create all those collections: > > #!/bin/sh > > for i in `seq -f "%04.0f" 0 3999` > do > echo $i > coll=mycoll${i} > URL="http://localhost:8983/solr/admin/collections"; > URL="${URL}?action=CREATE&name=${coll}&numShards=2&replicationFactor=1" > URL="${URL}&collection.configName=gettingstarted" > curl "$URL" > done > > Thanks, > Shawn > -- Damien Kamerman
Re: solr cloud does not start with many collections
On 3/2/2015 12:54 AM, Damien Kamerman wrote: > I still see the same cloud startup issue with Solr 5.0.0. I created 4,000 > collections from scratch and then attempted to stop/start the cloud. I have been trying to duplicate your setup using the "-e cloud" example included in the Solr 5.0 download and accepting all the defaults. This sets up two Solr instances on one machine, one of which runs an embedded zookeeper. I have been running into a LOT of issues just trying to get so many collections created, to say nothing about restart problems. The first problem I ran into was heap size. The example starts each of the Solr instances with a 512MB heap, which is WAY too small. It allowed me to create 274 collections, in addition to the gettingstarted collection that the example started with. One of the Solr instances simply crashed. No OutOfMemoryException or anything else in the log ... it just died. I bumped the heap on each Solr instance to 4GB. The next problem I ran into was the operating system limit on the number of processes ... and I had already bumped that up beyond the usual 1024 default, to 4096. Solr was not able to create any more threads, because my user was not able to fork any more processes. I got over 700 collections created before that became a problem. My max open files had also been increased already -- this is another place where a stock system will run into trouble creating a lot of collections. I fixed that, and the next problem I ran into was total RAM on the machine ... it turns out that with two Solr processes each using 4GB, I was dipped 3GB deep into swap. This is odd, because I have 12GB of RAM on that machine and it's not doing very much besides this SolrCloud test. Swapping means that performance was completely unacceptable and it would probably never finish. So ... I had to find a machine with more memory. I've got a dev server with 32GB. I fired up the two SolrCloud processes on it with 5GB heap each, with 32768 processes allowed. I am in the process of building 4000 collections (numShards=2, replicationFactor=1), and so far, it is working OK. I have almost 2700 collections now. If I can ever get it to actually build 4000 collections, then I can attempt restarting the second Solr instance and see what happens. I think I might hit another roadblock in the form of the 1 maxThreads limit on Jetty. Running this all on one machine might not be possible, but I'm giving it a try. Here's the script I am using to create all those collections: #!/bin/sh for i in `seq -f "%04.0f" 0 3999` do echo $i coll=mycoll${i} URL="http://localhost:8983/solr/admin/collections"; URL="${URL}?action=CREATE&name=${coll}&numShards=2&replicationFactor=1" URL="${URL}&collection.configName=gettingstarted" curl "$URL" done Thanks, Shawn
Re: solr cloud does not start with many collections
After one minute from startup I sometimes see the 'org.apache.solr.cloud.ZkController; Timed out waiting to see all nodes published as DOWN in our cluster state.' And I see the 'Still seeing conflicting information about the leader of shard' after about 5 minutes. Thanks Shawn, I will create an issue. On 4 March 2015 at 01:10, Shawn Heisey wrote: > On 3/3/2015 6:55 AM, Shawn Heisey wrote: > > With a longer zkClientTimeout, does the failure happen on a later > > collection? I had hoped that it would solve the problem, but I'm > > curious about whether it was able to load more collections before it > > finally died, or whether it made no difference... and whether the > > message now indicates 40 seconds or if it still says 30. > > I have found the code that produces the message, and the wait for this > particular section is hardcoded to 30 seconds. That means the timeout > won't affect it. > > If you move the Solr log so it creates a new one from startup, how long > does it take after startup begins before you see the failure that > indicates the conflicting leader information hasn't resolved? > > This most likely is a bug ... our SolrCloud experts will need to > investigate to find it, so we need as much information as you can provide. > > Thanks, > Shawn > > -- Damien Kamerman
Re: solr cloud does not start with many collections
On 3/3/2015 6:55 AM, Shawn Heisey wrote: > With a longer zkClientTimeout, does the failure happen on a later > collection? I had hoped that it would solve the problem, but I'm > curious about whether it was able to load more collections before it > finally died, or whether it made no difference... and whether the > message now indicates 40 seconds or if it still says 30. I have found the code that produces the message, and the wait for this particular section is hardcoded to 30 seconds. That means the timeout won't affect it. If you move the Solr log so it creates a new one from startup, how long does it take after startup begins before you see the failure that indicates the conflicting leader information hasn't resolved? This most likely is a bug ... our SolrCloud experts will need to investigate to find it, so we need as much information as you can provide. Thanks, Shawn
Re: solr cloud does not start with many collections
On 3/3/2015 12:42 AM, Damien Kamerman wrote: > Still no luck starting solr with 40s zkClientTimeout. I'm not seeing any > expired sessions... > > There must be a way to start solr with many collections. It runs fine.. > until a restart is required. With a longer zkClientTimeout, does the failure happen on a later collection? I had hoped that it would solve the problem, but I'm curious about whether it was able to load more collections before it finally died, or whether it made no difference... and whether the message now indicates 40 seconds or if it still says 30. Either way, I think we have reached the point where filing an issue in Jira is appropriate. You have the best information, so you should create the issue. The main description should fully describe the problem, but not have exhaustive detail. You can include lots of supporting detail in attachments and in the comments. https://issues.apache.org/jira/browse/SOLR Thanks, Shawn
Re: solr cloud does not start with many collections
Still no luck starting solr with 40s zkClientTimeout. I'm not seeing any expired sessions... There must be a way to start solr with many collections. It runs fine.. until a restart is required. On 3 March 2015 at 03:33, Shawn Heisey wrote: > On 3/2/2015 12:54 AM, Damien Kamerman wrote: > > I still see the same cloud startup issue with Solr 5.0.0. I created 4,000 > > collections from scratch and then attempted to stop/start the cloud. > > > > node1: > > WARN - 2015-03-02 18:09:02.371; > > org.eclipse.jetty.server.handler.RequestLogHandler; !RequestLog > > WARN - 2015-03-02 18:10:07.196; org.apache.solr.cloud.ZkController; > Timed > > out waiting to see all nodes published as DOWN in our cluster state. > > WARN - 2015-03-02 18:13:46.238; org.apache.solr.cloud.ZkController; > Still > > seeing conflicting information about the leader of shard shard1 for > > collection DD-3219 after 30 seconds; our state says > > http://host:8002/solr/DD-3219_shard1_replica1/, but ZooKeeper says > > http://host:8000/solr/DD-3219_shard1_replica2/ > > > > node2: > > WARN - 2015-03-02 18:09:01.871; > > org.eclipse.jetty.server.handler.RequestLogHandler; !RequestLog > > WARN - 2015-03-02 18:17:04.458; > > org.apache.solr.common.cloud.ZkStateReader$3; ZooKeeper watch triggered, > > but Solr cannot talk to ZK > > stop/start > > WARN - 2015-03-02 18:53:12.725; > > org.eclipse.jetty.server.handler.RequestLogHandler; !RequestLog > > WARN - 2015-03-02 18:56:30.702; org.apache.solr.cloud.ZkController; > Still > > seeing conflicting information about the leader of shard shard1 for > > collection DD-3581 after 30 seconds; our state says > > http://host:8001/solr/DD-3581_shard1_replica2/, but ZooKeeper says > > http://host:8002/solr/DD-3581_shard1_replica1/ > > > > node3: > > WARN - 2015-03-02 18:09:03.022; > > org.eclipse.jetty.server.handler.RequestLogHandler; !RequestLog > > WARN - 2015-03-02 18:10:08.178; org.apache.solr.cloud.ZkController; > Timed > > out waiting to see all nodes published as DOWN in our cluster state. > > WARN - 2015-03-02 18:13:47.737; org.apache.solr.cloud.ZkController; > Still > > seeing conflicting information about the leader of shard shard1 for > > collection DD-2707 after 30 seconds; our state says > > http://host:8002/solr/DD-2707_shard1_replica2/, but ZooKeeper says > > http://host:8000/solr/DD-2707_shard1_replica1/ > > I'm sorry to hear that 5.0 didn't fix the problem. I really hoped that > it would. > > There is one other thing I'd like to try before you file a bug -- > increasing zkClientTimeout to 40 seconds, to see whether it allows > changes the point at which it fails (or allows it to succeed). With the > default tickTime (2 seconds), the maximum time you can set > zkClientTimeout to is 40 seconds ... which in normal circumstances is a > VERY long time. In your situation, at least with the code in its > current state, 30 seconds (I'm pretty sure this is the default in 5.0) > may simply not be enough. > > > https://cwiki.apache.org/confluence/display/solr/Parameter+Reference#ParameterReference-SolrCloudInstanceZooKeeperParameters > > I think filing a bug, even if 40 seconds allows this to succeed, is a > good idea ... but you might want to wait for some of the cloud experts > to look at your logs to see if they have anything to add. > > Thanks, > Shawn > > -- Damien Kamerman
Re: solr cloud does not start with many collections
On 3/2/2015 12:54 AM, Damien Kamerman wrote: > I still see the same cloud startup issue with Solr 5.0.0. I created 4,000 > collections from scratch and then attempted to stop/start the cloud. > > node1: > WARN - 2015-03-02 18:09:02.371; > org.eclipse.jetty.server.handler.RequestLogHandler; !RequestLog > WARN - 2015-03-02 18:10:07.196; org.apache.solr.cloud.ZkController; Timed > out waiting to see all nodes published as DOWN in our cluster state. > WARN - 2015-03-02 18:13:46.238; org.apache.solr.cloud.ZkController; Still > seeing conflicting information about the leader of shard shard1 for > collection DD-3219 after 30 seconds; our state says > http://host:8002/solr/DD-3219_shard1_replica1/, but ZooKeeper says > http://host:8000/solr/DD-3219_shard1_replica2/ > > node2: > WARN - 2015-03-02 18:09:01.871; > org.eclipse.jetty.server.handler.RequestLogHandler; !RequestLog > WARN - 2015-03-02 18:17:04.458; > org.apache.solr.common.cloud.ZkStateReader$3; ZooKeeper watch triggered, > but Solr cannot talk to ZK > stop/start > WARN - 2015-03-02 18:53:12.725; > org.eclipse.jetty.server.handler.RequestLogHandler; !RequestLog > WARN - 2015-03-02 18:56:30.702; org.apache.solr.cloud.ZkController; Still > seeing conflicting information about the leader of shard shard1 for > collection DD-3581 after 30 seconds; our state says > http://host:8001/solr/DD-3581_shard1_replica2/, but ZooKeeper says > http://host:8002/solr/DD-3581_shard1_replica1/ > > node3: > WARN - 2015-03-02 18:09:03.022; > org.eclipse.jetty.server.handler.RequestLogHandler; !RequestLog > WARN - 2015-03-02 18:10:08.178; org.apache.solr.cloud.ZkController; Timed > out waiting to see all nodes published as DOWN in our cluster state. > WARN - 2015-03-02 18:13:47.737; org.apache.solr.cloud.ZkController; Still > seeing conflicting information about the leader of shard shard1 for > collection DD-2707 after 30 seconds; our state says > http://host:8002/solr/DD-2707_shard1_replica2/, but ZooKeeper says > http://host:8000/solr/DD-2707_shard1_replica1/ I'm sorry to hear that 5.0 didn't fix the problem. I really hoped that it would. There is one other thing I'd like to try before you file a bug -- increasing zkClientTimeout to 40 seconds, to see whether it allows changes the point at which it fails (or allows it to succeed). With the default tickTime (2 seconds), the maximum time you can set zkClientTimeout to is 40 seconds ... which in normal circumstances is a VERY long time. In your situation, at least with the code in its current state, 30 seconds (I'm pretty sure this is the default in 5.0) may simply not be enough. https://cwiki.apache.org/confluence/display/solr/Parameter+Reference#ParameterReference-SolrCloudInstanceZooKeeperParameters I think filing a bug, even if 40 seconds allows this to succeed, is a good idea ... but you might want to wait for some of the cloud experts to look at your logs to see if they have anything to add. Thanks, Shawn
Re: solr cloud does not start with many collections
I still see the same cloud startup issue with Solr 5.0.0. I created 4,000 collections from scratch and then attempted to stop/start the cloud. node1: WARN - 2015-03-02 18:09:02.371; org.eclipse.jetty.server.handler.RequestLogHandler; !RequestLog WARN - 2015-03-02 18:10:07.196; org.apache.solr.cloud.ZkController; Timed out waiting to see all nodes published as DOWN in our cluster state. WARN - 2015-03-02 18:13:46.238; org.apache.solr.cloud.ZkController; Still seeing conflicting information about the leader of shard shard1 for collection DD-3219 after 30 seconds; our state says http://host:8002/solr/DD-3219_shard1_replica1/, but ZooKeeper says http://host:8000/solr/DD-3219_shard1_replica2/ node2: WARN - 2015-03-02 18:09:01.871; org.eclipse.jetty.server.handler.RequestLogHandler; !RequestLog WARN - 2015-03-02 18:17:04.458; org.apache.solr.common.cloud.ZkStateReader$3; ZooKeeper watch triggered, but Solr cannot talk to ZK stop/start WARN - 2015-03-02 18:53:12.725; org.eclipse.jetty.server.handler.RequestLogHandler; !RequestLog WARN - 2015-03-02 18:56:30.702; org.apache.solr.cloud.ZkController; Still seeing conflicting information about the leader of shard shard1 for collection DD-3581 after 30 seconds; our state says http://host:8001/solr/DD-3581_shard1_replica2/, but ZooKeeper says http://host:8002/solr/DD-3581_shard1_replica1/ node3: WARN - 2015-03-02 18:09:03.022; org.eclipse.jetty.server.handler.RequestLogHandler; !RequestLog WARN - 2015-03-02 18:10:08.178; org.apache.solr.cloud.ZkController; Timed out waiting to see all nodes published as DOWN in our cluster state. WARN - 2015-03-02 18:13:47.737; org.apache.solr.cloud.ZkController; Still seeing conflicting information about the leader of shard shard1 for collection DD-2707 after 30 seconds; our state says http://host:8002/solr/DD-2707_shard1_replica2/, but ZooKeeper says http://host:8000/solr/DD-2707_shard1_replica1/ On 27 February 2015 at 17:48, Shawn Heisey wrote: > On 2/26/2015 11:14 PM, Damien Kamerman wrote: > > I've run into an issue with starting my solr cloud with many collections. > > My setup is: > > 3 nodes (solr 4.10.3 ; 64GB RAM each ; jdk1.8.0_25) running on a single > > server (256GB RAM). > > 5,000 collections (1 x shard ; 2 x replica) = 10,000 cores > > 1 x Zookeeper 3.4.6 > > Java arg -Djute.maxbuffer=67108864 added to solr and ZK. > > > > Then I stop all nodes, then start all nodes. All replicas are in the down > > state, some have no leader. At times I have seen some (12 or so) leaders > in > > the active state. In the solr logs I see lots of: > > > > org.apache.solr.cloud.ZkController; Still seeing conflicting information > > about the leader of shard shard1 for collection DD-4351 after 30 > > seconds; our state says > http://ftea1:8001/solr/DD-4351_shard1_replica1/, > > but ZooKeeper says http://ftea1:8000/solr/DD-4351_shard1_replica2/ > > > > > I've tried staggering the starts (1min) but does not help. > > I've reproduced with zero documents. > > Restarts are OK up to around 3,000 cores. > > Should this work? > > This is going to push SolrCloud beyond its limits. Is this just an > exercise to see how far you can push Solr, or are you looking at setting > up a production install with several thousand collections? > > In Solr 4.x, the clusterstate is one giant JSON structure containing the > state of the entire cloud. With 5000 collections, the entire thing > would need to be downloaded and uploaded at least 5000 times during the > course of a successful full system startup ... and I think with > replicationFactor set to 2, that might actually be 1 times. The > best-case scenario is that it would take a VERY long time, the > worst-case scenario is that concurrency problems would lead to a > deadlock. A deadlock might be what is happening here. > > In Solr 5.x, the clusterstate is broken up so there's a separate state > structure for each collection. This setup allows for faster and safer > multi-threading and far less data transfer. Assuming I understand the > implications correctly, there might not be any need to increase > jute.maxbuffer with 5.x ... although I have to assume that I might be > wrong about that. > > I would very much recommend that you set your scenario up from scratch > in Solr 5.0.0, to see if the new clusterstate format can eliminate the > problem you're seeing. If it doesn't, then we can pursue it as a likely > bug in the 5.x branch and you can file an issue in Jira. > > Thanks, > Shawn > > -- Damien Kamerman
Re: solr cloud does not start with many collections
Oh, and I was wondering if 'leaderVoteWait' might help in Solr4. On 27 February 2015 at 18:04, Damien Kamerman wrote: > This is going to push SolrCloud beyond its limits. Is this just an >> exercise to see how far you can push Solr, or are you looking at setting >> up a production install with several thousand collections? >> >> > I'm looking towards production. > > >> In Solr 4.x, the clusterstate is one giant JSON structure containing the >> state of the entire cloud. With 5000 collections, the entire thing >> would need to be downloaded and uploaded at least 5000 times during the >> course of a successful full system startup ... and I think with >> replicationFactor set to 2, that might actually be 1 times. The >> best-case scenario is that it would take a VERY long time, the >> worst-case scenario is that concurrency problems would lead to a >> deadlock. A deadlock might be what is happening here. >> >> > Yes, clusterstate.json is 3.3M. At times on startup I think it does > deadlock; log shows after 1min: > org.apache.solr.cloud.ZkController; Timed out waiting to see all nodes > published as DOWN in our cluster state. > > >> In Solr 5.x, the clusterstate is broken up so there's a separate state >> structure for each collection. This setup allows for faster and safer >> multi-threading and far less data transfer. Assuming I understand the >> implications correctly, there might not be any need to increase >> jute.maxbuffer with 5.x ... although I have to assume that I might be >> wrong about that. >> >> I would very much recommend that you set your scenario up from scratch >> in Solr 5.0.0, to see if the new clusterstate format can eliminate the >> problem you're seeing. If it doesn't, then we can pursue it as a likely >> bug in the 5.x branch and you can file an issue in Jira. >> >> > Thanks, will test in Solr 5.0.0. > -- Damien Kamerman
Re: solr cloud does not start with many collections
> > This is going to push SolrCloud beyond its limits. Is this just an > exercise to see how far you can push Solr, or are you looking at setting > up a production install with several thousand collections? > > I'm looking towards production. > In Solr 4.x, the clusterstate is one giant JSON structure containing the > state of the entire cloud. With 5000 collections, the entire thing > would need to be downloaded and uploaded at least 5000 times during the > course of a successful full system startup ... and I think with > replicationFactor set to 2, that might actually be 1 times. The > best-case scenario is that it would take a VERY long time, the > worst-case scenario is that concurrency problems would lead to a > deadlock. A deadlock might be what is happening here. > > Yes, clusterstate.json is 3.3M. At times on startup I think it does deadlock; log shows after 1min: org.apache.solr.cloud.ZkController; Timed out waiting to see all nodes published as DOWN in our cluster state. > In Solr 5.x, the clusterstate is broken up so there's a separate state > structure for each collection. This setup allows for faster and safer > multi-threading and far less data transfer. Assuming I understand the > implications correctly, there might not be any need to increase > jute.maxbuffer with 5.x ... although I have to assume that I might be > wrong about that. > > I would very much recommend that you set your scenario up from scratch > in Solr 5.0.0, to see if the new clusterstate format can eliminate the > problem you're seeing. If it doesn't, then we can pursue it as a likely > bug in the 5.x branch and you can file an issue in Jira. > > Thanks, will test in Solr 5.0.0.
Re: solr cloud does not start with many collections
On 2/26/2015 11:14 PM, Damien Kamerman wrote: > I've run into an issue with starting my solr cloud with many collections. > My setup is: > 3 nodes (solr 4.10.3 ; 64GB RAM each ; jdk1.8.0_25) running on a single > server (256GB RAM). > 5,000 collections (1 x shard ; 2 x replica) = 10,000 cores > 1 x Zookeeper 3.4.6 > Java arg -Djute.maxbuffer=67108864 added to solr and ZK. > > Then I stop all nodes, then start all nodes. All replicas are in the down > state, some have no leader. At times I have seen some (12 or so) leaders in > the active state. In the solr logs I see lots of: > > org.apache.solr.cloud.ZkController; Still seeing conflicting information > about the leader of shard shard1 for collection DD-4351 after 30 > seconds; our state says http://ftea1:8001/solr/DD-4351_shard1_replica1/, > but ZooKeeper says http://ftea1:8000/solr/DD-4351_shard1_replica2/ > I've tried staggering the starts (1min) but does not help. > I've reproduced with zero documents. > Restarts are OK up to around 3,000 cores. > Should this work? This is going to push SolrCloud beyond its limits. Is this just an exercise to see how far you can push Solr, or are you looking at setting up a production install with several thousand collections? In Solr 4.x, the clusterstate is one giant JSON structure containing the state of the entire cloud. With 5000 collections, the entire thing would need to be downloaded and uploaded at least 5000 times during the course of a successful full system startup ... and I think with replicationFactor set to 2, that might actually be 1 times. The best-case scenario is that it would take a VERY long time, the worst-case scenario is that concurrency problems would lead to a deadlock. A deadlock might be what is happening here. In Solr 5.x, the clusterstate is broken up so there's a separate state structure for each collection. This setup allows for faster and safer multi-threading and far less data transfer. Assuming I understand the implications correctly, there might not be any need to increase jute.maxbuffer with 5.x ... although I have to assume that I might be wrong about that. I would very much recommend that you set your scenario up from scratch in Solr 5.0.0, to see if the new clusterstate format can eliminate the problem you're seeing. If it doesn't, then we can pursue it as a likely bug in the 5.x branch and you can file an issue in Jira. Thanks, Shawn
solr cloud does not start with many collections
I've run into an issue with starting my solr cloud with many collections. My setup is: 3 nodes (solr 4.10.3 ; 64GB RAM each ; jdk1.8.0_25) running on a single server (256GB RAM). 5,000 collections (1 x shard ; 2 x replica) = 10,000 cores 1 x Zookeeper 3.4.6 Java arg -Djute.maxbuffer=67108864 added to solr and ZK. Then I stop all nodes, then start all nodes. All replicas are in the down state, some have no leader. At times I have seen some (12 or so) leaders in the active state. In the solr logs I see lots of: org.apache.solr.cloud.ZkController; Still seeing conflicting information about the leader of shard shard1 for collection DD-4351 after 30 seconds; our state says http://ftea1:8001/solr/DD-4351_shard1_replica1/, but ZooKeeper says http://ftea1:8000/solr/DD-4351_shard1_replica2/ org.apache.solr.common.SolrException; :org.apache.solr.common.SolrException: Error getting leader from zk for shard shard1 at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:910) at org.apache.solr.cloud.ZkController.register(ZkController.java:822) at org.apache.solr.cloud.ZkController.register(ZkController.java:770) at org.apache.solr.core.ZkContainer$2.run(ZkContainer.java:221) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.solr.common.SolrException: There is conflicting information about the leader of shard: shard1 our state says: http://ftea1:8001/solr/DD-1564_shard1_replica2/ but zookeeper says: http://ftea1:8000/solr/DD-1564_shard1_replica1/ at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:889) ... 6 more I've tried staggering the starts (1min) but does not help. I've reproduced with zero documents. Restarts are OK up to around 3,000 cores. Should this work? Damien.