Re: solr cloud does not start with many collections

2015-03-11 Thread Damien Kamerman
Didier, I'm starting to look at SOLR-6399
 after the core was unloaded, it was absent from the collection list, as
if it never existed. On the other hand, re-issuing a CREATE call with the
same collection restored the collection, along with its data
The collection is sill in ZK though?

 upon restart Solr tried to reload the previously-unloaded collection.
Looks like CoreContainer.load() uses CoreDescriptor.isTransient() and
CoreDescriptor.isLoadOnStartup() properties on startup.


On 7 March 2015 at 13:10, didier deshommes dfdes...@gmail.com wrote:

 It would be a huge step forward if one could have several hundreds of Solr
 collections, but only have a small portion of them opened/loaded at the
 same time. This is similar to ElasticSearch's close index api, listed here:

 http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-open-close.html
 . I've opened an issue to implement the same in Solr here a few months ago:
 https://issues.apache.org/jira/browse/SOLR-6399

 On Thu, Mar 5, 2015 at 4:42 PM, Damien Kamerman dami...@gmail.com wrote:

  I've tried a few variations, with 3 x ZK, 6 X nodes, solr 4.10.3, solr
 5.0
  without any success and no real difference. There is a tipping point at
  around 3,000-4,000 cores (varies depending on hardware) from where I can
  restart the cloud OK within ~4min, to the cloud not working and
  continuous 'conflicting
  information about the leader of shard' warnings.
 
  On 5 March 2015 at 14:15, Shawn Heisey apa...@elyograg.org wrote:
 
   On 3/4/2015 5:37 PM, Damien Kamerman wrote:
I'm running on Solaris x86, I have plenty of memory and no real
 limits
# plimit 15560
15560:  /opt1/jdk/bin/java -d64 -server -Xss512k -Xms32G -Xmx32G
-XX:MaxMetasp
   resource  current maximum
  time(seconds) unlimited   unlimited
  file(blocks)  unlimited   unlimited
  data(kbytes)  unlimited   unlimited
  stack(kbytes) unlimited   unlimited
  coredump(blocks)  unlimited   unlimited
  nofiles(descriptors)  65536   65536
  vmemory(kbytes)   unlimited   unlimited
   
I've been testing with 3 nodes, and that seems OK up to around 3,000
   cores
total. I'm thinking of testing with more nodes.
  
   I have opened an issue for the problems I encountered while recreating
 a
   config similar to yours, which I have been doing on Linux.
  
   https://issues.apache.org/jira/browse/SOLR-7191
  
   It's possible that the only thing the issue will lead to is
 improvements
   in the documentation, but I'm hopeful that there will be code
   improvements too.
  
   Thanks,
   Shawn
  
  
 
 
  --
  Damien Kamerman
 




-- 
Damien Kamerman


Re: solr cloud does not start with many collections

2015-03-06 Thread didier deshommes
It would be a huge step forward if one could have several hundreds of Solr
collections, but only have a small portion of them opened/loaded at the
same time. This is similar to ElasticSearch's close index api, listed here:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-open-close.html
. I've opened an issue to implement the same in Solr here a few months ago:
https://issues.apache.org/jira/browse/SOLR-6399

On Thu, Mar 5, 2015 at 4:42 PM, Damien Kamerman dami...@gmail.com wrote:

 I've tried a few variations, with 3 x ZK, 6 X nodes, solr 4.10.3, solr 5.0
 without any success and no real difference. There is a tipping point at
 around 3,000-4,000 cores (varies depending on hardware) from where I can
 restart the cloud OK within ~4min, to the cloud not working and
 continuous 'conflicting
 information about the leader of shard' warnings.

 On 5 March 2015 at 14:15, Shawn Heisey apa...@elyograg.org wrote:

  On 3/4/2015 5:37 PM, Damien Kamerman wrote:
   I'm running on Solaris x86, I have plenty of memory and no real limits
   # plimit 15560
   15560:  /opt1/jdk/bin/java -d64 -server -Xss512k -Xms32G -Xmx32G
   -XX:MaxMetasp
  resource  current maximum
 time(seconds) unlimited   unlimited
 file(blocks)  unlimited   unlimited
 data(kbytes)  unlimited   unlimited
 stack(kbytes) unlimited   unlimited
 coredump(blocks)  unlimited   unlimited
 nofiles(descriptors)  65536   65536
 vmemory(kbytes)   unlimited   unlimited
  
   I've been testing with 3 nodes, and that seems OK up to around 3,000
  cores
   total. I'm thinking of testing with more nodes.
 
  I have opened an issue for the problems I encountered while recreating a
  config similar to yours, which I have been doing on Linux.
 
  https://issues.apache.org/jira/browse/SOLR-7191
 
  It's possible that the only thing the issue will lead to is improvements
  in the documentation, but I'm hopeful that there will be code
  improvements too.
 
  Thanks,
  Shawn
 
 


 --
 Damien Kamerman



Re: solr cloud does not start with many collections

2015-03-05 Thread Damien Kamerman
I've tried a few variations, with 3 x ZK, 6 X nodes, solr 4.10.3, solr 5.0
without any success and no real difference. There is a tipping point at
around 3,000-4,000 cores (varies depending on hardware) from where I can
restart the cloud OK within ~4min, to the cloud not working and
continuous 'conflicting
information about the leader of shard' warnings.

On 5 March 2015 at 14:15, Shawn Heisey apa...@elyograg.org wrote:

 On 3/4/2015 5:37 PM, Damien Kamerman wrote:
  I'm running on Solaris x86, I have plenty of memory and no real limits
  # plimit 15560
  15560:  /opt1/jdk/bin/java -d64 -server -Xss512k -Xms32G -Xmx32G
  -XX:MaxMetasp
 resource  current maximum
time(seconds) unlimited   unlimited
file(blocks)  unlimited   unlimited
data(kbytes)  unlimited   unlimited
stack(kbytes) unlimited   unlimited
coredump(blocks)  unlimited   unlimited
nofiles(descriptors)  65536   65536
vmemory(kbytes)   unlimited   unlimited
 
  I've been testing with 3 nodes, and that seems OK up to around 3,000
 cores
  total. I'm thinking of testing with more nodes.

 I have opened an issue for the problems I encountered while recreating a
 config similar to yours, which I have been doing on Linux.

 https://issues.apache.org/jira/browse/SOLR-7191

 It's possible that the only thing the issue will lead to is improvements
 in the documentation, but I'm hopeful that there will be code
 improvements too.

 Thanks,
 Shawn




-- 
Damien Kamerman


Re: solr cloud does not start with many collections

2015-03-04 Thread Shawn Heisey
On 3/4/2015 2:09 AM, Shawn Heisey wrote:
 I've come to one major conclusion about this whole thing, even before
 I reach the magic number of 4000 collections. Thousands of collections
 is not at all practical with SolrCloud currently.

I've now encountered a new problem.  I may have been hasty in declaring
that an increase of jute.maxbuffer is not required.  There are now 3715
collections, and I've seen a zookeeper exception that may indicate an
increase actually is required.  I have added that parameter to the
startup and when I have some time to look deeper, I will see whether
that helps.

Before 5.0, the maxbuffer would have been exceeded by only a few hundred
collections ... so this is definitely progress.

Thanks,
Shawn



Re: solr cloud does not start with many collections

2015-03-04 Thread Damien Kamerman
I'm running on Solaris x86, I have plenty of memory and no real limits
# plimit 15560
15560:  /opt1/jdk/bin/java -d64 -server -Xss512k -Xms32G -Xmx32G
-XX:MaxMetasp
   resource  current maximum
  time(seconds) unlimited   unlimited
  file(blocks)  unlimited   unlimited
  data(kbytes)  unlimited   unlimited
  stack(kbytes) unlimited   unlimited
  coredump(blocks)  unlimited   unlimited
  nofiles(descriptors)  65536   65536
  vmemory(kbytes)   unlimited   unlimited

I've been testing with 3 nodes, and that seems OK up to around 3,000 cores
total. I'm thinking of testing with more nodes.


On 5 March 2015 at 05:28, Shawn Heisey apa...@elyograg.org wrote:

 On 3/4/2015 2:09 AM, Shawn Heisey wrote:
  I've come to one major conclusion about this whole thing, even before
  I reach the magic number of 4000 collections. Thousands of collections
  is not at all practical with SolrCloud currently.

 I've now encountered a new problem.  I may have been hasty in declaring
 that an increase of jute.maxbuffer is not required.  There are now 3715
 collections, and I've seen a zookeeper exception that may indicate an
 increase actually is required.  I have added that parameter to the
 startup and when I have some time to look deeper, I will see whether
 that helps.

 Before 5.0, the maxbuffer would have been exceeded by only a few hundred
 collections ... so this is definitely progress.

 Thanks,
 Shawn




-- 
Damien Kamerman


Re: solr cloud does not start with many collections

2015-03-04 Thread Shawn Heisey
On 3/4/2015 5:37 PM, Damien Kamerman wrote:
 I'm running on Solaris x86, I have plenty of memory and no real limits
 # plimit 15560
 15560:  /opt1/jdk/bin/java -d64 -server -Xss512k -Xms32G -Xmx32G
 -XX:MaxMetasp
resource  current maximum
   time(seconds) unlimited   unlimited
   file(blocks)  unlimited   unlimited
   data(kbytes)  unlimited   unlimited
   stack(kbytes) unlimited   unlimited
   coredump(blocks)  unlimited   unlimited
   nofiles(descriptors)  65536   65536
   vmemory(kbytes)   unlimited   unlimited
 
 I've been testing with 3 nodes, and that seems OK up to around 3,000 cores
 total. I'm thinking of testing with more nodes.

I have opened an issue for the problems I encountered while recreating a
config similar to yours, which I have been doing on Linux.

https://issues.apache.org/jira/browse/SOLR-7191

It's possible that the only thing the issue will lead to is improvements
in the documentation, but I'm hopeful that there will be code
improvements too.

Thanks,
Shawn



Re: solr cloud does not start with many collections

2015-03-04 Thread Shawn Heisey
On 3/3/2015 9:22 PM, Damien Kamerman wrote:
 I've done a similar thing to create the collections. You're going to need
 more memory I think.
 
 OK, so maxThreads limit on jetty could be causing a distributed dead-lock?

I don't know what the exact problems would be if maxThreads is reached.
 It's probably unpredictable.

With 2674 collections added, 5GB wasn't enough heap.  I started getting
a ton of exceptions during collection creation.  I had to shut down both
Solr instances.  When I brought up the first instance with a 7GB heap
(the one with the embedded zk), it took exactly half an hour for jetty
to start listening on port 8983, and about two hours total for it to
stabilize to the point where everything for that node was green on the
cloud graph.

Even now, nearly three hours after startup, the Solr log is still
spitting out thousands of lines that look like this, so I don't think I
can call it stable:

INFO  - 2015-03-04 07:35:51.166;
org.apache.solr.common.cloud.ZkStateReader; Updating data for mycoll1515
to ver 60

I'm going to try bringing up the other Solr instance now, and if that
stabilizes with all shards in the green, I will try to continue adding
collections.

Side note: I have been able to confirm with these tests that version 5.0
no longer requires increasing jute.maxbuffer to run many collections.
I'm still running with the default value and zookeeper has had no
problems handling all the data.

Thanks,
Shawn



Re: solr cloud does not start with many collections

2015-03-04 Thread Shawn Heisey
On 3/4/2015 1:02 AM, Shawn Heisey wrote:
 Even now, nearly three hours after startup, the Solr log is still
 spitting out thousands of lines that look like this, so I don't think I
 can call it stable:
 
 INFO  - 2015-03-04 07:35:51.166;
 org.apache.solr.common.cloud.ZkStateReader; Updating data for mycoll1515
 to ver 60
 
 I'm going to try bringing up the other Solr instance now, and if that
 stabilizes with all shards in the green, I will try to continue adding
 collections.

I've come to one major conclusion about this whole thing, even before I
reach the magic number of 4000 collections.  Thousands of collections is
not at all practical with SolrCloud currently.  Some additional
conclusions about this setup:

* Stopping and restarting the entire cluster will quite literally take
hours for full stability.  A rolling restart *might* go faster, but
honestly I would not count on that.

* An external zookeeper ensemble is absolutely critical.  Zookeeper
stability is extremely important.

* A lot of heap memory is required, even if the indexes are completely
empty and there is no query/index activity.  Active indexes with data
are going to push that even higher, and will very likely slow down
recovery on server restart.

* Operating system limits for the max number of open files and max
number of processes allowed will need to be reconfigured - these are
settings that are NOT managed by Solr or Jetty.  Configuration may vary
widely between different operating systems.

* Thousands of collections *might* work OK if there are enough servers
so that each one doesn't have more than a couple hundred cores.  This
would need to be tested, and I don't have the available hardware.

I'm not sure that the OP's problem can actually be called a bug ... it's
more of a performance limitation.  We should still file an issue and
treat it like a bug, though.

Thanks,
Shawn



Re: solr cloud does not start with many collections

2015-03-03 Thread Damien Kamerman
After one minute from startup I sometimes see the
'org.apache.solr.cloud.ZkController; Timed out waiting to see all nodes
published as DOWN in our cluster state.'
And I see the 'Still seeing conflicting information about the leader of
shard' after about 5 minutes.
Thanks Shawn, I will create an issue.

On 4 March 2015 at 01:10, Shawn Heisey apa...@elyograg.org wrote:

 On 3/3/2015 6:55 AM, Shawn Heisey wrote:
  With a longer zkClientTimeout, does the failure happen on a later
  collection?  I had hoped that it would solve the problem, but I'm
  curious about whether it was able to load more collections before it
  finally died, or whether it made no difference... and whether the
  message now indicates 40 seconds or if it still says 30.

 I have found the code that produces the message, and the wait for this
 particular section is hardcoded to 30 seconds.  That means the timeout
 won't affect it.

 If you move the Solr log so it creates a new one from startup, how long
 does it take after startup begins before you see the failure that
 indicates the conflicting leader information hasn't resolved?

 This most likely is a bug ... our SolrCloud experts will need to
 investigate to find it, so we need as much information as you can provide.

 Thanks,
 Shawn




-- 
Damien Kamerman


Re: solr cloud does not start with many collections

2015-03-03 Thread Shawn Heisey
On 3/2/2015 12:54 AM, Damien Kamerman wrote:
 I still see the same cloud startup issue with Solr 5.0.0. I created 4,000
 collections from scratch and then attempted to stop/start the cloud.

I have been trying to duplicate your setup using the -e cloud example
included in the Solr 5.0 download and accepting all the defaults.  This
sets up two Solr instances on one machine, one of which runs an embedded
zookeeper.

I have been running into a LOT of issues just trying to get so many
collections created, to say nothing about restart problems.

The first problem I ran into was heap size.  The example starts each of
the Solr instances with a 512MB heap, which is WAY too small.  It
allowed me to create 274 collections, in addition to the gettingstarted
collection that the example started with.  One of the Solr instances
simply crashed.  No OutOfMemoryException or anything else in the log ...
it just died.

I bumped the heap on each Solr instance to 4GB.  The next problem I ran
into was the operating system limit on the number of processes ... and I
had already bumped that up beyond the usual 1024 default, to 4096.  Solr
was not able to create any more threads, because my user was not able to
fork any more processes.  I got over 700 collections created before that
became a problem.  My max open files had also been increased already --
this is another place where a stock system will run into trouble
creating a lot of collections.

I fixed that, and the next problem I ran into was total RAM on the
machine ... it turns out that with two Solr processes each using 4GB, I
was dipped 3GB deep into swap.  This is odd, because I have 12GB of RAM
on that machine and it's not doing very much besides this SolrCloud
test.  Swapping means that performance was completely unacceptable and
it would probably never finish.

So ... I had to find a machine with more memory.  I've got a dev server
with 32GB.  I fired up the two SolrCloud processes on it with 5GB heap
each, with 32768 processes allowed.  I am in the process of building
4000 collections (numShards=2, replicationFactor=1), and so far, it is
working OK.  I have almost 2700 collections now.

If I can ever get it to actually build 4000 collections, then I can
attempt restarting the second Solr instance and see what happens.  I
think I might hit another roadblock in the form of the
1 maxThreads limit on Jetty.  Running this all on one machine might
not be possible, but I'm giving it a try.

Here's the script I am using to create all those collections:

#!/bin/sh

for i in `seq -f %04.0f 0 3999`
do
  echo $i
  coll=mycoll${i}
  URL=http://localhost:8983/solr/admin/collections;
  URL=${URL}?action=CREATEname=${coll}numShards=2replicationFactor=1
  URL=${URL}collection.configName=gettingstarted
  curl $URL
done

Thanks,
Shawn


Re: solr cloud does not start with many collections

2015-03-03 Thread Damien Kamerman
I've done a similar thing to create the collections. You're going to need
more memory I think.

OK, so maxThreads limit on jetty could be causing a distributed dead-lock?


On 4 March 2015 at 13:18, Shawn Heisey apa...@elyograg.org wrote:

 On 3/2/2015 12:54 AM, Damien Kamerman wrote:
  I still see the same cloud startup issue with Solr 5.0.0. I created 4,000
  collections from scratch and then attempted to stop/start the cloud.

 I have been trying to duplicate your setup using the -e cloud example
 included in the Solr 5.0 download and accepting all the defaults.  This
 sets up two Solr instances on one machine, one of which runs an embedded
 zookeeper.

 I have been running into a LOT of issues just trying to get so many
 collections created, to say nothing about restart problems.

 The first problem I ran into was heap size.  The example starts each of
 the Solr instances with a 512MB heap, which is WAY too small.  It
 allowed me to create 274 collections, in addition to the gettingstarted
 collection that the example started with.  One of the Solr instances
 simply crashed.  No OutOfMemoryException or anything else in the log ...
 it just died.

 I bumped the heap on each Solr instance to 4GB.  The next problem I ran
 into was the operating system limit on the number of processes ... and I
 had already bumped that up beyond the usual 1024 default, to 4096.  Solr
 was not able to create any more threads, because my user was not able to
 fork any more processes.  I got over 700 collections created before that
 became a problem.  My max open files had also been increased already --
 this is another place where a stock system will run into trouble
 creating a lot of collections.

 I fixed that, and the next problem I ran into was total RAM on the
 machine ... it turns out that with two Solr processes each using 4GB, I
 was dipped 3GB deep into swap.  This is odd, because I have 12GB of RAM
 on that machine and it's not doing very much besides this SolrCloud
 test.  Swapping means that performance was completely unacceptable and
 it would probably never finish.

 So ... I had to find a machine with more memory.  I've got a dev server
 with 32GB.  I fired up the two SolrCloud processes on it with 5GB heap
 each, with 32768 processes allowed.  I am in the process of building
 4000 collections (numShards=2, replicationFactor=1), and so far, it is
 working OK.  I have almost 2700 collections now.

 If I can ever get it to actually build 4000 collections, then I can
 attempt restarting the second Solr instance and see what happens.  I
 think I might hit another roadblock in the form of the
 1 maxThreads limit on Jetty.  Running this all on one machine might
 not be possible, but I'm giving it a try.

 Here's the script I am using to create all those collections:

 #!/bin/sh

 for i in `seq -f %04.0f 0 3999`
 do
   echo $i
   coll=mycoll${i}
   URL=http://localhost:8983/solr/admin/collections;
   URL=${URL}?action=CREATEname=${coll}numShards=2replicationFactor=1
   URL=${URL}collection.configName=gettingstarted
   curl $URL
 done

 Thanks,
 Shawn




-- 
Damien Kamerman


Re: solr cloud does not start with many collections

2015-03-03 Thread Shawn Heisey
On 3/3/2015 6:55 AM, Shawn Heisey wrote:
 With a longer zkClientTimeout, does the failure happen on a later
 collection?  I had hoped that it would solve the problem, but I'm
 curious about whether it was able to load more collections before it
 finally died, or whether it made no difference... and whether the
 message now indicates 40 seconds or if it still says 30.

I have found the code that produces the message, and the wait for this
particular section is hardcoded to 30 seconds.  That means the timeout
won't affect it.

If you move the Solr log so it creates a new one from startup, how long
does it take after startup begins before you see the failure that
indicates the conflicting leader information hasn't resolved?

This most likely is a bug ... our SolrCloud experts will need to
investigate to find it, so we need as much information as you can provide.

Thanks,
Shawn



Re: solr cloud does not start with many collections

2015-03-03 Thread Shawn Heisey
On 3/3/2015 12:42 AM, Damien Kamerman wrote:
 Still no luck starting solr with 40s zkClientTimeout. I'm not seeing any
 expired sessions...
 
 There must be a way to start solr with many collections. It runs fine..
 until a restart is required.

With a longer zkClientTimeout, does the failure happen on a later
collection?  I had hoped that it would solve the problem, but I'm
curious about whether it was able to load more collections before it
finally died, or whether it made no difference... and whether the
message now indicates 40 seconds or if it still says 30.

Either way, I think we have reached the point where filing an issue in
Jira is appropriate.  You have the best information, so you should
create the issue.  The main description should fully describe the
problem, but not have exhaustive detail.  You can include lots of
supporting detail in attachments and in the comments.

https://issues.apache.org/jira/browse/SOLR

Thanks,
Shawn



Re: solr cloud does not start with many collections

2015-03-02 Thread Damien Kamerman
Still no luck starting solr with 40s zkClientTimeout. I'm not seeing any
expired sessions...

There must be a way to start solr with many collections. It runs fine..
until a restart is required.

On 3 March 2015 at 03:33, Shawn Heisey apa...@elyograg.org wrote:

 On 3/2/2015 12:54 AM, Damien Kamerman wrote:
  I still see the same cloud startup issue with Solr 5.0.0. I created 4,000
  collections from scratch and then attempted to stop/start the cloud.
 
  node1:
  WARN  - 2015-03-02 18:09:02.371;
  org.eclipse.jetty.server.handler.RequestLogHandler; !RequestLog
  WARN  - 2015-03-02 18:10:07.196; org.apache.solr.cloud.ZkController;
 Timed
  out waiting to see all nodes published as DOWN in our cluster state.
  WARN  - 2015-03-02 18:13:46.238; org.apache.solr.cloud.ZkController;
 Still
  seeing conflicting information about the leader of shard shard1 for
  collection DD-3219 after 30 seconds; our state says
  http://host:8002/solr/DD-3219_shard1_replica1/, but ZooKeeper says
  http://host:8000/solr/DD-3219_shard1_replica2/
 
  node2:
  WARN  - 2015-03-02 18:09:01.871;
  org.eclipse.jetty.server.handler.RequestLogHandler; !RequestLog
  WARN  - 2015-03-02 18:17:04.458;
  org.apache.solr.common.cloud.ZkStateReader$3; ZooKeeper watch triggered,
  but Solr cannot talk to ZK
  stop/start
  WARN  - 2015-03-02 18:53:12.725;
  org.eclipse.jetty.server.handler.RequestLogHandler; !RequestLog
  WARN  - 2015-03-02 18:56:30.702; org.apache.solr.cloud.ZkController;
 Still
  seeing conflicting information about the leader of shard shard1 for
  collection DD-3581 after 30 seconds; our state says
  http://host:8001/solr/DD-3581_shard1_replica2/, but ZooKeeper says
  http://host:8002/solr/DD-3581_shard1_replica1/
 
  node3:
  WARN  - 2015-03-02 18:09:03.022;
  org.eclipse.jetty.server.handler.RequestLogHandler; !RequestLog
  WARN  - 2015-03-02 18:10:08.178; org.apache.solr.cloud.ZkController;
 Timed
  out waiting to see all nodes published as DOWN in our cluster state.
  WARN  - 2015-03-02 18:13:47.737; org.apache.solr.cloud.ZkController;
 Still
  seeing conflicting information about the leader of shard shard1 for
  collection DD-2707 after 30 seconds; our state says
  http://host:8002/solr/DD-2707_shard1_replica2/, but ZooKeeper says
  http://host:8000/solr/DD-2707_shard1_replica1/

 I'm sorry to hear that 5.0 didn't fix the problem.  I really hoped that
 it would.

 There is one other thing I'd like to try before you file a bug --
 increasing zkClientTimeout to 40 seconds, to see whether it allows
 changes the point at which it fails (or allows it to succeed).  With the
 default tickTime (2 seconds), the maximum time you can set
 zkClientTimeout to is 40 seconds ... which in normal circumstances is a
 VERY long time.  In your situation, at least with the code in its
 current state, 30 seconds (I'm pretty sure this is the default in 5.0)
 may simply not be enough.


 https://cwiki.apache.org/confluence/display/solr/Parameter+Reference#ParameterReference-SolrCloudInstanceZooKeeperParameters

 I think filing a bug, even if 40 seconds allows this to succeed, is a
 good idea ... but you might want to wait for some of the cloud experts
 to look at your logs to see if they have anything to add.

 Thanks,
 Shawn




-- 
Damien Kamerman


Re: solr cloud does not start with many collections

2015-03-02 Thread Shawn Heisey
On 3/2/2015 12:54 AM, Damien Kamerman wrote:
 I still see the same cloud startup issue with Solr 5.0.0. I created 4,000
 collections from scratch and then attempted to stop/start the cloud.

 node1:
 WARN  - 2015-03-02 18:09:02.371;
 org.eclipse.jetty.server.handler.RequestLogHandler; !RequestLog
 WARN  - 2015-03-02 18:10:07.196; org.apache.solr.cloud.ZkController; Timed
 out waiting to see all nodes published as DOWN in our cluster state.
 WARN  - 2015-03-02 18:13:46.238; org.apache.solr.cloud.ZkController; Still
 seeing conflicting information about the leader of shard shard1 for
 collection DD-3219 after 30 seconds; our state says
 http://host:8002/solr/DD-3219_shard1_replica1/, but ZooKeeper says
 http://host:8000/solr/DD-3219_shard1_replica2/

 node2:
 WARN  - 2015-03-02 18:09:01.871;
 org.eclipse.jetty.server.handler.RequestLogHandler; !RequestLog
 WARN  - 2015-03-02 18:17:04.458;
 org.apache.solr.common.cloud.ZkStateReader$3; ZooKeeper watch triggered,
 but Solr cannot talk to ZK
 stop/start
 WARN  - 2015-03-02 18:53:12.725;
 org.eclipse.jetty.server.handler.RequestLogHandler; !RequestLog
 WARN  - 2015-03-02 18:56:30.702; org.apache.solr.cloud.ZkController; Still
 seeing conflicting information about the leader of shard shard1 for
 collection DD-3581 after 30 seconds; our state says
 http://host:8001/solr/DD-3581_shard1_replica2/, but ZooKeeper says
 http://host:8002/solr/DD-3581_shard1_replica1/

 node3:
 WARN  - 2015-03-02 18:09:03.022;
 org.eclipse.jetty.server.handler.RequestLogHandler; !RequestLog
 WARN  - 2015-03-02 18:10:08.178; org.apache.solr.cloud.ZkController; Timed
 out waiting to see all nodes published as DOWN in our cluster state.
 WARN  - 2015-03-02 18:13:47.737; org.apache.solr.cloud.ZkController; Still
 seeing conflicting information about the leader of shard shard1 for
 collection DD-2707 after 30 seconds; our state says
 http://host:8002/solr/DD-2707_shard1_replica2/, but ZooKeeper says
 http://host:8000/solr/DD-2707_shard1_replica1/

I'm sorry to hear that 5.0 didn't fix the problem.  I really hoped that
it would.

There is one other thing I'd like to try before you file a bug --
increasing zkClientTimeout to 40 seconds, to see whether it allows
changes the point at which it fails (or allows it to succeed).  With the
default tickTime (2 seconds), the maximum time you can set
zkClientTimeout to is 40 seconds ... which in normal circumstances is a
VERY long time.  In your situation, at least with the code in its
current state, 30 seconds (I'm pretty sure this is the default in 5.0)
may simply not be enough.

https://cwiki.apache.org/confluence/display/solr/Parameter+Reference#ParameterReference-SolrCloudInstanceZooKeeperParameters

I think filing a bug, even if 40 seconds allows this to succeed, is a
good idea ... but you might want to wait for some of the cloud experts
to look at your logs to see if they have anything to add.

Thanks,
Shawn



Re: solr cloud does not start with many collections

2015-03-01 Thread Damien Kamerman
I still see the same cloud startup issue with Solr 5.0.0. I created 4,000
collections from scratch and then attempted to stop/start the cloud.

node1:
WARN  - 2015-03-02 18:09:02.371;
org.eclipse.jetty.server.handler.RequestLogHandler; !RequestLog
WARN  - 2015-03-02 18:10:07.196; org.apache.solr.cloud.ZkController; Timed
out waiting to see all nodes published as DOWN in our cluster state.
WARN  - 2015-03-02 18:13:46.238; org.apache.solr.cloud.ZkController; Still
seeing conflicting information about the leader of shard shard1 for
collection DD-3219 after 30 seconds; our state says
http://host:8002/solr/DD-3219_shard1_replica1/, but ZooKeeper says
http://host:8000/solr/DD-3219_shard1_replica2/

node2:
WARN  - 2015-03-02 18:09:01.871;
org.eclipse.jetty.server.handler.RequestLogHandler; !RequestLog
WARN  - 2015-03-02 18:17:04.458;
org.apache.solr.common.cloud.ZkStateReader$3; ZooKeeper watch triggered,
but Solr cannot talk to ZK
stop/start
WARN  - 2015-03-02 18:53:12.725;
org.eclipse.jetty.server.handler.RequestLogHandler; !RequestLog
WARN  - 2015-03-02 18:56:30.702; org.apache.solr.cloud.ZkController; Still
seeing conflicting information about the leader of shard shard1 for
collection DD-3581 after 30 seconds; our state says
http://host:8001/solr/DD-3581_shard1_replica2/, but ZooKeeper says
http://host:8002/solr/DD-3581_shard1_replica1/

node3:
WARN  - 2015-03-02 18:09:03.022;
org.eclipse.jetty.server.handler.RequestLogHandler; !RequestLog
WARN  - 2015-03-02 18:10:08.178; org.apache.solr.cloud.ZkController; Timed
out waiting to see all nodes published as DOWN in our cluster state.
WARN  - 2015-03-02 18:13:47.737; org.apache.solr.cloud.ZkController; Still
seeing conflicting information about the leader of shard shard1 for
collection DD-2707 after 30 seconds; our state says
http://host:8002/solr/DD-2707_shard1_replica2/, but ZooKeeper says
http://host:8000/solr/DD-2707_shard1_replica1/



On 27 February 2015 at 17:48, Shawn Heisey apa...@elyograg.org wrote:

 On 2/26/2015 11:14 PM, Damien Kamerman wrote:
  I've run into an issue with starting my solr cloud with many collections.
  My setup is:
  3 nodes (solr 4.10.3 ; 64GB RAM each ; jdk1.8.0_25) running on a single
  server (256GB RAM).
  5,000 collections (1 x shard ; 2 x replica) = 10,000 cores
  1 x Zookeeper 3.4.6
  Java arg -Djute.maxbuffer=67108864 added to solr and ZK.
 
  Then I stop all nodes, then start all nodes. All replicas are in the down
  state, some have no leader. At times I have seen some (12 or so) leaders
 in
  the active state. In the solr logs I see lots of:
 
  org.apache.solr.cloud.ZkController; Still seeing conflicting information
  about the leader of shard shard1 for collection DD-4351 after 30
  seconds; our state says
 http://ftea1:8001/solr/DD-4351_shard1_replica1/,
  but ZooKeeper says http://ftea1:8000/solr/DD-4351_shard1_replica2/

 snip

  I've tried staggering the starts (1min) but does not help.
  I've reproduced with zero documents.
  Restarts are OK up to around 3,000 cores.
  Should this work?

 This is going to push SolrCloud beyond its limits.  Is this just an
 exercise to see how far you can push Solr, or are you looking at setting
 up a production install with several thousand collections?

 In Solr 4.x, the clusterstate is one giant JSON structure containing the
 state of the entire cloud.  With 5000 collections, the entire thing
 would need to be downloaded and uploaded at least 5000 times during the
 course of a successful full system startup ... and I think with
 replicationFactor set to 2, that might actually be 1 times. The
 best-case scenario is that it would take a VERY long time, the
 worst-case scenario is that concurrency problems would lead to a
 deadlock.  A deadlock might be what is happening here.

 In Solr 5.x, the clusterstate is broken up so there's a separate state
 structure for each collection.  This setup allows for faster and safer
 multi-threading and far less data transfer.  Assuming I understand the
 implications correctly, there might not be any need to increase
 jute.maxbuffer with 5.x ... although I have to assume that I might be
 wrong about that.

 I would very much recommend that you set your scenario up from scratch
 in Solr 5.0.0, to see if the new clusterstate format can eliminate the
 problem you're seeing.  If it doesn't, then we can pursue it as a likely
 bug in the 5.x branch and you can file an issue in Jira.

 Thanks,
 Shawn




-- 
Damien Kamerman


Re: solr cloud does not start with many collections

2015-02-26 Thread Shawn Heisey
On 2/26/2015 11:14 PM, Damien Kamerman wrote:
 I've run into an issue with starting my solr cloud with many collections.
 My setup is:
 3 nodes (solr 4.10.3 ; 64GB RAM each ; jdk1.8.0_25) running on a single
 server (256GB RAM).
 5,000 collections (1 x shard ; 2 x replica) = 10,000 cores
 1 x Zookeeper 3.4.6
 Java arg -Djute.maxbuffer=67108864 added to solr and ZK.
 
 Then I stop all nodes, then start all nodes. All replicas are in the down
 state, some have no leader. At times I have seen some (12 or so) leaders in
 the active state. In the solr logs I see lots of:
 
 org.apache.solr.cloud.ZkController; Still seeing conflicting information
 about the leader of shard shard1 for collection DD-4351 after 30
 seconds; our state says http://ftea1:8001/solr/DD-4351_shard1_replica1/,
 but ZooKeeper says http://ftea1:8000/solr/DD-4351_shard1_replica2/

snip

 I've tried staggering the starts (1min) but does not help.
 I've reproduced with zero documents.
 Restarts are OK up to around 3,000 cores.
 Should this work?

This is going to push SolrCloud beyond its limits.  Is this just an
exercise to see how far you can push Solr, or are you looking at setting
up a production install with several thousand collections?

In Solr 4.x, the clusterstate is one giant JSON structure containing the
state of the entire cloud.  With 5000 collections, the entire thing
would need to be downloaded and uploaded at least 5000 times during the
course of a successful full system startup ... and I think with
replicationFactor set to 2, that might actually be 1 times. The
best-case scenario is that it would take a VERY long time, the
worst-case scenario is that concurrency problems would lead to a
deadlock.  A deadlock might be what is happening here.

In Solr 5.x, the clusterstate is broken up so there's a separate state
structure for each collection.  This setup allows for faster and safer
multi-threading and far less data transfer.  Assuming I understand the
implications correctly, there might not be any need to increase
jute.maxbuffer with 5.x ... although I have to assume that I might be
wrong about that.

I would very much recommend that you set your scenario up from scratch
in Solr 5.0.0, to see if the new clusterstate format can eliminate the
problem you're seeing.  If it doesn't, then we can pursue it as a likely
bug in the 5.x branch and you can file an issue in Jira.

Thanks,
Shawn



Re: solr cloud does not start with many collections

2015-02-26 Thread Damien Kamerman
Oh, and I was wondering if 'leaderVoteWait' might help in Solr4.

On 27 February 2015 at 18:04, Damien Kamerman dami...@gmail.com wrote:

 This is going to push SolrCloud beyond its limits.  Is this just an
 exercise to see how far you can push Solr, or are you looking at setting
 up a production install with several thousand collections?


 I'm looking towards production.


 In Solr 4.x, the clusterstate is one giant JSON structure containing the
 state of the entire cloud.  With 5000 collections, the entire thing
 would need to be downloaded and uploaded at least 5000 times during the
 course of a successful full system startup ... and I think with
 replicationFactor set to 2, that might actually be 1 times. The
 best-case scenario is that it would take a VERY long time, the
 worst-case scenario is that concurrency problems would lead to a
 deadlock.  A deadlock might be what is happening here.


 Yes, clusterstate.json is 3.3M. At times on startup I think it does
 deadlock; log shows after 1min:
 org.apache.solr.cloud.ZkController; Timed out waiting to see all nodes
 published as DOWN in our cluster state.


 In Solr 5.x, the clusterstate is broken up so there's a separate state
 structure for each collection.  This setup allows for faster and safer
 multi-threading and far less data transfer.  Assuming I understand the
 implications correctly, there might not be any need to increase
 jute.maxbuffer with 5.x ... although I have to assume that I might be
 wrong about that.

 I would very much recommend that you set your scenario up from scratch
 in Solr 5.0.0, to see if the new clusterstate format can eliminate the
 problem you're seeing.  If it doesn't, then we can pursue it as a likely
 bug in the 5.x branch and you can file an issue in Jira.


 Thanks, will test in Solr 5.0.0.




-- 
Damien Kamerman


Re: solr cloud does not start with many collections

2015-02-26 Thread Damien Kamerman

 This is going to push SolrCloud beyond its limits.  Is this just an
 exercise to see how far you can push Solr, or are you looking at setting
 up a production install with several thousand collections?


I'm looking towards production.


 In Solr 4.x, the clusterstate is one giant JSON structure containing the
 state of the entire cloud.  With 5000 collections, the entire thing
 would need to be downloaded and uploaded at least 5000 times during the
 course of a successful full system startup ... and I think with
 replicationFactor set to 2, that might actually be 1 times. The
 best-case scenario is that it would take a VERY long time, the
 worst-case scenario is that concurrency problems would lead to a
 deadlock.  A deadlock might be what is happening here.


Yes, clusterstate.json is 3.3M. At times on startup I think it does
deadlock; log shows after 1min:
org.apache.solr.cloud.ZkController; Timed out waiting to see all nodes
published as DOWN in our cluster state.


 In Solr 5.x, the clusterstate is broken up so there's a separate state
 structure for each collection.  This setup allows for faster and safer
 multi-threading and far less data transfer.  Assuming I understand the
 implications correctly, there might not be any need to increase
 jute.maxbuffer with 5.x ... although I have to assume that I might be
 wrong about that.

 I would very much recommend that you set your scenario up from scratch
 in Solr 5.0.0, to see if the new clusterstate format can eliminate the
 problem you're seeing.  If it doesn't, then we can pursue it as a likely
 bug in the 5.x branch and you can file an issue in Jira.


Thanks, will test in Solr 5.0.0.


solr cloud does not start with many collections

2015-02-26 Thread Damien Kamerman
I've run into an issue with starting my solr cloud with many collections.
My setup is:
3 nodes (solr 4.10.3 ; 64GB RAM each ; jdk1.8.0_25) running on a single
server (256GB RAM).
5,000 collections (1 x shard ; 2 x replica) = 10,000 cores
1 x Zookeeper 3.4.6
Java arg -Djute.maxbuffer=67108864 added to solr and ZK.

Then I stop all nodes, then start all nodes. All replicas are in the down
state, some have no leader. At times I have seen some (12 or so) leaders in
the active state. In the solr logs I see lots of:

org.apache.solr.cloud.ZkController; Still seeing conflicting information
about the leader of shard shard1 for collection DD-4351 after 30
seconds; our state says http://ftea1:8001/solr/DD-4351_shard1_replica1/,
but ZooKeeper says http://ftea1:8000/solr/DD-4351_shard1_replica2/

org.apache.solr.common.SolrException;
:org.apache.solr.common.SolrException: Error getting leader from zk for
shard shard1
at
org.apache.solr.cloud.ZkController.getLeader(ZkController.java:910)
at
org.apache.solr.cloud.ZkController.register(ZkController.java:822)
at
org.apache.solr.cloud.ZkController.register(ZkController.java:770)
at org.apache.solr.core.ZkContainer$2.run(ZkContainer.java:221)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.solr.common.SolrException: There is conflicting
information about the leader of shard: shard1 our state says:
http://ftea1:8001/solr/DD-1564_shard1_replica2/ but zookeeper says:
http://ftea1:8000/solr/DD-1564_shard1_replica1/
at
org.apache.solr.cloud.ZkController.getLeader(ZkController.java:889)
... 6 more

I've tried staggering the starts (1min) but does not help.
I've reproduced with zero documents.
Restarts are OK up to around 3,000 cores.
Should this work?

Damien.