subject:"solr cloud does not start with many collections"

Re: solr cloud does not start with many collections

2015-03-11 Thread Damien Kamerman

Didier, I'm starting to look at SOLR-6399
> after the core was unloaded, it was absent from the collection list, as
if it never existed. On the other hand, re-issuing a CREATE call with the
same collection restored the collection, along with its data
The collection is sill in ZK though?

> upon restart Solr tried to reload the previously-unloaded collection.
Looks like CoreContainer.load() uses CoreDescriptor.isTransient() and
CoreDescriptor.isLoadOnStartup() properties on startup.


On 7 March 2015 at 13:10, didier deshommes  wrote:

> It would be a huge step forward if one could have several hundreds of Solr
> collections, but only have a small portion of them opened/loaded at the
> same time. This is similar to ElasticSearch's close index api, listed here:
>
> http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-open-close.html
> . I've opened an issue to implement the same in Solr here a few months ago:
> https://issues.apache.org/jira/browse/SOLR-6399
>
> On Thu, Mar 5, 2015 at 4:42 PM, Damien Kamerman  wrote:
>
> > I've tried a few variations, with 3 x ZK, 6 X nodes, solr 4.10.3, solr
> 5.0
> > without any success and no real difference. There is a tipping point at
> > around 3,000-4,000 cores (varies depending on hardware) from where I can
> > restart the cloud OK within ~4min, to the cloud not working and
> > continuous 'conflicting
> > information about the leader of shard' warnings.
> >
> > On 5 March 2015 at 14:15, Shawn Heisey  wrote:
> >
> > > On 3/4/2015 5:37 PM, Damien Kamerman wrote:
> > > > I'm running on Solaris x86, I have plenty of memory and no real
> limits
> > > > # plimit 15560
> > > > 15560:  /opt1/jdk/bin/java -d64 -server -Xss512k -Xms32G -Xmx32G
> > > > -XX:MaxMetasp
> > > >resource  current maximum
> > > >   time(seconds) unlimited   unlimited
> > > >   file(blocks)  unlimited   unlimited
> > > >   data(kbytes)  unlimited   unlimited
> > > >   stack(kbytes) unlimited   unlimited
> > > >   coredump(blocks)  unlimited   unlimited
> > > >   nofiles(descriptors)  65536   65536
> > > >   vmemory(kbytes)   unlimited   unlimited
> > > >
> > > > I've been testing with 3 nodes, and that seems OK up to around 3,000
> > > cores
> > > > total. I'm thinking of testing with more nodes.
> > >
> > > I have opened an issue for the problems I encountered while recreating
> a
> > > config similar to yours, which I have been doing on Linux.
> > >
> > > https://issues.apache.org/jira/browse/SOLR-7191
> > >
> > > It's possible that the only thing the issue will lead to is
> improvements
> > > in the documentation, but I'm hopeful that there will be code
> > > improvements too.
> > >
> > > Thanks,
> > > Shawn
> > >
> > >
> >
> >
> > --
> > Damien Kamerman
> >
>



-- 
Damien Kamerman

Re: solr cloud does not start with many collections

2015-03-06 Thread didier deshommes

It would be a huge step forward if one could have several hundreds of Solr
collections, but only have a small portion of them opened/loaded at the
same time. This is similar to ElasticSearch's close index api, listed here:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-open-close.html
. I've opened an issue to implement the same in Solr here a few months ago:
https://issues.apache.org/jira/browse/SOLR-6399

On Thu, Mar 5, 2015 at 4:42 PM, Damien Kamerman  wrote:

> I've tried a few variations, with 3 x ZK, 6 X nodes, solr 4.10.3, solr 5.0
> without any success and no real difference. There is a tipping point at
> around 3,000-4,000 cores (varies depending on hardware) from where I can
> restart the cloud OK within ~4min, to the cloud not working and
> continuous 'conflicting
> information about the leader of shard' warnings.
>
> On 5 March 2015 at 14:15, Shawn Heisey  wrote:
>
> > On 3/4/2015 5:37 PM, Damien Kamerman wrote:
> > > I'm running on Solaris x86, I have plenty of memory and no real limits
> > > # plimit 15560
> > > 15560:  /opt1/jdk/bin/java -d64 -server -Xss512k -Xms32G -Xmx32G
> > > -XX:MaxMetasp
> > >resource  current maximum
> > >   time(seconds) unlimited   unlimited
> > >   file(blocks)  unlimited   unlimited
> > >   data(kbytes)  unlimited   unlimited
> > >   stack(kbytes) unlimited   unlimited
> > >   coredump(blocks)  unlimited   unlimited
> > >   nofiles(descriptors)  65536   65536
> > >   vmemory(kbytes)   unlimited   unlimited
> > >
> > > I've been testing with 3 nodes, and that seems OK up to around 3,000
> > cores
> > > total. I'm thinking of testing with more nodes.
> >
> > I have opened an issue for the problems I encountered while recreating a
> > config similar to yours, which I have been doing on Linux.
> >
> > https://issues.apache.org/jira/browse/SOLR-7191
> >
> > It's possible that the only thing the issue will lead to is improvements
> > in the documentation, but I'm hopeful that there will be code
> > improvements too.
> >
> > Thanks,
> > Shawn
> >
> >
>
>
> --
> Damien Kamerman
>

Re: solr cloud does not start with many collections

2015-03-05 Thread Damien Kamerman

I've tried a few variations, with 3 x ZK, 6 X nodes, solr 4.10.3, solr 5.0
without any success and no real difference. There is a tipping point at
around 3,000-4,000 cores (varies depending on hardware) from where I can
restart the cloud OK within ~4min, to the cloud not working and
continuous 'conflicting
information about the leader of shard' warnings.

On 5 March 2015 at 14:15, Shawn Heisey  wrote:

> On 3/4/2015 5:37 PM, Damien Kamerman wrote:
> > I'm running on Solaris x86, I have plenty of memory and no real limits
> > # plimit 15560
> > 15560:  /opt1/jdk/bin/java -d64 -server -Xss512k -Xms32G -Xmx32G
> > -XX:MaxMetasp
> >resource  current maximum
> >   time(seconds) unlimited   unlimited
> >   file(blocks)  unlimited   unlimited
> >   data(kbytes)  unlimited   unlimited
> >   stack(kbytes) unlimited   unlimited
> >   coredump(blocks)  unlimited   unlimited
> >   nofiles(descriptors)  65536   65536
> >   vmemory(kbytes)   unlimited   unlimited
> >
> > I've been testing with 3 nodes, and that seems OK up to around 3,000
> cores
> > total. I'm thinking of testing with more nodes.
>
> I have opened an issue for the problems I encountered while recreating a
> config similar to yours, which I have been doing on Linux.
>
> https://issues.apache.org/jira/browse/SOLR-7191
>
> It's possible that the only thing the issue will lead to is improvements
> in the documentation, but I'm hopeful that there will be code
> improvements too.
>
> Thanks,
> Shawn
>
>


-- 
Damien Kamerman

Re: solr cloud does not start with many collections

2015-03-04 Thread Shawn Heisey

On 3/4/2015 5:37 PM, Damien Kamerman wrote:
> I'm running on Solaris x86, I have plenty of memory and no real limits
> # plimit 15560
> 15560:  /opt1/jdk/bin/java -d64 -server -Xss512k -Xms32G -Xmx32G
> -XX:MaxMetasp
>resource  current maximum
>   time(seconds) unlimited   unlimited
>   file(blocks)  unlimited   unlimited
>   data(kbytes)  unlimited   unlimited
>   stack(kbytes) unlimited   unlimited
>   coredump(blocks)  unlimited   unlimited
>   nofiles(descriptors)  65536   65536
>   vmemory(kbytes)   unlimited   unlimited
> 
> I've been testing with 3 nodes, and that seems OK up to around 3,000 cores
> total. I'm thinking of testing with more nodes.

I have opened an issue for the problems I encountered while recreating a
config similar to yours, which I have been doing on Linux.

https://issues.apache.org/jira/browse/SOLR-7191

It's possible that the only thing the issue will lead to is improvements
in the documentation, but I'm hopeful that there will be code
improvements too.

Thanks,
Shawn

Re: solr cloud does not start with many collections

2015-03-04 Thread Damien Kamerman

I'm running on Solaris x86, I have plenty of memory and no real limits
# plimit 15560
15560:  /opt1/jdk/bin/java -d64 -server -Xss512k -Xms32G -Xmx32G
-XX:MaxMetasp
   resource  current maximum
  time(seconds) unlimited   unlimited
  file(blocks)  unlimited   unlimited
  data(kbytes)  unlimited   unlimited
  stack(kbytes) unlimited   unlimited
  coredump(blocks)  unlimited   unlimited
  nofiles(descriptors)  65536   65536
  vmemory(kbytes)   unlimited   unlimited

I've been testing with 3 nodes, and that seems OK up to around 3,000 cores
total. I'm thinking of testing with more nodes.


On 5 March 2015 at 05:28, Shawn Heisey  wrote:

> On 3/4/2015 2:09 AM, Shawn Heisey wrote:
> > I've come to one major conclusion about this whole thing, even before
> > I reach the magic number of 4000 collections. Thousands of collections
> > is not at all practical with SolrCloud currently.
>
> I've now encountered a new problem.  I may have been hasty in declaring
> that an increase of jute.maxbuffer is not required.  There are now 3715
> collections, and I've seen a zookeeper exception that may indicate an
> increase actually is required.  I have added that parameter to the
> startup and when I have some time to look deeper, I will see whether
> that helps.
>
> Before 5.0, the maxbuffer would have been exceeded by only a few hundred
> collections ... so this is definitely progress.
>
> Thanks,
> Shawn
>
>


-- 
Damien Kamerman

Re: solr cloud does not start with many collections

2015-03-04 Thread Shawn Heisey

On 3/4/2015 2:09 AM, Shawn Heisey wrote:
> I've come to one major conclusion about this whole thing, even before
> I reach the magic number of 4000 collections. Thousands of collections
> is not at all practical with SolrCloud currently.

I've now encountered a new problem.  I may have been hasty in declaring
that an increase of jute.maxbuffer is not required.  There are now 3715
collections, and I've seen a zookeeper exception that may indicate an
increase actually is required.  I have added that parameter to the
startup and when I have some time to look deeper, I will see whether
that helps.

Before 5.0, the maxbuffer would have been exceeded by only a few hundred
collections ... so this is definitely progress.

Thanks,
Shawn

Re: solr cloud does not start with many collections

2015-03-04 Thread Shawn Heisey

On 3/4/2015 1:02 AM, Shawn Heisey wrote:
> Even now, nearly three hours after startup, the Solr log is still
> spitting out thousands of lines that look like this, so I don't think I
> can call it stable:
> 
> INFO  - 2015-03-04 07:35:51.166;
> org.apache.solr.common.cloud.ZkStateReader; Updating data for mycoll1515
> to ver 60
> 
> I'm going to try bringing up the other Solr instance now, and if that
> stabilizes with all shards in the green, I will try to continue adding
> collections.

I've come to one major conclusion about this whole thing, even before I
reach the magic number of 4000 collections.  Thousands of collections is
not at all practical with SolrCloud currently.  Some additional
conclusions about this setup:

* Stopping and restarting the entire cluster will quite literally take
hours for full stability.  A rolling restart *might* go faster, but
honestly I would not count on that.

* An external zookeeper ensemble is absolutely critical.  Zookeeper
stability is extremely important.

* A lot of heap memory is required, even if the indexes are completely
empty and there is no query/index activity.  Active indexes with data
are going to push that even higher, and will very likely slow down
recovery on server restart.

* Operating system limits for the max number of open files and max
number of processes allowed will need to be reconfigured - these are
settings that are NOT managed by Solr or Jetty.  Configuration may vary
widely between different operating systems.

* Thousands of collections *might* work OK if there are enough servers
so that each one doesn't have more than a couple hundred cores.  This
would need to be tested, and I don't have the available hardware.

I'm not sure that the OP's problem can actually be called a bug ... it's
more of a performance limitation.  We should still file an issue and
treat it like a bug, though.

Thanks,
Shawn

Re: solr cloud does not start with many collections

2015-03-04 Thread Shawn Heisey

On 3/3/2015 9:22 PM, Damien Kamerman wrote:
> I've done a similar thing to create the collections. You're going to need
> more memory I think.
> 
> OK, so maxThreads limit on jetty could be causing a distributed dead-lock?

I don't know what the exact problems would be if maxThreads is reached.
 It's probably unpredictable.

With 2674 collections added, 5GB wasn't enough heap.  I started getting
a ton of exceptions during collection creation.  I had to shut down both
Solr instances.  When I brought up the first instance with a 7GB heap
(the one with the embedded zk), it took exactly half an hour for jetty
to start listening on port 8983, and about two hours total for it to
stabilize to the point where everything for that node was green on the
cloud graph.

Even now, nearly three hours after startup, the Solr log is still
spitting out thousands of lines that look like this, so I don't think I
can call it stable:

INFO  - 2015-03-04 07:35:51.166;
org.apache.solr.common.cloud.ZkStateReader; Updating data for mycoll1515
to ver 60

I'm going to try bringing up the other Solr instance now, and if that
stabilizes with all shards in the green, I will try to continue adding
collections.

Side note: I have been able to confirm with these tests that version 5.0
no longer requires increasing jute.maxbuffer to run many collections.
I'm still running with the default value and zookeeper has had no
problems handling all the data.

Thanks,
Shawn

Re: solr cloud does not start with many collections

2015-03-03 Thread Damien Kamerman

I've done a similar thing to create the collections. You're going to need
more memory I think.

OK, so maxThreads limit on jetty could be causing a distributed dead-lock?


On 4 March 2015 at 13:18, Shawn Heisey  wrote:

> On 3/2/2015 12:54 AM, Damien Kamerman wrote:
> > I still see the same cloud startup issue with Solr 5.0.0. I created 4,000
> > collections from scratch and then attempted to stop/start the cloud.
>
> I have been trying to duplicate your setup using the "-e cloud" example
> included in the Solr 5.0 download and accepting all the defaults.  This
> sets up two Solr instances on one machine, one of which runs an embedded
> zookeeper.
>
> I have been running into a LOT of issues just trying to get so many
> collections created, to say nothing about restart problems.
>
> The first problem I ran into was heap size.  The example starts each of
> the Solr instances with a 512MB heap, which is WAY too small.  It
> allowed me to create 274 collections, in addition to the gettingstarted
> collection that the example started with.  One of the Solr instances
> simply crashed.  No OutOfMemoryException or anything else in the log ...
> it just died.
>
> I bumped the heap on each Solr instance to 4GB.  The next problem I ran
> into was the operating system limit on the number of processes ... and I
> had already bumped that up beyond the usual 1024 default, to 4096.  Solr
> was not able to create any more threads, because my user was not able to
> fork any more processes.  I got over 700 collections created before that
> became a problem.  My max open files had also been increased already --
> this is another place where a stock system will run into trouble
> creating a lot of collections.
>
> I fixed that, and the next problem I ran into was total RAM on the
> machine ... it turns out that with two Solr processes each using 4GB, I
> was dipped 3GB deep into swap.  This is odd, because I have 12GB of RAM
> on that machine and it's not doing very much besides this SolrCloud
> test.  Swapping means that performance was completely unacceptable and
> it would probably never finish.
>
> So ... I had to find a machine with more memory.  I've got a dev server
> with 32GB.  I fired up the two SolrCloud processes on it with 5GB heap
> each, with 32768 processes allowed.  I am in the process of building
> 4000 collections (numShards=2, replicationFactor=1), and so far, it is
> working OK.  I have almost 2700 collections now.
>
> If I can ever get it to actually build 4000 collections, then I can
> attempt restarting the second Solr instance and see what happens.  I
> think I might hit another roadblock in the form of the
> 1 maxThreads limit on Jetty.  Running this all on one machine might
> not be possible, but I'm giving it a try.
>
> Here's the script I am using to create all those collections:
>
> #!/bin/sh
>
> for i in `seq -f "%04.0f" 0 3999`
> do
>   echo $i
>   coll=mycoll${i}
>   URL="http://localhost:8983/solr/admin/collections";
>   URL="${URL}?action=CREATE&name=${coll}&numShards=2&replicationFactor=1"
>   URL="${URL}&collection.configName=gettingstarted"
>   curl "$URL"
> done
>
> Thanks,
> Shawn
>



-- 
Damien Kamerman

Re: solr cloud does not start with many collections

2015-03-03 Thread Shawn Heisey

On 3/2/2015 12:54 AM, Damien Kamerman wrote:
> I still see the same cloud startup issue with Solr 5.0.0. I created 4,000
> collections from scratch and then attempted to stop/start the cloud.

I have been trying to duplicate your setup using the "-e cloud" example
included in the Solr 5.0 download and accepting all the defaults.  This
sets up two Solr instances on one machine, one of which runs an embedded
zookeeper.

I have been running into a LOT of issues just trying to get so many
collections created, to say nothing about restart problems.

The first problem I ran into was heap size.  The example starts each of
the Solr instances with a 512MB heap, which is WAY too small.  It
allowed me to create 274 collections, in addition to the gettingstarted
collection that the example started with.  One of the Solr instances
simply crashed.  No OutOfMemoryException or anything else in the log ...
it just died.

I bumped the heap on each Solr instance to 4GB.  The next problem I ran
into was the operating system limit on the number of processes ... and I
had already bumped that up beyond the usual 1024 default, to 4096.  Solr
was not able to create any more threads, because my user was not able to
fork any more processes.  I got over 700 collections created before that
became a problem.  My max open files had also been increased already --
this is another place where a stock system will run into trouble
creating a lot of collections.

I fixed that, and the next problem I ran into was total RAM on the
machine ... it turns out that with two Solr processes each using 4GB, I
was dipped 3GB deep into swap.  This is odd, because I have 12GB of RAM
on that machine and it's not doing very much besides this SolrCloud
test.  Swapping means that performance was completely unacceptable and
it would probably never finish.

So ... I had to find a machine with more memory.  I've got a dev server
with 32GB.  I fired up the two SolrCloud processes on it with 5GB heap
each, with 32768 processes allowed.  I am in the process of building
4000 collections (numShards=2, replicationFactor=1), and so far, it is
working OK.  I have almost 2700 collections now.

If I can ever get it to actually build 4000 collections, then I can
attempt restarting the second Solr instance and see what happens.  I
think I might hit another roadblock in the form of the
1 maxThreads limit on Jetty.  Running this all on one machine might
not be possible, but I'm giving it a try.

Here's the script I am using to create all those collections:

#!/bin/sh

for i in `seq -f "%04.0f" 0 3999`
do
  echo $i
  coll=mycoll${i}
  URL="http://localhost:8983/solr/admin/collections";
  URL="${URL}?action=CREATE&name=${coll}&numShards=2&replicationFactor=1"
  URL="${URL}&collection.configName=gettingstarted"
  curl "$URL"
done

Thanks,
Shawn

Re: solr cloud does not start with many collections

2015-03-03 Thread Damien Kamerman

After one minute from startup I sometimes see the
'org.apache.solr.cloud.ZkController; Timed out waiting to see all nodes
published as DOWN in our cluster state.'
And I see the 'Still seeing conflicting information about the leader of
shard' after about 5 minutes.
Thanks Shawn, I will create an issue.

On 4 March 2015 at 01:10, Shawn Heisey  wrote:

> On 3/3/2015 6:55 AM, Shawn Heisey wrote:
> > With a longer zkClientTimeout, does the failure happen on a later
> > collection?  I had hoped that it would solve the problem, but I'm
> > curious about whether it was able to load more collections before it
> > finally died, or whether it made no difference... and whether the
> > message now indicates 40 seconds or if it still says 30.
>
> I have found the code that produces the message, and the wait for this
> particular section is hardcoded to 30 seconds.  That means the timeout
> won't affect it.
>
> If you move the Solr log so it creates a new one from startup, how long
> does it take after startup begins before you see the failure that
> indicates the conflicting leader information hasn't resolved?
>
> This most likely is a bug ... our SolrCloud experts will need to
> investigate to find it, so we need as much information as you can provide.
>
> Thanks,
> Shawn
>
>


-- 
Damien Kamerman

Re: solr cloud does not start with many collections

2015-03-03 Thread Shawn Heisey

On 3/3/2015 6:55 AM, Shawn Heisey wrote:
> With a longer zkClientTimeout, does the failure happen on a later
> collection?  I had hoped that it would solve the problem, but I'm
> curious about whether it was able to load more collections before it
> finally died, or whether it made no difference... and whether the
> message now indicates 40 seconds or if it still says 30.

I have found the code that produces the message, and the wait for this
particular section is hardcoded to 30 seconds.  That means the timeout
won't affect it.

If you move the Solr log so it creates a new one from startup, how long
does it take after startup begins before you see the failure that
indicates the conflicting leader information hasn't resolved?

This most likely is a bug ... our SolrCloud experts will need to
investigate to find it, so we need as much information as you can provide.

Thanks,
Shawn

Re: solr cloud does not start with many collections

2015-03-03 Thread Shawn Heisey

On 3/3/2015 12:42 AM, Damien Kamerman wrote:
> Still no luck starting solr with 40s zkClientTimeout. I'm not seeing any
> expired sessions...
> 
> There must be a way to start solr with many collections. It runs fine..
> until a restart is required.

With a longer zkClientTimeout, does the failure happen on a later
collection?  I had hoped that it would solve the problem, but I'm
curious about whether it was able to load more collections before it
finally died, or whether it made no difference... and whether the
message now indicates 40 seconds or if it still says 30.

Either way, I think we have reached the point where filing an issue in
Jira is appropriate.  You have the best information, so you should
create the issue.  The main description should fully describe the
problem, but not have exhaustive detail.  You can include lots of
supporting detail in attachments and in the comments.

https://issues.apache.org/jira/browse/SOLR

Thanks,
Shawn

Re: solr cloud does not start with many collections

2015-03-02 Thread Damien Kamerman

Still no luck starting solr with 40s zkClientTimeout. I'm not seeing any
expired sessions...

There must be a way to start solr with many collections. It runs fine..
until a restart is required.

On 3 March 2015 at 03:33, Shawn Heisey  wrote:

> On 3/2/2015 12:54 AM, Damien Kamerman wrote:
> > I still see the same cloud startup issue with Solr 5.0.0. I created 4,000
> > collections from scratch and then attempted to stop/start the cloud.
> >
> > node1:
> > WARN  - 2015-03-02 18:09:02.371;
> > org.eclipse.jetty.server.handler.RequestLogHandler; !RequestLog
> > WARN  - 2015-03-02 18:10:07.196; org.apache.solr.cloud.ZkController;
> Timed
> > out waiting to see all nodes published as DOWN in our cluster state.
> > WARN  - 2015-03-02 18:13:46.238; org.apache.solr.cloud.ZkController;
> Still
> > seeing conflicting information about the leader of shard shard1 for
> > collection DD-3219 after 30 seconds; our state says
> > http://host:8002/solr/DD-3219_shard1_replica1/, but ZooKeeper says
> > http://host:8000/solr/DD-3219_shard1_replica2/
> >
> > node2:
> > WARN  - 2015-03-02 18:09:01.871;
> > org.eclipse.jetty.server.handler.RequestLogHandler; !RequestLog
> > WARN  - 2015-03-02 18:17:04.458;
> > org.apache.solr.common.cloud.ZkStateReader$3; ZooKeeper watch triggered,
> > but Solr cannot talk to ZK
> > stop/start
> > WARN  - 2015-03-02 18:53:12.725;
> > org.eclipse.jetty.server.handler.RequestLogHandler; !RequestLog
> > WARN  - 2015-03-02 18:56:30.702; org.apache.solr.cloud.ZkController;
> Still
> > seeing conflicting information about the leader of shard shard1 for
> > collection DD-3581 after 30 seconds; our state says
> > http://host:8001/solr/DD-3581_shard1_replica2/, but ZooKeeper says
> > http://host:8002/solr/DD-3581_shard1_replica1/
> >
> > node3:
> > WARN  - 2015-03-02 18:09:03.022;
> > org.eclipse.jetty.server.handler.RequestLogHandler; !RequestLog
> > WARN  - 2015-03-02 18:10:08.178; org.apache.solr.cloud.ZkController;
> Timed
> > out waiting to see all nodes published as DOWN in our cluster state.
> > WARN  - 2015-03-02 18:13:47.737; org.apache.solr.cloud.ZkController;
> Still
> > seeing conflicting information about the leader of shard shard1 for
> > collection DD-2707 after 30 seconds; our state says
> > http://host:8002/solr/DD-2707_shard1_replica2/, but ZooKeeper says
> > http://host:8000/solr/DD-2707_shard1_replica1/
>
> I'm sorry to hear that 5.0 didn't fix the problem.  I really hoped that
> it would.
>
> There is one other thing I'd like to try before you file a bug --
> increasing zkClientTimeout to 40 seconds, to see whether it allows
> changes the point at which it fails (or allows it to succeed).  With the
> default tickTime (2 seconds), the maximum time you can set
> zkClientTimeout to is 40 seconds ... which in normal circumstances is a
> VERY long time.  In your situation, at least with the code in its
> current state, 30 seconds (I'm pretty sure this is the default in 5.0)
> may simply not be enough.
>
>
> https://cwiki.apache.org/confluence/display/solr/Parameter+Reference#ParameterReference-SolrCloudInstanceZooKeeperParameters
>
> I think filing a bug, even if 40 seconds allows this to succeed, is a
> good idea ... but you might want to wait for some of the cloud experts
> to look at your logs to see if they have anything to add.
>
> Thanks,
> Shawn
>
>


-- 
Damien Kamerman

Re: solr cloud does not start with many collections

2015-03-02 Thread Shawn Heisey

On 3/2/2015 12:54 AM, Damien Kamerman wrote:
> I still see the same cloud startup issue with Solr 5.0.0. I created 4,000
> collections from scratch and then attempted to stop/start the cloud.
>
> node1:
> WARN  - 2015-03-02 18:09:02.371;
> org.eclipse.jetty.server.handler.RequestLogHandler; !RequestLog
> WARN  - 2015-03-02 18:10:07.196; org.apache.solr.cloud.ZkController; Timed
> out waiting to see all nodes published as DOWN in our cluster state.
> WARN  - 2015-03-02 18:13:46.238; org.apache.solr.cloud.ZkController; Still
> seeing conflicting information about the leader of shard shard1 for
> collection DD-3219 after 30 seconds; our state says
> http://host:8002/solr/DD-3219_shard1_replica1/, but ZooKeeper says
> http://host:8000/solr/DD-3219_shard1_replica2/
>
> node2:
> WARN  - 2015-03-02 18:09:01.871;
> org.eclipse.jetty.server.handler.RequestLogHandler; !RequestLog
> WARN  - 2015-03-02 18:17:04.458;
> org.apache.solr.common.cloud.ZkStateReader$3; ZooKeeper watch triggered,
> but Solr cannot talk to ZK
> stop/start
> WARN  - 2015-03-02 18:53:12.725;
> org.eclipse.jetty.server.handler.RequestLogHandler; !RequestLog
> WARN  - 2015-03-02 18:56:30.702; org.apache.solr.cloud.ZkController; Still
> seeing conflicting information about the leader of shard shard1 for
> collection DD-3581 after 30 seconds; our state says
> http://host:8001/solr/DD-3581_shard1_replica2/, but ZooKeeper says
> http://host:8002/solr/DD-3581_shard1_replica1/
>
> node3:
> WARN  - 2015-03-02 18:09:03.022;
> org.eclipse.jetty.server.handler.RequestLogHandler; !RequestLog
> WARN  - 2015-03-02 18:10:08.178; org.apache.solr.cloud.ZkController; Timed
> out waiting to see all nodes published as DOWN in our cluster state.
> WARN  - 2015-03-02 18:13:47.737; org.apache.solr.cloud.ZkController; Still
> seeing conflicting information about the leader of shard shard1 for
> collection DD-2707 after 30 seconds; our state says
> http://host:8002/solr/DD-2707_shard1_replica2/, but ZooKeeper says
> http://host:8000/solr/DD-2707_shard1_replica1/

I'm sorry to hear that 5.0 didn't fix the problem.  I really hoped that
it would.

There is one other thing I'd like to try before you file a bug --
increasing zkClientTimeout to 40 seconds, to see whether it allows
changes the point at which it fails (or allows it to succeed).  With the
default tickTime (2 seconds), the maximum time you can set
zkClientTimeout to is 40 seconds ... which in normal circumstances is a
VERY long time.  In your situation, at least with the code in its
current state, 30 seconds (I'm pretty sure this is the default in 5.0)
may simply not be enough.

https://cwiki.apache.org/confluence/display/solr/Parameter+Reference#ParameterReference-SolrCloudInstanceZooKeeperParameters

I think filing a bug, even if 40 seconds allows this to succeed, is a
good idea ... but you might want to wait for some of the cloud experts
to look at your logs to see if they have anything to add.

Thanks,
Shawn

Re: solr cloud does not start with many collections

2015-03-01 Thread Damien Kamerman

I still see the same cloud startup issue with Solr 5.0.0. I created 4,000
collections from scratch and then attempted to stop/start the cloud.

node1:
WARN  - 2015-03-02 18:09:02.371;
org.eclipse.jetty.server.handler.RequestLogHandler; !RequestLog
WARN  - 2015-03-02 18:10:07.196; org.apache.solr.cloud.ZkController; Timed
out waiting to see all nodes published as DOWN in our cluster state.
WARN  - 2015-03-02 18:13:46.238; org.apache.solr.cloud.ZkController; Still
seeing conflicting information about the leader of shard shard1 for
collection DD-3219 after 30 seconds; our state says
http://host:8002/solr/DD-3219_shard1_replica1/, but ZooKeeper says
http://host:8000/solr/DD-3219_shard1_replica2/

node2:
WARN  - 2015-03-02 18:09:01.871;
org.eclipse.jetty.server.handler.RequestLogHandler; !RequestLog
WARN  - 2015-03-02 18:17:04.458;
org.apache.solr.common.cloud.ZkStateReader$3; ZooKeeper watch triggered,
but Solr cannot talk to ZK
stop/start
WARN  - 2015-03-02 18:53:12.725;
org.eclipse.jetty.server.handler.RequestLogHandler; !RequestLog
WARN  - 2015-03-02 18:56:30.702; org.apache.solr.cloud.ZkController; Still
seeing conflicting information about the leader of shard shard1 for
collection DD-3581 after 30 seconds; our state says
http://host:8001/solr/DD-3581_shard1_replica2/, but ZooKeeper says
http://host:8002/solr/DD-3581_shard1_replica1/

node3:
WARN  - 2015-03-02 18:09:03.022;
org.eclipse.jetty.server.handler.RequestLogHandler; !RequestLog
WARN  - 2015-03-02 18:10:08.178; org.apache.solr.cloud.ZkController; Timed
out waiting to see all nodes published as DOWN in our cluster state.
WARN  - 2015-03-02 18:13:47.737; org.apache.solr.cloud.ZkController; Still
seeing conflicting information about the leader of shard shard1 for
collection DD-2707 after 30 seconds; our state says
http://host:8002/solr/DD-2707_shard1_replica2/, but ZooKeeper says
http://host:8000/solr/DD-2707_shard1_replica1/



On 27 February 2015 at 17:48, Shawn Heisey  wrote:

> On 2/26/2015 11:14 PM, Damien Kamerman wrote:
> > I've run into an issue with starting my solr cloud with many collections.
> > My setup is:
> > 3 nodes (solr 4.10.3 ; 64GB RAM each ; jdk1.8.0_25) running on a single
> > server (256GB RAM).
> > 5,000 collections (1 x shard ; 2 x replica) = 10,000 cores
> > 1 x Zookeeper 3.4.6
> > Java arg -Djute.maxbuffer=67108864 added to solr and ZK.
> >
> > Then I stop all nodes, then start all nodes. All replicas are in the down
> > state, some have no leader. At times I have seen some (12 or so) leaders
> in
> > the active state. In the solr logs I see lots of:
> >
> > org.apache.solr.cloud.ZkController; Still seeing conflicting information
> > about the leader of shard shard1 for collection DD-4351 after 30
> > seconds; our state says
> http://ftea1:8001/solr/DD-4351_shard1_replica1/,
> > but ZooKeeper says http://ftea1:8000/solr/DD-4351_shard1_replica2/
>
> 
>
> > I've tried staggering the starts (1min) but does not help.
> > I've reproduced with zero documents.
> > Restarts are OK up to around 3,000 cores.
> > Should this work?
>
> This is going to push SolrCloud beyond its limits.  Is this just an
> exercise to see how far you can push Solr, or are you looking at setting
> up a production install with several thousand collections?
>
> In Solr 4.x, the clusterstate is one giant JSON structure containing the
> state of the entire cloud.  With 5000 collections, the entire thing
> would need to be downloaded and uploaded at least 5000 times during the
> course of a successful full system startup ... and I think with
> replicationFactor set to 2, that might actually be 1 times. The
> best-case scenario is that it would take a VERY long time, the
> worst-case scenario is that concurrency problems would lead to a
> deadlock.  A deadlock might be what is happening here.
>
> In Solr 5.x, the clusterstate is broken up so there's a separate state
> structure for each collection.  This setup allows for faster and safer
> multi-threading and far less data transfer.  Assuming I understand the
> implications correctly, there might not be any need to increase
> jute.maxbuffer with 5.x ... although I have to assume that I might be
> wrong about that.
>
> I would very much recommend that you set your scenario up from scratch
> in Solr 5.0.0, to see if the new clusterstate format can eliminate the
> problem you're seeing.  If it doesn't, then we can pursue it as a likely
> bug in the 5.x branch and you can file an issue in Jira.
>
> Thanks,
> Shawn
>
>


-- 
Damien Kamerman

Re: solr cloud does not start with many collections

2015-02-26 Thread Damien Kamerman

Oh, and I was wondering if 'leaderVoteWait' might help in Solr4.

On 27 February 2015 at 18:04, Damien Kamerman  wrote:

> This is going to push SolrCloud beyond its limits.  Is this just an
>> exercise to see how far you can push Solr, or are you looking at setting
>> up a production install with several thousand collections?
>>
>>
> I'm looking towards production.
>
>
>> In Solr 4.x, the clusterstate is one giant JSON structure containing the
>> state of the entire cloud.  With 5000 collections, the entire thing
>> would need to be downloaded and uploaded at least 5000 times during the
>> course of a successful full system startup ... and I think with
>> replicationFactor set to 2, that might actually be 1 times. The
>> best-case scenario is that it would take a VERY long time, the
>> worst-case scenario is that concurrency problems would lead to a
>> deadlock.  A deadlock might be what is happening here.
>>
>>
> Yes, clusterstate.json is 3.3M. At times on startup I think it does
> deadlock; log shows after 1min:
> org.apache.solr.cloud.ZkController; Timed out waiting to see all nodes
> published as DOWN in our cluster state.
>
>
>> In Solr 5.x, the clusterstate is broken up so there's a separate state
>> structure for each collection.  This setup allows for faster and safer
>> multi-threading and far less data transfer.  Assuming I understand the
>> implications correctly, there might not be any need to increase
>> jute.maxbuffer with 5.x ... although I have to assume that I might be
>> wrong about that.
>>
>> I would very much recommend that you set your scenario up from scratch
>> in Solr 5.0.0, to see if the new clusterstate format can eliminate the
>> problem you're seeing.  If it doesn't, then we can pursue it as a likely
>> bug in the 5.x branch and you can file an issue in Jira.
>>
>>
> Thanks, will test in Solr 5.0.0.
>



-- 
Damien Kamerman

Re: solr cloud does not start with many collections

2015-02-26 Thread Damien Kamerman

>
> This is going to push SolrCloud beyond its limits.  Is this just an
> exercise to see how far you can push Solr, or are you looking at setting
> up a production install with several thousand collections?
>
>
I'm looking towards production.


> In Solr 4.x, the clusterstate is one giant JSON structure containing the
> state of the entire cloud.  With 5000 collections, the entire thing
> would need to be downloaded and uploaded at least 5000 times during the
> course of a successful full system startup ... and I think with
> replicationFactor set to 2, that might actually be 1 times. The
> best-case scenario is that it would take a VERY long time, the
> worst-case scenario is that concurrency problems would lead to a
> deadlock.  A deadlock might be what is happening here.
>
>
Yes, clusterstate.json is 3.3M. At times on startup I think it does
deadlock; log shows after 1min:
org.apache.solr.cloud.ZkController; Timed out waiting to see all nodes
published as DOWN in our cluster state.


> In Solr 5.x, the clusterstate is broken up so there's a separate state
> structure for each collection.  This setup allows for faster and safer
> multi-threading and far less data transfer.  Assuming I understand the
> implications correctly, there might not be any need to increase
> jute.maxbuffer with 5.x ... although I have to assume that I might be
> wrong about that.
>
> I would very much recommend that you set your scenario up from scratch
> in Solr 5.0.0, to see if the new clusterstate format can eliminate the
> problem you're seeing.  If it doesn't, then we can pursue it as a likely
> bug in the 5.x branch and you can file an issue in Jira.
>
>
Thanks, will test in Solr 5.0.0.

Re: solr cloud does not start with many collections

2015-02-26 Thread Shawn Heisey

On 2/26/2015 11:14 PM, Damien Kamerman wrote:
> I've run into an issue with starting my solr cloud with many collections.
> My setup is:
> 3 nodes (solr 4.10.3 ; 64GB RAM each ; jdk1.8.0_25) running on a single
> server (256GB RAM).
> 5,000 collections (1 x shard ; 2 x replica) = 10,000 cores
> 1 x Zookeeper 3.4.6
> Java arg -Djute.maxbuffer=67108864 added to solr and ZK.
> 
> Then I stop all nodes, then start all nodes. All replicas are in the down
> state, some have no leader. At times I have seen some (12 or so) leaders in
> the active state. In the solr logs I see lots of:
> 
> org.apache.solr.cloud.ZkController; Still seeing conflicting information
> about the leader of shard shard1 for collection DD-4351 after 30
> seconds; our state says http://ftea1:8001/solr/DD-4351_shard1_replica1/,
> but ZooKeeper says http://ftea1:8000/solr/DD-4351_shard1_replica2/

> I've tried staggering the starts (1min) but does not help.
> I've reproduced with zero documents.
> Restarts are OK up to around 3,000 cores.
> Should this work?

This is going to push SolrCloud beyond its limits.  Is this just an
exercise to see how far you can push Solr, or are you looking at setting
up a production install with several thousand collections?

In Solr 4.x, the clusterstate is one giant JSON structure containing the
state of the entire cloud.  With 5000 collections, the entire thing
would need to be downloaded and uploaded at least 5000 times during the
course of a successful full system startup ... and I think with
replicationFactor set to 2, that might actually be 1 times. The
best-case scenario is that it would take a VERY long time, the
worst-case scenario is that concurrency problems would lead to a
deadlock.  A deadlock might be what is happening here.

In Solr 5.x, the clusterstate is broken up so there's a separate state
structure for each collection.  This setup allows for faster and safer
multi-threading and far less data transfer.  Assuming I understand the
implications correctly, there might not be any need to increase
jute.maxbuffer with 5.x ... although I have to assume that I might be
wrong about that.

I would very much recommend that you set your scenario up from scratch
in Solr 5.0.0, to see if the new clusterstate format can eliminate the
problem you're seeing.  If it doesn't, then we can pursue it as a likely
bug in the 5.x branch and you can file an issue in Jira.

Thanks,
Shawn

solr cloud does not start with many collections

2015-02-26 Thread Damien Kamerman

I've run into an issue with starting my solr cloud with many collections.
My setup is:
3 nodes (solr 4.10.3 ; 64GB RAM each ; jdk1.8.0_25) running on a single
server (256GB RAM).
5,000 collections (1 x shard ; 2 x replica) = 10,000 cores
1 x Zookeeper 3.4.6
Java arg -Djute.maxbuffer=67108864 added to solr and ZK.

Then I stop all nodes, then start all nodes. All replicas are in the down
state, some have no leader. At times I have seen some (12 or so) leaders in
the active state. In the solr logs I see lots of:

org.apache.solr.cloud.ZkController; Still seeing conflicting information
about the leader of shard shard1 for collection DD-4351 after 30
seconds; our state says http://ftea1:8001/solr/DD-4351_shard1_replica1/,
but ZooKeeper says http://ftea1:8000/solr/DD-4351_shard1_replica2/

org.apache.solr.common.SolrException;
:org.apache.solr.common.SolrException: Error getting leader from zk for
shard shard1
at
org.apache.solr.cloud.ZkController.getLeader(ZkController.java:910)
at
org.apache.solr.cloud.ZkController.register(ZkController.java:822)
at
org.apache.solr.cloud.ZkController.register(ZkController.java:770)
at org.apache.solr.core.ZkContainer$2.run(ZkContainer.java:221)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.solr.common.SolrException: There is conflicting
information about the leader of shard: shard1 our state says:
http://ftea1:8001/solr/DD-1564_shard1_replica2/ but zookeeper says:
http://ftea1:8000/solr/DD-1564_shard1_replica1/
at
org.apache.solr.cloud.ZkController.getLeader(ZkController.java:889)
... 6 more

I've tried staggering the starts (1min) but does not help.
I've reproduced with zero documents.
Restarts are OK up to around 3,000 cores.
Should this work?

Damien.

Re: solr cloud does not start with many collections

Re: solr cloud does not start with many collections

Re: solr cloud does not start with many collections

Re: solr cloud does not start with many collections

Re: solr cloud does not start with many collections

Re: solr cloud does not start with many collections

Re: solr cloud does not start with many collections

Re: solr cloud does not start with many collections

Re: solr cloud does not start with many collections

Re: solr cloud does not start with many collections

Re: solr cloud does not start with many collections

Re: solr cloud does not start with many collections

Re: solr cloud does not start with many collections

Re: solr cloud does not start with many collections

Re: solr cloud does not start with many collections

Re: solr cloud does not start with many collections

Re: solr cloud does not start with many collections

Re: solr cloud does not start with many collections

Re: solr cloud does not start with many collections

solr cloud does not start with many collections

20 matches

Site Navigation

Mail list logo

Footer information