Re: Solr 4.3.1 - SolrCloud nodes down and lost documents

2013-07-24 Thread Neil Prosser
Great. Thanks for your suggestions. I'll go through them and see what I can
come up with to try and tame my GC pauses. I'll also make sure I upgrade to
4.4 before I start. Then at least I know I've got all the latest changes.

In the meantime, does anyone have any idea why I am able to get leaders who
are marked as down? I've just had the situation where of two nodes hosting
replicas of the same shard the leader was alive and marked as down and the
other replica was gone. I could perform searches directly on the two nodes
(with distrib=false) and once I'd restarted the node which was down the
leader sprung into live. I assume that since there was a change in
clusterstate.json it forced the leader to reconsider what it was up to.

Does anyone know the hole my nodes are falling into? Is it likely to be
tied up in my GC woes?


On 23 July 2013 13:06, Otis Gospodnetic otis.gospodne...@gmail.com wrote:

 Hi,

 On Tue, Jul 23, 2013 at 8:02 AM, Erick Erickson erickerick...@gmail.com
 wrote:
  Neil:
 
  Here's a must-read blog about why allocating more memory
  to the JVM than Solr requires is a Bad Thing:
  http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
 
  It turns out that you actually do yourself harm by allocating more
  memory to the JVM than it really needs. Of course the problem is
  figuring out how much it really needs, which if pretty tricky.
 
  Your long GC pauses _might_ be ameliorated by allocating _less_
  memory to the JVM, counterintuitive as that seems.

 or by using G1 :)

 See http://blog.sematext.com/2013/06/24/g1-cms-java-garbage-collector/

 Otis
 --
 Solr  ElasticSearch Support -- http://sematext.com/
 Performance Monitoring -- http://sematext.com/spm


  On Mon, Jul 22, 2013 at 5:05 PM, Neil Prosser neil.pros...@gmail.com
 wrote:
  I just have a little python script which I run with cron (luckily that's
  the granularity we have in Graphite). It reads the same JSON the admin
 UI
  displays and dumps numeric values into Graphite.
 
  I can open source it if you like. I just need to make sure I remove any
  hacks/shortcuts that I've taken because I'm working with our cluster!
 
 
  On 22 July 2013 19:26, Lance Norskog goks...@gmail.com wrote:
 
  Are you feeding Graphite from Solr? If so, how?
 
 
  On 07/19/2013 01:02 AM, Neil Prosser wrote:
 
  That was overnight so I was unable to track exactly what happened (I'm
  going off our Graphite graphs here).
 
 
 



Re: Solr 4.3.1 - SolrCloud nodes down and lost documents

2013-07-24 Thread Timothy Potter
Log messages?

On Wed, Jul 24, 2013 at 1:37 AM, Neil Prosser neil.pros...@gmail.com wrote:
 Great. Thanks for your suggestions. I'll go through them and see what I can
 come up with to try and tame my GC pauses. I'll also make sure I upgrade to
 4.4 before I start. Then at least I know I've got all the latest changes.

 In the meantime, does anyone have any idea why I am able to get leaders who
 are marked as down? I've just had the situation where of two nodes hosting
 replicas of the same shard the leader was alive and marked as down and the
 other replica was gone. I could perform searches directly on the two nodes
 (with distrib=false) and once I'd restarted the node which was down the
 leader sprung into live. I assume that since there was a change in
 clusterstate.json it forced the leader to reconsider what it was up to.

 Does anyone know the hole my nodes are falling into? Is it likely to be
 tied up in my GC woes?


 On 23 July 2013 13:06, Otis Gospodnetic otis.gospodne...@gmail.com wrote:

 Hi,

 On Tue, Jul 23, 2013 at 8:02 AM, Erick Erickson erickerick...@gmail.com
 wrote:
  Neil:
 
  Here's a must-read blog about why allocating more memory
  to the JVM than Solr requires is a Bad Thing:
  http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
 
  It turns out that you actually do yourself harm by allocating more
  memory to the JVM than it really needs. Of course the problem is
  figuring out how much it really needs, which if pretty tricky.
 
  Your long GC pauses _might_ be ameliorated by allocating _less_
  memory to the JVM, counterintuitive as that seems.

 or by using G1 :)

 See http://blog.sematext.com/2013/06/24/g1-cms-java-garbage-collector/

 Otis
 --
 Solr  ElasticSearch Support -- http://sematext.com/
 Performance Monitoring -- http://sematext.com/spm


  On Mon, Jul 22, 2013 at 5:05 PM, Neil Prosser neil.pros...@gmail.com
 wrote:
  I just have a little python script which I run with cron (luckily that's
  the granularity we have in Graphite). It reads the same JSON the admin
 UI
  displays and dumps numeric values into Graphite.
 
  I can open source it if you like. I just need to make sure I remove any
  hacks/shortcuts that I've taken because I'm working with our cluster!
 
 
  On 22 July 2013 19:26, Lance Norskog goks...@gmail.com wrote:
 
  Are you feeding Graphite from Solr? If so, how?
 
 
  On 07/19/2013 01:02 AM, Neil Prosser wrote:
 
  That was overnight so I was unable to track exactly what happened (I'm
  going off our Graphite graphs here).
 
 
 



Re: Solr 4.3.1 - SolrCloud nodes down and lost documents

2013-07-24 Thread Neil Prosser
Sorry, good point...

https://gist.github.com/neilprosser/d75a13d9e4b7caba51ab

I've included the log files for two servers hosting the same shard for the
same time period. The logging settings exclude anything below WARN
for org.apache.zookeeper, org.apache.solr.core.SolrCore
and org.apache.solr.update.processor.LogUpdateProcessor. That said, there's
still a lot of spam there.

The log for server09 starts with it throwing OutOfMemoryErrors. At this
point I externally have it listed as recovering. Unfortunately I haven't
got the GC logs for either box in that time period.

The key times that I know of are:

2013-07-24 07:14:08,560 - server04 registers its state as down.
2013-07-24 07:17:38,462 - server04 says it's the new leader (this ties in
with my external Graphite script observing that at 07:17 server04 was both
leader and down).
2013-07-24 07:31:21,667 - I get involved and server09 is restarted.
2013-07-24 07:31:42,408 - server04 updates its cloud state from ZooKeeper
and realises that it's the leader.
2013-07-24 07:31:42,449 - server04 registers its state as active.

I'm sorry there's so much there. I'm still getting used to what's important
for people. Both servers were running 4.3.1. I've since upgraded to 4.4.0.

If you need any more information or want me to do any filtering let me know.



On 24 July 2013 15:50, Timothy Potter thelabd...@gmail.com wrote:

 Log messages?

 On Wed, Jul 24, 2013 at 1:37 AM, Neil Prosser neil.pros...@gmail.com
 wrote:
  Great. Thanks for your suggestions. I'll go through them and see what I
 can
  come up with to try and tame my GC pauses. I'll also make sure I upgrade
 to
  4.4 before I start. Then at least I know I've got all the latest changes.
 
  In the meantime, does anyone have any idea why I am able to get leaders
 who
  are marked as down? I've just had the situation where of two nodes
 hosting
  replicas of the same shard the leader was alive and marked as down and
 the
  other replica was gone. I could perform searches directly on the two
 nodes
  (with distrib=false) and once I'd restarted the node which was down the
  leader sprung into live. I assume that since there was a change in
  clusterstate.json it forced the leader to reconsider what it was up to.
 
  Does anyone know the hole my nodes are falling into? Is it likely to be
  tied up in my GC woes?
 
 
  On 23 July 2013 13:06, Otis Gospodnetic otis.gospodne...@gmail.com
 wrote:
 
  Hi,
 
  On Tue, Jul 23, 2013 at 8:02 AM, Erick Erickson 
 erickerick...@gmail.com
  wrote:
   Neil:
  
   Here's a must-read blog about why allocating more memory
   to the JVM than Solr requires is a Bad Thing:
  
 http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
  
   It turns out that you actually do yourself harm by allocating more
   memory to the JVM than it really needs. Of course the problem is
   figuring out how much it really needs, which if pretty tricky.
  
   Your long GC pauses _might_ be ameliorated by allocating _less_
   memory to the JVM, counterintuitive as that seems.
 
  or by using G1 :)
 
  See http://blog.sematext.com/2013/06/24/g1-cms-java-garbage-collector/
 
  Otis
  --
  Solr  ElasticSearch Support -- http://sematext.com/
  Performance Monitoring -- http://sematext.com/spm
 
 
   On Mon, Jul 22, 2013 at 5:05 PM, Neil Prosser neil.pros...@gmail.com
 
  wrote:
   I just have a little python script which I run with cron (luckily
 that's
   the granularity we have in Graphite). It reads the same JSON the
 admin
  UI
   displays and dumps numeric values into Graphite.
  
   I can open source it if you like. I just need to make sure I remove
 any
   hacks/shortcuts that I've taken because I'm working with our cluster!
  
  
   On 22 July 2013 19:26, Lance Norskog goks...@gmail.com wrote:
  
   Are you feeding Graphite from Solr? If so, how?
  
  
   On 07/19/2013 01:02 AM, Neil Prosser wrote:
  
   That was overnight so I was unable to track exactly what happened
 (I'm
   going off our Graphite graphs here).
  
  
  
 



Re: Solr 4.3.1 - SolrCloud nodes down and lost documents

2013-07-24 Thread Timothy Potter
One thing I'm seeing in your logs is the leaderVoteWait safety
mechanism that I mentioned previously:


2013-07-24 07:06:19,856 INFO  o.a.s.c.ShardLeaderElectionContext -
Waiting until we see more replicas up: total=2 found=1 timeoutin=45792


From Mark M: This is a safety mechanism - you can turn it off by
configuring leaderVoteWait to 0 in solr.xml. This is meant to protect
the case where you stop a shard or it fails and then the first node to
get started back up has stale data - you don't want it to just become
the leader. So we wait to see everyone we know about in the shard up
to 3 or 5 min by default. Then we know all the shards participate in
the leader election and the leader will end up with all updates it
should have. You can lower that wait or turn it off with 0.

NOTE: I tried setting it to 0 and my cluster went haywire, so consider
just lowering it but not making it zero ;-)


From what I can tell, server04 heads into this leaderVoteWait loop
before it declares success as the leader which is why it is registered
as down. Down is different than gone, so it's likely it can
respond to queries.



On Wed, Jul 24, 2013 at 10:33 AM, Neil Prosser neil.pros...@gmail.com wrote:
 Sorry, good point...

 https://gist.github.com/neilprosser/d75a13d9e4b7caba51ab

 I've included the log files for two servers hosting the same shard for the
 same time period. The logging settings exclude anything below WARN
 for org.apache.zookeeper, org.apache.solr.core.SolrCore
 and org.apache.solr.update.processor.LogUpdateProcessor. That said, there's
 still a lot of spam there.

 The log for server09 starts with it throwing OutOfMemoryErrors. At this
 point I externally have it listed as recovering. Unfortunately I haven't
 got the GC logs for either box in that time period.

 The key times that I know of are:

 2013-07-24 07:14:08,560 - server04 registers its state as down.
 2013-07-24 07:17:38,462 - server04 says it's the new leader (this ties in
 with my external Graphite script observing that at 07:17 server04 was both
 leader and down).
 2013-07-24 07:31:21,667 - I get involved and server09 is restarted.
 2013-07-24 07:31:42,408 - server04 updates its cloud state from ZooKeeper
 and realises that it's the leader.
 2013-07-24 07:31:42,449 - server04 registers its state as active.

 I'm sorry there's so much there. I'm still getting used to what's important
 for people. Both servers were running 4.3.1. I've since upgraded to 4.4.0.

 If you need any more information or want me to do any filtering let me know.



 On 24 July 2013 15:50, Timothy Potter thelabd...@gmail.com wrote:

 Log messages?

 On Wed, Jul 24, 2013 at 1:37 AM, Neil Prosser neil.pros...@gmail.com
 wrote:
  Great. Thanks for your suggestions. I'll go through them and see what I
 can
  come up with to try and tame my GC pauses. I'll also make sure I upgrade
 to
  4.4 before I start. Then at least I know I've got all the latest changes.
 
  In the meantime, does anyone have any idea why I am able to get leaders
 who
  are marked as down? I've just had the situation where of two nodes
 hosting
  replicas of the same shard the leader was alive and marked as down and
 the
  other replica was gone. I could perform searches directly on the two
 nodes
  (with distrib=false) and once I'd restarted the node which was down the
  leader sprung into live. I assume that since there was a change in
  clusterstate.json it forced the leader to reconsider what it was up to.
 
  Does anyone know the hole my nodes are falling into? Is it likely to be
  tied up in my GC woes?
 
 
  On 23 July 2013 13:06, Otis Gospodnetic otis.gospodne...@gmail.com
 wrote:
 
  Hi,
 
  On Tue, Jul 23, 2013 at 8:02 AM, Erick Erickson 
 erickerick...@gmail.com
  wrote:
   Neil:
  
   Here's a must-read blog about why allocating more memory
   to the JVM than Solr requires is a Bad Thing:
  
 http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
  
   It turns out that you actually do yourself harm by allocating more
   memory to the JVM than it really needs. Of course the problem is
   figuring out how much it really needs, which if pretty tricky.
  
   Your long GC pauses _might_ be ameliorated by allocating _less_
   memory to the JVM, counterintuitive as that seems.
 
  or by using G1 :)
 
  See http://blog.sematext.com/2013/06/24/g1-cms-java-garbage-collector/
 
  Otis
  --
  Solr  ElasticSearch Support -- http://sematext.com/
  Performance Monitoring -- http://sematext.com/spm
 
 
   On Mon, Jul 22, 2013 at 5:05 PM, Neil Prosser neil.pros...@gmail.com
 
  wrote:
   I just have a little python script which I run with cron (luckily
 that's
   the granularity we have in Graphite). It reads the same JSON the
 admin
  UI
   displays and dumps numeric values into Graphite.
  
   I can open source it if you like. I just need to make sure I remove
 any
   hacks/shortcuts that I've taken because I'm working with our cluster!
  
  
   On 22 July 2013 19:26, Lance Norskog 

Re: Solr 4.3.1 - SolrCloud nodes down and lost documents

2013-07-24 Thread Shawn Heisey
On 7/24/2013 10:33 AM, Neil Prosser wrote:
 The log for server09 starts with it throwing OutOfMemoryErrors. At this
 point I externally have it listed as recovering. Unfortunately I haven't
 got the GC logs for either box in that time period.

There's a lot of messages in this thread, so I apologize if this has
been dealt with already by previous email messages.

All bets are off when you start throwing OOM errors.  The state of any
java program pretty much becomes completely undefined when you run out
of heap memory.

It just so happens that I just finished updating a wiki page about
reducing heap requirements for another message on the list.  GC pause
problems have already been mentioned, so increasing the heap may not be
a real option here.  Take a look at the following for ways to reduce
your heap requirements:

https://wiki.apache.org/solr/SolrPerformanceProblems#Reducing_heap_requirements

Thanks,
Shawn



Re: Solr 4.3.1 - SolrCloud nodes down and lost documents

2013-07-24 Thread Neil Prosser
That makes sense about all bets being off. I wanted to make sure that
people whose systems are behaving sensibly weren't going to have problems.

I think I need to tame the base amount of memory the field cache takes. We
currently do boosting on several fields during most queries. We boost by at
least 64 long fields (only one or two per query, but that many fields in
total) so with a maxDocs of around 10m on each shard we're talking about
nearly 5GB of heap just for the field cache. Then I still want to have a
filter cache which protects me from chewing too much CPU during requests.

I've seen a lot of concurrent mode failures flying by while staring at the
GC log so I think I need to get GC to kick in a bit sooner and try to limit
those (is it possible to eliminate them completely?). Performance is
definitely suffering while that's happening.

I had a go with G1 and no special options and things slowed down a bit. I'm
beginning to understand what to look for in the CMS GC logs so I'll stick
with that and see where I get to.

Thanks for the link. I'll give it a proper read when I get into work
tomorrow.

I have a feeling that five shards might not be the sweet spot for the spec
of the machines I'm running on. Our goal was to replace five 96GB physicals
with 48GB heaps doing master/slave replication for an index which is at
least 120GB in size. At the moment we're using ten VMs with 24GB of RAM,
8GB heaps and around 10GB of index. These machines are managing to get the
whole index into the Linux OS cache. Hopefully the 5GB minimum for field
cache and 8GB heap is what's causing this trouble right now.


On 24 July 2013 19:06, Shawn Heisey s...@elyograg.org wrote:

 On 7/24/2013 10:33 AM, Neil Prosser wrote:
  The log for server09 starts with it throwing OutOfMemoryErrors. At this
  point I externally have it listed as recovering. Unfortunately I haven't
  got the GC logs for either box in that time period.

 There's a lot of messages in this thread, so I apologize if this has
 been dealt with already by previous email messages.

 All bets are off when you start throwing OOM errors.  The state of any
 java program pretty much becomes completely undefined when you run out
 of heap memory.

 It just so happens that I just finished updating a wiki page about
 reducing heap requirements for another message on the list.  GC pause
 problems have already been mentioned, so increasing the heap may not be
 a real option here.  Take a look at the following for ways to reduce
 your heap requirements:


 https://wiki.apache.org/solr/SolrPerformanceProblems#Reducing_heap_requirements

 Thanks,
 Shawn




Re: Solr 4.3.1 - SolrCloud nodes down and lost documents

2013-07-23 Thread Erick Erickson
Neil:

Here's a must-read blog about why allocating more memory
to the JVM than Solr requires is a Bad Thing:
http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

It turns out that you actually do yourself harm by allocating more
memory to the JVM than it really needs. Of course the problem is
figuring out how much it really needs, which if pretty tricky.

Your long GC pauses _might_ be ameliorated by allocating _less_
memory to the JVM, counterintuitive as that seems.

Best
Erick

On Mon, Jul 22, 2013 at 5:05 PM, Neil Prosser neil.pros...@gmail.com wrote:
 I just have a little python script which I run with cron (luckily that's
 the granularity we have in Graphite). It reads the same JSON the admin UI
 displays and dumps numeric values into Graphite.

 I can open source it if you like. I just need to make sure I remove any
 hacks/shortcuts that I've taken because I'm working with our cluster!


 On 22 July 2013 19:26, Lance Norskog goks...@gmail.com wrote:

 Are you feeding Graphite from Solr? If so, how?


 On 07/19/2013 01:02 AM, Neil Prosser wrote:

 That was overnight so I was unable to track exactly what happened (I'm
 going off our Graphite graphs here).





Re: Solr 4.3.1 - SolrCloud nodes down and lost documents

2013-07-23 Thread Otis Gospodnetic
Hi,

On Tue, Jul 23, 2013 at 8:02 AM, Erick Erickson erickerick...@gmail.com wrote:
 Neil:

 Here's a must-read blog about why allocating more memory
 to the JVM than Solr requires is a Bad Thing:
 http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

 It turns out that you actually do yourself harm by allocating more
 memory to the JVM than it really needs. Of course the problem is
 figuring out how much it really needs, which if pretty tricky.

 Your long GC pauses _might_ be ameliorated by allocating _less_
 memory to the JVM, counterintuitive as that seems.

or by using G1 :)

See http://blog.sematext.com/2013/06/24/g1-cms-java-garbage-collector/

Otis
--
Solr  ElasticSearch Support -- http://sematext.com/
Performance Monitoring -- http://sematext.com/spm


 On Mon, Jul 22, 2013 at 5:05 PM, Neil Prosser neil.pros...@gmail.com wrote:
 I just have a little python script which I run with cron (luckily that's
 the granularity we have in Graphite). It reads the same JSON the admin UI
 displays and dumps numeric values into Graphite.

 I can open source it if you like. I just need to make sure I remove any
 hacks/shortcuts that I've taken because I'm working with our cluster!


 On 22 July 2013 19:26, Lance Norskog goks...@gmail.com wrote:

 Are you feeding Graphite from Solr? If so, how?


 On 07/19/2013 01:02 AM, Neil Prosser wrote:

 That was overnight so I was unable to track exactly what happened (I'm
 going off our Graphite graphs here).





Re: Solr 4.3.1 - SolrCloud nodes down and lost documents

2013-07-22 Thread Neil Prosser
Very true. I was impatient (I think less than three minutes impatient so
hopefully 4.4 will save me from myself) but I didn't realise it was doing
something rather than just hanging. Next time I have to restart a node I'll
just leave and go get a cup of coffee or something.

My configuration is set to auto hard-commit every 5 minutes. No auto
soft-commit time is set.

Over the course of the weekend, while left unattended the nodes have been
going up and down (I've got to solve the issue that is causing them to come
and go, but any suggestions on what is likely to be causing something like
that are welcome), at one point one of the nodes stopped taking updates.
After indexing properly for a few hours with that one shard not accepting
updates, the replica of that shard which contains all the correct documents
must have replicated from the broken node and dropped documents. Is there
any protection against this in Solr or should I be focusing on getting my
nodes to be more reliable? I've now got a situation where four of my five
shards have leaders who are marked as down and followers who are up.

I'm going to start grabbing information about the cluster state so I can
track which changes are happening and in what order. I can get hold of Solr
logs and garbage collection logs while these things are happening.

Is this all just down to my nodes being unreliable?


On 21 July 2013 13:52, Erick Erickson erickerick...@gmail.com wrote:

 Well, if I'm reading this right you had a node go out of circulation
 and then bounced nodes until that node became the leader. So of course
 it wouldn't have the documents (how could it?). Basically you shot
 yourself in the foot.

 Underlying here is why it took the machine you were re-starting so
 long to come up that you got impatient and started killing nodes.
 There has been quite a bit done to make that process better, so what
 version of Solr are you using? 4.4 is being voted on right now, so if
 you might want to consider upgrading.

 There was, for instance, a situation where it would take 3 minutes for
 machines to start up. How impatient were you?

 Also, what are your hard commit parameters? All of the documents
 you're indexing will be in the transaction log between hard commits,
 and when a node comes up the leader will replay everything in the tlog
 to the new node, which might be a source of why it took so long for
 the new node to come back up. At the very least the new node you were
 bringing back online will need to do a full index replication (old
 style) to get caught up.

 Best
 Erick

 On Fri, Jul 19, 2013 at 4:02 AM, Neil Prosser neil.pros...@gmail.com
 wrote:
  While indexing some documents to a SolrCloud cluster (10 machines, 5
 shards
  and 2 replicas, so one replica on each machine) one of the replicas
 stopped
  receiving documents, while the other replica of the shard continued to
 grow.
 
  That was overnight so I was unable to track exactly what happened (I'm
  going off our Graphite graphs here). This morning when I was able to look
  at the cluster both replicas of that shard were marked as down (with one
  marked as leader). I attempted to restart the non-leader node but it
 took a
  long time to restart so I killed it and restarted the old leader, which
  also took a long time. I killed that one (I'm impatient) and left the
  non-leader node to restart, not realising it was missing approximately
 700k
  documents that the old leader had. Eventually it restarted and became
  leader. I restarted the old leader and it dropped the number of documents
  it had to match the previous non-leader.
 
  Is this expected behaviour when a replica with fewer documents is started
  before the other and elected leader? Should I have been paying more
  attention to the number of documents on the server before restarting
 nodes?
 
  I am still in the process of tuning the caches and warming for these
  servers but we are putting some load through the cluster so it is
 possible
  that the nodes are having to work quite hard when a new version of the
 core
  comes is made available. Is this likely to explain why I occasionally see
  nodes dropping out? Unfortunately in restarting the nodes I lost the GC
  logs to see whether that was likely to be the culprit. Is this the sort
 of
  situation where you raise the ZooKeeper timeout a bit? Currently the
  timeout for all nodes is 15 seconds.
 
  Are there any known issues which might explain what's happening? I'm just
  getting started with SolrCloud after using standard master/slave
  replication for an index which has got too big for one machine over the
  last few months.
 
  Also, is there any particular information that would be helpful to help
  with these issues if it should happen again?



Re: Solr 4.3.1 - SolrCloud nodes down and lost documents

2013-07-22 Thread Erick Erickson
Wow, you really shouldn't be having nodes go up and down so
frequently, that's a big red flag. That said, SolrCloud should be
pretty robust so this is something to pursue...

But even a 5 minute hard commit can lead to a hefty transaction
log under load, you may want to reduce it substantially depending
on how fast you are sending docs to the index. I'm talking
15-30 seconds here. It's critical that openSearcher be set to false
or you'll invalidate your caches that often. All a hard commit
with openSearcher set to false does is close off the current segment
and open a new one. It does NOT open/warm new searchers etc.

The soft commits control visibility, so that's how you control
whether you can search the docs or not. Pardon me if I'm
repeating stuff you already know!

As far as your nodes coming and going, I've seen some people have
good results by upping the ZooKeeper timeout limit. So I guess
my first question is whether the nodes are actually going out of service
or whether it's just a timeout issue

Good luck!
Erick

On Mon, Jul 22, 2013 at 3:29 AM, Neil Prosser neil.pros...@gmail.com wrote:
 Very true. I was impatient (I think less than three minutes impatient so
 hopefully 4.4 will save me from myself) but I didn't realise it was doing
 something rather than just hanging. Next time I have to restart a node I'll
 just leave and go get a cup of coffee or something.

 My configuration is set to auto hard-commit every 5 minutes. No auto
 soft-commit time is set.

 Over the course of the weekend, while left unattended the nodes have been
 going up and down (I've got to solve the issue that is causing them to come
 and go, but any suggestions on what is likely to be causing something like
 that are welcome), at one point one of the nodes stopped taking updates.
 After indexing properly for a few hours with that one shard not accepting
 updates, the replica of that shard which contains all the correct documents
 must have replicated from the broken node and dropped documents. Is there
 any protection against this in Solr or should I be focusing on getting my
 nodes to be more reliable? I've now got a situation where four of my five
 shards have leaders who are marked as down and followers who are up.

 I'm going to start grabbing information about the cluster state so I can
 track which changes are happening and in what order. I can get hold of Solr
 logs and garbage collection logs while these things are happening.

 Is this all just down to my nodes being unreliable?


 On 21 July 2013 13:52, Erick Erickson erickerick...@gmail.com wrote:

 Well, if I'm reading this right you had a node go out of circulation
 and then bounced nodes until that node became the leader. So of course
 it wouldn't have the documents (how could it?). Basically you shot
 yourself in the foot.

 Underlying here is why it took the machine you were re-starting so
 long to come up that you got impatient and started killing nodes.
 There has been quite a bit done to make that process better, so what
 version of Solr are you using? 4.4 is being voted on right now, so if
 you might want to consider upgrading.

 There was, for instance, a situation where it would take 3 minutes for
 machines to start up. How impatient were you?

 Also, what are your hard commit parameters? All of the documents
 you're indexing will be in the transaction log between hard commits,
 and when a node comes up the leader will replay everything in the tlog
 to the new node, which might be a source of why it took so long for
 the new node to come back up. At the very least the new node you were
 bringing back online will need to do a full index replication (old
 style) to get caught up.

 Best
 Erick

 On Fri, Jul 19, 2013 at 4:02 AM, Neil Prosser neil.pros...@gmail.com
 wrote:
  While indexing some documents to a SolrCloud cluster (10 machines, 5
 shards
  and 2 replicas, so one replica on each machine) one of the replicas
 stopped
  receiving documents, while the other replica of the shard continued to
 grow.
 
  That was overnight so I was unable to track exactly what happened (I'm
  going off our Graphite graphs here). This morning when I was able to look
  at the cluster both replicas of that shard were marked as down (with one
  marked as leader). I attempted to restart the non-leader node but it
 took a
  long time to restart so I killed it and restarted the old leader, which
  also took a long time. I killed that one (I'm impatient) and left the
  non-leader node to restart, not realising it was missing approximately
 700k
  documents that the old leader had. Eventually it restarted and became
  leader. I restarted the old leader and it dropped the number of documents
  it had to match the previous non-leader.
 
  Is this expected behaviour when a replica with fewer documents is started
  before the other and elected leader? Should I have been paying more
  attention to the number of documents on the server before restarting
 nodes?
 
  I am still 

Re: Solr 4.3.1 - SolrCloud nodes down and lost documents

2013-07-22 Thread Neil Prosser
No need to apologise. It's always good to have things like that reiterated
in case I've misunderstood along the way.

I have a feeling that it's related to garbage collection. I assume that if
the JVM heads into a stop-the-world GC Solr can't let ZooKeeper know it's
still alive and so gets marked as down. I've just taken a look at the GC
logs and can see a couple of full collections which took longer than my ZK
timeout of 15s). I'm still in the process of tuning the cache sizes and
have probably got it wrong (I'm coming from a Solr instance which runs on a
48G heap with ~40m documents and bringing it into five shards with 8G
heap). I thought I was being conservative with the cache sizes but I should
probably drop them right down and start again. The entire index is cached
by Linux so I should just need caches to help with things which eat CPU at
request time.

The indexing level is unusual because normally we wouldn't be indexing
everything sequentially, just making delta updates to the index as things
are changed in our MoR. However, it's handy to know how it reacts under the
most extreme load we could give it.

In the case that I set my hard commit time to 15-30 seconds with
openSearcher set to false, how do I control when I actually do invalidate
the caches and open a new searcher? Is this something that Solr can do
automatically, or will I need some sort of coordinator process to perform a
'proper' commit from outside Solr?

In our case the process of opening a new searcher is definitely a hefty
operation. We have a large number of boosts and filters which are used for
just about every query that is made against the index so we currently have
them warmed which can take upwards of a minute on our giant core.

Thanks for your help.


On 22 July 2013 13:00, Erick Erickson erickerick...@gmail.com wrote:

 Wow, you really shouldn't be having nodes go up and down so
 frequently, that's a big red flag. That said, SolrCloud should be
 pretty robust so this is something to pursue...

 But even a 5 minute hard commit can lead to a hefty transaction
 log under load, you may want to reduce it substantially depending
 on how fast you are sending docs to the index. I'm talking
 15-30 seconds here. It's critical that openSearcher be set to false
 or you'll invalidate your caches that often. All a hard commit
 with openSearcher set to false does is close off the current segment
 and open a new one. It does NOT open/warm new searchers etc.

 The soft commits control visibility, so that's how you control
 whether you can search the docs or not. Pardon me if I'm
 repeating stuff you already know!

 As far as your nodes coming and going, I've seen some people have
 good results by upping the ZooKeeper timeout limit. So I guess
 my first question is whether the nodes are actually going out of service
 or whether it's just a timeout issue

 Good luck!
 Erick

 On Mon, Jul 22, 2013 at 3:29 AM, Neil Prosser neil.pros...@gmail.com
 wrote:
  Very true. I was impatient (I think less than three minutes impatient so
  hopefully 4.4 will save me from myself) but I didn't realise it was doing
  something rather than just hanging. Next time I have to restart a node
 I'll
  just leave and go get a cup of coffee or something.
 
  My configuration is set to auto hard-commit every 5 minutes. No auto
  soft-commit time is set.
 
  Over the course of the weekend, while left unattended the nodes have been
  going up and down (I've got to solve the issue that is causing them to
 come
  and go, but any suggestions on what is likely to be causing something
 like
  that are welcome), at one point one of the nodes stopped taking updates.
  After indexing properly for a few hours with that one shard not accepting
  updates, the replica of that shard which contains all the correct
 documents
  must have replicated from the broken node and dropped documents. Is there
  any protection against this in Solr or should I be focusing on getting my
  nodes to be more reliable? I've now got a situation where four of my five
  shards have leaders who are marked as down and followers who are up.
 
  I'm going to start grabbing information about the cluster state so I can
  track which changes are happening and in what order. I can get hold of
 Solr
  logs and garbage collection logs while these things are happening.
 
  Is this all just down to my nodes being unreliable?
 
 
  On 21 July 2013 13:52, Erick Erickson erickerick...@gmail.com wrote:
 
  Well, if I'm reading this right you had a node go out of circulation
  and then bounced nodes until that node became the leader. So of course
  it wouldn't have the documents (how could it?). Basically you shot
  yourself in the foot.
 
  Underlying here is why it took the machine you were re-starting so
  long to come up that you got impatient and started killing nodes.
  There has been quite a bit done to make that process better, so what
  version of Solr are you using? 4.4 is being voted on right now, so 

Re: Solr 4.3.1 - SolrCloud nodes down and lost documents

2013-07-22 Thread Neil Prosser
Sorry, I should also mention that these leader nodes which are marked as
down can actually still be queried locally with distrib=false with no
problems. Is it possible that they've somehow got themselves out-of-sync?


On 22 July 2013 13:37, Neil Prosser neil.pros...@gmail.com wrote:

 No need to apologise. It's always good to have things like that reiterated
 in case I've misunderstood along the way.

 I have a feeling that it's related to garbage collection. I assume that if
 the JVM heads into a stop-the-world GC Solr can't let ZooKeeper know it's
 still alive and so gets marked as down. I've just taken a look at the GC
 logs and can see a couple of full collections which took longer than my ZK
 timeout of 15s). I'm still in the process of tuning the cache sizes and
 have probably got it wrong (I'm coming from a Solr instance which runs on a
 48G heap with ~40m documents and bringing it into five shards with 8G
 heap). I thought I was being conservative with the cache sizes but I should
 probably drop them right down and start again. The entire index is cached
 by Linux so I should just need caches to help with things which eat CPU at
 request time.

 The indexing level is unusual because normally we wouldn't be indexing
 everything sequentially, just making delta updates to the index as things
 are changed in our MoR. However, it's handy to know how it reacts under the
 most extreme load we could give it.

 In the case that I set my hard commit time to 15-30 seconds with
 openSearcher set to false, how do I control when I actually do invalidate
 the caches and open a new searcher? Is this something that Solr can do
 automatically, or will I need some sort of coordinator process to perform a
 'proper' commit from outside Solr?

 In our case the process of opening a new searcher is definitely a hefty
 operation. We have a large number of boosts and filters which are used for
 just about every query that is made against the index so we currently have
 them warmed which can take upwards of a minute on our giant core.

 Thanks for your help.


 On 22 July 2013 13:00, Erick Erickson erickerick...@gmail.com wrote:

 Wow, you really shouldn't be having nodes go up and down so
 frequently, that's a big red flag. That said, SolrCloud should be
 pretty robust so this is something to pursue...

 But even a 5 minute hard commit can lead to a hefty transaction
 log under load, you may want to reduce it substantially depending
 on how fast you are sending docs to the index. I'm talking
 15-30 seconds here. It's critical that openSearcher be set to false
 or you'll invalidate your caches that often. All a hard commit
 with openSearcher set to false does is close off the current segment
 and open a new one. It does NOT open/warm new searchers etc.

 The soft commits control visibility, so that's how you control
 whether you can search the docs or not. Pardon me if I'm
 repeating stuff you already know!

 As far as your nodes coming and going, I've seen some people have
 good results by upping the ZooKeeper timeout limit. So I guess
 my first question is whether the nodes are actually going out of service
 or whether it's just a timeout issue

 Good luck!
 Erick

 On Mon, Jul 22, 2013 at 3:29 AM, Neil Prosser neil.pros...@gmail.com
 wrote:
  Very true. I was impatient (I think less than three minutes impatient so
  hopefully 4.4 will save me from myself) but I didn't realise it was
 doing
  something rather than just hanging. Next time I have to restart a node
 I'll
  just leave and go get a cup of coffee or something.
 
  My configuration is set to auto hard-commit every 5 minutes. No auto
  soft-commit time is set.
 
  Over the course of the weekend, while left unattended the nodes have
 been
  going up and down (I've got to solve the issue that is causing them to
 come
  and go, but any suggestions on what is likely to be causing something
 like
  that are welcome), at one point one of the nodes stopped taking updates.
  After indexing properly for a few hours with that one shard not
 accepting
  updates, the replica of that shard which contains all the correct
 documents
  must have replicated from the broken node and dropped documents. Is
 there
  any protection against this in Solr or should I be focusing on getting
 my
  nodes to be more reliable? I've now got a situation where four of my
 five
  shards have leaders who are marked as down and followers who are up.
 
  I'm going to start grabbing information about the cluster state so I can
  track which changes are happening and in what order. I can get hold of
 Solr
  logs and garbage collection logs while these things are happening.
 
  Is this all just down to my nodes being unreliable?
 
 
  On 21 July 2013 13:52, Erick Erickson erickerick...@gmail.com wrote:
 
  Well, if I'm reading this right you had a node go out of circulation
  and then bounced nodes until that node became the leader. So of course
  it wouldn't have the documents (how could it?). 

RE: Solr 4.3.1 - SolrCloud nodes down and lost documents

2013-07-22 Thread Markus Jelsma
It is possible: https://issues.apache.org/jira/browse/SOLR-4260
I rarely see it and i cannot reliably reproduce it but it just sometimes 
happens. Nodes will not bring each other back in sync.

 
 
-Original message-
 From:Neil Prosser neil.pros...@gmail.com
 Sent: Monday 22nd July 2013 14:41
 To: solr-user@lucene.apache.org
 Subject: Re: Solr 4.3.1 - SolrCloud nodes down and lost documents
 
 Sorry, I should also mention that these leader nodes which are marked as
 down can actually still be queried locally with distrib=false with no
 problems. Is it possible that they've somehow got themselves out-of-sync?
 
 
 On 22 July 2013 13:37, Neil Prosser neil.pros...@gmail.com wrote:
 
  No need to apologise. It's always good to have things like that reiterated
  in case I've misunderstood along the way.
 
  I have a feeling that it's related to garbage collection. I assume that if
  the JVM heads into a stop-the-world GC Solr can't let ZooKeeper know it's
  still alive and so gets marked as down. I've just taken a look at the GC
  logs and can see a couple of full collections which took longer than my ZK
  timeout of 15s). I'm still in the process of tuning the cache sizes and
  have probably got it wrong (I'm coming from a Solr instance which runs on a
  48G heap with ~40m documents and bringing it into five shards with 8G
  heap). I thought I was being conservative with the cache sizes but I should
  probably drop them right down and start again. The entire index is cached
  by Linux so I should just need caches to help with things which eat CPU at
  request time.
 
  The indexing level is unusual because normally we wouldn't be indexing
  everything sequentially, just making delta updates to the index as things
  are changed in our MoR. However, it's handy to know how it reacts under the
  most extreme load we could give it.
 
  In the case that I set my hard commit time to 15-30 seconds with
  openSearcher set to false, how do I control when I actually do invalidate
  the caches and open a new searcher? Is this something that Solr can do
  automatically, or will I need some sort of coordinator process to perform a
  'proper' commit from outside Solr?
 
  In our case the process of opening a new searcher is definitely a hefty
  operation. We have a large number of boosts and filters which are used for
  just about every query that is made against the index so we currently have
  them warmed which can take upwards of a minute on our giant core.
 
  Thanks for your help.
 
 
  On 22 July 2013 13:00, Erick Erickson erickerick...@gmail.com wrote:
 
  Wow, you really shouldn't be having nodes go up and down so
  frequently, that's a big red flag. That said, SolrCloud should be
  pretty robust so this is something to pursue...
 
  But even a 5 minute hard commit can lead to a hefty transaction
  log under load, you may want to reduce it substantially depending
  on how fast you are sending docs to the index. I'm talking
  15-30 seconds here. It's critical that openSearcher be set to false
  or you'll invalidate your caches that often. All a hard commit
  with openSearcher set to false does is close off the current segment
  and open a new one. It does NOT open/warm new searchers etc.
 
  The soft commits control visibility, so that's how you control
  whether you can search the docs or not. Pardon me if I'm
  repeating stuff you already know!
 
  As far as your nodes coming and going, I've seen some people have
  good results by upping the ZooKeeper timeout limit. So I guess
  my first question is whether the nodes are actually going out of service
  or whether it's just a timeout issue
 
  Good luck!
  Erick
 
  On Mon, Jul 22, 2013 at 3:29 AM, Neil Prosser neil.pros...@gmail.com
  wrote:
   Very true. I was impatient (I think less than three minutes impatient so
   hopefully 4.4 will save me from myself) but I didn't realise it was
  doing
   something rather than just hanging. Next time I have to restart a node
  I'll
   just leave and go get a cup of coffee or something.
  
   My configuration is set to auto hard-commit every 5 minutes. No auto
   soft-commit time is set.
  
   Over the course of the weekend, while left unattended the nodes have
  been
   going up and down (I've got to solve the issue that is causing them to
  come
   and go, but any suggestions on what is likely to be causing something
  like
   that are welcome), at one point one of the nodes stopped taking updates.
   After indexing properly for a few hours with that one shard not
  accepting
   updates, the replica of that shard which contains all the correct
  documents
   must have replicated from the broken node and dropped documents. Is
  there
   any protection against this in Solr or should I be focusing on getting
  my
   nodes to be more reliable? I've now got a situation where four of my
  five
   shards have leaders who are marked as down and followers who are up.
  
   I'm going to start grabbing information about

RE: Solr 4.3.1 - SolrCloud nodes down and lost documents

2013-07-22 Thread Markus Jelsma
You should increase your ZK time out, this may be the issue in your case. You 
may also want to try the G1GC collector to keep STW under ZK time out.
 
-Original message-
 From:Neil Prosser neil.pros...@gmail.com
 Sent: Monday 22nd July 2013 14:38
 To: solr-user@lucene.apache.org
 Subject: Re: Solr 4.3.1 - SolrCloud nodes down and lost documents
 
 No need to apologise. It's always good to have things like that reiterated
 in case I've misunderstood along the way.
 
 I have a feeling that it's related to garbage collection. I assume that if
 the JVM heads into a stop-the-world GC Solr can't let ZooKeeper know it's
 still alive and so gets marked as down. I've just taken a look at the GC
 logs and can see a couple of full collections which took longer than my ZK
 timeout of 15s). I'm still in the process of tuning the cache sizes and
 have probably got it wrong (I'm coming from a Solr instance which runs on a
 48G heap with ~40m documents and bringing it into five shards with 8G
 heap). I thought I was being conservative with the cache sizes but I should
 probably drop them right down and start again. The entire index is cached
 by Linux so I should just need caches to help with things which eat CPU at
 request time.
 
 The indexing level is unusual because normally we wouldn't be indexing
 everything sequentially, just making delta updates to the index as things
 are changed in our MoR. However, it's handy to know how it reacts under the
 most extreme load we could give it.
 
 In the case that I set my hard commit time to 15-30 seconds with
 openSearcher set to false, how do I control when I actually do invalidate
 the caches and open a new searcher? Is this something that Solr can do
 automatically, or will I need some sort of coordinator process to perform a
 'proper' commit from outside Solr?
 
 In our case the process of opening a new searcher is definitely a hefty
 operation. We have a large number of boosts and filters which are used for
 just about every query that is made against the index so we currently have
 them warmed which can take upwards of a minute on our giant core.
 
 Thanks for your help.
 
 
 On 22 July 2013 13:00, Erick Erickson erickerick...@gmail.com wrote:
 
  Wow, you really shouldn't be having nodes go up and down so
  frequently, that's a big red flag. That said, SolrCloud should be
  pretty robust so this is something to pursue...
 
  But even a 5 minute hard commit can lead to a hefty transaction
  log under load, you may want to reduce it substantially depending
  on how fast you are sending docs to the index. I'm talking
  15-30 seconds here. It's critical that openSearcher be set to false
  or you'll invalidate your caches that often. All a hard commit
  with openSearcher set to false does is close off the current segment
  and open a new one. It does NOT open/warm new searchers etc.
 
  The soft commits control visibility, so that's how you control
  whether you can search the docs or not. Pardon me if I'm
  repeating stuff you already know!
 
  As far as your nodes coming and going, I've seen some people have
  good results by upping the ZooKeeper timeout limit. So I guess
  my first question is whether the nodes are actually going out of service
  or whether it's just a timeout issue
 
  Good luck!
  Erick
 
  On Mon, Jul 22, 2013 at 3:29 AM, Neil Prosser neil.pros...@gmail.com
  wrote:
   Very true. I was impatient (I think less than three minutes impatient so
   hopefully 4.4 will save me from myself) but I didn't realise it was doing
   something rather than just hanging. Next time I have to restart a node
  I'll
   just leave and go get a cup of coffee or something.
  
   My configuration is set to auto hard-commit every 5 minutes. No auto
   soft-commit time is set.
  
   Over the course of the weekend, while left unattended the nodes have been
   going up and down (I've got to solve the issue that is causing them to
  come
   and go, but any suggestions on what is likely to be causing something
  like
   that are welcome), at one point one of the nodes stopped taking updates.
   After indexing properly for a few hours with that one shard not accepting
   updates, the replica of that shard which contains all the correct
  documents
   must have replicated from the broken node and dropped documents. Is there
   any protection against this in Solr or should I be focusing on getting my
   nodes to be more reliable? I've now got a situation where four of my five
   shards have leaders who are marked as down and followers who are up.
  
   I'm going to start grabbing information about the cluster state so I can
   track which changes are happening and in what order. I can get hold of
  Solr
   logs and garbage collection logs while these things are happening.
  
   Is this all just down to my nodes being unreliable?
  
  
   On 21 July 2013 13:52, Erick Erickson erickerick...@gmail.com wrote:
  
   Well, if I'm reading this right you had a node go out

Re: Solr 4.3.1 - SolrCloud nodes down and lost documents

2013-07-22 Thread Shawn Heisey
On 7/22/2013 6:45 AM, Markus Jelsma wrote:
 You should increase your ZK time out, this may be the issue in your case. You 
 may also want to try the G1GC collector to keep STW under ZK time out.

When I tried G1, the occasional stop-the-world GC actually got worse.  I
tried G1 after trying CMS with no other tuning parameters.  The average
GC time went down, but when it got into a place where it had to do a
stop-the-world collection, it was worse.

Based on the GC statistics in jvisualvm and jstat, I didn't think I had
a problem.  The way I discovered that I had a problem was by looking at
my haproxy load balancer -- sometimes requests would be sent to a backup
server instead of my primary, because the ping request handler was
timing out on the LB health check.  The LB was set to time out after
five seconds.  When I went looking deeper with the GC log and some other
tools, I was seeing 8-10 second GC pauses.  G1 was showing me pauses of
12 seconds.

Now I use a heavily tuned CMS config, and there are no more LB switches
to a backup server.  I've put some of my own information about my GC
settings on my personal Solr wiki page:

http://wiki.apache.org/solr/ShawnHeisey#GC_Tuning

I've got an 8GB heap on my systems running 3.5.0 (one copy of the index)
and a 6GB heap on those running 4.2.1 (the other copy of the index).

Summary: Just switching to the G1 collector won't solve GC pause
problems.  There's not a lot of G1 tuning information out there yet.  If
someone can come up with a good set of G1 tuning parameters, G1 might
become better than CMS.

Thanks,
Shawn



Re: Solr 4.3.1 - SolrCloud nodes down and lost documents

2013-07-22 Thread Timothy Potter
A couple of things I've learned along the way ...

I had a similar architecture where we used fairly low numbers for
auto-commits with openSearcher=false. This keeps the tlog to a
reasonable size. You'll need something on the client side to send in
the hard commit request to open a new searcher every N docs or M
minutes.

Be careful with raising the Zk timeout as that also determines how
quickly Zk can detect a node has crashed (afaik). In other words, it
takes the zk client timeout seconds for Zk to consider an ephemeral
znode as gone, so I caution you in increasing this value too much.

The other thing to be aware of is this leaderVoteWait safety mechanism
... might see log messages that look like:

2013-06-24 18:12:40,408 [coreLoadExecutor-4-thread-1] INFO
solr.cloud.ShardLeaderElectionContext  - Waiting until we see more
replicas up: total=2 found=1 timeoutin=139368

From Mark M: This is a safety mechanism - you can turn it off by
configuring leaderVoteWait to 0 in solr.xml. This is meant to protect
the case where you stop a shard or it fails and then the first node to
get started back up has stale data - you don't want it to just become
the leader. So we wait to see everyone we know about in the shard up
to 3 or 5 min by default. Then we know all the shards participate in
the leader election and the leader will end up with all updates it
should have. You can lower that wait or turn it off with 0.

NOTE: I tried setting it to 0 and my cluster went haywire, so consider
just lowering it but not making it zero ;-)

Max heap of 8GB seems overly large to me for 8M docs per shard esp.
since you're using MMapDirectory to cache the primary data structures
of your index in OS cache. I have run shards with 40M docs with 6GB
max heap and chose to have more aggressive cache eviction by using a
smallish LFU filter cache. This approach seems to spread the cost of
GC out over time vs. massive amounts of clean-up when a new searcher
is opened. With 8M docs, each cached filter will require about 1M of
memory, so it seems like you could run with a smaller heap. I'm not a
GC expert but found that having smaller heap and more aggressive cache
evictions reduced full GC's (and how long they run for) on my Solr
instances.

On Mon, Jul 22, 2013 at 8:09 AM, Shawn Heisey s...@elyograg.org wrote:
 On 7/22/2013 6:45 AM, Markus Jelsma wrote:
 You should increase your ZK time out, this may be the issue in your case. 
 You may also want to try the G1GC collector to keep STW under ZK time out.

 When I tried G1, the occasional stop-the-world GC actually got worse.  I
 tried G1 after trying CMS with no other tuning parameters.  The average
 GC time went down, but when it got into a place where it had to do a
 stop-the-world collection, it was worse.

 Based on the GC statistics in jvisualvm and jstat, I didn't think I had
 a problem.  The way I discovered that I had a problem was by looking at
 my haproxy load balancer -- sometimes requests would be sent to a backup
 server instead of my primary, because the ping request handler was
 timing out on the LB health check.  The LB was set to time out after
 five seconds.  When I went looking deeper with the GC log and some other
 tools, I was seeing 8-10 second GC pauses.  G1 was showing me pauses of
 12 seconds.

 Now I use a heavily tuned CMS config, and there are no more LB switches
 to a backup server.  I've put some of my own information about my GC
 settings on my personal Solr wiki page:

 http://wiki.apache.org/solr/ShawnHeisey#GC_Tuning

 I've got an 8GB heap on my systems running 3.5.0 (one copy of the index)
 and a 6GB heap on those running 4.2.1 (the other copy of the index).

 Summary: Just switching to the G1 collector won't solve GC pause
 problems.  There's not a lot of G1 tuning information out there yet.  If
 someone can come up with a good set of G1 tuning parameters, G1 might
 become better than CMS.

 Thanks,
 Shawn



Re: Solr 4.3.1 - SolrCloud nodes down and lost documents

2013-07-22 Thread Lance Norskog

Are you feeding Graphite from Solr? If so, how?

On 07/19/2013 01:02 AM, Neil Prosser wrote:

That was overnight so I was unable to track exactly what happened (I'm
going off our Graphite graphs here).




Re: Solr 4.3.1 - SolrCloud nodes down and lost documents

2013-07-22 Thread Neil Prosser
I just have a little python script which I run with cron (luckily that's
the granularity we have in Graphite). It reads the same JSON the admin UI
displays and dumps numeric values into Graphite.

I can open source it if you like. I just need to make sure I remove any
hacks/shortcuts that I've taken because I'm working with our cluster!


On 22 July 2013 19:26, Lance Norskog goks...@gmail.com wrote:

 Are you feeding Graphite from Solr? If so, how?


 On 07/19/2013 01:02 AM, Neil Prosser wrote:

 That was overnight so I was unable to track exactly what happened (I'm
 going off our Graphite graphs here).





Re: Solr 4.3.1 - SolrCloud nodes down and lost documents

2013-07-21 Thread Erick Erickson
Well, if I'm reading this right you had a node go out of circulation
and then bounced nodes until that node became the leader. So of course
it wouldn't have the documents (how could it?). Basically you shot
yourself in the foot.

Underlying here is why it took the machine you were re-starting so
long to come up that you got impatient and started killing nodes.
There has been quite a bit done to make that process better, so what
version of Solr are you using? 4.4 is being voted on right now, so if
you might want to consider upgrading.

There was, for instance, a situation where it would take 3 minutes for
machines to start up. How impatient were you?

Also, what are your hard commit parameters? All of the documents
you're indexing will be in the transaction log between hard commits,
and when a node comes up the leader will replay everything in the tlog
to the new node, which might be a source of why it took so long for
the new node to come back up. At the very least the new node you were
bringing back online will need to do a full index replication (old
style) to get caught up.

Best
Erick

On Fri, Jul 19, 2013 at 4:02 AM, Neil Prosser neil.pros...@gmail.com wrote:
 While indexing some documents to a SolrCloud cluster (10 machines, 5 shards
 and 2 replicas, so one replica on each machine) one of the replicas stopped
 receiving documents, while the other replica of the shard continued to grow.

 That was overnight so I was unable to track exactly what happened (I'm
 going off our Graphite graphs here). This morning when I was able to look
 at the cluster both replicas of that shard were marked as down (with one
 marked as leader). I attempted to restart the non-leader node but it took a
 long time to restart so I killed it and restarted the old leader, which
 also took a long time. I killed that one (I'm impatient) and left the
 non-leader node to restart, not realising it was missing approximately 700k
 documents that the old leader had. Eventually it restarted and became
 leader. I restarted the old leader and it dropped the number of documents
 it had to match the previous non-leader.

 Is this expected behaviour when a replica with fewer documents is started
 before the other and elected leader? Should I have been paying more
 attention to the number of documents on the server before restarting nodes?

 I am still in the process of tuning the caches and warming for these
 servers but we are putting some load through the cluster so it is possible
 that the nodes are having to work quite hard when a new version of the core
 comes is made available. Is this likely to explain why I occasionally see
 nodes dropping out? Unfortunately in restarting the nodes I lost the GC
 logs to see whether that was likely to be the culprit. Is this the sort of
 situation where you raise the ZooKeeper timeout a bit? Currently the
 timeout for all nodes is 15 seconds.

 Are there any known issues which might explain what's happening? I'm just
 getting started with SolrCloud after using standard master/slave
 replication for an index which has got too big for one machine over the
 last few months.

 Also, is there any particular information that would be helpful to help
 with these issues if it should happen again?