Re: leader split-brain at least once a day - need help

Mark Miller Mon, 12 Jan 2015 08:33:05 -0800

bq. ClusterState says we are the leader, but locally we don't think so

Generally this is due to some bug. One bug that can lead to it was recently
fixed in 4.10.3 I think. What version are you on?


- Mark

On Mon Jan 12 2015 at 7:35:47 AM Thomas Lamy <t.l...@cytainment.de> wrote:

> Hi,
>
> I found no big/unusual GC pauses in the Log (at least manually; I found
> no free solution to analyze them that worked out of the box on a
> headless debian wheezy box). Eventually i tried with -Xmx8G (was 64G
> before) on one of the nodes, after checking allocation after 1 hour run
> time was at about 2-3GB. That didn't move the time frame where a restart
> was needed, so I don't think Solr's JVM GC is the problem.
> We're trying to get all of our node's logs (zookeeper and solr) into
> Splunk now, just to get a better sorted view of what's going on in the
> cloud once a problem occurs. We're also enabling GC logging for
> zookeeper; maybe we were missing problems there while focussing on solr
> logs.
>
> Thomas
>
>
> Am 08.01.15 um 16:33 schrieb Yonik Seeley:
> > It's worth noting that those messages alone don't necessarily signify
> > a problem with the system (and it wouldn't be called "split brain").
> > The async nature of updates (and thread scheduling) along with
> > stop-the-world GC pauses that can change leadership, cause these
> > little windows of inconsistencies that we detect and log.
> >
> > -Yonik
> > http://heliosearch.org - native code faceting, facet functions,
> > sub-facets, off-heap data
> >
> >
> > On Wed, Jan 7, 2015 at 5:01 AM, Thomas Lamy <t.l...@cytainment.de>
> wrote:
> >> Hi there,
> >>
> >> we are running a 3 server cloud serving a dozen
> >> single-shard/replicate-everywhere collections. The 2 biggest
> collections are
> >> ~15M docs, and about 13GiB / 2.5GiB size. Solr is 4.10.2, ZK 3.4.5,
> Tomcat
> >> 7.0.56, Oracle Java 1.7.0_72-b14
> >>
> >> 10 of the 12 collections (the small ones) get filled by DIH full-import
> once
> >> a day starting at 1am. The second biggest collection is updated usind
> DIH
> >> delta-import every 10 minutes, the biggest one gets bulk json updates
> with
> >> commits once in 5 minutes.
> >>
> >> On a regular basis, we have a leader information mismatch:
> >> org.apache.solr.update.processor.DistributedUpdateProcessor; Request
> says it
> >> is coming from leader, but we are the leader
> >> or the opposite
> >> org.apache.solr.update.processor.DistributedUpdateProcessor;
> ClusterState
> >> says we are the leader, but locally we don't think so
> >>
> >> One of these pop up once a day at around 8am, making either some cores
> going
> >> to "recovery failed" state, or all cores of at least one cloud node into
> >> state "gone".
> >> This started out of the blue about 2 weeks ago, without changes to
> neither
> >> software, data, or client behaviour.
> >>
> >> Most of the time, we get things going again by restarting solr on the
> >> current leader node, forcing a new election - can this be triggered
> while
> >> keeping solr (and the caches) up?
> >> But sometimes this doesn't help, we had an incident last weekend where
> our
> >> admins didn't restart in time, creating millions of entries in
> >> /solr/oversser/queue, making zk close the connection, and leader
> re-elect
> >> fails. I had to flush zk, and re-upload collection config to get solr up
> >> again (just like in https://gist.github.com/
> isoboroff/424fcdf63fa760c1d1a7).
> >>
> >> We have a much bigger cloud (7 servers, ~50GiB Data in 8 collections,
> 1500
> >> requests/s) up and running, which does not have these problems since
> >> upgrading to 4.10.2.
> >>
> >>
> >> Any hints on where to look for a solution?
> >>
> >> Kind regards
> >> Thomas
> >>
> >> --
> >> Thomas Lamy
> >> Cytainment AG & Co KG
> >> Nordkanalstrasse 52
> >> 20097 Hamburg
> >>
> >> Tel.:     +49 (40) 23 706-747
> >> Fax:     +49 (40) 23 706-139
> >> Sitz und Registergericht Hamburg
> >> HRA 98121
> >> HRB 86068
> >> Ust-ID: DE213009476
> >>
>
>
> --
> Thomas Lamy
> Cytainment AG & Co KG
> Nordkanalstrasse 52
> 20097 Hamburg
>
> Tel.:     +49 (40) 23 706-747
> Fax:     +49 (40) 23 706-139
>
> Sitz und Registergericht Hamburg
> HRA 98121
> HRB 86068
> Ust-ID: DE213009476
>
>

Re: leader split-brain at least once a day - need help

Reply via email to