It is certainly 'more' possible, as we have additional code that revolves
around reading the clusterstate.json and since solr decided to change the
format of the clusterstate.json from 4.0 to 4.1 it requires additional
code changes to our service since the solrj lib from 4.0 isn't compatible
with anything after 4.0 due to the clusterstate.json change.  I can
however run java7 with these GC in a dev env under load to see if they
blow up or if it's even possible, and then roll it out to the replica, and
then to to the leader. I cannot however do this with a solr upgrade
without significant coding changes to our service, which would require us
to roll out new code for our service, as well as new solr instances.

So, while it's 'just as risky' as you say, it's 'less risky' than a new
version of java and is possible to implement without downtime.

It is actually something of a pain point that the upgrade path to
solrcloud seems to frequently require downtime. (clusterstate.json changes
in 4.1, and then again this big change in 4.4 with no solr.xml).

So we'll do what we can quickly to see if we can 'band-aid' the problem
until we can upgrade to solr 4.4  Speaking of band-aids - does anyone know
of a way to change the socket timeout/connection timeout for distributed
updates?

Jed.

On 7/10/13 2:38 PM, "Erick Erickson" <erickerick...@gmail.com> wrote:

>Jed:
>
>I'm not sure changing Java runtime is any less scary than upgrading
>Solr....
>
>Wait, I know! Ask your manager if you can do both at once <evil smirk>. I
>have
>a  t-shirt that says "I don't test, but when I do it's in production"...
>
>Erick
>
>On Wed, Jul 10, 2013 at 8:08 AM, Jed Glazner <jglaz...@adobe.com> wrote:
>> Hey Daniel,
>>
>> Thanks for the response.  I think we'll give this a try to see if this
>> helps.
>>
>> Jed.
>>
>> On 7/10/13 10:48 AM, "Daniel Collins" <danwcoll...@gmail.com> wrote:
>>
>>>We had something similar in terms of update times suddenly spiking up
>>>for
>>>no obvious reason.  We never got quite as bad as you in terms of the
>>>other
>>>knock on effects, but we certainly saw updates jumping from 10ms up to
>>>30000ms, all our external queues backed up and we rejected some updates,
>>>then after a while things quietened down.
>>>
>>>We were running Solr 4.3.0 but with Java 6 and the CMS GC.  We swapped
>>>to
>>>Java 7, G1 GC (and increased heap size from 8Gb to 12Gb) and the problem
>>>went away.
>>>
>>>Now, I admit its not exactly the same as your case, we never had the
>>>follow-on effects, but I'd consider Java 7 and the G1 GC, it has
>>>certainly
>>>reduced the "spikes" in our indexing times.
>>>
>>>We run the following settings now (the usual caveats apply, it might not
>>>work for you).
>>>
>>>    GC_OPTIONS="-XX:+AggressiveOpts -XX:+UseG1GC -XX:+UseStringCache
>>>-XX:+OptimizeStringConcat -XX:-UseSplitVerifier -XX:+UseNUMA
>>>-XX:MaxGCPauseMillis=50 -XX:GCPauseIntervalMillis=1000"
>>>
>>>I set the MaxGCPauseMillis/GCPauseIntervalMillis to try to minimise
>>>application pauses, that's our goal, if we have to use more memory in
>>>the
>>>short term then so be it, but we couldn't afford application pauses,
>>>because we are using NRT (soft commits every 1s, hard commits every 60s)
>>>and we get a lot of updates.
>>>
>>>I know there have been other discussion on G1 and it has received mixed
>>>results overall, but for us, it seems to be a winner.
>>>
>>>Hope that helps,
>>>
>>>
>>>On 10 July 2013 08:32, Jed Glazner <jglaz...@adobe.com> wrote:
>>>
>>>> We are planning an upgrade to 4.4 but it's still weeks out. We offer a
>>>> high availability search service and there are a number of changes in
>>>>4.4
>>>> that are not backward compatible. (i.e. Clusterstate.json and no
>>>>solr.xml)
>>>> So there must be lots of testing, additionally this upgrade cannot be
>>>> performed without downtime.
>>>>
>>>> Regardless, I need to find a band-aid right now.  Does anyone know if
>>>>it's
>>>> possible to set the timeout for distributed update request to/from
>>>>leader.
>>>>  Currently we see it's set to 0.  Maybe via -D startup param, or
>>>>something?
>>>>
>>>> Jed
>>>>
>>>> On 7/10/13 1:23 AM, "Otis Gospodnetic" <otis.gospodne...@gmail.com>
>>>>wrote:
>>>>
>>>> >Hi Jed,
>>>> >
>>>> >This is really with Solr 4.0?  If so, it may be wiser to jump on 4.4
>>>> >that is about to be released.  We did not have fun working with 4.0
>>>>in
>>>> >SolrCloud mode a few months ago.  You will save time, hair, and money
>>>> >if you convince your manager to let you use Solr 4.4. :)
>>>> >
>>>> >Otis
>>>> >--
>>>> >Solr & ElasticSearch Support -- http://sematext.com/
>>>> >Performance Monitoring -- http://sematext.com/spm
>>>> >
>>>> >
>>>> >
>>>> >On Tue, Jul 9, 2013 at 4:44 PM, Jed Glazner <jglaz...@adobe.com>
>>>>wrote:
>>>> >> Hi Shawn,
>>>> >>
>>>> >> I have been trying to duplicate this problem without success for
>>>>the
>>>> >>last 2 weeks which is one reason I'm getting flustered.   It seems
>>>> >>reasonable to be able to duplicate it but I can't.
>>>> >>
>>>> >>  We do have a story to upgrade but that is still weeks if not
>>>>months
>>>> >>before that gets rolled out to production.
>>>> >>
>>>> >> We have another cluster running the same version but with 8 shards
>>>>and
>>>> >>8 replicas with each shard at 100gb and more load and more indexing
>>>> >>requests without this problem but we send docs in batches here and
>>>>all
>>>> >>fields are stored.   Where as the trouble index has only 1 or 2
>>>>stored
>>>> >>fields and only send docs 1 at a time.
>>>> >>
>>>> >> Could that have anything to do with it?
>>>> >>
>>>> >> Jed
>>>> >>
>>>> >>
>>>> >> Von Samsung Mobile gesendet
>>>> >>
>>>> >>
>>>> >>
>>>> >> -------- Ursprüngliche Nachricht --------
>>>> >> Von: Shawn Heisey <s...@elyograg.org>
>>>> >> Datum: 07.09.2013 18:33 (GMT+01:00)
>>>> >> An: solr-user@lucene.apache.org
>>>> >> Betreff: Re: Solr Hangs During Updates for over 10 minutes
>>>> >>
>>>> >>
>>>> >> On 7/9/2013 9:50 AM, Jed Glazner wrote:
>>>> >>> I'll give you the high level before delving deep into setup etc. I
>>>> >>>have been struggeling at work with a seemingly random problem when
>>>>solr
>>>> >>>will hang for 10-15 minutes during updates.  This outage always
>>>>seems
>>>> >>>to immediately be proceeded by an EOF exception on  the replica.
>>>>Then
>>>> >>>10-15 minutes later we see an exception on the leader for a socket
>>>> >>>timeout to the replica.  The leader will then tell the replica to
>>>> >>>recover which in most cases it does and then the outage is over.
>>>> >>>
>>>> >>> Here are the setup details:
>>>> >>>
>>>> >>> We are currently using Solr 4.0.0 with an external ZK ensemble of
>>>>5
>>>> >>>machines.
>>>> >>
>>>> >> After 4.0.0 was released, a *lot* of problems with SolrCloud
>>>>surfaced
>>>> >> and have since been fixed.  You're five releases and about nine
>>>>months
>>>> >> behind what's current.  My recommendation: Upgrade to 4.3.1, ensure
>>>>your
>>>> >> configuration is up to date with changes to the example config
>>>>between
>>>> >> 4.0.0 and 4.3.1, and reindex.  Ideally, you should set up a 4.0.0
>>>> >> testbed, duplicate your current problem, and upgrade the testbed to
>>>>see
>>>> >> if the problem goes away.  A testbed will also give you practice
>>>>for
>>>>a
>>>> >> smooth upgrade of your production system.
>>>> >>
>>>> >> Thanks,
>>>> >> Shawn
>>>> >>
>>>>
>>>>
>>

Reply via email to