Re: "Classic" 4.2 master-slave replication not completing

Mark Miller Thu, 27 Jun 2013 07:36:42 -0700

Odd - looks like it's stuck waiting to be notified that a new searcher is ready.


- Mark

On Jun 27, 2013, at 8:58 AM, Neal Ensor <nen...@gmail.com> wrote:

> Okay, I have done this (updated to 4.3.1 across master and four slaves; one
> of these is my own PC for experiments, it is not being accessed by clients).
> 
> Just had a minor replication this morning, and all three slaves are "stuck"
> again.  Replication supposedly started at 8:40, ended 30 seconds later or
> so (on my local PC, set up identically to the other three slaves).  The
> three slaves will NOT complete the roll-over to the new index.  All three
> index folders have a write.lock and latest files are dated 8:40am (now it
> is 8:54am, with no further activity in the index folders).  There exists an
> "index.20130627084000061" (or some variation thereof) in all three slaves'
> data folder.
> 
> The seemingly-relevant thread dump of a "snappuller" thread on each of
> these slaves:
> 
>   - sun.misc.Unsafe.park(Native Method)
>   - java.util.concurrent.locks.LockSupport.park(LockSupport.java:156)
>   -
>   
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:811)
>   -
>   
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:969)
>   -
>   
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1281)
>   - java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:218)
>   - java.util.concurrent.FutureTask.get(FutureTask.java:83)
>   -
>   
> org.apache.solr.handler.SnapPuller.openNewWriterAndSearcher(SnapPuller.java:631)
>   -
>   org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:446)
>   -
>   
> org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:317)
>   - org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:223)
>   -
>   java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
>   -
>   java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
>   - java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
>   -
>   
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
>   -
>   
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180)
>   -
>   
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204)
>   -
>   
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
>   -
>   
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
>   - java.lang.Thread.run(Thread.java:662)
> 
> 
> Here they sit.  My local PC "slave" replicated very quickly, switched over
> to the new generation (206) immediately.  I am not sure why the three
> slaves are dragging on this.  If there's any configuration elements or
> other details you need, please let me know.  I can manually "kick" them by
> reloading the core from the admin pages, but obviously I would like this to
> be a hands-off process.  Any help is greatly appreciated; this has been
> bugging me for some time now.
> 
> 
> 
> On Mon, Jun 24, 2013 at 9:34 AM, Shalin Shekhar Mangar <
> shalinman...@gmail.com> wrote:
> 
>> A bunch of replication related issues were fixed in 4.2.1 so you're
>> better off upgrading to 4.2.1 or later (4.3.1 is the latest release).
>> 
>> On Mon, Jun 24, 2013 at 6:55 PM, Neal Ensor <nen...@gmail.com> wrote:
>>> As a bit of background, we run a setup (coming from 3.6.1 to 4.2
>> relatively
>>> recently) with a single master receiving updates with three slaves
>> pulling
>>> changes in.  Our index is around 5 million documents, around 26GB in size
>>> total.
>>> 
>>> The situation I'm seeing is this:  occasionally we update the master, and
>>> replication begins on the three slaves, seems to proceed normally until
>> it
>>> hits the end.  At that point, it "sticks"; there's no messages going on
>> in
>>> the logs, nothing on the admin page seems to be happening.  I sit there
>> for
>>> sometimes upwards of 30 minutes, seeing no further activity in the index
>>> folder(s).   After a while, I go to the core admin page and manually
>> reload
>>> the core, which "catches it up".  It seems like the index readers /
>> writers
>>> are not releasing the index otherwise?  The configuration is set to
>> reopen;
>>> very occasionally this situation actually fixes itself after a longish
>>> period of time, but it seems very annoying.
>>> 
>>> I had at first suspected this to be due to our underlying shared (SAN)
>>> storage, so we installed SSDs in all three slave machines, and moved the
>>> entire indexes to those.  It did not seem to affect this issue at all
>>> (additionally, I didn't really see the expected performance boost, but
>>> that's a separate issue entirely).
>>> 
>>> Any ideas?  Any configuration details I might share/reconfigure?  Any
>>> suggestions are appreciated. I could also upgrade to the later 4.3+
>>> versions, if that might help.
>>> 
>>> Thanks!
>>> 
>>> Neal Ensor
>>> nen...@gmail.com
>> 
>> 
>> 
>> --
>> Regards,
>> Shalin Shekhar Mangar.
>>

Re: "Classic" 4.2 master-slave replication not completing

Reply via email to