Okay, I have done this (updated to 4.3.1 across master and four slaves; one
of these is my own PC for experiments, it is not being accessed by clients).

Just had a minor replication this morning, and all three slaves are "stuck"
again.  Replication supposedly started at 8:40, ended 30 seconds later or
so (on my local PC, set up identically to the other three slaves).  The
three slaves will NOT complete the roll-over to the new index.  All three
index folders have a write.lock and latest files are dated 8:40am (now it
is 8:54am, with no further activity in the index folders).  There exists an
"index.20130627084000061" (or some variation thereof) in all three slaves'
data folder.

The seemingly-relevant thread dump of a "snappuller" thread on each of
these slaves:

   - sun.misc.Unsafe.park(Native Method)
   - java.util.concurrent.locks.LockSupport.park(LockSupport.java:156)
   -
   
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:811)
   -
   
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:969)
   -
   
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1281)
   - java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:218)
   - java.util.concurrent.FutureTask.get(FutureTask.java:83)
   -
   
org.apache.solr.handler.SnapPuller.openNewWriterAndSearcher(SnapPuller.java:631)
   -
   org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:446)
   -
   
org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:317)
   - org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:223)
   -
   java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
   -
   java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
   - java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
   -
   
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
   -
   
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180)
   -
   
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204)
   -
   
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
   -
   
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
   - java.lang.Thread.run(Thread.java:662)


Here they sit.  My local PC "slave" replicated very quickly, switched over
to the new generation (206) immediately.  I am not sure why the three
slaves are dragging on this.  If there's any configuration elements or
other details you need, please let me know.  I can manually "kick" them by
reloading the core from the admin pages, but obviously I would like this to
be a hands-off process.  Any help is greatly appreciated; this has been
bugging me for some time now.



On Mon, Jun 24, 2013 at 9:34 AM, Shalin Shekhar Mangar <
shalinman...@gmail.com> wrote:

> A bunch of replication related issues were fixed in 4.2.1 so you're
> better off upgrading to 4.2.1 or later (4.3.1 is the latest release).
>
> On Mon, Jun 24, 2013 at 6:55 PM, Neal Ensor <nen...@gmail.com> wrote:
> > As a bit of background, we run a setup (coming from 3.6.1 to 4.2
> relatively
> > recently) with a single master receiving updates with three slaves
> pulling
> > changes in.  Our index is around 5 million documents, around 26GB in size
> > total.
> >
> > The situation I'm seeing is this:  occasionally we update the master, and
> > replication begins on the three slaves, seems to proceed normally until
> it
> > hits the end.  At that point, it "sticks"; there's no messages going on
> in
> > the logs, nothing on the admin page seems to be happening.  I sit there
> for
> > sometimes upwards of 30 minutes, seeing no further activity in the index
> > folder(s).   After a while, I go to the core admin page and manually
> reload
> > the core, which "catches it up".  It seems like the index readers /
> writers
> > are not releasing the index otherwise?  The configuration is set to
> reopen;
> > very occasionally this situation actually fixes itself after a longish
> > period of time, but it seems very annoying.
> >
> > I had at first suspected this to be due to our underlying shared (SAN)
> > storage, so we installed SSDs in all three slave machines, and moved the
> > entire indexes to those.  It did not seem to affect this issue at all
> > (additionally, I didn't really see the expected performance boost, but
> > that's a separate issue entirely).
> >
> > Any ideas?  Any configuration details I might share/reconfigure?  Any
> > suggestions are appreciated. I could also upgrade to the later 4.3+
> > versions, if that might help.
> >
> > Thanks!
> >
> > Neal Ensor
> > nen...@gmail.com
>
>
>
> --
> Regards,
> Shalin Shekhar Mangar.
>

Reply via email to