Okay, I have done this (updated to 4.3.1 across master and four slaves; one of these is my own PC for experiments, it is not being accessed by clients).
Just had a minor replication this morning, and all three slaves are "stuck" again. Replication supposedly started at 8:40, ended 30 seconds later or so (on my local PC, set up identically to the other three slaves). The three slaves will NOT complete the roll-over to the new index. All three index folders have a write.lock and latest files are dated 8:40am (now it is 8:54am, with no further activity in the index folders). There exists an "index.20130627084000061" (or some variation thereof) in all three slaves' data folder. The seemingly-relevant thread dump of a "snappuller" thread on each of these slaves: - sun.misc.Unsafe.park(Native Method) - java.util.concurrent.locks.LockSupport.park(LockSupport.java:156) - java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:811) - java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:969) - java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1281) - java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:218) - java.util.concurrent.FutureTask.get(FutureTask.java:83) - org.apache.solr.handler.SnapPuller.openNewWriterAndSearcher(SnapPuller.java:631) - org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:446) - org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:317) - org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:223) - java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439) - java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317) - java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150) - java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98) - java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180) - java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204) - java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) - java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) - java.lang.Thread.run(Thread.java:662) Here they sit. My local PC "slave" replicated very quickly, switched over to the new generation (206) immediately. I am not sure why the three slaves are dragging on this. If there's any configuration elements or other details you need, please let me know. I can manually "kick" them by reloading the core from the admin pages, but obviously I would like this to be a hands-off process. Any help is greatly appreciated; this has been bugging me for some time now. On Mon, Jun 24, 2013 at 9:34 AM, Shalin Shekhar Mangar < shalinman...@gmail.com> wrote: > A bunch of replication related issues were fixed in 4.2.1 so you're > better off upgrading to 4.2.1 or later (4.3.1 is the latest release). > > On Mon, Jun 24, 2013 at 6:55 PM, Neal Ensor <nen...@gmail.com> wrote: > > As a bit of background, we run a setup (coming from 3.6.1 to 4.2 > relatively > > recently) with a single master receiving updates with three slaves > pulling > > changes in. Our index is around 5 million documents, around 26GB in size > > total. > > > > The situation I'm seeing is this: occasionally we update the master, and > > replication begins on the three slaves, seems to proceed normally until > it > > hits the end. At that point, it "sticks"; there's no messages going on > in > > the logs, nothing on the admin page seems to be happening. I sit there > for > > sometimes upwards of 30 minutes, seeing no further activity in the index > > folder(s). After a while, I go to the core admin page and manually > reload > > the core, which "catches it up". It seems like the index readers / > writers > > are not releasing the index otherwise? The configuration is set to > reopen; > > very occasionally this situation actually fixes itself after a longish > > period of time, but it seems very annoying. > > > > I had at first suspected this to be due to our underlying shared (SAN) > > storage, so we installed SSDs in all three slave machines, and moved the > > entire indexes to those. It did not seem to affect this issue at all > > (additionally, I didn't really see the expected performance boost, but > > that's a separate issue entirely). > > > > Any ideas? Any configuration details I might share/reconfigure? Any > > suggestions are appreciated. I could also upgrade to the later 4.3+ > > versions, if that might help. > > > > Thanks! > > > > Neal Ensor > > nen...@gmail.com > > > > -- > Regards, > Shalin Shekhar Mangar. >