Odd - looks like it's stuck waiting to be notified that a new searcher is ready.
- Mark On Jun 27, 2013, at 8:58 AM, Neal Ensor <nen...@gmail.com> wrote: > Okay, I have done this (updated to 4.3.1 across master and four slaves; one > of these is my own PC for experiments, it is not being accessed by clients). > > Just had a minor replication this morning, and all three slaves are "stuck" > again. Replication supposedly started at 8:40, ended 30 seconds later or > so (on my local PC, set up identically to the other three slaves). The > three slaves will NOT complete the roll-over to the new index. All three > index folders have a write.lock and latest files are dated 8:40am (now it > is 8:54am, with no further activity in the index folders). There exists an > "index.20130627084000061" (or some variation thereof) in all three slaves' > data folder. > > The seemingly-relevant thread dump of a "snappuller" thread on each of > these slaves: > > - sun.misc.Unsafe.park(Native Method) > - java.util.concurrent.locks.LockSupport.park(LockSupport.java:156) > - > > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:811) > - > > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:969) > - > > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1281) > - java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:218) > - java.util.concurrent.FutureTask.get(FutureTask.java:83) > - > > org.apache.solr.handler.SnapPuller.openNewWriterAndSearcher(SnapPuller.java:631) > - > org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:446) > - > > org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:317) > - org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:223) > - > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439) > - > java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317) > - java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150) > - > > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98) > - > > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180) > - > > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204) > - > > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) > - > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) > - java.lang.Thread.run(Thread.java:662) > > > Here they sit. My local PC "slave" replicated very quickly, switched over > to the new generation (206) immediately. I am not sure why the three > slaves are dragging on this. If there's any configuration elements or > other details you need, please let me know. I can manually "kick" them by > reloading the core from the admin pages, but obviously I would like this to > be a hands-off process. Any help is greatly appreciated; this has been > bugging me for some time now. > > > > On Mon, Jun 24, 2013 at 9:34 AM, Shalin Shekhar Mangar < > shalinman...@gmail.com> wrote: > >> A bunch of replication related issues were fixed in 4.2.1 so you're >> better off upgrading to 4.2.1 or later (4.3.1 is the latest release). >> >> On Mon, Jun 24, 2013 at 6:55 PM, Neal Ensor <nen...@gmail.com> wrote: >>> As a bit of background, we run a setup (coming from 3.6.1 to 4.2 >> relatively >>> recently) with a single master receiving updates with three slaves >> pulling >>> changes in. Our index is around 5 million documents, around 26GB in size >>> total. >>> >>> The situation I'm seeing is this: occasionally we update the master, and >>> replication begins on the three slaves, seems to proceed normally until >> it >>> hits the end. At that point, it "sticks"; there's no messages going on >> in >>> the logs, nothing on the admin page seems to be happening. I sit there >> for >>> sometimes upwards of 30 minutes, seeing no further activity in the index >>> folder(s). After a while, I go to the core admin page and manually >> reload >>> the core, which "catches it up". It seems like the index readers / >> writers >>> are not releasing the index otherwise? The configuration is set to >> reopen; >>> very occasionally this situation actually fixes itself after a longish >>> period of time, but it seems very annoying. >>> >>> I had at first suspected this to be due to our underlying shared (SAN) >>> storage, so we installed SSDs in all three slave machines, and moved the >>> entire indexes to those. It did not seem to affect this issue at all >>> (additionally, I didn't really see the expected performance boost, but >>> that's a separate issue entirely). >>> >>> Any ideas? Any configuration details I might share/reconfigure? Any >>> suggestions are appreciated. I could also upgrade to the later 4.3+ >>> versions, if that might help. >>> >>> Thanks! >>> >>> Neal Ensor >>> nen...@gmail.com >> >> >> >> -- >> Regards, >> Shalin Shekhar Mangar. >>