That amount of RAM can easily be eaten up depending on your sorting, faceting, data.
Do you have gc logging enabled? That should describe what is happening with the heap. - Mark On Tue, Oct 6, 2015 at 4:04 PM Rallavagu <rallav...@gmail.com> wrote: > Mark - currently 5.3 is being evaluated for upgrade purposes and > hopefully get there sooner. Meanwhile, following exception is noted from > logs during updates > > ERROR org.apache.solr.update.CommitTracker – auto commit > error...:java.lang.IllegalStateException: this writer hit an > OutOfMemoryError; cannot commit > at > > org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2807) > at > org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:2984) > at > > org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:559) > at > org.apache.solr.update.CommitTracker.run(CommitTracker.java:216) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:440) > at > > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:98) > at > > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:206) > at > > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:896) > at > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:919) > at java.lang.Thread.run(Thread.java:682) > > Considering the fact that the machine is configured with 48G (24G for > JVM which will be reduced in future) wondering how would it still go out > of memory. For memory mapped index files the remaining 24G or what is > available off of it should be available. Looking at the lsof output the > memory mapped files were around 10G. > > Thanks. > > > On 10/5/15 5:41 PM, Mark Miller wrote: > > I'd make two guess: > > > > Looks like you are using Jrocket? I don't think that is common or well > > tested at this point. > > > > There are a billion or so bug fixes from 4.6.1 to 5.3.2. Given the pace > of > > SolrCloud, you are dealing with something fairly ancient and so it will > be > > harder to find help with older issues most likely. > > > > - Mark > > > > On Mon, Oct 5, 2015 at 12:46 PM Rallavagu <rallav...@gmail.com> wrote: > > > >> Any takers on this? Any kinda clue would help. Thanks. > >> > >> On 10/4/15 10:14 AM, Rallavagu wrote: > >>> As there were no responses so far, I assume that this is not a very > >>> common issue that folks come across. So, I went into source (4.6.1) to > >>> see if I can figure out what could be the cause. > >>> > >>> > >>> The thread that is locking is in this block of code > >>> > >>> synchronized (recoveryLock) { > >>> // to be air tight we must also check after lock > >>> if (cc.isShutDown()) { > >>> log.warn("Skipping recovery because Solr is shutdown"); > >>> return; > >>> } > >>> log.info("Running recovery - first canceling any ongoing > >> recovery"); > >>> cancelRecovery(); > >>> > >>> while (recoveryRunning) { > >>> try { > >>> recoveryLock.wait(1000); > >>> } catch (InterruptedException e) { > >>> > >>> } > >>> // check again for those that were waiting > >>> if (cc.isShutDown()) { > >>> log.warn("Skipping recovery because Solr is shutdown"); > >>> return; > >>> } > >>> if (closed) return; > >>> } > >>> > >>> Subsequently, the thread will get into cancelRecovery method as below, > >>> > >>> public void cancelRecovery() { > >>> synchronized (recoveryLock) { > >>> if (recoveryStrat != null && recoveryRunning) { > >>> recoveryStrat.close(); > >>> while (true) { > >>> try { > >>> recoveryStrat.join(); > >>> } catch (InterruptedException e) { > >>> // not interruptible - keep waiting > >>> continue; > >>> } > >>> break; > >>> } > >>> > >>> recoveryRunning = false; > >>> recoveryLock.notifyAll(); > >>> } > >>> } > >>> } > >>> > >>> As per the stack trace "recoveryStrat.join()" is where things are > >>> holding up. > >>> > >>> I wonder why/how cancelRecovery would take time so around 870 threads > >>> would be waiting on. Is it possible that ZK is not responding or > >>> something else like Operating System resources could cause this? > Thanks. > >>> > >>> > >>> On 10/2/15 4:17 PM, Rallavagu wrote: > >>>> Here is the stack trace of the thread that is holding the lock. > >>>> > >>>> > >>>> "Thread-55266" id=77142 idx=0xc18 tid=992 prio=5 alive, waiting, > >>>> native_blocked, daemon > >>>> -- Waiting for notification on: > >>>> org/apache/solr/cloud/RecoveryStrategy@0x3f34e8480[fat lock] > >>>> at pthread_cond_wait@@GLIBC_2.3.2+202(:0)@0x3d4180b5ba > >>>> at eventTimedWaitNoTransitionImpl+71(event.c:90)@0x7ff3133b6ba8 > >>>> at > >>>> syncWaitForSignalNoTransition+65(synchronization.c:28)@0x7ff31354a0b2 > >>>> at syncWaitForSignal+189(synchronization.c:85)@0x7ff31354a20e > >>>> at syncWaitForJavaSignal+38(synchronization.c:93)@0x7ff31354a327 > >>>> at > >>>> > >> > RJNI_jrockit_vm_Threads_waitForNotifySignal+73(rnithreads.c:72)@0x7ff31351939a > >>>> > >>>> > >>>> at > >>>> jrockit/vm/Threads.waitForNotifySignal(JLjava/lang/Object;)Z(Native > >>>> Method) > >>>> at java/lang/Object.wait(J)V(Native Method) > >>>> at java/lang/Thread.join(Thread.java:1206) > >>>> ^-- Lock released while waiting: > >>>> org/apache/solr/cloud/RecoveryStrategy@0x3f34e8480[fat lock] > >>>> at java/lang/Thread.join(Thread.java:1259) > >>>> at > >>>> > >> > org/apache/solr/update/DefaultSolrCoreState.cancelRecovery(DefaultSolrCoreState.java:331) > >>>> > >>>> > >>>> ^-- Holding lock: java/lang/Object@0x114d8dd00[recursive] > >>>> at > >>>> > >> > org/apache/solr/update/DefaultSolrCoreState.doRecovery(DefaultSolrCoreState.java:297) > >>>> > >>>> > >>>> ^-- Holding lock: java/lang/Object@0x114d8dd00[fat lock] > >>>> at > >>>> > >> > org/apache/solr/handler/admin/CoreAdminHandler$2.run(CoreAdminHandler.java:770) > >>>> > >>>> > >>>> at jrockit/vm/RNI.c2java(JJJJJ)V(Native Method) > >>>> > >>>> > >>>> Stack trace of one of the 870 threads that is waiting for the lock to > be > >>>> released. > >>>> > >>>> "Thread-55489" id=77520 idx=0xebc tid=1494 prio=5 alive, blocked, > >>>> native_blocked, daemon > >>>> -- Blocked trying to get lock: java/lang/Object@0x114d8dd00[fat > >>>> lock] > >>>> at pthread_cond_wait@@GLIBC_2.3.2+202(:0)@0x3d4180b5ba > >>>> at eventTimedWaitNoTransitionImpl+71(event.c:90)@0x7ff3133b6ba8 > >>>> at > >>>> syncWaitForSignalNoTransition+65(synchronization.c:28)@0x7ff31354a0b2 > >>>> at syncWaitForSignal+189(synchronization.c:85)@0x7ff31354a20e > >>>> at syncWaitForJavaSignal+38(synchronization.c:93)@0x7ff31354a327 > >>>> at jrockit/vm/Threads.waitForUnblockSignal()V(Native Method) > >>>> at > jrockit/vm/Locks.fatLockBlockOrSpin(Locks.java:1411)[optimized] > >>>> at jrockit/vm/Locks.lockFat(Locks.java:1512)[optimized] > >>>> at > >>>> > jrockit/vm/Locks.monitorEnterSecondStageHard(Locks.java:1054)[optimized] > >>>> at > >>>> jrockit/vm/Locks.monitorEnterSecondStage(Locks.java:1005)[optimized] > >>>> at jrockit/vm/Locks.monitorEnter(Locks.java:2179)[optimized] > >>>> at > >>>> > >> > org/apache/solr/update/DefaultSolrCoreState.doRecovery(DefaultSolrCoreState.java:290) > >>>> > >>>> > >>>> at > >>>> > >> > org/apache/solr/handler/admin/CoreAdminHandler$2.run(CoreAdminHandler.java:770) > >>>> > >>>> > >>>> at jrockit/vm/RNI.c2java(JJJJJ)V(Native Method) > >>>> > >>>> On 10/2/15 4:12 PM, Rallavagu wrote: > >>>>> Solr 4.6.1 on Tomcat 7, single shard 4 node cloud with 3 node > zookeeper > >>>>> > >>>>> During updates, some nodes are going very high cpu and becomes > >>>>> unavailable. The thread dump shows the following thread is blocked > 870 > >>>>> threads which explains high CPU. Any clues on where to look? > >>>>> > >>>>> "Thread-56848" id=79207 idx=0x38 tid=3169 prio=5 alive, blocked, > >>>>> native_blocked, daemon > >>>>> -- Blocked trying to get lock: java/lang/Object@0x114d8dd00 > [fat > >>>>> lock] > >>>>> at pthread_cond_wait@@GLIBC_2.3.2+202(:0)@0x3d4180b5ba > >>>>> at eventTimedWaitNoTransitionImpl+71(event.c:90)@0x7ff3133b6ba8 > >>>>> at > >>>>> syncWaitForSignalNoTransition+65(synchronization.c:28)@0x7ff31354a0b2 > >>>>> at syncWaitForSignal+189(synchronization.c:85)@0x7ff31354a20e > >>>>> at > syncWaitForJavaSignal+38(synchronization.c:93)@0x7ff31354a327 > >>>>> at jrockit/vm/Threads.waitForUnblockSignal()V(Native Method) > >>>>> at > jrockit/vm/Locks.fatLockBlockOrSpin(Locks.java:1411)[optimized] > >>>>> at jrockit/vm/Locks.lockFat(Locks.java:1512)[optimized] > >>>>> at > >>>>> > >> jrockit/vm/Locks.monitorEnterSecondStageHard(Locks.java:1054)[optimized] > >>>>> at > >>>>> jrockit/vm/Locks.monitorEnterSecondStage(Locks.java:1005)[optimized] > >>>>> at jrockit/vm/Locks.monitorEnter(Locks.java:2179)[optimized] > >>>>> at > >>>>> > >> > org/apache/solr/update/DefaultSolrCoreState.doRecovery(DefaultSolrCoreState.java:290) > >>>>> > >>>>> > >>>>> > >>>>> at > >>>>> > >> > org/apache/solr/handler/admin/CoreAdminHandler$2.run(CoreAdminHandler.java:770) > >>>>> > >>>>> > >>>>> > >>>>> at jrockit/vm/RNI.c2java(JJJJJ)V(Native Method) > >> > -- - Mark about.me/markrmiller