Re: debugging long commits
Hi Brian, You might want to follow up on the Lucene list (java-u...@lucene.apache.org ). Something was causing problems with the merging and thus you ended up with too many segments (hence the slow commits). I doubt that you lost anything--usually the merge function doesn't modify the index until the merge is complete. But I am not familiar enough with this code in lucene to be sure. -Mike On 2-Jan-09, at 10:17 AM, Brian Whitman wrote: I think I'm getting close with this (sorry for the self-replies) I tried an optimize (which we never do) and it took 30m and said this a lot: Exception in thread "Lucene Merge Thread #4" org.apache.lucene.index.MergePolicy$MergeException: java.lang.ArrayIndexOutOfBoundsException: Array index out of range: 34950 at org .apache .lucene .index .ConcurrentMergeScheduler .handleMergeException(ConcurrentMergeScheduler.java:314) at org.apache.lucene.index.ConcurrentMergeScheduler $MergeThread.run(ConcurrentMergeScheduler.java:291) Caused by: java.lang.ArrayIndexOutOfBoundsException: Array index out of range: 34950 at org.apache.lucene.util.BitVector.get(BitVector.java:91) at org.apache.lucene.index.SegmentTermDocs.next(SegmentTermDocs.java: 125) at org .apache .lucene.index.SegmentTermPositions.next(SegmentTermPositions.java:98) at org .apache.lucene.index.SegmentMerger.appendPostings(SegmentMerger.java: 633) at org .apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java: 585) at org .apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java: 546) at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java: 499) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:139) at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java: 4291) at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3932) at org .apache .lucene .index .ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:205) at org.apache.lucene.index.ConcurrentMergeScheduler $MergeThread.run(ConcurrentMergeScheduler.java:260) Jan 2, 2009 6:05:49 PM org.apache.solr.common.SolrException log SEVERE: java.io.IOException: background merge hit exception: _ks4:C2504982 _oaw:C514635 _tll:C827949 _tdx:C18372 _te8:C19929 _tej:C22201 _1agw:C1717926 into _1agy [optimize] at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2346) at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2280) at org .apache .solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java: 355) at org .apache .solr .update.processor.RunUpdateProcessor.processCommit(RunUpdateProcesso ... But then it finished. And now commits are OK again. Anyone know what the merge hit exception means and if i lost anything?
Re: debugging long commits
I think I'm getting close with this (sorry for the self-replies) I tried an optimize (which we never do) and it took 30m and said this a lot: Exception in thread "Lucene Merge Thread #4" org.apache.lucene.index.MergePolicy$MergeException: java.lang.ArrayIndexOutOfBoundsException: Array index out of range: 34950 at org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:314) at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:291) Caused by: java.lang.ArrayIndexOutOfBoundsException: Array index out of range: 34950 at org.apache.lucene.util.BitVector.get(BitVector.java:91) at org.apache.lucene.index.SegmentTermDocs.next(SegmentTermDocs.java:125) at org.apache.lucene.index.SegmentTermPositions.next(SegmentTermPositions.java:98) at org.apache.lucene.index.SegmentMerger.appendPostings(SegmentMerger.java:633) at org.apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java:585) at org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:546) at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:499) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:139) at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4291) at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3932) at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:205) at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:260) Jan 2, 2009 6:05:49 PM org.apache.solr.common.SolrException log SEVERE: java.io.IOException: background merge hit exception: _ks4:C2504982 _oaw:C514635 _tll:C827949 _tdx:C18372 _te8:C19929 _tej:C22201 _1agw:C1717926 into _1agy [optimize] at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2346) at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2280) at org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:355) at org.apache.solr.update.processor.RunUpdateProcessor.processCommit(RunUpdateProcesso ... But then it finished. And now commits are OK again. Anyone know what the merge hit exception means and if i lost anything?
Re: debugging long commits
Not sure if these help. Here's the stack trace and jmap -histo output during a long (bad) commit Full thread dump Java HotSpot(TM) 64-Bit Server VM (11.0-b16 mixed mode): "Attach Listener" daemon prio=10 tid=0x2aabf9954400 nid=0x5e1c runnable [0x..0x42048d20] java.lang.Thread.State: RUNNABLE "Lucene Merge Thread #0" daemon prio=10 tid=0x2aabf9c7f400 nid=0x5e00 runnable [0x41f47000..0x41f47ba0] java.lang.Thread.State: RUNNABLE at org.apache.lucene.store.RAMOutputStream.flush(RAMOutputStream.java:147) at org.apache.lucene.store.RAMOutputStream.writeTo(RAMOutputStream.java:56) at org.apache.lucene.index.MultiLevelSkipListWriter.writeSkip(MultiLevelSkipListWriter.java:146) at org.apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java:588) at org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:546) at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:499) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:139) at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4291) at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3932) at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:205) at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:260) "pool-2-thread-1" prio=10 tid=0x2aabf9b94c00 nid=0x5dc3 waiting on condition [0x41a42000..0x41a42aa0] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x2aaabecd12d0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.parkNanos(Unknown Source) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(Unknown Source) at java.util.concurrent.DelayQueue.take(Unknown Source) at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(Unknown Source) at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.getTask(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) "DestroyJavaVM" prio=10 tid=0x2aabfc006c00 nid=0x5da3 waiting on condition [0x..0x4022ad10] java.lang.Thread.State: RUNNABLE "Timer-2" daemon prio=10 tid=0x2aabfc007c00 nid=0x5dc1 in Object.wait() [0x41e46000..0x41e46d20] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x2aaabef91338> (a java.util.TaskQueue) at java.util.TimerThread.mainLoop(Unknown Source) - locked <0x2aaabef91338> (a java.util.TaskQueue) at java.util.TimerThread.run(Unknown Source) "Checkpointer" daemon prio=10 tid=0x2aabf995b800 nid=0x5dc0 in Object.wait() [0x41d45000..0x41d45da0] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x2aaabebf9700> (a java.lang.Object) at java.lang.Object.wait(Object.java:485) at com.sleepycat.je.utilint.DaemonThread.run(DaemonThread.java:152) - locked <0x2aaabebf9700> (a java.lang.Object) at java.lang.Thread.run(Unknown Source) "Cleaner-1" daemon prio=10 tid=0x2aabf995b000 nid=0x5dbf in Object.wait() [0x41c44000..0x41c44c20] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x2aaabebee158> (a java.lang.Object) at java.lang.Object.wait(Object.java:485) at com.sleepycat.je.utilint.DaemonThread.run(DaemonThread.java:152) - locked <0x2aaabebee158> (a java.lang.Object) at java.lang.Thread.run(Unknown Source) "INCompressor" daemon prio=10 tid=0x2aabf9a3cc00 nid=0x5dbe in Object.wait() [0x41b43000..0x41b43ca0] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x2aaabeef3500> (a java.lang.Object) at com.sleepycat.je.utilint.DaemonThread.run(DaemonThread.java:154) - locked <0x2aaabeef3500> (a java.lang.Object) at java.lang.Thread.run(Unknown Source) "pool-1-thread-1" prio=10 tid=0x2aabfa015800 nid=0x5dbc waiting on condition [0x41941000..0x41941ba0] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x2aaabef91a08> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(Unknown Source) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(Unknown Source) at java.util.concurrent.LinkedBlockingQueue.take(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.getTask(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) "Timer-1" daemo
debugging long commits
We have a distributed setup that has been experiencing glacially slow commit times on only some of the shards. (10s on a good shard, 263s on a slow shard.) Each shard for this index has about 10GB of lucene index data and the documents are segregated by an md5 hash, so the distribution of document/data types should be equal across all shards. I've turned off our postcommit hooks to isolate the problem, so it's not a snapshot run amok or anything. I also moved the indexes over to new machines and the same indexes that were slow in production are also slow on the test machines. During the slow commit, the jetty process is 100% CPU / 50% RAM on a 8GB quad core machine. The slow commit happens every time after I add at least one document. (If I don't add any documents the commit is immediate.) What can I do to look into this problem?