Re: debugging long commits

2009-01-07 Thread Mike Klaas


Hi Brian,

You might want to follow up on the Lucene list (java-u...@lucene.apache.org 
).  Something was causing problems with the merging and thus you ended  
up with too many segments (hence the slow commits).  I doubt that you  
lost anything--usually the merge function doesn't modify the index  
until the merge is complete.  But I am not familiar enough with this  
code in lucene to be sure.


-Mike

On 2-Jan-09, at 10:17 AM, Brian Whitman wrote:


I think I'm getting close with this (sorry for the self-replies)

I tried an optimize (which we never do) and it took 30m and said  
this a lot:


Exception in thread "Lucene Merge Thread #4"
org.apache.lucene.index.MergePolicy$MergeException:
java.lang.ArrayIndexOutOfBoundsException: Array index out of range:  
34950

at
org 
.apache 
.lucene 
.index 
.ConcurrentMergeScheduler 
.handleMergeException(ConcurrentMergeScheduler.java:314)

at
org.apache.lucene.index.ConcurrentMergeScheduler 
$MergeThread.run(ConcurrentMergeScheduler.java:291)
Caused by: java.lang.ArrayIndexOutOfBoundsException: Array index out  
of

range: 34950
at org.apache.lucene.util.BitVector.get(BitVector.java:91)
at org.apache.lucene.index.SegmentTermDocs.next(SegmentTermDocs.java: 
125)

at
org 
.apache 
.lucene.index.SegmentTermPositions.next(SegmentTermPositions.java:98)

at
org 
.apache.lucene.index.SegmentMerger.appendPostings(SegmentMerger.java: 
633)

at
org 
.apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java: 
585)

at
org 
.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java: 
546)
at  
org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java: 
499)

at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:139)
at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java: 
4291)

at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3932)
at
org 
.apache 
.lucene 
.index 
.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:205)

at
org.apache.lucene.index.ConcurrentMergeScheduler 
$MergeThread.run(ConcurrentMergeScheduler.java:260)

Jan 2, 2009 6:05:49 PM org.apache.solr.common.SolrException log
SEVERE: java.io.IOException: background merge hit exception:  
_ks4:C2504982
_oaw:C514635 _tll:C827949 _tdx:C18372 _te8:C19929 _tej:C22201  
_1agw:C1717926

into _1agy [optimize]
at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2346)
at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2280)
at
org 
.apache 
.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java: 
355)

at
org 
.apache 
.solr 
.update.processor.RunUpdateProcessor.processCommit(RunUpdateProcesso

...


But then it finished. And now commits are OK again.

Anyone know what the merge hit exception means and if i lost anything?




Re: debugging long commits

2009-01-02 Thread Brian Whitman
I think I'm getting close with this (sorry for the self-replies)

I tried an optimize (which we never do) and it took 30m and said this a lot:

Exception in thread "Lucene Merge Thread #4"
org.apache.lucene.index.MergePolicy$MergeException:
java.lang.ArrayIndexOutOfBoundsException: Array index out of range: 34950
at
org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:314)
at
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:291)
Caused by: java.lang.ArrayIndexOutOfBoundsException: Array index out of
range: 34950
at org.apache.lucene.util.BitVector.get(BitVector.java:91)
at org.apache.lucene.index.SegmentTermDocs.next(SegmentTermDocs.java:125)
at
org.apache.lucene.index.SegmentTermPositions.next(SegmentTermPositions.java:98)
at
org.apache.lucene.index.SegmentMerger.appendPostings(SegmentMerger.java:633)
at
org.apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java:585)
at
org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:546)
at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:499)
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:139)
at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4291)
at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3932)
at
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:205)
at
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:260)
Jan 2, 2009 6:05:49 PM org.apache.solr.common.SolrException log
SEVERE: java.io.IOException: background merge hit exception: _ks4:C2504982
_oaw:C514635 _tll:C827949 _tdx:C18372 _te8:C19929 _tej:C22201 _1agw:C1717926
into _1agy [optimize]
at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2346)
at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2280)
at
org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:355)
at
org.apache.solr.update.processor.RunUpdateProcessor.processCommit(RunUpdateProcesso
...


But then it finished. And now commits are OK again.

Anyone know what the merge hit exception means and if i lost anything?


Re: debugging long commits

2009-01-02 Thread Brian Whitman
Not sure if these help. Here's the stack trace and jmap -histo output during
a long (bad) commit
Full thread dump Java HotSpot(TM) 64-Bit Server VM (11.0-b16 mixed mode):

"Attach Listener" daemon prio=10 tid=0x2aabf9954400 nid=0x5e1c runnable
[0x..0x42048d20]
   java.lang.Thread.State: RUNNABLE

"Lucene Merge Thread #0" daemon prio=10 tid=0x2aabf9c7f400 nid=0x5e00
runnable [0x41f47000..0x41f47ba0]
   java.lang.Thread.State: RUNNABLE
at org.apache.lucene.store.RAMOutputStream.flush(RAMOutputStream.java:147)
at org.apache.lucene.store.RAMOutputStream.writeTo(RAMOutputStream.java:56)
at
org.apache.lucene.index.MultiLevelSkipListWriter.writeSkip(MultiLevelSkipListWriter.java:146)
at
org.apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java:588)
at
org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:546)
at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:499)
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:139)
at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4291)
at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3932)
at
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:205)
at
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:260)

"pool-2-thread-1" prio=10 tid=0x2aabf9b94c00 nid=0x5dc3 waiting on
condition [0x41a42000..0x41a42aa0]
   java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x2aaabecd12d0> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.parkNanos(Unknown Source)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(Unknown
Source)
at java.util.concurrent.DelayQueue.take(Unknown Source)
at
java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(Unknown
Source)
at
java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(Unknown
Source)
at java.util.concurrent.ThreadPoolExecutor.getTask(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)

"DestroyJavaVM" prio=10 tid=0x2aabfc006c00 nid=0x5da3 waiting on
condition [0x..0x4022ad10]
   java.lang.Thread.State: RUNNABLE

"Timer-2" daemon prio=10 tid=0x2aabfc007c00 nid=0x5dc1 in Object.wait()
[0x41e46000..0x41e46d20]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x2aaabef91338> (a java.util.TaskQueue)
at java.util.TimerThread.mainLoop(Unknown Source)
- locked <0x2aaabef91338> (a java.util.TaskQueue)
at java.util.TimerThread.run(Unknown Source)

"Checkpointer" daemon prio=10 tid=0x2aabf995b800 nid=0x5dc0 in
Object.wait() [0x41d45000..0x41d45da0]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x2aaabebf9700> (a java.lang.Object)
at java.lang.Object.wait(Object.java:485)
at com.sleepycat.je.utilint.DaemonThread.run(DaemonThread.java:152)
- locked <0x2aaabebf9700> (a java.lang.Object)
at java.lang.Thread.run(Unknown Source)

"Cleaner-1" daemon prio=10 tid=0x2aabf995b000 nid=0x5dbf in
Object.wait() [0x41c44000..0x41c44c20]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x2aaabebee158> (a java.lang.Object)
at java.lang.Object.wait(Object.java:485)
at com.sleepycat.je.utilint.DaemonThread.run(DaemonThread.java:152)
- locked <0x2aaabebee158> (a java.lang.Object)
at java.lang.Thread.run(Unknown Source)

"INCompressor" daemon prio=10 tid=0x2aabf9a3cc00 nid=0x5dbe in
Object.wait() [0x41b43000..0x41b43ca0]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x2aaabeef3500> (a java.lang.Object)
at com.sleepycat.je.utilint.DaemonThread.run(DaemonThread.java:154)
- locked <0x2aaabeef3500> (a java.lang.Object)
at java.lang.Thread.run(Unknown Source)

"pool-1-thread-1" prio=10 tid=0x2aabfa015800 nid=0x5dbc waiting on
condition [0x41941000..0x41941ba0]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x2aaabef91a08> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(Unknown Source)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(Unknown
Source)
at java.util.concurrent.LinkedBlockingQueue.take(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.getTask(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)

"Timer-1" daemo

debugging long commits

2009-01-02 Thread Brian Whitman
We have a distributed setup that has been experiencing glacially slow commit
times on only some of the shards. (10s on a good shard, 263s on a slow
shard.) Each shard for this index has about 10GB of lucene index data and
the documents are segregated by an md5 hash, so the distribution of
document/data types should be equal across all shards. I've turned off our
postcommit hooks to isolate the problem, so it's not a snapshot run amok or
anything. I also moved the indexes over to new machines and the same indexes
that were slow in production are also slow on the test machines.
During the slow commit, the jetty process is 100% CPU / 50% RAM on a 8GB
quad core machine. The slow commit happens every time after I add at least
one document. (If I don't add any documents the commit is immediate.)

What can I do to look into this problem?