[ 
https://issues.apache.org/jira/browse/LUCENE-1282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12597919#action_12597919
 ] 

Michael McCandless commented on LUCENE-1282:
--------------------------------------------

Using the 19 GB index I have that consistently reproduces this hotspot bug, I 
boiled the bug down to a very small testcase that no longer involves Lucene.

However, this occurence of the bug is slightly different: for me, by specifying 
-Xbatch to java command line, the bug consistently happens.  It only rarely 
happens without -Xbatch.  Nonetheless, I'm hopeful that if Sun fixes this one 
test case properly, it will fix all the odd exceptions we've been seeing from 
this code.

I opened the bug 4 days ago (5/15) with http://bugs.sun.com, but have yet to 
hear if it's been accepted as a real bug.

if others could try out the code below on their Linux boxes, using 1.6.0_04/05 
of Sun's java, specifying -Xbatch, to see if the bug can be reproduced, that'd 
be great.

Here's the bug I opened:

{code}
Date Created: Thu May 15 11:53:15 MDT 2008
Type:        bug
Customer Name:   Michael McCandless
Customer Email:  [EMAIL PROTECTED]
SDN ID:       [EMAIL PROTECTED]
status:      Waiting
Category:    hotspot
Subcategory: runtime_system
Company:     IBM
release:     6
hardware:    x86
OSversion:   linux
priority:    4
Synopsis:    Simple code runs incorrectly with -Xbatch
Description:
 FULL PRODUCT VERSION :
java version "1.6.0_06"
Java(TM) SE Runtime Environment (build 1.6.0_06-b02)
Java HotSpot(TM) Server VM (build 10.0-b22, mixed mode)



FULL OS VERSION :
Linux 2.6.22.1 #7 SMP PREEMPT Tue Mar 18 18:22:09 EDT 2008 i686 GNU/Linux

A DESCRIPTION OF THE PROBLEM :
On the Apache Lucene project, we've now had 4 users hit by an apparent
JRE bug.  When this bug strikes, it silently corrupts the search
index, which is very costly to the user (makes the index unusable).
Details are here:

  https://issues.apache.org/jira/browse/LUCENE-1282

I can reliably reproduce the bug, but only on a very large (19 GB)
search index.  But I narrowed down one variant of the bug to attached
test case.



THE PROBLEM WAS REPRODUCIBLE WITH -Xint FLAG: No

THE PROBLEM WAS REPRODUCIBLE WITH -server FLAG: Yes

STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
Compile and run the attached code (Crash.java), with -Xbatch and it should fail 
(ie, throw the
RuntimeException, incorrectly).  It should pass without -Xbatch.





EXPECTED VERSUS ACTUAL BEHAVIOR :
Expected is no RuntimeException should be thrown.  Actual is it is thrown.
REPRODUCIBILITY :
This bug can be reproduced always.

---------- BEGIN SOURCE ----------
public class Crash {

  public static void main(String[] args) {
    new Crash().crash();
  }

  private Object alwaysNull;

  final void crash() throws Throwable {
    for (int r = 0; r < 3; r++) {
      for (int docNum = 0; docNum < 10000;) {
        if (r < 2) {
          for(int j=0;j<3000;j++)
            docNum++;
        } else {
          docNum++;
          doNothing(getNothing());
          if (alwaysNull != null) {
            throw new RuntimeException("BUG: checkAbort is always null: r=" + r 
+ " of 3; docNum=" + docNum);
          }
        }
      }
    }
  }

  Object getNothing() {
    return this;
  }

  int x;
  void doNothing(Object o) {
    x++;
  }
}


---------- END SOURCE ----------

CUSTOMER SUBMITTED WORKAROUND :
Don't specify -Xbatch.  You can also tweak the code to have it pass the test.  
Reducing the 10000
or 3000 low enough makes it pass.  Changing the doNothing(...)  line
to assign the result of getNothing() to an intermediate variable
first, also passes (this is the approach we plan to use for Lucene). Removing 
the x++ also passes.
workaround:  
comments:    (company - IBM , email - [EMAIL PROTECTED])
{code}

> Sun hotspot compiler bug in 1.6.0_04/05 affects Lucene
> ------------------------------------------------------
>
>                 Key: LUCENE-1282
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1282
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: 2.3, 2.3.1
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: corrupt_merge_out15.txt
>
>
> This is not a Lucene bug.  It's an as-yet not fully characterized Sun
> JRE bug, as best I can tell.  I'm opening this to gather all things we
> know, and to work around it in Lucene if possible, and maybe open an
> issue with Sun if we can reduce it to a compact test case.
> It's hit at least 3 users:
>   
> http://mail-archives.apache.org/mod_mbox/lucene-java-user/200803.mbox/[EMAIL 
> PROTECTED]
>   
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200804.mbox/[EMAIL 
> PROTECTED]
>   
> http://mail-archives.apache.org/mod_mbox/lucene-java-user/200805.mbox/[EMAIL 
> PROTECTED]
> It's specific to at least JRE 1.6.0_04 and 1.6.0_05, that affects
> Lucene.  Whereas 1.6.0_03 works OK and it's unknown whether 1.6.0_06
> shows it.
> The bug affects bulk merging of stored fields.  When it strikes, the
> segment produced by a merge is corrupt because its fdx file (stored
> fields index file) is missing one document.  After iterating many
> times with the first user that hit this, adding diagnostics &
> assertions, its seems that a call to fieldsWriter.addDocument some
> either fails to run entirely, or, fails to invoke its call to
> indexStream.writeLong.  It's as if when hotspot compiles a method,
> there's some sort of race condition in cutting over to the compiled
> code whereby a single method call fails to be invoked (speculation).
> Unfortunately, this corruption is silent when it occurs and only later
> detected when a merge tries to merge the bad segment, or an
> IndexReader tries to open it.  Here's a typical merge exception:
> {code}
> Exception in thread "Thread-10" 
> org.apache.lucene.index.MergePolicy$MergeException: 
> org.apache.lucene.index.CorruptIndexException:
>     doc counts differ for segment _3gh: fieldsReader shows 15999 but 
> segmentInfo shows 16000
>         at 
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:271)
> Caused by: org.apache.lucene.index.CorruptIndexException: doc counts differ 
> for segment _3gh: fieldsReader shows 15999 but segmentInfo shows 16000
>         at 
> org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:313)
>         at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:262)
>         at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:221)
>         at 
> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3099)
>         at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:2834)
>         at 
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:240)
> {code}
> and here's a typical exception hit when opening a searcher:
> {code}
> org.apache.lucene.index.CorruptIndexException: doc counts differ for segment 
> _kk: fieldsReader shows 72670 but segmentInfo shows 72671
>         at 
> org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:313)
>         at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:262)
>         at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:230)
>         at 
> org.apache.lucene.index.DirectoryIndexReader$1.doBody(DirectoryIndexReader.java:73)
>         at 
> org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:636)
>         at 
> org.apache.lucene.index.DirectoryIndexReader.open(DirectoryIndexReader.java:63)
>         at org.apache.lucene.index.IndexReader.open(IndexReader.java:209)
>         at org.apache.lucene.index.IndexReader.open(IndexReader.java:173)
>         at 
> org.apache.lucene.search.IndexSearcher.<init>(IndexSearcher.java:48)
> {code}
> Sometimes, adding -Xbatch (forces up front compilation) or -Xint
> (disables compilation) to the java command line works around the
> issue.
> Here are some of the OS's we've seen the failure on:
> {code}
> SuSE 10.0
> Linux phoebe 2.6.13-15-smp #1 SMP Tue Sep 13 14:56:15 UTC 2005 x86_64 
> x86_64 x86_64 GNU/Linux 
> SuSE 8.2
> Linux phobos 2.4.20-64GB-SMP #1 SMP Mon Mar 17 17:56:03 UTC 2003 i686 
> unknown unknown GNU/Linux 
> Red Hat Enterprise Linux Server release 5.1 (Tikanga)
> Linux lab8.betech.virginia.edu 2.6.18-53.1.14.el5 #1 SMP Tue Feb 19 
> 07:18:21 EST 2008 i686 i686 i386 GNU/Linux
> {code}
> I've already added assertions to Lucene to detect when this bug
> strikes, but since assertions are not usually enabled, I plan to add a
> real check to catch when this bug strikes *before* we commit the merge
> to the index.  This way we can detect & quarantine the failure and
> prevent corruption from entering the index.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to