Re: [jira] Commented: (LUCENE-140) docs out of order

robert engels Mon, 08 Jan 2007 17:40:14 -0800

The Java discussion that is cited is not valid (at least in terms ofthe test case provided).


The javadoc for RandomAccessFile states:

/**

* Sets the file-pointer offset, measured from the beginning ofthis

     * file, at which the next read or write occurs.  The offset may be

* set beyond the end of the file. Setting the offset beyond theend* of the file does not change the file length. The file lengthwill* change only by writing after the offset has been set beyondthe end

     * of the file.

so the seeking does not affect the file length, meaning that all ofthe lengths should be 0.

But since both of these methods are native, there is the realpossibility that some JVM or OS combination is not adhering to thespecification.



On Jan 8, 2007, at 7:27 PM, Doron Cohen (JIRA) wrote:

[ https://issues.apache.org/jira/browse/LUCENE-140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12463176 ]
Doron Cohen commented on LUCENE-140:
------------------------------------
Amazed by this long lasting bug report I was going similar routesto Mike, and I noticed 3 things -
(1) the sequence of ops brought by Jason is wrong:
-a- Open an IndexReader (#1) over an existing index (this readeris used for searching while updating the index)-b- Using this reader (#1) do a search for the document(s) thatyou would like to update; obtain their document ID numbers-c- Create an IndexWriter and add several new documents to theindex (for me, this writing is done in other threads) (*)
 -d- Close the IndexWriter (*)
 -e- Open another IndexReader (#2) over the index
-f- Delete the previously found documents by their document IDnumbers using reader #2
 -g- Close the #2 reader
 -h- Create another IndexWriter (#2) and re-add the updated documents
 -i- Close the IndexWriter #2
-j- Close the original IndexReader (#1) and open a new reader forgeneral searching
Problem here is that the docIDs found in (b) may be altered in step(d) and so step (f) would delete the wrong docs. In particular, itmight attempt to delete ids that are out of the range. This mightexpose exactly the BitVector problem, and would explain the wholething, but I too cannot see how it explains the delete-by-term case.
(2) BitVectort silent ignoring of attempts to delete slightly-out-of-bound docs that fall in the higher byte - this the problem thatMike fixed. I think the fix is okay - though some applicationsmight now get exceptions they did not get in the past - but Ibelieve this is for their own good.However when I first ran into this I didn't notice thatBitVector.size() would become wrong as result of this - nice catchMike!
I think however that the test Mike added does not expose the docsout of order bug - I tried this test without the fix and it onlyfail on the "gotException assert" - if you comment this assert thetest pass.
The following test would expose the out-of-order bug - it wouldfail with out-of-order before the fix, and would succeed without it.
  public void testOutOfOrder () throws IOException {
    String tempDir = System.getProperty("java.io.tmpdir");
    if (tempDir == null) {
throw new IOException("java.io.tmpdir undefined, cannot runtest: "+getName());
    }

    File indexDir = new File(tempDir, "lucenetestindexTemp");
    Directory dir = FSDirectory.getDirectory(indexDir, true);

    boolean create = true;
    int numDocs = 0;
    int maxDoc = 0;
    while (numDocs < 100) {
      IndexWriter iw = new IndexWriter(dir,anlzr,create);
      create = false;
      iw.setUseCompoundFile(false);
      for (int i=0; i<2; i++) {
        Document d = new Document();
        d.add(new Field("body","body"+i,Store.NO,Index.UN_TOKENIZED));
        iw.addDocument(d);
      }
      iw.optimize();
      iw.close();
      IndexReader ir = IndexReader.open(dir);
      numDocs = ir.numDocs();
      maxDoc = ir.maxDoc();
      assertEquals(numDocs,maxDoc);
      for (int i=7; i >=-1; i--) {
        try {
          ir.deleteDocument(maxDoc+i);
        } catch (ArrayIndexOutOfBoundsException e) {
        }
      }
      ir.close();
    }
  }

Mike, do you agree?
(3) maxDoc() computation in SegmentReader is based (on some paths)in RandomAccessFile.length(). IIRC I saw cases (in previousproject) where File.length() or RAF.length() (not sure which of thetwo) did not always reflect real length, if the system was verybusy IO wise, unless FD.sync() was called (with performance hit).
This post seems relevant - RAF.length over 2GB in NFS - http://forum.java.sun.com/thread.jspa?threadID=708670&messageID=4103657
Not sure if this can be the case here but at least we can discusswhether it is better to always store the length.
docs out of order
-----------------

                Key: LUCENE-140
                URL: https://issues.apache.org/jira/browse/LUCENE-140
            Project: Lucene - Java
         Issue Type: Bug
         Components: Index
   Affects Versions: unspecified
        Environment: Operating System: Linux
Platform: PC
           Reporter: legez
        Assigned To: Michael McCandless
Attachments: bug23650.txt, corrupted.part1.rar,corrupted.part2.rar
Hello,
I can not find out, why (and what) it is happening all the time.I got an
exception:
java.lang.IllegalStateException: docs out of order
        at
org.apache.lucene.index.SegmentMerger.appendPostings(SegmentMerger.java:219)
        at
org.apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java:191)
        at
org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:172)at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:135)at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:88)at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:341)at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:250)
        at Optimize.main(Optimize.java:29)
It happens either in 1.2 and 1.3rc1 (anyway what happened to it? Ican not findit neither in download nor in version list in this form).Everything seems OK. Ican search through index, but I can not optimize it. Even worseafter thisexception every time I add new documents and close IndexWriter newsegments iscreated! I think it has all documents added before, because of itssize.
My index is quite big: 500.000 docs, about 5gb of index directory.
It is _repeatable_. I drop index, reindex everything. Afterwards Iadd a few
docs, try to optimize and receive above exception.
My documents' structure is:
static Document indexIt(String id_strony, Reader reader, Stringdata_wydania,
String id_wydania, String id_gazety, String data_wstawienia)
{
    Document doc = new Document();
    doc.add(Field.Keyword("id", id_strony ));
    doc.add(Field.Keyword("data_wydania", data_wydania));
    doc.add(Field.Keyword("id_wydania", id_wydania));
    doc.add(Field.Text("id_gazety", id_gazety));
    doc.add(Field.Keyword("data_wstawienia", data_wstawienia));
    doc.add(Field.Text("tresc", reader));
    return doc;
}
Sincerely,
legez
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of theadministrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-140) docs out of order

Reply via email to