Fwd: Re: possible bug with indexing with term vectors

Michael McCandless Sat, 29 Sep 2007 04:27:15 -0700

I forgot to CC java-dev in my response:

Mike


"Michael McCandless" <[EMAIL PROTECTED]> wrote:
> "Andi Vajda" <[EMAIL PROTECTED]> wrote:
> >
> > On Fri, 28 Sep 2007, Michael McCandless wrote:
> > 
> > >> I tried all morning to isolate the problem but I seem to be unable
> > >> to reproduce it in a simple unit test. In my application, I've been
> > >> able to get errors by doing even less: just creating a FSDirectory
> > >> and adding documents with fields with term vectors fails when
> > >> optimizing the index with the error below. I even tried to add the
> > >> same documents, in the same order, in the unit test but to no
> > >> avail. It just works.
> > >
> > > Are you trying your unit test first in Python (using PyLucene)?
> > 
> > No, I wrote it in Java to begin with. But it's a good idea. I should try it 
> > from PyLucene too.
> 
> Yeah if we can first repro it in Python that's at least one step of
> simplification.
> 
> You could also use MockRAMDirectory as your dir in the test: it
> behaves like windows and will catch certain bugs eg if something opens
> an index file that had not actually been closed.
> 
> > >> What is different about my environment ? Well, I'm running PyLucene,
> > >> but the new one, the one using a Apple's Java VM, the same VM I'm
> > >> using to run the unit test. And I'm not doing anything special like
> > >> calling back into Python or something, I'm just calling regular
> > >> Lucene APIs adding documents into an IndexWriter on an FSDirectory
> > >> using a StandardAnalyzer. If I stop using term vectors, all is
> > >> working fine.
> > >
> > > Are your documents irregular wrt term vectors?  (Ie some docs have
> > > none, others store the terms but not positions/offsets, etc?).  Any
> > > interesting changes to Lucene's defaults (autoCommit=false, etc)?
> > 
> > All default config values, no config changes. All documents follow the
> > same pattern of having five fields, one with term vectors.
> 
> OK.
> 
> > >> I'd like to get to the bottom of this but could use some help. Does
> > >> the stacktrace below ring a bell ? Is there a way to run the whole
> > >> indexing and optimizing in one single thread ?
> > >
> > > You can easily turn off the concurrent (background) merges by doing
> > > this:
> > >
> > >  writer.setMergeScheduler(new SerialMergeScheduler())
> > >
> > > though that probably isn't punched through to Python in PyLucene.  You
> > > can also build a Lucene JAR w/ a small change to IndexWriter.java to
> > > do the same thing.
> > 
> > The new PyLucene is built with a code generator and all public APIs and 
> > classes are made available to Python. SerialMergeScheduler is available.
> 
> Wild!  Does this mean PyLucene will track tightly to Lucene releases
> going forward?
> 
> > > That stacktrace is happening while merging term vectors during an
> > > optimize.  It's specifically occuring when loading the term vectors
> > > for a given doc X; we read a position from the index stream (tvx) just
> > > fine, but then when we try to read the first vInt from the document
> > > stream (tvd) we hit the EOF exception.  So that position was too large
> > > or the tvd file was somehow truncated.  Weird.
> > >
> > > Can you call "writer.setInfoStream(System.out)" and get the error to
> > > occur and then post the resulting log?  It may shed some light
> > > here....
> > 
> > I called writer.setMergeScheduler(SerialMergeScheduler()) just after 
> > creating 
> > the writer and called writer.setInfoStream(System.out) just before calling 
> > optimize(). Below is what I get:
> 
> What happened prior to this first optimize call?  Did you just create
> the writer, switch to SerialMergeScheduler, add N docs, then call
> setInfoStream(...) and writer.optimize()?
> 
> The debug output starts with an optimize() call, which first flushes
> 372 docs to segment _7f; this is the first segment in the index.  Had
> you opened this writer with create=true?
> 
> This optimize() does nothing because the index has only one segment
> (_7f) in compound file format, so it's already optimized.  Then the
> writer is closed.
> 
> Then this is printed:
> 
>   <DBRepositoryView: Lucene (1)> indexed 191 items in 0:00:00.413600
> 
> Which is odd because 191 != 372.  Can't explain that difference...
> 
> Then another index writer is opened, 5 docs are added, then optimize()
> is called, which flushes 5 docs to segment _7g and converts it to
> compound file format.
> 
> Finally we try to merge _7f and _7g for optimize, and we hit the EOF
> exception trying to read the term vector for a doc from one of these
> two segments.
> 
> Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Fwd: Re: possible bug with indexing with term vectors

Reply via email to