I forgot to CC java-dev in my response: Mike
"Michael McCandless" <[EMAIL PROTECTED]> wrote: > "Andi Vajda" <[EMAIL PROTECTED]> wrote: > > > > On Fri, 28 Sep 2007, Michael McCandless wrote: > > > > >> I tried all morning to isolate the problem but I seem to be unable > > >> to reproduce it in a simple unit test. In my application, I've been > > >> able to get errors by doing even less: just creating a FSDirectory > > >> and adding documents with fields with term vectors fails when > > >> optimizing the index with the error below. I even tried to add the > > >> same documents, in the same order, in the unit test but to no > > >> avail. It just works. > > > > > > Are you trying your unit test first in Python (using PyLucene)? > > > > No, I wrote it in Java to begin with. But it's a good idea. I should try it > > from PyLucene too. > > Yeah if we can first repro it in Python that's at least one step of > simplification. > > You could also use MockRAMDirectory as your dir in the test: it > behaves like windows and will catch certain bugs eg if something opens > an index file that had not actually been closed. > > > >> What is different about my environment ? Well, I'm running PyLucene, > > >> but the new one, the one using a Apple's Java VM, the same VM I'm > > >> using to run the unit test. And I'm not doing anything special like > > >> calling back into Python or something, I'm just calling regular > > >> Lucene APIs adding documents into an IndexWriter on an FSDirectory > > >> using a StandardAnalyzer. If I stop using term vectors, all is > > >> working fine. > > > > > > Are your documents irregular wrt term vectors? (Ie some docs have > > > none, others store the terms but not positions/offsets, etc?). Any > > > interesting changes to Lucene's defaults (autoCommit=false, etc)? > > > > All default config values, no config changes. All documents follow the > > same pattern of having five fields, one with term vectors. > > OK. > > > >> I'd like to get to the bottom of this but could use some help. Does > > >> the stacktrace below ring a bell ? Is there a way to run the whole > > >> indexing and optimizing in one single thread ? > > > > > > You can easily turn off the concurrent (background) merges by doing > > > this: > > > > > > writer.setMergeScheduler(new SerialMergeScheduler()) > > > > > > though that probably isn't punched through to Python in PyLucene. You > > > can also build a Lucene JAR w/ a small change to IndexWriter.java to > > > do the same thing. > > > > The new PyLucene is built with a code generator and all public APIs and > > classes are made available to Python. SerialMergeScheduler is available. > > Wild! Does this mean PyLucene will track tightly to Lucene releases > going forward? > > > > That stacktrace is happening while merging term vectors during an > > > optimize. It's specifically occuring when loading the term vectors > > > for a given doc X; we read a position from the index stream (tvx) just > > > fine, but then when we try to read the first vInt from the document > > > stream (tvd) we hit the EOF exception. So that position was too large > > > or the tvd file was somehow truncated. Weird. > > > > > > Can you call "writer.setInfoStream(System.out)" and get the error to > > > occur and then post the resulting log? It may shed some light > > > here.... > > > > I called writer.setMergeScheduler(SerialMergeScheduler()) just after > > creating > > the writer and called writer.setInfoStream(System.out) just before calling > > optimize(). Below is what I get: > > What happened prior to this first optimize call? Did you just create > the writer, switch to SerialMergeScheduler, add N docs, then call > setInfoStream(...) and writer.optimize()? > > The debug output starts with an optimize() call, which first flushes > 372 docs to segment _7f; this is the first segment in the index. Had > you opened this writer with create=true? > > This optimize() does nothing because the index has only one segment > (_7f) in compound file format, so it's already optimized. Then the > writer is closed. > > Then this is printed: > > <DBRepositoryView: Lucene (1)> indexed 191 items in 0:00:00.413600 > > Which is odd because 191 != 372. Can't explain that difference... > > Then another index writer is opened, 5 docs are added, then optimize() > is called, which flushes 5 docs to segment _7g and converts it to > compound file format. > > Finally we try to merge _7f and _7g for optimize, and we hit the EOF > exception trying to read the term vector for a doc from one of these > two segments. > > Mike --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]