"Bill Janssen" <[EMAIL PROTECTED]> wrote: > > Hmmm ... how many chunks of "about 50 pages" do you do before > > hitting this? Roughly how many docs are in the index when it > > happens? > > Oh, gosh, not sure. I'm guessing it's about half done.
Ugh, OK. If we could boil this down to a smaller set that is easily reproducible (and transferable to me) then I could try to track it down. Do you have another PPC machine to reproduce this on? (To rule out bad RAM/hard-drive on the first one). Can you try running with the trunk version of Lucene (2.3-dev) and see if the error still occurs? EG you can download this AM's build here: http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/288/artifact/artifacts Another thing to try is turning on the infoStream (IndexWriter.setInfoStream(...)) and capture & post the resulting log. It will be very large since it takes quite a while for the error to occur... > So, I ran the same codebase with lucene-core-2.2.0.jar on an Intel > Mac Pro, OS X 10.5.0, Java 1.5, and no exception is raised. > Different corpus, about 50000 pages instead of 20000. This is > reinforcing my thinking that it's a big-endian issue. That's a good question. Lucene is endian independent: all writes to files boil eventually down to a writeByte/writeBytes calls in o.a.l.store.IndexOutput such that the ordering is controlled by Lucene, not the underlying CPU architecture. That said, it is clearly a difference in your test so it seems like a compelling lead... is it possible to run this different corpus back on the PPC machine, to rule out a corpus difference leading to the exception? > I've got 1735 documents, 18969 pages -- average page size 10.9, max > page size 1235 (a physics textbook), 578 one-page documents. These > are Web pages, PDFs, articles, photos, scanned stuff, technical > papers, etc. I index six documents at a time, so I guess I'm > averaging about 65 pages per chunk. For each document, I index the > whole text of the document as a Lucene Document, and I index the > text of each page separately as a Document. I use the "contents" > fields and "pagecontents" fields for those two uses. I also add > metadata information to each: "title", multiple "author" fields, > "date", "abstract", etc. OK, sounds like a nice rich corpus :) Are you using term vectors, stored fields, payloads on any of these? Mike --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
