OK... when you exported the sources & built yourself, you didn't make
any changes, right?

It's really odd how many of the errors are due to the term
"literals:cfid196$", or some variation (one time with "on" appended,
another time with "microsoft").  Do you know what documents typically
contain that term, and what the context is around it?  Maybe try to
index only those documents and see if this happens?  (It could
conceivably be caused by bad data, if this is some weird bug).  One
question: does your content ever use the [invalid] unicode character
U+FFFF?  (Lucene uses this internally to mark the end of the term).

Would it be possible to zip up all files starting with _1c (should be
~22 MB) and post somewhere that I could download?  That's the smallest
of the broken segments I think.

I don't need the full IW output just yet, thanks.

Mike

On Wed, Oct 28, 2009 at 10:21 AM, Peter Keegan <peterlkee...@gmail.com> wrote:
> Yes, I used JDK 1.6.0_16 when running CheckIndex and it reported the same
> problems when run multiple times.
>
>>Also, what does Lucene version "2.9 exported - 2009-10-27 15:31:52" mean?
> This appears to be something added by the ant build, since I built Lucene
> from the source code.
>
> I rebuilt the index using mergeFactor=50, ramBufferSize=200MB,
> maxBufferedDocs=1000000
> This produced 49 segments, 9 of which are broken. The broken segments are in
> the latter half, similar to my previous post with 3 segments. Do you think
> this could be caused by 'bad' data, for example bad unicode characters?
>
> Here is the output from CheckIndex:

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to