thats exactly the result I saw FWIW On Wed, Oct 28, 2009 at 11:25 AM, Michael McCandless < luc...@mikemccandless.com> wrote:
> Right, I would expect Lucene would silently truncate the term at the > U+FFFF, and not lead to this odd exception. > > Mike > > On Wed, Oct 28, 2009 at 11:23 AM, Robert Muir <rcm...@gmail.com> wrote: > > i might be wrong about this, but recently I intentionally tried to create > > index with terms with U+FFFF to see if it would cause a problem :) > > > > the U+FFFF seemed to be discarded completely (maybe at UTF-8 encode > time)... > > then again I was using RAMDirectory. > > > > On Wed, Oct 28, 2009 at 10:58 AM, Peter Keegan <peterlkee...@gmail.com > >wrote: > > > >> The only change I made to the source code was the patch for > >> PayloadNearQuery > >> (LUCENE-1986). > >> It's possible that our content contains U+FFFF. I will run in debugger > and > >> see. > >> The data is 'sensitive', so I may not be able to provide a bad segment, > >> unfortunately. > >> > >> Peter > >> > >> On Wed, Oct 28, 2009 at 10:43 AM, Michael McCandless < > >> luc...@mikemccandless.com> wrote: > >> > >> > OK... when you exported the sources & built yourself, you didn't make > >> > any changes, right? > >> > > >> > It's really odd how many of the errors are due to the term > >> > "literals:cfid196$", or some variation (one time with "on" appended, > >> > another time with "microsoft"). Do you know what documents typically > >> > contain that term, and what the context is around it? Maybe try to > >> > index only those documents and see if this happens? (It could > >> > conceivably be caused by bad data, if this is some weird bug). One > >> > question: does your content ever use the [invalid] unicode character > >> > U+FFFF? (Lucene uses this internally to mark the end of the term). > >> > > >> > Would it be possible to zip up all files starting with _1c (should be > >> > ~22 MB) and post somewhere that I could download? That's the smallest > >> > of the broken segments I think. > >> > > >> > I don't need the full IW output just yet, thanks. > >> > > >> > Mike > >> > > >> > On Wed, Oct 28, 2009 at 10:21 AM, Peter Keegan < > peterlkee...@gmail.com> > >> > wrote: > >> > > Yes, I used JDK 1.6.0_16 when running CheckIndex and it reported the > >> same > >> > > problems when run multiple times. > >> > > > >> > >>Also, what does Lucene version "2.9 exported - 2009-10-27 15:31:52" > >> mean? > >> > > This appears to be something added by the ant build, since I built > >> Lucene > >> > > from the source code. > >> > > > >> > > I rebuilt the index using mergeFactor=50, ramBufferSize=200MB, > >> > > maxBufferedDocs=1000000 > >> > > This produced 49 segments, 9 of which are broken. The broken > segments > >> are > >> > in > >> > > the latter half, similar to my previous post with 3 segments. Do you > >> > think > >> > > this could be caused by 'bad' data, for example bad unicode > characters? > >> > > > >> > > Here is the output from CheckIndex: > >> > > >> > --------------------------------------------------------------------- > >> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >> > For additional commands, e-mail: java-user-h...@lucene.apache.org > >> > > >> > > >> > > > > > > > > -- > > Robert Muir > > rcm...@gmail.com > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- Robert Muir rcm...@gmail.com