i might be wrong about this, but recently I intentionally tried to create
index with terms with U+FFFF to see if it would cause a problem :)

the U+FFFF seemed to be discarded completely (maybe at UTF-8 encode time)...
then again I was using RAMDirectory.

On Wed, Oct 28, 2009 at 10:58 AM, Peter Keegan <peterlkee...@gmail.com>wrote:

> The only change I made to the source code was the patch for
> PayloadNearQuery
> (LUCENE-1986).
> It's possible that our content contains U+FFFF. I will run in debugger and
> see.
> The data is 'sensitive', so I may not be able to provide a bad segment,
> unfortunately.
>
> Peter
>
> On Wed, Oct 28, 2009 at 10:43 AM, Michael McCandless <
> luc...@mikemccandless.com> wrote:
>
> > OK... when you exported the sources & built yourself, you didn't make
> > any changes, right?
> >
> > It's really odd how many of the errors are due to the term
> > "literals:cfid196$", or some variation (one time with "on" appended,
> > another time with "microsoft").  Do you know what documents typically
> > contain that term, and what the context is around it?  Maybe try to
> > index only those documents and see if this happens?  (It could
> > conceivably be caused by bad data, if this is some weird bug).  One
> > question: does your content ever use the [invalid] unicode character
> > U+FFFF?  (Lucene uses this internally to mark the end of the term).
> >
> > Would it be possible to zip up all files starting with _1c (should be
> > ~22 MB) and post somewhere that I could download?  That's the smallest
> > of the broken segments I think.
> >
> > I don't need the full IW output just yet, thanks.
> >
> > Mike
> >
> > On Wed, Oct 28, 2009 at 10:21 AM, Peter Keegan <peterlkee...@gmail.com>
> > wrote:
> > > Yes, I used JDK 1.6.0_16 when running CheckIndex and it reported the
> same
> > > problems when run multiple times.
> > >
> > >>Also, what does Lucene version "2.9 exported - 2009-10-27 15:31:52"
> mean?
> > > This appears to be something added by the ant build, since I built
> Lucene
> > > from the source code.
> > >
> > > I rebuilt the index using mergeFactor=50, ramBufferSize=200MB,
> > > maxBufferedDocs=1000000
> > > This produced 49 segments, 9 of which are broken. The broken segments
> are
> > in
> > > the latter half, similar to my previous post with 3 segments. Do you
> > think
> > > this could be caused by 'bad' data, for example bad unicode characters?
> > >
> > > Here is the output from CheckIndex:
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
>



-- 
Robert Muir
rcm...@gmail.com

Reply via email to