That's exactly what oal.util.UnicodeUtils does when convertig UTF-8 to
UTF-16 (which is Java's internal encoding).

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

> -----Original Message-----
> From: Michael McCandless [mailto:luc...@mikemccandless.com]
> Sent: Wednesday, October 28, 2009 4:25 PM
> To: java-user@lucene.apache.org
> Subject: Re: IO exception during merge/optimize
> 
> Right, I would expect Lucene would silently truncate the term at the
> U+FFFF, and not lead to this odd exception.
> 
> Mike
> 
> On Wed, Oct 28, 2009 at 11:23 AM, Robert Muir <rcm...@gmail.com> wrote:
> > i might be wrong about this, but recently I intentionally tried to
> create
> > index with terms with U+FFFF to see if it would cause a problem :)
> >
> > the U+FFFF seemed to be discarded completely (maybe at UTF-8 encode
> time)...
> > then again I was using RAMDirectory.
> >
> > On Wed, Oct 28, 2009 at 10:58 AM, Peter Keegan
> <peterlkee...@gmail.com>wrote:
> >
> >> The only change I made to the source code was the patch for
> >> PayloadNearQuery
> >> (LUCENE-1986).
> >> It's possible that our content contains U+FFFF. I will run in debugger
> and
> >> see.
> >> The data is 'sensitive', so I may not be able to provide a bad segment,
> >> unfortunately.
> >>
> >> Peter
> >>
> >> On Wed, Oct 28, 2009 at 10:43 AM, Michael McCandless <
> >> luc...@mikemccandless.com> wrote:
> >>
> >> > OK... when you exported the sources & built yourself, you didn't make
> >> > any changes, right?
> >> >
> >> > It's really odd how many of the errors are due to the term
> >> > "literals:cfid196$", or some variation (one time with "on" appended,
> >> > another time with "microsoft").  Do you know what documents typically
> >> > contain that term, and what the context is around it?  Maybe try to
> >> > index only those documents and see if this happens?  (It could
> >> > conceivably be caused by bad data, if this is some weird bug).  One
> >> > question: does your content ever use the [invalid] unicode character
> >> > U+FFFF?  (Lucene uses this internally to mark the end of the term).
> >> >
> >> > Would it be possible to zip up all files starting with _1c (should be
> >> > ~22 MB) and post somewhere that I could download?  That's the
> smallest
> >> > of the broken segments I think.
> >> >
> >> > I don't need the full IW output just yet, thanks.
> >> >
> >> > Mike
> >> >
> >> > On Wed, Oct 28, 2009 at 10:21 AM, Peter Keegan
> <peterlkee...@gmail.com>
> >> > wrote:
> >> > > Yes, I used JDK 1.6.0_16 when running CheckIndex and it reported
> the
> >> same
> >> > > problems when run multiple times.
> >> > >
> >> > >>Also, what does Lucene version "2.9 exported - 2009-10-27 15:31:52"
> >> mean?
> >> > > This appears to be something added by the ant build, since I built
> >> Lucene
> >> > > from the source code.
> >> > >
> >> > > I rebuilt the index using mergeFactor=50, ramBufferSize=200MB,
> >> > > maxBufferedDocs=1000000
> >> > > This produced 49 segments, 9 of which are broken. The broken
> segments
> >> are
> >> > in
> >> > > the latter half, similar to my previous post with 3 segments. Do
> you
> >> > think
> >> > > this could be caused by 'bad' data, for example bad unicode
> characters?
> >> > >
> >> > > Here is the output from CheckIndex:
> >> >
> >> > ---------------------------------------------------------------------
> >> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >> >
> >> >
> >>
> >
> >
> >
> > --
> > Robert Muir
> > rcm...@gmail.com
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to