Right, I would expect Lucene would silently truncate the term at the
U+FFFF, and not lead to this odd exception.

Mike

On Wed, Oct 28, 2009 at 11:23 AM, Robert Muir <rcm...@gmail.com> wrote:
> i might be wrong about this, but recently I intentionally tried to create
> index with terms with U+FFFF to see if it would cause a problem :)
>
> the U+FFFF seemed to be discarded completely (maybe at UTF-8 encode time)...
> then again I was using RAMDirectory.
>
> On Wed, Oct 28, 2009 at 10:58 AM, Peter Keegan <peterlkee...@gmail.com>wrote:
>
>> The only change I made to the source code was the patch for
>> PayloadNearQuery
>> (LUCENE-1986).
>> It's possible that our content contains U+FFFF. I will run in debugger and
>> see.
>> The data is 'sensitive', so I may not be able to provide a bad segment,
>> unfortunately.
>>
>> Peter
>>
>> On Wed, Oct 28, 2009 at 10:43 AM, Michael McCandless <
>> luc...@mikemccandless.com> wrote:
>>
>> > OK... when you exported the sources & built yourself, you didn't make
>> > any changes, right?
>> >
>> > It's really odd how many of the errors are due to the term
>> > "literals:cfid196$", or some variation (one time with "on" appended,
>> > another time with "microsoft").  Do you know what documents typically
>> > contain that term, and what the context is around it?  Maybe try to
>> > index only those documents and see if this happens?  (It could
>> > conceivably be caused by bad data, if this is some weird bug).  One
>> > question: does your content ever use the [invalid] unicode character
>> > U+FFFF?  (Lucene uses this internally to mark the end of the term).
>> >
>> > Would it be possible to zip up all files starting with _1c (should be
>> > ~22 MB) and post somewhere that I could download?  That's the smallest
>> > of the broken segments I think.
>> >
>> > I don't need the full IW output just yet, thanks.
>> >
>> > Mike
>> >
>> > On Wed, Oct 28, 2009 at 10:21 AM, Peter Keegan <peterlkee...@gmail.com>
>> > wrote:
>> > > Yes, I used JDK 1.6.0_16 when running CheckIndex and it reported the
>> same
>> > > problems when run multiple times.
>> > >
>> > >>Also, what does Lucene version "2.9 exported - 2009-10-27 15:31:52"
>> mean?
>> > > This appears to be something added by the ant build, since I built
>> Lucene
>> > > from the source code.
>> > >
>> > > I rebuilt the index using mergeFactor=50, ramBufferSize=200MB,
>> > > maxBufferedDocs=1000000
>> > > This produced 49 segments, 9 of which are broken. The broken segments
>> are
>> > in
>> > > the latter half, similar to my previous post with 3 segments. Do you
>> > think
>> > > this could be caused by 'bad' data, for example bad unicode characters?
>> > >
>> > > Here is the output from CheckIndex:
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> > For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >
>> >
>>
>
>
>
> --
> Robert Muir
> rcm...@gmail.com
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to