How does this patch work w.r.t. the length vint?

It looks like the length is still the number of 16 bit java chars,
but the encoding is now correct UTF-8?


-Yonik
Now hiring -- http://tinyurl.com/7m67g

On 9/21/05, Marvin Humphrey <[EMAIL PROTECTED]> wrote:
>
> On Sep 20, 2005, at 11:53 PM, Chris Lamprecht wrote:
>
> > import java.util.Arrays;
> >
> > ...
> >
> > Arrays.equals(array1, array2);
>
> Great, thank you, Chris.
>
> The patch for IndexOutput.java is done. It will now write valid
> UTF-8. Older versions of Lucene will not be able to read indexes
> written using this class, as they will choke if they encounter a null
> byte or a 4-byte UTF-8 sequence.
>
> As an added bonus, this patch yields a speedup of a couple percentage
> points (on my machine), made possible by simplified conditionals.
> For instance, the first if() clause...
>
> if (code >= 0x01 && code <= 0x7F)
>
> ...is now...
>
> if (code < 0x80)
>
> The new TestIndexOutput.java class is sort of done. It has all the
> tests Ken suggested, though I think it could stand the addition of a
> randomized test to excite edge cases. The data mirrors the data from
> TestIndexInput.java, and that's by design, as I think with so much
> overlap the two ought to be merged. How does "TestIndexIO.java" grab
> you all?
>
> On Aug 29, 2005, at 11:49 AM, Ken Krugler wrote:
>
> > a. Single surrogate pair (two Java chars)
> > b. Surrogate pair at the beginning, followed by regular data.
> > c. Surrogate pair at the end, followed by regular data.
> > d. Two surrogate pairs in a row.
> >
> > Then all of the above, but remove the second (low-order) surrogate
> > character (busted format).
> >
> > Then all of the above, but replace the first (high-order) surrogate
> > character.
>
> A minor wrinkle: each unpaired surrogate will have to be replaced by
> the Unicode replacement character U+FFFD, or the VInt count will be
> off. This means that a UTF-16LE sequence will grow by a code point,
> as the (mis-ordered) surrogate pair (representing a single code
> point), will get subbed out for two replacement characters. I don't
> think this is serious, though.
>
> > Then all of the above, but replace the surrogate pair with an xC0
> > x80 encoded null byte.
>
> I left this out of the test cases for IndexOutput (it's in there, and
> important, for IndexInput). The UTF-16 sequence "\u00C0\u0080"
> doesn't map to a null, so I used the regular UTF-16 null "\u0000".
> As before, I think this is what you intended.
>
> Files and patches can be found here:
>
> http://www.rectangular.com/downloads/IndexOutput.patch
> http://www.rectangular.com/downloads/MockIndexOutput.java
> http://www.rectangular.com/downloads/TestIndexOutput.java
>
> Marvin Humphrey
> Rectangular Research
> http://www.rectangular.com/
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

Reply via email to