On Sep 20, 2005, at 11:53 PM, Chris Lamprecht wrote:

import java.util.Arrays;

...

Arrays.equals(array1, array2);

Great, thank you, Chris.

The patch for IndexOutput.java is done. It will now write valid UTF-8. Older versions of Lucene will not be able to read indexes written using this class, as they will choke if they encounter a null byte or a 4-byte UTF-8 sequence.

As an added bonus, this patch yields a speedup of a couple percentage points (on my machine), made possible by simplified conditionals. For instance, the first if() clause...

    if (code >= 0x01 && code <= 0x7F)

...is now...

    if (code < 0x80)

The new TestIndexOutput.java class is sort of done. It has all the tests Ken suggested, though I think it could stand the addition of a randomized test to excite edge cases. The data mirrors the data from TestIndexInput.java, and that's by design, as I think with so much overlap the two ought to be merged. How does "TestIndexIO.java" grab you all?

On Aug 29, 2005, at 11:49 AM, Ken Krugler wrote:

a. Single surrogate pair (two Java chars)
b. Surrogate pair at the beginning, followed by regular data.
c. Surrogate pair at the end, followed by regular data.
d. Two surrogate pairs in a row.

Then all of the above, but remove the second (low-order) surrogate character (busted format).

Then all of the above, but replace the first (high-order) surrogate character.

A minor wrinkle: each unpaired surrogate will have to be replaced by the Unicode replacement character U+FFFD, or the VInt count will be off. This means that a UTF-16LE sequence will grow by a code point, as the (mis-ordered) surrogate pair (representing a single code point), will get subbed out for two replacement characters. I don't think this is serious, though.

Then all of the above, but replace the surrogate pair with an xC0 x80 encoded null byte.

I left this out of the test cases for IndexOutput (it's in there, and important, for IndexInput). The UTF-16 sequence "\u00C0\u0080" doesn't map to a null, so I used the regular UTF-16 null "\u0000". As before, I think this is what you intended.

Files and patches can be found here:

http://www.rectangular.com/downloads/IndexOutput.patch
http://www.rectangular.com/downloads/MockIndexOutput.java
http://www.rectangular.com/downloads/TestIndexOutput.java

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to