On Sep 20, 2005, at 11:53 PM, Chris Lamprecht wrote:
import java.util.Arrays;
...
Arrays.equals(array1, array2);
Great, thank you, Chris.
The patch for IndexOutput.java is done. It will now write valid
UTF-8. Older versions of Lucene will not be able to read indexes
written using this class, as they will choke if they encounter a null
byte or a 4-byte UTF-8 sequence.
As an added bonus, this patch yields a speedup of a couple percentage
points (on my machine), made possible by simplified conditionals.
For instance, the first if() clause...
if (code >= 0x01 && code <= 0x7F)
...is now...
if (code < 0x80)
The new TestIndexOutput.java class is sort of done. It has all the
tests Ken suggested, though I think it could stand the addition of a
randomized test to excite edge cases. The data mirrors the data from
TestIndexInput.java, and that's by design, as I think with so much
overlap the two ought to be merged. How does "TestIndexIO.java" grab
you all?
On Aug 29, 2005, at 11:49 AM, Ken Krugler wrote:
a. Single surrogate pair (two Java chars)
b. Surrogate pair at the beginning, followed by regular data.
c. Surrogate pair at the end, followed by regular data.
d. Two surrogate pairs in a row.
Then all of the above, but remove the second (low-order) surrogate
character (busted format).
Then all of the above, but replace the first (high-order) surrogate
character.
A minor wrinkle: each unpaired surrogate will have to be replaced by
the Unicode replacement character U+FFFD, or the VInt count will be
off. This means that a UTF-16LE sequence will grow by a code point,
as the (mis-ordered) surrogate pair (representing a single code
point), will get subbed out for two replacement characters. I don't
think this is serious, though.
Then all of the above, but replace the surrogate pair with an xC0
x80 encoded null byte.
I left this out of the test cases for IndexOutput (it's in there, and
important, for IndexInput). The UTF-16 sequence "\u00C0\u0080"
doesn't map to a null, so I used the regular UTF-16 null "\u0000".
As before, I think this is what you intended.
Files and patches can be found here:
http://www.rectangular.com/downloads/IndexOutput.patch
http://www.rectangular.com/downloads/MockIndexOutput.java
http://www.rectangular.com/downloads/TestIndexOutput.java
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]