How does this patch work w.r.t. the length vint? It looks like the length is still the number of 16 bit java chars, but the encoding is now correct UTF-8?
-Yonik Now hiring -- http://tinyurl.com/7m67g On 9/21/05, Marvin Humphrey <[EMAIL PROTECTED]> wrote: > > On Sep 20, 2005, at 11:53 PM, Chris Lamprecht wrote: > > > import java.util.Arrays; > > > > ... > > > > Arrays.equals(array1, array2); > > Great, thank you, Chris. > > The patch for IndexOutput.java is done. It will now write valid > UTF-8. Older versions of Lucene will not be able to read indexes > written using this class, as they will choke if they encounter a null > byte or a 4-byte UTF-8 sequence. > > As an added bonus, this patch yields a speedup of a couple percentage > points (on my machine), made possible by simplified conditionals. > For instance, the first if() clause... > > if (code >= 0x01 && code <= 0x7F) > > ...is now... > > if (code < 0x80) > > The new TestIndexOutput.java class is sort of done. It has all the > tests Ken suggested, though I think it could stand the addition of a > randomized test to excite edge cases. The data mirrors the data from > TestIndexInput.java, and that's by design, as I think with so much > overlap the two ought to be merged. How does "TestIndexIO.java" grab > you all? > > On Aug 29, 2005, at 11:49 AM, Ken Krugler wrote: > > > a. Single surrogate pair (two Java chars) > > b. Surrogate pair at the beginning, followed by regular data. > > c. Surrogate pair at the end, followed by regular data. > > d. Two surrogate pairs in a row. > > > > Then all of the above, but remove the second (low-order) surrogate > > character (busted format). > > > > Then all of the above, but replace the first (high-order) surrogate > > character. > > A minor wrinkle: each unpaired surrogate will have to be replaced by > the Unicode replacement character U+FFFD, or the VInt count will be > off. This means that a UTF-16LE sequence will grow by a code point, > as the (mis-ordered) surrogate pair (representing a single code > point), will get subbed out for two replacement characters. I don't > think this is serious, though. > > > Then all of the above, but replace the surrogate pair with an xC0 > > x80 encoded null byte. > > I left this out of the test cases for IndexOutput (it's in there, and > important, for IndexInput). The UTF-16 sequence "\u00C0\u0080" > doesn't map to a null, so I used the regular UTF-16 null "\u0000". > As before, I think this is what you intended. > > Files and patches can be found here: > > http://www.rectangular.com/downloads/IndexOutput.patch > http://www.rectangular.com/downloads/MockIndexOutput.java > http://www.rectangular.com/downloads/TestIndexOutput.java > > Marvin Humphrey > Rectangular Research > http://www.rectangular.com/ > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >