I’m working on an encoding of numbers / data into indexed terms. In the past I limited the encoding to ASCII but now I’m doing it at a more raw/byte level. Do I have to be aware of UTF8 / sorting issues when I do this? I noticed the following code in NumericUtils.java, line 186: while (nChars > 0) { // Store 7 bits per byte for compatibility // with UTF-8 encoding of terms bytes.bytes[nChars--] = (byte)(sortableBits & 0x7f); sortableBits >>>= 7; } It’s the comment more than anything that has my attention. Do I have to limit my bytes to only the low 7 bits? If so, why? I’ve already written a bunch of code that generates the terms without consideration for this, and I think a bug I’m looking at could be related to this.
~ David p.s. sorry to be CC’ing some folks directly but the mailing list is having problems