Encoding data in terms; UTF8 concerns?

david.w.smi...@gmail.com Sat, 10 May 2014 18:59:27 -0700

I’m working on an encoding of numbers / data into indexed terms.  In the
past I limited the encoding to ASCII but now I’m doing it at a more
raw/byte level.  Do I have to be aware of UTF8 / sorting issues when I do
this?  I noticed the following code in NumericUtils.java, line 186:
    while (nChars > 0) {
      // Store 7 bits per byte for compatibility
      // with UTF-8 encoding of terms
      bytes.bytes[nChars--] = (byte)(sortableBits & 0x7f);
      sortableBits >>>= 7;
    }
It’s the comment more than anything that has my attention. Do I have to
limit my bytes to only the low 7 bits?  If so, why?  I’ve already written a
bunch of code that generates the terms without consideration for this, and
I think a bug I’m looking at could be related to this.


~ David
p.s. sorry to be CC’ing some folks directly but the mailing list is having
problems

Encoding data in terms; UTF8 concerns?

Reply via email to