Thank you for the background info Uwe! It turns out my encoding was fine; I had some other bug. -- David
On Sunday, May 11, 2014, Uwe Schindler <u...@thetaphi.de> wrote: > Hi David, > > > > the reason why NumericUtils does the encoding in that way is just: > NumericField encoding was introduced in Lucene 2.9, where all terms were > char[], encoded in UTF-8 on the index side. Because of that, encoding each > byte with full 8 bits wuld have been a large overhead in index size: Each > term would get an additional byte, because java chars 128…255 would be > encoded in 2 bytes because of UTF-8. Because of this NumericField uses 7 > bits only. > > Because we cannot easily change the numeric encoding (we won’t be able to > change it ever, unless we have information about the terms in Field > metadata on the index side), this encoding stayed alive up to now – so it’s > all about index backwards compatibility. > > > > If you introduce a new field for spatial, you don’t need to take care > about this. Since Lucene 4 all terms are byte[] and are sorted in binary > order. The order of terms in index is given by BytesRef.compareTo(), which > is pure binary. The good thing for us: UTF-8 order for string terms (which > is used in Lucene) is identical to byte[] order, but it is different to > UTF-16 order (this is why we need a crazy backwards layer to read 3.x > indexes: terms are sorted slightly differently). We do full 8 bit encoding > already for Collation fields see CollationKeyAttributeFactory, which > encoded terms instead of UTF-8 with their collation key). > > > > Uwe > > > > ----- > > Uwe Schindler > > H.-H.-Meier-Allee 63, D-28213 Bremen > > http://www.thetaphi.de > > eMail: u...@thetaphi.de <javascript:_e(%7B%7D,'cvml','u...@thetaphi.de');> > > > > *From:* > david.w.smi...@gmail.com<javascript:_e(%7B%7D,'cvml','david.w.smi...@gmail.com');>[mailto: > david.w.smi...@gmail.com<javascript:_e(%7B%7D,'cvml','david.w.smi...@gmail.com');>] > > *Sent:* Sunday, May 11, 2014 1:17 AM > *To:* > dev@lucene.apache.org<javascript:_e(%7B%7D,'cvml','dev@lucene.apache.org');> > *Cc:* Uwe Schindler; Michael McCandless > *Subject:* Encoding data in terms; UTF8 concerns? > > > > I’m working on an encoding of numbers / data into indexed terms. In the > past I limited the encoding to ASCII but now I’m doing it at a more > raw/byte level. Do I have to be aware of UTF8 / sorting issues when I do > this? I noticed the following code in NumericUtils.java, line 186: > > while (nChars > 0) { > > // Store 7 bits per byte for compatibility > > // with UTF-8 encoding of terms > > bytes.bytes[nChars--] = (byte)(sortableBits & 0x7f); > > sortableBits >>>= 7; > > } > > It’s the comment more than anything that has my attention. Do I have to > limit my bytes to only the low 7 bits? If so, why? I’ve already written a > bunch of code that generates the terms without consideration for this, and > I think a bug I’m looking at could be related to this. > > > > ~ David > > p.s. sorry to be CC’ing some folks directly but the mailing list is having > problems > -- Sent from Gmail Mobile