RE: Encoding data in terms; UTF8 concerns?

Uwe Schindler Sun, 11 May 2014 02:31:17 -0700

Hi David,


the reason why NumericUtils does the encoding in that way is just: NumericField 
encoding was introduced in Lucene 2.9, where all terms were char[], encoded in 
UTF-8 on the index side. Because of that, encoding each byte with full 8 bits 
wuld have been a large overhead in index size: Each term would get an 
additional byte, because java chars 128…255 would be encoded in 2 bytes because 
of UTF-8. Because of this NumericField uses 7 bits only.

Because we cannot easily change the numeric encoding (we won’t be able to 
change it ever, unless we have information about the terms in Field metadata on 
the index side), this encoding stayed alive up to now – so it’s all about index 
backwards compatibility.

 

If you introduce a new field for spatial, you don’t need to take care about 
this. Since Lucene 4 all terms are byte[] and are sorted in binary order. The 
order of terms in index is given by BytesRef.compareTo(), which is pure binary. 
The good thing for us:  UTF-8 order for string terms (which is used in Lucene) 
is identical to byte[] order, but it is different to UTF-16 order (this is why 
we need a crazy backwards layer to read 3.x indexes: terms are sorted slightly 
differently). We do full 8 bit encoding already for Collation fields see 
CollationKeyAttributeFactory, which encoded terms instead of UTF-8 with their 
collation key).

 

Uwe

 

-----

Uwe Schindler

H.-H.-Meier-Allee 63, D-28213 Bremen

http://www.thetaphi.de <http://www.thetaphi.de/> 

eMail: u...@thetaphi.de

 

From: david.w.smi...@gmail.com [mailto:david.w.smi...@gmail.com] 
Sent: Sunday, May 11, 2014 1:17 AM
To: dev@lucene.apache.org
Cc: Uwe Schindler; Michael McCandless
Subject: Encoding data in terms; UTF8 concerns?

 

I’m working on an encoding of numbers / data into indexed terms.  In the past I 
limited the encoding to ASCII but now I’m doing it at a more raw/byte level.  
Do I have to be aware of UTF8 / sorting issues when I do this?  I noticed the 
following code in NumericUtils.java, line 186:

    while (nChars > 0) {

      // Store 7 bits per byte for compatibility

      // with UTF-8 encoding of terms

      bytes.bytes[nChars--] = (byte)(sortableBits & 0x7f);

      sortableBits >>>= 7;

    }

It’s the comment more than anything that has my attention. Do I have to limit 
my bytes to only the low 7 bits?  If so, why?  I’ve already written a bunch of 
code that generates the terms without consideration for this, and I think a bug 
I’m looking at could be related to this.

 

~ David

p.s. sorry to be CC’ing some folks directly but the mailing list is having 
problems

RE: Encoding data in terms; UTF8 concerns?

Reply via email to