The benefits to a byte count are substantial, including:
1. Lazy fields can skip strings without reading them, as they do for
all other value types.
2. The file format could be changed to standard UTF-8 without any
significant performance cost
3. Any other index operation that relies on the index format will
have an easier time with a representation that is a) easy to
quickly scan and b) consistent (all value types start with a byte
count).
Re. 3, Jian is concerned about programs in other languages that
manipulate Lucene index files. I have such a program in Java and face
the same issue. My case is a robust and general implementation of
IndexUpdater that copies segments transforming field values, updating
both stored values and postings (not yet term vectors). It is optimized
to skip (copy) and/or minimally process unchanged areas, which are
typically most areas. This process is slowed when processing unchanged
stored String values due to the current char count representation -- it
faces precisely the same issue as the lazy fields mechanism.
Re. the file format compatibility issue, if backward compatibility is a
requirement here, then it would seem to be necessary to have a
configuration option to choose the encoding of stored strings. It seems
easy to generalize the Lucene API's to specify an interface for any
desired encode/decode.
Chuck
jian chen wrote on 05/02/2006 08:15 AM:
> Hi, Doug,
>
> I totally agree with what you said. Yeah, I think it is more of a file
> format issue, less of an API issue. It seems that we just need to add an
> extra constructor to Term.java to take in utf8 byte array.
>
> Lucene 2.0 is going to break the backward compability anyway, right? So,
> maybe this change to standard UTF-8 could be a hot item on the Lucene
> 2.0list?
>
> Cheers,
>
> Jian Chen
>
> On 5/2/06, Doug Cutting <[EMAIL PROTECTED]> wrote:
>>
>> Chuck Williams wrote:
>> > For lazy fields, there would be a substantial benefit to having the
>> > count on a String be an encoded byte count rather than a Java char
>> > count, but this has the same problem. If there is a way to beat this
>> > problem, then I'd start arguing for a byte count.
>>
>> I think the way to beat it is to keep things as bytes as long as
>> possible. For example, each term in a Query needs to be converted from
>> String to byte[], but after that all search computation could happen
>> comparing byte arrays. (Note that lexicographic comparisons of UTF-8
>> encoded bytes give the same results as lexicographic comparisions of
>> Unicode character strings.) And, when indexing, each Token would need
>> to be converted from String to byte[] just once.
>>
>> The Java API can easily be made back-compatible. The harder part would
>> be making the file format back-compatible.
>>
>> Doug
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]