Re: "For dictionary encodings the dictionary is sorted"

Dain Sundstrom Mon, 05 Jun 2017 21:54:12 -0700

> On Dec 12, 2016, at 4:48 PM, Dain Sundstrom <d...@iq80.com> wrote:
> On Dec 12, 2016, at 4:36 PM, Owen O'Malley <omal...@apache.org> wrote:
>>> I think this should also be documented in the statistics section which
>> also uses UTF-16 BE, which is at least consistent, but still annoying for
>> everything other than Java.
>> 
>> Yes, it should be documented and we should replace it with UTF-8. (Although
>> changes to the serialized form are always painful.)
> 
> I think we can do something similar to the bloom filter code, where we add a 
> StringUtf8Stats object and have a transition period where we can produce both.


I was looking at the change proto changes to TimestampStatistics, and I think 
the same thing could work here.  We add:

    optional string minimumUtf8 = 4;
    optional string maximumUtf8 = 5;

and the update the writer write just the UTF-8 version (or both during a 
transition).

-dain

Re: "For dictionary encodings the dictionary is sorted"

Reply via email to