Re: "For dictionary encodings the dictionary is sorted"

Dain Sundstrom Tue, 06 Jun 2017 15:49:58 -0700

Sorry, I should have given a full example.  I mean we could change 
StringStatistics as follows:


message StringStatistics {
  optional string minimum = 1;
  optional string maximum = 2;
  optional sint64 sum = 3;
  optional string minimumUtf8 = 4;
  optional string maximumUtf8 = 5;
}

New writers would fill in both in the transition period, and then just the UTF8 
ones.  Then we update readers to the favor the new ones.

We can also make this backwards compatible in the transition period by 
truncating values to at the first surrogate pair.  This is how we do that in 
Presto:

public static Slice getMaxSlice(String maximum)
{
    if (maximum == null) {
        return null;
    }

    int index = firstSurrogateCharacter(maximum);
    if (index == -1) {
        return Slices.utf8Slice(maximum);
    }
    // Append 0xFF so that it is larger than maximum
    return concatSlices(Slices.utf8Slice(maximum.substring(0, index)), 
MAX_BYTE);
}

public static Slice getMinSlice(String minimum)
{
    if (minimum == null) {
        return null;
    }

    int index = firstSurrogateCharacter(minimum);
    if (index == -1) {
        return Slices.utf8Slice(minimum);
    }
    // truncate the string at the first surrogate character
    return Slices.utf8Slice(minimum.substring(0, index));
}

// returns index of first surrogateCharacter in the string -1 if no surrogate 
character is found
static int firstSurrogateCharacter(String value)
{
    char[] chars = value.toCharArray();
    for (int i = 0; i < chars.length; i++) {
        if (chars[i] >= MIN_SURROGATE) {
            return i;
        }
    }
    return -1;
}

The getMaxSliceCode might need to be adjusted depending on how it it used.  We 
are using UTF-8 binary so we can use the illegal 0xFF as the stop byte, but if 
you want to transition back to Java Strings, we would need to be a bit smarter.

-dain

> On Jun 6, 2017, at 3:39 PM, Owen O'Malley <[email protected]> wrote:
> 
> I'm confused. TimestampStatistics uses integers not strings.
> 
> .. Owen
> 
> On Mon, Jun 5, 2017 at 9:53 PM, Dain Sundstrom <[email protected]> wrote:
> 
>> 
>>> On Dec 12, 2016, at 4:48 PM, Dain Sundstrom <[email protected]> wrote:
>>> On Dec 12, 2016, at 4:36 PM, Owen O'Malley <[email protected]> wrote:
>>>>> I think this should also be documented in the statistics section which
>>>> also uses UTF-16 BE, which is at least consistent, but still annoying
>> for
>>>> everything other than Java.
>>>> 
>>>> Yes, it should be documented and we should replace it with UTF-8.
>> (Although
>>>> changes to the serialized form are always painful.)
>>> 
>>> I think we can do something similar to the bloom filter code, where we
>> add a StringUtf8Stats object and have a transition period where we can
>> produce both.
>> 
>> I was looking at the change proto changes to TimestampStatistics, and I
>> think the same thing could work here.  We add:
>> 
>>    optional string minimumUtf8 = 4;
>>    optional string maximumUtf8 = 5;
>> 
>> and the update the writer write just the UTF-8 version (or both during a
>> transition).
>> 
>> -dain

Re: "For dictionary encodings the dictionary is sorted"

Reply via email to