Ah I see.  I can’t believe I missed this fix :)  

Our reader was originally written in the 0.13 days, and which used Strings for 
stats.  This is the commit that changed everything to text and I believe it 
went out with Hive 0.14:

  
https://github.com/apache/hive/commit/6072e3aed88d9246e1130abadf3c15a88e975b4e#diff-340d190f994d92658b24aae1edf610b3

Is writer version "1 = HIVE-8732 fixed” after 0.14?  If so I can update my 
reader to detect this.

-dain

> On Jun 6, 2017, at 3:36 PM, Owen O'Malley <[email protected]> wrote:
> 
> On Tue, Jun 6, 2017 at 3:02 PM, Dain Sundstrom <[email protected]> wrote:
> 
>> Is it required that the StringStatistics min and max be the actual min and
>> max value for the column?  I ask for two reasons, I’d like to be able to
>> “trim” values if the min or max is very large.  Also, as a work around of
>> for the UTF-16be sorting problem (bug?), I’d like to trim values at the
>> first surrogate pair, so the value is slightly smaller than the min or
>> larger than the max, and still a valid UTF-8 sequence.
>> 
> 
> I agree that we want to be able to trim the values. I've seen cases where
> the String is huge (~100k) and makes the StringStatistics huge. I'd propose
> that we do something like:
> 
> message StringStatistics {
>  optional string minimum = 1;
>  optional string maximum = 2;
>  // sum will store the total length of all strings in a stripe
>  optional sint64 sum = 3;
>  // if set, the minimum will not be set and the lowerBound <= all values
>  optional string lowerBound = 4;
>  // if set, the maximum will not be set and the upperBound >= all values
>  optional string upperBound = 5;
> }
> 
> We shouldn't have any UTF16 in ORC. Is there a case where we compare
> strings that way? In particular, the StringStatistics uses Text, which uses
> UTF-8 as its encoding.
> 
> .. Owen
> 
> 
>> Thoughts?
>> 
>> -dain
>> 
>> 

Reply via email to