> I agree that we want to be able to trim the values. I've seen cases where
>  the String is huge (~100k) and makes the StringStatistics huge. I'd propose
>  that we do something like:

The only concrete consumer of this data outside of ORC readers is probably
partial scan computation of statistics from the footers.

In some cases, I find it better to avoid computing min-max ranges, when the 
strings 
exceed a useful length as keeping that updated involves a comparison for every
new row.

Long json strings or URLs usually are slower to write simply from this 
comparison.

So this is a great idea, with the appropriate indication to the partial scan 
reader 
not to update stats for those columns.

Cheers,
Gopal



Reply via email to