> I agree that we want to be able to trim the values. I've seen cases where > the String is huge (~100k) and makes the StringStatistics huge. I'd propose > that we do something like:
The only concrete consumer of this data outside of ORC readers is probably partial scan computation of statistics from the footers. In some cases, I find it better to avoid computing min-max ranges, when the strings exceed a useful length as keeping that updated involves a comparison for every new row. Long json strings or URLs usually are slower to write simply from this comparison. So this is a great idea, with the appropriate indication to the partial scan reader not to update stats for those columns. Cheers, Gopal
