Re: [I] RecordReaderImpl.getValueRange() may cause incorrect results [orc]

via GitHub Mon, 14 Jul 2025 23:42:50 -0700


cxzl25 commented on issue #1061:
URL: https://github.com/apache/orc/issues/1061#issuecomment-3072224549


   ### Root Cause
   
   It looks like presto and trino string length exceed 64byte , and the column 
statistics written to ORC will be missing `StringColumnStatistics`.
   
   And these writers do not use LowerBound and UpperBound introduced by 
[ORC-203](https://issues.apache.org/jira/browse/ORC-203).
   
   
   
https://github.com/prestodb/presto/commit/ea23a9070cfdc922060b594412b81c51ef223fd2
   
   com.facebook.presto.orc.OrcWriterOptions
   ```java
       public static final DataSize DEFAULT_MAX_STRING_STATISTICS_LIMIT = new 
DataSize(64, BYTE);
   ```
   
   ---
   
   
https://github.com/trinodb/trino/commit/ea23a9070cfdc922060b594412b81c51ef223fd2
   
   io.trino.orc.OrcWriterOptions
   ```java
       static final DataSize DEFAULT_MAX_STRING_STATISTICS_LIMIT = 
DataSize.ofBytes(64);
   ```
   
   ### Presto/Trino Config
   The Presto/Trino code should support the configuration of 
`hive.orc.writer.string-statistics-limit` to adjust the maximum value of Int or 
avoid this problem.
   
   ### ORC Reader
   
   I'm wondering if ORC Reader can also consider compatible with this behavior, 
similar to [ORC-1553](https://issues.apache.org/jira/browse/ORC-1553), read 
some column statistics that are known to be invalid, and choose to ignore.
   
   I tried to do the following fix, in Spark it can be queryed correctly.
   
   org.apache.orc.impl.RecordReaderImpl#evaluatePredicateProto
   
   ```Java
       if (cs.getClass().getName().equals(ColumnStatisticsImpl.class.getName())
               && range.hasNulls && range.lower == null && range.upper == null) 
{
         return TruthValue.YES_NO_NULL;
       }
   ```
   
   <img width="853" height="186" alt="Image" 
src="https://github.com/user-attachments/assets/c9df1882-2510-4f41-b54d-c95937085dcc";
 />


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] RecordReaderImpl.getValueRange() may cause incorrect results [orc]

Reply via email to