cxzl25 commented on issue #1061: URL: https://github.com/apache/orc/issues/1061#issuecomment-3072224549
### Root Cause It looks like presto and trino string length exceed 64byte , and the column statistics written to ORC will be missing `StringColumnStatistics`. And these writers do not use LowerBound and UpperBound introduced by [ORC-203](https://issues.apache.org/jira/browse/ORC-203). https://github.com/prestodb/presto/commit/ea23a9070cfdc922060b594412b81c51ef223fd2 com.facebook.presto.orc.OrcWriterOptions ```java public static final DataSize DEFAULT_MAX_STRING_STATISTICS_LIMIT = new DataSize(64, BYTE); ``` --- https://github.com/trinodb/trino/commit/ea23a9070cfdc922060b594412b81c51ef223fd2 io.trino.orc.OrcWriterOptions ```java static final DataSize DEFAULT_MAX_STRING_STATISTICS_LIMIT = DataSize.ofBytes(64); ``` ### Presto/Trino Config The Presto/Trino code should support the configuration of `hive.orc.writer.string-statistics-limit` to adjust the maximum value of Int or avoid this problem. ### ORC Reader I'm wondering if ORC Reader can also consider compatible with this behavior, similar to [ORC-1553](https://issues.apache.org/jira/browse/ORC-1553), read some column statistics that are known to be invalid, and choose to ignore. I tried to do the following fix, in Spark it can be queryed correctly. org.apache.orc.impl.RecordReaderImpl#evaluatePredicateProto ```Java if (cs.getClass().getName().equals(ColumnStatisticsImpl.class.getName()) && range.hasNulls && range.lower == null && range.upper == null) { return TruthValue.YES_NO_NULL; } ``` <img width="853" height="186" alt="Image" src="https://github.com/user-attachments/assets/c9df1882-2510-4f41-b54d-c95937085dcc" /> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
