[
https://issues.apache.org/jira/browse/ORC-1075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17469652#comment-17469652
]
Yiqun Zhang edited comment on ORC-1075 at 1/6/22, 3:33 AM:
-----------------------------------------------------------
[~wbo4958] Thank you for your comment to make the issue clear
Creating ColumnStatistics in the standard ORC library is actually derived by
schema category, it is not possible to actually create ColumnStatisticsImpl,
they must be related to the specific type
But we support deserialization to get ColumnStatisticsImpl, which can work for
ORC files generated by unofficial tools
So I support de-compatibility with.
My idea is to modify ValueRange and add a variable hasValue instead of the
current method
{code:java}
boolean hasValues() {
+ return hasValues; // Assign the value by index.getNumberOfValues()
!= 0
- return lower != null;
}
+ boolean hasComparableValue() {
+ return lower != null;
+ }
{code}
pickRowGroups returns TruthValue.YES_NO_NULL when it encounters a custom
statistic that is only ensured to has values and cannot be compared.
was (Author: guiyankuang):
[~wbo4958] Thank you for your comment to make the issue clear
Creating ColumnStatistics in the standard ORC library is actually pushed by
schema category, it is not possible to actually create ColumnStatisticsImpl,
they must be related to the specific type
But we support deserialization to get ColumnStatisticsImpl, which can work for
ORC files generated by unofficial tools
So I support de-compatibility with.
My idea is to modify ValueRange and add a variable hasValue instead of the
current method
{code:java}
boolean hasValues() {
+ return hasValues; // Assign the value by index.getNumberOfValues()
!= 0
- return lower != null;
}
+ boolean hasComparableValue() {
+ return lower != null;
+ }
{code}
pickRowGroups returns TruthValue.YES_NO_NULL when it encounters a custom
statistic that is only ensured to has values and cannot be compared.
> Failed to read rows from the ORC file without statistics in RowIndex when
> filter is pushed down for 1.6.11
> ----------------------------------------------------------------------------------------------------------
>
> Key: ORC-1075
> URL: https://issues.apache.org/jira/browse/ORC-1075
> Project: ORC
> Issue Type: Bug
> Components: Java, Reader
> Affects Versions: 1.6.11
> Reporter: Bobby Wang
> Priority: Blocker
> Attachments: none-1.orc
>
>
> I have attached an ORC file that seems not to include ColumnStatistics in
> RowIndex.
> {color:#FF0000}From the ORC spec, seems RowIndex.ColumnStatistics is not a
> required field ???{color}
>
> {code:java}
> message RowIndexEntry {
> repeated uint64 positions = 1 [packed=true];
> optional ColumnStatistics statistics = 2;
> }
> message RowIndex {
> repeated RowIndexEntry entry = 1;
>
> }
> {code}
> The meta of the ORC file
>
> {code:java}
> $ orctools meta none.orc
> log4j:WARN No appenders could be found for logger
> (org.apache.hadoop.util.Shell).
> log4j:WARN Please initialize the log4j system properly.
> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more
> info.
> Processing data file none.orc [length: 124]
> Structure for none.orc
> File Version: 0.12 with ORIGINAL
> Rows: 3
> Compression: NONE
> Calendar: Julian/Gregorian
> Type: struct<INT:int>
> Stripe Statistics:
> Stripe 1:
> Column 0: count: 3 hasNull: true
> Column 1: count: 3 hasNull: true min: 1 max: 3 sum: 6
> File Statistics:
> Stripes:
> Stripe: offset: 3 data: 4 rows: 3 tail: 32 index: 10
> Stream: column 0 section ROW_INDEX start: 3 length 4
> Stream: column 1 section ROW_INDEX start: 7 length 6
> Stream: column 1 section DATA start: 13 length 4
> Encoding column 0: DIRECT
> Encoding column 1: DIRECT_V2
> File length: 124 bytes
> Padding length: 0 bytes
> Padding ratio: 0%
> {code}
>
> the data of the orc file
> {code:java}
> $ orctools data none.orc
> log4j:WARN No appenders could be found for logger
> (org.apache.hadoop.util.Shell).
> log4j:WARN Please initialize the log4j system properly.
> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more
> info.
> Processing data file none.orc [length: 124]
> {"INT":1}
> {"INT":2}
> {"INT":3}{code}
> I have below code trying to read each row of the ORC file
> {code:java}
> // Pick the schema we want to read using schema evolution
> TypeDescription readSchema =
> TypeDescription.fromString("struct<INT:int>");
> // Get the information from the file footer
> Reader reader = OrcFile.createReader(new Path("none.orc"),
> OrcFile.readerOptions(new Configuration()));
> System.out.println("File schema: " + reader.getSchema());
> System.out.println("Row count: " + reader.getNumberOfRows());
> RecordReader rowIterator = reader.rows(
> reader.options()
> .schema(readSchema)
> .searchArgument(SearchArgumentFactory.newBuilder()
> .equals("INT", PredicateLeaf.Type.LONG, 2L)
> .build(), new String[]{"INT"}) //predict push down
> );
> // Read the row data
> VectorizedRowBatch batch = readSchema.createRowBatch();
> LongColumnVector x = (LongColumnVector) batch.cols[0];
> while (rowIterator.nextBatch(batch)) {
> System.out.println(batch.size);
> for (int row = 0; row < batch.size; ++row) {
> int xRow = x.isRepeating ? 0 : row;
> System.out.println("INT: " + (x.noNulls || !x.isNull[xRow] ?
> x.vector[xRow] :null));
> }
> }
> rowIterator.close();{code}
>
> h2. output from 1.6.11
> File schema: struct<INT:int>
> Row count: 3
> h2. output from 1.5.10
> File schema: struct<INT:int>
> Row count: 3
> 3
> INT: 1
> INT: 2
> INT: 3
>
> Actually, I found this issue on Spark 3.2 which depends on ORC 1.6.11, while
> there is no such issue on spark 3.0.x which depends on ORC 1.5.10
>
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)