chong created ORC-1083:
--------------------------
Summary: Failed to proune when converting Hybrid calendar to
Proleptic calendar
Key: ORC-1083
URL: https://issues.apache.org/jira/browse/ORC-1083
Project: ORC
Issue Type: Bug
Components: Java
Affects Versions: 1.6.11
Reporter: chong
Attachments: read-hybrid-as-proleptic.orc
The orc file only has one date column and one row in hybrid(Julian/Gregorian)
calendar: 1582-10-03.
Failed to proune for the filer "c1 = 1582-10-03" when converting hybrid
calendar to proleptic calendar. The date "1582-10-03" in hybrid calendar is
"1582-09-23" or "1582-10-13" in proleptic calendar, I'm not sure which one, but
apparently it's different from hybrid calendar. The query should return empty
when filtering with "c1 = 1582-10-03".
The "pickRowGroups" failed to proune when setting "orc.proleptic.gregorian" as
"true", this occures on version 1.6.11+. The version 1.5.10 is correct.
*The Orc file was attached: read-hybrid-as-proleptic.orc*
```
$ java -jar orc-tools-1.7.0-uber.jar meta read-hybrid-as-proleptic.orc
Processing data file read-hybrid-as-proleptic.orc [length: 246]
Structure for read-hybrid-as-proleptic.orc
File Version: 0.12 with ORC_14 by ORC Java 1.6.11
Rows: 1
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<c1:date>
Stripe Statistics:
Stripe 1:
Column 0: count: 1 hasNull: false
Column 1: count: 1 hasNull: false bytesOnDisk: 8 min: Hybrid AD 1582-10-03
max: Hybrid AD 1582-10-03
File Statistics:
Column 0: count: 1 hasNull: false
Column 1: count: 1 hasNull: false bytesOnDisk: 8 min: Hybrid AD 1582-10-03
max: Hybrid AD 1582-10-03
Stripes:
Stripe: offset: 3 data: 8 rows: 1 tail: 35 index: 37
Stream: column 0 section ROW_INDEX start: 3 length 11
Stream: column 1 section ROW_INDEX start: 14 length 26
Stream: column 1 section DATA start: 40 length 8
Encoding column 0: DIRECT
Encoding column 1: DIRECT_V2
File length: 246 bytes
Padding length: 0 bytes
Padding ratio: 0%
User Metadata:
org.apache.spark.version=3.2.0
________________________________________________________________________________________________________________________
(base) [chong@chong-pc tools]$ java -jar orc-tools-1.7.0-uber.jar data
read-hybrid-as-proleptic.orc
Processing data file read-hybrid-as-proleptic.orc [length: 246]
{"c1":"1582-10-03"}
________________________________________________________________________________________________________________________
```
*Code to reproduce this:*
```
Configuration conf = new Configuration();
// convert to proleptic calendar
conf.set("orc.proleptic.gregorian", "true");
Reader reader = OrcFile.createReader(new Path("<path to
read-hybrid-as-proleptic.orc>"),
OrcFile.readerOptions(conf));
System.out.println("File schema: " + reader.getSchema());
System.out.println("File row count: " + reader.getNumberOfRows());
Date dateForFilter = Date.valueOf("1582-10-03");
System.out.println("Filter is c1 == " + dateForFilter);
RecordReader rowIterator = reader.rows(
reader.options()
.searchArgument(SearchArgumentFactory.newBuilder()
.equals("c1", PredicateLeaf.Type.DATE, dateForFilter)
.build(), new String[]\{"c1"}) //predict push down
);
// Read the row data
VectorizedRowBatch batch = reader.getSchema().createRowBatch();
DateColumnVector x = (DateColumnVector) batch.cols[0];
System.out.println("-------------find-------------------------");
while (rowIterator.nextBatch(batch)) {
for (int row = 0; row < batch.size; ++row) {
int xRow = x.isRepeating ? 0 : row;
System.out.println("c1: " + (x.noNulls || !x.isNull[xRow] ?
x.vector[xRow] :null));
}
}
rowIterator.close();
```
*Comparation between 1.5.10 and 1.6.11*
```
For Orc version 1.5.10
-------------find-------------------------
For Orc version 1.6.11
-------------find-------------------------
c1: -141439
```
*Other information*
Please try to swith conf.set("orc.proleptic.gregorian", "true") for 1.5.10 and
1.6.11 and see the different.
The date "1582-10-03" in hybrid calendar is "1582-09-23" or "1582-10-13" in
proleptic calendar, which one is correct?
This is found on Spark 3.2.0 and Spark 3.0.1 is correct.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)