chong created ORC-1083: -------------------------- Summary: Failed to proune when converting Hybrid calendar to Proleptic calendar Key: ORC-1083 URL: https://issues.apache.org/jira/browse/ORC-1083 Project: ORC Issue Type: Bug Components: Java Affects Versions: 1.6.11 Reporter: chong Attachments: read-hybrid-as-proleptic.orc
The orc file only has one date column and one row in hybrid(Julian/Gregorian) calendar: 1582-10-03. Failed to proune for the filer "c1 = 1582-10-03" when converting hybrid calendar to proleptic calendar. The date "1582-10-03" in hybrid calendar is "1582-09-23" or "1582-10-13" in proleptic calendar, I'm not sure which one, but apparently it's different from hybrid calendar. The query should return empty when filtering with "c1 = 1582-10-03". The "pickRowGroups" failed to proune when setting "orc.proleptic.gregorian" as "true", this occures on version 1.6.11+. The version 1.5.10 is correct. *The Orc file was attached: read-hybrid-as-proleptic.orc* ``` $ java -jar orc-tools-1.7.0-uber.jar meta read-hybrid-as-proleptic.orc Processing data file read-hybrid-as-proleptic.orc [length: 246] Structure for read-hybrid-as-proleptic.orc File Version: 0.12 with ORC_14 by ORC Java 1.6.11 Rows: 1 Compression: SNAPPY Compression size: 262144 Calendar: Julian/Gregorian Type: struct<c1:date> Stripe Statistics: Stripe 1: Column 0: count: 1 hasNull: false Column 1: count: 1 hasNull: false bytesOnDisk: 8 min: Hybrid AD 1582-10-03 max: Hybrid AD 1582-10-03 File Statistics: Column 0: count: 1 hasNull: false Column 1: count: 1 hasNull: false bytesOnDisk: 8 min: Hybrid AD 1582-10-03 max: Hybrid AD 1582-10-03 Stripes: Stripe: offset: 3 data: 8 rows: 1 tail: 35 index: 37 Stream: column 0 section ROW_INDEX start: 3 length 11 Stream: column 1 section ROW_INDEX start: 14 length 26 Stream: column 1 section DATA start: 40 length 8 Encoding column 0: DIRECT Encoding column 1: DIRECT_V2 File length: 246 bytes Padding length: 0 bytes Padding ratio: 0% User Metadata: org.apache.spark.version=3.2.0 ________________________________________________________________________________________________________________________ (base) [chong@chong-pc tools]$ java -jar orc-tools-1.7.0-uber.jar data read-hybrid-as-proleptic.orc Processing data file read-hybrid-as-proleptic.orc [length: 246] {"c1":"1582-10-03"} ________________________________________________________________________________________________________________________ ``` *Code to reproduce this:* ``` Configuration conf = new Configuration(); // convert to proleptic calendar conf.set("orc.proleptic.gregorian", "true"); Reader reader = OrcFile.createReader(new Path("<path to read-hybrid-as-proleptic.orc>"), OrcFile.readerOptions(conf)); System.out.println("File schema: " + reader.getSchema()); System.out.println("File row count: " + reader.getNumberOfRows()); Date dateForFilter = Date.valueOf("1582-10-03"); System.out.println("Filter is c1 == " + dateForFilter); RecordReader rowIterator = reader.rows( reader.options() .searchArgument(SearchArgumentFactory.newBuilder() .equals("c1", PredicateLeaf.Type.DATE, dateForFilter) .build(), new String[]\{"c1"}) //predict push down ); // Read the row data VectorizedRowBatch batch = reader.getSchema().createRowBatch(); DateColumnVector x = (DateColumnVector) batch.cols[0]; System.out.println("-------------find-------------------------"); while (rowIterator.nextBatch(batch)) { for (int row = 0; row < batch.size; ++row) { int xRow = x.isRepeating ? 0 : row; System.out.println("c1: " + (x.noNulls || !x.isNull[xRow] ? x.vector[xRow] :null)); } } rowIterator.close(); ``` *Comparation between 1.5.10 and 1.6.11* ``` For Orc version 1.5.10 -------------find------------------------- For Orc version 1.6.11 -------------find------------------------- c1: -141439 ``` *Other information* Please try to swith conf.set("orc.proleptic.gregorian", "true") for 1.5.10 and 1.6.11 and see the different. The date "1582-10-03" in hybrid calendar is "1582-09-23" or "1582-10-13" in proleptic calendar, which one is correct? This is found on Spark 3.2.0 and Spark 3.0.1 is correct. -- This message was sent by Atlassian Jira (v8.20.1#820001)