[jira] [Commented] (DRILL-4980) Upgrading of the approach of parquet date correctness status detection
[ https://issues.apache.org/jira/browse/DRILL-4980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15946915#comment-15946915 ] Vitalii Diravka commented on DRILL-4980: [~rkins] No, it doesn't have any functional impact. It is just code refactoring and redesign of an approach of parquet date correctness status detection. > Upgrading of the approach of parquet date correctness status detection > -- > > Key: DRILL-4980 > URL: https://issues.apache.org/jira/browse/DRILL-4980 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Parquet >Affects Versions: 1.9.0 >Reporter: Vitalii Diravka >Assignee: Vitalii Diravka > Fix For: 1.10.0 > > > This jira is an addition for the > [DRILL-4203|https://issues.apache.org/jira/browse/DRILL-4203]. > The date correctness label for the new generated parquet files should be > upgraded. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (DRILL-4980) Upgrading of the approach of parquet date correctness status detection
[ https://issues.apache.org/jira/browse/DRILL-4980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15946260#comment-15946260 ] Rahul Challapalli commented on DRILL-4980: -- [~vitalii] Does this have any functional impact? Or can I come up with a query/data combination to reproduce the issue? I am asking as I want to automate the testcase in our functional regression runs > Upgrading of the approach of parquet date correctness status detection > -- > > Key: DRILL-4980 > URL: https://issues.apache.org/jira/browse/DRILL-4980 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Parquet >Affects Versions: 1.9.0 >Reporter: Vitalii Diravka >Assignee: Vitalii Diravka > Fix For: 1.10.0 > > > This jira is an addition for the > [DRILL-4203|https://issues.apache.org/jira/browse/DRILL-4203]. > The date correctness label for the new generated parquet files should be > upgraded. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (DRILL-4980) Upgrading of the approach of parquet date correctness status detection
[ https://issues.apache.org/jira/browse/DRILL-4980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15946187#comment-15946187 ] Kunal Khatua commented on DRILL-4980: - [~rkins] Assigning this to you as you have reviewed DRILL-4203 as well. > Upgrading of the approach of parquet date correctness status detection > -- > > Key: DRILL-4980 > URL: https://issues.apache.org/jira/browse/DRILL-4980 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Parquet >Affects Versions: 1.9.0 >Reporter: Vitalii Diravka >Assignee: Vitalii Diravka > Fix For: 1.10.0 > > > This jira is an addition for the > [DRILL-4203|https://issues.apache.org/jira/browse/DRILL-4203]. > The date correctness label for the new generated parquet files should be > upgraded. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (DRILL-4980) Upgrading of the approach of parquet date correctness status detection
[ https://issues.apache.org/jira/browse/DRILL-4980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15691681#comment-15691681 ] ASF GitHub Bot commented on DRILL-4980: --- Github user asfgit closed the pull request at: https://github.com/apache/drill/pull/644 > Upgrading of the approach of parquet date correctness status detection > -- > > Key: DRILL-4980 > URL: https://issues.apache.org/jira/browse/DRILL-4980 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Parquet >Affects Versions: 1.8.0 >Reporter: Vitalii Diravka >Assignee: Parth Chandra > Fix For: Future > > > This jira is an addition for the > [DRILL-4203|https://issues.apache.org/jira/browse/DRILL-4203]. > The date correctness label for the new generated parquet files should be > upgraded. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4980) Upgrading of the approach of parquet date correctness status detection
[ https://issues.apache.org/jira/browse/DRILL-4980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15691655#comment-15691655 ] ASF GitHub Bot commented on DRILL-4980: --- Github user parthchandra commented on the issue: https://github.com/apache/drill/pull/644 +1 > Upgrading of the approach of parquet date correctness status detection > -- > > Key: DRILL-4980 > URL: https://issues.apache.org/jira/browse/DRILL-4980 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Parquet >Affects Versions: 1.8.0 >Reporter: Vitalii Diravka >Assignee: Parth Chandra > Fix For: Future > > > This jira is an addition for the > [DRILL-4203|https://issues.apache.org/jira/browse/DRILL-4203]. > The date correctness label for the new generated parquet files should be > upgraded. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4980) Upgrading of the approach of parquet date correctness status detection
[ https://issues.apache.org/jira/browse/DRILL-4980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15686374#comment-15686374 ] ASF GitHub Bot commented on DRILL-4980: --- Github user vdiravka commented on a diff in the pull request: https://github.com/apache/drill/pull/644#discussion_r89088416 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java --- @@ -189,10 +195,16 @@ public static DateCorruptionStatus detectCorruptDates(ParquetMetadata footer, String createdBy = footer.getFileMetaData().getCreatedBy(); String drillVersion = footer.getFileMetaData().getKeyValueMetaData().get(ParquetRecordWriter.DRILL_VERSION_PROPERTY); -String isDateCorrect = footer.getFileMetaData().getKeyValueMetaData().get(ParquetRecordWriter.IS_DATE_CORRECT_PROPERTY); +String stringWriterVersion = footer.getFileMetaData().getKeyValueMetaData().get(ParquetRecordWriter.WRITER_VERSION_PROPERTY); +// This flag can be present in parquet files which were generated with 1.9.0-SNAPSHOT drill version. +// If this flag is present it means that the version of the drill parquet writer is 2 +final String isDateCorrectFlag = "is.date.correct"; +String isDateCorrect = footer.getFileMetaData().getKeyValueMetaData().get(isDateCorrectFlag); if (drillVersion != null) { - return Boolean.valueOf(isDateCorrect) ? DateCorruptionStatus.META_SHOWS_NO_CORRUPTION - : DateCorruptionStatus.META_SHOWS_CORRUPTION; + int writerVersion = (stringWriterVersion != null) ? Integer.parseInt(stringWriterVersion) --- End diff -- Agree. It makes sense. I even found one redundant check in the return statement. Changes in a new commit. > Upgrading of the approach of parquet date correctness status detection > -- > > Key: DRILL-4980 > URL: https://issues.apache.org/jira/browse/DRILL-4980 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Parquet >Affects Versions: 1.8.0 >Reporter: Vitalii Diravka >Assignee: Parth Chandra > Fix For: Future > > > This jira is an addition for the > [DRILL-4203|https://issues.apache.org/jira/browse/DRILL-4203]. > The date correctness label for the new generated parquet files should be > upgraded. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4980) Upgrading of the approach of parquet date correctness status detection
[ https://issues.apache.org/jira/browse/DRILL-4980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15685426#comment-15685426 ] ASF GitHub Bot commented on DRILL-4980: --- Github user parthchandra commented on a diff in the pull request: https://github.com/apache/drill/pull/644#discussion_r89031035 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java --- @@ -189,10 +195,16 @@ public static DateCorruptionStatus detectCorruptDates(ParquetMetadata footer, String createdBy = footer.getFileMetaData().getCreatedBy(); String drillVersion = footer.getFileMetaData().getKeyValueMetaData().get(ParquetRecordWriter.DRILL_VERSION_PROPERTY); -String isDateCorrect = footer.getFileMetaData().getKeyValueMetaData().get(ParquetRecordWriter.IS_DATE_CORRECT_PROPERTY); +String stringWriterVersion = footer.getFileMetaData().getKeyValueMetaData().get(ParquetRecordWriter.WRITER_VERSION_PROPERTY); +// This flag can be present in parquet files which were generated with 1.9.0-SNAPSHOT drill version. +// If this flag is present it means that the version of the drill parquet writer is 2 +final String isDateCorrectFlag = "is.date.correct"; +String isDateCorrect = footer.getFileMetaData().getKeyValueMetaData().get(isDateCorrectFlag); if (drillVersion != null) { - return Boolean.valueOf(isDateCorrect) ? DateCorruptionStatus.META_SHOWS_NO_CORRUPTION - : DateCorruptionStatus.META_SHOWS_CORRUPTION; + int writerVersion = (stringWriterVersion != null) ? Integer.parseInt(stringWriterVersion) --- End diff -- Use if-then-else to make this a little easier to read? > Upgrading of the approach of parquet date correctness status detection > -- > > Key: DRILL-4980 > URL: https://issues.apache.org/jira/browse/DRILL-4980 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Parquet >Affects Versions: 1.8.0 >Reporter: Vitalii Diravka >Assignee: Parth Chandra > Fix For: Future > > > This jira is an addition for the > [DRILL-4203|https://issues.apache.org/jira/browse/DRILL-4203]. > The date correctness label for the new generated parquet files should be > upgraded. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4980) Upgrading of the approach of parquet date correctness status detection
[ https://issues.apache.org/jira/browse/DRILL-4980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15654351#comment-15654351 ] ASF GitHub Bot commented on DRILL-4980: --- Github user vdiravka commented on the issue: https://github.com/apache/drill/pull/644 @parthchandra Could you please review this PR? > Upgrading of the approach of parquet date correctness status detection > -- > > Key: DRILL-4980 > URL: https://issues.apache.org/jira/browse/DRILL-4980 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Parquet >Affects Versions: 1.8.0 >Reporter: Vitalii Diravka >Assignee: Vitalii Diravka > Fix For: 1.9.0 > > > This jira is an addition for the > [DRILL-4203|https://issues.apache.org/jira/browse/DRILL-4203]. > The date correctness label for the new generated parquet files should be > upgraded. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4980) Upgrading of the approach of parquet date correctness status detection
[ https://issues.apache.org/jira/browse/DRILL-4980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15654323#comment-15654323 ] ASF GitHub Bot commented on DRILL-4980: --- Github user vdiravka commented on a diff in the pull request: https://github.com/apache/drill/pull/644#discussion_r87416017 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java --- @@ -185,7 +185,8 @@ private Metadata(FileSystem fs, ParquetFormatConfig formatConfig) { childFiles.add(file); } } -ParquetTableMetadata_v3 parquetTableMetadata = new ParquetTableMetadata_v3(true); +ParquetTableMetadata_v3 parquetTableMetadata = new ParquetTableMetadata_v3(DrillVersionInfo.getVersion(), +ParquetWriter.WRITER_VERSION); --- End diff -- `is.date.correct` or `parquet-writer.version` were needed in metadata cache file for quick detection of date values correctness. Otherwise need to check `files.rowGroups.columns.mxValue` values from this cache file. But thought a little, I've understood that due to new added `ParquetTableMetadata_v3` we can check: If version of parquet metadata cache file is 3, the date values are definitely correct. Otherwise (when parquet metadata cache file was generated earlier) need to check date values from this file. So `writerVersion` is redundant in the `ParquetTableMetadataBase` now. I deleted it. Please approve does it make sense? > Upgrading of the approach of parquet date correctness status detection > -- > > Key: DRILL-4980 > URL: https://issues.apache.org/jira/browse/DRILL-4980 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Parquet >Affects Versions: 1.8.0 >Reporter: Vitalii Diravka >Assignee: Vitalii Diravka > Fix For: 1.9.0 > > > This jira is an addition for the > [DRILL-4203|https://issues.apache.org/jira/browse/DRILL-4203]. > The date correctness label for the new generated parquet files should be > upgraded. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4980) Upgrading of the approach of parquet date correctness status detection
[ https://issues.apache.org/jira/browse/DRILL-4980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15654274#comment-15654274 ] ASF GitHub Bot commented on DRILL-4980: --- Github user vdiravka commented on a diff in the pull request: https://github.com/apache/drill/pull/644#discussion_r87411842 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java --- @@ -59,19 +59,24 @@ */ public static final long JULIAN_DAY_NUMBER_FOR_UNIX_EPOCH = 2440588; /** - * All old parquet files (which haven't "is.date.correct=true" property in metadata) have - * a corrupt date shift: {@value} days or 2 * {@value #JULIAN_DAY_NUMBER_FOR_UNIX_EPOCH} + * All old parquet files (which haven't "is.date.correct=true" or "parquet-writer.version" properties + * in metadata) have a corrupt date shift: {@value} days or 2 * {@value #JULIAN_DAY_NUMBER_FOR_UNIX_EPOCH} */ public static final long CORRECT_CORRUPT_DATE_SHIFT = 2 * JULIAN_DAY_NUMBER_FOR_UNIX_EPOCH; - // The year 5000 (or 1106685 day from Unix epoch) is chosen as the threshold for auto-detecting date corruption. - // This balances two possible cases of bad auto-correction. External tools writing dates in the future will not - // be shifted unless they are past this threshold (and we cannot identify them as external files based on the metadata). - // On the other hand, historical dates written with Drill wouldn't risk being incorrectly shifted unless they were - // something like 10,000 years in the past. private static final Chronology UTC = org.joda.time.chrono.ISOChronology.getInstanceUTC(); + /** + * The year 5000 (or 1106685 day from Unix epoch) is chosen as the threshold for auto-detecting date corruption. + * This balances two possible cases of bad auto-correction. External tools writing dates in the future will not + * be shifted unless they are past this threshold (and we cannot identify them as external files based on the metadata). + * On the other hand, historical dates written with Drill wouldn't risk being incorrectly shifted unless they were + * something like 10,000 years in the past. + */ public static final int DATE_CORRUPTION_THRESHOLD = (int) (UTC.getDateTimeMillis(5000, 1, 1, 0) / DateTimeConstants.MILLIS_PER_DAY); - + /** + * The version of drill parquet writer with date values corruption fix --- End diff -- Done > Upgrading of the approach of parquet date correctness status detection > -- > > Key: DRILL-4980 > URL: https://issues.apache.org/jira/browse/DRILL-4980 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Parquet >Affects Versions: 1.8.0 >Reporter: Vitalii Diravka >Assignee: Vitalii Diravka > Fix For: 1.9.0 > > > This jira is an addition for the > [DRILL-4203|https://issues.apache.org/jira/browse/DRILL-4203]. > The date correctness label for the new generated parquet files should be > upgraded. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4980) Upgrading of the approach of parquet date correctness status detection
[ https://issues.apache.org/jira/browse/DRILL-4980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15651436#comment-15651436 ] ASF GitHub Bot commented on DRILL-4980: --- Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/644#discussion_r87232002 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java --- @@ -59,19 +59,24 @@ */ public static final long JULIAN_DAY_NUMBER_FOR_UNIX_EPOCH = 2440588; /** - * All old parquet files (which haven't "is.date.correct=true" property in metadata) have - * a corrupt date shift: {@value} days or 2 * {@value #JULIAN_DAY_NUMBER_FOR_UNIX_EPOCH} + * All old parquet files (which haven't "is.date.correct=true" or "parquet-writer.version" properties + * in metadata) have a corrupt date shift: {@value} days or 2 * {@value #JULIAN_DAY_NUMBER_FOR_UNIX_EPOCH} */ public static final long CORRECT_CORRUPT_DATE_SHIFT = 2 * JULIAN_DAY_NUMBER_FOR_UNIX_EPOCH; - // The year 5000 (or 1106685 day from Unix epoch) is chosen as the threshold for auto-detecting date corruption. - // This balances two possible cases of bad auto-correction. External tools writing dates in the future will not - // be shifted unless they are past this threshold (and we cannot identify them as external files based on the metadata). - // On the other hand, historical dates written with Drill wouldn't risk being incorrectly shifted unless they were - // something like 10,000 years in the past. private static final Chronology UTC = org.joda.time.chrono.ISOChronology.getInstanceUTC(); + /** + * The year 5000 (or 1106685 day from Unix epoch) is chosen as the threshold for auto-detecting date corruption. + * This balances two possible cases of bad auto-correction. External tools writing dates in the future will not + * be shifted unless they are past this threshold (and we cannot identify them as external files based on the metadata). + * On the other hand, historical dates written with Drill wouldn't risk being incorrectly shifted unless they were + * something like 10,000 years in the past. + */ public static final int DATE_CORRUPTION_THRESHOLD = (int) (UTC.getDateTimeMillis(5000, 1, 1, 0) / DateTimeConstants.MILLIS_PER_DAY); - + /** + * The version of drill parquet writer with date values corruption fix --- End diff -- Maybe explain this a bit better. Something like: Version 2 (and later) of the Drill Parquet writer uses the date format described (in the Parquet spec? URL?). Prior versions had dates formatted (copy description from above.) > Upgrading of the approach of parquet date correctness status detection > -- > > Key: DRILL-4980 > URL: https://issues.apache.org/jira/browse/DRILL-4980 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Parquet >Affects Versions: 1.8.0 >Reporter: Vitalii Diravka >Assignee: Vitalii Diravka > Fix For: 1.9.0 > > > This jira is an addition for the > [DRILL-4203|https://issues.apache.org/jira/browse/DRILL-4203]. > The date correctness label for the new generated parquet files should be > upgraded. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4980) Upgrading of the approach of parquet date correctness status detection
[ https://issues.apache.org/jira/browse/DRILL-4980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15651437#comment-15651437 ] ASF GitHub Bot commented on DRILL-4980: --- Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/644#discussion_r87231425 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java --- @@ -185,7 +185,8 @@ private Metadata(FileSystem fs, ParquetFormatConfig formatConfig) { childFiles.add(file); } } -ParquetTableMetadata_v3 parquetTableMetadata = new ParquetTableMetadata_v3(true); +ParquetTableMetadata_v3 parquetTableMetadata = new ParquetTableMetadata_v3(DrillVersionInfo.getVersion(), +ParquetWriter.WRITER_VERSION); --- End diff -- I'm a bit confused. The writer version applies to the Parquet files which Drill writes. (Or, at least, that was the intention.) Here, we're talking about metadata. There may well be a metadata writer, but that should be a different writer, with a different version. Not sure we want to initialize the metadata object with the current writer version: there seems to be no correlation between the metadata object and the writer version. On the other hand, the metadata can certainly hold the writer version, but it must be the value read from the Parquet file itself; not a value set by the code. Else, we have the difficult problem of making sure that the code-set version number agrees with the actual version number in the file. > Upgrading of the approach of parquet date correctness status detection > -- > > Key: DRILL-4980 > URL: https://issues.apache.org/jira/browse/DRILL-4980 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Parquet >Affects Versions: 1.8.0 >Reporter: Vitalii Diravka >Assignee: Vitalii Diravka > Fix For: 1.9.0 > > > This jira is an addition for the > [DRILL-4203|https://issues.apache.org/jira/browse/DRILL-4203]. > The date correctness label for the new generated parquet files should be > upgraded. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4980) Upgrading of the approach of parquet date correctness status detection
[ https://issues.apache.org/jira/browse/DRILL-4980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15651439#comment-15651439 ] ASF GitHub Bot commented on DRILL-4980: --- Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/644#discussion_r87232692 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java --- @@ -189,10 +194,14 @@ public static DateCorruptionStatus detectCorruptDates(ParquetMetadata footer, String createdBy = footer.getFileMetaData().getCreatedBy(); String drillVersion = footer.getFileMetaData().getKeyValueMetaData().get(ParquetRecordWriter.DRILL_VERSION_PROPERTY); -String isDateCorrect = footer.getFileMetaData().getKeyValueMetaData().get(ParquetRecordWriter.IS_DATE_CORRECT_PROPERTY); +String writerVersion = footer.getFileMetaData().getKeyValueMetaData().get(ParquetRecordWriter.WRITER_VERSION_PROPERTY); +// This flag can be present in parquet files which were generated with 1.9.0-SNAPSHOT drill version +final String isDateCorrectFlag = "is.date.correct"; --- End diff -- Maybe here you want to special case the "is.date.correct" flag. 1) If the writer version is present, use it. 2) If "is.date.correct" is present, set the writer version to "2". 3) If neither are present, set the writer version to 1. That way we don't have to have (much) extra logic for the "is.date.correct" handling. > Upgrading of the approach of parquet date correctness status detection > -- > > Key: DRILL-4980 > URL: https://issues.apache.org/jira/browse/DRILL-4980 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Parquet >Affects Versions: 1.8.0 >Reporter: Vitalii Diravka >Assignee: Vitalii Diravka > Fix For: 1.9.0 > > > This jira is an addition for the > [DRILL-4203|https://issues.apache.org/jira/browse/DRILL-4203]. > The date correctness label for the new generated parquet files should be > upgraded. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4980) Upgrading of the approach of parquet date correctness status detection
[ https://issues.apache.org/jira/browse/DRILL-4980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15651438#comment-15651438 ] ASF GitHub Bot commented on DRILL-4980: --- Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/644#discussion_r87232227 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java --- @@ -59,19 +59,24 @@ */ public static final long JULIAN_DAY_NUMBER_FOR_UNIX_EPOCH = 2440588; /** - * All old parquet files (which haven't "is.date.correct=true" property in metadata) have - * a corrupt date shift: {@value} days or 2 * {@value #JULIAN_DAY_NUMBER_FOR_UNIX_EPOCH} + * All old parquet files (which haven't "is.date.correct=true" or "parquet-writer.version" properties + * in metadata) have a corrupt date shift: {@value} days or 2 * {@value #JULIAN_DAY_NUMBER_FOR_UNIX_EPOCH} */ public static final long CORRECT_CORRUPT_DATE_SHIFT = 2 * JULIAN_DAY_NUMBER_FOR_UNIX_EPOCH; - // The year 5000 (or 1106685 day from Unix epoch) is chosen as the threshold for auto-detecting date corruption. - // This balances two possible cases of bad auto-correction. External tools writing dates in the future will not - // be shifted unless they are past this threshold (and we cannot identify them as external files based on the metadata). - // On the other hand, historical dates written with Drill wouldn't risk being incorrectly shifted unless they were - // something like 10,000 years in the past. private static final Chronology UTC = org.joda.time.chrono.ISOChronology.getInstanceUTC(); + /** + * The year 5000 (or 1106685 day from Unix epoch) is chosen as the threshold for auto-detecting date corruption. + * This balances two possible cases of bad auto-correction. External tools writing dates in the future will not + * be shifted unless they are past this threshold (and we cannot identify them as external files based on the metadata). + * On the other hand, historical dates written with Drill wouldn't risk being incorrectly shifted unless they were + * something like 10,000 years in the past. + */ public static final int DATE_CORRUPTION_THRESHOLD = (int) (UTC.getDateTimeMillis(5000, 1, 1, 0) / DateTimeConstants.MILLIS_PER_DAY); - + /** + * The version of drill parquet writer with date values corruption fix + */ + public static final int DRILL_WRITER_VERSION_WITHOUT_CORRUPTION = 2; --- End diff -- Maybe call this DRILL_WRITER_VERSION_STD_DATE_FORMAT The old format was not "corrupted", it just used a date format that was non-standard. > Upgrading of the approach of parquet date correctness status detection > -- > > Key: DRILL-4980 > URL: https://issues.apache.org/jira/browse/DRILL-4980 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Parquet >Affects Versions: 1.8.0 >Reporter: Vitalii Diravka >Assignee: Vitalii Diravka > Fix For: 1.9.0 > > > This jira is an addition for the > [DRILL-4203|https://issues.apache.org/jira/browse/DRILL-4203]. > The date correctness label for the new generated parquet files should be > upgraded. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4980) Upgrading of the approach of parquet date correctness status detection
[ https://issues.apache.org/jira/browse/DRILL-4980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15651435#comment-15651435 ] ASF GitHub Bot commented on DRILL-4980: --- Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/644#discussion_r87232983 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetWriter.java --- @@ -40,6 +40,16 @@ public class ParquetWriter extends AbstractWriter { static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(ParquetWriter.class); +/** Version of Drill's Parquet writer. Increment this version (by 1) any time we make any format change to the file. + * Format changes include + * 1) supporting new data types, + * 2) changes to the format of data fields, + * 3) adding new metadata to the file footer, etc. + * Newer readers must be able to read old files. The Writer version tells the Parquet reader how to interpret fields --- End diff -- This is Javadoc, so needs HTML formatting. For the list: Supporting new data types, ... > Upgrading of the approach of parquet date correctness status detection > -- > > Key: DRILL-4980 > URL: https://issues.apache.org/jira/browse/DRILL-4980 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Parquet >Affects Versions: 1.8.0 >Reporter: Vitalii Diravka >Assignee: Vitalii Diravka > Fix For: 1.9.0 > > > This jira is an addition for the > [DRILL-4203|https://issues.apache.org/jira/browse/DRILL-4203]. > The date correctness label for the new generated parquet files should be > upgraded. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4980) Upgrading of the approach of parquet date correctness status detection
[ https://issues.apache.org/jira/browse/DRILL-4980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15647812#comment-15647812 ] ASF GitHub Bot commented on DRILL-4980: --- Github user vdiravka commented on the issue: https://github.com/apache/drill/pull/644 @paul-rogers Could you please review? I answered on the last comment and rebased the branch with resolving conflicts (ParquetTableMetadata_v3 was added) > Upgrading of the approach of parquet date correctness status detection > -- > > Key: DRILL-4980 > URL: https://issues.apache.org/jira/browse/DRILL-4980 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Parquet >Affects Versions: 1.8.0 >Reporter: Vitalii Diravka >Assignee: Vitalii Diravka > Fix For: 1.9.0 > > > This jira is an addition for the > [DRILL-4203|https://issues.apache.org/jira/browse/DRILL-4203]. > The date correctness label for the new generated parquet files should be > upgraded. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4980) Upgrading of the approach of parquet date correctness status detection
[ https://issues.apache.org/jira/browse/DRILL-4980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15637390#comment-15637390 ] ASF GitHub Bot commented on DRILL-4980: --- Github user vdiravka commented on a diff in the pull request: https://github.com/apache/drill/pull/644#discussion_r86615126 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java --- @@ -927,15 +927,11 @@ public void setMax(Object max) { @JsonProperty List files; @JsonProperty List directories; @JsonProperty String drillVersion; -@JsonProperty boolean isDateCorrect; +@JsonProperty int writerVersion; --- End diff -- I tried to figure out this and found that for reading parquet footer this class isn't used. For that purposes used `ParquetMetadata` from parquet-hadoop-1.8.1-drill-r0.jar. This class used for entire parquet directory with metadata cache file. When it is absent an empty of this class is created. I added some javadoc. Hope it will help. > Upgrading of the approach of parquet date correctness status detection > -- > > Key: DRILL-4980 > URL: https://issues.apache.org/jira/browse/DRILL-4980 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Parquet >Affects Versions: 1.8.0 >Reporter: Vitalii Diravka >Assignee: Vitalii Diravka > Fix For: 1.9.0 > > > This jira is an addition for the > [DRILL-4203|https://issues.apache.org/jira/browse/DRILL-4203]. > The date correctness label for the new generated parquet files should be > upgraded. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4980) Upgrading of the approach of parquet date correctness status detection
[ https://issues.apache.org/jira/browse/DRILL-4980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15634976#comment-15634976 ] ASF GitHub Bot commented on DRILL-4980: --- Github user vdiravka commented on a diff in the pull request: https://github.com/apache/drill/pull/644#discussion_r86477598 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java --- @@ -927,15 +927,11 @@ public void setMax(Object max) { @JsonProperty List files; @JsonProperty List directories; @JsonProperty String drillVersion; -@JsonProperty boolean isDateCorrect; +@JsonProperty int writerVersion; --- End diff -- `ParquetTableMetadata_v2` is used mostly when the parquet meta cache file created. For reading the metadata cache file or metadta footer this class is not used except the case, when we read one parquet file or parquet folder, an empty instance of this class is created, but used only columnTypeInfo field. And the drillVersion or writerVersion weren't used. That's why everything worked but it was not right. I added an empty constructor (as I made earlier, actually master version). And added initializing this fields across constructors parameters. I think it is more right and looks like this is what you were talking about. > Upgrading of the approach of parquet date correctness status detection > -- > > Key: DRILL-4980 > URL: https://issues.apache.org/jira/browse/DRILL-4980 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Parquet >Affects Versions: 1.8.0 >Reporter: Vitalii Diravka >Assignee: Vitalii Diravka > Fix For: 1.9.0 > > > This jira is an addition for the > [DRILL-4203|https://issues.apache.org/jira/browse/DRILL-4203]. > The date correctness label for the new generated parquet files should be > upgraded. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4980) Upgrading of the approach of parquet date correctness status detection
[ https://issues.apache.org/jira/browse/DRILL-4980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15634974#comment-15634974 ] ASF GitHub Bot commented on DRILL-4980: --- Github user vdiravka commented on a diff in the pull request: https://github.com/apache/drill/pull/644#discussion_r86477805 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetRecordWriter.java --- @@ -78,7 +78,7 @@ private static final int MAXIMUM_RECORD_COUNT_FOR_CHECK = 1; public static final String DRILL_VERSION_PROPERTY = "drill.version"; - public static final String IS_DATE_CORRECT_PROPERTY = "is.date.correct"; + public static final String WRITER_VERSION_PROPERTY = "parquet-writer.version"; --- End diff -- Done. Also parquet files for unit tests were regenerated as well. > Upgrading of the approach of parquet date correctness status detection > -- > > Key: DRILL-4980 > URL: https://issues.apache.org/jira/browse/DRILL-4980 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Parquet >Affects Versions: 1.8.0 >Reporter: Vitalii Diravka >Assignee: Vitalii Diravka > Fix For: 1.9.0 > > > This jira is an addition for the > [DRILL-4203|https://issues.apache.org/jira/browse/DRILL-4203]. > The date correctness label for the new generated parquet files should be > upgraded. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4980) Upgrading of the approach of parquet date correctness status detection
[ https://issues.apache.org/jira/browse/DRILL-4980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15634975#comment-15634975 ] ASF GitHub Bot commented on DRILL-4980: --- Github user vdiravka commented on a diff in the pull request: https://github.com/apache/drill/pull/644#discussion_r86478118 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetWriter.java --- @@ -40,6 +40,8 @@ public class ParquetWriter extends AbstractWriter { static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(ParquetWriter.class); + public static final int WRITER_VERSION = 2; --- End diff -- I copied this comment into the project. > Upgrading of the approach of parquet date correctness status detection > -- > > Key: DRILL-4980 > URL: https://issues.apache.org/jira/browse/DRILL-4980 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Parquet >Affects Versions: 1.8.0 >Reporter: Vitalii Diravka >Assignee: Vitalii Diravka > Fix For: 1.9.0 > > > This jira is an addition for the > [DRILL-4203|https://issues.apache.org/jira/browse/DRILL-4203]. > The date correctness label for the new generated parquet files should be > upgraded. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4980) Upgrading of the approach of parquet date correctness status detection
[ https://issues.apache.org/jira/browse/DRILL-4980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15633406#comment-15633406 ] ASF GitHub Bot commented on DRILL-4980: --- Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/644#discussion_r86388727 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetWriter.java --- @@ -40,6 +40,8 @@ public class ParquetWriter extends AbstractWriter { static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(ParquetWriter.class); + public static final int WRITER_VERSION = 2; --- End diff -- This deserves an explanation. Something like: Version of Drill's Parquet writer. Increment this version (by 1) any time we make any format change to the file. Format changes include 1) supporting new data types, 2) changes to the format of data fields, 3) adding new metadata to the file footer, etc. Newer readers must be able to read old files. The Writer version tells the Parquet reader how to interpret fields or metadata when that data changes format from one writer version to another. > Upgrading of the approach of parquet date correctness status detection > -- > > Key: DRILL-4980 > URL: https://issues.apache.org/jira/browse/DRILL-4980 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Parquet >Affects Versions: 1.8.0 >Reporter: Vitalii Diravka >Assignee: Vitalii Diravka > Fix For: 1.9.0 > > > This jira is an addition for the > [DRILL-4203|https://issues.apache.org/jira/browse/DRILL-4203]. > The date correctness label for the new generated parquet files should be > upgraded. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4980) Upgrading of the approach of parquet date correctness status detection
[ https://issues.apache.org/jira/browse/DRILL-4980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15633400#comment-15633400 ] ASF GitHub Bot commented on DRILL-4980: --- Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/644#discussion_r86384226 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java --- @@ -927,15 +927,11 @@ public void setMax(Object max) { @JsonProperty List files; @JsonProperty List directories; @JsonProperty String drillVersion; -@JsonProperty boolean isDateCorrect; +@JsonProperty int writerVersion; --- End diff -- This property is Jackson-serialized. How does Jackson handle older files written without this version? According to this post (http://stackoverflow.com/questions/8320993/jackson-what-happens-if-a-property-is-missing), "Setter methods are only invoked for properties with explicit values." This means that older files without the writerVersion set won't call the function to deserialize the writer version, and the version will default to the newest version. I suspect that either A) I'm misunderstanding what this code does, or B) we have a backward-compatibility issue. Please explain which. > Upgrading of the approach of parquet date correctness status detection > -- > > Key: DRILL-4980 > URL: https://issues.apache.org/jira/browse/DRILL-4980 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Parquet >Affects Versions: 1.8.0 >Reporter: Vitalii Diravka >Assignee: Vitalii Diravka > Fix For: 1.9.0 > > > This jira is an addition for the > [DRILL-4203|https://issues.apache.org/jira/browse/DRILL-4203]. > The date correctness label for the new generated parquet files should be > upgraded. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4980) Upgrading of the approach of parquet date correctness status detection
[ https://issues.apache.org/jira/browse/DRILL-4980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15633407#comment-15633407 ] ASF GitHub Bot commented on DRILL-4980: --- Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/644#discussion_r86383165 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java --- @@ -927,15 +927,11 @@ public void setMax(Object max) { @JsonProperty List files; @JsonProperty List directories; @JsonProperty String drillVersion; -@JsonProperty boolean isDateCorrect; +@JsonProperty int writerVersion; public ParquetTableMetadata_v2() { - super(); -} - -public ParquetTableMetadata_v2(boolean isDateCorrect) { this.drillVersion = DrillVersionInfo.getVersion(); - this.isDateCorrect = isDateCorrect; + this.writerVersion = ParquetWriter.WRITER_VERSION; --- End diff -- We set the writer version to the current version when we create the metadata. Is this same metadata used for both read and write? If so, we have the potential for a nasty bug. A (new) reader fails to set the writerVersion value from actual file metadata. The value will default to the latest, even if the file itself happens to be older. I wonder if it makes sense to pass the version into the constructor. The Writer passes in the current writer version. The reader must pass in the value found in the file. Or, is this metadata used only for writing, but not reading? If that is true perhaps we can document that in the code somewhere. (I looked but did not anything.) Or, is this metadata cached from scanning actual files? If so, isn't defaulting the writer version simply asking for trouble if someone forgets to set this field based on actual file version? > Upgrading of the approach of parquet date correctness status detection > -- > > Key: DRILL-4980 > URL: https://issues.apache.org/jira/browse/DRILL-4980 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Parquet >Affects Versions: 1.8.0 >Reporter: Vitalii Diravka >Assignee: Vitalii Diravka > Fix For: 1.9.0 > > > This jira is an addition for the > [DRILL-4203|https://issues.apache.org/jira/browse/DRILL-4203]. > The date correctness label for the new generated parquet files should be > upgraded. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4980) Upgrading of the approach of parquet date correctness status detection
[ https://issues.apache.org/jira/browse/DRILL-4980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15633402#comment-15633402 ] ASF GitHub Bot commented on DRILL-4980: --- Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/644#discussion_r86384912 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java --- @@ -944,14 +940,16 @@ public ParquetTableMetadata_v2(ParquetTableMetadataBase parquetTable, this.directories = directories; this.columnTypeInfo = ((ParquetTableMetadata_v2) parquetTable).columnTypeInfo; this.drillVersion = DrillVersionInfo.getVersion(); - this.isDateCorrect = true; + this.writerVersion = ParquetWriter.WRITER_VERSION; } public ParquetTableMetadata_v2(List files, List directories, ConcurrentHashMap columnTypeInfo) { this.files = files; this.directories = directories; this.columnTypeInfo = columnTypeInfo; + this.drillVersion = DrillVersionInfo.getVersion(); + this.writerVersion = ParquetWriter.WRITER_VERSION; --- End diff -- Given a set of files and directories, we're assuming the writer version? This is supposed to be the version of the Drill Parquet writer that wrote the actual Parquet files, right? Not the version of the metadata files that hold information about the Parquet files? Or, are we using the same writer version for both purposes? If we need two versions (one for Parquet, another for metadata), then we should introduce a separate Parquet metadata version. (Or, I'm misunderstanding the code...) > Upgrading of the approach of parquet date correctness status detection > -- > > Key: DRILL-4980 > URL: https://issues.apache.org/jira/browse/DRILL-4980 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Parquet >Affects Versions: 1.8.0 >Reporter: Vitalii Diravka >Assignee: Vitalii Diravka > Fix For: 1.9.0 > > > This jira is an addition for the > [DRILL-4203|https://issues.apache.org/jira/browse/DRILL-4203]. > The date correctness label for the new generated parquet files should be > upgraded. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4980) Upgrading of the approach of parquet date correctness status detection
[ https://issues.apache.org/jira/browse/DRILL-4980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15633403#comment-15633403 ] ASF GitHub Bot commented on DRILL-4980: --- Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/644#discussion_r86387445 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java --- @@ -189,19 +189,23 @@ public static DateCorruptionStatus detectCorruptDates(ParquetMetadata footer, String createdBy = footer.getFileMetaData().getCreatedBy(); String drillVersion = footer.getFileMetaData().getKeyValueMetaData().get(ParquetRecordWriter.DRILL_VERSION_PROPERTY); -String isDateCorrect = footer.getFileMetaData().getKeyValueMetaData().get(ParquetRecordWriter.IS_DATE_CORRECT_PROPERTY); -if (drillVersion != null) { - return Boolean.valueOf(isDateCorrect) ? DateCorruptionStatus.META_SHOWS_NO_CORRUPTION - : DateCorruptionStatus.META_SHOWS_CORRUPTION; -} else { - // Possibly an old, un-migrated Drill file, check the column statistics to see if min/max values look corrupt - // only applies if there is a date column selected - if (createdBy.equals("parquet-mr")) { -// loop through parquet column metadata to find date columns, check for corrupt values -return checkForCorruptDateValuesInStatistics(footer, columns, autoCorrectCorruptDates); +String writerVersion = footer.getFileMetaData().getKeyValueMetaData().get(ParquetRecordWriter.WRITER_VERSION_PROPERTY); +// This flag can be present in parquet files which were generated with 1.9.0-SNAPSHOT drill version +final String isDateCorrectFlag = "is.date.correct"; +String isDateCorrect = footer.getFileMetaData().getKeyValueMetaData().get(isDateCorrectFlag); +try { + if (drillVersion != null) { +return (writerVersion != null && Integer.parseInt(writerVersion) >= 2) || Boolean.valueOf(isDateCorrect) +|| isDrillVersionHasCorrectDates(drillVersion) --- End diff -- Isn't the Drill version check redundant? We know that all Drill versions from 1.9.0 onwards will have a Drill Parquet writer version in the file. So, we only need check the writer version, if we have it. If we don't have it, then the only info we have is the Drill version. We might want to explain this logic in a comment. The purpose of adding the writer version is that all future format decisions can be made on the writer version independent of Drill version. For example, suppose we change something in Drill 1.10. Drill 1.10.0-SNAPSHOT will start with writer version 2. Later we'll make a change and 1.10.0-SNAPSHOT will writer files with writer version 3. The Drill version is ambiguous, the writer version is spot on. > Upgrading of the approach of parquet date correctness status detection > -- > > Key: DRILL-4980 > URL: https://issues.apache.org/jira/browse/DRILL-4980 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Parquet >Affects Versions: 1.8.0 >Reporter: Vitalii Diravka >Assignee: Vitalii Diravka > Fix For: 1.9.0 > > > This jira is an addition for the > [DRILL-4203|https://issues.apache.org/jira/browse/DRILL-4203]. > The date correctness label for the new generated parquet files should be > upgraded. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4980) Upgrading of the approach of parquet date correctness status detection
[ https://issues.apache.org/jira/browse/DRILL-4980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15633404#comment-15633404 ] ASF GitHub Bot commented on DRILL-4980: --- Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/644#discussion_r86386510 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java --- @@ -214,14 +218,24 @@ public static DateCorruptionStatus detectCorruptDates(ParquetMetadata footer, } // written by a tool that wasn't Drill, the dates are not corrupted return DateCorruptionStatus.META_SHOWS_NO_CORRUPTION; -} catch (VersionParser.VersionParseException e) { - // If we couldn't parse "created by" field, check column metadata of date columns - return checkForCorruptDateValuesInStatistics(footer, columns, autoCorrectCorruptDates); } } +} catch (VersionParser.VersionParseException e) { + // If we couldn't parse "created by" or "drill version", check column metadata of date columns + return checkForCorruptDateValuesInStatistics(footer, columns, autoCorrectCorruptDates); } } + public static boolean isDrillVersionHasCorrectDates(String drillVersion) throws VersionParser.VersionParseException { --- End diff -- Might flow better as "drillVersionHasCorrectDates" But, see comment above about whether we need this check. > Upgrading of the approach of parquet date correctness status detection > -- > > Key: DRILL-4980 > URL: https://issues.apache.org/jira/browse/DRILL-4980 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Parquet >Affects Versions: 1.8.0 >Reporter: Vitalii Diravka >Assignee: Vitalii Diravka > Fix For: 1.9.0 > > > This jira is an addition for the > [DRILL-4203|https://issues.apache.org/jira/browse/DRILL-4203]. > The date correctness label for the new generated parquet files should be > upgraded. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4980) Upgrading of the approach of parquet date correctness status detection
[ https://issues.apache.org/jira/browse/DRILL-4980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15633401#comment-15633401 ] ASF GitHub Bot commented on DRILL-4980: --- Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/644#discussion_r86388085 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetRecordWriter.java --- @@ -78,7 +78,7 @@ private static final int MAXIMUM_RECORD_COUNT_FOR_CHECK = 1; public static final String DRILL_VERSION_PROPERTY = "drill.version"; - public static final String IS_DATE_CORRECT_PROPERTY = "is.date.correct"; + public static final String WRITER_VERSION_PROPERTY = "parquet-writer.version"; --- End diff -- We know we are Drill, but others might be confused. Perhaps "drill-writer.version" or "drill.writer-version" (to be consistent with "drill.version".) > Upgrading of the approach of parquet date correctness status detection > -- > > Key: DRILL-4980 > URL: https://issues.apache.org/jira/browse/DRILL-4980 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Parquet >Affects Versions: 1.8.0 >Reporter: Vitalii Diravka >Assignee: Vitalii Diravka > Fix For: 1.9.0 > > > This jira is an addition for the > [DRILL-4203|https://issues.apache.org/jira/browse/DRILL-4203]. > The date correctness label for the new generated parquet files should be > upgraded. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4980) Upgrading of the approach of parquet date correctness status detection
[ https://issues.apache.org/jira/browse/DRILL-4980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15632363#comment-15632363 ] ASF GitHub Bot commented on DRILL-4980: --- Github user vdiravka commented on the issue: https://github.com/apache/drill/pull/644 @parthchandra @paul-rogers @jaltekruse coud you please review? > Upgrading of the approach of parquet date correctness status detection > -- > > Key: DRILL-4980 > URL: https://issues.apache.org/jira/browse/DRILL-4980 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.8.0 >Reporter: Vitalii Diravka >Assignee: Vitalii Diravka > Fix For: 1.9.0 > > > This jira is an addition for the > [DRILL-4203|https://issues.apache.org/jira/browse/DRILL-4203]. > The date correctness label for the new generated parquet files should be > upgraded. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4980) Upgrading of the approach of parquet date correctness status detection
[ https://issues.apache.org/jira/browse/DRILL-4980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15632333#comment-15632333 ] ASF GitHub Bot commented on DRILL-4980: --- GitHub user vdiravka opened a pull request: https://github.com/apache/drill/pull/644 DRILL-4980: Upgrading of the approach of parquet date correctness status detection - Parquet writer version is added; - Updated the detection method of parquet date correctness. You can merge this pull request into a Git repository by running: $ git pull https://github.com/vdiravka/drill DRILL-4980 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/drill/pull/644.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #644 commit ca447255b1a8137a25b01e051053b8f354c75ad0 Author: Vitalii Diravka Date: 2016-10-26T12:22:06Z DRILL-4980: Upgrading of the approach of parquet date correctness status detection - Parquet writer version is added; - Updated the detection method of parquet date correctness. > Upgrading of the approach of parquet date correctness status detection > -- > > Key: DRILL-4980 > URL: https://issues.apache.org/jira/browse/DRILL-4980 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.8.0 >Reporter: Vitalii Diravka >Assignee: Vitalii Diravka > Fix For: 1.9.0 > > > This jira is an addition for the > [DRILL-4203|https://issues.apache.org/jira/browse/DRILL-4203]. > The date correctness label for the new generated parquet files should be > upgraded. -- This message was sent by Atlassian JIRA (v6.3.4#6332)