[jira] [Commented] (DRILL-4980) Upgrading of the approach of parquet date correctness status detection

2017-03-29 Thread Vitalii Diravka (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15946915#comment-15946915
 ] 

Vitalii Diravka commented on DRILL-4980:


[~rkins] No, it doesn't have any functional impact. It is just code refactoring 
and redesign of an approach of parquet date correctness status detection.

> Upgrading of the approach of parquet date correctness status detection
> --
>
> Key: DRILL-4980
> URL: https://issues.apache.org/jira/browse/DRILL-4980
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Affects Versions: 1.9.0
>Reporter: Vitalii Diravka
>Assignee: Vitalii Diravka
> Fix For: 1.10.0
>
>
> This jira is an addition for the 
> [DRILL-4203|https://issues.apache.org/jira/browse/DRILL-4203].
> The date correctness label for the new generated parquet files should be 
> upgraded. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (DRILL-4980) Upgrading of the approach of parquet date correctness status detection

2017-03-28 Thread Rahul Challapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15946260#comment-15946260
 ] 

Rahul Challapalli commented on DRILL-4980:
--

[~vitalii] Does this have any functional impact? Or can I come up with a 
query/data combination to reproduce the issue? I am asking as I want to 
automate the testcase in our functional regression runs

> Upgrading of the approach of parquet date correctness status detection
> --
>
> Key: DRILL-4980
> URL: https://issues.apache.org/jira/browse/DRILL-4980
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Affects Versions: 1.9.0
>Reporter: Vitalii Diravka
>Assignee: Vitalii Diravka
> Fix For: 1.10.0
>
>
> This jira is an addition for the 
> [DRILL-4203|https://issues.apache.org/jira/browse/DRILL-4203].
> The date correctness label for the new generated parquet files should be 
> upgraded. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (DRILL-4980) Upgrading of the approach of parquet date correctness status detection

2017-03-28 Thread Kunal Khatua (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15946187#comment-15946187
 ] 

Kunal Khatua commented on DRILL-4980:
-

[~rkins] Assigning this to you as you have reviewed DRILL-4203 as well.

> Upgrading of the approach of parquet date correctness status detection
> --
>
> Key: DRILL-4980
> URL: https://issues.apache.org/jira/browse/DRILL-4980
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Affects Versions: 1.9.0
>Reporter: Vitalii Diravka
>Assignee: Vitalii Diravka
> Fix For: 1.10.0
>
>
> This jira is an addition for the 
> [DRILL-4203|https://issues.apache.org/jira/browse/DRILL-4203].
> The date correctness label for the new generated parquet files should be 
> upgraded. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (DRILL-4980) Upgrading of the approach of parquet date correctness status detection

2016-11-23 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15691681#comment-15691681
 ] 

ASF GitHub Bot commented on DRILL-4980:
---

Github user asfgit closed the pull request at:

https://github.com/apache/drill/pull/644


> Upgrading of the approach of parquet date correctness status detection
> --
>
> Key: DRILL-4980
> URL: https://issues.apache.org/jira/browse/DRILL-4980
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Vitalii Diravka
>Assignee: Parth Chandra
> Fix For: Future
>
>
> This jira is an addition for the 
> [DRILL-4203|https://issues.apache.org/jira/browse/DRILL-4203].
> The date correctness label for the new generated parquet files should be 
> upgraded. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4980) Upgrading of the approach of parquet date correctness status detection

2016-11-23 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15691655#comment-15691655
 ] 

ASF GitHub Bot commented on DRILL-4980:
---

Github user parthchandra commented on the issue:

https://github.com/apache/drill/pull/644
  
+1


> Upgrading of the approach of parquet date correctness status detection
> --
>
> Key: DRILL-4980
> URL: https://issues.apache.org/jira/browse/DRILL-4980
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Vitalii Diravka
>Assignee: Parth Chandra
> Fix For: Future
>
>
> This jira is an addition for the 
> [DRILL-4203|https://issues.apache.org/jira/browse/DRILL-4203].
> The date correctness label for the new generated parquet files should be 
> upgraded. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4980) Upgrading of the approach of parquet date correctness status detection

2016-11-22 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15686374#comment-15686374
 ] 

ASF GitHub Bot commented on DRILL-4980:
---

Github user vdiravka commented on a diff in the pull request:

https://github.com/apache/drill/pull/644#discussion_r89088416
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java
 ---
@@ -189,10 +195,16 @@ public static DateCorruptionStatus 
detectCorruptDates(ParquetMetadata footer,
 
 String createdBy = footer.getFileMetaData().getCreatedBy();
 String drillVersion = 
footer.getFileMetaData().getKeyValueMetaData().get(ParquetRecordWriter.DRILL_VERSION_PROPERTY);
-String isDateCorrect = 
footer.getFileMetaData().getKeyValueMetaData().get(ParquetRecordWriter.IS_DATE_CORRECT_PROPERTY);
+String stringWriterVersion = 
footer.getFileMetaData().getKeyValueMetaData().get(ParquetRecordWriter.WRITER_VERSION_PROPERTY);
+// This flag can be present in parquet files which were generated with 
1.9.0-SNAPSHOT drill version.
+// If this flag is present it means that the version of the drill 
parquet writer is 2
+final String isDateCorrectFlag = "is.date.correct";
+String isDateCorrect = 
footer.getFileMetaData().getKeyValueMetaData().get(isDateCorrectFlag);
 if (drillVersion != null) {
-  return Boolean.valueOf(isDateCorrect) ? 
DateCorruptionStatus.META_SHOWS_NO_CORRUPTION
-  : DateCorruptionStatus.META_SHOWS_CORRUPTION;
+  int writerVersion = (stringWriterVersion != null) ? 
Integer.parseInt(stringWriterVersion)
--- End diff --

Agree. It makes sense. 
I even found one redundant check in the return statement.
Changes in a new commit.


> Upgrading of the approach of parquet date correctness status detection
> --
>
> Key: DRILL-4980
> URL: https://issues.apache.org/jira/browse/DRILL-4980
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Vitalii Diravka
>Assignee: Parth Chandra
> Fix For: Future
>
>
> This jira is an addition for the 
> [DRILL-4203|https://issues.apache.org/jira/browse/DRILL-4203].
> The date correctness label for the new generated parquet files should be 
> upgraded. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4980) Upgrading of the approach of parquet date correctness status detection

2016-11-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15685426#comment-15685426
 ] 

ASF GitHub Bot commented on DRILL-4980:
---

Github user parthchandra commented on a diff in the pull request:

https://github.com/apache/drill/pull/644#discussion_r89031035
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java
 ---
@@ -189,10 +195,16 @@ public static DateCorruptionStatus 
detectCorruptDates(ParquetMetadata footer,
 
 String createdBy = footer.getFileMetaData().getCreatedBy();
 String drillVersion = 
footer.getFileMetaData().getKeyValueMetaData().get(ParquetRecordWriter.DRILL_VERSION_PROPERTY);
-String isDateCorrect = 
footer.getFileMetaData().getKeyValueMetaData().get(ParquetRecordWriter.IS_DATE_CORRECT_PROPERTY);
+String stringWriterVersion = 
footer.getFileMetaData().getKeyValueMetaData().get(ParquetRecordWriter.WRITER_VERSION_PROPERTY);
+// This flag can be present in parquet files which were generated with 
1.9.0-SNAPSHOT drill version.
+// If this flag is present it means that the version of the drill 
parquet writer is 2
+final String isDateCorrectFlag = "is.date.correct";
+String isDateCorrect = 
footer.getFileMetaData().getKeyValueMetaData().get(isDateCorrectFlag);
 if (drillVersion != null) {
-  return Boolean.valueOf(isDateCorrect) ? 
DateCorruptionStatus.META_SHOWS_NO_CORRUPTION
-  : DateCorruptionStatus.META_SHOWS_CORRUPTION;
+  int writerVersion = (stringWriterVersion != null) ? 
Integer.parseInt(stringWriterVersion)
--- End diff --

Use if-then-else to make this a little easier to read?


> Upgrading of the approach of parquet date correctness status detection
> --
>
> Key: DRILL-4980
> URL: https://issues.apache.org/jira/browse/DRILL-4980
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Vitalii Diravka
>Assignee: Parth Chandra
> Fix For: Future
>
>
> This jira is an addition for the 
> [DRILL-4203|https://issues.apache.org/jira/browse/DRILL-4203].
> The date correctness label for the new generated parquet files should be 
> upgraded. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4980) Upgrading of the approach of parquet date correctness status detection

2016-11-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15654351#comment-15654351
 ] 

ASF GitHub Bot commented on DRILL-4980:
---

Github user vdiravka commented on the issue:

https://github.com/apache/drill/pull/644
  
@parthchandra Could you please review this PR?


> Upgrading of the approach of parquet date correctness status detection
> --
>
> Key: DRILL-4980
> URL: https://issues.apache.org/jira/browse/DRILL-4980
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Vitalii Diravka
>Assignee: Vitalii Diravka
> Fix For: 1.9.0
>
>
> This jira is an addition for the 
> [DRILL-4203|https://issues.apache.org/jira/browse/DRILL-4203].
> The date correctness label for the new generated parquet files should be 
> upgraded. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4980) Upgrading of the approach of parquet date correctness status detection

2016-11-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15654323#comment-15654323
 ] 

ASF GitHub Bot commented on DRILL-4980:
---

Github user vdiravka commented on a diff in the pull request:

https://github.com/apache/drill/pull/644#discussion_r87416017
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java 
---
@@ -185,7 +185,8 @@ private Metadata(FileSystem fs, ParquetFormatConfig 
formatConfig) {
 childFiles.add(file);
   }
 }
-ParquetTableMetadata_v3 parquetTableMetadata = new 
ParquetTableMetadata_v3(true);
+ParquetTableMetadata_v3 parquetTableMetadata = new 
ParquetTableMetadata_v3(DrillVersionInfo.getVersion(),
+ParquetWriter.WRITER_VERSION);
--- End diff --

`is.date.correct` or `parquet-writer.version` were needed in metadata cache 
file for quick detection of date values correctness. Otherwise need to check 
`files.rowGroups.columns.mxValue` values from this cache file. 
But thought a little, I've understood that due to new added 
`ParquetTableMetadata_v3` we can check:
If version of parquet metadata cache file is 3, the date values are 
definitely correct. Otherwise (when parquet metadata cache file was generated 
earlier) need to check date values from this file. 
So `writerVersion` is redundant in the `ParquetTableMetadataBase` now. I 
deleted it. Please approve does it make sense?


> Upgrading of the approach of parquet date correctness status detection
> --
>
> Key: DRILL-4980
> URL: https://issues.apache.org/jira/browse/DRILL-4980
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Vitalii Diravka
>Assignee: Vitalii Diravka
> Fix For: 1.9.0
>
>
> This jira is an addition for the 
> [DRILL-4203|https://issues.apache.org/jira/browse/DRILL-4203].
> The date correctness label for the new generated parquet files should be 
> upgraded. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4980) Upgrading of the approach of parquet date correctness status detection

2016-11-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15654274#comment-15654274
 ] 

ASF GitHub Bot commented on DRILL-4980:
---

Github user vdiravka commented on a diff in the pull request:

https://github.com/apache/drill/pull/644#discussion_r87411842
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java
 ---
@@ -59,19 +59,24 @@
*/
   public static final long JULIAN_DAY_NUMBER_FOR_UNIX_EPOCH = 2440588;
   /**
-   * All old parquet files (which haven't "is.date.correct=true" property 
in metadata) have
-   * a corrupt date shift: {@value} days or 2 * {@value 
#JULIAN_DAY_NUMBER_FOR_UNIX_EPOCH}
+   * All old parquet files (which haven't "is.date.correct=true" or 
"parquet-writer.version" properties
+   * in metadata) have a corrupt date shift: {@value} days or 2 * {@value 
#JULIAN_DAY_NUMBER_FOR_UNIX_EPOCH}
*/
   public static final long CORRECT_CORRUPT_DATE_SHIFT = 2 * 
JULIAN_DAY_NUMBER_FOR_UNIX_EPOCH;
-  // The year 5000 (or 1106685 day from Unix epoch) is chosen as the 
threshold for auto-detecting date corruption.
-  // This balances two possible cases of bad auto-correction. External 
tools writing dates in the future will not
-  // be shifted unless they are past this threshold (and we cannot 
identify them as external files based on the metadata).
-  // On the other hand, historical dates written with Drill wouldn't risk 
being incorrectly shifted unless they were
-  // something like 10,000 years in the past.
   private static final Chronology UTC = 
org.joda.time.chrono.ISOChronology.getInstanceUTC();
+  /**
+   * The year 5000 (or 1106685 day from Unix epoch) is chosen as the 
threshold for auto-detecting date corruption.
+   * This balances two possible cases of bad auto-correction. External 
tools writing dates in the future will not
+   * be shifted unless they are past this threshold (and we cannot 
identify them as external files based on the metadata).
+   * On the other hand, historical dates written with Drill wouldn't risk 
being incorrectly shifted unless they were
+   * something like 10,000 years in the past.
+   */
   public static final int DATE_CORRUPTION_THRESHOLD =
   (int) (UTC.getDateTimeMillis(5000, 1, 1, 0) / 
DateTimeConstants.MILLIS_PER_DAY);
-
+  /**
+   * The version of drill parquet writer with date values corruption fix
--- End diff --

Done


> Upgrading of the approach of parquet date correctness status detection
> --
>
> Key: DRILL-4980
> URL: https://issues.apache.org/jira/browse/DRILL-4980
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Vitalii Diravka
>Assignee: Vitalii Diravka
> Fix For: 1.9.0
>
>
> This jira is an addition for the 
> [DRILL-4203|https://issues.apache.org/jira/browse/DRILL-4203].
> The date correctness label for the new generated parquet files should be 
> upgraded. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4980) Upgrading of the approach of parquet date correctness status detection

2016-11-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15651436#comment-15651436
 ] 

ASF GitHub Bot commented on DRILL-4980:
---

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/644#discussion_r87232002
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java
 ---
@@ -59,19 +59,24 @@
*/
   public static final long JULIAN_DAY_NUMBER_FOR_UNIX_EPOCH = 2440588;
   /**
-   * All old parquet files (which haven't "is.date.correct=true" property 
in metadata) have
-   * a corrupt date shift: {@value} days or 2 * {@value 
#JULIAN_DAY_NUMBER_FOR_UNIX_EPOCH}
+   * All old parquet files (which haven't "is.date.correct=true" or 
"parquet-writer.version" properties
+   * in metadata) have a corrupt date shift: {@value} days or 2 * {@value 
#JULIAN_DAY_NUMBER_FOR_UNIX_EPOCH}
*/
   public static final long CORRECT_CORRUPT_DATE_SHIFT = 2 * 
JULIAN_DAY_NUMBER_FOR_UNIX_EPOCH;
-  // The year 5000 (or 1106685 day from Unix epoch) is chosen as the 
threshold for auto-detecting date corruption.
-  // This balances two possible cases of bad auto-correction. External 
tools writing dates in the future will not
-  // be shifted unless they are past this threshold (and we cannot 
identify them as external files based on the metadata).
-  // On the other hand, historical dates written with Drill wouldn't risk 
being incorrectly shifted unless they were
-  // something like 10,000 years in the past.
   private static final Chronology UTC = 
org.joda.time.chrono.ISOChronology.getInstanceUTC();
+  /**
+   * The year 5000 (or 1106685 day from Unix epoch) is chosen as the 
threshold for auto-detecting date corruption.
+   * This balances two possible cases of bad auto-correction. External 
tools writing dates in the future will not
+   * be shifted unless they are past this threshold (and we cannot 
identify them as external files based on the metadata).
+   * On the other hand, historical dates written with Drill wouldn't risk 
being incorrectly shifted unless they were
+   * something like 10,000 years in the past.
+   */
   public static final int DATE_CORRUPTION_THRESHOLD =
   (int) (UTC.getDateTimeMillis(5000, 1, 1, 0) / 
DateTimeConstants.MILLIS_PER_DAY);
-
+  /**
+   * The version of drill parquet writer with date values corruption fix
--- End diff --

Maybe explain this a bit better. Something like:

Version 2 (and later) of the Drill Parquet writer uses the date format 
described (in the Parquet spec? URL?). Prior versions had dates formatted (copy 
description from above.)


> Upgrading of the approach of parquet date correctness status detection
> --
>
> Key: DRILL-4980
> URL: https://issues.apache.org/jira/browse/DRILL-4980
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Vitalii Diravka
>Assignee: Vitalii Diravka
> Fix For: 1.9.0
>
>
> This jira is an addition for the 
> [DRILL-4203|https://issues.apache.org/jira/browse/DRILL-4203].
> The date correctness label for the new generated parquet files should be 
> upgraded. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4980) Upgrading of the approach of parquet date correctness status detection

2016-11-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15651437#comment-15651437
 ] 

ASF GitHub Bot commented on DRILL-4980:
---

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/644#discussion_r87231425
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java 
---
@@ -185,7 +185,8 @@ private Metadata(FileSystem fs, ParquetFormatConfig 
formatConfig) {
 childFiles.add(file);
   }
 }
-ParquetTableMetadata_v3 parquetTableMetadata = new 
ParquetTableMetadata_v3(true);
+ParquetTableMetadata_v3 parquetTableMetadata = new 
ParquetTableMetadata_v3(DrillVersionInfo.getVersion(),
+ParquetWriter.WRITER_VERSION);
--- End diff --

I'm a bit confused. The writer version applies to the Parquet files which 
Drill writes. (Or, at least, that was the intention.)

Here, we're talking about metadata. There may well be a metadata writer, 
but that should be a different writer, with a different version.

Not sure we want to initialize the metadata object with the current writer 
version: there seems to be no correlation between the metadata object and the 
writer version.

On the other hand, the metadata can certainly hold the writer version, but 
it must be the value read from the Parquet file itself; not a value set by the 
code. Else, we have the difficult problem of making sure that the code-set 
version number agrees with the actual version number in the file.


> Upgrading of the approach of parquet date correctness status detection
> --
>
> Key: DRILL-4980
> URL: https://issues.apache.org/jira/browse/DRILL-4980
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Vitalii Diravka
>Assignee: Vitalii Diravka
> Fix For: 1.9.0
>
>
> This jira is an addition for the 
> [DRILL-4203|https://issues.apache.org/jira/browse/DRILL-4203].
> The date correctness label for the new generated parquet files should be 
> upgraded. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4980) Upgrading of the approach of parquet date correctness status detection

2016-11-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15651439#comment-15651439
 ] 

ASF GitHub Bot commented on DRILL-4980:
---

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/644#discussion_r87232692
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java
 ---
@@ -189,10 +194,14 @@ public static DateCorruptionStatus 
detectCorruptDates(ParquetMetadata footer,
 
 String createdBy = footer.getFileMetaData().getCreatedBy();
 String drillVersion = 
footer.getFileMetaData().getKeyValueMetaData().get(ParquetRecordWriter.DRILL_VERSION_PROPERTY);
-String isDateCorrect = 
footer.getFileMetaData().getKeyValueMetaData().get(ParquetRecordWriter.IS_DATE_CORRECT_PROPERTY);
+String writerVersion = 
footer.getFileMetaData().getKeyValueMetaData().get(ParquetRecordWriter.WRITER_VERSION_PROPERTY);
+// This flag can be present in parquet files which were generated with 
1.9.0-SNAPSHOT drill version
+final String isDateCorrectFlag = "is.date.correct";
--- End diff --

Maybe here you want to special case the "is.date.correct" flag.

1) If the writer version is present, use it.
2) If "is.date.correct" is present, set the writer version to "2".
3) If neither are present, set the writer version to 1.

That way we don't have to have (much) extra logic for the "is.date.correct" 
handling.


> Upgrading of the approach of parquet date correctness status detection
> --
>
> Key: DRILL-4980
> URL: https://issues.apache.org/jira/browse/DRILL-4980
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Vitalii Diravka
>Assignee: Vitalii Diravka
> Fix For: 1.9.0
>
>
> This jira is an addition for the 
> [DRILL-4203|https://issues.apache.org/jira/browse/DRILL-4203].
> The date correctness label for the new generated parquet files should be 
> upgraded. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4980) Upgrading of the approach of parquet date correctness status detection

2016-11-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15651438#comment-15651438
 ] 

ASF GitHub Bot commented on DRILL-4980:
---

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/644#discussion_r87232227
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java
 ---
@@ -59,19 +59,24 @@
*/
   public static final long JULIAN_DAY_NUMBER_FOR_UNIX_EPOCH = 2440588;
   /**
-   * All old parquet files (which haven't "is.date.correct=true" property 
in metadata) have
-   * a corrupt date shift: {@value} days or 2 * {@value 
#JULIAN_DAY_NUMBER_FOR_UNIX_EPOCH}
+   * All old parquet files (which haven't "is.date.correct=true" or 
"parquet-writer.version" properties
+   * in metadata) have a corrupt date shift: {@value} days or 2 * {@value 
#JULIAN_DAY_NUMBER_FOR_UNIX_EPOCH}
*/
   public static final long CORRECT_CORRUPT_DATE_SHIFT = 2 * 
JULIAN_DAY_NUMBER_FOR_UNIX_EPOCH;
-  // The year 5000 (or 1106685 day from Unix epoch) is chosen as the 
threshold for auto-detecting date corruption.
-  // This balances two possible cases of bad auto-correction. External 
tools writing dates in the future will not
-  // be shifted unless they are past this threshold (and we cannot 
identify them as external files based on the metadata).
-  // On the other hand, historical dates written with Drill wouldn't risk 
being incorrectly shifted unless they were
-  // something like 10,000 years in the past.
   private static final Chronology UTC = 
org.joda.time.chrono.ISOChronology.getInstanceUTC();
+  /**
+   * The year 5000 (or 1106685 day from Unix epoch) is chosen as the 
threshold for auto-detecting date corruption.
+   * This balances two possible cases of bad auto-correction. External 
tools writing dates in the future will not
+   * be shifted unless they are past this threshold (and we cannot 
identify them as external files based on the metadata).
+   * On the other hand, historical dates written with Drill wouldn't risk 
being incorrectly shifted unless they were
+   * something like 10,000 years in the past.
+   */
   public static final int DATE_CORRUPTION_THRESHOLD =
   (int) (UTC.getDateTimeMillis(5000, 1, 1, 0) / 
DateTimeConstants.MILLIS_PER_DAY);
-
+  /**
+   * The version of drill parquet writer with date values corruption fix
+   */
+  public static final int DRILL_WRITER_VERSION_WITHOUT_CORRUPTION = 2;
--- End diff --

Maybe call this DRILL_WRITER_VERSION_STD_DATE_FORMAT

The old format was not "corrupted", it just used a date format that was 
non-standard.


> Upgrading of the approach of parquet date correctness status detection
> --
>
> Key: DRILL-4980
> URL: https://issues.apache.org/jira/browse/DRILL-4980
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Vitalii Diravka
>Assignee: Vitalii Diravka
> Fix For: 1.9.0
>
>
> This jira is an addition for the 
> [DRILL-4203|https://issues.apache.org/jira/browse/DRILL-4203].
> The date correctness label for the new generated parquet files should be 
> upgraded. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4980) Upgrading of the approach of parquet date correctness status detection

2016-11-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15651435#comment-15651435
 ] 

ASF GitHub Bot commented on DRILL-4980:
---

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/644#discussion_r87232983
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetWriter.java
 ---
@@ -40,6 +40,16 @@
 public class ParquetWriter extends AbstractWriter {
   static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(ParquetWriter.class);
 
+/** Version of Drill's Parquet writer. Increment this version (by 1) any 
time we make any format change to the file.
+ * Format changes include
+ * 1) supporting new data types,
+ * 2) changes to the format of data fields,
+ * 3) adding new metadata to the file footer, etc.
+ * Newer readers must be able to read old files. The Writer version tells 
the Parquet reader how to interpret fields
--- End diff --

This is Javadoc, so needs HTML formatting. For the list:


Supporting new data types,
...



> Upgrading of the approach of parquet date correctness status detection
> --
>
> Key: DRILL-4980
> URL: https://issues.apache.org/jira/browse/DRILL-4980
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Vitalii Diravka
>Assignee: Vitalii Diravka
> Fix For: 1.9.0
>
>
> This jira is an addition for the 
> [DRILL-4203|https://issues.apache.org/jira/browse/DRILL-4203].
> The date correctness label for the new generated parquet files should be 
> upgraded. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4980) Upgrading of the approach of parquet date correctness status detection

2016-11-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15647812#comment-15647812
 ] 

ASF GitHub Bot commented on DRILL-4980:
---

Github user vdiravka commented on the issue:

https://github.com/apache/drill/pull/644
  
@paul-rogers Could you please review? I answered on the last comment and 
rebased the branch with resolving conflicts (ParquetTableMetadata_v3 was added)


> Upgrading of the approach of parquet date correctness status detection
> --
>
> Key: DRILL-4980
> URL: https://issues.apache.org/jira/browse/DRILL-4980
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Vitalii Diravka
>Assignee: Vitalii Diravka
> Fix For: 1.9.0
>
>
> This jira is an addition for the 
> [DRILL-4203|https://issues.apache.org/jira/browse/DRILL-4203].
> The date correctness label for the new generated parquet files should be 
> upgraded. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4980) Upgrading of the approach of parquet date correctness status detection

2016-11-04 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15637390#comment-15637390
 ] 

ASF GitHub Bot commented on DRILL-4980:
---

Github user vdiravka commented on a diff in the pull request:

https://github.com/apache/drill/pull/644#discussion_r86615126
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java 
---
@@ -927,15 +927,11 @@ public void setMax(Object max) {
 @JsonProperty List files;
 @JsonProperty List directories;
 @JsonProperty String drillVersion;
-@JsonProperty boolean isDateCorrect;
+@JsonProperty int writerVersion;
--- End diff --

I tried to figure out this and found that for reading parquet footer this 
class isn't used. For that purposes used `ParquetMetadata` from 
parquet-hadoop-1.8.1-drill-r0.jar.
This class used for entire parquet directory with metadata cache file. When 
it is absent an empty of this class is created. 
I added some javadoc. Hope it will help.


> Upgrading of the approach of parquet date correctness status detection
> --
>
> Key: DRILL-4980
> URL: https://issues.apache.org/jira/browse/DRILL-4980
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Vitalii Diravka
>Assignee: Vitalii Diravka
> Fix For: 1.9.0
>
>
> This jira is an addition for the 
> [DRILL-4203|https://issues.apache.org/jira/browse/DRILL-4203].
> The date correctness label for the new generated parquet files should be 
> upgraded. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4980) Upgrading of the approach of parquet date correctness status detection

2016-11-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15634976#comment-15634976
 ] 

ASF GitHub Bot commented on DRILL-4980:
---

Github user vdiravka commented on a diff in the pull request:

https://github.com/apache/drill/pull/644#discussion_r86477598
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java 
---
@@ -927,15 +927,11 @@ public void setMax(Object max) {
 @JsonProperty List files;
 @JsonProperty List directories;
 @JsonProperty String drillVersion;
-@JsonProperty boolean isDateCorrect;
+@JsonProperty int writerVersion;
--- End diff --

`ParquetTableMetadata_v2` is used mostly when the parquet meta cache file 
created. 
For reading the metadata cache file or metadta footer this class is not 
used except the case, when we read one parquet file or parquet folder, an empty 
instance of this class is created, but used only columnTypeInfo field. And the 
drillVersion or writerVersion weren't used. That's why everything worked but it 
was not right. I added an empty constructor (as I made earlier, actually master 
version). 
And added initializing this fields across constructors parameters. I think 
it is more right and looks like this is what you were talking about.


> Upgrading of the approach of parquet date correctness status detection
> --
>
> Key: DRILL-4980
> URL: https://issues.apache.org/jira/browse/DRILL-4980
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Vitalii Diravka
>Assignee: Vitalii Diravka
> Fix For: 1.9.0
>
>
> This jira is an addition for the 
> [DRILL-4203|https://issues.apache.org/jira/browse/DRILL-4203].
> The date correctness label for the new generated parquet files should be 
> upgraded. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4980) Upgrading of the approach of parquet date correctness status detection

2016-11-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15634974#comment-15634974
 ] 

ASF GitHub Bot commented on DRILL-4980:
---

Github user vdiravka commented on a diff in the pull request:

https://github.com/apache/drill/pull/644#discussion_r86477805
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetRecordWriter.java
 ---
@@ -78,7 +78,7 @@
   private static final int MAXIMUM_RECORD_COUNT_FOR_CHECK = 1;
 
   public static final String DRILL_VERSION_PROPERTY = "drill.version";
-  public static final String IS_DATE_CORRECT_PROPERTY = "is.date.correct";
+  public static final String WRITER_VERSION_PROPERTY = 
"parquet-writer.version";
--- End diff --

Done. Also parquet files for unit tests were regenerated as well.


> Upgrading of the approach of parquet date correctness status detection
> --
>
> Key: DRILL-4980
> URL: https://issues.apache.org/jira/browse/DRILL-4980
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Vitalii Diravka
>Assignee: Vitalii Diravka
> Fix For: 1.9.0
>
>
> This jira is an addition for the 
> [DRILL-4203|https://issues.apache.org/jira/browse/DRILL-4203].
> The date correctness label for the new generated parquet files should be 
> upgraded. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4980) Upgrading of the approach of parquet date correctness status detection

2016-11-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15634975#comment-15634975
 ] 

ASF GitHub Bot commented on DRILL-4980:
---

Github user vdiravka commented on a diff in the pull request:

https://github.com/apache/drill/pull/644#discussion_r86478118
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetWriter.java
 ---
@@ -40,6 +40,8 @@
 public class ParquetWriter extends AbstractWriter {
   static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(ParquetWriter.class);
 
+  public static final int WRITER_VERSION = 2;
--- End diff --

I copied this comment into the project.


> Upgrading of the approach of parquet date correctness status detection
> --
>
> Key: DRILL-4980
> URL: https://issues.apache.org/jira/browse/DRILL-4980
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Vitalii Diravka
>Assignee: Vitalii Diravka
> Fix For: 1.9.0
>
>
> This jira is an addition for the 
> [DRILL-4203|https://issues.apache.org/jira/browse/DRILL-4203].
> The date correctness label for the new generated parquet files should be 
> upgraded. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4980) Upgrading of the approach of parquet date correctness status detection

2016-11-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15633406#comment-15633406
 ] 

ASF GitHub Bot commented on DRILL-4980:
---

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/644#discussion_r86388727
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetWriter.java
 ---
@@ -40,6 +40,8 @@
 public class ParquetWriter extends AbstractWriter {
   static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(ParquetWriter.class);
 
+  public static final int WRITER_VERSION = 2;
--- End diff --

This deserves an explanation. Something like:

Version of Drill's Parquet writer. Increment this version (by 1) any time 
we make any format change to the file. Format changes include 1) supporting new 
data types, 2) changes to the format of data fields, 3) adding new metadata to 
the file footer, etc.

Newer readers must be able to read old files. The Writer version tells the 
Parquet reader how to interpret fields or metadata when that data changes 
format from one writer version to another.


> Upgrading of the approach of parquet date correctness status detection
> --
>
> Key: DRILL-4980
> URL: https://issues.apache.org/jira/browse/DRILL-4980
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Vitalii Diravka
>Assignee: Vitalii Diravka
> Fix For: 1.9.0
>
>
> This jira is an addition for the 
> [DRILL-4203|https://issues.apache.org/jira/browse/DRILL-4203].
> The date correctness label for the new generated parquet files should be 
> upgraded. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4980) Upgrading of the approach of parquet date correctness status detection

2016-11-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15633400#comment-15633400
 ] 

ASF GitHub Bot commented on DRILL-4980:
---

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/644#discussion_r86384226
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java 
---
@@ -927,15 +927,11 @@ public void setMax(Object max) {
 @JsonProperty List files;
 @JsonProperty List directories;
 @JsonProperty String drillVersion;
-@JsonProperty boolean isDateCorrect;
+@JsonProperty int writerVersion;
--- End diff --

This property is Jackson-serialized. How does Jackson handle older files 
written without this version? According to this post 
(http://stackoverflow.com/questions/8320993/jackson-what-happens-if-a-property-is-missing),
 "Setter methods are only invoked for properties with explicit values." This 
means that older files without the writerVersion set won't call the function to 
deserialize the writer version, and the version will default to the newest 
version. I suspect that either A) I'm misunderstanding what this code does, or 
B) we have a backward-compatibility issue. Please explain which.


> Upgrading of the approach of parquet date correctness status detection
> --
>
> Key: DRILL-4980
> URL: https://issues.apache.org/jira/browse/DRILL-4980
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Vitalii Diravka
>Assignee: Vitalii Diravka
> Fix For: 1.9.0
>
>
> This jira is an addition for the 
> [DRILL-4203|https://issues.apache.org/jira/browse/DRILL-4203].
> The date correctness label for the new generated parquet files should be 
> upgraded. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4980) Upgrading of the approach of parquet date correctness status detection

2016-11-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15633407#comment-15633407
 ] 

ASF GitHub Bot commented on DRILL-4980:
---

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/644#discussion_r86383165
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java 
---
@@ -927,15 +927,11 @@ public void setMax(Object max) {
 @JsonProperty List files;
 @JsonProperty List directories;
 @JsonProperty String drillVersion;
-@JsonProperty boolean isDateCorrect;
+@JsonProperty int writerVersion;
 
 public ParquetTableMetadata_v2() {
-  super();
-}
-
-public ParquetTableMetadata_v2(boolean isDateCorrect) {
   this.drillVersion = DrillVersionInfo.getVersion();
-  this.isDateCorrect = isDateCorrect;
+  this.writerVersion = ParquetWriter.WRITER_VERSION;
--- End diff --

We set the writer version to the current version when we create the 
metadata. Is this same metadata used for both read and write? If so, we have 
the potential for a nasty bug. A (new) reader fails to set the writerVersion 
value from actual file metadata. The value will default to the latest, even if 
the file itself happens to be older.

I wonder if it makes sense to pass the version into the constructor. The 
Writer passes in the current writer version. The reader must pass in the value 
found in the file.

Or, is this metadata used only for writing, but not reading? If that is 
true perhaps we can document that in the code somewhere. (I looked but did not 
anything.)

Or, is this metadata cached from scanning actual files? If so, isn't 
defaulting the writer version simply asking for trouble if someone forgets to 
set this field based on actual file version?


> Upgrading of the approach of parquet date correctness status detection
> --
>
> Key: DRILL-4980
> URL: https://issues.apache.org/jira/browse/DRILL-4980
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Vitalii Diravka
>Assignee: Vitalii Diravka
> Fix For: 1.9.0
>
>
> This jira is an addition for the 
> [DRILL-4203|https://issues.apache.org/jira/browse/DRILL-4203].
> The date correctness label for the new generated parquet files should be 
> upgraded. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4980) Upgrading of the approach of parquet date correctness status detection

2016-11-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15633402#comment-15633402
 ] 

ASF GitHub Bot commented on DRILL-4980:
---

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/644#discussion_r86384912
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/Metadata.java 
---
@@ -944,14 +940,16 @@ public 
ParquetTableMetadata_v2(ParquetTableMetadataBase parquetTable,
   this.directories = directories;
   this.columnTypeInfo = ((ParquetTableMetadata_v2) 
parquetTable).columnTypeInfo;
   this.drillVersion = DrillVersionInfo.getVersion();
-  this.isDateCorrect = true;
+  this.writerVersion = ParquetWriter.WRITER_VERSION;
 }
 
 public ParquetTableMetadata_v2(List files, 
List directories,
 ConcurrentHashMap columnTypeInfo) {
   this.files = files;
   this.directories = directories;
   this.columnTypeInfo = columnTypeInfo;
+  this.drillVersion = DrillVersionInfo.getVersion();
+  this.writerVersion = ParquetWriter.WRITER_VERSION;
--- End diff --

Given a set of files and directories, we're assuming the writer version? 
This is supposed to be the version of the Drill Parquet writer that wrote the 
actual Parquet files, right? Not the version of the metadata files that hold 
information about the Parquet files? Or, are we using the same writer version 
for both purposes? If we need two versions (one for Parquet, another for 
metadata), then we should introduce a separate Parquet metadata version. (Or, 
I'm misunderstanding the code...)


> Upgrading of the approach of parquet date correctness status detection
> --
>
> Key: DRILL-4980
> URL: https://issues.apache.org/jira/browse/DRILL-4980
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Vitalii Diravka
>Assignee: Vitalii Diravka
> Fix For: 1.9.0
>
>
> This jira is an addition for the 
> [DRILL-4203|https://issues.apache.org/jira/browse/DRILL-4203].
> The date correctness label for the new generated parquet files should be 
> upgraded. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4980) Upgrading of the approach of parquet date correctness status detection

2016-11-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15633403#comment-15633403
 ] 

ASF GitHub Bot commented on DRILL-4980:
---

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/644#discussion_r86387445
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java
 ---
@@ -189,19 +189,23 @@ public static DateCorruptionStatus 
detectCorruptDates(ParquetMetadata footer,
 
 String createdBy = footer.getFileMetaData().getCreatedBy();
 String drillVersion = 
footer.getFileMetaData().getKeyValueMetaData().get(ParquetRecordWriter.DRILL_VERSION_PROPERTY);
-String isDateCorrect = 
footer.getFileMetaData().getKeyValueMetaData().get(ParquetRecordWriter.IS_DATE_CORRECT_PROPERTY);
-if (drillVersion != null) {
-  return Boolean.valueOf(isDateCorrect) ? 
DateCorruptionStatus.META_SHOWS_NO_CORRUPTION
-  : DateCorruptionStatus.META_SHOWS_CORRUPTION;
-} else {
-  // Possibly an old, un-migrated Drill file, check the column 
statistics to see if min/max values look corrupt
-  // only applies if there is a date column selected
-  if (createdBy.equals("parquet-mr")) {
-// loop through parquet column metadata to find date columns, 
check for corrupt values
-return checkForCorruptDateValuesInStatistics(footer, columns, 
autoCorrectCorruptDates);
+String writerVersion = 
footer.getFileMetaData().getKeyValueMetaData().get(ParquetRecordWriter.WRITER_VERSION_PROPERTY);
+// This flag can be present in parquet files which were generated with 
1.9.0-SNAPSHOT drill version
+final String isDateCorrectFlag = "is.date.correct";
+String isDateCorrect = 
footer.getFileMetaData().getKeyValueMetaData().get(isDateCorrectFlag);
+try {
+  if (drillVersion != null) {
+return (writerVersion != null && Integer.parseInt(writerVersion) 
>= 2) || Boolean.valueOf(isDateCorrect)
+|| isDrillVersionHasCorrectDates(drillVersion)
--- End diff --

Isn't the Drill version check redundant? We know that all Drill versions 
from 1.9.0 onwards will have a Drill Parquet writer version in the file. So, we 
only need check the writer version, if we have it. If we don't have it, then 
the only info we have is the Drill version.

We might want to explain this logic in a comment.

The purpose of adding the writer version is that all future format 
decisions can be made on the writer version independent of Drill version. For 
example, suppose we change something in Drill 1.10. Drill 1.10.0-SNAPSHOT will 
start with writer version 2. Later we'll make a change and 1.10.0-SNAPSHOT will 
writer files with writer version 3. The Drill version is ambiguous, the writer 
version is spot on.


> Upgrading of the approach of parquet date correctness status detection
> --
>
> Key: DRILL-4980
> URL: https://issues.apache.org/jira/browse/DRILL-4980
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Vitalii Diravka
>Assignee: Vitalii Diravka
> Fix For: 1.9.0
>
>
> This jira is an addition for the 
> [DRILL-4203|https://issues.apache.org/jira/browse/DRILL-4203].
> The date correctness label for the new generated parquet files should be 
> upgraded. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4980) Upgrading of the approach of parquet date correctness status detection

2016-11-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15633404#comment-15633404
 ] 

ASF GitHub Bot commented on DRILL-4980:
---

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/644#discussion_r86386510
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java
 ---
@@ -214,14 +218,24 @@ public static DateCorruptionStatus 
detectCorruptDates(ParquetMetadata footer,
   }
   // written by a tool that wasn't Drill, the dates are not 
corrupted
   return DateCorruptionStatus.META_SHOWS_NO_CORRUPTION;
-} catch (VersionParser.VersionParseException e) {
-  // If we couldn't parse "created by" field, check column 
metadata of date columns
-  return checkForCorruptDateValuesInStatistics(footer, columns, 
autoCorrectCorruptDates);
 }
   }
+} catch (VersionParser.VersionParseException e) {
+  // If we couldn't parse "created by" or "drill version", check 
column metadata of date columns
+  return checkForCorruptDateValuesInStatistics(footer, columns, 
autoCorrectCorruptDates);
 }
   }
 
+  public static boolean isDrillVersionHasCorrectDates(String drillVersion) 
throws VersionParser.VersionParseException {
--- End diff --

Might flow better as "drillVersionHasCorrectDates"

But, see comment above about whether we need this check.


> Upgrading of the approach of parquet date correctness status detection
> --
>
> Key: DRILL-4980
> URL: https://issues.apache.org/jira/browse/DRILL-4980
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Vitalii Diravka
>Assignee: Vitalii Diravka
> Fix For: 1.9.0
>
>
> This jira is an addition for the 
> [DRILL-4203|https://issues.apache.org/jira/browse/DRILL-4203].
> The date correctness label for the new generated parquet files should be 
> upgraded. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4980) Upgrading of the approach of parquet date correctness status detection

2016-11-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15633401#comment-15633401
 ] 

ASF GitHub Bot commented on DRILL-4980:
---

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/644#discussion_r86388085
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetRecordWriter.java
 ---
@@ -78,7 +78,7 @@
   private static final int MAXIMUM_RECORD_COUNT_FOR_CHECK = 1;
 
   public static final String DRILL_VERSION_PROPERTY = "drill.version";
-  public static final String IS_DATE_CORRECT_PROPERTY = "is.date.correct";
+  public static final String WRITER_VERSION_PROPERTY = 
"parquet-writer.version";
--- End diff --

We know we are Drill, but others might be confused. Perhaps 
"drill-writer.version" or "drill.writer-version" (to be consistent with 
"drill.version".)


> Upgrading of the approach of parquet date correctness status detection
> --
>
> Key: DRILL-4980
> URL: https://issues.apache.org/jira/browse/DRILL-4980
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Vitalii Diravka
>Assignee: Vitalii Diravka
> Fix For: 1.9.0
>
>
> This jira is an addition for the 
> [DRILL-4203|https://issues.apache.org/jira/browse/DRILL-4203].
> The date correctness label for the new generated parquet files should be 
> upgraded. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4980) Upgrading of the approach of parquet date correctness status detection

2016-11-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15632363#comment-15632363
 ] 

ASF GitHub Bot commented on DRILL-4980:
---

Github user vdiravka commented on the issue:

https://github.com/apache/drill/pull/644
  
@parthchandra @paul-rogers @jaltekruse coud you please review?


> Upgrading of the approach of parquet date correctness status detection
> --
>
> Key: DRILL-4980
> URL: https://issues.apache.org/jira/browse/DRILL-4980
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Vitalii Diravka
>Assignee: Vitalii Diravka
> Fix For: 1.9.0
>
>
> This jira is an addition for the 
> [DRILL-4203|https://issues.apache.org/jira/browse/DRILL-4203].
> The date correctness label for the new generated parquet files should be 
> upgraded. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4980) Upgrading of the approach of parquet date correctness status detection

2016-11-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15632333#comment-15632333
 ] 

ASF GitHub Bot commented on DRILL-4980:
---

GitHub user vdiravka opened a pull request:

https://github.com/apache/drill/pull/644

DRILL-4980: Upgrading of the approach of parquet date correctness status 
detection

- Parquet writer version is added;
- Updated the detection method of parquet date correctness.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/vdiravka/drill DRILL-4980

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/644.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #644


commit ca447255b1a8137a25b01e051053b8f354c75ad0
Author: Vitalii Diravka 
Date:   2016-10-26T12:22:06Z

DRILL-4980: Upgrading of the approach of parquet date correctness status 
detection
- Parquet writer version is added;
- Updated the detection method of parquet date correctness.




> Upgrading of the approach of parquet date correctness status detection
> --
>
> Key: DRILL-4980
> URL: https://issues.apache.org/jira/browse/DRILL-4980
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Vitalii Diravka
>Assignee: Vitalii Diravka
> Fix For: 1.9.0
>
>
> This jira is an addition for the 
> [DRILL-4203|https://issues.apache.org/jira/browse/DRILL-4203].
> The date correctness label for the new generated parquet files should be 
> upgraded. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)