GitHub user vdiravka opened a pull request:
https://github.com/apache/drill/pull/595
DRILL-4203: Parquet File. Date is stored wrongly
Drill was writing non-standard dates into parquet files for all releases
before this commit. The values have been read correctly by Drill, but
external tools like Spark reading the files will see corrupted values for
all dates that have been written by Drill.
This change corrects the behavior of the Drill parquet writer to correctly
store dates in the format given in the parquet specification.
To maintain compatibility with old files, the parquet reader code has been
updated to check for the old format and automatically shift the
corrupted values into corrected ones automatically.
The test cases included here should ensure that all files produced by
historical versions of Drill will continue to return the same values
they had in previous releases. For compatibility with external tools, any
old files with corrupted dates can be re-written using the CREATE TABLE AS
command (as the writer will now only produce the specification-compliant
values, even if after reading out of older corrupt files, one
new extra field "is.date.correct = true" will be included into the parquet
meta
information of files and into drill metadata cache files).
While the old behavior was a consistent shift into an unlikely range to be
used in a modern database (over 10,000 years in the future), these are
still valid date values. In the case where these may have been written
into files intentionally, and we cannot be certain from the metadata if
Drill produced the files, an option is included to turn off the
auto-correction.
Use of this option is assumed to be extremely unlikely, but it is included
for
completeness.
One small fix in the ParquetGroupScan to accommodate changes in master that
changed
when metadata is read.
Added new tests for bugs (revealed by the regression suite) with old and new
parquet (binary) files for new tests, updated metadata cache files
accordingly.
Removed unnecessary double conversion of value with Julian day.
Added ability to correct corrupted dates for parquet files with the second
version old metadata cache file as well.
Fix DrillVersionInfo to make it provide a valid version number even during
the unit tests. This is now a build-time generated class, rather than one
that looks on the classpath for META-INF files. (This pattern
for file generation with parameters passed from the POM files
was borrowed from parquet-mr)
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/vdiravka/drill DRILL-4203
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/drill/pull/595.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #595
----
commit 6f816742d773a1696b5329472c2465a79e35140c
Author: Vitalii Diravka <[email protected]>
Date: 2016-09-22T13:44:37Z
DRILL-4203: Parquet File. Date is stored wrongly
Drill was writing non-standard dates into parquet files for all releases
before this commit. The values have been read correctly by Drill, but
external tools like Spark reading the files will see corrupted values for
all dates that have been written by Drill.
This change corrects the behavior of the Drill parquet writer to correctly
store dates in the format given in the parquet specification.
To maintain compatibility with old files, the parquet reader code has been
updated to check for the old format and automatically shift the
corrupted values into corrected ones automatically.
The test cases included here should ensure that all files produced by
historical versions of Drill will continue to return the same values
they had in previous releases. For compatibility with external tools, any
old files with corrupted dates can be re-written using the CREATE TABLE AS
command (as the writer will now only produce the specification-compliant
values, even if after reading out of older corrupt files, one
new extra field "is.date.correct = true" will be included into the parquet
meta
information of files and into drill metadata cache files).
While the old behavior was a consistent shift into an unlikely range to be
used in a modern database (over 10,000 years in the future), these are
still valid date values. In the case where these may have been written
into files intentionally, and we cannot be certain from the metadata if
Drill produced the files, an option is included to turn off the
auto-correction.
Use of this option is assumed to be extremely unlikely, but it is included
for
completeness.
One small fix in the ParquetGroupScan to accommodate changes in master that
changed
when metadata is read.
Added new tests for bugs (revealed by the regression suite) with old and new
parquet (binary) files for new tests, updated metadata cache files
accordingly.
Removed unnecessary double conversion of value with Julian day.
Added ability to correct corrupted dates for parquet files with the second
version old metadata cache file as well.
Fix DrillVersionInfo to make it provide a valid version number even during
the unit tests. This is now a build-time generated class, rather than one
that looks on the classpath for META-INF files. (This pattern
for file generation with parameters passed from the POM files
was borrowed from parquet-mr)
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---