Peter Franzen created DRILL-8492:
------------------------------------
Summary: Allow Parquet TIME_MICROS and TIMESTAMP_MICROS columns
to be read as 64-bit integer values
Key: DRILL-8492
URL: https://issues.apache.org/jira/browse/DRILL-8492
Project: Apache Drill
Issue Type: Improvement
Components: Storage - Parquet
Affects Versions: 1.21.1
Reporter: Peter Franzen
When reading Parquet columns of type {{time_micros}} and
{{{}timestamp_micros{}}}, Drill truncates the microsecond values to
milliseconds in order to convert them to SQL timestamps.
It is currently not possible to read the original microsecond values (as 64-bit
values, not SQL timestamps) through Drill.
One solution for allowing reading the original 64-bit values is to add two
options similar to “store.parquet.reader.int96_as_timestamp" to control whether
microsecond
times and timestamps are truncated to millisecond timestamps or read as
non-truncated 64-bit values.
These options would be added to {{org.apache.drill.exec.ExecConstants}} and
{{{}org.apache.drill.exec.server.options.SystemOptionManager{}}}.
They would also be added to "drill-module.conf":
{{ store.parquet.reader.time_micros_as_int64: false,}}
{{ store.parquet.reader.timestamp_micros_as_int64: false,}}
These options would then be used in the same places as
{{{}store.parquet.reader.int96_as_timestamp{}}}:
* org.apache.drill.exec.store.parquet.columnreaders.ColumnReaderFactory
* org.apache.drill.exec.store.parquet.columnreaders.ParquetToDrillTypeConverter
* org.apache.drill.exec.store.parquet2.DrillParquetGroupConverter
to create an int64 reader instead of a time/timestamp reader when the
correspondning option is set to true.
In addition to this,
{{org.apache.drill.exec.store.parquet.metadata.FileMetadataCollector }}must be
altered to _not_ truncate the min and max values for
time_micros/timestamp_micros if the corresponding option is true. This class
doesn’t have a reference to an {{{}OptionManager{}}}, so the two new options
must be extracted from the {{OptionManager}} when the {{ParquetReaderConfig}}
instance is created.
Filtering on microsecond columns would be done using 64-bit values rather than
TIME/TIMESTAMP values when the new options are true, e.g.
{{SELECT * FROM <file> WHERE <timestamp_micros_column> = 1705914906694751;}}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)