Peter Franzen created DRILL-8492:
------------------------------------

             Summary: Allow Parquet TIME_MICROS and TIMESTAMP_MICROS  columns 
to be read as 64-bit integer values
                 Key: DRILL-8492
                 URL: https://issues.apache.org/jira/browse/DRILL-8492
             Project: Apache Drill
          Issue Type: Improvement
          Components: Storage - Parquet
    Affects Versions: 1.21.1
            Reporter: Peter Franzen


When reading Parquet columns of type {{time_micros}} and 
{{{}timestamp_micros{}}}, Drill truncates the microsecond values to 
milliseconds in order to convert them to SQL timestamps.

It is currently not possible to read the original microsecond values (as 64-bit 
values, not SQL timestamps) through Drill.

One solution for allowing reading the original 64-bit values is to add two 
options similar to “store.parquet.reader.int96_as_timestamp" to control whether 
microsecond
times and timestamps are truncated to millisecond timestamps or read as 
non-truncated 64-bit values.

These options would be added to {{org.apache.drill.exec.ExecConstants}} and
{{{}org.apache.drill.exec.server.options.SystemOptionManager{}}}.

They would also be added to "drill-module.conf":

{{   store.parquet.reader.time_micros_as_int64: false,}}
{{   store.parquet.reader.timestamp_micros_as_int64: false,}}

These options would then be used in the same places as 
{{{}store.parquet.reader.int96_as_timestamp{}}}:


 * org.apache.drill.exec.store.parquet.columnreaders.ColumnReaderFactory
 * org.apache.drill.exec.store.parquet.columnreaders.ParquetToDrillTypeConverter
 * org.apache.drill.exec.store.parquet2.DrillParquetGroupConverter



to create an int64 reader instead of a time/timestamp reader when the 
correspondning option is set to true.

In addition to this, 
{{org.apache.drill.exec.store.parquet.metadata.FileMetadataCollector }}must be 
altered to _not_ truncate the min and max values for 
time_micros/timestamp_micros if the corresponding option is true. This class 
doesn’t have a reference to an {{{}OptionManager{}}}, so the two new options 
must be extracted from the {{OptionManager}} when the {{ParquetReaderConfig}} 
instance is created.

Filtering on microsecond columns would be done using 64-bit values rather than 
TIME/TIMESTAMP values when the new options are true, e.g.

{{SELECT *  FROM <file> WHERE <timestamp_micros_column> = 1705914906694751;}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to