Hi, I am using Drill to query Parquet files that have fields of type timestamp_micros. By default, Drill truncates those microsecond values to milliseconds when reading the Parquet files in order to convert them to SQL timestamps.
In some of my use cases I need to read the original microsecond values (as 64-bit values, not SQL timestamps) through Drill, but this doesn’t seem to be possible (unless I’ve missed something). I have explored a possible solution to this, and would like to run it by some developers more experienced with the Drill code base before I create a pull request. My idea is to add tow options similar to “store.parquet.reader.int96_as_timestamp" to control whether or not microsecond times and timestamps are truncated to milliseconds. These options would be added to “org.apache.drill.exec.ExecConstants" and "org.apache.drill.exec.server.options.SystemOptionManager", and to drill-module.conf: store.parquet.reader.time_micros_as_int64: false, store.parquet.reader.timestamp_micros_as_int64: false, These options would then be used in the same places as “store.parquet.reader.int96_as_timestamp”: org.apache.drill.exec.store.parquet.columnreaders.ColumnReaderFactory org.apache.drill.exec.store.parquet.columnreaders.ParquetToDrillTypeConverter org.apache.drill.exec.store.parquet2.DrillParquetGroupConverter to create an int64 reader instead of a time/timestamp reader when the correspodning option is set to true. In addition to this, “org.apache.drill.exec.store.parquet.metadata.FileMetadataCollector” must be altered to _not_ truncate the min and max values for time_micros/timestamp_micros if the corresponding option is true. This class doesn’t have a reference to an OptionManager, so my guess is that the two new options must be extractred from the OptionManager when the ParquetReaderConfig instance is created. Filtering on microsecond columns would be done using 64-bit values rather than TIME/TIMESTAMP values, e.g. select * from <file> where <timestamp_micros_column> = 1705914906694751; I’ve tested the solution outlined above, and it seems to work when using sqlline and with the JDBC driver, but not with the web based interface. Any pointers to the relevent code for that would be appreciated. An alternative solution to the above could be to intercept all reading of the Parquet schemas and modifying the schema to report the microsecond columns as int64 columns, i.e. to completely discard the information that the columns contain time/timestamp values. This could potentially make parts of the code where it is not obvious that the time/timestamp properties of columns are used behave as expected. However, this variant would not align with how INT96 timestamps are handled. Any thoughts on this idea for how to access microsecond values would be highly appreciated. Thanks, /Peter