Hi Peter,

It sounds like you are on the right track: the new option is the quick
short-term solution. The best long-term solution is to generalize Drill's
date/time type, but that would take much more work. (Drill also has a bug
where the treatment of timezones is incorrect, which forces Drill to run in
the UTC time zone -- something that will also require difficult work.)

Given that JDBC works, the problem must be in the web interface, not in
your Parquet implementation. You've solved the problem with a new session
option. The web interface, however, has no sessions: if you set an option
in one call, and do your query in another, Drill will have "forgotten" your
option. Instead, there is a way to attach options to each query. Are you
using that feature?

As I recall, the JSON message to submit a query has an additional field to
hold session options. I do not recall, however, if the web UI added that
feature. Does anyone else know? Two workarounds. First, use your favorite
JSON request tool to submit a query with the option set. Second, set your
option as a system option so it is available to all sessions: ALTER SYSTEM
SET...

Thanks,

- Paul

On Mon, Jan 22, 2024 at 1:38 AM Peter Franzen <pe...@myire.org> wrote:

> Hi,
>
> I am using Drill to query Parquet files that have fields of type
> timestamp_micros. By default, Drill truncates those microsecond
> values to milliseconds when reading the Parquet files in order to convert
> them to SQL timestamps.
>
> In some of my use cases I need to read the original microsecond values (as
> 64-bit values, not SQL timestamps) through Drill, but
> this doesn’t seem to be possible (unless I’ve missed something).
>
> I have explored a possible solution to this, and would like to run it by
> some developers more experienced with the Drill code base
> before I create a pull request.
>
> My idea is to add tow options similar to
> “store.parquet.reader.int96_as_timestamp" to control whether or not
> microsecond
> times and timestamps are truncated to milliseconds. These options would be
> added to “org.apache.drill.exec.ExecConstants" and
> "org.apache.drill.exec.server.options.SystemOptionManager", and to
> drill-module.conf:
>
>     store.parquet.reader.time_micros_as_int64: false,
>     store.parquet.reader.timestamp_micros_as_int64: false,
>
> These options would then be used in the same places as
> “store.parquet.reader.int96_as_timestamp”:
>
> org.apache.drill.exec.store.parquet.columnreaders.ColumnReaderFactory
>
> org.apache.drill.exec.store.parquet.columnreaders.ParquetToDrillTypeConverter
> org.apache.drill.exec.store.parquet2.DrillParquetGroupConverter
>
> to create an int64 reader instead of a time/timestamp reader when the
> correspodning option is set to true.
>
> In addition to this,
> “org.apache.drill.exec.store.parquet.metadata.FileMetadataCollector” must
> be altered to _not_ truncate the min and max
> values for time_micros/timestamp_micros if the corresponding option is
> true. This class doesn’t have a reference to an OptionManager, so
> my guess is that the two new options must be extractred from the
> OptionManager when the ParquetReaderConfig instance is created.
>
> Filtering on microsecond columns would be done using 64-bit values rather
> than TIME/TIMESTAMP values, e.g.
>
> select *  from <file> where <timestamp_micros_column> = 1705914906694751;
>
> I’ve tested the solution outlined above, and it seems to work when using
> sqlline and with the JDBC driver, but not with the web based interface.
> Any pointers to the relevent code for that would be appreciated.
>
> An alternative solution to the above could be to intercept all reading of
> the Parquet schemas and modifying the schema to report the
> microsecond columns as int64 columns, i.e. to completely discard the
> information that the columns contain time/timestamp values.
> This could potentially make parts of the code where it is not obvious that
> the time/timestamp properties of columns are used behave
> as expected. However, this variant would not align with how INT96
> timestamps are handled.
>
> Any thoughts on this idea for how to access microsecond values would be
> highly appreciated.
>
> Thanks,
>
> /Peter
>
>

Reply via email to