Re: Parquet files with microsecond columns

James Turton Tue, 23 Jan 2024 07:56:23 -0800

My experience has been that if you've switched on authentication inDrill then web UI /does/ sustain a session. If not then it doesn't.


On 2024/01/23 09:48, Peter Franzen wrote:

Hi Paul,


Thanks for your comments.

I wasn’t aware that the Web UI doesn’t have sessions; when setting the option 
at the system
level the Web UI behaves as expected.

I’ll go ahead and create a pull request within the next few days.

/Peter

On 22 Jan 2024, at 21:40, Paul Rogers<[email protected]>  wrote:

Hi Peter,

It sounds like you are on the right track: the new option is the quick
short-term solution. The best long-term solution is to generalize Drill's
date/time type, but that would take much more work. (Drill also has a bug
where the treatment of timezones is incorrect, which forces Drill to run in
the UTC time zone -- something that will also require difficult work.)

Given that JDBC works, the problem must be in the web interface, not in
your Parquet implementation. You've solved the problem with a new session
option. The web interface, however, has no sessions: if you set an option
in one call, and do your query in another, Drill will have "forgotten" your
option. Instead, there is a way to attach options to each query. Are you
using that feature?

As I recall, the JSON message to submit a query has an additional field to
hold session options. I do not recall, however, if the web UI added that
feature. Does anyone else know? Two workarounds. First, use your favorite
JSON request tool to submit a query with the option set. Second, set your
option as a system option so it is available to all sessions: ALTER SYSTEM
SET...

Thanks,

- Paul

On Mon, Jan 22, 2024 at 1:38 AM Peter Franzen<[email protected]>  wrote:

Hi,

I am using Drill to query Parquet files that have fields of type
timestamp_micros. By default, Drill truncates those microsecond
values to milliseconds when reading the Parquet files in order to convert
them to SQL timestamps.

In some of my use cases I need to read the original microsecond values (as
64-bit values, not SQL timestamps) through Drill, but
this doesn’t seem to be possible (unless I’ve missed something).

I have explored a possible solution to this, and would like to run it by
some developers more experienced with the Drill code base
before I create a pull request.

My idea is to add tow options similar to
“store.parquet.reader.int96_as_timestamp" to control whether or not
microsecond
times and timestamps are truncated to milliseconds. These options would be
added to “org.apache.drill.exec.ExecConstants" and
"org.apache.drill.exec.server.options.SystemOptionManager", and to
drill-module.conf:

    store.parquet.reader.time_micros_as_int64: false,
    store.parquet.reader.timestamp_micros_as_int64: false,

These options would then be used in the same places as
“store.parquet.reader.int96_as_timestamp”:

org.apache.drill.exec.store.parquet.columnreaders.ColumnReaderFactory

org.apache.drill.exec.store.parquet.columnreaders.ParquetToDrillTypeConverter
org.apache.drill.exec.store.parquet2.DrillParquetGroupConverter

to create an int64 reader instead of a time/timestamp reader when the
correspodning option is set to true.

In addition to this,
“org.apache.drill.exec.store.parquet.metadata.FileMetadataCollector” must
be altered to _not_ truncate the min and max
values for time_micros/timestamp_micros if the corresponding option is
true. This class doesn’t have a reference to an OptionManager, so
my guess is that the two new options must be extractred from the
OptionManager when the ParquetReaderConfig instance is created.

Filtering on microsecond columns would be done using 64-bit values rather
than TIME/TIMESTAMP values, e.g.

select *  from <file> where <timestamp_micros_column> = 1705914906694751;

I’ve tested the solution outlined above, and it seems to work when using
sqlline and with the JDBC driver, but not with the web based interface.
Any pointers to the relevent code for that would be appreciated.

An alternative solution to the above could be to intercept all reading of
the Parquet schemas and modifying the schema to report the
microsecond columns as int64 columns, i.e. to completely discard the
information that the columns contain time/timestamp values.
This could potentially make parts of the code where it is not obvious that
the time/timestamp properties of columns are used behave
as expected. However, this variant would not align with how INT96
timestamps are handled.

Any thoughts on this idea for how to access microsecond values would be
highly appreciated.

Thanks,

/Peter

Re: Parquet files with microsecond columns

Reply via email to