[jira] [Created] (DRILL-8492) Allow Parquet TIME_MICROS and TIMESTAMP_MICROS columns to be read as 64-bit integer values

2024-04-26 Thread Peter Franzen (Jira)
Peter Franzen created DRILL-8492:


 Summary: Allow Parquet TIME_MICROS and TIMESTAMP_MICROS  columns 
to be read as 64-bit integer values
 Key: DRILL-8492
 URL: https://issues.apache.org/jira/browse/DRILL-8492
 Project: Apache Drill
  Issue Type: Improvement
  Components: Storage - Parquet
Affects Versions: 1.21.1
Reporter: Peter Franzen


When reading Parquet columns of type {{time_micros}} and 
{{{}timestamp_micros{}}}, Drill truncates the microsecond values to 
milliseconds in order to convert them to SQL timestamps.

It is currently not possible to read the original microsecond values (as 64-bit 
values, not SQL timestamps) through Drill.

One solution for allowing reading the original 64-bit values is to add two 
options similar to “store.parquet.reader.int96_as_timestamp" to control whether 
microsecond
times and timestamps are truncated to millisecond timestamps or read as 
non-truncated 64-bit values.

These options would be added to {{org.apache.drill.exec.ExecConstants}} and
{{{}org.apache.drill.exec.server.options.SystemOptionManager{}}}.

They would also be added to "drill-module.conf":

{{   store.parquet.reader.time_micros_as_int64: false,}}
{{   store.parquet.reader.timestamp_micros_as_int64: false,}}

These options would then be used in the same places as 
{{{}store.parquet.reader.int96_as_timestamp{}}}:


 * org.apache.drill.exec.store.parquet.columnreaders.ColumnReaderFactory
 * org.apache.drill.exec.store.parquet.columnreaders.ParquetToDrillTypeConverter
 * org.apache.drill.exec.store.parquet2.DrillParquetGroupConverter



to create an int64 reader instead of a time/timestamp reader when the 
correspondning option is set to true.

In addition to this, 
{{org.apache.drill.exec.store.parquet.metadata.FileMetadataCollector }}must be 
altered to _not_ truncate the min and max values for 
time_micros/timestamp_micros if the corresponding option is true. This class 
doesn’t have a reference to an {{{}OptionManager{}}}, so the two new options 
must be extracted from the {{OptionManager}} when the {{ParquetReaderConfig}} 
instance is created.

Filtering on microsecond columns would be done using 64-bit values rather than 
TIME/TIMESTAMP values when the new options are true, e.g.

{{SELECT *  FROM  WHERE  = 1705914906694751;}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: Parquet files with microsecond columns

2024-01-24 Thread Peter Franzen
Hi,

Thanks for the feedback. The first variant seems to work, so I’ll go with that.

/Peter

> On 24 Jan 2024, at 05:20, James Turton  wrote:
> 
> A reply on your actual topic now. I think that following implementation of 
> the int96_as_timestamp option will result in the type conversion being done 
> "deeply enough" for Drill. I sympathise a lot with the design thinking in 
> your second option but I'd personally go the first route and only consider 
> the second route if something wasn't working.
> 
> On 2024/01/22 11:36, Peter Franzen wrote:
>> Hi,
>> 
>> I am using Drill to query Parquet files that have fields of type 
>> timestamp_micros. By default, Drill truncates those microsecond
>> values to milliseconds when reading the Parquet files in order to convert 
>> them to SQL timestamps.
>> 
>> In some of my use cases I need to read the original microsecond values (as 
>> 64-bit values, not SQL timestamps) through Drill, but
>> this doesn’t seem to be possible (unless I’ve missed something).
>> 
>> I have explored a possible solution to this, and would like to run it by 
>> some developers more experienced with the Drill code base
>> before I create a pull request.
>> 
>> My idea is to add tow options similar to 
>> “store.parquet.reader.int96_as_timestamp" to control whether or not 
>> microsecond
>> times and timestamps are truncated to milliseconds. These options would be 
>> added to “org.apache.drill.exec.ExecConstants" and
>> "org.apache.drill.exec.server.options.SystemOptionManager", and to 
>> drill-module.conf:
>> 
>> store.parquet.reader.time_micros_as_int64: false,
>> store.parquet.reader.timestamp_micros_as_int64: false,
>> 
>> These options would then be used in the same places as 
>> “store.parquet.reader.int96_as_timestamp”:
>> 
>> org.apache.drill.exec.store.parquet.columnreaders.ColumnReaderFactory
>> org.apache.drill.exec.store.parquet.columnreaders.ParquetToDrillTypeConverter
>> org.apache.drill.exec.store.parquet2.DrillParquetGroupConverter
>> 
>> to create an int64 reader instead of a time/timestamp reader when the 
>> correspodning option is set to true.
>> 
>> In addition to this, 
>> “org.apache.drill.exec.store.parquet.metadata.FileMetadataCollector” must be 
>> altered to _not_ truncate the min and max
>> values for time_micros/timestamp_micros if the corresponding option is true. 
>> This class doesn’t have a reference to an OptionManager, so
>> my guess is that the two new options must be extractred from the 
>> OptionManager when the ParquetReaderConfig instance is created.
>> 
>> Filtering on microsecond columns would be done using 64-bit values rather 
>> than TIME/TIMESTAMP values, e.g.
>> 
>> select *  from  where  = 1705914906694751;
>> 
>> I’ve tested the solution outlined above, and it seems to work when using 
>> sqlline and with the JDBC driver, but not with the web based interface.
>> Any pointers to the relevent code for that would be appreciated.
>> 
>> An alternative solution to the above could be to intercept all reading of 
>> the Parquet schemas and modifying the schema to report the
>> microsecond columns as int64 columns, i.e. to completely discard the 
>> information that the columns contain time/timestamp values.
>> This could potentially make parts of the code where it is not obvious that 
>> the time/timestamp properties of columns are used behave
>> as expected. However, this variant would not align with how INT96 timestamps 
>> are handled.
>> 
>> Any thoughts on this idea for how to access microsecond values would be 
>> highly appreciated.
>> 
>> Thanks,
>> 
>> /Peter
>> 
> 



Re: Parquet files with microsecond columns

2024-01-23 Thread Peter Franzen
Hi Paul,

Thanks for your comments.

I wasn’t aware that the Web UI doesn’t have sessions; when setting the option 
at the system
level the Web UI behaves as expected.

I’ll go ahead and create a pull request within the next few days.

/Peter

> On 22 Jan 2024, at 21:40, Paul Rogers  wrote:
> 
> Hi Peter,
> 
> It sounds like you are on the right track: the new option is the quick
> short-term solution. The best long-term solution is to generalize Drill's
> date/time type, but that would take much more work. (Drill also has a bug
> where the treatment of timezones is incorrect, which forces Drill to run in
> the UTC time zone -- something that will also require difficult work.)
> 
> Given that JDBC works, the problem must be in the web interface, not in
> your Parquet implementation. You've solved the problem with a new session
> option. The web interface, however, has no sessions: if you set an option
> in one call, and do your query in another, Drill will have "forgotten" your
> option. Instead, there is a way to attach options to each query. Are you
> using that feature?
> 
> As I recall, the JSON message to submit a query has an additional field to
> hold session options. I do not recall, however, if the web UI added that
> feature. Does anyone else know? Two workarounds. First, use your favorite
> JSON request tool to submit a query with the option set. Second, set your
> option as a system option so it is available to all sessions: ALTER SYSTEM
> SET...
> 
> Thanks,
> 
> - Paul
> 
> On Mon, Jan 22, 2024 at 1:38 AM Peter Franzen  wrote:
> 
>> Hi,
>> 
>> I am using Drill to query Parquet files that have fields of type
>> timestamp_micros. By default, Drill truncates those microsecond
>> values to milliseconds when reading the Parquet files in order to convert
>> them to SQL timestamps.
>> 
>> In some of my use cases I need to read the original microsecond values (as
>> 64-bit values, not SQL timestamps) through Drill, but
>> this doesn’t seem to be possible (unless I’ve missed something).
>> 
>> I have explored a possible solution to this, and would like to run it by
>> some developers more experienced with the Drill code base
>> before I create a pull request.
>> 
>> My idea is to add tow options similar to
>> “store.parquet.reader.int96_as_timestamp" to control whether or not
>> microsecond
>> times and timestamps are truncated to milliseconds. These options would be
>> added to “org.apache.drill.exec.ExecConstants" and
>> "org.apache.drill.exec.server.options.SystemOptionManager", and to
>> drill-module.conf:
>> 
>>store.parquet.reader.time_micros_as_int64: false,
>>store.parquet.reader.timestamp_micros_as_int64: false,
>> 
>> These options would then be used in the same places as
>> “store.parquet.reader.int96_as_timestamp”:
>> 
>> org.apache.drill.exec.store.parquet.columnreaders.ColumnReaderFactory
>> 
>> org.apache.drill.exec.store.parquet.columnreaders.ParquetToDrillTypeConverter
>> org.apache.drill.exec.store.parquet2.DrillParquetGroupConverter
>> 
>> to create an int64 reader instead of a time/timestamp reader when the
>> correspodning option is set to true.
>> 
>> In addition to this,
>> “org.apache.drill.exec.store.parquet.metadata.FileMetadataCollector” must
>> be altered to _not_ truncate the min and max
>> values for time_micros/timestamp_micros if the corresponding option is
>> true. This class doesn’t have a reference to an OptionManager, so
>> my guess is that the two new options must be extractred from the
>> OptionManager when the ParquetReaderConfig instance is created.
>> 
>> Filtering on microsecond columns would be done using 64-bit values rather
>> than TIME/TIMESTAMP values, e.g.
>> 
>> select *  from  where  = 1705914906694751;
>> 
>> I’ve tested the solution outlined above, and it seems to work when using
>> sqlline and with the JDBC driver, but not with the web based interface.
>> Any pointers to the relevent code for that would be appreciated.
>> 
>> An alternative solution to the above could be to intercept all reading of
>> the Parquet schemas and modifying the schema to report the
>> microsecond columns as int64 columns, i.e. to completely discard the
>> information that the columns contain time/timestamp values.
>> This could potentially make parts of the code where it is not obvious that
>> the time/timestamp properties of columns are used behave
>> as expected. However, this variant would not align with how INT96
>> timestamps are handled.
>> 
>> Any thoughts on this idea for how to access microsecond values would be
>> highly appreciated.
>> 
>> Thanks,
>> 
>> /Peter
>> 
>> 



Parquet files with microsecond columns

2024-01-22 Thread Peter Franzen
Hi,

I am using Drill to query Parquet files that have fields of type 
timestamp_micros. By default, Drill truncates those microsecond
values to milliseconds when reading the Parquet files in order to convert them 
to SQL timestamps.

In some of my use cases I need to read the original microsecond values (as 
64-bit values, not SQL timestamps) through Drill, but
this doesn’t seem to be possible (unless I’ve missed something).

I have explored a possible solution to this, and would like to run it by some 
developers more experienced with the Drill code base
before I create a pull request.

My idea is to add tow options similar to 
“store.parquet.reader.int96_as_timestamp" to control whether or not microsecond
times and timestamps are truncated to milliseconds. These options would be 
added to “org.apache.drill.exec.ExecConstants" and
"org.apache.drill.exec.server.options.SystemOptionManager", and to 
drill-module.conf:

store.parquet.reader.time_micros_as_int64: false,
store.parquet.reader.timestamp_micros_as_int64: false,

These options would then be used in the same places as 
“store.parquet.reader.int96_as_timestamp”:

org.apache.drill.exec.store.parquet.columnreaders.ColumnReaderFactory
org.apache.drill.exec.store.parquet.columnreaders.ParquetToDrillTypeConverter
org.apache.drill.exec.store.parquet2.DrillParquetGroupConverter

to create an int64 reader instead of a time/timestamp reader when the 
correspodning option is set to true.

In addition to this, 
“org.apache.drill.exec.store.parquet.metadata.FileMetadataCollector” must be 
altered to _not_ truncate the min and max
values for time_micros/timestamp_micros if the corresponding option is true. 
This class doesn’t have a reference to an OptionManager, so
my guess is that the two new options must be extractred from the OptionManager 
when the ParquetReaderConfig instance is created.

Filtering on microsecond columns would be done using 64-bit values rather than 
TIME/TIMESTAMP values, e.g.

select *  from  where  = 1705914906694751;

I’ve tested the solution outlined above, and it seems to work when using 
sqlline and with the JDBC driver, but not with the web based interface.
Any pointers to the relevent code for that would be appreciated.

An alternative solution to the above could be to intercept all reading of the 
Parquet schemas and modifying the schema to report the
microsecond columns as int64 columns, i.e. to completely discard the 
information that the columns contain time/timestamp values.
This could potentially make parts of the code where it is not obvious that the 
time/timestamp properties of columns are used behave
as expected. However, this variant would not align with how INT96 timestamps 
are handled.

Any thoughts on this idea for how to access microsecond values would be highly 
appreciated.

Thanks,

/Peter



[jira] [Created] (DRILL-8458) Reading Parquet data page with repetition levels larger than column data throws IllegalArgumentException

2023-10-23 Thread Peter Franzen (Jira)
Peter Franzen created DRILL-8458:


 Summary: Reading Parquet data page with repetition levels larger 
than column data throws IllegalArgumentException
 Key: DRILL-8458
 URL: https://issues.apache.org/jira/browse/DRILL-8458
 Project: Apache Drill
  Issue Type: Bug
  Components: Storage - Parquet
Affects Versions: 1.21.1
Reporter: Peter Franzen


When the size of the repetition level bytes in a Parquet data page is larger 
than the size of the column data bytes, 
{{org.apache.parquet.hadoop.ColumnChunkIncReadStore$ColumnChunkIncPageReader::readPage}}
 throws an {{{}IllegalArgumentException{}}}. This is caused by trying to set 
the limit of a ByteBuffer to a value large than its capacity.
 
The offending code is at line 226 in {{{}ColumnChunkIncReadStore.java{}}}:
 
{code:java}
217 int pageBufOffset = 0;
218 ByteBuffer bb = (ByteBuffer) pageBuf.position(pageBufOffset);
219 BytesInput repLevelBytes = BytesInput.from(
220   (ByteBuffer) bb.slice().limit(pageBufOffset + repLevelSize)
221 );
222 pageBufOffset += repLevelSize;
223
224 bb = (ByteBuffer) pageBuf.position(pageBufOffset);
225 final BytesInput defLevelBytes = BytesInput.from(
226   (ByteBuffer) bb.slice().limit(pageBufOffset + defLevelSize)
227 );
228 pageBufOffset += defLevelSize;  {code}
 
The buffer {{pageBuf}} contains the repetition level bytes followed by the 
definition level bytes followed by the column data bytes.
 
The code at lines 217-221 reads the repetition level bytes, and then updates 
the position of the {{pageBuf}} buffer to the start of the definition level 
bytes (lines 222 and 224).
 
The code at lines 225-227 reads the definition level bytes, and when creating a 
slice of the {{pageBuf }}buffer containing the definition level bytes, the 
slice's limit is set as if the position was at the beginning of the repetition 
level bytes (line 226), i.e as if it not had been updated.
 
This means that if the capacity of the pageBuf buffer (which is the size of the 
repetition level bytes + the size of the definition level bytes + the size of 
the column data bytes) is less than (repLevelSize + repLevelSize + 
defLevelSize), the call to limit() will throw.
 
The fix is to change line 226 to
{code:java}
  (ByteBuffer) bb.slice().limit(defLevelSize){code}
 
For symmetry, line 220 could also be changed to
{code:java}
  (ByteBuffer) bb.slice().limit(repLevelSize){code}
 
although {{pageBufOffset}} is always 0 there and will not cause the limit to 
exceed the capacity.
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (DRILL-8423) Parquet TIME_MICROS columns with values > Integer.MAX_VALUE are not displayed correctly

2023-04-13 Thread Peter Franzen (Jira)
Peter Franzen created DRILL-8423:


 Summary: Parquet TIME_MICROS columns with values > 
Integer.MAX_VALUE are not displayed correctly
 Key: DRILL-8423
 URL: https://issues.apache.org/jira/browse/DRILL-8423
 Project: Apache Drill
  Issue Type: Bug
  Components: Storage - Parquet
Affects Versions: 1.20.3
Reporter: Peter Franzen


Assume a parquet file in a directory "Test" with a column _timeCol_ having the 
type {{{}org.apache.parquet.schema.OriginalType.TIME_MICROS{}}}.

Assume there are two records with the values 2147483647 and 2147483648, 
respectively, in that column (i.e. the times 00:35:47.483647 and 
00:35:47.483648).

Executing the query
{code:java}
SELECT timeCol FROM dfs.Test;{code}
produces the result
{code:java}
timeCol
---
00:35:47.483
23:24:12.517{code}
i.e. the microsecond value of Integer.MAX_VALUE + 1 has wrapped around when 
read from the parquet file (it is displayed as the same number of milliseconds 
before midnight as the time represented by Integer.MAX_VALUE is after midnight)

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (DRILL-8422) Parquet TIMESTAMP_MICROS columns in

2023-04-13 Thread Peter Franzen (Jira)
Peter Franzen created DRILL-8422:


 Summary: Parquet TIMESTAMP_MICROS columns in 
 Key: DRILL-8422
 URL: https://issues.apache.org/jira/browse/DRILL-8422
 Project: Apache Drill
  Issue Type: Bug
Reporter: Peter Franzen






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (DRILL-8421) Parquet TIMESTAMP_MICROS columns in WHERE clauses are not converted to milliseconds before filtering

2023-04-07 Thread Peter Franzen (Jira)
Peter Franzen created DRILL-8421:


 Summary: Parquet TIMESTAMP_MICROS columns in WHERE clauses are not 
converted to milliseconds before filtering
 Key: DRILL-8421
 URL: https://issues.apache.org/jira/browse/DRILL-8421
 Project: Apache Drill
  Issue Type: Bug
  Components: Storage - Parquet
Affects Versions: 1.21.0
Reporter: Peter Franzen


When using Drill with parquet files where the timestamp columns are in 
microseconds, Drill converts the microsecond values to milliseconds when 
displayed. However, when using a timestamp column in WHERE clauses it looks 
like the original microsecond value is used instead of the adjusted millisecond 
value when filtering records.

*To Reproduce*
Assume a parquet file in a directory "Test" with a column _timestampCol_ having 
the type {{{}org.apache.parquet.schema.OriginalType.TIMESTAMP_MICROS{}}}.

Assume there are two records with the values 1673981999806149 and 
1674759597743552, respectively, in that column (i.e. the UTC dates 
2023-01-17T18:59:59.806149 and 2023-01-26T18:59:57.743552)
 # Execute the query
{{SELECT timestampCol FROM dfs.Test;}}
The result includes both records, as expected.

 # Execute the query
{{SELECT timestampCol FROM dfs.Test WHERE timestampCol < 
TO_TIMESTAMP('2023-02-01 00:00:00', '-MM-dd HH:mm:ss')}}
This produces an empty result although both records have a value less than the 
argument.

 # Execute
{{SELECT timestampCol FROM dfs.Test WHERE timestampCol > 
TO_TIMESTAMP('2023-02-01 00:00:00', '-MM-dd HH:mm:ss')}}
The result includes both records although neither have a value greater than the 
argument.

*Expected behavior*
The query in 2) above should produce a result with both records, and the query 
in 3) should produce an empty result.

*Additional context*
Even timestamps long into the future produce results with both records, e.g.:
{{SELECT timestampCol FROM dfs.Test WHERE timestampCol > 
TO_TIMESTAMP('2502-04-04 00:00:00', '-MM-dd HH:mm:ss')}}

Manually converting the timestamp column to milliseconds produces the expected 
result:
{{SELECT timestampCol FROM dfs.Test WHERE 
TO_TIMESTAMP(CONVERT_FROM(CONVERT_TO(timestampCol, 'TIMESTAMP_EPOCH'), 
'BIGINT')/1000) < TO_TIMESTAMP('2023-02-01 00:00:00', '-MM-dd HH:mm:ss')}}
produces a result with both records.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)