[
https://issues.apache.org/jira/browse/DRILL-8421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17712433#comment-17712433
]
ASF GitHub Bot commented on DRILL-8421:
---------------------------------------
handmadecode opened a new pull request, #2793:
URL: https://github.com/apache/drill/pull/2793
# [DRILL-8421](https://issues.apache.org/jira/browse/DRILL-8421): Truncate
parquet microsecond columns
## Description
The metadata min and max values of parquet microsecond columns are truncated
to milliseconds, which is the time unit expected by the initial file pruning
during filtering. Also, `TIME_MICROS` columns are read as 64-bit values before
they are truncated to 32-bit milliseconds values. Previously they were read as
32-bit values, causing values > `Integer.MAX_VALUE` to be incorrect.
The second fix also addresses
[DRILL-8423](https://issues.apache.org/jira/browse/DRILL-8423).
## Documentation
Bugfix only, no documentation changes
## Testing
Unit tests added in new test class
`org.apache.drill.exec.store.parquet.TestMicrosecondColumns`.
> Parquet TIMESTAMP_MICROS columns in WHERE clauses are not converted to
> milliseconds before filtering
> ----------------------------------------------------------------------------------------------------
>
> Key: DRILL-8421
> URL: https://issues.apache.org/jira/browse/DRILL-8421
> Project: Apache Drill
> Issue Type: Bug
> Components: Storage - Parquet
> Affects Versions: 1.21.0
> Reporter: Peter Franzen
> Priority: Major
> Fix For: 1.21.1
>
>
> When using Drill with parquet files where the timestamp columns are in
> microseconds, Drill converts the microsecond values to milliseconds when
> displayed. However, when using a timestamp column in WHERE clauses it looks
> like the original microsecond value is used instead of the adjusted
> millisecond value when filtering records.
> *To Reproduce*
> Assume a parquet file in a directory "Test" with a column _timestampCol_
> having the type
> {{{}org.apache.parquet.schema.OriginalType.TIMESTAMP_MICROS{}}}.
> Assume there are two records with the values 1673981999806149 and
> 1674759597743552, respectively, in that column (i.e. the UTC dates
> 2023-01-17T18:59:59.806149 and 2023-01-26T18:59:57.743552)
> # Execute the query
> {{SELECT timestampCol FROM dfs.Test;}}
> The result includes both records, as expected.
> # Execute the query
> {{SELECT timestampCol FROM dfs.Test WHERE timestampCol <
> TO_TIMESTAMP('2023-02-01 00:00:00', 'yyyy-MM-dd HH:mm:ss')}}
> This produces an empty result although both records have a value less than
> the argument.
> # Execute
> {{SELECT timestampCol FROM dfs.Test WHERE timestampCol >
> TO_TIMESTAMP('2023-02-01 00:00:00', 'yyyy-MM-dd HH:mm:ss')}}
> The result includes both records although neither have a value greater than
> the argument.
> *Expected behavior*
> The query in 2) above should produce a result with both records, and the
> query in 3) should produce an empty result.
> *Additional context*
> Even timestamps long into the future produce results with both records, e.g.:
> {{SELECT timestampCol FROM dfs.Test WHERE timestampCol >
> TO_TIMESTAMP('2502-04-04 00:00:00', 'yyyy-MM-dd HH:mm:ss')}}
> Manually converting the timestamp column to milliseconds produces the
> expected result:
> {{SELECT timestampCol FROM dfs.Test WHERE
> TO_TIMESTAMP(CONVERT_FROM(CONVERT_TO(timestampCol, 'TIMESTAMP_EPOCH'),
> 'BIGINT')/1000) < TO_TIMESTAMP('2023-02-01 00:00:00', 'yyyy-MM-dd HH:mm:ss')}}
> produces a result with both records.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)