[
https://issues.apache.org/jira/browse/IMPALA-8077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Armstrong resolved IMPALA-8077.
-----------------------------------
Resolution: Won't Fix
We should focus on IMPALA-2017 instead
> Avoid converting timestamps in dropped rows during Parquet scanning
> -------------------------------------------------------------------
>
> Key: IMPALA-8077
> URL: https://issues.apache.org/jira/browse/IMPALA-8077
> Project: IMPALA
> Issue Type: Improvement
> Components: Backend
> Reporter: Csaba Ringhofer
> Priority: Major
> Labels: parquet, performance, timestamp
>
> If flag convert_legacy_hive_parquet_utc_timestamps is true, then every
> TIMESTAMP value is converted from UTC to local time during Parquet scanning.
> This is done during column decoding, and Impala materializes every column
> before calculating the WHERE predicate, so if a timestamp column is not in
> the predicate, then the conversion is unnecessarily done in rows that fail
> the predicate.
> Example:
> CREATE TABLE t (id INT, ts TIMESTAMP) STORED AS PARQUET;
> SELECT * FROM t WHERE id = 1;
> Timezone conversion will be done for every 'ts', even if the predicate
> matches only a single row (lets ignore stat and dictionary filtering). The
> CPU time of the query above is likely to be dominated by timezone conversion,
> especially if the query is very selective.
> Note that the same overhead is "normal" if the predicate uses the timestamps
> column e.g. in
> SELECT * FROM t WHERE ts = "2019.01.14 16:00:00"
> It would be possible to avoid this conversion, but this would be very hacky,
> so this is out of the scope of this issue.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)