[jira] [Resolved] (IMPALA-7568) Implement timezone aware parquet stat filtering for timestamp columns

Csaba Ringhofer (JIRA) Tue, 20 Nov 2018 07:15:18 -0800


     [ 
https://issues.apache.org/jira/browse/IMPALA-7568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Csaba Ringhofer resolved IMPALA-7568.
-------------------------------------
       Resolution: Implemented
    Fix Version/s: Impala 3.2.0

> Implement timezone aware parquet stat filtering for timestamp columns
> ---------------------------------------------------------------------
>
>                 Key: IMPALA-7568
>                 URL: https://issues.apache.org/jira/browse/IMPALA-7568
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Backend
>            Reporter: Csaba Ringhofer
>            Assignee: Csaba Ringhofer
>            Priority: Major
>              Labels: parquet, timestamp
>             Fix For: Impala 3.2.0
>
>
> Parquet timestamp columns can contain UTC normalized data, which means that 
> the data is stored in UTC but it is expected to be shown  in local time (to 
> be consistent with Hive). This is done by converting these timestamp from UTC 
> to local time during scanning.
> This conversion has to be considered during min/max stat filtering, otherwise 
> some row groups can be incorrectly skipped. For this reason IMPALA-7559 
> disables stat filtering on UTC normalized timestamp columns. 
> This ticket deals with creating a correct implementation to be able re-enable 
> stat filtering for these columns.
> DST and historical rule changes add some complexity to this. UTC->local 
> mapping can be non-monotonous, and  local->UTC mapping can be ambiguous. The 
> non-monotonous mapping means that if tMin <= t <= tMax is true in UTC does 
> not imply that the same is true in local time.
> The solution I see is to convert min/max of the predicate from local to UTC 
> and resolve ambiguity by  choosing the earlier time in case of min, and the 
> later time in case of max. These UTC values can be compared with stats safely.
> Note the timezone rules can be different in Hive and Impala (especially 
> historical ones), so we cannot ensure that Impala gives exactly the same 
> results as Hive. The goal is to ensure that Impala returns the same rows with 
> and without stat filtering.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

[jira] [Resolved] (IMPALA-7568) Implement timezone aware parquet stat filtering for timestamp columns

Reply via email to