[
https://issues.apache.org/jira/browse/ORC-135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15838699#comment-15838699
]
Owen O'Malley commented on ORC-135:
-----------------------------------
I'd propose extending the file format like:
{code:title=orc_proto.proto}
message TimestampStatistics {
// min,max values saved as milliseconds since epoch
optional sint64 minimum = 1;
optional sint64 maximum = 2;
optional sint64 minimumUtc = 3;
optional sint64 maximumUtc = 4;
}
{code}
where minimumUtc and maximumUtc are defined as <time in utc> - <epoch in utc>
in milliseconds. We stop setting minimum and maximum and only set minimumUtc
and maximumUtc. Old readers will not see the new min and max and new readers
will ignore old values.
> PPD for timestamp is wrong when reader and writer timezones are different
> -------------------------------------------------------------------------
>
> Key: ORC-135
> URL: https://issues.apache.org/jira/browse/ORC-135
> Project: Orc
> Issue Type: Bug
> Affects Versions: 1.0.0, 1.1.0, 1.2.0, 1.3.0
> Reporter: Prasanth Jayachandran
> Assignee: Prasanth Jayachandran
> Priority: Critical
>
> When reader and writer timezones are different, PPD evaluation does not
> offset the timezone when reading the min and max values. This can result is
> wrong PPD evaluation and hence incorrect results.
> Example:
> Table written in US/Eastern timezone. All values in this table are
> "2007-08-01 00:00:00.0".
> {code:title=PPD disabled}
> hive> set hive.optimize.index.filter=false;
> hive> select ORDER_DATE from ORDER_FACT_small where ORDER_DATE='2007-08-01
> 00:00:00.0' limit 1;
> 2007-08-01 00:00:00.0
> OK
> {code}
> {code:title=PPD enabled}
> set hive.optimize.index.filter=true;
> select ORDER_DATE from ORDER_FACT_small where ORDER_DATE='2007-08-01
> 00:00:00.0' limit 1;
> OK
> {code}
> No rows are returned when PPD is enabled (reader timezone is UTC)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)