[
https://issues.apache.org/jira/browse/HIVE-5527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Remus Rusanu updated HIVE-5527:
-------------------------------
Description:
A query like
SELECT ctimestamp2 from alltypesorc WHERE ctimestamp2 > -10669;
returns rows in row mode, but not in vector mode *when running in GMT+2
timezone*.
I know what causes this, but I don’t know exactly whether is a bug or not.
The reading of the TIMESTAMP types is done in TimeStampTreeReader class,
long ms = (result.vector[result.isRepeating ? 0 : i] +
WriterImpl.BASE_TIMESTAMP)
* WriterImpl.MILLIS_PER_SECOND;
long ns = parseNanos(nanoVector.vector[nanoVector.isRepeating ? 0 :
i]);
// the rounding error exists because java always rounds up when
dividing integers
// -42001/1000 = -42; and -42001 % 1000 = -1 (+ 1000)
// to get the correct value we need
// (-42 - 1)*1000 + 999 = -42001
// (42)*1000 + 1 = 42001
if(ms < 0 && ns != 0) {
ms -= 1000;
}
// Convert millis into nanos and add the nano vector value to it
result.vector[i] = (ms * 1000000) + ns;
As you see this relies on the ORC WriterImpl.BASE_TIMESTAMP, which is declared
as:
static final long BASE_TIMESTAMP =
Timestamp.valueOf("2015-01-01 00:00:00").getTime() / MILLIS_PER_SECOND;
On US/Pacific time, this will be 1420099200
On EEST (GMT+2) time is 1420063200
The first row in alltypesorc for ctimestamp2 reads -1420099192 as data[0] and
7005 as nanos[0]. On US/Pacific, with a LONG vector timestamp value of
8875000000. On EEST it ends up with -35992125000000. (Note how the abs(data[0])
value is smaller than the US/Pacific basetime, but bigger than the EEST, so it
goes negative on EEST and just cascades to a huge negative number).
The vector filter simply compares this with -10669 (the query WHERE clause) and
it qualifies the row on US/Pacific, but fails on EEST.
I’m not sure what the right solution is, the whole of Hive code appears to be
riddled with Timezone problems. As a side node, the build-common.xml sets an
environment variable TZ to US/Pacific, but this has no effect in running tests
on Windows.
But the gist of it is this: in row mode the results are consistent on any time
zone. In vector mode the results vary (rows qualify for WHERE clause) depending
on the timezone.
was:I did not yet identify the root cause, but the vectorization_regress.q
returns different results depending on the local timezone settings
> Use of localtime Calendar in vectorized Timestamp arithmetic results in data
> corruption (depends on localtime)
> --------------------------------------------------------------------------------------------------------------
>
> Key: HIVE-5527
> URL: https://issues.apache.org/jira/browse/HIVE-5527
> Project: Hive
> Issue Type: Bug
> Reporter: Remus Rusanu
>
> A query like
> SELECT ctimestamp2 from alltypesorc WHERE ctimestamp2 > -10669;
> returns rows in row mode, but not in vector mode *when running in GMT+2
> timezone*.
> I know what causes this, but I don’t know exactly whether is a bug or not.
> The reading of the TIMESTAMP types is done in TimeStampTreeReader class,
> long ms = (result.vector[result.isRepeating ? 0 : i] +
> WriterImpl.BASE_TIMESTAMP)
> * WriterImpl.MILLIS_PER_SECOND;
> long ns = parseNanos(nanoVector.vector[nanoVector.isRepeating ? 0 :
> i]);
> // the rounding error exists because java always rounds up when
> dividing integers
> // -42001/1000 = -42; and -42001 % 1000 = -1 (+ 1000)
> // to get the correct value we need
> // (-42 - 1)*1000 + 999 = -42001
> // (42)*1000 + 1 = 42001
> if(ms < 0 && ns != 0) {
> ms -= 1000;
> }
> // Convert millis into nanos and add the nano vector value to it
> result.vector[i] = (ms * 1000000) + ns;
> As you see this relies on the ORC WriterImpl.BASE_TIMESTAMP, which is
> declared as:
> static final long BASE_TIMESTAMP =
> Timestamp.valueOf("2015-01-01 00:00:00").getTime() / MILLIS_PER_SECOND;
> On US/Pacific time, this will be 1420099200
> On EEST (GMT+2) time is 1420063200
> The first row in alltypesorc for ctimestamp2 reads -1420099192 as data[0] and
> 7005 as nanos[0]. On US/Pacific, with a LONG vector timestamp value of
> 8875000000. On EEST it ends up with -35992125000000. (Note how the
> abs(data[0]) value is smaller than the US/Pacific basetime, but bigger than
> the EEST, so it goes negative on EEST and just cascades to a huge negative
> number).
> The vector filter simply compares this with -10669 (the query WHERE clause)
> and it qualifies the row on US/Pacific, but fails on EEST.
> I’m not sure what the right solution is, the whole of Hive code appears to be
> riddled with Timezone problems. As a side node, the build-common.xml sets an
> environment variable TZ to US/Pacific, but this has no effect in running
> tests on Windows.
> But the gist of it is this: in row mode the results are consistent on any
> time zone. In vector mode the results vary (rows qualify for WHERE clause)
> depending on the timezone.
--
This message was sent by Atlassian JIRA
(v6.1#6144)