[ https://issues.apache.org/jira/browse/HIVE-5527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Remus Rusanu updated HIVE-5527: ------------------------------- Description: A query like SELECT ctimestamp2 from alltypesorc WHERE ctimestamp2 > -10669; returns rows in row mode, but not in vector mode *when running in GMT+2 timezone*. I know what causes this, but I don’t know exactly whether is a bug or not. The reading of the TIMESTAMP types is done in TimeStampTreeReader class, long ms = (result.vector[result.isRepeating ? 0 : i] + WriterImpl.BASE_TIMESTAMP) * WriterImpl.MILLIS_PER_SECOND; long ns = parseNanos(nanoVector.vector[nanoVector.isRepeating ? 0 : i]); // the rounding error exists because java always rounds up when dividing integers // -42001/1000 = -42; and -42001 % 1000 = -1 (+ 1000) // to get the correct value we need // (-42 - 1)*1000 + 999 = -42001 // (42)*1000 + 1 = 42001 if(ms < 0 && ns != 0) { ms -= 1000; } // Convert millis into nanos and add the nano vector value to it result.vector[i] = (ms * 1000000) + ns; As you see this relies on the ORC WriterImpl.BASE_TIMESTAMP, which is declared as: static final long BASE_TIMESTAMP = Timestamp.valueOf("2015-01-01 00:00:00").getTime() / MILLIS_PER_SECOND; On US/Pacific time, this will be 1420099200 On EEST (GMT+2) time is 1420063200 The first row in alltypesorc for ctimestamp2 reads -1420099192 as data[0] and 7005 as nanos[0]. On US/Pacific, with a LONG vector timestamp value of 8875000000. On EEST it ends up with -35992125000000. (Note how the abs(data[0]) value is smaller than the US/Pacific basetime, but bigger than the EEST, so it goes negative on EEST and just cascades to a huge negative number). The vector filter simply compares this with -10669 (the query WHERE clause) and it qualifies the row on US/Pacific, but fails on EEST. I’m not sure what the right solution is, the whole of Hive code appears to be riddled with Timezone problems. As a side node, the build-common.xml sets an environment variable TZ to US/Pacific, but this has no effect in running tests on Windows. But the gist of it is this: in row mode the results are consistent on any time zone. In vector mode the results vary (rows qualify for WHERE clause) depending on the timezone. was:I did not yet identify the root cause, but the vectorization_regress.q returns different results depending on the local timezone settings > Use of localtime Calendar in vectorized Timestamp arithmetic results in data > corruption (depends on localtime) > -------------------------------------------------------------------------------------------------------------- > > Key: HIVE-5527 > URL: https://issues.apache.org/jira/browse/HIVE-5527 > Project: Hive > Issue Type: Bug > Reporter: Remus Rusanu > > A query like > SELECT ctimestamp2 from alltypesorc WHERE ctimestamp2 > -10669; > returns rows in row mode, but not in vector mode *when running in GMT+2 > timezone*. > I know what causes this, but I don’t know exactly whether is a bug or not. > The reading of the TIMESTAMP types is done in TimeStampTreeReader class, > long ms = (result.vector[result.isRepeating ? 0 : i] + > WriterImpl.BASE_TIMESTAMP) > * WriterImpl.MILLIS_PER_SECOND; > long ns = parseNanos(nanoVector.vector[nanoVector.isRepeating ? 0 : > i]); > // the rounding error exists because java always rounds up when > dividing integers > // -42001/1000 = -42; and -42001 % 1000 = -1 (+ 1000) > // to get the correct value we need > // (-42 - 1)*1000 + 999 = -42001 > // (42)*1000 + 1 = 42001 > if(ms < 0 && ns != 0) { > ms -= 1000; > } > // Convert millis into nanos and add the nano vector value to it > result.vector[i] = (ms * 1000000) + ns; > As you see this relies on the ORC WriterImpl.BASE_TIMESTAMP, which is > declared as: > static final long BASE_TIMESTAMP = > Timestamp.valueOf("2015-01-01 00:00:00").getTime() / MILLIS_PER_SECOND; > On US/Pacific time, this will be 1420099200 > On EEST (GMT+2) time is 1420063200 > The first row in alltypesorc for ctimestamp2 reads -1420099192 as data[0] and > 7005 as nanos[0]. On US/Pacific, with a LONG vector timestamp value of > 8875000000. On EEST it ends up with -35992125000000. (Note how the > abs(data[0]) value is smaller than the US/Pacific basetime, but bigger than > the EEST, so it goes negative on EEST and just cascades to a huge negative > number). > The vector filter simply compares this with -10669 (the query WHERE clause) > and it qualifies the row on US/Pacific, but fails on EEST. > I’m not sure what the right solution is, the whole of Hive code appears to be > riddled with Timezone problems. As a side node, the build-common.xml sets an > environment variable TZ to US/Pacific, but this has no effect in running > tests on Windows. > But the gist of it is this: in row mode the results are consistent on any > time zone. In vector mode the results vary (rows qualify for WHERE clause) > depending on the timezone. -- This message was sent by Atlassian JIRA (v6.1#6144)