[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-04-11 Thread Attila Jeges (Code Review)
Attila Jeges has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/9986


Change subject: IMPALA-3307: Add support for IANA time-zone db
..

IMPALA-3307: Add support for IANA time-zone db

Impala currently uses two different libraries for timestamp
manipulations: boost and glibc.

Issues with boost:
- Time-zone database is currently hard coded in timezone_db.cc.
  Impala admins cannot update it without upgrading Impala.
- Time-zone database is flat, therefore can’t track year-to-year
  changes.
- Time-zone database is not updated on a regular basis.

Issues with glibc:
- Uses /usr/share/zoneinfo/ database which could be out of sync on
  some of the nodes in the Impala cluster.
- Uses the host system’s local time-zone. Different nodes in the
  Impala cluster might use a different local time-zone.
- Conversion functions take a global lock, which causes severe
  performance degradation.

In addition to the issues above, the fact that /usr/share/zoneinfo/
and the hard-coded boost time-zone database are both in use is a
source of inconsistency in itself.

This patch makes the following changes:
- Instead of boost and glibc, impalad uses Google's CCTZ to implement
  time-zone conversions.
- Introduces a new statup flag (--hdfs_zoneinfo_dir) to impalad to
  specify an HDFS/S3/ADLS location that contains the shared compiled
  IANA time-zone database. If the startup flag is set, impalad will
  use the specified time-zone database. Otherwise, impalad will use
  the default /usr/share/zoneinfo time-zone database.
- impalad reads the entire time-zone database into an in-memory
  map on startup for fast lookups.
- The name of the coordinator node’s local time-zone is saved to the
  query context when preparing query execution. This time-zone is used
  whenever the current time-zone is referred afterwards in an
  execution node.
- Introduces a new startup flag (--hdfs_zoneabbrev_config) to impalad
  to specify an HDFS/S3/ADLS path to a shared config file that
  contains definitions for non-standard time-zone abbreviations.

Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
---
M CMakeLists.txt
M be/CMakeLists.txt
M be/src/benchmarks/CMakeLists.txt
A be/src/benchmarks/convert-timestamp-benchmark.cc
M be/src/exec/data-source-scan-node.cc
M be/src/exec/data-source-scan-node.h
M be/src/exec/parquet-column-readers.cc
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/cast-functions-ir.cc
M be/src/exprs/decimal-operators-ir.cc
M be/src/exprs/decimal-operators.h
M be/src/exprs/expr-test.cc
M be/src/exprs/literal.cc
M be/src/exprs/timestamp-functions-ir.cc
M be/src/exprs/timestamp-functions.cc
M be/src/exprs/timezone_db.cc
M be/src/exprs/timezone_db.h
M be/src/runtime/raw-value-test.cc
M be/src/runtime/runtime-state.cc
M be/src/runtime/runtime-state.h
M be/src/runtime/timestamp-test.cc
M be/src/runtime/timestamp-value.cc
M be/src/runtime/timestamp-value.h
M be/src/runtime/timestamp-value.inline.h
M be/src/service/impala-server.cc
M be/src/service/impalad-main.cc
M be/src/util/hdfs-util-test.cc
M be/src/util/hdfs-util.cc
M be/src/util/hdfs-util.h
M be/src/util/time-test.cc
M be/src/util/time.cc
M be/src/util/time.h
M bin/bootstrap_toolchain.py
M bin/impala-config.sh
A cmake_modules/FindCctz.cmake
M common/thrift/ImpalaInternalService.thrift
M common/thrift/metrics.json
M fe/src/test/java/org/apache/impala/testutil/TestUtils.java
M testdata/data/timezoneverification.csv
M testdata/workloads/functional-query/queries/QueryTest/exprs.test
40 files changed, 1,919 insertions(+), 1,045 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/86/9986/1
--
To view, visit http://gerrit.cloudera.org:8080/9986
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
Gerrit-Change-Number: 9986
Gerrit-PatchSet: 1
Gerrit-Owner: Attila Jeges 


[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-04-11 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9986 )

Change subject: IMPALA-3307: Add support for IANA time-zone db
..


Patch Set 1:

(2 comments)

Just skimmed through. Will do several further passes

http://gerrit.cloudera.org:8080/#/c/9986/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/9986/1//COMMIT_MSG@34
PS1, Line 34: --hdfs_zoneinfo_dir
Why call it hdfs_zone_dir when it can also refer to S3 and ADLS?
Maybe it could be just --zoneinfo_dir and the given URI would specify the 
storage.


http://gerrit.cloudera.org:8080/#/c/9986/1//COMMIT_MSG@45
PS1, Line 45: --hdfs_zoneabbrev_config
Same as above.



--
To view, visit http://gerrit.cloudera.org:8080/9986
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
Gerrit-Change-Number: 9986
Gerrit-PatchSet: 1
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 11 Apr 2018 16:23:39 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-04-12 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9986 )

Change subject: IMPALA-3307: Add support for IANA time-zone db
..


Patch Set 1:

(25 comments)

http://gerrit.cloudera.org:8080/#/c/9986/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/9986/1//COMMIT_MSG@34
PS1, Line 34: statup
typo: startup


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/benchmarks/convert-timestamp-benchmark.cc
File be/src/benchmarks/convert-timestamp-benchmark.cc:

http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/benchmarks/convert-timestamp-benchmark.cc@134
PS1, Line 134: val
I don't know what RAND_MAX is here, but I think that it can be 32K, which would 
mean that the time part would move very slowly from the starting time. Is this 
by design?

In order to have "really random" time, I would use cpp11 random classes, or put 
together the time from several rand() calls, each with a modulo smaller than 
32k.


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/benchmarks/convert-timestamp-benchmark.cc@166
PS1, Line 166:  d
I am a bit concerned about writing to the same buffer from every thread - maybe 
it does not hurt performance, but it is not how these functions are "normally" 
used.


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/benchmarks/convert-timestamp-benchmark.cc@709
PS1, Line 709: m1 = 
measure_multithreaded_elapsed_time(glibc_test_utc_to_unix, 
num_of_threads,BATCH_SIZE,
nit: long line


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/benchmarks/convert-timestamp-benchmark.cc@711
PS1, Line 711: m2 = 
measure_multithreaded_elapsed_time(cctz_test_utc_to_unix, num_of_threads, 
BATCH_SIZE,
nit: long line


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/benchmarks/convert-timestamp-benchmark.cc@725
PS1, Line 725: m1 = measure_multithreaded_elapsed_time(boost_test_from_utc, 
num_of_threads, BATCH_SIZE,
nit: long line


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/benchmarks/convert-timestamp-benchmark.cc@727
PS1, Line 727: m2 = measure_multithreaded_elapsed_time(cctz_test_from_utc, 
num_of_threads, BATCH_SIZE,
nit: long line


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/benchmarks/convert-timestamp-benchmark.cc@743
PS1, Line 743: m2 = 
measure_multithreaded_elapsed_time(cctz_test_utc_to_local, num_of_threads, 
BATCH_SIZE,
nit: long line


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timestamp-functions-ir.cc
File be/src/exprs/timestamp-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timestamp-functions-ir.cc@523
PS1, Line 523: /
nit: missing spaces


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timestamp-functions.cc
File be/src/exprs/timestamp-functions.cc:

http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timestamp-functions.cc@63
PS1, Line 63:   if (timezone == nullptr) {
This could be UNLIKELY.


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timestamp-functions.cc@90
PS1, Line 90: t.fractional_seconds()
You have explained in person that 't' is used instead of 'to_cs'for 
sub-seconds, because cctz't time type does not support nano seconds, and no 
timezone rule affects sub-seconds. Could you add a comment about this?


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timestamp-functions.cc@105
PS1, Line 105:   if (timezone == nullptr) {
This could be UNLIKELY.


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timestamp-functions.cc@142
PS1, Line 142: t.fractional_seconds()
Same as in line 90.


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.h
File be/src/exprs/timezone_db.h:

http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.h@28
PS1, Line 28: /// Functions to load and access the time-zone database.
Please add some comments about thread-safety (e.g. "Initialize() should be 
called at startup to load every timezone rule to memory. After it returned 
without error, other functions can be safely called from multiple threads.").


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.cc
File be/src/exprs/timezone_db.cc:

http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.cc@160
PS1, Line 160: bool IsSymbolicLink(const string& path, string* real_path) {
Maybe this could be moved to class FileSystemUtil.


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.cc@412
PS1, Line 412:   char buffer[64*1024];
I am a bit concerned about this - is it ok to keep buffers of this size on 
stack in Impala?


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.cc@438
PS1, Line 438: // Load custom time-zone abbreviations from 'is' and add them to 
'tz_name_map_'.
In most of Impala, the comments for private functions are in the .h file and 
start with "///". Can you move them to the header for the sake of consistency?


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.cc@439
PS1

[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-04-13 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9986 )

Change subject: IMPALA-3307: Add support for IANA time-zone db
..


Patch Set 1:

(7 comments)

Spent only a limited amount of time on this review. Will continue next week.

http://gerrit.cloudera.org:8080/#/c/9986/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/9986/1//COMMIT_MSG@7
PS1, Line 7: time-zone
Fun fact:
I was curious whether "timezone", "time zone" or "time-zone" is correct. 
Apparently all of them are, however I read that "time-zone" is a bit outdated.  
"time zone" is widely used and "timezone" is only in the US.
Can someone native confirm? :)


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/benchmarks/convert-timestamp-benchmark.cc
File be/src/benchmarks/convert-timestamp-benchmark.cc:

http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/benchmarks/convert-timestamp-benchmark.cc@1
PS1, Line 1: #include 
Missing Apache header


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/benchmarks/convert-timestamp-benchmark.cc@38
PS1, Line 38: UtcToUnixTime: Function  iters/ms   10%ile   50%ile   
90%ile 10%ile 50%ile 90%ile
Do you think we should force the 90col limit on the following comment as well?


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/benchmarks/convert-timestamp-benchmark.cc@136
PS1, Line 136: ss << to_simple_string(start);
to_simple_string() return a string already, no need to use a stringstream for 
this purpose.


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/decimal-operators.h
File be/src/exprs/decimal-operators.h:

http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/decimal-operators.h@168
PS1, Line 168:   /// instead of truncating if 'round' is true.
Should we mention the new parameter in the comment?


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/runtime/runtime-state.h
File be/src/runtime/runtime-state.h:

http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/runtime/runtime-state.h@321
PS1, Line 321: global local
I see the intent but "global local timezone" sounds strange :)


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/runtime/runtime-state.h@321
PS1, Line 321: -
nit: typo



--
To view, visit http://gerrit.cloudera.org:8080/9986
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
Gerrit-Change-Number: 9986
Gerrit-PatchSet: 1
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 13 Apr 2018 12:20:12 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-04-18 Thread Attila Jeges (Code Review)
Attila Jeges has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9986 )

Change subject: IMPALA-3307: Add support for IANA time-zone db
..


Patch Set 1:

(31 comments)

http://gerrit.cloudera.org:8080/#/c/9986/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/9986/1//COMMIT_MSG@34
PS1, Line 34: statup
> typo: startup
Done


http://gerrit.cloudera.org:8080/#/c/9986/1//COMMIT_MSG@34
PS1, Line 34: --hdfs_zoneinfo_dir
> Why call it hdfs_zone_dir when it can also refer to S3 and ADLS?
The name reflects that this flag should be used to specify a location in a 
"shared" filesystem that can be accessed through the HDFS API.

I think, naming the flag "--zoneinfo_dir" might be misleading as it should not 
be used to specify a directory in a local filesystem.


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/benchmarks/convert-timestamp-benchmark.cc
File be/src/benchmarks/convert-timestamp-benchmark.cc:

http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/benchmarks/convert-timestamp-benchmark.cc@1
PS1, Line 1: #include 
> Missing Apache header
Done


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/benchmarks/convert-timestamp-benchmark.cc@38
PS1, Line 38: UtcToUnixTime: Function  iters/ms   10%ile   50%ile   
90%ile 10%ile 50%ile 90%ile
> Do you think we should force the 90col limit on the following comment as we
Probably it's better to leave it like this. Some of the other benchmark 
programs have longer lines as well (e.g. 
./be/src/benchmarks/int-hash-benchmark.cc)


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/benchmarks/convert-timestamp-benchmark.cc@134
PS1, Line 134: val
> I don't know what RAND_MAX is here, but I think that it can be 32K, which w
Done


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/benchmarks/convert-timestamp-benchmark.cc@136
PS1, Line 136: ss << to_simple_string(start);
> to_simple_string() return a string already, no need to use a stringstream f
Done


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/benchmarks/convert-timestamp-benchmark.cc@166
PS1, Line 166:  d
> I am a bit concerned about writing to the same buffer from every thread - m
Fixed it. It required some major refactoring too.


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/benchmarks/convert-timestamp-benchmark.cc@709
PS1, Line 709: m1 = 
measure_multithreaded_elapsed_time(glibc_test_utc_to_unix, 
num_of_threads,BATCH_SIZE,
> nit: long line
Done


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/benchmarks/convert-timestamp-benchmark.cc@711
PS1, Line 711: m2 = 
measure_multithreaded_elapsed_time(cctz_test_utc_to_unix, num_of_threads, 
BATCH_SIZE,
> nit: long line
Done


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/benchmarks/convert-timestamp-benchmark.cc@725
PS1, Line 725: m1 = measure_multithreaded_elapsed_time(boost_test_from_utc, 
num_of_threads, BATCH_SIZE,
> nit: long line
Done


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/benchmarks/convert-timestamp-benchmark.cc@727
PS1, Line 727: m2 = measure_multithreaded_elapsed_time(cctz_test_from_utc, 
num_of_threads, BATCH_SIZE,
> nit: long line
Done


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/benchmarks/convert-timestamp-benchmark.cc@743
PS1, Line 743: m2 = 
measure_multithreaded_elapsed_time(cctz_test_utc_to_local, num_of_threads, 
BATCH_SIZE,
> nit: long line
Done


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/decimal-operators.h
File be/src/exprs/decimal-operators.h:

http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/decimal-operators.h@168
PS1, Line 168:   /// instead of truncating if 'round' is true.
> Should we mention the new parameter in the comment?
Done


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timestamp-functions-ir.cc
File be/src/exprs/timestamp-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timestamp-functions-ir.cc@523
PS1, Line 523: /
> nit: missing spaces
Done


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timestamp-functions.cc
File be/src/exprs/timestamp-functions.cc:

http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timestamp-functions.cc@90
PS1, Line 90: t.fractional_seconds()
> You have explained in person that 't' is used instead of 'to_cs'for sub-sec
Done


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timestamp-functions.cc@105
PS1, Line 105:   if (timezone == nullptr) {
> This could be UNLIKELY.
Done


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timestamp-functions.cc@142
PS1, Line 142: t.fractional_seconds()
> Same as in line 90.
Done


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.h
File be/src/exprs/timezone_db.h:

http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.h@28
PS1, Line 28: /// Functions to load and access the time-zone database.
> Please add some comments about thread-safety (e.g. "Initialize() should be
Done


http:/

[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-04-18 Thread Attila Jeges (Code Review)
Hello Gabor Kaszab, Zoltan Borok-Nagy, Csaba Ringhofer,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/9986

to look at the new patch set (#2).

Change subject: IMPALA-3307: Add support for IANA time-zone db
..

IMPALA-3307: Add support for IANA time-zone db

Impala currently uses two different libraries for timestamp
manipulations: boost and glibc.

Issues with boost:
- Time-zone database is currently hard coded in timezone_db.cc.
  Impala admins cannot update it without upgrading Impala.
- Time-zone database is flat, therefore can’t track year-to-year
  changes.
- Time-zone database is not updated on a regular basis.

Issues with glibc:
- Uses /usr/share/zoneinfo/ database which could be out of sync on
  some of the nodes in the Impala cluster.
- Uses the host system’s local time-zone. Different nodes in the
  Impala cluster might use a different local time-zone.
- Conversion functions take a global lock, which causes severe
  performance degradation.

In addition to the issues above, the fact that /usr/share/zoneinfo/
and the hard-coded boost time-zone database are both in use is a
source of inconsistency in itself.

This patch makes the following changes:
- Instead of boost and glibc, impalad uses Google's CCTZ to implement
  time-zone conversions.
- Introduces a new startup flag (--hdfs_zoneinfo_dir) to impalad to
  specify an HDFS/S3/ADLS location that contains the shared compiled
  IANA time-zone database. If the startup flag is set, impalad will
  use the specified time-zone database. Otherwise, impalad will use
  the default /usr/share/zoneinfo time-zone database.
- impalad reads the entire time-zone database into an in-memory
  map on startup for fast lookups.
- The name of the coordinator node’s local time-zone is saved to the
  query context when preparing query execution. This time-zone is used
  whenever the current time-zone is referred afterwards in an
  execution node.
- Introduces a new startup flag (--hdfs_zoneabbrev_config) to impalad
  to specify an HDFS/S3/ADLS path to a shared config file that
  contains definitions for non-standard time-zone abbreviations.

Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
---
M CMakeLists.txt
M be/CMakeLists.txt
M be/src/benchmarks/CMakeLists.txt
A be/src/benchmarks/convert-timestamp-benchmark.cc
M be/src/exec/data-source-scan-node.cc
M be/src/exec/data-source-scan-node.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/parquet-column-readers.cc
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/cast-functions-ir.cc
M be/src/exprs/decimal-operators-ir.cc
M be/src/exprs/decimal-operators.h
M be/src/exprs/expr-test.cc
M be/src/exprs/literal.cc
M be/src/exprs/timestamp-functions-ir.cc
M be/src/exprs/timestamp-functions.cc
M be/src/exprs/timezone_db.cc
M be/src/exprs/timezone_db.h
M be/src/runtime/raw-value-test.cc
M be/src/runtime/runtime-state.cc
M be/src/runtime/runtime-state.h
M be/src/runtime/timestamp-test.cc
M be/src/runtime/timestamp-value.cc
M be/src/runtime/timestamp-value.h
M be/src/runtime/timestamp-value.inline.h
M be/src/service/impala-server.cc
M be/src/service/impalad-main.cc
M be/src/util/filesystem-util.cc
M be/src/util/filesystem-util.h
M be/src/util/hdfs-util-test.cc
M be/src/util/hdfs-util.cc
M be/src/util/hdfs-util.h
M be/src/util/time-test.cc
M be/src/util/time.cc
M be/src/util/time.h
M bin/bootstrap_toolchain.py
M bin/impala-config.sh
A cmake_modules/FindCctz.cmake
M common/thrift/ImpalaInternalService.thrift
M common/thrift/metrics.json
M fe/src/test/java/org/apache/impala/testutil/TestUtils.java
M testdata/data/timezoneverification.csv
M testdata/workloads/functional-query/queries/QueryTest/exprs.test
43 files changed, 1,972 insertions(+), 1,051 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/86/9986/2
--
To view, visit http://gerrit.cloudera.org:8080/9986
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
Gerrit-Change-Number: 9986
Gerrit-PatchSet: 2
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-04-18 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9986 )

Change subject: IMPALA-3307: Add support for IANA time-zone db
..


Patch Set 2:

(12 comments)

http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/benchmarks/convert-timestamp-benchmark.cc
File be/src/benchmarks/convert-timestamp-benchmark.cc:

http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/benchmarks/convert-timestamp-benchmark.cc@172
PS2, Line 172: TestData
This looks like a fairly general class to me that could move to 
util/benchmark.h or a similar file. We can create a follow up ticket if you 
don't want to deal with this now.


http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/benchmarks/convert-timestamp-benchmark.cc@444
PS2, Line 444: time_t utc =
We could replace this with boost_utc_to_unix_time. This conversion should be 
fast, as we want to measure the speed of localtime_r, not the "glue" code.


http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/benchmarks/convert-timestamp-benchmark.cc@530
PS2, Line 530: //
 : // Test UnixTimeToUtcPtime (boost is expected to be faster than 
CCTZ)
 : //
 :
 : // boost
 : boost::pos
I think that this a bit misleading, as boost_unix_time_to_utc_ptime never uses 
its slow branch with this test data. We could measure gmtime_r separately, and 
add some comments that tells which functions were used in Impala before/after 
the change.


http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/exprs/decimal-operators.h
File be/src/exprs/decimal-operators.h:

http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/exprs/decimal-operators.h@168
PS2, Line 168:   /// local time in 'local_tz' time-zone). Rounds instead of 
truncating if 'round' is true.
nit: long line


http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/exprs/timestamp-functions.cc
File be/src/exprs/timestamp-functions.cc:

http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/exprs/timestamp-functions.cc@71
PS2, Line 71:   const boost::gregorian::date& d = ts_value.date();
:   const boost::posix_time::time_duration& t = ts_value.time();
:   const cctz::civil_second from_cs(d.year(), d.month(), d.day(), 
t.hours(), t.minutes(),
:   t.seconds());
:
:   auto from_tp = cctz::convert(from_cs, 
TimezoneDatabase::GetUtcTimezone());
Does cctz offer a function to create timepoint from unix time_t? If yes, then 
it should be faster to convert ts_val to a time_t, and convert that to from_tp.


http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/exprs/timezone_db.cc
File be/src/exprs/timezone_db.cc:

http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/exprs/timezone_db.cc@43
PS2, Line 43: DEFINE_string(hdfs_zoneinfo_dir, "",
: "HDFS/S3A/ADLS path to load IANA time-zone database from.");
: DEFINE_string(hdfs_zoneabbrev_config, "",
: "HDFS/S3A/ADLS path to config file defining non-standard 
time-zone abbreviations.");
It will be a bit tricky, but these should be tested somehow. Playing with these 
configurations could go to a custom cluster test. These tests can be probably 
created at a later stage of the review, after the architecture have benn 
accepted.


http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/exprs/timezone_db.cc@419
PS2, Line 419: Status TimezoneDatabase::LoadZoneAbbreviations(istream &is,
At least basic testing should be added to check that fix and non-fix 
abbreviations are parsed correctly, but it would be the best to also create 
tests for error messages.


http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/runtime/timestamp-value.cc
File be/src/runtime/timestamp-value.cc:

http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/runtime/timestamp-value.cc@135
PS2, Line 135:   auto from_tp = FromUnixSeconds(unix_time);
 :   auto to_cs = cctz::convert(from_tp, *local_tz
This would be a big change, but I would think about "auto" types - do they make 
the code more readable, or not? I think that if type name is long and the 
reader more-or-less knows what kind of type to expect, then "auto" is very 
nice. But if the reader does not know context/library well, then "auto" makes 
it harder to look-up the classes. If you want to keep them, then I would prefer 
longer variable names, like spelling out civil_seconds.


http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/runtime/timestamp-value.inline.h
File be/src/runtime/timestamp-value.inline.h:

http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/runtime/timestamp-value.inline.h@98
PS2, Line 98:   if (UNLIKELY(!HasDateAndTime())) return false;
:   cctz::civil_second cs(date_.year(), date_.month(), date_.day(), 
time_.hours(),
:   time_.minutes(), time_.seconds());
:   auto tp = cctz::convert(cs,
:   FLAGS_use_local_tz_for_unix_timestamp_conversions ? 
*local_tz :
:   Timezone

[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-04-19 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9986 )

Change subject: IMPALA-3307: Add support for IANA time-zone db
..


Patch Set 2:

(8 comments)

http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/benchmarks/convert-timestamp-benchmark.cc
File be/src/benchmarks/convert-timestamp-benchmark.cc:

http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/benchmarks/convert-timestamp-benchmark.cc@150
PS2, Line 150: void AddTestDataDateTimes(vector& data, int n, 
const string& startstr) {
Since it's in a benchmark it doesn't really matter, anyway, some nit comments:
- output parameters should be listed last and they should be pointers
- in this particular case I think it would be better to return 
vector


http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/benchmarks/convert-timestamp-benchmark.cc@174
PS2, Line 174: > >
Nit: since C++11 you don't need to put spaces between right angle brackets.


http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/benchmarks/convert-timestamp-benchmark.cc@228
PS2, Line 228: const shared_ptr > data_
Nit: I think using 'const vector&' would be simpler. I don't really see 
why we need shared_ptr here.

Also, in 'const shared_ptr' only the pointer is const, the pointed object is 
not.


http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/benchmarks/convert-timestamp-benchmark.cc@319
PS2, Line 319: boost_throw_if_date_out_of_range(local_time.date());
Is this function call really needed here? Don't we trust boost that it 
validates the date correctly?


http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/benchmarks/convert-timestamp-benchmark.cc@608
PS2, Line 608:   if (cctz_utc_to_unix_data.get_result() != 
glibc_utc_to_unix_data.get_result()) {
 : cerr << "cctz/glibc utc_to_unix results do not match!" << 
endl;
 : return 1;
 :   }
 :   if (boost_utc_to_unix_data.get_result() != 
glibc_utc_to_unix_data.get_result()) {
 : cerr << "boost/glibc utc_to_unix results do not match!" << 
endl;
 : return 1;
 :   }
The other benchmarks don't need this validity check?

It could be implemented in a helper function that takes a vector of TestData.


http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/exprs/timestamp-functions.cc
File be/src/exprs/timestamp-functions.cc:

http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/exprs/timestamp-functions.cc@122
PS2, Line 122: or
Nit: of


http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/exprs/timezone_db.h
File be/src/exprs/timezone_db.h:

http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/exprs/timezone_db.h@21
PS2, Line 21: boost
I think we should use the unordered_map class from the C++ STL.


http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/exprs/timezone_db.h@56
PS2, Line 56: TZ_MAP
TimezoneMap? I don't think a typename should be all capitals.



--
To view, visit http://gerrit.cloudera.org:8080/9986
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
Gerrit-Change-Number: 9986
Gerrit-PatchSet: 2
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 19 Apr 2018 14:20:32 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-04-20 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9986 )

Change subject: IMPALA-3307: Add support for IANA time-zone db
..


Patch Set 2:

(29 comments)

Dumping another batch of comments :)

http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exec/data-source-scan-node.h
File be/src/exec/data-source-scan-node.h:

http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exec/data-source-scan-node.h@100
PS1, Line 100:   Status MaterializeNextRow(RuntimeState* state, MemPool* 
mem_pool, Tuple* tuple);
What do you think about passing cctz::time_zone* instead of RuntimeState*?

Can you add a comment for the  new parameter and it's purpose?


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/expr-test.cc
File be/src/exprs/expr-test.cc:

http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/expr-test.cc@6398
PS1, Line 6398: const char* local_tz_name = "PST8PDT";
  : ScopedTimeZoneOverride time_zone(local_tz_name);
  : const cctz::time_zone* local_tz = 
TimezoneDatabase::FindTimezone(local_tz_name);
  : DCHECK(local_tz != nullptr);
Have you considered moving this to a function or macro or such? As I see you do 
this same thing 3 times.


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timestamp-functions.cc
File be/src/exprs/timestamp-functions.cc:

http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timestamp-functions.cc@a197
PS1, Line 197:
Nice that we can get rid of this magic :)


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timestamp-functions.cc@73
PS1, Line 73: from_cs
Might be just me but I don't really find the names from_cs, to_cs and from_tp 
too self-descriptive.


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timestamp-functions.cc@76
PS1, Line 76: auto from_tp = cctz::convert(from_cs, 
TimezoneDatabase::GetUtcTimezone());
:   auto to_cs = cctz::convert(from_tp, *timezone);
I think it would worth writing a comment why the two cctz::convert() calls are 
needed.


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timestamp-functions.cc@79
PS1, Line 79:   // Check if resulting timestamp is within range
In my opinion this comment doesn't add extra value as the name of the function 
states the same.


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timestamp-functions.cc@113
PS1, Line 113:   context->AddWarning(ss.str().c_str());
 : return ts_val;
 :   }
 :
this seems duplicate code (TimestampFunctions::FromUtc)


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timestamp-functions.cc@140
PS1, Line 140:   context->AddWarning(msg.c_str());
 : return TimestampVal::null();
 :   }
 :
 :   // Create 'return_date' and 'return_time' from 'to_cs'.
I think this conversion could go to a function as it seems duplicate for me 
with the same in TimestampFunctions::FromUtc.


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.h
File be/src/exprs/timezone_db.h:

http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.h@21
PS1, Line 21: #include 
shouldn't we use the one from std?


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.h@35
PS1, Line 35:
string& GetPath() ?


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.h@47
PS1, Line 47:   }
FindTimezone() returns pointer while GetUtcTimezone() returns reference to a 
timezone. Can these be unified?


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.h@54
PS1, Line 54:   static const cctz::time_zone UTC_TIMEZONE_;
Could you add a comment what the string param is used for?


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.h@62
PS1, Line 62:   /// location.
As I see you wrote comments for these function in the .cc file. Could you move 
them to the header?


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.h@70
PS1, Line 70: db.
zone_abbrev


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.cc
File be/src/exprs/timezone_db.cc:

http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.cc@10
PS1, Line 10: //
Unrelated to your change but shouldn't we replace this to the Apache header 
other files use?
Same goes for the header.


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.cc@103
PS1, Line 103: // Returns 'true' if path 'a' starts with path 'b'. If 
'relative' is not nullptr, it will
FileSystemUtil would be a better place for this, I think.


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.cc@105
PS1, Line 105: PathStartsWith
I have the feeling that this function should only decide if path 'b' is the 
prefix of path 'a'. Returning the remainder after 'b' is something that I think 
should be out of this functions scope.
I think this mix of responsibilities is the reason the name of

[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-04-20 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9986 )

Change subject: IMPALA-3307: Add support for IANA time-zone db
..


Patch Set 2:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/exprs/timezone_db.h
File be/src/exprs/timezone_db.h:

http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/exprs/timezone_db.h@31
PS2, Line 31: class TimezoneDatabase {
My general feeling about this class is that a bit more transparency would add 
great value. What I mean is that some additional class level comments would 
really help e.g. what is the exact format of the inputs to this class, what 
sources it can have, in nutshell what transformation is done on that data, and 
how it uses the processed data later on.

What do you think?


http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/exprs/timezone_db.cc
File be/src/exprs/timezone_db.cc:

http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/exprs/timezone_db.cc@420
PS2, Line 420: /* = nullptr */
drop this comment


http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/exprs/timezone_db.cc@495
PS2, Line 495: ZONEINFO_DIR
nit: ZONE_INFO_DIR?


http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/runtime/timestamp-value.h
File be/src/runtime/timestamp-value.h:

http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/runtime/timestamp-value.h@101
PS2, Line 101:   static TimestampValue FromUnixTime(time_t unix_time, const 
cctz::time_zone* local_tz) {
Do you think that mentioning the new param for these functions would add extra 
value?



--
To view, visit http://gerrit.cloudera.org:8080/9986
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
Gerrit-Change-Number: 9986
Gerrit-PatchSet: 2
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 20 Apr 2018 12:00:42 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-04-20 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9986 )

Change subject: IMPALA-3307: Add support for IANA time-zone db
..


Patch Set 2:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/util/filesystem-util.h
File be/src/util/filesystem-util.h:

http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/util/filesystem-util.h@57
PS2, Line 57:   static Status GetRealPath(
Shouldn't you mentioned that this should be called on sym links? (or do I 
misunderstand something?)
I think that what a "real path" is should need some further explanation.


http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/util/filesystem-util.h@60
PS2, Line 60: Is it is
nit: if it is


http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/util/hdfs-util.h
File be/src/util/hdfs-util.h:

http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/util/hdfs-util.h@59
PS2, Line 59: /// Returns basename of 'path'.
Could you add an example to the comment?


http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/util/time.h
File be/src/util/time.h:

http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/util/time.h@123
PS2, Line 123: /// Converts input microseconds-since-epoch to date-time string 
in 'tz' time zone.
Could you mention 'p' as well?


http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/util/time.cc
File be/src/util/time.cc:

http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/util/time.cc@167
PS2, Line 167:   const char* fmt = (p == TimePrecision::Millisecond) ? 
fmt_millisec :
For me this 3 layers of embedded ternary operators isn't that readable. What 
about a switch?



--
To view, visit http://gerrit.cloudera.org:8080/9986
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
Gerrit-Change-Number: 9986
Gerrit-PatchSet: 2
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 20 Apr 2018 14:28:51 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-05-03 Thread Attila Jeges (Code Review)
Attila Jeges has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9986 )

Change subject: IMPALA-3307: Add support for IANA time-zone db
..


Patch Set 2:

(55 comments)

http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/benchmarks/convert-timestamp-benchmark.cc
File be/src/benchmarks/convert-timestamp-benchmark.cc:

http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/benchmarks/convert-timestamp-benchmark.cc@150
PS2, Line 150: void AddTestDataDateTimes(vector& data, int n, 
const string& startstr) {
> Since it's in a benchmark it doesn't really matter, anyway, some nit commen
Done


http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/benchmarks/convert-timestamp-benchmark.cc@172
PS2, Line 172: TestData
> This looks like a fairly general class to me that could move to util/benchm
'measure_multithreaded_elapsed_time' function is not that general, it is used 
here as a quick and dirty way to verify that glibc calls are executed in a 
serial fashion even in a multithreaded environment.  Because of that I would 
like to keep this class here.


http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/benchmarks/convert-timestamp-benchmark.cc@174
PS2, Line 174: > >
> Nit: since C++11 you don't need to put spaces between right angle brackets.
Done


http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/benchmarks/convert-timestamp-benchmark.cc@228
PS2, Line 228: const shared_ptr > data_
> Nit: I think using 'const vector&' would be simpler. I don't really s
True, I rewrote this portion of the code so many times that I lost track of 
where I was going with it :)


http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/benchmarks/convert-timestamp-benchmark.cc@319
PS2, Line 319: boost_throw_if_date_out_of_range(local_time.date());
> Is this function call really needed here? Don't we trust boost that it vali
This is how the function was implemented originally. Since the point of this 
benchmark program is to compare the old implementation with the new one, I 
figured I shouldn't change the old code.

I think the call is necessary, because boost might not validate the date range 
until you call the gregorian::date accessors. It is confusing.


http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/benchmarks/convert-timestamp-benchmark.cc@444
PS2, Line 444: time_t utc =
> We could replace this with boost_utc_to_unix_time. This conversion should b
This is how UtcToLocal was implemented originally. The goal of this benchmark 
program is to compare the original implementation with the new one (including 
the glue code).


http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/benchmarks/convert-timestamp-benchmark.cc@530
PS2, Line 530: //
 : // Test UnixTimeToUtcPtime (boost is expected to be faster than 
CCTZ)
 : //
 :
 : // boost
 : boost::pos
> I think that this a bit misleading, as boost_unix_time_to_utc_ptime never u
Done


http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/benchmarks/convert-timestamp-benchmark.cc@608
PS2, Line 608:   if (cctz_utc_to_unix_data.get_result() != 
glibc_utc_to_unix_data.get_result()) {
 : cerr << "cctz/glibc utc_to_unix results do not match!" << 
endl;
 : return 1;
 :   }
 :   if (boost_utc_to_unix_data.get_result() != 
glibc_utc_to_unix_data.get_result()) {
 : cerr << "boost/glibc utc_to_unix results do not match!" << 
endl;
 : return 1;
 :   }
> The other benchmarks don't need this validity check?
Done (although passing a vector of TestData to the helper function was not 
feasible as the different TestData classes are instantiated with a different 
converter functions, therefore they are not "compatible")


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exec/data-source-scan-node.h
File be/src/exec/data-source-scan-node.h:

http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exec/data-source-scan-node.h@100
PS1, Line 100:   Status MaterializeNextRow(RuntimeState* state, MemPool* 
mem_pool, Tuple* tuple);
> What do you think about passing cctz::time_zone* instead of RuntimeState*?
Done


http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/exprs/decimal-operators.h
File be/src/exprs/decimal-operators.h:

http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/exprs/decimal-operators.h@168
PS2, Line 168:   /// local time in 'local_tz' time-zone). Rounds instead of 
truncating if 'round' is true.
> nit: long line
Done


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/expr-test.cc
File be/src/exprs/expr-test.cc:

http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/expr-test.cc@6398
PS1, Line 6398: const char* local_tz_name = "PST8PDT";
  : ScopedTimeZoneOverride time_zone(local_tz_name);
  : const cctz::time_zone* local_tz = 
TimezoneDatabase::FindTimezone(local_tz_name);
  : DCHECK(local_tz != nullptr);
> Have you c

[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-05-03 Thread Attila Jeges (Code Review)
Hello Gabor Kaszab, Zoltan Borok-Nagy, Csaba Ringhofer,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/9986

to look at the new patch set (#3).

Change subject: IMPALA-3307: Add support for IANA time-zone db
..

IMPALA-3307: Add support for IANA time-zone db

Impala currently uses two different libraries for timestamp
manipulations: boost and glibc.

Issues with boost:
- Time-zone database is currently hard coded in timezone_db.cc.
  Impala admins cannot update it without upgrading Impala.
- Time-zone database is flat, therefore can’t track year-to-year
  changes.
- Time-zone database is not updated on a regular basis.

Issues with glibc:
- Uses /usr/share/zoneinfo/ database which could be out of sync on
  some of the nodes in the Impala cluster.
- Uses the host system’s local time-zone. Different nodes in the
  Impala cluster might use a different local time-zone.
- Conversion functions take a global lock, which causes severe
  performance degradation.

In addition to the issues above, the fact that /usr/share/zoneinfo/
and the hard-coded boost time-zone database are both in use is a
source of inconsistency in itself.

This patch makes the following changes:
- Instead of boost and glibc, impalad uses Google's CCTZ to implement
  time-zone conversions.
- Introduces a new startup flag (--hdfs_zone_info_dir) to impalad to
  specify an HDFS/S3/ADLS location that contains the shared compiled
  IANA time-zone database. If the startup flag is set, impalad will
  use the specified time-zone database. Otherwise, impalad will use
  the default /usr/share/zoneinfo time-zone database.
- impalad reads the entire time-zone database into an in-memory
  map on startup for fast lookups.
- The name of the coordinator node’s local time-zone is saved to the
  query context when preparing query execution. This time-zone is used
  whenever the current time-zone is referred afterwards in an
  execution node.
- Introduces a new startup flag (--hdfs_zone_abbrev_conf) to impalad
  to specify an HDFS/S3/ADLS path to a shared config file that
  contains definitions for non-standard time-zone abbreviations.

Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
---
M CMakeLists.txt
M be/CMakeLists.txt
M be/src/benchmarks/CMakeLists.txt
A be/src/benchmarks/convert-timestamp-benchmark.cc
M be/src/common/global-types.h
M be/src/exec/data-source-scan-node.cc
M be/src/exec/data-source-scan-node.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/parquet-column-readers.cc
M be/src/exprs/CMakeLists.txt
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/cast-functions-ir.cc
M be/src/exprs/decimal-operators-ir.cc
M be/src/exprs/decimal-operators.h
M be/src/exprs/expr-test.cc
M be/src/exprs/literal.cc
M be/src/exprs/timestamp-functions-ir.cc
M be/src/exprs/timestamp-functions.cc
A be/src/exprs/timezone_db-test.cc
M be/src/exprs/timezone_db.cc
M be/src/exprs/timezone_db.h
M be/src/runtime/raw-value-test.cc
M be/src/runtime/runtime-state.cc
M be/src/runtime/runtime-state.h
M be/src/runtime/timestamp-test.cc
M be/src/runtime/timestamp-value.cc
M be/src/runtime/timestamp-value.h
M be/src/runtime/timestamp-value.inline.h
M be/src/service/impala-server.cc
M be/src/service/impalad-main.cc
M be/src/util/filesystem-util-test.cc
M be/src/util/filesystem-util.cc
M be/src/util/filesystem-util.h
M be/src/util/hdfs-util-test.cc
M be/src/util/hdfs-util.cc
M be/src/util/hdfs-util.h
M be/src/util/time-test.cc
M be/src/util/time.cc
M be/src/util/time.h
M bin/bootstrap_toolchain.py
M bin/impala-config.sh
A cmake_modules/FindCctz.cmake
M common/thrift/ImpalaInternalService.thrift
M common/thrift/metrics.json
M fe/src/test/java/org/apache/impala/testutil/TestUtils.java
M testdata/bin/create-load-data.sh
M testdata/data/timezoneverification.csv
A testdata/tzdb/abbrev.conf
A testdata/tzdb/zoneinfo/AmerICA/ArgeNTINA/MendOZA
A testdata/tzdb/zoneinfo/AmerICA/CancUN
A testdata/tzdb/zoneinfo/UTC
M testdata/workloads/functional-query/queries/QueryTest/exprs.test
A tests/custom_cluster/custom_tzdb.py
53 files changed, 2,542 insertions(+), 1,093 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/86/9986/3
--
To view, visit http://gerrit.cloudera.org:8080/9986
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
Gerrit-Change-Number: 9986
Gerrit-PatchSet: 3
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-05-04 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9986 )

Change subject: IMPALA-3307: Add support for IANA time-zone db
..


Patch Set 3:

(10 comments)

http://gerrit.cloudera.org:8080/#/c/9986/3/be/src/exprs/timestamp-functions.cc
File be/src/exprs/timestamp-functions.cc:

http://gerrit.cloudera.org:8080/#/c/9986/3/be/src/exprs/timestamp-functions.cc@95
PS3, Line 95:   time_t unix_time;
:   if (UNLIKELY(!ts_value.UtcToUnixTime(&unix_time))) return 
TimestampVal::null();
:   cctz::time_point from_tp = 
UnixSecondsToTimePoint(unix_time);
:
:   // Convert 'from_tp' time_point to civil_second assuming 
'timezone' time-zone.
:   cctz::civil_second to_cs = cctz::convert(from_tp, *timezone);
:
:   if (UNLIKELY(CheckIfDateOutOfRange(cctz::civil_day(to_cs {
: const string& msg = Substitute(
: "Timestamp '$0' did not convert to a valid local time in 
timezone '$1'",
: ts_value.ToString(), tz_string_value.DebugString());
: context->AddWarning(msg.c_str());
: return TimestampVal::null();
:   }
:
:   // Note that 'to_cs' has second granularity. Since time-zone 
rules do not affect
:   // fractional seconds, the fractional second part of the 
returned TimestampVal should be
:   // equal to ts_value.time().fractional_seconds().
:   return CivilSecondToTimestampVal(to_cs, 
ts_value.time().fractional_seconds());
This logic is the same as TimestampValue::UtcToLocal(), plus warning if the 
resulting timestamp is not valid. As local time and generic timezone 
conversions are the same now, it would make sense to keep them at one place.


http://gerrit.cloudera.org:8080/#/c/9986/3/be/src/exprs/timestamp-functions.cc@144
PS3, Line 144:   cctz::time_point from_tp = from_cl.pre;
 :
 :   // Convert 'from_tp' time_point to civil_second assuming 'UTC' 
time-zone.
 :   cctz::civil_second to_cs = cctz::convert(from_tp, 
TimezoneDatabase::GetUtcTimezone());
 :
 :   if (UNLIKELY(CheckIfDateOutOfRange(cctz::civil_day(to_cs {
 : const string& msg =
 : Substitute("Timestamp '$0' in timezone '$1' could not be 
converted to UTC",
 : ts_value.ToString(), tz_string_value.DebugString());
 : context->AddWarning(msg.c_str());
 : return TimestampVal::null();
 :   }
 :
 :   // Note that 'to_cs' has second granularity. Since time-zone 
rules do not affect
 :   // fractional seconds, the fractional second part of the 
returned TimestampVal should be
 :   // equal to ts_value.time().fractional_seconds().
 :   return CivilSecondToTimestampVal(to_cs, 
ts_value.time().fractional_seconds());
Similarly to my comment at line 113, this logic could be moved to a 
TimestampValue function. Removing these calculations from this file would mean 
that the helper functions (line 47-74) could be removed too.


http://gerrit.cloudera.org:8080/#/c/9986/3/be/src/runtime/timestamp-value.cc
File be/src/runtime/timestamp-value.cc:

http://gerrit.cloudera.org:8080/#/c/9986/3/be/src/runtime/timestamp-value.cc@117
PS3, Line 117:   time_ = boost::posix_time::time_duration(to_cs.hour(), 
to_cs.minute(), to_cs.second(),
nit: long line


http://gerrit.cloudera.org:8080/#/c/9986/3/be/src/util/filesystem-util.h
File be/src/util/filesystem-util.h:

http://gerrit.cloudera.org:8080/#/c/9986/3/be/src/util/filesystem-util.h@66
PS3, Line 66: iff
typo: if


http://gerrit.cloudera.org:8080/#/c/9986/3/be/src/util/filesystem-util.h@70
PS3, Line 70: iff
typo: if


http://gerrit.cloudera.org:8080/#/c/9986/3/be/src/util/filesystem-util.h@71
PS3, Line 71: path
Maybe writing "string" instead of "path" express better that no file system is 
accessed.


http://gerrit.cloudera.org:8080/#/c/9986/3/be/src/util/filesystem-util.h@79
PS3, Line 79: path
Same as line 71.


http://gerrit.cloudera.org:8080/#/c/9986/3/be/src/util/time.cc
File be/src/util/time.cc:

http://gerrit.cloudera.org:8080/#/c/9986/3/be/src/util/time.cc@165
PS3, Line 165:
nit: extra space


http://gerrit.cloudera.org:8080/#/c/9986/3/be/src/util/time.cc@168
PS3, Line 168:
nit: extra space


http://gerrit.cloudera.org:8080/#/c/9986/3/be/src/util/time.cc@171
PS3, Line 171: 
nit: extra space



--
To view, visit http://gerrit.cloudera.org:8080/9986
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
Gerrit-Change-Number: 9986
Gerrit-PatchSet: 3
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer

[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-05-08 Thread Attila Jeges (Code Review)
Attila Jeges has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9986 )

Change subject: IMPALA-3307: Add support for IANA time-zone db
..


Patch Set 3:

(10 comments)

http://gerrit.cloudera.org:8080/#/c/9986/3/be/src/exprs/timestamp-functions.cc
File be/src/exprs/timestamp-functions.cc:

http://gerrit.cloudera.org:8080/#/c/9986/3/be/src/exprs/timestamp-functions.cc@95
PS3, Line 95:   time_t unix_time;
:   if (UNLIKELY(!ts_value.UtcToUnixTime(&unix_time))) return 
TimestampVal::null();
:   cctz::time_point from_tp = 
UnixSecondsToTimePoint(unix_time);
:
:   // Convert 'from_tp' time_point to civil_second assuming 
'timezone' time-zone.
:   cctz::civil_second to_cs = cctz::convert(from_tp, *timezone);
:
:   if (UNLIKELY(CheckIfDateOutOfRange(cctz::civil_day(to_cs {
: const string& msg = Substitute(
: "Timestamp '$0' did not convert to a valid local time in 
timezone '$1'",
: ts_value.ToString(), tz_string_value.DebugString());
: context->AddWarning(msg.c_str());
: return TimestampVal::null();
:   }
:
:   // Note that 'to_cs' has second granularity. Since time-zone 
rules do not affect
:   // fractional seconds, the fractional second part of the 
returned TimestampVal should be
:   // equal to ts_value.time().fractional_seconds().
:   return CivilSecondToTimestampVal(to_cs, 
ts_value.time().fractional_seconds());
> This logic is the same as TimestampValue::UtcToLocal(), plus warning if the
Done


http://gerrit.cloudera.org:8080/#/c/9986/3/be/src/exprs/timestamp-functions.cc@144
PS3, Line 144:   cctz::time_point from_tp = from_cl.pre;
 :
 :   // Convert 'from_tp' time_point to civil_second assuming 'UTC' 
time-zone.
 :   cctz::civil_second to_cs = cctz::convert(from_tp, 
TimezoneDatabase::GetUtcTimezone());
 :
 :   if (UNLIKELY(CheckIfDateOutOfRange(cctz::civil_day(to_cs {
 : const string& msg =
 : Substitute("Timestamp '$0' in timezone '$1' could not be 
converted to UTC",
 : ts_value.ToString(), tz_string_value.DebugString());
 : context->AddWarning(msg.c_str());
 : return TimestampVal::null();
 :   }
 :
 :   // Note that 'to_cs' has second granularity. Since time-zone 
rules do not affect
 :   // fractional seconds, the fractional second part of the 
returned TimestampVal should be
 :   // equal to ts_value.time().fractional_seconds().
 :   return CivilSecondToTimestampVal(to_cs, 
ts_value.time().fractional_seconds());
> Similarly to my comment at line 113, this logic could be moved to a Timesta
Done


http://gerrit.cloudera.org:8080/#/c/9986/3/be/src/runtime/timestamp-value.cc
File be/src/runtime/timestamp-value.cc:

http://gerrit.cloudera.org:8080/#/c/9986/3/be/src/runtime/timestamp-value.cc@117
PS3, Line 117:   time_ = boost::posix_time::time_duration(to_cs.hour(), 
to_cs.minute(), to_cs.second(),
> nit: long line
Done


http://gerrit.cloudera.org:8080/#/c/9986/3/be/src/util/filesystem-util.h
File be/src/util/filesystem-util.h:

http://gerrit.cloudera.org:8080/#/c/9986/3/be/src/util/filesystem-util.h@66
PS3, Line 66: iff
> typo: if
"iff" stands for "if and only if".


http://gerrit.cloudera.org:8080/#/c/9986/3/be/src/util/filesystem-util.h@70
PS3, Line 70: iff
> typo: if
Same as above


http://gerrit.cloudera.org:8080/#/c/9986/3/be/src/util/filesystem-util.h@71
PS3, Line 71: path
> Maybe writing "string" instead of "path" express better that no file system
Done


http://gerrit.cloudera.org:8080/#/c/9986/3/be/src/util/filesystem-util.h@79
PS3, Line 79: path
> Same as line 71.
Done


http://gerrit.cloudera.org:8080/#/c/9986/3/be/src/util/time.cc
File be/src/util/time.cc:

http://gerrit.cloudera.org:8080/#/c/9986/3/be/src/util/time.cc@165
PS3, Line 165:
> nit: extra space
Done


http://gerrit.cloudera.org:8080/#/c/9986/3/be/src/util/time.cc@168
PS3, Line 168:
> nit: extra space
Done


http://gerrit.cloudera.org:8080/#/c/9986/3/be/src/util/time.cc@171
PS3, Line 171:
> nit: extra space
Done



--
To view, visit http://gerrit.cloudera.org:8080/9986
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
Gerrit-Change-Number: 9986
Gerrit-PatchSet: 3
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 08 May 2018 19:35:52 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-05-08 Thread Attila Jeges (Code Review)
Hello Gabor Kaszab, Zoltan Borok-Nagy, Csaba Ringhofer,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/9986

to look at the new patch set (#4).

Change subject: IMPALA-3307: Add support for IANA time-zone db
..

IMPALA-3307: Add support for IANA time-zone db

Impala currently uses two different libraries for timestamp
manipulations: boost and glibc.

Issues with boost:
- Time-zone database is currently hard coded in timezone_db.cc.
  Impala admins cannot update it without upgrading Impala.
- Time-zone database is flat, therefore can’t track year-to-year
  changes.
- Time-zone database is not updated on a regular basis.

Issues with glibc:
- Uses /usr/share/zoneinfo/ database which could be out of sync on
  some of the nodes in the Impala cluster.
- Uses the host system’s local time-zone. Different nodes in the
  Impala cluster might use a different local time-zone.
- Conversion functions take a global lock, which causes severe
  performance degradation.

In addition to the issues above, the fact that /usr/share/zoneinfo/
and the hard-coded boost time-zone database are both in use is a
source of inconsistency in itself.

This patch makes the following changes:
- Instead of boost and glibc, impalad uses Google's CCTZ to implement
  time-zone conversions.
- Introduces a new startup flag (--hdfs_zone_info_dir) to impalad to
  specify an HDFS/S3/ADLS location that contains the shared compiled
  IANA time-zone database. If the startup flag is set, impalad will
  use the specified time-zone database. Otherwise, impalad will use
  the default /usr/share/zoneinfo time-zone database.
- impalad reads the entire time-zone database into an in-memory
  map on startup for fast lookups.
- The name of the coordinator node’s local time-zone is saved to the
  query context when preparing query execution. This time-zone is used
  whenever the current time-zone is referred afterwards in an
  execution node.
- Introduces a new startup flag (--hdfs_zone_abbrev_conf) to impalad
  to specify an HDFS/S3/ADLS path to a shared config file that
  contains definitions for non-standard time-zone abbreviations.

Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
---
M CMakeLists.txt
M be/CMakeLists.txt
M be/src/benchmarks/CMakeLists.txt
A be/src/benchmarks/convert-timestamp-benchmark.cc
M be/src/common/global-types.h
M be/src/exec/data-source-scan-node.cc
M be/src/exec/data-source-scan-node.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/parquet-column-readers.cc
M be/src/exprs/CMakeLists.txt
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/cast-functions-ir.cc
M be/src/exprs/decimal-operators-ir.cc
M be/src/exprs/decimal-operators.h
M be/src/exprs/expr-test.cc
M be/src/exprs/literal.cc
M be/src/exprs/timestamp-functions-ir.cc
M be/src/exprs/timestamp-functions.cc
A be/src/exprs/timezone_db-test.cc
M be/src/exprs/timezone_db.cc
M be/src/exprs/timezone_db.h
M be/src/runtime/raw-value-test.cc
M be/src/runtime/runtime-state.cc
M be/src/runtime/runtime-state.h
M be/src/runtime/timestamp-test.cc
M be/src/runtime/timestamp-value.cc
M be/src/runtime/timestamp-value.h
M be/src/runtime/timestamp-value.inline.h
M be/src/service/impala-server.cc
M be/src/service/impalad-main.cc
M be/src/util/filesystem-util-test.cc
M be/src/util/filesystem-util.cc
M be/src/util/filesystem-util.h
M be/src/util/hdfs-util-test.cc
M be/src/util/hdfs-util.cc
M be/src/util/hdfs-util.h
M be/src/util/time-test.cc
M be/src/util/time.cc
M be/src/util/time.h
M bin/bootstrap_toolchain.py
M bin/impala-config.sh
A cmake_modules/FindCctz.cmake
M common/thrift/ImpalaInternalService.thrift
M common/thrift/metrics.json
M fe/src/test/java/org/apache/impala/testutil/TestUtils.java
M testdata/bin/create-load-data.sh
M testdata/data/timezoneverification.csv
A testdata/tzdb/abbrev.conf
A testdata/tzdb/zoneinfo/AmerICA/ArgeNTINA/MendOZA
A testdata/tzdb/zoneinfo/AmerICA/CancUN
A testdata/tzdb/zoneinfo/UTC
M testdata/workloads/functional-query/queries/QueryTest/exprs.test
A tests/custom_cluster/custom_tzdb.py
53 files changed, 2,531 insertions(+), 1,096 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/86/9986/4
--
To view, visit http://gerrit.cloudera.org:8080/9986
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
Gerrit-Change-Number: 9986
Gerrit-PatchSet: 4
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-05-09 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9986 )

Change subject: IMPALA-3307: Add support for IANA time-zone db
..


Patch Set 4: Code-Review+1

(6 comments)

Few nits otherwise looks good to me. The LocalToUtc performance part is 
optional - as it does not affect other parts of the code, it can be easily done 
later when general structure is already accepted by other reviewers.

http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timestamp-functions-ir.cc
File be/src/exprs/timestamp-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timestamp-functions-ir.cc@526
PS4, Line 526:   const string& tz_name = (start_lookup.abbr != nullptr) ? 
start_lookup.abbr :
 :   context->impl()->state()->local_time_zone()->name();
What is the goal of this logic? To print timezone abbreviations instead of the 
full names, or to distinguish between summer/winter time, or both? A comment 
would be nice, and maybe the logic could be moved to a TimestampValue member 
function like string GetTimezoneName(Timezone*).


http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/runtime/timestamp-value.cc
File be/src/runtime/timestamp-value.cc:

http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/runtime/timestamp-value.cc@93
PS4, Line 93: inline bool CheckIfDateOutOfRange(const cctz::civil_day& date) {
:   static const cctz::civil_day 
max_date(TimestampFunctions::MAX_YEAR, 12, 31);
:   static const cctz::civil_day 
min_date(TimestampFunctions::MIN_YEAR, 1, 1);
:   return date < min_date || date > max_date;
: }
This could be simpler and possibly faster by expecting a cctz::civil_second 
argument and check if 1400<=year<1.


http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/runtime/timestamp-value.cc@128
PS4, Line 128:
cctz explains pretty well the handling of dst boundaries, maybe we could add a 
link to it, for example to 
https://github.com/google/cctz/blob/a2dd3d0fbc811fe0a1d4d2dbb0341f1a3d28cb2a/include/cctz/time_zone.h#L147


http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/runtime/timestamp-value.cc@129
PS4, Line 129:   // In case of ambiguity invalidate TimestampValue.
 :   const cctz::time_zone::civil_lookup from_cl = 
local_tz->lookup(from_cs);
 :   if (UNLIKELY(from_cl.kind != 
cctz::time_zone::civil_lookup::UNIQUE)) {
 : SetToInvalidDateTime();
 :   } else {
 : cctz::time_point from_tp = from_cl.pre;
 :
 : // Convert 'from_tp' time_point to civil_second assuming 
'UTC' time-zone.
 : cctz::civil_second to_cs = cctz::convert(from_tp, 
TimezoneDatabase::GetUtcTimezone());
 :
 : // boost::gregorian::date() throws 
boost::gregorian::bad_year if year is not in the
 : // 1400.. range. Need to check validity before creating 
the date object.
 : if (UNLIKELY(CheckIfDateOutOfRange(cctz::civil_day(to_cs 
{
I may be possible to get TimestampValue from cctz::time_zone::civil_lookup in a 
faster way - splitting from_tp to a day part (since a constant date) and the 
remainder seconds part is enough for us and should be faster then getting 
cctz::civil_second (which contains year/month/day split).

The code could look something like this:
int64 secs_since_base = from_tp - BASETIME_AS_CCTZ_SYS_SEC;
time_=sec_since_base%(24*60*60)+time_.fractional_seconds();
int32 days_since_base = sec_since_base/(24*60*60);
if(out_of_range(days_since_base)) SetToInvalidDateTime();
date_ = days_since_base - BASEDATE_AS_BOOST_GREG_DATE;


http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/runtime/timestamp-value.cc@146
PS4, Line 146:   time_ = boost::posix_time::time_duration(to_cs.hour(), 
to_cs.minute(), to_cs.second(),
nit: long line


http://gerrit.cloudera.org:8080/#/c/9986/3/be/src/util/filesystem-util.h
File be/src/util/filesystem-util.h:

http://gerrit.cloudera.org:8080/#/c/9986/3/be/src/util/filesystem-util.h@66
PS3, Line 66: iff
> "iff" stands for "if and only if".
Thanks for the explanation!



--
To view, visit http://gerrit.cloudera.org:8080/9986
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
Gerrit-Change-Number: 9986
Gerrit-PatchSet: 4
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 09 May 2018 11:47:13 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-05-10 Thread Attila Jeges (Code Review)
Hello Gabor Kaszab, Zoltan Borok-Nagy, Csaba Ringhofer,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/9986

to look at the new patch set (#5).

Change subject: IMPALA-3307: Add support for IANA time-zone db
..

IMPALA-3307: Add support for IANA time-zone db

Impala currently uses two different libraries for timestamp
manipulations: boost and glibc.

Issues with boost:
- Time-zone database is currently hard coded in timezone_db.cc.
  Impala admins cannot update it without upgrading Impala.
- Time-zone database is flat, therefore can’t track year-to-year
  changes.
- Time-zone database is not updated on a regular basis.

Issues with glibc:
- Uses /usr/share/zoneinfo/ database which could be out of sync on
  some of the nodes in the Impala cluster.
- Uses the host system’s local time-zone. Different nodes in the
  Impala cluster might use a different local time-zone.
- Conversion functions take a global lock, which causes severe
  performance degradation.

In addition to the issues above, the fact that /usr/share/zoneinfo/
and the hard-coded boost time-zone database are both in use is a
source of inconsistency in itself.

This patch makes the following changes:
- Instead of boost and glibc, impalad uses Google's CCTZ to implement
  time-zone conversions.
- Introduces a new startup flag (--hdfs_zone_info_dir) to impalad to
  specify an HDFS/S3/ADLS location that contains the shared compiled
  IANA time-zone database. If the startup flag is set, impalad will
  use the specified time-zone database. Otherwise, impalad will use
  the default /usr/share/zoneinfo time-zone database.
- impalad reads the entire time-zone database into an in-memory
  map on startup for fast lookups.
- The name of the coordinator node’s local time-zone is saved to the
  query context when preparing query execution. This time-zone is used
  whenever the current time-zone is referred afterwards in an
  execution node.
- Introduces a new startup flag (--hdfs_zone_abbrev_conf) to impalad
  to specify an HDFS/S3/ADLS path to a shared config file that
  contains definitions for non-standard time-zone abbreviations.

Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
---
M CMakeLists.txt
M be/CMakeLists.txt
M be/src/benchmarks/CMakeLists.txt
A be/src/benchmarks/convert-timestamp-benchmark.cc
M be/src/common/global-types.h
M be/src/exec/data-source-scan-node.cc
M be/src/exec/data-source-scan-node.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/parquet-column-readers.cc
M be/src/exprs/CMakeLists.txt
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/cast-functions-ir.cc
M be/src/exprs/decimal-operators-ir.cc
M be/src/exprs/decimal-operators.h
M be/src/exprs/expr-test.cc
M be/src/exprs/literal.cc
M be/src/exprs/timestamp-functions-ir.cc
M be/src/exprs/timestamp-functions.cc
A be/src/exprs/timezone_db-test.cc
M be/src/exprs/timezone_db.cc
M be/src/exprs/timezone_db.h
M be/src/runtime/raw-value-test.cc
M be/src/runtime/runtime-state.cc
M be/src/runtime/runtime-state.h
M be/src/runtime/timestamp-test.cc
M be/src/runtime/timestamp-value.cc
M be/src/runtime/timestamp-value.h
M be/src/runtime/timestamp-value.inline.h
M be/src/service/impala-server.cc
M be/src/service/impalad-main.cc
M be/src/util/filesystem-util-test.cc
M be/src/util/filesystem-util.cc
M be/src/util/filesystem-util.h
M be/src/util/hdfs-util-test.cc
M be/src/util/hdfs-util.cc
M be/src/util/hdfs-util.h
M be/src/util/time-test.cc
M be/src/util/time.cc
M be/src/util/time.h
M bin/bootstrap_toolchain.py
M bin/impala-config.sh
A cmake_modules/FindCctz.cmake
M common/thrift/ImpalaInternalService.thrift
M common/thrift/metrics.json
M fe/src/test/java/org/apache/impala/testutil/TestUtils.java
M testdata/bin/create-load-data.sh
M testdata/data/timezoneverification.csv
A testdata/tzdb/abbrev.conf
A testdata/tzdb/zoneinfo/AmerICA/ArgeNTINA/MendOZA
A testdata/tzdb/zoneinfo/AmerICA/CancUN
A testdata/tzdb/zoneinfo/UTC
M testdata/workloads/functional-query/queries/QueryTest/exprs.test
A tests/custom_cluster/test_custom_tzdb.py
53 files changed, 2,556 insertions(+), 1,096 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/86/9986/5
--
To view, visit http://gerrit.cloudera.org:8080/9986
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
Gerrit-Change-Number: 9986
Gerrit-PatchSet: 5
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-05-10 Thread Attila Jeges (Code Review)
Attila Jeges has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9986 )

Change subject: IMPALA-3307: Add support for IANA time-zone db
..


Patch Set 4:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timestamp-functions-ir.cc
File be/src/exprs/timestamp-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timestamp-functions-ir.cc@526
PS4, Line 526:   const string& tz_name = (start_lookup.abbr != nullptr) ? 
start_lookup.abbr :
 :   context->impl()->state()->local_time_zone()->name();
> What is the goal of this logic? To print timezone abbreviations instead of
Both, I guess. I just replicated the original behavior: localtime_r() sets 
tzone.tm_zone to the time-zone abbreviation. Added a comment.

I don't think that the TimestampValue class would be a good place for this 
function. The timezone abbreviation corresponds to 'start_unix_millis', which 
is not a TimestampValue. I'll keep this logic here for now.


http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/runtime/timestamp-value.cc
File be/src/runtime/timestamp-value.cc:

http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/runtime/timestamp-value.cc@93
PS4, Line 93: inline bool CheckIfDateOutOfRange(const cctz::civil_day& date) {
:   static const cctz::civil_day 
max_date(TimestampFunctions::MAX_YEAR, 12, 31);
:   static const cctz::civil_day 
min_date(TimestampFunctions::MIN_YEAR, 1, 1);
:   return date < min_date || date > max_date;
: }
> This could be simpler and possibly faster by expecting a cctz::civil_second
Done


http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/runtime/timestamp-value.cc@128
PS4, Line 128:
> cctz explains pretty well the handling of dst boundaries, maybe we could ad
Done


http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/runtime/timestamp-value.cc@129
PS4, Line 129:   // In case of ambiguity invalidate TimestampValue.
 :   const cctz::time_zone::civil_lookup from_cl = 
local_tz->lookup(from_cs);
 :   if (UNLIKELY(from_cl.kind != 
cctz::time_zone::civil_lookup::UNIQUE)) {
 : SetToInvalidDateTime();
 :   } else {
 : cctz::time_point from_tp = from_cl.pre;
 :
 : // Convert 'from_tp' time_point to civil_second assuming 
'UTC' time-zone.
 : cctz::civil_second to_cs = cctz::convert(from_tp, 
TimezoneDatabase::GetUtcTimezone());
 :
 : // boost::gregorian::date() throws 
boost::gregorian::bad_year if year is not in the
 : // 1400.. range. Need to check validity before creating 
the date object.
 : if (UNLIKELY(CheckIfDateOutOfRange(cctz::civil_day(to_cs 
{
> I may be possible to get TimestampValue from cctz::time_zone::civil_lookup
Done


http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/runtime/timestamp-value.cc@146
PS4, Line 146:   time_ = boost::posix_time::time_duration(to_cs.hour(), 
to_cs.minute(), to_cs.second(),
> nit: long line
Done



--
To view, visit http://gerrit.cloudera.org:8080/9986
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
Gerrit-Change-Number: 9986
Gerrit-PatchSet: 4
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 10 May 2018 15:28:32 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-05-10 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9986 )

Change subject: IMPALA-3307: Add support for IANA time-zone db
..


Patch Set 5:

(19 comments)

Thanks for dealing with my previous comments, Attila! I filed some more but 
basically I'm fine with the changes and feel free to start involving someone 
with more experience on this topic.

http://gerrit.cloudera.org:8080/#/c/9986/4/CMakeLists.txt
File CMakeLists.txt:

http://gerrit.cloudera.org:8080/#/c/9986/4/CMakeLists.txt@281
PS4, Line 281: Cctz
nit: CCTZ


http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/expr-test.cc
File be/src/exprs/expr-test.cc:

http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/expr-test.cc@139
PS4, Line 139: new_time_zone_(time_zone), new_tz_
>From reading the names of these 2 variables it's not clear what de difference 
>is. Can you have a new_time_zone_ and a new_time_zone_name_ or such?


http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/expr-test.cc@140
PS4, Line 140: /*overwrite*/
Do you need this comment?


http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/expr-test.cc@153
PS4, Line 153: Timezone *
nit: Timezone* Expect..()


http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/expr-test.cc@155
PS4, Line 155: new_tz_ = TimezoneDatabase::FindTimezone(new_time_zone_);
I wonder if it makes sense to do this assignment in the constructor and then 
this function can be changed to something like "getTimezone" that simply 
returns the corresponding member.


http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timestamp-functions-ir.cc
File be/src/exprs/timestamp-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timestamp-functions-ir.cc@503
PS4, Line 503: namespace
Why did you need this namespace?


http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timestamp-functions-ir.cc@504
PS4, Line 504: nline
What is the plan to get rid of the "Duplicate code" TODOs in this review?


http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timestamp-functions.cc
File be/src/exprs/timestamp-functions.cc:

http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timestamp-functions.cc@55
PS4, Line 55: // This should raise some sort of error or at least return 
null. Hive just ignores it.
Shouldn't this be a TODO?


http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timestamp-functions.cc@87
PS4, Line 87: // This should raise some sort of error or at least return 
null. Hive just ignores it.
Same as above


http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timezone_db-test.cc
File be/src/exprs/timezone_db-test.cc:

http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timezone_db-test.cc@57
PS4, Line 57: TzAbbev
nit: TzAbbrev?


http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timezone_db-test.cc@68
PS4, Line 68:   // Abbreviations must start with an uppercase letter.
If it has to start with an uppercase letter, can we add a test this with an 
input that would be valid if the first letter was uppercase?
e.g. "pST", "pst", "singapore" etc.


http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timezone_db-test.cc@105
PS4, Line 105:   // Misformatted time-zone names.
Can you again play around with upper vs lower case letters here?


http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timezone_db.h
File be/src/exprs/timezone_db.h:

http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timezone_db.h@1
PS4, Line 1: // with the License.  You may obtain a copy of the License at
Hmm, is the top of the Apache comment missing?


http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timezone_db.h@23
PS4, Line 23: /// 'TimezoneDatabase' class contains functions to load and 
access the IANA time-zone
Nice description, thx :)


http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timezone_db.h@91
PS4, Line 91: tz_seg
nit: might be just my preference but tz_segment is still short and I think 
better to read. Up to you.


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.cc
File be/src/exprs/timezone_db.cc:

http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.cc@147
PS1, Line 147:
> Returning a pointer just to be able to signal failure with nullptr seems co
What I meant is that you could get rid of the offset_sec paramater if you 
changed the return value to int64_t* and you would still be able to detect 
failure if the return value is null. Not a big deal, though. Your choice.


http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timezone_db.cc
File be/src/exprs/timezone_db.cc:

http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timezone_db.cc@365
PS4, Line 365:   hdfsFile hdfs_file = hdfsOpenFile(
Just for my information: LoadZoneInfoFromHdfs() copies the zone info file to a 
local dir, meanwhile this function uses hdfsOpenFile to get the abbrev data. Is 
there a reason that they don't follow the same approac

[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-05-11 Thread Attila Jeges (Code Review)
Attila Jeges has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9986 )

Change subject: IMPALA-3307: Add support for IANA time-zone db
..


Patch Set 4:

(18 comments)

http://gerrit.cloudera.org:8080/#/c/9986/4/CMakeLists.txt
File CMakeLists.txt:

http://gerrit.cloudera.org:8080/#/c/9986/4/CMakeLists.txt@281
PS4, Line 281: Cctz
> nit: CCTZ
Done


http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/expr-test.cc
File be/src/exprs/expr-test.cc:

http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/expr-test.cc@139
PS4, Line 139: new_time_zone_(time_zone), new_tz_
> From reading the names of these 2 variables it's not clear what de differen
Done


http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/expr-test.cc@140
PS4, Line 140: /*overwrite*/
> Do you need this comment?
Removed it


http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/expr-test.cc@153
PS4, Line 153: Timezone *
> nit: Timezone* Expect..()
Done


http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/expr-test.cc@155
PS4, Line 155: new_tz_ = TimezoneDatabase::FindTimezone(new_time_zone_);
> I wonder if it makes sense to do this assignment in the constructor and the
The main reason I implemented the class this way was that some tests that use 
'ScopedTimeZoneOverride' don't need the actual Timezone pointer.

On the other hand, checking always that the timezone name is valid doesn't hurt 
anyone. Done.


http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timestamp-functions-ir.cc
File be/src/exprs/timestamp-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timestamp-functions-ir.cc@503
PS4, Line 503: namespace
> Why did you need this namespace?
Putting  stuff into an unnamed namespace is a common C++ idiom: it says that 
everything in the unnamed namespace is "local" to this translation unit. 
They're not visible from the outside, and their names won't clash with names in 
other translation units.

Impala uses unnamed namespaces elsewhere too.


http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timestamp-functions-ir.cc@504
PS4, Line 504: / TODO
> What is the plan to get rid of the "Duplicate code" TODOs in this review?
I removed the TODO comment.

I cannot think of a better place for this function. We can create a shared 
class for this function only but that would be awkward.


http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timestamp-functions.cc
File be/src/exprs/timestamp-functions.cc:

http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timestamp-functions.cc@55
PS4, Line 55: // This should raise some sort of error or at least return 
null. Hive just ignores it.
> Shouldn't this be a TODO?
Fixed the comment.


http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timestamp-functions.cc@87
PS4, Line 87: // This should raise some sort of error or at least return 
null. Hive just ignores it.
> Same as above
Fixed the comment.


http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timezone_db-test.cc
File be/src/exprs/timezone_db-test.cc:

http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timezone_db-test.cc@57
PS4, Line 57: TzAbbev
> nit: TzAbbrev?
Done


http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timezone_db-test.cc@68
PS4, Line 68:   // Abbreviations must start with an uppercase letter.
> If it has to start with an uppercase letter, can we add a test this with an
Done


http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timezone_db-test.cc@105
PS4, Line 105:   // Misformatted time-zone names.
> Can you again play around with upper vs lower case letters here?
Done


http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timezone_db.h
File be/src/exprs/timezone_db.h:

http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timezone_db.h@1
PS4, Line 1: // with the License.  You may obtain a copy of the License at
> Hmm, is the top of the Apache comment missing?
Not sure what happened there. I probably inadvertently executed some mysterious 
vim command :)


http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timezone_db.h@91
PS4, Line 91: tz_seg
> nit: might be just my preference but tz_segment is still short and I think
Done


http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.cc
File be/src/exprs/timezone_db.cc:

http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.cc@147
PS1, Line 147:
> What I meant is that you could get rid of the offset_sec paramater if you c
I understand.

The problem is that then we would have to allocate an int64_t in the function 
and return a pointer to it wrapped in unique_ptr to avoid memory 
leaks. Seems more pain then gain.


http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timezone_db.cc
File be/src/exprs/timezone_db.cc:

http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timezone_db.cc@365
PS4, Line 365:   hdfsFile hdfs_file = hdfsOpenFile(
> Just for my information: LoadZoneInf

[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-05-11 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9986 )

Change subject: IMPALA-3307: Add support for IANA time-zone db
..


Patch Set 5:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/9986/5/be/src/runtime/timestamp-value.cc
File be/src/runtime/timestamp-value.cc:

http://gerrit.cloudera.org:8080/#/c/9986/5/be/src/runtime/timestamp-value.cc@139
PS5, Line 139:   // In case the resulting 'time_point' is ambiguous, we have to 
invalidate
 :   // TimestampValue.
 :   // 'civil_lookup' members and the details of handling 
ambiguity are described at:
 :   // 
https://github.com/google/cctz/blob/a2dd3d0fbc811fe0a1d4d2dbb0341f1a3d28cb2a/
 :   // include/cctz/time_zone.h#L106
 :   if (UNLIKELY(from_cl.kind != 
cctz::time_zone::civil_lookup::UNIQUE)
I have investigated a bit about this:

- there is a Jira that complains about this behavior: 
https://issues.apache.org/jira/browse/IMPALA-3169

- Hive does not work like this, it returns a "valid" timestamp for 
repeated/skipped hours:

select
 to_utc_timestamp(cast("2011-03-13 02:15:00" as timestamp), 
"America/Los_Angeles"),
 to_utc_timestamp(cast("2011-11-06 01:15:00" as timestamp), 
"America/Los_Angeles")
result: 2011-03-13 10:15:00.0   2011-11-06 09:15:00.0

I think that we should do the same, at least  for repeated values. I can 
imagine several valid queries where this would be the correct behavior, for 
example when we filter for a time interval.

So I vote for solving IMPALA-3169 in this patch by choosing pre or post time in 
non UNIQUE cases too. If there are no test cases yet for skipped/repeated 
hours, then we should create some and expect the same results that Hive returns.



--
To view, visit http://gerrit.cloudera.org:8080/9986
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
Gerrit-Change-Number: 9986
Gerrit-PatchSet: 5
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 11 May 2018 14:04:02 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-05-11 Thread Attila Jeges (Code Review)
Hello Gabor Kaszab, Zoltan Borok-Nagy, Csaba Ringhofer,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/9986

to look at the new patch set (#6).

Change subject: IMPALA-3307: Add support for IANA time-zone db
..

IMPALA-3307: Add support for IANA time-zone db

Impala currently uses two different libraries for timestamp
manipulations: boost and glibc.

Issues with boost:
- Time-zone database is currently hard coded in timezone_db.cc.
  Impala admins cannot update it without upgrading Impala.
- Time-zone database is flat, therefore can’t track year-to-year
  changes.
- Time-zone database is not updated on a regular basis.

Issues with glibc:
- Uses /usr/share/zoneinfo/ database which could be out of sync on
  some of the nodes in the Impala cluster.
- Uses the host system’s local time-zone. Different nodes in the
  Impala cluster might use a different local time-zone.
- Conversion functions take a global lock, which causes severe
  performance degradation.

In addition to the issues above, the fact that /usr/share/zoneinfo/
and the hard-coded boost time-zone database are both in use is a
source of inconsistency in itself.

This patch makes the following changes:
- Instead of boost and glibc, impalad uses Google's CCTZ to implement
  time-zone conversions.
- Introduces a new startup flag (--hdfs_zone_info_dir) to impalad to
  specify an HDFS/S3/ADLS location that contains the shared compiled
  IANA time-zone database. If the startup flag is set, impalad will
  use the specified time-zone database. Otherwise, impalad will use
  the default /usr/share/zoneinfo time-zone database.
- impalad reads the entire time-zone database into an in-memory
  map on startup for fast lookups.
- The name of the coordinator node’s local time-zone is saved to the
  query context when preparing query execution. This time-zone is used
  whenever the current time-zone is referred afterwards in an
  execution node.
- Introduces a new startup flag (--hdfs_zone_abbrev_conf) to impalad
  to specify an HDFS/S3/ADLS path to a shared config file that
  contains definitions for non-standard time-zone abbreviations.

Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
---
M CMakeLists.txt
M be/CMakeLists.txt
M be/src/benchmarks/CMakeLists.txt
A be/src/benchmarks/convert-timestamp-benchmark.cc
M be/src/common/global-types.h
M be/src/exec/data-source-scan-node.cc
M be/src/exec/data-source-scan-node.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/parquet-column-readers.cc
M be/src/exprs/CMakeLists.txt
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/cast-functions-ir.cc
M be/src/exprs/decimal-operators-ir.cc
M be/src/exprs/decimal-operators.h
M be/src/exprs/expr-test.cc
M be/src/exprs/literal.cc
M be/src/exprs/timestamp-functions-ir.cc
M be/src/exprs/timestamp-functions.cc
A be/src/exprs/timezone_db-test.cc
M be/src/exprs/timezone_db.cc
M be/src/exprs/timezone_db.h
M be/src/runtime/raw-value-test.cc
M be/src/runtime/runtime-state.cc
M be/src/runtime/runtime-state.h
M be/src/runtime/timestamp-test.cc
M be/src/runtime/timestamp-value.cc
M be/src/runtime/timestamp-value.h
M be/src/runtime/timestamp-value.inline.h
M be/src/service/impala-server.cc
M be/src/service/impalad-main.cc
M be/src/util/filesystem-util-test.cc
M be/src/util/filesystem-util.cc
M be/src/util/filesystem-util.h
M be/src/util/hdfs-util-test.cc
M be/src/util/hdfs-util.cc
M be/src/util/hdfs-util.h
M be/src/util/time-test.cc
M be/src/util/time.cc
M be/src/util/time.h
M bin/bootstrap_toolchain.py
M bin/impala-config.sh
A cmake_modules/FindCctz.cmake
M common/thrift/ImpalaInternalService.thrift
M common/thrift/metrics.json
M fe/src/test/java/org/apache/impala/testutil/TestUtils.java
M testdata/bin/create-load-data.sh
M testdata/data/timezoneverification.csv
A testdata/tzdb/abbrev.conf
A testdata/tzdb/zoneinfo/AmerICA/ArgeNTINA/MendOZA
A testdata/tzdb/zoneinfo/AmerICA/CancUN
A testdata/tzdb/zoneinfo/UTC
M testdata/workloads/functional-query/queries/QueryTest/exprs.test
A tests/custom_cluster/test_custom_tzdb.py
53 files changed, 2,569 insertions(+), 1,092 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/86/9986/6
--
To view, visit http://gerrit.cloudera.org:8080/9986
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
Gerrit-Change-Number: 9986
Gerrit-PatchSet: 6
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-05-11 Thread Dan Hecht (Code Review)
Dan Hecht has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9986 )

Change subject: IMPALA-3307: Add support for IANA time-zone db
..


Patch Set 5:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/9986/5/be/src/runtime/runtime-state.h
File be/src/runtime/runtime-state.h:

http://gerrit.cloudera.org:8080/#/c/9986/5/be/src/runtime/runtime-state.h@317
PS5, Line 317:   /// Query-global timezone used as local timezone when 
executing the query.
who owns it?  Let's at least say "Not owned."


http://gerrit.cloudera.org:8080/#/c/9986/5/be/src/util/filesystem-util.h
File be/src/util/filesystem-util.h:

http://gerrit.cloudera.org:8080/#/c/9986/5/be/src/util/filesystem-util.h@58
PS5, Line 58: Real
Should we call it GetCanonicalPath()? I assume it's related to 
IsCanonicalPath(), and so would be nice to name them similarly (and group them 
together).  Is it the case that IsCanonicalPath() always returns true for the 
*real_path returned by GetRealPath()?



--
To view, visit http://gerrit.cloudera.org:8080/9986
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
Gerrit-Change-Number: 9986
Gerrit-PatchSet: 5
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 11 May 2018 21:07:49 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-05-15 Thread Attila Jeges (Code Review)
Attila Jeges has uploaded a new patch set (#7). ( 
http://gerrit.cloudera.org:8080/9986 )

Change subject: IMPALA-3307: Add support for IANA time-zone db
..

IMPALA-3307: Add support for IANA time-zone db

Impala currently uses two different libraries for timestamp
manipulations: boost and glibc.

Issues with boost:
- Time-zone database is currently hard coded in timezone_db.cc.
  Impala admins cannot update it without upgrading Impala.
- Time-zone database is flat, therefore can’t track year-to-year
  changes.
- Time-zone database is not updated on a regular basis.

Issues with glibc:
- Uses /usr/share/zoneinfo/ database which could be out of sync on
  some of the nodes in the Impala cluster.
- Uses the host system’s local time-zone. Different nodes in the
  Impala cluster might use a different local time-zone.
- Conversion functions take a global lock, which causes severe
  performance degradation.

In addition to the issues above, the fact that /usr/share/zoneinfo/
and the hard-coded boost time-zone database are both in use is a
source of inconsistency in itself.

This patch makes the following changes:
- Instead of boost and glibc, impalad uses Google's CCTZ to implement
  time-zone conversions.
- Introduces a new startup flag (--hdfs_zone_info_dir) to impalad to
  specify an HDFS/S3/ADLS location that contains the shared compiled
  IANA time-zone database. If the startup flag is set, impalad will
  use the specified time-zone database. Otherwise, impalad will use
  the default /usr/share/zoneinfo time-zone database.
- impalad reads the entire time-zone database into an in-memory
  map on startup for fast lookups.
- The name of the coordinator node’s local time-zone is saved to the
  query context when preparing query execution. This time-zone is used
  whenever the current time-zone is referred afterwards in an
  execution node.
- Introduces a new startup flag (--hdfs_zone_abbrev_conf) to impalad
  to specify an HDFS/S3/ADLS path to a shared config file that
  contains definitions for non-standard time-zone abbreviations.

Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
---
M CMakeLists.txt
M be/CMakeLists.txt
M be/src/benchmarks/CMakeLists.txt
A be/src/benchmarks/convert-timestamp-benchmark.cc
M be/src/common/global-types.h
M be/src/exec/data-source-scan-node.cc
M be/src/exec/data-source-scan-node.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/parquet-column-readers.cc
M be/src/exprs/CMakeLists.txt
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/cast-functions-ir.cc
M be/src/exprs/decimal-operators-ir.cc
M be/src/exprs/decimal-operators.h
M be/src/exprs/expr-test.cc
M be/src/exprs/literal.cc
M be/src/exprs/timestamp-functions-ir.cc
M be/src/exprs/timestamp-functions.cc
A be/src/exprs/timezone_db-test.cc
M be/src/exprs/timezone_db.cc
M be/src/exprs/timezone_db.h
M be/src/runtime/raw-value-test.cc
M be/src/runtime/runtime-state.cc
M be/src/runtime/runtime-state.h
M be/src/runtime/timestamp-test.cc
M be/src/runtime/timestamp-value.cc
M be/src/runtime/timestamp-value.h
M be/src/runtime/timestamp-value.inline.h
M be/src/service/impala-server.cc
M be/src/service/impalad-main.cc
M be/src/util/filesystem-util-test.cc
M be/src/util/filesystem-util.cc
M be/src/util/filesystem-util.h
M be/src/util/hdfs-util-test.cc
M be/src/util/hdfs-util.cc
M be/src/util/hdfs-util.h
M be/src/util/time-test.cc
M be/src/util/time.cc
M be/src/util/time.h
M bin/bootstrap_toolchain.py
M bin/impala-config.sh
A cmake_modules/FindCctz.cmake
M common/thrift/ImpalaInternalService.thrift
M common/thrift/metrics.json
M fe/src/test/java/org/apache/impala/testutil/TestUtils.java
M testdata/bin/create-load-data.sh
M testdata/data/timezoneverification.csv
A testdata/tzdb/abbrev.conf
A testdata/tzdb/zoneinfo/AmerICA/ArgeNTINA/MendOZA
A testdata/tzdb/zoneinfo/AmerICA/CancUN
A testdata/tzdb/zoneinfo/UTC
M testdata/workloads/functional-query/queries/QueryTest/exprs.test
A tests/custom_cluster/test_custom_tzdb.py
53 files changed, 2,575 insertions(+), 1,092 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/86/9986/7
--
To view, visit http://gerrit.cloudera.org:8080/9986
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
Gerrit-Change-Number: 9986
Gerrit-PatchSet: 7
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-05-15 Thread Dan Hecht (Code Review)
Dan Hecht has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9986 )

Change subject: IMPALA-3307: Add support for IANA time-zone db
..


Patch Set 7:

(2 comments)

I focused mostly on the non-timestamp/timezone/time and test/infra parts. It 
looks fine to me. Would be good to get Gabor's to finish his review to +1 and 
Tim can do the final +2.

Just a heads up regarding exceptions: in the past we've had a lot of issues 
with timestamp boost routines throwing exceptions for out of range values. You 
should make sure you exercise any path that can do that with tests. We 
generally either have to reason about why the boost function can't throw an 
exception (maybe we check the range before hand) or we wrap the boost call with 
try/catch so we don't expose the exception. Also IIRC, something to keep in 
mind is that codegen code can't properly handle try/catch, so in cases we 
needed to use that, we factored the try/catch code into native code and call 
out to it from the IR. Again, just a heads up, not sure if your change 
introduced any problem in this regard or not.

http://gerrit.cloudera.org:8080/#/c/9986/7/be/src/exprs/timestamp-functions-ir.cc
File be/src/exprs/timestamp-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/9986/7/be/src/exprs/timestamp-functions-ir.cc@502
PS7, Line 502:
 : namespace {
 : inline cctz::time_point 
UnixTimeToTimePoint(time_t t) {
 :   return std::chrono::time_point_cast(
 :   std::chrono::system_clock::from_time_t(0)) + 
cctz::sys_seconds(t);
 : }
 :
 : }
 :
 : StringVal TimestampFunctions::TimeOfDay(FunctionContext* 
context) {
 :   const TimestampVal curr = Now(context);
 :   if (curr.is_null) return StringVal::null();
 :   const string& day = ShortDayName(context, curr);
 :   const string& month = ShortMonthName(context, curr);
 :   IntVal dayofmonth = DayOfMonth(context, curr);
 :   IntVal hour = Hour(context, curr);
 :   IntVal min = Minute(context, curr);
 :   IntVal sec = Second(context, curr);
 :   IntVal year = Year(context, curr);
 :
 :   // Calculate 'start' time point at which query execution 
started.
 :   cctz::time_point start = 
UnixTimeToTimePoint(
 :   context->impl()->state()->query_ctx().start_unix_millis / 
MILLIS_PER_SEC);
 :   // Find 'tz_name' time-zone abbreviation that corresponds to 
'local_time_zone' at
 :   // 'start' time point.
 :   cctz::time_zone::absolute_lookup start_lookup =
 :   context->impl()->state()->local_time_zone()->lookup(start);
 :   const string& tz_name = (start_lookup.abbr != nullptr) ? 
start_lookup.abbr :
 :   context->impl()->state()->local_time_zone()->name();
any chance that can throw an exception? I believe our IR code can't properly 
handle try/catch, so if this can indeed throw an exception and needs to be 
wrapped in try/catch, it may need to be refactored so that this code lives in 
the native code and we call out to it from the IR. (Just a heads up, this may 
not be a problem here).


http://gerrit.cloudera.org:8080/#/c/9986/7/be/src/runtime/timestamp-value.inline.h
File be/src/runtime/timestamp-value.inline.h:

http://gerrit.cloudera.org:8080/#/c/9986/7/be/src/runtime/timestamp-value.inline.h@59
PS7, Line 59: .days()
in the past we've had issues where the boost date library can throw exceptions. 
I don't remember the details off hand and it may be that you are okay here 
given you've already checked HasDateAndTime() and if we ensure date_ is within 
range, but just wanted to mention it.



--
To view, visit http://gerrit.cloudera.org:8080/9986
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
Gerrit-Change-Number: 9986
Gerrit-PatchSet: 7
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 15 May 2018 17:34:57 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-05-15 Thread Dan Hecht (Code Review)
Dan Hecht has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9986 )

Change subject: IMPALA-3307: Add support for IANA time-zone db
..


Patch Set 7: Code-Review+1


--
To view, visit http://gerrit.cloudera.org:8080/9986
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
Gerrit-Change-Number: 9986
Gerrit-PatchSet: 7
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 15 May 2018 17:35:33 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-05-16 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9986 )

Change subject: IMPALA-3307: Add support for IANA time-zone db
..


Patch Set 7: Code-Review+1

(4 comments)

Thanks Attila for addressing my comments! I'm fine with the change.

http://gerrit.cloudera.org:8080/#/c/9986/7/be/src/exprs/timezone_db-test.cc
File be/src/exprs/timezone_db-test.cc:

http://gerrit.cloudera.org:8080/#/c/9986/7/be/src/exprs/timezone_db-test.cc@119
PS7, Line 119:   TestInvalidTimezoneAbbrevName("pST");
Four of these are already tested by TimezoneDbNamesTest. No need to test the 
here as well, I think.


http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timezone_db.h
File be/src/exprs/timezone_db.h:

http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timezone_db.h@1
PS4, Line 1: // Licensed to the Apache Software Foundation (ASF) under one
> Not sure what happened there. I probably inadvertently executed some myster
:D


http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timezone_db.cc
File be/src/exprs/timezone_db.cc:

http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timezone_db.cc@365
PS4, Line 365:   RETURN_IF_ERROR(
> LoadZoneInfoFromHdfs() uses cctz::load_time_zone() to load time-zone files
Thx!


http://gerrit.cloudera.org:8080/#/c/9986/5/be/src/runtime/runtime-state.cc
File be/src/runtime/runtime-state.cc:

http://gerrit.cloudera.org:8080/#/c/9986/5/be/src/runtime/runtime-state.cc@136
PS5, Line 136: local_time_zone_ = &TimezoneDatabase::GetUtcTimezone();
> True, but I wanted to be more explicit here.
thanks for the explanation.



--
To view, visit http://gerrit.cloudera.org:8080/9986
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
Gerrit-Change-Number: 9986
Gerrit-PatchSet: 7
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 16 May 2018 14:43:24 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-05-16 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9986 )

Change subject: IMPALA-3307: Add support for IANA time-zone db
..


Patch Set 7:

(15 comments)

This is going to be a big improvement. Did a pass, mainly had comments about 
clarifying internal interfaces.

http://gerrit.cloudera.org:8080/#/c/9986/7/be/src/exec/data-source-scan-node.h
File be/src/exec/data-source-scan-node.h:

http://gerrit.cloudera.org:8080/#/c/9986/7/be/src/exec/data-source-scan-node.h@101
PS7, Line 101:   /// local time-zone for materializing 'TYPE_TIMESTAMP' slots.
Can local_tz be NULL? Maybe make it const& if not.


http://gerrit.cloudera.org:8080/#/c/9986/7/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/9986/7/be/src/exec/parquet-column-readers.cc@603
PS7, Line 603:   if (dst_ts->HasDateAndTime()) 
dst_ts->UtcToLocal(parent_->state_->local_time_zone());
Would it make sense to cache the timezone locally in the ScalarColumnReader? 
That would save at least 2 pointer indirections per value, which could be 
meaningful in this part of the code.


http://gerrit.cloudera.org:8080/#/c/9986/7/be/src/exprs/decimal-operators.h
File be/src/exprs/decimal-operators.h:

http://gerrit.cloudera.org:8080/#/c/9986/7/be/src/exprs/decimal-operators.h@172
PS7, Line 172:   const T& decimal_value, int scale, bool round, const 
Timezone* local_tz);
Is it nullable?


http://gerrit.cloudera.org:8080/#/c/9986/7/be/src/exprs/expr-test.cc
File be/src/exprs/expr-test.cc:

http://gerrit.cloudera.org:8080/#/c/9986/7/be/src/exprs/expr-test.cc@161
PS7, Line 161:   const Timezone *new_tz_;
nit: Timezone* new_tz_


http://gerrit.cloudera.org:8080/#/c/9986/7/be/src/exprs/timezone_db.cc
File be/src/exprs/timezone_db.cc:

http://gerrit.cloudera.org:8080/#/c/9986/7/be/src/exprs/timezone_db.cc@48
PS7, Line 48: "HDFS/S3A/ADLS path to load IANA time-zone database from.");
nn]


http://gerrit.cloudera.org:8080/#/c/9986/7/be/src/runtime/runtime-state.h
File be/src/runtime/runtime-state.h:

http://gerrit.cloudera.org:8080/#/c/9986/7/be/src/runtime/runtime-state.h@317
PS7, Line 317:   /// Query-global timezone used as local timezone when 
executing the query.
Can this be NULL? Would be good to document. We should maybe return a const& 
above if it can't be NULL so that it's self-documenting.


http://gerrit.cloudera.org:8080/#/c/9986/7/be/src/runtime/runtime-state.cc
File be/src/runtime/runtime-state.cc:

http://gerrit.cloudera.org:8080/#/c/9986/7/be/src/runtime/runtime-state.cc@131
PS7, Line 131: LIKELY
LIKELY won't make a measurable difference outside of perf-critical code, I find 
it adds noise in cases like this. The codebase isn't very consistent about it 
but I'm trying to stop the pattern from spreading :)


http://gerrit.cloudera.org:8080/#/c/9986/7/be/src/runtime/runtime-state.cc@134
PS7, Line 134: LOG(ERROR) << "Failed to find local timezone " << 
query_ctx().local_time_zone
I think this should be a WARNING-level log. We should reserve ERROR for really 
severe errors, whereas this might flood logs.

I think we should also add a warning to the query warnings so that it's 
surfaced to the user, not just the admin.


http://gerrit.cloudera.org:8080/#/c/9986/7/be/src/runtime/timestamp-value.h
File be/src/runtime/timestamp-value.h:

http://gerrit.cloudera.org:8080/#/c/9986/7/be/src/runtime/timestamp-value.h@98
PS7, Line 98:   static TimestampValue FromUnixTime(time_t unix_time, const 
Timezone* local_tz) {
Here and below it isn't clear if local_tz is allowed to be NULL. If it can be 
NULL, can we extend comments to explain what happens if that case. If it can't, 
we could make it self-documenting by making it a const& instead of a const*.


http://gerrit.cloudera.org:8080/#/c/9986/7/be/src/runtime/timestamp-value.cc
File be/src/runtime/timestamp-value.cc:

http://gerrit.cloudera.org:8080/#/c/9986/7/be/src/runtime/timestamp-value.cc@100
PS7, Line 100: CheckIfDateOutOfRange
Maybe IsDateOutOfRange(). With "Check" it isn't clear whether returning true 
means that it's out of range or if it means that the timestamp passed the check.


http://gerrit.cloudera.org:8080/#/c/9986/7/be/src/util/time.h
File be/src/util/time.h:

http://gerrit.cloudera.org:8080/#/c/9986/7/be/src/util/time.h@122
PS7, Line 122: Timezone
Maybe const& if it's not allowed to be NULL.


http://gerrit.cloudera.org:8080/#/c/9986/7/cmake_modules/FindCctz.cmake
File cmake_modules/FindCctz.cmake:

http://gerrit.cloudera.org:8080/#/c/9986/7/cmake_modules/FindCctz.cmake@27
PS7, Line 27:   $ENV{IMPALA_HOME}/thirdparty/cctz-$ENV{IMPALA_CCTZ_VERSION}/src)
We can get rid of the thirdparty/ stuff. That's just left over from when Impala 
stored vendored versions of these dependencies in thirdpart/


http://gerrit.cloudera.org:8080/#/c/9986/7/common/thrift/ImpalaInternalService.thrift
File common/thrift/ImpalaInternalService.thrift:


[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-05-25 Thread Attila Jeges (Code Review)
Attila Jeges has uploaded a new patch set (#8). ( 
http://gerrit.cloudera.org:8080/9986 )

Change subject: IMPALA-3307: Add support for IANA time-zone db
..

IMPALA-3307: Add support for IANA time-zone db

Impala currently uses two different libraries for timestamp
manipulations: boost and glibc.

Issues with boost:
- Time-zone database is currently hard coded in timezone_db.cc.
  Impala admins cannot update it without upgrading Impala.
- Time-zone database is flat, therefore can’t track year-to-year
  changes.
- Time-zone database is not updated on a regular basis.

Issues with glibc:
- Uses /usr/share/zoneinfo/ database which could be out of sync on
  some of the nodes in the Impala cluster.
- Uses the host system’s local time-zone. Different nodes in the
  Impala cluster might use a different local time-zone.
- Conversion functions take a global lock, which causes severe
  performance degradation.

In addition to the issues above, the fact that /usr/share/zoneinfo/
and the hard-coded boost time-zone database are both in use is a
source of inconsistency in itself.

This patch makes the following changes:
- Instead of boost and glibc, impalad uses Google's CCTZ to implement
  time-zone conversions.
- Introduces a new startup flag (--hdfs_zone_info_dir) to impalad to
  specify an HDFS/S3/ADLS location that contains the shared compiled
  IANA time-zone database. If the startup flag is set, impalad will
  use the specified time-zone database. Otherwise, impalad will use
  the default /usr/share/zoneinfo time-zone database.
- impalad reads the entire time-zone database into an in-memory
  map on startup for fast lookups.
- The name of the coordinator node’s local time-zone is saved to the
  query context when preparing query execution. This time-zone is used
  whenever the current time-zone is referred afterwards in an
  execution node.
- Introduces a new startup flag (--hdfs_zone_abbrev_conf) to impalad
  to specify an HDFS/S3/ADLS path to a shared config file that
  contains definitions for non-standard time-zone abbreviations.

Cherry-picks: not for 2.x.

Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
---
M CMakeLists.txt
M be/CMakeLists.txt
M be/src/benchmarks/CMakeLists.txt
A be/src/benchmarks/convert-timestamp-benchmark.cc
M be/src/common/global-types.h
M be/src/exec/data-source-scan-node.cc
M be/src/exec/data-source-scan-node.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/parquet-column-readers.cc
M be/src/exprs/CMakeLists.txt
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/cast-functions-ir.cc
M be/src/exprs/decimal-operators-ir.cc
M be/src/exprs/decimal-operators.h
M be/src/exprs/expr-test.cc
M be/src/exprs/literal.cc
M be/src/exprs/timestamp-functions-ir.cc
M be/src/exprs/timestamp-functions.cc
A be/src/exprs/timezone_db-test.cc
M be/src/exprs/timezone_db.cc
M be/src/exprs/timezone_db.h
M be/src/runtime/raw-value-test.cc
M be/src/runtime/runtime-state.cc
M be/src/runtime/runtime-state.h
M be/src/runtime/timestamp-test.cc
M be/src/runtime/timestamp-value.cc
M be/src/runtime/timestamp-value.h
M be/src/runtime/timestamp-value.inline.h
M be/src/service/impala-server.cc
M be/src/service/impalad-main.cc
M be/src/util/filesystem-util-test.cc
M be/src/util/filesystem-util.cc
M be/src/util/filesystem-util.h
M be/src/util/hdfs-util-test.cc
M be/src/util/hdfs-util.cc
M be/src/util/hdfs-util.h
M be/src/util/time-test.cc
M be/src/util/time.cc
M be/src/util/time.h
M bin/bootstrap_toolchain.py
M bin/impala-config.sh
A cmake_modules/FindCctz.cmake
M common/thrift/ImpalaInternalService.thrift
M common/thrift/metrics.json
M fe/src/test/java/org/apache/impala/testutil/TestUtils.java
M testdata/bin/create-load-data.sh
M testdata/data/timezoneverification.csv
A testdata/tzdb/abbrev.conf
A testdata/tzdb/zoneinfo/AmerICA/ArgeNTINA/MendOZA
A testdata/tzdb/zoneinfo/AmerICA/CancUN
A testdata/tzdb/zoneinfo/UTC
M testdata/workloads/functional-query/queries/QueryTest/exprs.test
A tests/custom_cluster/test_custom_tzdb.py
53 files changed, 2,603 insertions(+), 1,095 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/86/9986/8
--
To view, visit http://gerrit.cloudera.org:8080/9986
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
Gerrit-Change-Number: 9986
Gerrit-PatchSet: 8
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-05-25 Thread Attila Jeges (Code Review)
Attila Jeges has uploaded a new patch set (#9). ( 
http://gerrit.cloudera.org:8080/9986 )

Change subject: IMPALA-3307: Add support for IANA time-zone db
..

IMPALA-3307: Add support for IANA time-zone db

Impala currently uses two different libraries for timestamp
manipulations: boost and glibc.

Issues with boost:
- Time-zone database is currently hard coded in timezone_db.cc.
  Impala admins cannot update it without upgrading Impala.
- Time-zone database is flat, therefore can’t track year-to-year
  changes.
- Time-zone database is not updated on a regular basis.

Issues with glibc:
- Uses /usr/share/zoneinfo/ database which could be out of sync on
  some of the nodes in the Impala cluster.
- Uses the host system’s local time-zone. Different nodes in the
  Impala cluster might use a different local time-zone.
- Conversion functions take a global lock, which causes severe
  performance degradation.

In addition to the issues above, the fact that /usr/share/zoneinfo/
and the hard-coded boost time-zone database are both in use is a
source of inconsistency in itself.

This patch makes the following changes:
- Instead of boost and glibc, impalad uses Google's CCTZ to implement
  time-zone conversions.
- Introduces a new startup flag (--hdfs_zone_info_dir) to impalad to
  specify an HDFS/S3/ADLS location that contains the shared compiled
  IANA time-zone database. If the startup flag is set, impalad will
  use the specified time-zone database. Otherwise, impalad will use
  the default /usr/share/zoneinfo time-zone database.
- impalad reads the entire time-zone database into an in-memory
  map on startup for fast lookups.
- The name of the coordinator node’s local time-zone is saved to the
  query context when preparing query execution. This time-zone is used
  whenever the current time-zone is referred afterwards in an
  execution node.
- Introduces a new startup flag (--hdfs_zone_abbrev_conf) to impalad
  to specify an HDFS/S3/ADLS path to a shared config file that
  contains definitions for non-standard time-zone abbreviations.

Cherry-picks: not for 2.x.

Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
---
M CMakeLists.txt
M be/CMakeLists.txt
M be/src/benchmarks/CMakeLists.txt
A be/src/benchmarks/convert-timestamp-benchmark.cc
M be/src/common/global-types.h
M be/src/exec/data-source-scan-node.cc
M be/src/exec/data-source-scan-node.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/parquet-column-readers.cc
M be/src/exprs/CMakeLists.txt
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/cast-functions-ir.cc
M be/src/exprs/decimal-operators-ir.cc
M be/src/exprs/decimal-operators.h
M be/src/exprs/expr-test.cc
M be/src/exprs/literal.cc
M be/src/exprs/timestamp-functions-ir.cc
M be/src/exprs/timestamp-functions.cc
A be/src/exprs/timezone_db-test.cc
M be/src/exprs/timezone_db.cc
M be/src/exprs/timezone_db.h
M be/src/runtime/raw-value-test.cc
M be/src/runtime/runtime-state.cc
M be/src/runtime/runtime-state.h
M be/src/runtime/timestamp-test.cc
M be/src/runtime/timestamp-value.cc
M be/src/runtime/timestamp-value.h
M be/src/runtime/timestamp-value.inline.h
M be/src/service/impala-server.cc
M be/src/service/impalad-main.cc
M be/src/util/filesystem-util-test.cc
M be/src/util/filesystem-util.cc
M be/src/util/filesystem-util.h
M be/src/util/hdfs-util-test.cc
M be/src/util/hdfs-util.cc
M be/src/util/hdfs-util.h
M be/src/util/time-test.cc
M be/src/util/time.cc
M be/src/util/time.h
M bin/bootstrap_toolchain.py
M bin/impala-config.sh
A cmake_modules/FindCctz.cmake
M common/thrift/ImpalaInternalService.thrift
M common/thrift/metrics.json
M fe/src/test/java/org/apache/impala/testutil/TestUtils.java
M testdata/bin/create-load-data.sh
M testdata/data/timezoneverification.csv
A testdata/tzdb/2017c/Africa/Abidjan
A testdata/tzdb/2017c/Africa/Accra
A testdata/tzdb/2017c/Africa/Addis_Ababa
A testdata/tzdb/2017c/Africa/Algiers
A testdata/tzdb/2017c/Africa/Asmara
A testdata/tzdb/2017c/Africa/Asmera
A testdata/tzdb/2017c/Africa/Bamako
A testdata/tzdb/2017c/Africa/Bangui
A testdata/tzdb/2017c/Africa/Banjul
A testdata/tzdb/2017c/Africa/Bissau
A testdata/tzdb/2017c/Africa/Blantyre
A testdata/tzdb/2017c/Africa/Brazzaville
A testdata/tzdb/2017c/Africa/Bujumbura
A testdata/tzdb/2017c/Africa/Cairo
A testdata/tzdb/2017c/Africa/Casablanca
A testdata/tzdb/2017c/Africa/Ceuta
A testdata/tzdb/2017c/Africa/Conakry
A testdata/tzdb/2017c/Africa/Dakar
A testdata/tzdb/2017c/Africa/Dar_es_Salaam
A testdata/tzdb/2017c/Africa/Djibouti
A testdata/tzdb/2017c/Africa/Douala
A testdata/tzdb/2017c/Africa/El_Aaiun
A testdata/tzdb/2017c/Africa/Freetown
A testdata/tzdb/2017c/Africa/Gaborone
A testdata/tzdb/2017c/Africa/Harare
A testdata/tzdb/2017c/Africa/Johannesburg
A testdata/tzdb/2017c/Africa/Juba
A testdata/tzdb/2017c/Africa/Kampala
A testdata/tzdb/2017c/Africa/Khartoum
A testdata/tzdb/2017c/Africa/Kigali
A testdata/tzdb/2017c/Africa/Kinshasa
A testdata/tzdb/2017c/Africa/Lagos
A testd

[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-05-25 Thread Attila Jeges (Code Review)
Attila Jeges has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9986 )

Change subject: IMPALA-3307: Add support for IANA time-zone db
..


Patch Set 9:

> > Uploaded patch set 9.
 >
 > Patch -set 9 contains the following changes:
 > - Added a full timezone db to testdata/tzdb.
 > - End-to-end tests and BE-tests were changed to use this timezone
 > db. This was necessary because some timezone-tests were failing on
 > older jenkins workers that had an older tzdata package installed.

It might be a good idea to store the timezone-db files in one .tar file and 
extract them before running the tests. What do you think?


--
To view, visit http://gerrit.cloudera.org:8080/9986
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
Gerrit-Change-Number: 9986
Gerrit-PatchSet: 9
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 25 May 2018 16:23:16 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-05-25 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9986 )

Change subject: IMPALA-3307: Add support for IANA time-zone db
..


Patch Set 9:

> > > Uploaded patch set 9.
 > >
 > > Patch -set 9 contains the following changes:
 > > - Added a full timezone db to testdata/tzdb.
 > > - End-to-end tests and BE-tests were changed to use this timezone
 > > db. This was necessary because some timezone-tests were failing
 > on
 > > older jenkins workers that had an older tzdata package installed.
 >
 > It might be a good idea to store the timezone-db files in one .tar
 > file and extract them before running the tests. What do you think?

I agree, .taring or compressing the tz db would be much better, if it does not 
make the code too complicated. Having less file would make the review more 
readable, and would also make the tz db consume much less space on hdfs, as the 
many small files will be rounded up to hdfs block size.


-- 
To view, visit http://gerrit.cloudera.org:8080/9986
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
Gerrit-Change-Number: 9986
Gerrit-PatchSet: 9
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 25 May 2018 16:33:01 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-05-25 Thread Attila Jeges (Code Review)
Attila Jeges has uploaded a new patch set (#10). ( 
http://gerrit.cloudera.org:8080/9986 )

Change subject: IMPALA-3307: Add support for IANA time-zone db
..

IMPALA-3307: Add support for IANA time-zone db

Impala currently uses two different libraries for timestamp
manipulations: boost and glibc.

Issues with boost:
- Time-zone database is currently hard coded in timezone_db.cc.
  Impala admins cannot update it without upgrading Impala.
- Time-zone database is flat, therefore can’t track year-to-year
  changes.
- Time-zone database is not updated on a regular basis.

Issues with glibc:
- Uses /usr/share/zoneinfo/ database which could be out of sync on
  some of the nodes in the Impala cluster.
- Uses the host system’s local time-zone. Different nodes in the
  Impala cluster might use a different local time-zone.
- Conversion functions take a global lock, which causes severe
  performance degradation.

In addition to the issues above, the fact that /usr/share/zoneinfo/
and the hard-coded boost time-zone database are both in use is a
source of inconsistency in itself.

This patch makes the following changes:
- Instead of boost and glibc, impalad uses Google's CCTZ to implement
  time-zone conversions.
- Introduces a new startup flag (--hdfs_zone_info_dir) to impalad to
  specify an HDFS/S3/ADLS location that contains the shared compiled
  IANA time-zone database. If the startup flag is set, impalad will
  use the specified time-zone database. Otherwise, impalad will use
  the default /usr/share/zoneinfo time-zone database.
- impalad reads the entire time-zone database into an in-memory
  map on startup for fast lookups.
- The name of the coordinator node’s local time-zone is saved to the
  query context when preparing query execution. This time-zone is used
  whenever the current time-zone is referred afterwards in an
  execution node.
- Introduces a new startup flag (--hdfs_zone_abbrev_conf) to impalad
  to specify an HDFS/S3/ADLS path to a shared config file that
  contains definitions for non-standard time-zone abbreviations.

Cherry-picks: not for 2.x.

Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
---
M CMakeLists.txt
M be/CMakeLists.txt
M be/src/benchmarks/CMakeLists.txt
A be/src/benchmarks/convert-timestamp-benchmark.cc
M be/src/common/global-types.h
M be/src/exec/data-source-scan-node.cc
M be/src/exec/data-source-scan-node.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/parquet-column-readers.cc
M be/src/exprs/CMakeLists.txt
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/cast-functions-ir.cc
M be/src/exprs/decimal-operators-ir.cc
M be/src/exprs/decimal-operators.h
M be/src/exprs/expr-test.cc
M be/src/exprs/literal.cc
M be/src/exprs/timestamp-functions-ir.cc
M be/src/exprs/timestamp-functions.cc
A be/src/exprs/timezone_db-test.cc
M be/src/exprs/timezone_db.cc
M be/src/exprs/timezone_db.h
M be/src/runtime/raw-value-test.cc
M be/src/runtime/runtime-state.cc
M be/src/runtime/runtime-state.h
M be/src/runtime/timestamp-test.cc
M be/src/runtime/timestamp-value.cc
M be/src/runtime/timestamp-value.h
M be/src/runtime/timestamp-value.inline.h
M be/src/service/impala-server.cc
M be/src/service/impalad-main.cc
M be/src/util/filesystem-util-test.cc
M be/src/util/filesystem-util.cc
M be/src/util/filesystem-util.h
M be/src/util/hdfs-util-test.cc
M be/src/util/hdfs-util.cc
M be/src/util/hdfs-util.h
M be/src/util/time-test.cc
M be/src/util/time.cc
M be/src/util/time.h
M bin/bootstrap_toolchain.py
M bin/impala-config.sh
A cmake_modules/FindCctz.cmake
M common/thrift/ImpalaInternalService.thrift
M common/thrift/metrics.json
M fe/src/test/java/org/apache/impala/testutil/TestUtils.java
M testdata/bin/create-load-data.sh
M testdata/data/timezoneverification.csv
A testdata/tzdb/2017c/Africa/Abidjan
A testdata/tzdb/2017c/Africa/Accra
A testdata/tzdb/2017c/Africa/Addis_Ababa
A testdata/tzdb/2017c/Africa/Algiers
A testdata/tzdb/2017c/Africa/Asmara
A testdata/tzdb/2017c/Africa/Asmera
A testdata/tzdb/2017c/Africa/Bamako
A testdata/tzdb/2017c/Africa/Bangui
A testdata/tzdb/2017c/Africa/Banjul
A testdata/tzdb/2017c/Africa/Bissau
A testdata/tzdb/2017c/Africa/Blantyre
A testdata/tzdb/2017c/Africa/Brazzaville
A testdata/tzdb/2017c/Africa/Bujumbura
A testdata/tzdb/2017c/Africa/Cairo
A testdata/tzdb/2017c/Africa/Casablanca
A testdata/tzdb/2017c/Africa/Ceuta
A testdata/tzdb/2017c/Africa/Conakry
A testdata/tzdb/2017c/Africa/Dakar
A testdata/tzdb/2017c/Africa/Dar_es_Salaam
A testdata/tzdb/2017c/Africa/Djibouti
A testdata/tzdb/2017c/Africa/Douala
A testdata/tzdb/2017c/Africa/El_Aaiun
A testdata/tzdb/2017c/Africa/Freetown
A testdata/tzdb/2017c/Africa/Gaborone
A testdata/tzdb/2017c/Africa/Harare
A testdata/tzdb/2017c/Africa/Johannesburg
A testdata/tzdb/2017c/Africa/Juba
A testdata/tzdb/2017c/Africa/Kampala
A testdata/tzdb/2017c/Africa/Khartoum
A testdata/tzdb/2017c/Africa/Kigali
A testdata/tzdb/2017c/Africa/Kinshasa
A testdata/tzdb/2017c/Africa/Lagos
A test

[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-05-28 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9986 )

Change subject: IMPALA-3307: Add support for IANA time-zone db
..


Patch Set 10:

I do not see any more low hanging fruits for performance improvement. Some 
overhead could be removed by modifying CCTZ, but this is out of the scope of 
this change, so I created a follow up Jira:
IMPALA-7085:
Consider patching Google/CCTZ for Impala's need


--
To view, visit http://gerrit.cloudera.org:8080/9986
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
Gerrit-Change-Number: 9986
Gerrit-PatchSet: 10
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Mon, 28 May 2018 16:20:05 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-05-30 Thread Attila Jeges (Code Review)
Attila Jeges has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9986 )

Change subject: IMPALA-3307: Add support for IANA time-zone db
..


Patch Set 10:

> > > > Uploaded patch set 9.
 > > >
 > > > Patch -set 9 contains the following changes:
 > > > - Added a full timezone db to testdata/tzdb.
 > > > - End-to-end tests and BE-tests were changed to use this
 > timezone
 > > > db. This was necessary because some timezone-tests were failing
 > > on
 > > > older jenkins workers that had an older tzdata package
 > installed.
 > >
 > > It might be a good idea to store the timezone-db files in one
 > .tar
 > > file and extract them before running the tests. What do you
 > think?
 >
 > I agree, .taring or compressing the tz db would be much better, if
 > it does not make the code too complicated. Having less file would
 > make the review more readable, and would also make the tz db
 > consume much less space on hdfs, as the many small files will be
 > rounded up to hdfs block size.

Extracting files from a .tar file can be tricky. Probably we would have to add 
libtar library to the native-toolchain to handle .tar files.

Alternatively we can store timezone files in a JAR archive instead. The BE can 
call into the java FE to extract files from it.


--
To view, visit http://gerrit.cloudera.org:8080/9986
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
Gerrit-Change-Number: 9986
Gerrit-PatchSet: 10
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 30 May 2018 15:25:00 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-05-30 Thread Attila Jeges (Code Review)
Attila Jeges has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9986 )

Change subject: IMPALA-3307: Add support for IANA time-zone db
..


Patch Set 10:

> > > > > Uploaded patch set 9.
 > > > >
 > > > > Patch -set 9 contains the following changes:
 > > > > - Added a full timezone db to testdata/tzdb.
 > > > > - End-to-end tests and BE-tests were changed to use this
 > > timezone
 > > > > db. This was necessary because some timezone-tests were
 > failing
 > > > on
 > > > > older jenkins workers that had an older tzdata package
 > > installed.
 > > >
 > > > It might be a good idea to store the timezone-db files in one
 > > .tar
 > > > file and extract them before running the tests. What do you
 > > think?
 > >
 > > I agree, .taring or compressing the tz db would be much better,
 > if
 > > it does not make the code too complicated. Having less file would
 > > make the review more readable, and would also make the tz db
 > > consume much less space on hdfs, as the many small files will be
 > > rounded up to hdfs block size.
 >
 > Extracting files from a .tar file can be tricky. Probably we would
 > have to add libtar library to the native-toolchain to handle .tar
 > files.
 >
 > Alternatively we can store timezone files in a JAR archive instead.
 > The BE can call into the java FE to extract files from it.

Tim, Dan, what do you think?


--
To view, visit http://gerrit.cloudera.org:8080/9986
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
Gerrit-Change-Number: 9986
Gerrit-PatchSet: 10
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 30 May 2018 15:34:20 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-05-30 Thread Philip Zeyliger (Code Review)
Philip Zeyliger has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9986 )

Change subject: IMPALA-3307: Add support for IANA time-zone db
..


Patch Set 10:

(5 comments)

> > > > > > Uploaded patch set 9.
 > > > > >
 > > > > > Patch -set 9 contains the following changes:
 > > > > > - Added a full timezone db to testdata/tzdb.
 > > > > > - End-to-end tests and BE-tests were changed to use this
 > > > timezone
 > > > > > db. This was necessary because some timezone-tests were
 > > failing
 > > > > on
 > > > > > older jenkins workers that had an older tzdata package
 > > > installed.
 > > > >
 > > > > It might be a good idea to store the timezone-db files in one
 > > > .tar
 > > > > file and extract them before running the tests. What do you
 > > > think?
 > > >
 > > > I agree, .taring or compressing the tz db would be much better,
 > > if
 > > > it does not make the code too complicated. Having less file
 > would
 > > > make the review more readable,

>From a review ability perspective, there's absolutely no need for this commit 
>to have 221 timezone files. You can test it with ~3, or do a separate commit. 
>i.e., it's neither here nor there.

 > >
 > > Alternatively we can store timezone files in a JAR archive
 > instead.
 > > The BE can call into the java FE to extract files from it.
 >
 > Tim, Dan, what do you think?

The Yarn equivalent here has this notion of a "distributed cache" which is to 
say it stores the files locally and re-uses them across jobs. I can't tell if 
we should be worried that all impalads, at boot time, will slam HDFS with 
reading the timezonedb. I think reading ~200 files per impalad times 200 
impalad daemons may be a lot of HDFS metadata load, but maybe it's comparable 
to what we do for queries anyway. I certainly think that tar or jar is a better 
way to go. Since this is happening at boot, we can probably still fork to tar, 
which we can assume is available, or use Java, which has tar libraries and 
native support for zip.

http://gerrit.cloudera.org:8080/#/c/9986/10//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/9986/10//COMMIT_MSG@35
PS10, Line 35:   specify an HDFS/S3/ADLS location that contains the shared 
compiled
In what format? Does it add to the host one or override it?

My /usr/share/zoneinfo is full of symlinks which HDFS doesn't support in some 
configurations (and S3 certainly doesn't).


http://gerrit.cloudera.org:8080/#/c/9986/10//COMMIT_MSG@41
PS10, Line 41: - The name of the coordinator node’s local time-zone is saved to 
the
Is it easy to tell what the local time zone is of an impalad node? (E.g., do we 
log it?)


http://gerrit.cloudera.org:8080/#/c/9986/10//COMMIT_MSG@45
PS10, Line 45: - Introduces a new startup flag (--hdfs_zone_abbrev_conf) to 
impalad
What's the distinction between this and --hdfs_zone_info_dir? Do we need both?


http://gerrit.cloudera.org:8080/#/c/9986/10/testdata/tzdb/2017c/Africa/Abidjan
File testdata/tzdb/2017c/Africa/Abidjan:

http://gerrit.cloudera.org:8080/#/c/9986/10/testdata/tzdb/2017c/Africa/Abidjan@1
PS10, Line 1: ../Atlantic/St_Helena
We're adding a ton of files. Do we need such a big database for our testing 
purposes?


http://gerrit.cloudera.org:8080/#/c/9986/10/tests/custom_cluster/test_shared_tzdb.py
File tests/custom_cluster/test_shared_tzdb.py:

http://gerrit.cloudera.org:8080/#/c/9986/10/tests/custom_cluster/test_shared_tzdb.py@38
PS10, Line 38: cls.ImpalaTestMatrix.add_constraint(lambda v:
Add a comment about what this is trying to do?



--
To view, visit http://gerrit.cloudera.org:8080/9986
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
Gerrit-Change-Number: 9986
Gerrit-PatchSet: 10
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Philip Zeyliger 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 30 May 2018 15:59:54 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-05-30 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9986 )

Change subject: IMPALA-3307: Add support for IANA time-zone db
..


Patch Set 10:

(1 comment)

Yeah I agree with Phil that doing JNI to java to unzip a file may be the least 
of all evils - we wouldn't be adding another dependency and there's no risk of 
DOSing the NameNode. I doubt it would take the NameNode down, but it could 
disrupt other things happening on the cluster unnecessarily.

http://gerrit.cloudera.org:8080/#/c/9986/10/testdata/tzdb/2017c/Africa/Abidjan
File testdata/tzdb/2017c/Africa/Abidjan:

http://gerrit.cloudera.org:8080/#/c/9986/10/testdata/tzdb/2017c/Africa/Abidjan@1
PS10, Line 1: ../Atlantic/St_Helena
> We're adding a ton of files. Do we need such a big database for our testing
We'll also need to exclude these files from RAT checks (and document how they 
are licensed).



--
To view, visit http://gerrit.cloudera.org:8080/9986
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
Gerrit-Change-Number: 9986
Gerrit-PatchSet: 10
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Philip Zeyliger 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 31 May 2018 00:30:03 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-06-04 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9986 )

Change subject: IMPALA-3307: Add support for IANA time-zone db
..


Patch Set 10:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/9986/10/tests/custom_cluster/test_shared_tzdb.py
File tests/custom_cluster/test_shared_tzdb.py:

http://gerrit.cloudera.org:8080/#/c/9986/10/tests/custom_cluster/test_shared_tzdb.py@59
PS10, Line 59: for abbrev in ['PST', 'JST', 'ACT', 'VST']:
I have copied and extended this test in 
https://gerrit.cloudera.org/#/c/10486/3..4/tests/query_test/test_timezones.py 
to also checks the warning and the to_utc_timestamp function - maybe these 
check could be added to this test too.



--
To view, visit http://gerrit.cloudera.org:8080/9986
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
Gerrit-Change-Number: 9986
Gerrit-PatchSet: 10
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Philip Zeyliger 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Mon, 04 Jun 2018 17:59:28 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-06-05 Thread Attila Jeges (Code Review)
Attila Jeges has uploaded a new patch set (#11). ( 
http://gerrit.cloudera.org:8080/9986 )

Change subject: IMPALA-3307: Add support for IANA time-zone db
..

IMPALA-3307: Add support for IANA time-zone db

Impala currently uses two different libraries for timestamp
manipulations: boost and glibc.

Issues with boost:
- Time-zone database is currently hard coded in timezone_db.cc.
  Impala admins cannot update it without upgrading Impala.
- Time-zone database is flat, therefore can’t track year-to-year
  changes.
- Time-zone database is not updated on a regular basis.

Issues with glibc:
- Uses /usr/share/zoneinfo/ database which could be out of sync on
  some of the nodes in the Impala cluster.
- Uses the host system’s local time-zone. Different nodes in the
  Impala cluster might use a different local time-zone.
- Conversion functions take a global lock, which causes severe
  performance degradation.

In addition to the issues above, the fact that /usr/share/zoneinfo/
and the hard-coded boost time-zone database are both in use is a
source of inconsistency in itself.

This patch makes the following changes:
- Instead of boost and glibc, impalad uses Google's CCTZ to implement
  time-zone conversions.
- Introduces a new startup flag (--hdfs_zone_info_zip) to impalad to
  specify an HDFS/S3/ADLS path to a zip archive that contains the
  shared compiled IANA time-zone database. If the startup flag is set,
  impalad will use the specified time-zone database. Otherwise,
  impalad will use the default /usr/share/zoneinfo time-zone database.
- impalad reads the entire time-zone database into an in-memory
  map on startup for fast lookups.
- The name of the coordinator node’s local time-zone is saved to the
  query context when preparing query execution. This time-zone is used
  whenever the current time-zone is referred afterwards in an
  execution node.
- Introduces a new startup flag (--hdfs_zone_abbrev_conf) to impalad
  to specify an HDFS/S3/ADLS path to a shared config file that
  contains definitions for non-standard time-zone abbreviations.

Cherry-picks: not for 2.x.

Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
---
M CMakeLists.txt
M be/CMakeLists.txt
M be/generated-sources/gen-cpp/CMakeLists.txt
M be/src/benchmarks/CMakeLists.txt
A be/src/benchmarks/convert-timestamp-benchmark.cc
M be/src/common/global-types.h
M be/src/common/init.cc
M be/src/exec/data-source-scan-node.cc
M be/src/exec/data-source-scan-node.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/parquet-column-readers.cc
M be/src/exprs/CMakeLists.txt
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/cast-functions-ir.cc
M be/src/exprs/decimal-operators-ir.cc
M be/src/exprs/decimal-operators.h
M be/src/exprs/expr-test.cc
M be/src/exprs/literal.cc
M be/src/exprs/timestamp-functions-ir.cc
M be/src/exprs/timestamp-functions.cc
A be/src/exprs/timezone_db-test.cc
M be/src/exprs/timezone_db.cc
M be/src/exprs/timezone_db.h
M be/src/runtime/raw-value-test.cc
M be/src/runtime/runtime-state.cc
M be/src/runtime/runtime-state.h
M be/src/runtime/timestamp-test.cc
M be/src/runtime/timestamp-value.cc
M be/src/runtime/timestamp-value.h
M be/src/runtime/timestamp-value.inline.h
M be/src/service/frontend.cc
M be/src/service/impala-server.cc
M be/src/service/impalad-main.cc
M be/src/util/CMakeLists.txt
M be/src/util/filesystem-util-test.cc
M be/src/util/filesystem-util.cc
M be/src/util/filesystem-util.h
M be/src/util/hdfs-util-test.cc
M be/src/util/hdfs-util.cc
M be/src/util/hdfs-util.h
M be/src/util/time-test.cc
M be/src/util/time.cc
M be/src/util/time.h
A be/src/util/zip-util.cc
A be/src/util/zip-util.h
M bin/bootstrap_toolchain.py
M bin/impala-config.sh
M bin/rat_exclude_files.txt
A cmake_modules/FindCctz.cmake
M common/thrift/CMakeLists.txt
M common/thrift/ImpalaInternalService.thrift
A common/thrift/Zip.thrift
M common/thrift/metrics.json
A fe/src/main/java/org/apache/impala/util/ZipUtil.java
M fe/src/test/java/org/apache/impala/testutil/TestUtils.java
M testdata/bin/create-load-data.sh
M testdata/data/timezoneverification.csv
A testdata/tzdb/2017c.zip
A testdata/tzdb/abbrev.conf
A testdata/tzdb_tiny/America/New_York
A testdata/tzdb_tiny/Etc/GMT+4
A testdata/tzdb_tiny/US/Eastern
A testdata/tzdb_tiny/UTC
A testdata/tzdb_tiny/Zulu
A testdata/tzdb_tiny/posix/UTC
A testdata/tzdb_tiny/posixrules
M testdata/workloads/functional-query/queries/QueryTest/exprs.test
M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py
A tests/custom_cluster/test_shared_tzdb.py
D tests/query_test/test_timezones.py
70 files changed, 2,995 insertions(+), 1,168 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/86/9986/11
--
To view, visit http://gerrit.cloudera.org:8080/9986
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a

[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-06-05 Thread Attila Jeges (Code Review)
Attila Jeges has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9986 )

Change subject: IMPALA-3307: Add support for IANA time-zone db
..


Patch Set 11:

> (1 comment)
 >
 > Yeah I agree with Phil that doing JNI to java to unzip a file may
 > be the least of all evils - we wouldn't be adding another
 > dependency and there's no risk of DOSing the NameNode. I doubt it
 > would take the NameNode down, but it could disrupt other things
 > happening on the cluster unnecessarily.

Changed the patch to take the shared timezone db in a zip archive.


--
To view, visit http://gerrit.cloudera.org:8080/9986
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
Gerrit-Change-Number: 9986
Gerrit-PatchSet: 11
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Philip Zeyliger 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 05 Jun 2018 16:12:05 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-06-06 Thread Attila Jeges (Code Review)
Attila Jeges has uploaded a new patch set (#12). ( 
http://gerrit.cloudera.org:8080/9986 )

Change subject: IMPALA-3307: Add support for IANA time-zone db
..

IMPALA-3307: Add support for IANA time-zone db

Impala currently uses two different libraries for timestamp
manipulations: boost and glibc.

Issues with boost:
- Time-zone database is currently hard coded in timezone_db.cc.
  Impala admins cannot update it without upgrading Impala.
- Time-zone database is flat, therefore can’t track year-to-year
  changes.
- Time-zone database is not updated on a regular basis.

Issues with glibc:
- Uses /usr/share/zoneinfo/ database which could be out of sync on
  some of the nodes in the Impala cluster.
- Uses the host system’s local time-zone. Different nodes in the
  Impala cluster might use a different local time-zone.
- Conversion functions take a global lock, which causes severe
  performance degradation.

In addition to the issues above, the fact that /usr/share/zoneinfo/
and the hard-coded boost time-zone database are both in use is a
source of inconsistency in itself.

This patch makes the following changes:
- Instead of boost and glibc, impalad uses Google's CCTZ to implement
  time-zone conversions.
- Introduces a new startup flag (--hdfs_zone_info_zip) to impalad to
  specify an HDFS/S3/ADLS path to a zip archive that contains the
  shared compiled IANA time-zone database. If the startup flag is set,
  impalad will use the specified time-zone database. Otherwise,
  impalad will use the default /usr/share/zoneinfo time-zone database.
- impalad reads the entire time-zone database into an in-memory
  map on startup for fast lookups.
- The name of the coordinator node’s local time-zone is saved to the
  query context when preparing query execution. This time-zone is used
  whenever the current time-zone is referred afterwards in an
  execution node.
- Introduces a new startup flag (--hdfs_zone_abbrev_conf) to impalad
  to specify an HDFS/S3/ADLS path to a shared config file that
  contains definitions for non-standard time-zone abbreviations.

Cherry-picks: not for 2.x.

Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
---
M CMakeLists.txt
M be/CMakeLists.txt
M be/generated-sources/gen-cpp/CMakeLists.txt
M be/src/benchmarks/CMakeLists.txt
A be/src/benchmarks/convert-timestamp-benchmark.cc
M be/src/common/global-types.h
M be/src/common/init.cc
M be/src/exec/data-source-scan-node.cc
M be/src/exec/data-source-scan-node.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/parquet-column-readers.cc
M be/src/exprs/CMakeLists.txt
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/cast-functions-ir.cc
M be/src/exprs/decimal-operators-ir.cc
M be/src/exprs/decimal-operators.h
M be/src/exprs/expr-test.cc
M be/src/exprs/literal.cc
M be/src/exprs/timestamp-functions-ir.cc
M be/src/exprs/timestamp-functions.cc
A be/src/exprs/timezone_db-test.cc
M be/src/exprs/timezone_db.cc
M be/src/exprs/timezone_db.h
M be/src/runtime/raw-value-test.cc
M be/src/runtime/runtime-state.cc
M be/src/runtime/runtime-state.h
M be/src/runtime/timestamp-test.cc
M be/src/runtime/timestamp-value.cc
M be/src/runtime/timestamp-value.h
M be/src/runtime/timestamp-value.inline.h
M be/src/service/frontend.cc
M be/src/service/impala-server.cc
M be/src/service/impalad-main.cc
M be/src/util/CMakeLists.txt
M be/src/util/filesystem-util-test.cc
M be/src/util/filesystem-util.cc
M be/src/util/filesystem-util.h
M be/src/util/hdfs-util-test.cc
M be/src/util/hdfs-util.cc
M be/src/util/hdfs-util.h
M be/src/util/time-test.cc
M be/src/util/time.cc
M be/src/util/time.h
A be/src/util/zip-util.cc
A be/src/util/zip-util.h
M bin/bootstrap_toolchain.py
M bin/impala-config.sh
M bin/rat_exclude_files.txt
A cmake_modules/FindCctz.cmake
M common/thrift/CMakeLists.txt
M common/thrift/ImpalaInternalService.thrift
A common/thrift/Zip.thrift
M common/thrift/metrics.json
A fe/src/main/java/org/apache/impala/util/ZipUtil.java
M fe/src/test/java/org/apache/impala/testutil/TestUtils.java
M testdata/bin/create-load-data.sh
M testdata/data/timezoneverification.csv
A testdata/tzdb/2017c.zip
A testdata/tzdb/abbrev.conf
A testdata/tzdb_tiny/America/New_York
A testdata/tzdb_tiny/Etc/GMT+4
A testdata/tzdb_tiny/US/Eastern
A testdata/tzdb_tiny/UTC
A testdata/tzdb_tiny/Zulu
A testdata/tzdb_tiny/posix/UTC
A testdata/tzdb_tiny/posixrules
M testdata/workloads/functional-query/queries/QueryTest/exprs.test
M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py
A tests/custom_cluster/test_shared_tzdb.py
D tests/query_test/test_timezones.py
70 files changed, 2,994 insertions(+), 1,167 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/86/9986/12
--
To view, visit http://gerrit.cloudera.org:8080/9986
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a

[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-06-06 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9986 )

Change subject: IMPALA-3307: Add support for IANA time-zone db
..


Patch Set 11:

(2 comments)

Thanks for adding zip support!
We should add some tests for zip_util, especially for error handling, which is 
an untested path at the moment if didn't miss something. I am ok with moving 
this (and dealing with my other comments) to a later commit.

http://gerrit.cloudera.org:8080/#/c/9986/11/be/src/exprs/timezone_db.cc
File be/src/exprs/timezone_db.cc:

http://gerrit.cloudera.org:8080/#/c/9986/11/be/src/exprs/timezone_db.cc@198
PS11, Line 198: GetNextDirectoryEntry
This is subjective, but I do not like this interface too much. I would prefer 
to wrap dir_stream to a class/struct, or create a function like this: static 
STATUS ListDirEntries(string path, vector& result, int max_result_num = 
0). Both could be moved to util/filesystem_util.


http://gerrit.cloudera.org:8080/#/c/9986/11/be/src/exprs/timezone_db.cc@213
PS11, Line 213: readdir_r
There was a discussion about readdir_r() vs readdir() in 
https://gerrit.cloudera.org/#/c/8546/8 , and readdir() was preferred in the end.



--
To view, visit http://gerrit.cloudera.org:8080/9986
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
Gerrit-Change-Number: 9986
Gerrit-PatchSet: 11
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Philip Zeyliger 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 06 Jun 2018 18:35:35 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-06-07 Thread Attila Jeges (Code Review)
Attila Jeges has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9986 )

Change subject: IMPALA-3307: Add support for IANA time-zone db
..


Patch Set 11:

(2 comments)

Added some tests for extracting files from a non-existent zip archive and from 
a corrupt zip archive.

http://gerrit.cloudera.org:8080/#/c/9986/11/be/src/exprs/timezone_db.cc
File be/src/exprs/timezone_db.cc:

http://gerrit.cloudera.org:8080/#/c/9986/11/be/src/exprs/timezone_db.cc@198
PS11, Line 198: GetNextDirectoryEntry
> This is subjective, but I do not like this interface too much. I would pref
Done


http://gerrit.cloudera.org:8080/#/c/9986/11/be/src/exprs/timezone_db.cc@213
PS11, Line 213: readdir_r
> There was a discussion about readdir_r() vs readdir() in https://gerrit.clo
Done



--
To view, visit http://gerrit.cloudera.org:8080/9986
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
Gerrit-Change-Number: 9986
Gerrit-PatchSet: 11
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Philip Zeyliger 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 07 Jun 2018 19:49:52 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-06-07 Thread Attila Jeges (Code Review)
Attila Jeges has uploaded a new patch set (#13). ( 
http://gerrit.cloudera.org:8080/9986 )

Change subject: IMPALA-3307: Add support for IANA time-zone db
..

IMPALA-3307: Add support for IANA time-zone db

Impala currently uses two different libraries for timestamp
manipulations: boost and glibc.

Issues with boost:
- Time-zone database is currently hard coded in timezone_db.cc.
  Impala admins cannot update it without upgrading Impala.
- Time-zone database is flat, therefore can’t track year-to-year
  changes.
- Time-zone database is not updated on a regular basis.

Issues with glibc:
- Uses /usr/share/zoneinfo/ database which could be out of sync on
  some of the nodes in the Impala cluster.
- Uses the host system’s local time-zone. Different nodes in the
  Impala cluster might use a different local time-zone.
- Conversion functions take a global lock, which causes severe
  performance degradation.

In addition to the issues above, the fact that /usr/share/zoneinfo/
and the hard-coded boost time-zone database are both in use is a
source of inconsistency in itself.

This patch makes the following changes:
- Instead of boost and glibc, impalad uses Google's CCTZ to implement
  time-zone conversions.
- Introduces a new startup flag (--hdfs_zone_info_zip) to impalad to
  specify an HDFS/S3/ADLS path to a zip archive that contains the
  shared compiled IANA time-zone database. If the startup flag is set,
  impalad will use the specified time-zone database. Otherwise,
  impalad will use the default /usr/share/zoneinfo time-zone database.
- impalad reads the entire time-zone database into an in-memory
  map on startup for fast lookups.
- The name of the coordinator node’s local time-zone is saved to the
  query context when preparing query execution. This time-zone is used
  whenever the current time-zone is referred afterwards in an
  execution node.
- Introduces a new startup flag (--hdfs_zone_abbrev_conf) to impalad
  to specify an HDFS/S3/ADLS path to a shared config file that
  contains definitions for non-standard time-zone abbreviations.

Cherry-picks: not for 2.x.

Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
---
M CMakeLists.txt
M be/CMakeLists.txt
M be/generated-sources/gen-cpp/CMakeLists.txt
M be/src/benchmarks/CMakeLists.txt
A be/src/benchmarks/convert-timestamp-benchmark.cc
M be/src/common/global-types.h
M be/src/common/init.cc
M be/src/exec/data-source-scan-node.cc
M be/src/exec/data-source-scan-node.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/parquet-column-readers.cc
M be/src/exprs/CMakeLists.txt
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/cast-functions-ir.cc
M be/src/exprs/decimal-operators-ir.cc
M be/src/exprs/decimal-operators.h
M be/src/exprs/expr-test.cc
M be/src/exprs/literal.cc
M be/src/exprs/timestamp-functions-ir.cc
M be/src/exprs/timestamp-functions.cc
A be/src/exprs/timezone_db-test.cc
M be/src/exprs/timezone_db.cc
M be/src/exprs/timezone_db.h
M be/src/runtime/raw-value-test.cc
M be/src/runtime/runtime-state.cc
M be/src/runtime/runtime-state.h
M be/src/runtime/timestamp-test.cc
M be/src/runtime/timestamp-value.cc
M be/src/runtime/timestamp-value.h
M be/src/runtime/timestamp-value.inline.h
M be/src/service/frontend.cc
M be/src/service/impala-server.cc
M be/src/service/impalad-main.cc
M be/src/util/CMakeLists.txt
M be/src/util/filesystem-util-test.cc
M be/src/util/filesystem-util.cc
M be/src/util/filesystem-util.h
M be/src/util/hdfs-util-test.cc
M be/src/util/hdfs-util.cc
M be/src/util/hdfs-util.h
M be/src/util/time-test.cc
M be/src/util/time.cc
M be/src/util/time.h
A be/src/util/zip-util-test.cc
A be/src/util/zip-util.cc
A be/src/util/zip-util.h
M bin/bootstrap_toolchain.py
M bin/impala-config.sh
M bin/rat_exclude_files.txt
A cmake_modules/FindCctz.cmake
M common/thrift/CMakeLists.txt
M common/thrift/ImpalaInternalService.thrift
A common/thrift/Zip.thrift
M common/thrift/metrics.json
A fe/src/main/java/org/apache/impala/util/ZipUtil.java
M fe/src/test/java/org/apache/impala/testutil/TestUtils.java
M testdata/bin/create-load-data.sh
M testdata/data/timezoneverification.csv
A testdata/tzdb/2017c-corrupt.zip
A testdata/tzdb/2017c.zip
A testdata/tzdb/abbrev.conf
A testdata/tzdb_tiny/America/New_York
A testdata/tzdb_tiny/Etc/GMT+4
A testdata/tzdb_tiny/US/Eastern
A testdata/tzdb_tiny/UTC
A testdata/tzdb_tiny/Zulu
A testdata/tzdb_tiny/posix/UTC
A testdata/tzdb_tiny/posixrules
M testdata/workloads/functional-query/queries/QueryTest/exprs.test
M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py
A tests/custom_cluster/test_shared_tzdb.py
D tests/query_test/test_timezones.py
72 files changed, 3,098 insertions(+), 1,167 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/86/9986/13
--
To view, visit http://gerrit.cloudera.org:8080/9986
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Message

[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-06-08 Thread Attila Jeges (Code Review)
Attila Jeges has uploaded a new patch set (#14). ( 
http://gerrit.cloudera.org:8080/9986 )

Change subject: IMPALA-3307: Add support for IANA time-zone db
..

IMPALA-3307: Add support for IANA time-zone db

Impala currently uses two different libraries for timestamp
manipulations: boost and glibc.

Issues with boost:
- Time-zone database is currently hard coded in timezone_db.cc.
  Impala admins cannot update it without upgrading Impala.
- Time-zone database is flat, therefore can’t track year-to-year
  changes.
- Time-zone database is not updated on a regular basis.

Issues with glibc:
- Uses /usr/share/zoneinfo/ database which could be out of sync on
  some of the nodes in the Impala cluster.
- Uses the host system’s local time-zone. Different nodes in the
  Impala cluster might use a different local time-zone.
- Conversion functions take a global lock, which causes severe
  performance degradation.

In addition to the issues above, the fact that /usr/share/zoneinfo/
and the hard-coded boost time-zone database are both in use is a
source of inconsistency in itself.

This patch makes the following changes:
- Instead of boost and glibc, impalad uses Google's CCTZ to implement
  time-zone conversions.
- Introduces a new startup flag (--hdfs_zone_info_zip) to impalad to
  specify an HDFS/S3/ADLS path to a zip archive that contains the
  shared compiled IANA time-zone database. If the startup flag is set,
  impalad will use the specified time-zone database. Otherwise,
  impalad will use the default /usr/share/zoneinfo time-zone database.
- impalad reads the entire time-zone database into an in-memory
  map on startup for fast lookups.
- The name of the coordinator node’s local time-zone is saved to the
  query context when preparing query execution. This time-zone is used
  whenever the current time-zone is referred afterwards in an
  execution node.
- Introduces a new startup flag (--hdfs_zone_abbrev_conf) to impalad
  to specify an HDFS/S3/ADLS path to a shared config file that
  contains definitions for non-standard time-zone abbreviations.

Cherry-picks: not for 2.x.

Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
---
M CMakeLists.txt
M be/CMakeLists.txt
M be/generated-sources/gen-cpp/CMakeLists.txt
M be/src/benchmarks/CMakeLists.txt
A be/src/benchmarks/convert-timestamp-benchmark.cc
M be/src/common/global-types.h
M be/src/common/init.cc
M be/src/exec/data-source-scan-node.cc
M be/src/exec/data-source-scan-node.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/parquet-column-readers.cc
M be/src/exprs/CMakeLists.txt
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/cast-functions-ir.cc
M be/src/exprs/decimal-operators-ir.cc
M be/src/exprs/decimal-operators.h
M be/src/exprs/expr-test.cc
M be/src/exprs/literal.cc
M be/src/exprs/timestamp-functions-ir.cc
M be/src/exprs/timestamp-functions.cc
A be/src/exprs/timezone_db-test.cc
M be/src/exprs/timezone_db.cc
M be/src/exprs/timezone_db.h
M be/src/runtime/raw-value-test.cc
M be/src/runtime/runtime-state.cc
M be/src/runtime/runtime-state.h
M be/src/runtime/timestamp-test.cc
M be/src/runtime/timestamp-value.cc
M be/src/runtime/timestamp-value.h
M be/src/runtime/timestamp-value.inline.h
M be/src/service/frontend.cc
M be/src/service/impala-server.cc
M be/src/service/impalad-main.cc
M be/src/util/CMakeLists.txt
M be/src/util/filesystem-util-test.cc
M be/src/util/filesystem-util.cc
M be/src/util/filesystem-util.h
M be/src/util/hdfs-util-test.cc
M be/src/util/hdfs-util.cc
M be/src/util/hdfs-util.h
M be/src/util/time-test.cc
M be/src/util/time.cc
M be/src/util/time.h
A be/src/util/zip-util-test.cc
A be/src/util/zip-util.cc
A be/src/util/zip-util.h
M bin/bootstrap_toolchain.py
M bin/impala-config.sh
M bin/rat_exclude_files.txt
A cmake_modules/FindCctz.cmake
M common/thrift/CMakeLists.txt
M common/thrift/ImpalaInternalService.thrift
A common/thrift/Zip.thrift
M common/thrift/metrics.json
A fe/src/main/java/org/apache/impala/util/ZipUtil.java
M fe/src/test/java/org/apache/impala/testutil/TestUtils.java
M testdata/bin/create-load-data.sh
M testdata/data/timezoneverification.csv
A testdata/tzdb/2017c-corrupt.zip
A testdata/tzdb/2017c.zip
A testdata/tzdb/abbrev.conf
A testdata/tzdb_tiny/America/New_York
A testdata/tzdb_tiny/Etc/GMT+4
A testdata/tzdb_tiny/US/Eastern
A testdata/tzdb_tiny/UTC
A testdata/tzdb_tiny/Zulu
A testdata/tzdb_tiny/posix/UTC
A testdata/tzdb_tiny/posixrules
M testdata/workloads/functional-query/queries/QueryTest/exprs.test
M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py
A tests/custom_cluster/test_shared_tzdb.py
D tests/query_test/test_timezones.py
72 files changed, 3,117 insertions(+), 1,167 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/86/9986/14
--
To view, visit http://gerrit.cloudera.org:8080/9986
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Message

[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-06-08 Thread Attila Jeges (Code Review)
Attila Jeges has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9986 )

Change subject: IMPALA-3307: Add support for IANA time-zone db
..


Patch Set 14:

> Uploaded patch set 14.

Added one more BE test for extracting files from a zip archive to a 
non-writable destination directory.

Fixed zip-slip vulnerability in ZipUtil.java.


--
To view, visit http://gerrit.cloudera.org:8080/9986
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
Gerrit-Change-Number: 9986
Gerrit-PatchSet: 14
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Philip Zeyliger 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 08 Jun 2018 13:43:51 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-06-08 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9986 )

Change subject: IMPALA-3307: Add support for IANA time-zone db
..


Patch Set 14: Code-Review+2

(5 comments)

Thanks for thinking about Zip-Slip!
I have left a few optional comments about the usability of the interfaces.

http://gerrit.cloudera.org:8080/#/c/9986/14//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/9986/14//COMMIT_MSG@34
PS14, Line 34: - Introduces a new startup flag (--hdfs_zone_info_zip) to 
impalad to
The Zip slip safe zip-util could be also mentioned in the commit message.


http://gerrit.cloudera.org:8080/#/c/9986/14/be/src/util/filesystem-util.h
File be/src/util/filesystem-util.h:

http://gerrit.cloudera.org:8080/#/c/9986/14/be/src/util/filesystem-util.h@92
PS14, Line 92: Directory(const string& path, bool skip_hidden_entries = 
true);
I thought a bit about usability and I vote for removing this parameter and skip 
only "." and ".." - I can't imagine any use case when I would be interested in 
those.


http://gerrit.cloudera.org:8080/#/c/9986/14/be/src/util/filesystem-util.h@109
PS14, Line 109: static Status GetEntryNames(const string& path,
I would prefer max_result_size to be the last parameter, and give it a default 
value of 0.


http://gerrit.cloudera.org:8080/#/c/9986/14/be/src/util/zip-util-test.cc
File be/src/util/zip-util-test.cc:

http://gerrit.cloudera.org:8080/#/c/9986/14/be/src/util/zip-util-test.cc@69
PS14, Line 69:   EXPECT_FALSE(filesystem::exists(dest_dir3));
I guess that this is only true if zip decoding failed at the start, and some 
files may be already decompressed before reaching an error in the zip. I am not 
sure what to do with this, probably nothing. It would be possible add some kind 
of cleanup logic to the java util, but I am not sure if this worth the effort.


http://gerrit.cloudera.org:8080/#/c/9986/14/fe/src/main/java/org/apache/impala/util/ZipUtil.java
File fe/src/main/java/org/apache/impala/util/ZipUtil.java:

http://gerrit.cloudera.org:8080/#/c/9986/14/fe/src/main/java/org/apache/impala/util/ZipUtil.java@45
PS14, Line 45: try (ZipFile zip = new ZipFile(params.archive_file)) {
I would move this block to a similar function with (String archiveFile, String 
destDir) parameters to make this util usable from Java too. This would be 
minimal extra effort and I think that it can be handy to have an easily usable 
Zip-Slip safe extract function.



--
To view, visit http://gerrit.cloudera.org:8080/9986
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
Gerrit-Change-Number: 9986
Gerrit-PatchSet: 14
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Philip Zeyliger 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 08 Jun 2018 16:15:15 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-06-11 Thread Attila Jeges (Code Review)
Attila Jeges has uploaded a new patch set (#15). ( 
http://gerrit.cloudera.org:8080/9986 )

Change subject: IMPALA-3307: Add support for IANA time-zone db
..

IMPALA-3307: Add support for IANA time-zone db

Impala currently uses two different libraries for timestamp
manipulations: boost and glibc.

Issues with boost:
- Time-zone database is currently hard coded in timezone_db.cc.
  Impala admins cannot update it without upgrading Impala.
- Time-zone database is flat, therefore can’t track year-to-year
  changes.
- Time-zone database is not updated on a regular basis.

Issues with glibc:
- Uses /usr/share/zoneinfo/ database which could be out of sync on
  some of the nodes in the Impala cluster.
- Uses the host system’s local time-zone. Different nodes in the
  Impala cluster might use a different local time-zone.
- Conversion functions take a global lock, which causes severe
  performance degradation.

In addition to the issues above, the fact that /usr/share/zoneinfo/
and the hard-coded boost time-zone database are both in use is a
source of inconsistency in itself.

This patch makes the following changes:
- Instead of boost and glibc, impalad uses Google's CCTZ to implement
  time-zone conversions.

- Introduces a new startup flag (--hdfs_zone_info_zip) to impalad to
  specify an HDFS/S3/ADLS path to a zip archive that contains the
  shared compiled IANA time-zone database. If the startup flag is set,
  impalad will use the specified time-zone database. Otherwise,
  impalad will use the default /usr/share/zoneinfo time-zone database.

- Introduces a new startup flag (--hdfs_zone_abbrev_conf) to impalad
  to specify an HDFS/S3/ADLS path to a shared config file that
  contains definitions for non-standard time-zone abbreviations.

- impalad reads the entire time-zone database into an in-memory
  map on startup for fast lookups.

- The name of the coordinator node’s local time-zone is saved to the
  query context when preparing query execution. This time-zone is used
  whenever the current time-zone is referred afterwards in an
  execution node.

- Adds a new ZipUtil class to extract files from a zip archive. The
  implementation is not vulnerable to Zip Slip.

Cherry-picks: not for 2.x.

Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
---
M CMakeLists.txt
M be/CMakeLists.txt
M be/generated-sources/gen-cpp/CMakeLists.txt
M be/src/benchmarks/CMakeLists.txt
A be/src/benchmarks/convert-timestamp-benchmark.cc
M be/src/common/global-types.h
M be/src/common/init.cc
M be/src/exec/data-source-scan-node.cc
M be/src/exec/data-source-scan-node.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/parquet-column-readers.cc
M be/src/exprs/CMakeLists.txt
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/cast-functions-ir.cc
M be/src/exprs/decimal-operators-ir.cc
M be/src/exprs/decimal-operators.h
M be/src/exprs/expr-test.cc
M be/src/exprs/literal.cc
M be/src/exprs/timestamp-functions-ir.cc
M be/src/exprs/timestamp-functions.cc
A be/src/exprs/timezone_db-test.cc
M be/src/exprs/timezone_db.cc
M be/src/exprs/timezone_db.h
M be/src/runtime/raw-value-test.cc
M be/src/runtime/runtime-state.cc
M be/src/runtime/runtime-state.h
M be/src/runtime/timestamp-test.cc
M be/src/runtime/timestamp-value.cc
M be/src/runtime/timestamp-value.h
M be/src/runtime/timestamp-value.inline.h
M be/src/service/frontend.cc
M be/src/service/impala-server.cc
M be/src/service/impalad-main.cc
M be/src/util/CMakeLists.txt
M be/src/util/filesystem-util-test.cc
M be/src/util/filesystem-util.cc
M be/src/util/filesystem-util.h
M be/src/util/hdfs-util-test.cc
M be/src/util/hdfs-util.cc
M be/src/util/hdfs-util.h
M be/src/util/time-test.cc
M be/src/util/time.cc
M be/src/util/time.h
A be/src/util/zip-util-test.cc
A be/src/util/zip-util.cc
A be/src/util/zip-util.h
M bin/bootstrap_toolchain.py
M bin/impala-config.sh
M bin/rat_exclude_files.txt
A cmake_modules/FindCctz.cmake
M common/thrift/CMakeLists.txt
M common/thrift/ImpalaInternalService.thrift
A common/thrift/Zip.thrift
M common/thrift/metrics.json
A fe/src/main/java/org/apache/impala/util/ZipUtil.java
M fe/src/test/java/org/apache/impala/testutil/TestUtils.java
M testdata/bin/create-load-data.sh
M testdata/data/timezoneverification.csv
A testdata/tzdb/2017c-corrupt.zip
A testdata/tzdb/2017c.zip
A testdata/tzdb/abbrev.conf
A testdata/tzdb_tiny/America/New_York
A testdata/tzdb_tiny/Etc/GMT+4
A testdata/tzdb_tiny/US/Eastern
A testdata/tzdb_tiny/UTC
A testdata/tzdb_tiny/Zulu
A testdata/tzdb_tiny/posix/UTC
A testdata/tzdb_tiny/posixrules
M testdata/workloads/functional-query/queries/QueryTest/exprs.test
M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py
A tests/custom_cluster/test_shared_tzdb.py
D tests/query_test/test_timezones.py
72 files changed, 3,119 insertions(+), 1,167 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/86/9986/15
--
To view, visit http://gerrit.cloudera.org:8080/9986
To uns

[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-06-11 Thread Attila Jeges (Code Review)
Attila Jeges has uploaded a new patch set (#16). ( 
http://gerrit.cloudera.org:8080/9986 )

Change subject: IMPALA-3307: Add support for IANA time-zone db
..

IMPALA-3307: Add support for IANA time-zone db

Impala currently uses two different libraries for timestamp
manipulations: boost and glibc.

Issues with boost:
- Time-zone database is currently hard coded in timezone_db.cc.
  Impala admins cannot update it without upgrading Impala.
- Time-zone database is flat, therefore can’t track year-to-year
  changes.
- Time-zone database is not updated on a regular basis.

Issues with glibc:
- Uses /usr/share/zoneinfo/ database which could be out of sync on
  some of the nodes in the Impala cluster.
- Uses the host system’s local time-zone. Different nodes in the
  Impala cluster might use a different local time-zone.
- Conversion functions take a global lock, which causes severe
  performance degradation.

In addition to the issues above, the fact that /usr/share/zoneinfo/
and the hard-coded boost time-zone database are both in use is a
source of inconsistency in itself.

This patch makes the following changes:
- Instead of boost and glibc, impalad uses Google's CCTZ to implement
  time-zone conversions.

- Introduces a new startup flag (--hdfs_zone_info_zip) to impalad to
  specify an HDFS/S3/ADLS path to a zip archive that contains the
  shared compiled IANA time-zone database. If the startup flag is set,
  impalad will use the specified time-zone database. Otherwise,
  impalad will use the default /usr/share/zoneinfo time-zone database.

- Introduces a new startup flag (--hdfs_zone_abbrev_conf) to impalad
  to specify an HDFS/S3/ADLS path to a shared config file that
  contains definitions for non-standard time-zone abbreviations.

- impalad reads the entire time-zone database into an in-memory
  map on startup for fast lookups.

- The name of the coordinator node’s local time-zone is saved to the
  query context when preparing query execution. This time-zone is used
  whenever the current time-zone is referred afterwards in an
  execution node.

- Adds a new ZipUtil class to extract files from a zip archive. The
  implementation is not vulnerable to Zip Slip.

Cherry-picks: not for 2.x.

Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
---
M CMakeLists.txt
M be/CMakeLists.txt
M be/generated-sources/gen-cpp/CMakeLists.txt
M be/src/benchmarks/CMakeLists.txt
A be/src/benchmarks/convert-timestamp-benchmark.cc
M be/src/common/global-types.h
M be/src/common/init.cc
M be/src/exec/data-source-scan-node.cc
M be/src/exec/data-source-scan-node.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/parquet-column-readers.cc
M be/src/exprs/CMakeLists.txt
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/cast-functions-ir.cc
M be/src/exprs/decimal-operators-ir.cc
M be/src/exprs/decimal-operators.h
M be/src/exprs/expr-test.cc
M be/src/exprs/literal.cc
M be/src/exprs/timestamp-functions-ir.cc
M be/src/exprs/timestamp-functions.cc
A be/src/exprs/timezone_db-test.cc
M be/src/exprs/timezone_db.cc
M be/src/exprs/timezone_db.h
M be/src/runtime/raw-value-test.cc
M be/src/runtime/runtime-state.cc
M be/src/runtime/runtime-state.h
M be/src/runtime/timestamp-test.cc
M be/src/runtime/timestamp-value.cc
M be/src/runtime/timestamp-value.h
M be/src/runtime/timestamp-value.inline.h
M be/src/service/frontend.cc
M be/src/service/impala-server.cc
M be/src/service/impalad-main.cc
M be/src/util/CMakeLists.txt
M be/src/util/filesystem-util-test.cc
M be/src/util/filesystem-util.cc
M be/src/util/filesystem-util.h
M be/src/util/hdfs-util-test.cc
M be/src/util/hdfs-util.cc
M be/src/util/hdfs-util.h
M be/src/util/time-test.cc
M be/src/util/time.cc
M be/src/util/time.h
A be/src/util/zip-util-test.cc
A be/src/util/zip-util.cc
A be/src/util/zip-util.h
M bin/bootstrap_toolchain.py
M bin/impala-config.sh
M bin/rat_exclude_files.txt
A cmake_modules/FindCctz.cmake
M common/thrift/CMakeLists.txt
M common/thrift/ImpalaInternalService.thrift
A common/thrift/Zip.thrift
M common/thrift/metrics.json
A fe/src/main/java/org/apache/impala/util/ZipUtil.java
M fe/src/test/java/org/apache/impala/testutil/TestUtils.java
M testdata/bin/create-load-data.sh
M testdata/data/timezoneverification.csv
A testdata/tzdb/2017c-corrupt.zip
A testdata/tzdb/2017c.zip
A testdata/tzdb/alias.conf
A testdata/tzdb_tiny/America/New_York
A testdata/tzdb_tiny/Etc/GMT+4
A testdata/tzdb_tiny/US/Eastern
A testdata/tzdb_tiny/UTC
A testdata/tzdb_tiny/Zulu
A testdata/tzdb_tiny/posix/UTC
A testdata/tzdb_tiny/posixrules
M testdata/workloads/functional-query/queries/QueryTest/exprs.test
M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py
A tests/custom_cluster/test_shared_tzdb.py
D tests/query_test/test_timezones.py
72 files changed, 3,089 insertions(+), 1,167 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/86/9986/16
--
To view, visit http://gerrit.cloudera.org:8080/9986
To unsu

[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-06-11 Thread Attila Jeges (Code Review)
Attila Jeges has uploaded a new patch set (#17). ( 
http://gerrit.cloudera.org:8080/9986 )

Change subject: IMPALA-3307: Add support for IANA time-zone db
..

IMPALA-3307: Add support for IANA time-zone db

Impala currently uses two different libraries for timestamp
manipulations: boost and glibc.

Issues with boost:
- Time-zone database is currently hard coded in timezone_db.cc.
  Impala admins cannot update it without upgrading Impala.
- Time-zone database is flat, therefore can’t track year-to-year
  changes.
- Time-zone database is not updated on a regular basis.

Issues with glibc:
- Uses /usr/share/zoneinfo/ database which could be out of sync on
  some of the nodes in the Impala cluster.
- Uses the host system’s local time-zone. Different nodes in the
  Impala cluster might use a different local time-zone.
- Conversion functions take a global lock, which causes severe
  performance degradation.

In addition to the issues above, the fact that /usr/share/zoneinfo/
and the hard-coded boost time-zone database are both in use is a
source of inconsistency in itself.

This patch makes the following changes:
- Instead of boost and glibc, impalad uses Google's CCTZ to implement
  time-zone conversions.

- Introduces a new startup flag (--hdfs_zone_info_zip) to impalad to
  specify an HDFS/S3/ADLS path to a zip archive that contains the
  shared compiled IANA time-zone database. If the startup flag is set,
  impalad will use the specified time-zone database. Otherwise,
  impalad will use the default /usr/share/zoneinfo time-zone database.

- Introduces a new startup flag (--hdfs_zone_alias_conf) to impalad to
  specify an HDFS/S3/ADLS path to a shared config file that contains
  definitions for non-standard time-zone aliases.

- impalad reads the entire time-zone database into an in-memory
  map on startup for fast lookups.

- The name of the coordinator node’s local time-zone is saved to the
  query context when preparing query execution. This time-zone is used
  whenever the current time-zone is referred afterwards in an
  execution node.

- Adds a new ZipUtil class to extract files from a zip archive. The
  implementation is not vulnerable to Zip Slip.

Cherry-picks: not for 2.x.

Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
---
M CMakeLists.txt
M be/CMakeLists.txt
M be/generated-sources/gen-cpp/CMakeLists.txt
M be/src/benchmarks/CMakeLists.txt
A be/src/benchmarks/convert-timestamp-benchmark.cc
M be/src/common/global-types.h
M be/src/common/init.cc
M be/src/exec/data-source-scan-node.cc
M be/src/exec/data-source-scan-node.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/parquet-column-readers.cc
M be/src/exprs/CMakeLists.txt
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/cast-functions-ir.cc
M be/src/exprs/decimal-operators-ir.cc
M be/src/exprs/decimal-operators.h
M be/src/exprs/expr-test.cc
M be/src/exprs/literal.cc
M be/src/exprs/timestamp-functions-ir.cc
M be/src/exprs/timestamp-functions.cc
A be/src/exprs/timezone_db-test.cc
M be/src/exprs/timezone_db.cc
M be/src/exprs/timezone_db.h
M be/src/runtime/raw-value-test.cc
M be/src/runtime/runtime-state.cc
M be/src/runtime/runtime-state.h
M be/src/runtime/timestamp-test.cc
M be/src/runtime/timestamp-value.cc
M be/src/runtime/timestamp-value.h
M be/src/runtime/timestamp-value.inline.h
M be/src/service/frontend.cc
M be/src/service/impala-server.cc
M be/src/service/impalad-main.cc
M be/src/util/CMakeLists.txt
M be/src/util/filesystem-util-test.cc
M be/src/util/filesystem-util.cc
M be/src/util/filesystem-util.h
M be/src/util/hdfs-util-test.cc
M be/src/util/hdfs-util.cc
M be/src/util/hdfs-util.h
M be/src/util/time-test.cc
M be/src/util/time.cc
M be/src/util/time.h
A be/src/util/zip-util-test.cc
A be/src/util/zip-util.cc
A be/src/util/zip-util.h
M bin/bootstrap_toolchain.py
M bin/impala-config.sh
M bin/rat_exclude_files.txt
A cmake_modules/FindCctz.cmake
M common/thrift/CMakeLists.txt
M common/thrift/ImpalaInternalService.thrift
A common/thrift/Zip.thrift
M common/thrift/metrics.json
A fe/src/main/java/org/apache/impala/util/ZipUtil.java
M fe/src/test/java/org/apache/impala/testutil/TestUtils.java
M testdata/bin/create-load-data.sh
M testdata/data/timezoneverification.csv
A testdata/tzdb/2017c-corrupt.zip
A testdata/tzdb/2017c.zip
A testdata/tzdb/alias.conf
A testdata/tzdb_tiny/America/New_York
A testdata/tzdb_tiny/Etc/GMT+4
A testdata/tzdb_tiny/US/Eastern
A testdata/tzdb_tiny/UTC
A testdata/tzdb_tiny/Zulu
A testdata/tzdb_tiny/posix/UTC
A testdata/tzdb_tiny/posixrules
M testdata/workloads/functional-query/queries/QueryTest/exprs.test
M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py
A tests/custom_cluster/test_shared_tzdb.py
D tests/query_test/test_timezones.py
72 files changed, 3,089 insertions(+), 1,167 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/86/9986/17
--
To view, visit http://gerrit.cloudera.org:8080/9986
To unsubscribe

[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-06-11 Thread Attila Jeges (Code Review)
Attila Jeges has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9986 )

Change subject: IMPALA-3307: Add support for IANA time-zone db
..


Patch Set 17:

> Uploaded patch set 17: Commit message was updated.

Added support for configurable timezone aliases (instead of just abbreviations).


--
To view, visit http://gerrit.cloudera.org:8080/9986
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
Gerrit-Change-Number: 9986
Gerrit-PatchSet: 17
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Philip Zeyliger 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Mon, 11 Jun 2018 17:37:46 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-06-11 Thread Attila Jeges (Code Review)
Attila Jeges has uploaded a new patch set (#18). ( 
http://gerrit.cloudera.org:8080/9986 )

Change subject: IMPALA-3307: Add support for IANA time-zone db
..

IMPALA-3307: Add support for IANA time-zone db

Impala currently uses two different libraries for timestamp
manipulations: boost and glibc.

Issues with boost:
- Time-zone database is currently hard coded in timezone_db.cc.
  Impala admins cannot update it without upgrading Impala.
- Time-zone database is flat, therefore can’t track year-to-year
  changes.
- Time-zone database is not updated on a regular basis.

Issues with glibc:
- Uses /usr/share/zoneinfo/ database which could be out of sync on
  some of the nodes in the Impala cluster.
- Uses the host system’s local time-zone. Different nodes in the
  Impala cluster might use a different local time-zone.
- Conversion functions take a global lock, which causes severe
  performance degradation.

In addition to the issues above, the fact that /usr/share/zoneinfo/
and the hard-coded boost time-zone database are both in use is a
source of inconsistency in itself.

This patch makes the following changes:
- Instead of boost and glibc, impalad uses Google's CCTZ to implement
  time-zone conversions.

- Introduces a new startup flag (--hdfs_zone_info_zip) to impalad to
  specify an HDFS/S3/ADLS path to a zip archive that contains the
  shared compiled IANA time-zone database. If the startup flag is set,
  impalad will use the specified time-zone database. Otherwise,
  impalad will use the default /usr/share/zoneinfo time-zone database.

- Introduces a new startup flag (--hdfs_zone_alias_conf) to impalad to
  specify an HDFS/S3/ADLS path to a shared config file that contains
  definitions for non-standard time-zone aliases.

- impalad reads the entire time-zone database into an in-memory
  map on startup for fast lookups.

- The name of the coordinator node’s local time-zone is saved to the
  query context when preparing query execution. This time-zone is used
  whenever the current time-zone is referred afterwards in an
  execution node.

- Adds a new ZipUtil class to extract files from a zip archive. The
  implementation is not vulnerable to Zip Slip.

Cherry-picks: not for 2.x.

Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
---
M CMakeLists.txt
M be/CMakeLists.txt
M be/generated-sources/gen-cpp/CMakeLists.txt
M be/src/benchmarks/CMakeLists.txt
A be/src/benchmarks/convert-timestamp-benchmark.cc
M be/src/common/global-types.h
M be/src/common/init.cc
M be/src/exec/data-source-scan-node.cc
M be/src/exec/data-source-scan-node.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/parquet-column-readers.cc
M be/src/exprs/CMakeLists.txt
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/cast-functions-ir.cc
M be/src/exprs/decimal-operators-ir.cc
M be/src/exprs/decimal-operators.h
M be/src/exprs/expr-test.cc
M be/src/exprs/literal.cc
M be/src/exprs/timestamp-functions-ir.cc
M be/src/exprs/timestamp-functions.cc
A be/src/exprs/timezone_db-test.cc
M be/src/exprs/timezone_db.cc
M be/src/exprs/timezone_db.h
M be/src/runtime/raw-value-test.cc
M be/src/runtime/runtime-state.cc
M be/src/runtime/runtime-state.h
M be/src/runtime/timestamp-test.cc
M be/src/runtime/timestamp-value.cc
M be/src/runtime/timestamp-value.h
M be/src/runtime/timestamp-value.inline.h
M be/src/service/frontend.cc
M be/src/service/impala-server.cc
M be/src/service/impalad-main.cc
M be/src/util/CMakeLists.txt
M be/src/util/filesystem-util-test.cc
M be/src/util/filesystem-util.cc
M be/src/util/filesystem-util.h
M be/src/util/hdfs-util-test.cc
M be/src/util/hdfs-util.cc
M be/src/util/hdfs-util.h
M be/src/util/time-test.cc
M be/src/util/time.cc
M be/src/util/time.h
A be/src/util/zip-util-test.cc
A be/src/util/zip-util.cc
A be/src/util/zip-util.h
M bin/bootstrap_toolchain.py
M bin/impala-config.sh
M bin/rat_exclude_files.txt
A cmake_modules/FindCctz.cmake
M common/thrift/CMakeLists.txt
M common/thrift/ImpalaInternalService.thrift
A common/thrift/Zip.thrift
M common/thrift/metrics.json
A fe/src/main/java/org/apache/impala/util/ZipUtil.java
M fe/src/test/java/org/apache/impala/testutil/TestUtils.java
M testdata/bin/create-load-data.sh
M testdata/data/timezoneverification.csv
A testdata/tzdb/2017c-corrupt.zip
A testdata/tzdb/2017c.zip
A testdata/tzdb/alias.conf
A testdata/tzdb_tiny/America/New_York
A testdata/tzdb_tiny/Etc/GMT+4
A testdata/tzdb_tiny/US/Eastern
A testdata/tzdb_tiny/UTC
A testdata/tzdb_tiny/Zulu
A testdata/tzdb_tiny/posix/UTC
A testdata/tzdb_tiny/posixrules
M testdata/workloads/functional-query/queries/QueryTest/exprs.test
M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py
A tests/custom_cluster/test_shared_tzdb.py
D tests/query_test/test_timezones.py
72 files changed, 3,089 insertions(+), 1,167 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/86/9986/18
--
To view, visit http://gerrit.cloudera.org:8080/9986
To unsubscribe

[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-06-12 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9986 )

Change subject: IMPALA-3307: Add support for IANA time-zone db
..


Patch Set 18: Code-Review+2

(1 comment)

http://gerrit.cloudera.org:8080/#/c/9986/17/be/src/exprs/timezone_db.cc
File be/src/exprs/timezone_db.cc:

http://gerrit.cloudera.org:8080/#/c/9986/17/be/src/exprs/timezone_db.cc@367
PS17, Line 367: Status TimezoneDatabase::LoadZoneAliasesFromHdfs(
  : const string& hdfs_zone_alias_conf) {
nit: this could fit in one line



--
To view, visit http://gerrit.cloudera.org:8080/9986
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
Gerrit-Change-Number: 9986
Gerrit-PatchSet: 18
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Philip Zeyliger 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 12 Jun 2018 13:24:38 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-06-13 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9986 )

Change subject: IMPALA-3307: Add support for IANA time-zone db
..


Patch Set 18: Code-Review+1

(2 comments)

http://gerrit.cloudera.org:8080/#/c/9986/18/be/src/util/zip-util.h
File be/src/util/zip-util.h:

http://gerrit.cloudera.org:8080/#/c/9986/18/be/src/util/zip-util.h@33
PS18, Line 33:   //Extract files from a zip archive to a destination directory 
in local filesystem.
nit: comment formatting - should start with ///


http://gerrit.cloudera.org:8080/#/c/9986/10/testdata/tzdb/2017c/Africa/Abidjan
File testdata/tzdb/2017c/Africa/Abidjan:

http://gerrit.cloudera.org:8080/#/c/9986/10/testdata/tzdb/2017c/Africa/Abidjan@1
PS10, Line 1: 
> These are binary data files created from text data files that are in the pu
How does they pass the rat checks? We run those precommit. I expected to see 
something in bin/rat_exclude_files.txt.



--
To view, visit http://gerrit.cloudera.org:8080/9986
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
Gerrit-Change-Number: 9986
Gerrit-PatchSet: 18
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Philip Zeyliger 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 14 Jun 2018 01:44:49 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-06-14 Thread Attila Jeges (Code Review)
Attila Jeges has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9986 )

Change subject: IMPALA-3307: Add support for IANA time-zone db
..


Patch Set 10:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/9986/17/be/src/exprs/timezone_db.cc
File be/src/exprs/timezone_db.cc:

http://gerrit.cloudera.org:8080/#/c/9986/17/be/src/exprs/timezone_db.cc@367
PS17, Line 367:   while (true) {
  : current_bytes_read = hdfsRead(hdfs_co
> nit: this could fit in one line
Done


http://gerrit.cloudera.org:8080/#/c/9986/18/be/src/util/zip-util.h
File be/src/util/zip-util.h:

http://gerrit.cloudera.org:8080/#/c/9986/18/be/src/util/zip-util.h@33
PS18, Line 33:
> nit: comment formatting - should start with ///
Done


http://gerrit.cloudera.org:8080/#/c/9986/10/testdata/tzdb/2017c/Africa/Abidjan
File testdata/tzdb/2017c/Africa/Abidjan:

http://gerrit.cloudera.org:8080/#/c/9986/10/testdata/tzdb/2017c/Africa/Abidjan@1
PS10, Line 1: ../Atlantic/St_Helena
> How does they pass the rat checks? We run those precommit. I expected to se
I've updated rat_exclude_files.txt in patch-set #11 and #13. I ran the rat 
checks with that and they passed.



--
To view, visit http://gerrit.cloudera.org:8080/9986
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
Gerrit-Change-Number: 9986
Gerrit-PatchSet: 10
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Philip Zeyliger 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 14 Jun 2018 15:27:02 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-06-14 Thread Attila Jeges (Code Review)
Attila Jeges has uploaded a new patch set (#19). ( 
http://gerrit.cloudera.org:8080/9986 )

Change subject: IMPALA-3307: Add support for IANA time-zone db
..

IMPALA-3307: Add support for IANA time-zone db

Impala currently uses two different libraries for timestamp
manipulations: boost and glibc.

Issues with boost:
- Time-zone database is currently hard coded in timezone_db.cc.
  Impala admins cannot update it without upgrading Impala.
- Time-zone database is flat, therefore can’t track year-to-year
  changes.
- Time-zone database is not updated on a regular basis.

Issues with glibc:
- Uses /usr/share/zoneinfo/ database which could be out of sync on
  some of the nodes in the Impala cluster.
- Uses the host system’s local time-zone. Different nodes in the
  Impala cluster might use a different local time-zone.
- Conversion functions take a global lock, which causes severe
  performance degradation.

In addition to the issues above, the fact that /usr/share/zoneinfo/
and the hard-coded boost time-zone database are both in use is a
source of inconsistency in itself.

This patch makes the following changes:
- Instead of boost and glibc, impalad uses Google's CCTZ to implement
  time-zone conversions.

- Introduces a new startup flag (--hdfs_zone_info_zip) to impalad to
  specify an HDFS/S3/ADLS path to a zip archive that contains the
  shared compiled IANA time-zone database. If the startup flag is set,
  impalad will use the specified time-zone database. Otherwise,
  impalad will use the default /usr/share/zoneinfo time-zone database.

- Introduces a new startup flag (--hdfs_zone_alias_conf) to impalad to
  specify an HDFS/S3/ADLS path to a shared config file that contains
  definitions for non-standard time-zone aliases.

- impalad reads the entire time-zone database into an in-memory
  map on startup for fast lookups.

- The name of the coordinator node’s local time-zone is saved to the
  query context when preparing query execution. This time-zone is used
  whenever the current time-zone is referred afterwards in an
  execution node.

- Adds a new ZipUtil class to extract files from a zip archive. The
  implementation is not vulnerable to Zip Slip.

Cherry-picks: not for 2.x.

Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
---
M CMakeLists.txt
M be/CMakeLists.txt
M be/generated-sources/gen-cpp/CMakeLists.txt
M be/src/benchmarks/CMakeLists.txt
A be/src/benchmarks/convert-timestamp-benchmark.cc
M be/src/common/global-types.h
M be/src/common/init.cc
M be/src/exec/data-source-scan-node.cc
M be/src/exec/data-source-scan-node.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/parquet-column-readers.cc
M be/src/exprs/CMakeLists.txt
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/cast-functions-ir.cc
M be/src/exprs/decimal-operators-ir.cc
M be/src/exprs/decimal-operators.h
M be/src/exprs/expr-test.cc
M be/src/exprs/literal.cc
M be/src/exprs/timestamp-functions-ir.cc
M be/src/exprs/timestamp-functions.cc
A be/src/exprs/timezone_db-test.cc
M be/src/exprs/timezone_db.cc
M be/src/exprs/timezone_db.h
M be/src/runtime/raw-value-test.cc
M be/src/runtime/runtime-state.cc
M be/src/runtime/runtime-state.h
M be/src/runtime/timestamp-test.cc
M be/src/runtime/timestamp-value.cc
M be/src/runtime/timestamp-value.h
M be/src/runtime/timestamp-value.inline.h
M be/src/service/frontend.cc
M be/src/service/impala-server.cc
M be/src/service/impalad-main.cc
M be/src/util/CMakeLists.txt
M be/src/util/filesystem-util-test.cc
M be/src/util/filesystem-util.cc
M be/src/util/filesystem-util.h
M be/src/util/hdfs-util-test.cc
M be/src/util/hdfs-util.cc
M be/src/util/hdfs-util.h
M be/src/util/time-test.cc
M be/src/util/time.cc
M be/src/util/time.h
A be/src/util/zip-util-test.cc
A be/src/util/zip-util.cc
A be/src/util/zip-util.h
M bin/bootstrap_toolchain.py
M bin/impala-config.sh
M bin/rat_exclude_files.txt
A cmake_modules/FindCctz.cmake
M common/thrift/CMakeLists.txt
M common/thrift/ImpalaInternalService.thrift
A common/thrift/Zip.thrift
M common/thrift/metrics.json
A fe/src/main/java/org/apache/impala/util/ZipUtil.java
M fe/src/test/java/org/apache/impala/testutil/TestUtils.java
M testdata/bin/create-load-data.sh
M testdata/data/timezoneverification.csv
A testdata/tzdb/2017c-corrupt.zip
A testdata/tzdb/2017c.zip
A testdata/tzdb/alias.conf
A testdata/tzdb_tiny/America/New_York
A testdata/tzdb_tiny/Etc/GMT+4
A testdata/tzdb_tiny/US/Eastern
A testdata/tzdb_tiny/UTC
A testdata/tzdb_tiny/Zulu
A testdata/tzdb_tiny/posix/UTC
A testdata/tzdb_tiny/posixrules
M testdata/workloads/functional-query/queries/QueryTest/exprs.test
M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py
A tests/custom_cluster/test_shared_tzdb.py
D tests/query_test/test_timezones.py
72 files changed, 3,088 insertions(+), 1,167 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/86/9986/19
--
To view, visit http://gerrit.cloudera.org:8080/9986
To unsubscribe

[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-06-14 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9986 )

Change subject: IMPALA-3307: Add support for IANA time-zone db
..


Patch Set 10:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/9986/10/testdata/tzdb/2017c/Africa/Abidjan
File testdata/tzdb/2017c/Africa/Abidjan:

http://gerrit.cloudera.org:8080/#/c/9986/10/testdata/tzdb/2017c/Africa/Abidjan@1
PS10, Line 1: ../Atlantic/St_Helena
> I've updated rat_exclude_files.txt in patch-set #11 and #13. I ran the rat
Hmm, not sure how I missed that - sorry!



--
To view, visit http://gerrit.cloudera.org:8080/9986
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
Gerrit-Change-Number: 9986
Gerrit-PatchSet: 10
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Philip Zeyliger 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 14 Jun 2018 16:43:03 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-06-18 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9986 )

Change subject: IMPALA-3307: Add support for IANA time-zone db
..


Patch Set 19:

Are we ready to go ahead and merge? Would be good to run exhaustive tests + 
ASAN before merging just to be sure we aren't going to break anything.


--
To view, visit http://gerrit.cloudera.org:8080/9986
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
Gerrit-Change-Number: 9986
Gerrit-PatchSet: 19
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Philip Zeyliger 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Mon, 18 Jun 2018 22:35:55 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-06-20 Thread Attila Jeges (Code Review)
Attila Jeges has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9986 )

Change subject: IMPALA-3307: Add support for IANA time-zone db
..


Patch Set 19:

> Are we ready to go ahead and merge? Would be good to run exhaustive
 > tests + ASAN before merging just to be sure we aren't going to
 > break anything.

Exhaustive and ASAN failed because of a flaky test (IMPALA-7181).
Exhaustive: 
https://master-02.jenkins.cloudera.com/job/impala-private-parameterized/2336/
ASAN: 
https://master-02.jenkins.cloudera.com/job/impala-private-parameterized/2337/

Since IMPALA-7181 was resolved yesterday, I've rebased the patch-set and 
restarted the tests this morning.
Exhaustive: 
https://master-02.jenkins.cloudera.com/job/impala-private-parameterized/2345/
ASAN: 
https://master-02.jenkins.cloudera.com/job/impala-private-parameterized/2346/

Hopefully they will pass this time.


--
To view, visit http://gerrit.cloudera.org:8080/9986
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
Gerrit-Change-Number: 9986
Gerrit-PatchSet: 19
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Philip Zeyliger 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 20 Jun 2018 14:00:06 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-06-20 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9986 )

Change subject: IMPALA-3307: Add support for IANA time-zone db
..


Patch Set 19: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/9986
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
Gerrit-Change-Number: 9986
Gerrit-PatchSet: 19
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Philip Zeyliger 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 20 Jun 2018 17:46:54 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-06-20 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9986 )

Change subject: IMPALA-3307: Add support for IANA time-zone db
..


Patch Set 19:

Thanks for running those tests!


--
To view, visit http://gerrit.cloudera.org:8080/9986
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
Gerrit-Change-Number: 9986
Gerrit-PatchSet: 19
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Philip Zeyliger 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 20 Jun 2018 17:47:42 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-06-21 Thread Attila Jeges (Code Review)
Attila Jeges has uploaded a new patch set (#20). ( 
http://gerrit.cloudera.org:8080/9986 )

Change subject: IMPALA-3307: Add support for IANA time-zone db
..

IMPALA-3307: Add support for IANA time-zone db

Impala currently uses two different libraries for timestamp
manipulations: boost and glibc.

Issues with boost:
- Time-zone database is currently hard coded in timezone_db.cc.
  Impala admins cannot update it without upgrading Impala.
- Time-zone database is flat, therefore can’t track year-to-year
  changes.
- Time-zone database is not updated on a regular basis.

Issues with glibc:
- Uses /usr/share/zoneinfo/ database which could be out of sync on
  some of the nodes in the Impala cluster.
- Uses the host system’s local time-zone. Different nodes in the
  Impala cluster might use a different local time-zone.
- Conversion functions take a global lock, which causes severe
  performance degradation.

In addition to the issues above, the fact that /usr/share/zoneinfo/
and the hard-coded boost time-zone database are both in use is a
source of inconsistency in itself.

This patch makes the following changes:
- Instead of boost and glibc, impalad uses Google's CCTZ to implement
  time-zone conversions.

- Introduces a new startup flag (--hdfs_zone_info_zip) to impalad to
  specify an HDFS/S3/ADLS path to a zip archive that contains the
  shared compiled IANA time-zone database. If the startup flag is set,
  impalad will use the specified time-zone database. Otherwise,
  impalad will use the default /usr/share/zoneinfo time-zone database.

- Introduces a new startup flag (--hdfs_zone_alias_conf) to impalad to
  specify an HDFS/S3/ADLS path to a shared config file that contains
  definitions for non-standard time-zone aliases.

- impalad reads the entire time-zone database into an in-memory
  map on startup for fast lookups.

- The name of the coordinator node’s local time-zone is saved to the
  query context when preparing query execution. This time-zone is used
  whenever the current time-zone is referred afterwards in an
  execution node.

- Adds a new ZipUtil class to extract files from a zip archive. The
  implementation is not vulnerable to Zip Slip.

Cherry-picks: not for 2.x.

Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
---
M CMakeLists.txt
M be/CMakeLists.txt
M be/generated-sources/gen-cpp/CMakeLists.txt
M be/src/benchmarks/CMakeLists.txt
A be/src/benchmarks/convert-timestamp-benchmark.cc
M be/src/common/global-types.h
M be/src/common/init.cc
M be/src/exec/data-source-scan-node.cc
M be/src/exec/data-source-scan-node.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/parquet-column-readers.cc
M be/src/exprs/CMakeLists.txt
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/cast-functions-ir.cc
M be/src/exprs/decimal-operators-ir.cc
M be/src/exprs/decimal-operators.h
M be/src/exprs/expr-test.cc
M be/src/exprs/literal.cc
M be/src/exprs/timestamp-functions-ir.cc
M be/src/exprs/timestamp-functions.cc
A be/src/exprs/timezone_db-test.cc
M be/src/exprs/timezone_db.cc
M be/src/exprs/timezone_db.h
M be/src/runtime/raw-value-test.cc
M be/src/runtime/runtime-state.cc
M be/src/runtime/runtime-state.h
M be/src/runtime/timestamp-test.cc
M be/src/runtime/timestamp-value.cc
M be/src/runtime/timestamp-value.h
M be/src/runtime/timestamp-value.inline.h
M be/src/service/frontend.cc
M be/src/service/impala-server.cc
M be/src/service/impalad-main.cc
M be/src/util/CMakeLists.txt
M be/src/util/filesystem-util-test.cc
M be/src/util/filesystem-util.cc
M be/src/util/filesystem-util.h
M be/src/util/hdfs-util-test.cc
M be/src/util/hdfs-util.cc
M be/src/util/hdfs-util.h
M be/src/util/time-test.cc
M be/src/util/time.cc
M be/src/util/time.h
A be/src/util/zip-util-test.cc
A be/src/util/zip-util.cc
A be/src/util/zip-util.h
M bin/bootstrap_toolchain.py
M bin/impala-config.sh
M bin/rat_exclude_files.txt
A cmake_modules/FindCctz.cmake
M common/thrift/CMakeLists.txt
M common/thrift/ImpalaInternalService.thrift
A common/thrift/Zip.thrift
M common/thrift/metrics.json
A fe/src/main/java/org/apache/impala/util/ZipUtil.java
M fe/src/test/java/org/apache/impala/testutil/TestUtils.java
M testdata/bin/create-load-data.sh
M testdata/data/timezoneverification.csv
A testdata/tzdb/2017c-corrupt.zip
A testdata/tzdb/2017c.zip
A testdata/tzdb/alias.conf
A testdata/tzdb_tiny/America/New_York
A testdata/tzdb_tiny/Etc/GMT+4
A testdata/tzdb_tiny/US/Eastern
A testdata/tzdb_tiny/UTC
A testdata/tzdb_tiny/Zulu
A testdata/tzdb_tiny/posix/UTC
A testdata/tzdb_tiny/posixrules
M testdata/workloads/functional-query/queries/QueryTest/exprs.test
M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py
A tests/custom_cluster/test_shared_tzdb.py
D tests/query_test/test_timezones.py
72 files changed, 3,087 insertions(+), 1,167 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/86/9986/20
--
To view, visit http://gerrit.cloudera.org:8080/9986
To unsubscribe

[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-06-21 Thread Attila Jeges (Code Review)
Attila Jeges has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9986 )

Change subject: IMPALA-3307: Add support for IANA time-zone db
..


Patch Set 20: Code-Review+2

Exhaustive, ASAN tests passed. Rebased the patch-set. Carry +2


--
To view, visit http://gerrit.cloudera.org:8080/9986
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
Gerrit-Change-Number: 9986
Gerrit-PatchSet: 20
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Philip Zeyliger 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 21 Jun 2018 12:31:06 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-06-21 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9986 )

Change subject: IMPALA-3307: Add support for IANA time-zone db
..


Patch Set 20:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/2721/ 
DRY_RUN=false


--
To view, visit http://gerrit.cloudera.org:8080/9986
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
Gerrit-Change-Number: 9986
Gerrit-PatchSet: 20
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Philip Zeyliger 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 21 Jun 2018 12:35:17 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-06-21 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9986 )

Change subject: IMPALA-3307: Add support for IANA time-zone db
..


Patch Set 20: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/2721/


--
To view, visit http://gerrit.cloudera.org:8080/9986
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
Gerrit-Change-Number: 9986
Gerrit-PatchSet: 20
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Philip Zeyliger 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 21 Jun 2018 16:05:46 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-06-21 Thread Attila Jeges (Code Review)
Attila Jeges has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9986 )

Change subject: IMPALA-3307: Add support for IANA time-zone db
..


Patch Set 20:

clang-tidy job failed. Investigating.


--
To view, visit http://gerrit.cloudera.org:8080/9986
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
Gerrit-Change-Number: 9986
Gerrit-PatchSet: 20
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Philip Zeyliger 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 21 Jun 2018 16:35:15 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-06-21 Thread Attila Jeges (Code Review)
Attila Jeges has uploaded a new patch set (#21). ( 
http://gerrit.cloudera.org:8080/9986 )

Change subject: IMPALA-3307: Add support for IANA time-zone db
..

IMPALA-3307: Add support for IANA time-zone db

Impala currently uses two different libraries for timestamp
manipulations: boost and glibc.

Issues with boost:
- Time-zone database is currently hard coded in timezone_db.cc.
  Impala admins cannot update it without upgrading Impala.
- Time-zone database is flat, therefore can’t track year-to-year
  changes.
- Time-zone database is not updated on a regular basis.

Issues with glibc:
- Uses /usr/share/zoneinfo/ database which could be out of sync on
  some of the nodes in the Impala cluster.
- Uses the host system’s local time-zone. Different nodes in the
  Impala cluster might use a different local time-zone.
- Conversion functions take a global lock, which causes severe
  performance degradation.

In addition to the issues above, the fact that /usr/share/zoneinfo/
and the hard-coded boost time-zone database are both in use is a
source of inconsistency in itself.

This patch makes the following changes:
- Instead of boost and glibc, impalad uses Google's CCTZ to implement
  time-zone conversions.

- Introduces a new startup flag (--hdfs_zone_info_zip) to impalad to
  specify an HDFS/S3/ADLS path to a zip archive that contains the
  shared compiled IANA time-zone database. If the startup flag is set,
  impalad will use the specified time-zone database. Otherwise,
  impalad will use the default /usr/share/zoneinfo time-zone database.

- Introduces a new startup flag (--hdfs_zone_alias_conf) to impalad to
  specify an HDFS/S3/ADLS path to a shared config file that contains
  definitions for non-standard time-zone aliases.

- impalad reads the entire time-zone database into an in-memory
  map on startup for fast lookups.

- The name of the coordinator node’s local time-zone is saved to the
  query context when preparing query execution. This time-zone is used
  whenever the current time-zone is referred afterwards in an
  execution node.

- Adds a new ZipUtil class to extract files from a zip archive. The
  implementation is not vulnerable to Zip Slip.

Cherry-picks: not for 2.x.

Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
---
M CMakeLists.txt
M be/CMakeLists.txt
M be/generated-sources/gen-cpp/CMakeLists.txt
M be/src/benchmarks/CMakeLists.txt
A be/src/benchmarks/convert-timestamp-benchmark.cc
M be/src/common/global-types.h
M be/src/common/init.cc
M be/src/exec/data-source-scan-node.cc
M be/src/exec/data-source-scan-node.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/parquet-column-readers.cc
M be/src/exprs/CMakeLists.txt
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/cast-functions-ir.cc
M be/src/exprs/decimal-operators-ir.cc
M be/src/exprs/decimal-operators.h
M be/src/exprs/expr-test.cc
M be/src/exprs/literal.cc
M be/src/exprs/timestamp-functions-ir.cc
M be/src/exprs/timestamp-functions.cc
A be/src/exprs/timezone_db-test.cc
M be/src/exprs/timezone_db.cc
M be/src/exprs/timezone_db.h
M be/src/runtime/raw-value-test.cc
M be/src/runtime/runtime-state.cc
M be/src/runtime/runtime-state.h
M be/src/runtime/timestamp-test.cc
M be/src/runtime/timestamp-value.cc
M be/src/runtime/timestamp-value.h
M be/src/runtime/timestamp-value.inline.h
M be/src/service/frontend.cc
M be/src/service/impala-server.cc
M be/src/service/impalad-main.cc
M be/src/util/CMakeLists.txt
M be/src/util/filesystem-util-test.cc
M be/src/util/filesystem-util.cc
M be/src/util/filesystem-util.h
M be/src/util/hdfs-util-test.cc
M be/src/util/hdfs-util.cc
M be/src/util/hdfs-util.h
M be/src/util/time-test.cc
M be/src/util/time.cc
M be/src/util/time.h
A be/src/util/zip-util-test.cc
A be/src/util/zip-util.cc
A be/src/util/zip-util.h
M bin/bootstrap_toolchain.py
M bin/impala-config.sh
M bin/rat_exclude_files.txt
A cmake_modules/FindCctz.cmake
M common/thrift/CMakeLists.txt
M common/thrift/ImpalaInternalService.thrift
A common/thrift/Zip.thrift
M common/thrift/metrics.json
A fe/src/main/java/org/apache/impala/util/ZipUtil.java
M fe/src/test/java/org/apache/impala/testutil/TestUtils.java
M testdata/bin/create-load-data.sh
M testdata/data/timezoneverification.csv
A testdata/tzdb/2017c-corrupt.zip
A testdata/tzdb/2017c.zip
A testdata/tzdb/alias.conf
A testdata/tzdb_tiny/America/New_York
A testdata/tzdb_tiny/Etc/GMT+4
A testdata/tzdb_tiny/US/Eastern
A testdata/tzdb_tiny/UTC
A testdata/tzdb_tiny/Zulu
A testdata/tzdb_tiny/posix/UTC
A testdata/tzdb_tiny/posixrules
M testdata/workloads/functional-query/queries/QueryTest/exprs.test
M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py
A tests/custom_cluster/test_shared_tzdb.py
D tests/query_test/test_timezones.py
72 files changed, 3,086 insertions(+), 1,176 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/86/9986/21
--
To view, visit http://gerrit.cloudera.org:8080/9986
To unsubscribe

[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-06-21 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9986 )

Change subject: IMPALA-3307: Add support for IANA time-zone db
..


Patch Set 21:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/2723/ 
DRY_RUN=false


--
To view, visit http://gerrit.cloudera.org:8080/9986
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
Gerrit-Change-Number: 9986
Gerrit-PatchSet: 21
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Philip Zeyliger 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 21 Jun 2018 16:46:02 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-06-21 Thread Attila Jeges (Code Review)
Attila Jeges has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9986 )

Change subject: IMPALA-3307: Add support for IANA time-zone db
..


Patch Set 21: Code-Review+2

Carry +2


--
To view, visit http://gerrit.cloudera.org:8080/9986
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
Gerrit-Change-Number: 9986
Gerrit-PatchSet: 21
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Philip Zeyliger 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 21 Jun 2018 17:34:50 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-06-21 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9986 )

Change subject: IMPALA-3307: Add support for IANA time-zone db
..


Patch Set 21: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/2723/


--
To view, visit http://gerrit.cloudera.org:8080/9986
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
Gerrit-Change-Number: 9986
Gerrit-PatchSet: 21
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Philip Zeyliger 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 21 Jun 2018 20:14:06 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-06-22 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9986 )

Change subject: IMPALA-3307: Add support for IANA time-zone db
..


Patch Set 22: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/9986
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
Gerrit-Change-Number: 9986
Gerrit-PatchSet: 22
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Philip Zeyliger 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 22 Jun 2018 09:50:36 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-06-22 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9986 )

Change subject: IMPALA-3307: Add support for IANA time-zone db
..


Patch Set 22:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/2725/ 
DRY_RUN=false


--
To view, visit http://gerrit.cloudera.org:8080/9986
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
Gerrit-Change-Number: 9986
Gerrit-PatchSet: 22
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Philip Zeyliger 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 22 Jun 2018 09:50:37 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-06-22 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9986 )

Change subject: IMPALA-3307: Add support for IANA time-zone db
..


Patch Set 22: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/2725/


--
To view, visit http://gerrit.cloudera.org:8080/9986
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
Gerrit-Change-Number: 9986
Gerrit-PatchSet: 22
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Philip Zeyliger 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 22 Jun 2018 09:51:31 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-06-22 Thread Attila Jeges (Code Review)
Attila Jeges has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9986 )

Change subject: IMPALA-3307: Add support for IANA time-zone db
..


Patch Set 22: Code-Review+2

Carry +2


--
To view, visit http://gerrit.cloudera.org:8080/9986
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
Gerrit-Change-Number: 9986
Gerrit-PatchSet: 22
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Philip Zeyliger 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 22 Jun 2018 09:52:24 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-06-22 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9986 )

Change subject: IMPALA-3307: Add support for IANA time-zone db
..


Patch Set 22:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/2726/ 
DRY_RUN=false


--
To view, visit http://gerrit.cloudera.org:8080/9986
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
Gerrit-Change-Number: 9986
Gerrit-PatchSet: 22
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Philip Zeyliger 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 22 Jun 2018 09:53:33 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-06-22 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/9986 )

Change subject: IMPALA-3307: Add support for IANA time-zone db
..

IMPALA-3307: Add support for IANA time-zone db

Impala currently uses two different libraries for timestamp
manipulations: boost and glibc.

Issues with boost:
- Time-zone database is currently hard coded in timezone_db.cc.
  Impala admins cannot update it without upgrading Impala.
- Time-zone database is flat, therefore can’t track year-to-year
  changes.
- Time-zone database is not updated on a regular basis.

Issues with glibc:
- Uses /usr/share/zoneinfo/ database which could be out of sync on
  some of the nodes in the Impala cluster.
- Uses the host system’s local time-zone. Different nodes in the
  Impala cluster might use a different local time-zone.
- Conversion functions take a global lock, which causes severe
  performance degradation.

In addition to the issues above, the fact that /usr/share/zoneinfo/
and the hard-coded boost time-zone database are both in use is a
source of inconsistency in itself.

This patch makes the following changes:
- Instead of boost and glibc, impalad uses Google's CCTZ to implement
  time-zone conversions.

- Introduces a new startup flag (--hdfs_zone_info_zip) to impalad to
  specify an HDFS/S3/ADLS path to a zip archive that contains the
  shared compiled IANA time-zone database. If the startup flag is set,
  impalad will use the specified time-zone database. Otherwise,
  impalad will use the default /usr/share/zoneinfo time-zone database.

- Introduces a new startup flag (--hdfs_zone_alias_conf) to impalad to
  specify an HDFS/S3/ADLS path to a shared config file that contains
  definitions for non-standard time-zone aliases.

- impalad reads the entire time-zone database into an in-memory
  map on startup for fast lookups.

- The name of the coordinator node’s local time-zone is saved to the
  query context when preparing query execution. This time-zone is used
  whenever the current time-zone is referred afterwards in an
  execution node.

- Adds a new ZipUtil class to extract files from a zip archive. The
  implementation is not vulnerable to Zip Slip.

Cherry-picks: not for 2.x.

Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
Reviewed-on: http://gerrit.cloudera.org:8080/9986
Reviewed-by: Impala Public Jenkins 
Reviewed-by: Attila Jeges 
Tested-by: Impala Public Jenkins 
---
M CMakeLists.txt
M be/CMakeLists.txt
M be/generated-sources/gen-cpp/CMakeLists.txt
M be/src/benchmarks/CMakeLists.txt
A be/src/benchmarks/convert-timestamp-benchmark.cc
M be/src/common/global-types.h
M be/src/common/init.cc
M be/src/exec/data-source-scan-node.cc
M be/src/exec/data-source-scan-node.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/parquet-column-readers.cc
M be/src/exprs/CMakeLists.txt
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/cast-functions-ir.cc
M be/src/exprs/decimal-operators-ir.cc
M be/src/exprs/decimal-operators.h
M be/src/exprs/expr-test.cc
M be/src/exprs/literal.cc
M be/src/exprs/timestamp-functions-ir.cc
M be/src/exprs/timestamp-functions.cc
A be/src/exprs/timezone_db-test.cc
M be/src/exprs/timezone_db.cc
M be/src/exprs/timezone_db.h
M be/src/runtime/raw-value-test.cc
M be/src/runtime/runtime-state.cc
M be/src/runtime/runtime-state.h
M be/src/runtime/timestamp-test.cc
M be/src/runtime/timestamp-value.cc
M be/src/runtime/timestamp-value.h
M be/src/runtime/timestamp-value.inline.h
M be/src/service/frontend.cc
M be/src/service/impala-server.cc
M be/src/service/impalad-main.cc
M be/src/util/CMakeLists.txt
M be/src/util/filesystem-util-test.cc
M be/src/util/filesystem-util.cc
M be/src/util/filesystem-util.h
M be/src/util/hdfs-util-test.cc
M be/src/util/hdfs-util.cc
M be/src/util/hdfs-util.h
M be/src/util/time-test.cc
M be/src/util/time.cc
M be/src/util/time.h
A be/src/util/zip-util-test.cc
A be/src/util/zip-util.cc
A be/src/util/zip-util.h
M bin/bootstrap_toolchain.py
M bin/impala-config.sh
M bin/rat_exclude_files.txt
A cmake_modules/FindCctz.cmake
M common/thrift/CMakeLists.txt
M common/thrift/ImpalaInternalService.thrift
A common/thrift/Zip.thrift
M common/thrift/metrics.json
A fe/src/main/java/org/apache/impala/util/ZipUtil.java
M fe/src/test/java/org/apache/impala/testutil/TestUtils.java
M testdata/bin/create-load-data.sh
M testdata/data/timezoneverification.csv
A testdata/tzdb/2017c-corrupt.zip
A testdata/tzdb/2017c.zip
A testdata/tzdb/alias.conf
A testdata/tzdb_tiny/America/New_York
A testdata/tzdb_tiny/Etc/GMT+4
A testdata/tzdb_tiny/US/Eastern
A testdata/tzdb_tiny/UTC
A testdata/tzdb_tiny/Zulu
A testdata/tzdb_tiny/posix/UTC
A testdata/tzdb_tiny/posixrules
M testdata/workloads/functional-query/queries/QueryTest/exprs.test
M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py
A tests/custom_cluster/test_shared_tzdb.py
D tests/query_test/test_timezones.py
72 files changed, 3,086 insertions(+), 1,1

[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db

2018-06-22 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9986 )

Change subject: IMPALA-3307: Add support for IANA time-zone db
..


Patch Set 22: Verified+1


--
To view, visit http://gerrit.cloudera.org:8080/9986
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
Gerrit-Change-Number: 9986
Gerrit-PatchSet: 22
Gerrit-Owner: Attila Jeges 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Philip Zeyliger 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 22 Jun 2018 13:18:56 +
Gerrit-HasComments: No