[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Attila Jeges has uploaded this change for review. ( http://gerrit.cloudera.org:8080/9986 Change subject: IMPALA-3307: Add support for IANA time-zone db .. IMPALA-3307: Add support for IANA time-zone db Impala currently uses two different libraries for timestamp manipulations: boost and glibc. Issues with boost: - Time-zone database is currently hard coded in timezone_db.cc. Impala admins cannot update it without upgrading Impala. - Time-zone database is flat, therefore can’t track year-to-year changes. - Time-zone database is not updated on a regular basis. Issues with glibc: - Uses /usr/share/zoneinfo/ database which could be out of sync on some of the nodes in the Impala cluster. - Uses the host system’s local time-zone. Different nodes in the Impala cluster might use a different local time-zone. - Conversion functions take a global lock, which causes severe performance degradation. In addition to the issues above, the fact that /usr/share/zoneinfo/ and the hard-coded boost time-zone database are both in use is a source of inconsistency in itself. This patch makes the following changes: - Instead of boost and glibc, impalad uses Google's CCTZ to implement time-zone conversions. - Introduces a new statup flag (--hdfs_zoneinfo_dir) to impalad to specify an HDFS/S3/ADLS location that contains the shared compiled IANA time-zone database. If the startup flag is set, impalad will use the specified time-zone database. Otherwise, impalad will use the default /usr/share/zoneinfo time-zone database. - impalad reads the entire time-zone database into an in-memory map on startup for fast lookups. - The name of the coordinator node’s local time-zone is saved to the query context when preparing query execution. This time-zone is used whenever the current time-zone is referred afterwards in an execution node. - Introduces a new startup flag (--hdfs_zoneabbrev_config) to impalad to specify an HDFS/S3/ADLS path to a shared config file that contains definitions for non-standard time-zone abbreviations. Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 --- M CMakeLists.txt M be/CMakeLists.txt M be/src/benchmarks/CMakeLists.txt A be/src/benchmarks/convert-timestamp-benchmark.cc M be/src/exec/data-source-scan-node.cc M be/src/exec/data-source-scan-node.h M be/src/exec/parquet-column-readers.cc M be/src/exprs/aggregate-functions-ir.cc M be/src/exprs/cast-functions-ir.cc M be/src/exprs/decimal-operators-ir.cc M be/src/exprs/decimal-operators.h M be/src/exprs/expr-test.cc M be/src/exprs/literal.cc M be/src/exprs/timestamp-functions-ir.cc M be/src/exprs/timestamp-functions.cc M be/src/exprs/timezone_db.cc M be/src/exprs/timezone_db.h M be/src/runtime/raw-value-test.cc M be/src/runtime/runtime-state.cc M be/src/runtime/runtime-state.h M be/src/runtime/timestamp-test.cc M be/src/runtime/timestamp-value.cc M be/src/runtime/timestamp-value.h M be/src/runtime/timestamp-value.inline.h M be/src/service/impala-server.cc M be/src/service/impalad-main.cc M be/src/util/hdfs-util-test.cc M be/src/util/hdfs-util.cc M be/src/util/hdfs-util.h M be/src/util/time-test.cc M be/src/util/time.cc M be/src/util/time.h M bin/bootstrap_toolchain.py M bin/impala-config.sh A cmake_modules/FindCctz.cmake M common/thrift/ImpalaInternalService.thrift M common/thrift/metrics.json M fe/src/test/java/org/apache/impala/testutil/TestUtils.java M testdata/data/timezoneverification.csv M testdata/workloads/functional-query/queries/QueryTest/exprs.test 40 files changed, 1,919 insertions(+), 1,045 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/86/9986/1 -- To view, visit http://gerrit.cloudera.org:8080/9986 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 Gerrit-Change-Number: 9986 Gerrit-PatchSet: 1 Gerrit-Owner: Attila Jeges
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/9986 ) Change subject: IMPALA-3307: Add support for IANA time-zone db .. Patch Set 1: (2 comments) Just skimmed through. Will do several further passes http://gerrit.cloudera.org:8080/#/c/9986/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/9986/1//COMMIT_MSG@34 PS1, Line 34: --hdfs_zoneinfo_dir Why call it hdfs_zone_dir when it can also refer to S3 and ADLS? Maybe it could be just --zoneinfo_dir and the given URI would specify the storage. http://gerrit.cloudera.org:8080/#/c/9986/1//COMMIT_MSG@45 PS1, Line 45: --hdfs_zoneabbrev_config Same as above. -- To view, visit http://gerrit.cloudera.org:8080/9986 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 Gerrit-Change-Number: 9986 Gerrit-PatchSet: 1 Gerrit-Owner: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Wed, 11 Apr 2018 16:23:39 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/9986 ) Change subject: IMPALA-3307: Add support for IANA time-zone db .. Patch Set 1: (25 comments) http://gerrit.cloudera.org:8080/#/c/9986/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/9986/1//COMMIT_MSG@34 PS1, Line 34: statup typo: startup http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/benchmarks/convert-timestamp-benchmark.cc File be/src/benchmarks/convert-timestamp-benchmark.cc: http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/benchmarks/convert-timestamp-benchmark.cc@134 PS1, Line 134: val I don't know what RAND_MAX is here, but I think that it can be 32K, which would mean that the time part would move very slowly from the starting time. Is this by design? In order to have "really random" time, I would use cpp11 random classes, or put together the time from several rand() calls, each with a modulo smaller than 32k. http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/benchmarks/convert-timestamp-benchmark.cc@166 PS1, Line 166: d I am a bit concerned about writing to the same buffer from every thread - maybe it does not hurt performance, but it is not how these functions are "normally" used. http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/benchmarks/convert-timestamp-benchmark.cc@709 PS1, Line 709: m1 = measure_multithreaded_elapsed_time(glibc_test_utc_to_unix, num_of_threads,BATCH_SIZE, nit: long line http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/benchmarks/convert-timestamp-benchmark.cc@711 PS1, Line 711: m2 = measure_multithreaded_elapsed_time(cctz_test_utc_to_unix, num_of_threads, BATCH_SIZE, nit: long line http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/benchmarks/convert-timestamp-benchmark.cc@725 PS1, Line 725: m1 = measure_multithreaded_elapsed_time(boost_test_from_utc, num_of_threads, BATCH_SIZE, nit: long line http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/benchmarks/convert-timestamp-benchmark.cc@727 PS1, Line 727: m2 = measure_multithreaded_elapsed_time(cctz_test_from_utc, num_of_threads, BATCH_SIZE, nit: long line http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/benchmarks/convert-timestamp-benchmark.cc@743 PS1, Line 743: m2 = measure_multithreaded_elapsed_time(cctz_test_utc_to_local, num_of_threads, BATCH_SIZE, nit: long line http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timestamp-functions-ir.cc File be/src/exprs/timestamp-functions-ir.cc: http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timestamp-functions-ir.cc@523 PS1, Line 523: / nit: missing spaces http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timestamp-functions.cc File be/src/exprs/timestamp-functions.cc: http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timestamp-functions.cc@63 PS1, Line 63: if (timezone == nullptr) { This could be UNLIKELY. http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timestamp-functions.cc@90 PS1, Line 90: t.fractional_seconds() You have explained in person that 't' is used instead of 'to_cs'for sub-seconds, because cctz't time type does not support nano seconds, and no timezone rule affects sub-seconds. Could you add a comment about this? http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timestamp-functions.cc@105 PS1, Line 105: if (timezone == nullptr) { This could be UNLIKELY. http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timestamp-functions.cc@142 PS1, Line 142: t.fractional_seconds() Same as in line 90. http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.h File be/src/exprs/timezone_db.h: http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.h@28 PS1, Line 28: /// Functions to load and access the time-zone database. Please add some comments about thread-safety (e.g. "Initialize() should be called at startup to load every timezone rule to memory. After it returned without error, other functions can be safely called from multiple threads."). http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.cc File be/src/exprs/timezone_db.cc: http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.cc@160 PS1, Line 160: bool IsSymbolicLink(const string& path, string* real_path) { Maybe this could be moved to class FileSystemUtil. http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.cc@412 PS1, Line 412: char buffer[64*1024]; I am a bit concerned about this - is it ok to keep buffers of this size on stack in Impala? http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.cc@438 PS1, Line 438: // Load custom time-zone abbreviations from 'is' and add them to 'tz_name_map_'. In most of Impala, the comments for private functions are in the .h file and start with "///". Can you move them to the header for the sake of consistency? http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.cc@439 PS1
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/9986 ) Change subject: IMPALA-3307: Add support for IANA time-zone db .. Patch Set 1: (7 comments) Spent only a limited amount of time on this review. Will continue next week. http://gerrit.cloudera.org:8080/#/c/9986/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/9986/1//COMMIT_MSG@7 PS1, Line 7: time-zone Fun fact: I was curious whether "timezone", "time zone" or "time-zone" is correct. Apparently all of them are, however I read that "time-zone" is a bit outdated. "time zone" is widely used and "timezone" is only in the US. Can someone native confirm? :) http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/benchmarks/convert-timestamp-benchmark.cc File be/src/benchmarks/convert-timestamp-benchmark.cc: http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/benchmarks/convert-timestamp-benchmark.cc@1 PS1, Line 1: #include Missing Apache header http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/benchmarks/convert-timestamp-benchmark.cc@38 PS1, Line 38: UtcToUnixTime: Function iters/ms 10%ile 50%ile 90%ile 10%ile 50%ile 90%ile Do you think we should force the 90col limit on the following comment as well? http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/benchmarks/convert-timestamp-benchmark.cc@136 PS1, Line 136: ss << to_simple_string(start); to_simple_string() return a string already, no need to use a stringstream for this purpose. http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/decimal-operators.h File be/src/exprs/decimal-operators.h: http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/decimal-operators.h@168 PS1, Line 168: /// instead of truncating if 'round' is true. Should we mention the new parameter in the comment? http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/runtime/runtime-state.h File be/src/runtime/runtime-state.h: http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/runtime/runtime-state.h@321 PS1, Line 321: global local I see the intent but "global local timezone" sounds strange :) http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/runtime/runtime-state.h@321 PS1, Line 321: - nit: typo -- To view, visit http://gerrit.cloudera.org:8080/9986 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 Gerrit-Change-Number: 9986 Gerrit-PatchSet: 1 Gerrit-Owner: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 13 Apr 2018 12:20:12 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Attila Jeges has posted comments on this change. ( http://gerrit.cloudera.org:8080/9986 ) Change subject: IMPALA-3307: Add support for IANA time-zone db .. Patch Set 1: (31 comments) http://gerrit.cloudera.org:8080/#/c/9986/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/9986/1//COMMIT_MSG@34 PS1, Line 34: statup > typo: startup Done http://gerrit.cloudera.org:8080/#/c/9986/1//COMMIT_MSG@34 PS1, Line 34: --hdfs_zoneinfo_dir > Why call it hdfs_zone_dir when it can also refer to S3 and ADLS? The name reflects that this flag should be used to specify a location in a "shared" filesystem that can be accessed through the HDFS API. I think, naming the flag "--zoneinfo_dir" might be misleading as it should not be used to specify a directory in a local filesystem. http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/benchmarks/convert-timestamp-benchmark.cc File be/src/benchmarks/convert-timestamp-benchmark.cc: http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/benchmarks/convert-timestamp-benchmark.cc@1 PS1, Line 1: #include > Missing Apache header Done http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/benchmarks/convert-timestamp-benchmark.cc@38 PS1, Line 38: UtcToUnixTime: Function iters/ms 10%ile 50%ile 90%ile 10%ile 50%ile 90%ile > Do you think we should force the 90col limit on the following comment as we Probably it's better to leave it like this. Some of the other benchmark programs have longer lines as well (e.g. ./be/src/benchmarks/int-hash-benchmark.cc) http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/benchmarks/convert-timestamp-benchmark.cc@134 PS1, Line 134: val > I don't know what RAND_MAX is here, but I think that it can be 32K, which w Done http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/benchmarks/convert-timestamp-benchmark.cc@136 PS1, Line 136: ss << to_simple_string(start); > to_simple_string() return a string already, no need to use a stringstream f Done http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/benchmarks/convert-timestamp-benchmark.cc@166 PS1, Line 166: d > I am a bit concerned about writing to the same buffer from every thread - m Fixed it. It required some major refactoring too. http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/benchmarks/convert-timestamp-benchmark.cc@709 PS1, Line 709: m1 = measure_multithreaded_elapsed_time(glibc_test_utc_to_unix, num_of_threads,BATCH_SIZE, > nit: long line Done http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/benchmarks/convert-timestamp-benchmark.cc@711 PS1, Line 711: m2 = measure_multithreaded_elapsed_time(cctz_test_utc_to_unix, num_of_threads, BATCH_SIZE, > nit: long line Done http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/benchmarks/convert-timestamp-benchmark.cc@725 PS1, Line 725: m1 = measure_multithreaded_elapsed_time(boost_test_from_utc, num_of_threads, BATCH_SIZE, > nit: long line Done http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/benchmarks/convert-timestamp-benchmark.cc@727 PS1, Line 727: m2 = measure_multithreaded_elapsed_time(cctz_test_from_utc, num_of_threads, BATCH_SIZE, > nit: long line Done http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/benchmarks/convert-timestamp-benchmark.cc@743 PS1, Line 743: m2 = measure_multithreaded_elapsed_time(cctz_test_utc_to_local, num_of_threads, BATCH_SIZE, > nit: long line Done http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/decimal-operators.h File be/src/exprs/decimal-operators.h: http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/decimal-operators.h@168 PS1, Line 168: /// instead of truncating if 'round' is true. > Should we mention the new parameter in the comment? Done http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timestamp-functions-ir.cc File be/src/exprs/timestamp-functions-ir.cc: http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timestamp-functions-ir.cc@523 PS1, Line 523: / > nit: missing spaces Done http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timestamp-functions.cc File be/src/exprs/timestamp-functions.cc: http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timestamp-functions.cc@90 PS1, Line 90: t.fractional_seconds() > You have explained in person that 't' is used instead of 'to_cs'for sub-sec Done http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timestamp-functions.cc@105 PS1, Line 105: if (timezone == nullptr) { > This could be UNLIKELY. Done http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timestamp-functions.cc@142 PS1, Line 142: t.fractional_seconds() > Same as in line 90. Done http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.h File be/src/exprs/timezone_db.h: http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.h@28 PS1, Line 28: /// Functions to load and access the time-zone database. > Please add some comments about thread-safety (e.g. "Initialize() should be Done http:/
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Hello Gabor Kaszab, Zoltan Borok-Nagy, Csaba Ringhofer, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/9986 to look at the new patch set (#2). Change subject: IMPALA-3307: Add support for IANA time-zone db .. IMPALA-3307: Add support for IANA time-zone db Impala currently uses two different libraries for timestamp manipulations: boost and glibc. Issues with boost: - Time-zone database is currently hard coded in timezone_db.cc. Impala admins cannot update it without upgrading Impala. - Time-zone database is flat, therefore can’t track year-to-year changes. - Time-zone database is not updated on a regular basis. Issues with glibc: - Uses /usr/share/zoneinfo/ database which could be out of sync on some of the nodes in the Impala cluster. - Uses the host system’s local time-zone. Different nodes in the Impala cluster might use a different local time-zone. - Conversion functions take a global lock, which causes severe performance degradation. In addition to the issues above, the fact that /usr/share/zoneinfo/ and the hard-coded boost time-zone database are both in use is a source of inconsistency in itself. This patch makes the following changes: - Instead of boost and glibc, impalad uses Google's CCTZ to implement time-zone conversions. - Introduces a new startup flag (--hdfs_zoneinfo_dir) to impalad to specify an HDFS/S3/ADLS location that contains the shared compiled IANA time-zone database. If the startup flag is set, impalad will use the specified time-zone database. Otherwise, impalad will use the default /usr/share/zoneinfo time-zone database. - impalad reads the entire time-zone database into an in-memory map on startup for fast lookups. - The name of the coordinator node’s local time-zone is saved to the query context when preparing query execution. This time-zone is used whenever the current time-zone is referred afterwards in an execution node. - Introduces a new startup flag (--hdfs_zoneabbrev_config) to impalad to specify an HDFS/S3/ADLS path to a shared config file that contains definitions for non-standard time-zone abbreviations. Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 --- M CMakeLists.txt M be/CMakeLists.txt M be/src/benchmarks/CMakeLists.txt A be/src/benchmarks/convert-timestamp-benchmark.cc M be/src/exec/data-source-scan-node.cc M be/src/exec/data-source-scan-node.h M be/src/exec/hdfs-orc-scanner.cc M be/src/exec/parquet-column-readers.cc M be/src/exprs/aggregate-functions-ir.cc M be/src/exprs/cast-functions-ir.cc M be/src/exprs/decimal-operators-ir.cc M be/src/exprs/decimal-operators.h M be/src/exprs/expr-test.cc M be/src/exprs/literal.cc M be/src/exprs/timestamp-functions-ir.cc M be/src/exprs/timestamp-functions.cc M be/src/exprs/timezone_db.cc M be/src/exprs/timezone_db.h M be/src/runtime/raw-value-test.cc M be/src/runtime/runtime-state.cc M be/src/runtime/runtime-state.h M be/src/runtime/timestamp-test.cc M be/src/runtime/timestamp-value.cc M be/src/runtime/timestamp-value.h M be/src/runtime/timestamp-value.inline.h M be/src/service/impala-server.cc M be/src/service/impalad-main.cc M be/src/util/filesystem-util.cc M be/src/util/filesystem-util.h M be/src/util/hdfs-util-test.cc M be/src/util/hdfs-util.cc M be/src/util/hdfs-util.h M be/src/util/time-test.cc M be/src/util/time.cc M be/src/util/time.h M bin/bootstrap_toolchain.py M bin/impala-config.sh A cmake_modules/FindCctz.cmake M common/thrift/ImpalaInternalService.thrift M common/thrift/metrics.json M fe/src/test/java/org/apache/impala/testutil/TestUtils.java M testdata/data/timezoneverification.csv M testdata/workloads/functional-query/queries/QueryTest/exprs.test 43 files changed, 1,972 insertions(+), 1,051 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/86/9986/2 -- To view, visit http://gerrit.cloudera.org:8080/9986 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 Gerrit-Change-Number: 9986 Gerrit-PatchSet: 2 Gerrit-Owner: Attila Jeges Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/9986 ) Change subject: IMPALA-3307: Add support for IANA time-zone db .. Patch Set 2: (12 comments) http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/benchmarks/convert-timestamp-benchmark.cc File be/src/benchmarks/convert-timestamp-benchmark.cc: http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/benchmarks/convert-timestamp-benchmark.cc@172 PS2, Line 172: TestData This looks like a fairly general class to me that could move to util/benchmark.h or a similar file. We can create a follow up ticket if you don't want to deal with this now. http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/benchmarks/convert-timestamp-benchmark.cc@444 PS2, Line 444: time_t utc = We could replace this with boost_utc_to_unix_time. This conversion should be fast, as we want to measure the speed of localtime_r, not the "glue" code. http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/benchmarks/convert-timestamp-benchmark.cc@530 PS2, Line 530: // : // Test UnixTimeToUtcPtime (boost is expected to be faster than CCTZ) : // : : // boost : boost::pos I think that this a bit misleading, as boost_unix_time_to_utc_ptime never uses its slow branch with this test data. We could measure gmtime_r separately, and add some comments that tells which functions were used in Impala before/after the change. http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/exprs/decimal-operators.h File be/src/exprs/decimal-operators.h: http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/exprs/decimal-operators.h@168 PS2, Line 168: /// local time in 'local_tz' time-zone). Rounds instead of truncating if 'round' is true. nit: long line http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/exprs/timestamp-functions.cc File be/src/exprs/timestamp-functions.cc: http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/exprs/timestamp-functions.cc@71 PS2, Line 71: const boost::gregorian::date& d = ts_value.date(); : const boost::posix_time::time_duration& t = ts_value.time(); : const cctz::civil_second from_cs(d.year(), d.month(), d.day(), t.hours(), t.minutes(), : t.seconds()); : : auto from_tp = cctz::convert(from_cs, TimezoneDatabase::GetUtcTimezone()); Does cctz offer a function to create timepoint from unix time_t? If yes, then it should be faster to convert ts_val to a time_t, and convert that to from_tp. http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/exprs/timezone_db.cc File be/src/exprs/timezone_db.cc: http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/exprs/timezone_db.cc@43 PS2, Line 43: DEFINE_string(hdfs_zoneinfo_dir, "", : "HDFS/S3A/ADLS path to load IANA time-zone database from."); : DEFINE_string(hdfs_zoneabbrev_config, "", : "HDFS/S3A/ADLS path to config file defining non-standard time-zone abbreviations."); It will be a bit tricky, but these should be tested somehow. Playing with these configurations could go to a custom cluster test. These tests can be probably created at a later stage of the review, after the architecture have benn accepted. http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/exprs/timezone_db.cc@419 PS2, Line 419: Status TimezoneDatabase::LoadZoneAbbreviations(istream &is, At least basic testing should be added to check that fix and non-fix abbreviations are parsed correctly, but it would be the best to also create tests for error messages. http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/runtime/timestamp-value.cc File be/src/runtime/timestamp-value.cc: http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/runtime/timestamp-value.cc@135 PS2, Line 135: auto from_tp = FromUnixSeconds(unix_time); : auto to_cs = cctz::convert(from_tp, *local_tz This would be a big change, but I would think about "auto" types - do they make the code more readable, or not? I think that if type name is long and the reader more-or-less knows what kind of type to expect, then "auto" is very nice. But if the reader does not know context/library well, then "auto" makes it harder to look-up the classes. If you want to keep them, then I would prefer longer variable names, like spelling out civil_seconds. http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/runtime/timestamp-value.inline.h File be/src/runtime/timestamp-value.inline.h: http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/runtime/timestamp-value.inline.h@98 PS2, Line 98: if (UNLIKELY(!HasDateAndTime())) return false; : cctz::civil_second cs(date_.year(), date_.month(), date_.day(), time_.hours(), : time_.minutes(), time_.seconds()); : auto tp = cctz::convert(cs, : FLAGS_use_local_tz_for_unix_timestamp_conversions ? *local_tz : : Timezone
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/9986 ) Change subject: IMPALA-3307: Add support for IANA time-zone db .. Patch Set 2: (8 comments) http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/benchmarks/convert-timestamp-benchmark.cc File be/src/benchmarks/convert-timestamp-benchmark.cc: http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/benchmarks/convert-timestamp-benchmark.cc@150 PS2, Line 150: void AddTestDataDateTimes(vector& data, int n, const string& startstr) { Since it's in a benchmark it doesn't really matter, anyway, some nit comments: - output parameters should be listed last and they should be pointers - in this particular case I think it would be better to return vector http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/benchmarks/convert-timestamp-benchmark.cc@174 PS2, Line 174: > > Nit: since C++11 you don't need to put spaces between right angle brackets. http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/benchmarks/convert-timestamp-benchmark.cc@228 PS2, Line 228: const shared_ptr > data_ Nit: I think using 'const vector&' would be simpler. I don't really see why we need shared_ptr here. Also, in 'const shared_ptr' only the pointer is const, the pointed object is not. http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/benchmarks/convert-timestamp-benchmark.cc@319 PS2, Line 319: boost_throw_if_date_out_of_range(local_time.date()); Is this function call really needed here? Don't we trust boost that it validates the date correctly? http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/benchmarks/convert-timestamp-benchmark.cc@608 PS2, Line 608: if (cctz_utc_to_unix_data.get_result() != glibc_utc_to_unix_data.get_result()) { : cerr << "cctz/glibc utc_to_unix results do not match!" << endl; : return 1; : } : if (boost_utc_to_unix_data.get_result() != glibc_utc_to_unix_data.get_result()) { : cerr << "boost/glibc utc_to_unix results do not match!" << endl; : return 1; : } The other benchmarks don't need this validity check? It could be implemented in a helper function that takes a vector of TestData. http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/exprs/timestamp-functions.cc File be/src/exprs/timestamp-functions.cc: http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/exprs/timestamp-functions.cc@122 PS2, Line 122: or Nit: of http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/exprs/timezone_db.h File be/src/exprs/timezone_db.h: http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/exprs/timezone_db.h@21 PS2, Line 21: boost I think we should use the unordered_map class from the C++ STL. http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/exprs/timezone_db.h@56 PS2, Line 56: TZ_MAP TimezoneMap? I don't think a typename should be all capitals. -- To view, visit http://gerrit.cloudera.org:8080/9986 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 Gerrit-Change-Number: 9986 Gerrit-PatchSet: 2 Gerrit-Owner: Attila Jeges Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Thu, 19 Apr 2018 14:20:32 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/9986 ) Change subject: IMPALA-3307: Add support for IANA time-zone db .. Patch Set 2: (29 comments) Dumping another batch of comments :) http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exec/data-source-scan-node.h File be/src/exec/data-source-scan-node.h: http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exec/data-source-scan-node.h@100 PS1, Line 100: Status MaterializeNextRow(RuntimeState* state, MemPool* mem_pool, Tuple* tuple); What do you think about passing cctz::time_zone* instead of RuntimeState*? Can you add a comment for the new parameter and it's purpose? http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/expr-test.cc File be/src/exprs/expr-test.cc: http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/expr-test.cc@6398 PS1, Line 6398: const char* local_tz_name = "PST8PDT"; : ScopedTimeZoneOverride time_zone(local_tz_name); : const cctz::time_zone* local_tz = TimezoneDatabase::FindTimezone(local_tz_name); : DCHECK(local_tz != nullptr); Have you considered moving this to a function or macro or such? As I see you do this same thing 3 times. http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timestamp-functions.cc File be/src/exprs/timestamp-functions.cc: http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timestamp-functions.cc@a197 PS1, Line 197: Nice that we can get rid of this magic :) http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timestamp-functions.cc@73 PS1, Line 73: from_cs Might be just me but I don't really find the names from_cs, to_cs and from_tp too self-descriptive. http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timestamp-functions.cc@76 PS1, Line 76: auto from_tp = cctz::convert(from_cs, TimezoneDatabase::GetUtcTimezone()); : auto to_cs = cctz::convert(from_tp, *timezone); I think it would worth writing a comment why the two cctz::convert() calls are needed. http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timestamp-functions.cc@79 PS1, Line 79: // Check if resulting timestamp is within range In my opinion this comment doesn't add extra value as the name of the function states the same. http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timestamp-functions.cc@113 PS1, Line 113: context->AddWarning(ss.str().c_str()); : return ts_val; : } : this seems duplicate code (TimestampFunctions::FromUtc) http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timestamp-functions.cc@140 PS1, Line 140: context->AddWarning(msg.c_str()); : return TimestampVal::null(); : } : : // Create 'return_date' and 'return_time' from 'to_cs'. I think this conversion could go to a function as it seems duplicate for me with the same in TimestampFunctions::FromUtc. http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.h File be/src/exprs/timezone_db.h: http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.h@21 PS1, Line 21: #include shouldn't we use the one from std? http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.h@35 PS1, Line 35: string& GetPath() ? http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.h@47 PS1, Line 47: } FindTimezone() returns pointer while GetUtcTimezone() returns reference to a timezone. Can these be unified? http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.h@54 PS1, Line 54: static const cctz::time_zone UTC_TIMEZONE_; Could you add a comment what the string param is used for? http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.h@62 PS1, Line 62: /// location. As I see you wrote comments for these function in the .cc file. Could you move them to the header? http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.h@70 PS1, Line 70: db. zone_abbrev http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.cc File be/src/exprs/timezone_db.cc: http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.cc@10 PS1, Line 10: // Unrelated to your change but shouldn't we replace this to the Apache header other files use? Same goes for the header. http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.cc@103 PS1, Line 103: // Returns 'true' if path 'a' starts with path 'b'. If 'relative' is not nullptr, it will FileSystemUtil would be a better place for this, I think. http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.cc@105 PS1, Line 105: PathStartsWith I have the feeling that this function should only decide if path 'b' is the prefix of path 'a'. Returning the remainder after 'b' is something that I think should be out of this functions scope. I think this mix of responsibilities is the reason the name of
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/9986 ) Change subject: IMPALA-3307: Add support for IANA time-zone db .. Patch Set 2: (4 comments) http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/exprs/timezone_db.h File be/src/exprs/timezone_db.h: http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/exprs/timezone_db.h@31 PS2, Line 31: class TimezoneDatabase { My general feeling about this class is that a bit more transparency would add great value. What I mean is that some additional class level comments would really help e.g. what is the exact format of the inputs to this class, what sources it can have, in nutshell what transformation is done on that data, and how it uses the processed data later on. What do you think? http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/exprs/timezone_db.cc File be/src/exprs/timezone_db.cc: http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/exprs/timezone_db.cc@420 PS2, Line 420: /* = nullptr */ drop this comment http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/exprs/timezone_db.cc@495 PS2, Line 495: ZONEINFO_DIR nit: ZONE_INFO_DIR? http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/runtime/timestamp-value.h File be/src/runtime/timestamp-value.h: http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/runtime/timestamp-value.h@101 PS2, Line 101: static TimestampValue FromUnixTime(time_t unix_time, const cctz::time_zone* local_tz) { Do you think that mentioning the new param for these functions would add extra value? -- To view, visit http://gerrit.cloudera.org:8080/9986 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 Gerrit-Change-Number: 9986 Gerrit-PatchSet: 2 Gerrit-Owner: Attila Jeges Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 20 Apr 2018 12:00:42 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/9986 ) Change subject: IMPALA-3307: Add support for IANA time-zone db .. Patch Set 2: (5 comments) http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/util/filesystem-util.h File be/src/util/filesystem-util.h: http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/util/filesystem-util.h@57 PS2, Line 57: static Status GetRealPath( Shouldn't you mentioned that this should be called on sym links? (or do I misunderstand something?) I think that what a "real path" is should need some further explanation. http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/util/filesystem-util.h@60 PS2, Line 60: Is it is nit: if it is http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/util/hdfs-util.h File be/src/util/hdfs-util.h: http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/util/hdfs-util.h@59 PS2, Line 59: /// Returns basename of 'path'. Could you add an example to the comment? http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/util/time.h File be/src/util/time.h: http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/util/time.h@123 PS2, Line 123: /// Converts input microseconds-since-epoch to date-time string in 'tz' time zone. Could you mention 'p' as well? http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/util/time.cc File be/src/util/time.cc: http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/util/time.cc@167 PS2, Line 167: const char* fmt = (p == TimePrecision::Millisecond) ? fmt_millisec : For me this 3 layers of embedded ternary operators isn't that readable. What about a switch? -- To view, visit http://gerrit.cloudera.org:8080/9986 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 Gerrit-Change-Number: 9986 Gerrit-PatchSet: 2 Gerrit-Owner: Attila Jeges Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 20 Apr 2018 14:28:51 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Attila Jeges has posted comments on this change. ( http://gerrit.cloudera.org:8080/9986 ) Change subject: IMPALA-3307: Add support for IANA time-zone db .. Patch Set 2: (55 comments) http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/benchmarks/convert-timestamp-benchmark.cc File be/src/benchmarks/convert-timestamp-benchmark.cc: http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/benchmarks/convert-timestamp-benchmark.cc@150 PS2, Line 150: void AddTestDataDateTimes(vector& data, int n, const string& startstr) { > Since it's in a benchmark it doesn't really matter, anyway, some nit commen Done http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/benchmarks/convert-timestamp-benchmark.cc@172 PS2, Line 172: TestData > This looks like a fairly general class to me that could move to util/benchm 'measure_multithreaded_elapsed_time' function is not that general, it is used here as a quick and dirty way to verify that glibc calls are executed in a serial fashion even in a multithreaded environment. Because of that I would like to keep this class here. http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/benchmarks/convert-timestamp-benchmark.cc@174 PS2, Line 174: > > > Nit: since C++11 you don't need to put spaces between right angle brackets. Done http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/benchmarks/convert-timestamp-benchmark.cc@228 PS2, Line 228: const shared_ptr > data_ > Nit: I think using 'const vector&' would be simpler. I don't really s True, I rewrote this portion of the code so many times that I lost track of where I was going with it :) http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/benchmarks/convert-timestamp-benchmark.cc@319 PS2, Line 319: boost_throw_if_date_out_of_range(local_time.date()); > Is this function call really needed here? Don't we trust boost that it vali This is how the function was implemented originally. Since the point of this benchmark program is to compare the old implementation with the new one, I figured I shouldn't change the old code. I think the call is necessary, because boost might not validate the date range until you call the gregorian::date accessors. It is confusing. http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/benchmarks/convert-timestamp-benchmark.cc@444 PS2, Line 444: time_t utc = > We could replace this with boost_utc_to_unix_time. This conversion should b This is how UtcToLocal was implemented originally. The goal of this benchmark program is to compare the original implementation with the new one (including the glue code). http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/benchmarks/convert-timestamp-benchmark.cc@530 PS2, Line 530: // : // Test UnixTimeToUtcPtime (boost is expected to be faster than CCTZ) : // : : // boost : boost::pos > I think that this a bit misleading, as boost_unix_time_to_utc_ptime never u Done http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/benchmarks/convert-timestamp-benchmark.cc@608 PS2, Line 608: if (cctz_utc_to_unix_data.get_result() != glibc_utc_to_unix_data.get_result()) { : cerr << "cctz/glibc utc_to_unix results do not match!" << endl; : return 1; : } : if (boost_utc_to_unix_data.get_result() != glibc_utc_to_unix_data.get_result()) { : cerr << "boost/glibc utc_to_unix results do not match!" << endl; : return 1; : } > The other benchmarks don't need this validity check? Done (although passing a vector of TestData to the helper function was not feasible as the different TestData classes are instantiated with a different converter functions, therefore they are not "compatible") http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exec/data-source-scan-node.h File be/src/exec/data-source-scan-node.h: http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exec/data-source-scan-node.h@100 PS1, Line 100: Status MaterializeNextRow(RuntimeState* state, MemPool* mem_pool, Tuple* tuple); > What do you think about passing cctz::time_zone* instead of RuntimeState*? Done http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/exprs/decimal-operators.h File be/src/exprs/decimal-operators.h: http://gerrit.cloudera.org:8080/#/c/9986/2/be/src/exprs/decimal-operators.h@168 PS2, Line 168: /// local time in 'local_tz' time-zone). Rounds instead of truncating if 'round' is true. > nit: long line Done http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/expr-test.cc File be/src/exprs/expr-test.cc: http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/expr-test.cc@6398 PS1, Line 6398: const char* local_tz_name = "PST8PDT"; : ScopedTimeZoneOverride time_zone(local_tz_name); : const cctz::time_zone* local_tz = TimezoneDatabase::FindTimezone(local_tz_name); : DCHECK(local_tz != nullptr); > Have you c
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Hello Gabor Kaszab, Zoltan Borok-Nagy, Csaba Ringhofer, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/9986 to look at the new patch set (#3). Change subject: IMPALA-3307: Add support for IANA time-zone db .. IMPALA-3307: Add support for IANA time-zone db Impala currently uses two different libraries for timestamp manipulations: boost and glibc. Issues with boost: - Time-zone database is currently hard coded in timezone_db.cc. Impala admins cannot update it without upgrading Impala. - Time-zone database is flat, therefore can’t track year-to-year changes. - Time-zone database is not updated on a regular basis. Issues with glibc: - Uses /usr/share/zoneinfo/ database which could be out of sync on some of the nodes in the Impala cluster. - Uses the host system’s local time-zone. Different nodes in the Impala cluster might use a different local time-zone. - Conversion functions take a global lock, which causes severe performance degradation. In addition to the issues above, the fact that /usr/share/zoneinfo/ and the hard-coded boost time-zone database are both in use is a source of inconsistency in itself. This patch makes the following changes: - Instead of boost and glibc, impalad uses Google's CCTZ to implement time-zone conversions. - Introduces a new startup flag (--hdfs_zone_info_dir) to impalad to specify an HDFS/S3/ADLS location that contains the shared compiled IANA time-zone database. If the startup flag is set, impalad will use the specified time-zone database. Otherwise, impalad will use the default /usr/share/zoneinfo time-zone database. - impalad reads the entire time-zone database into an in-memory map on startup for fast lookups. - The name of the coordinator node’s local time-zone is saved to the query context when preparing query execution. This time-zone is used whenever the current time-zone is referred afterwards in an execution node. - Introduces a new startup flag (--hdfs_zone_abbrev_conf) to impalad to specify an HDFS/S3/ADLS path to a shared config file that contains definitions for non-standard time-zone abbreviations. Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 --- M CMakeLists.txt M be/CMakeLists.txt M be/src/benchmarks/CMakeLists.txt A be/src/benchmarks/convert-timestamp-benchmark.cc M be/src/common/global-types.h M be/src/exec/data-source-scan-node.cc M be/src/exec/data-source-scan-node.h M be/src/exec/hdfs-orc-scanner.cc M be/src/exec/parquet-column-readers.cc M be/src/exprs/CMakeLists.txt M be/src/exprs/aggregate-functions-ir.cc M be/src/exprs/cast-functions-ir.cc M be/src/exprs/decimal-operators-ir.cc M be/src/exprs/decimal-operators.h M be/src/exprs/expr-test.cc M be/src/exprs/literal.cc M be/src/exprs/timestamp-functions-ir.cc M be/src/exprs/timestamp-functions.cc A be/src/exprs/timezone_db-test.cc M be/src/exprs/timezone_db.cc M be/src/exprs/timezone_db.h M be/src/runtime/raw-value-test.cc M be/src/runtime/runtime-state.cc M be/src/runtime/runtime-state.h M be/src/runtime/timestamp-test.cc M be/src/runtime/timestamp-value.cc M be/src/runtime/timestamp-value.h M be/src/runtime/timestamp-value.inline.h M be/src/service/impala-server.cc M be/src/service/impalad-main.cc M be/src/util/filesystem-util-test.cc M be/src/util/filesystem-util.cc M be/src/util/filesystem-util.h M be/src/util/hdfs-util-test.cc M be/src/util/hdfs-util.cc M be/src/util/hdfs-util.h M be/src/util/time-test.cc M be/src/util/time.cc M be/src/util/time.h M bin/bootstrap_toolchain.py M bin/impala-config.sh A cmake_modules/FindCctz.cmake M common/thrift/ImpalaInternalService.thrift M common/thrift/metrics.json M fe/src/test/java/org/apache/impala/testutil/TestUtils.java M testdata/bin/create-load-data.sh M testdata/data/timezoneverification.csv A testdata/tzdb/abbrev.conf A testdata/tzdb/zoneinfo/AmerICA/ArgeNTINA/MendOZA A testdata/tzdb/zoneinfo/AmerICA/CancUN A testdata/tzdb/zoneinfo/UTC M testdata/workloads/functional-query/queries/QueryTest/exprs.test A tests/custom_cluster/custom_tzdb.py 53 files changed, 2,542 insertions(+), 1,093 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/86/9986/3 -- To view, visit http://gerrit.cloudera.org:8080/9986 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 Gerrit-Change-Number: 9986 Gerrit-PatchSet: 3 Gerrit-Owner: Attila Jeges Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/9986 ) Change subject: IMPALA-3307: Add support for IANA time-zone db .. Patch Set 3: (10 comments) http://gerrit.cloudera.org:8080/#/c/9986/3/be/src/exprs/timestamp-functions.cc File be/src/exprs/timestamp-functions.cc: http://gerrit.cloudera.org:8080/#/c/9986/3/be/src/exprs/timestamp-functions.cc@95 PS3, Line 95: time_t unix_time; : if (UNLIKELY(!ts_value.UtcToUnixTime(&unix_time))) return TimestampVal::null(); : cctz::time_point from_tp = UnixSecondsToTimePoint(unix_time); : : // Convert 'from_tp' time_point to civil_second assuming 'timezone' time-zone. : cctz::civil_second to_cs = cctz::convert(from_tp, *timezone); : : if (UNLIKELY(CheckIfDateOutOfRange(cctz::civil_day(to_cs { : const string& msg = Substitute( : "Timestamp '$0' did not convert to a valid local time in timezone '$1'", : ts_value.ToString(), tz_string_value.DebugString()); : context->AddWarning(msg.c_str()); : return TimestampVal::null(); : } : : // Note that 'to_cs' has second granularity. Since time-zone rules do not affect : // fractional seconds, the fractional second part of the returned TimestampVal should be : // equal to ts_value.time().fractional_seconds(). : return CivilSecondToTimestampVal(to_cs, ts_value.time().fractional_seconds()); This logic is the same as TimestampValue::UtcToLocal(), plus warning if the resulting timestamp is not valid. As local time and generic timezone conversions are the same now, it would make sense to keep them at one place. http://gerrit.cloudera.org:8080/#/c/9986/3/be/src/exprs/timestamp-functions.cc@144 PS3, Line 144: cctz::time_point from_tp = from_cl.pre; : : // Convert 'from_tp' time_point to civil_second assuming 'UTC' time-zone. : cctz::civil_second to_cs = cctz::convert(from_tp, TimezoneDatabase::GetUtcTimezone()); : : if (UNLIKELY(CheckIfDateOutOfRange(cctz::civil_day(to_cs { : const string& msg = : Substitute("Timestamp '$0' in timezone '$1' could not be converted to UTC", : ts_value.ToString(), tz_string_value.DebugString()); : context->AddWarning(msg.c_str()); : return TimestampVal::null(); : } : : // Note that 'to_cs' has second granularity. Since time-zone rules do not affect : // fractional seconds, the fractional second part of the returned TimestampVal should be : // equal to ts_value.time().fractional_seconds(). : return CivilSecondToTimestampVal(to_cs, ts_value.time().fractional_seconds()); Similarly to my comment at line 113, this logic could be moved to a TimestampValue function. Removing these calculations from this file would mean that the helper functions (line 47-74) could be removed too. http://gerrit.cloudera.org:8080/#/c/9986/3/be/src/runtime/timestamp-value.cc File be/src/runtime/timestamp-value.cc: http://gerrit.cloudera.org:8080/#/c/9986/3/be/src/runtime/timestamp-value.cc@117 PS3, Line 117: time_ = boost::posix_time::time_duration(to_cs.hour(), to_cs.minute(), to_cs.second(), nit: long line http://gerrit.cloudera.org:8080/#/c/9986/3/be/src/util/filesystem-util.h File be/src/util/filesystem-util.h: http://gerrit.cloudera.org:8080/#/c/9986/3/be/src/util/filesystem-util.h@66 PS3, Line 66: iff typo: if http://gerrit.cloudera.org:8080/#/c/9986/3/be/src/util/filesystem-util.h@70 PS3, Line 70: iff typo: if http://gerrit.cloudera.org:8080/#/c/9986/3/be/src/util/filesystem-util.h@71 PS3, Line 71: path Maybe writing "string" instead of "path" express better that no file system is accessed. http://gerrit.cloudera.org:8080/#/c/9986/3/be/src/util/filesystem-util.h@79 PS3, Line 79: path Same as line 71. http://gerrit.cloudera.org:8080/#/c/9986/3/be/src/util/time.cc File be/src/util/time.cc: http://gerrit.cloudera.org:8080/#/c/9986/3/be/src/util/time.cc@165 PS3, Line 165: nit: extra space http://gerrit.cloudera.org:8080/#/c/9986/3/be/src/util/time.cc@168 PS3, Line 168: nit: extra space http://gerrit.cloudera.org:8080/#/c/9986/3/be/src/util/time.cc@171 PS3, Line 171: nit: extra space -- To view, visit http://gerrit.cloudera.org:8080/9986 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 Gerrit-Change-Number: 9986 Gerrit-PatchSet: 3 Gerrit-Owner: Attila Jeges Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Attila Jeges has posted comments on this change. ( http://gerrit.cloudera.org:8080/9986 ) Change subject: IMPALA-3307: Add support for IANA time-zone db .. Patch Set 3: (10 comments) http://gerrit.cloudera.org:8080/#/c/9986/3/be/src/exprs/timestamp-functions.cc File be/src/exprs/timestamp-functions.cc: http://gerrit.cloudera.org:8080/#/c/9986/3/be/src/exprs/timestamp-functions.cc@95 PS3, Line 95: time_t unix_time; : if (UNLIKELY(!ts_value.UtcToUnixTime(&unix_time))) return TimestampVal::null(); : cctz::time_point from_tp = UnixSecondsToTimePoint(unix_time); : : // Convert 'from_tp' time_point to civil_second assuming 'timezone' time-zone. : cctz::civil_second to_cs = cctz::convert(from_tp, *timezone); : : if (UNLIKELY(CheckIfDateOutOfRange(cctz::civil_day(to_cs { : const string& msg = Substitute( : "Timestamp '$0' did not convert to a valid local time in timezone '$1'", : ts_value.ToString(), tz_string_value.DebugString()); : context->AddWarning(msg.c_str()); : return TimestampVal::null(); : } : : // Note that 'to_cs' has second granularity. Since time-zone rules do not affect : // fractional seconds, the fractional second part of the returned TimestampVal should be : // equal to ts_value.time().fractional_seconds(). : return CivilSecondToTimestampVal(to_cs, ts_value.time().fractional_seconds()); > This logic is the same as TimestampValue::UtcToLocal(), plus warning if the Done http://gerrit.cloudera.org:8080/#/c/9986/3/be/src/exprs/timestamp-functions.cc@144 PS3, Line 144: cctz::time_point from_tp = from_cl.pre; : : // Convert 'from_tp' time_point to civil_second assuming 'UTC' time-zone. : cctz::civil_second to_cs = cctz::convert(from_tp, TimezoneDatabase::GetUtcTimezone()); : : if (UNLIKELY(CheckIfDateOutOfRange(cctz::civil_day(to_cs { : const string& msg = : Substitute("Timestamp '$0' in timezone '$1' could not be converted to UTC", : ts_value.ToString(), tz_string_value.DebugString()); : context->AddWarning(msg.c_str()); : return TimestampVal::null(); : } : : // Note that 'to_cs' has second granularity. Since time-zone rules do not affect : // fractional seconds, the fractional second part of the returned TimestampVal should be : // equal to ts_value.time().fractional_seconds(). : return CivilSecondToTimestampVal(to_cs, ts_value.time().fractional_seconds()); > Similarly to my comment at line 113, this logic could be moved to a Timesta Done http://gerrit.cloudera.org:8080/#/c/9986/3/be/src/runtime/timestamp-value.cc File be/src/runtime/timestamp-value.cc: http://gerrit.cloudera.org:8080/#/c/9986/3/be/src/runtime/timestamp-value.cc@117 PS3, Line 117: time_ = boost::posix_time::time_duration(to_cs.hour(), to_cs.minute(), to_cs.second(), > nit: long line Done http://gerrit.cloudera.org:8080/#/c/9986/3/be/src/util/filesystem-util.h File be/src/util/filesystem-util.h: http://gerrit.cloudera.org:8080/#/c/9986/3/be/src/util/filesystem-util.h@66 PS3, Line 66: iff > typo: if "iff" stands for "if and only if". http://gerrit.cloudera.org:8080/#/c/9986/3/be/src/util/filesystem-util.h@70 PS3, Line 70: iff > typo: if Same as above http://gerrit.cloudera.org:8080/#/c/9986/3/be/src/util/filesystem-util.h@71 PS3, Line 71: path > Maybe writing "string" instead of "path" express better that no file system Done http://gerrit.cloudera.org:8080/#/c/9986/3/be/src/util/filesystem-util.h@79 PS3, Line 79: path > Same as line 71. Done http://gerrit.cloudera.org:8080/#/c/9986/3/be/src/util/time.cc File be/src/util/time.cc: http://gerrit.cloudera.org:8080/#/c/9986/3/be/src/util/time.cc@165 PS3, Line 165: > nit: extra space Done http://gerrit.cloudera.org:8080/#/c/9986/3/be/src/util/time.cc@168 PS3, Line 168: > nit: extra space Done http://gerrit.cloudera.org:8080/#/c/9986/3/be/src/util/time.cc@171 PS3, Line 171: > nit: extra space Done -- To view, visit http://gerrit.cloudera.org:8080/9986 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 Gerrit-Change-Number: 9986 Gerrit-PatchSet: 3 Gerrit-Owner: Attila Jeges Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Tue, 08 May 2018 19:35:52 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Hello Gabor Kaszab, Zoltan Borok-Nagy, Csaba Ringhofer, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/9986 to look at the new patch set (#4). Change subject: IMPALA-3307: Add support for IANA time-zone db .. IMPALA-3307: Add support for IANA time-zone db Impala currently uses two different libraries for timestamp manipulations: boost and glibc. Issues with boost: - Time-zone database is currently hard coded in timezone_db.cc. Impala admins cannot update it without upgrading Impala. - Time-zone database is flat, therefore can’t track year-to-year changes. - Time-zone database is not updated on a regular basis. Issues with glibc: - Uses /usr/share/zoneinfo/ database which could be out of sync on some of the nodes in the Impala cluster. - Uses the host system’s local time-zone. Different nodes in the Impala cluster might use a different local time-zone. - Conversion functions take a global lock, which causes severe performance degradation. In addition to the issues above, the fact that /usr/share/zoneinfo/ and the hard-coded boost time-zone database are both in use is a source of inconsistency in itself. This patch makes the following changes: - Instead of boost and glibc, impalad uses Google's CCTZ to implement time-zone conversions. - Introduces a new startup flag (--hdfs_zone_info_dir) to impalad to specify an HDFS/S3/ADLS location that contains the shared compiled IANA time-zone database. If the startup flag is set, impalad will use the specified time-zone database. Otherwise, impalad will use the default /usr/share/zoneinfo time-zone database. - impalad reads the entire time-zone database into an in-memory map on startup for fast lookups. - The name of the coordinator node’s local time-zone is saved to the query context when preparing query execution. This time-zone is used whenever the current time-zone is referred afterwards in an execution node. - Introduces a new startup flag (--hdfs_zone_abbrev_conf) to impalad to specify an HDFS/S3/ADLS path to a shared config file that contains definitions for non-standard time-zone abbreviations. Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 --- M CMakeLists.txt M be/CMakeLists.txt M be/src/benchmarks/CMakeLists.txt A be/src/benchmarks/convert-timestamp-benchmark.cc M be/src/common/global-types.h M be/src/exec/data-source-scan-node.cc M be/src/exec/data-source-scan-node.h M be/src/exec/hdfs-orc-scanner.cc M be/src/exec/parquet-column-readers.cc M be/src/exprs/CMakeLists.txt M be/src/exprs/aggregate-functions-ir.cc M be/src/exprs/cast-functions-ir.cc M be/src/exprs/decimal-operators-ir.cc M be/src/exprs/decimal-operators.h M be/src/exprs/expr-test.cc M be/src/exprs/literal.cc M be/src/exprs/timestamp-functions-ir.cc M be/src/exprs/timestamp-functions.cc A be/src/exprs/timezone_db-test.cc M be/src/exprs/timezone_db.cc M be/src/exprs/timezone_db.h M be/src/runtime/raw-value-test.cc M be/src/runtime/runtime-state.cc M be/src/runtime/runtime-state.h M be/src/runtime/timestamp-test.cc M be/src/runtime/timestamp-value.cc M be/src/runtime/timestamp-value.h M be/src/runtime/timestamp-value.inline.h M be/src/service/impala-server.cc M be/src/service/impalad-main.cc M be/src/util/filesystem-util-test.cc M be/src/util/filesystem-util.cc M be/src/util/filesystem-util.h M be/src/util/hdfs-util-test.cc M be/src/util/hdfs-util.cc M be/src/util/hdfs-util.h M be/src/util/time-test.cc M be/src/util/time.cc M be/src/util/time.h M bin/bootstrap_toolchain.py M bin/impala-config.sh A cmake_modules/FindCctz.cmake M common/thrift/ImpalaInternalService.thrift M common/thrift/metrics.json M fe/src/test/java/org/apache/impala/testutil/TestUtils.java M testdata/bin/create-load-data.sh M testdata/data/timezoneverification.csv A testdata/tzdb/abbrev.conf A testdata/tzdb/zoneinfo/AmerICA/ArgeNTINA/MendOZA A testdata/tzdb/zoneinfo/AmerICA/CancUN A testdata/tzdb/zoneinfo/UTC M testdata/workloads/functional-query/queries/QueryTest/exprs.test A tests/custom_cluster/custom_tzdb.py 53 files changed, 2,531 insertions(+), 1,096 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/86/9986/4 -- To view, visit http://gerrit.cloudera.org:8080/9986 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 Gerrit-Change-Number: 9986 Gerrit-PatchSet: 4 Gerrit-Owner: Attila Jeges Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/9986 ) Change subject: IMPALA-3307: Add support for IANA time-zone db .. Patch Set 4: Code-Review+1 (6 comments) Few nits otherwise looks good to me. The LocalToUtc performance part is optional - as it does not affect other parts of the code, it can be easily done later when general structure is already accepted by other reviewers. http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timestamp-functions-ir.cc File be/src/exprs/timestamp-functions-ir.cc: http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timestamp-functions-ir.cc@526 PS4, Line 526: const string& tz_name = (start_lookup.abbr != nullptr) ? start_lookup.abbr : : context->impl()->state()->local_time_zone()->name(); What is the goal of this logic? To print timezone abbreviations instead of the full names, or to distinguish between summer/winter time, or both? A comment would be nice, and maybe the logic could be moved to a TimestampValue member function like string GetTimezoneName(Timezone*). http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/runtime/timestamp-value.cc File be/src/runtime/timestamp-value.cc: http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/runtime/timestamp-value.cc@93 PS4, Line 93: inline bool CheckIfDateOutOfRange(const cctz::civil_day& date) { : static const cctz::civil_day max_date(TimestampFunctions::MAX_YEAR, 12, 31); : static const cctz::civil_day min_date(TimestampFunctions::MIN_YEAR, 1, 1); : return date < min_date || date > max_date; : } This could be simpler and possibly faster by expecting a cctz::civil_second argument and check if 1400<=year<1. http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/runtime/timestamp-value.cc@128 PS4, Line 128: cctz explains pretty well the handling of dst boundaries, maybe we could add a link to it, for example to https://github.com/google/cctz/blob/a2dd3d0fbc811fe0a1d4d2dbb0341f1a3d28cb2a/include/cctz/time_zone.h#L147 http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/runtime/timestamp-value.cc@129 PS4, Line 129: // In case of ambiguity invalidate TimestampValue. : const cctz::time_zone::civil_lookup from_cl = local_tz->lookup(from_cs); : if (UNLIKELY(from_cl.kind != cctz::time_zone::civil_lookup::UNIQUE)) { : SetToInvalidDateTime(); : } else { : cctz::time_point from_tp = from_cl.pre; : : // Convert 'from_tp' time_point to civil_second assuming 'UTC' time-zone. : cctz::civil_second to_cs = cctz::convert(from_tp, TimezoneDatabase::GetUtcTimezone()); : : // boost::gregorian::date() throws boost::gregorian::bad_year if year is not in the : // 1400.. range. Need to check validity before creating the date object. : if (UNLIKELY(CheckIfDateOutOfRange(cctz::civil_day(to_cs { I may be possible to get TimestampValue from cctz::time_zone::civil_lookup in a faster way - splitting from_tp to a day part (since a constant date) and the remainder seconds part is enough for us and should be faster then getting cctz::civil_second (which contains year/month/day split). The code could look something like this: int64 secs_since_base = from_tp - BASETIME_AS_CCTZ_SYS_SEC; time_=sec_since_base%(24*60*60)+time_.fractional_seconds(); int32 days_since_base = sec_since_base/(24*60*60); if(out_of_range(days_since_base)) SetToInvalidDateTime(); date_ = days_since_base - BASEDATE_AS_BOOST_GREG_DATE; http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/runtime/timestamp-value.cc@146 PS4, Line 146: time_ = boost::posix_time::time_duration(to_cs.hour(), to_cs.minute(), to_cs.second(), nit: long line http://gerrit.cloudera.org:8080/#/c/9986/3/be/src/util/filesystem-util.h File be/src/util/filesystem-util.h: http://gerrit.cloudera.org:8080/#/c/9986/3/be/src/util/filesystem-util.h@66 PS3, Line 66: iff > "iff" stands for "if and only if". Thanks for the explanation! -- To view, visit http://gerrit.cloudera.org:8080/9986 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 Gerrit-Change-Number: 9986 Gerrit-PatchSet: 4 Gerrit-Owner: Attila Jeges Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Wed, 09 May 2018 11:47:13 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Hello Gabor Kaszab, Zoltan Borok-Nagy, Csaba Ringhofer, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/9986 to look at the new patch set (#5). Change subject: IMPALA-3307: Add support for IANA time-zone db .. IMPALA-3307: Add support for IANA time-zone db Impala currently uses two different libraries for timestamp manipulations: boost and glibc. Issues with boost: - Time-zone database is currently hard coded in timezone_db.cc. Impala admins cannot update it without upgrading Impala. - Time-zone database is flat, therefore can’t track year-to-year changes. - Time-zone database is not updated on a regular basis. Issues with glibc: - Uses /usr/share/zoneinfo/ database which could be out of sync on some of the nodes in the Impala cluster. - Uses the host system’s local time-zone. Different nodes in the Impala cluster might use a different local time-zone. - Conversion functions take a global lock, which causes severe performance degradation. In addition to the issues above, the fact that /usr/share/zoneinfo/ and the hard-coded boost time-zone database are both in use is a source of inconsistency in itself. This patch makes the following changes: - Instead of boost and glibc, impalad uses Google's CCTZ to implement time-zone conversions. - Introduces a new startup flag (--hdfs_zone_info_dir) to impalad to specify an HDFS/S3/ADLS location that contains the shared compiled IANA time-zone database. If the startup flag is set, impalad will use the specified time-zone database. Otherwise, impalad will use the default /usr/share/zoneinfo time-zone database. - impalad reads the entire time-zone database into an in-memory map on startup for fast lookups. - The name of the coordinator node’s local time-zone is saved to the query context when preparing query execution. This time-zone is used whenever the current time-zone is referred afterwards in an execution node. - Introduces a new startup flag (--hdfs_zone_abbrev_conf) to impalad to specify an HDFS/S3/ADLS path to a shared config file that contains definitions for non-standard time-zone abbreviations. Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 --- M CMakeLists.txt M be/CMakeLists.txt M be/src/benchmarks/CMakeLists.txt A be/src/benchmarks/convert-timestamp-benchmark.cc M be/src/common/global-types.h M be/src/exec/data-source-scan-node.cc M be/src/exec/data-source-scan-node.h M be/src/exec/hdfs-orc-scanner.cc M be/src/exec/parquet-column-readers.cc M be/src/exprs/CMakeLists.txt M be/src/exprs/aggregate-functions-ir.cc M be/src/exprs/cast-functions-ir.cc M be/src/exprs/decimal-operators-ir.cc M be/src/exprs/decimal-operators.h M be/src/exprs/expr-test.cc M be/src/exprs/literal.cc M be/src/exprs/timestamp-functions-ir.cc M be/src/exprs/timestamp-functions.cc A be/src/exprs/timezone_db-test.cc M be/src/exprs/timezone_db.cc M be/src/exprs/timezone_db.h M be/src/runtime/raw-value-test.cc M be/src/runtime/runtime-state.cc M be/src/runtime/runtime-state.h M be/src/runtime/timestamp-test.cc M be/src/runtime/timestamp-value.cc M be/src/runtime/timestamp-value.h M be/src/runtime/timestamp-value.inline.h M be/src/service/impala-server.cc M be/src/service/impalad-main.cc M be/src/util/filesystem-util-test.cc M be/src/util/filesystem-util.cc M be/src/util/filesystem-util.h M be/src/util/hdfs-util-test.cc M be/src/util/hdfs-util.cc M be/src/util/hdfs-util.h M be/src/util/time-test.cc M be/src/util/time.cc M be/src/util/time.h M bin/bootstrap_toolchain.py M bin/impala-config.sh A cmake_modules/FindCctz.cmake M common/thrift/ImpalaInternalService.thrift M common/thrift/metrics.json M fe/src/test/java/org/apache/impala/testutil/TestUtils.java M testdata/bin/create-load-data.sh M testdata/data/timezoneverification.csv A testdata/tzdb/abbrev.conf A testdata/tzdb/zoneinfo/AmerICA/ArgeNTINA/MendOZA A testdata/tzdb/zoneinfo/AmerICA/CancUN A testdata/tzdb/zoneinfo/UTC M testdata/workloads/functional-query/queries/QueryTest/exprs.test A tests/custom_cluster/test_custom_tzdb.py 53 files changed, 2,556 insertions(+), 1,096 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/86/9986/5 -- To view, visit http://gerrit.cloudera.org:8080/9986 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 Gerrit-Change-Number: 9986 Gerrit-PatchSet: 5 Gerrit-Owner: Attila Jeges Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Attila Jeges has posted comments on this change. ( http://gerrit.cloudera.org:8080/9986 ) Change subject: IMPALA-3307: Add support for IANA time-zone db .. Patch Set 4: (5 comments) http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timestamp-functions-ir.cc File be/src/exprs/timestamp-functions-ir.cc: http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timestamp-functions-ir.cc@526 PS4, Line 526: const string& tz_name = (start_lookup.abbr != nullptr) ? start_lookup.abbr : : context->impl()->state()->local_time_zone()->name(); > What is the goal of this logic? To print timezone abbreviations instead of Both, I guess. I just replicated the original behavior: localtime_r() sets tzone.tm_zone to the time-zone abbreviation. Added a comment. I don't think that the TimestampValue class would be a good place for this function. The timezone abbreviation corresponds to 'start_unix_millis', which is not a TimestampValue. I'll keep this logic here for now. http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/runtime/timestamp-value.cc File be/src/runtime/timestamp-value.cc: http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/runtime/timestamp-value.cc@93 PS4, Line 93: inline bool CheckIfDateOutOfRange(const cctz::civil_day& date) { : static const cctz::civil_day max_date(TimestampFunctions::MAX_YEAR, 12, 31); : static const cctz::civil_day min_date(TimestampFunctions::MIN_YEAR, 1, 1); : return date < min_date || date > max_date; : } > This could be simpler and possibly faster by expecting a cctz::civil_second Done http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/runtime/timestamp-value.cc@128 PS4, Line 128: > cctz explains pretty well the handling of dst boundaries, maybe we could ad Done http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/runtime/timestamp-value.cc@129 PS4, Line 129: // In case of ambiguity invalidate TimestampValue. : const cctz::time_zone::civil_lookup from_cl = local_tz->lookup(from_cs); : if (UNLIKELY(from_cl.kind != cctz::time_zone::civil_lookup::UNIQUE)) { : SetToInvalidDateTime(); : } else { : cctz::time_point from_tp = from_cl.pre; : : // Convert 'from_tp' time_point to civil_second assuming 'UTC' time-zone. : cctz::civil_second to_cs = cctz::convert(from_tp, TimezoneDatabase::GetUtcTimezone()); : : // boost::gregorian::date() throws boost::gregorian::bad_year if year is not in the : // 1400.. range. Need to check validity before creating the date object. : if (UNLIKELY(CheckIfDateOutOfRange(cctz::civil_day(to_cs { > I may be possible to get TimestampValue from cctz::time_zone::civil_lookup Done http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/runtime/timestamp-value.cc@146 PS4, Line 146: time_ = boost::posix_time::time_duration(to_cs.hour(), to_cs.minute(), to_cs.second(), > nit: long line Done -- To view, visit http://gerrit.cloudera.org:8080/9986 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 Gerrit-Change-Number: 9986 Gerrit-PatchSet: 4 Gerrit-Owner: Attila Jeges Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Thu, 10 May 2018 15:28:32 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/9986 ) Change subject: IMPALA-3307: Add support for IANA time-zone db .. Patch Set 5: (19 comments) Thanks for dealing with my previous comments, Attila! I filed some more but basically I'm fine with the changes and feel free to start involving someone with more experience on this topic. http://gerrit.cloudera.org:8080/#/c/9986/4/CMakeLists.txt File CMakeLists.txt: http://gerrit.cloudera.org:8080/#/c/9986/4/CMakeLists.txt@281 PS4, Line 281: Cctz nit: CCTZ http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/expr-test.cc File be/src/exprs/expr-test.cc: http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/expr-test.cc@139 PS4, Line 139: new_time_zone_(time_zone), new_tz_ >From reading the names of these 2 variables it's not clear what de difference >is. Can you have a new_time_zone_ and a new_time_zone_name_ or such? http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/expr-test.cc@140 PS4, Line 140: /*overwrite*/ Do you need this comment? http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/expr-test.cc@153 PS4, Line 153: Timezone * nit: Timezone* Expect..() http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/expr-test.cc@155 PS4, Line 155: new_tz_ = TimezoneDatabase::FindTimezone(new_time_zone_); I wonder if it makes sense to do this assignment in the constructor and then this function can be changed to something like "getTimezone" that simply returns the corresponding member. http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timestamp-functions-ir.cc File be/src/exprs/timestamp-functions-ir.cc: http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timestamp-functions-ir.cc@503 PS4, Line 503: namespace Why did you need this namespace? http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timestamp-functions-ir.cc@504 PS4, Line 504: nline What is the plan to get rid of the "Duplicate code" TODOs in this review? http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timestamp-functions.cc File be/src/exprs/timestamp-functions.cc: http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timestamp-functions.cc@55 PS4, Line 55: // This should raise some sort of error or at least return null. Hive just ignores it. Shouldn't this be a TODO? http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timestamp-functions.cc@87 PS4, Line 87: // This should raise some sort of error or at least return null. Hive just ignores it. Same as above http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timezone_db-test.cc File be/src/exprs/timezone_db-test.cc: http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timezone_db-test.cc@57 PS4, Line 57: TzAbbev nit: TzAbbrev? http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timezone_db-test.cc@68 PS4, Line 68: // Abbreviations must start with an uppercase letter. If it has to start with an uppercase letter, can we add a test this with an input that would be valid if the first letter was uppercase? e.g. "pST", "pst", "singapore" etc. http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timezone_db-test.cc@105 PS4, Line 105: // Misformatted time-zone names. Can you again play around with upper vs lower case letters here? http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timezone_db.h File be/src/exprs/timezone_db.h: http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timezone_db.h@1 PS4, Line 1: // with the License. You may obtain a copy of the License at Hmm, is the top of the Apache comment missing? http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timezone_db.h@23 PS4, Line 23: /// 'TimezoneDatabase' class contains functions to load and access the IANA time-zone Nice description, thx :) http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timezone_db.h@91 PS4, Line 91: tz_seg nit: might be just my preference but tz_segment is still short and I think better to read. Up to you. http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.cc File be/src/exprs/timezone_db.cc: http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.cc@147 PS1, Line 147: > Returning a pointer just to be able to signal failure with nullptr seems co What I meant is that you could get rid of the offset_sec paramater if you changed the return value to int64_t* and you would still be able to detect failure if the return value is null. Not a big deal, though. Your choice. http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timezone_db.cc File be/src/exprs/timezone_db.cc: http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timezone_db.cc@365 PS4, Line 365: hdfsFile hdfs_file = hdfsOpenFile( Just for my information: LoadZoneInfoFromHdfs() copies the zone info file to a local dir, meanwhile this function uses hdfsOpenFile to get the abbrev data. Is there a reason that they don't follow the same approac
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Attila Jeges has posted comments on this change. ( http://gerrit.cloudera.org:8080/9986 ) Change subject: IMPALA-3307: Add support for IANA time-zone db .. Patch Set 4: (18 comments) http://gerrit.cloudera.org:8080/#/c/9986/4/CMakeLists.txt File CMakeLists.txt: http://gerrit.cloudera.org:8080/#/c/9986/4/CMakeLists.txt@281 PS4, Line 281: Cctz > nit: CCTZ Done http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/expr-test.cc File be/src/exprs/expr-test.cc: http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/expr-test.cc@139 PS4, Line 139: new_time_zone_(time_zone), new_tz_ > From reading the names of these 2 variables it's not clear what de differen Done http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/expr-test.cc@140 PS4, Line 140: /*overwrite*/ > Do you need this comment? Removed it http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/expr-test.cc@153 PS4, Line 153: Timezone * > nit: Timezone* Expect..() Done http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/expr-test.cc@155 PS4, Line 155: new_tz_ = TimezoneDatabase::FindTimezone(new_time_zone_); > I wonder if it makes sense to do this assignment in the constructor and the The main reason I implemented the class this way was that some tests that use 'ScopedTimeZoneOverride' don't need the actual Timezone pointer. On the other hand, checking always that the timezone name is valid doesn't hurt anyone. Done. http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timestamp-functions-ir.cc File be/src/exprs/timestamp-functions-ir.cc: http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timestamp-functions-ir.cc@503 PS4, Line 503: namespace > Why did you need this namespace? Putting stuff into an unnamed namespace is a common C++ idiom: it says that everything in the unnamed namespace is "local" to this translation unit. They're not visible from the outside, and their names won't clash with names in other translation units. Impala uses unnamed namespaces elsewhere too. http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timestamp-functions-ir.cc@504 PS4, Line 504: / TODO > What is the plan to get rid of the "Duplicate code" TODOs in this review? I removed the TODO comment. I cannot think of a better place for this function. We can create a shared class for this function only but that would be awkward. http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timestamp-functions.cc File be/src/exprs/timestamp-functions.cc: http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timestamp-functions.cc@55 PS4, Line 55: // This should raise some sort of error or at least return null. Hive just ignores it. > Shouldn't this be a TODO? Fixed the comment. http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timestamp-functions.cc@87 PS4, Line 87: // This should raise some sort of error or at least return null. Hive just ignores it. > Same as above Fixed the comment. http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timezone_db-test.cc File be/src/exprs/timezone_db-test.cc: http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timezone_db-test.cc@57 PS4, Line 57: TzAbbev > nit: TzAbbrev? Done http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timezone_db-test.cc@68 PS4, Line 68: // Abbreviations must start with an uppercase letter. > If it has to start with an uppercase letter, can we add a test this with an Done http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timezone_db-test.cc@105 PS4, Line 105: // Misformatted time-zone names. > Can you again play around with upper vs lower case letters here? Done http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timezone_db.h File be/src/exprs/timezone_db.h: http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timezone_db.h@1 PS4, Line 1: // with the License. You may obtain a copy of the License at > Hmm, is the top of the Apache comment missing? Not sure what happened there. I probably inadvertently executed some mysterious vim command :) http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timezone_db.h@91 PS4, Line 91: tz_seg > nit: might be just my preference but tz_segment is still short and I think Done http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.cc File be/src/exprs/timezone_db.cc: http://gerrit.cloudera.org:8080/#/c/9986/1/be/src/exprs/timezone_db.cc@147 PS1, Line 147: > What I meant is that you could get rid of the offset_sec paramater if you c I understand. The problem is that then we would have to allocate an int64_t in the function and return a pointer to it wrapped in unique_ptr to avoid memory leaks. Seems more pain then gain. http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timezone_db.cc File be/src/exprs/timezone_db.cc: http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timezone_db.cc@365 PS4, Line 365: hdfsFile hdfs_file = hdfsOpenFile( > Just for my information: LoadZoneInf
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/9986 ) Change subject: IMPALA-3307: Add support for IANA time-zone db .. Patch Set 5: (1 comment) http://gerrit.cloudera.org:8080/#/c/9986/5/be/src/runtime/timestamp-value.cc File be/src/runtime/timestamp-value.cc: http://gerrit.cloudera.org:8080/#/c/9986/5/be/src/runtime/timestamp-value.cc@139 PS5, Line 139: // In case the resulting 'time_point' is ambiguous, we have to invalidate : // TimestampValue. : // 'civil_lookup' members and the details of handling ambiguity are described at: : // https://github.com/google/cctz/blob/a2dd3d0fbc811fe0a1d4d2dbb0341f1a3d28cb2a/ : // include/cctz/time_zone.h#L106 : if (UNLIKELY(from_cl.kind != cctz::time_zone::civil_lookup::UNIQUE) I have investigated a bit about this: - there is a Jira that complains about this behavior: https://issues.apache.org/jira/browse/IMPALA-3169 - Hive does not work like this, it returns a "valid" timestamp for repeated/skipped hours: select to_utc_timestamp(cast("2011-03-13 02:15:00" as timestamp), "America/Los_Angeles"), to_utc_timestamp(cast("2011-11-06 01:15:00" as timestamp), "America/Los_Angeles") result: 2011-03-13 10:15:00.0 2011-11-06 09:15:00.0 I think that we should do the same, at least for repeated values. I can imagine several valid queries where this would be the correct behavior, for example when we filter for a time interval. So I vote for solving IMPALA-3169 in this patch by choosing pre or post time in non UNIQUE cases too. If there are no test cases yet for skipped/repeated hours, then we should create some and expect the same results that Hive returns. -- To view, visit http://gerrit.cloudera.org:8080/9986 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 Gerrit-Change-Number: 9986 Gerrit-PatchSet: 5 Gerrit-Owner: Attila Jeges Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 11 May 2018 14:04:02 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Hello Gabor Kaszab, Zoltan Borok-Nagy, Csaba Ringhofer, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/9986 to look at the new patch set (#6). Change subject: IMPALA-3307: Add support for IANA time-zone db .. IMPALA-3307: Add support for IANA time-zone db Impala currently uses two different libraries for timestamp manipulations: boost and glibc. Issues with boost: - Time-zone database is currently hard coded in timezone_db.cc. Impala admins cannot update it without upgrading Impala. - Time-zone database is flat, therefore can’t track year-to-year changes. - Time-zone database is not updated on a regular basis. Issues with glibc: - Uses /usr/share/zoneinfo/ database which could be out of sync on some of the nodes in the Impala cluster. - Uses the host system’s local time-zone. Different nodes in the Impala cluster might use a different local time-zone. - Conversion functions take a global lock, which causes severe performance degradation. In addition to the issues above, the fact that /usr/share/zoneinfo/ and the hard-coded boost time-zone database are both in use is a source of inconsistency in itself. This patch makes the following changes: - Instead of boost and glibc, impalad uses Google's CCTZ to implement time-zone conversions. - Introduces a new startup flag (--hdfs_zone_info_dir) to impalad to specify an HDFS/S3/ADLS location that contains the shared compiled IANA time-zone database. If the startup flag is set, impalad will use the specified time-zone database. Otherwise, impalad will use the default /usr/share/zoneinfo time-zone database. - impalad reads the entire time-zone database into an in-memory map on startup for fast lookups. - The name of the coordinator node’s local time-zone is saved to the query context when preparing query execution. This time-zone is used whenever the current time-zone is referred afterwards in an execution node. - Introduces a new startup flag (--hdfs_zone_abbrev_conf) to impalad to specify an HDFS/S3/ADLS path to a shared config file that contains definitions for non-standard time-zone abbreviations. Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 --- M CMakeLists.txt M be/CMakeLists.txt M be/src/benchmarks/CMakeLists.txt A be/src/benchmarks/convert-timestamp-benchmark.cc M be/src/common/global-types.h M be/src/exec/data-source-scan-node.cc M be/src/exec/data-source-scan-node.h M be/src/exec/hdfs-orc-scanner.cc M be/src/exec/parquet-column-readers.cc M be/src/exprs/CMakeLists.txt M be/src/exprs/aggregate-functions-ir.cc M be/src/exprs/cast-functions-ir.cc M be/src/exprs/decimal-operators-ir.cc M be/src/exprs/decimal-operators.h M be/src/exprs/expr-test.cc M be/src/exprs/literal.cc M be/src/exprs/timestamp-functions-ir.cc M be/src/exprs/timestamp-functions.cc A be/src/exprs/timezone_db-test.cc M be/src/exprs/timezone_db.cc M be/src/exprs/timezone_db.h M be/src/runtime/raw-value-test.cc M be/src/runtime/runtime-state.cc M be/src/runtime/runtime-state.h M be/src/runtime/timestamp-test.cc M be/src/runtime/timestamp-value.cc M be/src/runtime/timestamp-value.h M be/src/runtime/timestamp-value.inline.h M be/src/service/impala-server.cc M be/src/service/impalad-main.cc M be/src/util/filesystem-util-test.cc M be/src/util/filesystem-util.cc M be/src/util/filesystem-util.h M be/src/util/hdfs-util-test.cc M be/src/util/hdfs-util.cc M be/src/util/hdfs-util.h M be/src/util/time-test.cc M be/src/util/time.cc M be/src/util/time.h M bin/bootstrap_toolchain.py M bin/impala-config.sh A cmake_modules/FindCctz.cmake M common/thrift/ImpalaInternalService.thrift M common/thrift/metrics.json M fe/src/test/java/org/apache/impala/testutil/TestUtils.java M testdata/bin/create-load-data.sh M testdata/data/timezoneverification.csv A testdata/tzdb/abbrev.conf A testdata/tzdb/zoneinfo/AmerICA/ArgeNTINA/MendOZA A testdata/tzdb/zoneinfo/AmerICA/CancUN A testdata/tzdb/zoneinfo/UTC M testdata/workloads/functional-query/queries/QueryTest/exprs.test A tests/custom_cluster/test_custom_tzdb.py 53 files changed, 2,569 insertions(+), 1,092 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/86/9986/6 -- To view, visit http://gerrit.cloudera.org:8080/9986 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 Gerrit-Change-Number: 9986 Gerrit-PatchSet: 6 Gerrit-Owner: Attila Jeges Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Dan Hecht has posted comments on this change. ( http://gerrit.cloudera.org:8080/9986 ) Change subject: IMPALA-3307: Add support for IANA time-zone db .. Patch Set 5: (2 comments) http://gerrit.cloudera.org:8080/#/c/9986/5/be/src/runtime/runtime-state.h File be/src/runtime/runtime-state.h: http://gerrit.cloudera.org:8080/#/c/9986/5/be/src/runtime/runtime-state.h@317 PS5, Line 317: /// Query-global timezone used as local timezone when executing the query. who owns it? Let's at least say "Not owned." http://gerrit.cloudera.org:8080/#/c/9986/5/be/src/util/filesystem-util.h File be/src/util/filesystem-util.h: http://gerrit.cloudera.org:8080/#/c/9986/5/be/src/util/filesystem-util.h@58 PS5, Line 58: Real Should we call it GetCanonicalPath()? I assume it's related to IsCanonicalPath(), and so would be nice to name them similarly (and group them together). Is it the case that IsCanonicalPath() always returns true for the *real_path returned by GetRealPath()? -- To view, visit http://gerrit.cloudera.org:8080/9986 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 Gerrit-Change-Number: 9986 Gerrit-PatchSet: 5 Gerrit-Owner: Attila Jeges Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 11 May 2018 21:07:49 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Attila Jeges has uploaded a new patch set (#7). ( http://gerrit.cloudera.org:8080/9986 ) Change subject: IMPALA-3307: Add support for IANA time-zone db .. IMPALA-3307: Add support for IANA time-zone db Impala currently uses two different libraries for timestamp manipulations: boost and glibc. Issues with boost: - Time-zone database is currently hard coded in timezone_db.cc. Impala admins cannot update it without upgrading Impala. - Time-zone database is flat, therefore can’t track year-to-year changes. - Time-zone database is not updated on a regular basis. Issues with glibc: - Uses /usr/share/zoneinfo/ database which could be out of sync on some of the nodes in the Impala cluster. - Uses the host system’s local time-zone. Different nodes in the Impala cluster might use a different local time-zone. - Conversion functions take a global lock, which causes severe performance degradation. In addition to the issues above, the fact that /usr/share/zoneinfo/ and the hard-coded boost time-zone database are both in use is a source of inconsistency in itself. This patch makes the following changes: - Instead of boost and glibc, impalad uses Google's CCTZ to implement time-zone conversions. - Introduces a new startup flag (--hdfs_zone_info_dir) to impalad to specify an HDFS/S3/ADLS location that contains the shared compiled IANA time-zone database. If the startup flag is set, impalad will use the specified time-zone database. Otherwise, impalad will use the default /usr/share/zoneinfo time-zone database. - impalad reads the entire time-zone database into an in-memory map on startup for fast lookups. - The name of the coordinator node’s local time-zone is saved to the query context when preparing query execution. This time-zone is used whenever the current time-zone is referred afterwards in an execution node. - Introduces a new startup flag (--hdfs_zone_abbrev_conf) to impalad to specify an HDFS/S3/ADLS path to a shared config file that contains definitions for non-standard time-zone abbreviations. Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 --- M CMakeLists.txt M be/CMakeLists.txt M be/src/benchmarks/CMakeLists.txt A be/src/benchmarks/convert-timestamp-benchmark.cc M be/src/common/global-types.h M be/src/exec/data-source-scan-node.cc M be/src/exec/data-source-scan-node.h M be/src/exec/hdfs-orc-scanner.cc M be/src/exec/parquet-column-readers.cc M be/src/exprs/CMakeLists.txt M be/src/exprs/aggregate-functions-ir.cc M be/src/exprs/cast-functions-ir.cc M be/src/exprs/decimal-operators-ir.cc M be/src/exprs/decimal-operators.h M be/src/exprs/expr-test.cc M be/src/exprs/literal.cc M be/src/exprs/timestamp-functions-ir.cc M be/src/exprs/timestamp-functions.cc A be/src/exprs/timezone_db-test.cc M be/src/exprs/timezone_db.cc M be/src/exprs/timezone_db.h M be/src/runtime/raw-value-test.cc M be/src/runtime/runtime-state.cc M be/src/runtime/runtime-state.h M be/src/runtime/timestamp-test.cc M be/src/runtime/timestamp-value.cc M be/src/runtime/timestamp-value.h M be/src/runtime/timestamp-value.inline.h M be/src/service/impala-server.cc M be/src/service/impalad-main.cc M be/src/util/filesystem-util-test.cc M be/src/util/filesystem-util.cc M be/src/util/filesystem-util.h M be/src/util/hdfs-util-test.cc M be/src/util/hdfs-util.cc M be/src/util/hdfs-util.h M be/src/util/time-test.cc M be/src/util/time.cc M be/src/util/time.h M bin/bootstrap_toolchain.py M bin/impala-config.sh A cmake_modules/FindCctz.cmake M common/thrift/ImpalaInternalService.thrift M common/thrift/metrics.json M fe/src/test/java/org/apache/impala/testutil/TestUtils.java M testdata/bin/create-load-data.sh M testdata/data/timezoneverification.csv A testdata/tzdb/abbrev.conf A testdata/tzdb/zoneinfo/AmerICA/ArgeNTINA/MendOZA A testdata/tzdb/zoneinfo/AmerICA/CancUN A testdata/tzdb/zoneinfo/UTC M testdata/workloads/functional-query/queries/QueryTest/exprs.test A tests/custom_cluster/test_custom_tzdb.py 53 files changed, 2,575 insertions(+), 1,092 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/86/9986/7 -- To view, visit http://gerrit.cloudera.org:8080/9986 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 Gerrit-Change-Number: 9986 Gerrit-PatchSet: 7 Gerrit-Owner: Attila Jeges Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Dan Hecht has posted comments on this change. ( http://gerrit.cloudera.org:8080/9986 ) Change subject: IMPALA-3307: Add support for IANA time-zone db .. Patch Set 7: (2 comments) I focused mostly on the non-timestamp/timezone/time and test/infra parts. It looks fine to me. Would be good to get Gabor's to finish his review to +1 and Tim can do the final +2. Just a heads up regarding exceptions: in the past we've had a lot of issues with timestamp boost routines throwing exceptions for out of range values. You should make sure you exercise any path that can do that with tests. We generally either have to reason about why the boost function can't throw an exception (maybe we check the range before hand) or we wrap the boost call with try/catch so we don't expose the exception. Also IIRC, something to keep in mind is that codegen code can't properly handle try/catch, so in cases we needed to use that, we factored the try/catch code into native code and call out to it from the IR. Again, just a heads up, not sure if your change introduced any problem in this regard or not. http://gerrit.cloudera.org:8080/#/c/9986/7/be/src/exprs/timestamp-functions-ir.cc File be/src/exprs/timestamp-functions-ir.cc: http://gerrit.cloudera.org:8080/#/c/9986/7/be/src/exprs/timestamp-functions-ir.cc@502 PS7, Line 502: : namespace { : inline cctz::time_point UnixTimeToTimePoint(time_t t) { : return std::chrono::time_point_cast( : std::chrono::system_clock::from_time_t(0)) + cctz::sys_seconds(t); : } : : } : : StringVal TimestampFunctions::TimeOfDay(FunctionContext* context) { : const TimestampVal curr = Now(context); : if (curr.is_null) return StringVal::null(); : const string& day = ShortDayName(context, curr); : const string& month = ShortMonthName(context, curr); : IntVal dayofmonth = DayOfMonth(context, curr); : IntVal hour = Hour(context, curr); : IntVal min = Minute(context, curr); : IntVal sec = Second(context, curr); : IntVal year = Year(context, curr); : : // Calculate 'start' time point at which query execution started. : cctz::time_point start = UnixTimeToTimePoint( : context->impl()->state()->query_ctx().start_unix_millis / MILLIS_PER_SEC); : // Find 'tz_name' time-zone abbreviation that corresponds to 'local_time_zone' at : // 'start' time point. : cctz::time_zone::absolute_lookup start_lookup = : context->impl()->state()->local_time_zone()->lookup(start); : const string& tz_name = (start_lookup.abbr != nullptr) ? start_lookup.abbr : : context->impl()->state()->local_time_zone()->name(); any chance that can throw an exception? I believe our IR code can't properly handle try/catch, so if this can indeed throw an exception and needs to be wrapped in try/catch, it may need to be refactored so that this code lives in the native code and we call out to it from the IR. (Just a heads up, this may not be a problem here). http://gerrit.cloudera.org:8080/#/c/9986/7/be/src/runtime/timestamp-value.inline.h File be/src/runtime/timestamp-value.inline.h: http://gerrit.cloudera.org:8080/#/c/9986/7/be/src/runtime/timestamp-value.inline.h@59 PS7, Line 59: .days() in the past we've had issues where the boost date library can throw exceptions. I don't remember the details off hand and it may be that you are okay here given you've already checked HasDateAndTime() and if we ensure date_ is within range, but just wanted to mention it. -- To view, visit http://gerrit.cloudera.org:8080/9986 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 Gerrit-Change-Number: 9986 Gerrit-PatchSet: 7 Gerrit-Owner: Attila Jeges Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Tue, 15 May 2018 17:34:57 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Dan Hecht has posted comments on this change. ( http://gerrit.cloudera.org:8080/9986 ) Change subject: IMPALA-3307: Add support for IANA time-zone db .. Patch Set 7: Code-Review+1 -- To view, visit http://gerrit.cloudera.org:8080/9986 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 Gerrit-Change-Number: 9986 Gerrit-PatchSet: 7 Gerrit-Owner: Attila Jeges Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Tue, 15 May 2018 17:35:33 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/9986 ) Change subject: IMPALA-3307: Add support for IANA time-zone db .. Patch Set 7: Code-Review+1 (4 comments) Thanks Attila for addressing my comments! I'm fine with the change. http://gerrit.cloudera.org:8080/#/c/9986/7/be/src/exprs/timezone_db-test.cc File be/src/exprs/timezone_db-test.cc: http://gerrit.cloudera.org:8080/#/c/9986/7/be/src/exprs/timezone_db-test.cc@119 PS7, Line 119: TestInvalidTimezoneAbbrevName("pST"); Four of these are already tested by TimezoneDbNamesTest. No need to test the here as well, I think. http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timezone_db.h File be/src/exprs/timezone_db.h: http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timezone_db.h@1 PS4, Line 1: // Licensed to the Apache Software Foundation (ASF) under one > Not sure what happened there. I probably inadvertently executed some myster :D http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timezone_db.cc File be/src/exprs/timezone_db.cc: http://gerrit.cloudera.org:8080/#/c/9986/4/be/src/exprs/timezone_db.cc@365 PS4, Line 365: RETURN_IF_ERROR( > LoadZoneInfoFromHdfs() uses cctz::load_time_zone() to load time-zone files Thx! http://gerrit.cloudera.org:8080/#/c/9986/5/be/src/runtime/runtime-state.cc File be/src/runtime/runtime-state.cc: http://gerrit.cloudera.org:8080/#/c/9986/5/be/src/runtime/runtime-state.cc@136 PS5, Line 136: local_time_zone_ = &TimezoneDatabase::GetUtcTimezone(); > True, but I wanted to be more explicit here. thanks for the explanation. -- To view, visit http://gerrit.cloudera.org:8080/9986 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 Gerrit-Change-Number: 9986 Gerrit-PatchSet: 7 Gerrit-Owner: Attila Jeges Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Wed, 16 May 2018 14:43:24 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9986 ) Change subject: IMPALA-3307: Add support for IANA time-zone db .. Patch Set 7: (15 comments) This is going to be a big improvement. Did a pass, mainly had comments about clarifying internal interfaces. http://gerrit.cloudera.org:8080/#/c/9986/7/be/src/exec/data-source-scan-node.h File be/src/exec/data-source-scan-node.h: http://gerrit.cloudera.org:8080/#/c/9986/7/be/src/exec/data-source-scan-node.h@101 PS7, Line 101: /// local time-zone for materializing 'TYPE_TIMESTAMP' slots. Can local_tz be NULL? Maybe make it const& if not. http://gerrit.cloudera.org:8080/#/c/9986/7/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: http://gerrit.cloudera.org:8080/#/c/9986/7/be/src/exec/parquet-column-readers.cc@603 PS7, Line 603: if (dst_ts->HasDateAndTime()) dst_ts->UtcToLocal(parent_->state_->local_time_zone()); Would it make sense to cache the timezone locally in the ScalarColumnReader? That would save at least 2 pointer indirections per value, which could be meaningful in this part of the code. http://gerrit.cloudera.org:8080/#/c/9986/7/be/src/exprs/decimal-operators.h File be/src/exprs/decimal-operators.h: http://gerrit.cloudera.org:8080/#/c/9986/7/be/src/exprs/decimal-operators.h@172 PS7, Line 172: const T& decimal_value, int scale, bool round, const Timezone* local_tz); Is it nullable? http://gerrit.cloudera.org:8080/#/c/9986/7/be/src/exprs/expr-test.cc File be/src/exprs/expr-test.cc: http://gerrit.cloudera.org:8080/#/c/9986/7/be/src/exprs/expr-test.cc@161 PS7, Line 161: const Timezone *new_tz_; nit: Timezone* new_tz_ http://gerrit.cloudera.org:8080/#/c/9986/7/be/src/exprs/timezone_db.cc File be/src/exprs/timezone_db.cc: http://gerrit.cloudera.org:8080/#/c/9986/7/be/src/exprs/timezone_db.cc@48 PS7, Line 48: "HDFS/S3A/ADLS path to load IANA time-zone database from."); nn] http://gerrit.cloudera.org:8080/#/c/9986/7/be/src/runtime/runtime-state.h File be/src/runtime/runtime-state.h: http://gerrit.cloudera.org:8080/#/c/9986/7/be/src/runtime/runtime-state.h@317 PS7, Line 317: /// Query-global timezone used as local timezone when executing the query. Can this be NULL? Would be good to document. We should maybe return a const& above if it can't be NULL so that it's self-documenting. http://gerrit.cloudera.org:8080/#/c/9986/7/be/src/runtime/runtime-state.cc File be/src/runtime/runtime-state.cc: http://gerrit.cloudera.org:8080/#/c/9986/7/be/src/runtime/runtime-state.cc@131 PS7, Line 131: LIKELY LIKELY won't make a measurable difference outside of perf-critical code, I find it adds noise in cases like this. The codebase isn't very consistent about it but I'm trying to stop the pattern from spreading :) http://gerrit.cloudera.org:8080/#/c/9986/7/be/src/runtime/runtime-state.cc@134 PS7, Line 134: LOG(ERROR) << "Failed to find local timezone " << query_ctx().local_time_zone I think this should be a WARNING-level log. We should reserve ERROR for really severe errors, whereas this might flood logs. I think we should also add a warning to the query warnings so that it's surfaced to the user, not just the admin. http://gerrit.cloudera.org:8080/#/c/9986/7/be/src/runtime/timestamp-value.h File be/src/runtime/timestamp-value.h: http://gerrit.cloudera.org:8080/#/c/9986/7/be/src/runtime/timestamp-value.h@98 PS7, Line 98: static TimestampValue FromUnixTime(time_t unix_time, const Timezone* local_tz) { Here and below it isn't clear if local_tz is allowed to be NULL. If it can be NULL, can we extend comments to explain what happens if that case. If it can't, we could make it self-documenting by making it a const& instead of a const*. http://gerrit.cloudera.org:8080/#/c/9986/7/be/src/runtime/timestamp-value.cc File be/src/runtime/timestamp-value.cc: http://gerrit.cloudera.org:8080/#/c/9986/7/be/src/runtime/timestamp-value.cc@100 PS7, Line 100: CheckIfDateOutOfRange Maybe IsDateOutOfRange(). With "Check" it isn't clear whether returning true means that it's out of range or if it means that the timestamp passed the check. http://gerrit.cloudera.org:8080/#/c/9986/7/be/src/util/time.h File be/src/util/time.h: http://gerrit.cloudera.org:8080/#/c/9986/7/be/src/util/time.h@122 PS7, Line 122: Timezone Maybe const& if it's not allowed to be NULL. http://gerrit.cloudera.org:8080/#/c/9986/7/cmake_modules/FindCctz.cmake File cmake_modules/FindCctz.cmake: http://gerrit.cloudera.org:8080/#/c/9986/7/cmake_modules/FindCctz.cmake@27 PS7, Line 27: $ENV{IMPALA_HOME}/thirdparty/cctz-$ENV{IMPALA_CCTZ_VERSION}/src) We can get rid of the thirdparty/ stuff. That's just left over from when Impala stored vendored versions of these dependencies in thirdpart/ http://gerrit.cloudera.org:8080/#/c/9986/7/common/thrift/ImpalaInternalService.thrift File common/thrift/ImpalaInternalService.thrift:
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Attila Jeges has uploaded a new patch set (#8). ( http://gerrit.cloudera.org:8080/9986 ) Change subject: IMPALA-3307: Add support for IANA time-zone db .. IMPALA-3307: Add support for IANA time-zone db Impala currently uses two different libraries for timestamp manipulations: boost and glibc. Issues with boost: - Time-zone database is currently hard coded in timezone_db.cc. Impala admins cannot update it without upgrading Impala. - Time-zone database is flat, therefore can’t track year-to-year changes. - Time-zone database is not updated on a regular basis. Issues with glibc: - Uses /usr/share/zoneinfo/ database which could be out of sync on some of the nodes in the Impala cluster. - Uses the host system’s local time-zone. Different nodes in the Impala cluster might use a different local time-zone. - Conversion functions take a global lock, which causes severe performance degradation. In addition to the issues above, the fact that /usr/share/zoneinfo/ and the hard-coded boost time-zone database are both in use is a source of inconsistency in itself. This patch makes the following changes: - Instead of boost and glibc, impalad uses Google's CCTZ to implement time-zone conversions. - Introduces a new startup flag (--hdfs_zone_info_dir) to impalad to specify an HDFS/S3/ADLS location that contains the shared compiled IANA time-zone database. If the startup flag is set, impalad will use the specified time-zone database. Otherwise, impalad will use the default /usr/share/zoneinfo time-zone database. - impalad reads the entire time-zone database into an in-memory map on startup for fast lookups. - The name of the coordinator node’s local time-zone is saved to the query context when preparing query execution. This time-zone is used whenever the current time-zone is referred afterwards in an execution node. - Introduces a new startup flag (--hdfs_zone_abbrev_conf) to impalad to specify an HDFS/S3/ADLS path to a shared config file that contains definitions for non-standard time-zone abbreviations. Cherry-picks: not for 2.x. Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 --- M CMakeLists.txt M be/CMakeLists.txt M be/src/benchmarks/CMakeLists.txt A be/src/benchmarks/convert-timestamp-benchmark.cc M be/src/common/global-types.h M be/src/exec/data-source-scan-node.cc M be/src/exec/data-source-scan-node.h M be/src/exec/hdfs-orc-scanner.cc M be/src/exec/parquet-column-readers.cc M be/src/exprs/CMakeLists.txt M be/src/exprs/aggregate-functions-ir.cc M be/src/exprs/cast-functions-ir.cc M be/src/exprs/decimal-operators-ir.cc M be/src/exprs/decimal-operators.h M be/src/exprs/expr-test.cc M be/src/exprs/literal.cc M be/src/exprs/timestamp-functions-ir.cc M be/src/exprs/timestamp-functions.cc A be/src/exprs/timezone_db-test.cc M be/src/exprs/timezone_db.cc M be/src/exprs/timezone_db.h M be/src/runtime/raw-value-test.cc M be/src/runtime/runtime-state.cc M be/src/runtime/runtime-state.h M be/src/runtime/timestamp-test.cc M be/src/runtime/timestamp-value.cc M be/src/runtime/timestamp-value.h M be/src/runtime/timestamp-value.inline.h M be/src/service/impala-server.cc M be/src/service/impalad-main.cc M be/src/util/filesystem-util-test.cc M be/src/util/filesystem-util.cc M be/src/util/filesystem-util.h M be/src/util/hdfs-util-test.cc M be/src/util/hdfs-util.cc M be/src/util/hdfs-util.h M be/src/util/time-test.cc M be/src/util/time.cc M be/src/util/time.h M bin/bootstrap_toolchain.py M bin/impala-config.sh A cmake_modules/FindCctz.cmake M common/thrift/ImpalaInternalService.thrift M common/thrift/metrics.json M fe/src/test/java/org/apache/impala/testutil/TestUtils.java M testdata/bin/create-load-data.sh M testdata/data/timezoneverification.csv A testdata/tzdb/abbrev.conf A testdata/tzdb/zoneinfo/AmerICA/ArgeNTINA/MendOZA A testdata/tzdb/zoneinfo/AmerICA/CancUN A testdata/tzdb/zoneinfo/UTC M testdata/workloads/functional-query/queries/QueryTest/exprs.test A tests/custom_cluster/test_custom_tzdb.py 53 files changed, 2,603 insertions(+), 1,095 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/86/9986/8 -- To view, visit http://gerrit.cloudera.org:8080/9986 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 Gerrit-Change-Number: 9986 Gerrit-PatchSet: 8 Gerrit-Owner: Attila Jeges Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Attila Jeges has uploaded a new patch set (#9). ( http://gerrit.cloudera.org:8080/9986 ) Change subject: IMPALA-3307: Add support for IANA time-zone db .. IMPALA-3307: Add support for IANA time-zone db Impala currently uses two different libraries for timestamp manipulations: boost and glibc. Issues with boost: - Time-zone database is currently hard coded in timezone_db.cc. Impala admins cannot update it without upgrading Impala. - Time-zone database is flat, therefore can’t track year-to-year changes. - Time-zone database is not updated on a regular basis. Issues with glibc: - Uses /usr/share/zoneinfo/ database which could be out of sync on some of the nodes in the Impala cluster. - Uses the host system’s local time-zone. Different nodes in the Impala cluster might use a different local time-zone. - Conversion functions take a global lock, which causes severe performance degradation. In addition to the issues above, the fact that /usr/share/zoneinfo/ and the hard-coded boost time-zone database are both in use is a source of inconsistency in itself. This patch makes the following changes: - Instead of boost and glibc, impalad uses Google's CCTZ to implement time-zone conversions. - Introduces a new startup flag (--hdfs_zone_info_dir) to impalad to specify an HDFS/S3/ADLS location that contains the shared compiled IANA time-zone database. If the startup flag is set, impalad will use the specified time-zone database. Otherwise, impalad will use the default /usr/share/zoneinfo time-zone database. - impalad reads the entire time-zone database into an in-memory map on startup for fast lookups. - The name of the coordinator node’s local time-zone is saved to the query context when preparing query execution. This time-zone is used whenever the current time-zone is referred afterwards in an execution node. - Introduces a new startup flag (--hdfs_zone_abbrev_conf) to impalad to specify an HDFS/S3/ADLS path to a shared config file that contains definitions for non-standard time-zone abbreviations. Cherry-picks: not for 2.x. Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 --- M CMakeLists.txt M be/CMakeLists.txt M be/src/benchmarks/CMakeLists.txt A be/src/benchmarks/convert-timestamp-benchmark.cc M be/src/common/global-types.h M be/src/exec/data-source-scan-node.cc M be/src/exec/data-source-scan-node.h M be/src/exec/hdfs-orc-scanner.cc M be/src/exec/parquet-column-readers.cc M be/src/exprs/CMakeLists.txt M be/src/exprs/aggregate-functions-ir.cc M be/src/exprs/cast-functions-ir.cc M be/src/exprs/decimal-operators-ir.cc M be/src/exprs/decimal-operators.h M be/src/exprs/expr-test.cc M be/src/exprs/literal.cc M be/src/exprs/timestamp-functions-ir.cc M be/src/exprs/timestamp-functions.cc A be/src/exprs/timezone_db-test.cc M be/src/exprs/timezone_db.cc M be/src/exprs/timezone_db.h M be/src/runtime/raw-value-test.cc M be/src/runtime/runtime-state.cc M be/src/runtime/runtime-state.h M be/src/runtime/timestamp-test.cc M be/src/runtime/timestamp-value.cc M be/src/runtime/timestamp-value.h M be/src/runtime/timestamp-value.inline.h M be/src/service/impala-server.cc M be/src/service/impalad-main.cc M be/src/util/filesystem-util-test.cc M be/src/util/filesystem-util.cc M be/src/util/filesystem-util.h M be/src/util/hdfs-util-test.cc M be/src/util/hdfs-util.cc M be/src/util/hdfs-util.h M be/src/util/time-test.cc M be/src/util/time.cc M be/src/util/time.h M bin/bootstrap_toolchain.py M bin/impala-config.sh A cmake_modules/FindCctz.cmake M common/thrift/ImpalaInternalService.thrift M common/thrift/metrics.json M fe/src/test/java/org/apache/impala/testutil/TestUtils.java M testdata/bin/create-load-data.sh M testdata/data/timezoneverification.csv A testdata/tzdb/2017c/Africa/Abidjan A testdata/tzdb/2017c/Africa/Accra A testdata/tzdb/2017c/Africa/Addis_Ababa A testdata/tzdb/2017c/Africa/Algiers A testdata/tzdb/2017c/Africa/Asmara A testdata/tzdb/2017c/Africa/Asmera A testdata/tzdb/2017c/Africa/Bamako A testdata/tzdb/2017c/Africa/Bangui A testdata/tzdb/2017c/Africa/Banjul A testdata/tzdb/2017c/Africa/Bissau A testdata/tzdb/2017c/Africa/Blantyre A testdata/tzdb/2017c/Africa/Brazzaville A testdata/tzdb/2017c/Africa/Bujumbura A testdata/tzdb/2017c/Africa/Cairo A testdata/tzdb/2017c/Africa/Casablanca A testdata/tzdb/2017c/Africa/Ceuta A testdata/tzdb/2017c/Africa/Conakry A testdata/tzdb/2017c/Africa/Dakar A testdata/tzdb/2017c/Africa/Dar_es_Salaam A testdata/tzdb/2017c/Africa/Djibouti A testdata/tzdb/2017c/Africa/Douala A testdata/tzdb/2017c/Africa/El_Aaiun A testdata/tzdb/2017c/Africa/Freetown A testdata/tzdb/2017c/Africa/Gaborone A testdata/tzdb/2017c/Africa/Harare A testdata/tzdb/2017c/Africa/Johannesburg A testdata/tzdb/2017c/Africa/Juba A testdata/tzdb/2017c/Africa/Kampala A testdata/tzdb/2017c/Africa/Khartoum A testdata/tzdb/2017c/Africa/Kigali A testdata/tzdb/2017c/Africa/Kinshasa A testdata/tzdb/2017c/Africa/Lagos A testd
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Attila Jeges has posted comments on this change. ( http://gerrit.cloudera.org:8080/9986 ) Change subject: IMPALA-3307: Add support for IANA time-zone db .. Patch Set 9: > > Uploaded patch set 9. > > Patch -set 9 contains the following changes: > - Added a full timezone db to testdata/tzdb. > - End-to-end tests and BE-tests were changed to use this timezone > db. This was necessary because some timezone-tests were failing on > older jenkins workers that had an older tzdata package installed. It might be a good idea to store the timezone-db files in one .tar file and extract them before running the tests. What do you think? -- To view, visit http://gerrit.cloudera.org:8080/9986 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 Gerrit-Change-Number: 9986 Gerrit-PatchSet: 9 Gerrit-Owner: Attila Jeges Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 25 May 2018 16:23:16 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/9986 ) Change subject: IMPALA-3307: Add support for IANA time-zone db .. Patch Set 9: > > > Uploaded patch set 9. > > > > Patch -set 9 contains the following changes: > > - Added a full timezone db to testdata/tzdb. > > - End-to-end tests and BE-tests were changed to use this timezone > > db. This was necessary because some timezone-tests were failing > on > > older jenkins workers that had an older tzdata package installed. > > It might be a good idea to store the timezone-db files in one .tar > file and extract them before running the tests. What do you think? I agree, .taring or compressing the tz db would be much better, if it does not make the code too complicated. Having less file would make the review more readable, and would also make the tz db consume much less space on hdfs, as the many small files will be rounded up to hdfs block size. -- To view, visit http://gerrit.cloudera.org:8080/9986 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 Gerrit-Change-Number: 9986 Gerrit-PatchSet: 9 Gerrit-Owner: Attila Jeges Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 25 May 2018 16:33:01 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Attila Jeges has uploaded a new patch set (#10). ( http://gerrit.cloudera.org:8080/9986 ) Change subject: IMPALA-3307: Add support for IANA time-zone db .. IMPALA-3307: Add support for IANA time-zone db Impala currently uses two different libraries for timestamp manipulations: boost and glibc. Issues with boost: - Time-zone database is currently hard coded in timezone_db.cc. Impala admins cannot update it without upgrading Impala. - Time-zone database is flat, therefore can’t track year-to-year changes. - Time-zone database is not updated on a regular basis. Issues with glibc: - Uses /usr/share/zoneinfo/ database which could be out of sync on some of the nodes in the Impala cluster. - Uses the host system’s local time-zone. Different nodes in the Impala cluster might use a different local time-zone. - Conversion functions take a global lock, which causes severe performance degradation. In addition to the issues above, the fact that /usr/share/zoneinfo/ and the hard-coded boost time-zone database are both in use is a source of inconsistency in itself. This patch makes the following changes: - Instead of boost and glibc, impalad uses Google's CCTZ to implement time-zone conversions. - Introduces a new startup flag (--hdfs_zone_info_dir) to impalad to specify an HDFS/S3/ADLS location that contains the shared compiled IANA time-zone database. If the startup flag is set, impalad will use the specified time-zone database. Otherwise, impalad will use the default /usr/share/zoneinfo time-zone database. - impalad reads the entire time-zone database into an in-memory map on startup for fast lookups. - The name of the coordinator node’s local time-zone is saved to the query context when preparing query execution. This time-zone is used whenever the current time-zone is referred afterwards in an execution node. - Introduces a new startup flag (--hdfs_zone_abbrev_conf) to impalad to specify an HDFS/S3/ADLS path to a shared config file that contains definitions for non-standard time-zone abbreviations. Cherry-picks: not for 2.x. Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 --- M CMakeLists.txt M be/CMakeLists.txt M be/src/benchmarks/CMakeLists.txt A be/src/benchmarks/convert-timestamp-benchmark.cc M be/src/common/global-types.h M be/src/exec/data-source-scan-node.cc M be/src/exec/data-source-scan-node.h M be/src/exec/hdfs-orc-scanner.cc M be/src/exec/parquet-column-readers.cc M be/src/exprs/CMakeLists.txt M be/src/exprs/aggregate-functions-ir.cc M be/src/exprs/cast-functions-ir.cc M be/src/exprs/decimal-operators-ir.cc M be/src/exprs/decimal-operators.h M be/src/exprs/expr-test.cc M be/src/exprs/literal.cc M be/src/exprs/timestamp-functions-ir.cc M be/src/exprs/timestamp-functions.cc A be/src/exprs/timezone_db-test.cc M be/src/exprs/timezone_db.cc M be/src/exprs/timezone_db.h M be/src/runtime/raw-value-test.cc M be/src/runtime/runtime-state.cc M be/src/runtime/runtime-state.h M be/src/runtime/timestamp-test.cc M be/src/runtime/timestamp-value.cc M be/src/runtime/timestamp-value.h M be/src/runtime/timestamp-value.inline.h M be/src/service/impala-server.cc M be/src/service/impalad-main.cc M be/src/util/filesystem-util-test.cc M be/src/util/filesystem-util.cc M be/src/util/filesystem-util.h M be/src/util/hdfs-util-test.cc M be/src/util/hdfs-util.cc M be/src/util/hdfs-util.h M be/src/util/time-test.cc M be/src/util/time.cc M be/src/util/time.h M bin/bootstrap_toolchain.py M bin/impala-config.sh A cmake_modules/FindCctz.cmake M common/thrift/ImpalaInternalService.thrift M common/thrift/metrics.json M fe/src/test/java/org/apache/impala/testutil/TestUtils.java M testdata/bin/create-load-data.sh M testdata/data/timezoneverification.csv A testdata/tzdb/2017c/Africa/Abidjan A testdata/tzdb/2017c/Africa/Accra A testdata/tzdb/2017c/Africa/Addis_Ababa A testdata/tzdb/2017c/Africa/Algiers A testdata/tzdb/2017c/Africa/Asmara A testdata/tzdb/2017c/Africa/Asmera A testdata/tzdb/2017c/Africa/Bamako A testdata/tzdb/2017c/Africa/Bangui A testdata/tzdb/2017c/Africa/Banjul A testdata/tzdb/2017c/Africa/Bissau A testdata/tzdb/2017c/Africa/Blantyre A testdata/tzdb/2017c/Africa/Brazzaville A testdata/tzdb/2017c/Africa/Bujumbura A testdata/tzdb/2017c/Africa/Cairo A testdata/tzdb/2017c/Africa/Casablanca A testdata/tzdb/2017c/Africa/Ceuta A testdata/tzdb/2017c/Africa/Conakry A testdata/tzdb/2017c/Africa/Dakar A testdata/tzdb/2017c/Africa/Dar_es_Salaam A testdata/tzdb/2017c/Africa/Djibouti A testdata/tzdb/2017c/Africa/Douala A testdata/tzdb/2017c/Africa/El_Aaiun A testdata/tzdb/2017c/Africa/Freetown A testdata/tzdb/2017c/Africa/Gaborone A testdata/tzdb/2017c/Africa/Harare A testdata/tzdb/2017c/Africa/Johannesburg A testdata/tzdb/2017c/Africa/Juba A testdata/tzdb/2017c/Africa/Kampala A testdata/tzdb/2017c/Africa/Khartoum A testdata/tzdb/2017c/Africa/Kigali A testdata/tzdb/2017c/Africa/Kinshasa A testdata/tzdb/2017c/Africa/Lagos A test
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/9986 ) Change subject: IMPALA-3307: Add support for IANA time-zone db .. Patch Set 10: I do not see any more low hanging fruits for performance improvement. Some overhead could be removed by modifying CCTZ, but this is out of the scope of this change, so I created a follow up Jira: IMPALA-7085: Consider patching Google/CCTZ for Impala's need -- To view, visit http://gerrit.cloudera.org:8080/9986 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 Gerrit-Change-Number: 9986 Gerrit-PatchSet: 10 Gerrit-Owner: Attila Jeges Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Mon, 28 May 2018 16:20:05 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Attila Jeges has posted comments on this change. ( http://gerrit.cloudera.org:8080/9986 ) Change subject: IMPALA-3307: Add support for IANA time-zone db .. Patch Set 10: > > > > Uploaded patch set 9. > > > > > > Patch -set 9 contains the following changes: > > > - Added a full timezone db to testdata/tzdb. > > > - End-to-end tests and BE-tests were changed to use this > timezone > > > db. This was necessary because some timezone-tests were failing > > on > > > older jenkins workers that had an older tzdata package > installed. > > > > It might be a good idea to store the timezone-db files in one > .tar > > file and extract them before running the tests. What do you > think? > > I agree, .taring or compressing the tz db would be much better, if > it does not make the code too complicated. Having less file would > make the review more readable, and would also make the tz db > consume much less space on hdfs, as the many small files will be > rounded up to hdfs block size. Extracting files from a .tar file can be tricky. Probably we would have to add libtar library to the native-toolchain to handle .tar files. Alternatively we can store timezone files in a JAR archive instead. The BE can call into the java FE to extract files from it. -- To view, visit http://gerrit.cloudera.org:8080/9986 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 Gerrit-Change-Number: 9986 Gerrit-PatchSet: 10 Gerrit-Owner: Attila Jeges Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Wed, 30 May 2018 15:25:00 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Attila Jeges has posted comments on this change. ( http://gerrit.cloudera.org:8080/9986 ) Change subject: IMPALA-3307: Add support for IANA time-zone db .. Patch Set 10: > > > > > Uploaded patch set 9. > > > > > > > > Patch -set 9 contains the following changes: > > > > - Added a full timezone db to testdata/tzdb. > > > > - End-to-end tests and BE-tests were changed to use this > > timezone > > > > db. This was necessary because some timezone-tests were > failing > > > on > > > > older jenkins workers that had an older tzdata package > > installed. > > > > > > It might be a good idea to store the timezone-db files in one > > .tar > > > file and extract them before running the tests. What do you > > think? > > > > I agree, .taring or compressing the tz db would be much better, > if > > it does not make the code too complicated. Having less file would > > make the review more readable, and would also make the tz db > > consume much less space on hdfs, as the many small files will be > > rounded up to hdfs block size. > > Extracting files from a .tar file can be tricky. Probably we would > have to add libtar library to the native-toolchain to handle .tar > files. > > Alternatively we can store timezone files in a JAR archive instead. > The BE can call into the java FE to extract files from it. Tim, Dan, what do you think? -- To view, visit http://gerrit.cloudera.org:8080/9986 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 Gerrit-Change-Number: 9986 Gerrit-PatchSet: 10 Gerrit-Owner: Attila Jeges Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Wed, 30 May 2018 15:34:20 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Philip Zeyliger has posted comments on this change. ( http://gerrit.cloudera.org:8080/9986 ) Change subject: IMPALA-3307: Add support for IANA time-zone db .. Patch Set 10: (5 comments) > > > > > > Uploaded patch set 9. > > > > > > > > > > Patch -set 9 contains the following changes: > > > > > - Added a full timezone db to testdata/tzdb. > > > > > - End-to-end tests and BE-tests were changed to use this > > > timezone > > > > > db. This was necessary because some timezone-tests were > > failing > > > > on > > > > > older jenkins workers that had an older tzdata package > > > installed. > > > > > > > > It might be a good idea to store the timezone-db files in one > > > .tar > > > > file and extract them before running the tests. What do you > > > think? > > > > > > I agree, .taring or compressing the tz db would be much better, > > if > > > it does not make the code too complicated. Having less file > would > > > make the review more readable, >From a review ability perspective, there's absolutely no need for this commit >to have 221 timezone files. You can test it with ~3, or do a separate commit. >i.e., it's neither here nor there. > > > > Alternatively we can store timezone files in a JAR archive > instead. > > The BE can call into the java FE to extract files from it. > > Tim, Dan, what do you think? The Yarn equivalent here has this notion of a "distributed cache" which is to say it stores the files locally and re-uses them across jobs. I can't tell if we should be worried that all impalads, at boot time, will slam HDFS with reading the timezonedb. I think reading ~200 files per impalad times 200 impalad daemons may be a lot of HDFS metadata load, but maybe it's comparable to what we do for queries anyway. I certainly think that tar or jar is a better way to go. Since this is happening at boot, we can probably still fork to tar, which we can assume is available, or use Java, which has tar libraries and native support for zip. http://gerrit.cloudera.org:8080/#/c/9986/10//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/9986/10//COMMIT_MSG@35 PS10, Line 35: specify an HDFS/S3/ADLS location that contains the shared compiled In what format? Does it add to the host one or override it? My /usr/share/zoneinfo is full of symlinks which HDFS doesn't support in some configurations (and S3 certainly doesn't). http://gerrit.cloudera.org:8080/#/c/9986/10//COMMIT_MSG@41 PS10, Line 41: - The name of the coordinator node’s local time-zone is saved to the Is it easy to tell what the local time zone is of an impalad node? (E.g., do we log it?) http://gerrit.cloudera.org:8080/#/c/9986/10//COMMIT_MSG@45 PS10, Line 45: - Introduces a new startup flag (--hdfs_zone_abbrev_conf) to impalad What's the distinction between this and --hdfs_zone_info_dir? Do we need both? http://gerrit.cloudera.org:8080/#/c/9986/10/testdata/tzdb/2017c/Africa/Abidjan File testdata/tzdb/2017c/Africa/Abidjan: http://gerrit.cloudera.org:8080/#/c/9986/10/testdata/tzdb/2017c/Africa/Abidjan@1 PS10, Line 1: ../Atlantic/St_Helena We're adding a ton of files. Do we need such a big database for our testing purposes? http://gerrit.cloudera.org:8080/#/c/9986/10/tests/custom_cluster/test_shared_tzdb.py File tests/custom_cluster/test_shared_tzdb.py: http://gerrit.cloudera.org:8080/#/c/9986/10/tests/custom_cluster/test_shared_tzdb.py@38 PS10, Line 38: cls.ImpalaTestMatrix.add_constraint(lambda v: Add a comment about what this is trying to do? -- To view, visit http://gerrit.cloudera.org:8080/9986 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 Gerrit-Change-Number: 9986 Gerrit-PatchSet: 10 Gerrit-Owner: Attila Jeges Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Philip Zeyliger Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Wed, 30 May 2018 15:59:54 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9986 ) Change subject: IMPALA-3307: Add support for IANA time-zone db .. Patch Set 10: (1 comment) Yeah I agree with Phil that doing JNI to java to unzip a file may be the least of all evils - we wouldn't be adding another dependency and there's no risk of DOSing the NameNode. I doubt it would take the NameNode down, but it could disrupt other things happening on the cluster unnecessarily. http://gerrit.cloudera.org:8080/#/c/9986/10/testdata/tzdb/2017c/Africa/Abidjan File testdata/tzdb/2017c/Africa/Abidjan: http://gerrit.cloudera.org:8080/#/c/9986/10/testdata/tzdb/2017c/Africa/Abidjan@1 PS10, Line 1: ../Atlantic/St_Helena > We're adding a ton of files. Do we need such a big database for our testing We'll also need to exclude these files from RAT checks (and document how they are licensed). -- To view, visit http://gerrit.cloudera.org:8080/9986 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 Gerrit-Change-Number: 9986 Gerrit-PatchSet: 10 Gerrit-Owner: Attila Jeges Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Philip Zeyliger Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Thu, 31 May 2018 00:30:03 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/9986 ) Change subject: IMPALA-3307: Add support for IANA time-zone db .. Patch Set 10: (1 comment) http://gerrit.cloudera.org:8080/#/c/9986/10/tests/custom_cluster/test_shared_tzdb.py File tests/custom_cluster/test_shared_tzdb.py: http://gerrit.cloudera.org:8080/#/c/9986/10/tests/custom_cluster/test_shared_tzdb.py@59 PS10, Line 59: for abbrev in ['PST', 'JST', 'ACT', 'VST']: I have copied and extended this test in https://gerrit.cloudera.org/#/c/10486/3..4/tests/query_test/test_timezones.py to also checks the warning and the to_utc_timestamp function - maybe these check could be added to this test too. -- To view, visit http://gerrit.cloudera.org:8080/9986 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 Gerrit-Change-Number: 9986 Gerrit-PatchSet: 10 Gerrit-Owner: Attila Jeges Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Philip Zeyliger Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Mon, 04 Jun 2018 17:59:28 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Attila Jeges has uploaded a new patch set (#11). ( http://gerrit.cloudera.org:8080/9986 ) Change subject: IMPALA-3307: Add support for IANA time-zone db .. IMPALA-3307: Add support for IANA time-zone db Impala currently uses two different libraries for timestamp manipulations: boost and glibc. Issues with boost: - Time-zone database is currently hard coded in timezone_db.cc. Impala admins cannot update it without upgrading Impala. - Time-zone database is flat, therefore can’t track year-to-year changes. - Time-zone database is not updated on a regular basis. Issues with glibc: - Uses /usr/share/zoneinfo/ database which could be out of sync on some of the nodes in the Impala cluster. - Uses the host system’s local time-zone. Different nodes in the Impala cluster might use a different local time-zone. - Conversion functions take a global lock, which causes severe performance degradation. In addition to the issues above, the fact that /usr/share/zoneinfo/ and the hard-coded boost time-zone database are both in use is a source of inconsistency in itself. This patch makes the following changes: - Instead of boost and glibc, impalad uses Google's CCTZ to implement time-zone conversions. - Introduces a new startup flag (--hdfs_zone_info_zip) to impalad to specify an HDFS/S3/ADLS path to a zip archive that contains the shared compiled IANA time-zone database. If the startup flag is set, impalad will use the specified time-zone database. Otherwise, impalad will use the default /usr/share/zoneinfo time-zone database. - impalad reads the entire time-zone database into an in-memory map on startup for fast lookups. - The name of the coordinator node’s local time-zone is saved to the query context when preparing query execution. This time-zone is used whenever the current time-zone is referred afterwards in an execution node. - Introduces a new startup flag (--hdfs_zone_abbrev_conf) to impalad to specify an HDFS/S3/ADLS path to a shared config file that contains definitions for non-standard time-zone abbreviations. Cherry-picks: not for 2.x. Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 --- M CMakeLists.txt M be/CMakeLists.txt M be/generated-sources/gen-cpp/CMakeLists.txt M be/src/benchmarks/CMakeLists.txt A be/src/benchmarks/convert-timestamp-benchmark.cc M be/src/common/global-types.h M be/src/common/init.cc M be/src/exec/data-source-scan-node.cc M be/src/exec/data-source-scan-node.h M be/src/exec/hdfs-orc-scanner.cc M be/src/exec/parquet-column-readers.cc M be/src/exprs/CMakeLists.txt M be/src/exprs/aggregate-functions-ir.cc M be/src/exprs/cast-functions-ir.cc M be/src/exprs/decimal-operators-ir.cc M be/src/exprs/decimal-operators.h M be/src/exprs/expr-test.cc M be/src/exprs/literal.cc M be/src/exprs/timestamp-functions-ir.cc M be/src/exprs/timestamp-functions.cc A be/src/exprs/timezone_db-test.cc M be/src/exprs/timezone_db.cc M be/src/exprs/timezone_db.h M be/src/runtime/raw-value-test.cc M be/src/runtime/runtime-state.cc M be/src/runtime/runtime-state.h M be/src/runtime/timestamp-test.cc M be/src/runtime/timestamp-value.cc M be/src/runtime/timestamp-value.h M be/src/runtime/timestamp-value.inline.h M be/src/service/frontend.cc M be/src/service/impala-server.cc M be/src/service/impalad-main.cc M be/src/util/CMakeLists.txt M be/src/util/filesystem-util-test.cc M be/src/util/filesystem-util.cc M be/src/util/filesystem-util.h M be/src/util/hdfs-util-test.cc M be/src/util/hdfs-util.cc M be/src/util/hdfs-util.h M be/src/util/time-test.cc M be/src/util/time.cc M be/src/util/time.h A be/src/util/zip-util.cc A be/src/util/zip-util.h M bin/bootstrap_toolchain.py M bin/impala-config.sh M bin/rat_exclude_files.txt A cmake_modules/FindCctz.cmake M common/thrift/CMakeLists.txt M common/thrift/ImpalaInternalService.thrift A common/thrift/Zip.thrift M common/thrift/metrics.json A fe/src/main/java/org/apache/impala/util/ZipUtil.java M fe/src/test/java/org/apache/impala/testutil/TestUtils.java M testdata/bin/create-load-data.sh M testdata/data/timezoneverification.csv A testdata/tzdb/2017c.zip A testdata/tzdb/abbrev.conf A testdata/tzdb_tiny/America/New_York A testdata/tzdb_tiny/Etc/GMT+4 A testdata/tzdb_tiny/US/Eastern A testdata/tzdb_tiny/UTC A testdata/tzdb_tiny/Zulu A testdata/tzdb_tiny/posix/UTC A testdata/tzdb_tiny/posixrules M testdata/workloads/functional-query/queries/QueryTest/exprs.test M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py A tests/custom_cluster/test_shared_tzdb.py D tests/query_test/test_timezones.py 70 files changed, 2,995 insertions(+), 1,168 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/86/9986/11 -- To view, visit http://gerrit.cloudera.org:8080/9986 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Attila Jeges has posted comments on this change. ( http://gerrit.cloudera.org:8080/9986 ) Change subject: IMPALA-3307: Add support for IANA time-zone db .. Patch Set 11: > (1 comment) > > Yeah I agree with Phil that doing JNI to java to unzip a file may > be the least of all evils - we wouldn't be adding another > dependency and there's no risk of DOSing the NameNode. I doubt it > would take the NameNode down, but it could disrupt other things > happening on the cluster unnecessarily. Changed the patch to take the shared timezone db in a zip archive. -- To view, visit http://gerrit.cloudera.org:8080/9986 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 Gerrit-Change-Number: 9986 Gerrit-PatchSet: 11 Gerrit-Owner: Attila Jeges Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Philip Zeyliger Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Tue, 05 Jun 2018 16:12:05 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Attila Jeges has uploaded a new patch set (#12). ( http://gerrit.cloudera.org:8080/9986 ) Change subject: IMPALA-3307: Add support for IANA time-zone db .. IMPALA-3307: Add support for IANA time-zone db Impala currently uses two different libraries for timestamp manipulations: boost and glibc. Issues with boost: - Time-zone database is currently hard coded in timezone_db.cc. Impala admins cannot update it without upgrading Impala. - Time-zone database is flat, therefore can’t track year-to-year changes. - Time-zone database is not updated on a regular basis. Issues with glibc: - Uses /usr/share/zoneinfo/ database which could be out of sync on some of the nodes in the Impala cluster. - Uses the host system’s local time-zone. Different nodes in the Impala cluster might use a different local time-zone. - Conversion functions take a global lock, which causes severe performance degradation. In addition to the issues above, the fact that /usr/share/zoneinfo/ and the hard-coded boost time-zone database are both in use is a source of inconsistency in itself. This patch makes the following changes: - Instead of boost and glibc, impalad uses Google's CCTZ to implement time-zone conversions. - Introduces a new startup flag (--hdfs_zone_info_zip) to impalad to specify an HDFS/S3/ADLS path to a zip archive that contains the shared compiled IANA time-zone database. If the startup flag is set, impalad will use the specified time-zone database. Otherwise, impalad will use the default /usr/share/zoneinfo time-zone database. - impalad reads the entire time-zone database into an in-memory map on startup for fast lookups. - The name of the coordinator node’s local time-zone is saved to the query context when preparing query execution. This time-zone is used whenever the current time-zone is referred afterwards in an execution node. - Introduces a new startup flag (--hdfs_zone_abbrev_conf) to impalad to specify an HDFS/S3/ADLS path to a shared config file that contains definitions for non-standard time-zone abbreviations. Cherry-picks: not for 2.x. Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 --- M CMakeLists.txt M be/CMakeLists.txt M be/generated-sources/gen-cpp/CMakeLists.txt M be/src/benchmarks/CMakeLists.txt A be/src/benchmarks/convert-timestamp-benchmark.cc M be/src/common/global-types.h M be/src/common/init.cc M be/src/exec/data-source-scan-node.cc M be/src/exec/data-source-scan-node.h M be/src/exec/hdfs-orc-scanner.cc M be/src/exec/parquet-column-readers.cc M be/src/exprs/CMakeLists.txt M be/src/exprs/aggregate-functions-ir.cc M be/src/exprs/cast-functions-ir.cc M be/src/exprs/decimal-operators-ir.cc M be/src/exprs/decimal-operators.h M be/src/exprs/expr-test.cc M be/src/exprs/literal.cc M be/src/exprs/timestamp-functions-ir.cc M be/src/exprs/timestamp-functions.cc A be/src/exprs/timezone_db-test.cc M be/src/exprs/timezone_db.cc M be/src/exprs/timezone_db.h M be/src/runtime/raw-value-test.cc M be/src/runtime/runtime-state.cc M be/src/runtime/runtime-state.h M be/src/runtime/timestamp-test.cc M be/src/runtime/timestamp-value.cc M be/src/runtime/timestamp-value.h M be/src/runtime/timestamp-value.inline.h M be/src/service/frontend.cc M be/src/service/impala-server.cc M be/src/service/impalad-main.cc M be/src/util/CMakeLists.txt M be/src/util/filesystem-util-test.cc M be/src/util/filesystem-util.cc M be/src/util/filesystem-util.h M be/src/util/hdfs-util-test.cc M be/src/util/hdfs-util.cc M be/src/util/hdfs-util.h M be/src/util/time-test.cc M be/src/util/time.cc M be/src/util/time.h A be/src/util/zip-util.cc A be/src/util/zip-util.h M bin/bootstrap_toolchain.py M bin/impala-config.sh M bin/rat_exclude_files.txt A cmake_modules/FindCctz.cmake M common/thrift/CMakeLists.txt M common/thrift/ImpalaInternalService.thrift A common/thrift/Zip.thrift M common/thrift/metrics.json A fe/src/main/java/org/apache/impala/util/ZipUtil.java M fe/src/test/java/org/apache/impala/testutil/TestUtils.java M testdata/bin/create-load-data.sh M testdata/data/timezoneverification.csv A testdata/tzdb/2017c.zip A testdata/tzdb/abbrev.conf A testdata/tzdb_tiny/America/New_York A testdata/tzdb_tiny/Etc/GMT+4 A testdata/tzdb_tiny/US/Eastern A testdata/tzdb_tiny/UTC A testdata/tzdb_tiny/Zulu A testdata/tzdb_tiny/posix/UTC A testdata/tzdb_tiny/posixrules M testdata/workloads/functional-query/queries/QueryTest/exprs.test M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py A tests/custom_cluster/test_shared_tzdb.py D tests/query_test/test_timezones.py 70 files changed, 2,994 insertions(+), 1,167 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/86/9986/12 -- To view, visit http://gerrit.cloudera.org:8080/9986 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/9986 ) Change subject: IMPALA-3307: Add support for IANA time-zone db .. Patch Set 11: (2 comments) Thanks for adding zip support! We should add some tests for zip_util, especially for error handling, which is an untested path at the moment if didn't miss something. I am ok with moving this (and dealing with my other comments) to a later commit. http://gerrit.cloudera.org:8080/#/c/9986/11/be/src/exprs/timezone_db.cc File be/src/exprs/timezone_db.cc: http://gerrit.cloudera.org:8080/#/c/9986/11/be/src/exprs/timezone_db.cc@198 PS11, Line 198: GetNextDirectoryEntry This is subjective, but I do not like this interface too much. I would prefer to wrap dir_stream to a class/struct, or create a function like this: static STATUS ListDirEntries(string path, vector& result, int max_result_num = 0). Both could be moved to util/filesystem_util. http://gerrit.cloudera.org:8080/#/c/9986/11/be/src/exprs/timezone_db.cc@213 PS11, Line 213: readdir_r There was a discussion about readdir_r() vs readdir() in https://gerrit.cloudera.org/#/c/8546/8 , and readdir() was preferred in the end. -- To view, visit http://gerrit.cloudera.org:8080/9986 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 Gerrit-Change-Number: 9986 Gerrit-PatchSet: 11 Gerrit-Owner: Attila Jeges Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Philip Zeyliger Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Wed, 06 Jun 2018 18:35:35 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Attila Jeges has posted comments on this change. ( http://gerrit.cloudera.org:8080/9986 ) Change subject: IMPALA-3307: Add support for IANA time-zone db .. Patch Set 11: (2 comments) Added some tests for extracting files from a non-existent zip archive and from a corrupt zip archive. http://gerrit.cloudera.org:8080/#/c/9986/11/be/src/exprs/timezone_db.cc File be/src/exprs/timezone_db.cc: http://gerrit.cloudera.org:8080/#/c/9986/11/be/src/exprs/timezone_db.cc@198 PS11, Line 198: GetNextDirectoryEntry > This is subjective, but I do not like this interface too much. I would pref Done http://gerrit.cloudera.org:8080/#/c/9986/11/be/src/exprs/timezone_db.cc@213 PS11, Line 213: readdir_r > There was a discussion about readdir_r() vs readdir() in https://gerrit.clo Done -- To view, visit http://gerrit.cloudera.org:8080/9986 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 Gerrit-Change-Number: 9986 Gerrit-PatchSet: 11 Gerrit-Owner: Attila Jeges Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Philip Zeyliger Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Thu, 07 Jun 2018 19:49:52 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Attila Jeges has uploaded a new patch set (#13). ( http://gerrit.cloudera.org:8080/9986 ) Change subject: IMPALA-3307: Add support for IANA time-zone db .. IMPALA-3307: Add support for IANA time-zone db Impala currently uses two different libraries for timestamp manipulations: boost and glibc. Issues with boost: - Time-zone database is currently hard coded in timezone_db.cc. Impala admins cannot update it without upgrading Impala. - Time-zone database is flat, therefore can’t track year-to-year changes. - Time-zone database is not updated on a regular basis. Issues with glibc: - Uses /usr/share/zoneinfo/ database which could be out of sync on some of the nodes in the Impala cluster. - Uses the host system’s local time-zone. Different nodes in the Impala cluster might use a different local time-zone. - Conversion functions take a global lock, which causes severe performance degradation. In addition to the issues above, the fact that /usr/share/zoneinfo/ and the hard-coded boost time-zone database are both in use is a source of inconsistency in itself. This patch makes the following changes: - Instead of boost and glibc, impalad uses Google's CCTZ to implement time-zone conversions. - Introduces a new startup flag (--hdfs_zone_info_zip) to impalad to specify an HDFS/S3/ADLS path to a zip archive that contains the shared compiled IANA time-zone database. If the startup flag is set, impalad will use the specified time-zone database. Otherwise, impalad will use the default /usr/share/zoneinfo time-zone database. - impalad reads the entire time-zone database into an in-memory map on startup for fast lookups. - The name of the coordinator node’s local time-zone is saved to the query context when preparing query execution. This time-zone is used whenever the current time-zone is referred afterwards in an execution node. - Introduces a new startup flag (--hdfs_zone_abbrev_conf) to impalad to specify an HDFS/S3/ADLS path to a shared config file that contains definitions for non-standard time-zone abbreviations. Cherry-picks: not for 2.x. Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 --- M CMakeLists.txt M be/CMakeLists.txt M be/generated-sources/gen-cpp/CMakeLists.txt M be/src/benchmarks/CMakeLists.txt A be/src/benchmarks/convert-timestamp-benchmark.cc M be/src/common/global-types.h M be/src/common/init.cc M be/src/exec/data-source-scan-node.cc M be/src/exec/data-source-scan-node.h M be/src/exec/hdfs-orc-scanner.cc M be/src/exec/parquet-column-readers.cc M be/src/exprs/CMakeLists.txt M be/src/exprs/aggregate-functions-ir.cc M be/src/exprs/cast-functions-ir.cc M be/src/exprs/decimal-operators-ir.cc M be/src/exprs/decimal-operators.h M be/src/exprs/expr-test.cc M be/src/exprs/literal.cc M be/src/exprs/timestamp-functions-ir.cc M be/src/exprs/timestamp-functions.cc A be/src/exprs/timezone_db-test.cc M be/src/exprs/timezone_db.cc M be/src/exprs/timezone_db.h M be/src/runtime/raw-value-test.cc M be/src/runtime/runtime-state.cc M be/src/runtime/runtime-state.h M be/src/runtime/timestamp-test.cc M be/src/runtime/timestamp-value.cc M be/src/runtime/timestamp-value.h M be/src/runtime/timestamp-value.inline.h M be/src/service/frontend.cc M be/src/service/impala-server.cc M be/src/service/impalad-main.cc M be/src/util/CMakeLists.txt M be/src/util/filesystem-util-test.cc M be/src/util/filesystem-util.cc M be/src/util/filesystem-util.h M be/src/util/hdfs-util-test.cc M be/src/util/hdfs-util.cc M be/src/util/hdfs-util.h M be/src/util/time-test.cc M be/src/util/time.cc M be/src/util/time.h A be/src/util/zip-util-test.cc A be/src/util/zip-util.cc A be/src/util/zip-util.h M bin/bootstrap_toolchain.py M bin/impala-config.sh M bin/rat_exclude_files.txt A cmake_modules/FindCctz.cmake M common/thrift/CMakeLists.txt M common/thrift/ImpalaInternalService.thrift A common/thrift/Zip.thrift M common/thrift/metrics.json A fe/src/main/java/org/apache/impala/util/ZipUtil.java M fe/src/test/java/org/apache/impala/testutil/TestUtils.java M testdata/bin/create-load-data.sh M testdata/data/timezoneverification.csv A testdata/tzdb/2017c-corrupt.zip A testdata/tzdb/2017c.zip A testdata/tzdb/abbrev.conf A testdata/tzdb_tiny/America/New_York A testdata/tzdb_tiny/Etc/GMT+4 A testdata/tzdb_tiny/US/Eastern A testdata/tzdb_tiny/UTC A testdata/tzdb_tiny/Zulu A testdata/tzdb_tiny/posix/UTC A testdata/tzdb_tiny/posixrules M testdata/workloads/functional-query/queries/QueryTest/exprs.test M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py A tests/custom_cluster/test_shared_tzdb.py D tests/query_test/test_timezones.py 72 files changed, 3,098 insertions(+), 1,167 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/86/9986/13 -- To view, visit http://gerrit.cloudera.org:8080/9986 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Message
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Attila Jeges has uploaded a new patch set (#14). ( http://gerrit.cloudera.org:8080/9986 ) Change subject: IMPALA-3307: Add support for IANA time-zone db .. IMPALA-3307: Add support for IANA time-zone db Impala currently uses two different libraries for timestamp manipulations: boost and glibc. Issues with boost: - Time-zone database is currently hard coded in timezone_db.cc. Impala admins cannot update it without upgrading Impala. - Time-zone database is flat, therefore can’t track year-to-year changes. - Time-zone database is not updated on a regular basis. Issues with glibc: - Uses /usr/share/zoneinfo/ database which could be out of sync on some of the nodes in the Impala cluster. - Uses the host system’s local time-zone. Different nodes in the Impala cluster might use a different local time-zone. - Conversion functions take a global lock, which causes severe performance degradation. In addition to the issues above, the fact that /usr/share/zoneinfo/ and the hard-coded boost time-zone database are both in use is a source of inconsistency in itself. This patch makes the following changes: - Instead of boost and glibc, impalad uses Google's CCTZ to implement time-zone conversions. - Introduces a new startup flag (--hdfs_zone_info_zip) to impalad to specify an HDFS/S3/ADLS path to a zip archive that contains the shared compiled IANA time-zone database. If the startup flag is set, impalad will use the specified time-zone database. Otherwise, impalad will use the default /usr/share/zoneinfo time-zone database. - impalad reads the entire time-zone database into an in-memory map on startup for fast lookups. - The name of the coordinator node’s local time-zone is saved to the query context when preparing query execution. This time-zone is used whenever the current time-zone is referred afterwards in an execution node. - Introduces a new startup flag (--hdfs_zone_abbrev_conf) to impalad to specify an HDFS/S3/ADLS path to a shared config file that contains definitions for non-standard time-zone abbreviations. Cherry-picks: not for 2.x. Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 --- M CMakeLists.txt M be/CMakeLists.txt M be/generated-sources/gen-cpp/CMakeLists.txt M be/src/benchmarks/CMakeLists.txt A be/src/benchmarks/convert-timestamp-benchmark.cc M be/src/common/global-types.h M be/src/common/init.cc M be/src/exec/data-source-scan-node.cc M be/src/exec/data-source-scan-node.h M be/src/exec/hdfs-orc-scanner.cc M be/src/exec/parquet-column-readers.cc M be/src/exprs/CMakeLists.txt M be/src/exprs/aggregate-functions-ir.cc M be/src/exprs/cast-functions-ir.cc M be/src/exprs/decimal-operators-ir.cc M be/src/exprs/decimal-operators.h M be/src/exprs/expr-test.cc M be/src/exprs/literal.cc M be/src/exprs/timestamp-functions-ir.cc M be/src/exprs/timestamp-functions.cc A be/src/exprs/timezone_db-test.cc M be/src/exprs/timezone_db.cc M be/src/exprs/timezone_db.h M be/src/runtime/raw-value-test.cc M be/src/runtime/runtime-state.cc M be/src/runtime/runtime-state.h M be/src/runtime/timestamp-test.cc M be/src/runtime/timestamp-value.cc M be/src/runtime/timestamp-value.h M be/src/runtime/timestamp-value.inline.h M be/src/service/frontend.cc M be/src/service/impala-server.cc M be/src/service/impalad-main.cc M be/src/util/CMakeLists.txt M be/src/util/filesystem-util-test.cc M be/src/util/filesystem-util.cc M be/src/util/filesystem-util.h M be/src/util/hdfs-util-test.cc M be/src/util/hdfs-util.cc M be/src/util/hdfs-util.h M be/src/util/time-test.cc M be/src/util/time.cc M be/src/util/time.h A be/src/util/zip-util-test.cc A be/src/util/zip-util.cc A be/src/util/zip-util.h M bin/bootstrap_toolchain.py M bin/impala-config.sh M bin/rat_exclude_files.txt A cmake_modules/FindCctz.cmake M common/thrift/CMakeLists.txt M common/thrift/ImpalaInternalService.thrift A common/thrift/Zip.thrift M common/thrift/metrics.json A fe/src/main/java/org/apache/impala/util/ZipUtil.java M fe/src/test/java/org/apache/impala/testutil/TestUtils.java M testdata/bin/create-load-data.sh M testdata/data/timezoneverification.csv A testdata/tzdb/2017c-corrupt.zip A testdata/tzdb/2017c.zip A testdata/tzdb/abbrev.conf A testdata/tzdb_tiny/America/New_York A testdata/tzdb_tiny/Etc/GMT+4 A testdata/tzdb_tiny/US/Eastern A testdata/tzdb_tiny/UTC A testdata/tzdb_tiny/Zulu A testdata/tzdb_tiny/posix/UTC A testdata/tzdb_tiny/posixrules M testdata/workloads/functional-query/queries/QueryTest/exprs.test M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py A tests/custom_cluster/test_shared_tzdb.py D tests/query_test/test_timezones.py 72 files changed, 3,117 insertions(+), 1,167 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/86/9986/14 -- To view, visit http://gerrit.cloudera.org:8080/9986 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Message
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Attila Jeges has posted comments on this change. ( http://gerrit.cloudera.org:8080/9986 ) Change subject: IMPALA-3307: Add support for IANA time-zone db .. Patch Set 14: > Uploaded patch set 14. Added one more BE test for extracting files from a zip archive to a non-writable destination directory. Fixed zip-slip vulnerability in ZipUtil.java. -- To view, visit http://gerrit.cloudera.org:8080/9986 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 Gerrit-Change-Number: 9986 Gerrit-PatchSet: 14 Gerrit-Owner: Attila Jeges Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Philip Zeyliger Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 08 Jun 2018 13:43:51 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/9986 ) Change subject: IMPALA-3307: Add support for IANA time-zone db .. Patch Set 14: Code-Review+2 (5 comments) Thanks for thinking about Zip-Slip! I have left a few optional comments about the usability of the interfaces. http://gerrit.cloudera.org:8080/#/c/9986/14//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/9986/14//COMMIT_MSG@34 PS14, Line 34: - Introduces a new startup flag (--hdfs_zone_info_zip) to impalad to The Zip slip safe zip-util could be also mentioned in the commit message. http://gerrit.cloudera.org:8080/#/c/9986/14/be/src/util/filesystem-util.h File be/src/util/filesystem-util.h: http://gerrit.cloudera.org:8080/#/c/9986/14/be/src/util/filesystem-util.h@92 PS14, Line 92: Directory(const string& path, bool skip_hidden_entries = true); I thought a bit about usability and I vote for removing this parameter and skip only "." and ".." - I can't imagine any use case when I would be interested in those. http://gerrit.cloudera.org:8080/#/c/9986/14/be/src/util/filesystem-util.h@109 PS14, Line 109: static Status GetEntryNames(const string& path, I would prefer max_result_size to be the last parameter, and give it a default value of 0. http://gerrit.cloudera.org:8080/#/c/9986/14/be/src/util/zip-util-test.cc File be/src/util/zip-util-test.cc: http://gerrit.cloudera.org:8080/#/c/9986/14/be/src/util/zip-util-test.cc@69 PS14, Line 69: EXPECT_FALSE(filesystem::exists(dest_dir3)); I guess that this is only true if zip decoding failed at the start, and some files may be already decompressed before reaching an error in the zip. I am not sure what to do with this, probably nothing. It would be possible add some kind of cleanup logic to the java util, but I am not sure if this worth the effort. http://gerrit.cloudera.org:8080/#/c/9986/14/fe/src/main/java/org/apache/impala/util/ZipUtil.java File fe/src/main/java/org/apache/impala/util/ZipUtil.java: http://gerrit.cloudera.org:8080/#/c/9986/14/fe/src/main/java/org/apache/impala/util/ZipUtil.java@45 PS14, Line 45: try (ZipFile zip = new ZipFile(params.archive_file)) { I would move this block to a similar function with (String archiveFile, String destDir) parameters to make this util usable from Java too. This would be minimal extra effort and I think that it can be handy to have an easily usable Zip-Slip safe extract function. -- To view, visit http://gerrit.cloudera.org:8080/9986 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 Gerrit-Change-Number: 9986 Gerrit-PatchSet: 14 Gerrit-Owner: Attila Jeges Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Philip Zeyliger Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 08 Jun 2018 16:15:15 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Attila Jeges has uploaded a new patch set (#15). ( http://gerrit.cloudera.org:8080/9986 ) Change subject: IMPALA-3307: Add support for IANA time-zone db .. IMPALA-3307: Add support for IANA time-zone db Impala currently uses two different libraries for timestamp manipulations: boost and glibc. Issues with boost: - Time-zone database is currently hard coded in timezone_db.cc. Impala admins cannot update it without upgrading Impala. - Time-zone database is flat, therefore can’t track year-to-year changes. - Time-zone database is not updated on a regular basis. Issues with glibc: - Uses /usr/share/zoneinfo/ database which could be out of sync on some of the nodes in the Impala cluster. - Uses the host system’s local time-zone. Different nodes in the Impala cluster might use a different local time-zone. - Conversion functions take a global lock, which causes severe performance degradation. In addition to the issues above, the fact that /usr/share/zoneinfo/ and the hard-coded boost time-zone database are both in use is a source of inconsistency in itself. This patch makes the following changes: - Instead of boost and glibc, impalad uses Google's CCTZ to implement time-zone conversions. - Introduces a new startup flag (--hdfs_zone_info_zip) to impalad to specify an HDFS/S3/ADLS path to a zip archive that contains the shared compiled IANA time-zone database. If the startup flag is set, impalad will use the specified time-zone database. Otherwise, impalad will use the default /usr/share/zoneinfo time-zone database. - Introduces a new startup flag (--hdfs_zone_abbrev_conf) to impalad to specify an HDFS/S3/ADLS path to a shared config file that contains definitions for non-standard time-zone abbreviations. - impalad reads the entire time-zone database into an in-memory map on startup for fast lookups. - The name of the coordinator node’s local time-zone is saved to the query context when preparing query execution. This time-zone is used whenever the current time-zone is referred afterwards in an execution node. - Adds a new ZipUtil class to extract files from a zip archive. The implementation is not vulnerable to Zip Slip. Cherry-picks: not for 2.x. Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 --- M CMakeLists.txt M be/CMakeLists.txt M be/generated-sources/gen-cpp/CMakeLists.txt M be/src/benchmarks/CMakeLists.txt A be/src/benchmarks/convert-timestamp-benchmark.cc M be/src/common/global-types.h M be/src/common/init.cc M be/src/exec/data-source-scan-node.cc M be/src/exec/data-source-scan-node.h M be/src/exec/hdfs-orc-scanner.cc M be/src/exec/parquet-column-readers.cc M be/src/exprs/CMakeLists.txt M be/src/exprs/aggregate-functions-ir.cc M be/src/exprs/cast-functions-ir.cc M be/src/exprs/decimal-operators-ir.cc M be/src/exprs/decimal-operators.h M be/src/exprs/expr-test.cc M be/src/exprs/literal.cc M be/src/exprs/timestamp-functions-ir.cc M be/src/exprs/timestamp-functions.cc A be/src/exprs/timezone_db-test.cc M be/src/exprs/timezone_db.cc M be/src/exprs/timezone_db.h M be/src/runtime/raw-value-test.cc M be/src/runtime/runtime-state.cc M be/src/runtime/runtime-state.h M be/src/runtime/timestamp-test.cc M be/src/runtime/timestamp-value.cc M be/src/runtime/timestamp-value.h M be/src/runtime/timestamp-value.inline.h M be/src/service/frontend.cc M be/src/service/impala-server.cc M be/src/service/impalad-main.cc M be/src/util/CMakeLists.txt M be/src/util/filesystem-util-test.cc M be/src/util/filesystem-util.cc M be/src/util/filesystem-util.h M be/src/util/hdfs-util-test.cc M be/src/util/hdfs-util.cc M be/src/util/hdfs-util.h M be/src/util/time-test.cc M be/src/util/time.cc M be/src/util/time.h A be/src/util/zip-util-test.cc A be/src/util/zip-util.cc A be/src/util/zip-util.h M bin/bootstrap_toolchain.py M bin/impala-config.sh M bin/rat_exclude_files.txt A cmake_modules/FindCctz.cmake M common/thrift/CMakeLists.txt M common/thrift/ImpalaInternalService.thrift A common/thrift/Zip.thrift M common/thrift/metrics.json A fe/src/main/java/org/apache/impala/util/ZipUtil.java M fe/src/test/java/org/apache/impala/testutil/TestUtils.java M testdata/bin/create-load-data.sh M testdata/data/timezoneverification.csv A testdata/tzdb/2017c-corrupt.zip A testdata/tzdb/2017c.zip A testdata/tzdb/abbrev.conf A testdata/tzdb_tiny/America/New_York A testdata/tzdb_tiny/Etc/GMT+4 A testdata/tzdb_tiny/US/Eastern A testdata/tzdb_tiny/UTC A testdata/tzdb_tiny/Zulu A testdata/tzdb_tiny/posix/UTC A testdata/tzdb_tiny/posixrules M testdata/workloads/functional-query/queries/QueryTest/exprs.test M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py A tests/custom_cluster/test_shared_tzdb.py D tests/query_test/test_timezones.py 72 files changed, 3,119 insertions(+), 1,167 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/86/9986/15 -- To view, visit http://gerrit.cloudera.org:8080/9986 To uns
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Attila Jeges has uploaded a new patch set (#16). ( http://gerrit.cloudera.org:8080/9986 ) Change subject: IMPALA-3307: Add support for IANA time-zone db .. IMPALA-3307: Add support for IANA time-zone db Impala currently uses two different libraries for timestamp manipulations: boost and glibc. Issues with boost: - Time-zone database is currently hard coded in timezone_db.cc. Impala admins cannot update it without upgrading Impala. - Time-zone database is flat, therefore can’t track year-to-year changes. - Time-zone database is not updated on a regular basis. Issues with glibc: - Uses /usr/share/zoneinfo/ database which could be out of sync on some of the nodes in the Impala cluster. - Uses the host system’s local time-zone. Different nodes in the Impala cluster might use a different local time-zone. - Conversion functions take a global lock, which causes severe performance degradation. In addition to the issues above, the fact that /usr/share/zoneinfo/ and the hard-coded boost time-zone database are both in use is a source of inconsistency in itself. This patch makes the following changes: - Instead of boost and glibc, impalad uses Google's CCTZ to implement time-zone conversions. - Introduces a new startup flag (--hdfs_zone_info_zip) to impalad to specify an HDFS/S3/ADLS path to a zip archive that contains the shared compiled IANA time-zone database. If the startup flag is set, impalad will use the specified time-zone database. Otherwise, impalad will use the default /usr/share/zoneinfo time-zone database. - Introduces a new startup flag (--hdfs_zone_abbrev_conf) to impalad to specify an HDFS/S3/ADLS path to a shared config file that contains definitions for non-standard time-zone abbreviations. - impalad reads the entire time-zone database into an in-memory map on startup for fast lookups. - The name of the coordinator node’s local time-zone is saved to the query context when preparing query execution. This time-zone is used whenever the current time-zone is referred afterwards in an execution node. - Adds a new ZipUtil class to extract files from a zip archive. The implementation is not vulnerable to Zip Slip. Cherry-picks: not for 2.x. Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 --- M CMakeLists.txt M be/CMakeLists.txt M be/generated-sources/gen-cpp/CMakeLists.txt M be/src/benchmarks/CMakeLists.txt A be/src/benchmarks/convert-timestamp-benchmark.cc M be/src/common/global-types.h M be/src/common/init.cc M be/src/exec/data-source-scan-node.cc M be/src/exec/data-source-scan-node.h M be/src/exec/hdfs-orc-scanner.cc M be/src/exec/parquet-column-readers.cc M be/src/exprs/CMakeLists.txt M be/src/exprs/aggregate-functions-ir.cc M be/src/exprs/cast-functions-ir.cc M be/src/exprs/decimal-operators-ir.cc M be/src/exprs/decimal-operators.h M be/src/exprs/expr-test.cc M be/src/exprs/literal.cc M be/src/exprs/timestamp-functions-ir.cc M be/src/exprs/timestamp-functions.cc A be/src/exprs/timezone_db-test.cc M be/src/exprs/timezone_db.cc M be/src/exprs/timezone_db.h M be/src/runtime/raw-value-test.cc M be/src/runtime/runtime-state.cc M be/src/runtime/runtime-state.h M be/src/runtime/timestamp-test.cc M be/src/runtime/timestamp-value.cc M be/src/runtime/timestamp-value.h M be/src/runtime/timestamp-value.inline.h M be/src/service/frontend.cc M be/src/service/impala-server.cc M be/src/service/impalad-main.cc M be/src/util/CMakeLists.txt M be/src/util/filesystem-util-test.cc M be/src/util/filesystem-util.cc M be/src/util/filesystem-util.h M be/src/util/hdfs-util-test.cc M be/src/util/hdfs-util.cc M be/src/util/hdfs-util.h M be/src/util/time-test.cc M be/src/util/time.cc M be/src/util/time.h A be/src/util/zip-util-test.cc A be/src/util/zip-util.cc A be/src/util/zip-util.h M bin/bootstrap_toolchain.py M bin/impala-config.sh M bin/rat_exclude_files.txt A cmake_modules/FindCctz.cmake M common/thrift/CMakeLists.txt M common/thrift/ImpalaInternalService.thrift A common/thrift/Zip.thrift M common/thrift/metrics.json A fe/src/main/java/org/apache/impala/util/ZipUtil.java M fe/src/test/java/org/apache/impala/testutil/TestUtils.java M testdata/bin/create-load-data.sh M testdata/data/timezoneverification.csv A testdata/tzdb/2017c-corrupt.zip A testdata/tzdb/2017c.zip A testdata/tzdb/alias.conf A testdata/tzdb_tiny/America/New_York A testdata/tzdb_tiny/Etc/GMT+4 A testdata/tzdb_tiny/US/Eastern A testdata/tzdb_tiny/UTC A testdata/tzdb_tiny/Zulu A testdata/tzdb_tiny/posix/UTC A testdata/tzdb_tiny/posixrules M testdata/workloads/functional-query/queries/QueryTest/exprs.test M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py A tests/custom_cluster/test_shared_tzdb.py D tests/query_test/test_timezones.py 72 files changed, 3,089 insertions(+), 1,167 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/86/9986/16 -- To view, visit http://gerrit.cloudera.org:8080/9986 To unsu
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Attila Jeges has uploaded a new patch set (#17). ( http://gerrit.cloudera.org:8080/9986 ) Change subject: IMPALA-3307: Add support for IANA time-zone db .. IMPALA-3307: Add support for IANA time-zone db Impala currently uses two different libraries for timestamp manipulations: boost and glibc. Issues with boost: - Time-zone database is currently hard coded in timezone_db.cc. Impala admins cannot update it without upgrading Impala. - Time-zone database is flat, therefore can’t track year-to-year changes. - Time-zone database is not updated on a regular basis. Issues with glibc: - Uses /usr/share/zoneinfo/ database which could be out of sync on some of the nodes in the Impala cluster. - Uses the host system’s local time-zone. Different nodes in the Impala cluster might use a different local time-zone. - Conversion functions take a global lock, which causes severe performance degradation. In addition to the issues above, the fact that /usr/share/zoneinfo/ and the hard-coded boost time-zone database are both in use is a source of inconsistency in itself. This patch makes the following changes: - Instead of boost and glibc, impalad uses Google's CCTZ to implement time-zone conversions. - Introduces a new startup flag (--hdfs_zone_info_zip) to impalad to specify an HDFS/S3/ADLS path to a zip archive that contains the shared compiled IANA time-zone database. If the startup flag is set, impalad will use the specified time-zone database. Otherwise, impalad will use the default /usr/share/zoneinfo time-zone database. - Introduces a new startup flag (--hdfs_zone_alias_conf) to impalad to specify an HDFS/S3/ADLS path to a shared config file that contains definitions for non-standard time-zone aliases. - impalad reads the entire time-zone database into an in-memory map on startup for fast lookups. - The name of the coordinator node’s local time-zone is saved to the query context when preparing query execution. This time-zone is used whenever the current time-zone is referred afterwards in an execution node. - Adds a new ZipUtil class to extract files from a zip archive. The implementation is not vulnerable to Zip Slip. Cherry-picks: not for 2.x. Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 --- M CMakeLists.txt M be/CMakeLists.txt M be/generated-sources/gen-cpp/CMakeLists.txt M be/src/benchmarks/CMakeLists.txt A be/src/benchmarks/convert-timestamp-benchmark.cc M be/src/common/global-types.h M be/src/common/init.cc M be/src/exec/data-source-scan-node.cc M be/src/exec/data-source-scan-node.h M be/src/exec/hdfs-orc-scanner.cc M be/src/exec/parquet-column-readers.cc M be/src/exprs/CMakeLists.txt M be/src/exprs/aggregate-functions-ir.cc M be/src/exprs/cast-functions-ir.cc M be/src/exprs/decimal-operators-ir.cc M be/src/exprs/decimal-operators.h M be/src/exprs/expr-test.cc M be/src/exprs/literal.cc M be/src/exprs/timestamp-functions-ir.cc M be/src/exprs/timestamp-functions.cc A be/src/exprs/timezone_db-test.cc M be/src/exprs/timezone_db.cc M be/src/exprs/timezone_db.h M be/src/runtime/raw-value-test.cc M be/src/runtime/runtime-state.cc M be/src/runtime/runtime-state.h M be/src/runtime/timestamp-test.cc M be/src/runtime/timestamp-value.cc M be/src/runtime/timestamp-value.h M be/src/runtime/timestamp-value.inline.h M be/src/service/frontend.cc M be/src/service/impala-server.cc M be/src/service/impalad-main.cc M be/src/util/CMakeLists.txt M be/src/util/filesystem-util-test.cc M be/src/util/filesystem-util.cc M be/src/util/filesystem-util.h M be/src/util/hdfs-util-test.cc M be/src/util/hdfs-util.cc M be/src/util/hdfs-util.h M be/src/util/time-test.cc M be/src/util/time.cc M be/src/util/time.h A be/src/util/zip-util-test.cc A be/src/util/zip-util.cc A be/src/util/zip-util.h M bin/bootstrap_toolchain.py M bin/impala-config.sh M bin/rat_exclude_files.txt A cmake_modules/FindCctz.cmake M common/thrift/CMakeLists.txt M common/thrift/ImpalaInternalService.thrift A common/thrift/Zip.thrift M common/thrift/metrics.json A fe/src/main/java/org/apache/impala/util/ZipUtil.java M fe/src/test/java/org/apache/impala/testutil/TestUtils.java M testdata/bin/create-load-data.sh M testdata/data/timezoneverification.csv A testdata/tzdb/2017c-corrupt.zip A testdata/tzdb/2017c.zip A testdata/tzdb/alias.conf A testdata/tzdb_tiny/America/New_York A testdata/tzdb_tiny/Etc/GMT+4 A testdata/tzdb_tiny/US/Eastern A testdata/tzdb_tiny/UTC A testdata/tzdb_tiny/Zulu A testdata/tzdb_tiny/posix/UTC A testdata/tzdb_tiny/posixrules M testdata/workloads/functional-query/queries/QueryTest/exprs.test M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py A tests/custom_cluster/test_shared_tzdb.py D tests/query_test/test_timezones.py 72 files changed, 3,089 insertions(+), 1,167 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/86/9986/17 -- To view, visit http://gerrit.cloudera.org:8080/9986 To unsubscribe
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Attila Jeges has posted comments on this change. ( http://gerrit.cloudera.org:8080/9986 ) Change subject: IMPALA-3307: Add support for IANA time-zone db .. Patch Set 17: > Uploaded patch set 17: Commit message was updated. Added support for configurable timezone aliases (instead of just abbreviations). -- To view, visit http://gerrit.cloudera.org:8080/9986 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 Gerrit-Change-Number: 9986 Gerrit-PatchSet: 17 Gerrit-Owner: Attila Jeges Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Philip Zeyliger Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Mon, 11 Jun 2018 17:37:46 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Attila Jeges has uploaded a new patch set (#18). ( http://gerrit.cloudera.org:8080/9986 ) Change subject: IMPALA-3307: Add support for IANA time-zone db .. IMPALA-3307: Add support for IANA time-zone db Impala currently uses two different libraries for timestamp manipulations: boost and glibc. Issues with boost: - Time-zone database is currently hard coded in timezone_db.cc. Impala admins cannot update it without upgrading Impala. - Time-zone database is flat, therefore can’t track year-to-year changes. - Time-zone database is not updated on a regular basis. Issues with glibc: - Uses /usr/share/zoneinfo/ database which could be out of sync on some of the nodes in the Impala cluster. - Uses the host system’s local time-zone. Different nodes in the Impala cluster might use a different local time-zone. - Conversion functions take a global lock, which causes severe performance degradation. In addition to the issues above, the fact that /usr/share/zoneinfo/ and the hard-coded boost time-zone database are both in use is a source of inconsistency in itself. This patch makes the following changes: - Instead of boost and glibc, impalad uses Google's CCTZ to implement time-zone conversions. - Introduces a new startup flag (--hdfs_zone_info_zip) to impalad to specify an HDFS/S3/ADLS path to a zip archive that contains the shared compiled IANA time-zone database. If the startup flag is set, impalad will use the specified time-zone database. Otherwise, impalad will use the default /usr/share/zoneinfo time-zone database. - Introduces a new startup flag (--hdfs_zone_alias_conf) to impalad to specify an HDFS/S3/ADLS path to a shared config file that contains definitions for non-standard time-zone aliases. - impalad reads the entire time-zone database into an in-memory map on startup for fast lookups. - The name of the coordinator node’s local time-zone is saved to the query context when preparing query execution. This time-zone is used whenever the current time-zone is referred afterwards in an execution node. - Adds a new ZipUtil class to extract files from a zip archive. The implementation is not vulnerable to Zip Slip. Cherry-picks: not for 2.x. Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 --- M CMakeLists.txt M be/CMakeLists.txt M be/generated-sources/gen-cpp/CMakeLists.txt M be/src/benchmarks/CMakeLists.txt A be/src/benchmarks/convert-timestamp-benchmark.cc M be/src/common/global-types.h M be/src/common/init.cc M be/src/exec/data-source-scan-node.cc M be/src/exec/data-source-scan-node.h M be/src/exec/hdfs-orc-scanner.cc M be/src/exec/parquet-column-readers.cc M be/src/exprs/CMakeLists.txt M be/src/exprs/aggregate-functions-ir.cc M be/src/exprs/cast-functions-ir.cc M be/src/exprs/decimal-operators-ir.cc M be/src/exprs/decimal-operators.h M be/src/exprs/expr-test.cc M be/src/exprs/literal.cc M be/src/exprs/timestamp-functions-ir.cc M be/src/exprs/timestamp-functions.cc A be/src/exprs/timezone_db-test.cc M be/src/exprs/timezone_db.cc M be/src/exprs/timezone_db.h M be/src/runtime/raw-value-test.cc M be/src/runtime/runtime-state.cc M be/src/runtime/runtime-state.h M be/src/runtime/timestamp-test.cc M be/src/runtime/timestamp-value.cc M be/src/runtime/timestamp-value.h M be/src/runtime/timestamp-value.inline.h M be/src/service/frontend.cc M be/src/service/impala-server.cc M be/src/service/impalad-main.cc M be/src/util/CMakeLists.txt M be/src/util/filesystem-util-test.cc M be/src/util/filesystem-util.cc M be/src/util/filesystem-util.h M be/src/util/hdfs-util-test.cc M be/src/util/hdfs-util.cc M be/src/util/hdfs-util.h M be/src/util/time-test.cc M be/src/util/time.cc M be/src/util/time.h A be/src/util/zip-util-test.cc A be/src/util/zip-util.cc A be/src/util/zip-util.h M bin/bootstrap_toolchain.py M bin/impala-config.sh M bin/rat_exclude_files.txt A cmake_modules/FindCctz.cmake M common/thrift/CMakeLists.txt M common/thrift/ImpalaInternalService.thrift A common/thrift/Zip.thrift M common/thrift/metrics.json A fe/src/main/java/org/apache/impala/util/ZipUtil.java M fe/src/test/java/org/apache/impala/testutil/TestUtils.java M testdata/bin/create-load-data.sh M testdata/data/timezoneverification.csv A testdata/tzdb/2017c-corrupt.zip A testdata/tzdb/2017c.zip A testdata/tzdb/alias.conf A testdata/tzdb_tiny/America/New_York A testdata/tzdb_tiny/Etc/GMT+4 A testdata/tzdb_tiny/US/Eastern A testdata/tzdb_tiny/UTC A testdata/tzdb_tiny/Zulu A testdata/tzdb_tiny/posix/UTC A testdata/tzdb_tiny/posixrules M testdata/workloads/functional-query/queries/QueryTest/exprs.test M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py A tests/custom_cluster/test_shared_tzdb.py D tests/query_test/test_timezones.py 72 files changed, 3,089 insertions(+), 1,167 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/86/9986/18 -- To view, visit http://gerrit.cloudera.org:8080/9986 To unsubscribe
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/9986 ) Change subject: IMPALA-3307: Add support for IANA time-zone db .. Patch Set 18: Code-Review+2 (1 comment) http://gerrit.cloudera.org:8080/#/c/9986/17/be/src/exprs/timezone_db.cc File be/src/exprs/timezone_db.cc: http://gerrit.cloudera.org:8080/#/c/9986/17/be/src/exprs/timezone_db.cc@367 PS17, Line 367: Status TimezoneDatabase::LoadZoneAliasesFromHdfs( : const string& hdfs_zone_alias_conf) { nit: this could fit in one line -- To view, visit http://gerrit.cloudera.org:8080/9986 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 Gerrit-Change-Number: 9986 Gerrit-PatchSet: 18 Gerrit-Owner: Attila Jeges Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Philip Zeyliger Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Tue, 12 Jun 2018 13:24:38 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9986 ) Change subject: IMPALA-3307: Add support for IANA time-zone db .. Patch Set 18: Code-Review+1 (2 comments) http://gerrit.cloudera.org:8080/#/c/9986/18/be/src/util/zip-util.h File be/src/util/zip-util.h: http://gerrit.cloudera.org:8080/#/c/9986/18/be/src/util/zip-util.h@33 PS18, Line 33: //Extract files from a zip archive to a destination directory in local filesystem. nit: comment formatting - should start with /// http://gerrit.cloudera.org:8080/#/c/9986/10/testdata/tzdb/2017c/Africa/Abidjan File testdata/tzdb/2017c/Africa/Abidjan: http://gerrit.cloudera.org:8080/#/c/9986/10/testdata/tzdb/2017c/Africa/Abidjan@1 PS10, Line 1: > These are binary data files created from text data files that are in the pu How does they pass the rat checks? We run those precommit. I expected to see something in bin/rat_exclude_files.txt. -- To view, visit http://gerrit.cloudera.org:8080/9986 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 Gerrit-Change-Number: 9986 Gerrit-PatchSet: 18 Gerrit-Owner: Attila Jeges Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Philip Zeyliger Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Thu, 14 Jun 2018 01:44:49 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Attila Jeges has posted comments on this change. ( http://gerrit.cloudera.org:8080/9986 ) Change subject: IMPALA-3307: Add support for IANA time-zone db .. Patch Set 10: (3 comments) http://gerrit.cloudera.org:8080/#/c/9986/17/be/src/exprs/timezone_db.cc File be/src/exprs/timezone_db.cc: http://gerrit.cloudera.org:8080/#/c/9986/17/be/src/exprs/timezone_db.cc@367 PS17, Line 367: while (true) { : current_bytes_read = hdfsRead(hdfs_co > nit: this could fit in one line Done http://gerrit.cloudera.org:8080/#/c/9986/18/be/src/util/zip-util.h File be/src/util/zip-util.h: http://gerrit.cloudera.org:8080/#/c/9986/18/be/src/util/zip-util.h@33 PS18, Line 33: > nit: comment formatting - should start with /// Done http://gerrit.cloudera.org:8080/#/c/9986/10/testdata/tzdb/2017c/Africa/Abidjan File testdata/tzdb/2017c/Africa/Abidjan: http://gerrit.cloudera.org:8080/#/c/9986/10/testdata/tzdb/2017c/Africa/Abidjan@1 PS10, Line 1: ../Atlantic/St_Helena > How does they pass the rat checks? We run those precommit. I expected to se I've updated rat_exclude_files.txt in patch-set #11 and #13. I ran the rat checks with that and they passed. -- To view, visit http://gerrit.cloudera.org:8080/9986 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 Gerrit-Change-Number: 9986 Gerrit-PatchSet: 10 Gerrit-Owner: Attila Jeges Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Philip Zeyliger Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Thu, 14 Jun 2018 15:27:02 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Attila Jeges has uploaded a new patch set (#19). ( http://gerrit.cloudera.org:8080/9986 ) Change subject: IMPALA-3307: Add support for IANA time-zone db .. IMPALA-3307: Add support for IANA time-zone db Impala currently uses two different libraries for timestamp manipulations: boost and glibc. Issues with boost: - Time-zone database is currently hard coded in timezone_db.cc. Impala admins cannot update it without upgrading Impala. - Time-zone database is flat, therefore can’t track year-to-year changes. - Time-zone database is not updated on a regular basis. Issues with glibc: - Uses /usr/share/zoneinfo/ database which could be out of sync on some of the nodes in the Impala cluster. - Uses the host system’s local time-zone. Different nodes in the Impala cluster might use a different local time-zone. - Conversion functions take a global lock, which causes severe performance degradation. In addition to the issues above, the fact that /usr/share/zoneinfo/ and the hard-coded boost time-zone database are both in use is a source of inconsistency in itself. This patch makes the following changes: - Instead of boost and glibc, impalad uses Google's CCTZ to implement time-zone conversions. - Introduces a new startup flag (--hdfs_zone_info_zip) to impalad to specify an HDFS/S3/ADLS path to a zip archive that contains the shared compiled IANA time-zone database. If the startup flag is set, impalad will use the specified time-zone database. Otherwise, impalad will use the default /usr/share/zoneinfo time-zone database. - Introduces a new startup flag (--hdfs_zone_alias_conf) to impalad to specify an HDFS/S3/ADLS path to a shared config file that contains definitions for non-standard time-zone aliases. - impalad reads the entire time-zone database into an in-memory map on startup for fast lookups. - The name of the coordinator node’s local time-zone is saved to the query context when preparing query execution. This time-zone is used whenever the current time-zone is referred afterwards in an execution node. - Adds a new ZipUtil class to extract files from a zip archive. The implementation is not vulnerable to Zip Slip. Cherry-picks: not for 2.x. Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 --- M CMakeLists.txt M be/CMakeLists.txt M be/generated-sources/gen-cpp/CMakeLists.txt M be/src/benchmarks/CMakeLists.txt A be/src/benchmarks/convert-timestamp-benchmark.cc M be/src/common/global-types.h M be/src/common/init.cc M be/src/exec/data-source-scan-node.cc M be/src/exec/data-source-scan-node.h M be/src/exec/hdfs-orc-scanner.cc M be/src/exec/parquet-column-readers.cc M be/src/exprs/CMakeLists.txt M be/src/exprs/aggregate-functions-ir.cc M be/src/exprs/cast-functions-ir.cc M be/src/exprs/decimal-operators-ir.cc M be/src/exprs/decimal-operators.h M be/src/exprs/expr-test.cc M be/src/exprs/literal.cc M be/src/exprs/timestamp-functions-ir.cc M be/src/exprs/timestamp-functions.cc A be/src/exprs/timezone_db-test.cc M be/src/exprs/timezone_db.cc M be/src/exprs/timezone_db.h M be/src/runtime/raw-value-test.cc M be/src/runtime/runtime-state.cc M be/src/runtime/runtime-state.h M be/src/runtime/timestamp-test.cc M be/src/runtime/timestamp-value.cc M be/src/runtime/timestamp-value.h M be/src/runtime/timestamp-value.inline.h M be/src/service/frontend.cc M be/src/service/impala-server.cc M be/src/service/impalad-main.cc M be/src/util/CMakeLists.txt M be/src/util/filesystem-util-test.cc M be/src/util/filesystem-util.cc M be/src/util/filesystem-util.h M be/src/util/hdfs-util-test.cc M be/src/util/hdfs-util.cc M be/src/util/hdfs-util.h M be/src/util/time-test.cc M be/src/util/time.cc M be/src/util/time.h A be/src/util/zip-util-test.cc A be/src/util/zip-util.cc A be/src/util/zip-util.h M bin/bootstrap_toolchain.py M bin/impala-config.sh M bin/rat_exclude_files.txt A cmake_modules/FindCctz.cmake M common/thrift/CMakeLists.txt M common/thrift/ImpalaInternalService.thrift A common/thrift/Zip.thrift M common/thrift/metrics.json A fe/src/main/java/org/apache/impala/util/ZipUtil.java M fe/src/test/java/org/apache/impala/testutil/TestUtils.java M testdata/bin/create-load-data.sh M testdata/data/timezoneverification.csv A testdata/tzdb/2017c-corrupt.zip A testdata/tzdb/2017c.zip A testdata/tzdb/alias.conf A testdata/tzdb_tiny/America/New_York A testdata/tzdb_tiny/Etc/GMT+4 A testdata/tzdb_tiny/US/Eastern A testdata/tzdb_tiny/UTC A testdata/tzdb_tiny/Zulu A testdata/tzdb_tiny/posix/UTC A testdata/tzdb_tiny/posixrules M testdata/workloads/functional-query/queries/QueryTest/exprs.test M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py A tests/custom_cluster/test_shared_tzdb.py D tests/query_test/test_timezones.py 72 files changed, 3,088 insertions(+), 1,167 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/86/9986/19 -- To view, visit http://gerrit.cloudera.org:8080/9986 To unsubscribe
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9986 ) Change subject: IMPALA-3307: Add support for IANA time-zone db .. Patch Set 10: (1 comment) http://gerrit.cloudera.org:8080/#/c/9986/10/testdata/tzdb/2017c/Africa/Abidjan File testdata/tzdb/2017c/Africa/Abidjan: http://gerrit.cloudera.org:8080/#/c/9986/10/testdata/tzdb/2017c/Africa/Abidjan@1 PS10, Line 1: ../Atlantic/St_Helena > I've updated rat_exclude_files.txt in patch-set #11 and #13. I ran the rat Hmm, not sure how I missed that - sorry! -- To view, visit http://gerrit.cloudera.org:8080/9986 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 Gerrit-Change-Number: 9986 Gerrit-PatchSet: 10 Gerrit-Owner: Attila Jeges Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Philip Zeyliger Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Thu, 14 Jun 2018 16:43:03 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9986 ) Change subject: IMPALA-3307: Add support for IANA time-zone db .. Patch Set 19: Are we ready to go ahead and merge? Would be good to run exhaustive tests + ASAN before merging just to be sure we aren't going to break anything. -- To view, visit http://gerrit.cloudera.org:8080/9986 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 Gerrit-Change-Number: 9986 Gerrit-PatchSet: 19 Gerrit-Owner: Attila Jeges Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Philip Zeyliger Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Mon, 18 Jun 2018 22:35:55 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Attila Jeges has posted comments on this change. ( http://gerrit.cloudera.org:8080/9986 ) Change subject: IMPALA-3307: Add support for IANA time-zone db .. Patch Set 19: > Are we ready to go ahead and merge? Would be good to run exhaustive > tests + ASAN before merging just to be sure we aren't going to > break anything. Exhaustive and ASAN failed because of a flaky test (IMPALA-7181). Exhaustive: https://master-02.jenkins.cloudera.com/job/impala-private-parameterized/2336/ ASAN: https://master-02.jenkins.cloudera.com/job/impala-private-parameterized/2337/ Since IMPALA-7181 was resolved yesterday, I've rebased the patch-set and restarted the tests this morning. Exhaustive: https://master-02.jenkins.cloudera.com/job/impala-private-parameterized/2345/ ASAN: https://master-02.jenkins.cloudera.com/job/impala-private-parameterized/2346/ Hopefully they will pass this time. -- To view, visit http://gerrit.cloudera.org:8080/9986 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 Gerrit-Change-Number: 9986 Gerrit-PatchSet: 19 Gerrit-Owner: Attila Jeges Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Philip Zeyliger Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Wed, 20 Jun 2018 14:00:06 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9986 ) Change subject: IMPALA-3307: Add support for IANA time-zone db .. Patch Set 19: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/9986 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 Gerrit-Change-Number: 9986 Gerrit-PatchSet: 19 Gerrit-Owner: Attila Jeges Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Philip Zeyliger Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Wed, 20 Jun 2018 17:46:54 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9986 ) Change subject: IMPALA-3307: Add support for IANA time-zone db .. Patch Set 19: Thanks for running those tests! -- To view, visit http://gerrit.cloudera.org:8080/9986 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 Gerrit-Change-Number: 9986 Gerrit-PatchSet: 19 Gerrit-Owner: Attila Jeges Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Philip Zeyliger Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Wed, 20 Jun 2018 17:47:42 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Attila Jeges has uploaded a new patch set (#20). ( http://gerrit.cloudera.org:8080/9986 ) Change subject: IMPALA-3307: Add support for IANA time-zone db .. IMPALA-3307: Add support for IANA time-zone db Impala currently uses two different libraries for timestamp manipulations: boost and glibc. Issues with boost: - Time-zone database is currently hard coded in timezone_db.cc. Impala admins cannot update it without upgrading Impala. - Time-zone database is flat, therefore can’t track year-to-year changes. - Time-zone database is not updated on a regular basis. Issues with glibc: - Uses /usr/share/zoneinfo/ database which could be out of sync on some of the nodes in the Impala cluster. - Uses the host system’s local time-zone. Different nodes in the Impala cluster might use a different local time-zone. - Conversion functions take a global lock, which causes severe performance degradation. In addition to the issues above, the fact that /usr/share/zoneinfo/ and the hard-coded boost time-zone database are both in use is a source of inconsistency in itself. This patch makes the following changes: - Instead of boost and glibc, impalad uses Google's CCTZ to implement time-zone conversions. - Introduces a new startup flag (--hdfs_zone_info_zip) to impalad to specify an HDFS/S3/ADLS path to a zip archive that contains the shared compiled IANA time-zone database. If the startup flag is set, impalad will use the specified time-zone database. Otherwise, impalad will use the default /usr/share/zoneinfo time-zone database. - Introduces a new startup flag (--hdfs_zone_alias_conf) to impalad to specify an HDFS/S3/ADLS path to a shared config file that contains definitions for non-standard time-zone aliases. - impalad reads the entire time-zone database into an in-memory map on startup for fast lookups. - The name of the coordinator node’s local time-zone is saved to the query context when preparing query execution. This time-zone is used whenever the current time-zone is referred afterwards in an execution node. - Adds a new ZipUtil class to extract files from a zip archive. The implementation is not vulnerable to Zip Slip. Cherry-picks: not for 2.x. Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 --- M CMakeLists.txt M be/CMakeLists.txt M be/generated-sources/gen-cpp/CMakeLists.txt M be/src/benchmarks/CMakeLists.txt A be/src/benchmarks/convert-timestamp-benchmark.cc M be/src/common/global-types.h M be/src/common/init.cc M be/src/exec/data-source-scan-node.cc M be/src/exec/data-source-scan-node.h M be/src/exec/hdfs-orc-scanner.cc M be/src/exec/parquet-column-readers.cc M be/src/exprs/CMakeLists.txt M be/src/exprs/aggregate-functions-ir.cc M be/src/exprs/cast-functions-ir.cc M be/src/exprs/decimal-operators-ir.cc M be/src/exprs/decimal-operators.h M be/src/exprs/expr-test.cc M be/src/exprs/literal.cc M be/src/exprs/timestamp-functions-ir.cc M be/src/exprs/timestamp-functions.cc A be/src/exprs/timezone_db-test.cc M be/src/exprs/timezone_db.cc M be/src/exprs/timezone_db.h M be/src/runtime/raw-value-test.cc M be/src/runtime/runtime-state.cc M be/src/runtime/runtime-state.h M be/src/runtime/timestamp-test.cc M be/src/runtime/timestamp-value.cc M be/src/runtime/timestamp-value.h M be/src/runtime/timestamp-value.inline.h M be/src/service/frontend.cc M be/src/service/impala-server.cc M be/src/service/impalad-main.cc M be/src/util/CMakeLists.txt M be/src/util/filesystem-util-test.cc M be/src/util/filesystem-util.cc M be/src/util/filesystem-util.h M be/src/util/hdfs-util-test.cc M be/src/util/hdfs-util.cc M be/src/util/hdfs-util.h M be/src/util/time-test.cc M be/src/util/time.cc M be/src/util/time.h A be/src/util/zip-util-test.cc A be/src/util/zip-util.cc A be/src/util/zip-util.h M bin/bootstrap_toolchain.py M bin/impala-config.sh M bin/rat_exclude_files.txt A cmake_modules/FindCctz.cmake M common/thrift/CMakeLists.txt M common/thrift/ImpalaInternalService.thrift A common/thrift/Zip.thrift M common/thrift/metrics.json A fe/src/main/java/org/apache/impala/util/ZipUtil.java M fe/src/test/java/org/apache/impala/testutil/TestUtils.java M testdata/bin/create-load-data.sh M testdata/data/timezoneverification.csv A testdata/tzdb/2017c-corrupt.zip A testdata/tzdb/2017c.zip A testdata/tzdb/alias.conf A testdata/tzdb_tiny/America/New_York A testdata/tzdb_tiny/Etc/GMT+4 A testdata/tzdb_tiny/US/Eastern A testdata/tzdb_tiny/UTC A testdata/tzdb_tiny/Zulu A testdata/tzdb_tiny/posix/UTC A testdata/tzdb_tiny/posixrules M testdata/workloads/functional-query/queries/QueryTest/exprs.test M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py A tests/custom_cluster/test_shared_tzdb.py D tests/query_test/test_timezones.py 72 files changed, 3,087 insertions(+), 1,167 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/86/9986/20 -- To view, visit http://gerrit.cloudera.org:8080/9986 To unsubscribe
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Attila Jeges has posted comments on this change. ( http://gerrit.cloudera.org:8080/9986 ) Change subject: IMPALA-3307: Add support for IANA time-zone db .. Patch Set 20: Code-Review+2 Exhaustive, ASAN tests passed. Rebased the patch-set. Carry +2 -- To view, visit http://gerrit.cloudera.org:8080/9986 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 Gerrit-Change-Number: 9986 Gerrit-PatchSet: 20 Gerrit-Owner: Attila Jeges Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Philip Zeyliger Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Thu, 21 Jun 2018 12:31:06 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/9986 ) Change subject: IMPALA-3307: Add support for IANA time-zone db .. Patch Set 20: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/2721/ DRY_RUN=false -- To view, visit http://gerrit.cloudera.org:8080/9986 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 Gerrit-Change-Number: 9986 Gerrit-PatchSet: 20 Gerrit-Owner: Attila Jeges Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Philip Zeyliger Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Thu, 21 Jun 2018 12:35:17 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/9986 ) Change subject: IMPALA-3307: Add support for IANA time-zone db .. Patch Set 20: Verified-1 Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/2721/ -- To view, visit http://gerrit.cloudera.org:8080/9986 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 Gerrit-Change-Number: 9986 Gerrit-PatchSet: 20 Gerrit-Owner: Attila Jeges Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Philip Zeyliger Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Thu, 21 Jun 2018 16:05:46 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Attila Jeges has posted comments on this change. ( http://gerrit.cloudera.org:8080/9986 ) Change subject: IMPALA-3307: Add support for IANA time-zone db .. Patch Set 20: clang-tidy job failed. Investigating. -- To view, visit http://gerrit.cloudera.org:8080/9986 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 Gerrit-Change-Number: 9986 Gerrit-PatchSet: 20 Gerrit-Owner: Attila Jeges Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Philip Zeyliger Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Thu, 21 Jun 2018 16:35:15 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Attila Jeges has uploaded a new patch set (#21). ( http://gerrit.cloudera.org:8080/9986 ) Change subject: IMPALA-3307: Add support for IANA time-zone db .. IMPALA-3307: Add support for IANA time-zone db Impala currently uses two different libraries for timestamp manipulations: boost and glibc. Issues with boost: - Time-zone database is currently hard coded in timezone_db.cc. Impala admins cannot update it without upgrading Impala. - Time-zone database is flat, therefore can’t track year-to-year changes. - Time-zone database is not updated on a regular basis. Issues with glibc: - Uses /usr/share/zoneinfo/ database which could be out of sync on some of the nodes in the Impala cluster. - Uses the host system’s local time-zone. Different nodes in the Impala cluster might use a different local time-zone. - Conversion functions take a global lock, which causes severe performance degradation. In addition to the issues above, the fact that /usr/share/zoneinfo/ and the hard-coded boost time-zone database are both in use is a source of inconsistency in itself. This patch makes the following changes: - Instead of boost and glibc, impalad uses Google's CCTZ to implement time-zone conversions. - Introduces a new startup flag (--hdfs_zone_info_zip) to impalad to specify an HDFS/S3/ADLS path to a zip archive that contains the shared compiled IANA time-zone database. If the startup flag is set, impalad will use the specified time-zone database. Otherwise, impalad will use the default /usr/share/zoneinfo time-zone database. - Introduces a new startup flag (--hdfs_zone_alias_conf) to impalad to specify an HDFS/S3/ADLS path to a shared config file that contains definitions for non-standard time-zone aliases. - impalad reads the entire time-zone database into an in-memory map on startup for fast lookups. - The name of the coordinator node’s local time-zone is saved to the query context when preparing query execution. This time-zone is used whenever the current time-zone is referred afterwards in an execution node. - Adds a new ZipUtil class to extract files from a zip archive. The implementation is not vulnerable to Zip Slip. Cherry-picks: not for 2.x. Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 --- M CMakeLists.txt M be/CMakeLists.txt M be/generated-sources/gen-cpp/CMakeLists.txt M be/src/benchmarks/CMakeLists.txt A be/src/benchmarks/convert-timestamp-benchmark.cc M be/src/common/global-types.h M be/src/common/init.cc M be/src/exec/data-source-scan-node.cc M be/src/exec/data-source-scan-node.h M be/src/exec/hdfs-orc-scanner.cc M be/src/exec/parquet-column-readers.cc M be/src/exprs/CMakeLists.txt M be/src/exprs/aggregate-functions-ir.cc M be/src/exprs/cast-functions-ir.cc M be/src/exprs/decimal-operators-ir.cc M be/src/exprs/decimal-operators.h M be/src/exprs/expr-test.cc M be/src/exprs/literal.cc M be/src/exprs/timestamp-functions-ir.cc M be/src/exprs/timestamp-functions.cc A be/src/exprs/timezone_db-test.cc M be/src/exprs/timezone_db.cc M be/src/exprs/timezone_db.h M be/src/runtime/raw-value-test.cc M be/src/runtime/runtime-state.cc M be/src/runtime/runtime-state.h M be/src/runtime/timestamp-test.cc M be/src/runtime/timestamp-value.cc M be/src/runtime/timestamp-value.h M be/src/runtime/timestamp-value.inline.h M be/src/service/frontend.cc M be/src/service/impala-server.cc M be/src/service/impalad-main.cc M be/src/util/CMakeLists.txt M be/src/util/filesystem-util-test.cc M be/src/util/filesystem-util.cc M be/src/util/filesystem-util.h M be/src/util/hdfs-util-test.cc M be/src/util/hdfs-util.cc M be/src/util/hdfs-util.h M be/src/util/time-test.cc M be/src/util/time.cc M be/src/util/time.h A be/src/util/zip-util-test.cc A be/src/util/zip-util.cc A be/src/util/zip-util.h M bin/bootstrap_toolchain.py M bin/impala-config.sh M bin/rat_exclude_files.txt A cmake_modules/FindCctz.cmake M common/thrift/CMakeLists.txt M common/thrift/ImpalaInternalService.thrift A common/thrift/Zip.thrift M common/thrift/metrics.json A fe/src/main/java/org/apache/impala/util/ZipUtil.java M fe/src/test/java/org/apache/impala/testutil/TestUtils.java M testdata/bin/create-load-data.sh M testdata/data/timezoneverification.csv A testdata/tzdb/2017c-corrupt.zip A testdata/tzdb/2017c.zip A testdata/tzdb/alias.conf A testdata/tzdb_tiny/America/New_York A testdata/tzdb_tiny/Etc/GMT+4 A testdata/tzdb_tiny/US/Eastern A testdata/tzdb_tiny/UTC A testdata/tzdb_tiny/Zulu A testdata/tzdb_tiny/posix/UTC A testdata/tzdb_tiny/posixrules M testdata/workloads/functional-query/queries/QueryTest/exprs.test M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py A tests/custom_cluster/test_shared_tzdb.py D tests/query_test/test_timezones.py 72 files changed, 3,086 insertions(+), 1,176 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/86/9986/21 -- To view, visit http://gerrit.cloudera.org:8080/9986 To unsubscribe
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/9986 ) Change subject: IMPALA-3307: Add support for IANA time-zone db .. Patch Set 21: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/2723/ DRY_RUN=false -- To view, visit http://gerrit.cloudera.org:8080/9986 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 Gerrit-Change-Number: 9986 Gerrit-PatchSet: 21 Gerrit-Owner: Attila Jeges Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Philip Zeyliger Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Thu, 21 Jun 2018 16:46:02 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Attila Jeges has posted comments on this change. ( http://gerrit.cloudera.org:8080/9986 ) Change subject: IMPALA-3307: Add support for IANA time-zone db .. Patch Set 21: Code-Review+2 Carry +2 -- To view, visit http://gerrit.cloudera.org:8080/9986 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 Gerrit-Change-Number: 9986 Gerrit-PatchSet: 21 Gerrit-Owner: Attila Jeges Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Philip Zeyliger Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Thu, 21 Jun 2018 17:34:50 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/9986 ) Change subject: IMPALA-3307: Add support for IANA time-zone db .. Patch Set 21: Verified-1 Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/2723/ -- To view, visit http://gerrit.cloudera.org:8080/9986 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 Gerrit-Change-Number: 9986 Gerrit-PatchSet: 21 Gerrit-Owner: Attila Jeges Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Philip Zeyliger Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Thu, 21 Jun 2018 20:14:06 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/9986 ) Change subject: IMPALA-3307: Add support for IANA time-zone db .. Patch Set 22: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/9986 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 Gerrit-Change-Number: 9986 Gerrit-PatchSet: 22 Gerrit-Owner: Attila Jeges Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Philip Zeyliger Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 22 Jun 2018 09:50:36 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/9986 ) Change subject: IMPALA-3307: Add support for IANA time-zone db .. Patch Set 22: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/2725/ DRY_RUN=false -- To view, visit http://gerrit.cloudera.org:8080/9986 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 Gerrit-Change-Number: 9986 Gerrit-PatchSet: 22 Gerrit-Owner: Attila Jeges Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Philip Zeyliger Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 22 Jun 2018 09:50:37 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/9986 ) Change subject: IMPALA-3307: Add support for IANA time-zone db .. Patch Set 22: Verified-1 Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/2725/ -- To view, visit http://gerrit.cloudera.org:8080/9986 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 Gerrit-Change-Number: 9986 Gerrit-PatchSet: 22 Gerrit-Owner: Attila Jeges Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Philip Zeyliger Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 22 Jun 2018 09:51:31 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Attila Jeges has posted comments on this change. ( http://gerrit.cloudera.org:8080/9986 ) Change subject: IMPALA-3307: Add support for IANA time-zone db .. Patch Set 22: Code-Review+2 Carry +2 -- To view, visit http://gerrit.cloudera.org:8080/9986 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 Gerrit-Change-Number: 9986 Gerrit-PatchSet: 22 Gerrit-Owner: Attila Jeges Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Philip Zeyliger Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 22 Jun 2018 09:52:24 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/9986 ) Change subject: IMPALA-3307: Add support for IANA time-zone db .. Patch Set 22: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/2726/ DRY_RUN=false -- To view, visit http://gerrit.cloudera.org:8080/9986 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 Gerrit-Change-Number: 9986 Gerrit-PatchSet: 22 Gerrit-Owner: Attila Jeges Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Philip Zeyliger Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 22 Jun 2018 09:53:33 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/9986 ) Change subject: IMPALA-3307: Add support for IANA time-zone db .. IMPALA-3307: Add support for IANA time-zone db Impala currently uses two different libraries for timestamp manipulations: boost and glibc. Issues with boost: - Time-zone database is currently hard coded in timezone_db.cc. Impala admins cannot update it without upgrading Impala. - Time-zone database is flat, therefore can’t track year-to-year changes. - Time-zone database is not updated on a regular basis. Issues with glibc: - Uses /usr/share/zoneinfo/ database which could be out of sync on some of the nodes in the Impala cluster. - Uses the host system’s local time-zone. Different nodes in the Impala cluster might use a different local time-zone. - Conversion functions take a global lock, which causes severe performance degradation. In addition to the issues above, the fact that /usr/share/zoneinfo/ and the hard-coded boost time-zone database are both in use is a source of inconsistency in itself. This patch makes the following changes: - Instead of boost and glibc, impalad uses Google's CCTZ to implement time-zone conversions. - Introduces a new startup flag (--hdfs_zone_info_zip) to impalad to specify an HDFS/S3/ADLS path to a zip archive that contains the shared compiled IANA time-zone database. If the startup flag is set, impalad will use the specified time-zone database. Otherwise, impalad will use the default /usr/share/zoneinfo time-zone database. - Introduces a new startup flag (--hdfs_zone_alias_conf) to impalad to specify an HDFS/S3/ADLS path to a shared config file that contains definitions for non-standard time-zone aliases. - impalad reads the entire time-zone database into an in-memory map on startup for fast lookups. - The name of the coordinator node’s local time-zone is saved to the query context when preparing query execution. This time-zone is used whenever the current time-zone is referred afterwards in an execution node. - Adds a new ZipUtil class to extract files from a zip archive. The implementation is not vulnerable to Zip Slip. Cherry-picks: not for 2.x. Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 Reviewed-on: http://gerrit.cloudera.org:8080/9986 Reviewed-by: Impala Public Jenkins Reviewed-by: Attila Jeges Tested-by: Impala Public Jenkins --- M CMakeLists.txt M be/CMakeLists.txt M be/generated-sources/gen-cpp/CMakeLists.txt M be/src/benchmarks/CMakeLists.txt A be/src/benchmarks/convert-timestamp-benchmark.cc M be/src/common/global-types.h M be/src/common/init.cc M be/src/exec/data-source-scan-node.cc M be/src/exec/data-source-scan-node.h M be/src/exec/hdfs-orc-scanner.cc M be/src/exec/parquet-column-readers.cc M be/src/exprs/CMakeLists.txt M be/src/exprs/aggregate-functions-ir.cc M be/src/exprs/cast-functions-ir.cc M be/src/exprs/decimal-operators-ir.cc M be/src/exprs/decimal-operators.h M be/src/exprs/expr-test.cc M be/src/exprs/literal.cc M be/src/exprs/timestamp-functions-ir.cc M be/src/exprs/timestamp-functions.cc A be/src/exprs/timezone_db-test.cc M be/src/exprs/timezone_db.cc M be/src/exprs/timezone_db.h M be/src/runtime/raw-value-test.cc M be/src/runtime/runtime-state.cc M be/src/runtime/runtime-state.h M be/src/runtime/timestamp-test.cc M be/src/runtime/timestamp-value.cc M be/src/runtime/timestamp-value.h M be/src/runtime/timestamp-value.inline.h M be/src/service/frontend.cc M be/src/service/impala-server.cc M be/src/service/impalad-main.cc M be/src/util/CMakeLists.txt M be/src/util/filesystem-util-test.cc M be/src/util/filesystem-util.cc M be/src/util/filesystem-util.h M be/src/util/hdfs-util-test.cc M be/src/util/hdfs-util.cc M be/src/util/hdfs-util.h M be/src/util/time-test.cc M be/src/util/time.cc M be/src/util/time.h A be/src/util/zip-util-test.cc A be/src/util/zip-util.cc A be/src/util/zip-util.h M bin/bootstrap_toolchain.py M bin/impala-config.sh M bin/rat_exclude_files.txt A cmake_modules/FindCctz.cmake M common/thrift/CMakeLists.txt M common/thrift/ImpalaInternalService.thrift A common/thrift/Zip.thrift M common/thrift/metrics.json A fe/src/main/java/org/apache/impala/util/ZipUtil.java M fe/src/test/java/org/apache/impala/testutil/TestUtils.java M testdata/bin/create-load-data.sh M testdata/data/timezoneverification.csv A testdata/tzdb/2017c-corrupt.zip A testdata/tzdb/2017c.zip A testdata/tzdb/alias.conf A testdata/tzdb_tiny/America/New_York A testdata/tzdb_tiny/Etc/GMT+4 A testdata/tzdb_tiny/US/Eastern A testdata/tzdb_tiny/UTC A testdata/tzdb_tiny/Zulu A testdata/tzdb_tiny/posix/UTC A testdata/tzdb_tiny/posixrules M testdata/workloads/functional-query/queries/QueryTest/exprs.test M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py A tests/custom_cluster/test_shared_tzdb.py D tests/query_test/test_timezones.py 72 files changed, 3,086 insertions(+), 1,1
[Impala-ASF-CR] IMPALA-3307: Add support for IANA time-zone db
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/9986 ) Change subject: IMPALA-3307: Add support for IANA time-zone db .. Patch Set 22: Verified+1 -- To view, visit http://gerrit.cloudera.org:8080/9986 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 Gerrit-Change-Number: 9986 Gerrit-PatchSet: 22 Gerrit-Owner: Attila Jeges Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Philip Zeyliger Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 22 Jun 2018 13:18:56 + Gerrit-HasComments: No