[ 
https://issues.apache.org/jira/browse/IMPALA-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16522047#comment-16522047
 ] 

Attila Jeges edited comment on IMPALA-3307 at 6/25/18 9:21 AM:
---------------------------------------------------------------

https://github.com/apache/impala/commit/17749dbcfc51ebe67c269ce812749d1845e47e7a

IMPALA-3307: Add support for IANA time-zone db

Impala currently uses two different libraries for timestamp
manipulations: boost and glibc.

Issues with boost:
- Time-zone database is currently hard coded in timezone_db.cc.
  Impala admins cannot update it without upgrading Impala.
- Time-zone database is flat, therefore can’t track year-to-year
  changes.
- Time-zone database is not updated on a regular basis.

Issues with glibc:
- Uses /usr/share/zoneinfo/ database which could be out of sync on
  some of the nodes in the Impala cluster.
- Uses the host system’s local time-zone. Different nodes in the
  Impala cluster might use a different local time-zone.
- Conversion functions take a global lock, which causes severe
  performance degradation.

In addition to the issues above, the fact that /usr/share/zoneinfo/
and the hard-coded boost time-zone database are both in use is a
source of inconsistency in itself.

This patch makes the following changes:
- Instead of boost and glibc, impalad uses Google's CCTZ to implement
  time-zone conversions.

- Introduces a new startup flag (--hdfs_zone_info_zip) to impalad to
  specify an HDFS/S3/ADLS path to a zip archive that contains the
  shared compiled IANA time-zone database. If the startup flag is set,
  impalad will use the specified time-zone database. Otherwise,
  impalad will use the default /usr/share/zoneinfo time-zone database.

- Introduces a new startup flag (--hdfs_zone_alias_conf) to impalad to
  specify an HDFS/S3/ADLS path to a shared config file that contains
  definitions for non-standard time-zone aliases.

- impalad reads the entire time-zone database into an in-memory
  map on startup for fast lookups.

- The name of the coordinator node’s local time-zone is saved to the
  query context when preparing query execution. This time-zone is used
  whenever the current time-zone is referred afterwards in an
  execution node.

- Adds a new ZipUtil class to extract files from a zip archive. The
  implementation is not vulnerable to Zip Slip.

Cherry-picks: not for 2.x.

Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
Reviewed-on: http://gerrit.cloudera.org:8080/9986
Reviewed-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Reviewed-by: Attila Jeges <atti...@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com>


was (Author: attilaj):
IMPALA-3307: Add support for IANA time-zone db

Impala currently uses two different libraries for timestamp
manipulations: boost and glibc.

Issues with boost:
- Time-zone database is currently hard coded in timezone_db.cc.
  Impala admins cannot update it without upgrading Impala.
- Time-zone database is flat, therefore can’t track year-to-year
  changes.
- Time-zone database is not updated on a regular basis.

Issues with glibc:
- Uses /usr/share/zoneinfo/ database which could be out of sync on
  some of the nodes in the Impala cluster.
- Uses the host system’s local time-zone. Different nodes in the
  Impala cluster might use a different local time-zone.
- Conversion functions take a global lock, which causes severe
  performance degradation.

In addition to the issues above, the fact that /usr/share/zoneinfo/
and the hard-coded boost time-zone database are both in use is a
source of inconsistency in itself.

This patch makes the following changes:
- Instead of boost and glibc, impalad uses Google's CCTZ to implement
  time-zone conversions.

- Introduces a new startup flag (--hdfs_zone_info_zip) to impalad to
  specify an HDFS/S3/ADLS path to a zip archive that contains the
  shared compiled IANA time-zone database. If the startup flag is set,
  impalad will use the specified time-zone database. Otherwise,
  impalad will use the default /usr/share/zoneinfo time-zone database.

- Introduces a new startup flag (--hdfs_zone_alias_conf) to impalad to
  specify an HDFS/S3/ADLS path to a shared config file that contains
  definitions for non-standard time-zone aliases.

- impalad reads the entire time-zone database into an in-memory
  map on startup for fast lookups.

- The name of the coordinator node’s local time-zone is saved to the
  query context when preparing query execution. This time-zone is used
  whenever the current time-zone is referred afterwards in an
  execution node.

- Adds a new ZipUtil class to extract files from a zip archive. The
  implementation is not vulnerable to Zip Slip.

Cherry-picks: not for 2.x.

Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
Reviewed-on: http://gerrit.cloudera.org:8080/9986
Reviewed-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Reviewed-by: Attila Jeges <atti...@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com>

> add support for IANA time zone database
> ---------------------------------------
>
>                 Key: IMPALA-3307
>                 URL: https://issues.apache.org/jira/browse/IMPALA-3307
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Backend
>    Affects Versions: impala 2.3
>            Reporter: Marcell Szabo
>            Assignee: Attila Jeges
>            Priority: Major
>              Labels: supportability
>             Fix For: Impala 3.1.0
>
>
> Currently the time zones are hard coded timezone_db.cc and they do not take 
> into account that timezone definitions changed year to year (except for 
> Moscow CDH-19918).
> I suggest moving timezone info into a separate config file, so that admins 
> can update if necessary, plus provide tools for updating it from well-known 
> sources.
> 1) Define an impala-friendly file format for timezone data (preferably 
> human-editable as well, even more preferably a format that other similar 
> systems already use)
> 2) Create tool  to extract timezone data from the IANA tzdata database or 
> /usr/share/zoneinfo
> into the format specified.
> 3) File (path, hdfs path) should be part of configuration
> 4) backends should load the tzinfo into a quick memory structure (quick 
> lookup by id + date) (maybe load/cache each time zone on demand, most of them 
> will never be used)
> 5) all date functions should use this generic tzinfo from memory
> regarding 2), similar tools:
> http://www.oracle.com/technetwork/java/javase/tzupdater-readme-136440.html
> http://dev.mysql.com/doc/refman/5.7/en/mysql-tzinfo-to-sql.html
> regarding 3), some reasons to make this configurable, and making 2) a manual 
> step:
> * tzinfo is not  perfectly standardised, automatic solutions might fail on 
> some OSes
> * tzinfo on different hosts might be out of sync. Good luck with debugging 
> such cases...
> * we wouldn't want query results automagically/unexpectedly change on OS 
> upgrade
> * we should give the admins the possibility to override / fine-tune tz data 
> if the applications require doing so.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to