[ https://issues.apache.org/jira/browse/PHOENIX-5066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17695231#comment-17695231 ]
Istvan Toth commented on PHOENIX-5066: -------------------------------------- Copying here the email I sent to dev list requesting review: We already have an old rambling thread on fixing Phoenix time zone handling, but I am starting another one. Please if you have interest in fixing date handling then take the time to read it, I promise it is much less messy than the old approach. *Background:* The SQL standard defines two groups of Date types. One is the WITHOUT TIME ZONE types, which are analogous to the java.time.Local* types, the other is the WITH TIME ZONE types, which are analogous to java.time.Instant. Types without qualifiers are interpreted as WITHOUT TIME ZONE types. *Current state of Phoenix* Phoenix follows this in most regards, as it stores temporal types as UTC epoch values, interprets string literals as UTC, and formats them as UTC. Phoenix date functions also operate in UTC, which is consistent with the parse/format code. This results in the behaviour required for WITHOUT TIME ZONE types. However, this breaks down when we begin to use the java.sql. temporal types, where Phoenix expects and returns the raw UTC epoch timestamp values. a brief illustration, for selects, but the situation is the same for upsert with PreparedStatment.setTimestamp() and friends: _INSERT INTO X (DATE) VALUES ('2000-01-01 10:10:10')_ stores the UTC epoch for '2000-01-01 10:10:10', regardless of the client timezone. _SELECT * from DATE; rs.getString("DATE")_ returns '2000-01-01 10:10:10' , regardless of the client timezone. So far, so good. _SELECT * from DATE; rs.getTimestamp("DATE").toString()_ returns the UTC epoch timestamp interpreted in the local timezone, i.e a different string for every TZ. This is the root of all of our problems. This problem is built into the ancient JDBC API, which has no real concept of WITH TIMEZONE and WITHOUT TIMEZONE types. *My previous attempt* Previously, I have changed Phoenix to treat all Temporal data as WITH TIMEZONE types, which is not wrong, but 1.) Goes against the SQL standard with defines unqualified temporal types as WITHOUT timezone 2.) Goes against several assumptions built into the Phoenix codebase, and requires pushing the client TZ to the server side, and - either using ThreadLocals per connection to store this information, - or adding the TZ as a parameter to many/most methods in the parser and compiler code - or some even deeper refactoring, that I haven't figured out yet. Implementing real WITH TIMEZONE types in Phoenix in the future is not off the table, but is a much more involved task. (As evidenced by the time I have already wasted on it) *The new proposal* Instead of re-imagining Phoenix to treat all types as WITH TIMEZONE types, we can keep the current interpretation as WITHOUT TIMEZONE types, and fix the get / functions very easily. The code change required for this is only a few hundred lines (without tests). The solution is as simple as applying the Timezone offset as a displacement in PreparedStatement and Resultset when setting or returning java.sql.Date/Time/Timestamp. This is what most DBs do, and also how the SQL standard defines conversion between WITH TIME ZONE and WITHOUT TIME ZONE types. In this case, the java.sql epoch based types transiting the JDBC interface are treated as WITH TIMEZONE types. This ensures that non timezone-aware clients get timestamps that, if interpreted in their local timezone, resolve to the same timezone-less date that the parser and formatter would return, and any java.sql.temporal types set in PreparedStatement will be interpreted as the local time in the client TZ. The only other change in functionality is the CURRENT_DATE() and CURRENT_TIME() functions, which now apply the displacement to the generated values. As usual, we hide this change behind a property, and default to the old behaviour to maintain backwards (bug) compatibility. *Complications* The SQL standard really talks about a Timezone Offset, and is not talking about DST. Many databases take DST into account, as does my patch. Without that the internal representation would not be really UTC, and the date functions (i.e. dayofweek()) wouldn't work. However, the standard DST problems also apply, with one hour each year not existing, and one hour existing twice. The proper way to handle WITHOUT TIMEZONE types from JDBC is to use strings, or use java.time.Local* functions with get/setObject methods. Phoenix doesn't yet have those, I plan to add them in the near future (PHOENIX-6881). When using ROW_TIMESTAMP the displacement rules can easily cause problems if the user is not expecting it, and expects the java.sql.Timestamp value to be applied directly. i.e when executing _preparedStmt.setTimestamp('ROW_TS_COL', new java.sql.Timestamp(new java.sql.Date()));_ from multiple timezones, there is no guarantee that the chronologically last insert will have the latest timestamp. This was already the case when setting the column via a String. This requires extra attention, and should be mentioned in the documentation. *Feedback* Please check out the new PR at [https://github.com/apache/phoenix/pull/1567], as I said it is very small. Any and all comments and questions are welcome, either with the general approach or with the specific implementation. I am also collecting some related issues in PHOENIX-6882, those will be also needed to get Phoenix date handling into a better place. *References/further reading* https://issues.apache.org/jira/browse/PHOENIX-5066 https://issues.apache.org/jira/browse/PHOENIX-6882 [https://github.com/julianhyde/sqlline/issues/440] [https://stackoverflow.com/questions/14070572/is-java-sql-timestamp-timezone-specific] [https://stackoverflow.com/questions/44574205/does-the-timezone-matter-when-parsing-a-timestamp] [https://medium.com/@williampuk/sql-timestamp-and-jdbc-timestamp-deep-dive-7ae0ea91e237] Also search for "displacement" in the SQL11 standard. (No public link available) > The TimeZone is incorrectly used during writing or reading data > --------------------------------------------------------------- > > Key: PHOENIX-5066 > URL: https://issues.apache.org/jira/browse/PHOENIX-5066 > Project: Phoenix > Issue Type: Bug > Affects Versions: 5.0.0, 4.14.1 > Reporter: Jaanai Zhang > Assignee: Istvan Toth > Priority: Critical > Fix For: 5.3.0 > > Attachments: DateTest.java, PHOENIX-5066.4x.v1.patch, > PHOENIX-5066.4x.v2.patch, PHOENIX-5066.4x.v3.patch, > PHOENIX-5066.master.v1.patch, PHOENIX-5066.master.v2.patch, > PHOENIX-5066.master.v3.patch, PHOENIX-5066.master.v4.patch, > PHOENIX-5066.master.v5.patch, PHOENIX-5066.master.v6.patch > > Time Spent: 20m > Remaining Estimate: 0h > > We have two methods to write data when uses JDBC API. > #1. Uses _the exceuteUpdate_ method to execute a string that is an upsert SQL. > #2. Uses the _prepareStatement_ method to set some objects and execute. > The _string_ data needs to convert to a new object by the schema information > of tables. we'll use some date formatters to convert string data to object > for Date/Time/Timestamp types when writes data and the formatters are used > when reads data as well. > > *Uses default timezone test* > Writing 3 records by the different ways. > {code:java} > UPSERT INTO date_test VALUES (1,'2018-12-10 15:40:47','2018-12-10 > 15:40:47','2018-12-10 15:40:47') > UPSERT INTO date_test VALUES (2,to_date('2018-12-10 > 15:40:47'),to_time('2018-12-10 15:40:47'),to_timestamp('2018-12-10 15:40:47')) > stmt.setInt(1, 3);stmt.setDate(2, date);stmt.setTime(3, > time);stmt.setTimestamp(4, ts); > {code} > Reading the table by the getObject(getDate/getTime/getTimestamp) methods. > {code:java} > 1 | 2018-12-10 | 23:45:07 | 2018-12-10 23:45:07.0 > 2 | 2018-12-10 | 23:45:07 | 2018-12-10 23:45:07.0 > 3 | 2018-12-10 | 15:45:07 | 2018-12-10 15:45:07.66 > {code} > Reading the table by the getString methods > {code:java} > 1 | 2018-12-10 15:45:07.000 | 2018-12-10 15:45:07.000 | 2018-12-10 > 15:45:07.000 > 2 | 2018-12-10 15:45:07.000 | 2018-12-10 15:45:07.000 | 2018-12-10 > 15:45:07.000 > 3 | 2018-12-10 07:45:07.660 | 2018-12-10 07:45:07.660 | 2018-12-10 > 07:45:07.660 > {code} > *Uses GMT+8 test* > Writing 3 records by the different ways. > {code:java} > UPSERT INTO date_test VALUES (1,'2018-12-10 15:40:47','2018-12-10 > 15:40:47','2018-12-10 15:40:47') > UPSERT INTO date_test VALUES (2,to_date('2018-12-10 > 15:40:47'),to_time('2018-12-10 15:40:47'),to_timestamp('2018-12-10 15:40:47')) > stmt.setInt(1, 3);stmt.setDate(2, date);stmt.setTime(3, > time);stmt.setTimestamp(4, ts); > {code} > Reading the table by the getObject(getDate/getTime/getTimestamp) methods. > {code:java} > 1 | 2018-12-10 | 23:40:47 | 2018-12-10 23:40:47.0 > 2 | 2018-12-10 | 15:40:47 | 2018-12-10 15:40:47.0 > 3 | 2018-12-10 | 15:40:47 | 2018-12-10 15:40:47.106 {code} > Reading the table by the getString methods > {code:java} > 1 | 2018-12-10 23:40:47.000 | 2018-12-10 23:40:47.000 | 2018-12-10 > 23:40:47.000 > 2 | 2018-12-10 15:40:47.000 | 2018-12-10 15:40:47.000 | 2018-12-10 > 15:40:47.000 > 3 | 2018-12-10 15:40:47.106 | 2018-12-10 15:40:47.106 | 2018-12-10 > 15:40:47.106 > {code} > > _We_ have a historical problem, we'll parse the string to > Date/Time/Timestamp objects with timezone in #1, which means the actual data > is going to be changed when stored in HBase table。 -- This message was sent by Atlassian Jira (v8.20.10#820010)