[jira] [Commented] (PHOENIX-5066) The TimeZone is incorrectly used during writing or reading data

Istvan Toth (Jira) Wed, 01 Mar 2023 09:52:12 -0800


    [ 
https://issues.apache.org/jira/browse/PHOENIX-5066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17695231#comment-17695231
 ]


Istvan Toth commented on PHOENIX-5066:
--------------------------------------

Copying here the email I sent to dev list requesting review:

We already have an old rambling thread on fixing Phoenix time zone handling, 
but I am starting another one.
Please if you have interest in fixing date handling then take the time to read 
it, I promise it is much less messy than the old approach.

*Background:*

The SQL standard defines two groups of Date types.
One is the WITHOUT TIME ZONE types, which are analogous to the java.time.Local* 
types, the other is the WITH TIME ZONE types, which are analogous to 
java.time.Instant.
Types without qualifiers are interpreted as WITHOUT TIME ZONE types.

*Current state of Phoenix*

Phoenix follows this in most regards, as it stores temporal types as UTC epoch 
values, interprets string literals as UTC, and formats them as UTC.
Phoenix date functions also operate in UTC, which is consistent with the 
parse/format code. This results in the behaviour required for WITHOUT TIME ZONE 
types.
However, this breaks down when we begin to use the java.sql. temporal types, 
where Phoenix expects and returns the raw UTC epoch timestamp values.

a brief illustration, for selects, but the situation is the same for upsert 
with PreparedStatment.setTimestamp() and friends:

_INSERT INTO X  (DATE) VALUES ('2000-01-01 10:10:10')_
stores  the UTC epoch for '2000-01-01 10:10:10', regardless of the client 
timezone.

_SELECT * from DATE; rs.getString("DATE")_ 
returns '2000-01-01 10:10:10' , regardless of the client timezone.

So far, so good.

_SELECT * from DATE; rs.getTimestamp("DATE").toString()_
returns the UTC epoch timestamp interpreted in the local timezone, i.e a 
different string for every TZ.

This is the root of all of our problems.

This problem is built into the ancient JDBC API, which has no real concept of 
WITH TIMEZONE and WITHOUT TIMEZONE types.

*My previous attempt*

Previously, I have changed Phoenix to treat all Temporal data as WITH TIMEZONE 
types, which is not wrong, but
1.) Goes against the SQL standard with defines unqualified temporal types as 
WITHOUT timezone
2.) Goes against several assumptions built into the Phoenix codebase, and 
requires pushing the client TZ to the server side, and
- either using ThreadLocals per connection to store this information,
- or adding the TZ as a parameter to many/most methods in the parser and 
compiler code
- or some even deeper refactoring, that I haven't figured out yet.

Implementing real WITH TIMEZONE types in Phoenix in the future is not off the 
table, but is a much more involved task.
(As evidenced by the time I have already wasted on it)

*The new proposal*

Instead of re-imagining Phoenix to treat all types as WITH TIMEZONE types, we 
can keep the current interpretation as WITHOUT TIMEZONE types, and  fix the get 
/ functions very easily.
The  code change required for this is only a few hundred lines (without tests).

The solution is as simple as applying the Timezone offset as a displacement in 
PreparedStatement and Resultset when setting or returning 
java.sql.Date/Time/Timestamp.

This is what most DBs do, and also how the SQL standard defines conversion 
between WITH TIME ZONE and WITHOUT TIME ZONE types. In this case, the java.sql 
epoch based types transiting the JDBC interface are treated as WITH TIMEZONE 
types.

This ensures that non timezone-aware clients get timestamps that, if 
interpreted in their local timezone, resolve to the same timezone-less date 
that the parser and formatter would return, and any java.sql.temporal types set 
in PreparedStatement will be interpreted as the local time in the client TZ.

The only other change in functionality is the CURRENT_DATE() and CURRENT_TIME() 
functions, which now apply the displacement to the generated values.

As usual, we hide this change behind a property, and default to the old 
behaviour to maintain backwards (bug) compatibility.

*Complications*

The SQL standard really talks about a Timezone Offset, and is not talking about 
DST.
Many databases take DST into account, as does my patch.
Without that the internal representation would not be really UTC, and the date 
functions (i.e. dayofweek()) wouldn't work.

However, the standard DST problems also apply, with one hour each year not 
existing, and one hour existing twice. 

The proper way to handle WITHOUT TIMEZONE types from JDBC is to use strings, or 
use java.time.Local* functions with get/setObject methods. Phoenix doesn't yet 
have those, I plan to add them in the near future (PHOENIX-6881).

When using ROW_TIMESTAMP the displacement rules can easily cause problems if 
the user is not expecting it, and expects the java.sql.Timestamp value to be 
applied directly.
i.e when executing _preparedStmt.setTimestamp('ROW_TS_COL', new 
java.sql.Timestamp(new java.sql.Date()));_ from multiple timezones, there is no 
guarantee that the chronologically last insert will have the latest timestamp.
This was already the case when setting the column via a String.
This requires extra attention, and should be mentioned in the documentation.

*Feedback*

Please check out the new PR at [https://github.com/apache/phoenix/pull/1567], 
as I said it is very small.

Any and all comments and questions are welcome, either with the general 
approach or with the specific implementation.

I am also collecting some related issues in PHOENIX-6882, those will be also 
needed to get Phoenix date handling into a better place.

*References/further reading*

https://issues.apache.org/jira/browse/PHOENIX-5066
https://issues.apache.org/jira/browse/PHOENIX-6882
[https://github.com/julianhyde/sqlline/issues/440]
[https://stackoverflow.com/questions/14070572/is-java-sql-timestamp-timezone-specific]
[https://stackoverflow.com/questions/44574205/does-the-timezone-matter-when-parsing-a-timestamp]
[https://medium.com/@williampuk/sql-timestamp-and-jdbc-timestamp-deep-dive-7ae0ea91e237]
Also search for "displacement" in the SQL11 standard. (No public link available)

> The TimeZone is incorrectly used during writing or reading data
> ---------------------------------------------------------------
>
>                 Key: PHOENIX-5066
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-5066
>             Project: Phoenix
>          Issue Type: Bug
>    Affects Versions: 5.0.0, 4.14.1
>            Reporter: Jaanai Zhang
>            Assignee: Istvan Toth
>            Priority: Critical
>             Fix For: 5.3.0
>
>         Attachments: DateTest.java, PHOENIX-5066.4x.v1.patch, 
> PHOENIX-5066.4x.v2.patch, PHOENIX-5066.4x.v3.patch, 
> PHOENIX-5066.master.v1.patch, PHOENIX-5066.master.v2.patch, 
> PHOENIX-5066.master.v3.patch, PHOENIX-5066.master.v4.patch, 
> PHOENIX-5066.master.v5.patch, PHOENIX-5066.master.v6.patch
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> We have two methods to write data when uses JDBC API.
> #1. Uses _the exceuteUpdate_ method to execute a string that is an upsert SQL.
> #2. Uses the _prepareStatement_ method to set some objects and execute.
> The _string_ data needs to convert to a new object by the schema information 
> of tables. we'll use some date formatters to convert string data to object 
> for Date/Time/Timestamp types when writes data and the formatters are used 
> when reads data as well.
>  
> *Uses default timezone test*
>  Writing 3 records by the different ways.
> {code:java}
> UPSERT INTO date_test VALUES (1,'2018-12-10 15:40:47','2018-12-10 
> 15:40:47','2018-12-10 15:40:47') 
> UPSERT INTO date_test VALUES (2,to_date('2018-12-10 
> 15:40:47'),to_time('2018-12-10 15:40:47'),to_timestamp('2018-12-10 15:40:47'))
> stmt.setInt(1, 3);stmt.setDate(2, date);stmt.setTime(3, 
> time);stmt.setTimestamp(4, ts);
> {code}
> Reading the table by the getObject(getDate/getTime/getTimestamp) methods.
> {code:java}
> 1 | 2018-12-10 | 23:45:07 | 2018-12-10 23:45:07.0 
> 2 | 2018-12-10 | 23:45:07 | 2018-12-10 23:45:07.0 
> 3 | 2018-12-10 | 15:45:07 | 2018-12-10 15:45:07.66 
> {code}
> Reading the table by the getString methods 
> {code:java}
> 1 | 2018-12-10 15:45:07.000 | 2018-12-10 15:45:07.000 | 2018-12-10 
> 15:45:07.000 
> 2 | 2018-12-10 15:45:07.000 | 2018-12-10 15:45:07.000 | 2018-12-10 
> 15:45:07.000 
> 3 | 2018-12-10 07:45:07.660 | 2018-12-10 07:45:07.660 | 2018-12-10 
> 07:45:07.660
> {code}
>  *Uses GMT+8 test*
>  Writing 3 records by the different ways.
> {code:java}
> UPSERT INTO date_test VALUES (1,'2018-12-10 15:40:47','2018-12-10 
> 15:40:47','2018-12-10 15:40:47')
> UPSERT INTO date_test VALUES (2,to_date('2018-12-10 
> 15:40:47'),to_time('2018-12-10 15:40:47'),to_timestamp('2018-12-10 15:40:47'))
> stmt.setInt(1, 3);stmt.setDate(2, date);stmt.setTime(3, 
> time);stmt.setTimestamp(4, ts);
> {code}
> Reading the table by the getObject(getDate/getTime/getTimestamp) methods.
> {code:java}
> 1 | 2018-12-10 | 23:40:47 | 2018-12-10 23:40:47.0 
> 2 | 2018-12-10 | 15:40:47 | 2018-12-10 15:40:47.0 
> 3 | 2018-12-10 | 15:40:47 | 2018-12-10 15:40:47.106 {code}
> Reading the table by the getString methods
> {code:java}
>  1 | 2018-12-10 23:40:47.000 | 2018-12-10 23:40:47.000 | 2018-12-10 
> 23:40:47.000
> 2 | 2018-12-10 15:40:47.000 | 2018-12-10 15:40:47.000 | 2018-12-10 
> 15:40:47.000
> 3 | 2018-12-10 15:40:47.106 | 2018-12-10 15:40:47.106 | 2018-12-10 
> 15:40:47.106
> {code}
>  
> _We_ have a historical problem,  we'll parse the string to 
> Date/Time/Timestamp objects with timezone in #1, which means the actual data 
> is going to be changed when stored in HBase table。



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (PHOENIX-5066) The TimeZone is incorrectly used during writing or reading data

Reply via email to