[jira] [Commented] (PHOENIX-5066) The TimeZone is incorrectly used during writing or reading data

Istvan Toth (Jira) Mon, 21 Nov 2022 02:00:18 -0800


    [ 
https://issues.apache.org/jira/browse/PHOENIX-5066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17636601#comment-17636601
 ]


Istvan Toth commented on PHOENIX-5066:
--------------------------------------

Thank you for looking into this, [~dbwong] .
Replying on the ticket in the hopes of attracting more participants:

 

Currently we're starting new threads in a very few places in the client code, 
and those are easy to handle.
I think we could even push down the ThreadLocal handling code into a custom 
Executor, which would make it mostly transparent to the rest of the threading 
code.

So I think that the potentially increased complexity for multithreaded code is 
not a major problem.
Would pushing the logic into a custom executor alleviate your concerns ?

I'll go into a bit more detail into the problems stemming from the current code 
that led me to using a TL:

The problems that I encounter are not stemming from a lack of structure, but 
rather by invalidating a fundamental
assumption that the code makes. 

The current code considers the expression tree to be fully self-contained, and 
the type system to be a purely functional
interface that is implemented by global singleton objects.

1.

The parsing and formatting of dates to/from strings is done by methods of the 
type singletons. With the new TZ handling, we cannot use a single TZ for every 
thread in a JVM. Even if we were to remove the feature where we can set the TZ 
on a per connection basis on the client side, the server side still has to 
handle requests coming from multiple timezones correctly, so that wouldn't 
help. Forcing the whole installation to be in a single timezone would make most 
of the features in the patch useless.

So we somehow have to have access the Context containing the current TZ when 
doing the parse/formatting. The solution that I thought of were:

1a. Taking the context from a TL (what we have now)
It is simple to implement, but there is an overhead accessing the TL, and we 
need to manage the TL.
Another problem is that unlike the Nodes, where we can cache the context taken 
from the TL, we cannot cache it in the singleton Type classes.

1b. Adding the context as a parameter to the methods where conversion can 
happen. 
This is trivial to do, but we push the problem one level up, where we need to 
have the context in the callers.

1b1. We could simply push the context in every method call on the code paths 
calling the conversions. However, this is such a fundamental operation, that we 
would need to push the context via most methods in the expression evaluation 
code paths, which would probably have a higher perf impact than using a TL, and 
would hurt code readability.

1b2. We could push the context into every Expression node, which would be a 
moderate memory hit (a sinlge extra pointer per node) This also requires adding 
the context to the node constructios / factories, but we not to the evaluation 
methods.

1c. Using customized Type instances (instead of singletons) that include the 
Context
This is basically the same case as b, but instead of handling the context, we'd 
need to handle the references/factory for the type instances containing the 
Context in the same way, so it doesn't sound like a good bargain. (It would 
save passing a parameter when calling the conversions, though)

1d. Moving out the conversion code from the type classes. 
This is not really different from 1b and 1c, but would additionally break the 
type encapsulation.


2.

The current code considers the protobuf serialized expression tree to be a 
fully self-contained entity, that contains everything needed for evaluation. 
The need for a context breaks this assumption as well.
Furthermore, HBase performs the filter deserialization before calling any 
coprocessor hooks, so we can't use our usual scan parameter method to push and 
make the Context available during deserialization.
We also have the exact same problem with type singletons as on the client side, 
but the solutions are also the same


Possible solutions:

2a. Using a TL (what we have now)
In this case we set the context in the scan thread from a coprocessor hook.
This is simple to implement, as we can use the same code that we use on the 
client side to read the TL to perform lazy evaluation of the context during 
expression evaluation, and sidestep the problem of not having the Context 
available when deserializing.

2b. Adding a synthetic root node the expression tree that sets the Context.
This would solve the problem of not having the Context available during 
deserialization. I'm not sure if we can store and access this information 
during deserializaion via the HBase API, we may have to use another TL just for 
deserialization purposes.

2c. Using a placeholder Context object for deserializing the expression tree, 
and updating it from the coprocessor hook.
This is similar to 2b, but we initialize the object later. Haven't checked if 
we can pass the placeholder from the HBase API, we may have to use a TL for 
this purpose again.

Both 2b and 2c assumes that we somehow solved the Type context without TLs, 
like using the methods 1b or 1b.


3.

We can access the Context objects from the parser quite easily, however, when 
we are generating sythetic nodes in the Compiler while optimizing, we have 
quite deep call trees where the Context is not available.

The possible solutions are similar to the other cases

3a. Using a TL (what we have now)
Even though we don't have the context directly available when creating 
synthetic nodes, we don't even need it if the created nodes are initializing 
their contexts from the TL during evaluation.

3b. Pushing the context everywhere as needed.
This is simply tracking the call tree, and adding the context parameter in 
every call until we reach an object where this is available. Ugly, and a bit of 
a perf hit, but certainly possible.

3c. Taking the context from the existing nodes.
If we already have the context in the original nodes, as in the 1b solution we 
can, we should be able to copy the context from those.


4. 

Generating synthetic ConnectionlessQueryServicesImpl objects.

On the server side for handling default values in MetaDataEndpointImpl (and 
possible other purposes), we create ConnectionlessQueryServicesImpl objects, 
which we use to parse and compile Expressions. We also need a Context for that.

This is just another issue that we need to keep track of, whatever solution we 
use for the other issues determines how we solve this.

4a. We currently set the TL before we call these functions (in addColumn and 
dropColumn)

4b. If we had a mechanism where the context is pushed to the nodes during 
parsing/compiling, or during evaluation, then we could just set it when 
creating the CLQSI / PhoenixConnection, and avoid the TL mechanism.


Other problems with the current TL solution:

5.

We need to update the TL on the public JDBC entry points.
Apart from some code bloat, this is not much of a problem for calls like 
execute(), where the cycles used for setting and clearing the TL are 
insignificant, but this can have a relevant perf impact of the ResultSet.get... 
methods, where less processing happens.

6.

We also need to set the TL for a lot of UTs, which are not using the public 
JDBC entry points.
This is one more thing to keep track of when writing tests.

Currently this is done in a quite random manner, we need to clean this up if we 
keep the current solution.

 

 

I am really looking for feedback on the issues above.

Should I work on refining the current TL based solution (and for example run 
some benchmarks on the perf impacts), 
or should I look into alternatives ?

Of the alternatives, I am leaning towards storing the context in every 
Expression (1b2 and 2c solutions above).
This would trade some memory, but would save on having to pass the context on 
long call trees, and would have the least impact on the rest of the code (apart 
from the existing solution)

Also note that major parts of the current patch are not even related to TZ 
issue, but are fixes and tests for the currently badly broken temporal rounding 
functions (PHOENIX-6823 and friends).

Those could be handled in a separate patch, but we need to rewrite that code 
anyway, so I haven't take the time to split them out, and rebuild the patch on 
top of that.

> The TimeZone is incorrectly used during writing or reading data
> ---------------------------------------------------------------
>
>                 Key: PHOENIX-5066
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-5066
>             Project: Phoenix
>          Issue Type: Bug
>    Affects Versions: 5.0.0, 4.14.1
>            Reporter: Jaanai Zhang
>            Assignee: Istvan Toth
>            Priority: Critical
>             Fix For: 5.3.0
>
>         Attachments: DateTest.java, PHOENIX-5066.4x.v1.patch, 
> PHOENIX-5066.4x.v2.patch, PHOENIX-5066.4x.v3.patch, 
> PHOENIX-5066.master.v1.patch, PHOENIX-5066.master.v2.patch, 
> PHOENIX-5066.master.v3.patch, PHOENIX-5066.master.v4.patch, 
> PHOENIX-5066.master.v5.patch, PHOENIX-5066.master.v6.patch
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> We have two methods to write data when uses JDBC API.
> #1. Uses _the exceuteUpdate_ method to execute a string that is an upsert SQL.
> #2. Uses the _prepareStatement_ method to set some objects and execute.
> The _string_ data needs to convert to a new object by the schema information 
> of tables. we'll use some date formatters to convert string data to object 
> for Date/Time/Timestamp types when writes data and the formatters are used 
> when reads data as well.
>  
> *Uses default timezone test*
>  Writing 3 records by the different ways.
> {code:java}
> UPSERT INTO date_test VALUES (1,'2018-12-10 15:40:47','2018-12-10 
> 15:40:47','2018-12-10 15:40:47') 
> UPSERT INTO date_test VALUES (2,to_date('2018-12-10 
> 15:40:47'),to_time('2018-12-10 15:40:47'),to_timestamp('2018-12-10 15:40:47'))
> stmt.setInt(1, 3);stmt.setDate(2, date);stmt.setTime(3, 
> time);stmt.setTimestamp(4, ts);
> {code}
> Reading the table by the getObject(getDate/getTime/getTimestamp) methods.
> {code:java}
> 1 | 2018-12-10 | 23:45:07 | 2018-12-10 23:45:07.0 
> 2 | 2018-12-10 | 23:45:07 | 2018-12-10 23:45:07.0 
> 3 | 2018-12-10 | 15:45:07 | 2018-12-10 15:45:07.66 
> {code}
> Reading the table by the getString methods 
> {code:java}
> 1 | 2018-12-10 15:45:07.000 | 2018-12-10 15:45:07.000 | 2018-12-10 
> 15:45:07.000 
> 2 | 2018-12-10 15:45:07.000 | 2018-12-10 15:45:07.000 | 2018-12-10 
> 15:45:07.000 
> 3 | 2018-12-10 07:45:07.660 | 2018-12-10 07:45:07.660 | 2018-12-10 
> 07:45:07.660
> {code}
>  *Uses GMT+8 test*
>  Writing 3 records by the different ways.
> {code:java}
> UPSERT INTO date_test VALUES (1,'2018-12-10 15:40:47','2018-12-10 
> 15:40:47','2018-12-10 15:40:47')
> UPSERT INTO date_test VALUES (2,to_date('2018-12-10 
> 15:40:47'),to_time('2018-12-10 15:40:47'),to_timestamp('2018-12-10 15:40:47'))
> stmt.setInt(1, 3);stmt.setDate(2, date);stmt.setTime(3, 
> time);stmt.setTimestamp(4, ts);
> {code}
> Reading the table by the getObject(getDate/getTime/getTimestamp) methods.
> {code:java}
> 1 | 2018-12-10 | 23:40:47 | 2018-12-10 23:40:47.0 
> 2 | 2018-12-10 | 15:40:47 | 2018-12-10 15:40:47.0 
> 3 | 2018-12-10 | 15:40:47 | 2018-12-10 15:40:47.106 {code}
> Reading the table by the getString methods
> {code:java}
>  1 | 2018-12-10 23:40:47.000 | 2018-12-10 23:40:47.000 | 2018-12-10 
> 23:40:47.000
> 2 | 2018-12-10 15:40:47.000 | 2018-12-10 15:40:47.000 | 2018-12-10 
> 15:40:47.000
> 3 | 2018-12-10 15:40:47.106 | 2018-12-10 15:40:47.106 | 2018-12-10 
> 15:40:47.106
> {code}
>  
> _We_ have a historical problem,  we'll parse the string to 
> Date/Time/Timestamp objects with timezone in #1, which means the actual data 
> is going to be changed when stored in HBase table。



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (PHOENIX-5066) The TimeZone is incorrectly used during writing or reading data

Reply via email to