[
https://issues.apache.org/jira/browse/PHOENIX-5066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17636601#comment-17636601
]
Istvan Toth commented on PHOENIX-5066:
--------------------------------------
Thank you for looking into this, [~dbwong] .
Replying on the ticket in the hopes of attracting more participants:
Currently we're starting new threads in a very few places in the client code,
and those are easy to handle.
I think we could even push down the ThreadLocal handling code into a custom
Executor, which would make it mostly transparent to the rest of the threading
code.
So I think that the potentially increased complexity for multithreaded code is
not a major problem.
Would pushing the logic into a custom executor alleviate your concerns ?
I'll go into a bit more detail into the problems stemming from the current code
that led me to using a TL:
The problems that I encounter are not stemming from a lack of structure, but
rather by invalidating a fundamental
assumption that the code makes.
The current code considers the expression tree to be fully self-contained, and
the type system to be a purely functional
interface that is implemented by global singleton objects.
1.
The parsing and formatting of dates to/from strings is done by methods of the
type singletons. With the new TZ handling, we cannot use a single TZ for every
thread in a JVM. Even if we were to remove the feature where we can set the TZ
on a per connection basis on the client side, the server side still has to
handle requests coming from multiple timezones correctly, so that wouldn't
help. Forcing the whole installation to be in a single timezone would make most
of the features in the patch useless.
So we somehow have to have access the Context containing the current TZ when
doing the parse/formatting. The solution that I thought of were:
1a. Taking the context from a TL (what we have now)
It is simple to implement, but there is an overhead accessing the TL, and we
need to manage the TL.
Another problem is that unlike the Nodes, where we can cache the context taken
from the TL, we cannot cache it in the singleton Type classes.
1b. Adding the context as a parameter to the methods where conversion can
happen.
This is trivial to do, but we push the problem one level up, where we need to
have the context in the callers.
1b1. We could simply push the context in every method call on the code paths
calling the conversions. However, this is such a fundamental operation, that we
would need to push the context via most methods in the expression evaluation
code paths, which would probably have a higher perf impact than using a TL, and
would hurt code readability.
1b2. We could push the context into every Expression node, which would be a
moderate memory hit (a sinlge extra pointer per node) This also requires adding
the context to the node constructios / factories, but we not to the evaluation
methods.
1c. Using customized Type instances (instead of singletons) that include the
Context
This is basically the same case as b, but instead of handling the context, we'd
need to handle the references/factory for the type instances containing the
Context in the same way, so it doesn't sound like a good bargain. (It would
save passing a parameter when calling the conversions, though)
1d. Moving out the conversion code from the type classes.
This is not really different from 1b and 1c, but would additionally break the
type encapsulation.
2.
The current code considers the protobuf serialized expression tree to be a
fully self-contained entity, that contains everything needed for evaluation.
The need for a context breaks this assumption as well.
Furthermore, HBase performs the filter deserialization before calling any
coprocessor hooks, so we can't use our usual scan parameter method to push and
make the Context available during deserialization.
We also have the exact same problem with type singletons as on the client side,
but the solutions are also the same
Possible solutions:
2a. Using a TL (what we have now)
In this case we set the context in the scan thread from a coprocessor hook.
This is simple to implement, as we can use the same code that we use on the
client side to read the TL to perform lazy evaluation of the context during
expression evaluation, and sidestep the problem of not having the Context
available when deserializing.
2b. Adding a synthetic root node the expression tree that sets the Context.
This would solve the problem of not having the Context available during
deserialization. I'm not sure if we can store and access this information
during deserializaion via the HBase API, we may have to use another TL just for
deserialization purposes.
2c. Using a placeholder Context object for deserializing the expression tree,
and updating it from the coprocessor hook.
This is similar to 2b, but we initialize the object later. Haven't checked if
we can pass the placeholder from the HBase API, we may have to use a TL for
this purpose again.
Both 2b and 2c assumes that we somehow solved the Type context without TLs,
like using the methods 1b or 1b.
3.
We can access the Context objects from the parser quite easily, however, when
we are generating sythetic nodes in the Compiler while optimizing, we have
quite deep call trees where the Context is not available.
The possible solutions are similar to the other cases
3a. Using a TL (what we have now)
Even though we don't have the context directly available when creating
synthetic nodes, we don't even need it if the created nodes are initializing
their contexts from the TL during evaluation.
3b. Pushing the context everywhere as needed.
This is simply tracking the call tree, and adding the context parameter in
every call until we reach an object where this is available. Ugly, and a bit of
a perf hit, but certainly possible.
3c. Taking the context from the existing nodes.
If we already have the context in the original nodes, as in the 1b solution we
can, we should be able to copy the context from those.
4.
Generating synthetic ConnectionlessQueryServicesImpl objects.
On the server side for handling default values in MetaDataEndpointImpl (and
possible other purposes), we create ConnectionlessQueryServicesImpl objects,
which we use to parse and compile Expressions. We also need a Context for that.
This is just another issue that we need to keep track of, whatever solution we
use for the other issues determines how we solve this.
4a. We currently set the TL before we call these functions (in addColumn and
dropColumn)
4b. If we had a mechanism where the context is pushed to the nodes during
parsing/compiling, or during evaluation, then we could just set it when
creating the CLQSI / PhoenixConnection, and avoid the TL mechanism.
Other problems with the current TL solution:
5.
We need to update the TL on the public JDBC entry points.
Apart from some code bloat, this is not much of a problem for calls like
execute(), where the cycles used for setting and clearing the TL are
insignificant, but this can have a relevant perf impact of the ResultSet.get...
methods, where less processing happens.
6.
We also need to set the TL for a lot of UTs, which are not using the public
JDBC entry points.
This is one more thing to keep track of when writing tests.
Currently this is done in a quite random manner, we need to clean this up if we
keep the current solution.
I am really looking for feedback on the issues above.
Should I work on refining the current TL based solution (and for example run
some benchmarks on the perf impacts),
or should I look into alternatives ?
Of the alternatives, I am leaning towards storing the context in every
Expression (1b2 and 2c solutions above).
This would trade some memory, but would save on having to pass the context on
long call trees, and would have the least impact on the rest of the code (apart
from the existing solution)
Also note that major parts of the current patch are not even related to TZ
issue, but are fixes and tests for the currently badly broken temporal rounding
functions (PHOENIX-6823 and friends).
Those could be handled in a separate patch, but we need to rewrite that code
anyway, so I haven't take the time to split them out, and rebuild the patch on
top of that.
> The TimeZone is incorrectly used during writing or reading data
> ---------------------------------------------------------------
>
> Key: PHOENIX-5066
> URL: https://issues.apache.org/jira/browse/PHOENIX-5066
> Project: Phoenix
> Issue Type: Bug
> Affects Versions: 5.0.0, 4.14.1
> Reporter: Jaanai Zhang
> Assignee: Istvan Toth
> Priority: Critical
> Fix For: 5.3.0
>
> Attachments: DateTest.java, PHOENIX-5066.4x.v1.patch,
> PHOENIX-5066.4x.v2.patch, PHOENIX-5066.4x.v3.patch,
> PHOENIX-5066.master.v1.patch, PHOENIX-5066.master.v2.patch,
> PHOENIX-5066.master.v3.patch, PHOENIX-5066.master.v4.patch,
> PHOENIX-5066.master.v5.patch, PHOENIX-5066.master.v6.patch
>
> Time Spent: 20m
> Remaining Estimate: 0h
>
> We have two methods to write data when uses JDBC API.
> #1. Uses _the exceuteUpdate_ method to execute a string that is an upsert SQL.
> #2. Uses the _prepareStatement_ method to set some objects and execute.
> The _string_ data needs to convert to a new object by the schema information
> of tables. we'll use some date formatters to convert string data to object
> for Date/Time/Timestamp types when writes data and the formatters are used
> when reads data as well.
>
> *Uses default timezone test*
> Writing 3 records by the different ways.
> {code:java}
> UPSERT INTO date_test VALUES (1,'2018-12-10 15:40:47','2018-12-10
> 15:40:47','2018-12-10 15:40:47')
> UPSERT INTO date_test VALUES (2,to_date('2018-12-10
> 15:40:47'),to_time('2018-12-10 15:40:47'),to_timestamp('2018-12-10 15:40:47'))
> stmt.setInt(1, 3);stmt.setDate(2, date);stmt.setTime(3,
> time);stmt.setTimestamp(4, ts);
> {code}
> Reading the table by the getObject(getDate/getTime/getTimestamp) methods.
> {code:java}
> 1 | 2018-12-10 | 23:45:07 | 2018-12-10 23:45:07.0
> 2 | 2018-12-10 | 23:45:07 | 2018-12-10 23:45:07.0
> 3 | 2018-12-10 | 15:45:07 | 2018-12-10 15:45:07.66
> {code}
> Reading the table by the getString methods
> {code:java}
> 1 | 2018-12-10 15:45:07.000 | 2018-12-10 15:45:07.000 | 2018-12-10
> 15:45:07.000
> 2 | 2018-12-10 15:45:07.000 | 2018-12-10 15:45:07.000 | 2018-12-10
> 15:45:07.000
> 3 | 2018-12-10 07:45:07.660 | 2018-12-10 07:45:07.660 | 2018-12-10
> 07:45:07.660
> {code}
> *Uses GMT+8 test*
> Writing 3 records by the different ways.
> {code:java}
> UPSERT INTO date_test VALUES (1,'2018-12-10 15:40:47','2018-12-10
> 15:40:47','2018-12-10 15:40:47')
> UPSERT INTO date_test VALUES (2,to_date('2018-12-10
> 15:40:47'),to_time('2018-12-10 15:40:47'),to_timestamp('2018-12-10 15:40:47'))
> stmt.setInt(1, 3);stmt.setDate(2, date);stmt.setTime(3,
> time);stmt.setTimestamp(4, ts);
> {code}
> Reading the table by the getObject(getDate/getTime/getTimestamp) methods.
> {code:java}
> 1 | 2018-12-10 | 23:40:47 | 2018-12-10 23:40:47.0
> 2 | 2018-12-10 | 15:40:47 | 2018-12-10 15:40:47.0
> 3 | 2018-12-10 | 15:40:47 | 2018-12-10 15:40:47.106 {code}
> Reading the table by the getString methods
> {code:java}
> 1 | 2018-12-10 23:40:47.000 | 2018-12-10 23:40:47.000 | 2018-12-10
> 23:40:47.000
> 2 | 2018-12-10 15:40:47.000 | 2018-12-10 15:40:47.000 | 2018-12-10
> 15:40:47.000
> 3 | 2018-12-10 15:40:47.106 | 2018-12-10 15:40:47.106 | 2018-12-10
> 15:40:47.106
> {code}
>
> _We_ have a historical problem, we'll parse the string to
> Date/Time/Timestamp objects with timezone in #1, which means the actual data
> is going to be changed when stored in HBase table。
--
This message was sent by Atlassian Jira
(v8.20.10#820010)