Re: [DISCUSS] Requesting preliminary reviews on the Timezone (PHOENIX-5066) patch

Istvan Toth Mon, 21 Nov 2022 02:05:36 -0800

Copying the conversation here in the hopes of getting more eyes:

Daniel has responded on the PR:

I had a high level question on your approach. Your email used thread local
variable for somewhat global access, i worry if our execution engine
becomes more multi threaded you will have to copy this to all the threads
and become more and more cumbersome. I'm hesitant of such patterns for
things that are not semaphore like. I assume the resistance in the code is
from our lack of structure and not wanting to pass a parameter everywhere.
I think another approach is possible with a provider pattern given the
connection and some jvm level singleton map. Did you consider this approach?

and this was my response on the ticket:

Currently we're starting new threads in a very few places in the client
code, and those are easy to handle.

I think we could even push down the ThreadLocal handling code into a custom
Executor, which would make it mostly transparent to the rest of the
threading code.

So I think that the potentially increased complexity for multithreaded code
is not a major problem.

Would pushing the logic into a custom executor alleviate your concerns ?

I'll go into a bit more detail into the problems stemming from the current
code that led me to using a TL:

The problems that I encounter are not stemming from a lack of structure,
but rather by invalidating a fundamental

assumption that the code makes.

The current code considers the expression tree to be fully self-contained,
and the type system to be a purely functional

interface that is implemented by global singleton objects.

1.

The parsing and formatting of dates to/from strings is done by methods of
the type singletons. With the new TZ handling, we cannot use a single TZ
for every thread in a JVM. Even if we were to remove the feature where we
can set the TZ on a per connection basis on the client side, the server
side still has to handle requests coming from multiple timezones correctly,
so that wouldn't help. Forcing the whole installation to be in a single
timezone would make most of the features in the patch useless.

So we somehow have to have access the Context containing the current TZ
when doing the parse/formatting. The solution that I thought of were:

1a. Taking the context from a TL (what we have now)

It is simple to implement, but there is an overhead accessing the TL, and
we need to manage the TL.

Another problem is that unlike the Nodes, where we can cache the context
taken from the TL, we cannot cache it in the singleton Type classes.

1b. Adding the context as a parameter to the methods where conversion can
happen.

This is trivial to do, but we push the problem one level up, where we need
to have the context in the callers.

1b1. We could simply push the context in every method call on the code
paths calling the conversions. However, this is such a fundamental
operation, that we would need to push the context via most methods in the
expression evaluation code paths, which would probably have a higher perf
impact than using a TL, and would hurt code readability.

1b2. We could push the context into every Expression node, which would be a
moderate memory hit (a sinlge extra pointer per node) This also requires
adding the context to the node constructios / factories, but we not to the
evaluation methods.

1c. Using customized Type instances (instead of singletons) that include
the Context

This is basically the same case as b, but instead of handling the context,
we'd need to handle the references/factory for the type instances
containing the Context in the same way, so it doesn't sound like a good
bargain. (It would save passing a parameter when calling the conversions,
though)

1d. Moving out the conversion code from the type classes.

This is not really different from 1b and 1c, but would additionally break
the type encapsulation.

2.

The current code considers the protobuf serialized expression tree to be a
fully self-contained entity, that contains everything needed for
evaluation. The need for a context breaks this assumption as well.

Furthermore, HBase performs the filter deserialization before calling any
coprocessor hooks, so we can't use our usual scan parameter method to push
and make the Context available during deserialization.

We also have the exact same problem with type singletons as on the client
side, but the solutions are also the same

Possible solutions:

2a. Using a TL (what we have now)

In this case we set the context in the scan thread from a coprocessor hook.

This is simple to implement, as we can use the same code that we use on the
client side to read the TL to perform lazy evaluation of the context during
expression evaluation, and sidestep the problem of not having the Context
available when deserializing.

2b. Adding a synthetic root node the expression tree that sets the Context.

This would solve the problem of not having the Context available during
deserialization. I'm not sure if we can store and access this information
during deserializaion via the HBase API, we may have to use another TL just
for deserialization purposes.

2c. Using a placeholder Context object for deserializing the expression
tree, and updating it from the coprocessor hook.

This is similar to 2b, but we initialize the object later. Haven't checked
if we can pass the placeholder from the HBase API, we may have to use a TL
for this purpose again.

Both 2b and 2c assumes that we somehow solved the Type context without TLs,
like using the methods 1b or 1b.

3.

We can access the Context objects from the parser quite easily, however,
when we are generating sythetic nodes in the Compiler while optimizing, we
have quite deep call trees where the Context is not available.

The possible solutions are similar to the other cases

3a. Using a TL (what we have now)

Even though we don't have the context directly available when creating
synthetic nodes, we don't even need it if the created nodes are
initializing their contexts from the TL during evaluation.

3b. Pushing the context everywhere as needed.

This is simply tracking the call tree, and adding the context parameter in
every call until we reach an object where this is available. Ugly, and a
bit of a perf hit, but certainly possible.

3c. Taking the context from the existing nodes.

If we already have the context in the original nodes, as in the 1b solution
we can, we should be able to copy the context from those.

4.

Generating synthetic ConnectionlessQueryServicesImpl objects.

On the server side for handling default values in MetaDataEndpointImpl (and
possible other purposes), we create ConnectionlessQueryServicesImpl
objects, which we use to parse and compile Expressions. We also need a
Context for that.

This is just another issue that we need to keep track of, whatever solution
we use for the other issues determines how we solve this.

4a. We currently set the TL before we call these functions (in addColumn
and dropColumn)

4b. If we had a mechanism where the context is pushed to the nodes during
parsing/compiling, or during evaluation, then we could just set it when
creating the CLQSI / PhoenixConnection, and avoid the TL mechanism.

Other problems with the current TL solution:

5.

We need to update the TL on the public JDBC entry points.

Apart from some code bloat, this is not much of a problem for calls like
execute(), where the cycles used for setting and clearing the TL are
insignificant, but this can have a relevant perf impact of the
ResultSet.get... methods, where less processing happens.

6.

We also need to set the TL for a lot of UTs, which are not using the public
JDBC entry points.

This is one more thing to keep track of when writing tests.

Currently this is done in a quite random manner, we need to clean this up
if we keep the current solution.

I am really looking for feedback on the issues above.

Should I work on refining the current TL based solution (and for example
run some benchmarks on the perf impacts),

or should I look into alternatives ?

Of the alternatives, I am leaning towards storing the context in every
Expression (1b2 and 2c solutions above).

This would trade some memory, but would save on having to pass the context
on long call trees, and would have the least impact on the rest of the code
(apart from the existing solution)

Also note that major parts of the current patch are not even related to TZ
issue, but are fixes and tests for the currently badly broken temporal
rounding functions (PHOENIX-6823
<https://issues.apache.org/jira/browse/PHOENIX-6823> and friends).

Those could be handled in a separate patch, but we need to rewrite that
code anyway, so I haven't take the time to split them out, and rebuild the
patch on top of that.

On Wed, Nov 9, 2022 at 1:05 PM Istvan Toth <[email protected]> wrote:

> Hi!
>
> I now have a preliminary patch for TimeZone sensitive date/time handling
> in Phoenix.
> While the patch is not ready for inclusion, it is functionally complete (I
> think), and can form a basis for discussing the implementation of the new
> feature.
>
> Please look at it, and let me know if you think that this approach is
> appropriate, or if you have suggestions for changes in the design and / or
> implementation.
>
> See a quick summary from the ticket below:
>
> I now have a functionally complete (still needs optimizations and
> cleanups) WIP patch at https://github.com/apache/phoenix/pull/1504
>
> At the heart of the solution is the new ExpressionContext object. This
> object encapsulates the context necessary for evaluating an expression. It
> is tied to a PhoenixConnection object, and can be configured by properties.
> At the moment it stores the TimeZone and the format strings.
>
> We have two implementations, one of which is GMTExpressionContext, which
> implements the old date handling behaviour and treats dates (mostly, at
> least on the server side)  as GMT timestamps.
>
> The new implementation is CompliantExpressionContext, which is initialized
> by the client TimeZone (can be overridden), and uses it for String
> parsing/printing, and for implementing the date functions.
>
> The current Phoenix epxression code, especially the singleton type system,
> is very resistant to adding this context to it, so my implementation uses a
> ThreadLocal variable to store and access the ExpressionContext.
>
> The big challenge is making sure that we propagate the ExpressionContext
> through all the components of an statement execution.
>
>
>    - We're creating new threads for processing scan, and we need to make
>       sure that ExpressionContext ThreadLocal is propagated.
>       - We're also manipulating the classloader, we need to make sure
>       that we copy the Expression TL to the new Classloader.
>       - We need to push the context to the coprocessors. We do this by
>       encoding the Context into Scan properties, and processing those in the
>       coprocessors.
>       - The Expression reconstruction happens BEFORE we rebuild the
>       ExpressionContext TL in the coprocessor hook, so we need to lazily 
> evaluate
>       the context when executing the expression.
>    - There are a few hacky places where we create a
>    ConnectionlessQueryServices object on the server side (to handle default
>    expressions). I have added new fields to the relevant RPC definitions in
>    MetaDataService.proto to push the ExpressionContext into the operations.
>
>
> regards
> Istvan
>

Re: [DISCUSS] Requesting preliminary reviews on the Timezone (PHOENIX-5066) patch

Reply via email to