Re: [DISCUSS] Requesting preliminary reviews on the Timezone (PHOENIX-5066) patch

Istvan Toth Thu, 16 Feb 2023 05:31:55 -0800

After wasting way too much time on the above approach, I realized that this
would move Phoenix's time handling further away from the sql standard.
Phoenix already ALMOST does the sql compliant thing by treating all
temporals as GMT, we just need to fix the JDBC API parts where we receive
or return java temporal types.


See  my last entry on the ticket:
https://issues.apache.org/jira/browse/PHOENIX-5066?focusedCommentId=17689728&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17689728

And a discussion on the missing steps when temporal java types are used:

https://stackoverflow.com/questions/14070572/is-java-sql-timestamp-timezone-specific

On Mon, Nov 21, 2022 at 11:05 AM Istvan Toth <[email protected]> wrote:

> Copying the conversation here in the hopes of getting more eyes:
>
> Daniel has responded on the PR:
>
> I had a high level question on your approach. Your email used thread local
> variable for somewhat global access, i worry if our execution engine
> becomes more multi threaded you will have to copy this to all the threads
> and become more and more cumbersome. I'm hesitant of such patterns for
> things that are not semaphore like. I assume the resistance in the code is
> from our lack of structure and not wanting to pass a parameter everywhere.
> I think another approach is possible with a provider pattern given the
> connection and some jvm level singleton map. Did you consider this approach?
>
>
> and this was my response on the ticket:
>
> Currently we're starting new threads in a very few places in the client
> code, and those are easy to handle.
>
> I think we could even push down the ThreadLocal handling code into a
> custom Executor, which would make it mostly transparent to the rest of the
> threading code.
>
> So I think that the potentially increased complexity for multithreaded
> code is not a major problem.
>
> Would pushing the logic into a custom executor alleviate your concerns ?
>
> I'll go into a bit more detail into the problems stemming from the current
> code that led me to using a TL:
>
> The problems that I encounter are not stemming from a lack of structure,
> but rather by invalidating a fundamental
>
> assumption that the code makes.
>
> The current code considers the expression tree to be fully self-contained,
> and the type system to be a purely functional
>
> interface that is implemented by global singleton objects.
>
> 1.
>
> The parsing and formatting of dates to/from strings is done by methods of
> the type singletons. With the new TZ handling, we cannot use a single TZ
> for every thread in a JVM. Even if we were to remove the feature where we
> can set the TZ on a per connection basis on the client side, the server
> side still has to handle requests coming from multiple timezones correctly,
> so that wouldn't help. Forcing the whole installation to be in a single
> timezone would make most of the features in the patch useless.
>
> So we somehow have to have access the Context containing the current TZ
> when doing the parse/formatting. The solution that I thought of were:
>
> 1a. Taking the context from a TL (what we have now)
>
> It is simple to implement, but there is an overhead accessing the TL, and
> we need to manage the TL.
>
> Another problem is that unlike the Nodes, where we can cache the context
> taken from the TL, we cannot cache it in the singleton Type classes.
>
> 1b. Adding the context as a parameter to the methods where conversion can
> happen.
>
> This is trivial to do, but we push the problem one level up, where we need
> to have the context in the callers.
>
> 1b1. We could simply push the context in every method call on the code
> paths calling the conversions. However, this is such a fundamental
> operation, that we would need to push the context via most methods in the
> expression evaluation code paths, which would probably have a higher perf
> impact than using a TL, and would hurt code readability.
>
> 1b2. We could push the context into every Expression node, which would be
> a moderate memory hit (a sinlge extra pointer per node) This also requires
> adding the context to the node constructios / factories, but we not to the
> evaluation methods.
>
> 1c. Using customized Type instances (instead of singletons) that include
> the Context
>
> This is basically the same case as b, but instead of handling the context,
> we'd need to handle the references/factory for the type instances
> containing the Context in the same way, so it doesn't sound like a good
> bargain. (It would save passing a parameter when calling the conversions,
> though)
>
> 1d. Moving out the conversion code from the type classes.
>
> This is not really different from 1b and 1c, but would additionally break
> the type encapsulation.
>
> 2.
>
> The current code considers the protobuf serialized expression tree to be a
> fully self-contained entity, that contains everything needed for
> evaluation. The need for a context breaks this assumption as well.
>
> Furthermore, HBase performs the filter deserialization before calling any
> coprocessor hooks, so we can't use our usual scan parameter method to push
> and make the Context available during deserialization.
>
> We also have the exact same problem with type singletons as on the client
> side, but the solutions are also the same
>
> Possible solutions:
>
> 2a. Using a TL (what we have now)
>
> In this case we set the context in the scan thread from a coprocessor hook.
>
> This is simple to implement, as we can use the same code that we use on
> the client side to read the TL to perform lazy evaluation of the context
> during expression evaluation, and sidestep the problem of not having the
> Context available when deserializing.
>
> 2b. Adding a synthetic root node the expression tree that sets the Context.
>
> This would solve the problem of not having the Context available during
> deserialization. I'm not sure if we can store and access this information
> during deserializaion via the HBase API, we may have to use another TL just
> for deserialization purposes.
>
> 2c. Using a placeholder Context object for deserializing the expression
> tree, and updating it from the coprocessor hook.
>
> This is similar to 2b, but we initialize the object later. Haven't checked
> if we can pass the placeholder from the HBase API, we may have to use a TL
> for this purpose again.
>
> Both 2b and 2c assumes that we somehow solved the Type context without
> TLs, like using the methods 1b or 1b.
>
> 3.
>
> We can access the Context objects from the parser quite easily, however,
> when we are generating sythetic nodes in the Compiler while optimizing, we
> have quite deep call trees where the Context is not available.
>
> The possible solutions are similar to the other cases
>
> 3a. Using a TL (what we have now)
>
> Even though we don't have the context directly available when creating
> synthetic nodes, we don't even need it if the created nodes are
> initializing their contexts from the TL during evaluation.
>
> 3b. Pushing the context everywhere as needed.
>
> This is simply tracking the call tree, and adding the context parameter in
> every call until we reach an object where this is available. Ugly, and a
> bit of a perf hit, but certainly possible.
>
> 3c. Taking the context from the existing nodes.
>
> If we already have the context in the original nodes, as in the 1b
> solution we can, we should be able to copy the context from those.
>
> 4.
>
> Generating synthetic ConnectionlessQueryServicesImpl objects.
>
> On the server side for handling default values in MetaDataEndpointImpl
> (and possible other purposes), we create ConnectionlessQueryServicesImpl
> objects, which we use to parse and compile Expressions. We also need a
> Context for that.
>
> This is just another issue that we need to keep track of, whatever
> solution we use for the other issues determines how we solve this.
>
> 4a. We currently set the TL before we call these functions (in addColumn
> and dropColumn)
>
> 4b. If we had a mechanism where the context is pushed to the nodes during
> parsing/compiling, or during evaluation, then we could just set it when
> creating the CLQSI / PhoenixConnection, and avoid the TL mechanism.
>
> Other problems with the current TL solution:
>
> 5.
>
> We need to update the TL on the public JDBC entry points.
>
> Apart from some code bloat, this is not much of a problem for calls like
> execute(), where the cycles used for setting and clearing the TL are
> insignificant, but this can have a relevant perf impact of the
> ResultSet.get... methods, where less processing happens.
>
> 6.
>
> We also need to set the TL for a lot of UTs, which are not using the
> public JDBC entry points.
>
> This is one more thing to keep track of when writing tests.
>
> Currently this is done in a quite random manner, we need to clean this up
> if we keep the current solution.
>
>
>
>
>
> I am really looking for feedback on the issues above.
>
> Should I work on refining the current TL based solution (and for example
> run some benchmarks on the perf impacts),
>
> or should I look into alternatives ?
>
> Of the alternatives, I am leaning towards storing the context in every
> Expression (1b2 and 2c solutions above).
>
> This would trade some memory, but would save on having to pass the context
> on long call trees, and would have the least impact on the rest of the code
> (apart from the existing solution)
>
> Also note that major parts of the current patch are not even related to TZ
> issue, but are fixes and tests for the currently badly broken temporal
> rounding functions (PHOENIX-6823
> <https://issues.apache.org/jira/browse/PHOENIX-6823> and friends).
>
> Those could be handled in a separate patch, but we need to rewrite that
> code anyway, so I haven't take the time to split them out, and rebuild the
> patch on top of that.
>
>
> On Wed, Nov 9, 2022 at 1:05 PM Istvan Toth <[email protected]> wrote:
>
>> Hi!
>>
>> I now have a preliminary patch for TimeZone sensitive date/time handling
>> in Phoenix.
>> While the patch is not ready for inclusion, it is functionally complete
>> (I think), and can form a basis for discussing the implementation of the
>> new feature.
>>
>> Please look at it, and let me know if you think that this approach is
>> appropriate, or if you have suggestions for changes in the design and / or
>> implementation.
>>
>> See a quick summary from the ticket below:
>>
>> I now have a functionally complete (still needs optimizations and
>> cleanups) WIP patch at https://github.com/apache/phoenix/pull/1504
>>
>> At the heart of the solution is the new ExpressionContext object. This
>> object encapsulates the context necessary for evaluating an expression. It
>> is tied to a PhoenixConnection object, and can be configured by properties.
>> At the moment it stores the TimeZone and the format strings.
>>
>> We have two implementations, one of which is GMTExpressionContext, which
>> implements the old date handling behaviour and treats dates (mostly, at
>> least on the server side)  as GMT timestamps.
>>
>> The new implementation is CompliantExpressionContext, which is
>> initialized by the client TimeZone (can be overridden), and uses it for
>> String parsing/printing, and for implementing the date functions.
>>
>> The current Phoenix epxression code, especially the singleton type
>> system, is very resistant to adding this context to it, so my
>> implementation uses a ThreadLocal variable to store and access the
>> ExpressionContext.
>>
>> The big challenge is making sure that we propagate the ExpressionContext
>> through all the components of an statement execution.
>>
>>
>>    - We're creating new threads for processing scan, and we need to make
>>       sure that ExpressionContext ThreadLocal is propagated.
>>       - We're also manipulating the classloader, we need to make sure
>>       that we copy the Expression TL to the new Classloader.
>>       - We need to push the context to the coprocessors. We do this by
>>       encoding the Context into Scan properties, and processing those in the
>>       coprocessors.
>>       - The Expression reconstruction happens BEFORE we rebuild the
>>       ExpressionContext TL in the coprocessor hook, so we need to lazily 
>> evaluate
>>       the context when executing the expression.
>>    - There are a few hacky places where we create a
>>    ConnectionlessQueryServices object on the server side (to handle default
>>    expressions). I have added new fields to the relevant RPC definitions in
>>    MetaDataService.proto to push the ExpressionContext into the operations.
>>
>>
>> regards
>> Istvan
>>
>

Re: [DISCUSS] Requesting preliminary reviews on the Timezone (PHOENIX-5066) patch

Reply via email to