[
https://issues.apache.org/jira/browse/DRILL-5510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16010989#comment-16010989
]
Paul Rogers commented on DRILL-5510:
------------------------------------
More details. The Hive client in the Hive storage plugin is not designed to
handle security.
* When we start the Hive storage plugin, we create a single instance of the
{{HiveSchemaFactory}}.
* {{HiveSchemaFactory}} holds on to a {{DrillHiveMetaStoreClient}} connection.
In the secure case, this connection is used to get security certificates for us
in creating secure connections.
* {{HiveSchemaFactory}} has a Guava loading cache of user-specific, secure
connections.
When the Hive metastore goes down, all connections become invalid including the
non-secure and all the secure connections. But, we try to handle the problem as
follows.
If a secure connection times out:
* Use the (now-invalid) insecure connection to get another ticket. But, since
this isn't valid, we can't reconnect and so always fail.
If we try to use a cached secure connection before timeout, then this happens:
* Try to send a message.
* When that fails, try to reconnect (using the old certificate for the prior
session.)
* When that fails, give up.
What we really need to do is:
* Recreate both the insecure *and* secure connections.
But, since the secure connection cache is held on the insecure connection, we
can't easily recreate that connection: we'd get a new object.
So, we have to make some changes.
* Hold the secure connection cache on an object other than a connection.
* Use a connection proxy instead of the connection as key to the cache. The
proxy allows maintaining the cache entry, but replacing the secure connection
with a new one. (The proxy is just a wrapper around a replacable secure
connection.)
* Similarly, provide a thread-safe way to reconnect the non-secure connection
used to get tickets for the secure connection.
All this is not a huge project, but it is more than can be done in the context
of simple bug fix for DRILL-5496. So, for that ticket, I used a hack: just
throw away the entire schema builder and create a new one. But, that solution
requires synchronizing all requests and is far from ideal. This ticket is a
request to create a better long-term solution.
> Revisit connection failure recovery in Hive storage plugin
> ----------------------------------------------------------
>
> Key: DRILL-5510
> URL: https://issues.apache.org/jira/browse/DRILL-5510
> Project: Apache Drill
> Issue Type: Improvement
> Affects Versions: 1.11.0
> Reporter: Paul Rogers
>
> DRILL-5496 describes a problem which occurs when the Hive metastore server is
> restarted while Drill runs. The solution in that ticket is a work-around: we
> discard all cached Hive metastore data and rebuild the metadata cache.
> The original code tried to be more subtle: detecting that the connection has
> failed, reconnect, but preserve the cache. DRILL-5496 describes the flaws in
> that approach for the secure connection case.
> This ticket asks to spend the time to understand the Hive metadata code and
> restructure it to preserve the cache across connection failures.
> Note a subtle issue: if the Hive metastore goes down, when it comes back up,
> it may contain different data; anything could happen while the server is
> down: upgrade schemas, replace one schema with another, etc. So, the caching
> mechanism, if it is to preserve data across reconnects, must handle such
> changes.
> Of course, such changes could occur even within a single connection, so the
> code should handle such cases already.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)