[ 
https://issues.apache.org/jira/browse/DRILL-5510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16010989#comment-16010989
 ] 

Paul Rogers commented on DRILL-5510:
------------------------------------

More details. The Hive client in the Hive storage plugin is not designed to 
handle security.

* When we start the Hive storage plugin, we create a single instance of the 
{{HiveSchemaFactory}}.
* {{HiveSchemaFactory}} holds on to a {{DrillHiveMetaStoreClient}} connection. 
In the secure case, this connection is used to get security certificates for us 
in creating secure connections.
* {{HiveSchemaFactory}} has a Guava loading cache of user-specific, secure 
connections.

When the Hive metastore goes down, all connections become invalid including the 
non-secure and all the secure connections. But, we try to handle the problem as 
follows.

If a secure connection times out:

* Use the (now-invalid) insecure connection to get another ticket. But, since 
this isn't valid, we can't reconnect and so always fail.

If we try to use a cached secure connection before timeout, then this happens:

* Try to send a message.
* When that fails, try to reconnect (using the old certificate for the prior 
session.)
* When that fails, give up.

What we really need to do is:

* Recreate both the insecure *and* secure connections.

But, since the secure connection cache is held on the insecure connection, we 
can't easily recreate that connection: we'd get a new object.

So, we have to make some changes.

* Hold the secure connection cache on an object other than a connection.
* Use a connection proxy instead of the connection as key to the cache. The 
proxy allows maintaining the cache entry, but replacing the secure connection 
with a new one. (The proxy is just a wrapper around a replacable secure 
connection.)
* Similarly, provide a thread-safe way to reconnect the non-secure connection 
used to get tickets for the secure connection.

All this is not a huge project, but it is more than can be done in the context 
of simple bug fix for DRILL-5496. So, for that ticket, I used a hack: just 
throw away the entire schema builder and create a new one. But, that solution 
requires synchronizing all requests and is far from ideal. This ticket is a 
request to create a better long-term solution.

> Revisit connection failure recovery in Hive storage plugin
> ----------------------------------------------------------
>
>                 Key: DRILL-5510
>                 URL: https://issues.apache.org/jira/browse/DRILL-5510
>             Project: Apache Drill
>          Issue Type: Improvement
>    Affects Versions: 1.11.0
>            Reporter: Paul Rogers
>
> DRILL-5496 describes a problem which occurs when the Hive metastore server is 
> restarted while Drill runs. The solution in that ticket is a work-around: we 
> discard all cached Hive metastore data and rebuild the metadata cache.
> The original code tried to be more subtle: detecting that the connection has 
> failed, reconnect, but preserve the cache. DRILL-5496 describes the flaws in 
> that approach for the secure connection case.
> This ticket asks to spend the time to understand the Hive metadata code and 
> restructure it to preserve the cache across connection failures.
> Note a subtle issue: if the Hive metastore goes down, when it comes back up, 
> it may contain different data; anything could happen while the server is 
> down: upgrade schemas, replace one schema with another, etc. So, the caching 
> mechanism, if it is to preserve data across reconnects, must handle such 
> changes.
> Of course, such changes could occur even within a single connection, so the 
> code should handle such cases already.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to