[
https://issues.apache.org/jira/browse/HIVE-21265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Istvan Fajth updated HIVE-21265:
--------------------------------
Issue Type: Improvement (was: Bug)
> Hive miss-uses HBase HConnection object and that puts high load on Zookeeper
> ----------------------------------------------------------------------------
>
> Key: HIVE-21265
> URL: https://issues.apache.org/jira/browse/HIVE-21265
> Project: Hive
> Issue Type: Improvement
> Components: HBase Handler
> Reporter: Istvan Fajth
> Priority: Major
>
> When there is a table in Hive backed by an HBase table, then the following
> access pattern is shown multiple times in Zookeeper even for a simple query
> like "SELECT * FROM table":
> - A client is connecting to Zookeeper
> - Checks whether the /hbase ZNode exists
> - Reads /hbase/hbaseid
> - Client closes the connection.
> The amount of these accesses are depending on the amount of data most likely
> it is correlating to the number of HBase regions.
> The same access pattern one can see in ZK when one runs the following Java
> code:
> {code}import org.apache.hadoop.hbase.client.*;
> public class Test {
> public static void main(String args[]) throws Exception {
> Connection c = ConnectionFactory.createConnection();
> c.close();
> }
> }{code}
> The problem with this is that for large tables this creates an enormous
> amount of session creation which is expensive in ZK, and if the amount of
> queries to this table is high, then the ZK transaction log is heavily
> written, and there are way more snapshots created then otherwise due to the
> amount of createSession closeSession transaction in Zookeeper. In this
> particular case the Zookeeper data directory was filled with about 24GB of
> data and caused the device to almost fill under the Zookeeper data directory.
> ~90% of the data written was createSession and closeSession transactions.
> I am not sure what logs I should provide, but reproducing the behaviour is
> easy enough. In Zookeeper if one enables DEBUG level logging, the logs are
> showing what is being read by sessions. These sessions live for 1-5ms tops.
> I imagine that the solution is to somehow share the connection object between
> the mappers if possible, and use one connection according to the suggestion
> in the API documentation of ConnectionFactory and request table/admin/any
> object from the one connection, or at least use only one connection object
> per map/reduce, and make it a longer living connection that is there for the
> whole map/reduce lifetime.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)