[ 
https://issues.apache.org/jira/browse/HIVE-21265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Istvan Fajth updated HIVE-21265:
--------------------------------
    Issue Type: Improvement  (was: Bug)

> Hive miss-uses HBase HConnection object and that puts high load on Zookeeper
> ----------------------------------------------------------------------------
>
>                 Key: HIVE-21265
>                 URL: https://issues.apache.org/jira/browse/HIVE-21265
>             Project: Hive
>          Issue Type: Improvement
>          Components: HBase Handler
>            Reporter: Istvan Fajth
>            Priority: Major
>
> When there is a table in Hive backed by an HBase table, then the following 
> access pattern is shown multiple times in Zookeeper even for a simple query 
> like "SELECT * FROM table":
> - A client is connecting to Zookeeper
> - Checks whether the /hbase ZNode exists
> - Reads /hbase/hbaseid
> - Client closes the connection.
> The amount of these accesses are depending on the amount of data most likely 
> it is correlating to the number of HBase regions.
> The same access pattern one can see in ZK when one runs the following Java 
> code:
> {code}import org.apache.hadoop.hbase.client.*;
> public class Test {
>       public static void main(String args[]) throws Exception {
>               Connection c = ConnectionFactory.createConnection();
>               c.close();
>       }
> }{code}
> The problem with this is that for large tables this creates an enormous 
> amount of session creation which is expensive in ZK, and if the amount of 
> queries to this table is high, then the ZK transaction log is heavily 
> written, and there are way more snapshots created then otherwise due to the 
> amount of createSession closeSession transaction in Zookeeper. In this 
> particular case the Zookeeper data directory was filled with about 24GB of 
> data and caused the device to almost fill under the Zookeeper data directory. 
> ~90% of the data written was createSession and closeSession transactions.
> I am not sure what logs I should provide, but reproducing the behaviour is 
> easy enough. In Zookeeper if one enables DEBUG level logging, the logs are 
> showing what is being read by sessions. These sessions live for 1-5ms tops.
> I imagine that the solution is to somehow share the connection object between 
> the mappers if possible, and use one connection according to the suggestion 
> in the API documentation of ConnectionFactory and request table/admin/any 
> object from the one connection, or at least use only one connection object 
> per map/reduce, and make it a longer living connection that is there for the 
> whole map/reduce lifetime.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to