Re: HBase-Hive integration performance issues

Hao Ren Tue, 27 Aug 2013 08:36:53 -0700

Matt,

Thank you for the lightning reply.

I will try out what you have mentioned in these days, thus I could tellyou some news on the issue in detail.


Thank you again. Your suggestions show me the way. =)

Hao

Le 27/08/2013 16:13, Matt Davies a écrit :

Hao,

A couple thoughts here.

This could be related to many things.
1. Did you pre-split your regions? If not, you could be hot-spotting on a
single server and then waiting for the region to split. If that is the
case, you could actually only be using a single server for much of your
load (if not all - depends on the region size you have configured) While
running did you see one system take the full load (via top, ganglia, or
some other tool)?

2.  The memory on each of these systems is quite low - 1.7 or 3.7 gb
depending if it is compute or memory - either way, it is way low, and I'd
expect you to be doing a lot of swapping.  You'll need 1 GB for each
daemon, which leaves you very little room for the OS (at 3.7 gb).  Do you
see swapping?  What are your JVM parameters?

3. Do these same 4 servers run your Hadoop infrastructure and the hive
query? If so, the system is woefully underpowered if you expect to see
production-like speed.  Running an Hive query on top of an HBase cluster
with so few resources will just not work out well in the end ;)


-Matt


On Tue, Aug 27, 2013 at 7:51 AM, Hao Ren <h....@claravista.fr> wrote:

Hi,

I am running Hive and HBase on Amazon EC2. By following the tutorial:
https://cwiki.apache.org/**confluence/display/Hive/**HBaseIntegration<https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration>,
 I managed to create a HBase table from Hive and insert data into it.

It works but with a low performance. To be specific, inserting 1.3 Gb (50
M rows, 3 columns) takes 30 mins. It is far from what I excepted, say 100 s.

Actually, my EC2 cluster contains 3 slaves and 1 master whose instance
type is 
medium(http://aws.amazon.com/**ec2/instance-types/#instance-**type<http://aws.amazon.com/ec2/instance-types/#instance-type>
).

Hadoop 1.0.4 is installed on my cluster. HBase is in pseudo-distributed
mode. A region server is running on the master. HDFS is used as storage.

Here are some configuration files:

*// hive-site.xml*

<configuration>

     <property>
         <name>hbase.zookeeper.quorum</**name>
         <value>ip-10-178-13-39.ec2.**internal</value>
     </property>

     <property>
         <name>hive.aux.jars.path</**name>
<value>/root/hive/build/dist/**lib/hive-hbase-handler-0.9.0-**
amplab-4.jar,/root/hive/build/**dist/lib/hbase-0.92.0.jar,/**
root/hive/build/dist/lib/**zookeeper-3.4.3.jar,/root/**
hive/build/dist/lib/guava-r09.**jar</value>
     </property>

     <property>
         <name>hbase.client.scanner.**caching</name>
         <value>10000</value>
     </property>

</configuration>

*// hbase-site.xml*

<configuration>

     <property>
         <name>hbase.rootdir</name>
<value>hdfs://ec2-54-226-206-**28.compute-1.amazonaws.com:**9010/hbase<http://ec2-54-226-206-28.compute-1.amazonaws.com:9010/hbase>
</value>
     </property>

     <property>
         <name>hbase.cluster.**distributed</name>
         <value>true</value>
     </property>

     <property>
         <name>hbase.zookeeper.quorum</**name>
         <value>ip-10-178-13-39.ec2.**internal</value>
     </property>

     <property>
         <name>hbase.client.scanner.**caching</name>
         <value>10000</value>
     </property>

</configuration>

*For understanding, I have some questions:*
1) In order to improve read performance, I have set
hbase.client.scanner.caching to 10000. But I don't know how to improve
write performance. Is there some basic config to do ?
2) Does the distributed mode matter ? Does fully-distributed mode have
better write performance than pseudo-distributed mode ?
3) If the number of region server is increased, will the write performance
be improved ?
4) In pseudo-distributed mode (one hbase daemon on master), when writing
data from hive to a hbase table, is the master the only entry to HBase ? I
don't think all data passes through the master is efficient. I wonder
whether it is possible write data in parallel from hive to hbase directly
in using mapReduce ?
5) Will the HBase bulk loading help a lot ?

I am new to HBase, but I really want to integrate HBase in production.

Any help is highly appreciated ! =)

Hao

--
Hao Ren
ClaraVista
www.claravista.fr



--
Hao Ren
ClaraVista
www.claravista.fr

Re: HBase-Hive integration performance issues

Reply via email to