Hi,

I am trying to tune the performance of my cluster. The cluster is hosted on 
Amazon EMR. There are 2 separate clusters - 1 for HBase and 1 for Hive. Hive 
cluster has no persistent data, it provides only processing power.
The test cluster from which I have generated the metrics, is 4 large nodes for 
Hbase (1 master + 3 core) and 4 large nodes for Hive (1 master + 3 core).
The process that is being monitored does this:

1.       New data files are received by the Hive cluster

2.       Hive inserts new data into Hbase

3.       Hive cluster then executes a bunch of hql on the hbase table to 
generate analytics.
Size of data: HBase table has 10 million rows of about 1K each.

I have attached Ganglia snapshots for this process from both Hive and HBase 
clusters. What is puzzling is:

1.       On the Cluster Network graph on Hbase, both In and Out lines follow 
each other closely. This is strange since after the initial insert, Hive is 
only selecting data from HBase table, so I would expect a lot of Out but 
nothing in In.

2.       The Cluster Network graph on Hbase shows 80MB/s Out on peaks, but the 
corresponding peaks on Hive's Cluster Network show only 10MB/s as In. Why is 
there such a significant difference between the data being sent out by HBase vs 
data being received by Hive, shouldn't they match ?

Any help or pointers are highly appreciated.

Also uploaded the metrics graphs here:
http://s15.postimg.org/qrrma4asb/hbase.png
http://s22.postimg.org/x443guzsh/hive.png

Thanks
Rupinder




This email is intended for the person(s) to whom it is addressed and may 
contain information that is PRIVILEGED or CONFIDENTIAL. Any unauthorized use, 
distribution, copying, or disclosure by any person other than the addressee(s) 
is strictly prohibited. If you have received this email in error, please notify 
the sender immediately by return email and delete the message and any 
attachments from your system.

Reply via email to