Hi Ramiya,

I think the biggest problem is that your replication factor is 1. You'd
better set it 3, otherwise Hadoop cannot take advantage of locality so the
performance will decrease.

And BTW, lots of small files will also decrease the performance.



-----Original Message-----
From: Ramiya V [mailto:ramiy...@persistent.co.in] 
Sent: 2009年9月1日 23:47
To: common-user@hadoop.apache.org
Subject: FW: Issues with performance on Hadoop/Hive

Hi,

I just wanted to add that I know 45GB data is really less to test the
performance of Hadoop/Hive as it needs data in Terabytes. Actually I have to
implement a  POC and it requires me to test only 45GB of data. Please let me
know if the performance can be improved.

Thanks,
Ramya 

________________________________________
From: Ramiya V
Sent: Wednesday, September 02, 2009 10:38 AM
To: common-user@hadoop.apache.org
Subject: Issues with performance on Hadoop/Hive

Hi,

I have set up a 4 (physical) nodes Hadoop cluster. Configuration: 2GB RAM
each machine. Currently am using the sub-project Hive for firing queries on
45GB of data. I have certain queries that need to be resolved:-

1) The performance that I am getting with the above setup is quite bad. It
takes app 39 minutes for simple select query (with where clause). I have set
the mapred.map.tasks=13 and mapred.reduce.tasks=7. Is this setting good
enough for the above setup? Are there any significant configuration
parameters I need to set for getting a better performance on Hive?

2) Does anybody know how exactly the data on HDFS is distributed across
nodes in a cluster? Also when we load the tables in Hive (by firing Load
command on master node),how and where is the data placed on HDFS in a
cluster?

3) How and when does the data replication for HDFS take place in a cluster?
Currently I have set the dfs replication factor=1. How does this affect the
performance?

4) Does adding a Virtual Machine to a physical machine cluster bring about
significant degradation in the performance?

Please let me know asap.

Thanks,
Ramya




DISCLAIMER
==========
This e-mail may contain privileged and confidential information which is the
property of Persistent Systems Ltd. It is intended only for the use of the
individual or entity to which it is addressed. If you are not the intended
recipient, you are not authorized to read, retain, copy, print, distribute
or use this message. If you have received this communication in error,
please notify the sender and delete all copies of this message. Persistent
Systems Ltd. does not accept any liability for virus infected mails.

Reply via email to