Hi,

I have set up a 4 (physica) nodes Hadoop cluster. Configuration: 2GB RAM each 
machine. Currently am using the sub-project Hive for firing queries on 45GB of 
data. I have certain queries that need to be resolved:-

1) The performance that I am getting with the above setup is quite bad. It 
takes app 39 minutes for simple select query (with where clause). I have set 
the mapred.map.tasks=13 and mapred.reduce.tasks=7. Is this setting good enough 
for the above setup? Are there any significant configuration parameters I need 
to set for getting a better performance on Hive?

2) Does anybody know how exactly the data on HDFS is distributed across nodes 
in a cluster? Also when we load the tables in Hive (by firing Load command on 
master node),how and where is the data placed on HDFS in a cluster?

3) How and when does the data replication for HDFS take place in a cluster? 
Currently I have set the dfs replication factor=1. How does this affect the 
performance?

3) Does adding a Virtual Machine to a physical machine cluster bring about 
significant degradation in the performance?

Please let me know asap.

Thanks,
Ramya




DISCLAIMER
==========
This e-mail may contain privileged and confidential information which is the 
property of Persistent Systems Ltd. It is intended only for the use of the 
individual or entity to which it is addressed. If you are not the intended 
recipient, you are not authorized to read, retain, copy, print, distribute or 
use this message. If you have received this communication in error, please 
notify the sender and delete all copies of this message. Persistent Systems 
Ltd. does not accept any liability for virus infected mails.

Reply via email to