RE: Issues with performance on Hadoop/Hive

Ramiya V Thu, 03 Sep 2009 21:53:38 -0700

Hi,

Thanks Amandeep and Ashish!

@Ashish: I have set the hive.metastore.warehouse.dir parameter as 
/home/hive/warehouse. This warehouse directory is on the local filesystem. So 
will the tables now get stored on the local filesystem or HDFS? I mean do we 
need to specify explicitly if we want the path to refer a location on the local 
filesystem?

@Amandeep:  Actually I need to know how the data gets distributed across the 
cluster. The master machine has 70GB free space and the 3 slaves have 
140GB,50GB,140GB free space on Ubuntu. So when I load a table on Hive with 45GB 
of data,how will it get distributed across 4 nodes? I mean since master has 
70GB free space will it store all the data on the master itself? (This I 
observed as when I loaded the table, 45GB data was stored on master and some 
blocks were replicated on the other 3 slaves)  I have set dfs.replication 
factor as 2. I wanted to know how exactly Hadoop uses its intelligence to 
utilize the free space effectively to store data on HDFS on a cluster,

 -Ramya
________________________________________
From: Ashish Thusoo [athu...@facebook.com]
Sent: Wednesday, September 02, 2009 11:02 PM
To: common-user@hadoop.apache.org
Subject: RE: Issues with performance on Hadoop/Hive

Hi Ramya,

If you are using the hive-default.xml and have not overwritten the 
hive.metastore.warehouse.dir parameter then by default the tables will get 
placed in

/user/hive/warehouse in hdfs.

Ashish

-----Original Message-----
From: Amandeep Khurana [mailto:ama...@gmail.com]
Sent: Wednesday, September 02, 2009 12:52 AM
To: common-user@hadoop.apache.org
Subject: Re: Issues with performance on Hadoop/Hive

Answers inline

Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz

On Tue, Sep 1, 2009 at 10:08 PM, Ramiya V <ramiy...@persistent.co.in> wrote:

> Hi,
>
> I have set up a 4 (physica) nodes Hadoop cluster. Configuration: 2GB
> RAM each machine. Currently am using the sub-project Hive for firing
> queries on 45GB of data. I have certain queries that need to be
> resolved:-
>
> 1) The performance that I am getting with the above setup is quite
> bad. It takes app 39 minutes for simple select query (with where
> clause). I have set the mapred.map.tasks=13 and mapred.reduce.tasks=7.
> Is this setting good enough for the above setup? Are there any
> significant configuration parameters I need to set for getting a better 
> performance on Hive?
>
>
Check on the resource utilization. I think you shouldnt be running more than
3 mappers + 1 reducer on each node at any time (given the hardware you are 
using). But then that mostly depends on the amount of work being done in the 
mappers and reducers.

2) Does anybody know how exactly the data on HDFS is distributed across
> nodes in a cluster? Also when we load the tables in Hive (by firing
> Load command on master node),how and where is the data placed on HDFS
> in a cluster?
>

Files are divided into blocks and the blocks are stored on the Datanodes..
Each block is 64MB by default. I'm not sure how the blocks are distributed 
among the datanodes..

3) How and when does the data replication for HDFS take place in a cluster?
> Currently I have set the dfs replication factor=1. How does this
> affect the performance?
>

Once you put the data into the hdfs, it starts replicating the blocks.
However, the put is successful as soon as one block gets created...

>
> 3) Does adding a Virtual Machine to a physical machine cluster bring
> about significant degradation in the performance?
>

Dont have numbers for this, but it does impact the performance. Moreover, your 
hardware resources are low and there is really no value add in using virtual 
machines on top of it.

> Please let me know asap.
>
> Thanks,
> Ramya
>
>
>
>
> DISCLAIMER
> ==========
> This e-mail may contain privileged and confidential information which
> is the property of Persistent Systems Ltd. It is intended only for the
> use of the individual or entity to which it is addressed. If you are
> not the intended recipient, you are not authorized to read, retain,
> copy, print, distribute or use this message. If you have received this
> communication in error, please notify the sender and delete all copies of 
> this message.
> Persistent Systems Ltd. does not accept any liability for virus
> infected mails.
>

DISCLAIMER
==========
This e-mail may contain privileged and confidential information which is the 
property of Persistent Systems Ltd. It is intended only for the use of the 
individual or entity to which it is addressed. If you are not the intended 
recipient, you are not authorized to read, retain, copy, print, distribute or 
use this message. If you have received this communication in error, please 
notify the sender and delete all copies of this message. Persistent Systems 
Ltd. does not accept any liability for virus infected mails.

RE: Issues with performance on Hadoop/Hive

Reply via email to