Hi Ramya, Yes you have to explicitly give the hdfs path, so
Hdfs://<namnode>:<port>/home/hive/warehouse in case you want to keep the same path in hdfs should work. Ashish -----Original Message----- From: Brian Bockelman [mailto:bbock...@cse.unl.edu] Sent: Friday, September 04, 2009 5:43 AM To: common-user@hadoop.apache.org Subject: Re: Issues with performance on Hadoop/Hive On Sep 3, 2009, at 11:53 PM, Ramiya V wrote: > Hi, > > Thanks Amandeep and Ashish! > > @Ashish: I have set the hive.metastore.warehouse.dir parameter as / > home/hive/warehouse. This warehouse directory is on the local > filesystem. So will the tables now get stored on the local filesystem > or HDFS? I mean do we need to specify explicitly if we want the path > to refer a location on the local filesystem? > > @Amandeep: Actually I need to know how the data gets distributed > across the cluster. The master machine has 70GB free space and the 3 > slaves have 140GB,50GB,140GB free space on Ubuntu. So when I load a > table on Hive with 45GB of data,how will it get distributed across 4 > nodes? I mean since master has 70GB free space will it store all the > data on the master itself? (This I observed as when I loaded the > table, 45GB data was stored on master and some blocks were replicated > on the other 3 slaves) I have set dfs.replication factor as 2. I > wanted to know how exactly Hadoop uses its intelligence to utilize the > free space effectively to store data on HDFS on a cluster, Hey Ramya, I believe this is thoroughly described in the system architecture document. Let us know if there's something you feel is missing: http://hadoop.apache.org/common/docs/current/hdfs_design.html Brian > > > -Ramya > ________________________________________ > From: Ashish Thusoo [athu...@facebook.com] > Sent: Wednesday, September 02, 2009 11:02 PM > To: common-user@hadoop.apache.org > Subject: RE: Issues with performance on Hadoop/Hive > > Hi Ramya, > > If you are using the hive-default.xml and have not overwritten the > hive.metastore.warehouse.dir parameter then by default the tables will > get placed in > > /user/hive/warehouse in hdfs. > > Ashish > > -----Original Message----- > From: Amandeep Khurana [mailto:ama...@gmail.com] > Sent: Wednesday, September 02, 2009 12:52 AM > To: common-user@hadoop.apache.org > Subject: Re: Issues with performance on Hadoop/Hive > > Answers inline > > > Amandeep Khurana > Computer Science Graduate Student > University of California, Santa Cruz > > > On Tue, Sep 1, 2009 at 10:08 PM, Ramiya V <ramiy...@persistent.co.in> > wrote: > >> Hi, >> >> I have set up a 4 (physica) nodes Hadoop cluster. Configuration: 2GB >> RAM each machine. Currently am using the sub-project Hive for firing >> queries on 45GB of data. I have certain queries that need to be >> resolved:- >> >> 1) The performance that I am getting with the above setup is quite >> bad. It takes app 39 minutes for simple select query (with where >> clause). I have set the mapred.map.tasks=13 and >> mapred.reduce.tasks=7. >> Is this setting good enough for the above setup? Are there any >> significant configuration parameters I need to set for getting a >> better performance on Hive? >> >> > Check on the resource utilization. I think you shouldnt be running > more than > 3 mappers + 1 reducer on each node at any time (given the hardware you > are using). But then that mostly depends on the amount of work being > done in the mappers and reducers. > > 2) Does anybody know how exactly the data on HDFS is distributed > across >> nodes in a cluster? Also when we load the tables in Hive (by firing >> Load command on master node),how and where is the data placed on HDFS >> in a cluster? >> > > Files are divided into blocks and the blocks are stored on the > Datanodes.. > Each block is 64MB by default. I'm not sure how the blocks are > distributed among the datanodes.. > > 3) How and when does the data replication for HDFS take place in a > cluster? >> Currently I have set the dfs replication factor=1. How does this >> affect the performance? >> > > Once you put the data into the hdfs, it starts replicating the blocks. > However, the put is successful as soon as one block gets created... > > >> >> 3) Does adding a Virtual Machine to a physical machine cluster bring >> about significant degradation in the performance? >> > > Dont have numbers for this, but it does impact the performance. > Moreover, your hardware resources are low and there is really no value > add in using virtual machines on top of it. > > >> Please let me know asap. >> >> Thanks, >> Ramya >> >> >> >> >> DISCLAIMER >> ========== >> This e-mail may contain privileged and confidential information which >> is the property of Persistent Systems Ltd. It is intended only for >> the use of the individual or entity to which it is addressed. If you >> are not the intended recipient, you are not authorized to read, >> retain, copy, print, distribute or use this message. If you have >> received this communication in error, please notify the sender and >> delete all copies of this message. >> Persistent Systems Ltd. does not accept any liability for virus >> infected mails. >> > > DISCLAIMER > ========== > This e-mail may contain privileged and confidential information which > is the property of Persistent Systems Ltd. It is intended only for the > use of the individual or entity to which it is addressed. If you are > not the intended recipient, you are not authorized to read, retain, > copy, print, distribute or use this message. If you have received this > communication in error, please notify the sender and delete all copies > of this message. Persistent Systems Ltd. does not accept any liability > for virus infected mails.