Re: mapred.tasktracker.map.tasks.maximum is not taking into effect

2011-07-01 Thread praveen.peddi
Are you sure? AFAIK all mapred.xxx properties can be set via job config. I also read on yahoo tutorial that this property can be either set in hadoop-site.XML or job config. May be someone can confirm this who have really used this property. Praveen On Jul 1, 2011, at 4:46 PM, "ext Anthony Urs

mapred.tasktracker.map.tasks.maximum is not taking into effect

2011-07-01 Thread praveen.peddi
Hi all, I am using hadoop 0.20.2. I am setting the property mapred.tasktracker.map.tasks.maximum = 4 (same for reduce also) on my job conf but I am still seeing max of only 2 map and reduce tasks on each node. I know my machine can run 4 maps and 4 reduce tasks in parallel. Is this a bug in 0.2

RE: controlling no. of mapper tasks

2011-06-20 Thread praveen.peddi
Hi David, Thanks for the response. I didn't specify anything for no. of concurrent mappers but I do see that it shows as 10 on 50030 (for 5 node cluster). So I believe hadoop is defaulting to no. of cores in the cluster which is 10. That is why I want to choose the map tasks also same as no. of

RE: controlling no. of mapper tasks

2011-06-20 Thread praveen.peddi
Hi David, I think Hadoop is looking at the data size, not the no. of input files. If I pass in .gz files, then yes hadoop is choosing 1 map task per file but if I pass in HUGE text file or same file split into 10 files, its choosing same no. of maps tasks (191 in my case). Thanks Praveen -

controlling no. of mapper tasks

2011-06-20 Thread praveen.peddi
Hi there, I know client can send "mapred.reduce.tasks" to specify no. of reduce tasks and hadoop honours it but "mapred.map.tasks" is not honoured by Hadoop. Is there any way to control number of map tasks? What I noticed is that Hadoop is choosing too many mappers and there is an extra overhead

Printing the job status on the client side

2011-05-23 Thread praveen.peddi
Hello all, How do I print the job status of each job on the client with the % complete. I am invoking the hadoop jobs using the java client (not hadoop cli) and I am not seeinf the map and reduce job status on the command line. Is there a property that I can set in the Configuration? Praveen

RE: How to run or submit MapReduce Job to Hadoop in my own program?

2011-05-17 Thread praveen.peddi
Hi there, I think you got the run(String[] args) method right but in the main method you are not calling your run method but ToolRunner.run. You need to invoke your method in order to point to localhost:54310 otherwise it will read those properties from the default hadoop conf. Praveen

Unable to use hadoop cluster on the cloud

2011-03-03 Thread praveen.peddi
Hello all, I installed hadoop0.20.2 on physical machines and everything works like a charm. Now I installed hadoop using the same hadoop-install gz file on the cloud. Installation seems fine. I can even copy files to hdfs from master machine. But when I try to do it from another "non hadoop" mac

Catching mapred exceptions on the client

2011-02-25 Thread praveen.peddi
Hello all, I have few mapreduce jobs that I am calling from a java driver. The problem I am facing is that when there is an exception in mapred job, the exception is not propogated to the client so even if first job failed, its going to second job and so on. Is there an another way of catching e

RE: Hadoop on physical machine Vs Cloud

2011-02-15 Thread praveen.peddi
I got this working when I bumped up the memory on the cloud to 8GB instead of 4GB. I guess with 4GB it was running out of resources. Praveen From: ext praveen.pe...@nokia.com [praveen.pe...@nokia.com] Sent: Thursday, February 10, 2011 4:40 PM To: common-u...@hadoo

Hadoop on physical machine Vs Cloud

2011-02-10 Thread praveen.peddi
Hello all, I have been using Hadoop on physical machine for sometime now. But recently I tried to run the same hadoop jobs on the Raskspace cloud and I am not yet successful. My input file has 150M transactions and all hadoop jobs finish in less than 90 minutes on a 4 node 4GB hadoop cluster on

Hadoop Version

2011-01-28 Thread praveen.peddi
Hello all, I am having issues with accessing hdfs and I figured its due to version mismatch. I know my jar files have multiple copies of hadoop (pig has its own, I have hadoop 0.20.2 and Whirr had its own hadoop copy). My question how to find the right version of hadoop that matches with the one

Hadoop environment variable

2011-01-25 Thread praveen.peddi
Hello all, I have set the Hadoop environment variable HADOOP_CONF_DIR and trying to run a Hadoop job from a java application but the job is not looking the hadoop config in this HADOOP_CONF_DIR folder. If I copy the xml files from this folder on to java application classpath, it works fine. Sinc

RE: Hadoop on Rackspace cloud

2011-01-06 Thread praveen.peddi
I found the solution. The problems was with ports. Looks like cloud doesn't have these ports open. For now I shutdown iptables on all hadoop machines and things worked magically. Thanks Praveen From: ext praveen.pe...@nokia.com [mailto:praveen.pe...@nokia.com] S

Hadoop on Rackspace cloud

2011-01-06 Thread praveen.peddi
Hello all, I am trying to run Hadoop on Rackspace and I am having issues with starting up servers. I have configrued everything on cloud exactly same as my local hadoop (which is working) but I can't start servers. HDFS fails to start. Does anyone had any luck on installing and starging Hadoop o

External jar dependency of map reduce jobs

2010-12-29 Thread praveen.peddi
Hello all, I have few map reduce jobs that I am invoking from a glasfish container. I am not using "hadoop" command line tool but calling the jobs directly from glassfish programatically. I have a driver that runs on glassfish and calls these jobs sequentially. I was able to run the jobs as long

RE: Starting a Hadoop job programtically

2010-11-24 Thread praveen.peddi
Hi Henning, Thanks again. Let me explain my scenario first so you make a better sense out of my question . I have a web application running on glassfish server. Every 24 hours Quartz job runs on the server and I need to call set of Hadoop jobs one after the other, read the final output and stor

RE: Starting a Hadoop job programtically

2010-11-23 Thread praveen.peddi
Hi Henning, Putting core-site.xml in classpath worked. Thanks for the help. I need to figure how to submit a job as a different user than the user hadoop is configured for. I have one more related to job submission. Did anyone face problem with running job that involves multiple jar files. I am

RE: Starting a Hadoop job programtically

2010-11-23 Thread praveen.peddi
Hi Henning, adding hadoop's conf folder didn't help fixing the issue but when I added the two below properties, I was able to access file system but cannot write anything due to different user. I have following questions based on experiments. 1. How can I access HDFS or submit jobs as different

RE: Starting a Hadoop job programtically

2010-11-22 Thread praveen.peddi
Hi Thanks for your reply. In my case I have a Driver that calls multiple jobs one after the other. I am using the following code to submit each job but it uses local hadoop jar files that is in the classpath. Its not submitting the job to Hadoop cluster. I thought I would need to specify where t

Starting a Hadoop job programtically

2010-11-22 Thread praveen.peddi
Hi all, I am trying to figure how I can start a hadoop job porgramatically from my Java application running in an app server. I was able to run my map reduce job using hadoop command from hadoop master machine but my goal is to run the same job from my java program (running on a different machin

RE: Mapper runs only on one machine

2010-11-16 Thread praveen.peddi
Thats a good point. I was indeed using gzip file that has a csv file in it. I uncompressed it and used csv file and now I can see many mappers running concurrently. Thanks for the suggestion. This is an important piece of information many people will miss since compressed format is a more logic

Mapper runs only on one machine

2010-11-16 Thread praveen.peddi
Hi all, I have been trying to figure out why all mappers run only on one machine when I have 4 node cluster. Ruduce part is running fine on all 4 nodes correctly. I am using 0.20.2. My input file is a large single file (10GB) Here is my config in mapred-site.xml. I specified map.tasks as 30 but