Re: program running faster on single node than cluster

2010-11-17 Thread Alex Baranau
These config settings depend on your MR job nature and resources available on the node. Since increasing heap size affected the time dramatically I assume that your jobs "like" memory. Can you describe your machines? Also, make sure you don't have any network issues (slow network can cause slowness

Re: program running faster on single node than cluster

2010-11-17 Thread Cornelio Iñigo
Hi the cluster has 12 nodes and the master node, I made a new test increasing the child nodes memory to 2000m and the HADOOP_HEAP_SIZE to 2000m and mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum is 2 (like default) and now the time is 6 minutes, but I think it is v

Re: repeat a job for different files

2010-11-17 Thread Alex Baranau
In case you need to process the files separately, use one MR job for each file. You can add a single file as input. I believe you'll need to iterate over all files in input dir and start job instance for each file. You can do this in java code or in script or... depending on your case. Alex Barana

Re: program running faster on single node than cluster

2010-11-17 Thread Alex Baranau
How many nodes do you use for you "fully distributed" cluster? Alex Baranau Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - Hadoop - HBase On Wed, Nov 17, 2010 at 5:44 AM, Cornelio Iñigo wrote: > Hi > > I have a question to you: > > I developed a program using Hadoop, it has one

repeat a job for different files

2010-11-17 Thread maha
Hi, When I set my inputFileFormat to take an input directory with three files in, the job is processed on all three and the output is one containing the result from all of them. Instead I want the job to be repeated separately for each inputFile and hence a different output. Eg.

Re: AWS Hadoop 20.2 AMIs

2010-11-17 Thread Gangl, Michael E (388K)
FYI, I commented out the Kernal version in the hadoop-ec2-env.sh script for the c1.xlarge if statements (at the bottom). Before it was using aki-427d952b Now it's using aki-b51cf9dc And I'm able to connect. Turns out the problem was a hang during the boot. This should probably be changed in th

AWS Hadoop 20.2 AMIs

2010-11-17 Thread Gangl, Michael E (388K)
I've been running into an issue today. I'm trying to procure 5 c1.xlarge instances on Amazon EC2. I was able to use the 453820947548/bixolabs-hadoop-0.20.2-i386 AMI for my previous m1.large instances, so I figured I could use the c1.xlarge instances with the x86_64 versions. When I start these

Re: Problem identifying cause of a failed job

2010-11-17 Thread Matt Pouttu-Clarke
We were getting SIGSEGV and fixed it by upgrading the JVM. We are using 1.6.0_21 currently. On Nov 16, 2010, at 3:50 PM, "Greg Langmead" wrote: Newbie alert. I have a Pig script I tested on small data and am now running it on a larger data set (85GB). My cluster is two machines right

program running faster on single node than cluster

2010-11-17 Thread Cornelio Iñigo
Hi I have a question to you: I developed a program using Hadoop, it has one map function and one reduce function (like WordCount) and in the map function I do all the process of my data when I run this program in a single node machine it takes like 7 minutes (its a small dataset), in a pseudo-dis

Re: Tweaking the File write in HDFS

2010-11-17 Thread Zooni79
Hi, As an extension to the problem statement...Is it possible to fuse step 1 and 2 in to one step? i.e. Can we have the map task to pick the input from an external filesystem instead of HDFS. Can FTPfileSystem/RawLocalFileSystem can be of any help here? ./zahoor On 15-Nov-2010, at 3:10 PM, Seb

Re: Retrieving information of submitted job

2010-11-17 Thread Harsh J
Hi, On Wed, Nov 17, 2010 at 5:11 PM, Jaydeep Ayachit wrote: > > -          Configuration associated with this job > > -          Job completion time JobHistory.JobInfo is the class you're looking for perhaps. -- Harsh J www.harshj.com

Retrieving information of submitted job

2010-11-17 Thread Jaydeep Ayachit
Hello, I need to retrieve some information from the submitted job. I can get JobClient.getJob(JOBID) that returns RunningJob. I need to get - Configuration associated with this job - Job completion time I could not see any APIs in RunningJob that can get this data. Any pointe

Re: program running faster on single node than cluster

2010-11-17 Thread Hari Sreekumar
Are all the nodes being used? Go to :50030 on the web interface after starting the job, and check whether the tasks are progressing together on all nodes or not. hari On Wed, Nov 17, 2010 at 9:14 AM, Cornelio Iñigo wrote: > Hi > > I have a question to you: > > I developed a program using Hadoop,

program running faster on single node than cluster

2010-11-17 Thread Cornelio Iñigo
Hi I have a question to you: I developed a program using Hadoop, it has one map function and one reduce function (like WordCount) and in the map function I do all the process of my data when I run this program in a single node machine it takes like 7 minutes (its a small dataset), in a pseudo-dis

Re: Generic Performance Tuning of MapReduce

2010-11-17 Thread Alex Baranau
Have you checked suggestions/examples here: http://hadoop.apache.org/common/docs/current/cluster_setup.html? You probably did, just in case. There's a lot of configuration options explained with real-world examples. Also useful: http://www.cloudera.com/blog/2009/12/7-tips-for-improving-mapreduce-pe