Re: knowing the nodes on which reduce tasks will run

2012-09-04 Thread Narasingu Ramesh
Hi Abhay, NameNode it has address of the all data nodes. MapReduce can do all the data is processing. First data set is putting into HDFS filesystem and then run hadoop jar file. Map task can handle input files for shufle, sorting and grouped together. Map task is completed and then

knowing the nodes on which reduce tasks will run

2012-09-03 Thread Abhay Ratnaparkhi
Hello, How can one get to know the nodes on which reduce tasks will run? One of my job is running and it's completing all the map tasks. My map tasks write lots of intermediate data. The intermediate directory is getting full on all the nodes. If the reduce task take any node from cluster then

Re: knowing the nodes on which reduce tasks will run

2012-09-03 Thread Bejoy Ks
HI Abhay The TaskTrackers on which the reduce tasks are triggered is chosen in random based on the reduce slot availability. So if you don't need the reduce tasks to be scheduled on some particular nodes you need to set 'mapred.tasktracker.reduce.tasks.maximum' on those nodes to 0. The bottleneck

Re: knowing the nodes on which reduce tasks will run

2012-09-03 Thread Bertrand Dechoux
Hi, The reducer is run where there is slot available, the location is not related to where the data is located and it is not possible to choose where the reducer will run (except by tweaking the tasktracker...). Regards Bertrand On Mon, Sep 3, 2012 at 4:19 PM, Abhay Ratnaparkhi

Re: knowing the nodes on which reduce tasks will run

2012-09-03 Thread Abhay Ratnaparkhi
How can I set 'mapred.tasktracker.reduce.tasks.maximum' to 0 in a running tasktracker? Seems that I need to restart the tasktracker and in that case I'll loose the output of map tasks by particular tasktracker. Can I change 'mapred.tasktracker.reduce.tasks.maximum' to 0 without restarting

Re: knowing the nodes on which reduce tasks will run

2012-09-03 Thread Bejoy Ks
Hi Abhay You need this value to be changed before you submit your job and restart TT. Modifying this value in mid time won't affect the running jobs. On Mon, Sep 3, 2012 at 9:06 PM, Abhay Ratnaparkhi abhay.ratnapar...@gmail.com wrote: How can I set 'mapred.tasktracker.reduce.tasks.maximum'

Re: knowing the nodes on which reduce tasks will run

2012-09-03 Thread Hemanth Yamijala
Hi, You are right that a change to mapred.tasktracker.reduce.tasks.maximum will require a restart of the tasktrackers. AFAIK, there is no way of modifying this property without restarting. On a different note, could you see if the amount of intermediate data can be reduced using a combiner, or

Re: knowing the nodes on which reduce tasks will run

2012-09-03 Thread Michael Segel
The short answer is no. The longer answer is that you can attempt to force data locality, however even then if an open slot becomes available, its used regardless of what you want to do... On Sep 3, 2012, at 9:19 AM, Abhay Ratnaparkhi abhay.ratnapar...@gmail.com wrote: Hello, How can

Re: knowing the nodes on which reduce tasks will run

2012-09-03 Thread Abhay Ratnaparkhi
All of my map tasks are about to complete and there is not much processing to be done in reducer. The job is running from a week so I don't want the job to fail. Any other suggestion to tackle this is welcome. ~Abhay On Mon, Sep 3, 2012 at 9:26 PM, Hemanth Yamijala