By default you get at least one task per file; if any file is bigger than a block, then that file is broken up into N tasks where each is one block long. Not sure what you mean by "properly calculate" -- as long as you have more tasks than you have cores, then you'll definitely have work for every core to do; having more tasks with high granularity will also let nodes that get "small" tasks to complete many of them while other cores are stuck with the "heavier" tasks.
If you call setNumMapTasks() with a higher number of tasks than the InputFormat creates (via the algorithm above), then it should create additional tasks by dividing files up into smaller chunks (which may be sub-block-sized). As for where you should run your computation.. I don't know that the "map" and "reduce" phases are really "optimized" for computation in any particular way. It's just a data motion thing. (At the end of the day, it's your code doing the processing on either side of the fence, which should dominate the execution time.) If you use an identity mapper with a pseudo-random key to spray the data into a bunch of reduce partitions, then you'll get a bunch of reducers each working on a hopefully-evenly-sized slice of the data. So the map tasks will quickly read from the original source data and forward the workload along to the reducers which do the actual heavy lifting. The cost of this approach is that you have to pay for the time taken to transfer the data from the mapper nodes to the reducer nodes and sort by key when it gets there. If you're only working with 600 MB of data, this is probably negligible. The advantages of doing your computation in the reducers is 1) You can directly control the number of reducer tasks and set this equal to the number of cores in your cluster. 2) You can tune your partitioning algorithm such that all reducers get roughly equal workload assignments, if there appears to be some sort of skew in the dataset. The tradeoff is that you have to ship all the data to the reducers before computation starts, which sacrifices data locality and involves an "intermediate" data set of the same size as the input data set. If this is in the range of hundreds of GB or north, then this can be very time-consuming -- so it doesn't scale terribly well. Of course, by the time you've got several hundred GB of data to work with, your current workload imbalance issues should be moot anyway. - Aaron On Fri, Nov 27, 2009 at 4:33 PM, CubicDesign <cubicdes...@gmail.com> wrote: > > > Aaron Kimball wrote: > >> (Note: this is a tasktracker setting, not a job setting. you'll need to >> set this on every >> node, then restart the mapreduce cluster to take effect.) >> >> > Ok. And here is my mistake. I set this to 16 only on the main node not also > on data nodes. Thanks a lot!!!!!! > > Of course, you need to have enough RAM to make sure that all these tasks >> can >> run concurrently without swapping. >> > No problem! > > > If your individual records require around a minute each to process as you >> claimed earlier, you're >> nowhere near in danger of hitting that particular performance bottleneck. >> >> >> > I was thinking that is I am under the recommended value of 64MB, Hadoop > cannot properly calculate the number of tasks. >