What is your use case? Why would you only want to use only 5 mappers and not
the whole 10 task trackers?
"If an individual file is so large that it will affect seek time it will be
split to several Splits" (http://wiki.apache.org/hadoop/HadoopMapReduce)
"if a split span over more than one dfs blo
What you may be looking for is a workflow system such as Oozie
(yahoo.github.com/oozie/) or Azkaban
(http://sna-projects.com/azkaban/).
If your needs are simple (2-3 jobs, not too many conditions, etc. per
workflow), you can checkout the JobControl API
(http://hadoop.apache.org/common/docs/r0.20.2
I have a generic question about how the number of mapper tasks is
calculated, as far as I know, the number is primarily based on the number of
splits, say if I have 5 splits and I have 10 tasktracker running in the
cluster, I will have 5 mapper tasks running in my MR job, right?
But what I found i
Hello all,
I am trying to write a MR program where the output from the mappers are
dependent on the previous map processes. I understand that a job scheduler
exists to control such processes. Would anyone be able to give some sample
code of a working implementation of this in hadoop 0.20.2?
Thanks everyone.
After setting the HADOOP_CLIENT_OPTS, the error changed to that the number
of tasks my job was launching was more than 100,000 which I believe is the
maximum set on my cluster.
This was because I had more than 100,000 files input to my job. I merged
some files so that the total nu
Hi Praveen,
Can you please look at the RM logs and check if there are any
errors/exceptions. I ran into a similar issue when my RM was down.
Also, the defaults are present at
http://svn.apache.org/repos/asf/hadoop/common/branches/MR-279/mapreduce/yarn/yarn-server/yarn-server-common/src/main/resou