http://answers.oreilly.com/topic/459-anatomy-of-a-mapreduce-job-run-with-hadoop/ "Computes the input splits for the job. If the splits cannot be computed, because the input paths don’t exist, for example, then the job is not submitted and an error is thrown to the MapReduce program.
Copies the resources needed to run the job, including the job JAR file, the configuration file and the computed input splits, to the jobtracker’s filesystem in a directory named after the job ID. The job JAR is copied with a high replication factor (controlled by the mapred.submit.replication property, which defaults to 10) so that there are lots of copies across the cluster for the tasktrackers to access when they run tasks for the job (step 3)." 1. My first question: who is responsible to compute the input splits? Is it the jobclient's work or the jobtracker's work? --- it sounds the jobclient's work from the above statement. But I do not understand how jobclient is able to compute this info because it does not hold enough information to do so. To compute the input split, the party must at least know how many blocks the target input includes, IFIAK, but jobclient does not seem to have such information. Here is my understanding about split using an example: a 256MB file stored in 4 blocks in HDFS can be splitted into 4 splits if it is the target input for the MR job. Is the minimal split a block or can a split be smaller than that? How exactly is a split size computed? -- --Sean