There would be thousands of tasks, but not all fired off at the same time. The number of parallel tasks is configurable but typically 1 per data node core.
*.......* On Wed, Dec 17, 2014 at 6:31 PM, bit1...@163.com <bit1...@163.com> wrote: > > Thanks Mark and Dieter for the reply. > > Actually, I got another question in mind. What's the relationship between > input split and mapper task?Is it one one relation or a mapper task can > handle more than one input splits? > > If mapper task can only handle one input split, then if there are many > input splits(say, the the original file is 1TB or larger,then there may be > thousands of input splits), thousands of mapper tasks would be created. > > ------------------------------ > bit1...@163.com > > > *From:* mark charts <mcha...@yahoo.com> > *Date:* 2014-12-18 00:15 > *To:* user@hadoop.apache.org > *Subject:* Re: How many blocks does one input split have? > Hello. > > > FYI. > > "The way HDFS has been set up, it breaks down very large files into large > blocks > (for example, measuring 128MB), and stores three copies of these blocks on > different nodes in the cluster. HDFS has no awareness of the content of > these > files. > > In YARN, when a MapReduce job is started, the Resource Manager (the > cluster resource management and job scheduling facility) creates an > Application Master daemon to look after the lifecycle of the job. (In > Hadoop 1, > the JobTracker monitored individual jobs as well as handling job > scheduling > and cluster resource management. One of the first things the Application > Master > does is determine which file blocks are needed for processing. The > Application > Master requests details from the NameNode on where the replicas of the > needed data blocks are stored. Using the location data for the file blocks, > the Application > Master makes requests to the Resource Manager to have map tasks process > specific > blocks on the slave nodes where they’re stored. > The key to efficient MapReduce processing is that, wherever possible, data > is > processed locally — on the slave node where it’s stored. > Before looking at how the data blocks are processed, you need to look more > closely at how Hadoop stores data. In Hadoop, files are composed of > individual > records, which are ultimately processed one-by-one by mapper tasks. For > example, the sample data set we use in this book contains information about > completed flights within the United States between 1987 and 2008. We have > one > large file for each year, and within every file, each individual line > represents a > single flight. In other words, one line represents one record. Now, > remember > that the block size for the Hadoop cluster is 64MB, which means that the > light > data files are broken into chunks of exactly 64MB. > > Do you see the problem? If each map task processes all records in a > specific > data block, what happens to those records that span block boundaries? > File blocks are exactly 64MB (or whatever you set the block size to be), > and > because HDFS has no conception of what’s inside the file blocks, it can’t > gauge > when a record might spill over into another block. To solve this problem, > Hadoop uses a logical representation of the data stored in file blocks, > known as > input splits. When a MapReduce job client calculates the input splits, it > figures > out where the first whole record in a block begins and where the last > record > in the block ends. In cases where the last record in a block is > incomplete, the > input split includes location information for the next block and the byte > offset > of the data needed to complete the record. > You can configure the Application Master daemon (or JobTracker, if you’re > in > Hadoop 1) to calculate the input splits instead of the job client, which > would > be faster for jobs processing a large number of data blocks. > MapReduce data processing is driven by this concept of input splits. The > number of input splits that are calculated for a specific application > determines > the number of mapper tasks. Each of these mapper tasks is assigned, where > possible, to a slave node where the input split is stored. The Resource > Manager > (or JobTracker, if you’re in Hadoop 1) does its best to ensure that input > splits > are processed locally." *sic* > > Courtesy of Dirk deRoos, Paul C. Zikopoulos, Bruce Brown, > Rafael Coss, and Roman B. Melnyk > > > > Mark Charts > > > > > On Wednesday, December 17, 2014 10:32 AM, Dieter De Witte < > drdwi...@gmail.com> wrote: > > > Hi, > > Check this post: > http://stackoverflow.com/questions/17727468/hadoop-input-split-size-vs-block-size > > Regards, D > > > 2014-12-17 15:16 GMT+01:00 Todd <bit1...@163.com>: > > Hi Hadoopers, > > I got a question about how many blocks does one input split have? It is > random or the number can be configured or fixed(can't be changed)? > Thanks! > > > >