Hello, Harsh! Thank you for your quick response. I have another questions: 1. your are saying that each map task will take as an input one file, but when the files size are less than the block size then it is possible that a map task to take more than one file, isn't it? 2.In this particular case, the same behavior will happen (meaning each file will be processed till end and then next one)? Regards, Florin
--- On Wed, 7/20/11, Harsh J <ha...@cloudera.com> wrote: > From: Harsh J <ha...@cloudera.com> > Subject: Re: Order of files in Map class > To: hdfs-user@hadoop.apache.org > Date: Wednesday, July 20, 2011, 3:44 AM > Florin, > > Your second example is how it happens in Hadoop, but > there's more here > to understand. > > To start with, your InputFormat (input splitter) computes > and > publishes a set amount of InputSplits. The total number of > input > splits is gonna be your total number of 'Map Tasks' in > Hadoop as the > job proceeds. The input splits are generally block splits, > i.e., > start-and-stop lengths over the same file. > > Each 'MapTask' is designated one split from this list of > splits. So > every map task would initialize separately, in its own JVM > (no shared > resources -- again, its a different instance of mappers per > file or > block!) and read the input split alone, into its map(key, > value, > context) function. > > So to summarize, your second example is what will happen, > but it would > be in parallel instead, such as: > > map1 | map2 | … > file1 | file2 | … > row1 | row1 | … > row2 | row 2 | … > > P.s. What I've explained here is the default behavior. Of > course > things can be highly tweaked to achieve other things, like > your first > example, but those probably come with greater read costs > attached. The > 'hadoop' way is data local, and one-file-per-task. > > On Wed, Jul 20, 2011 at 12:11 PM, Florin P <florinp...@yahoo.com> > wrote: > > Hello! > > Suppose that we have the files F1, F2,..Fk given by > the input splitter to the map class, what is the order in > which they will arrive when map function is applied? > > What is interesting me if it is possible that in > the map function to arrive mixed key-value pairs from > different files? They keys will arrive related with their > file, till no more keys are left from source file or they > can arrive one key from F1 one key from Fk and so on. > > Example: > > Mixed key value pairs at the map function: > > K1 from F1 > > K5 from F5 > > K7 from F8 > > etc > > > > ordered key-value pairs: > > K1 from F1 > > .. > > K_end_F1 from F1 > > K5 from F5 > > .. > > K_end_F5 from F5 > > and so on. > > > > I'll look forward for your answer. > > Regards, > > Florin > > > > > > > > -- > Harsh J >