I wrote a post on how to use CombineInputFormat: http://www.idryman.org/blog/2013/09/22/process-small-files-on-hadoop-using-combinefileinputformat-1/ In the RecordReader constructor, you can get the context of which file you are reading in. In my example, I created FileLineWritable to include the filename in the mapper input key. Then you can use the input key as:
public static class TestMapper extends Mapper<FileLineWritable, Text, Text, IntWritable>{ private Text txt = new Text(); private IntWritable count = new IntWritable(1); public void map (FileLineWritable key, Text val, Context context) throws IOException, InterruptedException{ StringTokenizer st = new StringTokenizer(val.toString()); while (st.hasMoreTokens()){ txt.set(key.fileName + st.nextToken()); context.write(txt, count); } } } Cheers, Felix On Aug 20, 2014, at 8:19 AM, rab ra <rab...@gmail.com> wrote: > Thanks for the response. > > Yes, I know wholeFileInputFormat. But i am not sure filename comes to map > process either as key or value. But, I think this file format reads the > contents of the file. I wish to have a inputformat that just gives filename > or list of filenames. > > Also, files are very small. The wholeFileInputFormat spans one map process > per file and thus results huge number of map processes. I wish to span a > single map process per group of files. > > I think I need to tweak CombineFileInputFormat's recordreader() so that it > does not read the entire file but just filename. > > > regards > rab > > regards > Bala > > > On Wed, Aug 20, 2014 at 6:48 PM, Shahab Yunus <shahab.yu...@gmail.com> wrote: > Have you looked at the WholeFileInputFormat implementations? There are quite > a few if search for them... > > http://hadoop-sandy.blogspot.com/2013/02/wholefileinputformat-in-java-hadoop.html > https://github.com/tomwhite/hadoop-book/blob/master/ch07/src/main/java/WholeFileInputFormat.java > > Regards, > Shahab > > > On Wed, Aug 20, 2014 at 1:46 AM, rab ra <rab...@gmail.com> wrote: > Hello, > > I have a use case wherein i need to process huge set of files stored in HDFS. > Those files are non-splittable and they need to be processed as a whole. > Here, I have the following question for which I need answers to proceed > further in this. > > 1. I wish to schedule the map process in task tracker where data is already > available. How can I do it? Currently, I have a file that contains list of > filenames. Each map get one line of it via NLineInputFormat. The map process > then accesses the file via FSDataInputStream and work with it. Is there a way > to ensure this map process is running on the node where the file is > available?. > > 2. Since the files are not large and it can be called as 'small' files by > hadoop standard. Now, I came across CombineFileInputFormat that can process > more than one file in a single map process. What I need here is a format > that can process more than one files in a single map but does not have to > read the files, and either in key or value, it has the filenames. In map > process then, I can run a loop to process these files. Any help? > > 3. Any othe alternatives? > > > > regards > rab > > >