Thanks for the response. Yes, I know wholeFileInputFormat. But i am not sure filename comes to map process either as key or value. But, I think this file format reads the contents of the file. I wish to have a inputformat that just gives filename or list of filenames.
Also, files are very small. The wholeFileInputFormat spans one map process per file and thus results huge number of map processes. I wish to span a single map process per group of files. I think I need to tweak CombineFileInputFormat's recordreader() so that it does not read the entire file but just filename. regards rab regards Bala On Wed, Aug 20, 2014 at 6:48 PM, Shahab Yunus <shahab.yu...@gmail.com> wrote: > Have you looked at the WholeFileInputFormat implementations? There are > quite a few if search for them... > > > http://hadoop-sandy.blogspot.com/2013/02/wholefileinputformat-in-java-hadoop.html > > https://github.com/tomwhite/hadoop-book/blob/master/ch07/src/main/java/WholeFileInputFormat.java > > Regards, > Shahab > > > On Wed, Aug 20, 2014 at 1:46 AM, rab ra <rab...@gmail.com> wrote: > >> Hello, >> >> I have a use case wherein i need to process huge set of files stored in >> HDFS. Those files are non-splittable and they need to be processed as a >> whole. Here, I have the following question for which I need answers to >> proceed further in this. >> >> 1. I wish to schedule the map process in task tracker where data is >> already available. How can I do it? Currently, I have a file that contains >> list of filenames. Each map get one line of it via NLineInputFormat. The >> map process then accesses the file via FSDataInputStream and work with it. >> Is there a way to ensure this map process is running on the node where the >> file is available?. >> >> 2. Since the files are not large and it can be called as 'small' files >> by hadoop standard. Now, I came across CombineFileInputFormat that can >> process more than one file in a single map process. What I need here is a >> format that can process more than one files in a single map but does not >> have to read the files, and either in key or value, it has the filenames. >> In map process then, I can run a loop to process these files. Any help? >> >> 3. Any othe alternatives? >> >> >> >> regards >> rab >> >> >