Hi > > > > I tried to use your CombileFileInputFormat implementation. However, I get the following exception > > > > ‘not org.apache.hadoop.mapred.InputFormat’ > > > > I am using hadoop 2.4.1 and it looks like it expect older interface as it does not accept ‘org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat’. May I know what version of Hadoop you used? > > > > > > Looks like I need to use older one ‘org.apache.hadoop.mapred.lib.CombineFileInputFormat’ ? > > > > Thanks and Regards > > rab On 20 Aug 2014 22:59, "Felix Chern" <idry...@gmail.com> wrote:
> I wrote a post on how to use CombineInputFormat: > > http://www.idryman.org/blog/2013/09/22/process-small-files-on-hadoop-using-combinefileinputformat-1/ > In the RecordReader constructor, you can get the context of which file you > are reading in. > In my example, I created FileLineWritable to include the filename in the > mapper input key. > Then you can use the input key as: > > public static class TestMapper extends Mapper<FileLineWritable, Text, > Text, IntWritable>{ private Text txt = new Text(); private IntWritable > count = new IntWritable(1); public void map (FileLineWritable key, Text > val, Context context) throws IOException, InterruptedException{ > StringTokenizer st = new StringTokenizer(val.toString()); while (st. > hasMoreTokens()){ txt.set(key.fileName + st.nextToken()); context.write( > txt, count); } } } > > > Cheers, > Felix > > > On Aug 20, 2014, at 8:19 AM, rab ra <rab...@gmail.com> wrote: > > Thanks for the response. > > Yes, I know wholeFileInputFormat. But i am not sure filename comes to map > process either as key or value. But, I think this file format reads the > contents of the file. I wish to have a inputformat that just gives filename > or list of filenames. > > Also, files are very small. The wholeFileInputFormat spans one map process > per file and thus results huge number of map processes. I wish to span a > single map process per group of files. > > I think I need to tweak CombineFileInputFormat's recordreader() so that it > does not read the entire file but just filename. > > > regards > rab > > regards > Bala > > > On Wed, Aug 20, 2014 at 6:48 PM, Shahab Yunus <shahab.yu...@gmail.com> > wrote: > >> Have you looked at the WholeFileInputFormat implementations? There are >> quite a few if search for them... >> >> >> http://hadoop-sandy.blogspot.com/2013/02/wholefileinputformat-in-java-hadoop.html >> >> https://github.com/tomwhite/hadoop-book/blob/master/ch07/src/main/java/WholeFileInputFormat.java >> >> Regards, >> Shahab >> >> >> On Wed, Aug 20, 2014 at 1:46 AM, rab ra <rab...@gmail.com> wrote: >> >>> Hello, >>> >>> I have a use case wherein i need to process huge set of files stored in >>> HDFS. Those files are non-splittable and they need to be processed as a >>> whole. Here, I have the following question for which I need answers to >>> proceed further in this. >>> >>> 1. I wish to schedule the map process in task tracker where data is >>> already available. How can I do it? Currently, I have a file that contains >>> list of filenames. Each map get one line of it via NLineInputFormat. The >>> map process then accesses the file via FSDataInputStream and work with it. >>> Is there a way to ensure this map process is running on the node where the >>> file is available?. >>> >>> 2. Since the files are not large and it can be called as 'small' files >>> by hadoop standard. Now, I came across CombineFileInputFormat that can >>> process more than one file in a single map process. What I need here is a >>> format that can process more than one files in a single map but does not >>> have to read the files, and either in key or value, it has the filenames. >>> In map process then, I can run a loop to process these files. Any help? >>> >>> 3. Any othe alternatives? >>> >>> >>> >>> regards >>> rab >>> >>> >> > >