Hi, Is it not good idea to model key as Text type?
I have a large number of sequential files that has bunch of key value pairs. I will read these seq files inside the map. Hence my map needs only filenames. I believe, with CombineFileInputFormat the map will run on nodes where data is already available and hence my explicit hdfs read will be faster. I do not want the contents in the map as all key value pairs are not needed. Regards Rab On 20 Aug 2014 22:59, "Felix Chern" <idry...@gmail.com> wrote: > I wrote a post on how to use CombineInputFormat: > > http://www.idryman.org/blog/2013/09/22/process-small-files-on-hadoop-using-combinefileinputformat-1/ > In the RecordReader constructor, you can get the context of which file you > are reading in. > In my example, I created FileLineWritable to include the filename in the > mapper input key. > Then you can use the input key as: > > public static class TestMapper extends Mapper<FileLineWritable, Text, > Text, IntWritable>{ private Text txt = new Text(); private IntWritable > count = new IntWritable(1); public void map (FileLineWritable key, Text > val, Context context) throws IOException, InterruptedException{ > StringTokenizer st = new StringTokenizer(val.toString()); while (st. > hasMoreTokens()){ txt.set(key.fileName + st.nextToken()); context.write( > txt, count); } } } > > > Cheers, > Felix > > > On Aug 20, 2014, at 8:19 AM, rab ra <rab...@gmail.com> wrote: > > Thanks for the response. > > Yes, I know wholeFileInputFormat. But i am not sure filename comes to map > process either as key or value. But, I think this file format reads the > contents of the file. I wish to have a inputformat that just gives filename > or list of filenames. > > Also, files are very small. The wholeFileInputFormat spans one map process > per file and thus results huge number of map processes. I wish to span a > single map process per group of files. > > I think I need to tweak CombineFileInputFormat's recordreader() so that it > does not read the entire file but just filename. > > > regards > rab > > regards > Bala > > > On Wed, Aug 20, 2014 at 6:48 PM, Shahab Yunus <shahab.yu...@gmail.com> > wrote: > >> Have you looked at the WholeFileInputFormat implementations? There are >> quite a few if search for them... >> >> >> http://hadoop-sandy.blogspot.com/2013/02/wholefileinputformat-in-java-hadoop.html >> >> https://github.com/tomwhite/hadoop-book/blob/master/ch07/src/main/java/WholeFileInputFormat.java >> >> Regards, >> Shahab >> >> >> On Wed, Aug 20, 2014 at 1:46 AM, rab ra <rab...@gmail.com> wrote: >> >>> Hello, >>> >>> I have a use case wherein i need to process huge set of files stored in >>> HDFS. Those files are non-splittable and they need to be processed as a >>> whole. Here, I have the following question for which I need answers to >>> proceed further in this. >>> >>> 1. I wish to schedule the map process in task tracker where data is >>> already available. How can I do it? Currently, I have a file that contains >>> list of filenames. Each map get one line of it via NLineInputFormat. The >>> map process then accesses the file via FSDataInputStream and work with it. >>> Is there a way to ensure this map process is running on the node where the >>> file is available?. >>> >>> 2. Since the files are not large and it can be called as 'small' files >>> by hadoop standard. Now, I came across CombineFileInputFormat that can >>> process more than one file in a single map process. What I need here is a >>> format that can process more than one files in a single map but does not >>> have to read the files, and either in key or value, it has the filenames. >>> In map process then, I can run a loop to process these files. Any help? >>> >>> 3. Any othe alternatives? >>> >>> >>> >>> regards >>> rab >>> >>> >> > >