Re: Hadoop InputFormat - Processing large number of small files

Felix Chern Wed, 20 Aug 2014 10:29:36 -0700

I wrote a post on how to use CombineInputFormat:
http://www.idryman.org/blog/2013/09/22/process-small-files-on-hadoop-using-combinefileinputformat-1/
In the RecordReader constructor, you can get the context of which file you are 
reading in.
In my example, I created FileLineWritable to include the filename in the mapper 
input key.
Then you can use the input key as:


  
  public static class TestMapper extends Mapper<FileLineWritable, Text, Text, 
IntWritable>{
    private Text txt = new Text();
    private IntWritable count = new IntWritable(1);
    public void map (FileLineWritable key, Text val, Context context) throws 
IOException, InterruptedException{
      StringTokenizer st = new StringTokenizer(val.toString());
        while (st.hasMoreTokens()){
          txt.set(key.fileName + st.nextToken());          
          context.write(txt, count);
        }
    }
  }


Cheers,
Felix


On Aug 20, 2014, at 8:19 AM, rab ra <rab...@gmail.com> wrote:

> Thanks for the response.
> 
> Yes, I know wholeFileInputFormat. But i am not sure filename comes to map 
> process either as key or value. But, I think this file format reads the 
> contents of the file. I wish to have a inputformat that just gives filename 
> or list of filenames.
> 
> Also, files are very small. The wholeFileInputFormat spans one map process 
> per file and thus results huge number of map processes. I wish to span a 
> single map process per group of files. 
> 
> I think I need to tweak CombineFileInputFormat's recordreader() so that it 
> does not read the entire file but just filename.
> 
> 
> regards
> rab
> 
> regards
> Bala
> 
> 
> On Wed, Aug 20, 2014 at 6:48 PM, Shahab Yunus <shahab.yu...@gmail.com> wrote:
> Have you looked at the WholeFileInputFormat implementations? There are quite 
> a few if search for them...
> 
> http://hadoop-sandy.blogspot.com/2013/02/wholefileinputformat-in-java-hadoop.html
> https://github.com/tomwhite/hadoop-book/blob/master/ch07/src/main/java/WholeFileInputFormat.java
> 
> Regards,
> Shahab
> 
> 
> On Wed, Aug 20, 2014 at 1:46 AM, rab ra <rab...@gmail.com> wrote:
> Hello,
> 
> I have a use case wherein i need to process huge set of files stored in HDFS. 
> Those files are non-splittable and they need to be processed as a whole. 
> Here, I have the following question for which I need answers to proceed 
> further in this.
> 
> 1.  I wish to schedule the map process in task tracker where data is already 
> available. How can I do it? Currently, I have a file that contains list of 
> filenames. Each map get one line of it via NLineInputFormat. The map process 
> then accesses the file via FSDataInputStream and work with it. Is there a way 
> to ensure this map process is running on the node where the file is 
> available?. 
> 
> 2.  Since the files are not large and it can be called as 'small' files by 
> hadoop standard. Now, I came across CombineFileInputFormat that can process 
> more than one file in a single map process.  What I need here is a format 
> that can process more than one files in a single map but does not have to 
> read the files, and either in key or value, it has the filenames. In map 
> process then, I can run a loop to process these files. Any help?
> 
> 3. Any othe alternatives?
> 
> 
> 
> regards
> rab
> 
> 
>

Re: Hadoop InputFormat - Processing large number of small files

Reply via email to