Re: Hadoop InputFormat - Processing large number of small files

rab ra Tue, 26 Aug 2014 02:28:14 -0700

Hi,

Is it not good idea to model key as Text type?


I have a large number of sequential files that has bunch of key value
pairs. I will read these seq files inside the map. Hence my map needs only
filenames. I believe, with CombineFileInputFormat the map will run on nodes
where data is already available and hence my explicit hdfs read will be
faster.
I do not want the contents in the map as all key value pairs are not needed.

Regards
Rab
On 20 Aug 2014 22:59, "Felix Chern" <idry...@gmail.com> wrote:

> I wrote a post on how to use CombineInputFormat:
>
> http://www.idryman.org/blog/2013/09/22/process-small-files-on-hadoop-using-combinefileinputformat-1/
> In the RecordReader constructor, you can get the context of which file you
> are reading in.
> In my example, I created FileLineWritable to include the filename in the
> mapper input key.
> Then you can use the input key as:
>
>  public static class TestMapper extends Mapper<FileLineWritable, Text,
> Text, IntWritable>{ private Text txt = new Text(); private IntWritable
> count = new IntWritable(1); public void map (FileLineWritable key, Text
> val, Context context) throws IOException, InterruptedException{
> StringTokenizer st = new StringTokenizer(val.toString()); while (st.
> hasMoreTokens()){ txt.set(key.fileName + st.nextToken()); context.write(
> txt, count); } } }
>
>
> Cheers,
> Felix
>
>
> On Aug 20, 2014, at 8:19 AM, rab ra <rab...@gmail.com> wrote:
>
> Thanks for the response.
>
> Yes, I know wholeFileInputFormat. But i am not sure filename comes to map
> process either as key or value. But, I think this file format reads the
> contents of the file. I wish to have a inputformat that just gives filename
> or list of filenames.
>
> Also, files are very small. The wholeFileInputFormat spans one map process
> per file and thus results huge number of map processes. I wish to span a
> single map process per group of files.
>
> I think I need to tweak CombineFileInputFormat's recordreader() so that it
> does not read the entire file but just filename.
>
>
> regards
> rab
>
> regards
> Bala
>
>
> On Wed, Aug 20, 2014 at 6:48 PM, Shahab Yunus <shahab.yu...@gmail.com>
> wrote:
>
>> Have you looked at the WholeFileInputFormat implementations? There are
>> quite a few if search for them...
>>
>>
>> http://hadoop-sandy.blogspot.com/2013/02/wholefileinputformat-in-java-hadoop.html
>>
>> https://github.com/tomwhite/hadoop-book/blob/master/ch07/src/main/java/WholeFileInputFormat.java
>>
>> Regards,
>> Shahab
>>
>>
>> On Wed, Aug 20, 2014 at 1:46 AM, rab ra <rab...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> I have a use case wherein i need to process huge set of files stored in
>>> HDFS. Those files are non-splittable and they need to be processed as a
>>> whole. Here, I have the following question for which I need answers to
>>> proceed further in this.
>>>
>>> 1.  I wish to schedule the map process in task tracker where data is
>>> already available. How can I do it? Currently, I have a file that contains
>>> list of filenames. Each map get one line of it via NLineInputFormat. The
>>> map process then accesses the file via FSDataInputStream and work with it.
>>> Is there a way to ensure this map process is running on the node where the
>>> file is available?.
>>>
>>> 2.  Since the files are not large and it can be called as 'small' files
>>> by hadoop standard. Now, I came across CombineFileInputFormat that can
>>> process more than one file in a single map process.  What I need here is a
>>> format that can process more than one files in a single map but does not
>>> have to read the files, and either in key or value, it has the filenames.
>>> In map process then, I can run a loop to process these files. Any help?
>>>
>>> 3. Any othe alternatives?
>>>
>>>
>>>
>>> regards
>>>  rab
>>>
>>>
>>
>
>

Re: Hadoop InputFormat - Processing large number of small files

Reply via email to