Re: Hadoop InputFormat - Processing large number of small files

2014-09-01 Thread rab ra
Hi > > > > I tried to use your CombileFileInputFormat implementation. However, I get the following exception > > > > ‘not org.apache.hadoop.mapred.InputFormat’ > > > > I am using hadoop 2.4.1 and it looks like it expect older interface as it does not accept ‘org.apache.hadoop.mapreduce.lib.input.Co

Re: Hadoop InputFormat - Processing large number of small files

2014-08-26 Thread rab ra
Hi, Is it not good idea to model key as Text type? I have a large number of sequential files that has bunch of key value pairs. I will read these seq files inside the map. Hence my map needs only filenames. I believe, with CombineFileInputFormat the map will run on nodes where data is already ava

RE: Hadoop InputFormat - Processing large number of small files

2014-08-21 Thread java8964
amples how to do that. Yong Date: Thu, 21 Aug 2014 22:26:12 +0530 Subject: Re: Hadoop InputFormat - Processing large number of small files From: rab...@gmail.com To: user@hadoop.apache.org Hello, This means that a file with names of all the files that need to be processed and is fe

Re: Hadoop InputFormat - Processing large number of small files

2014-08-21 Thread rab ra
Hello, This means that a file with names of all the files that need to be processed and is fed to hadoop with NLineInputFormat? If this is the case, then, how can we ensure that map processes are scheduled in the node where blocks containing the files are stored already? regards rab On Thu, Au

Re: Hadoop InputFormat - Processing large number of small files

2014-08-21 Thread Felix Chern
If I were you, I’ll first generate a file with those file name: hadoop fs -ls > term_file Then run the normal map reduce job Felix On Aug 21, 2014, at 1:38 AM, rab ra wrote: > Thanks for the link. If it is not required for CFinputformat to have contents > of the files in the map process but

Re: Hadoop InputFormat - Processing large number of small files

2014-08-21 Thread rab ra
Thanks for the link. If it is not required for CFinputformat to have contents of the files in the map process but only the filename, what changes need to be done in the code? rab. On 20 Aug 2014 22:59, "Felix Chern" wrote: > I wrote a post on how to use CombineInputFormat: > > http://www.idryman

Re: Hadoop InputFormat - Processing large number of small files

2014-08-20 Thread Felix Chern
I wrote a post on how to use CombineInputFormat: http://www.idryman.org/blog/2013/09/22/process-small-files-on-hadoop-using-combinefileinputformat-1/ In the RecordReader constructor, you can get the context of which file you are reading in. In my example, I created FileLineWritable to include the

Re: Hadoop InputFormat - Processing large number of small files

2014-08-20 Thread rab ra
Thanks for the response. Yes, I know wholeFileInputFormat. But i am not sure filename comes to map process either as key or value. But, I think this file format reads the contents of the file. I wish to have a inputformat that just gives filename or list of filenames. Also, files are very small.

Re: Hadoop InputFormat - Processing large number of small files

2014-08-20 Thread Shahab Yunus
Have you looked at the WholeFileInputFormat implementations? There are quite a few if search for them... http://hadoop-sandy.blogspot.com/2013/02/wholefileinputformat-in-java-hadoop.html https://github.com/tomwhite/hadoop-book/blob/master/ch07/src/main/java/WholeFileInputFormat.java Regards, Shah

Hadoop InputFormat - Processing large number of small files

2014-08-19 Thread rab ra
Hello, I have a use case wherein i need to process huge set of files stored in HDFS. Those files are non-splittable and they need to be processed as a whole. Here, I have the following question for which I need answers to proceed further in this. 1. I wish to schedule the map process in task tra