Re: [HADOOP-users] HowTo filter files for a Map/Reduce task over the same input folder

2008-04-15 Thread Alfonso Olias Sanz
It's addIputPath, then adds a Path object to the list of inputs.
So doing the filtering first then adding the paths (loop).

But I need an InputFormat anyway because I have my own RecordReader.
At the end I have to put the same logic in a different place. From my
point of view it is better for me to put the filtering logic there,
because my InputFormat is also a RecordReader Factory, and it will
instantiate a different RecordReader, base on the filter.

cheers

On 14/04/2008, Ted Dunning [EMAIL PROTECTED] wrote:

  You don't really need a custom input format, I don't think.

  You should be able to just add multiple inputs, one at a time after
  filtering them outside hadoop.


  On 4/14/08 10:59 AM, Alfonso Olias Sanz [EMAIL PROTECTED]
  wrote:


   ok thanks for the info :)
  
   On 11/04/2008, Arun C Murthy [EMAIL PROTECTED] wrote:
  
On Apr 11, 2008, at 10:21 AM, Amar Kamat wrote:
  
  
   A simpler way is to use
   FileInputFormat.setInputPathFilter(JobConf, PathFilter).
   Look at org.apache.hadoop.fs.PathFilter for details on PathFilter 
 interface.
  
  
+1, although FileInputFormat.setInputPathFilter is
   available only in hadoop-0.17 and above... like Amar mentioned previously,
   you'd have to have a custom InputFormat prior to hadoop-0.17.
  
Arun
  
  
  
   Amar
   Alfonso Olias Sanz wrote:
  
   Hi
   I have a general purpose input folder that it is used as input in a
   Map/Reduce task. That folder contains files grouped by names.
  
   I want to configure the JobConf in a way I can filter the files that
   have to be processed from that pass (ie  files which name starts by
   Elementary, or Source etc)  So the task function will only process
   those files.  So if the folder contains 1000 files and only 50 start
   by Elementary. Only those 50 will be processed by my task.
  
   I could set up different input folders and those containing the
   different files, but I cannot do that.
  
  
   Any idea?
  
   thanks
  
  
  
  
  
  




[HADOOP-users] HowTo filter files for a Map/Reduce task over the same input folder

2008-04-11 Thread Alfonso Olias Sanz
Hi
I have a general purpose input folder that it is used as input in a
Map/Reduce task. That folder contains files grouped by names.

I want to configure the JobConf in a way I can filter the files that
have to be processed from that pass (ie  files which name starts by
Elementary, or Source etc)  So the task function will only process
those files.  So if the folder contains 1000 files and only 50 start
by Elementary. Only those 50 will be processed by my task.

I could set up different input folders and those containing the
different files, but I cannot do that.


Any idea?

thanks


Re: [HADOOP-users] HowTo filter files for a Map/Reduce task over the same input folder

2008-04-11 Thread Amar Kamat
One way to do this is to write your own (file) input format. See 
src/java/org/apache/hadoop/mapred/FileInputFormat.java. You need to 
override listPaths() in order to have selectivity amongst the files in 
the input folder.

Amar
Alfonso Olias Sanz wrote:

Hi
I have a general purpose input folder that it is used as input in a
Map/Reduce task. That folder contains files grouped by names.

I want to configure the JobConf in a way I can filter the files that
have to be processed from that pass (ie  files which name starts by
Elementary, or Source etc)  So the task function will only process
those files.  So if the folder contains 1000 files and only 50 start
by Elementary. Only those 50 will be processed by my task.

I could set up different input folders and those containing the
different files, but I cannot do that.


Any idea?

thanks
  




Re: [HADOOP-users] HowTo filter files for a Map/Reduce task over the same input folder

2008-04-11 Thread Amar Kamat
A simpler way is to use FileInputFormat.setInputPathFilter(JobConf, 
PathFilter). Look at org.apache.hadoop.fs.PathFilter for details on 
PathFilter interface.

Amar
Alfonso Olias Sanz wrote:

Hi
I have a general purpose input folder that it is used as input in a
Map/Reduce task. That folder contains files grouped by names.

I want to configure the JobConf in a way I can filter the files that
have to be processed from that pass (ie  files which name starts by
Elementary, or Source etc)  So the task function will only process
those files.  So if the folder contains 1000 files and only 50 start
by Elementary. Only those 50 will be processed by my task.

I could set up different input folders and those containing the
different files, but I cannot do that.


Any idea?

thanks