Re: [HADOOP-users] HowTo filter files for a Map/Reduce task over the same input folder

2008-04-15 Thread Alfonso Olias Sanz
It's addIputPath, then adds a Path object to the list of inputs.
So doing the filtering first then adding the paths (loop).

But I need an InputFormat anyway because I have my own RecordReader.
At the end I have to put the same logic in a different place. From my
point of view it is better for me to put the filtering logic there,
because my InputFormat is also a RecordReader Factory, and it will
instantiate a different RecordReader, base on the filter.

cheers

On 14/04/2008, Ted Dunning <[EMAIL PROTECTED]> wrote:
>
>  You don't really need a custom input format, I don't think.
>
>  You should be able to just add multiple inputs, one at a time after
>  filtering them outside hadoop.
>
>
>  On 4/14/08 10:59 AM, "Alfonso Olias Sanz" <[EMAIL PROTECTED]>
>  wrote:
>
>
>  > ok thanks for the info :)
>  >
>  > On 11/04/2008, Arun C Murthy <[EMAIL PROTECTED]> wrote:
>  >>
>  >>  On Apr 11, 2008, at 10:21 AM, Amar Kamat wrote:
>  >>
>  >>
>  >>> A simpler way is to use
>  >> FileInputFormat.setInputPathFilter(JobConf, PathFilter).
>  >> Look at org.apache.hadoop.fs.PathFilter for details on PathFilter 
> interface.
>  >>>
>  >>
>  >>  +1, although FileInputFormat.setInputPathFilter is
>  >> available only in hadoop-0.17 and above... like Amar mentioned previously,
>  >> you'd have to have a custom InputFormat prior to hadoop-0.17.
>  >>
>  >>  Arun
>  >>
>  >>
>  >>
>  >>> Amar
>  >>> Alfonso Olias Sanz wrote:
>  >>>
>   Hi
>   I have a general purpose input folder that it is used as input in a
>   Map/Reduce task. That folder contains files grouped by names.
>  
>   I want to configure the JobConf in a way I can filter the files that
>   have to be processed from that pass (ie  files which name starts by
>   Elementary, or Source etc)  So the task function will only process
>   those files.  So if the folder contains 1000 files and only 50 start
>   by Elementary. Only those 50 will be processed by my task.
>  
>   I could set up different input folders and those containing the
>   different files, but I cannot do that.
>  
>  
>   Any idea?
>  
>   thanks
>  
>  
>  >>>
>  >>>
>  >>
>  >>
>
>


Re: [HADOOP-users] HowTo filter files for a Map/Reduce task over the same input folder

2008-04-14 Thread Ted Dunning

You don't really need a custom input format, I don't think.

You should be able to just add multiple inputs, one at a time after
filtering them outside hadoop.


On 4/14/08 10:59 AM, "Alfonso Olias Sanz" <[EMAIL PROTECTED]>
wrote:

> ok thanks for the info :)
> 
> On 11/04/2008, Arun C Murthy <[EMAIL PROTECTED]> wrote:
>> 
>>  On Apr 11, 2008, at 10:21 AM, Amar Kamat wrote:
>> 
>> 
>>> A simpler way is to use
>> FileInputFormat.setInputPathFilter(JobConf, PathFilter).
>> Look at org.apache.hadoop.fs.PathFilter for details on PathFilter interface.
>>> 
>> 
>>  +1, although FileInputFormat.setInputPathFilter is
>> available only in hadoop-0.17 and above... like Amar mentioned previously,
>> you'd have to have a custom InputFormat prior to hadoop-0.17.
>> 
>>  Arun
>> 
>> 
>> 
>>> Amar
>>> Alfonso Olias Sanz wrote:
>>> 
 Hi
 I have a general purpose input folder that it is used as input in a
 Map/Reduce task. That folder contains files grouped by names.
 
 I want to configure the JobConf in a way I can filter the files that
 have to be processed from that pass (ie  files which name starts by
 Elementary, or Source etc)  So the task function will only process
 those files.  So if the folder contains 1000 files and only 50 start
 by Elementary. Only those 50 will be processed by my task.
 
 I could set up different input folders and those containing the
 different files, but I cannot do that.
 
 
 Any idea?
 
 thanks
 
 
>>> 
>>> 
>> 
>> 



Re: [HADOOP-users] HowTo filter files for a Map/Reduce task over the same input folder

2008-04-14 Thread Alfonso Olias Sanz
ok thanks for the info :)

On 11/04/2008, Arun C Murthy <[EMAIL PROTECTED]> wrote:
>
>  On Apr 11, 2008, at 10:21 AM, Amar Kamat wrote:
>
>
> > A simpler way is to use
> FileInputFormat.setInputPathFilter(JobConf, PathFilter).
> Look at org.apache.hadoop.fs.PathFilter for details on PathFilter interface.
> >
>
>  +1, although FileInputFormat.setInputPathFilter is
> available only in hadoop-0.17 and above... like Amar mentioned previously,
> you'd have to have a custom InputFormat prior to hadoop-0.17.
>
>  Arun
>
>
>
> > Amar
> > Alfonso Olias Sanz wrote:
> >
> > > Hi
> > > I have a general purpose input folder that it is used as input in a
> > > Map/Reduce task. That folder contains files grouped by names.
> > >
> > > I want to configure the JobConf in a way I can filter the files that
> > > have to be processed from that pass (ie  files which name starts by
> > > Elementary, or Source etc)  So the task function will only process
> > > those files.  So if the folder contains 1000 files and only 50 start
> > > by Elementary. Only those 50 will be processed by my task.
> > >
> > > I could set up different input folders and those containing the
> > > different files, but I cannot do that.
> > >
> > >
> > > Any idea?
> > >
> > > thanks
> > >
> > >
> >
> >
>
>


Re: [HADOOP-users] HowTo filter files for a Map/Reduce task over the same input folder

2008-04-11 Thread Ted Dunning

Just call addInputFile multiple times after filtering.  (or is it
addInputPath... Don't have documentation handy)


On 4/11/08 6:33 AM, "Alfonso Olias Sanz" <[EMAIL PROTECTED]>
wrote:

> Hi
> I have a general purpose input folder that it is used as input in a
> Map/Reduce task. That folder contains files grouped by names.
> 
> I want to configure the JobConf in a way I can filter the files that
> have to be processed from that pass (ie  files which name starts by
> Elementary, or Source etc)  So the task function will only process
> those files.  So if the folder contains 1000 files and only 50 start
> by Elementary. Only those 50 will be processed by my task.
> 
> I could set up different input folders and those containing the
> different files, but I cannot do that.
> 
> 
> Any idea?
> 
> thanks



Re: [HADOOP-users] HowTo filter files for a Map/Reduce task over the same input folder

2008-04-11 Thread Arun C Murthy


On Apr 11, 2008, at 10:21 AM, Amar Kamat wrote:

A simpler way is to use FileInputFormat.setInputPathFilter(JobConf,  
PathFilter). Look at org.apache.hadoop.fs.PathFilter for details on  
PathFilter interface.


+1, although FileInputFormat.setInputPathFilter is available only in  
hadoop-0.17 and above... like Amar mentioned previously, you'd have  
to have a custom InputFormat prior to hadoop-0.17.


Arun


Amar
Alfonso Olias Sanz wrote:

Hi
I have a general purpose input folder that it is used as input in a
Map/Reduce task. That folder contains files grouped by names.

I want to configure the JobConf in a way I can filter the files that
have to be processed from that pass (ie  files which name starts by
Elementary, or Source etc)  So the task function will only process
those files.  So if the folder contains 1000 files and only 50 start
by Elementary. Only those 50 will be processed by my task.

I could set up different input folders and those containing the
different files, but I cannot do that.


Any idea?

thanks







Re: [HADOOP-users] HowTo filter files for a Map/Reduce task over the same input folder

2008-04-11 Thread Amar Kamat
A simpler way is to use FileInputFormat.setInputPathFilter(JobConf, 
PathFilter). Look at org.apache.hadoop.fs.PathFilter for details on 
PathFilter interface.

Amar
Alfonso Olias Sanz wrote:

Hi
I have a general purpose input folder that it is used as input in a
Map/Reduce task. That folder contains files grouped by names.

I want to configure the JobConf in a way I can filter the files that
have to be processed from that pass (ie  files which name starts by
Elementary, or Source etc)  So the task function will only process
those files.  So if the folder contains 1000 files and only 50 start
by Elementary. Only those 50 will be processed by my task.

I could set up different input folders and those containing the
different files, but I cannot do that.


Any idea?

thanks
  




Re: [HADOOP-users] HowTo filter files for a Map/Reduce task over the same input folder

2008-04-11 Thread Amar Kamat
One way to do this is to write your own (file) input format. See 
src/java/org/apache/hadoop/mapred/FileInputFormat.java. You need to 
override listPaths() in order to have selectivity amongst the files in 
the input folder.

Amar
Alfonso Olias Sanz wrote:

Hi
I have a general purpose input folder that it is used as input in a
Map/Reduce task. That folder contains files grouped by names.

I want to configure the JobConf in a way I can filter the files that
have to be processed from that pass (ie  files which name starts by
Elementary, or Source etc)  So the task function will only process
those files.  So if the folder contains 1000 files and only 50 start
by Elementary. Only those 50 will be processed by my task.

I could set up different input folders and those containing the
different files, but I cannot do that.


Any idea?

thanks