Re: [HADOOP-users] HowTo filter files for a Map/Reduce task over the same input folder
It's addIputPath, then adds a Path object to the list of inputs. So doing the filtering first then adding the paths (loop). But I need an InputFormat anyway because I have my own RecordReader. At the end I have to put the same logic in a different place. From my point of view it is better for me to put the filtering logic there, because my InputFormat is also a RecordReader Factory, and it will instantiate a different RecordReader, base on the filter. cheers On 14/04/2008, Ted Dunning <[EMAIL PROTECTED]> wrote: > > You don't really need a custom input format, I don't think. > > You should be able to just add multiple inputs, one at a time after > filtering them outside hadoop. > > > On 4/14/08 10:59 AM, "Alfonso Olias Sanz" <[EMAIL PROTECTED]> > wrote: > > > > ok thanks for the info :) > > > > On 11/04/2008, Arun C Murthy <[EMAIL PROTECTED]> wrote: > >> > >> On Apr 11, 2008, at 10:21 AM, Amar Kamat wrote: > >> > >> > >>> A simpler way is to use > >> FileInputFormat.setInputPathFilter(JobConf, PathFilter). > >> Look at org.apache.hadoop.fs.PathFilter for details on PathFilter > interface. > >>> > >> > >> +1, although FileInputFormat.setInputPathFilter is > >> available only in hadoop-0.17 and above... like Amar mentioned previously, > >> you'd have to have a custom InputFormat prior to hadoop-0.17. > >> > >> Arun > >> > >> > >> > >>> Amar > >>> Alfonso Olias Sanz wrote: > >>> > Hi > I have a general purpose input folder that it is used as input in a > Map/Reduce task. That folder contains files grouped by names. > > I want to configure the JobConf in a way I can filter the files that > have to be processed from that pass (ie files which name starts by > Elementary, or Source etc) So the task function will only process > those files. So if the folder contains 1000 files and only 50 start > by Elementary. Only those 50 will be processed by my task. > > I could set up different input folders and those containing the > different files, but I cannot do that. > > > Any idea? > > thanks > > > >>> > >>> > >> > >> > >
Re: [HADOOP-users] HowTo filter files for a Map/Reduce task over the same input folder
You don't really need a custom input format, I don't think. You should be able to just add multiple inputs, one at a time after filtering them outside hadoop. On 4/14/08 10:59 AM, "Alfonso Olias Sanz" <[EMAIL PROTECTED]> wrote: > ok thanks for the info :) > > On 11/04/2008, Arun C Murthy <[EMAIL PROTECTED]> wrote: >> >> On Apr 11, 2008, at 10:21 AM, Amar Kamat wrote: >> >> >>> A simpler way is to use >> FileInputFormat.setInputPathFilter(JobConf, PathFilter). >> Look at org.apache.hadoop.fs.PathFilter for details on PathFilter interface. >>> >> >> +1, although FileInputFormat.setInputPathFilter is >> available only in hadoop-0.17 and above... like Amar mentioned previously, >> you'd have to have a custom InputFormat prior to hadoop-0.17. >> >> Arun >> >> >> >>> Amar >>> Alfonso Olias Sanz wrote: >>> Hi I have a general purpose input folder that it is used as input in a Map/Reduce task. That folder contains files grouped by names. I want to configure the JobConf in a way I can filter the files that have to be processed from that pass (ie files which name starts by Elementary, or Source etc) So the task function will only process those files. So if the folder contains 1000 files and only 50 start by Elementary. Only those 50 will be processed by my task. I could set up different input folders and those containing the different files, but I cannot do that. Any idea? thanks >>> >>> >> >>
Re: [HADOOP-users] HowTo filter files for a Map/Reduce task over the same input folder
ok thanks for the info :) On 11/04/2008, Arun C Murthy <[EMAIL PROTECTED]> wrote: > > On Apr 11, 2008, at 10:21 AM, Amar Kamat wrote: > > > > A simpler way is to use > FileInputFormat.setInputPathFilter(JobConf, PathFilter). > Look at org.apache.hadoop.fs.PathFilter for details on PathFilter interface. > > > > +1, although FileInputFormat.setInputPathFilter is > available only in hadoop-0.17 and above... like Amar mentioned previously, > you'd have to have a custom InputFormat prior to hadoop-0.17. > > Arun > > > > > Amar > > Alfonso Olias Sanz wrote: > > > > > Hi > > > I have a general purpose input folder that it is used as input in a > > > Map/Reduce task. That folder contains files grouped by names. > > > > > > I want to configure the JobConf in a way I can filter the files that > > > have to be processed from that pass (ie files which name starts by > > > Elementary, or Source etc) So the task function will only process > > > those files. So if the folder contains 1000 files and only 50 start > > > by Elementary. Only those 50 will be processed by my task. > > > > > > I could set up different input folders and those containing the > > > different files, but I cannot do that. > > > > > > > > > Any idea? > > > > > > thanks > > > > > > > > > > > >
Re: [HADOOP-users] HowTo filter files for a Map/Reduce task over the same input folder
Just call addInputFile multiple times after filtering. (or is it addInputPath... Don't have documentation handy) On 4/11/08 6:33 AM, "Alfonso Olias Sanz" <[EMAIL PROTECTED]> wrote: > Hi > I have a general purpose input folder that it is used as input in a > Map/Reduce task. That folder contains files grouped by names. > > I want to configure the JobConf in a way I can filter the files that > have to be processed from that pass (ie files which name starts by > Elementary, or Source etc) So the task function will only process > those files. So if the folder contains 1000 files and only 50 start > by Elementary. Only those 50 will be processed by my task. > > I could set up different input folders and those containing the > different files, but I cannot do that. > > > Any idea? > > thanks
Re: [HADOOP-users] HowTo filter files for a Map/Reduce task over the same input folder
On Apr 11, 2008, at 10:21 AM, Amar Kamat wrote: A simpler way is to use FileInputFormat.setInputPathFilter(JobConf, PathFilter). Look at org.apache.hadoop.fs.PathFilter for details on PathFilter interface. +1, although FileInputFormat.setInputPathFilter is available only in hadoop-0.17 and above... like Amar mentioned previously, you'd have to have a custom InputFormat prior to hadoop-0.17. Arun Amar Alfonso Olias Sanz wrote: Hi I have a general purpose input folder that it is used as input in a Map/Reduce task. That folder contains files grouped by names. I want to configure the JobConf in a way I can filter the files that have to be processed from that pass (ie files which name starts by Elementary, or Source etc) So the task function will only process those files. So if the folder contains 1000 files and only 50 start by Elementary. Only those 50 will be processed by my task. I could set up different input folders and those containing the different files, but I cannot do that. Any idea? thanks
Re: [HADOOP-users] HowTo filter files for a Map/Reduce task over the same input folder
A simpler way is to use FileInputFormat.setInputPathFilter(JobConf, PathFilter). Look at org.apache.hadoop.fs.PathFilter for details on PathFilter interface. Amar Alfonso Olias Sanz wrote: Hi I have a general purpose input folder that it is used as input in a Map/Reduce task. That folder contains files grouped by names. I want to configure the JobConf in a way I can filter the files that have to be processed from that pass (ie files which name starts by Elementary, or Source etc) So the task function will only process those files. So if the folder contains 1000 files and only 50 start by Elementary. Only those 50 will be processed by my task. I could set up different input folders and those containing the different files, but I cannot do that. Any idea? thanks
Re: [HADOOP-users] HowTo filter files for a Map/Reduce task over the same input folder
One way to do this is to write your own (file) input format. See src/java/org/apache/hadoop/mapred/FileInputFormat.java. You need to override listPaths() in order to have selectivity amongst the files in the input folder. Amar Alfonso Olias Sanz wrote: Hi I have a general purpose input folder that it is used as input in a Map/Reduce task. That folder contains files grouped by names. I want to configure the JobConf in a way I can filter the files that have to be processed from that pass (ie files which name starts by Elementary, or Source etc) So the task function will only process those files. So if the folder contains 1000 files and only 50 start by Elementary. Only those 50 will be processed by my task. I could set up different input folders and those containing the different files, but I cannot do that. Any idea? thanks