Re: multiple file input

Tarandeep Singh Fri, 19 Jun 2009 15:19:41 -0700

On Fri, Jun 19, 2009 at 2:41 PM, pmg <[email protected]> wrote:

>
> For the sake of simplification I have simplified my input into two files 1.
> FileA 2. FileB
>
> As I said earlier I want to compare every record of FileA against every
> record in FileB I know this is n2 but this is the process. I wrote a simple
> InputFormat and RecordReader. It seems each file is read serially one after
> another. How can my record read have reference to both files at the same
> line so that I can create cross list of FileA and FileB for the mapper.
>
> Basically the way I see is to get mapper one record from FileA and all
> records from FileB so that mapper can compare n2 and forward them to
> reducer.

It will be hard (and inefficient) to do this in Mapper using some custom
intput format. What you can do is use Semi Join technique-

Since File A is smaller, run a map reduce job that will output key,value
pair where key is the field or set of fields on which you want to do the
comparison and value is the whole line.

The reducer is simply an Identity reducer which writes the files. So your
fileA has been partitioned on the field(s). you can also create bloom filter
on this field and store it in Distributed Cache.

Now read FileB, load Bloom filter into memory and see if the field from line
of FileB is present in Bloom filter, if yes emit Key,Value pair else not.

At reducers, you get the contents of FileB partitioned just like contents of
fileA were partitioned and at a particular reducer you get lines sorted on
the field you want to do the comparison, At this point you read the contents
of FileA that reached this reducer and since its contents were sorted as
well, you can quickly go over the two lists.

-Tarandeep

>
>
> thanks
>
>
>
> pmg wrote:
> >
> > Thanks owen. Are there any examples that I can look at?
> >
> >
> >
> > owen.omalley wrote:
> >>
> >> On Jun 18, 2009, at 10:56 AM, pmg wrote:
> >>
> >>> Each line from FileA gets compared with every line from FileB1,
> >>> FileB2 etc.
> >>> etc. FileB1, FileB2 etc. are in a different input directory
> >>
> >> In the general case, I'd define an InputFormat that takes two
> >> directories, computes the input splits for each directory and
> >> generates a new list of InputSplits that is the cross-product of the
> >> two lists. So instead of FileSplit, it would use a FileSplitPair that
> >> gives the FileSplit for dir1 and the FileSplit for dir2 and the record
> >> reader would return a TextPair with left and right records (ie.
> >> lines). Clearly, you read the first line of split1 and cross it by
> >> each line from split2, then move to the second line of split1 and
> >> process each line from split2, etc.
> >>
> >> You'll need to ensure that you don't overwhelm the system with either
> >> too many input splits (ie. maps). Also don't forget that N^2/M grows
> >> much faster with the size of the input (N) than the M machines can
> >> handle in a fixed amount of time.
> >>
> >>> Two input directories
> >>>
> >>> 1. input1 directory with a single file of 600K records - FileA
> >>> 2. input2 directory segmented into different files with 2Million
> >>> records -
> >>> FileB1, FileB2 etc.
> >>
> >> In this particular case, it would be right to load all of FileA into
> >> memory and process the chunks of FileB/part-*. Then it would be much
> >> faster than needing to re-read the file over and over again, but
> >> otherwise it would be the same.
> >>
> >> -- Owen
> >>
> >>
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/multiple-file-input-tp24095358p24119228.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>

Re: multiple file input

Reply via email to