Re: multiple file input

pmg Fri, 19 Jun 2009 15:45:34 -0700

thanks tarandeep

Correct if I am wrong that when I map FileA mapper created key,value pair
and sends across to the reducer. If so then how can I compare when FileB is
not even mapped yet.



Tarandeep wrote:
> 
> On Fri, Jun 19, 2009 at 2:41 PM, pmg <parmod.me...@gmail.com> wrote:
> 
>>
>> For the sake of simplification I have simplified my input into two files
>> 1.
>> FileA 2. FileB
>>
>> As I said earlier I want to compare every record of FileA against every
>> record in FileB I know this is n2 but this is the process. I wrote a
>> simple
>> InputFormat and RecordReader. It seems each file is read serially one
>> after
>> another. How can my record read have reference to both files at the same
>> line so that I can create cross list of FileA and FileB for the mapper.
>>
>> Basically the way I see is to get mapper one record from FileA and all
>> records from FileB so that mapper can compare n2 and forward them to
>> reducer.
> 
> 
> It will be hard (and inefficient) to do this in Mapper using some custom
> intput format. What you can do is use Semi Join technique-
> 
> Since File A is smaller, run a map reduce job that will output key,value
> pair where key is the field or set of fields on which you want to do the
> comparison and value is the whole line.
> 
> The reducer is simply an Identity reducer which writes the files. So your
> fileA has been partitioned on the field(s). you can also create bloom
> filter
> on this field and store it in Distributed Cache.
> 
> Now read FileB, load Bloom filter into memory and see if the field from
> line
> of FileB is present in Bloom filter, if yes emit Key,Value pair else not.
> 
> At reducers, you get the contents of FileB partitioned just like contents
> of
> fileA were partitioned and at a particular reducer you get lines sorted on
> the field you want to do the comparison, At this point you read the
> contents
> of FileA that reached this reducer and since its contents were sorted as
> well, you can quickly go over the two lists.
> 
> -Tarandeep
> 
>>
>>
>> thanks
>>
>>
>>
>> pmg wrote:
>> >
>> > Thanks owen. Are there any examples that I can look at?
>> >
>> >
>> >
>> > owen.omalley wrote:
>> >>
>> >> On Jun 18, 2009, at 10:56 AM, pmg wrote:
>> >>
>> >>> Each line from FileA gets compared with every line from FileB1,
>> >>> FileB2 etc.
>> >>> etc. FileB1, FileB2 etc. are in a different input directory
>> >>
>> >> In the general case, I'd define an InputFormat that takes two
>> >> directories, computes the input splits for each directory and
>> >> generates a new list of InputSplits that is the cross-product of the
>> >> two lists. So instead of FileSplit, it would use a FileSplitPair that
>> >> gives the FileSplit for dir1 and the FileSplit for dir2 and the record
>> >> reader would return a TextPair with left and right records (ie.
>> >> lines). Clearly, you read the first line of split1 and cross it by
>> >> each line from split2, then move to the second line of split1 and
>> >> process each line from split2, etc.
>> >>
>> >> You'll need to ensure that you don't overwhelm the system with either
>> >> too many input splits (ie. maps). Also don't forget that N^2/M grows
>> >> much faster with the size of the input (N) than the M machines can
>> >> handle in a fixed amount of time.
>> >>
>> >>> Two input directories
>> >>>
>> >>> 1. input1 directory with a single file of 600K records - FileA
>> >>> 2. input2 directory segmented into different files with 2Million
>> >>> records -
>> >>> FileB1, FileB2 etc.
>> >>
>> >> In this particular case, it would be right to load all of FileA into
>> >> memory and process the chunks of FileB/part-*. Then it would be much
>> >> faster than needing to re-read the file over and over again, but
>> >> otherwise it would be the same.
>> >>
>> >> -- Owen
>> >>
>> >>
>> >
>> >
>>
>> --
>> View this message in context:
>> http://www.nabble.com/multiple-file-input-tp24095358p24119228.html
>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/multiple-file-input-tp24095358p24119864.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: multiple file input

Reply via email to