thanks tarandeep Correct if I am wrong that when I map FileA mapper created key,value pair and sends across to the reducer. If so then how can I compare when FileB is not even mapped yet.
Tarandeep wrote: > > On Fri, Jun 19, 2009 at 2:41 PM, pmg <parmod.me...@gmail.com> wrote: > >> >> For the sake of simplification I have simplified my input into two files >> 1. >> FileA 2. FileB >> >> As I said earlier I want to compare every record of FileA against every >> record in FileB I know this is n2 but this is the process. I wrote a >> simple >> InputFormat and RecordReader. It seems each file is read serially one >> after >> another. How can my record read have reference to both files at the same >> line so that I can create cross list of FileA and FileB for the mapper. >> >> Basically the way I see is to get mapper one record from FileA and all >> records from FileB so that mapper can compare n2 and forward them to >> reducer. > > > It will be hard (and inefficient) to do this in Mapper using some custom > intput format. What you can do is use Semi Join technique- > > Since File A is smaller, run a map reduce job that will output key,value > pair where key is the field or set of fields on which you want to do the > comparison and value is the whole line. > > The reducer is simply an Identity reducer which writes the files. So your > fileA has been partitioned on the field(s). you can also create bloom > filter > on this field and store it in Distributed Cache. > > Now read FileB, load Bloom filter into memory and see if the field from > line > of FileB is present in Bloom filter, if yes emit Key,Value pair else not. > > At reducers, you get the contents of FileB partitioned just like contents > of > fileA were partitioned and at a particular reducer you get lines sorted on > the field you want to do the comparison, At this point you read the > contents > of FileA that reached this reducer and since its contents were sorted as > well, you can quickly go over the two lists. > > -Tarandeep > >> >> >> thanks >> >> >> >> pmg wrote: >> > >> > Thanks owen. Are there any examples that I can look at? >> > >> > >> > >> > owen.omalley wrote: >> >> >> >> On Jun 18, 2009, at 10:56 AM, pmg wrote: >> >> >> >>> Each line from FileA gets compared with every line from FileB1, >> >>> FileB2 etc. >> >>> etc. FileB1, FileB2 etc. are in a different input directory >> >> >> >> In the general case, I'd define an InputFormat that takes two >> >> directories, computes the input splits for each directory and >> >> generates a new list of InputSplits that is the cross-product of the >> >> two lists. So instead of FileSplit, it would use a FileSplitPair that >> >> gives the FileSplit for dir1 and the FileSplit for dir2 and the record >> >> reader would return a TextPair with left and right records (ie. >> >> lines). Clearly, you read the first line of split1 and cross it by >> >> each line from split2, then move to the second line of split1 and >> >> process each line from split2, etc. >> >> >> >> You'll need to ensure that you don't overwhelm the system with either >> >> too many input splits (ie. maps). Also don't forget that N^2/M grows >> >> much faster with the size of the input (N) than the M machines can >> >> handle in a fixed amount of time. >> >> >> >>> Two input directories >> >>> >> >>> 1. input1 directory with a single file of 600K records - FileA >> >>> 2. input2 directory segmented into different files with 2Million >> >>> records - >> >>> FileB1, FileB2 etc. >> >> >> >> In this particular case, it would be right to load all of FileA into >> >> memory and process the chunks of FileB/part-*. Then it would be much >> >> faster than needing to re-read the file over and over again, but >> >> otherwise it would be the same. >> >> >> >> -- Owen >> >> >> >> >> > >> > >> >> -- >> View this message in context: >> http://www.nabble.com/multiple-file-input-tp24095358p24119228.html >> Sent from the Hadoop core-user mailing list archive at Nabble.com. >> >> > > -- View this message in context: http://www.nabble.com/multiple-file-input-tp24095358p24119864.html Sent from the Hadoop core-user mailing list archive at Nabble.com.