For the sake of simplification I have simplified my input into two files 1. FileA 2. FileB
As I said earlier I want to compare every record of FileA against every record in FileB I know this is n2 but this is the process. I wrote a simple InputFormat and RecordReader. It seems each file is read serially one after another. How can my record read have reference to both files at the same line so that I can create cross list of FileA and FileB for the mapper. Basically the way I see is to get mapper one record from FileA and all records from FileB so that mapper can compare n2 and forward them to reducer. thanks pmg wrote: > > Thanks owen. Are there any examples that I can look at? > > > > owen.omalley wrote: >> >> On Jun 18, 2009, at 10:56 AM, pmg wrote: >> >>> Each line from FileA gets compared with every line from FileB1, >>> FileB2 etc. >>> etc. FileB1, FileB2 etc. are in a different input directory >> >> In the general case, I'd define an InputFormat that takes two >> directories, computes the input splits for each directory and >> generates a new list of InputSplits that is the cross-product of the >> two lists. So instead of FileSplit, it would use a FileSplitPair that >> gives the FileSplit for dir1 and the FileSplit for dir2 and the record >> reader would return a TextPair with left and right records (ie. >> lines). Clearly, you read the first line of split1 and cross it by >> each line from split2, then move to the second line of split1 and >> process each line from split2, etc. >> >> You'll need to ensure that you don't overwhelm the system with either >> too many input splits (ie. maps). Also don't forget that N^2/M grows >> much faster with the size of the input (N) than the M machines can >> handle in a fixed amount of time. >> >>> Two input directories >>> >>> 1. input1 directory with a single file of 600K records - FileA >>> 2. input2 directory segmented into different files with 2Million >>> records - >>> FileB1, FileB2 etc. >> >> In this particular case, it would be right to load all of FileA into >> memory and process the chunks of FileB/part-*. Then it would be much >> faster than needing to re-read the file over and over again, but >> otherwise it would be the same. >> >> -- Owen >> >> > > -- View this message in context: http://www.nabble.com/multiple-file-input-tp24095358p24119228.html Sent from the Hadoop core-user mailing list archive at Nabble.com.