You are right, Dieter. The "linux diff" regards a file as a list, but I only want to treat it as a set. Sorry I did't make it clear at begining .
On Tue, Mar 20, 2012 at 7:33 PM,Dieter Plaetinck <die...@plaetinck.be> wrote: > the "diff command on linux" (i.e. gnu diffutils) is way more involved than > this. > it can compare sections on different line numbers. (for example if you > copy a text file to another, and then delete or add some lines in arbitrary > places, and compare them, it will detect just that, whereas this crude > logic will give a lot false positives) > the diff logic is hard to map on (and hence IMHO doesn't fit) the M/R > paradigm > But what's the bigger picture here? usually you would run diff on files > created by humans (source code, notes, etc), i.e. files that can easily be > diff'ed on a single machine. > If you have files that are so huge they are probably generated by > software, which means you can do more appropriate things than diffing > output files. > > Dieter > > > On Tue, 20 Mar 2012 16:43:06 +0530 > Bejoy Ks <bejoy.had...@gmail.com> wrote: > > > Yes, if you are having more than 2 files to be compared against then, the > > file name/ id is required from mapper. If it is just two files and you > > just want to know which lines are not unique then just the line no would > be > > good but if you are looking at more granular info like the exact changes > in > > which all files then the value from mapper could be prefixed with some > > value like file name. > > > > Regards > > Bejoy KS > > > > 2012/3/20 botma lin <linj...@gmail.com> > > > > > Thanks Bejoy, that makes sense . > > > > > > If I want to know the different record's original file, I need to > > > put an extra file id into the mapper's output value, then get it in the > > > reducer . > > > > > > Do you have any other ideas > > > > > > Thanks!. > > > > > > > > > On Tue, Mar 20, 2012 at 6:09 PM,Bejoy Ks <bejoy.had...@gmail.com> > wrote: > > > > > > > Hi Lin > > > > In you mapper make the line no as the key and the line > contents as > > > > the value. In your reducer check whether the two values for a key are > > > > matching. ie if you are comparing two files then there would be two > > > values > > > > for a line number. If non matching patterns found increment a > counter to > > > > determine the number of non matching patterns and write those > patterns to > > > > output file . If the values matches for a key do nothing, no need > even > > > > writing to output dir. > > > > > > > > Regards > > > > Bejoy KS > > > > > > > > On Tue, Mar 20, 2012 at 2:01 PM, botma lin <linj...@gmail.com> > wrote: > > > > > > > > > Hi, all > > > > > > > > > > I'm newbie to hadoop. > > > > > > > > > > I'm trying to compare two large file and get the difference > > > between > > > > > them ,like the diff cmd in linux, > > > > > however, the mapred api can only get one record at a time . so > how > > > can > > > > I > > > > > get the relative records in two files and compare them by using > mapred > > > > api. > > > > > > > > > > thinks! > > > > > > > > > > > > > >