Re: how to implements the 'diff' cmd in hadoop

2012-03-20 Thread botma lin
Thanks Bejoy, that makes sense . If I want to know the different record's original file, I need to put an extra file id into the mapper's output value, then get it in the reducer . Do you have any other ideas Thanks!. On Tue, Mar 20, 2012 at 6:09 PM,Bejoy Ks

Re: how to implements the 'diff' cmd in hadoop

2012-03-20 Thread Bejoy Ks
Yes, if you are having more than 2 files to be compared against then, the file name/ id is required from mapper. If it is just two files and you just want to know which lines are not unique then just the line no would be good but if you are looking at more granular info like the exact changes in

Re: how to implements the 'diff' cmd in hadoop

2012-03-20 Thread botma lin
Thanks a lot! On Tue, Mar 20, 2012 at 7:13,Bejoy Ks bejoy.had...@gmail.com wrote: Yes, if you are having more than 2 files to be compared against then, the file name/ id is required from mapper. If it is just two files and you just want to know which lines are not unique then just the line

Re: how to implements the 'diff' cmd in hadoop

2012-03-20 Thread botma lin
You are right, Dieter. The linux diff regards a file as a list, but I only want to treat it as a set. Sorry I did't make it clear at begining . On Tue, Mar 20, 2012 at 7:33 PM,Dieter Plaetinck die...@plaetinck.be wrote: the diff command on linux (i.e. gnu diffutils) is way more involved than