Re: how to implements the 'diff' cmd in hadoop

botma lin Tue, 20 Mar 2012 19:53:56 -0700

You are right, Dieter. The "linux diff" regards a file as a list, but I
only want to treat it as a set. Sorry I did't make it clear at begining .


 On Tue, Mar 20, 2012 at 7:33 PM，Dieter Plaetinck <die...@plaetinck.be>
wrote：

> the "diff command on linux" (i.e. gnu diffutils) is way more involved than
> this.
> it can compare sections on different line numbers. (for example if you
> copy a text file to another, and then delete or add some lines in arbitrary
> places, and compare them, it will detect just that, whereas this crude
> logic will give a lot false positives)
> the diff logic is hard to map on (and hence IMHO doesn't fit) the M/R
> paradigm
> But what's the bigger picture here? usually you would run diff on files
> created by humans (source code, notes, etc), i.e. files that can easily be
> diff'ed on a single machine.
> If you have files that are so huge they are probably generated by
> software, which means you can do more appropriate things than diffing
> output files.
>
> Dieter
>
>
> On Tue, 20 Mar 2012 16:43:06 +0530
> Bejoy Ks <bejoy.had...@gmail.com> wrote:
>
> > Yes, if you are having more than 2 files to be compared against then, the
> > file name/ id is required from mapper. If it is just two files  and you
> > just want to know which lines are not unique then just the line no would
> be
> > good but if you are looking at more granular info like the exact changes
> in
> > which all files then the value from mapper could be prefixed with some
> > value like file name.
> >
> > Regards
> > Bejoy KS
> >
> > 2012/3/20 botma lin <linj...@gmail.com>
> >
> > > Thanks  Bejoy, that makes sense .
> > >
> > >       If I want to know the different record's original file, I need to
> > > put an extra file id into the mapper's output value, then get it in the
> > > reducer .
> > >
> > >      Do you have any other ideas
> > >
> > > Thanks!.
> > >
> > >
> > > On Tue, Mar 20, 2012 at 6:09 PM，Bejoy Ks <bejoy.had...@gmail.com>
> wrote：
> > >
> > > > Hi Lin
> > > >        In you mapper make the line no as the key and the line
> contents as
> > > > the value. In your reducer check whether the two values for a key are
> > > > matching. ie if you are comparing two files then there would be two
> > > values
> > > > for a line number. If non matching patterns found increment a
> counter to
> > > > determine the number of non matching patterns and write those
> patterns to
> > > > output file . If the values matches for a key do nothing, no need
> even
> > > > writing to output dir.
> > > >
> > > > Regards
> > > > Bejoy KS
> > > >
> > > > On Tue, Mar 20, 2012 at 2:01 PM, botma lin <linj...@gmail.com>
> wrote:
> > > >
> > > > > Hi, all
> > > > >
> > > > >      I'm newbie to hadoop.
> > > > >
> > > > >      I'm trying to compare two large file and get the difference
> > > between
> > > > > them ,like the diff cmd in linux,
> > > > >  however,  the mapred api can only get one record at a time . so
> how
> > > can
> > > > I
> > > > > get the relative records in two files and compare them by using
> mapred
> > > > api.
> > > > >
> > > > >     thinks!
> > > > >
> > > >
> > >
>
>

Re: how to implements the 'diff' cmd in hadoop

Reply via email to