On 4-Sep-08, at 6:51 AM, Andrey Pankov wrote:

Hello,

Does anyone know is it possible to compare data on HDFS but avoid
coping data to local box? I mean if I'd like to find difference
between local text files I can use diff command. If files are at HDFS
then I have to get them from HDFS to local box and only then do diff.
Coping files to local fs is a bit annoying and could be problematical
when files are huge, say 2-5 Gb.

You could always do this as a mapreduce task. "diff --brief" is trivial, actually finding the diffs is left as an exercise for the reader :) I'm currently doing a line-oriented diff of two files where the order of the lines is unimportant, so I just have my reducer flag lines that show up an odd number of times.


Karl Anderson
[EMAIL PROTECTED]
http://monkey.org/~kra



Reply via email to