On 4-Sep-08, at 6:51 AM, Andrey Pankov wrote:
Hello, Does anyone know is it possible to compare data on HDFS but avoid coping data to local box? I mean if I'd like to find difference between local text files I can use diff command. If files are at HDFS then I have to get them from HDFS to local box and only then do diff. Coping files to local fs is a bit annoying and could be problematical when files are huge, say 2-5 Gb.
You could always do this as a mapreduce task. "diff --brief" is trivial, actually finding the diffs is left as an exercise for the reader :) I'm currently doing a line-oriented diff of two files where the order of the lines is unimportant, so I just have my reducer flag lines that show up an odd number of times.
Karl Anderson [EMAIL PROTECTED] http://monkey.org/~kra