Re: Compare data on HDFS side
On 4-Sep-08, at 6:51 AM, Andrey Pankov wrote: Hello, Does anyone know is it possible to compare data on HDFS but avoid coping data to local box? I mean if I'd like to find difference between local text files I can use diff command. If files are at HDFS then I have to get them from HDFS to local box and only then do diff. Coping files to local fs is a bit annoying and could be problematical when files are huge, say 2-5 Gb. You could always do this as a mapreduce task. diff --brief is trivial, actually finding the diffs is left as an exercise for the reader :) I'm currently doing a line-oriented diff of two files where the order of the lines is unimportant, so I just have my reducer flag lines that show up an odd number of times. Karl Anderson [EMAIL PROTECTED] http://monkey.org/~kra
Re: Compare data on HDFS side
One way is to write a small program which does diff at block level. Open both files, read data with same offset do a diff. This will tell you diffs at your offset boundry and usefull to check if two files differ. There is also an open jira which can get you chechsum of files which would make this task trivial. Lohit On Sep 4, 2008, at 6:51 AM, Andrey Pankov [EMAIL PROTECTED] wrote: Hello, Does anyone know is it possible to compare data on HDFS but avoid coping data to local box? I mean if I'd like to find difference between local text files I can use diff command. If files are at HDFS then I have to get them from HDFS to local box and only then do diff. Coping files to local fs is a bit annoying and could be problematical when files are huge, say 2-5 Gb. Thanks in advance. -- Andrey Pankov