Re: Compare data on HDFS side

2008-09-05 Thread Karl Anderson


On 4-Sep-08, at 6:51 AM, Andrey Pankov wrote:


Hello,

Does anyone know is it possible to compare data on HDFS but avoid
coping data to local box? I mean if I'd like to find difference
between local text files I can use diff command. If files are at HDFS
then I have to get them from HDFS to local box and only then do diff.
Coping files to local fs is a bit annoying and could be problematical
when files are huge, say 2-5 Gb.


You could always do this as a mapreduce task.  diff --brief is  
trivial, actually finding the diffs is left as an exercise for the  
reader :)  I'm currently doing a line-oriented diff of two files where  
the order of the lines is unimportant, so I just have my reducer flag  
lines that show up an odd number of times.



Karl Anderson
[EMAIL PROTECTED]
http://monkey.org/~kra





Re: Compare data on HDFS side

2008-09-04 Thread Lohit Vijayarenu


One way is to write a small program which does diff at block level. Open both 
files, read data with same offset do a diff. This will tell you diffs at your 
offset boundry and usefull to check if two files differ. There is also an open 
jira which can get you chechsum of files which would make this task trivial.
Lohit

On Sep 4, 2008, at 6:51 AM, Andrey Pankov [EMAIL PROTECTED] wrote:

Hello,

Does anyone know is it possible to compare data on HDFS but avoid
coping data to local box? I mean if I'd like to find difference
between local text files I can use diff command. If files are at HDFS
then I have to get them from HDFS to local box and only then do diff.
Coping files to local fs is a bit annoying and could be problematical
when files are huge, say 2-5 Gb.

Thanks in advance.

-- 
Andrey Pankov