I was going through this link 
http://stackoverflow.com/questions/9406477/data-integrity-in-hdfs-which-data-nodes-verifies-the-checksum
 . Its written that in recent version of hadoop only the last data node 
verifies the checksum as the write happens in a pipeline fashion. 
Now I have a question:
Assuming my cluster has two data nodes A and B cluster, I have a file, half of 
the file content is written on first data node A and the other remaining half 
is written on the second data node B to take advantage of parallelism.  My 
question is:  Will data node A will not store the check sum for the blocks 
stored on it. 

As per the line "only the last data node verifies the checksum", it looks like 
only the  last data node in my case it will be data node B, will generate the 
checksum. But if only data node B generates checksum, then it will generate the 
check sum only for the blocks stored on data node B. What about the checksum 
for the data blocks on data node  machine A?
                                          

Reply via email to