I was going through this link http://stackoverflow.com/questions/9406477/data-integrity-in-hdfs-which-data-nodes-verifies-the-checksum . Its written that in recent version of hadoop only the last data node verifies the checksum as the write happens in a pipeline fashion. Now I have a question: Assuming my cluster has two data nodes A and B cluster, I have a file, half of the file content is written on first data node A and the other remaining half is written on the second data node B to take advantage of parallelism. My question is: Will data node A will not store the check sum for the blocks stored on it.
As per the line "only the last data node verifies the checksum", it looks like only the last data node in my case it will be data node B, will generate the checksum. But if only data node B generates checksum, then it will generate the check sum only for the blocks stored on data node B. What about the checksum for the data blocks on data node machine A?