Hi Gera, Thanks for your input. I have fairly large amount of data and if I go by -cat option followed by md5sum calculation then it will become time consuming process.
I could understand from the code that hadoop checksum is nothing but MD5 of MD5 of CRC32C and then returning output.I would be more curious to know if in case I have to create checksum manually that hadoop is doing internally, then how do I do that? Is there any document or link available which can explain that how this checksum calculation works behind the scene? Thanks Shashi On Sat, Aug 8, 2015 at 8:00 AM, Gera Shegalov <g...@apache.org> wrote: > The fs checksum output has more info like bytes per CRC, CRC per block. > See e.g.: > https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/MD5MD5CRC32FileChecksum.java > > In order to avoid dealing with different formatting or byte order you > could use md5sum for the remote file as well if the file is reasonably small > > hadoop fs -cat /abc.txt | md5sum > > On Fri, Aug 7, 2015 at 3:35 AM Shashi Vishwakarma < > shashi.vish...@gmail.com> wrote: > >> Hi >> >> I have a small confusion regarding checksum verification.Lets say , i >> have a file abc.txt and I transferred this file to hdfs. How do I ensure >> about data integrity? >> >> I followed below steps to check that file is correctly transferred. >> >> *On Local File System:* >> >> md5sum abc.txt >> >> 276fb620d097728ba1983928935d6121 TestFile >> >> *On Hadoop Cluster :* >> >> hadoop fs -checksum /abc.txt >> >> /abc.txt MD5-of-0MD5-of-512CRC32C >> 000002000000000000000000911156a9cf0d906c56db7c8141320df0 >> >> Both output looks different to me. Let me know if I am doing anything >> wrong. >> >> How do I verify if my file is transferred properly into HDFS? >> >> Thanks >> Shashi >> >