[ https://issues.apache.org/jira/browse/HADOOP-7550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13107772#comment-13107772 ]
Steve Loughran commented on HADOOP-7550: ---------------------------------------- I've just noticed that the paper cited in the description is a [Stone 1998], not the one I was referring to. the other paper to read is [Stone 2000], "When The CRC and TCP Checksum Disagree", ACM. the 1998 paper looks at the quality of the TCP checksum, which is pretty weak. The 2000 paper collects real-world statistics on the problem, looking at cases where the Ethernet CRC checksum and the TCP checksum are inconsistent, by monitoring TCP resend requests and such like -so detecting situations where the Ethernet checksum somehow missed a problem, but the TCP counter detected it. Where the 1998 paper is interesting is that # It shows that other checksum algorithms (here, Fletcher-256) can be better # It shows that trailing checksums are better than leading checksums (which has implications for HTTP content-length and content-MD5 headers too) # It shows that errors in compressed data are more likely to be picked up, which is good news for compressed HDFS data, but not for messages about that data. > Need for Integrity Validation of RPC > ------------------------------------ > > Key: HADOOP-7550 > URL: https://issues.apache.org/jira/browse/HADOOP-7550 > Project: Hadoop Common > Issue Type: Improvement > Components: ipc > Reporter: Dave Thompson > Assignee: Dave Thompson > > Some recent investigation of network packet corruption has shown a need for > hadoop RPC integrity validation beyond assurances already provided by 802.3 > link layer and TCP 16-bit CRC. > During an unusual occurrence on a 4k node cluster, we've seen as high as 4 > TCP anomalies per second on a single node, sustained over an hour (14k per > hour). A TCP anomaly would be an escaped link layer packet that resulted > in a TCP CRC failure, TCP packet out of sequence > or TCP packet size error. > According to this paper[*]: http://tinyurl.com/3aue72r > TCP's 16-bit CRC has an effective detection rate of 2^10. 1 in 1024 errors > may escape detection, and in fact what originally alerted us to this issue > was seeing failures due to bit-errors in hadoop traffic. Extrapolating from > that paper, one might expect 14 escaped packet errors per hour for that > single node of a 4k cluster. While the above error rate > was unusually high due to a broadband aggregate switch issue, hadoop not > having an integrity check on RPC makes it problematic to discover, and limit > any potential data damage due to > acting on a corrupt RPC message. > ------ > [*] In case this jira outlives that tinyurl, the IEEE paper cited is: > "Performance of Checksums and CRCs over Real Data" by Jonathan Stone, Michael > Greenwald, Craig Partridge, Jim Hughes. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira