[ 
https://issues.apache.org/jira/browse/HADOOP-7550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13107772#comment-13107772
 ] 

Steve Loughran commented on HADOOP-7550:
----------------------------------------

I've just noticed that the paper cited in the description is a [Stone 1998], 
not the one I was referring to.

the other paper to read is [Stone 2000], "When The CRC and TCP Checksum 
Disagree", ACM. 

the 1998 paper looks at the quality of the TCP checksum, which is pretty weak. 
The 2000 paper collects real-world statistics on the problem, looking at cases 
where the Ethernet CRC checksum and the TCP checksum are inconsistent, by 
monitoring TCP resend requests and such like -so detecting situations where the 
Ethernet checksum somehow missed a problem, but the TCP counter detected it. 

Where the 1998 paper is interesting is that
# It shows that other checksum algorithms (here, Fletcher-256) can be better
# It shows that trailing checksums are better than leading checksums (which has 
implications for HTTP content-length and content-MD5 headers too)
# It shows that errors in compressed data are more likely to be picked up, 
which is good news for compressed HDFS data, but not for messages about that 
data.


> Need for Integrity Validation of RPC
> ------------------------------------
>
>                 Key: HADOOP-7550
>                 URL: https://issues.apache.org/jira/browse/HADOOP-7550
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: ipc
>            Reporter: Dave Thompson
>            Assignee: Dave Thompson
>
> Some recent investigation of network packet corruption has shown a need for 
> hadoop RPC integrity validation beyond assurances already provided by 802.3 
> link layer and TCP 16-bit CRC.
> During an unusual occurrence on a 4k node cluster, we've seen as high as 4 
> TCP anomalies per second on a single node, sustained over an hour (14k per 
> hour).   A TCP anomaly  would be an escaped link layer packet that resulted 
> in a TCP CRC failure, TCP packet out of sequence
> or TCP packet size error.
> According to this paper[*]:  http://tinyurl.com/3aue72r
> TCP's 16-bit CRC has an effective detection rate of 2^10.   1 in 1024 errors 
> may escape detection, and in fact what originally alerted us to this issue 
> was seeing failures due to bit-errors in hadoop traffic.  Extrapolating from 
> that paper, one might expect 14 escaped packet errors per hour for that 
> single node of a 4k cluster.  While the above error rate
> was unusually high due to a broadband aggregate switch issue, hadoop not 
> having an integrity check on RPC makes it problematic to discover, and limit 
> any potential data damage due to
> acting on a corrupt RPC message.
> ------
> [*] In case this jira outlives that tinyurl, the IEEE paper cited is:  
> "Performance of Checksums and CRCs over Real Data" by Jonathan Stone, Michael 
> Greenwald, Craig Partridge, Jim Hughes.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to