[ 
https://issues.apache.org/jira/browse/HADOOP-7550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13107967#comment-13107967
 ] 

Dave Thompson commented on HADOOP-7550:
---------------------------------------

Yes, the ideal solution for this use case, due to performance considerations, 
would not be to use a cryptographically secure checksum, as is specified in the 
RFC1964 Kerberos GSS-API SASL implementation that Sun provides, but rather 
something along the lines of a CRC-32 that encapsulates the entire RPC.  I 
agree that an on/off mechanism should be included in the implementation, for 
performance considerations as you mentioned, and also because such data 
integrity check would be wastefully redundant if someone had need to deploy 
with something like QoP auth-conf or auth-int secure SASL option.

> Need for Integrity Validation of RPC
> ------------------------------------
>
>                 Key: HADOOP-7550
>                 URL: https://issues.apache.org/jira/browse/HADOOP-7550
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: ipc
>            Reporter: Dave Thompson
>            Assignee: Dave Thompson
>
> Some recent investigation of network packet corruption has shown a need for 
> hadoop RPC integrity validation beyond assurances already provided by 802.3 
> link layer and TCP 16-bit CRC.
> During an unusual occurrence on a 4k node cluster, we've seen as high as 4 
> TCP anomalies per second on a single node, sustained over an hour (14k per 
> hour).   A TCP anomaly  would be an escaped link layer packet that resulted 
> in a TCP CRC failure, TCP packet out of sequence
> or TCP packet size error.
> According to this paper[*]:  http://tinyurl.com/3aue72r
> TCP's 16-bit CRC has an effective detection rate of 2^10.   1 in 1024 errors 
> may escape detection, and in fact what originally alerted us to this issue 
> was seeing failures due to bit-errors in hadoop traffic.  Extrapolating from 
> that paper, one might expect 14 escaped packet errors per hour for that 
> single node of a 4k cluster.  While the above error rate
> was unusually high due to a broadband aggregate switch issue, hadoop not 
> having an integrity check on RPC makes it problematic to discover, and limit 
> any potential data damage due to
> acting on a corrupt RPC message.
> ------
> [*] In case this jira outlives that tinyurl, the IEEE paper cited is:  
> "Performance of Checksums and CRCs over Real Data" by Jonathan Stone, Michael 
> Greenwald, Craig Partridge, Jim Hughes.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to