[Lustre-devel] [Bug 11602] fsx causes occasional checksum failures

adilger Tue, 23 Jan 2007 15:31:53 -0800

Please don't reply to lustre-devel. Instead, comment in Bugzilla by using the 
following link:
https://bugzilla.lustre.org/show_bug.cgi?id=11602


           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |[EMAIL PROTECTED]


(In reply to comment #0)
>   For some time now we've noticed that with lustre checksums enabled, always
> on at LLNL, the fsx test will cause spurious checksum failure messages.  The
> failures are harmless in nature but they generate some very scary looking log
> messages which we need to resolve.
> 
>   Now lustre should be locking these pages before they're being sent but
> something is still modifing them.

Actually, I think the issue is that the VM isn't locking the pages that are
being written, so the checksum that the client calculates becomes invalid as fsx
makes a small write into a page that is currently being sent.

There is already some work in progress at CFS to change the client IO model in
order to fit it better into the standard Linux form, and also to address this
same problem for LAID (which _would_ result in data corruption issues because
the RAID checksum would be incorrect and a lost disk would result in incorrect
parity).  Also, in 1.8 the current checksumming code is replaced by Kerberos
data authentication and encryption, and the encryption needs separate buffers
for the IO and the problem is also gone as a result.

_______________________________________________
Lustre-devel mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-devel

[Lustre-devel] [Bug 11602] fsx causes occasional checksum failures

Reply via email to