Re: [ceph-users] bit correctness and checksumming

james Wed, 16 Oct 2013 11:07:09 -0700

Very interesting link. I don't suppose there is any data availableseparating 4K and 512-byte sectored drives?


On 2013-10-16 18:43, Tim Bell wrote:

At CERN, we have had cases in the past of silent corruptions. It is
good to be able to identify the devices causing them and swap them
out.

It's an old presentation but the concepts are still relevant today
...http://www.nsc.liu.se/lcsc2007/presentations/LCSC_2007-kelemen.pdf
Tim
-----Original Message-----
From: ceph-users-boun...@lists.ceph.com[mailto:ceph-users-boun...@lists.ceph.com] On Behalf Ofja...@peacon.co.uk
Sent: 16 October 2013 18:54
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] bit correctness and checksumming
Does Ceph log anywhere corrected(/caught) silent corruption - wouldbe interesting to know how much a problem this is, in a large scaledeployment. Something to gather in the league table mentioned atthe London Ceph day?
Just thinking out-loud (please shout me down...) - if the FS itselfperforms it's own ECC, ATA streaming command set might be of use to
avoid performance degradation due to drive level recovery at all.


On 2013-10-16 17:12, Sage Weil wrote:
> On Wed, 16 Oct 2013, Dan Van Der Ster wrote:
>> Hi all,
>> There has been some confusion the past couple days at the CHEP
>> conference during conversations about Ceph and protection frombit>> flips or other subtle data corruption. Can someone pleasesummarise>> the current state of data integrity protection in Ceph, assumingwe
>> have an XFS backend filesystem? ie. don't rely on the protection
>> offered by btrfs. I saw in the docs that wire messages andjournal>> writes are CRC'd, but nothing explicit about the objectsthemselves.
>
> - Everything that passes over the wire is checksummed (crc32c).This
> is mainly because the TCP checksum is so weak.
>
> - The journal entries have a crc.
>
> - During deep scrub, we read the objects and metadata, calculate a
> crc32c, and compare across replicas. This detects missingobjects,
> bitrot, failing disks, or anything other source of inconistency.
>
> - Ceph does not calculate and store a per-object checksum. Doingso> is difficult because rados allows arbitrary overwrites of parts ofan
> object.
>
> - Ceph *does* have a new opportunistic checksum feature, which is
> currently only enabled in QA. It calculates and stores checksumson
> whatever block size you configure (e.g., 64k) if/when we
> write/overwrite a complete block, and will verify any completeblock> read against the stored crc, if one happens to be available. Thiscan
> help catch some but not all sources of corruption.
>
>> We also have some specific questions:
>>
>> 1. Is an object checksum stored on the OSD somewhere? Is this in
>> user.ceph._, because it wasn't obvious when looking at the code?
>
> No (except for the new/experimental opportunistic crc I mention
> above).
>
>> 2. When is the checksum verified. Surely it is checked during the
>> deep scrub, but what about during an object read?
>
> For non-btrfs, no crc to verify. For btrfs, the fs has its owncrc
> and verifies it.
>
>> 2b. Can a user read corrupted data if the master replica has abit
>> flip but this hasn't yet been found by a deep scrub?
>
> Yes.
>
>> 3. During deep scrub of an object with 2 replicas, suppose the
>> checksum is different for the two objects -- which object wins?(I.e.
>> if you store the checksum locally, this is trivial since the
>> consistency of objects can be evaluated locally. Without thelocal
>> checksum, you can have conflicts.)
>
> In this case we normally choose the primary.  The repair has to be
> explicitly triggered by the admin, however, and there are someoptions
> to control that choice.
>
>> 4. If the checksum is already stored per object in the OSD, isthis>> retrievable by librados? We have some applications which alsoneed to
>> know the checksum of the data and this would be handy if it was
>> already calculated by Ceph.
>
> It would! It may be that the way to get there is to build and APIto> expose the opportunistic checksums, and/or to extend that featureto> maintain full checksums (by re-reading partially overwrittenblocks on> write). (Note, however, that even this wouldn't cover xattrs andomap
> content; really this is something that "should" be handled by the
> backend storage/file system.)
>
> sage
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] bit correctness and checksumming

Reply via email to