I looked at the code.  The automatic repair should handle getting an EIO during 
read of the object replica.  It does NOT require removing the object as I said 
before, so it doesn’t matter which copy has bad sectors.  It will copy from a 
good replica to the primary, if necessary.  By default a deep-scrub which would 
catch this case is performed weekly.  A repair must be initiated by 
administrative action.

When replicas differ due to comparison of checksums, we currently don’t have a 
way to determine which copy(s) are corrupt.  This is where a manual 
intervention may be necessary if the administrator can determine which copy(s) 
are bad.

David Zafman
Senior Developer
http://www.inktank.com




On Nov 18, 2013, at 1:11 PM, Chris Dunlop <ch...@onthe.net.au> wrote:

> OK, that's good (as far is it goes, being a manual process).
> 
> So then, back to what I think was Mihály's original issue:
> 
>> pg repair or deep-scrub can not fix this issue. But if I
>> understand correctly, osd has to known it can not retrieve
>> object from osd.0 and need to be replicate an another osd
>> because there is no 3 working replicas now.
> 
> Given a bad checksum and/or read error tells ceph that an object
> is corrupt, it would seem to be a natural step to then have ceph
> automatically use another good-checksum copy, and even rewrite
> the corrupt object, either in normal operation or under a scub
> or repair.
> 
> Is there a reason this isn't done, apart from lack of tuits?
> 
> Cheers,
> 
> Chris
> 
> 
> On Mon, Nov 18, 2013 at 11:43:46AM -0800, David Zafman wrote:
>> 
>> No, you wouldn’t need to re-replicate the whole disk for a single bad 
>> sector.  The way to deal with that if the object is on the primary is to 
>> remove the file manually from the OSD’s filesystem and perform a repair of 
>> the PG that holds that object.  This will copy the object back from one of 
>> the replicas.
>> 
>> David
>> 
>> On Nov 17, 2013, at 10:46 PM, Chris Dunlop <ch...@onthe.net.au> wrote:
>> 
>>> Hi David,
>>> 
>>> On Fri, Nov 15, 2013 at 10:00:37AM -0800, David Zafman wrote:
>>>> 
>>>> Replication does not occur until the OSD is “out.”  This creates a new 
>>>> mapping in the cluster of where the PGs should be and thus data begins to 
>>>> move and/or create sufficient copies.  This scheme lets you control how 
>>>> and when you want the replication to occur.  If you have plenty of space 
>>>> and you aren’t going to replace the drive immediately, just mark the OSD 
>>>> “down" AND “out.".  If you are going to replace the drive immediately, set 
>>>> the “noout” flag.  Take the OSD “down” and replace drive.  Assuming it is 
>>>> mounted in the same place as the bad drive, bring the OSD back up.  This 
>>>> will replicate exactly the same PGs the bad drive held back to the 
>>>> replacement drive.  As was stated before don’t forget to “ceph osd unset 
>>>> noout"
>>>> 
>>>> Keep in mind that in the case of a machine that has a hardware failure and 
>>>> takes OSD(s) down there is an automatic timeout which will mark them “out" 
>>>> for unattended operation.  Unless you are monitoring the cluster 24/7 you 
>>>> should have enough disk space available to handle failures.
>>>> 
>>>> Related info in:
>>>> 
>>>> http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/
>>>> 
>>>> David Zafman
>>>> Senior Developer
>>>> http://www.inktank.com
>>> 
>>> 
>>> Are you saying, if a disk suffers from a bad sector in an object
>>> for which it's primary, and for which good data exists on other
>>> replica PGs, there's no way for ceph to recover other than by
>>> (re-)replicating the whole disk?
>>> 
>>> I.e., even if the disk is able to remap the bad sector using a
>>> spare, so the disk is ok (albeit missing a sector's worth of
>>> object data), the only way to recover is to basically blow away
>>> all the data on that disk and start again, replicating
>>> everything back to the disk (or to other disks)?
>>> 
>>> Cheers,
>>> 
>>> Chris.

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to