Re: [ceph-users] How safe is ceph pg repair these days?
On Mon, Feb 20, 2017 at 02:12:52PM PST, Gregory Farnum spake thusly: > Hmm, I went digging in and sadly this isn't quite right. Thanks for looking into this! This is the answer I was afraid of. Aren't all of those blog entries which talk about using repair and the ceph docs themselves putting people's data at risk? It seems like the only responsible way to deal with inconsistent PGs is to dig into the osd log, look at the reason for the inconistency, examine the data on disk, determine which one is good and which is bad, and delete the bad one? > The code has a lot of internal plumbing to allow more smarts than were > previously feasible and the erasure-coded pools make use of them for > noticing stuff like local corruption. Replicated pools make an attempt > but it's not as reliable as one would like and it still doesn't > involve any kind of voting mechanism. This is pretty surprising. I would have thought a best two out of three voting mechanism in a triple replicated setup would be the obvious way to go. It must be more difficult to implement than I suppose. > A self-inconsistent replicated primary won't get chosen. A primary is > self-inconsistent when its digest doesn't match the data, which > happens when: > 1) the object hasn't been written since it was last scrubbed, or > 2) the object was written in full, or > 3) the object has only been appended to since the last time its digest > was recorded, or > 4) something has gone terribly wrong in/under LevelDB and the omap > entries don't match what the digest says should be there. At least there's some sort of basic heuristic which attempts to do the right thing even if the whole process isn't as thorough as it could be. > David knows more and correct if I'm missing something. He's also > working on interfaces for scrub that are more friendly in general and > allow administrators to make more fine-grained decisions about > recovery in ways that cooperate with RADOS. These will be very welcome improvements! -- Tracy Reed signature.asc Description: PGP signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How safe is ceph pg repair these days?
Nick, Yes, as you would expect a read error would not be used as a source for repair no matter which OSD(s) are getting read errors. David On 2/21/17 12:38 AM, Nick Fisk wrote: -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Gregory Farnum Sent: 20 February 2017 22:13 To: Nick Fisk <n...@fisk.me.uk>; David Zafman <dzaf...@redhat.com> Cc: ceph-users <ceph-us...@ceph.com> Subject: Re: [ceph-users] How safe is ceph pg repair these days? On Sat, Feb 18, 2017 at 12:39 AM, Nick Fisk <n...@fisk.me.uk> wrote: From what I understand in Jewel+ Ceph has the concept of an authorative shard, so in the case of a 3x replica pools, it will notice that 2 replicas match and one doesn't and use one of the good replicas. However, in a 2x pool your out of luck. However, if someone could confirm my suspicions that would be good as well. Hmm, I went digging in and sadly this isn't quite right. The code has a lot of internal plumbing to allow more smarts than were previously feasible and the erasure-coded pools make use of them for noticing stuff like local corruption. Replicated pools make an attempt but it's not as reliable as one would like and it still doesn't involve any kind of voting mechanism. A self-inconsistent replicated primary won't get chosen. A primary is self- inconsistent when its digest doesn't match the data, which happens when: 1) the object hasn't been written since it was last scrubbed, or 2) the object was written in full, or 3) the object has only been appended to since the last time its digest was recorded, or 4) something has gone terribly wrong in/under LevelDB and the omap entries don't match what the digest says should be there. Thanks for the correction Greg. So I'm guessing that the probability of overwriting with an incorrect primary is reduced in later releases, but it can still happen. Quick question and its maybe that this is a #5 on your list. What about objects that are marked inconsistent on the primary due to a read error. I would say 90% of my inconsistent PG's are always caused by a read error and associated smartctl error. "rados list-inconsistent-obj" shows that it knows that the primary had a read error, so I assume a "pg repair" wouldn't try and read from the primary again? David knows more and correct if I'm missing something. He's also working on interfaces for scrub that are more friendly in general and allow administrators to make more fine-grained decisions about recovery in ways that cooperate with RADOS. -Greg -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Tracy Reed Sent: 18 February 2017 03:06 To: Shinobu Kinjo <ski...@redhat.com> Cc: ceph-users <ceph-us...@ceph.com> Subject: Re: [ceph-users] How safe is ceph pg repair these days? Well, that's the question...is that safe? Because the link to the mailing list post (possibly outdated) says that what you just suggested is definitely NOT safe. Is the mailing list post wrong? Has the situation changed? Exactly what does ceph repair do now? I suppose I could go dig into the code but I'm not an expert and would hate to get it wrong and post possibly bogus info the the list for other newbies to find and worry about and possibly lose their data. On Fri, Feb 17, 2017 at 06:08:39PM PST, Shinobu Kinjo spake thusly: if ``ceph pg deep-scrub `` does not work then do ``ceph pg repair On Sat, Feb 18, 2017 at 10:02 AM, Tracy Reed <tr...@ultraviolet.org> wrote: I have a 3 replica cluster. A couple times I have run into inconsistent PGs. I googled it and ceph docs and various blogs say run a repair first. But a couple people on IRC and a mailing list thread from 2015 say that ceph blindly copies the primary over the secondaries and calls it good. http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015- May/001370. html I sure hope that isn't the case. If so it would seem highly irresponsible to implement such a naive command called "repair". I have recently learned how to properly analyze the OSD logs and manually fix these things but not before having run repair on a dozen inconsistent PGs. Now I'm worried about what sort of corruption I may have introduced. Repairing things by hand is a simple heuristic based on comparing the size or checksum (as indicated by the logs) for each of the 3 copies and figuring out which is correct. Presumably matching two out of three should win and the odd object out should be deleted since having the exact same kind of error on two different OSDs is highly improbable. I don't understand why ceph repair wouldn't have done this all along. What is the current best practice in the use of ceph repair? Thanks! -- Tracy Reed ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -
Re: [ceph-users] How safe is ceph pg repair these days?
> -Original Message- > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of > Gregory Farnum > Sent: 20 February 2017 22:13 > To: Nick Fisk <n...@fisk.me.uk>; David Zafman <dzaf...@redhat.com> > Cc: ceph-users <ceph-us...@ceph.com> > Subject: Re: [ceph-users] How safe is ceph pg repair these days? > > On Sat, Feb 18, 2017 at 12:39 AM, Nick Fisk <n...@fisk.me.uk> wrote: > > From what I understand in Jewel+ Ceph has the concept of an > > authorative shard, so in the case of a 3x replica pools, it will > > notice that 2 replicas match and one doesn't and use one of the good > > replicas. However, in a 2x pool your out of luck. > > > > However, if someone could confirm my suspicions that would be good as > well. > > Hmm, I went digging in and sadly this isn't quite right. The code has a lot of > internal plumbing to allow more smarts than were previously feasible and > the erasure-coded pools make use of them for noticing stuff like local > corruption. Replicated pools make an attempt but it's not as reliable as one > would like and it still doesn't involve any kind of voting mechanism. > A self-inconsistent replicated primary won't get chosen. A primary is self- > inconsistent when its digest doesn't match the data, which happens when: > 1) the object hasn't been written since it was last scrubbed, or > 2) the object was written in full, or > 3) the object has only been appended to since the last time its digest was > recorded, or > 4) something has gone terribly wrong in/under LevelDB and the omap entries > don't match what the digest says should be there. > Thanks for the correction Greg. So I'm guessing that the probability of overwriting with an incorrect primary is reduced in later releases, but it can still happen. Quick question and its maybe that this is a #5 on your list. What about objects that are marked inconsistent on the primary due to a read error. I would say 90% of my inconsistent PG's are always caused by a read error and associated smartctl error. "rados list-inconsistent-obj" shows that it knows that the primary had a read error, so I assume a "pg repair" wouldn't try and read from the primary again? > David knows more and correct if I'm missing something. He's also working on > interfaces for scrub that are more friendly in general and allow > administrators to make more fine-grained decisions about recovery in ways > that cooperate with RADOS. > -Greg > > > > >> -Original Message- > >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf > >> Of Tracy Reed > >> Sent: 18 February 2017 03:06 > >> To: Shinobu Kinjo <ski...@redhat.com> > >> Cc: ceph-users <ceph-us...@ceph.com> > >> Subject: Re: [ceph-users] How safe is ceph pg repair these days? > >> > >> Well, that's the question...is that safe? Because the link to the > >> mailing > > list > >> post (possibly outdated) says that what you just suggested is > >> definitely > > NOT > >> safe. Is the mailing list post wrong? Has the situation changed? > >> Exactly > > what > >> does ceph repair do now? I suppose I could go dig into the code but > >> I'm > > not > >> an expert and would hate to get it wrong and post possibly bogus info > >> the the list for other newbies to find and worry about and possibly > >> lose their data. > >> > >> On Fri, Feb 17, 2017 at 06:08:39PM PST, Shinobu Kinjo spake thusly: > >> > if ``ceph pg deep-scrub `` does not work then > >> > do > >> > ``ceph pg repair > >> > > >> > > >> > On Sat, Feb 18, 2017 at 10:02 AM, Tracy Reed > >> > <tr...@ultraviolet.org> > >> wrote: > >> > > I have a 3 replica cluster. A couple times I have run into > >> > > inconsistent PGs. I googled it and ceph docs and various blogs > >> > > say run a repair first. But a couple people on IRC and a mailing > >> > > list thread from 2015 say that ceph blindly copies the primary > >> > > over the secondaries and calls it good. > >> > > > >> > > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015- > >> May/001370. > >> > > html > >> > > > >> > > I sure hope that isn't the case. If so it would seem highly > >> > > irresponsible to implement such a naive command called "repair". > >> > > I have recently learned how to properly analyze the OSD logs and > >> > > manually
Re: [ceph-users] How safe is ceph pg repair these days?
Hello, On Mon, 20 Feb 2017 17:15:59 -0800 Gregory Farnum wrote: > On Mon, Feb 20, 2017 at 4:24 PM, Christian Balzer <ch...@gol.com> wrote: > > > > Hello, > > > > On Mon, 20 Feb 2017 14:12:52 -0800 Gregory Farnum wrote: > > > >> On Sat, Feb 18, 2017 at 12:39 AM, Nick Fisk <n...@fisk.me.uk> wrote: > >> > From what I understand in Jewel+ Ceph has the concept of an authorative > >> > shard, so in the case of a 3x replica pools, it will notice that 2 > >> > replicas > >> > match and one doesn't and use one of the good replicas. However, in a 2x > >> > pool your out of luck. > >> > > >> > However, if someone could confirm my suspicions that would be good as > >> > well. > >> > >> Hmm, I went digging in and sadly this isn't quite right. The code has > >> a lot of internal plumbing to allow more smarts than were previously > >> feasible and the erasure-coded pools make use of them for noticing > >> stuff like local corruption. Replicated pools make an attempt but it's > >> not as reliable as one would like and it still doesn't involve any > >> kind of voting mechanism. > > > > I seem to recall a lot of that plumbing going/being talked about, but > > never going into full action, good to know that I didn't miss anything and > > that my memory is still reliable. ^o^ > > Yeah. Mixed in with the subtlety are some good use cases, though. For > instance, anything written with RGW is always going to fit into the > cases where it will detect a bad primary. RBD is a lot less likely to > (unless you've done something crazy like set 4K objects, and the VM > always sends down 4k writes), but since scrubbing fills in the data > you can count on your snapshots and golden images being > well-protected. Etc etc. > > > > >> A self-inconsistent replicated primary won't get chosen. A primary is > >> self-inconsistent when its digest doesn't match the data, which > >> happens when: > >> 1) the object hasn't been written since it was last scrubbed, or > >> 2) the object was written in full, or > >> 3) the object has only been appended to since the last time its digest > >> was recorded, or > >> 4) something has gone terribly wrong in/under LevelDB and the omap > >> entries don't match what the digest says should be there. > >> > >> David knows more and correct if I'm missing something. He's also > >> working on interfaces for scrub that are more friendly in general and > >> allow administrators to make more fine-grained decisions about > >> recovery in ways that cooperate with RADOS. > >> > > That is certainly appreciated, especially if it gets backported to > > versions where people are stuck with FS based OSDs. > > > > However I presume that the main goal and focus is still BlueStore with > > live internal checksums that make scrubbing obsolete, right? > > I'm not sure what you mean. BlueStore certainly has a ton of work > going on, and we have plans to update scrub/repair to play nicely and > handle more of the use cases that BlueStore is likely to expose and > which FileStore did not. But just about all the scrub/repair > enhancements we're aiming at will work on both systems, and making > them handle the BlueStore cases may do a lot more proportionally for > FileStore. I'm talking about the various discussions here (google for "bluestore checksum", which also shows talks on devel, unsurprisingly) as well as Sage's various slides about Bluestore and checksums. >From those I take away that: 1. All BlueStore reads have 100% read-checksums all the time, completely preventing silent data corruption from happening as it is now possible. Similar to ZFS/BTRFS, with in-flight delivery of a good replica and repair of the broken one. 2. Scrubbing (deep) becomes something of "feel good" thing that can be done in much longer intervals (depending on the quality of your storage and replication size) and with much lower priority. As it's main (only?) benefit will be to detect and correct corruption before all replicas of very infrequently read data may have become affected. Christian > -Greg > > > > > > > Christian > > > >> -Greg > >> > >> > > >> >> -Original Message- > >> >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of > >> >> Tracy Reed > >> >> Sent: 18 February 2017 03:06 > >> >> To: Shinobu Kinjo <ski...@redhat.com> > >> >> Cc: ceph-
Re: [ceph-users] How safe is ceph pg repair these days?
Hello, On Mon, 20 Feb 2017 14:12:52 -0800 Gregory Farnum wrote: > On Sat, Feb 18, 2017 at 12:39 AM, Nick Fisk <n...@fisk.me.uk> wrote: > > From what I understand in Jewel+ Ceph has the concept of an authorative > > shard, so in the case of a 3x replica pools, it will notice that 2 replicas > > match and one doesn't and use one of the good replicas. However, in a 2x > > pool your out of luck. > > > > However, if someone could confirm my suspicions that would be good as well. > > > > Hmm, I went digging in and sadly this isn't quite right. The code has > a lot of internal plumbing to allow more smarts than were previously > feasible and the erasure-coded pools make use of them for noticing > stuff like local corruption. Replicated pools make an attempt but it's > not as reliable as one would like and it still doesn't involve any > kind of voting mechanism. I seem to recall a lot of that plumbing going/being talked about, but never going into full action, good to know that I didn't miss anything and that my memory is still reliable. ^o^ > A self-inconsistent replicated primary won't get chosen. A primary is > self-inconsistent when its digest doesn't match the data, which > happens when: > 1) the object hasn't been written since it was last scrubbed, or > 2) the object was written in full, or > 3) the object has only been appended to since the last time its digest > was recorded, or > 4) something has gone terribly wrong in/under LevelDB and the omap > entries don't match what the digest says should be there. > > David knows more and correct if I'm missing something. He's also > working on interfaces for scrub that are more friendly in general and > allow administrators to make more fine-grained decisions about > recovery in ways that cooperate with RADOS. > That is certainly appreciated, especially if it gets backported to versions where people are stuck with FS based OSDs. However I presume that the main goal and focus is still BlueStore with live internal checksums that make scrubbing obsolete, right? Christian > -Greg > > > > >> -Original Message- > >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of > >> Tracy Reed > >> Sent: 18 February 2017 03:06 > >> To: Shinobu Kinjo <ski...@redhat.com> > >> Cc: ceph-users <ceph-us...@ceph.com> > >> Subject: Re: [ceph-users] How safe is ceph pg repair these days? > >> > >> Well, that's the question...is that safe? Because the link to the mailing > > list > >> post (possibly outdated) says that what you just suggested is definitely > > NOT > >> safe. Is the mailing list post wrong? Has the situation changed? Exactly > > what > >> does ceph repair do now? I suppose I could go dig into the code but I'm > > not > >> an expert and would hate to get it wrong and post possibly bogus info the > >> the list for other newbies to find and worry about and possibly lose their > >> data. > >> > >> On Fri, Feb 17, 2017 at 06:08:39PM PST, Shinobu Kinjo spake thusly: > >> > if ``ceph pg deep-scrub `` does not work then > >> > do > >> > ``ceph pg repair > >> > > >> > > >> > On Sat, Feb 18, 2017 at 10:02 AM, Tracy Reed <tr...@ultraviolet.org> > >> wrote: > >> > > I have a 3 replica cluster. A couple times I have run into > >> > > inconsistent PGs. I googled it and ceph docs and various blogs say > >> > > run a repair first. But a couple people on IRC and a mailing list > >> > > thread from 2015 say that ceph blindly copies the primary over the > >> > > secondaries and calls it good. > >> > > > >> > > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015- > >> May/001370. > >> > > html > >> > > > >> > > I sure hope that isn't the case. If so it would seem highly > >> > > irresponsible to implement such a naive command called "repair". I > >> > > have recently learned how to properly analyze the OSD logs and > >> > > manually fix these things but not before having run repair on a > >> > > dozen inconsistent PGs. Now I'm worried about what sort of > >> > > corruption I may have introduced. Repairing things by hand is a > >> > > simple heuristic based on comparing the size or checksum (as > >> > > indicated by the logs) for each of the 3 copies and figuring out > >> > > which is correct. Presumably matching two out of thr
Re: [ceph-users] How safe is ceph pg repair these days?
>From what I understand in Jewel+ Ceph has the concept of an authorative shard, so in the case of a 3x replica pools, it will notice that 2 replicas match and one doesn't and use one of the good replicas. However, in a 2x pool your out of luck. However, if someone could confirm my suspicions that would be good as well. > -Original Message- > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of > Tracy Reed > Sent: 18 February 2017 03:06 > To: Shinobu Kinjo <ski...@redhat.com> > Cc: ceph-users <ceph-us...@ceph.com> > Subject: Re: [ceph-users] How safe is ceph pg repair these days? > > Well, that's the question...is that safe? Because the link to the mailing list > post (possibly outdated) says that what you just suggested is definitely NOT > safe. Is the mailing list post wrong? Has the situation changed? Exactly what > does ceph repair do now? I suppose I could go dig into the code but I'm not > an expert and would hate to get it wrong and post possibly bogus info the > the list for other newbies to find and worry about and possibly lose their > data. > > On Fri, Feb 17, 2017 at 06:08:39PM PST, Shinobu Kinjo spake thusly: > > if ``ceph pg deep-scrub `` does not work then > > do > > ``ceph pg repair > > > > > > On Sat, Feb 18, 2017 at 10:02 AM, Tracy Reed <tr...@ultraviolet.org> > wrote: > > > I have a 3 replica cluster. A couple times I have run into > > > inconsistent PGs. I googled it and ceph docs and various blogs say > > > run a repair first. But a couple people on IRC and a mailing list > > > thread from 2015 say that ceph blindly copies the primary over the > > > secondaries and calls it good. > > > > > > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015- > May/001370. > > > html > > > > > > I sure hope that isn't the case. If so it would seem highly > > > irresponsible to implement such a naive command called "repair". I > > > have recently learned how to properly analyze the OSD logs and > > > manually fix these things but not before having run repair on a > > > dozen inconsistent PGs. Now I'm worried about what sort of > > > corruption I may have introduced. Repairing things by hand is a > > > simple heuristic based on comparing the size or checksum (as > > > indicated by the logs) for each of the 3 copies and figuring out > > > which is correct. Presumably matching two out of three should win > > > and the odd object out should be deleted since having the exact same > > > kind of error on two different OSDs is highly improbable. I don't > > > understand why ceph repair wouldn't have done this all along. > > > > > > What is the current best practice in the use of ceph repair? > > > > > > Thanks! > > > > > > -- > > > Tracy Reed > > > > > > ___ > > > ceph-users mailing list > > > ceph-users@lists.ceph.com > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > -- > Tracy Reed ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How safe is ceph pg repair these days?
Well, that's the question...is that safe? Because the link to the mailing list post (possibly outdated) says that what you just suggested is definitely NOT safe. Is the mailing list post wrong? Has the situation changed? Exactly what does ceph repair do now? I suppose I could go dig into the code but I'm not an expert and would hate to get it wrong and post possibly bogus info the the list for other newbies to find and worry about and possibly lose their data. On Fri, Feb 17, 2017 at 06:08:39PM PST, Shinobu Kinjo spake thusly: > if ``ceph pg deep-scrub `` does not work > then > do > ``ceph pg repair > > > On Sat, Feb 18, 2017 at 10:02 AM, Tracy Reedwrote: > > I have a 3 replica cluster. A couple times I have run into inconsistent > > PGs. I googled it and ceph docs and various blogs say run a repair > > first. But a couple people on IRC and a mailing list thread from 2015 > > say that ceph blindly copies the primary over the secondaries and calls > > it good. > > > > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-May/001370.html > > > > I sure hope that isn't the case. If so it would seem highly > > irresponsible to implement such a naive command called "repair". I have > > recently learned how to properly analyze the OSD logs and manually fix > > these things but not before having run repair on a dozen inconsistent > > PGs. Now I'm worried about what sort of corruption I may have > > introduced. Repairing things by hand is a simple heuristic based on > > comparing the size or checksum (as indicated by the logs) for each of > > the 3 copies and figuring out which is correct. Presumably matching two > > out of three should win and the odd object out should be deleted since > > having the exact same kind of error on two different OSDs is highly > > improbable. I don't understand why ceph repair wouldn't have done this > > all along. > > > > What is the current best practice in the use of ceph repair? > > > > Thanks! > > > > -- > > Tracy Reed > > > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > -- Tracy Reed signature.asc Description: PGP signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How safe is ceph pg repair these days?
if ``ceph pg deep-scrub `` does not work then do ``ceph pg repair On Sat, Feb 18, 2017 at 10:02 AM, Tracy Reedwrote: > I have a 3 replica cluster. A couple times I have run into inconsistent > PGs. I googled it and ceph docs and various blogs say run a repair > first. But a couple people on IRC and a mailing list thread from 2015 > say that ceph blindly copies the primary over the secondaries and calls > it good. > > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-May/001370.html > > I sure hope that isn't the case. If so it would seem highly > irresponsible to implement such a naive command called "repair". I have > recently learned how to properly analyze the OSD logs and manually fix > these things but not before having run repair on a dozen inconsistent > PGs. Now I'm worried about what sort of corruption I may have > introduced. Repairing things by hand is a simple heuristic based on > comparing the size or checksum (as indicated by the logs) for each of > the 3 copies and figuring out which is correct. Presumably matching two > out of three should win and the odd object out should be deleted since > having the exact same kind of error on two different OSDs is highly > improbable. I don't understand why ceph repair wouldn't have done this > all along. > > What is the current best practice in the use of ceph repair? > > Thanks! > > -- > Tracy Reed > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] How safe is ceph pg repair these days?
I have a 3 replica cluster. A couple times I have run into inconsistent PGs. I googled it and ceph docs and various blogs say run a repair first. But a couple people on IRC and a mailing list thread from 2015 say that ceph blindly copies the primary over the secondaries and calls it good. http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-May/001370.html I sure hope that isn't the case. If so it would seem highly irresponsible to implement such a naive command called "repair". I have recently learned how to properly analyze the OSD logs and manually fix these things but not before having run repair on a dozen inconsistent PGs. Now I'm worried about what sort of corruption I may have introduced. Repairing things by hand is a simple heuristic based on comparing the size or checksum (as indicated by the logs) for each of the 3 copies and figuring out which is correct. Presumably matching two out of three should win and the odd object out should be deleted since having the exact same kind of error on two different OSDs is highly improbable. I don't understand why ceph repair wouldn't have done this all along. What is the current best practice in the use of ceph repair? Thanks! -- Tracy Reed signature.asc Description: PGP signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com