Re: [ceph-users] Cascading Failure of OSDs

Quentin Hartman Fri, 06 Mar 2015 19:15:37 -0800

Alright, tried a few suggestions for repairing this state, but I don't seem
to have any PG replicas that have good copies of the missing / zero length
shards. What do I do now? telling the pg's to repair doesn't seem to help
anything? I can deal with data loss if I can figure out which images might
be damaged, I just need to get the cluster consistent enough that the
things which aren't damaged can be usable.


Also, I'm seeing these similar, but not quite identical, error messages as
well. I assume they are referring to the same root problem:

-1> 2015-03-07 03:12:49.217295 7fc8ab343700  0 log [ERR] : 3.69d shard 22:
soid dd85669d/rbd_data.3f7a2ae8944a.00000000000019a5/7//3 size 0 != known
size 4194304



On Fri, Mar 6, 2015 at 7:54 PM, Quentin Hartman <
qhart...@direwolfdigital.com> wrote:

> Finally found an error that seems to provide some direction:
>
> -1> 2015-03-07 02:52:19.378808 7f175b1cf700  0 log [ERR] : scrub 3.18e
> e08a418e/rbd_data.3f7a2ae8944a.00000000000016c8/7//3 on disk size (0) does
> not match object info size (4120576) ajusted for ondisk to (4120576)
>
> I'm diving into google now and hoping for something useful. If anyone has
> a suggestion, I'm all ears!
>
> QH
>
> On Fri, Mar 6, 2015 at 6:26 PM, Quentin Hartman <
> qhart...@direwolfdigital.com> wrote:
>
>> Thanks for the suggestion, but that doesn't seem to have made a
>> difference.
>>
>> I've shut the entire cluster down and brought it back up, and my config
>> management system seems to have upgraded ceph to 0.80.8 during the reboot.
>> Everything seems to have come back up, but I am still seeing the crash
>> loops, so that seems to indicate that this is definitely something
>> persistent, probably tied to the OSD data, rather than some weird transient
>> state.
>>
>>
>> On Fri, Mar 6, 2015 at 5:51 PM, Sage Weil <s...@newdream.net> wrote:
>>
>>> It looks like you may be able to work around the issue for the moment
>>> with
>>>
>>>  ceph osd set nodeep-scrub
>>>
>>> as it looks like it is scrub that is getting stuck?
>>>
>>> sage
>>>
>>>
>>> On Fri, 6 Mar 2015, Quentin Hartman wrote:
>>>
>>> > Ceph health detail - http://pastebin.com/5URX9SsQpg dump summary (with
>>> > active+clean pgs removed) - http://pastebin.com/Y5ATvWDZ
>>> > an osd crash log (in github gist because it was too big for pastebin) -
>>> > https://gist.github.com/qhartman/cb0e290df373d284cfb5
>>> >
>>> > And now I've got four OSDs that are looping.....
>>> >
>>> > On Fri, Mar 6, 2015 at 5:33 PM, Quentin Hartman
>>> > <qhart...@direwolfdigital.com> wrote:
>>> >       So I'm in the middle of trying to triage a problem with my ceph
>>> >       cluster running 0.80.5. I have 24 OSDs spread across 8 machines.
>>> >       The cluster has been running happily for about a year. This last
>>> >       weekend, something caused the box running the MDS to sieze hard,
>>> >       and when we came in on monday, several OSDs were down or
>>> >       unresponsive. I brought the MDS and the OSDs back on online, and
>>> >       managed to get things running again with minimal data loss. Had
>>> >       to mark a few objects as lost, but things were apparently
>>> >       running fine at the end of the day on Monday.
>>> > This afternoon, I noticed that one of the OSDs was apparently stuck in
>>> > a crash/restart loop, and the cluster was unhappy. Performance was in
>>> > the tank and "ceph status" is reporting all manner of problems, as one
>>> > would expect if an OSD is misbehaving. I marked the offending OSD out,
>>> > and the cluster started rebalancing as expected. However, I noticed a
>>> > short while later, another OSD has started into a crash/restart loop.
>>> > So, I repeat the process. And it happens again. At this point I
>>> > notice, that there are actually two at a time which are in this state.
>>> >
>>> > It's as if there's some toxic chunk of data that is getting passed
>>> > around, and when it lands on an OSD it kills it. Contrary to that,
>>> > however, I tried just stopping an OSD when it's in a bad state, and
>>> > once the cluster starts to try rebalancing with that OSD down and not
>>> > previously marked out, another OSD will start crash-looping.
>>> >
>>> > I've investigated the disk of the first OSD I found with this problem,
>>> > and it has no apparent corruption on the file system.
>>> >
>>> > I'll follow up to this shortly with links to pastes of log snippets.
>>> > Any input would be appreciated. This is turning into a real cascade
>>> > failure, and I haven't any idea how to stop it.
>>> >
>>> > QH
>>> >
>>> >
>>> >
>>> >
>>>
>>
>>
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Cascading Failure of OSDs

Reply via email to