So I'm in the middle of trying to triage a problem with my ceph cluster
running 0.80.5. I have 24 OSDs spread across 8 machines. The cluster has
been running happily for about a year. This last weekend, something caused
the box running the MDS to sieze hard, and when we came in on monday,
several OSDs were down or unresponsive. I brought the MDS and the OSDs back
on online, and managed to get things running again with minimal data loss.
Had to mark a few objects as lost, but things were apparently running fine
at the end of the day on Monday.

This afternoon, I noticed that one of the OSDs was apparently stuck in a
crash/restart loop, and the cluster was unhappy. Performance was in the
tank and "ceph status" is reporting all manner of problems, as one would
expect if an OSD is misbehaving. I marked the offending OSD out, and the
cluster started rebalancing as expected. However, I noticed a short while
later, another OSD has started into a crash/restart loop. So, I repeat the
process. And it happens again. At this point I notice, that there are
actually two at a time which are in this state.

It's as if there's some toxic chunk of data that is getting passed around,
and when it lands on an OSD it kills it. Contrary to that, however, I tried
just stopping an OSD when it's in a bad state, and once the cluster starts
to try rebalancing with that OSD down and not previously marked out,
another OSD will start crash-looping.

I've investigated the disk of the first OSD I found with this problem, and
it has no apparent corruption on the file system.

I'll follow up to this shortly with links to pastes of log snippets. Any
input would be appreciated. This is turning into a real cascade failure,
and I haven't any idea how to stop it.

ceph-users mailing list

Reply via email to