Alright, tried a few suggestions for repairing this state, but I don't seem to have any PG replicas that have good copies of the missing / zero length shards. What do I do now? telling the pg's to repair doesn't seem to help anything? I can deal with data loss if I can figure out which images might be damaged, I just need to get the cluster consistent enough that the things which aren't damaged can be usable.
Also, I'm seeing these similar, but not quite identical, error messages as well. I assume they are referring to the same root problem: -1> 2015-03-07 03:12:49.217295 7fc8ab343700 0 log [ERR] : 3.69d shard 22: soid dd85669d/rbd_data.3f7a2ae8944a.00000000000019a5/7//3 size 0 != known size 4194304 On Fri, Mar 6, 2015 at 7:54 PM, Quentin Hartman < qhart...@direwolfdigital.com> wrote: > Finally found an error that seems to provide some direction: > > -1> 2015-03-07 02:52:19.378808 7f175b1cf700 0 log [ERR] : scrub 3.18e > e08a418e/rbd_data.3f7a2ae8944a.00000000000016c8/7//3 on disk size (0) does > not match object info size (4120576) ajusted for ondisk to (4120576) > > I'm diving into google now and hoping for something useful. If anyone has > a suggestion, I'm all ears! > > QH > > On Fri, Mar 6, 2015 at 6:26 PM, Quentin Hartman < > qhart...@direwolfdigital.com> wrote: > >> Thanks for the suggestion, but that doesn't seem to have made a >> difference. >> >> I've shut the entire cluster down and brought it back up, and my config >> management system seems to have upgraded ceph to 0.80.8 during the reboot. >> Everything seems to have come back up, but I am still seeing the crash >> loops, so that seems to indicate that this is definitely something >> persistent, probably tied to the OSD data, rather than some weird transient >> state. >> >> >> On Fri, Mar 6, 2015 at 5:51 PM, Sage Weil <s...@newdream.net> wrote: >> >>> It looks like you may be able to work around the issue for the moment >>> with >>> >>> ceph osd set nodeep-scrub >>> >>> as it looks like it is scrub that is getting stuck? >>> >>> sage >>> >>> >>> On Fri, 6 Mar 2015, Quentin Hartman wrote: >>> >>> > Ceph health detail - http://pastebin.com/5URX9SsQpg dump summary (with >>> > active+clean pgs removed) - http://pastebin.com/Y5ATvWDZ >>> > an osd crash log (in github gist because it was too big for pastebin) - >>> > https://gist.github.com/qhartman/cb0e290df373d284cfb5 >>> > >>> > And now I've got four OSDs that are looping..... >>> > >>> > On Fri, Mar 6, 2015 at 5:33 PM, Quentin Hartman >>> > <qhart...@direwolfdigital.com> wrote: >>> > So I'm in the middle of trying to triage a problem with my ceph >>> > cluster running 0.80.5. I have 24 OSDs spread across 8 machines. >>> > The cluster has been running happily for about a year. This last >>> > weekend, something caused the box running the MDS to sieze hard, >>> > and when we came in on monday, several OSDs were down or >>> > unresponsive. I brought the MDS and the OSDs back on online, and >>> > managed to get things running again with minimal data loss. Had >>> > to mark a few objects as lost, but things were apparently >>> > running fine at the end of the day on Monday. >>> > This afternoon, I noticed that one of the OSDs was apparently stuck in >>> > a crash/restart loop, and the cluster was unhappy. Performance was in >>> > the tank and "ceph status" is reporting all manner of problems, as one >>> > would expect if an OSD is misbehaving. I marked the offending OSD out, >>> > and the cluster started rebalancing as expected. However, I noticed a >>> > short while later, another OSD has started into a crash/restart loop. >>> > So, I repeat the process. And it happens again. At this point I >>> > notice, that there are actually two at a time which are in this state. >>> > >>> > It's as if there's some toxic chunk of data that is getting passed >>> > around, and when it lands on an OSD it kills it. Contrary to that, >>> > however, I tried just stopping an OSD when it's in a bad state, and >>> > once the cluster starts to try rebalancing with that OSD down and not >>> > previously marked out, another OSD will start crash-looping. >>> > >>> > I've investigated the disk of the first OSD I found with this problem, >>> > and it has no apparent corruption on the file system. >>> > >>> > I'll follow up to this shortly with links to pastes of log snippets. >>> > Any input would be appreciated. This is turning into a real cascade >>> > failure, and I haven't any idea how to stop it. >>> > >>> > QH >>> > >>> > >>> > >>> > >>> >> >> >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com