Can you check the capacitor reading on the S3700 with smartctl ? This drive has non-volatile cache which *should* get flushed when power is lost, depending on what hardware does on reboot it might get flushed even when rebooting. I just got this drive for testing yesterday and it’s a beast, but some things were peculiar - for example my fio benchmark slowed down (35K IOPS -> 5K IOPS) after several GB (random - 5-40) written, and then it would creep back up over time even under load. Disabling write cache helps, no idea why.
Z. > On 28 May 2015, at 09:22, Christian Balzer <ch...@gol.com> wrote: > > > Hello Greg, > > On Wed, 27 May 2015 22:53:43 -0700 Gregory Farnum wrote: > >> The description of the logging abruptly ending and the journal being >> bad really sounds like part of the disk is going back in time. I'm not >> sure if XFS internally is set up in such a way that something like >> losing part of its journal would allow that? >> > I'm special. ^o^ > No XFS, EXT4. As stated in the original thread, below. > And the (OSD) journal is a raw partition on a DC S3700. > > And since there was at least a 30 seconds pause between the completion of > the "/etc/init.d/ceph stop" and issuing of the shutdown command, the > logging abruptly ending seems to be unlikely related to the shutdown at > all. > >> If any of the OSD developers have the time it's conceivable a copy of >> the OSD journal would be enlightening (if e.g. the header offsets are >> wrong but there are a bunch of valid journal entries), but this is two >> reports of this issue from you and none very similar from anybody >> else. I'm still betting on something in the software or hardware stack >> misbehaving. (There aren't that many people running Debian; there are >> lots of people running Ubuntu and we find bad XFS kernels there not >> infrequently; I think you're hitting something like that.) >> > There should be no file system involved with the raw partition SSD > journal, n'est-ce pas? > > The hardware is vastly different, the previous case was on an AMD > system with onboard SATA (SP5100), this one is a SM storage goat with LSI > 3008. > > The only thing they have in common is the Ceph version 0.80.7 (via the > Debian repository, not Ceph) and Debian Jessie as OS with kernel 3.16 > (though there were minor updates on that between those incidents, > backported fixes) > > A copy of the journal would consist of the entire 10GB partition, since we > don't know where in loop it was at the time, right? > > Christian >> >> On Sun, May 24, 2015 at 7:26 PM, Christian Balzer <ch...@gol.com> wrote: >>> >>> Hello again (marvel at my elephantine memory and thread necromancy) >>> >>> Firstly, this happened again, details below. >>> Secondly, as I changed things to sysv-init AND did a "/etc/init.d/ceph >>> stop" which dutifully listed all OSDs as being killed/stopped BEFORE >>> rebooting the node. >>> >>> This is completely new node with significantly different HW than the >>> example below. >>> But the same SW versions as before (Debian Jessie, Ceph 0.80.7). >>> And just like below/before the logs for that OSD have nothing in them >>> indicating it did shut down properly (no "journal flush done") and when >>> coming back on reboot we get the dreaded: >>> --- >>> 2015-05-25 10:32:55.439492 7f568aa157c0 1 journal >>> _open /var/lib/ceph/osd/ceph-30/journal fd 23: 10000269312 bytes, >>> block size 4096 bytes, directio = 1, aio = 1 2015-05-25 >>> 10:32:55.439859 7f568aa157c0 -1 journal read_header error decoding >>> journal header 2015-05-25 10:32:55.439905 7f568aa157c0 -1 >>> filestore(/var/lib/ceph/osd/ceph-30) mount failed to open >>> journal /var/lib/ceph/osd/ceph-30/journal: (22) Invalid argument >>> 2015-05-25 10:32:55.936975 7f568aa157c0 -1 osd.30 0 OSD:init: unable >>> to mount object store --- >>> >>> I see nothing in the changelogs for 0.80.8 and .9 that seems related to >>> this, never mind that from the looks of it the repository at Ceph has >>> only Wheezy (bpo70) packages and Debian Jessie is still stuck at >>> 0.80.7 (Sid just went to .9 last week) >>> >>> I'm preserving the state of things as they are for a few days, so if >>> any developer would like a peek or more details, speak up now. >>> >>> I'd open an issue, but I don't have a reliable way to reproduce this >>> and even less desire to do so on this production cluster. ^_- >>> >>> Christian >>> >>> On Sat, 6 Dec 2014 12:48:25 +0900 Christian Balzer wrote: >>> >>>> On Fri, 5 Dec 2014 11:23:19 -0800 Gregory Farnum wrote: >>>> >>>>> On Thu, Dec 4, 2014 at 7:03 PM, Christian Balzer <ch...@gol.com> >>>>> wrote: >>>>>> >>>>>> Hello, >>>>>> >>>>>> This morning I decided to reboot a storage node (Debian Jessie, >>>>>> thus 3.16 kernel and Ceph 0.80.7, HDD OSDs with SSD journals) >>>>>> after applying some changes. >>>>>> >>>>>> It came back up one OSD short, the last log lines before the >>>>>> reboot are: --- >>>>>> 2014-12-05 09:35:27.700330 7f87e789c700 2 -- >>>>>> 10.0.8.21:6823/29520 >> 10.0.8.22:0/5161 pipe(0x7f881b772580 >>>>>> sd=247 :6823 s=2 pgs=21 cs=1 l=1 c=0x7f881f469020).fault (0) >>>>>> Success 2014-12-05 09:35:27.700350 7f87f011d700 10 osd.4 >>>>>> pg_epoch: 293 pg[3.316( v 289'1347 (0'0,289'1347] local-les=289 >>>>>> n=8 ec=5 les/c 289/289 288/288/288) [8,4,16] r=1 lpr=288 >>>>>> pi=276-287/1 luod=0'0 crt=289'1345 lcod 289'1346 active] >>>>>> cancel_copy_ops --- >>>>>> >>>>>> Quite obviously it didn't complete its shutdown, so >>>>>> unsurprisingly we get: --- >>>>>> 2014-12-05 09:37:40.278128 7f218a7037c0 1 journal >>>>>> _open /var/lib/ceph/osd/ceph-4/journal fd 24: 10000269312 bytes, >>>>>> block size 4096 bytes, directio = 1, aio = 1 2014-12-05 >>>>>> 09:37:40.278427 7f218a7037c0 -1 journal read_header error decoding >>>>>> journal header 2014-12-05 09:37:40.278479 7f218a7037c0 -1 >>>>>> filestore(/var/lib/ceph/osd/ceph-4) mount failed to open >>>>>> journal /var/lib/ceph/osd/ceph-4/journal: (22) Invalid argument >>>>>> 2014-12-05 09:37:40.776203 7f218a7037c0 -1 osd.4 0 OSD:init: >>>>>> unable to mount object store 2014-12-05 09:37:40.776223 >>>>>> 7f218a7037c0 -1 ESC[0;31m ** ERROR: osd init failed: (22) Invalid >>>>>> argument ESC[0m --- >>>>>> >>>>>> Thankfully this isn't production yet and I was eventually able to >>>>>> recover the OSD by re-creating the journal ("ceph-osd -i 4 >>>>>> --mkjournal"), but it leaves me with a rather bad taste in my >>>>>> mouth. >>>>>> >>>>>> So the pertinent questions would be: >>>>>> >>>>>> 1. What caused this? >>>>>> My bet is on the evil systemd just pulling the plug before the >>>>>> poor OSD had finished its shutdown job. >>>>>> >>>>>> 2. How to prevent it from happening again? >>>>>> Is there something the Ceph developers can do with regards to init >>>>>> scripts? Or is this something to be brought up with the Debian >>>>>> maintainer? Debian is transiting from sysv-init to systemd (booo!) >>>>>> with Jessie, but the OSDs still have a sysvinit magic file in >>>>>> their top directory. Could this have an affect on things? >>>>>> >>>>>> 3. Is it really that easy to trash your OSDs? >>>>>> In the case a storage node crashes, am I to expect most if not all >>>>>> OSDs or at least their journals to require manual loving? >>>>> >>>>> So this "can't happen". >>>> >>>> Good thing you quoted that, as it clearly did. ^o^ >>>> >>>> Now the question of how exactly remains to be answered. >>>> >>>>> Being force killed definitely can't kill the >>>>> OSD's disk state; that's the whole point of the journaling. >>>> >>>> The other OSDs got to the point where they logged "journal flush >>>> done", this one didn't. Coincidence? I think not. >>>> >>>> Totally agree about the point of journaling being to prevent this >>>> kind of situation of course. >>>> >>>>> The error >>>>> message indicates that the header written on disk is nonsense to the >>>>> OSD, which means that the local filesystem or disk lost something >>>>> somehow (assuming you haven't done something silly like downgrading >>>>> the software version it's running) and doesn't know it (if there had >>>>> been a read error the output would be different). >>>> >>>> The journal is on an SSD, as stated. >>>> And before you ask it's on an Intel DC S3700. >>>> >>>> This was created on 0.80.7 just a day before, so no version games. >>>> >>>>> I'd double-check >>>>> your disk settings etc just to be sure, and check for known issues >>>>> with xfs on Jessie. >>>>> >>>> I'm using ext4, but that shouldn't be an issue here to begin with, as >>>> the journal is a raw SSD partition. >>>> >>>> Christian >>> >>> >>> -- >>> Christian Balzer Network/Systems Engineer >>> ch...@gol.com Global OnLine Japan/Fusion Communications >>> http://www.gol.com/ >> > > > -- > Christian Balzer Network/Systems Engineer > ch...@gol.com Global OnLine Japan/Fusion Communications > http://www.gol.com/ > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com