Re: [ceph-users] OSD trashed by simple reboot (Debian Jessie, systemd?)

Jan Schermer Thu, 28 May 2015 01:32:36 -0700

Can you check the capacitor reading on the S3700 with smartctl ? This drive has 
non-volatile cache which *should* get flushed when power is lost, depending on 
what hardware does on reboot it might get flushed even when rebooting.
I just got this drive for testing yesterday and it’s a beast, but some things 
were peculiar - for example my fio benchmark slowed down (35K IOPS -> 5K IOPS) 
after several GB (random - 5-40) written, and then it would creep back up over 
time even under load. Disabling write cache helps, no idea why.


Z.


> On 28 May 2015, at 09:22, Christian Balzer <ch...@gol.com> wrote:
> 
> 
> Hello Greg,
> 
> On Wed, 27 May 2015 22:53:43 -0700 Gregory Farnum wrote:
> 
>> The description of the logging abruptly ending and the journal being
>> bad really sounds like part of the disk is going back in time. I'm not
>> sure if XFS internally is set up in such a way that something like
>> losing part of its journal would allow that?
>> 
> I'm special. ^o^
> No XFS, EXT4. As stated in the original thread, below.
> And the (OSD) journal is a raw partition on a DC S3700.
> 
> And since there was at least a 30 seconds pause between the completion of
> the "/etc/init.d/ceph stop" and issuing of the shutdown command, the
> logging abruptly ending seems to be unlikely related to the shutdown at
> all.
> 
>> If any of the OSD developers have the time it's conceivable a copy of
>> the OSD journal would be enlightening (if e.g. the header offsets are
>> wrong but there are a bunch of valid journal entries), but this is two
>> reports of this issue from you and none very similar from anybody
>> else. I'm still betting on something in the software or hardware stack
>> misbehaving. (There aren't that many people running Debian; there are
>> lots of people running Ubuntu and we find bad XFS kernels there not
>> infrequently; I think you're hitting something like that.)
>> 
> There should be no file system involved with the raw partition SSD
> journal, n'est-ce pas?
> 
> The hardware is vastly different, the previous case was on an AMD
> system with onboard SATA (SP5100), this one is a SM storage goat with LSI
> 3008.
> 
> The only thing they have in common is the Ceph version 0.80.7 (via the
> Debian repository, not Ceph) and Debian Jessie as OS with kernel 3.16
> (though there were minor updates on that between those incidents,
> backported fixes)
> 
> A copy of the journal would consist of the entire 10GB partition, since we
> don't know where in loop it was at the time, right?
> 
> Christian
>> 
>> On Sun, May 24, 2015 at 7:26 PM, Christian Balzer <ch...@gol.com> wrote:
>>> 
>>> Hello again (marvel at my elephantine memory and thread necromancy)
>>> 
>>> Firstly, this happened again, details below.
>>> Secondly, as I changed things to sysv-init AND did a "/etc/init.d/ceph
>>> stop" which dutifully listed all OSDs as being killed/stopped BEFORE
>>> rebooting the node.
>>> 
>>> This is completely new node with significantly different HW than the
>>> example below.
>>> But the same SW versions as before (Debian Jessie, Ceph 0.80.7).
>>> And just like below/before the logs for that OSD have nothing in them
>>> indicating it did shut down properly (no "journal flush done") and when
>>> coming back on reboot we get the dreaded:
>>> ---
>>> 2015-05-25 10:32:55.439492 7f568aa157c0  1 journal
>>> _open /var/lib/ceph/osd/ceph-30/journal fd 23: 10000269312 bytes,
>>> block size 4096 bytes, directio = 1, aio = 1 2015-05-25
>>> 10:32:55.439859 7f568aa157c0 -1 journal read_header error decoding
>>> journal header 2015-05-25 10:32:55.439905 7f568aa157c0 -1
>>> filestore(/var/lib/ceph/osd/ceph-30) mount failed to open
>>> journal /var/lib/ceph/osd/ceph-30/journal: (22) Invalid argument
>>> 2015-05-25 10:32:55.936975 7f568aa157c0 -1 osd.30 0 OSD:init: unable
>>> to mount object store ---
>>> 
>>> I see nothing in the changelogs for 0.80.8 and .9 that seems related to
>>> this, never mind that from the looks of it the repository at Ceph has
>>> only Wheezy (bpo70) packages and Debian Jessie is still stuck at
>>> 0.80.7 (Sid just went to .9 last week)
>>> 
>>> I'm preserving the state of things as they are for a few days, so if
>>> any developer would like a peek or more details, speak up now.
>>> 
>>> I'd open an issue, but I don't have a reliable way to reproduce this
>>> and even less desire to do so on this production cluster. ^_-
>>> 
>>> Christian
>>> 
>>> On Sat, 6 Dec 2014 12:48:25 +0900 Christian Balzer wrote:
>>> 
>>>> On Fri, 5 Dec 2014 11:23:19 -0800 Gregory Farnum wrote:
>>>> 
>>>>> On Thu, Dec 4, 2014 at 7:03 PM, Christian Balzer <ch...@gol.com>
>>>>> wrote:
>>>>>> 
>>>>>> Hello,
>>>>>> 
>>>>>> This morning I decided to reboot a storage node (Debian Jessie,
>>>>>> thus 3.16 kernel and Ceph 0.80.7, HDD OSDs with SSD journals)
>>>>>> after applying some changes.
>>>>>> 
>>>>>> It came back up one OSD short, the last log lines before the
>>>>>> reboot are: ---
>>>>>> 2014-12-05 09:35:27.700330 7f87e789c700  2 --
>>>>>> 10.0.8.21:6823/29520 >> 10.0.8.22:0/5161 pipe(0x7f881b772580
>>>>>> sd=247 :6823 s=2 pgs=21 cs=1 l=1 c=0x7f881f469020).fault (0)
>>>>>> Success 2014-12-05 09:35:27.700350 7f87f011d700 10 osd.4
>>>>>> pg_epoch: 293 pg[3.316( v 289'1347 (0'0,289'1347] local-les=289
>>>>>> n=8 ec=5 les/c 289/289 288/288/288) [8,4,16] r=1 lpr=288
>>>>>> pi=276-287/1 luod=0'0 crt=289'1345 lcod 289'1346 active]
>>>>>> cancel_copy_ops ---
>>>>>> 
>>>>>> Quite obviously it didn't complete its shutdown, so
>>>>>> unsurprisingly we get: ---
>>>>>> 2014-12-05 09:37:40.278128 7f218a7037c0  1 journal
>>>>>> _open /var/lib/ceph/osd/ceph-4/journal fd 24: 10000269312 bytes,
>>>>>> block size 4096 bytes, directio = 1, aio = 1 2014-12-05
>>>>>> 09:37:40.278427 7f218a7037c0 -1 journal read_header error decoding
>>>>>> journal header 2014-12-05 09:37:40.278479 7f218a7037c0 -1
>>>>>> filestore(/var/lib/ceph/osd/ceph-4) mount failed to open
>>>>>> journal /var/lib/ceph/osd/ceph-4/journal: (22) Invalid argument
>>>>>> 2014-12-05 09:37:40.776203 7f218a7037c0 -1 osd.4 0 OSD:init:
>>>>>> unable to mount object store 2014-12-05 09:37:40.776223
>>>>>> 7f218a7037c0 -1 ESC[0;31m ** ERROR: osd init failed: (22) Invalid
>>>>>> argument ESC[0m ---
>>>>>> 
>>>>>> Thankfully this isn't production yet and I was eventually able to
>>>>>> recover the OSD by re-creating the journal ("ceph-osd -i 4
>>>>>> --mkjournal"), but it leaves me with a rather bad taste in my
>>>>>> mouth.
>>>>>> 
>>>>>> So the pertinent questions would be:
>>>>>> 
>>>>>> 1. What caused this?
>>>>>> My bet is on the evil systemd just pulling the plug before the
>>>>>> poor OSD had finished its shutdown job.
>>>>>> 
>>>>>> 2. How to prevent it from happening again?
>>>>>> Is there something the Ceph developers can do with regards to init
>>>>>> scripts? Or is this something to be brought up with the Debian
>>>>>> maintainer? Debian is transiting from sysv-init to systemd (booo!)
>>>>>> with Jessie, but the OSDs still have a sysvinit magic file in
>>>>>> their top directory. Could this have an affect on things?
>>>>>> 
>>>>>> 3. Is it really that easy to trash your OSDs?
>>>>>> In the case a storage node crashes, am I to expect most if not all
>>>>>> OSDs or at least their journals to require manual loving?
>>>>> 
>>>>> So this "can't happen".
>>>> 
>>>> Good thing you quoted that, as it clearly did. ^o^
>>>> 
>>>> Now the question of how exactly remains to be answered.
>>>> 
>>>>> Being force killed definitely can't kill the
>>>>> OSD's disk state; that's the whole point of the journaling.
>>>> 
>>>> The other OSDs got to the point where they logged "journal flush
>>>> done", this one didn't. Coincidence? I think not.
>>>> 
>>>> Totally agree about the point of journaling being to prevent this
>>>> kind of situation of course.
>>>> 
>>>>> The error
>>>>> message indicates that the header written on disk is nonsense to the
>>>>> OSD, which means that the local filesystem or disk lost something
>>>>> somehow (assuming you haven't done something silly like downgrading
>>>>> the software version it's running) and doesn't know it (if there had
>>>>> been a read error the output would be different).
>>>> 
>>>> The journal is on an SSD, as stated.
>>>> And before you ask it's on an Intel DC S3700.
>>>> 
>>>> This was created on 0.80.7 just a day before, so no version games.
>>>> 
>>>>> I'd double-check
>>>>> your disk settings etc just to be sure, and check for known issues
>>>>> with xfs on Jessie.
>>>>> 
>>>> I'm using ext4, but that shouldn't be an issue here to begin with, as
>>>> the journal is a raw SSD partition.
>>>> 
>>>> Christian
>>> 
>>> 
>>> --
>>> Christian Balzer        Network/Systems Engineer
>>> ch...@gol.com           Global OnLine Japan/Fusion Communications
>>> http://www.gol.com/
>> 
> 
> 
> -- 
> Christian Balzer        Network/Systems Engineer                
> ch...@gol.com         Global OnLine Japan/Fusion Communications
> http://www.gol.com/
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD trashed by simple reboot (Debian Jessie, systemd?)

Reply via email to