Hello Greg,

On Wed, 27 May 2015 22:53:43 -0700 Gregory Farnum wrote:

> The description of the logging abruptly ending and the journal being
> bad really sounds like part of the disk is going back in time. I'm not
> sure if XFS internally is set up in such a way that something like
> losing part of its journal would allow that?
> 
I'm special. ^o^
No XFS, EXT4. As stated in the original thread, below.
And the (OSD) journal is a raw partition on a DC S3700.

And since there was at least a 30 seconds pause between the completion of
the "/etc/init.d/ceph stop" and issuing of the shutdown command, the
logging abruptly ending seems to be unlikely related to the shutdown at
all.

> If any of the OSD developers have the time it's conceivable a copy of
> the OSD journal would be enlightening (if e.g. the header offsets are
> wrong but there are a bunch of valid journal entries), but this is two
> reports of this issue from you and none very similar from anybody
> else. I'm still betting on something in the software or hardware stack
> misbehaving. (There aren't that many people running Debian; there are
> lots of people running Ubuntu and we find bad XFS kernels there not
> infrequently; I think you're hitting something like that.)
> 
There should be no file system involved with the raw partition SSD
journal, n'est-ce pas?

The hardware is vastly different, the previous case was on an AMD
system with onboard SATA (SP5100), this one is a SM storage goat with LSI
3008.

The only thing they have in common is the Ceph version 0.80.7 (via the
Debian repository, not Ceph) and Debian Jessie as OS with kernel 3.16
(though there were minor updates on that between those incidents,
backported fixes)
 
A copy of the journal would consist of the entire 10GB partition, since we
don't know where in loop it was at the time, right?

Christian
> 
> On Sun, May 24, 2015 at 7:26 PM, Christian Balzer <ch...@gol.com> wrote:
> >
> > Hello again (marvel at my elephantine memory and thread necromancy)
> >
> > Firstly, this happened again, details below.
> > Secondly, as I changed things to sysv-init AND did a "/etc/init.d/ceph
> > stop" which dutifully listed all OSDs as being killed/stopped BEFORE
> > rebooting the node.
> >
> > This is completely new node with significantly different HW than the
> > example below.
> > But the same SW versions as before (Debian Jessie, Ceph 0.80.7).
> > And just like below/before the logs for that OSD have nothing in them
> > indicating it did shut down properly (no "journal flush done") and when
> > coming back on reboot we get the dreaded:
> > ---
> > 2015-05-25 10:32:55.439492 7f568aa157c0  1 journal
> > _open /var/lib/ceph/osd/ceph-30/journal fd 23: 10000269312 bytes,
> > block size 4096 bytes, directio = 1, aio = 1 2015-05-25
> > 10:32:55.439859 7f568aa157c0 -1 journal read_header error decoding
> > journal header 2015-05-25 10:32:55.439905 7f568aa157c0 -1
> > filestore(/var/lib/ceph/osd/ceph-30) mount failed to open
> > journal /var/lib/ceph/osd/ceph-30/journal: (22) Invalid argument
> > 2015-05-25 10:32:55.936975 7f568aa157c0 -1 osd.30 0 OSD:init: unable
> > to mount object store ---
> >
> > I see nothing in the changelogs for 0.80.8 and .9 that seems related to
> > this, never mind that from the looks of it the repository at Ceph has
> > only Wheezy (bpo70) packages and Debian Jessie is still stuck at
> > 0.80.7 (Sid just went to .9 last week)
> >
> > I'm preserving the state of things as they are for a few days, so if
> > any developer would like a peek or more details, speak up now.
> >
> > I'd open an issue, but I don't have a reliable way to reproduce this
> > and even less desire to do so on this production cluster. ^_-
> >
> > Christian
> >
> > On Sat, 6 Dec 2014 12:48:25 +0900 Christian Balzer wrote:
> >
> >> On Fri, 5 Dec 2014 11:23:19 -0800 Gregory Farnum wrote:
> >>
> >> > On Thu, Dec 4, 2014 at 7:03 PM, Christian Balzer <ch...@gol.com>
> >> > wrote:
> >> > >
> >> > > Hello,
> >> > >
> >> > > This morning I decided to reboot a storage node (Debian Jessie,
> >> > > thus 3.16 kernel and Ceph 0.80.7, HDD OSDs with SSD journals)
> >> > > after applying some changes.
> >> > >
> >> > > It came back up one OSD short, the last log lines before the
> >> > > reboot are: ---
> >> > > 2014-12-05 09:35:27.700330 7f87e789c700  2 --
> >> > > 10.0.8.21:6823/29520 >> 10.0.8.22:0/5161 pipe(0x7f881b772580
> >> > > sd=247 :6823 s=2 pgs=21 cs=1 l=1 c=0x7f881f469020).fault (0)
> >> > > Success 2014-12-05 09:35:27.700350 7f87f011d700 10 osd.4
> >> > > pg_epoch: 293 pg[3.316( v 289'1347 (0'0,289'1347] local-les=289
> >> > > n=8 ec=5 les/c 289/289 288/288/288) [8,4,16] r=1 lpr=288
> >> > > pi=276-287/1 luod=0'0 crt=289'1345 lcod 289'1346 active]
> >> > > cancel_copy_ops ---
> >> > >
> >> > > Quite obviously it didn't complete its shutdown, so
> >> > > unsurprisingly we get: ---
> >> > > 2014-12-05 09:37:40.278128 7f218a7037c0  1 journal
> >> > > _open /var/lib/ceph/osd/ceph-4/journal fd 24: 10000269312 bytes,
> >> > > block size 4096 bytes, directio = 1, aio = 1 2014-12-05
> >> > > 09:37:40.278427 7f218a7037c0 -1 journal read_header error decoding
> >> > > journal header 2014-12-05 09:37:40.278479 7f218a7037c0 -1
> >> > > filestore(/var/lib/ceph/osd/ceph-4) mount failed to open
> >> > > journal /var/lib/ceph/osd/ceph-4/journal: (22) Invalid argument
> >> > > 2014-12-05 09:37:40.776203 7f218a7037c0 -1 osd.4 0 OSD:init:
> >> > > unable to mount object store 2014-12-05 09:37:40.776223
> >> > > 7f218a7037c0 -1 ESC[0;31m ** ERROR: osd init failed: (22) Invalid
> >> > > argument ESC[0m ---
> >> > >
> >> > > Thankfully this isn't production yet and I was eventually able to
> >> > > recover the OSD by re-creating the journal ("ceph-osd -i 4
> >> > > --mkjournal"), but it leaves me with a rather bad taste in my
> >> > > mouth.
> >> > >
> >> > > So the pertinent questions would be:
> >> > >
> >> > > 1. What caused this?
> >> > > My bet is on the evil systemd just pulling the plug before the
> >> > > poor OSD had finished its shutdown job.
> >> > >
> >> > > 2. How to prevent it from happening again?
> >> > > Is there something the Ceph developers can do with regards to init
> >> > > scripts? Or is this something to be brought up with the Debian
> >> > > maintainer? Debian is transiting from sysv-init to systemd (booo!)
> >> > > with Jessie, but the OSDs still have a sysvinit magic file in
> >> > > their top directory. Could this have an affect on things?
> >> > >
> >> > > 3. Is it really that easy to trash your OSDs?
> >> > > In the case a storage node crashes, am I to expect most if not all
> >> > > OSDs or at least their journals to require manual loving?
> >> >
> >> > So this "can't happen".
> >>
> >> Good thing you quoted that, as it clearly did. ^o^
> >>
> >> Now the question of how exactly remains to be answered.
> >>
> >> > Being force killed definitely can't kill the
> >> > OSD's disk state; that's the whole point of the journaling.
> >>
> >> The other OSDs got to the point where they logged "journal flush
> >> done", this one didn't. Coincidence? I think not.
> >>
> >> Totally agree about the point of journaling being to prevent this
> >> kind of situation of course.
> >>
> >> > The error
> >> > message indicates that the header written on disk is nonsense to the
> >> > OSD, which means that the local filesystem or disk lost something
> >> > somehow (assuming you haven't done something silly like downgrading
> >> > the software version it's running) and doesn't know it (if there had
> >> > been a read error the output would be different).
> >>
> >> The journal is on an SSD, as stated.
> >> And before you ask it's on an Intel DC S3700.
> >>
> >> This was created on 0.80.7 just a day before, so no version games.
> >>
> >> > I'd double-check
> >> > your disk settings etc just to be sure, and check for known issues
> >> > with xfs on Jessie.
> >> >
> >> I'm using ext4, but that shouldn't be an issue here to begin with, as
> >> the journal is a raw SSD partition.
> >>
> >> Christian
> >
> >
> > --
> > Christian Balzer        Network/Systems Engineer
> > ch...@gol.com           Global OnLine Japan/Fusion Communications
> > http://www.gol.com/
> 


-- 
Christian Balzer        Network/Systems Engineer                
ch...@gol.com           Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to