Re: [ceph-users] SSD journal deployment experiences

Christian Balzer Sat, 06 Sep 2014 04:27:48 -0700

On Fri, 5 Sep 2014 09:42:02 +0000 Dan Van Der Ster wrote:

> 
> > On 05 Sep 2014, at 11:04, Christian Balzer <ch...@gol.com> wrote:
> > 
> > On Fri, 5 Sep 2014 07:46:12 +0000 Dan Van Der Ster wrote:
> >> 
> >>> On 05 Sep 2014, at 03:09, Christian Balzer <ch...@gol.com> wrote:
> >>> 
> >>> On Thu, 4 Sep 2014 14:49:39 -0700 Craig Lewis wrote:
> >>> 
> >>>> On Thu, Sep 4, 2014 at 9:21 AM, Dan Van Der Ster
> >>>> <daniel.vanders...@cern.ch> wrote:
> >>>> 
[snip]
> >>>>> 2) If you have SSD journals at a ratio of 1 to 4 or 5, how painful
> >>>>> is the backfilling which results from an SSD failure? Have you
> >>>>> considered tricks like increasing the down out interval so
> >>>>> backfilling doesn’t happen in this case (leaving time for the SSD
> >>>>> to be replaced)?
> >>>>> 
> >>>> 
> >>>> Replacing a failed SSD won't help your backfill.  I haven't actually
> >>>> tested it, but I'm pretty sure that losing the journal effectively
> >>>> corrupts your OSDs.  I don't know what steps are required to
> >>>> complete this operation, but it wouldn't surprise me if you need to
> >>>> re-format the OSD.
> >>>> 
> >>> This.
> >>> All the threads I've read about this indicate that journal loss
> >>> during operation means OSD loss. Total OSD loss, no recovery.
> >>> From what I gathered the developers are aware of this and it might be
> >>> addressed in the future.
> >>> 
> >> 
> >> I suppose I need to try it then. I don’t understand why you can't just
> >> use ceph-osd -i 10 --mkjournal to rebuild osd 10’s journal, for
> >> example.
> >> 
> > I think the logic is if you shut down an OSD cleanly beforehand you can
> > just do that.
> > However from what I gathered there is no logic to re-issue transactions
> > that made it to the journal but not the filestore.
> > So a journal SSD failing mid-operation with a busy OSD would certainly
> > be in that state.
> > 
> 
> I had thought that the journal write and the buffered filestore write
> happen at the same time.


Nope, definitely not.

That's why we have tunables like the ones at:
http://ceph.com/docs/master/rados/configuration/filestore-config-ref/#synchronization-intervals

And people (me included) tend to crank that up (to eleven ^o^).

The write-out to the filestore may start roughly at the same time as the
journal gets things, but it can and will fall behind.

> So all the previous journal writes that
> succeeded are already on their way to the filestore. My (could be
> incorrect) understanding is that the real purpose of the journal is to
> be able to replay writes after a power outage (since the buffered
> filestore writes would be lost in that case). If there is no power
> outage, then filestore writes are still good regardless of a journal
> failure.
> 
From Cephs perspective a write is successful once it is on all replica
size journals. 
I think (hope) that what you wrote up there to be true, but that doesn't
change the fact that journal data not even on the way to the filestore yet
is the crux here.

> 
> > I'm sure (hope) somebody from the Ceph team will pipe up about this.
> 
> Ditto!
> 
Guess it will be next week...

> 
> >>> Now 200GB DC 3700s can write close to 400MB/s so a 1:4 or even 1:5
> >>> ratio is sensible. However these will be the ones limiting your max
> >>> sequential write speed if that is of importance to you. In nearly all
> >>> use cases you run out of IOPS (on your HDDs) long before that becomes
> >>> an issue, though.
> >> 
> >> IOPS is definitely the main limit, but we also only have 1 single
> >> 10Gig-E NIC on these servers, so 4 drives that can write (even only
> >> 200MB/s) would be good enough.
> >> 
> > Fair enough. ^o^
> > 
> >> Also, we’ll put the SSDs in the first four ports of an SAS2008 HBA
> >> which is shared with the other 20 spinning disks. Counting the double
> >> writes, the HBA will run out of bandwidth before these SSDs, I expect.
> >> 
> > Depends on what PCIe slot it is and so forth. A 2008 should give you
> > 4GB/s, enough to keep the SSDs happy at least. ^o^
> > 
> > A 2008 has only 8 SAS/SATA ports, so are you using port expanders on
> > your case backplane? 
> > In that case you might want to spread the SSDs out over channels, as in
> > have 3 HDDs sharing one channel with one SSD.
> 
> We use a Promise VTrak J830sS, and now I’ll got ask our hardware team if
> there would be any benefit to store the SSDs row or column wise.
>
Ah, a storage pod. So you have that and a real OSD head server, something
like a 1U machine or Supermicro Twin? 
Looking at the specs of it I would assume 3 drive per expander, so having
one SSD mixed with 2 HDDs should definitely be beneficial. 

> With the current config, when I dd to all drives in parallel I can write
> at 24*74MB/s = 1776MB/s.
> 
That's surprisingly low. As I wrote up there, a 2008 has 8 PCIe 2.0 lanes,
so as far as that bus goes, it can do 4GB/s.
And given your storage pod I assume it is connected with 2 mini-SAS
cables, 4 lanes each at 6Gb/s, making for 4x6x2 = 48Gb/s SATA bandwidth. 

How fast can your "eco 5900rpm" drives write individually? 
If it is significantly more than 74MB/s (I couldn't find any specs or
reviews of those drives on the net), I would really want to know where
that bottleneck is.

> > 
> >>> Raiding the journal SSDs seems wasteful given the cost and quality of
> >>> the DC 3700s. 
> >>> Configure your cluster in a way that re-balancing doesn't happen
> >>> unless you want to (when the load low) by:
> >>> a) Setting the "mon osd downout subtree limit" so that a host going
> >>> down doesn't result in a full re-balancing and the resulting IO shit
> >>> storm. In nearly all cases nodes a recoverable and if it isn't the
> >>> OSDs may be. And even if that fails, you get to pick the time for the
> >>> recovery.
> >> 
> >> This is a good point — I have it set at the rack level now. The whole
> >> node failure we experienced manifested as a device remove of all 24
> >> drives followed quickly by a hot-insert. Restarting the daemons
> >> brought those OSDs back online (though it was outside of working
> >> hours, so backfilling kicked in before anyone noticed).
> >> 
> > Lucky! ^o^
> > 
> >> 
> >>> b) As you mentioned and others have before, set the out interval so
> >>> you can react to things. 
> >> 
> >> We use 15 minutes, which is so we can reboot a host without
> >> backfilling. What do you use?
> >> 
> > I'm not using it right now, but for the cluster I'm currently deploying
> > will go with something like 4 hours (as do others here) or more if I
> > feel that I might not be in time to set the cluster to "noout" if
> > warranted.
> 
> Hmm, not sure I’d be comfortable with 4 hours. According to the rados
> reliability tool that would drop your durability from ~11-nines to
> ~9-nines (assuming 3 replicas, consumer drives).
> 
I'm not convinced that this tool is considering all the variables, see
also the recent "Best practice K/M-parameters EC pool" thread.

But of course delaying recovery indeed increases risk. 
The art here is to find the point where indiscriminate backfilling becomes
more of a problem than delaying action until a knowledgeable human can
take a look at things. 
And that point will vary wildly depending on the hardware, cluster size
and availability of said humans. 
Since you said that single OSD failures have basically no impact, might
want to leave it at that then.

If your cluster can indeed recover (backfill) the loss of a single node
faster than somebody can be on the scene and either fix it or declare it a
lost cause for now, you might want to to leave the "mon osd downout subtree
limit" at rack level. 
Or change it via cron for times when nobody is onsite. ^o^

> auto mark-out:     15 minutes
>     storage               durability    PL(site)  PL(copies)
> PL(NRE)     PL(rep)    loss/PiB ----------            ----------
> ----------  ----------  ----------  ----------  ---------- RADOS: 3
> cp             11-nines   0.000e+00   4.644e-10   0.000020%
> 0.000e+00   3.813e+02
> 
> 
> auto mark-out:    240 minutes
>     storage               durability    PL(site)  PL(copies)
> PL(NRE)     PL(rep)    loss/PiB ----------            ----------
> ----------  ----------  ----------  ----------  ---------- RADOS: 3
> cp              9-nines   0.000e+00   7.836e-08   0.000254%
> 0.000e+00   5.016e+04
> 
> 
> > 
> >>> c) Configure the various backfill options to have only a small
> >>> impact. Journal SSDs will improve things compared to your current
> >>> situation. And if I recall correctly, you're using a replica size of
> >>> 3 to 4, so you can afford a more sedate recovery.
> >> 
> >> It’s already at 1 backfill, 1 recovery, and the lowest queue priority
> >> (1/63) for recovery IOs.
> >> 
> > So how long does that take you to recover 1TB then in the case of a
> > single OSD failure?
> 
> Single OSD failures take us ~1 hour to backfill. The 24 OSD failure took
> ~2 hours to backfill.
> 
Impressive, even given your huge cluster with 1128 OSDs.
However that's not really answering my question, how much data is on an
average OSD and thus gets backfilled in that hour?

Regards,

Christian
> > And is that setting still impacting your performance more than you'd
> > like?
> 
> Single failures are transparent, but the 24 failure was noticeable.
> Journal SSDs will improve the situation, like you said, so 5 OSD
> failures would probably be close to transparent.
> 
> 
> > 
> >>> Journals on a filesystem go against KISS. 
> >>> Not only do you add one more layer of complexity that can fail (and
> >>> filesystems do have bugs as people were reminded when Firefly came
> >>> out), you're also wasting CPU cycles that might needed over in the
> >>> less than optimal OSD code. ^o^
> >>> And you gain nothing from putting journals on a filesystem.
> >> 
> >> Well the gains that I had in mind resulted from my assumption that you
> >> can create a new empty journal on another device, then restart the
> >> OSD. If that’s not possible, then I agree there are no gains to speak
> >> of.
> >> 
> > Can always create a new partition as well, if there is enough space.
> 
> True..
> 
> Cheers, Dan
> 
> 
> > 
> > Regards,
> > 
> > Christian
> > 
> >> 
> >>> You might want to look into cache pools (and dedicated SSD servers
> >>> with fast controllers and CPUs) in your test cluster and for the
> >>> future. Right now my impression is that there is quite a bit more
> >>> polishing to be done (retention of hot objects, etc) and there have
> >>> been stability concerns raised here.
> >> 
> >> Right, Greg already said publicly not to use the cache tiers for RBD.
> >> 
> >> Thanks for your thorough response… you’ve provide a lot of confidence
> >> that the traditional journal deployment is still a good or even the
> >> best option.
> >> 
> >> Cheers, Dan
> >> 
> >> 
> >>> 
> >>> Regards,
> >>> 
> >>> Christian
> >>> -- 
> >>> Christian Balzer        Network/Systems Engineer                
> >>> ch...@gol.com     Global OnLine Japan/Fusion Communications
> >>> http://www.gol.com/
> >>> _______________________________________________
> >>> ceph-users mailing list
> >>> ceph-users@lists.ceph.com
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> 
> > 
> > 
> > -- 
> > Christian Balzer        Network/Systems Engineer                
> > ch...@gol.com       Global OnLine Japan/Fusion Communications
> > http://www.gol.com/
> 


-- 
Christian Balzer        Network/Systems Engineer                
ch...@gol.com           Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] SSD journal deployment experiences

Reply via email to