Re: [ceph-users] Ceph Journal Disk Size

Lionel Bouton Thu, 02 Jul 2015 10:51:13 -0700

On 07/02/15 19:13, Shane Gibson wrote:
>
> Lionel - thanks for the feedback ... inline below ... 
>
> On 7/2/15, 9:58 AM, "Lionel Bouton" <lionel+c...@bouton.name
> <mailto:lionel+c...@bouton.name>> wrote:
>
>
>     Ouch. These spinning disks are probably a bottleneck: there are
>     regular advices on this list to use one DC SSD for 4 OSDs. You
>     would probably better off with a dedicated partition at the
>     beginning of each OSD disk or worse one file on the filesystem but
>     it should still be better than a shared spinning disk.
>
>
> I understand the benefit of journals on SSDs - but if you don't have
> them, you don't have them.  With that in mind, I'm completely open to
> any ideas on the "best structuring" of using 7200 rpm disks with
> journal/osd device types.    I'm open to playing around with
> performance testing various scenarios.  Again - we realize this is
> "less than optimal", but I would like to explore tweaking and tuning
> this setup for "the best possible performance" you can get out of it.


It's choosing between bad and worse. To keep it simple you write roughly
as much on journals than on filestores. If the device you use for
multiple journals is no better than the ones you use for filestores you
introduce a bottleneck (each additional journal divides the available
bandwidth). If you want to remove the bottleneck you either put one
device per journal (you will have twice the bandwidth but half the
storage space) or you use devices with more bandwidth (both sequential
and random): SSDs.
If you don't have access to SSDs, you have a compromise to reach between
available space (journal stored on the same disk than the filestore) and
performance (journal on a dedicated disk).
If you put several journals on the same disk in your current
configuration you most probably restrict both performance and available
space. I wouldn't even try to put 2 journals on the same disk : you are
already at the performance level of an OSD with filestore and journal on
the same disk but you just sacrificed one third of your storage space.

>
>
>     Anyway given that you get to use 720 disks (12 disks on 60
>     servers), I'd still prefer your setup to mine (24 OSDs) even with
>     what I consider a bottleneck your setup as probably far more
>     bandwidth ;-)
>
>
> My understanding from reading the Ceph docs was that mixing Journal on
> the OSD disks was strongly considered a "very bad idea", due to the IO
> operations between the Journal and OSD disk itself creating contention.

Yes this is true. But if you create even more contention elsewhere you
are going from bad to worse.

>  Like I said - I'm open to testing this configuration ... and probably
> will.  We're finalizing our build/deployment harness right now to be
> able to modify the architecture of the OSDs with a fresh build fairly
> easily. 
>
>
>     A reaction to one of your earlier mails:
>     You said you are going to 8TB drives. The problem isn't so much
>     with the time needed to create new replicas when an OSD fails but
>     the time to fill one freshly installed. The rebalancing is much
>     faster when you add 4 x 2TB drives than 1 x 8TB drives.
>
>
> Why should it matter how long it takes a single drive to "fill"??

This depends. Let's assume you have a stack of new drives stored for
spares. If you use them to replace faulty drives while there is
rebalancing going on (ie: pgs trying to reach size replicas) which is
more and more likely when you have huge numbers of disks (which means
less time when the whole cluster isn't repairing something somewhere)
the bigger the disks are, the more contention they will bring and the
longer your cluster will be repairing: in extreme cases you might see
situations where you fall behind min_size for some pgs which would block
some IOs or even lose data.
You'll have to compute the probabilities for yourself given the likely
scenario for your cluster (the risk might very well be negligible) but
larger drives may not be safer.
Another issue is performance : you'll get 4x more IOPS with 4 x 2TB
drives than with one single 8TB. So if you have a performance target
your money might be better spent on smaller drives

>  Please note that I'm very very new to operating Ceph, so am working
> to understand these details - and I'm certain my understanding is
> still a bit ... simplistic ... :-) 
>
> If a drive failes, wouldn't the replica copies on that drive be
> replicated across "other OSD" devices when appropriate timers/triggers
> cause those data migration/re-replications to kick off?

Yes.

>
> Subsequently, you add a new OSD and bring it online.

With 720 disks "subsequently" might be replaced by "concurrently" and
then see above. Let's say average practical MTBF is 3 years : you will
get one failure every day and a half with on occasions some rapid
successive failures during the same day, will you still be able to time
your OSD creation to avoid contention while repair is going on?

>  It's now ready to be used - and depending on your CRUSH map policies,
> will "start to fill" - yes, this process ... to "fill an entire 8TB
> drive" certainly would take a while, but that shouldn't block or
> degrade the entire cluster - since we have a replica copy set of 3 ...
> there are "two other replica copies" to service read requests.

In fact not by default: each read always go to the primary OSD so your
new disk is a bottleneck (unless you configure it initially to prevent
it from becoming primary).

>  If a replica copy is updated, which is currently in flight with the
> rebalancing to that new OSD, yes, I can see where there would be
> latency/delays/issues.   As the drive is rebalanced, is it marked
> "available" for new writes?  That would certainly cause significant
> latency with a new write request - I'd hope that during "rebalance"
> operation, that OSD disk is not marked available for new writes.

It is for every pg already put on it. I'm not sure what happens to pg
currently being moved but I assume writes are streamed to it after the
initial sync (concurrent writes would probably not make sense).

>
> Which brings me to a question ... 
>
> Are there any good documents out there that detail (preferably via a
> flow chart/diagram or similar) how the various failure/recovery
> scenarios cause "change" or "impact" to the cluster?   I've seen very
> little in regards to this, but may be digging in the wrong places?  
>
> Thank you for any follow up information that helps illuminate my
> understanding (or lack thereof) how Ceph and failure/recovery
> situations should impact a cluster... 
>
> ~~shane 
>
>
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph Journal Disk Size

Reply via email to