Re: [ceph-users] Cache tier experiences (for ample sized caches ^o^)

2015-10-07 Thread Christian Balzer

Hello,

On Wed, 07 Oct 2015 07:34:16 +0200 Loic Dachary wrote:

> Hi Christian,
> 
> Interesting use case :-) How many OSDs / hosts do you have ? And how are
> they connected together ?
>
If you look far back in the archives you'd find that design.

And of course there will be a lot of "I told you so" comments, but it
worked just as planned while being within the design specifications. 

For example one of the first things I did was to have 64 VMs install
themselves automatically from a virtual CD-ROM in parallel. 
This Ceph cluster handled that w/o any slow requests and in decent time. 

To answer your question, just 2 nodes with 2 OSDs (RAID6 with a 4GB cache
Areca controller) each, replication of 2 obviously. 
Initially 3, now 6 compute nodes.
All interconnected via redundant 40Gb/s Infiniband (IPoIB), 2 ports per
server and 2 switches. 

While the low number of OSDs is obviously part of the problem here this is
masked by the journal SSDs and the large HW cache for the steady state. 
My revised design is 6 RAID10 OSDs per node, the change to RAID10 is
mostly to accommodate the type of VMs this cluster wasn't designed for in
the first place.

My main suspect for the excessive slowness are actually the Toshiba DT
type drives used. 
We only found out after deployment that these can go into a zombie mode
(20% of their usual performance for ~8 hours if not permanently until power
cycled) after a week of uptime.
Again, the HW cache is likely masking this for the steady state, but
asking a sick DT drive to seek (for reads) is just asking for trouble.

To illustrate this:
---
DSK |  sdd | busy 86% | read   0 | write 99 | avio 43.6 ms |
DSK |  sda | busy 12% | read   0 | write151 | avio 4.13 ms |
DSK |  sdc | busy  8% | read   0 | write139 | avio 2.82 ms |
DSK |  sdb | busy  7% | read   0 | write132 | avio 2.70 ms |
---
The above is a snippet from atop on another machine here, the 4 disks are
in a RAID 10.
I'm sure you can guess which one is the DT01ACA200 drive, sdb and sdc are
Hitachi HDS723020BLA642 and sda is a Toshiba MG03ACA200.

I have another production cluster that originally only had just 3 nodes
and 8 OSDs each. 
It performed much better using MG drives.

So the new node I'm trying to phase has these MG HDDs and the older ones
will be replaced eventually.

Christian

[snip]

-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cache tier experiences (for ample sized caches ^o^)

2015-10-07 Thread Udo Lembke
Hi Christian,

On 07.10.2015 09:04, Christian Balzer wrote:
> 
> ...
> 
> My main suspect for the excessive slowness are actually the Toshiba DT
> type drives used. 
> We only found out after deployment that these can go into a zombie mode
> (20% of their usual performance for ~8 hours if not permanently until power
> cycled) after a week of uptime.
> Again, the HW cache is likely masking this for the steady state, but
> asking a sick DT drive to seek (for reads) is just asking for trouble.
> 
> ...
does this mean, you can reboot your OSD-Nodes one after the other and then your 
cluster should be fast enough for app.
one week to bring the additional node in?

Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cache tier experiences (for ample sized caches ^o^)

2015-10-07 Thread Christian Balzer

Hello Udo,

On Wed, 07 Oct 2015 11:40:11 +0200 Udo Lembke wrote:

> Hi Christian,
> 
> On 07.10.2015 09:04, Christian Balzer wrote:
> > 
> > ...
> > 
> > My main suspect for the excessive slowness are actually the Toshiba DT
> > type drives used. 
> > We only found out after deployment that these can go into a zombie mode
> > (20% of their usual performance for ~8 hours if not permanently until
> > power cycled) after a week of uptime.
> > Again, the HW cache is likely masking this for the steady state, but
> > asking a sick DT drive to seek (for reads) is just asking for trouble.
> > 
> > ...
> does this mean, you can reboot your OSD-Nodes one after the other and
> then your cluster should be fast enough for app. one week to bring the
> additional node in?
> 
Actually shut down (power cycle), a reboot won't "fix" that state as the
power to the backplane stays on.

And even if the drives would be at full speed, at this point in time
(2x over planned capacity) I'm not sure if that's enough.

6 month and 140 VMs earlier I might have just tried that, now I'm looking
for something that is going to work 100%, no ifs and whens.

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cache tier experiences (for ample sized caches ^o^)

2015-10-06 Thread Loic Dachary
Hi Christian,

Interesting use case :-) How many OSDs / hosts do you have ? And how are they 
connected together ?

Cheers

On 07/10/2015 04:58, Christian Balzer wrote:
> 
> Hello,
> 
> a bit of back story first, it may prove educational for others a future
> generations.
> 
> As some may recall, I have a firefly production cluster with a storage node
> design that was both optimized for the use case at the time and with an
> estimated capacity to support 140 VMs (all running the same
> OS/application, thus the predictable usage pattern). 
> Alas people starting running different VMs and also my request for new HW
> was delayed.
> 
> So now there are 280 VMs doing nearly exclusively writes (8MB/s, 1000 ceph
> ops) and while the ceph cluster can handle this steady state w/o breaking a
> sweat (avio is less then 0.01ms and "disks" are less than 5% busy).
> Basically nicely validating my design for this use case. ^o^ 
> 
> It becomes slightly irritated when asked to do reads (like VM reboots).
> Those will drive utilization up to 100% at times, alas avio is still
> reasonable at less than 5ms.
> This is also why I disabled scrubbing 9 months ago when it did hit my
> expected capacity limit (and asked for more HW).
> 
> However when trying to add the new node (much faster design in several
> ways) to the cluster the resulting backfilling (when adding the first OSD
> to the CRUSH map, not even starting it or anything) totally kills things
> with avio frequently over 100ms and thus VMs croaking left and right.
> This was of course with all the recently discussed backfill and recovery
> parameters tuned all the way down.
> 
> There simply is no maintenance window long enough to phase in that 3rd
> node. This finally got the attention of the people who approve HW orders
> and now the tack seems to be "fix it whatever it takes" ^o^
> 
> So the least invasive plan I've come up with so far is to create a SSD
> backed cache tier pool, wait until most (hot) objects have made it in there
> and the old (now backing) pool has gone mostly quiescent and then add
> that additional node and re-build the older ones as planned. 
> 
> The size of that SSD cache pool would be at least 80% of the total current
> data (which of course isn't all hot), so do people who have actually
> experience with cache tiers under firefly that aren't under constant
> pressure to evict things think this is feasible?
> 
> Again, I think based on the cache size I can tune things to avoid
> evictions and flushes, but if it should start flushing things for example,
> is that an asynchronous operation or will that impede performance of the
> cache tier? 
> As in, does the flushing of an object have to be finished before it can be
> written to again?
> 
> Obviously I can't do anything about slow reads from the backing pool for
> objects that somehow didn't make it into cache yet. But while slow reads
> are not nice, it is slow WRITES that really upset the VMs and the
> application they run.
> 
> Clearly what I'm worried about here is that the old pool
> backfilling/recovering will be quite comatose (as mentioned above) during
> that time.
> 
> Regards,
> 
> Christian
> 

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Cache tier experiences (for ample sized caches ^o^)

2015-10-06 Thread Christian Balzer

Hello,

a bit of back story first, it may prove educational for others a future
generations.

As some may recall, I have a firefly production cluster with a storage node
design that was both optimized for the use case at the time and with an
estimated capacity to support 140 VMs (all running the same
OS/application, thus the predictable usage pattern). 
Alas people starting running different VMs and also my request for new HW
was delayed.

So now there are 280 VMs doing nearly exclusively writes (8MB/s, 1000 ceph
ops) and while the ceph cluster can handle this steady state w/o breaking a
sweat (avio is less then 0.01ms and "disks" are less than 5% busy).
Basically nicely validating my design for this use case. ^o^ 

It becomes slightly irritated when asked to do reads (like VM reboots).
Those will drive utilization up to 100% at times, alas avio is still
reasonable at less than 5ms.
This is also why I disabled scrubbing 9 months ago when it did hit my
expected capacity limit (and asked for more HW).

However when trying to add the new node (much faster design in several
ways) to the cluster the resulting backfilling (when adding the first OSD
to the CRUSH map, not even starting it or anything) totally kills things
with avio frequently over 100ms and thus VMs croaking left and right.
This was of course with all the recently discussed backfill and recovery
parameters tuned all the way down.

There simply is no maintenance window long enough to phase in that 3rd
node. This finally got the attention of the people who approve HW orders
and now the tack seems to be "fix it whatever it takes" ^o^

So the least invasive plan I've come up with so far is to create a SSD
backed cache tier pool, wait until most (hot) objects have made it in there
and the old (now backing) pool has gone mostly quiescent and then add
that additional node and re-build the older ones as planned. 

The size of that SSD cache pool would be at least 80% of the total current
data (which of course isn't all hot), so do people who have actually
experience with cache tiers under firefly that aren't under constant
pressure to evict things think this is feasible?

Again, I think based on the cache size I can tune things to avoid
evictions and flushes, but if it should start flushing things for example,
is that an asynchronous operation or will that impede performance of the
cache tier? 
As in, does the flushing of an object have to be finished before it can be
written to again?

Obviously I can't do anything about slow reads from the backing pool for
objects that somehow didn't make it into cache yet. But while slow reads
are not nice, it is slow WRITES that really upset the VMs and the
application they run.

Clearly what I'm worried about here is that the old pool
backfilling/recovering will be quite comatose (as mentioned above) during
that time.

Regards,

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com