
On Wed, 27 May 2015 12:54:04 +0200 Xavier Serrano wrote:

> Hello,
> Slow requests, blocked requests and blocked ops occur quite often
> in our cluster; too often, I'd say: several times during one day.
> I must say we are running some tests, but we are far from pushing
> the cluster to the limit (or at least, that's what I believe).
> Every time a blocked request/operation happened, restarting the
> affected OSD solved the problem.
You should open a bug with that description and a way to reproduce things,
even if only sometimes. 
Having slow disks instead of an overloaded network causing permanently
blocked requests definitely shouldn't happen.

> Yesterday, we wanted to see if it was possible to minimize the impact
> that backfills and recovery have over normal cluster performace.
> In our case, performance dropped from 1000 cluster IOPS (approx)
> to 10 IOPS (approx) when doing some kind of recovery.
> Thus, we reduced the parameters "osd max backfills" and "osd recovery
> max active" to 1 (defaults are 10 and 15, respectively). Cluster
> performance during recovery improved to 500-600 IOPS (approx),
> and overall recovery time stayed approximately the same (surprisingly).
There are some "sleep" values for recovery and scrub as well, these help a
LOT with loaded clusters, too.

> Since then, we have had no more slow/blocked requests/ops
> (and our tests are still running). It is soon to say this, but
> my guess is that osds/disks in our cluster cannot cope with
> all I/O: network bandwidth is not an issue (10 GbE interconnection,
> graphs show network usage is under control all the time), but
> spindles are not high-performance (WD Green). Eventually, this might
> lead to slow/blocked requests/ops (which shouldn't occur that often).
Ah yes, I was going to comment on your HDDs earlier.
As Dan van der Ster at CERN will happily admit, using green, slow HDDs
with Ceph (and no SSD journals) is a bad idea.

You're likely to see a VAST improvement with even just 1 journal SSD (of
suficient speed and durability) for 10 of your HDDs, a 1:5 ratio would of
course be better.
However with 20 OSDs per node, you're likely to go from a being
bottlenecked by your HDDs to being CPU limited (when dealing with lots of
small IOPS at least).
Still, better than now for sure.

BTW, if your monitors are just used for that function, 128GB is total and
utter overkill. 
They will be fine with 16-32GB, your storage nodes will be much better
served (pagecache for hot read objects) with more RAM.
And with 20 OSDs per node 32GB is pretty close to the minimum I'd
recommend anyway.

> Reducing I/O pressure caused by recovery and backfill undoubtedly
> helped on improving cluster performance during recovery, that was
> expected. But we did not expect that recovery time stayed the same...
> The only explanation for this is that, during recovery, there are
> lots of operations that fail due a timeout, are retried several
> times, etc.
> So if disks are the bottleneck, reducing such values may help as
> well in normal cluster operation (when propagating the replicas,
> for instance). And slow/blocked requests/ops do not occur (or at
> least, occur less frequently).
> Does this make sense to you? Any other thoughts?
Very much so, see above for more thoughts.


> Thank you very much again for your time.
> - Xavier Serrano
> - LCAC, Laboratori de Càlcul
> - Departament d'Arquitectura de Computadors, UPC

Christian Balzer        Network/Systems Engineer                
ch...@gol.com           Global OnLine Japan/Fusion Communications
ceph-users mailing list

Reply via email to