Re: [ceph-users] Preconditioning an RBD image

Peter Maloney Thu, 23 Mar 2017 15:45:14 -0700

Hi Nick,

I didn't test with a colocated journal. I figure ceph knows what it's
doing with the journal device, and it has no filesystem, so there's no
xfs journal, file metadata, etc. to cache due to small random sync writes.

I tested the bcache and journals on some SAS SSDs (rados bench was ok
but real clients were really low bandwidth), and journals on NVMe
(P3700) and bcache on some SAS SSDs, and also tested both on the NVMe. I
think the performance is slightly better with it all on the NVMe (hdds
being the bottleneck... tests in VMs show the same, but rados bench
looks a tiny bit better). The bcache partition is shared by the osds,
and the journals are separate partitions.

I'm not sure it's really triple overhead. bcache doesn't write all your
data to the writeback cache... just as much small sync writes as long as
the cache doesn't fill up, or get too busy (based on await). And the
bcache device flushes very slowly to the hdd, not overloading it (unless
cache is full). And when I make it do it faster, it seems to do it more
quickly than without bcache (like it does it more sequentially, or
without sync; but I didn't really measure... just looked at, eg. 400MB
dirty data, and then it flushes in 20 seconds). And if you overwrite the
same data a few times (like a filesystem journal, or some fs metadata),
you'd think it wouldn't have to write it more than once to the hdd in
the end. Maybe that means something small like leveldb isn't written
often to the hdd.

And it's not just a write cache. The default is 10% writeback, which
means the rest is read cache. And it keeps read stats so it knows which
data is the most popular. My nodes right now show 33-44% cache hits
(cache is too small I think). And bcache reorders writes on the cache
device so they are sequential, and can write to both at the same time so
it can actually go faster than a pure ssd in specific situations (mixed
sequential and random, only until the cache fills).

I think I owe you another graph later when I put all my VMs on there
(probably finally fixed my rbd snapshot hanging VM issue ...worked
around it by disabling exclusive-lock,object-map,fast-diff). The
bandwidth hungry ones (which hung the most often) were moved shortly
after the bcache change, and it's hard to explain how it affects the
graphs... easier to see with iostat while changing it and having a mix
of cache and not than ganglia afterwards.

Peter

On 03/23/17 21:18, Nick Fisk wrote:
>
> Hi Peter,
>
>  
>
> Interesting graph. Out of interest, when you use bcache, do you then
> just leave the journal collocated on the combined bcache device and
> rely on the writeback to provide journal performance, or do you still
> create a separate partition on whatever SSD/NVME you use, effectively
> giving triple write overhead?
>
>  
>
> Nick
>
>  
>
> *From:*ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On
> Behalf Of *Peter Maloney
> *Sent:* 22 March 2017 10:06
> *To:* Alex Gorbachev <a...@iss-integration.com>; ceph-users
> <ceph-users@lists.ceph.com>
> *Subject:* Re: [ceph-users] Preconditioning an RBD image
>
>  
>
> Does iostat (eg.  iostat -xmy 1 /dev/sd[a-z]) show high util% or await
> during these problems?
>
> Ceph filestore requires lots of metadata writing (directory splitting
> for example), xattrs, leveldb, etc. which are small sync writes that
> HDDs are bad at (100-300 iops), and SSDs are good at (cheapo would be
> 6k iops, and not so crazy DC/NVMe would be 20-200k iops and more). So
> in theory, these things are mitigated by using an SSD, like bcache on
> your osd device. You could also try something like that, at least to test.
>
> I have tested with bcache in writeback mode and found hugely obvious
> differences seen by iostat, for example here's my before and after
> (heavier load due to converting week 49-50 or so, and the highest
> spikes being the scrub infinite loop bug in 10.2.3):
>
> http://www.brockmann-consult.de/ganglia/graph.php?cs=10%2F25%2F2016+10%3A27&ce=03%2F09%2F2017+17%3A26&z=xlarge&hreg
> <http://xo4t.mj.am/lnk/AEQAIfcunacAAAAAAAAAAF3gdw0AADNJBWwAAAAAAACRXwBY1DvAh8uPYe5LRJaO473StLfNWAAAlBI/1/_KQLt2QHZGOUvRQTr457rQ/aHR0cDovL3d3dy5icm9ja21hbm4tY29uc3VsdC5kZS9nYW5nbGlhL2dyYXBoLnBocD9jcz0xMCUyRjI1JTJGMjAxNisxMCUzQTI3JmNlPTAzJTJGMDklMkYyMDE3KzE3JTNBMjYmej14bGFyZ2UmaHJlZw>[]=ceph.*&mreg[]=sd[c-z]_await&glegend=show&aggregate=1&x=100
>
> But when you share a cache device, you get a single point of failure
> (and bcache, like all software, can be assumed to have bugs too). And
> I recommend vanilla kernel 4.9 or later which has many bcache fixes,
> or Ubuntu's 4.4 kernel which has the specific fixes I checked for.
>
> On 03/21/17 23:22, Alex Gorbachev wrote:
>
>     I wanted to share the recent experience, in which a few RBD
>     volumes, formatted as XFS and exported via Ubuntu
>     NFS-kernel-server performed poorly, even generated an "out of
>     space" warnings on a nearly empty filesystem.  I tried a variety
>     of hacks and fixes to no effect, until things started magically
>     working just after some dd write testing.
>
>      
>
>     The only explanation I can come up with is that preconditioning,
>     or thickening, the images with this benchmarking is what caused
>     the improvement.
>
>      
>
>     Ceph is Hammer 0.94.7 running on Ubuntu 14.04, kernel 4.10 on OSD
>     nodes and 4.4 on NFS nodes.
>
>      
>
>     Regards,
>
>     Alex
>
>     Storcium
>
>     -- 
>
>     -- 
>
>     Alex Gorbachev
>
>     Storcium
>
>
>
>
>     _______________________________________________
>
>     ceph-users mailing list
>
>     ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>
>     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>     
> <http://xo4t.mj.am/lnk/AEQAIfcunacAAAAAAAAAAF3gdw0AADNJBWwAAAAAAACRXwBY1DvAh8uPYe5LRJaO473StLfNWAAAlBI/2/Zu9hF2FfS7TM3GerHHD6gQ/aHR0cDovL2xpc3RzLmNlcGguY29tL2xpc3RpbmZvLmNnaS9jZXBoLXVzZXJzLWNlcGguY29t>
>
>  
>
>  
>
> -- 
>  
> --------------------------------------------
> Peter Maloney
> Brockmann Consult
> Max-Planck-Str. 2
> 21502 Geesthacht
> Germany
> Tel: +49 4152 889 300
> Fax: +49 4152 889 333
> E-mail: peter.malo...@brockmann-consult.de
> <mailto:peter.malo...@brockmann-consult.de>
> Internet: http://www.brockmann-consult.de
> <http://xo4t.mj.am/lnk/AEQAIfcunacAAAAAAAAAAF3gdw0AADNJBWwAAAAAAACRXwBY1DvAh8uPYe5LRJaO473StLfNWAAAlBI/3/nNYiN8Wg-QCZi0bq10AfKQ/aHR0cDovL3d3dy5icm9ja21hbm4tY29uc3VsdC5kZQ>
> --------------------------------------------
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Preconditioning an RBD image

Reply via email to