[ceph-users] Review of Ceph on ZFS - or how not to deploy Ceph for RBD + OpenStack

Kevin Olbrich Tue, 10 Jan 2017 13:22:15 -0800

Dear Ceph-users,

just to make sure nobody makes the mistake, I would like to share my
experience with Ceph on ZFS in our test lab.
ZFS is a Copy-on-Write filesystem and is suitable IMHO where data
resilience has high priority.
I work for a mid-sized datacenter in Germany and we set up a cluster using
Ceph hammer -> infernalis -> jewel 10.2.3 (upgrades during 24/7 usage).
We initialy had chosen ZFS for it's great cache (ARC) and thought it would
be a great idea to use it instead of XFS (or EXT4 when it was supported).
Before we were using ZFS for Backup-Storage JBODs and made good results
(performance is great!).


We then assumed that ZFS is a good choice for distributed / high
availability scenarios.
Since end 2015 I was running OpenStack Liberty / Mitaka on top of this
cluster and our use case were all sorts of VMs (20/80 split Win / Linux).
We are running this cluster setup for over a year now.

Details:

   - 80x Disks (56x 500GB SATA via FC, 24x 1TB SATA via SAS) JBOD
   - All nodes (OpenStack and Ceph) on CentOS 7
   - Everything Kernel 3.10.0-x, switched to 4.4.30+ (elrepo) while upgrade
   to jewel
   - ZFSonLinux latest
   - 5x Ceph node equipped with 32GB RAM, Intel i5, Intel DC P3700 NVMe
   journal, Emulex Fiber PCIe for JBOD
   - 2x 1GBit-Bond per node with belance-alb (belancing using different
   MAC-address during ARP) on two switches
   - 2x HP 2920 using 20G interconnect, then switched to 2x HP Comware 5130
   using IRF-stack with 20G interconnect
   - Nodes had RAIDZ2 (RAID6) configuration for 14x 500GB disks (= 1 OSD
   node) and the 24 disk JBOD had 4x RAIDZ2 (RAID6) using 6 disks each (= 4
   OSD node, only 2 in production).
   - 90x VMs in total at the time we ended our evaluation
   - 6 OSDs in total
   - pgnum 128 x 4 pools, 512 PGs total, size 2 and min_size 1
   - OSD filled 30 - 40%, low fragmentation

We were not using 10GBit NICs as our VM traffic would not exceed 2x GBit
per node in normal operation as we expected a lot of 4k blocks from Windows
Remote Services (known as "terminal server").

Pros:

   - Survived two outages without a single lost object (just had to do "pg
   repair num" on 4 PGs)
   KVM VMs were frozen and OS started to reset SCSI bus until cluster was
   back online - no broken databases (we were running MySQL, MSSQL and
   Exchange)
   - Read-Cache using normal Samsung PRO SSDs works very well
   - Together with multipathd optimal redundancy and performance
   - Deep-Scrub is not needed as ZFS can scrub itself in RAIDZ1 and RAIDZ2
   backed by checksumms

Cons:

   - Performance goes lower and lower with ongoing usage (we added the 1TB
   disks JBOD to accommodate this issue) but lately hit it again.
   - Disks spin at 100% all the time in the 14x 500G JBODs, 30% at the
   SAS-JBOD - mostly related to COW
   - Even a little bit of fragmentation results in slow downs
   - If Deep-Scrub is enabled, IO stucks very often
   - noout-flag needs to be set to stop recovery storm (which is bad as a
   recovery of a single 500GB OSD is great while 6 TB takes a very long time)


We moved from Hammer in 2015 to Infernalis in early 2016 and to Jewel in
Oct 2016. During upgrade to Jewel, we moved to the elrepo.org kernel-lt
package and upgraded from kernel 3.10.0 to 4.4.30+.
Migration from Infernalis to Jewel was noticeable, most VMs were running a
lot faster but we also had a great increase of stuck requests. I am not
sure but I did not notice any on Infernalis.

We experienced a lot of io blocks (X requests blocked > 32 sec) when a lot
of data is changed in cloned RBDs (disk imported via OpenStack Glance,
cloned during instance creation by Cinder).
If the disk was cloned some months ago and large software updates are
applied (a lot of small files) combined with a lot of syncs, we often had a
node hit suicide timeout.
Most likely this is a problem with op thread count, as it is easy to block
threads with RAIDZ2 (RAID6) if many small operations are written to disk
(again, COW is not optimal here).
When recovery took place (0.020% degraded) the cluster performance was very
bad - remote service VMs (Windows) were unusable. Recovery itself was using
70 - 200 mb/s which was okay.

Read did not cause any problems. We made a lot of backups of the running
VMs during the day and performance in other VMs was slightly lowered -
nothing we realy worried about.
All in all read performance was okay while write performance was awful as
soon as filestore flush kicked in (= some seconds when downloading stuff
via GBit to the VM).
Scrub and Deepscrub needed to be disabled to remain "normal operation" -
this is the worst point about this setup.

In data resilience terms we were very satisfied. We had one node crashing
regulary with Infernalis (we never found the reason after 3 days) before we
upgraded to Jewel and no data was corrupted when this happend (especially
MS Exchange did not complain!).
After we upgraded to Jewel, it did not crash again. In all cases, VMs were
fully functional.

Currently we are migrating most VMs out of the cluster to shut it down (we
had some semi-productive VMs on it to get real world usage stats).

I just wanted to let you know which problems we had with Ceph on ZFS. No
doubt we made a lot of mistakes (this was our first Ceph cluster) but we
had a lot of tests running on it and would not recommand to use ZFS as the
backend.

And for those interested in monitoring this type of cluster: Do not use
munin. As the disks were spinning at 100% and each disk is seen three times
(2 paths combined in one mpath) I caused a deadlock resulting in 3/4
offline nodes (one of the disasters we had Ceph repair everything).

I hope this helps all Ceph users who are interested in the idea of running
Ceph on ZFS.

Kind regards,
Kevin Olbrich.

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Review of Ceph on ZFS - or how not to deploy Ceph for RBD + OpenStack

Reply via email to