Re: [ceph-users] Review of Ceph on ZFS - or how not to deploy Ceph for RBD + OpenStack

2017-01-11 Thread Willem Jan Withagen
On 11-1-2017 08:06, Adrian Saul wrote:
> 
> I would concur having spent a lot of time on ZFS on Solaris.
> 
> ZIL will reduce the fragmentation problem a lot (because it is not
> doing intent logging into the filesystem itself which fragments the
> block allocations) and write response will be a lot better.  I would
> use different devices for L2ARC and ZIL - ZIL needs to be small and
> fast for writes (and mirrored - we have used some HGST 16G devices
> which are designed as ZILs - pricy but highly recommend) - L2ARC just
> needs to be faster for reads than your data disks, most SSDs would be
> fine for this.

Been using ZFS on FreeBSD ever since 2006, an I really like it.
Other than that it does not scale horizontally.

Ceph does a lot of sync()-type calls.
If you do not have a ZIL on SSDs, then ZFS create a ZIL on HHD for the
sync() writes
Most of the documentation then talks about using that to reliably speed
up NFS. But it is actually for ANY sync() operation.

> A 14 disk RAIDZ2 is also going to be very poor for writes especially
> with SATA - you are effectively only getting one disk worth of IOPS
> for write as each write needs to hit all disks.  Without a ZIL you
> are also losing out on write IOPS for ZIL and metadata operations.

I would definitely not have used a RAIDZ2 if speed is of the utmost
importance. It has it's advantages, but now you are both using ZFS's
redundancy AND the redundancy that is in CEPH.
So 2 extra HDD's in ZFS, and then on to off that the CEPH redundancy.

I haven't tried a large cluster yet, but if money allows it my choice
would be 2 disks mirrors per OSD in a vdev-pool. And use that with a ZIL
on SSD. This gives you 2* write speed IOPS of the disks.
Using the raid-types does not give you much extras for speed when tere
are more spindles.

One of the things that would be tempting is to even have only 1 disk in
a vdev, and let ceph do the rest. Problem is that you will need to
ZFS-scrub more often, and repair manually. Because errors will be
detected, but cannot be repaired.

We have not even discussed compression in ZFS, because that again is a
large way of getting more speed out of the system...

There are also some questions that I'm wondering about:
 - L2ARC uses (lots of) core memory, so do the OSDs and then there is
the buffer. All these interact, and compete for free RAM.
   What mix is sensible and gets most out of the memory you have.
 - If you have a fast ZIL, would you still need a journal in Ceph?

Just my 2cts,
--WjW


>> -Original Message- From: ceph-users
>> [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Patrick
>> Donnelly Sent: Wednesday, 11 January 2017 5:24 PM To: Kevin
>> Olbrich Cc: Ceph Users Subject: Re: [ceph-users] Review of Ceph on
>> ZFS - or how not to deploy Ceph for RBD + OpenStack
>> 
>> Hello Kevin,
>> 
>> On Tue, Jan 10, 2017 at 4:21 PM, Kevin Olbrich  wrote:
>>> 5x Ceph node equipped with 32GB RAM, Intel i5, Intel DC P3700
>>> NVMe journal,
>> 
>> Is the "journal" used as a ZIL?
>> 
>>> We experienced a lot of io blocks (X requests blocked > 32 sec)
>>> when a lot of data is changed in cloned RBDs (disk imported via
>>> OpenStack Glance, cloned during instance creation by Cinder). If
>>> the disk was cloned some months ago and large software updates
>>> are applied (a lot of small files) combined with a lot of syncs,
>>> we often had a node hit suicide timeout. Most likely this is a
>>> problem with op thread count, as it is easy to block threads with
>>> RAIDZ2 (RAID6) if many small operations are written to disk
>>> (again, COW is not optimal here). When recovery took place
>>> (0.020% degraded) the cluster performance was very bad - remote
>>> service VMs (Windows) were unusable. Recovery itself was using 70
>>> - 200 mb/s which was okay.
>> 
>> I would think having an SSD ZIL here would make a very large
>> difference. Probably a ZIL may have a much larger performance
>> impact than an L2ARC device. [You may even partition it and have
>> both but I'm not sure if that's normally recommended.]
>> 
>> Thanks for your writeup!
>> 
>> -- Patrick Donnelly 
>> ___ ceph-users mailing
>> list ceph-users@lists.ceph.com 
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> Confidentiality: This email and any attachments are confidential and
> may be subject to copyright, legal or some other professional
> privilege. They are intended solely for the attention and use of the
> named addressee(s). They may only be copied, distributed or disclosed
> with the consent of the copyright owner. If you have received this
> email by mistake or by breach of the confidentiality clause, please
> notify the sender immediately by return email and delete or destroy
> all copies of the email. Any confidentiality, privilege or copyright
> is not waived or lost because this email has been sent to you by
> mistake. ___ ceph-users
> 

Re: [ceph-users] Review of Ceph on ZFS - or how not to deploy Ceph for RBD + OpenStack

2017-01-10 Thread Adrian Saul

I would concur having spent a lot of time on ZFS on Solaris.

ZIL will reduce the fragmentation problem a lot (because it is not doing intent 
logging into the filesystem itself which fragments the block allocations) and 
write response will be a lot better.  I would use different devices for L2ARC 
and ZIL - ZIL needs to be small and fast for writes (and mirrored - we have 
used some HGST 16G devices which are designed as ZILs - pricy but highly 
recommend) - L2ARC just needs to be faster for reads than your data disks, most 
SSDs would be fine for this.

A 14 disk RAIDZ2 is also going to be very poor for writes especially with SATA 
- you are effectively only getting one disk worth of IOPS for write as each 
write needs to hit all disks.  Without a ZIL you are also losing out on write 
IOPS for ZIL and metadata operations.



> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Patrick Donnelly
> Sent: Wednesday, 11 January 2017 5:24 PM
> To: Kevin Olbrich
> Cc: Ceph Users
> Subject: Re: [ceph-users] Review of Ceph on ZFS - or how not to deploy Ceph
> for RBD + OpenStack
>
> Hello Kevin,
>
> On Tue, Jan 10, 2017 at 4:21 PM, Kevin Olbrich  wrote:
> > 5x Ceph node equipped with 32GB RAM, Intel i5, Intel DC P3700 NVMe
> > journal,
>
> Is the "journal" used as a ZIL?
>
> > We experienced a lot of io blocks (X requests blocked > 32 sec) when a
> > lot of data is changed in cloned RBDs (disk imported via OpenStack
> > Glance, cloned during instance creation by Cinder).
> > If the disk was cloned some months ago and large software updates are
> > applied (a lot of small files) combined with a lot of syncs, we often
> > had a node hit suicide timeout.
> > Most likely this is a problem with op thread count, as it is easy to
> > block threads with RAIDZ2 (RAID6) if many small operations are written
> > to disk (again, COW is not optimal here).
> > When recovery took place (0.020% degraded) the cluster performance was
> > very bad - remote service VMs (Windows) were unusable. Recovery itself
> > was using
> > 70 - 200 mb/s which was okay.
>
> I would think having an SSD ZIL here would make a very large difference.
> Probably a ZIL may have a much larger performance impact than an L2ARC
> device. [You may even partition it and have both but I'm not sure if that's
> normally recommended.]
>
> Thanks for your writeup!
>
> --
> Patrick Donnelly
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Review of Ceph on ZFS - or how not to deploy Ceph for RBD + OpenStack

2017-01-10 Thread Patrick Donnelly
Hello Kevin,

On Tue, Jan 10, 2017 at 4:21 PM, Kevin Olbrich  wrote:
> 5x Ceph node equipped with 32GB RAM, Intel i5, Intel DC P3700 NVMe journal,

Is the "journal" used as a ZIL?

> We experienced a lot of io blocks (X requests blocked > 32 sec) when a lot
> of data is changed in cloned RBDs (disk imported via OpenStack Glance,
> cloned during instance creation by Cinder).
> If the disk was cloned some months ago and large software updates are
> applied (a lot of small files) combined with a lot of syncs, we often had a
> node hit suicide timeout.
> Most likely this is a problem with op thread count, as it is easy to block
> threads with RAIDZ2 (RAID6) if many small operations are written to disk
> (again, COW is not optimal here).
> When recovery took place (0.020% degraded) the cluster performance was very
> bad - remote service VMs (Windows) were unusable. Recovery itself was using
> 70 - 200 mb/s which was okay.

I would think having an SSD ZIL here would make a very large
difference. Probably a ZIL may have a much larger performance impact
than an L2ARC device. [You may even partition it and have both but I'm
not sure if that's normally recommended.]

Thanks for your writeup!

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Review of Ceph on ZFS - or how not to deploy Ceph for RBD + OpenStack

2017-01-10 Thread Lindsay Mathieson

On 11/01/2017 7:21 AM, Kevin Olbrich wrote:


Read-Cache using normal Samsung PRO SSDs works very well


How did you implement the cache and measure the results?

a ZFS ssd cache will perform very badly with VM hosting and/or 
distriibuted filesystems, the random nature of the I/O and the ARC cache 
essential render it useless. I never saw better than a 6% hit rate with 
L2ARC.



Also if used as Journals or SSD Tiers, Samsung Pro have shocking write 
performance.


ZFS is probably not optimal for Ceph, but regardless of the underlying 
file system, with a 5 Node, 2G, Replica 3 setup you are going to see 
pretty bad write performance.


POOMA U  - but I believe that linked clones, especially old ones are 
going to be pretty slow.


--
Lindsay Mathieson

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Review of Ceph on ZFS - or how not to deploy Ceph for RBD + OpenStack

2017-01-10 Thread Kevin Olbrich
Dear Ceph-users,

just to make sure nobody makes the mistake, I would like to share my
experience with Ceph on ZFS in our test lab.
ZFS is a Copy-on-Write filesystem and is suitable IMHO where data
resilience has high priority.
I work for a mid-sized datacenter in Germany and we set up a cluster using
Ceph hammer -> infernalis -> jewel 10.2.3 (upgrades during 24/7 usage).
We initialy had chosen ZFS for it's great cache (ARC) and thought it would
be a great idea to use it instead of XFS (or EXT4 when it was supported).
Before we were using ZFS for Backup-Storage JBODs and made good results
(performance is great!).

We then assumed that ZFS is a good choice for distributed / high
availability scenarios.
Since end 2015 I was running OpenStack Liberty / Mitaka on top of this
cluster and our use case were all sorts of VMs (20/80 split Win / Linux).
We are running this cluster setup for over a year now.

Details:

   - 80x Disks (56x 500GB SATA via FC, 24x 1TB SATA via SAS) JBOD
   - All nodes (OpenStack and Ceph) on CentOS 7
   - Everything Kernel 3.10.0-x, switched to 4.4.30+ (elrepo) while upgrade
   to jewel
   - ZFSonLinux latest
   - 5x Ceph node equipped with 32GB RAM, Intel i5, Intel DC P3700 NVMe
   journal, Emulex Fiber PCIe for JBOD
   - 2x 1GBit-Bond per node with belance-alb (belancing using different
   MAC-address during ARP) on two switches
   - 2x HP 2920 using 20G interconnect, then switched to 2x HP Comware 5130
   using IRF-stack with 20G interconnect
   - Nodes had RAIDZ2 (RAID6) configuration for 14x 500GB disks (= 1 OSD
   node) and the 24 disk JBOD had 4x RAIDZ2 (RAID6) using 6 disks each (= 4
   OSD node, only 2 in production).
   - 90x VMs in total at the time we ended our evaluation
   - 6 OSDs in total
   - pgnum 128 x 4 pools, 512 PGs total, size 2 and min_size 1
   - OSD filled 30 - 40%, low fragmentation

We were not using 10GBit NICs as our VM traffic would not exceed 2x GBit
per node in normal operation as we expected a lot of 4k blocks from Windows
Remote Services (known as "terminal server").

Pros:

   - Survived two outages without a single lost object (just had to do "pg
   repair num" on 4 PGs)
   KVM VMs were frozen and OS started to reset SCSI bus until cluster was
   back online - no broken databases (we were running MySQL, MSSQL and
   Exchange)
   - Read-Cache using normal Samsung PRO SSDs works very well
   - Together with multipathd optimal redundancy and performance
   - Deep-Scrub is not needed as ZFS can scrub itself in RAIDZ1 and RAIDZ2
   backed by checksumms

Cons:

   - Performance goes lower and lower with ongoing usage (we added the 1TB
   disks JBOD to accommodate this issue) but lately hit it again.
   - Disks spin at 100% all the time in the 14x 500G JBODs, 30% at the
   SAS-JBOD - mostly related to COW
   - Even a little bit of fragmentation results in slow downs
   - If Deep-Scrub is enabled, IO stucks very often
   - noout-flag needs to be set to stop recovery storm (which is bad as a
   recovery of a single 500GB OSD is great while 6 TB takes a very long time)


We moved from Hammer in 2015 to Infernalis in early 2016 and to Jewel in
Oct 2016. During upgrade to Jewel, we moved to the elrepo.org kernel-lt
package and upgraded from kernel 3.10.0 to 4.4.30+.
Migration from Infernalis to Jewel was noticeable, most VMs were running a
lot faster but we also had a great increase of stuck requests. I am not
sure but I did not notice any on Infernalis.

We experienced a lot of io blocks (X requests blocked > 32 sec) when a lot
of data is changed in cloned RBDs (disk imported via OpenStack Glance,
cloned during instance creation by Cinder).
If the disk was cloned some months ago and large software updates are
applied (a lot of small files) combined with a lot of syncs, we often had a
node hit suicide timeout.
Most likely this is a problem with op thread count, as it is easy to block
threads with RAIDZ2 (RAID6) if many small operations are written to disk
(again, COW is not optimal here).
When recovery took place (0.020% degraded) the cluster performance was very
bad - remote service VMs (Windows) were unusable. Recovery itself was using
70 - 200 mb/s which was okay.

Read did not cause any problems. We made a lot of backups of the running
VMs during the day and performance in other VMs was slightly lowered -
nothing we realy worried about.
All in all read performance was okay while write performance was awful as
soon as filestore flush kicked in (= some seconds when downloading stuff
via GBit to the VM).
Scrub and Deepscrub needed to be disabled to remain "normal operation" -
this is the worst point about this setup.

In data resilience terms we were very satisfied. We had one node crashing
regulary with Infernalis (we never found the reason after 3 days) before we
upgraded to Jewel and no data was corrupted when this happend (especially
MS Exchange did not complain!).
After we upgraded to Jewel, it did not crash again. In all cases, VMs were
fully functional.