subject:"\[ceph\-users\] Ceph \+ VMWare"

Re: [ceph-users] ceph , VMWare , NFS-ganesha

2018-05-29 Thread Steven Vacaroaia

Thank you all

My goal is to have an SSD based Ceph ( NVME + SSD) cluster so I need to
consider performance as well as reliability
 ( although I do realize that a performant cluster that breaks my VMware is
not ideal ;-))

It appears that NFS is the safe way to do it but will it be the bottleneck
from performance perspective

Anyone did a comparison between iSCSI and NFS  ?

Would network be a bottleneck ?

Many thanks

Steven

On Tue, 29 May 2018 at 11:04, Dennis Benndorf <
dennis.bennd...@googlemail.com> wrote:

> Hi,
>
> we use PetaSAN for our VMWare-Cluster. It provides an webinterface for
> management and does clustered active-active ISCSI. For us the easy
> management was the point to choose this, so we need not to think about how
> to configure ISCSI...
> Regards,
> Dennis
>
> Am 28.05.2018 um 21:42 schrieb Steven Vacaroaia:
>
> Hi,
>
> I need to design and build a storage platform that will be "consumed"
> mainly by VMWare
>
> CEPH is my first choice
>
> As far as I can see, there are 3 ways CEPH storage can be made available
> to VMWare
>
> 1. iSCSI
> 2. NFS-Ganesha
> 3. mounted rbd to a lInux NFS server
>
> Any suggestions / advice as to which one is better ( and why) as well as
> links to doumentation/best practices will be truly appreciated
>
> Thanks
> Steven
>
>
> ___
> ceph-users mailing 
> listceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph , VMWare , NFS-ganesha

2018-05-29 Thread Dennis Benndorf


Hi,

we use PetaSAN for our VMWare-Cluster. It provides an webinterface for 
management and does clustered active-active ISCSI. For us the easy 
management was the point to choose this, so we need not to think about 
how to configure ISCSI...


Regards,
Dennis

Am 28.05.2018 um 21:42 schrieb Steven Vacaroaia:

Hi,

I need to design and build a storage platform that will be "consumed" 
mainly by VMWare


CEPH is my first choice

As far as I can see, there are 3 ways CEPH storage can be made 
available to VMWare


1. iSCSI
2. NFS-Ganesha
3. mounted rbd to a lInux NFS server

Any suggestions / advice as to which one is better ( and why) as well 
as links to doumentation/best practices will be truly appreciated


Thanks
Steven


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph , VMWare , NFS-ganesha

2018-05-29 Thread Alex Gorbachev

On Mon, May 28, 2018 at 3:42 PM, Steven Vacaroaia  wrote:
> Hi,
>
> I need to design and build a storage platform that will be "consumed" mainly
> by VMWare
>
> CEPH is my first choice
>
> As far as I can see, there are 3 ways CEPH storage can be made available to
> VMWare
>
> 1. iSCSI
> 2. NFS-Ganesha
> 3. mounted rbd to a lInux NFS server
>
> Any suggestions / advice as to which one is better ( and why) as well as
> links to doumentation/best practices will be truly appreciated

We use NFS with Pacemaker quite successfully, with repackaging kRBD
with XFS.  I tried rbd-nbd as well, but performance is not good when
running sync.
--
Alex Gorbachev
Storcium

>
> Thanks
> Steven
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph , VMWare , NFS-ganesha

2018-05-29 Thread Heðin Ejdesgaard Møller

We are using the iSCSI gateway in ceph-12.2 with vsphere-6.5 as the client.
It's an active/passive setup, per. LUN.
We choose this solution because that's what we could get RH support for and it 
sticks to the "no SPOF" philosophy.

Performance is ~25-30% slower then krbd mounting the same rbd image directly. 
This is based on the following.
We spun up a FC27 VM within the vmware cluster attached a vdisk from the vmware 
datastore, ran various fio test.
Then we mapped the same rbd image directly and ran the same tests(ofc we 
removed the iscsi exposure first.) 

Regards
Heðin Ejdesgaard

On mán, 2018-05-28 at 15:47 -0500, Brady Deetz wrote:
> You might look into open vstorage as a gateway into ceph. 
> 
> On Mon, May 28, 2018, 2:42 PM Steven Vacaroaia  wrote:
> > Hi,
> > 
> > I need to design and build a storage platform that will be "consumed" 
> > mainly by VMWare 
> > 
> > CEPH is my first choice 
> > 
> > As far as I can see, there are 3 ways CEPH storage can be made available to 
> > VMWare 
> > 
> > 1. iSCSI
> > 2. NFS-Ganesha
> > 3. mounted rbd to a lInux NFS server
> > 
> > Any suggestions / advice as to which one is better ( and why) as well as 
> > links to doumentation/best practices will
> > be truly appreciated 
> > 
> > Thanks
> > Steven
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] ceph , VMWare , NFS-ganesha

2018-05-28 Thread Steven Vacaroaia

Hi,

I need to design and build a storage platform that will be "consumed"
mainly by VMWare

CEPH is my first choice

As far as I can see, there are 3 ways CEPH storage can be made available to
VMWare

1. iSCSI
2. NFS-Ganesha
3. mounted rbd to a lInux NFS server

Any suggestions / advice as to which one is better ( and why) as well as
links to doumentation/best practices will be truly appreciated

Thanks
Steven
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph , VMWare , NFS-ganesha

2018-05-28 Thread Brady Deetz

You might look into open vstorage as a gateway into ceph.

On Mon, May 28, 2018, 2:42 PM Steven Vacaroaia  wrote:

> Hi,
>
> I need to design and build a storage platform that will be "consumed"
> mainly by VMWare
>
> CEPH is my first choice
>
> As far as I can see, there are 3 ways CEPH storage can be made available
> to VMWare
>
> 1. iSCSI
> 2. NFS-Ganesha
> 3. mounted rbd to a lInux NFS server
>
> Any suggestions / advice as to which one is better ( and why) as well as
> links to doumentation/best practices will be truly appreciated
>
> Thanks
> Steven
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph + VMWare

2016-10-18 Thread Alex Gorbachev

On Tuesday, October 18, 2016, Frédéric Nass 
wrote:

> Hi Alex,
>
> Just to know, what kind of backstore are you using whithin Storcium ? 
> vdisk_fileio
> or vdisk_blockio ?
>
> I see your agents can handle both : http://www.spinics.net/lists/
> ceph-users/msg27817.html
>
Hi Frédéric,

We use all of them, and NFS as well, which has been performing quite well.
Vdisk_fileio is a bit dangerous in write cache mode.  Also, for some
reason, object size of 16MB for RBD does better with VMWare.

Storcium gives you a choice for each LUN.  The challenge has been figuring
out optimal workloads under highly varied use cases.  I see better results
with NVMe journals and write combining HBAs, e.g. Areca.

Regards,
Alex

> Regards,
>
> Frédéric.
>
> Le 06/10/2016 à 16:01, Alex Gorbachev a écrit :
>
> On Wed, Oct 5, 2016 at 2:32 PM, Patrick McGarry  
>  wrote:
>
> Hey guys,
>
> Starting to buckle down a bit in looking at how we can better set up
> Ceph for VMWare integration, but I need a little info/help from you
> folks.
>
> If you currently are using Ceph+VMWare, or are exploring the option,
> I'd like some simple info from you:
>
> 1) Company
> 2) Current deployment size
> 3) Expected deployment growth
> 4) Integration method (or desired method) ex: iscsi, native, etc
>
> Just casting the net so we know who is interested and might want to
> help us shape and/or test things in the future if we can make it
> better. Thanks.
>
>
> Hi Patrick,
>
> We have Storcium certified with VMWare, and we use it ourselves:
>
> Ceph Hammer latest
>
> SCST redundant Pacemaker based delivery front ends - our agents are
> published on github
>
> EnhanceIO for read caching at delivery layer
>
> NFS v3, and iSCSI and FC delivery
>
> Our deployment size we use ourselves is 700 TB raw.
>
> Challenges are as others described, but HA and multi host access works
> fine courtesy of SCST.  Write amplification is a challenge on spinning
> disks.
>
> Happy to share more.
>
> Alex
>
>
> --
>
> Best Regards,
>
> Patrick McGarry
> Director Ceph Community || Red Hathttp://ceph.com  ||  
> http://community.redhat.com
> @scuttlemonkey || @ceph
> ___
> ceph-users mailing listceph-us...@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org 
> 
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>

-- 
--
Alex Gorbachev
Storcium
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph + VMWare

2016-10-18 Thread Frédéric Nass



Hi Alex,

Just to know, what kind of backstore are you using whithin Storcium ? 
vdisk_fileio or vdisk_blockio ?


I see your agents can handle both : 
http://www.spinics.net/lists/ceph-users/msg27817.html


Regards,

Frédéric.

Le 06/10/2016 à 16:01, Alex Gorbachev a écrit :

On Wed, Oct 5, 2016 at 2:32 PM, Patrick McGarry  wrote:

Hey guys,

Starting to buckle down a bit in looking at how we can better set up
Ceph for VMWare integration, but I need a little info/help from you
folks.

If you currently are using Ceph+VMWare, or are exploring the option,
I'd like some simple info from you:

1) Company
2) Current deployment size
3) Expected deployment growth
4) Integration method (or desired method) ex: iscsi, native, etc

Just casting the net so we know who is interested and might want to
help us shape and/or test things in the future if we can make it
better. Thanks.


Hi Patrick,

We have Storcium certified with VMWare, and we use it ourselves:

Ceph Hammer latest

SCST redundant Pacemaker based delivery front ends - our agents are
published on github

EnhanceIO for read caching at delivery layer

NFS v3, and iSCSI and FC delivery

Our deployment size we use ourselves is 700 TB raw.

Challenges are as others described, but HA and multi host access works
fine courtesy of SCST.  Write amplification is a challenge on spinning
disks.

Happy to share more.

Alex


--

Best Regards,

Patrick McGarry
Director Ceph Community || Red Hat
http://ceph.com  ||  http://community.redhat.com
@scuttlemonkey || @ceph
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph + VMWare

2016-10-18 Thread Frédéric Nass


Hi Alex,

Just to know, what kind of backstore are you using whithin Storcium ? 
vdisk_fileio or vdisk_blockio ?


I see your agents can handle both : 
http://www.spinics.net/lists/ceph-users/msg27817.html


Regards,

Frédéric.


Le 06/10/2016 à 16:01, Alex Gorbachev a écrit :

On Wed, Oct 5, 2016 at 2:32 PM, Patrick McGarry  wrote:

Hey guys,

Starting to buckle down a bit in looking at how we can better set up
Ceph for VMWare integration, but I need a little info/help from you
folks.

If you currently are using Ceph+VMWare, or are exploring the option,
I'd like some simple info from you:

1) Company
2) Current deployment size
3) Expected deployment growth
4) Integration method (or desired method) ex: iscsi, native, etc

Just casting the net so we know who is interested and might want to
help us shape and/or test things in the future if we can make it
better. Thanks.


Hi Patrick,

We have Storcium certified with VMWare, and we use it ourselves:

Ceph Hammer latest

SCST redundant Pacemaker based delivery front ends - our agents are
published on github

EnhanceIO for read caching at delivery layer

NFS v3, and iSCSI and FC delivery

Our deployment size we use ourselves is 700 TB raw.

Challenges are as others described, but HA and multi host access works
fine courtesy of SCST.  Write amplification is a challenge on spinning
disks.

Happy to share more.

Alex


--

Best Regards,

Patrick McGarry
Director Ceph Community || Red Hat
http://ceph.com  ||  http://community.redhat.com
@scuttlemonkey || @ceph
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph + VMWare

2016-10-11 Thread Frédéric Nass

Hi Patrick,

1) Université de Lorraine. (7.000 researchers and staff members, 60.000
students, 42 schools and education structures, 60 research labs).

2) RHCS cluster: 144 OSDs on 12 nodes for 520 TB raw capacity.
VMware clusters: 7 VMware clusters (40 ESXi hosts). First need is
to provide capacitive storage (Ceph) to VMs running in a VMware vRA IaaS
cluster (6 ESXi hosts).

3) Deployment growth ?
RHCS cluster: Initial need was 750 TB of usable storage, so a x4
growth in the next 3 years is expected to reach 1 PB of usable storage.
VMware clusters: We just started to offer a IaaS service to
research laboratories and education structures whithin our university.
We can expect to host several hundreds of VMs in the next 2 years
(~600-800).

4) Integration method ? Clearly native.
I spent some of the last 6 months working on building an HA gateway
cluster (iSCSI and NFS) to provide RHCS Ceph storage to our VMware IaaS
Cluster. Here are my findings:

* iSCSI ?

Gives better performance than NFS, we know that. BUT, we cannot go
into production with iSCSI because of ESXi hosts entering a never ending
iSCSI 'Abort Task' loop when the Ceph cluster fails to acknowledge a 4MB
IO in less than 5s, resulting in VMs crashing. I've been told by a
VMware engineer that this 5s limit cannot be raised as it's hardcoded in
ESXi iSCSI software initiator.
Why would an IO take more than 5s ? In case of a important load on
the Ceph cluster, or a Ceph failure scenario (network isolation, OSD
crash), or deep-scrubbing bothering client IOs or any combination of
these or those I didn't think about...

What I have tested:
iSCSI Active/Active HA cluster. Each ESXi sees the same datastore
through both targets but only accesses one datastore at a time through a
statically defined prefered path.
3 ESXi work on one target, 3 ESXi work on the other. If a target
goes down, the other paths are used.

- LIO iSCSI targets with kernel RBD mapping (no cache). VAAI
methods. Easy to configure. Delivers good performance with eagger zeroed
virtual disks. 'Abort Task' loop has the ESXi disconnect from the
vCenter Server.

Restartign the target get them back in but some VMs certainly crashed.
- FreeBSD / FreeNAS running in KVM (on top of CentOS) mapping RBD
images through librbd. Found that fileio backstore was used. Found hard
to make it HA with librbd cache. And still the 'Abort Task' loop...
- SCST ESOS targets with kernel RBD mapping (no cache). VAAI
methods, ALUA. Easy to configure too. 'Abort Task' still happens but the
ESX does not get disconnected from the vCenter Server. Still targets
have to be restarted to fix this situation.

* NFS ?

Gives less performance than iSCSI, we know that too. BUT, it's
probably the best option right now. It's very easy to make it HA with
Pacemaker/Corosync as VMware doesn't make use of the NFS lock manager.
Here is a good start :
https://www.sebastien-han.fr/blog/2012/07/06/nfs-over-rbd/
We're still benchmarking IOPs to decide whether we can go into
production with this infrastructure but we're actually very satisfied
with the HA mechanism.
Running synchronous writes on multiple VMs (on virtual disk hosted
on NFS datastores with 'sync' exports of RBD images) while Storage
vMotioning those multiple disks between NFS RBD datastores and flapping
ViP (and thus NFS exports) from one server to the other at the same time
never kills any VM nor makes any datastore unavailable.
And every Storage vMotion task complete ! This is excellent
results. Note that it's important to run VMware Tools in VMs as VMware
Tools installation extend the write delay timeout on local iSCSI devices.

What I have tested:
- NFS exports with async mode sharing RBD images with XFS on top of
it. Gives the best performances but, as an evidence, no one will want to
use this mode in production.
- NFS exports with sync mode sharing RBD images with XFS on top of
it. Gives mitigated performances. We would clearly announce this type of
storage as capacitive and not performant through our IaaS service.
As VMs caches writes, IOPS might be good enough for tier 2 or 3
applications. We would probably be able to increase the number of IOPS
by using more RBD images and NFS shares.
- NFS exports with sync mode sharing RBD images with ZFS (with
compression) on top of it. The idea is to provide better performance by
putting the SLOG (write journal) on fast SSD drives.
See this real life (love-)story :
https://virtualexistenz.wordpress.com/2013/02/01/using-zfs-storage-as-vmware-nfs-datastores-a-real-life-love-story/
Each NFS server has 2 mirrored SSDs (RAID1). Each NFS server
export partitions of this SSD volume through iSCSI.
Each NFS server is a client of local and distant iSCSI target.
Then the SLOG device is made of a ZFS mirror of 2 disks : local iSCSI
device and distant iSCSI device

Re: [ceph-users] Ceph + VMWare

2016-10-07 Thread Jake Young

Hey Patrick,

I work for Cisco.

We have a 200TB cluster (108 OSDs on 12 OSD Nodes) and use the cluster for
both OpenStack and VMware deployments.

We are using iSCSI now, but it really would be much better if VMware did
support RBD natively.

We present a 1-2TB Volume that is shared between 4-8 ESXi hosts.

I have been looking for an optimal solution for a few years now, and I have
finally found something that works pretty well:

We are installing FreeNAS on a KVM hypervisor and passing through rbd
volumes as disks on a SCSI bus. We are able to add volumes dynamically (no
need to reboot FreeNAS to recognize new drives).  In FreeNAS, we are
passing the disks through directly as iscsi targets, we are not putting the
disks into a ZFS volume.

The biggest benefit to this is that VMware really likes the FreeBSD target
and all VAAI stuff works reliably. We also get the benefit of the stability
of rbd in QEMU client.

My next step is to create a redundant KVM host with a redundant FreeNAS VM
and see how iscsi multipath works with the ESXi hosts.

We have tried many different things and have run into all the same issues
as others have posted on this list. The general theme seems to be that most
(all?) Linux iSCSI Target software and Linux NFS solutions are not very
good. The BSD OS's (FreeBSD, Solaris derivatives, etc.) do these things a
lot better, but typically lack Ceph support as well as having poor HW
compatibility (compared to Linux).

Our goal has always been to replace FC SAN with something comparable in
performance, reliability and redundancy.

Again, the best thing in the world would be for ESXi to mount rbd volumes
natively using librbd. I'm not sure if VMware is interested in this though.

Jake

On Wednesday, October 5, 2016, Patrick McGarry  wrote:

> Hey guys,
>
> Starting to buckle down a bit in looking at how we can better set up
> Ceph for VMWare integration, but I need a little info/help from you
> folks.
>
> If you currently are using Ceph+VMWare, or are exploring the option,
> I'd like some simple info from you:
>
> 1) Company
> 2) Current deployment size
> 3) Expected deployment growth
> 4) Integration method (or desired method) ex: iscsi, native, etc
>
> Just casting the net so we know who is interested and might want to
> help us shape and/or test things in the future if we can make it
> better. Thanks.
>
>
> --
>
> Best Regards,
>
> Patrick McGarry
> Director Ceph Community || Red Hat
> http://ceph.com  ||  http://community.redhat.com
> @scuttlemonkey || @ceph
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph + VMWare

2016-10-06 Thread Alex Gorbachev

On Wed, Oct 5, 2016 at 2:32 PM, Patrick McGarry  wrote:
> Hey guys,
>
> Starting to buckle down a bit in looking at how we can better set up
> Ceph for VMWare integration, but I need a little info/help from you
> folks.
>
> If you currently are using Ceph+VMWare, or are exploring the option,
> I'd like some simple info from you:
>
> 1) Company
> 2) Current deployment size
> 3) Expected deployment growth
> 4) Integration method (or desired method) ex: iscsi, native, etc
>
> Just casting the net so we know who is interested and might want to
> help us shape and/or test things in the future if we can make it
> better. Thanks.
>

Hi Patrick,

We have Storcium certified with VMWare, and we use it ourselves:

Ceph Hammer latest

SCST redundant Pacemaker based delivery front ends - our agents are
published on github

EnhanceIO for read caching at delivery layer

NFS v3, and iSCSI and FC delivery

Our deployment size we use ourselves is 700 TB raw.

Challenges are as others described, but HA and multi host access works
fine courtesy of SCST.  Write amplification is a challenge on spinning
disks.

Happy to share more.

Alex

>
> --
>
> Best Regards,
>
> Patrick McGarry
> Director Ceph Community || Red Hat
> http://ceph.com  ||  http://community.redhat.com
> @scuttlemonkey || @ceph
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph + VMWare

2016-10-06 Thread Oliver Dzombic

Hi,

maybe, in fact, a clean iscsi implementation would be better, because
more useable in general.

So the MS hyper-V people could use it too.



For me, when it comes to iSCSI ( we tested so far the tgtd module ), the
problem is at most on the reliability part when it comes to resilence in
case the ceph cluster changes from OK to what ever else.

So if the iSCSI implementation could receive some work, that even if PGs
are changing in the backfilling/degregated/... state, things will just
continue to work. Thats currently not the case.

Even more evil: the tgtd module currently seems not to support to have
ONE iSCSI target being mounted to MULTIPLE vmware esxi nodes.

So in fact you cant use it as shared storage because you receive very
fast readlocks which are never released and preventing other nodes from
using the same LUN.

-- 
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:i...@ip-interactive.de

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107


Am 06.10.2016 um 08:13 schrieb Daniel Schwager:
> Hi all,
> 
> we are using Ceph (jewel 10.2.2, 10GBit Ceph frontend/backend, 3 nodes, each 
> 8 OSD's and 2 journal SSD's) 
> in out VMware environment especially for test environments and templates - 
> but currently 
> not for productive machines (because of missing FC-redundancy & performance).
> 
> On our Linux based SCST 4GBit fiber channel proxy, 16 ceph-rbd  devices 
> (non-caching, in total 10 TB) 
> creating a LVM (stripped) volume which is published as a FC-target to our 
> VMware cluster. 
> Looks fine, works stable. But currently the proxy is not redundant (only one 
> head).
> Performance is ok (a), but not that good than our IBM Storwize 3700 SAN (16 
> HDD's).
> Especially for small IO's (4k), the IBM is twice as fast as Ceph. 
> 
> Native ceph integration to VMware would be great (-:
> 
> Best regards
> Daniel
> 
> (a) Atto Benchmark screenshots - IBM Storwize 37000 vs. Ceph
> https://dtnet.storage.dtnetcloud.com/d/684b330eea/
> 
> ---
> DT Netsolution GmbH   -   Taläckerstr. 30-D-70437 Stuttgart
> Geschäftsführer: Daniel Schwager, Stefan Hörz - HRB Stuttgart 19870
> Tel: +49-711-849910-32, Fax: -932 - Mailto:daniel.schwa...@dtnet.de
> 
>> -Original Message-
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
>> Patrick McGarry
>> Sent: Wednesday, October 05, 2016 8:33 PM
>> To: Ceph-User; Ceph Devel
>> Subject: [ceph-users] Ceph + VMWare
>>
>> Hey guys,
>>
>> Starting to buckle down a bit in looking at how we can better set up
>> Ceph for VMWare integration, but I need a little info/help from you
>> folks.
>>
>> If you currently are using Ceph+VMWare, or are exploring the option,
>> I'd like some simple info from you:
>>
>> 1) Company
>> 2) Current deployment size
>> 3) Expected deployment growth
>> 4) Integration method (or desired method) ex: iscsi, native, etc
>>
>> Just casting the net so we know who is interested and might want to
>> help us shape and/or test things in the future if we can make it
>> better. Thanks.
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph + VMWare

2016-10-05 Thread Daniel Schwager

Hi all,

we are using Ceph (jewel 10.2.2, 10GBit Ceph frontend/backend, 3 nodes, each 8 
OSD's and 2 journal SSD's) 
in out VMware environment especially for test environments and templates - but 
currently 
not for productive machines (because of missing FC-redundancy & performance).

On our Linux based SCST 4GBit fiber channel proxy, 16 ceph-rbd  devices 
(non-caching, in total 10 TB) 
creating a LVM (stripped) volume which is published as a FC-target to our 
VMware cluster. 
Looks fine, works stable. But currently the proxy is not redundant (only one 
head).
Performance is ok (a), but not that good than our IBM Storwize 3700 SAN (16 
HDD's).
Especially for small IO's (4k), the IBM is twice as fast as Ceph. 

Native ceph integration to VMware would be great (-:

Best regards
Daniel

(a) Atto Benchmark screenshots - IBM Storwize 37000 vs. Ceph
https://dtnet.storage.dtnetcloud.com/d/684b330eea/

---
DT Netsolution GmbH   -   Taläckerstr. 30-D-70437 Stuttgart
Geschäftsführer: Daniel Schwager, Stefan Hörz - HRB Stuttgart 19870
Tel: +49-711-849910-32, Fax: -932 - Mailto:daniel.schwa...@dtnet.de

> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
> Patrick McGarry
> Sent: Wednesday, October 05, 2016 8:33 PM
> To: Ceph-User; Ceph Devel
> Subject: [ceph-users] Ceph + VMWare
> 
> Hey guys,
> 
> Starting to buckle down a bit in looking at how we can better set up
> Ceph for VMWare integration, but I need a little info/help from you
> folks.
> 
> If you currently are using Ceph+VMWare, or are exploring the option,
> I'd like some simple info from you:
> 
> 1) Company
> 2) Current deployment size
> 3) Expected deployment growth
> 4) Integration method (or desired method) ex: iscsi, native, etc
> 
> Just casting the net so we know who is interested and might want to
> help us shape and/or test things in the future if we can make it
> better. Thanks.
> 


smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph + VMWare

2016-10-05 Thread Oliver Dzombic

Hi Patrick,

we are currently trying to get ceph running with it for a customer. (
Means our stuff = cephfs, customer stuff = vmware on ONE ceph cluster ).

Unluckily iscsi sucks ( ohne OSD fails = iscsi lock -> need restart
iscsi daemon on ceph servers ).

NFS sucks ( no natural HA )

So if you can get it run with a vmware plugin ( just like for example
ScaleIO ) there are some people out there who might want to marry you :-)

--

To your questions:

1) See below

2) 10 TB for vmware

3) 10 TB each year, impossible to give here clear numbers, since there
is currently no clean way for vmware + ceph. If it would work ( reliable
), numbers would explode for sure.

4) native = perfect, iscsi = OK


-- 
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:i...@ip-interactive.de

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107


Am 05.10.2016 um 20:32 schrieb Patrick McGarry:
> Hey guys,
> 
> Starting to buckle down a bit in looking at how we can better set up
> Ceph for VMWare integration, but I need a little info/help from you
> folks.
> 
> If you currently are using Ceph+VMWare, or are exploring the option,
> I'd like some simple info from you:
> 
> 1) Company
> 2) Current deployment size
> 3) Expected deployment growth
> 4) Integration method (or desired method) ex: iscsi, native, etc
> 
> Just casting the net so we know who is interested and might want to
> help us shape and/or test things in the future if we can make it
> better. Thanks.
> 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Ceph + VMWare

2016-10-05 Thread Patrick McGarry

Hey guys,

Starting to buckle down a bit in looking at how we can better set up
Ceph for VMWare integration, but I need a little info/help from you
folks.

If you currently are using Ceph+VMWare, or are exploring the option,
I'd like some simple info from you:

1) Company
2) Current deployment size
3) Expected deployment growth
4) Integration method (or desired method) ex: iscsi, native, etc

Just casting the net so we know who is interested and might want to
help us shape and/or test things in the future if we can make it
better. Thanks.


-- 

Best Regards,

Patrick McGarry
Director Ceph Community || Red Hat
http://ceph.com  ||  http://community.redhat.com
@scuttlemonkey || @ceph
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-09-11 Thread Nick Fisk

> -Original Message-
> From: Alex Gorbachev [mailto:a...@iss-integration.com]
> Sent: 11 September 2016 03:17
> To: Nick Fisk 
> Cc: Wilhelm Redbrake ; Horace Ng ; 
> ceph-users 
> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> 
> Confirming again much better performance with ESXi and NFS on RBD using the 
> XFS hint Nick uses, below.

Cool, I never experimented with different extent sizes, so I don't know if 
there is any performance/fragmentation benefit with larger/smaller values. I 
think storage vmotions might benefit from using striped RBD's with rbd-nbd, as 
this might get round the PG contention issues with 32 concurrent writes to the 
same PG. I want to test this out at some point.

> 
> I saw high load averages on the NFS server nodes, corresponding to iowait, 
> does not seem to cause too much trouble so far.

Yeah I get this as well, but I think this is just a side effect of having a 
storage backend that can support a high queue depth. Every IO in flight will 
increase the load by 1. However, despite what it looks like in top, it doesn't 
actually consume any CPU, so it shouldn't cause any problems.

> 
> Here are HDtune Pro testing results from some recent runs.  The puzzling part 
> is better random IO performance with 16 mb object size
> on both iSCSI and NFS.  I my thinking this should be slower, however, this 
> has been confirmed by the timed vmotion tests and more
> random IO tests by my coworker as well:
> 
> Test_type read MB/s write MB/s read iops write iops read multi iops write 
> multi iops NFS 1mb 460 103 8753 66 47466 1616 NFS 4mb 441
> 147 8863 82 47556 764 iSCSI 1mb 117 76 326 90 672 938 iSCSI 4mb 275 60 205 24 
> 2015 1212 NFS 16mb 455 177 7761 119 36403 3175 iSCSI
> 16mb 300 65 1117 237 12389 1826
> 
> ( prettier view at
> http://storcium.blogspot.com/2016/09/latest-tests-on-nfs-vs.html )

Interesting. Are you pre-conditioning the RBD's before these tests? The only 
logical thing I can think of is that if you are writing to a new area of the 
RBD, it will be having to create the objects as it goes, larger objects would 
therefore need less object creates per MB.

> 
> Alex
> 
> >
> > From: Alex Gorbachev [mailto:a...@iss-integration.com]
> > Sent: 04 September 2016 04:45
> > To: Nick Fisk 
> > Cc: Wilhelm Redbrake ; Horace Ng ;
> > ceph-users 
> > Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> >
> >
> >
> >
> >
> > On Saturday, September 3, 2016, Alex Gorbachev  
> > wrote:
> >
> > HI Nick,
> >
> > On Sun, Aug 21, 2016 at 3:19 PM, Nick Fisk  wrote:
> >
> > From: Alex Gorbachev [mailto:a...@iss-integration.com]
> > Sent: 21 August 2016 15:27
> > To: Wilhelm Redbrake 
> > Cc: n...@fisk.me.uk; Horace Ng ; ceph-users
> > 
> > Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> >
> >
> >
> >
> >
> > On Sunday, August 21, 2016, Wilhelm Redbrake  wrote:
> >
> > Hi Nick,
> > i understand all of your technical improvements.
> > But: why do you Not use a simple for example Areca Raid Controller with 8 
> > gb Cache and Bbu ontop in every ceph node.
> > Configure n Times RAID 0 on the Controller and enable Write back Cache.
> > That must be a latency "Killer" like in all the prop. Storage arrays or Not 
> > ??
> >
> > Best Regards !!
> >
> >
> >
> > What we saw specifically with Areca cards is that performance is excellent 
> > in benchmarking and for bursty loads. However, once we
> started loading with more constant workloads (we replicate databases and 
> files to our Ceph cluster), this looks to have saturated the
> relatively small Areca NVDIMM caches and we went back to pure drive based 
> performance.
> >
> >
> >
> > Yes, I think that is a valid point. Although low latency, you are still 
> > having to write to the disks twice (journal+data), so once the
> cache’s on the cards start filling up, you are going to hit problems.
> >
> >
> >
> >
> >
> > So we built 8 new nodes with no Arecas, M500 SSDs for journals (1 SSD per 3 
> > HDDs) in hopes that it would help reduce the noisy
> neighbor impact. That worked, but now the overall latency is really high at 
> times, not always. Red Hat engineer suggested this is due to
> loading the 7200 rpm NL-SAS drives with too many IOPS, which get their 
> latency sky high. Overall we are functioning fine, but I sure
> would like storage vmotion and other large operations faster.
> >
> >
> >
> >
> >
> > Yeah this is the biggest pain poi

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-09-11 Thread Alex Gorbachev

--
Alex Gorbachev
Storcium

On Sun, Sep 11, 2016 at 12:54 PM, Nick Fisk  wrote:

>
>
>
>
> *From:* Alex Gorbachev [mailto:a...@iss-integration.com]
> *Sent:* 11 September 2016 16:14
>
> *To:* Nick Fisk 
> *Cc:* Wilhelm Redbrake ; Horace Ng ;
> ceph-users 
> *Subject:* Re: [ceph-users] Ceph + VMware + Single Thread Performance
>
>
>
>
>
> On Sun, Sep 4, 2016 at 4:48 PM, Nick Fisk  wrote:
>
>
>
>
>
> *From:* Alex Gorbachev [mailto:a...@iss-integration.com]
> *Sent:* 04 September 2016 04:45
> *To:* Nick Fisk 
> *Cc:* Wilhelm Redbrake ; Horace Ng ;
> ceph-users 
> *Subject:* Re: [ceph-users] Ceph + VMware + Single Thread Performance
>
>
>
>
>
>
> On Saturday, September 3, 2016, Alex Gorbachev 
> wrote:
>
> HI Nick,
>
> On Sun, Aug 21, 2016 at 3:19 PM, Nick Fisk  wrote:
>
> *From:* Alex Gorbachev [mailto:a...@iss-integration.com]
> *Sent:* 21 August 2016 15:27
> *To:* Wilhelm Redbrake 
> *Cc:* n...@fisk.me.uk; Horace Ng ; ceph-users <
> ceph-users@lists.ceph.com>
> *Subject:* Re: [ceph-users] Ceph + VMware + Single Thread Performance
>
>
>
>
>
> On Sunday, August 21, 2016, Wilhelm Redbrake  wrote:
>
> Hi Nick,
> i understand all of your technical improvements.
> But: why do you Not use a simple for example Areca Raid Controller with 8
> gb Cache and Bbu ontop in every ceph node.
> Configure n Times RAID 0 on the Controller and enable Write back Cache.
> That must be a latency "Killer" like in all the prop. Storage arrays or
> Not ??
>
> Best Regards !!
>
>
>
> What we saw specifically with Areca cards is that performance is excellent
> in benchmarking and for bursty loads. However, once we started loading with
> more constant workloads (we replicate databases and files to our Ceph
> cluster), this looks to have saturated the relatively small Areca NVDIMM
> caches and we went back to pure drive based performance.
>
>
>
> Yes, I think that is a valid point. Although low latency, you are still
> having to write to the disks twice (journal+data), so once the cache’s on
> the cards start filling up, you are going to hit problems.
>
>
>
>
>
> So we built 8 new nodes with no Arecas, M500 SSDs for journals (1 SSD per
> 3 HDDs) in hopes that it would help reduce the noisy neighbor impact. That
> worked, but now the overall latency is really high at times, not always.
> Red Hat engineer suggested this is due to loading the 7200 rpm NL-SAS
> drives with too many IOPS, which get their latency sky high. Overall we are
> functioning fine, but I sure would like storage vmotion and other large
> operations faster.
>
>
>
>
>
> Yeah this is the biggest pain point I think. Normal VM ops are fine, but
> if you ever have to move a multi-TB VM, it’s just too slow.
>
>
>
> If you use iscsi with vaai and are migrating a thick provisioned vmdk,
> then performance is actually quite good, as the block sizes used for the
> copy are a lot bigger.
>
>
>
> However, my use case required thin provisioned VM’s + snapshots and I
> found that using iscsi you have no control over the fragmentation of the
> vmdk’s and so the read performance is then what suffers (certainly with
> 7.2k disks)
>
>
>
> Also with thin provisioned vmdk’s I think I was seeing PG contention with
> the updating of the VMFS metadata, although I can’t be sure.
>
>
>
>
>
> I am thinking I will test a few different schedulers and readahead
> settings to see if we can improve this by parallelizing reads. Also will
> test NFS, but need to determine whether to do krbd/knfsd or something more
> interesting like CephFS/Ganesha.
>
>
>
> As you know I’m on NFS now. I’ve found it a lot easier to get going and a
> lot less sensitive to making config adjustments without suddenly everything
> dropping offline. The fact that you can specify the extent size on XFS
> helps massively with using thin vmdks/snapshots to avoid fragmentation.
> Storage v-motions are a bit faster than iscsi, but I think I am hitting PG
> contention when esxi tries to write 32 copy threads to the same object.
> There is probably some tuning that could be done here (RBD striping???) but
> this is the best it’s been for a long time and I’m reluctant to fiddle any
> further.
>
>
>
> We have moved ahead and added NFS support to Storcium, and now able ti run
> NFS servers with Pacemaker in HA mode (all agents are public at
> https://github.com/akurz/resource-agents/tree/master/heartbeat
> <http://xo4t.mj.am/lnk/AEEAFVynFzgAAFhNkjYAADNJBWwAAACRXwBX1Yw-HyPgnby2QY24q0KYBbMaNgAAlBI/1/jhUfi_RqmIdhFLdYFMjkzg/aHR0cDovL3hvNHQubWouYW0vbG5rL0FFTUFGT1RpTVA0QUFBQUFBQUFBQUZoTmtqWU

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-09-11 Thread Nick Fisk

From: Alex Gorbachev [mailto:a...@iss-integration.com] 
Sent: 11 September 2016 16:14
To: Nick Fisk 
Cc: Wilhelm Redbrake ; Horace Ng ; ceph-users 

Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance

On Sun, Sep 4, 2016 at 4:48 PM, Nick Fisk mailto:n...@fisk.me.uk> > wrote:

From: Alex Gorbachev [mailto:a...@iss-integration.com 
<mailto:a...@iss-integration.com> ] 
Sent: 04 September 2016 04:45
To: Nick Fisk mailto:n...@fisk.me.uk> >
Cc: Wilhelm Redbrake mailto:w...@globe.de> >; Horace Ng 
mailto:hor...@hkisl.net> >; ceph-users 
mailto:ceph-users@lists.ceph.com> >
Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance

On Saturday, September 3, 2016, Alex Gorbachev mailto:a...@iss-integration.com> > wrote:

HI Nick,

On Sun, Aug 21, 2016 at 3:19 PM, Nick Fisk  wrote:

From: Alex Gorbachev [mailto:a...@iss-integration.com] 
Sent: 21 August 2016 15:27
To: Wilhelm Redbrake 
Cc: n...@fisk.me.uk; Horace Ng ; ceph-users 

Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance

On Sunday, August 21, 2016, Wilhelm Redbrake  wrote:

Hi Nick,
i understand all of your technical improvements.
But: why do you Not use a simple for example Areca Raid Controller with 8 gb 
Cache and Bbu ontop in every ceph node.
Configure n Times RAID 0 on the Controller and enable Write back Cache.
That must be a latency "Killer" like in all the prop. Storage arrays or Not ??

Best Regards !!

What we saw specifically with Areca cards is that performance is excellent in 
benchmarking and for bursty loads. However, once we started loading with more 
constant workloads (we replicate databases and files to our Ceph cluster), this 
looks to have saturated the relatively small Areca NVDIMM caches and we went 
back to pure drive based performance. 

Yes, I think that is a valid point. Although low latency, you are still having 
to write to the disks twice (journal+data), so once the cache’s on the cards 
start filling up, you are going to hit problems.

So we built 8 new nodes with no Arecas, M500 SSDs for journals (1 SSD per 3 
HDDs) in hopes that it would help reduce the noisy neighbor impact. That 
worked, but now the overall latency is really high at times, not always. Red 
Hat engineer suggested this is due to loading the 7200 rpm NL-SAS drives with 
too many IOPS, which get their latency sky high. Overall we are functioning 
fine, but I sure would like storage vmotion and other large operations faster. 

Yeah this is the biggest pain point I think. Normal VM ops are fine, but if you 
ever have to move a multi-TB VM, it’s just too slow. 

If you use iscsi with vaai and are migrating a thick provisioned vmdk, then 
performance is actually quite good, as the block sizes used for the copy are a 
lot bigger. 

However, my use case required thin provisioned VM’s + snapshots and I found 
that using iscsi you have no control over the fragmentation of the vmdk’s and 
so the read performance is then what suffers (certainly with 7.2k disks)

Also with thin provisioned vmdk’s I think I was seeing PG contention with the 
updating of the VMFS metadata, although I can’t be sure.

I am thinking I will test a few different schedulers and readahead settings to 
see if we can improve this by parallelizing reads. Also will test NFS, but need 
to determine whether to do krbd/knfsd or something more interesting like 
CephFS/Ganesha. 

As you know I’m on NFS now. I’ve found it a lot easier to get going and a lot 
less sensitive to making config adjustments without suddenly everything 
dropping offline. The fact that you can specify the extent size on XFS helps 
massively with using thin vmdks/snapshots to avoid fragmentation. Storage 
v-motions are a bit faster than iscsi, but I think I am hitting PG contention 
when esxi tries to write 32 copy threads to the same object. There is probably 
some tuning that could be done here (RBD striping???) but this is the best it’s 
been for a long time and I’m reluctant to fiddle any further.

We have moved ahead and added NFS support to Storcium, and now able ti run NFS 
servers with Pacemaker in HA mode (all agents are public at 
https://github.com/akurz/resource-agents/tree/master/heartbeat 
<http://xo4t.mj.am/lnk/AEMAFOTiMP4AAFhNkjYAADNJBWwAAACRXwBXzIiFBSEAPLcmRUCEpgI8l005EAAAlBI/1/SaDNCfweUSbAAalNO6TCqg/aHR0cHM6Ly9naXRodWIuY29tL2FrdXJ6L3Jlc291cmNlLWFnZW50cy90cmVlL21hc3Rlci9oZWFydGJlYXQ>
 ).  I can confirm that VM performance is definitely better and benchmarks are 
more smooth (in Windows we can see a lot of choppiness with iSCSI, NFS is 
choppy on writes, but smooth on reads, likely due to the bursty nature of OSD 
filesystems when dealing with that small IO size).

Were you using extsz=16384 at creation time for the filesystem?  I saw kernel 
memory deadlock messages during vmotion, such as:

 XFS: nfsd(102545) possible

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-09-11 Thread Alex Gorbachev

On Sun, Sep 4, 2016 at 4:48 PM, Nick Fisk  wrote:

>
>
>
>
> *From:* Alex Gorbachev [mailto:a...@iss-integration.com]
> *Sent:* 04 September 2016 04:45
> *To:* Nick Fisk 
> *Cc:* Wilhelm Redbrake ; Horace Ng ;
> ceph-users 
> *Subject:* Re: [ceph-users] Ceph + VMware + Single Thread Performance
>
>
>
>
>
> On Saturday, September 3, 2016, Alex Gorbachev 
> wrote:
>
> HI Nick,
>
> On Sun, Aug 21, 2016 at 3:19 PM, Nick Fisk  wrote:
>
> *From:* Alex Gorbachev [mailto:a...@iss-integration.com]
> *Sent:* 21 August 2016 15:27
> *To:* Wilhelm Redbrake 
> *Cc:* n...@fisk.me.uk; Horace Ng ; ceph-users <
> ceph-users@lists.ceph.com>
> *Subject:* Re: [ceph-users] Ceph + VMware + Single Thread Performance
>
>
>
>
>
> On Sunday, August 21, 2016, Wilhelm Redbrake  wrote:
>
> Hi Nick,
> i understand all of your technical improvements.
> But: why do you Not use a simple for example Areca Raid Controller with 8
> gb Cache and Bbu ontop in every ceph node.
> Configure n Times RAID 0 on the Controller and enable Write back Cache.
> That must be a latency "Killer" like in all the prop. Storage arrays or
> Not ??
>
> Best Regards !!
>
>
>
> What we saw specifically with Areca cards is that performance is excellent
> in benchmarking and for bursty loads. However, once we started loading with
> more constant workloads (we replicate databases and files to our Ceph
> cluster), this looks to have saturated the relatively small Areca NVDIMM
> caches and we went back to pure drive based performance.
>
>
>
> Yes, I think that is a valid point. Although low latency, you are still
> having to write to the disks twice (journal+data), so once the cache’s on
> the cards start filling up, you are going to hit problems.
>
>
>
>
>
> So we built 8 new nodes with no Arecas, M500 SSDs for journals (1 SSD per
> 3 HDDs) in hopes that it would help reduce the noisy neighbor impact. That
> worked, but now the overall latency is really high at times, not always.
> Red Hat engineer suggested this is due to loading the 7200 rpm NL-SAS
> drives with too many IOPS, which get their latency sky high. Overall we are
> functioning fine, but I sure would like storage vmotion and other large
> operations faster.
>
>
>
>
>
> Yeah this is the biggest pain point I think. Normal VM ops are fine, but
> if you ever have to move a multi-TB VM, it’s just too slow.
>
>
>
> If you use iscsi with vaai and are migrating a thick provisioned vmdk,
> then performance is actually quite good, as the block sizes used for the
> copy are a lot bigger.
>
>
>
> However, my use case required thin provisioned VM’s + snapshots and I
> found that using iscsi you have no control over the fragmentation of the
> vmdk’s and so the read performance is then what suffers (certainly with
> 7.2k disks)
>
>
>
> Also with thin provisioned vmdk’s I think I was seeing PG contention with
> the updating of the VMFS metadata, although I can’t be sure.
>
>
>
>
>
> I am thinking I will test a few different schedulers and readahead
> settings to see if we can improve this by parallelizing reads. Also will
> test NFS, but need to determine whether to do krbd/knfsd or something more
> interesting like CephFS/Ganesha.
>
>
>
> As you know I’m on NFS now. I’ve found it a lot easier to get going and a
> lot less sensitive to making config adjustments without suddenly everything
> dropping offline. The fact that you can specify the extent size on XFS
> helps massively with using thin vmdks/snapshots to avoid fragmentation.
> Storage v-motions are a bit faster than iscsi, but I think I am hitting PG
> contention when esxi tries to write 32 copy threads to the same object.
> There is probably some tuning that could be done here (RBD striping???) but
> this is the best it’s been for a long time and I’m reluctant to fiddle any
> further.
>
>
>
> We have moved ahead and added NFS support to Storcium, and now able ti run
> NFS servers with Pacemaker in HA mode (all agents are public at
> https://github.com/akurz/resource-agents/tree/master/heartbeat
> <http://xo4t.mj.am/lnk/AEMAFOTiMP4AAFhNkjYAADNJBWwAAACRXwBXzIiFBSEAPLcmRUCEpgI8l005EAAAlBI/1/SaDNCfweUSbAAalNO6TCqg/aHR0cHM6Ly9naXRodWIuY29tL2FrdXJ6L3Jlc291cmNlLWFnZW50cy90cmVlL21hc3Rlci9oZWFydGJlYXQ>).
> I can confirm that VM performance is definitely better and benchmarks are
> more smooth (in Windows we can see a lot of choppiness with iSCSI, NFS is
> choppy on writes, but smooth on reads, likely due to the bursty nature of
> OSD filesystems when dealing with that small IO size).
>
>
>
> Were you using extsz=16384 at creation time for the files

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-09-10 Thread Alex Gorbachev

Confirming again much better performance with ESXi and NFS on RBD
using the XFS hint Nick uses, below.

I saw high load averages on the NFS server nodes, corresponding to
iowait, does not seem to cause too much trouble so far.

Here are HDtune Pro testing results from some recent runs.  The
puzzling part is better random IO performance with 16 mb object size
on both iSCSI and NFS.  I my thinking this should be slower, however,
this has been confirmed by the timed vmotion tests and more random IO
tests by my coworker as well:

Test_type read MB/s write MB/s read iops write iops read multi iops
write multi iops
NFS 1mb 460 103 8753 66 47466 1616
NFS 4mb 441 147 8863 82 47556 764
iSCSI 1mb 117 76 326 90 672 938
iSCSI 4mb 275 60 205 24 2015 1212
NFS 16mb 455 177 7761 119 36403 3175
iSCSI 16mb 300 65 1117 237 12389 1826

( prettier view at
http://storcium.blogspot.com/2016/09/latest-tests-on-nfs-vs.html )

Alex

>
> From: Alex Gorbachev [mailto:a...@iss-integration.com]
> Sent: 04 September 2016 04:45
> To: Nick Fisk 
> Cc: Wilhelm Redbrake ; Horace Ng ; 
> ceph-users 
> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
>
>
>
>
>
> On Saturday, September 3, 2016, Alex Gorbachev  
> wrote:
>
> HI Nick,
>
> On Sun, Aug 21, 2016 at 3:19 PM, Nick Fisk  wrote:
>
> From: Alex Gorbachev [mailto:a...@iss-integration.com]
> Sent: 21 August 2016 15:27
> To: Wilhelm Redbrake 
> Cc: n...@fisk.me.uk; Horace Ng ; ceph-users 
> 
> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
>
>
>
>
>
> On Sunday, August 21, 2016, Wilhelm Redbrake  wrote:
>
> Hi Nick,
> i understand all of your technical improvements.
> But: why do you Not use a simple for example Areca Raid Controller with 8 gb 
> Cache and Bbu ontop in every ceph node.
> Configure n Times RAID 0 on the Controller and enable Write back Cache.
> That must be a latency "Killer" like in all the prop. Storage arrays or Not ??
>
> Best Regards !!
>
>
>
> What we saw specifically with Areca cards is that performance is excellent in 
> benchmarking and for bursty loads. However, once we started loading with more 
> constant workloads (we replicate databases and files to our Ceph cluster), 
> this looks to have saturated the relatively small Areca NVDIMM caches and we 
> went back to pure drive based performance.
>
>
>
> Yes, I think that is a valid point. Although low latency, you are still 
> having to write to the disks twice (journal+data), so once the cache’s on the 
> cards start filling up, you are going to hit problems.
>
>
>
>
>
> So we built 8 new nodes with no Arecas, M500 SSDs for journals (1 SSD per 3 
> HDDs) in hopes that it would help reduce the noisy neighbor impact. That 
> worked, but now the overall latency is really high at times, not always. Red 
> Hat engineer suggested this is due to loading the 7200 rpm NL-SAS drives with 
> too many IOPS, which get their latency sky high. Overall we are functioning 
> fine, but I sure would like storage vmotion and other large operations faster.
>
>
>
>
>
> Yeah this is the biggest pain point I think. Normal VM ops are fine, but if 
> you ever have to move a multi-TB VM, it’s just too slow.
>
>
>
> If you use iscsi with vaai and are migrating a thick provisioned vmdk, then 
> performance is actually quite good, as the block sizes used for the copy are 
> a lot bigger.
>
>
>
> However, my use case required thin provisioned VM’s + snapshots and I found 
> that using iscsi you have no control over the fragmentation of the vmdk’s and 
> so the read performance is then what suffers (certainly with 7.2k disks)
>
>
>
> Also with thin provisioned vmdk’s I think I was seeing PG contention with the 
> updating of the VMFS metadata, although I can’t be sure.
>
>
>
>
>
> I am thinking I will test a few different schedulers and readahead settings 
> to see if we can improve this by parallelizing reads. Also will test NFS, but 
> need to determine whether to do krbd/knfsd or something more interesting like 
> CephFS/Ganesha.
>
>
>
> As you know I’m on NFS now. I’ve found it a lot easier to get going and a lot 
> less sensitive to making config adjustments without suddenly everything 
> dropping offline. The fact that you can specify the extent size on XFS helps 
> massively with using thin vmdks/snapshots to avoid fragmentation. Storage 
> v-motions are a bit faster than iscsi, but I think I am hitting PG contention 
> when esxi tries to write 32 copy threads to the same object. There is 
> probably some tuning that could be done here (RBD striping???) but this is 
> the best it’s been for a long time and I’m reluctant to fiddle any further.
>
>
>
> W

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-09-04 Thread Nick Fisk

From: Alex Gorbachev [mailto:a...@iss-integration.com] 
Sent: 04 September 2016 04:45
To: Nick Fisk 
Cc: Wilhelm Redbrake ; Horace Ng ; ceph-users 

Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance

On Saturday, September 3, 2016, Alex Gorbachev mailto:a...@iss-integration.com> > wrote:

HI Nick,

On Sun, Aug 21, 2016 at 3:19 PM, Nick Fisk  > wrote:

From: Alex Gorbachev [mailto:a...@iss-integration.com 
 ] 
Sent: 21 August 2016 15:27
To: Wilhelm Redbrake  >
Cc: n...@fisk.me.uk  ; Horace 
Ng  >; 
ceph-users  >
Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance

On Sunday, August 21, 2016, Wilhelm Redbrake  > wrote:

Hi Nick,
i understand all of your technical improvements.
But: why do you Not use a simple for example Areca Raid Controller with 8 gb 
Cache and Bbu ontop in every ceph node.
Configure n Times RAID 0 on the Controller and enable Write back Cache.
That must be a latency "Killer" like in all the prop. Storage arrays or Not ??

Best Regards !!

What we saw specifically with Areca cards is that performance is excellent in 
benchmarking and for bursty loads. However, once we started loading with more 
constant workloads (we replicate databases and files to our Ceph cluster), this 
looks to have saturated the relatively small Areca NVDIMM caches and we went 
back to pure drive based performance. 

Yes, I think that is a valid point. Although low latency, you are still having 
to write to the disks twice (journal+data), so once the cache’s on the cards 
start filling up, you are going to hit problems.

So we built 8 new nodes with no Arecas, M500 SSDs for journals (1 SSD per 3 
HDDs) in hopes that it would help reduce the noisy neighbor impact. That 
worked, but now the overall latency is really high at times, not always. Red 
Hat engineer suggested this is due to loading the 7200 rpm NL-SAS drives with 
too many IOPS, which get their latency sky high. Overall we are functioning 
fine, but I sure would like storage vmotion and other large operations faster. 

Yeah this is the biggest pain point I think. Normal VM ops are fine, but if you 
ever have to move a multi-TB VM, it’s just too slow. 

If you use iscsi with vaai and are migrating a thick provisioned vmdk, then 
performance is actually quite good, as the block sizes used for the copy are a 
lot bigger. 

However, my use case required thin provisioned VM’s + snapshots and I found 
that using iscsi you have no control over the fragmentation of the vmdk’s and 
so the read performance is then what suffers (certainly with 7.2k disks)

Also with thin provisioned vmdk’s I think I was seeing PG contention with the 
updating of the VMFS metadata, although I can’t be sure.

I am thinking I will test a few different schedulers and readahead settings to 
see if we can improve this by parallelizing reads. Also will test NFS, but need 
to determine whether to do krbd/knfsd or something more interesting like 
CephFS/Ganesha. 

As you know I’m on NFS now. I’ve found it a lot easier to get going and a lot 
less sensitive to making config adjustments without suddenly everything 
dropping offline. The fact that you can specify the extent size on XFS helps 
massively with using thin vmdks/snapshots to avoid fragmentation. Storage 
v-motions are a bit faster than iscsi, but I think I am hitting PG contention 
when esxi tries to write 32 copy threads to the same object. There is probably 
some tuning that could be done here (RBD striping???) but this is the best it’s 
been for a long time and I’m reluctant to fiddle any further.

We have moved ahead and added NFS support to Storcium, and now able ti run NFS 
servers with Pacemaker in HA mode (all agents are public at 
https://github.com/akurz/resource-agents/tree/master/heartbeat).  I can confirm 
that VM performance is definitely better and benchmarks are more smooth (in 
Windows we can see a lot of choppiness with iSCSI, NFS is choppy on writes, but 
smooth on reads, likely due to the bursty nature of OSD filesystems when 
dealing with that small IO size).

Were you using extsz=16384 at creation time for the filesystem?  I saw kernel 
memory deadlock messages during vmotion, such as:

 XFS: nfsd(102545) possible memory allocation deadlock size 40320 in kmem_alloc 
(mode:0x2400240)

And analyzing fragmentation:

root@roc-5r-scd218:~# xfs_db -r /dev/rbd21

xfs_db> frag -d

actual 0, ideal 0, fragmentation factor 0.00%

xfs_db> frag -f

actual 1863960, ideal 74, fragmentation factor 100.00%

Just from two vmotions.

Are you seeing anything similar?

Found your post on setting XFS extent size hint for sparse files:

xfs_io -c extsize 16M /mountpoint

Will test - fragmentation definitely present without this.  

Yeah I got bit by that when I 1st set it up, I then created another datastore 
with that ext hint and moved everything across. Haven’t seen any kmem al

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-09-03 Thread Alex Gorbachev

On Saturday, September 3, 2016, Alex Gorbachev 
wrote:

> HI Nick,
>
> On Sun, Aug 21, 2016 at 3:19 PM, Nick Fisk  > wrote:
>
>> *From:* Alex Gorbachev [mailto:a...@iss-integration.com
>> ]
>> *Sent:* 21 August 2016 15:27
>> *To:* Wilhelm Redbrake > >
>> *Cc:* n...@fisk.me.uk ;
>> Horace Ng > >; ceph-users <
>> ceph-users@lists.ceph.com
>> >
>> *Subject:* Re: [ceph-users] Ceph + VMware + Single Thread Performance
>>
>>
>>
>>
>>
>> On Sunday, August 21, 2016, Wilhelm Redbrake > > wrote:
>>
>> Hi Nick,
>> i understand all of your technical improvements.
>> But: why do you Not use a simple for example Areca Raid Controller with 8
>> gb Cache and Bbu ontop in every ceph node.
>> Configure n Times RAID 0 on the Controller and enable Write back Cache.
>> That must be a latency "Killer" like in all the prop. Storage arrays or
>> Not ??
>>
>> Best Regards !!
>>
>>
>>
>> What we saw specifically with Areca cards is that performance is
>> excellent in benchmarking and for bursty loads. However, once we started
>> loading with more constant workloads (we replicate databases and files to
>> our Ceph cluster), this looks to have saturated the relatively small Areca
>> NVDIMM caches and we went back to pure drive based performance.
>>
>>
>>
>> Yes, I think that is a valid point. Although low latency, you are still
>> having to write to the disks twice (journal+data), so once the cache’s on
>> the cards start filling up, you are going to hit problems.
>>
>>
>>
>>
>>
>> So we built 8 new nodes with no Arecas, M500 SSDs for journals (1 SSD per
>> 3 HDDs) in hopes that it would help reduce the noisy neighbor impact. That
>> worked, but now the overall latency is really high at times, not always.
>> Red Hat engineer suggested this is due to loading the 7200 rpm NL-SAS
>> drives with too many IOPS, which get their latency sky high. Overall we are
>> functioning fine, but I sure would like storage vmotion and other large
>> operations faster.
>>
>>
>>
>>
>>
>> Yeah this is the biggest pain point I think. Normal VM ops are fine, but
>> if you ever have to move a multi-TB VM, it’s just too slow.
>>
>>
>>
>> If you use iscsi with vaai and are migrating a thick provisioned vmdk,
>> then performance is actually quite good, as the block sizes used for the
>> copy are a lot bigger.
>>
>>
>>
>> However, my use case required thin provisioned VM’s + snapshots and I
>> found that using iscsi you have no control over the fragmentation of the
>> vmdk’s and so the read performance is then what suffers (certainly with
>> 7.2k disks)
>>
>>
>>
>> Also with thin provisioned vmdk’s I think I was seeing PG contention with
>> the updating of the VMFS metadata, although I can’t be sure.
>>
>>
>>
>>
>>
>> I am thinking I will test a few different schedulers and readahead
>> settings to see if we can improve this by parallelizing reads. Also will
>> test NFS, but need to determine whether to do krbd/knfsd or something more
>> interesting like CephFS/Ganesha.
>>
>>
>>
>> As you know I’m on NFS now. I’ve found it a lot easier to get going and a
>> lot less sensitive to making config adjustments without suddenly everything
>> dropping offline. The fact that you can specify the extent size on XFS
>> helps massively with using thin vmdks/snapshots to avoid fragmentation.
>> Storage v-motions are a bit faster than iscsi, but I think I am hitting PG
>> contention when esxi tries to write 32 copy threads to the same object.
>> There is probably some tuning that could be done here (RBD striping???) but
>> this is the best it’s been for a long time and I’m reluctant to fiddle any
>> further.
>>
>
> We have moved ahead and added NFS support to Storcium, and now able ti run
> NFS servers with Pacemaker in HA mode (all agents are public at
> https://github.com/akurz/resource-agents/tree/master/heartbeat).  I can
> confirm that VM performance is definitely better and benchmarks are more
> smooth (in Windows we can see a lot of choppiness with iSCSI, NFS is choppy
> on writes, but smooth on reads, likely due to the bursty nature of OSD
> filesystems when dealing with that small IO size).
>
> Were you using extsz=16384 at creation time for the filesystem?  I saw
> kernel memory deadlock messages during vmotion, such as:
>
>  XFS: nfsd(102545) possible memory allocation deadlock size 40320 in
>

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-09-03 Thread Alex Gorbachev

HI Nick,

On Sun, Aug 21, 2016 at 3:19 PM, Nick Fisk  wrote:

> *From:* Alex Gorbachev [mailto:a...@iss-integration.com]
> *Sent:* 21 August 2016 15:27
> *To:* Wilhelm Redbrake 
> *Cc:* n...@fisk.me.uk; Horace Ng ; ceph-users <
> ceph-users@lists.ceph.com>
> *Subject:* Re: [ceph-users] Ceph + VMware + Single Thread Performance
>
>
>
>
>
> On Sunday, August 21, 2016, Wilhelm Redbrake  wrote:
>
> Hi Nick,
> i understand all of your technical improvements.
> But: why do you Not use a simple for example Areca Raid Controller with 8
> gb Cache and Bbu ontop in every ceph node.
> Configure n Times RAID 0 on the Controller and enable Write back Cache.
> That must be a latency "Killer" like in all the prop. Storage arrays or
> Not ??
>
> Best Regards !!
>
>
>
> What we saw specifically with Areca cards is that performance is excellent
> in benchmarking and for bursty loads. However, once we started loading with
> more constant workloads (we replicate databases and files to our Ceph
> cluster), this looks to have saturated the relatively small Areca NVDIMM
> caches and we went back to pure drive based performance.
>
>
>
> Yes, I think that is a valid point. Although low latency, you are still
> having to write to the disks twice (journal+data), so once the cache’s on
> the cards start filling up, you are going to hit problems.
>
>
>
>
>
> So we built 8 new nodes with no Arecas, M500 SSDs for journals (1 SSD per
> 3 HDDs) in hopes that it would help reduce the noisy neighbor impact. That
> worked, but now the overall latency is really high at times, not always.
> Red Hat engineer suggested this is due to loading the 7200 rpm NL-SAS
> drives with too many IOPS, which get their latency sky high. Overall we are
> functioning fine, but I sure would like storage vmotion and other large
> operations faster.
>
>
>
>
>
> Yeah this is the biggest pain point I think. Normal VM ops are fine, but
> if you ever have to move a multi-TB VM, it’s just too slow.
>
>
>
> If you use iscsi with vaai and are migrating a thick provisioned vmdk,
> then performance is actually quite good, as the block sizes used for the
> copy are a lot bigger.
>
>
>
> However, my use case required thin provisioned VM’s + snapshots and I
> found that using iscsi you have no control over the fragmentation of the
> vmdk’s and so the read performance is then what suffers (certainly with
> 7.2k disks)
>
>
>
> Also with thin provisioned vmdk’s I think I was seeing PG contention with
> the updating of the VMFS metadata, although I can’t be sure.
>
>
>
>
>
> I am thinking I will test a few different schedulers and readahead
> settings to see if we can improve this by parallelizing reads. Also will
> test NFS, but need to determine whether to do krbd/knfsd or something more
> interesting like CephFS/Ganesha.
>
>
>
> As you know I’m on NFS now. I’ve found it a lot easier to get going and a
> lot less sensitive to making config adjustments without suddenly everything
> dropping offline. The fact that you can specify the extent size on XFS
> helps massively with using thin vmdks/snapshots to avoid fragmentation.
> Storage v-motions are a bit faster than iscsi, but I think I am hitting PG
> contention when esxi tries to write 32 copy threads to the same object.
> There is probably some tuning that could be done here (RBD striping???) but
> this is the best it’s been for a long time and I’m reluctant to fiddle any
> further.
>

We have moved ahead and added NFS support to Storcium, and now able ti run
NFS servers with Pacemaker in HA mode (all agents are public at
https://github.com/akurz/resource-agents/tree/master/heartbeat).  I can
confirm that VM performance is definitely better and benchmarks are more
smooth (in Windows we can see a lot of choppiness with iSCSI, NFS is choppy
on writes, but smooth on reads, likely due to the bursty nature of OSD
filesystems when dealing with that small IO size).

Were you using extsz=16384 at creation time for the filesystem?  I saw
kernel memory deadlock messages during vmotion, such as:

 XFS: nfsd(102545) possible memory allocation deadlock size 40320 in
kmem_alloc (mode:0x2400240)

And analyzing fragmentation:

root@roc-5r-scd218:~# xfs_db -r /dev/rbd21
xfs_db> frag -d
actual 0, ideal 0, fragmentation factor 0.00%
xfs_db> frag -f
actual 1863960, ideal 74, fragmentation factor 100.00%

Just from two vmotions.

Are you seeing anything similar?

Thank you,
Alex


>
>
> But as mentioned above, thick vmdk’s with vaai might be a really good fit.
>
>
>
> Thanks for your very valuable info on analysis and hw build.
>
>
>
> Alex
>
>
>
>
>
>
> Am 21.08.2016 um 09:31 schrieb Nick

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-08-31 Thread Nick Fisk

From: w...@globe.de [mailto:w...@globe.de] 
Sent: 31 August 2016 08:56
To: n...@fisk.me.uk; 'Alex Gorbachev' ; 'Horace Ng' 

Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance

Nick,

what do you think about Infiniband?

I have read that with Infiniband the latency is at 1,2us

It’s great, but I don’t believe the Ceph support for RDMA is finished yet, so 
you are stuck using IPoIB, which has similar performance to 10G Ethernet.

For now concentrate on removing latency where you easily can (3.5+ Ghz CPU’s, 
NVME journals) and then when stuff like RDMA comes along, you will be in a 
better place to take advantage of it.

Kind Regards!

Am 31.08.16 um 09:51 schrieb Nick Fisk:

From: w...@globe.de <mailto:w...@globe.de>  [mailto:w...@globe.de] 
Sent: 30 August 2016 18:40
To: n...@fisk.me.uk <mailto:n...@fisk.me.uk> ; 'Alex Gorbachev'  
<mailto:a...@iss-integration.com> 
Cc: 'Horace Ng'  <mailto:hor...@hkisl.net> 
Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance

Hi Nick,

here are my answers and questions...

Am 30.08.16 um 19:05 schrieb Nick Fisk:

From: w...@globe.de <mailto:w...@globe.de>  [mailto:w...@globe.de] 
Sent: 30 August 2016 08:48
To: n...@fisk.me.uk <mailto:n...@fisk.me.uk> ; 'Alex Gorbachev'  
<mailto:a...@iss-integration.com> 
Cc: 'Horace Ng'  <mailto:hor...@hkisl.net> 
Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance

Hi Nick, Hi Ales,

Nick: i've got my 600GB SAS HP Drives.

Performance is not good soo i don't paste the results here...

Generally another thing: I've build in the Ceph Cluster Samsung SM863 
Enterprise SSD's

If i do a 4k Test on the SSD directly without filesystem i become 

(See Sebastien's Han Tests)

https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/

<http://xo4t.mj.am/lnk/AEEAFKSsSPsAAF3gduwAADNJBWwAAACRXwBXxoyWJxD41h5WTsmv5AyUVi8GUwAAlBI/1/kG_bXVmSVXssUysDBe9M-g/aHR0cDovL3hvNHQubWouYW0vbG5rL0FFVUFGSU9hZmQ4QUFBQUFBQUFBQUYzZ2R1d0FBRE5KQld3QUFBQUFBQUNSWHdCWHhieV9JUzdQRkYzQVNBMkpBZ0MyTjBobU53QUFsQkkvMS83Z09JODR1dUhUUWhOc3ViSEg2UmxnL2FIUjBjSE02THk5M2QzY3VjMlZpWVhOMGFXVnVMV2hoYmk1bWNpOWliRzluTHpJd01UUXZNVEF2TVRBdlkyVndhQzFvYjNjdGRHOHRkR1Z6ZEMxcFppMTViM1Z5TFhOelpDMXBjeTF6ZFdsMFlXSnNaUzFoY3kxaExXcHZkWEp1WVd3dFpHVjJhV05sTHc>

dd if=/dev/zero of=/dev/sdd bs=4k count=100 oflag=direct,dsync
100+0 Datensätze ein
100+0 Datensätze aus
409600 bytes (4,1 GB, 3,8 GiB) copied, 52,7139 s, 77,7 MB/s

77000/4 = ~2 IOP’s

If i format the device with xfs i become:

mkfs.xfs -f /dev/sdd

mount /dev/sdd /mnt

cd /mnt

dd if=/dev/zero of=/mnt/test.txt bs=4k count=10 oflag=direct,dsync
10+0 Datensätze ein
10+0 Datensätze aus
40960 bytes (410 MB, 391 MiB) copied, 21,1856 s, 19,3 MB/s

19300/4 = ~5000 IOPs
I know once you have a FS on the device it will slow down due to the extra 
journal writes, maybe this is a little more than expected here…but still 
reasonably fast. Can you see in iostat how many IO’s the device is doing during 
this test?

watch iostat -dmx -t -y 1 1 /dev/sde

Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sde   0,00 0,000,00 9625,00 0,0025,85 5,50 
0,600,060,000,06   0,06  59,60

So there seems to be an extra delay somewhere when writing via a FS instead of 
raw device. You are still getting around 10,000 iops though, so not too bad.

If i use the ssd in the ceph cluster and i do the test again with rados bench 
bs=4K and -t = 1 (one thread) i become only 2-3 MByte/s

2500/4 = ~600IOP’s

My question is: How can it be that the pure device performance is so high 
against the xfs and the ceph rbd performance?

Ceph will be a lot slower as you are replacing a 30cm SAS/SATA cable with 
networking, software and also doing replication. You have at least 2 network 
hops with Ceph. For a slightly fairer test set replication to 1x.

Replication 3x:
rados bench -p rbd 60 write -b 4k -t 1
Invalid value for block-size: The option value '4k' seems to be invalid
root@ceph-mon-1:~# rados bench -p rbd 60 write -b 4K -t 1
Maintaining 1 concurrent writes of 4096 bytes to objects of size 4096 for up to 
60 seconds or 0 objects
Object prefix: benchmark_data_ceph-mon-1_30407
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
0   0 0 0 0 0   -   0
1   1   402   4011.5661   1.56641  0.00226091  0.00248929
2   1   775   774   1.51142   1.45703   0.0021945  0.00258187
3   1  1110  1109   1.44374   1.30859  0.00278291  0.00270182
4   1  1421  1420   1.38647   1.21484  0.001

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-08-31 Thread Nick Fisk

 

 

From: w...@globe.de [mailto:w...@globe.de] 
Sent: 30 August 2016 18:40
To: n...@fisk.me.uk; 'Alex Gorbachev' 
Cc: 'Horace Ng' 
Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance

 

Hi Nick,

here are my answers and questions...

 

Am 30.08.16 um 19:05 schrieb Nick Fisk:

 

 

From: w...@globe.de <mailto:w...@globe.de>  [mailto:w...@globe.de] 
Sent: 30 August 2016 08:48
To: n...@fisk.me.uk <mailto:n...@fisk.me.uk> ; 'Alex Gorbachev'  
<mailto:a...@iss-integration.com> 
Cc: 'Horace Ng'  <mailto:hor...@hkisl.net> 
Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance

 

Hi Nick, Hi Ales,

Nick: i've got my 600GB SAS HP Drives.

Performance is not good soo i don't paste the results here...

 

Generally another thing: I've build in the Ceph Cluster Samsung SM863 
Enterprise SSD's

If i do a 4k Test on the SSD directly without filesystem i become 

(See Sebastien's Han Tests)

https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
 
<http://xo4t.mj.am/lnk/AEUAFIOafd8AAF3gduwAADNJBWwAAACRXwBXxby_IS7PFF3ASA2JAgC2N0hmNwAAlBI/1/7gOI84uuHTQhNsubHH6Rlg/aHR0cHM6Ly93d3cuc2ViYXN0aWVuLWhhbi5mci9ibG9nLzIwMTQvMTAvMTAvY2VwaC1ob3ctdG8tdGVzdC1pZi15b3VyLXNzZC1pcy1zdWl0YWJsZS1hcy1hLWpvdXJuYWwtZGV2aWNlLw>
 

 

dd if=/dev/zero of=/dev/sdd bs=4k count=100 oflag=direct,dsync
100+0 Datensätze ein
100+0 Datensätze aus
409600 bytes (4,1 GB, 3,8 GiB) copied, 52,7139 s, 77,7 MB/s

77000/4 = ~2 IOP’s

 

If i format the device with xfs i become:

mkfs.xfs -f /dev/sdd

mount /dev/sdd /mnt

cd /mnt

dd if=/dev/zero of=/mnt/test.txt bs=4k count=10 oflag=direct,dsync
10+0 Datensätze ein
10+0 Datensätze aus
40960 bytes (410 MB, 391 MiB) copied, 21,1856 s, 19,3 MB/s

19300/4 = ~5000 IOPs
I know once you have a FS on the device it will slow down due to the extra 
journal writes, maybe this is a little more than expected here…but still 
reasonably fast. Can you see in iostat how many IO’s the device is doing during 
this test?



watch iostat -dmx -t -y 1 1 /dev/sde

Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sde   0,00 0,000,00 9625,00 0,0025,85 5,50 
0,600,060,000,06   0,06  59,60



So there seems to be an extra delay somewhere when writing via a FS instead of 
raw device. You are still getting around 10,000 iops though, so not too bad.






If i use the ssd in the ceph cluster and i do the test again with rados bench 
bs=4K and -t = 1 (one thread) i become only 2-3 MByte/s

2500/4 = ~600IOP’s

My question is: How can it be that the pure device performance is so high 
against the xfs and the ceph rbd performance?

Ceph will be a lot slower as you are replacing a 30cm SAS/SATA cable with 
networking, software and also doing replication. You have at least 2 network 
hops with Ceph. For a slightly fairer test set replication to 1x.


Replication 3x:
rados bench -p rbd 60 write -b 4k -t 1
Invalid value for block-size: The option value '4k' seems to be invalid
root@ceph-mon-1:~# rados bench -p rbd 60 write -b 4K -t 1
Maintaining 1 concurrent writes of 4096 bytes to objects of size 4096 for up to 
60 seconds or 0 objects
Object prefix: benchmark_data_ceph-mon-1_30407
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
0   0 0 0 0 0   -   0
1   1   402   4011.5661   1.56641  0.00226091  0.00248929
2   1   775   774   1.51142   1.45703   0.0021945  0.00258187
3   1  1110  1109   1.44374   1.30859  0.00278291  0.00270182
4   1  1421  1420   1.38647   1.21484  0.00199578  0.00281537
5   1  1731  1730   1.35132   1.21094  0.00219136  0.00288843
6   1  2044  2043   1.32985   1.22266   0.0023981  0.00293468
7   1  2351  2350   1.31116   1.19922  0.00258856  0.00296963
8   1  2703  2702   1.31911 1.375   0.0224678  0.00295862
9   1  2955  2954   1.28191  0.984375  0.00841621  0.00304526
   10   1  3228  3227   1.26034   1.06641  0.00261023  0.00309665
   11   1  3501  35001.2427   1.06641  0.00659853  0.00313985
   12   1  3791  3790   1.23353   1.13281   0.0027244  0.00316168
   13   1  4150  4149   1.24649   1.40234  0.00262242  0.00313177
   14   1  4460  4459   1.24394   1.21094  0.00262075  0.00313735
   15   1  4721  4720   1.22897   1.01953  0.00239961  0.00317357
   16   1  4983  4982   1.21611   1.02344  0.00290526  0.00321005
   17   1  5279  5278   1.21258   1.15625  0.00252002   0.0032196
   18   1  5605  5604   1.21595   1.27344

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-08-22 Thread Christian Balzer



Hello,

On Mon, 22 Aug 2016 20:34:54 +0100 Nick Fisk wrote:

> > -Original Message-
> > From: Christian Balzer [mailto:ch...@gol.com]
> > Sent: 22 August 2016 03:00
> > To: 'ceph-users' 
> > Cc: Nick Fisk 
> > Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> > 
> > 
> > Hello,
> > 
> > On Sun, 21 Aug 2016 09:57:40 +0100 Nick Fisk wrote:
> > 
> > >
> > >
> > > > -Original Message-
> > > > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
> > > > Behalf Of Christian Balzer
> > > > Sent: 21 August 2016 09:32
> > > > To: ceph-users 
> > > > Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> > > >
> > > >
> > > > Hello,
> > > >
> > > > On Sun, 21 Aug 2016 09:16:33 +0100 Brian :: wrote:
> > > >
> > > > > Hi Nick
> > > > >
> > > > > Interested in this comment - "-Dual sockets are probably bad and
> > > > > will impact performance."
> > > > >
> > > > > Have you got real world experience of this being the case?
> > > > >
> > > > Well, Nick wrote "probably".
> > > >
> > > > Dual sockets and thus NUMA, the need for CPUs to talk to each other
> > > > and share information certainly can impact things that are
> > > very
> > > > time critical.
> > > > How much though is a question of design, both HW and SW.
> > >
> > > There was a guy from Redhat (sorry his name escapes me now) a few
> > > months ago on the performance weekly meeting. He was analysing the CPU
> > > cache miss effects with Ceph and it looked like a NUMA setup was
> > > having quite a severe impact on some things. To be honest a lot of it
> > > went over my head, but I came away from it with a general feeling that
> > > if you can get the required performance from 1 socket, then that is 
> > > probably a better bet. This includes only populating a single
> > socket in a dual socket system. There was also a Ceph tech talk at the 
> > start of the year (High perf databases on Ceph) where the guy
> > presenting was also recommending only populating 1 socket for latency 
> > reasons.
> > >
> > I wonder how complete their testing was and how much manual tuning they 
> > tried.
> > As in:
> > 
> > 1. Was irqbalance running?
> > Because it and the normal kernel strategies clash beautifully.
> > Irqbalance moves stuff around, the kernel tries to move things close to 
> > where the IRQs are, cat and mouse.
> > 
> > 2. Did they try with manual IRQ pinning?
> > I do, not that it's critical with my Ceph nodes, but on other machines it 
> > can make a LOT of difference.
> > Like keeping the cores near (or at least on the same NUMA node) as the 
> > network IRQs reserved for KVM vhost processes.
> > 
> > 3. Did they try pining Ceph OSD processes?
> > While this may certainly help (and make things more predictable when the 
> > load gets high), as I said above the kernel normally does a
> > pretty good job of NOT moving things around and keeping processes close to 
> > the resources they need.
> > 
> 
> From what I remember I think they went to pretty long lengths to tune things. 
> I think one point was that if you have a 40GB nic on socket, a NVME on 
> another, no matter where the process runs, you are going to have a lot of 
> traffic crossing between the sockets.

Traffic yes, complete process migrations hopefully not.
But anyway, yes, that's to be expected.

And also unavoidable if you want/need to utilize the whole capabilities
and PCIe lanes of a dual socket motherboard.
And in some cases (usually not with Ceph/OSDs), the IRQ load really will
benefit from more cores to play with.

> 
> Here is the DB on Ceph one
> 
> http://ceph-users.ceph.narkive.com/1sj4VI4U/ceph-tech-talk-high-performance-production-databases-on-ceph

Thanks!
Yeah, basically confirms what I know/said.

> 
> I don't think the recordings are available for the performance meeting one, 
> but it was something to do with certain C++ string functions causing issue 
> with CPU cache. Honestly can't remember much else.
> 
> > > Both of those, coupled with the fact that Xeon E3's are the cheapest way 
> > > to get high clock speeds, sort of made my decision.
> > >
> > Totally agreed, my current HDD node design is based on the single CPU 
> >

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-08-22 Thread Nick Fisk

From: Alex Gorbachev [mailto:a...@iss-integration.com] 
Sent: 22 August 2016 20:30
To: Nick Fisk 
Cc: Wilhelm Redbrake ; Horace Ng ; ceph-users 

Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance

On Sunday, August 21, 2016, Wilhelm Redbrake mailto:w...@globe.de> > wrote:

Hi Nick,
i understand all of your technical improvements.
But: why do you Not use a simple for example Areca Raid Controller with 8 gb 
Cache and Bbu ontop in every ceph node.
Configure n Times RAID 0 on the Controller and enable Write back Cache.
That must be a latency "Killer" like in all the prop. Storage arrays or Not ??

Best Regards !!

What we saw specifically with Areca cards is that performance is excellent in 
benchmarking and for bursty loads. However, once we started loading with more 
constant workloads (we replicate databases and files to our Ceph cluster), this 
looks to have saturated the relatively small Areca NVDIMM caches and we went 
back to pure drive based performance. 

Yes, I think that is a valid point. Although low latency, you are still having 
to write to the disks twice (journal+data), so once the cache’s on the cards 
start filling up, you are going to hit problems.

So we built 8 new nodes with no Arecas, M500 SSDs for journals (1 SSD per 3 
HDDs) in hopes that it would help reduce the noisy neighbor impact. That 
worked, but now the overall latency is really high at times, not always. Red 
Hat engineer suggested this is due to loading the 7200 rpm NL-SAS drives with 
too many IOPS, which get their latency sky high. Overall we are functioning 
fine, but I sure would like storage vmotion and other large operations faster. 

Yeah this is the biggest pain point I think. Normal VM ops are fine, but if you 
ever have to move a multi-TB VM, it’s just too slow. 

If you use iscsi with vaai and are migrating a thick provisioned vmdk, then 
performance is actually quite good, as the block sizes used for the copy are a 
lot bigger. 

However, my use case required thin provisioned VM’s + snapshots and I found 
that using iscsi you have no control over the fragmentation of the vmdk’s and 
so the read performance is then what suffers (certainly with 7.2k disks)

Also with thin provisioned vmdk’s I think I was seeing PG contention with the 
updating of the VMFS metadata, although I can’t be sure.

I am thinking I will test a few different schedulers and readahead settings to 
see if we can improve this by parallelizing reads. Also will test NFS, but need 
to determine whether to do krbd/knfsd or something more interesting like 
CephFS/Ganesha. 

As you know I’m on NFS now. I’ve found it a lot easier to get going and a lot 
less sensitive to making config adjustments without suddenly everything 
dropping offline. The fact that you can specify the extent size on XFS helps 
massively with using thin vmdks/snapshots to avoid fragmentation. Storage 
v-motions are a bit faster than iscsi, but I think I am hitting PG contention 
when esxi tries to write 32 copy threads to the same object. There is probably 
some tuning that could be done here (RBD striping???) but this is the best it’s 
been for a long time and I’m reluctant to fiddle any further.

But as mentioned above, thick vmdk’s with vaai might be a really good fit.

Any chance thin vs. thick difference could be related to discards?  I saw 
zillions of them in recent testing.

I was using FILEIO and so discard weren’t working for me. I know fragmentation 
was definitely the cause of the small reads. The VMFS metadata I’m less sure 
of, but it seemed the most likely cause as it only effected write performance 
the 1st time round.

Thanks for your very valuable info on analysis and hw build. 

Alex

Am 21.08.2016 um 09:31 schrieb Nick Fisk :

>> -Original Message-
>> From: Alex Gorbachev [mailto:a...@iss-integration.com]
>> Sent: 21 August 2016 04:15
>> To: Nick Fisk 
>> Cc: w...@globe.de; Horace Ng ; ceph-users 
>> 
>> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
>>
>> Hi Nick,
>>
>> On Thu, Jul 21, 2016 at 8:33 AM, Nick Fisk  wrote:
>>>> -Original Message-
>>>> From: w...@globe.de [mailto:w...@globe.de]
>>>> Sent: 21 July 2016 13:23
>>>> To: n...@fisk.me.uk; 'Horace Ng' 
>>>> Cc: ceph-users@lists.ceph.com
>>>> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
>>>>
>>>> Okay and what is your plan now to speed up ?
>>>
>>> Now I have come up with a lower latency hardware design, there is not much 
>>> further improvement until persistent RBD caching is
>> implemented, as you will be moving the SSD/NVME closer to the client. But 
>> I'm happy with what I can achieve at the moment. You
>> coul

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-08-22 Thread Nick Fisk

> -Original Message-
> From: Christian Balzer [mailto:ch...@gol.com]
> Sent: 22 August 2016 03:00
> To: 'ceph-users' 
> Cc: Nick Fisk 
> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> 
> 
> Hello,
> 
> On Sun, 21 Aug 2016 09:57:40 +0100 Nick Fisk wrote:
> 
> >
> >
> > > -Original Message-
> > > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
> > > Behalf Of Christian Balzer
> > > Sent: 21 August 2016 09:32
> > > To: ceph-users 
> > > Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> > >
> > >
> > > Hello,
> > >
> > > On Sun, 21 Aug 2016 09:16:33 +0100 Brian :: wrote:
> > >
> > > > Hi Nick
> > > >
> > > > Interested in this comment - "-Dual sockets are probably bad and
> > > > will impact performance."
> > > >
> > > > Have you got real world experience of this being the case?
> > > >
> > > Well, Nick wrote "probably".
> > >
> > > Dual sockets and thus NUMA, the need for CPUs to talk to each other
> > > and share information certainly can impact things that are
> > very
> > > time critical.
> > > How much though is a question of design, both HW and SW.
> >
> > There was a guy from Redhat (sorry his name escapes me now) a few
> > months ago on the performance weekly meeting. He was analysing the CPU
> > cache miss effects with Ceph and it looked like a NUMA setup was
> > having quite a severe impact on some things. To be honest a lot of it
> > went over my head, but I came away from it with a general feeling that
> > if you can get the required performance from 1 socket, then that is 
> > probably a better bet. This includes only populating a single
> socket in a dual socket system. There was also a Ceph tech talk at the start 
> of the year (High perf databases on Ceph) where the guy
> presenting was also recommending only populating 1 socket for latency reasons.
> >
> I wonder how complete their testing was and how much manual tuning they tried.
> As in:
> 
> 1. Was irqbalance running?
> Because it and the normal kernel strategies clash beautifully.
> Irqbalance moves stuff around, the kernel tries to move things close to where 
> the IRQs are, cat and mouse.
> 
> 2. Did they try with manual IRQ pinning?
> I do, not that it's critical with my Ceph nodes, but on other machines it can 
> make a LOT of difference.
> Like keeping the cores near (or at least on the same NUMA node) as the 
> network IRQs reserved for KVM vhost processes.
> 
> 3. Did they try pining Ceph OSD processes?
> While this may certainly help (and make things more predictable when the load 
> gets high), as I said above the kernel normally does a
> pretty good job of NOT moving things around and keeping processes close to 
> the resources they need.
> 

>From what I remember I think they went to pretty long lengths to tune things. 
>I think one point was that if you have a 40GB nic on socket, a NVME on 
>another, no matter where the process runs, you are going to have a lot of 
>traffic crossing between the sockets.

Here is the DB on Ceph one

http://ceph-users.ceph.narkive.com/1sj4VI4U/ceph-tech-talk-high-performance-production-databases-on-ceph

I don't think the recordings are available for the performance meeting one, but 
it was something to do with certain C++ string functions causing issue with CPU 
cache. Honestly can't remember much else.

> > Both of those, coupled with the fact that Xeon E3's are the cheapest way to 
> > get high clock speeds, sort of made my decision.
> >
> Totally agreed, my current HDD node design is based on the single CPU 
> Supermicro 5028R-E1CR12L barebone, with an E5-1650 v3
> (3.50GHz) CPU.

Nice. Any ideas how they compare to the E3's?

> 
> > >
> > > We're looking here at a case where he's trying to reduce latency by
> > > all means and where the actual CPU needs for the HDDs are negligible.
> > > The idea being that a "Ceph IOPS" stays on one core which is hopefully 
> > > also not being shared at that time.
> > >
> > > If you're looking at full SSD nodes OTOH a singe CPU may very well
> > > not be able to saturate a sensible amount of SSDs per node, so
> > a
> > > slight penalty but better utilization and overall IOPS with 2 CPUs may be 
> > > the forward.
> >
> > Definitely, as always work out what your requirements are and design around 
> > them.
> >
> On my cache ti

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-08-22 Thread Alex Gorbachev

>
>
> On Sunday, August 21, 2016, Wilhelm Redbrake  wrote:
>
> Hi Nick,
> i understand all of your technical improvements.
> But: why do you Not use a simple for example Areca Raid Controller with 8
> gb Cache and Bbu ontop in every ceph node.
> Configure n Times RAID 0 on the Controller and enable Write back Cache.
> That must be a latency "Killer" like in all the prop. Storage arrays or
> Not ??
>
> Best Regards !!
>
>
>
> What we saw specifically with Areca cards is that performance is excellent
> in benchmarking and for bursty loads. However, once we started loading with
> more constant workloads (we replicate databases and files to our Ceph
> cluster), this looks to have saturated the relatively small Areca NVDIMM
> caches and we went back to pure drive based performance.
>
>
>
> Yes, I think that is a valid point. Although low latency, you are still
> having to write to the disks twice (journal+data), so once the cache’s on
> the cards start filling up, you are going to hit problems.
>
>
>
>
>
> So we built 8 new nodes with no Arecas, M500 SSDs for journals (1 SSD per
> 3 HDDs) in hopes that it would help reduce the noisy neighbor impact. That
> worked, but now the overall latency is really high at times, not always.
> Red Hat engineer suggested this is due to loading the 7200 rpm NL-SAS
> drives with too many IOPS, which get their latency sky high. Overall we are
> functioning fine, but I sure would like storage vmotion and other large
> operations faster.
>
>
>
>
>
> Yeah this is the biggest pain point I think. Normal VM ops are fine, but
> if you ever have to move a multi-TB VM, it’s just too slow.
>
>
>
> If you use iscsi with vaai and are migrating a thick provisioned vmdk,
> then performance is actually quite good, as the block sizes used for the
> copy are a lot bigger.
>
>
>
> However, my use case required thin provisioned VM’s + snapshots and I
> found that using iscsi you have no control over the fragmentation of the
> vmdk’s and so the read performance is then what suffers (certainly with
> 7.2k disks)
>
>
>
> Also with thin provisioned vmdk’s I think I was seeing PG contention with
> the updating of the VMFS metadata, although I can’t be sure.
>
>
>
>
>
> I am thinking I will test a few different schedulers and readahead
> settings to see if we can improve this by parallelizing reads. Also will
> test NFS, but need to determine whether to do krbd/knfsd or something more
> interesting like CephFS/Ganesha.
>
>
>
> As you know I’m on NFS now. I’ve found it a lot easier to get going and a
> lot less sensitive to making config adjustments without suddenly everything
> dropping offline. The fact that you can specify the extent size on XFS
> helps massively with using thin vmdks/snapshots to avoid fragmentation.
> Storage v-motions are a bit faster than iscsi, but I think I am hitting PG
> contention when esxi tries to write 32 copy threads to the same object.
> There is probably some tuning that could be done here (RBD striping???) but
> this is the best it’s been for a long time and I’m reluctant to fiddle any
> further.
>
>
>
> But as mentioned above, thick vmdk’s with vaai might be a really good fit.
>

Any chance thin vs. thick difference could be related to discards?  I saw
zillions of them in recent testing.


>
>
> Thanks for your very valuable info on analysis and hw build.
>
>
>
> Alex
>
>
>
>
>
>
> Am 21.08.2016 um 09:31 schrieb Nick Fisk :
>
> >> -Original Message-
> >> From: Alex Gorbachev [mailto:a...@iss-integration.com]
> >> Sent: 21 August 2016 04:15
> >> To: Nick Fisk 
> >> Cc: w...@globe.de; Horace Ng ; ceph-users <
> ceph-users@lists.ceph.com>
> >> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> >>
> >> Hi Nick,
> >>
> >> On Thu, Jul 21, 2016 at 8:33 AM, Nick Fisk  wrote:
> >>>> -Original Message-
> >>>> From: w...@globe.de [mailto:w...@globe.de]
> >>>> Sent: 21 July 2016 13:23
> >>>> To: n...@fisk.me.uk; 'Horace Ng' 
> >>>> Cc: ceph-users@lists.ceph.com
> >>>> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> >>>>
> >>>> Okay and what is your plan now to speed up ?
> >>>
> >>> Now I have come up with a lower latency hardware design, there is not
> much further improvement until persistent RBD caching is
> >> implemented, as you will be moving the SSD/NVME closer to the client.
> But I'm happy with what I can achieve at the mo

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-08-21 Thread Christian Balzer

Hello,

On Sun, 21 Aug 2016 09:57:40 +0100 Nick Fisk wrote:

> 
> 
> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
> > Christian Balzer
> > Sent: 21 August 2016 09:32
> > To: ceph-users 
> > Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> > 
> > 
> > Hello,
> > 
> > On Sun, 21 Aug 2016 09:16:33 +0100 Brian :: wrote:
> > 
> > > Hi Nick
> > >
> > > Interested in this comment - "-Dual sockets are probably bad and will
> > > impact performance."
> > >
> > > Have you got real world experience of this being the case?
> > >
> > Well, Nick wrote "probably".
> > 
> > Dual sockets and thus NUMA, the need for CPUs to talk to each other and 
> > share information certainly can impact things that are
> very
> > time critical.
> > How much though is a question of design, both HW and SW.
> 
> There was a guy from Redhat (sorry his name escapes me now) a few months ago 
> on the performance weekly meeting. He was analysing the
> CPU cache miss effects with Ceph and it looked like a NUMA setup was having 
> quite a severe impact on some things. To be honest a lot
> of it went over my head, but I came away from it with a general feeling that 
> if you can get the required performance from 1 socket,
> then that is probably a better bet. This includes only populating a single 
> socket in a dual socket system. There was also a Ceph
> tech talk at the start of the year (High perf databases on Ceph) where the 
> guy presenting was also recommending only populating 1
> socket for latency reasons.
> 
I wonder how complete their testing was and how much manual tuning they
tried.
As in:

1. Was irqbalance running? 
Because it and the normal kernel strategies clash beautifully.
Irqbalance moves stuff around, the kernel tries to move things close to
where the IRQs are, cat and mouse.

2. Did they try with manual IRQ pinning?
I do, not that it's critical with my Ceph nodes, but on other machines it
can make a LOT of difference. 
Like keeping the cores near (or at least on the same NUMA node) as the
network IRQs reserved for KVM vhost processes. 

3. Did they try pining Ceph OSD processes?
While this may certainly help (and make things more predictable when the
load gets high), as I said above the kernel normally does a pretty good job
of NOT moving things around and keeping processes close to the resources
they need.

> Both of those, coupled with the fact that Xeon E3's are the cheapest way to 
> get high clock speeds, sort of made my decision.
> 
Totally agreed, my current HDD node design is based on the single CPU
Supermicro 5028R-E1CR12L barebone, with an E5-1650 v3 (3.50GHz) CPU.

> > 
> > We're looking here at a case where he's trying to reduce latency by all 
> > means and where the actual CPU needs for the HDDs are
> > negligible.
> > The idea being that a "Ceph IOPS" stays on one core which is hopefully also 
> > not being shared at that time.
> > 
> > If you're looking at full SSD nodes OTOH a singe CPU may very well not be 
> > able to saturate a sensible amount of SSDs per node, so
> a
> > slight penalty but better utilization and overall IOPS with 2 CPUs may be 
> > the forward.
> 
> Definitely, as always work out what your requirements are and design around 
> them.  
> 
On my cache tier nodes with 2x E5-2623 v3 (3.00GHz) and currently 4 800GB
DC S3610 SSDs I can already saturate all but 2 "cores", with the "right"
extreme test cases.
Normal load is of course just around 4 (out of 16) "cores".

And for the people who like it fast(er) but don't have to deal with VMware
or the likes, instead of forcing the c-state to 1 just setting the governor
to "performance" was enough in my case to halve latency (from about 2 to
1ms).

This still does save some power at times and (as Nick speculated) indeed
allows some cores to use their turbo speeds.

So the 4-5 busy cores on my cache tier nodes tend to hover around 3.3GHz,
instead of the 3.0GHz baseline for their CPUs.
And the less loaded cores don't tend to go below 2.6GHz, as opposed to the
1.2GHz that the "powersave" governor would default to.

Christian

> > 
> > Christian
> > 
> > > Thanks - B
> > >
> > > On Sun, Aug 21, 2016 at 8:31 AM, Nick Fisk  wrote:
> > > >> -Original Message-
> > > >> From: Alex Gorbachev [mailto:a...@iss-integration.com]
> > > >> Sent: 21 August 2016 04:15
> > > >> To: Nick Fisk 
> > > >> Cc: w...@globe.de; Horace Ng ; ceph-u

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-08-21 Thread Nick Fisk

From: Alex Gorbachev [mailto:a...@iss-integration.com] 
Sent: 21 August 2016 15:27
To: Wilhelm Redbrake 
Cc: n...@fisk.me.uk; Horace Ng ; ceph-users 

Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance

On Sunday, August 21, 2016, Wilhelm Redbrake mailto:w...@globe.de> > wrote:

Hi Nick,
i understand all of your technical improvements.
But: why do you Not use a simple for example Areca Raid Controller with 8 gb 
Cache and Bbu ontop in every ceph node.
Configure n Times RAID 0 on the Controller and enable Write back Cache.
That must be a latency "Killer" like in all the prop. Storage arrays or Not ??

Best Regards !!

What we saw specifically with Areca cards is that performance is excellent in 
benchmarking and for bursty loads. However, once we started loading with more 
constant workloads (we replicate databases and files to our Ceph cluster), this 
looks to have saturated the relatively small Areca NVDIMM caches and we went 
back to pure drive based performance. 

Yes, I think that is a valid point. Although low latency, you are still having 
to write to the disks twice (journal+data), so once the cache’s on the cards 
start filling up, you are going to hit problems.

So we built 8 new nodes with no Arecas, M500 SSDs for journals (1 SSD per 3 
HDDs) in hopes that it would help reduce the noisy neighbor impact. That 
worked, but now the overall latency is really high at times, not always. Red 
Hat engineer suggested this is due to loading the 7200 rpm NL-SAS drives with 
too many IOPS, which get their latency sky high. Overall we are functioning 
fine, but I sure would like storage vmotion and other large operations faster. 

Yeah this is the biggest pain point I think. Normal VM ops are fine, but if you 
ever have to move a multi-TB VM, it’s just too slow. 

If you use iscsi with vaai and are migrating a thick provisioned vmdk, then 
performance is actually quite good, as the block sizes used for the copy are a 
lot bigger. 

However, my use case required thin provisioned VM’s + snapshots and I found 
that using iscsi you have no control over the fragmentation of the vmdk’s and 
so the read performance is then what suffers (certainly with 7.2k disks)

Also with thin provisioned vmdk’s I think I was seeing PG contention with the 
updating of the VMFS metadata, although I can’t be sure.

I am thinking I will test a few different schedulers and readahead settings to 
see if we can improve this by parallelizing reads. Also will test NFS, but need 
to determine whether to do krbd/knfsd or something more interesting like 
CephFS/Ganesha. 

As you know I’m on NFS now. I’ve found it a lot easier to get going and a lot 
less sensitive to making config adjustments without suddenly everything 
dropping offline. The fact that you can specify the extent size on XFS helps 
massively with using thin vmdks/snapshots to avoid fragmentation. Storage 
v-motions are a bit faster than iscsi, but I think I am hitting PG contention 
when esxi tries to write 32 copy threads to the same object. There is probably 
some tuning that could be done here (RBD striping???) but this is the best it’s 
been for a long time and I’m reluctant to fiddle any further.

But as mentioned above, thick vmdk’s with vaai might be a really good fit.

Thanks for your very valuable info on analysis and hw build. 

Alex

Am 21.08.2016 um 09:31 schrieb Nick Fisk  >:

>> -Original Message-
>> From: Alex Gorbachev [mailto:a...@iss-integration.com  ]
>> Sent: 21 August 2016 04:15
>> To: Nick Fisk  >
>> Cc: w...@globe.de  ; Horace Ng >  >; ceph-users  >
>> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
>>
>> Hi Nick,
>>
>> On Thu, Jul 21, 2016 at 8:33 AM, Nick Fisk  > 
>> wrote:
>>>> -Original Message-
>>>> From: w...@globe.de   [mailto:w...@globe.de  ]
>>>> Sent: 21 July 2016 13:23
>>>> To: n...@fisk.me.uk  ; 'Horace Ng' >>>  >
>>>> Cc: ceph-users@lists.ceph.com  
>>>> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
>>>>
>>>> Okay and what is your plan now to speed up ?
>>>
>>> Now I have come up with a lower latency hardware design, there is not much 
>>> further improvement until persistent RBD caching is
>> implemented, as you will be moving the SSD/NVME closer to the client. But 
>> I'm happy with what I can achieve at the moment. You
>> could also experiment with bcache on the RBD.
>>
>> Reviving this thread, would you be willing to share the details of the low 
>> latency hardware design?  Are you optimizing for NFS or
>> iSCSI?
>
> Both really, just trying to get the write latency as low as possible, as you 
> know, vmware

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-08-21 Thread Alex Gorbachev

On Sunday, August 21, 2016, Wilhelm Redbrake  wrote:

> Hi Nick,
> i understand all of your technical improvements.
> But: why do you Not use a simple for example Areca Raid Controller with 8
> gb Cache and Bbu ontop in every ceph node.
> Configure n Times RAID 0 on the Controller and enable Write back Cache.
> That must be a latency "Killer" like in all the prop. Storage arrays or
> Not ??
>
> Best Regards !!


What we saw specifically with Areca cards is that performance is excellent
in benchmarking and for bursty loads. However, once we started loading with
more constant workloads (we replicate databases and files to our Ceph
cluster), this looks to have saturated the relatively small Areca NVDIMM
caches and we went back to pure drive based performance.

So we built 8 new nodes with no Arecas, M500 SSDs for journals (1 SSD per 3
HDDs) in hopes that it would help reduce the noisy neighbor impact. That
worked, but now the overall latency is really high at times, not always.
Red Hat engineer suggested this is due to loading the 7200 rpm NL-SAS
drives with too many IOPS, which get their latency sky high. Overall we are
functioning fine, but I sure would like storage vmotion and other large
operations faster.

I am thinking I will test a few different schedulers and readahead settings
to see if we can improve this by parallelizing reads. Also will test NFS,
but need to determine whether to do krbd/knfsd or something more
interesting like CephFS/Ganesha.

Thanks for your very valuable info on analysis and hw build.

Alex


>
>
>
> Am 21.08.2016 um 09:31 schrieb Nick Fisk >:
>
> >> -Original Message-
> >> From: Alex Gorbachev [mailto:a...@iss-integration.com ]
> >> Sent: 21 August 2016 04:15
> >> To: Nick Fisk >
> >> Cc: w...@globe.de ; Horace Ng  >; ceph-users >
> >> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> >>
> >> Hi Nick,
> >>
> >> On Thu, Jul 21, 2016 at 8:33 AM, Nick Fisk  > wrote:
> >>>> -Original Message-
> >>>> From: w...@globe.de  [mailto:w...@globe.de ]
> >>>> Sent: 21 July 2016 13:23
> >>>> To: n...@fisk.me.uk ; 'Horace Ng'  >
> >>>> Cc: ceph-users@lists.ceph.com 
> >>>> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> >>>>
> >>>> Okay and what is your plan now to speed up ?
> >>>
> >>> Now I have come up with a lower latency hardware design, there is not
> much further improvement until persistent RBD caching is
> >> implemented, as you will be moving the SSD/NVME closer to the client.
> But I'm happy with what I can achieve at the moment. You
> >> could also experiment with bcache on the RBD.
> >>
> >> Reviving this thread, would you be willing to share the details of the
> low latency hardware design?  Are you optimizing for NFS or
> >> iSCSI?
> >
> > Both really, just trying to get the write latency as low as possible, as
> you know, vmware does everything with lots of unbuffered small io's. Eg
> when you migrate a VM or as thin vmdk's grow.
> >
> > Even storage vmotions which might kick off 32 threads, as they all
> roughly fall on the same PG, there still appears to be a bottleneck with
> contention on the PG itself.
> >
> > These were the sort of things I was trying to optimise for, to make the
> time spent in Ceph as minimal as possible for each IO.
> >
> > So onto the hardware. Through reading various threads and experiments on
> my own I came to the following conclusions.
> >
> > -You need highest possible frequency on the CPU cores, which normally
> also means less of them.
> > -Dual sockets are probably bad and will impact performance.
> > -Use NVME's for journals to minimise latency
> >
> > The end result was OSD nodes based off of a 3.5Ghz Xeon E3v5 with an
> Intel P3700 for a journal. I used the SuperMicro X11SSH-CTF board which has
> 10G-T onboard as well as 8SATA and 8SAS, so no expansion cards required.
> Actually this design as well as being very performant for Ceph, also works
> out very cheap as you are using low end server parts. The whole lot +
> 12x7.2k disks all goes into a 1U case.
> >
> > During testing I noticed that by default c-states and p-states slaughter
> performance. After forcing max cstate to 1 and forcing the CPU frequency up
> to max, I was seeing 600us latency for a 4kb write to a 3xreplica pool, or
> around 1600IOPs, this is at QD=1.
> >
> > Few other observations:
> > 1. Power usage is around 150-200W for this config with 12x7.2k disks
> > 2. CPU u

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-08-21 Thread Nick Fisk



> -Original Message-
> From: Wilhelm Redbrake [mailto:w...@globe.de]
> Sent: 21 August 2016 09:34
> To: n...@fisk.me.uk
> Cc: Alex Gorbachev ; Horace Ng ; 
> ceph-users 
> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> 
> Hi Nick,
> i understand all of your technical improvements.
> But: why do you Not use a simple for example Areca Raid Controller with 8 gb 
> Cache and Bbu ontop in every ceph node.
> Configure n Times RAID 0 on the Controller and enable Write back Cache.
> That must be a latency "Killer" like in all the prop. Storage arrays or Not ??

Possibly, the latency of the NVME is very low, to the point that the "latency" 
in Ceph dwarfs it. So I'm not sure how much more improvement can be got from 
lowering journal latency further. But you are certainly correct it would help.

The other thing, if you don't use a SSD for a journal but rely on the RAID WBC, 
do you still see half the MB/s on the hard disks due to colo journal? Maybe 
someone can confirm?

Oh and I just looked at the price of that thing. The 16 port version is nearly 
double the price of what I paid for the 400GB NVME and that’s without adding on 
the 8GB ram and BBU. Maybe it's more suited for a full SSD cluster rather than 
spinning disks?

> 
> Best Regards !!
> 
> 
> 
> Am 21.08.2016 um 09:31 schrieb Nick Fisk :
> 
> >> -Original Message-
> >> From: Alex Gorbachev [mailto:a...@iss-integration.com]
> >> Sent: 21 August 2016 04:15
> >> To: Nick Fisk 
> >> Cc: w...@globe.de; Horace Ng ; ceph-users
> >> 
> >> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> >>
> >> Hi Nick,
> >>
> >> On Thu, Jul 21, 2016 at 8:33 AM, Nick Fisk  wrote:
> >>>> -Original Message-
> >>>> From: w...@globe.de [mailto:w...@globe.de]
> >>>> Sent: 21 July 2016 13:23
> >>>> To: n...@fisk.me.uk; 'Horace Ng' 
> >>>> Cc: ceph-users@lists.ceph.com
> >>>> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> >>>>
> >>>> Okay and what is your plan now to speed up ?
> >>>
> >>> Now I have come up with a lower latency hardware design, there is
> >>> not much further improvement until persistent RBD caching is
> >> implemented, as you will be moving the SSD/NVME closer to the client.
> >> But I'm happy with what I can achieve at the moment. You could also 
> >> experiment with bcache on the RBD.
> >>
> >> Reviving this thread, would you be willing to share the details of
> >> the low latency hardware design?  Are you optimizing for NFS or iSCSI?
> >
> > Both really, just trying to get the write latency as low as possible, as 
> > you know, vmware does everything with lots of unbuffered
> small io's. Eg when you migrate a VM or as thin vmdk's grow.
> >
> > Even storage vmotions which might kick off 32 threads, as they all roughly 
> > fall on the same PG, there still appears to be a bottleneck
> with contention on the PG itself.
> >
> > These were the sort of things I was trying to optimise for, to make the 
> > time spent in Ceph as minimal as possible for each IO.
> >
> > So onto the hardware. Through reading various threads and experiments on my 
> > own I came to the following conclusions.
> >
> > -You need highest possible frequency on the CPU cores, which normally also 
> > means less of them.
> > -Dual sockets are probably bad and will impact performance.
> > -Use NVME's for journals to minimise latency
> >
> > The end result was OSD nodes based off of a 3.5Ghz Xeon E3v5 with an Intel 
> > P3700 for a journal. I used the SuperMicro X11SSH-CTF
> board which has 10G-T onboard as well as 8SATA and 8SAS, so no expansion 
> cards required. Actually this design as well as being very
> performant for Ceph, also works out very cheap as you are using low end 
> server parts. The whole lot + 12x7.2k disks all goes into a 1U
> case.
> >
> > During testing I noticed that by default c-states and p-states slaughter 
> > performance. After forcing max cstate to 1 and forcing the
> CPU frequency up to max, I was seeing 600us latency for a 4kb write to a 
> 3xreplica pool, or around 1600IOPs, this is at QD=1.
> >
> > Few other observations:
> > 1. Power usage is around 150-200W for this config with 12x7.2k disks
> > 2. CPU usage maxing out disks, is only around 10-15%, so plenty of headroom 
> > for more disks.
> > 3. NOTE FOR ABOVE: Don't include iowait when

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-08-21 Thread Nick Fisk



> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
> Christian Balzer
> Sent: 21 August 2016 09:32
> To: ceph-users 
> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> 
> 
> Hello,
> 
> On Sun, 21 Aug 2016 09:16:33 +0100 Brian :: wrote:
> 
> > Hi Nick
> >
> > Interested in this comment - "-Dual sockets are probably bad and will
> > impact performance."
> >
> > Have you got real world experience of this being the case?
> >
> Well, Nick wrote "probably".
> 
> Dual sockets and thus NUMA, the need for CPUs to talk to each other and share 
> information certainly can impact things that are
very
> time critical.
> How much though is a question of design, both HW and SW.

There was a guy from Redhat (sorry his name escapes me now) a few months ago on 
the performance weekly meeting. He was analysing the
CPU cache miss effects with Ceph and it looked like a NUMA setup was having 
quite a severe impact on some things. To be honest a lot
of it went over my head, but I came away from it with a general feeling that if 
you can get the required performance from 1 socket,
then that is probably a better bet. This includes only populating a single 
socket in a dual socket system. There was also a Ceph
tech talk at the start of the year (High perf databases on Ceph) where the guy 
presenting was also recommending only populating 1
socket for latency reasons.

Both of those, coupled with the fact that Xeon E3's are the cheapest way to get 
high clock speeds, sort of made my decision.

> 
> We're looking here at a case where he's trying to reduce latency by all means 
> and where the actual CPU needs for the HDDs are
> negligible.
> The idea being that a "Ceph IOPS" stays on one core which is hopefully also 
> not being shared at that time.
> 
> If you're looking at full SSD nodes OTOH a singe CPU may very well not be 
> able to saturate a sensible amount of SSDs per node, so
a
> slight penalty but better utilization and overall IOPS with 2 CPUs may be the 
> forward.

Definitely, as always work out what your requirements are and design around 
them.  

> 
> Christian
> 
> > Thanks - B
> >
> > On Sun, Aug 21, 2016 at 8:31 AM, Nick Fisk  wrote:
> > >> -Original Message-----
> > >> From: Alex Gorbachev [mailto:a...@iss-integration.com]
> > >> Sent: 21 August 2016 04:15
> > >> To: Nick Fisk 
> > >> Cc: w...@globe.de; Horace Ng ; ceph-users
> > >> 
> > >> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> > >>
> > >> Hi Nick,
> > >>
> > >> On Thu, Jul 21, 2016 at 8:33 AM, Nick Fisk  wrote:
> > >> >> -Original Message-
> > >> >> From: w...@globe.de [mailto:w...@globe.de]
> > >> >> Sent: 21 July 2016 13:23
> > >> >> To: n...@fisk.me.uk; 'Horace Ng' 
> > >> >> Cc: ceph-users@lists.ceph.com
> > >> >> Subject: Re: [ceph-users] Ceph + VMware + Single Thread
> > >> >> Performance
> > >> >>
> > >> >> Okay and what is your plan now to speed up ?
> > >> >
> > >> > Now I have come up with a lower latency hardware design, there is
> > >> > not much further improvement until persistent RBD caching is
> > >> implemented, as you will be moving the SSD/NVME closer to the
> > >> client. But I'm happy with what I can achieve at the moment. You could 
> > >> also experiment with bcache on the RBD.
> > >>
> > >> Reviving this thread, would you be willing to share the details of
> > >> the low latency hardware design?  Are you optimizing for NFS or iSCSI?
> > >
> > > Both really, just trying to get the write latency as low as possible, as 
> > > you know, vmware does everything with lots of
unbuffered
> small io's. Eg when you migrate a VM or as thin vmdk's grow.
> > >
> > > Even storage vmotions which might kick off 32 threads, as they all 
> > > roughly fall on the same PG, there still appears to be a
> bottleneck with contention on the PG itself.
> > >
> > > These were the sort of things I was trying to optimise for, to make the 
> > > time spent in Ceph as minimal as possible for each IO.
> > >
> > > So onto the hardware. Through reading various threads and experiments on 
> > > my own I came to the following conclusions.
> > >
> > > -You need highest possible frequency on the C

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-08-21 Thread Christian Balzer


Hello,

On Sun, 21 Aug 2016 09:16:33 +0100 Brian :: wrote:

> Hi Nick
> 
> Interested in this comment - "-Dual sockets are probably bad and will
> impact performance."
> 
> Have you got real world experience of this being the case?
> 
Well, Nick wrote "probably".

Dual sockets and thus NUMA, the need for CPUs to talk to each other and
share information certainly can impact things that are very time critical.
How much though is a question of design, both HW and SW.

We're looking here at a case where he's trying to reduce latency by all
means and where the actual CPU needs for the HDDs are negligible.
The idea being that a "Ceph IOPS" stays on one core which is hopefully
also not being shared at that time.

If you're looking at full SSD nodes OTOH a singe CPU may very well not be
able to saturate a sensible amount of SSDs per node, so a slight penalty
but better utilization and overall IOPS with 2 CPUs may be the forward.

Christian

> Thanks - B
> 
> On Sun, Aug 21, 2016 at 8:31 AM, Nick Fisk  wrote:
> >> -Original Message-
> >> From: Alex Gorbachev [mailto:a...@iss-integration.com]
> >> Sent: 21 August 2016 04:15
> >> To: Nick Fisk 
> >> Cc: w...@globe.de; Horace Ng ; ceph-users 
> >> 
> >> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> >>
> >> Hi Nick,
> >>
> >> On Thu, Jul 21, 2016 at 8:33 AM, Nick Fisk  wrote:
> >> >> -Original Message-----
> >> >> From: w...@globe.de [mailto:w...@globe.de]
> >> >> Sent: 21 July 2016 13:23
> >> >> To: n...@fisk.me.uk; 'Horace Ng' 
> >> >> Cc: ceph-users@lists.ceph.com
> >> >> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> >> >>
> >> >> Okay and what is your plan now to speed up ?
> >> >
> >> > Now I have come up with a lower latency hardware design, there is not 
> >> > much further improvement until persistent RBD caching is
> >> implemented, as you will be moving the SSD/NVME closer to the client. But 
> >> I'm happy with what I can achieve at the moment. You
> >> could also experiment with bcache on the RBD.
> >>
> >> Reviving this thread, would you be willing to share the details of the low 
> >> latency hardware design?  Are you optimizing for NFS or
> >> iSCSI?
> >
> > Both really, just trying to get the write latency as low as possible, as 
> > you know, vmware does everything with lots of unbuffered small io's. Eg 
> > when you migrate a VM or as thin vmdk's grow.
> >
> > Even storage vmotions which might kick off 32 threads, as they all roughly 
> > fall on the same PG, there still appears to be a bottleneck with contention 
> > on the PG itself.
> >
> > These were the sort of things I was trying to optimise for, to make the 
> > time spent in Ceph as minimal as possible for each IO.
> >
> > So onto the hardware. Through reading various threads and experiments on my 
> > own I came to the following conclusions.
> >
> > -You need highest possible frequency on the CPU cores, which normally also 
> > means less of them.
> > -Dual sockets are probably bad and will impact performance.
> > -Use NVME's for journals to minimise latency
> >
> > The end result was OSD nodes based off of a 3.5Ghz Xeon E3v5 with an Intel 
> > P3700 for a journal. I used the SuperMicro X11SSH-CTF board which has 10G-T 
> > onboard as well as 8SATA and 8SAS, so no expansion cards required. Actually 
> > this design as well as being very performant for Ceph, also works out very 
> > cheap as you are using low end server parts. The whole lot + 12x7.2k disks 
> > all goes into a 1U case.
> >
> > During testing I noticed that by default c-states and p-states slaughter 
> > performance. After forcing max cstate to 1 and forcing the CPU frequency up 
> > to max, I was seeing 600us latency for a 4kb write to a 3xreplica pool, or 
> > around 1600IOPs, this is at QD=1.
> >
> > Few other observations:
> > 1. Power usage is around 150-200W for this config with 12x7.2k disks
> > 2. CPU usage maxing out disks, is only around 10-15%, so plenty of headroom 
> > for more disks.
> > 3. NOTE FOR ABOVE: Don't include iowait when looking at CPU usage
> > 4. No idea about CPU load for pure SSD nodes, but based on the current 
> > disks, you could maybe expect ~1iops per node, before maxing out CPU's
> > 5. Single NVME seems to be able to journal 12 disks with no problem during 
>

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-08-21 Thread Brian ::

Hi Nick

Interested in this comment - "-Dual sockets are probably bad and will
impact performance."

Have you got real world experience of this being the case?

Thanks - B

On Sun, Aug 21, 2016 at 8:31 AM, Nick Fisk  wrote:
>> -Original Message-
>> From: Alex Gorbachev [mailto:a...@iss-integration.com]
>> Sent: 21 August 2016 04:15
>> To: Nick Fisk 
>> Cc: w...@globe.de; Horace Ng ; ceph-users 
>> 
>> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
>>
>> Hi Nick,
>>
>> On Thu, Jul 21, 2016 at 8:33 AM, Nick Fisk  wrote:
>> >> -Original Message-
>> >> From: w...@globe.de [mailto:w...@globe.de]
>> >> Sent: 21 July 2016 13:23
>> >> To: n...@fisk.me.uk; 'Horace Ng' 
>> >> Cc: ceph-users@lists.ceph.com
>> >> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
>> >>
>> >> Okay and what is your plan now to speed up ?
>> >
>> > Now I have come up with a lower latency hardware design, there is not much 
>> > further improvement until persistent RBD caching is
>> implemented, as you will be moving the SSD/NVME closer to the client. But 
>> I'm happy with what I can achieve at the moment. You
>> could also experiment with bcache on the RBD.
>>
>> Reviving this thread, would you be willing to share the details of the low 
>> latency hardware design?  Are you optimizing for NFS or
>> iSCSI?
>
> Both really, just trying to get the write latency as low as possible, as you 
> know, vmware does everything with lots of unbuffered small io's. Eg when you 
> migrate a VM or as thin vmdk's grow.
>
> Even storage vmotions which might kick off 32 threads, as they all roughly 
> fall on the same PG, there still appears to be a bottleneck with contention 
> on the PG itself.
>
> These were the sort of things I was trying to optimise for, to make the time 
> spent in Ceph as minimal as possible for each IO.
>
> So onto the hardware. Through reading various threads and experiments on my 
> own I came to the following conclusions.
>
> -You need highest possible frequency on the CPU cores, which normally also 
> means less of them.
> -Dual sockets are probably bad and will impact performance.
> -Use NVME's for journals to minimise latency
>
> The end result was OSD nodes based off of a 3.5Ghz Xeon E3v5 with an Intel 
> P3700 for a journal. I used the SuperMicro X11SSH-CTF board which has 10G-T 
> onboard as well as 8SATA and 8SAS, so no expansion cards required. Actually 
> this design as well as being very performant for Ceph, also works out very 
> cheap as you are using low end server parts. The whole lot + 12x7.2k disks 
> all goes into a 1U case.
>
> During testing I noticed that by default c-states and p-states slaughter 
> performance. After forcing max cstate to 1 and forcing the CPU frequency up 
> to max, I was seeing 600us latency for a 4kb write to a 3xreplica pool, or 
> around 1600IOPs, this is at QD=1.
>
> Few other observations:
> 1. Power usage is around 150-200W for this config with 12x7.2k disks
> 2. CPU usage maxing out disks, is only around 10-15%, so plenty of headroom 
> for more disks.
> 3. NOTE FOR ABOVE: Don't include iowait when looking at CPU usage
> 4. No idea about CPU load for pure SSD nodes, but based on the current disks, 
> you could maybe expect ~1iops per node, before maxing out CPU's
> 5. Single NVME seems to be able to journal 12 disks with no problem during 
> normal operation, no doubt a specific benchmark could max it out though.
> 6. There are slightly faster Xeon E3's, but price/performance = diminishing 
> returns
>
> Hope that answers all your questions.
> Nick
>
>>
>> Thank you,
>> Alex
>>
>> >
>> >>
>> >> Would it help to put in multiple P3700 per OSD Node to improve 
>> >> performance for a single Thread (example Storage VMotion) ?
>> >
>> > Most likely not, it's all the other parts of the puzzle which are causing 
>> > the latency. ESXi was designed for storage arrays that service
>> IO's in 100us-1ms range, Ceph is probably about 10x slower than this, hence 
>> the problem. Disable the BBWC on a RAID controller or
>> SAN and you will the same behaviour.
>> >
>> >>
>> >> Regards
>> >>
>> >>
>> >> Am 21.07.16 um 14:17 schrieb Nick Fisk:
>> >> >> -Original Message-
>> >> >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
>> >> >> Behalf Of w...

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-08-21 Thread Nick Fisk

> -Original Message-
> From: Alex Gorbachev [mailto:a...@iss-integration.com]
> Sent: 21 August 2016 04:15
> To: Nick Fisk 
> Cc: w...@globe.de; Horace Ng ; ceph-users 
> 
> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> 
> Hi Nick,
> 
> On Thu, Jul 21, 2016 at 8:33 AM, Nick Fisk  wrote:
> >> -Original Message-
> >> From: w...@globe.de [mailto:w...@globe.de]
> >> Sent: 21 July 2016 13:23
> >> To: n...@fisk.me.uk; 'Horace Ng' 
> >> Cc: ceph-users@lists.ceph.com
> >> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> >>
> >> Okay and what is your plan now to speed up ?
> >
> > Now I have come up with a lower latency hardware design, there is not much 
> > further improvement until persistent RBD caching is
> implemented, as you will be moving the SSD/NVME closer to the client. But I'm 
> happy with what I can achieve at the moment. You
> could also experiment with bcache on the RBD.
> 
> Reviving this thread, would you be willing to share the details of the low 
> latency hardware design?  Are you optimizing for NFS or
> iSCSI?

Both really, just trying to get the write latency as low as possible, as you 
know, vmware does everything with lots of unbuffered small io's. Eg when you 
migrate a VM or as thin vmdk's grow.

Even storage vmotions which might kick off 32 threads, as they all roughly fall 
on the same PG, there still appears to be a bottleneck with contention on the 
PG itself. 

These were the sort of things I was trying to optimise for, to make the time 
spent in Ceph as minimal as possible for each IO.

So onto the hardware. Through reading various threads and experiments on my own 
I came to the following conclusions. 

-You need highest possible frequency on the CPU cores, which normally also 
means less of them. 
-Dual sockets are probably bad and will impact performance.
-Use NVME's for journals to minimise latency

The end result was OSD nodes based off of a 3.5Ghz Xeon E3v5 with an Intel 
P3700 for a journal. I used the SuperMicro X11SSH-CTF board which has 10G-T 
onboard as well as 8SATA and 8SAS, so no expansion cards required. Actually 
this design as well as being very performant for Ceph, also works out very 
cheap as you are using low end server parts. The whole lot + 12x7.2k disks all 
goes into a 1U case.

During testing I noticed that by default c-states and p-states slaughter 
performance. After forcing max cstate to 1 and forcing the CPU frequency up to 
max, I was seeing 600us latency for a 4kb write to a 3xreplica pool, or around 
1600IOPs, this is at QD=1.

Few other observations:
1. Power usage is around 150-200W for this config with 12x7.2k disks
2. CPU usage maxing out disks, is only around 10-15%, so plenty of headroom for 
more disks.
3. NOTE FOR ABOVE: Don't include iowait when looking at CPU usage
4. No idea about CPU load for pure SSD nodes, but based on the current disks, 
you could maybe expect ~1iops per node, before maxing out CPU's
5. Single NVME seems to be able to journal 12 disks with no problem during 
normal operation, no doubt a specific benchmark could max it out though.
6. There are slightly faster Xeon E3's, but price/performance = diminishing 
returns

Hope that answers all your questions.
Nick

> 
> Thank you,
> Alex
> 
> >
> >>
> >> Would it help to put in multiple P3700 per OSD Node to improve performance 
> >> for a single Thread (example Storage VMotion) ?
> >
> > Most likely not, it's all the other parts of the puzzle which are causing 
> > the latency. ESXi was designed for storage arrays that service
> IO's in 100us-1ms range, Ceph is probably about 10x slower than this, hence 
> the problem. Disable the BBWC on a RAID controller or
> SAN and you will the same behaviour.
> >
> >>
> >> Regards
> >>
> >>
> >> Am 21.07.16 um 14:17 schrieb Nick Fisk:
> >> >> -Original Message-
> >> >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
> >> >> Behalf Of w...@globe.de
> >> >> Sent: 21 July 2016 13:04
> >> >> To: n...@fisk.me.uk; 'Horace Ng' 
> >> >> Cc: ceph-users@lists.ceph.com
> >> >> Subject: Re: [ceph-users] Ceph + VMware + Single Thread
> >> >> Performance
> >> >>
> >> >> Hi,
> >> >>
> >> >> hmm i think 200 MByte/s is really bad. Is your Cluster in production 
> >> >> right now?
> >> > It's just been built, not running yet.
> >> >
> >> >> So if you start a storage migration you get only 200 MByte/s

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-08-20 Thread Alex Gorbachev

Hi Nick,

On Thu, Jul 21, 2016 at 8:33 AM, Nick Fisk  wrote:
>> -Original Message-
>> From: w...@globe.de [mailto:w...@globe.de]
>> Sent: 21 July 2016 13:23
>> To: n...@fisk.me.uk; 'Horace Ng' 
>> Cc: ceph-users@lists.ceph.com
>> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
>>
>> Okay and what is your plan now to speed up ?
>
> Now I have come up with a lower latency hardware design, there is not much 
> further improvement until persistent RBD caching is implemented, as you will 
> be moving the SSD/NVME closer to the client. But I'm happy with what I can 
> achieve at the moment. You could also experiment with bcache on the RBD.

Reviving this thread, would you be willing to share the details of the
low latency hardware design?  Are you optimizing for NFS or iSCSI?

Thank you,
Alex

>
>>
>> Would it help to put in multiple P3700 per OSD Node to improve performance 
>> for a single Thread (example Storage VMotion) ?
>
> Most likely not, it's all the other parts of the puzzle which are causing the 
> latency. ESXi was designed for storage arrays that service IO's in 100us-1ms 
> range, Ceph is probably about 10x slower than this, hence the problem. 
> Disable the BBWC on a RAID controller or SAN and you will the same behaviour.
>
>>
>> Regards
>>
>>
>> Am 21.07.16 um 14:17 schrieb Nick Fisk:
>> >> -Original Message-
>> >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
>> >> Of w...@globe.de
>> >> Sent: 21 July 2016 13:04
>> >> To: n...@fisk.me.uk; 'Horace Ng' 
>> >> Cc: ceph-users@lists.ceph.com
>> >> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
>> >>
>> >> Hi,
>> >>
>> >> hmm i think 200 MByte/s is really bad. Is your Cluster in production 
>> >> right now?
>> > It's just been built, not running yet.
>> >
>> >> So if you start a storage migration you get only 200 MByte/s right?
>> > I wish. My current cluster (not this new one) would storage migrate at
>> > ~10-15MB/s. Serial latency is the problem, without being able to
>> > buffer, ESXi waits on an ack for each IO before sending the next. Also it 
>> > submits the migrations in 64kb chunks, unless you get VAAI
>> working. I think esxi will try and do them in parallel, which will help as 
>> well.
>> >
>> >> I think it would be awesome if you get 1000 MByte/s
>> >>
>> >> Where is the Bottleneck?
>> > Latency serialisation, without a buffer, you can't drive the devices
>> > to 100%. With buffered IO (or high queue depths) I can max out the 
>> > journals.
>> >
>> >> A FIO Test from Sebastien Han give us 400 MByte/s raw performance from 
>> >> the P3700.
>> >>
>> >> https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your
>> >> -ssd-is-suitable-as-a-journal-device/
>> >>
>> >> How could it be that the rbd client performance is 50% slower?
>> >>
>> >> Regards
>> >>
>> >>
>> >> Am 21.07.16 um 12:15 schrieb Nick Fisk:
>> >>> I've had a lot of pain with this, smaller block sizes are even worse.
>> >>> You want to try and minimize latency at every point as there is no
>> >>> buffering happening in the iSCSI stack. This means:-
>> >>>
>> >>> 1. Fast journals (NVME or NVRAM)
>> >>> 2. 10GB or better networking
>> >>> 3. Fast CPU's (Ghz)
>> >>> 4. Fix CPU c-state's to C1
>> >>> 5. Fix CPU's Freq to max
>> >>>
>> >>> Also I can't be sure, but I think there is a metadata update
>> >>> happening with VMFS, particularly if you are using thin VMDK's, this
>> >>> can also be a major bottleneck. For my use case, I've switched over to 
>> >>> NFS as it has given much more performance at scale and
>> less headache.
>> >>>
>> >>> For the RADOS Run, here you go (400GB P3700):
>> >>>
>> >>> Total time run: 60.026491
>> >>> Total writes made:  3104
>> >>> Write size: 4194304
>> >>> Object size:4194304
>> >>> Bandwidth (MB/sec): 206.842
>> >>> Stddev Bandwidth:   8.10412
>> >>> Max bandwidth (MB/sec): 224
>> >>> M

89 matches

Mail list logo