Re: [ceph-users] ceph , VMWare , NFS-ganesha

2018-05-29 Thread Steven Vacaroaia
Thank you all

My goal is to have an SSD based Ceph ( NVME + SSD) cluster so I need to
consider performance as well as reliability
 ( although I do realize that a performant cluster that breaks my VMware is
not ideal ;-))

It appears that NFS is the safe way to do it but will it be the bottleneck
from performance perspective

Anyone did a comparison between iSCSI and NFS  ?

Would network be a bottleneck ?

Many thanks

Steven

On Tue, 29 May 2018 at 11:04, Dennis Benndorf <
dennis.bennd...@googlemail.com> wrote:

> Hi,
>
> we use PetaSAN for our VMWare-Cluster. It provides an webinterface for
> management and does clustered active-active ISCSI. For us the easy
> management was the point to choose this, so we need not to think about how
> to configure ISCSI...
> Regards,
> Dennis
>
> Am 28.05.2018 um 21:42 schrieb Steven Vacaroaia:
>
> Hi,
>
> I need to design and build a storage platform that will be "consumed"
> mainly by VMWare
>
> CEPH is my first choice
>
> As far as I can see, there are 3 ways CEPH storage can be made available
> to VMWare
>
> 1. iSCSI
> 2. NFS-Ganesha
> 3. mounted rbd to a lInux NFS server
>
> Any suggestions / advice as to which one is better ( and why) as well as
> links to doumentation/best practices will be truly appreciated
>
> Thanks
> Steven
>
>
> ___
> ceph-users mailing 
> listceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph , VMWare , NFS-ganesha

2018-05-29 Thread Dennis Benndorf

Hi,

we use PetaSAN for our VMWare-Cluster. It provides an webinterface for 
management and does clustered active-active ISCSI. For us the easy 
management was the point to choose this, so we need not to think about 
how to configure ISCSI...


Regards,
Dennis

Am 28.05.2018 um 21:42 schrieb Steven Vacaroaia:

Hi,

I need to design and build a storage platform that will be "consumed" 
mainly by VMWare


CEPH is my first choice

As far as I can see, there are 3 ways CEPH storage can be made 
available to VMWare


1. iSCSI
2. NFS-Ganesha
3. mounted rbd to a lInux NFS server

Any suggestions / advice as to which one is better ( and why) as well 
as links to doumentation/best practices will be truly appreciated


Thanks
Steven


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph , VMWare , NFS-ganesha

2018-05-29 Thread Alex Gorbachev
On Mon, May 28, 2018 at 3:42 PM, Steven Vacaroaia  wrote:
> Hi,
>
> I need to design and build a storage platform that will be "consumed" mainly
> by VMWare
>
> CEPH is my first choice
>
> As far as I can see, there are 3 ways CEPH storage can be made available to
> VMWare
>
> 1. iSCSI
> 2. NFS-Ganesha
> 3. mounted rbd to a lInux NFS server
>
> Any suggestions / advice as to which one is better ( and why) as well as
> links to doumentation/best practices will be truly appreciated

We use NFS with Pacemaker quite successfully, with repackaging kRBD
with XFS.  I tried rbd-nbd as well, but performance is not good when
running sync.
--
Alex Gorbachev
Storcium

>
> Thanks
> Steven
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph , VMWare , NFS-ganesha

2018-05-29 Thread Heðin Ejdesgaard Møller
We are using the iSCSI gateway in ceph-12.2 with vsphere-6.5 as the client.
It's an active/passive setup, per. LUN.
We choose this solution because that's what we could get RH support for and it 
sticks to the "no SPOF" philosophy.

Performance is ~25-30% slower then krbd mounting the same rbd image directly. 
This is based on the following.
We spun up a FC27 VM within the vmware cluster attached a vdisk from the vmware 
datastore, ran various fio test.
Then we mapped the same rbd image directly and ran the same tests(ofc we 
removed the iscsi exposure first.) 

Regards
Heðin Ejdesgaard


On mán, 2018-05-28 at 15:47 -0500, Brady Deetz wrote:
> You might look into open vstorage as a gateway into ceph. 
> 
> On Mon, May 28, 2018, 2:42 PM Steven Vacaroaia  wrote:
> > Hi,
> > 
> > I need to design and build a storage platform that will be "consumed" 
> > mainly by VMWare 
> > 
> > CEPH is my first choice 
> > 
> > As far as I can see, there are 3 ways CEPH storage can be made available to 
> > VMWare 
> > 
> > 1. iSCSI
> > 2. NFS-Ganesha
> > 3. mounted rbd to a lInux NFS server
> > 
> > Any suggestions / advice as to which one is better ( and why) as well as 
> > links to doumentation/best practices will
> > be truly appreciated 
> > 
> > Thanks
> > Steven
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph , VMWare , NFS-ganesha

2018-05-28 Thread Steven Vacaroaia
Hi,

I need to design and build a storage platform that will be "consumed"
mainly by VMWare

CEPH is my first choice

As far as I can see, there are 3 ways CEPH storage can be made available to
VMWare

1. iSCSI
2. NFS-Ganesha
3. mounted rbd to a lInux NFS server

Any suggestions / advice as to which one is better ( and why) as well as
links to doumentation/best practices will be truly appreciated

Thanks
Steven
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph , VMWare , NFS-ganesha

2018-05-28 Thread Brady Deetz
You might look into open vstorage as a gateway into ceph.

On Mon, May 28, 2018, 2:42 PM Steven Vacaroaia  wrote:

> Hi,
>
> I need to design and build a storage platform that will be "consumed"
> mainly by VMWare
>
> CEPH is my first choice
>
> As far as I can see, there are 3 ways CEPH storage can be made available
> to VMWare
>
> 1. iSCSI
> 2. NFS-Ganesha
> 3. mounted rbd to a lInux NFS server
>
> Any suggestions / advice as to which one is better ( and why) as well as
> links to doumentation/best practices will be truly appreciated
>
> Thanks
> Steven
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph + VMWare

2016-10-18 Thread Alex Gorbachev
On Tuesday, October 18, 2016, Frédéric Nass 
wrote:

> Hi Alex,
>
> Just to know, what kind of backstore are you using whithin Storcium ? 
> vdisk_fileio
> or vdisk_blockio ?
>
> I see your agents can handle both : http://www.spinics.net/lists/
> ceph-users/msg27817.html
>
Hi Frédéric,

We use all of them, and NFS as well, which has been performing quite well.
Vdisk_fileio is a bit dangerous in write cache mode.  Also, for some
reason, object size of 16MB for RBD does better with VMWare.

Storcium gives you a choice for each LUN.  The challenge has been figuring
out optimal workloads under highly varied use cases.  I see better results
with NVMe journals and write combining HBAs, e.g. Areca.

Regards,
Alex

> Regards,
>
> Frédéric.
>
> Le 06/10/2016 à 16:01, Alex Gorbachev a écrit :
>
> On Wed, Oct 5, 2016 at 2:32 PM, Patrick McGarry  
>  wrote:
>
> Hey guys,
>
> Starting to buckle down a bit in looking at how we can better set up
> Ceph for VMWare integration, but I need a little info/help from you
> folks.
>
> If you currently are using Ceph+VMWare, or are exploring the option,
> I'd like some simple info from you:
>
> 1) Company
> 2) Current deployment size
> 3) Expected deployment growth
> 4) Integration method (or desired method) ex: iscsi, native, etc
>
> Just casting the net so we know who is interested and might want to
> help us shape and/or test things in the future if we can make it
> better. Thanks.
>
>
> Hi Patrick,
>
> We have Storcium certified with VMWare, and we use it ourselves:
>
> Ceph Hammer latest
>
> SCST redundant Pacemaker based delivery front ends - our agents are
> published on github
>
> EnhanceIO for read caching at delivery layer
>
> NFS v3, and iSCSI and FC delivery
>
> Our deployment size we use ourselves is 700 TB raw.
>
> Challenges are as others described, but HA and multi host access works
> fine courtesy of SCST.  Write amplification is a challenge on spinning
> disks.
>
> Happy to share more.
>
> Alex
>
>
> --
>
> Best Regards,
>
> Patrick McGarry
> Director Ceph Community || Red Hathttp://ceph.com  ||  
> http://community.redhat.com
> @scuttlemonkey || @ceph
> ___
> ceph-users mailing listceph-us...@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org 
> 
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>

-- 
--
Alex Gorbachev
Storcium
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph + VMWare

2016-10-18 Thread Frédéric Nass


Hi Alex,

Just to know, what kind of backstore are you using whithin Storcium ? 
vdisk_fileio or vdisk_blockio ?


I see your agents can handle both : 
http://www.spinics.net/lists/ceph-users/msg27817.html


Regards,

Frédéric.

Le 06/10/2016 à 16:01, Alex Gorbachev a écrit :

On Wed, Oct 5, 2016 at 2:32 PM, Patrick McGarry  wrote:

Hey guys,

Starting to buckle down a bit in looking at how we can better set up
Ceph for VMWare integration, but I need a little info/help from you
folks.

If you currently are using Ceph+VMWare, or are exploring the option,
I'd like some simple info from you:

1) Company
2) Current deployment size
3) Expected deployment growth
4) Integration method (or desired method) ex: iscsi, native, etc

Just casting the net so we know who is interested and might want to
help us shape and/or test things in the future if we can make it
better. Thanks.


Hi Patrick,

We have Storcium certified with VMWare, and we use it ourselves:

Ceph Hammer latest

SCST redundant Pacemaker based delivery front ends - our agents are
published on github

EnhanceIO for read caching at delivery layer

NFS v3, and iSCSI and FC delivery

Our deployment size we use ourselves is 700 TB raw.

Challenges are as others described, but HA and multi host access works
fine courtesy of SCST.  Write amplification is a challenge on spinning
disks.

Happy to share more.

Alex


--

Best Regards,

Patrick McGarry
Director Ceph Community || Red Hat
http://ceph.com  ||  http://community.redhat.com
@scuttlemonkey || @ceph
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph + VMWare

2016-10-18 Thread Frédéric Nass

Hi Alex,

Just to know, what kind of backstore are you using whithin Storcium ? 
vdisk_fileio or vdisk_blockio ?


I see your agents can handle both : 
http://www.spinics.net/lists/ceph-users/msg27817.html


Regards,

Frédéric.


Le 06/10/2016 à 16:01, Alex Gorbachev a écrit :

On Wed, Oct 5, 2016 at 2:32 PM, Patrick McGarry  wrote:

Hey guys,

Starting to buckle down a bit in looking at how we can better set up
Ceph for VMWare integration, but I need a little info/help from you
folks.

If you currently are using Ceph+VMWare, or are exploring the option,
I'd like some simple info from you:

1) Company
2) Current deployment size
3) Expected deployment growth
4) Integration method (or desired method) ex: iscsi, native, etc

Just casting the net so we know who is interested and might want to
help us shape and/or test things in the future if we can make it
better. Thanks.


Hi Patrick,

We have Storcium certified with VMWare, and we use it ourselves:

Ceph Hammer latest

SCST redundant Pacemaker based delivery front ends - our agents are
published on github

EnhanceIO for read caching at delivery layer

NFS v3, and iSCSI and FC delivery

Our deployment size we use ourselves is 700 TB raw.

Challenges are as others described, but HA and multi host access works
fine courtesy of SCST.  Write amplification is a challenge on spinning
disks.

Happy to share more.

Alex


--

Best Regards,

Patrick McGarry
Director Ceph Community || Red Hat
http://ceph.com  ||  http://community.redhat.com
@scuttlemonkey || @ceph
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph + VMWare

2016-10-11 Thread Frédéric Nass

Hi Patrick,

1) Université de Lorraine. (7.000 researchers and staff members, 60.000 
students, 42 schools and education structures, 60 research labs).


2) RHCS cluster: 144 OSDs on 12 nodes for 520 TB raw capacity.
VMware clusters: 7 VMware clusters (40 ESXi hosts). First need is 
to provide capacitive storage (Ceph) to VMs running in a VMware vRA IaaS 
cluster (6 ESXi hosts).


3) Deployment growth ?
RHCS cluster: Initial need was 750 TB of usable storage, so a x4 
growth in the next 3 years is expected to reach 1 PB of usable storage.
VMware clusters: We just started to offer a IaaS service to 
research laboratories and education structures whithin our university.
We can expect to host several hundreds of VMs in the next 2 years 
(~600-800).


4) Integration method ? Clearly native.
I spent some of the last 6 months working on building an HA gateway 
cluster (iSCSI and NFS) to provide RHCS Ceph storage to our VMware IaaS 
Cluster. Here are my findings:


* iSCSI ?

Gives better performance than NFS, we know that. BUT, we cannot go 
into production with iSCSI because of ESXi hosts entering a never ending 
iSCSI 'Abort Task' loop when the Ceph cluster fails to acknowledge a 4MB 
IO in less than 5s, resulting in VMs crashing. I've been told by a 
VMware engineer that this 5s limit cannot be raised as it's hardcoded in 
ESXi iSCSI software initiator.
Why would an IO take more than 5s ? In case of a important load on 
the Ceph cluster, or a Ceph failure scenario (network isolation, OSD 
crash), or deep-scrubbing bothering client IOs or any combination of 
these or those I didn't think about...


What I have tested:
iSCSI Active/Active HA cluster. Each ESXi sees the same datastore 
through both targets but only accesses one datastore at a time through a 
statically defined prefered path.
3 ESXi work on one target, 3 ESXi work on the other. If a target 
goes down, the other paths are used.


- LIO iSCSI targets with kernel RBD mapping (no cache). VAAI 
methods. Easy to configure. Delivers good performance with eagger zeroed 
virtual disks. 'Abort Task' loop has the ESXi disconnect from the 
vCenter Server.

Restartign the target get them back in but some VMs certainly crashed.
- FreeBSD / FreeNAS running in KVM (on top of CentOS) mapping RBD 
images through librbd. Found that fileio backstore was used. Found hard 
to make it HA with librbd cache. And still the 'Abort Task' loop...
- SCST ESOS targets with kernel RBD mapping (no cache). VAAI 
methods, ALUA. Easy to configure too. 'Abort Task' still happens but the 
ESX does not get disconnected from the vCenter Server. Still targets 
have to be restarted to fix this situation.


* NFS ?

Gives less performance than iSCSI, we know that too. BUT, it's 
probably the best option right now. It's very easy to make it HA with 
Pacemaker/Corosync as VMware doesn't make use of the NFS lock manager. 
Here is a good start : 
https://www.sebastien-han.fr/blog/2012/07/06/nfs-over-rbd/
We're still benchmarking IOPs to decide whether we can go into 
production with this infrastructure but we're actually very satisfied 
with the HA mechanism.
Running synchronous writes on multiple VMs (on virtual disk hosted 
on NFS datastores with 'sync' exports of RBD images) while Storage 
vMotioning those multiple disks between NFS RBD datastores and flapping 
ViP (and thus NFS exports) from one server to the other at the same time 
never kills any VM nor makes any datastore unavailable.
And every Storage vMotion task complete ! This is excellent 
results. Note that it's important to run VMware Tools in VMs as VMware 
Tools installation extend the write delay timeout on local iSCSI devices.


What I have tested:
- NFS exports with async mode sharing RBD images with XFS on top of 
it. Gives the best performances but, as an evidence, no one will want to 
use this mode in production.
- NFS exports with sync mode sharing RBD images with XFS on top of 
it. Gives mitigated performances. We would clearly announce this type of 
storage as capacitive and not performant through our IaaS service.
  As VMs caches writes, IOPS might be good enough for tier 2 or 3 
applications. We would probably be able to increase the number of IOPS 
by using more RBD images and NFS shares.
- NFS exports with sync mode sharing RBD images with ZFS (with 
compression) on top of it. The idea is to provide better performance by 
putting the SLOG (write journal) on fast SSD drives.
  See this real life (love-)story : 
https://virtualexistenz.wordpress.com/2013/02/01/using-zfs-storage-as-vmware-nfs-datastores-a-real-life-love-story/
  Each NFS server has 2 mirrored SSDs (RAID1). Each NFS server 
export partitions of this SSD volume through iSCSI.
  Each NFS server is a client of local and distant iSCSI target. 
Then the SLOG device is made of a ZFS mirror of 2 disks : local iSCSI 
device and distant iSCSI device

Re: [ceph-users] Ceph + VMWare

2016-10-07 Thread Jake Young
Hey Patrick,

I work for Cisco.

We have a 200TB cluster (108 OSDs on 12 OSD Nodes) and use the cluster for
both OpenStack and VMware deployments.

We are using iSCSI now, but it really would be much better if VMware did
support RBD natively.

We present a 1-2TB Volume that is shared between 4-8 ESXi hosts.

I have been looking for an optimal solution for a few years now, and I have
finally found something that works pretty well:

We are installing FreeNAS on a KVM hypervisor and passing through rbd
volumes as disks on a SCSI bus. We are able to add volumes dynamically (no
need to reboot FreeNAS to recognize new drives).  In FreeNAS, we are
passing the disks through directly as iscsi targets, we are not putting the
disks into a ZFS volume.

The biggest benefit to this is that VMware really likes the FreeBSD target
and all VAAI stuff works reliably. We also get the benefit of the stability
of rbd in QEMU client.

My next step is to create a redundant KVM host with a redundant FreeNAS VM
and see how iscsi multipath works with the ESXi hosts.

We have tried many different things and have run into all the same issues
as others have posted on this list. The general theme seems to be that most
(all?) Linux iSCSI Target software and Linux NFS solutions are not very
good. The BSD OS's (FreeBSD, Solaris derivatives, etc.) do these things a
lot better, but typically lack Ceph support as well as having poor HW
compatibility (compared to Linux).

Our goal has always been to replace FC SAN with something comparable in
performance, reliability and redundancy.

Again, the best thing in the world would be for ESXi to mount rbd volumes
natively using librbd. I'm not sure if VMware is interested in this though.

Jake


On Wednesday, October 5, 2016, Patrick McGarry  wrote:

> Hey guys,
>
> Starting to buckle down a bit in looking at how we can better set up
> Ceph for VMWare integration, but I need a little info/help from you
> folks.
>
> If you currently are using Ceph+VMWare, or are exploring the option,
> I'd like some simple info from you:
>
> 1) Company
> 2) Current deployment size
> 3) Expected deployment growth
> 4) Integration method (or desired method) ex: iscsi, native, etc
>
> Just casting the net so we know who is interested and might want to
> help us shape and/or test things in the future if we can make it
> better. Thanks.
>
>
> --
>
> Best Regards,
>
> Patrick McGarry
> Director Ceph Community || Red Hat
> http://ceph.com  ||  http://community.redhat.com
> @scuttlemonkey || @ceph
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph + VMWare

2016-10-06 Thread Alex Gorbachev
On Wed, Oct 5, 2016 at 2:32 PM, Patrick McGarry  wrote:
> Hey guys,
>
> Starting to buckle down a bit in looking at how we can better set up
> Ceph for VMWare integration, but I need a little info/help from you
> folks.
>
> If you currently are using Ceph+VMWare, or are exploring the option,
> I'd like some simple info from you:
>
> 1) Company
> 2) Current deployment size
> 3) Expected deployment growth
> 4) Integration method (or desired method) ex: iscsi, native, etc
>
> Just casting the net so we know who is interested and might want to
> help us shape and/or test things in the future if we can make it
> better. Thanks.
>

Hi Patrick,

We have Storcium certified with VMWare, and we use it ourselves:

Ceph Hammer latest

SCST redundant Pacemaker based delivery front ends - our agents are
published on github

EnhanceIO for read caching at delivery layer

NFS v3, and iSCSI and FC delivery

Our deployment size we use ourselves is 700 TB raw.

Challenges are as others described, but HA and multi host access works
fine courtesy of SCST.  Write amplification is a challenge on spinning
disks.

Happy to share more.

Alex

>
> --
>
> Best Regards,
>
> Patrick McGarry
> Director Ceph Community || Red Hat
> http://ceph.com  ||  http://community.redhat.com
> @scuttlemonkey || @ceph
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph + VMWare

2016-10-06 Thread Oliver Dzombic
Hi,

maybe, in fact, a clean iscsi implementation would be better, because
more useable in general.

So the MS hyper-V people could use it too.



For me, when it comes to iSCSI ( we tested so far the tgtd module ), the
problem is at most on the reliability part when it comes to resilence in
case the ceph cluster changes from OK to what ever else.

So if the iSCSI implementation could receive some work, that even if PGs
are changing in the backfilling/degregated/... state, things will just
continue to work. Thats currently not the case.

Even more evil: the tgtd module currently seems not to support to have
ONE iSCSI target being mounted to MULTIPLE vmware esxi nodes.

So in fact you cant use it as shared storage because you receive very
fast readlocks which are never released and preventing other nodes from
using the same LUN.

-- 
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:i...@ip-interactive.de

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107


Am 06.10.2016 um 08:13 schrieb Daniel Schwager:
> Hi all,
> 
> we are using Ceph (jewel 10.2.2, 10GBit Ceph frontend/backend, 3 nodes, each 
> 8 OSD's and 2 journal SSD's) 
> in out VMware environment especially for test environments and templates - 
> but currently 
> not for productive machines (because of missing FC-redundancy & performance).
> 
> On our Linux based SCST 4GBit fiber channel proxy, 16 ceph-rbd  devices 
> (non-caching, in total 10 TB) 
> creating a LVM (stripped) volume which is published as a FC-target to our 
> VMware cluster. 
> Looks fine, works stable. But currently the proxy is not redundant (only one 
> head).
> Performance is ok (a), but not that good than our IBM Storwize 3700 SAN (16 
> HDD's).
> Especially for small IO's (4k), the IBM is twice as fast as Ceph. 
> 
> Native ceph integration to VMware would be great (-:
> 
> Best regards
> Daniel
> 
> (a) Atto Benchmark screenshots - IBM Storwize 37000 vs. Ceph
> https://dtnet.storage.dtnetcloud.com/d/684b330eea/
> 
> ---
> DT Netsolution GmbH   -   Taläckerstr. 30-D-70437 Stuttgart
> Geschäftsführer: Daniel Schwager, Stefan Hörz - HRB Stuttgart 19870
> Tel: +49-711-849910-32, Fax: -932 - Mailto:daniel.schwa...@dtnet.de
> 
>> -Original Message-
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
>> Patrick McGarry
>> Sent: Wednesday, October 05, 2016 8:33 PM
>> To: Ceph-User; Ceph Devel
>> Subject: [ceph-users] Ceph + VMWare
>>
>> Hey guys,
>>
>> Starting to buckle down a bit in looking at how we can better set up
>> Ceph for VMWare integration, but I need a little info/help from you
>> folks.
>>
>> If you currently are using Ceph+VMWare, or are exploring the option,
>> I'd like some simple info from you:
>>
>> 1) Company
>> 2) Current deployment size
>> 3) Expected deployment growth
>> 4) Integration method (or desired method) ex: iscsi, native, etc
>>
>> Just casting the net so we know who is interested and might want to
>> help us shape and/or test things in the future if we can make it
>> better. Thanks.
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph + VMWare

2016-10-05 Thread Daniel Schwager
Hi all,

we are using Ceph (jewel 10.2.2, 10GBit Ceph frontend/backend, 3 nodes, each 8 
OSD's and 2 journal SSD's) 
in out VMware environment especially for test environments and templates - but 
currently 
not for productive machines (because of missing FC-redundancy & performance).

On our Linux based SCST 4GBit fiber channel proxy, 16 ceph-rbd  devices 
(non-caching, in total 10 TB) 
creating a LVM (stripped) volume which is published as a FC-target to our 
VMware cluster. 
Looks fine, works stable. But currently the proxy is not redundant (only one 
head).
Performance is ok (a), but not that good than our IBM Storwize 3700 SAN (16 
HDD's).
Especially for small IO's (4k), the IBM is twice as fast as Ceph. 

Native ceph integration to VMware would be great (-:

Best regards
Daniel

(a) Atto Benchmark screenshots - IBM Storwize 37000 vs. Ceph
https://dtnet.storage.dtnetcloud.com/d/684b330eea/

---
DT Netsolution GmbH   -   Taläckerstr. 30-D-70437 Stuttgart
Geschäftsführer: Daniel Schwager, Stefan Hörz - HRB Stuttgart 19870
Tel: +49-711-849910-32, Fax: -932 - Mailto:daniel.schwa...@dtnet.de

> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
> Patrick McGarry
> Sent: Wednesday, October 05, 2016 8:33 PM
> To: Ceph-User; Ceph Devel
> Subject: [ceph-users] Ceph + VMWare
> 
> Hey guys,
> 
> Starting to buckle down a bit in looking at how we can better set up
> Ceph for VMWare integration, but I need a little info/help from you
> folks.
> 
> If you currently are using Ceph+VMWare, or are exploring the option,
> I'd like some simple info from you:
> 
> 1) Company
> 2) Current deployment size
> 3) Expected deployment growth
> 4) Integration method (or desired method) ex: iscsi, native, etc
> 
> Just casting the net so we know who is interested and might want to
> help us shape and/or test things in the future if we can make it
> better. Thanks.
> 


smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph + VMWare

2016-10-05 Thread Oliver Dzombic
Hi Patrick,

we are currently trying to get ceph running with it for a customer. (
Means our stuff = cephfs, customer stuff = vmware on ONE ceph cluster ).

Unluckily iscsi sucks ( ohne OSD fails = iscsi lock -> need restart
iscsi daemon on ceph servers ).

NFS sucks ( no natural HA )

So if you can get it run with a vmware plugin ( just like for example
ScaleIO ) there are some people out there who might want to marry you :-)

--

To your questions:

1) See below

2) 10 TB for vmware

3) 10 TB each year, impossible to give here clear numbers, since there
is currently no clean way for vmware + ceph. If it would work ( reliable
), numbers would explode for sure.

4) native = perfect, iscsi = OK


-- 
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:i...@ip-interactive.de

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107


Am 05.10.2016 um 20:32 schrieb Patrick McGarry:
> Hey guys,
> 
> Starting to buckle down a bit in looking at how we can better set up
> Ceph for VMWare integration, but I need a little info/help from you
> folks.
> 
> If you currently are using Ceph+VMWare, or are exploring the option,
> I'd like some simple info from you:
> 
> 1) Company
> 2) Current deployment size
> 3) Expected deployment growth
> 4) Integration method (or desired method) ex: iscsi, native, etc
> 
> Just casting the net so we know who is interested and might want to
> help us shape and/or test things in the future if we can make it
> better. Thanks.
> 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph + VMWare

2016-10-05 Thread Patrick McGarry
Hey guys,

Starting to buckle down a bit in looking at how we can better set up
Ceph for VMWare integration, but I need a little info/help from you
folks.

If you currently are using Ceph+VMWare, or are exploring the option,
I'd like some simple info from you:

1) Company
2) Current deployment size
3) Expected deployment growth
4) Integration method (or desired method) ex: iscsi, native, etc

Just casting the net so we know who is interested and might want to
help us shape and/or test things in the future if we can make it
better. Thanks.


-- 

Best Regards,

Patrick McGarry
Director Ceph Community || Red Hat
http://ceph.com  ||  http://community.redhat.com
@scuttlemonkey || @ceph
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-09-11 Thread Nick Fisk
> -Original Message-
> From: Alex Gorbachev [mailto:a...@iss-integration.com]
> Sent: 11 September 2016 03:17
> To: Nick Fisk 
> Cc: Wilhelm Redbrake ; Horace Ng ; 
> ceph-users 
> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> 
> Confirming again much better performance with ESXi and NFS on RBD using the 
> XFS hint Nick uses, below.

Cool, I never experimented with different extent sizes, so I don't know if 
there is any performance/fragmentation benefit with larger/smaller values. I 
think storage vmotions might benefit from using striped RBD's with rbd-nbd, as 
this might get round the PG contention issues with 32 concurrent writes to the 
same PG. I want to test this out at some point.

> 
> I saw high load averages on the NFS server nodes, corresponding to iowait, 
> does not seem to cause too much trouble so far.

Yeah I get this as well, but I think this is just a side effect of having a 
storage backend that can support a high queue depth. Every IO in flight will 
increase the load by 1. However, despite what it looks like in top, it doesn't 
actually consume any CPU, so it shouldn't cause any problems.

> 
> Here are HDtune Pro testing results from some recent runs.  The puzzling part 
> is better random IO performance with 16 mb object size
> on both iSCSI and NFS.  I my thinking this should be slower, however, this 
> has been confirmed by the timed vmotion tests and more
> random IO tests by my coworker as well:
> 
> Test_type read MB/s write MB/s read iops write iops read multi iops write 
> multi iops NFS 1mb 460 103 8753 66 47466 1616 NFS 4mb 441
> 147 8863 82 47556 764 iSCSI 1mb 117 76 326 90 672 938 iSCSI 4mb 275 60 205 24 
> 2015 1212 NFS 16mb 455 177 7761 119 36403 3175 iSCSI
> 16mb 300 65 1117 237 12389 1826
> 
> ( prettier view at
> http://storcium.blogspot.com/2016/09/latest-tests-on-nfs-vs.html )

Interesting. Are you pre-conditioning the RBD's before these tests? The only 
logical thing I can think of is that if you are writing to a new area of the 
RBD, it will be having to create the objects as it goes, larger objects would 
therefore need less object creates per MB.

> 
> Alex
> 
> >
> > From: Alex Gorbachev [mailto:a...@iss-integration.com]
> > Sent: 04 September 2016 04:45
> > To: Nick Fisk 
> > Cc: Wilhelm Redbrake ; Horace Ng ;
> > ceph-users 
> > Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> >
> >
> >
> >
> >
> > On Saturday, September 3, 2016, Alex Gorbachev  
> > wrote:
> >
> > HI Nick,
> >
> > On Sun, Aug 21, 2016 at 3:19 PM, Nick Fisk  wrote:
> >
> > From: Alex Gorbachev [mailto:a...@iss-integration.com]
> > Sent: 21 August 2016 15:27
> > To: Wilhelm Redbrake 
> > Cc: n...@fisk.me.uk; Horace Ng ; ceph-users
> > 
> > Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> >
> >
> >
> >
> >
> > On Sunday, August 21, 2016, Wilhelm Redbrake  wrote:
> >
> > Hi Nick,
> > i understand all of your technical improvements.
> > But: why do you Not use a simple for example Areca Raid Controller with 8 
> > gb Cache and Bbu ontop in every ceph node.
> > Configure n Times RAID 0 on the Controller and enable Write back Cache.
> > That must be a latency "Killer" like in all the prop. Storage arrays or Not 
> > ??
> >
> > Best Regards !!
> >
> >
> >
> > What we saw specifically with Areca cards is that performance is excellent 
> > in benchmarking and for bursty loads. However, once we
> started loading with more constant workloads (we replicate databases and 
> files to our Ceph cluster), this looks to have saturated the
> relatively small Areca NVDIMM caches and we went back to pure drive based 
> performance.
> >
> >
> >
> > Yes, I think that is a valid point. Although low latency, you are still 
> > having to write to the disks twice (journal+data), so once the
> cache’s on the cards start filling up, you are going to hit problems.
> >
> >
> >
> >
> >
> > So we built 8 new nodes with no Arecas, M500 SSDs for journals (1 SSD per 3 
> > HDDs) in hopes that it would help reduce the noisy
> neighbor impact. That worked, but now the overall latency is really high at 
> times, not always. Red Hat engineer suggested this is due to
> loading the 7200 rpm NL-SAS drives with too many IOPS, which get their 
> latency sky high. Overall we are functioning fine, but I sure
> would like storage vmotion and other large operations faster.
> >
> >
> >
> >
> >
> > Yeah this is the biggest pain poi

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-09-11 Thread Alex Gorbachev
--
Alex Gorbachev
Storcium

On Sun, Sep 11, 2016 at 12:54 PM, Nick Fisk  wrote:

>
>
>
>
> *From:* Alex Gorbachev [mailto:a...@iss-integration.com]
> *Sent:* 11 September 2016 16:14
>
> *To:* Nick Fisk 
> *Cc:* Wilhelm Redbrake ; Horace Ng ;
> ceph-users 
> *Subject:* Re: [ceph-users] Ceph + VMware + Single Thread Performance
>
>
>
>
>
> On Sun, Sep 4, 2016 at 4:48 PM, Nick Fisk  wrote:
>
>
>
>
>
> *From:* Alex Gorbachev [mailto:a...@iss-integration.com]
> *Sent:* 04 September 2016 04:45
> *To:* Nick Fisk 
> *Cc:* Wilhelm Redbrake ; Horace Ng ;
> ceph-users 
> *Subject:* Re: [ceph-users] Ceph + VMware + Single Thread Performance
>
>
>
>
>
>
> On Saturday, September 3, 2016, Alex Gorbachev 
> wrote:
>
> HI Nick,
>
> On Sun, Aug 21, 2016 at 3:19 PM, Nick Fisk  wrote:
>
> *From:* Alex Gorbachev [mailto:a...@iss-integration.com]
> *Sent:* 21 August 2016 15:27
> *To:* Wilhelm Redbrake 
> *Cc:* n...@fisk.me.uk; Horace Ng ; ceph-users <
> ceph-users@lists.ceph.com>
> *Subject:* Re: [ceph-users] Ceph + VMware + Single Thread Performance
>
>
>
>
>
> On Sunday, August 21, 2016, Wilhelm Redbrake  wrote:
>
> Hi Nick,
> i understand all of your technical improvements.
> But: why do you Not use a simple for example Areca Raid Controller with 8
> gb Cache and Bbu ontop in every ceph node.
> Configure n Times RAID 0 on the Controller and enable Write back Cache.
> That must be a latency "Killer" like in all the prop. Storage arrays or
> Not ??
>
> Best Regards !!
>
>
>
> What we saw specifically with Areca cards is that performance is excellent
> in benchmarking and for bursty loads. However, once we started loading with
> more constant workloads (we replicate databases and files to our Ceph
> cluster), this looks to have saturated the relatively small Areca NVDIMM
> caches and we went back to pure drive based performance.
>
>
>
> Yes, I think that is a valid point. Although low latency, you are still
> having to write to the disks twice (journal+data), so once the cache’s on
> the cards start filling up, you are going to hit problems.
>
>
>
>
>
> So we built 8 new nodes with no Arecas, M500 SSDs for journals (1 SSD per
> 3 HDDs) in hopes that it would help reduce the noisy neighbor impact. That
> worked, but now the overall latency is really high at times, not always.
> Red Hat engineer suggested this is due to loading the 7200 rpm NL-SAS
> drives with too many IOPS, which get their latency sky high. Overall we are
> functioning fine, but I sure would like storage vmotion and other large
> operations faster.
>
>
>
>
>
> Yeah this is the biggest pain point I think. Normal VM ops are fine, but
> if you ever have to move a multi-TB VM, it’s just too slow.
>
>
>
> If you use iscsi with vaai and are migrating a thick provisioned vmdk,
> then performance is actually quite good, as the block sizes used for the
> copy are a lot bigger.
>
>
>
> However, my use case required thin provisioned VM’s + snapshots and I
> found that using iscsi you have no control over the fragmentation of the
> vmdk’s and so the read performance is then what suffers (certainly with
> 7.2k disks)
>
>
>
> Also with thin provisioned vmdk’s I think I was seeing PG contention with
> the updating of the VMFS metadata, although I can’t be sure.
>
>
>
>
>
> I am thinking I will test a few different schedulers and readahead
> settings to see if we can improve this by parallelizing reads. Also will
> test NFS, but need to determine whether to do krbd/knfsd or something more
> interesting like CephFS/Ganesha.
>
>
>
> As you know I’m on NFS now. I’ve found it a lot easier to get going and a
> lot less sensitive to making config adjustments without suddenly everything
> dropping offline. The fact that you can specify the extent size on XFS
> helps massively with using thin vmdks/snapshots to avoid fragmentation.
> Storage v-motions are a bit faster than iscsi, but I think I am hitting PG
> contention when esxi tries to write 32 copy threads to the same object.
> There is probably some tuning that could be done here (RBD striping???) but
> this is the best it’s been for a long time and I’m reluctant to fiddle any
> further.
>
>
>
> We have moved ahead and added NFS support to Storcium, and now able ti run
> NFS servers with Pacemaker in HA mode (all agents are public at
> https://github.com/akurz/resource-agents/tree/master/heartbeat
> <http://xo4t.mj.am/lnk/AEEAFVynFzgAAFhNkjYAADNJBWwAAACRXwBX1Yw-HyPgnby2QY24q0KYBbMaNgAAlBI/1/jhUfi_RqmIdhFLdYFMjkzg/aHR0cDovL3hvNHQubWouYW0vbG5rL0FFTUFGT1RpTVA0QUFBQUFBQUFBQUZoTmtqWU

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-09-11 Thread Nick Fisk
 

 

From: Alex Gorbachev [mailto:a...@iss-integration.com] 
Sent: 11 September 2016 16:14
To: Nick Fisk 
Cc: Wilhelm Redbrake ; Horace Ng ; ceph-users 

Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance

 

 

On Sun, Sep 4, 2016 at 4:48 PM, Nick Fisk mailto:n...@fisk.me.uk> > wrote:

 

 

From: Alex Gorbachev [mailto:a...@iss-integration.com 
<mailto:a...@iss-integration.com> ] 
Sent: 04 September 2016 04:45
To: Nick Fisk mailto:n...@fisk.me.uk> >
Cc: Wilhelm Redbrake mailto:w...@globe.de> >; Horace Ng 
mailto:hor...@hkisl.net> >; ceph-users 
mailto:ceph-users@lists.ceph.com> >
Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance

 

 


On Saturday, September 3, 2016, Alex Gorbachev mailto:a...@iss-integration.com> > wrote:

HI Nick,

On Sun, Aug 21, 2016 at 3:19 PM, Nick Fisk  wrote:

From: Alex Gorbachev [mailto:a...@iss-integration.com] 
Sent: 21 August 2016 15:27
To: Wilhelm Redbrake 
Cc: n...@fisk.me.uk; Horace Ng ; ceph-users 

Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance

 



On Sunday, August 21, 2016, Wilhelm Redbrake  wrote:

Hi Nick,
i understand all of your technical improvements.
But: why do you Not use a simple for example Areca Raid Controller with 8 gb 
Cache and Bbu ontop in every ceph node.
Configure n Times RAID 0 on the Controller and enable Write back Cache.
That must be a latency "Killer" like in all the prop. Storage arrays or Not ??

Best Regards !!

 

What we saw specifically with Areca cards is that performance is excellent in 
benchmarking and for bursty loads. However, once we started loading with more 
constant workloads (we replicate databases and files to our Ceph cluster), this 
looks to have saturated the relatively small Areca NVDIMM caches and we went 
back to pure drive based performance. 

 

Yes, I think that is a valid point. Although low latency, you are still having 
to write to the disks twice (journal+data), so once the cache’s on the cards 
start filling up, you are going to hit problems.

 

 

So we built 8 new nodes with no Arecas, M500 SSDs for journals (1 SSD per 3 
HDDs) in hopes that it would help reduce the noisy neighbor impact. That 
worked, but now the overall latency is really high at times, not always. Red 
Hat engineer suggested this is due to loading the 7200 rpm NL-SAS drives with 
too many IOPS, which get their latency sky high. Overall we are functioning 
fine, but I sure would like storage vmotion and other large operations faster. 

 

 

Yeah this is the biggest pain point I think. Normal VM ops are fine, but if you 
ever have to move a multi-TB VM, it’s just too slow. 

 

If you use iscsi with vaai and are migrating a thick provisioned vmdk, then 
performance is actually quite good, as the block sizes used for the copy are a 
lot bigger. 

 

However, my use case required thin provisioned VM’s + snapshots and I found 
that using iscsi you have no control over the fragmentation of the vmdk’s and 
so the read performance is then what suffers (certainly with 7.2k disks)

 

Also with thin provisioned vmdk’s I think I was seeing PG contention with the 
updating of the VMFS metadata, although I can’t be sure.

 

 

I am thinking I will test a few different schedulers and readahead settings to 
see if we can improve this by parallelizing reads. Also will test NFS, but need 
to determine whether to do krbd/knfsd or something more interesting like 
CephFS/Ganesha. 

 

As you know I’m on NFS now. I’ve found it a lot easier to get going and a lot 
less sensitive to making config adjustments without suddenly everything 
dropping offline. The fact that you can specify the extent size on XFS helps 
massively with using thin vmdks/snapshots to avoid fragmentation. Storage 
v-motions are a bit faster than iscsi, but I think I am hitting PG contention 
when esxi tries to write 32 copy threads to the same object. There is probably 
some tuning that could be done here (RBD striping???) but this is the best it’s 
been for a long time and I’m reluctant to fiddle any further.

 

We have moved ahead and added NFS support to Storcium, and now able ti run NFS 
servers with Pacemaker in HA mode (all agents are public at 
https://github.com/akurz/resource-agents/tree/master/heartbeat 
<http://xo4t.mj.am/lnk/AEMAFOTiMP4AAFhNkjYAADNJBWwAAACRXwBXzIiFBSEAPLcmRUCEpgI8l005EAAAlBI/1/SaDNCfweUSbAAalNO6TCqg/aHR0cHM6Ly9naXRodWIuY29tL2FrdXJ6L3Jlc291cmNlLWFnZW50cy90cmVlL21hc3Rlci9oZWFydGJlYXQ>
 ).  I can confirm that VM performance is definitely better and benchmarks are 
more smooth (in Windows we can see a lot of choppiness with iSCSI, NFS is 
choppy on writes, but smooth on reads, likely due to the bursty nature of OSD 
filesystems when dealing with that small IO size).

 

Were you using extsz=16384 at creation time for the filesystem?  I saw kernel 
memory deadlock messages during vmotion, such as:

 

 XFS: nfsd(102545) possible 

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-09-11 Thread Alex Gorbachev
On Sun, Sep 4, 2016 at 4:48 PM, Nick Fisk  wrote:

>
>
>
>
> *From:* Alex Gorbachev [mailto:a...@iss-integration.com]
> *Sent:* 04 September 2016 04:45
> *To:* Nick Fisk 
> *Cc:* Wilhelm Redbrake ; Horace Ng ;
> ceph-users 
> *Subject:* Re: [ceph-users] Ceph + VMware + Single Thread Performance
>
>
>
>
>
> On Saturday, September 3, 2016, Alex Gorbachev 
> wrote:
>
> HI Nick,
>
> On Sun, Aug 21, 2016 at 3:19 PM, Nick Fisk  wrote:
>
> *From:* Alex Gorbachev [mailto:a...@iss-integration.com]
> *Sent:* 21 August 2016 15:27
> *To:* Wilhelm Redbrake 
> *Cc:* n...@fisk.me.uk; Horace Ng ; ceph-users <
> ceph-users@lists.ceph.com>
> *Subject:* Re: [ceph-users] Ceph + VMware + Single Thread Performance
>
>
>
>
>
> On Sunday, August 21, 2016, Wilhelm Redbrake  wrote:
>
> Hi Nick,
> i understand all of your technical improvements.
> But: why do you Not use a simple for example Areca Raid Controller with 8
> gb Cache and Bbu ontop in every ceph node.
> Configure n Times RAID 0 on the Controller and enable Write back Cache.
> That must be a latency "Killer" like in all the prop. Storage arrays or
> Not ??
>
> Best Regards !!
>
>
>
> What we saw specifically with Areca cards is that performance is excellent
> in benchmarking and for bursty loads. However, once we started loading with
> more constant workloads (we replicate databases and files to our Ceph
> cluster), this looks to have saturated the relatively small Areca NVDIMM
> caches and we went back to pure drive based performance.
>
>
>
> Yes, I think that is a valid point. Although low latency, you are still
> having to write to the disks twice (journal+data), so once the cache’s on
> the cards start filling up, you are going to hit problems.
>
>
>
>
>
> So we built 8 new nodes with no Arecas, M500 SSDs for journals (1 SSD per
> 3 HDDs) in hopes that it would help reduce the noisy neighbor impact. That
> worked, but now the overall latency is really high at times, not always.
> Red Hat engineer suggested this is due to loading the 7200 rpm NL-SAS
> drives with too many IOPS, which get their latency sky high. Overall we are
> functioning fine, but I sure would like storage vmotion and other large
> operations faster.
>
>
>
>
>
> Yeah this is the biggest pain point I think. Normal VM ops are fine, but
> if you ever have to move a multi-TB VM, it’s just too slow.
>
>
>
> If you use iscsi with vaai and are migrating a thick provisioned vmdk,
> then performance is actually quite good, as the block sizes used for the
> copy are a lot bigger.
>
>
>
> However, my use case required thin provisioned VM’s + snapshots and I
> found that using iscsi you have no control over the fragmentation of the
> vmdk’s and so the read performance is then what suffers (certainly with
> 7.2k disks)
>
>
>
> Also with thin provisioned vmdk’s I think I was seeing PG contention with
> the updating of the VMFS metadata, although I can’t be sure.
>
>
>
>
>
> I am thinking I will test a few different schedulers and readahead
> settings to see if we can improve this by parallelizing reads. Also will
> test NFS, but need to determine whether to do krbd/knfsd or something more
> interesting like CephFS/Ganesha.
>
>
>
> As you know I’m on NFS now. I’ve found it a lot easier to get going and a
> lot less sensitive to making config adjustments without suddenly everything
> dropping offline. The fact that you can specify the extent size on XFS
> helps massively with using thin vmdks/snapshots to avoid fragmentation.
> Storage v-motions are a bit faster than iscsi, but I think I am hitting PG
> contention when esxi tries to write 32 copy threads to the same object.
> There is probably some tuning that could be done here (RBD striping???) but
> this is the best it’s been for a long time and I’m reluctant to fiddle any
> further.
>
>
>
> We have moved ahead and added NFS support to Storcium, and now able ti run
> NFS servers with Pacemaker in HA mode (all agents are public at
> https://github.com/akurz/resource-agents/tree/master/heartbeat
> <http://xo4t.mj.am/lnk/AEMAFOTiMP4AAFhNkjYAADNJBWwAAACRXwBXzIiFBSEAPLcmRUCEpgI8l005EAAAlBI/1/SaDNCfweUSbAAalNO6TCqg/aHR0cHM6Ly9naXRodWIuY29tL2FrdXJ6L3Jlc291cmNlLWFnZW50cy90cmVlL21hc3Rlci9oZWFydGJlYXQ>).
> I can confirm that VM performance is definitely better and benchmarks are
> more smooth (in Windows we can see a lot of choppiness with iSCSI, NFS is
> choppy on writes, but smooth on reads, likely due to the bursty nature of
> OSD filesystems when dealing with that small IO size).
>
>
>
> Were you using extsz=16384 at creation time for the files

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-09-10 Thread Alex Gorbachev
Confirming again much better performance with ESXi and NFS on RBD
using the XFS hint Nick uses, below.

I saw high load averages on the NFS server nodes, corresponding to
iowait, does not seem to cause too much trouble so far.

Here are HDtune Pro testing results from some recent runs.  The
puzzling part is better random IO performance with 16 mb object size
on both iSCSI and NFS.  I my thinking this should be slower, however,
this has been confirmed by the timed vmotion tests and more random IO
tests by my coworker as well:

Test_type read MB/s write MB/s read iops write iops read multi iops
write multi iops
NFS 1mb 460 103 8753 66 47466 1616
NFS 4mb 441 147 8863 82 47556 764
iSCSI 1mb 117 76 326 90 672 938
iSCSI 4mb 275 60 205 24 2015 1212
NFS 16mb 455 177 7761 119 36403 3175
iSCSI 16mb 300 65 1117 237 12389 1826

( prettier view at
http://storcium.blogspot.com/2016/09/latest-tests-on-nfs-vs.html )

Alex

>
> From: Alex Gorbachev [mailto:a...@iss-integration.com]
> Sent: 04 September 2016 04:45
> To: Nick Fisk 
> Cc: Wilhelm Redbrake ; Horace Ng ; 
> ceph-users 
> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
>
>
>
>
>
> On Saturday, September 3, 2016, Alex Gorbachev  
> wrote:
>
> HI Nick,
>
> On Sun, Aug 21, 2016 at 3:19 PM, Nick Fisk  wrote:
>
> From: Alex Gorbachev [mailto:a...@iss-integration.com]
> Sent: 21 August 2016 15:27
> To: Wilhelm Redbrake 
> Cc: n...@fisk.me.uk; Horace Ng ; ceph-users 
> 
> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
>
>
>
>
>
> On Sunday, August 21, 2016, Wilhelm Redbrake  wrote:
>
> Hi Nick,
> i understand all of your technical improvements.
> But: why do you Not use a simple for example Areca Raid Controller with 8 gb 
> Cache and Bbu ontop in every ceph node.
> Configure n Times RAID 0 on the Controller and enable Write back Cache.
> That must be a latency "Killer" like in all the prop. Storage arrays or Not ??
>
> Best Regards !!
>
>
>
> What we saw specifically with Areca cards is that performance is excellent in 
> benchmarking and for bursty loads. However, once we started loading with more 
> constant workloads (we replicate databases and files to our Ceph cluster), 
> this looks to have saturated the relatively small Areca NVDIMM caches and we 
> went back to pure drive based performance.
>
>
>
> Yes, I think that is a valid point. Although low latency, you are still 
> having to write to the disks twice (journal+data), so once the cache’s on the 
> cards start filling up, you are going to hit problems.
>
>
>
>
>
> So we built 8 new nodes with no Arecas, M500 SSDs for journals (1 SSD per 3 
> HDDs) in hopes that it would help reduce the noisy neighbor impact. That 
> worked, but now the overall latency is really high at times, not always. Red 
> Hat engineer suggested this is due to loading the 7200 rpm NL-SAS drives with 
> too many IOPS, which get their latency sky high. Overall we are functioning 
> fine, but I sure would like storage vmotion and other large operations faster.
>
>
>
>
>
> Yeah this is the biggest pain point I think. Normal VM ops are fine, but if 
> you ever have to move a multi-TB VM, it’s just too slow.
>
>
>
> If you use iscsi with vaai and are migrating a thick provisioned vmdk, then 
> performance is actually quite good, as the block sizes used for the copy are 
> a lot bigger.
>
>
>
> However, my use case required thin provisioned VM’s + snapshots and I found 
> that using iscsi you have no control over the fragmentation of the vmdk’s and 
> so the read performance is then what suffers (certainly with 7.2k disks)
>
>
>
> Also with thin provisioned vmdk’s I think I was seeing PG contention with the 
> updating of the VMFS metadata, although I can’t be sure.
>
>
>
>
>
> I am thinking I will test a few different schedulers and readahead settings 
> to see if we can improve this by parallelizing reads. Also will test NFS, but 
> need to determine whether to do krbd/knfsd or something more interesting like 
> CephFS/Ganesha.
>
>
>
> As you know I’m on NFS now. I’ve found it a lot easier to get going and a lot 
> less sensitive to making config adjustments without suddenly everything 
> dropping offline. The fact that you can specify the extent size on XFS helps 
> massively with using thin vmdks/snapshots to avoid fragmentation. Storage 
> v-motions are a bit faster than iscsi, but I think I am hitting PG contention 
> when esxi tries to write 32 copy threads to the same object. There is 
> probably some tuning that could be done here (RBD striping???) but this is 
> the best it’s been for a long time and I’m reluctant to fiddle any further.
>
>
>
> W

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-09-04 Thread Nick Fisk
 

 

From: Alex Gorbachev [mailto:a...@iss-integration.com] 
Sent: 04 September 2016 04:45
To: Nick Fisk 
Cc: Wilhelm Redbrake ; Horace Ng ; ceph-users 

Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance

 



On Saturday, September 3, 2016, Alex Gorbachev mailto:a...@iss-integration.com> > wrote:

HI Nick,

On Sun, Aug 21, 2016 at 3:19 PM, Nick Fisk  > wrote:

From: Alex Gorbachev [mailto:a...@iss-integration.com 
 ] 
Sent: 21 August 2016 15:27
To: Wilhelm Redbrake  >
Cc: n...@fisk.me.uk  ; Horace 
Ng  >; 
ceph-users  >
Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance

 



On Sunday, August 21, 2016, Wilhelm Redbrake  > wrote:

Hi Nick,
i understand all of your technical improvements.
But: why do you Not use a simple for example Areca Raid Controller with 8 gb 
Cache and Bbu ontop in every ceph node.
Configure n Times RAID 0 on the Controller and enable Write back Cache.
That must be a latency "Killer" like in all the prop. Storage arrays or Not ??

Best Regards !!

 

What we saw specifically with Areca cards is that performance is excellent in 
benchmarking and for bursty loads. However, once we started loading with more 
constant workloads (we replicate databases and files to our Ceph cluster), this 
looks to have saturated the relatively small Areca NVDIMM caches and we went 
back to pure drive based performance. 

 

Yes, I think that is a valid point. Although low latency, you are still having 
to write to the disks twice (journal+data), so once the cache’s on the cards 
start filling up, you are going to hit problems.

 

 

So we built 8 new nodes with no Arecas, M500 SSDs for journals (1 SSD per 3 
HDDs) in hopes that it would help reduce the noisy neighbor impact. That 
worked, but now the overall latency is really high at times, not always. Red 
Hat engineer suggested this is due to loading the 7200 rpm NL-SAS drives with 
too many IOPS, which get their latency sky high. Overall we are functioning 
fine, but I sure would like storage vmotion and other large operations faster. 

 

 

Yeah this is the biggest pain point I think. Normal VM ops are fine, but if you 
ever have to move a multi-TB VM, it’s just too slow. 

 

If you use iscsi with vaai and are migrating a thick provisioned vmdk, then 
performance is actually quite good, as the block sizes used for the copy are a 
lot bigger. 

 

However, my use case required thin provisioned VM’s + snapshots and I found 
that using iscsi you have no control over the fragmentation of the vmdk’s and 
so the read performance is then what suffers (certainly with 7.2k disks)

 

Also with thin provisioned vmdk’s I think I was seeing PG contention with the 
updating of the VMFS metadata, although I can’t be sure.

 

 

I am thinking I will test a few different schedulers and readahead settings to 
see if we can improve this by parallelizing reads. Also will test NFS, but need 
to determine whether to do krbd/knfsd or something more interesting like 
CephFS/Ganesha. 

 

As you know I’m on NFS now. I’ve found it a lot easier to get going and a lot 
less sensitive to making config adjustments without suddenly everything 
dropping offline. The fact that you can specify the extent size on XFS helps 
massively with using thin vmdks/snapshots to avoid fragmentation. Storage 
v-motions are a bit faster than iscsi, but I think I am hitting PG contention 
when esxi tries to write 32 copy threads to the same object. There is probably 
some tuning that could be done here (RBD striping???) but this is the best it’s 
been for a long time and I’m reluctant to fiddle any further.

 

We have moved ahead and added NFS support to Storcium, and now able ti run NFS 
servers with Pacemaker in HA mode (all agents are public at 
https://github.com/akurz/resource-agents/tree/master/heartbeat).  I can confirm 
that VM performance is definitely better and benchmarks are more smooth (in 
Windows we can see a lot of choppiness with iSCSI, NFS is choppy on writes, but 
smooth on reads, likely due to the bursty nature of OSD filesystems when 
dealing with that small IO size).

 

Were you using extsz=16384 at creation time for the filesystem?  I saw kernel 
memory deadlock messages during vmotion, such as:

 

 XFS: nfsd(102545) possible memory allocation deadlock size 40320 in kmem_alloc 
(mode:0x2400240)

 

And analyzing fragmentation:

 

root@roc-5r-scd218:~# xfs_db -r /dev/rbd21

xfs_db> frag -d

actual 0, ideal 0, fragmentation factor 0.00%

xfs_db> frag -f

actual 1863960, ideal 74, fragmentation factor 100.00%

 

Just from two vmotions.

 

Are you seeing anything similar?

 

Found your post on setting XFS extent size hint for sparse files:

 

xfs_io -c extsize 16M /mountpoint

Will test - fragmentation definitely present without this.  

 

Yeah I got bit by that when I 1st set it up, I then created another datastore 
with that ext hint and moved everything across. Haven’t seen any kmem al

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-09-03 Thread Alex Gorbachev
On Saturday, September 3, 2016, Alex Gorbachev 
wrote:

> HI Nick,
>
> On Sun, Aug 21, 2016 at 3:19 PM, Nick Fisk  > wrote:
>
>> *From:* Alex Gorbachev [mailto:a...@iss-integration.com
>> ]
>> *Sent:* 21 August 2016 15:27
>> *To:* Wilhelm Redbrake > >
>> *Cc:* n...@fisk.me.uk ;
>> Horace Ng > >; ceph-users <
>> ceph-users@lists.ceph.com
>> >
>> *Subject:* Re: [ceph-users] Ceph + VMware + Single Thread Performance
>>
>>
>>
>>
>>
>> On Sunday, August 21, 2016, Wilhelm Redbrake > > wrote:
>>
>> Hi Nick,
>> i understand all of your technical improvements.
>> But: why do you Not use a simple for example Areca Raid Controller with 8
>> gb Cache and Bbu ontop in every ceph node.
>> Configure n Times RAID 0 on the Controller and enable Write back Cache.
>> That must be a latency "Killer" like in all the prop. Storage arrays or
>> Not ??
>>
>> Best Regards !!
>>
>>
>>
>> What we saw specifically with Areca cards is that performance is
>> excellent in benchmarking and for bursty loads. However, once we started
>> loading with more constant workloads (we replicate databases and files to
>> our Ceph cluster), this looks to have saturated the relatively small Areca
>> NVDIMM caches and we went back to pure drive based performance.
>>
>>
>>
>> Yes, I think that is a valid point. Although low latency, you are still
>> having to write to the disks twice (journal+data), so once the cache’s on
>> the cards start filling up, you are going to hit problems.
>>
>>
>>
>>
>>
>> So we built 8 new nodes with no Arecas, M500 SSDs for journals (1 SSD per
>> 3 HDDs) in hopes that it would help reduce the noisy neighbor impact. That
>> worked, but now the overall latency is really high at times, not always.
>> Red Hat engineer suggested this is due to loading the 7200 rpm NL-SAS
>> drives with too many IOPS, which get their latency sky high. Overall we are
>> functioning fine, but I sure would like storage vmotion and other large
>> operations faster.
>>
>>
>>
>>
>>
>> Yeah this is the biggest pain point I think. Normal VM ops are fine, but
>> if you ever have to move a multi-TB VM, it’s just too slow.
>>
>>
>>
>> If you use iscsi with vaai and are migrating a thick provisioned vmdk,
>> then performance is actually quite good, as the block sizes used for the
>> copy are a lot bigger.
>>
>>
>>
>> However, my use case required thin provisioned VM’s + snapshots and I
>> found that using iscsi you have no control over the fragmentation of the
>> vmdk’s and so the read performance is then what suffers (certainly with
>> 7.2k disks)
>>
>>
>>
>> Also with thin provisioned vmdk’s I think I was seeing PG contention with
>> the updating of the VMFS metadata, although I can’t be sure.
>>
>>
>>
>>
>>
>> I am thinking I will test a few different schedulers and readahead
>> settings to see if we can improve this by parallelizing reads. Also will
>> test NFS, but need to determine whether to do krbd/knfsd or something more
>> interesting like CephFS/Ganesha.
>>
>>
>>
>> As you know I’m on NFS now. I’ve found it a lot easier to get going and a
>> lot less sensitive to making config adjustments without suddenly everything
>> dropping offline. The fact that you can specify the extent size on XFS
>> helps massively with using thin vmdks/snapshots to avoid fragmentation.
>> Storage v-motions are a bit faster than iscsi, but I think I am hitting PG
>> contention when esxi tries to write 32 copy threads to the same object.
>> There is probably some tuning that could be done here (RBD striping???) but
>> this is the best it’s been for a long time and I’m reluctant to fiddle any
>> further.
>>
>
> We have moved ahead and added NFS support to Storcium, and now able ti run
> NFS servers with Pacemaker in HA mode (all agents are public at
> https://github.com/akurz/resource-agents/tree/master/heartbeat).  I can
> confirm that VM performance is definitely better and benchmarks are more
> smooth (in Windows we can see a lot of choppiness with iSCSI, NFS is choppy
> on writes, but smooth on reads, likely due to the bursty nature of OSD
> filesystems when dealing with that small IO size).
>
> Were you using extsz=16384 at creation time for the filesystem?  I saw
> kernel memory deadlock messages during vmotion, such as:
>
>  XFS: nfsd(102545) possible memory allocation deadlock size 40320 in
> 

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-09-03 Thread Alex Gorbachev
HI Nick,

On Sun, Aug 21, 2016 at 3:19 PM, Nick Fisk  wrote:

> *From:* Alex Gorbachev [mailto:a...@iss-integration.com]
> *Sent:* 21 August 2016 15:27
> *To:* Wilhelm Redbrake 
> *Cc:* n...@fisk.me.uk; Horace Ng ; ceph-users <
> ceph-users@lists.ceph.com>
> *Subject:* Re: [ceph-users] Ceph + VMware + Single Thread Performance
>
>
>
>
>
> On Sunday, August 21, 2016, Wilhelm Redbrake  wrote:
>
> Hi Nick,
> i understand all of your technical improvements.
> But: why do you Not use a simple for example Areca Raid Controller with 8
> gb Cache and Bbu ontop in every ceph node.
> Configure n Times RAID 0 on the Controller and enable Write back Cache.
> That must be a latency "Killer" like in all the prop. Storage arrays or
> Not ??
>
> Best Regards !!
>
>
>
> What we saw specifically with Areca cards is that performance is excellent
> in benchmarking and for bursty loads. However, once we started loading with
> more constant workloads (we replicate databases and files to our Ceph
> cluster), this looks to have saturated the relatively small Areca NVDIMM
> caches and we went back to pure drive based performance.
>
>
>
> Yes, I think that is a valid point. Although low latency, you are still
> having to write to the disks twice (journal+data), so once the cache’s on
> the cards start filling up, you are going to hit problems.
>
>
>
>
>
> So we built 8 new nodes with no Arecas, M500 SSDs for journals (1 SSD per
> 3 HDDs) in hopes that it would help reduce the noisy neighbor impact. That
> worked, but now the overall latency is really high at times, not always.
> Red Hat engineer suggested this is due to loading the 7200 rpm NL-SAS
> drives with too many IOPS, which get their latency sky high. Overall we are
> functioning fine, but I sure would like storage vmotion and other large
> operations faster.
>
>
>
>
>
> Yeah this is the biggest pain point I think. Normal VM ops are fine, but
> if you ever have to move a multi-TB VM, it’s just too slow.
>
>
>
> If you use iscsi with vaai and are migrating a thick provisioned vmdk,
> then performance is actually quite good, as the block sizes used for the
> copy are a lot bigger.
>
>
>
> However, my use case required thin provisioned VM’s + snapshots and I
> found that using iscsi you have no control over the fragmentation of the
> vmdk’s and so the read performance is then what suffers (certainly with
> 7.2k disks)
>
>
>
> Also with thin provisioned vmdk’s I think I was seeing PG contention with
> the updating of the VMFS metadata, although I can’t be sure.
>
>
>
>
>
> I am thinking I will test a few different schedulers and readahead
> settings to see if we can improve this by parallelizing reads. Also will
> test NFS, but need to determine whether to do krbd/knfsd or something more
> interesting like CephFS/Ganesha.
>
>
>
> As you know I’m on NFS now. I’ve found it a lot easier to get going and a
> lot less sensitive to making config adjustments without suddenly everything
> dropping offline. The fact that you can specify the extent size on XFS
> helps massively with using thin vmdks/snapshots to avoid fragmentation.
> Storage v-motions are a bit faster than iscsi, but I think I am hitting PG
> contention when esxi tries to write 32 copy threads to the same object.
> There is probably some tuning that could be done here (RBD striping???) but
> this is the best it’s been for a long time and I’m reluctant to fiddle any
> further.
>

We have moved ahead and added NFS support to Storcium, and now able ti run
NFS servers with Pacemaker in HA mode (all agents are public at
https://github.com/akurz/resource-agents/tree/master/heartbeat).  I can
confirm that VM performance is definitely better and benchmarks are more
smooth (in Windows we can see a lot of choppiness with iSCSI, NFS is choppy
on writes, but smooth on reads, likely due to the bursty nature of OSD
filesystems when dealing with that small IO size).

Were you using extsz=16384 at creation time for the filesystem?  I saw
kernel memory deadlock messages during vmotion, such as:

 XFS: nfsd(102545) possible memory allocation deadlock size 40320 in
kmem_alloc (mode:0x2400240)

And analyzing fragmentation:

root@roc-5r-scd218:~# xfs_db -r /dev/rbd21
xfs_db> frag -d
actual 0, ideal 0, fragmentation factor 0.00%
xfs_db> frag -f
actual 1863960, ideal 74, fragmentation factor 100.00%

Just from two vmotions.

Are you seeing anything similar?

Thank you,
Alex


>
>
> But as mentioned above, thick vmdk’s with vaai might be a really good fit.
>
>
>
> Thanks for your very valuable info on analysis and hw build.
>
>
>
> Alex
>
>
>
>
>
>
> Am 21.08.2016 um 09:31 schrieb Nick 

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-08-31 Thread Nick Fisk
From: w...@globe.de [mailto:w...@globe.de] 
Sent: 31 August 2016 08:56
To: n...@fisk.me.uk; 'Alex Gorbachev' ; 'Horace Ng' 

Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance

 

Nick,

what do you think about Infiniband?

I have read that with Infiniband the latency is at 1,2us

It’s great, but I don’t believe the Ceph support for RDMA is finished yet, so 
you are stuck using IPoIB, which has similar performance to 10G Ethernet.

For now concentrate on removing latency where you easily can (3.5+ Ghz CPU’s, 
NVME journals) and then when stuff like RDMA comes along, you will be in a 
better place to take advantage of it.

 

Kind Regards!

 

Am 31.08.16 um 09:51 schrieb Nick Fisk:

 

 

From: w...@globe.de <mailto:w...@globe.de>  [mailto:w...@globe.de] 
Sent: 30 August 2016 18:40
To: n...@fisk.me.uk <mailto:n...@fisk.me.uk> ; 'Alex Gorbachev'  
<mailto:a...@iss-integration.com> 
Cc: 'Horace Ng'  <mailto:hor...@hkisl.net> 
Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance

 

Hi Nick,

here are my answers and questions...

 

Am 30.08.16 um 19:05 schrieb Nick Fisk:

 

 

From: w...@globe.de <mailto:w...@globe.de>  [mailto:w...@globe.de] 
Sent: 30 August 2016 08:48
To: n...@fisk.me.uk <mailto:n...@fisk.me.uk> ; 'Alex Gorbachev'  
<mailto:a...@iss-integration.com> 
Cc: 'Horace Ng'  <mailto:hor...@hkisl.net> 
Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance

 

Hi Nick, Hi Ales,

Nick: i've got my 600GB SAS HP Drives.

Performance is not good soo i don't paste the results here...

 

Generally another thing: I've build in the Ceph Cluster Samsung SM863 
Enterprise SSD's

If i do a 4k Test on the SSD directly without filesystem i become 

(See Sebastien's Han Tests)

https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
 
<http://xo4t.mj.am/lnk/AEEAFKSsSPsAAF3gduwAADNJBWwAAACRXwBXxoyWJxD41h5WTsmv5AyUVi8GUwAAlBI/1/kG_bXVmSVXssUysDBe9M-g/aHR0cDovL3hvNHQubWouYW0vbG5rL0FFVUFGSU9hZmQ4QUFBQUFBQUFBQUYzZ2R1d0FBRE5KQld3QUFBQUFBQUNSWHdCWHhieV9JUzdQRkYzQVNBMkpBZ0MyTjBobU53QUFsQkkvMS83Z09JODR1dUhUUWhOc3ViSEg2UmxnL2FIUjBjSE02THk5M2QzY3VjMlZpWVhOMGFXVnVMV2hoYmk1bWNpOWliRzluTHpJd01UUXZNVEF2TVRBdlkyVndhQzFvYjNjdGRHOHRkR1Z6ZEMxcFppMTViM1Z5TFhOelpDMXBjeTF6ZFdsMFlXSnNaUzFoY3kxaExXcHZkWEp1WVd3dFpHVjJhV05sTHc>
 

 

dd if=/dev/zero of=/dev/sdd bs=4k count=100 oflag=direct,dsync
100+0 Datensätze ein
100+0 Datensätze aus
409600 bytes (4,1 GB, 3,8 GiB) copied, 52,7139 s, 77,7 MB/s

77000/4 = ~2 IOP’s

 

If i format the device with xfs i become:

mkfs.xfs -f /dev/sdd

mount /dev/sdd /mnt

cd /mnt

dd if=/dev/zero of=/mnt/test.txt bs=4k count=10 oflag=direct,dsync
10+0 Datensätze ein
10+0 Datensätze aus
40960 bytes (410 MB, 391 MiB) copied, 21,1856 s, 19,3 MB/s

19300/4 = ~5000 IOPs
I know once you have a FS on the device it will slow down due to the extra 
journal writes, maybe this is a little more than expected here…but still 
reasonably fast. Can you see in iostat how many IO’s the device is doing during 
this test?



watch iostat -dmx -t -y 1 1 /dev/sde

Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sde   0,00 0,000,00 9625,00 0,0025,85 5,50 
0,600,060,000,06   0,06  59,60




So there seems to be an extra delay somewhere when writing via a FS instead of 
raw device. You are still getting around 10,000 iops though, so not too bad.







If i use the ssd in the ceph cluster and i do the test again with rados bench 
bs=4K and -t = 1 (one thread) i become only 2-3 MByte/s

2500/4 = ~600IOP’s

My question is: How can it be that the pure device performance is so high 
against the xfs and the ceph rbd performance?

Ceph will be a lot slower as you are replacing a 30cm SAS/SATA cable with 
networking, software and also doing replication. You have at least 2 network 
hops with Ceph. For a slightly fairer test set replication to 1x.


Replication 3x:
rados bench -p rbd 60 write -b 4k -t 1
Invalid value for block-size: The option value '4k' seems to be invalid
root@ceph-mon-1:~# rados bench -p rbd 60 write -b 4K -t 1
Maintaining 1 concurrent writes of 4096 bytes to objects of size 4096 for up to 
60 seconds or 0 objects
Object prefix: benchmark_data_ceph-mon-1_30407
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
0   0 0 0 0 0   -   0
1   1   402   4011.5661   1.56641  0.00226091  0.00248929
2   1   775   774   1.51142   1.45703   0.0021945  0.00258187
3   1  1110  1109   1.44374   1.30859  0.00278291  0.00270182
4   1  1421  1420   1.38647   1.21484  0.001

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-08-31 Thread Nick Fisk
 

 

From: w...@globe.de [mailto:w...@globe.de] 
Sent: 30 August 2016 18:40
To: n...@fisk.me.uk; 'Alex Gorbachev' 
Cc: 'Horace Ng' 
Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance

 

Hi Nick,

here are my answers and questions...

 

Am 30.08.16 um 19:05 schrieb Nick Fisk:

 

 

From: w...@globe.de <mailto:w...@globe.de>  [mailto:w...@globe.de] 
Sent: 30 August 2016 08:48
To: n...@fisk.me.uk <mailto:n...@fisk.me.uk> ; 'Alex Gorbachev'  
<mailto:a...@iss-integration.com> 
Cc: 'Horace Ng'  <mailto:hor...@hkisl.net> 
Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance

 

Hi Nick, Hi Ales,

Nick: i've got my 600GB SAS HP Drives.

Performance is not good soo i don't paste the results here...

 

Generally another thing: I've build in the Ceph Cluster Samsung SM863 
Enterprise SSD's

If i do a 4k Test on the SSD directly without filesystem i become 

(See Sebastien's Han Tests)

https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
 
<http://xo4t.mj.am/lnk/AEUAFIOafd8AAF3gduwAADNJBWwAAACRXwBXxby_IS7PFF3ASA2JAgC2N0hmNwAAlBI/1/7gOI84uuHTQhNsubHH6Rlg/aHR0cHM6Ly93d3cuc2ViYXN0aWVuLWhhbi5mci9ibG9nLzIwMTQvMTAvMTAvY2VwaC1ob3ctdG8tdGVzdC1pZi15b3VyLXNzZC1pcy1zdWl0YWJsZS1hcy1hLWpvdXJuYWwtZGV2aWNlLw>
 

 

dd if=/dev/zero of=/dev/sdd bs=4k count=100 oflag=direct,dsync
100+0 Datensätze ein
100+0 Datensätze aus
409600 bytes (4,1 GB, 3,8 GiB) copied, 52,7139 s, 77,7 MB/s

77000/4 = ~2 IOP’s

 

If i format the device with xfs i become:

mkfs.xfs -f /dev/sdd

mount /dev/sdd /mnt

cd /mnt

dd if=/dev/zero of=/mnt/test.txt bs=4k count=10 oflag=direct,dsync
10+0 Datensätze ein
10+0 Datensätze aus
40960 bytes (410 MB, 391 MiB) copied, 21,1856 s, 19,3 MB/s

19300/4 = ~5000 IOPs
I know once you have a FS on the device it will slow down due to the extra 
journal writes, maybe this is a little more than expected here…but still 
reasonably fast. Can you see in iostat how many IO’s the device is doing during 
this test?



watch iostat -dmx -t -y 1 1 /dev/sde

Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sde   0,00 0,000,00 9625,00 0,0025,85 5,50 
0,600,060,000,06   0,06  59,60



So there seems to be an extra delay somewhere when writing via a FS instead of 
raw device. You are still getting around 10,000 iops though, so not too bad.






If i use the ssd in the ceph cluster and i do the test again with rados bench 
bs=4K and -t = 1 (one thread) i become only 2-3 MByte/s

2500/4 = ~600IOP’s

My question is: How can it be that the pure device performance is so high 
against the xfs and the ceph rbd performance?

Ceph will be a lot slower as you are replacing a 30cm SAS/SATA cable with 
networking, software and also doing replication. You have at least 2 network 
hops with Ceph. For a slightly fairer test set replication to 1x.


Replication 3x:
rados bench -p rbd 60 write -b 4k -t 1
Invalid value for block-size: The option value '4k' seems to be invalid
root@ceph-mon-1:~# rados bench -p rbd 60 write -b 4K -t 1
Maintaining 1 concurrent writes of 4096 bytes to objects of size 4096 for up to 
60 seconds or 0 objects
Object prefix: benchmark_data_ceph-mon-1_30407
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
0   0 0 0 0 0   -   0
1   1   402   4011.5661   1.56641  0.00226091  0.00248929
2   1   775   774   1.51142   1.45703   0.0021945  0.00258187
3   1  1110  1109   1.44374   1.30859  0.00278291  0.00270182
4   1  1421  1420   1.38647   1.21484  0.00199578  0.00281537
5   1  1731  1730   1.35132   1.21094  0.00219136  0.00288843
6   1  2044  2043   1.32985   1.22266   0.0023981  0.00293468
7   1  2351  2350   1.31116   1.19922  0.00258856  0.00296963
8   1  2703  2702   1.31911 1.375   0.0224678  0.00295862
9   1  2955  2954   1.28191  0.984375  0.00841621  0.00304526
   10   1  3228  3227   1.26034   1.06641  0.00261023  0.00309665
   11   1  3501  35001.2427   1.06641  0.00659853  0.00313985
   12   1  3791  3790   1.23353   1.13281   0.0027244  0.00316168
   13   1  4150  4149   1.24649   1.40234  0.00262242  0.00313177
   14   1  4460  4459   1.24394   1.21094  0.00262075  0.00313735
   15   1  4721  4720   1.22897   1.01953  0.00239961  0.00317357
   16   1  4983  4982   1.21611   1.02344  0.00290526  0.00321005
   17   1  5279  5278   1.21258   1.15625  0.00252002   0.0032196
   18   1  5605  5604   1.21595   1.27344

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-08-22 Thread Christian Balzer


Hello,

On Mon, 22 Aug 2016 20:34:54 +0100 Nick Fisk wrote:

> > -Original Message-
> > From: Christian Balzer [mailto:ch...@gol.com]
> > Sent: 22 August 2016 03:00
> > To: 'ceph-users' 
> > Cc: Nick Fisk 
> > Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> > 
> > 
> > Hello,
> > 
> > On Sun, 21 Aug 2016 09:57:40 +0100 Nick Fisk wrote:
> > 
> > >
> > >
> > > > -Original Message-
> > > > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
> > > > Behalf Of Christian Balzer
> > > > Sent: 21 August 2016 09:32
> > > > To: ceph-users 
> > > > Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> > > >
> > > >
> > > > Hello,
> > > >
> > > > On Sun, 21 Aug 2016 09:16:33 +0100 Brian :: wrote:
> > > >
> > > > > Hi Nick
> > > > >
> > > > > Interested in this comment - "-Dual sockets are probably bad and
> > > > > will impact performance."
> > > > >
> > > > > Have you got real world experience of this being the case?
> > > > >
> > > > Well, Nick wrote "probably".
> > > >
> > > > Dual sockets and thus NUMA, the need for CPUs to talk to each other
> > > > and share information certainly can impact things that are
> > > very
> > > > time critical.
> > > > How much though is a question of design, both HW and SW.
> > >
> > > There was a guy from Redhat (sorry his name escapes me now) a few
> > > months ago on the performance weekly meeting. He was analysing the CPU
> > > cache miss effects with Ceph and it looked like a NUMA setup was
> > > having quite a severe impact on some things. To be honest a lot of it
> > > went over my head, but I came away from it with a general feeling that
> > > if you can get the required performance from 1 socket, then that is 
> > > probably a better bet. This includes only populating a single
> > socket in a dual socket system. There was also a Ceph tech talk at the 
> > start of the year (High perf databases on Ceph) where the guy
> > presenting was also recommending only populating 1 socket for latency 
> > reasons.
> > >
> > I wonder how complete their testing was and how much manual tuning they 
> > tried.
> > As in:
> > 
> > 1. Was irqbalance running?
> > Because it and the normal kernel strategies clash beautifully.
> > Irqbalance moves stuff around, the kernel tries to move things close to 
> > where the IRQs are, cat and mouse.
> > 
> > 2. Did they try with manual IRQ pinning?
> > I do, not that it's critical with my Ceph nodes, but on other machines it 
> > can make a LOT of difference.
> > Like keeping the cores near (or at least on the same NUMA node) as the 
> > network IRQs reserved for KVM vhost processes.
> > 
> > 3. Did they try pining Ceph OSD processes?
> > While this may certainly help (and make things more predictable when the 
> > load gets high), as I said above the kernel normally does a
> > pretty good job of NOT moving things around and keeping processes close to 
> > the resources they need.
> > 
> 
> From what I remember I think they went to pretty long lengths to tune things. 
> I think one point was that if you have a 40GB nic on socket, a NVME on 
> another, no matter where the process runs, you are going to have a lot of 
> traffic crossing between the sockets.

Traffic yes, complete process migrations hopefully not.
But anyway, yes, that's to be expected.

And also unavoidable if you want/need to utilize the whole capabilities
and PCIe lanes of a dual socket motherboard.
And in some cases (usually not with Ceph/OSDs), the IRQ load really will
benefit from more cores to play with.

> 
> Here is the DB on Ceph one
> 
> http://ceph-users.ceph.narkive.com/1sj4VI4U/ceph-tech-talk-high-performance-production-databases-on-ceph

Thanks!
Yeah, basically confirms what I know/said.

> 
> I don't think the recordings are available for the performance meeting one, 
> but it was something to do with certain C++ string functions causing issue 
> with CPU cache. Honestly can't remember much else.
> 
> > > Both of those, coupled with the fact that Xeon E3's are the cheapest way 
> > > to get high clock speeds, sort of made my decision.
> > >
> > Totally agreed, my current HDD node design is based on the single CPU 
> >

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-08-22 Thread Nick Fisk
 

 

From: Alex Gorbachev [mailto:a...@iss-integration.com] 
Sent: 22 August 2016 20:30
To: Nick Fisk 
Cc: Wilhelm Redbrake ; Horace Ng ; ceph-users 

Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance

 


On Sunday, August 21, 2016, Wilhelm Redbrake mailto:w...@globe.de> > wrote:

Hi Nick,
i understand all of your technical improvements.
But: why do you Not use a simple for example Areca Raid Controller with 8 gb 
Cache and Bbu ontop in every ceph node.
Configure n Times RAID 0 on the Controller and enable Write back Cache.
That must be a latency "Killer" like in all the prop. Storage arrays or Not ??

Best Regards !!

 

What we saw specifically with Areca cards is that performance is excellent in 
benchmarking and for bursty loads. However, once we started loading with more 
constant workloads (we replicate databases and files to our Ceph cluster), this 
looks to have saturated the relatively small Areca NVDIMM caches and we went 
back to pure drive based performance. 

 

Yes, I think that is a valid point. Although low latency, you are still having 
to write to the disks twice (journal+data), so once the cache’s on the cards 
start filling up, you are going to hit problems.

 

 

So we built 8 new nodes with no Arecas, M500 SSDs for journals (1 SSD per 3 
HDDs) in hopes that it would help reduce the noisy neighbor impact. That 
worked, but now the overall latency is really high at times, not always. Red 
Hat engineer suggested this is due to loading the 7200 rpm NL-SAS drives with 
too many IOPS, which get their latency sky high. Overall we are functioning 
fine, but I sure would like storage vmotion and other large operations faster. 

 

 

Yeah this is the biggest pain point I think. Normal VM ops are fine, but if you 
ever have to move a multi-TB VM, it’s just too slow. 

 

If you use iscsi with vaai and are migrating a thick provisioned vmdk, then 
performance is actually quite good, as the block sizes used for the copy are a 
lot bigger. 

 

However, my use case required thin provisioned VM’s + snapshots and I found 
that using iscsi you have no control over the fragmentation of the vmdk’s and 
so the read performance is then what suffers (certainly with 7.2k disks)

 

Also with thin provisioned vmdk’s I think I was seeing PG contention with the 
updating of the VMFS metadata, although I can’t be sure.

 

 

I am thinking I will test a few different schedulers and readahead settings to 
see if we can improve this by parallelizing reads. Also will test NFS, but need 
to determine whether to do krbd/knfsd or something more interesting like 
CephFS/Ganesha. 

 

As you know I’m on NFS now. I’ve found it a lot easier to get going and a lot 
less sensitive to making config adjustments without suddenly everything 
dropping offline. The fact that you can specify the extent size on XFS helps 
massively with using thin vmdks/snapshots to avoid fragmentation. Storage 
v-motions are a bit faster than iscsi, but I think I am hitting PG contention 
when esxi tries to write 32 copy threads to the same object. There is probably 
some tuning that could be done here (RBD striping???) but this is the best it’s 
been for a long time and I’m reluctant to fiddle any further.

 

But as mentioned above, thick vmdk’s with vaai might be a really good fit.

 

Any chance thin vs. thick difference could be related to discards?  I saw 
zillions of them in recent testing.

 

 

I was using FILEIO and so discard weren’t working for me. I know fragmentation 
was definitely the cause of the small reads. The VMFS metadata I’m less sure 
of, but it seemed the most likely cause as it only effected write performance 
the 1st time round.

 

 

 

Thanks for your very valuable info on analysis and hw build. 

 

Alex

 




Am 21.08.2016 um 09:31 schrieb Nick Fisk :

>> -Original Message-
>> From: Alex Gorbachev [mailto:a...@iss-integration.com]
>> Sent: 21 August 2016 04:15
>> To: Nick Fisk 
>> Cc: w...@globe.de; Horace Ng ; ceph-users 
>> 
>> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
>>
>> Hi Nick,
>>
>> On Thu, Jul 21, 2016 at 8:33 AM, Nick Fisk  wrote:
>>>> -Original Message-
>>>> From: w...@globe.de [mailto:w...@globe.de]
>>>> Sent: 21 July 2016 13:23
>>>> To: n...@fisk.me.uk; 'Horace Ng' 
>>>> Cc: ceph-users@lists.ceph.com
>>>> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
>>>>
>>>> Okay and what is your plan now to speed up ?
>>>
>>> Now I have come up with a lower latency hardware design, there is not much 
>>> further improvement until persistent RBD caching is
>> implemented, as you will be moving the SSD/NVME closer to the client. But 
>> I'm happy with what I can achieve at the moment. You
>> coul

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-08-22 Thread Nick Fisk
> -Original Message-
> From: Christian Balzer [mailto:ch...@gol.com]
> Sent: 22 August 2016 03:00
> To: 'ceph-users' 
> Cc: Nick Fisk 
> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> 
> 
> Hello,
> 
> On Sun, 21 Aug 2016 09:57:40 +0100 Nick Fisk wrote:
> 
> >
> >
> > > -Original Message-
> > > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
> > > Behalf Of Christian Balzer
> > > Sent: 21 August 2016 09:32
> > > To: ceph-users 
> > > Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> > >
> > >
> > > Hello,
> > >
> > > On Sun, 21 Aug 2016 09:16:33 +0100 Brian :: wrote:
> > >
> > > > Hi Nick
> > > >
> > > > Interested in this comment - "-Dual sockets are probably bad and
> > > > will impact performance."
> > > >
> > > > Have you got real world experience of this being the case?
> > > >
> > > Well, Nick wrote "probably".
> > >
> > > Dual sockets and thus NUMA, the need for CPUs to talk to each other
> > > and share information certainly can impact things that are
> > very
> > > time critical.
> > > How much though is a question of design, both HW and SW.
> >
> > There was a guy from Redhat (sorry his name escapes me now) a few
> > months ago on the performance weekly meeting. He was analysing the CPU
> > cache miss effects with Ceph and it looked like a NUMA setup was
> > having quite a severe impact on some things. To be honest a lot of it
> > went over my head, but I came away from it with a general feeling that
> > if you can get the required performance from 1 socket, then that is 
> > probably a better bet. This includes only populating a single
> socket in a dual socket system. There was also a Ceph tech talk at the start 
> of the year (High perf databases on Ceph) where the guy
> presenting was also recommending only populating 1 socket for latency reasons.
> >
> I wonder how complete their testing was and how much manual tuning they tried.
> As in:
> 
> 1. Was irqbalance running?
> Because it and the normal kernel strategies clash beautifully.
> Irqbalance moves stuff around, the kernel tries to move things close to where 
> the IRQs are, cat and mouse.
> 
> 2. Did they try with manual IRQ pinning?
> I do, not that it's critical with my Ceph nodes, but on other machines it can 
> make a LOT of difference.
> Like keeping the cores near (or at least on the same NUMA node) as the 
> network IRQs reserved for KVM vhost processes.
> 
> 3. Did they try pining Ceph OSD processes?
> While this may certainly help (and make things more predictable when the load 
> gets high), as I said above the kernel normally does a
> pretty good job of NOT moving things around and keeping processes close to 
> the resources they need.
> 

>From what I remember I think they went to pretty long lengths to tune things. 
>I think one point was that if you have a 40GB nic on socket, a NVME on 
>another, no matter where the process runs, you are going to have a lot of 
>traffic crossing between the sockets.

Here is the DB on Ceph one

http://ceph-users.ceph.narkive.com/1sj4VI4U/ceph-tech-talk-high-performance-production-databases-on-ceph

I don't think the recordings are available for the performance meeting one, but 
it was something to do with certain C++ string functions causing issue with CPU 
cache. Honestly can't remember much else.

> > Both of those, coupled with the fact that Xeon E3's are the cheapest way to 
> > get high clock speeds, sort of made my decision.
> >
> Totally agreed, my current HDD node design is based on the single CPU 
> Supermicro 5028R-E1CR12L barebone, with an E5-1650 v3
> (3.50GHz) CPU.

Nice. Any ideas how they compare to the E3's?

> 
> > >
> > > We're looking here at a case where he's trying to reduce latency by
> > > all means and where the actual CPU needs for the HDDs are negligible.
> > > The idea being that a "Ceph IOPS" stays on one core which is hopefully 
> > > also not being shared at that time.
> > >
> > > If you're looking at full SSD nodes OTOH a singe CPU may very well
> > > not be able to saturate a sensible amount of SSDs per node, so
> > a
> > > slight penalty but better utilization and overall IOPS with 2 CPUs may be 
> > > the forward.
> >
> > Definitely, as always work out what your requirements are and design around 
> > them.
> >
> On my cache ti

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-08-22 Thread Alex Gorbachev
>
>
> On Sunday, August 21, 2016, Wilhelm Redbrake  wrote:
>
> Hi Nick,
> i understand all of your technical improvements.
> But: why do you Not use a simple for example Areca Raid Controller with 8
> gb Cache and Bbu ontop in every ceph node.
> Configure n Times RAID 0 on the Controller and enable Write back Cache.
> That must be a latency "Killer" like in all the prop. Storage arrays or
> Not ??
>
> Best Regards !!
>
>
>
> What we saw specifically with Areca cards is that performance is excellent
> in benchmarking and for bursty loads. However, once we started loading with
> more constant workloads (we replicate databases and files to our Ceph
> cluster), this looks to have saturated the relatively small Areca NVDIMM
> caches and we went back to pure drive based performance.
>
>
>
> Yes, I think that is a valid point. Although low latency, you are still
> having to write to the disks twice (journal+data), so once the cache’s on
> the cards start filling up, you are going to hit problems.
>
>
>
>
>
> So we built 8 new nodes with no Arecas, M500 SSDs for journals (1 SSD per
> 3 HDDs) in hopes that it would help reduce the noisy neighbor impact. That
> worked, but now the overall latency is really high at times, not always.
> Red Hat engineer suggested this is due to loading the 7200 rpm NL-SAS
> drives with too many IOPS, which get their latency sky high. Overall we are
> functioning fine, but I sure would like storage vmotion and other large
> operations faster.
>
>
>
>
>
> Yeah this is the biggest pain point I think. Normal VM ops are fine, but
> if you ever have to move a multi-TB VM, it’s just too slow.
>
>
>
> If you use iscsi with vaai and are migrating a thick provisioned vmdk,
> then performance is actually quite good, as the block sizes used for the
> copy are a lot bigger.
>
>
>
> However, my use case required thin provisioned VM’s + snapshots and I
> found that using iscsi you have no control over the fragmentation of the
> vmdk’s and so the read performance is then what suffers (certainly with
> 7.2k disks)
>
>
>
> Also with thin provisioned vmdk’s I think I was seeing PG contention with
> the updating of the VMFS metadata, although I can’t be sure.
>
>
>
>
>
> I am thinking I will test a few different schedulers and readahead
> settings to see if we can improve this by parallelizing reads. Also will
> test NFS, but need to determine whether to do krbd/knfsd or something more
> interesting like CephFS/Ganesha.
>
>
>
> As you know I’m on NFS now. I’ve found it a lot easier to get going and a
> lot less sensitive to making config adjustments without suddenly everything
> dropping offline. The fact that you can specify the extent size on XFS
> helps massively with using thin vmdks/snapshots to avoid fragmentation.
> Storage v-motions are a bit faster than iscsi, but I think I am hitting PG
> contention when esxi tries to write 32 copy threads to the same object.
> There is probably some tuning that could be done here (RBD striping???) but
> this is the best it’s been for a long time and I’m reluctant to fiddle any
> further.
>
>
>
> But as mentioned above, thick vmdk’s with vaai might be a really good fit.
>

Any chance thin vs. thick difference could be related to discards?  I saw
zillions of them in recent testing.


>
>
> Thanks for your very valuable info on analysis and hw build.
>
>
>
> Alex
>
>
>
>
>
>
> Am 21.08.2016 um 09:31 schrieb Nick Fisk :
>
> >> -Original Message-
> >> From: Alex Gorbachev [mailto:a...@iss-integration.com]
> >> Sent: 21 August 2016 04:15
> >> To: Nick Fisk 
> >> Cc: w...@globe.de; Horace Ng ; ceph-users <
> ceph-users@lists.ceph.com>
> >> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> >>
> >> Hi Nick,
> >>
> >> On Thu, Jul 21, 2016 at 8:33 AM, Nick Fisk  wrote:
> >>>> -Original Message-
> >>>> From: w...@globe.de [mailto:w...@globe.de]
> >>>> Sent: 21 July 2016 13:23
> >>>> To: n...@fisk.me.uk; 'Horace Ng' 
> >>>> Cc: ceph-users@lists.ceph.com
> >>>> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> >>>>
> >>>> Okay and what is your plan now to speed up ?
> >>>
> >>> Now I have come up with a lower latency hardware design, there is not
> much further improvement until persistent RBD caching is
> >> implemented, as you will be moving the SSD/NVME closer to the client.
> But I'm happy with what I can achieve at the mo

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-08-21 Thread Christian Balzer

Hello,

On Sun, 21 Aug 2016 09:57:40 +0100 Nick Fisk wrote:

> 
> 
> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
> > Christian Balzer
> > Sent: 21 August 2016 09:32
> > To: ceph-users 
> > Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> > 
> > 
> > Hello,
> > 
> > On Sun, 21 Aug 2016 09:16:33 +0100 Brian :: wrote:
> > 
> > > Hi Nick
> > >
> > > Interested in this comment - "-Dual sockets are probably bad and will
> > > impact performance."
> > >
> > > Have you got real world experience of this being the case?
> > >
> > Well, Nick wrote "probably".
> > 
> > Dual sockets and thus NUMA, the need for CPUs to talk to each other and 
> > share information certainly can impact things that are
> very
> > time critical.
> > How much though is a question of design, both HW and SW.
> 
> There was a guy from Redhat (sorry his name escapes me now) a few months ago 
> on the performance weekly meeting. He was analysing the
> CPU cache miss effects with Ceph and it looked like a NUMA setup was having 
> quite a severe impact on some things. To be honest a lot
> of it went over my head, but I came away from it with a general feeling that 
> if you can get the required performance from 1 socket,
> then that is probably a better bet. This includes only populating a single 
> socket in a dual socket system. There was also a Ceph
> tech talk at the start of the year (High perf databases on Ceph) where the 
> guy presenting was also recommending only populating 1
> socket for latency reasons.
> 
I wonder how complete their testing was and how much manual tuning they
tried.
As in:

1. Was irqbalance running? 
Because it and the normal kernel strategies clash beautifully.
Irqbalance moves stuff around, the kernel tries to move things close to
where the IRQs are, cat and mouse.

2. Did they try with manual IRQ pinning?
I do, not that it's critical with my Ceph nodes, but on other machines it
can make a LOT of difference. 
Like keeping the cores near (or at least on the same NUMA node) as the
network IRQs reserved for KVM vhost processes. 

3. Did they try pining Ceph OSD processes?
While this may certainly help (and make things more predictable when the
load gets high), as I said above the kernel normally does a pretty good job
of NOT moving things around and keeping processes close to the resources
they need.

> Both of those, coupled with the fact that Xeon E3's are the cheapest way to 
> get high clock speeds, sort of made my decision.
> 
Totally agreed, my current HDD node design is based on the single CPU
Supermicro 5028R-E1CR12L barebone, with an E5-1650 v3 (3.50GHz) CPU.

> > 
> > We're looking here at a case where he's trying to reduce latency by all 
> > means and where the actual CPU needs for the HDDs are
> > negligible.
> > The idea being that a "Ceph IOPS" stays on one core which is hopefully also 
> > not being shared at that time.
> > 
> > If you're looking at full SSD nodes OTOH a singe CPU may very well not be 
> > able to saturate a sensible amount of SSDs per node, so
> a
> > slight penalty but better utilization and overall IOPS with 2 CPUs may be 
> > the forward.
> 
> Definitely, as always work out what your requirements are and design around 
> them.  
> 
On my cache tier nodes with 2x E5-2623 v3 (3.00GHz) and currently 4 800GB
DC S3610 SSDs I can already saturate all but 2 "cores", with the "right"
extreme test cases.
Normal load is of course just around 4 (out of 16) "cores".

And for the people who like it fast(er) but don't have to deal with VMware
or the likes, instead of forcing the c-state to 1 just setting the governor
to "performance" was enough in my case to halve latency (from about 2 to
1ms).

This still does save some power at times and (as Nick speculated) indeed
allows some cores to use their turbo speeds.

So the 4-5 busy cores on my cache tier nodes tend to hover around 3.3GHz,
instead of the 3.0GHz baseline for their CPUs.
And the less loaded cores don't tend to go below 2.6GHz, as opposed to the
1.2GHz that the "powersave" governor would default to.

Christian

> > 
> > Christian
> > 
> > > Thanks - B
> > >
> > > On Sun, Aug 21, 2016 at 8:31 AM, Nick Fisk  wrote:
> > > >> -Original Message-
> > > >> From: Alex Gorbachev [mailto:a...@iss-integration.com]
> > > >> Sent: 21 August 2016 04:15
> > > >> To: Nick Fisk 
> > > >> Cc: w...@globe.de; Horace Ng ; ceph-u

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-08-21 Thread Nick Fisk
From: Alex Gorbachev [mailto:a...@iss-integration.com] 
Sent: 21 August 2016 15:27
To: Wilhelm Redbrake 
Cc: n...@fisk.me.uk; Horace Ng ; ceph-users 

Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance

 



On Sunday, August 21, 2016, Wilhelm Redbrake mailto:w...@globe.de> > wrote:

Hi Nick,
i understand all of your technical improvements.
But: why do you Not use a simple for example Areca Raid Controller with 8 gb 
Cache and Bbu ontop in every ceph node.
Configure n Times RAID 0 on the Controller and enable Write back Cache.
That must be a latency "Killer" like in all the prop. Storage arrays or Not ??

Best Regards !!

 

What we saw specifically with Areca cards is that performance is excellent in 
benchmarking and for bursty loads. However, once we started loading with more 
constant workloads (we replicate databases and files to our Ceph cluster), this 
looks to have saturated the relatively small Areca NVDIMM caches and we went 
back to pure drive based performance. 

 

Yes, I think that is a valid point. Although low latency, you are still having 
to write to the disks twice (journal+data), so once the cache’s on the cards 
start filling up, you are going to hit problems.

 

 

So we built 8 new nodes with no Arecas, M500 SSDs for journals (1 SSD per 3 
HDDs) in hopes that it would help reduce the noisy neighbor impact. That 
worked, but now the overall latency is really high at times, not always. Red 
Hat engineer suggested this is due to loading the 7200 rpm NL-SAS drives with 
too many IOPS, which get their latency sky high. Overall we are functioning 
fine, but I sure would like storage vmotion and other large operations faster. 

 

 

Yeah this is the biggest pain point I think. Normal VM ops are fine, but if you 
ever have to move a multi-TB VM, it’s just too slow. 

 

If you use iscsi with vaai and are migrating a thick provisioned vmdk, then 
performance is actually quite good, as the block sizes used for the copy are a 
lot bigger. 

 

However, my use case required thin provisioned VM’s + snapshots and I found 
that using iscsi you have no control over the fragmentation of the vmdk’s and 
so the read performance is then what suffers (certainly with 7.2k disks)

 

Also with thin provisioned vmdk’s I think I was seeing PG contention with the 
updating of the VMFS metadata, although I can’t be sure.

 

 

I am thinking I will test a few different schedulers and readahead settings to 
see if we can improve this by parallelizing reads. Also will test NFS, but need 
to determine whether to do krbd/knfsd or something more interesting like 
CephFS/Ganesha. 

 

As you know I’m on NFS now. I’ve found it a lot easier to get going and a lot 
less sensitive to making config adjustments without suddenly everything 
dropping offline. The fact that you can specify the extent size on XFS helps 
massively with using thin vmdks/snapshots to avoid fragmentation. Storage 
v-motions are a bit faster than iscsi, but I think I am hitting PG contention 
when esxi tries to write 32 copy threads to the same object. There is probably 
some tuning that could be done here (RBD striping???) but this is the best it’s 
been for a long time and I’m reluctant to fiddle any further.

 

But as mentioned above, thick vmdk’s with vaai might be a really good fit.

 

Thanks for your very valuable info on analysis and hw build. 

 

Alex

 




Am 21.08.2016 um 09:31 schrieb Nick Fisk  >:

>> -Original Message-
>> From: Alex Gorbachev [mailto:a...@iss-integration.com  ]
>> Sent: 21 August 2016 04:15
>> To: Nick Fisk  >
>> Cc: w...@globe.de  ; Horace Ng >  >; ceph-users  >
>> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
>>
>> Hi Nick,
>>
>> On Thu, Jul 21, 2016 at 8:33 AM, Nick Fisk  > 
>> wrote:
>>>> -Original Message-
>>>> From: w...@globe.de   [mailto:w...@globe.de  ]
>>>> Sent: 21 July 2016 13:23
>>>> To: n...@fisk.me.uk  ; 'Horace Ng' >>>  >
>>>> Cc: ceph-users@lists.ceph.com  
>>>> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
>>>>
>>>> Okay and what is your plan now to speed up ?
>>>
>>> Now I have come up with a lower latency hardware design, there is not much 
>>> further improvement until persistent RBD caching is
>> implemented, as you will be moving the SSD/NVME closer to the client. But 
>> I'm happy with what I can achieve at the moment. You
>> could also experiment with bcache on the RBD.
>>
>> Reviving this thread, would you be willing to share the details of the low 
>> latency hardware design?  Are you optimizing for NFS or
>> iSCSI?
>
> Both really, just trying to get the write latency as low as possible, as you 
> know, vmware 

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-08-21 Thread Alex Gorbachev
On Sunday, August 21, 2016, Wilhelm Redbrake  wrote:

> Hi Nick,
> i understand all of your technical improvements.
> But: why do you Not use a simple for example Areca Raid Controller with 8
> gb Cache and Bbu ontop in every ceph node.
> Configure n Times RAID 0 on the Controller and enable Write back Cache.
> That must be a latency "Killer" like in all the prop. Storage arrays or
> Not ??
>
> Best Regards !!


What we saw specifically with Areca cards is that performance is excellent
in benchmarking and for bursty loads. However, once we started loading with
more constant workloads (we replicate databases and files to our Ceph
cluster), this looks to have saturated the relatively small Areca NVDIMM
caches and we went back to pure drive based performance.

So we built 8 new nodes with no Arecas, M500 SSDs for journals (1 SSD per 3
HDDs) in hopes that it would help reduce the noisy neighbor impact. That
worked, but now the overall latency is really high at times, not always.
Red Hat engineer suggested this is due to loading the 7200 rpm NL-SAS
drives with too many IOPS, which get their latency sky high. Overall we are
functioning fine, but I sure would like storage vmotion and other large
operations faster.

I am thinking I will test a few different schedulers and readahead settings
to see if we can improve this by parallelizing reads. Also will test NFS,
but need to determine whether to do krbd/knfsd or something more
interesting like CephFS/Ganesha.

Thanks for your very valuable info on analysis and hw build.

Alex


>
>
>
> Am 21.08.2016 um 09:31 schrieb Nick Fisk >:
>
> >> -Original Message-
> >> From: Alex Gorbachev [mailto:a...@iss-integration.com ]
> >> Sent: 21 August 2016 04:15
> >> To: Nick Fisk >
> >> Cc: w...@globe.de ; Horace Ng  >; ceph-users >
> >> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> >>
> >> Hi Nick,
> >>
> >> On Thu, Jul 21, 2016 at 8:33 AM, Nick Fisk  > wrote:
> >>>> -Original Message-
> >>>> From: w...@globe.de  [mailto:w...@globe.de ]
> >>>> Sent: 21 July 2016 13:23
> >>>> To: n...@fisk.me.uk ; 'Horace Ng'  >
> >>>> Cc: ceph-users@lists.ceph.com 
> >>>> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> >>>>
> >>>> Okay and what is your plan now to speed up ?
> >>>
> >>> Now I have come up with a lower latency hardware design, there is not
> much further improvement until persistent RBD caching is
> >> implemented, as you will be moving the SSD/NVME closer to the client.
> But I'm happy with what I can achieve at the moment. You
> >> could also experiment with bcache on the RBD.
> >>
> >> Reviving this thread, would you be willing to share the details of the
> low latency hardware design?  Are you optimizing for NFS or
> >> iSCSI?
> >
> > Both really, just trying to get the write latency as low as possible, as
> you know, vmware does everything with lots of unbuffered small io's. Eg
> when you migrate a VM or as thin vmdk's grow.
> >
> > Even storage vmotions which might kick off 32 threads, as they all
> roughly fall on the same PG, there still appears to be a bottleneck with
> contention on the PG itself.
> >
> > These were the sort of things I was trying to optimise for, to make the
> time spent in Ceph as minimal as possible for each IO.
> >
> > So onto the hardware. Through reading various threads and experiments on
> my own I came to the following conclusions.
> >
> > -You need highest possible frequency on the CPU cores, which normally
> also means less of them.
> > -Dual sockets are probably bad and will impact performance.
> > -Use NVME's for journals to minimise latency
> >
> > The end result was OSD nodes based off of a 3.5Ghz Xeon E3v5 with an
> Intel P3700 for a journal. I used the SuperMicro X11SSH-CTF board which has
> 10G-T onboard as well as 8SATA and 8SAS, so no expansion cards required.
> Actually this design as well as being very performant for Ceph, also works
> out very cheap as you are using low end server parts. The whole lot +
> 12x7.2k disks all goes into a 1U case.
> >
> > During testing I noticed that by default c-states and p-states slaughter
> performance. After forcing max cstate to 1 and forcing the CPU frequency up
> to max, I was seeing 600us latency for a 4kb write to a 3xreplica pool, or
> around 1600IOPs, this is at QD=1.
> >
> > Few other observations:
> > 1. Power usage is around 150-200W for this config with 12x7.2k disks
> > 2. CPU u

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-08-21 Thread Nick Fisk


> -Original Message-
> From: Wilhelm Redbrake [mailto:w...@globe.de]
> Sent: 21 August 2016 09:34
> To: n...@fisk.me.uk
> Cc: Alex Gorbachev ; Horace Ng ; 
> ceph-users 
> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> 
> Hi Nick,
> i understand all of your technical improvements.
> But: why do you Not use a simple for example Areca Raid Controller with 8 gb 
> Cache and Bbu ontop in every ceph node.
> Configure n Times RAID 0 on the Controller and enable Write back Cache.
> That must be a latency "Killer" like in all the prop. Storage arrays or Not ??

Possibly, the latency of the NVME is very low, to the point that the "latency" 
in Ceph dwarfs it. So I'm not sure how much more improvement can be got from 
lowering journal latency further. But you are certainly correct it would help.

The other thing, if you don't use a SSD for a journal but rely on the RAID WBC, 
do you still see half the MB/s on the hard disks due to colo journal? Maybe 
someone can confirm?

Oh and I just looked at the price of that thing. The 16 port version is nearly 
double the price of what I paid for the 400GB NVME and that’s without adding on 
the 8GB ram and BBU. Maybe it's more suited for a full SSD cluster rather than 
spinning disks?

> 
> Best Regards !!
> 
> 
> 
> Am 21.08.2016 um 09:31 schrieb Nick Fisk :
> 
> >> -Original Message-
> >> From: Alex Gorbachev [mailto:a...@iss-integration.com]
> >> Sent: 21 August 2016 04:15
> >> To: Nick Fisk 
> >> Cc: w...@globe.de; Horace Ng ; ceph-users
> >> 
> >> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> >>
> >> Hi Nick,
> >>
> >> On Thu, Jul 21, 2016 at 8:33 AM, Nick Fisk  wrote:
> >>>> -Original Message-
> >>>> From: w...@globe.de [mailto:w...@globe.de]
> >>>> Sent: 21 July 2016 13:23
> >>>> To: n...@fisk.me.uk; 'Horace Ng' 
> >>>> Cc: ceph-users@lists.ceph.com
> >>>> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> >>>>
> >>>> Okay and what is your plan now to speed up ?
> >>>
> >>> Now I have come up with a lower latency hardware design, there is
> >>> not much further improvement until persistent RBD caching is
> >> implemented, as you will be moving the SSD/NVME closer to the client.
> >> But I'm happy with what I can achieve at the moment. You could also 
> >> experiment with bcache on the RBD.
> >>
> >> Reviving this thread, would you be willing to share the details of
> >> the low latency hardware design?  Are you optimizing for NFS or iSCSI?
> >
> > Both really, just trying to get the write latency as low as possible, as 
> > you know, vmware does everything with lots of unbuffered
> small io's. Eg when you migrate a VM or as thin vmdk's grow.
> >
> > Even storage vmotions which might kick off 32 threads, as they all roughly 
> > fall on the same PG, there still appears to be a bottleneck
> with contention on the PG itself.
> >
> > These were the sort of things I was trying to optimise for, to make the 
> > time spent in Ceph as minimal as possible for each IO.
> >
> > So onto the hardware. Through reading various threads and experiments on my 
> > own I came to the following conclusions.
> >
> > -You need highest possible frequency on the CPU cores, which normally also 
> > means less of them.
> > -Dual sockets are probably bad and will impact performance.
> > -Use NVME's for journals to minimise latency
> >
> > The end result was OSD nodes based off of a 3.5Ghz Xeon E3v5 with an Intel 
> > P3700 for a journal. I used the SuperMicro X11SSH-CTF
> board which has 10G-T onboard as well as 8SATA and 8SAS, so no expansion 
> cards required. Actually this design as well as being very
> performant for Ceph, also works out very cheap as you are using low end 
> server parts. The whole lot + 12x7.2k disks all goes into a 1U
> case.
> >
> > During testing I noticed that by default c-states and p-states slaughter 
> > performance. After forcing max cstate to 1 and forcing the
> CPU frequency up to max, I was seeing 600us latency for a 4kb write to a 
> 3xreplica pool, or around 1600IOPs, this is at QD=1.
> >
> > Few other observations:
> > 1. Power usage is around 150-200W for this config with 12x7.2k disks
> > 2. CPU usage maxing out disks, is only around 10-15%, so plenty of headroom 
> > for more disks.
> > 3. NOTE FOR ABOVE: Don't include iowait when

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-08-21 Thread Nick Fisk


> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
> Christian Balzer
> Sent: 21 August 2016 09:32
> To: ceph-users 
> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> 
> 
> Hello,
> 
> On Sun, 21 Aug 2016 09:16:33 +0100 Brian :: wrote:
> 
> > Hi Nick
> >
> > Interested in this comment - "-Dual sockets are probably bad and will
> > impact performance."
> >
> > Have you got real world experience of this being the case?
> >
> Well, Nick wrote "probably".
> 
> Dual sockets and thus NUMA, the need for CPUs to talk to each other and share 
> information certainly can impact things that are
very
> time critical.
> How much though is a question of design, both HW and SW.

There was a guy from Redhat (sorry his name escapes me now) a few months ago on 
the performance weekly meeting. He was analysing the
CPU cache miss effects with Ceph and it looked like a NUMA setup was having 
quite a severe impact on some things. To be honest a lot
of it went over my head, but I came away from it with a general feeling that if 
you can get the required performance from 1 socket,
then that is probably a better bet. This includes only populating a single 
socket in a dual socket system. There was also a Ceph
tech talk at the start of the year (High perf databases on Ceph) where the guy 
presenting was also recommending only populating 1
socket for latency reasons.

Both of those, coupled with the fact that Xeon E3's are the cheapest way to get 
high clock speeds, sort of made my decision.

> 
> We're looking here at a case where he's trying to reduce latency by all means 
> and where the actual CPU needs for the HDDs are
> negligible.
> The idea being that a "Ceph IOPS" stays on one core which is hopefully also 
> not being shared at that time.
> 
> If you're looking at full SSD nodes OTOH a singe CPU may very well not be 
> able to saturate a sensible amount of SSDs per node, so
a
> slight penalty but better utilization and overall IOPS with 2 CPUs may be the 
> forward.

Definitely, as always work out what your requirements are and design around 
them.  

> 
> Christian
> 
> > Thanks - B
> >
> > On Sun, Aug 21, 2016 at 8:31 AM, Nick Fisk  wrote:
> > >> -Original Message-----
> > >> From: Alex Gorbachev [mailto:a...@iss-integration.com]
> > >> Sent: 21 August 2016 04:15
> > >> To: Nick Fisk 
> > >> Cc: w...@globe.de; Horace Ng ; ceph-users
> > >> 
> > >> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> > >>
> > >> Hi Nick,
> > >>
> > >> On Thu, Jul 21, 2016 at 8:33 AM, Nick Fisk  wrote:
> > >> >> -Original Message-
> > >> >> From: w...@globe.de [mailto:w...@globe.de]
> > >> >> Sent: 21 July 2016 13:23
> > >> >> To: n...@fisk.me.uk; 'Horace Ng' 
> > >> >> Cc: ceph-users@lists.ceph.com
> > >> >> Subject: Re: [ceph-users] Ceph + VMware + Single Thread
> > >> >> Performance
> > >> >>
> > >> >> Okay and what is your plan now to speed up ?
> > >> >
> > >> > Now I have come up with a lower latency hardware design, there is
> > >> > not much further improvement until persistent RBD caching is
> > >> implemented, as you will be moving the SSD/NVME closer to the
> > >> client. But I'm happy with what I can achieve at the moment. You could 
> > >> also experiment with bcache on the RBD.
> > >>
> > >> Reviving this thread, would you be willing to share the details of
> > >> the low latency hardware design?  Are you optimizing for NFS or iSCSI?
> > >
> > > Both really, just trying to get the write latency as low as possible, as 
> > > you know, vmware does everything with lots of
unbuffered
> small io's. Eg when you migrate a VM or as thin vmdk's grow.
> > >
> > > Even storage vmotions which might kick off 32 threads, as they all 
> > > roughly fall on the same PG, there still appears to be a
> bottleneck with contention on the PG itself.
> > >
> > > These were the sort of things I was trying to optimise for, to make the 
> > > time spent in Ceph as minimal as possible for each IO.
> > >
> > > So onto the hardware. Through reading various threads and experiments on 
> > > my own I came to the following conclusions.
> > >
> > > -You need highest possible frequency on the C

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-08-21 Thread Christian Balzer

Hello,

On Sun, 21 Aug 2016 09:16:33 +0100 Brian :: wrote:

> Hi Nick
> 
> Interested in this comment - "-Dual sockets are probably bad and will
> impact performance."
> 
> Have you got real world experience of this being the case?
> 
Well, Nick wrote "probably".

Dual sockets and thus NUMA, the need for CPUs to talk to each other and
share information certainly can impact things that are very time critical.
How much though is a question of design, both HW and SW.

We're looking here at a case where he's trying to reduce latency by all
means and where the actual CPU needs for the HDDs are negligible.
The idea being that a "Ceph IOPS" stays on one core which is hopefully
also not being shared at that time.

If you're looking at full SSD nodes OTOH a singe CPU may very well not be
able to saturate a sensible amount of SSDs per node, so a slight penalty
but better utilization and overall IOPS with 2 CPUs may be the forward.

Christian

> Thanks - B
> 
> On Sun, Aug 21, 2016 at 8:31 AM, Nick Fisk  wrote:
> >> -Original Message-
> >> From: Alex Gorbachev [mailto:a...@iss-integration.com]
> >> Sent: 21 August 2016 04:15
> >> To: Nick Fisk 
> >> Cc: w...@globe.de; Horace Ng ; ceph-users 
> >> 
> >> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> >>
> >> Hi Nick,
> >>
> >> On Thu, Jul 21, 2016 at 8:33 AM, Nick Fisk  wrote:
> >> >> -Original Message-----
> >> >> From: w...@globe.de [mailto:w...@globe.de]
> >> >> Sent: 21 July 2016 13:23
> >> >> To: n...@fisk.me.uk; 'Horace Ng' 
> >> >> Cc: ceph-users@lists.ceph.com
> >> >> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> >> >>
> >> >> Okay and what is your plan now to speed up ?
> >> >
> >> > Now I have come up with a lower latency hardware design, there is not 
> >> > much further improvement until persistent RBD caching is
> >> implemented, as you will be moving the SSD/NVME closer to the client. But 
> >> I'm happy with what I can achieve at the moment. You
> >> could also experiment with bcache on the RBD.
> >>
> >> Reviving this thread, would you be willing to share the details of the low 
> >> latency hardware design?  Are you optimizing for NFS or
> >> iSCSI?
> >
> > Both really, just trying to get the write latency as low as possible, as 
> > you know, vmware does everything with lots of unbuffered small io's. Eg 
> > when you migrate a VM or as thin vmdk's grow.
> >
> > Even storage vmotions which might kick off 32 threads, as they all roughly 
> > fall on the same PG, there still appears to be a bottleneck with contention 
> > on the PG itself.
> >
> > These were the sort of things I was trying to optimise for, to make the 
> > time spent in Ceph as minimal as possible for each IO.
> >
> > So onto the hardware. Through reading various threads and experiments on my 
> > own I came to the following conclusions.
> >
> > -You need highest possible frequency on the CPU cores, which normally also 
> > means less of them.
> > -Dual sockets are probably bad and will impact performance.
> > -Use NVME's for journals to minimise latency
> >
> > The end result was OSD nodes based off of a 3.5Ghz Xeon E3v5 with an Intel 
> > P3700 for a journal. I used the SuperMicro X11SSH-CTF board which has 10G-T 
> > onboard as well as 8SATA and 8SAS, so no expansion cards required. Actually 
> > this design as well as being very performant for Ceph, also works out very 
> > cheap as you are using low end server parts. The whole lot + 12x7.2k disks 
> > all goes into a 1U case.
> >
> > During testing I noticed that by default c-states and p-states slaughter 
> > performance. After forcing max cstate to 1 and forcing the CPU frequency up 
> > to max, I was seeing 600us latency for a 4kb write to a 3xreplica pool, or 
> > around 1600IOPs, this is at QD=1.
> >
> > Few other observations:
> > 1. Power usage is around 150-200W for this config with 12x7.2k disks
> > 2. CPU usage maxing out disks, is only around 10-15%, so plenty of headroom 
> > for more disks.
> > 3. NOTE FOR ABOVE: Don't include iowait when looking at CPU usage
> > 4. No idea about CPU load for pure SSD nodes, but based on the current 
> > disks, you could maybe expect ~1iops per node, before maxing out CPU's
> > 5. Single NVME seems to be able to journal 12 disks with no problem during 
>

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-08-21 Thread Brian ::
Hi Nick

Interested in this comment - "-Dual sockets are probably bad and will
impact performance."

Have you got real world experience of this being the case?

Thanks - B

On Sun, Aug 21, 2016 at 8:31 AM, Nick Fisk  wrote:
>> -Original Message-
>> From: Alex Gorbachev [mailto:a...@iss-integration.com]
>> Sent: 21 August 2016 04:15
>> To: Nick Fisk 
>> Cc: w...@globe.de; Horace Ng ; ceph-users 
>> 
>> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
>>
>> Hi Nick,
>>
>> On Thu, Jul 21, 2016 at 8:33 AM, Nick Fisk  wrote:
>> >> -Original Message-
>> >> From: w...@globe.de [mailto:w...@globe.de]
>> >> Sent: 21 July 2016 13:23
>> >> To: n...@fisk.me.uk; 'Horace Ng' 
>> >> Cc: ceph-users@lists.ceph.com
>> >> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
>> >>
>> >> Okay and what is your plan now to speed up ?
>> >
>> > Now I have come up with a lower latency hardware design, there is not much 
>> > further improvement until persistent RBD caching is
>> implemented, as you will be moving the SSD/NVME closer to the client. But 
>> I'm happy with what I can achieve at the moment. You
>> could also experiment with bcache on the RBD.
>>
>> Reviving this thread, would you be willing to share the details of the low 
>> latency hardware design?  Are you optimizing for NFS or
>> iSCSI?
>
> Both really, just trying to get the write latency as low as possible, as you 
> know, vmware does everything with lots of unbuffered small io's. Eg when you 
> migrate a VM or as thin vmdk's grow.
>
> Even storage vmotions which might kick off 32 threads, as they all roughly 
> fall on the same PG, there still appears to be a bottleneck with contention 
> on the PG itself.
>
> These were the sort of things I was trying to optimise for, to make the time 
> spent in Ceph as minimal as possible for each IO.
>
> So onto the hardware. Through reading various threads and experiments on my 
> own I came to the following conclusions.
>
> -You need highest possible frequency on the CPU cores, which normally also 
> means less of them.
> -Dual sockets are probably bad and will impact performance.
> -Use NVME's for journals to minimise latency
>
> The end result was OSD nodes based off of a 3.5Ghz Xeon E3v5 with an Intel 
> P3700 for a journal. I used the SuperMicro X11SSH-CTF board which has 10G-T 
> onboard as well as 8SATA and 8SAS, so no expansion cards required. Actually 
> this design as well as being very performant for Ceph, also works out very 
> cheap as you are using low end server parts. The whole lot + 12x7.2k disks 
> all goes into a 1U case.
>
> During testing I noticed that by default c-states and p-states slaughter 
> performance. After forcing max cstate to 1 and forcing the CPU frequency up 
> to max, I was seeing 600us latency for a 4kb write to a 3xreplica pool, or 
> around 1600IOPs, this is at QD=1.
>
> Few other observations:
> 1. Power usage is around 150-200W for this config with 12x7.2k disks
> 2. CPU usage maxing out disks, is only around 10-15%, so plenty of headroom 
> for more disks.
> 3. NOTE FOR ABOVE: Don't include iowait when looking at CPU usage
> 4. No idea about CPU load for pure SSD nodes, but based on the current disks, 
> you could maybe expect ~1iops per node, before maxing out CPU's
> 5. Single NVME seems to be able to journal 12 disks with no problem during 
> normal operation, no doubt a specific benchmark could max it out though.
> 6. There are slightly faster Xeon E3's, but price/performance = diminishing 
> returns
>
> Hope that answers all your questions.
> Nick
>
>>
>> Thank you,
>> Alex
>>
>> >
>> >>
>> >> Would it help to put in multiple P3700 per OSD Node to improve 
>> >> performance for a single Thread (example Storage VMotion) ?
>> >
>> > Most likely not, it's all the other parts of the puzzle which are causing 
>> > the latency. ESXi was designed for storage arrays that service
>> IO's in 100us-1ms range, Ceph is probably about 10x slower than this, hence 
>> the problem. Disable the BBWC on a RAID controller or
>> SAN and you will the same behaviour.
>> >
>> >>
>> >> Regards
>> >>
>> >>
>> >> Am 21.07.16 um 14:17 schrieb Nick Fisk:
>> >> >> -Original Message-
>> >> >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
>> >> >> Behalf Of w...

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-08-21 Thread Nick Fisk
> -Original Message-
> From: Alex Gorbachev [mailto:a...@iss-integration.com]
> Sent: 21 August 2016 04:15
> To: Nick Fisk 
> Cc: w...@globe.de; Horace Ng ; ceph-users 
> 
> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> 
> Hi Nick,
> 
> On Thu, Jul 21, 2016 at 8:33 AM, Nick Fisk  wrote:
> >> -Original Message-
> >> From: w...@globe.de [mailto:w...@globe.de]
> >> Sent: 21 July 2016 13:23
> >> To: n...@fisk.me.uk; 'Horace Ng' 
> >> Cc: ceph-users@lists.ceph.com
> >> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> >>
> >> Okay and what is your plan now to speed up ?
> >
> > Now I have come up with a lower latency hardware design, there is not much 
> > further improvement until persistent RBD caching is
> implemented, as you will be moving the SSD/NVME closer to the client. But I'm 
> happy with what I can achieve at the moment. You
> could also experiment with bcache on the RBD.
> 
> Reviving this thread, would you be willing to share the details of the low 
> latency hardware design?  Are you optimizing for NFS or
> iSCSI?

Both really, just trying to get the write latency as low as possible, as you 
know, vmware does everything with lots of unbuffered small io's. Eg when you 
migrate a VM or as thin vmdk's grow.

Even storage vmotions which might kick off 32 threads, as they all roughly fall 
on the same PG, there still appears to be a bottleneck with contention on the 
PG itself. 

These were the sort of things I was trying to optimise for, to make the time 
spent in Ceph as minimal as possible for each IO.

So onto the hardware. Through reading various threads and experiments on my own 
I came to the following conclusions. 

-You need highest possible frequency on the CPU cores, which normally also 
means less of them. 
-Dual sockets are probably bad and will impact performance.
-Use NVME's for journals to minimise latency

The end result was OSD nodes based off of a 3.5Ghz Xeon E3v5 with an Intel 
P3700 for a journal. I used the SuperMicro X11SSH-CTF board which has 10G-T 
onboard as well as 8SATA and 8SAS, so no expansion cards required. Actually 
this design as well as being very performant for Ceph, also works out very 
cheap as you are using low end server parts. The whole lot + 12x7.2k disks all 
goes into a 1U case.

During testing I noticed that by default c-states and p-states slaughter 
performance. After forcing max cstate to 1 and forcing the CPU frequency up to 
max, I was seeing 600us latency for a 4kb write to a 3xreplica pool, or around 
1600IOPs, this is at QD=1.

Few other observations:
1. Power usage is around 150-200W for this config with 12x7.2k disks
2. CPU usage maxing out disks, is only around 10-15%, so plenty of headroom for 
more disks.
3. NOTE FOR ABOVE: Don't include iowait when looking at CPU usage
4. No idea about CPU load for pure SSD nodes, but based on the current disks, 
you could maybe expect ~1iops per node, before maxing out CPU's
5. Single NVME seems to be able to journal 12 disks with no problem during 
normal operation, no doubt a specific benchmark could max it out though.
6. There are slightly faster Xeon E3's, but price/performance = diminishing 
returns

Hope that answers all your questions.
Nick

> 
> Thank you,
> Alex
> 
> >
> >>
> >> Would it help to put in multiple P3700 per OSD Node to improve performance 
> >> for a single Thread (example Storage VMotion) ?
> >
> > Most likely not, it's all the other parts of the puzzle which are causing 
> > the latency. ESXi was designed for storage arrays that service
> IO's in 100us-1ms range, Ceph is probably about 10x slower than this, hence 
> the problem. Disable the BBWC on a RAID controller or
> SAN and you will the same behaviour.
> >
> >>
> >> Regards
> >>
> >>
> >> Am 21.07.16 um 14:17 schrieb Nick Fisk:
> >> >> -Original Message-
> >> >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
> >> >> Behalf Of w...@globe.de
> >> >> Sent: 21 July 2016 13:04
> >> >> To: n...@fisk.me.uk; 'Horace Ng' 
> >> >> Cc: ceph-users@lists.ceph.com
> >> >> Subject: Re: [ceph-users] Ceph + VMware + Single Thread
> >> >> Performance
> >> >>
> >> >> Hi,
> >> >>
> >> >> hmm i think 200 MByte/s is really bad. Is your Cluster in production 
> >> >> right now?
> >> > It's just been built, not running yet.
> >> >
> >> >> So if you start a storage migration you get only 200 MByte/s 

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-08-20 Thread Alex Gorbachev
Hi Nick,

On Thu, Jul 21, 2016 at 8:33 AM, Nick Fisk  wrote:
>> -Original Message-
>> From: w...@globe.de [mailto:w...@globe.de]
>> Sent: 21 July 2016 13:23
>> To: n...@fisk.me.uk; 'Horace Ng' 
>> Cc: ceph-users@lists.ceph.com
>> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
>>
>> Okay and what is your plan now to speed up ?
>
> Now I have come up with a lower latency hardware design, there is not much 
> further improvement until persistent RBD caching is implemented, as you will 
> be moving the SSD/NVME closer to the client. But I'm happy with what I can 
> achieve at the moment. You could also experiment with bcache on the RBD.

Reviving this thread, would you be willing to share the details of the
low latency hardware design?  Are you optimizing for NFS or iSCSI?

Thank you,
Alex

>
>>
>> Would it help to put in multiple P3700 per OSD Node to improve performance 
>> for a single Thread (example Storage VMotion) ?
>
> Most likely not, it's all the other parts of the puzzle which are causing the 
> latency. ESXi was designed for storage arrays that service IO's in 100us-1ms 
> range, Ceph is probably about 10x slower than this, hence the problem. 
> Disable the BBWC on a RAID controller or SAN and you will the same behaviour.
>
>>
>> Regards
>>
>>
>> Am 21.07.16 um 14:17 schrieb Nick Fisk:
>> >> -Original Message-
>> >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
>> >> Of w...@globe.de
>> >> Sent: 21 July 2016 13:04
>> >> To: n...@fisk.me.uk; 'Horace Ng' 
>> >> Cc: ceph-users@lists.ceph.com
>> >> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
>> >>
>> >> Hi,
>> >>
>> >> hmm i think 200 MByte/s is really bad. Is your Cluster in production 
>> >> right now?
>> > It's just been built, not running yet.
>> >
>> >> So if you start a storage migration you get only 200 MByte/s right?
>> > I wish. My current cluster (not this new one) would storage migrate at
>> > ~10-15MB/s. Serial latency is the problem, without being able to
>> > buffer, ESXi waits on an ack for each IO before sending the next. Also it 
>> > submits the migrations in 64kb chunks, unless you get VAAI
>> working. I think esxi will try and do them in parallel, which will help as 
>> well.
>> >
>> >> I think it would be awesome if you get 1000 MByte/s
>> >>
>> >> Where is the Bottleneck?
>> > Latency serialisation, without a buffer, you can't drive the devices
>> > to 100%. With buffered IO (or high queue depths) I can max out the 
>> > journals.
>> >
>> >> A FIO Test from Sebastien Han give us 400 MByte/s raw performance from 
>> >> the P3700.
>> >>
>> >> https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your
>> >> -ssd-is-suitable-as-a-journal-device/
>> >>
>> >> How could it be that the rbd client performance is 50% slower?
>> >>
>> >> Regards
>> >>
>> >>
>> >> Am 21.07.16 um 12:15 schrieb Nick Fisk:
>> >>> I've had a lot of pain with this, smaller block sizes are even worse.
>> >>> You want to try and minimize latency at every point as there is no
>> >>> buffering happening in the iSCSI stack. This means:-
>> >>>
>> >>> 1. Fast journals (NVME or NVRAM)
>> >>> 2. 10GB or better networking
>> >>> 3. Fast CPU's (Ghz)
>> >>> 4. Fix CPU c-state's to C1
>> >>> 5. Fix CPU's Freq to max
>> >>>
>> >>> Also I can't be sure, but I think there is a metadata update
>> >>> happening with VMFS, particularly if you are using thin VMDK's, this
>> >>> can also be a major bottleneck. For my use case, I've switched over to 
>> >>> NFS as it has given much more performance at scale and
>> less headache.
>> >>>
>> >>> For the RADOS Run, here you go (400GB P3700):
>> >>>
>> >>> Total time run: 60.026491
>> >>> Total writes made:  3104
>> >>> Write size: 4194304
>> >>> Object size:4194304
>> >>> Bandwidth (MB/sec): 206.842
>> >>> Stddev Bandwidth:   8.10412
>> >>> Max bandwidth (MB/sec): 224
>> >>> M

Re: [ceph-users] ceph + vmware

2016-07-26 Thread Jake Young
On Thursday, July 21, 2016, Mike Christie  wrote:

> On 07/21/2016 11:41 AM, Mike Christie wrote:
> > On 07/20/2016 02:20 PM, Jake Young wrote:
> >>
> >> For starters, STGT doesn't implement VAAI properly and you will need to
> >> disable VAAI in ESXi.
> >>
> >> LIO does seem to implement VAAI properly, but performance is not nearly
> >> as good as STGT even with VAAI's benefits. The assumption for the cause
> >> is that LIO currently uses kernel rbd mapping and kernel rbd performance
> >> is not as good as librbd.
> >>
> >> I recently did a simple test of creating an 80GB eager zeroed disk with
> >> STGT (VAAI disabled, no rbd client cache) and LIO (VAAI enabled) and
> >> found that STGT was actually slightly faster.
> >>
> >> I think we're all holding our breath waiting for LIO librbd support via
> >> TCMU, which seems to be right around the corner. That solution will
> >
> > Is there a thread for that?


Not a thread, but it has come up a few times...  Maybe I'm getting ahead of
myself. I can't wait for this solution to be available.


> >
> >> combine the performance benefits of librbd with the more feature-full
> >> LIO iSCSI interface. The lrbd configuration tool for LIO from SUSE is
> >> pretty cool and it makes configuring LIO easier than STGT.
> >>
> >
> > I wrote a tcmu rbd driver a while back. It is based on gpl2 code, so
> > Andy could not take it into tcmu. I attached it here if you want to play
> > with it.
> >
>
> Here it is attached in patch form built against the current tcmu code.
>
> I have not tested it since March, so if there have been major changes to
> the tcmu code there might be issues.
>
> You should only use this for testing. I wrote it up in a night. I have
> done very little testing.
>
> It only supports READ, WRITE, DISCARD/UNMAP, TUR, MODE_SENSE/SELECT, and
> SYNC_CACHE.
>

Thanks for this!  I was able to patch and compile without errors.

I'm having trouble using it though. Does it require targetcli-fb?  This
should show up as a "User: rbd" backstore, right?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph + vmware

2016-07-22 Thread Nick Fisk
 

 

From: Frédéric Nass [mailto:frederic.n...@univ-lorraine.fr] 
Sent: 22 July 2016 15:13
To: n...@fisk.me.uk
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] ceph + vmware

 

 

Le 22/07/2016 14:10, Nick Fisk a écrit :

 

 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Frédéric Nass
Sent: 22 July 2016 11:19
To: n...@fisk.me.uk <mailto:n...@fisk.me.uk> ; 'Jake Young'  
<mailto:jak3...@gmail.com> ; 'Jan Schermer'  
<mailto:j...@schermer.cz> 
Cc: ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> 
Subject: Re: [ceph-users] ceph + vmware

 

 

Le 22/07/2016 11:48, Nick Fisk a écrit :

 

 

From: Frédéric Nass [mailto:frederic.n...@univ-lorraine.fr] 
Sent: 22 July 2016 10:40
To: n...@fisk.me.uk <mailto:n...@fisk.me.uk> ; 'Jake Young'  
<mailto:jak3...@gmail.com> ; 'Jan Schermer'  
<mailto:j...@schermer.cz> 
Cc: ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> 
Subject: Re: [ceph-users] ceph + vmware

 

 

Le 22/07/2016 10:23, Nick Fisk a écrit :

 

 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Frédéric Nass
Sent: 22 July 2016 09:10
To: n...@fisk.me.uk <mailto:n...@fisk.me.uk> ; 'Jake Young'  
<mailto:jak3...@gmail.com> ; 'Jan Schermer'  
<mailto:j...@schermer.cz> 
Cc: ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> 
Subject: Re: [ceph-users] ceph + vmware

 

 

Le 22/07/2016 09:47, Nick Fisk a écrit :

 

 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Frédéric Nass
Sent: 22 July 2016 08:11
To: Jake Young  <mailto:jak3...@gmail.com> ; Jan Schermer  
<mailto:j...@schermer.cz> 
Cc: ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> 
Subject: Re: [ceph-users] ceph + vmware

 

 

Le 20/07/2016 21:20, Jake Young a écrit :



On Wednesday, July 20, 2016, Jan Schermer mailto:j...@schermer.cz> > wrote:


> On 20 Jul 2016, at 18:38, Mike Christie  <mailto:mchri...@redhat.com> > wrote:
>
> On 07/20/2016 03:50 AM, Frédéric Nass wrote:
>>
>> Hi Mike,
>>
>> Thanks for the update on the RHCS iSCSI target.
>>
>> Will RHCS 2.1 iSCSI target be compliant with VMWare ESXi client ? (or is
>> it too early to say / announce).
>
> No HA support for sure. We are looking into non HA support though.
>
>>
>> Knowing that HA iSCSI target was on the roadmap, we chose iSCSI over NFS
>> so we'll just have to remap RBDs to RHCS targets when it's available.
>>
>> So we're currently running :
>>
>> - 2 LIO iSCSI targets exporting the same RBD images. Each iSCSI target
>> has all VAAI primitives enabled and run the same configuration.
>> - RBD images are mapped on each target using the kernel client (so no
>> RBD cache).
>> - 6 ESXi. Each ESXi can access to the same LUNs through both targets,
>> but in a failover manner so that each ESXi always access the same LUN
>> through one target at a time.
>> - LUNs are VMFS datastores and VAAI primitives are enabled client side
>> (except UNMAP as per default).
>>
>> Do you see anthing risky regarding this configuration ?
>
> If you use a application that uses scsi persistent reservations then you
> could run into troubles, because some apps expect the reservation info
> to be on the failover nodes as well as the active ones.
>
> Depending on the how you do failover and the issue that caused the
> failover, IO could be stuck on the old active node and cause data
> corruption. If the initial active node looses its network connectivity
> and you failover, you have to make sure that the initial active node is
> fenced off and IO stuck on that node will never be executed. So do
> something like add it to the ceph monitor blacklist and make sure IO on
> that node is flushed and failed before unblacklisting it.
>

With iSCSI you can't really do hot failover unless you only use synchronous IO.

 

VMware does only use synchronous IO. Since the hypervisor can't tell what type 
of data the VMs are writing, all IO is treated as needing to be synchronous. 

 

(With any of opensource target softwares available).
Flushing the buffers doesn't really help because you don't know what in-flight 
IO happened before the outage
and which didn't. You could end with only part of the "transaction" written on 
persistent storage.

If you only use synchronous IO all the way from client to the persistent 
storage shared between
iSCSI target then all should be fine, otherwise YMMV - some people run it like 
that without realizing
the dangers and have never had a problem, so it may be strictly theoretical, 
and it all depends on how often you need to do the
failover 

Re: [ceph-users] ceph + vmware

2016-07-22 Thread Frédéric Nass



Le 22/07/2016 14:10, Nick Fisk a écrit :


*From:*ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On 
Behalf Of *Frédéric Nass

*Sent:* 22 July 2016 11:19
*To:* n...@fisk.me.uk; 'Jake Young' ; 'Jan 
Schermer' 

*Cc:* ceph-users@lists.ceph.com
*Subject:* Re: [ceph-users] ceph + vmware

Le 22/07/2016 11:48, Nick Fisk a écrit :

*From:*Frédéric Nass [mailto:frederic.n...@univ-lorraine.fr]
*Sent:* 22 July 2016 10:40
*To:* n...@fisk.me.uk <mailto:n...@fisk.me.uk>; 'Jake Young'
 <mailto:jak3...@gmail.com>; 'Jan Schermer'
 <mailto:j...@schermer.cz>
*Cc:* ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
*Subject:* Re: [ceph-users] ceph + vmware

Le 22/07/2016 10:23, Nick Fisk a écrit :

*From:*ceph-users [mailto:ceph-users-boun...@lists.ceph.com]
*On Behalf Of *Frédéric Nass
*Sent:* 22 July 2016 09:10
*To:* n...@fisk.me.uk <mailto:n...@fisk.me.uk>; 'Jake Young'
 <mailto:jak3...@gmail.com>; 'Jan Schermer'
 <mailto:j...@schermer.cz>
    *Cc:* ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
*Subject:* Re: [ceph-users] ceph + vmware

Le 22/07/2016 09:47, Nick Fisk a écrit :

*From:*ceph-users
[mailto:ceph-users-boun...@lists.ceph.com] *On Behalf Of
*Frédéric Nass
*Sent:* 22 July 2016 08:11
*To:* Jake Young 
<mailto:jak3...@gmail.com>; Jan Schermer 
<mailto:j...@schermer.cz>
        *Cc:* ceph-users@lists.ceph.com
<mailto:ceph-users@lists.ceph.com>
*Subject:* Re: [ceph-users] ceph + vmware

Le 20/07/2016 21:20, Jake Young a écrit :



On Wednesday, July 20, 2016, Jan Schermer
mailto:j...@schermer.cz>> wrote:


> On 20 Jul 2016, at 18:38, Mike Christie
mailto:mchri...@redhat.com>>
wrote:
>
> On 07/20/2016 03:50 AM, Frédéric Nass wrote:
>>
>> Hi Mike,
>>
>> Thanks for the update on the RHCS iSCSI target.
>>
>> Will RHCS 2.1 iSCSI target be compliant with
VMWare ESXi client ? (or is
>> it too early to say / announce).
>
> No HA support for sure. We are looking into non
HA support though.
>
>>
>> Knowing that HA iSCSI target was on the
roadmap, we chose iSCSI over NFS
>> so we'll just have to remap RBDs to RHCS
targets when it's available.
>>
>> So we're currently running :
>>
>> - 2 LIO iSCSI targets exporting the same RBD
images. Each iSCSI target
>> has all VAAI primitives enabled and run the
same configuration.
>> - RBD images are mapped on each target using
the kernel client (so no
>> RBD cache).
>> - 6 ESXi. Each ESXi can access to the same LUNs
through both targets,
>> but in a failover manner so that each ESXi
always access the same LUN
>> through one target at a time.
>> - LUNs are VMFS datastores and VAAI primitives
are enabled client side
>> (except UNMAP as per default).
>>
>> Do you see anthing risky regarding this
configuration ?
>
> If you use a application that uses scsi
persistent reservations then you
> could run into troubles, because some apps
expect the reservation info
> to be on the failover nodes as well as the
active ones.
>
> Depending on the how you do failover and the
issue that caused the
> failover, IO could be stuck on the old active
node and cause data
> corruption. If the initial active node looses
its network connectivity
> and you failover, you have to make sure that the
in

Re: [ceph-users] ceph + vmware

2016-07-22 Thread Nick Fisk
 

 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Frédéric Nass
Sent: 22 July 2016 11:19
To: n...@fisk.me.uk; 'Jake Young' ; 'Jan Schermer' 

Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] ceph + vmware

 

 

Le 22/07/2016 11:48, Nick Fisk a écrit :

 

 

From: Frédéric Nass [mailto:frederic.n...@univ-lorraine.fr] 
Sent: 22 July 2016 10:40
To: n...@fisk.me.uk <mailto:n...@fisk.me.uk> ; 'Jake Young'  
<mailto:jak3...@gmail.com> ; 'Jan Schermer'  
<mailto:j...@schermer.cz> 
Cc: ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> 
Subject: Re: [ceph-users] ceph + vmware

 

 

Le 22/07/2016 10:23, Nick Fisk a écrit :

 

 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Frédéric Nass
Sent: 22 July 2016 09:10
To: n...@fisk.me.uk <mailto:n...@fisk.me.uk> ; 'Jake Young'  
<mailto:jak3...@gmail.com> ; 'Jan Schermer'  
<mailto:j...@schermer.cz> 
Cc: ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> 
Subject: Re: [ceph-users] ceph + vmware

 

 

Le 22/07/2016 09:47, Nick Fisk a écrit :

 

 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Frédéric Nass
Sent: 22 July 2016 08:11
To: Jake Young  <mailto:jak3...@gmail.com> ; Jan Schermer  
<mailto:j...@schermer.cz> 
Cc: ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> 
Subject: Re: [ceph-users] ceph + vmware

 

 

Le 20/07/2016 21:20, Jake Young a écrit :



On Wednesday, July 20, 2016, Jan Schermer mailto:j...@schermer.cz> > wrote:


> On 20 Jul 2016, at 18:38, Mike Christie  <mailto:mchri...@redhat.com> > wrote:
>
> On 07/20/2016 03:50 AM, Frédéric Nass wrote:
>>
>> Hi Mike,
>>
>> Thanks for the update on the RHCS iSCSI target.
>>
>> Will RHCS 2.1 iSCSI target be compliant with VMWare ESXi client ? (or is
>> it too early to say / announce).
>
> No HA support for sure. We are looking into non HA support though.
>
>>
>> Knowing that HA iSCSI target was on the roadmap, we chose iSCSI over NFS
>> so we'll just have to remap RBDs to RHCS targets when it's available.
>>
>> So we're currently running :
>>
>> - 2 LIO iSCSI targets exporting the same RBD images. Each iSCSI target
>> has all VAAI primitives enabled and run the same configuration.
>> - RBD images are mapped on each target using the kernel client (so no
>> RBD cache).
>> - 6 ESXi. Each ESXi can access to the same LUNs through both targets,
>> but in a failover manner so that each ESXi always access the same LUN
>> through one target at a time.
>> - LUNs are VMFS datastores and VAAI primitives are enabled client side
>> (except UNMAP as per default).
>>
>> Do you see anthing risky regarding this configuration ?
>
> If you use a application that uses scsi persistent reservations then you
> could run into troubles, because some apps expect the reservation info
> to be on the failover nodes as well as the active ones.
>
> Depending on the how you do failover and the issue that caused the
> failover, IO could be stuck on the old active node and cause data
> corruption. If the initial active node looses its network connectivity
> and you failover, you have to make sure that the initial active node is
> fenced off and IO stuck on that node will never be executed. So do
> something like add it to the ceph monitor blacklist and make sure IO on
> that node is flushed and failed before unblacklisting it.
>

With iSCSI you can't really do hot failover unless you only use synchronous IO.

 

VMware does only use synchronous IO. Since the hypervisor can't tell what type 
of data the VMs are writing, all IO is treated as needing to be synchronous. 

 

(With any of opensource target softwares available).
Flushing the buffers doesn't really help because you don't know what in-flight 
IO happened before the outage
and which didn't. You could end with only part of the "transaction" written on 
persistent storage.

If you only use synchronous IO all the way from client to the persistent 
storage shared between
iSCSI target then all should be fine, otherwise YMMV - some people run it like 
that without realizing
the dangers and have never had a problem, so it may be strictly theoretical, 
and it all depends on how often you need to do the
failover and what data you are storing - corrupting a few images on a gallery 
site could be fine but corrupting
a large database tablespace is no fun at all.

 

No, it's not. VMFS corruption is pretty bad too and there is no fsck for VMFS...

 


Some (non opensource) solutions exist, Solaris supposedly does this in some(?) 
way, maybe some iSCSI guru
can chime tell us what ma

Re: [ceph-users] ceph + vmware

2016-07-22 Thread Frédéric Nass



Le 22/07/2016 11:48, Nick Fisk a écrit :


*From:*Frédéric Nass [mailto:frederic.n...@univ-lorraine.fr]
*Sent:* 22 July 2016 10:40
*To:* n...@fisk.me.uk; 'Jake Young' ; 'Jan 
Schermer' 

*Cc:* ceph-users@lists.ceph.com
*Subject:* Re: [ceph-users] ceph + vmware

Le 22/07/2016 10:23, Nick Fisk a écrit :

*From:*ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On
Behalf Of *Frédéric Nass
*Sent:* 22 July 2016 09:10
*To:* n...@fisk.me.uk <mailto:n...@fisk.me.uk>; 'Jake Young'
 <mailto:jak3...@gmail.com>; 'Jan Schermer'
 <mailto:j...@schermer.cz>
*Cc:* ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
*Subject:* Re: [ceph-users] ceph + vmware

Le 22/07/2016 09:47, Nick Fisk a écrit :

*From:*ceph-users [mailto:ceph-users-boun...@lists.ceph.com]
*On Behalf Of *Frédéric Nass
*Sent:* 22 July 2016 08:11
*To:* Jake Young 
<mailto:jak3...@gmail.com>; Jan Schermer 
<mailto:j...@schermer.cz>
*Cc:* ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
*Subject:* Re: [ceph-users] ceph + vmware

Le 20/07/2016 21:20, Jake Young a écrit :



On Wednesday, July 20, 2016, Jan Schermer mailto:j...@schermer.cz>> wrote:


> On 20 Jul 2016, at 18:38, Mike Christie
mailto:mchri...@redhat.com>> wrote:
>
> On 07/20/2016 03:50 AM, Frédéric Nass wrote:
>>
>> Hi Mike,
>>
>> Thanks for the update on the RHCS iSCSI target.
>>
>> Will RHCS 2.1 iSCSI target be compliant with VMWare
ESXi client ? (or is
>> it too early to say / announce).
>
> No HA support for sure. We are looking into non HA
support though.
>
>>
>> Knowing that HA iSCSI target was on the roadmap, we
chose iSCSI over NFS
>> so we'll just have to remap RBDs to RHCS targets
when it's available.
>>
>> So we're currently running :
>>
>> - 2 LIO iSCSI targets exporting the same RBD
images. Each iSCSI target
>> has all VAAI primitives enabled and run the same
configuration.
>> - RBD images are mapped on each target using the
kernel client (so no
>> RBD cache).
>> - 6 ESXi. Each ESXi can access to the same LUNs
through both targets,
>> but in a failover manner so that each ESXi always
access the same LUN
>> through one target at a time.
>> - LUNs are VMFS datastores and VAAI primitives are
enabled client side
>> (except UNMAP as per default).
>>
>> Do you see anthing risky regarding this configuration ?
>
> If you use a application that uses scsi persistent
reservations then you
> could run into troubles, because some apps expect
the reservation info
> to be on the failover nodes as well as the active ones.
>
> Depending on the how you do failover and the issue
that caused the
> failover, IO could be stuck on the old active node
and cause data
> corruption. If the initial active node looses its
network connectivity
> and you failover, you have to make sure that the
initial active node is
> fenced off and IO stuck on that node will never be
executed. So do
> something like add it to the ceph monitor blacklist
and make sure IO on
> that node is flushed and failed before
unblacklisting it.
>

With iSCSI you can't really do hot failover unless you
only use synchronous IO.

VMware does only use synchronous IO. Since the hypervisor
can't tell what type of data the VMs are writing, all IO
is treated as needing to be synchronous.

(With any of opensource target softwares available).
Flushing the buffers doesn't really help because you
don't know what in-flight IO happened before the outage
and which di

Re: [ceph-users] ceph + vmware

2016-07-22 Thread Nick Fisk
 

 

From: Frédéric Nass [mailto:frederic.n...@univ-lorraine.fr] 
Sent: 22 July 2016 10:40
To: n...@fisk.me.uk; 'Jake Young' ; 'Jan Schermer' 

Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] ceph + vmware

 

 

Le 22/07/2016 10:23, Nick Fisk a écrit :

 

 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Frédéric Nass
Sent: 22 July 2016 09:10
To: n...@fisk.me.uk <mailto:n...@fisk.me.uk> ; 'Jake Young'  
<mailto:jak3...@gmail.com> ; 'Jan Schermer'  
<mailto:j...@schermer.cz> 
Cc: ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> 
Subject: Re: [ceph-users] ceph + vmware

 

 

Le 22/07/2016 09:47, Nick Fisk a écrit :

 

 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Frédéric Nass
Sent: 22 July 2016 08:11
To: Jake Young  <mailto:jak3...@gmail.com> ; Jan Schermer  
<mailto:j...@schermer.cz> 
Cc: ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> 
Subject: Re: [ceph-users] ceph + vmware

 

 

Le 20/07/2016 21:20, Jake Young a écrit :



On Wednesday, July 20, 2016, Jan Schermer mailto:j...@schermer.cz> > wrote:


> On 20 Jul 2016, at 18:38, Mike Christie  <mailto:mchri...@redhat.com> > wrote:
>
> On 07/20/2016 03:50 AM, Frédéric Nass wrote:
>>
>> Hi Mike,
>>
>> Thanks for the update on the RHCS iSCSI target.
>>
>> Will RHCS 2.1 iSCSI target be compliant with VMWare ESXi client ? (or is
>> it too early to say / announce).
>
> No HA support for sure. We are looking into non HA support though.
>
>>
>> Knowing that HA iSCSI target was on the roadmap, we chose iSCSI over NFS
>> so we'll just have to remap RBDs to RHCS targets when it's available.
>>
>> So we're currently running :
>>
>> - 2 LIO iSCSI targets exporting the same RBD images. Each iSCSI target
>> has all VAAI primitives enabled and run the same configuration.
>> - RBD images are mapped on each target using the kernel client (so no
>> RBD cache).
>> - 6 ESXi. Each ESXi can access to the same LUNs through both targets,
>> but in a failover manner so that each ESXi always access the same LUN
>> through one target at a time.
>> - LUNs are VMFS datastores and VAAI primitives are enabled client side
>> (except UNMAP as per default).
>>
>> Do you see anthing risky regarding this configuration ?
>
> If you use a application that uses scsi persistent reservations then you
> could run into troubles, because some apps expect the reservation info
> to be on the failover nodes as well as the active ones.
>
> Depending on the how you do failover and the issue that caused the
> failover, IO could be stuck on the old active node and cause data
> corruption. If the initial active node looses its network connectivity
> and you failover, you have to make sure that the initial active node is
> fenced off and IO stuck on that node will never be executed. So do
> something like add it to the ceph monitor blacklist and make sure IO on
> that node is flushed and failed before unblacklisting it.
>

With iSCSI you can't really do hot failover unless you only use synchronous IO.

 

VMware does only use synchronous IO. Since the hypervisor can't tell what type 
of data the VMs are writing, all IO is treated as needing to be synchronous. 

 

(With any of opensource target softwares available).
Flushing the buffers doesn't really help because you don't know what in-flight 
IO happened before the outage
and which didn't. You could end with only part of the "transaction" written on 
persistent storage.

If you only use synchronous IO all the way from client to the persistent 
storage shared between
iSCSI target then all should be fine, otherwise YMMV - some people run it like 
that without realizing
the dangers and have never had a problem, so it may be strictly theoretical, 
and it all depends on how often you need to do the
failover and what data you are storing - corrupting a few images on a gallery 
site could be fine but corrupting
a large database tablespace is no fun at all.

 

No, it's not. VMFS corruption is pretty bad too and there is no fsck for VMFS...

 


Some (non opensource) solutions exist, Solaris supposedly does this in some(?) 
way, maybe some iSCSI guru
can chime tell us what magic they do, but I don't think it's possible without 
client support
(you essentialy have to do something like transactions and replay the last 
transaction on failover). Maybe
something can be enabled in protocol to do the iSCSI IO synchronous or make it 
at least wait for some sort of ACK from the
server (which would require some sort of cache mirroring between the targets) 
without making it synchronous all the way.

 

This is why th

Re: [ceph-users] ceph + vmware

2016-07-22 Thread Frédéric Nass



Le 22/07/2016 10:23, Nick Fisk a écrit :


*From:*ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On 
Behalf Of *Frédéric Nass

*Sent:* 22 July 2016 09:10
*To:* n...@fisk.me.uk; 'Jake Young' ; 'Jan 
Schermer' 

*Cc:* ceph-users@lists.ceph.com
*Subject:* Re: [ceph-users] ceph + vmware

Le 22/07/2016 09:47, Nick Fisk a écrit :

*From:*ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On
Behalf Of *Frédéric Nass
*Sent:* 22 July 2016 08:11
*To:* Jake Young  <mailto:jak3...@gmail.com>;
Jan Schermer  <mailto:j...@schermer.cz>
*Cc:* ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
    *Subject:* Re: [ceph-users] ceph + vmware

Le 20/07/2016 21:20, Jake Young a écrit :



On Wednesday, July 20, 2016, Jan Schermer mailto:j...@schermer.cz>> wrote:


> On 20 Jul 2016, at 18:38, Mike Christie
mailto:mchri...@redhat.com>> wrote:
>
> On 07/20/2016 03:50 AM, Frédéric Nass wrote:
>>
>> Hi Mike,
>>
>> Thanks for the update on the RHCS iSCSI target.
>>
>> Will RHCS 2.1 iSCSI target be compliant with VMWare
ESXi client ? (or is
>> it too early to say / announce).
>
> No HA support for sure. We are looking into non HA
support though.
>
>>
>> Knowing that HA iSCSI target was on the roadmap, we
chose iSCSI over NFS
>> so we'll just have to remap RBDs to RHCS targets when
it's available.
>>
>> So we're currently running :
>>
>> - 2 LIO iSCSI targets exporting the same RBD images.
Each iSCSI target
>> has all VAAI primitives enabled and run the same
configuration.
>> - RBD images are mapped on each target using the kernel
client (so no
>> RBD cache).
>> - 6 ESXi. Each ESXi can access to the same LUNs through
both targets,
>> but in a failover manner so that each ESXi always
access the same LUN
>> through one target at a time.
>> - LUNs are VMFS datastores and VAAI primitives are
enabled client side
>> (except UNMAP as per default).
>>
>> Do you see anthing risky regarding this configuration ?
>
> If you use a application that uses scsi persistent
reservations then you
> could run into troubles, because some apps expect the
reservation info
> to be on the failover nodes as well as the active ones.
>
> Depending on the how you do failover and the issue that
caused the
> failover, IO could be stuck on the old active node and
cause data
> corruption. If the initial active node looses its
network connectivity
> and you failover, you have to make sure that the initial
active node is
> fenced off and IO stuck on that node will never be
executed. So do
> something like add it to the ceph monitor blacklist and
make sure IO on
> that node is flushed and failed before unblacklisting it.
>

With iSCSI you can't really do hot failover unless you
only use synchronous IO.

VMware does only use synchronous IO. Since the hypervisor
can't tell what type of data the VMs are writing, all IO is
treated as needing to be synchronous.

(With any of opensource target softwares available).
Flushing the buffers doesn't really help because you don't
know what in-flight IO happened before the outage
and which didn't. You could end with only part of the
"transaction" written on persistent storage.

If you only use synchronous IO all the way from client to
the persistent storage shared between
iSCSI target then all should be fine, otherwise YMMV -
some people run it like that without realizing
the dangers and have never had a problem, so it may be
strictly theoretical, and it all depends on how often you
need to do the
failover and what data you are storing - corrupting a few
images on a gallery site could be fine but corrupting
a large database tablespace is no fun at all.

No, it's not. VMFS corruption is pretty bad too and there i

Re: [ceph-users] ceph + vmware

2016-07-22 Thread Nick Fisk
 

 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Frédéric Nass
Sent: 22 July 2016 09:10
To: n...@fisk.me.uk; 'Jake Young' ; 'Jan Schermer' 

Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] ceph + vmware

 

 

Le 22/07/2016 09:47, Nick Fisk a écrit :

 

 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Frédéric Nass
Sent: 22 July 2016 08:11
To: Jake Young  <mailto:jak3...@gmail.com> ; Jan Schermer  
<mailto:j...@schermer.cz> 
Cc: ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> 
Subject: Re: [ceph-users] ceph + vmware

 

 

Le 20/07/2016 21:20, Jake Young a écrit :



On Wednesday, July 20, 2016, Jan Schermer mailto:j...@schermer.cz> > wrote:


> On 20 Jul 2016, at 18:38, Mike Christie  <mailto:mchri...@redhat.com> > wrote:
>
> On 07/20/2016 03:50 AM, Frédéric Nass wrote:
>>
>> Hi Mike,
>>
>> Thanks for the update on the RHCS iSCSI target.
>>
>> Will RHCS 2.1 iSCSI target be compliant with VMWare ESXi client ? (or is
>> it too early to say / announce).
>
> No HA support for sure. We are looking into non HA support though.
>
>>
>> Knowing that HA iSCSI target was on the roadmap, we chose iSCSI over NFS
>> so we'll just have to remap RBDs to RHCS targets when it's available.
>>
>> So we're currently running :
>>
>> - 2 LIO iSCSI targets exporting the same RBD images. Each iSCSI target
>> has all VAAI primitives enabled and run the same configuration.
>> - RBD images are mapped on each target using the kernel client (so no
>> RBD cache).
>> - 6 ESXi. Each ESXi can access to the same LUNs through both targets,
>> but in a failover manner so that each ESXi always access the same LUN
>> through one target at a time.
>> - LUNs are VMFS datastores and VAAI primitives are enabled client side
>> (except UNMAP as per default).
>>
>> Do you see anthing risky regarding this configuration ?
>
> If you use a application that uses scsi persistent reservations then you
> could run into troubles, because some apps expect the reservation info
> to be on the failover nodes as well as the active ones.
>
> Depending on the how you do failover and the issue that caused the
> failover, IO could be stuck on the old active node and cause data
> corruption. If the initial active node looses its network connectivity
> and you failover, you have to make sure that the initial active node is
> fenced off and IO stuck on that node will never be executed. So do
> something like add it to the ceph monitor blacklist and make sure IO on
> that node is flushed and failed before unblacklisting it.
>

With iSCSI you can't really do hot failover unless you only use synchronous IO.

 

VMware does only use synchronous IO. Since the hypervisor can't tell what type 
of data the VMs are writing, all IO is treated as
needing to be synchronous. 

 

(With any of opensource target softwares available).
Flushing the buffers doesn't really help because you don't know what in-flight 
IO happened before the outage
and which didn't. You could end with only part of the "transaction" written on 
persistent storage.

If you only use synchronous IO all the way from client to the persistent 
storage shared between
iSCSI target then all should be fine, otherwise YMMV - some people run it like 
that without realizing
the dangers and have never had a problem, so it may be strictly theoretical, 
and it all depends on how often you need to do the
failover and what data you are storing - corrupting a few images on a gallery 
site could be fine but corrupting
a large database tablespace is no fun at all.

 

No, it's not. VMFS corruption is pretty bad too and there is no fsck for VMFS...

 


Some (non opensource) solutions exist, Solaris supposedly does this in some(?) 
way, maybe some iSCSI guru
can chime tell us what magic they do, but I don't think it's possible without 
client support
(you essentialy have to do something like transactions and replay the last 
transaction on failover). Maybe
something can be enabled in protocol to do the iSCSI IO synchronous or make it 
at least wait for some sort of ACK from the
server (which would require some sort of cache mirroring between the targets) 
without making it synchronous all the way.

 

This is why the SAN vendors wrote their own clients and drivers. It is not 
possible to dynamically make all OS's do what your iSCSI
target expects. 

 

Something like VMware does the right thing pretty much all the time (there are 
some iSCSI initiator bugs in earlier ESXi 5.x).  If
you have control of your ESXi hosts then attempting to set up HA iSCSI targets 
is possible. 

 

If you have a mixed client environment with v

Re: [ceph-users] ceph + vmware

2016-07-22 Thread Frédéric Nass



Le 22/07/2016 09:47, Nick Fisk a écrit :


*From:*ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On 
Behalf Of *Frédéric Nass

*Sent:* 22 July 2016 08:11
*To:* Jake Young ; Jan Schermer 
*Cc:* ceph-users@lists.ceph.com
*Subject:* Re: [ceph-users] ceph + vmware

Le 20/07/2016 21:20, Jake Young a écrit :



On Wednesday, July 20, 2016, Jan Schermer mailto:j...@schermer.cz>> wrote:


> On 20 Jul 2016, at 18:38, Mike Christie mailto:mchri...@redhat.com>> wrote:
>
> On 07/20/2016 03:50 AM, Frédéric Nass wrote:
>>
>> Hi Mike,
>>
>> Thanks for the update on the RHCS iSCSI target.
>>
>> Will RHCS 2.1 iSCSI target be compliant with VMWare ESXi
client ? (or is
>> it too early to say / announce).
>
> No HA support for sure. We are looking into non HA support
though.
>
>>
>> Knowing that HA iSCSI target was on the roadmap, we chose
iSCSI over NFS
>> so we'll just have to remap RBDs to RHCS targets when it's
available.
>>
>> So we're currently running :
>>
>> - 2 LIO iSCSI targets exporting the same RBD images. Each
iSCSI target
>> has all VAAI primitives enabled and run the same configuration.
>> - RBD images are mapped on each target using the kernel
client (so no
>> RBD cache).
>> - 6 ESXi. Each ESXi can access to the same LUNs through
both targets,
>> but in a failover manner so that each ESXi always access
the same LUN
>> through one target at a time.
>> - LUNs are VMFS datastores and VAAI primitives are enabled
client side
>> (except UNMAP as per default).
>>
>> Do you see anthing risky regarding this configuration ?
>
> If you use a application that uses scsi persistent
reservations then you
> could run into troubles, because some apps expect the
reservation info
> to be on the failover nodes as well as the active ones.
>
> Depending on the how you do failover and the issue that
caused the
> failover, IO could be stuck on the old active node and cause
data
> corruption. If the initial active node looses its network
connectivity
> and you failover, you have to make sure that the initial
active node is
> fenced off and IO stuck on that node will never be executed.
So do
> something like add it to the ceph monitor blacklist and make
sure IO on
> that node is flushed and failed before unblacklisting it.
>

With iSCSI you can't really do hot failover unless you only
use synchronous IO.

VMware does only use synchronous IO. Since the hypervisor can't
tell what type of data the VMs are writing, all IO is treated as
needing to be synchronous.

(With any of opensource target softwares available).
Flushing the buffers doesn't really help because you don't
know what in-flight IO happened before the outage
and which didn't. You could end with only part of the
"transaction" written on persistent storage.

If you only use synchronous IO all the way from client to the
persistent storage shared between
iSCSI target then all should be fine, otherwise YMMV - some
people run it like that without realizing
the dangers and have never had a problem, so it may be
strictly theoretical, and it all depends on how often you need
to do the
failover and what data you are storing - corrupting a few
images on a gallery site could be fine but corrupting
a large database tablespace is no fun at all.

No, it's not. VMFS corruption is pretty bad too and there is no
fsck for VMFS...


Some (non opensource) solutions exist, Solaris supposedly does
this in some(?) way, maybe some iSCSI guru
can chime tell us what magic they do, but I don't think it's
possible without client support
(you essentialy have to do something like transactions and
replay the last transaction on failover). Maybe
something can be enabled in protocol to do the iSCSI IO
synchronous or make it at least wait for some sort of ACK from the
server (which would require some sort of cache mirroring
between the targets) without making it synchronous all the way.

This is why the SAN vendors wrote their own clients and drivers.
It is not possible to dynamically make all OS's do what 

Re: [ceph-users] ceph + vmware

2016-07-22 Thread Nick Fisk
 

 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Frédéric Nass
Sent: 22 July 2016 08:11
To: Jake Young ; Jan Schermer 
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] ceph + vmware

 

 

Le 20/07/2016 21:20, Jake Young a écrit :



On Wednesday, July 20, 2016, Jan Schermer mailto:j...@schermer.cz> > wrote:


> On 20 Jul 2016, at 18:38, Mike Christie  <mailto:mchri...@redhat.com> > wrote:
>
> On 07/20/2016 03:50 AM, Frédéric Nass wrote:
>>
>> Hi Mike,
>>
>> Thanks for the update on the RHCS iSCSI target.
>>
>> Will RHCS 2.1 iSCSI target be compliant with VMWare ESXi client ? (or is
>> it too early to say / announce).
>
> No HA support for sure. We are looking into non HA support though.
>
>>
>> Knowing that HA iSCSI target was on the roadmap, we chose iSCSI over NFS
>> so we'll just have to remap RBDs to RHCS targets when it's available.
>>
>> So we're currently running :
>>
>> - 2 LIO iSCSI targets exporting the same RBD images. Each iSCSI target
>> has all VAAI primitives enabled and run the same configuration.
>> - RBD images are mapped on each target using the kernel client (so no
>> RBD cache).
>> - 6 ESXi. Each ESXi can access to the same LUNs through both targets,
>> but in a failover manner so that each ESXi always access the same LUN
>> through one target at a time.
>> - LUNs are VMFS datastores and VAAI primitives are enabled client side
>> (except UNMAP as per default).
>>
>> Do you see anthing risky regarding this configuration ?
>
> If you use a application that uses scsi persistent reservations then you
> could run into troubles, because some apps expect the reservation info
> to be on the failover nodes as well as the active ones.
>
> Depending on the how you do failover and the issue that caused the
> failover, IO could be stuck on the old active node and cause data
> corruption. If the initial active node looses its network connectivity
> and you failover, you have to make sure that the initial active node is
> fenced off and IO stuck on that node will never be executed. So do
> something like add it to the ceph monitor blacklist and make sure IO on
> that node is flushed and failed before unblacklisting it.
>

With iSCSI you can't really do hot failover unless you only use synchronous IO.

 

VMware does only use synchronous IO. Since the hypervisor can't tell what type 
of data the VMs are writing, all IO is treated as
needing to be synchronous. 

 

(With any of opensource target softwares available).
Flushing the buffers doesn't really help because you don't know what in-flight 
IO happened before the outage
and which didn't. You could end with only part of the "transaction" written on 
persistent storage.

If you only use synchronous IO all the way from client to the persistent 
storage shared between
iSCSI target then all should be fine, otherwise YMMV - some people run it like 
that without realizing
the dangers and have never had a problem, so it may be strictly theoretical, 
and it all depends on how often you need to do the
failover and what data you are storing - corrupting a few images on a gallery 
site could be fine but corrupting
a large database tablespace is no fun at all.

 

No, it's not. VMFS corruption is pretty bad too and there is no fsck for VMFS...

 


Some (non opensource) solutions exist, Solaris supposedly does this in some(?) 
way, maybe some iSCSI guru
can chime tell us what magic they do, but I don't think it's possible without 
client support
(you essentialy have to do something like transactions and replay the last 
transaction on failover). Maybe
something can be enabled in protocol to do the iSCSI IO synchronous or make it 
at least wait for some sort of ACK from the
server (which would require some sort of cache mirroring between the targets) 
without making it synchronous all the way.

 

This is why the SAN vendors wrote their own clients and drivers. It is not 
possible to dynamically make all OS's do what your iSCSI
target expects. 

 

Something like VMware does the right thing pretty much all the time (there are 
some iSCSI initiator bugs in earlier ESXi 5.x).  If
you have control of your ESXi hosts then attempting to set up HA iSCSI targets 
is possible. 

 

If you have a mixed client environment with various versions of Windows 
connecting to the target, you may be better off buying some
SAN appliances.

 


The one time I had to use it I resorted to simply mirroring in via mdraid on 
the client side over two targets sharing the same
DAS, and this worked fine during testing but never went to production in the 
end.

Jan

>
>>
>> Would you recommend LIO or STGT (with rbd bs-type) target for ESXi
>> clie

Re: [ceph-users] ceph + vmware

2016-07-22 Thread Frédéric Nass



Le 20/07/2016 21:20, Jake Young a écrit :



On Wednesday, July 20, 2016, Jan Schermer > wrote:



> On 20 Jul 2016, at 18:38, Mike Christie > wrote:
>
> On 07/20/2016 03:50 AM, Frédéric Nass wrote:
>>
>> Hi Mike,
>>
>> Thanks for the update on the RHCS iSCSI target.
>>
>> Will RHCS 2.1 iSCSI target be compliant with VMWare ESXi client
? (or is
>> it too early to say / announce).
>
> No HA support for sure. We are looking into non HA support though.
>
>>
>> Knowing that HA iSCSI target was on the roadmap, we chose iSCSI
over NFS
>> so we'll just have to remap RBDs to RHCS targets when it's
available.
>>
>> So we're currently running :
>>
>> - 2 LIO iSCSI targets exporting the same RBD images. Each iSCSI
target
>> has all VAAI primitives enabled and run the same configuration.
>> - RBD images are mapped on each target using the kernel client
(so no
>> RBD cache).
>> - 6 ESXi. Each ESXi can access to the same LUNs through both
targets,
>> but in a failover manner so that each ESXi always access the
same LUN
>> through one target at a time.
>> - LUNs are VMFS datastores and VAAI primitives are enabled
client side
>> (except UNMAP as per default).
>>
>> Do you see anthing risky regarding this configuration ?
>
> If you use a application that uses scsi persistent reservations
then you
> could run into troubles, because some apps expect the
reservation info
> to be on the failover nodes as well as the active ones.
>
> Depending on the how you do failover and the issue that caused the
> failover, IO could be stuck on the old active node and cause data
> corruption. If the initial active node looses its network
connectivity
> and you failover, you have to make sure that the initial active
node is
> fenced off and IO stuck on that node will never be executed. So do
> something like add it to the ceph monitor blacklist and make
sure IO on
> that node is flushed and failed before unblacklisting it.
>

With iSCSI you can't really do hot failover unless you only use
synchronous IO.


VMware does only use synchronous IO. Since the hypervisor can't tell 
what type of data the VMs are writing, all IO is treated as needing to 
be synchronous.


(With any of opensource target softwares available).
Flushing the buffers doesn't really help because you don't know
what in-flight IO happened before the outage
and which didn't. You could end with only part of the
"transaction" written on persistent storage.

If you only use synchronous IO all the way from client to the
persistent storage shared between
iSCSI target then all should be fine, otherwise YMMV - some people
run it like that without realizing
the dangers and have never had a problem, so it may be strictly
theoretical, and it all depends on how often you need to do the
failover and what data you are storing - corrupting a few images
on a gallery site could be fine but corrupting
a large database tablespace is no fun at all.


No, it's not. VMFS corruption is pretty bad too and there is no fsck 
for VMFS...



Some (non opensource) solutions exist, Solaris supposedly does
this in some(?) way, maybe some iSCSI guru
can chime tell us what magic they do, but I don't think it's
possible without client support
(you essentialy have to do something like transactions and replay
the last transaction on failover). Maybe
something can be enabled in protocol to do the iSCSI IO
synchronous or make it at least wait for some sort of ACK from the
server (which would require some sort of cache mirroring between
the targets) without making it synchronous all the way.


This is why the SAN vendors wrote their own clients and drivers. It is 
not possible to dynamically make all OS's do what your iSCSI target 
expects.


Something like VMware does the right thing pretty much all the time 
(there are some iSCSI initiator bugs in earlier ESXi 5.x).  If you 
have control of your ESXi hosts then attempting to set up HA iSCSI 
targets is possible.


If you have a mixed client environment with various versions of 
Windows connecting to the target, you may be better off buying some 
SAN appliances.



The one time I had to use it I resorted to simply mirroring in via
mdraid on the client side over two targets sharing the same
DAS, and this worked fine during testing but never went to
production in the end.

Jan

>
>>
>> Would you recommend LIO or STGT (with rbd bs-type) target for ESXi
>> clients ?
>
> I can't say, because I have not used stgt with rbd bs-type
support enough.


For starters, STGT doesn't implement VAAI properly and you will need 
to disable VAAI in ESXi.


LIO does seem to implement VAA

Re: [ceph-users] ceph + vmware

2016-07-21 Thread Mike Christie
On 07/21/2016 11:41 AM, Mike Christie wrote:
> On 07/20/2016 02:20 PM, Jake Young wrote:
>>
>> For starters, STGT doesn't implement VAAI properly and you will need to
>> disable VAAI in ESXi.
>>
>> LIO does seem to implement VAAI properly, but performance is not nearly
>> as good as STGT even with VAAI's benefits. The assumption for the cause
>> is that LIO currently uses kernel rbd mapping and kernel rbd performance
>> is not as good as librbd. 
>>
>> I recently did a simple test of creating an 80GB eager zeroed disk with
>> STGT (VAAI disabled, no rbd client cache) and LIO (VAAI enabled) and
>> found that STGT was actually slightly faster.
>>
>> I think we're all holding our breath waiting for LIO librbd support via
>> TCMU, which seems to be right around the corner. That solution will
> 
> Is there a thread for that?
> 
>> combine the performance benefits of librbd with the more feature-full
>> LIO iSCSI interface. The lrbd configuration tool for LIO from SUSE is
>> pretty cool and it makes configuring LIO easier than STGT. 
>>
> 
> I wrote a tcmu rbd driver a while back. It is based on gpl2 code, so
> Andy could not take it into tcmu. I attached it here if you want to play
> with it.
> 

Here it is attached in patch form built against the current tcmu code.

I have not tested it since March, so if there have been major changes to
the tcmu code there might be issues.

You should only use this for testing. I wrote it up in a night. I have
done very little testing.

It only supports READ, WRITE, DISCARD/UNMAP, TUR, MODE_SENSE/SELECT, and
SYNC_CACHE.
commit 90846c4f94c3c51d608bd79eb1304a9106ba67c1
Author: Mike Christie 
Date:   Thu Jul 21 12:41:48 2016 -0500

tcmu: add rbd support

Add basic tcmu rbd support.

This does READ, WRITE, DISCARD and FLUSH.

diff --git a/CMakeLists.txt b/CMakeLists.txt
index 507188a..ac8f4b2 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -137,6 +137,24 @@ add_executable(consumer
   )
 target_link_libraries(consumer tcmu)
 
+if (with-rbd)
+	find_library(LIBRBD rbd)
+
+	# Stuff for building the rbd handler
+	add_library(handler_rbd
+	  SHARED
+	  rbd.c
+	  )
+	set_target_properties(handler_rbd
+	  PROPERTIES
+	  PREFIX ""
+	  )
+	target_link_libraries(handler_rbd
+	  ${LIBRBD}
+	  )
+	install(TARGETS handler_rbd DESTINATION ${CMAKE_INSTALL_LIBDIR}/tcmu-runner)
+endif (with-rbd)
+
 if (with-glfs)
 	find_library(GFAPI gfapi)
 
diff --git a/rbd.c b/rbd.c
new file mode 100644
index 000..2dc3b98
--- /dev/null
+++ b/rbd.c
@@ -0,0 +1,818 @@
+/*
+ * Code from QEMU Block driver for RADOS (Ceph) ported to a TCMU handler
+ * by Mike Christie.
+ *
+ * Copyright (C) 2010-2011 Christian Brunner ,
+ * Josh Durgin 
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ * Contributions after 2012-01-13 are licensed under the terms of the
+ * GNU GPL, version 2 or (at your option) any later version.
+ */
+#define _GNU_SOURCE
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "tcmu-runner.h"
+#include "libtcmu.h"
+
+#include 
+
+/* rbd_aio_discard added in 0.1.2 */
+#if LIBRBD_VERSION_CODE >= LIBRBD_VERSION(0, 1, 2)
+#define LIBRBD_SUPPORTS_DISCARD
+#else
+#undef LIBRBD_SUPPORTS_DISCARD
+#endif
+
+#define OBJ_MAX_SIZE (1UL << OBJ_DEFAULT_OBJ_ORDER)
+
+#define RBD_MAX_CONF_NAME_SIZE 128
+#define RBD_MAX_CONF_VAL_SIZE 512
+#define RBD_MAX_CONF_SIZE 1024
+#define RBD_MAX_POOL_NAME_SIZE 128
+#define RBD_MAX_SNAP_NAME_SIZE 128
+#define RBD_MAX_SNAPS 100
+
+struct tcmu_rbd_state {
+	rados_t cluster;
+	rados_ioctx_t io_ctx;
+	rbd_image_t image;
+	char name[RBD_MAX_IMAGE_NAME_SIZE];
+	char *snap;
+	uint64_t num_lbas;
+	unsigned int block_size;
+};
+
+enum {
+	RBD_AIO_READ,
+	RBD_AIO_WRITE,
+	RBD_AIO_DISCARD,
+	RBD_AIO_FLUSH,
+};
+
+struct rbd_aio_cb {
+	struct tcmu_device *dev;
+	struct tcmulib_cmd *tcmulib_cmd;
+	int64_t ret;
+	char *bounce;
+	int rbd_aio_cmd;
+	int error;
+	int64_t length;
+};
+
+static int tcmu_rbd_next_tok(char *dst, int dst_len, char *src, char delim,
+			 const char *name, char **p)
+{
+	int l;
+	char *end;
+
+	*p = NULL;
+
+	if (delim != '\0') {
+	for (end = src; *end; ++end) {
+			if (*end == delim) {
+break;
+			}
+			if (*end == '\\' && end[1] != '\0') {
+end++;
+			}
+		}
+		if (*end == delim) {
+			*p = end + 1;
+			*end = '\0';
+		}
+	}
+	l = strlen(src);
+	if (l >= dst_len) {
+		errp("%s too long", name);
+		return -EINVAL;
+	} else if (l == 0) {
+		errp("%s too short", name);
+		return -EINVAL;
+	}
+
+	strncpy(dst, src, dst_len);
+
+	return 0;
+}
+
+static void tcmu_rbd_unescape(char *src)
+{   
+	char *p;
+
+	for (p = src; *src; ++src, ++p) { 
+		if (*src == '\\' && src[1] != '\0') {
+			src++;
+		}
+		*p = *src;
+	}
+	*p = '\0';
+}
+
+static int tcmu_rbd_parsename(const char *config,
+			  char *pool, int

Re: [ceph-users] ceph + vmware

2016-07-21 Thread Mike Christie
On 07/20/2016 02:20 PM, Jake Young wrote:
> 
> For starters, STGT doesn't implement VAAI properly and you will need to
> disable VAAI in ESXi.
> 
> LIO does seem to implement VAAI properly, but performance is not nearly
> as good as STGT even with VAAI's benefits. The assumption for the cause
> is that LIO currently uses kernel rbd mapping and kernel rbd performance
> is not as good as librbd. 
> 
> I recently did a simple test of creating an 80GB eager zeroed disk with
> STGT (VAAI disabled, no rbd client cache) and LIO (VAAI enabled) and
> found that STGT was actually slightly faster.
> 
> I think we're all holding our breath waiting for LIO librbd support via
> TCMU, which seems to be right around the corner. That solution will

Is there a thread for that?

> combine the performance benefits of librbd with the more feature-full
> LIO iSCSI interface. The lrbd configuration tool for LIO from SUSE is
> pretty cool and it makes configuring LIO easier than STGT. 
> 

I wrote a tcmu rbd driver a while back. It is based on gpl2 code, so
Andy could not take it into tcmu. I attached it here if you want to play
with it.
/*
 * Code from QEMU Block driver for RADOS (Ceph) ported to a TCMU handler
 * by Mike Christie.
 *
 * Copyright (C) 2010-2011 Christian Brunner ,
 * Josh Durgin 
 *
 * This work is licensed under the terms of the GNU GPL, version 2.  See
 * the COPYING file in the top-level directory.
 *
 * Contributions after 2012-01-13 are licensed under the terms of the
 * GNU GPL, version 2 or (at your option) any later version.
 */
#define _GNU_SOURCE
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#include "tcmu-runner.h"
#include "libtcmu.h"

#include 

/* rbd_aio_discard added in 0.1.2 */
#if LIBRBD_VERSION_CODE >= LIBRBD_VERSION(0, 1, 2)
#define LIBRBD_SUPPORTS_DISCARD
#else
#undef LIBRBD_SUPPORTS_DISCARD
#endif

#define OBJ_MAX_SIZE (1UL << OBJ_DEFAULT_OBJ_ORDER)

#define RBD_MAX_CONF_NAME_SIZE 128
#define RBD_MAX_CONF_VAL_SIZE 512
#define RBD_MAX_CONF_SIZE 1024
#define RBD_MAX_POOL_NAME_SIZE 128
#define RBD_MAX_SNAP_NAME_SIZE 128
#define RBD_MAX_SNAPS 100

struct tcmu_rbd_state {
rados_t cluster;
rados_ioctx_t io_ctx;
rbd_image_t image;
char name[RBD_MAX_IMAGE_NAME_SIZE];
char *snap;
uint64_t num_lbas;
unsigned int block_size;
};

enum {
RBD_AIO_READ,
RBD_AIO_WRITE,
RBD_AIO_DISCARD,
RBD_AIO_FLUSH,
};

struct rbd_aio_cb {
struct tcmu_device *dev;
struct tcmulib_cmd *tcmulib_cmd;
int64_t ret;
char *bounce;
int rbd_aio_cmd;
int error;
int64_t length;
};

static int tcmu_rbd_next_tok(char *dst, int dst_len, char *src, char delim,
 const char *name, char **p)
{
int l;
char *end;

*p = NULL;

if (delim != '\0') {
for (end = src; *end; ++end) {
if (*end == delim) {
break;
}
if (*end == '\\' && end[1] != '\0') {
end++;
}
}
if (*end == delim) {
*p = end + 1;
*end = '\0';
}
}
l = strlen(src);
if (l >= dst_len) {
errp("%s too long", name);
return -EINVAL;
} else if (l == 0) {
errp("%s too short", name);
return -EINVAL;
}

strncpy(dst, src, dst_len);

return 0;
}

static void tcmu_rbd_unescape(char *src)
{   
char *p;

for (p = src; *src; ++src, ++p) { 
if (*src == '\\' && src[1] != '\0') {
src++;
}
*p = *src;
}
*p = '\0';
}

static int tcmu_rbd_parsename(const char *config,
  char *pool, int pool_len,
  char *snap, int snap_len,
  char *name, int name_len,
  char *conf, int conf_len)
{
char *p, *buf;
int ret;

buf = strdup(config);
p = buf;
*snap = '\0';
*conf = '\0';

ret = tcmu_rbd_next_tok(pool, pool_len, p, '/', "pool name", &p);
if (ret < 0 || !p) {
ret = -EINVAL;
goto done;
}
tcmu_rbd_unescape(pool);

if (strchr(p, '@')) {
ret = tcmu_rbd_next_tok(name, name_len, p, '@', "object name",
&p);
if (ret < 0) {
goto done;
}
ret = tcmu_rbd_next_tok(snap, snap_len, p, ':', "snap name",
&p);
tcmu_rbd_un

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-07-21 Thread Nick Fisk
Yes awesome, as long as you fully test bcache and you are happy with it.

 

Also, if you intend to do HA, you will have to use dual port SAS SSD’s instead 
of NVME and make sure you create your resource agent scripts correctly, 
otherwise bye bye data.

 

If you enable writeback caching in TGT and you have power failure, then 
anything in the cache is lost. This will either mean holes in your data, or 
sections that are out of date. Basically that LUN will most likely be toast and 
you will have to reformat.

 

From: w...@globe.de [mailto:w...@globe.de] 
Sent: 21 July 2016 15:04
To: n...@fisk.me.uk; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance

 

Okay that should be the answer... 

I think it would be great to use Intel P3700 1.6TB as bcache in the iscsi rbd 
client gateway nodes.

caching device: Intel P3700 1.6TB

backing device: RBD from Ceph Cluster

What do you mean? I think this setup should improve the performance 
dramatically or not?

If i enable writeback in these nodes and use tgt for vmware. What happens if 
iscsi node 1 goes offline. Power Loss... or Linux Kernel crash.

 

 

Am 21.07.16 um 15:57 schrieb Nick Fisk:

What you are seeing is probably averaged over 1 second or something like that. 
So yes in 1 second IO would have run on all OSD’s. But for any 1 point in time 
a single thread will only run on 1 OSD (+2 replicas) assuming the IO size isn’t 
bigger than the object size. 

 

For RBD, If data is striped in 4MB chunks, then you will have to read/write 
more than 4MB at a time to cross over to the next object. You get exactly the 
same problems with reading when you don’t set the readahead above 4MB.

 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
w...@globe.de <mailto:w...@globe.de> 
Sent: 21 July 2016 14:05
To: ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> 
Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance

 

That can not be correct.

Check it at your cluster with dstat as i said...

You will see at every node parallel IO on every OSD and journal

 

Am 21.07.16 um 15:02 schrieb Jake Young:

I think the answer is that with 1 thread you can only ever write to one journal 
at a time. Theoretically, you would need 10 threads to be able to write to 10 
nodes at the same time.  

 

Jake

On Thursday, July 21, 2016, w...@globe.de <mailto:w...@globe.de>  
mailto:w...@globe.de> > wrote:

What i not really undertand is:

Lets say the Intel P3700 works with 200 MByte/s rados bench one thread... See 
Nicks results below...

If we have multiple OSD Nodes. For example 10 Nodes.

Every Node has exactly 1x P3700 NVMe built in.

Why is the single Thread performance exactly at 200 MByte/s on the rbd client 
with 10 OSD Node Cluster???

I think it must be at 10 Nodes * 200 MByte/s = 2000 MByte/s.

 

Everyone look yourself at your cluster. 

dstat -D sdb,sdc,sdd,sdX 

You will see that Ceph stripes the data over all OSD's in the cluster if you 
test at the client side with rados bench...

rados bench -p rbd 60 write -b 4M -t 1

 

 

Am 21.07.16 um 14:38 schrieb w...@globe.de 
 :

Is there not a way to enable Linux page Cache? So do not user D_Sync... 

Then we would the dramatically performance improve. 


Am 21.07.16 um 14:33 schrieb Nick Fisk: 




-Original Message- 
From: w...@globe.de   
[mailto:w...@globe.de  ] 
Sent: 21 July 2016 13:23 
To: n...@fisk.me.uk  ; 'Horace 
Ng'
Cc: ceph-users@lists.ceph.com 
  
Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance 

Okay and what is your plan now to speed up ? 

Now I have come up with a lower latency hardware design, there is not much 
further improvement until persistent RBD caching is implemented, as you will be 
moving the SSD/NVME closer to the client. But I'm happy with what I can achieve 
at the moment. You could also experiment with bcache on the RBD. 





Would it help to put in multiple P3700 per OSD Node to improve performance for 
a single Thread (example Storage VMotion) ? 

Most likely not, it's all the other parts of the puzzle which are causing the 
latency. ESXi was designed for storage arrays that service IO's in 100us-1ms 
range, Ceph is probably about 10x slower than this, hence the problem. Disable 
the BBWC on a RAID controller or SAN and you will the same behaviour. 





Regards 


Am 21.07.16 um 14:17 schrieb Nick Fisk: 




-Original Message- 
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com 
 ] On Behalf 
Of w...@globe.de   
Sent: 21 July 2016 13:04 
To: n...@fisk.me.uk  ; 'Horace 
Ng'    
Cc: ceph-users@lists.ceph.com 
  
Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance 

Hi, 

hmm i think 200 MByte/s is really bad. Is your Cluster in production right now? 

It's just been built, not running yet. 





So if you start a storage migration you get only 200 MByte/s

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-07-21 Thread Nick Fisk
What you are seeing is probably averaged over 1 second or something like that. 
So yes in 1 second IO would have run on all OSD’s. But for any 1 point in time 
a single thread will only run on 1 OSD (+2 replicas) assuming the IO size isn’t 
bigger than the object size. 

 

For RBD, If data is striped in 4MB chunks, then you will have to read/write 
more than 4MB at a time to cross over to the next object. You get exactly the 
same problems with reading when you don’t set the readahead above 4MB.

 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
w...@globe.de
Sent: 21 July 2016 14:05
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance

 

That can not be correct.

Check it at your cluster with dstat as i said...

You will see at every node parallel IO on every OSD and journal

 

Am 21.07.16 um 15:02 schrieb Jake Young:

I think the answer is that with 1 thread you can only ever write to one journal 
at a time. Theoretically, you would need 10 threads to be able to write to 10 
nodes at the same time.  

 

Jake

On Thursday, July 21, 2016, w...@globe.de <mailto:w...@globe.de>  
mailto:w...@globe.de> > wrote:

What i not really undertand is:

Lets say the Intel P3700 works with 200 MByte/s rados bench one thread... See 
Nicks results below...

If we have multiple OSD Nodes. For example 10 Nodes.

Every Node has exactly 1x P3700 NVMe built in.

Why is the single Thread performance exactly at 200 MByte/s on the rbd client 
with 10 OSD Node Cluster???

I think it must be at 10 Nodes * 200 MByte/s = 2000 MByte/s.

 

Everyone look yourself at your cluster. 

dstat -D sdb,sdc,sdd,sdX 

You will see that Ceph stripes the data over all OSD's in the cluster if you 
test at the client side with rados bench...

rados bench -p rbd 60 write -b 4M -t 1

 

 

Am 21.07.16 um 14:38 schrieb w...@globe.de 
 :

Is there not a way to enable Linux page Cache? So do not user D_Sync... 

Then we would the dramatically performance improve. 


Am 21.07.16 um 14:33 schrieb Nick Fisk: 



-Original Message- 
From: w...@globe.de   
[mailto:w...@globe.de  ] 
Sent: 21 July 2016 13:23 
To: n...@fisk.me.uk  ; 'Horace 
Ng'
Cc: ceph-users@lists.ceph.com 
  
Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance 

Okay and what is your plan now to speed up ? 

Now I have come up with a lower latency hardware design, there is not much 
further improvement until persistent RBD caching is implemented, as you will be 
moving the SSD/NVME closer to the client. But I'm happy with what I can achieve 
at the moment. You could also experiment with bcache on the RBD. 




Would it help to put in multiple P3700 per OSD Node to improve performance for 
a single Thread (example Storage VMotion) ? 

Most likely not, it's all the other parts of the puzzle which are causing the 
latency. ESXi was designed for storage arrays that service IO's in 100us-1ms 
range, Ceph is probably about 10x slower than this, hence the problem. Disable 
the BBWC on a RAID controller or SAN and you will the same behaviour. 




Regards 


Am 21.07.16 um 14:17 schrieb Nick Fisk: 



-Original Message- 
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com 
 ] On Behalf 
Of w...@globe.de   
Sent: 21 July 2016 13:04 
To: n...@fisk.me.uk  ; 'Horace 
Ng'    
Cc: ceph-users@lists.ceph.com 
  
Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance 

Hi, 

hmm i think 200 MByte/s is really bad. Is your Cluster in production right now? 

It's just been built, not running yet. 




So if you start a storage migration you get only 200 MByte/s right? 

I wish. My current cluster (not this new one) would storage migrate at 
~10-15MB/s. Serial latency is the problem, without being able to 
buffer, ESXi waits on an ack for each IO before sending the next. Also it 
submits the migrations in 64kb chunks, unless you get VAAI 

working. I think esxi will try and do them in parallel, which will help as 
well. 



I think it would be awesome if you get 1000 MByte/s 

Where is the Bottleneck? 

Latency serialisation, without a buffer, you can't drive the devices 
to 100%. With buffered IO (or high queue depths) I can max out the journals. 




A FIO Test from Sebastien Han give us 400 MByte/s raw performance from the 
P3700. 

https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your 
-ssd-is-suitable-as-a-journal-device/ 

How could it be that the rbd client performance is 50% slower? 

Regards 


Am 21.07.16 um 12:15 schrieb Nick Fisk: 



I've had a lot of pain with this, smaller block sizes are even worse. 
You want to try and minimize latency at every point as there is no 
buffering happening in the iSCSI stack. This means:- 

1. Fast journals (NVME or NVRAM) 
2. 10GB or better networking 
3. Fast CPU's (Ghz) 
4. Fix CPU c-state's to C1 
5. Fix CPU's Freq to max 

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-07-21 Thread w...@globe.de

Okay that should be the answer...

I think it would be great to use Intel P3700 1.6TB as bcache in the 
iscsi rbd client gateway nodes.


caching device: Intel P3700 1.6TB

backing device: RBD from Ceph Cluster

What do you mean? I think this setup should improve the performance 
dramatically or not?


If i enable writeback in these nodes and use tgt for vmware. What 
happens if iscsi node 1 goes offline. Power Loss... or Linux Kernel crash.




Am 21.07.16 um 15:57 schrieb Nick Fisk:


What you are seeing is probably averaged over 1 second or something 
like that. So yes in 1 second IO would have run on all OSD’s. But for 
any 1 point in time a single thread will only run on 1 OSD (+2 
replicas) assuming the IO size isn’t bigger than the object size.


For RBD, If data is striped in 4MB chunks, then you will have to 
read/write more than 4MB at a time to cross over to the next object. 
You get exactly the same problems with reading when you don’t set the 
readahead above 4MB.


*From:*ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On 
Behalf Of *w...@globe.de

*Sent:* 21 July 2016 14:05
*To:* ceph-users@lists.ceph.com
*Subject:* Re: [ceph-users] Ceph + VMware + Single Thread Performance

That can not be correct.

Check it at your cluster with dstat as i said...

You will see at every node parallel IO on every OSD and journal

Am 21.07.16 um 15:02 schrieb Jake Young:

I think the answer is that with 1 thread you can only ever write
to one journal at a time. Theoretically, you would need 10 threads
to be able to write to 10 nodes at the same time.

Jake

On Thursday, July 21, 2016, w...@globe.de <mailto:w...@globe.de>
mailto:w...@globe.de>> wrote:

What i not really undertand is:

Lets say the Intel P3700 works with 200 MByte/s rados bench
one thread... See Nicks results below...

If we have multiple OSD Nodes. For example 10 Nodes.

Every Node has exactly 1x P3700 NVMe built in.

Why is the single Thread performance exactly at 200 MByte/s on
the rbd client with 10 OSD Node Cluster???

I think it must be at 10 Nodes * 200 MByte/s = 2000 MByte/s.

Everyone look yourself at your cluster.

dstat -D sdb,sdc,sdd,sdX 

You will see that Ceph stripes the data over all OSD's in the
cluster if you test at the client side with rados bench...

*rados bench -p rbd 60 write -b 4M -t 1*

Am 21.07.16 um 14:38 schrieb w...@globe.de
:

Is there not a way to enable Linux page Cache? So do not
user D_Sync...

Then we would the dramatically performance improve.


Am 21.07.16 um 14:33 schrieb Nick Fisk:

-Original Message-
From: w...@globe.de

[mailto:w...@globe.de
]
Sent: 21 July 2016 13:23
To: n...@fisk.me.uk
;
'Horace Ng' 

Cc: ceph-users@lists.ceph.com


        Subject: Re: [ceph-users] Ceph + VMware + Single
Thread Performance

Okay and what is your plan now to speed up ?

Now I have come up with a lower latency hardware
design, there is not much further improvement until
persistent RBD caching is implemented, as you will be
moving the SSD/NVME closer to the client. But I'm
happy with what I can achieve at the moment. You could
also experiment with bcache on the RBD.


Would it help to put in multiple P3700 per OSD
Node to improve performance for a single Thread
(example Storage VMotion) ?

Most likely not, it's all the other parts of the
puzzle which are causing the latency. ESXi was
designed for storage arrays that service IO's in
100us-1ms range, Ceph is probably about 10x slower
than this, hence the problem. Disable the BBWC on a
RAID controller or SAN and you will the same behaviour.


Regards


Am 21.07.16 um 14:17 schrieb Nick Fisk:

-Original Message-
From: ceph-users
[mailto:ceph-users-boun...@lists.ceph.com

]
On Behalf
Of w...@globe.de

Sent: 21 July 2016 13:04
To: n...@fisk.me.uk
;
'Horace Ng' 


 

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-07-21 Thread w...@globe.de

That can not be correct.

Check it at your cluster with dstat as i said...

You will see at every node parallel IO on every OSD and journal


Am 21.07.16 um 15:02 schrieb Jake Young:
I think the answer is that with 1 thread you can only ever write to 
one journal at a time. Theoretically, you would need 10 threads to be 
able to write to 10 nodes at the same time.


Jake

On Thursday, July 21, 2016, w...@globe.de <mailto:w...@globe.de> 
mailto:w...@globe.de>> wrote:


What i not really undertand is:

Lets say the Intel P3700 works with 200 MByte/s rados bench one
thread... See Nicks results below...

If we have multiple OSD Nodes. For example 10 Nodes.

Every Node has exactly 1x P3700 NVMe built in.

Why is the single Thread performance exactly at 200 MByte/s on the
rbd client with 10 OSD Node Cluster???

I think it must be at 10 Nodes * 200 MByte/s = 2000 MByte/s.


Everyone look yourself at your cluster.

dstat -D sdb,sdc,sdd,sdX 

You will see that Ceph stripes the data over all OSD's in the
cluster if you test at the client side with rados bench...

*rados bench -p rbd 60 write -b 4M -t 1*



Am 21.07.16 um 14:38 schrieb w...@globe.de
:

Is there not a way to enable Linux page Cache? So do not user
D_Sync...

Then we would the dramatically performance improve.


Am 21.07.16 um 14:33 schrieb Nick Fisk:

-Original Message-
From: w...@globe.de 
[mailto:w...@globe.de ]
Sent: 21 July 2016 13:23
To: n...@fisk.me.uk
; 'Horace Ng'


Cc: ceph-users@lists.ceph.com
    
    Subject: Re: [ceph-users] Ceph + VMware + Single Thread
Performance

Okay and what is your plan now to speed up ?

Now I have come up with a lower latency hardware design, there
is not much further improvement until persistent RBD caching is
implemented, as you will be moving the SSD/NVME closer to the
client. But I'm happy with what I can achieve at the moment. You
could also experiment with bcache on the RBD.


Would it help to put in multiple P3700 per OSD Node to improve
performance for a single Thread (example Storage VMotion) ?

Most likely not, it's all the other parts of the puzzle which
are causing the latency. ESXi was designed for storage arrays
that service IO's in 100us-1ms range, Ceph is probably about 10x
slower than this, hence the problem. Disable the BBWC on a RAID
controller or SAN and you will the same behaviour.


Regards


Am 21.07.16 um 14:17 schrieb Nick Fisk:

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com
]
On Behalf
Of w...@globe.de 
Sent: 21 July 2016 13:04
To: n...@fisk.me.uk
; 'Horace
Ng' 

Cc: ceph-users@lists.ceph.com

Subject: Re: [ceph-users] Ceph + VMware + Single Thread
Performance

Hi,

hmm i think 200 MByte/s is really bad. Is your Cluster in
production right now?

It's just been built, not running yet.


So if you start a storage migration you get only 200 MByte/s
right?

I wish. My current cluster (not this new one) would storage
migrate at
~10-15MB/s. Serial latency is the problem, without being able to
buffer, ESXi waits on an ack for each IO before sending the
next. Also it submits the migrations in 64kb chunks, unless
you get VAAI

working. I think esxi will try and do them in parallel, which
will help as well.

I think it would be awesome if you get 1000 MByte/s

Where is the Bottleneck?

Latency serialisation, without a buffer, you can't drive the
devices
to 100%. With buffered IO (or high queue depths) I can max out
the journals.


A FIO Test from Sebastien Han give us 400 MByte/s raw
performance from the P3700.

https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your

-ssd-is-suitable-as-a-journal-device/

How could it be that the rbd client performance is 50% slower?

Regards


Am 21.07.16 um 12:15 schrieb Nick Fisk:

I've had a lot of pain with this, smaller block sizes are
even worse.
You want to try and minimize latency at every point as there
is no
buffering happening in the iSCSI stack. This means:-

1. Fast journals (NVME or NVRAM)
2. 10GB or better networking
3. Fast CPU's (Ghz)
4. Fix CPU c-state's to C1
5. Fix CPU's Freq to max

Also I can't be sure, but I think there is a metadata update
happening with VMFS, particularly if you are using thin
VMDK's, this
can also be a major bottleneck. For my use case, I've
switched over to NFS as it has given much more performance
at scale and

less headache.

For the RADOS Run, here you go (400GB P3700):

Total time run: 60.026491
Total writes made:  3

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-07-21 Thread Jake Young
I think the answer is that with 1 thread you can only ever write to one
journal at a time. Theoretically, you would need 10 threads to be able to
write to 10 nodes at the same time.

Jake

On Thursday, July 21, 2016, w...@globe.de  wrote:

> What i not really undertand is:
>
> Lets say the Intel P3700 works with 200 MByte/s rados bench one thread...
> See Nicks results below...
>
> If we have multiple OSD Nodes. For example 10 Nodes.
>
> Every Node has exactly 1x P3700 NVMe built in.
>
> Why is the single Thread performance exactly at 200 MByte/s on the rbd
> client with 10 OSD Node Cluster???
>
> I think it must be at 10 Nodes * 200 MByte/s = 2000 MByte/s.
>
>
> Everyone look yourself at your cluster.
>
> dstat -D sdb,sdc,sdd,sdX 
>
> You will see that Ceph stripes the data over all OSD's in the cluster if
> you test at the client side with rados bench...
>
> *rados bench -p rbd 60 write -b 4M -t 1*
>
>
>
> Am 21.07.16 um 14:38 schrieb w...@globe.de
> :
>
> Is there not a way to enable Linux page Cache? So do not user D_Sync...
>
> Then we would the dramatically performance improve.
>
>
> Am 21.07.16 um 14:33 schrieb Nick Fisk:
>
> -Original Message-
> From: w...@globe.de  [
> mailto:w...@globe.de ]
> Sent: 21 July 2016 13:23
> To: n...@fisk.me.uk ;
> 'Horace Ng' 
> 
> Cc: ceph-users@lists.ceph.com
> 
> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
>
> Okay and what is your plan now to speed up ?
>
> Now I have come up with a lower latency hardware design, there is not much
> further improvement until persistent RBD caching is implemented, as you
> will be moving the SSD/NVME closer to the client. But I'm happy with what I
> can achieve at the moment. You could also experiment with bcache on the
> RBD.
>
> Would it help to put in multiple P3700 per OSD Node to improve performance
> for a single Thread (example Storage VMotion) ?
>
> Most likely not, it's all the other parts of the puzzle which are causing
> the latency. ESXi was designed for storage arrays that service IO's in
> 100us-1ms range, Ceph is probably about 10x slower than this, hence the
> problem. Disable the BBWC on a RAID controller or SAN and you will the same
> behaviour.
>
> Regards
>
>
> Am 21.07.16 um 14:17 schrieb Nick Fisk:
>
> -Original Message-----
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com
> ] On
> Behalf
> Of w...@globe.de 
> Sent: 21 July 2016 13:04
> To: n...@fisk.me.uk ;
> 'Horace Ng' 
> 
> Cc: ceph-users@lists.ceph.com
> 
> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
>
> Hi,
>
> hmm i think 200 MByte/s is really bad. Is your Cluster in production right
> now?
>
> It's just been built, not running yet.
>
> So if you start a storage migration you get only 200 MByte/s right?
>
> I wish. My current cluster (not this new one) would storage migrate at
> ~10-15MB/s. Serial latency is the problem, without being able to
> buffer, ESXi waits on an ack for each IO before sending the next. Also it
> submits the migrations in 64kb chunks, unless you get VAAI
>
> working. I think esxi will try and do them in parallel, which will help as
> well.
>
> I think it would be awesome if you get 1000 MByte/s
>
> Where is the Bottleneck?
>
> Latency serialisation, without a buffer, you can't drive the devices
> to 100%. With buffered IO (or high queue depths) I can max out the
> journals.
>
> A FIO Test from Sebastien Han give us 400 MByte/s raw performance from the
> P3700.
>
> https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your
> -ssd-is-suitable-as-a-journal-device/
>
> How could it be that the rbd client performance is 50% slower?
>
> Regards
>
>
> Am 21.07.16 um 12:15 schrieb Nick Fisk:
>
> I've had a lot of pain with this, smaller block sizes are even worse.
> You want to try and minimize latency at every point as there is no
> buffering happening in the iSCSI stack. This means:-
>
> 1. Fast journals (NVME or NVRAM)
> 2. 10GB or better networking
> 3. Fast CPU's (Ghz)
> 4. Fix CPU c-state's to C1
> 5. Fix CPU's Freq to max
>
> Also I can't be sure, but I think there is a metadata update
> happening with VMFS, particularly if you are using thin VMDK's, this
> can also be a major bottleneck. For my use case, I've switched over to NFS
> as it has given much more performance at scale and
>
> less headache.
>
> For the RADOS Run, here you go (400GB P3700):
>
> Total time run: 60.026491
> Total writes made:  3104
> Write size: 

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-07-21 Thread Nick Fisk
Yes, but not if you are using iSCSI and don't want data loss. If the data is in 
a cache somewhere and you lose power or crashgame over. That's why you want 
to cache to a non volatile device close to the source.

If you use something like FIO and use buffered IO, you will see that you will 
get really high numbers, unfortunately you can't do this with iSCSI though.

> -Original Message-
> From: w...@globe.de [mailto:w...@globe.de]
> Sent: 21 July 2016 13:39
> To: n...@fisk.me.uk; 'Horace Ng' 
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> 
> Is there not a way to enable Linux page Cache? So do not user D_Sync...
> 
> Then we would the dramatically performance improve.
> 
> 
> Am 21.07.16 um 14:33 schrieb Nick Fisk:
> >> -Original Message-
> >> From: w...@globe.de [mailto:w...@globe.de]
> >> Sent: 21 July 2016 13:23
> >> To: n...@fisk.me.uk; 'Horace Ng' 
> >> Cc: ceph-users@lists.ceph.com
> >> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> >>
> >> Okay and what is your plan now to speed up ?
> > Now I have come up with a lower latency hardware design, there is not much 
> > further improvement until persistent RBD caching is
> implemented, as you will be moving the SSD/NVME closer to the client. But I'm 
> happy with what I can achieve at the moment. You
> could also experiment with bcache on the RBD.
> >
> >> Would it help to put in multiple P3700 per OSD Node to improve performance 
> >> for a single Thread (example Storage VMotion) ?
> > Most likely not, it's all the other parts of the puzzle which are causing 
> > the latency. ESXi was designed for storage arrays that service
> IO's in 100us-1ms range, Ceph is probably about 10x slower than this, hence 
> the problem. Disable the BBWC on a RAID controller or
> SAN and you will the same behaviour.
> >
> >> Regards
> >>
> >>
> >> Am 21.07.16 um 14:17 schrieb Nick Fisk:
> >>>> -Original Message-
> >>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
> >>>> Behalf Of w...@globe.de
> >>>> Sent: 21 July 2016 13:04
> >>>> To: n...@fisk.me.uk; 'Horace Ng' 
> >>>> Cc: ceph-users@lists.ceph.com
> >>>> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> >>>>
> >>>> Hi,
> >>>>
> >>>> hmm i think 200 MByte/s is really bad. Is your Cluster in production 
> >>>> right now?
> >>> It's just been built, not running yet.
> >>>
> >>>> So if you start a storage migration you get only 200 MByte/s right?
> >>> I wish. My current cluster (not this new one) would storage migrate
> >>> at ~10-15MB/s. Serial latency is the problem, without being able to
> >>> buffer, ESXi waits on an ack for each IO before sending the next.
> >>> Also it submits the migrations in 64kb chunks, unless you get VAAI
> >> working. I think esxi will try and do them in parallel, which will help as 
> >> well.
> >>>> I think it would be awesome if you get 1000 MByte/s
> >>>>
> >>>> Where is the Bottleneck?
> >>> Latency serialisation, without a buffer, you can't drive the devices
> >>> to 100%. With buffered IO (or high queue depths) I can max out the 
> >>> journals.
> >>>
> >>>> A FIO Test from Sebastien Han give us 400 MByte/s raw performance from 
> >>>> the P3700.
> >>>>
> >>>> https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-yo
> >>>> ur -ssd-is-suitable-as-a-journal-device/
> >>>>
> >>>> How could it be that the rbd client performance is 50% slower?
> >>>>
> >>>> Regards
> >>>>
> >>>>
> >>>> Am 21.07.16 um 12:15 schrieb Nick Fisk:
> >>>>> I've had a lot of pain with this, smaller block sizes are even worse.
> >>>>> You want to try and minimize latency at every point as there is no
> >>>>> buffering happening in the iSCSI stack. This means:-
> >>>>>
> >>>>> 1. Fast journals (NVME or NVRAM)
> >>>>> 2. 10GB or better networking
> >>>>> 3. Fast CPU's (Ghz)
> >>>>> 4. Fix CPU c-state's to C1
> >>>>> 5. Fix CPU's Freq to

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-07-21 Thread w...@globe.de

What i not really undertand is:

Lets say the Intel P3700 works with 200 MByte/s rados bench one 
thread... See Nicks results below...


If we have multiple OSD Nodes. For example 10 Nodes.

Every Node has exactly 1x P3700 NVMe built in.

Why is the single Thread performance exactly at 200 MByte/s on the rbd 
client with 10 OSD Node Cluster???


I think it must be at 10 Nodes * 200 MByte/s = 2000 MByte/s.


Everyone look yourself at your cluster.

dstat -D sdb,sdc,sdd,sdX 

You will see that Ceph stripes the data over all OSD's in the cluster if 
you test at the client side with rados bench...


*rados bench -p rbd 60 write -b 4M -t 1*



Am 21.07.16 um 14:38 schrieb w...@globe.de:

Is there not a way to enable Linux page Cache? So do not user D_Sync...

Then we would the dramatically performance improve.


Am 21.07.16 um 14:33 schrieb Nick Fisk:

-Original Message-
From: w...@globe.de [mailto:w...@globe.de]
Sent: 21 July 2016 13:23
To: n...@fisk.me.uk; 'Horace Ng' 
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance

Okay and what is your plan now to speed up ?
Now I have come up with a lower latency hardware design, there is not 
much further improvement until persistent RBD caching is implemented, 
as you will be moving the SSD/NVME closer to the client. But I'm 
happy with what I can achieve at the moment. You could also 
experiment with bcache on the RBD.


Would it help to put in multiple P3700 per OSD Node to improve 
performance for a single Thread (example Storage VMotion) ?
Most likely not, it's all the other parts of the puzzle which are 
causing the latency. ESXi was designed for storage arrays that 
service IO's in 100us-1ms range, Ceph is probably about 10x slower 
than this, hence the problem. Disable the BBWC on a RAID controller 
or SAN and you will the same behaviour.



Regards


Am 21.07.16 um 14:17 schrieb Nick Fisk:

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
Of w...@globe.de
Sent: 21 July 2016 13:04
To: n...@fisk.me.uk; 'Horace Ng' 
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance

Hi,

hmm i think 200 MByte/s is really bad. Is your Cluster in 
production right now?

It's just been built, not running yet.


So if you start a storage migration you get only 200 MByte/s right?

I wish. My current cluster (not this new one) would storage migrate at
~10-15MB/s. Serial latency is the problem, without being able to
buffer, ESXi waits on an ack for each IO before sending the next. 
Also it submits the migrations in 64kb chunks, unless you get VAAI
working. I think esxi will try and do them in parallel, which will 
help as well.

I think it would be awesome if you get 1000 MByte/s

Where is the Bottleneck?

Latency serialisation, without a buffer, you can't drive the devices
to 100%. With buffered IO (or high queue depths) I can max out the 
journals.


A FIO Test from Sebastien Han give us 400 MByte/s raw performance 
from the P3700.


https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your
-ssd-is-suitable-as-a-journal-device/

How could it be that the rbd client performance is 50% slower?

Regards


Am 21.07.16 um 12:15 schrieb Nick Fisk:
I've had a lot of pain with this, smaller block sizes are even 
worse.

You want to try and minimize latency at every point as there is no
buffering happening in the iSCSI stack. This means:-

1. Fast journals (NVME or NVRAM)
2. 10GB or better networking
3. Fast CPU's (Ghz)
4. Fix CPU c-state's to C1
5. Fix CPU's Freq to max

Also I can't be sure, but I think there is a metadata update
happening with VMFS, particularly if you are using thin VMDK's, this
can also be a major bottleneck. For my use case, I've switched 
over to NFS as it has given much more performance at scale and

less headache.

For the RADOS Run, here you go (400GB P3700):

Total time run: 60.026491
Total writes made:  3104
Write size: 4194304
Object size:4194304
Bandwidth (MB/sec): 206.842
Stddev Bandwidth:   8.10412
Max bandwidth (MB/sec): 224
Min bandwidth (MB/sec): 180
Average IOPS:   51
Stddev IOPS:2
Max IOPS:   56
Min IOPS:   45
Average Latency(s): 0.0193366
Stddev Latency(s):  0.00148039
Max latency(s): 0.0377946
Min latency(s): 0.015909

Nick


-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
Behalf Of Horace
Sent: 21 July 2016 10:26
To: w...@globe.de
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance

Hi,

Same here, I've read some blog saying that vmware will frequently
verify the locking on VMFS over iSCSI, hence it will have much 
slower performance than NFS (with different locking mechanism).


Regards,
Horace Ng

---

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-07-21 Thread w...@globe.de

Is there not a way to enable Linux page Cache? So do not user D_Sync...

Then we would the dramatically performance improve.


Am 21.07.16 um 14:33 schrieb Nick Fisk:

-Original Message-
From: w...@globe.de [mailto:w...@globe.de]
Sent: 21 July 2016 13:23
To: n...@fisk.me.uk; 'Horace Ng' 
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance

Okay and what is your plan now to speed up ?

Now I have come up with a lower latency hardware design, there is not much 
further improvement until persistent RBD caching is implemented, as you will be 
moving the SSD/NVME closer to the client. But I'm happy with what I can achieve 
at the moment. You could also experiment with bcache on the RBD.


Would it help to put in multiple P3700 per OSD Node to improve performance for 
a single Thread (example Storage VMotion) ?

Most likely not, it's all the other parts of the puzzle which are causing the 
latency. ESXi was designed for storage arrays that service IO's in 100us-1ms 
range, Ceph is probably about 10x slower than this, hence the problem. Disable 
the BBWC on a RAID controller or SAN and you will the same behaviour.


Regards


Am 21.07.16 um 14:17 schrieb Nick Fisk:

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
Of w...@globe.de
Sent: 21 July 2016 13:04
To: n...@fisk.me.uk; 'Horace Ng' 
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance

Hi,

hmm i think 200 MByte/s is really bad. Is your Cluster in production right now?

It's just been built, not running yet.


So if you start a storage migration you get only 200 MByte/s right?

I wish. My current cluster (not this new one) would storage migrate at
~10-15MB/s. Serial latency is the problem, without being able to
buffer, ESXi waits on an ack for each IO before sending the next. Also it 
submits the migrations in 64kb chunks, unless you get VAAI

working. I think esxi will try and do them in parallel, which will help as well.

I think it would be awesome if you get 1000 MByte/s

Where is the Bottleneck?

Latency serialisation, without a buffer, you can't drive the devices
to 100%. With buffered IO (or high queue depths) I can max out the journals.


A FIO Test from Sebastien Han give us 400 MByte/s raw performance from the 
P3700.

https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your
-ssd-is-suitable-as-a-journal-device/

How could it be that the rbd client performance is 50% slower?

Regards


Am 21.07.16 um 12:15 schrieb Nick Fisk:

I've had a lot of pain with this, smaller block sizes are even worse.
You want to try and minimize latency at every point as there is no
buffering happening in the iSCSI stack. This means:-

1. Fast journals (NVME or NVRAM)
2. 10GB or better networking
3. Fast CPU's (Ghz)
4. Fix CPU c-state's to C1
5. Fix CPU's Freq to max

Also I can't be sure, but I think there is a metadata update
happening with VMFS, particularly if you are using thin VMDK's, this
can also be a major bottleneck. For my use case, I've switched over to NFS as 
it has given much more performance at scale and

less headache.

For the RADOS Run, here you go (400GB P3700):

Total time run: 60.026491
Total writes made:  3104
Write size: 4194304
Object size:4194304
Bandwidth (MB/sec): 206.842
Stddev Bandwidth:   8.10412
Max bandwidth (MB/sec): 224
Min bandwidth (MB/sec): 180
Average IOPS:   51
Stddev IOPS:2
Max IOPS:   56
Min IOPS:   45
Average Latency(s): 0.0193366
Stddev Latency(s):  0.00148039
Max latency(s): 0.0377946
Min latency(s): 0.015909

Nick


-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
Behalf Of Horace
Sent: 21 July 2016 10:26
To: w...@globe.de
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance

Hi,

Same here, I've read some blog saying that vmware will frequently
verify the locking on VMFS over iSCSI, hence it will have much slower 
performance than NFS (with different locking mechanism).

Regards,
Horace Ng

- Original Message -
From: w...@globe.de
To: ceph-users@lists.ceph.com
Sent: Thursday, July 21, 2016 5:11:21 PM
Subject: [ceph-users] Ceph + VMware + Single Thread Performance

Hi everyone,

we see at our cluster relatively slow Single Thread Performance on the iscsi 
Nodes.


Our setup:

3 Racks:

18x Data Nodes, 3 Mon Nodes, 3 iscsi Gateway Nodes with tgt (rbd cache off).

2x Samsung SM863 Enterprise SSD for Journal (3 OSD per SSD) and 6x
WD Red 1TB per Data Node as OSD.

Replication = 3

chooseleaf = 3 type Rack in the crush map


We get only ca. 90 MByte/s on the iscsi Gateway Servers with:

rados bench -p rbd 60 write -b 4M -t 1


If we test with:

rados bench -p rbd 60 write -b 4M -

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-07-21 Thread Nick Fisk
 

From: Jake Young [mailto:jak3...@gmail.com] 
Sent: 21 July 2016 13:24
To: n...@fisk.me.uk; w...@globe.de
Cc: Horace Ng ; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance

 

My workaround to your single threaded performance issue was to increase the 
thread count of the tgtd process (I added --nr_iothreads=128 as an argument to 
tgtd).  This does help my workload.  

 

FWIW below are my rados bench numbers from my cluster with 1 thread:

 

This first one is a "cold" run. This is a test pool, and it's not in use.  This 
is the first time I've written to it in a week (but I have written to it 
before). 

 

Total time run: 60.049311

Total writes made:  1196

Write size: 4194304

Bandwidth (MB/sec): 79.668

 

Stddev Bandwidth:   80.3998

Max bandwidth (MB/sec): 208

Min bandwidth (MB/sec): 0

Average Latency:0.0502066

Stddev Latency: 0.47209

Max latency:12.9035

Min latency:0.013051

 

This next one is the 6th run. I honestly don't understand why there is such a 
huge performance difference. 

 

Total time run: 60.042933

Total writes made:  2980

Write size: 4194304

Bandwidth (MB/sec): 198.525

 

Stddev Bandwidth:   32.129

Max bandwidth (MB/sec): 224

Min bandwidth (MB/sec): 0

Average Latency:0.0201471

Stddev Latency: 0.0126896

Max latency:0.265931

Min latency:0.013211

 

 

75 OSDs, all 2TB SAS spinners.  There are 9 OSD servers each has a 2GB BBU RAID 
cache.

 

I have tuned my CPU c-state and freq to max, I have 8x 2.5MHz cores, so just 
about one core per OSD. I have 40G networking.  I don't use journals, but I 
have the RAID cache enabled.

 

 

Nick,

 

What NFS server are you using?

 

The kernel one. Seems to be working really so far after I got past the XFS 
fragmentation issues, I had to set an extent size hint of 16mb at the root.

 

 

Jake 

 


On Thursday, July 21, 2016, Nick Fisk mailto:n...@fisk.me.uk> 
> wrote:

I've had a lot of pain with this, smaller block sizes are even worse. You want 
to try and minimize latency at every point as there
is no buffering happening in the iSCSI stack. This means:-

1. Fast journals (NVME or NVRAM)
2. 10GB or better networking
3. Fast CPU's (Ghz)
4. Fix CPU c-state's to C1
5. Fix CPU's Freq to max

Also I can't be sure, but I think there is a metadata update happening with 
VMFS, particularly if you are using thin VMDK's, this
can also be a major bottleneck. For my use case, I've switched over to NFS as 
it has given much more performance at scale and less
headache.

For the RADOS Run, here you go (400GB P3700):

Total time run: 60.026491
Total writes made:  3104
Write size: 4194304
Object size:4194304
Bandwidth (MB/sec): 206.842
Stddev Bandwidth:   8.10412
Max bandwidth (MB/sec): 224
Min bandwidth (MB/sec): 180
Average IOPS:   51
Stddev IOPS:2
Max IOPS:   56
Min IOPS:   45
Average Latency(s): 0.0193366
Stddev Latency(s):  0.00148039
Max latency(s): 0.0377946
Min latency(s): 0.015909

Nick

> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com  ] 
> On Behalf Of Horace
> Sent: 21 July 2016 10:26
> To: w...@globe.de  
> Cc: ceph-users@lists.ceph.com  
> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
>
> Hi,
>
> Same here, I've read some blog saying that vmware will frequently verify the 
> locking on VMFS over iSCSI, hence it will have much
> slower performance than NFS (with different locking mechanism).
>
> Regards,
> Horace Ng
>
> - Original Message -
> From: w...@globe.de  
> To: ceph-users@lists.ceph.com  
> Sent: Thursday, July 21, 2016 5:11:21 PM
> Subject: [ceph-users] Ceph + VMware + Single Thread Performance
>
> Hi everyone,
>
> we see at our cluster relatively slow Single Thread Performance on the iscsi 
> Nodes.
>
>
> Our setup:
>
> 3 Racks:
>
> 18x Data Nodes, 3 Mon Nodes, 3 iscsi Gateway Nodes with tgt (rbd cache off).
>
> 2x Samsung SM863 Enterprise SSD for Journal (3 OSD per SSD) and 6x WD
> Red 1TB per Data Node as OSD.
>
> Replication = 3
>
> chooseleaf = 3 type Rack in the crush map
>
>
> We get only ca. 90 MByte/s on the iscsi Gateway Servers with:
>
> rados bench -p rbd 60 write -b 4M -t 1
>
>
> If we test with:
>
> rados bench -p rbd 60 write -b 4M -t 32
>
> we get ca. 600 - 700 MByte/s
>
>
> We plan to replace the Samsung SSD with Intel DC P3700 PCIe NVM'e for
> the Journal to get better Single Thread Performance.
>
> Is anyone of you out there who has an Intel P3700 for Journal an can
&

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-07-21 Thread Nick Fisk
> -Original Message-
> From: w...@globe.de [mailto:w...@globe.de]
> Sent: 21 July 2016 13:23
> To: n...@fisk.me.uk; 'Horace Ng' 
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> 
> Okay and what is your plan now to speed up ?

Now I have come up with a lower latency hardware design, there is not much 
further improvement until persistent RBD caching is implemented, as you will be 
moving the SSD/NVME closer to the client. But I'm happy with what I can achieve 
at the moment. You could also experiment with bcache on the RBD.

> 
> Would it help to put in multiple P3700 per OSD Node to improve performance 
> for a single Thread (example Storage VMotion) ?

Most likely not, it's all the other parts of the puzzle which are causing the 
latency. ESXi was designed for storage arrays that service IO's in 100us-1ms 
range, Ceph is probably about 10x slower than this, hence the problem. Disable 
the BBWC on a RAID controller or SAN and you will the same behaviour.

> 
> Regards
> 
> 
> Am 21.07.16 um 14:17 schrieb Nick Fisk:
> >> -Original Message-
> >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> >> Of w...@globe.de
> >> Sent: 21 July 2016 13:04
> >> To: n...@fisk.me.uk; 'Horace Ng' 
> >> Cc: ceph-users@lists.ceph.com
> >> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> >>
> >> Hi,
> >>
> >> hmm i think 200 MByte/s is really bad. Is your Cluster in production right 
> >> now?
> > It's just been built, not running yet.
> >
> >> So if you start a storage migration you get only 200 MByte/s right?
> > I wish. My current cluster (not this new one) would storage migrate at
> > ~10-15MB/s. Serial latency is the problem, without being able to
> > buffer, ESXi waits on an ack for each IO before sending the next. Also it 
> > submits the migrations in 64kb chunks, unless you get VAAI
> working. I think esxi will try and do them in parallel, which will help as 
> well.
> >
> >> I think it would be awesome if you get 1000 MByte/s
> >>
> >> Where is the Bottleneck?
> > Latency serialisation, without a buffer, you can't drive the devices
> > to 100%. With buffered IO (or high queue depths) I can max out the journals.
> >
> >> A FIO Test from Sebastien Han give us 400 MByte/s raw performance from the 
> >> P3700.
> >>
> >> https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your
> >> -ssd-is-suitable-as-a-journal-device/
> >>
> >> How could it be that the rbd client performance is 50% slower?
> >>
> >> Regards
> >>
> >>
> >> Am 21.07.16 um 12:15 schrieb Nick Fisk:
> >>> I've had a lot of pain with this, smaller block sizes are even worse.
> >>> You want to try and minimize latency at every point as there is no
> >>> buffering happening in the iSCSI stack. This means:-
> >>>
> >>> 1. Fast journals (NVME or NVRAM)
> >>> 2. 10GB or better networking
> >>> 3. Fast CPU's (Ghz)
> >>> 4. Fix CPU c-state's to C1
> >>> 5. Fix CPU's Freq to max
> >>>
> >>> Also I can't be sure, but I think there is a metadata update
> >>> happening with VMFS, particularly if you are using thin VMDK's, this
> >>> can also be a major bottleneck. For my use case, I've switched over to 
> >>> NFS as it has given much more performance at scale and
> less headache.
> >>>
> >>> For the RADOS Run, here you go (400GB P3700):
> >>>
> >>> Total time run: 60.026491
> >>> Total writes made:  3104
> >>> Write size: 4194304
> >>> Object size:4194304
> >>> Bandwidth (MB/sec): 206.842
> >>> Stddev Bandwidth:   8.10412
> >>> Max bandwidth (MB/sec): 224
> >>> Min bandwidth (MB/sec): 180
> >>> Average IOPS:   51
> >>> Stddev IOPS:2
> >>> Max IOPS:   56
> >>> Min IOPS:   45
> >>> Average Latency(s): 0.0193366
> >>> Stddev Latency(s):  0.00148039
> >>> Max latency(s):     0.0377946
> >>> Min latency(s): 0.015909
> >>>
> >>> Nick
> >>>
> >>>> -Original Message-
> >>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
> &g

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-07-21 Thread Jake Young
My workaround to your single threaded performance issue was to increase the
thread count of the tgtd process (I added --nr_iothreads=128 as an argument
to tgtd).  This does help my workload.

FWIW below are my rados bench numbers from my cluster with 1 thread:

This first one is a "cold" run. This is a test pool, and it's not in use.
This is the first time I've written to it in a week (but I have written to
it before).

Total time run: 60.049311
Total writes made:  1196
Write size: 4194304
Bandwidth (MB/sec): 79.668

Stddev Bandwidth:   80.3998
Max bandwidth (MB/sec): 208
Min bandwidth (MB/sec): 0
Average Latency:0.0502066
Stddev Latency: 0.47209
Max latency:12.9035
Min latency:0.013051

This next one is the 6th run. I honestly don't understand why there is such
a huge performance difference.

Total time run: 60.042933
Total writes made:  2980
Write size: 4194304
Bandwidth (MB/sec): 198.525

Stddev Bandwidth:   32.129
Max bandwidth (MB/sec): 224
Min bandwidth (MB/sec): 0
Average Latency:0.0201471
Stddev Latency: 0.0126896
Max latency:0.265931
Min latency:0.013211


75 OSDs, all 2TB SAS spinners.  There are 9 OSD servers each has a 2GB
BBU RAID cache.

I have tuned my CPU c-state and freq to max, I have 8x 2.5MHz cores, so
just about one core per OSD. I have 40G networking.  I don't use journals,
but I have the RAID cache enabled.


Nick,

What NFS server are you using?

Jake


On Thursday, July 21, 2016, Nick Fisk  wrote:

> I've had a lot of pain with this, smaller block sizes are even worse. You
> want to try and minimize latency at every point as there
> is no buffering happening in the iSCSI stack. This means:-
>
> 1. Fast journals (NVME or NVRAM)
> 2. 10GB or better networking
> 3. Fast CPU's (Ghz)
> 4. Fix CPU c-state's to C1
> 5. Fix CPU's Freq to max
>
> Also I can't be sure, but I think there is a metadata update happening
> with VMFS, particularly if you are using thin VMDK's, this
> can also be a major bottleneck. For my use case, I've switched over to NFS
> as it has given much more performance at scale and less
> headache.
>
> For the RADOS Run, here you go (400GB P3700):
>
> Total time run: 60.026491
> Total writes made:  3104
> Write size: 4194304
> Object size:4194304
> Bandwidth (MB/sec): 206.842
> Stddev Bandwidth:   8.10412
> Max bandwidth (MB/sec): 224
> Min bandwidth (MB/sec): 180
> Average IOPS:   51
> Stddev IOPS:2
> Max IOPS:   56
> Min IOPS:   45
> Average Latency(s): 0.0193366
> Stddev Latency(s):  0.00148039
> Max latency(s): 0.0377946
> Min latency(s): 0.015909
>
> Nick
>
> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com
> ] On Behalf Of Horace
> > Sent: 21 July 2016 10:26
> > To: w...@globe.de 
> > Cc: ceph-users@lists.ceph.com 
> > Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> >
> > Hi,
> >
> > Same here, I've read some blog saying that vmware will frequently verify
> the locking on VMFS over iSCSI, hence it will have much
> > slower performance than NFS (with different locking mechanism).
> >
> > Regards,
> > Horace Ng
> >
> > - Original Message -
> > From: w...@globe.de 
> > To: ceph-users@lists.ceph.com 
> > Sent: Thursday, July 21, 2016 5:11:21 PM
> > Subject: [ceph-users] Ceph + VMware + Single Thread Performance
> >
> > Hi everyone,
> >
> > we see at our cluster relatively slow Single Thread Performance on the
> iscsi Nodes.
> >
> >
> > Our setup:
> >
> > 3 Racks:
> >
> > 18x Data Nodes, 3 Mon Nodes, 3 iscsi Gateway Nodes with tgt (rbd cache
> off).
> >
> > 2x Samsung SM863 Enterprise SSD for Journal (3 OSD per SSD) and 6x WD
> > Red 1TB per Data Node as OSD.
> >
> > Replication = 3
> >
> > chooseleaf = 3 type Rack in the crush map
> >
> >
> > We get only ca. 90 MByte/s on the iscsi Gateway Servers with:
> >
> > rados bench -p rbd 60 write -b 4M -t 1
> >
> >
> > If we test with:
> >
> > rados bench -p rbd 60 write -b 4M -t 32
> >
> > we get ca. 600 - 700 MByte/s
> >
> >
> > We plan to replace the Samsung SSD with Intel DC P3700 PCIe NVM'e for
> > the Journal to get better Single Thread Performance.
> >
> > Is anyone of you out there who has an Intel P3700 for Journal an can
> > give me back test results with:
>

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-07-21 Thread w...@globe.de

Okay and what is your plan now to speed up ?

Would it help to put in multiple P3700 per OSD Node to improve 
performance for a single Thread (example Storage VMotion) ?


Regards


Am 21.07.16 um 14:17 schrieb Nick Fisk:

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
w...@globe.de
Sent: 21 July 2016 13:04
To: n...@fisk.me.uk; 'Horace Ng' 
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance

Hi,

hmm i think 200 MByte/s is really bad. Is your Cluster in production right now?

It's just been built, not running yet.


So if you start a storage migration you get only 200 MByte/s right?

I wish. My current cluster (not this new one) would storage migrate at 
~10-15MB/s. Serial latency is the problem, without being able
to buffer, ESXi waits on an ack for each IO before sending the next. Also it 
submits the migrations in 64kb chunks, unless you get
VAAI working. I think esxi will try and do them in parallel, which will help as 
well.


I think it would be awesome if you get 1000 MByte/s

Where is the Bottleneck?

Latency serialisation, without a buffer, you can't drive the devices to 100%. 
With buffered IO (or high queue depths) I can max out
the journals.


A FIO Test from Sebastien Han give us 400 MByte/s raw performance from the 
P3700.

https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/

How could it be that the rbd client performance is 50% slower?

Regards


Am 21.07.16 um 12:15 schrieb Nick Fisk:

I've had a lot of pain with this, smaller block sizes are even worse.
You want to try and minimize latency at every point as there is no
buffering happening in the iSCSI stack. This means:-

1. Fast journals (NVME or NVRAM)
2. 10GB or better networking
3. Fast CPU's (Ghz)
4. Fix CPU c-state's to C1
5. Fix CPU's Freq to max

Also I can't be sure, but I think there is a metadata update happening
with VMFS, particularly if you are using thin VMDK's, this can also be
a major bottleneck. For my use case, I've switched over to NFS as it has given 
much more performance at scale and less headache.

For the RADOS Run, here you go (400GB P3700):

Total time run: 60.026491
Total writes made:  3104
Write size: 4194304
Object size:4194304
Bandwidth (MB/sec): 206.842
Stddev Bandwidth:   8.10412
Max bandwidth (MB/sec): 224
Min bandwidth (MB/sec): 180
Average IOPS:   51
Stddev IOPS:2
Max IOPS:   56
Min IOPS:   45
Average Latency(s): 0.0193366
Stddev Latency(s):  0.00148039
Max latency(s): 0.0377946
Min latency(s): 0.015909

Nick


-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
Of Horace
Sent: 21 July 2016 10:26
To: w...@globe.de
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance

Hi,

Same here, I've read some blog saying that vmware will frequently
verify the locking on VMFS over iSCSI, hence it will have much slower 
performance than NFS (with different locking mechanism).

Regards,
Horace Ng

- Original Message -
From: w...@globe.de
To: ceph-users@lists.ceph.com
Sent: Thursday, July 21, 2016 5:11:21 PM
Subject: [ceph-users] Ceph + VMware + Single Thread Performance

Hi everyone,

we see at our cluster relatively slow Single Thread Performance on the iscsi 
Nodes.


Our setup:

3 Racks:

18x Data Nodes, 3 Mon Nodes, 3 iscsi Gateway Nodes with tgt (rbd cache off).

2x Samsung SM863 Enterprise SSD for Journal (3 OSD per SSD) and 6x WD
Red 1TB per Data Node as OSD.

Replication = 3

chooseleaf = 3 type Rack in the crush map


We get only ca. 90 MByte/s on the iscsi Gateway Servers with:

rados bench -p rbd 60 write -b 4M -t 1


If we test with:

rados bench -p rbd 60 write -b 4M -t 32

we get ca. 600 - 700 MByte/s


We plan to replace the Samsung SSD with Intel DC P3700 PCIe NVM'e for
the Journal to get better Single Thread Performance.

Is anyone of you out there who has an Intel P3700 for Journal an can
give me back test results with:


rados bench -p rbd 60 write -b 4M -t 1


Thank you very much !!

Kind Regards !!

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-07-21 Thread Nick Fisk
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
> w...@globe.de
> Sent: 21 July 2016 13:04
> To: n...@fisk.me.uk; 'Horace Ng' 
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> 
> Hi,
> 
> hmm i think 200 MByte/s is really bad. Is your Cluster in production right 
> now?

It's just been built, not running yet.

> 
> So if you start a storage migration you get only 200 MByte/s right?

I wish. My current cluster (not this new one) would storage migrate at 
~10-15MB/s. Serial latency is the problem, without being able
to buffer, ESXi waits on an ack for each IO before sending the next. Also it 
submits the migrations in 64kb chunks, unless you get
VAAI working. I think esxi will try and do them in parallel, which will help as 
well.

> 
> I think it would be awesome if you get 1000 MByte/s
> 
> Where is the Bottleneck?

Latency serialisation, without a buffer, you can't drive the devices to 100%. 
With buffered IO (or high queue depths) I can max out
the journals.

> 
> A FIO Test from Sebastien Han give us 400 MByte/s raw performance from the 
> P3700.
> 
> https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
> 
> How could it be that the rbd client performance is 50% slower?
> 
> Regards
> 
> 
> Am 21.07.16 um 12:15 schrieb Nick Fisk:
> > I've had a lot of pain with this, smaller block sizes are even worse.
> > You want to try and minimize latency at every point as there is no
> > buffering happening in the iSCSI stack. This means:-
> >
> > 1. Fast journals (NVME or NVRAM)
> > 2. 10GB or better networking
> > 3. Fast CPU's (Ghz)
> > 4. Fix CPU c-state's to C1
> > 5. Fix CPU's Freq to max
> >
> > Also I can't be sure, but I think there is a metadata update happening
> > with VMFS, particularly if you are using thin VMDK's, this can also be
> > a major bottleneck. For my use case, I've switched over to NFS as it has 
> > given much more performance at scale and less headache.
> >
> > For the RADOS Run, here you go (400GB P3700):
> >
> > Total time run: 60.026491
> > Total writes made:  3104
> > Write size: 4194304
> > Object size:4194304
> > Bandwidth (MB/sec): 206.842
> > Stddev Bandwidth:   8.10412
> > Max bandwidth (MB/sec): 224
> > Min bandwidth (MB/sec): 180
> > Average IOPS:   51
> > Stddev IOPS:2
> > Max IOPS:   56
> > Min IOPS:   45
> > Average Latency(s): 0.0193366
> > Stddev Latency(s):  0.00148039
> > Max latency(s): 0.0377946
> > Min latency(s): 0.015909
> >
> > Nick
> >
> >> -Original Message-
> >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> >> Of Horace
> >> Sent: 21 July 2016 10:26
> >> To: w...@globe.de
> >> Cc: ceph-users@lists.ceph.com
> >> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> >>
> >> Hi,
> >>
> >> Same here, I've read some blog saying that vmware will frequently
> >> verify the locking on VMFS over iSCSI, hence it will have much slower 
> >> performance than NFS (with different locking mechanism).
> >>
> >> Regards,
> >> Horace Ng
> >>
> >> - Original Message -
> >> From: w...@globe.de
> >> To: ceph-users@lists.ceph.com
> >> Sent: Thursday, July 21, 2016 5:11:21 PM
> >> Subject: [ceph-users] Ceph + VMware + Single Thread Performance
> >>
> >> Hi everyone,
> >>
> >> we see at our cluster relatively slow Single Thread Performance on the 
> >> iscsi Nodes.
> >>
> >>
> >> Our setup:
> >>
> >> 3 Racks:
> >>
> >> 18x Data Nodes, 3 Mon Nodes, 3 iscsi Gateway Nodes with tgt (rbd cache 
> >> off).
> >>
> >> 2x Samsung SM863 Enterprise SSD for Journal (3 OSD per SSD) and 6x WD
> >> Red 1TB per Data Node as OSD.
> >>
> >> Replication = 3
> >>
> >> chooseleaf = 3 type Rack in the crush map
> >>
> >>
> >> We get only ca. 90 MByte/s on the iscsi Gateway Servers with:
> >>
> >> rados bench -p rbd 60 write -b 4M -t 1
> >>
> >>
> >> If we test with:
> >>
> >> rados bench -p rbd 60 write -b 4M -t 32
> >>
> >> we get ca. 

Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-07-21 Thread w...@globe.de

Hi,

hmm i think 200 MByte/s is really bad. Is your Cluster in production 
right now?


So if you start a storage migration you get only 200 MByte/s right?

I think it would be awesome if you get 1000 MByte/s

Where is the Bottleneck?

A FIO Test from Sebastien Han give us 400 MByte/s raw performance from 
the P3700.


https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/

How could it be that the rbd client performance is 50% slower?

Regards


Am 21.07.16 um 12:15 schrieb Nick Fisk:

I've had a lot of pain with this, smaller block sizes are even worse. You want 
to try and minimize latency at every point as there
is no buffering happening in the iSCSI stack. This means:-

1. Fast journals (NVME or NVRAM)
2. 10GB or better networking
3. Fast CPU's (Ghz)
4. Fix CPU c-state's to C1
5. Fix CPU's Freq to max

Also I can't be sure, but I think there is a metadata update happening with 
VMFS, particularly if you are using thin VMDK's, this
can also be a major bottleneck. For my use case, I've switched over to NFS as 
it has given much more performance at scale and less
headache.

For the RADOS Run, here you go (400GB P3700):

Total time run: 60.026491
Total writes made:  3104
Write size: 4194304
Object size:4194304
Bandwidth (MB/sec): 206.842
Stddev Bandwidth:   8.10412
Max bandwidth (MB/sec): 224
Min bandwidth (MB/sec): 180
Average IOPS:   51
Stddev IOPS:2
Max IOPS:   56
Min IOPS:   45
Average Latency(s): 0.0193366
Stddev Latency(s):  0.00148039
Max latency(s): 0.0377946
Min latency(s): 0.015909

Nick


-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Horace
Sent: 21 July 2016 10:26
To: w...@globe.de
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance

Hi,

Same here, I've read some blog saying that vmware will frequently verify the 
locking on VMFS over iSCSI, hence it will have much
slower performance than NFS (with different locking mechanism).

Regards,
Horace Ng

- Original Message -
From: w...@globe.de
To: ceph-users@lists.ceph.com
Sent: Thursday, July 21, 2016 5:11:21 PM
Subject: [ceph-users] Ceph + VMware + Single Thread Performance

Hi everyone,

we see at our cluster relatively slow Single Thread Performance on the iscsi 
Nodes.


Our setup:

3 Racks:

18x Data Nodes, 3 Mon Nodes, 3 iscsi Gateway Nodes with tgt (rbd cache off).

2x Samsung SM863 Enterprise SSD for Journal (3 OSD per SSD) and 6x WD
Red 1TB per Data Node as OSD.

Replication = 3

chooseleaf = 3 type Rack in the crush map


We get only ca. 90 MByte/s on the iscsi Gateway Servers with:

rados bench -p rbd 60 write -b 4M -t 1


If we test with:

rados bench -p rbd 60 write -b 4M -t 32

we get ca. 600 - 700 MByte/s


We plan to replace the Samsung SSD with Intel DC P3700 PCIe NVM'e for
the Journal to get better Single Thread Performance.

Is anyone of you out there who has an Intel P3700 for Journal an can
give me back test results with:


rados bench -p rbd 60 write -b 4M -t 1


Thank you very much !!

Kind Regards !!

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-07-21 Thread Nick Fisk
I've had a lot of pain with this, smaller block sizes are even worse. You want 
to try and minimize latency at every point as there
is no buffering happening in the iSCSI stack. This means:-

1. Fast journals (NVME or NVRAM)
2. 10GB or better networking
3. Fast CPU's (Ghz)
4. Fix CPU c-state's to C1
5. Fix CPU's Freq to max

Also I can't be sure, but I think there is a metadata update happening with 
VMFS, particularly if you are using thin VMDK's, this
can also be a major bottleneck. For my use case, I've switched over to NFS as 
it has given much more performance at scale and less
headache.

For the RADOS Run, here you go (400GB P3700):

Total time run: 60.026491
Total writes made:  3104
Write size: 4194304
Object size:4194304
Bandwidth (MB/sec): 206.842
Stddev Bandwidth:   8.10412
Max bandwidth (MB/sec): 224
Min bandwidth (MB/sec): 180
Average IOPS:   51
Stddev IOPS:2
Max IOPS:   56
Min IOPS:   45
Average Latency(s): 0.0193366
Stddev Latency(s):  0.00148039
Max latency(s): 0.0377946
Min latency(s): 0.015909

Nick

> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
> Horace
> Sent: 21 July 2016 10:26
> To: w...@globe.de
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Ceph + VMware + Single Thread Performance
> 
> Hi,
> 
> Same here, I've read some blog saying that vmware will frequently verify the 
> locking on VMFS over iSCSI, hence it will have much
> slower performance than NFS (with different locking mechanism).
> 
> Regards,
> Horace Ng
> 
> - Original Message -
> From: w...@globe.de
> To: ceph-users@lists.ceph.com
> Sent: Thursday, July 21, 2016 5:11:21 PM
> Subject: [ceph-users] Ceph + VMware + Single Thread Performance
> 
> Hi everyone,
> 
> we see at our cluster relatively slow Single Thread Performance on the iscsi 
> Nodes.
> 
> 
> Our setup:
> 
> 3 Racks:
> 
> 18x Data Nodes, 3 Mon Nodes, 3 iscsi Gateway Nodes with tgt (rbd cache off).
> 
> 2x Samsung SM863 Enterprise SSD for Journal (3 OSD per SSD) and 6x WD
> Red 1TB per Data Node as OSD.
> 
> Replication = 3
> 
> chooseleaf = 3 type Rack in the crush map
> 
> 
> We get only ca. 90 MByte/s on the iscsi Gateway Servers with:
> 
> rados bench -p rbd 60 write -b 4M -t 1
> 
> 
> If we test with:
> 
> rados bench -p rbd 60 write -b 4M -t 32
> 
> we get ca. 600 - 700 MByte/s
> 
> 
> We plan to replace the Samsung SSD with Intel DC P3700 PCIe NVM'e for
> the Journal to get better Single Thread Performance.
> 
> Is anyone of you out there who has an Intel P3700 for Journal an can
> give me back test results with:
> 
> 
> rados bench -p rbd 60 write -b 4M -t 1
> 
> 
> Thank you very much !!
> 
> Kind Regards !!
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph + VMware + Single Thread Performance

2016-07-21 Thread Horace
Hi,

Same here, I've read some blog saying that vmware will frequently verify the 
locking on VMFS over iSCSI, hence it will have much slower performance than NFS 
(with different locking mechanism).

Regards,
Horace Ng

- Original Message -
From: w...@globe.de
To: ceph-users@lists.ceph.com
Sent: Thursday, July 21, 2016 5:11:21 PM
Subject: [ceph-users] Ceph + VMware + Single Thread Performance

Hi everyone,

we see at our cluster relatively slow Single Thread Performance on the 
iscsi Nodes.


Our setup:

3 Racks:

18x Data Nodes, 3 Mon Nodes, 3 iscsi Gateway Nodes with tgt (rbd cache off).

2x Samsung SM863 Enterprise SSD for Journal (3 OSD per SSD) and 6x WD 
Red 1TB per Data Node as OSD.

Replication = 3

chooseleaf = 3 type Rack in the crush map


We get only ca. 90 MByte/s on the iscsi Gateway Servers with:

rados bench -p rbd 60 write -b 4M -t 1


If we test with:

rados bench -p rbd 60 write -b 4M -t 32

we get ca. 600 - 700 MByte/s


We plan to replace the Samsung SSD with Intel DC P3700 PCIe NVM'e for 
the Journal to get better Single Thread Performance.

Is anyone of you out there who has an Intel P3700 for Journal an can 
give me back test results with:


rados bench -p rbd 60 write -b 4M -t 1


Thank you very much !!

Kind Regards !!

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph + VMware + Single Thread Performance

2016-07-21 Thread w...@globe.de

Hi everyone,

we see at our cluster relatively slow Single Thread Performance on the 
iscsi Nodes.



Our setup:

3 Racks:

18x Data Nodes, 3 Mon Nodes, 3 iscsi Gateway Nodes with tgt (rbd cache off).

2x Samsung SM863 Enterprise SSD for Journal (3 OSD per SSD) and 6x WD 
Red 1TB per Data Node as OSD.


Replication = 3

chooseleaf = 3 type Rack in the crush map


We get only ca. 90 MByte/s on the iscsi Gateway Servers with:

rados bench -p rbd 60 write -b 4M -t 1


If we test with:

rados bench -p rbd 60 write -b 4M -t 32

we get ca. 600 - 700 MByte/s


We plan to replace the Samsung SSD with Intel DC P3700 PCIe NVM'e for 
the Journal to get better Single Thread Performance.


Is anyone of you out there who has an Intel P3700 for Journal an can 
give me back test results with:



rados bench -p rbd 60 write -b 4M -t 1


Thank you very much !!

Kind Regards !!

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph + vmware

2016-07-20 Thread Mike Christie
On 07/20/2016 11:52 AM, Jan Schermer wrote:
> 
>> On 20 Jul 2016, at 18:38, Mike Christie  wrote:
>>
>> On 07/20/2016 03:50 AM, Frédéric Nass wrote:
>>>
>>> Hi Mike,
>>>
>>> Thanks for the update on the RHCS iSCSI target.
>>>
>>> Will RHCS 2.1 iSCSI target be compliant with VMWare ESXi client ? (or is
>>> it too early to say / announce).
>>
>> No HA support for sure. We are looking into non HA support though.
>>
>>>
>>> Knowing that HA iSCSI target was on the roadmap, we chose iSCSI over NFS
>>> so we'll just have to remap RBDs to RHCS targets when it's available.
>>>
>>> So we're currently running :
>>>
>>> - 2 LIO iSCSI targets exporting the same RBD images. Each iSCSI target
>>> has all VAAI primitives enabled and run the same configuration.
>>> - RBD images are mapped on each target using the kernel client (so no
>>> RBD cache).
>>> - 6 ESXi. Each ESXi can access to the same LUNs through both targets,
>>> but in a failover manner so that each ESXi always access the same LUN
>>> through one target at a time.
>>> - LUNs are VMFS datastores and VAAI primitives are enabled client side
>>> (except UNMAP as per default).
>>>
>>> Do you see anthing risky regarding this configuration ?
>>
>> If you use a application that uses scsi persistent reservations then you
>> could run into troubles, because some apps expect the reservation info
>> to be on the failover nodes as well as the active ones.
>>
>> Depending on the how you do failover and the issue that caused the
>> failover, IO could be stuck on the old active node and cause data
>> corruption. If the initial active node looses its network connectivity
>> and you failover, you have to make sure that the initial active node is
>> fenced off and IO stuck on that node will never be executed. So do
>> something like add it to the ceph monitor blacklist and make sure IO on
>> that node is flushed and failed before unblacklisting it.
>>
> 
> With iSCSI you can't really do hot failover unless you only use synchronous 
> IO.
> (With any of opensource target softwares available).

That is what we are working on adding.

Why did you only say iSCSI though?

> Flushing the buffers doesn't really help because you don't know what 
> in-flight IO happened before the outage

To be clear, when I wrote flush I did not mean cache buffers. I only
meant the targets list of commands.

And, for the unblacklist comment it is best to unmap images that are
under a blacklist then remap them. The osd blacklist remove command
would leave some krbd structs in a bad state.

> and which didn't. You could end with only part of the "transaction" written 
> on persistent storage.
> 

Maybe I am not sure what you mean by hot failover.

If you are failing over for the case where one node just goes
unreachable, then if you blacklist it before making another node active
you know IO that had not been sent will be failed and never execute,
partially sent IO will be failed and not execute. IO that was sent to
the OSD and is executing will completed by the OSD before new IO to the
same sectors, so you would not end up with what looks like partial
transactions if you later did a read.

If the OSDs die mid write you could end up with a part of command
written, but that could happen with any SCSI based protocol.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph + vmware

2016-07-20 Thread Jake Young
On Wednesday, July 20, 2016, Jan Schermer  wrote:

>
> > On 20 Jul 2016, at 18:38, Mike Christie  > wrote:
> >
> > On 07/20/2016 03:50 AM, Frédéric Nass wrote:
> >>
> >> Hi Mike,
> >>
> >> Thanks for the update on the RHCS iSCSI target.
> >>
> >> Will RHCS 2.1 iSCSI target be compliant with VMWare ESXi client ? (or is
> >> it too early to say / announce).
> >
> > No HA support for sure. We are looking into non HA support though.
> >
> >>
> >> Knowing that HA iSCSI target was on the roadmap, we chose iSCSI over NFS
> >> so we'll just have to remap RBDs to RHCS targets when it's available.
> >>
> >> So we're currently running :
> >>
> >> - 2 LIO iSCSI targets exporting the same RBD images. Each iSCSI target
> >> has all VAAI primitives enabled and run the same configuration.
> >> - RBD images are mapped on each target using the kernel client (so no
> >> RBD cache).
> >> - 6 ESXi. Each ESXi can access to the same LUNs through both targets,
> >> but in a failover manner so that each ESXi always access the same LUN
> >> through one target at a time.
> >> - LUNs are VMFS datastores and VAAI primitives are enabled client side
> >> (except UNMAP as per default).
> >>
> >> Do you see anthing risky regarding this configuration ?
> >
> > If you use a application that uses scsi persistent reservations then you
> > could run into troubles, because some apps expect the reservation info
> > to be on the failover nodes as well as the active ones.
> >
> > Depending on the how you do failover and the issue that caused the
> > failover, IO could be stuck on the old active node and cause data
> > corruption. If the initial active node looses its network connectivity
> > and you failover, you have to make sure that the initial active node is
> > fenced off and IO stuck on that node will never be executed. So do
> > something like add it to the ceph monitor blacklist and make sure IO on
> > that node is flushed and failed before unblacklisting it.
> >
>
> With iSCSI you can't really do hot failover unless you only use
> synchronous IO.


VMware does only use synchronous IO. Since the hypervisor can't tell what
type of data the VMs are writing, all IO is treated as needing to be
synchronous.

(With any of opensource target softwares available).
> Flushing the buffers doesn't really help because you don't know what
> in-flight IO happened before the outage
> and which didn't. You could end with only part of the "transaction"
> written on persistent storage.
>
> If you only use synchronous IO all the way from client to the persistent
> storage shared between
> iSCSI target then all should be fine, otherwise YMMV - some people run it
> like that without realizing
> the dangers and have never had a problem, so it may be strictly
> theoretical, and it all depends on how often you need to do the
> failover and what data you are storing - corrupting a few images on a
> gallery site could be fine but corrupting
> a large database tablespace is no fun at all.


No, it's not. VMFS corruption is pretty bad too and there is no fsck for
VMFS...


>
> Some (non opensource) solutions exist, Solaris supposedly does this in
> some(?) way, maybe some iSCSI guru
> can chime tell us what magic they do, but I don't think it's possible
> without client support
> (you essentialy have to do something like transactions and replay the last
> transaction on failover). Maybe
> something can be enabled in protocol to do the iSCSI IO synchronous or
> make it at least wait for some sort of ACK from the
> server (which would require some sort of cache mirroring between the
> targets) without making it synchronous all the way.


This is why the SAN vendors wrote their own clients and drivers. It is not
possible to dynamically make all OS's do what your iSCSI target expects.

Something like VMware does the right thing pretty much all the time (there
are some iSCSI initiator bugs in earlier ESXi 5.x).  If you have control of
your ESXi hosts then attempting to set up HA iSCSI targets is possible.

If you have a mixed client environment with various versions of Windows
connecting to the target, you may be better off buying some SAN appliances.


> The one time I had to use it I resorted to simply mirroring in via mdraid
> on the client side over two targets sharing the same
> DAS, and this worked fine during testing but never went to production in
> the end.
>
> Jan
>
> >
> >>
> >> Would you recommend LIO or STGT (with rbd bs-type) target for ESXi
> >> clients ?
> >
> > I can't say, because I have not used stgt with rbd bs-type support
> enough.


For starters, STGT doesn't implement VAAI properly and you will need to
disable VAAI in ESXi.

LIO does seem to implement VAAI properly, but performance is not nearly as
good as STGT even with VAAI's benefits. The assumption for the cause is
that LIO currently uses kernel rbd mapping and kernel rbd performance is
not as good as librbd.

I recently did a simple test of creating an 80GB eager zeroed disk with
STGT (VAAI dis

Re: [ceph-users] ceph + vmware

2016-07-20 Thread Jan Schermer

> On 20 Jul 2016, at 18:38, Mike Christie  wrote:
> 
> On 07/20/2016 03:50 AM, Frédéric Nass wrote:
>> 
>> Hi Mike,
>> 
>> Thanks for the update on the RHCS iSCSI target.
>> 
>> Will RHCS 2.1 iSCSI target be compliant with VMWare ESXi client ? (or is
>> it too early to say / announce).
> 
> No HA support for sure. We are looking into non HA support though.
> 
>> 
>> Knowing that HA iSCSI target was on the roadmap, we chose iSCSI over NFS
>> so we'll just have to remap RBDs to RHCS targets when it's available.
>> 
>> So we're currently running :
>> 
>> - 2 LIO iSCSI targets exporting the same RBD images. Each iSCSI target
>> has all VAAI primitives enabled and run the same configuration.
>> - RBD images are mapped on each target using the kernel client (so no
>> RBD cache).
>> - 6 ESXi. Each ESXi can access to the same LUNs through both targets,
>> but in a failover manner so that each ESXi always access the same LUN
>> through one target at a time.
>> - LUNs are VMFS datastores and VAAI primitives are enabled client side
>> (except UNMAP as per default).
>> 
>> Do you see anthing risky regarding this configuration ?
> 
> If you use a application that uses scsi persistent reservations then you
> could run into troubles, because some apps expect the reservation info
> to be on the failover nodes as well as the active ones.
> 
> Depending on the how you do failover and the issue that caused the
> failover, IO could be stuck on the old active node and cause data
> corruption. If the initial active node looses its network connectivity
> and you failover, you have to make sure that the initial active node is
> fenced off and IO stuck on that node will never be executed. So do
> something like add it to the ceph monitor blacklist and make sure IO on
> that node is flushed and failed before unblacklisting it.
> 

With iSCSI you can't really do hot failover unless you only use synchronous IO.
(With any of opensource target softwares available).
Flushing the buffers doesn't really help because you don't know what in-flight 
IO happened before the outage
and which didn't. You could end with only part of the "transaction" written on 
persistent storage.

If you only use synchronous IO all the way from client to the persistent 
storage shared between
iSCSI target then all should be fine, otherwise YMMV - some people run it like 
that without realizing
the dangers and have never had a problem, so it may be strictly theoretical, 
and it all depends on how often you need to do the
failover and what data you are storing - corrupting a few images on a gallery 
site could be fine but corrupting
a large database tablespace is no fun at all.

Some (non opensource) solutions exist, Solaris supposedly does this in some(?) 
way, maybe some iSCSI guru
can chime tell us what magic they do, but I don't think it's possible without 
client support
(you essentialy have to do something like transactions and replay the last 
transaction on failover). Maybe
something can be enabled in protocol to do the iSCSI IO synchronous or make it 
at least wait for some sort of ACK from the
server (which would require some sort of cache mirroring between the targets) 
without making it synchronous all the way.

The one time I had to use it I resorted to simply mirroring in via mdraid on 
the client side over two targets sharing the same
DAS, and this worked fine during testing but never went to production in the 
end.

Jan

> 
>> 
>> Would you recommend LIO or STGT (with rbd bs-type) target for ESXi
>> clients ?
> 
> I can't say, because I have not used stgt with rbd bs-type support enough.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph + vmware

2016-07-20 Thread Mike Christie
On 07/20/2016 03:50 AM, Frédéric Nass wrote:
> 
> Hi Mike,
> 
> Thanks for the update on the RHCS iSCSI target.
> 
> Will RHCS 2.1 iSCSI target be compliant with VMWare ESXi client ? (or is
> it too early to say / announce).

No HA support for sure. We are looking into non HA support though.

> 
> Knowing that HA iSCSI target was on the roadmap, we chose iSCSI over NFS
> so we'll just have to remap RBDs to RHCS targets when it's available.
> 
> So we're currently running :
> 
> - 2 LIO iSCSI targets exporting the same RBD images. Each iSCSI target
> has all VAAI primitives enabled and run the same configuration.
> - RBD images are mapped on each target using the kernel client (so no
> RBD cache).
> - 6 ESXi. Each ESXi can access to the same LUNs through both targets,
> but in a failover manner so that each ESXi always access the same LUN
> through one target at a time.
> - LUNs are VMFS datastores and VAAI primitives are enabled client side
> (except UNMAP as per default).
> 
> Do you see anthing risky regarding this configuration ?

If you use a application that uses scsi persistent reservations then you
could run into troubles, because some apps expect the reservation info
to be on the failover nodes as well as the active ones.

Depending on the how you do failover and the issue that caused the
failover, IO could be stuck on the old active node and cause data
corruption. If the initial active node looses its network connectivity
and you failover, you have to make sure that the initial active node is
fenced off and IO stuck on that node will never be executed. So do
something like add it to the ceph monitor blacklist and make sure IO on
that node is flushed and failed before unblacklisting it.


> 
> Would you recommend LIO or STGT (with rbd bs-type) target for ESXi
> clients ?

I can't say, because I have not used stgt with rbd bs-type support enough.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph + vmware

2016-07-20 Thread Frédéric Nass


Hi Mike,

Thanks for the update on the RHCS iSCSI target.

Will RHCS 2.1 iSCSI target be compliant with VMWare ESXi client ? (or is 
it too early to say / announce).


Knowing that HA iSCSI target was on the roadmap, we chose iSCSI over NFS 
so we'll just have to remap RBDs to RHCS targets when it's available.


So we're currently running :

- 2 LIO iSCSI targets exporting the same RBD images. Each iSCSI target 
has all VAAI primitives enabled and run the same configuration.
- RBD images are mapped on each target using the kernel client (so no 
RBD cache).
- 6 ESXi. Each ESXi can access to the same LUNs through both targets, 
but in a failover manner so that each ESXi always access the same LUN 
through one target at a time.
- LUNs are VMFS datastores and VAAI primitives are enabled client side 
(except UNMAP as per default).


Do you see anthing risky regarding this configuration ?

Would you recommend LIO or STGT (with rbd bs-type) target for ESXi clients ?

Best regards,

Frederic.

--

Frédéric Nass

Sous-direction Infrastructures
Direction du Numérique
Université de Lorraine

Tél : +33 3 72 74 11 35



Le 11/07/2016 17:45, Mike Christie a écrit :

On 07/08/2016 02:22 PM, Oliver Dzombic wrote:

Hi,

does anyone have experience how to connect vmware with ceph smart ?

iSCSI multipath does not really worked well.

Are you trying to export rbd images from multiple iscsi targets at the
same time or just one target?

For the HA/multiple target setup, I am working on this for Red Hat. We
plan to release it in RHEL 7.3/RHCS 2.1. SUSE ships something already as
someone mentioned.

We just got a large chunk of code in the upstream kernel (it is in the
block layer maintainer's tree for the next kernel) so it should be
simple to add COMPARE_AND_WRITE support now. We should be posting krbd
exclusive lock support in the next couple weeks.



NFS could be, but i think thats just too much layers in between to have
some useable performance.

Systems like ScaleIO have developed a vmware addon to talk with it.

Is there something similar out there for ceph ?

What are you using ?

Thank you !


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph + vmware

2016-07-16 Thread Jake Young
On Saturday, July 16, 2016, Oliver Dzombic  wrote:

> Hi Jake,
>
> thank you very much both was needed, MTU and VAAI deactivated ( i hope
> that wont interfere with vmotion or other features ).
>
> I changed now the MTU of vmkernel and vswitch. That solved this problem.


Try turning VAAI back on at some point.


>
> So i could make an ext4 filesystem and mount it.
>
> Running
>
> dd if=/dev/zero of=/mnt/8G_test bs=4k count=2M conv=fdatasync
>
> Something is strange to me:
>
> The network gets streight 1 Gbit ( maximum connection ) of iscsi bandwidth.
>
> But inside the vm i can only see 40-50MB/s.
>
> I mean replicationsize is 2. So it would be easy to say 1/2 of 1 Gbit =
> 500 Mbit = 40-50MB/s.
>
> But should this reduction not be inside of the ceph cluster ? Which is
> going with 10G network ?
>
> I mean the data are hitting with 1 Gbit the ceph iscsi server. So now
> this is transported to RBD internally by tgt.
> And there its multiplied by 2 ( over the  cluster network which is 10G )
> before the ACK is sended back to iscsi. So the cluster will internally
> duplicate it via 10G. So my expected bandwidth inside the vm should be
> higher than half of the maximum speed.
>
> Is this a wrong understanding of the mechanism ?


The delay is most likely just having to wait for 2 disks to actually do the
write.


>
> --
> Mit freundlichen Gruessen / Best regards
>
> Oliver Dzombic
> IP-Interactive
>
> mailto:i...@ip-interactive.de 
>
> Anschrift:
>
> IP Interactive UG ( haftungsbeschraenkt )
> Zum Sonnenberg 1-3
> 63571 Gelnhausen
>
> HRB 93402 beim Amtsgericht Hanau
> Geschäftsführung: Oliver Dzombic
>
> Steuer Nr.: 35 236 3622 1
> UST ID: DE274086107
>
>
> Am 16.07.2016 um 02:18 schrieb Jake Young:
> > I had some odd issues like that due to MTU mismatch.
> >
> > Keep in mind that the vSwitch and vmkernel port have independent MTU
> > settings.  Verify you can ping with large size packets without
> > fragmentation between your host and iscsi target.
> >
> > If that's not it, you can try to disable VAAI options to see if one of
> > them is causing issues. I haven't used ESXi 6.0 yet.
> >
> > Jake
> >
> >
> > On Friday, July 15, 2016, Oliver Dzombic  
> > > wrote:
> >
> > Hi,
> >
> > i am currently trying out the stuff.
> >
> > My tgt config:
> >
> > # cat tgtd.conf
> > # The default config file
> > include /etc/tgt/targets.conf
> >
> > # Config files from other packages etc.
> > include /etc/tgt/conf.d/*.conf
> >
> > nr_iothreads=128
> >
> >
> > -
> >
> > # cat iqn.2016-07.tgt.esxi-test.conf
> > 
> >   initiator-address ALL
> >   scsi_sn esxi-test
> >   #vendor_id CEPH
> >   #controller_tid 1
> >   write-cache on
> >   read-cache on
> >   driver iscsi
> >   bs-type rbd
> >   
> >   lun 1
> >   scsi_id cf1c4a71e700506357
> >   
> >   
> >
> >
> > --
> >
> >
> > If i create a vm inside esxi 6 and try to format the virtual hdd, i
> see
> > in logs:
> >
> > sd:2:0:0:0: [sda] CDB:
> > Write(10): 2a 00 0f 86 a8 80 00 01 40 00
> > mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=880068aa5e00)
> > mptscsih: ioc0: attempting task abort! ( sc=880068aa4a80)
> >
> > With the LSI HDD emulation. With the vmware paravirtualization
> > everything just freeze.
> >
> > Any idea with that issue ?
> >
> > --
> > Mit freundlichen Gruessen / Best regards
> >
> > Oliver Dzombic
> > IP-Interactive
> >
> > mailto:i...@ip-interactive.de 
> >
> > Anschrift:
> >
> > IP Interactive UG ( haftungsbeschraenkt )
> > Zum Sonnenberg 1-3
> > 63571 Gelnhausen
> >
> > HRB 93402 beim Amtsgericht Hanau
> > Geschäftsführung: Oliver Dzombic
> >
> > Steuer Nr.: 35 236 3622 1
> > UST ID: DE274086107
> >
> >
> > Am 11.07.2016 um 22:24 schrieb Jake Young:
> > > I'm using this setup with ESXi 5.1 and I get very good
> performance.  I
> > > suspect you have other issues.  Reliability is another story (see
> > Nick's
> > > posts on tgt and HA to get an idea of the awful problems you can
> > have),
> > > but for my test labs the risk is acceptable.
> > >
> > >
> > > One change I found helpful is to run tgtd with 128 threads.  I'm
> > running
> > > Ubuntu 14.04, so I editted my /etc/init.tgt.conf file and changed
> the
> > > line that read:
> > >
> > > exec tgtd
> > >
> > > to
> > >
> > > exec tgtd --nr_iothreads=128
> > >
> > >
> > > If you're not concerned with reliability, you can enhance
> throughput
> > > even more by enabling rbd client write-back cache in your tgt VM's
> > > ceph.conf file (you'll need to restart tgtd for this to take
> effect):
> > >
> > > [client]
> > > rbd_cache = true
> > > rbd_cache_size = 67108864 # (64MB)
> > > rbd_cache_max_dirty = 50331648 # (48MB)
> > > rbd_ca

Re: [ceph-users] ceph + vmware

2016-07-16 Thread Oliver Dzombic
Hi Jake,

thank you very much both was needed, MTU and VAAI deactivated ( i hope
that wont interfere with vmotion or other features ).

I changed now the MTU of vmkernel and vswitch. That solved this problem.

So i could make an ext4 filesystem and mount it.

Running

dd if=/dev/zero of=/mnt/8G_test bs=4k count=2M conv=fdatasync

Something is strange to me:

The network gets streight 1 Gbit ( maximum connection ) of iscsi bandwidth.

But inside the vm i can only see 40-50MB/s.

I mean replicationsize is 2. So it would be easy to say 1/2 of 1 Gbit =
500 Mbit = 40-50MB/s.

But should this reduction not be inside of the ceph cluster ? Which is
going with 10G network ?

I mean the data are hitting with 1 Gbit the ceph iscsi server. So now
this is transported to RBD internally by tgt.
And there its multiplied by 2 ( over the  cluster network which is 10G )
before the ACK is sended back to iscsi. So the cluster will internally
duplicate it via 10G. So my expected bandwidth inside the vm should be
higher than half of the maximum speed.

Is this a wrong understanding of the mechanism ?


-- 
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:i...@ip-interactive.de

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107


Am 16.07.2016 um 02:18 schrieb Jake Young:
> I had some odd issues like that due to MTU mismatch. 
> 
> Keep in mind that the vSwitch and vmkernel port have independent MTU
> settings.  Verify you can ping with large size packets without
> fragmentation between your host and iscsi target. 
> 
> If that's not it, you can try to disable VAAI options to see if one of
> them is causing issues. I haven't used ESXi 6.0 yet. 
> 
> Jake
> 
> 
> On Friday, July 15, 2016, Oliver Dzombic  > wrote:
> 
> Hi,
> 
> i am currently trying out the stuff.
> 
> My tgt config:
> 
> # cat tgtd.conf
> # The default config file
> include /etc/tgt/targets.conf
> 
> # Config files from other packages etc.
> include /etc/tgt/conf.d/*.conf
> 
> nr_iothreads=128
> 
> 
> -
> 
> # cat iqn.2016-07.tgt.esxi-test.conf
> 
>   initiator-address ALL
>   scsi_sn esxi-test
>   #vendor_id CEPH
>   #controller_tid 1
>   write-cache on
>   read-cache on
>   driver iscsi
>   bs-type rbd
>   
>   lun 1
>   scsi_id cf1c4a71e700506357
>   
>   
> 
> 
> --
> 
> 
> If i create a vm inside esxi 6 and try to format the virtual hdd, i see
> in logs:
> 
> sd:2:0:0:0: [sda] CDB:
> Write(10): 2a 00 0f 86 a8 80 00 01 40 00
> mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=880068aa5e00)
> mptscsih: ioc0: attempting task abort! ( sc=880068aa4a80)
> 
> With the LSI HDD emulation. With the vmware paravirtualization
> everything just freeze.
> 
> Any idea with that issue ?
> 
> --
> Mit freundlichen Gruessen / Best regards
> 
> Oliver Dzombic
> IP-Interactive
> 
> mailto:i...@ip-interactive.de
> 
> Anschrift:
> 
> IP Interactive UG ( haftungsbeschraenkt )
> Zum Sonnenberg 1-3
> 63571 Gelnhausen
> 
> HRB 93402 beim Amtsgericht Hanau
> Geschäftsführung: Oliver Dzombic
> 
> Steuer Nr.: 35 236 3622 1
> UST ID: DE274086107
> 
> 
> Am 11.07.2016 um 22:24 schrieb Jake Young:
> > I'm using this setup with ESXi 5.1 and I get very good performance.  I
> > suspect you have other issues.  Reliability is another story (see
> Nick's
> > posts on tgt and HA to get an idea of the awful problems you can
> have),
> > but for my test labs the risk is acceptable.
> >
> >
> > One change I found helpful is to run tgtd with 128 threads.  I'm
> running
> > Ubuntu 14.04, so I editted my /etc/init.tgt.conf file and changed the
> > line that read:
> >
> > exec tgtd
> >
> > to
> >
> > exec tgtd --nr_iothreads=128
> >
> >
> > If you're not concerned with reliability, you can enhance throughput
> > even more by enabling rbd client write-back cache in your tgt VM's
> > ceph.conf file (you'll need to restart tgtd for this to take effect):
> >
> > [client]
> > rbd_cache = true
> > rbd_cache_size = 67108864 # (64MB)
> > rbd_cache_max_dirty = 50331648 # (48MB)
> > rbd_cache_target_dirty = 33554432 # (32MB)
> > rbd_cache_max_dirty_age = 2
> > rbd_cache_writethrough_until_flush = false
> >
> >
> >
> >
> > Here's a sample targets.conf:
> >
> >   
> >   initiator-address ALL
> >   scsi_sn Charter
> >   #vendor_id CEPH
> >   #controller_tid 1
> >   write-cache on
> >   read-cache on
> >   driver iscsi
> >   bs-type rbd
> >   
> >   lun 5
> >   scsi_id cfe1000c4a71e700506357
> 

Re: [ceph-users] ceph + vmware

2016-07-15 Thread Jake Young
I had some odd issues like that due to MTU mismatch.

Keep in mind that the vSwitch and vmkernel port have independent MTU
settings.  Verify you can ping with large size packets without
fragmentation between your host and iscsi target.

If that's not it, you can try to disable VAAI options to see if one of them
is causing issues. I haven't used ESXi 6.0 yet.

Jake


On Friday, July 15, 2016, Oliver Dzombic  wrote:

> Hi,
>
> i am currently trying out the stuff.
>
> My tgt config:
>
> # cat tgtd.conf
> # The default config file
> include /etc/tgt/targets.conf
>
> # Config files from other packages etc.
> include /etc/tgt/conf.d/*.conf
>
> nr_iothreads=128
>
>
> -
>
> # cat iqn.2016-07.tgt.esxi-test.conf
> 
>   initiator-address ALL
>   scsi_sn esxi-test
>   #vendor_id CEPH
>   #controller_tid 1
>   write-cache on
>   read-cache on
>   driver iscsi
>   bs-type rbd
>   
>   lun 1
>   scsi_id cf1c4a71e700506357
>   
>   
>
>
> --
>
>
> If i create a vm inside esxi 6 and try to format the virtual hdd, i see
> in logs:
>
> sd:2:0:0:0: [sda] CDB:
> Write(10): 2a 00 0f 86 a8 80 00 01 40 00
> mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=880068aa5e00)
> mptscsih: ioc0: attempting task abort! ( sc=880068aa4a80)
>
> With the LSI HDD emulation. With the vmware paravirtualization
> everything just freeze.
>
> Any idea with that issue ?
>
> --
> Mit freundlichen Gruessen / Best regards
>
> Oliver Dzombic
> IP-Interactive
>
> mailto:i...@ip-interactive.de 
>
> Anschrift:
>
> IP Interactive UG ( haftungsbeschraenkt )
> Zum Sonnenberg 1-3
> 63571 Gelnhausen
>
> HRB 93402 beim Amtsgericht Hanau
> Geschäftsführung: Oliver Dzombic
>
> Steuer Nr.: 35 236 3622 1
> UST ID: DE274086107
>
>
> Am 11.07.2016 um 22:24 schrieb Jake Young:
> > I'm using this setup with ESXi 5.1 and I get very good performance.  I
> > suspect you have other issues.  Reliability is another story (see Nick's
> > posts on tgt and HA to get an idea of the awful problems you can have),
> > but for my test labs the risk is acceptable.
> >
> >
> > One change I found helpful is to run tgtd with 128 threads.  I'm running
> > Ubuntu 14.04, so I editted my /etc/init.tgt.conf file and changed the
> > line that read:
> >
> > exec tgtd
> >
> > to
> >
> > exec tgtd --nr_iothreads=128
> >
> >
> > If you're not concerned with reliability, you can enhance throughput
> > even more by enabling rbd client write-back cache in your tgt VM's
> > ceph.conf file (you'll need to restart tgtd for this to take effect):
> >
> > [client]
> > rbd_cache = true
> > rbd_cache_size = 67108864 # (64MB)
> > rbd_cache_max_dirty = 50331648 # (48MB)
> > rbd_cache_target_dirty = 33554432 # (32MB)
> > rbd_cache_max_dirty_age = 2
> > rbd_cache_writethrough_until_flush = false
> >
> >
> >
> >
> > Here's a sample targets.conf:
> >
> >   
> >   initiator-address ALL
> >   scsi_sn Charter
> >   #vendor_id CEPH
> >   #controller_tid 1
> >   write-cache on
> >   read-cache on
> >   driver iscsi
> >   bs-type rbd
> >   
> >   lun 5
> >   scsi_id cfe1000c4a71e700506357
> >   
> >   
> >   lun 6
> >   scsi_id cfe1000c4a71e700507157
> >   
> >   
> >   lun 7
> >   scsi_id cfe1000c4a71e70050da7a
> >   
> >   
> >   lun 8
> >   scsi_id cfe1000c4a71e70050bac0
> >   
> >   
> >
> >
> >
> > I don't have FIO numbers handy, but I have some oracle calibrate io
> > output.
> >
> > We're running Oracle RAC database servers in linux VMs on ESXi 5.1,
> > which use iSCSI to connect to the tgt service.  I only have a single
> > connection setup in ESXi for each LUN.  I tested using multipathing and
> > two tgt VMs presenting identical LUNs/RBD disks, but found that there
> > wasn't a significant performance gain by doing this, even with
> > round-robin path selecting in VMware.
> >
> >
> > These tests were run from two RAC VMs, each on a different host, with
> > both hosts connected to the same tgt instance.  The way we have oracle
> > configured, it would have been using two of the LUNs heavily during this
> > calibrate IO test.
> >
> >
> > This output is with 128 threads in tgtd and rbd client cache enabled:
> >
> > START_TIME   END_TIME   MAX_IOPS   MAX_MBPS
> MAX_PMBPS   LATENCY   DISKS
> >   -- --
> -- -- --
> > 28-JUN-016 15:10:50  28-JUN-016 15:20:04   14153658
> 412   14  75
> >
> >
> > This output is with the same configuration, but with rbd client cache
> > disabled:
> >
> > START_TIME END_TIMEMAX_IOPS   MAX_MBPS  MAX_PMBPS
> LATENCY   DISKS
> >   -- --
> -- -- --
> > 28-JUN-016 22:44:29  28-JUN-016 22:49:057449161219
>  20  75
> >
> > This output is from a directly connected EMC VNX5100 FC SAN with 25
> > disks using dual 8Gb FC links on a different lab system:
> >
> > START_TIME END_TIMEMAX_IOPS   

Re: [ceph-users] ceph + vmware

2016-07-15 Thread Oliver Dzombic
Hi,

i am currently trying out the stuff.

My tgt config:

# cat tgtd.conf
# The default config file
include /etc/tgt/targets.conf

# Config files from other packages etc.
include /etc/tgt/conf.d/*.conf

nr_iothreads=128


-

# cat iqn.2016-07.tgt.esxi-test.conf

  initiator-address ALL
  scsi_sn esxi-test
  #vendor_id CEPH
  #controller_tid 1
  write-cache on
  read-cache on
  driver iscsi
  bs-type rbd
  
  lun 1
  scsi_id cf1c4a71e700506357
  
  


--


If i create a vm inside esxi 6 and try to format the virtual hdd, i see
in logs:

sd:2:0:0:0: [sda] CDB:
Write(10): 2a 00 0f 86 a8 80 00 01 40 00
mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=880068aa5e00)
mptscsih: ioc0: attempting task abort! ( sc=880068aa4a80)

With the LSI HDD emulation. With the vmware paravirtualization
everything just freeze.

Any idea with that issue ?

-- 
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:i...@ip-interactive.de

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107


Am 11.07.2016 um 22:24 schrieb Jake Young:
> I'm using this setup with ESXi 5.1 and I get very good performance.  I
> suspect you have other issues.  Reliability is another story (see Nick's
> posts on tgt and HA to get an idea of the awful problems you can have),
> but for my test labs the risk is acceptable.
> 
> 
> One change I found helpful is to run tgtd with 128 threads.  I'm running
> Ubuntu 14.04, so I editted my /etc/init.tgt.conf file and changed the
> line that read:
> 
> exec tgtd
> 
> to 
> 
> exec tgtd --nr_iothreads=128
> 
> 
> If you're not concerned with reliability, you can enhance throughput
> even more by enabling rbd client write-back cache in your tgt VM's
> ceph.conf file (you'll need to restart tgtd for this to take effect):
> 
> [client]
> rbd_cache = true
> rbd_cache_size = 67108864 # (64MB)
> rbd_cache_max_dirty = 50331648 # (48MB)
> rbd_cache_target_dirty = 33554432 # (32MB)
> rbd_cache_max_dirty_age = 2
> rbd_cache_writethrough_until_flush = false
> 
> 
> 
> 
> Here's a sample targets.conf:
> 
>   
>   initiator-address ALL
>   scsi_sn Charter
>   #vendor_id CEPH
>   #controller_tid 1
>   write-cache on
>   read-cache on
>   driver iscsi
>   bs-type rbd
>   
>   lun 5
>   scsi_id cfe1000c4a71e700506357
>   
>   
>   lun 6
>   scsi_id cfe1000c4a71e700507157
>   
>   
>   lun 7
>   scsi_id cfe1000c4a71e70050da7a
>   
>   
>   lun 8
>   scsi_id cfe1000c4a71e70050bac0
>   
>   
> 
> 
> 
> I don't have FIO numbers handy, but I have some oracle calibrate io
> output.  
> 
> We're running Oracle RAC database servers in linux VMs on ESXi 5.1,
> which use iSCSI to connect to the tgt service.  I only have a single
> connection setup in ESXi for each LUN.  I tested using multipathing and
> two tgt VMs presenting identical LUNs/RBD disks, but found that there
> wasn't a significant performance gain by doing this, even with
> round-robin path selecting in VMware.
> 
> 
> These tests were run from two RAC VMs, each on a different host, with
> both hosts connected to the same tgt instance.  The way we have oracle
> configured, it would have been using two of the LUNs heavily during this
> calibrate IO test.
> 
> 
> This output is with 128 threads in tgtd and rbd client cache enabled:
> 
> START_TIME   END_TIME   MAX_IOPS   MAX_MBPS  MAX_PMBPS   
> LATENCY   DISKS
>   -- -- -- 
> -- --
> 28-JUN-016 15:10:50  28-JUN-016 15:20:04   14153658412
>14  75
> 
> 
> This output is with the same configuration, but with rbd client cache
> disabled:
> 
> START_TIME END_TIMEMAX_IOPS   MAX_MBPS  MAX_PMBPS
> LATENCY   DISKS
>   -- -- -- 
> -- --
> 28-JUN-016 22:44:29  28-JUN-016 22:49:057449161219   
> 20  75
> 
> This output is from a directly connected EMC VNX5100 FC SAN with 25
> disks using dual 8Gb FC links on a different lab system:
> 
> START_TIME END_TIMEMAX_IOPS   MAX_MBPS  MAX_PMBPS
> LATENCY   DISKS
>   -- -- -- 
> -- --
> 28-JUN-016 22:11:25  28-JUN-016 22:18:486487299224   
> 19  75
> 
> 
> One of our goals for our Ceph cluster is to replace the EMC SANs.  We've
> accomplished this performance wise, the next step is to get a plausible
> iSCSI HA solution working.  I'm very interested in what Mike Christie is
> putting together.  I'm in the process of vetting the SUSE solution now.
> 
> BTW - The tests were run when we had 75 OSDs, which are all 7200RPM 2TB
> HDs, across 9 OSD hosts.  We have no SSD journals, instead we

Re: [ceph-users] ceph + vmware

2016-07-15 Thread Nick Fisk
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
> Oliver Dzombic
> Sent: 15 July 2016 08:35
> To: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] ceph + vmware
> 
> Hi Nick,
> 
> yeah i understand the point and message, i wont do it :-)
> 
> I just asked me recently how do i test if cache is enabled or not ?
> 
> What i found requires a client to be connected to an rbd device. But we dont 
> have that.

The cache is per client, so it is only relevant to clients that are currently 
connected to an RBD. If you want to confirm, you need to enable the admin 
socket for clients and then view the config from the admin socket.

Ie

http://ceph.com/planet/ceph-validate-that-the-rbd-cache-is-active/


> 
> Is there any way to ask ceph server if cache is enabled or not ? Its disabled 
> by config. But by config the default size and min size of
> newly created pool are different from what ceph really does.
> 
> --
> Mit freundlichen Gruessen / Best regards
> 
> Oliver Dzombic
> IP-Interactive
> 
> mailto:i...@ip-interactive.de
> 
> Anschrift:
> 
> IP Interactive UG ( haftungsbeschraenkt ) Zum Sonnenberg 1-3
> 63571 Gelnhausen
> 
> HRB 93402 beim Amtsgericht Hanau
> Geschäftsführung: Oliver Dzombic
> 
> Steuer Nr.: 35 236 3622 1
> UST ID: DE274086107
> 
> 
> Am 15.07.2016 um 09:32 schrieb Nick Fisk:
> >> -Original Message-
> >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> >> Of Oliver Dzombic
> >> Sent: 12 July 2016 20:59
> >> To: ceph-users@lists.ceph.com
> >> Subject: Re: [ceph-users] ceph + vmware
> >>
> >> Hi Jack,
> >>
> >> thank you!
> >>
> >> What has reliability to do with rbd_cache = true ?
> >>
> >> I mean aside of the fact, that if a host powers down, the "flying" data 
> >> are lost.
> >
> > Not reliability, but consistency. As you have touched on the cache is in 
> > volatile memory and you have told tgt that your cache is non-
> volatile, now if you have a crash/power outageetc, then all the data in 
> the cache will be lost. This will likely leave your RBD full of
> holes or out of date data.
> >
> > If you plan to run HA then this is even more important as you could do a 
> > write on 1 iscsi target and read the data from another
> before the cache has flushed. Again corruption, especially if the initiator 
> is doing round robin over the paths.
> >
> > Also when you run HA the chance that TGT will failover to the other node 
> > because of some timeout you normally don't notice, this
> will also likely cause serious corruption.
> >
> >>
> >> Are there any special limitations / issues with rbd_cache = true and iscsi 
> >> tgt ?
> >
> > I just wouldn't do it.
> >
> > You can almost guarantee data corruption if you do. When librbd gets 
> > persistent cache to SSD, this will probably be safe and as long
> as you can present the cache device to both nodes (eg dual path SAS), HA 
> should be safe as well.
> >
> >>
> >> --
> >> Mit freundlichen Gruessen / Best regards
> >>
> >> Oliver Dzombic
> >> IP-Interactive
> >>
> >> mailto:i...@ip-interactive.de
> >>
> >> Anschrift:
> >>
> >> IP Interactive UG ( haftungsbeschraenkt ) Zum Sonnenberg 1-3
> >> 63571 Gelnhausen
> >>
> >> HRB 93402 beim Amtsgericht Hanau
> >> Geschäftsführung: Oliver Dzombic
> >>
> >> Steuer Nr.: 35 236 3622 1
> >> UST ID: DE274086107
> >>
> >>
> >> Am 11.07.2016 um 22:24 schrieb Jake Young:
> >>> I'm using this setup with ESXi 5.1 and I get very good performance.
> >>> I suspect you have other issues.  Reliability is another story (see
> >>> Nick's posts on tgt and HA to get an idea of the awful problems you
> >>> can have), but for my test labs the risk is acceptable.
> >>>
> >>>
> >>> One change I found helpful is to run tgtd with 128 threads.  I'm
> >>> running Ubuntu 14.04, so I editted my /etc/init.tgt.conf file and
> >>> changed the line that read:
> >>>
> >>> exec tgtd
> >>>
> >>> to
> >>>
> >>> exec tgtd --nr_iothreads=128
> >>>
> >>>
> >>> If you're not concerned with reliability, you can enhance throughput
> >>> even more by enabling rbd client write-back c

Re: [ceph-users] ceph + vmware

2016-07-15 Thread Oliver Dzombic
Hi Nick,

yeah i understand the point and message, i wont do it :-)

I just asked me recently how do i test if cache is enabled or not ?

What i found requires a client to be connected to an rbd device. But we
dont have that.

Is there any way to ask ceph server if cache is enabled or not ? Its
disabled by config. But by config the default size and min size of newly
created pool are different from what ceph really does.

-- 
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:i...@ip-interactive.de

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107


Am 15.07.2016 um 09:32 schrieb Nick Fisk:
>> -Original Message-
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
>> Oliver Dzombic
>> Sent: 12 July 2016 20:59
>> To: ceph-users@lists.ceph.com
>> Subject: Re: [ceph-users] ceph + vmware
>>
>> Hi Jack,
>>
>> thank you!
>>
>> What has reliability to do with rbd_cache = true ?
>>
>> I mean aside of the fact, that if a host powers down, the "flying" data are 
>> lost.
> 
> Not reliability, but consistency. As you have touched on the cache is in 
> volatile memory and you have told tgt that your cache is non-volatile, now if 
> you have a crash/power outageetc, then all the data in the cache will be 
> lost. This will likely leave your RBD full of holes or out of date data.
> 
> If you plan to run HA then this is even more important as you could do a 
> write on 1 iscsi target and read the data from another before the cache has 
> flushed. Again corruption, especially if the initiator is doing round robin 
> over the paths.
> 
> Also when you run HA the chance that TGT will failover to the other node 
> because of some timeout you normally don't notice, this will also likely 
> cause serious corruption. 
> 
>>
>> Are there any special limitations / issues with rbd_cache = true and iscsi 
>> tgt ?
> 
> I just wouldn't do it. 
> 
> You can almost guarantee data corruption if you do. When librbd gets 
> persistent cache to SSD, this will probably be safe and as long as you can 
> present the cache device to both nodes (eg dual path SAS), HA should be safe 
> as well.
> 
>>
>> --
>> Mit freundlichen Gruessen / Best regards
>>
>> Oliver Dzombic
>> IP-Interactive
>>
>> mailto:i...@ip-interactive.de
>>
>> Anschrift:
>>
>> IP Interactive UG ( haftungsbeschraenkt ) Zum Sonnenberg 1-3
>> 63571 Gelnhausen
>>
>> HRB 93402 beim Amtsgericht Hanau
>> Geschäftsführung: Oliver Dzombic
>>
>> Steuer Nr.: 35 236 3622 1
>> UST ID: DE274086107
>>
>>
>> Am 11.07.2016 um 22:24 schrieb Jake Young:
>>> I'm using this setup with ESXi 5.1 and I get very good performance.  I
>>> suspect you have other issues.  Reliability is another story (see
>>> Nick's posts on tgt and HA to get an idea of the awful problems you
>>> can have), but for my test labs the risk is acceptable.
>>>
>>>
>>> One change I found helpful is to run tgtd with 128 threads.  I'm
>>> running Ubuntu 14.04, so I editted my /etc/init.tgt.conf file and
>>> changed the line that read:
>>>
>>> exec tgtd
>>>
>>> to
>>>
>>> exec tgtd --nr_iothreads=128
>>>
>>>
>>> If you're not concerned with reliability, you can enhance throughput
>>> even more by enabling rbd client write-back cache in your tgt VM's
>>> ceph.conf file (you'll need to restart tgtd for this to take effect):
>>>
>>> [client]
>>> rbd_cache = true
>>> rbd_cache_size = 67108864 # (64MB)
>>> rbd_cache_max_dirty = 50331648 # (48MB) rbd_cache_target_dirty =
>>> 33554432 # (32MB) rbd_cache_max_dirty_age = 2
>>> rbd_cache_writethrough_until_flush = false
>>>
>>>
>>>
>>>
>>> Here's a sample targets.conf:
>>>
>>>   
>>>   initiator-address ALL
>>>   scsi_sn Charter
>>>   #vendor_id CEPH
>>>   #controller_tid 1
>>>   write-cache on
>>>   read-cache on
>>>   driver iscsi
>>>   bs-type rbd
>>>   
>>>   lun 5
>>>   scsi_id cfe1000c4a71e700506357
>>>   
>>>   
>>>   lun 6
>>>   scsi_id cfe1000c4a71e700507157
>>>   
>>>   
>>>   lun 7
>>>   scsi_id cfe1000c4a71e70050da7a
>>>   
>

Re: [ceph-users] ceph + vmware

2016-07-15 Thread Nick Fisk
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
> Oliver Dzombic
> Sent: 12 July 2016 20:59
> To: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] ceph + vmware
> 
> Hi Jack,
> 
> thank you!
> 
> What has reliability to do with rbd_cache = true ?
> 
> I mean aside of the fact, that if a host powers down, the "flying" data are 
> lost.

Not reliability, but consistency. As you have touched on the cache is in 
volatile memory and you have told tgt that your cache is non-volatile, now if 
you have a crash/power outageetc, then all the data in the cache will be 
lost. This will likely leave your RBD full of holes or out of date data.

If you plan to run HA then this is even more important as you could do a write 
on 1 iscsi target and read the data from another before the cache has flushed. 
Again corruption, especially if the initiator is doing round robin over the 
paths.

Also when you run HA the chance that TGT will failover to the other node 
because of some timeout you normally don't notice, this will also likely cause 
serious corruption. 

> 
> Are there any special limitations / issues with rbd_cache = true and iscsi 
> tgt ?

I just wouldn't do it. 

You can almost guarantee data corruption if you do. When librbd gets persistent 
cache to SSD, this will probably be safe and as long as you can present the 
cache device to both nodes (eg dual path SAS), HA should be safe as well.

> 
> --
> Mit freundlichen Gruessen / Best regards
> 
> Oliver Dzombic
> IP-Interactive
> 
> mailto:i...@ip-interactive.de
> 
> Anschrift:
> 
> IP Interactive UG ( haftungsbeschraenkt ) Zum Sonnenberg 1-3
> 63571 Gelnhausen
> 
> HRB 93402 beim Amtsgericht Hanau
> Geschäftsführung: Oliver Dzombic
> 
> Steuer Nr.: 35 236 3622 1
> UST ID: DE274086107
> 
> 
> Am 11.07.2016 um 22:24 schrieb Jake Young:
> > I'm using this setup with ESXi 5.1 and I get very good performance.  I
> > suspect you have other issues.  Reliability is another story (see
> > Nick's posts on tgt and HA to get an idea of the awful problems you
> > can have), but for my test labs the risk is acceptable.
> >
> >
> > One change I found helpful is to run tgtd with 128 threads.  I'm
> > running Ubuntu 14.04, so I editted my /etc/init.tgt.conf file and
> > changed the line that read:
> >
> > exec tgtd
> >
> > to
> >
> > exec tgtd --nr_iothreads=128
> >
> >
> > If you're not concerned with reliability, you can enhance throughput
> > even more by enabling rbd client write-back cache in your tgt VM's
> > ceph.conf file (you'll need to restart tgtd for this to take effect):
> >
> > [client]
> > rbd_cache = true
> > rbd_cache_size = 67108864 # (64MB)
> > rbd_cache_max_dirty = 50331648 # (48MB) rbd_cache_target_dirty =
> > 33554432 # (32MB) rbd_cache_max_dirty_age = 2
> > rbd_cache_writethrough_until_flush = false
> >
> >
> >
> >
> > Here's a sample targets.conf:
> >
> >   
> >   initiator-address ALL
> >   scsi_sn Charter
> >   #vendor_id CEPH
> >   #controller_tid 1
> >   write-cache on
> >   read-cache on
> >   driver iscsi
> >   bs-type rbd
> >   
> >   lun 5
> >   scsi_id cfe1000c4a71e700506357
> >   
> >   
> >   lun 6
> >   scsi_id cfe1000c4a71e700507157
> >   
> >   
> >   lun 7
> >   scsi_id cfe1000c4a71e70050da7a
> >   
> >   
> >   lun 8
> >   scsi_id cfe1000c4a71e70050bac0
> >   
> >   
> >
> >
> >
> > I don't have FIO numbers handy, but I have some oracle calibrate io
> > output.
> >
> > We're running Oracle RAC database servers in linux VMs on ESXi 5.1,
> > which use iSCSI to connect to the tgt service.  I only have a single
> > connection setup in ESXi for each LUN.  I tested using multipathing
> > and two tgt VMs presenting identical LUNs/RBD disks, but found that
> > there wasn't a significant performance gain by doing this, even with
> > round-robin path selecting in VMware.
> >
> >
> > These tests were run from two RAC VMs, each on a different host, with
> > both hosts connected to the same tgt instance.  The way we have oracle
> > configured, it would have been using two of the LUNs heavily during
> > this calibrate IO test.
> >
> >
> > This output is with 128 threads in tgtd and rbd client cache enabled:
> >
> > START_TIME   END_TIME   MAX_IOPS   MAX_MBPS  MAX_PMBPS  
> >  LATENCY   D

Re: [ceph-users] ceph + vmware

2016-07-12 Thread Oliver Dzombic
Hi Jack,

thank you!

What has reliability to do with rbd_cache = true ?

I mean aside of the fact, that if a host powers down, the "flying" data
are lost.

Are there any special limitations / issues with rbd_cache = true and
iscsi tgt ?

-- 
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:i...@ip-interactive.de

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107


Am 11.07.2016 um 22:24 schrieb Jake Young:
> I'm using this setup with ESXi 5.1 and I get very good performance.  I
> suspect you have other issues.  Reliability is another story (see Nick's
> posts on tgt and HA to get an idea of the awful problems you can have),
> but for my test labs the risk is acceptable.
> 
> 
> One change I found helpful is to run tgtd with 128 threads.  I'm running
> Ubuntu 14.04, so I editted my /etc/init.tgt.conf file and changed the
> line that read:
> 
> exec tgtd
> 
> to 
> 
> exec tgtd --nr_iothreads=128
> 
> 
> If you're not concerned with reliability, you can enhance throughput
> even more by enabling rbd client write-back cache in your tgt VM's
> ceph.conf file (you'll need to restart tgtd for this to take effect):
> 
> [client]
> rbd_cache = true
> rbd_cache_size = 67108864 # (64MB)
> rbd_cache_max_dirty = 50331648 # (48MB)
> rbd_cache_target_dirty = 33554432 # (32MB)
> rbd_cache_max_dirty_age = 2
> rbd_cache_writethrough_until_flush = false
> 
> 
> 
> 
> Here's a sample targets.conf:
> 
>   
>   initiator-address ALL
>   scsi_sn Charter
>   #vendor_id CEPH
>   #controller_tid 1
>   write-cache on
>   read-cache on
>   driver iscsi
>   bs-type rbd
>   
>   lun 5
>   scsi_id cfe1000c4a71e700506357
>   
>   
>   lun 6
>   scsi_id cfe1000c4a71e700507157
>   
>   
>   lun 7
>   scsi_id cfe1000c4a71e70050da7a
>   
>   
>   lun 8
>   scsi_id cfe1000c4a71e70050bac0
>   
>   
> 
> 
> 
> I don't have FIO numbers handy, but I have some oracle calibrate io
> output.  
> 
> We're running Oracle RAC database servers in linux VMs on ESXi 5.1,
> which use iSCSI to connect to the tgt service.  I only have a single
> connection setup in ESXi for each LUN.  I tested using multipathing and
> two tgt VMs presenting identical LUNs/RBD disks, but found that there
> wasn't a significant performance gain by doing this, even with
> round-robin path selecting in VMware.
> 
> 
> These tests were run from two RAC VMs, each on a different host, with
> both hosts connected to the same tgt instance.  The way we have oracle
> configured, it would have been using two of the LUNs heavily during this
> calibrate IO test.
> 
> 
> This output is with 128 threads in tgtd and rbd client cache enabled:
> 
> START_TIME   END_TIME   MAX_IOPS   MAX_MBPS  MAX_PMBPS   
> LATENCY   DISKS
>   -- -- -- 
> -- --
> 28-JUN-016 15:10:50  28-JUN-016 15:20:04   14153658412
>14  75
> 
> 
> This output is with the same configuration, but with rbd client cache
> disabled:
> 
> START_TIME END_TIMEMAX_IOPS   MAX_MBPS  MAX_PMBPS
> LATENCY   DISKS
>   -- -- -- 
> -- --
> 28-JUN-016 22:44:29  28-JUN-016 22:49:057449161219   
> 20  75
> 
> This output is from a directly connected EMC VNX5100 FC SAN with 25
> disks using dual 8Gb FC links on a different lab system:
> 
> START_TIME END_TIMEMAX_IOPS   MAX_MBPS  MAX_PMBPS
> LATENCY   DISKS
>   -- -- -- 
> -- --
> 28-JUN-016 22:11:25  28-JUN-016 22:18:486487299224   
> 19  75
> 
> 
> One of our goals for our Ceph cluster is to replace the EMC SANs.  We've
> accomplished this performance wise, the next step is to get a plausible
> iSCSI HA solution working.  I'm very interested in what Mike Christie is
> putting together.  I'm in the process of vetting the SUSE solution now.
> 
> BTW - The tests were run when we had 75 OSDs, which are all 7200RPM 2TB
> HDs, across 9 OSD hosts.  We have no SSD journals, instead we have all
> the disks setup as single disk RAID1 disk groups with WB cache with
> BBU.  All OSD hosts have 40Gb networking and the ESXi hosts have 10G.
> 
> Jake
> 
> 
> On Mon, Jul 11, 2016 at 12:06 PM, Oliver Dzombic  > wrote:
> 
> Hi Mike,
> 
> i was trying:
> 
> https://ceph.com/dev-notes/adding-support-for-rbd-to-stgt/
> 
> ONE target, from different OSD servers directly, to multiple vmware esxi
> servers.
> 
> A config looked like:
> 
> #cat iqn.ceph-cluster_netzlaboranten-storage.conf
> 
> 
> driver iscsi
> bs-type rbd
> backing-store rbd/vmware-s

Re: [ceph-users] ceph + vmware

2016-07-11 Thread Alex Gorbachev
Hi Oliver,

On Friday, July 8, 2016, Oliver Dzombic  wrote:

> Hi,
>
> does anyone have experience how to connect vmware with ceph smart ?
>
> iSCSI multipath does not really worked well.
> NFS could be, but i think thats just too much layers in between to have
> some useable performance.
>
> Systems like ScaleIO have developed a vmware addon to talk with it.
>
> Is there something similar out there for ceph ?
>
> What are you using ?


We use RBD with SCST, Pacemaker and EnhanceIO (for read only SSD caching).
The HA agents are open source, there are several options for those.
Currently running 3 VMware clusters with 15 hosts total, and things are
quite decent.

Regards,
Alex Gorbachev
Storcium


>
> Thank you !
>
> --
> Mit freundlichen Gruessen / Best regards
>
> Oliver Dzombic
> IP-Interactive
>
> mailto:i...@ip-interactive.de 
>
> Anschrift:
>
> IP Interactive UG ( haftungsbeschraenkt )
> Zum Sonnenberg 1-3
> 63571 Gelnhausen
>
> HRB 93402 beim Amtsgericht Hanau
> Geschäftsführung: Oliver Dzombic
>
> Steuer Nr.: 35 236 3622 1
> UST ID: DE274086107
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


-- 
--
Alex Gorbachev
Storcium
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph + vmware

2016-07-11 Thread Jake Young
I'm using this setup with ESXi 5.1 and I get very good performance.  I
suspect you have other issues.  Reliability is another story (see Nick's
posts on tgt and HA to get an idea of the awful problems you can have), but
for my test labs the risk is acceptable.


One change I found helpful is to run tgtd with 128 threads.  I'm running
Ubuntu 14.04, so I editted my /etc/init.tgt.conf file and changed the line
that read:

exec tgtd

to

exec tgtd --nr_iothreads=128


If you're not concerned with reliability, you can enhance throughput even
more by enabling rbd client write-back cache in your tgt VM's ceph.conf
file (you'll need to restart tgtd for this to take effect):

[client]
rbd_cache = true
rbd_cache_size = 67108864 # (64MB)
rbd_cache_max_dirty = 50331648 # (48MB)
rbd_cache_target_dirty = 33554432 # (32MB)
rbd_cache_max_dirty_age = 2
rbd_cache_writethrough_until_flush = false




Here's a sample targets.conf:

  
  initiator-address ALL
  scsi_sn Charter
  #vendor_id CEPH
  #controller_tid 1
  write-cache on
  read-cache on
  driver iscsi
  bs-type rbd
  
  lun 5
  scsi_id cfe1000c4a71e700506357
  
  
  lun 6
  scsi_id cfe1000c4a71e700507157
  
  
  lun 7
  scsi_id cfe1000c4a71e70050da7a
  
  
  lun 8
  scsi_id cfe1000c4a71e70050bac0
  
  



I don't have FIO numbers handy, but I have some oracle calibrate io output.


We're running Oracle RAC database servers in linux VMs on ESXi 5.1, which
use iSCSI to connect to the tgt service.  I only have a single connection
setup in ESXi for each LUN.  I tested using multipathing and two tgt VMs
presenting identical LUNs/RBD disks, but found that there wasn't a
significant performance gain by doing this, even with round-robin path
selecting in VMware.


These tests were run from two RAC VMs, each on a different host, with both
hosts connected to the same tgt instance.  The way we have oracle
configured, it would have been using two of the LUNs heavily during this
calibrate IO test.


This output is with 128 threads in tgtd and rbd client cache enabled:

START_TIME   END_TIME   MAX_IOPS   MAX_MBPS
MAX_PMBPS   LATENCY   DISKS
  -- --
-- -- --
28-JUN-016 15:10:50  28-JUN-016 15:20:04   14153658
412   14  75


This output is with the same configuration, but with rbd client cache
disabled:

START_TIME END_TIMEMAX_IOPS   MAX_MBPS  MAX_PMBPS
  LATENCY   DISKS
  -- --
-- -- --
28-JUN-016 22:44:29  28-JUN-016 22:49:057449161219
  20  75

This output is from a directly connected EMC VNX5100 FC SAN with 25 disks
using dual 8Gb FC links on a different lab system:

START_TIME END_TIMEMAX_IOPS   MAX_MBPS  MAX_PMBPS
  LATENCY   DISKS
  -- --
-- -- --
28-JUN-016 22:11:25  28-JUN-016 22:18:486487299224
  19  75


One of our goals for our Ceph cluster is to replace the EMC SANs.  We've
accomplished this performance wise, the next step is to get a plausible
iSCSI HA solution working.  I'm very interested in what Mike Christie is
putting together.  I'm in the process of vetting the SUSE solution now.

BTW - The tests were run when we had 75 OSDs, which are all 7200RPM 2TB
HDs, across 9 OSD hosts.  We have no SSD journals, instead we have all the
disks setup as single disk RAID1 disk groups with WB cache with BBU.  All
OSD hosts have 40Gb networking and the ESXi hosts have 10G.

Jake


On Mon, Jul 11, 2016 at 12:06 PM, Oliver Dzombic 
wrote:

> Hi Mike,
>
> i was trying:
>
> https://ceph.com/dev-notes/adding-support-for-rbd-to-stgt/
>
> ONE target, from different OSD servers directly, to multiple vmware esxi
> servers.
>
> A config looked like:
>
> #cat iqn.ceph-cluster_netzlaboranten-storage.conf
>
> 
> driver iscsi
> bs-type rbd
> backing-store rbd/vmware-storage
> initiator-address 10.0.0.9
> initiator-address 10.0.0.10
> incominguser vmwaren-storage RPb18P0xAqkAw4M1
> 
>
>
> We had 4 OSD servers. Everyone had this config running.
> We had 2 vmware servers ( esxi ).
>
> So we had 4 paths to this vmware-storage RBD object.
>
> VMware, in the very end, had 8 paths ( 4 path's directly connected to
> the specific vmware server ) + 4 paths this specific vmware servers saw
> via the other vmware server ).
>
> There were very big problems with performance. I am talking about < 10
> MB/s. So the customer was not able to use it, so good old nfs is serving.
>
> At that time we used ceph hammer, and i think esxi 5.5 the customer was
> using, or maybe esxi 6, was somewhere last year the testing.
>
> 
>
> We will make a new attempt now with ceph jewel and esxi 6 and this time
> we will manage the vmware servers.
>
> As soon as we fixed this
>
> "ceph mon Segmentation fault after set

Re: [ceph-users] ceph + vmware

2016-07-11 Thread Oliver Dzombic
Hi Mike,

i was trying:

https://ceph.com/dev-notes/adding-support-for-rbd-to-stgt/

ONE target, from different OSD servers directly, to multiple vmware esxi
servers.

A config looked like:

#cat iqn.ceph-cluster_netzlaboranten-storage.conf


driver iscsi
bs-type rbd
backing-store rbd/vmware-storage
initiator-address 10.0.0.9
initiator-address 10.0.0.10
incominguser vmwaren-storage RPb18P0xAqkAw4M1



We had 4 OSD servers. Everyone had this config running.
We had 2 vmware servers ( esxi ).

So we had 4 paths to this vmware-storage RBD object.

VMware, in the very end, had 8 paths ( 4 path's directly connected to
the specific vmware server ) + 4 paths this specific vmware servers saw
via the other vmware server ).

There were very big problems with performance. I am talking about < 10
MB/s. So the customer was not able to use it, so good old nfs is serving.

At that time we used ceph hammer, and i think esxi 5.5 the customer was
using, or maybe esxi 6, was somewhere last year the testing.



We will make a new attempt now with ceph jewel and esxi 6 and this time
we will manage the vmware servers.

As soon as we fixed this

"ceph mon Segmentation fault after set crush_ruleset ceph 10.2.2"

what i already mailed here to the list is solved, we can start the testing.


-- 
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:i...@ip-interactive.de

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107


Am 11.07.2016 um 17:45 schrieb Mike Christie:
> On 07/08/2016 02:22 PM, Oliver Dzombic wrote:
>> Hi,
>>
>> does anyone have experience how to connect vmware with ceph smart ?
>>
>> iSCSI multipath does not really worked well.
> 
> Are you trying to export rbd images from multiple iscsi targets at the
> same time or just one target?
> 
> For the HA/multiple target setup, I am working on this for Red Hat. We
> plan to release it in RHEL 7.3/RHCS 2.1. SUSE ships something already as
> someone mentioned.
> 
> We just got a large chunk of code in the upstream kernel (it is in the
> block layer maintainer's tree for the next kernel) so it should be
> simple to add COMPARE_AND_WRITE support now. We should be posting krbd
> exclusive lock support in the next couple weeks.
> 
> 
>> NFS could be, but i think thats just too much layers in between to have
>> some useable performance.
>>
>> Systems like ScaleIO have developed a vmware addon to talk with it.
>>
>> Is there something similar out there for ceph ?
>>
>> What are you using ?
>>
>> Thank you !
>>
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph + vmware

2016-07-11 Thread Mike Christie
On 07/08/2016 02:22 PM, Oliver Dzombic wrote:
> Hi,
> 
> does anyone have experience how to connect vmware with ceph smart ?
> 
> iSCSI multipath does not really worked well.

Are you trying to export rbd images from multiple iscsi targets at the
same time or just one target?

For the HA/multiple target setup, I am working on this for Red Hat. We
plan to release it in RHEL 7.3/RHCS 2.1. SUSE ships something already as
someone mentioned.

We just got a large chunk of code in the upstream kernel (it is in the
block layer maintainer's tree for the next kernel) so it should be
simple to add COMPARE_AND_WRITE support now. We should be posting krbd
exclusive lock support in the next couple weeks.


> NFS could be, but i think thats just too much layers in between to have
> some useable performance.
> 
> Systems like ScaleIO have developed a vmware addon to talk with it.
> 
> Is there something similar out there for ceph ?
> 
> What are you using ?
> 
> Thank you !
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph + vmware

2016-07-09 Thread Nick Fisk
See my post from a few days ago. If you value your sanity and free time, use 
NFS. Otherwise SCST is probably your best bet at the moment, or maybe try out 
the SUSE implementation.

> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Jan 
> Schermer
> Sent: 08 July 2016 20:53
> To: Oliver Dzombic 
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] ceph + vmware
> 
> There is no Ceph plugin for VMware (and I think you need at least an 
> Enterprise license for storage plugins, much $$$).
> The "VMware" way to do this without the plugin would be to have a VM running 
> on every host serving RBD devices over iSCSI to the
> other VMs (the way their storage applicances work, maybe you could even 
> re-use them somehow? I haven't used VMware in a while,
> so not sure if one can login to the appliance and customize it...).
> Nevertheless I think it's ugly, messy and is going to be even slower than 
> Ceph by itself.
> 
> But you can always just use RBD client (kernel/userspace) in the VMs 
> themselves, VMware has pretty fast networking so the
> overhead wouldn't be that large.
> 
> Jan
> 
> 
> > On 08 Jul 2016, at 21:22, Oliver Dzombic  wrote:
> >
> > Hi,
> >
> > does anyone have experience how to connect vmware with ceph smart ?
> >
> > iSCSI multipath does not really worked well.
> > NFS could be, but i think thats just too much layers in between to
> > have some useable performance.
> >
> > Systems like ScaleIO have developed a vmware addon to talk with it.
> >
> > Is there something similar out there for ceph ?
> >
> > What are you using ?
> >
> > Thank you !
> >
> > --
> > Mit freundlichen Gruessen / Best regards
> >
> > Oliver Dzombic
> > IP-Interactive
> >
> > mailto:i...@ip-interactive.de
> >
> > Anschrift:
> >
> > IP Interactive UG ( haftungsbeschraenkt ) Zum Sonnenberg 1-3
> > 63571 Gelnhausen
> >
> > HRB 93402 beim Amtsgericht Hanau
> > Geschäftsführung: Oliver Dzombic
> >
> > Steuer Nr.: 35 236 3622 1
> > UST ID: DE274086107
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph + vmware

2016-07-08 Thread Jan Schermer
There is no Ceph plugin for VMware (and I think you need at least an Enterprise 
license for storage plugins, much $$$).
The "VMware" way to do this without the plugin would be to have a VM running on 
every host serving RBD devices over iSCSI to the other VMs (the way their 
storage applicances work, maybe you could even re-use them somehow? I haven't 
used VMware in a while, so not sure if one can login to the appliance and 
customize it...).
Nevertheless I think it's ugly, messy and is going to be even slower than Ceph 
by itself.

But you can always just use RBD client (kernel/userspace) in the VMs 
themselves, VMware has pretty fast networking so the overhead wouldn't be that 
large.

Jan


> On 08 Jul 2016, at 21:22, Oliver Dzombic  wrote:
> 
> Hi,
> 
> does anyone have experience how to connect vmware with ceph smart ?
> 
> iSCSI multipath does not really worked well.
> NFS could be, but i think thats just too much layers in between to have
> some useable performance.
> 
> Systems like ScaleIO have developed a vmware addon to talk with it.
> 
> Is there something similar out there for ceph ?
> 
> What are you using ?
> 
> Thank you !
> 
> -- 
> Mit freundlichen Gruessen / Best regards
> 
> Oliver Dzombic
> IP-Interactive
> 
> mailto:i...@ip-interactive.de
> 
> Anschrift:
> 
> IP Interactive UG ( haftungsbeschraenkt )
> Zum Sonnenberg 1-3
> 63571 Gelnhausen
> 
> HRB 93402 beim Amtsgericht Hanau
> Geschäftsführung: Oliver Dzombic
> 
> Steuer Nr.: 35 236 3622 1
> UST ID: DE274086107
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph + vmware

2016-07-08 Thread Oliver Dzombic
Hi,

does anyone have experience how to connect vmware with ceph smart ?

iSCSI multipath does not really worked well.
NFS could be, but i think thats just too much layers in between to have
some useable performance.

Systems like ScaleIO have developed a vmware addon to talk with it.

Is there something similar out there for ceph ?

What are you using ?

Thank you !

-- 
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:i...@ip-interactive.de

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com