Re: [ceph-users] suse_enterprise_storage3_rbd_LIO_vmware_performance_bad

Nick Fisk Fri, 01 Jul 2016 11:21:40 -0700

To summarise,

LIO is just not working very well at the moment because of the ABORT Tasks 
problem, this will hopefully be fixed at some point. I'm not sure if SUSE works 
around this, but see below for other pain points with RBD + ESXi + iSCSI


TGT is easy to get going, but performance isn't the best and failover is an 
absolute pain as TGT won't stop if it has ongoing IO. You normally end up in a 
complete mess if you try and do HA, unless you can cover a number of different 
failure scenarios.

SCST probably works the best at the moment. Yes, you have to compile it into a 
new kernel, but it performs well, doesn't fall over, supports the VAAI 
extensions and can be configured HA in an ALUA or VIP failover modes. There 
might be a couple of corner cases with the ALUA mode with Active/Standby paths, 
with possible data corruption that need to be tested/explored.

However, there are a number of pain points with iSCSI + ESXi + RBD and they all 
mainly centre on write latency. It seems VMFS was designed around the fact that 
Enterprise storage arrays service writes in 10-100us, whereas Ceph will service 
them in 2-10ms.

1. Thin Provisioning makes things slow. I believe the main cause is that when 
growing and zeroing the new blocks, metadata needs to be updated and the block 
zero'd. Both issue small IO which would normally not be a problem, but with 
Ceph it becomes a bottleneck to overall IO on the datastore.

2. Snapshots effectively turn all IO into 64kb IO's. Again a traditional SAN 
will coalesce these back into a stream of larger IO's before committing to 
disk. However with Ceph each IO takes 2-10ms and so everything seems slow. The 
future feature of persistent RBD cache may go a long way to helping with this.

3. >2TB VMDK's with snapshots use a different allocation mode, which happens in 
4kb chunks instead of 64kb ones. This makes the problem 16 times worse than 
above.

4. Any of the above will also apply when migrating machines around, so VM's can 
takes hours/days to move.

5. If you use FILEIO, you can't use thin provisioning. If you use BLOCKIO, you 
get thin provisioning, but no pagecache or readahead, so performance can nose 
dive if this is needed.

6. iSCSI is very complicated (especially ALUA) and sensitive. Get used to 
seeing APD/PDL even when you think you have finally got everything working 
great.


Normal IO from eager zeroed VM's with no snapshots, however should perform ok. 
So depends what your workload is.


And then comes NFS. It's very easy to setup, very easy to configure for HA, and 
works pretty well overall. You don't seem to get any of the IO size penalties 
when using snapshots. If you mount with discard, thin provisioning is done by 
Ceph. You can defragment the FS on the proxy node and several other things that 
you can't do with VMFS. Just make sure you run the server in sync mode to avoid 
data loss.

The only downside is that every IO causes an IO to the FS and one to the FS 
journal, so you effectively double your IO. But if your Ceph backend can 
support it, then it shouldn't be too much of a problem.

Now to the original poster, assuming the iSCSI node is just kernel mounting the 
RBD, I would run iostat on it, to try and see what sort of latency you are 
seeing at that point. Also do the same with esxtop +u, and look at the write 
latency there, both whilst running the fio in the VM. This should hopefully let 
you see if there is just a gradual increase as you go from hop to hop or if 
there is an obvious culprit.

Can you also confirm your kernel version? 

With 1GB networking I think you will struggle to get your write latency much 
below 10-15ms, but from your example ~30ms is still a bit high. I wonder if the 
default queue depths on your iSCSI target are too low as well?

Nick

> -----Original Message-----
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Oliver Dzombic
> Sent: 01 July 2016 09:27
> To: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users]
> suse_enterprise_storage3_rbd_LIO_vmware_performance_bad
> 
> Hi,
> 
> my experience:
> 
> ceph + iscsi ( multipath ) + vmware == worst
> 
> Better you search for another solution.
> 
> vmware + nfs + vmware might have a much better performance.
> 
> --------
> 
> If you are able to get vmware run with iscsi and ceph, i would be
> >>very<< intrested in what/how you did that.
> 
> --
> Mit freundlichen Gruessen / Best regards
> 
> Oliver Dzombic
> IP-Interactive
> 
> mailto:i...@ip-interactive.de
> 
> Anschrift:
> 
> IP Interactive UG ( haftungsbeschraenkt ) Zum Sonnenberg 1-3
> 63571 Gelnhausen
> 
> HRB 93402 beim Amtsgericht Hanau
> Geschäftsführung: Oliver Dzombic
> 
> Steuer Nr.: 35 236 3622 1
> UST ID: DE274086107
> 
> 
> Am 01.07.2016 um 07:04 schrieb mq:
> > Hi list
> > I have tested suse enterprise storage3 using 2 iscsi  gateway attached
> > to  vmware. The performance is bad.  I have turn off  VAAI following
> > the
> >
> (https://kb.vmware.com/selfservice/microsites/search.do?language=en_US
> > &cmd=displayKC&externalId=1033665)
> >
> <https://kb.vmware.com/selfservice/microsites/search.do?language=en_U
> S&cmd=displayKC&externalId=1033665%29>.
> > My cluster
> > 3 ceph nodes :2*E5-2620 64G , mem 2*1Gbps (3*10K SAS, 1*480G  SSD) per
> > node, SSD as journal
> > 1 vmware node  2*E5-2620 64G , mem 2*1Gbps
> >
> > # ceph -s
> >     cluster 0199f68d-a745-4da3-9670-15f2981e7a15
> >      health HEALTH_OK
> >      monmap e1: 3 mons at
> >
> {node1=192.168.50.91:6789/0,node2=192.168.50.92:6789/0,node3=192.168.5
> 0.93:6789/0}
> >             election epoch 22, quorum 0,1,2 node1,node2,node3
> >      osdmap e200: 9 osds: 9 up, 9 in
> >             flags sortbitwise
> >       pgmap v1162: 448 pgs, 1 pools, 14337 MB data, 4935 objects
> >             18339 MB used, 5005 GB / 5023 GB avail
> >                  448 active+clean
> >   client io 87438 kB/s wr, 0 op/s rd, 213 op/s wr
> >
> > sudo ceph osd tree
> > ID WEIGHT  TYPE NAME      UP/DOWN REWEIGHT PRIMARY-AFFINITY
> > -1 4.90581 root default
> > -2 1.63527     host node1
> > 0 0.54509         osd.0       up  1.00000          1.00000
> > 1 0.54509         osd.1       up  1.00000          1.00000
> > 2 0.54509         osd.2       up  1.00000          1.00000
> > -3 1.63527     host node2
> > 3 0.54509         osd.3       up  1.00000          1.00000
> > 4 0.54509         osd.4       up  1.00000          1.00000
> > 5 0.54509         osd.5       up  1.00000          1.00000
> > -4 1.63527     host node3
> > 6 0.54509         osd.6       up  1.00000          1.00000
> > 7 0.54509         osd.7       up  1.00000          1.00000
> > 8 0.54509         osd.8       up  1.00000          1.00000
> >
> >
> >
> > An linux vm in vmmare， running fio.  4k randwrite result just 64 IOPS
> > lantency is high，dd test just 11MB／s.
> >
> > fio -ioengine=libaio -bs=4k -direct=1 -thread -rw=randwrite -size=100G
> > -filename=/dev/sdb  -name="EBS 4KB randwrite test" -iodepth=32
> > -runtime=60 EBS 4KB randwrite test: (g=0): rw=randwrite,
> > bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
> > fio-2.0.13
> > Starting 1 thread
> > Jobs: 1 (f=1): [w] [100.0% done] [0K/131K/0K /s] [0 /32 /0  iops] [eta
> > 00m:00s] EBS 4KB randwrite test: (groupid=0, jobs=1): err= 0:
> > pid=6766: Wed Jun
> > 29 21:28:06 2016
> >   write: io=15696KB, bw=264627 B/s, iops=64 , runt= 60737msec
> >     slat (usec): min=10 , max=213 , avg=35.54, stdev=16.41
> >     clat (msec): min=1 , max=31368 , avg=495.01, stdev=1862.52
> >      lat (msec): min=2 , max=31368 , avg=495.04, stdev=1862.52
> >     clat percentiles (msec):
> >      |  1.00th=[    7],  5.00th=[    8], 10.00th=[    8], 20.00th=[    9],
> >      | 30.00th=[    9], 40.00th=[   10], 50.00th=[  198], 60.00th=[  204],
> >      | 70.00th=[  208], 80.00th=[  217], 90.00th=[  799], 95.00th=[ 1795],
> >      | 99.00th=[ 7177], 99.50th=[12649], 99.90th=[16712], 99.95th=[16712],
> >      | 99.99th=[16712]
> >     bw (KB/s)  : min=   36, max=11960, per=100.00%, avg=264.77,
> > stdev=1110.81
> >     lat (msec) : 2=0.03%, 4=0.23%, 10=40.93%, 20=0.48%, 50=0.03%
> >     lat (msec) : 100=0.08%, 250=39.55%, 500=5.63%, 750=2.91%, 1000=1.35%
> >     lat (msec) : 2000=4.03%, >=2000=4.77%
> >   cpu          : usr=0.02%, sys=0.22%, ctx=2973, majf=0,
> > minf=18446744073709538907
> >   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.2%, 16=0.4%, 32=99.2%,
> >>=64=0.0%
> >      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >>=64=0.0%
> >      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%,
> >>=64=0.0%
> >      issued    : total=r=0/w=3924/d=0, short=r=0/w=0/d=0
> >
> > Run status group 0 (all jobs):
> >   WRITE: io=15696KB, aggrb=258KB/s, minb=258KB/s, maxb=258KB/s,
> > mint=60737msec, maxt=60737msec
> >
> > Disk stats (read/write):
> >   sdb: ios=83/3921, merge=0/0, ticks=60/1903085, in_queue=1931694,
> > util=100.00%
> >
> > anyone can give me some suggestion to improve the performance ?
> >
> > Regards
> >
> > MQ
> >
> >
> >
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] suse_enterprise_storage3_rbd_LIO_vmware_performance_bad

Reply via email to