Re: [ceph-users] Analysing ceph performance with SSD journal, 10gbe NIC and 2 replicas -Hammer release

Maged Mokhtar Fri, 06 Jan 2017 23:59:59 -0800

    
The numbers are very low. I would first benchmark the system without the vm 
client using rbd 4k test such as:
rbd bench-write image01  --pool=rbd --io-threads=32 --io-size 4096
--io-pattern rand --rbd_cache=false



-------- Original message --------
From: kevin parrikar <kevin.parker...@gmail.com> 
Date: 07/01/2017  05:48  (GMT+02:00) 
To: Christian Balzer <ch...@gol.com> 
Cc: ceph-users@lists.ceph.com 
Subject: Re: [ceph-users] Analysing ceph performance with SSD journal, 10gbe 
NIC and 2 replicas -Hammer release 

i really need some help here :(

replaced all 7.2 rpm SAS disks with new Samsung 840 evo 512Gb SSD with no 
seperate journal Disk .Now both OSD nodes are with 2 ssd disks  with a replica 
of 2 . 
Total number of OSD process in the cluster is 4.with all SSD.

But throughput has gone down from 1.4 MB/s to 1.3 MB/s for 4k writes and for 4M 
it has gone down from 140MB/s to 126MB/s .

now atop no longer shows OSD device as 100% busy..
How ever i can see both ceph-osd process in atop with 53% and 47% disk 
utilization.

 PID                         RDDSK          WRDSK           WCANCL       DSK    
 CMD        1/220771                          0K                648.8M          
   0K               53%    ceph-osd19547                          0K            
    576.7M             0K               47%    ceph-osd


OSD disks(ssd) utilization from atop

DSK |  sdc | busy  6%  | read  0  | write  517  | KiB/r   0  | KiB/w  293 | 
MBr/s 0.00  | MBw/s 148.18  | avq   9.44  | avio 0.12 ms  |

DSK |  sdd | busy   5% | read   0 | write   336 | KiB/r   0  | KiB/w   292 | 
MBr/s 0.00 | MBw/s  96.12  | avq     7.62  | avio 0.15 ms  |


Queue Depth of OSD disks
 cat /sys/block/sdd/device//queue_depth256
atop inside virtual machine:[4 CPU/3Gb RAM]
DSK |   vdc  | busy     96%  | read     0  | write  256  | KiB/r   0  | KiB/w  
512  | MBr/s   0.00  | MBw/s 128.00  | avq    7.96  | avio 3.77 ms  |


Both Guest and Host are using deadline I/O scheduler

Virtual Machine Configuration:

 </disk>    <disk type='network' device='disk'>      <driver name='qemu' 
type='raw' cache='writeback'/>      <auth username='compute'>        <secret 
type='ceph' uuid='a5d0dd94-57c4-ae55-ffe0-7e3732a24455'/>      </auth>      
<source protocol='rbd' 
name='volumes/volume-449da0e7-6223-457c-b2c6-b5e112099212'>        <host 
name='172.16.1.8' port='6789'/>        <host name='172.16.1.11' port='6789'/>   
     <host name='172.16.1.12' port='6789'/>      </source>      <target 
dev='vdb' bus='virtio'/>      
<serial>449da0e7-6223-457c-b2c6-b5e112099212</serial>      <address type='pci' 
domain='0x0000' bus='0x00' slot='0x07' function='0x0'/>    </disk>



ceph.conf

 cat /etc/ceph/ceph.conf
[global]fsid = c4e1a523-9017-492e-9c30-8350eba1bd51mon_initial_members = 
node-16 node-30 node-31mon_host = 172.16.1.11 172.16.1.12 
172.16.1.8auth_cluster_required = cephxauth_service_required = 
cephxauth_client_required = cephxfilestore_xattr_use_omap = 
truelog_to_syslog_level = infolog_to_syslog = Trueosd_pool_default_size = 
2osd_pool_default_min_size = 1osd_pool_default_pg_num = 64public_network = 
172.16.1.0/24log_to_syslog_facility = LOG_LOCAL0osd_journal_size = 
2048auth_supported = cephxosd_pool_default_pgp_num = 64osd_mkfs_type = 
xfscluster_network = 172.16.1.0/24osd_recovery_max_active = 1osd_max_backfills 
= 1

[client]rbd_cache_writethrough_until_flush = Truerbd_cache = True
[client.radosgw.gateway]rgw_keystone_accepted_roles = _member_, Member, admin, 
swiftoperatorkeyring = /etc/ceph/keyring.radosgw.gatewayrgw_frontends = fastcgi 
socket_port=9000 socket_host=127.0.0.1rgw_socket_path = 
/tmp/radosgw.sockrgw_keystone_revocation_interval = 1000000
Any guidance on where to look for issues.

Regards,Kevin
On Fri, Jan 6, 2017 at 4:42 PM, kevin parrikar <kevin.parker...@gmail.com> 
wrote:
Thanks Christian for your valuable comments,each comment is a new learning for 
me.
Please see inline 

On Fri, Jan 6, 2017 at 9:32 AM, Christian Balzer <ch...@gol.com> wrote:


Hello,



On Fri, 6 Jan 2017 08:40:36 +0530 kevin parrikar wrote:



> Hello All,

>

> I have setup a ceph cluster based on 0.94.6 release in  2 servers each with

> 80Gb intel s3510 and 2x3 Tb 7.2 SATA disks,16 CPU,24G RAM

> which is connected to a 10G switch with a replica of 2 [ i will add 3 more

> servers to the cluster] and 3 seperate monitor nodes which are vms.

>

I'd go to the latest hammer, this version has a lethal cache-tier bug if

you should decide to try that.



80Gb Intel DC S3510 are a) slow and b) have only 0.3 DWPD.

You're going to wear those out quickly and if not replaced in time loose

data.



2 HDDs give you a theoretical speed of something like 300MB/s sustained,

when used a OSDs I'd expect the usual 50-60MB/s per OSD due to

seeks, journal (file system) and leveldb overheads.

Which perfectly matches your results.

Hmmmm that makes sense ,its hitting 7.2 rpm OSD's peak write speed.I was in an 
assumption that ssd Journal to OSD will happen slowly at a later time and hence 
 i could use slower and cheaper disks for OSD.But in practise it looks like 
many articles in the internet that talks about faster journal and slower OSD 
dont seems to be correct.

Will adding more OSD disks per node improve the overall performance?

 i can add 4 more disks to each node,but all are 7.2 rpm disks .I am expecting 
some kind of parallel writes on these disks and magically improves performance 
:D
This is my second experiment with Ceph last time i gave up and purchased 
another costly solution from a vendor.But this time i am determined to fix all 
issues and bring up a solid cluster .
Last time clsuter was  giving a throughput of around 900kbps for 1G writes from 
virtual machine and now things have improved ,its giving 1.4 Mbps but still far 
slower than the target of 24Mbps.

Expecting to make some progress with the help of experts here :)


> rbd_cache is enabled in configurations,XFS filesystem,LSI 92465-4i raid

> card with 512Mb cache [ssd is in writeback mode wth BBU]

>

>

> Before installing ceph, i tried to check max throughpit of intel 3500  80G

> SSD using block size of 4M [i read somewhere that ceph uses 4m objects] and

> it was giving 220mbps {dd if=/dev/zero of=/dev/sdb bs=4M count=1000

> oflag=direct}

>

Irrelevant, sustained sequential writes will be limited by what your OSDs

(HDDs) can sustain.



> *Observation:*

> Now the cluster is up and running and from the vm i am trying to write a 4g

> file to its volume using dd if=/dev/zero of=/dev/sdb bs=4M count=1000

> oflag=direct .It takes aroud 39 seconds to write.

>

>  during this time ssd journal was showing disk write of 104M on both the

> ceph servers (dstat sdb) and compute node a network transfer rate of ~110M

> on its 10G storage interface(dstat -nN eth2]

>

As I said, sounds about right.



>

> my questions are:

>

>

>    - Is this the best throughput ceph can offer or can anything in my

>    environment be optmised to get  more performance? [iperf shows a max

>    throughput 9.8Gbits/s]

>

Not your network.



Watch your nodes with atop and you will note that your HDDs are maxed out.



>

>

>    - I guess Network/SSD is under utilized and it can handle more writes

>    how can this be improved to send more data over network to ssd?

>

As jiajia wrote, a cache-tier might give you some speed boosts.

But with those SSDs I'd advise against it, both too small and too low

endurance.



>

>

>    - rbd kernel module wasn't loaded on compute node,i loaded it manually

>    using "modprobe" and later destroyed/re-created vms,but this doesnot give

>    any performance boost. So librbd and RBD are equally fast?

>

Irrelevant and confusing.

Your VMs will use on or the other depending on how they are configured.



>

>

>    - Samsung evo 840 512Gb shows a throughput of 500Mbps for 4M writes [dd

>    if=/dev/zero of=/dev/sdb bs=4M count=1000 oflag=direct] and for 4Kb it was

>    equally fast as that of intel S3500 80gb .Does changing my SSD from intel

>    s3500 100Gb to Samsung 840 500Gb make any performance  difference here just

>    because for 4M wirtes samsung 840 evo is faster?Can Ceph utilize this extra

>    speed.Since samsung evo 840 is faster in 4M writes.

>

Those SSDs would be an even worse choice for endurance/reliability

reasons, though their larger size offsets that a bit.



Unless you have a VERY good understanding and data on how much your

cluster is going to write, pick at the very least SSDs with 3+ DWPD

endurance like the DC S3610s.

In very light loaded cases DC S3520 with 1DWPD may be OK, but again, you

need to know what you're doing here.



Christian

>

> Can somebody help me understand this better.

>

> Regards,

> Kevin





--

Christian Balzer        Network/Systems Engineer

ch...@gol.com           Global OnLine Japan/Rakuten Communications

http://www.gol.com/

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Analysing ceph performance with SSD journal, 10gbe NIC and 2 replicas -Hammer release

Reply via email to