Re: [ceph-users] Did maximum performance reached?
Hi What type of clients do you have. - Are they Linux physical OR VM mounting Ceph RBD or CephFS ?? - Or they are simply openstack / cloud instances using Ceph as cinder volumes or something like that ?? - Karan - On 28 Jul 2015, at 11:53, Shneur Zalman Mattern shz...@eimsys.co.il wrote: We've built Ceph cluster: 3 mon nodes (one of them is combined with mds) 3 osd nodes (each one have 10 osd + 2 ssd for journaling) switch 24 ports x 10G 10 gigabit - for public network 20 gigabit bonding - between osds Ubuntu 12.04.05 Ceph 0.87.2 - Clients has: 10 gigabit for ceph-connection CentOS 6.6 with kernel 3.19.8 equipped by cephfs-kmodule == fio-2.0.13 seqwrite, bs=1M, filesize=10G, parallel-jobs=16 === Single client: Starting 16 processes .below is just 1 job info trivial-readwrite-grid01: (groupid=0, jobs=1): err= 0: pid=10484: Tue Jul 28 13:26:24 2015 write: io=10240MB, bw=78656KB/s, iops=76 , runt=133312msec slat (msec): min=1 , max=117 , avg=13.01, stdev=12.57 clat (usec): min=1 , max=68 , avg= 3.61, stdev= 1.99 lat (msec): min=1 , max=117 , avg=13.01, stdev=12.57 clat percentiles (usec): | 1.00th=[1], 5.00th=[2], 10.00th=[2], 20.00th=[2], | 30.00th=[3], 40.00th=[3], 50.00th=[3], 60.00th=[4], | 70.00th=[4], 80.00th=[5], 90.00th=[5], 95.00th=[6], | 99.00th=[9], 99.50th=[ 10], 99.90th=[ 23], 99.95th=[ 28], | 99.99th=[ 62] bw (KB/s) : min=35790, max=318215, per=6.31%, avg=78816.91, stdev=26397.76 lat (usec) : 2=1.33%, 4=54.43%, 10=43.54%, 20=0.56%, 50=0.11% lat (usec) : 100=0.03% cpu : usr=0.89%, sys=12.85%, ctx=58248, majf=0, minf=9 IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, =64=0.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% issued: total=r=0/w=10240/d=0, short=r=0/w=0/d=0 ...what's above repeats 16 times... Run status group 0 (all jobs): WRITE: io=163840MB, aggrb=1219.8MB/s, minb=78060KB/s, maxb=78655KB/s, mint=133312msec, maxt=134329msec + Two clients: + below is just 1 job info trivial-readwrite-gridsrv: (groupid=0, jobs=1): err= 0: pid=10605: Tue Jul 28 14:05:59 2015 write: io=10240MB, bw=43154KB/s, iops=42 , runt=242984msec slat (usec): min=991 , max=285653 , avg=23716.12, stdev=23960.60 clat (usec): min=1 , max=65 , avg= 3.67, stdev= 2.02 lat (usec): min=994 , max=285664 , avg=23723.39, stdev=23962.22 clat percentiles (usec): | 1.00th=[2], 5.00th=[2], 10.00th=[2], 20.00th=[2], | 30.00th=[3], 40.00th=[3], 50.00th=[3], 60.00th=[4], | 70.00th=[4], 80.00th=[5], 90.00th=[5], 95.00th=[6], | 99.00th=[8], 99.50th=[ 10], 99.90th=[ 28], 99.95th=[ 37], | 99.99th=[ 56] bw (KB/s) : min=20630, max=276480, per=6.30%, avg=43328.34, stdev=21905.92 lat (usec) : 2=0.84%, 4=49.45%, 10=49.13%, 20=0.37%, 50=0.18% lat (usec) : 100=0.03% cpu : usr=0.49%, sys=5.68%, ctx=31428, majf=0, minf=9 IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, =64=0.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% issued: total=r=0/w=10240/d=0, short=r=0/w=0/d=0 ...what's above repeats 16 times... Run status group 0 (all jobs): WRITE: io=163840MB, aggrb=687960KB/s, minb=42997KB/s, maxb=43270KB/s, mint=242331msec, maxt=243869msec - And almost the same(?!) aggregated result from the second client: - Run status group 0 (all jobs): WRITE: io=163840MB, aggrb=679401KB/s, minb=42462KB/s, maxb=42852KB/s, mint=244697msec, maxt=246941msec - If I'll summarize: - aggrb1 + aggrb2 = 687960KB/s + 679401KB/s = 1367MB/s it looks like the same bandwidth from just one client aggrb=1219.8MB/s and it was divided? why? Question: If I'll connect 12 clients nodes - each one can write just on 100MB/s? Perhaps, I need to scale out our ceph up to 15(how many?) OSD nodes - and it'll serve 2 clients on the 1.3GB/s (bw of 10gig nic), or not? health HEALTH_OK monmap e1: 3 mons at {mon1=192.168.56.251:6789/0,mon2=192.168.56.252:6789/0,mon3=192.168.56.253:6789/0}, election epoch 140, quorum 0,1,2 mon1,mon2,mon3 mdsmap e12: 1/1/1 up {0=mon3=up:active} osdmap e832: 31 osds: 30 up, 30 in pgmap v106186: 6144 pgs, 3 pools, 2306 GB
[ceph-users] Did maximum performance reached?
We've built Ceph cluster: 3 mon nodes (one of them is combined with mds) 3 osd nodes (each one have 10 osd + 2 ssd for journaling) switch 24 ports x 10G 10 gigabit - for public network 20 gigabit bonding - between osds Ubuntu 12.04.05 Ceph 0.87.2 - Clients has: 10 gigabit for ceph-connection CentOS 6.6 with kernel 3.19.8 equipped by cephfs-kmodule == fio-2.0.13 seqwrite, bs=1M, filesize=10G, parallel-jobs=16 === Single client: Starting 16 processes .below is just 1 job info trivial-readwrite-grid01: (groupid=0, jobs=1): err= 0: pid=10484: Tue Jul 28 13:26:24 2015 write: io=10240MB, bw=78656KB/s, iops=76 , runt=133312msec slat (msec): min=1 , max=117 , avg=13.01, stdev=12.57 clat (usec): min=1 , max=68 , avg= 3.61, stdev= 1.99 lat (msec): min=1 , max=117 , avg=13.01, stdev=12.57 clat percentiles (usec): | 1.00th=[1], 5.00th=[2], 10.00th=[2], 20.00th=[2], | 30.00th=[3], 40.00th=[3], 50.00th=[3], 60.00th=[4], | 70.00th=[4], 80.00th=[5], 90.00th=[5], 95.00th=[6], | 99.00th=[9], 99.50th=[ 10], 99.90th=[ 23], 99.95th=[ 28], | 99.99th=[ 62] bw (KB/s) : min=35790, max=318215, per=6.31%, avg=78816.91, stdev=26397.76 lat (usec) : 2=1.33%, 4=54.43%, 10=43.54%, 20=0.56%, 50=0.11% lat (usec) : 100=0.03% cpu : usr=0.89%, sys=12.85%, ctx=58248, majf=0, minf=9 IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, =64=0.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% issued: total=r=0/w=10240/d=0, short=r=0/w=0/d=0 ...what's above repeats 16 times... Run status group 0 (all jobs): WRITE: io=163840MB, aggrb=1219.8MB/s, minb=78060KB/s, maxb=78655KB/s, mint=133312msec, maxt=134329msec + Two clients: + below is just 1 job info trivial-readwrite-gridsrv: (groupid=0, jobs=1): err= 0: pid=10605: Tue Jul 28 14:05:59 2015 write: io=10240MB, bw=43154KB/s, iops=42 , runt=242984msec slat (usec): min=991 , max=285653 , avg=23716.12, stdev=23960.60 clat (usec): min=1 , max=65 , avg= 3.67, stdev= 2.02 lat (usec): min=994 , max=285664 , avg=23723.39, stdev=23962.22 clat percentiles (usec): | 1.00th=[2], 5.00th=[2], 10.00th=[2], 20.00th=[2], | 30.00th=[3], 40.00th=[3], 50.00th=[3], 60.00th=[4], | 70.00th=[4], 80.00th=[5], 90.00th=[5], 95.00th=[6], | 99.00th=[8], 99.50th=[ 10], 99.90th=[ 28], 99.95th=[ 37], | 99.99th=[ 56] bw (KB/s) : min=20630, max=276480, per=6.30%, avg=43328.34, stdev=21905.92 lat (usec) : 2=0.84%, 4=49.45%, 10=49.13%, 20=0.37%, 50=0.18% lat (usec) : 100=0.03% cpu : usr=0.49%, sys=5.68%, ctx=31428, majf=0, minf=9 IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, =64=0.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% issued: total=r=0/w=10240/d=0, short=r=0/w=0/d=0 ...what's above repeats 16 times... Run status group 0 (all jobs): WRITE: io=163840MB, aggrb=687960KB/s, minb=42997KB/s, maxb=43270KB/s, mint=242331msec, maxt=243869msec - And almost the same(?!) aggregated result from the second client: - Run status group 0 (all jobs): WRITE: io=163840MB, aggrb=679401KB/s, minb=42462KB/s, maxb=42852KB/s, mint=244697msec, maxt=246941msec - If I'll summarize: - aggrb1 + aggrb2 = 687960KB/s + 679401KB/s = 1367MB/s it looks like the same bandwidth from just one client aggrb=1219.8MB/s and it was divided? why? Question: If I'll connect 12 clients nodes - each one can write just on 100MB/s? Perhaps, I need to scale out our ceph up to 15(how many?) OSD nodes - and it'll serve 2 clients on the 1.3GB/s (bw of 10gig nic), or not? health HEALTH_OK monmap e1: 3 mons at {mon1=192.168.56.251:6789/0,mon2=192.168.56.252:6789/0,mon3=192.168.56.253:6789/0}, election epoch 140, quorum 0,1,2 mon1,mon2,mon3 mdsmap e12: 1/1/1 up {0=mon3=up:active} osdmap e832: 31 osds: 30 up, 30 in pgmap v106186: 6144 pgs, 3 pools, 2306 GB data, 1379 kobjects 4624 GB used, 104 TB / 109 TB avail 6144 active+clean Perhaps, I don't understand something in Ceph architecture? I thought, that: Each spindel-disk can write ~ 100MB/s , and we have 10 SAS disks on each node = aggregated write speed is ~ 900MB/s (because of striping etc.) And we have 3 OSD nodes, and objects are striped also on 30 osds - I thought it's also
[ceph-users] Did maximum performance reached?
Hi, But my question is why speed is divided between clients? And how much OSDnodes, OSDdaemos, PGs, I have to add/remove to ceph, that each cephfs-client could write with his max network speed (10Gbit/s ~ 1.2GB/s)??? From: Johannes Formann mlm...@formann.de Sent: Tuesday, July 28, 2015 12:46 PM To: Shneur Zalman Mattern Subject: Re: [ceph-users] Did maximum performance reached? Hi, size=3 would decrease your performance. But with size=2 your results are not bad too: Math: size=2 means each write is written 4 times (2 copies, first journal, later disk). Calculating with 1.300MB/s „Client“ Bandwidth that means: 2 (size) * 1300 MB/s / 6 (SSD) = 433MB/s each SSD 2 (size) * 1300 MB/s / 30 (HDD) = 87MB/s each HDD greetings Johannes Am 28.07.2015 um 11:41 schrieb Shneur Zalman Mattern shz...@eimsys.co.il: Hi, Johannes (that's my grandpa's name) The size is 2, do you really think that number of replicas can increase performance? on the http://ceph.com/docs/master/architecture/ written Note: Striping is independent of object replicas. Since CRUSH replicates objects across OSDs, stripes get replicated automatically. OK, I'll check it, Regards, Shneur From: Johannes Formann mlm...@formann.de Sent: Tuesday, July 28, 2015 12:09 PM To: Shneur Zalman Mattern Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Did maximum performance reached? Hello, what is the „size“ parameter of your pool? Some math do show the impact: size=3 means each write is written 6 times (3 copies, first journal, later disk). Calculating with 1.300MB/s „Client“ Bandwidth that means: 3 (size) * 1300 MB/s / 6 (SSD) = 650MB/s per SSD 3 (size) * 1300 MB/s / 30 (HDD) = 130MB/s per HDD If you use size=3, the results are as good as one can expect. (Even with size=2 the results won’t be bad) greetings Johannes Am 28.07.2015 um 10:53 schrieb Shneur Zalman Mattern shz...@eimsys.co.il: We've built Ceph cluster: 3 mon nodes (one of them is combined with mds) 3 osd nodes (each one have 10 osd + 2 ssd for journaling) switch 24 ports x 10G 10 gigabit - for public network 20 gigabit bonding - between osds Ubuntu 12.04.05 Ceph 0.87.2 - Clients has: 10 gigabit for ceph-connection CentOS 6.6 with kernel 3.19.8 equipped by cephfs-kmodule == fio-2.0.13 seqwrite, bs=1M, filesize=10G, parallel-jobs=16 === Single client: Starting 16 processes .below is just 1 job info trivial-readwrite-grid01: (groupid=0, jobs=1): err= 0: pid=10484: Tue Jul 28 13:26:24 2015 write: io=10240MB, bw=78656KB/s, iops=76 , runt=133312msec slat (msec): min=1 , max=117 , avg=13.01, stdev=12.57 clat (usec): min=1 , max=68 , avg= 3.61, stdev= 1.99 lat (msec): min=1 , max=117 , avg=13.01, stdev=12.57 clat percentiles (usec): | 1.00th=[1], 5.00th=[2], 10.00th=[2], 20.00th=[2], | 30.00th=[3], 40.00th=[3], 50.00th=[3], 60.00th=[4], | 70.00th=[4], 80.00th=[5], 90.00th=[5], 95.00th=[6], | 99.00th=[9], 99.50th=[ 10], 99.90th=[ 23], 99.95th=[ 28], | 99.99th=[ 62] bw (KB/s) : min=35790, max=318215, per=6.31%, avg=78816.91, stdev=26397.76 lat (usec) : 2=1.33%, 4=54.43%, 10=43.54%, 20=0.56%, 50=0.11% lat (usec) : 100=0.03% cpu : usr=0.89%, sys=12.85%, ctx=58248, majf=0, minf=9 IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, =64=0.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% issued: total=r=0/w=10240/d=0, short=r=0/w=0/d=0 ...what's above repeats 16 times... Run status group 0 (all jobs): WRITE: io=163840MB, aggrb=1219.8MB/s, minb=78060KB/s, maxb=78655KB/s, mint=133312msec, maxt=134329msec + Two clients: + below is just 1 job info trivial-readwrite-gridsrv: (groupid=0, jobs=1): err= 0: pid=10605: Tue Jul 28 14:05:59 2015 write: io=10240MB, bw=43154KB/s, iops=42 , runt=242984msec slat (usec): min=991 , max=285653 , avg=23716.12, stdev=23960.60 clat (usec): min=1 , max=65 , avg= 3.67, stdev= 2.02 lat (usec): min=994 , max=285664 , avg=23723.39, stdev=23962.22 clat percentiles (usec): | 1.00th=[2], 5.00th=[2], 10.00th=[2], 20.00th=[2], | 30.00th=[3], 40.00th=[3], 50.00th=[3], 60.00th=[4], | 70.00th=[4], 80.00th=[5], 90.00th=[5], 95.00th=[6], | 99.00th=[8], 99.50th=[ 10], 99.90th=[ 28], 99.95th=[ 37], | 99.99th=[ 56] bw (KB/s) : min=20630, max=276480, per=6.30%, avg=43328.34, stdev=21905.92 lat (usec) : 2=0.84%, 4=49.45%, 10=49.13%, 20=0.37
[ceph-users] Did maximum performance reached?
Hi! And so, in your math I need to build size = osd, 30 replicas for my cluster of 120TB - to get my demans And 4TB real storage capacity in price 3000$ per 1TB? Joke? All the best, Shneur From: Johannes Formann mlm...@formann.de Sent: Tuesday, July 28, 2015 12:46 PM To: Shneur Zalman Mattern Subject: Re: [ceph-users] Did maximum performance reached? Hi, size=3 would decrease your performance. But with size=2 your results are not bad too: Math: size=2 means each write is written 4 times (2 copies, first journal, later disk). Calculating with 1.300MB/s „Client“ Bandwidth that means: 2 (size) * 1300 MB/s / 6 (SSD) = 433MB/s each SSD 2 (size) * 1300 MB/s / 30 (HDD) = 87MB/s each HDD greetings Johannes This footnote confirms that this email message has been scanned by PineApp Mail-SeCure for the presence of malicious code, vandals computer viruses. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Did maximum performance reached?
Hi, Karan! That's physical CentOS clients of CephFS mounted by kernel-module (kernel 4.1.3) Thanks Hi What type of clients do you have. - Are they Linux physical OR VM mounting Ceph RBD or CephFS ?? - Or they are simply openstack / cloud instances using Ceph as cinder volumes or something like that ?? - Karan - On 28 Jul 2015, at 11:53, Shneur Zalman Mattern shz...@eimsys.co.il wrote: We've built Ceph cluster: 3 mon nodes (one of them is combined with mds) 3 osd nodes (each one have 10 osd + 2 ssd for journaling) switch 24 ports x 10G 10 gigabit - for public network 20 gigabit bonding - between osds Ubuntu 12.04.05 Ceph 0.87.2 - Clients has: 10 gigabit for ceph-connection CentOS 6.6 with kernel 4.1.3 equipped by cephfs-kmodule This footnote confirms that this email message has been scanned by PineApp Mail-SeCure for the presence of malicious code, vandals computer viruses. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Did maximum performance reached?
Oh, now I've to cry :-) not because it's not SSDs... it's SAS2 HDDs Because, I need to build something for 140 clients... 4200 OSDs :-( Looks like, I can pickup my performance by SSDs, but I need a huge capacity ~ 2PB Perhaps, tiering cache pool can save my money, but I've read here - that it's slower than all people think... :-( Why Lustre is more performable? There're same HDDs? This footnote confirms that this email message has been scanned by PineApp Mail-SeCure for the presence of malicious code, vandals computer viruses. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Did maximum performance reached?
Hi, Johannes (that's my grandpa's name) The size is 2, do you really think that number of replicas can increase performance? on the http://ceph.com/docs/master/architecture/ written Note: Striping is independent of object replicas. Since CRUSH replicates objects across OSDs, stripes get replicated automatically. OK, I'll check it, Regards, Shneur From: Johannes Formann mlm...@formann.de Sent: Tuesday, July 28, 2015 12:09 PM To: Shneur Zalman Mattern Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Did maximum performance reached? Hello, what is the „size“ parameter of your pool? Some math do show the impact: size=3 means each write is written 6 times (3 copies, first journal, later disk). Calculating with 1.300MB/s „Client“ Bandwidth that means: 3 (size) * 1300 MB/s / 6 (SSD) = 650MB/s per SSD 3 (size) * 1300 MB/s / 30 (HDD) = 130MB/s per HDD If you use size=3, the results are as good as one can expect. (Even with size=2 the results won’t be bad) greetings Johannes Am 28.07.2015 um 10:53 schrieb Shneur Zalman Mattern shz...@eimsys.co.il: We've built Ceph cluster: 3 mon nodes (one of them is combined with mds) 3 osd nodes (each one have 10 osd + 2 ssd for journaling) switch 24 ports x 10G 10 gigabit - for public network 20 gigabit bonding - between osds Ubuntu 12.04.05 Ceph 0.87.2 - Clients has: 10 gigabit for ceph-connection CentOS 6.6 with kernel 3.19.8 equipped by cephfs-kmodule == fio-2.0.13 seqwrite, bs=1M, filesize=10G, parallel-jobs=16 === Single client: Starting 16 processes .below is just 1 job info trivial-readwrite-grid01: (groupid=0, jobs=1): err= 0: pid=10484: Tue Jul 28 13:26:24 2015 write: io=10240MB, bw=78656KB/s, iops=76 , runt=133312msec slat (msec): min=1 , max=117 , avg=13.01, stdev=12.57 clat (usec): min=1 , max=68 , avg= 3.61, stdev= 1.99 lat (msec): min=1 , max=117 , avg=13.01, stdev=12.57 clat percentiles (usec): | 1.00th=[1], 5.00th=[2], 10.00th=[2], 20.00th=[2], | 30.00th=[3], 40.00th=[3], 50.00th=[3], 60.00th=[4], | 70.00th=[4], 80.00th=[5], 90.00th=[5], 95.00th=[6], | 99.00th=[9], 99.50th=[ 10], 99.90th=[ 23], 99.95th=[ 28], | 99.99th=[ 62] bw (KB/s) : min=35790, max=318215, per=6.31%, avg=78816.91, stdev=26397.76 lat (usec) : 2=1.33%, 4=54.43%, 10=43.54%, 20=0.56%, 50=0.11% lat (usec) : 100=0.03% cpu : usr=0.89%, sys=12.85%, ctx=58248, majf=0, minf=9 IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, =64=0.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% issued: total=r=0/w=10240/d=0, short=r=0/w=0/d=0 ...what's above repeats 16 times... Run status group 0 (all jobs): WRITE: io=163840MB, aggrb=1219.8MB/s, minb=78060KB/s, maxb=78655KB/s, mint=133312msec, maxt=134329msec + Two clients: + below is just 1 job info trivial-readwrite-gridsrv: (groupid=0, jobs=1): err= 0: pid=10605: Tue Jul 28 14:05:59 2015 write: io=10240MB, bw=43154KB/s, iops=42 , runt=242984msec slat (usec): min=991 , max=285653 , avg=23716.12, stdev=23960.60 clat (usec): min=1 , max=65 , avg= 3.67, stdev= 2.02 lat (usec): min=994 , max=285664 , avg=23723.39, stdev=23962.22 clat percentiles (usec): | 1.00th=[2], 5.00th=[2], 10.00th=[2], 20.00th=[2], | 30.00th=[3], 40.00th=[3], 50.00th=[3], 60.00th=[4], | 70.00th=[4], 80.00th=[5], 90.00th=[5], 95.00th=[6], | 99.00th=[8], 99.50th=[ 10], 99.90th=[ 28], 99.95th=[ 37], | 99.99th=[ 56] bw (KB/s) : min=20630, max=276480, per=6.30%, avg=43328.34, stdev=21905.92 lat (usec) : 2=0.84%, 4=49.45%, 10=49.13%, 20=0.37%, 50=0.18% lat (usec) : 100=0.03% cpu : usr=0.49%, sys=5.68%, ctx=31428, majf=0, minf=9 IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, =64=0.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% issued: total=r=0/w=10240/d=0, short=r=0/w=0/d=0 ...what's above repeats 16 times... Run status group 0 (all jobs): WRITE: io=163840MB, aggrb=687960KB/s, minb=42997KB/s, maxb=43270KB/s, mint=242331msec, maxt=243869msec - And almost the same(?!) aggregated result from the second client: - Run status group 0 (all jobs): WRITE: io=163840MB, aggrb=679401KB/s, minb=42462KB/s, maxb=42852KB/s, mint=244697msec, maxt=246941msec - If I'll
Re: [ceph-users] Did maximum performance reached?
Hello, what is the „size“ parameter of your pool? Some math do show the impact: size=3 means each write is written 6 times (3 copies, first journal, later disk). Calculating with 1.300MB/s „Client“ Bandwidth that means: 3 (size) * 1300 MB/s / 6 (SSD) = 650MB/s per SSD 3 (size) * 1300 MB/s / 30 (HDD) = 130MB/s per HDD If you use size=3, the results are as good as one can expect. (Even with size=2 the results won’t be bad) greetings Johannes Am 28.07.2015 um 10:53 schrieb Shneur Zalman Mattern shz...@eimsys.co.il: We've built Ceph cluster: 3 mon nodes (one of them is combined with mds) 3 osd nodes (each one have 10 osd + 2 ssd for journaling) switch 24 ports x 10G 10 gigabit - for public network 20 gigabit bonding - between osds Ubuntu 12.04.05 Ceph 0.87.2 - Clients has: 10 gigabit for ceph-connection CentOS 6.6 with kernel 3.19.8 equipped by cephfs-kmodule == fio-2.0.13 seqwrite, bs=1M, filesize=10G, parallel-jobs=16 === Single client: Starting 16 processes .below is just 1 job info trivial-readwrite-grid01: (groupid=0, jobs=1): err= 0: pid=10484: Tue Jul 28 13:26:24 2015 write: io=10240MB, bw=78656KB/s, iops=76 , runt=133312msec slat (msec): min=1 , max=117 , avg=13.01, stdev=12.57 clat (usec): min=1 , max=68 , avg= 3.61, stdev= 1.99 lat (msec): min=1 , max=117 , avg=13.01, stdev=12.57 clat percentiles (usec): | 1.00th=[1], 5.00th=[2], 10.00th=[2], 20.00th=[2], | 30.00th=[3], 40.00th=[3], 50.00th=[3], 60.00th=[4], | 70.00th=[4], 80.00th=[5], 90.00th=[5], 95.00th=[6], | 99.00th=[9], 99.50th=[ 10], 99.90th=[ 23], 99.95th=[ 28], | 99.99th=[ 62] bw (KB/s) : min=35790, max=318215, per=6.31%, avg=78816.91, stdev=26397.76 lat (usec) : 2=1.33%, 4=54.43%, 10=43.54%, 20=0.56%, 50=0.11% lat (usec) : 100=0.03% cpu : usr=0.89%, sys=12.85%, ctx=58248, majf=0, minf=9 IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, =64=0.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% issued: total=r=0/w=10240/d=0, short=r=0/w=0/d=0 ...what's above repeats 16 times... Run status group 0 (all jobs): WRITE: io=163840MB, aggrb=1219.8MB/s, minb=78060KB/s, maxb=78655KB/s, mint=133312msec, maxt=134329msec + Two clients: + below is just 1 job info trivial-readwrite-gridsrv: (groupid=0, jobs=1): err= 0: pid=10605: Tue Jul 28 14:05:59 2015 write: io=10240MB, bw=43154KB/s, iops=42 , runt=242984msec slat (usec): min=991 , max=285653 , avg=23716.12, stdev=23960.60 clat (usec): min=1 , max=65 , avg= 3.67, stdev= 2.02 lat (usec): min=994 , max=285664 , avg=23723.39, stdev=23962.22 clat percentiles (usec): | 1.00th=[2], 5.00th=[2], 10.00th=[2], 20.00th=[2], | 30.00th=[3], 40.00th=[3], 50.00th=[3], 60.00th=[4], | 70.00th=[4], 80.00th=[5], 90.00th=[5], 95.00th=[6], | 99.00th=[8], 99.50th=[ 10], 99.90th=[ 28], 99.95th=[ 37], | 99.99th=[ 56] bw (KB/s) : min=20630, max=276480, per=6.30%, avg=43328.34, stdev=21905.92 lat (usec) : 2=0.84%, 4=49.45%, 10=49.13%, 20=0.37%, 50=0.18% lat (usec) : 100=0.03% cpu : usr=0.49%, sys=5.68%, ctx=31428, majf=0, minf=9 IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, =64=0.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% issued: total=r=0/w=10240/d=0, short=r=0/w=0/d=0 ...what's above repeats 16 times... Run status group 0 (all jobs): WRITE: io=163840MB, aggrb=687960KB/s, minb=42997KB/s, maxb=43270KB/s, mint=242331msec, maxt=243869msec - And almost the same(?!) aggregated result from the second client: - Run status group 0 (all jobs): WRITE: io=163840MB, aggrb=679401KB/s, minb=42462KB/s, maxb=42852KB/s, mint=244697msec, maxt=246941msec - If I'll summarize: - aggrb1 + aggrb2 = 687960KB/s + 679401KB/s = 1367MB/s it looks like the same bandwidth from just one client aggrb=1219.8MB/s and it was divided? why? Question: If I'll connect 12 clients nodes - each one can write just on 100MB/s? Perhaps, I need to scale out our ceph up to 15(how many?) OSD nodes - and it'll serve 2 clients on the 1.3GB/s (bw of 10gig nic), or not? health HEALTH_OK monmap e1: 3 mons at
Re: [ceph-users] Did maximum performance reached?
The speed is divided because ist fair :) You reach the limit your hardware (I guess the SSDs) can deliver. For 2 clients each doing 1200 MB/s you’ll have basically to double the amount of OSDs. greetings Johannes Am 28.07.2015 um 11:56 schrieb Shneur Zalman Mattern shz...@eimsys.co.il: Hi, But my question is why speed is divided between clients? And how much OSDnodes, OSDdaemos, PGs, I have to add/remove to ceph, that each cephfs-client could write with his max network speed (10Gbit/s ~ 1.2GB/s)??? From: Johannes Formann mlm...@formann.de Sent: Tuesday, July 28, 2015 12:46 PM To: Shneur Zalman Mattern Subject: Re: [ceph-users] Did maximum performance reached? Hi, size=3 would decrease your performance. But with size=2 your results are not bad too: Math: size=2 means each write is written 4 times (2 copies, first journal, later disk). Calculating with 1.300MB/s „Client“ Bandwidth that means: 2 (size) * 1300 MB/s / 6 (SSD) = 433MB/s each SSD 2 (size) * 1300 MB/s / 30 (HDD) = 87MB/s each HDD greetings Johannes Am 28.07.2015 um 11:41 schrieb Shneur Zalman Mattern shz...@eimsys.co.il: Hi, Johannes (that's my grandpa's name) The size is 2, do you really think that number of replicas can increase performance? on the http://ceph.com/docs/master/architecture/ written Note: Striping is independent of object replicas. Since CRUSH replicates objects across OSDs, stripes get replicated automatically. OK, I'll check it, Regards, Shneur From: Johannes Formann mlm...@formann.de Sent: Tuesday, July 28, 2015 12:09 PM To: Shneur Zalman Mattern Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Did maximum performance reached? Hello, what is the „size“ parameter of your pool? Some math do show the impact: size=3 means each write is written 6 times (3 copies, first journal, later disk). Calculating with 1.300MB/s „Client“ Bandwidth that means: 3 (size) * 1300 MB/s / 6 (SSD) = 650MB/s per SSD 3 (size) * 1300 MB/s / 30 (HDD) = 130MB/s per HDD If you use size=3, the results are as good as one can expect. (Even with size=2 the results won’t be bad) greetings Johannes Am 28.07.2015 um 10:53 schrieb Shneur Zalman Mattern shz...@eimsys.co.il: We've built Ceph cluster: 3 mon nodes (one of them is combined with mds) 3 osd nodes (each one have 10 osd + 2 ssd for journaling) switch 24 ports x 10G 10 gigabit - for public network 20 gigabit bonding - between osds Ubuntu 12.04.05 Ceph 0.87.2 - Clients has: 10 gigabit for ceph-connection CentOS 6.6 with kernel 3.19.8 equipped by cephfs-kmodule == fio-2.0.13 seqwrite, bs=1M, filesize=10G, parallel-jobs=16 === Single client: Starting 16 processes .below is just 1 job info trivial-readwrite-grid01: (groupid=0, jobs=1): err= 0: pid=10484: Tue Jul 28 13:26:24 2015 write: io=10240MB, bw=78656KB/s, iops=76 , runt=133312msec slat (msec): min=1 , max=117 , avg=13.01, stdev=12.57 clat (usec): min=1 , max=68 , avg= 3.61, stdev= 1.99 lat (msec): min=1 , max=117 , avg=13.01, stdev=12.57 clat percentiles (usec): | 1.00th=[1], 5.00th=[2], 10.00th=[2], 20.00th=[2], | 30.00th=[3], 40.00th=[3], 50.00th=[3], 60.00th=[4], | 70.00th=[4], 80.00th=[5], 90.00th=[5], 95.00th=[6], | 99.00th=[9], 99.50th=[ 10], 99.90th=[ 23], 99.95th=[ 28], | 99.99th=[ 62] bw (KB/s) : min=35790, max=318215, per=6.31%, avg=78816.91, stdev=26397.76 lat (usec) : 2=1.33%, 4=54.43%, 10=43.54%, 20=0.56%, 50=0.11% lat (usec) : 100=0.03% cpu : usr=0.89%, sys=12.85%, ctx=58248, majf=0, minf=9 IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, =64=0.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% issued: total=r=0/w=10240/d=0, short=r=0/w=0/d=0 ...what's above repeats 16 times... Run status group 0 (all jobs): WRITE: io=163840MB, aggrb=1219.8MB/s, minb=78060KB/s, maxb=78655KB/s, mint=133312msec, maxt=134329msec + Two clients: + below is just 1 job info trivial-readwrite-gridsrv: (groupid=0, jobs=1): err= 0: pid=10605: Tue Jul 28 14:05:59 2015 write: io=10240MB, bw=43154KB/s, iops=42 , runt=242984msec slat (usec): min=991 , max=285653 , avg=23716.12, stdev=23960.60 clat (usec): min=1 , max=65 , avg= 3.67, stdev= 2.02 lat (usec): min=994 , max=285664 , avg=23723.39, stdev=23962.22 clat percentiles (usec): | 1.00th=[2], 5.00th=[2], 10.00th=[2], 20.00th=[2], | 30.00th=[3], 40.00th=[3], 50.00th=[3], 60.00th=[4
Re: [ceph-users] Did maximum performance reached?
On 28/07/15 11:17, Shneur Zalman Mattern wrote: Oh, now I've to cry :-) not because it's not SSDs... it's SAS2 HDDs Because, I need to build something for 140 clients... 4200 OSDs :-( Looks like, I can pickup my performance by SSDs, but I need a huge capacity ~ 2PB Perhaps, tiering cache pool can save my money, but I've read here - that it's slower than all people think... :-( Why Lustre is more performable? There're same HDDs? Lustre isn't (A) creating two copies of your data, and it's (B) not executing disk writes as atomic transactions (i.e. no data writeahead log). The A tradeoff is that while a Lustre system typically requires an expensive dual ported RAID controller, Ceph doesn't. You take the money you saved on RAID controllers have spend it on having a larger number of cheaper hosts and drives. If you've already bought the Lustre-oriented hardware then my advice would be to run Lustre on it :-) The efficient way of handling B is to use SSD journals for your OSDs. Typical Ceph servers have one SSD per approx 4 OSDs. John ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Did maximum performance reached?
On 28/07/15 11:53, John Spray wrote: On 28/07/15 11:17, Shneur Zalman Mattern wrote: Oh, now I've to cry :-) not because it's not SSDs... it's SAS2 HDDs Because, I need to build something for 140 clients... 4200 OSDs :-( Looks like, I can pickup my performance by SSDs, but I need a huge capacity ~ 2PB Perhaps, tiering cache pool can save my money, but I've read here - that it's slower than all people think... :-( Why Lustre is more performable? There're same HDDs? Lustre isn't (A) creating two copies of your data, and it's (B) not executing disk writes as atomic transactions (i.e. no data writeahead log). The A tradeoff is that while a Lustre system typically requires an expensive dual ported RAID controller, Ceph doesn't. You take the money you saved on RAID controllers have spend it on having a larger number of cheaper hosts and drives. If you've already bought the Lustre-oriented hardware then my advice would be to run Lustre on it :-) The efficient way of handling B is to use SSD journals for your OSDs. Typical Ceph servers have one SSD per approx 4 OSDs. Oh, I've just re-read the original message in this thread, and you're already using SSD journals. So I think the only point of confusion was that you weren't dividing your expected bandwidth number by the number of replicas, right? Each spindel-disk can write ~ 100MB/s , and we have 10 SAS disks on each node = aggregated write speed is ~ 900MB/s (because of striping etc.) And we have 3 OSD nodes, and objects are striped also on 30 osds - I thought it's also aggregateble and we'll get something around 2.5 GB/s, but not... Your expected bandwidth (with size=2 replicas) will be (900MB/s * 3)/2 = 1300MB/s -- so I think you're actually doing pretty well with your 1367MB/s number. John ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Did maximum performance reached?
As I'm understanding now that's in this case (30 disks) 10Gbit Network is not a bottleneck! With other HW config ( + 5 OSD nodes = + 50 disks ) I'd get 3400 MB/s, and 3 clients can work on full bandwidth, yes? OK, let's try ! ! ! ! ! ! ! Perhaps, somebody has more suggestions for increasing performance: 1. NVMe journals, 2. btrfs over osd 3. ssd-based osds, 4. 15K hdds 5. RAID 10 on each OSD node . everybody - brainstorm!!! John: Your expected bandwidth (with size=2 replicas) will be (900MB/s * 3)/2 = 1300MB/s -- so I think you're actually doing pretty well with your 1367MB/s number. This footnote confirms that this email message has been scanned by PineApp Mail-SeCure for the presence of malicious code, vandals computer viruses. This footnote confirms that this email message has been scanned by PineApp Mail-SeCure for the presence of malicious code, vandals computer viruses. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Did maximum performance reached?
Hi, On 28.07.2015 12:02, Shneur Zalman Mattern wrote: Hi! And so, in your math I need to build size = osd, 30 replicas for my cluster of 120TB - to get my demans 30 replicas is the wrong math! Less replicas = more speed (because of less writing). More replicas less speed. Fore data safety an replica of 3 is recommended. Udo ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com