[ceph-users] Elastic-sized RBD planned?
Hi to all! Perhaps, somebody already thought about, but my Googling had no results. How can I do RBD that will grow on demand of VM/client disk space. Are there in Ceph some options for this? Is it planned to do? Is it utopic idea? Is this client need CephFS already? Thanks, Shneur This footnote confirms that this email message has been scanned by PineApp Mail-SeCure for the presence of malicious code, vandals computer viruses. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Did maximum performance reached?
We've built Ceph cluster: 3 mon nodes (one of them is combined with mds) 3 osd nodes (each one have 10 osd + 2 ssd for journaling) switch 24 ports x 10G 10 gigabit - for public network 20 gigabit bonding - between osds Ubuntu 12.04.05 Ceph 0.87.2 - Clients has: 10 gigabit for ceph-connection CentOS 6.6 with kernel 3.19.8 equipped by cephfs-kmodule == fio-2.0.13 seqwrite, bs=1M, filesize=10G, parallel-jobs=16 === Single client: Starting 16 processes .below is just 1 job info trivial-readwrite-grid01: (groupid=0, jobs=1): err= 0: pid=10484: Tue Jul 28 13:26:24 2015 write: io=10240MB, bw=78656KB/s, iops=76 , runt=133312msec slat (msec): min=1 , max=117 , avg=13.01, stdev=12.57 clat (usec): min=1 , max=68 , avg= 3.61, stdev= 1.99 lat (msec): min=1 , max=117 , avg=13.01, stdev=12.57 clat percentiles (usec): | 1.00th=[1], 5.00th=[2], 10.00th=[2], 20.00th=[2], | 30.00th=[3], 40.00th=[3], 50.00th=[3], 60.00th=[4], | 70.00th=[4], 80.00th=[5], 90.00th=[5], 95.00th=[6], | 99.00th=[9], 99.50th=[ 10], 99.90th=[ 23], 99.95th=[ 28], | 99.99th=[ 62] bw (KB/s) : min=35790, max=318215, per=6.31%, avg=78816.91, stdev=26397.76 lat (usec) : 2=1.33%, 4=54.43%, 10=43.54%, 20=0.56%, 50=0.11% lat (usec) : 100=0.03% cpu : usr=0.89%, sys=12.85%, ctx=58248, majf=0, minf=9 IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, =64=0.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% issued: total=r=0/w=10240/d=0, short=r=0/w=0/d=0 ...what's above repeats 16 times... Run status group 0 (all jobs): WRITE: io=163840MB, aggrb=1219.8MB/s, minb=78060KB/s, maxb=78655KB/s, mint=133312msec, maxt=134329msec + Two clients: + below is just 1 job info trivial-readwrite-gridsrv: (groupid=0, jobs=1): err= 0: pid=10605: Tue Jul 28 14:05:59 2015 write: io=10240MB, bw=43154KB/s, iops=42 , runt=242984msec slat (usec): min=991 , max=285653 , avg=23716.12, stdev=23960.60 clat (usec): min=1 , max=65 , avg= 3.67, stdev= 2.02 lat (usec): min=994 , max=285664 , avg=23723.39, stdev=23962.22 clat percentiles (usec): | 1.00th=[2], 5.00th=[2], 10.00th=[2], 20.00th=[2], | 30.00th=[3], 40.00th=[3], 50.00th=[3], 60.00th=[4], | 70.00th=[4], 80.00th=[5], 90.00th=[5], 95.00th=[6], | 99.00th=[8], 99.50th=[ 10], 99.90th=[ 28], 99.95th=[ 37], | 99.99th=[ 56] bw (KB/s) : min=20630, max=276480, per=6.30%, avg=43328.34, stdev=21905.92 lat (usec) : 2=0.84%, 4=49.45%, 10=49.13%, 20=0.37%, 50=0.18% lat (usec) : 100=0.03% cpu : usr=0.49%, sys=5.68%, ctx=31428, majf=0, minf=9 IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, =64=0.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% issued: total=r=0/w=10240/d=0, short=r=0/w=0/d=0 ...what's above repeats 16 times... Run status group 0 (all jobs): WRITE: io=163840MB, aggrb=687960KB/s, minb=42997KB/s, maxb=43270KB/s, mint=242331msec, maxt=243869msec - And almost the same(?!) aggregated result from the second client: - Run status group 0 (all jobs): WRITE: io=163840MB, aggrb=679401KB/s, minb=42462KB/s, maxb=42852KB/s, mint=244697msec, maxt=246941msec - If I'll summarize: - aggrb1 + aggrb2 = 687960KB/s + 679401KB/s = 1367MB/s it looks like the same bandwidth from just one client aggrb=1219.8MB/s and it was divided? why? Question: If I'll connect 12 clients nodes - each one can write just on 100MB/s? Perhaps, I need to scale out our ceph up to 15(how many?) OSD nodes - and it'll serve 2 clients on the 1.3GB/s (bw of 10gig nic), or not? health HEALTH_OK monmap e1: 3 mons at {mon1=192.168.56.251:6789/0,mon2=192.168.56.252:6789/0,mon3=192.168.56.253:6789/0}, election epoch 140, quorum 0,1,2 mon1,mon2,mon3 mdsmap e12: 1/1/1 up {0=mon3=up:active} osdmap e832: 31 osds: 30 up, 30 in pgmap v106186: 6144 pgs, 3 pools, 2306 GB data, 1379 kobjects 4624 GB used, 104 TB / 109 TB avail 6144 active+clean Perhaps, I don't understand something in Ceph architecture? I thought, that: Each spindel-disk can write ~ 100MB/s , and we have 10 SAS disks on each node = aggregated write speed is ~ 900MB/s (because of striping etc.) And we have 3 OSD nodes, and objects are striped also on 30 osds - I thought it's also
[ceph-users] Did maximum performance reached?
Hi, But my question is why speed is divided between clients? And how much OSDnodes, OSDdaemos, PGs, I have to add/remove to ceph, that each cephfs-client could write with his max network speed (10Gbit/s ~ 1.2GB/s)??? From: Johannes Formann mlm...@formann.de Sent: Tuesday, July 28, 2015 12:46 PM To: Shneur Zalman Mattern Subject: Re: [ceph-users] Did maximum performance reached? Hi, size=3 would decrease your performance. But with size=2 your results are not bad too: Math: size=2 means each write is written 4 times (2 copies, first journal, later disk). Calculating with 1.300MB/s „Client“ Bandwidth that means: 2 (size) * 1300 MB/s / 6 (SSD) = 433MB/s each SSD 2 (size) * 1300 MB/s / 30 (HDD) = 87MB/s each HDD greetings Johannes Am 28.07.2015 um 11:41 schrieb Shneur Zalman Mattern shz...@eimsys.co.il: Hi, Johannes (that's my grandpa's name) The size is 2, do you really think that number of replicas can increase performance? on the http://ceph.com/docs/master/architecture/ written Note: Striping is independent of object replicas. Since CRUSH replicates objects across OSDs, stripes get replicated automatically. OK, I'll check it, Regards, Shneur From: Johannes Formann mlm...@formann.de Sent: Tuesday, July 28, 2015 12:09 PM To: Shneur Zalman Mattern Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Did maximum performance reached? Hello, what is the „size“ parameter of your pool? Some math do show the impact: size=3 means each write is written 6 times (3 copies, first journal, later disk). Calculating with 1.300MB/s „Client“ Bandwidth that means: 3 (size) * 1300 MB/s / 6 (SSD) = 650MB/s per SSD 3 (size) * 1300 MB/s / 30 (HDD) = 130MB/s per HDD If you use size=3, the results are as good as one can expect. (Even with size=2 the results won’t be bad) greetings Johannes Am 28.07.2015 um 10:53 schrieb Shneur Zalman Mattern shz...@eimsys.co.il: We've built Ceph cluster: 3 mon nodes (one of them is combined with mds) 3 osd nodes (each one have 10 osd + 2 ssd for journaling) switch 24 ports x 10G 10 gigabit - for public network 20 gigabit bonding - between osds Ubuntu 12.04.05 Ceph 0.87.2 - Clients has: 10 gigabit for ceph-connection CentOS 6.6 with kernel 3.19.8 equipped by cephfs-kmodule == fio-2.0.13 seqwrite, bs=1M, filesize=10G, parallel-jobs=16 === Single client: Starting 16 processes .below is just 1 job info trivial-readwrite-grid01: (groupid=0, jobs=1): err= 0: pid=10484: Tue Jul 28 13:26:24 2015 write: io=10240MB, bw=78656KB/s, iops=76 , runt=133312msec slat (msec): min=1 , max=117 , avg=13.01, stdev=12.57 clat (usec): min=1 , max=68 , avg= 3.61, stdev= 1.99 lat (msec): min=1 , max=117 , avg=13.01, stdev=12.57 clat percentiles (usec): | 1.00th=[1], 5.00th=[2], 10.00th=[2], 20.00th=[2], | 30.00th=[3], 40.00th=[3], 50.00th=[3], 60.00th=[4], | 70.00th=[4], 80.00th=[5], 90.00th=[5], 95.00th=[6], | 99.00th=[9], 99.50th=[ 10], 99.90th=[ 23], 99.95th=[ 28], | 99.99th=[ 62] bw (KB/s) : min=35790, max=318215, per=6.31%, avg=78816.91, stdev=26397.76 lat (usec) : 2=1.33%, 4=54.43%, 10=43.54%, 20=0.56%, 50=0.11% lat (usec) : 100=0.03% cpu : usr=0.89%, sys=12.85%, ctx=58248, majf=0, minf=9 IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, =64=0.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% issued: total=r=0/w=10240/d=0, short=r=0/w=0/d=0 ...what's above repeats 16 times... Run status group 0 (all jobs): WRITE: io=163840MB, aggrb=1219.8MB/s, minb=78060KB/s, maxb=78655KB/s, mint=133312msec, maxt=134329msec + Two clients: + below is just 1 job info trivial-readwrite-gridsrv: (groupid=0, jobs=1): err= 0: pid=10605: Tue Jul 28 14:05:59 2015 write: io=10240MB, bw=43154KB/s, iops=42 , runt=242984msec slat (usec): min=991 , max=285653 , avg=23716.12, stdev=23960.60 clat (usec): min=1 , max=65 , avg= 3.67, stdev= 2.02 lat (usec): min=994 , max=285664 , avg=23723.39, stdev=23962.22 clat percentiles (usec): | 1.00th=[2], 5.00th=[2], 10.00th=[2], 20.00th=[2], | 30.00th=[3], 40.00th=[3], 50.00th=[3], 60.00th=[4], | 70.00th=[4], 80.00th=[5], 90.00th=[5], 95.00th=[6], | 99.00th=[8], 99.50th=[ 10], 99.90th=[ 28], 99.95th=[ 37], | 99.99th=[ 56] bw (KB/s) : min=20630, max=276480, per=6.30%, avg=43328.34, stdev=21905.92 lat (usec) : 2=0.84%, 4=49.45%, 10=49.13%, 20=0.37
[ceph-users] Did maximum performance reached?
Hi! And so, in your math I need to build size = osd, 30 replicas for my cluster of 120TB - to get my demans And 4TB real storage capacity in price 3000$ per 1TB? Joke? All the best, Shneur From: Johannes Formann mlm...@formann.de Sent: Tuesday, July 28, 2015 12:46 PM To: Shneur Zalman Mattern Subject: Re: [ceph-users] Did maximum performance reached? Hi, size=3 would decrease your performance. But with size=2 your results are not bad too: Math: size=2 means each write is written 4 times (2 copies, first journal, later disk). Calculating with 1.300MB/s „Client“ Bandwidth that means: 2 (size) * 1300 MB/s / 6 (SSD) = 433MB/s each SSD 2 (size) * 1300 MB/s / 30 (HDD) = 87MB/s each HDD greetings Johannes This footnote confirms that this email message has been scanned by PineApp Mail-SeCure for the presence of malicious code, vandals computer viruses. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Did maximum performance reached?
Hi, Karan! That's physical CentOS clients of CephFS mounted by kernel-module (kernel 4.1.3) Thanks Hi What type of clients do you have. - Are they Linux physical OR VM mounting Ceph RBD or CephFS ?? - Or they are simply openstack / cloud instances using Ceph as cinder volumes or something like that ?? - Karan - On 28 Jul 2015, at 11:53, Shneur Zalman Mattern shz...@eimsys.co.il wrote: We've built Ceph cluster: 3 mon nodes (one of them is combined with mds) 3 osd nodes (each one have 10 osd + 2 ssd for journaling) switch 24 ports x 10G 10 gigabit - for public network 20 gigabit bonding - between osds Ubuntu 12.04.05 Ceph 0.87.2 - Clients has: 10 gigabit for ceph-connection CentOS 6.6 with kernel 4.1.3 equipped by cephfs-kmodule This footnote confirms that this email message has been scanned by PineApp Mail-SeCure for the presence of malicious code, vandals computer viruses. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Did maximum performance reached?
Oh, now I've to cry :-) not because it's not SSDs... it's SAS2 HDDs Because, I need to build something for 140 clients... 4200 OSDs :-( Looks like, I can pickup my performance by SSDs, but I need a huge capacity ~ 2PB Perhaps, tiering cache pool can save my money, but I've read here - that it's slower than all people think... :-( Why Lustre is more performable? There're same HDDs? This footnote confirms that this email message has been scanned by PineApp Mail-SeCure for the presence of malicious code, vandals computer viruses. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Did maximum performance reached?
Hi, Johannes (that's my grandpa's name) The size is 2, do you really think that number of replicas can increase performance? on the http://ceph.com/docs/master/architecture/ written Note: Striping is independent of object replicas. Since CRUSH replicates objects across OSDs, stripes get replicated automatically. OK, I'll check it, Regards, Shneur From: Johannes Formann mlm...@formann.de Sent: Tuesday, July 28, 2015 12:09 PM To: Shneur Zalman Mattern Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Did maximum performance reached? Hello, what is the „size“ parameter of your pool? Some math do show the impact: size=3 means each write is written 6 times (3 copies, first journal, later disk). Calculating with 1.300MB/s „Client“ Bandwidth that means: 3 (size) * 1300 MB/s / 6 (SSD) = 650MB/s per SSD 3 (size) * 1300 MB/s / 30 (HDD) = 130MB/s per HDD If you use size=3, the results are as good as one can expect. (Even with size=2 the results won’t be bad) greetings Johannes Am 28.07.2015 um 10:53 schrieb Shneur Zalman Mattern shz...@eimsys.co.il: We've built Ceph cluster: 3 mon nodes (one of them is combined with mds) 3 osd nodes (each one have 10 osd + 2 ssd for journaling) switch 24 ports x 10G 10 gigabit - for public network 20 gigabit bonding - between osds Ubuntu 12.04.05 Ceph 0.87.2 - Clients has: 10 gigabit for ceph-connection CentOS 6.6 with kernel 3.19.8 equipped by cephfs-kmodule == fio-2.0.13 seqwrite, bs=1M, filesize=10G, parallel-jobs=16 === Single client: Starting 16 processes .below is just 1 job info trivial-readwrite-grid01: (groupid=0, jobs=1): err= 0: pid=10484: Tue Jul 28 13:26:24 2015 write: io=10240MB, bw=78656KB/s, iops=76 , runt=133312msec slat (msec): min=1 , max=117 , avg=13.01, stdev=12.57 clat (usec): min=1 , max=68 , avg= 3.61, stdev= 1.99 lat (msec): min=1 , max=117 , avg=13.01, stdev=12.57 clat percentiles (usec): | 1.00th=[1], 5.00th=[2], 10.00th=[2], 20.00th=[2], | 30.00th=[3], 40.00th=[3], 50.00th=[3], 60.00th=[4], | 70.00th=[4], 80.00th=[5], 90.00th=[5], 95.00th=[6], | 99.00th=[9], 99.50th=[ 10], 99.90th=[ 23], 99.95th=[ 28], | 99.99th=[ 62] bw (KB/s) : min=35790, max=318215, per=6.31%, avg=78816.91, stdev=26397.76 lat (usec) : 2=1.33%, 4=54.43%, 10=43.54%, 20=0.56%, 50=0.11% lat (usec) : 100=0.03% cpu : usr=0.89%, sys=12.85%, ctx=58248, majf=0, minf=9 IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, =64=0.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% issued: total=r=0/w=10240/d=0, short=r=0/w=0/d=0 ...what's above repeats 16 times... Run status group 0 (all jobs): WRITE: io=163840MB, aggrb=1219.8MB/s, minb=78060KB/s, maxb=78655KB/s, mint=133312msec, maxt=134329msec + Two clients: + below is just 1 job info trivial-readwrite-gridsrv: (groupid=0, jobs=1): err= 0: pid=10605: Tue Jul 28 14:05:59 2015 write: io=10240MB, bw=43154KB/s, iops=42 , runt=242984msec slat (usec): min=991 , max=285653 , avg=23716.12, stdev=23960.60 clat (usec): min=1 , max=65 , avg= 3.67, stdev= 2.02 lat (usec): min=994 , max=285664 , avg=23723.39, stdev=23962.22 clat percentiles (usec): | 1.00th=[2], 5.00th=[2], 10.00th=[2], 20.00th=[2], | 30.00th=[3], 40.00th=[3], 50.00th=[3], 60.00th=[4], | 70.00th=[4], 80.00th=[5], 90.00th=[5], 95.00th=[6], | 99.00th=[8], 99.50th=[ 10], 99.90th=[ 28], 99.95th=[ 37], | 99.99th=[ 56] bw (KB/s) : min=20630, max=276480, per=6.30%, avg=43328.34, stdev=21905.92 lat (usec) : 2=0.84%, 4=49.45%, 10=49.13%, 20=0.37%, 50=0.18% lat (usec) : 100=0.03% cpu : usr=0.49%, sys=5.68%, ctx=31428, majf=0, minf=9 IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, =64=0.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% issued: total=r=0/w=10240/d=0, short=r=0/w=0/d=0 ...what's above repeats 16 times... Run status group 0 (all jobs): WRITE: io=163840MB, aggrb=687960KB/s, minb=42997KB/s, maxb=43270KB/s, mint=242331msec, maxt=243869msec - And almost the same(?!) aggregated result from the second client: - Run status group 0 (all jobs): WRITE: io=163840MB, aggrb=679401KB/s, minb=42462KB/s, maxb=42852KB/s, mint=244697msec, maxt=246941msec - If I'll
Re: [ceph-users] Did maximum performance reached?
As I'm understanding now that's in this case (30 disks) 10Gbit Network is not a bottleneck! With other HW config ( + 5 OSD nodes = + 50 disks ) I'd get 3400 MB/s, and 3 clients can work on full bandwidth, yes? OK, let's try ! ! ! ! ! ! ! Perhaps, somebody has more suggestions for increasing performance: 1. NVMe journals, 2. btrfs over osd 3. ssd-based osds, 4. 15K hdds 5. RAID 10 on each OSD node . everybody - brainstorm!!! John: Your expected bandwidth (with size=2 replicas) will be (900MB/s * 3)/2 = 1300MB/s -- so I think you're actually doing pretty well with your 1367MB/s number. This footnote confirms that this email message has been scanned by PineApp Mail-SeCure for the presence of malicious code, vandals computer viruses. This footnote confirms that this email message has been scanned by PineApp Mail-SeCure for the presence of malicious code, vandals computer viruses. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Clients' connection for concurrent access to ceph
Workaround... We're building now a huge computing cluster 140 computing DISKLESS nodes and they are pulling to storage a lot of computing data concurrently User that put job for the cluster - need also access to the same storage place (seeking progress results) We've built Ceph cluster: 3 mon nodes (one of them is combined with mds) 3 osd nodes (each one have 10 osd + ssd for journaling) switch 24 ports x 10G 10 gigabit - for public network 20 gigabit bonding - between osds Ubuntu 12.04.05 Ceph 0.87.2 - giant - Clients has: 10 gigabit for ceph-connection CentOS 6.6 with upgraded kernel 3.19.8 (already running computing cluster) Surely all nodes, switches and clients were configured to jumbo-frames of network = First test: I thought to make big rbd with shareing, but: - RBD supports multiple clients' mappingmounting but not parallel writes ... Second test: NFS over RBD - it's working pretty good, but: 1. NFS gateway - it's Single-Point-of-Failure 2. There's no performance scaling of scale-out storage e.g. bottleneck (limited with bandwidth of NFS-gateway) Third test: We wanted to try CephFS, because our client is familiar with Lustre, that's very near to CephFS capabilities: 1. I've used my CEPH nodes in the client's role. I've mounted CephFS on one of nodes, and ran dd with bs=1M ... - I've got wonderful write performance ~ 1.1 GBytes/s (really near to 10Gbit network throughput) 2. I've connected CentOS client to 10gig public network, mounted CephFS, but ... - It was just ~ 250 MBytes/s 3. I've connected Ubuntu client (non-ceph member) to 10gig public network, mounted CephFS, and ... - It was also ~ 260 MBytes/s Now I have to know: perhaps ceph-members-nodes have privileged access ??? I'm sure you have more ceph deployment experience, have you seen this CephFS performance deviations? Thanks, Shneur This footnote confirms that this email message has been scanned by PineApp Mail-SeCure for the presence of malicious code, vandals computer viruses. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com