Re: [ceph-users] Did maximum performance reached?
Hi What type of clients do you have. - Are they Linux physical OR VM mounting Ceph RBD or CephFS ?? - Or they are simply openstack / cloud instances using Ceph as cinder volumes or something like that ?? - Karan - On 28 Jul 2015, at 11:53, Shneur Zalman Mattern shz...@eimsys.co.il wrote: We've built Ceph cluster: 3 mon nodes (one of them is combined with mds) 3 osd nodes (each one have 10 osd + 2 ssd for journaling) switch 24 ports x 10G 10 gigabit - for public network 20 gigabit bonding - between osds Ubuntu 12.04.05 Ceph 0.87.2 - Clients has: 10 gigabit for ceph-connection CentOS 6.6 with kernel 3.19.8 equipped by cephfs-kmodule == fio-2.0.13 seqwrite, bs=1M, filesize=10G, parallel-jobs=16 === Single client: Starting 16 processes .below is just 1 job info trivial-readwrite-grid01: (groupid=0, jobs=1): err= 0: pid=10484: Tue Jul 28 13:26:24 2015 write: io=10240MB, bw=78656KB/s, iops=76 , runt=133312msec slat (msec): min=1 , max=117 , avg=13.01, stdev=12.57 clat (usec): min=1 , max=68 , avg= 3.61, stdev= 1.99 lat (msec): min=1 , max=117 , avg=13.01, stdev=12.57 clat percentiles (usec): | 1.00th=[1], 5.00th=[2], 10.00th=[2], 20.00th=[2], | 30.00th=[3], 40.00th=[3], 50.00th=[3], 60.00th=[4], | 70.00th=[4], 80.00th=[5], 90.00th=[5], 95.00th=[6], | 99.00th=[9], 99.50th=[ 10], 99.90th=[ 23], 99.95th=[ 28], | 99.99th=[ 62] bw (KB/s) : min=35790, max=318215, per=6.31%, avg=78816.91, stdev=26397.76 lat (usec) : 2=1.33%, 4=54.43%, 10=43.54%, 20=0.56%, 50=0.11% lat (usec) : 100=0.03% cpu : usr=0.89%, sys=12.85%, ctx=58248, majf=0, minf=9 IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, =64=0.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% issued: total=r=0/w=10240/d=0, short=r=0/w=0/d=0 ...what's above repeats 16 times... Run status group 0 (all jobs): WRITE: io=163840MB, aggrb=1219.8MB/s, minb=78060KB/s, maxb=78655KB/s, mint=133312msec, maxt=134329msec + Two clients: + below is just 1 job info trivial-readwrite-gridsrv: (groupid=0, jobs=1): err= 0: pid=10605: Tue Jul 28 14:05:59 2015 write: io=10240MB, bw=43154KB/s, iops=42 , runt=242984msec slat (usec): min=991 , max=285653 , avg=23716.12, stdev=23960.60 clat (usec): min=1 , max=65 , avg= 3.67, stdev= 2.02 lat (usec): min=994 , max=285664 , avg=23723.39, stdev=23962.22 clat percentiles (usec): | 1.00th=[2], 5.00th=[2], 10.00th=[2], 20.00th=[2], | 30.00th=[3], 40.00th=[3], 50.00th=[3], 60.00th=[4], | 70.00th=[4], 80.00th=[5], 90.00th=[5], 95.00th=[6], | 99.00th=[8], 99.50th=[ 10], 99.90th=[ 28], 99.95th=[ 37], | 99.99th=[ 56] bw (KB/s) : min=20630, max=276480, per=6.30%, avg=43328.34, stdev=21905.92 lat (usec) : 2=0.84%, 4=49.45%, 10=49.13%, 20=0.37%, 50=0.18% lat (usec) : 100=0.03% cpu : usr=0.49%, sys=5.68%, ctx=31428, majf=0, minf=9 IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, =64=0.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% issued: total=r=0/w=10240/d=0, short=r=0/w=0/d=0 ...what's above repeats 16 times... Run status group 0 (all jobs): WRITE: io=163840MB, aggrb=687960KB/s, minb=42997KB/s, maxb=43270KB/s, mint=242331msec, maxt=243869msec - And almost the same(?!) aggregated result from the second client: - Run status group 0 (all jobs): WRITE: io=163840MB, aggrb=679401KB/s, minb=42462KB/s, maxb=42852KB/s, mint=244697msec, maxt=246941msec - If I'll summarize: - aggrb1 + aggrb2 = 687960KB/s + 679401KB/s = 1367MB/s it looks like the same bandwidth from just one client aggrb=1219.8MB/s and it was divided? why? Question: If I'll connect 12 clients nodes - each one can write just on 100MB/s? Perhaps, I need to scale out our ceph up to 15(how many?) OSD nodes - and it'll serve 2 clients on the 1.3GB/s (bw of 10gig nic), or not? health HEALTH_OK monmap e1: 3 mons at {mon1=192.168.56.251:6789/0,mon2=192.168.56.252:6789/0,mon3=192.168.56.253:6789/0}, election epoch 140, quorum 0,1,2 mon1,mon2,mon3 mdsmap e12: 1/1/1 up {0=mon3=up:active} osdmap e832: 31 osds: 30 up, 30 in pgmap v106186: 6144 pgs, 3 pools, 2306 GB
[ceph-users] Did maximum performance reached?
We've built Ceph cluster: 3 mon nodes (one of them is combined with mds) 3 osd nodes (each one have 10 osd + 2 ssd for journaling) switch 24 ports x 10G 10 gigabit - for public network 20 gigabit bonding - between osds Ubuntu 12.04.05 Ceph 0.87.2 - Clients has: 10 gigabit for ceph-connection CentOS 6.6 with kernel 3.19.8 equipped by cephfs-kmodule == fio-2.0.13 seqwrite, bs=1M, filesize=10G, parallel-jobs=16 === Single client: Starting 16 processes .below is just 1 job info trivial-readwrite-grid01: (groupid=0, jobs=1): err= 0: pid=10484: Tue Jul 28 13:26:24 2015 write: io=10240MB, bw=78656KB/s, iops=76 , runt=133312msec slat (msec): min=1 , max=117 , avg=13.01, stdev=12.57 clat (usec): min=1 , max=68 , avg= 3.61, stdev= 1.99 lat (msec): min=1 , max=117 , avg=13.01, stdev=12.57 clat percentiles (usec): | 1.00th=[1], 5.00th=[2], 10.00th=[2], 20.00th=[2], | 30.00th=[3], 40.00th=[3], 50.00th=[3], 60.00th=[4], | 70.00th=[4], 80.00th=[5], 90.00th=[5], 95.00th=[6], | 99.00th=[9], 99.50th=[ 10], 99.90th=[ 23], 99.95th=[ 28], | 99.99th=[ 62] bw (KB/s) : min=35790, max=318215, per=6.31%, avg=78816.91, stdev=26397.76 lat (usec) : 2=1.33%, 4=54.43%, 10=43.54%, 20=0.56%, 50=0.11% lat (usec) : 100=0.03% cpu : usr=0.89%, sys=12.85%, ctx=58248, majf=0, minf=9 IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, =64=0.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% issued: total=r=0/w=10240/d=0, short=r=0/w=0/d=0 ...what's above repeats 16 times... Run status group 0 (all jobs): WRITE: io=163840MB, aggrb=1219.8MB/s, minb=78060KB/s, maxb=78655KB/s, mint=133312msec, maxt=134329msec + Two clients: + below is just 1 job info trivial-readwrite-gridsrv: (groupid=0, jobs=1): err= 0: pid=10605: Tue Jul 28 14:05:59 2015 write: io=10240MB, bw=43154KB/s, iops=42 , runt=242984msec slat (usec): min=991 , max=285653 , avg=23716.12, stdev=23960.60 clat (usec): min=1 , max=65 , avg= 3.67, stdev= 2.02 lat (usec): min=994 , max=285664 , avg=23723.39, stdev=23962.22 clat percentiles (usec): | 1.00th=[2], 5.00th=[2], 10.00th=[2], 20.00th=[2], | 30.00th=[3], 40.00th=[3], 50.00th=[3], 60.00th=[4], | 70.00th=[4], 80.00th=[5], 90.00th=[5], 95.00th=[6], | 99.00th=[8], 99.50th=[ 10], 99.90th=[ 28], 99.95th=[ 37], | 99.99th=[ 56] bw (KB/s) : min=20630, max=276480, per=6.30%, avg=43328.34, stdev=21905.92 lat (usec) : 2=0.84%, 4=49.45%, 10=49.13%, 20=0.37%, 50=0.18% lat (usec) : 100=0.03% cpu : usr=0.49%, sys=5.68%, ctx=31428, majf=0, minf=9 IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, =64=0.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% issued: total=r=0/w=10240/d=0, short=r=0/w=0/d=0 ...what's above repeats 16 times... Run status group 0 (all jobs): WRITE: io=163840MB, aggrb=687960KB/s, minb=42997KB/s, maxb=43270KB/s, mint=242331msec, maxt=243869msec - And almost the same(?!) aggregated result from the second client: - Run status group 0 (all jobs): WRITE: io=163840MB, aggrb=679401KB/s, minb=42462KB/s, maxb=42852KB/s, mint=244697msec, maxt=246941msec - If I'll summarize: - aggrb1 + aggrb2 = 687960KB/s + 679401KB/s = 1367MB/s it looks like the same bandwidth from just one client aggrb=1219.8MB/s and it was divided? why? Question: If I'll connect 12 clients nodes - each one can write just on 100MB/s? Perhaps, I need to scale out our ceph up to 15(how many?) OSD nodes - and it'll serve 2 clients on the 1.3GB/s (bw of 10gig nic), or not? health HEALTH_OK monmap e1: 3 mons at {mon1=192.168.56.251:6789/0,mon2=192.168.56.252:6789/0,mon3=192.168.56.253:6789/0}, election epoch 140, quorum 0,1,2 mon1,mon2,mon3 mdsmap e12: 1/1/1 up {0=mon3=up:active} osdmap e832: 31 osds: 30 up, 30 in pgmap v106186: 6144 pgs, 3 pools, 2306 GB data, 1379 kobjects 4624 GB used, 104 TB / 109 TB avail 6144 active+clean Perhaps, I don't understand something in Ceph architecture? I thought, that: Each spindel-disk can write ~ 100MB/s , and we have 10 SAS disks on each node = aggregated write speed is ~ 900MB/s (because of striping etc.) And we have 3 OSD nodes, and objects are striped also on 30 osds - I thought it's also
Re: [ceph-users] State of nfs-ganesha CEPH fsal
Hi, On 07/28/2015 11:08 AM, Haomai Wang wrote: On Tue, Jul 28, 2015 at 4:47 PM, Gregory Farnum g...@gregs42.com wrote: On Tue, Jul 28, 2015 at 8:01 AM, Burkhard Linke burkhard.li...@computational.bio.uni-giessen.de wrote: *snipsnap* Can you give some details on that issues? I'm currently looking for a way to provide NFS based access to CephFS to our desktop machines. Ummm...sadly I can't; we don't appear to have any tracker tickets and I'm not sure where the report went to. :( I think it was from Haomai... My fault, I should report this to ticket. I have forgotten the details about the problem, I submit the infos to IRC :-( It related to the ls output. It will print the wrong user/group owner as -1, maybe related to root squash? Are you sure this problem is related to the CephFS FSAL? I also had a hard time setting up ganesha correctly, especially with respect to user and group mappings, especially with a kerberized setup. I'm currently running a small test setup with one server and one client to single out the last kerberos related problems (nfs-ganesha 2.2.0 / Ceph Hammer 0.94.2 / Ubuntu 14.04). User/group listings have been OK so far. Do you remember whether the problem occurs every time or just arbitrarily? Best regards, Burkhard ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Did maximum performance reached?
Hi, But my question is why speed is divided between clients? And how much OSDnodes, OSDdaemos, PGs, I have to add/remove to ceph, that each cephfs-client could write with his max network speed (10Gbit/s ~ 1.2GB/s)??? From: Johannes Formann mlm...@formann.de Sent: Tuesday, July 28, 2015 12:46 PM To: Shneur Zalman Mattern Subject: Re: [ceph-users] Did maximum performance reached? Hi, size=3 would decrease your performance. But with size=2 your results are not bad too: Math: size=2 means each write is written 4 times (2 copies, first journal, later disk). Calculating with 1.300MB/s „Client“ Bandwidth that means: 2 (size) * 1300 MB/s / 6 (SSD) = 433MB/s each SSD 2 (size) * 1300 MB/s / 30 (HDD) = 87MB/s each HDD greetings Johannes Am 28.07.2015 um 11:41 schrieb Shneur Zalman Mattern shz...@eimsys.co.il: Hi, Johannes (that's my grandpa's name) The size is 2, do you really think that number of replicas can increase performance? on the http://ceph.com/docs/master/architecture/ written Note: Striping is independent of object replicas. Since CRUSH replicates objects across OSDs, stripes get replicated automatically. OK, I'll check it, Regards, Shneur From: Johannes Formann mlm...@formann.de Sent: Tuesday, July 28, 2015 12:09 PM To: Shneur Zalman Mattern Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Did maximum performance reached? Hello, what is the „size“ parameter of your pool? Some math do show the impact: size=3 means each write is written 6 times (3 copies, first journal, later disk). Calculating with 1.300MB/s „Client“ Bandwidth that means: 3 (size) * 1300 MB/s / 6 (SSD) = 650MB/s per SSD 3 (size) * 1300 MB/s / 30 (HDD) = 130MB/s per HDD If you use size=3, the results are as good as one can expect. (Even with size=2 the results won’t be bad) greetings Johannes Am 28.07.2015 um 10:53 schrieb Shneur Zalman Mattern shz...@eimsys.co.il: We've built Ceph cluster: 3 mon nodes (one of them is combined with mds) 3 osd nodes (each one have 10 osd + 2 ssd for journaling) switch 24 ports x 10G 10 gigabit - for public network 20 gigabit bonding - between osds Ubuntu 12.04.05 Ceph 0.87.2 - Clients has: 10 gigabit for ceph-connection CentOS 6.6 with kernel 3.19.8 equipped by cephfs-kmodule == fio-2.0.13 seqwrite, bs=1M, filesize=10G, parallel-jobs=16 === Single client: Starting 16 processes .below is just 1 job info trivial-readwrite-grid01: (groupid=0, jobs=1): err= 0: pid=10484: Tue Jul 28 13:26:24 2015 write: io=10240MB, bw=78656KB/s, iops=76 , runt=133312msec slat (msec): min=1 , max=117 , avg=13.01, stdev=12.57 clat (usec): min=1 , max=68 , avg= 3.61, stdev= 1.99 lat (msec): min=1 , max=117 , avg=13.01, stdev=12.57 clat percentiles (usec): | 1.00th=[1], 5.00th=[2], 10.00th=[2], 20.00th=[2], | 30.00th=[3], 40.00th=[3], 50.00th=[3], 60.00th=[4], | 70.00th=[4], 80.00th=[5], 90.00th=[5], 95.00th=[6], | 99.00th=[9], 99.50th=[ 10], 99.90th=[ 23], 99.95th=[ 28], | 99.99th=[ 62] bw (KB/s) : min=35790, max=318215, per=6.31%, avg=78816.91, stdev=26397.76 lat (usec) : 2=1.33%, 4=54.43%, 10=43.54%, 20=0.56%, 50=0.11% lat (usec) : 100=0.03% cpu : usr=0.89%, sys=12.85%, ctx=58248, majf=0, minf=9 IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, =64=0.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% issued: total=r=0/w=10240/d=0, short=r=0/w=0/d=0 ...what's above repeats 16 times... Run status group 0 (all jobs): WRITE: io=163840MB, aggrb=1219.8MB/s, minb=78060KB/s, maxb=78655KB/s, mint=133312msec, maxt=134329msec + Two clients: + below is just 1 job info trivial-readwrite-gridsrv: (groupid=0, jobs=1): err= 0: pid=10605: Tue Jul 28 14:05:59 2015 write: io=10240MB, bw=43154KB/s, iops=42 , runt=242984msec slat (usec): min=991 , max=285653 , avg=23716.12, stdev=23960.60 clat (usec): min=1 , max=65 , avg= 3.67, stdev= 2.02 lat (usec): min=994 , max=285664 , avg=23723.39, stdev=23962.22 clat percentiles (usec): | 1.00th=[2], 5.00th=[2], 10.00th=[2], 20.00th=[2], | 30.00th=[3], 40.00th=[3], 50.00th=[3], 60.00th=[4], | 70.00th=[4], 80.00th=[5], 90.00th=[5], 95.00th=[6], | 99.00th=[8], 99.50th=[ 10], 99.90th=[ 28], 99.95th=[ 37], | 99.99th=[ 56] bw (KB/s) : min=20630, max=276480, per=6.30%, avg=43328.34, stdev=21905.92 lat (usec) : 2=0.84%, 4=49.45%, 10=49.13%, 20=0.37%,
[ceph-users] Did maximum performance reached?
Hi! And so, in your math I need to build size = osd, 30 replicas for my cluster of 120TB - to get my demans And 4TB real storage capacity in price 3000$ per 1TB? Joke? All the best, Shneur From: Johannes Formann mlm...@formann.de Sent: Tuesday, July 28, 2015 12:46 PM To: Shneur Zalman Mattern Subject: Re: [ceph-users] Did maximum performance reached? Hi, size=3 would decrease your performance. But with size=2 your results are not bad too: Math: size=2 means each write is written 4 times (2 copies, first journal, later disk). Calculating with 1.300MB/s „Client“ Bandwidth that means: 2 (size) * 1300 MB/s / 6 (SSD) = 433MB/s each SSD 2 (size) * 1300 MB/s / 30 (HDD) = 87MB/s each HDD greetings Johannes This footnote confirms that this email message has been scanned by PineApp Mail-SeCure for the presence of malicious code, vandals computer viruses. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] State of nfs-ganesha CEPH fsal
Hi, On 07/27/2015 05:42 PM, Gregory Farnum wrote: On Mon, Jul 27, 2015 at 4:33 PM, Burkhard Linke burkhard.li...@computational.bio.uni-giessen.de wrote: Hi, the nfs-ganesha documentation states: ... This FSAL links to a modified version of the CEPH library that has been extended to expose its distributed cluster and replication facilities to the pNFS operations in the FSAL. ... The CEPH library modifications have not been merged into the upstream yet. (https://github.com/nfs-ganesha/nfs-ganesha/wiki/Fsalsupport#ceph) Is this still the case with the hammer release? The FSAL has been upstream for quite a while, but it's not part of our regular testing yet and I'm not sure what it gets from the Ganesha side. I'd encourage you to test it, but be wary — we had a recent report of some issues we haven't been able to set up to reproduce yet. Can you give some details on that issues? I'm currently looking for a way to provide NFS based access to CephFS to our desktop machines. The kernel NFS implementation in Ubuntu had some problems with CephFS in our setup, which I was not able to resolve yet. Ganesha seems to be more promising, since it uses libcephfs directly and does not need a mountpoint of its own. Best regards, Burkhard ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] State of nfs-ganesha CEPH fsal
On Tue, Jul 28, 2015 at 5:28 PM, Burkhard Linke burkhard.li...@computational.bio.uni-giessen.de wrote: Hi, On 07/28/2015 11:08 AM, Haomai Wang wrote: On Tue, Jul 28, 2015 at 4:47 PM, Gregory Farnum g...@gregs42.com wrote: On Tue, Jul 28, 2015 at 8:01 AM, Burkhard Linke burkhard.li...@computational.bio.uni-giessen.de wrote: *snipsnap* Can you give some details on that issues? I'm currently looking for a way to provide NFS based access to CephFS to our desktop machines. Ummm...sadly I can't; we don't appear to have any tracker tickets and I'm not sure where the report went to. :( I think it was from Haomai... My fault, I should report this to ticket. I have forgotten the details about the problem, I submit the infos to IRC :-( It related to the ls output. It will print the wrong user/group owner as -1, maybe related to root squash? Are you sure this problem is related to the CephFS FSAL? I also had a hard time setting up ganesha correctly, especially with respect to user and group mappings, especially with a kerberized setup. I'm currently running a small test setup with one server and one client to single out the last kerberos related problems (nfs-ganesha 2.2.0 / Ceph Hammer 0.94.2 / Ubuntu 14.04). User/group listings have been OK so far. Do you remember whether the problem occurs every time or just arbitrarily? Great! I'm not sure the reason. I guess it may related to nfs-ganesha version or client distro version. Best regards, Burkhard ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Best Regards, Wheat ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Did maximum performance reached?
Hi, Karan! That's physical CentOS clients of CephFS mounted by kernel-module (kernel 4.1.3) Thanks Hi What type of clients do you have. - Are they Linux physical OR VM mounting Ceph RBD or CephFS ?? - Or they are simply openstack / cloud instances using Ceph as cinder volumes or something like that ?? - Karan - On 28 Jul 2015, at 11:53, Shneur Zalman Mattern shz...@eimsys.co.il wrote: We've built Ceph cluster: 3 mon nodes (one of them is combined with mds) 3 osd nodes (each one have 10 osd + 2 ssd for journaling) switch 24 ports x 10G 10 gigabit - for public network 20 gigabit bonding - between osds Ubuntu 12.04.05 Ceph 0.87.2 - Clients has: 10 gigabit for ceph-connection CentOS 6.6 with kernel 4.1.3 equipped by cephfs-kmodule This footnote confirms that this email message has been scanned by PineApp Mail-SeCure for the presence of malicious code, vandals computer viruses. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Did maximum performance reached?
Oh, now I've to cry :-) not because it's not SSDs... it's SAS2 HDDs Because, I need to build something for 140 clients... 4200 OSDs :-( Looks like, I can pickup my performance by SSDs, but I need a huge capacity ~ 2PB Perhaps, tiering cache pool can save my money, but I've read here - that it's slower than all people think... :-( Why Lustre is more performable? There're same HDDs? This footnote confirms that this email message has been scanned by PineApp Mail-SeCure for the presence of malicious code, vandals computer viruses. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] which kernel version can help avoid kernel client deadlock
Hi, Ilya, Thanks for your quick reply. Here is the link http://ceph.com/docs/cuttlefish/faq/ http://ceph.com/docs/cuttlefish/faq/ , under the HOW CAN I GIVE CEPH A TRY?” section which talk about the old kernel stuff. By the way, what’s the main reason of using kernel 4.1, is there a lot of critical bugs fixed in that version despite perf improvements? I am worrying kernel 4.1 is too new that may introduce other problems. And if I’m using the librdb API, is the kernel version matters? In my tests, I built a 2-nodes cluster, each with only one OSD with os centos 7.1, kernel version 3.10.0.229 and ceph v0.94.2. I created several rbds and mkfs.xfs on those rbds to create filesystems. (kernel client were running on the ceph cluster) I performed heavy IO tests on those filesystems and found some fio got hung and turned into D state forever (uninterruptible sleep). I suspect it’s the deadlock that make the fio process hung. However the ceph-osd are stil responsive, and I can operate rbd via librbd API. Does this mean it’s not the loopback mount deadlock that cause the fio process hung? Or it is also a deadlock phnonmenon, only one thread is blocked in memory allocation and other threads are still possible to receive API requests, so the ceph-osd are still responsive? What worth mentioning is that after I restart the ceph-osd daemon, all processes in D state come back into normal state. Below is related log in kernel: Jul 7 02:25:39 node0 kernel: INFO: task xfsaild/rbd1:24795 blocked for more than 120 seconds. Jul 7 02:25:39 node0 kernel: echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. Jul 7 02:25:39 node0 kernel: xfsaild/rbd1D 880c2fc13680 0 24795 2 0x0080 Jul 7 02:25:39 node0 kernel: 8801d6343d40 0046 8801d6343fd8 00013680 Jul 7 02:25:39 node0 kernel: 8801d6343fd8 00013680 880c0c0b 880c0c0b Jul 7 02:25:39 node0 kernel: 880c2fc14340 0001 8805bace2528 Jul 7 02:25:39 node0 kernel: Call Trace: Jul 7 02:25:39 node0 kernel: [81609e39] schedule+0x29/0x70 Jul 7 02:25:39 node0 kernel: [a03a1890] _xfs_log_force+0x230/0x290 [xfs] Jul 7 02:25:39 node0 kernel: [810a9620] ? wake_up_state+0x20/0x20 Jul 7 02:25:39 node0 kernel: [a03a1916] xfs_log_force+0x26/0x80 [xfs] Jul 7 02:25:39 node0 kernel: [a03a6390] ? xfs_trans_ail_cursor_first+0x90/0x90 [xfs] Jul 7 02:25:39 node0 kernel: [a03a64e1] xfsaild+0x151/0x5e0 [xfs] Jul 7 02:25:39 node0 kernel: [a03a6390] ? xfs_trans_ail_cursor_first+0x90/0x90 [xfs] Jul 7 02:25:39 node0 kernel: [8109739f] kthread+0xcf/0xe0 Jul 7 02:25:39 node0 kernel: [810972d0] ? kthread_create_on_node+0x140/0x140 Jul 7 02:25:39 node0 kernel: [8161497c] ret_from_fork+0x7c/0xb0 Jul 7 02:25:39 node0 kernel: [810972d0] ? kthread_create_on_node+0x140/0x140 Jul 7 02:25:39 node0 kernel: INFO: task xfsaild/rbd5:2914 blocked for more than 120 seconds. Does anyone encounter the same problem or could help with this? Thanks. On Jul 28, 2015, at 3:01 PM, Ilya Dryomov idryo...@gmail.com wrote: On Tue, Jul 28, 2015 at 9:17 AM, van chaofa...@owtware.com wrote: Hi, list, I found on the ceph FAQ that, ceph kernel client should not run on machines belong to ceph cluster. As ceph FAQ metioned, “In older kernels, Ceph can deadlock if you try to mount CephFS or RBD client services on the same host that runs your test Ceph cluster. This is not a Ceph-related issue.” Here it says that there will be deadlock if using old kernel version. I wonder if anyone knows which new kernel version solve this loopback mount deadlock. It will be a great help since I do need to use rbd kernel client on the ceph cluster. Note that doing this is *not* recommended. That said, if you don't push your system to its knees too hard, it should work. I'm not sure what exactly constitutes and older kernel as per that FAQ (as you haven't even linked it), but even if I knew, I'd still suggest 4.1. As I search more informations, I found two articals https://lwn.net/Articles/595652/ and https://lwn.net/Articles/596618/ talk about supporting nfs loopback mount,it seems they do effort not on memory management only, but also on nfs related codes, I wonder if ceph has also so some effort on kernel client to solve this problem. If ceph did, could anyone help provide the kernel version with the patch? There wasn't any specific effort on the ceph side, but we do try not to break it: sometime around 3.18 a ceph patch was merged that made it impossible to do co-locate kernel client with OSDs; once we realized that, the culprit patch was reverted and the revert was backported. So the bottom line is we don't recommend it, but we try not to break your ability to do it ;) Thanks, Ilya
[ceph-users] wrong documentation in add or rm mons
i followed the following documentation to add monitors to my already existing cluster with 1 mon http://ceph.com/docs/master/rados/operations/add-or-rm-mons/ when i follow this documentation. the monitor assimilates the old monitor so my monitor status is gone. but when i skip the ceph mon add mon-id ip[:port] part it adds the monitor and all works well. this issue also happens with ceph-deploy mon add so i think the documentation is not correct can someone confirm this? greetz Ramonskie For information, services and offers, please visit our web site: http://www.klm.com. This e-mail and any attachment may contain confidential and privileged material intended for the addressee only. If you are not the addressee, you are notified that no part of the e-mail or any attachment may be disclosed, copied or distributed, and that any other action related to this e-mail or attachment is strictly prohibited, and may be unlawful. If you have received this e-mail by error, please notify the sender immediately by return e-mail, and delete this message. Koninklijke Luchtvaart Maatschappij NV (KLM), its subsidiaries and/or its employees shall not be liable for the incorrect or incomplete transmission of this e-mail or any attachments, nor responsible for any delay in receipt. Koninklijke Luchtvaart Maatschappij N.V. (also known as KLM Royal Dutch Airlines) is registered in Amstelveen, The Netherlands, with registered number 33014286 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Weird behaviour of cephfs with samba
On Mon, Jul 27, 2015 at 6:25 PM, Jörg Henne henn...@gmail.com wrote: Gregory Farnum greg@... writes: Yeah, I think there were some directory listing bugs in that version that Samba is probably running into. They're fixed in a newer kernel release (I'm not sure which one exactly, sorry). Ok, thanks, good to know! and then detaches itself but the mountpoint stays empty no matter what. /var/log/ceph/ceph-client.admin.log isn't enlighting as well. I've never used a FUSE before, though, so I might be overlooking something. Uh, that's odd. What do you mean it's empty no matter what? Is the ceph-fuse process actually still running? Yes, e.g. 8525 pts/0Sl 0:00 ceph-fuse -m 10.208.66.1:6789 /mnt/regtest2 But root@gru:/mnt# ls /mnt/regtest2 | wc -l 0 With the kernel module I mount just a subpath of the cephfs space like in /etc/fstab: my_monhost:/regression-test /mnt/regtest ... which ceph-fuse doesn't seem to support, but then I would expect regression-test to simply be a sub-directory of /mnt/regtest2. You can mount subtrees with the -r option to ceph-fuse. Once you've started it up you should find a file like client.admin.[0-9]*.asok in (I think?) /var/run/ceph. You can run ceph --admin-daemon /var/run/ceph/{client_asok} status and provide the output to see if it's doing anything useful. Or set debug client = 20 in the config and then upload the client log file either publicly or with ceph-post-file and I'll take a quick look to see what's going on. -Greg (You should also be able to talk to Ceph directly via the Samba daemon; the bindings are in upstream Samba although you probably need to install one of the Ceph packages to make it work. That's the way we test in our nightlies.) Indeed, it seems like something is missing: [2015/07/27 19:21:40.080572, 0] ../lib/util/modules.c:48(load_module) Error loading module '/usr/lib/x86_64-linux-gnu/samba/vfs/ceph.so': /usr/lib/x86_64-linux-gnu/samba/vfs/ceph.so: cannot open shared object file: No such file or directory Mmm, that looks like a Samba config issue which unfortunately I don't know much about. Perhaps you need to install these modules individually? It looks like our nightly tests are just getting the Ceph VFS installed by default. :/ -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] hadoop on ceph
On Mon, Jul 27, 2015 at 6:34 PM, Patrick McGarry pmcga...@redhat.com wrote: Moving this to the ceph-user list where it has a better chance of being answered. On Mon, Jul 27, 2015 at 5:35 AM, jingxia@baifendian.com jingxia@baifendian.com wrote: Dear , I have questions to ask. The doc says hadoop on ceph but requires Hadoop 1.1.X stable series I want to know if CephFS Hadoop plugin can be used by Hadoop 2.6.0 now or it is not support Hadoop2.6.0 and still being developed? If Ceph can not be used by Hadoop2.6.0,then i want to know when it will can be used and is there a team to developing it? I use Hadoop 1.1.2 on ceph is ok, but when hadoop 2.6.0 use ceph,there is something wrong and hdfs is still on. The current Hadoop plugin we test with should run against Hadoop 2. There are a couple of different versions floating around so maybe you managed to grab the old one? But in any case the Ceph plugin has very little to do with whether HDFS gets started or not; that's all in your configuration steps and scripts. Development on the Hadoop integration is pretty sporadic but it runs in our nightlies so we notice if it breaks. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD RAM usage values
On 07/17/2015 02:50 PM, Gregory Farnum wrote: On Fri, Jul 17, 2015 at 1:13 PM, Kenneth Waegeman kenneth.waege...@ugent.be wrote: Hi all, I've read in the documentation that OSDs use around 512MB on a healthy cluster.(http://ceph.com/docs/master/start/hardware-recommendations/#ram) Now, our OSD's are all using around 2GB of RAM memory while the cluster is healthy. PID USER PR NIVIRTRESSHR S %CPU %MEM TIME+ COMMAND 29784 root 20 0 6081276 2.535g 4740 S 0.7 8.1 1346:55 ceph-osd 32818 root 20 0 5417212 2.164g 24780 S 16.2 6.9 1238:55 ceph-osd 25053 root 20 0 5386604 2.159g 27864 S 0.7 6.9 1192:08 ceph-osd 33875 root 20 0 5345288 2.092g 3544 S 0.7 6.7 1188:53 ceph-osd 30779 root 20 0 5474832 2.090g 28892 S 1.0 6.7 1142:29 ceph-osd 22068 root 20 0 5191516 2.000g 28664 S 0.7 6.4 31:56.72 ceph-osd 34932 root 20 0 5242656 1.994g 4536 S 0.3 6.4 1144:48 ceph-osd 26883 root 20 0 5178164 1.938g 6164 S 0.3 6.2 1173:01 ceph-osd 31796 root 20 0 5193308 1.916g 27000 S 16.2 6.1 923:14.87 ceph-osd 25958 root 20 0 5193436 1.901g 2900 S 0.7 6.1 1039:53 ceph-osd 27826 root 20 0 5225764 1.845g 5576 S 1.0 5.9 1031:15 ceph-osd 36011 root 20 0 5111660 1.823g 20512 S 15.9 5.8 1093:01 ceph-osd 19736 root 20 0 2134680 0.994g 0 S 0.3 3.2 46:13.47 ceph-osd [root@osd003 ~]# ceph status 2015-07-17 14:03:13.865063 7f1fde5f0700 -1 WARNING: the following dangerous and experimental features are enabled: keyvaluestore 2015-07-17 14:03:13.887087 7f1fde5f0700 -1 WARNING: the following dangerous and experimental features are enabled: keyvaluestore cluster 92bfcf0a-1d39-43b3-b60f-44f01b630e47 health HEALTH_OK monmap e1: 3 mons at {mds01=10.141.16.1:6789/0,mds02=10.141.16.2:6789/0,mds03=10.141.16.3:6789/0} election epoch 58, quorum 0,1,2 mds01,mds02,mds03 mdsmap e17218: 1/1/1 up {0=mds03=up:active}, 1 up:standby osdmap e25542: 258 osds: 258 up, 258 in pgmap v2460163: 4160 pgs, 4 pools, 228 TB data, 154 Mobjects 270 TB used, 549 TB / 819 TB avail 4152 active+clean 8 active+clean+scrubbing+deep We are using erasure code on most of our OSDs, so maybe that is a reason. But also the cache-pool filestore OSDS on 200GB SSDs are using 2GB of RAM. Our erasure code pool (16*14 osds) have a pg_num of 2048; our cache pool (2*14 OSDS) has a pg_num of 1024. Are these normal values for this configuration, and is the documentation a bit outdated, or should we look into something else? 2GB of RSS is larger than I would have expected, but not unreasonable. In particular I don't think we've gathered numbers on either EC pools or on the effects of the caching processes. Which data is actually in memory of the OSDS? Is this mostly cached data? We are short on memory on these servers, can we have influence on this? Thanks again! Kenneth -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Did maximum performance reached?
Hi, Johannes (that's my grandpa's name) The size is 2, do you really think that number of replicas can increase performance? on the http://ceph.com/docs/master/architecture/ written Note: Striping is independent of object replicas. Since CRUSH replicates objects across OSDs, stripes get replicated automatically. OK, I'll check it, Regards, Shneur From: Johannes Formann mlm...@formann.de Sent: Tuesday, July 28, 2015 12:09 PM To: Shneur Zalman Mattern Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Did maximum performance reached? Hello, what is the „size“ parameter of your pool? Some math do show the impact: size=3 means each write is written 6 times (3 copies, first journal, later disk). Calculating with 1.300MB/s „Client“ Bandwidth that means: 3 (size) * 1300 MB/s / 6 (SSD) = 650MB/s per SSD 3 (size) * 1300 MB/s / 30 (HDD) = 130MB/s per HDD If you use size=3, the results are as good as one can expect. (Even with size=2 the results won’t be bad) greetings Johannes Am 28.07.2015 um 10:53 schrieb Shneur Zalman Mattern shz...@eimsys.co.il: We've built Ceph cluster: 3 mon nodes (one of them is combined with mds) 3 osd nodes (each one have 10 osd + 2 ssd for journaling) switch 24 ports x 10G 10 gigabit - for public network 20 gigabit bonding - between osds Ubuntu 12.04.05 Ceph 0.87.2 - Clients has: 10 gigabit for ceph-connection CentOS 6.6 with kernel 3.19.8 equipped by cephfs-kmodule == fio-2.0.13 seqwrite, bs=1M, filesize=10G, parallel-jobs=16 === Single client: Starting 16 processes .below is just 1 job info trivial-readwrite-grid01: (groupid=0, jobs=1): err= 0: pid=10484: Tue Jul 28 13:26:24 2015 write: io=10240MB, bw=78656KB/s, iops=76 , runt=133312msec slat (msec): min=1 , max=117 , avg=13.01, stdev=12.57 clat (usec): min=1 , max=68 , avg= 3.61, stdev= 1.99 lat (msec): min=1 , max=117 , avg=13.01, stdev=12.57 clat percentiles (usec): | 1.00th=[1], 5.00th=[2], 10.00th=[2], 20.00th=[2], | 30.00th=[3], 40.00th=[3], 50.00th=[3], 60.00th=[4], | 70.00th=[4], 80.00th=[5], 90.00th=[5], 95.00th=[6], | 99.00th=[9], 99.50th=[ 10], 99.90th=[ 23], 99.95th=[ 28], | 99.99th=[ 62] bw (KB/s) : min=35790, max=318215, per=6.31%, avg=78816.91, stdev=26397.76 lat (usec) : 2=1.33%, 4=54.43%, 10=43.54%, 20=0.56%, 50=0.11% lat (usec) : 100=0.03% cpu : usr=0.89%, sys=12.85%, ctx=58248, majf=0, minf=9 IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, =64=0.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% issued: total=r=0/w=10240/d=0, short=r=0/w=0/d=0 ...what's above repeats 16 times... Run status group 0 (all jobs): WRITE: io=163840MB, aggrb=1219.8MB/s, minb=78060KB/s, maxb=78655KB/s, mint=133312msec, maxt=134329msec + Two clients: + below is just 1 job info trivial-readwrite-gridsrv: (groupid=0, jobs=1): err= 0: pid=10605: Tue Jul 28 14:05:59 2015 write: io=10240MB, bw=43154KB/s, iops=42 , runt=242984msec slat (usec): min=991 , max=285653 , avg=23716.12, stdev=23960.60 clat (usec): min=1 , max=65 , avg= 3.67, stdev= 2.02 lat (usec): min=994 , max=285664 , avg=23723.39, stdev=23962.22 clat percentiles (usec): | 1.00th=[2], 5.00th=[2], 10.00th=[2], 20.00th=[2], | 30.00th=[3], 40.00th=[3], 50.00th=[3], 60.00th=[4], | 70.00th=[4], 80.00th=[5], 90.00th=[5], 95.00th=[6], | 99.00th=[8], 99.50th=[ 10], 99.90th=[ 28], 99.95th=[ 37], | 99.99th=[ 56] bw (KB/s) : min=20630, max=276480, per=6.30%, avg=43328.34, stdev=21905.92 lat (usec) : 2=0.84%, 4=49.45%, 10=49.13%, 20=0.37%, 50=0.18% lat (usec) : 100=0.03% cpu : usr=0.49%, sys=5.68%, ctx=31428, majf=0, minf=9 IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, =64=0.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% issued: total=r=0/w=10240/d=0, short=r=0/w=0/d=0 ...what's above repeats 16 times... Run status group 0 (all jobs): WRITE: io=163840MB, aggrb=687960KB/s, minb=42997KB/s, maxb=43270KB/s, mint=242331msec, maxt=243869msec - And almost the same(?!) aggregated result from the second client: - Run status group 0 (all jobs): WRITE: io=163840MB, aggrb=679401KB/s, minb=42462KB/s, maxb=42852KB/s, mint=244697msec, maxt=246941msec - If I'll
Re: [ceph-users] Did maximum performance reached?
Hello, what is the „size“ parameter of your pool? Some math do show the impact: size=3 means each write is written 6 times (3 copies, first journal, later disk). Calculating with 1.300MB/s „Client“ Bandwidth that means: 3 (size) * 1300 MB/s / 6 (SSD) = 650MB/s per SSD 3 (size) * 1300 MB/s / 30 (HDD) = 130MB/s per HDD If you use size=3, the results are as good as one can expect. (Even with size=2 the results won’t be bad) greetings Johannes Am 28.07.2015 um 10:53 schrieb Shneur Zalman Mattern shz...@eimsys.co.il: We've built Ceph cluster: 3 mon nodes (one of them is combined with mds) 3 osd nodes (each one have 10 osd + 2 ssd for journaling) switch 24 ports x 10G 10 gigabit - for public network 20 gigabit bonding - between osds Ubuntu 12.04.05 Ceph 0.87.2 - Clients has: 10 gigabit for ceph-connection CentOS 6.6 with kernel 3.19.8 equipped by cephfs-kmodule == fio-2.0.13 seqwrite, bs=1M, filesize=10G, parallel-jobs=16 === Single client: Starting 16 processes .below is just 1 job info trivial-readwrite-grid01: (groupid=0, jobs=1): err= 0: pid=10484: Tue Jul 28 13:26:24 2015 write: io=10240MB, bw=78656KB/s, iops=76 , runt=133312msec slat (msec): min=1 , max=117 , avg=13.01, stdev=12.57 clat (usec): min=1 , max=68 , avg= 3.61, stdev= 1.99 lat (msec): min=1 , max=117 , avg=13.01, stdev=12.57 clat percentiles (usec): | 1.00th=[1], 5.00th=[2], 10.00th=[2], 20.00th=[2], | 30.00th=[3], 40.00th=[3], 50.00th=[3], 60.00th=[4], | 70.00th=[4], 80.00th=[5], 90.00th=[5], 95.00th=[6], | 99.00th=[9], 99.50th=[ 10], 99.90th=[ 23], 99.95th=[ 28], | 99.99th=[ 62] bw (KB/s) : min=35790, max=318215, per=6.31%, avg=78816.91, stdev=26397.76 lat (usec) : 2=1.33%, 4=54.43%, 10=43.54%, 20=0.56%, 50=0.11% lat (usec) : 100=0.03% cpu : usr=0.89%, sys=12.85%, ctx=58248, majf=0, minf=9 IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, =64=0.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% issued: total=r=0/w=10240/d=0, short=r=0/w=0/d=0 ...what's above repeats 16 times... Run status group 0 (all jobs): WRITE: io=163840MB, aggrb=1219.8MB/s, minb=78060KB/s, maxb=78655KB/s, mint=133312msec, maxt=134329msec + Two clients: + below is just 1 job info trivial-readwrite-gridsrv: (groupid=0, jobs=1): err= 0: pid=10605: Tue Jul 28 14:05:59 2015 write: io=10240MB, bw=43154KB/s, iops=42 , runt=242984msec slat (usec): min=991 , max=285653 , avg=23716.12, stdev=23960.60 clat (usec): min=1 , max=65 , avg= 3.67, stdev= 2.02 lat (usec): min=994 , max=285664 , avg=23723.39, stdev=23962.22 clat percentiles (usec): | 1.00th=[2], 5.00th=[2], 10.00th=[2], 20.00th=[2], | 30.00th=[3], 40.00th=[3], 50.00th=[3], 60.00th=[4], | 70.00th=[4], 80.00th=[5], 90.00th=[5], 95.00th=[6], | 99.00th=[8], 99.50th=[ 10], 99.90th=[ 28], 99.95th=[ 37], | 99.99th=[ 56] bw (KB/s) : min=20630, max=276480, per=6.30%, avg=43328.34, stdev=21905.92 lat (usec) : 2=0.84%, 4=49.45%, 10=49.13%, 20=0.37%, 50=0.18% lat (usec) : 100=0.03% cpu : usr=0.49%, sys=5.68%, ctx=31428, majf=0, minf=9 IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, =64=0.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% issued: total=r=0/w=10240/d=0, short=r=0/w=0/d=0 ...what's above repeats 16 times... Run status group 0 (all jobs): WRITE: io=163840MB, aggrb=687960KB/s, minb=42997KB/s, maxb=43270KB/s, mint=242331msec, maxt=243869msec - And almost the same(?!) aggregated result from the second client: - Run status group 0 (all jobs): WRITE: io=163840MB, aggrb=679401KB/s, minb=42462KB/s, maxb=42852KB/s, mint=244697msec, maxt=246941msec - If I'll summarize: - aggrb1 + aggrb2 = 687960KB/s + 679401KB/s = 1367MB/s it looks like the same bandwidth from just one client aggrb=1219.8MB/s and it was divided? why? Question: If I'll connect 12 clients nodes - each one can write just on 100MB/s? Perhaps, I need to scale out our ceph up to 15(how many?) OSD nodes - and it'll serve 2 clients on the 1.3GB/s (bw of 10gig nic), or not? health HEALTH_OK monmap e1: 3 mons at
Re: [ceph-users] OSD RAM usage values
On Tue, Jul 28, 2015 at 11:00 AM, Kenneth Waegeman kenneth.waege...@ugent.be wrote: On 07/17/2015 02:50 PM, Gregory Farnum wrote: On Fri, Jul 17, 2015 at 1:13 PM, Kenneth Waegeman kenneth.waege...@ugent.be wrote: Hi all, I've read in the documentation that OSDs use around 512MB on a healthy cluster.(http://ceph.com/docs/master/start/hardware-recommendations/#ram) Now, our OSD's are all using around 2GB of RAM memory while the cluster is healthy. PID USER PR NIVIRTRESSHR S %CPU %MEM TIME+ COMMAND 29784 root 20 0 6081276 2.535g 4740 S 0.7 8.1 1346:55 ceph-osd 32818 root 20 0 5417212 2.164g 24780 S 16.2 6.9 1238:55 ceph-osd 25053 root 20 0 5386604 2.159g 27864 S 0.7 6.9 1192:08 ceph-osd 33875 root 20 0 5345288 2.092g 3544 S 0.7 6.7 1188:53 ceph-osd 30779 root 20 0 5474832 2.090g 28892 S 1.0 6.7 1142:29 ceph-osd 22068 root 20 0 5191516 2.000g 28664 S 0.7 6.4 31:56.72 ceph-osd 34932 root 20 0 5242656 1.994g 4536 S 0.3 6.4 1144:48 ceph-osd 26883 root 20 0 5178164 1.938g 6164 S 0.3 6.2 1173:01 ceph-osd 31796 root 20 0 5193308 1.916g 27000 S 16.2 6.1 923:14.87 ceph-osd 25958 root 20 0 5193436 1.901g 2900 S 0.7 6.1 1039:53 ceph-osd 27826 root 20 0 5225764 1.845g 5576 S 1.0 5.9 1031:15 ceph-osd 36011 root 20 0 5111660 1.823g 20512 S 15.9 5.8 1093:01 ceph-osd 19736 root 20 0 2134680 0.994g 0 S 0.3 3.2 46:13.47 ceph-osd [root@osd003 ~]# ceph status 2015-07-17 14:03:13.865063 7f1fde5f0700 -1 WARNING: the following dangerous and experimental features are enabled: keyvaluestore 2015-07-17 14:03:13.887087 7f1fde5f0700 -1 WARNING: the following dangerous and experimental features are enabled: keyvaluestore cluster 92bfcf0a-1d39-43b3-b60f-44f01b630e47 health HEALTH_OK monmap e1: 3 mons at {mds01=10.141.16.1:6789/0,mds02=10.141.16.2:6789/0,mds03=10.141.16.3:6789/0} election epoch 58, quorum 0,1,2 mds01,mds02,mds03 mdsmap e17218: 1/1/1 up {0=mds03=up:active}, 1 up:standby osdmap e25542: 258 osds: 258 up, 258 in pgmap v2460163: 4160 pgs, 4 pools, 228 TB data, 154 Mobjects 270 TB used, 549 TB / 819 TB avail 4152 active+clean 8 active+clean+scrubbing+deep We are using erasure code on most of our OSDs, so maybe that is a reason. But also the cache-pool filestore OSDS on 200GB SSDs are using 2GB of RAM. Our erasure code pool (16*14 osds) have a pg_num of 2048; our cache pool (2*14 OSDS) has a pg_num of 1024. Are these normal values for this configuration, and is the documentation a bit outdated, or should we look into something else? 2GB of RSS is larger than I would have expected, but not unreasonable. In particular I don't think we've gathered numbers on either EC pools or on the effects of the caching processes. Which data is actually in memory of the OSDS? Is this mostly cached data? We are short on memory on these servers, can we have influence on this? Mmm, we've discussed this a few times on the mailing list. The CERN guys published a document on experimenting with a very large cluster and not enough RAM, but there's nothing I would really recommend changing for a production system, especially an EC one, if you aren't intimately familiar with what's going on. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Configuring MemStore in Ceph
On Wed, Jul 29, 2015 at 10:21 AM, Aakanksha Pudipeddi-SSI aakanksha...@ssi.samsung.com wrote: Hello Haomai, I am using v0.94.2. Thanks, Aakanksha -Original Message- From: Haomai Wang [mailto:haomaiw...@gmail.com] Sent: Tuesday, July 28, 2015 7:20 PM To: Aakanksha Pudipeddi-SSI Cc: ceph-us...@ceph.com Subject: Re: [ceph-users] Configuring MemStore in Ceph Which version do you use? https://github.com/ceph/ceph/commit/c60f88ba8a6624099f576eaa5f1225c2fcaab41a should fix your problem On Wed, Jul 29, 2015 at 5:44 AM, Aakanksha Pudipeddi-SSI aakanksha...@ssi.samsung.com wrote: Hello, I am trying to setup a ceph cluster with a memstore backend. The problem is, it is always created with a fixed size (1GB). I made changes to the ceph.conf file as follows: osd_objectstore = memstore memstore_device_bytes = 5*1024*1024*1024 The resultant cluster still has 1GB allocated to it. Could anybody point out what I am doing wrong here? What's the mean of The resultant cluster still has 1GB allocated to it? Is it mean that you can't write data more than 1GB? Thanks, Aakanksha ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Best Regards, Wheat -- Best Regards, Wheat ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Updating OSD Parameters
I believe you can use ceph tell to inject it in a running cluster. From your admin node you should be able to run Ceph tell osd.* injectargs --osd_recovery_max_active 1 --osd_max_backfills 1” Regards, Nikhil Mitra From: ceph-users ceph-users-boun...@lists.ceph.commailto:ceph-users-boun...@lists.ceph.com on behalf of Noah Mehl noahm...@combinedpublic.commailto:noahm...@combinedpublic.com Date: Tuesday, July 28, 2015 at 7:53 AM To: ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com Subject: [ceph-users] Updating OSD Parameters When we update the following in ceph.conf: [osd] osd_recovery_max_active = 1 osd_max_backfills = 1 How do we make sure it takes affect? Do we have to restart all of the ceph osd’s and mon’s? Thanks! ~Noah ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Configuring MemStore in Ceph
Which version do you use? https://github.com/ceph/ceph/commit/c60f88ba8a6624099f576eaa5f1225c2fcaab41a should fix your problem On Wed, Jul 29, 2015 at 5:44 AM, Aakanksha Pudipeddi-SSI aakanksha...@ssi.samsung.com wrote: Hello, I am trying to setup a ceph cluster with a memstore backend. The problem is, it is always created with a fixed size (1GB). I made changes to the ceph.conf file as follows: osd_objectstore = memstore memstore_device_bytes = 5*1024*1024*1024 The resultant cluster still has 1GB allocated to it. Could anybody point out what I am doing wrong here? Thanks, Aakanksha ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Best Regards, Wheat ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Configuring MemStore in Ceph
Hello Haomai, I am using v0.94.2. Thanks, Aakanksha -Original Message- From: Haomai Wang [mailto:haomaiw...@gmail.com] Sent: Tuesday, July 28, 2015 7:20 PM To: Aakanksha Pudipeddi-SSI Cc: ceph-us...@ceph.com Subject: Re: [ceph-users] Configuring MemStore in Ceph Which version do you use? https://github.com/ceph/ceph/commit/c60f88ba8a6624099f576eaa5f1225c2fcaab41a should fix your problem On Wed, Jul 29, 2015 at 5:44 AM, Aakanksha Pudipeddi-SSI aakanksha...@ssi.samsung.com wrote: Hello, I am trying to setup a ceph cluster with a memstore backend. The problem is, it is always created with a fixed size (1GB). I made changes to the ceph.conf file as follows: osd_objectstore = memstore memstore_device_bytes = 5*1024*1024*1024 The resultant cluster still has 1GB allocated to it. Could anybody point out what I am doing wrong here? Thanks, Aakanksha ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Best Regards, Wheat ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] which kernel version can help avoid kernel client deadlock
Hi, Ilya, In the dmesg, there is also a lot of libceph socket error, which I think may be caused by my stopping ceph service without unmap rbd. Here is a more than 1 lines log contains more info, http://jmp.sh/NcokrfT http://jmp.sh/NcokrfT Thanks for willing to help. van chaofa...@owtware.com On Jul 28, 2015, at 7:11 PM, Ilya Dryomov idryo...@gmail.com wrote: On Tue, Jul 28, 2015 at 11:19 AM, van chaofa...@owtware.com mailto:chaofa...@owtware.com wrote: Hi, Ilya, Thanks for your quick reply. Here is the link http://ceph.com/docs/cuttlefish/faq/ http://ceph.com/docs/cuttlefish/faq/ , under the HOW CAN I GIVE CEPH A TRY?” section which talk about the old kernel stuff. By the way, what’s the main reason of using kernel 4.1, is there a lot of critical bugs fixed in that version despite perf improvements? I am worrying kernel 4.1 is too new that may introduce other problems. Well, I'm not sure what exactly is in 3.10.0.229, so I can't tell you off hand. I can think of one important memory pressure related fix that's probably not in there. I'm suggesting the latest stable version of 4.1 (currently 4.1.3), because if you hit a deadlock (remember, this is a configuration that is neither recommended nor guaranteed to work), it'll be easier to debug and fix if the fix turns out to be worth it. If 4.1 is not acceptable for you, try the latest stable version of 3.18 (that is 3.18.19). It's an LTS kernel, so that should mitigate some of your concerns. And if I’m using the librdb API, is the kernel version matters? No, not so much. In my tests, I built a 2-nodes cluster, each with only one OSD with os centos 7.1, kernel version 3.10.0.229 and ceph v0.94.2. I created several rbds and mkfs.xfs on those rbds to create filesystems. (kernel client were running on the ceph cluster) I performed heavy IO tests on those filesystems and found some fio got hung and turned into D state forever (uninterruptible sleep). I suspect it’s the deadlock that make the fio process hung. However the ceph-osd are stil responsive, and I can operate rbd via librbd API. Does this mean it’s not the loopback mount deadlock that cause the fio process hung? Or it is also a deadlock phnonmenon, only one thread is blocked in memory allocation and other threads are still possible to receive API requests, so the ceph-osd are still responsive? What worth mentioning is that after I restart the ceph-osd daemon, all processes in D state come back into normal state. Below is related log in kernel: Jul 7 02:25:39 node0 kernel: INFO: task xfsaild/rbd1:24795 blocked for more than 120 seconds. Jul 7 02:25:39 node0 kernel: echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. Jul 7 02:25:39 node0 kernel: xfsaild/rbd1D 880c2fc13680 0 24795 2 0x0080 Jul 7 02:25:39 node0 kernel: 8801d6343d40 0046 8801d6343fd8 00013680 Jul 7 02:25:39 node0 kernel: 8801d6343fd8 00013680 880c0c0b 880c0c0b Jul 7 02:25:39 node0 kernel: 880c2fc14340 0001 8805bace2528 Jul 7 02:25:39 node0 kernel: Call Trace: Jul 7 02:25:39 node0 kernel: [81609e39] schedule+0x29/0x70 Jul 7 02:25:39 node0 kernel: [a03a1890] _xfs_log_force+0x230/0x290 [xfs] Jul 7 02:25:39 node0 kernel: [810a9620] ? wake_up_state+0x20/0x20 Jul 7 02:25:39 node0 kernel: [a03a1916] xfs_log_force+0x26/0x80 [xfs] Jul 7 02:25:39 node0 kernel: [a03a6390] ? xfs_trans_ail_cursor_first+0x90/0x90 [xfs] Jul 7 02:25:39 node0 kernel: [a03a64e1] xfsaild+0x151/0x5e0 [xfs] Jul 7 02:25:39 node0 kernel: [a03a6390] ? xfs_trans_ail_cursor_first+0x90/0x90 [xfs] Jul 7 02:25:39 node0 kernel: [8109739f] kthread+0xcf/0xe0 Jul 7 02:25:39 node0 kernel: [810972d0] ? kthread_create_on_node+0x140/0x140 Jul 7 02:25:39 node0 kernel: [8161497c] ret_from_fork+0x7c/0xb0 Jul 7 02:25:39 node0 kernel: [810972d0] ? kthread_create_on_node+0x140/0x140 Jul 7 02:25:39 node0 kernel: INFO: task xfsaild/rbd5:2914 blocked for more than 120 seconds. Is that all there is in dmesg? Can you paste the entire dmesg? Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] which kernel version can help avoid kernel client deadlock
On Tue, Jul 28, 2015 at 9:17 AM, van chaofa...@owtware.com wrote: Hi, list, I found on the ceph FAQ that, ceph kernel client should not run on machines belong to ceph cluster. As ceph FAQ metioned, “In older kernels, Ceph can deadlock if you try to mount CephFS or RBD client services on the same host that runs your test Ceph cluster. This is not a Ceph-related issue.” Here it says that there will be deadlock if using old kernel version. I wonder if anyone knows which new kernel version solve this loopback mount deadlock. It will be a great help since I do need to use rbd kernel client on the ceph cluster. Note that doing this is *not* recommended. That said, if you don't push your system to its knees too hard, it should work. I'm not sure what exactly constitutes and older kernel as per that FAQ (as you haven't even linked it), but even if I knew, I'd still suggest 4.1. As I search more informations, I found two articals https://lwn.net/Articles/595652/ and https://lwn.net/Articles/596618/ talk about supporting nfs loopback mount,it seems they do effort not on memory management only, but also on nfs related codes, I wonder if ceph has also so some effort on kernel client to solve this problem. If ceph did, could anyone help provide the kernel version with the patch? There wasn't any specific effort on the ceph side, but we do try not to break it: sometime around 3.18 a ceph patch was merged that made it impossible to do co-locate kernel client with OSDs; once we realized that, the culprit patch was reverted and the revert was backported. So the bottom line is we don't recommend it, but we try not to break your ability to do it ;) Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] State of nfs-ganesha CEPH fsal
On Tue, Jul 28, 2015 at 8:01 AM, Burkhard Linke burkhard.li...@computational.bio.uni-giessen.de wrote: Hi, On 07/27/2015 05:42 PM, Gregory Farnum wrote: On Mon, Jul 27, 2015 at 4:33 PM, Burkhard Linke burkhard.li...@computational.bio.uni-giessen.de wrote: Hi, the nfs-ganesha documentation states: ... This FSAL links to a modified version of the CEPH library that has been extended to expose its distributed cluster and replication facilities to the pNFS operations in the FSAL. ... The CEPH library modifications have not been merged into the upstream yet. (https://github.com/nfs-ganesha/nfs-ganesha/wiki/Fsalsupport#ceph) Is this still the case with the hammer release? The FSAL has been upstream for quite a while, but it's not part of our regular testing yet and I'm not sure what it gets from the Ganesha side. I'd encourage you to test it, but be wary — we had a recent report of some issues we haven't been able to set up to reproduce yet. Can you give some details on that issues? I'm currently looking for a way to provide NFS based access to CephFS to our desktop machines. Ummm...sadly I can't; we don't appear to have any tracker tickets and I'm not sure where the report went to. :( I think it was from Haomai... -Greg The kernel NFS implementation in Ubuntu had some problems with CephFS in our setup, which I was not able to resolve yet. Ganesha seems to be more promising, since it uses libcephfs directly and does not need a mountpoint of its own. Best regards, Burkhard ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] State of nfs-ganesha CEPH fsal
On Tue, Jul 28, 2015 at 4:47 PM, Gregory Farnum g...@gregs42.com wrote: On Tue, Jul 28, 2015 at 8:01 AM, Burkhard Linke burkhard.li...@computational.bio.uni-giessen.de wrote: Hi, On 07/27/2015 05:42 PM, Gregory Farnum wrote: On Mon, Jul 27, 2015 at 4:33 PM, Burkhard Linke burkhard.li...@computational.bio.uni-giessen.de wrote: Hi, the nfs-ganesha documentation states: ... This FSAL links to a modified version of the CEPH library that has been extended to expose its distributed cluster and replication facilities to the pNFS operations in the FSAL. ... The CEPH library modifications have not been merged into the upstream yet. (https://github.com/nfs-ganesha/nfs-ganesha/wiki/Fsalsupport#ceph) Is this still the case with the hammer release? The FSAL has been upstream for quite a while, but it's not part of our regular testing yet and I'm not sure what it gets from the Ganesha side. I'd encourage you to test it, but be wary — we had a recent report of some issues we haven't been able to set up to reproduce yet. Can you give some details on that issues? I'm currently looking for a way to provide NFS based access to CephFS to our desktop machines. Ummm...sadly I can't; we don't appear to have any tracker tickets and I'm not sure where the report went to. :( I think it was from Haomai... My fault, I should report this to ticket. I have forgotten the details about the problem, I submit the infos to IRC :-( It related to the ls output. It will print the wrong user/group owner as -1, maybe related to root squash? -Greg The kernel NFS implementation in Ubuntu had some problems with CephFS in our setup, which I was not able to resolve yet. Ganesha seems to be more promising, since it uses libcephfs directly and does not need a mountpoint of its own. Best regards, Burkhard ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Best Regards, Wheat ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Did maximum performance reached?
The speed is divided because ist fair :) You reach the limit your hardware (I guess the SSDs) can deliver. For 2 clients each doing 1200 MB/s you’ll have basically to double the amount of OSDs. greetings Johannes Am 28.07.2015 um 11:56 schrieb Shneur Zalman Mattern shz...@eimsys.co.il: Hi, But my question is why speed is divided between clients? And how much OSDnodes, OSDdaemos, PGs, I have to add/remove to ceph, that each cephfs-client could write with his max network speed (10Gbit/s ~ 1.2GB/s)??? From: Johannes Formann mlm...@formann.de Sent: Tuesday, July 28, 2015 12:46 PM To: Shneur Zalman Mattern Subject: Re: [ceph-users] Did maximum performance reached? Hi, size=3 would decrease your performance. But with size=2 your results are not bad too: Math: size=2 means each write is written 4 times (2 copies, first journal, later disk). Calculating with 1.300MB/s „Client“ Bandwidth that means: 2 (size) * 1300 MB/s / 6 (SSD) = 433MB/s each SSD 2 (size) * 1300 MB/s / 30 (HDD) = 87MB/s each HDD greetings Johannes Am 28.07.2015 um 11:41 schrieb Shneur Zalman Mattern shz...@eimsys.co.il: Hi, Johannes (that's my grandpa's name) The size is 2, do you really think that number of replicas can increase performance? on the http://ceph.com/docs/master/architecture/ written Note: Striping is independent of object replicas. Since CRUSH replicates objects across OSDs, stripes get replicated automatically. OK, I'll check it, Regards, Shneur From: Johannes Formann mlm...@formann.de Sent: Tuesday, July 28, 2015 12:09 PM To: Shneur Zalman Mattern Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Did maximum performance reached? Hello, what is the „size“ parameter of your pool? Some math do show the impact: size=3 means each write is written 6 times (3 copies, first journal, later disk). Calculating with 1.300MB/s „Client“ Bandwidth that means: 3 (size) * 1300 MB/s / 6 (SSD) = 650MB/s per SSD 3 (size) * 1300 MB/s / 30 (HDD) = 130MB/s per HDD If you use size=3, the results are as good as one can expect. (Even with size=2 the results won’t be bad) greetings Johannes Am 28.07.2015 um 10:53 schrieb Shneur Zalman Mattern shz...@eimsys.co.il: We've built Ceph cluster: 3 mon nodes (one of them is combined with mds) 3 osd nodes (each one have 10 osd + 2 ssd for journaling) switch 24 ports x 10G 10 gigabit - for public network 20 gigabit bonding - between osds Ubuntu 12.04.05 Ceph 0.87.2 - Clients has: 10 gigabit for ceph-connection CentOS 6.6 with kernel 3.19.8 equipped by cephfs-kmodule == fio-2.0.13 seqwrite, bs=1M, filesize=10G, parallel-jobs=16 === Single client: Starting 16 processes .below is just 1 job info trivial-readwrite-grid01: (groupid=0, jobs=1): err= 0: pid=10484: Tue Jul 28 13:26:24 2015 write: io=10240MB, bw=78656KB/s, iops=76 , runt=133312msec slat (msec): min=1 , max=117 , avg=13.01, stdev=12.57 clat (usec): min=1 , max=68 , avg= 3.61, stdev= 1.99 lat (msec): min=1 , max=117 , avg=13.01, stdev=12.57 clat percentiles (usec): | 1.00th=[1], 5.00th=[2], 10.00th=[2], 20.00th=[2], | 30.00th=[3], 40.00th=[3], 50.00th=[3], 60.00th=[4], | 70.00th=[4], 80.00th=[5], 90.00th=[5], 95.00th=[6], | 99.00th=[9], 99.50th=[ 10], 99.90th=[ 23], 99.95th=[ 28], | 99.99th=[ 62] bw (KB/s) : min=35790, max=318215, per=6.31%, avg=78816.91, stdev=26397.76 lat (usec) : 2=1.33%, 4=54.43%, 10=43.54%, 20=0.56%, 50=0.11% lat (usec) : 100=0.03% cpu : usr=0.89%, sys=12.85%, ctx=58248, majf=0, minf=9 IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, =64=0.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% issued: total=r=0/w=10240/d=0, short=r=0/w=0/d=0 ...what's above repeats 16 times... Run status group 0 (all jobs): WRITE: io=163840MB, aggrb=1219.8MB/s, minb=78060KB/s, maxb=78655KB/s, mint=133312msec, maxt=134329msec + Two clients: + below is just 1 job info trivial-readwrite-gridsrv: (groupid=0, jobs=1): err= 0: pid=10605: Tue Jul 28 14:05:59 2015 write: io=10240MB, bw=43154KB/s, iops=42 , runt=242984msec slat (usec): min=991 , max=285653 , avg=23716.12, stdev=23960.60 clat (usec): min=1 , max=65 , avg= 3.67, stdev= 2.02 lat (usec): min=994 , max=285664 , avg=23723.39, stdev=23962.22 clat percentiles (usec): | 1.00th=[2], 5.00th=[2], 10.00th=[2], 20.00th=[2], | 30.00th=[3], 40.00th=[3], 50.00th=[3], 60.00th=[4],
Re: [ceph-users] OSD RAM usage values
On 07/17/2015 07:50 AM, Gregory Farnum wrote: On Fri, Jul 17, 2015 at 1:13 PM, Kenneth Waegeman kenneth.waege...@ugent.be wrote: Hi all, I've read in the documentation that OSDs use around 512MB on a healthy cluster.(http://ceph.com/docs/master/start/hardware-recommendations/#ram) Now, our OSD's are all using around 2GB of RAM memory while the cluster is healthy. PID USER PR NIVIRTRESSHR S %CPU %MEM TIME+ COMMAND 29784 root 20 0 6081276 2.535g 4740 S 0.7 8.1 1346:55 ceph-osd 32818 root 20 0 5417212 2.164g 24780 S 16.2 6.9 1238:55 ceph-osd 25053 root 20 0 5386604 2.159g 27864 S 0.7 6.9 1192:08 ceph-osd 33875 root 20 0 5345288 2.092g 3544 S 0.7 6.7 1188:53 ceph-osd 30779 root 20 0 5474832 2.090g 28892 S 1.0 6.7 1142:29 ceph-osd 22068 root 20 0 5191516 2.000g 28664 S 0.7 6.4 31:56.72 ceph-osd 34932 root 20 0 5242656 1.994g 4536 S 0.3 6.4 1144:48 ceph-osd 26883 root 20 0 5178164 1.938g 6164 S 0.3 6.2 1173:01 ceph-osd 31796 root 20 0 5193308 1.916g 27000 S 16.2 6.1 923:14.87 ceph-osd 25958 root 20 0 5193436 1.901g 2900 S 0.7 6.1 1039:53 ceph-osd 27826 root 20 0 5225764 1.845g 5576 S 1.0 5.9 1031:15 ceph-osd 36011 root 20 0 5111660 1.823g 20512 S 15.9 5.8 1093:01 ceph-osd 19736 root 20 0 2134680 0.994g 0 S 0.3 3.2 46:13.47 ceph-osd [root@osd003 ~]# ceph status 2015-07-17 14:03:13.865063 7f1fde5f0700 -1 WARNING: the following dangerous and experimental features are enabled: keyvaluestore 2015-07-17 14:03:13.887087 7f1fde5f0700 -1 WARNING: the following dangerous and experimental features are enabled: keyvaluestore cluster 92bfcf0a-1d39-43b3-b60f-44f01b630e47 health HEALTH_OK monmap e1: 3 mons at {mds01=10.141.16.1:6789/0,mds02=10.141.16.2:6789/0,mds03=10.141.16.3:6789/0} election epoch 58, quorum 0,1,2 mds01,mds02,mds03 mdsmap e17218: 1/1/1 up {0=mds03=up:active}, 1 up:standby osdmap e25542: 258 osds: 258 up, 258 in pgmap v2460163: 4160 pgs, 4 pools, 228 TB data, 154 Mobjects 270 TB used, 549 TB / 819 TB avail 4152 active+clean 8 active+clean+scrubbing+deep We are using erasure code on most of our OSDs, so maybe that is a reason. But also the cache-pool filestore OSDS on 200GB SSDs are using 2GB of RAM. Our erasure code pool (16*14 osds) have a pg_num of 2048; our cache pool (2*14 OSDS) has a pg_num of 1024. Are these normal values for this configuration, and is the documentation a bit outdated, or should we look into something else? 2GB of RSS is larger than I would have expected, but not unreasonable. In particular I don't think we've gathered numbers on either EC pools or on the effects of the caching processes. FWIW, here's statistics for ~36 ceph-osds on the wip-promote-prob branch after several hours of cache tiering tests (30 OSD base, 6 OS cache tier) using an EC6+2 pool. At the time of this test, 4K random read/writes were being performed. The cache tier OSDs specifically use quite a bit more memory than the base tier. Interestingly in this test major pagefaults are showing up for the cache tier OSDs which is annoying. I may need to tweak kernel VM settings on this box. # PROCESS SUMMARY (counters are /sec) #Time PID User PR PPID THRD S VSZ RSS CP SysT UsrT Pct AccuTime RKB WKB MajF MinF Command 09:58:48 715 root 20 1 424 S1G 271M 8 0.19 0.43 6 30:12.64000 2502 /usr/local/bin/ceph-osd 09:58:48 1363 root 20 1 424 S1G 325M 8 0.14 0.33 4 26:50.54000 68 /usr/local/bin/ceph-osd 09:58:48 2080 root 20 1 420 S1G 276M 1 0.21 0.49 7 23:49.36000 2848 /usr/local/bin/ceph-osd 09:58:48 2747 root 20 1 424 S1G 283M 8 0.25 0.68 9 25:16.63000 1391 /usr/local/bin/ceph-osd 09:58:48 3451 root 20 1 424 S1G 331M 6 0.13 0.14 2 27:36.71000 148 /usr/local/bin/ceph-osd 09:58:48 4172 root 20 1 424 S1G 301M 6 0.19 0.43 6 29:44.56000 2165 /usr/local/bin/ceph-osd 09:58:48 4935 root 20 1 420 S1G 310M 9 0.18 0.28 4 29:09.78000 2042 /usr/local/bin/ceph-osd 09:58:48 5750 root 20 1 420 S1G 267M 2 0.11 0.14 2 26:55.31000 866 /usr/local/bin/ceph-osd 09:58:48 6544 root 20 1 424 S1G 299M 7 0.22 0.62 8 26:46.35000 3468 /usr/local/bin/ceph-osd 09:58:48 7379 root 20 1 424 S1G 283M 8 0.16 0.47 6 25:47.86000 538 /usr/local/bin/ceph-osd 09:58:48 8183 root 20 1 424 S1G 269M 4 0.25 0.67 9 35:09.85000 2968 /usr/local/bin/ceph-osd 09:58:48 9026 root 20 1 424 S1G 261M 1 0.19 0.46 6 26:27.36000
Re: [ceph-users] Ceph 0.94 (and lower) performance on 1 hosts ??
Hi again, So I have tried - changing the cpus frequency : either 1.6GHZ, or 2.4GHZ on all cores - changing the memory configuration, from advanced ecc mode to performance mode, boosting the memory bandwidth from 35GB/s to 40GB/s - plugged a second 10GB/s link and setup a ceph internal network - tried various tuned-adm profile such as throughput-performance This changed about nothing. If - the CPUs are not maxed out, and lowering the frequency doesn't change a thing - the network is not maxed out - the memory doesn't seem to have an impact - network interrupts are spread across all 8 cpu cores and receive queues are OK - disks are not used at their maximum potential (iostat shows my dd commands produce much more tps than the 4MB ceph transfers...) Where can I possibly find a bottleneck ? I'm /(almost) out of ideas/ ... :'( Regards -Message d'origine- De : ceph-users [mailto:ceph-users-boun...@lists.ceph.com] De la part de SCHAER Frederic Envoyé : vendredi 24 juillet 2015 16:04 À : Christian Balzer; ceph-users@lists.ceph.com Objet : [PROVENANCE INTERNET] Re: [ceph-users] Ceph 0.94 (and lower) performance on 1 hosts ?? Hi, Thanks. I did not know about atop, nice tool... and I don't seem to be IRQ overloaded - I can reach 100% cpu % for IRQs, but that's shared across all 8 physical cores. I also discovered turbostat which showed me the R510s were not configured for performance in the bios (but dbpm - demand based power management), and were not bumping the CPUs frequency to 2.4GHz as they should... only apparently remaining at 1.6Ghz... But changing that did not improve things unfortunately. I know have CPUs using their xeon turbo frequency, but no throughput improvement. Looking at RPS/ RSS, it looks like our Broadcom cards are configured correctly according to redhat, i.e : one receive queue per physical core, spreading the IRQ load everywhere. One thing I noticed though is that the dell BIOS allows to change IRQs... but once you change the network card IRQ, it also changes the RAID card IRQ as well as many others, all sharing the same bios IRQ (that's therefore apparently a useless option). Weird. Still attempting to determine the bottleneck ;) Regards Frederic -Message d'origine- De : Christian Balzer [mailto:ch...@gol.com] Envoyé : jeudi 23 juillet 2015 14:18 À : ceph-users@lists.ceph.com Cc : Gregory Farnum; SCHAER Frederic Objet : Re: [ceph-users] Ceph 0.94 (and lower) performance on 1 hosts ?? On Thu, 23 Jul 2015 11:14:22 +0100 Gregory Farnum wrote: Your note that dd can do 2GB/s without networking makes me think that you should explore that. As you say, network interrupts can be problematic in some systems. The only thing I can think of that's been really bad in the past is that some systems process all network interrupts on cpu 0, and you probably want to make sure that it's splitting them across CPUs. An IRQ overload would be very visible with atop. Splitting the IRQs will help, but it is likely to need some smarts. As in, irqbalance may spread things across NUMA nodes. A card with just one IRQ line will need RPS (Receive Packet Steering), irqbalance can't help it. For example, I have a compute node with such a single line card and Quad Opterons (64 cores, 8 NUMA nodes). The default is all interrupt handling on CPU0 and that is very little, except for eth2. So this gets a special treatment: --- echo 4 /proc/irq/106/smp_affinity_list --- Pinning the IRQ for eth2 to CPU 4 by default --- echo f0 /sys/class/net/eth2/queues/rx-0/rps_cpus --- giving RPS CPUs 4-7 to work with. At peak times it needs more than 2 cores, otherwise with this architecture just using 4 and 5 (same L2 cache) would be better. Regards, Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Updating OSD Parameters
On 28-07-15 16:53, Noah Mehl wrote: When we update the following in ceph.conf: [osd] osd_recovery_max_active = 1 osd_max_backfills = 1 How do we make sure it takes affect? Do we have to restart all of the ceph osd’s and mon’s? On a client with client.admin keyring you execute: ceph tell osd.* injectargs '--osd_recovery_max_active=1' It will take effect immediately. Keep in mind though that PGs which are currently recovering are not affected. So if a OSD is currently doing 10 backfills, it will keep doing that. It however won't accept any new backfills. So it slowly goes down to 9, 8, 7, etc, until you see only 1 backfill active. Same goes for recovery. Wido Thanks! ~Noah ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Updating OSD Parameters
When we update the following in ceph.conf: [osd] osd_recovery_max_active = 1 osd_max_backfills = 1 How do we make sure it takes affect? Do we have to restart all of the ceph osd’s and mon’s? Thanks! ~Noah ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Updating OSD Parameters
Wido, That’s awesome, I will look at this right now. Thanks! ~Noah On Jul 28, 2015, at 11:02 AM, Wido den Hollander w...@42on.com wrote: On 28-07-15 16:53, Noah Mehl wrote: When we update the following in ceph.conf: [osd] osd_recovery_max_active = 1 osd_max_backfills = 1 How do we make sure it takes affect? Do we have to restart all of the ceph osd’s and mon’s? On a client with client.admin keyring you execute: ceph tell osd.* injectargs '--osd_recovery_max_active=1' It will take effect immediately. Keep in mind though that PGs which are currently recovering are not affected. So if a OSD is currently doing 10 backfills, it will keep doing that. It however won't accept any new backfills. So it slowly goes down to 9, 8, 7, etc, until you see only 1 backfill active. Same goes for recovery. Wido Thanks! ~Noah ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] why are there degraded PGs when adding OSDs?
If it wouldn't be too much trouble, I'd actually like the binary osdmap as well (it contains the crushmap, but also a bunch of other stuff). There is a command that lets you get old osdmaps from the mon by epoch as long as they haven't been trimmed. -Sam - Original Message - From: Chad William Seys cws...@physics.wisc.edu To: Samuel Just sj...@redhat.com Cc: ceph-users ceph-us...@ceph.com Sent: Tuesday, July 28, 2015 7:40:31 AM Subject: Re: [ceph-users] why are there degraded PGs when adding OSDs? Hi Sam, Trying again today with crush tunables set to firefly. Degraded peaked around 46.8%. I've attached the ceph pg dump and the crushmap (same as osdmap) from before and after the OSD additions. 3 osds were added on host osd03. This added 5TB to about 17TB for a total of around 22TB. 5TB/22TB = 22.7% Is it expected for 46.8% of PGs to be degraded after adding 22% of the storage? Another weird thing is that the kernel RBD clients froze up after the OSDs were added, but worked fine after reboot. (Debian kernel 3.16.7) Thanks for checking! C. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] which kernel version can help avoid kernel client deadlock
On Jul 28, 2015, at 7:57 PM, Ilya Dryomov idryo...@gmail.com wrote: On Tue, Jul 28, 2015 at 2:46 PM, van chaofa...@owtware.com wrote: Hi, Ilya, In the dmesg, there is also a lot of libceph socket error, which I think may be caused by my stopping ceph service without unmap rbd. Well, sure enough, if you kill all OSDs, the filesystem mounted on top of rbd device will get stuck. Sure it will get stuck if osds are stopped. And since rados requests have retry policy, the stucked requests will recover after I start the daemon again. But in my case, the osds are running in normal state and librbd API can read/write normally. Meanwhile, heavy fio test for the filesystem mounted on top of rbd device will get stuck. I wonder if this phenomenon is triggered by running rbd kernel client on machines have ceph daemons, i.e. the annoying loopback mount deadlock issue. In my opinion, if it’s due to the loopback mount deadlock, the OSDs will become unresponsive. No matter the requests are from user space requests (like API) or from kernel client. Am I right? If so, my case seems to be triggered by another bug. Anyway, it seems that I should separate client and daemons at least. Thanks. Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Did maximum performance reached?
On 28/07/15 11:17, Shneur Zalman Mattern wrote: Oh, now I've to cry :-) not because it's not SSDs... it's SAS2 HDDs Because, I need to build something for 140 clients... 4200 OSDs :-( Looks like, I can pickup my performance by SSDs, but I need a huge capacity ~ 2PB Perhaps, tiering cache pool can save my money, but I've read here - that it's slower than all people think... :-( Why Lustre is more performable? There're same HDDs? Lustre isn't (A) creating two copies of your data, and it's (B) not executing disk writes as atomic transactions (i.e. no data writeahead log). The A tradeoff is that while a Lustre system typically requires an expensive dual ported RAID controller, Ceph doesn't. You take the money you saved on RAID controllers have spend it on having a larger number of cheaper hosts and drives. If you've already bought the Lustre-oriented hardware then my advice would be to run Lustre on it :-) The efficient way of handling B is to use SSD journals for your OSDs. Typical Ceph servers have one SSD per approx 4 OSDs. John ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Did maximum performance reached?
On 28/07/15 11:53, John Spray wrote: On 28/07/15 11:17, Shneur Zalman Mattern wrote: Oh, now I've to cry :-) not because it's not SSDs... it's SAS2 HDDs Because, I need to build something for 140 clients... 4200 OSDs :-( Looks like, I can pickup my performance by SSDs, but I need a huge capacity ~ 2PB Perhaps, tiering cache pool can save my money, but I've read here - that it's slower than all people think... :-( Why Lustre is more performable? There're same HDDs? Lustre isn't (A) creating two copies of your data, and it's (B) not executing disk writes as atomic transactions (i.e. no data writeahead log). The A tradeoff is that while a Lustre system typically requires an expensive dual ported RAID controller, Ceph doesn't. You take the money you saved on RAID controllers have spend it on having a larger number of cheaper hosts and drives. If you've already bought the Lustre-oriented hardware then my advice would be to run Lustre on it :-) The efficient way of handling B is to use SSD journals for your OSDs. Typical Ceph servers have one SSD per approx 4 OSDs. Oh, I've just re-read the original message in this thread, and you're already using SSD journals. So I think the only point of confusion was that you weren't dividing your expected bandwidth number by the number of replicas, right? Each spindel-disk can write ~ 100MB/s , and we have 10 SAS disks on each node = aggregated write speed is ~ 900MB/s (because of striping etc.) And we have 3 OSD nodes, and objects are striped also on 30 osds - I thought it's also aggregateble and we'll get something around 2.5 GB/s, but not... Your expected bandwidth (with size=2 replicas) will be (900MB/s * 3)/2 = 1300MB/s -- so I think you're actually doing pretty well with your 1367MB/s number. John ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Weird behaviour of cephfs with samba
I use cephfs over samba vfs and have some issues. 1) If I use 1 stacked vfs (ceph scannedonly) - I have problems with file order, but solved by dirsort vfs (vfs objects = scannedonly dirsort ceph). Single ceph vfs looks good too (and I use it single for fast internal shares), but you can try add dirsort (vfs objects = dirsort ceph). 2) I use 2 my patches: https://github.com/mahatma-kaganovich/raw/tree/master/app-portage/ppatch/files/extensions/net-fs/samba/compile - to support max disk size and to secure chown. About chown: I unsure about strict follow standard system behaviour, but it works for me, without - user can chown() even to root. I put first (max disk size) patch into samba bugzilla times ago, second patch - no, I unsure about it correctness, but sure about security hole. Jörg Henne пишет: Hi all, the faq at http://ceph.com/docs/cuttlefish/faq/ mentions the possibility to run export a mounted cephfs via samba. This combination exhibits a very weird behaviour, though. We have a directory on cephfs with many small xml snippets. If I repeadtedly ls the directory on Unix, I get the same answer each and every time: root@gru:/mnt/regtest/regressiontestdata2/assets# while true; do ls|wc -l; sleep 1; done 851 851 851 ... and so on If I do the same on the directory exported and mounted via SMB under Windows the result looks like that (output generated unter cygwin, but effect is present with Windows Explorer as well): $ while true; do ls|wc -l; sleep 1; done 380 380 380 380 380 1451 362 851 851 851 851 851 851 851 851 1451 362 851 851 851 ... The problem does not seem to be related to Samba. If I copy the files to an XFS volume and export that, things look fine. Thanks Joerg Henne ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- WBR, Dzianis Kahanovich AKA Denis Kaganovich, http://mahatma.bspu.unibel.by/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] which kernel version can help avoid kernel client deadlock
On Tue, Jul 28, 2015 at 11:19 AM, van chaofa...@owtware.com wrote: Hi, Ilya, Thanks for your quick reply. Here is the link http://ceph.com/docs/cuttlefish/faq/ , under the HOW CAN I GIVE CEPH A TRY?” section which talk about the old kernel stuff. By the way, what’s the main reason of using kernel 4.1, is there a lot of critical bugs fixed in that version despite perf improvements? I am worrying kernel 4.1 is too new that may introduce other problems. Well, I'm not sure what exactly is in 3.10.0.229, so I can't tell you off hand. I can think of one important memory pressure related fix that's probably not in there. I'm suggesting the latest stable version of 4.1 (currently 4.1.3), because if you hit a deadlock (remember, this is a configuration that is neither recommended nor guaranteed to work), it'll be easier to debug and fix if the fix turns out to be worth it. If 4.1 is not acceptable for you, try the latest stable version of 3.18 (that is 3.18.19). It's an LTS kernel, so that should mitigate some of your concerns. And if I’m using the librdb API, is the kernel version matters? No, not so much. In my tests, I built a 2-nodes cluster, each with only one OSD with os centos 7.1, kernel version 3.10.0.229 and ceph v0.94.2. I created several rbds and mkfs.xfs on those rbds to create filesystems. (kernel client were running on the ceph cluster) I performed heavy IO tests on those filesystems and found some fio got hung and turned into D state forever (uninterruptible sleep). I suspect it’s the deadlock that make the fio process hung. However the ceph-osd are stil responsive, and I can operate rbd via librbd API. Does this mean it’s not the loopback mount deadlock that cause the fio process hung? Or it is also a deadlock phnonmenon, only one thread is blocked in memory allocation and other threads are still possible to receive API requests, so the ceph-osd are still responsive? What worth mentioning is that after I restart the ceph-osd daemon, all processes in D state come back into normal state. Below is related log in kernel: Jul 7 02:25:39 node0 kernel: INFO: task xfsaild/rbd1:24795 blocked for more than 120 seconds. Jul 7 02:25:39 node0 kernel: echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. Jul 7 02:25:39 node0 kernel: xfsaild/rbd1D 880c2fc13680 0 24795 2 0x0080 Jul 7 02:25:39 node0 kernel: 8801d6343d40 0046 8801d6343fd8 00013680 Jul 7 02:25:39 node0 kernel: 8801d6343fd8 00013680 880c0c0b 880c0c0b Jul 7 02:25:39 node0 kernel: 880c2fc14340 0001 8805bace2528 Jul 7 02:25:39 node0 kernel: Call Trace: Jul 7 02:25:39 node0 kernel: [81609e39] schedule+0x29/0x70 Jul 7 02:25:39 node0 kernel: [a03a1890] _xfs_log_force+0x230/0x290 [xfs] Jul 7 02:25:39 node0 kernel: [810a9620] ? wake_up_state+0x20/0x20 Jul 7 02:25:39 node0 kernel: [a03a1916] xfs_log_force+0x26/0x80 [xfs] Jul 7 02:25:39 node0 kernel: [a03a6390] ? xfs_trans_ail_cursor_first+0x90/0x90 [xfs] Jul 7 02:25:39 node0 kernel: [a03a64e1] xfsaild+0x151/0x5e0 [xfs] Jul 7 02:25:39 node0 kernel: [a03a6390] ? xfs_trans_ail_cursor_first+0x90/0x90 [xfs] Jul 7 02:25:39 node0 kernel: [8109739f] kthread+0xcf/0xe0 Jul 7 02:25:39 node0 kernel: [810972d0] ? kthread_create_on_node+0x140/0x140 Jul 7 02:25:39 node0 kernel: [8161497c] ret_from_fork+0x7c/0xb0 Jul 7 02:25:39 node0 kernel: [810972d0] ? kthread_create_on_node+0x140/0x140 Jul 7 02:25:39 node0 kernel: INFO: task xfsaild/rbd5:2914 blocked for more than 120 seconds. Is that all there is in dmesg? Can you paste the entire dmesg? Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Did maximum performance reached?
As I'm understanding now that's in this case (30 disks) 10Gbit Network is not a bottleneck! With other HW config ( + 5 OSD nodes = + 50 disks ) I'd get 3400 MB/s, and 3 clients can work on full bandwidth, yes? OK, let's try ! ! ! ! ! ! ! Perhaps, somebody has more suggestions for increasing performance: 1. NVMe journals, 2. btrfs over osd 3. ssd-based osds, 4. 15K hdds 5. RAID 10 on each OSD node . everybody - brainstorm!!! John: Your expected bandwidth (with size=2 replicas) will be (900MB/s * 3)/2 = 1300MB/s -- so I think you're actually doing pretty well with your 1367MB/s number. This footnote confirms that this email message has been scanned by PineApp Mail-SeCure for the presence of malicious code, vandals computer viruses. This footnote confirms that this email message has been scanned by PineApp Mail-SeCure for the presence of malicious code, vandals computer viruses. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Did maximum performance reached?
Hi, On 28.07.2015 12:02, Shneur Zalman Mattern wrote: Hi! And so, in your math I need to build size = osd, 30 replicas for my cluster of 120TB - to get my demans 30 replicas is the wrong math! Less replicas = more speed (because of less writing). More replicas less speed. Fore data safety an replica of 3 is recommended. Udo ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Weird behaviour of cephfs with samba
PS I start to use this patches with samba 4.1. IMHO some of problems may (or must) be solved not inside vfs code, but outside - in samba kernel, but I still use both in samba 4.2.3 without verification. Dzianis Kahanovich пишет: I use cephfs over samba vfs and have some issues. 1) If I use 1 stacked vfs (ceph scannedonly) - I have problems with file order, but solved by dirsort vfs (vfs objects = scannedonly dirsort ceph). Single ceph vfs looks good too (and I use it single for fast internal shares), but you can try add dirsort (vfs objects = dirsort ceph). 2) I use 2 my patches: https://github.com/mahatma-kaganovich/raw/tree/master/app-portage/ppatch/files/extensions/net-fs/samba/compile - to support max disk size and to secure chown. About chown: I unsure about strict follow standard system behaviour, but it works for me, without - user can chown() even to root. I put first (max disk size) patch into samba bugzilla times ago, second patch - no, I unsure about it correctness, but sure about security hole. Jörg Henne пишет: Hi all, the faq at http://ceph.com/docs/cuttlefish/faq/ mentions the possibility to run export a mounted cephfs via samba. This combination exhibits a very weird behaviour, though. We have a directory on cephfs with many small xml snippets. If I repeadtedly ls the directory on Unix, I get the same answer each and every time: root@gru:/mnt/regtest/regressiontestdata2/assets# while true; do ls|wc -l; sleep 1; done 851 851 851 ... and so on If I do the same on the directory exported and mounted via SMB under Windows the result looks like that (output generated unter cygwin, but effect is present with Windows Explorer as well): $ while true; do ls|wc -l; sleep 1; done 380 380 380 380 380 1451 362 851 851 851 851 851 851 851 851 1451 362 851 851 851 ... The problem does not seem to be related to Samba. If I copy the files to an XFS volume and export that, things look fine. Thanks Joerg Henne ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- WBR, Dzianis Kahanovich AKA Denis Kaganovich, http://mahatma.bspu.unibel.by/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Unable to create new pool in cluster
Dear Kefu, Thanks.. It worked.. Appreciate your help.. TC On Sun, Jul 26, 2015 at 8:06 AM, kefu chai tchai...@gmail.com wrote: On Sat, Jul 25, 2015 at 9:43 PM, Daleep Bais daleepb...@gmail.com wrote: Hi All, I am unable to create new pool in my cluster. I have some existing pools. I get error : ceph osd pool create fullpool 128 128 Error EINVAL: crushtool: exec failed: (2) No such file or directory existing pools are : cluster# ceph osd lspools 0 rbd,1 data,3 pspl, Please suggest.. Daleep, seems your crushtool is not in $PATH when the monitor started. you might want to make sure you have crushtool installed somewhere, and: $ ceph --admin-daemon path-to-your-admin-socket config show | grep crushtool ## check the patch to crushtool $ ceph tell mon.* injectargs --crushtool path-to-your-crushtool ## point it to your crushtool HTH. -- Regards Kefu Chai ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] which kernel version can help avoid kernel client deadlock
On Tue, Jul 28, 2015 at 2:46 PM, van chaofa...@owtware.com wrote: Hi, Ilya, In the dmesg, there is also a lot of libceph socket error, which I think may be caused by my stopping ceph service without unmap rbd. Well, sure enough, if you kill all OSDs, the filesystem mounted on top of rbd device will get stuck. Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] which kernel version can help avoid kernel client deadlock
On Tue, Jul 28, 2015 at 7:20 PM, van chaofa...@owtware.com wrote: On Jul 28, 2015, at 7:57 PM, Ilya Dryomov idryo...@gmail.com wrote: On Tue, Jul 28, 2015 at 2:46 PM, van chaofa...@owtware.com wrote: Hi, Ilya, In the dmesg, there is also a lot of libceph socket error, which I think may be caused by my stopping ceph service without unmap rbd. Well, sure enough, if you kill all OSDs, the filesystem mounted on top of rbd device will get stuck. Sure it will get stuck if osds are stopped. And since rados requests have retry policy, the stucked requests will recover after I start the daemon again. But in my case, the osds are running in normal state and librbd API can read/write normally. Meanwhile, heavy fio test for the filesystem mounted on top of rbd device will get stuck. I wonder if this phenomenon is triggered by running rbd kernel client on machines have ceph daemons, i.e. the annoying loopback mount deadlock issue. In my opinion, if it’s due to the loopback mount deadlock, the OSDs will become unresponsive. No matter the requests are from user space requests (like API) or from kernel client. Am I right? Not necessarily. If so, my case seems to be triggered by another bug. Anyway, it seems that I should separate client and daemons at least. Try 3.18.19 if you can. I'd be interested in your results. Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] which kernel version can help avoid kernel client deadlock
Hi, list, I found on the ceph FAQ that, ceph kernel client should not run on machines belong to ceph cluster. As ceph FAQ metioned, “In older kernels, Ceph can deadlock if you try to mount CephFS or RBD client services on the same host that runs your test Ceph cluster. This is not a Ceph-related issue.” Here it says that there will be deadlock if using old kernel version. I wonder if anyone knows which new kernel version solve this loopback mount deadlock. It will be a great help since I do need to use rbd kernel client on the ceph cluster. As I search more informations, I found two articals https://lwn.net/Articles/595652/ https://lwn.net/Articles/595652/ and https://lwn.net/Articles/596618/ https://lwn.net/Articles/596618/ talk about supporting nfs loopback mount,it seems they do effort not on memory management only, but also on nfs related codes, I wonder if ceph has also so some effort on kernel client to solve this problem. If ceph did, could anyone help provide the kernel version with the patch? Thanks. van chaofa...@owtware.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] RadosGW - radosgw-agent start error
Hello everyone, I’m setting up a federated configuration of radosgw but when I start a radosgw-agent I face with the error bellow and I’d like to know if I’m doing something wrong…? See the error: root@cephgw0001:~# radosgw-agent -v -c /etc/ceph/radosgw-agent/default.conf 2015-07-28 17:02:03,103 3600 [radosgw_agent][INFO ] ____ __ ___ ___ 2015-07-28 17:02:03,103 3600 [radosgw_agent][INFO ] /__` \ / |\ | / ` /\ / _` |__ |\ | | 2015-07-28 17:02:03,104 3600 [radosgw_agent][INFO ] .__/ | | \| \__,/~~\ \__ |___ | \| | 2015-07-28 17:02:03,104 3600 [radosgw_agent][INFO ] v1.2.3 2015-07-28 17:02:03,105 3600 [radosgw_agent][INFO ] agent options: 2015-07-28 17:02:03,105 3600 [radosgw_agent][INFO ] args: 2015-07-28 17:02:03,106 3600 [radosgw_agent][INFO ]conf : None 2015-07-28 17:02:03,106 3600 [radosgw_agent][INFO ]dest_access_key : 2015-07-28 17:02:03,107 3600 [radosgw_agent][INFO ]dest_secret_key : 2015-07-28 17:02:03,108 3600 [radosgw_agent][INFO ]destination : http://tmk.object-storage.local:80 2015-07-28 17:02:03,108 3600 [radosgw_agent][INFO ]incremental_sync_delay : 30 2015-07-28 17:02:03,109 3600 [radosgw_agent][INFO ]lock_timeout : 60 2015-07-28 17:02:03,109 3600 [radosgw_agent][INFO ]log_file : /var/log/radosgw/radosgw-sync.log 2015-07-28 17:02:03,110 3600 [radosgw_agent][INFO ]log_lock_time : 20 2015-07-28 17:02:03,110 3600 [radosgw_agent][INFO ]max_entries : 1000 2015-07-28 17:02:03,111 3600 [radosgw_agent][INFO ]metadata_only : False 2015-07-28 17:02:03,111 3600 [radosgw_agent][INFO ]num_workers : 1 2015-07-28 17:02:03,112 3600 [radosgw_agent][INFO ]object_sync_timeout : 216000 2015-07-28 17:02:03,112 3600 [radosgw_agent][INFO ]prepare_error_delay : 10 2015-07-28 17:02:03,113 3600 [radosgw_agent][INFO ]quiet : False 2015-07-28 17:02:03,113 3600 [radosgw_agent][INFO ]rgw_data_log_window : 30 2015-07-28 17:02:03,114 3600 [radosgw_agent][INFO ]source : None 2015-07-28 17:02:03,114 3600 [radosgw_agent][INFO ]src_access_key : 2015-07-28 17:02:03,115 3600 [radosgw_agent][INFO ]src_secret_key : 2015-07-28 17:02:03,115 3600 [radosgw_agent][INFO ]src_zone : None 2015-07-28 17:02:03,116 3600 [radosgw_agent][INFO ]sync_scope : incremental 2015-07-28 17:02:03,116 3600 [radosgw_agent][INFO ]test_server_host : None 2015-07-28 17:02:03,117 3600 [radosgw_agent][INFO ]test_server_port : 8080 2015-07-28 17:02:03,118 3600 [radosgw_agent][INFO ]verbose : True 2015-07-28 17:02:03,118 3600 [radosgw_agent][INFO ]versioned : False 2015-07-28 17:02:03,118 3600 [radosgw_agent.client][INFO ] creating connection to endpoint: http://tmk.object-storage.local:80 2015-07-28 17:02:03,120 3600 [radosgw_agent][ERROR ] RegionMapError: Could not retrieve region map from destination: make_request() got an unexpected keyword argument 'params' Regards. Italo Santos http://italosantos.com.br/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Configuring MemStore in Ceph
Hello, I am trying to setup a ceph cluster with a memstore backend. The problem is, it is always created with a fixed size (1GB). I made changes to the ceph.conf file as follows: osd_objectstore = memstore memstore_device_bytes = 5*1024*1024*1024 The resultant cluster still has 1GB allocated to it. Could anybody point out what I am doing wrong here? Thanks, Aakanksha ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com