[ceph-users] Rados gateway / RBD access restrictions
Hi, I've been playing around with the rados gateway and RBD and have some questions about user access restrictions. I'd like to be able to set up a cluster that would be shared among different clients without any conflicts... Is there a way to limit S3/Swift clients to be able to write data only to one bucket? Now S3 users can create their own buckets and as many as they want - it would be good to have some kind of control over what user can and can't do. I found this thread about namespaces: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-August/033451.html ..but it's old and I was wondering if maybe the namespace feature is better documented now somewhere? Another problem: is there a way to limit RBD clients to a certain bandwidth and/or iops they can use? So that one client can't disrupt another client's vm's for example? Cheers, J -- Jacek Jarosiewicz Administrator Systemów Informatycznych SUPERMEDIA Sp. z o.o. z siedzibą w Warszawie ul. Senatorska 13/15, 00-075 Warszawa Sąd Rejonowy dla m.st.Warszawy, XII Wydział Gospodarczy Krajowego Rejestru Sądowego, nr KRS 029537; kapitał zakładowy 42.756.000 zł NIP: 957-05-49-503 Adres korespondencyjny: ul. Jubilerska 10, 04-190 Warszawa SUPERMEDIA - http://www.supermedia.pl dostep do internetu - hosting - kolokacja - lacza - telefonia ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] file/directory invisible through ceph-fuse
On 2015年07月01日 16:11, Gregory Farnum wrote: On Wed, Jul 1, 2015 at 9:02 AM, flisky yinjif...@lianjia.com wrote: Hi list, I meet a strange problem: sometimes I cannot see the file/directory created by another ceph-fuse client. It comes into visible after I touch/mkdir the same name. Any thoughts? What version are you running? We've seen a few things like this with older releases, although usually it's in the kernel... -Greg ceph-fuse: 0.94.1 kernel version: 2.6.32-431.el6.x86_64 FUSE library version: 2.8.3 FUSE kernel interface version 7.12 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] bucket owner vs S3 ACL?
Hi Valery, With the old account did you try to give FULL access to the new one user ID ? Process should be : From OLD account add FULL access to NEW account (S3 ACL with CloudBerry for example) With radosgw admin update link from OLD account to NEW account (link allow user to see bucket with bucket list command) From NEW account remove FULL access to old account (S3 ACL with CloudBerry for example) Thanks On Jun 29, 2015, at 11:46 AM, Valery Tschopp valery.tsch...@switch.ch wrote: Hi guys, We use the radosgw (v0.80.9) with the Openstack Keystone integration. One project have been deleted, so now I have to transfer the ownership of all the buckets to another user/project. Using radosgw-admin I have changed the owner: radosgw-admin bucket link --uid NEW_USER_ID --bucket BUCKET_NAME And the owner have been update: radosgw-admin bucket stats --bucket BUCKET_NAME { bucket: BUCKET_NAME, pool: .rgw.buckets, index_pool: .rgw.buckets.index, id: default.4063334.17, marker: default.4063334.17, owner: NEW_USER_ID, ver: 66301, master_ver: 0, mtime: 1435583681, max_marker: , usage: { rgw.main: { size_kb: 189433890, size_kb_actual: 189473684, num_objects: 19043}, rgw.multimeta: { size_kb: 0, size_kb_actual: 0, num_objects: 0}}, bucket_quota: { enabled: false, max_size_kb: -1, max_objects: -1} } But the S3 ACL of this bucket is still referencing the old user/project (from radosgw.log) when I try to access it with the new owner: 2015-06-29 17:08:33.236265 7f40d8a76700 15 Read AccessControlPolicyAccessControlPolicy xmlns=http://s3.amazonaws.com/doc/2006-03-01/;OwnerIDOLD_USER_ID/IDDisplayNameOLD_PROJECT_NAME/DisplayName/OwnerAccessControlListGrantGrantee xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance; xsi:type=CanonicalUserIDOLD_USER_ID/IDDisplayNameOLD_PROJECT_NAME/DisplayName/GranteePermissionFULL_CONTROL/Permission/Grant/AccessControlList/AccessControlPolicy Therefore I get a 403, because the S3 ACL still enforce the old owner, not the new one. How can I update these S3 ACL, and fully transfer the ownership to the new owner/project??? Cheers, Valery -- SWITCH -- Valery Tschopp, Software Engineer, Peta Solutions Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland email: valery.tsch...@switch.ch phone: +41 44 268 1544 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Very low 4k randread performance ~1000iops
Thanks Mark Are there any plans for ZFS like L2ARC to CEPH or is the cache tiering what should work like this in the future? I have tested cache tier + EC pool, and that created too much load on our servers, so it was not viable to be used. I was also wondering if EnhanceIO would be a good solution for getting more random iops. I've read some Sébastien's writings. Br, Tuomas -Original Message- From: Mark Nelson [mailto:mnel...@redhat.com] Sent: 1. heinäkuuta 2015 20:29 To: Tuomas Juntunen; ceph-users@lists.ceph.com Subject: Re: [ceph-users] Very low 4k randread performance ~1000iops On 07/01/2015 12:13 PM, Tuomas Juntunen wrote: Hi Yes, the OSD's are on spinning disks and we have 18 SSD's for journal, one SSD for two OSD's The OSD's are: Model Family: Seagate Barracuda 7200.14 (AF) Device Model: ST2000DM001-1CH164 What I've understood the journals are not used as read cache at all, just for writing. Would SSD based cache pool be viable solution here? Ok, so that makes more sense. The performance is still lower than expected but maybe 3-4x rather than several orders of magnitude. My guess is that cache tiering in it's current form probably won't help you much unless you have a workload that fits mostly into the cache. The promotion penalty is really high though so we likely will have to promote much more slowly than we currently do. Mark Br, T -Original Message- From: Mark Nelson [mailto:mnel...@redhat.com] Sent: 1. heinäkuuta 2015 13:58 To: Tuomas Juntunen; ceph-users@lists.ceph.com Subject: Re: [ceph-users] Very low 4k randread performance ~1000iops On 06/30/2015 10:42 PM, Tuomas Juntunen wrote: Hi For seq reads here's the latencies: lat (usec) : 2=0.01%, 10=0.01%, 20=0.01%, 50=0.02%, 100=0.03% lat (usec) : 250=1.02%, 500=87.09%, 750=7.47%, 1000=1.50% lat (msec) : 2=0.76%, 4=1.72%, 10=0.19%, 20=0.19% Random reads: lat (usec) : 10=0.01% lat (msec) : 2=0.01%, 4=0.01%, 10=0.02%, 20=0.03%, 50=0.55% lat (msec) : 100=99.31%, 250=0.08% 100msecs seems a lot to me. It is, but what's more interesting imho is that it's so consistent. You don't have some ops completing fast and other ones completing slowly holding everything up. It's like the OSDs are simply overloaded with concurrent IOs and everything is waiting. Maybe I'm confused, are your OSDs on SSDs? Are there spinning disks involved? If so, what model(s)? You might want to use collectl -sD -oT on one of the OSD nodes during the test and see what the IO to the disk looks like during random reads and the especially with the svctime for the disks is like. Mark Br,T -Original Message- From: Mark Nelson [mailto:mnel...@redhat.com] Sent: 30. kesäkuuta 2015 22:01 To: Tuomas Juntunen; ceph-users@lists.ceph.com Subject: Re: [ceph-users] Very low 4k randread performance ~1000iops Seems reasonable. What's the latency distribution look like in your fio output file? Would be useful to know if it's universally slow or if some ops are taking much longer to complete than others. Mark On 06/30/2015 01:27 PM, Tuomas Juntunen wrote: I created a file which has the following parameters [random-read] rw=randread size=128m directory=/root/asd ioengine=libaio bs=4k #numjobs=8 iodepth=64 Br,T -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Mark Nelson Sent: 30. kesäkuuta 2015 20:55 To: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Very low 4k randread performance ~1000iops Hi Tuomos, Can you paste the command you ran to do the test? Thanks, Mark On 06/30/2015 12:18 PM, Tuomas Juntunen wrote: Hi Its not probably hitting the disks, but that really doesnt matter. The point is we have very responsive VMs while writing and that is what the users will see. The iops we get with sequential read is good, but the random read is way too low. Is using SSDs as OSDs the only way to get it up? or is there some tunable which would enhance it? I would assume Linux caches reads in memory and serves them from there, but atleast now we dont see it. Br, Tuomas *From:*Somnath Roy [mailto:somnath@sandisk.com] *Sent:* 30. kesäkuuta 2015 19:24 *To:* Tuomas Juntunen; 'ceph-users' *Subject:* RE: [ceph-users] Very low 4k randread performance ~1000iops Break it down, try fio-rbd to see what is the performance you getting.. But, I am really surprised you are getting 100k iops for write, did you check it is hitting the disks ? Thanks Regards Somnath *From:*ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf Of *Tuomas Juntunen *Sent:* Tuesday, June 30, 2015 8:33 AM *To:* 'ceph-users' *Subject:* [ceph-users] Very low 4k randread performance ~1000iops Hi I have been trying to figure out why our 4k random reads in VMs are so bad. I am using fio to test this. Write : 170k iops Random write : 109k iops Read :
Re: [ceph-users] Very low 4k randread performance ~1000iops
On 07/01/2015 01:39 PM, Tuomas Juntunen wrote: Thanks Mark Are there any plans for ZFS like L2ARC to CEPH or is the cache tiering what should work like this in the future? I have tested cache tier + EC pool, and that created too much load on our servers, so it was not viable to be used. We are doing a lot of work in this space right now. Hopefully we'll see improvements coming in the coming releases. I was also wondering if EnhanceIO would be a good solution for getting more random iops. I've read some Sébastien's writings. Possibly! Try it and let us know. ;) Br, Tuomas -Original Message- From: Mark Nelson [mailto:mnel...@redhat.com] Sent: 1. heinäkuuta 2015 20:29 To: Tuomas Juntunen; ceph-users@lists.ceph.com Subject: Re: [ceph-users] Very low 4k randread performance ~1000iops On 07/01/2015 12:13 PM, Tuomas Juntunen wrote: Hi Yes, the OSD's are on spinning disks and we have 18 SSD's for journal, one SSD for two OSD's The OSD's are: Model Family: Seagate Barracuda 7200.14 (AF) Device Model: ST2000DM001-1CH164 What I've understood the journals are not used as read cache at all, just for writing. Would SSD based cache pool be viable solution here? Ok, so that makes more sense. The performance is still lower than expected but maybe 3-4x rather than several orders of magnitude. My guess is that cache tiering in it's current form probably won't help you much unless you have a workload that fits mostly into the cache. The promotion penalty is really high though so we likely will have to promote much more slowly than we currently do. Mark Br, T -Original Message- From: Mark Nelson [mailto:mnel...@redhat.com] Sent: 1. heinäkuuta 2015 13:58 To: Tuomas Juntunen; ceph-users@lists.ceph.com Subject: Re: [ceph-users] Very low 4k randread performance ~1000iops On 06/30/2015 10:42 PM, Tuomas Juntunen wrote: Hi For seq reads here's the latencies: lat (usec) : 2=0.01%, 10=0.01%, 20=0.01%, 50=0.02%, 100=0.03% lat (usec) : 250=1.02%, 500=87.09%, 750=7.47%, 1000=1.50% lat (msec) : 2=0.76%, 4=1.72%, 10=0.19%, 20=0.19% Random reads: lat (usec) : 10=0.01% lat (msec) : 2=0.01%, 4=0.01%, 10=0.02%, 20=0.03%, 50=0.55% lat (msec) : 100=99.31%, 250=0.08% 100msecs seems a lot to me. It is, but what's more interesting imho is that it's so consistent. You don't have some ops completing fast and other ones completing slowly holding everything up. It's like the OSDs are simply overloaded with concurrent IOs and everything is waiting. Maybe I'm confused, are your OSDs on SSDs? Are there spinning disks involved? If so, what model(s)? You might want to use collectl -sD -oT on one of the OSD nodes during the test and see what the IO to the disk looks like during random reads and the especially with the svctime for the disks is like. Mark Br,T -Original Message- From: Mark Nelson [mailto:mnel...@redhat.com] Sent: 30. kesäkuuta 2015 22:01 To: Tuomas Juntunen; ceph-users@lists.ceph.com Subject: Re: [ceph-users] Very low 4k randread performance ~1000iops Seems reasonable. What's the latency distribution look like in your fio output file? Would be useful to know if it's universally slow or if some ops are taking much longer to complete than others. Mark On 06/30/2015 01:27 PM, Tuomas Juntunen wrote: I created a file which has the following parameters [random-read] rw=randread size=128m directory=/root/asd ioengine=libaio bs=4k #numjobs=8 iodepth=64 Br,T -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Mark Nelson Sent: 30. kesäkuuta 2015 20:55 To: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Very low 4k randread performance ~1000iops Hi Tuomos, Can you paste the command you ran to do the test? Thanks, Mark On 06/30/2015 12:18 PM, Tuomas Juntunen wrote: Hi It’s not probably hitting the disks, but that really doesn’t matter. The point is we have very responsive VM’s while writing and that is what the users will see. The iops we get with sequential read is good, but the random read is way too low. Is using SSD’s as OSD’s the only way to get it up? or is there some tunable which would enhance it? I would assume Linux caches reads in memory and serves them from there, but atleast now we don’t see it. Br, Tuomas *From:*Somnath Roy [mailto:somnath@sandisk.com] *Sent:* 30. kesäkuuta 2015 19:24 *To:* Tuomas Juntunen; 'ceph-users' *Subject:* RE: [ceph-users] Very low 4k randread performance ~1000iops Break it down, try fio-rbd to see what is the performance you getting.. But, I am really surprised you are getting 100k iops for write, did you check it is hitting the disks ? Thanks Regards Somnath *From:*ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf Of *Tuomas Juntunen *Sent:* Tuesday, June 30, 2015 8:33 AM *To:* 'ceph-users' *Subject:* [ceph-users] Very low 4k randread performance ~1000iops Hi I have been trying
Re: [ceph-users] Error create subuser
Hi, I think it's because secret key for swift subuser is not generated : radosgw-admin key create --subuser=johndoe:swift --key-type=swift --gen-secret Mikaël Le 01/07/2015 14:50, Jimmy Goffaux a écrit : radosgw-agent= 1.2.1trust Ubuntu 14.04 English version : Hello, According to the documentation here: http://ceph.com/docs/master/radosgw/admin/ I followed to the letter the documentation and the result is totally different: root@ih-prd-rgw01:~# radosgw-admin user create --uid=johndoe --display-name=John Doe --email=m...@email.com { user_id: johndoe, [] subusers: [], keys: [ { user: johndoe, access_key: SO4FYX3VXA8TO9D9AAM9, secret_key: tnOIYOuPztmWcnYfP5fGfPAWb5+thPqqdML0+Fmc}], swift_keys: [], [...] root@ih-prd-rgw01:~# radosgw-admin subuser create --uid=johndoe --subuser=johndoe:swift --access=full { user_id: johndoe, [...] subusers: [ { id: johndoe:swift, permissions: full-control}], keys: [ { user: johndoe:swift, access_key: 6ENTC5V4OD15A3UO9B11, secret_key: }, { user: johndoe, access_key: SO4FYX3VXA8TO9D9AAM9, secret_key: tnOIYOuPztmWcnYfP5fGfPAWb5+thPqqdML0+Fmc}], swift_keys: [], [.] Would you have any idea why the SWIFT user is not in the tags swift_keys? Thanks... French version : Bonjour, En suivant la documentation ici : http://ceph.com/docs/master/radosgw/admin/ J'ai suivi à la lettre la documentation et le résultat est totalement différents : root@ih-prd-rgw01:~# radosgw-admin user create --uid=johndoe --display-name=John Doe --email=ji...@goffaux.fr { user_id: johndoe, [] subusers: [], keys: [ { user: johndoe, access_key: SO4FYX3VXA8TO9D9AAM9, secret_key: tnOIYOuPztmWcnYfP5fGfPAWb5+thPqqdML0+Fmc}], swift_keys: [], [...] root@ih-prd-rgw01:~# radosgw-admin subuser create --uid=johndoe --subuser=johndoe:swift --access=full { user_id: johndoe, [...] subusers: [ { id: johndoe:swift, permissions: full-control}], keys: [ { user: johndoe:swift, access_key: 6ENTC5V4OD15A3UO9B11, secret_key: }, { user: johndoe, access_key: SO4FYX3VXA8TO9D9AAM9, secret_key: tnOIYOuPztmWcnYfP5fGfPAWb5+thPqqdML0+Fmc}], swift_keys: [], [.] Auriez-vous une idée pourquoi l'utilisateur SWIFT ne se trouve pas dans les balises swift_keys ? Merci ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Freezes on VM's after upgrade from Giant to Hammer, app is not responding
Hi Cephers, On Sunday evening we are upgraded Ceph form 0.87 to 0.94. After upgrade VM's running on Proxmox, freezes for 3-4s in 10min periods (application is not responding on Windows). Before upgrade everything was working fine. On /proc/diskstats at field 7 (time spent reading (ms) ) and 11 (time spent writing (ms)) there are peaks from 90ms to 2000ms on osd disks. Is there any default settings changed? Every server have one ssd for journal and 4-6 osd. BBU are ok on controllers. Our ceph.conf below. [global] fsid={UUID} mon initial members = ceph35, ceph30, ceph20, ceph15, ceph10 mon host = 10.20.8.35, 10.20.8.30, 10.20.8.20, 10.20.8.15, 10.20.8.10 public network = 10.20.8.0/22 cluster network = 10.20.4.0/22 filestore xattr use omap = true filestore max sync interval = 30 osd journal size = 10240 osd mount options xfs = rw,noatime,inode64,allocsize=4M osd pool default size = 3 osd pool default min size = 1 osd pool default pg num = 2048 osd pool default pgp num = 2048 osd disk thread ioprio class = idle osd disk thread ioprio priority = 7 osd crush chooseleaf type = 1 osd recovery max active = 1 osd recovery op priority = 1 osd max backfills = 1 auth cluster required = cephx auth service required = cephx auth client required = cephx rbd default format = 2 ##ceph35 osds [osd.0] cluster addr = 10.20.4.35 [osd.1] cluster addr = 10.20.4.35 [osd.2] cluster addr = 10.20.4.35 [osd.3] cluster addr = 10.20.4.35 [osd.4] cluster addr = 10.20.4.35 [osd.5] cluster addr = 10.20.4.35 ##ceph25 osds [osd.6] cluster addr = 10.20.4.25 [osd.7] cluster addr = 10.20.4.25 [osd.8] cluster addr = 10.20.4.25 [osd.9] cluster addr = 10.20.4.25 [osd.10] cluster addr = 10.20.4.25 [osd.11] cluster addr = 10.20.4.25 ##ceph15 osds [osd.12] cluster addr = 10.20.4.15 [osd.13] cluster addr = 10.20.4.15 [osd.14] cluster addr = 10.20.4.15 [osd.15] cluster addr = 10.20.4.15 ##ceph30 osds [osd.16] cluster addr = 10.20.4.30 [osd.17] cluster addr = 10.20.4.30 [osd.18] cluster addr = 10.20.4.30 [osd.19] cluster addr = 10.20.4.30 [osd.20] cluster addr = 10.20.4.30 [osd.21] cluster addr = 10.20.4.30 ##ceph20 osds [osd.22] cluster addr = 10.20.4.20 [osd.23] cluster addr = 10.20.4.20 [osd.24] cluster addr = 10.20.4.20 [osd.25] cluster addr = 10.20.4.20 [osd.26] cluster addr = 10.20.4.20 [osd.27] cluster addr = 10.20.4.20 ##ceph10 osd [osd.28] cluster addr = 10.20.4.10 [osd.29] cluster addr = 10.20.4.10 [osd.30] cluster addr = 10.20.4.10 [osd.31] cluster addr = 10.20.4.10 #adresy monitorów [mon.ceph35] host = ceph35 mon addr = 10.20.8.35:6789 [mon.ceph30] host = ceph30 mon addr = 10.20.8.30:6789 [mon.ceph20] host = ceph20 mon addr = 10.20.8.20:6789 [mon.ceph15] host = ceph15 mon addr = 10.20.8.15:6789 [mon.ceph10] host = ceph10 mon addr = 10.20.8.10:6789 Thanks for help. Regards Mateusz ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Ceph references
Hi community Do you know if there is page with all the official Ceph cluster deployed ? With number of nodes, volumetry, protocol (block / file / object) If not are you agree to create this list on Ceph site? Thanks Sent from my iPhone ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Error create subuser
It's not really a problem, swift johndoe user works if he has a record in swift_keys. The s3 secret key of johndoe user is here : keys: [ { user: johndoe, access_key: 91KC4JI5BRO39A22JY9I, secret_key: Z5kLaBtg870xBhYtb4RKY82qGsbiqRpGs\/KQUXKF}, I tested swift and s3 access with the same configuration as yours and it works for me. Mikaël Le 01/07/2015 15:09, Jimmy Goffaux a écrit : { user: johndoe:swift, access_key: UFSCBO5JXROB8641XF52, secret_key: }], ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Error create subuser
hi, Thank you for your return but, I just regenerate the user completely and I confirm that I have a problem :( radosgw-admin user create --uid=johndoe --display-name=John Doe --email=m...@email.com subusers: [], keys: [ { user: johndoe, access_key: 91KC4JI5BRO39A22JY9I, secret_key: Z5kLaBtg870xBhYtb4RKY82qGsbiqRpGs\/KQUXKF}], swift_keys: [] radosgw-admin key create --subuser=johndoe:swift --key-type=swift --gen-secret subusers: [], keys: [ { user: johndoe, access_key: 91KC4JI5BRO39A22JY9I, secret_key: Z5kLaBtg870xBhYtb4RKY82qGsbiqRpGs\/KQUXKF}], swift_keys: [ { user: johndoe:swift, secret_key: 04ZuTaKP8Eq8WBW9fMZJItzkeSOpc9jJkAdSe4pO}], radosgw-admin subuser create --uid=johndoe --subuser=johndoe:swift --access=full subusers: [ { id: johndoe:swift, permissions: full-control}], keys: [ { user: johndoe, access_key: 91KC4JI5BRO39A22JY9I, secret_key: Z5kLaBtg870xBhYtb4RKY82qGsbiqRpGs\/KQUXKF}, { user: johndoe:swift, access_key: UFSCBO5JXROB8641XF52, secret_key: }], swift_keys: [ { user: johndoe:swift, secret_key: 04ZuTaKP8Eq8WBW9fMZJItzkeSOpc9jJkAdSe4pO}], I have permission at my subusers by cons in keys I do not understand why I have: { user: johndoe:swift, access_key: UFSCBO5JXROB8641XF52, secret_key: }], Thanks On Wed, 01 Jul 2015 15:03:33 +0200, Mikaël Guichard wrote: Hi, I think it's because secret key for swift subuser is not generated : radosgw-admin key create --subuser=johndoe:swift --key-type=swift --gen-secret Mikaël Le 01/07/2015 14:50, Jimmy Goffaux a écrit : radosgw-agent= 1.2.1trust Ubuntu 14.04 English version : Hello, According to the documentation here: http://ceph.com/docs/master/radosgw/admin/ I followed to the letter the documentation and the result is totally different: root@ih-prd-rgw01:~# radosgw-admin user create --uid=johndoe --display-name=John Doe --email=m...@email.com { user_id: johndoe, [] subusers: [], keys: [ { user: johndoe, access_key: SO4FYX3VXA8TO9D9AAM9, secret_key: tnOIYOuPztmWcnYfP5fGfPAWb5+thPqqdML0+Fmc}], swift_keys: [], [...] root@ih-prd-rgw01:~# radosgw-admin subuser create --uid=johndoe --subuser=johndoe:swift --access=full { user_id: johndoe, [...] subusers: [ { id: johndoe:swift, permissions: full-control}], keys: [ { user: johndoe:swift, access_key: 6ENTC5V4OD15A3UO9B11, secret_key: }, { user: johndoe, access_key: SO4FYX3VXA8TO9D9AAM9, secret_key: tnOIYOuPztmWcnYfP5fGfPAWb5+thPqqdML0+Fmc}], swift_keys: [], [.] Would you have any idea why the SWIFT user is not in the tags swift_keys? Thanks... French version : Bonjour, En suivant la documentation ici : http://ceph.com/docs/master/radosgw/admin/ J'ai suivi à la lettre la documentation et le résultat est totalement différents : root@ih-prd-rgw01:~# radosgw-admin user create --uid=johndoe --display-name=John Doe --email=ji...@goffaux.fr { user_id: johndoe, [] subusers: [], keys: [ { user: johndoe, access_key: SO4FYX3VXA8TO9D9AAM9, secret_key: tnOIYOuPztmWcnYfP5fGfPAWb5+thPqqdML0+Fmc}], swift_keys: [], [...] root@ih-prd-rgw01:~# radosgw-admin subuser create --uid=johndoe --subuser=johndoe:swift --access=full { user_id: johndoe, [...] subusers: [ { id: johndoe:swift, permissions: full-control}], keys: [ { user: johndoe:swift, access_key: 6ENTC5V4OD15A3UO9B11, secret_key: }, { user: johndoe, access_key: SO4FYX3VXA8TO9D9AAM9, secret_key: tnOIYOuPztmWcnYfP5fGfPAWb5+thPqqdML0+Fmc}], swift_keys: [], [.] Auriez-vous une idée pourquoi l'utilisateur SWIFT ne se trouve pas dans les balises swift_keys ? Merci ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Jimmy Goffaux ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Rados gateway / RBD access restrictions
ok, I think I found the answer to the second question: http://wiki.ceph.com/Planning/Blueprints/Giant/Add_QoS_capacity_to_librbd ..librbd doesn't support any QoS for now.. Can anyone shed some light on the namespaces and limiting S3 users to one bucket? J On 07/01/2015 10:31 AM, Jacek Jarosiewicz wrote: Hi, I've been playing around with the rados gateway and RBD and have some questions about user access restrictions. I'd like to be able to set up a cluster that would be shared among different clients without any conflicts... Is there a way to limit S3/Swift clients to be able to write data only to one bucket? Now S3 users can create their own buckets and as many as they want - it would be good to have some kind of control over what user can and can't do. I found this thread about namespaces: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-August/033451.html ..but it's old and I was wondering if maybe the namespace feature is better documented now somewhere? Another problem: is there a way to limit RBD clients to a certain bandwidth and/or iops they can use? So that one client can't disrupt another client's vm's for example? Cheers, J -- Jacek Jarosiewicz Administrator Systemów Informatycznych SUPERMEDIA Sp. z o.o. z siedzibą w Warszawie ul. Senatorska 13/15, 00-075 Warszawa Sąd Rejonowy dla m.st.Warszawy, XII Wydział Gospodarczy Krajowego Rejestru Sądowego, nr KRS 029537; kapitał zakładowy 42.756.000 zł NIP: 957-05-49-503 Adres korespondencyjny: ul. Jubilerska 10, 04-190 Warszawa SUPERMEDIA - http://www.supermedia.pl dostep do internetu - hosting - kolokacja - lacza - telefonia ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Error create subuser
Yes it also works ... It's more that I do not expect to have an element johndoe:swift in keys Thank you for providing answers. On Wed, 01 Jul 2015 15:28:15 +0200, Mikaël Guichard wrote: It's not really a problem, swift johndoe user works if he has a record in swift_keys. The s3 secret key of johndoe user is here : keys: [ { user: johndoe, access_key: 91KC4JI5BRO39A22JY9I, secret_key: Z5kLaBtg870xBhYtb4RKY82qGsbiqRpGs\/KQUXKF}, I tested swift and s3 access with the same configuration as yours and it works for me. Mikaël Le 01/07/2015 15:09, Jimmy Goffaux a écrit : { user: johndoe:swift, access_key: UFSCBO5JXROB8641XF52, secret_key: }], ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Jimmy Goffaux ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Perfomance issue.
Le Tue, 16 Jun 2015 10:04:26 +0200 Marcus Forness pixel...@gmail.com écrivait: hi! anyone able to privide som tips on performance issue on a newly installe all flash ceph cluster? When we do write test we get 900MB/s write. but read tests are only 200MB/s all servers are on 10GBit connections. [global] fsid = 453d2db9-c764-4921-8f3c-ee0f75412e19 mon_initial_members = ceph02, ceph03, ceph04 mon_host = 10.129.23.202,10.129.23.203,10.129.23.204 auth_cluster_required = cephx auth_service_required = cephx auth_client_required = cephx filestore_xattr_use_omap = true public_network = 10.129.0.0/16 this is the cehph conf file Did you test the local filesystem performance of your servers? -- Emmanuel Florac | Direction technique | Intellique | eflo...@intellique.com | +33 1 78 94 84 02 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Ceph erasure code benchmark failing
Hi, I am new to ceph project. I am trying to benchmark erasure code on Intel and I am getting following error. [root@nitin ceph]# CEPH_ERASURE_CODE_BENCHMARK=src/ceph_erasure_code_benchmark PLUGIN_DIRECTORY=src/.libs qa/workunits/erasure-code/bench.sh seconds KB plugin k m work. iter. sizeeras. command. serie encode_vandermonde_isa *load dlopen(src/.libs/libec_isa.so): src/.libs/libec_isa.so: cannot open shared object file: No such file or directory* I have checked out master branch and compiled as ceph with following steps ./autogen.sh ; ./configure --with-debug --without-tcmalloc --without-fuse;make Am I missing something here? Thanks in advance Nitin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph erasure code benchmark failing
Hi Nitin, Are you installed YASM compiler ? David On 07/01/2015 01:46 PM, Nitin Saxena wrote: Hi, I am new to ceph project. I am trying to benchmark erasure code on Intel and I am getting following error. [root@nitin ceph]# CEPH_ERASURE_CODE_BENCHMARK=src/ceph_erasure_code_benchmark PLUGIN_DIRECTORY=src/.libs qa/workunits/erasure-code/bench.sh seconds KB plugin k m work. iter. sizeeras. command. serie encode_vandermonde_isa *load dlopen(src/.libs/libec_isa.so): src/.libs/libec_isa.so: cannot open shared object file: No such file or directory* * * I have checked out master branch and compiled as ceph with following steps ./autogen.sh ; ./configure --with-debug --without-tcmalloc --without-fuse;make Am I missing something here? Thanks in advance Nitin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Cordialement, *David Casier Ligne directe: 06 65 19 66 84 Email: david.cas...@aevoo.fr * ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Very low 4k randread performance ~1000iops
On 06/30/2015 10:42 PM, Tuomas Juntunen wrote: Hi For seq reads here's the latencies: lat (usec) : 2=0.01%, 10=0.01%, 20=0.01%, 50=0.02%, 100=0.03% lat (usec) : 250=1.02%, 500=87.09%, 750=7.47%, 1000=1.50% lat (msec) : 2=0.76%, 4=1.72%, 10=0.19%, 20=0.19% Random reads: lat (usec) : 10=0.01% lat (msec) : 2=0.01%, 4=0.01%, 10=0.02%, 20=0.03%, 50=0.55% lat (msec) : 100=99.31%, 250=0.08% 100msecs seems a lot to me. It is, but what's more interesting imho is that it's so consistent. You don't have some ops completing fast and other ones completing slowly holding everything up. It's like the OSDs are simply overloaded with concurrent IOs and everything is waiting. Maybe I'm confused, are your OSDs on SSDs? Are there spinning disks involved? If so, what model(s)? You might want to use collectl -sD -oT on one of the OSD nodes during the test and see what the IO to the disk looks like during random reads and the especially with the svctime for the disks is like. Mark Br,T -Original Message- From: Mark Nelson [mailto:mnel...@redhat.com] Sent: 30. kesäkuuta 2015 22:01 To: Tuomas Juntunen; ceph-users@lists.ceph.com Subject: Re: [ceph-users] Very low 4k randread performance ~1000iops Seems reasonable. What's the latency distribution look like in your fio output file? Would be useful to know if it's universally slow or if some ops are taking much longer to complete than others. Mark On 06/30/2015 01:27 PM, Tuomas Juntunen wrote: I created a file which has the following parameters [random-read] rw=randread size=128m directory=/root/asd ioengine=libaio bs=4k #numjobs=8 iodepth=64 Br,T -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Mark Nelson Sent: 30. kesäkuuta 2015 20:55 To: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Very low 4k randread performance ~1000iops Hi Tuomos, Can you paste the command you ran to do the test? Thanks, Mark On 06/30/2015 12:18 PM, Tuomas Juntunen wrote: Hi It’s not probably hitting the disks, but that really doesn’t matter. The point is we have very responsive VM’s while writing and that is what the users will see. The iops we get with sequential read is good, but the random read is way too low. Is using SSD’s as OSD’s the only way to get it up? or is there some tunable which would enhance it? I would assume Linux caches reads in memory and serves them from there, but atleast now we don’t see it. Br, Tuomas *From:*Somnath Roy [mailto:somnath@sandisk.com] *Sent:* 30. kesäkuuta 2015 19:24 *To:* Tuomas Juntunen; 'ceph-users' *Subject:* RE: [ceph-users] Very low 4k randread performance ~1000iops Break it down, try fio-rbd to see what is the performance you getting.. But, I am really surprised you are getting 100k iops for write, did you check it is hitting the disks ? Thanks Regards Somnath *From:*ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf Of *Tuomas Juntunen *Sent:* Tuesday, June 30, 2015 8:33 AM *To:* 'ceph-users' *Subject:* [ceph-users] Very low 4k randread performance ~1000iops Hi I have been trying to figure out why our 4k random reads in VM’s are so bad. I am using fio to test this. Write : 170k iops Random write : 109k iops Read : 64k iops Random read : 1k iops Our setup is: 3 nodes with 36 OSDs, 18 SSD’s one SSD for two OSD’s, each node has 64gb mem 2x6core cpu’s 4 monitors running on other servers 40gbit infiniband with IPoIB Openstack : Qemu-kvm for virtuals Any help would be appreciated Thank you in advance. Br, Tuomas - - -- PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Rados gateway / RBD access restrictions
On Wed, Jul 1, 2015 at 3:10 PM, Jacek Jarosiewicz jjarosiew...@supermedia.pl wrote: ok, I think I found the answer to the second question: http://wiki.ceph.com/Planning/Blueprints/Giant/Add_QoS_capacity_to_librbd ..librbd doesn't support any QoS for now.. But libvirt/qemu can do QoS: see iotune in https://libvirt.org/formatdomain.html Cheers, Dan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] xattrs vs omap
Hello all, I've got a coworker who put filestore_xattr_use_omap = true in the ceph.conf when we first started building the cluster. Now he can't remember why. He thinks it may be a holdover from our first Ceph cluster (running dumpling on ext4, iirc). In the newly built cluster, we are using XFS with 2048 byte inodes, running Ceph 0.94.2. It currently has production data in it. From my reading of other threads, it looks like this is probably not something you want set to true (at least on XFS), due to performance implications. Is this something you can change on a running cluster? Is it worth the hassle? Thanks, Adam ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph erasure code benchmark failing
Hi, Like David said: the most probable cause is that there is no recent yasm installed. You can ./install-deps.sh to ensure the necessary dependencies are installed. Cheers On 01/07/2015 13:46, Nitin Saxena wrote: Hi, I am new to ceph project. I am trying to benchmark erasure code on Intel and I am getting following error. [root@nitin ceph]# CEPH_ERASURE_CODE_BENCHMARK=src/ceph_erasure_code_benchmark PLUGIN_DIRECTORY=src/.libs qa/workunits/erasure-code/bench.sh seconds KB plugin k m work. iter. sizeeras. command. serie encode_vandermonde_isa *load dlopen(src/.libs/libec_isa.so): src/.libs/libec_isa.so: cannot open shared object file: No such file or directory* * * I have checked out master branch and compiled as ceph with following steps ./autogen.sh ; ./configure --with-debug --without-tcmalloc --without-fuse;make Am I missing something here? Thanks in advance Nitin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Loïc Dachary, Artisan Logiciel Libre signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Node reboot -- OSDs not logging off from cluster
On Tue, Jun 30, 2015 at 10:36 AM, Daniel Schneller daniel.schnel...@centerdevice.com wrote: Hi! We are seeing a strange - and problematic - behavior in our 0.94.1 cluster on Ubuntu 14.04.1. We have 5 nodes, 4 OSDs each. When rebooting one of the nodes (e. g. for a kernel upgrade) the OSDs do not seem to shut down correctly. Clients hang and ceph osd tree show the OSDs of that node still up. Repeated runs of ceph osd tree show them going down after a while. For instance, here OSD.7 is still up, even though the machine is in the middle of the reboot cycle. [C|root@control01] ~ ➜ ceph osd tree # idweight type name up/down reweight -1 36.2root default -2 7.24host node01 0 1.81osd.0 up 1 5 1.81osd.5 up 1 10 1.81osd.10 up 1 15 1.81osd.15 up 1 -3 7.24host node02 1 1.81osd.1 up 1 6 1.81osd.6 up 1 11 1.81osd.11 up 1 16 1.81osd.16 up 1 -4 7.24host node03 2 1.81osd.2 down1 7 1.81osd.7 up 1 12 1.81osd.12 down1 17 1.81osd.17 down1 -5 7.24host node04 3 1.81osd.3 up 1 8 1.81osd.8 up 1 13 1.81osd.13 up 1 18 1.81osd.18 up 1 -6 7.24host node05 4 1.81osd.4 up 1 9 1.81osd.9 up 1 14 1.81osd.14 up 1 19 1.81osd.19 up 1 So it seems, the services are either not shut down correctly when the reboot begins, or they do not get enough time to actually let the cluster know they are going away. If I stop the OSDs on that node manually before the reboot, everything works as expected and clients don't notice any interruptions. [C|root@node03] ~ ➜ service ceph-osd stop id=2 ceph-osd stop/waiting [C|root@node03] ~ ➜ service ceph-osd stop id=7 ceph-osd stop/waiting [C|root@node03] ~ ➜ service ceph-osd stop id=12 ceph-osd stop/waiting [C|root@node03] ~ ➜ service ceph-osd stop id=17 ceph-osd stop/waiting [C|root@node03] ~ ➜ reboot The upstart file was not changed from the packaged version. Interestingly, the same Ceph version on a different cluster does _not_ show this behaviour. Any ideas as to what is causing this or how to diagnose this? I'm not sure why it would be happening, but: * The OSDs send out shutdown messages to the monitor indicating they're going away whenever they get shut down politely. There's a short timeout to make sure they don't hang on you. * The only way the OSD doesn't get marked down during reboot is if the monitor doesn't get this message. * If the monitor isn't getting the message, the OSD either isn't sending the message or it's getting blocked. My guess is that for some reason the OSDs are getting the shutdown signal after the networking goes away. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] xattrs vs omap
It doesn't matter, I think filestore_xattr_use_omap is a 'noop' and not used in the Hammer. Thanks Regards Somnath -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Adam Tygart Sent: Wednesday, July 01, 2015 8:20 AM To: Ceph Users Subject: [ceph-users] xattrs vs omap Hello all, I've got a coworker who put filestore_xattr_use_omap = true in the ceph.conf when we first started building the cluster. Now he can't remember why. He thinks it may be a holdover from our first Ceph cluster (running dumpling on ext4, iirc). In the newly built cluster, we are using XFS with 2048 byte inodes, running Ceph 0.94.2. It currently has production data in it. From my reading of other threads, it looks like this is probably not something you want set to true (at least on XFS), due to performance implications. Is this something you can change on a running cluster? Is it worth the hassle? Thanks, Adam ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Very low 4k randread performance ~1000iops
Hi I'll check the possibility on testing EnhanceIO. I'll report back on this. Thanks Br,T -Original Message- From: Mark Nelson [mailto:mnel...@redhat.com] Sent: 1. heinäkuuta 2015 21:51 To: Tuomas Juntunen; ceph-users@lists.ceph.com Subject: Re: [ceph-users] Very low 4k randread performance ~1000iops On 07/01/2015 01:39 PM, Tuomas Juntunen wrote: Thanks Mark Are there any plans for ZFS like L2ARC to CEPH or is the cache tiering what should work like this in the future? I have tested cache tier + EC pool, and that created too much load on our servers, so it was not viable to be used. We are doing a lot of work in this space right now. Hopefully we'll see improvements coming in the coming releases. I was also wondering if EnhanceIO would be a good solution for getting more random iops. I've read some Sébastien's writings. Possibly! Try it and let us know. ;) Br, Tuomas -Original Message- From: Mark Nelson [mailto:mnel...@redhat.com] Sent: 1. heinäkuuta 2015 20:29 To: Tuomas Juntunen; ceph-users@lists.ceph.com Subject: Re: [ceph-users] Very low 4k randread performance ~1000iops On 07/01/2015 12:13 PM, Tuomas Juntunen wrote: Hi Yes, the OSD's are on spinning disks and we have 18 SSD's for journal, one SSD for two OSD's The OSD's are: Model Family: Seagate Barracuda 7200.14 (AF) Device Model: ST2000DM001-1CH164 What I've understood the journals are not used as read cache at all, just for writing. Would SSD based cache pool be viable solution here? Ok, so that makes more sense. The performance is still lower than expected but maybe 3-4x rather than several orders of magnitude. My guess is that cache tiering in it's current form probably won't help you much unless you have a workload that fits mostly into the cache. The promotion penalty is really high though so we likely will have to promote much more slowly than we currently do. Mark Br, T -Original Message- From: Mark Nelson [mailto:mnel...@redhat.com] Sent: 1. heinäkuuta 2015 13:58 To: Tuomas Juntunen; ceph-users@lists.ceph.com Subject: Re: [ceph-users] Very low 4k randread performance ~1000iops On 06/30/2015 10:42 PM, Tuomas Juntunen wrote: Hi For seq reads here's the latencies: lat (usec) : 2=0.01%, 10=0.01%, 20=0.01%, 50=0.02%, 100=0.03% lat (usec) : 250=1.02%, 500=87.09%, 750=7.47%, 1000=1.50% lat (msec) : 2=0.76%, 4=1.72%, 10=0.19%, 20=0.19% Random reads: lat (usec) : 10=0.01% lat (msec) : 2=0.01%, 4=0.01%, 10=0.02%, 20=0.03%, 50=0.55% lat (msec) : 100=99.31%, 250=0.08% 100msecs seems a lot to me. It is, but what's more interesting imho is that it's so consistent. You don't have some ops completing fast and other ones completing slowly holding everything up. It's like the OSDs are simply overloaded with concurrent IOs and everything is waiting. Maybe I'm confused, are your OSDs on SSDs? Are there spinning disks involved? If so, what model(s)? You might want to use collectl -sD -oT on one of the OSD nodes during the test and see what the IO to the disk looks like during random reads and the especially with the svctime for the disks is like. Mark Br,T -Original Message- From: Mark Nelson [mailto:mnel...@redhat.com] Sent: 30. kesäkuuta 2015 22:01 To: Tuomas Juntunen; ceph-users@lists.ceph.com Subject: Re: [ceph-users] Very low 4k randread performance ~1000iops Seems reasonable. What's the latency distribution look like in your fio output file? Would be useful to know if it's universally slow or if some ops are taking much longer to complete than others. Mark On 06/30/2015 01:27 PM, Tuomas Juntunen wrote: I created a file which has the following parameters [random-read] rw=randread size=128m directory=/root/asd ioengine=libaio bs=4k #numjobs=8 iodepth=64 Br,T -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Mark Nelson Sent: 30. kesäkuuta 2015 20:55 To: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Very low 4k randread performance ~1000iops Hi Tuomos, Can you paste the command you ran to do the test? Thanks, Mark On 06/30/2015 12:18 PM, Tuomas Juntunen wrote: Hi Its not probably hitting the disks, but that really doesnt matter. The point is we have very responsive VMs while writing and that is what the users will see. The iops we get with sequential read is good, but the random read is way too low. Is using SSDs as OSDs the only way to get it up? or is there some tunable which would enhance it? I would assume Linux caches reads in memory and serves them from there, but atleast now we dont see it. Br, Tuomas *From:*Somnath Roy [mailto:somnath@sandisk.com] *Sent:* 30. kesäkuuta 2015 19:24 *To:* Tuomas Juntunen; 'ceph-users' *Subject:* RE: [ceph-users] Very low 4k randread performance ~1000iops Break it down,
[ceph-users] any recommendation of using EnhanceIO?
Hi cephers, Is anyone out there that implement enhanceIO in a production environment? any recommendation? any perf output to share with the diff between using it and not? Thanks in advance, *German* ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Redhat Storage Ceph Storage 1.3 released
On 07/01/2015 03:02 PM, Vickey Singh wrote: - What's the exact version number of OpenSource Ceph is provided with this Product It is Hammer, specifically 0.94.1 with several critical bugfixes on top as the product went through QE. All of the bugfixes have been proposed or merged to Hammer upstream, IIRC, so the product has many of the serious bug fixes that were in 0.94.2 or the upcoming 0.94.3. - RHCS 1.3 Features that are mentioned in the blog , will all of them present in open source Ceph. Yep! That blog post describes many of the changes from Firefly - Hammer. - Ken ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] One of our nodes has logs saying: wrongly marked me down
This can happen if your OSDs are flapping.. Hope your network is stable. Thanks Regards Somnath From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Tuomas Juntunen Sent: Wednesday, July 01, 2015 2:24 PM To: 'ceph-users' Subject: [ceph-users] One of our nodes has logs saying: wrongly marked me down Hi One our nodes has OSD logs that say wrongly marked me down for every OSD at some point. What could be the reason for this. Anyone have any similar experiences? Other nodes work totally fine and they are all identical. Br,T PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] any recommendation of using EnhanceIO?
Hi, I’ve asked same question last weeks or so (just search the mailing list archives for EnhanceIO :) and got some interesting answers. Looks like the project is pretty much dead since it was bought out by HGST. Even their website has some broken links in regards to EnhanceIO I’m keen to try flashcache or bcache (its been in the mainline kernel for some time) Dominik On Wed, Jul 1, 2015 at 9:13 PM, German Anders gand...@despegar.com wrote: Hi cephers, Is anyone out there that implement enhanceIO in a production environment? any recommendation? any perf output to share with the diff between using it and not? Thanks in advance, *German* ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Node reboot -- OSDs not logging off from cluster
-Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Gregory Farnum Sent: 01 July 2015 16:56 To: Daniel Schneller Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Node reboot -- OSDs not logging off from cluster On Tue, Jun 30, 2015 at 10:36 AM, Daniel Schneller daniel.schnel...@centerdevice.com wrote: Hi! We are seeing a strange - and problematic - behavior in our 0.94.1 cluster on Ubuntu 14.04.1. We have 5 nodes, 4 OSDs each. When rebooting one of the nodes (e. g. for a kernel upgrade) the OSDs do not seem to shut down correctly. Clients hang and ceph osd tree show the OSDs of that node still up. Repeated runs of ceph osd tree show them going down after a while. For instance, here OSD.7 is still up, even though the machine is in the middle of the reboot cycle. [C|root@control01] ~ ➜ ceph osd tree # idweight type name up/down reweight -1 36.2root default -2 7.24host node01 0 1.81osd.0 up 1 5 1.81osd.5 up 1 10 1.81osd.10 up 1 15 1.81osd.15 up 1 -3 7.24host node02 1 1.81osd.1 up 1 6 1.81osd.6 up 1 11 1.81osd.11 up 1 16 1.81osd.16 up 1 -4 7.24host node03 2 1.81osd.2 down1 7 1.81osd.7 up 1 12 1.81osd.12 down1 17 1.81osd.17 down1 -5 7.24host node04 3 1.81osd.3 up 1 8 1.81osd.8 up 1 13 1.81osd.13 up 1 18 1.81osd.18 up 1 -6 7.24host node05 4 1.81osd.4 up 1 9 1.81osd.9 up 1 14 1.81osd.14 up 1 19 1.81osd.19 up 1 So it seems, the services are either not shut down correctly when the reboot begins, or they do not get enough time to actually let the cluster know they are going away. If I stop the OSDs on that node manually before the reboot, everything works as expected and clients don't notice any interruptions. [C|root@node03] ~ ➜ service ceph-osd stop id=2 ceph-osd stop/waiting [C|root@node03] ~ ➜ service ceph-osd stop id=7 ceph-osd stop/waiting [C|root@node03] ~ ➜ service ceph-osd stop id=12 ceph-osd stop/waiting [C|root@node03] ~ ➜ service ceph-osd stop id=17 ceph-osd stop/waiting [C|root@node03] ~ ➜ reboot The upstart file was not changed from the packaged version. Interestingly, the same Ceph version on a different cluster does _not_ show this behaviour. Any ideas as to what is causing this or how to diagnose this? Do you have the OSD's running on the same boxes as the monitors? I'm not sure why it would be happening, but: * The OSDs send out shutdown messages to the monitor indicating they're going away whenever they get shut down politely. There's a short timeout to make sure they don't hang on you. * The only way the OSD doesn't get marked down during reboot is if the monitor doesn't get this message. * If the monitor isn't getting the message, the OSD either isn't sending the message or it's getting blocked. My guess is that for some reason the OSDs are getting the shutdown signal after the networking goes away. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Redhat Storage Ceph Storage 1.3 released
Hello Ceph lovers You would have noticed that recently RedHat has released RedHat Ceph Storage 1.3 http://redhatstorage.redhat.com/2015/06/25/announcing-red-hat-ceph-storage-1-3/ My question is - What's the exact version number of OpenSource Ceph is provided with this Product - RHCS 1.3 Features that are mentioned in the blog , will all of them present in open source Ceph. Regards Vickey ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] One of our nodes has logs saying: wrongly marked me down
Hi One our nodes has OSD logs that say wrongly marked me down for every OSD at some point. What could be the reason for this. Anyone have any similar experiences? Other nodes work totally fine and they are all identical. Br,T ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Redhat Storage Ceph Storage 1.3 released
Hi, The details of the differences between the Hammer point releases and the RedHat Ceph Storage 1.3 can be listed as described at http://www.spinics.net/lists/ceph-devel/msg24489.html reconciliation between hammer and v0.94.1.2 The same analysis should be done for https://github.com/ceph/ceph/releases/tag/v0.94.1.3 which presumably matches RedHat Ceph Storage 1.3. Cheers On 01/07/2015 23:02, Vickey Singh wrote: Hello Ceph lovers You would have noticed that recently RedHat has released RedHat Ceph Storage 1.3 http://redhatstorage.redhat.com/2015/06/25/announcing-red-hat-ceph-storage-1-3/ My question is - What's the exact version number of OpenSource Ceph is provided with this Product - RHCS 1.3 Features that are mentioned in the blog , will all of them present in open source Ceph. Regards Vickey ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Loïc Dachary, Artisan Logiciel Libre signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Ceph Journal Disk Size
I would like to get some clarification on the size of the journal disks that I should get for my new Ceph cluster I am planning. I read about the journal settings on http://ceph.com/docs/master/rados/configuration/osd-config-ref/#journal-settings but that didn't really clarify it for me that or I just didn't get it. I found in the Learning Ceph Packt book it states that you should have one disk for journalling for every 4 OSDs. Using that as a reference I was planning on getting multiple systems with 8 x 6TB inline SAS drives for OSDs with two SSDs for journalling per host as well as 2 hot spares for the 6TB drives and 2 drives for the OS. I was thinking of 400GB SSD drives but am wondering if that is too much. Any informed opinions would be appreciated. Thanks, *Nate Curry* ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Mon performance impact on OSDs?
I've been wrestling with IO performance in my cluster and one area I have not yet explored thoroughly is whether or not performance constraints on mon hosts would be likely to have any impact on OSDs. My mons are quite small, and one in particular has rather high IO waits (frequently 30% or more) due to the other work it performs, notably hosting postgres for Openstack which is quite chatty for some reason. Is this likely to trickle-down into the OSDs performance? Everything I've seen online indicates the performance between MONs and OSDs should be decoupled, but I'd like to hear some real world experiences. Thanks! QH ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Very low 4k randread performance ~1000iops
On 07/01/2015 12:13 PM, Tuomas Juntunen wrote: Hi Yes, the OSD's are on spinning disks and we have 18 SSD's for journal, one SSD for two OSD's The OSD's are: Model Family: Seagate Barracuda 7200.14 (AF) Device Model: ST2000DM001-1CH164 What I've understood the journals are not used as read cache at all, just for writing. Would SSD based cache pool be viable solution here? Ok, so that makes more sense. The performance is still lower than expected but maybe 3-4x rather than several orders of magnitude. My guess is that cache tiering in it's current form probably won't help you much unless you have a workload that fits mostly into the cache. The promotion penalty is really high though so we likely will have to promote much more slowly than we currently do. Mark Br, T -Original Message- From: Mark Nelson [mailto:mnel...@redhat.com] Sent: 1. heinäkuuta 2015 13:58 To: Tuomas Juntunen; ceph-users@lists.ceph.com Subject: Re: [ceph-users] Very low 4k randread performance ~1000iops On 06/30/2015 10:42 PM, Tuomas Juntunen wrote: Hi For seq reads here's the latencies: lat (usec) : 2=0.01%, 10=0.01%, 20=0.01%, 50=0.02%, 100=0.03% lat (usec) : 250=1.02%, 500=87.09%, 750=7.47%, 1000=1.50% lat (msec) : 2=0.76%, 4=1.72%, 10=0.19%, 20=0.19% Random reads: lat (usec) : 10=0.01% lat (msec) : 2=0.01%, 4=0.01%, 10=0.02%, 20=0.03%, 50=0.55% lat (msec) : 100=99.31%, 250=0.08% 100msecs seems a lot to me. It is, but what's more interesting imho is that it's so consistent. You don't have some ops completing fast and other ones completing slowly holding everything up. It's like the OSDs are simply overloaded with concurrent IOs and everything is waiting. Maybe I'm confused, are your OSDs on SSDs? Are there spinning disks involved? If so, what model(s)? You might want to use collectl -sD -oT on one of the OSD nodes during the test and see what the IO to the disk looks like during random reads and the especially with the svctime for the disks is like. Mark Br,T -Original Message- From: Mark Nelson [mailto:mnel...@redhat.com] Sent: 30. kesäkuuta 2015 22:01 To: Tuomas Juntunen; ceph-users@lists.ceph.com Subject: Re: [ceph-users] Very low 4k randread performance ~1000iops Seems reasonable. What's the latency distribution look like in your fio output file? Would be useful to know if it's universally slow or if some ops are taking much longer to complete than others. Mark On 06/30/2015 01:27 PM, Tuomas Juntunen wrote: I created a file which has the following parameters [random-read] rw=randread size=128m directory=/root/asd ioengine=libaio bs=4k #numjobs=8 iodepth=64 Br,T -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Mark Nelson Sent: 30. kesäkuuta 2015 20:55 To: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Very low 4k randread performance ~1000iops Hi Tuomos, Can you paste the command you ran to do the test? Thanks, Mark On 06/30/2015 12:18 PM, Tuomas Juntunen wrote: Hi It’s not probably hitting the disks, but that really doesn’t matter. The point is we have very responsive VM’s while writing and that is what the users will see. The iops we get with sequential read is good, but the random read is way too low. Is using SSD’s as OSD’s the only way to get it up? or is there some tunable which would enhance it? I would assume Linux caches reads in memory and serves them from there, but atleast now we don’t see it. Br, Tuomas *From:*Somnath Roy [mailto:somnath@sandisk.com] *Sent:* 30. kesäkuuta 2015 19:24 *To:* Tuomas Juntunen; 'ceph-users' *Subject:* RE: [ceph-users] Very low 4k randread performance ~1000iops Break it down, try fio-rbd to see what is the performance you getting.. But, I am really surprised you are getting 100k iops for write, did you check it is hitting the disks ? Thanks Regards Somnath *From:*ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf Of *Tuomas Juntunen *Sent:* Tuesday, June 30, 2015 8:33 AM *To:* 'ceph-users' *Subject:* [ceph-users] Very low 4k randread performance ~1000iops Hi I have been trying to figure out why our 4k random reads in VM’s are so bad. I am using fio to test this. Write : 170k iops Random write : 109k iops Read : 64k iops Random read : 1k iops Our setup is: 3 nodes with 36 OSDs, 18 SSD’s one SSD for two OSD’s, each node has 64gb mem 2x6core cpu’s 4 monitors running on other servers 40gbit infiniband with IPoIB Openstack : Qemu-kvm for virtuals Any help would be appreciated Thank you in advance. Br, Tuomas - - -- PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review,
Re: [ceph-users] Very low 4k randread performance ~1000iops
Hi Yes, the OSD's are on spinning disks and we have 18 SSD's for journal, one SSD for two OSD's The OSD's are: Model Family: Seagate Barracuda 7200.14 (AF) Device Model: ST2000DM001-1CH164 What I've understood the journals are not used as read cache at all, just for writing. Would SSD based cache pool be viable solution here? Br, T -Original Message- From: Mark Nelson [mailto:mnel...@redhat.com] Sent: 1. heinäkuuta 2015 13:58 To: Tuomas Juntunen; ceph-users@lists.ceph.com Subject: Re: [ceph-users] Very low 4k randread performance ~1000iops On 06/30/2015 10:42 PM, Tuomas Juntunen wrote: Hi For seq reads here's the latencies: lat (usec) : 2=0.01%, 10=0.01%, 20=0.01%, 50=0.02%, 100=0.03% lat (usec) : 250=1.02%, 500=87.09%, 750=7.47%, 1000=1.50% lat (msec) : 2=0.76%, 4=1.72%, 10=0.19%, 20=0.19% Random reads: lat (usec) : 10=0.01% lat (msec) : 2=0.01%, 4=0.01%, 10=0.02%, 20=0.03%, 50=0.55% lat (msec) : 100=99.31%, 250=0.08% 100msecs seems a lot to me. It is, but what's more interesting imho is that it's so consistent. You don't have some ops completing fast and other ones completing slowly holding everything up. It's like the OSDs are simply overloaded with concurrent IOs and everything is waiting. Maybe I'm confused, are your OSDs on SSDs? Are there spinning disks involved? If so, what model(s)? You might want to use collectl -sD -oT on one of the OSD nodes during the test and see what the IO to the disk looks like during random reads and the especially with the svctime for the disks is like. Mark Br,T -Original Message- From: Mark Nelson [mailto:mnel...@redhat.com] Sent: 30. kesäkuuta 2015 22:01 To: Tuomas Juntunen; ceph-users@lists.ceph.com Subject: Re: [ceph-users] Very low 4k randread performance ~1000iops Seems reasonable. What's the latency distribution look like in your fio output file? Would be useful to know if it's universally slow or if some ops are taking much longer to complete than others. Mark On 06/30/2015 01:27 PM, Tuomas Juntunen wrote: I created a file which has the following parameters [random-read] rw=randread size=128m directory=/root/asd ioengine=libaio bs=4k #numjobs=8 iodepth=64 Br,T -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Mark Nelson Sent: 30. kesäkuuta 2015 20:55 To: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Very low 4k randread performance ~1000iops Hi Tuomos, Can you paste the command you ran to do the test? Thanks, Mark On 06/30/2015 12:18 PM, Tuomas Juntunen wrote: Hi Its not probably hitting the disks, but that really doesnt matter. The point is we have very responsive VMs while writing and that is what the users will see. The iops we get with sequential read is good, but the random read is way too low. Is using SSDs as OSDs the only way to get it up? or is there some tunable which would enhance it? I would assume Linux caches reads in memory and serves them from there, but atleast now we dont see it. Br, Tuomas *From:*Somnath Roy [mailto:somnath@sandisk.com] *Sent:* 30. kesäkuuta 2015 19:24 *To:* Tuomas Juntunen; 'ceph-users' *Subject:* RE: [ceph-users] Very low 4k randread performance ~1000iops Break it down, try fio-rbd to see what is the performance you getting.. But, I am really surprised you are getting 100k iops for write, did you check it is hitting the disks ? Thanks Regards Somnath *From:*ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf Of *Tuomas Juntunen *Sent:* Tuesday, June 30, 2015 8:33 AM *To:* 'ceph-users' *Subject:* [ceph-users] Very low 4k randread performance ~1000iops Hi I have been trying to figure out why our 4k random reads in VMs are so bad. I am using fio to test this. Write : 170k iops Random write : 109k iops Read : 64k iops Random read : 1k iops Our setup is: 3 nodes with 36 OSDs, 18 SSDs one SSD for two OSDs, each node has 64gb mem 2x6core cpus 4 monitors running on other servers 40gbit infiniband with IPoIB Openstack : Qemu-kvm for virtuals Any help would be appreciated Thank you in advance. Br, Tuomas - - -- PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
[ceph-users] file/directory invisible through ceph-fuse
Hi list, I meet a strange problem: sometimes I cannot see the file/directory created by another ceph-fuse client. It comes into visible after I touch/mkdir the same name. Any thoughts? Thanks! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Round-trip time for monitors
Hi everybody, We have 3 monitors in our ceph cluster: 2 in one local site (2 data centers a few km away from each other), and the 3rd one on a remote site, with a maximum round-trip time (RTT) of 30ms between the local site and the remote site. All OSDs run on the local site. The reason for the remote monitor is to keep the cluster running if any DC fails. Is that a valid configuration ? What is the maximum RTT valid in such a Ceph cluster ? Here are some details about our running cluster: Current monmap: --- epoch 4 fsid ... last_changed 2015-05-12 08:39:35.600843 created 0.00 0: IP addr local0:6789/0 mon.local0 1: IP addr local1:6789/0 mon.local1 2: IP addr remote:6789/0 mon.remote --- In our running cluster, the mon logs show that the leader monitor is on the local site, while the other two are peons. Being curious, I increased runtime log-level debug settings for a few subsystems (ms, mon, paxos...) to see if there was some kind of heartbeat between the monitors. I noticed messages such as these ones... -- 2015-07-01 07:01:05.840845 7fd569bbe700 1 -- IP local1:6789/0 -- mon.0 IP local0:6789/0 -- mon_health( service 1 op tell e 0 r 0 ) v1 -- ?+0 0x3b9b200 2015-07-01 07:01:05.840871 7fd569bbe700 20 -- IP local1:6789/0 submit_message mon_health( service 1 op tell e 0 r 0 ) v1 remote, IP local0:6789/0, have pipe. 2015-07-01 07:01:05.840885 7fd569bbe700 1 -- IP local1:6789/0 -- mon.2 IP remote:6789/0 -- mon_health( service 1 op tell e 0 r 0 ) v1 -- ?+0 0x3b98a00 2015-07-01 07:01:05.840894 7fd569bbe700 20 -- IP local1:6789/0 submit_message mon_health( service 1 op tell e 0 r 0 ) v1 remote, IP remote:6789/0, have pipe. -- ... but none which tells me what I want: the idea was to see if anybody could complain about a high RTT, and to monitor that value. Any idea on how to do it ? Thank you. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Round-trip time for monitors
On 07/01/2015 09:38 AM, - - wrote: Hi everybody, We have 3 monitors in our ceph cluster: 2 in one local site (2 data centers a few km away from each other), and the 3rd one on a remote site, with a maximum round-trip time (RTT) of 30ms between the local site and the remote site. All OSDs run on the local site. The reason for the remote monitor is to keep the cluster running if any DC fails. Is that a valid configuration ? What is the maximum RTT valid in such a Ceph cluster ? Well, I think that 30ms is a bit high. You'll get into a clock-drift situation a lot earlier. The leading monitor uses it's local time, sends out the packet, which then arrives 15ms later at the other mon. For the monitors that is a clock drift of 15ms at least. Also, it could be that the monitor on the remote site is elected as leader and that will cause all your OSD traffic to go via that Monitor. In big changes in the cluster it will add at least 30ms of latency to certain requests which slows down the cluster. Here are some details about our running cluster: Current monmap: --- epoch 4 fsid ... last_changed 2015-05-12 08:39:35.600843 created 0.00 0: IP addr local0:6789/0 mon.local0 1: IP addr local1:6789/0 mon.local1 2: IP addr remote:6789/0 mon.remote --- In our running cluster, the mon logs show that the leader monitor is on the local site, while the other two are peons. Being curious, I increased runtime log-level debug settings for a few subsystems (ms, mon, paxos...) to see if there was some kind of heartbeat between the monitors. I noticed messages such as these ones... -- 2015-07-01 07:01:05.840845 7fd569bbe700 1 -- IP local1:6789/0 -- mon.0 IP local0:6789/0 -- mon_health( service 1 op tell e 0 r 0 ) v1 -- ?+0 0x3b9b200 2015-07-01 07:01:05.840871 7fd569bbe700 20 -- IP local1:6789/0 submit_message mon_health( service 1 op tell e 0 r 0 ) v1 remote, IP local0:6789/0, have pipe. 2015-07-01 07:01:05.840885 7fd569bbe700 1 -- IP local1:6789/0 -- mon.2 IP remote:6789/0 -- mon_health( service 1 op tell e 0 r 0 ) v1 -- ?+0 0x3b98a00 2015-07-01 07:01:05.840894 7fd569bbe700 20 -- IP local1:6789/0 submit_message mon_health( service 1 op tell e 0 r 0 ) v1 remote, IP remote:6789/0, have pipe. -- ... but none which tells me what I want: the idea was to see if anybody could complain about a high RTT, and to monitor that value. Any idea on how to do it ? Thank you. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Wido den Hollander 42on B.V. Ceph trainer and consultant Phone: +31 (0)20 700 9902 Skype: contact42on ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CDS Jewel Wed/Thurs
Hey Patrick, Looks like the GMT+8 time for the 1st day is wrong, should be 10:00 pm - 7:30 am? -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Patrick McGarry Sent: Tuesday, June 30, 2015 11:28 PM To: Ceph Devel; Ceph-User Subject: CDS Jewel Wed/Thurs Hey cephers, Just a friendly reminder that our Ceph Developer Summit for Jewel planning is set to run tomorrow and Thursday. The schedule and dial in information is available on the new wiki: http://tracker.ceph.com/projects/ceph/wiki/CDS_Jewel Please let me know if you have any questions. Thanks! -- Best Regards, Patrick McGarry Director Ceph Community || Red Hat http://ceph.com || http://community.redhat.com @scuttlemonkey || @ceph -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Round-trip time for monitors
On Wed, Jul 1, 2015 at 8:38 AM, - - francois.pe...@san-services.com wrote: Hi everybody, We have 3 monitors in our ceph cluster: 2 in one local site (2 data centers a few km away from each other), and the 3rd one on a remote site, with a maximum round-trip time (RTT) of 30ms between the local site and the remote site. All OSDs run on the local site. The reason for the remote monitor is to keep the cluster running if any DC fails. Is that a valid configuration ? What is the maximum RTT valid in such a Ceph cluster ? Here are some details about our running cluster: Current monmap: --- epoch 4 fsid ... last_changed 2015-05-12 08:39:35.600843 created 0.00 0: IP addr local0:6789/0 mon.local0 1: IP addr local1:6789/0 mon.local1 2: IP addr remote:6789/0 mon.remote --- In our running cluster, the mon logs show that the leader monitor is on the local site, while the other two are peons. Being curious, I increased runtime log-level debug settings for a few subsystems (ms, mon, paxos...) to see if there was some kind of heartbeat between the monitors. I noticed messages such as these ones... -- 2015-07-01 07:01:05.840845 7fd569bbe700 1 -- IP local1:6789/0 -- mon.0 IP local0:6789/0 -- mon_health( service 1 op tell e 0 r 0 ) v1 -- ?+0 0x3b9b200 2015-07-01 07:01:05.840871 7fd569bbe700 20 -- IP local1:6789/0 submit_message mon_health( service 1 op tell e 0 r 0 ) v1 remote, IP local0:6789/0, have pipe. 2015-07-01 07:01:05.840885 7fd569bbe700 1 -- IP local1:6789/0 -- mon.2 IP remote:6789/0 -- mon_health( service 1 op tell e 0 r 0 ) v1 -- ?+0 0x3b98a00 2015-07-01 07:01:05.840894 7fd569bbe700 20 -- IP local1:6789/0 submit_message mon_health( service 1 op tell e 0 r 0 ) v1 remote, IP remote:6789/0, have pipe. -- ... but none which tells me what I want: the idea was to see if anybody could complain about a high RTT, and to monitor that value. Any idea on how to do it ? I don't think there's anything that monitors RTT directly, but 30 ms shouldn't be a problem; that's an order of magnitude or more below all the various timeout thresholds. The clock skew detection might need to be loosened up but I'm not very familiar with how that bit works, and it's not quite as crucial anyway. :) -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] file/directory invisible through ceph-fuse
On Wed, Jul 1, 2015 at 9:02 AM, flisky yinjif...@lianjia.com wrote: Hi list, I meet a strange problem: sometimes I cannot see the file/directory created by another ceph-fuse client. It comes into visible after I touch/mkdir the same name. Any thoughts? What version are you running? We've seen a few things like this with older releases, although usually it's in the kernel... -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] One of our nodes has logs saying: wrongly marked me down
Ive checked the network, we use IPoIB and all nodes are connected to the same switch, there are no breaks in connectivity while this happens. My constant ping says 0.03 0.1ms. I would say this is ok. This happens almost every time when deep scrubbing is running. Our loads on this particular server goes to 300+ and osds are marked down. Any suggestions on settings? I now have the following settings that might affect this [global] osd_op_threads = 6 osd_op_num_threads_per_shard = 1 osd_op_num_shards = 25 #osd_op_num_sharded_pool_threads = 25 filestore_op_threads = 6 ms_nocrc = true filestore_fd_cache_size = 64 filestore_fd_cache_shards = 32 ms_dispatch_throttle_bytes = 0 throttler_perf_counter = false [osd] osd scrub load threshold = 0.1 osd max backfills = 1 osd recovery max active = 1 osd scrub sleep = .1 osd disk thread ioprio class = idle osd disk thread ioprio priority = 7 osd scrub chunk max = 5 osd deep scrub stride = 1048576 filestore queue max ops = 1 filestore max sync interval = 30 filestore min sync interval = 29 osd_client_message_size_cap = 0 osd_client_message_cap = 0 osd_enable_op_tracker = false Br, T From: Somnath Roy [mailto:somnath@sandisk.com] Sent: 2. heinäkuuta 2015 0:30 To: Tuomas Juntunen; 'ceph-users' Subject: RE: [ceph-users] One of our nodes has logs saying: wrongly marked me down This can happen if your OSDs are flapping.. Hope your network is stable. Thanks Regards Somnath From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Tuomas Juntunen Sent: Wednesday, July 01, 2015 2:24 PM To: 'ceph-users' Subject: [ceph-users] One of our nodes has logs saying: wrongly marked me down Hi One our nodes has OSD logs that say wrongly marked me down for every OSD at some point. What could be the reason for this. Anyone have any similar experiences? Other nodes work totally fine and they are all identical. Br,T _ PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] One of our nodes has logs saying: wrongly marked me down
Yeah, this can happen during deep_scrub and also during rebalancing..I forgot to mention that.. Generally, it is a good idea to throttle those..For deep scrub, you can try using (got it from old post, I never used it) osd_scrub_chunk_min = 1 osd_scrub_chunk_max = 1 osd_scrub_sleep = 0.1 For rebalancing I think you are already using proper value.. But, I don't think this will eliminate the scenario all together but should alleviate it a bit. Also, why you are using so many shards ? How many OSDs you are running in a box ? shard 25 should be good if you are running with single OSD, IF you have lot of OSDs in a box, try to reduce it ~5 or so. Thanks Regards Somnath From: Tuomas Juntunen [mailto:tuomas.juntu...@databasement.fi] Sent: Wednesday, July 01, 2015 8:18 PM To: Somnath Roy; 'ceph-users' Subject: RE: [ceph-users] One of our nodes has logs saying: wrongly marked me down I've checked the network, we use IPoIB and all nodes are connected to the same switch, there are no breaks in connectivity while this happens. My constant ping says 0.03 - 0.1ms. I would say this is ok. This happens almost every time when deep scrubbing is running. Our loads on this particular server goes to 300+ and osd's are marked down. Any suggestions on settings? I now have the following settings that might affect this [global] osd_op_threads = 6 osd_op_num_threads_per_shard = 1 osd_op_num_shards = 25 #osd_op_num_sharded_pool_threads = 25 filestore_op_threads = 6 ms_nocrc = true filestore_fd_cache_size = 64 filestore_fd_cache_shards = 32 ms_dispatch_throttle_bytes = 0 throttler_perf_counter = false [osd] osd scrub load threshold = 0.1 osd max backfills = 1 osd recovery max active = 1 osd scrub sleep = .1 osd disk thread ioprio class = idle osd disk thread ioprio priority = 7 osd scrub chunk max = 5 osd deep scrub stride = 1048576 filestore queue max ops = 1 filestore max sync interval = 30 filestore min sync interval = 29 osd_client_message_size_cap = 0 osd_client_message_cap = 0 osd_enable_op_tracker = false Br, T From: Somnath Roy [mailto:somnath@sandisk.com] Sent: 2. heinäkuuta 2015 0:30 To: Tuomas Juntunen; 'ceph-users' Subject: RE: [ceph-users] One of our nodes has logs saying: wrongly marked me down This can happen if your OSDs are flapping.. Hope your network is stable. Thanks Regards Somnath From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Tuomas Juntunen Sent: Wednesday, July 01, 2015 2:24 PM To: 'ceph-users' Subject: [ceph-users] One of our nodes has logs saying: wrongly marked me down Hi One our nodes has OSD logs that say wrongly marked me down for every OSD at some point. What could be the reason for this. Anyone have any similar experiences? Other nodes work totally fine and they are all identical. Br,T PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph Journal Disk Size
I would probably go with less size osd disks, 4TB is to much to loss in case of a broken disk, so maybe more osd daemons with less size, maybe 1TB or 2TB size. 4:1 relationship is good enough, also i think that 200G disk for the journals would be ok, so you can save some money there, the osd's of course configured them as a JBOD, don't use any RAID under it, and use two different networks for public and cluster net. *German* 2015-07-01 18:49 GMT-03:00 Nate Curry cu...@mosaicatm.com: I would like to get some clarification on the size of the journal disks that I should get for my new Ceph cluster I am planning. I read about the journal settings on http://ceph.com/docs/master/rados/configuration/osd-config-ref/#journal-settings but that didn't really clarify it for me that or I just didn't get it. I found in the Learning Ceph Packt book it states that you should have one disk for journalling for every 4 OSDs. Using that as a reference I was planning on getting multiple systems with 8 x 6TB inline SAS drives for OSDs with two SSDs for journalling per host as well as 2 hot spares for the 6TB drives and 2 drives for the OS. I was thinking of 400GB SSD drives but am wondering if that is too much. Any informed opinions would be appreciated. Thanks, *Nate Curry* ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to use cgroup to bind ceph-osd to a specific cpu core?
Re: your previous question I will not elaborate on this much more, I hope some of you will try it if you have NUMA systems and see for yourself. But I can recommend some docs: http://globalsp.ts.fujitsu.com/dmsp/Publications/public/wp-ivy-bridge-ep-memory-performance-ww-en.pdf http://globalsp.ts.fujitsu.com/dmsp/Publications/public/wp-ivy-bridge-ep-memory-performance-ww-en.pdf http://events.linuxfoundation.org/sites/events/files/eeus13_shelton.pdf http://events.linuxfoundation.org/sites/events/files/eeus13_shelton.pdf RHEL also has some nice documentation on the issue. If you don’t use ancient (like RHEL6) systems then your OS+kernel should do the “right thing” by default and take NUMA locality into account when scheduling and migrating. Jan On 01 Jul 2015, at 03:02, Ray Sun xiaoq...@gmail.com wrote: Jan, Thanks a lot. I can do my contribution to this project if I can. Best Regards -- Ray On Tue, Jun 30, 2015 at 11:50 PM, Jan Schermer j...@schermer.cz mailto:j...@schermer.cz wrote: Hi all, our script is available on GitHub https://github.com/prozeta/pincpus https://github.com/prozeta/pincpus I haven’t had much time to do a proper README, but I hope the configuration is self explanatory enough for now. What it does is pin each OSD into the most “empty” cgroup assigned to a NUMA node. Let me know how it works for you! Jan On 30 Jun 2015, at 10:50, Huang Zhiteng winsto...@gmail.com mailto:winsto...@gmail.com wrote: On Tue, Jun 30, 2015 at 4:25 PM, Jan Schermer j...@schermer.cz mailto:j...@schermer.cz wrote: Not having OSDs and KVMs compete against each other is one thing. But there are more reasons to do this 1) not moving the processes and threads between cores that much (better cache utilization) 2) aligning the processes with memory on NUMA systems (that means all modern dual socket systems) - you don’t want your OSD running on CPU1 with memory allocated to CPU2 3) the same goes for other resources like NICs or storage controllers - but that’s less important and not always practical to do 4) you can limit the scheduling domain on linux if you limit the cpuset for your OSDs (I’m not sure how important this is, just best practice) 5) you can easily limit memory or CPU usage, set priority, with much greater granularity than without cgroups 6) if you have HyperThreading enabled you get the most gain when the workloads on the threads are dissimiliar - so to have the higher throughput you have to pin OSD to thread1 and KVM to thread2 on the same core. We’re not doing that because latency and performance of the core can vary depending on what the other thread is doing. But it might be useful to someone. Some workloads exhibit 100% performance gain when everything aligns in a NUMA system, compared to a SMP mode on the same hardware. You likely won’t notice it on light workloads, as the interconnects (QPI) are very fast and there’s a lot of bandwidth, but for stuff like big OLAP databases or other data-manipulation workloads there’s a huge difference. And with CEPH being CPU hungy and memory intensive, we’re seeing some big gains here just by co-locating the memory with the processes…. Could you elaborate a it on this? I'm interested to learn in what situation memory locality helps Ceph to what extend. Jan On 30 Jun 2015, at 08:12, Ray Sun xiaoq...@gmail.com mailto:xiaoq...@gmail.com wrote: Sound great, any update please let me know. Best Regards -- Ray On Tue, Jun 30, 2015 at 1:46 AM, Jan Schermer j...@schermer.cz mailto:j...@schermer.cz wrote: I promised you all our scripts for automatic cgroup assignment - they are in our production already and I just need to put them on github, stay tuned tomorrow :-) Jan On 29 Jun 2015, at 19:41, Somnath Roy somnath@sandisk.com mailto:somnath@sandisk.com wrote: Presently, you have to do it by using tool like ‘taskset’ or ‘numactl’… Thanks Regards Somnath From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Ray Sun Sent: Monday, June 29, 2015 9:19 AM To: ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com Subject: [ceph-users] How to use cgroup to bind ceph-osd to a specific cpu core? Cephers, I want to bind each of my ceph-osd to a specific cpu core, but I didn't find any document to explain that, could any one can provide me some detailed information. Thanks. Currently, my ceph is running like this: oot 28692 1 0 Jun23 ?00:37:26 /usr/bin/ceph-mon -i seed.econe.com http://seed.econe.com/ --pid-file /var/run/ceph/mon.seed.econe.com.pid -c /etc/ceph/ceph.conf --cluster ceph root 40063 1 1 Jun23 ?02:13:31 /usr/bin/ceph-osd -i 0 --pid-file /var/run/ceph/osd.0.pid -c /etc/ceph/ceph.conf --cluster ceph root 42096 1 0 Jun23 ?01:33:42 /usr/bin/ceph-osd -i
Re: [ceph-users] xattrs vs omap
Hello, On Wed, 1 Jul 2015 15:24:13 + Somnath Roy wrote: It doesn't matter, I think filestore_xattr_use_omap is a 'noop' and not used in the Hammer. Then what was this functionality replaced with, esp. considering EXT4 based OSDs? Chibi Thanks Regards Somnath -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Adam Tygart Sent: Wednesday, July 01, 2015 8:20 AM To: Ceph Users Subject: [ceph-users] xattrs vs omap Hello all, I've got a coworker who put filestore_xattr_use_omap = true in the ceph.conf when we first started building the cluster. Now he can't remember why. He thinks it may be a holdover from our first Ceph cluster (running dumpling on ext4, iirc). In the newly built cluster, we are using XFS with 2048 byte inodes, running Ceph 0.94.2. It currently has production data in it. From my reading of other threads, it looks like this is probably not something you want set to true (at least on XFS), due to performance implications. Is this something you can change on a running cluster? Is it worth the hassle? Thanks, Adam ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph Journal Disk Size
It also depends a lot on the size of your cluster ... I have a test cluster I'm standing up right now with 60 nodes - a total of 600 OSDs each at 4 TB ... If I lose 4 TB - that's a very small fraction of the data. My replicas are going to be spread out across a lot of spindles, and replicating that missing 4 TB isn't much of an issue, across 3 racks each with 80 gbit/sec ToR uplinks to Spine. Each node has 20 gbit/sec to ToR in a bond. On the other hand ... if you only have 4 .. or 8 ... or 10 servers ... and a smaller number of OSDs - you have fewer spindles replicating that loss, and it might be more of an issue. It just depends on the size/scale of your environment. We're going to 8 TB drives - and that will ultimately be spread over a 100 or more physical servers w/ 10 OSD disks per server. This will be across 7 to 10 racks (same network topology) ... so an 8 TB drive loss isn't too big of an issue. Now that assumes that replication actually works well in that size cluster. We're still cessing out this part of the PoC engagement. ~~shane On 7/1/15, 5:05 PM, ceph-users on behalf of German Anders ceph-users-boun...@lists.ceph.commailto:ceph-users-boun...@lists.ceph.com on behalf of gand...@despegar.commailto:gand...@despegar.com wrote: ask the other guys on the list, but for me to lose 4TB of data is to much, the cluster will still running fine, but in some point you need to recover that disk, and also if you lose one server with all the 4TB disk in that case yeah it will hurt the cluster, also take into account that with that kind of disk you will get no more than 100-110 iops per disk German Anders Storage System Engineer Leader Despegar | IT Team office +54 11 4894 3500 x3408 mobile +54 911 3493 7262 mail gand...@despegar.commailto:gand...@despegar.com 2015-07-01 20:54 GMT-03:00 Nate Curry cu...@mosaicatm.commailto:cu...@mosaicatm.com: 4TB is too much to lose? Why would it matter if you lost one 4TB with the redundancy? Won't it auto recover from the disk failure? Nate Curry On Jul 1, 2015 6:12 PM, German Anders gand...@despegar.commailto:gand...@despegar.com wrote: I would probably go with less size osd disks, 4TB is to much to loss in case of a broken disk, so maybe more osd daemons with less size, maybe 1TB or 2TB size. 4:1 relationship is good enough, also i think that 200G disk for the journals would be ok, so you can save some money there, the osd's of course configured them as a JBOD, don't use any RAID under it, and use two different networks for public and cluster net. German 2015-07-01 18:49 GMT-03:00 Nate Curry cu...@mosaicatm.commailto:cu...@mosaicatm.com: I would like to get some clarification on the size of the journal disks that I should get for my new Ceph cluster I am planning. I read about the journal settings on http://ceph.com/docs/master/rados/configuration/osd-config-ref/#journal-settings but that didn't really clarify it for me that or I just didn't get it. I found in the Learning Ceph Packt book it states that you should have one disk for journalling for every 4 OSDs. Using that as a reference I was planning on getting multiple systems with 8 x 6TB inline SAS drives for OSDs with two SSDs for journalling per host as well as 2 hot spares for the 6TB drives and 2 drives for the OS. I was thinking of 400GB SSD drives but am wondering if that is too much. Any informed opinions would be appreciated. Thanks, Nate Curry ___ ceph-users mailing list ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph Journal Disk Size
I'm interested in such a configuration, can you share some perfomance test/numbers? Thanks in advance, Best regards, *German* 2015-07-01 21:16 GMT-03:00 Shane Gibson shane_gib...@symantec.com: It also depends a lot on the size of your cluster ... I have a test cluster I'm standing up right now with 60 nodes - a total of 600 OSDs each at 4 TB ... If I lose 4 TB - that's a very small fraction of the data. My replicas are going to be spread out across a lot of spindles, and replicating that missing 4 TB isn't much of an issue, across 3 racks each with 80 gbit/sec ToR uplinks to Spine. Each node has 20 gbit/sec to ToR in a bond. On the other hand ... if you only have 4 .. or 8 ... or 10 servers ... and a smaller number of OSDs - you have fewer spindles replicating that loss, and it might be more of an issue. It just depends on the size/scale of your environment. We're going to 8 TB drives - and that will ultimately be spread over a 100 or more physical servers w/ 10 OSD disks per server. This will be across 7 to 10 racks (same network topology) ... so an 8 TB drive loss isn't too big of an issue. Now that assumes that replication actually works well in that size cluster. We're still cessing out this part of the PoC engagement. ~~shane On 7/1/15, 5:05 PM, ceph-users on behalf of German Anders ceph-users-boun...@lists.ceph.com on behalf of gand...@despegar.com wrote: ask the other guys on the list, but for me to lose 4TB of data is to much, the cluster will still running fine, but in some point you need to recover that disk, and also if you lose one server with all the 4TB disk in that case yeah it will hurt the cluster, also take into account that with that kind of disk you will get no more than 100-110 iops per disk *German Anders* Storage System Engineer Leader *Despegar* | IT Team *office* +54 11 4894 3500 x3408 *mobile* +54 911 3493 7262 *mail* gand...@despegar.com 2015-07-01 20:54 GMT-03:00 Nate Curry cu...@mosaicatm.com: 4TB is too much to lose? Why would it matter if you lost one 4TB with the redundancy? Won't it auto recover from the disk failure? Nate Curry On Jul 1, 2015 6:12 PM, German Anders gand...@despegar.com wrote: I would probably go with less size osd disks, 4TB is to much to loss in case of a broken disk, so maybe more osd daemons with less size, maybe 1TB or 2TB size. 4:1 relationship is good enough, also i think that 200G disk for the journals would be ok, so you can save some money there, the osd's of course configured them as a JBOD, don't use any RAID under it, and use two different networks for public and cluster net. *German* 2015-07-01 18:49 GMT-03:00 Nate Curry cu...@mosaicatm.com: I would like to get some clarification on the size of the journal disks that I should get for my new Ceph cluster I am planning. I read about the journal settings on http://ceph.com/docs/master/rados/configuration/osd-config-ref/#journal-settings but that didn't really clarify it for me that or I just didn't get it. I found in the Learning Ceph Packt book it states that you should have one disk for journalling for every 4 OSDs. Using that as a reference I was planning on getting multiple systems with 8 x 6TB inline SAS drives for OSDs with two SSDs for journalling per host as well as 2 hot spares for the 6TB drives and 2 drives for the OS. I was thinking of 400GB SSD drives but am wondering if that is too much. Any informed opinions would be appreciated. Thanks, *Nate Curry* ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com