Re: [ceph-users] Ceph for home use
On 16 March 2016 at 04:34, Edward Wingatewrote: > Given my resources, > I'd still only run a single node with 3 OSDs and replica count of 2. > I'd then have a VM mount the a Ceph RBD to serve Samba/NFS shares. > Fun & instructive to play with ceph that way, but not really a good use of it - ceph's main thing is to provide replication across nodes for redundancy and failover. Which you aren't going to get with one node :) I really recommend seting up your NAS using ZFS (under BSD or Linux), its an excellent usecase for you setup. You can configure mirrored disks for redundacy and extend the storage indefintly by adding extra disks. Plus you get all the zfs goodies - excellent cmd tools, snapshots, checksums, ssd caches and more. You can share zfs disks directly via smb, nfs or iscsi or via your VM. And you will get much better performance. -- Lindsay ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Local SSD cache for ceph on each compute node.
I am using openstack so need this to be fully automated and apply to all my VMs. If I could do what you mention at the hypervisor level that would me much easier. The options that you mention I guess are for very specific use cases and need to be configured on a per vm basis whilst I am looking for a general "ceph on steroids" approach for all my VMs without any maintenance. Thanks again :) -Original Message- From: Jason Dillaman [mailto:dilla...@redhat.com] Sent: 16 March 2016 01:42 To: Daniel NiasoffCc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Local SSD cache for ceph on each compute node. Indeed, well understood. As a shorter term workaround, if you have control over the VMs, you could always just slice out an LVM volume from local SSD/NVMe and pass it through to the guest. Within the guest, use dm-cache (or similar) to add a cache front-end to your RBD volume. Others have also reported improvements by using the QEMU x-data-plane option and RAIDing several RBD images together within the VM. -- Jason Dillaman - Original Message - > From: "Daniel Niasoff" > To: "Jason Dillaman" > Cc: ceph-users@lists.ceph.com > Sent: Tuesday, March 15, 2016 9:32:50 PM > Subject: RE: [ceph-users] Local SSD cache for ceph on each compute node. > > Thanks. > > Reassuring but I could do with something today :) > > -Original Message- > From: Jason Dillaman [mailto:dilla...@redhat.com] > Sent: 16 March 2016 01:25 > To: Daniel Niasoff > Cc: ceph-users@lists.ceph.com > Subject: Re: [ceph-users] Local SSD cache for ceph on each compute node. > > The good news is such a feature is in the early stage of design [1]. > Hopefully this is a feature that will land in the Kraken release timeframe. > > [1] > http://tracker.ceph.com/projects/ceph/wiki/Rbd_-_ordered_crash-consist > ent_write-back_caching_extension > > -- > > Jason Dillaman > > > - Original Message - > > From: "Daniel Niasoff" > > To: ceph-users@lists.ceph.com > > Sent: Tuesday, March 15, 2016 8:47:04 PM > > Subject: [ceph-users] Local SSD cache for ceph on each compute node. > > > > Hi, > > > > Let me start. Ceph is amazing, no it really is! > > > > But a hypervisor reading and writing all its data off the network > > off the network will add some latency to read and writes. > > > > So the hypervisor could do with a local cache, possible SSD or even NVMe. > > > > Spent a while looking into this but it seems really strange that few > > people see the value of this. > > > > Basically the cache would be used in two ways > > > > a) cache hot data > > b) writeback cache for ceph writes > > > > There is the RBD cache but that isn't disk based and on a hypervisor > > memory is at a premium. > > > > A simple solution would be to put a journal on each compute node and > > get each hypervisor to use its own journal. Would this work? > > > > Something like this > > http://sebastien-han.fr/images/ceph-cache-pool-compute-design.png > > > > Can this be achieved? > > > > A better explanation of what I am trying to achieve is here > > > > http://opennebula.org/cached-ssd-storage-infrastructure-for-vms/ > > > > This talk if it was voted in looks interesting - > > https://www.openstack.org/summit/austin-2016/vote-for-speakers/Prese > > nt > > ation/6827 > > > > Can anyone help? > > > > Thanks > > > > Daniel > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Local SSD cache for ceph on each compute node.
Indeed, well understood. As a shorter term workaround, if you have control over the VMs, you could always just slice out an LVM volume from local SSD/NVMe and pass it through to the guest. Within the guest, use dm-cache (or similar) to add a cache front-end to your RBD volume. Others have also reported improvements by using the QEMU x-data-plane option and RAIDing several RBD images together within the VM. -- Jason Dillaman - Original Message - > From: "Daniel Niasoff"> To: "Jason Dillaman" > Cc: ceph-users@lists.ceph.com > Sent: Tuesday, March 15, 2016 9:32:50 PM > Subject: RE: [ceph-users] Local SSD cache for ceph on each compute node. > > Thanks. > > Reassuring but I could do with something today :) > > -Original Message- > From: Jason Dillaman [mailto:dilla...@redhat.com] > Sent: 16 March 2016 01:25 > To: Daniel Niasoff > Cc: ceph-users@lists.ceph.com > Subject: Re: [ceph-users] Local SSD cache for ceph on each compute node. > > The good news is such a feature is in the early stage of design [1]. > Hopefully this is a feature that will land in the Kraken release timeframe. > > [1] > http://tracker.ceph.com/projects/ceph/wiki/Rbd_-_ordered_crash-consistent_write-back_caching_extension > > -- > > Jason Dillaman > > > - Original Message - > > From: "Daniel Niasoff" > > To: ceph-users@lists.ceph.com > > Sent: Tuesday, March 15, 2016 8:47:04 PM > > Subject: [ceph-users] Local SSD cache for ceph on each compute node. > > > > Hi, > > > > Let me start. Ceph is amazing, no it really is! > > > > But a hypervisor reading and writing all its data off the network off > > the network will add some latency to read and writes. > > > > So the hypervisor could do with a local cache, possible SSD or even NVMe. > > > > Spent a while looking into this but it seems really strange that few > > people see the value of this. > > > > Basically the cache would be used in two ways > > > > a) cache hot data > > b) writeback cache for ceph writes > > > > There is the RBD cache but that isn't disk based and on a hypervisor > > memory is at a premium. > > > > A simple solution would be to put a journal on each compute node and > > get each hypervisor to use its own journal. Would this work? > > > > Something like this > > http://sebastien-han.fr/images/ceph-cache-pool-compute-design.png > > > > Can this be achieved? > > > > A better explanation of what I am trying to achieve is here > > > > http://opennebula.org/cached-ssd-storage-infrastructure-for-vms/ > > > > This talk if it was voted in looks interesting - > > https://www.openstack.org/summit/austin-2016/vote-for-speakers/Present > > ation/6827 > > > > Can anyone help? > > > > Thanks > > > > Daniel > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] inconsistent PG -> unfound objects on an erasure coded system
Ah, actually, I think there will be duplicates only around half the time -- either the old link or the new link could be orphaned depending on which xfs decides to list first. Only if the old link is orphaned will it match the name of the object once it's recreated. I should be able to find time to put together a branch in the next week or two if you want to wait. It's still probably worth trying removing that object in 70.459. -Sam On Tue, Mar 15, 2016 at 6:03 PM, Samuel Justwrote: > The bug is entirely independent of hardware issues -- entirely a ceph > bug. xfs doesn't let us specify an ordering when reading a directory, > so we have to keep directory sizes small. That means that when one of > those pg collection subfolders has 320 files in it, we split it into > up to 16 smaller directories. Overwriting or removing an ec object > requires us to rename the old version out of the way in case we need > to roll back (that's the generation number I mentioned above). For > crash safety, this involves first creating a link to the new name, > then removing the old one. Both the old and new link will be in the > same subdirectory. If creating the new link pushes the directory to > 320 files then we do a split while both links are present. If the > file in question is using the special long filename handling, then a > bug in the resulting link juggling causes us to orphan the old version > of the file. Your cluster seems to have an unusual number of objects > with very long names, which is why it is so visible on your cluster. > > There are critical pool sizes where the PGs will all be close to one > of those limits. It's possible you are not close to one of those > limits. It's also possible you are nearing one now. In any case, the > remapping gave the orphaned files an opportunity to cause trouble, but > they don't appear due to remapping. > -Sam > > On Tue, Mar 15, 2016 at 5:41 PM, Jeffrey McDonald wrote: >> One more question.did we hit the bug because we had hardware issues >> during the remapping or would it have happened regardless of the hardware >> issues? e.g. I'm not planning to add any additional hardware soon, but >> would the bug pop again on an (unpatched) system not subject to any >> remapping? >> >> thanks, >> jeff >> >> On Tue, Mar 15, 2016 at 7:27 PM, Samuel Just wrote: >>> >>> [back on list] >>> >>> ceph-objectstore-tool has a whole bunch of machinery for modifying an >>> offline objectstore. It would be the easiest place to put it -- you >>> could add a >>> >>> ceph-objectstore-tool --op filestore-repair-orphan-links [--dry-run] ... >>> >>> command which would mount the filestore in a special mode and iterate >>> over all collections and repair them. If you want to go that route, >>> we'd be happy to help you get it written. Once it fixes your cluster, >>> we'd then be able to merge and backport it in case anyone else hits >>> it. >>> >>> You'd probably be fine doing it while the OSD is live...but as a rule >>> I usually prefer to do my osd surgery offline. Journal doesn't matter >>> here, the orphaned files are basically invisible to the filestore >>> (except when doing a collection scan for scrub) since they are in the >>> wrong directory. >>> >>> I don't think the orphans are necessarily going to be 0 size. There >>> might be quirk of how radosgw creates these objects that always causes >>> them to be created 0 size than then overwritten with a writefull -- if >>> that's true it might be the case that you would only see 0 size ones. >>> -Sam >>> >>> On Tue, Mar 15, 2016 at 4:02 PM, Jeffrey McDonald >>> wrote: >>> > Thanks, I can try to write a tool to do this. Does >>> > ceph-objectstore-tool >>> > provide a framework? >>> > >>> > Can I safely delete the files while the OSD is alive or should I take it >>> > offline? Any concerns about the journal? >>> > >>> > Are there any other properties of the orphans, e.g. will the orphans >>> > always >>> > be size 0? >>> > >>> > Thanks! >>> > Jeff >>> > >>> > On Tue, Mar 15, 2016 at 5:35 PM, Samuel Just wrote: >>> >> >>> >> Ok, a branch merged to master which should fix this >>> >> (https://github.com/ceph/ceph/pull/8136). It'll be backported in due >>> >> course. The problem is that that patch won't clean orphaned files >>> >> that already exist. >>> >> >>> >> Let me explain a bit about what the orphaned files look like. The >>> >> problem is files with object names that result in escaped filenames >>> >> longer than the max filename ceph will create (~250 iirc). Normally, >>> >> the name of the file is an escaped and sanitized version of the object >>> >> name: >>> >> >>> >> >>> >> >>> >> ./DIR_9/DIR_5/DIR_4/DIR_D/DIR_C/default.325674.107\u\ushadow\u.KePEE8heghHVnlb1\uEIupG0I5eROwRn\u77__head_C1DCD459__46__0 >>> >> >>> >> corresponds to an object like >>> >> >>> >> >>> >> >>> >>
Re: [ceph-users] cephx capabilities to forbid rbd creation
Perhaps something like this? mon 'allow r' osd 'allow class-read object_prefix rbd_children, allow r class-read object_prefix rbd_directory, allow rwx object_prefix rbd_header., allow rwx object_prefix rbd_data., allow rwx object_prefix rbd_id.' As Greg mentioned, this won't stop you from just creating random objects in the pool that match the substrings list in the cap, but it will prevent you from creating new images. -- Jason Dillaman Red Hat Ceph Storage Engineering dilla...@redhat.com http://www.redhat.com - Original Message - > From: "Gregory Farnum"> To: "Loris Cuoghi" > Cc: ceph-users@lists.ceph.com > Sent: Tuesday, March 15, 2016 5:54:43 PM > Subject: Re: [ceph-users] cephx capabilities to forbid rbd creation > > On Tue, Mar 15, 2016 at 2:44 PM, Loris Cuoghi wrote: > > So, one key per RBD. > > Or, dynamically enable/disable access to each RBD in each hypervisor's key. > > Uhm, something doesn't scale here. :P > > (I wonder if there's any limit to a key's capabilities string...) > > > > But, as it appears, I share your view that it is the only available > > approach right now. > > > > Anyone would like to prove us wrong? :) > > The OSD capabilities aren't fine-grained enough to prevent you from > creating objects, except by specifying that you only get access to > certain prefixes or namespaces. So, either you lock down a key to a > specific set of RBD volumes, or you let it create RBD volumes > arbitrarily. > ...unless, maybe, you can keep it from writing to the RBD index > objects? But that doesn't prevent the user from scribbling across your > cluster, just registering it. ;) > > That said, it is *possible* (although probably *unwise*) to give > hypervisor keys access to all of the RBD volumes they host. cephx keys > can have an arbitrary number of "allow" clauses, although I imagine if > you get them large enough it could cause trouble (or maybe not?) in > terms of memory usage or just plain old permission parsing time. And > you'd likely run into issues with newly-created or newly-migrated > instances ending up on a hypervisor which has an old version of its > keyring cached. I'm not certain if there's a way to refresh those > on-demand from the monitor. > -Greg > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Local SSD cache for ceph on each compute node.
Thanks. Reassuring but I could do with something today :) -Original Message- From: Jason Dillaman [mailto:dilla...@redhat.com] Sent: 16 March 2016 01:25 To: Daniel NiasoffCc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Local SSD cache for ceph on each compute node. The good news is such a feature is in the early stage of design [1]. Hopefully this is a feature that will land in the Kraken release timeframe. [1] http://tracker.ceph.com/projects/ceph/wiki/Rbd_-_ordered_crash-consistent_write-back_caching_extension -- Jason Dillaman - Original Message - > From: "Daniel Niasoff" > To: ceph-users@lists.ceph.com > Sent: Tuesday, March 15, 2016 8:47:04 PM > Subject: [ceph-users] Local SSD cache for ceph on each compute node. > > Hi, > > Let me start. Ceph is amazing, no it really is! > > But a hypervisor reading and writing all its data off the network off > the network will add some latency to read and writes. > > So the hypervisor could do with a local cache, possible SSD or even NVMe. > > Spent a while looking into this but it seems really strange that few > people see the value of this. > > Basically the cache would be used in two ways > > a) cache hot data > b) writeback cache for ceph writes > > There is the RBD cache but that isn't disk based and on a hypervisor > memory is at a premium. > > A simple solution would be to put a journal on each compute node and > get each hypervisor to use its own journal. Would this work? > > Something like this > http://sebastien-han.fr/images/ceph-cache-pool-compute-design.png > > Can this be achieved? > > A better explanation of what I am trying to achieve is here > > http://opennebula.org/cached-ssd-storage-infrastructure-for-vms/ > > This talk if it was voted in looks interesting - > https://www.openstack.org/summit/austin-2016/vote-for-speakers/Present > ation/6827 > > Can anyone help? > > Thanks > > Daniel > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Local SSD cache for ceph on each compute node.
The good news is such a feature is in the early stage of design [1]. Hopefully this is a feature that will land in the Kraken release timeframe. [1] http://tracker.ceph.com/projects/ceph/wiki/Rbd_-_ordered_crash-consistent_write-back_caching_extension -- Jason Dillaman - Original Message - > From: "Daniel Niasoff"> To: ceph-users@lists.ceph.com > Sent: Tuesday, March 15, 2016 8:47:04 PM > Subject: [ceph-users] Local SSD cache for ceph on each compute node. > > Hi, > > Let me start. Ceph is amazing, no it really is! > > But a hypervisor reading and writing all its data off the network off the > network will add some latency to read and writes. > > So the hypervisor could do with a local cache, possible SSD or even NVMe. > > Spent a while looking into this but it seems really strange that few people > see the value of this. > > Basically the cache would be used in two ways > > a) cache hot data > b) writeback cache for ceph writes > > There is the RBD cache but that isn't disk based and on a hypervisor memory > is at a premium. > > A simple solution would be to put a journal on each compute node and get each > hypervisor to use its own journal. Would this work? > > Something like this > http://sebastien-han.fr/images/ceph-cache-pool-compute-design.png > > Can this be achieved? > > A better explanation of what I am trying to achieve is here > > http://opennebula.org/cached-ssd-storage-infrastructure-for-vms/ > > This talk if it was voted in looks interesting - > https://www.openstack.org/summit/austin-2016/vote-for-speakers/Presentation/6827 > > Can anyone help? > > Thanks > > Daniel > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] inconsistent PG -> unfound objects on an erasure coded system
The bug is entirely independent of hardware issues -- entirely a ceph bug. xfs doesn't let us specify an ordering when reading a directory, so we have to keep directory sizes small. That means that when one of those pg collection subfolders has 320 files in it, we split it into up to 16 smaller directories. Overwriting or removing an ec object requires us to rename the old version out of the way in case we need to roll back (that's the generation number I mentioned above). For crash safety, this involves first creating a link to the new name, then removing the old one. Both the old and new link will be in the same subdirectory. If creating the new link pushes the directory to 320 files then we do a split while both links are present. If the file in question is using the special long filename handling, then a bug in the resulting link juggling causes us to orphan the old version of the file. Your cluster seems to have an unusual number of objects with very long names, which is why it is so visible on your cluster. There are critical pool sizes where the PGs will all be close to one of those limits. It's possible you are not close to one of those limits. It's also possible you are nearing one now. In any case, the remapping gave the orphaned files an opportunity to cause trouble, but they don't appear due to remapping. -Sam On Tue, Mar 15, 2016 at 5:41 PM, Jeffrey McDonaldwrote: > One more question.did we hit the bug because we had hardware issues > during the remapping or would it have happened regardless of the hardware > issues? e.g. I'm not planning to add any additional hardware soon, but > would the bug pop again on an (unpatched) system not subject to any > remapping? > > thanks, > jeff > > On Tue, Mar 15, 2016 at 7:27 PM, Samuel Just wrote: >> >> [back on list] >> >> ceph-objectstore-tool has a whole bunch of machinery for modifying an >> offline objectstore. It would be the easiest place to put it -- you >> could add a >> >> ceph-objectstore-tool --op filestore-repair-orphan-links [--dry-run] ... >> >> command which would mount the filestore in a special mode and iterate >> over all collections and repair them. If you want to go that route, >> we'd be happy to help you get it written. Once it fixes your cluster, >> we'd then be able to merge and backport it in case anyone else hits >> it. >> >> You'd probably be fine doing it while the OSD is live...but as a rule >> I usually prefer to do my osd surgery offline. Journal doesn't matter >> here, the orphaned files are basically invisible to the filestore >> (except when doing a collection scan for scrub) since they are in the >> wrong directory. >> >> I don't think the orphans are necessarily going to be 0 size. There >> might be quirk of how radosgw creates these objects that always causes >> them to be created 0 size than then overwritten with a writefull -- if >> that's true it might be the case that you would only see 0 size ones. >> -Sam >> >> On Tue, Mar 15, 2016 at 4:02 PM, Jeffrey McDonald >> wrote: >> > Thanks, I can try to write a tool to do this. Does >> > ceph-objectstore-tool >> > provide a framework? >> > >> > Can I safely delete the files while the OSD is alive or should I take it >> > offline? Any concerns about the journal? >> > >> > Are there any other properties of the orphans, e.g. will the orphans >> > always >> > be size 0? >> > >> > Thanks! >> > Jeff >> > >> > On Tue, Mar 15, 2016 at 5:35 PM, Samuel Just wrote: >> >> >> >> Ok, a branch merged to master which should fix this >> >> (https://github.com/ceph/ceph/pull/8136). It'll be backported in due >> >> course. The problem is that that patch won't clean orphaned files >> >> that already exist. >> >> >> >> Let me explain a bit about what the orphaned files look like. The >> >> problem is files with object names that result in escaped filenames >> >> longer than the max filename ceph will create (~250 iirc). Normally, >> >> the name of the file is an escaped and sanitized version of the object >> >> name: >> >> >> >> >> >> >> >> ./DIR_9/DIR_5/DIR_4/DIR_D/DIR_C/default.325674.107\u\ushadow\u.KePEE8heghHVnlb1\uEIupG0I5eROwRn\u77__head_C1DCD459__46__0 >> >> >> >> corresponds to an object like >> >> >> >> >> >> >> >> c1dcd459/default.325674.107__shadow_.KePEE8heghHVnlb1_EIupG0I5eROwRn_77/head//70 >> >> >> >> the DIR_9/DIR_5/DIR_4/DIR_D/DIR_C/ path is derived from the hash >> >> starting with the last value: cd459 -> DIR_9/DIR_5/DIR_4/DIR_D/DIR_C/ >> >> >> >> It ends up in DIR_9/DIR_5/DIR_4/DIR_D/DIR_C/ because that's the >> >> longest path that exists (DIR_9/DIR_5/DIR_4/DIR_D/DIR_C/DIR_D does not >> >> exist -- if DIR_9/DIR_5/DIR_4/DIR_D/DIR_C/ ever gets too full, >> >> DIR_9/DIR_5/DIR_4/DIR_D/DIR_C/DIR_D would be created and this file >> >> would be moved into it). >> >> >> >> When the escaped filename gets too long, we truncate the filename, and >> >> then
[ceph-users] Local SSD cache for ceph on each compute node.
Hi, Let me start. Ceph is amazing, no it really is! But a hypervisor reading and writing all its data off the network off the network will add some latency to read and writes. So the hypervisor could do with a local cache, possible SSD or even NVMe. Spent a while looking into this but it seems really strange that few people see the value of this. Basically the cache would be used in two ways a) cache hot data b) writeback cache for ceph writes There is the RBD cache but that isn't disk based and on a hypervisor memory is at a premium. A simple solution would be to put a journal on each compute node and get each hypervisor to use its own journal. Would this work? Something like this http://sebastien-han.fr/images/ceph-cache-pool-compute-design.png Can this be achieved? A better explanation of what I am trying to achieve is here http://opennebula.org/cached-ssd-storage-infrastructure-for-vms/ This talk if it was voted in looks interesting - https://www.openstack.org/summit/austin-2016/vote-for-speakers/Presentation/6827 Can anyone help? Thanks Daniel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] inconsistent PG -> unfound objects on an erasure coded system
One more question.did we hit the bug because we had hardware issues during the remapping or would it have happened regardless of the hardware issues? e.g. I'm not planning to add any additional hardware soon, but would the bug pop again on an (unpatched) system not subject to any remapping? thanks, jeff On Tue, Mar 15, 2016 at 7:27 PM, Samuel Justwrote: > [back on list] > > ceph-objectstore-tool has a whole bunch of machinery for modifying an > offline objectstore. It would be the easiest place to put it -- you > could add a > > ceph-objectstore-tool --op filestore-repair-orphan-links [--dry-run] ... > > command which would mount the filestore in a special mode and iterate > over all collections and repair them. If you want to go that route, > we'd be happy to help you get it written. Once it fixes your cluster, > we'd then be able to merge and backport it in case anyone else hits > it. > > You'd probably be fine doing it while the OSD is live...but as a rule > I usually prefer to do my osd surgery offline. Journal doesn't matter > here, the orphaned files are basically invisible to the filestore > (except when doing a collection scan for scrub) since they are in the > wrong directory. > > I don't think the orphans are necessarily going to be 0 size. There > might be quirk of how radosgw creates these objects that always causes > them to be created 0 size than then overwritten with a writefull -- if > that's true it might be the case that you would only see 0 size ones. > -Sam > > On Tue, Mar 15, 2016 at 4:02 PM, Jeffrey McDonald > wrote: > > Thanks, I can try to write a tool to do this. Does > ceph-objectstore-tool > > provide a framework? > > > > Can I safely delete the files while the OSD is alive or should I take it > > offline? Any concerns about the journal? > > > > Are there any other properties of the orphans, e.g. will the orphans > always > > be size 0? > > > > Thanks! > > Jeff > > > > On Tue, Mar 15, 2016 at 5:35 PM, Samuel Just wrote: > >> > >> Ok, a branch merged to master which should fix this > >> (https://github.com/ceph/ceph/pull/8136). It'll be backported in due > >> course. The problem is that that patch won't clean orphaned files > >> that already exist. > >> > >> Let me explain a bit about what the orphaned files look like. The > >> problem is files with object names that result in escaped filenames > >> longer than the max filename ceph will create (~250 iirc). Normally, > >> the name of the file is an escaped and sanitized version of the object > >> name: > >> > >> > >> > ./DIR_9/DIR_5/DIR_4/DIR_D/DIR_C/default.325674.107\u\ushadow\u.KePEE8heghHVnlb1\uEIupG0I5eROwRn\u77__head_C1DCD459__46__0 > >> > >> corresponds to an object like > >> > >> > >> > c1dcd459/default.325674.107__shadow_.KePEE8heghHVnlb1_EIupG0I5eROwRn_77/head//70 > >> > >> the DIR_9/DIR_5/DIR_4/DIR_D/DIR_C/ path is derived from the hash > >> starting with the last value: cd459 -> DIR_9/DIR_5/DIR_4/DIR_D/DIR_C/ > >> > >> It ends up in DIR_9/DIR_5/DIR_4/DIR_D/DIR_C/ because that's the > >> longest path that exists (DIR_9/DIR_5/DIR_4/DIR_D/DIR_C/DIR_D does not > >> exist -- if DIR_9/DIR_5/DIR_4/DIR_D/DIR_C/ ever gets too full, > >> DIR_9/DIR_5/DIR_4/DIR_D/DIR_C/DIR_D would be created and this file > >> would be moved into it). > >> > >> When the escaped filename gets too long, we truncate the filename, and > >> then append a hash and a number yielding a name like: > >> > >> > >> > ./DIR_9/DIR_5/DIR_4/DIR_D/DIR_E/default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkL_fa202ec9b4b3b217275a_0_long > >> > >> The _long at the end is always present with files like this. > >> fa202ec9b4b3b217275a is the hash of the filename. The 0 indicates > >> that it's the 0th file with this prefix and this hash -- if there are > >> hash collisions with the same prefix, you'll see _1_ and _2_ and so on > >> to distinguish them (very very unlikely). When the filename has been > >> truncated as with this one, you will find the full file name in the > >> attrs (attr user.cephosd.lfn3): > >> > >> > >> > ./DIR_9/DIR_5/DIR_4/DIR_D/DIR_E/default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkL_fa202ec9b4b3b217275a_0_long: > >> user.cephos.lfn3: > >> > >> > default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkLf.4\u156__head_79CED459__46__0 > >> > >> Let's look at one of the orphaned files (the one with the same > >> file-name as
Re: [ceph-users] inconsistent PG -> unfound objects on an erasure coded system
[back on list] ceph-objectstore-tool has a whole bunch of machinery for modifying an offline objectstore. It would be the easiest place to put it -- you could add a ceph-objectstore-tool --op filestore-repair-orphan-links [--dry-run] ... command which would mount the filestore in a special mode and iterate over all collections and repair them. If you want to go that route, we'd be happy to help you get it written. Once it fixes your cluster, we'd then be able to merge and backport it in case anyone else hits it. You'd probably be fine doing it while the OSD is live...but as a rule I usually prefer to do my osd surgery offline. Journal doesn't matter here, the orphaned files are basically invisible to the filestore (except when doing a collection scan for scrub) since they are in the wrong directory. I don't think the orphans are necessarily going to be 0 size. There might be quirk of how radosgw creates these objects that always causes them to be created 0 size than then overwritten with a writefull -- if that's true it might be the case that you would only see 0 size ones. -Sam On Tue, Mar 15, 2016 at 4:02 PM, Jeffrey McDonaldwrote: > Thanks, I can try to write a tool to do this. Does ceph-objectstore-tool > provide a framework? > > Can I safely delete the files while the OSD is alive or should I take it > offline? Any concerns about the journal? > > Are there any other properties of the orphans, e.g. will the orphans always > be size 0? > > Thanks! > Jeff > > On Tue, Mar 15, 2016 at 5:35 PM, Samuel Just wrote: >> >> Ok, a branch merged to master which should fix this >> (https://github.com/ceph/ceph/pull/8136). It'll be backported in due >> course. The problem is that that patch won't clean orphaned files >> that already exist. >> >> Let me explain a bit about what the orphaned files look like. The >> problem is files with object names that result in escaped filenames >> longer than the max filename ceph will create (~250 iirc). Normally, >> the name of the file is an escaped and sanitized version of the object >> name: >> >> >> ./DIR_9/DIR_5/DIR_4/DIR_D/DIR_C/default.325674.107\u\ushadow\u.KePEE8heghHVnlb1\uEIupG0I5eROwRn\u77__head_C1DCD459__46__0 >> >> corresponds to an object like >> >> >> c1dcd459/default.325674.107__shadow_.KePEE8heghHVnlb1_EIupG0I5eROwRn_77/head//70 >> >> the DIR_9/DIR_5/DIR_4/DIR_D/DIR_C/ path is derived from the hash >> starting with the last value: cd459 -> DIR_9/DIR_5/DIR_4/DIR_D/DIR_C/ >> >> It ends up in DIR_9/DIR_5/DIR_4/DIR_D/DIR_C/ because that's the >> longest path that exists (DIR_9/DIR_5/DIR_4/DIR_D/DIR_C/DIR_D does not >> exist -- if DIR_9/DIR_5/DIR_4/DIR_D/DIR_C/ ever gets too full, >> DIR_9/DIR_5/DIR_4/DIR_D/DIR_C/DIR_D would be created and this file >> would be moved into it). >> >> When the escaped filename gets too long, we truncate the filename, and >> then append a hash and a number yielding a name like: >> >> >> ./DIR_9/DIR_5/DIR_4/DIR_D/DIR_E/default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkL_fa202ec9b4b3b217275a_0_long >> >> The _long at the end is always present with files like this. >> fa202ec9b4b3b217275a is the hash of the filename. The 0 indicates >> that it's the 0th file with this prefix and this hash -- if there are >> hash collisions with the same prefix, you'll see _1_ and _2_ and so on >> to distinguish them (very very unlikely). When the filename has been >> truncated as with this one, you will find the full file name in the >> attrs (attr user.cephosd.lfn3): >> >> >> ./DIR_9/DIR_5/DIR_4/DIR_D/DIR_E/default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkL_fa202ec9b4b3b217275a_0_long: >> user.cephos.lfn3: >> >> default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkLf.4\u156__head_79CED459__46__0 >> >> Let's look at one of the orphaned files (the one with the same >> file-name as the previous one, actually): >> >> >> ./DIR_9/DIR_5/DIR_4/DIR_D/default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkL_fa202ec9b4b3b217275a_0_long: >> user.cephos.lfn3: >> >>
[ceph-users] mon create-initial failed after installation (ceph-deploy: 1.5.31 / ceph: 10.0.2)
Hello, I've tried to install the ceph using ceph-deploy as usual. [ceph@octopus conf]$ ceph-deploy install --mon --mds --testing octopus *install* was completed without any surprising. But *mon create-initial* was failed: ### Take1 ### Log [ceph@octopus conf]$ ceph-deploy mon create-initial [ceph_deploy.conf][DEBUG ] found configuration file at: /home/ceph/.cephdeploy.conf [ceph_deploy.cli][INFO ] Invoked (1.5.31): /usr/bin/ceph-deploy mon create-initial [ceph_deploy.cli][INFO ] ceph-deploy options: [ceph_deploy.cli][INFO ] username : None [ceph_deploy.cli][INFO ] verbose : False [ceph_deploy.cli][INFO ] overwrite_conf: False [ceph_deploy.cli][INFO ] subcommand: create-initial [ceph_deploy.cli][INFO ] quiet : False [ceph_deploy.cli][INFO ] cd_conf : [ceph_deploy.cli][INFO ] cluster : ceph [ceph_deploy.cli][INFO ] func : [ceph_deploy.cli][INFO ] ceph_conf : None [ceph_deploy.cli][INFO ] default_release : False [ceph_deploy.cli][INFO ] keyrings : None [ceph_deploy.mon][DEBUG ] Deploying mon, cluster ceph hosts octopus [ceph_deploy.mon][DEBUG ] detecting platform for host octopus ... [octopus][DEBUG ] connection detected need for sudo [octopus][DEBUG ] connected to host: octopus [octopus][DEBUG ] detect platform information from remote host [octopus][DEBUG ] detect machine type [octopus][DEBUG ] find the location of an executable [ceph_deploy.mon][INFO ] distro info: CentOS Linux 7.2.1511 Core [octopus][DEBUG ] determining if provided host has same hostname in remote [octopus][DEBUG ] get remote short hostname [octopus][DEBUG ] deploying mon to octopus [octopus][DEBUG ] get remote short hostname [octopus][DEBUG ] remote hostname: octopus [octopus][DEBUG ] write cluster configuration to /etc/ceph/{cluster}.conf [octopus][DEBUG ] create the mon path if it does not exist [octopus][DEBUG ] checking for done path: /var/lib/ceph/mon/ceph-octopus/done [octopus][DEBUG ] done path does not exist: /var/lib/ceph/mon/ceph-octopus/done [octopus][INFO ] creating keyring file: /var/lib/ceph/tmp/ceph-octopus.mon.keyring [octopus][DEBUG ] create the monitor keyring file [octopus][INFO ] Running command: sudo ceph-mon --cluster ceph --mkfs -i octopus --keyring /var/lib/ceph/tmp/ceph-octopus.mon.keyring --setuser 1000 --setgroup 1000 [octopus][DEBUG ] ceph-mon: mon.noname-a 172.16.0.2:6789/0 is local, renaming to mon.octopus [octopus][DEBUG ] ceph-mon: set fsid to 53dab59e-3e16-4fdf-a029-b92c32aabde8 [octopus][DEBUG ] ceph-mon: created monfs at /var/lib/ceph/mon/ceph-octopus for mon.octopus [octopus][INFO ] unlinking keyring file /var/lib/ceph/tmp/ceph-octopus.mon.keyring [octopus][DEBUG ] create a done file to avoid re-doing the mon deployment [octopus][DEBUG ] create the init path if it does not exist [octopus][INFO ] Running command: sudo systemctl enable ceph.target [octopus][WARNIN] Failed to execute operation: Access denied [octopus][ERROR ] RuntimeError: command returned non-zero exit status: 1 [ceph_deploy.mon][ERROR ] Failed to execute command: systemctl enable ceph.target [ceph_deploy][ERROR ] GenericError: Failed to create 1 monitors Log Interestingly after rebooting the host, it's completed -; ### Take2 ### Log [ceph@octopus conf]$ ceph-deploy mon create-initial [ceph_deploy.conf][DEBUG ] found configuration file at: /home/ceph/.cephdeploy.conf [ceph_deploy.cli][INFO ] Invoked (1.5.31): /usr/bin/ceph-deploy mon create-initial [ceph_deploy.cli][INFO ] ceph-deploy options: [ceph_deploy.cli][INFO ] username : None [ceph_deploy.cli][INFO ] verbose : False [ceph_deploy.cli][INFO ] overwrite_conf: False [ceph_deploy.cli][INFO ] subcommand: create-initial [ceph_deploy.cli][INFO ] quiet : False [ceph_deploy.cli][INFO ] cd_conf : [ceph_deploy.cli][INFO ] cluster : ceph [ceph_deploy.cli][INFO ] func : [ceph_deploy.cli][INFO ] ceph_conf : None [ceph_deploy.cli][INFO ] default_release : False [ceph_deploy.cli][INFO ] keyrings : None [ceph_deploy.mon][DEBUG ] Deploying mon, cluster ceph hosts octopus [ceph_deploy.mon][DEBUG ] detecting platform for host octopus ... [octopus][DEBUG ] connection detected need for sudo [octopus][DEBUG ] connected to host: octopus [octopus][DEBUG ] detect platform information from remote host [octopus][DEBUG ] detect machine type [octopus][DEBUG ] find the location of an executable [ceph_deploy.mon][INFO ] distro info: CentOS Linux 7.2.1511 Core [octopus][DEBUG ] determining if provided host has same hostname in remote [octopus][DEBUG ] get remote short
Re: [ceph-users] inconsistent PG -> unfound objects on an erasure coded system
Ok, a branch merged to master which should fix this (https://github.com/ceph/ceph/pull/8136). It'll be backported in due course. The problem is that that patch won't clean orphaned files that already exist. Let me explain a bit about what the orphaned files look like. The problem is files with object names that result in escaped filenames longer than the max filename ceph will create (~250 iirc). Normally, the name of the file is an escaped and sanitized version of the object name: ./DIR_9/DIR_5/DIR_4/DIR_D/DIR_C/default.325674.107\u\ushadow\u.KePEE8heghHVnlb1\uEIupG0I5eROwRn\u77__head_C1DCD459__46__0 corresponds to an object like c1dcd459/default.325674.107__shadow_.KePEE8heghHVnlb1_EIupG0I5eROwRn_77/head//70 the DIR_9/DIR_5/DIR_4/DIR_D/DIR_C/ path is derived from the hash starting with the last value: cd459 -> DIR_9/DIR_5/DIR_4/DIR_D/DIR_C/ It ends up in DIR_9/DIR_5/DIR_4/DIR_D/DIR_C/ because that's the longest path that exists (DIR_9/DIR_5/DIR_4/DIR_D/DIR_C/DIR_D does not exist -- if DIR_9/DIR_5/DIR_4/DIR_D/DIR_C/ ever gets too full, DIR_9/DIR_5/DIR_4/DIR_D/DIR_C/DIR_D would be created and this file would be moved into it). When the escaped filename gets too long, we truncate the filename, and then append a hash and a number yielding a name like: ./DIR_9/DIR_5/DIR_4/DIR_D/DIR_E/default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkL_fa202ec9b4b3b217275a_0_long The _long at the end is always present with files like this. fa202ec9b4b3b217275a is the hash of the filename. The 0 indicates that it's the 0th file with this prefix and this hash -- if there are hash collisions with the same prefix, you'll see _1_ and _2_ and so on to distinguish them (very very unlikely). When the filename has been truncated as with this one, you will find the full file name in the attrs (attr user.cephosd.lfn3): ./DIR_9/DIR_5/DIR_4/DIR_D/DIR_E/default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkL_fa202ec9b4b3b217275a_0_long: user.cephos.lfn3: default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkLf.4\u156__head_79CED459__46__0 Let's look at one of the orphaned files (the one with the same file-name as the previous one, actually): ./DIR_9/DIR_5/DIR_4/DIR_D/default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkL_fa202ec9b4b3b217275a_0_long: user.cephos.lfn3: default.724733.17\u\ushadow\uprostate\srnaseq\s8e5da6e8-8881-4813-a4e3-327df57fd1b7\sUNCID\u2409283.304a95c1-2180-4a81-a85a-880427e97d67.140304\uUNC14-SN744\u0400\uAC3LWGACXX\u7\uGAGTGG.tar.gz.2~\u1r\uFGidmpEP8GRsJkNLfAh9CokxkLf.4\u156__head_79CED459__46_3189d_0 This one has the same filename as the previous object, but is an orphan. What makes it an orphan is that it has hash 79CED459, but is in ./DIR_9/DIR_5/DIR_4/DIR_D even though ./DIR_9/DIR_5/DIR_4/DIR_D/DIR_E exists (objects-files are always at the farthest directory from the root matching their hash). All of the orphans will be long-file-name objects (but most long-file-name objects are fine and are neither orphans nor have duplicates -- it's a fairly low occurrence bug). In your case, I think *all* of the orphans will probably happen to have files with duplicate names in the correct directory -- though might not if the object had actually been deleted since the bug happened. When there are duplicates, the full object names will either be the same or differ by the generation number at the end (_0 vs 3189d_0) in this case. Once the orphaned files are cleaned up, your cluster should be back to normal. If you want to wait, someone might get time to build a patch for ceph-objectstore-tool to automate this. You can try removing the orphan we identified in pg 70.459 and re-scrubbing to confirm that that fixes the pg. -Sam On Wed, Mar 9, 2016 at 6:58 AM, Jeffrey McDonaldwrote: > Hi, I went back to the mon logs to see if I could illicit any additional > information about this PG. > Prior to 1/27/16, the deep-scrub on this OSD passes(then I see obsolete > rollback objects found): > > ceph.log.4.gz:2016-01-20 09:43:36.195640 osd.307 10.31.0.67:6848/127170 538 > : cluster [INF] 70.459 deep-scrub ok > ceph.log.4.gz:2016-01-27 09:51:49.952459 osd.307 10.31.0.67:6848/127170 583 > : cluster [INF] 70.459 deep-scrub starts > ceph.log.4.gz:2016-01-27
Re: [ceph-users] cephx capabilities to forbid rbd creation
On Tue, Mar 15, 2016 at 2:44 PM, Loris Cuoghiwrote: > So, one key per RBD. > Or, dynamically enable/disable access to each RBD in each hypervisor's key. > Uhm, something doesn't scale here. :P > (I wonder if there's any limit to a key's capabilities string...) > > But, as it appears, I share your view that it is the only available > approach right now. > > Anyone would like to prove us wrong? :) The OSD capabilities aren't fine-grained enough to prevent you from creating objects, except by specifying that you only get access to certain prefixes or namespaces. So, either you lock down a key to a specific set of RBD volumes, or you let it create RBD volumes arbitrarily. ...unless, maybe, you can keep it from writing to the RBD index objects? But that doesn't prevent the user from scribbling across your cluster, just registering it. ;) That said, it is *possible* (although probably *unwise*) to give hypervisor keys access to all of the RBD volumes they host. cephx keys can have an arbitrary number of "allow" clauses, although I imagine if you get them large enough it could cause trouble (or maybe not?) in terms of memory usage or just plain old permission parsing time. And you'd likely run into issues with newly-created or newly-migrated instances ending up on a hypervisor which has an old version of its keyring cached. I'm not certain if there's a way to refresh those on-demand from the monitor. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephx capabilities to forbid rbd creation
So, one key per RBD. Or, dynamically enable/disable access to each RBD in each hypervisor's key. Uhm, something doesn't scale here. :P (I wonder if there's any limit to a key's capabilities string...) But, as it appears, I share your view that it is the only available approach right now. Anyone would like to prove us wrong? :) Le 15/03/2016 22:33, David Casier a écrit : > Hi, > Maybe (not tested) : > [osd ]allow * object_prefix ? > > > > 2016-03-15 22:18 GMT+01:00 Loris Cuoghi: >> Hi David, >> >> One pool per virtualization host would make it impossible to live >> migrate a VM. :) >> >> Thanks, >> >> Loris >> >> >> Le 15/03/2016 22:11, David Casier a écrit : >>> Hi Loris, >>> If i'm not mistaken, there are no rbd ACL in cephx. >>> Why not 1 pool/client and pool quota ? >>> >>> David. >>> >>> 2016-02-12 3:34 GMT+01:00 Loris Cuoghi : Hi! We are on version 9.2.0, 5 mons and 80 OSDS distributed on 10 hosts. How could we twist cephx capabilities so to forbid our KVM+QEMU+libvirt hosts any RBD creation capability ? We currently have an rbd-user key like so : caps: [mon] allow r caps: [osd] allow x object_prefix rbd_children, allow rwx object_prefix rbd_header., allow rwx object_prefix rbd_id., allow rw object_prefix rbd_data. And another rbd-manager key like the one suggested in the documentation, which is used in a central machine which is the only one allowed to create RBD images: caps: [mon] allow r caps: [osd] allow class-read object_prefix rbd_children, allow rwx pool=rbd Now, the libvirt hosts all share the same "rbd-user" secret. Our intention is to permit the QEMU processes to take full advantage of any single RBD functionality, but to forbid any new RBD creation with this same key. In the eventuality of a stolen key, or other hellish scenarios. What cephx capabilities did you guys configure for your virtualization hosts? Thanks, Loris ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephx capabilities to forbid rbd creation
Hi, Maybe (not tested) : [osd ]allow * object_prefix ? 2016-03-15 22:18 GMT+01:00 Loris Cuoghi: > > Hi David, > > One pool per virtualization host would make it impossible to live > migrate a VM. :) > > Thanks, > > Loris > > > Le 15/03/2016 22:11, David Casier a écrit : >> Hi Loris, >> If i'm not mistaken, there are no rbd ACL in cephx. >> Why not 1 pool/client and pool quota ? >> >> David. >> >> 2016-02-12 3:34 GMT+01:00 Loris Cuoghi : >>> Hi! >>> >>> We are on version 9.2.0, 5 mons and 80 OSDS distributed on 10 hosts. >>> >>> How could we twist cephx capabilities so to forbid our KVM+QEMU+libvirt >>> hosts any RBD creation capability ? >>> >>> We currently have an rbd-user key like so : >>> >>> caps: [mon] allow r >>> caps: [osd] allow x object_prefix rbd_children, allow rwx >>> object_prefix rbd_header., allow rwx object_prefix rbd_id., allow rw >>> object_prefix rbd_data. >>> >>> >>> And another rbd-manager key like the one suggested in the documentation, >>> which is used in a central machine which is the only one allowed to create >>> RBD images: >>> >>> caps: [mon] allow r >>> caps: [osd] allow class-read object_prefix rbd_children, allow rwx >>> pool=rbd >>> >>> Now, the libvirt hosts all share the same "rbd-user" secret. >>> Our intention is to permit the QEMU processes to take full advantage of any >>> single RBD functionality, but to forbid any new RBD creation with this same >>> key. In the eventuality of a stolen key, or other hellish scenarios. >>> >>> What cephx capabilities did you guys configure for your virtualization >>> hosts? >>> >>> Thanks, >>> >>> Loris >>> ___ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> -- Cordialement, David CASIER 3B Rue Taylor, CS20004 75481 PARIS Cedex 10 Paris Ligne directe: 01 75 98 53 85 Email: david.cas...@aevoo.fr ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephx capabilities to forbid rbd creation
Hi David, One pool per virtualization host would make it impossible to live migrate a VM. :) Thanks, Loris Le 15/03/2016 22:11, David Casier a écrit : > Hi Loris, > If i'm not mistaken, there are no rbd ACL in cephx. > Why not 1 pool/client and pool quota ? > > David. > > 2016-02-12 3:34 GMT+01:00 Loris Cuoghi: >> Hi! >> >> We are on version 9.2.0, 5 mons and 80 OSDS distributed on 10 hosts. >> >> How could we twist cephx capabilities so to forbid our KVM+QEMU+libvirt >> hosts any RBD creation capability ? >> >> We currently have an rbd-user key like so : >> >> caps: [mon] allow r >> caps: [osd] allow x object_prefix rbd_children, allow rwx >> object_prefix rbd_header., allow rwx object_prefix rbd_id., allow rw >> object_prefix rbd_data. >> >> >> And another rbd-manager key like the one suggested in the documentation, >> which is used in a central machine which is the only one allowed to create >> RBD images: >> >> caps: [mon] allow r >> caps: [osd] allow class-read object_prefix rbd_children, allow rwx >> pool=rbd >> >> Now, the libvirt hosts all share the same "rbd-user" secret. >> Our intention is to permit the QEMU processes to take full advantage of any >> single RBD functionality, but to forbid any new RBD creation with this same >> key. In the eventuality of a stolen key, or other hellish scenarios. >> >> What cephx capabilities did you guys configure for your virtualization >> hosts? >> >> Thanks, >> >> Loris >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Disable cephx authentication ?
Interesting ! Is it safe to do this ? Perhaps "rados" is considered an internal command while rbd is a client librados ? In MonClient.cc : if (!cct->_conf->auth_supported.empty()) method = cct->_conf->auth_supported; else if (entity_name.get_type() == CEPH_ENTITY_TYPE_OSD || entity_name.get_type() == CEPH_ENTITY_TYPE_MDS || entity_name.get_type() == CEPH_ENTITY_TYPE_MON) method = cct->_conf->auth_cluster_required; else method = cct->_conf->auth_client_required; 2016-03-15 9:35 GMT+01:00 Nguyen Hoang Nam: > Hi there, > > I setup ceph cluster with disable cephx cluster authen and enable cephx > client authen as follow : > > auth_cluster_required = none > > auth_service_required = cephx > > auth_client_required = cephx > > I can run command such as `ceph -s`, `rados -p rbd put` but I can not run > command `rbd ls`, `rbd create` ... Output of those commands always are: > > 2016-03-15 10:49:30.659194 7f1a6eda0700 0 cephx: verify_reply couldn't > decrypt with error: error decoding block for decryption > > 2016-03-15 10:49:30.659211 7f1a6eda0700 0 -- 172.30.6.101:0/954989888 >> > 172.30.6.103:6804/23638 pipe(0x7f1a8119f7f0 sd=4 :45067 s=1 pgs=0 cs=0 l=1 > c=0x7f1a8 > 1197000).failed verifying authorize reply > > Can you explain me why RBD failed in this case ? Thank you in advance > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephx capabilities to forbid rbd creation
Hi Loris, If i'm not mistaken, there are no rbd ACL in cephx. Why not 1 pool/client and pool quota ? David. 2016-02-12 3:34 GMT+01:00 Loris Cuoghi: > Hi! > > We are on version 9.2.0, 5 mons and 80 OSDS distributed on 10 hosts. > > How could we twist cephx capabilities so to forbid our KVM+QEMU+libvirt > hosts any RBD creation capability ? > > We currently have an rbd-user key like so : > > caps: [mon] allow r > caps: [osd] allow x object_prefix rbd_children, allow rwx > object_prefix rbd_header., allow rwx object_prefix rbd_id., allow rw > object_prefix rbd_data. > > > And another rbd-manager key like the one suggested in the documentation, > which is used in a central machine which is the only one allowed to create > RBD images: > > caps: [mon] allow r > caps: [osd] allow class-read object_prefix rbd_children, allow rwx > pool=rbd > > Now, the libvirt hosts all share the same "rbd-user" secret. > Our intention is to permit the QEMU processes to take full advantage of any > single RBD functionality, but to forbid any new RBD creation with this same > key. In the eventuality of a stolen key, or other hellish scenarios. > > What cephx capabilities did you guys configure for your virtualization > hosts? > > Thanks, > > Loris > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph-disk from jewel has issues on redhat 7
My ceph-deploy came from the download.ceph.com site and it is 1.5.31-0. This code is in ceph itself though, the deploy logic is where the code appears to do the right thing ;-) Steve > On Mar 15, 2016, at 2:38 PM, Vasu Kulkarniwrote: > > Thanks for the steps that should be enough to test it out, I hope you got the > latest ceph-deploy either from pip or throught github. > > On Tue, Mar 15, 2016 at 12:29 PM, Stephen Lord wrote: > I would have to nuke my cluster right now, and I do not have a spare one.. > > The procedure though is literally this, given a 3 node redhat 7.2 cluster, > ceph00, ceph01 and ceph02 > > ceph-deploy install --testing ceph00 ceph01 ceph02 > ceph-deploy new ceph00 ceph01 ceph02 > > ceph-deploy mon create ceph00 ceph01 ceph02 > ceph-deploy gatherkeys ceph00 > > ceph-deploy osd create ceph00:sdb:/dev/sdi > ceph-deploy osd create ceph00:sdc:/dev/sdi > > All devices have their partition tables wiped before this. They are all just > SATA devices, no special devices in the way. > > sdi is an ssd and it is being carved up for journals. The first osd create > works, the second one gets stuck in a loop in the update_partition call in > ceph_disk for the 5 iterations before it gives up. When I look in > /sys/block/sdi the partition for the first osd is visible, the one for the > second is not. However looking at /proc/partitions it sees the correct thing. > So something about partprobe is not kicking udev into doing the right thing > when the second partition is added I suspect. > > If I do not use the separate journal device then it usually works, but > occasionally I see a single retry in that same loop. > > There is code in ceph_deploy which uses partprobe or partx depending on which > distro it detects, that is how I worked out what to change here. > > If I have to tear things down again I will reproduce and post here. > > Steve > > > On Mar 15, 2016, at 2:12 PM, Vasu Kulkarni wrote: > > > > Do you mind giving the full failed logs somewhere in fpaste.org along with > > some os version details? > > There are some known issues on RHEL, If you use 'osd prepare' and 'osd > > activate'(specifying just the journal partition here) it might work better. > > > > On Tue, Mar 15, 2016 at 12:05 PM, Stephen Lord > > wrote: > > Not multipath if you mean using the multipath driver, just trying to setup > > OSDs which use a data disk and a journal ssd. If I run just a disk based > > OSD and only specify one device to ceph-deploy then it usually works > > although sometimes has to retry. In the case where I am using it to carve > > an SSD into several partitions for journals it fails on the second one. > > > > Steve > > > > > -- > The information contained in this transmission may be confidential. Any > disclosure, copying, or further distribution of confidential information is > not permitted unless such privilege is explicitly granted in writing by > Quantum. Quantum reserves the right to have electronic communications, > including email and attachments, sent across its networks filtered through > anti virus and spam software programs and retain such messages in order to > comply with applicable data security and retention requirements. Quantum is > not responsible for the proper and complete transmission of the substance of > this communication or for any delay in its receipt. > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph-disk from jewel has issues on redhat 7
Thanks for the steps that should be enough to test it out, I hope you got the latest ceph-deploy either from pip or throught github. On Tue, Mar 15, 2016 at 12:29 PM, Stephen Lordwrote: > I would have to nuke my cluster right now, and I do not have a spare one.. > > The procedure though is literally this, given a 3 node redhat 7.2 cluster, > ceph00, ceph01 and ceph02 > > ceph-deploy install --testing ceph00 ceph01 ceph02 > ceph-deploy new ceph00 ceph01 ceph02 > > ceph-deploy mon create ceph00 ceph01 ceph02 > ceph-deploy gatherkeys ceph00 > > ceph-deploy osd create ceph00:sdb:/dev/sdi > ceph-deploy osd create ceph00:sdc:/dev/sdi > > All devices have their partition tables wiped before this. They are all > just SATA devices, no special devices in the way. > > sdi is an ssd and it is being carved up for journals. The first osd create > works, the second one gets stuck in a loop in the update_partition call in > ceph_disk for the 5 iterations before it gives up. When I look in > /sys/block/sdi the partition for the first osd is visible, the one for the > second is not. However looking at /proc/partitions it sees the correct > thing. So something about partprobe is not kicking udev into doing the > right thing when the second partition is added I suspect. > > If I do not use the separate journal device then it usually works, but > occasionally I see a single retry in that same loop. > > There is code in ceph_deploy which uses partprobe or partx depending on > which distro it detects, that is how I worked out what to change here. > > If I have to tear things down again I will reproduce and post here. > > Steve > > > On Mar 15, 2016, at 2:12 PM, Vasu Kulkarni wrote: > > > > Do you mind giving the full failed logs somewhere in fpaste.org along > with some os version details? > > There are some known issues on RHEL, If you use 'osd prepare' and 'osd > activate'(specifying just the journal partition here) it might work better. > > > > On Tue, Mar 15, 2016 at 12:05 PM, Stephen Lord > wrote: > > Not multipath if you mean using the multipath driver, just trying to > setup OSDs which use a data disk and a journal ssd. If I run just a disk > based OSD and only specify one device to ceph-deploy then it usually works > although sometimes has to retry. In the case where I am using it to carve > an SSD into several partitions for journals it fails on the second one. > > > > Steve > > > > > -- > The information contained in this transmission may be confidential. Any > disclosure, copying, or further distribution of confidential information is > not permitted unless such privilege is explicitly granted in writing by > Quantum. Quantum reserves the right to have electronic communications, > including email and attachments, sent across its networks filtered through > anti virus and spam software programs and retain such messages in order to > comply with applicable data security and retention requirements. Quantum is > not responsible for the proper and complete transmission of the substance > of this communication or for any delay in its receipt. > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph-disk from jewel has issues on redhat 7
I would have to nuke my cluster right now, and I do not have a spare one.. The procedure though is literally this, given a 3 node redhat 7.2 cluster, ceph00, ceph01 and ceph02 ceph-deploy install --testing ceph00 ceph01 ceph02 ceph-deploy new ceph00 ceph01 ceph02 ceph-deploy mon create ceph00 ceph01 ceph02 ceph-deploy gatherkeys ceph00 ceph-deploy osd create ceph00:sdb:/dev/sdi ceph-deploy osd create ceph00:sdc:/dev/sdi All devices have their partition tables wiped before this. They are all just SATA devices, no special devices in the way. sdi is an ssd and it is being carved up for journals. The first osd create works, the second one gets stuck in a loop in the update_partition call in ceph_disk for the 5 iterations before it gives up. When I look in /sys/block/sdi the partition for the first osd is visible, the one for the second is not. However looking at /proc/partitions it sees the correct thing. So something about partprobe is not kicking udev into doing the right thing when the second partition is added I suspect. If I do not use the separate journal device then it usually works, but occasionally I see a single retry in that same loop. There is code in ceph_deploy which uses partprobe or partx depending on which distro it detects, that is how I worked out what to change here. If I have to tear things down again I will reproduce and post here. Steve > On Mar 15, 2016, at 2:12 PM, Vasu Kulkarniwrote: > > Do you mind giving the full failed logs somewhere in fpaste.org along with > some os version details? > There are some known issues on RHEL, If you use 'osd prepare' and 'osd > activate'(specifying just the journal partition here) it might work better. > > On Tue, Mar 15, 2016 at 12:05 PM, Stephen Lord wrote: > Not multipath if you mean using the multipath driver, just trying to setup > OSDs which use a data disk and a journal ssd. If I run just a disk based OSD > and only specify one device to ceph-deploy then it usually works although > sometimes has to retry. In the case where I am using it to carve an SSD into > several partitions for journals it fails on the second one. > > Steve > -- The information contained in this transmission may be confidential. Any disclosure, copying, or further distribution of confidential information is not permitted unless such privilege is explicitly granted in writing by Quantum. Quantum reserves the right to have electronic communications, including email and attachments, sent across its networks filtered through anti virus and spam software programs and retain such messages in order to comply with applicable data security and retention requirements. Quantum is not responsible for the proper and complete transmission of the substance of this communication or for any delay in its receipt. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph-disk from jewel has issues on redhat 7
Do you mind giving the full failed logs somewhere in fpaste.org along with some os version details? There are some known issues on RHEL, If you use 'osd prepare' and 'osd activate'(specifying just the journal partition here) it might work better. On Tue, Mar 15, 2016 at 12:05 PM, Stephen Lordwrote: > Not multipath if you mean using the multipath driver, just trying to setup > OSDs which use a data disk and a journal ssd. If I run just a disk based > OSD and only specify one device to ceph-deploy then it usually works > although sometimes has to retry. In the case where I am using it to carve > an SSD into several partitions for journals it fails on the second one. > > Steve > > > > On Mar 15, 2016, at 1:45 PM, Vasu Kulkarni wrote: > > > > Ceph-deploy suite and also selinux suite(which isn't merged yet) > indirectly tests ceph-disk and has been run on Jewel as well. I guess the > issue Stephen is seeing is on multipath device > > which I believe is a known issue. > > > > On Tue, Mar 15, 2016 at 11:42 AM, Gregory Farnum > wrote: > > There's a ceph-disk suite from last August that Loïc set up, but based > > on the qa list it wasn't running for a while and isn't in great shape. > > :/ I know there are some CentOS7 boxes in the sepia lab but it might > > not be enough for a small and infrequently-run test to reliably get > > tested against them. > > -Greg > > > > On Tue, Mar 15, 2016 at 11:04 AM, Ben Hines wrote: > > > It seems like ceph-disk is often breaking on centos/redhat systems. > Does it > > > have automated tests in the ceph release structure? > > > > > > -Ben > > > > > > > > > On Tue, Mar 15, 2016 at 8:52 AM, Stephen Lord > > > wrote: > > >> > > >> > > >> Hi, > > >> > > >> The ceph-disk (10.0.4 version) command seems to have problems > operating on > > >> a Redhat 7 system, it uses the partprobe command unconditionally to > update > > >> the partition table, I had to change this to partx -u to get past > this. > > >> > > >> @@ -1321,13 +1321,13 @@ > > >> processed, i.e. the 95-ceph-osd.rules actions and mode changes, > > >> group changes etc. are complete. > > >> """ > > >> -LOG.debug('Calling partprobe on %s device %s', description, dev) > > >> +LOG.debug('Calling partx on %s device %s', description, dev) > > >> partprobe_ok = False > > >> error = 'unknown error' > > >> for i in (1, 2, 3, 4, 5): > > >> command_check_call(['udevadm', 'settle', '--timeout=600']) > > >> try: > > >> -_check_output(['partprobe', dev]) > > >> +_check_output(['partx', '-u', dev]) > > >> partprobe_ok = True > > >> break > > >> except subprocess.CalledProcessError as e: > > >> > > >> > > >> It really needs to be doing that conditional on the operating system > > >> version. > > >> > > >> Steve > > >> > > >> > > >> -- > > >> The information contained in this transmission may be confidential. > Any > > >> disclosure, copying, or further distribution of confidential > information is > > >> not permitted unless such privilege is explicitly granted in writing > by > > >> Quantum. Quantum reserves the right to have electronic communications, > > >> including email and attachments, sent across its networks filtered > through > > >> anti virus and spam software programs and retain such messages in > order to > > >> comply with applicable data security and retention requirements. > Quantum is > > >> not responsible for the proper and complete transmission of the > substance of > > >> this communication or for any delay in its receipt. > > >> ___ > > >> ceph-users mailing list > > >> ceph-users@lists.ceph.com > > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > > > > > > > ___ > > > ceph-users mailing list > > > ceph-users@lists.ceph.com > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > > https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.ceph.com_listinfo.cgi_ceph-2Dusers-2Dceph.com=CwICAg=8S5idjlO_n28Ko3lg6lskTMwneSC-WqZ5EBTEEvDlkg=tA8AXp6f2QAGtnc-TrB3H1XZIqqTELvv3S6ZQGJZBLs=xCs2lM8j21CKCDOMzMG9A39MKnroKXExLDI0-FgCPkA=yZ89WNI7wA8agL8i7CODARX7K864Ewod22WMdbv82xw= > > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph-disk from jewel has issues on redhat 7
Not multipath if you mean using the multipath driver, just trying to setup OSDs which use a data disk and a journal ssd. If I run just a disk based OSD and only specify one device to ceph-deploy then it usually works although sometimes has to retry. In the case where I am using it to carve an SSD into several partitions for journals it fails on the second one. Steve > On Mar 15, 2016, at 1:45 PM, Vasu Kulkarniwrote: > > Ceph-deploy suite and also selinux suite(which isn't merged yet) indirectly > tests ceph-disk and has been run on Jewel as well. I guess the issue Stephen > is seeing is on multipath device > which I believe is a known issue. > > On Tue, Mar 15, 2016 at 11:42 AM, Gregory Farnum wrote: > There's a ceph-disk suite from last August that Loïc set up, but based > on the qa list it wasn't running for a while and isn't in great shape. > :/ I know there are some CentOS7 boxes in the sepia lab but it might > not be enough for a small and infrequently-run test to reliably get > tested against them. > -Greg > > On Tue, Mar 15, 2016 at 11:04 AM, Ben Hines wrote: > > It seems like ceph-disk is often breaking on centos/redhat systems. Does it > > have automated tests in the ceph release structure? > > > > -Ben > > > > > > On Tue, Mar 15, 2016 at 8:52 AM, Stephen Lord > > wrote: > >> > >> > >> Hi, > >> > >> The ceph-disk (10.0.4 version) command seems to have problems operating on > >> a Redhat 7 system, it uses the partprobe command unconditionally to update > >> the partition table, I had to change this to partx -u to get past this. > >> > >> @@ -1321,13 +1321,13 @@ > >> processed, i.e. the 95-ceph-osd.rules actions and mode changes, > >> group changes etc. are complete. > >> """ > >> -LOG.debug('Calling partprobe on %s device %s', description, dev) > >> +LOG.debug('Calling partx on %s device %s', description, dev) > >> partprobe_ok = False > >> error = 'unknown error' > >> for i in (1, 2, 3, 4, 5): > >> command_check_call(['udevadm', 'settle', '--timeout=600']) > >> try: > >> -_check_output(['partprobe', dev]) > >> +_check_output(['partx', '-u', dev]) > >> partprobe_ok = True > >> break > >> except subprocess.CalledProcessError as e: > >> > >> > >> It really needs to be doing that conditional on the operating system > >> version. > >> > >> Steve > >> > >> > >> -- > >> The information contained in this transmission may be confidential. Any > >> disclosure, copying, or further distribution of confidential information is > >> not permitted unless such privilege is explicitly granted in writing by > >> Quantum. Quantum reserves the right to have electronic communications, > >> including email and attachments, sent across its networks filtered through > >> anti virus and spam software programs and retain such messages in order to > >> comply with applicable data security and retention requirements. Quantum is > >> not responsible for the proper and complete transmission of the substance > >> of > >> this communication or for any delay in its receipt. > >> ___ > >> ceph-users mailing list > >> ceph-users@lists.ceph.com > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.ceph.com_listinfo.cgi_ceph-2Dusers-2Dceph.com=CwICAg=8S5idjlO_n28Ko3lg6lskTMwneSC-WqZ5EBTEEvDlkg=tA8AXp6f2QAGtnc-TrB3H1XZIqqTELvv3S6ZQGJZBLs=xCs2lM8j21CKCDOMzMG9A39MKnroKXExLDI0-FgCPkA=yZ89WNI7wA8agL8i7CODARX7K864Ewod22WMdbv82xw= ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph-disk from jewel has issues on redhat 7
Ceph-deploy suite and also selinux suite(which isn't merged yet) indirectly tests ceph-disk and has been run on Jewel as well. I guess the issue Stephen is seeing is on multipath device which I believe is a known issue. On Tue, Mar 15, 2016 at 11:42 AM, Gregory Farnumwrote: > There's a ceph-disk suite from last August that Loïc set up, but based > on the qa list it wasn't running for a while and isn't in great shape. > :/ I know there are some CentOS7 boxes in the sepia lab but it might > not be enough for a small and infrequently-run test to reliably get > tested against them. > -Greg > > On Tue, Mar 15, 2016 at 11:04 AM, Ben Hines wrote: > > It seems like ceph-disk is often breaking on centos/redhat systems. Does > it > > have automated tests in the ceph release structure? > > > > -Ben > > > > > > On Tue, Mar 15, 2016 at 8:52 AM, Stephen Lord > > wrote: > >> > >> > >> Hi, > >> > >> The ceph-disk (10.0.4 version) command seems to have problems operating > on > >> a Redhat 7 system, it uses the partprobe command unconditionally to > update > >> the partition table, I had to change this to partx -u to get past this. > >> > >> @@ -1321,13 +1321,13 @@ > >> processed, i.e. the 95-ceph-osd.rules actions and mode changes, > >> group changes etc. are complete. > >> """ > >> -LOG.debug('Calling partprobe on %s device %s', description, dev) > >> +LOG.debug('Calling partx on %s device %s', description, dev) > >> partprobe_ok = False > >> error = 'unknown error' > >> for i in (1, 2, 3, 4, 5): > >> command_check_call(['udevadm', 'settle', '--timeout=600']) > >> try: > >> -_check_output(['partprobe', dev]) > >> +_check_output(['partx', '-u', dev]) > >> partprobe_ok = True > >> break > >> except subprocess.CalledProcessError as e: > >> > >> > >> It really needs to be doing that conditional on the operating system > >> version. > >> > >> Steve > >> > >> > >> -- > >> The information contained in this transmission may be confidential. Any > >> disclosure, copying, or further distribution of confidential > information is > >> not permitted unless such privilege is explicitly granted in writing by > >> Quantum. Quantum reserves the right to have electronic communications, > >> including email and attachments, sent across its networks filtered > through > >> anti virus and spam software programs and retain such messages in order > to > >> comply with applicable data security and retention requirements. > Quantum is > >> not responsible for the proper and complete transmission of the > substance of > >> this communication or for any delay in its receipt. > >> ___ > >> ceph-users mailing list > >> ceph-users@lists.ceph.com > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph-disk from jewel has issues on redhat 7
There's a ceph-disk suite from last August that Loïc set up, but based on the qa list it wasn't running for a while and isn't in great shape. :/ I know there are some CentOS7 boxes in the sepia lab but it might not be enough for a small and infrequently-run test to reliably get tested against them. -Greg On Tue, Mar 15, 2016 at 11:04 AM, Ben Hineswrote: > It seems like ceph-disk is often breaking on centos/redhat systems. Does it > have automated tests in the ceph release structure? > > -Ben > > > On Tue, Mar 15, 2016 at 8:52 AM, Stephen Lord > wrote: >> >> >> Hi, >> >> The ceph-disk (10.0.4 version) command seems to have problems operating on >> a Redhat 7 system, it uses the partprobe command unconditionally to update >> the partition table, I had to change this to partx -u to get past this. >> >> @@ -1321,13 +1321,13 @@ >> processed, i.e. the 95-ceph-osd.rules actions and mode changes, >> group changes etc. are complete. >> """ >> -LOG.debug('Calling partprobe on %s device %s', description, dev) >> +LOG.debug('Calling partx on %s device %s', description, dev) >> partprobe_ok = False >> error = 'unknown error' >> for i in (1, 2, 3, 4, 5): >> command_check_call(['udevadm', 'settle', '--timeout=600']) >> try: >> -_check_output(['partprobe', dev]) >> +_check_output(['partx', '-u', dev]) >> partprobe_ok = True >> break >> except subprocess.CalledProcessError as e: >> >> >> It really needs to be doing that conditional on the operating system >> version. >> >> Steve >> >> >> -- >> The information contained in this transmission may be confidential. Any >> disclosure, copying, or further distribution of confidential information is >> not permitted unless such privilege is explicitly granted in writing by >> Quantum. Quantum reserves the right to have electronic communications, >> including email and attachments, sent across its networks filtered through >> anti virus and spam software programs and retain such messages in order to >> comply with applicable data security and retention requirements. Quantum is >> not responsible for the proper and complete transmission of the substance of >> this communication or for any delay in its receipt. >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Ceph for home use
Wanting to play around with Ceph, I have a single-node Ceph with 1 monitor and 3 OSDs running on a VM. I am loving the flexibility that Ceph provides (and perhaps just the novelty of it). I've been planning for some time to build a NAS for home use and seriously thinking about running Ceph on real hardware (Core2 Quad Q9550 with 8-16GB RAM and 3x 4TB HDD) as the backing store. Given my resources, I'd still only run a single node with 3 OSDs and replica count of 2. I'd then have a VM mount the a Ceph RBD to serve Samba/NFS shares. I realize mine is not the usual or ideal use case for Ceph, but do you see any reason not to do this? I don't need high-performance, just good enough to serve 2 movie streams, which my test VM is already able to do, and it will be one of my backup data stores, not main store (yet). I just like Ceph's ability to add storage as needed with minimal fuss. Also, I have some older HD's (500-750GB, 5+ years old) that are still chugging along fine. I don't want to entrust main data to them, but feel I could use them for temporary backfill purposes. If I add them to another node, can Ceph be configured to use them only for backfill purposes, should the need arise? Anyway, just wanted a sanity check, in case the novelty of running Ceph is clouding my judgement. Feel free to set me straight if this is just a silly idea for such relatively small scale storage! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph-disk from jewel has issues on redhat 7
It seems like ceph-disk is often breaking on centos/redhat systems. Does it have automated tests in the ceph release structure? -Ben On Tue, Mar 15, 2016 at 8:52 AM, Stephen Lordwrote: > > Hi, > > The ceph-disk (10.0.4 version) command seems to have problems operating on > a Redhat 7 system, it uses the partprobe command unconditionally to update > the partition table, I had to change this to partx -u to get past this. > > @@ -1321,13 +1321,13 @@ > processed, i.e. the 95-ceph-osd.rules actions and mode changes, > group changes etc. are complete. > """ > -LOG.debug('Calling partprobe on %s device %s', description, dev) > +LOG.debug('Calling partx on %s device %s', description, dev) > partprobe_ok = False > error = 'unknown error' > for i in (1, 2, 3, 4, 5): > command_check_call(['udevadm', 'settle', '--timeout=600']) > try: > -_check_output(['partprobe', dev]) > +_check_output(['partx', '-u', dev]) > partprobe_ok = True > break > except subprocess.CalledProcessError as e: > > > It really needs to be doing that conditional on the operating system > version. > > Steve > > > -- > The information contained in this transmission may be confidential. Any > disclosure, copying, or further distribution of confidential information is > not permitted unless such privilege is explicitly granted in writing by > Quantum. Quantum reserves the right to have electronic communications, > including email and attachments, sent across its networks filtered through > anti virus and spam software programs and retain such messages in order to > comply with applicable data security and retention requirements. Quantum is > not responsible for the proper and complete transmission of the substance > of this communication or for any delay in its receipt. > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Calculating PG in an mixed environment
Thank you both for the quick replay, and I found my answer "Number of OSDs which this Pool will have PGs in. Typically, this is the entire Cluster OSD count, but could be less based on CRUSH rules. (e.g. Separate SSD and SATA disk sets)" @Michael: So the ration of PGs per OSDs should between 100 and 200 PGs. This means that if I calculate the PGs of a pool with the first formula and I get let's say 8129 and if I have 4 pools, I'm way overcommitting the PGs per OSD ration when each of the 4 pools uses 8129 PGs. Right? So I should lower the PGs on the pools? Best, Martin On Tue, Mar 15, 2016 at 4:47 PM, Michael Kiddwrote: > Hello Martin, > The proper way is to perform the following process: > > For all Pools utilizing the same bucket of OSDs: > (Pool1_pg_num * Pool1_size) + (Pool2_pg_num * Pool2_size) + ... > (Pool(n)_pg_num * Pool(n)_size) > -- > OSD count > > This value should be between 100 and 200 PGs and is the actual ratio of PGs > per OSD in that bucket of OSDs. > > For the actual recommendation from Ceph Devs (and written by myself), please > see: > http://ceph.com/pgcalc/ > > NOTE: The tool is partially broken, but the explanation at the top/bottom is > sound. I'll work to get the tool fully functional again. > > Thanks, > > Michael J. Kidd > Sr. Software Maintenance Engineer > Red Hat Ceph Storage > +1 919-442-8878 > > On Tue, Mar 15, 2016 at 11:41 AM, Martin Palma wrote: >> >> Hi all, >> >> The documentation [0] gives us the following formula for calculating >> the number of PG if the cluster is bigger than 50 OSDs: >> >> (OSDs * 100) >> Total PGs = >> pool size >> >> When we have mixed storage server (HDD disks and SSD disks) and we >> have defined different roots in our crush map to map some pools only >> to HDD disk and some to SSD disks like described by Sebastien Han [1]. >> >> In the above formula what number of OSDs should be use to calculate >> the PGs for a pool only on the HDD disks? The total number of OSDs in >> a cluster or only the number of OSDs which have an HDD disk as >> backend? >> >> Best, >> Martin >> >> >> [0] >> http://docs.ceph.com/docs/master/rados/operations/placement-groups/#choosing-the-number-of-placement-groups >> [1] >> http://www.sebastien-han.fr/blog/2014/08/25/ceph-mix-sata-and-ssd-within-the-same-box/ >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] data corruption with hammer
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 There are no monitors on the new node. It doesn't look like there has been any new corruption since we stopped changing the cache modes. Upon closer inspection, some files have been changed such that binary files are now ASCII files and visa versa. These are readable ASCII files and are things like PHP or script files. Or C files where ASCII files should be. I've seen this type of corruption before when a SAN node misbehaved and both controllers were writing concurrently to the backend disks. The volume was only mounted by one host, but the writes were split between the controllers when it should have been active/passive. We have killed off the OSDs on the new node as a precaution and will try to replicate this in our lab. I suspicion is that is has to do with the cache promotion code update, but I'm not sure how it would have caused this. -BEGIN PGP SIGNATURE- Version: Mailvelope v1.3.6 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJW6D4zCRDmVDuy+mK58QAAoW0QAKmaNnN78m/3/YLIIlAB U+q9PKXgB4ptds1prEJrB/HJqtxIi021M2urk6iO2XRUgR4qSWZyVJWMmeE9 6EhM6IvLbweOePr2LJ5nAVEkL5Fns+ya/aOAvilqo2WJGr8jt9J1ABjQgodp SAGwDywo3GbGUmdxWWy5CrhLsdc9WNhiXdBxREh/uqWFvw2D8/1Uq4/u8tEv fohrGD+SZfYLQwP9O/v8Rc1C3A0h7N4ytSMiN7Xg2CC9bJDynn0FTrP2LAr/ edEYx+SWF2VtKuG7wVHrQqImTfDUoTLJXP5Q6B+Oxy852qvWzglfoRhaKwGf fodaxFlTDQaeMnyhMlODRMMXadmiTmyM/WK44YBuMjM8tnlaxf7yKgh09ADz ay5oviRWnn7peXmq65TvaZzUfz6Mx5ZWYtqIevaXb0ieFgrxCTdVbdpnMNRt bMwQ+yVQ8WB5AQmEqN6p6enBCxpvr42p8Eu484dO0xqjIiEOfsMANT/8V63y RzjPMOaFKFnl3JoYNm61RGAUYszNBeX/Plm/3mP0qiiGBAeHYoxh7DNYlrs/ gUb/O9V0yNuHQIRTs8ZRyrzZKpmh9YMYo8hCsfIqWZjMwEyQaRFuysQB3NaR lQCO/o12Khv2cygmTCQxS2L7vp2zrkPaS/KietqQ0gwkV1XbynK0XyLkAVDw zTLa =Wk/a -END PGP SIGNATURE- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Mon, Mar 14, 2016 at 9:35 PM, Christian Balzerwrote: > > Hello, > > On Mon, 14 Mar 2016 20:51:04 -0600 Mike Lovell wrote: > >> something weird happened on one of the ceph clusters that i administer >> tonight which resulted in virtual machines using rbd volumes seeing >> corruption in multiple forms. >> >> when everything was fine earlier in the day, the cluster was a number of >> storage nodes spread across 3 different roots in the crush map. the first >> bunch of storage nodes have both hard drives and ssds in them with the >> hard drives in one root and the ssds in another. there is a pool for >> each and the pool for the ssds is a cache tier for the hard drives. the >> last set of storage nodes were in a separate root with their own pool >> that is being used for burn in testing. >> >> these nodes had run for a while with test traffic and we decided to move >> them to the main root and pools. the main cluster is running 0.94.5 and >> the new nodes got 0.94.6 due to them getting configured after that was >> released. i removed the test pool and did a ceph osd crush move to move >> the first node into the main cluster, the hard drives into the root for >> that tier of storage and the ssds into the root and pool for the cache >> tier. each set was done about 45 minutes apart and they ran for a couple >> hours while performing backfill without any issue other than high load >> on the cluster. >> > Since I glanced what your setup looks like from Robert's posts and yours I > won't say the obvious thing, as you aren't using EC pools. > >> we normally run the ssd tier in the forward cache-mode due to the ssds we >> have not being able to keep up with the io of writeback. this results in >> io on the hard drives slowing going up and performance of the cluster >> starting to suffer. about once a week, i change the cache-mode between >> writeback and forward for short periods of time to promote actively used >> data to the cache tier. this moves io load from the hard drive tier to >> the ssd tier and has been done multiple times without issue. i normally >> don't do this while there are backfills or recoveries happening on the >> cluster but decided to go ahead while backfill was happening due to the >> high load. >> > As you might recall, I managed to have "rados bench" break (I/O error) when > doing these switches with Firefly on my crappy test cluster, but not with > Hammer. > However I haven't done any such switches on my production cluster with a > cache tier, both because the cache pool hasn't even reached 50% capacity > after 2 weeks of pounding and because I'm sure that everything will hold > up when it comes to the first flushing. > > Maybe the extreme load (as opposed to normal VM ops) of your cluster > during the backfilling triggered the same or a similar bug. > >> i tried this procedure to change the ssd cache-tier between writeback and >> forward cache-mode and things seemed okay from the ceph cluster. about 10 >> minutes after the first attempt a changing the mode, vms using the ceph >> cluster for their storage started seeing corruption in
Re: [ceph-users] data corruption with hammer
there are not any monitors running on the new nodes. the monitors are on separate nodes and running the 0.94.5 release. i spent some time thinking about this last night as well and my thoughts went to the recency patches. i wouldn't think that caused this but its the only thing that seems close. mike On Mon, Mar 14, 2016 at 9:35 PM, Christian Balzerwrote: > > Hello, > > On Mon, 14 Mar 2016 20:51:04 -0600 Mike Lovell wrote: > > > something weird happened on one of the ceph clusters that i administer > > tonight which resulted in virtual machines using rbd volumes seeing > > corruption in multiple forms. > > > > when everything was fine earlier in the day, the cluster was a number of > > storage nodes spread across 3 different roots in the crush map. the first > > bunch of storage nodes have both hard drives and ssds in them with the > > hard drives in one root and the ssds in another. there is a pool for > > each and the pool for the ssds is a cache tier for the hard drives. the > > last set of storage nodes were in a separate root with their own pool > > that is being used for burn in testing. > > > > these nodes had run for a while with test traffic and we decided to move > > them to the main root and pools. the main cluster is running 0.94.5 and > > the new nodes got 0.94.6 due to them getting configured after that was > > released. i removed the test pool and did a ceph osd crush move to move > > the first node into the main cluster, the hard drives into the root for > > that tier of storage and the ssds into the root and pool for the cache > > tier. each set was done about 45 minutes apart and they ran for a couple > > hours while performing backfill without any issue other than high load > > on the cluster. > > > Since I glanced what your setup looks like from Robert's posts and yours I > won't say the obvious thing, as you aren't using EC pools. > > > we normally run the ssd tier in the forward cache-mode due to the ssds we > > have not being able to keep up with the io of writeback. this results in > > io on the hard drives slowing going up and performance of the cluster > > starting to suffer. about once a week, i change the cache-mode between > > writeback and forward for short periods of time to promote actively used > > data to the cache tier. this moves io load from the hard drive tier to > > the ssd tier and has been done multiple times without issue. i normally > > don't do this while there are backfills or recoveries happening on the > > cluster but decided to go ahead while backfill was happening due to the > > high load. > > > As you might recall, I managed to have "rados bench" break (I/O error) when > doing these switches with Firefly on my crappy test cluster, but not with > Hammer. > However I haven't done any such switches on my production cluster with a > cache tier, both because the cache pool hasn't even reached 50% capacity > after 2 weeks of pounding and because I'm sure that everything will hold > up when it comes to the first flushing. > > Maybe the extreme load (as opposed to normal VM ops) of your cluster > during the backfilling triggered the same or a similar bug. > > > i tried this procedure to change the ssd cache-tier between writeback and > > forward cache-mode and things seemed okay from the ceph cluster. about 10 > > minutes after the first attempt a changing the mode, vms using the ceph > > cluster for their storage started seeing corruption in multiple forms. > > the mode was flipped back and forth multiple times in that time frame > > and its unknown if the corruption was noticed with the first change or > > subsequent changes. the vms were having issues of filesystems having > > errors and getting remounted RO and mysql databases seeing corruption > > (both myisam and innodb). some of this was recoverable but on some > > filesystems there was corruption that lead to things like lots of data > > ending up in the lost+found and some of the databases were > > un-recoverable (backups are helping there). > > > > i'm not sure what would have happened to cause this corruption. the > > libvirt logs for the qemu processes for the vms did not provide any > > output of problems from the ceph client code. it doesn't look like any > > of the qemu processes had crashed. also, it has now been several hours > > since this happened with no additional corruption noticed by the vms. it > > doesn't appear that we had any corruption happen before i attempted the > > flipping of the ssd tier cache-mode. > > > > the only think i can think of that is different between this time doing > > this procedure vs previous attempts was that there was the one storage > > node running 0.94.6 where the remainder were running 0.94.5. is is > > possible that something changed between these two releases that would > > have caused problems with data consistency related to the cache tier? or > > otherwise? any other thoughts or suggestions? > > > What comes to mind in terms of these
[ceph-users] ceph-disk from jewel has issues on redhat 7
Hi, The ceph-disk (10.0.4 version) command seems to have problems operating on a Redhat 7 system, it uses the partprobe command unconditionally to update the partition table, I had to change this to partx -u to get past this. @@ -1321,13 +1321,13 @@ processed, i.e. the 95-ceph-osd.rules actions and mode changes, group changes etc. are complete. """ -LOG.debug('Calling partprobe on %s device %s', description, dev) +LOG.debug('Calling partx on %s device %s', description, dev) partprobe_ok = False error = 'unknown error' for i in (1, 2, 3, 4, 5): command_check_call(['udevadm', 'settle', '--timeout=600']) try: -_check_output(['partprobe', dev]) +_check_output(['partx', '-u', dev]) partprobe_ok = True break except subprocess.CalledProcessError as e: It really needs to be doing that conditional on the operating system version. Steve -- The information contained in this transmission may be confidential. Any disclosure, copying, or further distribution of confidential information is not permitted unless such privilege is explicitly granted in writing by Quantum. Quantum reserves the right to have electronic communications, including email and attachments, sent across its networks filtered through anti virus and spam software programs and retain such messages in order to comply with applicable data security and retention requirements. Quantum is not responsible for the proper and complete transmission of the substance of this communication or for any delay in its receipt. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Calculating PG in an mixed environment
Hello Martin, The proper way is to perform the following process: For all Pools utilizing the same bucket of OSDs: (Pool1_pg_num * Pool1_size) + (Pool2_pg_num * Pool2_size) + ... (Pool(n)_pg_num * Pool(n)_size) -- OSD count This value should be between 100 and 200 PGs and is the actual ratio of PGs per OSD in that bucket of OSDs. For the actual recommendation from Ceph Devs (and written by myself), please see: http://ceph.com/pgcalc/ NOTE: The tool is partially broken, but the explanation at the top/bottom is sound. I'll work to get the tool fully functional again. Thanks, Michael J. Kidd Sr. Software Maintenance Engineer Red Hat Ceph Storage +1 919-442-8878 On Tue, Mar 15, 2016 at 11:41 AM, Martin Palmawrote: > Hi all, > > The documentation [0] gives us the following formula for calculating > the number of PG if the cluster is bigger than 50 OSDs: > > (OSDs * 100) > Total PGs = > pool size > > When we have mixed storage server (HDD disks and SSD disks) and we > have defined different roots in our crush map to map some pools only > to HDD disk and some to SSD disks like described by Sebastien Han [1]. > > In the above formula what number of OSDs should be use to calculate > the PGs for a pool only on the HDD disks? The total number of OSDs in > a cluster or only the number of OSDs which have an HDD disk as > backend? > > Best, > Martin > > > [0] > http://docs.ceph.com/docs/master/rados/operations/placement-groups/#choosing-the-number-of-placement-groups > [1] > http://www.sebastien-han.fr/blog/2014/08/25/ceph-mix-sata-and-ssd-within-the-same-box/ > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Calculating PG in an mixed environment
you can find in http://ceph.com/pgcalc/ 2016-03-15 23:41 GMT+08:00 Martin Palma: > Hi all, > > The documentation [0] gives us the following formula for calculating > the number of PG if the cluster is bigger than 50 OSDs: > > (OSDs * 100) > Total PGs = > pool size > > When we have mixed storage server (HDD disks and SSD disks) and we > have defined different roots in our crush map to map some pools only > to HDD disk and some to SSD disks like described by Sebastien Han [1]. > > In the above formula what number of OSDs should be use to calculate > the PGs for a pool only on the HDD disks? The total number of OSDs in > a cluster or only the number of OSDs which have an HDD disk as > backend? > > Best, > Martin > > > [0] > http://docs.ceph.com/docs/master/rados/operations/placement-groups/#choosing-the-number-of-placement-groups > [1] > http://www.sebastien-han.fr/blog/2014/08/25/ceph-mix-sata-and-ssd-within-the-same-box/ > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- thanks huangjun ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Calculating PG in an mixed environment
Hi all, The documentation [0] gives us the following formula for calculating the number of PG if the cluster is bigger than 50 OSDs: (OSDs * 100) Total PGs = pool size When we have mixed storage server (HDD disks and SSD disks) and we have defined different roots in our crush map to map some pools only to HDD disk and some to SSD disks like described by Sebastien Han [1]. In the above formula what number of OSDs should be use to calculate the PGs for a pool only on the HDD disks? The total number of OSDs in a cluster or only the number of OSDs which have an HDD disk as backend? Best, Martin [0] http://docs.ceph.com/docs/master/rados/operations/placement-groups/#choosing-the-number-of-placement-groups [1] http://www.sebastien-han.fr/blog/2014/08/25/ceph-mix-sata-and-ssd-within-the-same-box/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] SSD and Journal
Yes, if you can manage *cost* , separating journal on a different device should improve write performance. But, you need to evaluate how many osd journals you can dedicate to a single OSD as at some point it will be bottlenecked by that journal device BW. Thanks & Regards Somnath From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Yair Magnezi Sent: Tuesday, March 15, 2016 6:44 AM To: ceph-users@lists.ceph.com Subject: [ceph-users] SSD and Journal Hi Guys . On a full ssd cluster , is it meaningful to put the journal on a different drive ? does it have any impact on performance ? Thanks Yair Magnezi Storage & Data Protection // Kenshoo Office +972 7 32862423 // Mobile +972 50 575-2955 __ This e-mail, as well as any attached document, may contain material which is confidential and privileged and may include trademark, copyright and other intellectual property rights that are proprietary to Kenshoo Ltd, its subsidiaries or affiliates ("Kenshoo"). This e-mail and its attachments may be read, copied and used only by the addressee for the purpose(s) for which it was disclosed herein. If you have received it in error, please destroy the message and any attachment, and contact us immediately. If you are not the intended recipient, be aware that any review, reliance, disclosure, copying, distribution or use of the contents of this message without Kenshoo's express permission is strictly prohibited. PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] ceph client lost connection to primary osd
Hi, can the ceph client(by librbd) io continued if connection to primary osd lost? Thanks ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] SSD and Journal
Hi Guys . On a full ssd cluster , is it meaningful to put the journal on a different drive ? does it have any impact on performance ? Thanks *Yair Magnezi * *Storage & Data Protection // KenshooOffice +972 7 32862423 // Mobile +972 50 575-2955__* -- This e-mail, as well as any attached document, may contain material which is confidential and privileged and may include trademark, copyright and other intellectual property rights that are proprietary to Kenshoo Ltd, its subsidiaries or affiliates ("Kenshoo"). This e-mail and its attachments may be read, copied and used only by the addressee for the purpose(s) for which it was disclosed herein. If you have received it in error, please destroy the message and any attachment, and contact us immediately. If you are not the intended recipient, be aware that any review, reliance, disclosure, copying, distribution or use of the contents of this message without Kenshoo's express permission is strictly prohibited. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Understanding "ceph -w" output - cluster monitoring
On Tue, Mar 15, 2016 at 6:38 AM, Blade Doylewrote: > > > On Mon, Mar 14, 2016 at 3:48 PM, Christian Balzer wrote: >> >> >> Hello, >> >> On Mon, 14 Mar 2016 09:16:13 -0700 Blade Doyle wrote: >> >> > Hi Ceph Community, >> > >> > I am trying to use "ceph -w" output to monitor my ceph cluster. The >> > basic setup is: >> > >> > A python script runs ceph -w and processes each line of output. It >> > finds >> > the data it wants and reports it to InfluxDB. I view the data using >> > Grafana, and Ceph Dashboard. >> > >> >> A much richer and more precise source of information would be the various >> performance counters and using collectd to feed them into graphite and >> friends. >> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-May/039953.html >> >> I'm using the DWM one, YMMV. > > > Thanks much for your reply, Christian. > > Ugh. Ok, then it looks like the key info here is to get the data from the > osd/mon sockets. Forgive me for not digging too deep yet, but it looks like > I would do something like: > > ceph --admin-daemon /var/run/ceph/ceph-osd.4.asok perf dump Only if you want per-daemon stats. > * which of that data is read/write bytes? > * Is that data for the entire cluster, or just that osd? (would I need to > read data from each individual osd sock in the cluster?) Please have a look at the link I posted. There is an existing piece of code there for doing stats collection, and it supports both gathering stats from every daemon (you can sum them yourself) or gathering the already-summed stats from the mon (much simpler if you don't need more detail). Remember that the diamond code is free software: even if you don't want to use diamond you're completely free to just copy what it does. As for the meaning of stats, you'll mostly find that it's either obvious ("num_read_kb", "num_write_kb" etc) or completely obscure ("num_evict_mode_some"). As long as you only want the obvious ones you'll be fine :-) John ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rbd cache on full ssd cluster
Thanks Christian . Still "So yes, your numbers are normal for single client, low depth reads, as many threads in this ML confirm." we're facing very high latency ( i expect much less latency from ssd cluster ) : clat percentiles (usec): | 1.00th=[ 350], 5.00th=[ 390], 10.00th=[ 414], 20.00th=[ 454], | 30.00th=[ 494], 40.00th=[ 540], 50.00th=[ 612], 60.00th=[ 732], | 70.00th=[ 1064], 80.00th=[10304], 90.00th=[37632], 95.00th=[38656], | 99.00th=[40192], 99.50th=[41216], 99.90th=[43264], 99.95th=[43776], Thanks *Yair Magnezi * *Storage & Data Protection TL // KenshooOffice +972 7 32862423 // Mobile +972 50 575-2955__* On Tue, Mar 15, 2016 at 2:28 AM, Christian Balzerwrote: > > Hello, > > On Mon, 14 Mar 2016 15:51:11 +0200 Yair Magnezi wrote: > > > On Fri, Mar 11, 2016 at 2:01 AM, Christian Balzer wrote: > > > > > > > > Hello, > > > > > > As alway there are many similar threads in here, googling and reading > > > up stuff are good for you. > > > > > > On Thu, 10 Mar 2016 16:55:03 +0200 Yair Magnezi wrote: > > > > > > > Hello Cephers . > > > > > > > > I wonder if anyone has some experience with full ssd cluster . > > > > We're testing ceph ( "firefly" ) with 4 nodes ( supermicro > > > > SYS-F628R3-R72BPT ) * 1TB SSD , total of 12 osds . > > > > Our network is 10 gig . > > > Much more, relevant details, from SW versions (kernel, OS, Ceph) and > > > configuration (replica size of your pool) to precise HW info. > > > > > > > H/W --> 4 nodes supermicro ( SYS-F628R3-R72BPT ) , every node has > > 64 GB mem , > > MegaRAID SAS 2208 : RAID0 , 4 * 1 TB ssd ( SAMSUNG > > MZ7KM960HAHP-5 ) > > > > SM863, they should be fine. > However I've never seen any results of them with sync writes, if you have > the time, something to test. > > > Cluster --. 4 nodes , 12 OSD's , replica size = 2 , ubuntu 14.04.1 > > LTS , > > > Otherwise similar to my cache pool against which I tested below, > 2 nodes with 4x 800GB Intel DC S3610 each, replica of 2, thus 8 OSDs. > 2 E5-2623 (3GHz base speed) per node. > Network is QDR Infiniband, IPoIB. > > Debian Jessie and Ceph Hammer, though. > > > > > > > In particular your SSDs, exact maker/version/size. > > > Where are your journals? > > > > > > SAMSUNG MZ7KM960HAHP-5 , 893.752 GB > > Journals on the same drive data ( all SSD as mentioned ) > > > Again, should be fine but test these with sync writes. > And of course monitor their wearout over time. > > > > > > Also Firefly is EOL, Hammer and even more so the upcoming Jewel have > > > significant improvements with SSDs. > > > > > > > We used the ceph_deploy for installation with all defaults > > > > ( followed ceph documentation for integration with open-stack ) > > > > As much as we understand there is no need to enable the rbd cache as > > > > we're running on full ssd. > > > RBD cache as in the client side librbd cache is always very helpful, > > > fast backing storage or not. > > > It can significantly reduce the number of small writes, something Ceph > > > has to do a lot of heavy lifting for. > > > > > > > bench marking the cluster shows very poor performance write but > > > > mostly read ( clients are open-stack but also vmware instances ) . > > > > > > Benchmarking how (exact command line for fio for example) and with what > > > results? > > > You say poor, but that might be "normal" for your situation, we can't > > > really tell w/o hard data. > > > > > > > > > > >fio --name=randread --ioengine=libaio --iodepth=1 --rw=randread > > --bs=4k --direct=1 --size=256M --numjobs=10 --runtime=120 > > --group_reporting --directory=/ceph_test2 > > > > Just to make sure, this is run inside your VM? > > >root@open-compute1:~# fio --name=randread --ioengine=libaio > > --iodepth=1 --rw=randread --bs=4k --direct=1 --size=256M --numjobs=10 > > --runtime=120 --group_reporting --directory=/ceph_test2 > > randread: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, > > iodepth=1 > > ... > > randread: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, > > iodepth=1 > > fio-2.1.3 > > Starting 10 processes > > randread: Laying out IO file(s) (1 file(s) / 256MB) > > randread: Laying out IO file(s) (1 file(s) / 256MB) > > randread: Laying out IO file(s) (1 file(s) / 256MB) > > randread: Laying out IO file(s) (1 file(s) / 256MB) > > randread: Laying out IO file(s) (1 file(s) / 256MB) > > randread: Laying out IO file(s) (1 file(s) / 256MB) > > randread: Laying out IO file(s) (1 file(s) / 256MB) > > randread: Laying out IO file(s) (1 file(s) / 256MB) > > randread: Laying out IO file(s) (1 file(s) / 256MB) > > randread: Laying out IO file(s) (1 file(s) / 256MB) > > Jobs: 10 (f=10): [rr] [100.0% done] [4616KB/0KB/0KB /s] [1154/0/0 > > iops] [eta 00m:00s] > > randread: (groupid=0, jobs=10): err= 0: pid=25393: Mon Mar 14 09:17:24 > > 2016 read : io=597360KB,
[ceph-users] TR: CEPH nightmare or not
Hi, We have a 3 ceph clusters (Hammer 0.94.5) on same physical nodes Using LXC on debian Wheezy. Each physical node has 12 4To 7200 RPM hard drive, 2x200Gb SSD MLC, 2 x 10 Gb ethernet. On each physical drive we have an lxc container for 1 OSD and the journal is on SSD partition. One of our ceph clusters has 96 OSD with 1024 Pgp. Last week we raised our Pgp from 1024 to 2048 in one pass. Bad idea :(. You need to read the fucking manual before upgrading this kind of parameter. Ceph was a bit stressed and can't return to normal. A few OSD (~10%) were flapping On our physical nodes, we noticed some network problems: Ping 127.0.0.1: 64 bytes from 127.0.0.1: icmp_req=1258 ttl=64 time=0.146 ms ping: sendmsg: Invalid argument 64 bytes from 127.0.0.1: icmp_req=1260 ttl=64 time=0.023 ms ping: sendmsg: Invalid argument 64 bytes from 127.0.0.1: icmp_req=1262 ttl=64 time=0.028 ms ping: sendmsg: Invalid argument ping: sendmsg: Invalid argument ping: sendmsg: Invalid argument 64 bytes from 127.0.0.1: icmp_req=1266 ttl=64 time=0.026 ms 64 bytes from 127.0.0.1: icmp_req=1267 ttl=64 time=0.142 ms ping: sendmsg: Invalid argument ping: sendmsg: Invalid argument 64 bytes from 127.0.0.1: icmp_req=1270 ttl=64 time=0.137 ms ping: sendmsg: Invalid argument With our kernel (3.16) nothing in the logs.After a few days of research, we tried to upgrade kernel to a newer version (4.4.4). Not so easy to backport it to debian wheezy but after a few hours, it works. The problem wasn't gone away but we noticed a new message in logs: arp_cache: Neighbour table overflow. In Debian , arp cache level 1 has only 128 records ! We had this to our sysctl.conf on every physical node: net.ipv4.neigh.default.gc_thresh1 = 4096 net.ipv4.neigh.default.gc_thresh2 = 8192 net.ipv4.neigh.default.gc_thresh3 = 8192 net.ipv4.neigh.default.gc_stale_time = 86400 Immediately networks problems disappeared and our cluster came back to a better state in a few hours : HEALTH_OK :) To sum up: Do not raise your pgp in one pass ! Look at your kernel parameters, you may need some tweaks to be fine Regards Pierre DOUCET ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Disable cephx authentication ?
Hi there, I setup ceph cluster with disable cephx cluster authen and enable cephx client authen as follow : auth_cluster_required = none auth_service_required = cephx auth_client_required = cephx I can run command such as `ceph -s`, `rados -p rbd put` but I can not run command `rbd ls`, `rbd create` ... Output of those commands always are: 2016-03-15 10:49:30.659194 7f1a6eda0700 0 cephx: verify_reply couldn't decrypt with error: error decoding block for decryption 2016-03-15 10:49:30.659211 7f1a6eda0700 0 -- 172.30.6.101:0/954989888 >> 172.30.6.103:6804/23638 pipe(0x7f1a8119f7f0 sd=4 :45067 s=1 pgs=0 cs=0 l=1 c=0x7f1a81197000).failed verifying authorize reply Can you explain me why RBD failed in this case ? Thank you in advance ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Understanding "ceph -w" output - cluster monitoring
On Mon, 14 Mar 2016 23:38:24 -0700 Blade Doyle wrote: > On Mon, Mar 14, 2016 at 3:48 PM, Christian Balzerwrote: > > > > > Hello, > > > > On Mon, 14 Mar 2016 09:16:13 -0700 Blade Doyle wrote: > > > > > Hi Ceph Community, > > > > > > I am trying to use "ceph -w" output to monitor my ceph cluster. The > > > basic setup is: > > > > > > A python script runs ceph -w and processes each line of output. It > > > finds the data it wants and reports it to InfluxDB. I view the data > > > using Grafana, and Ceph Dashboard. > > > > > > > A much richer and more precise source of information would be the > > various performance counters and using collectd to feed them into > > graphite and friends. > > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-May/039953.html > > > > I'm using the DWM one, YMMV. > > > > Thanks much for your reply, Christian. > > Ugh. Ok, then it looks like the key info here is to get the data from > the osd/mon sockets. Forgive me for not digging too deep yet, but it > looks like I would do something like: > > ceph --admin-daemon /var/run/ceph/ceph-osd.4.asok perf dump > Correct. > * which of that data is read/write bytes? More than one choice, the obvious ones are "counter-osd_op_out_bytes" and "counter-osd_op_in_bytes". This is why collectd with graphite is so much fun, you just click and drool until the data (graph) makes sense. > * Is that data for the entire cluster, or just that osd? (would I need > to read data from each individual osd sock in the cluster?) > Indeed the later, the mons don't keep track of this. Christian > Thanks, > Blade. -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Rakuten Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Understanding "ceph -w" output - cluster monitoring
On Mon, Mar 14, 2016 at 3:48 PM, Christian Balzerwrote: > > Hello, > > On Mon, 14 Mar 2016 09:16:13 -0700 Blade Doyle wrote: > > > Hi Ceph Community, > > > > I am trying to use "ceph -w" output to monitor my ceph cluster. The > > basic setup is: > > > > A python script runs ceph -w and processes each line of output. It finds > > the data it wants and reports it to InfluxDB. I view the data using > > Grafana, and Ceph Dashboard. > > > > A much richer and more precise source of information would be the various > performance counters and using collectd to feed them into graphite and > friends. > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-May/039953.html > > I'm using the DWM one, YMMV. > Thanks much for your reply, Christian. Ugh. Ok, then it looks like the key info here is to get the data from the osd/mon sockets. Forgive me for not digging too deep yet, but it looks like I would do something like: ceph --admin-daemon /var/run/ceph/ceph-osd.4.asok perf dump * which of that data is read/write bytes? * Is that data for the entire cluster, or just that osd? (would I need to read data from each individual osd sock in the cluster?) Thanks, Blade. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com