Re: [ceph-users] the state of cephfs in giant
Hi Sage, sorry to be late to this thread; I just caught this one as I was reviewing the Giant release notes. A few questions below: On Mon, Oct 13, 2014 at 8:16 PM, Sage Weil s...@newdream.net wrote: [...] * ACLs: implemented, tested for kernel client. not implemented for ceph-fuse. [...] * samba VFS integration: implemented, limited test coverage. ACLs are kind of a must-have feature for most Samba admins. The Samba Ceph VFS builds on userspace libcephfs directly, neither the kernel client nor ceph-fuse, so I'm trying to understand whether ACLs are available to Samba users or not. Can you clarify please? * ganesha NFS integration: implemented, no test coverage. I understood from a conversation I had with John in London that flock() and fcntl() support had recently been added to ceph-fuse, can this be expected to Just Work™ in Ganesha as well? Also, can you make a general statement as to the stability of flock() and fcntl() support in the kernel client and in libcephfs/ceph-fuse? This too is particularly interesting for Samba admins who rely on byte-range locking for Samba CTDB support. * kernel NFS reexport: implemented. limited test coverage. no known issues. In this scenario, is there any specific magic that the kernel client does to avoid producing deadlocks under memory pressure? Or are you referring to FUSE-mounted CephFS reexported via kernel NFS? Cheers, Florian -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: v0.86 released (Giant release candidate)
Hi Sage, On Tue, Oct 7, 2014 at 9:20 PM, Sage Weil s...@inktank.com wrote: This is a release candidate for Giant, which will hopefully be out in another week or two (s v0.86). We did a feature freeze about a month ago and since then have been doing only stabilization and bug fixing (and a handful on low-risk enhancements). A fair bit of new functionality went into the final sprint, but it's baked for quite a while now and we're feeling pretty good about it. Major items include: * librados locking refactor to improve scaling and client performance * local recovery code (LRC) erasure code plugin to trade some additional storage overhead for improved recovery performance * LTTNG tracing framework, with initial tracepoints in librados, librbd, and the OSD FileStore backend * separate monitor audit log for all administrative commands * asynchronos monitor transaction commits to reduce the impact on monitor read requests while processing updates * low-level tool for working with individual OSD data stores for debugging, recovery, and testing * many MDS improvements (bug fixes, health reporting) There are still a handful of known bugs in this release, but nothing severe enough to prevent a release. By and large we are pretty pleased with the stability and expect the final Giant release to be quite reliable. Please try this out on your non-production clusters for a preview. Thanks for the summary! Since you mentioned MDS improvements, and just so it doesn't get lost: as you hinted at in off-list email, please do provide a write-up of CephFS features expected to work in Giant at the time of the release (broken down by kernel client vs. Ceph-FUSE, if necessary). Not in the sense that anyone is offering commercial support, but in the sense of if you use this limited feature set, we are confident that it at least won't eat your data. I think that would be beneficial to a large portion of the user base, and clear up a lot of the present confusion about the maturity and stability of the filesystem. Cheers, Florian -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph-users] Status of snapshots in CephFS
On Fri, Sep 19, 2014 at 5:25 PM, Sage Weil sw...@redhat.com wrote: On Fri, 19 Sep 2014, Florian Haas wrote: Hello everyone, Just thought I'd circle back on some discussions I've had with people earlier in the year: Shortly before firefly, snapshot support for CephFS clients was effectively disabled by default at the MDS level, and can only be enabled after accepting a scary warning that your filesystem is highly likely to break if snapshot support is enabled. Has any progress been made on this in the interim? With libcephfs support slowly maturing in Ganesha, the option of deploying a Ceph-backed userspace NFS server is becoming more attractive -- and it's probably a better use of resources than mapping a boatload of RBDs on an NFS head node and then exporting all the data from there. Recent snapshot trimming issues notwithstanding, RBD snapshot support is reasonably stable, but even so, making snapshot data available via NFS, that way, is rather ugly. In addition, the libcephfs/Ganesha approach would obviously include much better horizontal scalability. We haven't done any work on snapshot stability. It is probably moderately stable if snapshots are only done at the root or at a consistent point in the hierarcy (as opposed to random directories), but there are still some basic problems that need to be resolved. I would not suggest deploying this in production! But some stress testing woudl as always be very welcome. :) OK, on a semi-related note: is there any reasonably current authoritative list of features that are supported and unsupported in either ceph-fuse or kernel cephfs, and if so, at what minimal version? The most comprehensive overview that seems to be available is one from Greg, which however is a year and a half old: http://ceph.com/dev-notes/cephfs-mds-status-discussion/ In addition, https://github.com/nfs-ganesha/nfs-ganesha/wiki/ReleaseNotes_2.0#CEPH states: The current requirement to build and use the Ceph FSAL is a Ceph build environment which includes Ceph client enhancements staged on the libwipcephfs development branch. These changes are expected to be part of the Ceph Firefly release. ... though it's not clear whether they ever did make it into firefly. Could someone in the know comment on that? I think this is referring to the libcephfs API changes that the cohortfs folks did. That all merged shortly before firefly. Great, thanks for the clarification. By the way, we have some basic samba integration tests in our regular regression tests, but nothing based on ganesha. If you really want this to the work, the most valuable thing you could do would be to help get the tests written and integrated into ceph-qa-suite.git. Probably the biggest piece of work there is creating a task/ganesha.py that installs and configures ganesha with the ceph FSAL. Hmmm, given the excellent writeup that Niels de Vos of Gluster fame wrote about this topic, I might actually be able to cargo-cult some of what's in the Samba task and adapt it for ganesha. Sorry while I'm being ignorant about Teuthology: what platform does it normally run on? I ask because I understand most of your testing is done on Ubuntu, and Ubuntu currently doesn't ship a Ganesha package, which would make the install task a bit more complex. Cheers, Florian signature.asc Description: OpenPGP digital signature
Re: [ceph-users] Status of snapshots in CephFS
On Fri, Sep 19, 2014 at 5:25 PM, Sage Weil sw...@redhat.com wrote: On Fri, 19 Sep 2014, Florian Haas wrote: Hello everyone, Just thought I'd circle back on some discussions I've had with people earlier in the year: Shortly before firefly, snapshot support for CephFS clients was effectively disabled by default at the MDS level, and can only be enabled after accepting a scary warning that your filesystem is highly likely to break if snapshot support is enabled. Has any progress been made on this in the interim? With libcephfs support slowly maturing in Ganesha, the option of deploying a Ceph-backed userspace NFS server is becoming more attractive -- and it's probably a better use of resources than mapping a boatload of RBDs on an NFS head node and then exporting all the data from there. Recent snapshot trimming issues notwithstanding, RBD snapshot support is reasonably stable, but even so, making snapshot data available via NFS, that way, is rather ugly. In addition, the libcephfs/Ganesha approach would obviously include much better horizontal scalability. We haven't done any work on snapshot stability. It is probably moderately stable if snapshots are only done at the root or at a consistent point in the hierarcy (as opposed to random directories), but there are still some basic problems that need to be resolved. I would not suggest deploying this in production! But some stress testing woudl as always be very welcome. :) OK, on a semi-related note: is there any reasonably current authoritative list of features that are supported and unsupported in either ceph-fuse or kernel cephfs, and if so, at what minimal version? The most comprehensive overview that seems to be available is one from Greg, which however is a year and a half old: http://ceph.com/dev-notes/cephfs-mds-status-discussion/ In addition, https://github.com/nfs-ganesha/nfs-ganesha/wiki/ReleaseNotes_2.0#CEPH states: The current requirement to build and use the Ceph FSAL is a Ceph build environment which includes Ceph client enhancements staged on the libwipcephfs development branch. These changes are expected to be part of the Ceph Firefly release. ... though it's not clear whether they ever did make it into firefly. Could someone in the know comment on that? I think this is referring to the libcephfs API changes that the cohortfs folks did. That all merged shortly before firefly. Great, thanks for the clarification. By the way, we have some basic samba integration tests in our regular regression tests, but nothing based on ganesha. If you really want this to the work, the most valuable thing you could do would be to help get the tests written and integrated into ceph-qa-suite.git. Probably the biggest piece of work there is creating a task/ganesha.py that installs and configures ganesha with the ceph FSAL. Hmmm, given the excellent writeup that Niels de Vos of Gluster fame wrote about this topic, I might actually be able to cargo-cult some of what's in the Samba task and adapt it for ganesha. Sorry while I'm being ignorant about Teuthology: what platform does it normally run on? I ask because I understand most of your testing is done on Ubuntu, and Ubuntu currently doesn't ship a Ganesha package, which would make the install task a bit more complex. Cheers, Florian -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: snap_trimming + backfilling is inefficient with many purged_snaps
On Wed, Sep 24, 2014 at 1:05 AM, Sage Weil sw...@redhat.com wrote: Sam and I discussed this on IRC and have we think two simpler patches that solve the problem more directly. See wip-9487. So I understand this makes Dan's patch (and the config parameter that it introduces) unnecessary, but is it correct to assume that just like Dan's patch yours too will not be effective unless osd snap trim sleep 0? Queued for testing now. Once that passes we can backport and test for firefly and dumpling too. Note that this won't make the next dumpling or firefly point releases (which are imminent). Should be in the next ones, though. OK, just in case anyone else runs into problems after removing tons of snapshots with =0.67.11, what's the plan to get them going again until 0.67.12 comes out? Install the autobuild package from the wip branch? Cheers, Florian -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: snap_trimming + backfilling is inefficient with many purged_snaps
On Mon, Sep 22, 2014 at 7:06 PM, Florian Haas flor...@hastexo.com wrote: On Sun, Sep 21, 2014 at 9:52 PM, Sage Weil sw...@redhat.com wrote: On Sun, 21 Sep 2014, Florian Haas wrote: So yes, I think your patch absolutely still has merit, as would any means of reducing the number of snapshots an OSD will trim in one go. As it is, the situation looks really really bad, specifically considering that RBD and RADOS are meant to be super rock solid, as opposed to say CephFS which is in an experimental state. And contrary to CephFS snapshots, I can't recall any documentation saying that RBD snapshots will break your system. Yeah, it sounds like a separate issue, and no, the limit is not documented because it's definitely not the intended behavior. :) ...and I see you already have a log attached to #9503. Will take a look. I've already updated that issue in Redmine, but for the list archives I should also add this here: Dan's patch for #9503, together with Sage's for #9487, makes the problem go away in an instant. I've already pointed out that I owe Dan dinner, and Sage, well I already owe Sage pretty much lifelong full board. :) Looks like I was bit too eager: while the cluster is behaving nicely with these patches while nothing happens to any OSDs, it does flag PGs as incomplete when an OSD goes down. Once the mon osd down out interval expires things seem to recover/backfill normally, but it's still disturbing to see this in the interim. I've updated http://tracker.ceph.com/issues/9503 with a pg query from one of the affected PGs, within the mon osd down out interval, while it was marked incomplete. Dan or Sage, any ideas as to what might be causing this? Cheers, Florian -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: snap_trimming + backfilling is inefficient with many purged_snaps
On Sun, Sep 21, 2014 at 9:52 PM, Sage Weil sw...@redhat.com wrote: On Sun, 21 Sep 2014, Florian Haas wrote: So yes, I think your patch absolutely still has merit, as would any means of reducing the number of snapshots an OSD will trim in one go. As it is, the situation looks really really bad, specifically considering that RBD and RADOS are meant to be super rock solid, as opposed to say CephFS which is in an experimental state. And contrary to CephFS snapshots, I can't recall any documentation saying that RBD snapshots will break your system. Yeah, it sounds like a separate issue, and no, the limit is not documented because it's definitely not the intended behavior. :) ...and I see you already have a log attached to #9503. Will take a look. I've already updated that issue in Redmine, but for the list archives I should also add this here: Dan's patch for #9503, together with Sage's for #9487, makes the problem go away in an instant. I've already pointed out that I owe Dan dinner, and Sage, well I already owe Sage pretty much lifelong full board. :) Everyone with a ton of snapshots in their clusters (not sure where the threshold is, but it gets nasty somewhere between 1,000 and 10,000 I imagine) should probably update to 0.67.11 and 0.80.6 as soon as they come out, otherwise Terrible Things Will Happen™ if you're ever forced to delete a large number of snaps at once. Thanks again to Dan and Sage, Florian -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: snap_trimming + backfilling is inefficient with many purged_snaps
On Sat, Sep 20, 2014 at 9:08 PM, Alphe Salas asa...@kepler.cl wrote: Real field testings and proof workout are better than any unit testing ... I would follow Dan s notice of resolution because it based on real problem and not fony style test ground. That statement is almost an insult to the authors and maintainers of the testing framework around Ceph. Therefore, I'm taking the liberty to register my objection. That said, I'm not sure that wip-9487-dumpling is the final fix to the issue. On the system where I am seeing the issue, even with the fix deployed, osd's still not only go crazy snap trimming (which by itself would be understandable, as the system has indeed recently had thousands of snapshots removed), but they also still produce the previously seen ENOENT messages indicating they're trying to trim snaps that aren't there. That system, however, has PGs marked as recovering, not backfilling as in Dan's system. Not sure if wip-9487 falls short of fixing the issue at its root. Sage, whenever you have time, would you mind commenting? Cheers, Florian -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: snap_trimming + backfilling is inefficient with many purged_snaps
On Sun, Sep 21, 2014 at 4:26 PM, Dan van der Ster daniel.vanders...@cern.ch wrote: Hi Florian, September 21 2014 3:33 PM, Florian Haas flor...@hastexo.com wrote: That said, I'm not sure that wip-9487-dumpling is the final fix to the issue. On the system where I am seeing the issue, even with the fix deployed, osd's still not only go crazy snap trimming (which by itself would be understandable, as the system has indeed recently had thousands of snapshots removed), but they also still produce the previously seen ENOENT messages indicating they're trying to trim snaps that aren't there. You should be able to tell exactly how many snaps need to be trimmed. Check the current purged_snaps with ceph pg x.y query and also check the snap_trimq from debug_osd=10. The problem fixed in wip-9487 is the (mis)communication of purged_snaps to a new OSD. But if in your cluster purged_snaps is correct (which it should be after the fix from Sage), and it still has lots of snaps to trim, then I believe the only thing to do is let those snaps all get trimmed. (my other patch linked sometime earlier in this thread might help by breaking up all that trimming work into smaller pieces, but that was never tested). Yes, it does indeed look like the system does have thousands of snapshots left to trim. That said, since the PGs are locked during this time, this creates a situation where the cluster is becoming unusable with no way for the user to recover. Entering the realm of speculation, I wonder if your OSDs are getting interrupted, marked down, out, or crashing before they have the opportunity to persist purged_snaps? purged_snaps is updated in ReplicatedPG::WaitingOnReplicas::react, but if the primary is too busy to actually send that transaction to its peers, so then eventually it or the new primary needs to start again, and no progress is ever made. If this is what is happening on your cluster, then again, perhaps my osd_snap_trim_max patch could be a solution. Since the snap trimmer immediately jacks the affected OSDs up to 100% CPU utilization, and they stop even responding to heartbeats, yes they do get marked down and that makes the issue much worse. Even when setting nodown, though, then that doesn't change the fact that the affected OSDs just spin practically indefinitely. So, even with the patch for 9487, which fixes *your* issue of the cluster trying to trim tons of snaps when in fact it should be trimming only a handful, the user is still in a world of pain when they do indeed have tons of snaps to trim. And obviously, neither of osd max backfills nor osd recovery max active help here, because even a single backfill/recovery makes the OSD go nuts. There is the silly option of setting osd_snap_trim_sleep to say 61 minutes, and restarting the ceph-osd daemons before the snap trim can kick in, i.e. hourly, via a cron job. Of course, while this prevents the OSD from going into a death spin, it only perpetuates the problem until a patch for this issue is available, because snap trimming never even runs, let alone completes. This is particularly bad because a user can get themselves a non-functional cluster simply by trying to delete thousands of snapshots at once. If you consider a tiny virtualization cluster of just 100 persistent VMs, out of which you take one snapshot an hour, then deleting the snapshots taken in one month puts you well above that limit. So we're not talking about outrageous numbers here. I don't think anyone can fault any user for attempting this. What makes the situation even worse is that there is no cluster-wide limit to the number of snapshots, or even say snapshots per RBD volume, or snapshots per PG, nor any limit on the number of snapshots deleted concurrently. So yes, I think your patch absolutely still has merit, as would any means of reducing the number of snapshots an OSD will trim in one go. As it is, the situation looks really really bad, specifically considering that RBD and RADOS are meant to be super rock solid, as opposed to say CephFS which is in an experimental state. And contrary to CephFS snapshots, I can't recall any documentation saying that RBD snapshots will break your system. Cheers, Florian -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: snap_trimming + backfilling is inefficient with many purged_snaps
On Fri, Sep 19, 2014 at 12:27 AM, Sage Weil sw...@redhat.com wrote: On Fri, 19 Sep 2014, Florian Haas wrote: Hi Sage, was the off-list reply intentional? Whoops! Nope :) On Thu, Sep 18, 2014 at 11:47 PM, Sage Weil sw...@redhat.com wrote: So, disaster is a pretty good description. Would anyone from the core team like to suggest another course of action or workaround, or are Dan and I generally on the right track to make the best out of a pretty bad situation? The short term fix would probably be to just prevent backfill for the time being until the bug is fixed. As in, osd max backfills = 0? Yeah :) Just managed to reproduce the problem... sage Saw the wip branch. Color me freakishly impressed on the turnaround. :) Thanks! Cheers, Florian -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: snap_trimming + backfilling is inefficient with many purged_snaps
Hi Dan, saw the pull request, and can confirm your observations, at least partially. Comments inline. On Thu, Sep 18, 2014 at 2:50 PM, Dan Van Der Ster daniel.vanders...@cern.ch wrote: Do I understand your issue report correctly in that you have found setting osd_snap_trim_sleep to be ineffective, because it's being applied when iterating from PG to PG, rather than from snap to snap? If so, then I'm guessing that that can hardly be intentional… I’m beginning to agree with you on that guess. AFAICT, the normal behavior of the snap trimmer is to trim one single snap, the one which is in the snap_trimq but not yet in purged_snaps. So the only time the current sleep implementation could be useful is if we rm’d a snap across many PGs at once, e.g. rm a pool snap or an rbd snap. But those aren’t a huge problem anyway since you’d at most need to trim O(100) PGs. Hmm. I'm actually seeing this in a system where the problematic snaps could *only* have been RBD snaps. We could move the snap trim sleep into the SnapTrimmer state machine, for example in ReplicatedPG::NotTrimming::react. This should allow other IOs to get through to the OSD, but of course the trimming PG would remain locked. And it would be locked for even longer now due to the sleep. To solve that we could limit the number of trims per instance of the SnapTrimmer, like I’ve done in this pull req: https://github.com/ceph/ceph/pull/2516 Breaking out of the trimmer like that should allow IOs to the trimming PG to get through. The second aspect of this issue is why are the purged_snaps being lost to begin with. I’ve managed to reproduce that on my test cluster. All you have to do is create many pool snaps (e.g. of a nearly empty pool), then rmsnap all those snapshots. Then use crush reweight to move the PGs around. With debug_osd=10, you will see adding snap 1 to purged_snaps”, which is one signature of this lost purged_snaps issue. To reproduce slow requests the number of snaps purged needs to be O(1). Hmmm, I'm not sure if I confirm that. I see adding snap X to purged_snaps, but only after the snap has been purged. See https://gist.github.com/fghaas/88db3cd548983a92aa35. Of course, the fact that the OSD tries to trim a snap only to get an ENOENT is probably indicative of something being fishy with the snaptrimq and/or the purged_snaps list as well. Looking forward to any ideas someone might have. So am I. :) Cheers, Florian -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: snap_trimming + backfilling is inefficient with many purged_snaps
On Thu, Sep 18, 2014 at 8:56 PM, Mango Thirtyfour daniel.vanders...@cern.ch wrote: Hi Florian, On Sep 18, 2014 7:03 PM, Florian Haas flor...@hastexo.com wrote: Hi Dan, saw the pull request, and can confirm your observations, at least partially. Comments inline. On Thu, Sep 18, 2014 at 2:50 PM, Dan Van Der Ster daniel.vanders...@cern.ch wrote: Do I understand your issue report correctly in that you have found setting osd_snap_trim_sleep to be ineffective, because it's being applied when iterating from PG to PG, rather than from snap to snap? If so, then I'm guessing that that can hardly be intentional… I’m beginning to agree with you on that guess. AFAICT, the normal behavior of the snap trimmer is to trim one single snap, the one which is in the snap_trimq but not yet in purged_snaps. So the only time the current sleep implementation could be useful is if we rm’d a snap across many PGs at once, e.g. rm a pool snap or an rbd snap. But those aren’t a huge problem anyway since you’d at most need to trim O(100) PGs. Hmm. I'm actually seeing this in a system where the problematic snaps could *only* have been RBD snaps. True, as am I. The current sleep is useful in this case, but since we'd normally only expect up to ~100 of these PGs per OSD, the trimming of 1 snap across all of those PGs would finish rather quickly anyway. Latency would surely be increased momentarily, but I wouldn't expect 90s slow requests like I have with the 3 snap_trimq single PG. Possibly the sleep is useful in both places. We could move the snap trim sleep into the SnapTrimmer state machine, for example in ReplicatedPG::NotTrimming::react. This should allow other IOs to get through to the OSD, but of course the trimming PG would remain locked. And it would be locked for even longer now due to the sleep. To solve that we could limit the number of trims per instance of the SnapTrimmer, like I’ve done in this pull req: https://github.com/ceph/ceph/pull/2516 Breaking out of the trimmer like that should allow IOs to the trimming PG to get through. The second aspect of this issue is why are the purged_snaps being lost to begin with. I’ve managed to reproduce that on my test cluster. All you have to do is create many pool snaps (e.g. of a nearly empty pool), then rmsnap all those snapshots. Then use crush reweight to move the PGs around. With debug_osd=10, you will see adding snap 1 to purged_snaps”, which is one signature of this lost purged_snaps issue. To reproduce slow requests the number of snaps purged needs to be O(1). Hmmm, I'm not sure if I confirm that. I see adding snap X to purged_snaps, but only after the snap has been purged. See https://gist.github.com/fghaas/88db3cd548983a92aa35. Of course, the fact that the OSD tries to trim a snap only to get an ENOENT is probably indicative of something being fishy with the snaptrimq and/or the purged_snaps list as well. With such a long snap_trimq there in your log, I suspect you're seeing the exact same behavior as I am. In my case the first snap trimmed is snap 1, of course because that is the first rm'd snap, and the contents of your pool are surely different. I also see the ENOENT messages... again confirming those snaps were already trimmed. Anyway, what I've observed is that a large snap_trimq like that will block the OSD until they are all re-trimmed. That's... a mess. So what is your workaround for recovery? My hunch would be to - stop all access to the cluster; - set nodown and noout so that other OSDs don't mark spinning OSDs down (which would cause all sorts of primary and PG reassignments, useless backfill/recovery when mon osd down out interval expires, etc.); - set osd_snap_trim_sleep to a ridiculously high value like 10 or 30 so that at least *between* PGs, the OSD has a chance to respond to heartbeats and do whatever else it needs to do; - let the snap trim play itself out over several hours (days?). That sounds utterly awful, but if anyone has a better idea (other than wait until the patch is merged), I'd be all ears. Cheers Florian -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: snap_trimming + backfilling is inefficient with many purged_snaps
On Thu, Sep 18, 2014 at 9:12 PM, Dan van der Ster daniel.vanders...@cern.ch wrote: Hi, September 18 2014 9:03 PM, Florian Haas flor...@hastexo.com wrote: On Thu, Sep 18, 2014 at 8:56 PM, Dan van der Ster daniel.vanders...@cern.ch wrote: Hi Florian, On Sep 18, 2014 7:03 PM, Florian Haas flor...@hastexo.com wrote: Hi Dan, saw the pull request, and can confirm your observations, at least partially. Comments inline. On Thu, Sep 18, 2014 at 2:50 PM, Dan Van Der Ster daniel.vanders...@cern.ch wrote: Do I understand your issue report correctly in that you have found setting osd_snap_trim_sleep to be ineffective, because it's being applied when iterating from PG to PG, rather than from snap to snap? If so, then I'm guessing that that can hardly be intentional… I’m beginning to agree with you on that guess. AFAICT, the normal behavior of the snap trimmer is to trim one single snap, the one which is in the snap_trimq but not yet in purged_snaps. So the only time the current sleep implementation could be useful is if we rm’d a snap across many PGs at once, e.g. rm a pool snap or an rbd snap. But those aren’t a huge problem anyway since you’d at most need to trim O(100) PGs. Hmm. I'm actually seeing this in a system where the problematic snaps could *only* have been RBD snaps. True, as am I. The current sleep is useful in this case, but since we'd normally only expect up to ~100 of these PGs per OSD, the trimming of 1 snap across all of those PGs would finish rather quickly anyway. Latency would surely be increased momentarily, but I wouldn't expect 90s slow requests like I have with the 3 snap_trimq single PG. Possibly the sleep is useful in both places. We could move the snap trim sleep into the SnapTrimmer state machine, for example in ReplicatedPG::NotTrimming::react. This should allow other IOs to get through to the OSD, but of course the trimming PG would remain locked. And it would be locked for even longer now due to the sleep. To solve that we could limit the number of trims per instance of the SnapTrimmer, like I’ve done in this pull req: https://github.com/ceph/ceph/pull/2516 Breaking out of the trimmer like that should allow IOs to the trimming PG to get through. The second aspect of this issue is why are the purged_snaps being lost to begin with. I’ve managed to reproduce that on my test cluster. All you have to do is create many pool snaps (e.g. of a nearly empty pool), then rmsnap all those snapshots. Then use crush reweight to move the PGs around. With debug_osd=10, you will see adding snap 1 to purged_snaps”, which is one signature of this lost purged_snaps issue. To reproduce slow requests the number of snaps purged needs to be O(1). Hmmm, I'm not sure if I confirm that. I see adding snap X to purged_snaps, but only after the snap has been purged. See https://gist.github.com/fghaas/88db3cd548983a92aa35. Of course, the fact that the OSD tries to trim a snap only to get an ENOENT is probably indicative of something being fishy with the snaptrimq and/or the purged_snaps list as well. With such a long snap_trimq there in your log, I suspect you're seeing the exact same behavior as I am. In my case the first snap trimmed is snap 1, of course because that is the first rm'd snap, and the contents of your pool are surely different. I also see the ENOENT messages... again confirming those snaps were already trimmed. Anyway, what I've observed is that a large snap_trimq like that will block the OSD until they are all re-trimmed. That's... a mess. So what is your workaround for recovery? My hunch would be to - stop all access to the cluster; - set nodown and noout so that other OSDs don't mark spinning OSDs down (which would cause all sorts of primary and PG reassignments, useless backfill/recovery when mon osd down out interval expires, etc.); - set osd_snap_trim_sleep to a ridiculously high value like 10 or 30 so that at least *between* PGs, the OSD has a chance to respond to heartbeats and do whatever else it needs to do; - let the snap trim play itself out over several hours (days?). What I've been doing is I just continue draining my OSDs, two at a time. Each time, 1-2 other OSDs become blocked for a couple minutes (out of the ~1 hour it takes to drain) while a single PG re-trims, leading to ~100 slow requests. The OSD must still be responding to the peer pings, since other OSDs do not mark it down. Luckily this doesn't happen with every single movement of our pool 5 PGs, otherwise it would be a disaster like you said. So just to clarify, what you're doing is out of the OSDs that are spinning, you mark 2 out and wait for them to go empty? What I'm seeing i my environment is that the OSDs *do* go down. Marking them out seems not to help much as the problem then promptly pops up elsewhere. So, disaster is a pretty good description. Would anyone from
Ceph Puppet modules (again)
Hi, Somehow I'm thinking I'm opening a can of worms, but here it goes anyway. I saw some discussion about this here on this list last (Northern Hemisphere) autumn, but not much since. I'd like to ask for some clarification on the current state of the Ceph Puppet modules. Currently there are several: one on StackForge (http://git.openstack.org/cgit/stackforge/puppet-ceph/), primarily written by Loïc Dachary, and one on the eNovance GitHub repo (https://github.com/enovance/puppet-ceph), written by Sébastien Han and François Charlier. The eNovance repo is AGPL licensed, which I find rather incomprehensible — the only thing this would make sense for would be to force providers of *public* Puppet hosts to contribute back upstream, but that's a really far fetched use case. The StackForge repo is ASL licensed, which looks a bit saner. Then there is a TelekomCloud fork of the eNovance repo at https://github.com/TelekomCloud/puppet-ceph/tree/rc/eisbrecher, with 55 unmerged patches. Also AGPL, as far as I can tell. And finally there's puppet-cephdeploy (https://github.com/dontalton/puppet-cephdeploy) where I like that it builds upon ceph-deploy, but rather dislike that it's rather closely interwoven with OpenStack. ASL. Finally, after the discussion that Loïc kicked off in https://www.mail-archive.com/ceph-devel@vger.kernel.org/msg16673.html, there's https://github.com/ceph/puppet-ceph which hasn't seen any updates in 2 months. This is a mirror of the StackForge module, as far as I can tell, is ASL licensed and has seen neither the eNovance work nor the TelekomCloud updates, presumably on account of the license issue. Neither repo seems to be universally accepted and fully complete (StackForge only supports mon deployment; eNovance doesn't do radosgw, for example), so I'm trying to understand where people should best direct their efforts to get things to a working state. All thoughts and comments appreciated. Thanks! Cheers, Florian -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph Puppet modules (again)
On Mon, Mar 10, 2014 at 7:27 PM, Loic Dachary l...@dachary.org wrote: Hi Florian, New efforts should be directed to https://github.com/stackforge/puppet-ceph (mirrored at https://github.com/ceph/puppet-ceph) and evolving at https://review.openstack.org/#/q/status:open+project:stackforge/puppet-ceph,n,z I'm happily developing it with Andrew Woodward and David Simard and quite happy about how its future looks like. It will eventually unite all other modules and benefit from a proper integration test environment. I've been distressed far too often by the lack of integration tests when writing puppet modules. It makes all the difference in the world to me, although it's not currently popular among puppet module developers, to the point that the official tool (beaker) can't allocate a disk when creating an instance (!). I wrote a tiny tool https://pypi.python.org/pypi/gerritexec to listen to gerrit events, a simple one liner that actually runs the puppet modules for osd/mon on cuttlefish/dumpling/emperor in various situations. When working on puppet-ceph, my efforts are often directed to patching Ceph itself to make it more amicable to configuration management systems. ceph-disk: prepare should be idempotent is one example at http://tracker.ceph.com/issues/7475. But you will find a number of patches in Firefly oriented toward this goal. I believe this will also help reduce the complexity of the Chef, Ansible, Salt, ... playcookbooks (;-), in the same way hiding osd ids from them resolved a number of unnecessary problems for all of them (although some of them did not evolve to take advantage of it and are still overcomplex). From the point of view of someone in a hurry and no time to develop, what I'm doing it not useable at the moment. I'm under no pressure to rush anything and won't commit to any deadline ;-) However, if a developer is willing to help out, I'd be happy to spare the time to get her/him on board and speed up the process. I'm not sure if I'm getting this correctly, but it sounds a bit like please send patches my way, but don't expect to get anything useable anytime soon. That doesn't seem like a very powerful argument for contribution, I'm sad to say. But maybe I'm getting something wrong? Cheers, Florian -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: github pull requests
On Fri, Mar 22, 2013 at 12:15 AM, Gregory Farnum g...@inktank.com wrote: I'm not sure that we handle enough incoming yet that the extra process weight of something like Gerrit or Launchpad is necessary over Github. What are you looking for in that system which Github doesn't provide? -Greg Automated regression tests and gated commits come to mind. Gerrit alone of course doesn't help with that, you'd probably want to consider either running Jenkins, or hook the master merges up with automatic teuthology runs. Just my two cents, though. Cheers, Florian -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
OSD nodes with =8 spinners, SSD-backed journals, and their performance impact
Hi everyone, we ran into an interesting performance issue on Friday that we were able to troubleshoot with some help from Greg and Sam (thanks guys), and in the process realized that there's little guidance around for how to optimize performance in OSD nodes with lots of spinning disks (and hence, hosting a relatively large number of OSDs). In that type of hardware configuration, the usual mantra of put your OSD journals on an SSD doesn't always hold up. So we wrote up some recommendations, and I'd ask everyone interested to critique this or provide feedback: http://www.hastexo.com/resources/hints-and-kinks/solid-state-drives-and-ceph-osd-journals It's probably easiest to comment directly on that page, but if you prefer instead to just respond in this thread, that's perfectly fine too. For some background of the discussion, please refer to the LogBot log from #ceph: http://irclogs.ceph.widodh.nl/index.php?date=2013-01-12 Hope this is useful. Cheers, Florian -- Helpful information? Let us know! http://www.hastexo.com/shoutbox -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSD nodes with =8 spinners, SSD-backed journals, and their performance impact
Hi Tom, On Mon, Jan 14, 2013 at 2:28 PM, Tom Lanyon t...@netspot.com.au wrote: On 14/01/2013, at 10:47 PM, Florian Haas flor...@hastexo.com wrote: snip http://www.hastexo.com/resources/hints-and-kinks/solid-state-drives-and-ceph-osd-journals It's probably easiest to comment directly on that page, but if you prefer instead to just respond in this thread, that's perfectly fine too. snip Hi Florian, Thanks for putting this together. Pleasure! A couple of minor questions/comments: * One of the conclusions is to use the SSDs (assuming 2) un-RAIDed, but the article doesn't actually explain why using them in a RAID-1 is a poor idea. Added paragraph starting with putting your journal SSDs in a RAID set looks like a good idea at first, does that explain the situation better? * Should the end of this sentence: Another option is to use, say, one partition on each of your SSD in a RAID for the operating system installation, and then chop up the rest of your SSDs an non-RAIDed Ceph OSDs. ...instead read: Another option is to use, say, one partition on each of your SSD in a RAID for the operating system installation, and then chop up the rest of your SSDs an non-RAIDed Ceph **OSD journals**. ? Sure. Fixed. Btw: coming to LCA? If you are, please find me and say hello. :) Cheers, Florian -- Helpful information? Let us know! http://www.hastexo.com/shoutbox -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSD nodes with =8 spinners, SSD-backed journals, and their performance impact
Hi Mark, thanks for the comments. On Mon, Jan 14, 2013 at 2:46 PM, Mark Nelson mark.nel...@inktank.com wrote: Hi Florian, Couple of comments: OSDs use a write-ahead mode for local operations: a write hits the journal first, and from there is then being copied into the backing filestore. It's probably important to mention that this is true by default only for non-btrfs file systems. See: http://ceph.com/wiki/OSD_journal I am well aware of that, but I've yet to find a customer (or user) that's actually willing to entrust a production cluster with several hundred terabytes of data to btrfs. :) Besides, the whole post is about whether or not to use dedicated SSD block devices for OSD journals, and if you're tossing everything into btrfs you've already made the decision to use in-filestore journals. Thus, for best cluster performance it is crucial that the journal is fast, whereas the filestore can be comparatively slow. This is a bit misleading. Having a faster journal is helpful when there are short bursts of traffic. So long as the journal doesn't fill up and there are periods of inactivity for the data to get flushed, having slow filestore disk may be ok. With lots of traffic, reality eventually catches up with you and you've gotta get all of that data flushed out to the backing file system. I agree that the wording is non-optimal. What I meant was to equate fast with SSDs, and comparatively slow with spinners. And to combine spinners with SSDs is one of the most interesting points about Ceph in terms of cost effectiveness. Pretty much every other storage technology would require you to either go all-SSD or to look into rather sophisticated HSM in order to achieve similar performance at a comparable scale. Suggestions for better wording? Have you ever seen ceph performance bouncing around with periods of really high throughput followed by periods of really low (or no!) throughput? That's usually the result of having a very fast journal paired with a slow data disk. The journal writes out data very quickly, hits it's max ops or max bytes limit, then writes are stalled for a period while data in the journal gets flushed out to the data disk. Sure, essentially the equivalent, on a different level, of an NFS server with lots of RAM and a high vm.dirty_ratio suddenly doing a massive writeout. Another thing to remember is that writes to the journal happen without causing a lot of seeks. Ceph doesn't have to do metadata or dentry lookups/writes to write data to the journal. Because of this, it's been my experience that journals are primarily throughput bound rather than being random IOPS bound. Just putting the journals on any old SSD isn't enough, you need to choose ones that get really high throughput like the Intel S3700s or other high performance models. Yup. By and large, try to go for a relatively small number of OSDs per node, ideally not more than 8. This combined with SSD journals is likely to give you the best overall performance. The advice that I usually give people is that if performance is a big concern, try to match filestore disk and journal performance is nearly matched. In my test setup, I use 1 intel 520 SSD to host 3 journals for 7200rpm enterprise SATA disks. A 1:4 ratio or even 1:6 ratio may also work fine depending on various factors. So far the limits I've hit with very minimal tuning seem to be around 15 spinning disks and 5 SSDs for around 1.4GB/s (2.8GB/s including journal writes) to one node. Yes, I realize that there's no hard number here. I could also have put ideally not more than 6. The point I was trying to make is that people need to get off their thinking of what an ideal storage box is, and that more disks per host isn't necessarily better. We had a user in #ceph last week thinking that an OSD node with 36 spinners was a stellar idea. It probably isn't. If you do go with OSD nodes with a very high number of disks, consider dropping the idea of an SSD-based journal. Yes, in this kind of setup you might actually do better with journals on the spinners. If your SSD(s) is/are slow you very well may be better off with putting the journals on the same spinning disks as the OSD data. It's all a giant balancing act between write throughput, read throughput, and capacity. And people generally prefer simple heuristics (a.k.a. rules of thumb) over giant balancing acts. So I think if we tell them something like, Got more that 8 spinners? * No? Toss your journals on SSDs, * Yes? At least consider not to. ... then I am hoping that will lead more people on the right path, than when we tell them: * Here's two dozen performance graphs, a pivot table, and a crystal ball. I am obviously jesting and exaggerating, but you get my point. :) Cheers, Florian -- Helpful information? Let us know! http://www.hastexo.com/shoutbox -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message
Re: OSD nodes with =8 spinners, SSD-backed journals, and their performance impact
On 01/14/2013 06:34 PM, Gregory Farnum wrote: On Mon, Jan 14, 2013 at 6:09 AM, Florian Haas flor...@hastexo.com wrote: Hi Mark, thanks for the comments. On Mon, Jan 14, 2013 at 2:46 PM, Mark Nelson mark.nel...@inktank.com wrote: Hi Florian, Couple of comments: OSDs use a write-ahead mode for local operations: a write hits the journal first, and from there is then being copied into the backing filestore. It's probably important to mention that this is true by default only for non-btrfs file systems. See: http://ceph.com/wiki/OSD_journal I am well aware of that, but I've yet to find a customer (or user) that's actually willing to entrust a production cluster with several hundred terabytes of data to btrfs. :) Besides, the whole post is about whether or not to use dedicated SSD block devices for OSD journals, and if you're tossing everything into btrfs you've already made the decision to use in-filestore journals. That is absolutely not the case. btrfs works just fine with an external journal on SSD or whatever else; what made you think otherwise? A misunderstanding on my part. Also, I was overly broad in my comment. What I really meant to say was that if I'm using a btrfs filestore, and a separate dedicated block device for the journal, then the journaling mode is write-ahead and not parallel. Which was a wrong assumption on my part, as an external journal combined with a btrfs filestore seems to support parallel journaling mode just fine. For some reason I had supposed the journal had to be in the same btrfs as the filestore for this to work. Sorry for the confusion. Cheers, Florian -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Windows port
On Tue, Jan 8, 2013 at 3:00 PM, Dino Yancey dino2...@gmail.com wrote: Hi, I am also curious if a Windows port, specifically the client-side, is on the roadmap. This is somewhat OT from the original post, but if all you're interested is using RBD block storage from Windows, you can already do that by going through an iSCSI or FC head node. Proof-of-concept configuration outlined here: http://www.hastexo.com/resources/hints-and-kinks/turning-ceph-rbd-images-san-storage-devices Not sure if this helps, but just thought I'd mention it. Cheers, Florian -- Helpful information? Let us know! http://www.hastexo.com/shoutbox -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Integration work
On 08/28/2012 11:32 AM, Plaetinck, Dieter wrote: On Tue, 28 Aug 2012 11:12:16 -0700 Ross Turk r...@inktank.com wrote: Hi, ceph-devel! It's me, your friendly community guy. Inktank has an engineering team dedicated to Ceph, and we want to work on the right stuff. From time to time, I'd like to check in with you to make sure that we are. Over the past several months, Inktank's engineers have focused on core stability, radosgw, and feature expansion for RBD. At the same time, they have been regularly allocating cycles to integration work. Recently, this has consisted of improvements to the way Ceph works within OpenStack (even though OpenStack isn't the only technology that we think Ceph should play nicely with). What other sorts of integrations would you like to see Inktank engineers work on? are we only supposed to give answers wrt. integration with other software? if not, I would suggest to write documentation. If I may say so, the amount of work that John has poured into this in recent week has been incredible (http://www.ceph.com/docs/master/). So while it's definitely not complete nor perfect, I'm sure he would appreciate a little more specific information as to where you believe documentation is lacking. I for my part, in the documentation space, would love for the admin tools to become self-documenting. For example, I would love a help subcommand at any level of the ceph shell, listing the supported subcommands in that level. As in ceph help, ceph mon help, ceph osd getmap help. Even better, the ceph shell could support a general-purpose hook that bash-completion can use (kind of like hg does in Mercurial), and this and the above-conjectured help facility could arguably share quite a bit of code. and also integration with CM like puppet/chef +1, although people are already working on both. So maybe this is just about the need to tell more people about that. :) Cheers, Florian -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: wip-crush
On 08/22/2012 03:10 AM, Sage Weil wrote: I pushed a branch that changes some of the crush terminology. Instead of having a crush type called pool that requires you to say things like pool=default in the ceph osd crush set ... command, it uses root instead. That hopefully reinforces that it is a tree/hierarchy. There is also a patch that changes bucket to node throughout, since bucket is a term also used by radosgw. Thoughts? I think the main pain in making this transition is that old clusters have maps that have a type 'pool' and new ones won't, and the docs will need to walk people through both... pool in a crushmap being completely unrelated to a RADOS pool is something that I've heard customers/users report as confusing, as well. So changing that is probably a good thing. Naming it root is probably a good choice as well, as it happens to match http://ceph.com/wiki/Custom_data_placement_with_CRUSH. As for changing bucket to node... a node is normally simply a physical server (at least in HA terminology, which many potential Ceph users will be familiar with), and CRUSH uses host for that. So that's another recipe for confusion. How about using something super-generic, like element or item? Cheers, Florian -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph Benchmark HowTo
On Tue, Jul 24, 2012 at 6:19 PM, Tommi Virtanen t...@inktank.com wrote: On Tue, Jul 24, 2012 at 8:55 AM, Mark Nelson mark.nel...@inktank.com wrote: personally I think it's fine to have it on the wiki. I do want to stress that performance is going to be (hopefully!) improving over the next couple of months so we will probably want to have updated results (or at least remove old results!) as things improve. Also, I'm not sure if we will be keeping the wiki around in it's current form. There was some talk about migrating to something else, but I don't really remember the details. Sounds like a job for doc/dev/benchmark/index.rst! (It, or parts of it, can move out from under Internal if/when it gets user friendly enough to not need as much skill to use.) If John is currently busy (which I assume he always is :) ), I should be able to take care of that. In that case, would someone please open a documentation bug and assign that to me? Cheers, Florian -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph Benchmark HowTo
Hi Mehdi, great work! A few questions (for you, Mark, and anyone else watching this thread) regarding the content of that wiki page: For the OSD tests, which OSD filesystem are you testing on? Are you using a separate journal device? If yes, what type? For the RADOS benchmarks: # rados bench -p pbench 900 seq ... 611 16 17010 16994 111.241 104 1.05852 0.574897 612 16 17037 17021 111.236 108 1.17321 0.574932 613 16 17056 17040 111.17876 1.01611 0.574903 Total time run:613.339616 Total reads made: 17056 Read size:4194304 Bandwidth (MB/sec):111.234 Average Latency: 0.575252 Max latency: 1.65182 Min latency: 0.07418 How meaningful is it to use a (arithmetic) average here, consisting the min and max differ by a factor of 22? Aren't we being bitten by outliers pretty severely here, and wouldn't, say, a median be much more useful? (Actually, would the max latency include the initial hunt for a mon and the mon/osdmap exchange?) seekwatcher -t rbd-latency-write.trace -o rbd-latency-write.png -p 'dd if=/dev/zero of=/dev/rbd0 bs=4M count=1000 oflag=direct' -d /dev/rbd0 Just making sure: are you getting the same numbers just with dd, rather than dd invoked by seekwatcher? Also, for your dd latency test of 4M direct I/O reads writes, you seem to be getting 39 and 300 ms average latency, yet further down it says RBD latency read/write: 28ms and 114.5ms. Any explanation for the write latency being cut in half on what was apparently a different test run? Also, were read and write caches cleared between tests? (echo 3 /proc/sys/vm/drop_caches) Cheers, Florian -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Tuning placement group
On Fri, Jul 20, 2012 at 9:33 AM, François Charlier francois.charl...@enovance.com wrote: Hello, Readinghttp://ceph.com/docs/master/ops/manage/grow/placement-groups/ and thinking to build a ceph cluster with potentially 1000 OSDs. Using the recommandations on the previously cited link, it would require pg_num being set between 10,000 30,000. Okay with that. Let's use the recommended value of 16,384 ; this is alreay about 160 placement groups per OSD. What if, for a start, we choose to reach this number of 1000 OSDs slowly, starting with 100 OSDs ? It's now 1600 placement groups per OSD. What if we chose 30,000 (or 32,768) placement groups to keep room for expansion ? My question is : How will behave a Ceph pool with 1000, 5000 or even 1 placement groups per OSD ? Will this impact performance ? How bad ? Can it be worked around ? Is this a problem of RAM size ? CPU usage ? Any hint about this would be much appreciated. If I may, I'd like to add an additional point of consideration, specifically for radosgw setups: What's the recommended way to set the number of PGs for the half-dozen pools that radosgw normally creates on its own (.rgw, .rgw.users, .rgw.buckets and so on)? I *think* wanting to set a custom number of PGs would require pre-creating these pools manually, but there may be a way -- undocumented? -- to instruct radosgw to set a user-configured number of PGs on pool creation. Insight on that would be much appreciated. Cheers, Florian -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Ceph BoF at OSCON
Hi everyone, For those of you attending OSCON in Portland next week, there will be a birds-of-a-feather session on Ceph Monday night. All OSCON attendees interested in Ceph are very welcome. Details about the BoF are in this blog post: http://www.hastexo.com/blogs/florian/2012/07/12/openstack-high-availability-and-ceph-oscon Looking forward to meeting you there! Florian -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: specifying secret in rbd map command
On Mon, Jul 9, 2012 at 4:57 PM, Travis Rhoden trho...@gmail.com wrote: Hey folks, I had a bit of unexpected trouble today using the rbd map command to map an RBD to a kernel object. I had previously been using the echo ... /sys/bus/rbd... method of manipulating RBDs. I was looking at the instructions here: http://ceph.com/docs/master/rbd/rbd-ko/ When I tried to use the given syntax, sudo rbd map {image-name} --pool {pool-name} --name {client-name} --secret {client-secret}, I found the following: 1. {client-secret} is really supposed to be a file, not the actual secret. An strace on the command shows an attempt to open a file with the secret as its name 2. If I give a keyring file as the client-secret, the command does not parse out the key for the given client-name. In other words, I gave the name as client.admin, then gave it the keyring file which contained merely [client.admin] key = AQB67+BPGNX0NhAA9iK7Epcj72Jck1wOAQBetA== But the command wouldn't parse out the key. 3. I had to create a new file, containing only the text of the key, and pass that to the command instead. Then everything is happy. Im happy to update the docs to make this process clear. But I wonder if there might be any plans to modify the command behavior to accept a keyring file and pull out the key belonging to specified client name. Either way, I can update the docs to make it clear that you are specifying a file, not the key string itself. I agree. This confuses quite a few people. Specifically because the Ceph filesystem client supports secret and secretfile as mount options, and expects a file only in the latter case. rbd acting differently does violate POLA in that way. Cheers, Florian -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: librbd: error finding header
On 07/09/12 12:29, Vladimir Bashkirtsev wrote: On 09/07/12 18:33, Dan Mick wrote: Vladimir: you can do some investigation with the rados command. What does rados -p rbd ls show you? Rather long list of: rb.0.11.2786 rb.0.d.54a2 rb.0.6.2eb5 rb.0.d.8294 rb.0.13.0377 rb.0.e.0629 rb.0.6.2756 rb.0.d.6156 rb.0.d.9b82 rb.0.5.0c9e rb.0.d.80ba rb.0.f.0e75 rb.0.6.ab4f rb.0.d.48e4 rb.0.d.5f67 rb.0.13.14ad rb.0.d.e074 rb.0.f.1a4b rb.0.13.04a3 ... How to find out to which image these objects belong? rbd info would tell you the block prefix for the image you're looking at. Or does that command give you an error opening image message as well? Cheers, Florian -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Setting a big maxosd kills all mons
Hi guys, Someone I worked with today pointed me to a quick and easy way to bring down an entire cluster, by making all mons kill themselves in mass suicide: ceph osd setmaxosd 2147483647 2012-07-05 16:29:41.893862 b5962b70 0 monclient: hunting for new mon I don't know what the actual threshold is, but setting your maxosd to any sufficiently big number should do it. I had hoped 2^31-1 would be fine, but evidently it's not. This is what's in the mon log -- the first line is obviously only on the leader at the time of the command, the others are on all mons. -1 2012-07-05 16:29:41.829470 b41a1b70 0 mon.daisy@0(leader) e1 handle_command mon_command(osd setmaxosd 2147483647 v 0) v1 0 2012-07-05 16:29:41.887590 b41a1b70 -1 *** Caught signal (Aborted) ** in thread b41a1b70 ceph version 0.48argonaut (commit:c2b20ca74249892c8e5e40c12aa14446a2bf2030) 1: /usr/bin/ceph-mon() [0x816f461] 2: [0xb7738400] 3: [0xb7738424] 4: (gsignal()+0x51) [0xb731a781] 5: (abort()+0x182) [0xb731dbb2] 6: (__gnu_cxx::__verbose_terminate_handler()+0x14f) [0xb753b53f] 7: (()+0xbd405) [0xb7539405] 8: (()+0xbd442) [0xb7539442] 9: (()+0xbd581) [0xb7539581] 10: (()+0x11dea) [0xb7582dea] 11: (tc_new()+0x26) [0xb75a1636] 12: (std::vectorunsigned char, std::allocatorunsigned char ::_M_fill_insert(__gnu_cxx::__normal_iteratorunsigned char*, std::vectorunsigned char, std::allocatorunsigned char , unsigned int, unsigned char const)+0x79) [0x8185629] 13: (OSDMap::set_max_osd(int)+0x497) [0x817c6b7] From src/mon/OSDMonitor.cc: int newmax = atoi(m-cmd[2].c_str()); if (newmax osdmap.crush-get_max_devices()) { err = -ERANGE; ss cannot set max_osd to newmax which is crush max_devices osdmap.crush-get_max_devices(); goto out; } I think that counts as unchecked user input, or has cmd[2] been sanitized at any time before it gets here? Also, is there a way to recover from this, short of reinitializing all mons? Cheers, Florian -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Writes to mounted Ceph FS fail silently if client has no write capability on data pool
Hi everyone, please enlighten me if I'm misinterpreting something, but I think the Ceph FS layer could handle the following situation better. How to reproduce (this is on a 3.2.0 kernel): 1. Create a client, mine is named test, with the following capabilities: client.test key: key caps: [mds] allow caps: [mon] allow r caps: [osd] allow rw pool=testpool Note the client only has access to a single pool, testpool. 2. Export the client's secret and mount a Ceph FS. mount -t ceph -o name=test,secretfile=/etc/ceph/test.secret daisy,eric,frank:/ /mnt This succeeds, despite us not even having read access to the data pool. 3. Write something to a file. root@alice:/mnt# echo hello world hello.txt root@alice:/mnt# cat hello.txt This too succeeds. 4. Sync and clear caches. root@alice:/mnt# sync root@alice:/mnt# echo 3 /proc/sys/vm/drop_caches 5. Check file size and contents. root@alice:/mnt# ls -la total 5 drwxr-xr-x 1 root root0 Jul 5 17:15 . drwxr-xr-x 21 root root 4096 Jun 11 09:03 .. -rw-r--r-- 1 root root 12 Jul 5 17:15 hello.txt root@alice:/mnt# cat hello.txt root@alice:/mnt# Note the reported file size in unchanged, but the file is empty. Checking the data pool with client.admin credentials obviously shows that that pool is empty, so objects are never written. Interestingly, cephfs hello.txt show_location does list an object_name, identifying an object which doesn't exist. Is there any way to make the client fail with -EIO, -EPERM, -EOPNOTSUPP or whatever else is appropriate, rather than pretending to write when it can't? Also, going down the rabbit hole, how would this behavior change if I used cephfs to set the default layout on some directory to use a different pool? All thoughts appreciated. Cheers, Florian -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
cephfs show_location produces kernel divide error: 0000 [#1] when run against a directory that is not the filesystem root
And one more issue report for today... :) Really easy to reproduce on my 3.2.0 Debian squeeze-backports kernel: mount a Ceph FS, create a directory in it. Then run cephfs dir show_location. dmesg stacktrace: [ 7153.714260] libceph: mon2 192.168.42.116:6789 session established [ 7308.584193] divide error: [#1] SMP [ 7308.584936] Modules linked in: cryptd aes_i586 aes_generic cbc ceph libceph nfsd lockd nfs_acl auth_rpcgss sunrpc fuse joydev usbhid hid snd_pcm snd_timer snd processor soundcore snd_page_alloc thermal_sys button tpm_tis tpm tpm_bios psmouse i2c_piix4 evdev serio_raw i2c_core virtio_balloon pcspkr ext3 jbd mbcache btrfs zlib_deflate crc32c libcrc32c sg sr_mod cdrom ata_generic virtio_net virtio_blk ata_piix uhci_hcd ehci_hcd libata usbcore floppy scsi_mod virtio_pci usb_common [last unloaded: scsi_wait_scan] [ 7308.588013] [ 7308.588013] Pid: 1444, comm: cephfs Not tainted 3.2.0-0.bpo.2-686-pae #1 Bochs Bochs [ 7308.588013] EIP: 0060:[f848c6c2] EFLAGS: 00010246 CPU: 0 [ 7308.588013] EIP is at ceph_calc_file_object_mapping+0x44/0xe8 [libceph] [ 7308.588013] EAX: EBX: ECX: EDX: [ 7308.588013] ESI: EDI: EBP: ESP: f7495ce4 [ 7308.588013] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 [ 7308.588013] Process cephfs (pid: 1444, ti=f7494000 task=f7266a60 task.ti=f7494000) [ 7308.588013] Stack: [ 7308.588013] 0001b053 f5f20624 f5f203f0 f749a800 f5f20420 [ 7308.588013] f84ca6a7 f7495d40 f7495d58 f7495d50 f7495d38 0001 0246 f5f20420 [ 7308.588013] f749a90c bff6ff70 c14203a4 fffba978 000a0050 f79f0298 0001 [ 7308.588013] Call Trace: [ 7308.588013] [f84ca6a7] ? ceph_ioctl_get_dataloc+0x9e/0x213 [ceph] [ 7308.588013] [c10b6781] ? __do_fault+0x3ee/0x42b [ 7308.588013] [c10b75f3] ? handle_pte_fault+0x3aa/0xa67 [ 7308.588013] [c10e0844] ? path_openat+0x27f/0x294 [ 7308.588013] [f84cac16] ? ceph_ioctl+0x3fa/0x460 [ceph] [ 7308.588013] [c10d9fdb] ? cp_new_stat64+0xee/0x100 [ 7308.588013] [c10b7ebe] ? handle_mm_fault+0x20e/0x224 [ 7308.588013] [f84ca81c] ? ceph_ioctl_get_dataloc+0x213/0x213 [ceph] I unfortunately don't have a more recent kernel to test with, so if this has been fixed upstream feel free to ignore me. Otherwise, perhaps something that could go into the 3.5-rc cycle. Doing show_location on a file, and on the root directory of the fs, both work fine. Cheers, Florian -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: cephfs show_location produces kernel divide error: 0000 [#1] when run against a directory that is not the filesystem root
On Thu, Jul 5, 2012 at 10:04 PM, Gregory Farnum g...@inktank.com wrote: But I have a few more queries while this is fresh. If you create a directory, unmount and remount, and get the location, does that work? Nope, same error. (actually, just flushing caches would probably do it.) Idem. If you create a directory on one node, and then go look at it on another node and try to get the location from there, does that work? No. Cheers, Florian -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Writes to mounted Ceph FS fail silently if client has no write capability on data pool
On Thu, Jul 5, 2012 at 10:01 PM, Gregory Farnum g...@inktank.com wrote: Also, going down the rabbit hole, how would this behavior change if I used cephfs to set the default layout on some directory to use a different pool? I'm not sure what you're asking here — if you have access to the metadata server, you can change the pool that new files go into, and I think you can set the pool to be whatever you like (and we should probably harden all this, too). So you can fix it if it's a problem, but you can also turn it into a problem. I am aware that I would be able to do this. My question was more along the lines of: if the pool that data is written to can be set on a per-file or per-directory basis, and we can also set read and write permissions per pool, how would the filesystem behave properly? Hide files the mounting user doesn't have read access to? Return -EIO or -EPERM on writes to files stored in pools we can't write to? Failing a mount if we're missing some permission on any file or directory in the fs? All of these sound painful in one way or another, so I'm having trouble envisioning what the correct behavior would look like. Florian -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: URL-safe base64 encoding for keys
On Tue, Jul 3, 2012 at 2:22 PM, Wido den Hollander w...@widodh.nl wrote: Hi, With my CloudStack integration I'm running into a problem with the cephx keys due to '/' being possible in the cephx keys. CloudStack's API expects a URI to be passed when adding a storage pool, e.g.: addStoragePool?uri=rbd://user:cephx...@monitor.addr/poolname If 'cephxkey' contains a / the URI parser in Java fails (java.net.URI) and splits the URI in the wrong place. For base64 there is a specification [0] that describes the usage of - and _ instead of + and / Is there a way that we change the bits in src/common/armor.c and replace the + and / for - and _? FWIW (only semi-related), some S3 clients -- s3cmd from s3tools, for example -- seem to choke on the forward slash in radosgw auto-generated secret keys, as well. Cheers, Florian -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: URL-safe base64 encoding for keys
On Tue, Jul 3, 2012 at 5:04 PM, Yehuda Sadeh yeh...@inktank.com wrote: FWIW (only semi-related), some S3 clients -- s3cmd from s3tools, for example -- seem to choke on the forward slash in radosgw auto-generated secret keys, as well. With radosgw we actually switch a while back to use the alternative encoding. If you still have some old access keys, just replace them. Is a while back after 0.47.3? Because I was definitely keys with / from that version. Cheers, Florian -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
rbd rm allows removal of mapped device, nukes data, then returns -EBUSY
Hi everyone, just wanted to check if this was the expected behavior -- it doesn't look like it would be, to me. What I do is create a 1G RBD, and just for the heck of it, make an XFS on it: root@alice:~# rbd create xfsdev --size 1024 root@alice:~# rbd map xfsdev root@alice:~# rbd showmapped id poolimage snapdevice 0 rbd xfsdev - /dev/rbd0 root@alice:~# mkfs -t xfs /dev/rbd/rbd/xfsdev log stripe unit (4194304 bytes) is too large (maximum is 256KiB) log stripe unit adjusted to 32KiB meta-data=/dev/rbd/rbd/xfsdevisize=256agcount=9, agsize=31744 blks = sectsz=512 attr=2, projid32bit=0 data = bsize=4096 blocks=262144, imaxpct=25 = sunit=1024 swidth=1024 blks naming =version 2 bsize=4096 ascii-ci=0 log =internal log bsize=4096 blocks=2560, version=2 = sectsz=512 sunit=8 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 I double check to see if there's an XFS signature on the device: root@alice:~# xxd /dev/rbd/rbd/xfsdev | head 000: 5846 5342 1000 0004 XFSB 010: 020: 17bb f4df b1f3 444b bc01 3b3e f827 8fef ..DK..;.'.. 030: 0002 0008 4000 ..@. 040: 4001 4002 ..@...@. 050: 0001 7c00 0009 ..|. 060: 0a00 b5a4 0200 0100 0010 070: 0c09 0804 0f00 0019 080: 0040 003d ...@...= 090: 0003 f5d8 Now, I try to remove the device while it's mapped: root@alice:~# rbd rm xfsdev Removing image: 99% complete...2012-07-02 06:52:57.386040 b6c8d710 -1 librbd: error removing header: (16) Device or resource busy Removing image: 99% complete...failed. delete error: image still has watchers This means the image is still open or the client using it crashed. Try again after closing/unmapping it or waiting 30s for the crashed client to timeout. That sounds reasonable, except that the data has already been nuked: root@alice:~# xxd /dev/rbd/rbd/xfsdev | head 000: 010: 020: 030: 040: 050: 060: 070: 080: 090: After unmapping, the device removal proceeds just fine. root@alice:~# rbd unmap /dev/rbd0 root@alice:~# rbd rm xfsdev Removing image: 100% complete...done. Now if the RBD is capable of detecting that it's being watched, why not fail the removal _before_ wiping data, potentially with an override with a --force flag? Cheers, Florian -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Radosgw installation and administration docs
On Sun, Jul 1, 2012 at 10:22 PM, Chuanyu chua...@cs.nctu.edu.tw wrote: Hi Yehuda, Florian, I follow the wiki, and steps which you discussed, construct my ceph system with rados gateway, and I can use libs3 to upload file via radosgw, (thanks a lot!) but got 405 Method Not Allowed when I use swift, $ swift -v -A http://s3.paca.tw:80/auth -U paca:paca1 -K UoJO4nFgdAoX+9nEftElIY+AMmDIkcrUBkycNKPA stat Auth GET failed: http://s3.paca.tw:80/auth/tokens 405 Method Not Allowed ( Because there has no test step on wiki, I follow the Florian's question, and guess the test command is above ?!) my radosgw-admin config: $ radosgw-admin user info --uid=paca { user_id: paca, rados_uid: 0, display_name: chuanyu, email: chua...@cs.nctu.edu.tw, suspended: 0, subusers: [ { id: paca:paca1, permissions: none}], This is most likely your problem. You're being bitten by http://tracker.newdream.net/issues/2650. Try radosgw-admin subuser modify --subuser=paca:paca1 --access=full and see if that improves things. Cheers, Florian -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Does radosgw really need to talk to an MDS?
Hi everyone, radosgw(8) states that the following capabilities must be granted to the user that radosgw uses to connect to RADOS. ceph-authtool -n client.radosgw.gateway --cap mon 'allow r' --cap osd 'allow rwx' --cap mds 'allow' /etc/ceph/keyring.radosgw.gateway Could someone explain why we need an mds 'allow' in here? I thought only CephFS clients talked to MDSs, and at first glance configuring client.radosgw.gateway without any MDS capability seems not to break anything (at least with my limited S3 tests). Am I missing something? Cheers, Florian -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Assertion failure when radosgw can't authenticate
Hi, in cephx enabled clusters (0.47.x), authentication failures from radosgw seem to lead to an uncaught assertion failure: 2012-07-02 11:26:46.559830 b69c5730 0 librados: client.radosgw.charlie authentication error (1) Operation not permitted 2012-07-02 11:26:46.560093 b69c5730 -1 Couldn't init storage provider (RADOS) 2012-07-02 11:26:46.560401 b69c5730 -1 common/Timer.cc: In function 'SafeTimer::~SafeTimer()' thread b69c5730 time 2012-07-02 11:26:46.5601 10 common/Timer.cc: 57: FAILED assert(thread == __null) ceph version 0.47.3 (commit:c467d9d1b2eac9d3d4706b8e044979aa63b009f8) 1: (SafeTimer::~SafeTimer()+0x96) [0x80a5c76] 2: (main()+0x56f) [0x809708f] 3: (__libc_start_main()+0xe6) [0xb6cefca6] 4: /usr/bin/radosgw() [0x807f4a1] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. --- begin dump of recent events --- -2 2012-07-02 11:26:46.559830 b69c5730 0 librados: client.radosgw.charlie authentication error (1) Operation not permitted -1 2012-07-02 11:26:46.560093 b69c5730 -1 Couldn't init storage provider (RADOS) 0 2012-07-02 11:26:46.560401 b69c5730 -1 common/Timer.cc: In function 'SafeTimer::~SafeTimer()' thread b69c5730 time 2012-07-02 11:26 :46.560110 common/Timer.cc: 57: FAILED assert(thread == __null) Kinda ugly. Maybe this could be fixed in the pending 0.48 release. The issue obviously goes away immediately after correcting the auth credentials. Cheers, Florian -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Does radosgw really need to talk to an MDS?
On Mon, Jul 2, 2012 at 1:44 PM, Wido den Hollander w...@widodh.nl wrote: You are not allowing the RADOS Gateway to do anything on the MDS. There is no 'r', 'w' or 'x' permission which you are allowing. So there is nothing the rgw has access to on the MDS. Yep, so we might as well leave off --cap mds 'allow'? Florian -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
radosgw forgetting subuser permissions when creating a fresh key
Hi everyone, I wonder if this is intentional: when I create a new Swift key for an existing subuser, which has previously been assigned full control permissions, those permissions appear to get lost upon key creation. # radosgw-admin subuser create --uid=johndoe --subuser=johndoe:swift --access=full { user_id: johndoe, rados_uid: 0, display_name: John Doe, email: j...@example.com, suspended: 0, subusers: [ { id: johndoe:swift, permissions: full-control}], keys: [ { user: johndoe, access_key: QFAMEDSJP5DEKJO0DDXY, secret_key: iaSFLDVvDdQt6lkNzHyW4fPLZugBAI1g17LO0+87}], swift_keys: []} Note permissions: full-control # radosgw-admin key create --subuser=johndoe:swift --key-type=swift { user_id: johndoe, rados_uid: 0, display_name: John Doe, email: j...@example.com, suspended: 0, subusers: [ { id: johndoe:swift, permissions: none}], keys: [ { user: johndoe, access_key: QFAMEDSJP5DEKJO0DDXY, secret_key: iaSFLDVvDdQt6lkNzHyW4fPLZugBAI1g17LO0+87}], swift_keys: [ { user: johndoe:swift, secret_key: E9T2rUZNu2gxUjcwUBO8n\/Ev4KX6\/GprEuH4qhu1}]} Note that while there is now a key, the permissions are gone. Is this meant to be a security feature of sorts, or is this a bug? subuser modify can obviously restore the permissions, but it seems to be less than desirable to have to do that. Cheers, Florian -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph as a NOVA-INST-DIR/instances/ storage backend
On Mon, Jun 25, 2012 at 6:03 PM, Tommi Virtanen t...@inktank.com wrote: On Sat, Jun 23, 2012 at 11:42 AM, Igor Laskovy igor.lask...@gmail.com wrote: Hi all from hot Kiev)) Does anybody use Ceph as a backend storage for NOVA-INST-DIR/instances/ ? Yes. http://www.sebastien-han.fr/blog/2012/06/10/introducing-ceph-to-openstack/ Look at the Live Migration with CephFS part. Is it in production use? Production use would require CephFS to be production ready, which at this point it isn't. Live migration is still possible? Yes. I kindly ask any advice of best practices point of view. That's the shared NFS mount style for storing images, right? While you could use the Ceph Distributed File System for that, there's a better answer (for both Nova and Glance): RBD. ... which sort of goes hand-in-hand with boot from volume, which was just recently documented in the Nova admin guide, so you may want to take a look: http://docs.openstack.org/trunk/openstack-compute/admin/content/boot-from-volume.html That being said, volume attachment persistence across live migrations hasn't always been stellar in Nova, and I'm not 100% sure how well trunk currently deals with that. Cheers, Florian -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Openstack] Ceph/OpenStack integration on Ubuntu precise: horribly broken, or am I doing something wrong?
On Fri, Jun 22, 2012 at 7:43 AM, James Page james.p...@ubuntu.com wrote: You can type faster than I can... I'm working on getting this resolved in the current dev release of Ubuntu in the next few days after which it will go through the normal SRU process for Ubuntu 12.04. Sweet, thanks! The SRU to resolve the install-ability of python-ceph have just complete verification and should be available in Ubuntu 12.04 updates in the next few hours (depending on which mirror you use). Excellent. Thanks a lot! Cheers, Florian -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rbd locking and handling broken clients
On Thu, Jun 14, 2012 at 1:41 AM, Greg Farnum g...@inktank.com wrote: On Wednesday, June 13, 2012 at 1:37 PM, Florian Haas wrote: Greg, My understanding of Ceph code internals is far too limited to comment on your specific points, but allow me to ask a naive question. Couldn't you be stealing a lot of ideas from SCSI-3 Persistent Reservations? If you had server-side (OSD) persistence of information of the this device is in use by X type (where anything other than X would get an I/O error when attempting to access data), and you had a manual, authenticated override akin to SCSI PR preemption, plus key registration/exchange for that authentication, then you would at least have to have the combination of a misbehaving OSD plus a malicious client for data corruption. A non-malicious but just broken client probably won't do. Clearly I may be totally misguided, as Ceph is fundamentally decentralized and SCSI isn't, but if PR-ish behavior comes even close to what you're looking for, grabbing those ideas would look better to me than designing your own wheel. Yeah, the problem here is exactly that Ceph (and RBD) are fundamentally decentralized. :) True, but as a general comment I do posit that to say X is not exactly like Y, thus nothing applicable to X applies to Y is a fallacy. :) I'm not familiar with the SCSI PR mechanism either, but it looks to me like it deals in entirely local information — the equivalent with RBD would require performing a locking operation on every object in the RBD image before you accessed it. We could do that, but then opening an image would take time linear in its size… :( Well you would make this configurable and optional, wouldn't you? Kind of like no-one forces people to use PRs on SCSI LUs. When this is being used, however, taking a performance hit on open sounds like a reasonable price to pay for not shredding data. TANSTAAFL. Again, this is just my poorly informed two cents. :) Cheers, Florian -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Building documentation offline?
Hi everyone, it occurred to me this afternoon that admin/build-doc unconditionally tries to fetch some updates from GitHub, which breaks building docs when you don't have a network connection. Would there be any reasonably simple way to make it support offline build, provided the various pip bits have previously been downloaded and installed? Cheers, Florian -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rbd locking and handling broken clients
Greg, My understanding of Ceph code internals is far too limited to comment on your specific points, but allow me to ask a naive question. Couldn't you be stealing a lot of ideas from SCSI-3 Persistent Reservations? If you had server-side (OSD) persistence of information of the this device is in use by X type (where anything other than X would get an I/O error when attempting to access data), and you had a manual, authenticated override akin to SCSI PR preemption, plus key registration/exchange for that authentication, then you would at least have to have the combination of a misbehaving OSD plus a malicious client for data corruption. A non-malicious but just broken client probably won't do. Clearly I may be totally misguided, as Ceph is fundamentally decentralized and SCSI isn't, but if PR-ish behavior comes even close to what you're looking for, grabbing those ideas would look better to me than designing your own wheel. Just my $.02, of course. Cheers, Florian -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Radosgw installation and administration docs
Hi everyone, I have a long flight ahead of me later this week and plan to be spending some time on http://ceph.com/docs/master/ops/radosgw/ -- which currently happens to be a bit, ahem, sparse. There's currently not a lot of documentation on radosgw, and some of it is inconsistent, so if one of the devs could answer the following questions, I can put them in a more comprehensive document that should make radosgw easier to set up and run. 1. Apache rewrite rule Is the Apache configuration example listed in the man page correct and authoritative? Specifically, it seems unclear to me whether the rewrite engine rule: (RewriteRule ^/([a-zA-Z0-9-_.]*)([/]?.*) /s3gw.fcgi?page=$1params=$2%{QUERY_STRING} [E=HTTP_AUTHORIZATION:%{HTTP:Authorization},L]) ... is expected to work only for compatibility with S3 clients, or whether this rewrite rule is also for Swift clients. 2. FastCGI wrapper The radosgw man page says it should be exec /usr/bin/radosgw -c /etc/ceph/ceph.conf -n client.radosgw.gateway, whereas the Wiki (http://ceph.com/wiki/RADOS_Gateway) omits the -n option. I didn't get it to work without the -n option, so is it safe to say that it is required? 3. Apache/radosgw daemon/FastCGI wrapper interaction Is it safe to say that we always need all three of these? The man page indicates so, the Wiki makes no mention of the daemon started by the init script. 4. FastCGI configuration directives The man page mentions: FastCgiExternalServer /var/www/s3gw.fcgi -socket /tmp/radosgw.sock The Wiki says: FastCgiWrapper /var/www/s3gw.fcgi FastCgiServer /usr/bin/radosgw https://github.com/ceph/teuthology/blob/master/teuthology/task/apache.conf (which was mentioned as an additional reference on IRC at some point) says: FastCgiIPCDir /tmp/cephtest/apache/tmp/fastcgi_sock FastCgiExternalServer /tmp/cephtest/apache/htdocs/rgw.fcgi -socket rgw_sock Which of these is required/preferred? -socket option or not? Wrapper, Server or ExternalServer? IPCDir? 5. Logging What's the preferred way of adding debug logging for radosgw? https://github.com/ceph/teuthology/blob/master/teuthology/task/apache.conf mentions: SetEnv RGW_LOG_LEVEL 20 SetEnv RGW_PRINT_CONTINUE yes SetEnv RGW_SHOULD_LOG yes ... but it's unclear to me whether this is still current (I found no trace of those envars in the source, but maybe I was looking in the wrong place). https://github.com/ceph/ceph/commit/452b1248a68f743ad55641722da80e3fd5ad2ae9 touched the debug rgw option. If that is the preferred way of doing things now, where should you set this? In ceph.conf, in the [client.radosgw.name] section? Also, for each of these, where would the logging output end up? /var/log/ceph? Apache error log? If so, only if the Apache LogLevel is more verbose than info? Syslog? 6. Swift API: Keys Is it correct to assume that for any Swift client to work, we must set a Swift key for the user, like so? radosgw-admin key create --key-type=swift --uid=user If so, is the secret_key that that creates for the user: swift_keys: [ { user: user, secret_key: longbase64hash}]} ... the same key that the swift command line client expects to be set with th -K option? 7. Swift API: swift user name When we call swift -U user, is that the verbatim user_id that we've defined with radosgw-admin user create --uid=user_id? Or do we need to set a prefix? Or define a separate Swift user ID? 8. Swift API: authentication version When radosgw acts as the auth server for a Swift request, is it correct to say that only v1.0 Swift authentication is supported, not v2.0? 9. Swift API: authentication URL What's the correct Swift authentication URL for swift -A url? It seems like it's http://rgw hostname:port/auth, but confirmation would help. 10. radosgw OpenStack user information From the radosgw-admin man page: --os-user=group:name The OpenStack user (only needed for use with OpenStack) --os-secret=key The OpenStack key What's this meant to be used for? Keystone authentication? If so, is there anything else that needs to be done for Keystone to work with this, such as add an endpoint URI? Please feel free to point me to existing documentation where it exists. Your help is much appreciated. Thanks! Cheers, Florian -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Radosgw installation and administration docs
Hi Yehuda, thanks, that resolved a lot of questions for me. A few follow-up comments below: On 06/12/12 18:47, Yehuda Sadeh wrote: On Tue, Jun 12, 2012 at 3:44 AM, Florian Haas flor...@hastexo.com wrote: Hi everyone, I have a long flight ahead of me later this week and plan to be spending some time on http://ceph.com/docs/master/ops/radosgw/ -- which currently happens to be a bit, ahem, sparse. There's currently not a lot of documentation on radosgw, and some of it is inconsistent, so if one of the devs could answer the following questions, I can put them in a more comprehensive document that should make radosgw easier to set up and run. 1. Apache rewrite rule Is the Apache configuration example listed in the man page correct and authoritative? Specifically, it seems unclear to me whether the rewrite engine rule: (RewriteRule ^/([a-zA-Z0-9-_.]*)([/]?.*) /s3gw.fcgi?page=$1params=$2%{QUERY_STRING} We currently use a slightly different rule: RewriteRule ^/(.*) /radosgw.fcgi?params=$1%{QUERY_STRING} [E=HTTP_AUTHORIZATION:%{HTTP:Authorization},L] Could you explain what happened to page? [E=HTTP_AUTHORIZATION:%{HTTP:Authorization},L]) ... is expected to work only for compatibility with S3 clients, or whether this rewrite rule is also for Swift clients. Not really needed for Swift. It's required for passing in the HTTP_AUTHORIZATION env, however, Swift uses a different field which is not filtered out by apache. OK. 2. FastCGI wrapper The radosgw man page says it should be exec /usr/bin/radosgw -c /etc/ceph/ceph.conf -n client.radosgw.gateway, whereas the Wiki (http://ceph.com/wiki/RADOS_Gateway) omits the -n option. I didn't get it to work without the -n option, so is it safe to say that it is required? -n is required for specifying the ceph user that the gateway would use. Without it it'd use client.admin is the default. OK. 3. Apache/radosgw daemon/FastCGI wrapper interaction Is it safe to say that we always need all three of these? The man page indicates so, the Wiki makes no mention of the daemon started by the init script. The wrapper is not needed if not using apache for spawning the radosgw processes. E.g., when using the FastCgiExternalServer param: FastCgiExternalServer /var/www/radosgw.fcgi -socket /var/run/ceph/radosgw.client.radosgw 4. FastCGI configuration directives The man page mentions: FastCgiExternalServer /var/www/s3gw.fcgi -socket /tmp/radosgw.sock The Wiki says: FastCgiWrapper /var/www/s3gw.fcgi FastCgiServer /usr/bin/radosgw https://github.com/ceph/teuthology/blob/master/teuthology/task/apache.conf (which was mentioned as an additional reference on IRC at some point) says: FastCgiIPCDir /tmp/cephtest/apache/tmp/fastcgi_sock FastCgiExternalServer /tmp/cephtest/apache/htdocs/rgw.fcgi -socket rgw_sock Which of these is required/preferred? -socket option or not? Wrapper, Server or ExternalServer? IPCDir? Either one is required. We prefer using the external server option. We found out that letting apache (or the fastcgi process manager) managing was sub-optimal and was introducing high latencies. OK, I'm sticking to FastCgiExternalServer then. 5. Logging What's the preferred way of adding debug logging for radosgw? https://github.com/ceph/teuthology/blob/master/teuthology/task/apache.conf mentions: SetEnv RGW_LOG_LEVEL 20 SetEnv RGW_PRINT_CONTINUE yes SetEnv RGW_SHOULD_LOG yes All are obsolete and defunct, and have a corresponding ceph.conf conf: debug rgw = 20 rgw print continue = true rgw should log = true the latter will be replaced soon by: rgw enable usage log = true Note that only the 'debug rgw' option is really related to debug logs. The 'rgw print continue' option is a badly named option to control the use of 100-continue (should the radosgw 'print' -- as in FCGX_FPrintF -- the 100-continue when it should?). This can only work with a modified mod_fastcgi that supports that. The 'rgw should log' option sets whether we log each user operation to the dedicated pool (so that it can be analyzed later on for billing, etc.) Yep. I was really only looking for what debug rgw does, and got confused by the FastCGI envars. ... but it's unclear to me whether this is still current (I found no trace of those envars in the source, but maybe I was looking in the wrong place). https://github.com/ceph/ceph/commit/452b1248a68f743ad55641722da80e3fd5ad2ae9 touched the debug rgw option. If that is the preferred way of doing things now, where should you set this? In ceph.conf, in the [client.radosgw.name] section? Either under the global section, or [client], or [client.radosgw.name]. Depends on how you organize your conf. OK. Also, for each of these, where would the logging output end up? /var/log/ceph? Apache error log? If so, only if the Apache LogLevel is more verbose than info? Syslog? The debug log would end up wherever you specified
radosgw-admin: mildly confusing man page and usage message
Hi, just noticed that radosgw-admin comes with a bit of confusing content in its man page and usage message: EXAMPLES Generate a new user: $ radosgw-admin user gen --display-name=johnny rotten --email=joh...@rotten.com As far as I remember user gen is gone, and it's now user create. However: radosgw-admin user create --display-name=test --email=test@demo user_id was not specified, aborting ... is followed by a usage message that doesn't mention user_id anywhere (the option string is --uid). So conceivably the example could also use a mention of --uid. Also, is there a way to retrieve the next available user_id or just tell radosgw-admin to use max(user_id)+1? If one of the Ceph guys could provide a quick comment on this, I can send a patch to the man page RST. Thanks. Cheers, Florian -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: radosgw-admin: mildly confusing man page and usage message
On 06/11/12 23:39, Yehuda Sadeh wrote: If one of the Ceph guys could provide a quick comment on this, I can send a patch to the man page RST. Thanks. Minimum required to create a user: radosgw-admin user create --uid=user id --display-name=display name The user id is actually a user 'account' name, not necessarily a numeric value. The email param is optional. Thanks. https://github.com/ceph/ceph/pull/13 Cheers, Florian -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Openstack] Ceph + OpenStack [HOW-TO]
On 06/10/12 23:32, Sébastien Han wrote: Hello everyone, I recently posted on my website an introduction to ceph and the integration of Ceph in OpenStack. It could be really helpful since the OpenStack documentation has not dealt with it so far. Feel free to comment, express your opinions and share your personal experience about both of them. This is mighty comprehensive. :) Thanks! Cheers, Florian -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Ceph/OpenStack integration on Ubuntu precise: horribly broken, or am I doing something wrong?
Hi everyone, apologies for the cross-post, and not sure if this is new information. I did do a cursory check of both list archives and didn't find anything pertinent, so here goes. Feel free to point me to an existing thread if I'm merely regurgitating something that's already known. Either I'm doing something terribly wrong, or the current state of OpenStack/Ceph integration in Ubuntu precise is somewhat suboptimal. At least as far as sticking to packages available in Ubuntu repos is concerned. Steps to reproduce: 1. On an installation using 12.04 with current updates, create a RADOS pool. 2. Configure glance to use rbd as its backend storage. 3. Attempt to upload an image. glance add name=Ubuntu 12.04 cloudimg amd64 is_public=true container_format=ovf disk_format=qcow2 precise-server-cloudimg-amd64-disk1.img Uploading image 'Ubuntu 12.04 cloudimg amd64' ==[ 92%] 222.109521M/s, ETA 0h 0m 0sFailed to add image. Got error: Data supplied was not valid. Details: 400 Bad Request The server could not comply with the request since it is either malformed or otherwise incorrect. Error uploading image: (NameError): global name 'rados' is not defined Digging around in glance/store/rbd.py yields this: try: import rados import rbd except ImportError: pass I will go so far as say the error handling here could be improved -- however, the Swift store implementation seems to do the same. Now, in Ubuntu rados.py and rbd.py ship in the python-ceph package, which is available in universe, but has an unresolvable dependency on librgw1. librgw1 apparently had been in Ubuntu for quite a while, but was dropped just before the Essex release: https://launchpad.net/ubuntu/precise/amd64/radosgw ... and if I read http://changelogs.ubuntu.com/changelogs/pool/main/c/ceph/ceph_0.41-1ubuntu2/changelog correctly, then the rationale for it was to drop radosgw since libfcgi is not in main and the code may not be suitable for LTS. (I wonder why this wasn't factored out into a separate package then, as was apparently the case for ceph-mds). Does this really mean that radosgw functionality was dropped from Ubuntu because it wasn't considered ready for main, and now it completely breaks a package in universe that's essential for Ceph/OpenStack integration? AFAICT the only thing that would be unaffected by this would be nova-volume (now cinder) which rather than using the Ceph Python bindings just calls out to the rados and rbd binaries. But both the glance RBD store and the RADOS Swift and S3 frontends (via radosgw) would be affected. This can of course all be fixed by using upstream packages from the Ceph guys (thanks to Sébastien Han for pointing that out to me). Anyone able to confirm or refute these findings? Should there be an Ubuntu bug for this? If so, against what package? Cheers, Florian -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ceph rbd crashes/stalls while random write 4k blocks
On Fri, May 25, 2012 at 8:47 AM, Stefan Priebe - Profihost AG s.pri...@profihost.ag wrote: Am 24.05.2012 16:19, schrieb Florian Haas: On Thu, May 24, 2012 at 4:09 PM, Stefan Priebe - Profihost AG s.pri...@profihost.ag wrote: Take a look at these to see if anything looks familiar: http://oss.sgi.com/bugzilla/show_bug.cgi?id=922 https://bugs.launchpad.net/ubuntu/+source/linux/+bug/979498 http://oss.sgi.com/archives/xfs/2011-11/msg00400.html These are solved by using 3.0.20. ... or so Christoph says, but comment #4 in bug 922 seems to indicate otherwise. I'm sorry you're absolutely right. BUT XFS had some regressions with xlog_grabt_log_space since 2.6.28 which was fixed in 3.0.X by reverting back to a kernel thread instead of workers. I was working with Christoph and Dave on this problem and it tooked be nearly a whole month to track that down (git commit c7eead1e118fb7e34ee8f5063c3c090c054c3820). In this case (#922) it seems it is really related to a too small log. But I don't have a too small log in my ceph case ;-) Hmmm. So what's Chinner saying about this one? Should we move this discussion to an XFS list? Cheers, Florian -- Need help with High Availability? http://www.hastexo.com/now -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ceph rbd crashes/stalls while random write 4k blocks
Stefan, On 05/24/12 13:07, Stefan Priebe - Profihost AG wrote: Hi list, i'm still testing ceph rbd with kvm. Right now i'm testing a rbd block device within a network booted kvm. Sequential write/reads and random reads are fine. No problems so far. But when i trigger lots of 4k random writes all of them stall after short time and i get 0 iops and 0 transfer. used command: fio --filename=/dev/vda --direct=1 --rw=randwrite --bs=4k --size=20G --numjobs=50 --runtime=30 --group_reporting --name=file1 Then some time later i see this call trace: INFO: task ceph-osd:3065 blocked for more than 120 seconds. echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. ceph-osdD 8803b0e61d88 0 3065 1 0x0004 88032f3ab7f8 0086 8803bffdac08 8803 8803b0e61820 00010800 88032f3abfd8 88032f3aa010 88032f3abfd8 00010800 81a0b020 8803b0e61820 Call Trace: [815e0e1a] schedule+0x3a/0x60 [815e127d] schedule_timeout+0x1fd/0x2e0 [812696c4] ? xfs_iext_bno_to_ext+0x84/0x160 [81074db1] ? down_trylock+0x31/0x50 [812696c4] ? xfs_iext_bno_to_ext+0x84/0x160 [815e20b9] __down+0x69/0xb0 [8128c4a6] ? _xfs_buf_find+0xf6/0x280 [81074e6b] down+0x3b/0x50 sorry I'm coming a bit late to the various threads you've posted recently, but on this particular issue: what kernel are your OSDs running on, and do these hung tasks occur if you're using a local filesystem other than XFS? As of late XFS has occasionally been producing seemingly random kernel hangs. Your call trace doesn't have the signature entries from xfssyncd that identify a particular problem that I've been struggling with lately, but you just might be affected by some other effect of the same root issue. Take a look at these to see if anything looks familiar: http://oss.sgi.com/bugzilla/show_bug.cgi?id=922 https://bugs.launchpad.net/ubuntu/+source/linux/+bug/979498 http://oss.sgi.com/archives/xfs/2011-11/msg00400.html Not sure if this helps at all; just thought I might pitch that in. Cheers, Florian -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ceph rbd crashes/stalls while random write 4k blocks
On Thu, May 24, 2012 at 4:09 PM, Stefan Priebe - Profihost AG s.pri...@profihost.ag wrote: Take a look at these to see if anything looks familiar: http://oss.sgi.com/bugzilla/show_bug.cgi?id=922 https://bugs.launchpad.net/ubuntu/+source/linux/+bug/979498 http://oss.sgi.com/archives/xfs/2011-11/msg00400.html These are solved by using 3.0.20. ... or so Christoph says, but comment #4 in bug 922 seems to indicate otherwise. Florian -- Need help with High Availability? http://www.hastexo.com/now -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] doc: fix snapshot creation/deletion syntax in rbd man page (trivial)
Creating a snapshot requires using rbd snap create, as opposed to just rbd create. Also for purposes of clarification, add note that removing a snapshot similarly requires rbd snap rm. Thanks to Josh Durgin for the explanation on IRC. --- man/rbd.8 | 10 +- 1 files changed, 9 insertions(+), 1 deletions(-) diff --git a/man/rbd.8 b/man/rbd.8 index 0278137..b59c2f6 100644 --- a/man/rbd.8 +++ b/man/rbd.8 @@ -194,7 +194,15 @@ To create a new snapshot: .sp .nf .ft C -rbd create mypool/myimage@mysnap +rbd snap create mypool/myimage@mysnap +.ft P +.fi +.sp +To delete a snapshot: +.sp +.nf +.ft C +rbd snap rm mypool/myimage@mysnap .ft P .fi .sp -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Nova on RBD Device
On Tue, Feb 7, 2012 at 8:01 PM, Mandell Degerness mand...@pistoncloud.com wrote: Can anyone point me in the right direction for setting up Nova so that it allocates disk space on RBD device(s) rather than on local disk as defined in the --instances_path flag? I've already got nova-volume working with RBD. I suspect that I need to modify _cache_image and _create_image in nova/virt/libvirt/connection.py. Hmmm. Wouldn't that most likely be a question for the openstack list? http://wiki.openstack.org/MailingLists for details on how to subscribe, if you haven't already. Cheers, Florian -- Need help with High Availability? http://www.hastexo.com/now -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: interesting point on btrfs, xfs, ext4
On Wed, Jan 25, 2012 at 10:15 AM, Tomasz Paszkowski ss7...@gmail.com wrote: http://www.youtube.com/watch?v=FegjLbCnoBw I sat in that talk at LCA and can highly recommend it. Jon Corbet wrote a piece on LWN about it too (currently subscribers only): https://lwn.net/Articles/476263/ Cheers, Florian -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/2] Add resource agents to debian build, trivial CP error
Hi, please consider two follow-up patches to the OCF resource agents: the first adds them to the Debian build, as a separate package ceph-resource-agents that depends on resource-agents, the second fixes a trivial (and embarassing, however harmless) cut and paste error. Thanks! Cheers, Florian -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] debian: build ceph-resource-agents
--- debian/ceph-resource-agents.install |1 + debian/control | 13 + debian/rules|2 ++ 3 files changed, 16 insertions(+), 0 deletions(-) create mode 100644 debian/ceph-resource-agents.install diff --git a/debian/ceph-resource-agents.install b/debian/ceph-resource-agents.install new file mode 100644 index 000..30843f6 --- /dev/null +++ b/debian/ceph-resource-agents.install @@ -0,0 +1 @@ +usr/lib/ocf/resource.d/ceph/* diff --git a/debian/control b/debian/control index e8c4d30..0f57ad3 100644 --- a/debian/control +++ b/debian/control @@ -112,6 +112,19 @@ Description: debugging symbols for ceph-common . This package contains the debugging symbols for ceph-common. +Package: ceph-resource-agents +Architecture: linux-any +Recommends: pacemaker +Priority: extra +Depends: ceph (= ${binary:Version}), ${misc:Depends}, resource-agents +Description: OCF-compliant resource agents for Ceph + Ceph is a distributed storage and network file system designed to provide + excellent performance, reliability, and scalability. + . + This package contains the resource agents (RAs) which integrate + Ceph with OCF-compliant cluster resource managers, + such as Pacemaker. + Package: librados2 Conflicts: librados, librados1 Replaces: librados, librados1 diff --git a/debian/rules b/debian/rules index 4f3fe62..0bc594a 100755 --- a/debian/rules +++ b/debian/rules @@ -20,6 +20,8 @@ endif export DEB_HOST_ARCH ?= $(shell dpkg-architecture -qDEB_HOST_ARCH) +extraopts += --with-ocf + ifeq ($(DEB_HOST_ARCH), armel) # armel supports ARMv4t or above instructions sets. # libatomic-ops is only usable with Ceph for ARMv6 or above. -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/2] Add Ceph integration with OCF-compliant HA resource managers
Hi everyone, please consider reviewing the following patches. These add OCF-compliant cluster resource agent functionality to Ceph, allowing MDS, OSD and MON to run as cluster resources under compliant managers (such as Pacemaker, http://www.clusterlabs.org). This new stuff does not build nor install by default; you must enable with the --with-ocf flag. That same flag maps to a new RPM build conditional (--with ocf) which rolls the resource agents into a separate subpackage, ceph-resource-agents. These patches require the tiny patch to the init script that I posted here a few days ago. Just in case you're interested, all the above changes (including the init script patch) since commit e18b1c9734e88e3b779ba2d70cdd54f8fb94743d: rgw: removing swift user index when removing user (2011-12-28 17:00:19 -0800) are also available in my GitHub repo at: git://github.com/fghaas/ceph ocf-ra Florian Haas (3): init script: be LSB compliant for exit code on status Add OCF-compliant resource agent for Ceph daemons Spec: conditionally build ceph-resource-agents package ceph.spec.in| 22 ++ configure.ac|8 ++ src/Makefile.am |4 +- src/init-ceph.in|7 ++- src/ocf/Makefile.am | 23 +++ src/ocf/ceph.in | 177 +++ 6 files changed, 238 insertions(+), 3 deletions(-) create mode 100644 src/ocf/Makefile.am create mode 100644 src/ocf/ceph.in Hope this is useful. All feedback is much appreciated. Thanks! Cheers, Florian -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] Spec: conditionally build ceph-resource-agents package
Put OCF resource agents in a separate subpackage, to be enabled with a separate build conditional (--with ocf). Make the subpackage depend on the resource-agents package, which provides the ocf-shellfuncs library that the Ceph RAs use. Signed-off-by: Florian Haas flor...@hastexo.com --- ceph.spec.in | 22 ++ 1 files changed, 22 insertions(+), 0 deletions(-) diff --git a/ceph.spec.in b/ceph.spec.in index b0f3c3a..3950fd1 100644 --- a/ceph.spec.in +++ b/ceph.spec.in @@ -1,5 +1,6 @@ %define with_gtk2 %{?_with_gtk2: 1} %{!?_with_gtk2: 0} +%bcond_with ocf # it seems there is no usable tcmalloc rpm for x86_64; parts of # google-perftools don't compile on x86_64, and apparently the # decision was to not build the package at all, even if tcmalloc @@ -130,6 +131,19 @@ gcephtool is a graphical monitor for the clusters running the Ceph distributed file system. %endif +%if %{with ocf} +%package resource-agents +Summary: OCF-compliant resource agents for Ceph daemons +Group: System Environment/Base +License: LGPLv2 +Requires: %{name} = %{version} +Requires: resource-agents +%description resource-agents +Resource agents for monitoring and managing Ceph daemons +under Open Cluster Framework (OCF) compliant resource +managers such as Pacemaker. +%endif + %package -n librados2 Summary: RADOS distributed object store client library Group: System Environment/Libraries @@ -211,6 +225,7 @@ MY_CONF_OPT=$MY_CONF_OPT --without-gtk2 --docdir=%{_docdir}/ceph \ --without-hadoop \ $MY_CONF_OPT \ + %{?_with_ocf} \ %{?with_tcmalloc:--with-tcmalloc} %{!?with_tcmalloc:--without-tcmalloc} # fix bug in specific version of libedit-devel @@ -415,6 +430,13 @@ fi %endif # +%if %{with ocf} +%files resource-agents +%defattr(0755,root,root,-) +/usr/lib/ocf/resource.d/%{name}/* +%endif + +# %files -n librados2 %defattr(-,root,root,-) %{_libdir}/librados.so.* -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] Add OCF-compliant resource agent for Ceph daemons
Add a wrapper around the ceph init script that makes MDS, OSD and MON configurable as Open Cluster Framework (OCF) compliant cluster resources. Allows Ceph daemons to tie in with cluster resource managers that support OCF, such as Pacemaker (http://www.clusterlabs.org). Disabled by default, configure --with-ocf to enable. Signed-off-by: Florian Haas flor...@hastexo.com --- configure.ac|8 ++ src/Makefile.am |4 +- src/ocf/Makefile.am | 23 +++ src/ocf/ceph.in | 177 +++ 4 files changed, 210 insertions(+), 2 deletions(-) create mode 100644 src/ocf/Makefile.am create mode 100644 src/ocf/ceph.in diff --git a/configure.ac b/configure.ac index 60f998c..e334a24 100644 --- a/configure.ac +++ b/configure.ac @@ -277,6 +277,12 @@ AM_CONDITIONAL(WITH_LIBATOMIC, [test $HAVE_ATOMIC_OPS = 1]) #[], #[with_newsyn=no]) +AC_ARG_WITH([ocf], +[AS_HELP_STRING([--with-ocf], [build OCF-compliant cluster resource agent])], +, +[with_ocf=no]) +AM_CONDITIONAL(WITH_OCF, [ test $with_ocf = yes ]) + # Checks for header files. AC_HEADER_DIRENT AC_HEADER_STDC @@ -375,6 +381,8 @@ AM_PATH_PYTHON([2.4], AC_CONFIG_HEADERS([src/acconfig.h]) AC_CONFIG_FILES([Makefile src/Makefile + src/ocf/Makefile + src/ocf/ceph man/Makefile ceph.spec]) AC_OUTPUT diff --git a/src/Makefile.am b/src/Makefile.am index 748425e..8026e17 100644 --- a/src/Makefile.am +++ b/src/Makefile.am @@ -1,6 +1,6 @@ AUTOMAKE_OPTIONS = gnu -SUBDIRS = -DIST_SUBDIRS = gtest +SUBDIRS = ocf +DIST_SUBDIRS = gtest ocf CLEANFILES = bin_PROGRAMS = # like bin_PROGRAMS, but these targets are only built for debug builds diff --git a/src/ocf/Makefile.am b/src/ocf/Makefile.am new file mode 100644 index 000..9be40ec --- /dev/null +++ b/src/ocf/Makefile.am @@ -0,0 +1,23 @@ +EXTRA_DIST = ceph.in Makefile.in + +if WITH_OCF +# The root of the OCF resource agent hierarchy +# Per the OCF standard, it's always lib, +# not lib64 (even on 64-bit platforms). +ocfdir = $(prefix)/lib/ocf + +# The ceph provider directory +radir = $(ocfdir)/resource.d/$(PACKAGE_NAME) + +ra_SCRIPTS = ceph + +install-data-hook: + $(LN_S) ceph $(DESTDIR)$(radir)/osd + $(LN_S) ceph $(DESTDIR)$(radir)/mds + $(LN_S) ceph $(DESTDIR)$(radir)/mon + +uninstall-hook: + rm -f $(DESTDIR)$(radir)/osd + rm -f $(DESTDIR)$(radir)/mds + rm -f $(DESTDIR)$(radir)/mon +endif diff --git a/src/ocf/ceph.in b/src/ocf/ceph.in new file mode 100644 index 000..9db1bc9 --- /dev/null +++ b/src/ocf/ceph.in @@ -0,0 +1,177 @@ +#!/bin/sh + +# Initialization: +: ${OCF_FUNCTIONS_DIR=${OCF_ROOT}/lib/heartbeat} +. ${OCF_FUNCTIONS_DIR}/ocf-shellfuncs + +# Convenience variables +# When sysconfdir isn't passed in as a configure flag, +# it's defined in terms of prefix +prefix=@prefix@ +CEPH_INIT=@sysconfdir@/init.d/ceph + +ceph_meta_data() { +local longdesc +local shortdesc +case $__SCRIPT_NAME in + osd) + longdesc=Wraps the ceph init script to provide an OCF resource agent that manages and monitors the Ceph OSD service. + longdesc=Manages a Ceph OSD instance. + ;; + mds) + longdesc=Wraps the ceph init script to provide an OCF resource agent that manages and monitors the Ceph MDS service. + longdesc=Manages a Ceph MDS instance. + ;; + mon) + longdesc=Wraps the ceph init script to provide an OCF resource agent that manages and monitors the Ceph MON service. + longdesc=Manages a Ceph MON instance. + ;; +esac + +cat EOF +?xml version=1.0? +!DOCTYPE resource-agent SYSTEM ra-api-1.dtd +resource-agent name=${__SCRIPT_NAME} version=0.1 + version0.1/version + longdesc lang=en${longdesc}/longdesc + shortdesc lang=en${shortdesc}/shortdesc + parameters/ + actions +action name=starttimeout=20 / +action name=stop timeout=20 / +action name=monitor timeout=20 +interval=10/ +action name=meta-datatimeout=5 / +action name=validate-all timeout=20 / + /actions +/resource-agent +EOF +} + +ceph_action() { +local init_action +init_action=$1 + +case ${__SCRIPT_NAME} in + osd|mds|mon) + ocf_run $CEPH_INIT $init_action ${__SCRIPT_NAME} + ;; + *) + ocf_run $CEPH_INIT $init_action + ;; +esac +} + +ceph_validate_all() { +# Do we have the ceph init script? +check_binary @sysconfdir@/init.d/ceph + +# Do we have a configuration file? +[ -e @sysconfdir@/ceph/ceph.conf ] || exit $OCF_ERR_INSTALLED +} + +ceph_monitor() { +local rc + +ceph_action status + +# 0: running, and fully caught up with master +# 3: gracefully stopped +# any other: error +case $? in +0) +rc=$OCF_SUCCESS +ocf_log debug Resource is running
Trivial patch to fix init script LSB compliance
Hi everyone, please consider merging the following trivial patch that makes the ceph init script return the proper LSB exit code (3) for the status action if the service is gracefully stopped, and only return 1 (as before) if the service has died and left its PID file hanging around. Who cares about the exit code? Pacemaker does (http://www.clusterlabs.org). Pacemaker is a high-availability resource manager that can be used for monitoring and recovering resources in-place, and as per a brief discussion in #ceph that I had with Greg before the holidays, no such in-place recovery currently exists within Ceph itself. Integrating init script based services with Pacemaker is trivial if the script complies with the exit codes that the LSB spec specifies. Feedback on this is much appreciated. Hope this is useful. Cheers, Florian [PATCH] init script: be LSB compliant for exit code on status -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html