Re: [ceph-users] the state of cephfs in giant

2014-10-30 Thread Florian Haas
Hi Sage,

sorry to be late to this thread; I just caught this one as I was
reviewing the Giant release notes. A few questions below:

On Mon, Oct 13, 2014 at 8:16 PM, Sage Weil s...@newdream.net wrote:
 [...]
 * ACLs: implemented, tested for kernel client. not implemented for
   ceph-fuse.
 [...]
 * samba VFS integration: implemented, limited test coverage.

ACLs are kind of a must-have feature for most Samba admins. The Samba
Ceph VFS builds on userspace libcephfs directly, neither the kernel
client nor ceph-fuse, so I'm trying to understand whether ACLs are
available to Samba users or not. Can you clarify please?

 * ganesha NFS integration: implemented, no test coverage.

I understood from a conversation I had with John in London that
flock() and fcntl() support had recently been added to ceph-fuse, can
this be expected to Just Work™ in Ganesha as well?

Also, can you make a general statement as to the stability of flock()
and fcntl() support in the kernel client and in libcephfs/ceph-fuse?
This too is particularly interesting for Samba admins who rely on
byte-range locking for Samba CTDB support.

 * kernel NFS reexport: implemented. limited test coverage. no known
   issues.

In this scenario, is there any specific magic that the kernel client
does to avoid producing deadlocks under memory pressure? Or are you
referring to FUSE-mounted CephFS reexported via kernel NFS?

Cheers,
Florian
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: v0.86 released (Giant release candidate)

2014-10-10 Thread Florian Haas
Hi Sage,

On Tue, Oct 7, 2014 at 9:20 PM, Sage Weil s...@inktank.com wrote:
 This is a release candidate for Giant, which will hopefully be out in
 another week or two (s v0.86).  We did a feature freeze about a month ago
 and since then have been doing only stabilization and bug fixing (and a
 handful on low-risk enhancements).  A fair bit of new functionality went
 into the final sprint, but it's baked for quite a while now and we're
 feeling pretty good about it.

 Major items include:

 * librados locking refactor to improve scaling and client performance
 * local recovery code (LRC) erasure code plugin to trade some
   additional storage overhead for improved recovery performance
 * LTTNG tracing framework, with initial tracepoints in librados,
   librbd, and the OSD FileStore backend
 * separate monitor audit log for all administrative commands
 * asynchronos monitor transaction commits to reduce the impact on
   monitor read requests while processing updates
 * low-level tool for working with individual OSD data stores for
   debugging, recovery, and testing
 * many MDS improvements (bug fixes, health reporting)

 There are still a handful of known bugs in this release, but nothing
 severe enough to prevent a release.  By and large we are pretty
 pleased with the stability and expect the final Giant release to be
 quite reliable.

 Please try this out on your non-production clusters for a preview.

Thanks for the summary! Since you mentioned MDS improvements, and just
so it doesn't get lost: as you hinted at in off-list email, please do
provide a write-up of CephFS features expected to work in Giant at the
time of the release (broken down by kernel client vs. Ceph-FUSE, if
necessary). Not in the sense that anyone is offering commercial
support, but in the sense of if you use this limited feature set, we
are confident that it at least won't eat your data. I think that
would be beneficial to a large portion of the user base, and clear up
a lot of the present confusion about the maturity and stability of the
filesystem.

Cheers,
Florian
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph-users] Status of snapshots in CephFS

2014-09-24 Thread Florian Haas
On Fri, Sep 19, 2014 at 5:25 PM, Sage Weil sw...@redhat.com wrote:
 On Fri, 19 Sep 2014, Florian Haas wrote:
 Hello everyone,

 Just thought I'd circle back on some discussions I've had with people
 earlier in the year:

 Shortly before firefly, snapshot support for CephFS clients was
 effectively disabled by default at the MDS level, and can only be
 enabled after accepting a scary warning that your filesystem is highly
 likely to break if snapshot support is enabled. Has any progress been
 made on this in the interim?

 With libcephfs support slowly maturing in Ganesha, the option of
 deploying a Ceph-backed userspace NFS server is becoming more
 attractive -- and it's probably a better use of resources than mapping
 a boatload of RBDs on an NFS head node and then exporting all the data
 from there. Recent snapshot trimming issues notwithstanding, RBD
 snapshot support is reasonably stable, but even so, making snapshot
 data available via NFS, that way, is rather ugly. In addition, the
 libcephfs/Ganesha approach would obviously include much better
 horizontal scalability.

 We haven't done any work on snapshot stability.  It is probably moderately
 stable if snapshots are only done at the root or at a consistent point in
 the hierarcy (as opposed to random directories), but there are still some
 basic problems that need to be resolved.  I would not suggest deploying
 this in production!  But some stress testing woudl as always be very
 welcome.  :)

OK, on a semi-related note: is there any reasonably current
authoritative list of features that are supported and unsupported in
either ceph-fuse or kernel cephfs, and if so, at what minimal version?

The most comprehensive overview that seems to be available is one from
Greg, which however is a year and a half old:

http://ceph.com/dev-notes/cephfs-mds-status-discussion/

 In addition, 
 https://github.com/nfs-ganesha/nfs-ganesha/wiki/ReleaseNotes_2.0#CEPH
 states:

 The current requirement to build and use the Ceph FSAL is a Ceph
 build environment which includes Ceph client enhancements staged on
 the libwipcephfs development branch. These changes are expected to be
 part of the Ceph Firefly release.

 ... though it's not clear whether they ever did make it into firefly.
 Could someone in the know comment on that?

 I think this is referring to the libcephfs API changes that the cohortfs
 folks did.  That all merged shortly before firefly.

Great, thanks for the clarification.

 By the way, we have some basic samba integration tests in our regular
 regression tests, but nothing based on ganesha.  If you really want this
 to the work, the most valuable thing you could do would be to help
 get the tests written and integrated into ceph-qa-suite.git.  Probably the
 biggest piece of work there is creating a task/ganesha.py that installs
 and configures ganesha with the ceph FSAL.

Hmmm, given the excellent writeup that Niels de Vos of Gluster fame
wrote about this topic, I might actually be able to cargo-cult some of
what's in the Samba task and adapt it for ganesha.

Sorry while I'm being ignorant about Teuthology: what platform does it
normally run on? I ask because I understand most of your testing is done
on Ubuntu, and Ubuntu currently doesn't ship a Ganesha package, which
would make the install task a bit more complex.

Cheers,
Florian



signature.asc
Description: OpenPGP digital signature


Re: [ceph-users] Status of snapshots in CephFS

2014-09-24 Thread Florian Haas
On Fri, Sep 19, 2014 at 5:25 PM, Sage Weil sw...@redhat.com wrote:
 On Fri, 19 Sep 2014, Florian Haas wrote:
 Hello everyone,

 Just thought I'd circle back on some discussions I've had with people
 earlier in the year:

 Shortly before firefly, snapshot support for CephFS clients was
 effectively disabled by default at the MDS level, and can only be
 enabled after accepting a scary warning that your filesystem is highly
 likely to break if snapshot support is enabled. Has any progress been
 made on this in the interim?

 With libcephfs support slowly maturing in Ganesha, the option of
 deploying a Ceph-backed userspace NFS server is becoming more
 attractive -- and it's probably a better use of resources than mapping
 a boatload of RBDs on an NFS head node and then exporting all the data
 from there. Recent snapshot trimming issues notwithstanding, RBD
 snapshot support is reasonably stable, but even so, making snapshot
 data available via NFS, that way, is rather ugly. In addition, the
 libcephfs/Ganesha approach would obviously include much better
 horizontal scalability.

 We haven't done any work on snapshot stability.  It is probably moderately
 stable if snapshots are only done at the root or at a consistent point in
 the hierarcy (as opposed to random directories), but there are still some
 basic problems that need to be resolved.  I would not suggest deploying
 this in production!  But some stress testing woudl as always be very
 welcome.  :)

OK, on a semi-related note: is there any reasonably current
authoritative list of features that are supported and unsupported in
either ceph-fuse or kernel cephfs, and if so, at what minimal version?

The most comprehensive overview that seems to be available is one from
Greg, which however is a year and a half old:

http://ceph.com/dev-notes/cephfs-mds-status-discussion/

 In addition, 
 https://github.com/nfs-ganesha/nfs-ganesha/wiki/ReleaseNotes_2.0#CEPH
 states:

 The current requirement to build and use the Ceph FSAL is a Ceph
 build environment which includes Ceph client enhancements staged on
 the libwipcephfs development branch. These changes are expected to be
 part of the Ceph Firefly release.

 ... though it's not clear whether they ever did make it into firefly.
 Could someone in the know comment on that?

 I think this is referring to the libcephfs API changes that the cohortfs
 folks did.  That all merged shortly before firefly.

Great, thanks for the clarification.

 By the way, we have some basic samba integration tests in our regular
 regression tests, but nothing based on ganesha.  If you really want this
 to the work, the most valuable thing you could do would be to help
 get the tests written and integrated into ceph-qa-suite.git.  Probably the
 biggest piece of work there is creating a task/ganesha.py that installs
 and configures ganesha with the ceph FSAL.

Hmmm, given the excellent writeup that Niels de Vos of Gluster fame
wrote about this topic, I might actually be able to cargo-cult some of
what's in the Samba task and adapt it for ganesha.

Sorry while I'm being ignorant about Teuthology: what platform does it
normally run on? I ask because I understand most of your testing is done
on Ubuntu, and Ubuntu currently doesn't ship a Ganesha package, which
would make the install task a bit more complex.

Cheers,
Florian
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: snap_trimming + backfilling is inefficient with many purged_snaps

2014-09-24 Thread Florian Haas
On Wed, Sep 24, 2014 at 1:05 AM, Sage Weil sw...@redhat.com wrote:
 Sam and I discussed this on IRC and have we think two simpler patches that
 solve the problem more directly.  See wip-9487.

So I understand this makes Dan's patch (and the config parameter that
it introduces) unnecessary, but is it correct to assume that just like
Dan's patch yours too will not be effective unless osd snap trim sleep
 0?

 Queued for testing now.
 Once that passes we can backport and test for firefly and dumpling too.

 Note that this won't make the next dumpling or firefly point releases
 (which are imminent).  Should be in the next ones, though.

OK, just in case anyone else runs into problems after removing tons of
snapshots with =0.67.11, what's the plan to get them going again
until 0.67.12 comes out? Install the autobuild package from the wip
branch?

Cheers,
Florian
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: snap_trimming + backfilling is inefficient with many purged_snaps

2014-09-23 Thread Florian Haas
On Mon, Sep 22, 2014 at 7:06 PM, Florian Haas flor...@hastexo.com wrote:
 On Sun, Sep 21, 2014 at 9:52 PM, Sage Weil sw...@redhat.com wrote:
 On Sun, 21 Sep 2014, Florian Haas wrote:
 So yes, I think your patch absolutely still has merit, as would any
 means of reducing the number of snapshots an OSD will trim in one go.
 As it is, the situation looks really really bad, specifically
 considering that RBD and RADOS are meant to be super rock solid, as
 opposed to say CephFS which is in an experimental state. And contrary
 to CephFS snapshots, I can't recall any documentation saying that RBD
 snapshots will break your system.

 Yeah, it sounds like a separate issue, and no, the limit is not
 documented because it's definitely not the intended behavior. :)

 ...and I see you already have a log attached to #9503.  Will take a look.

 I've already updated that issue in Redmine, but for the list archives
 I should also add this here: Dan's patch for #9503, together with
 Sage's for #9487, makes the problem go away in an instant. I've
 already pointed out that I owe Dan dinner, and Sage, well I already
 owe Sage pretty much lifelong full board. :)

Looks like I was bit too eager: while the cluster is behaving nicely
with these patches while nothing happens to any OSDs, it does flag PGs
as incomplete when an OSD goes down. Once the mon osd down out
interval expires things seem to recover/backfill normally, but it's
still disturbing to see this in the interim.

I've updated http://tracker.ceph.com/issues/9503 with a pg query from
one of the affected PGs, within the mon osd down out interval, while
it was marked incomplete.

Dan or Sage, any ideas as to what might be causing this?

Cheers,
Florian
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: snap_trimming + backfilling is inefficient with many purged_snaps

2014-09-22 Thread Florian Haas
On Sun, Sep 21, 2014 at 9:52 PM, Sage Weil sw...@redhat.com wrote:
 On Sun, 21 Sep 2014, Florian Haas wrote:
 So yes, I think your patch absolutely still has merit, as would any
 means of reducing the number of snapshots an OSD will trim in one go.
 As it is, the situation looks really really bad, specifically
 considering that RBD and RADOS are meant to be super rock solid, as
 opposed to say CephFS which is in an experimental state. And contrary
 to CephFS snapshots, I can't recall any documentation saying that RBD
 snapshots will break your system.

 Yeah, it sounds like a separate issue, and no, the limit is not
 documented because it's definitely not the intended behavior. :)

 ...and I see you already have a log attached to #9503.  Will take a look.

I've already updated that issue in Redmine, but for the list archives
I should also add this here: Dan's patch for #9503, together with
Sage's for #9487, makes the problem go away in an instant. I've
already pointed out that I owe Dan dinner, and Sage, well I already
owe Sage pretty much lifelong full board. :)

Everyone with a ton of snapshots in their clusters (not sure where the
threshold is, but it gets nasty somewhere between 1,000 and 10,000 I
imagine) should probably update to 0.67.11 and 0.80.6 as soon as they
come out, otherwise Terrible Things Will Happen™ if you're ever forced
to delete a large number of snaps at once.

Thanks again to Dan and Sage,
Florian
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: snap_trimming + backfilling is inefficient with many purged_snaps

2014-09-21 Thread Florian Haas
On Sat, Sep 20, 2014 at 9:08 PM, Alphe Salas asa...@kepler.cl wrote:
 Real field testings and proof workout are better than any unit testing ... I
 would follow Dan s notice of resolution because it based on real problem and
 not fony style test ground.

That statement is almost an insult to the authors and maintainers of
the testing framework around Ceph. Therefore, I'm taking the liberty
to register my objection.

That said, I'm not sure that wip-9487-dumpling is the final fix to the
issue. On the system where I am seeing the issue, even with the fix
deployed, osd's still not only go crazy snap trimming (which by itself
would be understandable, as the system has indeed recently had
thousands of snapshots removed), but they also still produce the
previously seen ENOENT messages indicating they're trying to trim
snaps that aren't there.

That system, however, has PGs marked as recovering, not backfilling as
in Dan's system. Not sure if wip-9487 falls short of fixing the issue
at its root. Sage, whenever you have time, would you mind commenting?

Cheers,
Florian
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: snap_trimming + backfilling is inefficient with many purged_snaps

2014-09-21 Thread Florian Haas
On Sun, Sep 21, 2014 at 4:26 PM, Dan van der Ster
daniel.vanders...@cern.ch wrote:
 Hi Florian,

 September 21 2014 3:33 PM, Florian Haas flor...@hastexo.com wrote:
 That said, I'm not sure that wip-9487-dumpling is the final fix to the
 issue. On the system where I am seeing the issue, even with the fix
 deployed, osd's still not only go crazy snap trimming (which by itself
 would be understandable, as the system has indeed recently had
 thousands of snapshots removed), but they also still produce the
 previously seen ENOENT messages indicating they're trying to trim
 snaps that aren't there.


 You should be able to tell exactly how many snaps need to be trimmed. Check 
 the current purged_snaps with

 ceph pg x.y query

 and also check the snap_trimq from debug_osd=10. The problem fixed in 
 wip-9487 is the (mis)communication of purged_snaps to a new OSD. But if in 
 your cluster purged_snaps is correct (which it should be after the fix from 
 Sage), and it still has lots of snaps to trim, then I believe the only thing 
 to do is let those snaps all get trimmed. (my other patch linked sometime 
 earlier in this thread might help by breaking up all that trimming work into 
 smaller pieces, but that was never tested).

Yes, it does indeed look like the system does have thousands of
snapshots left to trim. That said, since the PGs are locked during
this time, this creates a situation where the cluster is becoming
unusable with no way for the user to recover.

 Entering the realm of speculation, I wonder if your OSDs are getting 
 interrupted, marked down, out, or crashing before they have the opportunity 
 to persist purged_snaps? purged_snaps is updated in 
 ReplicatedPG::WaitingOnReplicas::react, but if the primary is too busy to 
 actually send that transaction to its peers, so then eventually it or the new 
 primary needs to start again, and no progress is ever made. If this is what 
 is happening on your cluster, then again, perhaps my osd_snap_trim_max patch 
 could be a solution.

Since the snap trimmer immediately jacks the affected OSDs up to 100%
CPU utilization, and they stop even responding to heartbeats, yes they
do get marked down and that makes the issue much worse. Even when
setting nodown, though, then that doesn't change the fact that the
affected OSDs just spin practically indefinitely.

So, even with the patch for 9487, which fixes *your* issue of the
cluster trying to trim tons of snaps when in fact it should be
trimming only a handful, the user is still in a world of pain when
they do indeed have tons of snaps to trim. And obviously, neither of
osd max backfills nor osd recovery max active help here, because even
a single backfill/recovery makes the OSD go nuts.

There is the silly option of setting osd_snap_trim_sleep to say 61
minutes, and restarting the ceph-osd daemons before the snap trim can
kick in, i.e. hourly, via a cron job. Of course, while this prevents
the OSD from going into a death spin, it only perpetuates the problem
until a patch for this issue is available, because snap trimming never
even runs, let alone completes.

This is particularly bad because a user can get themselves a
non-functional cluster simply by trying to delete thousands of
snapshots at once. If you consider a tiny virtualization cluster of
just 100 persistent VMs, out of which you take one snapshot an hour,
then deleting the snapshots taken in one month puts you well above
that limit. So we're not talking about outrageous numbers here. I
don't think anyone can fault any user for attempting this.

What makes the situation even worse is that there is no cluster-wide
limit to the number of snapshots, or even say snapshots per RBD
volume, or snapshots per PG, nor any limit on the number of snapshots
deleted concurrently.

So yes, I think your patch absolutely still has merit, as would any
means of reducing the number of snapshots an OSD will trim in one go.
As it is, the situation looks really really bad, specifically
considering that RBD and RADOS are meant to be super rock solid, as
opposed to say CephFS which is in an experimental state. And contrary
to CephFS snapshots, I can't recall any documentation saying that RBD
snapshots will break your system.

Cheers,
Florian
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: snap_trimming + backfilling is inefficient with many purged_snaps

2014-09-19 Thread Florian Haas
On Fri, Sep 19, 2014 at 12:27 AM, Sage Weil sw...@redhat.com wrote:
 On Fri, 19 Sep 2014, Florian Haas wrote:
 Hi Sage,

 was the off-list reply intentional?

 Whoops!  Nope :)

 On Thu, Sep 18, 2014 at 11:47 PM, Sage Weil sw...@redhat.com wrote:
  So, disaster is a pretty good description. Would anyone from the core
  team like to suggest another course of action or workaround, or are
  Dan and I generally on the right track to make the best out of a
  pretty bad situation?
 
  The short term fix would probably be to just prevent backfill for the time
  being until the bug is fixed.

 As in, osd max backfills = 0?

 Yeah :)

 Just managed to reproduce the problem...

 sage

Saw the wip branch. Color me freakishly impressed on the turnaround. :) Thanks!

Cheers,
Florian
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: snap_trimming + backfilling is inefficient with many purged_snaps

2014-09-18 Thread Florian Haas
Hi Dan,

saw the pull request, and can confirm your observations, at least
partially. Comments inline.

On Thu, Sep 18, 2014 at 2:50 PM, Dan Van Der Ster
daniel.vanders...@cern.ch wrote:
 Do I understand your issue report correctly in that you have found
 setting osd_snap_trim_sleep to be ineffective, because it's being
 applied when iterating from PG to PG, rather than from snap to snap?
 If so, then I'm guessing that that can hardly be intentional…


 I’m beginning to agree with you on that guess. AFAICT, the normal behavior of 
 the snap trimmer is to trim one single snap, the one which is in the 
 snap_trimq but not yet in purged_snaps. So the only time the current sleep 
 implementation could be useful is if we rm’d a snap across many PGs at once, 
 e.g. rm a pool snap or an rbd snap. But those aren’t a huge problem anyway 
 since you’d at most need to trim O(100) PGs.

Hmm. I'm actually seeing this in a system where the problematic snaps
could *only* have been RBD snaps.

 We could move the snap trim sleep into the SnapTrimmer state machine, for 
 example in ReplicatedPG::NotTrimming::react. This should allow other IOs to 
 get through to the OSD, but of course the trimming PG would remain locked. 
 And it would be locked for even longer now due to the sleep.

 To solve that we could limit the number of trims per instance of the 
 SnapTrimmer, like I’ve done in this pull req: 
 https://github.com/ceph/ceph/pull/2516
 Breaking out of the trimmer like that should allow IOs to the trimming PG to 
 get through.

 The second aspect of this issue is why are the purged_snaps being lost to 
 begin with. I’ve managed to reproduce that on my test cluster. All you have 
 to do is create many pool snaps (e.g. of a nearly empty pool), then rmsnap 
 all those snapshots. Then use crush reweight to move the PGs around. With 
 debug_osd=10, you will see adding snap 1 to purged_snaps”, which is one 
 signature of this lost purged_snaps issue. To reproduce slow requests the 
 number of snaps purged needs to be O(1).

Hmmm, I'm not sure if I confirm that. I see adding snap X to
purged_snaps, but only after the snap has been purged. See
https://gist.github.com/fghaas/88db3cd548983a92aa35. Of course, the
fact that the OSD tries to trim a snap only to get an ENOENT is
probably indicative of something being fishy with the snaptrimq and/or
the purged_snaps list as well.

 Looking forward to any ideas someone might have.

So am I. :)

Cheers,
Florian
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: snap_trimming + backfilling is inefficient with many purged_snaps

2014-09-18 Thread Florian Haas
On Thu, Sep 18, 2014 at 8:56 PM, Mango Thirtyfour
daniel.vanders...@cern.ch wrote:
 Hi Florian,

 On Sep 18, 2014 7:03 PM, Florian Haas flor...@hastexo.com wrote:

 Hi Dan,

 saw the pull request, and can confirm your observations, at least
 partially. Comments inline.

 On Thu, Sep 18, 2014 at 2:50 PM, Dan Van Der Ster
 daniel.vanders...@cern.ch wrote:
  Do I understand your issue report correctly in that you have found
  setting osd_snap_trim_sleep to be ineffective, because it's being
  applied when iterating from PG to PG, rather than from snap to snap?
  If so, then I'm guessing that that can hardly be intentional…
 
 
  I’m beginning to agree with you on that guess. AFAICT, the normal behavior 
  of the snap trimmer is to trim one single snap, the one which is in the 
  snap_trimq but not yet in purged_snaps. So the only time the current sleep 
  implementation could be useful is if we rm’d a snap across many PGs at 
  once, e.g. rm a pool snap or an rbd snap. But those aren’t a huge problem 
  anyway since you’d at most need to trim O(100) PGs.

 Hmm. I'm actually seeing this in a system where the problematic snaps
 could *only* have been RBD snaps.


 True, as am I. The current sleep is useful in this case, but since we'd 
 normally only expect up to ~100 of these PGs per OSD, the trimming of 1 snap 
 across all of those PGs would finish rather quickly anyway. Latency would 
 surely be increased momentarily, but I wouldn't expect 90s slow requests like 
 I have with the 3 snap_trimq single PG.

 Possibly the sleep is useful in both places.

  We could move the snap trim sleep into the SnapTrimmer state machine, for 
  example in ReplicatedPG::NotTrimming::react. This should allow other IOs 
  to get through to the OSD, but of course the trimming PG would remain 
  locked. And it would be locked for even longer now due to the sleep.
 
  To solve that we could limit the number of trims per instance of the 
  SnapTrimmer, like I’ve done in this pull req: 
  https://github.com/ceph/ceph/pull/2516
  Breaking out of the trimmer like that should allow IOs to the trimming PG 
  to get through.
 
  The second aspect of this issue is why are the purged_snaps being lost to 
  begin with. I’ve managed to reproduce that on my test cluster. All you 
  have to do is create many pool snaps (e.g. of a nearly empty pool), then 
  rmsnap all those snapshots. Then use crush reweight to move the PGs 
  around. With debug_osd=10, you will see adding snap 1 to purged_snaps”, 
  which is one signature of this lost purged_snaps issue. To reproduce slow 
  requests the number of snaps purged needs to be O(1).

 Hmmm, I'm not sure if I confirm that. I see adding snap X to
 purged_snaps, but only after the snap has been purged. See
 https://gist.github.com/fghaas/88db3cd548983a92aa35. Of course, the
 fact that the OSD tries to trim a snap only to get an ENOENT is
 probably indicative of something being fishy with the snaptrimq and/or
 the purged_snaps list as well.


 With such a long snap_trimq there in your log, I suspect you're seeing the 
 exact same behavior as I am. In my case the first snap trimmed is snap 1, of 
 course because that is the first rm'd snap, and the contents of your pool are 
 surely different. I also see the ENOENT messages... again confirming those 
 snaps were already trimmed. Anyway, what I've observed is that a large 
 snap_trimq like that will block the OSD until they are all re-trimmed.

That's... a mess.

So what is your workaround for recovery? My hunch would be to

- stop all access to the cluster;
- set nodown and noout so that other OSDs don't mark spinning OSDs
down (which would cause all sorts of primary and PG reassignments,
useless backfill/recovery when mon osd down out interval expires,
etc.);
- set osd_snap_trim_sleep to a ridiculously high value like 10 or 30
so that at least *between* PGs, the OSD has a chance to respond to
heartbeats and do whatever else it needs to do;
- let the snap trim play itself out over several hours (days?).

That sounds utterly awful, but if anyone has a better idea (other than
wait until the patch is merged), I'd be all ears.

Cheers
Florian
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: snap_trimming + backfilling is inefficient with many purged_snaps

2014-09-18 Thread Florian Haas
On Thu, Sep 18, 2014 at 9:12 PM, Dan van der Ster
daniel.vanders...@cern.ch wrote:
 Hi,

 September 18 2014 9:03 PM, Florian Haas flor...@hastexo.com wrote:
 On Thu, Sep 18, 2014 at 8:56 PM, Dan van der Ster 
 daniel.vanders...@cern.ch wrote:

 Hi Florian,

 On Sep 18, 2014 7:03 PM, Florian Haas flor...@hastexo.com wrote:
 Hi Dan,

 saw the pull request, and can confirm your observations, at least
 partially. Comments inline.

 On Thu, Sep 18, 2014 at 2:50 PM, Dan Van Der Ster
 daniel.vanders...@cern.ch wrote:
 Do I understand your issue report correctly in that you have found
 setting osd_snap_trim_sleep to be ineffective, because it's being
 applied when iterating from PG to PG, rather than from snap to snap?
 If so, then I'm guessing that that can hardly be intentional…


 I’m beginning to agree with you on that guess. AFAICT, the normal 
 behavior of the snap trimmer
 is
 to trim one single snap, the one which is in the snap_trimq but not yet in 
 purged_snaps. So the
 only time the current sleep implementation could be useful is if we rm’d a 
 snap across many PGs
 at
 once, e.g. rm a pool snap or an rbd snap. But those aren’t a huge problem 
 anyway since you’d at
 most need to trim O(100) PGs.

 Hmm. I'm actually seeing this in a system where the problematic snaps
 could *only* have been RBD snaps.

 True, as am I. The current sleep is useful in this case, but since we'd 
 normally only expect up
 to
 ~100 of these PGs per OSD, the trimming of 1 snap across all of those PGs 
 would finish rather
 quickly anyway. Latency would surely be increased momentarily, but I 
 wouldn't expect 90s slow
 requests like I have with the 3 snap_trimq single PG.

 Possibly the sleep is useful in both places.

 We could move the snap trim sleep into the SnapTrimmer state machine, for 
 example in
 ReplicatedPG::NotTrimming::react. This should allow other IOs to get 
 through to the OSD, but of
 course the trimming PG would remain locked. And it would be locked for 
 even longer now due to
 the
 sleep.

 To solve that we could limit the number of trims per instance of the 
 SnapTrimmer, like I’ve
 done
 in this pull req: https://github.com/ceph/ceph/pull/2516
 Breaking out of the trimmer like that should allow IOs to the trimming PG 
 to get through.

 The second aspect of this issue is why are the purged_snaps being lost to 
 begin with. I’ve
 managed to reproduce that on my test cluster. All you have to do is create 
 many pool snaps (e.g.
 of
 a nearly empty pool), then rmsnap all those snapshots. Then use crush 
 reweight to move the PGs
 around. With debug_osd=10, you will see adding snap 1 to purged_snaps”, 
 which is one signature
 of
 this lost purged_snaps issue. To reproduce slow requests the number of 
 snaps purged needs to be
 O(1).

 Hmmm, I'm not sure if I confirm that. I see adding snap X to
 purged_snaps, but only after the snap has been purged. See
 https://gist.github.com/fghaas/88db3cd548983a92aa35. Of course, the
 fact that the OSD tries to trim a snap only to get an ENOENT is
 probably indicative of something being fishy with the snaptrimq and/or
 the purged_snaps list as well.

 With such a long snap_trimq there in your log, I suspect you're seeing the 
 exact same behavior as
 I
 am. In my case the first snap trimmed is snap 1, of course because that is 
 the first rm'd snap,
 and
 the contents of your pool are surely different. I also see the ENOENT 
 messages... again
 confirming
 those snaps were already trimmed. Anyway, what I've observed is that a 
 large snap_trimq like that
 will block the OSD until they are all re-trimmed.

 That's... a mess.

 So what is your workaround for recovery? My hunch would be to

 - stop all access to the cluster;
 - set nodown and noout so that other OSDs don't mark spinning OSDs
 down (which would cause all sorts of primary and PG reassignments,
 useless backfill/recovery when mon osd down out interval expires,
 etc.);
 - set osd_snap_trim_sleep to a ridiculously high value like 10 or 30
 so that at least *between* PGs, the OSD has a chance to respond to
 heartbeats and do whatever else it needs to do;
 - let the snap trim play itself out over several hours (days?).


 What I've been doing is I just continue draining my OSDs, two at a time. Each 
 time, 1-2 other OSDs become blocked for a couple minutes (out of the ~1 hour 
 it takes to drain) while a single PG re-trims, leading to ~100 slow requests. 
 The OSD must still be responding to the peer pings, since other OSDs do not 
 mark it down. Luckily this doesn't happen with every single movement of our 
 pool 5 PGs, otherwise it would be a disaster like you said.

So just to clarify, what you're doing is out of the OSDs that are
spinning, you mark 2 out and wait for them to go empty?

What I'm seeing i my environment is that the OSDs *do* go down.
Marking them out seems not to help much as the problem then promptly
pops up elsewhere.

So, disaster is a pretty good description. Would anyone from

Ceph Puppet modules (again)

2014-03-10 Thread Florian Haas
Hi,

Somehow I'm thinking I'm opening a can of worms, but here it goes
anyway. I saw some discussion about this here on this list last
(Northern Hemisphere) autumn, but not much since.

I'd like to ask for some clarification on the current state of the
Ceph Puppet modules. Currently there are several: one on StackForge
(http://git.openstack.org/cgit/stackforge/puppet-ceph/), primarily
written by Loïc Dachary, and one on the eNovance GitHub repo
(https://github.com/enovance/puppet-ceph), written by Sébastien Han
and François Charlier. The eNovance repo is AGPL licensed, which I
find rather incomprehensible — the only thing this would make sense
for would be to force providers of *public* Puppet hosts to contribute
back upstream, but that's a really far fetched use case. The
StackForge repo is ASL licensed, which looks a bit saner.

Then there is a TelekomCloud fork of the eNovance repo at
https://github.com/TelekomCloud/puppet-ceph/tree/rc/eisbrecher, with
55 unmerged patches. Also AGPL, as far as I can tell.

And finally there's puppet-cephdeploy
(https://github.com/dontalton/puppet-cephdeploy) where I like that it
builds upon ceph-deploy, but rather dislike that it's rather closely
interwoven with OpenStack. ASL.

Finally, after the discussion that Loïc kicked off in
https://www.mail-archive.com/ceph-devel@vger.kernel.org/msg16673.html,
there's https://github.com/ceph/puppet-ceph which hasn't seen any
updates in 2 months. This is a mirror of the StackForge module, as far
as I can tell, is ASL licensed and has seen neither the eNovance work
nor the TelekomCloud updates, presumably on account of the license
issue.

Neither repo seems to be universally accepted and fully complete
(StackForge only supports mon deployment; eNovance doesn't do radosgw,
for example), so I'm trying to understand where people should best
direct their efforts to get things to a working state.

All thoughts and comments appreciated. Thanks!

Cheers,
Florian
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph Puppet modules (again)

2014-03-10 Thread Florian Haas
On Mon, Mar 10, 2014 at 7:27 PM, Loic Dachary l...@dachary.org wrote:
 Hi Florian,

 New efforts should be directed to

 https://github.com/stackforge/puppet-ceph (mirrored at 
 https://github.com/ceph/puppet-ceph) and evolving at 
 https://review.openstack.org/#/q/status:open+project:stackforge/puppet-ceph,n,z

 I'm happily developing it with Andrew Woodward and David Simard and quite 
 happy about how its future looks like. It will eventually unite all other 
 modules and benefit from a proper integration test environment. I've been 
 distressed far too often by the lack of integration tests when writing puppet 
 modules. It makes all the difference in the world to me, although it's not 
 currently popular among puppet module developers, to the point that the 
 official tool (beaker) can't allocate a disk when creating an instance (!). I 
 wrote a tiny tool https://pypi.python.org/pypi/gerritexec to listen to gerrit 
 events, a simple one liner that actually runs the puppet modules for osd/mon 
 on cuttlefish/dumpling/emperor in various situations.

 When working on puppet-ceph, my efforts are often directed to patching Ceph 
 itself to make it more amicable to configuration management systems. 
 ceph-disk: prepare should be idempotent is one example at 
 http://tracker.ceph.com/issues/7475. But you will find a number of patches in 
 Firefly oriented toward this goal. I believe this will also help reduce the 
 complexity of the Chef, Ansible, Salt, ... playcookbooks (;-), in the same 
 way hiding osd ids from them resolved a number of unnecessary problems for 
 all of them (although some of them did not evolve to take advantage of it and 
 are still overcomplex).

 From the point of view of someone in a hurry and no time to develop, what I'm 
 doing it not useable at the moment. I'm under no pressure to rush anything 
 and won't commit to any deadline ;-) However, if a developer is willing to 
 help out, I'd be happy to spare the time to get her/him on board and speed up 
 the process.

I'm not sure if I'm getting this correctly, but it sounds a bit like
please send patches my way, but don't expect to get anything useable
anytime soon. That doesn't seem like a very powerful argument for
contribution, I'm sad to say.

But maybe I'm getting something wrong?

Cheers,
Florian
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: github pull requests

2013-03-22 Thread Florian Haas
On Fri, Mar 22, 2013 at 12:15 AM, Gregory Farnum g...@inktank.com wrote:
 I'm not sure that we handle enough incoming yet that the extra process
 weight of something like Gerrit or Launchpad is necessary over Github.
 What are you looking for in that system which Github doesn't provide?
 -Greg

Automated regression tests and gated commits come to mind. Gerrit
alone of course doesn't help with that, you'd probably want to
consider either running Jenkins, or hook the master merges up with
automatic teuthology runs.

Just my two cents, though.

Cheers,
Florian
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


OSD nodes with =8 spinners, SSD-backed journals, and their performance impact

2013-01-14 Thread Florian Haas
Hi everyone,

we ran into an interesting performance issue on Friday that we were
able to troubleshoot with some help from Greg and Sam (thanks guys),
and in the process realized that there's little guidance around for
how to optimize performance in OSD nodes with lots of spinning disks
(and hence, hosting a relatively large number of OSDs). In that type
of hardware configuration, the usual mantra of put your OSD journals
on an SSD doesn't always hold up. So we wrote up some
recommendations, and I'd ask everyone interested to critique this or
provide feedback:

http://www.hastexo.com/resources/hints-and-kinks/solid-state-drives-and-ceph-osd-journals

It's probably easiest to comment directly on that page, but if you
prefer instead to just respond in this thread, that's perfectly fine
too.

For some background of the discussion, please refer to the LogBot log
from #ceph:
http://irclogs.ceph.widodh.nl/index.php?date=2013-01-12

Hope this is useful.

Cheers,
Florian

-- 
Helpful information? Let us know!
http://www.hastexo.com/shoutbox
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD nodes with =8 spinners, SSD-backed journals, and their performance impact

2013-01-14 Thread Florian Haas
Hi Tom,

On Mon, Jan 14, 2013 at 2:28 PM, Tom Lanyon t...@netspot.com.au wrote:
 On 14/01/2013, at 10:47 PM, Florian Haas flor...@hastexo.com wrote:
 snip
 http://www.hastexo.com/resources/hints-and-kinks/solid-state-drives-and-ceph-osd-journals

 It's probably easiest to comment directly on that page, but if you
 prefer instead to just respond in this thread, that's perfectly fine
 too.
 snip


 Hi Florian,

 Thanks for putting this together.

Pleasure!

 A couple of minor questions/comments:

 * One of the conclusions is to use the SSDs (assuming 2) un-RAIDed, but the 
 article doesn't actually explain why using them in a RAID-1 is a poor idea.

Added paragraph starting with putting your journal SSDs in a RAID set
looks like a good idea at first, does that explain the situation
better?

 * Should the end of this sentence:
 Another option is to use, say, one partition on each of your SSD in 
 a RAID for the operating system installation, and then chop up the rest of 
 your SSDs an non-RAIDed Ceph OSDs.

 ...instead read:

 Another option is to use, say, one partition on each of your SSD in 
 a RAID for the operating system installation, and then chop up the rest of 
 your SSDs an non-RAIDed Ceph **OSD journals**. ?

Sure. Fixed.

Btw: coming to LCA? If you are, please find me and say hello. :)

Cheers,
Florian

-- 
Helpful information? Let us know!
http://www.hastexo.com/shoutbox
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD nodes with =8 spinners, SSD-backed journals, and their performance impact

2013-01-14 Thread Florian Haas
Hi Mark,

thanks for the comments.

On Mon, Jan 14, 2013 at 2:46 PM, Mark Nelson mark.nel...@inktank.com wrote:
 Hi Florian,

 Couple of comments:

 OSDs use a write-ahead mode for local operations: a write hits the journal
 first, and from there is then being copied into the backing filestore.

 It's probably important to mention that this is true by default only for
 non-btrfs file systems.  See:

 http://ceph.com/wiki/OSD_journal

I am well aware of that, but I've yet to find a customer (or user)
that's actually willing to entrust a production cluster with several
hundred terabytes of data to btrfs. :) Besides, the whole post is
about whether or not to use dedicated SSD block devices for OSD
journals, and if you're tossing everything into btrfs you've already
made the decision to use in-filestore journals.

 Thus, for best cluster performance it is crucial that the journal is fast,
 whereas the filestore can be comparatively slow.

 This is a bit misleading.  Having a faster journal is helpful when there are
 short bursts of traffic.  So long as the journal doesn't fill up and there
 are periods of inactivity for the data to get flushed, having slow filestore
 disk may be ok.  With lots of traffic, reality eventually catches up with
 you and you've gotta get all of that data flushed out to the backing file
 system.

I agree that the wording is non-optimal. What I meant was to equate
fast with SSDs, and comparatively slow with spinners. And to
combine spinners with SSDs is one of the most interesting points about
Ceph in terms of cost effectiveness. Pretty much every other storage
technology would require you to either go all-SSD or to look into
rather sophisticated HSM in order to achieve similar performance at a
comparable scale.

Suggestions for better wording?

 Have you ever seen ceph performance bouncing around with periods of really
 high throughput followed by periods of really low (or no!) throughput?
 That's usually the result of having a very fast journal paired with a slow
 data disk.  The journal writes out data very quickly, hits it's max ops or
 max bytes limit, then writes are stalled for a period while data in the
 journal gets flushed out to the data disk.

Sure, essentially the equivalent, on a different level, of an NFS
server with lots of RAM and a high vm.dirty_ratio suddenly doing a
massive writeout.

 Another thing to remember is that writes to the journal happen without
 causing a lot of seeks.  Ceph doesn't have to do metadata or dentry
 lookups/writes to write data to the journal.  Because of this, it's been my
 experience that journals are primarily throughput bound rather than being
 random IOPS bound.  Just putting the journals on any old SSD isn't enough,
 you need to choose ones that get really high throughput like the Intel
 S3700s or other high performance models.

Yup.

 By and large, try to go for a relatively small number of OSDs per node,
 ideally not more than 8. This combined with SSD journals is likely to give
 you the best overall performance.

 The advice that I usually give people is that if performance is a big
 concern, try to match filestore disk and journal performance is nearly
 matched.  In my test setup, I use 1 intel 520 SSD to host 3 journals for
 7200rpm enterprise SATA disks.  A 1:4 ratio or even 1:6 ratio may also work
 fine depending on various factors.  So far the limits I've hit with very
 minimal tuning seem to be around 15 spinning disks and 5 SSDs for around
 1.4GB/s (2.8GB/s including journal writes) to one node.

Yes, I realize that there's no hard number here. I could also have put
ideally not more than 6. The point I was trying to make is that
people need to get off their thinking of what an ideal storage box is,
and that more disks per host isn't necessarily better. We had a user
in #ceph last week thinking that an OSD node with 36 spinners was a
stellar idea. It probably isn't.

 If you do go with OSD nodes with a very high number of disks, consider
 dropping the idea of an SSD-based journal. Yes, in this kind of setup you
 might actually do better with journals on the spinners.

 If your SSD(s) is/are slow you very well may be better off with putting the
 journals on the same spinning disks as the OSD data.  It's all a giant
 balancing act between write throughput, read throughput, and capacity.

And people generally prefer simple heuristics (a.k.a. rules of thumb)
over giant balancing acts. So I think if we tell them something like,

Got more that 8 spinners?
* No? Toss your journals on SSDs,
* Yes? At least consider not to.

... then I am hoping that will lead more people on the right path,
than when we tell them:

* Here's two dozen performance graphs, a pivot table, and a crystal ball.

I am obviously jesting and exaggerating, but you get my point. :)

Cheers,
Florian

-- 
Helpful information? Let us know!
http://www.hastexo.com/shoutbox
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message 

Re: OSD nodes with =8 spinners, SSD-backed journals, and their performance impact

2013-01-14 Thread Florian Haas
On 01/14/2013 06:34 PM, Gregory Farnum wrote:
 On Mon, Jan 14, 2013 at 6:09 AM, Florian Haas flor...@hastexo.com wrote:
 Hi Mark,

 thanks for the comments.

 On Mon, Jan 14, 2013 at 2:46 PM, Mark Nelson mark.nel...@inktank.com wrote:
 Hi Florian,

 Couple of comments:

 OSDs use a write-ahead mode for local operations: a write hits the journal
 first, and from there is then being copied into the backing filestore.

 It's probably important to mention that this is true by default only for
 non-btrfs file systems.  See:

 http://ceph.com/wiki/OSD_journal

 I am well aware of that, but I've yet to find a customer (or user)
 that's actually willing to entrust a production cluster with several
 hundred terabytes of data to btrfs. :) Besides, the whole post is
 about whether or not to use dedicated SSD block devices for OSD
 journals, and if you're tossing everything into btrfs you've already
 made the decision to use in-filestore journals.
 
 That is absolutely not the case. btrfs works just fine with an
 external journal on SSD or whatever else; what made you think
 otherwise?

A misunderstanding on my part. Also, I was overly broad in my comment.
What I really meant to say was that if I'm using a btrfs filestore, and
a separate dedicated block device for the journal, then the journaling
mode is write-ahead and not parallel.

Which was a wrong assumption on my part, as an external journal combined
with a btrfs filestore seems to support parallel journaling mode just
fine. For some reason I had supposed the journal had to be in the same
btrfs as the filestore for this to work.

Sorry for the confusion.

Cheers,
Florian
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Windows port

2013-01-09 Thread Florian Haas
On Tue, Jan 8, 2013 at 3:00 PM, Dino Yancey dino2...@gmail.com wrote:
 Hi,

 I am also curious if a Windows port, specifically the client-side, is
 on the roadmap.

This is somewhat OT from the original post, but if all you're
interested is using RBD block storage from Windows, you can already do
that by going through an iSCSI or FC head node. Proof-of-concept
configuration outlined here:

http://www.hastexo.com/resources/hints-and-kinks/turning-ceph-rbd-images-san-storage-devices

Not sure if this helps, but just thought I'd mention it.

Cheers,
Florian

-- 
Helpful information? Let us know!
http://www.hastexo.com/shoutbox
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Integration work

2012-08-28 Thread Florian Haas
On 08/28/2012 11:32 AM, Plaetinck, Dieter wrote:
 On Tue, 28 Aug 2012 11:12:16 -0700
 Ross Turk r...@inktank.com wrote:
 

 Hi, ceph-devel! It's me, your friendly community guy.

 Inktank has an engineering team dedicated to Ceph, and we want to work 
 on the right stuff. From time to time, I'd like to check in with you to 
 make sure that we are.

 Over the past several months, Inktank's engineers have focused on core 
 stability, radosgw, and feature expansion for RBD. At the same time, 
 they have been regularly allocating cycles to integration work. 
 Recently, this has consisted of improvements to the way Ceph works 
 within OpenStack (even though OpenStack isn't the only technology that 
 we think Ceph should play nicely with).

 What other sorts of integrations would you like to see Inktank engineers 
 work on?
 
 are we only supposed to give answers wrt. integration with other software?
 if not, I would suggest to write documentation.

If I may say so, the amount of work that John has poured into this in
recent week has been incredible (http://www.ceph.com/docs/master/). So
while it's definitely not complete nor perfect, I'm sure he would
appreciate a little more specific information as to where you believe
documentation is lacking.

I for my part, in the documentation space, would love for the admin
tools to become self-documenting. For example, I would love a help
subcommand at any level of the ceph shell, listing the supported
subcommands in that level. As in ceph help, ceph mon help, ceph osd
getmap help.

Even better, the ceph shell could support a general-purpose hook that
bash-completion can use (kind of like hg does in Mercurial), and this
and the above-conjectured help facility could arguably share quite a bit
of code.

 and also integration with CM like puppet/chef

+1, although people are already working on both. So maybe this is just
about the need to tell more people about that. :)

Cheers,
Florian
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: wip-crush

2012-08-22 Thread Florian Haas
On 08/22/2012 03:10 AM, Sage Weil wrote:
 I pushed a branch that changes some of the crush terminology.  Instead of 
 having a crush type called pool that requires you to say things like 
 pool=default in the ceph osd crush set ... command, it uses root 
 instead.  That hopefully reinforces that it is a tree/hierarchy.
 
 There is also a patch that changes bucket to node throughout, since 
 bucket is a term also used by radosgw.
 
 Thoughts?  I think the main pain in making this transition is that old 
 clusters have maps that have a type 'pool' and new ones won't, and the 
 docs will need to walk people through both...

pool in a crushmap being completely unrelated to a RADOS pool is
something that I've heard customers/users report as confusing, as well.
So changing that is probably a good thing. Naming it root is probably
a good choice as well, as it happens to match
http://ceph.com/wiki/Custom_data_placement_with_CRUSH.

As for changing bucket to node... a node is normally simply a
physical server (at least in HA terminology, which many potential Ceph
users will be familiar with), and CRUSH uses host for that. So that's
another recipe for confusion. How about using something super-generic,
like element or item?

Cheers,
Florian

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph Benchmark HowTo

2012-07-25 Thread Florian Haas
On Tue, Jul 24, 2012 at 6:19 PM, Tommi Virtanen t...@inktank.com wrote:
 On Tue, Jul 24, 2012 at 8:55 AM, Mark Nelson mark.nel...@inktank.com wrote:
 personally I think it's fine to have it on the wiki.  I do want to stress
 that performance is going to be (hopefully!) improving over the next couple
 of months so we will probably want to have updated results (or at least
 remove old results!) as things improve.  Also, I'm not sure if we will be
 keeping the wiki around in it's current form. There was some talk about
 migrating to something else, but I don't really remember the details.

 Sounds like a job for doc/dev/benchmark/index.rst!  (It, or parts of
 it, can move out from under Internal if/when it gets user friendly
 enough to not need as much skill to use.)

If John is currently busy (which I assume he always is :) ), I should
be able to take care of that. In that case, would someone please open
a documentation bug and assign that to me?

Cheers,
Florian
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph Benchmark HowTo

2012-07-25 Thread Florian Haas
Hi Mehdi,

great work! A few questions (for you, Mark, and anyone else watching
this thread) regarding the content of that wiki page:

For the OSD tests, which OSD filesystem are you testing on? Are you
using a separate journal device? If yes, what type?

For the RADOS benchmarks:

# rados bench -p pbench 900 seq
...
   611  16 17010 16994   111.241   104   1.05852  0.574897
   612  16 17037 17021   111.236   108   1.17321  0.574932
   613  16 17056 17040   111.17876   1.01611  0.574903
 Total time run:613.339616
Total reads made: 17056
Read size:4194304
Bandwidth (MB/sec):111.234

Average Latency:   0.575252
Max latency:   1.65182
Min latency:   0.07418

How meaningful is it to use a (arithmetic) average here, consisting
the min and max differ by a factor of 22? Aren't we being bitten by
outliers pretty severely here, and wouldn't, say, a median be much
more useful? (Actually, would the max latency include the initial
hunt for a mon and the mon/osdmap exchange?)



seekwatcher -t rbd-latency-write.trace -o rbd-latency-write.png -p 'dd
if=/dev/zero of=/dev/rbd0 bs=4M count=1000 oflag=direct' -d /dev/rbd0

Just making sure: are you getting the same numbers just with dd,
rather than dd invoked by seekwatcher?

Also, for your dd latency test of 4M direct I/O reads writes, you seem
to be getting 39 and 300 ms average latency, yet further down it says
RBD latency read/write: 28ms and 114.5ms. Any explanation for the
write latency being cut in half on what was apparently a different
test run?

Also, were read and write caches cleared between tests? (echo 3 
/proc/sys/vm/drop_caches)

Cheers,
Florian
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Tuning placement group

2012-07-20 Thread Florian Haas
On Fri, Jul 20, 2012 at 9:33 AM, François Charlier
francois.charl...@enovance.com wrote:
 Hello,

 Readinghttp://ceph.com/docs/master/ops/manage/grow/placement-groups/
 and thinking to build a ceph cluster with potentially 1000 OSDs.

 Using the recommandations on the previously cited link, it would require
 pg_num being set between 10,000   30,000. Okay with that. Let's use the
 recommended value of 16,384 ; this  is alreay about 160 placement groups
 per OSD.

 What  if, for  a start,  we choose  to reach  this number  of 1000  OSDs
 slowly, starting with 100 OSDs ? It's now 1600 placement groups per OSD.

 What if  we chose 30,000 (or  32,768) placement groups to  keep room for
 expansion ?

 My question  is : How will  behave a Ceph  pool with 1000, 5000  or even
 1 placement groups per OSD ?  Will this impact performance ? How bad
 ? Can it be worked around ? Is this a problem of RAM size ? CPU usage ?

 Any hint about this would be much appreciated.

If I may, I'd like to add an additional point of consideration,
specifically for radosgw setups:

What's the recommended way to set the number of PGs for the half-dozen
pools that radosgw normally creates on its own (.rgw, .rgw.users,
.rgw.buckets and so on)? I *think* wanting to set a custom number of
PGs would require pre-creating these pools manually, but there may be
a way -- undocumented? -- to instruct radosgw to set a user-configured
number of PGs on pool creation. Insight on that would be much
appreciated.

Cheers,
Florian
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Ceph BoF at OSCON

2012-07-13 Thread Florian Haas
Hi everyone,

For those of you attending OSCON in Portland next week, there will be
a birds-of-a-feather session on Ceph Monday night. All OSCON attendees
interested in Ceph are very welcome.

Details about the BoF are in this blog post:

http://www.hastexo.com/blogs/florian/2012/07/12/openstack-high-availability-and-ceph-oscon

Looking forward to meeting you there!
Florian
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: specifying secret in rbd map command

2012-07-09 Thread Florian Haas
On Mon, Jul 9, 2012 at 4:57 PM, Travis Rhoden trho...@gmail.com wrote:
 Hey folks,

 I had a bit of unexpected trouble today using the rbd map command to
 map an RBD to a kernel object.  I had previously been using the echo
 ...  /sys/bus/rbd... method of manipulating RBDs.

 I was looking at the instructions here:
 http://ceph.com/docs/master/rbd/rbd-ko/

 When I tried to use the given syntax,  sudo rbd map {image-name}
 --pool {pool-name} --name {client-name} --secret {client-secret}, I
 found the following:

 1. {client-secret} is really supposed to be a file, not the actual
 secret.  An strace on the command shows an attempt to open a file with
 the secret as its name
 2. If I give a keyring file as the client-secret, the command does not
 parse out the key for the given client-name.  In other words, I gave
 the name as client.admin, then gave it the keyring file which
 contained merely

 [client.admin]
 key = AQB67+BPGNX0NhAA9iK7Epcj72Jck1wOAQBetA==

 But the command wouldn't parse out the key.

 3. I had to create a new file, containing only the text of the key,
 and pass that to the command instead.  Then everything is happy.


 Im happy to update the docs to make this process clear.  But I wonder
 if there might be any plans to modify the command behavior to accept a
 keyring file and pull out the key belonging to specified client name.
 Either way, I can update the docs to make it clear that you are
 specifying a file, not the key string itself.

I agree. This confuses quite a few people. Specifically because the
Ceph filesystem client supports secret and secretfile as mount
options, and expects a file only in the latter case. rbd acting
differently does violate POLA in that way.

Cheers,
Florian
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: librbd: error finding header

2012-07-09 Thread Florian Haas
On 07/09/12 12:29, Vladimir Bashkirtsev wrote:
 On 09/07/12 18:33, Dan Mick wrote:
 Vladimir: you can do some investigation with the rados command.  What
 does
 rados -p rbd ls show you?
 Rather long list of:
 rb.0.11.2786
 rb.0.d.54a2
 rb.0.6.2eb5
 rb.0.d.8294
 rb.0.13.0377
 rb.0.e.0629
 rb.0.6.2756
 rb.0.d.6156
 rb.0.d.9b82
 rb.0.5.0c9e
 rb.0.d.80ba
 rb.0.f.0e75
 rb.0.6.ab4f
 rb.0.d.48e4
 rb.0.d.5f67
 rb.0.13.14ad
 rb.0.d.e074
 rb.0.f.1a4b
 rb.0.13.04a3
 ...
 
 How to find out to which image these objects belong?

rbd info would tell you the block prefix for the image you're looking
at. Or does that command give you an error opening image message as well?

Cheers,
Florian
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Setting a big maxosd kills all mons

2012-07-05 Thread Florian Haas
Hi guys,

Someone I worked with today pointed me to a quick and easy way to
bring down an entire cluster, by making all mons kill themselves in
mass suicide:

ceph osd setmaxosd 2147483647
2012-07-05 16:29:41.893862 b5962b70  0 monclient: hunting for new mon

I don't know what the actual threshold is, but setting your maxosd to
any sufficiently big number should do it. I had hoped 2^31-1 would be
fine, but evidently it's not.

This is what's in the mon log -- the first line is obviously only on
the leader at the time of the command, the others are on all mons.

-1 2012-07-05 16:29:41.829470 b41a1b70  0 mon.daisy@0(leader) e1
handle_command mon_command(osd setmaxosd 2147483647 v 0) v1
 0 2012-07-05 16:29:41.887590 b41a1b70 -1 *** Caught signal (Aborted) **
 in thread b41a1b70

 ceph version 0.48argonaut (commit:c2b20ca74249892c8e5e40c12aa14446a2bf2030)
 1: /usr/bin/ceph-mon() [0x816f461]
 2: [0xb7738400]
 3: [0xb7738424]
 4: (gsignal()+0x51) [0xb731a781]
 5: (abort()+0x182) [0xb731dbb2]
 6: (__gnu_cxx::__verbose_terminate_handler()+0x14f) [0xb753b53f]
 7: (()+0xbd405) [0xb7539405]
 8: (()+0xbd442) [0xb7539442]
 9: (()+0xbd581) [0xb7539581]
 10: (()+0x11dea) [0xb7582dea]
 11: (tc_new()+0x26) [0xb75a1636]
 12: (std::vectorunsigned char, std::allocatorunsigned char
::_M_fill_insert(__gnu_cxx::__normal_iteratorunsigned char*,
std::vectorunsigned char, std::allocatorunsigned char  , unsigned
int, unsigned char const)+0x79) [0x8185629]
 13: (OSDMap::set_max_osd(int)+0x497) [0x817c6b7]

From src/mon/OSDMonitor.cc:

  int newmax = atoi(m-cmd[2].c_str());
  if (newmax  osdmap.crush-get_max_devices()) {
err = -ERANGE;
ss  cannot set max_osd to   newmax   which is  crush
max_devices 
osdmap.crush-get_max_devices();
goto out;
  }

I think that counts as unchecked user input, or has cmd[2] been
sanitized at any time before it gets here?

Also, is there a way to recover from this, short of reinitializing all mons?

Cheers,
Florian
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Writes to mounted Ceph FS fail silently if client has no write capability on data pool

2012-07-05 Thread Florian Haas
Hi everyone,

please enlighten me if I'm misinterpreting something, but I think the
Ceph FS layer could handle the following situation better.

How to reproduce (this is on a 3.2.0 kernel):

1. Create a client, mine is named test, with the following capabilities:

client.test
key: key
caps: [mds] allow
caps: [mon] allow r
caps: [osd] allow rw pool=testpool

Note the client only has access to a single pool, testpool.

2. Export the client's secret and mount a Ceph FS.

mount -t ceph -o name=test,secretfile=/etc/ceph/test.secret
daisy,eric,frank:/ /mnt

This succeeds, despite us not even having read access to the data pool.

3. Write something to a file.

root@alice:/mnt# echo hello world  hello.txt
root@alice:/mnt# cat hello.txt

This too succeeds.

4. Sync and clear caches.

root@alice:/mnt# sync
root@alice:/mnt# echo 3  /proc/sys/vm/drop_caches

5. Check file size and contents.

root@alice:/mnt# ls -la
total 5
drwxr-xr-x  1 root root0 Jul  5 17:15 .
drwxr-xr-x 21 root root 4096 Jun 11 09:03 ..
-rw-r--r--  1 root root   12 Jul  5 17:15 hello.txt
root@alice:/mnt# cat hello.txt
root@alice:/mnt#

Note the reported file size in unchanged, but the file is empty.

Checking the data pool with client.admin credentials obviously shows
that that pool is empty, so objects are never written. Interestingly,
cephfs hello.txt show_location does list an object_name, identifying
an object which doesn't exist.

Is there any way to make the client fail with -EIO, -EPERM,
-EOPNOTSUPP or whatever else is appropriate, rather than pretending to
write when it can't?

Also, going down the rabbit hole, how would this behavior change if I
used cephfs to set the default layout on some directory to use a
different pool?

All thoughts appreciated.

Cheers,
Florian
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


cephfs show_location produces kernel divide error: 0000 [#1] when run against a directory that is not the filesystem root

2012-07-05 Thread Florian Haas
And one more issue report for today... :)

Really easy to reproduce on my 3.2.0 Debian squeeze-backports kernel:
mount a Ceph FS, create a directory in it. Then run cephfs dir
show_location.

dmesg stacktrace:

[ 7153.714260] libceph: mon2 192.168.42.116:6789 session established
[ 7308.584193] divide error:  [#1] SMP
[ 7308.584936] Modules linked in: cryptd aes_i586 aes_generic cbc ceph
libceph nfsd lockd nfs_acl auth_rpcgss sunrpc fuse joydev usbhid hid
snd_pcm snd_timer snd processor soundcore snd_page_alloc thermal_sys
button tpm_tis tpm tpm_bios psmouse i2c_piix4 evdev serio_raw i2c_core
virtio_balloon pcspkr ext3 jbd mbcache btrfs zlib_deflate crc32c
libcrc32c sg sr_mod cdrom ata_generic virtio_net virtio_blk ata_piix
uhci_hcd ehci_hcd libata usbcore floppy scsi_mod virtio_pci usb_common
[last unloaded: scsi_wait_scan]
[ 7308.588013]
[ 7308.588013] Pid: 1444, comm: cephfs Not tainted
3.2.0-0.bpo.2-686-pae #1 Bochs Bochs
[ 7308.588013] EIP: 0060:[f848c6c2] EFLAGS: 00010246 CPU: 0
[ 7308.588013] EIP is at ceph_calc_file_object_mapping+0x44/0xe8 [libceph]
[ 7308.588013] EAX:  EBX:  ECX:  EDX: 
[ 7308.588013] ESI:  EDI:  EBP:  ESP: f7495ce4
[ 7308.588013]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[ 7308.588013] Process cephfs (pid: 1444, ti=f7494000 task=f7266a60
task.ti=f7494000)
[ 7308.588013] Stack:
[ 7308.588013]     0001b053 f5f20624 f5f203f0
f749a800 f5f20420
[ 7308.588013]  f84ca6a7 f7495d40 f7495d58 f7495d50 f7495d38 0001
0246 f5f20420
[ 7308.588013]  f749a90c bff6ff70 c14203a4 fffba978 000a0050 
f79f0298 0001
[ 7308.588013] Call Trace:
[ 7308.588013]  [f84ca6a7] ? ceph_ioctl_get_dataloc+0x9e/0x213 [ceph]
[ 7308.588013]  [c10b6781] ? __do_fault+0x3ee/0x42b
[ 7308.588013]  [c10b75f3] ? handle_pte_fault+0x3aa/0xa67
[ 7308.588013]  [c10e0844] ? path_openat+0x27f/0x294
[ 7308.588013]  [f84cac16] ? ceph_ioctl+0x3fa/0x460 [ceph]
[ 7308.588013]  [c10d9fdb] ? cp_new_stat64+0xee/0x100
[ 7308.588013]  [c10b7ebe] ? handle_mm_fault+0x20e/0x224
[ 7308.588013]  [f84ca81c] ? ceph_ioctl_get_dataloc+0x213/0x213 [ceph]

I unfortunately don't have a more recent kernel to test with, so if
this has been fixed upstream feel free to ignore me. Otherwise,
perhaps something that could go into the 3.5-rc cycle.

Doing show_location on a file, and on the root directory of the fs,
both work fine.

Cheers,
Florian
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: cephfs show_location produces kernel divide error: 0000 [#1] when run against a directory that is not the filesystem root

2012-07-05 Thread Florian Haas
On Thu, Jul 5, 2012 at 10:04 PM, Gregory Farnum g...@inktank.com wrote:
 But I have a few more queries while this is fresh. If you create a
 directory, unmount and remount, and get the location, does that work?

Nope, same error.

 (actually, just flushing caches would probably do it.)

Idem.

 If you create a
 directory on one node, and then go look at it on another node and try
 to get the location from there, does that work?

No.

Cheers,
Florian
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Writes to mounted Ceph FS fail silently if client has no write capability on data pool

2012-07-05 Thread Florian Haas
On Thu, Jul 5, 2012 at 10:01 PM, Gregory Farnum g...@inktank.com wrote:
 Also, going down the rabbit hole, how would this behavior change if I
 used cephfs to set the default layout on some directory to use a
 different pool?

 I'm not sure what you're asking here — if you have access to the
 metadata server, you can change the pool that new files go into, and I
 think you can set the pool to be whatever you like (and we should
 probably harden all this, too). So you can fix it if it's a problem,
 but you can also turn it into a problem.

I am aware that I would be able to do this.

My question was more along the lines of: if the pool that data is
written to can be set on a per-file or per-directory basis, and we can
also set read and write permissions per pool, how would the filesystem
behave properly? Hide files the mounting user doesn't have read access
to? Return -EIO or -EPERM on writes to files stored in pools we can't
write to? Failing a mount if we're missing some permission on any file
or directory in the fs? All of these sound painful in one way or
another, so I'm having trouble envisioning what the correct behavior
would look like.

Florian
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: URL-safe base64 encoding for keys

2012-07-03 Thread Florian Haas
On Tue, Jul 3, 2012 at 2:22 PM, Wido den Hollander w...@widodh.nl wrote:
 Hi,

 With my CloudStack integration I'm running into a problem with the cephx
 keys due to '/' being possible in the cephx keys.

 CloudStack's API expects a URI to be passed when adding a storage pool,
 e.g.:

 addStoragePool?uri=rbd://user:cephx...@monitor.addr/poolname

 If 'cephxkey' contains a / the URI parser in Java fails (java.net.URI) and
 splits the URI in the wrong place.

 For base64 there is a specification [0] that describes the usage of - and _
 instead of +  and /

 Is there a way that we change the bits in src/common/armor.c and replace the
 + and / for - and _?

FWIW (only semi-related), some S3 clients -- s3cmd from s3tools, for
example -- seem to choke on the forward slash in radosgw
auto-generated secret keys, as well.

Cheers,
Florian
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: URL-safe base64 encoding for keys

2012-07-03 Thread Florian Haas
On Tue, Jul 3, 2012 at 5:04 PM, Yehuda Sadeh yeh...@inktank.com wrote:
 FWIW (only semi-related), some S3 clients -- s3cmd from s3tools, for
 example -- seem to choke on the forward slash in radosgw
 auto-generated secret keys, as well.


 With radosgw we actually switch a while back to use the alternative
 encoding. If you still have some old access keys, just replace them.

Is a while back after 0.47.3? Because I was definitely keys with /
from that version.

Cheers,
Florian
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


rbd rm allows removal of mapped device, nukes data, then returns -EBUSY

2012-07-02 Thread Florian Haas
Hi everyone,

just wanted to check if this was the expected behavior -- it doesn't
look like it would be, to me.

What I do is create a 1G RBD, and just for the heck of it, make an XFS on it:

root@alice:~# rbd create xfsdev --size 1024
root@alice:~# rbd map xfsdev
root@alice:~# rbd showmapped
id  poolimage   snapdevice
0   rbd xfsdev  -   /dev/rbd0
root@alice:~# mkfs -t xfs /dev/rbd/rbd/xfsdev
log stripe unit (4194304 bytes) is too large (maximum is 256KiB)
log stripe unit adjusted to 32KiB
meta-data=/dev/rbd/rbd/xfsdevisize=256agcount=9, agsize=31744 blks
 =   sectsz=512   attr=2, projid32bit=0
data =   bsize=4096   blocks=262144, imaxpct=25
 =   sunit=1024   swidth=1024 blks
naming   =version 2  bsize=4096   ascii-ci=0
log  =internal log   bsize=4096   blocks=2560, version=2
 =   sectsz=512   sunit=8 blks, lazy-count=1
realtime =none   extsz=4096   blocks=0, rtextents=0

I double check to see if there's an XFS signature on the device:

root@alice:~# xxd /dev/rbd/rbd/xfsdev | head
000: 5846 5342  1000   0004   XFSB
010:          
020: 17bb f4df b1f3 444b bc01 3b3e f827 8fef  ..DK..;.'..
030:   0002 0008    4000  ..@.
040:    4001    4002  ..@...@.
050:  0001  7c00  0009    ..|.
060:  0a00 b5a4 0200 0100 0010    
070:     0c09 0804 0f00 0019  
080:    0040    003d  ...@...=
090:   0003 f5d8      

Now, I try to remove the device while it's mapped:

root@alice:~# rbd rm xfsdev
Removing image: 99% complete...2012-07-02 06:52:57.386040 b6c8d710 -1
librbd: error removing header: (16) Device or resource busy
Removing image: 99% complete...failed.
delete error: image still has watchers
This means the image is still open or the client using it crashed. Try
again after closing/unmapping it or waiting 30s for the crashed client
to timeout.

That sounds reasonable, except that the data has already been nuked:

root@alice:~# xxd /dev/rbd/rbd/xfsdev | head
000:          
010:          
020:          
030:          
040:          
050:          
060:          
070:          
080:          
090:          

After unmapping, the device removal proceeds just fine.

root@alice:~# rbd unmap /dev/rbd0
root@alice:~# rbd rm xfsdev
Removing image: 100% complete...done.

Now if the RBD is capable of detecting that it's being watched, why
not fail the removal _before_ wiping data, potentially with an
override with a --force flag?

Cheers,
Florian
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Radosgw installation and administration docs

2012-07-02 Thread Florian Haas
On Sun, Jul 1, 2012 at 10:22 PM, Chuanyu chua...@cs.nctu.edu.tw wrote:
 Hi Yehuda, Florian,

 I follow the wiki, and steps which you discussed,
 construct my ceph system with rados gateway,
 and I can use libs3 to upload file via radosgw, (thanks a lot!)
 but got 405 Method Not Allowed when I use swift,

 $ swift -v -A http://s3.paca.tw:80/auth -U paca:paca1 -K
 UoJO4nFgdAoX+9nEftElIY+AMmDIkcrUBkycNKPA stat
 Auth GET failed: http://s3.paca.tw:80/auth/tokens 405 Method Not Allowed

 ( Because there has no test step on wiki,
  I follow the Florian's question, and guess the test command is above ?!)

 my radosgw-admin config:
 $ radosgw-admin user info --uid=paca
 { user_id: paca,
   rados_uid: 0,
   display_name: chuanyu,
   email: chua...@cs.nctu.edu.tw,
   suspended: 0,
   subusers: [
 { id: paca:paca1,
   permissions: none}],

This is most likely your problem. You're being bitten by
http://tracker.newdream.net/issues/2650.

Try radosgw-admin subuser modify --subuser=paca:paca1 --access=full
and see if that improves things.

Cheers,
Florian
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Does radosgw really need to talk to an MDS?

2012-07-02 Thread Florian Haas
Hi everyone,

radosgw(8) states that the following capabilities must be granted to
the user that radosgw uses to connect to RADOS.

ceph-authtool -n client.radosgw.gateway --cap mon 'allow r' --cap osd
'allow rwx' --cap mds 'allow' /etc/ceph/keyring.radosgw.gateway

Could someone explain why we need an mds 'allow' in here? I thought
only CephFS clients talked to MDSs, and at first glance configuring
client.radosgw.gateway without any MDS capability seems not to break
anything (at least with my limited S3 tests). Am I missing something?

Cheers,
Florian
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Assertion failure when radosgw can't authenticate

2012-07-02 Thread Florian Haas
Hi,

in cephx enabled clusters (0.47.x), authentication failures from
radosgw seem to lead to an uncaught assertion failure:

2012-07-02 11:26:46.559830 b69c5730  0 librados:
client.radosgw.charlie authentication error (1) Operation not
permitted
2012-07-02 11:26:46.560093 b69c5730 -1 Couldn't init storage provider (RADOS)
2012-07-02 11:26:46.560401 b69c5730 -1 common/Timer.cc: In function
'SafeTimer::~SafeTimer()' thread b69c5730 time 2012-07-02
11:26:46.5601
10
common/Timer.cc: 57: FAILED assert(thread == __null)

 ceph version 0.47.3 (commit:c467d9d1b2eac9d3d4706b8e044979aa63b009f8)
 1: (SafeTimer::~SafeTimer()+0x96) [0x80a5c76]
 2: (main()+0x56f) [0x809708f]
 3: (__libc_start_main()+0xe6) [0xb6cefca6]
 4: /usr/bin/radosgw() [0x807f4a1]
 NOTE: a copy of the executable, or `objdump -rdS executable` is
needed to interpret this.

--- begin dump of recent events ---
-2 2012-07-02 11:26:46.559830 b69c5730  0 librados:
client.radosgw.charlie authentication error (1) Operation not
permitted
-1 2012-07-02 11:26:46.560093 b69c5730 -1 Couldn't init storage
provider (RADOS)
 0 2012-07-02 11:26:46.560401 b69c5730 -1 common/Timer.cc: In
function 'SafeTimer::~SafeTimer()' thread b69c5730 time 2012-07-02
11:26
:46.560110
common/Timer.cc: 57: FAILED assert(thread == __null)

Kinda ugly. Maybe this could be fixed in the pending 0.48 release.

The issue obviously goes away immediately after correcting the auth credentials.

Cheers,
Florian
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Does radosgw really need to talk to an MDS?

2012-07-02 Thread Florian Haas
On Mon, Jul 2, 2012 at 1:44 PM, Wido den Hollander w...@widodh.nl wrote:
 You are not allowing the RADOS Gateway to do anything on the MDS.

 There is no 'r',  'w' or 'x' permission which you are allowing. So there is
 nothing the rgw has access to on the MDS.

Yep, so we might as well leave off --cap mds 'allow'?

Florian
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


radosgw forgetting subuser permissions when creating a fresh key

2012-06-25 Thread Florian Haas
Hi everyone,

I wonder if this is intentional: when I create a new Swift key for an
existing subuser, which has previously been assigned full control
permissions, those permissions appear to get lost upon key creation.

# radosgw-admin subuser create --uid=johndoe --subuser=johndoe:swift
--access=full
{ user_id: johndoe,
  rados_uid: 0,
  display_name: John Doe,
  email: j...@example.com,
  suspended: 0,
  subusers: [
{ id: johndoe:swift,
  permissions: full-control}],
  keys: [
{ user: johndoe,
  access_key: QFAMEDSJP5DEKJO0DDXY,
  secret_key: iaSFLDVvDdQt6lkNzHyW4fPLZugBAI1g17LO0+87}],
  swift_keys: []}

Note permissions: full-control

# radosgw-admin key create --subuser=johndoe:swift --key-type=swift
{ user_id: johndoe,
  rados_uid: 0,
  display_name: John Doe,
  email: j...@example.com,
  suspended: 0,
  subusers: [
 { id: johndoe:swift,
   permissions: none}],
  keys: [
{ user: johndoe,
  access_key: QFAMEDSJP5DEKJO0DDXY,
  secret_key: iaSFLDVvDdQt6lkNzHyW4fPLZugBAI1g17LO0+87}],
  swift_keys: [
{ user: johndoe:swift,
  secret_key: E9T2rUZNu2gxUjcwUBO8n\/Ev4KX6\/GprEuH4qhu1}]}

Note that while there is now a key, the permissions are gone. Is this
meant to be a security feature of sorts, or is this a bug? subuser
modify can obviously restore the permissions, but it seems to be less
than desirable to have to do that.

Cheers,
Florian
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph as a NOVA-INST-DIR/instances/ storage backend

2012-06-25 Thread Florian Haas
On Mon, Jun 25, 2012 at 6:03 PM, Tommi Virtanen t...@inktank.com wrote:
 On Sat, Jun 23, 2012 at 11:42 AM, Igor Laskovy igor.lask...@gmail.com wrote:
 Hi all from hot Kiev))

 Does anybody use Ceph as a backend storage for NOVA-INST-DIR/instances/ ?

Yes. http://www.sebastien-han.fr/blog/2012/06/10/introducing-ceph-to-openstack/
Look at the Live Migration with CephFS part.

 Is it in production use?

Production use would require CephFS to be production ready, which at
this point it isn't.

 Live migration is still possible?

Yes.

 I kindly ask any advice of best practices point of view.

 That's the shared NFS mount style for storing images, right? While you
 could use the Ceph Distributed File System for that, there's a better
 answer (for both Nova and Glance): RBD.

... which sort of goes hand-in-hand with boot from volume, which was
just recently documented in the Nova admin guide, so you may want to
take a look: 
http://docs.openstack.org/trunk/openstack-compute/admin/content/boot-from-volume.html

That being said, volume attachment persistence across live migrations
hasn't always been stellar in Nova, and I'm not 100% sure how well
trunk currently deals with that.

Cheers,
Florian
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Openstack] Ceph/OpenStack integration on Ubuntu precise: horribly broken, or am I doing something wrong?

2012-06-21 Thread Florian Haas
On Fri, Jun 22, 2012 at 7:43 AM, James Page james.p...@ubuntu.com wrote:
 You can type faster than I can... I'm working on getting this
 resolved in the current dev release of Ubuntu in the next few
 days after which it will go through the normal SRU process for
 Ubuntu 12.04.
 Sweet, thanks!

 The SRU to resolve the install-ability of python-ceph have just
 complete verification and should be available in Ubuntu 12.04 updates
 in the next few hours (depending on which mirror you use).

Excellent. Thanks a lot!

Cheers,
Florian
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rbd locking and handling broken clients

2012-06-14 Thread Florian Haas
On Thu, Jun 14, 2012 at 1:41 AM, Greg Farnum g...@inktank.com wrote:
 On Wednesday, June 13, 2012 at 1:37 PM, Florian Haas wrote:
 Greg,

 My understanding of Ceph code internals is far too limited to comment on
 your specific points, but allow me to ask a naive question.

 Couldn't you be stealing a lot of ideas from SCSI-3 Persistent
 Reservations? If you had server-side (OSD) persistence of information of
 the this device is in use by X type (where anything other than X would
 get an I/O error when attempting to access data), and you had a manual,
 authenticated override akin to SCSI PR preemption, plus key
 registration/exchange for that authentication, then you would at least
 have to have the combination of a misbehaving OSD plus a malicious
 client for data corruption. A non-malicious but just broken client
 probably won't do.

 Clearly I may be totally misguided, as Ceph is fundamentally
 decentralized and SCSI isn't, but if PR-ish behavior comes even close to
 what you're looking for, grabbing those ideas would look better to me
 than designing your own wheel.

 Yeah, the problem here is exactly that Ceph (and RBD) are fundamentally 
 decentralized. :)

True, but as a general comment I do posit that to say X is not
exactly like Y, thus nothing applicable to X applies to Y is a
fallacy. :)

 I'm not familiar with the SCSI PR mechanism either, but it looks to me like 
 it deals in entirely local information — the equivalent with RBD would 
 require performing a locking operation on every object in the RBD image 
 before you accessed it. We could do that, but then opening an image would 
 take time linear in its size… :(

Well you would make this configurable and optional, wouldn't you? Kind
of like no-one forces people to use PRs on SCSI LUs. When this is
being used, however, taking a performance hit on open sounds like a
reasonable price to pay for not shredding data. TANSTAAFL.

Again, this is just my poorly informed two cents. :)

Cheers,
Florian
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Building documentation offline?

2012-06-14 Thread Florian Haas
Hi everyone,

it occurred to me this afternoon that admin/build-doc unconditionally
tries to fetch some updates from GitHub, which breaks building docs when
you don't have a network connection. Would there be any reasonably
simple way to make it support offline build, provided the various pip
bits have previously been downloaded and installed?

Cheers,
Florian
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rbd locking and handling broken clients

2012-06-13 Thread Florian Haas
Greg,

My understanding of Ceph code internals is far too limited to comment on
your specific points, but allow me to ask a naive question.

Couldn't you be stealing a lot of ideas from SCSI-3 Persistent
Reservations? If you had server-side (OSD) persistence of information of
the this device is in use by X type (where anything other than X would
get an I/O error when attempting to access data), and you had a manual,
authenticated override akin to SCSI PR preemption, plus key
registration/exchange for that authentication, then you would at least
have to have the combination of a misbehaving OSD plus a malicious
client for data corruption. A non-malicious but just broken client
probably won't do.

Clearly I may be totally misguided, as Ceph is fundamentally
decentralized and SCSI isn't, but if PR-ish behavior comes even close to
what you're looking for, grabbing those ideas would look better to me
than designing your own wheel.

Just my $.02, of course.

Cheers,
Florian
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Radosgw installation and administration docs

2012-06-12 Thread Florian Haas
Hi everyone,

I have a long flight ahead of me later this week and plan to be
spending some time on http://ceph.com/docs/master/ops/radosgw/ -- which
currently happens to be a bit, ahem, sparse.

There's currently not a lot of documentation on radosgw, and some of it
is inconsistent, so if one of the devs could answer the following
questions, I can put them in a more comprehensive document that should
make radosgw easier to set up and run.

1. Apache rewrite rule

Is the Apache configuration example listed in the man page correct and
authoritative? Specifically, it seems unclear to me whether the
rewrite engine rule:

(RewriteRule ^/([a-zA-Z0-9-_.]*)([/]?.*)
/s3gw.fcgi?page=$1params=$2%{QUERY_STRING}
[E=HTTP_AUTHORIZATION:%{HTTP:Authorization},L])

... is expected to work only for compatibility with S3 clients, or
whether this rewrite rule is also for Swift clients.


2. FastCGI wrapper

The radosgw man page says it should be exec /usr/bin/radosgw -c
/etc/ceph/ceph.conf -n client.radosgw.gateway, whereas the Wiki
(http://ceph.com/wiki/RADOS_Gateway) omits the -n option. I didn't get
it to work without the -n option, so is it safe to say that it is required?


3. Apache/radosgw daemon/FastCGI wrapper interaction

Is it safe to say that we always need all three of these? The man page indicates
so, the Wiki makes no mention of the daemon started by the init script.


4. FastCGI configuration directives

The man page mentions:
FastCgiExternalServer /var/www/s3gw.fcgi -socket /tmp/radosgw.sock

The Wiki says:
FastCgiWrapper /var/www/s3gw.fcgi
FastCgiServer /usr/bin/radosgw

https://github.com/ceph/teuthology/blob/master/teuthology/task/apache.conf
(which was mentioned as an additional reference on IRC at some point) says:
FastCgiIPCDir /tmp/cephtest/apache/tmp/fastcgi_sock
FastCgiExternalServer /tmp/cephtest/apache/htdocs/rgw.fcgi -socket rgw_sock

Which of these is required/preferred? -socket option or not? Wrapper,
Server or ExternalServer? IPCDir?


5. Logging

What's the preferred way of adding debug logging for radosgw?

https://github.com/ceph/teuthology/blob/master/teuthology/task/apache.conf
mentions:

SetEnv RGW_LOG_LEVEL 20
SetEnv RGW_PRINT_CONTINUE yes
SetEnv RGW_SHOULD_LOG yes

... but it's unclear to me whether this is still current (I found no
trace of those envars in the source, but maybe I was looking in the
wrong place).

https://github.com/ceph/ceph/commit/452b1248a68f743ad55641722da80e3fd5ad2ae9
touched the debug rgw option. If that is the preferred way of doing
things now, where should you set this? In ceph.conf, in the
[client.radosgw.name] section?

Also, for each of these, where would the logging output end up?
/var/log/ceph? Apache error log? If so, only if the Apache LogLevel is
more verbose than info? Syslog?


6. Swift API: Keys

Is it correct to assume that for any Swift client to work, we must set a
Swift key for the user, like so?

radosgw-admin key create --key-type=swift --uid=user

If so, is the secret_key that that creates for the user:

  swift_keys: [
{ user: user,
  secret_key: longbase64hash}]}


... the same key that the swift command line client expects to be set
with th -K option?


7. Swift API: swift user name

When we call swift -U user, is that the verbatim user_id that we've
defined with radosgw-admin user create --uid=user_id? Or do we need
to set a prefix? Or define a separate Swift user ID?


8. Swift API: authentication version

When radosgw acts as the auth server for a Swift request, is it correct
to say that only v1.0 Swift authentication is supported, not v2.0?


9. Swift API: authentication URL

What's the correct Swift authentication URL for swift -A url? It
seems like it's http://rgw hostname:port/auth, but confirmation
would help.

10. radosgw OpenStack user information

From the radosgw-admin man page:
   --os-user=group:name
  The OpenStack user (only needed for use with OpenStack)
   --os-secret=key
  The OpenStack key

What's this meant to be used for? Keystone authentication? If so, is
there anything else that needs to be done for Keystone to work with
this, such as add an endpoint URI?

Please feel free to point me to existing documentation where it
exists. Your help is much appreciated. Thanks!

Cheers,
Florian
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Radosgw installation and administration docs

2012-06-12 Thread Florian Haas
Hi Yehuda,

thanks, that resolved a lot of questions for me. A few follow-up
comments below:

On 06/12/12 18:47, Yehuda Sadeh wrote:
 On Tue, Jun 12, 2012 at 3:44 AM, Florian Haas flor...@hastexo.com wrote:
 Hi everyone,

 I have a long flight ahead of me later this week and plan to be
 spending some time on http://ceph.com/docs/master/ops/radosgw/ -- which
 currently happens to be a bit, ahem, sparse.

 There's currently not a lot of documentation on radosgw, and some of it
 is inconsistent, so if one of the devs could answer the following
 questions, I can put them in a more comprehensive document that should
 make radosgw easier to set up and run.

 1. Apache rewrite rule

 Is the Apache configuration example listed in the man page correct and
 authoritative? Specifically, it seems unclear to me whether the
 rewrite engine rule:

 (RewriteRule ^/([a-zA-Z0-9-_.]*)([/]?.*)
 /s3gw.fcgi?page=$1params=$2%{QUERY_STRING}
 
 We currently use a slightly different rule:
 
   RewriteRule ^/(.*)
 /radosgw.fcgi?params=$1%{QUERY_STRING}
 [E=HTTP_AUTHORIZATION:%{HTTP:Authorization},L]

Could you explain what happened to page?

 [E=HTTP_AUTHORIZATION:%{HTTP:Authorization},L])

 ... is expected to work only for compatibility with S3 clients, or
 whether this rewrite rule is also for Swift clients.
 
 Not really needed for Swift. It's required for passing in the
 HTTP_AUTHORIZATION env, however, Swift uses a different field which is
 not filtered out by apache.

OK.

 2. FastCGI wrapper

 The radosgw man page says it should be exec /usr/bin/radosgw -c
 /etc/ceph/ceph.conf -n client.radosgw.gateway, whereas the Wiki
 (http://ceph.com/wiki/RADOS_Gateway) omits the -n option. I didn't get
 it to work without the -n option, so is it safe to say that it is required?
 
 -n is required for specifying the ceph user that the gateway would
 use. Without it it'd use client.admin is the default.

OK.

 3. Apache/radosgw daemon/FastCGI wrapper interaction

 Is it safe to say that we always need all three of these? The man page 
 indicates
 so, the Wiki makes no mention of the daemon started by the init script.
 
 The wrapper is not needed if not using apache for spawning the radosgw
 processes. E.g.,  when using the FastCgiExternalServer param:
 
 FastCgiExternalServer /var/www/radosgw.fcgi -socket
 /var/run/ceph/radosgw.client.radosgw

 4. FastCGI configuration directives

 The man page mentions:
 FastCgiExternalServer /var/www/s3gw.fcgi -socket /tmp/radosgw.sock

 The Wiki says:
 FastCgiWrapper /var/www/s3gw.fcgi
 FastCgiServer /usr/bin/radosgw

 https://github.com/ceph/teuthology/blob/master/teuthology/task/apache.conf
 (which was mentioned as an additional reference on IRC at some point) says:
 FastCgiIPCDir /tmp/cephtest/apache/tmp/fastcgi_sock
 FastCgiExternalServer /tmp/cephtest/apache/htdocs/rgw.fcgi -socket rgw_sock

 Which of these is required/preferred? -socket option or not? Wrapper,
 Server or ExternalServer? IPCDir?

 
 Either one is required. We prefer using the external server option. We
 found out that letting apache (or the fastcgi process manager)
 managing was sub-optimal and was introducing high latencies.

OK, I'm sticking to FastCgiExternalServer then.


 5. Logging

 What's the preferred way of adding debug logging for radosgw?

 https://github.com/ceph/teuthology/blob/master/teuthology/task/apache.conf
 mentions:

 SetEnv RGW_LOG_LEVEL 20
 SetEnv RGW_PRINT_CONTINUE yes
 SetEnv RGW_SHOULD_LOG yes
 
 All are obsolete and defunct, and have a corresponding ceph.conf conf:
 
 debug rgw = 20
 rgw print continue = true
 rgw should log = true
 
 the latter will be replaced soon by:
 
 rgw enable usage log = true
 
 Note that only the 'debug rgw' option is really related to debug logs.
 The 'rgw print continue' option is a badly named option to control the
 use of 100-continue (should the radosgw 'print' -- as in FCGX_FPrintF
 -- the 100-continue when it should?). This can only work with a
 modified mod_fastcgi that supports that.
 The 'rgw should log' option sets whether we log each user operation to
 the dedicated pool (so that it can be analyzed later on for billing,
 etc.)

Yep. I was really only looking for what debug rgw does, and got
confused by the FastCGI envars.

 ... but it's unclear to me whether this is still current (I found no
 trace of those envars in the source, but maybe I was looking in the
 wrong place).

 https://github.com/ceph/ceph/commit/452b1248a68f743ad55641722da80e3fd5ad2ae9
 touched the debug rgw option. If that is the preferred way of doing
 things now, where should you set this? In ceph.conf, in the
 [client.radosgw.name] section?
 
 Either under the global section, or [client], or
 [client.radosgw.name]. Depends on how you organize your conf.

OK.

 Also, for each of these, where would the logging output end up?
 /var/log/ceph? Apache error log? If so, only if the Apache LogLevel is
 more verbose than info? Syslog?
 
 
 The debug log would end up wherever you specified

radosgw-admin: mildly confusing man page and usage message

2012-06-11 Thread Florian Haas
Hi,

just noticed that radosgw-admin comes with a bit of confusing content in
its man page and usage message:

EXAMPLES
   Generate a new user:

   $ radosgw-admin user gen --display-name=johnny rotten
--email=joh...@rotten.com

As far as I remember user gen is gone, and it's now user create.
However:

radosgw-admin user create --display-name=test --email=test@demo
user_id was not specified, aborting

... is followed by a usage message that doesn't mention user_id anywhere
(the option string is --uid). So conceivably the example could also use
a mention of --uid.

Also, is there a way to retrieve the next available user_id or just
tell radosgw-admin to use max(user_id)+1?

If one of the Ceph guys could provide a quick comment on this, I can
send a patch to the man page RST. Thanks.

Cheers,
Florian
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: radosgw-admin: mildly confusing man page and usage message

2012-06-11 Thread Florian Haas
On 06/11/12 23:39, Yehuda Sadeh wrote:
 If one of the Ceph guys could provide a quick comment on this, I can
 send a patch to the man page RST. Thanks.

 
 Minimum required to create a user:
 
 radosgw-admin user create --uid=user id --display-name=display name
 
 The user id is actually a user 'account' name, not necessarily a
 numeric value. The email param is optional.

Thanks. https://github.com/ceph/ceph/pull/13

Cheers,
Florian
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Openstack] Ceph + OpenStack [HOW-TO]

2012-06-10 Thread Florian Haas
On 06/10/12 23:32, Sébastien Han wrote:
 Hello everyone,
 
 I recently posted on my website an introduction to ceph and the
 integration of Ceph in OpenStack.
 It could be really helpful since the OpenStack documentation has not
 dealt with it so far.
 
 Feel free to comment, express your opinions and share your personal
 experience about both of them.

This is mighty comprehensive. :) Thanks!

Cheers,
Florian
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Ceph/OpenStack integration on Ubuntu precise: horribly broken, or am I doing something wrong?

2012-06-08 Thread Florian Haas
Hi everyone,

apologies for the cross-post, and not sure if this is new information.
I did do a cursory check of both list archives and didn't find
anything pertinent, so here goes. Feel free to point me to an existing
thread if I'm merely regurgitating something that's already known.

Either I'm doing something terribly wrong, or the current state of
OpenStack/Ceph integration in Ubuntu precise is somewhat suboptimal.
At least as far as sticking to packages available in Ubuntu repos is
concerned.

Steps to reproduce:

1. On an installation using 12.04 with current updates, create a RADOS pool.
2. Configure glance to use rbd as its backend storage.
3. Attempt to upload an image.

glance add name=Ubuntu 12.04 cloudimg amd64 is_public=true
container_format=ovf disk_format=qcow2 
precise-server-cloudimg-amd64-disk1.img
Uploading image 'Ubuntu 12.04 cloudimg amd64'
==[
92%] 222.109521M/s, ETA  0h  0m  0sFailed to add image. Got
error:
Data supplied was not valid.
Details: 400 Bad Request

The server could not comply with the request since it is either
malformed or otherwise incorrect.

 Error uploading image: (NameError): global name 'rados' is not defined

Digging around in glance/store/rbd.py yields this:

try:
import rados
import rbd
except ImportError:
pass

I will go so far as say the error handling here could be improved --
however, the Swift store implementation seems to do the same.

Now, in Ubuntu rados.py and rbd.py ship in the python-ceph package,
which is available in universe, but has an unresolvable dependency on
librgw1. librgw1 apparently had been in Ubuntu for quite a while, but
was dropped just before the Essex release:

https://launchpad.net/ubuntu/precise/amd64/radosgw

... and if I read
http://changelogs.ubuntu.com/changelogs/pool/main/c/ceph/ceph_0.41-1ubuntu2/changelog
correctly, then the rationale for it was to drop radosgw since
libfcgi is not in main and the code may not be suitable for LTS. (I
wonder why this wasn't factored out into a separate package then, as
was apparently the case for ceph-mds).

Does this really mean that radosgw functionality was dropped from
Ubuntu because it wasn't considered ready for main, and now it
completely breaks a package in universe that's essential for
Ceph/OpenStack integration? AFAICT the only thing that would be
unaffected by this would be nova-volume (now cinder) which rather than
using the Ceph Python bindings just calls out to the rados and rbd
binaries. But both the glance RBD store and the RADOS Swift and S3
frontends (via radosgw) would be affected.

This can of course all be fixed by using upstream packages from the
Ceph guys (thanks to Sébastien Han for pointing that out to me).

Anyone able to confirm or refute these findings? Should there be an
Ubuntu bug for this? If so, against what package?

Cheers,
Florian
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ceph rbd crashes/stalls while random write 4k blocks

2012-05-25 Thread Florian Haas
On Fri, May 25, 2012 at 8:47 AM, Stefan Priebe - Profihost AG
s.pri...@profihost.ag wrote:
 Am 24.05.2012 16:19, schrieb Florian Haas:
 On Thu, May 24, 2012 at 4:09 PM, Stefan Priebe - Profihost AG
 s.pri...@profihost.ag wrote:
 Take a look at these to see if anything looks familiar:

 http://oss.sgi.com/bugzilla/show_bug.cgi?id=922
 https://bugs.launchpad.net/ubuntu/+source/linux/+bug/979498
 http://oss.sgi.com/archives/xfs/2011-11/msg00400.html

 These are solved by using 3.0.20.

 ... or so Christoph says, but comment #4 in bug 922 seems to indicate 
 otherwise.

 I'm sorry you're absolutely right. BUT XFS had some regressions with
 xlog_grabt_log_space since 2.6.28 which was fixed in 3.0.X by reverting
 back to a kernel thread instead of workers. I was working with Christoph
 and Dave on this problem and it tooked be nearly a whole month to track
 that down (git commit c7eead1e118fb7e34ee8f5063c3c090c054c3820). In this
 case (#922) it seems it is really related to a too small log. But I
 don't have a too small log in my ceph case ;-)

Hmmm. So what's Chinner saying about this one? Should we move this
discussion to an XFS list?

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ceph rbd crashes/stalls while random write 4k blocks

2012-05-24 Thread Florian Haas
Stefan,

On 05/24/12 13:07, Stefan Priebe - Profihost AG wrote:
 Hi list,

 i'm still testing ceph rbd with kvm. Right now i'm testing a rbd block
 device within a network booted kvm.

 Sequential write/reads and random reads are fine. No problems so far.

 But when i trigger lots of 4k random writes all of them stall after
 short time and i get 0 iops and 0 transfer.

 used command:
 fio --filename=/dev/vda --direct=1 --rw=randwrite --bs=4k --size=20G
 --numjobs=50 --runtime=30 --group_reporting --name=file1

 Then some time later i see this call trace:

 INFO: task ceph-osd:3065 blocked for more than 120 seconds.
 echo 0  /proc/sys/kernel/hung_task_timeout_secs disables this message.
 ceph-osdD 8803b0e61d88 0  3065  1 0x0004
  88032f3ab7f8 0086 8803bffdac08 8803
  8803b0e61820 00010800 88032f3abfd8 88032f3aa010
  88032f3abfd8 00010800 81a0b020 8803b0e61820
 Call Trace:
  [815e0e1a] schedule+0x3a/0x60
  [815e127d] schedule_timeout+0x1fd/0x2e0
  [812696c4] ? xfs_iext_bno_to_ext+0x84/0x160
  [81074db1] ? down_trylock+0x31/0x50
  [812696c4] ? xfs_iext_bno_to_ext+0x84/0x160
  [815e20b9] __down+0x69/0xb0
  [8128c4a6] ? _xfs_buf_find+0xf6/0x280
  [81074e6b] down+0x3b/0x50

sorry I'm coming a bit late to the various threads you've posted
recently, but on this particular issue: what kernel are your OSDs
running on, and do these hung tasks occur if you're using a local
filesystem other than XFS?

As of late XFS has occasionally been producing seemingly random kernel
hangs. Your call trace doesn't have the signature entries from xfssyncd
that identify a particular problem that I've been struggling with
lately, but you just might be affected by some other effect of the same
root issue.

Take a look at these to see if anything looks familiar:

http://oss.sgi.com/bugzilla/show_bug.cgi?id=922
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/979498
http://oss.sgi.com/archives/xfs/2011-11/msg00400.html

Not sure if this helps at all; just thought I might pitch that in.

Cheers,
Florian
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ceph rbd crashes/stalls while random write 4k blocks

2012-05-24 Thread Florian Haas
On Thu, May 24, 2012 at 4:09 PM, Stefan Priebe - Profihost AG
s.pri...@profihost.ag wrote:
 Take a look at these to see if anything looks familiar:

 http://oss.sgi.com/bugzilla/show_bug.cgi?id=922
 https://bugs.launchpad.net/ubuntu/+source/linux/+bug/979498
 http://oss.sgi.com/archives/xfs/2011-11/msg00400.html

 These are solved by using 3.0.20.

... or so Christoph says, but comment #4 in bug 922 seems to indicate otherwise.

Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] doc: fix snapshot creation/deletion syntax in rbd man page (trivial)

2012-02-17 Thread Florian Haas
Creating a snapshot requires using rbd snap create,
as opposed to just rbd create. Also for purposes of
clarification, add note that removing a snapshot similarly
requires rbd snap rm.

Thanks to Josh Durgin for the explanation on IRC.
---
 man/rbd.8 |   10 +-
 1 files changed, 9 insertions(+), 1 deletions(-)

diff --git a/man/rbd.8 b/man/rbd.8
index 0278137..b59c2f6 100644
--- a/man/rbd.8
+++ b/man/rbd.8
@@ -194,7 +194,15 @@ To create a new snapshot:
 .sp
 .nf
 .ft C
-rbd create mypool/myimage@mysnap
+rbd snap create mypool/myimage@mysnap
+.ft P
+.fi
+.sp
+To delete a snapshot:
+.sp
+.nf
+.ft C
+rbd snap rm mypool/myimage@mysnap
 .ft P
 .fi
 .sp
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Nova on RBD Device

2012-02-07 Thread Florian Haas
On Tue, Feb 7, 2012 at 8:01 PM, Mandell Degerness
mand...@pistoncloud.com wrote:
 Can anyone point me in the right direction for setting up Nova so that
 it allocates disk space on RBD device(s) rather than on local disk as
 defined in the --instances_path flag?

 I've already got nova-volume working with RBD.

 I suspect that I need to modify _cache_image and _create_image in
 nova/virt/libvirt/connection.py.

Hmmm. Wouldn't that most likely be a question for the openstack list?
http://wiki.openstack.org/MailingLists for details on how to
subscribe, if you haven't already.

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: interesting point on btrfs, xfs, ext4

2012-01-25 Thread Florian Haas
On Wed, Jan 25, 2012 at 10:15 AM, Tomasz Paszkowski ss7...@gmail.com wrote:
 http://www.youtube.com/watch?v=FegjLbCnoBw

I sat in that talk at LCA and can highly recommend it. Jon Corbet
wrote a piece on LWN about it too (currently subscribers only):
https://lwn.net/Articles/476263/

Cheers,
Florian
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/2] Add resource agents to debian build, trivial CP error

2012-01-05 Thread Florian Haas
Hi,

please consider two follow-up patches to the OCF resource agents: the
first adds them to the Debian build, as a separate package
ceph-resource-agents that depends on resource-agents, the second
fixes a trivial (and embarassing, however harmless) cut and paste
error. Thanks!

Cheers,
Florian

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] debian: build ceph-resource-agents

2012-01-05 Thread Florian Haas
---
 debian/ceph-resource-agents.install |1 +
 debian/control  |   13 +
 debian/rules|2 ++
 3 files changed, 16 insertions(+), 0 deletions(-)
 create mode 100644 debian/ceph-resource-agents.install

diff --git a/debian/ceph-resource-agents.install 
b/debian/ceph-resource-agents.install
new file mode 100644
index 000..30843f6
--- /dev/null
+++ b/debian/ceph-resource-agents.install
@@ -0,0 +1 @@
+usr/lib/ocf/resource.d/ceph/*
diff --git a/debian/control b/debian/control
index e8c4d30..0f57ad3 100644
--- a/debian/control
+++ b/debian/control
@@ -112,6 +112,19 @@ Description: debugging symbols for ceph-common
  .
  This package contains the debugging symbols for ceph-common.
 
+Package: ceph-resource-agents
+Architecture: linux-any
+Recommends: pacemaker
+Priority: extra
+Depends: ceph (= ${binary:Version}), ${misc:Depends}, resource-agents
+Description: OCF-compliant resource agents for Ceph
+ Ceph is a distributed storage and network file system designed to provide
+ excellent performance, reliability, and scalability.
+ .
+ This package contains the resource agents (RAs) which integrate
+ Ceph with OCF-compliant cluster resource managers,
+ such as Pacemaker.
+
 Package: librados2
 Conflicts: librados, librados1
 Replaces: librados, librados1
diff --git a/debian/rules b/debian/rules
index 4f3fe62..0bc594a 100755
--- a/debian/rules
+++ b/debian/rules
@@ -20,6 +20,8 @@ endif
 
 export DEB_HOST_ARCH  ?= $(shell dpkg-architecture -qDEB_HOST_ARCH)
 
+extraopts += --with-ocf
+
 ifeq ($(DEB_HOST_ARCH), armel)
   # armel supports ARMv4t or above instructions sets.
   # libatomic-ops is only usable with Ceph for ARMv6 or above.
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/2] Add Ceph integration with OCF-compliant HA resource managers

2011-12-29 Thread Florian Haas
Hi everyone,

please consider reviewing the following patches. These add
OCF-compliant cluster resource agent functionality to Ceph, allowing
MDS, OSD and MON to run as cluster resources under compliant managers
(such as Pacemaker, http://www.clusterlabs.org).

This new stuff does not build nor install by default; you must enable
with the --with-ocf flag. That same flag maps to a new RPM build
conditional (--with ocf) which rolls the resource agents into a
separate subpackage, ceph-resource-agents.

These patches require the tiny patch to the init script that I posted
here a few days ago.

Just in case you're interested, all the above changes (including the
init script patch) since commit e18b1c9734e88e3b779ba2d70cdd54f8fb94743d:

  rgw: removing swift user index when removing user (2011-12-28 17:00:19 -0800)

are also available in my GitHub repo at:
  git://github.com/fghaas/ceph ocf-ra

Florian Haas (3):
  init script: be LSB compliant for exit code on status
  Add OCF-compliant resource agent for Ceph daemons
  Spec: conditionally build ceph-resource-agents package

 ceph.spec.in|   22 ++
 configure.ac|8 ++
 src/Makefile.am |4 +-
 src/init-ceph.in|7 ++-
 src/ocf/Makefile.am |   23 +++
 src/ocf/ceph.in |  177 +++
 6 files changed, 238 insertions(+), 3 deletions(-)
 create mode 100644 src/ocf/Makefile.am
 create mode 100644 src/ocf/ceph.in

Hope this is useful. All feedback is much appreciated. Thanks!

Cheers,
Florian
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] Spec: conditionally build ceph-resource-agents package

2011-12-29 Thread Florian Haas
Put OCF resource agents in a separate subpackage,
to be enabled with a separate build conditional
(--with ocf).

Make the subpackage depend on the resource-agents
package, which provides the ocf-shellfuncs library
that the Ceph RAs use.

Signed-off-by: Florian Haas flor...@hastexo.com
---
 ceph.spec.in |   22 ++
 1 files changed, 22 insertions(+), 0 deletions(-)

diff --git a/ceph.spec.in b/ceph.spec.in
index b0f3c3a..3950fd1 100644
--- a/ceph.spec.in
+++ b/ceph.spec.in
@@ -1,5 +1,6 @@
 %define with_gtk2 %{?_with_gtk2: 1} %{!?_with_gtk2: 0}
 
+%bcond_with ocf
 # it seems there is no usable tcmalloc rpm for x86_64; parts of
 # google-perftools don't compile on x86_64, and apparently the
 # decision was to not build the package at all, even if tcmalloc
@@ -130,6 +131,19 @@ gcephtool is a graphical monitor for the clusters running 
the Ceph distributed
 file system.
 %endif
 
+%if %{with ocf}
+%package resource-agents
+Summary:   OCF-compliant resource agents for Ceph daemons
+Group: System Environment/Base
+License:   LGPLv2
+Requires:  %{name} = %{version}
+Requires:  resource-agents
+%description resource-agents
+Resource agents for monitoring and managing Ceph daemons
+under Open Cluster Framework (OCF) compliant resource
+managers such as Pacemaker.
+%endif
+
 %package -n librados2
 Summary:   RADOS distributed object store client library
 Group: System Environment/Libraries
@@ -211,6 +225,7 @@ MY_CONF_OPT=$MY_CONF_OPT --without-gtk2
--docdir=%{_docdir}/ceph \
--without-hadoop \
$MY_CONF_OPT \
+   %{?_with_ocf} \
%{?with_tcmalloc:--with-tcmalloc} 
%{!?with_tcmalloc:--without-tcmalloc}
 
 # fix bug in specific version of libedit-devel
@@ -415,6 +430,13 @@ fi
 %endif
 
 
#
+%if %{with ocf}
+%files resource-agents
+%defattr(0755,root,root,-)
+/usr/lib/ocf/resource.d/%{name}/*
+%endif
+
+#
 %files -n librados2
 %defattr(-,root,root,-)
 %{_libdir}/librados.so.*
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] Add OCF-compliant resource agent for Ceph daemons

2011-12-29 Thread Florian Haas
Add a wrapper around the ceph init script that makes
MDS, OSD and MON configurable as Open Cluster Framework
(OCF) compliant cluster resources. Allows Ceph
daemons to tie in with cluster resource managers that
support OCF, such as Pacemaker (http://www.clusterlabs.org).

Disabled by default, configure --with-ocf to enable.

Signed-off-by: Florian Haas flor...@hastexo.com
---
 configure.ac|8 ++
 src/Makefile.am |4 +-
 src/ocf/Makefile.am |   23 +++
 src/ocf/ceph.in |  177 +++
 4 files changed, 210 insertions(+), 2 deletions(-)
 create mode 100644 src/ocf/Makefile.am
 create mode 100644 src/ocf/ceph.in

diff --git a/configure.ac b/configure.ac
index 60f998c..e334a24 100644
--- a/configure.ac
+++ b/configure.ac
@@ -277,6 +277,12 @@ AM_CONDITIONAL(WITH_LIBATOMIC, [test $HAVE_ATOMIC_OPS = 
1])
 #[],
 #[with_newsyn=no])
 
+AC_ARG_WITH([ocf],
+[AS_HELP_STRING([--with-ocf], [build OCF-compliant cluster 
resource agent])],
+,
+[with_ocf=no])
+AM_CONDITIONAL(WITH_OCF, [ test $with_ocf = yes ])
+
 # Checks for header files.
 AC_HEADER_DIRENT
 AC_HEADER_STDC
@@ -375,6 +381,8 @@ AM_PATH_PYTHON([2.4],
 AC_CONFIG_HEADERS([src/acconfig.h])
 AC_CONFIG_FILES([Makefile
src/Makefile
+   src/ocf/Makefile
+   src/ocf/ceph
man/Makefile
ceph.spec])
 AC_OUTPUT
diff --git a/src/Makefile.am b/src/Makefile.am
index 748425e..8026e17 100644
--- a/src/Makefile.am
+++ b/src/Makefile.am
@@ -1,6 +1,6 @@
 AUTOMAKE_OPTIONS = gnu
-SUBDIRS =
-DIST_SUBDIRS = gtest
+SUBDIRS = ocf
+DIST_SUBDIRS = gtest ocf
 CLEANFILES =
 bin_PROGRAMS =
 # like bin_PROGRAMS, but these targets are only built for debug builds
diff --git a/src/ocf/Makefile.am b/src/ocf/Makefile.am
new file mode 100644
index 000..9be40ec
--- /dev/null
+++ b/src/ocf/Makefile.am
@@ -0,0 +1,23 @@
+EXTRA_DIST = ceph.in Makefile.in
+
+if WITH_OCF
+# The root of the OCF resource agent hierarchy
+# Per the OCF standard, it's always lib,
+# not lib64 (even on 64-bit platforms).
+ocfdir = $(prefix)/lib/ocf
+
+# The ceph provider directory
+radir = $(ocfdir)/resource.d/$(PACKAGE_NAME)
+
+ra_SCRIPTS = ceph
+
+install-data-hook:
+   $(LN_S) ceph $(DESTDIR)$(radir)/osd
+   $(LN_S) ceph $(DESTDIR)$(radir)/mds
+   $(LN_S) ceph $(DESTDIR)$(radir)/mon
+
+uninstall-hook:
+   rm -f $(DESTDIR)$(radir)/osd
+   rm -f $(DESTDIR)$(radir)/mds
+   rm -f $(DESTDIR)$(radir)/mon
+endif
diff --git a/src/ocf/ceph.in b/src/ocf/ceph.in
new file mode 100644
index 000..9db1bc9
--- /dev/null
+++ b/src/ocf/ceph.in
@@ -0,0 +1,177 @@
+#!/bin/sh
+
+# Initialization:
+: ${OCF_FUNCTIONS_DIR=${OCF_ROOT}/lib/heartbeat}
+. ${OCF_FUNCTIONS_DIR}/ocf-shellfuncs
+
+# Convenience variables
+# When sysconfdir isn't passed in as a configure flag,
+# it's defined in terms of prefix
+prefix=@prefix@
+CEPH_INIT=@sysconfdir@/init.d/ceph
+
+ceph_meta_data() {
+local longdesc
+local shortdesc
+case $__SCRIPT_NAME in
+   osd)
+   longdesc=Wraps the ceph init script to provide an OCF resource 
agent that manages and monitors the Ceph OSD service.
+   longdesc=Manages a Ceph OSD instance.
+   ;;
+   mds)
+   longdesc=Wraps the ceph init script to provide an OCF resource 
agent that manages and monitors the Ceph MDS service.
+   longdesc=Manages a Ceph MDS instance.
+   ;;
+   mon)
+   longdesc=Wraps the ceph init script to provide an OCF resource 
agent that manages and monitors the Ceph MON service.
+   longdesc=Manages a Ceph MON instance.
+   ;;
+esac
+
+cat EOF
+?xml version=1.0?
+!DOCTYPE resource-agent SYSTEM ra-api-1.dtd
+resource-agent name=${__SCRIPT_NAME} version=0.1
+  version0.1/version
+  longdesc lang=en${longdesc}/longdesc
+  shortdesc lang=en${shortdesc}/shortdesc
+  parameters/
+  actions
+action name=starttimeout=20 /
+action name=stop timeout=20 /
+action name=monitor  timeout=20
+interval=10/
+action name=meta-datatimeout=5 /
+action name=validate-all   timeout=20 /
+  /actions
+/resource-agent
+EOF
+}
+
+ceph_action() {
+local init_action
+init_action=$1
+
+case ${__SCRIPT_NAME} in
+   osd|mds|mon)
+   ocf_run $CEPH_INIT $init_action ${__SCRIPT_NAME}
+   ;;
+   *)
+   ocf_run $CEPH_INIT $init_action
+   ;;
+esac
+}
+
+ceph_validate_all() {
+# Do we have the ceph init script?
+check_binary @sysconfdir@/init.d/ceph
+
+# Do we have a configuration file?
+[ -e @sysconfdir@/ceph/ceph.conf ] || exit $OCF_ERR_INSTALLED
+}
+
+ceph_monitor() {
+local rc
+
+ceph_action status
+
+# 0: running, and fully caught up with master
+# 3: gracefully stopped
+# any other: error
+case $? in
+0)
+rc=$OCF_SUCCESS
+ocf_log debug Resource is running

Trivial patch to fix init script LSB compliance

2011-12-27 Thread Florian Haas
Hi everyone,

please consider merging the following trivial patch that makes the
ceph init script return the proper LSB exit code (3) for the status
action if the service is gracefully stopped, and only return 1 (as
before) if the service has died and left its PID file hanging around.

Who cares about the exit code? Pacemaker does
(http://www.clusterlabs.org). Pacemaker is a high-availability
resource manager that can be used for monitoring and recovering
resources in-place, and as per a brief discussion in #ceph that I had
with Greg before the holidays, no such in-place recovery currently
exists within Ceph itself. Integrating init script based services with
Pacemaker is trivial if the script complies with the exit codes that
the LSB spec specifies.

Feedback on this is much appreciated. Hope this is useful.

Cheers,
Florian

[PATCH] init script: be LSB compliant for exit code on status
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html