Re: [PATCH 3/3] ceph: fix vmtruncate deadlock

2013-02-26 Thread Yan, Zheng
On 02/26/2013 01:00 PM, Sage Weil wrote:
 On Tue, 26 Feb 2013, Yan, Zheng wrote:
 It looks to me like truncates can get queued for later, so that's not the 
 case?
 And how could the client receive a truncate while in the middle of
 writing? Either it's got the write caps (in which case nobody else can
 truncate), or it shouldn't be writing...I suppose if somebody enables
 lazyIO that would do it, but again in that case I imagine we'd rather
 do the truncate before writing:

 Commit 22cddde104 make buffered write get/put caps for individual page. So 
 the MDS can
 revoke the write caps in the middle of write.


 I think it's better to do vmtruncate after write finishes.
 What if you're writing past the truncate point? Then your written data
 would be truncated off even though it should have been a file
 extension.

 I was wrong, I thought the order of write and truncate is not important. 
 Now I'm worry about the correctness of commit 22cddde104, we probably 
 should revert that commit.
 
 I'm sorry I haven't had time to dig into this thread.  Do you think we 
 should revert that commit for this merge window?

No, that commit fixes a i_size update bug, it is more important than the 
deadlock.
But I think the eventual fix is revert that commit and also remove the 
optimization
that drops Fw caps early (to avoid slow cap revocation caused by 
balance_dirty_pages)

 
 My gut feeling is that the whole locking scheme here needs to be reworked.  
 I suspect where we'll end up is something more like XFS, where there is an 
 explicit truncate (or other) mutex that is separate from i_mutex.  The 
 fact that we were relying on i_mutex serialization from the VFS was 
 fragile and (as we can see) ultimately a flawed approach.  And the locking 
 and races between writes, mmap, and truncation in the page cache code are 
 fragile and annoying.  The only good thing going for us is that fsx is 
 pretty good at stress testing the code.  :)

If we want to keep write operation atomic, write and vmtruncate need to be
protected by a mutex. I don't think introducing a new mutex makes thing simple.

 
 Oh, you're right about the locking (I think, anyway?). However, i_size 
 doesn't protect us ? we might for instance have truncated to zero and 
 then back up to the original file size. :( But that doesn't look like 
 it's handled correctly anyway (racy) ? am I misreading that badly or 
 is the infrastructure in fact broken in that case?

 You are right. probably we can fix the race by disallowing both read and 
 write when there is pending truncate.
 
 FWIW, I think the high-level strategy before should still be basically 
 sound: we can queue a truncation without blocking, and any write path will 
 apply a pending truncation under the appropriate lock.  But I think it's 
 going to mean carefully documenting the locking requirements for the 
 various paths and working out a new locking structure.
 
 Yan, is this something you are able to put some time into?  I would like 
 to work on it, but it's going to be hard for me to find the time in the 
 next several weeks to get to it.

OK, I will work on it.

Regards
Yan, Zheng

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


maintanance on osd host

2013-02-26 Thread Stefan Priebe - Profihost AG
Hi list,

how can i do a short maintanance like a kernel upgrade on an osd host?
Right now ceph starts to backfill immediatly if i say:
ceph osd out 41
...

Without ceph osd out command all clients hang for the time ceph does not
know that the host was rebootet.

I tried
ceph osd set nodown and ceph osd set noout
but this doesn't result in any difference

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: maintanance on osd host

2013-02-26 Thread Andrey Korolyov
On Tue, Feb 26, 2013 at 6:56 PM, Stefan Priebe - Profihost AG
s.pri...@profihost.ag wrote:
 Hi list,

 how can i do a short maintanance like a kernel upgrade on an osd host?
 Right now ceph starts to backfill immediatly if i say:
 ceph osd out 41
 ...

 Without ceph osd out command all clients hang for the time ceph does not
 know that the host was rebootet.

 I tried
 ceph osd set nodown and ceph osd set noout
 but this doesn't result in any difference


Hi Stefan,

in my practice nodown will freeze all I/O for sure until OSD will
return, killing osd process and setting ``mon osd down out interval''
large enough will do the trick - you`ll get only two small freezes on
the peering process at start and at the end. Also it is very strange
that your clients hanging for a long time - I have set non-optimal
values for purpose and was not able to observe re-peering process
longer than a minute.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OpenStack summit : Ceph design session

2013-02-26 Thread Neil Levine
It's been an embryonic internal Inktank conversation and Nick Barcet
at eNovance mentioned some ideas when we last met. Will try and put
together a blueprint soon.

Neil

On Mon, Feb 25, 2013 at 2:04 AM, Loic Dachary l...@dachary.org wrote:
 Hi Neil,

 I've added RBD backups secondary clusters within Openstack to the list of 
 blueprints. Do you have links to mail threads / chat logs related to this 
 topic ?

 I moved the content of the session to an etherpad for collaborative editing

 https://etherpad.openstack.org/roadmap-for-ceph-integration-with-openstack

 and it is now linked from

 http://summit.openstack.org/cfp/details/38

 Cheers

 On 02/25/2013 07:12 AM, Neil Levine wrote:
 Thanks for taking the lead on this Loic.

 As a blueprint, I'd like to look at RBD backups to secondary clusters
 within Openstack. Nick Barcet and others have mentioned ideas for this
 now that Cinder is multi-cluster aware.

 Neil

 On Sun, Feb 24, 2013 at 3:16 PM, Josh Durgin josh.dur...@inktank.com wrote:
 On 02/23/2013 02:33 AM, Loic Dachary wrote:

 Hi,

 In anticipation of the next OpenStack summit
 http://www.openstack.org/summit/portland-2013/, I proposed a session to
 discuss OpenStack and Ceph integration. Our meeting during FOSDEM earlier
 this month was a great experience although it was planned at the last
 minute. I hope we can organize something even better for the summit.

 For developers and contributors to both Ceph and OpenStack such as myself,
 it would be a great opportunity to figure out a sensible roadmap for the
 next six months. I realize this roadmap is already clear for Josh Durgin 
 and
 other Ceph / OpenStack developers who are passionately invested in both
 projects for a long time. However I am new to both projects and such a
 session would be a precious guide and highly motivating.

 http://summit.openstack.org/cfp/details/38

 What do you think ?


 Sounds like a great idea!
 Thanks for putting together the session!

 Josh

 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

 --
 Loïc Dachary, Artisan Logiciel Libre

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: maintanance on osd host

2013-02-26 Thread Sage Weil
On Tue, 26 Feb 2013, Stefan Priebe - Profihost AG wrote:
 Hi list,
 
 how can i do a short maintanance like a kernel upgrade on an osd host?
 Right now ceph starts to backfill immediatly if i say:
 ceph osd out 41
 ...
 
 Without ceph osd out command all clients hang for the time ceph does not
 know that the host was rebootet.
 
 I tried
 ceph osd set nodown and ceph osd set noout
 but this doesn't result in any difference

For a temporary event like this, you want the osd to be down (so that io 
can continue with remaining replicas) but NOT to mark it out (so that data 
doesn't get rebalanced).  The simplest way to do that is

 ceph osd set noout
 killall ceph-osd
 .. reboot ..

Just remember to do

 ceph osd unset noout

when you are done so that future osds that fail will get marked out on 
their own after the 5 minute (default) interval.

sage

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: maintanance on osd host

2013-02-26 Thread Stefan Priebe - Profihost AG
But that redults in a 1-3s hickup for all KVM vms. This is not what I want.

Stefan

Am 26.02.2013 um 18:06 schrieb Sage Weil s...@inktank.com:

 On Tue, 26 Feb 2013, Stefan Priebe - Profihost AG wrote:
 Hi list,
 
 how can i do a short maintanance like a kernel upgrade on an osd host?
 Right now ceph starts to backfill immediatly if i say:
 ceph osd out 41
 ...
 
 Without ceph osd out command all clients hang for the time ceph does not
 know that the host was rebootet.
 
 I tried
 ceph osd set nodown and ceph osd set noout
 but this doesn't result in any difference
 
 For a temporary event like this, you want the osd to be down (so that io 
 can continue with remaining replicas) but NOT to mark it out (so that data 
 doesn't get rebalanced).  The simplest way to do that is
 
 ceph osd set noout
 killall ceph-osd
 .. reboot ..
 
 Just remember to do
 
 ceph osd unset noout
 
 when you are done so that future osds that fail will get marked out on 
 their own after the 5 minute (default) interval.
 
 sage
 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: maintanance on osd host

2013-02-26 Thread Sage Weil
On Tue, 26 Feb 2013, Stefan Priebe - Profihost AG wrote:
 But that redults in a 1-3s hickup for all KVM vms. This is not what I want.

You can do

 kill $pid
 ceph osd down $osdid

(or even reverse the order, if the sequence is quick enough) to avoid 
waiting for the failure detection delay.  But if the OSDs are going down, 
then the peering has to happen one way or another.

sage


 
 Stefan
 
 Am 26.02.2013 um 18:06 schrieb Sage Weil s...@inktank.com:
 
  On Tue, 26 Feb 2013, Stefan Priebe - Profihost AG wrote:
  Hi list,
  
  how can i do a short maintanance like a kernel upgrade on an osd host?
  Right now ceph starts to backfill immediatly if i say:
  ceph osd out 41
  ...
  
  Without ceph osd out command all clients hang for the time ceph does not
  know that the host was rebootet.
  
  I tried
  ceph osd set nodown and ceph osd set noout
  but this doesn't result in any difference
  
  For a temporary event like this, you want the osd to be down (so that io 
  can continue with remaining replicas) but NOT to mark it out (so that data 
  doesn't get rebalanced).  The simplest way to do that is
  
  ceph osd set noout
  killall ceph-osd
  .. reboot ..
  
  Just remember to do
  
  ceph osd unset noout
  
  when you are done so that future osds that fail will get marked out on 
  their own after the 5 minute (default) interval.
  
  sage
  
  --
  To unsubscribe from this list: send the line unsubscribe ceph-devel in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Crash and strange things on MDS

2013-02-26 Thread Kevin Decherf
On Tue, Feb 19, 2013 at 05:09:30PM -0800, Gregory Farnum wrote:
 On Tue, Feb 19, 2013 at 5:00 PM, Kevin Decherf ke...@kdecherf.com wrote:
  On Tue, Feb 19, 2013 at 10:15:48AM -0800, Gregory Farnum wrote:
  Looks like you've got ~424k dentries pinned, and it's trying to keep
  400k inodes in cache. So you're still a bit oversubscribed, yes. This
  might just be the issue where your clients are keeping a bunch of
  inodes cached for the VFS (http://tracker.ceph.com/issues/3289).
 
  Thanks for the analyze. We use only one ceph-fuse client at this time
  which makes all high-load commands like rsync, tar and cp on a huge
  amount of files. Well, I will replace it by the kernel client.
 
 Oh, that bug is just an explanation of what's happening; I believe it
 exists in the kernel client as well.

After setting the mds cache size to 900k, storms are gone.
However we continue to observe high latency on some clients (always the
same clients): each IO takes between 40 and 90ms (for example with
Wordpress, it takes ~20 seconds to load all needed files...).
With a non-laggy client, IO requests take less than 1ms.

-- 
Kevin Decherf - @Kdecherf
GPG C610 FE73 E706 F968 612B E4B2 108A BD75 A81E 6E2F
http://kdecherf.com
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Crash and strange things on MDS

2013-02-26 Thread Gregory Farnum
On Tue, Feb 26, 2013 at 9:57 AM, Kevin Decherf ke...@kdecherf.com wrote:
 On Tue, Feb 19, 2013 at 05:09:30PM -0800, Gregory Farnum wrote:
 On Tue, Feb 19, 2013 at 5:00 PM, Kevin Decherf ke...@kdecherf.com wrote:
  On Tue, Feb 19, 2013 at 10:15:48AM -0800, Gregory Farnum wrote:
  Looks like you've got ~424k dentries pinned, and it's trying to keep
  400k inodes in cache. So you're still a bit oversubscribed, yes. This
  might just be the issue where your clients are keeping a bunch of
  inodes cached for the VFS (http://tracker.ceph.com/issues/3289).
 
  Thanks for the analyze. We use only one ceph-fuse client at this time
  which makes all high-load commands like rsync, tar and cp on a huge
  amount of files. Well, I will replace it by the kernel client.

 Oh, that bug is just an explanation of what's happening; I believe it
 exists in the kernel client as well.

 After setting the mds cache size to 900k, storms are gone.
 However we continue to observe high latency on some clients (always the
 same clients): each IO takes between 40 and 90ms (for example with
 Wordpress, it takes ~20 seconds to load all needed files...).
 With a non-laggy client, IO requests take less than 1ms.

I can't be sure from that description, but it sounds like you've got
one client which is generally holding the RW caps on the files, and
then another client which comes in occasionally to read those same
files. That requires the first client to drop its caps, and involves a
couple round-trip messages and is going to take some time — this is an
unavoidable consequence if you have clients sharing files, although
there's probably still room for us to optimize.

Can you describe your client workload in a bit more detail?
-Greg
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] libceph: fix a osd request memory leak

2013-02-26 Thread Josh Durgin

Reviewed-by: Josh Durgin josh.dur...@inktank.com

On 02/25/2013 02:36 PM, Alex Elder wrote:

If an invalid layout is provided to ceph_osdc_new_request(), its
call to calc_layout() might return an error.  At that point in the
function we've already allocated an osd request structure, so we
need to free it (drop a reference) in the event such an error
occurs.

The only other value calc_layout() will return is 0, so make that
explicit in the successful case.

This resolves:
 http://tracker.ceph.com/issues/4240

Signed-off-by: Alex Elder el...@inktank.com
---
  net/ceph/osd_client.c |6 --
  1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index 39629b6..5daced2 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -109,7 +109,7 @@ static int calc_layout(struct ceph_vino vino,
snprintf(req-r_oid, sizeof(req-r_oid), %llx.%08llx, vino.ino, bno);
req-r_oid_len = strlen(req-r_oid);

-   return r;
+   return 0;
  }

  /*
@@ -437,8 +437,10 @@ struct ceph_osd_request
*ceph_osdc_new_request(struct ceph_osd_client *osdc,

/* calculate max write size */
r = calc_layout(vino, layout, off, plen, req, ops);
-   if (r  0)
+   if (r  0) {
+   ceph_osdc_put_request(req);
return ERR_PTR(r);
+   }
req-r_file_layout = *layout;  /* keep a copy */

/* in case it differs from natural (file) alignment that



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] libceph: make ceph_msg-bio_seg be unsigned

2013-02-26 Thread Josh Durgin

Reviewed-by: Josh Durgin josh.dur...@inktank.com

On 02/25/2013 02:40 PM, Alex Elder wrote:

The bio_seg field is used by the ceph messenger in iterating through
a bio.  It should never have a negative value, so make it an
unsigned.

Change variables used to hold bio_seg values to all be unsigned as
well.  Change two variable names in init_bio_iter() to match the
convention used everywhere else.

Signed-off-by: Alex Elder el...@inktank.com
---
  include/linux/ceph/messenger.h |2 +-
  net/ceph/messenger.c   |   16 +---
  2 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h
index 60903e0..8297288 100644
--- a/include/linux/ceph/messenger.h
+++ b/include/linux/ceph/messenger.h
@@ -86,7 +86,7 @@ struct ceph_msg {
  #ifdef CONFIG_BLOCK
struct bio  *bio;   /* instead of pages/pagelist */
struct bio  *bio_iter;  /* bio iterator */
-   int bio_seg;/* current bio segment */
+   unsigned int bio_seg;   /* current bio segment */
  #endif /* CONFIG_BLOCK */
struct ceph_pagelist *trail;/* the trailing part of the data */
bool front_is_vmalloc;
diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index 2c0669f..c06f940 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -697,18 +697,19 @@ static void con_out_kvec_add(struct
ceph_connection *con,
  }

  #ifdef CONFIG_BLOCK
-static void init_bio_iter(struct bio *bio, struct bio **iter, int *seg)
+static void init_bio_iter(struct bio *bio, struct bio **bio_iter,
+   unsigned int *bio_seg)
  {
if (!bio) {
-   *iter = NULL;
-   *seg = 0;
+   *bio_iter = NULL;
+   *bio_seg = 0;
return;
}
-   *iter = bio;
-   *seg = bio-bi_idx;
+   *bio_iter = bio;
+   *bio_seg = (unsigned int) bio-bi_idx;
  }

-static void iter_bio_next(struct bio **bio_iter, int *seg)
+static void iter_bio_next(struct bio **bio_iter, unsigned int *seg)
  {
if (*bio_iter == NULL)
return;
@@ -1818,7 +1819,8 @@ static int read_partial_message_pages(struct
ceph_connection *con,

  #ifdef CONFIG_BLOCK
  static int read_partial_message_bio(struct ceph_connection *con,
-   struct bio **bio_iter, int *bio_seg,
+   struct bio **bio_iter,
+   unsigned int *bio_seg,
unsigned int data_len, bool do_datacrc)
  {
struct bio_vec *bv = bio_iovec_idx(*bio_iter, *bio_seg);



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] ceph: fix vmtruncate deadlock

2013-02-26 Thread Gregory Farnum
On Mon, Feb 25, 2013 at 4:01 PM, Gregory Farnum g...@inktank.com wrote:
 On Fri, Feb 22, 2013 at 8:31 PM, Yan, Zheng zheng.z@intel.com wrote:
 On 02/23/2013 02:54 AM, Gregory Farnum wrote:
 I haven't spent that much time in the kernel client, but this patch
 isn't working out for me. In particular, I'm pretty sure we need to
 preserve this:

 diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
 index 5d5c32b..b9d8417 100644
 --- a/fs/ceph/caps.c
 +++ b/fs/ceph/caps.c
 @@ -2067,12 +2067,6 @@ static int try_get_cap_refs(struct ceph_inode_info 
 *ci, int need, int want,
 }
 have = __ceph_caps_issued(ci, implemented);

 -   /*
 -* disallow writes while a truncate is pending
 -*/
 -   if (ci-i_truncate_pending)
 -   have = ~CEPH_CAP_FILE_WR;
 -
 if ((have  need) == need) {
 /*
  * Look at (implemented  ~have  not) so that we keep 
 waiting

 Because if there's a pending truncate, we really can't write. You do
 handle it in the particular case of doing buffered file writes, but
 these caps are a marker of permission, and the client shouldn't have
 write permission to a file until it's up to date on the truncates. Or
 am I misunderstanding something?

 pending vmtruncate is only relevant to buffered write case. If client 
 doesn't have 'b' cap,
 the page cache is empty, and __ceph_do_pending_vmtruncate is no-op. For 
 buffered write,
 this patch only affects situation that clients receives truncate request 
 from MDS in the
 middle of write.
 It looks to me like truncates can get queued for later, so that's not the 
 case?
 And how could the client receive a truncate while in the middle of
 writing? Either it's got the write caps (in which case nobody else can
 truncate), or it shouldn't be writing...I suppose if somebody enables
 lazyIO that would do it, but again in that case I imagine we'd rather
 do the truncate before writing:

 I think it's better to do vmtruncate after write finishes.
 What if you're writing past the truncate point? Then your written data
 would be truncated off even though it should have been a file
 extension.


 diff --git a/fs/ceph/file.c b/fs/ceph/file.c
 index a1e5b81..bf7849a 100644
 --- a/fs/ceph/file.c
 +++ b/fs/ceph/file.c
 @@ -653,7 +653,6 @@ static ssize_t ceph_aio_read(struct kiocb *iocb, const 
 struct iovec *iov,
 dout(aio_read %p %llx.%llx %llu~%u trying to get caps on %p\n,
  inode, ceph_vinop(inode), pos, (unsigned)len, inode);
  again:
 -   __ceph_do_pending_vmtruncate(inode);

 There doesn't seem to be any kind of replacement for this? We need to
 do any pending truncates before reading or we might get stale data
 back.

 generic_file_aio_read checks i_size when coping data to user buffer, so the 
 user program can't
 get stale data. This __ceph_do_pending_vmtruncate is not protected by 
 i_mutex, it's a potential
 bug, that's the reason I remove it.

 Oh, you're right about the locking (I think, anyway?). However, i_size
 doesn't protect us — we might for instance have truncated to zero and
 then back up to the original file size. :( But that doesn't look like
 it's handled correctly anyway (racy) — am I misreading that badly or
 is the infrastructure in fact broken in that case?

Sage pointed out in standup today that we only send a truncate message
for truncate downs, and he thinks that the more complicated cases
(truncate to zero, write out 2KB, truncate to 1KB) are okay thanks to
the capabilities bits. I haven't thought it through but that seems
plausible to me.
-Greg
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] libceph: focus calc_layout() on filling in the osd op

2013-02-26 Thread Josh Durgin

On 02/25/2013 03:09 PM, Alex Elder wrote:

This series refactors the code involved with identifying the
details of the name, offset, and length of an object involved
with an osd request based on a file layout.  It makes the focus
of calc_layout() be filling in an osd op structure based on the
file layout it is provided.  The caller (ceph_osdc_new_request())
is then responsible for filling in fields related to the request,
such as the name of the target object.

-Alex

[PATCH 1/3] libceph: pass object number back to calc_layout() caller
[PATCH 2/3] libceph: format target object name in caller
[PATCH 3/3] libceph: don't pass request to calc_layout()


These all look good.
Reviewed-by: Josh Durgin josh.dur...@inktank.com

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] libceph: abstract setting message data info

2013-02-26 Thread Josh Durgin

On 02/25/2013 03:40 PM, Alex Elder wrote:

This series makes the fields related to the data portion of
a ceph message not get manipulated by code outside the ceph
messenger.  It implements some interface functions that can
be used to assign data-related fields.  Doing this will allow
the way message data is managed to be changed independent
of the users of the messenger module.

-Alex

[PATCH 1/4] libceph: distinguish page array and pagelist count
[PATCH 2/4] libceph: set page alignment in start_request()
[PATCH 3/4] libceph: isolate message page field manipulation
[PATCH 4/4] libceph: isolate other message data fields


These all look good.
Reviewed-by: Josh Durgin josh.dur...@inktank.com

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load

2013-02-26 Thread Jim Schutt
Hi Sage,

On 02/20/2013 05:12 PM, Sage Weil wrote:
 Hi Jim,
 
 I'm resurrecting an ancient thread here, but: we've just observed this on 
 another big cluster and remembered that this hasn't actually been fixed.

Sorry for the delayed reply - I missed this in a backlog
of unread email...

 
 I think the right solution is to make an option that will setsockopt on 
 SO_RECVBUF to some value (say, 256KB).  I pushed a branch that does this, 
 wip-tcp.  Do you mind checking to see if this addresses the issue (without 
 manually adjusting things in /proc)?

I'll be happy to test it out...

 
 And perhaps we should consider making this default to 256KB...

That's the value I've been using with my /proc adjustments
since I figured out what was going on.  My servers use
a 10 GbE port for each of the cluster and public networks,
with cephfs clients using 1 GbE, and I've not detected any
issues resulting from that value.  So, it seems like a decent
starting point for a default...

-- Jim

 
 Thanks!
 sage
 


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load

2013-02-26 Thread Sage Weil
On Tue, 26 Feb 2013, Jim Schutt wrote:
  I think the right solution is to make an option that will setsockopt on 
  SO_RECVBUF to some value (say, 256KB).  I pushed a branch that does this, 
  wip-tcp.  Do you mind checking to see if this addresses the issue (without 
  manually adjusting things in /proc)?
 
 I'll be happy to test it out...

That would be great!  It's branch wip-tcp, and the setting is 'ms tcp 
rcvbuf'.

Thanks!
sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: maintanance on osd host

2013-02-26 Thread Stefan Priebe

Hi Sage,

Am 26.02.2013 18:24, schrieb Sage Weil:

On Tue, 26 Feb 2013, Stefan Priebe - Profihost AG wrote:

But that redults in a 1-3s hickup for all KVM vms. This is not what I want.


You can do

  kill $pid
  ceph osd down $osdid

(or even reverse the order, if the sequence is quick enough) to avoid
waiting for the failure detection delay.  But if the OSDs are going down,
then the peering has to happen one way or another.


But exaclty this results in starting backfill immediatly. My idea was to 
first mark the osd down so the mon knows about this fact. So no I/O is 
stalled. And then reboot the whole host but exactly this does not work 
like expected as backfilling is starting immediatly after setting the 
osd to down ;-(


Greets,
Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Crash and strange things on MDS

2013-02-26 Thread Kevin Decherf
On Tue, Feb 26, 2013 at 10:10:06AM -0800, Gregory Farnum wrote:
 On Tue, Feb 26, 2013 at 9:57 AM, Kevin Decherf ke...@kdecherf.com wrote:
  On Tue, Feb 19, 2013 at 05:09:30PM -0800, Gregory Farnum wrote:
  On Tue, Feb 19, 2013 at 5:00 PM, Kevin Decherf ke...@kdecherf.com wrote:
   On Tue, Feb 19, 2013 at 10:15:48AM -0800, Gregory Farnum wrote:
   Looks like you've got ~424k dentries pinned, and it's trying to keep
   400k inodes in cache. So you're still a bit oversubscribed, yes. This
   might just be the issue where your clients are keeping a bunch of
   inodes cached for the VFS (http://tracker.ceph.com/issues/3289).
  
   Thanks for the analyze. We use only one ceph-fuse client at this time
   which makes all high-load commands like rsync, tar and cp on a huge
   amount of files. Well, I will replace it by the kernel client.
 
  Oh, that bug is just an explanation of what's happening; I believe it
  exists in the kernel client as well.
 
  After setting the mds cache size to 900k, storms are gone.
  However we continue to observe high latency on some clients (always the
  same clients): each IO takes between 40 and 90ms (for example with
  Wordpress, it takes ~20 seconds to load all needed files...).
  With a non-laggy client, IO requests take less than 1ms.
 
 I can't be sure from that description, but it sounds like you've got
 one client which is generally holding the RW caps on the files, and
 then another client which comes in occasionally to read those same
 files. That requires the first client to drop its caps, and involves a
 couple round-trip messages and is going to take some time — this is an
 unavoidable consequence if you have clients sharing files, although
 there's probably still room for us to optimize.
 
 Can you describe your client workload in a bit more detail?

We have one folder per application (php, java, ruby). Every application has
small (1M) files. The folder is mounted by only one client by default.

In case of overload, another clients spawn to mount the same folder and
access the same files.

In the following test, only one client was used to serve the
application (a website using wordpress).

I made the test with strace to see the time of each IO request (strace -T
-e trace=file) and I noticed the same pattern:

...
[pid  4378] stat(/data/wp-includes/user.php, {st_mode=S_IFREG|0750, 
st_size=28622, ...}) = 0 0.033409
[pid  4378] lstat(/data/wp-includes/user.php, {st_mode=S_IFREG|0750, 
st_size=28622, ...}) = 0 0.081642
[pid  4378] open(/data/wp-includes/user.php, O_RDONLY) = 5 0.041138
[pid  4378] stat(/data/wp-includes/meta.php, {st_mode=S_IFREG|0750, 
st_size=10896, ...}) = 0 0.082303
[pid  4378] lstat(/data/wp-includes/meta.php, {st_mode=S_IFREG|0750, 
st_size=10896, ...}) = 0 0.004090
[pid  4378] open(/data/wp-includes/meta.php, O_RDONLY) = 5 0.081929
...

~250 files were accessed for only one request (thanks Wordpress.).

The fs is mounted with these options: 
rw,noatime,name=hidden,secret=hidden,nodcache.

I have a debug (debug_mds=20) log of the active mds during this test if you 
want.
-- 
Kevin Decherf - @Kdecherf
GPG C610 FE73 E706 F968 612B E4B2 108A BD75 A81E 6E2F
http://kdecherf.com
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: maintanance on osd host

2013-02-26 Thread Sage Weil
On Tue, 26 Feb 2013, Stefan Priebe wrote:
 Hi Sage,
 
 Am 26.02.2013 18:24, schrieb Sage Weil:
  On Tue, 26 Feb 2013, Stefan Priebe - Profihost AG wrote:
   But that redults in a 1-3s hickup for all KVM vms. This is not what I
   want.
  
  You can do
  
kill $pid
ceph osd down $osdid
  
  (or even reverse the order, if the sequence is quick enough) to avoid
  waiting for the failure detection delay.  But if the OSDs are going down,
  then the peering has to happen one way or another.
 
 But exaclty this results in starting backfill immediatly. My idea was to first
 mark the osd down so the mon knows about this fact. So no I/O is stalled. And
 then reboot the whole host but exactly this does not work like expected as
 backfilling is starting immediatly after setting the osd to down ;-(

Backfilling should not happen on down, unless you have reconfigured 'mon 
osd down out interval = 0' or something along those lines.  Setting the 
'noout' flag will also prevent the osds from marking out.

As for limiting the IO stall: you could also do 'ceph osd set noup', 
then mark them down, then kill the daemon, and you won't have to worry 
about racing with the daemon marking itself back up (as it normally does).

sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Crash and strange things on MDS

2013-02-26 Thread Gregory Farnum
On Tue, Feb 26, 2013 at 11:58 AM, Kevin Decherf ke...@kdecherf.com wrote:
 We have one folder per application (php, java, ruby). Every application has
 small (1M) files. The folder is mounted by only one client by default.

 In case of overload, another clients spawn to mount the same folder and
 access the same files.

 In the following test, only one client was used to serve the
 application (a website using wordpress).

 I made the test with strace to see the time of each IO request (strace -T
 -e trace=file) and I noticed the same pattern:

 ...
 [pid  4378] stat(/data/wp-includes/user.php, {st_mode=S_IFREG|0750, 
 st_size=28622, ...}) = 0 0.033409
 [pid  4378] lstat(/data/wp-includes/user.php, {st_mode=S_IFREG|0750, 
 st_size=28622, ...}) = 0 0.081642
 [pid  4378] open(/data/wp-includes/user.php, O_RDONLY) = 5 0.041138
 [pid  4378] stat(/data/wp-includes/meta.php, {st_mode=S_IFREG|0750, 
 st_size=10896, ...}) = 0 0.082303
 [pid  4378] lstat(/data/wp-includes/meta.php, {st_mode=S_IFREG|0750, 
 st_size=10896, ...}) = 0 0.004090
 [pid  4378] open(/data/wp-includes/meta.php, O_RDONLY) = 5 0.081929
 ...

 ~250 files were accessed for only one request (thanks Wordpress.).

Okay, that is slower than I'd expect, even for an across-the-wire request...

 The fs is mounted with these options: 
 rw,noatime,name=hidden,secret=hidden,nodcache.

What kernel and why are you using nodcache? Did you have problems
without that mount option? That's forcing an MDS access for most
operations, rather than using local data.

 I have a debug (debug_mds=20) log of the active mds during this test if you 
 want.

Yeah, can you post it somewhere?
-Greg
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: maintanance on osd host

2013-02-26 Thread Gregory Farnum
On Tue, Feb 26, 2013 at 11:44 AM, Stefan Priebe s.pri...@profihost.ag wrote:
 Hi Sage,

 Am 26.02.2013 18:24, schrieb Sage Weil:

 On Tue, 26 Feb 2013, Stefan Priebe - Profihost AG wrote:

 But that redults in a 1-3s hickup for all KVM vms. This is not what I
 want.


 You can do

   kill $pid
   ceph osd down $osdid

 (or even reverse the order, if the sequence is quick enough) to avoid
 waiting for the failure detection delay.  But if the OSDs are going down,
 then the peering has to happen one way or another.


 But exaclty this results in starting backfill immediatly. My idea was to
 first mark the osd down so the mon knows about this fact. So no I/O is
 stalled. And then reboot the whole host but exactly this does not work like
 expected as backfilling is starting immediatly after setting the osd to down
 ;-(

out and down are quite different — are you sure you tried down
and not out? (You reference out in your first email, rather than
down.)
-Greg
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Crash and strange things on MDS

2013-02-26 Thread Kevin Decherf
On Tue, Feb 26, 2013 at 12:26:17PM -0800, Gregory Farnum wrote:
 On Tue, Feb 26, 2013 at 11:58 AM, Kevin Decherf ke...@kdecherf.com wrote:
  We have one folder per application (php, java, ruby). Every application has
  small (1M) files. The folder is mounted by only one client by default.
 
  In case of overload, another clients spawn to mount the same folder and
  access the same files.
 
  In the following test, only one client was used to serve the
  application (a website using wordpress).
 
  I made the test with strace to see the time of each IO request (strace -T
  -e trace=file) and I noticed the same pattern:
 
  ...
  [pid  4378] stat(/data/wp-includes/user.php, {st_mode=S_IFREG|0750, 
  st_size=28622, ...}) = 0 0.033409
  [pid  4378] lstat(/data/wp-includes/user.php, {st_mode=S_IFREG|0750, 
  st_size=28622, ...}) = 0 0.081642
  [pid  4378] open(/data/wp-includes/user.php, O_RDONLY) = 5 0.041138
  [pid  4378] stat(/data/wp-includes/meta.php, {st_mode=S_IFREG|0750, 
  st_size=10896, ...}) = 0 0.082303
  [pid  4378] lstat(/data/wp-includes/meta.php, {st_mode=S_IFREG|0750, 
  st_size=10896, ...}) = 0 0.004090
  [pid  4378] open(/data/wp-includes/meta.php, O_RDONLY) = 5 0.081929
  ...
 
  ~250 files were accessed for only one request (thanks Wordpress.).
 
 Okay, that is slower than I'd expect, even for an across-the-wire request...
 
  The fs is mounted with these options: 
  rw,noatime,name=hidden,secret=hidden,nodcache.
 
 What kernel and why are you using nodcache?

We use kernel 3.7.0. nodcache is enabled by default (we only specify user and
secretfile as mount options) and I didn't find it in the documentation of
mount.ceph.

 Did you have problems
 without that mount option? That's forcing an MDS access for most
 operations, rather than using local data.

Good question, I will try it (-o dcache?).

  I have a debug (debug_mds=20) log of the active mds during this test if you 
  want.
 
 Yeah, can you post it somewhere?

Upload in progress :-)

-- 
Kevin Decherf - @Kdecherf
GPG C610 FE73 E706 F968 612B E4B2 108A BD75 A81E 6E2F
http://kdecherf.com
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Crash and strange things on MDS

2013-02-26 Thread Gregory Farnum
On Tue, Feb 26, 2013 at 1:57 PM, Kevin Decherf ke...@kdecherf.com wrote:
 On Tue, Feb 26, 2013 at 12:26:17PM -0800, Gregory Farnum wrote:
 On Tue, Feb 26, 2013 at 11:58 AM, Kevin Decherf ke...@kdecherf.com wrote:
  We have one folder per application (php, java, ruby). Every application has
  small (1M) files. The folder is mounted by only one client by default.
 
  In case of overload, another clients spawn to mount the same folder and
  access the same files.
 
  In the following test, only one client was used to serve the
  application (a website using wordpress).
 
  I made the test with strace to see the time of each IO request (strace -T
  -e trace=file) and I noticed the same pattern:
 
  ...
  [pid  4378] stat(/data/wp-includes/user.php, {st_mode=S_IFREG|0750, 
  st_size=28622, ...}) = 0 0.033409
  [pid  4378] lstat(/data/wp-includes/user.php, {st_mode=S_IFREG|0750, 
  st_size=28622, ...}) = 0 0.081642
  [pid  4378] open(/data/wp-includes/user.php, O_RDONLY) = 5 0.041138
  [pid  4378] stat(/data/wp-includes/meta.php, {st_mode=S_IFREG|0750, 
  st_size=10896, ...}) = 0 0.082303
  [pid  4378] lstat(/data/wp-includes/meta.php, {st_mode=S_IFREG|0750, 
  st_size=10896, ...}) = 0 0.004090
  [pid  4378] open(/data/wp-includes/meta.php, O_RDONLY) = 5 0.081929
  ...
 
  ~250 files were accessed for only one request (thanks Wordpress.).

 Okay, that is slower than I'd expect, even for an across-the-wire request...

  The fs is mounted with these options: 
  rw,noatime,name=hidden,secret=hidden,nodcache.

 What kernel and why are you using nodcache?

 We use kernel 3.7.0. nodcache is enabled by default (we only specify user and
 secretfile as mount options) and I didn't find it in the documentation of
 mount.ceph.

 Did you have problems
 without that mount option? That's forcing an MDS access for most
 operations, rather than using local data.

 Good question, I will try it (-o dcache?).

Oh right — I forgot Sage had enabled that by default; I don't recall
how necessary it is. (Sage?)


  I have a debug (debug_mds=20) log of the active mds during this test if 
  you want.

 Yeah, can you post it somewhere?

 Upload in progress :-)

Looking forward to it. ;)
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Crash and strange things on MDS

2013-02-26 Thread Yan, Zheng
On Wed, Feb 27, 2013 at 5:58 AM, Gregory Farnum g...@inktank.com wrote:
 On Tue, Feb 26, 2013 at 1:57 PM, Kevin Decherf ke...@kdecherf.com wrote:
 On Tue, Feb 26, 2013 at 12:26:17PM -0800, Gregory Farnum wrote:
 On Tue, Feb 26, 2013 at 11:58 AM, Kevin Decherf ke...@kdecherf.com wrote:
  We have one folder per application (php, java, ruby). Every application 
  has
  small (1M) files. The folder is mounted by only one client by default.
 
  In case of overload, another clients spawn to mount the same folder and
  access the same files.
 
  In the following test, only one client was used to serve the
  application (a website using wordpress).
 
  I made the test with strace to see the time of each IO request (strace -T
  -e trace=file) and I noticed the same pattern:
 
  ...
  [pid  4378] stat(/data/wp-includes/user.php, {st_mode=S_IFREG|0750, 
  st_size=28622, ...}) = 0 0.033409
  [pid  4378] lstat(/data/wp-includes/user.php, {st_mode=S_IFREG|0750, 
  st_size=28622, ...}) = 0 0.081642
  [pid  4378] open(/data/wp-includes/user.php, O_RDONLY) = 5 0.041138
  [pid  4378] stat(/data/wp-includes/meta.php, {st_mode=S_IFREG|0750, 
  st_size=10896, ...}) = 0 0.082303
  [pid  4378] lstat(/data/wp-includes/meta.php, {st_mode=S_IFREG|0750, 
  st_size=10896, ...}) = 0 0.004090
  [pid  4378] open(/data/wp-includes/meta.php, O_RDONLY) = 5 0.081929
  ...
 
  ~250 files were accessed for only one request (thanks Wordpress.).

 Okay, that is slower than I'd expect, even for an across-the-wire request...

  The fs is mounted with these options: 
  rw,noatime,name=hidden,secret=hidden,nodcache.

 What kernel and why are you using nodcache?

 We use kernel 3.7.0. nodcache is enabled by default (we only specify user and
 secretfile as mount options) and I didn't find it in the documentation of
 mount.ceph.

 Did you have problems
 without that mount option? That's forcing an MDS access for most
 operations, rather than using local data.

 Good question, I will try it (-o dcache?).

 Oh right — I forgot Sage had enabled that by default; I don't recall
 how necessary it is. (Sage?)


That code is buggy, see ceph_dir_test_complete(), it always return false.

Yan, Zheng


  I have a debug (debug_mds=20) log of the active mds during this test if 
  you want.

 Yeah, can you post it somewhere?

 Upload in progress :-)

 Looking forward to it. ;)
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Crash and strange things on MDS

2013-02-26 Thread Sage Weil
On Wed, 27 Feb 2013, Yan, Zheng  wrote:
 On Wed, Feb 27, 2013 at 5:58 AM, Gregory Farnum g...@inktank.com wrote:
  On Tue, Feb 26, 2013 at 1:57 PM, Kevin Decherf ke...@kdecherf.com wrote:
  On Tue, Feb 26, 2013 at 12:26:17PM -0800, Gregory Farnum wrote:
  On Tue, Feb 26, 2013 at 11:58 AM, Kevin Decherf ke...@kdecherf.com 
  wrote:
   We have one folder per application (php, java, ruby). Every application 
   has
   small (1M) files. The folder is mounted by only one client by default.
  
   In case of overload, another clients spawn to mount the same folder and
   access the same files.
  
   In the following test, only one client was used to serve the
   application (a website using wordpress).
  
   I made the test with strace to see the time of each IO request (strace 
   -T
   -e trace=file) and I noticed the same pattern:
  
   ...
   [pid  4378] stat(/data/wp-includes/user.php, {st_mode=S_IFREG|0750, 
   st_size=28622, ...}) = 0 0.033409
   [pid  4378] lstat(/data/wp-includes/user.php, {st_mode=S_IFREG|0750, 
   st_size=28622, ...}) = 0 0.081642
   [pid  4378] open(/data/wp-includes/user.php, O_RDONLY) = 5 0.041138
   [pid  4378] stat(/data/wp-includes/meta.php, {st_mode=S_IFREG|0750, 
   st_size=10896, ...}) = 0 0.082303
   [pid  4378] lstat(/data/wp-includes/meta.php, {st_mode=S_IFREG|0750, 
   st_size=10896, ...}) = 0 0.004090
   [pid  4378] open(/data/wp-includes/meta.php, O_RDONLY) = 5 0.081929
   ...
  
   ~250 files were accessed for only one request (thanks Wordpress.).
 
  Okay, that is slower than I'd expect, even for an across-the-wire 
  request...
 
   The fs is mounted with these options: 
   rw,noatime,name=hidden,secret=hidden,nodcache.
 
  What kernel and why are you using nodcache?
 
  We use kernel 3.7.0. nodcache is enabled by default (we only specify user 
  and
  secretfile as mount options) and I didn't find it in the documentation of
  mount.ceph.
 
  Did you have problems
  without that mount option? That's forcing an MDS access for most
  operations, rather than using local data.
 
  Good question, I will try it (-o dcache?).
 
  Oh right ? I forgot Sage had enabled that by default; I don't recall
  how necessary it is. (Sage?)
 
 
 That code is buggy, see ceph_dir_test_complete(), it always return false.

FWIW, I think #4023 may be the root of those original bugs.  I think that 
should be up next (as far as fs/ceph goes) after after this i_mutex stuff 
is sorted out.

sage


 
 Yan, Zheng
 
 
   I have a debug (debug_mds=20) log of the active mds during this test if 
   you want.
 
  Yeah, can you post it somewhere?
 
  Upload in progress :-)
 
  Looking forward to it. ;)
  --
  To unsubscribe from this list: send the line unsubscribe ceph-devel in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL] Ceph updates for 3.9-rc1

2013-02-26 Thread Sage Weil
Hi Linus,

Please pull the following Ceph updates from

  git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus

A few groups of patches here.  Alex has been hard at work improving the 
RBD code, layout groundwork for understanding the new formats and doing 
layering.  Most of the infrastructure is now in place for the final bits 
that will come with the next window.

There are a few changes to the data layout.  Jim Schutt's patch fixes some 
non-ideal CRUSH behavior, and a set of patches from me updates the client 
to speak a newer version of the protocol and implement an improved hashing 
strategy across storage nodes (when the server side supports it too).

A pair of patches from Sam Lang fix the atomicity of open+create 
operations.  Several patches from Yan, Zheng fix various mds/client issues 
that turned up during multi-mds torture tests.

A final set of patches expose file layouts via virtual xattrs, and allow 
the policies to be set on directories via xattrs as well (avoiding the 
awkward ioctl interface and providing a consistent interface for both 
kernel mount and ceph-fuse users).

Thanks!
sage




Alex Elder (118):
  libceph: reformat __reset_osd()
  rbd: document rbd_spec structure
  rbd: kill rbd_spec-image_name_len
  rbd: kill rbd_spec-image_id_len
  rbd: use kmemdup()
  ceph: define ceph_encode_8_safe()
  rbd: define and use rbd_warn()
  rbd: add warning messages for missing arguments
  rbd: add a warning in bio_chain_clone_range()
  rbd: add warnings to rbd_dev_probe_update_spec()
  rbd: standardize rbd_request variable names
  rbd: standardize ceph_osd_request variable names
  rbd: be picky about osd request status type
  rbd: encapsulate handling for a single request
  rbd: end request on error in rbd_do_request() caller
  rbd: a little more cleanup of rbd_rq_fn()
  rbd: make exists flag atomic
  rbd: only get snap context for write requests
  rbd: separate layout init
  rbd: drop oid parameters from ceph_osdc_build_request()
  rbd: drop snapid parameter from rbd_req_sync_read()
  rbd: drop flags parameter from rbd_req_sync_exec()
  rbd: kill rbd_req_sync_op() snapc and snapid parameters
  rbd: don't bother setting snapid in rbd_do_request()
  libceph: always allow trail in osd request
  libceph: kill op_needs_trail()
  libceph: pass length to ceph_osdc_build_request()
  libceph: pass length to ceph_calc_file_object_mapping()
  libceph: drop snapid in ceph_calc_raw_layout()
  libceph: drop osdc from ceph_calc_raw_layout()
  libceph: don't set flags in ceph_osdc_alloc_request()
  libceph: don't set pages or bio in ceph_osdc_alloc_request()
  rbd: pass num_op with ops array
  libceph: pass num_op with ops
  rbd: there is really only one op
  rbd: assume single op in a request
  rbd: kill ceph_osd_req_op-flags
  rbd: pull in ceph_calc_raw_layout()
  rbd: open code rbd_calc_raw_layout()
  rbd: don't bother calculating file mapping
  rbd: use a common layout for each device
  rbd: combine rbd sync watch/unwatch functions
  rbd: don't leak rbd_req on synchronous requests
  rbd: don't leak rbd_req for rbd_req_sync_notify_ack()
  rbd: don't assign extent info in rbd_do_request()
  rbd: don't assign extent info in rbd_req_sync_op()
  rbd: move call osd op setup into rbd_osd_req_op_create()
  rbd: move remaining osd op setup into rbd_osd_req_op_create()
  rbd: assign watch request more directly
  rbd: fix type of snap_id in rbd_dev_v2_snap_info()
  rbd: small changes
  rbd: check for overflow in rbd_get_num_segments()
  rbd: don't retry setting up header watch
  Merge branch 'testing' of github.com:ceph/ceph-client into 
v3.8-rc5-testing
  libceph: fix messenger CONFIG_BLOCK dependencies
  rbd: new request tracking code
  rbd: kill rbd_rq_fn() and all other related code
  rbd: kill rbd_req_coll and rbd_request
  rbd: implement sync object read with new code
  rbd: get rid of rbd_req_sync_read()
  rbd: implement watch/unwatch with new code
  rbd: get rid of rbd_req_sync_watch()
  rbd: use new code for notify ack
  rbd: get rid of rbd_req_sync_notify_ack()
  rbd: send notify ack asynchronously
  rbd: implement sync method with new code
  rbd: get rid of rbd_req_sync_exec()
  rbd: unregister linger in watch sync routine
  rbd: track object rather than osd request for watch
  rbd: decrement obj request count when deleting
  rbd: don't drop watch requests on completion
  rbd: define flags field, use it for exists flag
  rbd: prevent open for image being removed
  libceph: add a compatibility check interface
  rbd: don't take extra bio reference for osd client
  libceph: don't require r_num_pages for bio requests

Re: maintanance on osd host

2013-02-26 Thread Stefan Priebe - Profihost AG
Hi Greg,
  Hi Sage,

Am 26.02.2013 21:27, schrieb Gregory Farnum:
 On Tue, Feb 26, 2013 at 11:44 AM, Stefan Priebe s.pri...@profihost.ag wrote:
 out and down are quite different — are you sure you tried down
 and not out? (You reference out in your first email, rather than
 down.)
 -Greg

sorry that's it i misread down / out. Sorry. Wouldn't it make sense to
mark the osd automatically down when shutting down via the init script?
It doesn't seem to make sense to hope for the automatic detection when
somebody uses the init script.

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html