Re: [ceph] locking fun with d_materialise_unique()

2013-01-29 Thread Al Viro
On Mon, Jan 28, 2013 at 09:20:34PM -0800, Sage Weil wrote:

 Yep, that is indeed a problem.  I think we just need to do the r_aborted 
 and/or r_locked_dir check in the else if condition...
 
  I'm not sure if we are guaranteed that ceph_readdir_prepopulate() won't
  get to its splice_dentry() and d_delete() calls in similar situations -
  I hadn't checked that one yet.  If it isn't guaranteed, we have a problem
  there as well.
 
 ...and the condition guarding readdir_prepopulate().  :)

 I think you're reading it correctly.  The main thing to keep in mind here 
 is that we *do* need to call fill_inode() for the inode metadata on these 
 requests to keep the mds and client state in sync.  The dentry state is 
 safe to ignore.

You mean the parts under
if (rinfo-head-is_dentry) {
and
if (rinfo-head-is_target) {
in there?  Because there's fill_inode() called from readdir_prepopulate()
and it's a lot more problematic than those two...

 It would be great to have the dir i_mutex rules summarized somewhere, even 
 if it is just a copy of the below.  It took a fair bit of trial and error 
 to infer what was going on when writing this code.  :)

Directory -i_mutex rules are in part documented - what VFS guarantees
to hold side is in Documentation/filesystems/directory-locking.  It's
the other side (what locks are expected to be held by callers of dcache.c
functions) that is badly missing...

 Ping me when you've pushed that branch and I'll take a look...

To gitol...@ra.kernel.org:/pub/scm/linux/kernel/git/viro/vfs.git
   01a88fa..4056362  master - master

with tentative ceph patch in the very end.  Should be on git.kernel.org
shortly...
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 01/12, v2] rbd: new request tracking code

2013-01-29 Thread Josh Durgin

On 01/24/2013 06:08 AM, Alex Elder wrote:

This is an update of the first patch in my request tracking
code series.  After posting it the other day I identified some
problems related to reference counting of image and object
requests.  I also am starting to look at the details of
implementing layered reads, and ended up making some
substantive changes.  Since I have not seen any review
feedback I thought the best thing would be to just
re-post the updated patch.

The remaining patches in the series have changed accordingly,
but they have not really changed substantively, so I am
not re-posting those (but will if it's requested).

The main functional change is that an image request no longer
maintains an array of object request pointers, it maintains
a list of object requests.  This simplifies some things, and
makes the image request structure fixed size.

A few other functional changes:
- Reference counting of object and image requests is now
   done sensibly.
- Image requests now support a callback when complete,
   which will be used for layered I/O requests.
- There are a few new helper functions that encapsulate
   tying an object request to an image request.
- An distinct value is now used for the which field
   for object requests not associated with a image request
   (mainly used for validation/assertions).

Other changes:
- Everything that was named image_request now uses
   img_request instead.
- A few blocks and lines of code have been rearranged.

The updated series is available on the ceph-client git
repository in the branch wip-rbd-review-v2.

-Alex

This patch fully implements the new request tracking code for rbd
I/O requests.

Each I/O request to an rbd image will get an rbd_image_request
structure allocated to track it.  This provides access to all
information about the original request, as well as access to the
set of one or more object requests that are initiated as a result
of the image request.

An rbd_obj_request structure defines a request sent to a single osd
object (possibly) as part of an rbd image request.  An rbd object
request refers to a ceph_osd_request structure built up to represent
the request; for now it will contain a single osd operation.  It
also provides space to hold the result status and the version of the
object when the osd request completes.

An rbd_obj_request structure can also stand on its own.  This will
be used for reading the version 1 header object, for issuing
acknowledgements to event notifications, and for making object
method calls.

All rbd object requests now complete asynchronously with respect
to the osd client--they supply a common callback routine.

This resolves:
 http://tracker.newdream.net/issues/3741

Signed-off-by: Alex Elder el...@inktank.com
---
v2: - fixed reference counting
 - image request callback support
 - image/object connection helper functions
 - distinct BAD_WHICH value for non-image object requests

  drivers/block/rbd.c |  622
++-
  1 file changed, 620 insertions(+), 2 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 6689363..46a61dd 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -181,6 +181,67 @@ struct rbd_req_coll {
struct rbd_req_status   status[0];
  };

+struct rbd_img_request;
+typedef void (*rbd_img_callback_t)(struct rbd_img_request *);
+
+#defineBAD_WHICH   U32_MAX /* Good which or bad which, 
which? */
+
+struct rbd_obj_request;
+typedef void (*rbd_obj_callback_t)(struct rbd_obj_request *);
+
+enum obj_req_type { obj_req_bio }; /* More types to come */


enum labels should be capitalized.


+struct rbd_obj_request {
+   const char  *object_name;
+   u64 offset; /* object start byte */
+   u64 length; /* bytes from offset */
+
+   struct rbd_img_request  *img_request;
+   struct list_headlinks;
+   u32 which;  /* posn image request list */
+
+   enum obj_req_type   type;
+   struct bio  *bio_list;
+
+   struct ceph_osd_request *osd_req;
+
+   u64 xferred;/* bytes transferred */
+   u64 version;


This version is only used (uselessly) for the watch operation. It
should be removed in a future patch (along with the obj_ver in the
header).


+   s32 result;
+   atomic_tdone;
+
+   rbd_obj_callback_t  callback;
+
+   struct kref kref;
+};
+
+struct rbd_img_request {
+   struct request  *rq;
+   struct rbd_device   *rbd_dev;
+   u64 offset; /* starting image byte offset */
+   u64 length; /* byte count from offset */
+   boolwrite_request;  /* false for read */
+   union {
+ 

Re: [PATCH 02/12, v2] rbd: kill rbd_rq_fn() and all other related code

2013-01-29 Thread Josh Durgin

Reviewed-by: Josh Durgin josh.dur...@inktank.com

On 01/24/2013 08:32 AM, Alex Elder wrote:

Now that the request function has been replaced by one using the new
request management data structures the old one can go away.
Deleting it makes rbd_dev_do_request() no longer needed, and
deleting that makes other functions unneeded, and so on.

Signed-off-by: Alex Elder el...@inktank.com
---
  drivers/block/rbd.c |  319
---
  1 file changed, 319 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 46a61dd..7caddea 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -621,18 +621,6 @@ static void rbd_put_client(struct rbd_client *rbdc)
kref_put(rbdc-kref, rbd_client_release);
  }

-/*
- * Destroy requests collection
- */
-static void rbd_coll_release(struct kref *kref)
-{
-   struct rbd_req_coll *coll =
-   container_of(kref, struct rbd_req_coll, kref);
-
-   dout(rbd_coll_release %p\n, coll);
-   kfree(coll);
-}
-
  static bool rbd_image_format_valid(u32 image_format)
  {
return image_format == 1 || image_format == 2;
@@ -876,28 +864,6 @@ static u64 rbd_segment_length(struct rbd_device
*rbd_dev,
return length;
  }

-static int rbd_get_num_segments(struct rbd_image_header *header,
-   u64 ofs, u64 len)
-{
-   u64 start_seg;
-   u64 end_seg;
-   u64 result;
-
-   if (!len)
-   return 0;
-   if (len - 1  U64_MAX - ofs)
-   return -ERANGE;
-
-   start_seg = ofs  header-obj_order;
-   end_seg = (ofs + len - 1)  header-obj_order;
-
-   result = end_seg - start_seg + 1;
-   if (result  (u64) INT_MAX)
-   return -ERANGE;
-
-   return (int) result;
-}
-
  /*
   * returns the size of an object in the image
   */
@@ -1216,52 +1182,6 @@ static void rbd_osd_req_op_destroy(struct
ceph_osd_req_op *op)
kfree(op);
  }

-static void rbd_coll_end_req_index(struct request *rq,
-  struct rbd_req_coll *coll,
-  int index,
-  s32 ret, u64 len)
-{
-   struct request_queue *q;
-   int min, max, i;
-
-   dout(rbd_coll_end_req_index %p index %d ret %d len %llu\n,
-coll, index, (int)ret, (unsigned long long)len);
-
-   if (!rq)
-   return;
-
-   if (!coll) {
-   blk_end_request(rq, ret, len);
-   return;
-   }
-
-   q = rq-q;
-
-   spin_lock_irq(q-queue_lock);
-   coll-status[index].done = 1;
-   coll-status[index].rc = ret;
-   coll-status[index].bytes = len;
-   max = min = coll-num_done;
-   while (max  coll-total  coll-status[max].done)
-   max++;
-
-   for (i = min; imax; i++) {
-   __blk_end_request(rq, (int)coll-status[i].rc,
- coll-status[i].bytes);
-   coll-num_done++;
-   kref_put(coll-kref, rbd_coll_release);
-   }
-   spin_unlock_irq(q-queue_lock);
-}
-
-static void rbd_coll_end_req(struct rbd_request *rbd_req,
-s32 ret, u64 len)
-{
-   rbd_coll_end_req_index(rbd_req-rq,
-   rbd_req-coll, rbd_req-coll_index,
-   ret, len);
-}
-
  /*
   * Send ceph osd request
   */
@@ -1361,46 +1281,6 @@ done_osd_req:
return ret;
  }

-/*
- * Ceph osd op callback
- */
-static void rbd_req_cb(struct ceph_osd_request *osd_req, struct
ceph_msg *msg)
-{
-   struct rbd_request *rbd_req = osd_req-r_priv;
-   struct ceph_osd_reply_head *replyhead;
-   struct ceph_osd_op *op;
-   s32 rc;
-   u64 bytes;
-   int read_op;
-
-   /* parse reply */
-   replyhead = msg-front.iov_base;
-   WARN_ON(le32_to_cpu(replyhead-num_ops) == 0);
-   op = (void *)(replyhead + 1);
-   rc = (s32)le32_to_cpu(replyhead-result);
-   bytes = le64_to_cpu(op-extent.length);
-   read_op = (le16_to_cpu(op-op) == CEPH_OSD_OP_READ);
-
-   dout(rbd_req_cb bytes=%llu readop=%d rc=%d\n,
-   (unsigned long long) bytes, read_op, (int) rc);
-
-   if (rc == (s32)-ENOENT  read_op) {
-   zero_bio_chain(rbd_req-bio, 0);
-   rc = 0;
-   } else if (rc == 0  read_op  bytes  rbd_req-len) {
-   zero_bio_chain(rbd_req-bio, bytes);
-   bytes = rbd_req-len;
-   }
-
-   rbd_coll_end_req(rbd_req, rc, bytes);
-
-   if (rbd_req-bio)
-   bio_chain_put(rbd_req-bio);
-
-   ceph_osdc_put_request(osd_req);
-   kfree(rbd_req);
-}
-
  static void rbd_simple_req_cb(struct ceph_osd_request *osd_req,
struct ceph_msg *msg)
  {
@@ -1448,70 +1328,6 @@ done:
return ret;
  }

-/*
- * Do an asynchronous ceph osd operation
- */
-static int rbd_do_op(struct request *rq,
-struct rbd_device 

Re: [PATCH 03/12, v2] rbd: kill rbd_req_coll and rbd_request

2013-01-29 Thread Josh Durgin

Reviewed-by: Josh Durgin josh.dur...@inktank.com

On 01/24/2013 08:32 AM, Alex Elder wrote:

The two remaining callers of rbd_do_request() always pass a null
collection pointer, so the coll and coll_index parameters are
not needed.  There is no other use of that data structure, so it
can be eliminated.

Deleting them means there is no need to allocate a rbd_request
structure for the callback function.  And since that's the only use
of *that* structure, it too can be eliminated.

Signed-off-by: Alex Elder el...@inktank.com
---
  drivers/block/rbd.c |   58
+++
  1 file changed, 3 insertions(+), 55 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 7caddea..3302cea 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -162,25 +162,6 @@ struct rbd_client {
struct list_headnode;
  };

-/*
- * a request completion status
- */
-struct rbd_req_status {
-   int done;
-   s32 rc;
-   u64 bytes;
-};
-
-/*
- * a collection of requests
- */
-struct rbd_req_coll {
-   int total;
-   int num_done;
-   struct kref kref;
-   struct rbd_req_status   status[0];
-};
-
  struct rbd_img_request;
  typedef void (*rbd_img_callback_t)(struct rbd_img_request *);

@@ -242,18 +223,6 @@ struct rbd_img_request {
  #define for_each_obj_request_safe(ireq, oreq, n) \
list_for_each_entry_safe_reverse(oreq, n, ireq-obj_requests, links)

-/*
- * a single io request
- */
-struct rbd_request {
-   struct request  *rq;/* blk layer request */
-   struct bio  *bio;   /* cloned bio */
-   struct page **pages;/* list of used pages */
-   u64 len;
-   int coll_index;
-   struct rbd_req_coll *coll;
-};
-
  struct rbd_snap {
struct  device  dev;
const char  *name;
@@ -1195,21 +1164,18 @@ static int rbd_do_request(struct request *rq,
  int num_pages,
  int flags,
  struct ceph_osd_req_op *op,
- struct rbd_req_coll *coll,
- int coll_index,
  void (*rbd_cb)(struct ceph_osd_request *,
 struct ceph_msg *),
  u64 *ver)
  {
struct ceph_osd_client *osdc;
struct ceph_osd_request *osd_req;
-   struct rbd_request *rbd_req = NULL;
struct timespec mtime = CURRENT_TIME;
int ret;

-   dout(rbd_do_request object_name=%s ofs=%llu len=%llu coll=%p[%d]\n,
+   dout(rbd_do_request object_name=%s ofs=%llu len=%llu\n,
object_name, (unsigned long long) ofs,
-   (unsigned long long) len, coll, coll_index);
+   (unsigned long long) len);

osdc = rbd_dev-rbd_client-client-osdc;
osd_req = ceph_osdc_alloc_request(osdc, snapc, 1, false, GFP_NOIO);
@@ -1223,22 +1189,8 @@ static int rbd_do_request(struct request *rq,
bio_get(osd_req-r_bio);
}

-   if (coll) {
-   ret = -ENOMEM;
-   rbd_req = kmalloc(sizeof(*rbd_req), GFP_NOIO);
-   if (!rbd_req)
-   goto done_osd_req;
-
-   rbd_req-rq = rq;
-   rbd_req-bio = bio;
-   rbd_req-pages = pages;
-   rbd_req-len = len;
-   rbd_req-coll = coll;
-   rbd_req-coll_index = coll_index;
-   }
-
osd_req-r_callback = rbd_cb;
-   osd_req-r_priv = rbd_req;
+   osd_req-r_priv = NULL;

strncpy(osd_req-r_oid, object_name, sizeof(osd_req-r_oid));
osd_req-r_oid_len = strlen(osd_req-r_oid);
@@ -1274,8 +1226,6 @@ static int rbd_do_request(struct request *rq,
  done_err:
if (bio)
bio_chain_put(osd_req-r_bio);
-   kfree(rbd_req);
-done_osd_req:
ceph_osdc_put_request(osd_req);

return ret;
@@ -1314,7 +1264,6 @@ static int rbd_req_sync_op(struct rbd_device *rbd_dev,
  pages, num_pages,
  flags,
  op,
- NULL, 0,
  NULL,
  ver);
if (ret  0)
@@ -1390,7 +1339,6 @@ static int rbd_req_sync_notify_ack(struct
rbd_device *rbd_dev,
  NULL, 0,
  CEPH_OSD_FLAG_READ,
  op,
- NULL, 0,
  rbd_simple_req_cb, NULL);

rbd_osd_req_op_destroy(op);



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/12, v2] rbd: implement sync object read with new code

2013-01-29 Thread Josh Durgin

A couple small comments, but looks good.

Reviewed-by: Josh Durgin josh.dur...@inktank.com

On 01/24/2013 08:33 AM, Alex Elder wrote:

Reimplement the synchronous read operation used for reading a
version 1 header using the new request tracking code.  Name the
resulting function rbd_obj_read_sync() to better reflect that
it's a full object operation, not an object request.  To do this,
implement a new obj_req_pages object request type.

This implements a new mechanism to allow the caller to wait for
completion for an rbd_obj_request by calling rbd_obj_request_wait().

This partially resolves:
 http://tracker.newdream.net/issues/3755

Signed-off-by: Alex Elder el...@inktank.com
---
  drivers/block/rbd.c |   96
---
  1 file changed, 92 insertions(+), 4 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 3302cea..742236b 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -170,7 +170,7 @@ typedef void (*rbd_img_callback_t)(struct
rbd_img_request *);
  struct rbd_obj_request;
  typedef void (*rbd_obj_callback_t)(struct rbd_obj_request *);

-enum obj_req_type { obj_req_bio }; /* More types to come */
+enum obj_req_type { obj_req_bio, obj_req_pages };


Should be capitalized


  struct rbd_obj_request {
const char  *object_name;
@@ -182,7 +182,13 @@ struct rbd_obj_request {
u32 which;  /* posn image request list */

enum obj_req_type   type;
-   struct bio  *bio_list;
+   union {
+   struct bio  *bio_list;
+   struct {
+   struct page **pages;
+   u32 page_count;
+   };
+   };

struct ceph_osd_request *osd_req;

@@ -192,6 +198,7 @@ struct rbd_obj_request {
atomic_tdone;

rbd_obj_callback_t  callback;
+   struct completion   completion;

struct kref kref;
  };
@@ -1077,6 +1084,7 @@ static bool obj_req_type_valid(enum obj_req_type type)
  {
switch (type) {
case obj_req_bio:
+   case obj_req_pages:
return true;
default:
return false;
@@ -1291,14 +1299,23 @@ static void rbd_img_request_complete(struct
rbd_img_request *img_request)
rbd_img_request_put(img_request);
  }

+/* Caller is responsible for rbd_obj_request_destroy(obj_request) */
+
+static int rbd_obj_request_wait(struct rbd_obj_request *obj_request)
+{
+   return wait_for_completion_interruptible(obj_request-completion);
+}
+
  static void rbd_obj_request_complete(struct rbd_obj_request *obj_request)
  {
if (obj_request-callback)
obj_request-callback(obj_request);
+   else
+   complete_all(obj_request-completion);
  }

  /*
- * Request sync osd read
+ * Synchronously read a range from an object into a provided buffer
   */
  static int rbd_req_sync_read(struct rbd_device *rbd_dev,
  const char *object_name,
@@ -1556,6 +1573,11 @@ static struct ceph_osd_request *rbd_osd_req_create(
/* osd client requires num pages even for bio */
osd_req-r_num_pages = calc_pages_for(offset, length);
break;
+   case obj_req_pages:
+   osd_req-r_pages = obj_request-pages;
+   osd_req-r_num_pages = obj_request-page_count;
+   osd_req-r_page_alignment = offset  ~PAGE_MASK;
+   break;
}

if (write_request) {
@@ -1618,6 +1640,7 @@ static struct rbd_obj_request
*rbd_obj_request_create(const char *object_name,
obj_request-type = type;
INIT_LIST_HEAD(obj_request-links);
atomic_set(obj_request-done, 0);
+   init_completion(obj_request-completion);
kref_init(obj_request-kref);

return obj_request;
@@ -1641,6 +1664,11 @@ static void rbd_obj_request_destroy(struct kref
*kref)
if (obj_request-bio_list)
bio_chain_put(obj_request-bio_list);
break;
+   case obj_req_pages:
+   if (obj_request-pages)
+   ceph_release_page_vector(obj_request-pages,
+   obj_request-page_count);
+   break;
}

kfree(obj_request);
@@ -1988,6 +2016,65 @@ static void rbd_free_disk(struct rbd_device *rbd_dev)
put_disk(disk);
  }

+static int rbd_obj_read_sync(struct rbd_device *rbd_dev,
+   const char *object_name,
+   u64 offset, u64 length,
+   char *buf, u64 *version)
+
+{
+   struct ceph_osd_req_op *op;
+   struct rbd_obj_request *obj_request;
+   struct ceph_osd_client *osdc;
+   struct page **pages = NULL;
+   u32 page_count;
+   int ret;
+
+   page_count = (u32) calc_pages_for(offset, 

Re: [PATCH 05/12, v2] rbd: get rid of rbd_req_sync_read()

2013-01-29 Thread Josh Durgin

Reviewed-by: Josh Durgin josh.dur...@inktank.com

On 01/24/2013 08:33 AM, Alex Elder wrote:

Delete rbd_req_sync_read() is no longer used, so get rid of it.

Signed-off-by: Alex Elder el...@inktank.com
---
  drivers/block/rbd.c |   24 
  1 file changed, 24 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 742236b..750fc73 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -1315,29 +1315,6 @@ static void rbd_obj_request_complete(struct
rbd_obj_request *obj_request)
  }

  /*
- * Synchronously read a range from an object into a provided buffer
- */
-static int rbd_req_sync_read(struct rbd_device *rbd_dev,
- const char *object_name,
- u64 ofs, u64 len,
- char *buf,
- u64 *ver)
-{
-   struct ceph_osd_req_op *op;
-   int ret;
-
-   op = rbd_osd_req_op_create(CEPH_OSD_OP_READ, ofs, len);
-   if (!op)
-   return -ENOMEM;
-
-   ret = rbd_req_sync_op(rbd_dev, CEPH_OSD_FLAG_READ,
-  op, object_name, ofs, len, buf, ver);
-   rbd_osd_req_op_destroy(op);
-
-   return ret;
-}
-
-/*
   * Request sync osd watch
   */
  static int rbd_req_sync_notify_ack(struct rbd_device *rbd_dev,
@@ -2113,7 +2090,6 @@ rbd_dev_v1_header_read(struct rbd_device *rbd_dev,
u64 *version)
if (!ondisk)
return ERR_PTR(-ENOMEM);

-   (void) rbd_req_sync_read;   /* avoid a warning */
ret = rbd_obj_read_sync(rbd_dev, rbd_dev-header_name,
   0, size,
   (char *) ondisk, version);



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 06/12, v2] rbd: implement watch/unwatch with new code

2013-01-29 Thread Josh Durgin

Reviewed-by: Josh Durgin josh.dur...@inktank.com

On 01/24/2013 08:33 AM, Alex Elder wrote:

Implement a new function to set up or tear down a watch event
for an mapped rbd image header using the new request code.

Create a new object request type nodata to handle this.  And
define rbd_osd_trivial_callback() which simply marks a request done.

Signed-off-by: Alex Elder el...@inktank.com
---
  drivers/block/rbd.c |   87
+--
  1 file changed, 84 insertions(+), 3 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 750fc73..7b1eddc 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -170,7 +170,7 @@ typedef void (*rbd_img_callback_t)(struct
rbd_img_request *);
  struct rbd_obj_request;
  typedef void (*rbd_obj_callback_t)(struct rbd_obj_request *);

-enum obj_req_type { obj_req_bio, obj_req_pages };
+enum obj_req_type { obj_req_nodata, obj_req_bio, obj_req_pages };

  struct rbd_obj_request {
const char  *object_name;
@@ -1083,6 +1083,7 @@ static inline void rbd_img_obj_request_del(struct
rbd_img_request *img_request,
  static bool obj_req_type_valid(enum obj_req_type type)
  {
switch (type) {
+   case obj_req_nodata:
case obj_req_bio:
case obj_req_pages:
return true;
@@ -1306,6 +1307,12 @@ static int rbd_obj_request_wait(struct
rbd_obj_request *obj_request)
return wait_for_completion_interruptible(obj_request-completion);
  }

+static void rbd_osd_trivial_callback(struct rbd_obj_request *obj_request,
+   struct ceph_osd_op *op)
+{
+   atomic_set(obj_request-done, 1);
+}
+
  static void rbd_obj_request_complete(struct rbd_obj_request *obj_request)
  {
if (obj_request-callback)
@@ -1500,6 +1507,9 @@ static void rbd_osd_req_callback(struct
ceph_osd_request *osd_req,
case CEPH_OSD_OP_WRITE:
rbd_osd_write_callback(obj_request, op);
break;
+   case CEPH_OSD_OP_WATCH:
+   rbd_osd_trivial_callback(obj_request, op);
+   break;
default:
rbd_warn(NULL, %s: unsupported op %hu\n,
obj_request-object_name, (unsigned short) opcode);
@@ -1543,6 +1553,8 @@ static struct ceph_osd_request *rbd_osd_req_create(

rbd_assert(obj_req_type_valid(obj_request-type));
switch (obj_request-type) {
+   case obj_req_nodata:
+   break;  /* Nothing to do */
case obj_req_bio:
rbd_assert(obj_request-bio_list != NULL);
osd_req-r_bio = obj_request-bio_list;
@@ -1637,6 +1649,8 @@ static void rbd_obj_request_destroy(struct kref *kref)

rbd_assert(obj_req_type_valid(obj_request-type));
switch (obj_request-type) {
+   case obj_req_nodata:
+   break;  /* Nothing to do */
case obj_req_bio:
if (obj_request-bio_list)
bio_chain_put(obj_request-bio_list);
@@ -1863,6 +1877,72 @@ static int rbd_img_request_submit(struct
rbd_img_request *img_request)
return 0;
  }

+/*
+ * Request sync osd watch/unwatch.  The value of start determines
+ * whether a watch request is being initiated or torn down.
+ */
+static int rbd_dev_header_watch_sync(struct rbd_device *rbd_dev, int start)
+{
+   struct ceph_osd_client *osdc = rbd_dev-rbd_client-client-osdc;
+   struct rbd_obj_request *obj_request;
+   struct ceph_osd_req_op *op;
+   int ret;
+
+   rbd_assert(start ^ !!rbd_dev-watch_event);
+   rbd_assert(start ^ !!rbd_dev-watch_request);
+
+   if (start) {
+   ret = ceph_osdc_create_event(osdc, rbd_watch_cb, 0, rbd_dev,
+   rbd_dev-watch_event);
+   if (ret  0)
+   return ret;
+   }
+
+   ret = -ENOMEM;
+   obj_request = rbd_obj_request_create(rbd_dev-header_name,
+   0, 0, obj_req_nodata);
+   if (!obj_request)
+   goto out_cancel;
+
+   op = rbd_osd_req_op_create(CEPH_OSD_OP_WATCH,
+   rbd_dev-watch_event-cookie,
+   rbd_dev-header.obj_version, start);
+   if (!op)
+   goto out_cancel;
+   obj_request-osd_req = rbd_osd_req_create(rbd_dev, true,
+   obj_request, op);
+   rbd_osd_req_op_destroy(op);
+   if (!obj_request-osd_req)
+   goto out_cancel;
+
+   if (start) {
+   rbd_dev-watch_request = obj_request-osd_req;
+   ceph_osdc_set_request_linger(osdc, rbd_dev-watch_request);
+   }
+   ret = rbd_obj_request_submit(osdc, obj_request);
+   if (ret)
+   goto out_cancel;
+   ret = rbd_obj_request_wait(obj_request);
+   if (ret)
+   goto out_cancel;
+
+   ret = obj_request-result;
+   if 

Re: [PATCH 07/12, v2] rbd: get rid of rbd_req_sync_watch()

2013-01-29 Thread Josh Durgin

Reviewed-by: Josh Durgin josh.dur...@inktank.com

On 01/24/2013 08:34 AM, Alex Elder wrote:

Get rid of rbd_req_sync_watch(), because it is no longer used.

Signed-off-by: Alex Elder el...@inktank.com
---
  drivers/block/rbd.c |   42 --
  1 file changed, 42 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 7b1eddc..8f659f3 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -1367,47 +1367,6 @@ static void rbd_watch_cb(u64 ver, u64 notify_id,
u8 opcode, void *data)
rbd_req_sync_notify_ack(rbd_dev, hver, notify_id);
  }

-/*
- * Request sync osd watch/unwatch.  The value of start determines
- * whether a watch request is being initiated or torn down.
- */
-static int rbd_req_sync_watch(struct rbd_device *rbd_dev, int start)
-{
-   struct ceph_osd_req_op *op;
-   int ret = 0;
-
-   rbd_assert(start ^ !!rbd_dev-watch_event);
-   rbd_assert(start ^ !!rbd_dev-watch_request);
-
-   if (start) {
-   struct ceph_osd_client *osdc;
-
-   osdc = rbd_dev-rbd_client-client-osdc;
-   ret = ceph_osdc_create_event(osdc, rbd_watch_cb, 0, rbd_dev,
-   rbd_dev-watch_event);
-   if (ret  0)
-   return ret;
-   }
-
-   op = rbd_osd_req_op_create(CEPH_OSD_OP_WATCH,
-   rbd_dev-watch_event-cookie,
-   rbd_dev-header.obj_version, start);
-   if (op)
-   ret = rbd_req_sync_op(rbd_dev,
- CEPH_OSD_FLAG_WRITE | CEPH_OSD_FLAG_ONDISK,
- op, rbd_dev-header_name,
- 0, 0, NULL, NULL);
-
-   /* Cancel the event if we're tearing down, or on error */
-
-   if (!start || !op || ret  0) {
-   ceph_osdc_cancel_event(rbd_dev-watch_event);
-   rbd_dev-watch_event = NULL;
-   }
-   rbd_osd_req_op_destroy(op);
-
-   return ret;
-}

  /*
   * Synchronous osd object method call
@@ -3960,7 +3919,6 @@ static int rbd_dev_probe_finish(struct rbd_device
*rbd_dev)
if (ret)
goto err_out_bus;

-   (void) rbd_req_sync_watch;  /* avoid a warning */
ret = rbd_dev_header_watch_sync(rbd_dev, 1);
if (ret)
goto err_out_bus;



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 08/12, v2] rbd: use new code for notify ack

2013-01-29 Thread Josh Durgin

Reviewed-by: Josh Durgin josh.dur...@inktank.com

On 01/24/2013 08:34 AM, Alex Elder wrote:

Use the new object request tracking mechanism for handling a
notify_ack request.

Move the callback function below the definition of this so we don't
have to do a pre-declaration.

This resolves:
 http://tracker.newdream.net/issues/3754

Signed-off-by: Alex Elder el...@inktank.com
---
  drivers/block/rbd.c |   76
+--
  1 file changed, 55 insertions(+), 21 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 8f659f3..e2b6230 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -1347,27 +1347,6 @@ static int rbd_req_sync_notify_ack(struct
rbd_device *rbd_dev,
return ret;
  }

-static void rbd_watch_cb(u64 ver, u64 notify_id, u8 opcode, void *data)
-{
-   struct rbd_device *rbd_dev = (struct rbd_device *)data;
-   u64 hver;
-   int rc;
-
-   if (!rbd_dev)
-   return;
-
-   dout(rbd_watch_cb %s notify_id=%llu opcode=%u\n,
-   rbd_dev-header_name, (unsigned long long) notify_id,
-   (unsigned int) opcode);
-   rc = rbd_dev_refresh(rbd_dev, hver);
-   if (rc)
-   rbd_warn(rbd_dev, got notification but failed to 
-   update snaps: %d\n, rc);
-
-   rbd_req_sync_notify_ack(rbd_dev, hver, notify_id);
-}
-
-
  /*
   * Synchronous osd object method call
   */
@@ -1466,6 +1445,7 @@ static void rbd_osd_req_callback(struct
ceph_osd_request *osd_req,
case CEPH_OSD_OP_WRITE:
rbd_osd_write_callback(obj_request, op);
break;
+   case CEPH_OSD_OP_NOTIFY_ACK:
case CEPH_OSD_OP_WATCH:
rbd_osd_trivial_callback(obj_request, op);
break;
@@ -1836,6 +1816,60 @@ static int rbd_img_request_submit(struct
rbd_img_request *img_request)
return 0;
  }

+static int rbd_obj_notify_ack_sync(struct rbd_device *rbd_dev,
+  u64 ver, u64 notify_id)
+{
+   struct rbd_obj_request *obj_request;
+   struct ceph_osd_req_op *op;
+   struct ceph_osd_client *osdc;
+   int ret;
+
+   obj_request = rbd_obj_request_create(rbd_dev-header_name, 0, 0,
+   obj_req_nodata);
+   if (!obj_request)
+   return -ENOMEM;
+
+   ret = -ENOMEM;
+   op = rbd_osd_req_op_create(CEPH_OSD_OP_NOTIFY_ACK, notify_id, ver);
+   if (!op)
+   goto out;
+   obj_request-osd_req = rbd_osd_req_create(rbd_dev, false,
+   obj_request, op);
+   rbd_osd_req_op_destroy(op);
+   if (!obj_request-osd_req)
+   goto out;
+
+   osdc = rbd_dev-rbd_client-client-osdc;
+   ret = rbd_obj_request_submit(osdc, obj_request);
+   if (!ret)
+   ret = rbd_obj_request_wait(obj_request);
+out:
+   rbd_obj_request_put(obj_request);
+
+   return ret;
+}
+
+static void rbd_watch_cb(u64 ver, u64 notify_id, u8 opcode, void *data)
+{
+   struct rbd_device *rbd_dev = (struct rbd_device *)data;
+   u64 hver;
+   int rc;
+
+   if (!rbd_dev)
+   return;
+
+   dout(rbd_watch_cb %s notify_id=%llu opcode=%u\n,
+   rbd_dev-header_name, (unsigned long long) notify_id,
+   (unsigned int) opcode);
+   rc = rbd_dev_refresh(rbd_dev, hver);
+   if (rc)
+   rbd_warn(rbd_dev, got notification but failed to 
+   update snaps: %d\n, rc);
+
+   (void) rbd_req_sync_notify_ack; /* avoid a warning */
+   rbd_obj_notify_ack_sync(rbd_dev, hver, notify_id);
+}
+
  /*
   * Request sync osd watch/unwatch.  The value of start determines
   * whether a watch request is being initiated or torn down.



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/12, v2] rbd: get rid of rbd_req_sync_notify_ack()

2013-01-29 Thread Josh Durgin

Reviewed-by: Josh Durgin josh.dur...@inktank.com

On 01/24/2013 08:35 AM, Alex Elder wrote:

Get rid rbd_req_sync_notify_ack() because it is no longer used.
As a result rbd_simple_req_cb() becomes unreferenced, so get rid
of that too.

Signed-off-by: Alex Elder el...@inktank.com
---
  drivers/block/rbd.c |   33 -
  1 file changed, 33 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index e2b6230..b952b2f 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -1240,12 +1240,6 @@ done_err:
return ret;
  }

-static void rbd_simple_req_cb(struct ceph_osd_request *osd_req,
-   struct ceph_msg *msg)
-{
-   ceph_osdc_put_request(osd_req);
-}
-
  /*
   * Do a synchronous ceph osd operation
   */
@@ -1322,32 +1316,6 @@ static void rbd_obj_request_complete(struct
rbd_obj_request *obj_request)
  }

  /*
- * Request sync osd watch
- */
-static int rbd_req_sync_notify_ack(struct rbd_device *rbd_dev,
-  u64 ver,
-  u64 notify_id)
-{
-   struct ceph_osd_req_op *op;
-   int ret;
-
-   op = rbd_osd_req_op_create(CEPH_OSD_OP_NOTIFY_ACK, notify_id, ver);
-   if (!op)
-   return -ENOMEM;
-
-   ret = rbd_do_request(NULL, rbd_dev, NULL, CEPH_NOSNAP,
- rbd_dev-header_name, 0, 0, NULL,
- NULL, 0,
- CEPH_OSD_FLAG_READ,
- op,
- rbd_simple_req_cb, NULL);
-
-   rbd_osd_req_op_destroy(op);
-
-   return ret;
-}
-
-/*
   * Synchronous osd object method call
   */
  static int rbd_req_sync_exec(struct rbd_device *rbd_dev,
@@ -1866,7 +1834,6 @@ static void rbd_watch_cb(u64 ver, u64 notify_id,
u8 opcode, void *data)
rbd_warn(rbd_dev, got notification but failed to 
update snaps: %d\n, rc);

-   (void) rbd_req_sync_notify_ack; /* avoid a warning */
rbd_obj_notify_ack_sync(rbd_dev, hver, notify_id);
  }



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 10/12, v2] rbd: send notify ack asynchronously

2013-01-29 Thread Josh Durgin

Reviewed-by: Josh Durgin josh.dur...@inktank.com

On 01/24/2013 08:35 AM, Alex Elder wrote:

When we receive notification of a change to an rbd image's header
object we need to refresh our information about the image (its
size and snapshot context).  Once we have refreshed our rbd image
we need to acknowledge the notification.

This acknowledgement was previously done synchronously, but there's
really no need to wait for it to complete.

Change it so the caller doesn't wait for the notify acknowledgement
request to complete.  And change the name to reflect it's no longer
synchronous.

This resolves:
 http://tracker.newdream.net/issues/3877

Signed-off-by: Alex Elder el...@inktank.com
---
  drivers/block/rbd.c |   10 +-
  1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index b952b2f..48650d1 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -1784,7 +1784,7 @@ static int rbd_img_request_submit(struct
rbd_img_request *img_request)
return 0;
  }

-static int rbd_obj_notify_ack_sync(struct rbd_device *rbd_dev,
+static int rbd_obj_notify_ack(struct rbd_device *rbd_dev,
   u64 ver, u64 notify_id)
  {
struct rbd_obj_request *obj_request;
@@ -1808,11 +1808,11 @@ static int rbd_obj_notify_ack_sync(struct
rbd_device *rbd_dev,
goto out;

osdc = rbd_dev-rbd_client-client-osdc;
+   obj_request-callback = rbd_obj_request_put;
ret = rbd_obj_request_submit(osdc, obj_request);
-   if (!ret)
-   ret = rbd_obj_request_wait(obj_request);
  out:
-   rbd_obj_request_put(obj_request);
+   if (ret)
+   rbd_obj_request_put(obj_request);

return ret;
  }
@@ -1834,7 +1834,7 @@ static void rbd_watch_cb(u64 ver, u64 notify_id,
u8 opcode, void *data)
rbd_warn(rbd_dev, got notification but failed to 
update snaps: %d\n, rc);

-   rbd_obj_notify_ack_sync(rbd_dev, hver, notify_id);
+   rbd_obj_notify_ack(rbd_dev, hver, notify_id);
  }

  /*



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 11/12, v2] rbd: implement sync method with new code

2013-01-29 Thread Josh Durgin

With the version parameter removed now or in a later patch:

Reviewed-by: Josh Durgin josh.dur...@inktank.com

On 01/24/2013 08:36 AM, Alex Elder wrote:

Reimplement synchronous object method calls using the new request
tracking code.  Use the name rbd_obj_method_sync()

Signed-off-by: Alex Elder el...@inktank.com
---
  drivers/block/rbd.c |  111
+++
  1 file changed, 94 insertions(+), 17 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 48650d1..5ad2ac2 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -1413,6 +1413,7 @@ static void rbd_osd_req_callback(struct
ceph_osd_request *osd_req,
case CEPH_OSD_OP_WRITE:
rbd_osd_write_callback(obj_request, op);
break;
+   case CEPH_OSD_OP_CALL:
case CEPH_OSD_OP_NOTIFY_ACK:
case CEPH_OSD_OP_WATCH:
rbd_osd_trivial_callback(obj_request, op);
@@ -1903,6 +1904,81 @@ done:
return ret;
  }

+/*
+ * Synchronous osd object method call
+ */
+static int rbd_obj_method_sync(struct rbd_device *rbd_dev,
+const char *object_name,
+const char *class_name,
+const char *method_name,
+const char *outbound,
+size_t outbound_size,
+char *inbound,
+size_t inbound_size,
+u64 *version)


You can get rid of the version parameter. It's not useful to any
current or planned rbd code that I can think of.


+{
+   struct rbd_obj_request *obj_request;
+   struct ceph_osd_client *osdc;
+   struct ceph_osd_req_op *op;
+   struct page **pages;
+   u32 page_count;
+   int ret;
+
+   /*
+* Method calls are ultimately read operations but they
+* don't involve object data (so no offset or length).
+* The result should placed into the inbound buffer
+* provided.  They also supply outbound data--parameters for
+* the object method.  Currently if this is present it will
+* be a snapshot id.
+*/
+   page_count = (u32) calc_pages_for(0, inbound_size);
+   pages = ceph_alloc_page_vector(page_count, GFP_KERNEL);
+   if (IS_ERR(pages))
+   return PTR_ERR(pages);
+
+   ret = -ENOMEM;
+   obj_request = rbd_obj_request_create(object_name, 0, 0, obj_req_pages);
+   if (!obj_request)
+   goto out;
+
+   obj_request-pages = pages;
+   obj_request-page_count = page_count;
+
+   op = rbd_osd_req_op_create(CEPH_OSD_OP_CALL, class_name,
+   method_name, outbound, outbound_size);
+   if (!op)
+   goto out;
+   obj_request-osd_req = rbd_osd_req_create(rbd_dev, false,
+   obj_request, op);
+   rbd_osd_req_op_destroy(op);
+   if (!obj_request-osd_req)
+   goto out;
+
+   osdc = rbd_dev-rbd_client-client-osdc;
+   ret = rbd_obj_request_submit(osdc, obj_request);
+   if (ret)
+   goto out;
+   ret = rbd_obj_request_wait(obj_request);
+   if (ret)
+   goto out;
+
+   ret = obj_request-result;
+   if (ret  0)
+   goto out;
+   ret = ceph_copy_from_page_vector(pages, inbound, 0,
+   obj_request-xferred);
+   if (version)
+   *version = obj_request-version;


Here too


+out:
+   if (obj_request)
+   rbd_obj_request_put(obj_request);
+   else
+   ceph_release_page_vector(pages, page_count);
+
+   return ret;
+}
+
  static void rbd_request_fn(struct request_queue *q)
  {
struct rbd_device *rbd_dev = q-queuedata;
@@ -2753,11 +2829,12 @@ static int _rbd_dev_v2_snap_size(struct
rbd_device *rbd_dev, u64 snap_id,
__le64 size;
} __attribute__ ((packed)) size_buf = { 0 };

-   ret = rbd_req_sync_exec(rbd_dev, rbd_dev-header_name,
+   (void) rbd_req_sync_exec;   /* Avoid a warning */
+   ret = rbd_obj_method_sync(rbd_dev, rbd_dev-header_name,
rbd, get_size,
(char *) snapid, sizeof (snapid),
(char *) size_buf, sizeof (size_buf), NULL);
-   dout(%s: rbd_req_sync_exec returned %d\n, __func__, ret);
+   dout(%s: rbd_obj_method_sync returned %d\n, __func__, ret);
if (ret  0)
return ret;

@@ -2788,14 +2865,14 @@ static int rbd_dev_v2_object_prefix(struct
rbd_device *rbd_dev)
if (!reply_buf)
return -ENOMEM;

-   ret = rbd_req_sync_exec(rbd_dev, rbd_dev-header_name,
+   ret = rbd_obj_method_sync(rbd_dev, rbd_dev-header_name,
rbd, get_object_prefix,
NULL, 0,
 

Re: [PATCH 12/12, v2] rbd: get rid of rbd_req_sync_exec()

2013-01-29 Thread Josh Durgin

Reviewed-by: Josh Durgin josh.dur...@inktank.com

On 01/24/2013 08:36 AM, Alex Elder wrote:

Get rid rbd_req_sync_exec() because it is no longer used.  That
eliminates the last use of rbd_req_sync_op(), so get rid of that
too.  And finally, that leaves rbd_do_request() unreferenced, so get
rid of that.

Signed-off-by: Alex Elder el...@inktank.com
---
  drivers/block/rbd.c |  160
---
  1 file changed, 160 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 5ad2ac2..c807a4c 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -1160,126 +1160,6 @@ static void rbd_osd_req_op_destroy(struct
ceph_osd_req_op *op)
kfree(op);
  }

-/*
- * Send ceph osd request
- */
-static int rbd_do_request(struct request *rq,
- struct rbd_device *rbd_dev,
- struct ceph_snap_context *snapc,
- u64 snapid,
- const char *object_name, u64 ofs, u64 len,
- struct bio *bio,
- struct page **pages,
- int num_pages,
- int flags,
- struct ceph_osd_req_op *op,
- void (*rbd_cb)(struct ceph_osd_request *,
-struct ceph_msg *),
- u64 *ver)
-{
-   struct ceph_osd_client *osdc;
-   struct ceph_osd_request *osd_req;
-   struct timespec mtime = CURRENT_TIME;
-   int ret;
-
-   dout(rbd_do_request object_name=%s ofs=%llu len=%llu\n,
-   object_name, (unsigned long long) ofs,
-   (unsigned long long) len);
-
-   osdc = rbd_dev-rbd_client-client-osdc;
-   osd_req = ceph_osdc_alloc_request(osdc, snapc, 1, false, GFP_NOIO);
-   if (!osd_req)
-   return -ENOMEM;
-
-   osd_req-r_flags = flags;
-   osd_req-r_pages = pages;
-   if (bio) {
-   osd_req-r_bio = bio;
-   bio_get(osd_req-r_bio);
-   }
-
-   osd_req-r_callback = rbd_cb;
-   osd_req-r_priv = NULL;
-
-   strncpy(osd_req-r_oid, object_name, sizeof(osd_req-r_oid));
-   osd_req-r_oid_len = strlen(osd_req-r_oid);
-
-   osd_req-r_file_layout = rbd_dev-layout; /* struct */
-   osd_req-r_num_pages = calc_pages_for(ofs, len);
-   osd_req-r_page_alignment = ofs  ~PAGE_MASK;
-
-   ceph_osdc_build_request(osd_req, ofs, len, 1, op,
-   snapc, snapid, mtime);
-
-   if (op-op == CEPH_OSD_OP_WATCH  op-watch.flag) {
-   ceph_osdc_set_request_linger(osdc, osd_req);
-   rbd_dev-watch_request = osd_req;
-   }
-
-   ret = ceph_osdc_start_request(osdc, osd_req, false);
-   if (ret  0)
-   goto done_err;
-
-   if (!rbd_cb) {
-   u64 version;
-
-   ret = ceph_osdc_wait_request(osdc, osd_req);
-   version = le64_to_cpu(osd_req-r_reassert_version.version);
-   if (ver)
-   *ver = version;
-   dout(reassert_ver=%llu\n, (unsigned long long) version);
-   ceph_osdc_put_request(osd_req);
-   }
-   return ret;
-
-done_err:
-   if (bio)
-   bio_chain_put(osd_req-r_bio);
-   ceph_osdc_put_request(osd_req);
-
-   return ret;
-}
-
-/*
- * Do a synchronous ceph osd operation
- */
-static int rbd_req_sync_op(struct rbd_device *rbd_dev,
-  int flags,
-  struct ceph_osd_req_op *op,
-  const char *object_name,
-  u64 ofs, u64 inbound_size,
-  char *inbound,
-  u64 *ver)
-{
-   int ret;
-   struct page **pages;
-   int num_pages;
-
-   rbd_assert(op != NULL);
-
-   num_pages = calc_pages_for(ofs, inbound_size);
-   pages = ceph_alloc_page_vector(num_pages, GFP_KERNEL);
-   if (IS_ERR(pages))
-   return PTR_ERR(pages);
-
-   ret = rbd_do_request(NULL, rbd_dev, NULL, CEPH_NOSNAP,
- object_name, ofs, inbound_size, NULL,
- pages, num_pages,
- flags,
- op,
- NULL,
- ver);
-   if (ret  0)
-   goto done;
-
-   if ((flags  CEPH_OSD_FLAG_READ)  inbound)
-   ret = ceph_copy_from_page_vector(pages, inbound, ofs, ret);
-
-done:
-   ceph_release_page_vector(pages, num_pages);
-   return ret;
-}
-
  static int rbd_obj_request_submit(struct ceph_osd_client *osdc,
struct rbd_obj_request *obj_request)
  {
@@ -1315,45 +1195,6 @@ static void rbd_obj_request_complete(struct
rbd_obj_request *obj_request)
complete_all(obj_request-completion);
  }

-/*
- * Synchronous osd object method call
- */
-static int 

RE: Ceph Production Environment Setup and Configurations?

2013-01-29 Thread Chen, Xiaoxi
[The following views only behalf of myself, not relate with Intel..]
Looking forward for the performance data on Atom.
Atom perform badly in Swift, but since Ceph is slightly efficient than Swift, 
it must be better.
I have some concern about weather Atom can support such high throughput( you 
have 16 disks, assuming 50MB/s per disk, you would like to have 50*16 
*2(include journal write)=1600MB/s ,together with corresponding network 
throughput, say 10GbE. The total IO throughput seems too high for Atom. Baidu 
(www.baidu.com) use Arm based storage node(not using ceph,but their own DFS), a 
node with a quad-core Arm together with 4 disks, and a 2U box can put up to 6 
nodes, the 6 nodes share a 10Gb NIC. Although Atom is different with Arm, but 
you can take Baidu as a reference.


-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Gandalf Corvotempesta
Sent: 2013年1月29日 17:25
To: femi anjorin
Cc: ceph-devel@vger.kernel.org; Ross Turk
Subject: Re: Ceph Production Environment Setup and Configurations?

2013/1/29 femi anjorin femi.anjo...@gmail.com:
 CPU - Intel(R) Atom(TM) CPU D525   @ 1.80GHz

Atom? For which kind of role ?
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in the 
body of a message to majord...@vger.kernel.org More majordomo info at  
http://vger.kernel.org/majordomo-info.html


Re: Ceph Production Environment Setup and Configurations?

2013-01-29 Thread Joao Eduardo Luis
On 01/29/2013 11:25 AM, Chen, Xiaoxi wrote:
 [The following views only behalf of myself, not relate with Intel..]
 Looking forward for the performance data on Atom.
 Atom perform badly in Swift, but since Ceph is slightly efficient than Swift, 
 it must be better.
 I have some concern about weather Atom can support such high throughput( you 
 have 16 disks, assuming 50MB/s per disk, you would like to have 50*16 
 *2(include journal write)=1600MB/s ,together with corresponding network 
 throughput, say 10GbE. The total IO throughput seems too high for Atom. Baidu 
 (www.baidu.com) use Arm based storage node(not using ceph,but their own DFS), 
 a node with a quad-core Arm together with 4 disks, and a 2U box can put up to 
 6 nodes, the 6 nodes share a 10Gb NIC. Although Atom is different with Arm, 
 but you can take Baidu as a reference.
 
 
 -Original Message-
 From: ceph-devel-ow...@vger.kernel.org 
 [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Gandalf Corvotempesta
 Sent: 2013年1月29日 17:25
 To: femi anjorin
 Cc: ceph-devel@vger.kernel.org; Ross Turk
 Subject: Re: Ceph Production Environment Setup and Configurations?
 
 2013/1/29 femi anjorin femi.anjo...@gmail.com:
 CPU - Intel(R) Atom(TM) CPU D525   @ 1.80GHz
 
 Atom? For which kind of role ?
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in the 
 body of a message to majord...@vger.kernel.org More majordomo info at  
 http://vger.kernel.org/majordomo-info.html
 N�Р骒r��yb�X�肚�v�^�)藓{.n�+���z�]z鳐�{ay��,j��f"�h���z��wア�
 ⒎�j:+v���w�j�m������赙zZ+�茛j��!tml=
 


FWIW, Wido had been testing Ceph on Atoms a while back.  I don't know
what conclusions he reached, or even if he has already reached any
conclusions at all.  I, for one, would be really interested in knowing
how well (or how poorly) Atoms deal with Ceph.

Maybe Wido can share some thoughts on this one? :-)

  -Joao
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] utime: fix narrowing conversion compiler warning in sleep()

2013-01-29 Thread Danny Al-Gaaf
Hi,

could someone cherry-pick this patch to the bobtail branch? I see the
same issue there.

Thanks!

Danny


Am 27.01.2013 21:57, schrieb Danny Al-Gaaf:
 Fix compiler warning:
 ./include/utime.h: In member function 'void utime_t::sleep()':
 ./include/utime.h:139:50: warning: narrowing conversion of
  '((utime_t*)this)-utime_t::tv.utime_t::anonymous struct::tv_sec' from
  '__u32 {aka unsigned int}' to '__time_t {aka long int}' inside { } is
  ill-formed in C++11 [-Wnarrowing]
 ./include/utime.h:139:50: warning: narrowing conversion of
  '((utime_t*)this)-utime_t::tv.utime_t::anonymous struct::tv_nsec' from
  '__u32 {aka unsigned int}' to 'long int' inside { } is
  ill-formed in C++11 [-Wnarrowing]
 
 Signed-off-by: Danny Al-Gaaf danny.al-g...@bisect.de
 ---
  src/include/utime.h | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)
 
 diff --git a/src/include/utime.h b/src/include/utime.h
 index 526dec5..f433fff 100644
 --- a/src/include/utime.h
 +++ b/src/include/utime.h
 @@ -136,7 +136,7 @@ public:
}
  
void sleep() {
 -struct timespec ts = { tv.tv_sec, tv.tv_nsec };
 +struct timespec ts = { (__time_t)tv.tv_sec, (long)tv.tv_nsec };
  nanosleep(ts, ts);
}
  
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph Production Environment Setup?

2013-01-29 Thread Joao Eduardo Luis

On 01/29/2013 02:56 AM, femi anjorin wrote:

Please can anyone  an advise  on how exactly a CEPH production
environment should look like? and what the configuration files should
be. My hardwares include the following:

Server A, B, C configuration
CPU - Intel(R) Core(TM)2 Quad  CPU   Q9550  @ 2.83GHz
RAM - 16GB
Hard drive -  500GB
SSD - 120GB

Server D,E,F,G,H,J configuration
CPU - Intel(R) Atom(TM) CPU D525   @ 1.80GHz
RAM - 4 GB
Boot drive -  320gb
SSD - 120 GB
Storage drives - 16 X 2 TB

I am thinking of these configurations but i am not sure.
Server A - MDS and MON
Server B - MON
Server C - MON
Server D, E,F,G,H,J - OSD



Those 16GB RAM on the monitor nodes vs the 4GB RAM on the osd nodes seem 
to be a bit wrong to me.  The OSDs tend to require much more RAM, for 
instance for recovery, while the monitor is not as heavy on the memory 
-- if a cluster grows significantly large, the in-memory maps may grow a 
lot too, but that reason alone shouldn't be the reason you would give a 
monitor 16GB RAM and 4GB for an osd.


Furthermore, I see you have 16x2TB storage drives.  Is that per OSD 
node?  I'm assuming that's what you're aiming to do, so how many OSDs 
were you thinking of running on the same host?  Usually we go for 1 OSD 
per drive, but you might have something else on your mind.  I am not an 
expert on server configuration, but my point is that, if you are going 
to have more than one OSD on the same host, your RAM sure looks smaller 
than what I would envision.


BTW, not sure if you're placing SSDs on the monitor/mds nodes with the 
same intent as when you place it on the OSD nodes (keeping the osd 
journal, maybe?), but if you indeed intend to keep the daemons journals 
in them I think you should know that the monitor and the mds don't keep 
a journal.  The monitors do keep a store on disk, but the mds don't even 
do that, instead keeping its data directly on the osds and whatever it 
needs in-memory.


  -Joao





--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph Production Environment Setup?

2013-01-29 Thread Martin B Nielsen
There is also the hardware recommendation page in the ceph docs (
http://ceph.com/docs/master/install/hardware-recommendations/ )

Basically they recommend something like ~ 1GHz CPU (or 1 core/osd),
500M-1GB RAM pr OSD daemon. Also most run with 1 OSD daemon pr. disk
(so if you put 16x disk pr. node you'll vastly overpower your atom
cpu)

Overall, while the cluster chugs along happily the hw specs are
relatively modest; as soon as it starts to recover you'll see high
cpu/mem usage.

Cheers,
Martin


On Tue, Jan 29, 2013 at 3:56 AM, femi anjorin femi.anjo...@gmail.com wrote:

 Please can anyone  an advise  on how exactly a CEPH production
 environment should look like? and what the configuration files should
 be. My hardwares include the following:

 Server A, B, C configuration
 CPU - Intel(R) Core(TM)2 Quad  CPU   Q9550  @ 2.83GHz
 RAM - 16GB
 Hard drive -  500GB
 SSD - 120GB

 Server D,E,F,G,H,J configuration
 CPU - Intel(R) Atom(TM) CPU D525   @ 1.80GHz
 RAM - 4 GB
 Boot drive -  320gb
 SSD - 120 GB
 Storage drives - 16 X 2 TB

 I am thinking of these configurations but i am not sure.
 Server A - MDS and MON
 Server B - MON
 Server C - MON
 Server D, E,F,G,H,J - OSD

 Regards.
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph Production Environment Setup and Configurations?

2013-01-29 Thread Mark Nelson
On 01/29/2013 05:34 AM, Joao Eduardo Luis wrote:
 On 01/29/2013 11:25 AM, Chen, Xiaoxi wrote:
 [The following views only behalf of myself, not relate with Intel..]
 Looking forward for the performance data on Atom.
 Atom perform badly in Swift, but since Ceph is slightly efficient than 
 Swift, it must be better.
 I have some concern about weather Atom can support such high throughput( you 
 have 16 disks, assuming 50MB/s per disk, you would like to have 50*16 
 *2(include journal write)=1600MB/s ,together with corresponding network 
 throughput, say 10GbE. The total IO throughput seems too high for Atom. 
 Baidu (www.baidu.com) use Arm based storage node(not using ceph,but their 
 own DFS), a node with a quad-core Arm together with 4 disks, and a 2U box 
 can put up to 6 nodes, the 6 nodes share a 10Gb NIC. Although Atom is 
 different with Arm, but you can take Baidu as a reference.


 -Original Message-
 From: ceph-devel-ow...@vger.kernel.org 
 [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Gandalf Corvotempesta
 Sent: 2013年1月29日 17:25
 To: femi anjorin
 Cc: ceph-devel@vger.kernel.org; Ross Turk
 Subject: Re: Ceph Production Environment Setup and Configurations?

 2013/1/29 femi anjorin femi.anjo...@gmail.com:
 CPU - Intel(R) Atom(TM) CPU D525   @ 1.80GHz

 Atom? For which kind of role ?
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in the 
 body of a message to majord...@vger.kernel.org More majordomo info at  
 http://vger.kernel.org/majordomo-info.html
 N�Р骒r��yb�X�肚�v�^�)藓{.n�+���z�]z鳐�{ay��,j��f"�h���z��wア�
 ⒎�j:+v���w�j�m������赙zZ+�茛j��!tml=

 
 
 FWIW, Wido had been testing Ceph on Atoms a while back.  I don't know
 what conclusions he reached, or even if he has already reached any
 conclusions at all.  I, for one, would be really interested in knowing
 how well (or how poorly) Atoms deal with Ceph.
 
 Maybe Wido can share some thoughts on this one? :-)
 
-Joao
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

Just FYI, I don't expect you'll be able to get anywhere close to 10GbE
performance on the Atom.  You'll almost certainly be CPU bound long
before that.  I actually would be curious how much throughput you could
push to those disks just doing straight fio tests and how much CPU
overhead you'd see.  I imagine you could max out the CPU without even
having Ceph involved.

I'm guessing you probably are going to be limited to about 0.75-1 OSD
per atom core, and that's assuming that your network and SAS controllers
aren't CPU hogs.

Mark
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph Production Environment Setup?

2013-01-29 Thread femi anjorin
Thanks.  I have upgraded all the systems to quad core processor
machine with 32 GB RAM although i still have 16 hard drives on each of
the storage nodes.

16 hard drives means i should have 16 OSD daemon but i dont know what
the OSD configuration should look like in ceph.conf.

I mounted the disk under the OSD data directory according to
http://ceph.com/docs/master/rados/deployment/mkcephfs/
the mount looks like this:
/dev/sda1 on /var/lib/ceph/osd/ceph-0
/dev/sdb1 on /var/lib/ceph/osd/ceph-1
/dev/sdc1 on /var/lib/ceph/osd/ceph-2
/dev/sdd1 on /var/lib/ceph/osd/ceph-3
/dev/sde1 on /var/lib/ceph/osd/ceph-4
/dev/sdf1 on /var/lib/ceph/osd/ceph-5
/dev/sdg1 on /var/lib/ceph/osd/ceph-6
/dev/sdh1 on /var/lib/ceph/osd/ceph-7
/dev/sdi1 on /var/lib/ceph/osd/ceph-8
/dev/sdj1 on /var/lib/ceph/osd/ceph-9
/dev/sdk1 on /var/lib/ceph/osd/ceph-10
/dev/sdl1 on /var/lib/ceph/osd/ceph-11
/dev/sdm1 on /var/lib/ceph/osd/ceph-12
/dev/sdn1 on /var/lib/ceph/osd/ceph-13
/dev/sdo1 on /var/lib/ceph/osd/ceph-14
/dev/sdp1 on /var/lib/ceph/osd/ceph-15

BUT I dont know how the OSD configuration should look like?  I see the
following in this ceph reference :
http://ceph.com/docs/master/rados/deployment/mkcephfs/

For each [osd.n] section of your configuration file, specify the
storage device. For example:
[osd.1]
devs = /dev/sda
[osd.2]
devs = /dev/sdb 

I guess this is a configuration for one hard drive. What should the
OSD config be with 16 drives in one host?



Regards,
Femi.


On Tue, Jan 29, 2013 at 1:39 PM, Martin B Nielsen mar...@unity3d.com wrote:
 There is also the hardware recommendation page in the ceph docs (
 http://ceph.com/docs/master/install/hardware-recommendations/ )

 Basically they recommend something like ~ 1GHz CPU (or 1 core/osd),
 500M-1GB RAM pr OSD daemon. Also most run with 1 OSD daemon pr. disk
 (so if you put 16x disk pr. node you'll vastly overpower your atom
 cpu)

 Overall, while the cluster chugs along happily the hw specs are
 relatively modest; as soon as it starts to recover you'll see high
 cpu/mem usage.

 Cheers,
 Martin


 On Tue, Jan 29, 2013 at 3:56 AM, femi anjorin femi.anjo...@gmail.com wrote:

 Please can anyone  an advise  on how exactly a CEPH production
 environment should look like? and what the configuration files should
 be. My hardwares include the following:

 Server A, B, C configuration
 CPU - Intel(R) Core(TM)2 Quad  CPU   Q9550  @ 2.83GHz
 RAM - 16GB
 Hard drive -  500GB
 SSD - 120GB

 Server D,E,F,G,H,J configuration
 CPU - Intel(R) Atom(TM) CPU D525   @ 1.80GHz
 RAM - 4 GB
 Boot drive -  320gb
 SSD - 120 GB
 Storage drives - 16 X 2 TB

 I am thinking of these configurations but i am not sure.
 Server A - MDS and MON
 Server B - MON
 Server C - MON
 Server D, E,F,G,H,J - OSD

 Regards.
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Testing nonsubscriber posts

2013-01-29 Thread Dan Mick

ignore
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: page allocation failures on osd nodes

2013-01-29 Thread Andrey Korolyov
On Mon, Jan 28, 2013 at 8:55 PM, Andrey Korolyov and...@xdel.ru wrote:
 On Mon, Jan 28, 2013 at 5:48 PM, Sam Lang sam.l...@inktank.com wrote:
 On Sun, Jan 27, 2013 at 2:52 PM, Andrey Korolyov and...@xdel.ru wrote:

 Ahem. once on almost empty node same trace produced by qemu
 process(which was actually pinned to the specific numa node), so seems
 that`s generally is a some scheduler/mm bug, not directly related to
 the osd processes. In other words, the less percentage of memory
 actually is an RSS, the more is a probability of such allocation
 failure.

 This might be a known bug in xen for your kernel?  The xen users list
 might be able to help.
 -sam

 It is vanilla-3.4, I really wonder from where comes paravirt bits in the 
 trace.

Bug exposed only in 3.4 and really harmless, at least in ways I have
tested that. Ceph-osd memory allocation behavior more likely to
trigger those messages than most other applications in ``same''
conditions.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: page allocation failures on osd nodes

2013-01-29 Thread Sam Lang
On Tue, Jan 29, 2013 at 2:21 PM, Andrey Korolyov and...@xdel.ru wrote:
 On Mon, Jan 28, 2013 at 8:55 PM, Andrey Korolyov and...@xdel.ru wrote:
 On Mon, Jan 28, 2013 at 5:48 PM, Sam Lang sam.l...@inktank.com wrote:
 On Sun, Jan 27, 2013 at 2:52 PM, Andrey Korolyov and...@xdel.ru wrote:

 Ahem. once on almost empty node same trace produced by qemu
 process(which was actually pinned to the specific numa node), so seems
 that`s generally is a some scheduler/mm bug, not directly related to
 the osd processes. In other words, the less percentage of memory
 actually is an RSS, the more is a probability of such allocation
 failure.

 This might be a known bug in xen for your kernel?  The xen users list
 might be able to help.
 -sam

 It is vanilla-3.4, I really wonder from where comes paravirt bits in the 
 trace.

 Bug exposed only in 3.4 and really harmless, at least in ways I have
 tested that. Ceph-osd memory allocation behavior more likely to
 trigger those messages than most other applications in ``same''
 conditions.

Good to know.  Thanks!
-sam
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] utime: fix narrowing conversion compiler warning in sleep()

2013-01-29 Thread Sage Weil
On Tue, 29 Jan 2013, Danny Al-Gaaf wrote:
 Hi,
 
 could someone cherry-pick this patch to the bobtail branch? I see the
 same issue there.

Cherry-picked, thanks!

s

 
 Thanks!
 
 Danny
 
 
 Am 27.01.2013 21:57, schrieb Danny Al-Gaaf:
  Fix compiler warning:
  ./include/utime.h: In member function 'void utime_t::sleep()':
  ./include/utime.h:139:50: warning: narrowing conversion of
   '((utime_t*)this)-utime_t::tv.utime_t::anonymous struct::tv_sec' from
   '__u32 {aka unsigned int}' to '__time_t {aka long int}' inside { } is
   ill-formed in C++11 [-Wnarrowing]
  ./include/utime.h:139:50: warning: narrowing conversion of
   '((utime_t*)this)-utime_t::tv.utime_t::anonymous struct::tv_nsec' from
   '__u32 {aka unsigned int}' to 'long int' inside { } is
   ill-formed in C++11 [-Wnarrowing]
  
  Signed-off-by: Danny Al-Gaaf danny.al-g...@bisect.de
  ---
   src/include/utime.h | 2 +-
   1 file changed, 1 insertion(+), 1 deletion(-)
  
  diff --git a/src/include/utime.h b/src/include/utime.h
  index 526dec5..f433fff 100644
  --- a/src/include/utime.h
  +++ b/src/include/utime.h
  @@ -136,7 +136,7 @@ public:
 }
   
 void sleep() {
  -struct timespec ts = { tv.tv_sec, tv.tv_nsec };
  +struct timespec ts = { (__time_t)tv.tv_sec, (long)tv.tv_nsec };
   nanosleep(ts, ts);
 }
   
  
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph] locking fun with d_materialise_unique()

2013-01-29 Thread Sage Weil
Hi Al,

On Tue, 29 Jan 2013, Al Viro wrote:

 On Mon, Jan 28, 2013 at 09:20:34PM -0800, Sage Weil wrote:
 
  Yep, that is indeed a problem.  I think we just need to do the r_aborted 
  and/or r_locked_dir check in the else if condition...
  
   I'm not sure if we are guaranteed that ceph_readdir_prepopulate() won't
   get to its splice_dentry() and d_delete() calls in similar situations -
   I hadn't checked that one yet.  If it isn't guaranteed, we have a problem
   there as well.
  
  ...and the condition guarding readdir_prepopulate().  :)
 
  I think you're reading it correctly.  The main thing to keep in mind here 
  is that we *do* need to call fill_inode() for the inode metadata on these 
  requests to keep the mds and client state in sync.  The dentry state is 
  safe to ignore.
 
 You mean the parts under
 if (rinfo-head-is_dentry) {
 and
 if (rinfo-head-is_target) {
 in there?  Because there's fill_inode() called from readdir_prepopulate()
 and it's a lot more problematic than those two...

Yep, patch below.
 
  It would be great to have the dir i_mutex rules summarized somewhere, even 
  if it is just a copy of the below.  It took a fair bit of trial and error 
  to infer what was going on when writing this code.  :)
 
 Directory -i_mutex rules are in part documented - what VFS guarantees
 to hold side is in Documentation/filesystems/directory-locking.  It's
 the other side (what locks are expected to be held by callers of dcache.c
 functions) that is badly missing...
 
  Ping me when you've pushed that branch and I'll take a look...
 
 To gitol...@ra.kernel.org:/pub/scm/linux/kernel/git/viro/vfs.git
01a88fa..4056362  master - master
 
 with tentative ceph patch in the very end.  Should be on git.kernel.org
 shortly...

We should drop teh mds_client.c hunk from your patch, and then do 
something like the below.  I'll put it in the ceph tree so we can do some 
basic testing.  Unfortunately, the abort case is a bit annoying to 
trigger.. we'll need to write a new test for it.

Thanks-
sage



From 80d70ef02d5277af8e1355db74fffe71f7217e76 Mon Sep 17 00:00:00 2001
From: Sage Weil s...@inktank.com
Date: Tue, 29 Jan 2013 13:01:15 -0800
Subject: [PATCH] ceph: prepopulate inodes only when request is aborted

If r_aborted is true, we do not hold the dir i_mutex, and cannot touch
the dcache.  However, we still need to update the inodes with the state
returned by the MDS.

Reported-by: Al Viro v...@zeniv.linux.org.uk
Signed-off-by: Sage Weil s...@inktank.com
---
 fs/ceph/inode.c |   36 
 1 file changed, 36 insertions(+)

diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index 77b1b18..b51653e 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -1195,6 +1195,39 @@ done:
 /*
  * Prepopulate our cache with readdir results, leases, etc.
  */
+static int readdir_prepopulate_inodes_only(struct ceph_mds_request *req,
+  struct ceph_mds_session *session)
+{
+   struct ceph_mds_reply_info_parsed *rinfo = req-r_reply_info;
+   int i, err = 0;
+
+   for (i = 0; i  rinfo-dir_nr; i++) {
+   struct ceph_vino vino;
+   struct inode *in;
+   int rc;
+
+   vino.ino = le64_to_cpu(rinfo-dir_in[i].in-ino);
+   vino.snap = le64_to_cpu(rinfo-dir_in[i].in-snapid);
+
+   in = ceph_get_inode(req-r_dentry-d_sb, vino);
+   if (IS_ERR(in)) {
+   err = PTR_ERR(in);
+   dout(new_inode badness got %d\n, err);
+   continue;
+   }
+   rc = fill_inode(in, rinfo-dir_in[i], NULL, session,
+   req-r_request_started, -1,
+   req-r_caps_reservation);
+   if (rc  0) {
+   pr_err(fill_inode badness on %p got %d\n, in, rc);
+   err = rc;
+   continue;
+   }
+   }
+
+   return err;
+}
+
 int ceph_readdir_prepopulate(struct ceph_mds_request *req,
 struct ceph_mds_session *session)
 {
@@ -1209,6 +1242,9 @@ int ceph_readdir_prepopulate(struct ceph_mds_request *req,
u64 frag = le32_to_cpu(rhead-args.readdir.frag);
struct ceph_dentry_info *di;
 
+   if (req-r_aborted)
+   return readdir_prepopulate_inodes_only(req, session);
+
if (le32_to_cpu(rinfo-head-op) == CEPH_MDS_OP_LSSNAP) {
snapdir = ceph_get_snapdir(parent-d_inode);
parent = d_find_alias(snapdir);
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 01/12, v2] rbd: new request tracking code

2013-01-29 Thread Alex Elder
On 01/29/2013 04:43 AM, Josh Durgin wrote:
 On 01/24/2013 06:08 AM, Alex Elder wrote:
 This is an update of the first patch in my request tracking
 code series.  After posting it the other day I identified some
 problems related to reference counting of image and object
 requests.  I also am starting to look at the details of
 implementing layered reads, and ended up making some
 substantive changes.  Since I have not seen any review
 feedback I thought the best thing would be to just
 re-post the updated patch.

 The remaining patches in the series have changed accordingly,
 but they have not really changed substantively, so I am
 not re-posting those (but will if it's requested).

 The main functional change is that an image request no longer
 maintains an array of object request pointers, it maintains
 a list of object requests.  This simplifies some things, and
 makes the image request structure fixed size.

 A few other functional changes:
 - Reference counting of object and image requests is now
done sensibly.
 - Image requests now support a callback when complete,
which will be used for layered I/O requests.
 - There are a few new helper functions that encapsulate
tying an object request to an image request.
 - An distinct value is now used for the which field
for object requests not associated with a image request
(mainly used for validation/assertions).

 Other changes:
 - Everything that was named image_request now uses
img_request instead.
 - A few blocks and lines of code have been rearranged.

 The updated series is available on the ceph-client git
 repository in the branch wip-rbd-review-v2.

 -Alex

 This patch fully implements the new request tracking code for rbd
 I/O requests.

Responses to your review comments are below.

Thank you very much for careful, thorough, and thoughtful
work on this.  I believe it's very important and you do
a good job of it.

-Alex

 Each I/O request to an rbd image will get an rbd_image_request
 structure allocated to track it.  This provides access to all
 information about the original request, as well as access to the
 set of one or more object requests that are initiated as a result
 of the image request.

 An rbd_obj_request structure defines a request sent to a single osd
 object (possibly) as part of an rbd image request.  An rbd object
 request refers to a ceph_osd_request structure built up to represent
 the request; for now it will contain a single osd operation.  It
 also provides space to hold the result status and the version of the
 object when the osd request completes.

 An rbd_obj_request structure can also stand on its own.  This will
 be used for reading the version 1 header object, for issuing
 acknowledgements to event notifications, and for making object
 method calls.

 All rbd object requests now complete asynchronously with respect
 to the osd client--they supply a common callback routine.

 This resolves:
  http://tracker.newdream.net/issues/3741

 Signed-off-by: Alex Elder el...@inktank.com
 ---
 v2: - fixed reference counting
  - image request callback support
  - image/object connection helper functions
  - distinct BAD_WHICH value for non-image object requests

   drivers/block/rbd.c |  622
 ++-
   1 file changed, 620 insertions(+), 2 deletions(-)

 diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
 index 6689363..46a61dd 100644
 --- a/drivers/block/rbd.c
 +++ b/drivers/block/rbd.c
 @@ -181,6 +181,67 @@ struct rbd_req_coll {
   struct rbd_req_statusstatus[0];
   };

 +struct rbd_img_request;
 +typedef void (*rbd_img_callback_t)(struct rbd_img_request *);
 +
 +#defineBAD_WHICHU32_MAX/* Good which or bad which,
 which? */
 +
 +struct rbd_obj_request;
 +typedef void (*rbd_obj_callback_t)(struct rbd_obj_request *);
 +
 +enum obj_req_type { obj_req_bio };/* More types to come */
 
 enum labels should be capitalized.

Not where I come from (at least not always), but I'll
convert these to be all caps.

 +struct rbd_obj_request {
 +const char*object_name;
 +u64offset;/* object start byte */
 +u64length;/* bytes from offset */
 +
 +struct rbd_img_request*img_request;
 +struct list_headlinks;
 +u32which;/* posn image request list */
 +
 +enum obj_req_typetype;
 +struct bio*bio_list;
 +
 +struct ceph_osd_request*osd_req;
 +
 +u64xferred;/* bytes transferred */
 +u64version;
 
 This version is only used (uselessly) for the watch operation. It
 should be removed in a future patch (along with the obj_ver in the
 header).

I'm not sure whether the description is quite right, but:
   http://tracker.ceph.com/issues/3952

 +s32result;
 +atomic_tdone;
 +
 +rbd_obj_callback_t