Re: [PATCH] ceph: convert comma to semicolon

2014-07-24 Thread Ilya Dryomov
On Wed, Jul 23, 2014 at 6:41 PM, Himangi Saraogi himangi...@gmail.com wrote:
 Replace a comma between expression statements by a semicolon. This changes
 the semantics of the code, but given the current indentation appears to be
 what is intended.

 A simplified version of the Coccinelle semantic patch that performs this
 transformation is as follows:
 // smpl
 @r@
 expression e1,e2;
 @@

  e1
 -,
 +;
  e2;
 // /smpl

 Signed-off-by: Himangi Saraogi himangi...@gmail.com
 Acked-by: Julia Lawall julia.law...@lip6.fr

Applied.

Thanks,

Ilya
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] rbd: Use rbd_segment_name_free

2014-07-24 Thread Ilya Dryomov
On Thu, Jul 24, 2014 at 1:47 AM, Himangi Saraogi himangi...@gmail.com wrote:
 Free memory allocated using kmem_cache_zalloc using kmem_cache_free
 rather than kfree. The helper rbd_segment_name_free does the job here.
 Its position is shifted above the calling function.

 The Coccinelle semantic patch that detects this change is as follows:

 // smpl
 @@
 expression x,E,c;
 @@

  x = \(kmem_cache_alloc\|kmem_cache_zalloc\|kmem_cache_alloc_node\)(c,...)
  ... when != x = E
  when != x
 ?-kfree(x)
 +kmem_cache_free(c,x)
 // /smpl

 Signed-off-by: Himangi Saraogi himangi...@gmail.com
 Acked-by: Julia Lawall julia.law...@lip6.fr

Applied.

Thanks,

Ilya
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/8] wip-overlap

2014-07-24 Thread Ilya Dryomov
Hi,

This series fixes a bug reported by Sandisk in the Read from clones
thread about a week ago.  The description along with a simplified
reproducer are in 7/8 and 8/8.  1/8 touches on our sysfs interface.
The rest are cleanups.

Thanks,

Ilya


Ilya Dryomov (8):
  rbd: show the entire chain of parent images
  rbd: introduce rbd_dev_header_info()
  rbd: remove unnecessary asserts in rbd_dev_image_probe()
  rbd: split rbd_dev_spec_update() into two functions
  rbd: harden rbd_dev_refresh() caller
  rbd: update mapping size only on refresh
  rbd: do not read in parent info before snap context
  rbd: take snap_id into account when reading in parent info

 Documentation/ABI/testing/sysfs-bus-rbd |4 +-
 drivers/block/rbd.c |  261 ---
 2 files changed, 137 insertions(+), 128 deletions(-)

-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/8] rbd: introduce rbd_dev_header_info()

2014-07-24 Thread Ilya Dryomov
A wrapper around rbd_dev_v{1,2}_header_info() to reduce duplication.

Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com
---
 drivers/block/rbd.c |   24 ++--
 1 file changed, 14 insertions(+), 10 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 7847fbb949ff..0d3be608f16f 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -514,7 +514,7 @@ static void rbd_dev_remove_parent(struct rbd_device 
*rbd_dev);
 
 static int rbd_dev_refresh(struct rbd_device *rbd_dev);
 static int rbd_dev_v2_header_onetime(struct rbd_device *rbd_dev);
-static int rbd_dev_v2_header_info(struct rbd_device *rbd_dev);
+static int rbd_dev_header_info(struct rbd_device *rbd_dev);
 static const char *rbd_dev_v2_snap_name(struct rbd_device *rbd_dev,
u64 snap_id);
 static int _rbd_dev_v2_snap_size(struct rbd_device *rbd_dev, u64 snap_id,
@@ -3506,13 +3506,10 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev)
u64 mapping_size;
int ret;
 
-   rbd_assert(rbd_image_format_valid(rbd_dev-image_format));
down_write(rbd_dev-header_rwsem);
mapping_size = rbd_dev-mapping.size;
-   if (rbd_dev-image_format == 1)
-   ret = rbd_dev_v1_header_info(rbd_dev);
-   else
-   ret = rbd_dev_v2_header_info(rbd_dev);
+
+   ret = rbd_dev_header_info(rbd_dev);
 
/* If it's a mapped snapshot, validate its EXISTS flag */
 
@@ -4501,6 +4498,16 @@ static int rbd_dev_v2_header_info(struct rbd_device 
*rbd_dev)
return ret;
 }
 
+static int rbd_dev_header_info(struct rbd_device *rbd_dev)
+{
+   rbd_assert(rbd_image_format_valid(rbd_dev-image_format));
+
+   if (rbd_dev-image_format == 1)
+   return rbd_dev_v1_header_info(rbd_dev);
+
+   return rbd_dev_v2_header_info(rbd_dev);
+}
+
 static int rbd_bus_add_dev(struct rbd_device *rbd_dev)
 {
struct device *dev;
@@ -5149,10 +5156,7 @@ static int rbd_dev_image_probe(struct rbd_device 
*rbd_dev, bool mapping)
goto out_header_name;
}
 
-   if (rbd_dev-image_format == 1)
-   ret = rbd_dev_v1_header_info(rbd_dev);
-   else
-   ret = rbd_dev_v2_header_info(rbd_dev);
+   ret = rbd_dev_header_info(rbd_dev);
if (ret)
goto err_out_watch;
 
-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/8] rbd: show the entire chain of parent images

2014-07-24 Thread Ilya Dryomov
Make /sys/bus/rbd/devices/id/parent show the entire chain of parent
images.  While at it, kernel sprintf() doesn't return negative values,
casting to unsigned long long is no longer necessary and there is no
good reason to split into multiple sprintf() calls.

Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com
---
 Documentation/ABI/testing/sysfs-bus-rbd |4 +--
 drivers/block/rbd.c |   56 +--
 2 files changed, 25 insertions(+), 35 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-bus-rbd 
b/Documentation/ABI/testing/sysfs-bus-rbd
index 501adc2a9ec7..2ddd680929d8 100644
--- a/Documentation/ABI/testing/sysfs-bus-rbd
+++ b/Documentation/ABI/testing/sysfs-bus-rbd
@@ -94,5 +94,5 @@ current_snap
 
 parent
 
-   Information identifying the pool, image, and snapshot id for
-   the parent image in a layered rbd image (format 2 only).
+   Information identifying the chain of parent images in a layered rbd
+   image.  Entries are separated by empty lines.
diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 703b728e05fa..7847fbb949ff 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -3685,46 +3685,36 @@ static ssize_t rbd_snap_show(struct device *dev,
 }
 
 /*
- * For an rbd v2 image, shows the pool id, image id, and snapshot id
- * for the parent image.  If there is no parent, simply shows
- * (no parent image).
+ * For a v2 image, shows the chain of parent images, separated by empty
+ * lines.  For v1 images or if there is no parent, shows (no parent
+ * image).
  */
 static ssize_t rbd_parent_show(struct device *dev,
-struct device_attribute *attr,
-char *buf)
+  struct device_attribute *attr,
+  char *buf)
 {
struct rbd_device *rbd_dev = dev_to_rbd_dev(dev);
-   struct rbd_spec *spec = rbd_dev-parent_spec;
-   int count;
-   char *bufp = buf;
+   ssize_t count = 0;
 
-   if (!spec)
+   if (!rbd_dev-parent)
return sprintf(buf, (no parent image)\n);
 
-   count = sprintf(bufp, pool_id %llu\npool_name %s\n,
-   (unsigned long long) spec-pool_id, spec-pool_name);
-   if (count  0)
-   return count;
-   bufp += count;
-
-   count = sprintf(bufp, image_id %s\nimage_name %s\n, spec-image_id,
-   spec-image_name ? spec-image_name : (unknown));
-   if (count  0)
-   return count;
-   bufp += count;
-
-   count = sprintf(bufp, snap_id %llu\nsnap_name %s\n,
-   (unsigned long long) spec-snap_id, spec-snap_name);
-   if (count  0)
-   return count;
-   bufp += count;
-
-   count = sprintf(bufp, overlap %llu\n, rbd_dev-parent_overlap);
-   if (count  0)
-   return count;
-   bufp += count;
-
-   return (ssize_t) (bufp - buf);
+   for ( ; rbd_dev-parent; rbd_dev = rbd_dev-parent) {
+   struct rbd_spec *spec = rbd_dev-parent_spec;
+
+   count += sprintf(buf[count], %s
+   pool_id %llu\npool_name %s\n
+   image_id %s\nimage_name %s\n
+   snap_id %llu\nsnap_name %s\n
+   overlap %llu\n,
+   !count ?  : \n, /* first? */
+   spec-pool_id, spec-pool_name,
+   spec-image_id, spec-image_name ?: (unknown),
+   spec-snap_id, spec-snap_name,
+   rbd_dev-parent_overlap);
+   }
+
+   return count;
 }
 
 static ssize_t rbd_image_refresh(struct device *dev,
-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/8] rbd: harden rbd_dev_refresh() caller

2014-07-24 Thread Ilya Dryomov
Recently discovered watch/notify problems showed that we really can't
ignore errors in anything refresh related.  Alas, currently there is
not much we can do in response to those errors, except print warnings.

Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com
---
 drivers/block/rbd.c |   17 +
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 23df1773ef77..5dd6f5057ef4 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -2963,11 +2963,20 @@ static void rbd_watch_cb(u64 ver, u64 notify_id, u8 
opcode, void *data)
dout(%s: \%s\ notify_id %llu opcode %u\n, __func__,
rbd_dev-header_name, (unsigned long long)notify_id,
(unsigned int)opcode);
+
+   /*
+* Until adequate refresh error handling is in place, there is
+* not much we can do here, except warn.
+*
+* See http://tracker.ceph.com/issues/5040
+*/
ret = rbd_dev_refresh(rbd_dev);
if (ret)
-   rbd_warn(rbd_dev, header refresh error (%d)\n, ret);
+   rbd_warn(rbd_dev, refresh failed: %d\n, ret);
 
-   rbd_obj_notify_ack_sync(rbd_dev, notify_id);
+   ret = rbd_obj_notify_ack_sync(rbd_dev, notify_id);
+   if (ret)
+   rbd_warn(rbd_dev, notify_ack ret %d\n, ret);
 }
 
 /*
@@ -3724,9 +3733,9 @@ static ssize_t rbd_image_refresh(struct device *dev,
 
ret = rbd_dev_refresh(rbd_dev);
if (ret)
-   rbd_warn(rbd_dev, : manual header refresh error (%d)\n, ret);
+   return ret;
 
-   return ret  0 ? ret : size;
+   return size;
 }
 
 static DEVICE_ATTR(size, S_IRUGO, rbd_size_show, NULL);
-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/8] rbd: update mapping size only on refresh

2014-07-24 Thread Ilya Dryomov
There is no sense in trying to update the mapping size before it's even
been set.

Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com
---
 drivers/block/rbd.c |   19 +++
 1 file changed, 7 insertions(+), 12 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 5dd6f5057ef4..16f388f2960b 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -971,12 +971,6 @@ static int rbd_header_from_disk(struct rbd_device *rbd_dev,
header-snap_names = snap_names;
header-snap_sizes = snap_sizes;
 
-   /* Make sure mapping size is consistent with header info */
-
-   if (rbd_dev-spec-snap_id == CEPH_NOSNAP || first_time)
-   if (rbd_dev-mapping.size != header-image_size)
-   rbd_dev-mapping.size = header-image_size;
-
return 0;
 out_2big:
ret = -EIO;
@@ -3520,9 +3514,14 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev)
 
ret = rbd_dev_header_info(rbd_dev);
 
-   /* If it's a mapped snapshot, validate its EXISTS flag */
+   if (rbd_dev-spec-snap_id == CEPH_NOSNAP) {
+   if (rbd_dev-mapping.size != rbd_dev-header.image_size)
+   rbd_dev-mapping.size = rbd_dev-header.image_size;
+   } else {
+   /* validate mapped snapshot's EXISTS flag */
+   rbd_exists_validate(rbd_dev);
+   }
 
-   rbd_exists_validate(rbd_dev);
up_write(rbd_dev-header_rwsem);
 
if (mapping_size != rbd_dev-mapping.size) {
@@ -4505,10 +4504,6 @@ static int rbd_dev_v2_header_info(struct rbd_device 
*rbd_dev)
is EXPERIMENTAL!);
}
 
-   if (rbd_dev-spec-snap_id == CEPH_NOSNAP)
-   if (rbd_dev-mapping.size != rbd_dev-header.image_size)
-   rbd_dev-mapping.size = rbd_dev-header.image_size;
-
ret = rbd_dev_v2_snap_context(rbd_dev);
dout(rbd_dev_v2_snap_context returned %d\n, ret);
 
-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 8/8] rbd: take snap_id into account when reading in parent info

2014-07-24 Thread Ilya Dryomov
If we are mapping a snapshot, we must read in the parent_overlap value
of that snapshot instead of that of the base image.  Not doing so may
in particular result in us returning zeros instead of user data:

# cat overlap-snap.sh
#!/bin/bash
rbd create --size 10 --image-format 2 foo
FOO_DEV=$(rbd map foo)
dd if=/dev/urandom of=/dev/rbd0 bs=1M /dev/null
echo Base image
dd if=$FOO_DEV bs=1 count=16 skip=$(((4  20) - 8)) 2/dev/null | xxd
rbd snap create foo@snap
rbd snap protect foo@snap
rbd clone foo@snap bar
rbd snap create bar@snap
BAR_DEV=$(rbd map bar@snap)
echo Snapshot
dd if=$BAR_DEV bs=1 count=16 skip=$(((4  20) - 8)) 2/dev/null | xxd
rbd resize --allow-shrink --size 4 bar
echo Snapshot after base image resize
dd if=$BAR_DEV bs=1 count=16 skip=$(((4  20) - 8)) 2/dev/null | xxd

# ./overlap-snap.sh
Base image
000: e781 e33b d34b 2225 6034 2845 a2e3 36ed  ...;.K%`4(E..6.
Snapshot
000: e781 e33b d34b 2225 6034 2845 a2e3 36ed  ...;.K%`4(E..6.
Resizing image: 100% complete...done.
Snapshot after base image resize
000: e781 e33b d34b 2225      ...;.K%

Even though bar@snap was taken with the old bar parent_overlap (8M),
reads from bar@snap beyond the new bar parent_overlap (4M) return
zeroes.  Fix it.

Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com
---
 drivers/block/rbd.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index c4606987e9d1..cbc89fa9a677 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -4020,7 +4020,7 @@ static int rbd_dev_v2_parent_info(struct rbd_device 
*rbd_dev)
goto out_err;
}
 
-   snapid = cpu_to_le64(CEPH_NOSNAP);
+   snapid = cpu_to_le64(rbd_dev-spec-snap_id);
ret = rbd_obj_method_sync(rbd_dev, rbd_dev-header_name,
rbd, get_parent,
snapid, sizeof (snapid),
-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/8] rbd: split rbd_dev_spec_update() into two functions

2014-07-24 Thread Ilya Dryomov
rbd_dev_spec_update() has two modes of operation, with nothing in
common between them.  Split it into two functions, one for each mode
and make our expectations more clear.

Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com
---
 drivers/block/rbd.c |   79 +++
 1 file changed, 48 insertions(+), 31 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 4541f6027e4a..23df1773ef77 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -3798,6 +3798,9 @@ static struct rbd_spec *rbd_spec_alloc(void)
spec = kzalloc(sizeof (*spec), GFP_KERNEL);
if (!spec)
return NULL;
+
+   spec-pool_id = CEPH_NOPOOL;
+   spec-snap_id = CEPH_NOSNAP;
kref_init(spec-kref);
 
return spec;
@@ -4257,18 +4260,38 @@ static u64 rbd_snap_id_by_name(struct rbd_device 
*rbd_dev, const char *name)
 }
 
 /*
- * When an rbd image has a parent image, it is identified by the
- * pool, image, and snapshot ids (not names).  This function fills
- * in the names for those ids.  (It's OK if we can't figure out the
- * name for an image id, but the pool and snapshot ids should always
- * exist and have names.)  All names in an rbd spec are dynamically
- * allocated.
+ * An image being mapped will have everything but the snap id.
+ */
+static int rbd_spec_fill_snap_id(struct rbd_device *rbd_dev)
+{
+   struct rbd_spec *spec = rbd_dev-spec;
+
+   rbd_assert(spec-pool_id != CEPH_NOPOOL  spec-pool_name);
+   rbd_assert(spec-image_id  spec-image_name);
+   rbd_assert(spec-snap_name);
+
+   if (strcmp(spec-snap_name, RBD_SNAP_HEAD_NAME)) {
+   u64 snap_id;
+
+   snap_id = rbd_snap_id_by_name(rbd_dev, spec-snap_name);
+   if (snap_id == CEPH_NOSNAP)
+   return -ENOENT;
+
+   spec-snap_id = snap_id;
+   } else {
+   spec-snap_id = CEPH_NOSNAP;
+   }
+
+   return 0;
+}
+
+/*
+ * A parent image will have all ids but none of the names.
  *
- * When an image being mapped (not a parent) is probed, we have the
- * pool name and pool id, image name and image id, and the snapshot
- * name.  The only thing we're missing is the snapshot id.
+ * All names in an rbd spec are dynamically allocated.  It's OK if we
+ * can't figure out the name for an image id.
  */
-static int rbd_dev_spec_update(struct rbd_device *rbd_dev)
+static int rbd_spec_fill_names(struct rbd_device *rbd_dev)
 {
struct ceph_osd_client *osdc = rbd_dev-rbd_client-client-osdc;
struct rbd_spec *spec = rbd_dev-spec;
@@ -4277,24 +4300,9 @@ static int rbd_dev_spec_update(struct rbd_device 
*rbd_dev)
const char *snap_name;
int ret;
 
-   /*
-* An image being mapped will have the pool name (etc.), but
-* we need to look up the snapshot id.
-*/
-   if (spec-pool_name) {
-   if (strcmp(spec-snap_name, RBD_SNAP_HEAD_NAME)) {
-   u64 snap_id;
-
-   snap_id = rbd_snap_id_by_name(rbd_dev, spec-snap_name);
-   if (snap_id == CEPH_NOSNAP)
-   return -ENOENT;
-   spec-snap_id = snap_id;
-   } else {
-   spec-snap_id = CEPH_NOSNAP;
-   }
-
-   return 0;
-   }
+   rbd_assert(spec-pool_id != CEPH_NOPOOL);
+   rbd_assert(spec-image_id);
+   rbd_assert(spec-snap_id != CEPH_NOSNAP);
 
/* Get the pool name; we have to make our own copy of this */
 
@@ -4313,7 +4321,7 @@ static int rbd_dev_spec_update(struct rbd_device *rbd_dev)
if (!image_name)
rbd_warn(rbd_dev, unable to get image name);
 
-   /* Look up the snapshot name, and make a copy */
+   /* Fetch the snapshot name */
 
snap_name = rbd_snap_name(rbd_dev, spec-snap_id);
if (IS_ERR(snap_name)) {
@@ -4326,10 +4334,10 @@ static int rbd_dev_spec_update(struct rbd_device 
*rbd_dev)
spec-snap_name = snap_name;
 
return 0;
+
 out_err:
kfree(image_name);
kfree(pool_name);
-
return ret;
 }
 
@@ -5158,7 +5166,16 @@ static int rbd_dev_image_probe(struct rbd_device 
*rbd_dev, bool mapping)
if (ret)
goto err_out_watch;
 
-   ret = rbd_dev_spec_update(rbd_dev);
+   /*
+* If this image is the one being mapped, we have pool name and
+* id, image name and id, and snap name - need to fill snap id.
+* Otherwise this is a parent image, identified by pool, image
+* and snap ids - need to fill in names for those ids.
+*/
+   if (mapping)
+   ret = rbd_spec_fill_snap_id(rbd_dev);
+   else
+   ret = rbd_spec_fill_names(rbd_dev);
if (ret)
goto err_out_probe;
 
-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to 

[PATCH 7/8] rbd: do not read in parent info before snap context

2014-07-24 Thread Ilya Dryomov
Currently rbd_dev_v2_header_info() reads in parent info before the snap
context is read in.  This is wrong, because we may need to look at the
the parent_overlap value of the snapshot instead of that of the base
image, for example when mapping a snapshot - see next commit.  (When
mapping a snapshot, all we got is its name and we need the snap context
to translate that name into an id to know which parent info to look
for).

The approach taken here is to make sure rbd_dev_v2_parent_info() is
called after the snap context has been read in.  The other approach
would be to add a parent_overlap field to struct rbd_mapping and
maintain it the same way rbd_mapping::size is maintained.  The reason
I chose the first approach is that the value of keeping around both
base image values and the actual mapping values is unclear to me.

Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com
---
 drivers/block/rbd.c |   64 ---
 1 file changed, 30 insertions(+), 34 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 16f388f2960b..c4606987e9d1 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -515,6 +515,7 @@ static void rbd_dev_remove_parent(struct rbd_device 
*rbd_dev);
 static int rbd_dev_refresh(struct rbd_device *rbd_dev);
 static int rbd_dev_v2_header_onetime(struct rbd_device *rbd_dev);
 static int rbd_dev_header_info(struct rbd_device *rbd_dev);
+static int rbd_dev_v2_parent_info(struct rbd_device *rbd_dev);
 static const char *rbd_dev_v2_snap_name(struct rbd_device *rbd_dev,
u64 snap_id);
 static int _rbd_dev_v2_snap_size(struct rbd_device *rbd_dev, u64 snap_id,
@@ -3513,6 +3514,18 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev)
mapping_size = rbd_dev-mapping.size;
 
ret = rbd_dev_header_info(rbd_dev);
+   if (ret)
+   return ret;
+
+   /*
+* If there is a parent, see if it has disappeared due to the
+* mapped image getting flattened.
+*/
+   if (rbd_dev-parent) {
+   ret = rbd_dev_v2_parent_info(rbd_dev);
+   if (ret)
+   return ret;
+   }
 
if (rbd_dev-spec-snap_id == CEPH_NOSNAP) {
if (rbd_dev-mapping.size != rbd_dev-header.image_size)
@@ -3524,11 +3537,10 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev)
 
up_write(rbd_dev-header_rwsem);
 
-   if (mapping_size != rbd_dev-mapping.size) {
+   if (mapping_size != rbd_dev-mapping.size)
rbd_dev_update_size(rbd_dev);
-   }
 
-   return ret;
+   return 0;
 }
 
 static int rbd_init_disk(struct rbd_device *rbd_dev)
@@ -4477,33 +4489,6 @@ static int rbd_dev_v2_header_info(struct rbd_device 
*rbd_dev)
return ret;
}
 
-   /*
-* If the image supports layering, get the parent info.  We
-* need to probe the first time regardless.  Thereafter we
-* only need to if there's a parent, to see if it has
-* disappeared due to the mapped image getting flattened.
-*/
-   if (rbd_dev-header.features  RBD_FEATURE_LAYERING 
-   (first_time || rbd_dev-parent_spec)) {
-   bool warn;
-
-   ret = rbd_dev_v2_parent_info(rbd_dev);
-   if (ret)
-   return ret;
-
-   /*
-* Print a warning if this is the initial probe and
-* the image has a parent.  Don't print it if the
-* image now being probed is itself a parent.  We
-* can tell at this point because we won't know its
-* pool name yet (just its pool id).
-*/
-   warn = rbd_dev-parent_spec  rbd_dev-spec-pool_name;
-   if (first_time  warn)
-   rbd_warn(rbd_dev, WARNING: kernel layering 
-   is EXPERIMENTAL!);
-   }
-
ret = rbd_dev_v2_snap_context(rbd_dev);
dout(rbd_dev_v2_snap_context returned %d\n, ret);
 
@@ -5183,14 +5168,28 @@ static int rbd_dev_image_probe(struct rbd_device 
*rbd_dev, bool mapping)
if (ret)
goto err_out_probe;
 
+   if (rbd_dev-header.features  RBD_FEATURE_LAYERING) {
+   ret = rbd_dev_v2_parent_info(rbd_dev);
+   if (ret)
+   goto err_out_probe;
+
+   /*
+* Need to warn users if this image is the one being
+* mapped and has a parent.
+*/
+   if (mapping  rbd_dev-parent_spec)
+   rbd_warn(rbd_dev,
+WARNING: kernel layering is EXPERIMENTAL!);
+   }
+
ret = rbd_dev_probe_parent(rbd_dev);
if (ret)
goto err_out_probe;
 
dout(discovered format %u image, header name is %s\n,
rbd_dev-image_format, 

[PATCH 3/8] rbd: remove unnecessary asserts in rbd_dev_image_probe()

2014-07-24 Thread Ilya Dryomov
spec-image_id assert doesn't buy us much and image_format is asserted
in rbd_dev_header_name() and rbd_dev_header_info() anyway.

Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com
---
 drivers/block/rbd.c |2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 0d3be608f16f..4541f6027e4a 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -5143,8 +5143,6 @@ static int rbd_dev_image_probe(struct rbd_device 
*rbd_dev, bool mapping)
ret = rbd_dev_image_id(rbd_dev);
if (ret)
return ret;
-   rbd_assert(rbd_dev-spec-image_id);
-   rbd_assert(rbd_image_format_valid(rbd_dev-image_format));
 
ret = rbd_dev_header_name(rbd_dev);
if (ret)
-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Read from clones

2014-07-24 Thread Ilya Dryomov
On Tue, Jul 15, 2014 at 9:44 AM, Lakshminarayana Mavunduri
lakshminarayana.mavund...@sandisk.com wrote:
 Hi,

 Following the below set of steps, we are seeing data loss while reading from 
 clones.

 1)Create an image with image format 2 (in this case we have made the size to 
 be 1024MB).
 rbd create image1 --size 1024 --image-format 2
 2)Map the image and write 1024MB worth of data to it.
 3)Create a snapshot for the image created in step 1)
 rbd snap create image1@snap1
 4)Create a clone for the snapshot created in step 3)
 rbd clone image1@snap1 clone1
 5)Create a snapshot for the clone created in step 4)
 rbd snap create clone1@snap2
 6)Create a clone for the snapshot created in step 5)
 rbd clone clone1@snap2 clone2
 7)Shrink the size of the clone created in step 4) (in this case we have made 
 it to half of its size)
 rbd resize -s 512 --allow-shrink clone1
 8)Map the clone created in step 6) and try reading 1024MB worth of data.
 9)Our observation is that, only the first 512MB worth data is intact, the 
 rest is not copied over. (In fact, it's only the parent overlap worth data of 
 clone1 that is always copied over!)

 After the above set of steps, the parent overlap for clone2 would be 1024MB, 
 whereas the parent overlap for clone1 would be 512MB. Our understanding is, 
 since clone2's parent snapshot is taken before a shrink is performed on 
 clone1, any reads worth parent overlap data on clone2 should be serviced from 
 it's parent (at least, as long as there are no overwrites done on clone2, 
 which is the case here), and we are not finding that to be true in this case.

 To augment our theory, if the parent image (a base RBD image, which is not a 
 clone) is shrinked, any reads on the clones that are created before the 
 shrink, are as expected to our understanding.

 Wanted to check if this is indeed a bug or if we are missing anything here. 
 The tests are run across ceph version 0.80.

FWIW, I just pushed a fix for this - wip-overlap in ceph-client.git.

Thanks,

Ilya
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Read from clones

2014-07-24 Thread Pavan Rallabhandi
Thanks Ilya!

-Original Message-
From: Ilya Dryomov [mailto:ilya.dryo...@inktank.com]
Sent: Thursday, July 24, 2014 2:27 PM
To: Lakshminarayana Mavunduri
Cc: ceph-devel@vger.kernel.org; Pavan Rallabhandi
Subject: Re: Read from clones

On Tue, Jul 15, 2014 at 9:44 AM, Lakshminarayana Mavunduri 
lakshminarayana.mavund...@sandisk.com wrote:
 Hi,

 Following the below set of steps, we are seeing data loss while reading from 
 clones.

 1)Create an image with image format 2 (in this case we have made the size to 
 be 1024MB).
 rbd create image1 --size 1024 --image-format 2 2)Map the image
 and write 1024MB worth of data to it.
 3)Create a snapshot for the image created in step 1)
 rbd snap create image1@snap1
 4)Create a clone for the snapshot created in step 3)
 rbd clone image1@snap1 clone1
 5)Create a snapshot for the clone created in step 4)
 rbd snap create clone1@snap2
 6)Create a clone for the snapshot created in step 5)
 rbd clone clone1@snap2 clone2
 7)Shrink the size of the clone created in step 4) (in this case we have made 
 it to half of its size)
 rbd resize -s 512 --allow-shrink clone1 8)Map the clone
 created in step 6) and try reading 1024MB worth of data.
 9)Our observation is that, only the first 512MB worth data is intact,
 the rest is not copied over. (In fact, it's only the parent overlap
 worth data of clone1 that is always copied over!)

 After the above set of steps, the parent overlap for clone2 would be 1024MB, 
 whereas the parent overlap for clone1 would be 512MB. Our understanding is, 
 since clone2's parent snapshot is taken before a shrink is performed on 
 clone1, any reads worth parent overlap data on clone2 should be serviced from 
 it's parent (at least, as long as there are no overwrites done on clone2, 
 which is the case here), and we are not finding that to be true in this case.

 To augment our theory, if the parent image (a base RBD image, which is not a 
 clone) is shrinked, any reads on the clones that are created before the 
 shrink, are as expected to our understanding.

 Wanted to check if this is indeed a bug or if we are missing anything here. 
 The tests are run across ceph version 0.80.

FWIW, I just pushed a fix for this - wip-overlap in ceph-client.git.

Thanks,

Ilya



PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).



Re: First attempt at rocksdb monitor store stress testing

2014-07-24 Thread Mark Nelson

Hi Xinxin,

Thanks! I wonder as well if it might be interesting to expose the 
options related to universal compaction?  It looks like rocksdb provides 
a lot of interesting knobs you can adjust!


Mark

On 07/24/2014 12:08 AM, Shu, Xinxin wrote:

Hi mark,

I think this maybe related to 'verify_checksums' config option ,when 
ReadOptions is initialized, default this option is  true , all data read from 
underlying storage will be verified against corresponding checksums,  however,  
this option cannot be configured in wip-rocksdb branch. I will modify code to 
make this option configurable .

Cheers,
xinxin

-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Mark Nelson
Sent: Thursday, July 24, 2014 7:14 AM
To: ceph-devel@vger.kernel.org
Subject: First attempt at rocksdb monitor store stress testing

Hi Guys,

So I've been interested lately in leveldb 99th percentile latency (and the 
amount of write amplification we are seeing) with leveldb. Joao mentioned he 
has written a tool called mon-store-stress in wip-leveldb-misc to try to 
provide a means to roughly guess at what's happening on the mons under heavy 
load.  I cherry-picked it over to wip-rocksdb and after a couple of hacks was 
able to get everything built and running with some basic tests.  There was 
little tuning done and I don't know how realistic this workload is, so don't 
assume this means anything yet, but some initial results are here:

http://nhm.ceph.com/mon-store-stress/First%20Attempt.pdf

Command that was used to run the tests:

./ceph-test-mon-store-stress --mon-keyvaluedb leveldb|rocksdb 
--write-min-size 50K --write-max-size 2M --percent-write 70 --percent-read 30 
--keep-state --test-seed 1406137270 --stop-at 5000 foo

The most interesting bit right now is that rocksdb seems to be hanging in the 
middle of the test (left it running for several hours).  CPU usage on one core 
was quite high during the hang.  Profiling using perf with dwarf symbols I see:

-  49.14%  ceph-test-mon-s  ceph-test-mon-store-stress  [.] unsigned int 
rocksdb::crc32c::ExtendImplrocksdb::crc32c::Fast_CRC32(unsigned int, char 
const*, unsigned long)
 - unsigned int
rocksdb::crc32c::ExtendImplrocksdb::crc32c::Fast_CRC32(unsigned int, char 
const*, unsigned long)
  51.70% rocksdb::ReadBlockContents(rocksdb::RandomAccessFile*,
rocksdb::Footer const, rocksdb::ReadOptions const, rocksdb::BlockHandle 
const, rocksdb::BlockContents*, rocksdb::Env*, bool)
  48.30%
rocksdb::BlockBasedTableBuilder::WriteRawBlock(rocksdb::Slice const, 
rocksdb::CompressionType, rocksdb::BlockHandle*)

Not sure what's going on yet, may need to try to enable logging/debugging in 
rocksdb.  Thoughts/Suggestions welcome. :)

Mark
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in the 
body of a message to majord...@vger.kernel.org More majordomo info at  
http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/8] rbd: show the entire chain of parent images

2014-07-24 Thread Alex Elder
On 07/24/2014 03:42 AM, Ilya Dryomov wrote:
 Make /sys/bus/rbd/devices/id/parent show the entire chain of parent
 images.  While at it, kernel sprintf() doesn't return negative values,
 casting to unsigned long long is no longer necessary and there is no
 good reason to split into multiple sprintf() calls.

I like the use of a single snprintf() call, it is much less
cumbersome.

The reason I always cast u64 values to (unsigned long long)
in printf() calls is because on some 64-bit architectures,
u64 is defined as simply (unsigned long), so without the
cast they spit out warnings.  I hate this, but it is reality
(see include/asm-generic/{int-l64.h,int-ll64.h}).  You can
alternatively use %PRIu64 rather than %llu in your format
strings, but I think I hate that more.  Anyway, if you want
to avoid warnings on all architectures you should fix that.

As another aside, I've been too timid to use the ?: conditional
expression without its middle operand.  I have no objection
to it at all, I like it.  I bring it up because I recently
got burned for using a gcc feature that wasn't supported
on older compilers (unnamed struct/union fields)--specifically
a version newer than gcc 3.2, which is the minimum supported
version for the kernel (see Documentation/Changes).

But fear not!  That extension is supported in gcc 3.2:

https://gcc.gnu.org/onlinedocs/gcc-3.2/gcc/Conditionals.html#Conditionals
Just FYI...

Reviewed-by: Alex Elder el...@linaro.org

 Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com
 ---
  Documentation/ABI/testing/sysfs-bus-rbd |4 +--
  drivers/block/rbd.c |   56 
 +--
  2 files changed, 25 insertions(+), 35 deletions(-)
 
 diff --git a/Documentation/ABI/testing/sysfs-bus-rbd 
 b/Documentation/ABI/testing/sysfs-bus-rbd
 index 501adc2a9ec7..2ddd680929d8 100644
 --- a/Documentation/ABI/testing/sysfs-bus-rbd
 +++ b/Documentation/ABI/testing/sysfs-bus-rbd
 @@ -94,5 +94,5 @@ current_snap
  
  parent
  
 - Information identifying the pool, image, and snapshot id for
 - the parent image in a layered rbd image (format 2 only).
 + Information identifying the chain of parent images in a layered rbd
 + image.  Entries are separated by empty lines.
 diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
 index 703b728e05fa..7847fbb949ff 100644
 --- a/drivers/block/rbd.c
 +++ b/drivers/block/rbd.c
 @@ -3685,46 +3685,36 @@ static ssize_t rbd_snap_show(struct device *dev,
  }
  
  /*
 - * For an rbd v2 image, shows the pool id, image id, and snapshot id
 - * for the parent image.  If there is no parent, simply shows
 - * (no parent image).
 + * For a v2 image, shows the chain of parent images, separated by empty
 + * lines.  For v1 images or if there is no parent, shows (no parent
 + * image).
   */
  static ssize_t rbd_parent_show(struct device *dev,
 -  struct device_attribute *attr,
 -  char *buf)
 +struct device_attribute *attr,
 +char *buf)
  {
   struct rbd_device *rbd_dev = dev_to_rbd_dev(dev);
 - struct rbd_spec *spec = rbd_dev-parent_spec;
 - int count;
 - char *bufp = buf;
 + ssize_t count = 0;
  
 - if (!spec)
 + if (!rbd_dev-parent)
   return sprintf(buf, (no parent image)\n);
  
 - count = sprintf(bufp, pool_id %llu\npool_name %s\n,
 - (unsigned long long) spec-pool_id, spec-pool_name);
 - if (count  0)
 - return count;
 - bufp += count;
 -
 - count = sprintf(bufp, image_id %s\nimage_name %s\n, spec-image_id,
 - spec-image_name ? spec-image_name : (unknown));
 - if (count  0)
 - return count;
 - bufp += count;
 -
 - count = sprintf(bufp, snap_id %llu\nsnap_name %s\n,
 - (unsigned long long) spec-snap_id, spec-snap_name);
 - if (count  0)
 - return count;
 - bufp += count;
 -
 - count = sprintf(bufp, overlap %llu\n, rbd_dev-parent_overlap);
 - if (count  0)
 - return count;
 - bufp += count;
 -
 - return (ssize_t) (bufp - buf);
 + for ( ; rbd_dev-parent; rbd_dev = rbd_dev-parent) {
 + struct rbd_spec *spec = rbd_dev-parent_spec;
 +
 + count += sprintf(buf[count], %s
 + pool_id %llu\npool_name %s\n
 + image_id %s\nimage_name %s\n
 + snap_id %llu\nsnap_name %s\n
 + overlap %llu\n,
 + !count ?  : \n, /* first? */
 + spec-pool_id, spec-pool_name,
 + spec-image_id, spec-image_name ?: (unknown),
 + spec-snap_id, spec-snap_name,
 + rbd_dev-parent_overlap);
 + }
 +
 + return count;
  }
  
  static ssize_t rbd_image_refresh(struct device *dev,
 

--
To unsubscribe from this list: send the line 

Re: [PATCH 2/8] rbd: introduce rbd_dev_header_info()

2014-07-24 Thread Alex Elder
On 07/24/2014 03:42 AM, Ilya Dryomov wrote:
 A wrapper around rbd_dev_v{1,2}_header_info() to reduce duplication.

Looks good.

Reviewed-by: Alex Elder el...@linaro.org

 Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com
 ---
  drivers/block/rbd.c |   24 ++--
  1 file changed, 14 insertions(+), 10 deletions(-)
 
 diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
 index 7847fbb949ff..0d3be608f16f 100644
 --- a/drivers/block/rbd.c
 +++ b/drivers/block/rbd.c
 @@ -514,7 +514,7 @@ static void rbd_dev_remove_parent(struct rbd_device 
 *rbd_dev);
  
  static int rbd_dev_refresh(struct rbd_device *rbd_dev);
  static int rbd_dev_v2_header_onetime(struct rbd_device *rbd_dev);
 -static int rbd_dev_v2_header_info(struct rbd_device *rbd_dev);
 +static int rbd_dev_header_info(struct rbd_device *rbd_dev);
  static const char *rbd_dev_v2_snap_name(struct rbd_device *rbd_dev,
   u64 snap_id);
  static int _rbd_dev_v2_snap_size(struct rbd_device *rbd_dev, u64 snap_id,
 @@ -3506,13 +3506,10 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev)
   u64 mapping_size;
   int ret;
  
 - rbd_assert(rbd_image_format_valid(rbd_dev-image_format));
   down_write(rbd_dev-header_rwsem);
   mapping_size = rbd_dev-mapping.size;
 - if (rbd_dev-image_format == 1)
 - ret = rbd_dev_v1_header_info(rbd_dev);
 - else
 - ret = rbd_dev_v2_header_info(rbd_dev);
 +
 + ret = rbd_dev_header_info(rbd_dev);
  
   /* If it's a mapped snapshot, validate its EXISTS flag */
  
 @@ -4501,6 +4498,16 @@ static int rbd_dev_v2_header_info(struct rbd_device 
 *rbd_dev)
   return ret;
  }
  
 +static int rbd_dev_header_info(struct rbd_device *rbd_dev)
 +{
 + rbd_assert(rbd_image_format_valid(rbd_dev-image_format));
 +
 + if (rbd_dev-image_format == 1)
 + return rbd_dev_v1_header_info(rbd_dev);
 +
 + return rbd_dev_v2_header_info(rbd_dev);
 +}
 +
  static int rbd_bus_add_dev(struct rbd_device *rbd_dev)
  {
   struct device *dev;
 @@ -5149,10 +5156,7 @@ static int rbd_dev_image_probe(struct rbd_device 
 *rbd_dev, bool mapping)
   goto out_header_name;
   }
  
 - if (rbd_dev-image_format == 1)
 - ret = rbd_dev_v1_header_info(rbd_dev);
 - else
 - ret = rbd_dev_v2_header_info(rbd_dev);
 + ret = rbd_dev_header_info(rbd_dev);
   if (ret)
   goto err_out_watch;
  
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/8] rbd: remove unnecessary asserts in rbd_dev_image_probe()

2014-07-24 Thread Alex Elder
On 07/24/2014 03:42 AM, Ilya Dryomov wrote:
 spec-image_id assert doesn't buy us much and image_format is asserted
 in rbd_dev_header_name() and rbd_dev_header_info() anyway.

Looks good.  Over time I'm sure there are more assertions
that can go away; they were most useful during rapid development
when confidence in certain things was lower.

Reviewed-by: Alex Elder el...@linaro.org


 
 Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com
 ---
  drivers/block/rbd.c |2 --
  1 file changed, 2 deletions(-)
 
 diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
 index 0d3be608f16f..4541f6027e4a 100644
 --- a/drivers/block/rbd.c
 +++ b/drivers/block/rbd.c
 @@ -5143,8 +5143,6 @@ static int rbd_dev_image_probe(struct rbd_device 
 *rbd_dev, bool mapping)
   ret = rbd_dev_image_id(rbd_dev);
   if (ret)
   return ret;
 - rbd_assert(rbd_dev-spec-image_id);
 - rbd_assert(rbd_image_format_valid(rbd_dev-image_format));
  
   ret = rbd_dev_header_name(rbd_dev);
   if (ret)
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/8] rbd: show the entire chain of parent images

2014-07-24 Thread Ilya Dryomov
On Thu, Jul 24, 2014 at 4:31 PM, Alex Elder el...@ieee.org wrote:
 On 07/24/2014 03:42 AM, Ilya Dryomov wrote:
 Make /sys/bus/rbd/devices/id/parent show the entire chain of parent
 images.  While at it, kernel sprintf() doesn't return negative values,
 casting to unsigned long long is no longer necessary and there is no
 good reason to split into multiple sprintf() calls.

 I like the use of a single snprintf() call, it is much less
 cumbersome.

 The reason I always cast u64 values to (unsigned long long)
 in printf() calls is because on some 64-bit architectures,
 u64 is defined as simply (unsigned long), so without the
 cast they spit out warnings.  I hate this, but it is reality
 (see include/asm-generic/{int-l64.h,int-ll64.h}).  You can
 alternatively use %PRIu64 rather than %llu in your format
 strings, but I think I hate that more.  Anyway, if you want
 to avoid warnings on all architectures you should fix that.

IIRC all architectures nowadays use int-ll64.h.  int-l64.h is used only
in userspace for compatibility.

Thanks,

Ilya
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/8] rbd: split rbd_dev_spec_update() into two functions

2014-07-24 Thread Alex Elder
On 07/24/2014 03:42 AM, Ilya Dryomov wrote:
 rbd_dev_spec_update() has two modes of operation, with nothing in
 common between them.  Split it into two functions, one for each mode
 and make our expectations more clear.
 
 Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com

Looks good.

Reviewed-by: Alex Elder el...@linaro.org

 ---
  drivers/block/rbd.c |   79 
 +++
  1 file changed, 48 insertions(+), 31 deletions(-)
 
 diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
 index 4541f6027e4a..23df1773ef77 100644
 --- a/drivers/block/rbd.c
 +++ b/drivers/block/rbd.c
 @@ -3798,6 +3798,9 @@ static struct rbd_spec *rbd_spec_alloc(void)
   spec = kzalloc(sizeof (*spec), GFP_KERNEL);
   if (!spec)
   return NULL;
 +
 + spec-pool_id = CEPH_NOPOOL;
 + spec-snap_id = CEPH_NOSNAP;
   kref_init(spec-kref);
  
   return spec;
 @@ -4257,18 +4260,38 @@ static u64 rbd_snap_id_by_name(struct rbd_device 
 *rbd_dev, const char *name)
  }
  
  /*
 - * When an rbd image has a parent image, it is identified by the
 - * pool, image, and snapshot ids (not names).  This function fills
 - * in the names for those ids.  (It's OK if we can't figure out the
 - * name for an image id, but the pool and snapshot ids should always
 - * exist and have names.)  All names in an rbd spec are dynamically
 - * allocated.
 + * An image being mapped will have everything but the snap id.
 + */
 +static int rbd_spec_fill_snap_id(struct rbd_device *rbd_dev)
 +{
 + struct rbd_spec *spec = rbd_dev-spec;
 +
 + rbd_assert(spec-pool_id != CEPH_NOPOOL  spec-pool_name);
 + rbd_assert(spec-image_id  spec-image_name);
 + rbd_assert(spec-snap_name);
 +
 + if (strcmp(spec-snap_name, RBD_SNAP_HEAD_NAME)) {
 + u64 snap_id;
 +
 + snap_id = rbd_snap_id_by_name(rbd_dev, spec-snap_name);
 + if (snap_id == CEPH_NOSNAP)
 + return -ENOENT;
 +
 + spec-snap_id = snap_id;
 + } else {
 + spec-snap_id = CEPH_NOSNAP;
 + }
 +
 + return 0;
 +}
 +
 +/*
 + * A parent image will have all ids but none of the names.
   *
 - * When an image being mapped (not a parent) is probed, we have the
 - * pool name and pool id, image name and image id, and the snapshot
 - * name.  The only thing we're missing is the snapshot id.
 + * All names in an rbd spec are dynamically allocated.  It's OK if we
 + * can't figure out the name for an image id.
   */
 -static int rbd_dev_spec_update(struct rbd_device *rbd_dev)
 +static int rbd_spec_fill_names(struct rbd_device *rbd_dev)
  {
   struct ceph_osd_client *osdc = rbd_dev-rbd_client-client-osdc;
   struct rbd_spec *spec = rbd_dev-spec;
 @@ -4277,24 +4300,9 @@ static int rbd_dev_spec_update(struct rbd_device 
 *rbd_dev)
   const char *snap_name;
   int ret;
  
 - /*
 -  * An image being mapped will have the pool name (etc.), but
 -  * we need to look up the snapshot id.
 -  */
 - if (spec-pool_name) {
 - if (strcmp(spec-snap_name, RBD_SNAP_HEAD_NAME)) {
 - u64 snap_id;
 -
 - snap_id = rbd_snap_id_by_name(rbd_dev, spec-snap_name);
 - if (snap_id == CEPH_NOSNAP)
 - return -ENOENT;
 - spec-snap_id = snap_id;
 - } else {
 - spec-snap_id = CEPH_NOSNAP;
 - }
 -
 - return 0;
 - }
 + rbd_assert(spec-pool_id != CEPH_NOPOOL);
 + rbd_assert(spec-image_id);
 + rbd_assert(spec-snap_id != CEPH_NOSNAP);
  
   /* Get the pool name; we have to make our own copy of this */
  
 @@ -4313,7 +4321,7 @@ static int rbd_dev_spec_update(struct rbd_device 
 *rbd_dev)
   if (!image_name)
   rbd_warn(rbd_dev, unable to get image name);
  
 - /* Look up the snapshot name, and make a copy */
 + /* Fetch the snapshot name */
  
   snap_name = rbd_snap_name(rbd_dev, spec-snap_id);
   if (IS_ERR(snap_name)) {
 @@ -4326,10 +4334,10 @@ static int rbd_dev_spec_update(struct rbd_device 
 *rbd_dev)
   spec-snap_name = snap_name;
  
   return 0;
 +
  out_err:
   kfree(image_name);
   kfree(pool_name);
 -
   return ret;
  }
  
 @@ -5158,7 +5166,16 @@ static int rbd_dev_image_probe(struct rbd_device 
 *rbd_dev, bool mapping)
   if (ret)
   goto err_out_watch;
  
 - ret = rbd_dev_spec_update(rbd_dev);
 + /*
 +  * If this image is the one being mapped, we have pool name and
 +  * id, image name and id, and snap name - need to fill snap id.
 +  * Otherwise this is a parent image, identified by pool, image
 +  * and snap ids - need to fill in names for those ids.
 +  */
 + if (mapping)
 + ret = rbd_spec_fill_snap_id(rbd_dev);
 + else
 + ret = rbd_spec_fill_names(rbd_dev);
   if (ret)
   goto err_out_probe;
  
 

--
To 

Re: [PATCH 5/8] rbd: harden rbd_dev_refresh() caller

2014-07-24 Thread Alex Elder
On 07/24/2014 03:42 AM, Ilya Dryomov wrote:
 Recently discovered watch/notify problems showed that we really can't
 ignore errors in anything refresh related.  Alas, currently there is
 not much we can do in response to those errors, except print warnings.

Looks good.

Reviewed-by: Alex Elder el...@linaro.org


 
 Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com
 ---
  drivers/block/rbd.c |   17 +
  1 file changed, 13 insertions(+), 4 deletions(-)
 
 diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
 index 23df1773ef77..5dd6f5057ef4 100644
 --- a/drivers/block/rbd.c
 +++ b/drivers/block/rbd.c
 @@ -2963,11 +2963,20 @@ static void rbd_watch_cb(u64 ver, u64 notify_id, u8 
 opcode, void *data)
   dout(%s: \%s\ notify_id %llu opcode %u\n, __func__,
   rbd_dev-header_name, (unsigned long long)notify_id,
   (unsigned int)opcode);
 +
 + /*
 +  * Until adequate refresh error handling is in place, there is
 +  * not much we can do here, except warn.
 +  *
 +  * See http://tracker.ceph.com/issues/5040
 +  */
   ret = rbd_dev_refresh(rbd_dev);
   if (ret)
 - rbd_warn(rbd_dev, header refresh error (%d)\n, ret);
 + rbd_warn(rbd_dev, refresh failed: %d\n, ret);
  
 - rbd_obj_notify_ack_sync(rbd_dev, notify_id);
 + ret = rbd_obj_notify_ack_sync(rbd_dev, notify_id);
 + if (ret)
 + rbd_warn(rbd_dev, notify_ack ret %d\n, ret);
  }
  
  /*
 @@ -3724,9 +3733,9 @@ static ssize_t rbd_image_refresh(struct device *dev,
  
   ret = rbd_dev_refresh(rbd_dev);
   if (ret)
 - rbd_warn(rbd_dev, : manual header refresh error (%d)\n, ret);
 + return ret;
  
 - return ret  0 ? ret : size;
 + return size;
  }
  
  static DEVICE_ATTR(size, S_IRUGO, rbd_size_show, NULL);
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/8] rbd: update mapping size only on refresh

2014-07-24 Thread Alex Elder
On 07/24/2014 03:42 AM, Ilya Dryomov wrote:
 There is no sense in trying to update the mapping size before it's even
 been set.

It took me a bit to follow this.  But basically there is
no mapping unless it's mapped.  So previously this was
updating the mapping information even for unmapped
parent (or other) images.  There's no need to update
the mapping size for a snapshot--it'll never change.

Is that right?  If not, please advise; otherwise:

Reviewed-by: Alex Elder el...@linaro.org

 Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com
 ---
  drivers/block/rbd.c |   19 +++
  1 file changed, 7 insertions(+), 12 deletions(-)
 
 diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
 index 5dd6f5057ef4..16f388f2960b 100644
 --- a/drivers/block/rbd.c
 +++ b/drivers/block/rbd.c
 @@ -971,12 +971,6 @@ static int rbd_header_from_disk(struct rbd_device 
 *rbd_dev,
   header-snap_names = snap_names;
   header-snap_sizes = snap_sizes;
  
 - /* Make sure mapping size is consistent with header info */
 -
 - if (rbd_dev-spec-snap_id == CEPH_NOSNAP || first_time)
 - if (rbd_dev-mapping.size != header-image_size)
 - rbd_dev-mapping.size = header-image_size;
 -
   return 0;
  out_2big:
   ret = -EIO;
 @@ -3520,9 +3514,14 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev)
  
   ret = rbd_dev_header_info(rbd_dev);
  
 - /* If it's a mapped snapshot, validate its EXISTS flag */
 + if (rbd_dev-spec-snap_id == CEPH_NOSNAP) {
 + if (rbd_dev-mapping.size != rbd_dev-header.image_size)
 + rbd_dev-mapping.size = rbd_dev-header.image_size;
 + } else {
 + /* validate mapped snapshot's EXISTS flag */
 + rbd_exists_validate(rbd_dev);
 + }
  
 - rbd_exists_validate(rbd_dev);
   up_write(rbd_dev-header_rwsem);
  
   if (mapping_size != rbd_dev-mapping.size) {
 @@ -4505,10 +4504,6 @@ static int rbd_dev_v2_header_info(struct rbd_device 
 *rbd_dev)
   is EXPERIMENTAL!);
   }
  
 - if (rbd_dev-spec-snap_id == CEPH_NOSNAP)
 - if (rbd_dev-mapping.size != rbd_dev-header.image_size)
 - rbd_dev-mapping.size = rbd_dev-header.image_size;
 -
   ret = rbd_dev_v2_snap_context(rbd_dev);
   dout(rbd_dev_v2_snap_context returned %d\n, ret);
  
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/8] rbd: update mapping size only on refresh

2014-07-24 Thread Ilya Dryomov
On Thu, Jul 24, 2014 at 5:25 PM, Alex Elder el...@ieee.org wrote:
 On 07/24/2014 03:42 AM, Ilya Dryomov wrote:
 There is no sense in trying to update the mapping size before it's even
 been set.

 It took me a bit to follow this.  But basically there is
 no mapping unless it's mapped.  So previously this was
 updating the mapping information even for unmapped
 parent (or other) images.  There's no need to update
 the mapping size for a snapshot--it'll never change.

 Is that right?  If not, please advise; otherwise:

No.  Previously it was updating the mapping size both on the inital map
and on refresh (of the base image).  Doing that on the inital map
doesn't make sense: not only struct rbd_mapping fields aren't properly
initialized at that point - rbd_dev_mapping_set() is called much later
in the map sequence, snap_id isn't initialized either (at least in the
format 2 case, I haven't looked too closely at the format 1 case).  And
just in general, trying to update something before it had a chance to
change is bogus..

Thanks,

Ilya
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] rgw: list all available options during help()

2014-07-24 Thread Abhishek Lekshmanan
Adding the available help arguments from the man page

Fixes: #8112
Signed-off-by: Abhishek Lekshmanan abhishek.lekshma...@gmail.com
---
 src/rgw/rgw_main.cc | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/src/rgw/rgw_main.cc b/src/rgw/rgw_main.cc
index 51cfebe..29d6577 100644
--- a/src/rgw/rgw_main.cc
+++ b/src/rgw/rgw_main.cc
@@ -726,7 +726,13 @@ int usage()
   cerr  options:\n;
   cerr--rgw-region=region region in which radosgw runs\n;
   cerr--rgw-zone=zone zone in which radosgw runs\n;
+  cerr--rgw-socket-path=path  specify a unix domain socket path\n;
   generic_server_usage();
+  cerr-m monaddress[:port]  connect to specified monitor\n;
+  cerr--keyring=path  path to radosgw keyring\n;
+  cerr--logfile=logfile   file to log debug output\n;
+  cerr--debug-rgw=log-level/memory-level  Set radosgw debug 
level\n;
+
   return 0;
 }
 
-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] rgw: format help options to align with the rest

2014-07-24 Thread Abhishek Lekshmanan
Whitespace removal to make all help options align in a similar fashion

Signed-off-by: Abhishek Lekshmanan abhishek.lekshma...@gmail.com
---
 src/rgw/rgw_main.cc | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/rgw/rgw_main.cc b/src/rgw/rgw_main.cc
index f6257d7..51cfebe 100644
--- a/src/rgw/rgw_main.cc
+++ b/src/rgw/rgw_main.cc
@@ -724,8 +724,8 @@ int usage()
 {
   cerr  usage: radosgw [options...]  std::endl;
   cerr  options:\n;
-  cerr --rgw-region=region region in which radosgw runs\n;
-  cerr --rgw-zone=zone zone in which radosgw runs\n;
+  cerr--rgw-region=region region in which radosgw runs\n;
+  cerr--rgw-zone=zone zone in which radosgw runs\n;
   generic_server_usage();
   return 0;
 }
-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/3] WIP-8112 radosgw usage man update

2014-07-24 Thread Abhishek Lekshmanan
Hi,

This is a series of patches to update radosgw usage and man pages with
all the available help options. 1/3 is a minor format fix to display
the options in the same line.

Abhishek Lekshmanan (3):
  rgw: format help options to align with the rest
  rgw: list all available options during help()
  doc: update radosgw man page with available opts

 doc/man/8/radosgw.rst | 28 
 src/rgw/rgw_main.cc   | 10 --
 2 files changed, 36 insertions(+), 2 deletions(-)

-- 
1.9.1

Thanks,
 Abhishek
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] doc: update radosgw man page with available opts

2014-07-24 Thread Abhishek Lekshmanan
Fixes:#8112
Signed-off-by: Abhishek Lekshmanan abhishek.lekshma...@gmail.com
---
 doc/man/8/radosgw.rst | 28 
 1 file changed, 28 insertions(+)

diff --git a/doc/man/8/radosgw.rst b/doc/man/8/radosgw.rst
index 5e5e3c0..7aa9655 100644
--- a/doc/man/8/radosgw.rst
+++ b/doc/man/8/radosgw.rst
@@ -32,10 +32,38 @@ Options
Connect to specified monitor (instead of looking through
``ceph.conf``).
 
+.. option:: -i ID, --id ID
+
+   Set the ID portion of name for radosgw
+
+.. option:: -n TYPE.ID, --name TYPE.ID
+
+   Set the name for rados gateway (eg. client.radosgw.gateway)
+
+.. option:: --cluster NAME
+
+   Set the cluster name (default: ceph)
+
+.. option:: -d
+
+   Run in foreground, log to stderr
+
+.. option:: -f
+
+   Run in foreground, log to usual location
+
 .. option:: --rgw-socket-path=path
 
Specify a unix domain socket path.
 
+.. option:: --rgw-region=region
+
+   The region where radosgw runs
+
+.. option:: --rgw-zone=zone
+
+   The zone where radosgw runs
+
 
 Configuration
 =
-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: help: how to get ceph daemon debug info from ceph-rest-api?

2014-07-24 Thread David Moreau Simard
Hi Zhu Qiang,

Unless I am mistaken, it looks like the command “dump_ops_in_flight” is only 
available as an admin-socket command:
https://github.com/ceph/ceph/blob/master/src/osd/OSD.cc#L1371-L1373

The commands available to ceph-rest-api are, to my knowledge, usually declared 
like so: https://github.com/ceph/ceph/blob/master/src/mon/MonCommands.h

I put the ceph-devel mailing list in copy, perhaps someone else knows if there 
is another way to retrieve that information.

--
David Moreau Simard

On Jul 24, 2014, at 4:48 AM, zhu qiang zhu_qiang...@foxmail.com wrote:

 Hello Mr. Simard
 
 
 I want to use ceph-rest-api to view some debug details from ceph daemons.
 On linux shell I can get this message from below:
 # ceph daemon osd.0 dump_ops_in_flight | python -m json.tool
   { num_ops: 0,
   ops: []}
 
 This is my question:
 Can I get this output from ceph-rest-api ?
 
 Until now I tried some method : curl, python-cephclient ,but I did not get 
 the right respone.
 
 Best regarding.
 
 
 

N�r��yb�X��ǧv�^�)޺{.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w���
���j:+v���w�j�mzZ+�ݢj��!�i

Re: [PATCH 6/8] rbd: update mapping size only on refresh

2014-07-24 Thread Ilya Dryomov
On Thu, Jul 24, 2014 at 5:46 PM, Ilya Dryomov ilya.dryo...@inktank.com wrote:
 On Thu, Jul 24, 2014 at 5:25 PM, Alex Elder el...@ieee.org wrote:
 On 07/24/2014 03:42 AM, Ilya Dryomov wrote:
 There is no sense in trying to update the mapping size before it's even
 been set.

 It took me a bit to follow this.  But basically there is
 no mapping unless it's mapped.  So previously this was
 updating the mapping information even for unmapped
 parent (or other) images.  There's no need to update
 the mapping size for a snapshot--it'll never change.

 Is that right?  If not, please advise; otherwise:

 No.  Previously it was updating the mapping size both on the inital map
 and on refresh (of the base image).  Doing that on the inital map
 doesn't make sense: not only struct rbd_mapping fields aren't properly
 initialized at that point - rbd_dev_mapping_set() is called much later
 in the map sequence, snap_id isn't initialized either (at least in the
 format 2 case, I haven't looked too closely at the format 1 case).  And
 just in general, trying to update something before it had a chance to
 change is bogus..

BTW, while we are on the subject of struct rbd_mapping, is there any
particular reason to keep around both the base image values and actual
mapping values?  I am tempted to change the mapping sequence so that 1)
snap context is read in immediately after watch setup, before anything
else is done, 2) supplied snap name is resolved into an id and 3) the
actual (i.e. based on snap_id) mapping size, features, parent_overlap,
etc are read in directly into struct rbd_image_header.  That would
allow to rip struct rbd_mapping entirely and make the code more clear.

Thanks,

Ilya
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/8] rbd: update mapping size only on refresh

2014-07-24 Thread Alex Elder
On 07/24/2014 08:46 AM, Ilya Dryomov wrote:
 On Thu, Jul 24, 2014 at 5:25 PM, Alex Elder el...@ieee.org wrote:
 On 07/24/2014 03:42 AM, Ilya Dryomov wrote:
 There is no sense in trying to update the mapping size before it's even
 been set.

 It took me a bit to follow this.  But basically there is
 no mapping unless it's mapped.  So previously this was
 updating the mapping information even for unmapped
 parent (or other) images.  There's no need to update
 the mapping size for a snapshot--it'll never change.

 Is that right?  If not, please advise; otherwise:
 
 No.  Previously it was updating the mapping size both on the inital map
 and on refresh (of the base image).  Doing that on the inital map
 doesn't make sense: not only struct rbd_mapping fields aren't properly
 initialized at that point - rbd_dev_mapping_set() is called much later
 in the map sequence, snap_id isn't initialized either (at least in the
 format 2 case, I haven't looked too closely at the format 1 case).  And
 just in general, trying to update something before it had a chance to
 change is bogus..

OK, now I see it.  Hopefully this makes sense.  Here is how it
was previously structured:

  rbd_add()
do_rbd_add()
 |rbd_dev_image_probe()
 |  rbd_dev_header_info()
 |rbd_dev_v1_header_info()orrbd_dev_v2_header_info()
 |  rbd_header_from_disk()update mapping
 |update mapping
 |  . . .
 |rbd_dev_device_setup()
 |  rbd_dev_mapping_set()

So neither rbd_header_from_disk() nor rbd_dev_v2_header_info()
should be using the values in the rbd_dev-mapping field during
initial image probe, because they have not yet been initialized
at that point.

Meanwhile, the only reason we need to *update* a mapping size
is if we learn it may have changed, which will be covered by a
refresh, so doing so in rbd_dev_refresh() makes sense.  And
as long as you know whether it's mapping the base image you
might as well do the rbd_exists_validate() call conditionally.

OK, this all looks good and makes good sense to me.
Thank you for explaining it.

Reviewed-by: Alex Elder el...@linaro.org


 
 Thanks,
 
 Ilya
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


sizeof (struct tYpO *) : it is just a typo or rather a bug ?

2014-07-24 Thread Toralf Förster
Inspired by this typo fix 
http://article.gmane.org/gmane.linux.kernel/1754640
I grep'ed the current git tree of linus for similar issues.

For these 4 places I'm wondering where the appropriate struct definition is 
located :

arch/ia64/sn/kernel/io_acpi_init.c: sizeof(struct pci_devdev_info *)) {
tools/perf/builtin-sched.c: sched-tasks = realloc(sched-tasks, 
sched-nr_tasks * sizeof(struct task_task *));
fs/ceph/xattr.c:xattrs = kcalloc(numattr, sizeof(struct 
ceph_xattr *),
fs/ceph/xattr.c:memset(xattrs, 0, numattr*sizeof(struct 
ceph_xattr *));

Nevertheless I was told, that gcc doesn't complain about such things due to 
eventually evaluating it to sizeof(null). I'm however curious if at least a 
warning should be emitted in such a case, or?

-- 
Toralf

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sizeof (struct tYpO *) : it is just a typo or rather a bug ?

2014-07-24 Thread Ilya Dryomov
On Thu, Jul 24, 2014 at 10:20 PM, Toralf Förster toralf.foers...@gmx.de wrote:
 Inspired by this typo fix
 http://article.gmane.org/gmane.linux.kernel/1754640
 I grep'ed the current git tree of linus for similar issues.

 For these 4 places I'm wondering where the appropriate struct definition is 
 located :

 arch/ia64/sn/kernel/io_acpi_init.c: sizeof(struct pci_devdev_info *)) 
 {
 tools/perf/builtin-sched.c: sched-tasks = realloc(sched-tasks, 
 sched-nr_tasks * sizeof(struct task_task *));
 fs/ceph/xattr.c:xattrs = kcalloc(numattr, sizeof(struct 
 ceph_xattr *),
 fs/ceph/xattr.c:memset(xattrs, 0, numattr*sizeof(struct 
 ceph_xattr *));

Heh, the ceph one is a five year old typo..  Looks like it should be
struct ceph_inode_xattr, I'll fix it up.  I'm curious though, how did
you grep for these?

Thanks,

Ilya
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sizeof (struct tYpO *) : it is just a typo or rather a bug ?

2014-07-24 Thread Toralf Förster
On 07/24/2014 08:33 PM, Ilya Dryomov wrote:
 On Thu, Jul 24, 2014 at 10:20 PM, Toralf Förster toralf.foers...@gmx.de 
 wrote:
 Inspired by this typo fix
 http://article.gmane.org/gmane.linux.kernel/1754640
 I grep'ed the current git tree of linus for similar issues.

 For these 4 places I'm wondering where the appropriate struct definition is 
 located :

 arch/ia64/sn/kernel/io_acpi_init.c: sizeof(struct pci_devdev_info 
 *)) {
 tools/perf/builtin-sched.c: sched-tasks = realloc(sched-tasks, 
 sched-nr_tasks * sizeof(struct task_task *));
 fs/ceph/xattr.c:xattrs = kcalloc(numattr, sizeof(struct 
 ceph_xattr *),
 fs/ceph/xattr.c:memset(xattrs, 0, numattr*sizeof(struct 
 ceph_xattr *));
 
 Heh, the ceph one is a five year old typo..  Looks like it should be
 struct ceph_inode_xattr, I'll fix it up.  I'm curious though, how did
 you grep for these?
 
 Thanks,
 
 Ilya
 

1:
grep -Hr sizeof[ *(]struct .* \*.)  | cut -f2- -d':' | tee ~/tmp/out

2:
cat ~/tmp/out | perl -wane 'chomp(); my ($left, $right) = split 
(/sizeof\(/); print $right, \n;' | cut -f2 -d' ' | sort -u | cut -f1 -d')' | 
grep -v '^+' | while read i; do echo $i; git grep -q struct $i { || echo 
error; echo; done

3:
ignore false positives

-- 
Toralf

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 8/8] rbd: take snap_id into account when reading in parent info

2014-07-24 Thread Alex Elder
On 07/24/2014 03:42 AM, Ilya Dryomov wrote:
 If we are mapping a snapshot, we must read in the parent_overlap value
 of that snapshot instead of that of the base image.  Not doing so may
 in particular result in us returning zeros instead of user data:
 
 # cat overlap-snap.sh
 #!/bin/bash
 rbd create --size 10 --image-format 2 foo
 FOO_DEV=$(rbd map foo)
 dd if=/dev/urandom of=/dev/rbd0 bs=1M /dev/null
 echo Base image
 dd if=$FOO_DEV bs=1 count=16 skip=$(((4  20) - 8)) 2/dev/null | xxd
 rbd snap create foo@snap
 rbd snap protect foo@snap
 rbd clone foo@snap bar
 rbd snap create bar@snap
 BAR_DEV=$(rbd map bar@snap)
 echo Snapshot
 dd if=$BAR_DEV bs=1 count=16 skip=$(((4  20) - 8)) 2/dev/null | xxd
 rbd resize --allow-shrink --size 4 bar
 echo Snapshot after base image resize
 dd if=$BAR_DEV bs=1 count=16 skip=$(((4  20) - 8)) 2/dev/null | xxd
 
 # ./overlap-snap.sh
 Base image
 000: e781 e33b d34b 2225 6034 2845 a2e3 36ed  ...;.K%`4(E..6.
 Snapshot
 000: e781 e33b d34b 2225 6034 2845 a2e3 36ed  ...;.K%`4(E..6.
 Resizing image: 100% complete...done.
 Snapshot after base image resize
 000: e781 e33b d34b 2225      ...;.K%
 
 Even though bar@snap was taken with the old bar parent_overlap (8M),
 reads from bar@snap beyond the new bar parent_overlap (4M) return
 zeroes.  Fix it.
 
 Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com
 ---
  drivers/block/rbd.c |2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)
 
 diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
 index c4606987e9d1..cbc89fa9a677 100644
 --- a/drivers/block/rbd.c
 +++ b/drivers/block/rbd.c
 @@ -4020,7 +4020,7 @@ static int rbd_dev_v2_parent_info(struct rbd_device 
 *rbd_dev)
   goto out_err;
   }
  
 - snapid = cpu_to_le64(CEPH_NOSNAP);
 + snapid = cpu_to_le64(rbd_dev-spec-snap_id);

Well that's just an outright bug.  It's been there since the
original commit that added parent support:
86b00e0 rbd: get parent spec for version 2 images
Parent images *must* be snapshots, so this was never
right.

I bet that was hard to figure out...

Looks good.

Reviewed-by: Alex Elder el...@linaro.org


   ret = rbd_obj_method_sync(rbd_dev, rbd_dev-header_name,
   rbd, get_parent,
   snapid, sizeof (snapid),
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] WIP-8112 radosgw usage man update

2014-07-24 Thread Yehuda Sadeh
Applied (with very slight modifications).

Thanks,
Yehuda

On Thu, Jul 24, 2014 at 8:00 AM, Abhishek Lekshmanan
abhishek.lekshma...@gmail.com wrote:
 Hi,

 This is a series of patches to update radosgw usage and man pages with
 all the available help options. 1/3 is a minor format fix to display
 the options in the same line.

 Abhishek Lekshmanan (3):
   rgw: format help options to align with the rest
   rgw: list all available options during help()
   doc: update radosgw man page with available opts

  doc/man/8/radosgw.rst | 28 
  src/rgw/rgw_main.cc   | 10 --
  2 files changed, 36 insertions(+), 2 deletions(-)

 --
 1.9.1

 Thanks,
  Abhishek
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: First attempt at rocksdb monitor store stress testing

2014-07-24 Thread Mark Nelson
Earlier today I modified the rocksdb options so I could enable universal 
compaction.  Over all performance is lower but I don't see the 
hang/stall in the middle of the test either.  Instead the disk is 
basically pegged with 100% writes.  I suspect average latency is higher 
than leveldb, but the highest latency is about 5-6s while we were seeing 
30s spikes for leveldb with levelled (heh) compaction.


I haven't done much tuning either way yet.  It may be that if we keep 
level 0 and level 1 roughly the same size we can reduce stalls in the 
levelled setups.  It will also be interesting to see what happens in 
these tests on SSDs.


Mark

On 07/24/2014 06:13 AM, Mark Nelson wrote:

Hi Xinxin,

Thanks! I wonder as well if it might be interesting to expose the
options related to universal compaction?  It looks like rocksdb provides
a lot of interesting knobs you can adjust!

Mark

On 07/24/2014 12:08 AM, Shu, Xinxin wrote:

Hi mark,

I think this maybe related to 'verify_checksums' config option ,when
ReadOptions is initialized, default this option is  true , all data
read from underlying storage will be verified against corresponding
checksums,  however,  this option cannot be configured in wip-rocksdb
branch. I will modify code to make this option configurable .

Cheers,
xinxin

-Original Message-
From: ceph-devel-ow...@vger.kernel.org
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Mark Nelson
Sent: Thursday, July 24, 2014 7:14 AM
To: ceph-devel@vger.kernel.org
Subject: First attempt at rocksdb monitor store stress testing

Hi Guys,

So I've been interested lately in leveldb 99th percentile latency (and
the amount of write amplification we are seeing) with leveldb. Joao
mentioned he has written a tool called mon-store-stress in
wip-leveldb-misc to try to provide a means to roughly guess at what's
happening on the mons under heavy load.  I cherry-picked it over to
wip-rocksdb and after a couple of hacks was able to get everything
built and running with some basic tests.  There was little tuning done
and I don't know how realistic this workload is, so don't assume this
means anything yet, but some initial results are here:

http://nhm.ceph.com/mon-store-stress/First%20Attempt.pdf

Command that was used to run the tests:

./ceph-test-mon-store-stress --mon-keyvaluedb leveldb|rocksdb
--write-min-size 50K --write-max-size 2M --percent-write 70
--percent-read 30 --keep-state --test-seed 1406137270 --stop-at 5000 foo

The most interesting bit right now is that rocksdb seems to be hanging
in the middle of the test (left it running for several hours).  CPU
usage on one core was quite high during the hang.  Profiling using
perf with dwarf symbols I see:

-  49.14%  ceph-test-mon-s  ceph-test-mon-store-stress  [.] unsigned
int rocksdb::crc32c::ExtendImplrocksdb::crc32c::Fast_CRC32(unsigned
int, char const*, unsigned long)
 - unsigned int
rocksdb::crc32c::ExtendImplrocksdb::crc32c::Fast_CRC32(unsigned
int, char const*, unsigned long)
  51.70% rocksdb::ReadBlockContents(rocksdb::RandomAccessFile*,
rocksdb::Footer const, rocksdb::ReadOptions const,
rocksdb::BlockHandle const, rocksdb::BlockContents*, rocksdb::Env*,
bool)
  48.30%
rocksdb::BlockBasedTableBuilder::WriteRawBlock(rocksdb::Slice const,
rocksdb::CompressionType, rocksdb::BlockHandle*)

Not sure what's going on yet, may need to try to enable
logging/debugging in rocksdb.  Thoughts/Suggestions welcome. :)

Mark
--
To unsubscribe from this list: send the line unsubscribe ceph-devel
in the body of a message to majord...@vger.kernel.org More majordomo
info at  http://vger.kernel.org/majordomo-info.html





--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: First attempt at rocksdb monitor store stress testing

2014-07-24 Thread Shu, Xinxin
Hi mark, 

I am looking forward to your results on SSDs .
rocksdb generates a crc of data to be written , this cannot be switch off (but 
can be subsititued with xxhash),  there are two options , Option. 
verify_checksums_in_compaction and ReadOptions. verify_checksums,  If we 
disable these two options , i think cpu usage will goes down . If we use 
universal compaction , this is not friendly with read operation.

Btw , can you list your rocksdb configuration?

Cheers,
xinxin

-Original Message-
From: Mark Nelson [mailto:mark.nel...@inktank.com] 
Sent: Friday, July 25, 2014 7:45 AM
To: Shu, Xinxin; ceph-devel@vger.kernel.org
Subject: Re: First attempt at rocksdb monitor store stress testing

Earlier today I modified the rocksdb options so I could enable universal 
compaction.  Over all performance is lower but I don't see the hang/stall in 
the middle of the test either.  Instead the disk is basically pegged with 100% 
writes.  I suspect average latency is higher than leveldb, but the highest 
latency is about 5-6s while we were seeing 30s spikes for leveldb with levelled 
(heh) compaction.

I haven't done much tuning either way yet.  It may be that if we keep level 0 
and level 1 roughly the same size we can reduce stalls in the levelled setups.  
It will also be interesting to see what happens in these tests on SSDs.

Mark

On 07/24/2014 06:13 AM, Mark Nelson wrote:
 Hi Xinxin,

 Thanks! I wonder as well if it might be interesting to expose the 
 options related to universal compaction?  It looks like rocksdb 
 provides a lot of interesting knobs you can adjust!

 Mark

 On 07/24/2014 12:08 AM, Shu, Xinxin wrote:
 Hi mark,

 I think this maybe related to 'verify_checksums' config option ,when 
 ReadOptions is initialized, default this option is  true , all data 
 read from underlying storage will be verified against corresponding 
 checksums,  however,  this option cannot be configured in wip-rocksdb 
 branch. I will modify code to make this option configurable .

 Cheers,
 xinxin

 -Original Message-
 From: ceph-devel-ow...@vger.kernel.org 
 [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Mark Nelson
 Sent: Thursday, July 24, 2014 7:14 AM
 To: ceph-devel@vger.kernel.org
 Subject: First attempt at rocksdb monitor store stress testing

 Hi Guys,

 So I've been interested lately in leveldb 99th percentile latency 
 (and the amount of write amplification we are seeing) with leveldb. 
 Joao mentioned he has written a tool called mon-store-stress in 
 wip-leveldb-misc to try to provide a means to roughly guess at what's 
 happening on the mons under heavy load.  I cherry-picked it over to 
 wip-rocksdb and after a couple of hacks was able to get everything 
 built and running with some basic tests.  There was little tuning 
 done and I don't know how realistic this workload is, so don't assume 
 this means anything yet, but some initial results are here:

 http://nhm.ceph.com/mon-store-stress/First%20Attempt.pdf

 Command that was used to run the tests:

 ./ceph-test-mon-store-stress --mon-keyvaluedb leveldb|rocksdb 
 --write-min-size 50K --write-max-size 2M --percent-write 70 
 --percent-read 30 --keep-state --test-seed 1406137270 --stop-at 5000 
 foo

 The most interesting bit right now is that rocksdb seems to be 
 hanging in the middle of the test (left it running for several 
 hours).  CPU usage on one core was quite high during the hang.  
 Profiling using perf with dwarf symbols I see:

 -  49.14%  ceph-test-mon-s  ceph-test-mon-store-stress  [.] unsigned 
 int 
 rocksdb::crc32c::ExtendImplrocksdb::crc32c::Fast_CRC32(unsigned
 int, char const*, unsigned long)
  - unsigned int
 rocksdb::crc32c::ExtendImplrocksdb::crc32c::Fast_CRC32(unsigned
 int, char const*, unsigned long)
   51.70% 
 rocksdb::ReadBlockContents(rocksdb::RandomAccessFile*,
 rocksdb::Footer const, rocksdb::ReadOptions const, 
 rocksdb::BlockHandle const, rocksdb::BlockContents*, rocksdb::Env*,
 bool)
   48.30%
 rocksdb::BlockBasedTableBuilder::WriteRawBlock(rocksdb::Slice const, 
 rocksdb::CompressionType, rocksdb::BlockHandle*)

 Not sure what's going on yet, may need to try to enable 
 logging/debugging in rocksdb.  Thoughts/Suggestions welcome. :)

 Mark
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel
 in the body of a message to majord...@vger.kernel.org More majordomo 
 info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html