Upgrade ceph to 0.45 version

2012-04-18 Thread Madhusudhana U
Hi all,
I am currently running ceph ver 0.41 in my cluster and i would like to 
upgrade it to 0.45 version. Can someone put lights on the procedure to 
upgrade the cluster with/without destroying data.

Thanks
__Madhusudhana

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


re: ceph: move encode_fh to new API

2012-04-18 Thread Dan Carpenter
Hello Sage Weil,

This is a semi-automatic email about new static checker warnings.

The patch f59919a07e03: "ceph: move encode_fh to new API" from Apr 5, 
2012, leads to the following Smatch complaint:

fs/ceph/export.c:85 ceph_encode_fh()
 error: we previously assumed 'dentry' could be null (see line 67)

fs/ceph/export.c
66  /* if we found an alias, generate a connectable fh */
67  if (*max_len >= connected_handle_length && dentry) {
   ^^
New check.

68  dout("encode_fh %p connectable\n", dentry);
69  spin_lock(&dentry->d_lock);
70  parent = dentry->d_parent;
71  cfh->ino = ceph_ino(inode);
72  cfh->parent_ino = ceph_ino(parent->d_inode);
73  cfh->parent_name_hash = 
ceph_dentry_hash(parent->d_inode,
74   dentry);
75  *max_len = connected_handle_length;
76  type = 2;
77  spin_unlock(&dentry->d_lock);
78  } else if (*max_len >= handle_length) {
79  if (parent_inode) {
80  /* nfsd wants connectable */
81  *max_len = connected_handle_length;
82  type = 255;
83  } else {
84  dout("encode_fh %p\n", dentry);
85  fh->ino = ceph_ino(dentry->d_inode);
   ^^^
Old dereference.

86  *max_len = handle_length;
87  type = 1;

These emails really are mostly automated...  So if it's a false positive
then I blame the script.  Hope it's not too much spam.

regards,
dan carpenter

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: wip-librbd-caching

2012-04-18 Thread Martin Mailand

Am 12.04.2012 21:45, schrieb Sage Weil:

The config options you'll want to look at are client_oc_* (in case you
didn't see that already :).  "oc" is short for objectcacher, and it isn't
only used for client (libcephfs), so it might be worth renaming these
options before people start using them.


Hi,

I changed the values and the performance is still very good and the 
memory footprint is much smaller.


OPTION(client_oc_size, OPT_INT, 1024*1024* 50)// MB * n
OPTION(client_oc_max_dirty, OPT_INT, 1024*1024* 25)// MB * n  (dirty 
OR tx.. bigish)
OPTION(client_oc_target_dirty, OPT_INT, 1024*1024* 8) // target dirty 
(keep this smallish)

// note: the max amount of "in flight" dirty data is roughly (max - target)

But I am not quite sure about the meaning of the values.
client_oc_size Max size of the cache?
client_oc_max_dirty max dirty value before the writeback starts?
client_oc_target_dirty ???


-martin
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


rbd snapshot in qemu and libvirt

2012-04-18 Thread Martin Mailand

Hi List,

does anyone know the actual progress of the rbd snapshot feature 
integration into qemu and libvirt?


-martin
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rbd snapshot in qemu and libvirt

2012-04-18 Thread Wido den Hollander

Hi,

On 04/18/2012 03:08 PM, Martin Mailand wrote:

Hi List,

does anyone know the actual progress of the rbd snapshot feature
integration into qemu and libvirt?


I tested this about a year ago and that worked fine.

Anything in particular you are looking for?

Wido



-martin
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rbd snapshot in qemu and libvirt

2012-04-18 Thread Martin Mailand

Hi Wido,

I am looking for doing the snapshots via libvirt, create, delete, 
rollback and list of the snapshot.


-martin

Am 18.04.2012 15:10, schrieb Wido den Hollander:

I tested this about a year ago and that worked fine.

Anything in particular you are looking for?


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] ceph: fix bounds check macros ceph_decode_need and ceph_encode_need

2012-04-18 Thread Alex Elder

On 12/14/2011 02:24 PM, Xi Wang wrote:

Given a large n, the bounds check (*p + n>  end) can be bypassed due to
pointer wraparound.  A safer check is (n>  end - *p).

Signed-off-by: Xi Wang


I noticed this proposed change never got committed.

It looks good, but I don't like the name "ceph_need()".

I am planning to pull this in soon, modified like this:

static inline int ceph_need_ok(void **p, void *end, size_t n)
{
   return end >= *p && n <= end - *p;
}

And then used like this:

   if (!likely(ceph_need_ok(p, end, n)))

If you have an objection to that, please say so soon
(and if you have no objection, please ACK).

Reviewed-by: Alex Elder 


---
  include/linux/ceph/decode.h |9 +++--
  1 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/include/linux/ceph/decode.h b/include/linux/ceph/decode.h
index c5b6939..ea6db7b 100644
--- a/include/linux/ceph/decode.h
+++ b/include/linux/ceph/decode.h
@@ -12,6 +12,11 @@
   *   void *endpointer to end of buffer (last byte + 1)
   */

+static inline int ceph_need(void **p, void *end, size_t n)
+{
+   return ((end<  *p) || (n>  end - *p));
+}
+
  static inline u64 ceph_decode_64(void **p)
  {
u64 v = get_unaligned_le64(*p);
@@ -47,7 +52,7 @@ static inline void ceph_decode_copy(void **p, void *pv, 
size_t n)
   */
  #define ceph_decode_need(p, end, n, bad)  \
do {\
-   if (unlikely(*(p) + (n)>  (end)))\
+   if (unlikely(ceph_need(p, end, n))) \
goto bad;   \
} while (0)

@@ -166,7 +171,7 @@ static inline void ceph_encode_string(void **p, void *end,

  #define ceph_encode_need(p, end, n, bad)  \
do {\
-   if (unlikely(*(p) + (n)>  (end)))\
+   if (unlikely(ceph_need(p, end, n))) \
goto bad;   \
} while (0)



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] rbd: use gfp_flags parameter in rbd_header_from_disk()

2012-04-18 Thread Alex Elder

On 03/22/2012 01:33 AM, Dan Carpenter wrote:

We should use the gfp_flags that the caller specified instead of
GFP_KERNEL here.

There is only one caller and it uses GFP_KERNEL, so this change is just
a cleanup and doesn't change how the code works.

Signed-off-by: Dan Carpenter


This looks good.  Sorry it didn't get committed earlier.

Reviewed-by: Alex Elder 


diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index a6278e7..fc9341f 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -474,11 +474,11 @@ static int rbd_header_from_disk(struct rbd_image_header 
*header,
return -ENOMEM;
if (snap_count) {
header->snap_names = kmalloc(header->snap_names_len,
-GFP_KERNEL);
+gfp_flags);
if (!header->snap_names)
goto err_snapc;
header->snap_sizes = kmalloc(snap_count * sizeof(u64),
-GFP_KERNEL);
+gfp_flags);
if (!header->snap_sizes)
goto err_names;
} else {
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] rbd: fix integer overflow in rbd_header_from_disk()

2012-04-18 Thread Alex Elder

On 04/09/2012 04:52 PM, Xi Wang wrote:

ondisk->snap_count is read from disk via rbd_req_sync_read() and thus
needs validation.  Otherwise, a bogus `snap_count' could overflow the
kmalloc() size, leading to memory corruption.

Also use `u32' consistently for `snap_count'.


This looks good, however I have changed it to use UINT_MAX
rather than ULONG_MAX, because on some architectures size_t
(__kernel_size_t) is defined as type (unsigned int).  It
is the more conservative value, and even on architectures
where __BITS_PER_LONG is 64, it still offers a sane upper
bound on the number of snapshots for a rbd device.

Reviewed-by: Alex Elder 


Signed-off-by: Xi Wang
---
  drivers/block/rbd.c |   12 +++-
  1 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 013c7a5..d47f7e6 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -487,18 +487,20 @@ static void rbd_coll_release(struct kref *kref)
   */
  static int rbd_header_from_disk(struct rbd_image_header *header,
 struct rbd_image_header_ondisk *ondisk,
-int allocated_snaps,
+u32 allocated_snaps,
 gfp_t gfp_flags)
  {
-   int i;
-   u32 snap_count;
+   u32 i, snap_count;

if (memcmp(ondisk, RBD_HEADER_TEXT, sizeof(RBD_HEADER_TEXT)))
return -ENXIO;

snap_count = le32_to_cpu(ondisk->snap_count);
+   if (snap_count>  (ULONG_MAX - sizeof(struct ceph_snap_context))
+/ sizeof(*ondisk))
+   return -EINVAL;
header->snapc = kmalloc(sizeof(struct ceph_snap_context) +
-   snap_count * sizeof (*ondisk),
+   snap_count * sizeof(*ondisk),
gfp_flags);
if (!header->snapc)
return -ENOMEM;
@@ -1592,7 +1594,7 @@ static int rbd_read_header(struct rbd_device *rbd_dev,
  {
ssize_t rc;
struct rbd_image_header_ondisk *dh;
-   int snap_count = 0;
+   u32 snap_count = 0;
u64 ver;
size_t len;



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/6] rbd: updates from Josh Durgin

2012-04-18 Thread Alex Elder

I am posting this series on behalf of Josh Durgin, who originally
created these at the end of last year.  I've updated them to apply
to the current rbd code.

The first two fix bugs, the last is a documentation fix, and the
others are cleanups.

-Alex
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rbd snapshot in qemu and libvirt

2012-04-18 Thread Andrey Korolyov
I have tested all of them about a week ago, all works fine. Also it
will be very nice if rbd can list an actual allocated size of every
image or snapshot in future.

On Wed, Apr 18, 2012 at 5:22 PM, Martin Mailand  wrote:
> Hi Wido,
>
> I am looking for doing the snapshots via libvirt, create, delete, rollback
> and list of the snapshot.
>
> -martin
>
> Am 18.04.2012 15:10, schrieb Wido den Hollander:
>
>> I tested this about a year ago and that worked fine.
>>
>> Anything in particular you are looking for?
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/6] rbd: protect read of snapshot sequence number

2012-04-18 Thread Alex Elder

This is updated whenever a snapshot is added or deleted, and the
snapc pointer is changed with every refresh of the header.

Signed-off-by: Josh Durgin 
Reviewed-by: Alex Elder 
---
 drivers/block/rbd.c |4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Index: b/drivers/block/rbd.c
===
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -1685,7 +1685,9 @@
if (ret < 0)
return ret;

-   dev->header.snapc->seq =  new_snapid;
+   down_write(&dev->header_rwsem);
+   dev->header.snapc->seq = new_snapid;
+   up_write(&dev->header_rwsem);

return 0;
 bad:
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/6] rbd: store snapshot id instead of index

2012-04-18 Thread Alex Elder

When a device was open at a snapshot, and snapshots were deleted or
added, data from the wrong snapshot could be read. Instead of
assuming the snap context is constant, store the actual snap id when
the device is initialized, and rely on the OSDs to signal an error
if we try reading from a snapshot that was deleted.

Signed-off-by: Josh Durgin 
Reviewed-by: Alex Elder 
---
 drivers/block/rbd.c |   27 +--
 1 file changed, 5 insertions(+), 22 deletions(-)

Index: b/drivers/block/rbd.c
===
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -175,8 +175,7 @@
/* protects updating the header */
struct rw_semaphore header_rwsem;
charsnap_name[RBD_MAX_SNAP_NAME_LEN];
-   u32 cur_snap;   /* index+1 of current snapshot within snap context
-  0 - for the head */
+   u64 snap_id;/* current snapshot id */
int read_only;

struct list_headnode;
@@ -554,21 +553,6 @@
return -ENOMEM;
 }

-static int snap_index(struct rbd_image_header *header, int snap_num)
-{
-   return header->total_snaps - snap_num;
-}
-
-static u64 cur_snap_id(struct rbd_device *rbd_dev)
-{
-   struct rbd_image_header *header = &rbd_dev->header;
-
-   if (!rbd_dev->cur_snap)
-   return 0;
-
-   return header->snapc->snaps[snap_index(header, rbd_dev->cur_snap)];
-}
-
 static int snap_by_name(struct rbd_image_header *header, const char 
*snap_name,

u64 *seq, u64 *size)
 {
@@ -607,7 +591,7 @@
snapc->seq = header->snap_seq;
else
snapc->seq = 0;
-   dev->cur_snap = 0;
+   dev->snap_id = CEPH_NOSNAP;
dev->read_only = 0;
if (size)
*size = header->image_size;
@@ -615,8 +599,7 @@
ret = snap_by_name(header, dev->snap_name, &snapc->seq, size);
if (ret < 0)
goto done;
-
-   dev->cur_snap = header->total_snaps - ret;
+   dev->snap_id = snapc->seq;
dev->read_only = 1;
}

@@ -1523,7 +1506,7 @@
  coll, cur_seg);
else
rbd_req_read(rq, rbd_dev,
-cur_snap_id(rbd_dev),
+rbd_dev->snap_id,
 ofs,
 op_size, bio,
 coll, cur_seg);
@@ -1658,7 +1641,7 @@
struct ceph_mon_client *monc;

/* we should create a snapshot only if we're pointing at the head */
-   if (dev->cur_snap)
+   if (dev->snap_id != CEPH_NOSNAP)
return -EINVAL;

monc = &dev->rbd_client->client->monc;
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/6] rbd: remove conditional snapid parameters

2012-04-18 Thread Alex Elder

The snapid parameters passed to rbd_do_op() and rbd_req_sync_op()
are now always either a valid snapid or an explicit CEPH_NOSNAP.

[el...@dreamhost.com: Rephrased the description]

Signed-off-by: Josh Durgin 
Reviewed-by: Alex Elder 
---
 drivers/block/rbd.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Index: b/drivers/block/rbd.c
===
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -1153,7 +1153,7 @@
 int coll_index)
 {
return rbd_do_op(rq, rbd_dev, NULL,
-(snapid ? snapid : CEPH_NOSNAP),
+snapid,
 CEPH_OSD_OP_READ,
 CEPH_OSD_FLAG_READ,
 2,
@@ -1172,7 +1172,7 @@
  u64 *ver)
 {
return rbd_req_sync_op(dev, NULL,
-  (snapid ? snapid : CEPH_NOSNAP),
+  snapid,
   CEPH_OSD_OP_READ,
   CEPH_OSD_FLAG_READ,
   NULL,
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/6] rbd: fix snapshot size type

2012-04-18 Thread Alex Elder

Snapshot sizes should be the same type as regular image sizes. This
only affects their displayed size in sysfs, not the reported size of
an actual block device sizes.

Signed-off-by: Josh Durgin 
Reviewed-by: Alex Elder 
---
 drivers/block/rbd.c |6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

Index: b/drivers/block/rbd.c
===
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -141,7 +141,7 @@
 struct rbd_snap {
struct  device  dev;
const char  *name;
-   size_t  size;
+   u64 size;
struct list_headnode;
u64 id;
 };
@@ -1936,7 +1936,7 @@
 {
struct rbd_snap *snap = container_of(dev, struct rbd_snap, dev);

-   return sprintf(buf, "%zd\n", snap->size);
+   return sprintf(buf, "%llu\n", (unsigned long long)snap->size);
 }

 static ssize_t rbd_snap_id_show(struct device *dev,
@@ -1945,7 +1945,7 @@
 {
struct rbd_snap *snap = container_of(dev, struct rbd_snap, dev);

-   return sprintf(buf, "%llu\n", (unsigned long long) snap->id);
+   return sprintf(buf, "%llu\n", (unsigned long long)snap->id);
 }

 static DEVICE_ATTR(snap_size, S_IRUGO, rbd_snap_size_show, NULL);
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/6] rbd: rename __rbd_update_snaps to __rbd_refresh_header

2012-04-18 Thread Alex Elder

This function rereads the entire header and handles any changes in
it, not just changes in snapshots.

Signed-off-by: Josh Durgin 
Reviewed-by: Alex Elder 
---
 drivers/block/rbd.c |   12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

Index: b/drivers/block/rbd.c
===
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -240,7 +240,7 @@
put_device(&rbd_dev->dev);
 }

-static int __rbd_update_snaps(struct rbd_device *rbd_dev);
+static int __rbd_refresh_header(struct rbd_device *rbd_dev);

 static int rbd_open(struct block_device *bdev, fmode_t mode)
 {
@@ -1223,7 +1223,7 @@
dout("rbd_watch_cb %s notify_id=%lld opcode=%d\n", dev->obj_md_name,
notify_id, (int)opcode);
mutex_lock_nested(&ctl_mutex, SINGLE_DEPTH_NESTING);
-   rc = __rbd_update_snaps(dev);
+   rc = __rbd_refresh_header(dev);
mutex_unlock(&ctl_mutex);
if (rc)
pr_warning(RBD_DRV_NAME "%d got notification but failed to "
@@ -1690,7 +1690,7 @@
 /*
  * only read the first part of the ondisk header, without the snaps info
  */
-static int __rbd_update_snaps(struct rbd_device *rbd_dev)
+static int __rbd_refresh_header(struct rbd_device *rbd_dev)
 {
int ret;
struct rbd_image_header h;
@@ -1877,7 +1877,7 @@

mutex_lock_nested(&ctl_mutex, SINGLE_DEPTH_NESTING);

-   rc = __rbd_update_snaps(rbd_dev);
+   rc = __rbd_refresh_header(rbd_dev);
if (rc < 0)
ret = rc;

@@ -2160,7 +2160,7 @@
 rbd_dev->header.obj_version);
if (ret == -ERANGE) {
mutex_lock_nested(&ctl_mutex, SINGLE_DEPTH_NESTING);
-   rc = __rbd_update_snaps(rbd_dev);
+   rc = __rbd_refresh_header(rbd_dev);
mutex_unlock(&ctl_mutex);
if (rc < 0)
return rc;
@@ -2545,7 +2545,7 @@
if (ret < 0)
goto err_unlock;

-   ret = __rbd_update_snaps(rbd_dev);
+   ret = __rbd_refresh_header(rbd_dev);
if (ret < 0)
goto err_unlock;

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/6] rbd: correct sysfs snap attribute documentation

2012-04-18 Thread Alex Elder

Each attribute is prefixed with "snap_".

Signed-off-by: Josh Durgin 
Reviewed-by: Alex Elder 
---
 Documentation/ABI/testing/sysfs-bus-rbd |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Index: b/Documentation/ABI/testing/sysfs-bus-rbd
===
--- a/Documentation/ABI/testing/sysfs-bus-rbd
+++ b/Documentation/ABI/testing/sysfs-bus-rbd
@@ -65,11 +65,11 @@
 Entries under /sys/bus/rbd/devices//snap_
 -

-id
+snap_id

The rados internal snapshot id assigned for this snapshot

-size
+snap_size

The size of the image when this snapshot was taken.

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rbd snapshot in qemu and libvirt

2012-04-18 Thread Martin Mailand

Hi Andrey,

if I try it I get this error.

virsh snapshot-create linux1
error: Requested operation is not valid: Disk 
'rbd/vm1:rbd_cache_enabled=1' does not support snapshotting


maybe the rbd_cache option is the problem?


-martin


Am 18.04.2012 16:39, schrieb Andrey Korolyov:

I have tested all of them about a week ago, all works fine. Also it
will be very nice if rbd can list an actual allocated size of every
image or snapshot in future.

On Wed, Apr 18, 2012 at 5:22 PM, Martin Mailand  wrote:

Hi Wido,

I am looking for doing the snapshots via libvirt, create, delete, rollback
and list of the snapshot.

-martin

Am 18.04.2012 15:10, schrieb Wido den Hollander:


I tested this about a year ago and that worked fine.

Anything in particular you are looking for?



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: localized pgs

2012-04-18 Thread Noah Watkins

On Apr 17, 2012, at 9:36 AM, Sage Weil wrote:

> I guess the question is, is there a compelling use case, and if there is, 
> is there a better approach?

I am failing to find a reference to the paper, but I recall that the 1 
localized write to HDFS for MR applications isn't enormously beneficial. Then 
again it does save a lot of network traffic, especially for triplication.

As more people start looking at the Object Class handler mechanism, I could see 
it being desirable to stick data into objects (grab later via rendezvous) 
rather than return data to the rados_exec caller. Making that efficient would 
be important.

Could you piggyback on the recovery logic to accomplish this? Say, write 2 
objects as normal, and 1 local "miss placed" object, letting RADOS 
asynchronously move that object to its intended location.

- Noah--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] ceph: osd_client: fix endianness bug in osd_req_encode_op()

2012-04-18 Thread Alex Elder

From Al Viro 

Al Viro noticed that we were using a non-cpu-encoded value in
a switch statement in osd_req_encode_op().  The result would
clearly not work correctly on a big-endian machine.

Signed-off-by: Alex Elder 

---
 net/ceph/osd_client.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: b/net/ceph/osd_client.c
===
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -278,7 +278,7 @@ static void osd_req_encode_op(struct cep
 {
dst->op = cpu_to_le16(src->op);

-   switch (dst->op) {
+   switch (src->op) {
case CEPH_OSD_OP_READ:
case CEPH_OSD_OP_WRITE:
dst->extent.offset =
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rbd snapshot in qemu and libvirt

2012-04-18 Thread Andrey Korolyov
Oh, I forgot to say about a patch:

https://www.redhat.com/archives/libvir-list/2012-March/msg01369.html

On Wed, Apr 18, 2012 at 6:55 PM, Martin Mailand  wrote:
> Hi Andrey,
>
> if I try it I get this error.
>
> virsh snapshot-create linux1
> error: Requested operation is not valid: Disk 'rbd/vm1:rbd_cache_enabled=1'
> does not support snapshotting
>
> maybe the rbd_cache option is the problem?
>
>
> -martin
>
>
> Am 18.04.2012 16:39, schrieb Andrey Korolyov:
>
>> I have tested all of them about a week ago, all works fine. Also it
>> will be very nice if rbd can list an actual allocated size of every
>> image or snapshot in future.
>>
>> On Wed, Apr 18, 2012 at 5:22 PM, Martin Mailand
>>  wrote:
>>>
>>> Hi Wido,
>>>
>>> I am looking for doing the snapshots via libvirt, create, delete,
>>> rollback
>>> and list of the snapshot.
>>>
>>> -martin
>>>
>>> Am 18.04.2012 15:10, schrieb Wido den Hollander:
>>>
 I tested this about a year ago and that worked fine.

 Anything in particular you are looking for?
>>>
>>>
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majord...@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: wip-librbd-caching

2012-04-18 Thread Greg Farnum
On Wednesday, April 18, 2012 at 5:50 AM, Martin Mailand wrote:
> Hi,
>  
> I changed the values and the performance is still very good and the  
> memory footprint is much smaller.
>  
> OPTION(client_oc_size, OPT_INT, 1024*1024* 50) // MB * n
> OPTION(client_oc_max_dirty, OPT_INT, 1024*1024* 25) // MB * n (dirty  
> OR tx.. bigish)
> OPTION(client_oc_target_dirty, OPT_INT, 1024*1024* 8) // target dirty  
> (keep this smallish)
> // note: the max amount of "in flight" dirty data is roughly (max - target)
>  
> But I am not quite sure about the meaning of the values.
> client_oc_size Max size of the cache?
> client_oc_max_dirty max dirty value before the writeback starts?
> client_oc_target_dirty ???
>  

Right now the cache writeout algorithms are based on amount of dirty data, 
rather than something like how long the data has been dirty.  
client_oc_size is the max (and therefore typical) size of the cache.
client_oc_max_dirty is the largest amount of dirty data in the cache — if this 
much is dirty and you try to dirty more, the dirtier (a write of some kind) 
will block until some of the other dirty data has been committed.
client_oc_target_dirty is the amount of dirty data that will trigger the cache 
to start flushing data out.

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rbd snapshot in qemu and libvirt

2012-04-18 Thread Martin Mailand

Hi,

Am 18.04.2012 17:52, schrieb Andrey Korolyov:

Oh, I forgot to say about a patch:


perfect, now it works.

Thanks.

-martin
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


re: ceph: move encode_fh to new API

2012-04-18 Thread Sage Weil
On Wed, 18 Apr 2012, Dan Carpenter wrote:
> Hello Sage Weil,
> 
> This is a semi-automatic email about new static checker warnings.
> 
> The patch f59919a07e03: "ceph: move encode_fh to new API" from Apr 5, 
> 2012, leads to the following Smatch complaint:
> 
> fs/ceph/export.c:85 ceph_encode_fh()
>error: we previously assumed 'dentry' could be null (see line 67)

Thanks, Dan!  This caught a minor bug.

Al, can you update f59919a07e0335358d9470289cf46ffb83b7d2f9 in your 
for-next with the following?

diff --git a/fs/ceph/export.c b/fs/ceph/export.c
index 869c554..8e1b60e 100644
--- a/fs/ceph/export.c
+++ b/fs/ceph/export.c
@@ -82,7 +82,7 @@ static int ceph_encode_fh(struct inode *inode, u32 *rawfh, 
int *max_len,
type = 255;
} else {
dout("encode_fh %p\n", dentry);
-   fh->ino = ceph_ino(dentry->d_inode);
+   fh->ino = ceph_ino(inode);
*max_len = handle_length;
type = 1;
}

...or the complete patch is below.

Thanks!
sage


>From 21d32da4f1731a583a471b24308b5f8c4e065ab6 Mon Sep 17 00:00:00 2001
From: Sage Weil 
Date: Wed, 18 Apr 2012 10:39:14 -0700
Subject: [PATCH] ceph: move encode_fh to new API

Use parent_inode has a flag for whether nfsd wants a connectable fh, but
generate one opportunistically so that we can take advantage of the
additional info in there.

Signed-off-by: Sage Weil 
Signed-off-by: Al Viro 
---
 fs/ceph/export.c |   34 +-
 1 files changed, 21 insertions(+), 13 deletions(-)

diff --git a/fs/ceph/export.c b/fs/ceph/export.c
index 4f9234c..8e1b60e 100644
--- a/fs/ceph/export.c
+++ b/fs/ceph/export.c
@@ -40,38 +40,49 @@ struct ceph_nfs_confh {
u32 parent_name_hash;
 } __attribute__ ((packed));
 
-static int ceph_encode_fh(struct dentry *dentry, u32 *rawfh, int *max_len,
- int connectable)
+/*
+ * The presence of @parent_inode here tells us whether NFS wants a
+ * connectable file handle.  However, we want to make a connectionable
+ * file handle unconditionally so that the MDS gets as much of a hint
+ * as possible.  That means we only use @parent_dentry to indicate
+ * whether nfsd wants a connectable fh, and whether we should indicate
+ * failure from a too-small @max_len.
+ */
+static int ceph_encode_fh(struct inode *inode, u32 *rawfh, int *max_len,
+ struct inode *parent_inode)
 {
int type;
struct ceph_nfs_fh *fh = (void *)rawfh;
struct ceph_nfs_confh *cfh = (void *)rawfh;
-   struct dentry *parent;
-   struct inode *inode = dentry->d_inode;
int connected_handle_length = sizeof(*cfh)/4;
int handle_length = sizeof(*fh)/4;
+   struct dentry *dentry = d_find_alias(inode);
+   struct dentry *parent;
 
/* don't re-export snaps */
if (ceph_snap(inode) != CEPH_NOSNAP)
return -EINVAL;
 
-   spin_lock(&dentry->d_lock);
-   parent = dentry->d_parent;
-   if (*max_len >= connected_handle_length) {
+   /* if we found an alias, generate a connectable fh */
+   if (*max_len >= connected_handle_length && dentry) {
dout("encode_fh %p connectable\n", dentry);
-   cfh->ino = ceph_ino(dentry->d_inode);
+   spin_lock(&dentry->d_lock);
+   parent = dentry->d_parent;
+   cfh->ino = ceph_ino(inode);
cfh->parent_ino = ceph_ino(parent->d_inode);
cfh->parent_name_hash = ceph_dentry_hash(parent->d_inode,
 dentry);
*max_len = connected_handle_length;
type = 2;
+   spin_unlock(&dentry->d_lock);
} else if (*max_len >= handle_length) {
-   if (connectable) {
+   if (parent_inode) {
+   /* nfsd wants connectable */
*max_len = connected_handle_length;
type = 255;
} else {
dout("encode_fh %p\n", dentry);
-   fh->ino = ceph_ino(dentry->d_inode);
+   fh->ino = ceph_ino(inode);
*max_len = handle_length;
type = 1;
}
@@ -79,7 +90,6 @@ static int ceph_encode_fh(struct dentry *dentry, u32 *rawfh, 
int *max_len,
*max_len = handle_length;
type = 255;
}
-   spin_unlock(&dentry->d_lock);
return type;
 }
 
@@ -247,9 +257,7 @@ static struct dentry *ceph_fh_to_parent(struct super_block 
*sb,
 }
 
 const struct export_operations ceph_export_ops = {
-#ifdef CEPH_BREAKAGE_FIXED
.encode_fh = ceph_encode_fh,
-#endif
.fh_to_dentry = ceph_fh_to_dentry,
.fh_to_parent = ceph_fh_to_parent,
 };
-- 
1.7.9

--
To unsubscribe from this list: send the line "unsubscribe ceph-de

Re: wip-librbd-caching

2012-04-18 Thread Sage Weil
On Wed, 18 Apr 2012, Martin Mailand wrote:
> Am 12.04.2012 21:45, schrieb Sage Weil:
> > The config options you'll want to look at are client_oc_* (in case you
> > didn't see that already :).  "oc" is short for objectcacher, and it isn't
> > only used for client (libcephfs), so it might be worth renaming these
> > options before people start using them.
> 
> Hi,
> 
> I changed the values and the performance is still very good and the memory
> footprint is much smaller.
> 
> OPTION(client_oc_size, OPT_INT, 1024*1024* 50)// MB * n
> OPTION(client_oc_max_dirty, OPT_INT, 1024*1024* 25)// MB * n  (dirty OR
> tx.. bigish)
> OPTION(client_oc_target_dirty, OPT_INT, 1024*1024* 8) // target dirty (keep
> this smallish)
> // note: the max amount of "in flight" dirty data is roughly (max - target)
> 
> But I am not quite sure about the meaning of the values.
> client_oc_size Max size of the cache?

yes

> client_oc_max_dirty max dirty value before the writeback starts?

before writes block and wait for writeback to bring the dirty level down

> client_oc_target_dirty ???

before writeback starts

BTW I renamed 'rbd cache enabled' -> 'rbd cache'.  I'd like to rename the 
objectcacher settings too so they aren't nested under client_ (which is 
the fs client code).

objectcacher_*?

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] ceph: fix bounds check macros ceph_decode_need and ceph_encode_need

2012-04-18 Thread Sage Weil
On Wed, 18 Apr 2012, Alex Elder wrote:
> On 12/14/2011 02:24 PM, Xi Wang wrote:
> > Given a large n, the bounds check (*p + n>  end) can be bypassed due to
> > pointer wraparound.  A safer check is (n>  end - *p).
> > 
> > Signed-off-by: Xi Wang
> 
> I noticed this proposed change never got committed.
> 
> It looks good, but I don't like the name "ceph_need()".
> 
> I am planning to pull this in soon, modified like this:
> 
> static inline int ceph_need_ok(void **p, void *end, size_t n)

Would something like ceph_has_room() make more sense?

> {
>return end >= *p && n <= end - *p;
> }
> 
> And then used like this:
> 
>if (!likely(ceph_need_ok(p, end, n)))
> 
> If you have an objection to that, please say so soon
> (and if you have no objection, please ACK).
> 
> Reviewed-by: Alex Elder 
> 
> > ---
> >   include/linux/ceph/decode.h |9 +++--
> >   1 files changed, 7 insertions(+), 2 deletions(-)
> > 
> > diff --git a/include/linux/ceph/decode.h b/include/linux/ceph/decode.h
> > index c5b6939..ea6db7b 100644
> > --- a/include/linux/ceph/decode.h
> > +++ b/include/linux/ceph/decode.h
> > @@ -12,6 +12,11 @@
> >*   void *endpointer to end of buffer (last byte + 1)
> >*/
> > 
> > +static inline int ceph_need(void **p, void *end, size_t n)
> > +{
> > +   return ((end<  *p) || (n>  end - *p));
> > +}
> > +
> >   static inline u64 ceph_decode_64(void **p)
> >   {
> > u64 v = get_unaligned_le64(*p);
> > @@ -47,7 +52,7 @@ static inline void ceph_decode_copy(void **p, void *pv,
> > size_t n)
> >*/
> >   #define ceph_decode_need(p, end, n, bad)  \
> > do {\
> > -   if (unlikely(*(p) + (n)>  (end)))   \
> > +   if (unlikely(ceph_need(p, end, n))) \
> > goto bad;   \
> > } while (0)
> > 
> > @@ -166,7 +171,7 @@ static inline void ceph_encode_string(void **p, void
> > *end,
> > 
> >   #define ceph_encode_need(p, end, n, bad)  \
> > do {\
> > -   if (unlikely(*(p) + (n)>  (end)))   \
> > +   if (unlikely(ceph_need(p, end, n))) \
> > goto bad;   \
> > } while (0)
> > 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] rbd: fix integer overflow in rbd_header_from_disk()

2012-04-18 Thread Xi Wang
On Apr 18, 2012, at 10:21 AM, Alex Elder wrote:
> 
> This looks good, however I have changed it to use UINT_MAX
> rather than ULONG_MAX, because on some architectures size_t
> (__kernel_size_t) is defined as type (unsigned int).  It
> is the more conservative value, and even on architectures
> where __BITS_PER_LONG is 64, it still offers a sane upper
> bound on the number of snapshots for a rbd device.

Looks good to me.

BTW, on which arch is size_t 32-bit while long is 64?
Several commits still use ULONG_MAX, such as

http://git.kernel.org/linus/64486697

- xi
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] ceph: fix bounds check macros ceph_decode_need and ceph_encode_need

2012-04-18 Thread Alex Elder

On 04/18/2012 12:53 PM, Sage Weil wrote:

Would something like ceph_has_room() make more sense?


Yes, and standing by itself it reads a lot better too.

The "need_ok" came from the fact that it was only ever
called by ceph_need_*() macros.

Without objection I'll change it to ceph_has_room()
before I commit it for testing.

-Alex
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: pgs stuck inactive

2012-04-18 Thread Greg Farnum
On Tuesday, April 17, 2012 at 11:41 PM, Damien Churchill wrote:
> On 17 April 2012 17:49, Greg Farnum  (mailto:gregory.far...@dreamhost.com)> wrote:
> > Do you know what version this was created with, and what upgrades you've 
> > been through? My best guess right now is that there's a problem with the 
> > encoding and decoding that I'm going to have to track down, and more 
> > context will make it a lot easier. :)
> 
> 
> 
> Hmmm that's testing my memory, I'd say that cluster has been alive at
> least since 0.34. Occasionally I think there was a version skipped,
> not sure if that could cause any issues?

Okay. So the good news is that we can see what's broken now and have a kludge 
to prevent it happening to others; the bad news is we still have no idea how it 
actually occurred. :( But I don't think it's worth investing the time given 
what we have available, so all we can do is repair your cluster. 

Are you building your binaries from source, and can you run a patched version 
of the monitors? If you can I'll give you a patch to enable a simple command 
that should make things work; otherwise we'll need to start editing things by 
hand. (Yucky)
-Greg

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] rbd: fix integer overflow in rbd_header_from_disk()

2012-04-18 Thread Alex Elder

On 04/18/2012 01:09 PM, Xi Wang wrote:

On Apr 18, 2012, at 10:21 AM, Alex Elder wrote:


This looks good, however I have changed it to use UINT_MAX
rather than ULONG_MAX, because on some architectures size_t
(__kernel_size_t) is defined as type (unsigned int).  It
is the more conservative value, and even on architectures
where __BITS_PER_LONG is 64, it still offers a sane upper
bound on the number of snapshots for a rbd device.


Looks good to me.

BTW, on which arch is size_t 32-bit while long is 64?
Several commits still use ULONG_MAX, such as


In my tree, these at least *can* be:  arm, m68k, mips,
powerpc, sparc, x86, and others (including generic).

It may not matter in practice, but I prefer to be very
careful.  As long as you're going out of your way to
avoid overflow might as well make sure it works on all
architectures in the process.

-Alex


http://git.kernel.org/linus/64486697

- xi
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: pgs stuck inactive

2012-04-18 Thread Damien Churchill
On 18 April 2012 19:41, Greg Farnum  wrote:
>
> Are you building your binaries from source, and can you run a patched version 
> of the monitors? If you can I'll give you a patch to enable a simple command 
> that should make things work; otherwise we'll need to start editing things by 
> hand. (Yucky)
> -Greg
>

I was using the Ubuntu packages but I can quite happily build my own
packages if you give me the patch :-)

I agree it's a waste of time if it's not obvious what's caused it,
could be some obscure cause occurred due to upgrading between older
versions.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: pgs stuck inactive

2012-04-18 Thread Greg Farnum
On Wednesday, April 18, 2012 at 12:04 PM, Damien Churchill wrote:
> On 18 April 2012 19:41, Greg Farnum  (mailto:gregory.far...@dreamhost.com)> wrote:
> > 
> > Are you building your binaries from source, and can you run a patched 
> > version of the monitors? If you can I'll give you a patch to enable a 
> > simple command that should make things work; otherwise we'll need to start 
> > editing things by hand. (Yucky)
> > -Greg
> 
> 
> 
> I was using the Ubuntu packages but I can quite happily build my own
> packages if you give me the patch :-)
> 
> I agree it's a waste of time if it's not obvious what's caused it,
> could be some obscure cause occurred due to upgrading between older
> versions.

Okay, assuming you're still on 0.41.1, can you checkout the git branch 
"for-damien" and build it? 
Then shut down your monitors, replace their executables with the freshly-built 
ones, and run
"ceph osd pool set vmimages pg_num 320 --a-dev-told-me-to"
and
"ceph osd pool set vmimages pgp_num 320"



That should get everything back up and running. The one sour note is that due 
to the bug in the past, your data (ie, filesystem) and vmimages pools have 
gotten conflated. It shouldn't cause any issues (they use very different naming 
schemes), but they're tied together in terms of replication and the raw pool 
statistics.
(If that's important you can create a new pool and move the rbd images to it.)
-Greg

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ceph: move encode_fh to new API

2012-04-18 Thread Al Viro
On Wed, Apr 18, 2012 at 10:39:08AM -0700, Sage Weil wrote:
> On Wed, 18 Apr 2012, Dan Carpenter wrote:
> > Hello Sage Weil,
> > 
> > This is a semi-automatic email about new static checker warnings.
> > 
> > The patch f59919a07e03: "ceph: move encode_fh to new API" from Apr 5, 
> > 2012, leads to the following Smatch complaint:
> > 
> > fs/ceph/export.c:85 ceph_encode_fh()
> >  error: we previously assumed 'dentry' could be null (see line 67)
> 
> Thanks, Dan!  This caught a minor bug.
> 
> Al, can you update f59919a07e0335358d9470289cf46ffb83b7d2f9 in your 
> for-next with the following?

Folded and pushed...
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] rbd: fix integer overflow in rbd_header_from_disk()

2012-04-18 Thread Xi Wang
On Apr 18, 2012, at 2:47 PM, Alex Elder wrote:

> It may not matter in practice, but I prefer to be very
> careful.  As long as you're going out of your way to
> avoid overflow might as well make sure it works on all
> architectures in the process.

I agree.  Probably a cleaner way is to define SIZE_MAX
or use KMALLOC_MAX_SIZE.

- xi
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html