Re: ceph status reporting non-existing osd

2012-07-19 Thread Andrey Korolyov
On Thu, Jul 19, 2012 at 1:28 AM, Gregory Farnum g...@inktank.com wrote:
 On Wed, Jul 18, 2012 at 12:07 PM, Andrey Korolyov and...@xdel.ru wrote:
 On Wed, Jul 18, 2012 at 10:30 PM, Gregory Farnum g...@inktank.com wrote:
 On Wed, Jul 18, 2012 at 12:47 AM, Andrey Korolyov and...@xdel.ru wrote:
 On Wed, Jul 18, 2012 at 11:18 AM, Gregory Farnum g...@inktank.com wrote:
 On Tuesday, July 17, 2012 at 11:22 PM, Andrey Korolyov wrote:
 On Wed, Jul 18, 2012 at 10:09 AM, Gregory Farnum g...@inktank.com 
 (mailto:g...@inktank.com) wrote:
  Hrm. That shouldn't be possible if the OSD has been removed. How did 
  you take it out? It sounds like maybe you just marked it in the OUT 
  state (and turned it off quite quickly) without actually taking it out 
  of the cluster?
  -Greg



 As I have did removal, it was definitely not like that - at first
 place, I have marked osds(4 and 5 on same host) out, then rebuilt
 crushmap and then kill osd processes. As I mentioned before, osd.4
 doest not exist in crushmap and therefore it shouldn`t be reported at
 all(theoretically).

 Okay, that's what happened — marking an OSD out in the CRUSH map means 
 all the data gets moved off it, but that doesn't remove it from all the 
 places where it's registered in the monitor and in the map, for a couple 
 reasons:
 1) You might want to mark an OSD out before taking it down, to allow for 
 more orderly data movement.
 2) OSDs can get marked out automatically, but the system shouldn't be 
 able to forget about them on its own.
 3) You might want to remove an OSD from the CRUSH map in the process of 
 placing it somewhere else (perhaps you moved the physical machine to a 
 new location).
 etc.

 You want to run ceph osd rm 4 5 and that should unregister both of them 
 from everything[1]. :)
 -Greg
 [1]: Except for the full lists, which have a bug in the version of code 
 you're running — remove the OSDs, then adjust the full ratios again, and 
 all will be well.


 $ ceph osd rm 4
 osd.4 does not exist
 $ ceph -s
health HEALTH_WARN 1 near full osd(s)
monmap e3: 3 mons at
 {0=192.168.10.129:6789/0,1=192.168.10.128:6789/0,2=192.168.10.127:6789/0},
 election epoch 58, quorum 0,1,2 0,1,2
osdmap e2198: 4 osds: 4 up, 4 in
 pgmap v586056: 464 pgs: 464 active+clean; 66645 MB data, 231 GB
 used, 95877 MB / 324 GB avail
mdsmap e207: 1/1/1 up {0=a=up:active}

 $ ceph health detail
 HEALTH_WARN 1 near full osd(s)
 osd.4 is near full at 89%

 $ ceph osd dump
 
 max_osd 4
 osd.0 up   in  weight 1 up_from 2183 up_thru 2187 down_at 2172
 last_clean_interval [2136,2171) 192.168.10.128:6800/4030
 192.168.10.128:6801/4030 192.168.10.128:6802/4030 exists,up
 68b3deec-e80a-48b7-9c29-1b98f5de4f62
 osd.1 up   in  weight 1 up_from 2136 up_thru 2186 down_at 2135
 last_clean_interval [2115,2134) 192.168.10.129:6800/2980
 192.168.10.129:6801/2980 192.168.10.129:6802/2980 exists,up
 b2a26fe9-aaa8-445f-be1f-fa7d2a283b57
 osd.2 up   in  weight 1 up_from 2181 up_thru 2187 down_at 2172
 last_clean_interval [2136,2171) 192.168.10.128:6803/4128
 192.168.10.128:6804/4128 192.168.10.128:6805/4128 exists,up
 378d367a-f7fb-4892-9ec9-db8ffdd2eb20
 osd.3 up   in  weight 1 up_from 2136 up_thru 2186 down_at 2135
 last_clean_interval [2115,2134) 192.168.10.129:6803/3069
 192.168.10.129:6804/3069 192.168.10.129:6805/3069 exists,up
 faf8eda8-55fc-4a0e-899f-47dbd32b81b8
 

 Hrm. How did you create your new crush map? All the normal avenues of
 removing an OSD from the map set a flag which the PGMap uses to delete
 its records (which would prevent it reappearing in the full list), and
 I can't see how setcrushmap would remove an OSD from the map (although
 there might be a code path I haven't found).

 Manually, by deleting osd4|5 entries and reweighing remaining nodes.

 So you extracted the CRUSH map, edited it, and injected it using ceph
 osd setrcrushmap?

Yep, exactly.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Poor read performance in KVM

2012-07-19 Thread Vladimir Bashkirtsev



It's actually the sum of the latencies of all 3971 asynchronous reads,
in seconds, so the average latency was ~200ms, which is still pretty
high.
OK. I did realize it later that day when I've noticed that sum does go 
up only. So sum is number of seconds spent and divided by avgcount gives 
an idea of average latency. I guess same is used everywhere else.


That's latency between KVM and the OSDs. The extra latency could be from
the callback to qemu or an artifact of this workload on the osds.
You can use the admin socket on the osds for 'perf dump' as well, and
check the 'op_r_latency' values, which is the latency between receiving
a request on the OSD and replying to it.

osd.0 latency 0.121423775
op_r_latency: {
avgcount: 25018,
sum: 3037.78
},
osd.3 latency 0.075055975
 op_r_latency: {
avgcount: 3597595,
sum: 270021
},

osd.0 was restarted recently, osd.3 runs for few days.


It may still sit in some network queues before being sent, so you may
want to enable debug_ms=1 for rbd. This will show each osd_op and the
corresponding osd_op_reply with a leading timestamp. You can tell
which ones go together by their transaction id, i.e.:

2012-07-17 17:39:21.813009 7f9174a30760  1 -- 
192.168.106.222:0/1009333 -- 192.168.106.222:6800/9134 -- 
osd_op(client.4104.0:20  [write 4096~512] 3.bd49d10d) v4 -- ?+0 
0x235ac90 con 0x234e530
2012-07-17 17:39:21.869671 7f9171093710  1 -- 
192.168.106.222:0/1009333 == osd.0 192.168.106.222:6800/9134 20  
osd_op_reply(20  [write 4096~512] ondisk = 0) v4  99+0+0 
(1120882424 0 0) 0x235ac90 con 0x23

4e530

The transaction id is 20 here (seen in 'osd_op(client.4104.0:20' and 
'osd_op_reply(20'.


Done that... Log excerpt below. Appears to be in line with what rados 
bench says: latency is in millisecond range. Manual glance over roughly 
4000 requests at random spots just confirms that latency is low on OSD.


To run this test I have:
1. Migrated out all VMs from the host except one which I am testing
2. Put debug_ms=1 in ceph.conf and restarted ceph (mon, mds, osd - all 
running on this host).


I am not sure that ceph-osd.0.log cointains information we looking for: 
ie it appears in reverse to your example. May be I need to look 
somewhere else? My thinking is that osd log is done by osd and thus does 
not include any network latency.


ceph-osd.0.log:2012-07-19 19:09:52.899706 7fb261ffb700  1 -- 
172.16.64.200:6802/17408 == client.159051 172.16.64.200:0/1021101 376 
 osd_op(client.159051.0:1427 rb.0.7. [read 1282048~4096] 
2.c8eaa480) v4  158+0+0 (4291909772 0 0) 0x7fb16c001000 con 
0x7fb22c004b10
ceph-osd.0.log:2012-07-19 19:09:52.899912 7fb25e7f4700  1 -- 
172.16.64.200:6802/17408 -- 172.16.64.200:0/1021101 -- 
osd_op_reply(1427 rb.0.7. [read 1282048~4096] = 0) v4 -- ?+0 
0x7fb2408e2000 con 0x7fb22c004b10
ceph-osd.0.log:2012-07-19 19:09:52.900254 7fb261ffb700  1 -- 
172.16.64.200:6802/17408 == client.159051 172.16.64.200:0/1021101 377 
 osd_op(client.159051.0:1428 rb.0.7. [read 1286144~4096] 
2.c8eaa480) v4  158+0+0 (1278691322 0 0) 0x7fb16c001400 con 
0x7fb22c004b10
ceph-osd.0.log:2012-07-19 19:09:52.900451 7fb25eff5700  1 -- 
172.16.64.200:6802/17408 -- 172.16.64.200:0/1021101 -- 
osd_op_reply(1428 rb.0.7. [read 1286144~4096] = 0) v4 -- ?+0 
0x7fb24dbd9570 con 0x7fb22c004b10
ceph-osd.0.log:2012-07-19 19:09:52.900815 7fb261ffb700  1 -- 
172.16.64.200:6802/17408 == client.159051 172.16.64.200:0/1021101 378 
 osd_op(client.159051.0:1429 rb.0.7. [read 1290240~4096] 
2.c8eaa480) v4  158+0+0 (372927 0 0) 0x7fb16c001000 con 
0x7fb22c004b10
ceph-osd.0.log:2012-07-19 19:09:52.901016 7fb25e7f4700  1 -- 
172.16.64.200:6802/17408 -- 172.16.64.200:0/1021101 -- 
osd_op_reply(1429 rb.0.7. [read 1290240~4096] = 0) v4 -- ?+0 
0x7fb2403d14e0 con 0x7fb22c004b10
ceph-osd.0.log:2012-07-19 19:09:52.901359 7fb261ffb700  1 -- 
172.16.64.200:6802/17408 == client.159051 172.16.64.200:0/1021101 379 
 osd_op(client.159051.0:1430 rb.0.7. [read 1294336~4096] 
2.c8eaa480) v4  158+0+0 (773263335 0 0) 0x7fb16c001400 con 
0x7fb22c004b10
ceph-osd.0.log:2012-07-19 19:09:52.901557 7fb25eff5700  1 -- 
172.16.64.200:6802/17408 -- 172.16.64.200:0/1021101 -- 
osd_op_reply(1430 rb.0.7. [read 1294336~4096] = 0) v4 -- ?+0 
0x7fb24e311050 con 0x7fb22c004b10
ceph-osd.0.log:2012-07-19 19:09:52.901917 7fb261ffb700  1 -- 
172.16.64.200:6802/17408 == client.159051 172.16.64.200:0/1021101 380 
 osd_op(client.159051.0:1431 rb.0.7. [read 1298432~4096] 
2.c8eaa480) v4  158+0+0 (3155241114 0 0) 0x7fb16c001000 con 
0x7fb22c004b10
ceph-osd.0.log:2012-07-19 19:09:52.902120 7fb25e7f4700  1 -- 
172.16.64.200:6802/17408 -- 172.16.64.200:0/1021101 -- 
osd_op_reply(1431 rb.0.7. [read 1298432~4096] = 0) v4 -- ?+0 
0x7fb2414137d0 con 0x7fb22c004b10
ceph-osd.0.log:2012-07-19 

Re: Slow request warnings on 0.48

2012-07-19 Thread Matthew Richardson
I'd just like to report the same behaviour on my test cluster with 0.48.

I've set up a single box (Sl6.1 - 2.6.32-220.23.1 kernel) with 1 mds,
mon and osd, and replication set to '1' for both data and metadata.

Having mounted using ceph-fuse, I'm running a simple fio job to create load:

[global]
directory=/mnt/ceph
size=500M
rw=read
ioengine=libaio

[simple]

I'm then watching the latency with ioping.

With rw=read, rw=randread (random reads) or rw=write (sequential writes)
I see no problems and the latency sits around 1-2ms.  However, with
rw=randwrite (random writes) I see the latency jump to between 5 and 60
seconds, and the following types of warning lines appear:

2012-07-19 10:29:39.417625 osd.0 [WRN] 11 slow requests, 6 included
below; oldest blocked for  54.425766 secs
[WRN] slow request 54.420958 seconds old, received at 2012-07-19
10:28:44.996584: osd_op(client.4113.0:9153 10003ed.003b [write
847872~4096] 0.dc4b476f snapc 1=[]) v4 currently started
2012-07-19 10:29:39.417641 osd.0 [WRN] slow request 54.420587 seconds
old, received at 2012-07-19 10:28:44.996955: osd_op(client.4113.0:9154
10003ed. [write 1175552~4096] 0.44a7cb80 snapc 1=[]) v4
currently started
[...snip...]


Let me know if there's any more information that I can provide that
might help with diagnosing the problem (also bearing in mind that I'm
new to ceph so might need extra notes on generating tests, dumps etc :) )

Thanks,

Matthew


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



signature.asc
Description: OpenPGP digital signature


Re: Poor read performance in KVM

2012-07-19 Thread Vladimir Bashkirtsev



Try to determine how much of the 200ms avg latency comes from osds vs
the qemu block driver.
Look like that osd.0 performs with low latency but osd.1 latency is way 
too high and on average it appears as 200ms. osd is backed by btrfs over 
LVM2. May be issue lie in backing fs selection? All four osds running 
similar setup: btrfs over LVM2 so I have some doubts that it may be a 
reason as osd.0 performs well.


I have read full log between osd_op for 3670 and osd_reply and there's 
number of pings from other osds (which were responded to quickly) and 
good number of osd_op_reply writes (osd_sub_op for these writes came 10 
seconds before). So it appears 3670 was delayed by backlog of operations.


Once the latency is under control, you might look into changing guest 
settings to send larger requests and readahead more.


Josh



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/6] rbd: old patches from Josh

2012-07-19 Thread Alex Elder
Late last year Josh Durgin had put together a series of
fixes for rbd that never got committed.  I told him I
would get them in, and this series represents the last
six that remain.

Here's a summary:
[PATCH 1/6] rbd: return errors for mapped but deleted snapshot
This adds code to distinguish the result of attempting
to read data from a deleted snapshot from the the result
of reading a hole in a snapshot.  The former now produces
ENXIO.
[PATCH 2/6] rbd: only reset capacity when pointing to head
When an rbd header is refreshed, its capacity is set in
case it has been changed.  This should not happen for
mapped snapshots.
[PATCH 3/6] rbd: expose the correct size of the device in sysfs
An rbd_dev--even one mapping a snashot--holds the size of
it's base image in its header's image_size field.  The sysfs
entry for the snapshot size was showing the wrong value.
[PATCH 4/6] rbd: set image size when header is updated
The rbd image size was not getting updated when a header
was refrehsed.
[PATCH 5/6] rbd: use reference counting for the snap context
This makes sure the rbd code takes a reference to its
snapshot context while a request related to that context
is underway.
[PATCH 6/6] rbd: send header version when notifying
This ensures the version that gets sent back on a watch
notify acknowledgement is the one that got read as
a result of refreshing the header.

I've reviewed them all, but am posting them for a chance for
others to comment before I commit them.

-Alex
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/6] rbd: return errors for mapped but deleted snapshot

2012-07-19 Thread Alex Elder
When a snapshot is deleted, the OSD will return ENOENT when reading
from it. This is normally interpreted as a hole by rbd, which will
return zeroes. To minimize the time in which this can happen, stop
requests early when we are notified that our snapshot no longer
exists.

[el...@inktank.com: updated __rbd_init_snaps_header() logic]

Signed-off-by: Josh Durgin josh.dur...@inktank.com
Reviewed-by: Alex Elder el...@inktank.com
---
 drivers/block/rbd.c |   32 ++--
 1 files changed, 30 insertions(+), 2 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index b124442..730d0ce 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -172,9 +172,13 @@ struct rbd_device {

/* protects updating the header */
struct rw_semaphore header_rwsem;
+   /* name of the snapshot this device reads from */
char*snap_name;
+   /* id of the snapshot this device reads from */
u64 snap_id;/* current snapshot id */
-   int read_only;
+   /* whether the snap_id this device reads from still exists */
+   boolsnap_exists;
+   int read_only;

struct list_headnode;

@@ -597,6 +601,7 @@ static int rbd_header_set_snap(struct rbd_device
*rbd_dev, u64 *size)
else
snapc-seq = 0;
rbd_dev-snap_id = CEPH_NOSNAP;
+   rbd_dev-snap_exists = false;
rbd_dev-read_only = 0;
if (size)
*size = header-image_size;
@@ -606,6 +611,7 @@ static int rbd_header_set_snap(struct rbd_device
*rbd_dev, u64 *size)
if (ret  0)
goto done;
rbd_dev-snap_id = snapc-seq;
+   rbd_dev-snap_exists = true;
rbd_dev-read_only = 1;
}

@@ -1468,6 +1474,21 @@ static void rbd_rq_fn(struct request_queue *q)

spin_unlock_irq(q-queue_lock);

+   if (rbd_dev-snap_id != CEPH_NOSNAP) {
+   bool snap_exists;
+
+   down_read(rbd_dev-header_rwsem);
+   snap_exists = rbd_dev-snap_exists;
+   up_read(rbd_dev-header_rwsem);
+
+   if (!snap_exists) {
+   dout(request for non-existent snapshot);
+   spin_lock_irq(q-queue_lock);
+   __blk_end_request_all(rq, -ENXIO);
+   continue;
+   }
+   }
+
dout(%s 0x%x bytes at 0x%llx\n,
 do_write ? write : read,
 size, blk_rq_pos(rq) * SECTOR_SIZE);
@@ -2088,7 +2109,14 @@ static int __rbd_init_snaps_header(struct
rbd_device *rbd_dev)
cur_id = rbd_dev-header.snapc-snaps[i - 1];

if (!i || old_snap-id  cur_id) {
-   /* old_snap-id was skipped, thus was removed */
+   /*
+* old_snap-id was skipped, thus was
+* removed.  If this rbd_dev is mapped to
+* the removed snapshot, record that it no
+* longer exists, to prevent further I/O.
+*/
+   if (rbd_dev-snap_id == old_snap-id)
+   rbd_dev-snap_exists = false;
__rbd_remove_snap_dev(rbd_dev, old_snap);
continue;
}
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/6] rbd: only reset capacity when pointing to head

2012-07-19 Thread Alex Elder
Snapshots cannot be resized, and the new capacity of head should not
be reflected by the snapshot.

Signed-off-by: Josh Durgin josh.dur...@inktank.com
Reviewed-by: Alex Elder el...@inktank.com
---
 drivers/block/rbd.c |7 ++-
 1 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 730d0ce..f171ceb 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -1725,7 +1725,12 @@ static int __rbd_refresh_header(struct rbd_device
*rbd_dev)
return ret;

/* resized? */
-   set_capacity(rbd_dev-disk, h.image_size / SECTOR_SIZE);
+   if (rbd_dev-snap_id == CEPH_NOSNAP) {
+   sector_t size = (sector_t) h.image_size / SECTOR_SIZE;
+
+   dout(setting size to %llu sectors, (unsigned long long) size);
+   set_capacity(rbd_dev-disk, size);
+   }

down_write(rbd_dev-header_rwsem);

-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/6] rbd: expose the correct size of the device in sysfs

2012-07-19 Thread Alex Elder
If an image was mapped to a snapshot, the size of the head version
would be shown. Protect capacity with header_rwsem, since it may
change.

Signed-off-by: Josh Durgin josh.dur...@dreamhost.com
Reviewed-by: Alex Elder el...@inktank.com
---
 drivers/block/rbd.c |   11 ---
 1 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index f171ceb..9c3a1db 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -1724,6 +1724,8 @@ static int __rbd_refresh_header(struct rbd_device
*rbd_dev)
if (ret  0)
return ret;

+   down_write(rbd_dev-header_rwsem);
+
/* resized? */
if (rbd_dev-snap_id == CEPH_NOSNAP) {
sector_t size = (sector_t) h.image_size / SECTOR_SIZE;
@@ -1732,8 +1734,6 @@ static int __rbd_refresh_header(struct rbd_device
*rbd_dev)
set_capacity(rbd_dev-disk, size);
}

-   down_write(rbd_dev-header_rwsem);
-
snap_seq = rbd_dev-header.snapc-seq;
if (rbd_dev-header.total_snaps 
rbd_dev-header.snapc-snaps[0] == snap_seq)
@@ -1853,8 +1853,13 @@ static ssize_t rbd_size_show(struct device *dev,
 struct device_attribute *attr, char *buf)
 {
struct rbd_device *rbd_dev = dev_to_rbd_dev(dev);
+   sector_t size;
+
+   down_read(rbd_dev-header_rwsem);
+   size = get_capacity(rbd_dev-disk);
+   up_read(rbd_dev-header_rwsem);

-   return sprintf(buf, %llu\n, (unsigned long
long)rbd_dev-header.image_size);
+   return sprintf(buf, %llu\n, (unsigned long long) size * SECTOR_SIZE);
 }

 static ssize_t rbd_major_show(struct device *dev,
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/6] rbd: set image size when header is updated

2012-07-19 Thread Alex Elder
The image may have been resized.

Signed-off-by: Josh Durgin josh.dur...@dreamhost.com
Reviewed-by: Alex Elder el...@inktank.com
---
 drivers/block/rbd.c |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 9c3a1db..a6bbda2 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -1746,6 +1746,7 @@ static int __rbd_refresh_header(struct rbd_device
*rbd_dev)
kfree(rbd_dev-header.snap_names);
kfree(rbd_dev-header.snapc);

+   rbd_dev-header.image_size = h.image_size;
rbd_dev-header.total_snaps = h.total_snaps;
rbd_dev-header.snapc = h.snapc;
rbd_dev-header.snap_names = h.snap_names;
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/6] rbd: use reference counting for the snap context

2012-07-19 Thread Alex Elder
This prevents a race between requests with a given snap context and
header updates that free it. The osd client was already expecting the
snap context to be reference counted, since it get()s it in
ceph_osdc_build_request and put()s it when the request completes.

Also remove the second down_read()/up_read() on header_rwsem in
rbd_do_request, which wasn't actually preventing this race or
protecting any other data.

Signed-off-by: Josh Durgin josh.dur...@dreamhost.com
Reviewed-by: Alex Elder el...@inktank.com
---
 drivers/block/rbd.c |   36 ++--
 1 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index a6bbda2..988f944 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -626,7 +626,7 @@ static void rbd_header_free(struct rbd_image_header
*header)
kfree(header-object_prefix);
kfree(header-snap_sizes);
kfree(header-snap_names);
-   kfree(header-snapc);
+   ceph_put_snap_context(header-snapc);
 }

 /*
@@ -902,13 +902,10 @@ static int rbd_do_request(struct request *rq,
dout(rbd_do_request object_name=%s ofs=%lld len=%lld\n,
object_name, len, ofs);

-   down_read(rbd_dev-header_rwsem);
-
osdc = rbd_dev-rbd_client-client-osdc;
req = ceph_osdc_alloc_request(osdc, flags, snapc, ops,
false, GFP_NOIO, pages, bio);
if (!req) {
-   up_read(rbd_dev-header_rwsem);
ret = -ENOMEM;
goto done_pages;
}
@@ -942,7 +939,6 @@ static int rbd_do_request(struct request *rq,
snapc,
mtime,
req-r_oid, req-r_oid_len);
-   up_read(rbd_dev-header_rwsem);

if (linger_req) {
ceph_osdc_set_request_linger(osdc, req);
@@ -1448,6 +1444,7 @@ static void rbd_rq_fn(struct request_queue *q)
u64 ofs;
int num_segs, cur_seg = 0;
struct rbd_req_coll *coll;
+   struct ceph_snap_context *snapc;

/* peek at request from block layer */
if (!rq)
@@ -1474,21 +1471,20 @@ static void rbd_rq_fn(struct request_queue *q)

spin_unlock_irq(q-queue_lock);

-   if (rbd_dev-snap_id != CEPH_NOSNAP) {
-   bool snap_exists;
+   down_read(rbd_dev-header_rwsem);

-   down_read(rbd_dev-header_rwsem);
-   snap_exists = rbd_dev-snap_exists;
+   if (rbd_dev-snap_id != CEPH_NOSNAP  !rbd_dev-snap_exists) {
up_read(rbd_dev-header_rwsem);
-
-   if (!snap_exists) {
-   dout(request for non-existent snapshot);
-   spin_lock_irq(q-queue_lock);
-   __blk_end_request_all(rq, -ENXIO);
-   continue;
-   }
+   dout(request for non-existent snapshot);
+   spin_lock_irq(q-queue_lock);
+   __blk_end_request_all(rq, -ENXIO);
+   continue;
}

+   snapc = ceph_get_snap_context(rbd_dev-header.snapc);
+
+   up_read(rbd_dev-header_rwsem);
+
dout(%s 0x%x bytes at 0x%llx\n,
 do_write ? write : read,
 size, blk_rq_pos(rq) * SECTOR_SIZE);
@@ -1498,6 +1494,7 @@ static void rbd_rq_fn(struct request_queue *q)
if (!coll) {
spin_lock_irq(q-queue_lock);
__blk_end_request_all(rq, -ENOMEM);
+   ceph_put_snap_context(snapc);
continue;
}

@@ -1521,7 +1518,7 @@ static void rbd_rq_fn(struct request_queue *q)
/* init OSD command: write or read */
if (do_write)
rbd_req_write(rq, rbd_dev,
- rbd_dev-header.snapc,
+ snapc,
  ofs,
  op_size, bio,
  coll, cur_seg);
@@ -1544,6 +1541,8 @@ next_seg:
if (bp)
bio_pair_release(bp);
spin_lock_irq(q-queue_lock);
+
+   ceph_put_snap_context(snapc);
}
 }

@@ -1744,7 +1743,8 @@ static int __rbd_refresh_header(struct rbd_device
*rbd_dev)
/* rbd_dev-header.object_prefix shouldn't change */
kfree(rbd_dev-header.snap_sizes);
kfree(rbd_dev-header.snap_names);
-   kfree(rbd_dev-header.snapc);
+   /* osd requests may still refer to snapc */
+   ceph_put_snap_context(rbd_dev-header.snapc);

rbd_dev-header.image_size = 

[PATCH 6/6] rbd: send header version when notifying

2012-07-19 Thread Alex Elder
Previously the original header version was sent. Now, we update it
when the header changes.

Signed-off-by: Josh Durgin josh.dur...@dreamhost.com
Reviewed-by: Alex Elder el...@inktank.com
---
 drivers/block/rbd.c |7 +--
 1 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 988f944..4d3a1e0 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -1197,7 +1197,7 @@ static int rbd_req_sync_notify_ack(struct
rbd_device *rbd_dev,
if (ret  0)
return ret;

-   ops[0].watch.ver = cpu_to_le64(rbd_dev-header.obj_version);
+   ops[0].watch.ver = cpu_to_le64(ver);
ops[0].watch.cookie = notify_id;
ops[0].watch.flag = 0;

@@ -1216,6 +1216,7 @@ static int rbd_req_sync_notify_ack(struct
rbd_device *rbd_dev,
 static void rbd_watch_cb(u64 ver, u64 notify_id, u8 opcode, void *data)
 {
struct rbd_device *rbd_dev = (struct rbd_device *)data;
+   u64 hver;
int rc;

if (!rbd_dev)
@@ -1225,12 +1226,13 @@ static void rbd_watch_cb(u64 ver, u64 notify_id,
u8 opcode, void *data)
rbd_dev-header_name, notify_id, (int) opcode);
mutex_lock_nested(ctl_mutex, SINGLE_DEPTH_NESTING);
rc = __rbd_refresh_header(rbd_dev);
+   hver = rbd_dev-header.obj_version;
mutex_unlock(ctl_mutex);
if (rc)
pr_warning(RBD_DRV_NAME %d got notification but failed to 
update snaps: %d\n, rbd_dev-major, rc);

-   rbd_req_sync_notify_ack(rbd_dev, ver, notify_id, rbd_dev-header_name);
+   rbd_req_sync_notify_ack(rbd_dev, hver, notify_id, rbd_dev-header_name);
 }

 /*
@@ -1746,6 +1748,7 @@ static int __rbd_refresh_header(struct rbd_device
*rbd_dev)
/* osd requests may still refer to snapc */
ceph_put_snap_context(rbd_dev-header.snapc);

+   rbd_dev-header.obj_version = h.obj_version;
rbd_dev-header.image_size = h.image_size;
rbd_dev-header.total_snaps = h.total_snaps;
rbd_dev-header.snapc = h.snapc;
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] rbd: fix the memory leak of bio_chain_clone

2012-07-19 Thread Guangliang Zhao
On Tue, Jul 17, 2012 at 01:18:50PM -0700, Yehuda Sadeh wrote:
 On Wed, Jul 11, 2012 at 5:34 AM, Guangliang Zhao gz...@suse.com wrote:
 
  The bio_pair alloced in bio_chain_clone would not be freed,
  this will cause a memory leak. It could be freed actually only
  after 3 times release, because the reference count of bio_pair
  is initialized to 3 when bio_split and bio_pair_release only
  drops the reference count.
 
  The function bio_pair_release must be called three times for
  releasing bio_pair, and the callback functions of bios on the
  requests will be called when the last release time in bio_pair_release,
  however, these functions will also be called in rbd_req_cb. In
  other words, they will be called twice, and it may cause serious
  consequences.
 
  This patch clones bio chian from the origin directly, doesn't use
  bio_split(without bio_pair). The new bio chain can be release
  whenever we don't need it.
 
  Signed-off-by: Guangliang Zhao gz...@suse.com
  ---
   drivers/block/rbd.c |   66
  ---
   1 files changed, 26 insertions(+), 40 deletions(-)
 
  diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
  index 013c7a5..6a12040 100644
  --- a/drivers/block/rbd.c
  +++ b/drivers/block/rbd.c
  @@ -712,51 +712,43 @@ static void zero_bio_chain(struct bio *chain, int
  start_ofs)
  }
   }
 
  -/*
  - * bio_chain_clone - clone a chain of bios up to a certain length.
  - * might return a bio_pair that will need to be released.
  +/**
  + *  bio_chain_clone - clone a chain of bios up to a certain length.
  + *  @old: bio to clone
  + *  @offset: start point for bio clone
  + *  @len: length of bio chain
  + *  @gfp_mask: allocation priority
  + *
  + *  RETURNS:
  + *  Pointer to new bio chain on success, NULL on failure.
*/
  -static struct bio *bio_chain_clone(struct bio **old, struct bio **next,
  -  struct bio_pair **bp,
  +static struct bio *bio_chain_clone(struct bio **old, int *offset,
 int len, gfp_t gfpmask)
   {
  struct bio *tmp, *old_chain = *old, *new_chain = NULL, *tail =
  NULL;
  int total = 0;
 
  -   if (*bp) {
  -   bio_pair_release(*bp);
  -   *bp = NULL;
  -   }
  -
  while (old_chain  (total  len)) {
  +   int need = len - total;
  +
  tmp = bio_kmalloc(gfpmask, old_chain-bi_max_vecs);
  if (!tmp)
  goto err_out;
 
  -   if (total + old_chain-bi_size  len) {
  -   struct bio_pair *bp;
  -
  +   __bio_clone(tmp, old_chain);
  +   if (total + (tmp-bi_size - *offset)  len) {
 
 can change this to:
   if (tmp-bi_size - *offset  need)
 
 I think it'll be slightly more readable

Yes, I will modify it in the next version.

 
  +   tmp-bi_sector += *offset  SECTOR_SHIFT;
  +   tmp-bi_io_vec-bv_offset += *offset 
  SECTOR_SHIFT;
 
 Shouldn't these two lines be outside this if?

Excellent, the else branch need it too.

 
 
 
  /*
  -* this split can only happen with a single paged
  bio,
  -* split_bio will BUG_ON if this is not the case
  +* This can only happen with a single paged bio,
  +* rbd_merge_bvec would guarantee it.
   */
  -   dout(bio_chain_clone split! total=%d
  remaining=%d
  -bi_size=%d\n,
  -(int)total, (int)len-total,
  -(int)old_chain-bi_size);
  -
  -   /* split the bio. We'll release it either in the
  next
  -  call, or it will have to be released outside */
  -   bp = bio_split(old_chain, (len - total) /
  SECTOR_SIZE);
  -   if (!bp)
  -   goto err_out;
  -
  -   __bio_clone(tmp, bp-bio1);
  -
  -   *next = bp-bio2;
  +   tmp-bi_size = need;
  +   tmp-bi_io_vec-bv_len = need;
  +   *offset += need;
  } else {
  -   __bio_clone(tmp, old_chain);
  -   *next = old_chain-bi_next;
  +   old_chain = old_chain-bi_next;
  +   *offset = 0;

We also need adjust the length of tmp bios in the else branch, so the following 
two lines should be added.

tmp-bi_size -= *offset;
tmp-bi_io_vec-bv_offset -= *offset;

  }
 
  tmp-bi_bdev = NULL;
  @@ -769,7 +761,6 @@ static struct bio *bio_chain_clone(struct bio **old,
  struct bio **next,
  tail-bi_next = tmp;
  tail = tmp;

[PATCH] rbd: fix the repeat initialization of semaphore

2012-07-19 Thread Guangliang Zhao
The header_rwsem of rbd_dev initializes twice in
function rbd_add.

Signed-off-by: Guangliang Zhao gz...@suse.com
---
 drivers/block/rbd.c |2 --
 1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 013c7a5..50117dd 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -2381,8 +2381,6 @@ static ssize_t rbd_add(struct bus_type *bus,
INIT_LIST_HEAD(rbd_dev-snaps);
init_rwsem(rbd_dev-header_rwsem);
 
-   init_rwsem(rbd_dev-header_rwsem);
-
/* generate unique id: find highest unique id, add one */
rbd_id_get(rbd_dev);
 
-- 
1.7.3.4

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Ceph doesn't update the block device size while a rbd image is mounted

2012-07-19 Thread Sébastien Han
Hi Cephers!

I'm working with rbd mapping. I figured out that the block device size
of the rbd device is not update while the device is mounted. Here my
tests:

1. Pick up a device and check his size

# rbd ls
size

# rbd info test
rbd image 'test':
size 1 MB in 2500 objects
order 22 (4096 KB objects)
block_name_prefix: rb.0.6
parent:  (pool -1)

2. Map the device

# rbd map --secret /etc/ceph/secret test
# rbd showmapped
id pool image snap device
1 rbd test - /dev/rbd1

3. Put a fs on it and check the block device size

# mkfs.ext4 /dev/rdb1
...
...

# fdisk -l /dev/rbd1

Disk /dev/rbd1: 10.5 GB, 1048576 bytes

4. Mount it

# mount /dev/rbd1 /mnt
# df -h
/dev/rbd1   9.8G  277M  9.0G   3% /mnt


5. Change the image size

# rbd resize --size 11000 test
Resizing image: 100% complete...done.

# rbd info test
rbd image 'test':
size 11000 MB in 2750 objects
order 22 (4096 KB objects)
block_name_prefix: rb.0.6
parent:  (pool -1)

At this point of time, if you perform the fdisk -l /dev/rbd1, the
block device size will remain the same.

6. Unmount the device:

# umount /media

# fdisk -l /dev/rbd1
Disk /dev/rbd1: 11.5 GB, 11534336000 bytes

Unmounting the directory did update the block device size.

Of course you can do something really fast like:

# umount /media  mount /dev/rbd1 /media

That will work, it's a valid solution as long as there is no opened
file. I won't use this trick in production...

I also tried to mount -o remount and it didn't work.

7. Resize the fs (this can be performed while the fs is mounted):

# e2fsck -f /dev/rbd1
e2fsck 1.42 (29-Nov-2011)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/rbd1: 11/644640 files (0.0% non-contiguous), 77173/256 blocks

# resize2fs /dev/rbd1
resize2fs 1.42 (29-Nov-2011)
Resizing the filesystem on /dev/rbd1 to 2816000 (4k) blocks.
The filesystem on /dev/rbd1 is now 2816000 blocks long.


Did I miss something?
Is this feature coming?

Thank you in advance :)
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph doesn't update the block device size while a rbd image is mounted

2012-07-19 Thread Wido den Hollander

Hi,

On 19-07-12 16:55, Sébastien Han wrote:

Hi Cephers!

I'm working with rbd mapping. I figured out that the block device size
of the rbd device is not update while the device is mounted. Here my
tests:



iirc this is not something RBD specific, but since the device is in use 
it can't be re-read by the kernel.


So when you unmount it the kernel can re-read the header and resize the 
device.


Wido


1. Pick up a device and check his size

# rbd ls
size

# rbd info test
rbd image 'test':
size 1 MB in 2500 objects
order 22 (4096 KB objects)
block_name_prefix: rb.0.6
parent:  (pool -1)

2. Map the device

# rbd map --secret /etc/ceph/secret test
# rbd showmapped
id pool image snap device
1 rbd test - /dev/rbd1

3. Put a fs on it and check the block device size

# mkfs.ext4 /dev/rdb1
...
...

# fdisk -l /dev/rbd1

Disk /dev/rbd1: 10.5 GB, 1048576 bytes

4. Mount it

# mount /dev/rbd1 /mnt
# df -h
/dev/rbd1   9.8G  277M  9.0G   3% /mnt


5. Change the image size

# rbd resize --size 11000 test
Resizing image: 100% complete...done.

# rbd info test
rbd image 'test':
size 11000 MB in 2750 objects
order 22 (4096 KB objects)
block_name_prefix: rb.0.6
parent:  (pool -1)

At this point of time, if you perform the fdisk -l /dev/rbd1, the
block device size will remain the same.

6. Unmount the device:

# umount /media

# fdisk -l /dev/rbd1
Disk /dev/rbd1: 11.5 GB, 11534336000 bytes

Unmounting the directory did update the block device size.

Of course you can do something really fast like:

# umount /media  mount /dev/rbd1 /media

That will work, it's a valid solution as long as there is no opened
file. I won't use this trick in production...

I also tried to mount -o remount and it didn't work.

7. Resize the fs (this can be performed while the fs is mounted):

# e2fsck -f /dev/rbd1
e2fsck 1.42 (29-Nov-2011)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/rbd1: 11/644640 files (0.0% non-contiguous), 77173/256 blocks

# resize2fs /dev/rbd1
resize2fs 1.42 (29-Nov-2011)
Resizing the filesystem on /dev/rbd1 to 2816000 (4k) blocks.
The filesystem on /dev/rbd1 is now 2816000 blocks long.


Did I miss something?
Is this feature coming?

Thank you in advance :)
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph doesn't update the block device size while a rbd image is mounted

2012-07-19 Thread Sébastien Han
Ok I got your point seems logic, but why is this possible with LVM for example?

You can easily do this with LVM without un-mounting the device.

Cheers.


On Thu, Jul 19, 2012 at 5:15 PM, Wido den Hollander w...@widodh.nl wrote:
 Hi,


 On 19-07-12 16:55, Sébastien Han wrote:

 Hi Cephers!

 I'm working with rbd mapping. I figured out that the block device size
 of the rbd device is not update while the device is mounted. Here my
 tests:


 iirc this is not something RBD specific, but since the device is in use it
 can't be re-read by the kernel.

 So when you unmount it the kernel can re-read the header and resize the
 device.

 Wido

 1. Pick up a device and check his size

 # rbd ls
 size

 # rbd info test
 rbd image 'test':
 size 1 MB in 2500 objects
 order 22 (4096 KB objects)
 block_name_prefix: rb.0.6
 parent:  (pool -1)

 2. Map the device

 # rbd map --secret /etc/ceph/secret test
 # rbd showmapped
 id pool image snap device
 1 rbd test - /dev/rbd1

 3. Put a fs on it and check the block device size

 # mkfs.ext4 /dev/rdb1
 ...
 ...

 # fdisk -l /dev/rbd1

 Disk /dev/rbd1: 10.5 GB, 1048576 bytes

 4. Mount it

 # mount /dev/rbd1 /mnt
 # df -h
 /dev/rbd1   9.8G  277M  9.0G   3% /mnt


 5. Change the image size

 # rbd resize --size 11000 test
 Resizing image: 100% complete...done.

 # rbd info test
 rbd image 'test':
 size 11000 MB in 2750 objects
 order 22 (4096 KB objects)
 block_name_prefix: rb.0.6
 parent:  (pool -1)

 At this point of time, if you perform the fdisk -l /dev/rbd1, the
 block device size will remain the same.

 6. Unmount the device:

 # umount /media

 # fdisk -l /dev/rbd1
 Disk /dev/rbd1: 11.5 GB, 11534336000 bytes

 Unmounting the directory did update the block device size.

 Of course you can do something really fast like:

 # umount /media  mount /dev/rbd1 /media

 That will work, it's a valid solution as long as there is no opened
 file. I won't use this trick in production...

 I also tried to mount -o remount and it didn't work.

 7. Resize the fs (this can be performed while the fs is mounted):

 # e2fsck -f /dev/rbd1
 e2fsck 1.42 (29-Nov-2011)
 Pass 1: Checking inodes, blocks, and sizes
 Pass 2: Checking directory structure
 Pass 3: Checking directory connectivity
 Pass 4: Checking reference counts
 Pass 5: Checking group summary information
 /dev/rbd1: 11/644640 files (0.0% non-contiguous), 77173/256 blocks

 # resize2fs /dev/rbd1
 resize2fs 1.42 (29-Nov-2011)
 Resizing the filesystem on /dev/rbd1 to 2816000 (4k) blocks.
 The filesystem on /dev/rbd1 is now 2816000 blocks long.


 Did I miss something?
 Is this feature coming?

 Thank you in advance :)
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph doesn't update the block device size while a rbd image is mounted

2012-07-19 Thread Tommi Virtanen
On Thu, Jul 19, 2012 at 8:26 AM, Sébastien Han han.sebast...@gmail.com wrote:
 Ok I got your point seems logic, but why is this possible with LVM for 
 example?

 You can easily do this with LVM without un-mounting the device.

Do your LVM volumes have partition tables inside them? That might be
the difference.

Of course, you can put your filesystem straight on the RBD; that would
be a good experiment to run.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph doesn't update the block device size while a rbd image is mounted

2012-07-19 Thread Tommi Virtanen
On Thu, Jul 19, 2012 at 8:38 AM, Tommi Virtanen t...@inktank.com wrote:
 Do your LVM volumes have partition tables inside them? That might be
 the difference.

 Of course, you can put your filesystem straight on the RBD; that would
 be a good experiment to run.

Oops, I see you did put your fs straight on the RBD already; seeing
fdisk tripped my brain.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph doesn't update the block device size while a rbd image is mounted

2012-07-19 Thread Wido den Hollander



On 19-07-12 17:26, Sébastien Han wrote:

Ok I got your point seems logic, but why is this possible with LVM for example?

You can easily do this with LVM without un-mounting the device.




LVM runs through the device mapper and are not regular block devices.

If you resize the disk underneath LVM you won't see an increased VG or 
PV size unless you change the availability of the VG to unavailable and 
back to available again.


I'm not a 100% sure what the exact root cause is, but the kernel won't 
read the new size of a block device as long as it is in use.


Wido



Cheers.


On Thu, Jul 19, 2012 at 5:15 PM, Wido den Hollander w...@widodh.nl wrote:

Hi,


On 19-07-12 16:55, Sébastien Han wrote:


Hi Cephers!

I'm working with rbd mapping. I figured out that the block device size
of the rbd device is not update while the device is mounted. Here my
tests:



iirc this is not something RBD specific, but since the device is in use it
can't be re-read by the kernel.

So when you unmount it the kernel can re-read the header and resize the
device.

Wido


1. Pick up a device and check his size

# rbd ls
size

# rbd info test
rbd image 'test':
size 1 MB in 2500 objects
order 22 (4096 KB objects)
block_name_prefix: rb.0.6
parent:  (pool -1)

2. Map the device

# rbd map --secret /etc/ceph/secret test
# rbd showmapped
id pool image snap device
1 rbd test - /dev/rbd1

3. Put a fs on it and check the block device size

# mkfs.ext4 /dev/rdb1
...
...

# fdisk -l /dev/rbd1

Disk /dev/rbd1: 10.5 GB, 1048576 bytes

4. Mount it

# mount /dev/rbd1 /mnt
# df -h
/dev/rbd1   9.8G  277M  9.0G   3% /mnt


5. Change the image size

# rbd resize --size 11000 test
Resizing image: 100% complete...done.

# rbd info test
rbd image 'test':
size 11000 MB in 2750 objects
order 22 (4096 KB objects)
block_name_prefix: rb.0.6
parent:  (pool -1)

At this point of time, if you perform the fdisk -l /dev/rbd1, the
block device size will remain the same.

6. Unmount the device:

# umount /media

# fdisk -l /dev/rbd1
Disk /dev/rbd1: 11.5 GB, 11534336000 bytes

Unmounting the directory did update the block device size.

Of course you can do something really fast like:

# umount /media  mount /dev/rbd1 /media

That will work, it's a valid solution as long as there is no opened
file. I won't use this trick in production...

I also tried to mount -o remount and it didn't work.

7. Resize the fs (this can be performed while the fs is mounted):

# e2fsck -f /dev/rbd1
e2fsck 1.42 (29-Nov-2011)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/rbd1: 11/644640 files (0.0% non-contiguous), 77173/256 blocks

# resize2fs /dev/rbd1
resize2fs 1.42 (29-Nov-2011)
Resizing the filesystem on /dev/rbd1 to 2816000 (4k) blocks.
The filesystem on /dev/rbd1 is now 2816000 blocks long.


Did I miss something?
Is this feature coming?

Thank you in advance :)
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html




--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph doesn't update the block device size while a rbd image is mounted

2012-07-19 Thread Sébastien Han
Hum ok, I see. Thanks!

But if you have any clue to force the kernel to re-read without
unmont/mounting :)

On Thu, Jul 19, 2012 at 5:47 PM, Wido den Hollander w...@widodh.nl wrote:


 On 19-07-12 17:26, Sébastien Han wrote:

 Ok I got your point seems logic, but why is this possible with LVM for
 example?

 You can easily do this with LVM without un-mounting the device.



 LVM runs through the device mapper and are not regular block devices.

 If you resize the disk underneath LVM you won't see an increased VG or PV
 size unless you change the availability of the VG to unavailable and back to
 available again.

 I'm not a 100% sure what the exact root cause is, but the kernel won't read
 the new size of a block device as long as it is in use.

 Wido



 Cheers.


 On Thu, Jul 19, 2012 at 5:15 PM, Wido den Hollander w...@widodh.nl
 wrote:

 Hi,


 On 19-07-12 16:55, Sébastien Han wrote:


 Hi Cephers!

 I'm working with rbd mapping. I figured out that the block device size
 of the rbd device is not update while the device is mounted. Here my
 tests:


 iirc this is not something RBD specific, but since the device is in use
 it
 can't be re-read by the kernel.

 So when you unmount it the kernel can re-read the header and resize the
 device.

 Wido

 1. Pick up a device and check his size

 # rbd ls
 size

 # rbd info test
 rbd image 'test':
 size 1 MB in 2500 objects
 order 22 (4096 KB objects)
 block_name_prefix: rb.0.6
 parent:  (pool -1)

 2. Map the device

 # rbd map --secret /etc/ceph/secret test
 # rbd showmapped
 id pool image snap device
 1 rbd test - /dev/rbd1

 3. Put a fs on it and check the block device size

 # mkfs.ext4 /dev/rdb1
 ...
 ...

 # fdisk -l /dev/rbd1

 Disk /dev/rbd1: 10.5 GB, 1048576 bytes

 4. Mount it

 # mount /dev/rbd1 /mnt
 # df -h
 /dev/rbd1   9.8G  277M  9.0G   3% /mnt


 5. Change the image size

 # rbd resize --size 11000 test
 Resizing image: 100% complete...done.

 # rbd info test
 rbd image 'test':
 size 11000 MB in 2750 objects
 order 22 (4096 KB objects)
 block_name_prefix: rb.0.6
 parent:  (pool -1)

 At this point of time, if you perform the fdisk -l /dev/rbd1, the
 block device size will remain the same.

 6. Unmount the device:

 # umount /media

 # fdisk -l /dev/rbd1
 Disk /dev/rbd1: 11.5 GB, 11534336000 bytes

 Unmounting the directory did update the block device size.

 Of course you can do something really fast like:

 # umount /media  mount /dev/rbd1 /media

 That will work, it's a valid solution as long as there is no opened
 file. I won't use this trick in production...

 I also tried to mount -o remount and it didn't work.

 7. Resize the fs (this can be performed while the fs is mounted):

 # e2fsck -f /dev/rbd1
 e2fsck 1.42 (29-Nov-2011)
 Pass 1: Checking inodes, blocks, and sizes
 Pass 2: Checking directory structure
 Pass 3: Checking directory connectivity
 Pass 4: Checking reference counts
 Pass 5: Checking group summary information
 /dev/rbd1: 11/644640 files (0.0% non-contiguous), 77173/256 blocks

 # resize2fs /dev/rbd1
 resize2fs 1.42 (29-Nov-2011)
 Resizing the filesystem on /dev/rbd1 to 2816000 (4k) blocks.
 The filesystem on /dev/rbd1 is now 2816000 blocks long.


 Did I miss something?
 Is this feature coming?

 Thank you in advance :)
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Poor read performance in KVM

2012-07-19 Thread Tommi Virtanen
On Thu, Jul 19, 2012 at 5:19 AM, Vladimir Bashkirtsev
vladi...@bashkirtsev.com wrote:
 Look like that osd.0 performs with low latency but osd.1 latency is way too
 high and on average it appears as 200ms. osd is backed by btrfs over LVM2.
 May be issue lie in backing fs selection? All four osds running similar

Our typical experience with btrfs right now seems to be that it works
fast when the filesystem is fresh, but as it ages, it starts to have
higher and higher delays on syncing writes. This does not seem to be
completely deterministic, that is, if you run many btrfs'es, the
symptoms start to appear at different times on different instances.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/4] rbd: use snapc-seq the way server does

2012-07-19 Thread Alex Elder
This series of patches changes the way the snap context seq field
is used.  Currently it is used in a way that isn't really useful, and
as such is a bit confusing.  This behavior seems to be a hold over
from a time when there was no snap_id field maintained for an rbd_dev.

Summary:
[PATCH 1/4] rbd: don't use snapc-seq that way
Removes special handling in __rbd_refresh_header() that ensured
the seq field was updated to point to the head if it had been
at the start of the function.
[PATCH 2/4] rbd: preserve snapc-seq in rbd_header_set_snap()
Changes rbd_header_set_snap() so it doesn't set the seq field
to the snapshot id (for a snapshot mapping) or the highest
snapshot id (for the base image).
[PATCH 3/4] rbd: set snapc-seq only when refreshing header
Assigns snapc-seq whenever an updated rbd image header is
received rather than when a new snapshot id has been
assigned.
[PATCH 4/4] rbd: kill rbd_image_header-snap_seq
Gets rid of the rbd_image_header-snap_seq field, which
previously kept the same information now maintained in
the snapc-seq field.

-Alex
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/4] rbd: don't use snapc-seq that way

2012-07-19 Thread Alex Elder
In what appears to be an artifact of a different way of encoding
whether an rbd image maps a snapshot, __rbd_refresh_header() has
code that arranges to update the seq value in an rbd image's
snapshot context to point to the first entry in its snapshot
array if that's where it was pointing initially.

We now use rbd_dev-snap_id to record the snapshot id--using
the special value SNAP_NONE to indicate the rbd_dev is not mapping
a snapshot at all.

There is therefore no need to check for this case, nor to update the
seq value, in __rbd_refresh_header().  Just preserve the seq value
that rbd_read_header() provides (which, at the moment, is nothing).

Signed-off-by: Alex Elder el...@inktank.com
---
 drivers/block/rbd.c |   14 --
 1 files changed, 0 insertions(+), 14 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 4d3a1e0..8a46599 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -1718,8 +1718,6 @@ static int __rbd_refresh_header(struct rbd_device
*rbd_dev)
 {
int ret;
struct rbd_image_header h;
-   u64 snap_seq;
-   int follow_seq = 0;

ret = rbd_read_header(rbd_dev, h);
if (ret  0)
@@ -1735,13 +1733,6 @@ static int __rbd_refresh_header(struct rbd_device
*rbd_dev)
set_capacity(rbd_dev-disk, size);
}

-   snap_seq = rbd_dev-header.snapc-seq;
-   if (rbd_dev-header.total_snaps 
-   rbd_dev-header.snapc-snaps[0] == snap_seq)
-   /* pointing at the head, will need to follow that
-  if head moves */
-   follow_seq = 1;
-
/* rbd_dev-header.object_prefix shouldn't change */
kfree(rbd_dev-header.snap_sizes);
kfree(rbd_dev-header.snap_names);
@@ -1759,11 +1750,6 @@ static int __rbd_refresh_header(struct rbd_device
*rbd_dev)
WARN_ON(strcmp(rbd_dev-header.object_prefix, h.object_prefix));
kfree(h.object_prefix);

-   if (follow_seq)
-   rbd_dev-header.snapc-seq = rbd_dev-header.snapc-snaps[0];
-   else
-   rbd_dev-header.snapc-seq = snap_seq;
-
ret = __rbd_init_snaps_header(rbd_dev);

up_write(rbd_dev-header_rwsem);
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/4] rbd: set snapc-seq only when refreshing header

2012-07-19 Thread Alex Elder
In rbd_header_add_snap() there is code to set snapc-seq to the
just-added snapshot id.  This is the only remnant left of the
use of that field for recording which snapshot an rbd_dev was
associated with.  That functionality is no longer supported,
so get rid of that final bit of code.

Doing so means we never actually set snapc-seq any more.  On the
server, the snapshot context's sequence value represents the highest
snapshot id ever issued for a particular rbd image.  So we'll make
it have that meaning here as well.  To do so, set this value
whenever the rbd header is (re-)read.  That way it will always be
consistent with the rest of the snapshot context we maintain.

Signed-off-by: Alex Elder el...@inktank.com
---
 drivers/block/rbd.c |   10 ++
 1 files changed, 2 insertions(+), 8 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index ac8a83f..c299a55 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -537,6 +537,7 @@ static int rbd_header_from_disk(struct
rbd_image_header *header,

atomic_set(header-snapc-nref, 1);
header-snap_seq = le64_to_cpu(ondisk-snap_seq);
+   header-snapc-seq = le64_to_cpu(ondisk-snap_seq);
header-snapc-num_snaps = snap_count;
header-total_snaps = snap_count;

@@ -1685,14 +1686,7 @@ static int rbd_header_add_snap(struct rbd_device
*rbd_dev,

kfree(data);

-   if (ret  0)
-   return ret;
-
-   down_write(rbd_dev-header_rwsem);
-   rbd_dev-header.snapc-seq = new_snapid;
-   up_write(rbd_dev-header_rwsem);
-
-   return 0;
+   return ret  0 ? ret : 0;
 bad:
return -ERANGE;
 }
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/4] rbd: kill rbd_image_header-snap_seq

2012-07-19 Thread Alex Elder
The snap_seq field in an rbd_image_header structure held the value
from the rbd image header when it was last refreshed.  We now
maintain this value in the snapc-seq field.  So get rid of the
other one.

Signed-off-by: Alex Elder el...@inktank.com
---
 drivers/block/rbd.c |2 --
 1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index c299a55..6df8c62 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -82,7 +82,6 @@ struct rbd_image_header {
__u8 comp_type;
struct ceph_snap_context *snapc;
size_t snap_names_len;
-   u64 snap_seq;
u32 total_snaps;

char *snap_names;
@@ -536,7 +535,6 @@ static int rbd_header_from_disk(struct
rbd_image_header *header,
header-comp_type = ondisk-options.comp_type;

atomic_set(header-snapc-nref, 1);
-   header-snap_seq = le64_to_cpu(ondisk-snap_seq);
header-snapc-seq = le64_to_cpu(ondisk-snap_seq);
header-snapc-num_snaps = snap_count;
header-total_snaps = snap_count;
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph doesn't update the block device size while a rbd image is mounted

2012-07-19 Thread Calvin Morrow
I've had a little more luck using cfdisk than vanilla fdisk when it
comes to detecting changes.  You might try running partprobe and then
cfdisk and seeing if you get anything different.

Calvin

On Thu, Jul 19, 2012 at 9:50 AM, Sébastien Han han.sebast...@gmail.com wrote:
 Hum ok, I see. Thanks!

 But if you have any clue to force the kernel to re-read without
 unmont/mounting :)

 On Thu, Jul 19, 2012 at 5:47 PM, Wido den Hollander w...@widodh.nl wrote:


 On 19-07-12 17:26, Sébastien Han wrote:

 Ok I got your point seems logic, but why is this possible with LVM for
 example?

 You can easily do this with LVM without un-mounting the device.



 LVM runs through the device mapper and are not regular block devices.

 If you resize the disk underneath LVM you won't see an increased VG or PV
 size unless you change the availability of the VG to unavailable and back to
 available again.

 I'm not a 100% sure what the exact root cause is, but the kernel won't read
 the new size of a block device as long as it is in use.

 Wido



 Cheers.


 On Thu, Jul 19, 2012 at 5:15 PM, Wido den Hollander w...@widodh.nl
 wrote:

 Hi,


 On 19-07-12 16:55, Sébastien Han wrote:


 Hi Cephers!

 I'm working with rbd mapping. I figured out that the block device size
 of the rbd device is not update while the device is mounted. Here my
 tests:


 iirc this is not something RBD specific, but since the device is in use
 it
 can't be re-read by the kernel.

 So when you unmount it the kernel can re-read the header and resize the
 device.

 Wido

 1. Pick up a device and check his size

 # rbd ls
 size

 # rbd info test
 rbd image 'test':
 size 1 MB in 2500 objects
 order 22 (4096 KB objects)
 block_name_prefix: rb.0.6
 parent:  (pool -1)

 2. Map the device

 # rbd map --secret /etc/ceph/secret test
 # rbd showmapped
 id pool image snap device
 1 rbd test - /dev/rbd1

 3. Put a fs on it and check the block device size

 # mkfs.ext4 /dev/rdb1
 ...
 ...

 # fdisk -l /dev/rbd1

 Disk /dev/rbd1: 10.5 GB, 1048576 bytes

 4. Mount it

 # mount /dev/rbd1 /mnt
 # df -h
 /dev/rbd1   9.8G  277M  9.0G   3% /mnt


 5. Change the image size

 # rbd resize --size 11000 test
 Resizing image: 100% complete...done.

 # rbd info test
 rbd image 'test':
 size 11000 MB in 2750 objects
 order 22 (4096 KB objects)
 block_name_prefix: rb.0.6
 parent:  (pool -1)

 At this point of time, if you perform the fdisk -l /dev/rbd1, the
 block device size will remain the same.

 6. Unmount the device:

 # umount /media

 # fdisk -l /dev/rbd1
 Disk /dev/rbd1: 11.5 GB, 11534336000 bytes

 Unmounting the directory did update the block device size.

 Of course you can do something really fast like:

 # umount /media  mount /dev/rbd1 /media

 That will work, it's a valid solution as long as there is no opened
 file. I won't use this trick in production...

 I also tried to mount -o remount and it didn't work.

 7. Resize the fs (this can be performed while the fs is mounted):

 # e2fsck -f /dev/rbd1
 e2fsck 1.42 (29-Nov-2011)
 Pass 1: Checking inodes, blocks, and sizes
 Pass 2: Checking directory structure
 Pass 3: Checking directory connectivity
 Pass 4: Checking reference counts
 Pass 5: Checking group summary information
 /dev/rbd1: 11/644640 files (0.0% non-contiguous), 77173/256 blocks

 # resize2fs /dev/rbd1
 resize2fs 1.42 (29-Nov-2011)
 Resizing the filesystem on /dev/rbd1 to 2816000 (4k) blocks.
 The filesystem on /dev/rbd1 is now 2816000 blocks long.


 Did I miss something?
 Is this feature coming?

 Thank you in advance :)
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html



 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Poor read performance in KVM

2012-07-19 Thread Calvin Morrow
On Thu, Jul 19, 2012 at 9:52 AM, Tommi Virtanen t...@inktank.com wrote:

 On Thu, Jul 19, 2012 at 5:19 AM, Vladimir Bashkirtsev
 vladi...@bashkirtsev.com wrote:
  Look like that osd.0 performs with low latency but osd.1 latency is way
  too
  high and on average it appears as 200ms. osd is backed by btrfs over
  LVM2.
  May be issue lie in backing fs selection? All four osds running similar

 Our typical experience with btrfs right now seems to be that it works
 fast when the filesystem is fresh, but as it ages, it starts to have
 higher and higher delays on syncing writes. This does not seem to be
 completely deterministic, that is, if you run many btrfs'es, the
 symptoms start to appear at different times on different instances.

Is Inktank seeing the slowdown on btrfs volumes with large metadata
(32k / 64k) node/leaf sizes as well, or only on default (4k) sizes?

Vladmir,

What node/leaf size are you using on your btrfs volume?

 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Poor read performance in KVM

2012-07-19 Thread Mark Nelson

On 07/19/2012 01:06 PM, Calvin Morrow wrote:

On Thu, Jul 19, 2012 at 9:52 AM, Tommi Virtanent...@inktank.com  wrote:


On Thu, Jul 19, 2012 at 5:19 AM, Vladimir Bashkirtsev
vladi...@bashkirtsev.com  wrote:

Look like that osd.0 performs with low latency but osd.1 latency is way
too
high and on average it appears as 200ms. osd is backed by btrfs over
LVM2.
May be issue lie in backing fs selection? All four osds running similar


Our typical experience with btrfs right now seems to be that it works
fast when the filesystem is fresh, but as it ages, it starts to have
higher and higher delays on syncing writes. This does not seem to be
completely deterministic, that is, if you run many btrfs'es, the
symptoms start to appear at different times on different instances.


Is Inktank seeing the slowdown on btrfs volumes with large metadata
(32k / 64k) node/leaf sizes as well, or only on default (4k) sizes?

Vladmir,

What node/leaf size are you using on your btrfs volume?


Hi Vladmir,

We are seeing degradation at 64k node/leaf sizes as well.  So far the 
degradation is most obvious with small writes.  it affects XFS as well, 
though not as severely.  We are vigorously looking into it. :)


Mark




--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
Mark Nelson
Performance Engineer
Inktank
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph doesn't update the block device size while a rbd image is mounted

2012-07-19 Thread Sébastien Han
With LVM, you can re-scan the scsi bus to extend a physical drive and
then run a pvextend.

@Calvin: I tried your solution

# partprobe /dev/rbd1

Unfortunatly nothing changed.

Did you make it working?

Cheers!


On Thu, Jul 19, 2012 at 5:50 PM, Sébastien Han han.sebast...@gmail.com wrote:
 Hum ok, I see. Thanks!

 But if you have any clue to force the kernel to re-read without
 unmont/mounting :)

 On Thu, Jul 19, 2012 at 5:47 PM, Wido den Hollander w...@widodh.nl wrote:


 On 19-07-12 17:26, Sébastien Han wrote:

 Ok I got your point seems logic, but why is this possible with LVM for
 example?

 You can easily do this with LVM without un-mounting the device.



 LVM runs through the device mapper and are not regular block devices.

 If you resize the disk underneath LVM you won't see an increased VG or PV
 size unless you change the availability of the VG to unavailable and back to
 available again.

 I'm not a 100% sure what the exact root cause is, but the kernel won't read
 the new size of a block device as long as it is in use.

 Wido



 Cheers.


 On Thu, Jul 19, 2012 at 5:15 PM, Wido den Hollander w...@widodh.nl
 wrote:

 Hi,


 On 19-07-12 16:55, Sébastien Han wrote:


 Hi Cephers!

 I'm working with rbd mapping. I figured out that the block device size
 of the rbd device is not update while the device is mounted. Here my
 tests:


 iirc this is not something RBD specific, but since the device is in use
 it
 can't be re-read by the kernel.

 So when you unmount it the kernel can re-read the header and resize the
 device.

 Wido

 1. Pick up a device and check his size

 # rbd ls
 size

 # rbd info test
 rbd image 'test':
 size 1 MB in 2500 objects
 order 22 (4096 KB objects)
 block_name_prefix: rb.0.6
 parent:  (pool -1)

 2. Map the device

 # rbd map --secret /etc/ceph/secret test
 # rbd showmapped
 id pool image snap device
 1 rbd test - /dev/rbd1

 3. Put a fs on it and check the block device size

 # mkfs.ext4 /dev/rdb1
 ...
 ...

 # fdisk -l /dev/rbd1

 Disk /dev/rbd1: 10.5 GB, 1048576 bytes

 4. Mount it

 # mount /dev/rbd1 /mnt
 # df -h
 /dev/rbd1   9.8G  277M  9.0G   3% /mnt


 5. Change the image size

 # rbd resize --size 11000 test
 Resizing image: 100% complete...done.

 # rbd info test
 rbd image 'test':
 size 11000 MB in 2750 objects
 order 22 (4096 KB objects)
 block_name_prefix: rb.0.6
 parent:  (pool -1)

 At this point of time, if you perform the fdisk -l /dev/rbd1, the
 block device size will remain the same.

 6. Unmount the device:

 # umount /media

 # fdisk -l /dev/rbd1
 Disk /dev/rbd1: 11.5 GB, 11534336000 bytes

 Unmounting the directory did update the block device size.

 Of course you can do something really fast like:

 # umount /media  mount /dev/rbd1 /media

 That will work, it's a valid solution as long as there is no opened
 file. I won't use this trick in production...

 I also tried to mount -o remount and it didn't work.

 7. Resize the fs (this can be performed while the fs is mounted):

 # e2fsck -f /dev/rbd1
 e2fsck 1.42 (29-Nov-2011)
 Pass 1: Checking inodes, blocks, and sizes
 Pass 2: Checking directory structure
 Pass 3: Checking directory connectivity
 Pass 4: Checking reference counts
 Pass 5: Checking group summary information
 /dev/rbd1: 11/644640 files (0.0% non-contiguous), 77173/256 blocks

 # resize2fs /dev/rbd1
 resize2fs 1.42 (29-Nov-2011)
 Resizing the filesystem on /dev/rbd1 to 2816000 (4k) blocks.
 The filesystem on /dev/rbd1 is now 2816000 blocks long.


 Did I miss something?
 Is this feature coming?

 Thank you in advance :)
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph doesn't update the block device size while a rbd image is mounted

2012-07-19 Thread Calvin Morrow
I haven't tried resizing an rbd yet, but I was changing partitions on
a non-ceph two-node cluster with shared storage yesterday while
certain partitions were in use (partitions 1,2,5 were mounted,
deleting partition ids 6+, adding new ones) and fdisk wasn't
re-reading disk changes.  Partprobe followed by cfdisk seemed to
update the kernel with the on-disk changes.

I realize its not exactly what you are doing, but I figured it might
be close enough to be worth a shot.  cfdisk seems to make a system
call to refresh disk information that vanilla fdisk doesn't.

Calvin

On Thu, Jul 19, 2012 at 1:44 PM, Sébastien Han han.sebast...@gmail.com wrote:
 With LVM, you can re-scan the scsi bus to extend a physical drive and
 then run a pvextend.

 @Calvin: I tried your solution

 # partprobe /dev/rbd1

 Unfortunatly nothing changed.

 Did you make it working?

 Cheers!


 On Thu, Jul 19, 2012 at 5:50 PM, Sébastien Han han.sebast...@gmail.com 
 wrote:
 Hum ok, I see. Thanks!

 But if you have any clue to force the kernel to re-read without
 unmont/mounting :)

 On Thu, Jul 19, 2012 at 5:47 PM, Wido den Hollander w...@widodh.nl wrote:


 On 19-07-12 17:26, Sébastien Han wrote:

 Ok I got your point seems logic, but why is this possible with LVM for
 example?

 You can easily do this with LVM without un-mounting the device.



 LVM runs through the device mapper and are not regular block devices.

 If you resize the disk underneath LVM you won't see an increased VG or PV
 size unless you change the availability of the VG to unavailable and back to
 available again.

 I'm not a 100% sure what the exact root cause is, but the kernel won't read
 the new size of a block device as long as it is in use.

 Wido



 Cheers.


 On Thu, Jul 19, 2012 at 5:15 PM, Wido den Hollander w...@widodh.nl
 wrote:

 Hi,


 On 19-07-12 16:55, Sébastien Han wrote:


 Hi Cephers!

 I'm working with rbd mapping. I figured out that the block device size
 of the rbd device is not update while the device is mounted. Here my
 tests:


 iirc this is not something RBD specific, but since the device is in use
 it
 can't be re-read by the kernel.

 So when you unmount it the kernel can re-read the header and resize the
 device.

 Wido

 1. Pick up a device and check his size

 # rbd ls
 size

 # rbd info test
 rbd image 'test':
 size 1 MB in 2500 objects
 order 22 (4096 KB objects)
 block_name_prefix: rb.0.6
 parent:  (pool -1)

 2. Map the device

 # rbd map --secret /etc/ceph/secret test
 # rbd showmapped
 id pool image snap device
 1 rbd test - /dev/rbd1

 3. Put a fs on it and check the block device size

 # mkfs.ext4 /dev/rdb1
 ...
 ...

 # fdisk -l /dev/rbd1

 Disk /dev/rbd1: 10.5 GB, 1048576 bytes

 4. Mount it

 # mount /dev/rbd1 /mnt
 # df -h
 /dev/rbd1   9.8G  277M  9.0G   3% /mnt


 5. Change the image size

 # rbd resize --size 11000 test
 Resizing image: 100% complete...done.

 # rbd info test
 rbd image 'test':
 size 11000 MB in 2750 objects
 order 22 (4096 KB objects)
 block_name_prefix: rb.0.6
 parent:  (pool -1)

 At this point of time, if you perform the fdisk -l /dev/rbd1, the
 block device size will remain the same.

 6. Unmount the device:

 # umount /media

 # fdisk -l /dev/rbd1
 Disk /dev/rbd1: 11.5 GB, 11534336000 bytes

 Unmounting the directory did update the block device size.

 Of course you can do something really fast like:

 # umount /media  mount /dev/rbd1 /media

 That will work, it's a valid solution as long as there is no opened
 file. I won't use this trick in production...

 I also tried to mount -o remount and it didn't work.

 7. Resize the fs (this can be performed while the fs is mounted):

 # e2fsck -f /dev/rbd1
 e2fsck 1.42 (29-Nov-2011)
 Pass 1: Checking inodes, blocks, and sizes
 Pass 2: Checking directory structure
 Pass 3: Checking directory connectivity
 Pass 4: Checking reference counts
 Pass 5: Checking group summary information
 /dev/rbd1: 11/644640 files (0.0% non-contiguous), 77173/256 blocks

 # resize2fs /dev/rbd1
 resize2fs 1.42 (29-Nov-2011)
 Resizing the filesystem on /dev/rbd1 to 2816000 (4k) blocks.
 The filesystem on /dev/rbd1 is now 2816000 blocks long.


 Did I miss something?
 Is this feature coming?

 Thank you in advance :)
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html



 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph doesn't update the block device size while a rbd image is mounted

2012-07-19 Thread Andreas Kurz
On 07/19/2012 09:44 PM, Sébastien Han wrote:
 With LVM, you can re-scan the scsi bus to extend a physical drive and
 then run a pvextend.
 
 @Calvin: I tried your solution
 
 # partprobe /dev/rbd1

Did you try blockdev?

# blockdev --rereadpt /dev/rbd1

Regards,
Andreas

 
 Unfortunatly nothing changed.
 
 Did you make it working?
 
 Cheers!
 
 
 On Thu, Jul 19, 2012 at 5:50 PM, Sébastien Han han.sebast...@gmail.com 
 wrote:
 Hum ok, I see. Thanks!

 But if you have any clue to force the kernel to re-read without
 unmont/mounting :)

 On Thu, Jul 19, 2012 at 5:47 PM, Wido den Hollander w...@widodh.nl wrote:


 On 19-07-12 17:26, Sébastien Han wrote:

 Ok I got your point seems logic, but why is this possible with LVM for
 example?

 You can easily do this with LVM without un-mounting the device.



 LVM runs through the device mapper and are not regular block devices.

 If you resize the disk underneath LVM you won't see an increased VG or PV
 size unless you change the availability of the VG to unavailable and back to
 available again.

 I'm not a 100% sure what the exact root cause is, but the kernel won't read
 the new size of a block device as long as it is in use.

 Wido



 Cheers.


 On Thu, Jul 19, 2012 at 5:15 PM, Wido den Hollander w...@widodh.nl
 wrote:

 Hi,


 On 19-07-12 16:55, Sébastien Han wrote:


 Hi Cephers!

 I'm working with rbd mapping. I figured out that the block device size
 of the rbd device is not update while the device is mounted. Here my
 tests:


 iirc this is not something RBD specific, but since the device is in use
 it
 can't be re-read by the kernel.

 So when you unmount it the kernel can re-read the header and resize the
 device.

 Wido

 1. Pick up a device and check his size

 # rbd ls
 size

 # rbd info test
 rbd image 'test':
 size 1 MB in 2500 objects
 order 22 (4096 KB objects)
 block_name_prefix: rb.0.6
 parent:  (pool -1)

 2. Map the device

 # rbd map --secret /etc/ceph/secret test
 # rbd showmapped
 id pool image snap device
 1 rbd test - /dev/rbd1

 3. Put a fs on it and check the block device size

 # mkfs.ext4 /dev/rdb1
 ...
 ...

 # fdisk -l /dev/rbd1

 Disk /dev/rbd1: 10.5 GB, 1048576 bytes

 4. Mount it

 # mount /dev/rbd1 /mnt
 # df -h
 /dev/rbd1   9.8G  277M  9.0G   3% /mnt


 5. Change the image size

 # rbd resize --size 11000 test
 Resizing image: 100% complete...done.

 # rbd info test
 rbd image 'test':
 size 11000 MB in 2750 objects
 order 22 (4096 KB objects)
 block_name_prefix: rb.0.6
 parent:  (pool -1)

 At this point of time, if you perform the fdisk -l /dev/rbd1, the
 block device size will remain the same.

 6. Unmount the device:

 # umount /media

 # fdisk -l /dev/rbd1
 Disk /dev/rbd1: 11.5 GB, 11534336000 bytes

 Unmounting the directory did update the block device size.

 Of course you can do something really fast like:

 # umount /media  mount /dev/rbd1 /media

 That will work, it's a valid solution as long as there is no opened
 file. I won't use this trick in production...

 I also tried to mount -o remount and it didn't work.

 7. Resize the fs (this can be performed while the fs is mounted):

 # e2fsck -f /dev/rbd1
 e2fsck 1.42 (29-Nov-2011)
 Pass 1: Checking inodes, blocks, and sizes
 Pass 2: Checking directory structure
 Pass 3: Checking directory connectivity
 Pass 4: Checking reference counts
 Pass 5: Checking group summary information
 /dev/rbd1: 11/644640 files (0.0% non-contiguous), 77173/256 blocks

 # resize2fs /dev/rbd1
 resize2fs 1.42 (29-Nov-2011)
 Resizing the filesystem on /dev/rbd1 to 2816000 (4k) blocks.
 The filesystem on /dev/rbd1 is now 2816000 blocks long.


 Did I miss something?
 Is this feature coming?

 Thank you in advance :)
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html



 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 




signature.asc
Description: OpenPGP digital signature


[PATCH 00/12] rbd: cleanup series

2012-07-19 Thread Alex Elder
This series includes a bunch of relatively small cleanups.
They're grouped a bit below, but they apply together in
this sequence and the later ones may have dependencies on
those earlier in the series.

Summaries:
[PATCH 01/12] rbd: drop extra header_rwsem init
[PATCH 02/12] rbd: simplify __rbd_remove_all_snaps()
[PATCH 03/12] rbd: clean up a few dout() calls
[PATCH 04/12] ceph: define snap counts as u32 everywhere
These four are very simple and straightforward cleanups.

[PATCH 05/12] rbd: snapc is unused in rbd_req_sync_read()
[PATCH 06/12] rbd: drop rbd_header_from_disk() gfp_flags parameter
[PATCH 07/12] rbd: drop rbd_dev parameter in snap functions
[PATCH 08/12] rbd: drop rbd_req_sync_exec() ver parameter
These four each drop an unused argument from a function.

[PATCH 09/12] rbd: have __rbd_add_snap_dev() return a pointer
[PATCH 10/12] rbd: make rbd_create_rw_ops() return a pointer
These two each change a function so it returns a pointer
rather than filling in the address of a provided pointer.

[PATCH 11/12] rbd: always pass ops array to rbd_req_sync_op()
[PATCH 12/12] rbd: fixes in rbd_header_from_disk()
These comprise slightly more involved refactoring of the
code, but as with the rest of the patches in this series,
there should be no functional difference as a result.

-Alex
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/4] rbd: don't use snapc-seq that way

2012-07-19 Thread Josh Durgin

On 07/19/2012 10:11 AM, Alex Elder wrote:

We now use rbd_dev-snap_id to record the snapshot id--using
the special value SNAP_NONE to indicate the rbd_dev is not mapping
a snapshot at all.


That's CEPH_NOSNAP, not SNAP_NONE, right? In any case,

Reviewed-by: Josh Durgin josh.dur...@inktank.com
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 01/12] rbd: drop extra header_rwsem init

2012-07-19 Thread Alex Elder
In commit c01a there was inadvertently added an extra
initialization of rbd_dev-header_rwsem.  This gets rid of the
duplicate.

(Guangliang Zhao also offered up the same fix.)

Reported-by: Guangliang Zhao gz...@suse.com
Signed-off-by: Alex Elder el...@inktank.com
---
 drivers/block/rbd.c |2 --
 1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 6df8c62..b9895fe 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -2458,8 +2458,6 @@ static ssize_t rbd_add(struct bus_type *bus,
INIT_LIST_HEAD(rbd_dev-snaps);
init_rwsem(rbd_dev-header_rwsem);

-   init_rwsem(rbd_dev-header_rwsem);
-
/* generate unique id: find highest unique id, add one */
rbd_id_get(rbd_dev);

-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 02/12] rbd: simplify __rbd_remove_all_snaps()

2012-07-19 Thread Alex Elder
This just replaces a while loop with list_for_each_entry_safe()
in __rbd_remove_all_snaps().

Signed-off-by: Alex Elder el...@inktank.com
---
 drivers/block/rbd.c |5 ++---
 1 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index b9895fe..74e6a33 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -1692,11 +1692,10 @@ bad:
 static void __rbd_remove_all_snaps(struct rbd_device *rbd_dev)
 {
struct rbd_snap *snap;
+   struct rbd_snap *next;

-   while (!list_empty(rbd_dev-snaps)) {
-   snap = list_first_entry(rbd_dev-snaps, struct rbd_snap, node);
+   list_for_each_entry_safe(snap, next, rbd_dev-snaps, node)
__rbd_remove_snap_dev(rbd_dev, snap);
-   }
 }

 /*
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 03/12] rbd: clean up a few dout() calls

2012-07-19 Thread Alex Elder
There was a dout() call in rbd_do_request() that was reporting
the reporting the offset as the length and vice versa.  While
fixing that I did a quick scan of other dout() calls and fixed
a couple of other minor things.

Signed-off-by: Alex Elder el...@inktank.com
---
 drivers/block/rbd.c |7 +++
 1 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 74e6a33..93b2447 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -895,7 +895,7 @@ static int rbd_do_request(struct request *rq,
}

dout(rbd_do_request object_name=%s ofs=%lld len=%lld\n,
-   object_name, len, ofs);
+   object_name, ofs, len);

osdc = rbd_dev-rbd_client-client-osdc;
req = ceph_osdc_alloc_request(osdc, flags, snapc, ops,
@@ -1315,8 +1315,7 @@ static void rbd_notify_cb(u64 ver, u64 notify_id,
u8 opcode, void *data)
return;

dout(rbd_notify_cb %s notify_id=%lld opcode=%d\n,
-   rbd_dev-header_name,
-   notify_id, (int)opcode);
+   rbd_dev-header_name, notify_id, (int) opcode);
 }

 /*
@@ -1664,7 +1663,7 @@ static int rbd_header_add_snap(struct rbd_device
*rbd_dev,

monc = rbd_dev-rbd_client-client-monc;
ret = ceph_monc_create_snapid(monc, rbd_dev-pool_id, new_snapid);
-   dout(created snapid=%lld\n, new_snapid);
+   dout(created snapid=%llu\n, new_snapid);
if (ret  0)
return ret;

-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 04/12] ceph: define snap counts as u32 everywhere

2012-07-19 Thread Alex Elder
There are two structures in which a count of snapshots are
maintained:

struct ceph_snap_context {
...
u32 num_snaps;
...
}
and
struct ceph_snap_realm {
...
u32 num_prior_parent_snaps;   /*  had prior to parent_since */
...
u32 num_snaps;
...
}

These fields never take on negative values (e.g., to hold special
meaning), and so are really inherently unsigned.  Furthermore they
take their value from over-the-wire or on-disk formatted 32-bit
values.

So change their definition to have type u32, and change some spots
elsewhere in the code to account for this change.

Signed-off-by: Alex Elder el...@inktank.com
---
 fs/ceph/snap.c   |   18 ++
 fs/ceph/super.h  |4 ++--
 include/linux/ceph/libceph.h |2 +-
 3 files changed, 13 insertions(+), 11 deletions(-)

diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
index e5206fc..cbb2f54 100644
--- a/fs/ceph/snap.c
+++ b/fs/ceph/snap.c
@@ -296,8 +296,7 @@ static int build_snap_context(struct ceph_snap_realm
*realm)
struct ceph_snap_realm *parent = realm-parent;
struct ceph_snap_context *snapc;
int err = 0;
-   int i;
-   int num = realm-num_prior_parent_snaps + realm-num_snaps;
+   u32 num = realm-num_prior_parent_snaps + realm-num_snaps;

/*
 * build parent context, if it hasn't been built.
@@ -321,11 +320,11 @@ static int build_snap_context(struct
ceph_snap_realm *realm)
realm-cached_context-seq == realm-seq 
(!parent ||
 realm-cached_context-seq = parent-cached_context-seq)) {
-   dout(build_snap_context %llx %p: %p seq %lld (%d snaps)
+   dout(build_snap_context %llx %p: %p seq %lld (%u snaps)
  (unchanged)\n,
 realm-ino, realm, realm-cached_context,
 realm-cached_context-seq,
-realm-cached_context-num_snaps);
+(unsigned int) realm-cached_context-num_snaps);
return 0;
}

@@ -342,6 +341,8 @@ static int build_snap_context(struct ceph_snap_realm
*realm)
num = 0;
snapc-seq = realm-seq;
if (parent) {
+   u32 i;
+
/* include any of parent's snaps occurring _after_ my
   parent became my parent */
for (i = 0; i  parent-cached_context-num_snaps; i++)
@@ -361,8 +362,9 @@ static int build_snap_context(struct ceph_snap_realm
*realm)

sort(snapc-snaps, num, sizeof(u64), cmpu64_rev, NULL);
snapc-num_snaps = num;
-   dout(build_snap_context %llx %p: %p seq %lld (%d snaps)\n,
-realm-ino, realm, snapc, snapc-seq, snapc-num_snaps);
+   dout(build_snap_context %llx %p: %p seq %lld (%u snaps)\n,
+realm-ino, realm, snapc, snapc-seq,
+(unsigned int) snapc-num_snaps);

if (realm-cached_context)
ceph_put_snap_context(realm-cached_context);
@@ -402,9 +404,9 @@ static void rebuild_snap_realms(struct
ceph_snap_realm *realm)
  * helper to allocate and decode an array of snapids.  free prior
  * instance, if any.
  */
-static int dup_array(u64 **dst, __le64 *src, int num)
+static int dup_array(u64 **dst, __le64 *src, u32 num)
 {
-   int i;
+   u32 i;

kfree(*dst);
if (num) {
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index fc35036..3ea48b7 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -612,9 +612,9 @@ struct ceph_snap_realm {
u64 parent_since;   /* snapid when our current parent became so */

u64 *prior_parent_snaps;  /* snaps inherited from any parents we */
-   int num_prior_parent_snaps;   /*  had prior to parent_since */
+   u32 num_prior_parent_snaps;   /*  had prior to parent_since */
u64 *snaps;   /* snaps specific to this realm */
-   int num_snaps;
+   u32 num_snaps;

struct ceph_snap_realm *parent;
struct list_head children;   /* list of child realms */
diff --git a/include/linux/ceph/libceph.h b/include/linux/ceph/libceph.h
index 98ec36a..0b72295 100644
--- a/include/linux/ceph/libceph.h
+++ b/include/linux/ceph/libceph.h
@@ -160,7 +160,7 @@ struct ceph_client {
 struct ceph_snap_context {
atomic_t nref;
u64 seq;
-   int num_snaps;
+   u32 num_snaps;
u64 snaps[];
 };

-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 05/12] rbd: snapc is unused in rbd_req_sync_read()

2012-07-19 Thread Alex Elder
The snapc parameter to in rbd_req_sync_read() is not used, so
get rid of it.

Reported-by: Josh Durgin josh.dur...@inktank.com
Signed-off-by: Alex Elder el...@inktank.com
---
 drivers/block/rbd.c |3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 93b2447..06e022d 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -1162,7 +1162,6 @@ static int rbd_req_read(struct request *rq,
  * Request sync osd read
  */
 static int rbd_req_sync_read(struct rbd_device *rbd_dev,
- struct ceph_snap_context *snapc,
  u64 snapid,
  const char *object_name,
  u64 ofs, u64 len,
@@ -1609,7 +1608,7 @@ static int rbd_read_header(struct rbd_device *rbd_dev,
return -ENOMEM;

rc = rbd_req_sync_read(rbd_dev,
-  NULL, CEPH_NOSNAP,
+  CEPH_NOSNAP,
   rbd_dev-header_name,
   0, len,
   (char *)dh, ver);
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 06/12] rbd: drop rbd_header_from_disk() gfp_flags parameter

2012-07-19 Thread Alex Elder
The function rbd_header_from_disk() is only called in one spot, and
it passes GFP_KERNEL as its value for the gfp_flags parameter.

Just drop that parameter and substitute GFP_KERNEL everywhere within
that function it had been used.  (If we find we need the parameter
again in the future it's easy enough to add back again.)

Signed-off-by: Alex Elder el...@inktank.com
---
 drivers/block/rbd.c |   13 ++---
 1 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 06e022d..3b4b4d2 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -487,8 +487,7 @@ static void rbd_coll_release(struct kref *kref)
  */
 static int rbd_header_from_disk(struct rbd_image_header *header,
 struct rbd_image_header_ondisk *ondisk,
-u32 allocated_snaps,
-gfp_t gfp_flags)
+u32 allocated_snaps)
 {
u32 i, snap_count;

@@ -501,18 +500,18 @@ static int rbd_header_from_disk(struct
rbd_image_header *header,
return -EINVAL;
header-snapc = kmalloc(sizeof(struct ceph_snap_context) +
snap_count * sizeof(u64),
-   gfp_flags);
+   GFP_KERNEL);
if (!header-snapc)
return -ENOMEM;

header-snap_names_len = le64_to_cpu(ondisk-snap_names_len);
if (snap_count) {
header-snap_names = kmalloc(header-snap_names_len,
-gfp_flags);
+GFP_KERNEL);
if (!header-snap_names)
goto err_snapc;
header-snap_sizes = kmalloc(snap_count * sizeof(u64),
-gfp_flags);
+GFP_KERNEL);
if (!header-snap_sizes)
goto err_names;
} else {
@@ -521,7 +520,7 @@ static int rbd_header_from_disk(struct
rbd_image_header *header,
}

header-object_prefix = kmalloc(sizeof (ondisk-block_name) + 1,
-   gfp_flags);
+   GFP_KERNEL);
if (!header-object_prefix)
goto err_sizes;

@@ -1615,7 +1614,7 @@ static int rbd_read_header(struct rbd_device *rbd_dev,
if (rc  0)
goto out_dh;

-   rc = rbd_header_from_disk(header, dh, snap_count, GFP_KERNEL);
+   rc = rbd_header_from_disk(header, dh, snap_count);
if (rc  0) {
if (rc == -ENXIO)
pr_warning(unrecognized header format
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 07/12] rbd: drop rbd_dev parameter in snap functions

2012-07-19 Thread Alex Elder
Both rbd_register_snap_dev() and __rbd_remove_snap_dev() have
rbd_dev parameters that are unused.  Remove them.

Signed-off-by: Alex Elder el...@inktank.com
---
 drivers/block/rbd.c |   19 +++
 1 files changed, 7 insertions(+), 12 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 3b4b4d2..955d75d 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -202,8 +202,7 @@ static ssize_t rbd_snap_add(struct device *dev,
struct device_attribute *attr,
const char *buf,
size_t count);
-static void __rbd_remove_snap_dev(struct rbd_device *rbd_dev,
- struct rbd_snap *snap);
+static void __rbd_remove_snap_dev(struct rbd_snap *snap);

 static ssize_t rbd_add(struct bus_type *bus, const char *buf,
   size_t count);
@@ -1692,7 +1691,7 @@ static void __rbd_remove_all_snaps(struct
rbd_device *rbd_dev)
struct rbd_snap *next;

list_for_each_entry_safe(snap, next, rbd_dev-snaps, node)
-   __rbd_remove_snap_dev(rbd_dev, snap);
+   __rbd_remove_snap_dev(snap);
 }

 /*
@@ -2000,15 +1999,13 @@ static struct device_type rbd_snap_device_type = {
.release= rbd_snap_dev_release,
 };

-static void __rbd_remove_snap_dev(struct rbd_device *rbd_dev,
- struct rbd_snap *snap)
+static void __rbd_remove_snap_dev(struct rbd_snap *snap)
 {
list_del(snap-node);
device_unregister(snap-dev);
 }

-static int rbd_register_snap_dev(struct rbd_device *rbd_dev,
- struct rbd_snap *snap,
+static int rbd_register_snap_dev(struct rbd_snap *snap,
  struct device *parent)
 {
struct device *dev = snap-dev;
@@ -2035,8 +2032,7 @@ static int __rbd_add_snap_dev(struct rbd_device
*rbd_dev,
snap-size = rbd_dev-header.snap_sizes[i];
snap-id = rbd_dev-header.snapc-snaps[i];
if (device_is_registered(rbd_dev-dev)) {
-   ret = rbd_register_snap_dev(rbd_dev, snap,
-rbd_dev-dev);
+   ret = rbd_register_snap_dev(snap, rbd_dev-dev);
if (ret  0)
goto err;
}
@@ -2101,7 +2097,7 @@ static int __rbd_init_snaps_header(struct
rbd_device *rbd_dev)
 */
if (rbd_dev-snap_id == old_snap-id)
rbd_dev-snap_exists = false;
-   __rbd_remove_snap_dev(rbd_dev, old_snap);
+   __rbd_remove_snap_dev(old_snap);
continue;
}
if (old_snap-id == cur_id) {
@@ -2165,8 +2161,7 @@ static int rbd_bus_add_dev(struct rbd_device *rbd_dev)
goto out;

list_for_each_entry(snap, rbd_dev-snaps, node) {
-   ret = rbd_register_snap_dev(rbd_dev, snap,
-rbd_dev-dev);
+   ret = rbd_register_snap_dev(snap, rbd_dev-dev);
if (ret  0)
break;
}
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 08/12] rbd: drop rbd_req_sync_exec() ver parameter

2012-07-19 Thread Alex Elder
The only place that passes a version pointer to rbd_req_sync_exec()
is in rbd_header_add_snap(), and that spot ignores the result.

The only thing rbd_req_sync_exec() does with its ver parameter is
pass it directly to rbd_req_sync_op().  So we can just use a null
pointer there, and drop the ver parameter to rbd_req_sync_exec().

Signed-off-by: Alex Elder el...@inktank.com
---
 drivers/block/rbd.c |8 +++-
 1 files changed, 3 insertions(+), 5 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 955d75d..d4a8d9e 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -1374,8 +1374,7 @@ static int rbd_req_sync_exec(struct rbd_device
*rbd_dev,
 const char *class_name,
 const char *method_name,
 const char *data,
-int len,
-u64 *ver)
+int len)
 {
struct ceph_osd_req_op *ops;
int class_name_len = strlen(class_name);
@@ -1398,7 +1397,7 @@ static int rbd_req_sync_exec(struct rbd_device
*rbd_dev,
   0,
   CEPH_OSD_FLAG_WRITE | CEPH_OSD_FLAG_ONDISK,
   ops,
-  object_name, 0, 0, NULL, NULL, ver);
+  object_name, 0, 0, NULL, NULL, NULL);

rbd_destroy_ops(ops);

@@ -1651,7 +1650,6 @@ static int rbd_header_add_snap(struct rbd_device
*rbd_dev,
u64 new_snapid;
int ret;
void *data, *p, *e;
-   u64 ver;
struct ceph_mon_client *monc;

/* we should create a snapshot only if we're pointing at the head */
@@ -1676,7 +1674,7 @@ static int rbd_header_add_snap(struct rbd_device
*rbd_dev,

ret = rbd_req_sync_exec(rbd_dev, rbd_dev-header_name,
rbd, snap_add,
-   data, p - data, ver);
+   data, p - data);

kfree(data);

-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 09/12] rbd: have __rbd_add_snap_dev() return a pointer

2012-07-19 Thread Alex Elder
It's not obvious whether the snapshot pointer whose address is
provided to __rbd_add_snap_dev() will be assigned by that function.
Change it to return the snapshot, or a pointer-coded errno in the
event of a failure.

Signed-off-by: Alex Elder el...@inktank.com
---
 drivers/block/rbd.c |   37 ++---
 1 files changed, 22 insertions(+), 15 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index d4a8d9e..79b0762 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -2018,15 +2018,21 @@ static int rbd_register_snap_dev(struct rbd_snap
*snap,
return ret;
 }

-static int __rbd_add_snap_dev(struct rbd_device *rbd_dev,
- int i, const char *name,
- struct rbd_snap **snapp)
+static struct rbd_snap *__rbd_add_snap_dev(struct rbd_device *rbd_dev,
+ int i, const char *name)
 {
+   struct rbd_snap *snap;
int ret;
-   struct rbd_snap *snap = kzalloc(sizeof(*snap), GFP_KERNEL);
+
+   snap = kzalloc(sizeof (*snap), GFP_KERNEL);
if (!snap)
-   return -ENOMEM;
+   return ERR_PTR(-ENOMEM);
+
+   ret = -ENOMEM;
snap-name = kstrdup(name, GFP_KERNEL);
+   if (!snap-name)
+   goto err;
+
snap-size = rbd_dev-header.snap_sizes[i];
snap-id = rbd_dev-header.snapc-snaps[i];
if (device_is_registered(rbd_dev-dev)) {
@@ -2034,12 +2040,14 @@ static int __rbd_add_snap_dev(struct rbd_device
*rbd_dev,
if (ret  0)
goto err;
}
-   *snapp = snap;
-   return 0;
+
+   return snap;
+
 err:
kfree(snap-name);
kfree(snap);
-   return ret;
+
+   return ERR_PTR(ret);
 }

 /*
@@ -2072,7 +2080,6 @@ static int __rbd_init_snaps_header(struct
rbd_device *rbd_dev)
const char *name, *first_name;
int i = rbd_dev-header.total_snaps;
struct rbd_snap *snap, *old_snap = NULL;
-   int ret;
struct list_head *p, *n;

first_name = rbd_dev-header.snap_names;
@@ -2115,9 +2122,9 @@ static int __rbd_init_snaps_header(struct
rbd_device *rbd_dev)
if (cur_id = old_snap-id)
break;
/* a new snapshot */
-   ret = __rbd_add_snap_dev(rbd_dev, i - 1, name, snap);
-   if (ret  0)
-   return ret;
+   snap = __rbd_add_snap_dev(rbd_dev, i - 1, name);
+   if (IS_ERR(snap))
+   return PTR_ERR(snap);

/* note that we add it backward so using n and not p */
list_add(snap-node, n);
@@ -2131,9 +2138,9 @@ static int __rbd_init_snaps_header(struct
rbd_device *rbd_dev)
WARN_ON(1);
return -EINVAL;
}
-   ret = __rbd_add_snap_dev(rbd_dev, i - 1, name, snap);
-   if (ret  0)
-   return ret;
+   snap = __rbd_add_snap_dev(rbd_dev, i - 1, name);
+   if (IS_ERR(snap))
+   return PTR_ERR(snap);
list_add(snap-node, rbd_dev-snaps);
}

-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 10/12] rbd: make rbd_create_rw_ops() return a pointer

2012-07-19 Thread Alex Elder
Either rbd_create_rw_ops() will succeed, or it will fail because a
memory allocation failed.  Have it just return a valid pointer or
null rather than stuffing a pointer into a provided address and
returning an errno.

Signed-off-by: Alex Elder el...@inktank.com
---
 drivers/block/rbd.c |   68
--
 1 files changed, 38 insertions(+), 30 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 79b0762..0535612 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -783,22 +783,24 @@ err_out:
 /*
  * helpers for osd request op vectors.
  */
-static int rbd_create_rw_ops(struct ceph_osd_req_op **ops,
-   int num_ops,
-   int opcode,
-   u32 payload_len)
-{
-   *ops = kzalloc(sizeof(struct ceph_osd_req_op) * (num_ops + 1),
-  GFP_NOIO);
-   if (!*ops)
-   return -ENOMEM;
-   (*ops)[0].op = opcode;
+static struct ceph_osd_req_op *rbd_create_rw_ops(int num_ops,
+   int opcode, u32 payload_len)
+{
+   struct ceph_osd_req_op *ops;
+
+   ops = kzalloc(sizeof (*ops) * (num_ops + 1), GFP_NOIO);
+   if (!ops)
+   return NULL;
+
+   ops[0].op = opcode;
+
/*
 * op extent offset and length will be set later on
 * in calc_raw_layout()
 */
-   (*ops)[0].payload_len = payload_len;
-   return 0;
+   ops[0].payload_len = payload_len;
+
+   return ops;
 }

 static void rbd_destroy_ops(struct ceph_osd_req_op *ops)
@@ -1033,8 +1035,9 @@ static int rbd_req_sync_op(struct rbd_device *rbd_dev,

if (!orig_ops) {
payload_len = (flags  CEPH_OSD_FLAG_WRITE ? len : 0);
-   ret = rbd_create_rw_ops(ops, 1, opcode, payload_len);
-   if (ret  0)
+   ret = -ENOMEM;
+   ops = rbd_create_rw_ops(1, opcode, payload_len);
+   if (!ops)
goto done;

if ((flags  CEPH_OSD_FLAG_WRITE)  buf) {
@@ -1097,8 +1100,9 @@ static int rbd_do_op(struct request *rq,

payload_len = (flags  CEPH_OSD_FLAG_WRITE ? seg_len : 0);

-   ret = rbd_create_rw_ops(ops, 1, opcode, payload_len);
-   if (ret  0)
+   ret = -ENOMEM;
+   ops = rbd_create_rw_ops(1, opcode, payload_len);
+   if (!ops)
goto done;

/* we've taken care of segment sizes earlier when we
@@ -1185,9 +1189,9 @@ static int rbd_req_sync_notify_ack(struct
rbd_device *rbd_dev,
struct ceph_osd_req_op *ops;
int ret;

-   ret = rbd_create_rw_ops(ops, 1, CEPH_OSD_OP_NOTIFY_ACK, 0);
-   if (ret  0)
-   return ret;
+   ops = rbd_create_rw_ops(1, CEPH_OSD_OP_NOTIFY_ACK, 0);
+   if (!ops)
+   return -ENOMEM;

ops[0].watch.ver = cpu_to_le64(ver);
ops[0].watch.cookie = notify_id;
@@ -1236,10 +1240,11 @@ static int rbd_req_sync_watch(struct rbd_device
*rbd_dev,
 {
struct ceph_osd_req_op *ops;
struct ceph_osd_client *osdc = rbd_dev-rbd_client-client-osdc;
+   int ret;

-   int ret = rbd_create_rw_ops(ops, 1, CEPH_OSD_OP_WATCH, 0);
-   if (ret  0)
-   return ret;
+   ops = rbd_create_rw_ops(1, CEPH_OSD_OP_WATCH, 0);
+   if (!ops)
+   return -ENOMEM;

ret = ceph_osdc_create_event(osdc, rbd_watch_cb, 0,
 (void *)rbd_dev, rbd_dev-watch_event);
@@ -1279,10 +1284,11 @@ static int rbd_req_sync_unwatch(struct
rbd_device *rbd_dev,
const char *object_name)
 {
struct ceph_osd_req_op *ops;
+   int ret;

-   int ret = rbd_create_rw_ops(ops, 1, CEPH_OSD_OP_WATCH, 0);
-   if (ret  0)
-   return ret;
+   ops = rbd_create_rw_ops(1, CEPH_OSD_OP_WATCH, 0);
+   if (!ops)
+   return -ENOMEM;

ops[0].watch.ver = 0;
ops[0].watch.cookie = cpu_to_le64(rbd_dev-watch_event-cookie);
@@ -1328,9 +1334,9 @@ static int rbd_req_sync_notify(struct rbd_device
*rbd_dev,
int payload_len = sizeof(u32) + sizeof(u32);
int ret;

-   ret = rbd_create_rw_ops(ops, 1, CEPH_OSD_OP_NOTIFY, payload_len);
-   if (ret  0)
-   return ret;
+   ops = rbd_create_rw_ops(1, CEPH_OSD_OP_NOTIFY, payload_len);
+   if (!ops)
+   return -ENOMEM;

info.rbd_dev = rbd_dev;

@@ -1379,10 +1385,12 @@ static int rbd_req_sync_exec(struct rbd_device
*rbd_dev,
struct ceph_osd_req_op *ops;
int class_name_len = strlen(class_name);
int method_name_len = strlen(method_name);
-   int ret = rbd_create_rw_ops(ops, 1, CEPH_OSD_OP_CALL,
+   int ret;
+
+   ops = rbd_create_rw_ops(1, CEPH_OSD_OP_CALL,
class_name_len + method_name_len + len);
-   if (ret  0)
-   return ret;
+   if (!ops)
+   return 

[PATCH 11/12] rbd: always pass ops array to rbd_req_sync_op()

2012-07-19 Thread Alex Elder
All of the callers of rbd_req_sync_op() except one pass a non-null
ops pointer.  The only one that does not is rbd_req_sync_read(),
which passes CEPH_OSD_OP_READ as its opcode and, CEPH_OSD_FLAG_READ
for flags.

By allocating the ops array in rbd_req_sync_read() and moving the
special case code for the null ops pointer into it, it becomes
clear that much of that code is not even necessary.

In addition, the opcode argument to rbd_req_sync_op() is never
actually used, so get rid of that.

Signed-off-by: Alex Elder el...@inktank.com
---
 drivers/block/rbd.c |   46 --
 1 files changed, 16 insertions(+), 30 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 0535612..6279186 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -1013,9 +1013,8 @@ static void rbd_simple_req_cb(struct
ceph_osd_request *req, struct ceph_msg *msg
 static int rbd_req_sync_op(struct rbd_device *rbd_dev,
   struct ceph_snap_context *snapc,
   u64 snapid,
-  int opcode,
   int flags,
-  struct ceph_osd_req_op *orig_ops,
+  struct ceph_osd_req_op *ops,
   const char *object_name,
   u64 ofs, u64 len,
   char *buf,
@@ -1025,28 +1024,14 @@ static int rbd_req_sync_op(struct rbd_device
*rbd_dev,
int ret;
struct page **pages;
int num_pages;
-   struct ceph_osd_req_op *ops = orig_ops;
-   u32 payload_len;
+
+   BUG_ON(ops == NULL);

num_pages = calc_pages_for(ofs , len);
pages = ceph_alloc_page_vector(num_pages, GFP_KERNEL);
if (IS_ERR(pages))
return PTR_ERR(pages);

-   if (!orig_ops) {
-   payload_len = (flags  CEPH_OSD_FLAG_WRITE ? len : 0);
-   ret = -ENOMEM;
-   ops = rbd_create_rw_ops(1, opcode, payload_len);
-   if (!ops)
-   goto done;
-
-   if ((flags  CEPH_OSD_FLAG_WRITE)  buf) {
-   ret = ceph_copy_to_page_vector(pages, buf, ofs, len);
-   if (ret  0)
-   goto done_ops;
-   }
-   }
-
ret = rbd_do_request(NULL, rbd_dev, snapc, snapid,
  object_name, ofs, len, NULL,
  pages, num_pages,
@@ -1056,14 +1041,11 @@ static int rbd_req_sync_op(struct rbd_device
*rbd_dev,
  NULL,
  linger_req, ver);
if (ret  0)
-   goto done_ops;
+   goto done;

if ((flags  CEPH_OSD_FLAG_READ)  buf)
ret = ceph_copy_from_page_vector(pages, buf, ofs, ret);

-done_ops:
-   if (!orig_ops)
-   rbd_destroy_ops(ops);
 done:
ceph_release_page_vector(pages, num_pages);
return ret;
@@ -1170,12 +1152,20 @@ static int rbd_req_sync_read(struct rbd_device
*rbd_dev,
  char *buf,
  u64 *ver)
 {
-   return rbd_req_sync_op(rbd_dev, NULL,
+   struct ceph_osd_req_op *ops;
+   int ret;
+
+   ops = rbd_create_rw_ops(1, CEPH_OSD_OP_READ, 0);
+   if (!ops)
+   return -ENOMEM;
+
+   ret = rbd_req_sync_op(rbd_dev, NULL,
   snapid,
-  CEPH_OSD_OP_READ,
   CEPH_OSD_FLAG_READ,
-  NULL,
-  object_name, ofs, len, buf, NULL, ver);
+  ops, object_name, ofs, len, buf, NULL, ver);
+   rbd_destroy_ops(ops);
+
+   return ret;
 }

 /*
@@ -1257,7 +1247,6 @@ static int rbd_req_sync_watch(struct rbd_device
*rbd_dev,

ret = rbd_req_sync_op(rbd_dev, NULL,
  CEPH_NOSNAP,
- 0,
  CEPH_OSD_FLAG_WRITE | CEPH_OSD_FLAG_ONDISK,
  ops,
  object_name, 0, 0, NULL,
@@ -1296,7 +1285,6 @@ static int rbd_req_sync_unwatch(struct rbd_device
*rbd_dev,

ret = rbd_req_sync_op(rbd_dev, NULL,
  CEPH_NOSNAP,
- 0,
  CEPH_OSD_FLAG_WRITE | CEPH_OSD_FLAG_ONDISK,
  ops,
  object_name, 0, 0, NULL, NULL, NULL);
@@ -1353,7 +1341,6 @@ static int rbd_req_sync_notify(struct rbd_device
*rbd_dev,

ret = rbd_req_sync_op(rbd_dev, NULL,
   CEPH_NOSNAP,
-  0,
   CEPH_OSD_FLAG_WRITE | CEPH_OSD_FLAG_ONDISK,
   ops,
   object_name, 0, 0, NULL, NULL, NULL);
@@ -1402,7 +1389,6 @@ static int rbd_req_sync_exec(struct rbd_device
*rbd_dev,


[PATCH 12/12] rbd: fixes in rbd_header_from_disk()

2012-07-19 Thread Alex Elder
This fixes a few issues in rbd_header_from_disk():
- The memcmp() call at the beginning of the function is really
  looking at the text field of struct rbd_image_header_ondisk.
  While it does lie at the beginning of the structure, the
  comparison should be done against the field, not the structure
  as a whole.
- There is a check intended to catch overflow, but it's wrong in
  two ways.
- First, the type we don't want to overflow is size_t, not
  unsigned int, and there is now a SIZE_MAX we can use for
  use with that type.
- Second, we're allocating the snapshot ids and snapshot
  image sizes separately (each has type u64; on disk they
  grouped together as a rbd_image_header_ondisk structure).
  So we can use the size of u64 in this overflow check.
- If there are no snapshots, then there should be no snapshot
  names.  Enforce this, and issue a warning if we encounter a
  header with no snapshots but a non-zero snap_names_len.
- When saving the snapshot names into the header, be more direct
  in defining the offset in the on-disk structure from which
  they're being copied by using snap_count rather than i
  in the array index.
- If an error occurs, the snapc and snap_names fields are
  freed at the end of the function.  Make those fields be null
  pointers after they're freed, to be explicit that they are
  no longer valid.
- Finally, move the definition of the local variable i to the
  innermost scope in which it's needed.

Signed-off-by: Alex Elder el...@inktank.com
---
 drivers/block/rbd.c |   19 +--
 1 files changed, 13 insertions(+), 6 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 6279186..573202a 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -488,14 +488,14 @@ static int rbd_header_from_disk(struct
rbd_image_header *header,
 struct rbd_image_header_ondisk *ondisk,
 u32 allocated_snaps)
 {
-   u32 i, snap_count;
+   u32 snap_count;

-   if (memcmp(ondisk, RBD_HEADER_TEXT, sizeof(RBD_HEADER_TEXT)))
+   if (memcmp(ondisk-text, RBD_HEADER_TEXT, sizeof(RBD_HEADER_TEXT)))
return -ENXIO;

snap_count = le32_to_cpu(ondisk-snap_count);
-   if (snap_count  (UINT_MAX - sizeof(struct ceph_snap_context))
-/ sizeof (*ondisk))
+   if (snap_count  (SIZE_MAX - sizeof(struct ceph_snap_context))
+/ sizeof (u64))
return -EINVAL;
header-snapc = kmalloc(sizeof(struct ceph_snap_context) +
snap_count * sizeof(u64),
@@ -503,8 +503,8 @@ static int rbd_header_from_disk(struct
rbd_image_header *header,
if (!header-snapc)
return -ENOMEM;

-   header-snap_names_len = le64_to_cpu(ondisk-snap_names_len);
if (snap_count) {
+   header-snap_names_len = le64_to_cpu(ondisk-snap_names_len);
header-snap_names = kmalloc(header-snap_names_len,
 GFP_KERNEL);
if (!header-snap_names)
@@ -514,6 +514,8 @@ static int rbd_header_from_disk(struct
rbd_image_header *header,
if (!header-snap_sizes)
goto err_names;
} else {
+   WARN_ON(ondisk-snap_names_len);
+   header-snap_names_len = 0;
header-snap_names = NULL;
header-snap_sizes = NULL;
}
@@ -538,6 +540,8 @@ static int rbd_header_from_disk(struct
rbd_image_header *header,
header-total_snaps = snap_count;

if (snap_count  allocated_snaps == snap_count) {
+   int i;
+
for (i = 0; i  snap_count; i++) {
header-snapc-snaps[i] =
le64_to_cpu(ondisk-snaps[i].id);
@@ -546,7 +550,7 @@ static int rbd_header_from_disk(struct
rbd_image_header *header,
}

/* copy snapshot names */
-   memcpy(header-snap_names, ondisk-snaps[i],
+   memcpy(header-snap_names, ondisk-snaps[snap_count],
header-snap_names_len);
}

@@ -556,8 +560,11 @@ err_sizes:
kfree(header-snap_sizes);
 err_names:
kfree(header-snap_names);
+   header-snap_names = NULL;
 err_snapc:
kfree(header-snapc);
+   header-snapc = NULL;
+
return -ENOMEM;
 }

-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/4] rbd: don't use snapc-seq that way

2012-07-19 Thread Alex Elder
On 07/19/2012 04:02 PM, Josh Durgin wrote:
 On 07/19/2012 10:11 AM, Alex Elder wrote:
 We now use rbd_dev-snap_id to record the snapshot id--using
 the special value SNAP_NONE to indicate the rbd_dev is not mapping
 a snapshot at all.
 
 That's CEPH_NOSNAP, not SNAP_NONE, right? In any case,

Yes.  I'll fix the description before I commit it...

 Reviewed-by: Josh Durgin josh.dur...@inktank.com
 
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] rbd: use snapc-seq the way server does

2012-07-19 Thread Josh Durgin

On 07/19/2012 10:09 AM, Alex Elder wrote:

This series of patches changes the way the snap context seq field
is used.  Currently it is used in a way that isn't really useful, and
as such is a bit confusing.  This behavior seems to be a hold over
from a time when there was no snap_id field maintained for an rbd_dev.

Summary:
[PATCH 1/4] rbd: don't use snapc-seq that way
 Removes special handling in __rbd_refresh_header() that ensured
 the seq field was updated to point to the head if it had been
 at the start of the function.
[PATCH 2/4] rbd: preserve snapc-seq in rbd_header_set_snap()
 Changes rbd_header_set_snap() so it doesn't set the seq field
 to the snapshot id (for a snapshot mapping) or the highest
 snapshot id (for the base image).
[PATCH 3/4] rbd: set snapc-seq only when refreshing header
 Assigns snapc-seq whenever an updated rbd image header is
 received rather than when a new snapshot id has been
 assigned.
[PATCH 4/4] rbd: kill rbd_image_header-snap_seq
 Gets rid of the rbd_image_header-snap_seq field, which
 previously kept the same information now maintained in
 the snapc-seq field.

-Alex


The rest of the series looks good too.

Reviewed-by: Josh Durgin josh.dur...@inktank.com
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL] Ceph fixes for 3.5

2012-07-19 Thread Sage Weil
Hi Linus,

Please pull these last minute fixes for Ceph from:

 git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus

The important one fixes a bug in the socket failure handling behavior that 
was turned up in some recent failure injection testing.  The other two are 
minor bug fixes.

Thanks!
sage


Dan Carpenter (1):
  rbd: endian bug in rbd_req_cb()

Sage Weil (1):
  libceph: fix messenger retry

Yan, Zheng (1):
  rbd: Fix ceph_snap_context size calculation

 drivers/block/rbd.c|4 ++--
 include/linux/ceph/messenger.h |   12 ++--
 net/ceph/messenger.c   |   12 ++--
 3 files changed, 10 insertions(+), 18 deletions(-)
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


making objdump useful

2012-07-19 Thread Sage Weil
I finally figured out how to make objdump interleave the source code in 
the .ko file dumps on our qa machines.  The problem is that the debug info 
refeferences the path where the kernel was compiled (which is non-obvious 
since the info is compressed).  For our environment, this is a quick 
workaround:

 sudo mkdir -p /srv/autobuild-ceph/gitbuilder.git
 sudo ln -s /path/to/your/kernel/tree /srv/autobuild-ceph/gitbuilder.git/build
 objdump -rdS whatever.ko | less

Now, to figure out why kdb freezes on the 'dmesg' command...

sage

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


mkcephfs problem

2012-07-19 Thread Tim Flavin
I am trying to get Ceph running on an ARM system, currently one quad
core node, running Ubuntu 12.04.

It compiles fine, currently without tcmalloc and google perf tools, but
I am running into a problem with mkcephfs.  'mkcephfs -a -c ceph.conf'
didn't work so I did it piece by piece until I got this:

root@ubuntu:/home/tim# mkcephfs -d /tmp/foo --prepare-mon
Building generic osdmap from /tmp/foo/conf
/usr/local/bin/osdmaptool: osdmap file '/tmp/foo/osdmap'
/usr/local/bin/osdmaptool: writing epoch 1 to /tmp/foo/osdmap
Generating admin key at /tmp/foo/keyring.admin
creating /tmp/foo/keyring.admin
Building initial monitor keyring
cat: /tmp/foo/key.*: No such file or directory

There are no key.* files in /tmp/foo

# ls /tmp/foo
conf  keyring.admin  keyring.mon  monmap  osdmap


If you are running everything on one node, is it OK to use localhost and
127.0.0.1?


Here is my config file:

[osd]
osd journal size = 1000
filestore xattr use omap = true

[mon.a]
host = localhost
mon addr = 127.0.0.1:6789

[osd.0]
host = localhost


[mds.a]
host = localhost


Here are are the other configuration commands:

root@ubuntu:/home/tim# mkcephfs -c ceph.conf -d /tmp/foo --prepare-monmap
preparing monmap in /tmp/foo/monmap
/usr/local/bin/monmaptool --create --clobber --add a 127.0.0.1:6789
--print /tmp/foo/monmap
/usr/local/bin/monmaptool: monmap file /tmp/foo/monmap
/usr/local/bin/monmaptool: generated fsid 8e4f846b-1829-46e4-9902-5d8316c3fa4f
epoch 0
fsid 8e4f846b-1829-46e4-9902-5d8316c3fa4f
last_changed 2012-07-19 22:40:38.651045
created 2012-07-19 22:40:38.651045
0: 127.0.0.1:6789/0 mon.a
/usr/local/bin/monmaptool: writing epoch 0 to /tmp/foo/monmap (1 monitors)
root@ubuntu:/home/tim# mkcephfs -d /tmp/foo --init-local-daemons mds
root@ubuntu:/home/tim# mkcephfs -d /tmp/foo --init-local-daemons osd
root@ubuntu:/home/tim# mkcephfs -d /tmp/foo --prepare-mon
Building generic osdmap from /tmp/foo/conf
/usr/local/bin/osdmaptool: osdmap file '/tmp/foo/osdmap'
/usr/local/bin/osdmaptool: writing epoch 1 to /tmp/foo/osdmap
Generating admin key at /tmp/foo/keyring.admin
creating /tmp/foo/keyring.admin
Building initial monitor keyring
cat: /tmp/foo/key.*: No such file or directory
root@ubuntu:/home/tim#


Any ideas?  Which command was supposed to generate the key.* files?

Tim
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Poor read performance in KVM

2012-07-19 Thread Vladimir Bashkirtsev

On 20/07/2012 1:22 AM, Tommi Virtanen wrote:

On Thu, Jul 19, 2012 at 5:19 AM, Vladimir Bashkirtsev
vladi...@bashkirtsev.com wrote:

Look like that osd.0 performs with low latency but osd.1 latency is way too
high and on average it appears as 200ms. osd is backed by btrfs over LVM2.
May be issue lie in backing fs selection? All four osds running similar

Our typical experience with btrfs right now seems to be that it works
fast when the filesystem is fresh, but as it ages, it starts to have
higher and higher delays on syncing writes. This does not seem to be
completely deterministic, that is, if you run many btrfs'es, the
symptoms start to appear at different times on different instances.
I would suspect it as well but all btrfs are relatively fresh as we have 
moved ceph from test bed to a production environment.

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to compile Java-Rados.

2012-07-19 Thread ramu
Hi Noah,

Thank you for fixes and suggestions of compilation of java-rados ,
It's working fine now.

Thanks,
Ramu.

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Poor read performance in KVM

2012-07-19 Thread Vladimir Bashkirtsev




We are seeing degradation at 64k node/leaf sizes as well.  So far the 
degradation is most obvious with small writes.  it affects XFS as 
well, though not as severely.  We are vigorously looking into it. :)


Just confirming that one of our clients has run fair amount (on 
gigabytes scale) of mysql updates in preceding few days. So it appears 
all performance degradation stems from random writes. Perhaps I will 
rebuild each osd to check if it would help to recover.

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Poor read performance in KVM

2012-07-19 Thread Vladimir Bashkirtsev



What node/leaf size are you using on your btrfs volume?

Default 4K.

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Poor read performance in KVM

2012-07-19 Thread Vladimir Bashkirtsev
Yes, they can hold up reads to the same object. Depending on where 
they're stuck, they may be blocking other requests as well if they're

e.g. taking up all the filestore threads. Waiting for subops means
they're waiting for replicas to acknowledge the write and commit it to
disk. The real cause for slowness of those ops is the replicas. If you
enable 'debug osd = 25', 'filestore = 25', and 'debug journal = 20' you
can trace through the logs to see exactly what's happening with the
subops for those requests.
Looks like I hit exactly the same issue as described in Slow request 
warnings on 0.48 but from different angle. As our client has run mysql 
updates performance started to degrade across the cluster bringing the 
rest of VMs to standstill and producing incredible latency. At some 
point slow request warnings started to pop up and now it seems I cannot 
get rid of them at all: I have shut down all clients, all ceph 
subsystems, restarted everything and it is back to the same behaviour - 
slow request warnings.


Before rebuilding osds I will enable debug as you suggested in attempt 
to find underlying issue. Then will rebuild osds as a measure of last 
resort to make sure that indeed osds causing the issue.

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html