Re: ceph status reporting non-existing osd
On Thu, Jul 19, 2012 at 1:28 AM, Gregory Farnum g...@inktank.com wrote: On Wed, Jul 18, 2012 at 12:07 PM, Andrey Korolyov and...@xdel.ru wrote: On Wed, Jul 18, 2012 at 10:30 PM, Gregory Farnum g...@inktank.com wrote: On Wed, Jul 18, 2012 at 12:47 AM, Andrey Korolyov and...@xdel.ru wrote: On Wed, Jul 18, 2012 at 11:18 AM, Gregory Farnum g...@inktank.com wrote: On Tuesday, July 17, 2012 at 11:22 PM, Andrey Korolyov wrote: On Wed, Jul 18, 2012 at 10:09 AM, Gregory Farnum g...@inktank.com (mailto:g...@inktank.com) wrote: Hrm. That shouldn't be possible if the OSD has been removed. How did you take it out? It sounds like maybe you just marked it in the OUT state (and turned it off quite quickly) without actually taking it out of the cluster? -Greg As I have did removal, it was definitely not like that - at first place, I have marked osds(4 and 5 on same host) out, then rebuilt crushmap and then kill osd processes. As I mentioned before, osd.4 doest not exist in crushmap and therefore it shouldn`t be reported at all(theoretically). Okay, that's what happened — marking an OSD out in the CRUSH map means all the data gets moved off it, but that doesn't remove it from all the places where it's registered in the monitor and in the map, for a couple reasons: 1) You might want to mark an OSD out before taking it down, to allow for more orderly data movement. 2) OSDs can get marked out automatically, but the system shouldn't be able to forget about them on its own. 3) You might want to remove an OSD from the CRUSH map in the process of placing it somewhere else (perhaps you moved the physical machine to a new location). etc. You want to run ceph osd rm 4 5 and that should unregister both of them from everything[1]. :) -Greg [1]: Except for the full lists, which have a bug in the version of code you're running — remove the OSDs, then adjust the full ratios again, and all will be well. $ ceph osd rm 4 osd.4 does not exist $ ceph -s health HEALTH_WARN 1 near full osd(s) monmap e3: 3 mons at {0=192.168.10.129:6789/0,1=192.168.10.128:6789/0,2=192.168.10.127:6789/0}, election epoch 58, quorum 0,1,2 0,1,2 osdmap e2198: 4 osds: 4 up, 4 in pgmap v586056: 464 pgs: 464 active+clean; 66645 MB data, 231 GB used, 95877 MB / 324 GB avail mdsmap e207: 1/1/1 up {0=a=up:active} $ ceph health detail HEALTH_WARN 1 near full osd(s) osd.4 is near full at 89% $ ceph osd dump max_osd 4 osd.0 up in weight 1 up_from 2183 up_thru 2187 down_at 2172 last_clean_interval [2136,2171) 192.168.10.128:6800/4030 192.168.10.128:6801/4030 192.168.10.128:6802/4030 exists,up 68b3deec-e80a-48b7-9c29-1b98f5de4f62 osd.1 up in weight 1 up_from 2136 up_thru 2186 down_at 2135 last_clean_interval [2115,2134) 192.168.10.129:6800/2980 192.168.10.129:6801/2980 192.168.10.129:6802/2980 exists,up b2a26fe9-aaa8-445f-be1f-fa7d2a283b57 osd.2 up in weight 1 up_from 2181 up_thru 2187 down_at 2172 last_clean_interval [2136,2171) 192.168.10.128:6803/4128 192.168.10.128:6804/4128 192.168.10.128:6805/4128 exists,up 378d367a-f7fb-4892-9ec9-db8ffdd2eb20 osd.3 up in weight 1 up_from 2136 up_thru 2186 down_at 2135 last_clean_interval [2115,2134) 192.168.10.129:6803/3069 192.168.10.129:6804/3069 192.168.10.129:6805/3069 exists,up faf8eda8-55fc-4a0e-899f-47dbd32b81b8 Hrm. How did you create your new crush map? All the normal avenues of removing an OSD from the map set a flag which the PGMap uses to delete its records (which would prevent it reappearing in the full list), and I can't see how setcrushmap would remove an OSD from the map (although there might be a code path I haven't found). Manually, by deleting osd4|5 entries and reweighing remaining nodes. So you extracted the CRUSH map, edited it, and injected it using ceph osd setrcrushmap? Yep, exactly. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Poor read performance in KVM
It's actually the sum of the latencies of all 3971 asynchronous reads, in seconds, so the average latency was ~200ms, which is still pretty high. OK. I did realize it later that day when I've noticed that sum does go up only. So sum is number of seconds spent and divided by avgcount gives an idea of average latency. I guess same is used everywhere else. That's latency between KVM and the OSDs. The extra latency could be from the callback to qemu or an artifact of this workload on the osds. You can use the admin socket on the osds for 'perf dump' as well, and check the 'op_r_latency' values, which is the latency between receiving a request on the OSD and replying to it. osd.0 latency 0.121423775 op_r_latency: { avgcount: 25018, sum: 3037.78 }, osd.3 latency 0.075055975 op_r_latency: { avgcount: 3597595, sum: 270021 }, osd.0 was restarted recently, osd.3 runs for few days. It may still sit in some network queues before being sent, so you may want to enable debug_ms=1 for rbd. This will show each osd_op and the corresponding osd_op_reply with a leading timestamp. You can tell which ones go together by their transaction id, i.e.: 2012-07-17 17:39:21.813009 7f9174a30760 1 -- 192.168.106.222:0/1009333 -- 192.168.106.222:6800/9134 -- osd_op(client.4104.0:20 [write 4096~512] 3.bd49d10d) v4 -- ?+0 0x235ac90 con 0x234e530 2012-07-17 17:39:21.869671 7f9171093710 1 -- 192.168.106.222:0/1009333 == osd.0 192.168.106.222:6800/9134 20 osd_op_reply(20 [write 4096~512] ondisk = 0) v4 99+0+0 (1120882424 0 0) 0x235ac90 con 0x23 4e530 The transaction id is 20 here (seen in 'osd_op(client.4104.0:20' and 'osd_op_reply(20'. Done that... Log excerpt below. Appears to be in line with what rados bench says: latency is in millisecond range. Manual glance over roughly 4000 requests at random spots just confirms that latency is low on OSD. To run this test I have: 1. Migrated out all VMs from the host except one which I am testing 2. Put debug_ms=1 in ceph.conf and restarted ceph (mon, mds, osd - all running on this host). I am not sure that ceph-osd.0.log cointains information we looking for: ie it appears in reverse to your example. May be I need to look somewhere else? My thinking is that osd log is done by osd and thus does not include any network latency. ceph-osd.0.log:2012-07-19 19:09:52.899706 7fb261ffb700 1 -- 172.16.64.200:6802/17408 == client.159051 172.16.64.200:0/1021101 376 osd_op(client.159051.0:1427 rb.0.7. [read 1282048~4096] 2.c8eaa480) v4 158+0+0 (4291909772 0 0) 0x7fb16c001000 con 0x7fb22c004b10 ceph-osd.0.log:2012-07-19 19:09:52.899912 7fb25e7f4700 1 -- 172.16.64.200:6802/17408 -- 172.16.64.200:0/1021101 -- osd_op_reply(1427 rb.0.7. [read 1282048~4096] = 0) v4 -- ?+0 0x7fb2408e2000 con 0x7fb22c004b10 ceph-osd.0.log:2012-07-19 19:09:52.900254 7fb261ffb700 1 -- 172.16.64.200:6802/17408 == client.159051 172.16.64.200:0/1021101 377 osd_op(client.159051.0:1428 rb.0.7. [read 1286144~4096] 2.c8eaa480) v4 158+0+0 (1278691322 0 0) 0x7fb16c001400 con 0x7fb22c004b10 ceph-osd.0.log:2012-07-19 19:09:52.900451 7fb25eff5700 1 -- 172.16.64.200:6802/17408 -- 172.16.64.200:0/1021101 -- osd_op_reply(1428 rb.0.7. [read 1286144~4096] = 0) v4 -- ?+0 0x7fb24dbd9570 con 0x7fb22c004b10 ceph-osd.0.log:2012-07-19 19:09:52.900815 7fb261ffb700 1 -- 172.16.64.200:6802/17408 == client.159051 172.16.64.200:0/1021101 378 osd_op(client.159051.0:1429 rb.0.7. [read 1290240~4096] 2.c8eaa480) v4 158+0+0 (372927 0 0) 0x7fb16c001000 con 0x7fb22c004b10 ceph-osd.0.log:2012-07-19 19:09:52.901016 7fb25e7f4700 1 -- 172.16.64.200:6802/17408 -- 172.16.64.200:0/1021101 -- osd_op_reply(1429 rb.0.7. [read 1290240~4096] = 0) v4 -- ?+0 0x7fb2403d14e0 con 0x7fb22c004b10 ceph-osd.0.log:2012-07-19 19:09:52.901359 7fb261ffb700 1 -- 172.16.64.200:6802/17408 == client.159051 172.16.64.200:0/1021101 379 osd_op(client.159051.0:1430 rb.0.7. [read 1294336~4096] 2.c8eaa480) v4 158+0+0 (773263335 0 0) 0x7fb16c001400 con 0x7fb22c004b10 ceph-osd.0.log:2012-07-19 19:09:52.901557 7fb25eff5700 1 -- 172.16.64.200:6802/17408 -- 172.16.64.200:0/1021101 -- osd_op_reply(1430 rb.0.7. [read 1294336~4096] = 0) v4 -- ?+0 0x7fb24e311050 con 0x7fb22c004b10 ceph-osd.0.log:2012-07-19 19:09:52.901917 7fb261ffb700 1 -- 172.16.64.200:6802/17408 == client.159051 172.16.64.200:0/1021101 380 osd_op(client.159051.0:1431 rb.0.7. [read 1298432~4096] 2.c8eaa480) v4 158+0+0 (3155241114 0 0) 0x7fb16c001000 con 0x7fb22c004b10 ceph-osd.0.log:2012-07-19 19:09:52.902120 7fb25e7f4700 1 -- 172.16.64.200:6802/17408 -- 172.16.64.200:0/1021101 -- osd_op_reply(1431 rb.0.7. [read 1298432~4096] = 0) v4 -- ?+0 0x7fb2414137d0 con 0x7fb22c004b10 ceph-osd.0.log:2012-07-19
Re: Slow request warnings on 0.48
I'd just like to report the same behaviour on my test cluster with 0.48. I've set up a single box (Sl6.1 - 2.6.32-220.23.1 kernel) with 1 mds, mon and osd, and replication set to '1' for both data and metadata. Having mounted using ceph-fuse, I'm running a simple fio job to create load: [global] directory=/mnt/ceph size=500M rw=read ioengine=libaio [simple] I'm then watching the latency with ioping. With rw=read, rw=randread (random reads) or rw=write (sequential writes) I see no problems and the latency sits around 1-2ms. However, with rw=randwrite (random writes) I see the latency jump to between 5 and 60 seconds, and the following types of warning lines appear: 2012-07-19 10:29:39.417625 osd.0 [WRN] 11 slow requests, 6 included below; oldest blocked for 54.425766 secs [WRN] slow request 54.420958 seconds old, received at 2012-07-19 10:28:44.996584: osd_op(client.4113.0:9153 10003ed.003b [write 847872~4096] 0.dc4b476f snapc 1=[]) v4 currently started 2012-07-19 10:29:39.417641 osd.0 [WRN] slow request 54.420587 seconds old, received at 2012-07-19 10:28:44.996955: osd_op(client.4113.0:9154 10003ed. [write 1175552~4096] 0.44a7cb80 snapc 1=[]) v4 currently started [...snip...] Let me know if there's any more information that I can provide that might help with diagnosing the problem (also bearing in mind that I'm new to ceph so might need extra notes on generating tests, dumps etc :) ) Thanks, Matthew -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. signature.asc Description: OpenPGP digital signature
Re: Poor read performance in KVM
Try to determine how much of the 200ms avg latency comes from osds vs the qemu block driver. Look like that osd.0 performs with low latency but osd.1 latency is way too high and on average it appears as 200ms. osd is backed by btrfs over LVM2. May be issue lie in backing fs selection? All four osds running similar setup: btrfs over LVM2 so I have some doubts that it may be a reason as osd.0 performs well. I have read full log between osd_op for 3670 and osd_reply and there's number of pings from other osds (which were responded to quickly) and good number of osd_op_reply writes (osd_sub_op for these writes came 10 seconds before). So it appears 3670 was delayed by backlog of operations. Once the latency is under control, you might look into changing guest settings to send larger requests and readahead more. Josh -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/6] rbd: old patches from Josh
Late last year Josh Durgin had put together a series of fixes for rbd that never got committed. I told him I would get them in, and this series represents the last six that remain. Here's a summary: [PATCH 1/6] rbd: return errors for mapped but deleted snapshot This adds code to distinguish the result of attempting to read data from a deleted snapshot from the the result of reading a hole in a snapshot. The former now produces ENXIO. [PATCH 2/6] rbd: only reset capacity when pointing to head When an rbd header is refreshed, its capacity is set in case it has been changed. This should not happen for mapped snapshots. [PATCH 3/6] rbd: expose the correct size of the device in sysfs An rbd_dev--even one mapping a snashot--holds the size of it's base image in its header's image_size field. The sysfs entry for the snapshot size was showing the wrong value. [PATCH 4/6] rbd: set image size when header is updated The rbd image size was not getting updated when a header was refrehsed. [PATCH 5/6] rbd: use reference counting for the snap context This makes sure the rbd code takes a reference to its snapshot context while a request related to that context is underway. [PATCH 6/6] rbd: send header version when notifying This ensures the version that gets sent back on a watch notify acknowledgement is the one that got read as a result of refreshing the header. I've reviewed them all, but am posting them for a chance for others to comment before I commit them. -Alex -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/6] rbd: return errors for mapped but deleted snapshot
When a snapshot is deleted, the OSD will return ENOENT when reading from it. This is normally interpreted as a hole by rbd, which will return zeroes. To minimize the time in which this can happen, stop requests early when we are notified that our snapshot no longer exists. [el...@inktank.com: updated __rbd_init_snaps_header() logic] Signed-off-by: Josh Durgin josh.dur...@inktank.com Reviewed-by: Alex Elder el...@inktank.com --- drivers/block/rbd.c | 32 ++-- 1 files changed, 30 insertions(+), 2 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index b124442..730d0ce 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -172,9 +172,13 @@ struct rbd_device { /* protects updating the header */ struct rw_semaphore header_rwsem; + /* name of the snapshot this device reads from */ char*snap_name; + /* id of the snapshot this device reads from */ u64 snap_id;/* current snapshot id */ - int read_only; + /* whether the snap_id this device reads from still exists */ + boolsnap_exists; + int read_only; struct list_headnode; @@ -597,6 +601,7 @@ static int rbd_header_set_snap(struct rbd_device *rbd_dev, u64 *size) else snapc-seq = 0; rbd_dev-snap_id = CEPH_NOSNAP; + rbd_dev-snap_exists = false; rbd_dev-read_only = 0; if (size) *size = header-image_size; @@ -606,6 +611,7 @@ static int rbd_header_set_snap(struct rbd_device *rbd_dev, u64 *size) if (ret 0) goto done; rbd_dev-snap_id = snapc-seq; + rbd_dev-snap_exists = true; rbd_dev-read_only = 1; } @@ -1468,6 +1474,21 @@ static void rbd_rq_fn(struct request_queue *q) spin_unlock_irq(q-queue_lock); + if (rbd_dev-snap_id != CEPH_NOSNAP) { + bool snap_exists; + + down_read(rbd_dev-header_rwsem); + snap_exists = rbd_dev-snap_exists; + up_read(rbd_dev-header_rwsem); + + if (!snap_exists) { + dout(request for non-existent snapshot); + spin_lock_irq(q-queue_lock); + __blk_end_request_all(rq, -ENXIO); + continue; + } + } + dout(%s 0x%x bytes at 0x%llx\n, do_write ? write : read, size, blk_rq_pos(rq) * SECTOR_SIZE); @@ -2088,7 +2109,14 @@ static int __rbd_init_snaps_header(struct rbd_device *rbd_dev) cur_id = rbd_dev-header.snapc-snaps[i - 1]; if (!i || old_snap-id cur_id) { - /* old_snap-id was skipped, thus was removed */ + /* +* old_snap-id was skipped, thus was +* removed. If this rbd_dev is mapped to +* the removed snapshot, record that it no +* longer exists, to prevent further I/O. +*/ + if (rbd_dev-snap_id == old_snap-id) + rbd_dev-snap_exists = false; __rbd_remove_snap_dev(rbd_dev, old_snap); continue; } -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/6] rbd: only reset capacity when pointing to head
Snapshots cannot be resized, and the new capacity of head should not be reflected by the snapshot. Signed-off-by: Josh Durgin josh.dur...@inktank.com Reviewed-by: Alex Elder el...@inktank.com --- drivers/block/rbd.c |7 ++- 1 files changed, 6 insertions(+), 1 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 730d0ce..f171ceb 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -1725,7 +1725,12 @@ static int __rbd_refresh_header(struct rbd_device *rbd_dev) return ret; /* resized? */ - set_capacity(rbd_dev-disk, h.image_size / SECTOR_SIZE); + if (rbd_dev-snap_id == CEPH_NOSNAP) { + sector_t size = (sector_t) h.image_size / SECTOR_SIZE; + + dout(setting size to %llu sectors, (unsigned long long) size); + set_capacity(rbd_dev-disk, size); + } down_write(rbd_dev-header_rwsem); -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/6] rbd: expose the correct size of the device in sysfs
If an image was mapped to a snapshot, the size of the head version would be shown. Protect capacity with header_rwsem, since it may change. Signed-off-by: Josh Durgin josh.dur...@dreamhost.com Reviewed-by: Alex Elder el...@inktank.com --- drivers/block/rbd.c | 11 --- 1 files changed, 8 insertions(+), 3 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index f171ceb..9c3a1db 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -1724,6 +1724,8 @@ static int __rbd_refresh_header(struct rbd_device *rbd_dev) if (ret 0) return ret; + down_write(rbd_dev-header_rwsem); + /* resized? */ if (rbd_dev-snap_id == CEPH_NOSNAP) { sector_t size = (sector_t) h.image_size / SECTOR_SIZE; @@ -1732,8 +1734,6 @@ static int __rbd_refresh_header(struct rbd_device *rbd_dev) set_capacity(rbd_dev-disk, size); } - down_write(rbd_dev-header_rwsem); - snap_seq = rbd_dev-header.snapc-seq; if (rbd_dev-header.total_snaps rbd_dev-header.snapc-snaps[0] == snap_seq) @@ -1853,8 +1853,13 @@ static ssize_t rbd_size_show(struct device *dev, struct device_attribute *attr, char *buf) { struct rbd_device *rbd_dev = dev_to_rbd_dev(dev); + sector_t size; + + down_read(rbd_dev-header_rwsem); + size = get_capacity(rbd_dev-disk); + up_read(rbd_dev-header_rwsem); - return sprintf(buf, %llu\n, (unsigned long long)rbd_dev-header.image_size); + return sprintf(buf, %llu\n, (unsigned long long) size * SECTOR_SIZE); } static ssize_t rbd_major_show(struct device *dev, -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/6] rbd: set image size when header is updated
The image may have been resized. Signed-off-by: Josh Durgin josh.dur...@dreamhost.com Reviewed-by: Alex Elder el...@inktank.com --- drivers/block/rbd.c |1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 9c3a1db..a6bbda2 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -1746,6 +1746,7 @@ static int __rbd_refresh_header(struct rbd_device *rbd_dev) kfree(rbd_dev-header.snap_names); kfree(rbd_dev-header.snapc); + rbd_dev-header.image_size = h.image_size; rbd_dev-header.total_snaps = h.total_snaps; rbd_dev-header.snapc = h.snapc; rbd_dev-header.snap_names = h.snap_names; -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/6] rbd: use reference counting for the snap context
This prevents a race between requests with a given snap context and header updates that free it. The osd client was already expecting the snap context to be reference counted, since it get()s it in ceph_osdc_build_request and put()s it when the request completes. Also remove the second down_read()/up_read() on header_rwsem in rbd_do_request, which wasn't actually preventing this race or protecting any other data. Signed-off-by: Josh Durgin josh.dur...@dreamhost.com Reviewed-by: Alex Elder el...@inktank.com --- drivers/block/rbd.c | 36 ++-- 1 files changed, 18 insertions(+), 18 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index a6bbda2..988f944 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -626,7 +626,7 @@ static void rbd_header_free(struct rbd_image_header *header) kfree(header-object_prefix); kfree(header-snap_sizes); kfree(header-snap_names); - kfree(header-snapc); + ceph_put_snap_context(header-snapc); } /* @@ -902,13 +902,10 @@ static int rbd_do_request(struct request *rq, dout(rbd_do_request object_name=%s ofs=%lld len=%lld\n, object_name, len, ofs); - down_read(rbd_dev-header_rwsem); - osdc = rbd_dev-rbd_client-client-osdc; req = ceph_osdc_alloc_request(osdc, flags, snapc, ops, false, GFP_NOIO, pages, bio); if (!req) { - up_read(rbd_dev-header_rwsem); ret = -ENOMEM; goto done_pages; } @@ -942,7 +939,6 @@ static int rbd_do_request(struct request *rq, snapc, mtime, req-r_oid, req-r_oid_len); - up_read(rbd_dev-header_rwsem); if (linger_req) { ceph_osdc_set_request_linger(osdc, req); @@ -1448,6 +1444,7 @@ static void rbd_rq_fn(struct request_queue *q) u64 ofs; int num_segs, cur_seg = 0; struct rbd_req_coll *coll; + struct ceph_snap_context *snapc; /* peek at request from block layer */ if (!rq) @@ -1474,21 +1471,20 @@ static void rbd_rq_fn(struct request_queue *q) spin_unlock_irq(q-queue_lock); - if (rbd_dev-snap_id != CEPH_NOSNAP) { - bool snap_exists; + down_read(rbd_dev-header_rwsem); - down_read(rbd_dev-header_rwsem); - snap_exists = rbd_dev-snap_exists; + if (rbd_dev-snap_id != CEPH_NOSNAP !rbd_dev-snap_exists) { up_read(rbd_dev-header_rwsem); - - if (!snap_exists) { - dout(request for non-existent snapshot); - spin_lock_irq(q-queue_lock); - __blk_end_request_all(rq, -ENXIO); - continue; - } + dout(request for non-existent snapshot); + spin_lock_irq(q-queue_lock); + __blk_end_request_all(rq, -ENXIO); + continue; } + snapc = ceph_get_snap_context(rbd_dev-header.snapc); + + up_read(rbd_dev-header_rwsem); + dout(%s 0x%x bytes at 0x%llx\n, do_write ? write : read, size, blk_rq_pos(rq) * SECTOR_SIZE); @@ -1498,6 +1494,7 @@ static void rbd_rq_fn(struct request_queue *q) if (!coll) { spin_lock_irq(q-queue_lock); __blk_end_request_all(rq, -ENOMEM); + ceph_put_snap_context(snapc); continue; } @@ -1521,7 +1518,7 @@ static void rbd_rq_fn(struct request_queue *q) /* init OSD command: write or read */ if (do_write) rbd_req_write(rq, rbd_dev, - rbd_dev-header.snapc, + snapc, ofs, op_size, bio, coll, cur_seg); @@ -1544,6 +1541,8 @@ next_seg: if (bp) bio_pair_release(bp); spin_lock_irq(q-queue_lock); + + ceph_put_snap_context(snapc); } } @@ -1744,7 +1743,8 @@ static int __rbd_refresh_header(struct rbd_device *rbd_dev) /* rbd_dev-header.object_prefix shouldn't change */ kfree(rbd_dev-header.snap_sizes); kfree(rbd_dev-header.snap_names); - kfree(rbd_dev-header.snapc); + /* osd requests may still refer to snapc */ + ceph_put_snap_context(rbd_dev-header.snapc); rbd_dev-header.image_size =
[PATCH 6/6] rbd: send header version when notifying
Previously the original header version was sent. Now, we update it when the header changes. Signed-off-by: Josh Durgin josh.dur...@dreamhost.com Reviewed-by: Alex Elder el...@inktank.com --- drivers/block/rbd.c |7 +-- 1 files changed, 5 insertions(+), 2 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 988f944..4d3a1e0 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -1197,7 +1197,7 @@ static int rbd_req_sync_notify_ack(struct rbd_device *rbd_dev, if (ret 0) return ret; - ops[0].watch.ver = cpu_to_le64(rbd_dev-header.obj_version); + ops[0].watch.ver = cpu_to_le64(ver); ops[0].watch.cookie = notify_id; ops[0].watch.flag = 0; @@ -1216,6 +1216,7 @@ static int rbd_req_sync_notify_ack(struct rbd_device *rbd_dev, static void rbd_watch_cb(u64 ver, u64 notify_id, u8 opcode, void *data) { struct rbd_device *rbd_dev = (struct rbd_device *)data; + u64 hver; int rc; if (!rbd_dev) @@ -1225,12 +1226,13 @@ static void rbd_watch_cb(u64 ver, u64 notify_id, u8 opcode, void *data) rbd_dev-header_name, notify_id, (int) opcode); mutex_lock_nested(ctl_mutex, SINGLE_DEPTH_NESTING); rc = __rbd_refresh_header(rbd_dev); + hver = rbd_dev-header.obj_version; mutex_unlock(ctl_mutex); if (rc) pr_warning(RBD_DRV_NAME %d got notification but failed to update snaps: %d\n, rbd_dev-major, rc); - rbd_req_sync_notify_ack(rbd_dev, ver, notify_id, rbd_dev-header_name); + rbd_req_sync_notify_ack(rbd_dev, hver, notify_id, rbd_dev-header_name); } /* @@ -1746,6 +1748,7 @@ static int __rbd_refresh_header(struct rbd_device *rbd_dev) /* osd requests may still refer to snapc */ ceph_put_snap_context(rbd_dev-header.snapc); + rbd_dev-header.obj_version = h.obj_version; rbd_dev-header.image_size = h.image_size; rbd_dev-header.total_snaps = h.total_snaps; rbd_dev-header.snapc = h.snapc; -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] rbd: fix the memory leak of bio_chain_clone
On Tue, Jul 17, 2012 at 01:18:50PM -0700, Yehuda Sadeh wrote: On Wed, Jul 11, 2012 at 5:34 AM, Guangliang Zhao gz...@suse.com wrote: The bio_pair alloced in bio_chain_clone would not be freed, this will cause a memory leak. It could be freed actually only after 3 times release, because the reference count of bio_pair is initialized to 3 when bio_split and bio_pair_release only drops the reference count. The function bio_pair_release must be called three times for releasing bio_pair, and the callback functions of bios on the requests will be called when the last release time in bio_pair_release, however, these functions will also be called in rbd_req_cb. In other words, they will be called twice, and it may cause serious consequences. This patch clones bio chian from the origin directly, doesn't use bio_split(without bio_pair). The new bio chain can be release whenever we don't need it. Signed-off-by: Guangliang Zhao gz...@suse.com --- drivers/block/rbd.c | 66 --- 1 files changed, 26 insertions(+), 40 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 013c7a5..6a12040 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -712,51 +712,43 @@ static void zero_bio_chain(struct bio *chain, int start_ofs) } } -/* - * bio_chain_clone - clone a chain of bios up to a certain length. - * might return a bio_pair that will need to be released. +/** + * bio_chain_clone - clone a chain of bios up to a certain length. + * @old: bio to clone + * @offset: start point for bio clone + * @len: length of bio chain + * @gfp_mask: allocation priority + * + * RETURNS: + * Pointer to new bio chain on success, NULL on failure. */ -static struct bio *bio_chain_clone(struct bio **old, struct bio **next, - struct bio_pair **bp, +static struct bio *bio_chain_clone(struct bio **old, int *offset, int len, gfp_t gfpmask) { struct bio *tmp, *old_chain = *old, *new_chain = NULL, *tail = NULL; int total = 0; - if (*bp) { - bio_pair_release(*bp); - *bp = NULL; - } - while (old_chain (total len)) { + int need = len - total; + tmp = bio_kmalloc(gfpmask, old_chain-bi_max_vecs); if (!tmp) goto err_out; - if (total + old_chain-bi_size len) { - struct bio_pair *bp; - + __bio_clone(tmp, old_chain); + if (total + (tmp-bi_size - *offset) len) { can change this to: if (tmp-bi_size - *offset need) I think it'll be slightly more readable Yes, I will modify it in the next version. + tmp-bi_sector += *offset SECTOR_SHIFT; + tmp-bi_io_vec-bv_offset += *offset SECTOR_SHIFT; Shouldn't these two lines be outside this if? Excellent, the else branch need it too. /* -* this split can only happen with a single paged bio, -* split_bio will BUG_ON if this is not the case +* This can only happen with a single paged bio, +* rbd_merge_bvec would guarantee it. */ - dout(bio_chain_clone split! total=%d remaining=%d -bi_size=%d\n, -(int)total, (int)len-total, -(int)old_chain-bi_size); - - /* split the bio. We'll release it either in the next - call, or it will have to be released outside */ - bp = bio_split(old_chain, (len - total) / SECTOR_SIZE); - if (!bp) - goto err_out; - - __bio_clone(tmp, bp-bio1); - - *next = bp-bio2; + tmp-bi_size = need; + tmp-bi_io_vec-bv_len = need; + *offset += need; } else { - __bio_clone(tmp, old_chain); - *next = old_chain-bi_next; + old_chain = old_chain-bi_next; + *offset = 0; We also need adjust the length of tmp bios in the else branch, so the following two lines should be added. tmp-bi_size -= *offset; tmp-bi_io_vec-bv_offset -= *offset; } tmp-bi_bdev = NULL; @@ -769,7 +761,6 @@ static struct bio *bio_chain_clone(struct bio **old, struct bio **next, tail-bi_next = tmp; tail = tmp;
[PATCH] rbd: fix the repeat initialization of semaphore
The header_rwsem of rbd_dev initializes twice in function rbd_add. Signed-off-by: Guangliang Zhao gz...@suse.com --- drivers/block/rbd.c |2 -- 1 files changed, 0 insertions(+), 2 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 013c7a5..50117dd 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -2381,8 +2381,6 @@ static ssize_t rbd_add(struct bus_type *bus, INIT_LIST_HEAD(rbd_dev-snaps); init_rwsem(rbd_dev-header_rwsem); - init_rwsem(rbd_dev-header_rwsem); - /* generate unique id: find highest unique id, add one */ rbd_id_get(rbd_dev); -- 1.7.3.4 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Ceph doesn't update the block device size while a rbd image is mounted
Hi Cephers! I'm working with rbd mapping. I figured out that the block device size of the rbd device is not update while the device is mounted. Here my tests: 1. Pick up a device and check his size # rbd ls size # rbd info test rbd image 'test': size 1 MB in 2500 objects order 22 (4096 KB objects) block_name_prefix: rb.0.6 parent: (pool -1) 2. Map the device # rbd map --secret /etc/ceph/secret test # rbd showmapped id pool image snap device 1 rbd test - /dev/rbd1 3. Put a fs on it and check the block device size # mkfs.ext4 /dev/rdb1 ... ... # fdisk -l /dev/rbd1 Disk /dev/rbd1: 10.5 GB, 1048576 bytes 4. Mount it # mount /dev/rbd1 /mnt # df -h /dev/rbd1 9.8G 277M 9.0G 3% /mnt 5. Change the image size # rbd resize --size 11000 test Resizing image: 100% complete...done. # rbd info test rbd image 'test': size 11000 MB in 2750 objects order 22 (4096 KB objects) block_name_prefix: rb.0.6 parent: (pool -1) At this point of time, if you perform the fdisk -l /dev/rbd1, the block device size will remain the same. 6. Unmount the device: # umount /media # fdisk -l /dev/rbd1 Disk /dev/rbd1: 11.5 GB, 11534336000 bytes Unmounting the directory did update the block device size. Of course you can do something really fast like: # umount /media mount /dev/rbd1 /media That will work, it's a valid solution as long as there is no opened file. I won't use this trick in production... I also tried to mount -o remount and it didn't work. 7. Resize the fs (this can be performed while the fs is mounted): # e2fsck -f /dev/rbd1 e2fsck 1.42 (29-Nov-2011) Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information /dev/rbd1: 11/644640 files (0.0% non-contiguous), 77173/256 blocks # resize2fs /dev/rbd1 resize2fs 1.42 (29-Nov-2011) Resizing the filesystem on /dev/rbd1 to 2816000 (4k) blocks. The filesystem on /dev/rbd1 is now 2816000 blocks long. Did I miss something? Is this feature coming? Thank you in advance :) -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph doesn't update the block device size while a rbd image is mounted
Hi, On 19-07-12 16:55, Sébastien Han wrote: Hi Cephers! I'm working with rbd mapping. I figured out that the block device size of the rbd device is not update while the device is mounted. Here my tests: iirc this is not something RBD specific, but since the device is in use it can't be re-read by the kernel. So when you unmount it the kernel can re-read the header and resize the device. Wido 1. Pick up a device and check his size # rbd ls size # rbd info test rbd image 'test': size 1 MB in 2500 objects order 22 (4096 KB objects) block_name_prefix: rb.0.6 parent: (pool -1) 2. Map the device # rbd map --secret /etc/ceph/secret test # rbd showmapped id pool image snap device 1 rbd test - /dev/rbd1 3. Put a fs on it and check the block device size # mkfs.ext4 /dev/rdb1 ... ... # fdisk -l /dev/rbd1 Disk /dev/rbd1: 10.5 GB, 1048576 bytes 4. Mount it # mount /dev/rbd1 /mnt # df -h /dev/rbd1 9.8G 277M 9.0G 3% /mnt 5. Change the image size # rbd resize --size 11000 test Resizing image: 100% complete...done. # rbd info test rbd image 'test': size 11000 MB in 2750 objects order 22 (4096 KB objects) block_name_prefix: rb.0.6 parent: (pool -1) At this point of time, if you perform the fdisk -l /dev/rbd1, the block device size will remain the same. 6. Unmount the device: # umount /media # fdisk -l /dev/rbd1 Disk /dev/rbd1: 11.5 GB, 11534336000 bytes Unmounting the directory did update the block device size. Of course you can do something really fast like: # umount /media mount /dev/rbd1 /media That will work, it's a valid solution as long as there is no opened file. I won't use this trick in production... I also tried to mount -o remount and it didn't work. 7. Resize the fs (this can be performed while the fs is mounted): # e2fsck -f /dev/rbd1 e2fsck 1.42 (29-Nov-2011) Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information /dev/rbd1: 11/644640 files (0.0% non-contiguous), 77173/256 blocks # resize2fs /dev/rbd1 resize2fs 1.42 (29-Nov-2011) Resizing the filesystem on /dev/rbd1 to 2816000 (4k) blocks. The filesystem on /dev/rbd1 is now 2816000 blocks long. Did I miss something? Is this feature coming? Thank you in advance :) -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph doesn't update the block device size while a rbd image is mounted
Ok I got your point seems logic, but why is this possible with LVM for example? You can easily do this with LVM without un-mounting the device. Cheers. On Thu, Jul 19, 2012 at 5:15 PM, Wido den Hollander w...@widodh.nl wrote: Hi, On 19-07-12 16:55, Sébastien Han wrote: Hi Cephers! I'm working with rbd mapping. I figured out that the block device size of the rbd device is not update while the device is mounted. Here my tests: iirc this is not something RBD specific, but since the device is in use it can't be re-read by the kernel. So when you unmount it the kernel can re-read the header and resize the device. Wido 1. Pick up a device and check his size # rbd ls size # rbd info test rbd image 'test': size 1 MB in 2500 objects order 22 (4096 KB objects) block_name_prefix: rb.0.6 parent: (pool -1) 2. Map the device # rbd map --secret /etc/ceph/secret test # rbd showmapped id pool image snap device 1 rbd test - /dev/rbd1 3. Put a fs on it and check the block device size # mkfs.ext4 /dev/rdb1 ... ... # fdisk -l /dev/rbd1 Disk /dev/rbd1: 10.5 GB, 1048576 bytes 4. Mount it # mount /dev/rbd1 /mnt # df -h /dev/rbd1 9.8G 277M 9.0G 3% /mnt 5. Change the image size # rbd resize --size 11000 test Resizing image: 100% complete...done. # rbd info test rbd image 'test': size 11000 MB in 2750 objects order 22 (4096 KB objects) block_name_prefix: rb.0.6 parent: (pool -1) At this point of time, if you perform the fdisk -l /dev/rbd1, the block device size will remain the same. 6. Unmount the device: # umount /media # fdisk -l /dev/rbd1 Disk /dev/rbd1: 11.5 GB, 11534336000 bytes Unmounting the directory did update the block device size. Of course you can do something really fast like: # umount /media mount /dev/rbd1 /media That will work, it's a valid solution as long as there is no opened file. I won't use this trick in production... I also tried to mount -o remount and it didn't work. 7. Resize the fs (this can be performed while the fs is mounted): # e2fsck -f /dev/rbd1 e2fsck 1.42 (29-Nov-2011) Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information /dev/rbd1: 11/644640 files (0.0% non-contiguous), 77173/256 blocks # resize2fs /dev/rbd1 resize2fs 1.42 (29-Nov-2011) Resizing the filesystem on /dev/rbd1 to 2816000 (4k) blocks. The filesystem on /dev/rbd1 is now 2816000 blocks long. Did I miss something? Is this feature coming? Thank you in advance :) -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph doesn't update the block device size while a rbd image is mounted
On Thu, Jul 19, 2012 at 8:26 AM, Sébastien Han han.sebast...@gmail.com wrote: Ok I got your point seems logic, but why is this possible with LVM for example? You can easily do this with LVM without un-mounting the device. Do your LVM volumes have partition tables inside them? That might be the difference. Of course, you can put your filesystem straight on the RBD; that would be a good experiment to run. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph doesn't update the block device size while a rbd image is mounted
On Thu, Jul 19, 2012 at 8:38 AM, Tommi Virtanen t...@inktank.com wrote: Do your LVM volumes have partition tables inside them? That might be the difference. Of course, you can put your filesystem straight on the RBD; that would be a good experiment to run. Oops, I see you did put your fs straight on the RBD already; seeing fdisk tripped my brain. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph doesn't update the block device size while a rbd image is mounted
On 19-07-12 17:26, Sébastien Han wrote: Ok I got your point seems logic, but why is this possible with LVM for example? You can easily do this with LVM without un-mounting the device. LVM runs through the device mapper and are not regular block devices. If you resize the disk underneath LVM you won't see an increased VG or PV size unless you change the availability of the VG to unavailable and back to available again. I'm not a 100% sure what the exact root cause is, but the kernel won't read the new size of a block device as long as it is in use. Wido Cheers. On Thu, Jul 19, 2012 at 5:15 PM, Wido den Hollander w...@widodh.nl wrote: Hi, On 19-07-12 16:55, Sébastien Han wrote: Hi Cephers! I'm working with rbd mapping. I figured out that the block device size of the rbd device is not update while the device is mounted. Here my tests: iirc this is not something RBD specific, but since the device is in use it can't be re-read by the kernel. So when you unmount it the kernel can re-read the header and resize the device. Wido 1. Pick up a device and check his size # rbd ls size # rbd info test rbd image 'test': size 1 MB in 2500 objects order 22 (4096 KB objects) block_name_prefix: rb.0.6 parent: (pool -1) 2. Map the device # rbd map --secret /etc/ceph/secret test # rbd showmapped id pool image snap device 1 rbd test - /dev/rbd1 3. Put a fs on it and check the block device size # mkfs.ext4 /dev/rdb1 ... ... # fdisk -l /dev/rbd1 Disk /dev/rbd1: 10.5 GB, 1048576 bytes 4. Mount it # mount /dev/rbd1 /mnt # df -h /dev/rbd1 9.8G 277M 9.0G 3% /mnt 5. Change the image size # rbd resize --size 11000 test Resizing image: 100% complete...done. # rbd info test rbd image 'test': size 11000 MB in 2750 objects order 22 (4096 KB objects) block_name_prefix: rb.0.6 parent: (pool -1) At this point of time, if you perform the fdisk -l /dev/rbd1, the block device size will remain the same. 6. Unmount the device: # umount /media # fdisk -l /dev/rbd1 Disk /dev/rbd1: 11.5 GB, 11534336000 bytes Unmounting the directory did update the block device size. Of course you can do something really fast like: # umount /media mount /dev/rbd1 /media That will work, it's a valid solution as long as there is no opened file. I won't use this trick in production... I also tried to mount -o remount and it didn't work. 7. Resize the fs (this can be performed while the fs is mounted): # e2fsck -f /dev/rbd1 e2fsck 1.42 (29-Nov-2011) Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information /dev/rbd1: 11/644640 files (0.0% non-contiguous), 77173/256 blocks # resize2fs /dev/rbd1 resize2fs 1.42 (29-Nov-2011) Resizing the filesystem on /dev/rbd1 to 2816000 (4k) blocks. The filesystem on /dev/rbd1 is now 2816000 blocks long. Did I miss something? Is this feature coming? Thank you in advance :) -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph doesn't update the block device size while a rbd image is mounted
Hum ok, I see. Thanks! But if you have any clue to force the kernel to re-read without unmont/mounting :) On Thu, Jul 19, 2012 at 5:47 PM, Wido den Hollander w...@widodh.nl wrote: On 19-07-12 17:26, Sébastien Han wrote: Ok I got your point seems logic, but why is this possible with LVM for example? You can easily do this with LVM without un-mounting the device. LVM runs through the device mapper and are not regular block devices. If you resize the disk underneath LVM you won't see an increased VG or PV size unless you change the availability of the VG to unavailable and back to available again. I'm not a 100% sure what the exact root cause is, but the kernel won't read the new size of a block device as long as it is in use. Wido Cheers. On Thu, Jul 19, 2012 at 5:15 PM, Wido den Hollander w...@widodh.nl wrote: Hi, On 19-07-12 16:55, Sébastien Han wrote: Hi Cephers! I'm working with rbd mapping. I figured out that the block device size of the rbd device is not update while the device is mounted. Here my tests: iirc this is not something RBD specific, but since the device is in use it can't be re-read by the kernel. So when you unmount it the kernel can re-read the header and resize the device. Wido 1. Pick up a device and check his size # rbd ls size # rbd info test rbd image 'test': size 1 MB in 2500 objects order 22 (4096 KB objects) block_name_prefix: rb.0.6 parent: (pool -1) 2. Map the device # rbd map --secret /etc/ceph/secret test # rbd showmapped id pool image snap device 1 rbd test - /dev/rbd1 3. Put a fs on it and check the block device size # mkfs.ext4 /dev/rdb1 ... ... # fdisk -l /dev/rbd1 Disk /dev/rbd1: 10.5 GB, 1048576 bytes 4. Mount it # mount /dev/rbd1 /mnt # df -h /dev/rbd1 9.8G 277M 9.0G 3% /mnt 5. Change the image size # rbd resize --size 11000 test Resizing image: 100% complete...done. # rbd info test rbd image 'test': size 11000 MB in 2750 objects order 22 (4096 KB objects) block_name_prefix: rb.0.6 parent: (pool -1) At this point of time, if you perform the fdisk -l /dev/rbd1, the block device size will remain the same. 6. Unmount the device: # umount /media # fdisk -l /dev/rbd1 Disk /dev/rbd1: 11.5 GB, 11534336000 bytes Unmounting the directory did update the block device size. Of course you can do something really fast like: # umount /media mount /dev/rbd1 /media That will work, it's a valid solution as long as there is no opened file. I won't use this trick in production... I also tried to mount -o remount and it didn't work. 7. Resize the fs (this can be performed while the fs is mounted): # e2fsck -f /dev/rbd1 e2fsck 1.42 (29-Nov-2011) Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information /dev/rbd1: 11/644640 files (0.0% non-contiguous), 77173/256 blocks # resize2fs /dev/rbd1 resize2fs 1.42 (29-Nov-2011) Resizing the filesystem on /dev/rbd1 to 2816000 (4k) blocks. The filesystem on /dev/rbd1 is now 2816000 blocks long. Did I miss something? Is this feature coming? Thank you in advance :) -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Poor read performance in KVM
On Thu, Jul 19, 2012 at 5:19 AM, Vladimir Bashkirtsev vladi...@bashkirtsev.com wrote: Look like that osd.0 performs with low latency but osd.1 latency is way too high and on average it appears as 200ms. osd is backed by btrfs over LVM2. May be issue lie in backing fs selection? All four osds running similar Our typical experience with btrfs right now seems to be that it works fast when the filesystem is fresh, but as it ages, it starts to have higher and higher delays on syncing writes. This does not seem to be completely deterministic, that is, if you run many btrfs'es, the symptoms start to appear at different times on different instances. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/4] rbd: use snapc-seq the way server does
This series of patches changes the way the snap context seq field is used. Currently it is used in a way that isn't really useful, and as such is a bit confusing. This behavior seems to be a hold over from a time when there was no snap_id field maintained for an rbd_dev. Summary: [PATCH 1/4] rbd: don't use snapc-seq that way Removes special handling in __rbd_refresh_header() that ensured the seq field was updated to point to the head if it had been at the start of the function. [PATCH 2/4] rbd: preserve snapc-seq in rbd_header_set_snap() Changes rbd_header_set_snap() so it doesn't set the seq field to the snapshot id (for a snapshot mapping) or the highest snapshot id (for the base image). [PATCH 3/4] rbd: set snapc-seq only when refreshing header Assigns snapc-seq whenever an updated rbd image header is received rather than when a new snapshot id has been assigned. [PATCH 4/4] rbd: kill rbd_image_header-snap_seq Gets rid of the rbd_image_header-snap_seq field, which previously kept the same information now maintained in the snapc-seq field. -Alex -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/4] rbd: don't use snapc-seq that way
In what appears to be an artifact of a different way of encoding whether an rbd image maps a snapshot, __rbd_refresh_header() has code that arranges to update the seq value in an rbd image's snapshot context to point to the first entry in its snapshot array if that's where it was pointing initially. We now use rbd_dev-snap_id to record the snapshot id--using the special value SNAP_NONE to indicate the rbd_dev is not mapping a snapshot at all. There is therefore no need to check for this case, nor to update the seq value, in __rbd_refresh_header(). Just preserve the seq value that rbd_read_header() provides (which, at the moment, is nothing). Signed-off-by: Alex Elder el...@inktank.com --- drivers/block/rbd.c | 14 -- 1 files changed, 0 insertions(+), 14 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 4d3a1e0..8a46599 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -1718,8 +1718,6 @@ static int __rbd_refresh_header(struct rbd_device *rbd_dev) { int ret; struct rbd_image_header h; - u64 snap_seq; - int follow_seq = 0; ret = rbd_read_header(rbd_dev, h); if (ret 0) @@ -1735,13 +1733,6 @@ static int __rbd_refresh_header(struct rbd_device *rbd_dev) set_capacity(rbd_dev-disk, size); } - snap_seq = rbd_dev-header.snapc-seq; - if (rbd_dev-header.total_snaps - rbd_dev-header.snapc-snaps[0] == snap_seq) - /* pointing at the head, will need to follow that - if head moves */ - follow_seq = 1; - /* rbd_dev-header.object_prefix shouldn't change */ kfree(rbd_dev-header.snap_sizes); kfree(rbd_dev-header.snap_names); @@ -1759,11 +1750,6 @@ static int __rbd_refresh_header(struct rbd_device *rbd_dev) WARN_ON(strcmp(rbd_dev-header.object_prefix, h.object_prefix)); kfree(h.object_prefix); - if (follow_seq) - rbd_dev-header.snapc-seq = rbd_dev-header.snapc-snaps[0]; - else - rbd_dev-header.snapc-seq = snap_seq; - ret = __rbd_init_snaps_header(rbd_dev); up_write(rbd_dev-header_rwsem); -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/4] rbd: set snapc-seq only when refreshing header
In rbd_header_add_snap() there is code to set snapc-seq to the just-added snapshot id. This is the only remnant left of the use of that field for recording which snapshot an rbd_dev was associated with. That functionality is no longer supported, so get rid of that final bit of code. Doing so means we never actually set snapc-seq any more. On the server, the snapshot context's sequence value represents the highest snapshot id ever issued for a particular rbd image. So we'll make it have that meaning here as well. To do so, set this value whenever the rbd header is (re-)read. That way it will always be consistent with the rest of the snapshot context we maintain. Signed-off-by: Alex Elder el...@inktank.com --- drivers/block/rbd.c | 10 ++ 1 files changed, 2 insertions(+), 8 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index ac8a83f..c299a55 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -537,6 +537,7 @@ static int rbd_header_from_disk(struct rbd_image_header *header, atomic_set(header-snapc-nref, 1); header-snap_seq = le64_to_cpu(ondisk-snap_seq); + header-snapc-seq = le64_to_cpu(ondisk-snap_seq); header-snapc-num_snaps = snap_count; header-total_snaps = snap_count; @@ -1685,14 +1686,7 @@ static int rbd_header_add_snap(struct rbd_device *rbd_dev, kfree(data); - if (ret 0) - return ret; - - down_write(rbd_dev-header_rwsem); - rbd_dev-header.snapc-seq = new_snapid; - up_write(rbd_dev-header_rwsem); - - return 0; + return ret 0 ? ret : 0; bad: return -ERANGE; } -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/4] rbd: kill rbd_image_header-snap_seq
The snap_seq field in an rbd_image_header structure held the value from the rbd image header when it was last refreshed. We now maintain this value in the snapc-seq field. So get rid of the other one. Signed-off-by: Alex Elder el...@inktank.com --- drivers/block/rbd.c |2 -- 1 files changed, 0 insertions(+), 2 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index c299a55..6df8c62 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -82,7 +82,6 @@ struct rbd_image_header { __u8 comp_type; struct ceph_snap_context *snapc; size_t snap_names_len; - u64 snap_seq; u32 total_snaps; char *snap_names; @@ -536,7 +535,6 @@ static int rbd_header_from_disk(struct rbd_image_header *header, header-comp_type = ondisk-options.comp_type; atomic_set(header-snapc-nref, 1); - header-snap_seq = le64_to_cpu(ondisk-snap_seq); header-snapc-seq = le64_to_cpu(ondisk-snap_seq); header-snapc-num_snaps = snap_count; header-total_snaps = snap_count; -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph doesn't update the block device size while a rbd image is mounted
I've had a little more luck using cfdisk than vanilla fdisk when it comes to detecting changes. You might try running partprobe and then cfdisk and seeing if you get anything different. Calvin On Thu, Jul 19, 2012 at 9:50 AM, Sébastien Han han.sebast...@gmail.com wrote: Hum ok, I see. Thanks! But if you have any clue to force the kernel to re-read without unmont/mounting :) On Thu, Jul 19, 2012 at 5:47 PM, Wido den Hollander w...@widodh.nl wrote: On 19-07-12 17:26, Sébastien Han wrote: Ok I got your point seems logic, but why is this possible with LVM for example? You can easily do this with LVM without un-mounting the device. LVM runs through the device mapper and are not regular block devices. If you resize the disk underneath LVM you won't see an increased VG or PV size unless you change the availability of the VG to unavailable and back to available again. I'm not a 100% sure what the exact root cause is, but the kernel won't read the new size of a block device as long as it is in use. Wido Cheers. On Thu, Jul 19, 2012 at 5:15 PM, Wido den Hollander w...@widodh.nl wrote: Hi, On 19-07-12 16:55, Sébastien Han wrote: Hi Cephers! I'm working with rbd mapping. I figured out that the block device size of the rbd device is not update while the device is mounted. Here my tests: iirc this is not something RBD specific, but since the device is in use it can't be re-read by the kernel. So when you unmount it the kernel can re-read the header and resize the device. Wido 1. Pick up a device and check his size # rbd ls size # rbd info test rbd image 'test': size 1 MB in 2500 objects order 22 (4096 KB objects) block_name_prefix: rb.0.6 parent: (pool -1) 2. Map the device # rbd map --secret /etc/ceph/secret test # rbd showmapped id pool image snap device 1 rbd test - /dev/rbd1 3. Put a fs on it and check the block device size # mkfs.ext4 /dev/rdb1 ... ... # fdisk -l /dev/rbd1 Disk /dev/rbd1: 10.5 GB, 1048576 bytes 4. Mount it # mount /dev/rbd1 /mnt # df -h /dev/rbd1 9.8G 277M 9.0G 3% /mnt 5. Change the image size # rbd resize --size 11000 test Resizing image: 100% complete...done. # rbd info test rbd image 'test': size 11000 MB in 2750 objects order 22 (4096 KB objects) block_name_prefix: rb.0.6 parent: (pool -1) At this point of time, if you perform the fdisk -l /dev/rbd1, the block device size will remain the same. 6. Unmount the device: # umount /media # fdisk -l /dev/rbd1 Disk /dev/rbd1: 11.5 GB, 11534336000 bytes Unmounting the directory did update the block device size. Of course you can do something really fast like: # umount /media mount /dev/rbd1 /media That will work, it's a valid solution as long as there is no opened file. I won't use this trick in production... I also tried to mount -o remount and it didn't work. 7. Resize the fs (this can be performed while the fs is mounted): # e2fsck -f /dev/rbd1 e2fsck 1.42 (29-Nov-2011) Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information /dev/rbd1: 11/644640 files (0.0% non-contiguous), 77173/256 blocks # resize2fs /dev/rbd1 resize2fs 1.42 (29-Nov-2011) Resizing the filesystem on /dev/rbd1 to 2816000 (4k) blocks. The filesystem on /dev/rbd1 is now 2816000 blocks long. Did I miss something? Is this feature coming? Thank you in advance :) -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Poor read performance in KVM
On Thu, Jul 19, 2012 at 9:52 AM, Tommi Virtanen t...@inktank.com wrote: On Thu, Jul 19, 2012 at 5:19 AM, Vladimir Bashkirtsev vladi...@bashkirtsev.com wrote: Look like that osd.0 performs with low latency but osd.1 latency is way too high and on average it appears as 200ms. osd is backed by btrfs over LVM2. May be issue lie in backing fs selection? All four osds running similar Our typical experience with btrfs right now seems to be that it works fast when the filesystem is fresh, but as it ages, it starts to have higher and higher delays on syncing writes. This does not seem to be completely deterministic, that is, if you run many btrfs'es, the symptoms start to appear at different times on different instances. Is Inktank seeing the slowdown on btrfs volumes with large metadata (32k / 64k) node/leaf sizes as well, or only on default (4k) sizes? Vladmir, What node/leaf size are you using on your btrfs volume? -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Poor read performance in KVM
On 07/19/2012 01:06 PM, Calvin Morrow wrote: On Thu, Jul 19, 2012 at 9:52 AM, Tommi Virtanent...@inktank.com wrote: On Thu, Jul 19, 2012 at 5:19 AM, Vladimir Bashkirtsev vladi...@bashkirtsev.com wrote: Look like that osd.0 performs with low latency but osd.1 latency is way too high and on average it appears as 200ms. osd is backed by btrfs over LVM2. May be issue lie in backing fs selection? All four osds running similar Our typical experience with btrfs right now seems to be that it works fast when the filesystem is fresh, but as it ages, it starts to have higher and higher delays on syncing writes. This does not seem to be completely deterministic, that is, if you run many btrfs'es, the symptoms start to appear at different times on different instances. Is Inktank seeing the slowdown on btrfs volumes with large metadata (32k / 64k) node/leaf sizes as well, or only on default (4k) sizes? Vladmir, What node/leaf size are you using on your btrfs volume? Hi Vladmir, We are seeing degradation at 64k node/leaf sizes as well. So far the degradation is most obvious with small writes. it affects XFS as well, though not as severely. We are vigorously looking into it. :) Mark -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Mark Nelson Performance Engineer Inktank -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph doesn't update the block device size while a rbd image is mounted
With LVM, you can re-scan the scsi bus to extend a physical drive and then run a pvextend. @Calvin: I tried your solution # partprobe /dev/rbd1 Unfortunatly nothing changed. Did you make it working? Cheers! On Thu, Jul 19, 2012 at 5:50 PM, Sébastien Han han.sebast...@gmail.com wrote: Hum ok, I see. Thanks! But if you have any clue to force the kernel to re-read without unmont/mounting :) On Thu, Jul 19, 2012 at 5:47 PM, Wido den Hollander w...@widodh.nl wrote: On 19-07-12 17:26, Sébastien Han wrote: Ok I got your point seems logic, but why is this possible with LVM for example? You can easily do this with LVM without un-mounting the device. LVM runs through the device mapper and are not regular block devices. If you resize the disk underneath LVM you won't see an increased VG or PV size unless you change the availability of the VG to unavailable and back to available again. I'm not a 100% sure what the exact root cause is, but the kernel won't read the new size of a block device as long as it is in use. Wido Cheers. On Thu, Jul 19, 2012 at 5:15 PM, Wido den Hollander w...@widodh.nl wrote: Hi, On 19-07-12 16:55, Sébastien Han wrote: Hi Cephers! I'm working with rbd mapping. I figured out that the block device size of the rbd device is not update while the device is mounted. Here my tests: iirc this is not something RBD specific, but since the device is in use it can't be re-read by the kernel. So when you unmount it the kernel can re-read the header and resize the device. Wido 1. Pick up a device and check his size # rbd ls size # rbd info test rbd image 'test': size 1 MB in 2500 objects order 22 (4096 KB objects) block_name_prefix: rb.0.6 parent: (pool -1) 2. Map the device # rbd map --secret /etc/ceph/secret test # rbd showmapped id pool image snap device 1 rbd test - /dev/rbd1 3. Put a fs on it and check the block device size # mkfs.ext4 /dev/rdb1 ... ... # fdisk -l /dev/rbd1 Disk /dev/rbd1: 10.5 GB, 1048576 bytes 4. Mount it # mount /dev/rbd1 /mnt # df -h /dev/rbd1 9.8G 277M 9.0G 3% /mnt 5. Change the image size # rbd resize --size 11000 test Resizing image: 100% complete...done. # rbd info test rbd image 'test': size 11000 MB in 2750 objects order 22 (4096 KB objects) block_name_prefix: rb.0.6 parent: (pool -1) At this point of time, if you perform the fdisk -l /dev/rbd1, the block device size will remain the same. 6. Unmount the device: # umount /media # fdisk -l /dev/rbd1 Disk /dev/rbd1: 11.5 GB, 11534336000 bytes Unmounting the directory did update the block device size. Of course you can do something really fast like: # umount /media mount /dev/rbd1 /media That will work, it's a valid solution as long as there is no opened file. I won't use this trick in production... I also tried to mount -o remount and it didn't work. 7. Resize the fs (this can be performed while the fs is mounted): # e2fsck -f /dev/rbd1 e2fsck 1.42 (29-Nov-2011) Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information /dev/rbd1: 11/644640 files (0.0% non-contiguous), 77173/256 blocks # resize2fs /dev/rbd1 resize2fs 1.42 (29-Nov-2011) Resizing the filesystem on /dev/rbd1 to 2816000 (4k) blocks. The filesystem on /dev/rbd1 is now 2816000 blocks long. Did I miss something? Is this feature coming? Thank you in advance :) -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph doesn't update the block device size while a rbd image is mounted
I haven't tried resizing an rbd yet, but I was changing partitions on a non-ceph two-node cluster with shared storage yesterday while certain partitions were in use (partitions 1,2,5 were mounted, deleting partition ids 6+, adding new ones) and fdisk wasn't re-reading disk changes. Partprobe followed by cfdisk seemed to update the kernel with the on-disk changes. I realize its not exactly what you are doing, but I figured it might be close enough to be worth a shot. cfdisk seems to make a system call to refresh disk information that vanilla fdisk doesn't. Calvin On Thu, Jul 19, 2012 at 1:44 PM, Sébastien Han han.sebast...@gmail.com wrote: With LVM, you can re-scan the scsi bus to extend a physical drive and then run a pvextend. @Calvin: I tried your solution # partprobe /dev/rbd1 Unfortunatly nothing changed. Did you make it working? Cheers! On Thu, Jul 19, 2012 at 5:50 PM, Sébastien Han han.sebast...@gmail.com wrote: Hum ok, I see. Thanks! But if you have any clue to force the kernel to re-read without unmont/mounting :) On Thu, Jul 19, 2012 at 5:47 PM, Wido den Hollander w...@widodh.nl wrote: On 19-07-12 17:26, Sébastien Han wrote: Ok I got your point seems logic, but why is this possible with LVM for example? You can easily do this with LVM without un-mounting the device. LVM runs through the device mapper and are not regular block devices. If you resize the disk underneath LVM you won't see an increased VG or PV size unless you change the availability of the VG to unavailable and back to available again. I'm not a 100% sure what the exact root cause is, but the kernel won't read the new size of a block device as long as it is in use. Wido Cheers. On Thu, Jul 19, 2012 at 5:15 PM, Wido den Hollander w...@widodh.nl wrote: Hi, On 19-07-12 16:55, Sébastien Han wrote: Hi Cephers! I'm working with rbd mapping. I figured out that the block device size of the rbd device is not update while the device is mounted. Here my tests: iirc this is not something RBD specific, but since the device is in use it can't be re-read by the kernel. So when you unmount it the kernel can re-read the header and resize the device. Wido 1. Pick up a device and check his size # rbd ls size # rbd info test rbd image 'test': size 1 MB in 2500 objects order 22 (4096 KB objects) block_name_prefix: rb.0.6 parent: (pool -1) 2. Map the device # rbd map --secret /etc/ceph/secret test # rbd showmapped id pool image snap device 1 rbd test - /dev/rbd1 3. Put a fs on it and check the block device size # mkfs.ext4 /dev/rdb1 ... ... # fdisk -l /dev/rbd1 Disk /dev/rbd1: 10.5 GB, 1048576 bytes 4. Mount it # mount /dev/rbd1 /mnt # df -h /dev/rbd1 9.8G 277M 9.0G 3% /mnt 5. Change the image size # rbd resize --size 11000 test Resizing image: 100% complete...done. # rbd info test rbd image 'test': size 11000 MB in 2750 objects order 22 (4096 KB objects) block_name_prefix: rb.0.6 parent: (pool -1) At this point of time, if you perform the fdisk -l /dev/rbd1, the block device size will remain the same. 6. Unmount the device: # umount /media # fdisk -l /dev/rbd1 Disk /dev/rbd1: 11.5 GB, 11534336000 bytes Unmounting the directory did update the block device size. Of course you can do something really fast like: # umount /media mount /dev/rbd1 /media That will work, it's a valid solution as long as there is no opened file. I won't use this trick in production... I also tried to mount -o remount and it didn't work. 7. Resize the fs (this can be performed while the fs is mounted): # e2fsck -f /dev/rbd1 e2fsck 1.42 (29-Nov-2011) Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information /dev/rbd1: 11/644640 files (0.0% non-contiguous), 77173/256 blocks # resize2fs /dev/rbd1 resize2fs 1.42 (29-Nov-2011) Resizing the filesystem on /dev/rbd1 to 2816000 (4k) blocks. The filesystem on /dev/rbd1 is now 2816000 blocks long. Did I miss something? Is this feature coming? Thank you in advance :) -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph doesn't update the block device size while a rbd image is mounted
On 07/19/2012 09:44 PM, Sébastien Han wrote: With LVM, you can re-scan the scsi bus to extend a physical drive and then run a pvextend. @Calvin: I tried your solution # partprobe /dev/rbd1 Did you try blockdev? # blockdev --rereadpt /dev/rbd1 Regards, Andreas Unfortunatly nothing changed. Did you make it working? Cheers! On Thu, Jul 19, 2012 at 5:50 PM, Sébastien Han han.sebast...@gmail.com wrote: Hum ok, I see. Thanks! But if you have any clue to force the kernel to re-read without unmont/mounting :) On Thu, Jul 19, 2012 at 5:47 PM, Wido den Hollander w...@widodh.nl wrote: On 19-07-12 17:26, Sébastien Han wrote: Ok I got your point seems logic, but why is this possible with LVM for example? You can easily do this with LVM without un-mounting the device. LVM runs through the device mapper and are not regular block devices. If you resize the disk underneath LVM you won't see an increased VG or PV size unless you change the availability of the VG to unavailable and back to available again. I'm not a 100% sure what the exact root cause is, but the kernel won't read the new size of a block device as long as it is in use. Wido Cheers. On Thu, Jul 19, 2012 at 5:15 PM, Wido den Hollander w...@widodh.nl wrote: Hi, On 19-07-12 16:55, Sébastien Han wrote: Hi Cephers! I'm working with rbd mapping. I figured out that the block device size of the rbd device is not update while the device is mounted. Here my tests: iirc this is not something RBD specific, but since the device is in use it can't be re-read by the kernel. So when you unmount it the kernel can re-read the header and resize the device. Wido 1. Pick up a device and check his size # rbd ls size # rbd info test rbd image 'test': size 1 MB in 2500 objects order 22 (4096 KB objects) block_name_prefix: rb.0.6 parent: (pool -1) 2. Map the device # rbd map --secret /etc/ceph/secret test # rbd showmapped id pool image snap device 1 rbd test - /dev/rbd1 3. Put a fs on it and check the block device size # mkfs.ext4 /dev/rdb1 ... ... # fdisk -l /dev/rbd1 Disk /dev/rbd1: 10.5 GB, 1048576 bytes 4. Mount it # mount /dev/rbd1 /mnt # df -h /dev/rbd1 9.8G 277M 9.0G 3% /mnt 5. Change the image size # rbd resize --size 11000 test Resizing image: 100% complete...done. # rbd info test rbd image 'test': size 11000 MB in 2750 objects order 22 (4096 KB objects) block_name_prefix: rb.0.6 parent: (pool -1) At this point of time, if you perform the fdisk -l /dev/rbd1, the block device size will remain the same. 6. Unmount the device: # umount /media # fdisk -l /dev/rbd1 Disk /dev/rbd1: 11.5 GB, 11534336000 bytes Unmounting the directory did update the block device size. Of course you can do something really fast like: # umount /media mount /dev/rbd1 /media That will work, it's a valid solution as long as there is no opened file. I won't use this trick in production... I also tried to mount -o remount and it didn't work. 7. Resize the fs (this can be performed while the fs is mounted): # e2fsck -f /dev/rbd1 e2fsck 1.42 (29-Nov-2011) Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information /dev/rbd1: 11/644640 files (0.0% non-contiguous), 77173/256 blocks # resize2fs /dev/rbd1 resize2fs 1.42 (29-Nov-2011) Resizing the filesystem on /dev/rbd1 to 2816000 (4k) blocks. The filesystem on /dev/rbd1 is now 2816000 blocks long. Did I miss something? Is this feature coming? Thank you in advance :) -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html signature.asc Description: OpenPGP digital signature
[PATCH 00/12] rbd: cleanup series
This series includes a bunch of relatively small cleanups. They're grouped a bit below, but they apply together in this sequence and the later ones may have dependencies on those earlier in the series. Summaries: [PATCH 01/12] rbd: drop extra header_rwsem init [PATCH 02/12] rbd: simplify __rbd_remove_all_snaps() [PATCH 03/12] rbd: clean up a few dout() calls [PATCH 04/12] ceph: define snap counts as u32 everywhere These four are very simple and straightforward cleanups. [PATCH 05/12] rbd: snapc is unused in rbd_req_sync_read() [PATCH 06/12] rbd: drop rbd_header_from_disk() gfp_flags parameter [PATCH 07/12] rbd: drop rbd_dev parameter in snap functions [PATCH 08/12] rbd: drop rbd_req_sync_exec() ver parameter These four each drop an unused argument from a function. [PATCH 09/12] rbd: have __rbd_add_snap_dev() return a pointer [PATCH 10/12] rbd: make rbd_create_rw_ops() return a pointer These two each change a function so it returns a pointer rather than filling in the address of a provided pointer. [PATCH 11/12] rbd: always pass ops array to rbd_req_sync_op() [PATCH 12/12] rbd: fixes in rbd_header_from_disk() These comprise slightly more involved refactoring of the code, but as with the rest of the patches in this series, there should be no functional difference as a result. -Alex -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] rbd: don't use snapc-seq that way
On 07/19/2012 10:11 AM, Alex Elder wrote: We now use rbd_dev-snap_id to record the snapshot id--using the special value SNAP_NONE to indicate the rbd_dev is not mapping a snapshot at all. That's CEPH_NOSNAP, not SNAP_NONE, right? In any case, Reviewed-by: Josh Durgin josh.dur...@inktank.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 01/12] rbd: drop extra header_rwsem init
In commit c01a there was inadvertently added an extra initialization of rbd_dev-header_rwsem. This gets rid of the duplicate. (Guangliang Zhao also offered up the same fix.) Reported-by: Guangliang Zhao gz...@suse.com Signed-off-by: Alex Elder el...@inktank.com --- drivers/block/rbd.c |2 -- 1 files changed, 0 insertions(+), 2 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 6df8c62..b9895fe 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -2458,8 +2458,6 @@ static ssize_t rbd_add(struct bus_type *bus, INIT_LIST_HEAD(rbd_dev-snaps); init_rwsem(rbd_dev-header_rwsem); - init_rwsem(rbd_dev-header_rwsem); - /* generate unique id: find highest unique id, add one */ rbd_id_get(rbd_dev); -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 02/12] rbd: simplify __rbd_remove_all_snaps()
This just replaces a while loop with list_for_each_entry_safe() in __rbd_remove_all_snaps(). Signed-off-by: Alex Elder el...@inktank.com --- drivers/block/rbd.c |5 ++--- 1 files changed, 2 insertions(+), 3 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index b9895fe..74e6a33 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -1692,11 +1692,10 @@ bad: static void __rbd_remove_all_snaps(struct rbd_device *rbd_dev) { struct rbd_snap *snap; + struct rbd_snap *next; - while (!list_empty(rbd_dev-snaps)) { - snap = list_first_entry(rbd_dev-snaps, struct rbd_snap, node); + list_for_each_entry_safe(snap, next, rbd_dev-snaps, node) __rbd_remove_snap_dev(rbd_dev, snap); - } } /* -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 03/12] rbd: clean up a few dout() calls
There was a dout() call in rbd_do_request() that was reporting the reporting the offset as the length and vice versa. While fixing that I did a quick scan of other dout() calls and fixed a couple of other minor things. Signed-off-by: Alex Elder el...@inktank.com --- drivers/block/rbd.c |7 +++ 1 files changed, 3 insertions(+), 4 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 74e6a33..93b2447 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -895,7 +895,7 @@ static int rbd_do_request(struct request *rq, } dout(rbd_do_request object_name=%s ofs=%lld len=%lld\n, - object_name, len, ofs); + object_name, ofs, len); osdc = rbd_dev-rbd_client-client-osdc; req = ceph_osdc_alloc_request(osdc, flags, snapc, ops, @@ -1315,8 +1315,7 @@ static void rbd_notify_cb(u64 ver, u64 notify_id, u8 opcode, void *data) return; dout(rbd_notify_cb %s notify_id=%lld opcode=%d\n, - rbd_dev-header_name, - notify_id, (int)opcode); + rbd_dev-header_name, notify_id, (int) opcode); } /* @@ -1664,7 +1663,7 @@ static int rbd_header_add_snap(struct rbd_device *rbd_dev, monc = rbd_dev-rbd_client-client-monc; ret = ceph_monc_create_snapid(monc, rbd_dev-pool_id, new_snapid); - dout(created snapid=%lld\n, new_snapid); + dout(created snapid=%llu\n, new_snapid); if (ret 0) return ret; -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 04/12] ceph: define snap counts as u32 everywhere
There are two structures in which a count of snapshots are maintained: struct ceph_snap_context { ... u32 num_snaps; ... } and struct ceph_snap_realm { ... u32 num_prior_parent_snaps; /* had prior to parent_since */ ... u32 num_snaps; ... } These fields never take on negative values (e.g., to hold special meaning), and so are really inherently unsigned. Furthermore they take their value from over-the-wire or on-disk formatted 32-bit values. So change their definition to have type u32, and change some spots elsewhere in the code to account for this change. Signed-off-by: Alex Elder el...@inktank.com --- fs/ceph/snap.c | 18 ++ fs/ceph/super.h |4 ++-- include/linux/ceph/libceph.h |2 +- 3 files changed, 13 insertions(+), 11 deletions(-) diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c index e5206fc..cbb2f54 100644 --- a/fs/ceph/snap.c +++ b/fs/ceph/snap.c @@ -296,8 +296,7 @@ static int build_snap_context(struct ceph_snap_realm *realm) struct ceph_snap_realm *parent = realm-parent; struct ceph_snap_context *snapc; int err = 0; - int i; - int num = realm-num_prior_parent_snaps + realm-num_snaps; + u32 num = realm-num_prior_parent_snaps + realm-num_snaps; /* * build parent context, if it hasn't been built. @@ -321,11 +320,11 @@ static int build_snap_context(struct ceph_snap_realm *realm) realm-cached_context-seq == realm-seq (!parent || realm-cached_context-seq = parent-cached_context-seq)) { - dout(build_snap_context %llx %p: %p seq %lld (%d snaps) + dout(build_snap_context %llx %p: %p seq %lld (%u snaps) (unchanged)\n, realm-ino, realm, realm-cached_context, realm-cached_context-seq, -realm-cached_context-num_snaps); +(unsigned int) realm-cached_context-num_snaps); return 0; } @@ -342,6 +341,8 @@ static int build_snap_context(struct ceph_snap_realm *realm) num = 0; snapc-seq = realm-seq; if (parent) { + u32 i; + /* include any of parent's snaps occurring _after_ my parent became my parent */ for (i = 0; i parent-cached_context-num_snaps; i++) @@ -361,8 +362,9 @@ static int build_snap_context(struct ceph_snap_realm *realm) sort(snapc-snaps, num, sizeof(u64), cmpu64_rev, NULL); snapc-num_snaps = num; - dout(build_snap_context %llx %p: %p seq %lld (%d snaps)\n, -realm-ino, realm, snapc, snapc-seq, snapc-num_snaps); + dout(build_snap_context %llx %p: %p seq %lld (%u snaps)\n, +realm-ino, realm, snapc, snapc-seq, +(unsigned int) snapc-num_snaps); if (realm-cached_context) ceph_put_snap_context(realm-cached_context); @@ -402,9 +404,9 @@ static void rebuild_snap_realms(struct ceph_snap_realm *realm) * helper to allocate and decode an array of snapids. free prior * instance, if any. */ -static int dup_array(u64 **dst, __le64 *src, int num) +static int dup_array(u64 **dst, __le64 *src, u32 num) { - int i; + u32 i; kfree(*dst); if (num) { diff --git a/fs/ceph/super.h b/fs/ceph/super.h index fc35036..3ea48b7 100644 --- a/fs/ceph/super.h +++ b/fs/ceph/super.h @@ -612,9 +612,9 @@ struct ceph_snap_realm { u64 parent_since; /* snapid when our current parent became so */ u64 *prior_parent_snaps; /* snaps inherited from any parents we */ - int num_prior_parent_snaps; /* had prior to parent_since */ + u32 num_prior_parent_snaps; /* had prior to parent_since */ u64 *snaps; /* snaps specific to this realm */ - int num_snaps; + u32 num_snaps; struct ceph_snap_realm *parent; struct list_head children; /* list of child realms */ diff --git a/include/linux/ceph/libceph.h b/include/linux/ceph/libceph.h index 98ec36a..0b72295 100644 --- a/include/linux/ceph/libceph.h +++ b/include/linux/ceph/libceph.h @@ -160,7 +160,7 @@ struct ceph_client { struct ceph_snap_context { atomic_t nref; u64 seq; - int num_snaps; + u32 num_snaps; u64 snaps[]; }; -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 05/12] rbd: snapc is unused in rbd_req_sync_read()
The snapc parameter to in rbd_req_sync_read() is not used, so get rid of it. Reported-by: Josh Durgin josh.dur...@inktank.com Signed-off-by: Alex Elder el...@inktank.com --- drivers/block/rbd.c |3 +-- 1 files changed, 1 insertions(+), 2 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 93b2447..06e022d 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -1162,7 +1162,6 @@ static int rbd_req_read(struct request *rq, * Request sync osd read */ static int rbd_req_sync_read(struct rbd_device *rbd_dev, - struct ceph_snap_context *snapc, u64 snapid, const char *object_name, u64 ofs, u64 len, @@ -1609,7 +1608,7 @@ static int rbd_read_header(struct rbd_device *rbd_dev, return -ENOMEM; rc = rbd_req_sync_read(rbd_dev, - NULL, CEPH_NOSNAP, + CEPH_NOSNAP, rbd_dev-header_name, 0, len, (char *)dh, ver); -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 06/12] rbd: drop rbd_header_from_disk() gfp_flags parameter
The function rbd_header_from_disk() is only called in one spot, and it passes GFP_KERNEL as its value for the gfp_flags parameter. Just drop that parameter and substitute GFP_KERNEL everywhere within that function it had been used. (If we find we need the parameter again in the future it's easy enough to add back again.) Signed-off-by: Alex Elder el...@inktank.com --- drivers/block/rbd.c | 13 ++--- 1 files changed, 6 insertions(+), 7 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 06e022d..3b4b4d2 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -487,8 +487,7 @@ static void rbd_coll_release(struct kref *kref) */ static int rbd_header_from_disk(struct rbd_image_header *header, struct rbd_image_header_ondisk *ondisk, -u32 allocated_snaps, -gfp_t gfp_flags) +u32 allocated_snaps) { u32 i, snap_count; @@ -501,18 +500,18 @@ static int rbd_header_from_disk(struct rbd_image_header *header, return -EINVAL; header-snapc = kmalloc(sizeof(struct ceph_snap_context) + snap_count * sizeof(u64), - gfp_flags); + GFP_KERNEL); if (!header-snapc) return -ENOMEM; header-snap_names_len = le64_to_cpu(ondisk-snap_names_len); if (snap_count) { header-snap_names = kmalloc(header-snap_names_len, -gfp_flags); +GFP_KERNEL); if (!header-snap_names) goto err_snapc; header-snap_sizes = kmalloc(snap_count * sizeof(u64), -gfp_flags); +GFP_KERNEL); if (!header-snap_sizes) goto err_names; } else { @@ -521,7 +520,7 @@ static int rbd_header_from_disk(struct rbd_image_header *header, } header-object_prefix = kmalloc(sizeof (ondisk-block_name) + 1, - gfp_flags); + GFP_KERNEL); if (!header-object_prefix) goto err_sizes; @@ -1615,7 +1614,7 @@ static int rbd_read_header(struct rbd_device *rbd_dev, if (rc 0) goto out_dh; - rc = rbd_header_from_disk(header, dh, snap_count, GFP_KERNEL); + rc = rbd_header_from_disk(header, dh, snap_count); if (rc 0) { if (rc == -ENXIO) pr_warning(unrecognized header format -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 07/12] rbd: drop rbd_dev parameter in snap functions
Both rbd_register_snap_dev() and __rbd_remove_snap_dev() have rbd_dev parameters that are unused. Remove them. Signed-off-by: Alex Elder el...@inktank.com --- drivers/block/rbd.c | 19 +++ 1 files changed, 7 insertions(+), 12 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 3b4b4d2..955d75d 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -202,8 +202,7 @@ static ssize_t rbd_snap_add(struct device *dev, struct device_attribute *attr, const char *buf, size_t count); -static void __rbd_remove_snap_dev(struct rbd_device *rbd_dev, - struct rbd_snap *snap); +static void __rbd_remove_snap_dev(struct rbd_snap *snap); static ssize_t rbd_add(struct bus_type *bus, const char *buf, size_t count); @@ -1692,7 +1691,7 @@ static void __rbd_remove_all_snaps(struct rbd_device *rbd_dev) struct rbd_snap *next; list_for_each_entry_safe(snap, next, rbd_dev-snaps, node) - __rbd_remove_snap_dev(rbd_dev, snap); + __rbd_remove_snap_dev(snap); } /* @@ -2000,15 +1999,13 @@ static struct device_type rbd_snap_device_type = { .release= rbd_snap_dev_release, }; -static void __rbd_remove_snap_dev(struct rbd_device *rbd_dev, - struct rbd_snap *snap) +static void __rbd_remove_snap_dev(struct rbd_snap *snap) { list_del(snap-node); device_unregister(snap-dev); } -static int rbd_register_snap_dev(struct rbd_device *rbd_dev, - struct rbd_snap *snap, +static int rbd_register_snap_dev(struct rbd_snap *snap, struct device *parent) { struct device *dev = snap-dev; @@ -2035,8 +2032,7 @@ static int __rbd_add_snap_dev(struct rbd_device *rbd_dev, snap-size = rbd_dev-header.snap_sizes[i]; snap-id = rbd_dev-header.snapc-snaps[i]; if (device_is_registered(rbd_dev-dev)) { - ret = rbd_register_snap_dev(rbd_dev, snap, -rbd_dev-dev); + ret = rbd_register_snap_dev(snap, rbd_dev-dev); if (ret 0) goto err; } @@ -2101,7 +2097,7 @@ static int __rbd_init_snaps_header(struct rbd_device *rbd_dev) */ if (rbd_dev-snap_id == old_snap-id) rbd_dev-snap_exists = false; - __rbd_remove_snap_dev(rbd_dev, old_snap); + __rbd_remove_snap_dev(old_snap); continue; } if (old_snap-id == cur_id) { @@ -2165,8 +2161,7 @@ static int rbd_bus_add_dev(struct rbd_device *rbd_dev) goto out; list_for_each_entry(snap, rbd_dev-snaps, node) { - ret = rbd_register_snap_dev(rbd_dev, snap, -rbd_dev-dev); + ret = rbd_register_snap_dev(snap, rbd_dev-dev); if (ret 0) break; } -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 08/12] rbd: drop rbd_req_sync_exec() ver parameter
The only place that passes a version pointer to rbd_req_sync_exec() is in rbd_header_add_snap(), and that spot ignores the result. The only thing rbd_req_sync_exec() does with its ver parameter is pass it directly to rbd_req_sync_op(). So we can just use a null pointer there, and drop the ver parameter to rbd_req_sync_exec(). Signed-off-by: Alex Elder el...@inktank.com --- drivers/block/rbd.c |8 +++- 1 files changed, 3 insertions(+), 5 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 955d75d..d4a8d9e 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -1374,8 +1374,7 @@ static int rbd_req_sync_exec(struct rbd_device *rbd_dev, const char *class_name, const char *method_name, const char *data, -int len, -u64 *ver) +int len) { struct ceph_osd_req_op *ops; int class_name_len = strlen(class_name); @@ -1398,7 +1397,7 @@ static int rbd_req_sync_exec(struct rbd_device *rbd_dev, 0, CEPH_OSD_FLAG_WRITE | CEPH_OSD_FLAG_ONDISK, ops, - object_name, 0, 0, NULL, NULL, ver); + object_name, 0, 0, NULL, NULL, NULL); rbd_destroy_ops(ops); @@ -1651,7 +1650,6 @@ static int rbd_header_add_snap(struct rbd_device *rbd_dev, u64 new_snapid; int ret; void *data, *p, *e; - u64 ver; struct ceph_mon_client *monc; /* we should create a snapshot only if we're pointing at the head */ @@ -1676,7 +1674,7 @@ static int rbd_header_add_snap(struct rbd_device *rbd_dev, ret = rbd_req_sync_exec(rbd_dev, rbd_dev-header_name, rbd, snap_add, - data, p - data, ver); + data, p - data); kfree(data); -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 09/12] rbd: have __rbd_add_snap_dev() return a pointer
It's not obvious whether the snapshot pointer whose address is provided to __rbd_add_snap_dev() will be assigned by that function. Change it to return the snapshot, or a pointer-coded errno in the event of a failure. Signed-off-by: Alex Elder el...@inktank.com --- drivers/block/rbd.c | 37 ++--- 1 files changed, 22 insertions(+), 15 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index d4a8d9e..79b0762 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -2018,15 +2018,21 @@ static int rbd_register_snap_dev(struct rbd_snap *snap, return ret; } -static int __rbd_add_snap_dev(struct rbd_device *rbd_dev, - int i, const char *name, - struct rbd_snap **snapp) +static struct rbd_snap *__rbd_add_snap_dev(struct rbd_device *rbd_dev, + int i, const char *name) { + struct rbd_snap *snap; int ret; - struct rbd_snap *snap = kzalloc(sizeof(*snap), GFP_KERNEL); + + snap = kzalloc(sizeof (*snap), GFP_KERNEL); if (!snap) - return -ENOMEM; + return ERR_PTR(-ENOMEM); + + ret = -ENOMEM; snap-name = kstrdup(name, GFP_KERNEL); + if (!snap-name) + goto err; + snap-size = rbd_dev-header.snap_sizes[i]; snap-id = rbd_dev-header.snapc-snaps[i]; if (device_is_registered(rbd_dev-dev)) { @@ -2034,12 +2040,14 @@ static int __rbd_add_snap_dev(struct rbd_device *rbd_dev, if (ret 0) goto err; } - *snapp = snap; - return 0; + + return snap; + err: kfree(snap-name); kfree(snap); - return ret; + + return ERR_PTR(ret); } /* @@ -2072,7 +2080,6 @@ static int __rbd_init_snaps_header(struct rbd_device *rbd_dev) const char *name, *first_name; int i = rbd_dev-header.total_snaps; struct rbd_snap *snap, *old_snap = NULL; - int ret; struct list_head *p, *n; first_name = rbd_dev-header.snap_names; @@ -2115,9 +2122,9 @@ static int __rbd_init_snaps_header(struct rbd_device *rbd_dev) if (cur_id = old_snap-id) break; /* a new snapshot */ - ret = __rbd_add_snap_dev(rbd_dev, i - 1, name, snap); - if (ret 0) - return ret; + snap = __rbd_add_snap_dev(rbd_dev, i - 1, name); + if (IS_ERR(snap)) + return PTR_ERR(snap); /* note that we add it backward so using n and not p */ list_add(snap-node, n); @@ -2131,9 +2138,9 @@ static int __rbd_init_snaps_header(struct rbd_device *rbd_dev) WARN_ON(1); return -EINVAL; } - ret = __rbd_add_snap_dev(rbd_dev, i - 1, name, snap); - if (ret 0) - return ret; + snap = __rbd_add_snap_dev(rbd_dev, i - 1, name); + if (IS_ERR(snap)) + return PTR_ERR(snap); list_add(snap-node, rbd_dev-snaps); } -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 10/12] rbd: make rbd_create_rw_ops() return a pointer
Either rbd_create_rw_ops() will succeed, or it will fail because a memory allocation failed. Have it just return a valid pointer or null rather than stuffing a pointer into a provided address and returning an errno. Signed-off-by: Alex Elder el...@inktank.com --- drivers/block/rbd.c | 68 -- 1 files changed, 38 insertions(+), 30 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 79b0762..0535612 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -783,22 +783,24 @@ err_out: /* * helpers for osd request op vectors. */ -static int rbd_create_rw_ops(struct ceph_osd_req_op **ops, - int num_ops, - int opcode, - u32 payload_len) -{ - *ops = kzalloc(sizeof(struct ceph_osd_req_op) * (num_ops + 1), - GFP_NOIO); - if (!*ops) - return -ENOMEM; - (*ops)[0].op = opcode; +static struct ceph_osd_req_op *rbd_create_rw_ops(int num_ops, + int opcode, u32 payload_len) +{ + struct ceph_osd_req_op *ops; + + ops = kzalloc(sizeof (*ops) * (num_ops + 1), GFP_NOIO); + if (!ops) + return NULL; + + ops[0].op = opcode; + /* * op extent offset and length will be set later on * in calc_raw_layout() */ - (*ops)[0].payload_len = payload_len; - return 0; + ops[0].payload_len = payload_len; + + return ops; } static void rbd_destroy_ops(struct ceph_osd_req_op *ops) @@ -1033,8 +1035,9 @@ static int rbd_req_sync_op(struct rbd_device *rbd_dev, if (!orig_ops) { payload_len = (flags CEPH_OSD_FLAG_WRITE ? len : 0); - ret = rbd_create_rw_ops(ops, 1, opcode, payload_len); - if (ret 0) + ret = -ENOMEM; + ops = rbd_create_rw_ops(1, opcode, payload_len); + if (!ops) goto done; if ((flags CEPH_OSD_FLAG_WRITE) buf) { @@ -1097,8 +1100,9 @@ static int rbd_do_op(struct request *rq, payload_len = (flags CEPH_OSD_FLAG_WRITE ? seg_len : 0); - ret = rbd_create_rw_ops(ops, 1, opcode, payload_len); - if (ret 0) + ret = -ENOMEM; + ops = rbd_create_rw_ops(1, opcode, payload_len); + if (!ops) goto done; /* we've taken care of segment sizes earlier when we @@ -1185,9 +1189,9 @@ static int rbd_req_sync_notify_ack(struct rbd_device *rbd_dev, struct ceph_osd_req_op *ops; int ret; - ret = rbd_create_rw_ops(ops, 1, CEPH_OSD_OP_NOTIFY_ACK, 0); - if (ret 0) - return ret; + ops = rbd_create_rw_ops(1, CEPH_OSD_OP_NOTIFY_ACK, 0); + if (!ops) + return -ENOMEM; ops[0].watch.ver = cpu_to_le64(ver); ops[0].watch.cookie = notify_id; @@ -1236,10 +1240,11 @@ static int rbd_req_sync_watch(struct rbd_device *rbd_dev, { struct ceph_osd_req_op *ops; struct ceph_osd_client *osdc = rbd_dev-rbd_client-client-osdc; + int ret; - int ret = rbd_create_rw_ops(ops, 1, CEPH_OSD_OP_WATCH, 0); - if (ret 0) - return ret; + ops = rbd_create_rw_ops(1, CEPH_OSD_OP_WATCH, 0); + if (!ops) + return -ENOMEM; ret = ceph_osdc_create_event(osdc, rbd_watch_cb, 0, (void *)rbd_dev, rbd_dev-watch_event); @@ -1279,10 +1284,11 @@ static int rbd_req_sync_unwatch(struct rbd_device *rbd_dev, const char *object_name) { struct ceph_osd_req_op *ops; + int ret; - int ret = rbd_create_rw_ops(ops, 1, CEPH_OSD_OP_WATCH, 0); - if (ret 0) - return ret; + ops = rbd_create_rw_ops(1, CEPH_OSD_OP_WATCH, 0); + if (!ops) + return -ENOMEM; ops[0].watch.ver = 0; ops[0].watch.cookie = cpu_to_le64(rbd_dev-watch_event-cookie); @@ -1328,9 +1334,9 @@ static int rbd_req_sync_notify(struct rbd_device *rbd_dev, int payload_len = sizeof(u32) + sizeof(u32); int ret; - ret = rbd_create_rw_ops(ops, 1, CEPH_OSD_OP_NOTIFY, payload_len); - if (ret 0) - return ret; + ops = rbd_create_rw_ops(1, CEPH_OSD_OP_NOTIFY, payload_len); + if (!ops) + return -ENOMEM; info.rbd_dev = rbd_dev; @@ -1379,10 +1385,12 @@ static int rbd_req_sync_exec(struct rbd_device *rbd_dev, struct ceph_osd_req_op *ops; int class_name_len = strlen(class_name); int method_name_len = strlen(method_name); - int ret = rbd_create_rw_ops(ops, 1, CEPH_OSD_OP_CALL, + int ret; + + ops = rbd_create_rw_ops(1, CEPH_OSD_OP_CALL, class_name_len + method_name_len + len); - if (ret 0) - return ret; + if (!ops) + return
[PATCH 11/12] rbd: always pass ops array to rbd_req_sync_op()
All of the callers of rbd_req_sync_op() except one pass a non-null ops pointer. The only one that does not is rbd_req_sync_read(), which passes CEPH_OSD_OP_READ as its opcode and, CEPH_OSD_FLAG_READ for flags. By allocating the ops array in rbd_req_sync_read() and moving the special case code for the null ops pointer into it, it becomes clear that much of that code is not even necessary. In addition, the opcode argument to rbd_req_sync_op() is never actually used, so get rid of that. Signed-off-by: Alex Elder el...@inktank.com --- drivers/block/rbd.c | 46 -- 1 files changed, 16 insertions(+), 30 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 0535612..6279186 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -1013,9 +1013,8 @@ static void rbd_simple_req_cb(struct ceph_osd_request *req, struct ceph_msg *msg static int rbd_req_sync_op(struct rbd_device *rbd_dev, struct ceph_snap_context *snapc, u64 snapid, - int opcode, int flags, - struct ceph_osd_req_op *orig_ops, + struct ceph_osd_req_op *ops, const char *object_name, u64 ofs, u64 len, char *buf, @@ -1025,28 +1024,14 @@ static int rbd_req_sync_op(struct rbd_device *rbd_dev, int ret; struct page **pages; int num_pages; - struct ceph_osd_req_op *ops = orig_ops; - u32 payload_len; + + BUG_ON(ops == NULL); num_pages = calc_pages_for(ofs , len); pages = ceph_alloc_page_vector(num_pages, GFP_KERNEL); if (IS_ERR(pages)) return PTR_ERR(pages); - if (!orig_ops) { - payload_len = (flags CEPH_OSD_FLAG_WRITE ? len : 0); - ret = -ENOMEM; - ops = rbd_create_rw_ops(1, opcode, payload_len); - if (!ops) - goto done; - - if ((flags CEPH_OSD_FLAG_WRITE) buf) { - ret = ceph_copy_to_page_vector(pages, buf, ofs, len); - if (ret 0) - goto done_ops; - } - } - ret = rbd_do_request(NULL, rbd_dev, snapc, snapid, object_name, ofs, len, NULL, pages, num_pages, @@ -1056,14 +1041,11 @@ static int rbd_req_sync_op(struct rbd_device *rbd_dev, NULL, linger_req, ver); if (ret 0) - goto done_ops; + goto done; if ((flags CEPH_OSD_FLAG_READ) buf) ret = ceph_copy_from_page_vector(pages, buf, ofs, ret); -done_ops: - if (!orig_ops) - rbd_destroy_ops(ops); done: ceph_release_page_vector(pages, num_pages); return ret; @@ -1170,12 +1152,20 @@ static int rbd_req_sync_read(struct rbd_device *rbd_dev, char *buf, u64 *ver) { - return rbd_req_sync_op(rbd_dev, NULL, + struct ceph_osd_req_op *ops; + int ret; + + ops = rbd_create_rw_ops(1, CEPH_OSD_OP_READ, 0); + if (!ops) + return -ENOMEM; + + ret = rbd_req_sync_op(rbd_dev, NULL, snapid, - CEPH_OSD_OP_READ, CEPH_OSD_FLAG_READ, - NULL, - object_name, ofs, len, buf, NULL, ver); + ops, object_name, ofs, len, buf, NULL, ver); + rbd_destroy_ops(ops); + + return ret; } /* @@ -1257,7 +1247,6 @@ static int rbd_req_sync_watch(struct rbd_device *rbd_dev, ret = rbd_req_sync_op(rbd_dev, NULL, CEPH_NOSNAP, - 0, CEPH_OSD_FLAG_WRITE | CEPH_OSD_FLAG_ONDISK, ops, object_name, 0, 0, NULL, @@ -1296,7 +1285,6 @@ static int rbd_req_sync_unwatch(struct rbd_device *rbd_dev, ret = rbd_req_sync_op(rbd_dev, NULL, CEPH_NOSNAP, - 0, CEPH_OSD_FLAG_WRITE | CEPH_OSD_FLAG_ONDISK, ops, object_name, 0, 0, NULL, NULL, NULL); @@ -1353,7 +1341,6 @@ static int rbd_req_sync_notify(struct rbd_device *rbd_dev, ret = rbd_req_sync_op(rbd_dev, NULL, CEPH_NOSNAP, - 0, CEPH_OSD_FLAG_WRITE | CEPH_OSD_FLAG_ONDISK, ops, object_name, 0, 0, NULL, NULL, NULL); @@ -1402,7 +1389,6 @@ static int rbd_req_sync_exec(struct rbd_device *rbd_dev,
[PATCH 12/12] rbd: fixes in rbd_header_from_disk()
This fixes a few issues in rbd_header_from_disk(): - The memcmp() call at the beginning of the function is really looking at the text field of struct rbd_image_header_ondisk. While it does lie at the beginning of the structure, the comparison should be done against the field, not the structure as a whole. - There is a check intended to catch overflow, but it's wrong in two ways. - First, the type we don't want to overflow is size_t, not unsigned int, and there is now a SIZE_MAX we can use for use with that type. - Second, we're allocating the snapshot ids and snapshot image sizes separately (each has type u64; on disk they grouped together as a rbd_image_header_ondisk structure). So we can use the size of u64 in this overflow check. - If there are no snapshots, then there should be no snapshot names. Enforce this, and issue a warning if we encounter a header with no snapshots but a non-zero snap_names_len. - When saving the snapshot names into the header, be more direct in defining the offset in the on-disk structure from which they're being copied by using snap_count rather than i in the array index. - If an error occurs, the snapc and snap_names fields are freed at the end of the function. Make those fields be null pointers after they're freed, to be explicit that they are no longer valid. - Finally, move the definition of the local variable i to the innermost scope in which it's needed. Signed-off-by: Alex Elder el...@inktank.com --- drivers/block/rbd.c | 19 +-- 1 files changed, 13 insertions(+), 6 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 6279186..573202a 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -488,14 +488,14 @@ static int rbd_header_from_disk(struct rbd_image_header *header, struct rbd_image_header_ondisk *ondisk, u32 allocated_snaps) { - u32 i, snap_count; + u32 snap_count; - if (memcmp(ondisk, RBD_HEADER_TEXT, sizeof(RBD_HEADER_TEXT))) + if (memcmp(ondisk-text, RBD_HEADER_TEXT, sizeof(RBD_HEADER_TEXT))) return -ENXIO; snap_count = le32_to_cpu(ondisk-snap_count); - if (snap_count (UINT_MAX - sizeof(struct ceph_snap_context)) -/ sizeof (*ondisk)) + if (snap_count (SIZE_MAX - sizeof(struct ceph_snap_context)) +/ sizeof (u64)) return -EINVAL; header-snapc = kmalloc(sizeof(struct ceph_snap_context) + snap_count * sizeof(u64), @@ -503,8 +503,8 @@ static int rbd_header_from_disk(struct rbd_image_header *header, if (!header-snapc) return -ENOMEM; - header-snap_names_len = le64_to_cpu(ondisk-snap_names_len); if (snap_count) { + header-snap_names_len = le64_to_cpu(ondisk-snap_names_len); header-snap_names = kmalloc(header-snap_names_len, GFP_KERNEL); if (!header-snap_names) @@ -514,6 +514,8 @@ static int rbd_header_from_disk(struct rbd_image_header *header, if (!header-snap_sizes) goto err_names; } else { + WARN_ON(ondisk-snap_names_len); + header-snap_names_len = 0; header-snap_names = NULL; header-snap_sizes = NULL; } @@ -538,6 +540,8 @@ static int rbd_header_from_disk(struct rbd_image_header *header, header-total_snaps = snap_count; if (snap_count allocated_snaps == snap_count) { + int i; + for (i = 0; i snap_count; i++) { header-snapc-snaps[i] = le64_to_cpu(ondisk-snaps[i].id); @@ -546,7 +550,7 @@ static int rbd_header_from_disk(struct rbd_image_header *header, } /* copy snapshot names */ - memcpy(header-snap_names, ondisk-snaps[i], + memcpy(header-snap_names, ondisk-snaps[snap_count], header-snap_names_len); } @@ -556,8 +560,11 @@ err_sizes: kfree(header-snap_sizes); err_names: kfree(header-snap_names); + header-snap_names = NULL; err_snapc: kfree(header-snapc); + header-snapc = NULL; + return -ENOMEM; } -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] rbd: don't use snapc-seq that way
On 07/19/2012 04:02 PM, Josh Durgin wrote: On 07/19/2012 10:11 AM, Alex Elder wrote: We now use rbd_dev-snap_id to record the snapshot id--using the special value SNAP_NONE to indicate the rbd_dev is not mapping a snapshot at all. That's CEPH_NOSNAP, not SNAP_NONE, right? In any case, Yes. I'll fix the description before I commit it... Reviewed-by: Josh Durgin josh.dur...@inktank.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] rbd: use snapc-seq the way server does
On 07/19/2012 10:09 AM, Alex Elder wrote: This series of patches changes the way the snap context seq field is used. Currently it is used in a way that isn't really useful, and as such is a bit confusing. This behavior seems to be a hold over from a time when there was no snap_id field maintained for an rbd_dev. Summary: [PATCH 1/4] rbd: don't use snapc-seq that way Removes special handling in __rbd_refresh_header() that ensured the seq field was updated to point to the head if it had been at the start of the function. [PATCH 2/4] rbd: preserve snapc-seq in rbd_header_set_snap() Changes rbd_header_set_snap() so it doesn't set the seq field to the snapshot id (for a snapshot mapping) or the highest snapshot id (for the base image). [PATCH 3/4] rbd: set snapc-seq only when refreshing header Assigns snapc-seq whenever an updated rbd image header is received rather than when a new snapshot id has been assigned. [PATCH 4/4] rbd: kill rbd_image_header-snap_seq Gets rid of the rbd_image_header-snap_seq field, which previously kept the same information now maintained in the snapc-seq field. -Alex The rest of the series looks good too. Reviewed-by: Josh Durgin josh.dur...@inktank.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[GIT PULL] Ceph fixes for 3.5
Hi Linus, Please pull these last minute fixes for Ceph from: git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus The important one fixes a bug in the socket failure handling behavior that was turned up in some recent failure injection testing. The other two are minor bug fixes. Thanks! sage Dan Carpenter (1): rbd: endian bug in rbd_req_cb() Sage Weil (1): libceph: fix messenger retry Yan, Zheng (1): rbd: Fix ceph_snap_context size calculation drivers/block/rbd.c|4 ++-- include/linux/ceph/messenger.h | 12 ++-- net/ceph/messenger.c | 12 ++-- 3 files changed, 10 insertions(+), 18 deletions(-) -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
making objdump useful
I finally figured out how to make objdump interleave the source code in the .ko file dumps on our qa machines. The problem is that the debug info refeferences the path where the kernel was compiled (which is non-obvious since the info is compressed). For our environment, this is a quick workaround: sudo mkdir -p /srv/autobuild-ceph/gitbuilder.git sudo ln -s /path/to/your/kernel/tree /srv/autobuild-ceph/gitbuilder.git/build objdump -rdS whatever.ko | less Now, to figure out why kdb freezes on the 'dmesg' command... sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
mkcephfs problem
I am trying to get Ceph running on an ARM system, currently one quad core node, running Ubuntu 12.04. It compiles fine, currently without tcmalloc and google perf tools, but I am running into a problem with mkcephfs. 'mkcephfs -a -c ceph.conf' didn't work so I did it piece by piece until I got this: root@ubuntu:/home/tim# mkcephfs -d /tmp/foo --prepare-mon Building generic osdmap from /tmp/foo/conf /usr/local/bin/osdmaptool: osdmap file '/tmp/foo/osdmap' /usr/local/bin/osdmaptool: writing epoch 1 to /tmp/foo/osdmap Generating admin key at /tmp/foo/keyring.admin creating /tmp/foo/keyring.admin Building initial monitor keyring cat: /tmp/foo/key.*: No such file or directory There are no key.* files in /tmp/foo # ls /tmp/foo conf keyring.admin keyring.mon monmap osdmap If you are running everything on one node, is it OK to use localhost and 127.0.0.1? Here is my config file: [osd] osd journal size = 1000 filestore xattr use omap = true [mon.a] host = localhost mon addr = 127.0.0.1:6789 [osd.0] host = localhost [mds.a] host = localhost Here are are the other configuration commands: root@ubuntu:/home/tim# mkcephfs -c ceph.conf -d /tmp/foo --prepare-monmap preparing monmap in /tmp/foo/monmap /usr/local/bin/monmaptool --create --clobber --add a 127.0.0.1:6789 --print /tmp/foo/monmap /usr/local/bin/monmaptool: monmap file /tmp/foo/monmap /usr/local/bin/monmaptool: generated fsid 8e4f846b-1829-46e4-9902-5d8316c3fa4f epoch 0 fsid 8e4f846b-1829-46e4-9902-5d8316c3fa4f last_changed 2012-07-19 22:40:38.651045 created 2012-07-19 22:40:38.651045 0: 127.0.0.1:6789/0 mon.a /usr/local/bin/monmaptool: writing epoch 0 to /tmp/foo/monmap (1 monitors) root@ubuntu:/home/tim# mkcephfs -d /tmp/foo --init-local-daemons mds root@ubuntu:/home/tim# mkcephfs -d /tmp/foo --init-local-daemons osd root@ubuntu:/home/tim# mkcephfs -d /tmp/foo --prepare-mon Building generic osdmap from /tmp/foo/conf /usr/local/bin/osdmaptool: osdmap file '/tmp/foo/osdmap' /usr/local/bin/osdmaptool: writing epoch 1 to /tmp/foo/osdmap Generating admin key at /tmp/foo/keyring.admin creating /tmp/foo/keyring.admin Building initial monitor keyring cat: /tmp/foo/key.*: No such file or directory root@ubuntu:/home/tim# Any ideas? Which command was supposed to generate the key.* files? Tim -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Poor read performance in KVM
On 20/07/2012 1:22 AM, Tommi Virtanen wrote: On Thu, Jul 19, 2012 at 5:19 AM, Vladimir Bashkirtsev vladi...@bashkirtsev.com wrote: Look like that osd.0 performs with low latency but osd.1 latency is way too high and on average it appears as 200ms. osd is backed by btrfs over LVM2. May be issue lie in backing fs selection? All four osds running similar Our typical experience with btrfs right now seems to be that it works fast when the filesystem is fresh, but as it ages, it starts to have higher and higher delays on syncing writes. This does not seem to be completely deterministic, that is, if you run many btrfs'es, the symptoms start to appear at different times on different instances. I would suspect it as well but all btrfs are relatively fresh as we have moved ceph from test bed to a production environment. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to compile Java-Rados.
Hi Noah, Thank you for fixes and suggestions of compilation of java-rados , It's working fine now. Thanks, Ramu. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Poor read performance in KVM
We are seeing degradation at 64k node/leaf sizes as well. So far the degradation is most obvious with small writes. it affects XFS as well, though not as severely. We are vigorously looking into it. :) Just confirming that one of our clients has run fair amount (on gigabytes scale) of mysql updates in preceding few days. So it appears all performance degradation stems from random writes. Perhaps I will rebuild each osd to check if it would help to recover. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Poor read performance in KVM
What node/leaf size are you using on your btrfs volume? Default 4K. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Poor read performance in KVM
Yes, they can hold up reads to the same object. Depending on where they're stuck, they may be blocking other requests as well if they're e.g. taking up all the filestore threads. Waiting for subops means they're waiting for replicas to acknowledge the write and commit it to disk. The real cause for slowness of those ops is the replicas. If you enable 'debug osd = 25', 'filestore = 25', and 'debug journal = 20' you can trace through the logs to see exactly what's happening with the subops for those requests. Looks like I hit exactly the same issue as described in Slow request warnings on 0.48 but from different angle. As our client has run mysql updates performance started to degrade across the cluster bringing the rest of VMs to standstill and producing incredible latency. At some point slow request warnings started to pop up and now it seems I cannot get rid of them at all: I have shut down all clients, all ceph subsystems, restarted everything and it is back to the same behaviour - slow request warnings. Before rebuilding osds I will enable debug as you suggested in attempt to find underlying issue. Then will rebuild osds as a measure of last resort to make sure that indeed osds causing the issue. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html