'Racing read got wrong version' during proxy write testing
Hi all, I ran into a problem during the teuthology test of proxy write. It is like this: - Client sends 3 writes and a read on the same object to base tier - Set up cache tiering - Client retries ops and sends the 3 writes and 1 read to the cache tier - The 3 writes finished on the base tier, say with versions v1, v2 and v3 - Cache tier proxies the 1st write, and start to promote the object for the 2nd write, the 2nd and 3rd writes and the read are blocked - The proxied 1st write finishes on the base tier with version v4, and returns to cache tier. But somehow the cache tier fails to send the reply due to socket failure injecting - Client retries the writes and the read again, the writes are identified as dup ops - The promotion finishes, it copies the pg_log entries from the base tier and put it in the cache tier's pg_log. This includes the 3 writes on the base tier and the proxied write - The writes dispatches after the promotion, they are identified as completed dup ops. Cache tier replies these write ops with the version from the base tier (v1, v2 and v3) - In the last, the read dispatches, it reads the version of the proxied write (v4) and replies to client - Client complains that 'racing read got wrong version' In a previous discussion of the 'ops not idempotent' problem, we solved it by copying the pg_log entries in the base tier to cache tier during promotion. Seems like there is still a problem with this approach in the above scenario. My first thought is that when proxying the write, the cache tier should use the original reqid from the client. But currently we don't have a way to pass the original reqid from cache to base. Any ideas? -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
preparing v0.80.11
Hi Nathan, Although v0.80.10 is not out yet, the odds of a discovering a problem that would require an additional backport are low. I think you can start with v0.80.11 without further delay :-) Cheers -- Loïc Dachary, Artisan Logiciel Libre signature.asc Description: OpenPGP digital signature
Cache pool on top of ec base pool teuthology test
We have a bunch of teuthology tests which build cache pool on top of an ec base pool, and do partial object write. This is ok with the current cache tiering implementation. But with proxy write, this won't work. In my testing, the error message is something like below: 2015-05-20 05:51:42.896828 7f682eed4700 10 osd.5 pg_epoch: 183 pg[2.1( v 183'7978 (179'7799,183'7978] local-les=172 n=1258 ec=8 les/c 172/172 170/171/171) [5,0] r=0 lpr=171 luod=183'7975 crt=183'7974 lcod 183'7974 mlcod 183'7974 active+clean] do_proxy_write Start proxy write for osd_op(client.4130.0:32967 plana4222147-6594 [write 733117~408660] 2.3ed99ff9 ack+ondisk+write+known_if_redirected e183) v5 2015-05-20 05:51:42.899958 7f682c6cf700 1 -- 10.214.132.32:6808/20666 -- 10.214.132.32:0/20666 -- osd_op_reply(17556 plana4222147-6594 [write 733117~408660] v0'0 uv0 ondisk = -95 ((95) Operation not supported)) v6 -- ?+0 0xa2e23c0 con 0x9355760 What should we do with these tests? -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: preparing v0.80.11
Although v0.80.10 is not out yet, the odds of a discovering a problem that would require an additional backport are low. I think you can start with v0.80.11 without further delay :-) As soon as I get over the flu :-( Nathan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/5] libceph: a couple tweaks for wait loops
On Thu, May 21, 2015 at 4:29 PM, Alex Elder el...@ieee.org wrote: On 05/21/2015 07:35 AM, Ilya Dryomov wrote: - return -ETIMEDOUT instead of -EIO in case of timeout - wait_event_interruptible_timeout() returns time left until timeout and since it can be almost LONG_MAX we had better assign it to long Any error returned by wait_event_interruptible_timeout() can now be returned by __ceph_open_session(). It looks like that may, in fact, be only -EINTR and -ERESTARTSYS. But it's a change you could note in the log message. I think it's just -ERESTARTSYS so I didn't bother. It turns out the only caller ignores the return value of ceph_monc_wait_osdmap() anyway. That should maybe be fixed. That's on purpose - rbd map tries to wait for a new enough osdmap only if the pool that the image is supposed to be in doesn't exist and we know we have a stale osdmap. We ignore wait retval because if we timeout we should return this pool doesn't exist, not -ETIMEDOUT. Thanks, Ilya -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 08/11] block: kill merge_bvec_fn() completely
On Fri, May 22, 2015 at 11:18:40AM -0700, Ming Lin wrote: From: Kent Overstreet kent.overstr...@gmail.com As generic_make_request() is now able to handle arbitrarily sized bios, it's no longer necessary for each individual block driver to define its own -merge_bvec_fn() callback. Remove every invocation completely. It might be good to replace patch 1 and this one by a patch per driver to remove the merge_bvec_fn instance and add the blk_queue_split call for all those drivers that actually had a -merge_bvec_fn. As some of them were non-trivial attention from the maintainers would be helpful, and a patch per driver might help with that. -/* This is called by bio_add_page(). - * - * q-max_hw_sectors and other global limits are already enforced there. - * - * We need to call down to our lower level device, - * in case it has special restrictions. - * - * We also may need to enforce configured max-bio-bvecs limits. - * - * As long as the BIO is empty we have to allow at least one bvec, - * regardless of size and offset, so no need to ask lower levels. - */ -int drbd_merge_bvec(struct request_queue *q, struct bvec_merge_data *bvm, struct bio_vec *bvec) This just checks the lower device, so it looks obviously fine. -static int pkt_merge_bvec(struct request_queue *q, struct bvec_merge_data *bmd, - struct bio_vec *bvec) -{ - struct pktcdvd_device *pd = q-queuedata; - sector_t zone = get_zone(bmd-bi_sector, pd); - int used = ((bmd-bi_sector - zone) 9) + bmd-bi_size; - int remaining = (pd-settings.size 9) - used; - int remaining2; - - /* - * A bio = PAGE_SIZE must be allowed. If it crosses a packet - * boundary, pkt_make_request() will split the bio. - */ - remaining2 = PAGE_SIZE - bmd-bi_size; - remaining = max(remaining, remaining2); - - BUG_ON(remaining 0); - return remaining; -} As mentioned in the comment pkt_make_request will split the bio so pkt looks fine. diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index ec6c5c6..f50edb3 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -3440,52 +3440,6 @@ static int rbd_queue_rq(struct blk_mq_hw_ctx *hctx, return BLK_MQ_RQ_QUEUE_OK; } -/* - * a queue callback. Makes sure that we don't create a bio that spans across - * multiple osd objects. One exception would be with a single page bios, - * which we handle later at bio_chain_clone_range() - */ -static int rbd_merge_bvec(struct request_queue *q, struct bvec_merge_data *bmd, - struct bio_vec *bvec) It seems rbd handles requests spanning objects just fine, so I don't really understand why rbd_merge_bvec even exists. Getting some form of ACK from the ceph folks would be useful. -/* - * We assume I/O is going to the origin (which is the volume - * more likely to have restrictions e.g. by being striped). - * (Looking up the exact location of the data would be expensive - * and could always be out of date by the time the bio is submitted.) - */ -static int cache_bvec_merge(struct dm_target *ti, - struct bvec_merge_data *bvm, - struct bio_vec *biovec, int max_size) -{ DM seems to have the most complex merge functions of all drivers, so I'd really love to see an ACK from Mike. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
ceph branch status
-- All Branches -- Adam Crume adamcr...@gmail.com 2014-12-01 20:45:58 -0800 wip-doc-rbd-replay Alfredo Deza ad...@redhat.com 2015-03-23 16:39:48 -0400 wip-11212 2015-03-25 10:10:43 -0400 wip-11065 Alfredo Deza alfredo.d...@inktank.com 2014-07-08 13:58:35 -0400 wip-8679 2014-09-04 13:58:14 -0400 wip-8366 2014-10-13 11:10:10 -0400 wip-9730 Boris Ranto bra...@redhat.com 2015-04-13 13:51:32 +0200 wip-fix-ceph-dencoder-build 2015-04-14 13:51:49 +0200 wip-fix-ceph-dencoder-build-master 2015-05-15 15:26:05 +0200 wip-selinux-policy Chi Xinze xmdx...@gmail.com 2015-05-15 21:47:44 + XinzeChi-wip-ec-read Dan Mick dan.m...@inktank.com 2013-07-16 23:00:06 -0700 wip-5634 Danny Al-Gaaf danny.al-g...@bisect.de 2015-04-23 16:32:00 +0200 wip-da-SCA-20150421 2015-04-23 17:18:57 +0200 wip-nosetests 2015-04-23 18:20:16 +0200 wip-unify-num_objects_degraded 2015-04-30 12:34:08 +0200 wip-user 2015-05-22 18:31:01 +0200 wip-da-SCA-20150427 David Zafman dzaf...@redhat.com 2014-08-29 10:41:23 -0700 wip-libcommon-rebase 2014-11-26 09:41:50 -0800 wip-9403 2014-12-02 21:20:17 -0800 wip-zafman-docfix 2015-01-08 15:07:45 -0800 wip-vstart-kvs 2015-02-20 16:13:43 -0800 wip-10883-firefly 2015-02-20 16:14:57 -0800 wip-10883-dumpling 2015-04-23 10:22:09 -0700 wip-11454-80.8 2015-04-24 13:14:23 -0700 wip-cot-giant 2015-05-21 15:29:48 -0700 wip-11511 2015-05-22 19:42:28 -0700 wip-10794 Dongmao Zhang deanracc...@gmail.com 2014-11-14 19:14:34 +0800 thesues-master Greg Farnum gfar...@redhat.com 2015-04-19 18:03:41 -0700 greg-testing-quota-full 2015-04-29 21:44:11 -0700 wip-init-names 2015-05-15 10:38:22 -0700 greg-fs-testing Greg Farnum g...@inktank.com 2014-10-23 13:33:44 -0700 wip-forward-scrub Gregory Meno gm...@redhat.com 2015-02-25 17:30:33 -0800 wip-fix-typo-troubleshooting Guang Yang ygu...@yahoo-inc.com 2014-08-08 10:41:12 + wip-guangyy-pg-splitting 2014-09-25 00:47:46 + wip-9008 2014-09-30 10:36:39 + guangyy-wip-9614 Haomai Wang haomaiw...@gmail.com 2014-07-27 13:37:49 +0800 wip-flush-set 2015-04-20 00:47:59 +0800 update-organization 2015-04-20 00:48:42 +0800 update-organization-1 Ilya Dryomov idryo...@gmail.com 2015-05-22 17:44:55 +0300 wip-crush-ruleset-name 2015-05-25 15:41:51 +0300 wip-mount-timeout-doc Ilya Dryomov ilya.dryo...@inktank.com 2014-09-05 16:15:10 +0400 wip-rbd-notify-errors James Page james.p...@ubuntu.com 2013-02-27 22:50:38 + wip-debhelper-8 Jason Dillaman dilla...@redhat.com 2015-04-13 19:36:27 -0400 wip-11056 2015-04-14 14:55:50 -0400 wip-11056-firefly 2015-05-12 11:04:36 -0400 wip-librbd-helgrind 2015-05-15 11:19:33 -0400 wip-11537 2015-05-21 16:27:18 -0400 wip-11579 2015-05-22 00:52:20 -0400 wip-11625 Jenkins jenk...@inktank.com 2014-07-29 05:24:39 -0700 wip-nhm-hang 2015-02-02 10:35:28 -0800 wip-sam-v0.92 2015-04-13 13:24:40 -0700 rhcs-v0.80.8 Joao Eduardo Luis jec...@gmail.com 2014-09-10 09:39:23 +0100 wip-leveldb-get.dumpling Joao Eduardo Luis joao.l...@gmail.com 2014-07-22 15:41:42 +0100 wip-leveldb-misc Joao Eduardo Luis joao.l...@inktank.com 2014-09-02 17:19:52 +0100 wip-leveldb-get 2014-10-17 16:20:11 +0100 wip-paxos-fix 2014-10-21 21:32:46 +0100 wip-9675.dumpling Joao Eduardo Luis j...@redhat.com 2014-11-17 16:43:53 + wip-mon-osdmap-cleanup 2014-12-15 16:18:56 + wip-giant-mon-backports 2014-12-17 17:13:57 + wip-mon-backports.firefly 2014-12-17 23:15:10 + wip-mon-sync-fix.dumpling 2015-01-07 23:01:00 + wip-mon-blackhole-mlog-0.87.7 2015-01-10 02:40:42 + wip-dho-joao 2015-01-10 02:46:31 + wip-mon-paxos-fix 2015-01-26 13:00:09 + wip-mon-datahealth-fix 2015-02-04 22:36:14 + wip-10643 2015-02-26 14:54:01 + wip-10507 Joao Eduardo Luis j...@suse.de 2015-05-05 12:24:28 +0100 wip-mon-scrub 2015-05-12 16:04:23 +0100 wip-11545 John Spray john.sp...@redhat.com 2015-02-18 14:04:18 + wip10649 2015-04-06 17:25:02 +0100 wip-progress-events 2015-05-05 14:29:16 +0100 wip-live-query 2015-05-06 16:21:25 +0100 wip-11541-hammer-workaround 2015-05-07 14:23:37 +0100 wip-11504 2015-05-13 13:48:41 +0100 wip-9963 2015-05-13 18:55:04 +0100 wip-9964 2015-05-22 10:48:32 +0100 wip-offline-backward 2015-05-22 14:08:04 +0100 wip-blacklist-json 2015-05-22 14:17:20 +0100 wip-damaged-fixes
Re: [PATCH 3/5] libceph: a couple tweaks for wait loops
On 05/25/2015 05:38 AM, Ilya Dryomov wrote: On Thu, May 21, 2015 at 4:29 PM, Alex Elder el...@ieee.org wrote: On 05/21/2015 07:35 AM, Ilya Dryomov wrote: - return -ETIMEDOUT instead of -EIO in case of timeout - wait_event_interruptible_timeout() returns time left until timeout and since it can be almost LONG_MAX we had better assign it to long Any error returned by wait_event_interruptible_timeout() can now be returned by __ceph_open_session(). It looks like that may, in fact, be only -EINTR and -ERESTARTSYS. But it's a change you could note in the log message. I think it's just -ERESTARTSYS so I didn't bother. My point was almost a little more philosophical. It's conceivable (though not likely) that wait_event_interruptible_timeout() could be changed to return a value that your caller here does not expect. It turns out the only caller ignores the return value of ceph_monc_wait_osdmap() anyway. That should maybe be fixed. That's on purpose - rbd map tries to wait for a new enough osdmap only if the pool that the image is supposed to be in doesn't exist and we know we have a stale osdmap. We ignore wait retval because if we timeout we should return this pool doesn't exist, not -ETIMEDOUT. Yes I realize that. This second part of my response was following on to my previous thought. That is, the caller might get a different return value that it didn't expect; but since the only caller ignores what gets returned, it's a moot point. -Alex Thanks, Ilya -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 'Racing read got wrong version' during proxy write testing
On Mon, 25 May 2015, Wang, Zhiqiang wrote: Hi all, I ran into a problem during the teuthology test of proxy write. It is like this: - Client sends 3 writes and a read on the same object to base tier - Set up cache tiering - Client retries ops and sends the 3 writes and 1 read to the cache tier - The 3 writes finished on the base tier, say with versions v1, v2 and v3 - Cache tier proxies the 1st write, and start to promote the object for the 2nd write, the 2nd and 3rd writes and the read are blocked - The proxied 1st write finishes on the base tier with version v4, and returns to cache tier. But somehow the cache tier fails to send the reply due to socket failure injecting - Client retries the writes and the read again, the writes are identified as dup ops - The promotion finishes, it copies the pg_log entries from the base tier and put it in the cache tier's pg_log. This includes the 3 writes on the base tier and the proxied write - The writes dispatches after the promotion, they are identified as completed dup ops. Cache tier replies these write ops with the version from the base tier (v1, v2 and v3) - In the last, the read dispatches, it reads the version of the proxied write (v4) and replies to client - Client complains that 'racing read got wrong version' In a previous discussion of the 'ops not idempotent' problem, we solved it by copying the pg_log entries in the base tier to cache tier during promotion. Seems like there is still a problem with this approach in the above scenario. My first thought is that when proxying the write, the cache tier should use the original reqid from the client. But currently we don't have a way to pass the original reqid from cache to base. Any ideas? I agree--I think the correct fix here is to make the proxied op be recognized as a dup. We can either do that by passing in an optional reqid to the Objecter, or extending the op somehow so that both reqids are listed. I think the first option will be cleaner, but I think we will also need to make sure the 'retry' count is preserved as (I think) we skip the dup check if retry==0. And we probably want to preserve the behavior that a given (reqid, retry) only exists once in the system. This probably means adding more optional args to Objecter::read()...? sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 08/11] block: kill merge_bvec_fn() completely
On Mon, May 25, 2015 at 5:04 PM, Christoph Hellwig h...@lst.de wrote: On Fri, May 22, 2015 at 11:18:40AM -0700, Ming Lin wrote: From: Kent Overstreet kent.overstr...@gmail.com As generic_make_request() is now able to handle arbitrarily sized bios, it's no longer necessary for each individual block driver to define its own -merge_bvec_fn() callback. Remove every invocation completely. It might be good to replace patch 1 and this one by a patch per driver to remove the merge_bvec_fn instance and add the blk_queue_split call for all those drivers that actually had a -merge_bvec_fn. As some of them were non-trivial attention from the maintainers would be helpful, and a patch per driver might help with that. -/* This is called by bio_add_page(). - * - * q-max_hw_sectors and other global limits are already enforced there. - * - * We need to call down to our lower level device, - * in case it has special restrictions. - * - * We also may need to enforce configured max-bio-bvecs limits. - * - * As long as the BIO is empty we have to allow at least one bvec, - * regardless of size and offset, so no need to ask lower levels. - */ -int drbd_merge_bvec(struct request_queue *q, struct bvec_merge_data *bvm, struct bio_vec *bvec) This just checks the lower device, so it looks obviously fine. -static int pkt_merge_bvec(struct request_queue *q, struct bvec_merge_data *bmd, - struct bio_vec *bvec) -{ - struct pktcdvd_device *pd = q-queuedata; - sector_t zone = get_zone(bmd-bi_sector, pd); - int used = ((bmd-bi_sector - zone) 9) + bmd-bi_size; - int remaining = (pd-settings.size 9) - used; - int remaining2; - - /* - * A bio = PAGE_SIZE must be allowed. If it crosses a packet - * boundary, pkt_make_request() will split the bio. - */ - remaining2 = PAGE_SIZE - bmd-bi_size; - remaining = max(remaining, remaining2); - - BUG_ON(remaining 0); - return remaining; -} As mentioned in the comment pkt_make_request will split the bio so pkt looks fine. diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index ec6c5c6..f50edb3 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -3440,52 +3440,6 @@ static int rbd_queue_rq(struct blk_mq_hw_ctx *hctx, return BLK_MQ_RQ_QUEUE_OK; } -/* - * a queue callback. Makes sure that we don't create a bio that spans across - * multiple osd objects. One exception would be with a single page bios, - * which we handle later at bio_chain_clone_range() - */ -static int rbd_merge_bvec(struct request_queue *q, struct bvec_merge_data *bmd, - struct bio_vec *bvec) It seems rbd handles requests spanning objects just fine, so I don't really understand why rbd_merge_bvec even exists. Getting some form of ACK from the ceph folks would be useful. I'm not Alex, but yeah, we have all the clone/split machinery and so we can handle a spanning case just fine. I think rbd_merge_bvec() exists to make sure we don't have to do that unless it's really necessary - like when a single page gets submitted at an inconvenient offset. I have a patch that adds a blk_queue_chunk_sectors(object_size) call to rbd_init_disk() but I haven't had a chance to play with it yet. In any case, we should be fine with getting rid of rbd_merge_bvec(). If this ends up a per-driver patchset, I can make rbd_merge_bvec() - blk_queue_chunk_sectors() a single patch and push it through ceph-client.git. Thanks, Ilya -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 08/11] block: kill merge_bvec_fn() completely
On Mon, May 25, 2015 at 06:02:30PM +0300, Ilya Dryomov wrote: I'm not Alex, but yeah, we have all the clone/split machinery and so we can handle a spanning case just fine. I think rbd_merge_bvec() exists to make sure we don't have to do that unless it's really necessary - like when a single page gets submitted at an inconvenient offset. I have a patch that adds a blk_queue_chunk_sectors(object_size) call to rbd_init_disk() but I haven't had a chance to play with it yet. In any case, we should be fine with getting rid of rbd_merge_bvec(). If this ends up a per-driver patchset, I can make rbd_merge_bvec() - blk_queue_chunk_sectors() a single patch and push it through ceph-client.git. Hmm, looks like the new blk_queue_split_bio ignore the chunk_sectors value, another thing that needs updating. I forgot how many weird merging hacks we had to add for nvme.. While I'd like to see per-driver patches we'd still need to merge them together through the block tree. Note that with this series there won't be any benefit of using blk_queue_chunk_sectors over just doing the split in rbd. Maybe we can even remove it again and do that work in the drivers in the future. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 08/11] block: kill merge_bvec_fn() completely
On 05/25/2015 10:02 AM, Ilya Dryomov wrote: On Mon, May 25, 2015 at 5:04 PM, Christoph Hellwig h...@lst.de wrote: On Fri, May 22, 2015 at 11:18:40AM -0700, Ming Lin wrote: From: Kent Overstreet kent.overstr...@gmail.com As generic_make_request() is now able to handle arbitrarily sized bios, it's no longer necessary for each individual block driver to define its own -merge_bvec_fn() callback. Remove every invocation completely. It might be good to replace patch 1 and this one by a patch per driver to remove the merge_bvec_fn instance and add the blk_queue_split call for all those drivers that actually had a -merge_bvec_fn. As some of them were non-trivial attention from the maintainers would be helpful, and a patch per driver might help with that. -/* This is called by bio_add_page(). - * - * q-max_hw_sectors and other global limits are already enforced there. - * - * We need to call down to our lower level device, - * in case it has special restrictions. - * - * We also may need to enforce configured max-bio-bvecs limits. - * - * As long as the BIO is empty we have to allow at least one bvec, - * regardless of size and offset, so no need to ask lower levels. - */ -int drbd_merge_bvec(struct request_queue *q, struct bvec_merge_data *bvm, struct bio_vec *bvec) This just checks the lower device, so it looks obviously fine. -static int pkt_merge_bvec(struct request_queue *q, struct bvec_merge_data *bmd, - struct bio_vec *bvec) -{ - struct pktcdvd_device *pd = q-queuedata; - sector_t zone = get_zone(bmd-bi_sector, pd); - int used = ((bmd-bi_sector - zone) 9) + bmd-bi_size; - int remaining = (pd-settings.size 9) - used; - int remaining2; - - /* - * A bio = PAGE_SIZE must be allowed. If it crosses a packet - * boundary, pkt_make_request() will split the bio. - */ - remaining2 = PAGE_SIZE - bmd-bi_size; - remaining = max(remaining, remaining2); - - BUG_ON(remaining 0); - return remaining; -} As mentioned in the comment pkt_make_request will split the bio so pkt looks fine. diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index ec6c5c6..f50edb3 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -3440,52 +3440,6 @@ static int rbd_queue_rq(struct blk_mq_hw_ctx *hctx, return BLK_MQ_RQ_QUEUE_OK; } -/* - * a queue callback. Makes sure that we don't create a bio that spans across - * multiple osd objects. One exception would be with a single page bios, - * which we handle later at bio_chain_clone_range() - */ -static int rbd_merge_bvec(struct request_queue *q, struct bvec_merge_data *bmd, - struct bio_vec *bvec) It seems rbd handles requests spanning objects just fine, so I don't really understand why rbd_merge_bvec even exists. Getting some form of ACK from the ceph folks would be useful. I'm not Alex, but yeah, we have all the clone/split machinery and so we can handle a spanning case just fine. I think rbd_merge_bvec() exists to make sure we don't have to do that unless it's really necessary - like when a single page gets submitted at an inconvenient offset. I am Alex. This is something I never removed. I haven't looked at it closely now, but it seems to me that after I created a function that split stuff properly up *before* the BIO layer got to it (which has since been replaced by code related to Kent's immutable BIO work), there has been no need for this function. Removing this was on a long-ago to-do list--but I didn't want to do it without spending some time ensuring it wouldn't break anything. If you want me to work through it in more detail so I can give a more certain response, let me know and I will do so. -Alex I have a patch that adds a blk_queue_chunk_sectors(object_size) call to rbd_init_disk() but I haven't had a chance to play with it yet. In any case, we should be fine with getting rid of rbd_merge_bvec(). If this ends up a per-driver patchset, I can make rbd_merge_bvec() - blk_queue_chunk_sectors() a single patch and push it through ceph-client.git. Thanks, Ilya -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 08/11] block: kill merge_bvec_fn() completely
On Mon, May 25, 2015 at 6:08 PM, Christoph Hellwig h...@lst.de wrote: On Mon, May 25, 2015 at 06:02:30PM +0300, Ilya Dryomov wrote: I'm not Alex, but yeah, we have all the clone/split machinery and so we can handle a spanning case just fine. I think rbd_merge_bvec() exists to make sure we don't have to do that unless it's really necessary - like when a single page gets submitted at an inconvenient offset. I have a patch that adds a blk_queue_chunk_sectors(object_size) call to rbd_init_disk() but I haven't had a chance to play with it yet. In any case, we should be fine with getting rid of rbd_merge_bvec(). If this ends up a per-driver patchset, I can make rbd_merge_bvec() - blk_queue_chunk_sectors() a single patch and push it through ceph-client.git. Hmm, looks like the new blk_queue_split_bio ignore the chunk_sectors value, another thing that needs updating. I forgot how many weird merging hacks we had to add for nvme.. While I'd like to see per-driver patches we'd still need to merge them together through the block tree. Note that with this series there won't be any benefit of using blk_queue_chunk_sectors over just doing the split in rbd. Maybe we can even remove it again and do that work in the drivers in the future. OK, I'll drop it, especially if it's potentially on its way out. With the fancy striping support, which I'll hopefully get to sometime, the striping pattern will become much more complicated anyway, so relying on rbd doing bio splitting is right in the long run as well. Thanks, Ilya -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Cache pool on top of ec base pool teuthology test
On Mon, 25 May 2015, Wang, Zhiqiang wrote: We have a bunch of teuthology tests which build cache pool on top of an ec base pool, and do partial object write. This is ok with the current cache tiering implementation. But with proxy write, this won't work. In my testing, the error message is something like below: 2015-05-20 05:51:42.896828 7f682eed4700 10 osd.5 pg_epoch: 183 pg[2.1( v 183'7978 (179'7799,183'7978] local-les=172 n=1258 ec=8 les/c 172/172 170/171/171) [5,0] r=0 lpr=171 luod=183'7975 crt=183'7974 lcod 183'7974 mlcod 183'7974 active+clean] do_proxy_write Start proxy write for osd_op(client.4130.0:32967 plana4222147-6594 [write 733117~408660] 2.3ed99ff9 ack+ondisk+write+known_if_redirected e183) v5 2015-05-20 05:51:42.899958 7f682c6cf700 1 -- 10.214.132.32:6808/20666 -- 10.214.132.32:0/20666 -- osd_op_reply(17556 plana4222147-6594 [write 733117~408660] v0'0 uv0 ondisk = -95 ((95) Operation not supported)) v6 -- ?+0 0xa2e23c0 con 0x9355760 What should we do with these tests? I think the test is fine.. but the OSD should refuse to proxy the write if the base tier won't support the write operation in question. I believe we recently renamed one of the helpers supports_omap()... we probably need a similar set of helpers for object overwrites? Or, we can make a method like should_proxy_to_ec() that scans through the op vector and makes a conservative judgement of whether it is safe to proxy... sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Cache pool on top of ec base pool teuthology test
OK, so the cache tier OSD checks if it's able to proxy the write op, if not, then forces a promotion. Btw, I don't find the supports_omap helps in the current master. Do you mean the 'CEPH_OSD_COPY_GET_FLAG_NOTSUPP_OMAP' flag? -Original Message- From: Sage Weil [mailto:sw...@redhat.com] Sent: Monday, May 25, 2015 11:04 PM To: Wang, Zhiqiang Cc: ceph-devel@vger.kernel.org Subject: Re: Cache pool on top of ec base pool teuthology test On Mon, 25 May 2015, Wang, Zhiqiang wrote: We have a bunch of teuthology tests which build cache pool on top of an ec base pool, and do partial object write. This is ok with the current cache tiering implementation. But with proxy write, this won't work. In my testing, the error message is something like below: 2015-05-20 05:51:42.896828 7f682eed4700 10 osd.5 pg_epoch: 183 pg[2.1( v 183'7978 (179'7799,183'7978] local-les=172 n=1258 ec=8 les/c 172/172 170/171/171) [5,0] r=0 lpr=171 luod=183'7975 crt=183'7974 lcod 183'7974 mlcod 183'7974 active+clean] do_proxy_write Start proxy write for osd_op(client.4130.0:32967 plana4222147-6594 [write 733117~408660] 2.3ed99ff9 ack+ondisk+write+known_if_redirected e183) v5 2015-05-20 05:51:42.899958 7f682c6cf700 1 -- 10.214.132.32:6808/20666 -- 10.214.132.32:0/20666 -- osd_op_reply(17556 plana4222147-6594 [write 733117~408660] v0'0 uv0 ondisk = -95 ((95) Operation not supported)) v6 -- ?+0 0xa2e23c0 con 0x9355760 What should we do with these tests? I think the test is fine.. but the OSD should refuse to proxy the write if the base tier won't support the write operation in question. I believe we recently renamed one of the helpers supports_omap()... we probably need a similar set of helpers for object overwrites? Or, we can make a method like should_proxy_to_ec() that scans through the op vector and makes a conservative judgement of whether it is safe to proxy... sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/5] Mon: expose commands for temperature related setting
Thanks for reviewing. We will update and add reviewed-by after the patch accepted, the follow-up discussion is at https://github.com/ceph/ceph/pull/4737 On 2015/5/22 8:55, Joao Eduardo Luis wrote: As far as I can tell, this patch can be split in two different patches: - add hit_set_grade_decay_rate option to 'osd pool set/get' - add 'osd tier cache-measure' Also, for the latter we could also use an explanatory commit message. Aside from that, I don't see anything obviously wrong with the patch. -Joao On 05/21/2015 02:34 PM, Li Wang wrote: From: MingXin Liu mingxin...@ubuntukylin.com Signed-off-by: MingXin Liu mingxin...@ubuntukylin.com Reviewed-by: Li Wang liw...@ubuntukylin.com --- src/mon/MonCommands.h | 8 +++-- src/mon/OSDMonitor.cc | 87 --- 2 files changed, 88 insertions(+), 7 deletions(-) diff --git a/src/mon/MonCommands.h b/src/mon/MonCommands.h index 8a36807..b26834d 100644 --- a/src/mon/MonCommands.h +++ b/src/mon/MonCommands.h @@ -639,11 +639,11 @@ COMMAND(osd pool rename \ rename srcpool to destpool, osd, rw, cli,rest) COMMAND(osd pool get \ name=pool,type=CephPoolname \ - name=var,type=CephChoices,strings=size|min_size|crash_replay_interval|pg_num|pgp_num|crush_ruleset|hit_set_type|hit_set_period|hit_set_count|hit_set_fpp|auid|target_max_objects|target_max_bytes|cache_target_dirty_ratio|cache_target_full_ratio|cache_min_flush_age|cache_min_evict_age|erasure_code_profile|min_read_recency_for_promote|write_fadvise_dontneed|all, \ + name=var,type=CephChoices,strings=size|min_size|crash_replay_interval|pg_num|pgp_num|crush_ruleset|hit_set_type|hit_set_period|hit_set_count|hit_set_fpp|auid|target_max_objects|target_max_bytes|cache_target_dirty_ratio|cache_target_full_ratio|cache_min_flush_age|cache_min_evict_age|erasure_code_profile|min_read_recency_for_promote|write_fadvise_dontneed|hit_set_grade_decay_rate|all, \ get pool parameter var, osd, r, cli,rest) COMMAND(osd pool set \ name=pool,type=CephPoolname \ - name=var,type=CephChoices,strings=size|min_size|crash_replay_interval|pg_num|pgp_num|crush_ruleset|hashpspool|nodelete|nopgchange|nosizechange|hit_set_type|hit_set_period|hit_set_count|hit_set_fpp|debug_fake_ec_pool|target_max_bytes|target_max_objects|cache_target_dirty_ratio|cache_target_full_ratio|cache_min_flush_age|cache_min_evict_age|auid|min_read_recency_for_promote|write_fadvise_dontneed \ + name=var,type=CephChoices,strings=size|min_size|crash_replay_interval|pg_num|pgp_num|crush_ruleset|hashpspool|nodelete|nopgchange|nosizechange|hit_set_type|hit_set_period|hit_set_count|hit_set_fpp|debug_fake_ec_pool|target_max_bytes|target_max_objects|cache_target_dirty_ratio|cache_target_full_ratio|cache_min_flush_age|cache_min_evict_age|auid|min_read_recency_for_promote|write_fadvise_dontneed|hit_set_grade_decay_rate \ name=val,type=CephString \ name=force,type=CephChoices,strings=--yes-i-really-mean-it,req=false, \ set pool parameter var to val, osd, rw, cli,rest) @@ -695,6 +695,10 @@ COMMAND(osd tier cache-mode \ name=pool,type=CephPoolname \ name=mode,type=CephChoices,strings=none|writeback|forward|readonly|readforward|readproxy, \ specify the caching mode for cache tier pool, osd, rw, cli,rest) +COMMAND(osd tier cache-measure \ +name=pool,type=CephPoolname \ +name=measure,type=CephChoices,strings=atime|temperature, \ +specify the caching measure to judge hot objects for cache tier pool, osd, rw, cli,rest) COMMAND(osd tier set-overlay \ name=pool,type=CephPoolname \ name=overlaypool,type=CephPoolname, \ diff --git a/src/mon/OSDMonitor.cc b/src/mon/OSDMonitor.cc index 10597d0..0374778 100644 --- a/src/mon/OSDMonitor.cc +++ b/src/mon/OSDMonitor.cc @@ -2803,7 +2803,7 @@ namespace { CACHE_TARGET_DIRTY_RATIO, CACHE_TARGET_FULL_RATIO, CACHE_MIN_FLUSH_AGE, CACHE_MIN_EVICT_AGE, ERASURE_CODE_PROFILE, MIN_READ_RECENCY_FOR_PROMOTE, -WRITE_FADVISE_DONTNEED}; +WRITE_FADVISE_DONTNEED, HIT_SET_GRADE_DECAY_RATE}; std::setosd_pool_get_choices subtract_second_from_first(const std::setosd_pool_get_choices first, @@ -3251,7 +3251,8 @@ bool OSDMonitor::preprocess_command(MMonCommand *m) (cache_min_evict_age, CACHE_MIN_EVICT_AGE) (erasure_code_profile, ERASURE_CODE_PROFILE) (min_read_recency_for_promote, MIN_READ_RECENCY_FOR_PROMOTE) - (write_fadvise_dontneed, WRITE_FADVISE_DONTNEED); + (write_fadvise_dontneed, WRITE_FADVISE_DONTNEED) + (hit_set_grade_decay_rate, HIT_SET_GRADE_DECAY_RATE); typedef std::setosd_pool_get_choices choices_set_t; @@ -3259,7 +3260,7 @@ bool OSDMonitor::preprocess_command(MMonCommand *m) (HIT_SET_TYPE)(HIT_SET_PERIOD)(HIT_SET_COUNT)(HIT_SET_FPP) (TARGET_MAX_OBJECTS)(TARGET_MAX_BYTES)(CACHE_TARGET_FULL_RATIO)
RE: Cache pool on top of ec base pool teuthology test
Yes, we can do the force promotion check in init_op_flags, as we did before. -Original Message- From: Sage Weil [mailto:sw...@redhat.com] Sent: Tuesday, May 26, 2015 10:19 AM To: Wang, Zhiqiang Cc: ceph-devel@vger.kernel.org Subject: RE: Cache pool on top of ec base pool teuthology test On Tue, 26 May 2015, Wang, Zhiqiang wrote: OK, so the cache tier OSD checks if it's able to proxy the write op, if not, then forces a promotion. Btw, I don't find the supports_omap helps in the current master. Do you mean the 'CEPH_OSD_COPY_GET_FLAG_NOTSUPP_OMAP' flag? Yeah, that isn't helpful. I think what we need is the helper to check the op, and make that dependent on is_erasure()? That shoudl be good enough for now. sage -Original Message- From: Sage Weil [mailto:sw...@redhat.com] Sent: Monday, May 25, 2015 11:04 PM To: Wang, Zhiqiang Cc: ceph-devel@vger.kernel.org Subject: Re: Cache pool on top of ec base pool teuthology test On Mon, 25 May 2015, Wang, Zhiqiang wrote: We have a bunch of teuthology tests which build cache pool on top of an ec base pool, and do partial object write. This is ok with the current cache tiering implementation. But with proxy write, this won't work. In my testing, the error message is something like below: 2015-05-20 05:51:42.896828 7f682eed4700 10 osd.5 pg_epoch: 183 pg[2.1( v 183'7978 (179'7799,183'7978] local-les=172 n=1258 ec=8 les/c 172/172 170/171/171) [5,0] r=0 lpr=171 luod=183'7975 crt=183'7974 lcod 183'7974 mlcod 183'7974 active+clean] do_proxy_write Start proxy write for osd_op(client.4130.0:32967 plana4222147-6594 [write 733117~408660] 2.3ed99ff9 ack+ondisk+write+known_if_redirected e183) v5 2015-05-20 05:51:42.899958 7f682c6cf700 1 -- 10.214.132.32:6808/20666 -- 10.214.132.32:0/20666 -- osd_op_reply(17556 plana4222147-6594 [write 733117~408660] v0'0 uv0 ondisk = -95 ((95) Operation not supported)) v6 -- ?+0 0xa2e23c0 con 0x9355760 What should we do with these tests? I think the test is fine.. but the OSD should refuse to proxy the write if the base tier won't support the write operation in question. I believe we recently renamed one of the helpers supports_omap()... we probably need a similar set of helpers for object overwrites? Or, we can make a method like should_proxy_to_ec() that scans through the op vector and makes a conservative judgement of whether it is safe to proxy... sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: teuthology timeout error
Hi Loic, We rebased our teuthology/ceph-qa-suite and retried the test toward LRC on current master. However, we unfortunately got the same result as before (timeout error). [test conditions] Target : Ceph-9.0.0-971-gd49d816 https://github.com/kawaguchi-s/teuthology https://github.com/kawaguchi-s/ceph-qa-suite/tree/wip-10886-lrc [teuthology log] 2015-05-25 10:18:23 # start 2015-05-25 11:59:52,106.106 INFO:teuthology.orchestra.run.RX35-1:Running: 'adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph status -- format=json-pretty' 2015-05-25 11:59:52,564.564 INFO:tasks.ceph.ceph_manager:no progress seen, keeping timeout for now 2015-05-25 11:59:52,565.565 INFO:tasks.thrashosds.thrasher:Traceback (most recent call last): File /root/src/ceph-qa-suite_master/tasks/ceph_manager.py, line 635, in wrapper return func(self) File /root/src/ceph-qa-suite_master/tasks/ceph_manager.py, line 668, in do_thrash timeout=self.config.get('timeout') File /root/src/ceph-qa-suite_master/tasks/ceph_manager.py, line 1569, in wait_for_recovery 'failed to recover before timeout expired' AssertionError: failed to recover before timeout expired Traceback (most recent call last): File /root/work/teuthology/virtualenv/lib/python2.7/site-packages/gevent/greenlet.py, line 390, in run result = self._run(*self.args, **self.kwargs) File /root/src/ceph-qa-suite_master/tasks/ceph_manager.py, line 635, in wrapper return func(self) File /root/src/ceph-qa-suite_master/tasks/ceph_manager.py, line 668, in do_thrash timeout=self.config.get('timeout') File /root/src/ceph-qa-suite_master/tasks/ceph_manager.py, line 1569, in wait_for_recovery 'failed to recover before timeout expired' AssertionError: failed to recover before timeout expired Greenlet at 0x36cacd0: bound method Thrasher.do_thrash of tasks.ceph_manager.Thrasher instance at 0x36df3f8 failed with AssertionError Best regards, Takeshi Miyamae -Original Message- From: Loic Dachary [mailto:l...@dachary.org] Sent: Thursday, May 21, 2015 6:38 PM To: Miyamae, Takeshi/宮前 剛; Ceph Development Cc: Kawaguchi, Shotaro/川口 翔太朗; Imai, Hiroki/今井 宏樹; Nakao, Takanori/中尾 鷹詔; Shiozawa, Kensuke/塩沢 賢輔 Subject: Re: teuthology timeout error Hi, [sorry the previous mail was sent by accident, here is the full mail] On 21/05/2015 10:32, Miyamae, Takeshi wrote: Hi Loic, Could you please share the teuthology/ceph-qa-suite repository you are using to run these tests so I can try to reproduce / diagnose the problem ? https://github.com/kawaguchi-s/teuthology/tree/wip-10886 https://github.com/kawaguchi-s/ceph-qa-suite/tree/wip-10886 When compared against master they show differences that indicate it would be good to rebase: https://github.com/ceph/teuthology/compare/master...kawaguchi-s:wip-10886 https://github.com/ceph/ceph-qa-suite/compare/master...kawaguchi-s:wip-10886 I think the teuthology commit on top of wip-10886 is a mistake https://github.com/kawaguchi-s/teuthology/commit/348e54931f89c9b0ae7a84eb931576f8414017b5 do you really need to modify teuthology ? It should just be necessary to use the latest master branch. It looks like the https://github.com/kawaguchi-s/ceph-qa-suite/commit/f2e3ca5d12ceef742eae2a9cf4057c436e9040c3 commit in your ceph-qa-suite is not what you intended. However https://github.com/kawaguchi-s/ceph-qa-suite/commit/4b39d6d4862f9091a849d224e880795be406815d https://github.com/kawaguchi-s/ceph-qa-suite/commit/d16b4b058ae118931928541a2c8acd68f9703a44 look ok :-) Instead of naming the test 4nodes16osds3mons1client.yaml it would be better to use the same kind of naming you see at https://github.com/ceph/ceph-qa-suite/tree/master/suites/rados/thrash-erasure-code/workloads. That is a file name made of the distinctive parameters for the shec plugin (the parameters that are the default can be omited). Cheers Here are our teuthology/ceph-qa-suite repositories. Thanks in advance. Best regards, Takeshi Miyamae -Original Message- From: Loic Dachary [mailto:l...@dachary.org] Sent: Wednesday, May 20, 2015 4:49 PM To: Miyamae, Takeshi/宮前 剛; Ceph Development Cc: Kawaguchi, Shotaro/川口 翔太朗; Imai, Hiroki/今井 宏樹; Nakao, Takanori/中尾 鷹詔; Shiozawa, Kensuke/塩沢 賢輔 Subject: Re: teuthology timeout error Hi, On 20/05/2015 04:20, Miyamae, Takeshi wrote: Hi Loic, When we fixed our own issue and restarted teuthology, Great ! we encountered another issue (timeout error) which occurs in case of LRC as well. Do you have any information about that ? Could you please share the teuthology/ceph-qa-suite repository you are using to run these tests so I can try to reproduce / diagnose the problem ? Thanks [error messages (in case of LRC pool)] 2015-04-28 12:38:54,128.128 INFO:teuthology.orchestra.run.RX35-1:Running: 'adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph status --format=json-pretty' 2015-04-28 12:38:54,516.516
RE: Cache pool on top of ec base pool teuthology test
On Tue, 26 May 2015, Wang, Zhiqiang wrote: OK, so the cache tier OSD checks if it's able to proxy the write op, if not, then forces a promotion. Btw, I don't find the supports_omap helps in the current master. Do you mean the 'CEPH_OSD_COPY_GET_FLAG_NOTSUPP_OMAP' flag? Yeah, that isn't helpful. I think what we need is the helper to check the op, and make that dependent on is_erasure()? That shoudl be good enough for now. sage -Original Message- From: Sage Weil [mailto:sw...@redhat.com] Sent: Monday, May 25, 2015 11:04 PM To: Wang, Zhiqiang Cc: ceph-devel@vger.kernel.org Subject: Re: Cache pool on top of ec base pool teuthology test On Mon, 25 May 2015, Wang, Zhiqiang wrote: We have a bunch of teuthology tests which build cache pool on top of an ec base pool, and do partial object write. This is ok with the current cache tiering implementation. But with proxy write, this won't work. In my testing, the error message is something like below: 2015-05-20 05:51:42.896828 7f682eed4700 10 osd.5 pg_epoch: 183 pg[2.1( v 183'7978 (179'7799,183'7978] local-les=172 n=1258 ec=8 les/c 172/172 170/171/171) [5,0] r=0 lpr=171 luod=183'7975 crt=183'7974 lcod 183'7974 mlcod 183'7974 active+clean] do_proxy_write Start proxy write for osd_op(client.4130.0:32967 plana4222147-6594 [write 733117~408660] 2.3ed99ff9 ack+ondisk+write+known_if_redirected e183) v5 2015-05-20 05:51:42.899958 7f682c6cf700 1 -- 10.214.132.32:6808/20666 -- 10.214.132.32:0/20666 -- osd_op_reply(17556 plana4222147-6594 [write 733117~408660] v0'0 uv0 ondisk = -95 ((95) Operation not supported)) v6 -- ?+0 0xa2e23c0 con 0x9355760 What should we do with these tests? I think the test is fine.. but the OSD should refuse to proxy the write if the base tier won't support the write operation in question. I believe we recently renamed one of the helpers supports_omap()... we probably need a similar set of helpers for object overwrites? Or, we can make a method like should_proxy_to_ec() that scans through the op vector and makes a conservative judgement of whether it is safe to proxy... sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html