'Racing read got wrong version' during proxy write testing

2015-05-25 Thread Wang, Zhiqiang
Hi all,

I ran into a problem during the teuthology test of proxy write. It is like this:

- Client sends 3 writes and a read on the same object to base tier
- Set up cache tiering
- Client retries ops and sends the 3 writes and 1 read to the cache tier
- The 3 writes finished on the base tier, say with versions v1, v2 and v3
- Cache tier proxies the 1st write, and start to promote the object for the 2nd 
write, the 2nd and 3rd writes and the read are blocked
- The proxied 1st write finishes on the base tier with version v4, and returns 
to cache tier. But somehow the cache tier fails to send the reply due to socket 
failure injecting
- Client retries the writes and the read again, the writes are identified as 
dup ops
- The promotion finishes, it copies the pg_log entries from the base tier and 
put it in the cache tier's pg_log. This includes the 3 writes on the base tier 
and the proxied write
- The writes dispatches after the promotion, they are identified as completed 
dup ops. Cache tier replies these write ops with the version from the base tier 
(v1, v2 and v3)
- In the last, the read dispatches, it reads the version of the proxied write 
(v4) and replies to client
- Client complains that 'racing read got wrong version'

In a previous discussion of the 'ops not idempotent' problem, we solved it by 
copying the pg_log entries in the base tier to cache tier during promotion. 
Seems like there is still a problem with this approach in the above scenario. 
My first thought is that when proxying the write, the cache tier should use the 
original reqid from the client. But currently we don't have a way to pass the 
original reqid from cache to base. Any ideas?
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


preparing v0.80.11

2015-05-25 Thread Loic Dachary
Hi Nathan,

Although v0.80.10 is not out yet, the odds of a discovering a problem that 
would require an additional backport are low. I think you can start with 
v0.80.11 without further delay :-)

Cheers

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature


Cache pool on top of ec base pool teuthology test

2015-05-25 Thread Wang, Zhiqiang
We have a bunch of teuthology tests which build cache pool on top of an ec base 
pool, and do partial object write. This is ok with the current cache tiering 
implementation. But with proxy write, this won't work. In my testing, the error 
message is something like below:

2015-05-20 05:51:42.896828 7f682eed4700 10 osd.5 pg_epoch: 183 pg[2.1( v 
183'7978 (179'7799,183'7978] local-les=172 n=1258 ec=8 les/c 172/172 
170/171/171) [5,0] r=0 lpr=171 luod=183'7975 crt=183'7974 lcod 183'7974 mlcod 
183'7974 active+clean] do_proxy_write Start proxy write for 
osd_op(client.4130.0:32967 plana4222147-6594 [write 733117~408660] 2.3ed99ff9 
ack+ondisk+write+known_if_redirected e183) v5
2015-05-20 05:51:42.899958 7f682c6cf700  1 -- 10.214.132.32:6808/20666 -- 
10.214.132.32:0/20666 -- osd_op_reply(17556 plana4222147-6594 [write 
733117~408660] v0'0 uv0 ondisk = -95 ((95) Operation not supported)) v6 -- ?+0 
0xa2e23c0 con 0x9355760

What should we do with these tests?
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: preparing v0.80.11

2015-05-25 Thread Nathan Cutler

Although v0.80.10 is not out yet, the odds of a discovering a problem that 
would require an additional backport are low. I think you can start with 
v0.80.11 without further delay :-)


As soon as I get over the flu :-(

Nathan

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/5] libceph: a couple tweaks for wait loops

2015-05-25 Thread Ilya Dryomov
On Thu, May 21, 2015 at 4:29 PM, Alex Elder el...@ieee.org wrote:
 On 05/21/2015 07:35 AM, Ilya Dryomov wrote:
 - return -ETIMEDOUT instead of -EIO in case of timeout
 - wait_event_interruptible_timeout() returns time left until timeout
   and since it can be almost LONG_MAX we had better assign it to long

 Any error returned by wait_event_interruptible_timeout()
 can now be returned by __ceph_open_session().  It looks
 like that may, in fact, be only -EINTR and -ERESTARTSYS.
 But it's a change you could note in the log message.

I think it's just -ERESTARTSYS so I didn't bother.


 It turns out the only caller ignores the return value of
 ceph_monc_wait_osdmap() anyway.  That should maybe be fixed.

That's on purpose - rbd map tries to wait for a new enough osdmap only
if the pool that the image is supposed to be in doesn't exist and we
know we have a stale osdmap.  We ignore wait retval because if we
timeout we should return this pool doesn't exist, not -ETIMEDOUT.

Thanks,

Ilya
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 08/11] block: kill merge_bvec_fn() completely

2015-05-25 Thread Christoph Hellwig
On Fri, May 22, 2015 at 11:18:40AM -0700, Ming Lin wrote:
 From: Kent Overstreet kent.overstr...@gmail.com
 
 As generic_make_request() is now able to handle arbitrarily sized bios,
 it's no longer necessary for each individual block driver to define its
 own -merge_bvec_fn() callback. Remove every invocation completely.

It might be good to replace patch 1 and this one by a patch per driver
to remove the merge_bvec_fn instance and add the blk_queue_split call
for all those drivers that actually had a -merge_bvec_fn.  As some
of them were non-trivial attention from the maintainers would be helpful,
and a patch per driver might help with that.

 -/* This is called by bio_add_page().
 - *
 - * q-max_hw_sectors and other global limits are already enforced there.
 - *
 - * We need to call down to our lower level device,
 - * in case it has special restrictions.
 - *
 - * We also may need to enforce configured max-bio-bvecs limits.
 - *
 - * As long as the BIO is empty we have to allow at least one bvec,
 - * regardless of size and offset, so no need to ask lower levels.
 - */
 -int drbd_merge_bvec(struct request_queue *q, struct bvec_merge_data *bvm, 
 struct bio_vec *bvec)


This just checks the lower device, so it looks obviously fine.

 -static int pkt_merge_bvec(struct request_queue *q, struct bvec_merge_data 
 *bmd,
 -   struct bio_vec *bvec)
 -{
 - struct pktcdvd_device *pd = q-queuedata;
 - sector_t zone = get_zone(bmd-bi_sector, pd);
 - int used = ((bmd-bi_sector - zone)  9) + bmd-bi_size;
 - int remaining = (pd-settings.size  9) - used;
 - int remaining2;
 -
 - /*
 -  * A bio = PAGE_SIZE must be allowed. If it crosses a packet
 -  * boundary, pkt_make_request() will split the bio.
 -  */
 - remaining2 = PAGE_SIZE - bmd-bi_size;
 - remaining = max(remaining, remaining2);
 -
 - BUG_ON(remaining  0);
 - return remaining;
 -}

As mentioned in the comment pkt_make_request will split the bio so pkt
looks fine.

 diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
 index ec6c5c6..f50edb3 100644
 --- a/drivers/block/rbd.c
 +++ b/drivers/block/rbd.c
 @@ -3440,52 +3440,6 @@ static int rbd_queue_rq(struct blk_mq_hw_ctx *hctx,
   return BLK_MQ_RQ_QUEUE_OK;
  }
  
 -/*
 - * a queue callback. Makes sure that we don't create a bio that spans across
 - * multiple osd objects. One exception would be with a single page bios,
 - * which we handle later at bio_chain_clone_range()
 - */
 -static int rbd_merge_bvec(struct request_queue *q, struct bvec_merge_data 
 *bmd,
 -   struct bio_vec *bvec)

It seems rbd handles requests spanning objects just fine, so I don't
really understand why rbd_merge_bvec even exists.  Getting some form
of ACK from the ceph folks would be useful.

 -/*
 - * We assume I/O is going to the origin (which is the volume
 - * more likely to have restrictions e.g. by being striped).
 - * (Looking up the exact location of the data would be expensive
 - * and could always be out of date by the time the bio is submitted.)
 - */
 -static int cache_bvec_merge(struct dm_target *ti,
 - struct bvec_merge_data *bvm,
 - struct bio_vec *biovec, int max_size)
 -{

DM seems to have the most complex merge functions of all drivers, so
I'd really love to see an ACK from Mike.

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


ceph branch status

2015-05-25 Thread ceph branch robot
-- All Branches --

Adam Crume adamcr...@gmail.com
2014-12-01 20:45:58 -0800   wip-doc-rbd-replay

Alfredo Deza ad...@redhat.com
2015-03-23 16:39:48 -0400   wip-11212
2015-03-25 10:10:43 -0400   wip-11065

Alfredo Deza alfredo.d...@inktank.com
2014-07-08 13:58:35 -0400   wip-8679
2014-09-04 13:58:14 -0400   wip-8366
2014-10-13 11:10:10 -0400   wip-9730

Boris Ranto bra...@redhat.com
2015-04-13 13:51:32 +0200   wip-fix-ceph-dencoder-build
2015-04-14 13:51:49 +0200   wip-fix-ceph-dencoder-build-master
2015-05-15 15:26:05 +0200   wip-selinux-policy

Chi Xinze xmdx...@gmail.com
2015-05-15 21:47:44 +   XinzeChi-wip-ec-read

Dan Mick dan.m...@inktank.com
2013-07-16 23:00:06 -0700   wip-5634

Danny Al-Gaaf danny.al-g...@bisect.de
2015-04-23 16:32:00 +0200   wip-da-SCA-20150421
2015-04-23 17:18:57 +0200   wip-nosetests
2015-04-23 18:20:16 +0200   wip-unify-num_objects_degraded
2015-04-30 12:34:08 +0200   wip-user
2015-05-22 18:31:01 +0200   wip-da-SCA-20150427

David Zafman dzaf...@redhat.com
2014-08-29 10:41:23 -0700   wip-libcommon-rebase
2014-11-26 09:41:50 -0800   wip-9403
2014-12-02 21:20:17 -0800   wip-zafman-docfix
2015-01-08 15:07:45 -0800   wip-vstart-kvs
2015-02-20 16:13:43 -0800   wip-10883-firefly
2015-02-20 16:14:57 -0800   wip-10883-dumpling
2015-04-23 10:22:09 -0700   wip-11454-80.8
2015-04-24 13:14:23 -0700   wip-cot-giant
2015-05-21 15:29:48 -0700   wip-11511
2015-05-22 19:42:28 -0700   wip-10794

Dongmao Zhang deanracc...@gmail.com
2014-11-14 19:14:34 +0800   thesues-master

Greg Farnum gfar...@redhat.com
2015-04-19 18:03:41 -0700   greg-testing-quota-full
2015-04-29 21:44:11 -0700   wip-init-names
2015-05-15 10:38:22 -0700   greg-fs-testing

Greg Farnum g...@inktank.com
2014-10-23 13:33:44 -0700   wip-forward-scrub

Gregory Meno gm...@redhat.com
2015-02-25 17:30:33 -0800   wip-fix-typo-troubleshooting

Guang Yang ygu...@yahoo-inc.com
2014-08-08 10:41:12 +   wip-guangyy-pg-splitting
2014-09-25 00:47:46 +   wip-9008
2014-09-30 10:36:39 +   guangyy-wip-9614

Haomai Wang haomaiw...@gmail.com
2014-07-27 13:37:49 +0800   wip-flush-set
2015-04-20 00:47:59 +0800   update-organization
2015-04-20 00:48:42 +0800   update-organization-1

Ilya Dryomov idryo...@gmail.com
2015-05-22 17:44:55 +0300   wip-crush-ruleset-name
2015-05-25 15:41:51 +0300   wip-mount-timeout-doc

Ilya Dryomov ilya.dryo...@inktank.com
2014-09-05 16:15:10 +0400   wip-rbd-notify-errors

James Page james.p...@ubuntu.com
2013-02-27 22:50:38 +   wip-debhelper-8

Jason Dillaman dilla...@redhat.com
2015-04-13 19:36:27 -0400   wip-11056
2015-04-14 14:55:50 -0400   wip-11056-firefly
2015-05-12 11:04:36 -0400   wip-librbd-helgrind
2015-05-15 11:19:33 -0400   wip-11537
2015-05-21 16:27:18 -0400   wip-11579
2015-05-22 00:52:20 -0400   wip-11625

Jenkins jenk...@inktank.com
2014-07-29 05:24:39 -0700   wip-nhm-hang
2015-02-02 10:35:28 -0800   wip-sam-v0.92
2015-04-13 13:24:40 -0700   rhcs-v0.80.8

Joao Eduardo Luis jec...@gmail.com
2014-09-10 09:39:23 +0100   wip-leveldb-get.dumpling

Joao Eduardo Luis joao.l...@gmail.com
2014-07-22 15:41:42 +0100   wip-leveldb-misc

Joao Eduardo Luis joao.l...@inktank.com
2014-09-02 17:19:52 +0100   wip-leveldb-get
2014-10-17 16:20:11 +0100   wip-paxos-fix
2014-10-21 21:32:46 +0100   wip-9675.dumpling

Joao Eduardo Luis j...@redhat.com
2014-11-17 16:43:53 +   wip-mon-osdmap-cleanup
2014-12-15 16:18:56 +   wip-giant-mon-backports
2014-12-17 17:13:57 +   wip-mon-backports.firefly
2014-12-17 23:15:10 +   wip-mon-sync-fix.dumpling
2015-01-07 23:01:00 +   wip-mon-blackhole-mlog-0.87.7
2015-01-10 02:40:42 +   wip-dho-joao
2015-01-10 02:46:31 +   wip-mon-paxos-fix
2015-01-26 13:00:09 +   wip-mon-datahealth-fix
2015-02-04 22:36:14 +   wip-10643
2015-02-26 14:54:01 +   wip-10507

Joao Eduardo Luis j...@suse.de
2015-05-05 12:24:28 +0100   wip-mon-scrub
2015-05-12 16:04:23 +0100   wip-11545

John Spray john.sp...@redhat.com
2015-02-18 14:04:18 +   wip10649
2015-04-06 17:25:02 +0100   wip-progress-events
2015-05-05 14:29:16 +0100   wip-live-query
2015-05-06 16:21:25 +0100   wip-11541-hammer-workaround
2015-05-07 14:23:37 +0100   wip-11504
2015-05-13 13:48:41 +0100   wip-9963
2015-05-13 18:55:04 +0100   wip-9964
2015-05-22 10:48:32 +0100   wip-offline-backward
2015-05-22 14:08:04 +0100   wip-blacklist-json
2015-05-22 14:17:20 +0100   wip-damaged-fixes


Re: [PATCH 3/5] libceph: a couple tweaks for wait loops

2015-05-25 Thread Alex Elder
On 05/25/2015 05:38 AM, Ilya Dryomov wrote:
 On Thu, May 21, 2015 at 4:29 PM, Alex Elder el...@ieee.org wrote:
 On 05/21/2015 07:35 AM, Ilya Dryomov wrote:
 - return -ETIMEDOUT instead of -EIO in case of timeout
 - wait_event_interruptible_timeout() returns time left until timeout
   and since it can be almost LONG_MAX we had better assign it to long

 Any error returned by wait_event_interruptible_timeout()
 can now be returned by __ceph_open_session().  It looks
 like that may, in fact, be only -EINTR and -ERESTARTSYS.
 But it's a change you could note in the log message.
 
 I think it's just -ERESTARTSYS so I didn't bother.

My point was almost a little more philosophical.  It's conceivable
(though not likely) that wait_event_interruptible_timeout() could
be changed to return a value that your caller here does not expect.

 It turns out the only caller ignores the return value of
 ceph_monc_wait_osdmap() anyway.  That should maybe be fixed.
 
 That's on purpose - rbd map tries to wait for a new enough osdmap only
 if the pool that the image is supposed to be in doesn't exist and we
 know we have a stale osdmap.  We ignore wait retval because if we
 timeout we should return this pool doesn't exist, not -ETIMEDOUT.

Yes I realize that.  This second part of my response was
following on to my previous thought.  That is, the caller
might get a different return value that it didn't expect;
but since the only caller ignores what gets returned, it's
a moot point.

-Alex

 Thanks,
 
 Ilya
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 'Racing read got wrong version' during proxy write testing

2015-05-25 Thread Sage Weil
On Mon, 25 May 2015, Wang, Zhiqiang wrote:
 Hi all,
 
 I ran into a problem during the teuthology test of proxy write. It is like 
 this:
 
 - Client sends 3 writes and a read on the same object to base tier
 - Set up cache tiering
 - Client retries ops and sends the 3 writes and 1 read to the cache tier
 - The 3 writes finished on the base tier, say with versions v1, v2 and v3
 - Cache tier proxies the 1st write, and start to promote the object for the 
 2nd write, the 2nd and 3rd writes and the read are blocked
 - The proxied 1st write finishes on the base tier with version v4, and 
 returns to cache tier. But somehow the cache tier fails to send the reply due 
 to socket failure injecting
 - Client retries the writes and the read again, the writes are identified as 
 dup ops
 - The promotion finishes, it copies the pg_log entries from the base tier and 
 put it in the cache tier's pg_log. This includes the 3 writes on the base 
 tier and the proxied write
 - The writes dispatches after the promotion, they are identified as completed 
 dup ops. Cache tier replies these write ops with the version from the base 
 tier (v1, v2 and v3)
 - In the last, the read dispatches, it reads the version of the proxied write 
 (v4) and replies to client
 - Client complains that 'racing read got wrong version'
 
 In a previous discussion of the 'ops not idempotent' problem, we solved it by 
 copying the pg_log entries in the base tier to cache tier during promotion. 
 Seems like there is still a problem with this approach in the above scenario. 
 My first thought is that when proxying the write, the cache tier should use 
 the original reqid from the client. But currently we don't have a way to pass 
 the original reqid from cache to base. Any ideas?

I agree--I think the correct fix here is to make the proxied op be 
recognized as a dup.  We can either do that by passing in an optional 
reqid to the Objecter, or extending the op somehow so that both reqids are 
listed.  I think the first option will be cleaner, but I think we 
will also need to make sure the 'retry' count is preserved as (I think) we 
skip the dup check if retry==0.  And we probably want to preserve the 
behavior that a given (reqid, retry) only exists once in the system.

This probably means adding more optional args to Objecter::read()...?

sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 08/11] block: kill merge_bvec_fn() completely

2015-05-25 Thread Ilya Dryomov
On Mon, May 25, 2015 at 5:04 PM, Christoph Hellwig h...@lst.de wrote:
 On Fri, May 22, 2015 at 11:18:40AM -0700, Ming Lin wrote:
 From: Kent Overstreet kent.overstr...@gmail.com

 As generic_make_request() is now able to handle arbitrarily sized bios,
 it's no longer necessary for each individual block driver to define its
 own -merge_bvec_fn() callback. Remove every invocation completely.

 It might be good to replace patch 1 and this one by a patch per driver
 to remove the merge_bvec_fn instance and add the blk_queue_split call
 for all those drivers that actually had a -merge_bvec_fn.  As some
 of them were non-trivial attention from the maintainers would be helpful,
 and a patch per driver might help with that.

 -/* This is called by bio_add_page().
 - *
 - * q-max_hw_sectors and other global limits are already enforced there.
 - *
 - * We need to call down to our lower level device,
 - * in case it has special restrictions.
 - *
 - * We also may need to enforce configured max-bio-bvecs limits.
 - *
 - * As long as the BIO is empty we have to allow at least one bvec,
 - * regardless of size and offset, so no need to ask lower levels.
 - */
 -int drbd_merge_bvec(struct request_queue *q, struct bvec_merge_data *bvm, 
 struct bio_vec *bvec)


 This just checks the lower device, so it looks obviously fine.

 -static int pkt_merge_bvec(struct request_queue *q, struct bvec_merge_data 
 *bmd,
 -   struct bio_vec *bvec)
 -{
 - struct pktcdvd_device *pd = q-queuedata;
 - sector_t zone = get_zone(bmd-bi_sector, pd);
 - int used = ((bmd-bi_sector - zone)  9) + bmd-bi_size;
 - int remaining = (pd-settings.size  9) - used;
 - int remaining2;
 -
 - /*
 -  * A bio = PAGE_SIZE must be allowed. If it crosses a packet
 -  * boundary, pkt_make_request() will split the bio.
 -  */
 - remaining2 = PAGE_SIZE - bmd-bi_size;
 - remaining = max(remaining, remaining2);
 -
 - BUG_ON(remaining  0);
 - return remaining;
 -}

 As mentioned in the comment pkt_make_request will split the bio so pkt
 looks fine.

 diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
 index ec6c5c6..f50edb3 100644
 --- a/drivers/block/rbd.c
 +++ b/drivers/block/rbd.c
 @@ -3440,52 +3440,6 @@ static int rbd_queue_rq(struct blk_mq_hw_ctx *hctx,
   return BLK_MQ_RQ_QUEUE_OK;
  }

 -/*
 - * a queue callback. Makes sure that we don't create a bio that spans across
 - * multiple osd objects. One exception would be with a single page bios,
 - * which we handle later at bio_chain_clone_range()
 - */
 -static int rbd_merge_bvec(struct request_queue *q, struct bvec_merge_data 
 *bmd,
 -   struct bio_vec *bvec)

 It seems rbd handles requests spanning objects just fine, so I don't
 really understand why rbd_merge_bvec even exists.  Getting some form
 of ACK from the ceph folks would be useful.

I'm not Alex, but yeah, we have all the clone/split machinery and so we
can handle a spanning case just fine.  I think rbd_merge_bvec() exists
to make sure we don't have to do that unless it's really necessary -
like when a single page gets submitted at an inconvenient offset.

I have a patch that adds a blk_queue_chunk_sectors(object_size) call to
rbd_init_disk() but I haven't had a chance to play with it yet.  In any
case, we should be fine with getting rid of rbd_merge_bvec().  If this
ends up a per-driver patchset, I can make rbd_merge_bvec() -
blk_queue_chunk_sectors() a single patch and push it through
ceph-client.git.

Thanks,

Ilya
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 08/11] block: kill merge_bvec_fn() completely

2015-05-25 Thread Christoph Hellwig
On Mon, May 25, 2015 at 06:02:30PM +0300, Ilya Dryomov wrote:
 I'm not Alex, but yeah, we have all the clone/split machinery and so we
 can handle a spanning case just fine.  I think rbd_merge_bvec() exists
 to make sure we don't have to do that unless it's really necessary -
 like when a single page gets submitted at an inconvenient offset.
 
 I have a patch that adds a blk_queue_chunk_sectors(object_size) call to
 rbd_init_disk() but I haven't had a chance to play with it yet.  In any
 case, we should be fine with getting rid of rbd_merge_bvec().  If this
 ends up a per-driver patchset, I can make rbd_merge_bvec() -
 blk_queue_chunk_sectors() a single patch and push it through
 ceph-client.git.

Hmm, looks like the new blk_queue_split_bio ignore the chunk_sectors
value, another thing that needs updating.  I forgot how many weird
merging hacks we had to add for nvme..

While I'd like to see per-driver patches we'd still need to merge
them together through the block tree.  Note that with this series
there won't be any benefit of using blk_queue_chunk_sectors over just
doing the split in rbd.  Maybe we can even remove it again and do
that work in the drivers in the future.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 08/11] block: kill merge_bvec_fn() completely

2015-05-25 Thread Alex Elder
On 05/25/2015 10:02 AM, Ilya Dryomov wrote:
 On Mon, May 25, 2015 at 5:04 PM, Christoph Hellwig h...@lst.de wrote:
 On Fri, May 22, 2015 at 11:18:40AM -0700, Ming Lin wrote:
 From: Kent Overstreet kent.overstr...@gmail.com

 As generic_make_request() is now able to handle arbitrarily sized bios,
 it's no longer necessary for each individual block driver to define its
 own -merge_bvec_fn() callback. Remove every invocation completely.

 It might be good to replace patch 1 and this one by a patch per driver
 to remove the merge_bvec_fn instance and add the blk_queue_split call
 for all those drivers that actually had a -merge_bvec_fn.  As some
 of them were non-trivial attention from the maintainers would be helpful,
 and a patch per driver might help with that.

 -/* This is called by bio_add_page().
 - *
 - * q-max_hw_sectors and other global limits are already enforced there.
 - *
 - * We need to call down to our lower level device,
 - * in case it has special restrictions.
 - *
 - * We also may need to enforce configured max-bio-bvecs limits.
 - *
 - * As long as the BIO is empty we have to allow at least one bvec,
 - * regardless of size and offset, so no need to ask lower levels.
 - */
 -int drbd_merge_bvec(struct request_queue *q, struct bvec_merge_data *bvm, 
 struct bio_vec *bvec)


 This just checks the lower device, so it looks obviously fine.

 -static int pkt_merge_bvec(struct request_queue *q, struct bvec_merge_data 
 *bmd,
 -   struct bio_vec *bvec)
 -{
 - struct pktcdvd_device *pd = q-queuedata;
 - sector_t zone = get_zone(bmd-bi_sector, pd);
 - int used = ((bmd-bi_sector - zone)  9) + bmd-bi_size;
 - int remaining = (pd-settings.size  9) - used;
 - int remaining2;
 -
 - /*
 -  * A bio = PAGE_SIZE must be allowed. If it crosses a packet
 -  * boundary, pkt_make_request() will split the bio.
 -  */
 - remaining2 = PAGE_SIZE - bmd-bi_size;
 - remaining = max(remaining, remaining2);
 -
 - BUG_ON(remaining  0);
 - return remaining;
 -}

 As mentioned in the comment pkt_make_request will split the bio so pkt
 looks fine.

 diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
 index ec6c5c6..f50edb3 100644
 --- a/drivers/block/rbd.c
 +++ b/drivers/block/rbd.c
 @@ -3440,52 +3440,6 @@ static int rbd_queue_rq(struct blk_mq_hw_ctx *hctx,
   return BLK_MQ_RQ_QUEUE_OK;
  }

 -/*
 - * a queue callback. Makes sure that we don't create a bio that spans 
 across
 - * multiple osd objects. One exception would be with a single page bios,
 - * which we handle later at bio_chain_clone_range()
 - */
 -static int rbd_merge_bvec(struct request_queue *q, struct bvec_merge_data 
 *bmd,
 -   struct bio_vec *bvec)

 It seems rbd handles requests spanning objects just fine, so I don't
 really understand why rbd_merge_bvec even exists.  Getting some form
 of ACK from the ceph folks would be useful.
 
 I'm not Alex, but yeah, we have all the clone/split machinery and so we
 can handle a spanning case just fine.  I think rbd_merge_bvec() exists
 to make sure we don't have to do that unless it's really necessary -
 like when a single page gets submitted at an inconvenient offset.

I am Alex.  This is something I never removed.  I haven't
looked at it closely now, but it seems to me that after I
created a function that split stuff properly up *before*
the BIO layer got to it (which has since been replaced by
code related to Kent's immutable BIO work), there has been
no need for this function.  Removing this was on a long-ago
to-do list--but I didn't want to do it without spending some
time ensuring it wouldn't break anything.

If you want me to work through it in more detail so I can
give a more certain response, let me know and I will do so.

-Alex

 I have a patch that adds a blk_queue_chunk_sectors(object_size) call to
 rbd_init_disk() but I haven't had a chance to play with it yet.  In any
 case, we should be fine with getting rid of rbd_merge_bvec().  If this
 ends up a per-driver patchset, I can make rbd_merge_bvec() -
 blk_queue_chunk_sectors() a single patch and push it through
 ceph-client.git.
 
 Thanks,
 
 Ilya
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 08/11] block: kill merge_bvec_fn() completely

2015-05-25 Thread Ilya Dryomov
On Mon, May 25, 2015 at 6:08 PM, Christoph Hellwig h...@lst.de wrote:
 On Mon, May 25, 2015 at 06:02:30PM +0300, Ilya Dryomov wrote:
 I'm not Alex, but yeah, we have all the clone/split machinery and so we
 can handle a spanning case just fine.  I think rbd_merge_bvec() exists
 to make sure we don't have to do that unless it's really necessary -
 like when a single page gets submitted at an inconvenient offset.

 I have a patch that adds a blk_queue_chunk_sectors(object_size) call to
 rbd_init_disk() but I haven't had a chance to play with it yet.  In any
 case, we should be fine with getting rid of rbd_merge_bvec().  If this
 ends up a per-driver patchset, I can make rbd_merge_bvec() -
 blk_queue_chunk_sectors() a single patch and push it through
 ceph-client.git.

 Hmm, looks like the new blk_queue_split_bio ignore the chunk_sectors
 value, another thing that needs updating.  I forgot how many weird
 merging hacks we had to add for nvme..

 While I'd like to see per-driver patches we'd still need to merge
 them together through the block tree.  Note that with this series
 there won't be any benefit of using blk_queue_chunk_sectors over just
 doing the split in rbd.  Maybe we can even remove it again and do
 that work in the drivers in the future.

OK, I'll drop it, especially if it's potentially on its way out.  With
the fancy striping support, which I'll hopefully get to sometime, the
striping pattern will become much more complicated anyway, so relying
on rbd doing bio splitting is right in the long run as well.

Thanks,

Ilya
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Cache pool on top of ec base pool teuthology test

2015-05-25 Thread Sage Weil
On Mon, 25 May 2015, Wang, Zhiqiang wrote:
 We have a bunch of teuthology tests which build cache pool on top of an 
 ec base pool, and do partial object write. This is ok with the current 
 cache tiering implementation. But with proxy write, this won't work. In 
 my testing, the error message is something like below:
 
 2015-05-20 05:51:42.896828 7f682eed4700 10 osd.5 pg_epoch: 183 pg[2.1( v 
 183'7978 (179'7799,183'7978] local-les=172 n=1258 ec=8 les/c 172/172 
 170/171/171) [5,0] r=0 lpr=171 luod=183'7975 crt=183'7974 lcod 183'7974 mlcod 
 183'7974 active+clean] do_proxy_write Start proxy write for 
 osd_op(client.4130.0:32967 plana4222147-6594 [write 733117~408660] 2.3ed99ff9 
 ack+ondisk+write+known_if_redirected e183) v5
 2015-05-20 05:51:42.899958 7f682c6cf700  1 -- 10.214.132.32:6808/20666 -- 
 10.214.132.32:0/20666 -- osd_op_reply(17556 plana4222147-6594 [write 
 733117~408660] v0'0 uv0 ondisk = -95 ((95) Operation not supported)) v6 -- 
 ?+0 0xa2e23c0 con 0x9355760
 
 What should we do with these tests?

I think the test is fine.. but the OSD should refuse to proxy the write if 
the base tier won't support the write operation in question.  I believe we 
recently renamed one of the helpers supports_omap()... we probably need a 
similar set of helpers for object overwrites?

Or, we can make a method like should_proxy_to_ec() that scans through 
the op vector and makes a conservative judgement of whether it is safe 
to proxy...

sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Cache pool on top of ec base pool teuthology test

2015-05-25 Thread Wang, Zhiqiang
OK, so the cache tier OSD checks if it's able to proxy the write op, if not, 
then forces a promotion.

Btw, I don't find the supports_omap helps in the current master. Do you mean 
the 'CEPH_OSD_COPY_GET_FLAG_NOTSUPP_OMAP' flag?

-Original Message-
From: Sage Weil [mailto:sw...@redhat.com] 
Sent: Monday, May 25, 2015 11:04 PM
To: Wang, Zhiqiang
Cc: ceph-devel@vger.kernel.org
Subject: Re: Cache pool on top of ec base pool teuthology test

On Mon, 25 May 2015, Wang, Zhiqiang wrote:
 We have a bunch of teuthology tests which build cache pool on top of 
 an ec base pool, and do partial object write. This is ok with the 
 current cache tiering implementation. But with proxy write, this won't 
 work. In my testing, the error message is something like below:
 
 2015-05-20 05:51:42.896828 7f682eed4700 10 osd.5 pg_epoch: 183 pg[2.1( 
 v 183'7978 (179'7799,183'7978] local-les=172 n=1258 ec=8 les/c 172/172 
 170/171/171) [5,0] r=0 lpr=171 luod=183'7975 crt=183'7974 lcod 
 183'7974 mlcod 183'7974 active+clean] do_proxy_write Start proxy write 
 for osd_op(client.4130.0:32967 plana4222147-6594 [write 733117~408660] 
 2.3ed99ff9 ack+ondisk+write+known_if_redirected e183) v5
 2015-05-20 05:51:42.899958 7f682c6cf700  1 -- 10.214.132.32:6808/20666 
 -- 10.214.132.32:0/20666 -- osd_op_reply(17556 plana4222147-6594 
 [write 733117~408660] v0'0 uv0 ondisk = -95 ((95) Operation not 
 supported)) v6 -- ?+0 0xa2e23c0 con 0x9355760
 
 What should we do with these tests?

I think the test is fine.. but the OSD should refuse to proxy the write if the 
base tier won't support the write operation in question.  I believe we recently 
renamed one of the helpers supports_omap()... we probably need a similar set of 
helpers for object overwrites?

Or, we can make a method like should_proxy_to_ec() that scans through the op 
vector and makes a conservative judgement of whether it is safe to proxy...

sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/5] Mon: expose commands for temperature related setting

2015-05-25 Thread Li Wang

Thanks for reviewing. We will update and add reviewed-by after
the patch accepted, the follow-up discussion is at
https://github.com/ceph/ceph/pull/4737

On 2015/5/22 8:55, Joao Eduardo Luis wrote:

As far as I can tell, this patch can be split in two different patches:

- add hit_set_grade_decay_rate option to 'osd pool set/get'
- add 'osd tier cache-measure'

Also, for the latter we could also use an explanatory commit message.

Aside from that, I don't see anything obviously wrong with the patch.

   -Joao

On 05/21/2015 02:34 PM, Li Wang wrote:

From: MingXin Liu mingxin...@ubuntukylin.com

Signed-off-by: MingXin Liu mingxin...@ubuntukylin.com
Reviewed-by: Li Wang liw...@ubuntukylin.com
---
  src/mon/MonCommands.h |  8 +++--
  src/mon/OSDMonitor.cc | 87 ---
  2 files changed, 88 insertions(+), 7 deletions(-)

diff --git a/src/mon/MonCommands.h b/src/mon/MonCommands.h
index 8a36807..b26834d 100644
--- a/src/mon/MonCommands.h
+++ b/src/mon/MonCommands.h
@@ -639,11 +639,11 @@ COMMAND(osd pool rename  \
rename srcpool to destpool, osd, rw, cli,rest)
  COMMAND(osd pool get  \
name=pool,type=CephPoolname  \
-   
name=var,type=CephChoices,strings=size|min_size|crash_replay_interval|pg_num|pgp_num|crush_ruleset|hit_set_type|hit_set_period|hit_set_count|hit_set_fpp|auid|target_max_objects|target_max_bytes|cache_target_dirty_ratio|cache_target_full_ratio|cache_min_flush_age|cache_min_evict_age|erasure_code_profile|min_read_recency_for_promote|write_fadvise_dontneed|all,
 \
+   
name=var,type=CephChoices,strings=size|min_size|crash_replay_interval|pg_num|pgp_num|crush_ruleset|hit_set_type|hit_set_period|hit_set_count|hit_set_fpp|auid|target_max_objects|target_max_bytes|cache_target_dirty_ratio|cache_target_full_ratio|cache_min_flush_age|cache_min_evict_age|erasure_code_profile|min_read_recency_for_promote|write_fadvise_dontneed|hit_set_grade_decay_rate|all,
 \
get pool parameter var, osd, r, cli,rest)
  COMMAND(osd pool set  \
name=pool,type=CephPoolname  \
-   
name=var,type=CephChoices,strings=size|min_size|crash_replay_interval|pg_num|pgp_num|crush_ruleset|hashpspool|nodelete|nopgchange|nosizechange|hit_set_type|hit_set_period|hit_set_count|hit_set_fpp|debug_fake_ec_pool|target_max_bytes|target_max_objects|cache_target_dirty_ratio|cache_target_full_ratio|cache_min_flush_age|cache_min_evict_age|auid|min_read_recency_for_promote|write_fadvise_dontneed
  \
+   
name=var,type=CephChoices,strings=size|min_size|crash_replay_interval|pg_num|pgp_num|crush_ruleset|hashpspool|nodelete|nopgchange|nosizechange|hit_set_type|hit_set_period|hit_set_count|hit_set_fpp|debug_fake_ec_pool|target_max_bytes|target_max_objects|cache_target_dirty_ratio|cache_target_full_ratio|cache_min_flush_age|cache_min_evict_age|auid|min_read_recency_for_promote|write_fadvise_dontneed|hit_set_grade_decay_rate
  \
name=val,type=CephString  \
name=force,type=CephChoices,strings=--yes-i-really-mean-it,req=false, 
\
set pool parameter var to val, osd, rw, cli,rest)
@@ -695,6 +695,10 @@ COMMAND(osd tier cache-mode  \
name=pool,type=CephPoolname  \

name=mode,type=CephChoices,strings=none|writeback|forward|readonly|readforward|readproxy,
 \
specify the caching mode for cache tier pool, osd, rw, 
cli,rest)
+COMMAND(osd tier cache-measure  \
+name=pool,type=CephPoolname  \
+name=measure,type=CephChoices,strings=atime|temperature, \
+specify the caching measure to judge hot objects for cache tier pool, osd, 
rw, cli,rest)
  COMMAND(osd tier set-overlay  \
name=pool,type=CephPoolname  \
name=overlaypool,type=CephPoolname, \
diff --git a/src/mon/OSDMonitor.cc b/src/mon/OSDMonitor.cc
index 10597d0..0374778 100644
--- a/src/mon/OSDMonitor.cc
+++ b/src/mon/OSDMonitor.cc
@@ -2803,7 +2803,7 @@ namespace {
  CACHE_TARGET_DIRTY_RATIO, CACHE_TARGET_FULL_RATIO,
  CACHE_MIN_FLUSH_AGE, CACHE_MIN_EVICT_AGE,
  ERASURE_CODE_PROFILE, MIN_READ_RECENCY_FOR_PROMOTE,
-WRITE_FADVISE_DONTNEED};
+WRITE_FADVISE_DONTNEED, HIT_SET_GRADE_DECAY_RATE};

std::setosd_pool_get_choices
  subtract_second_from_first(const std::setosd_pool_get_choices first,
@@ -3251,7 +3251,8 @@ bool OSDMonitor::preprocess_command(MMonCommand *m)
(cache_min_evict_age, CACHE_MIN_EVICT_AGE)
(erasure_code_profile, ERASURE_CODE_PROFILE)
(min_read_recency_for_promote, MIN_READ_RECENCY_FOR_PROMOTE)
-  (write_fadvise_dontneed, WRITE_FADVISE_DONTNEED);
+  (write_fadvise_dontneed, WRITE_FADVISE_DONTNEED)
+  (hit_set_grade_decay_rate, HIT_SET_GRADE_DECAY_RATE);

  typedef std::setosd_pool_get_choices choices_set_t;

@@ -3259,7 +3260,7 @@ bool OSDMonitor::preprocess_command(MMonCommand *m)
(HIT_SET_TYPE)(HIT_SET_PERIOD)(HIT_SET_COUNT)(HIT_SET_FPP)
(TARGET_MAX_OBJECTS)(TARGET_MAX_BYTES)(CACHE_TARGET_FULL_RATIO)

RE: Cache pool on top of ec base pool teuthology test

2015-05-25 Thread Wang, Zhiqiang
Yes, we can do the force promotion check in init_op_flags, as we did before.

-Original Message-
From: Sage Weil [mailto:sw...@redhat.com] 
Sent: Tuesday, May 26, 2015 10:19 AM
To: Wang, Zhiqiang
Cc: ceph-devel@vger.kernel.org
Subject: RE: Cache pool on top of ec base pool teuthology test

On Tue, 26 May 2015, Wang, Zhiqiang wrote:
 OK, so the cache tier OSD checks if it's able to proxy the write op, if not, 
 then forces a promotion.
 
 Btw, I don't find the supports_omap helps in the current master. Do you mean 
 the 'CEPH_OSD_COPY_GET_FLAG_NOTSUPP_OMAP' flag?

Yeah, that isn't helpful.  I think what we need is the helper to check the op, 
and make that dependent on is_erasure()?  That shoudl be good enough for now.

sage


 
 -Original Message-
 From: Sage Weil [mailto:sw...@redhat.com]
 Sent: Monday, May 25, 2015 11:04 PM
 To: Wang, Zhiqiang
 Cc: ceph-devel@vger.kernel.org
 Subject: Re: Cache pool on top of ec base pool teuthology test
 
 On Mon, 25 May 2015, Wang, Zhiqiang wrote:
  We have a bunch of teuthology tests which build cache pool on top of 
  an ec base pool, and do partial object write. This is ok with the 
  current cache tiering implementation. But with proxy write, this 
  won't work. In my testing, the error message is something like below:
  
  2015-05-20 05:51:42.896828 7f682eed4700 10 osd.5 pg_epoch: 183 
  pg[2.1( v 183'7978 (179'7799,183'7978] local-les=172 n=1258 ec=8 
  les/c 172/172
  170/171/171) [5,0] r=0 lpr=171 luod=183'7975 crt=183'7974 lcod
  183'7974 mlcod 183'7974 active+clean] do_proxy_write Start proxy 
  write for osd_op(client.4130.0:32967 plana4222147-6594 [write 
  733117~408660]
  2.3ed99ff9 ack+ondisk+write+known_if_redirected e183) v5
  2015-05-20 05:51:42.899958 7f682c6cf700  1 -- 
  10.214.132.32:6808/20666
  -- 10.214.132.32:0/20666 -- osd_op_reply(17556 plana4222147-6594
  [write 733117~408660] v0'0 uv0 ondisk = -95 ((95) Operation not
  supported)) v6 -- ?+0 0xa2e23c0 con 0x9355760
  
  What should we do with these tests?
 
 I think the test is fine.. but the OSD should refuse to proxy the write if 
 the base tier won't support the write operation in question.  I believe we 
 recently renamed one of the helpers supports_omap()... we probably need a 
 similar set of helpers for object overwrites?
 
 Or, we can make a method like should_proxy_to_ec() that scans through the 
 op vector and makes a conservative judgement of whether it is safe to proxy...
 
 sage
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel 
 in the body of a message to majord...@vger.kernel.org More majordomo 
 info at  http://vger.kernel.org/majordomo-info.html
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: teuthology timeout error

2015-05-25 Thread Miyamae, Takeshi
Hi Loic,

We rebased our teuthology/ceph-qa-suite and retried the test toward LRC on 
current master.
However, we unfortunately got the same result as before (timeout error).

[test conditions]
Target : Ceph-9.0.0-971-gd49d816
https://github.com/kawaguchi-s/teuthology
https://github.com/kawaguchi-s/ceph-qa-suite/tree/wip-10886-lrc

[teuthology log]

2015-05-25 10:18:23 # start

2015-05-25 11:59:52,106.106 INFO:teuthology.orchestra.run.RX35-1:Running: 
'adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph 
status -- format=json-pretty'
2015-05-25 11:59:52,564.564 INFO:tasks.ceph.ceph_manager:no progress seen, 
keeping timeout for now
2015-05-25 11:59:52,565.565 INFO:tasks.thrashosds.thrasher:Traceback (most 
recent call last):
  File /root/src/ceph-qa-suite_master/tasks/ceph_manager.py, line 635, in 
wrapper
return func(self)
  File /root/src/ceph-qa-suite_master/tasks/ceph_manager.py, line 668, in 
do_thrash
timeout=self.config.get('timeout')
  File /root/src/ceph-qa-suite_master/tasks/ceph_manager.py, line 1569, in 
wait_for_recovery
'failed to recover before timeout expired'
AssertionError: failed to recover before timeout expired

Traceback (most recent call last):
  File 
/root/work/teuthology/virtualenv/lib/python2.7/site-packages/gevent/greenlet.py,
 line 390, in run
result = self._run(*self.args, **self.kwargs)
  File /root/src/ceph-qa-suite_master/tasks/ceph_manager.py, line 635, in 
wrapper
return func(self)
  File /root/src/ceph-qa-suite_master/tasks/ceph_manager.py, line 668, in 
do_thrash
timeout=self.config.get('timeout')
  File /root/src/ceph-qa-suite_master/tasks/ceph_manager.py, line 1569, in 
wait_for_recovery
'failed to recover before timeout expired'
AssertionError: failed to recover before timeout expired Greenlet at 
0x36cacd0: bound method Thrasher.do_thrash of tasks.ceph_manager.Thrasher 
instance at 0x36df3f8 failed with AssertionError

Best regards,
Takeshi Miyamae

-Original Message-
From: Loic Dachary [mailto:l...@dachary.org] 
Sent: Thursday, May 21, 2015 6:38 PM
To: Miyamae, Takeshi/宮前 剛; Ceph Development
Cc: Kawaguchi, Shotaro/川口 翔太朗; Imai, Hiroki/今井 宏樹; Nakao, Takanori/中尾 鷹詔; 
Shiozawa, Kensuke/塩沢 賢輔
Subject: Re: teuthology timeout error

Hi,

[sorry the previous mail was sent by accident, here is the full mail]

On 21/05/2015 10:32, Miyamae, Takeshi wrote:
 Hi Loic,
 
 Could you please share the teuthology/ceph-qa-suite repository you 
 are using to run these tests so I can try to reproduce / diagnose the 
 problem ?
 
 https://github.com/kawaguchi-s/teuthology/tree/wip-10886
 https://github.com/kawaguchi-s/ceph-qa-suite/tree/wip-10886
 

When compared against master they show differences that indicate it would be 
good to rebase:

https://github.com/ceph/teuthology/compare/master...kawaguchi-s:wip-10886
https://github.com/ceph/ceph-qa-suite/compare/master...kawaguchi-s:wip-10886

I think the teuthology commit on top of wip-10886 is a mistake

https://github.com/kawaguchi-s/teuthology/commit/348e54931f89c9b0ae7a84eb931576f8414017b5

do you really need to modify teuthology ? It should just be necessary to use 
the latest master branch.

It looks like the

https://github.com/kawaguchi-s/ceph-qa-suite/commit/f2e3ca5d12ceef742eae2a9cf4057c436e9040c3

commit in your ceph-qa-suite is not what you intended. However

https://github.com/kawaguchi-s/ceph-qa-suite/commit/4b39d6d4862f9091a849d224e880795be406815d
https://github.com/kawaguchi-s/ceph-qa-suite/commit/d16b4b058ae118931928541a2c8acd68f9703a44

look ok :-) Instead of naming the test 4nodes16osds3mons1client.yaml it would 
be better to use the same kind of naming you see at 
https://github.com/ceph/ceph-qa-suite/tree/master/suites/rados/thrash-erasure-code/workloads.
 That is a file name made of the distinctive parameters for the shec plugin 
(the parameters that are the default can be omited).

Cheers

 Here are our teuthology/ceph-qa-suite repositories. Thanks in advance.
 
 Best regards,
 Takeshi Miyamae
 
 -Original Message-
 From: Loic Dachary [mailto:l...@dachary.org]
 Sent: Wednesday, May 20, 2015 4:49 PM
 To: Miyamae, Takeshi/宮前 剛; Ceph Development
 Cc: Kawaguchi, Shotaro/川口 翔太朗; Imai, Hiroki/今井 宏樹; Nakao, Takanori/中尾 
 鷹詔; Shiozawa, Kensuke/塩沢 賢輔
 Subject: Re: teuthology timeout error
 
 Hi,
 
 On 20/05/2015 04:20, Miyamae, Takeshi wrote:
 Hi Loic,

 When we fixed our own issue and restarted teuthology,
 
 Great !
 
 we encountered another issue (timeout error) which occurs in case of LRC as 
 well.
 Do you have any information about that ?
 
 Could you please share the teuthology/ceph-qa-suite repository you are using 
 to run these tests so I can try to reproduce / diagnose the problem ?
 
 Thanks
 

 [error messages (in case of LRC pool)]

 2015-04-28 12:38:54,128.128 INFO:teuthology.orchestra.run.RX35-1:Running: 
 'adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph 
 status --format=json-pretty'
 2015-04-28 12:38:54,516.516 

RE: Cache pool on top of ec base pool teuthology test

2015-05-25 Thread Sage Weil
On Tue, 26 May 2015, Wang, Zhiqiang wrote:
 OK, so the cache tier OSD checks if it's able to proxy the write op, if not, 
 then forces a promotion.
 
 Btw, I don't find the supports_omap helps in the current master. Do you mean 
 the 'CEPH_OSD_COPY_GET_FLAG_NOTSUPP_OMAP' flag?

Yeah, that isn't helpful.  I think what we need is the helper to check the 
op, and make that dependent on is_erasure()?  That shoudl be good enough 
for now.

sage


 
 -Original Message-
 From: Sage Weil [mailto:sw...@redhat.com] 
 Sent: Monday, May 25, 2015 11:04 PM
 To: Wang, Zhiqiang
 Cc: ceph-devel@vger.kernel.org
 Subject: Re: Cache pool on top of ec base pool teuthology test
 
 On Mon, 25 May 2015, Wang, Zhiqiang wrote:
  We have a bunch of teuthology tests which build cache pool on top of 
  an ec base pool, and do partial object write. This is ok with the 
  current cache tiering implementation. But with proxy write, this won't 
  work. In my testing, the error message is something like below:
  
  2015-05-20 05:51:42.896828 7f682eed4700 10 osd.5 pg_epoch: 183 pg[2.1( 
  v 183'7978 (179'7799,183'7978] local-les=172 n=1258 ec=8 les/c 172/172 
  170/171/171) [5,0] r=0 lpr=171 luod=183'7975 crt=183'7974 lcod 
  183'7974 mlcod 183'7974 active+clean] do_proxy_write Start proxy write 
  for osd_op(client.4130.0:32967 plana4222147-6594 [write 733117~408660] 
  2.3ed99ff9 ack+ondisk+write+known_if_redirected e183) v5
  2015-05-20 05:51:42.899958 7f682c6cf700  1 -- 10.214.132.32:6808/20666 
  -- 10.214.132.32:0/20666 -- osd_op_reply(17556 plana4222147-6594 
  [write 733117~408660] v0'0 uv0 ondisk = -95 ((95) Operation not 
  supported)) v6 -- ?+0 0xa2e23c0 con 0x9355760
  
  What should we do with these tests?
 
 I think the test is fine.. but the OSD should refuse to proxy the write if 
 the base tier won't support the write operation in question.  I believe we 
 recently renamed one of the helpers supports_omap()... we probably need a 
 similar set of helpers for object overwrites?
 
 Or, we can make a method like should_proxy_to_ec() that scans through the 
 op vector and makes a conservative judgement of whether it is safe to proxy...
 
 sage
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html