On Thu, Jul 27, 2017 at 12:32 PM, <xiongweijiang...@gmail.com> wrote: > From: Xiongwei Jiang <xiongwei.ji...@alibaba-inc.com> > > when an application is writing or reading on rbd device, if some or all > OSDs crash, the application will hang and can't be killed because it is > in D state. Even though OSDs comes up later, the application may still > keeps in D state. So we need a timeout mechanism to solve this problem.
Hi Xiongwei, This shouldn't happen -- when the OSDs come back up, all requests should be properly resent and completed. If you have a scenario where the OSDs come back up into HEALTH_OK and this doesn't happen, that's likely a bug. > > Signed-off-by: Xiongwei Jiang <xiongwei.ji...@alibaba-inc.com> > --- > drivers/block/rbd.c | 34 +++++++++++++++++++++++++++++++++- > 1 file changed, 33 insertions(+), 1 deletion(-) > > diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c > index c16f745..33a1c97 100644 > --- a/drivers/block/rbd.c > +++ b/drivers/block/rbd.c > @@ -159,6 +159,13 @@ struct rbd_image_header { > u64 *snap_sizes; /* format 1 only */ > }; > > + > +struct rbd_request_linker { > + struct work_struct work; > + void *img_request; > +}; > + > + > /* > * An rbd image specification. > * > @@ -4013,6 +4020,7 @@ static void rbd_queue_workfn(struct work_struct *work) > struct request *rq = blk_mq_rq_from_pdu(work); > struct rbd_device *rbd_dev = rq->q->queuedata; > struct rbd_img_request *img_request; > + struct rbd_request_linker *linker; > struct ceph_snap_context *snapc = NULL; > u64 offset = (u64)blk_rq_pos(rq) << SECTOR_SHIFT; > u64 length = blk_rq_bytes(rq); > @@ -4120,6 +4128,7 @@ static void rbd_queue_workfn(struct work_struct *work) > goto err_unlock; > } > img_request->rq = rq; > + linker->img_request = img_request; > snapc = NULL; /* img_request consumes a ref */ > > if (op_type == OBJ_OP_DISCARD) > @@ -4358,9 +4367,32 @@ static int rbd_init_request(struct blk_mq_tag_set > *set, struct request *rq, > return 0; > } > > +static enum blk_eh_timer_return rbd_request_timeout(struct request *rq, > + bool reserved) > +{ > + struct rbd_obj_request *obj_request; > + struct rbd_obj_request *next_obj_request; > + struct rbd_img_request *img_request; > + struct rbd_request_linker *linker = blk_mq_rq_to_pdu(rq); > + > + img_request = (struct rbd_img_request *)linker->img_request; > + for_each_obj_request_safe(img_request, obj_request, next_obj_request) > { > + struct ceph_osd_request *osd_req = obj_request->osd_req; > + > + if (!osd_req) > + printk(KERN_INFO "osd_req is null \n"); > + else > + ceph_osdc_cancel_request(osd_req); > + } > + return BLK_EH_HANDLED; > +} > + > + > + > static const struct blk_mq_ops rbd_mq_ops = { > .queue_rq = rbd_queue_rq, > .init_request = rbd_init_request, > + .timeout = rbd_request_timeout, The default blk-mq timeout is 30 seconds. This means that any "slow" request (osd_op_complaint_time ceph.conf option) will be timed out, forcing the mounted filesystem into read-only mode at best. > }; > > static int rbd_init_disk(struct rbd_device *rbd_dev) > @@ -4392,7 +4424,7 @@ static int rbd_init_disk(struct rbd_device *rbd_dev) > rbd_dev->tag_set.numa_node = NUMA_NO_NODE; > rbd_dev->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE; > rbd_dev->tag_set.nr_hw_queues = 1; > - rbd_dev->tag_set.cmd_size = sizeof(struct work_struct); > + rbd_dev->tag_set.cmd_size = sizeof(struct rbd_request_linker); > > err = blk_mq_alloc_tag_set(&rbd_dev->tag_set); > if (err) We already have such a timeout at libceph level -- osd_request_timeout (similar to rados_osd_op_timeout ceph.conf option). Note that just as rados_osd_op_timeout in userspace, osd_request_timeout is disabled by default and is NOT recommended for use with librbd/krbd. Thanks, Ilya