Re: [PATCH 22/23] drbd: implement REQ_OP_WRITE_ZEROES
On Thu, Mar 30, 2017 at 07:15:50PM -0400, Mike Snitzer wrote: > I got pretty far along with implementing the DM thinp support for > WRITE_ZEROES in terms of thinp's DISCARD support (more of an > implementation detail.. or so I thought). > > But while discussing this effort with Jeff Moyer he asked: shouldn't the > zeroed blocks be provisioned? This is a fairly embarassing question not > to be able to answer in the moment. So I clearly need to learn what the > overall intent of WRITE_ZEROES actually is. It is to ensure the that the block range it's called on returns zeroes on future reads. Currently if REQ_UNMAP is set you may free the space allocation, else not. After looking at the callers I think this is the wrong way around, so I'm going to invert the flag, so that the two callers that care that blocks are allocated will set it instead.
Re: [PATCH 22/23] drbd: implement REQ_OP_WRITE_ZEROES
Mike Snitzer writes: Mike, > But while discussing this effort with Jeff Moyer he asked: shouldn't the > zeroed blocks be provisioned? This is a fairly embarassing question not > to be able to answer in the moment. So I clearly need to learn what the > overall intent of WRITE_ZEROES actually is. The intent is to guarantee all future reads to these blocks will return zeroes. Whether to allocate or deallocate or do a combination to achieve that goal is up to the device. > If it is meant as a replacement for WRITE_SAME (as hch switched dm-io > over from WRITE_SAME with a payload of 0 to WRITE_ZEROES) and for the > backing mechanism for blkdev_issue_zeroout() then I guess I have my > answer. Yes. I was hoping MD would use WRITE SAME to initialize data and parity drives. That's why I went with SAME nomenclature rather than ZEROES (which had just appeared in NVMe at that time). Christoph's renaming is mostly to emphasize the purpose and the semantics. People keep getting confused because both REQ_DISCARD and REQ_WRITE_SAME use the WRITE SAME SCSI command. But the two uses are completely orthogonal and governed by different heuristics in sd.c. > Unless DM thinp can guarantee that the discarded blocks will > always return zeroes (by zeroing before all partial block writes) my > discard based dm-thinp implementation of WRITE_ZEROES is a complete > throw-away (unless block zeroing is enabled.. which it never is because > performance sucks with it). So if an upper-level of the IO stack > (e.g. ext4) were to assume that a block will _definitely_ have zeroes > then DM thinp would fall short. You don't have a way to mark those blocks as being full of zeroes without actually writing them? Note that the fallback to a zeroout command is to do a regular write. So if DM doesn't zero the blocks, the block layer is going to it. -- Martin K. Petersen Oracle Linux Engineering
Re: [PATCH 22/23] drbd: implement REQ_OP_WRITE_ZEROES
On Thu, Mar 30 2017 at 11:20am -0400, Martin K. Petersen wrote: > Mike Snitzer writes: > > > I can work on this now. Only question I have is: should DM thinp take > > care to zero any misaligned head and tail? (I assume so but with all > > the back and forth between Bart, Paolo and Martin I figured I'd ask > > explicitly). > > Yep, let's make sure our semantics match the hardware ditto. > > - So write zeroes should behave deterministically and explicitly handle >any blocks that can't be cleared via deprovisioning. > > - And discard can work at the discard granularity in a >non-deterministic fashion. I got pretty far along with implementing the DM thinp support for WRITE_ZEROES in terms of thinp's DISCARD support (more of an implementation detail.. or so I thought). But while discussing this effort with Jeff Moyer he asked: shouldn't the zeroed blocks be provisioned? This is a fairly embarassing question not to be able to answer in the moment. So I clearly need to learn what the overall intent of WRITE_ZEROES actually is. If it is meant as a replacement for WRITE_SAME (as hch switched dm-io over from WRITE_SAME with a payload of 0 to WRITE_ZEROES) and for the backing mechanism for blkdev_issue_zeroout() then I guess I have my answer. Unless DM thinp can guarantee that the discarded blocks will always return zeroes (by zeroing before all partial block writes) my discard based dm-thinp implementation of WRITE_ZEROES is a complete throw-away (unless block zeroing is enabled.. which it never is because performance sucks with it). So if an upper-level of the IO stack (e.g. ext4) were to assume that a block will _definitely_ have zeroes then DM thinp would fall short. This is all to say: I don't see a quick way forward on implementing performant WRITE_ZEROES support for DM thinp.
Re: [PATCH 22/23] drbd: implement REQ_OP_WRITE_ZEROES
Mike Snitzer writes: > I can work on this now. Only question I have is: should DM thinp take > care to zero any misaligned head and tail? (I assume so but with all > the back and forth between Bart, Paolo and Martin I figured I'd ask > explicitly). Yep, let's make sure our semantics match the hardware ditto. - So write zeroes should behave deterministically and explicitly handle any blocks that can't be cleared via deprovisioning. - And discard can work at the discard granularity in a non-deterministic fashion. -- Martin K. Petersen Oracle Linux Engineering
Re: [PATCH 22/23] drbd: implement REQ_OP_WRITE_ZEROES
On Thu, Mar 30 2017 at 7:44am -0400, Christoph Hellwig wrote: > On Thu, Mar 30, 2017 at 12:06:41PM +0200, Lars Ellenberg wrote: > > > Will it make an fstrim cause thinly provisioned > > devices to suddenly be fully allocated? > > Not for SCSI devices. Yes for dm-thinp until it implements > REQ_OP_WRITE_ZEROES, which will hopefully be soon. I can work on this now. Only question I have is: should DM thinp take care to zero any misaligned head and tail? (I assume so but with all the back and forth between Bart, Paolo and Martin I figured I'd ask explicitly).
Re: [Drbd-dev] [PATCH 22/23] drbd: implement REQ_OP_WRITE_ZEROES
On Thu, Mar 30, 2017 at 01:44:09PM +0200, Christoph Hellwig wrote: > On Thu, Mar 30, 2017 at 12:06:41PM +0200, Lars Ellenberg wrote: > > On Thu, Mar 23, 2017 at 10:33:40AM -0400, Christoph Hellwig wrote: > > > It seems like DRBD assumes its on the wire TRIM request always zeroes > > > data. > > > Use that fact to implement REQ_OP_WRITE_ZEROES. > > > > > > XXX: will need a careful audit from the drbd team! > > > > Thanks, this one looks ok to me. > > So the DRBD protocol requires the TRIM operation to always zero? "users" (both as in submitting entities, and people using DRBD) expect that DRBD guarantees replicas to be identical after whatever operations have been completed by all replicas. Which means that for trim/discard/unmap, we can only expose that to upper layers (or use it for internal purposes) if the operation has a well defined, and on all backends identical, result. Short answer: Yes. > > The real question for me is, will the previous one (21/23) > > return != 0 (some EOPNOTSUPP or else) to DRBD in more situations than > > what we have now? > > No, blkdev_issue_zeroout should never return -EOPNOTSUPP. > > > Will it make an fstrim cause thinly provisioned > > devices to suddenly be fully allocated? > > Not for SCSI devices. Yes for dm-thinp until it implements > REQ_OP_WRITE_ZEROES, which will hopefully be soon. "good enough for me" ;-) Thanks, Lars
Re: [PATCH 22/23] drbd: implement REQ_OP_WRITE_ZEROES
On Thu, Mar 30, 2017 at 12:06:41PM +0200, Lars Ellenberg wrote: > On Thu, Mar 23, 2017 at 10:33:40AM -0400, Christoph Hellwig wrote: > > It seems like DRBD assumes its on the wire TRIM request always zeroes data. > > Use that fact to implement REQ_OP_WRITE_ZEROES. > > > > XXX: will need a careful audit from the drbd team! > > Thanks, this one looks ok to me. So the DRBD protocol requires the TRIM operation to always zero? > The real question for me is, will the previous one (21/23) > return != 0 (some EOPNOTSUPP or else) to DRBD in more situations than > what we have now? No, blkdev_issue_zeroout should never return -EOPNOTSUPP. > Will it make an fstrim cause thinly provisioned > devices to suddenly be fully allocated? Not for SCSI devices. Yes for dm-thinp until it implements REQ_OP_WRITE_ZEROES, which will hopefully be soon.
Re: [PATCH 22/23] drbd: implement REQ_OP_WRITE_ZEROES
On Thu, Mar 23, 2017 at 10:33:40AM -0400, Christoph Hellwig wrote: > It seems like DRBD assumes its on the wire TRIM request always zeroes data. > Use that fact to implement REQ_OP_WRITE_ZEROES. > > XXX: will need a careful audit from the drbd team! Thanks, this one looks ok to me. The real question for me is, will the previous one (21/23) return != 0 (some EOPNOTSUPP or else) to DRBD in more situations than what we have now? Will it make an fstrim cause thinly provisioned devices to suddenly be fully allocated? Or does it unmap "the same" as what we have now? Especially on top of dm-thin, but also on top of any other device. That's something that is not really "obvious" to me yet. Cheers, Lars
[PATCH 22/23] drbd: implement REQ_OP_WRITE_ZEROES
It seems like DRBD assumes its on the wire TRIM request always zeroes data. Use that fact to implement REQ_OP_WRITE_ZEROES. XXX: will need a careful audit from the drbd team! Signed-off-by: Christoph Hellwig --- drivers/block/drbd/drbd_main.c | 3 ++- drivers/block/drbd/drbd_nl.c | 2 ++ drivers/block/drbd/drbd_receiver.c | 6 +++--- drivers/block/drbd/drbd_req.c | 7 +-- drivers/block/drbd/drbd_worker.c | 4 +++- 5 files changed, 15 insertions(+), 7 deletions(-) diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c index 92c60cbd04ee..8e62d9f65510 100644 --- a/drivers/block/drbd/drbd_main.c +++ b/drivers/block/drbd/drbd_main.c @@ -1668,7 +1668,8 @@ static u32 bio_flags_to_wire(struct drbd_connection *connection, (bio->bi_opf & REQ_FUA ? DP_FUA : 0) | (bio->bi_opf & REQ_PREFLUSH ? DP_FLUSH : 0) | (bio_op(bio) == REQ_OP_WRITE_SAME ? DP_WSAME : 0) | - (bio_op(bio) == REQ_OP_DISCARD ? DP_DISCARD : 0); + (bio_op(bio) == REQ_OP_DISCARD ? DP_DISCARD : 0) | + (bio_op(bio) == REQ_OP_WRITE_ZEROES ? DP_DISCARD : 0); else return bio->bi_opf & REQ_SYNC ? DP_RW_SYNC : 0; } diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c index 908c704e20aa..e4516d3b971d 100644 --- a/drivers/block/drbd/drbd_nl.c +++ b/drivers/block/drbd/drbd_nl.c @@ -1217,10 +1217,12 @@ static void decide_on_discard_support(struct drbd_device *device, blk_queue_discard_granularity(q, 512); q->limits.max_discard_sectors = drbd_max_discard_sectors(connection); queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, q); + q->limits.max_write_zeroes_sectors = drbd_max_discard_sectors(connection); } else { queue_flag_clear_unlocked(QUEUE_FLAG_DISCARD, q); blk_queue_discard_granularity(q, 0); q->limits.max_discard_sectors = 0; + q->limits.max_write_zeroes_sectors = 0; } } diff --git a/drivers/block/drbd/drbd_receiver.c b/drivers/block/drbd/drbd_receiver.c index 8641776f043a..e64e01ca4d1a 100644 --- a/drivers/block/drbd/drbd_receiver.c +++ b/drivers/block/drbd/drbd_receiver.c @@ -2285,7 +2285,7 @@ static unsigned long wire_flags_to_bio_flags(u32 dpf) static unsigned long wire_flags_to_bio_op(u32 dpf) { if (dpf & DP_DISCARD) - return REQ_OP_DISCARD; + return REQ_OP_WRITE_ZEROES; else return REQ_OP_WRITE; } @@ -2476,7 +2476,7 @@ static int receive_Data(struct drbd_connection *connection, struct packet_info * op_flags = wire_flags_to_bio_flags(dp_flags); if (pi->cmd == P_TRIM) { D_ASSERT(peer_device, peer_req->i.size > 0); - D_ASSERT(peer_device, op == REQ_OP_DISCARD); + D_ASSERT(peer_device, op == REQ_OP_WRITE_ZEROES); D_ASSERT(peer_device, peer_req->pages == NULL); } else if (peer_req->pages == NULL) { D_ASSERT(device, peer_req->i.size == 0); @@ -4789,7 +4789,7 @@ static int receive_rs_deallocated(struct drbd_connection *connection, struct pac if (get_ldev(device)) { struct drbd_peer_request *peer_req; - const int op = REQ_OP_DISCARD; + const int op = REQ_OP_WRITE_ZEROES; peer_req = drbd_alloc_peer_req(peer_device, ID_SYNCER, sector, size, 0, GFP_NOIO); diff --git a/drivers/block/drbd/drbd_req.c b/drivers/block/drbd/drbd_req.c index bdd42b8360cc..d76ee53ce528 100644 --- a/drivers/block/drbd/drbd_req.c +++ b/drivers/block/drbd/drbd_req.c @@ -59,6 +59,7 @@ static struct drbd_request *drbd_req_new(struct drbd_device *device, struct bio drbd_req_make_private_bio(req, bio_src); req->rq_state = (bio_data_dir(bio_src) == WRITE ? RQ_WRITE : 0) | (bio_op(bio_src) == REQ_OP_WRITE_SAME ? RQ_WSAME : 0) + | (bio_op(bio_src) == REQ_OP_WRITE_ZEROES ? RQ_UNMAP : 0) | (bio_op(bio_src) == REQ_OP_DISCARD ? RQ_UNMAP : 0); req->device = device; req->master_bio = bio_src; @@ -1180,7 +1181,8 @@ drbd_submit_req_private_bio(struct drbd_request *req) if (get_ldev(device)) { if (drbd_insert_fault(device, type)) bio_io_error(bio); - else if (bio_op(bio) == REQ_OP_DISCARD) + else if (bio_op(bio) == REQ_OP_WRITE_ZEROES || +bio_op(bio) == REQ_OP_DISCARD) drbd_process_discard_req(req); else generic_make_request(bio); @@ -1234,7 +1236,8 @@ drbd_request_prepare(struct drbd_device *device, struct bio *bio, unsigned long _drbd_start_io_acct(device, req); /* process discards alwa