Re: [PATCH] bcache: recover data from backing device when read request hit clean
Rui Hua-- Thanks for the clarification. You're correct, it doesn't get called, and it goes to read_complete. However, read_error should probably get called. How would you suggest handling this-- should we initially set read_dirty_data true and clear it if it is all clean? (and change the condition to properly go into the error path). As it is, the patch makes things no worse but the error path is very unsafe. Mike On Thu, Nov 16, 2017 at 8:25 PM, Rui Huawrote: > Hi, Michael-- > > Thanks for the quickly reply. > > If the metadata on the cache device was NOT correctly read, > cached_dev_read_error() will not be called, so the recovery check will > not be executed. so this patch is safe. > s->iop.error was not set when faild to read metadata on the cache > device, so the cached_dev_bio_complete() will be called instead of > cached_dev_read_error() > > 2017-11-17 12:02 GMT+08:00 Michael Lyle : >> Hi, Rui Hua-- >> >> On 11/16/2017 07:51 PM, Rui Hua wrote: >>> In this patch, we use s->read_dirty_data to judge whether the read >>> request hit dirty data in cache device, it is safe to reread data from >>> the backing device when the read request hit clean data. This can not >>> only handle cache read race, but also recover data when failed read >>> request from cache device. >> >> I haven't read over all of this yet, to understand the read race. But >> this change can't be safe-- read_dirty_data only is set to true if the >> metadata on the cache device is correctly read. If that fails, the flag >> may not be set and we could read stale data on the backing device. >> >> Mike
Re: [GIT PULL] Followup merge window block fixes/changes
On 11/17/2017 12:35 PM, Linus Torvalds wrote: > On Fri, Nov 17, 2017 at 11:29 AM, Linus Torvalds >wrote: >> >> "F*ck no, that code is too ugly, you need to write it so that it can >> be maintained". > > Dammit, the rest of the pull looks ok, so I'll take it anyway. > > But I really do expect you to > > (a) clean up that mess - maybe with my patch as an example, but maybe > some entirely different way. The patch I sent is already deleted from > my tree, it was purely meant as an example. > > (b) really push back on people when they send you ugly code > > It shouldn't have to always be me being upset and pushing back on the > block pulls. My bad, I have probably given Paolo and Luca a bit more leeway than usual, since they are new to upstream development and the BFQ code does need some cleanups here and there. A huge part of that was completed as part of getting it merged, but it still needs a bit of love. Now that they have received the Linus hazing, I guess the honey moon is over. In all seriousness, I do push back on crap quite often. Sometimes I miss some too... -- Jens Axboe
Re: [GIT PULL] Followup merge window block fixes/changes
> Il giorno 17 nov 2017, alle ore 20:29, Linus Torvalds >ha scritto: > > On Fri, Nov 17, 2017 at 8:51 AM, Jens Axboe wrote: >> >> - Small BFQ updates series from Luca and Paolo. > > Honestly, this code is too ugly to live. > > Seriously. That update should have been rejected on the grounds of > being completely unmaintainable crap. > > Why didn't you? > > Why are you allowing code like that patch to bfq_dispatch_request() > into the kernel source tree? > > Yes, it improves performance. But it does so by adding random > #iofdef's into the middle of some pretty crtitical code, with > absolutely no attempt at having a sane maintainable code-base. > > That function is literally *three* lines of code without the #ifdef case: > >spin_lock_irq(>lock); >rq = __bfq_dispatch_request(hctx); >spin_unlock_irq(>lock); > > and you could actually see what it did. > > But instead of trying to abstract it in some legible manner, that > three-line obvious function got *three* copies of the same #if mess > all enclosing rancom crap, and the end result is really hard to read. > > For example, I just spent a couple of minutes trying to make sense of > the code, and stop that unreadable mess. I came up with the appended > patch. > > It may not work for some reason I can't see, but that's not the point. > The attached patch is not meant as a "apply this as-is". > > It's meant as a "look, you can write legible code where the statistics > just go away when they are compiled out". > > Now that "obvious three-liner function" is not three lines any more, > but it's at least *clear* straight-line code: > >spin_lock_irq(>lock); > >in_serv_queue = bfqd->in_service_queue; >waiting_rq = in_serv_queue && bfq_bfqq_wait_request(in_serv_queue); > >rq = __bfq_dispatch_request(hctx); > >idle_timer_disabled = >waiting_rq && !bfq_bfqq_wait_request(in_serv_queue); > >spin_unlock_irq(>lock); > >bfq_dispatch_statistics(hctx, rq, in_serv_queue, waiting_rq, > idle_timer_disabled); > > and I am hopeful that when bfq_dispatch_statistics() is disabled, the > compiler will be smart enough to not generate extra code, certainly > not noticeably so. > Sorry for causing this problem. Yours was our first version, but then we feared that leaving useless instructions was worse than adding a burst of ifdefs. I'll try not to repeat this mistake. Thanks, Paolo > See? One is a mess of horrible ugly #ifdef's in the middle of the code > that makes the end result completely illegible. > > The other is actually somewhat legible, and has a clear separation of > the statistics gathering. And that *statistics* gathering is clearly > optionally enabled or not. > > Not the mess of random code put together in random ways that are > completely illegble. > > Your job as maintainer is _literally_ to tell people "f*ck no, that > code is too ugly, you need to write it so that it can be maintained". > > And I'm doing my job. > > "F*ck no, that code is too ugly, you need to write it so that it can > be maintained". > > Show some taste. > >Linus >
Re: [GIT PULL] Followup merge window block fixes/changes
On Fri, Nov 17, 2017 at 12:18 PM, Paolo Valentewrote: > > Sorry for causing this problem. Yours was our first version, but then > we feared that leaving useless instructions was worse than adding a > burst of ifdefs. I'll try not to repeat this mistake. So I generally do not want people to depend on the compiler to do non-obvious conversions (so please do not write bad code that you just hope the compiler can sort out and make reasonable), but we do heavily rely on the compiler doing the obvious dead code removal particularly for when things end up not being used. A lot of the time that's very much something you can and should depend on, because we have abstraction layers that involves a lot of code just becoming no-ops on many architectures, but other architectures need some particular sequence. Another example is a lot of helper functions to do verification of various kernel rules that then are enabled by some debug option (eg CONFIG_DEBUG_ATOMIC_SLEEP etc), where we know that (and depend on) the compiler will do a reasonable job. That's not so different from having "statistics gathering function that compiles to nothing". And yes, sometimes the code might need to be written so that the compiler has an easier time _seeing_ that the core really goes away, so it's not necessarily a blind trust in the compiler. So for example, the functions that can go away should obviously be inline functions so that you don't end up having the compiler generate the arguments and the call to an empty function body, and so that the compiler can see that "oh, those arguments aren't even used at all, so now I can remove all the code that generates them". If you are comfy with assembler, it can actually be a great thing to occasionally just verify the actual generated code if it's some piece of code you care deeply about. Obviously for performance work, profiling happens, and then you see it that way. Because sometimes you find some silly "Duh!" moments, where you didn't realize that the compiler wasn't able to really get rid of something for some reason that is usually very obvious in hindsight, but wasn't actually obvious at all until you looked at what the compiler generated. Linus
Re: [PATCH] bio: ensure __bio_clone_fast copies bi_partno
On 17 November 2017 at 20:47, Michael Lylewrote: > A new field was introduced in 74d46992e0d9dee7f1f376de0d56d31614c8a17a, > bi_partno, instead of using bdev->bd_contains and encoding the partition > information in the bi_bdev field. __bio_clone_fast was changed to copy > the disk information, but not the partition information. At minimum, > this regressed bcache and caused data corruption. Thanks alot for this Michael, I will run it up on our test servers today. Campbell
Re: [GIT PULL] Followup merge window block fixes/changes
On Fri, Nov 17, 2017 at 11:29 AM, Linus Torvaldswrote: > > "F*ck no, that code is too ugly, you need to write it so that it can > be maintained". Dammit, the rest of the pull looks ok, so I'll take it anyway. But I really do expect you to (a) clean up that mess - maybe with my patch as an example, but maybe some entirely different way. The patch I sent is already deleted from my tree, it was purely meant as an example. (b) really push back on people when they send you ugly code It shouldn't have to always be me being upset and pushing back on the block pulls. Linus
Re: [GIT PULL] Followup merge window block fixes/changes
On Fri, Nov 17, 2017 at 8:51 AM, Jens Axboewrote: > > - Small BFQ updates series from Luca and Paolo. Honestly, this code is too ugly to live. Seriously. That update should have been rejected on the grounds of being completely unmaintainable crap. Why didn't you? Why are you allowing code like that patch to bfq_dispatch_request() into the kernel source tree? Yes, it improves performance. But it does so by adding random #iofdef's into the middle of some pretty crtitical code, with absolutely no attempt at having a sane maintainable code-base. That function is literally *three* lines of code without the #ifdef case: spin_lock_irq(>lock); rq = __bfq_dispatch_request(hctx); spin_unlock_irq(>lock); and you could actually see what it did. But instead of trying to abstract it in some legible manner, that three-line obvious function got *three* copies of the same #if mess all enclosing rancom crap, and the end result is really hard to read. For example, I just spent a couple of minutes trying to make sense of the code, and stop that unreadable mess. I came up with the appended patch. It may not work for some reason I can't see, but that's not the point. The attached patch is not meant as a "apply this as-is". It's meant as a "look, you can write legible code where the statistics just go away when they are compiled out". Now that "obvious three-liner function" is not three lines any more, but it's at least *clear* straight-line code: spin_lock_irq(>lock); in_serv_queue = bfqd->in_service_queue; waiting_rq = in_serv_queue && bfq_bfqq_wait_request(in_serv_queue); rq = __bfq_dispatch_request(hctx); idle_timer_disabled = waiting_rq && !bfq_bfqq_wait_request(in_serv_queue); spin_unlock_irq(>lock); bfq_dispatch_statistics(hctx, rq, in_serv_queue, waiting_rq, idle_timer_disabled); and I am hopeful that when bfq_dispatch_statistics() is disabled, the compiler will be smart enough to not generate extra code, certainly not noticeably so. See? One is a mess of horrible ugly #ifdef's in the middle of the code that makes the end result completely illegible. The other is actually somewhat legible, and has a clear separation of the statistics gathering. And that *statistics* gathering is clearly optionally enabled or not. Not the mess of random code put together in random ways that are completely illegble. Your job as maintainer is _literally_ to tell people "f*ck no, that code is too ugly, you need to write it so that it can be maintained". And I'm doing my job. "F*ck no, that code is too ugly, you need to write it so that it can be maintained". Show some taste. Linus block/bfq-iosched.c | 55 - 1 file changed, 29 insertions(+), 26 deletions(-) diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c index bcb6d21baf12..47d88d03dadd 100644 --- a/block/bfq-iosched.c +++ b/block/bfq-iosched.c @@ -3689,35 +3689,14 @@ static struct request *__bfq_dispatch_request(struct blk_mq_hw_ctx *hctx) return rq; } -static struct request *bfq_dispatch_request(struct blk_mq_hw_ctx *hctx) -{ - struct bfq_data *bfqd = hctx->queue->elevator->elevator_data; - struct request *rq; -#if defined(CONFIG_BFQ_GROUP_IOSCHED) && defined(CONFIG_DEBUG_BLK_CGROUP) - struct bfq_queue *in_serv_queue, *bfqq; - bool waiting_rq, idle_timer_disabled; -#endif - - spin_lock_irq(>lock); - #if defined(CONFIG_BFQ_GROUP_IOSCHED) && defined(CONFIG_DEBUG_BLK_CGROUP) - in_serv_queue = bfqd->in_service_queue; - waiting_rq = in_serv_queue && bfq_bfqq_wait_request(in_serv_queue); - - rq = __bfq_dispatch_request(hctx); - - idle_timer_disabled = - waiting_rq && !bfq_bfqq_wait_request(in_serv_queue); - -#else - rq = __bfq_dispatch_request(hctx); -#endif - spin_unlock_irq(>lock); +static void bfq_dispatch_statistics(struct blk_mq_hw_ctx *hctx, struct request *rq, + struct bfq_queue *in_serv_queue, bool waiting_rq, bool idle_timer_disabled) +{ + struct bfq_queue *bfqq = rq ? RQ_BFQQ(rq) : NULL; -#if defined(CONFIG_BFQ_GROUP_IOSCHED) && defined(CONFIG_DEBUG_BLK_CGROUP) - bfqq = rq ? RQ_BFQQ(rq) : NULL; if (!idle_timer_disabled && !bfqq) - return rq; + return; /* * rq and bfqq are guaranteed to exist until this function @@ -3752,8 +3731,32 @@ static struct request *bfq_dispatch_request(struct blk_mq_hw_ctx *hctx) bfqg_stats_update_io_remove(bfqg, rq->cmd_flags); } spin_unlock_irq(hctx->queue->queue_lock); +} +#else +static inline void bfq_dispatch_statistics(struct blk_mq_hw_ctx *hctx, struct request *rq, struct bfq_queue *in_serv_queue, bool waiting_rq, bool idle_timer_disabled) {} #endif +static struct request *bfq_dispatch_request(struct blk_mq_hw_ctx *hctx) +{ +
Re: [RFC PATCH] blk-throttle: add burst allowance.
On Thu, Nov 16, 2017 at 08:25:58PM -0800, Khazhismel Kumykov wrote: > On Thu, Nov 16, 2017 at 8:50 AM, Shaohua Liwrote: > > On Tue, Nov 14, 2017 at 03:10:22PM -0800, Khazhismel Kumykov wrote: > >> Allows configuration additional bytes or ios before a throttle is > >> triggered. > >> > >> This allows implementation of a bucket style rate-limit/throttle on a > >> block device. Previously, bursting to a device was limited to allowance > >> granted in a single throtl_slice (similar to a bucket with limit N and > >> refill rate N/slice). > >> > >> Additional parameters bytes/io_burst_conf defined for tg, which define a > >> number of bytes/ios that must be depleted before throttling happens. A > >> tg that does not deplete this allowance functions as though it has no > >> configured limits. tgs earn additional allowance at rate defined by > >> bps/iops for the tg. Once a tg has *_disp > *_burst_conf, throttling > >> kicks in. If a tg is idle for a while, it will again have some burst > >> allowance before it gets throttled again. > >> > >> slice_end for a tg is extended until io_disp/byte_disp would fall to 0, > >> when all "used" burst allowance would be earned back. trim_slice still > >> does progress slice_start as before and decrements *_disp as before, and > >> tgs continue to get bytes/ios in throtl_slice intervals. > > > > Can you describe why we need this? It would be great if you can describe the > > usage model and an example. Does this work for io.low/io.max or both? > > > > Thanks, > > Shaohua > > > > Use case that brought this up was configuring limits for a remote > shared device. Bursting beyond io.max is desired but only for so much > before the limit kicks in, afterwards with sustained usage throughput > is capped. (This proactively avoids remote-side limits). In that case > one would configure in a root container io.max + io.burst, and > configure low/other limits on descendants sharing the resource on the > same node. > > With this patch, so long as tg has not dispatched more than the burst, > no limit is applied at all by that tg, including limit imposed by > io.low in tg_iops_limit, etc. I'd appreciate if you can give more details about the 'why'. 'configuring limits for a remote shared device' doesn't justify the change.
Re: Ideas to reuse filesystem's checksum to enhance dm-raid1/10/5/6?
On Nov 15, 2017, at 7:18 PM, Qu Wenruowrote: > > [Background] > Recently I'm considering the possibility to use checksum from filesystem > to enhance device-mapper raid. > > The idea behind it is quite simple, since most modern filesystems have > checksum for their metadata, and even some (btrfs) have checksum for data. > > And for btrfs RAID1/10 (just ignore the RAID5/6 for now), at read time > it can use the checksum to determine which copy is correct so it can > return the correct data even one copy get corrupted. > > [Objective] > The final objective is to allow device mapper to do the checksum > verification (and repair if possible). > > If only for verification, it's not much different from current endio > hook method used by most of the fs. > However if we can move the repair part from filesystem (well, only btrfs > supports it yet), it would benefit all fs. I recall Darrick was looking into a mechanism to do this. Rather than changing the whole block layer to take a callback to do a checksum, what we looked at was to allow the upper-layer read to specify a "retry count" to the lower-layer block device. If the lower layer is able to retry the read then it will read a different device (or combination of devices for e.g. RAID-6) based on the retry count, until the upper layer gets a good read (based on checksum, or whatever). If there are no more devices (or combinations) to try then a final error is returned. Darrick can probably point at the original thread/patch. Cheers, Andreas signature.asc Description: Message signed with OpenPGP
Re: [RFC PATCH 0/2] apply write hints to select the type of segments
... > >>> From: Hyunchul Lee> >>> > >>> Using write hints[1], applications can inform the life time > >>> of the data > >>> written to devices. and this[2] reported that the write hints > >>> patch > >>> decreased writes in NAND by 25%. > >>> > >>> This hints help F2FS to determine the followings. > >>> 1) the segment types where the data will be written. > >>> 2) the hints that will be passed down to devices with the > >>> data of segments. > >>> > >>> This patch set implements the first mapping from write hints > >>> to segment types > >>> as shown below. > >>> > >>> hints segment type > >>> - > >>> WRITE_LIFE_SHORT CURSEG_COLD_DATA > >>> WRITE_LIFE_EXTREMECURSEG_HOT_DATA > >>> othersCURSEG_WARM_DATA > >>> > >>> The F2FS poliy for hot/cold seperation has precedence over > >>> this hints, And > >>> hints are not applied in in-place update. > >> > >> Could we change to disable IPU if file/inode write hint is > >> existing? > >> > > > > I am afraid that this makes side effects. for example, this > > could cause > > out-of-place updates even when there are not enough free > > segments. > > I can write the patch that handles these situations. But I > > wonder > > that this is required, and I am not sure which IPU polices can > > be disabled. > > Oh, As I replied in another thread, I think IPU just affects > filesystem > hot/cold separating, rather than this feature. So I think it > will be okay > to not consider it. > > > > >>> > >>> Before the second mapping is implemented, write hints are not > >>> passed down > >>> to devices. Because it is better that the data of a segment > >>> have the same > >>> hint. > >>> > >>> [1]: c75b1d9421f80f4143e389d2d50ddfc8a28c8c35 > >>> [2]: https://lwn.net/Articles/726477/ > >> > >> Could you write a patch to support passing write hint to block > >> layer for > >> buffered writes as below commit: > >> 0127251c45ae ("ext4: add support for passing in write hints > >> for buffered writes") > >> > > > > Sure I will. I wrote it already ;) > > Cool, ;) > > > I think that datas from the same segment should be passed down > > with the same > > hint, and the following mapping is reasonable. I wonder what is > > your opinion > > about it. > > > > segment type hints > > - > > CURSEG_COLD_DATA WRITE_LIFE_EXTREME > > CURSEG_HOT_DATAWRITE_LIFE_SHORT > > CURSEG_COLD_NODE WRITE_LIFE_NORMAL > > We have WRITE_LIFE_LONG defined rather than WRITE_LIFE_NORMAL in > fs.h? > > > CURSEG_HOT_NODEWRITE_LIFE_MEDIUM > > As I know, in scenario of cell phone, data of meta_inode is > hottest, then hot > data, warm node, and cold node should be coldest. So I suggested > we can define > as below: > > META_DATAWRITE_LIFE_SHORT > HOT_DATA & WARM_NODE WRITE_LIFE_MEDIUM > HOT_NODE & WARM_DATA WRITE_LIFE_LONG > COLD_NODE & COLD_DATAWRITE_LIFE_EXTREME > > >>> > >>> I agree, But I am not sure that assigning the same hint to a node > >>> and data > >>> segment is good. Because NVMe is likely to write them in the same > >>> erase > >>> block if they have the same hint. > >> > >> If we do not give the hint, they can still be written to the same > >> erase block, > > I mean it's possible to write them to the same erase block. :) > > >> right? it will not be worse? > >> > > > > If the hint is not given, I think that they could be written to > > the same erase block, or not. But if we give the same hint, they >
Re: [RFC] md: make queue limits depending on limits of RAID members
On Wed, Nov 15, 2017 at 11:25:12AM +0100, Mariusz Dabrowski wrote: > Hi all, > > In order to be compliant with a pass-throug drive behavior, RAID queue > limits should be calculated in a way that minimal io, optimal io and > discard granularity size will be met from a single drive perspective. > Currently MD driver is ignoring queue limits reported by members and all > calculations are based on chunk (stripe) size. We did use blk_stack_limits, which will care about these. > I am working on patch which > changes this. Since kernel 4.13 drives which support streams (write hints) > might report discard_alignment or discard_granularity bigger than array > stripe (for more details view NVMe 1.3 specification, chapter 9.3 Streams) > so result of current calculation is not optimal for those drives. For > example for 4 disk RAID 0 with chunk size smaller than discard_granularity > of member drives, discard_granularity of RAID array should be equal to > 4*member_discard_granularity. > > The problem is that for some drives there is a risk that result of this > calculation exceeds unsigned int range. Current implementation of function > nvme_config_discard in NVMe driver can also suffer the same problem: > > if (ctrl->nr_streams && ns->sws && ns->sgs) { > unsigned int sz = logical_block_size * ns->sws * ns->sgs; > > ns->queue->limits.discard_alignment = sz; > ns->queue->limits.discard_granularity = sz; > } > > One potential solution for that is to change type of some queue limits > (at least discard_granularity and discard_alignment, maybe more, like > max_discard_sectors?) from 32 bits to 64 bits. > > What are your thoughts about this? Do you expect any troubles with > changing this in block layer? Would you be willing to do such change? > > Signed-off-by: Mariusz Dabrowski> --- > > drivers/md/md.c | 24 > drivers/md/md.h | 1 + > drivers/md/raid0.c | 23 --- > drivers/md/raid10.c | 30 +- > drivers/md/raid5.c | 25 +++-- > 5 files changed, 89 insertions(+), 14 deletions(-) > > diff --git a/drivers/md/md.c b/drivers/md/md.c > index 0ff1bbf6c90e..e0dc114cab19 100644 > --- a/drivers/md/md.c > +++ b/drivers/md/md.c > @@ -66,6 +66,7 @@ > #include > #include > #include > +#include > > #include > #include "md.h" > @@ -679,6 +680,29 @@ struct md_rdev *md_find_rdev_nr_rcu(struct mddev *mddev, > int nr) > } > EXPORT_SYMBOL_GPL(md_find_rdev_nr_rcu); > > +void md_rdev_get_queue_limits(struct mddev *mddev, struct queue_limits > *limits) > +{ > + struct md_rdev *rdev; > + > + memset(limits, 0, sizeof(struct queue_limits)); > + > + rdev_for_each(rdev, mddev) { > + struct queue_limits *rdev_limits; > + > + rdev_limits = >bdev->bd_queue->limits; > + limits->io_min = max(limits->io_min, rdev_limits->io_min); > + limits->io_opt = lcm_not_zero(limits->io_opt, > + rdev_limits->io_opt); > + limits->discard_granularity = > + max(limits->discard_granularity, > + rdev_limits->discard_granularity); > + limits->discard_alignment = > + max(limits->discard_alignment, > + rdev_limits->discard_alignment); > + } > +} isn't this exactly what blk_stack_limits does? > +EXPORT_SYMBOL_GPL(md_rdev_get_queue_limits); > + > static struct md_rdev *find_rdev(struct mddev *mddev, dev_t dev) > { > struct md_rdev *rdev; > diff --git a/drivers/md/md.h b/drivers/md/md.h > index d8287d3cd1bf..5b514b9bb535 100644 > --- a/drivers/md/md.h > +++ b/drivers/md/md.h > @@ -702,6 +702,7 @@ extern void md_reload_sb(struct mddev *mddev, int > raid_disk); > extern void md_update_sb(struct mddev *mddev, int force); > extern void md_kick_rdev_from_array(struct md_rdev * rdev); > struct md_rdev *md_find_rdev_nr_rcu(struct mddev *mddev, int nr); > +void md_rdev_get_queue_limits(struct mddev *mddev, struct queue_limits > *limits); > > static inline void rdev_dec_pending(struct md_rdev *rdev, struct mddev > *mddev) > { > diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c > index 5a00fc118470..f1dcf7fd3eb1 100644 > --- a/drivers/md/raid0.c > +++ b/drivers/md/raid0.c > @@ -382,15 +382,31 @@ static int raid0_run(struct mddev *mddev) > if (mddev->queue) { > struct md_rdev *rdev; > bool discard_supported = false; > + struct queue_limits limits; > + unsigned int io_min, io_opt; > + unsigned int granularity, alignment; > + unsigned int chunk_size = mddev->chunk_sectors << 9; > > + md_rdev_get_queue_limits(mddev, ); > blk_queue_max_hw_sectors(mddev->queue, mddev->chunk_sectors); >
Re: [RFC PATCH 0/2] apply write hints to select the type of segments
On 11/17, Christoph Hellwig wrote: > > Next time please coordinate this with the block list and Jens, who > actually wrote the patch. Got it. > > > hints segment type > > - > > WRITE_LIFE_SHORT CURSEG_COLD_DATA > > WRITE_LIFE_EXTREMECURSEG_HOT_DATA > > othersCURSEG_WARM_DATA > > Normally cold data is data with a long lifetime, and extreme is colder > than cold, so there seems to be some mismatch here. It was wrong description and I fixed it which matches to implementation. The below description was merged: WRITE_LIFE_SHORT CURSEG_HOT_DATA WRITE_LIFE_EXTREMECURSEG_COLD_DATA othersCURSEG_WARM_DATA
Re: [RFC PATCH 0/2] apply write hints to select the type of segments
Next time please coordinate this with the block list and Jens, who actually wrote the patch. > hints segment type > - > WRITE_LIFE_SHORT CURSEG_COLD_DATA > WRITE_LIFE_EXTREMECURSEG_HOT_DATA > othersCURSEG_WARM_DATA Normally cold data is data with a long lifetime, and extreme is colder than cold, so there seems to be some mismatch here.
Re: Regression in 4.14: wrong data being read from bcache device
On Fri, Nov 17, 2017 at 7:08 AM, Christoph Hellwigwrote: > On Thu, Nov 16, 2017 at 07:07:40PM -0800, Michael Lyle wrote: >> I think the probable cause is this: we construct bios based on >> previously completed bios in a few places, > > That is an extremely bad idea in many ways, so I think we'll need > to fix this as the priority. Yah, sorry, this wasn't actually the case. I was sleep deprived and hadn't been able to follow it all yet. Bart points out that there's many remaining cases of assignment of disk where partno isn't assigned as well-- so I wonder if there are other potential problems from this. Mike
Re: [PATCH] bio: ensure __bio_clone_fast copies bi_partno
Jens & everyone-- thanks for the speedy review and handling. I've updated my test cases to ensure that volumes from old releases work, even when I "don't think" there's been a disk format change. Bart-- On 11/17/2017 08:50 AM, Bart Van Assche wrote: > Have you considered to use bio_copy_dev() instead of open-coding it? One could... Right now almost all the uses of bio_copy_dev are in bcache and they need to change for other reasons. (e.g. macro uses parameter more than once, function is passed in as parameter). There's a whole lot of places to change if it's desired to make bio_copy_dev universally used to copy device information. > Additionally, there is more code that copies these fields, e.g. the code in > bio_clone_bioset(). Shouldn't that code be modified such that it also copies > bi_partno? Yes, when I was grepping around there were other things that looked possibly unsafe. I don't have test environments for all of these other subsystems. I wanted to get the minimal fix for this in, though, because people are actively losing data to the problem it triggers with bcache. Mike
Re: [PATCH] bio: ensure __bio_clone_fast copies bi_partno
On Thu, 2017-11-16 at 23:47 -0800, Michael Lyle wrote: > diff --git a/block/bio.c b/block/bio.c > index 101c2a9b5481..33fa6b4af312 100644 > --- a/block/bio.c > +++ b/block/bio.c > @@ -597,6 +597,7 @@ void __bio_clone_fast(struct bio *bio, struct bio > *bio_src) >* so we don't set nor calculate new physical/hw segment counts here >*/ > bio->bi_disk = bio_src->bi_disk; > + bio->bi_partno = bio_src->bi_partno; > bio_set_flag(bio, BIO_CLONED); > bio->bi_opf = bio_src->bi_opf; > bio->bi_write_hint = bio_src->bi_write_hint; Have you considered to use bio_copy_dev() instead of open-coding it? Additionally, there is more code that copies these fields, e.g. the code in bio_clone_bioset(). Shouldn't that code be modified such that it also copies bi_partno? How about the following class of assignments in drivers/md/raid1.c: mbio->bi_disk = (void *)conf->mirrors[i].rdev; Should these assignments perhaps be followed by a mbio->bi_partno assignment? How about the following class of assignments in the NVMe code: bio->bi_disk = disk; Should these assignments perhaps be followed by a bio->bi_partno assignment? Thanks, Bart.
Re: [PATCH v2 2/3] bdi: add error handle for bdi_debug_register
On 11/17/2017 08:06 AM, weiping zhang wrote: > On Wed, Nov 01, 2017 at 02:47:22PM +0100, Jan Kara wrote: >> On Tue 31-10-17 18:38:24, weiping zhang wrote: >>> In order to make error handle more cleaner we call bdi_debug_register >>> before set state to WB_registered, that we can avoid call bdi_unregister >>> in release_bdi(). >>> >>> Signed-off-by: weiping zhang>> >> Looks good to me. You can add: >> >> Reviewed-by: Jan Kara >> >> Honza >> >>> --- >>> mm/backing-dev.c | 5 - >>> 1 file changed, 4 insertions(+), 1 deletion(-) >>> >>> diff --git a/mm/backing-dev.c b/mm/backing-dev.c >>> index b5f940ce0143..84b2dc76f140 100644 >>> --- a/mm/backing-dev.c >>> +++ b/mm/backing-dev.c >>> @@ -882,10 +882,13 @@ int bdi_register_va(struct backing_dev_info *bdi, >>> const char *fmt, va_list args) >>> if (IS_ERR(dev)) >>> return PTR_ERR(dev); >>> >>> + if (bdi_debug_register(bdi, dev_name(dev))) { >>> + device_destroy(bdi_class, dev->devt); >>> + return -ENOMEM; >>> + } >>> cgwb_bdi_register(bdi); >>> bdi->dev = dev; >>> >>> - bdi_debug_register(bdi, dev_name(dev)); >>> set_bit(WB_registered, >wb.state); >>> >>> spin_lock_bh(_lock); >>> -- > > Hello Jens, > > Could you please give some comments for this series cleanup. It looks good to me - for some reason I seem to be missing patch 2/3 locally, but I have this followup. I'll get it applied for 4.15, thanks. -- Jens Axboe
Re: [PATCH] bio: ensure __bio_clone_fast copies bi_partno
On 11/17/2017 12:47 AM, Michael Lyle wrote: > A new field was introduced in 74d46992e0d9dee7f1f376de0d56d31614c8a17a, > bi_partno, instead of using bdev->bd_contains and encoding the partition > information in the bi_bdev field. __bio_clone_fast was changed to copy > the disk information, but not the partition information. At minimum, > this regressed bcache and caused data corruption. That's not good... Fix looks good to me, I'll queue this up for a pull today. Thanks for bisecting this, Michael. -- Jens Axboe
Re: Regression in 4.14: wrong data being read from bcache device
On Thu, Nov 16, 2017 at 07:07:40PM -0800, Michael Lyle wrote: > I think the probable cause is this: we construct bios based on > previously completed bios in a few places, That is an extremely bad idea in many ways, so I think we'll need to fix this as the priority.
Re: [PATCH v2 2/3] bdi: add error handle for bdi_debug_register
On Wed, Nov 01, 2017 at 02:47:22PM +0100, Jan Kara wrote: > On Tue 31-10-17 18:38:24, weiping zhang wrote: > > In order to make error handle more cleaner we call bdi_debug_register > > before set state to WB_registered, that we can avoid call bdi_unregister > > in release_bdi(). > > > > Signed-off-by: weiping zhang> > Looks good to me. You can add: > > Reviewed-by: Jan Kara > > Honza > > > --- > > mm/backing-dev.c | 5 - > > 1 file changed, 4 insertions(+), 1 deletion(-) > > > > diff --git a/mm/backing-dev.c b/mm/backing-dev.c > > index b5f940ce0143..84b2dc76f140 100644 > > --- a/mm/backing-dev.c > > +++ b/mm/backing-dev.c > > @@ -882,10 +882,13 @@ int bdi_register_va(struct backing_dev_info *bdi, > > const char *fmt, va_list args) > > if (IS_ERR(dev)) > > return PTR_ERR(dev); > > > > + if (bdi_debug_register(bdi, dev_name(dev))) { > > + device_destroy(bdi_class, dev->devt); > > + return -ENOMEM; > > + } > > cgwb_bdi_register(bdi); > > bdi->dev = dev; > > > > - bdi_debug_register(bdi, dev_name(dev)); > > set_bit(WB_registered, >wb.state); > > > > spin_lock_bh(_lock); > > -- Hello Jens, Could you please give some comments for this series cleanup. -- Thanks weiping
4.14: WARNING: CPU: 4 PID: 2895 at block/blk-mq.c:1144 with virtio-blk
When doing cpu hotplug in a KVM guest with virtio blk I get warnings like 747.652408] [ cut here ] [ 747.652410] WARNING: CPU: 4 PID: 2895 at block/blk-mq.c:1144 __blk_mq_run_hw_queue+0xd4/0x100 [ 747.652410] Modules linked in: dm_multipath [ 747.652412] CPU: 4 PID: 2895 Comm: kworker/4:1H Tainted: GW 4.14.0+ #191 [ 747.652412] Hardware name: IBM 2964 NC9 704 (KVM/Linux) [ 747.652414] Workqueue: kblockd blk_mq_run_work_fn [ 747.652414] task: 6068 task.stack: 5ea3 [ 747.652415] Krnl PSW : 0704f0018000 00505864 (__blk_mq_run_hw_queue+0xd4/0x100) [ 747.652417]R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:3 PM:0 RI:0 EA:3 [ 747.652417] Krnl GPRS: 0010 00ff 5cbec400 [ 747.652418]63709120 63709500 59fa44b0 [ 747.652418]59fa4480 6370f700 63709100 [ 747.652419]5cbec500 00970948 5ea33d80 5ea33d48 [ 747.652423] Krnl Code: 00505854: ebaff0a4lmg %r10,%r15,160(%r15) 0050585a: c0f4ffe690d3 brcl15,1d7a00 #00505860: a7f40001 brc 15,505862 >00505864: 581003b0 l %r1,944 00505868: c01b001fff00 nilf%r1,2096896 0050586e: a784ffdb brc 8,505824 00505872: a7f40001 brc 15,505874 00505876: 9120218f tm 399(%r2),32 [ 747.652435] Call Trace: [ 747.652435] ([<63709600>] 0x63709600) [ 747.652436] [<00187bcc>] process_one_work+0x264/0x4b8 [ 747.652438] [<00187e78>] worker_thread+0x58/0x4f8 [ 747.652439] [<0018ee94>] kthread+0x144/0x168 [ 747.652439] [<008f8a62>] kernel_thread_starter+0x6/0xc [ 747.652440] [<008f8a5c>] kernel_thread_starter+0x0/0xc [ 747.652440] Last Breaking-Event-Address: [ 747.652441] [<00505860>] __blk_mq_run_hw_queue+0xd0/0x100 [ 747.652442] ---[ end trace 4a001a80379b18ba ]--- [ 747.652450] [ cut here ] This is b7a71e66d (Jens Axboe2017-08-01 09:28:24 -0600 1141) * are mapped to it. b7a71e66d (Jens Axboe2017-08-01 09:28:24 -0600 1142) */ 6a83e74d2 (Bart Van Assche 2016-11-02 10:09:51 -0600 1143) WARN_ON(!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask) && 6a83e74d2 (Bart Van Assche 2016-11-02 10:09:51 -0600 1144) cpu_online(hctx->next_cpu)); 6a83e74d2 (Bart Van Assche 2016-11-02 10:09:51 -0600 1145) b7a71e66d (Jens Axboe2017-08-01 09:28:24 -0600 1146)/* Is this a known issue? Christian
Re: [PATCH] bcache: recover data from backing device when read request hit clean
On 17/11/17 13:22, Coly Li wrote: On 17/11/2017 8:57 PM, Eddie Chapman wrote: On 17/11/17 10:20, Rui Hua wrote: Hi, Stefan 2017-11-17 16:28 GMT+08:00 Stefan Priebe - Profihost AG: I‘m getting the same xfs error message under high load. Does this patch fix it? Did you applied the patch "bcache: only permit to recovery read error when cache device is clean" ? If you did, maybe this patch can fix it. And you'd better check /sys/fs/bcache/XXX/internal/cache_read_races in your environment, meanwhile, it should not be zero when you get that err message. Hi all, I have 3 servers running a very recent 4.9 stable release, with several recent bcache patches cherry picked, including V4 of "bcache: only permit to recovery read error when cache device is clean". In the 3 weeks since using these cherry picks I've experienced a very small number of isolated read errors in the layer above bcache, on all 3 servers. On one of the servers, 2 out of the 6 bcache resources have a value of 1 in /sys/fs/bcache/XXX/internal/cache_read_races, and it is on these same 2 bcache resources where one read error has occurred on the upper layer. The other 4 bcache resources have 0 in cache_read_races and I haven't had any read errors on the layers above them. On another server, I have 1 bcache resource out of 10 with a value of 5 in /sys/fs/bcache/XXX/internal/cache_read_races, and it is on that bcache resource where a read error occurred on one occasion. The other 9 bcache resources have 0 in cache_read_races, and no read errors have occurred on the layers above any of them. On the 3rd server where some read errors occurred, I cannot verify if there were positive values in cache_read_races as I moved the data from there onto other storage, and shut down the bcache resources where the errors occurred. If I can provide any other info which might help with this issue, please let me know. Hi Eddie, This is very informative, thank you so much :-) Coly Li Hi Coly, You are most welcome. Another interesting info, but maybe it is unrelated/coincidence: the bcache resources where the errors occurred, the underlying backing device was a raid adapter that is quite a lot slower than the (different) underlying physical storage on the other bcache resources that do not have read races. Up to now I had suspected a driver issue with this raid adapter as causing the read errors, so I started the process of gradually retiring the adapter on these servers in the last 3 weeks. Anyway, in light of this issue coming up here I'm wondering if this is significant in suggesting possibly that the read races are more likely to occur if the backing storage is quite slow. Or maybe not. Eddie
Re: [PATCH] bcache: recover data from backing device when read request hit clean
On 17/11/2017 8:57 PM, Eddie Chapman wrote: > On 17/11/17 10:20, Rui Hua wrote: >> Hi, Stefan >> >> 2017-11-17 16:28 GMT+08:00 Stefan Priebe - Profihost AG >>: >>> I‘m getting the same xfs error message under high load. Does this >>> patch fix >>> it? >>> >> Did you applied the patch "bcache: only permit to recovery read error >> when cache device is clean" ? >> If you did, maybe this patch can fix it. And you'd better check >> /sys/fs/bcache/XXX/internal/cache_read_races in your environment, >> meanwhile, it should not be zero when you get that err message. > > Hi all, > > I have 3 servers running a very recent 4.9 stable release, with several > recent bcache patches cherry picked, including V4 of "bcache: only > permit to recovery read error when cache device is clean". > > In the 3 weeks since using these cherry picks I've experienced a very > small number of isolated read errors in the layer above bcache, on all 3 > servers. > > On one of the servers, 2 out of the 6 bcache resources have a value of 1 > in /sys/fs/bcache/XXX/internal/cache_read_races, and it is on these same > 2 bcache resources where one read error has occurred on the upper layer. > The other 4 bcache resources have 0 in cache_read_races and I haven't > had any read errors on the layers above them. > > On another server, I have 1 bcache resource out of 10 with a value of 5 > in /sys/fs/bcache/XXX/internal/cache_read_races, and it is on that > bcache resource where a read error occurred on one occasion. The other 9 > bcache resources have 0 in cache_read_races, and no read errors have > occurred on the layers above any of them. > > On the 3rd server where some read errors occurred, I cannot verify if > there were positive values in cache_read_races as I moved the data from > there onto other storage, and shut down the bcache resources where the > errors occurred. > > If I can provide any other info which might help with this issue, please > let me know. Hi Eddie, This is very informative, thank you so much :-) Coly Li
Re: [PATCH] bcache: recover data from backing device when read request hit clean
On 17/11/17 10:20, Rui Hua wrote: Hi, Stefan 2017-11-17 16:28 GMT+08:00 Stefan Priebe - Profihost AG: I‘m getting the same xfs error message under high load. Does this patch fix it? Did you applied the patch "bcache: only permit to recovery read error when cache device is clean" ? If you did, maybe this patch can fix it. And you'd better check /sys/fs/bcache/XXX/internal/cache_read_races in your environment, meanwhile, it should not be zero when you get that err message. Hi all, I have 3 servers running a very recent 4.9 stable release, with several recent bcache patches cherry picked, including V4 of "bcache: only permit to recovery read error when cache device is clean". In the 3 weeks since using these cherry picks I've experienced a very small number of isolated read errors in the layer above bcache, on all 3 servers. On one of the servers, 2 out of the 6 bcache resources have a value of 1 in /sys/fs/bcache/XXX/internal/cache_read_races, and it is on these same 2 bcache resources where one read error has occurred on the upper layer. The other 4 bcache resources have 0 in cache_read_races and I haven't had any read errors on the layers above them. On another server, I have 1 bcache resource out of 10 with a value of 5 in /sys/fs/bcache/XXX/internal/cache_read_races, and it is on that bcache resource where a read error occurred on one occasion. The other 9 bcache resources have 0 in cache_read_races, and no read errors have occurred on the layers above any of them. On the 3rd server where some read errors occurred, I cannot verify if there were positive values in cache_read_races as I moved the data from there onto other storage, and shut down the bcache resources where the errors occurred. If I can provide any other info which might help with this issue, please let me know. regards, Eddie
Re: [PATCH] bcache: recover data from backing device when read request hit clean
Am 17.11.2017 um 11:20 schrieb Rui Hua: > Hi, Stefan > > 2017-11-17 16:28 GMT+08:00 Stefan Priebe - Profihost AG >: >> I‘m getting the same xfs error message under high load. Does this patch fix >> it? >> > Did you applied the patch "bcache: only permit to recovery read error > when cache device is clean" ? > If you did, maybe this patch can fix it. And you'd better check > /sys/fs/bcache/XXX/internal/cache_read_races in your environment, > meanwhile, it should not be zero when you get that err message. > Yes this patch is applied. Stefan
Re: Ideas to reuse filesystem's checksum to enhance dm-raid1/10/5/6?
On 2017-11-16 20:30, Qu Wenruo wrote: On 2017年11月17日 00:47, Austin S. Hemmelgarn wrote: This is at least less complicated than dm-integrity. Just a new hook for READ bio. And it can start from easy part. Like starting from dm-raid1 and other fs support. It's less complicated for end users (in theory, but cryptsetup devs are working on that for dm-integrity), but significantly more complicated for developers. It also brings up the question of what happens when you want some other layer between the filesystem and the MD/DM RAID layer (say, running bcache or dm-cache on top of the RAID array). In the case of dm-integrity, that's not an issue because dm-integrity is entirely self-contained, it doesn't depend on other layers beyond the standard block interface. Each layer can choose to drop the support for extra verification. If the layer is not modifying the data, it can pass it do lower layer. Just as integrity payload. Which then makes things a bit more complicated in every other layer as well, in turn making things more complicated for all developers. As I mentioned in my other reply on this thread, running with dm-integrity _below_ the RAID layer instead of on top of it will provide the same net effect, and in fact provide a stronger guarantee than what you are proposing (because dm-integrity does real cryptographic integrity verification, as opposed to just checking for bit-rot). Although with more CPU usage for each device even they are containing same data. I never said it wasn't higher resource usage. If your checksum is calculated and checked at FS level there is no added value when you spread this logic to other layers. That's why I'm moving the checking part to lower level, to make more value from the checksum. dm-integrity adds basic 'check-summing' to any filesystem without the need to modify fs itself Well, despite the fact that modern filesystem has already implemented their metadata csum. - the paid price is - if there is bug between passing data from 'fs' to dm-integrity' it cannot be captured. Advantage of having separated 'fs' and 'block' layer is in its separation and simplicity at each level. Totally agreed on this. But the idea here should not bring that large impact (compared to big things like ZFS/Btrfs). 1) It only affect READ bio 2) Every dm target can choose if to support or pass down the hook. no mean to support it for RAID0 for example. And for complex raid like RAID5/6 no need to support it from the very beginning. 3) Main part of the functionality is already implemented The core complexity contains 2 parts: a) checksum calculation and checking Modern fs is already doing this, at least for metadata. b) recovery dm targets already have this implemented for supported raid profile. All these are already implemented, just moving them to different timing is not bringing such big modification IIRC. If you want integrated solution - you are simply looking for btrfs where multiple layers are integrated together. If with such verification hook (along with something extra to handle scrub), btrfs chunk mapping can be re-implemented with device-mapper: In fact btrfs logical space is just a dm-linear device, and each chunk can be implemented by its corresponding dm-* module like: dm-linear: | btrfs chunk 1 | btrfs chunk 2 | ... | btrfs chunk n | and btrfs chunk 1: metadata, using dm-raid1 on diskA and diskB btrfs chunk 2: data, using dm-raid0 on disk A B C D ... btrfs chunk n: system, using dm-raid1 on disk A B At least btrfs can take the advantage of the simplicity of separate layers. And other filesystem can get a little higher chance to recover its metadata if built on dm-raid. Again, just put dm-integrity below dm-raid. The other filesystems primarily have metadata checksums to catch data corruption, not repair it, Because they have no extra copy. If they have, they will definitely use the extra copy to repair. But they don't have those extra copies now, so that really becomes irrelevant as an argument (especially since it's not likely they will add data or metadata replication in the filesystem any time in the near future). and I severely doubt that you will manage to convince developers to add support in their filesystem (especially XFS) because: 1. It's a layering violation (yes, I know BTRFS is too, but that's a bit less of an issue because it's a completely self-contained layering violation, while this isn't). If passing something along with bio is violating layers, then integrity payload is already doing this for a long time. The block integrity layer is also interfacing directly with hardware and _needs_ to pass that data down. Unless I'm mistaken, it also doesn't do any verification except in the filesystem layer, and doesn't pass down any complaints about the integrity of the data (it may try to re-read it, but that's not the same as what
Re: [PATCH] bio: ensure __bio_clone_fast copies bi_partno
On Thu, Nov 16, 2017 at 11:47:25PM -0800, Michael Lyle wrote: > A new field was introduced in 74d46992e0d9dee7f1f376de0d56d31614c8a17a, > bi_partno, instead of using bdev->bd_contains and encoding the partition > information in the bi_bdev field. __bio_clone_fast was changed to copy > the disk information, but not the partition information. At minimum, > this regressed bcache and caused data corruption. > > Signed-off-by: Michael Lyle> Fixes: 74d46992e0d9dee7f1f376de0d56d31614c8a17a > Reported-by: Pavel Goran > Reported-by: Campbell Steven > Cc: Christoph Hellwig > Cc: Jens Axboe > Cc: > --- > block/bio.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/block/bio.c b/block/bio.c > index 101c2a9b5481..33fa6b4af312 100644 > --- a/block/bio.c > +++ b/block/bio.c > @@ -597,6 +597,7 @@ void __bio_clone_fast(struct bio *bio, struct bio > *bio_src) >* so we don't set nor calculate new physical/hw segment counts here >*/ > bio->bi_disk = bio_src->bi_disk; > + bio->bi_partno = bio_src->bi_partno; > bio_set_flag(bio, BIO_CLONED); > bio->bi_opf = bio_src->bi_opf; > bio->bi_write_hint = bio_src->bi_write_hint; Reviewed-by: Ming Lei -- Ming
Re: [PATCH] bio: ensure __bio_clone_fast copies bi_partno
On 17/11/2017 3:47 PM, Michael Lyle wrote: > A new field was introduced in 74d46992e0d9dee7f1f376de0d56d31614c8a17a, > bi_partno, instead of using bdev->bd_contains and encoding the partition > information in the bi_bdev field. __bio_clone_fast was changed to copy > the disk information, but not the partition information. At minimum, > this regressed bcache and caused data corruption. > Hi Michael, Thanks for the fix, it looks good to me. > Signed-off-by: Michael Lyle> Fixes: 74d46992e0d9dee7f1f376de0d56d31614c8a17a > Reported-by: Pavel Goran > Reported-by: Campbell Steven Reviewed-by: Coly Li Coly Li > Cc: Christoph Hellwig > Cc: Jens Axboe > Cc: > --- > block/bio.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/block/bio.c b/block/bio.c > index 101c2a9b5481..33fa6b4af312 100644 > --- a/block/bio.c > +++ b/block/bio.c > @@ -597,6 +597,7 @@ void __bio_clone_fast(struct bio *bio, struct bio > *bio_src) >* so we don't set nor calculate new physical/hw segment counts here >*/ > bio->bi_disk = bio_src->bi_disk; > + bio->bi_partno = bio_src->bi_partno; > bio_set_flag(bio, BIO_CLONED); > bio->bi_opf = bio_src->bi_opf; > bio->bi_write_hint = bio_src->bi_write_hint; >