Re: [PATCH] bcache: recover data from backing device when read request hit clean

2017-11-17 Thread Michael Lyle
Rui Hua--

Thanks for the clarification.  You're correct, it doesn't get called,
and it goes to read_complete.  However, read_error should probably get
called.  How would you suggest handling this-- should we initially set
read_dirty_data true and clear it if it is all clean?  (and change the
condition to properly go into the error path).  As it is, the patch
makes things no worse but the error path is very unsafe.

Mike

On Thu, Nov 16, 2017 at 8:25 PM, Rui Hua  wrote:
> Hi, Michael--
>
> Thanks for the quickly reply.
>
> If the metadata on the cache device was NOT correctly read,
> cached_dev_read_error() will not be called, so the recovery check will
> not be executed. so this patch is safe.
> s->iop.error was not set when faild to read metadata on the cache
> device, so the  cached_dev_bio_complete() will be called instead of
> cached_dev_read_error()
>
> 2017-11-17 12:02 GMT+08:00 Michael Lyle :
>> Hi, Rui Hua--
>>
>> On 11/16/2017 07:51 PM, Rui Hua wrote:
>>> In this patch, we use s->read_dirty_data to judge whether the read
>>> request hit dirty data in cache device, it is safe to reread data from
>>> the backing device when the read request hit clean data. This can not
>>> only handle cache read race, but also recover data when failed read
>>> request from cache device.
>>
>> I haven't read over all of this yet, to understand the read race.  But
>> this change can't be safe-- read_dirty_data only is set to true if the
>> metadata on the cache device is correctly read.  If that fails, the flag
>> may not be set and we could read stale data on the backing device.
>>
>> Mike


Re: [GIT PULL] Followup merge window block fixes/changes

2017-11-17 Thread Jens Axboe
On 11/17/2017 12:35 PM, Linus Torvalds wrote:
> On Fri, Nov 17, 2017 at 11:29 AM, Linus Torvalds
>  wrote:
>>
>> "F*ck no, that code is too ugly, you need to write it so that it can
>> be maintained".
> 
> Dammit, the rest of the pull looks ok, so I'll take it anyway.
> 
> But I really do expect you to
> 
>  (a) clean up that mess - maybe with my patch as an example, but maybe
> some entirely different way. The patch I sent is already deleted from
> my tree, it was purely meant as an example.
> 
>  (b) really push back on people when they send you ugly code
> 
> It shouldn't have to always be me being upset and pushing back on the
> block pulls.

My bad, I have probably given Paolo and Luca a bit more leeway than
usual, since they are new to upstream development and the BFQ code does
need some cleanups here and there. A huge part of that was completed as
part of getting it merged, but it still needs a bit of love.

Now that they have received the Linus hazing, I guess the honey moon is
over.

In all seriousness, I do push back on crap quite often. Sometimes I miss
some too...

-- 
Jens Axboe



Re: [GIT PULL] Followup merge window block fixes/changes

2017-11-17 Thread Paolo Valente

> Il giorno 17 nov 2017, alle ore 20:29, Linus Torvalds 
>  ha scritto:
> 
> On Fri, Nov 17, 2017 at 8:51 AM, Jens Axboe  wrote:
>> 
>> - Small BFQ updates series from Luca and Paolo.
> 
> Honestly, this code is too ugly to live.
> 
> Seriously. That update should have been rejected on the grounds of
> being completely unmaintainable crap.
> 
> Why didn't you?
> 
> Why are you allowing code like that patch to bfq_dispatch_request()
> into the kernel source tree?
> 
> Yes, it improves performance. But it does so by adding random
> #iofdef's into the middle of some pretty crtitical code, with
> absolutely no attempt at having a sane maintainable code-base.
> 
> That function is literally *three* lines of code without the #ifdef case:
> 
>spin_lock_irq(>lock);
>rq = __bfq_dispatch_request(hctx);
>spin_unlock_irq(>lock);
> 
> and you could actually see what it did.
> 
> But instead of trying to abstract it in some legible manner, that
> three-line obvious function got *three* copies of the same #if mess
> all enclosing rancom crap, and the end result is really hard to read.
> 
> For example, I just spent a couple of minutes trying to make sense of
> the code, and stop that unreadable mess.  I came up with the appended
> patch.
> 
> It may not work for some reason I can't see, but that's not the point.
> The attached patch is not meant as a "apply this as-is".
> 
> It's meant as a "look, you can write legible code where the statistics
> just go away when they are compiled out".
> 
> Now that "obvious three-liner function" is not three lines any more,
> but it's at least *clear* straight-line code:
> 
>spin_lock_irq(>lock);
> 
>in_serv_queue = bfqd->in_service_queue;
>waiting_rq = in_serv_queue && bfq_bfqq_wait_request(in_serv_queue);
> 
>rq = __bfq_dispatch_request(hctx);
> 
>idle_timer_disabled =
>waiting_rq && !bfq_bfqq_wait_request(in_serv_queue);
> 
>spin_unlock_irq(>lock);
> 
>bfq_dispatch_statistics(hctx, rq, in_serv_queue, waiting_rq,
> idle_timer_disabled);
> 
> and I am  hopeful that when bfq_dispatch_statistics() is disabled, the
> compiler will be smart enough to not generate extra code, certainly
> not noticeably so.
> 

Sorry for causing this problem.  Yours was our first version, but then
we feared that leaving useless instructions was worse than adding a
burst of ifdefs. I'll try not to repeat this mistake.

Thanks,
Paolo

> See? One is a mess of horrible ugly #ifdef's in the middle of the code
> that makes the end result completely illegible.
> 
> The other is actually somewhat legible, and has a clear separation of
> the statistics gathering. And that *statistics* gathering is clearly
> optionally enabled or not.
> 
> Not the mess of random code put together in random ways that are
> completely illegble.
> 
> Your job as maintainer is _literally_ to tell people "f*ck no, that
> code is too ugly, you need to write it so that it can be maintained".
> 
> And I'm doing my job.
> 
> "F*ck no, that code is too ugly, you need to write it so that it can
> be maintained".
> 
> Show some taste.
> 
>Linus
> 



Re: [GIT PULL] Followup merge window block fixes/changes

2017-11-17 Thread Linus Torvalds
On Fri, Nov 17, 2017 at 12:18 PM, Paolo Valente
 wrote:
>
> Sorry for causing this problem.  Yours was our first version, but then
> we feared that leaving useless instructions was worse than adding a
> burst of ifdefs. I'll try not to repeat this mistake.

So I generally do not want people to depend on the compiler to do
non-obvious conversions (so please do not write bad code that you just
hope the compiler can sort out and make reasonable), but we do heavily
rely on the compiler doing the obvious dead code removal particularly
for when things end up not being used.

A lot of the time that's very much something you can and should depend
on, because we have abstraction layers that involves a lot of code
just becoming no-ops on many architectures, but other architectures
need some particular sequence.

Another example is a lot of helper functions to do verification of
various kernel rules that then are enabled by some debug option (eg
CONFIG_DEBUG_ATOMIC_SLEEP etc), where we know that (and depend on)
the compiler will do a reasonable job.

That's not so different from having "statistics gathering function
that compiles to nothing".

And yes, sometimes the code might need to be written so that the
compiler has an easier time _seeing_ that the core really goes away,
so it's not necessarily a blind trust in the compiler.

So for example,  the functions that can go away should obviously be
inline functions so that you don't end up having the compiler generate
the arguments and the call to an empty function body, and so that the
compiler can see that "oh, those arguments aren't even used at all, so
now I can remove all the code that generates them".

If you are comfy with assembler, it can actually be a great thing to
occasionally just verify the actual generated code if it's some piece
of code you care deeply about. Obviously for performance work,
profiling happens, and then you see it that way.

Because sometimes you find some silly "Duh!" moments, where you didn't
realize that the compiler wasn't able to really get rid of something
for some reason that is usually very obvious in hindsight, but wasn't
actually obvious at all until you looked at what the compiler
generated.

   Linus


Re: [PATCH] bio: ensure __bio_clone_fast copies bi_partno

2017-11-17 Thread Campbell Steven
On 17 November 2017 at 20:47, Michael Lyle  wrote:
> A new field was introduced in 74d46992e0d9dee7f1f376de0d56d31614c8a17a,
> bi_partno, instead of using bdev->bd_contains and encoding the partition
> information in the bi_bdev field.  __bio_clone_fast was changed to copy
> the disk information, but not the partition information.  At minimum,
> this regressed bcache and caused data corruption.

Thanks alot for this Michael, I will run it up on our test servers today.

Campbell


Re: [GIT PULL] Followup merge window block fixes/changes

2017-11-17 Thread Linus Torvalds
On Fri, Nov 17, 2017 at 11:29 AM, Linus Torvalds
 wrote:
>
> "F*ck no, that code is too ugly, you need to write it so that it can
> be maintained".

Dammit, the rest of the pull looks ok, so I'll take it anyway.

But I really do expect you to

 (a) clean up that mess - maybe with my patch as an example, but maybe
some entirely different way. The patch I sent is already deleted from
my tree, it was purely meant as an example.

 (b) really push back on people when they send you ugly code

It shouldn't have to always be me being upset and pushing back on the
block pulls.

  Linus


Re: [GIT PULL] Followup merge window block fixes/changes

2017-11-17 Thread Linus Torvalds
On Fri, Nov 17, 2017 at 8:51 AM, Jens Axboe  wrote:
>
> - Small BFQ updates series from Luca and Paolo.

Honestly, this code is too ugly to live.

Seriously. That update should have been rejected on the grounds of
being completely unmaintainable crap.

Why didn't you?

Why are you allowing code like that patch to bfq_dispatch_request()
into the kernel source tree?

Yes, it improves performance. But it does so by adding random
#iofdef's into the middle of some pretty crtitical code, with
absolutely no attempt at having a sane maintainable code-base.

That function is literally *three* lines of code without the #ifdef case:

spin_lock_irq(>lock);
rq = __bfq_dispatch_request(hctx);
spin_unlock_irq(>lock);

and you could actually see what it did.

But instead of trying to abstract it in some legible manner, that
three-line obvious function got *three* copies of the same #if mess
all enclosing rancom crap, and the end result is really hard to read.

For example, I just spent a couple of minutes trying to make sense of
the code, and stop that unreadable mess.  I came up with the appended
patch.

It may not work for some reason I can't see, but that's not the point.
The attached patch is not meant as a "apply this as-is".

It's meant as a "look, you can write legible code where the statistics
just go away when they are compiled out".

Now that "obvious three-liner function" is not three lines any more,
but it's at least *clear* straight-line code:

spin_lock_irq(>lock);

in_serv_queue = bfqd->in_service_queue;
waiting_rq = in_serv_queue && bfq_bfqq_wait_request(in_serv_queue);

rq = __bfq_dispatch_request(hctx);

idle_timer_disabled =
waiting_rq && !bfq_bfqq_wait_request(in_serv_queue);

spin_unlock_irq(>lock);

bfq_dispatch_statistics(hctx, rq, in_serv_queue, waiting_rq,
idle_timer_disabled);

and I am  hopeful that when bfq_dispatch_statistics() is disabled, the
compiler will be smart enough to not generate extra code, certainly
not noticeably so.

See? One is a mess of horrible ugly #ifdef's in the middle of the code
that makes the end result completely illegible.

The other is actually somewhat legible, and has a clear separation of
the statistics gathering. And that *statistics* gathering is clearly
optionally enabled or not.

Not the mess of random code put together in random ways that are
completely illegble.

Your job as maintainer is _literally_ to tell people "f*ck no, that
code is too ugly, you need to write it so that it can be maintained".

And I'm doing my job.

"F*ck no, that code is too ugly, you need to write it so that it can
be maintained".

Show some taste.

Linus
 block/bfq-iosched.c | 55 -
 1 file changed, 29 insertions(+), 26 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index bcb6d21baf12..47d88d03dadd 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -3689,35 +3689,14 @@ static struct request *__bfq_dispatch_request(struct 
blk_mq_hw_ctx *hctx)
return rq;
 }
 
-static struct request *bfq_dispatch_request(struct blk_mq_hw_ctx *hctx)
-{
-   struct bfq_data *bfqd = hctx->queue->elevator->elevator_data;
-   struct request *rq;
-#if defined(CONFIG_BFQ_GROUP_IOSCHED) && defined(CONFIG_DEBUG_BLK_CGROUP)
-   struct bfq_queue *in_serv_queue, *bfqq;
-   bool waiting_rq, idle_timer_disabled;
-#endif
-
-   spin_lock_irq(>lock);
-
 #if defined(CONFIG_BFQ_GROUP_IOSCHED) && defined(CONFIG_DEBUG_BLK_CGROUP)
-   in_serv_queue = bfqd->in_service_queue;
-   waiting_rq = in_serv_queue && bfq_bfqq_wait_request(in_serv_queue);
-
-   rq = __bfq_dispatch_request(hctx);
-
-   idle_timer_disabled =
-   waiting_rq && !bfq_bfqq_wait_request(in_serv_queue);
-
-#else
-   rq = __bfq_dispatch_request(hctx);
-#endif
-   spin_unlock_irq(>lock);
+static void bfq_dispatch_statistics(struct blk_mq_hw_ctx *hctx, struct request 
*rq,
+   struct bfq_queue *in_serv_queue, bool waiting_rq, bool 
idle_timer_disabled)
+{
+   struct bfq_queue *bfqq = rq ? RQ_BFQQ(rq) : NULL;
 
-#if defined(CONFIG_BFQ_GROUP_IOSCHED) && defined(CONFIG_DEBUG_BLK_CGROUP)
-   bfqq = rq ? RQ_BFQQ(rq) : NULL;
if (!idle_timer_disabled && !bfqq)
-   return rq;
+   return;
 
/*
 * rq and bfqq are guaranteed to exist until this function
@@ -3752,8 +3731,32 @@ static struct request *bfq_dispatch_request(struct 
blk_mq_hw_ctx *hctx)
bfqg_stats_update_io_remove(bfqg, rq->cmd_flags);
}
spin_unlock_irq(hctx->queue->queue_lock);
+}
+#else
+static inline void bfq_dispatch_statistics(struct blk_mq_hw_ctx *hctx, struct 
request *rq, struct bfq_queue *in_serv_queue, bool waiting_rq, bool 
idle_timer_disabled) {}
 #endif
 
+static struct request *bfq_dispatch_request(struct blk_mq_hw_ctx *hctx)
+{
+   

Re: [RFC PATCH] blk-throttle: add burst allowance.

2017-11-17 Thread Shaohua Li
On Thu, Nov 16, 2017 at 08:25:58PM -0800, Khazhismel Kumykov wrote:
> On Thu, Nov 16, 2017 at 8:50 AM, Shaohua Li  wrote:
> > On Tue, Nov 14, 2017 at 03:10:22PM -0800, Khazhismel Kumykov wrote:
> >> Allows configuration additional bytes or ios before a throttle is
> >> triggered.
> >>
> >> This allows implementation of a bucket style rate-limit/throttle on a
> >> block device. Previously, bursting to a device was limited to allowance
> >> granted in a single throtl_slice (similar to a bucket with limit N and
> >> refill rate N/slice).
> >>
> >> Additional parameters bytes/io_burst_conf defined for tg, which define a
> >> number of bytes/ios that must be depleted before throttling happens. A
> >> tg that does not deplete this allowance functions as though it has no
> >> configured limits. tgs earn additional allowance at rate defined by
> >> bps/iops for the tg. Once a tg has *_disp > *_burst_conf, throttling
> >> kicks in. If a tg is idle for a while, it will again have some burst
> >> allowance before it gets throttled again.
> >>
> >> slice_end for a tg is extended until io_disp/byte_disp would fall to 0,
> >> when all "used" burst allowance would be earned back. trim_slice still
> >> does progress slice_start as before and decrements *_disp as before, and
> >> tgs continue to get bytes/ios in throtl_slice intervals.
> >
> > Can you describe why we need this? It would be great if you can describe the
> > usage model and an example. Does this work for io.low/io.max or both?
> >
> > Thanks,
> > Shaohua
> >
> 
> Use case that brought this up was configuring limits for a remote
> shared device. Bursting beyond io.max is desired but only for so much
> before the limit kicks in, afterwards with sustained usage throughput
> is capped. (This proactively avoids remote-side limits). In that case
> one would configure in a root container io.max + io.burst, and
> configure low/other limits on descendants sharing the resource on the
> same node.
> 
> With this patch, so long as tg has not dispatched more than the burst,
> no limit is applied at all by that tg, including limit imposed by
> io.low in tg_iops_limit, etc.

I'd appreciate if you can give more details about the 'why'. 'configuring
limits for a remote shared device' doesn't justify the change.


Re: Ideas to reuse filesystem's checksum to enhance dm-raid1/10/5/6?

2017-11-17 Thread Andreas Dilger
On Nov 15, 2017, at 7:18 PM, Qu Wenruo  wrote:
> 
> [Background]
> Recently I'm considering the possibility to use checksum from filesystem
> to enhance device-mapper raid.
> 
> The idea behind it is quite simple, since most modern filesystems have
> checksum for their metadata, and even some (btrfs) have checksum for data.
> 
> And for btrfs RAID1/10 (just ignore the RAID5/6 for now), at read time
> it can use the checksum to determine which copy is correct so it can
> return the correct data even one copy get corrupted.
> 
> [Objective]
> The final objective is to allow device mapper to do the checksum
> verification (and repair if possible).
> 
> If only for verification, it's not much different from current endio
> hook method used by most of the fs.
> However if we can move the repair part from filesystem (well, only btrfs
> supports it yet), it would benefit all fs.

I recall Darrick was looking into a mechanism to do this.  Rather than
changing the whole block layer to take a callback to do a checksum, what
we looked at was to allow the upper-layer read to specify a "retry count"
to the lower-layer block device.  If the lower layer is able to retry the
read then it will read a different device (or combination of devices for
e.g. RAID-6) based on the retry count, until the upper layer gets a good
read (based on checksum, or whatever).  If there are no more devices (or
combinations) to try then a final error is returned.

Darrick can probably point at the original thread/patch.

Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP


Re: [RFC PATCH 0/2] apply write hints to select the type of segments

2017-11-17 Thread Jaegeuk Kim
...
> >>> From: Hyunchul Lee 
> >>>
> >>> Using write hints[1], applications can inform the life time 
> >>> of the data
> >>> written to devices. and this[2] reported that the write hints 
> >>> patch
> >>> decreased writes in NAND by 25%.
> >>>
> >>> This hints help F2FS to determine the followings.
> >>>   1) the segment types where the data will be written.
> >>>   2) the hints that will be passed down to devices with the 
> >>> data of segments.
> >>>
> >>> This patch set implements the first mapping from write hints 
> >>> to segment types
> >>> as shown below.
> >>>
> >>>   hints segment type
> >>>   - 
> >>>   WRITE_LIFE_SHORT  CURSEG_COLD_DATA
> >>>   WRITE_LIFE_EXTREMECURSEG_HOT_DATA
> >>>   othersCURSEG_WARM_DATA
> >>>
> >>> The F2FS poliy for hot/cold seperation has precedence over 
> >>> this hints, And
> >>> hints are not applied in in-place update.
> >>
> >> Could we change to disable IPU if file/inode write hint is 
> >> existing?
> >>
> >
> > I am afraid that this makes side effects. for example, this 
> > could cause
> > out-of-place updates even when there are not enough free 
> > segments. 
> > I can write the patch that handles these situations. But I 
> > wonder 
> > that this is required, and I am not sure which IPU polices can 
> > be disabled.
> 
>  Oh, As I replied in another thread, I think IPU just affects 
>  filesystem
>  hot/cold separating, rather than this feature. So I think it 
>  will be okay
>  to not consider it.
> 
> >
> >>>
> >>> Before the second mapping is implemented, write hints are not 
> >>> passed down
> >>> to devices. Because it is better that the data of a segment 
> >>> have the same 
> >>> hint.
> >>>
> >>> [1]: c75b1d9421f80f4143e389d2d50ddfc8a28c8c35
> >>> [2]: https://lwn.net/Articles/726477/
> >>
> >> Could you write a patch to support passing write hint to block 
> >> layer for
> >> buffered writes as below commit:
> >> 0127251c45ae ("ext4: add support for passing in write hints 
> >> for buffered writes")
> >>
> >
> > Sure I will. I wrote it already ;)
> 
>  Cool, ;)
> 
> > I think that datas from the same segment should be passed down 
> > with the same
> > hint, and the following mapping is reasonable. I wonder what is 
> > your opinion
> > about it.
> >
> >   segment type   hints
> >      -
> >   CURSEG_COLD_DATA   WRITE_LIFE_EXTREME
> >   CURSEG_HOT_DATAWRITE_LIFE_SHORT
> >   CURSEG_COLD_NODE   WRITE_LIFE_NORMAL
> 
>  We have WRITE_LIFE_LONG defined rather than WRITE_LIFE_NORMAL in 
>  fs.h?
> 
> >   CURSEG_HOT_NODEWRITE_LIFE_MEDIUM
> 
>  As I know, in scenario of cell phone, data of meta_inode is 
>  hottest, then hot
>  data, warm node, and cold node should be coldest. So I suggested 
>  we can define
>  as below:
> 
>  META_DATAWRITE_LIFE_SHORT
>  HOT_DATA & WARM_NODE WRITE_LIFE_MEDIUM
>  HOT_NODE & WARM_DATA WRITE_LIFE_LONG
>  COLD_NODE & COLD_DATAWRITE_LIFE_EXTREME
> 
> >>>
> >>> I agree, But I am not sure that assigning the same hint to a node 
> >>> and data
> >>> segment is good. Because NVMe is likely to write them in the same 
> >>> erase 
> >>> block if they have the same hint.
> >>
> >> If we do not give the hint, they can still be written to the same 
> >> erase block,
> 
>  I mean it's possible to write them to the same erase block. :)
> 
> >> right? it will not be worse?
> >>
> >
> > If the hint is not given, I think that they could be written to 
> > the same erase block, or not. But if we give the same hint, they 
> 

Re: [RFC] md: make queue limits depending on limits of RAID members

2017-11-17 Thread Shaohua Li
On Wed, Nov 15, 2017 at 11:25:12AM +0100, Mariusz Dabrowski wrote:
> Hi all,
> 
> In order to be compliant with a pass-throug drive behavior, RAID queue
> limits should be calculated in a way that minimal io, optimal io and
> discard granularity size will be met from a single drive perspective.
> Currently MD driver is ignoring queue limits reported by members and all
> calculations are based on chunk (stripe) size.

We did use blk_stack_limits, which will care about these.

> I am working on patch which
> changes this. Since kernel 4.13 drives which support streams (write hints)
> might report discard_alignment or discard_granularity bigger than array
> stripe (for more details view NVMe 1.3 specification, chapter 9.3 Streams)
> so result of current calculation is not optimal for those drives. For 
> example for 4 disk RAID 0 with chunk size smaller than discard_granularity
> of member drives, discard_granularity of RAID array should be equal to
> 4*member_discard_granularity.
> 
> The problem is that for some drives there is a risk that result of this
> calculation exceeds unsigned int range. Current implementation of function
> nvme_config_discard in NVMe driver can also suffer the same problem:
> 
>   if (ctrl->nr_streams && ns->sws && ns->sgs) {
>   unsigned int sz = logical_block_size * ns->sws * ns->sgs;
> 
>   ns->queue->limits.discard_alignment = sz;
>   ns->queue->limits.discard_granularity = sz;
>   }
> 
> One potential solution for that is to change type of some queue limits
> (at least discard_granularity and discard_alignment, maybe more, like
> max_discard_sectors?) from 32 bits to 64 bits.
> 
> What are your thoughts about this? Do you expect any troubles with 
> changing this in block layer? Would you be willing to do such change?
> 
> Signed-off-by: Mariusz Dabrowski 
> ---
> 
>  drivers/md/md.c | 24 
>  drivers/md/md.h |  1 +
>  drivers/md/raid0.c  | 23 ---
>  drivers/md/raid10.c | 30 +-
>  drivers/md/raid5.c  | 25 +++--
>  5 files changed, 89 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 0ff1bbf6c90e..e0dc114cab19 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -66,6 +66,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include "md.h"
> @@ -679,6 +680,29 @@ struct md_rdev *md_find_rdev_nr_rcu(struct mddev *mddev, 
> int nr)
>  }
>  EXPORT_SYMBOL_GPL(md_find_rdev_nr_rcu);
>  
> +void md_rdev_get_queue_limits(struct mddev *mddev, struct queue_limits 
> *limits)
> +{
> + struct md_rdev *rdev;
> +
> + memset(limits, 0, sizeof(struct queue_limits));
> +
> + rdev_for_each(rdev, mddev) {
> + struct queue_limits *rdev_limits;
> +
> + rdev_limits = >bdev->bd_queue->limits;
> + limits->io_min = max(limits->io_min, rdev_limits->io_min);
> + limits->io_opt = lcm_not_zero(limits->io_opt,
> +   rdev_limits->io_opt);
> + limits->discard_granularity =
> + max(limits->discard_granularity,
> + rdev_limits->discard_granularity);
> + limits->discard_alignment =
> + max(limits->discard_alignment,
> + rdev_limits->discard_alignment);
> + }
> +}

isn't this exactly what blk_stack_limits does?

> +EXPORT_SYMBOL_GPL(md_rdev_get_queue_limits);
> +
>  static struct md_rdev *find_rdev(struct mddev *mddev, dev_t dev)
>  {
>   struct md_rdev *rdev;
> diff --git a/drivers/md/md.h b/drivers/md/md.h
> index d8287d3cd1bf..5b514b9bb535 100644
> --- a/drivers/md/md.h
> +++ b/drivers/md/md.h
> @@ -702,6 +702,7 @@ extern void md_reload_sb(struct mddev *mddev, int 
> raid_disk);
>  extern void md_update_sb(struct mddev *mddev, int force);
>  extern void md_kick_rdev_from_array(struct md_rdev * rdev);
>  struct md_rdev *md_find_rdev_nr_rcu(struct mddev *mddev, int nr);
> +void md_rdev_get_queue_limits(struct mddev *mddev, struct queue_limits 
> *limits);
>  
>  static inline void rdev_dec_pending(struct md_rdev *rdev, struct mddev 
> *mddev)
>  {
> diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
> index 5a00fc118470..f1dcf7fd3eb1 100644
> --- a/drivers/md/raid0.c
> +++ b/drivers/md/raid0.c
> @@ -382,15 +382,31 @@ static int raid0_run(struct mddev *mddev)
>   if (mddev->queue) {
>   struct md_rdev *rdev;
>   bool discard_supported = false;
> + struct queue_limits limits;
> + unsigned int io_min, io_opt;
> + unsigned int granularity, alignment;
> + unsigned int chunk_size = mddev->chunk_sectors << 9;
>  
> + md_rdev_get_queue_limits(mddev, );
>   blk_queue_max_hw_sectors(mddev->queue, mddev->chunk_sectors);
>   

Re: [RFC PATCH 0/2] apply write hints to select the type of segments

2017-11-17 Thread Jaegeuk Kim
On 11/17, Christoph Hellwig wrote:
> 
> Next time please coordinate this with the block list and Jens, who
> actually wrote the patch.

Got it.

> 
> >   hints segment type
> >   - 
> >   WRITE_LIFE_SHORT  CURSEG_COLD_DATA
> >   WRITE_LIFE_EXTREMECURSEG_HOT_DATA
> >   othersCURSEG_WARM_DATA
> 
> Normally cold data is data with a long lifetime, and extreme is colder
> than cold, so there seems to be some mismatch here.

It was wrong description and I fixed it which matches to implementation.

The below description was merged:

WRITE_LIFE_SHORT  CURSEG_HOT_DATA
WRITE_LIFE_EXTREMECURSEG_COLD_DATA
othersCURSEG_WARM_DATA


Re: [RFC PATCH 0/2] apply write hints to select the type of segments

2017-11-17 Thread Christoph Hellwig

Next time please coordinate this with the block list and Jens, who
actually wrote the patch.

>   hints segment type
>   - 
>   WRITE_LIFE_SHORT  CURSEG_COLD_DATA
>   WRITE_LIFE_EXTREMECURSEG_HOT_DATA
>   othersCURSEG_WARM_DATA

Normally cold data is data with a long lifetime, and extreme is colder
than cold, so there seems to be some mismatch here.


Re: Regression in 4.14: wrong data being read from bcache device

2017-11-17 Thread Michael Lyle
On Fri, Nov 17, 2017 at 7:08 AM, Christoph Hellwig  wrote:
> On Thu, Nov 16, 2017 at 07:07:40PM -0800, Michael Lyle wrote:
>> I think the probable cause is this: we construct bios based on
>> previously completed bios in a few places,
>
> That is an extremely bad idea in many ways, so I think we'll need
> to fix this as the priority.

Yah, sorry, this wasn't actually the case.  I was sleep deprived and
hadn't been able to follow it all yet.

Bart points out that there's many remaining cases of assignment of
disk where partno isn't assigned as well-- so I wonder if there are
other potential problems from this.

Mike


Re: [PATCH] bio: ensure __bio_clone_fast copies bi_partno

2017-11-17 Thread Michael Lyle
Jens & everyone-- thanks for the speedy review and handling.  I've
updated my test cases to ensure that volumes from old releases work,
even when I "don't think" there's been a disk format change.

Bart--

On 11/17/2017 08:50 AM, Bart Van Assche wrote:
> Have you considered to use bio_copy_dev() instead of open-coding it?

One could... Right now almost all the uses of bio_copy_dev are in bcache
and they need to change for other reasons.  (e.g. macro uses parameter
more than once, function is passed in as parameter).  There's a whole
lot of places to change if it's desired to make bio_copy_dev universally
used to copy device information.

> Additionally, there is more code that copies these fields, e.g. the code in
> bio_clone_bioset(). Shouldn't that code be modified such that it also copies
> bi_partno?

Yes, when I was grepping around there were other things that looked
possibly unsafe.  I don't have test environments for all of these other
subsystems.

I wanted to get the minimal fix for this in, though, because people are
actively losing data to the problem it triggers with bcache.

Mike


Re: [PATCH] bio: ensure __bio_clone_fast copies bi_partno

2017-11-17 Thread Bart Van Assche
On Thu, 2017-11-16 at 23:47 -0800, Michael Lyle wrote:
> diff --git a/block/bio.c b/block/bio.c
> index 101c2a9b5481..33fa6b4af312 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -597,6 +597,7 @@ void __bio_clone_fast(struct bio *bio, struct bio 
> *bio_src)
>* so we don't set nor calculate new physical/hw segment counts here
>*/
>   bio->bi_disk = bio_src->bi_disk;
> + bio->bi_partno = bio_src->bi_partno;
>   bio_set_flag(bio, BIO_CLONED);
>   bio->bi_opf = bio_src->bi_opf;
>   bio->bi_write_hint = bio_src->bi_write_hint;

Have you considered to use bio_copy_dev() instead of open-coding it?

Additionally, there is more code that copies these fields, e.g. the code in
bio_clone_bioset(). Shouldn't that code be modified such that it also copies
bi_partno?

How about the following class of assignments in drivers/md/raid1.c:

mbio->bi_disk = (void *)conf->mirrors[i].rdev;

Should these assignments perhaps be followed by a mbio->bi_partno assignment?

How about the following class of assignments in the NVMe code:

   bio->bi_disk = disk;

Should these assignments perhaps be followed by a bio->bi_partno assignment?

Thanks,

Bart.

Re: [PATCH v2 2/3] bdi: add error handle for bdi_debug_register

2017-11-17 Thread Jens Axboe
On 11/17/2017 08:06 AM, weiping zhang wrote:
> On Wed, Nov 01, 2017 at 02:47:22PM +0100, Jan Kara wrote:
>> On Tue 31-10-17 18:38:24, weiping zhang wrote:
>>> In order to make error handle more cleaner we call bdi_debug_register
>>> before set state to WB_registered, that we can avoid call bdi_unregister
>>> in release_bdi().
>>>
>>> Signed-off-by: weiping zhang 
>>
>> Looks good to me. You can add:
>>
>> Reviewed-by: Jan Kara 
>>
>>  Honza
>>
>>> ---
>>>  mm/backing-dev.c | 5 -
>>>  1 file changed, 4 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
>>> index b5f940ce0143..84b2dc76f140 100644
>>> --- a/mm/backing-dev.c
>>> +++ b/mm/backing-dev.c
>>> @@ -882,10 +882,13 @@ int bdi_register_va(struct backing_dev_info *bdi, 
>>> const char *fmt, va_list args)
>>> if (IS_ERR(dev))
>>> return PTR_ERR(dev);
>>>  
>>> +   if (bdi_debug_register(bdi, dev_name(dev))) {
>>> +   device_destroy(bdi_class, dev->devt);
>>> +   return -ENOMEM;
>>> +   }
>>> cgwb_bdi_register(bdi);
>>> bdi->dev = dev;
>>>  
>>> -   bdi_debug_register(bdi, dev_name(dev));
>>> set_bit(WB_registered, >wb.state);
>>>  
>>> spin_lock_bh(_lock);
>>> -- 
> 
> Hello Jens,
> 
> Could you please give some comments for this series cleanup.

It looks good to me - for some reason I seem to be missing patch
2/3 locally, but I have this followup. I'll get it applied for
4.15, thanks.

-- 
Jens Axboe



Re: [PATCH] bio: ensure __bio_clone_fast copies bi_partno

2017-11-17 Thread Jens Axboe
On 11/17/2017 12:47 AM, Michael Lyle wrote:
> A new field was introduced in 74d46992e0d9dee7f1f376de0d56d31614c8a17a,
> bi_partno, instead of using bdev->bd_contains and encoding the partition
> information in the bi_bdev field.  __bio_clone_fast was changed to copy
> the disk information, but not the partition information.  At minimum,
> this regressed bcache and caused data corruption.

That's not good... Fix looks good to me, I'll queue this up for a pull
today. Thanks for bisecting this, Michael.

-- 
Jens Axboe



Re: Regression in 4.14: wrong data being read from bcache device

2017-11-17 Thread Christoph Hellwig
On Thu, Nov 16, 2017 at 07:07:40PM -0800, Michael Lyle wrote:
> I think the probable cause is this: we construct bios based on
> previously completed bios in a few places,

That is an extremely bad idea in many ways, so I think we'll need
to fix this as the priority.


Re: [PATCH v2 2/3] bdi: add error handle for bdi_debug_register

2017-11-17 Thread weiping zhang
On Wed, Nov 01, 2017 at 02:47:22PM +0100, Jan Kara wrote:
> On Tue 31-10-17 18:38:24, weiping zhang wrote:
> > In order to make error handle more cleaner we call bdi_debug_register
> > before set state to WB_registered, that we can avoid call bdi_unregister
> > in release_bdi().
> > 
> > Signed-off-by: weiping zhang 
> 
> Looks good to me. You can add:
> 
> Reviewed-by: Jan Kara 
> 
>   Honza
> 
> > ---
> >  mm/backing-dev.c | 5 -
> >  1 file changed, 4 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> > index b5f940ce0143..84b2dc76f140 100644
> > --- a/mm/backing-dev.c
> > +++ b/mm/backing-dev.c
> > @@ -882,10 +882,13 @@ int bdi_register_va(struct backing_dev_info *bdi, 
> > const char *fmt, va_list args)
> > if (IS_ERR(dev))
> > return PTR_ERR(dev);
> >  
> > +   if (bdi_debug_register(bdi, dev_name(dev))) {
> > +   device_destroy(bdi_class, dev->devt);
> > +   return -ENOMEM;
> > +   }
> > cgwb_bdi_register(bdi);
> > bdi->dev = dev;
> >  
> > -   bdi_debug_register(bdi, dev_name(dev));
> > set_bit(WB_registered, >wb.state);
> >  
> > spin_lock_bh(_lock);
> > -- 

Hello Jens,

Could you please give some comments for this series cleanup.

--
Thanks
weiping



4.14: WARNING: CPU: 4 PID: 2895 at block/blk-mq.c:1144 with virtio-blk

2017-11-17 Thread Christian Borntraeger
When doing cpu hotplug in a KVM guest with virtio blk  I get  warnings like
  747.652408] [ cut here ]
[  747.652410] WARNING: CPU: 4 PID: 2895 at block/blk-mq.c:1144 
__blk_mq_run_hw_queue+0xd4/0x100
[  747.652410] Modules linked in: dm_multipath
[  747.652412] CPU: 4 PID: 2895 Comm: kworker/4:1H Tainted: GW   
4.14.0+ #191
[  747.652412] Hardware name: IBM 2964 NC9 704 (KVM/Linux)
[  747.652414] Workqueue: kblockd blk_mq_run_work_fn
[  747.652414] task: 6068 task.stack: 5ea3
[  747.652415] Krnl PSW : 0704f0018000 00505864 
(__blk_mq_run_hw_queue+0xd4/0x100)
[  747.652417]R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:3 PM:0 
RI:0 EA:3
[  747.652417] Krnl GPRS: 0010 00ff 5cbec400 

[  747.652418]63709120  63709500 
59fa44b0
[  747.652418]59fa4480  6370f700 
63709100
[  747.652419]5cbec500 00970948 5ea33d80 
5ea33d48
[  747.652423] Krnl Code: 00505854: ebaff0a4lmg 
%r10,%r15,160(%r15)
   0050585a: c0f4ffe690d3   brcl15,1d7a00
  #00505860: a7f40001   brc 15,505862
  >00505864: 581003b0   l   %r1,944
   00505868: c01b001fff00   nilf%r1,2096896
   0050586e: a784ffdb   brc 8,505824
   00505872: a7f40001   brc 15,505874
   00505876: 9120218f   tm  399(%r2),32
[  747.652435] Call Trace:
[  747.652435] ([<63709600>] 0x63709600)
[  747.652436]  [<00187bcc>] process_one_work+0x264/0x4b8 
[  747.652438]  [<00187e78>] worker_thread+0x58/0x4f8 
[  747.652439]  [<0018ee94>] kthread+0x144/0x168 
[  747.652439]  [<008f8a62>] kernel_thread_starter+0x6/0xc 
[  747.652440]  [<008f8a5c>] kernel_thread_starter+0x0/0xc 
[  747.652440] Last Breaking-Event-Address:
[  747.652441]  [<00505860>] __blk_mq_run_hw_queue+0xd0/0x100
[  747.652442] ---[ end trace 4a001a80379b18ba ]---
[  747.652450] [ cut here ]


This is 

b7a71e66d (Jens Axboe2017-08-01 09:28:24 -0600 1141) * are 
mapped to it.
b7a71e66d (Jens Axboe2017-08-01 09:28:24 -0600 1142) */
6a83e74d2 (Bart Van Assche   2016-11-02 10:09:51 -0600 1143)
WARN_ON(!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask) &&
6a83e74d2 (Bart Van Assche   2016-11-02 10:09:51 -0600 1144)
cpu_online(hctx->next_cpu));
6a83e74d2 (Bart Van Assche   2016-11-02 10:09:51 -0600 1145) 
b7a71e66d (Jens Axboe2017-08-01 09:28:24 -0600 1146)/*


Is this a known issue?

Christian



Re: [PATCH] bcache: recover data from backing device when read request hit clean

2017-11-17 Thread Eddie Chapman

On 17/11/17 13:22, Coly Li wrote:

On 17/11/2017 8:57 PM, Eddie Chapman wrote:

On 17/11/17 10:20, Rui Hua wrote:

Hi, Stefan

2017-11-17 16:28 GMT+08:00 Stefan Priebe - Profihost AG
:

I‘m getting the same xfs error message under high load. Does this
patch fix
it?


Did you applied the patch "bcache: only permit to recovery read error
when cache device is clean" ?
If you did, maybe this patch can fix it. And you'd better check
/sys/fs/bcache/XXX/internal/cache_read_races in your environment,
meanwhile, it should not be zero when you get that err message.


Hi all,

I have 3 servers running a very recent 4.9 stable release, with several
recent bcache patches cherry picked, including V4 of "bcache: only
permit to recovery read error when cache device is clean".

In the 3 weeks since using these cherry picks I've experienced a very
small number of isolated read errors in the layer above bcache, on all 3
servers.

On one of the servers, 2 out of the 6 bcache resources have a value of 1
in /sys/fs/bcache/XXX/internal/cache_read_races, and it is on these same
2 bcache resources where one read error has occurred on the upper layer.
The other 4 bcache resources have 0 in cache_read_races and I haven't
had any read errors on the layers above them.

On another server, I have 1 bcache resource out of 10 with a value of 5
in /sys/fs/bcache/XXX/internal/cache_read_races, and it is on that
bcache resource where a read error occurred on one occasion. The other 9
bcache resources have 0 in cache_read_races, and no read errors have
occurred on the layers above any of them.

On the 3rd server where some read errors occurred, I cannot verify if
there were positive values in cache_read_races as I moved the data from
there onto other storage, and shut down the bcache resources where the
errors occurred.

If I can provide any other info which might help with this issue, please
let me know.


Hi Eddie,

This is very informative, thank you so much :-)

Coly Li


Hi Coly,

You are most welcome. Another interesting info, but maybe it is 
unrelated/coincidence: the bcache resources where the errors occurred, 
the underlying backing device was a raid adapter that is quite a lot 
slower than the (different) underlying physical storage on the other 
bcache resources that do not have read races. Up to now I had suspected 
a driver issue with this raid adapter as causing the read errors, so I 
started the process of gradually retiring the adapter on these servers 
in the last 3 weeks. Anyway, in light of this issue coming up here I'm 
wondering if this is significant in suggesting possibly that the read 
races are more likely to occur if the backing storage is quite slow. Or 
maybe not.


Eddie


Re: [PATCH] bcache: recover data from backing device when read request hit clean

2017-11-17 Thread Coly Li
On 17/11/2017 8:57 PM, Eddie Chapman wrote:
> On 17/11/17 10:20, Rui Hua wrote:
>> Hi, Stefan
>>
>> 2017-11-17 16:28 GMT+08:00 Stefan Priebe - Profihost AG
>> :
>>> I‘m getting the same xfs error message under high load. Does this
>>> patch fix
>>> it?
>>>
>> Did you applied the patch "bcache: only permit to recovery read error
>> when cache device is clean" ?
>> If you did, maybe this patch can fix it. And you'd better check
>> /sys/fs/bcache/XXX/internal/cache_read_races in your environment,
>> meanwhile, it should not be zero when you get that err message.
> 
> Hi all,
> 
> I have 3 servers running a very recent 4.9 stable release, with several
> recent bcache patches cherry picked, including V4 of "bcache: only
> permit to recovery read error when cache device is clean".
> 
> In the 3 weeks since using these cherry picks I've experienced a very
> small number of isolated read errors in the layer above bcache, on all 3
> servers.
> 
> On one of the servers, 2 out of the 6 bcache resources have a value of 1
> in /sys/fs/bcache/XXX/internal/cache_read_races, and it is on these same
> 2 bcache resources where one read error has occurred on the upper layer.
> The other 4 bcache resources have 0 in cache_read_races and I haven't
> had any read errors on the layers above them.
> 
> On another server, I have 1 bcache resource out of 10 with a value of 5
> in /sys/fs/bcache/XXX/internal/cache_read_races, and it is on that
> bcache resource where a read error occurred on one occasion. The other 9
> bcache resources have 0 in cache_read_races, and no read errors have
> occurred on the layers above any of them.
> 
> On the 3rd server where some read errors occurred, I cannot verify if
> there were positive values in cache_read_races as I moved the data from
> there onto other storage, and shut down the bcache resources where the
> errors occurred.
> 
> If I can provide any other info which might help with this issue, please
> let me know.

Hi Eddie,

This is very informative, thank you so much :-)

Coly Li


Re: [PATCH] bcache: recover data from backing device when read request hit clean

2017-11-17 Thread Eddie Chapman

On 17/11/17 10:20, Rui Hua wrote:

Hi, Stefan

2017-11-17 16:28 GMT+08:00 Stefan Priebe - Profihost AG :

I‘m getting the same xfs error message under high load. Does this patch fix
it?


Did you applied the patch "bcache: only permit to recovery read error
when cache device is clean" ?
If you did, maybe this patch can fix it. And you'd better check
/sys/fs/bcache/XXX/internal/cache_read_races in your environment,
meanwhile, it should not be zero when you get that err message.


Hi all,

I have 3 servers running a very recent 4.9 stable release, with several 
recent bcache patches cherry picked, including V4 of "bcache: only 
permit to recovery read error when cache device is clean".


In the 3 weeks since using these cherry picks I've experienced a very 
small number of isolated read errors in the layer above bcache, on all 3 
servers.


On one of the servers, 2 out of the 6 bcache resources have a value of 1 
in /sys/fs/bcache/XXX/internal/cache_read_races, and it is on these same 
2 bcache resources where one read error has occurred on the upper layer. 
The other 4 bcache resources have 0 in cache_read_races and I haven't 
had any read errors on the layers above them.


On another server, I have 1 bcache resource out of 10 with a value of 5 
in /sys/fs/bcache/XXX/internal/cache_read_races, and it is on that 
bcache resource where a read error occurred on one occasion. The other 9 
bcache resources have 0 in cache_read_races, and no read errors have 
occurred on the layers above any of them.


On the 3rd server where some read errors occurred, I cannot verify if 
there were positive values in cache_read_races as I moved the data from 
there onto other storage, and shut down the bcache resources where the 
errors occurred.


If I can provide any other info which might help with this issue, please 
let me know.


regards,
Eddie


Re: [PATCH] bcache: recover data from backing device when read request hit clean

2017-11-17 Thread Stefan Priebe - Profihost AG

Am 17.11.2017 um 11:20 schrieb Rui Hua:
> Hi, Stefan
> 
> 2017-11-17 16:28 GMT+08:00 Stefan Priebe - Profihost AG 
> :
>> I‘m getting the same xfs error message under high load. Does this patch fix
>> it?
>>
> Did you applied the patch "bcache: only permit to recovery read error
> when cache device is clean" ?
> If you did, maybe this patch can fix it. And you'd better check
> /sys/fs/bcache/XXX/internal/cache_read_races in your environment,
> meanwhile, it should not be zero when you get that err message.
> 


Yes this patch is applied.

Stefan


Re: Ideas to reuse filesystem's checksum to enhance dm-raid1/10/5/6?

2017-11-17 Thread Austin S. Hemmelgarn

On 2017-11-16 20:30, Qu Wenruo wrote:



On 2017年11月17日 00:47, Austin S. Hemmelgarn wrote:



This is at least less complicated than dm-integrity.

Just a new hook for READ bio. And it can start from easy part.
Like starting from dm-raid1 and other fs support.

It's less complicated for end users (in theory, but cryptsetup devs are
working on that for dm-integrity), but significantly more complicated
for developers.

It also brings up the question of what happens when you want some other
layer between the filesystem and the MD/DM RAID layer (say, running
bcache or dm-cache on top of the RAID array).  In the case of
dm-integrity, that's not an issue because dm-integrity is entirely
self-contained, it doesn't depend on other layers beyond the standard
block interface.


Each layer can choose to drop the support for extra verification.

If the layer is not modifying the data, it can pass it do lower layer.
Just as integrity payload.
Which then makes things a bit more complicated in every other layer as 
well, in turn making things more complicated for all developers.




As I mentioned in my other reply on this thread, running with
dm-integrity _below_ the RAID layer instead of on top of it will provide
the same net effect, and in fact provide a stronger guarantee than what
you are proposing (because dm-integrity does real cryptographic
integrity verification, as opposed to just checking for bit-rot).


Although with more CPU usage for each device even they are containing
same data.

I never said it wasn't higher resource usage.






If your checksum is calculated and checked at FS level there is no added
value when you spread this logic to other layers.


That's why I'm moving the checking part to lower level, to make more
value from the checksum.



dm-integrity adds basic 'check-summing' to any filesystem without the
need to modify fs itself


Well, despite the fact that modern filesystem has already implemented
their metadata csum.

   - the paid price is - if there is bug between

passing data from  'fs' to dm-integrity'  it cannot be captured.

Advantage of having separated 'fs' and 'block' layer is in its
separation and simplicity at each level.


Totally agreed on this.

But the idea here should not bring that large impact (compared to big
things like ZFS/Btrfs).

1) It only affect READ bio
2) Every dm target can choose if to support or pass down the hook.
     no mean to support it for RAID0 for example.
     And for complex raid like RAID5/6 no need to support it from the very
     beginning.
3) Main part of the functionality is already implemented
     The core complexity contains 2 parts:
     a) checksum calculation and checking
    Modern fs is already doing this, at least for metadata.
     b) recovery
    dm targets already have this implemented for supported raid
    profile.
     All these are already implemented, just moving them to different
     timing is not bringing such big modification IIRC.


If you want integrated solution - you are simply looking for btrfs where
multiple layers are integrated together.


If with such verification hook (along with something extra to handle
scrub), btrfs chunk mapping can be re-implemented with device-mapper:

In fact btrfs logical space is just a dm-linear device, and each chunk
can be implemented by its corresponding dm-* module like:

dm-linear:   | btrfs chunk 1 | btrfs chunk 2 | ... | btrfs chunk n |
and
btrfs chunk 1: metadata, using dm-raid1 on diskA and diskB
btrfs chunk 2: data, using dm-raid0 on disk A B C D
...
btrfs chunk n: system, using dm-raid1 on disk A B

At least btrfs can take the advantage of the simplicity of separate
layers.

And other filesystem can get a little higher chance to recover its
metadata if built on dm-raid.

Again, just put dm-integrity below dm-raid.  The other filesystems
primarily have metadata checksums to catch data corruption, not repair
it,


Because they have no extra copy.
If they have, they will definitely use the extra copy to repair.
But they don't have those extra copies now, so that really becomes 
irrelevant as an argument (especially since it's not likely they will 
add data or metadata replication in the filesystem any time in the near 
future).



and I severely doubt that you will manage to convince developers to
add support in their filesystem (especially XFS) because:
1. It's a layering violation (yes, I know BTRFS is too, but that's a bit
less of an issue because it's a completely self-contained layering
violation, while this isn't).


If passing something along with bio is violating layers, then integrity
payload is already doing this for a long time.
The block integrity layer is also interfacing directly with hardware and 
_needs_ to pass that data down.  Unless I'm mistaken, it also doesn't do 
any verification except in the filesystem layer, and doesn't pass down 
any complaints about the integrity of the data (it may try to re-read 
it, but that's not the same as what 

Re: [PATCH] bio: ensure __bio_clone_fast copies bi_partno

2017-11-17 Thread Ming Lei
On Thu, Nov 16, 2017 at 11:47:25PM -0800, Michael Lyle wrote:
> A new field was introduced in 74d46992e0d9dee7f1f376de0d56d31614c8a17a,
> bi_partno, instead of using bdev->bd_contains and encoding the partition
> information in the bi_bdev field.  __bio_clone_fast was changed to copy
> the disk information, but not the partition information.  At minimum,
> this regressed bcache and caused data corruption.
> 
> Signed-off-by: Michael Lyle 
> Fixes: 74d46992e0d9dee7f1f376de0d56d31614c8a17a
> Reported-by: Pavel Goran 
> Reported-by: Campbell Steven 
> Cc: Christoph Hellwig 
> Cc: Jens Axboe 
> Cc: 
> ---
>  block/bio.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/block/bio.c b/block/bio.c
> index 101c2a9b5481..33fa6b4af312 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -597,6 +597,7 @@ void __bio_clone_fast(struct bio *bio, struct bio 
> *bio_src)
>* so we don't set nor calculate new physical/hw segment counts here
>*/
>   bio->bi_disk = bio_src->bi_disk;
> + bio->bi_partno = bio_src->bi_partno;
>   bio_set_flag(bio, BIO_CLONED);
>   bio->bi_opf = bio_src->bi_opf;
>   bio->bi_write_hint = bio_src->bi_write_hint;

Reviewed-by: Ming Lei 

-- 
Ming


Re: [PATCH] bio: ensure __bio_clone_fast copies bi_partno

2017-11-17 Thread Coly Li
On 17/11/2017 3:47 PM, Michael Lyle wrote:
> A new field was introduced in 74d46992e0d9dee7f1f376de0d56d31614c8a17a,
> bi_partno, instead of using bdev->bd_contains and encoding the partition
> information in the bi_bdev field.  __bio_clone_fast was changed to copy
> the disk information, but not the partition information.  At minimum,
> this regressed bcache and caused data corruption.
> 

Hi Michael,

Thanks for the fix, it looks good to me.

> Signed-off-by: Michael Lyle 
> Fixes: 74d46992e0d9dee7f1f376de0d56d31614c8a17a
> Reported-by: Pavel Goran 
> Reported-by: Campbell Steven 

Reviewed-by: Coly Li 

Coly Li

> Cc: Christoph Hellwig 
> Cc: Jens Axboe 
> Cc: 
> ---
>  block/bio.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/block/bio.c b/block/bio.c
> index 101c2a9b5481..33fa6b4af312 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -597,6 +597,7 @@ void __bio_clone_fast(struct bio *bio, struct bio 
> *bio_src)
>* so we don't set nor calculate new physical/hw segment counts here
>*/
>   bio->bi_disk = bio_src->bi_disk;
> + bio->bi_partno = bio_src->bi_partno;
>   bio_set_flag(bio, BIO_CLONED);
>   bio->bi_opf = bio_src->bi_opf;
>   bio->bi_write_hint = bio_src->bi_write_hint;
>