date:20180130

Re: [PATCH] Use bio_endio instead of bio_put in error path of blk_rq_append_bio

2018-01-30 Thread Ming Lei

On Tue, Jan 30, 2018 at 04:24:14PM +0100, Jiri Palecek wrote:
> 
> On 1/30/18 1:53 PM, Ming Lei wrote:
> > On Thu, Jan 25, 2018 at 9:58 PM, Jiří Paleček  wrote:
> > >   Avoids page leak from bounced requests
> > > ---
> > >   block/blk-map.c | 3 ++-
> > >   1 file changed, 2 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/block/blk-map.c b/block/blk-map.c
> > > index d3a94719f03f..702d68166689 100644
> > > --- a/block/blk-map.c
> > > +++ b/block/blk-map.c
> > > @@ -26,7 +26,8 @@ int blk_rq_append_bio(struct request *rq, struct bio 
> > > **bio)
> > >  } else {
> > >  if (!ll_back_merge_fn(rq->q, rq, *bio)) {
> > >  if (orig_bio != *bio) {
> > > -   bio_put(*bio);
> > > +   bio_inc_remaining(orig_bio);
> > > +   bio_endio(*bio);
> > 'bio_inc_remaining(orig_bio);' shouldn't be needed since we don't chain 
> > bounced
> > bio, otherwise this patch is fine.
> 
> I believe it is needed or at least desirable. The situation when the request
> bounced is like this
> 
> bio (bounced) . bi_private ---> orig_bio
> 
> and at the end of bounce_end_io, bio_endio(bio->bi_private) [which is
> bio_endio(orig_bio) in our case] is called. That doesn't have any effect on
> __blk_rq_map_user_iov; its bios have .bi_end_io==0, therefore one call more
> or less doesn't matter. However, for other callers, like osd_initiator.c,
> this is not the case. They pass bios which have bi_end_io, and might be
> surprised if this was called unexpectedly.

OK, I think it is good to follow the rule of not calling .bi_end_io() in
the failure path, just like __blk_rq_map_user_iov()/blk_rq_map_kern().

But seems pscsi_map_sg() depends on bio_endio(orig_bio) to free the 'orig_bio',
could you fix it in this patch too?

> 
> Before  caa4b02476e3, blk_rq_append_request wouldn't touch/delete the passed
> bio at all under any circumstances. I believe it should stay that way and
> incrementing the remaining counter, which effectively nullifies the extra
> bio_endio call, does that pretty straightforwardly.

Seems too tricky to use bio_inc_remaining() for avoiding bio_endio(orig_bio),
if we have to do that, I suggest to add comment on that.

Thanks,
Ming

Re: WARNING: CPU: 2 PID: 207 at drivers/nvme/host/core.c:527 nvme_setup_cmd+0x3d3

2018-01-30 Thread jianchao.wang

Hi Jens

On 01/30/2018 11:57 PM, Jens Axboe wrote:
> On 1/30/18 8:41 AM, Jens Axboe wrote:
>> Hi,
>>
>> I just hit this on 4.15+ on the laptop, it's running Linus' git
>> as of yesterday, right after the block tree merge:
>>
>> commit 0a4b6e2f80aad46fb55a5cf7b1664c0aef030ee0
>> Merge: 9697e9da8429 796baeeef85a
>> Author: Linus Torvalds 
>> Date:   Mon Jan 29 11:51:49 2018 -0800
>>
>> Merge branch 'for-4.16/block' of git://git.kernel.dk/linux-block
>>
>> It's hitting this case:
>>
>> if (WARN_ON_ONCE(n != segments)) {   
>>
>>
>> in nvme, indicating that rq->nr_phys_segments does not equal the
>> number of bios in the request. Anyone seen this? I'll take a closer
>> look as time permits, full trace below.
>>
>>
>>  WARNING: CPU: 2 PID: 207 at drivers/nvme/host/core.c:527 
>> nvme_setup_cmd+0x3d3/0x430
>>  Modules linked in: rfcomm fuse ctr ccm bnep arc4 binfmt_misc 
>> snd_hda_codec_hdmi nls_iso8859_1 nls_cp437 vfat snd_hda_codec_conexant fat 
>> snd_hda_codec_generic iwlmvm snd_hda_intel snd_hda_codec snd_hwdep mac80211 
>> snd_hda_core snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq 
>> x86_pkg_temp_thermal intel_powerclamp kvm_intel uvcvideo iwlwifi btusb 
>> snd_seq_device videobuf2_vmalloc btintel videobuf2_memops kvm snd_timer 
>> videobuf2_v4l2 bluetooth irqbypass videobuf2_core aesni_intel aes_x86_64 
>> crypto_simd cryptd snd glue_helper videodev cfg80211 ecdh_generic soundcore 
>> hid_generic usbhid hid i915 psmouse e1000e ptp pps_core xhci_pci xhci_hcd 
>> intel_gtt
>>  CPU: 2 PID: 207 Comm: jbd2/nvme0n1p7- Tainted: G U   4.15.0+ 
>> #176
>>  Hardware name: LENOVO 20FBCTO1WW/20FBCTO1WW, BIOS N1FET59W (1.33 ) 
>> 12/19/2017
>>  RIP: 0010:nvme_setup_cmd+0x3d3/0x430
>>  RSP: 0018:880423e9f838 EFLAGS: 00010217
>>  RAX:  RBX: 880423e9f8c8 RCX: 0001
>>  RDX: 88022b200010 RSI: 0002 RDI: 327f
>>  RBP: 880421251400 R08: 88022b20 R09: 0009
>>  R10:  R11:  R12: 
>>  R13: 88042341e280 R14:  R15: 880421251440
>>  FS:  () GS:88044150() knlGS:
>>  CS:  0010 DS:  ES:  CR0: 80050033
>>  CR2: 55b684795030 CR3: 02e09006 CR4: 001606e0
>>  DR0:  DR1:  DR2: 
>>  DR3:  DR6: fffe0ff0 DR7: 0400
>>  Call Trace:
>>   nvme_queue_rq+0x40/0xa00
>>   ? __sbitmap_queue_get+0x24/0x90
>>   ? blk_mq_get_tag+0xa3/0x250
>>   ? wait_woken+0x80/0x80
>>   ? blk_mq_get_driver_tag+0x97/0xf0
>>   blk_mq_dispatch_rq_list+0x7b/0x4a0
>>   ? deadline_remove_request+0x49/0xb0
>>   blk_mq_do_dispatch_sched+0x4f/0xc0
>>   blk_mq_sched_dispatch_requests+0x106/0x170
>>   __blk_mq_run_hw_queue+0x53/0xa0
>>   __blk_mq_delay_run_hw_queue+0x83/0xa0
>>   blk_mq_run_hw_queue+0x6c/0xd0
>>   blk_mq_sched_insert_request+0x96/0x140
>>   __blk_mq_try_issue_directly+0x3d/0x190
>>   blk_mq_try_issue_directly+0x30/0x70
>>   blk_mq_make_request+0x1a4/0x6a0
>>   generic_make_request+0xfd/0x2f0
>>   ? submit_bio+0x5c/0x110
>>   submit_bio+0x5c/0x110
>>   ? __blkdev_issue_discard+0x152/0x200
>>   submit_bio_wait+0x43/0x60
>>   ext4_process_freed_data+0x1cd/0x440
>>   ? account_page_dirtied+0xe2/0x1a0
>>   ext4_journal_commit_callback+0x4a/0xc0
>>   jbd2_journal_commit_transaction+0x17e2/0x19e0
>>   ? kjournald2+0xb0/0x250
>>   kjournald2+0xb0/0x250
>>   ? wait_woken+0x80/0x80
>>   ? commit_timeout+0x10/0x10
>>   kthread+0x111/0x130
>>   ? kthread_create_worker_on_cpu+0x50/0x50
>>   ? do_group_exit+0x3a/0xa0
>>   ret_from_fork+0x1f/0x30
>>  Code: 73 89 c1 83 ce 10 c1 e1 10 09 ca 83 f8 04 0f 87 0f ff ff ff 8b 4d 20 
>> 48 8b 7d 00 c1 e9 09 48 01 8c c7 00 08 00 00 e9 f8 fe ff ff <0f> ff 4c 89 c7 
>> 41 bc 0a 00 00 00 e8 0d 78 d6 ff e9 a1 fc ff ff 
>>  ---[ end trace 50d361cc444506c8 ]---
>>  print_req_error: I/O error, dev nvme0n1, sector 847167488
> 
> Looking at the disassembly, 'n' is 2 and 'segments' is 0x.
> 

Let's consider the case following:

blk_mq_bio_to_request()
  -> blk_init_request_from_bio()
-> blk_rq_bio_prep()

if (bio_has_data(bio))
rq->nr_phys_segments = bio_phys_segments(q, bio);

static inline bool bio_has_data(struct bio *bio)
{
if (bio &&
bio->bi_iter.bi_size &&
bio_op(bio) != REQ_OP_DISCARD &&   //-> HERE !
bio_op(bio) != REQ_OP_SECURE_ERASE &&
bio_op(bio) != REQ_OP_WRITE_ZEROES)
return true;

return false;
}
For the DISCARD req, the nr_phys_segments is 0.

dd_insert_request()
  -> blk_mq_sched_try_insert_merge()
-> elv_attempt_insert_merge()
  -> blk_attempt_req_merge()
-> attempt_merge()
  -> ll_merge_requests_fn()

total_phys_segments = req->nr_phys_segments + next->nr_phys_segments;
// total_phys_segments will be 0
if (blk

Re: [dm-devel] [PATCH v5] blk-mq: introduce BLK_STS_DEV_RESOURCE

2018-01-30 Thread Ming Lei

On Tue, Jan 30, 2018 at 08:22:27PM -0700, Jens Axboe wrote:
> On 1/30/18 8:21 PM, Bart Van Assche wrote:
> > On Tue, 2018-01-30 at 20:17 -0700, Jens Axboe wrote:
> >> BLK_STS_RESOURCE should always be safe to return, and it should work
> >> the same as STS_DEV_RESOURCE, except it may cause an extra queue
> >> run.
> >>
> >> Well written drivers should use STS_DEV_RESOURCE where it makes
> >> sense.
> > 
> > Hello Jens,
> > 
> > I would appreciate it if other names would be chosen than BLK_STS_RESOURCE
> > and BLK_STS_DEV_RESOURCE, e.g. names that directly refer to the fact that
> > one of the two status codes causes the queue to be rerun and the other not.
> > I'm afraid that the currently chosen names will cause confusion.
> 
> DEV_RESOURCE is pretty clear I think, but I agree that STS_RESOURCE
> could perhaps be better. STS_SYSTEM_RESOURCE? It makes the distinction

I guess it still can't cover all, for example, .queue_rq() may not
submit rq to hardware successfully because of tansport busy, such FC,..

-- 
Ming

Re: [dm-devel] [PATCH v5] blk-mq: introduce BLK_STS_DEV_RESOURCE

2018-01-30 Thread Jens Axboe

On 1/30/18 8:27 PM, Bart Van Assche wrote:
> On Tue, 2018-01-30 at 20:22 -0700, Jens Axboe wrote:
>> On 1/30/18 8:21 PM, Bart Van Assche wrote:
>>> On Tue, 2018-01-30 at 20:17 -0700, Jens Axboe wrote:
 BLK_STS_RESOURCE should always be safe to return, and it should work
 the same as STS_DEV_RESOURCE, except it may cause an extra queue
 run.

 Well written drivers should use STS_DEV_RESOURCE where it makes
 sense.
>>>
>>> Hello Jens,
>>>
>>> I would appreciate it if other names would be chosen than BLK_STS_RESOURCE
>>> and BLK_STS_DEV_RESOURCE, e.g. names that directly refer to the fact that
>>> one of the two status codes causes the queue to be rerun and the other not.
>>> I'm afraid that the currently chosen names will cause confusion.
>>
>> DEV_RESOURCE is pretty clear I think, but I agree that STS_RESOURCE
>> could perhaps be better. STS_SYSTEM_RESOURCE? It makes the distinction
>> a bit clearer.
> 
> I'm concerned about both. A block driver can namely run out of device
> resources without receiving a notification if that resource becomes
> available again.

That is true, and that is why I don't want to driver to make guesses on
when to re-run. The saving grace here is that it should happen extremely
rarely. I went over this in a previous email. If it's not a rare
occurence, then there's no other way around it then wiring up a
notification mechanism to kick the queue when the shortage clears.

One way to deal with that is making the assumption that other IO will
clear the issue. That means we can confine it to the blk-mq layer.
Similarly to how we currently sometimes have to track starvation across
hardware queues and restart queue X when Y completes IO, we may have to
have a broader scope of shortage. That would probably fix all bug
pathological cases.

> In that case the appropriate return code is BLK_STS_RESOURCE. Hence my
> concern ...

But this isn't a new thing, the only real change here is making it so
that drivers don't have to think about that case.

-- 
Jens Axboe

[PATCH] bcache: fix error return value in memory shrink

2018-01-30 Thread tang . junhui

From: Tang Junhui 

In bch_mca_scan(), the return value should not be the number of freed btree
nodes, but the number of pages of freed btree nodes.

Signed-off-by: Tang Junhui 
---
 drivers/md/bcache/btree.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 81e8dc3..a5fbccc 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -718,7 +718,7 @@ static unsigned long bch_mca_scan(struct shrinker *shrink,
}
 out:
mutex_unlock(&c->bucket_lock);
-   return freed;
+   return freed * c->btree_pages;
 }
 
 static unsigned long bch_mca_count(struct shrinker *shrink,
-- 
1.8.3.1

Re: [dm-devel] [PATCH v5] blk-mq: introduce BLK_STS_DEV_RESOURCE

2018-01-30 Thread Bart Van Assche

On Tue, 2018-01-30 at 20:22 -0700, Jens Axboe wrote:
> On 1/30/18 8:21 PM, Bart Van Assche wrote:
> > On Tue, 2018-01-30 at 20:17 -0700, Jens Axboe wrote:
> > > BLK_STS_RESOURCE should always be safe to return, and it should work
> > > the same as STS_DEV_RESOURCE, except it may cause an extra queue
> > > run.
> > > 
> > > Well written drivers should use STS_DEV_RESOURCE where it makes
> > > sense.
> > 
> > Hello Jens,
> > 
> > I would appreciate it if other names would be chosen than BLK_STS_RESOURCE
> > and BLK_STS_DEV_RESOURCE, e.g. names that directly refer to the fact that
> > one of the two status codes causes the queue to be rerun and the other not.
> > I'm afraid that the currently chosen names will cause confusion.
> 
> DEV_RESOURCE is pretty clear I think, but I agree that STS_RESOURCE
> could perhaps be better. STS_SYSTEM_RESOURCE? It makes the distinction
> a bit clearer.

I'm concerned about both. A block driver can namely run out of device resources
without receiving a notification if that resource becomes available again. In
that case the appropriate return code is BLK_STS_RESOURCE. Hence my concern ...

Bart.

Re: [dm-devel] [PATCH v5] blk-mq: introduce BLK_STS_DEV_RESOURCE

2018-01-30 Thread Jens Axboe

On 1/30/18 8:21 PM, Bart Van Assche wrote:
> On Tue, 2018-01-30 at 20:17 -0700, Jens Axboe wrote:
>> BLK_STS_RESOURCE should always be safe to return, and it should work
>> the same as STS_DEV_RESOURCE, except it may cause an extra queue
>> run.
>>
>> Well written drivers should use STS_DEV_RESOURCE where it makes
>> sense.
> 
> Hello Jens,
> 
> I would appreciate it if other names would be chosen than BLK_STS_RESOURCE
> and BLK_STS_DEV_RESOURCE, e.g. names that directly refer to the fact that
> one of the two status codes causes the queue to be rerun and the other not.
> I'm afraid that the currently chosen names will cause confusion.

DEV_RESOURCE is pretty clear I think, but I agree that STS_RESOURCE
could perhaps be better. STS_SYSTEM_RESOURCE? It makes the distinction
a bit clearer.

-- 
Jens Axboe

Re: [dm-devel] [PATCH v5] blk-mq: introduce BLK_STS_DEV_RESOURCE

2018-01-30 Thread Bart Van Assche

On Tue, 2018-01-30 at 20:17 -0700, Jens Axboe wrote:
> BLK_STS_RESOURCE should always be safe to return, and it should work
> the same as STS_DEV_RESOURCE, except it may cause an extra queue
> run.
> 
> Well written drivers should use STS_DEV_RESOURCE where it makes
> sense.

Hello Jens,

I would appreciate it if other names would be chosen than BLK_STS_RESOURCE
and BLK_STS_DEV_RESOURCE, e.g. names that directly refer to the fact that
one of the two status codes causes the queue to be rerun and the other not.
I'm afraid that the currently chosen names will cause confusion.

Thanks,

Bart.

Re: [PATCH v6] blk-mq: introduce BLK_STS_DEV_RESOURCE

2018-01-30 Thread Jens Axboe

On 1/30/18 8:04 PM, Mike Snitzer wrote:
> From: Ming Lei 
> 
> This status is returned from driver to block layer if device related
> resource is unavailable, but driver can guarantee that IO dispatch
> will be triggered in future when the resource is available.
> 
> Convert some drivers to return BLK_STS_DEV_RESOURCE.  Also, if driver
> returns BLK_STS_RESOURCE and SCHED_RESTART is set, rerun queue after
> a delay (BLK_MQ_DELAY_QUEUE) to avoid IO stalls.  BLK_MQ_DELAY_QUEUE is
> 3 ms because both scsi-mq and nvmefc are using that magic value.
> 
> If a driver can make sure there is in-flight IO, it is safe to return
> BLK_STS_DEV_RESOURCE because:
> 
> 1) If all in-flight IOs complete before examining SCHED_RESTART in
> blk_mq_dispatch_rq_list(), SCHED_RESTART must be cleared, so queue
> is run immediately in this case by blk_mq_dispatch_rq_list();
> 
> 2) if there is any in-flight IO after/when examining SCHED_RESTART
> in blk_mq_dispatch_rq_list():
> - if SCHED_RESTART isn't set, queue is run immediately as handled in 1)
> - otherwise, this request will be dispatched after any in-flight IO is
>   completed via blk_mq_sched_restart()
> 
> 3) if SCHED_RESTART is set concurently in context because of
> BLK_STS_RESOURCE, blk_mq_delay_run_hw_queue() will cover the above two
> cases and make sure IO hang can be avoided.
> 
> One invariant is that queue will be rerun if SCHED_RESTART is set.

Applied, thanks.

-- 
Jens Axboe

Re: [dm-devel] [PATCH v5] blk-mq: introduce BLK_STS_DEV_RESOURCE

2018-01-30 Thread Jens Axboe

On 1/30/18 10:52 AM, Bart Van Assche wrote:
> On 01/30/18 06:24, Mike Snitzer wrote:
>> + *
>> + * If driver returns BLK_STS_RESOURCE and SCHED_RESTART
>> + * bit is set, run queue after a delay to avoid IO stalls
>> + * that could otherwise occur if the queue is idle.
>>   */
>> -if (!blk_mq_sched_needs_restart(hctx) ||
>> +needs_restart = blk_mq_sched_needs_restart(hctx);
>> +if (!needs_restart ||
>>  (no_tag && list_empty_careful(&hctx->dispatch_wait.entry)))
>>  blk_mq_run_hw_queue(hctx, true);
>> +else if (needs_restart && (ret == BLK_STS_RESOURCE))
>> +blk_mq_delay_run_hw_queue(hctx, BLK_MQ_QUEUE_DELAY);
>>  }
> 
> If a request completes concurrently with execution of the above code 
> then the request completion will trigger a call of 
> blk_mq_sched_restart_hctx() and that call will clear the 
> BLK_MQ_S_SCHED_RESTART bit. If that bit is cleared before the above code 
> tests it then the above code will schedule an asynchronous call of 
> __blk_mq_run_hw_queue(). If the .queue_rq() call triggered by the new 
> queue run returns again BLK_STS_RESOURCE then the above code will be 
> executed again. In other words, a loop occurs. That loop will repeat as 
> long as the described race occurs. The current (kernel v4.15) block 
> layer behavior is simpler: only block drivers call 
> blk_mq_delay_run_hw_queue() and the block layer core never calls that 
> function. Hence that loop cannot occur with the v4.15 block layer core 
> and block drivers. A motivation of why that loop is preferred compared 
> to the current behavior (no loop) is missing. Does this mean that that 
> loop is a needless complication of the block layer core?

The dispatch, and later restart check, is within the hctx lock. The
completions should be as well.

> Sorry but I still prefer the v4.15 block layer approach because this 
> patch has in my view the following disadvantages:
> - It involves a blk-mq API change. API changes are always risky and need
>some time to stabilize.
> - The delay after which to rerun the queue is moved from block layer
>drivers into the block layer core. I think that's wrong because only
>the block driver authors can make a good choice for this constant.

It's exactly the right place to put it, as drivers cannot make a good
decision for when to run the queue again. You get NULL on allocating
some piece of memory, when do you run it again? That's the last thing
I want driver writers to make a decision on, because a novice device
driver writer will just think that he should run the queue again asap.
In reality we are screwed. Decisions like that SHOULD be in shared
and generic code, not in driver private code.

> - This patch makes block drivers harder to understand. Anyone who sees
>return BLK_STS_RESOURCE / return BLK_STS_DEV_RESOURCE for the first
>time will have to look up the meaning of these constants. An explicit
>blk_mq_delay_run_hw_queue() call is easier to understand.

BLK_STS_RESOURCE should always be safe to return, and it should work
the same as STS_DEV_RESOURCE, except it may cause an extra queue
run.

Well written drivers should use STS_DEV_RESOURCE where it makes
sense.

> - This patch does not fix any bugs nor makes block drivers easier to
>read or to implement. So why is this patch considered useful?

It does fix the run bug on global resources, I'm assuming you mean
it doesn't fix any EXTRA bugs compared to just use the delayed
run?

-- 
Jens Axboe

Re: [PATCH v5] blk-mq: introduce BLK_STS_DEV_RESOURCE

2018-01-30 Thread Mike Snitzer

On Tue, Jan 30 2018 at  9:44P -0500,
Jens Axboe  wrote:

> On 1/30/18 7:24 AM, Mike Snitzer wrote:
> > From: Ming Lei 
> > 
> > This status is returned from driver to block layer if device related
> > resource is unavailable, but driver can guarantee that IO dispatch
> > will be triggered in future when the resource is available.
> > 
> > Convert some drivers to return BLK_STS_DEV_RESOURCE.  Also, if driver
> > returns BLK_STS_RESOURCE and SCHED_RESTART is set, rerun queue after
> > a delay (BLK_MQ_DELAY_QUEUE) to avoid IO stalls.  BLK_MQ_DELAY_QUEUE is
> > 3 ms because both scsi-mq and nvmefc are using that magic value.
> > 
> > If a driver can make sure there is in-flight IO, it is safe to return
> > BLK_STS_DEV_RESOURCE because:
> > 
> > 1) If all in-flight IOs complete before examining SCHED_RESTART in
> > blk_mq_dispatch_rq_list(), SCHED_RESTART must be cleared, so queue
> > is run immediately in this case by blk_mq_dispatch_rq_list();
> > 
> > 2) if there is any in-flight IO after/when examining SCHED_RESTART
> > in blk_mq_dispatch_rq_list():
> > - if SCHED_RESTART isn't set, queue is run immediately as handled in 1)
> > - otherwise, this request will be dispatched after any in-flight IO is
> >   completed via blk_mq_sched_restart()
> > 
> > 3) if SCHED_RESTART is set concurently in context because of
> > BLK_STS_RESOURCE, blk_mq_delay_run_hw_queue() will cover the above two
> > cases and make sure IO hang can be avoided.
> > 
> > One invariant is that queue will be rerun if SCHED_RESTART is set.
> 
> This looks pretty good to me. I'm waffling a bit on whether to retain
> the current BLK_STS_RESOURCE behavior and name the new one something
> else, but I do like using the DEV name in there to signify the
> difference between a global and device resource.
> 
> Just a few small nits below - can you roll a v6 with the changes?

Folded in all your feedback and just replied with v6.

> > diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> > index 2d973ac54b09..f41d2057215f 100644
> > --- a/include/linux/blk_types.h
> > +++ b/include/linux/blk_types.h
> > @@ -39,6 +39,23 @@ typedef u8 __bitwise blk_status_t;
> >  
> >  #define BLK_STS_AGAIN  ((__force blk_status_t)12)
> >  
> > +/*
> > + * BLK_STS_DEV_RESOURCE is returned from driver to block layer if device
> > + * related resource is unavailable, but driver can guarantee that queue
> > + * will be rerun in future once the resource is available (whereby
> > + * dispatching requests).
> > + *
> > + * To safely return BLK_STS_DEV_RESOURCE, and allow forward progress, a
> > + * driver just needs to make sure there is in-flight IO.
> > + *
> > + * Difference with BLK_STS_RESOURCE:
> > + * If driver isn't sure if the queue will be rerun once device resource
> > + * is made available, please return BLK_STS_RESOURCE.  For example: when
> > + * memory allocation, DMA Mapping or other system resource allocation
> > + * fails and IO can't be submitted to device.
> > + */
> > +#define BLK_STS_DEV_RESOURCE   ((__force blk_status_t)13)
> 
> I'd rephrase that as:
> 
> BLK_STS_DEV_RESOURCE is returned from the driver to the block layer if
> device related resource are unavailable, but the driver can guarantee
> that the queue will be rerun in the future once resources become
> available again. This is typically the case for device specific
> resources that are consumed for IO. If the driver fails allocating these
> resources, we know that inflight (or pending) IO will free these
> resource upon completion.
> 
> This is different from BLK_STS_RESOURCE in that it explicitly references
> device specific resource. For resources of wider scope, allocation
> failure can happen without having pending IO. This means that we can't
> rely on request completions freeing these resources, as IO may not be in
> flight. Examples of that are kernel memory allocations, DMA mappings, or
> any other system wide resources.

Thanks for that, definitely clearer, nice job.

Mike

[PATCH v6] blk-mq: introduce BLK_STS_DEV_RESOURCE

2018-01-30 Thread Mike Snitzer

From: Ming Lei 

This status is returned from driver to block layer if device related
resource is unavailable, but driver can guarantee that IO dispatch
will be triggered in future when the resource is available.

Convert some drivers to return BLK_STS_DEV_RESOURCE.  Also, if driver
returns BLK_STS_RESOURCE and SCHED_RESTART is set, rerun queue after
a delay (BLK_MQ_DELAY_QUEUE) to avoid IO stalls.  BLK_MQ_DELAY_QUEUE is
3 ms because both scsi-mq and nvmefc are using that magic value.

If a driver can make sure there is in-flight IO, it is safe to return
BLK_STS_DEV_RESOURCE because:

1) If all in-flight IOs complete before examining SCHED_RESTART in
blk_mq_dispatch_rq_list(), SCHED_RESTART must be cleared, so queue
is run immediately in this case by blk_mq_dispatch_rq_list();

2) if there is any in-flight IO after/when examining SCHED_RESTART
in blk_mq_dispatch_rq_list():
- if SCHED_RESTART isn't set, queue is run immediately as handled in 1)
- otherwise, this request will be dispatched after any in-flight IO is
  completed via blk_mq_sched_restart()

3) if SCHED_RESTART is set concurently in context because of
BLK_STS_RESOURCE, blk_mq_delay_run_hw_queue() will cover the above two
cases and make sure IO hang can be avoided.

One invariant is that queue will be rerun if SCHED_RESTART is set.

Suggested-by: Jens Axboe 
Tested-by: Laurence Oberman 
Signed-off-by: Ming Lei 
Signed-off-by: Mike Snitzer 
---
V6:
- use -EBUSY, instead of -ENOMEM, for BLK_STS_DEV_RESOURCE
- rename BLK_MQ_QUEUE_DELAY to BLK_MQ_RESOURCE_DELAY
- cleanup blk_types.h comment block above BLK_STS_DEV_RESOURCE
- all suggested by Jens
V5:
- fixed stale reference to 10ms delay in blk-mq.c comment
- revised commit header to include Ming's proof of how
  blk-mq drivers will make progress (serves to document how
  scsi_queue_rq now works)
V4:
- cleanup header and code comments
- rerun queue after BLK_MQ_QUEUE_DELAY (3ms) instead of 10ms
- eliminate nvmefc's queue rerun now that blk-mq does it
V3:
- fix typo, and improvement document
- add tested-by tag
V2:
- add comments on the new introduced status
- patch style fix
- both are suggested by Christoph

 block/blk-core.c |  1 +
 block/blk-mq.c   | 20 
 drivers/block/null_blk.c |  2 +-
 drivers/block/virtio_blk.c   |  2 +-
 drivers/block/xen-blkfront.c |  2 +-
 drivers/md/dm-rq.c   |  5 ++---
 drivers/nvme/host/fc.c   | 12 ++--
 drivers/scsi/scsi_lib.c  |  6 +++---
 include/linux/blk_types.h| 18 ++
 9 files changed, 45 insertions(+), 23 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index cdae69be68e9..79dfef84c66c 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -145,6 +145,7 @@ static const struct {
[BLK_STS_MEDIUM]= { -ENODATA,   "critical medium" },
[BLK_STS_PROTECTION]= { -EILSEQ,"protection" },
[BLK_STS_RESOURCE]  = { -ENOMEM,"kernel resource" },
+   [BLK_STS_DEV_RESOURCE]  = { -EBUSY, "device resource" },
[BLK_STS_AGAIN] = { -EAGAIN,"nonblocking retry" },
 
/* device mapper special case, should not leak out: */
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 43e7449723e0..52a206e3777f 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1160,6 +1160,8 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx 
**hctx,
return true;
 }
 
+#define BLK_MQ_RESOURCE_DELAY  3   /* ms units */
+
 bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 bool got_budget)
 {
@@ -1167,6 +1169,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, 
struct list_head *list,
struct request *rq, *nxt;
bool no_tag = false;
int errors, queued;
+   blk_status_t ret = BLK_STS_OK;
 
if (list_empty(list))
return false;
@@ -1179,7 +1182,6 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, 
struct list_head *list,
errors = queued = 0;
do {
struct blk_mq_queue_data bd;
-   blk_status_t ret;
 
rq = list_first_entry(list, struct request, queuelist);
if (!blk_mq_get_driver_tag(rq, &hctx, false)) {
@@ -1224,7 +1226,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, 
struct list_head *list,
}
 
ret = q->mq_ops->queue_rq(hctx, &bd);
-   if (ret == BLK_STS_RESOURCE) {
+   if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE) {
/*
 * If an I/O scheduler has been configured and we got a
 * driver tag for the next request already, free it
@@ -1255,6 +1257,8 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, 
struct list_head *list,
 * that is wh

Re: [PATCH v5] blk-mq: introduce BLK_STS_DEV_RESOURCE

2018-01-30 Thread Jens Axboe

On 1/30/18 7:24 AM, Mike Snitzer wrote:
> From: Ming Lei 
> 
> This status is returned from driver to block layer if device related
> resource is unavailable, but driver can guarantee that IO dispatch
> will be triggered in future when the resource is available.
> 
> Convert some drivers to return BLK_STS_DEV_RESOURCE.  Also, if driver
> returns BLK_STS_RESOURCE and SCHED_RESTART is set, rerun queue after
> a delay (BLK_MQ_DELAY_QUEUE) to avoid IO stalls.  BLK_MQ_DELAY_QUEUE is
> 3 ms because both scsi-mq and nvmefc are using that magic value.
> 
> If a driver can make sure there is in-flight IO, it is safe to return
> BLK_STS_DEV_RESOURCE because:
> 
> 1) If all in-flight IOs complete before examining SCHED_RESTART in
> blk_mq_dispatch_rq_list(), SCHED_RESTART must be cleared, so queue
> is run immediately in this case by blk_mq_dispatch_rq_list();
> 
> 2) if there is any in-flight IO after/when examining SCHED_RESTART
> in blk_mq_dispatch_rq_list():
> - if SCHED_RESTART isn't set, queue is run immediately as handled in 1)
> - otherwise, this request will be dispatched after any in-flight IO is
>   completed via blk_mq_sched_restart()
> 
> 3) if SCHED_RESTART is set concurently in context because of
> BLK_STS_RESOURCE, blk_mq_delay_run_hw_queue() will cover the above two
> cases and make sure IO hang can be avoided.
> 
> One invariant is that queue will be rerun if SCHED_RESTART is set.

This looks pretty good to me. I'm waffling a bit on whether to retain
the current BLK_STS_RESOURCE behavior and name the new one something
else, but I do like using the DEV name in there to signify the
difference between a global and device resource.

Just a few small nits below - can you roll a v6 with the changes?

> diff --git a/block/blk-core.c b/block/blk-core.c
> index cdae69be68e9..38279d4ae08b 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -145,6 +145,7 @@ static const struct {
>   [BLK_STS_MEDIUM]= { -ENODATA,   "critical medium" },
>   [BLK_STS_PROTECTION]= { -EILSEQ,"protection" },
>   [BLK_STS_RESOURCE]  = { -ENOMEM,"kernel resource" },
> + [BLK_STS_DEV_RESOURCE]  = { -ENOMEM,"device resource" },
>   [BLK_STS_AGAIN] = { -EAGAIN,"nonblocking retry" },

I think we should make BLK_STS_DEV_RESOURCE be -EBUSY, that more closely
matches the result and makes it distinctly different than the global
shortage that is STS_RESOURCE/ENOMEM.

> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 43e7449723e0..e39b4e2a63db 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1160,6 +1160,8 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx 
> **hctx,
>   return true;
>  }
>  
> +#define BLK_MQ_QUEUE_DELAY   3   /* ms units */

Make that BLK_MQ_RESOURCE_DELAY or something like that. The name is too
generic right now.

> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index 2d973ac54b09..f41d2057215f 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -39,6 +39,23 @@ typedef u8 __bitwise blk_status_t;
>  
>  #define BLK_STS_AGAIN((__force blk_status_t)12)
>  
> +/*
> + * BLK_STS_DEV_RESOURCE is returned from driver to block layer if device
> + * related resource is unavailable, but driver can guarantee that queue
> + * will be rerun in future once the resource is available (whereby
> + * dispatching requests).
> + *
> + * To safely return BLK_STS_DEV_RESOURCE, and allow forward progress, a
> + * driver just needs to make sure there is in-flight IO.
> + *
> + * Difference with BLK_STS_RESOURCE:
> + * If driver isn't sure if the queue will be rerun once device resource
> + * is made available, please return BLK_STS_RESOURCE.  For example: when
> + * memory allocation, DMA Mapping or other system resource allocation
> + * fails and IO can't be submitted to device.
> + */
> +#define BLK_STS_DEV_RESOURCE ((__force blk_status_t)13)

I'd rephrase that as:

BLK_STS_DEV_RESOURCE is returned from the driver to the block layer if
device related resource are unavailable, but the driver can guarantee
that the queue will be rerun in the future once resources become
available again. This is typically the case for device specific
resources that are consumed for IO. If the driver fails allocating these
resources, we know that inflight (or pending) IO will free these
resource upon completion.

This is different from BLK_STS_RESOURCE in that it explicitly references
device specific resource. For resources of wider scope, allocation
failure can happen without having pending IO. This means that we can't
rely on request completions freeing these resources, as IO may not be in
flight. Examples of that are kernel memory allocations, DMA mappings, or
any other system wide resources.

-- 
Jens Axboe

[PATCH] bcache: fix error count in memory shrink

2018-01-30 Thread tang . junhui

From: Tang Junhui 

In bch_mca_scan(), nr btree nodes are expected to shrink, so the for(;;)
loop need to satisfy the condition freed < nr.
And since c->btree_cache_used always decrease after mca_data_free() calling
in for(;;) loop,  so we need a auto variable to record c->btree_cache_used
before the for(;;) loop.

Signed-off-by: Tang Junhui 
---
 drivers/md/bcache/btree.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 81e8dc3..7e9ef3d 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -664,6 +664,7 @@ static unsigned long bch_mca_scan(struct shrinker *shrink,
struct btree *b, *t;
unsigned long i, nr = sc->nr_to_scan;
unsigned long freed = 0;
+   unsigned int btree_cache_used;
 
if (c->shrinker_disabled)
return SHRINK_STOP;
@@ -700,7 +701,8 @@ static unsigned long bch_mca_scan(struct shrinker *shrink,
}
}
 
-   for (i = 0; (nr--) && i < c->btree_cache_used; i++) {
+   btree_cache_used = c->btree_cache_used;
+   for (i = 0; freed < nr && i < btree_cache_used; i++) {
if (list_empty(&c->btree_cache))
goto out;
 
-- 
1.8.3.1

Re: [dm-devel] [PATCH v5] blk-mq: introduce BLK_STS_DEV_RESOURCE

2018-01-30 Thread Ming Lei

On Tue, Jan 30, 2018 at 09:52:31AM -0800, Bart Van Assche wrote:
> On 01/30/18 06:24, Mike Snitzer wrote:
> > +*
> > +* If driver returns BLK_STS_RESOURCE and SCHED_RESTART
> > +* bit is set, run queue after a delay to avoid IO stalls
> > +* that could otherwise occur if the queue is idle.
> >  */
> > -   if (!blk_mq_sched_needs_restart(hctx) ||
> > +   needs_restart = blk_mq_sched_needs_restart(hctx);
> > +   if (!needs_restart ||
> > (no_tag && list_empty_careful(&hctx->dispatch_wait.entry)))
> > blk_mq_run_hw_queue(hctx, true);
> > +   else if (needs_restart && (ret == BLK_STS_RESOURCE))
> > +   blk_mq_delay_run_hw_queue(hctx, BLK_MQ_QUEUE_DELAY);
> > }
> 
> If a request completes concurrently with execution of the above code then
> the request completion will trigger a call of blk_mq_sched_restart_hctx()
> and that call will clear the BLK_MQ_S_SCHED_RESTART bit. If that bit is
> cleared before the above code tests it then the above code will schedule an
> asynchronous call of __blk_mq_run_hw_queue(). If the .queue_rq() call

Right.

> triggered by the new queue run returns again BLK_STS_RESOURCE then the above
> code will be executed again. In other words, a loop occurs. That loop will

This patch doesn't change anything about that. When BLK_STS_RESOURCE is
returned, this request is added to hctx->dispatch, next time, before
dispatching this request, SCHED_RESTART is set and the loop is cut.

> repeat as long as the described race occurs. The current (kernel v4.15)
> block layer behavior is simpler: only block drivers call
> blk_mq_delay_run_hw_queue() and the block layer core never calls that
> function. Hence that loop cannot occur with the v4.15 block layer core and

That way isn't safe, I have explained to you in the following link:

https://marc.info/?l=dm-devel&m=151727454815018&w=2

> block drivers. A motivation of why that loop is preferred compared to the
> current behavior (no loop) is missing. Does this mean that that loop is a
> needless complication of the block layer core?

No such loop as I explained above.

> 
> Sorry but I still prefer the v4.15 block layer approach because this patch
> has in my view the following disadvantages:
> - It involves a blk-mq API change. API changes are always risky and need
>   some time to stabilize.
> - The delay after which to rerun the queue is moved from block layer
>   drivers into the block layer core. I think that's wrong because only
>   the block driver authors can make a good choice for this constant.
> - This patch makes block drivers harder to understand. Anyone who sees
>   return BLK_STS_RESOURCE / return BLK_STS_DEV_RESOURCE for the first
>   time will have to look up the meaning of these constants. An explicit
>   blk_mq_delay_run_hw_queue() call is easier to understand.
> - This patch makes the blk-mq core harder to understand because of the
>   loop mentioned above.
> - This patch does not fix any bugs nor makes block drivers easier to
>   read or to implement. So why is this patch considered useful?

It does avoid the race I mentioned in the following link:

https://marc.info/?l=dm-devel&m=151727454815018&w=2

More importantly, every driver need this change, if you have better idea
to fix them all, please post a patch, then we can compare both.

Thanks,
Ming

[PATCH 4/5] lightnvm: pblk: add padding distribution sysfs attribute

2018-01-30 Thread Javier González

From: Hans Holmberg 

When pblk receives a sync, all data up to that point in the write buffer
must be comitted to persistent storage, and as flash memory comes with a
minimal write size there is a significant cost involved both in terms
of time for completing the sync and in terms of write amplification
padded sectors for filling up to the minimal write size.

In order to get a better understanding of the costs involved for syncs,
Add a sysfs attribute to pblk: padded_dist, showing a normalized
distribution of sectors padded. In order to facilitate measurements of
specific workloads during the lifetime of the pblk instance, the
distribution can be reset by writing 0 to the attribute.

Do this by introducing counters for each possible padding:
{0..(minimal write size - 1)} and calculate the normalized distribution
when showing the attribute.

Signed-off-by: Hans Holmberg 
Signed-off-by: Javier González 
---
 drivers/lightnvm/pblk-init.c  | 16 --
 drivers/lightnvm/pblk-rb.c| 15 +-
 drivers/lightnvm/pblk-sysfs.c | 68 +++
 drivers/lightnvm/pblk.h   |  6 +++-
 4 files changed, 95 insertions(+), 10 deletions(-)

diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
index 7eedc5daebc8..a5e3510c0f38 100644
--- a/drivers/lightnvm/pblk-init.c
+++ b/drivers/lightnvm/pblk-init.c
@@ -921,6 +921,7 @@ static void pblk_free(struct pblk *pblk)
 {
pblk_luns_free(pblk);
pblk_lines_free(pblk);
+   kfree(pblk->pad_dist);
pblk_line_meta_free(pblk);
pblk_core_free(pblk);
pblk_l2p_free(pblk);
@@ -998,11 +999,13 @@ static void *pblk_init(struct nvm_tgt_dev *dev, struct 
gendisk *tdisk,
pblk->pad_rst_wa = 0;
pblk->gc_rst_wa = 0;
 
+   atomic_long_set(&pblk->nr_flush, 0);
+   pblk->nr_flush_rst = 0;
+
 #ifdef CONFIG_NVM_DEBUG
atomic_long_set(&pblk->inflight_writes, 0);
atomic_long_set(&pblk->padded_writes, 0);
atomic_long_set(&pblk->padded_wb, 0);
-   atomic_long_set(&pblk->nr_flush, 0);
atomic_long_set(&pblk->req_writes, 0);
atomic_long_set(&pblk->sub_writes, 0);
atomic_long_set(&pblk->sync_writes, 0);
@@ -1034,10 +1037,17 @@ static void *pblk_init(struct nvm_tgt_dev *dev, struct 
gendisk *tdisk,
goto fail_free_luns;
}
 
+   pblk->pad_dist = kzalloc((pblk->min_write_pgs - 1) * sizeof(atomic64_t),
+GFP_KERNEL);
+   if (!pblk->pad_dist) {
+   ret = -ENOMEM;
+   goto fail_free_line_meta;
+   }
+
ret = pblk_core_init(pblk);
if (ret) {
pr_err("pblk: could not initialize core\n");
-   goto fail_free_line_meta;
+   goto fail_free_pad_dist;
}
 
ret = pblk_l2p_init(pblk);
@@ -1097,6 +1107,8 @@ static void *pblk_init(struct nvm_tgt_dev *dev, struct 
gendisk *tdisk,
pblk_l2p_free(pblk);
 fail_free_core:
pblk_core_free(pblk);
+fail_free_pad_dist:
+   kfree(pblk->pad_dist);
 fail_free_line_meta:
pblk_line_meta_free(pblk);
 fail_free_luns:
diff --git a/drivers/lightnvm/pblk-rb.c b/drivers/lightnvm/pblk-rb.c
index 7044b5599cc4..264372d7b537 100644
--- a/drivers/lightnvm/pblk-rb.c
+++ b/drivers/lightnvm/pblk-rb.c
@@ -437,9 +437,7 @@ static int pblk_rb_may_write_flush(struct pblk_rb *rb, 
unsigned int nr_entries,
if (bio->bi_opf & REQ_PREFLUSH) {
struct pblk *pblk = container_of(rb, struct pblk, rwb);
 
-#ifdef CONFIG_NVM_DEBUG
atomic_long_inc(&pblk->nr_flush);
-#endif
if (pblk_rb_flush_point_set(&pblk->rwb, bio, mem))
*io_ret = NVM_IO_OK;
}
@@ -620,14 +618,17 @@ unsigned int pblk_rb_read_to_bio(struct pblk_rb *rb, 
struct nvm_rq *rqd,
pr_err("pblk: could not pad page in write bio\n");
return NVM_IO_ERR;
}
+
+   if (pad < pblk->min_write_pgs)
+   atomic64_inc(&pblk->pad_dist[pad - 1]);
+   else
+   pr_warn("pblk: padding more than min. sectors\n");
+
+   atomic64_add(pad, &pblk->pad_wa);
}
 
-   atomic64_add(pad, &((struct pblk *)
-   (container_of(rb, struct pblk, rwb)))->pad_wa);
-
 #ifdef CONFIG_NVM_DEBUG
-   atomic_long_add(pad, &((struct pblk *)
-   (container_of(rb, struct pblk, rwb)))->padded_writes);
+   atomic_long_add(pad, &pblk->padded_writes);
 #endif
 
return NVM_IO_OK;
diff --git a/drivers/lightnvm/pblk-sysfs.c b/drivers/lightnvm/pblk-sysfs.c
index 4804bbd32d5f..f902ac00d071 100644
--- a/drivers/lightnvm/pblk-sysfs.c
+++ b/drivers/lightnvm/pblk-sysfs.c
@@ -341,6 +341,38 @@ static ssize_t pblk_sysfs_get_write_amp_trip(struct pblk 
*pblk, char *page)
atomic64_read(&pblk->pad_wa) - pblk->pad_rst_wa, page);
 }
 
+static ssize_t pblk_sysfs_get_p

[PATCH 3/5] lightnvm: pblk: export write amplification counters to sysfs

2018-01-30 Thread Javier González

From: Hans Holmberg 

In a SSD, write amplification, WA, is defined as the average
number of page writes per user page write. Write amplification
negatively affects write performance and decreases the lifetime
of the disk, so it's a useful metric to add to sysfs.

In plkb's case, the number of writes per user sector is the sum of:

(1) number of user writes
(2) number of sectors written by the garbage collector
(3) number of sectors padded (i.e. due to syncs)

This patch adds persistent counters for 1-3 and two sysfs attributes
to export these along with WA calculated with five decimals:

write_amp_mileage: the accumulated write amplification stats
  for the lifetime of the pblk instance

write_amp_trip: resetable stats to facilitate delta measurements,
values reset at creation and if 0 is written
to the attribute.

64-bit counters are used as a 32 bit counter would wrap around
already after about 17 TB worth of user data. It will take a
long long time before the 64 bit sector counters wrap around.

The counters are stored after the bad block bitmap in the first
emeta sector of each written line. There is plenty of space in the
first emeta sector, so we don't need to bump the major version of
the line data format.

Signed-off-by: Hans Holmberg 
Signed-off-by: Javier González 
---
 drivers/lightnvm/pblk-cache.c|  4 ++
 drivers/lightnvm/pblk-core.c |  6 +++
 drivers/lightnvm/pblk-init.c | 11 -
 drivers/lightnvm/pblk-map.c  |  2 +
 drivers/lightnvm/pblk-rb.c   |  3 ++
 drivers/lightnvm/pblk-recovery.c | 25 
 drivers/lightnvm/pblk-sysfs.c| 86 +++-
 drivers/lightnvm/pblk.h  | 42 
 8 files changed, 169 insertions(+), 10 deletions(-)

diff --git a/drivers/lightnvm/pblk-cache.c b/drivers/lightnvm/pblk-cache.c
index 000fcad38136..29a23111b31c 100644
--- a/drivers/lightnvm/pblk-cache.c
+++ b/drivers/lightnvm/pblk-cache.c
@@ -63,6 +63,8 @@ int pblk_write_to_cache(struct pblk *pblk, struct bio *bio, 
unsigned long flags)
bio_advance(bio, PBLK_EXPOSED_PAGE_SIZE);
}
 
+   atomic64_add(nr_entries, &pblk->user_wa);
+
 #ifdef CONFIG_NVM_DEBUG
atomic_long_add(nr_entries, &pblk->inflight_writes);
atomic_long_add(nr_entries, &pblk->req_writes);
@@ -117,6 +119,8 @@ int pblk_write_gc_to_cache(struct pblk *pblk, struct 
pblk_gc_rq *gc_rq)
WARN_ONCE(gc_rq->secs_to_gc != valid_entries,
"pblk: inconsistent GC write\n");
 
+   atomic64_add(valid_entries, &pblk->gc_wa);
+
 #ifdef CONFIG_NVM_DEBUG
atomic_long_add(valid_entries, &pblk->inflight_writes);
atomic_long_add(valid_entries, &pblk->recov_gc_writes);
diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c
index 155e42a26293..22e61cd4f801 100644
--- a/drivers/lightnvm/pblk-core.c
+++ b/drivers/lightnvm/pblk-core.c
@@ -1630,11 +1630,16 @@ void pblk_line_close_meta(struct pblk *pblk, struct 
pblk_line *line)
struct pblk_line_meta *lm = &pblk->lm;
struct pblk_emeta *emeta = line->emeta;
struct line_emeta *emeta_buf = emeta->buf;
+   struct wa_counters *wa = emeta_to_wa(lm, emeta_buf);
 
/* No need for exact vsc value; avoid a big line lock and take aprox. */
memcpy(emeta_to_vsc(pblk, emeta_buf), l_mg->vsc_list, lm->vsc_list_len);
memcpy(emeta_to_bb(emeta_buf), line->blk_bitmap, lm->blk_bitmap_len);
 
+   wa->user = cpu_to_le64(atomic64_read(&pblk->user_wa));
+   wa->pad = cpu_to_le64(atomic64_read(&pblk->pad_wa));
+   wa->gc = cpu_to_le64(atomic64_read(&pblk->gc_wa));
+
emeta_buf->nr_valid_lbas = cpu_to_le64(line->nr_valid_lbas);
emeta_buf->crc = cpu_to_le32(pblk_calc_emeta_crc(pblk, emeta_buf));
 
@@ -1837,6 +1842,7 @@ void pblk_update_map_dev(struct pblk *pblk, sector_t lba,
 #endif
/* Invalidate and discard padded entries */
if (lba == ADDR_EMPTY) {
+   atomic64_inc(&pblk->pad_wa);
 #ifdef CONFIG_NVM_DEBUG
atomic_long_inc(&pblk->padded_wb);
 #endif
diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
index 93d671ca518e..7eedc5daebc8 100644
--- a/drivers/lightnvm/pblk-init.c
+++ b/drivers/lightnvm/pblk-init.c
@@ -559,8 +559,8 @@ static unsigned int calc_emeta_len(struct pblk *pblk)
 
/* Round to sector size so that lba_list starts on its own sector */
lm->emeta_sec[1] = DIV_ROUND_UP(
-   sizeof(struct line_emeta) + lm->blk_bitmap_len,
-   geo->sec_size);
+   sizeof(struct line_emeta) + lm->blk_bitmap_len +
+   sizeof(struct wa_counters), geo->sec_size);
lm->emeta_len[1] = lm->emeta_sec[1] * geo->sec_size;
 
/* Round to sector size so that vsc_list starts on its own sector */
@@ -991,6 +991,13 @@ static void *pblk_init(str

[PATCH 5/5] lightnvm: pblk: refactor bad block identification

2018-01-30 Thread Javier González

In preparation for the OCSSD 2.0 spec. bad block identification,
refactor the current code to generalize bad block get/set functions and
structures.

Signed-off-by: Javier González 
---
 drivers/lightnvm/pblk-init.c | 213 +++
 drivers/lightnvm/pblk.h  |   6 --
 2 files changed, 112 insertions(+), 107 deletions(-)

diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
index a5e3510c0f38..86a94a7faa96 100644
--- a/drivers/lightnvm/pblk-init.c
+++ b/drivers/lightnvm/pblk-init.c
@@ -365,7 +365,25 @@ static void pblk_luns_free(struct pblk *pblk)
kfree(pblk->luns);
 }
 
-static void pblk_free_line_bitmaps(struct pblk_line *line)
+static void pblk_line_mg_free(struct pblk *pblk)
+{
+   struct pblk_line_mgmt *l_mg = &pblk->l_mg;
+   int i;
+
+   kfree(l_mg->bb_template);
+   kfree(l_mg->bb_aux);
+   kfree(l_mg->vsc_list);
+
+   for (i = 0; i < PBLK_DATA_LINES; i++) {
+   kfree(l_mg->sline_meta[i]);
+   pblk_mfree(l_mg->eline_meta[i]->buf, l_mg->emeta_alloc_type);
+   kfree(l_mg->eline_meta[i]);
+   }
+
+   kfree(pblk->lines);
+}
+
+static void pblk_line_meta_free(struct pblk_line *line)
 {
kfree(line->blk_bitmap);
kfree(line->erase_bitmap);
@@ -382,40 +400,16 @@ static void pblk_lines_free(struct pblk *pblk)
line = &pblk->lines[i];
 
pblk_line_free(pblk, line);
-   pblk_free_line_bitmaps(line);
+   pblk_line_meta_free(line);
}
spin_unlock(&l_mg->free_lock);
 }
 
-static void pblk_line_meta_free(struct pblk *pblk)
+static int pblk_bb_get_tbl(struct nvm_tgt_dev *dev, struct pblk_lun *rlun,
+  u8 *blks, int nr_blks)
 {
-   struct pblk_line_mgmt *l_mg = &pblk->l_mg;
-   int i;
-
-   kfree(l_mg->bb_template);
-   kfree(l_mg->bb_aux);
-   kfree(l_mg->vsc_list);
-
-   for (i = 0; i < PBLK_DATA_LINES; i++) {
-   kfree(l_mg->sline_meta[i]);
-   pblk_mfree(l_mg->eline_meta[i]->buf, l_mg->emeta_alloc_type);
-   kfree(l_mg->eline_meta[i]);
-   }
-
-   kfree(pblk->lines);
-}
-
-static int pblk_bb_discovery(struct nvm_tgt_dev *dev, struct pblk_lun *rlun)
-{
-   struct nvm_geo *geo = &dev->geo;
struct ppa_addr ppa;
-   u8 *blks;
-   int nr_blks, ret;
-
-   nr_blks = geo->nr_chks * geo->plane_mode;
-   blks = kmalloc(nr_blks, GFP_KERNEL);
-   if (!blks)
-   return -ENOMEM;
+   int ret;
 
ppa.ppa = 0;
ppa.g.ch = rlun->bppa.g.ch;
@@ -423,34 +417,56 @@ static int pblk_bb_discovery(struct nvm_tgt_dev *dev, 
struct pblk_lun *rlun)
 
ret = nvm_get_tgt_bb_tbl(dev, ppa, blks);
if (ret)
-   goto out;
+   return ret;
 
nr_blks = nvm_bb_tbl_fold(dev->parent, blks, nr_blks);
-   if (nr_blks < 0) {
-   ret = nr_blks;
-   goto out;
-   }
-
-   rlun->bb_list = blks;
+   if (nr_blks < 0)
+   return -EIO;
 
return 0;
-out:
-   kfree(blks);
-   return ret;
+}
+
+static void *pblk_bb_get_log(struct pblk *pblk)
+{
+   struct nvm_tgt_dev *dev = pblk->dev;
+   struct nvm_geo *geo = &dev->geo;
+   u8 *log;
+   int i, nr_blks, blk_per_lun;
+   int ret;
+
+   blk_per_lun = geo->nr_chks * geo->plane_mode;
+   nr_blks = blk_per_lun * geo->all_luns;
+
+   log = kmalloc(nr_blks, GFP_KERNEL);
+   if (!log)
+   return ERR_PTR(-ENOMEM);
+
+   for (i = 0; i < geo->all_luns; i++) {
+   struct pblk_lun *rlun = &pblk->luns[i];
+   u8 *log_pos = log + i * blk_per_lun;
+
+   ret = pblk_bb_get_tbl(dev, rlun, log_pos, blk_per_lun);
+   if (ret) {
+   kfree(log);
+   return ERR_PTR(-EIO);
+   }
+   }
+
+   return log;
 }
 
 static int pblk_bb_line(struct pblk *pblk, struct pblk_line *line,
-   int blk_per_line)
+   u8 *bb_log, int blk_per_line)
 {
struct nvm_tgt_dev *dev = pblk->dev;
struct nvm_geo *geo = &dev->geo;
-   struct pblk_lun *rlun;
-   int bb_cnt = 0;
-   int i;
+   int i, bb_cnt = 0;
 
for (i = 0; i < blk_per_line; i++) {
-   rlun = &pblk->luns[i];
-   if (rlun->bb_list[line->id] == NVM_BLK_T_FREE)
+   struct pblk_lun *rlun = &pblk->luns[i];
+   u8 *lun_bb_log = bb_log + i * blk_per_line;
+
+   if (lun_bb_log[line->id] == NVM_BLK_T_FREE)
continue;
 
set_bit(pblk_ppa_to_pos(geo, rlun->bppa), line->blk_bitmap);
@@ -460,29 +476,12 @@ static int pblk_bb_line(struct pblk *pblk, struct 
pblk_line *line,
return bb_cnt;
 }
 
-static int pblk_alloc_line_bitmaps(struct pblk *pblk, struct pblk_line *line)
-{
-   struct pblk_line_meta

[PATCH 2/5] lightnvm: pblk: check data lines version on recovery

2018-01-30 Thread Javier González

From: Hans Holmberg 

As a preparation for future bumps of data line persistent storage
versions, we need to start checking the emeta line version during
recovery. Also slit up the current emeta/smeta version into two
bytes (major,minor).

Recovering lines with the same major number as the current pblk data
line version must succeed. This means that any changes in the
persistent format must be:

 (1) Backward compatible: if we switch back to and older
 kernel, recovery of lines stored with major == current_major
 and minor > current_minor must succeed.

 (2) Forward compatible: switching to a newer kernel,
 recovery of lines stored with major=current_major and
 minor < minor must handle the data format differences
 gracefully(i.e. initialize new data structures to default values).

If we detect lines that have a different major number than
the current we must abort recovery. The user must manually
migrate the data in this case.

Previously the version stored in the emeta header was copied
from smeta, which has version 1, so we need to set the minor
version to 1.

Signed-off-by: Hans Holmberg 
Signed-off-by: Javier González 
---
 drivers/lightnvm/pblk-core.c |  9 -
 drivers/lightnvm/pblk-recovery.c | 26 --
 drivers/lightnvm/pblk.h  | 16 ++--
 3 files changed, 46 insertions(+), 5 deletions(-)

diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c
index 9027cf2ed1d8..155e42a26293 100644
--- a/drivers/lightnvm/pblk-core.c
+++ b/drivers/lightnvm/pblk-core.c
@@ -975,7 +975,8 @@ static int pblk_line_init_metadata(struct pblk *pblk, 
struct pblk_line *line,
memcpy(smeta_buf->header.uuid, pblk->instance_uuid, 16);
smeta_buf->header.id = cpu_to_le32(line->id);
smeta_buf->header.type = cpu_to_le16(line->type);
-   smeta_buf->header.version = SMETA_VERSION;
+   smeta_buf->header.version_major = SMETA_VERSION_MAJOR;
+   smeta_buf->header.version_minor = SMETA_VERSION_MINOR;
 
/* Start metadata */
smeta_buf->seq_nr = cpu_to_le64(line->seq_nr);
@@ -998,6 +999,12 @@ static int pblk_line_init_metadata(struct pblk *pblk, 
struct pblk_line *line,
/* End metadata */
memcpy(&emeta_buf->header, &smeta_buf->header,
sizeof(struct line_header));
+
+   emeta_buf->header.version_major = EMETA_VERSION_MAJOR;
+   emeta_buf->header.version_minor = EMETA_VERSION_MINOR;
+   emeta_buf->header.crc = cpu_to_le32(
+   pblk_calc_meta_header_crc(pblk, &emeta_buf->header));
+
emeta_buf->seq_nr = cpu_to_le64(line->seq_nr);
emeta_buf->nr_lbas = cpu_to_le64(line->sec_in_line);
emeta_buf->nr_valid_lbas = cpu_to_le64(0);
diff --git a/drivers/lightnvm/pblk-recovery.c b/drivers/lightnvm/pblk-recovery.c
index 1d5e961bf5e0..a30fe203d454 100644
--- a/drivers/lightnvm/pblk-recovery.c
+++ b/drivers/lightnvm/pblk-recovery.c
@@ -826,6 +826,25 @@ static u64 pblk_line_emeta_start(struct pblk *pblk, struct 
pblk_line *line)
return emeta_start;
 }
 
+static int pblk_recov_check_line_version(struct pblk *pblk,
+struct line_emeta *emeta)
+{
+   struct line_header *header = &emeta->header;
+
+   if (header->version_major != EMETA_VERSION_MAJOR) {
+   pr_err("pblk: line major version mismatch: %d, expected: %d\n",
+  header->version_major, EMETA_VERSION_MAJOR);
+   return 1;
+   }
+
+#ifdef NVM_DEBUG
+   if (header->version_minor > EMETA_VERSION_MINOR)
+   pr_info("pblk: newer line minor version found: %d\n", line_v);
+#endif
+
+   return 0;
+}
+
 struct pblk_line *pblk_recov_l2p(struct pblk *pblk)
 {
struct pblk_line_meta *lm = &pblk->lm;
@@ -873,9 +892,9 @@ struct pblk_line *pblk_recov_l2p(struct pblk *pblk)
if (le32_to_cpu(smeta_buf->header.identifier) != PBLK_MAGIC)
continue;
 
-   if (smeta_buf->header.version != SMETA_VERSION) {
+   if (smeta_buf->header.version_major != SMETA_VERSION_MAJOR) {
pr_err("pblk: found incompatible line version %u\n",
-   le16_to_cpu(smeta_buf->header.version));
+   smeta_buf->header.version_major);
return ERR_PTR(-EINVAL);
}
 
@@ -943,6 +962,9 @@ struct pblk_line *pblk_recov_l2p(struct pblk *pblk)
goto next;
}
 
+   if (pblk_recov_check_line_version(pblk, line->emeta->buf))
+   return ERR_PTR(-EINVAL);
+
if (pblk_recov_l2p_from_emeta(pblk, line))
pblk_recov_l2p_from_oob(pblk, line);
 
diff --git a/drivers/lightnvm/pblk.h b/drivers/lightnvm/pblk.h
index 8c357fb6538e..fae2526f80b2 100644
--- a/drivers/lightnvm/pblk.h
+++ b/drivers/light

[PATCH 1/5] lightnvm: pblk: handle bad sectors in the emeta area correctly

2018-01-30 Thread Javier González

From: Hans Holmberg 

Unless we check if there are bad sectors in the entire emeta-area
we risk ending up with valid bitmap / available sector count inconsistency.
This results in lines with a bad chunk at the last LUN marked as bad,
so go through the whole emeta area and mark up the invalid sectors.

Signed-off-by: Hans Holmberg 
---
 drivers/lightnvm/pblk-core.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c
index 0487b9340c1d..9027cf2ed1d8 100644
--- a/drivers/lightnvm/pblk-core.c
+++ b/drivers/lightnvm/pblk-core.c
@@ -1021,6 +1021,7 @@ static int pblk_line_init_bb(struct pblk *pblk, struct 
pblk_line *line,
int nr_bb = 0;
u64 off;
int bit = -1;
+   int emeta_secs;
 
line->sec_in_line = lm->sec_per_line;
 
@@ -1055,18 +1056,18 @@ static int pblk_line_init_bb(struct pblk *pblk, struct 
pblk_line *line,
/* Mark emeta metadata sectors as bad sectors. We need to consider bad
 * blocks to make sure that there are enough sectors to store emeta
 */
-   off = lm->sec_per_line - lm->emeta_sec[0];
-   bitmap_set(line->invalid_bitmap, off, lm->emeta_sec[0]);
-   while (nr_bb) {
+   emeta_secs = lm->emeta_sec[0];
+   off = lm->sec_per_line;
+   while (emeta_secs) {
off -= geo->sec_per_pl;
if (!test_bit(off, line->invalid_bitmap)) {
bitmap_set(line->invalid_bitmap, off, geo->sec_per_pl);
-   nr_bb--;
+   emeta_secs -= geo->sec_per_pl;
}
}
 
-   line->sec_in_line -= lm->emeta_sec[0];
line->emeta_ssec = off;
+   line->sec_in_line -= lm->emeta_sec[0];
line->nr_valid_lbas = 0;
line->left_msecs = line->sec_in_line;
*line->vsc = cpu_to_le32(line->sec_in_line);
-- 
2.7.4

Re: [PATCH 2/2] lightnvm: remove multiple groups in 1.2 data structure

2018-01-30 Thread Javier González

> On 30 Jan 2018, at 21.26, Matias Bjørling  wrote:
> 
> Only one id group from the 1.2 specification is supported. Make
> sure that only the first group is accessible.
> 
> Signed-off-by: Matias Bjørling 
> ---
> drivers/nvme/host/lightnvm.c | 5 +++--
> 1 file changed, 3 insertions(+), 2 deletions(-)
> 

Same as before. Even though current implementation only uses one group,
the 1.2 specification relies on 4. At least, the identify structure
should reflect this.

Javier


signature.asc
Description: Message signed with OpenPGP

Re: [PATCH 1/2] lightnvm: remove mlc pairs structure

2018-01-30 Thread Javier González

> On 30 Jan 2018, at 21.26, Matias Bjørling  wrote:
> 
> The known implementations of the 1.2 specification, and upcoming 2.0
> implementation all expose a sequential list of pages to write.
> Remove the data structure, as it is no longer needed.
> 
> Signed-off-by: Matias Bjørling 
> ---
> drivers/nvme/host/lightnvm.c | 14 +-
> 1 file changed, 1 insertion(+), 13 deletions(-)

Even though the current implementations does not use the MLC pairing
information, this is part of the on the 1.2 identification. Until we
eventually remove 1.2 support (if we do), the identify structure should
reflect the specification as is.

Javier

signature.asc
Description: Message signed with OpenPGP

Re: [PATCH] lightnvm: remove chnl_offset in nvme_nvm_identity

2018-01-30 Thread Javier González

> On 30 Jan 2018, at 22.30, Matias Bjørling  wrote:
> 
> The identity structure is initialized to zero in the beginning of
> the nvme_nvm_identity function. The chnl_offset is separately set to
> zero. Since both the variable and assignment is never changed, remove
> them.
> 
> Signed-off-by: Matias Bjørling 
> ---
> drivers/nvme/host/lightnvm.c | 4 +---
> 1 file changed, 1 insertion(+), 3 deletions(-)
> 

Looks good.

Reviewed-by: Javier González 



signature.asc
Description: Message signed with OpenPGP

Re: [RFC] hard LOCKUP caused by race between blk_init_queue_node and blkcg_print_blkgs

2018-01-30 Thread Joseph Qi

Hi Bart,
Thanks very much for the quick response.

On 18/1/31 05:19, Bart Van Assche wrote:
> On Tue, 2018-01-30 at 19:21 +0800, Joseph Qi wrote:
>> Hi Jens and Folks,
>>
>> Recently we've gotten a hard LOCKUP issue. After investigating the issue
>> we've found a race between blk_init_queue_node and blkcg_print_blkgs.
>> The race is described below.
>>
>> blk_init_queue_node blkcg_print_blkgs
>>   blk_alloc_queue_node (1)
>> q->queue_lock = &q->__queue_lock (2)
>> blkcg_init_queue(q) (3)
>> spin_lock_irq(blkg->q->queue_lock) (4)
>>   q->queue_lock = lock (5)
>> spin_unlock_irq(blkg->q->queue_lock) (6)
>>
>> (1) allocate an uninitialized queue;
>> (2) initialize queue_lock to its default internal lock;
>> (3) initialize blkcg part of request queue, which will create blkg and
>> then insert it to blkg_list;
>> (4) traverse blkg_list and find the created blkg, and then take its
>> queue lock, here it is the default *internal lock*;
>> (5) *race window*, now queue_lock is overridden with *driver specified
>> lock*;
>> (6) now unlock *driver specified lock*, not the locked *internal lock*,
>> unlock balance breaks.
>>
>> For the issue above, I think blkcg_init_queue is a bit earlier. We
>> can't allow a further use before request queue is fully initialized.
>> Since blk_init_queue_node is a really common path and it allows driver
>> to override the default internal lock, I'm afraid several other places
>> may also have the same issue.
>> Am I missing something here?
> 
> What code triggered the blkcg_print_blkgs() call? Was it perhaps a function
> related to throttling? If so, I think there are two possible ways to fix this
> race:
Yes, it is from block throttle.

> 1. Moving the .queue_lock initialization from blk_init_queue_node() into
>blk_alloc_queue_node() such that it happens before the blkcg_init_queue()
>call. This however will require a tree-wide change of the blk_alloc_queue()
>and all blk_alloc_queue_node() calls in all block drivers.
> 2. Splitting the blk_throtl_init() function such that the 
> blkcg_activate_policy()
>call can occur after the .queue_lock pointer has been initialized, e.g. 
> from
>inside blk_init_allocated_queue() or from inside blk_register_queue() 
> instead
>of from inside blkcg_init_queue().
> 
At the first glance, I'm thinking of moving the blkcg_init_queue after
the queue_lock is overridden. But I don't have an idea yet to avoid the
risk during cleanup.
Since the initialization of request queue is so fundamental, I'm not
sure if there is the same risk in several other places.

Thanks,
Joseph

> I'm not sure what the best approach is. Maybe there is even a third approach
> that is better than the two approaches mentioned above.
> 
> Bart.
>

Re: BUG: unable to handle kernel NULL pointer dereference in blk_throtl_update_limit_valid

2018-01-30 Thread Eric Biggers

On Tue, Dec 19, 2017 at 06:42:00AM -0800, syzbot wrote:
> Hello,
> 
> syzkaller hit the following crash on
> 6084b576dca2e898f5c101baef151f7bfdbb606d
> git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/master
> compiler: gcc (GCC) 7.1.1 20170620
> .config is attached
> Raw console output is attached.
> 
> Unfortunately, I don't have any reproducer for this bug yet.
> 
> 
> netlink: 14 bytes leftover after parsing attributes in process
> `syz-executor6'.
> BUG: unable to handle kernel NULL pointer dereference at 0098
> IP: blk_throtl_update_limit_valid.isra.19+0x6a/0x2e0
> block/blk-throttle.c:580
> PGD 213563067 P4D 213563067 PUD 214f2a067 PMD 0
> Oops:  [#1] SMP
> Dumping ftrace buffer:
>(ftrace buffer empty)
> Modules linked in:
> CPU: 1 PID: 3445 Comm: kworker/1:3 Not tainted 4.15.0-rc3-next-20171214+ #67
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> Workqueue: events __blk_release_queue
> RIP: 0010:blk_throtl_update_limit_valid.isra.19+0x6a/0x2e0
> block/blk-throttle.c:580
> RSP: 0018:c90002e13d30 EFLAGS: 00010093
> RAX: 8801e125c600 RBX: 8801df611c00 RCX: 81714e8a
> RDX:  RSI: d36b0597 RDI: 0086
> RBP: c90002e13d68 R08: 0001 R09: 000a
> R10: c90002e13cb8 R11: 000a R12: 
> R13: 8801dfdcd808 R14: 817178f0 R15: 0098
> FS:  () GS:88021fd0() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 0098 CR3: 000213784000 CR4: 001426e0
> DR0: 2000 DR1:  DR2: 
> DR3:  DR6: fffe0ff0 DR7: 0600
> Call Trace:
>  throtl_pd_offline+0x4e/0x80 block/blk-throttle.c:602
>  blkg_destroy+0x93/0x270 block/blk-cgroup.c:327
>  blkg_destroy_all+0x69/0xe0 block/blk-cgroup.c:372
>  blkcg_exit_queue+0x21/0x40 block/blk-cgroup.c:1202
>  __blk_release_queue+0x54/0x180 block/blk-sysfs.c:802
>  process_one_work+0x288/0x7a0 kernel/workqueue.c:2112
>  worker_thread+0x43/0x4d0 kernel/workqueue.c:2246
>  kthread+0x149/0x170 kernel/kthread.c:238
>  ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:524
> Code: c7 40 07 08 83 e8 c7 15 b1 ff e8 52 85 b3 ff 85 c0 59 74 12 e8 c8 54
> ba ff 80 3d aa f0 a9 01 00 0f 84 fd 01 00 00 e8 b6 54 ba ff <49> 8b 07 31 ff
> 48 8b 80 c8 06 00 00 48 8b 70 28 e8 e1 d9 b7 ff
> RIP: blk_throtl_update_limit_valid.isra.19+0x6a/0x2e0
> block/blk-throttle.c:580 RSP: c90002e13d30
> CR2: 0098
> ---[ end trace b30dc449e3987ceb ]---
> Kernel panic - not syncing: Fatal exception
> Dumping ftrace buffer:
>(ftrace buffer empty)
> Kernel Offset: disabled
> Rebooting in 86400 seconds..

Invalidating this bug since it hasn't been seen again, and it was reported while
KASAN was accidentally disabled in the syzbot kconfig due to a change to the
kconfig menus in linux-next (so this crash was possibly caused by slab
corruption elsewhere).

#syz invalid

Re: [RFC] hard LOCKUP caused by race between blk_init_queue_node and blkcg_print_blkgs

2018-01-30 Thread Bart Van Assche

On Tue, 2018-01-30 at 19:21 +0800, Joseph Qi wrote:
> Hi Jens and Folks,
> 
> Recently we've gotten a hard LOCKUP issue. After investigating the issue
> we've found a race between blk_init_queue_node and blkcg_print_blkgs.
> The race is described below.
> 
> blk_init_queue_node blkcg_print_blkgs
>   blk_alloc_queue_node (1)
> q->queue_lock = &q->__queue_lock (2)
> blkcg_init_queue(q) (3)
> spin_lock_irq(blkg->q->queue_lock) (4)
>   q->queue_lock = lock (5)
> spin_unlock_irq(blkg->q->queue_lock) (6)
> 
> (1) allocate an uninitialized queue;
> (2) initialize queue_lock to its default internal lock;
> (3) initialize blkcg part of request queue, which will create blkg and
> then insert it to blkg_list;
> (4) traverse blkg_list and find the created blkg, and then take its
> queue lock, here it is the default *internal lock*;
> (5) *race window*, now queue_lock is overridden with *driver specified
> lock*;
> (6) now unlock *driver specified lock*, not the locked *internal lock*,
> unlock balance breaks.
> 
> For the issue above, I think blkcg_init_queue is a bit earlier. We
> can't allow a further use before request queue is fully initialized.
> Since blk_init_queue_node is a really common path and it allows driver
> to override the default internal lock, I'm afraid several other places
> may also have the same issue.
> Am I missing something here?

What code triggered the blkcg_print_blkgs() call? Was it perhaps a function
related to throttling? If so, I think there are two possible ways to fix this
race:
1. Moving the .queue_lock initialization from blk_init_queue_node() into
   blk_alloc_queue_node() such that it happens before the blkcg_init_queue()
   call. This however will require a tree-wide change of the blk_alloc_queue()
   and all blk_alloc_queue_node() calls in all block drivers.
2. Splitting the blk_throtl_init() function such that the 
blkcg_activate_policy()
   call can occur after the .queue_lock pointer has been initialized, e.g. from
   inside blk_init_allocated_queue() or from inside blk_register_queue() instead
   of from inside blkcg_init_queue().

I'm not sure what the best approach is. Maybe there is even a third approach
that is better than the two approaches mentioned above.

Bart.

Re: WARNING: CPU: 2 PID: 207 at drivers/nvme/host/core.c:527 nvme_setup_cmd+0x3d3

2018-01-30 Thread Jens Axboe

On 1/30/18 1:49 PM, Keith Busch wrote:
> On Tue, Jan 30, 2018 at 01:32:25PM -0700, Jens Axboe wrote:
>> On 1/30/18 1:30 PM, Keith Busch wrote:
>>> On Tue, Jan 30, 2018 at 08:57:49AM -0700, Jens Axboe wrote:

 Looking at the disassembly, 'n' is 2 and 'segments' is 0x.
>>>
>>> Is this still a problem if you don't use an IO scheduler? With deadline,
>>> I'm not finding any path to bio_attempt_discard_merge which is where the
>>> nr_phys_segments is supposed to get it set to 2. Not sure how it could
>>> becmoe 0x, though.
>>
>> blk_mq_make_request() -> blk_mq_sched_bio_merge() -> 
>> __blk_mq_sched_bio_merge()
>>  -> blk_mq_attempt_merge() -> bio_attempt_discard_merge()
> 
> That's the calls only if you don't have an elevator_queue, right? With
> deadline, it looks like it goes through this path (ftrace confirms):
> 
>   __blk_mq_sched_bio_merge() -> dd_bio_merge() -> blk_mq_sched_try_merge()
> 
> Which doesn't have a case for ELEVATOR_DISCARD_MERGE.
> 
> Relavant function_graph:
> 
>   46)   |blk_mq_make_request() {
>   46)   0.133 us|  blk_queue_bounce();
>   46)   0.370 us|  blk_queue_split();
>   46)   0.314 us|  bio_integrity_prep();
>   46)   0.081 us|  blk_attempt_plug_merge();
>   46)   |  __blk_mq_sched_bio_merge() {
>   46)   |dd_bio_merge() {
>   46)   0.792 us|  _raw_spin_lock();
>   46)   |  blk_mq_sched_try_merge() {

Yeah I guess you are right, it can't happen for mq-deadline since the
generic sched path doesn't support it.

I'll see if I can make it happen again, but I'm not too hopeful. And if
I can't, it's hard to compare with "none" to see if it makes a difference
or not.

-- 
Jens Axboe

Re: WARNING: CPU: 2 PID: 207 at drivers/nvme/host/core.c:527 nvme_setup_cmd+0x3d3

2018-01-30 Thread Keith Busch

On Tue, Jan 30, 2018 at 01:32:25PM -0700, Jens Axboe wrote:
> On 1/30/18 1:30 PM, Keith Busch wrote:
> > On Tue, Jan 30, 2018 at 08:57:49AM -0700, Jens Axboe wrote:
> >>
> >> Looking at the disassembly, 'n' is 2 and 'segments' is 0x.
> > 
> > Is this still a problem if you don't use an IO scheduler? With deadline,
> > I'm not finding any path to bio_attempt_discard_merge which is where the
> > nr_phys_segments is supposed to get it set to 2. Not sure how it could
> > becmoe 0x, though.
> 
> blk_mq_make_request() -> blk_mq_sched_bio_merge() -> 
> __blk_mq_sched_bio_merge()
>   -> blk_mq_attempt_merge() -> bio_attempt_discard_merge()

That's the calls only if you don't have an elevator_queue, right? With
deadline, it looks like it goes through this path (ftrace confirms):

  __blk_mq_sched_bio_merge() -> dd_bio_merge() -> blk_mq_sched_try_merge()

Which doesn't have a case for ELEVATOR_DISCARD_MERGE.

Relavant function_graph:

  46)   |blk_mq_make_request() {
  46)   0.133 us|  blk_queue_bounce();
  46)   0.370 us|  blk_queue_split();
  46)   0.314 us|  bio_integrity_prep();
  46)   0.081 us|  blk_attempt_plug_merge();
  46)   |  __blk_mq_sched_bio_merge() {
  46)   |dd_bio_merge() {
  46)   0.792 us|  _raw_spin_lock();
  46)   |  blk_mq_sched_try_merge() {

Re: WARNING: CPU: 2 PID: 207 at drivers/nvme/host/core.c:527 nvme_setup_cmd+0x3d3

2018-01-30 Thread Jens Axboe

On 1/30/18 1:30 PM, Keith Busch wrote:
> On Tue, Jan 30, 2018 at 08:57:49AM -0700, Jens Axboe wrote:
>>
>> Looking at the disassembly, 'n' is 2 and 'segments' is 0x.
> 
> Is this still a problem if you don't use an IO scheduler? With deadline,
> I'm not finding any path to bio_attempt_discard_merge which is where the
> nr_phys_segments is supposed to get it set to 2. Not sure how it could
> becmoe 0x, though.

blk_mq_make_request() -> blk_mq_sched_bio_merge() -> __blk_mq_sched_bio_merge()
-> blk_mq_attempt_merge() -> bio_attempt_discard_merge()

Doesn't matter what IO scheduler you use.

I don't know if it triggers without a scheduler. I've been running this code
continually on the laptop (always do), and haven't seen it before today.


-- 
Jens Axboe

Re: WARNING: CPU: 2 PID: 207 at drivers/nvme/host/core.c:527 nvme_setup_cmd+0x3d3

2018-01-30 Thread Keith Busch

On Tue, Jan 30, 2018 at 08:57:49AM -0700, Jens Axboe wrote:
> 
> Looking at the disassembly, 'n' is 2 and 'segments' is 0x.

Is this still a problem if you don't use an IO scheduler? With deadline,
I'm not finding any path to bio_attempt_discard_merge which is where the
nr_phys_segments is supposed to get it set to 2. Not sure how it could
becmoe 0x, though.

Re: [PATCH v5] blk-mq: introduce BLK_STS_DEV_RESOURCE

2018-01-30 Thread Mike Snitzer

On Tue, Jan 30 2018 at  2:42pm -0500,
Bart Van Assche  wrote:

> On Tue, 2018-01-30 at 14:33 -0500, Mike Snitzer wrote:
> > On Tue, Jan 30 2018 at 12:52pm -0500,
> > Bart Van Assche  wrote:
> > 
> > > - This patch does not fix any bugs nor makes block drivers easier to
> > >   read or to implement. So why is this patch considered useful?
> > 
> > It enables blk-mq core to own the problem that individual drivers should
> > have no business needing to worry about.  Period.
> 
> Thanks for having confirmed that this patch is an API-change only and does
> not fix any bugs.

No, it is an API change that enables blk-mq drivers to make forward
progress without compromising the potential benefits associated with
blk-mq's SCHED_RESTART functionality.

(If we're going to beat this horse to death it might as well be with
precision)

Re: [PATCH v5] blk-mq: introduce BLK_STS_DEV_RESOURCE

2018-01-30 Thread Bart Van Assche

On Tue, 2018-01-30 at 14:33 -0500, Mike Snitzer wrote:
> On Tue, Jan 30 2018 at 12:52pm -0500,
> Bart Van Assche  wrote:
> 
> > - This patch does not fix any bugs nor makes block drivers easier to
> >   read or to implement. So why is this patch considered useful?
> 
> It enables blk-mq core to own the problem that individual drivers should
> have no business needing to worry about.  Period.

Thanks for having confirmed that this patch is an API-change only and does
not fix any bugs.

Bart.

Re: [PATCH v5] blk-mq: introduce BLK_STS_DEV_RESOURCE

2018-01-30 Thread Mike Snitzer

On Tue, Jan 30 2018 at 12:52pm -0500,
Bart Van Assche  wrote:

> On 01/30/18 06:24, Mike Snitzer wrote:
> >+ *
> >+ * If driver returns BLK_STS_RESOURCE and SCHED_RESTART
> >+ * bit is set, run queue after a delay to avoid IO stalls
> >+ * that could otherwise occur if the queue is idle.
> >  */
> >-if (!blk_mq_sched_needs_restart(hctx) ||
> >+needs_restart = blk_mq_sched_needs_restart(hctx);
> >+if (!needs_restart ||
> > (no_tag && list_empty_careful(&hctx->dispatch_wait.entry)))
> > blk_mq_run_hw_queue(hctx, true);
> >+else if (needs_restart && (ret == BLK_STS_RESOURCE))
> >+blk_mq_delay_run_hw_queue(hctx, BLK_MQ_QUEUE_DELAY);
> > }
> 
> If a request completes concurrently with execution of the above code
> then the request completion will trigger a call of
> blk_mq_sched_restart_hctx() and that call will clear the
> BLK_MQ_S_SCHED_RESTART bit. If that bit is cleared before the above
> code tests it then the above code will schedule an asynchronous call
> of __blk_mq_run_hw_queue(). If the .queue_rq() call triggered by the
> new queue run returns again BLK_STS_RESOURCE then the above code
> will be executed again. In other words, a loop occurs. That loop
> will repeat as long as the described race occurs. The current
> (kernel v4.15) block layer behavior is simpler: only block drivers
> call blk_mq_delay_run_hw_queue() and the block layer core never
> calls that function. Hence that loop cannot occur with the v4.15
> block layer core and block drivers. A motivation of why that loop is
> preferred compared to the current behavior (no loop) is missing.
> Does this mean that that loop is a needless complication of the
> block layer core?

No it means the loop is an internal blk-mq concern.  And that drivers
needn't worry about kicking the queue.  And it affords blk-mq latitude
to change how it responds to BLK_STS_RESOURCE in the future (without
needing to change every driver).

But even v4.15 has a similar loop.  It just happens to extend into the
.queue_rq() where the driver is completely blind to SCHED_RESTART.  And
the driver may just repeatedly kick the queue after a delay (via
blk_mq_delay_run_hw_queue).

This cycle should be a very rare occurrence regardless of which approach
is taken (V5 vs 4.15).  The fact that you engineered your SRP initiator
and target code to pathologically trigger this worst case (via
target_can_queue) doesn't mean it is the fast path for a properly
configured and functioning storage network.

> Sorry but I still prefer the v4.15 block layer approach because this
> patch has in my view the following disadvantages:
> - It involves a blk-mq API change. API changes are always risky and need
>   some time to stabilize.

The number of blk-mq API changes that have occurred since blk-mq was
introduced is a very long list.  Seems contrived to make this the one
that is breaking the camel's back.

> - The delay after which to rerun the queue is moved from block layer
>   drivers into the block layer core. I think that's wrong because only
>   the block driver authors can make a good choice for this constant.

Unsubstantiated.  3ms (scsi-mq, nvmefc) vs 100ms (dm-mq mpath): which is
better?  Pretty sure if the underlying storage network is 1) performant
2) properly configured -- then a shorter delay is preferable.

> - This patch makes block drivers harder to understand. Anyone who sees
>   return BLK_STS_RESOURCE / return BLK_STS_DEV_RESOURCE for the first
>   time will have to look up the meaning of these constants. An explicit
>   blk_mq_delay_run_hw_queue() call is easier to understand.

No it doesn't make blk-mq harder to understand.  But even if it did it
actually acknowledges that there is existing blk-mq SCHED_RESTART
heuristic for how to deal with the need for blk-mq to back-off in the
face of BLK_STS_RESOURCE returns.  By just having each blk-mq driver
blindly kick the queue you're actively ignoring, and defeating, that
entire design element of blk-mq (SCHED_RESTART).

It is an instance where blk-mq can and does know better.  Acknowledging
that fact is moving blk-mq in a better direction.

> - This patch makes the blk-mq core harder to understand because of the
>   loop mentioned above.

You've said your peace.  But you've taken on this campaign to undermine
this line of development with such passion that we're now in a place
where Jens is shell-shocked by all the repeat campaigning and noise.

Bart you keep saying the same things over and over.  Yet you cannot show
the patch to actively be a problem with testing-based detail.

Seems you'd rather refuse to even test it than open yourself up to the
possibility that this concern of yours has been making a mountain out of
a mole hill.

> - This patch does not fix any bugs nor makes block drivers easier to
>   read or to implement. So why is this patch considered usefu

Re: [dm-devel] [LSF/MM TOPIC] block, dm: restack queue_limits

2018-01-30 Thread Ewan D. Milne

On Tue, 2018-01-30 at 16:07 +0100, Hannes Reinecke wrote:
> On 01/29/2018 10:08 PM, Mike Snitzer wrote:
> > We currently don't restack the queue_limits if the lowest, or
> > intermediate, layer of an IO stack changes.
> > 
> > This is particularly unfortunate in the case of FLUSH/FUA which may
> > change if/when a HW controller's BBU fails; whereby requiring the device
> > advertise that it has a volatile write cache (WCE=1).
> > 
> Uh-oh. Device rescan.
> Would be a valid topic on its own...
> 
> > But in the context of DM, really it'd be best if the entire stack of
> > devices had their limits restacked if any underlying layer's limits
> > change.
> > 
> > In the past, Martin and I discussed that we should "just do it" but
> > never did.  Not sure we need a lengthy discussion but figured I'd put it
> > out there.
> > 
> > Maybe I'll find time, between now and April, to try implementing it.
> > 
> For SCSI the device capabilities are pretty much set in stone after the
> initial scan; there are hooks for rescanning, but they will only work
> half of the time.
> Plus we can't really change the device type on the fly (eg if the SCSI
> device type changes; if it moves away from '0' we would need to unbind
> the sd driver, and if it moves to '0' we'll need to rescan the sd
> device. None of this is happening right now.)
> 
> So I'd be glad to have a discussion around this.

At least array vendor has also desired the ability to change various
attributes of the device after the initial scan, such as the model name.
Not sure what would break if we did this, since who knows what userspace
software might be caching this info, but...

-Ewan

Re: [dm-devel] [PATCH v5] blk-mq: introduce BLK_STS_DEV_RESOURCE

2018-01-30 Thread Laurence Oberman

On Tue, 2018-01-30 at 09:52 -0800, Bart Van Assche wrote:
> On 01/30/18 06:24, Mike Snitzer wrote:
> > +    *
> > +    * If driver returns BLK_STS_RESOURCE and
> > SCHED_RESTART
> > +    * bit is set, run queue after a delay to avoid IO
> > stalls
> > +    * that could otherwise occur if the queue is
> > idle.
> >      */
> > -   if (!blk_mq_sched_needs_restart(hctx) ||
> > +   needs_restart = blk_mq_sched_needs_restart(hctx);
> > +   if (!needs_restart ||
> >     (no_tag && list_empty_careful(&hctx-
> > >dispatch_wait.entry)))
> >     blk_mq_run_hw_queue(hctx, true);
> > +   else if (needs_restart && (ret ==
> > BLK_STS_RESOURCE))
> > +   blk_mq_delay_run_hw_queue(hctx,
> > BLK_MQ_QUEUE_DELAY);
> >     }
> 
> If a request completes concurrently with execution of the above code 
> then the request completion will trigger a call of 
> blk_mq_sched_restart_hctx() and that call will clear the 
> BLK_MQ_S_SCHED_RESTART bit. If that bit is cleared before the above
> code 
> tests it then the above code will schedule an asynchronous call of 
> __blk_mq_run_hw_queue(). If the .queue_rq() call triggered by the
> new 
> queue run returns again BLK_STS_RESOURCE then the above code will be 
> executed again. In other words, a loop occurs. That loop will repeat
> as 
> long as the described race occurs. The current (kernel v4.15) block 
> layer behavior is simpler: only block drivers call 
> blk_mq_delay_run_hw_queue() and the block layer core never calls
> that 
> function. Hence that loop cannot occur with the v4.15 block layer
> core 
> and block drivers. A motivation of why that loop is preferred
> compared 
> to the current behavior (no loop) is missing. Does this mean that
> that 
> loop is a needless complication of the block layer core?
> 
> Sorry but I still prefer the v4.15 block layer approach because this 
> patch has in my view the following disadvantages:
> - It involves a blk-mq API change. API changes are always risky and
> need
>    some time to stabilize.
> - The delay after which to rerun the queue is moved from block layer
>    drivers into the block layer core. I think that's wrong because
> only
>    the block driver authors can make a good choice for this constant.
> - This patch makes block drivers harder to understand. Anyone who
> sees
>    return BLK_STS_RESOURCE / return BLK_STS_DEV_RESOURCE for the
> first
>    time will have to look up the meaning of these constants. An
> explicit
>    blk_mq_delay_run_hw_queue() call is easier to understand.
> - This patch makes the blk-mq core harder to understand because of
> the
>    loop mentioned above.
> - This patch does not fix any bugs nor makes block drivers easier to
>    read or to implement. So why is this patch considered useful?
> 
> Thanks,
> 
> Bart.

Hello Bart

What is the performance implication of your method versus this patch
above.
Is there more of a delay in your approach or in Ming's approach from
your own testing.
I guess it seems we will never get consensus on this so is it time to
take a vote. 

I respect and trust your inputs, you know that, but are you perhaps
prepared to accept the approach above and agree to it and if it turns
out to expose more issues it can be addressed later.

Is that not a better way to progress this because to me it looks like
you and Ming will continue to disagree on which is the better approach.

With much respect
Laurence

Re: [dm-devel] [PATCH v5] blk-mq: introduce BLK_STS_DEV_RESOURCE

2018-01-30 Thread Bart Van Assche


On 01/30/18 06:24, Mike Snitzer wrote:

+*
+* If driver returns BLK_STS_RESOURCE and SCHED_RESTART
+* bit is set, run queue after a delay to avoid IO stalls
+* that could otherwise occur if the queue is idle.
 */
-   if (!blk_mq_sched_needs_restart(hctx) ||
+   needs_restart = blk_mq_sched_needs_restart(hctx);
+   if (!needs_restart ||
(no_tag && list_empty_careful(&hctx->dispatch_wait.entry)))
blk_mq_run_hw_queue(hctx, true);
+   else if (needs_restart && (ret == BLK_STS_RESOURCE))
+   blk_mq_delay_run_hw_queue(hctx, BLK_MQ_QUEUE_DELAY);
}


If a request completes concurrently with execution of the above code 
then the request completion will trigger a call of 
blk_mq_sched_restart_hctx() and that call will clear the 
BLK_MQ_S_SCHED_RESTART bit. If that bit is cleared before the above code 
tests it then the above code will schedule an asynchronous call of 
__blk_mq_run_hw_queue(). If the .queue_rq() call triggered by the new 
queue run returns again BLK_STS_RESOURCE then the above code will be 
executed again. In other words, a loop occurs. That loop will repeat as 
long as the described race occurs. The current (kernel v4.15) block 
layer behavior is simpler: only block drivers call 
blk_mq_delay_run_hw_queue() and the block layer core never calls that 
function. Hence that loop cannot occur with the v4.15 block layer core 
and block drivers. A motivation of why that loop is preferred compared 
to the current behavior (no loop) is missing. Does this mean that that 
loop is a needless complication of the block layer core?


Sorry but I still prefer the v4.15 block layer approach because this 
patch has in my view the following disadvantages:

- It involves a blk-mq API change. API changes are always risky and need
  some time to stabilize.
- The delay after which to rerun the queue is moved from block layer
  drivers into the block layer core. I think that's wrong because only
  the block driver authors can make a good choice for this constant.
- This patch makes block drivers harder to understand. Anyone who sees
  return BLK_STS_RESOURCE / return BLK_STS_DEV_RESOURCE for the first
  time will have to look up the meaning of these constants. An explicit
  blk_mq_delay_run_hw_queue() call is easier to understand.
- This patch makes the blk-mq core harder to understand because of the
  loop mentioned above.
- This patch does not fix any bugs nor makes block drivers easier to
  read or to implement. So why is this patch considered useful?

Thanks,

Bart.

Re: [LSF/MM TOPIC] De-clustered RAID with MD

2018-01-30 Thread Wol's lists


On 30/01/18 11:24, NeilBrown wrote:

On Tue, Jan 30 2018, Wols Lists wrote:


On 29/01/18 21:50, NeilBrown wrote:

By doing declustered parity you can sanely do raid6 on 100 drives, using
a logical stripe size that is much smaller than 100.
When recovering a single drive, the 10-groups-of-10 would put heavy load
on 9 other drives, while the decluster approach puts light load on 99
other drives.  No matter how clever md is at throttling recovery, I
would still rather distribute the load so that md has an easier job.


Not offering to do it ... :-)

But that sounds a bit like linux raid-10. Could a simple approach be to
do something like "raid-6,11,100", ie raid-6 with 9 data chunks, two
parity, striped across 100 drives? Okay, it's not as good as the
decluster approach, but it would spread the stress of a rebuild across
20 drives, not 10. And probably be fairly easy to implement.


If you did that, I think you would be about 80% of the way to fully
declustered-parity RAID.
If you then tweak the math a bit so that one stripe would was

A1 A2 A3 A4 B1 B2 B3 B4 C1 C2 C3 C4 

and the next

A1 C1 A2 C2 A3 C3 A4 C4 B1 D1 B2 D2 

and then

A1 B1 C1 D1 A2 B2 C2 D2 A3 B3 C3 D3 

   XX
   
When Ax are a logical stripe and Bx are the next,  then you have a

slightly better distribution.  If device XX fails then the reads needed
for the first stripe mostly come from different drives than those for
the second stripe, which are mostly different again for the 3rd stripe.

Presumably the CRUSH algorithm (which I only skim-read once about a year
ago) formalizes how to do this, and does it better.
Once you have the data handling in place for your proposal, it should be
little more than replacing a couple of calculations to get the full
solution.

Okay. I think I have it - a definition for raid-16 (or is it raid-61). 
But I need a bit of help with the maths. And it might need a look-up 
table :-(


Okay. Like raid-10, raid-16 would be spec'd as "--level 16,3,8", ie 3 
mirrors, emulating an 8-drive raid-6.


What I'm looking for is a PRNG that has the "bug" that it repeats over a 
short period, and over that period is guaranteed to repeat each number 
the same number of times. I saw a wonderful video demonstration of this 
years ago - if you plot the generated number against the number of times 
it was generated, after a few rows it "filled up" a rectangle on the graph.


At which point the maths becomes very simple. I just need at least as 
many real drives as "mirrors times emulated".


If somebody can come up with the algorithm I want, I can spec it, and 
then maybe someone can implement it? It'll be fun testing - I'll need my 
new machine when I get it working :-)


Cheers,
Wol

Re: [PATCH v6 1/2] Return bytes transferred for partial direct I/O

2018-01-30 Thread Goldwyn Rodrigues



On 01/29/2018 01:04 PM, Randy Dunlap wrote:
> On 01/29/2018 06:57 AM, Goldwyn Rodrigues wrote:
>> From: Goldwyn Rodrigues 
>>
> 
>> diff --git a/Documentation/sysctl/fs.txt b/Documentation/sysctl/fs.txt
>> index 6c00c1e2743f..72e213d62511 100644
>> --- a/Documentation/sysctl/fs.txt
>> +++ b/Documentation/sysctl/fs.txt
>> @@ -76,6 +77,19 @@ dcache isn't pruned yet.
>>  
>>  ==
>>  
>> +dio_short_writes:
>> +
>> +In case Direct I/O encounters an transient error, it returns
> 
>  a transient
> 
>> +the errorcode, even if it has performed part of the write.
> 
>error code,
> 
>> +This flag, if on (default), will return the number of bytes written
>> +so far, as the write(2) symantics are. However, some older applications
> 
>semantics
> 
>> +still consider a direct write as an error if all of the I/O
>> +submitted is not complete. ie write(file, count, buf) != count.
> 
>   I.e.
> 
>> +This option can be disabled on systems in order to support
>> +existing applications which do not expect short writes.

Thanks for the comments. I will incorporate the language changes.

> 
> and if my system has a mix of older applications and new ones,
> will they all work just fine?
> 

Newer applications should treat the error as nothing is written.
But yes, I tried doing it through prctl for an individual processes, but
did not find a free bit to stick it in.

-- 
Goldwyn

Re: WARNING: CPU: 2 PID: 207 at drivers/nvme/host/core.c:527 nvme_setup_cmd+0x3d3

2018-01-30 Thread Jens Axboe

On 1/30/18 8:41 AM, Jens Axboe wrote:
> Hi,
> 
> I just hit this on 4.15+ on the laptop, it's running Linus' git
> as of yesterday, right after the block tree merge:
> 
> commit 0a4b6e2f80aad46fb55a5cf7b1664c0aef030ee0
> Merge: 9697e9da8429 796baeeef85a
> Author: Linus Torvalds 
> Date:   Mon Jan 29 11:51:49 2018 -0800
> 
> Merge branch 'for-4.16/block' of git://git.kernel.dk/linux-block
> 
> It's hitting this case:
> 
> if (WARN_ON_ONCE(n != segments)) {
>   
> 
> in nvme, indicating that rq->nr_phys_segments does not equal the
> number of bios in the request. Anyone seen this? I'll take a closer
> look as time permits, full trace below.
> 
> 
>  WARNING: CPU: 2 PID: 207 at drivers/nvme/host/core.c:527 
> nvme_setup_cmd+0x3d3/0x430
>  Modules linked in: rfcomm fuse ctr ccm bnep arc4 binfmt_misc 
> snd_hda_codec_hdmi nls_iso8859_1 nls_cp437 vfat snd_hda_codec_conexant fat 
> snd_hda_codec_generic iwlmvm snd_hda_intel snd_hda_codec snd_hwdep mac80211 
> snd_hda_core snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq 
> x86_pkg_temp_thermal intel_powerclamp kvm_intel uvcvideo iwlwifi btusb 
> snd_seq_device videobuf2_vmalloc btintel videobuf2_memops kvm snd_timer 
> videobuf2_v4l2 bluetooth irqbypass videobuf2_core aesni_intel aes_x86_64 
> crypto_simd cryptd snd glue_helper videodev cfg80211 ecdh_generic soundcore 
> hid_generic usbhid hid i915 psmouse e1000e ptp pps_core xhci_pci xhci_hcd 
> intel_gtt
>  CPU: 2 PID: 207 Comm: jbd2/nvme0n1p7- Tainted: G U   4.15.0+ #176
>  Hardware name: LENOVO 20FBCTO1WW/20FBCTO1WW, BIOS N1FET59W (1.33 ) 12/19/2017
>  RIP: 0010:nvme_setup_cmd+0x3d3/0x430
>  RSP: 0018:880423e9f838 EFLAGS: 00010217
>  RAX:  RBX: 880423e9f8c8 RCX: 0001
>  RDX: 88022b200010 RSI: 0002 RDI: 327f
>  RBP: 880421251400 R08: 88022b20 R09: 0009
>  R10:  R11:  R12: 
>  R13: 88042341e280 R14:  R15: 880421251440
>  FS:  () GS:88044150() knlGS:
>  CS:  0010 DS:  ES:  CR0: 80050033
>  CR2: 55b684795030 CR3: 02e09006 CR4: 001606e0
>  DR0:  DR1:  DR2: 
>  DR3:  DR6: fffe0ff0 DR7: 0400
>  Call Trace:
>   nvme_queue_rq+0x40/0xa00
>   ? __sbitmap_queue_get+0x24/0x90
>   ? blk_mq_get_tag+0xa3/0x250
>   ? wait_woken+0x80/0x80
>   ? blk_mq_get_driver_tag+0x97/0xf0
>   blk_mq_dispatch_rq_list+0x7b/0x4a0
>   ? deadline_remove_request+0x49/0xb0
>   blk_mq_do_dispatch_sched+0x4f/0xc0
>   blk_mq_sched_dispatch_requests+0x106/0x170
>   __blk_mq_run_hw_queue+0x53/0xa0
>   __blk_mq_delay_run_hw_queue+0x83/0xa0
>   blk_mq_run_hw_queue+0x6c/0xd0
>   blk_mq_sched_insert_request+0x96/0x140
>   __blk_mq_try_issue_directly+0x3d/0x190
>   blk_mq_try_issue_directly+0x30/0x70
>   blk_mq_make_request+0x1a4/0x6a0
>   generic_make_request+0xfd/0x2f0
>   ? submit_bio+0x5c/0x110
>   submit_bio+0x5c/0x110
>   ? __blkdev_issue_discard+0x152/0x200
>   submit_bio_wait+0x43/0x60
>   ext4_process_freed_data+0x1cd/0x440
>   ? account_page_dirtied+0xe2/0x1a0
>   ext4_journal_commit_callback+0x4a/0xc0
>   jbd2_journal_commit_transaction+0x17e2/0x19e0
>   ? kjournald2+0xb0/0x250
>   kjournald2+0xb0/0x250
>   ? wait_woken+0x80/0x80
>   ? commit_timeout+0x10/0x10
>   kthread+0x111/0x130
>   ? kthread_create_worker_on_cpu+0x50/0x50
>   ? do_group_exit+0x3a/0xa0
>   ret_from_fork+0x1f/0x30
>  Code: 73 89 c1 83 ce 10 c1 e1 10 09 ca 83 f8 04 0f 87 0f ff ff ff 8b 4d 20 
> 48 8b 7d 00 c1 e9 09 48 01 8c c7 00 08 00 00 e9 f8 fe ff ff <0f> ff 4c 89 c7 
> 41 bc 0a 00 00 00 e8 0d 78 d6 ff e9 a1 fc ff ff 
>  ---[ end trace 50d361cc444506c8 ]---
>  print_req_error: I/O error, dev nvme0n1, sector 847167488

Looking at the disassembly, 'n' is 2 and 'segments' is 0x.

-- 
Jens Axboe

WARNING: CPU: 2 PID: 207 at drivers/nvme/host/core.c:527 nvme_setup_cmd+0x3d3

2018-01-30 Thread Jens Axboe

Hi,

I just hit this on 4.15+ on the laptop, it's running Linus' git
as of yesterday, right after the block tree merge:

commit 0a4b6e2f80aad46fb55a5cf7b1664c0aef030ee0
Merge: 9697e9da8429 796baeeef85a
Author: Linus Torvalds 
Date:   Mon Jan 29 11:51:49 2018 -0800

Merge branch 'for-4.16/block' of git://git.kernel.dk/linux-block

It's hitting this case:

if (WARN_ON_ONCE(n != segments)) {  

in nvme, indicating that rq->nr_phys_segments does not equal the
number of bios in the request. Anyone seen this? I'll take a closer
look as time permits, full trace below.


 WARNING: CPU: 2 PID: 207 at drivers/nvme/host/core.c:527 
nvme_setup_cmd+0x3d3/0x430
 Modules linked in: rfcomm fuse ctr ccm bnep arc4 binfmt_misc 
snd_hda_codec_hdmi nls_iso8859_1 nls_cp437 vfat snd_hda_codec_conexant fat 
snd_hda_codec_generic iwlmvm snd_hda_intel snd_hda_codec snd_hwdep mac80211 
snd_hda_core snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq 
x86_pkg_temp_thermal intel_powerclamp kvm_intel uvcvideo iwlwifi btusb 
snd_seq_device videobuf2_vmalloc btintel videobuf2_memops kvm snd_timer 
videobuf2_v4l2 bluetooth irqbypass videobuf2_core aesni_intel aes_x86_64 
crypto_simd cryptd snd glue_helper videodev cfg80211 ecdh_generic soundcore 
hid_generic usbhid hid i915 psmouse e1000e ptp pps_core xhci_pci xhci_hcd 
intel_gtt
 CPU: 2 PID: 207 Comm: jbd2/nvme0n1p7- Tainted: G U   4.15.0+ #176
 Hardware name: LENOVO 20FBCTO1WW/20FBCTO1WW, BIOS N1FET59W (1.33 ) 12/19/2017
 RIP: 0010:nvme_setup_cmd+0x3d3/0x430
 RSP: 0018:880423e9f838 EFLAGS: 00010217
 RAX:  RBX: 880423e9f8c8 RCX: 0001
 RDX: 88022b200010 RSI: 0002 RDI: 327f
 RBP: 880421251400 R08: 88022b20 R09: 0009
 R10:  R11:  R12: 
 R13: 88042341e280 R14:  R15: 880421251440
 FS:  () GS:88044150() knlGS:
 CS:  0010 DS:  ES:  CR0: 80050033
 CR2: 55b684795030 CR3: 02e09006 CR4: 001606e0
 DR0:  DR1:  DR2: 
 DR3:  DR6: fffe0ff0 DR7: 0400
 Call Trace:
  nvme_queue_rq+0x40/0xa00
  ? __sbitmap_queue_get+0x24/0x90
  ? blk_mq_get_tag+0xa3/0x250
  ? wait_woken+0x80/0x80
  ? blk_mq_get_driver_tag+0x97/0xf0
  blk_mq_dispatch_rq_list+0x7b/0x4a0
  ? deadline_remove_request+0x49/0xb0
  blk_mq_do_dispatch_sched+0x4f/0xc0
  blk_mq_sched_dispatch_requests+0x106/0x170
  __blk_mq_run_hw_queue+0x53/0xa0
  __blk_mq_delay_run_hw_queue+0x83/0xa0
  blk_mq_run_hw_queue+0x6c/0xd0
  blk_mq_sched_insert_request+0x96/0x140
  __blk_mq_try_issue_directly+0x3d/0x190
  blk_mq_try_issue_directly+0x30/0x70
  blk_mq_make_request+0x1a4/0x6a0
  generic_make_request+0xfd/0x2f0
  ? submit_bio+0x5c/0x110
  submit_bio+0x5c/0x110
  ? __blkdev_issue_discard+0x152/0x200
  submit_bio_wait+0x43/0x60
  ext4_process_freed_data+0x1cd/0x440
  ? account_page_dirtied+0xe2/0x1a0
  ext4_journal_commit_callback+0x4a/0xc0
  jbd2_journal_commit_transaction+0x17e2/0x19e0
  ? kjournald2+0xb0/0x250
  kjournald2+0xb0/0x250
  ? wait_woken+0x80/0x80
  ? commit_timeout+0x10/0x10
  kthread+0x111/0x130
  ? kthread_create_worker_on_cpu+0x50/0x50
  ? do_group_exit+0x3a/0xa0
  ret_from_fork+0x1f/0x30
 Code: 73 89 c1 83 ce 10 c1 e1 10 09 ca 83 f8 04 0f 87 0f ff ff ff 8b 4d 20 48 
8b 7d 00 c1 e9 09 48 01 8c c7 00 08 00 00 e9 f8 fe ff ff <0f> ff 4c 89 c7 41 bc 
0a 00 00 00 e8 0d 78 d6 ff e9 a1 fc ff ff 
 ---[ end trace 50d361cc444506c8 ]---
 print_req_error: I/O error, dev nvme0n1, sector 847167488

-- 
Jens Axboe

Re: v4.15 and I/O hang with BFQ

2018-01-30 Thread Paolo Valente

> Il giorno 30 gen 2018, alle ore 15:40, Ming Lei  ha 
> scritto:
> 
> On Tue, Jan 30, 2018 at 03:30:28PM +0100, Oleksandr Natalenko wrote:
>> Hi.
>> 
> ...
>>   systemd-udevd-271   [000]  4.311033: bfq_insert_requests: insert
>> rq->0
>>   systemd-udevd-271   [000] ...1 4.311037: blk_mq_do_dispatch_sched:
>> not get rq, 1
>>  cfdisk-408   [000] 13.484220: bfq_insert_requests: insert
>> rq->1
>>kworker/0:1H-174   [000] 13.484253: blk_mq_do_dispatch_sched:
>> not get rq, 1
>> ===
>> 
>> Looks the same, right?
> 
> Yeah, same with before.
> 

Hi guys,
sorry for the delay with this fix.  We are proceeding very slowly on
this, because I'm super busy.  Anyway, now I can at least explain in
more detail the cause that leads to this hang.  Commit 'a6a252e64914
("blk-mq-sched: decide how to handle flush rq via RQF_FLUSH_SEQ")'
makes all non-flush re-prepared requests be re-inserted into the I/O
scheduler.  With this change, I/O schedulers may get the same request
inserted again, even several times, without a finish_request invoked
on the request before each re-insertion.

For the I/O scheduler, every such re-prepared request is equivalent
to the insertion of a new request. For schedulers like mq-deadline
or kyber this fact causes no problems. In contrast, it confuses a stateful
scheduler like BFQ, which preserves states for an I/O request until
finish_request is invoked on it. In particular, BFQ has no way
to know that the above re-insertions concerns the same, already dispatched
request. So it may get stuck waiting for the completion of these
re-inserted requests forever, thus preventing any other queue of
requests to be served.

We are trying to address this issue by adding the hook requeue_request
to bfq interface.

Unfortunately, with our current implementation of requeue_request in
place, bfq eventually gets to an incoherent state.  This is apparently
caused by a requeue of an I/O request, immediately followed by a
completion of the same request.  This seems rather absurd, and drives
bfq crazy.  But this is something for which we don't have definite
results yet.

We're working on it, sorry again for the delay.

Thanks,
Paolo

> -- 
> Ming

Re: [PATCH] Use bio_endio instead of bio_put in error path of blk_rq_append_bio

2018-01-30 Thread Jiri Palecek



On 1/30/18 1:53 PM, Ming Lei wrote:

On Thu, Jan 25, 2018 at 9:58 PM, Jiří Paleček  wrote:

  Avoids page leak from bounced requests
---
  block/blk-map.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/blk-map.c b/block/blk-map.c
index d3a94719f03f..702d68166689 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -26,7 +26,8 @@ int blk_rq_append_bio(struct request *rq, struct bio **bio)
 } else {
 if (!ll_back_merge_fn(rq->q, rq, *bio)) {
 if (orig_bio != *bio) {
-   bio_put(*bio);
+   bio_inc_remaining(orig_bio);
+   bio_endio(*bio);

'bio_inc_remaining(orig_bio);' shouldn't be needed since we don't chain bounced
bio, otherwise this patch is fine.


I believe it is needed or at least desirable. The situation when the 
request bounced is like this


bio (bounced) . bi_private ---> orig_bio

and at the end of bounce_end_io, bio_endio(bio->bi_private) [which is 
bio_endio(orig_bio) in our case] is called. That doesn't have any effect 
on __blk_rq_map_user_iov; its bios have .bi_end_io==0, therefore one 
call more or less doesn't matter. However, for other callers, like 
osd_initiator.c, this is not the case. They pass bios which have 
bi_end_io, and might be surprised if this was called unexpectedly.


Before  caa4b02476e3, blk_rq_append_request wouldn't touch/delete the 
passed bio at all under any circumstances. I believe it should stay that 
way and incrementing the remaining counter, which effectively nullifies 
the extra bio_endio call, does that pretty straightforwardly.


Regards

    Jiri Palecek

Re: [dm-devel] [LSF/MM TOPIC] block, dm: restack queue_limits

2018-01-30 Thread Hannes Reinecke

On 01/29/2018 10:08 PM, Mike Snitzer wrote:
> We currently don't restack the queue_limits if the lowest, or
> intermediate, layer of an IO stack changes.
> 
> This is particularly unfortunate in the case of FLUSH/FUA which may
> change if/when a HW controller's BBU fails; whereby requiring the device
> advertise that it has a volatile write cache (WCE=1).
> 
Uh-oh. Device rescan.
Would be a valid topic on its own...

> But in the context of DM, really it'd be best if the entire stack of
> devices had their limits restacked if any underlying layer's limits
> change.
> 
> In the past, Martin and I discussed that we should "just do it" but
> never did.  Not sure we need a lengthy discussion but figured I'd put it
> out there.
> 
> Maybe I'll find time, between now and April, to try implementing it.
> 
For SCSI the device capabilities are pretty much set in stone after the
initial scan; there are hooks for rescanning, but they will only work
half of the time.
Plus we can't really change the device type on the fly (eg if the SCSI
device type changes; if it moves away from '0' we would need to unbind
the sd driver, and if it moves to '0' we'll need to rescan the sd
device. None of this is happening right now.)

So I'd be glad to have a discussion around this.

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

Re: v4.15 and I/O hang with BFQ

2018-01-30 Thread Ming Lei

On Tue, Jan 30, 2018 at 03:30:28PM +0100, Oleksandr Natalenko wrote:
> Hi.
> 
...
>systemd-udevd-271   [000]  4.311033: bfq_insert_requests: insert
> rq->0
>systemd-udevd-271   [000] ...1 4.311037: blk_mq_do_dispatch_sched:
> not get rq, 1
>   cfdisk-408   [000] 13.484220: bfq_insert_requests: insert
> rq->1
> kworker/0:1H-174   [000] 13.484253: blk_mq_do_dispatch_sched:
> not get rq, 1
> ===
> 
> Looks the same, right?

Yeah, same with before.

-- 
Ming

[PATCH] lightnvm: remove chnl_offset in nvme_nvm_identity

2018-01-30 Thread Matias Bjørling

The identity structure is initialized to zero in the beginning of
the nvme_nvm_identity function. The chnl_offset is separately set to
zero. Since both the variable and assignment is never changed, remove
them.

Signed-off-by: Matias Bjørling 
---
 drivers/nvme/host/lightnvm.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c
index e5544806fb0e..dc0b1335c7c6 100644
--- a/drivers/nvme/host/lightnvm.c
+++ b/drivers/nvme/host/lightnvm.c
@@ -59,8 +59,7 @@ struct nvme_nvm_identity {
__u64   rsvd[2];
__le64  prp1;
__le64  prp2;
-   __le32  chnl_off;
-   __u32   rsvd11[5];
+   __u32   rsvd11[6];
 };
 
 struct nvme_nvm_getbbtbl {
@@ -268,7 +267,6 @@ static int nvme_nvm_identity(struct nvm_dev *nvmdev, struct 
nvm_id *nvm_id)
 
c.identity.opcode = nvme_nvm_admin_identity;
c.identity.nsid = cpu_to_le32(ns->head->ns_id);
-   c.identity.chnl_off = 0;
 
nvme_nvm_id = kmalloc(sizeof(struct nvme_nvm_id), GFP_KERNEL);
if (!nvme_nvm_id)
-- 
2.11.0

Re: v4.15 and I/O hang with BFQ

2018-01-30 Thread Oleksandr Natalenko


Hi.

30.01.2018 09:19, Ming Lei wrote:

Hi,
We knew there is IO hang issue on BFQ over USB-storage wrt. blk-mq, and
last time I found it is inside BFQ. You can try the debug patch in the
following link[1] to see if it is same with the previous report[1][2]:

[1] https://marc.info/?l=linux-block&m=151214241518562&w=2
[2] https://bugzilla.kernel.org/show_bug.cgi?id=198023

If you aren't sure if they are same, please post the trace somewhere,
then I can check if it is a new bug.


OK, first, I got 2 more stacktraces without tracing, then I've rebooted 
with your patch and checked tracing.


Before reboot, cfdisk:

===
[  266.630386] INFO: task cfdisk:437 blocked for more than 20 seconds.
[  266.640950]   Not tainted 4.15.0-pf1 #1
[  266.645131] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.

[  266.651789] cfdisk  D0   437410 0x
[  266.661331] Call Trace:
[  266.664517]  ? __schedule+0x35f/0x1000
[  266.668660]  ? bio_alloc_bioset+0xc7/0x1e0
[  266.672330]  ? submit_bh_wbc+0x162/0x190
[  266.678034]  schedule+0x32/0xc0
[  266.681293]  io_schedule+0x12/0x40
[  266.685230]  wait_on_page_bit+0xea/0x130
[  266.689015]  ? add_to_page_cache_lru+0xe0/0xe0
[  266.693456]  ? blkdev_writepages+0x30/0x30
[  266.695521]  do_read_cache_page+0x167/0x350
[  266.697160]  read_dev_sector+0x28/0xc0
[  266.698685]  scsi_bios_ptable+0x4e/0x120 [scsi_mod]
[  266.700156]  scsicam_bios_param+0x16/0x1a0 [scsi_mod]
[  266.701868]  sd_getgeo+0xbf/0xd0 [sd_mod]
[  266.703388]  blkdev_ioctl+0x178/0x990
[  266.704818]  block_ioctl+0x39/0x40
[  266.706381]  do_vfs_ioctl+0xa4/0x630
[  266.708584]  ? __fput+0x131/0x1e0
[  266.710184]  ? preempt_count_add+0x68/0xa0
[  266.711762]  ? _raw_spin_unlock_irq+0x1d/0x30
[  266.713304]  SyS_ioctl+0x74/0x80
[  266.714502]  ? exit_to_usermode_loop+0x39/0xa0
[  266.717352]  entry_SYSCALL_64_fastpath+0x20/0x83
[  266.718857] RIP: 0033:0x7fc482064d87
===

Blocked kworker:

===
[  389.511083] INFO: task kworker/u8:5:178 blocked for more than 20 
seconds.

[  389.516454]   Not tainted 4.15.0-pf1 #1
[  389.520091] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this

message.
[  389.524726] kworker/u8:5D0   178  2 0x8000
[  389.528321] Workqueue: events_freezable_power_ disk_events_workfn
[  389.532228] Call Trace:
[  389.534909]  ? __schedule+0x35f/0x1000
[  389.541906]  ? blk_mq_sched_dispatch_requests+0x116/0x190
[  389.546561]  ? __sbitmap_get_word+0x2a/0x80
[  389.551167]  ? sbitmap_get_shallow+0x5c/0xa0
[  389.555633]  schedule+0x32/0xc0
[  389.559803]  io_schedule+0x12/0x40
[  389.564504]  blk_mq_get_tag+0x181/0x2a0
[  389.572541]  ? wait_woken+0x80/0x80
[  389.576008]  blk_mq_get_request+0xf9/0x410
[  389.579998]  blk_mq_alloc_request+0x7e/0xe0
[  389.584824]  blk_get_request_flags+0x40/0x160
[  389.588917]  scsi_execute+0x38/0x1e0 [scsi_mod]
[  389.593079]  scsi_test_unit_ready+0x5d/0xd0 [scsi_mod]
[  389.596966]  sd_check_events+0xf5/0x1a0 [sd_mod]
[  389.602558]  disk_check_events+0x69/0x150
[  389.604822]  process_one_work+0x1df/0x420
[  389.606584]  worker_thread+0x2b/0x3d0
[  389.608175]  ? process_one_work+0x420/0x420
[  389.609833]  kthread+0x113/0x130
[  389.611327]  ? kthread_create_on_node+0x70/0x70
[  389.612852]  ret_from_fork+0x35/0x40
===

After reboot, tracing info:

===
   systemd-udevd-269   [000] ...1 4.309198: 
blk_mq_do_dispatch_sched: get rq->0, 1
kworker/0:1H-174   [000]  4.309380: 
blk_mq_do_dispatch_sched: not get rq, 1
kworker/u8:2-167   [000]  4.309562: bfq_insert_requests: 
insert rq->0
kworker/u8:2-167   [000] ...1 4.309563: 
blk_mq_do_dispatch_sched: get rq->0, 1
kworker/0:1H-174   [000]  4.309694: 
blk_mq_do_dispatch_sched: not get rq, 1
kworker/u8:2-167   [000]  4.309879: bfq_insert_requests: 
insert rq->0
kworker/u8:2-167   [000] ...1 4.309880: 
blk_mq_do_dispatch_sched: get rq->0, 1
kworker/0:1H-174   [000]  4.310001: 
blk_mq_do_dispatch_sched: not get rq, 1
   systemd-udevd-271   [000]  4.311033: bfq_insert_requests: 
insert rq->0
   systemd-udevd-271   [000] ...1 4.311037: 
blk_mq_do_dispatch_sched: not get rq, 1
  cfdisk-408   [000] 13.484220: bfq_insert_requests: 
insert rq->1
kworker/0:1H-174   [000] 13.484253: 
blk_mq_do_dispatch_sched: not get rq, 1

===

Looks the same, right?


Or Paolo should know if the issue is fixed or not in V4.15.


So, Paolo :)?

Regards,
  Oleksandr

[PATCH v5] blk-mq: introduce BLK_STS_DEV_RESOURCE

2018-01-30 Thread Mike Snitzer

From: Ming Lei 

This status is returned from driver to block layer if device related
resource is unavailable, but driver can guarantee that IO dispatch
will be triggered in future when the resource is available.

Convert some drivers to return BLK_STS_DEV_RESOURCE.  Also, if driver
returns BLK_STS_RESOURCE and SCHED_RESTART is set, rerun queue after
a delay (BLK_MQ_DELAY_QUEUE) to avoid IO stalls.  BLK_MQ_DELAY_QUEUE is
3 ms because both scsi-mq and nvmefc are using that magic value.

If a driver can make sure there is in-flight IO, it is safe to return
BLK_STS_DEV_RESOURCE because:

1) If all in-flight IOs complete before examining SCHED_RESTART in
blk_mq_dispatch_rq_list(), SCHED_RESTART must be cleared, so queue
is run immediately in this case by blk_mq_dispatch_rq_list();

2) if there is any in-flight IO after/when examining SCHED_RESTART
in blk_mq_dispatch_rq_list():
- if SCHED_RESTART isn't set, queue is run immediately as handled in 1)
- otherwise, this request will be dispatched after any in-flight IO is
  completed via blk_mq_sched_restart()

3) if SCHED_RESTART is set concurently in context because of
BLK_STS_RESOURCE, blk_mq_delay_run_hw_queue() will cover the above two
cases and make sure IO hang can be avoided.

One invariant is that queue will be rerun if SCHED_RESTART is set.

Suggested-by: Jens Axboe 
Tested-by: Laurence Oberman 
Signed-off-by: Ming Lei 
Signed-off-by: Mike Snitzer 
---
V5:
- fixed stale reference to 10ms delay in blk-mq.c comment
- revised commit header to include Ming's proof of how
  blk-mq drivers will make progress (serves to document how
  scsi_queue_rq now works)
V4:
- cleanup header and code comments
- rerun queue after BLK_MQ_QUEUE_DELAY (3ms) instead of 10ms
- eliminate nvmefc's queue rerun now that blk-mq does it
V3:
- fix typo, and improvement document
- add tested-by tag
V2:
- add comments on the new introduced status
- patch style fix
- both are suggested by Christoph

 block/blk-core.c |  1 +
 block/blk-mq.c   | 20 
 drivers/block/null_blk.c |  2 +-
 drivers/block/virtio_blk.c   |  2 +-
 drivers/block/xen-blkfront.c |  2 +-
 drivers/md/dm-rq.c   |  5 ++---
 drivers/nvme/host/fc.c   | 12 ++--
 drivers/scsi/scsi_lib.c  |  6 +++---
 include/linux/blk_types.h| 17 +
 9 files changed, 44 insertions(+), 23 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index cdae69be68e9..38279d4ae08b 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -145,6 +145,7 @@ static const struct {
[BLK_STS_MEDIUM]= { -ENODATA,   "critical medium" },
[BLK_STS_PROTECTION]= { -EILSEQ,"protection" },
[BLK_STS_RESOURCE]  = { -ENOMEM,"kernel resource" },
+   [BLK_STS_DEV_RESOURCE]  = { -ENOMEM,"device resource" },
[BLK_STS_AGAIN] = { -EAGAIN,"nonblocking retry" },
 
/* device mapper special case, should not leak out: */
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 43e7449723e0..e39b4e2a63db 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1160,6 +1160,8 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx 
**hctx,
return true;
 }
 
+#define BLK_MQ_QUEUE_DELAY 3   /* ms units */
+
 bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 bool got_budget)
 {
@@ -1167,6 +1169,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, 
struct list_head *list,
struct request *rq, *nxt;
bool no_tag = false;
int errors, queued;
+   blk_status_t ret = BLK_STS_OK;
 
if (list_empty(list))
return false;
@@ -1179,7 +1182,6 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, 
struct list_head *list,
errors = queued = 0;
do {
struct blk_mq_queue_data bd;
-   blk_status_t ret;
 
rq = list_first_entry(list, struct request, queuelist);
if (!blk_mq_get_driver_tag(rq, &hctx, false)) {
@@ -1224,7 +1226,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, 
struct list_head *list,
}
 
ret = q->mq_ops->queue_rq(hctx, &bd);
-   if (ret == BLK_STS_RESOURCE) {
+   if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE) {
/*
 * If an I/O scheduler has been configured and we got a
 * driver tag for the next request already, free it
@@ -1255,6 +1257,8 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, 
struct list_head *list,
 * that is where we will continue on next queue run.
 */
if (!list_empty(list)) {
+   bool needs_restart;
+
spin_lock(&hctx->lock);
list_splice_init(list, &hctx->dispatch);

[PATCH 2/2] lightnvm: remove multiple groups in 1.2 data structure

2018-01-30 Thread Matias Bjørling

Only one id group from the 1.2 specification is supported. Make
sure that only the first group is accessible.

Signed-off-by: Matias Bjørling 
---
 drivers/nvme/host/lightnvm.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c
index 16906327645e..e5544806fb0e 100644
--- a/drivers/nvme/host/lightnvm.c
+++ b/drivers/nvme/host/lightnvm.c
@@ -167,7 +167,8 @@ struct nvme_nvm_id {
__le32  dom;
struct nvme_nvm_addr_format ppaf;
__u8resv[228];
-   struct nvme_nvm_id_group groups[4];
+   struct nvme_nvm_id_group group;
+   __u8resv2[2880];
 } __packed;
 
 struct nvme_nvm_bb_tbl {
@@ -209,7 +210,7 @@ static int init_grps(struct nvm_id *nvm_id, struct 
nvme_nvm_id *nvme_nvm_id)
if (nvme_nvm_id->cgrps != 1)
return -EINVAL;
 
-   src = &nvme_nvm_id->groups[0];
+   src = &nvme_nvm_id->group;
grp = &nvm_id->grp;
 
grp->mtype = src->mtype;
-- 
2.11.0

[PATCH 1/2] lightnvm: remove mlc pairs structure

2018-01-30 Thread Matias Bjørling

The known implementations of the 1.2 specification, and upcoming 2.0
implementation all expose a sequential list of pages to write.
Remove the data structure, as it is no longer needed.

Signed-off-by: Matias Bjørling 
---
 drivers/nvme/host/lightnvm.c | 14 +-
 1 file changed, 1 insertion(+), 13 deletions(-)

diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c
index 50ef71ee3d86..16906327645e 100644
--- a/drivers/nvme/host/lightnvm.c
+++ b/drivers/nvme/host/lightnvm.c
@@ -116,17 +116,6 @@ struct nvme_nvm_command {
};
 };
 
-#define NVME_NVM_LP_MLC_PAIRS 886
-struct nvme_nvm_lp_mlc {
-   __le16  num_pairs;
-   __u8pairs[NVME_NVM_LP_MLC_PAIRS];
-};
-
-struct nvme_nvm_lp_tbl {
-   __u8id[8];
-   struct nvme_nvm_lp_mlc  mlc;
-};
-
 struct nvme_nvm_id_group {
__u8mtype;
__u8fmtype;
@@ -150,8 +139,7 @@ struct nvme_nvm_id_group {
__le32  mpos;
__le32  mccap;
__le16  cpar;
-   __u8reserved[10];
-   struct nvme_nvm_lp_tbl lptbl;
+   __u8reserved[906];
 } __packed;
 
 struct nvme_nvm_addr_format {
-- 
2.11.0

Re: [PATCH] Use bio_endio instead of bio_put in error path of blk_rq_append_bio

2018-01-30 Thread Ming Lei

On Thu, Jan 25, 2018 at 9:58 PM, Jiří Paleček  wrote:
>  Avoids page leak from bounced requests
> ---
>  block/blk-map.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/block/blk-map.c b/block/blk-map.c
> index d3a94719f03f..702d68166689 100644
> --- a/block/blk-map.c
> +++ b/block/blk-map.c
> @@ -26,7 +26,8 @@ int blk_rq_append_bio(struct request *rq, struct bio **bio)
> } else {
> if (!ll_back_merge_fn(rq->q, rq, *bio)) {
> if (orig_bio != *bio) {
> -   bio_put(*bio);
> +   bio_inc_remaining(orig_bio);
> +   bio_endio(*bio);

'bio_inc_remaining(orig_bio);' shouldn't be needed since we don't chain bounced
bio, otherwise this patch is fine.

Thanks,
Ming Lei

Re: [LSF/MM TOPIC] De-clustered RAID with MD

2018-01-30 Thread NeilBrown

On Tue, Jan 30 2018, Wols Lists wrote:

> On 29/01/18 21:50, NeilBrown wrote:
>> By doing declustered parity you can sanely do raid6 on 100 drives, using
>> a logical stripe size that is much smaller than 100.
>> When recovering a single drive, the 10-groups-of-10 would put heavy load
>> on 9 other drives, while the decluster approach puts light load on 99
>> other drives.  No matter how clever md is at throttling recovery, I
>> would still rather distribute the load so that md has an easier job.
>
> Not offering to do it ... :-)
>
> But that sounds a bit like linux raid-10. Could a simple approach be to
> do something like "raid-6,11,100", ie raid-6 with 9 data chunks, two
> parity, striped across 100 drives? Okay, it's not as good as the
> decluster approach, but it would spread the stress of a rebuild across
> 20 drives, not 10. And probably be fairly easy to implement.

If you did that, I think you would be about 80% of the way to fully
declustered-parity RAID.
If you then tweak the math a bit so that one stripe would was

A1 A2 A3 A4 B1 B2 B3 B4 C1 C2 C3 C4 

and the next

A1 C1 A2 C2 A3 C3 A4 C4 B1 D1 B2 D2 

and then

A1 B1 C1 D1 A2 B2 C2 D2 A3 B3 C3 D3 

  XX

When Ax are a logical stripe and Bx are the next,  then you have a
slightly better distribution.  If device XX fails then the reads needed
for the first stripe mostly come from different drives than those for
the second stripe, which are mostly different again for the 3rd stripe.

Presumably the CRUSH algorithm (which I only skim-read once about a year
ago) formalizes how to do this, and does it better.
Once you have the data handling in place for your proposal, it should be
little more than replacing a couple of calculations to get the full
solution.

NeilBrown

signature.asc
Description: PGP signature

[RFC] hard LOCKUP caused by race between blk_init_queue_node and blkcg_print_blkgs

2018-01-30 Thread Joseph Qi

Hi Jens and Folks,

Recently we've gotten a hard LOCKUP issue. After investigating the issue
we've found a race between blk_init_queue_node and blkcg_print_blkgs.
The race is described below.

blk_init_queue_node blkcg_print_blkgs
  blk_alloc_queue_node (1)
q->queue_lock = &q->__queue_lock (2)
blkcg_init_queue(q) (3)
spin_lock_irq(blkg->q->queue_lock) (4)
  q->queue_lock = lock (5)
spin_unlock_irq(blkg->q->queue_lock) (6)

(1) allocate an uninitialized queue;
(2) initialize queue_lock to its default internal lock;
(3) initialize blkcg part of request queue, which will create blkg and
then insert it to blkg_list;
(4) traverse blkg_list and find the created blkg, and then take its
queue lock, here it is the default *internal lock*;
(5) *race window*, now queue_lock is overridden with *driver specified
lock*;
(6) now unlock *driver specified lock*, not the locked *internal lock*,
unlock balance breaks.

For the issue above, I think blkcg_init_queue is a bit earlier. We
can't allow a further use before request queue is fully initialized.
Since blk_init_queue_node is a really common path and it allows driver
to override the default internal lock, I'm afraid several other places
may also have the same issue.
Am I missing something here?

Thanks,
Joseph

Re: [LSF/MM TOPIC] Two blk-mq related topics

2018-01-30 Thread Mel Gorman

On Tue, Jan 30, 2018 at 11:08:28AM +0100, Johannes Thumshirn wrote:
> [+Cc Mel]
> Jens Axboe  writes:
> > On 1/29/18 1:56 PM, James Bottomley wrote:
> >> On Mon, 2018-01-29 at 23:46 +0800, Ming Lei wrote:
> >> [...]
> >>> 2. When to enable SCSI_MQ at default again?
> >> 
> >> I'm not sure there's much to discuss ... I think the basic answer is as
> >> soon as Christoph wants to try it again.
> >
> > FWIW, internally I've been running various IO intensive workloads on
> > what is essentially 4.12 upstream with scsi-mq the default (with
> > mq-deadline as the scheduler) and comparing IO workloads with a
> > previous 4.6 kernel (without scsi-mq), and things are looking
> > great.
> >
> > We're never going to iron out the last kinks with it being off
> > by default, I think we should attempt to flip the switch again
> > for 4.16.
> 
> The 4.12 sounds interesting. I remember Mel ran some test with 4.12 as
> we where considering to flip the config option for SLES and it showed
> several road blocks.
> 

Mostly due to slow storage and BFQ where mq-deadline was not a universal
win as an alternative default. I don't have current data and I archived
what I had, but it was based on 4.13-rc7 at the time and BFQ has changed
a lot since so it would need to be redone.

> I'm not sure whether he re-evaluated 4.13/4.14 on his grid though.
> 

No, it hasn't. Grid time for performance testing has been tight during
the last few months to say the least.

> But I'm definitively interested in this discussion and can even possibly
> share some benchmark results we did in our FC Lab.
> 

If you remind me, I may be able to re-execute the tests in a 4.16-rcX
before LSF/MM so you have other data to work with.  Unfortunately, I'll
not be able to make LSF/MM this time due to personal commitments that
conflict and are unmovable.

-- 
Mel Gorman
SUSE Labs

Re: [LSF/MM TOPIC] De-clustered RAID with MD

2018-01-30 Thread Wols Lists

On 29/01/18 21:50, NeilBrown wrote:
> By doing declustered parity you can sanely do raid6 on 100 drives, using
> a logical stripe size that is much smaller than 100.
> When recovering a single drive, the 10-groups-of-10 would put heavy load
> on 9 other drives, while the decluster approach puts light load on 99
> other drives.  No matter how clever md is at throttling recovery, I
> would still rather distribute the load so that md has an easier job.

Not offering to do it ... :-)

But that sounds a bit like linux raid-10. Could a simple approach be to
do something like "raid-6,11,100", ie raid-6 with 9 data chunks, two
parity, striped across 100 drives? Okay, it's not as good as the
decluster approach, but it would spread the stress of a rebuild across
20 drives, not 10. And probably be fairly easy to implement.

Cheers,
Wol

Re: [LSF/MM TOPIC] Two blk-mq related topics

2018-01-30 Thread John Garry


On 30/01/2018 01:24, Ming Lei wrote:

On Mon, Jan 29, 2018 at 12:56:30PM -0800, James Bottomley wrote:

On Mon, 2018-01-29 at 23:46 +0800, Ming Lei wrote:
[...]

2. When to enable SCSI_MQ at default again?


I'm not sure there's much to discuss ... I think the basic answer is as
soon as Christoph wants to try it again.


I guess Christoph still need to evaluate if there are existed issues or
blockers before trying it again. And more input may be got from F2F
discussion, IMHO.




SCSI_MQ is enabled on V3.17 firstly, but disabled at default. In
V4.13-rc1, it is enabled at default, but later the patch is reverted
in V4.13-rc7, and becomes disabled at default too.

Now both the original reported PM issue(actually SCSI quiesce) and
the sequential IO performance issue have been addressed.


Is the blocker bug just not closed because no-one thought to do it:

https://bugzilla.kernel.org/show_bug.cgi?id=178381

(we have confirmed that this issue is now fixed with the original
reporter?)



From a developer view, this issue is fixed by the following commit:

3a0a52997(block, scsi: Make SCSI quiesce and resume work reliably),
and it is verified by kernel list reporter.



And did the Huawei guy (Jonathan Cameron) confirm his performance issue
was fixed (I don't think I saw email that he did)?


Last time I talked with John Garry about the issue, and the merged .get_budget
based patch improves much on the IO performance, but there is still a bit gap
compared with legacy path. Seems a driver specific issue, remembered that 
removing
a driver's lock can improve performance much.

Garry, could you provide further update on this issue?


Hi Ming,

From our testing with experimental changes to our driver to support 
SCSI mq we were almost getting on par performance with legacy path. But 
without these MQ was hitting performance (and I would not necessarily 
say it was a driver issue).


We can retest from today's mainline and see where we are.

BTW, Have you got performance figures for many other single queue HBAs 
with and without CONFIG_SCSI_MQ_DEFAULT=Y?


Thanks,
John



Thanks,
Ming

.

Re: [LSF/MM TOPIC] Two blk-mq related topics

2018-01-30 Thread Johannes Thumshirn

[+Cc Mel]
Jens Axboe  writes:
> On 1/29/18 1:56 PM, James Bottomley wrote:
>> On Mon, 2018-01-29 at 23:46 +0800, Ming Lei wrote:
>> [...]
>>> 2. When to enable SCSI_MQ at default again?
>> 
>> I'm not sure there's much to discuss ... I think the basic answer is as
>> soon as Christoph wants to try it again.
>
> FWIW, internally I've been running various IO intensive workloads on
> what is essentially 4.12 upstream with scsi-mq the default (with
> mq-deadline as the scheduler) and comparing IO workloads with a
> previous 4.6 kernel (without scsi-mq), and things are looking
> great.
>
> We're never going to iron out the last kinks with it being off
> by default, I think we should attempt to flip the switch again
> for 4.16.

The 4.12 sounds interesting. I remember Mel ran some test with 4.12 as
we where considering to flip the config option for SLES and it showed
several road blocks.

I'm not sure whether he re-evaluated 4.13/4.14 on his grid though.

But I'm definitively interested in this discussion and can even possibly
share some benchmark results we did in our FC Lab.

Byte,
Johannes
-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

Re: [LSF/MM TOPIC] De-clustered RAID with MD

2018-01-30 Thread Johannes Thumshirn

Wols Lists  writes:

> On 29/01/18 15:23, Johannes Thumshirn wrote:
>> Hi linux-raid, lsf-pc
>> 
>> (If you've received this mail multiple times, I'm sorry, I'm having
>> trouble with the mail setup).
>
> My immediate reactions as a lay person (I edit the raid wiki) ...
>> 
>> With the rise of bigger and bigger disks, array rebuilding times start
>> skyrocketing.
>
> And? Yes, your data is at risk during a rebuild, but md-raid throttles
> the i/o, so it doesn't hammer the system.
>> 
>> In a paper form '92 Holland and Gibson [1] suggest a mapping algorithm
>> similar to RAID5 but instead of utilizing all disks in an array for
>> every I/O operation, but implement a per-I/O mapping function to only
>> use a subset of the available disks.
>> 
>> This has at least two advantages:
>> 1) If one disk has to be replaced, it's not needed to read the data from
>>all disks to recover the one failed disk so non-affected disks can be
>>used for real user I/O and not just recovery and
>
> Again, that's throttling, so that's not a problem ...

And throttling in a production environment is not exactly
desired. Imagine a 500 disk array (and yes this is something we've seen
with MD) and you have to replace disks. While the array is rebuilt you
have to throttle all I/O because with raid-{1,5,6,10} all data is
striped across all disks.

With a parity declustered RAID (or DDP like Dell, NetApp or Huawei call
it) you don't have to as the I/O is replicated in parity groups across a
subset of disks. All I/O targeting disks which aren't needed to recover
the data from the failed disks aren't affected by the throttling at all.

>> 2) an efficient mapping function can improve parallel I/O submission, as
>>two different I/Os are not necessarily going to the same disks in the
>>array. 
>> 
>> For the mapping function used a hashing algorithm like Ceph's CRUSH [2]
>> would be ideal, as it provides a pseudo random but deterministic mapping
>> for the I/O onto the drives.
>> 
>> This whole declustering of cause only makes sense for more than (at
>> least) 4 drives but we do have customers with several orders of
>> magnitude more drivers in an MD array.
>
> If you have four drives or more - especially if they are multi-terabyte
> drives - you should NOT be using raid-5 ...

raid-6 won't help you much in above scenario.

>> 
>> At LSF I'd like to discuss if:
>> 1) The wider MD audience is interested in de-clusterd RAID with MD
>
> I haven't read the papers, so no comment, sorry.
>
>> 2) de-clustered RAID should be implemented as a sublevel of RAID5 or
>>as a new personality
>
> Neither! If you're going to do it, it should be raid-6.
>
>> 3) CRUSH is a suitible algorith for this (there's evidence in [3] that
>>the NetApp E-Series Arrays do use CRUSH for parity declustering)
>> 
>> [1] http://www.pdl.cmu.edu/PDL-FTP/Declustering/ASPLOS.pdf 
>> [2] https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf
>> [3]
>> https://www.snia.org/sites/default/files/files2/files2/SDC2013/presentations/DistributedStorage/Jibbe-Gwaltney_Method-to_Establish_High_Availability.pdf
>> 
> Okay - I've now skimmed the crush paper [2]. Looks well interesting.
> BUT. It feels more like btrfs than it does like raid.
>
> Btrfs manages disks, and does raid, it tries to be the "everything
> between the hard drive and the file". This crush thingy reads to me like
> it wants to be the same. There's nothing wrong with that, but md is a
> unix-y "do one thing (raid) and do it well".

Well CRUSH is (one of) the algorithms behind Ceph. It takes the
decisions where to place a block. It is just a hash (well technically a
weighted decision-tree) function that takes a block of I/O and a some
configuration parameters and "calculates" the placement.

> My knee-jerk reaction is if you want to go for it, it sounds like a good
> idea. It just doesn't really feel a good fit for md.

Thanks for the input.

   Johannes
-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

Re: [LSF/MM TOPIC] Two blk-mq related topics

2018-01-30 Thread Martin Steigerwald

Ming Lei - 30.01.18, 02:24:
> > > SCSI_MQ is enabled on V3.17 firstly, but disabled at default. In
> > > V4.13-rc1, it is enabled at default, but later the patch is reverted
> > > in V4.13-rc7, and becomes disabled at default too.
> > > 
> > > Now both the original reported PM issue(actually SCSI quiesce) and
> > > the sequential IO performance issue have been addressed.
> > 
> > Is the blocker bug just not closed because no-one thought to do it:
> > 
> > https://bugzilla.kernel.org/show_bug.cgi?id=178381
> > 
> > (we have confirmed that this issue is now fixed with the original
> > reporter?)
> 
> From a developer view, this issue is fixed by the following commit:
> 3a0a52997(block, scsi: Make SCSI quiesce and resume work reliably),
> and it is verified by kernel list reporter.

I never seen any suspend / hibernate related issues with blk-mq + bfq since 
then. Using heavily utilized BTRFS dual SSD RAID 1.

% egrep "MQ|BFQ" /boot/config-4.15.0-tp520-btrfstrim+
CONFIG_POSIX_MQUEUE=y
CONFIG_POSIX_MQUEUE_SYSCTL=y
CONFIG_BLK_WBT_MQ=y
CONFIG_BLK_MQ_PCI=y
CONFIG_BLK_MQ_VIRTIO=y
CONFIG_MQ_IOSCHED_DEADLINE=m
CONFIG_MQ_IOSCHED_KYBER=m
CONFIG_IOSCHED_BFQ=m
CONFIG_BFQ_GROUP_IOSCHED=y
CONFIG_NET_SCH_MQPRIO=m
# CONFIG_SCSI_MQ_DEFAULT is not set
# CONFIG_DM_MQ_DEFAULT is not set
CONFIG_DM_CACHE_SMQ=m

% cat /proc/cmdline 
BOOT_IMAGE=/vmlinuz-4.15.0-tp520-btrfstrim+ root=UUID=[…] ro 
rootflags=subvol=debian resume=/dev/mapper/sata-swap init=/bin/systemd 
thinkpad_acpi.fan_control=1 systemd.restore_state=0 scsi_mod.use_blk_mq=1

% cat /sys/block/sda/queue/scheduler 
[bfq] none

Thanks,
-- 
Martin

Re: v4.15 and I/O hang with BFQ

2018-01-30 Thread Ming Lei

Hi,

On Tue, Jan 30, 2018 at 09:05:26AM +0100, Oleksandr Natalenko wrote:
> Hi, Paolo, Ivan, Ming et al.
> 
> It looks like I've just encountered the issue Ivan has already described in
> [1]. Since I'm able to reproduce it reliably in a VM, I'd like to draw more
> attention to it.
> 
> First, I'm using v4.15 kernel with all pending BFQ fixes:
> 
> ===
> 2ad909a300c4 bfq-iosched: don't call bfqg_and_blkg_put for
> !CONFIG_BFQ_GROUP_IOSCHED
> 83c97a310f83 block, bfq: release oom-queue ref to root group on exit
> 5b9eb4716af1 block, bfq: put async queues for root bfq groups too
> 3c5529454a27 block, bfq: limit sectors served with interactive weight
> raising
> e6c72be3486b block, bfq: limit tags for writes and async I/O
> e579b91d96ce block, bfq: increase threshold to deem I/O as random
> f6cbc16aac88 block, bfq: remove superfluous check in queue-merging setup
> 8045d8575183 block, bfq: let a queue be merged only shortly after starting
> I/O
> 242954975f5e block, bfq: check low_latency flag in bfq_bfqq_save_state()
> 8349c1bddd95 block, bfq: add missing rq_pos_tree update on rq removal
> 558200440cb9 block, bfq: fix occurrences of request finish method's old name
> 6ed2f47ee870 block, bfq: consider also past I/O in soft real-time detection
> e5f295dd18f2 block, bfq: remove batches of confusing ifdefs
> ===
> 
> Next, I boot an Arch VM with this kernel and emulated USB stick attached:
> 
> ===
> qemu-system-x86_64 -display gtk,gl=on -machine q35,accel=kvm -cpu host,+vmx
> -enable-kvm -drive if=pflash,format=raw,readonly,file=/mnt/vms/ovmf/code.img
> -drive if=pflash,format=raw,file=/mnt/vms/ovmf/vars.img -cdrom
> /mnt/vms/ovmf/shell.iso -netdev user,id=user.0 -device
> virtio-net,netdev=user.0 -usb -device nec-usb-xhci,id=xhci -device
> usb-tablet,bus=xhci.0 -serial stdio -m 512 -hda sda.img -hdb sdb.img -smp 4
> -drive if=none,id=stick,file=usb.img -device
> usb-storage,bus=xhci.0,drive=stick
> ===
> 
> Within the VM itself I use udev rule to set the I/O scheduler:
> 
> ===
> ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/scheduler}="bfq"
> ===

We knew there is IO hang issue on BFQ over USB-storage wrt. blk-mq, and
last time I found it is inside BFQ. You can try the debug patch in the
following link[1] to see if it is same with the previous report[1][2]:

[1] https://marc.info/?l=linux-block&m=151214241518562&w=2
[2] https://bugzilla.kernel.org/show_bug.cgi?id=198023

If you aren't sure if they are same, please post the trace somewhere,
then I can check if it is a new bug.

Or Paolo should know if the issue is fixed or not in V4.15.

Thanks,
Ming

v4.15 and I/O hang with BFQ

2018-01-30 Thread Oleksandr Natalenko


Hi, Paolo, Ivan, Ming et al.

It looks like I've just encountered the issue Ivan has already described 
in [1]. Since I'm able to reproduce it reliably in a VM, I'd like to 
draw more attention to it.


First, I'm using v4.15 kernel with all pending BFQ fixes:

===
2ad909a300c4 bfq-iosched: don't call bfqg_and_blkg_put for 
!CONFIG_BFQ_GROUP_IOSCHED

83c97a310f83 block, bfq: release oom-queue ref to root group on exit
5b9eb4716af1 block, bfq: put async queues for root bfq groups too
3c5529454a27 block, bfq: limit sectors served with interactive weight 
raising

e6c72be3486b block, bfq: limit tags for writes and async I/O
e579b91d96ce block, bfq: increase threshold to deem I/O as random
f6cbc16aac88 block, bfq: remove superfluous check in queue-merging setup
8045d8575183 block, bfq: let a queue be merged only shortly after 
starting I/O

242954975f5e block, bfq: check low_latency flag in bfq_bfqq_save_state()
8349c1bddd95 block, bfq: add missing rq_pos_tree update on rq removal
558200440cb9 block, bfq: fix occurrences of request finish method's old 
name
6ed2f47ee870 block, bfq: consider also past I/O in soft real-time 
detection

e5f295dd18f2 block, bfq: remove batches of confusing ifdefs
===

Next, I boot an Arch VM with this kernel and emulated USB stick 
attached:


===
qemu-system-x86_64 -display gtk,gl=on -machine q35,accel=kvm -cpu 
host,+vmx -enable-kvm -drive 
if=pflash,format=raw,readonly,file=/mnt/vms/ovmf/code.img -drive 
if=pflash,format=raw,file=/mnt/vms/ovmf/vars.img -cdrom 
/mnt/vms/ovmf/shell.iso -netdev user,id=user.0 -device 
virtio-net,netdev=user.0 -usb -device nec-usb-xhci,id=xhci -device 
usb-tablet,bus=xhci.0 -serial stdio -m 512 -hda sda.img -hdb sdb.img 
-smp 4 -drive if=none,id=stick,file=usb.img -device 
usb-storage,bus=xhci.0,drive=stick

===

Within the VM itself I use udev rule to set the I/O scheduler:

===
ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/scheduler}="bfq"
===

Things boot and work fine until I try to create a partition on the 
emulated USB stick with cfdisk. On writing a new partition table it 
blocks, and I get the following stacktrace:


===
[  225.670702] INFO: task cfdisk:416 blocked for more than 20 seconds.
[  225.681427]   Not tainted 4.15.0-pf1 #1
[  225.685341] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.

[  225.691910] cfdisk  D0   416407 0x
[  225.700777] Call Trace:
[  225.703654]  ? __schedule+0x35f/0x1000
[  225.706990]  ? __switch_to_asm+0x40/0x70
[  225.709943]  ? __switch_to_asm+0x34/0x70
[  225.712224]  ? __switch_to_asm+0x40/0x70
[  225.714225]  ? __switch_to_asm+0x40/0x70
[  225.716790]  schedule+0x32/0xc0
[  225.718355]  io_schedule+0x12/0x40
[  225.719747]  wait_on_page_bit+0xea/0x130
[  225.721266]  ? add_to_page_cache_lru+0xe0/0xe0
[  225.722622]  __filemap_fdatawait_range+0xbf/0x110
[  225.724625]  ? preempt_count_sub+0x50/0x90
[  225.726591]  ? sync_inodes_one_sb+0x20/0x20
[  225.727507]  filemap_fdatawait_keep_errors+0x1a/0x40
[  225.728491]  iterate_bdevs+0xa7/0x140
[  225.729304]  sys_sync+0x7c/0xb0
[  225.730095]  entry_SYSCALL_64_fastpath+0x20/0x83
[  225.732420] RIP: 0033:0x7f2631cf4a17
===

I've tried it several times with the same result. Next, I reboot the 
system, change the scheduler to Kyber, and then I can create a new 
partition successfully.


I also went further. While having Kyber activated to work with an 
emulated USB stick properly, I create RAID10 on it, then I create LVM 
PV, then VG, then LV, then XFS on top of that, then I set I/O scheduler 
back to BFQ, then reboot, and on reboot mdadm gets blocked while 
discovering things, producing the following stacktrace:


===
[   41.350763] INFO: task mdadm:354 blocked for more than 20 seconds.
[   41.356154]   Not tainted 4.15.0-pf1 #1
[   41.358674] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.

[   41.363046] mdadm   D0   354352 0x0100
[   41.368700] Call Trace:
[   41.370417]  ? __schedule+0x35f/0x1000
[   41.372668]  ? blk_queue_exit+0x3e/0x60
[   41.374816]  ? generic_make_request+0x12f/0x2d0
[   41.377363]  schedule+0x32/0xc0
[   41.380416]  io_schedule+0x12/0x40
[   41.382698]  __blkdev_direct_IO_simple+0x206/0x360
[   41.382707]  ? bdget+0x120/0x120
[   41.382714]  ? blkdev_direct_IO+0x3a5/0x3f0
[   41.382718]  blkdev_direct_IO+0x3a5/0x3f0
[   41.382724]  ? current_time+0x18/0x70
[   41.382731]  ? __atime_needs_update+0x7f/0x190
[   41.382743]  ? generic_file_read_iter+0x8c/0x9d0
[   41.382747]  generic_file_read_iter+0x8c/0x9d0
[   41.382759]  ? __seccomp_filter+0x3b/0x260
[   41.382765]  __vfs_read+0xf5/0x170
[   41.382772]  vfs_read+0x91/0x130
[   41.382778]  SyS_read+0x52/0xc0
[   41.382819]  do_syscall_64+0x67/0x1a0
[   41.382828]  entry_SYSCALL64_slow_path+0x25/0x25
[   41.382833] RIP: 0033:0x7f4b8088a3a1
[   41.382836] RSP: 002b:7ffdd681def8 EFLAGS: 0246 ORIG_RAX: 

[   41.382841] RAX: ffda RBX:

60 matches

Mail list logo