Re: usercopy whitelist woe in scsi_sense_cache

2018-04-10 Thread Kees Cook
On Tue, Apr 10, 2018 at 10:16 AM, Oleksandr Natalenko
 wrote:
> Hi, Kees, Paolo et al.
>
> 10.04.2018 08:53, Kees Cook wrote:
>>
>> Unfortunately I only had a single hang with no dumps. I haven't been
>> able to reproduce it since. :(
>
>
> For your convenience I've prepared a VM that contains a reproducer.

Awesome. :)

> Under the /root folder there is a reproducer script (reproducer.sh). It does
> trivial things like enabling sysrq, opening LUKS device, mounting a volume,
> running a background I/O (this is an important part, actually, since I
> wasn't able to trigger the issue without the background I/O) and, finally,
> running the smartctl in a loop. If you are lucky, within a minute or two
> you'll get the first warning followed shortly by subsequent bugs and I/O
> stall (htop is pre-installed for your convenience too).

Yup!

[   27.729498] Bad or missing usercopy whitelist? Kernel memory
exposure attempt detected from SLUB object 'scsi_sense_cache' (offset
76, size 22)!

I'll see about booting with my own kernels, etc, and try to narrow this down. :)

-Kees

-- 
Kees Cook
Pixel Security


Re: Hang with blk-mq map series (block/008)

2018-04-10 Thread Jens Axboe
On 4/10/18 8:30 PM, Ming Lei wrote:
> On Tue, Apr 10, 2018 at 08:08:43PM -0600, Jens Axboe wrote:
>> On 4/10/18 8:02 PM, Ming Lei wrote:
>>> On Tue, Apr 10, 2018 at 09:51:41AM -0600, Jens Axboe wrote:
 Hi Ming,

 Ran the above blktests on the current tree, and we end up getting
 a hang that we never recover from. There's on request perpetually
 stuck:

 root@dell:/sys/kernel/debug/block/nvme0n1/hctx2# cat busy 
 5e2b09fe {.op=READ, .cmd_flags=, .rq_flags=DONTPREP|IO_STAT|STATS, 
 .state=in_flight, .tag=313, .internal_tag=-1}

 and no amount of manual running or kicking the queue will bring it
 back to life. If I run it on 'master', it works fine.

 Did you run blktests on your series?
>>>
>>> Not yet.
>>>
>>> Will take a look at it, but this seems related with driver, since
>>> the request has been submitted to hardware already.
>>
>> It might be, but it repeatedly failed for me on for-linus, and not on
>> master. I can try doing some more runs just to gain more confidence.
> 
> Looks it is one IO vs CPU hotplug test, which was done before posting.
> Not reproduce it on NVMe inside KVM.
> 
> Will install a kernel and run it on real HW.

Sounds good. I'll do some more testing tomorrow to ensure that my
report was sound.

> BTW, could you share your lscpu inf
Architecture:  x86_64
CPU op-mode(s):32-bit, 64-bit
Byte Order:Little Endian
CPU(s):48
On-line CPU(s) list:   0-47
Thread(s) per core:2
Core(s) per socket:12
Socket(s): 2
NUMA node(s):  2
Vendor ID: GenuineIntel
CPU family:6
Model: 79
Model name:Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
Stepping:  1
CPU MHz:   1201.251
CPU max MHz:   2900.
CPU min MHz:   1200.
BogoMIPS:  4394.61
Virtualization:VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache:  256K
L3 cache:  30720K
NUMA node0 CPU(s): 
0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46
NUMA node1 CPU(s): 
1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx 
pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology 
nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est 
tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt 
aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 
cdp_l3 invpcid_single tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust 
bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap intel_pt 
xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln 
pts


-- 
Jens Axboe



Re: Hang with blk-mq map series (block/008)

2018-04-10 Thread Ming Lei
On Tue, Apr 10, 2018 at 08:08:43PM -0600, Jens Axboe wrote:
> On 4/10/18 8:02 PM, Ming Lei wrote:
> > On Tue, Apr 10, 2018 at 09:51:41AM -0600, Jens Axboe wrote:
> >> Hi Ming,
> >>
> >> Ran the above blktests on the current tree, and we end up getting
> >> a hang that we never recover from. There's on request perpetually
> >> stuck:
> >>
> >> root@dell:/sys/kernel/debug/block/nvme0n1/hctx2# cat busy 
> >> 5e2b09fe {.op=READ, .cmd_flags=, .rq_flags=DONTPREP|IO_STAT|STATS, 
> >> .state=in_flight, .tag=313, .internal_tag=-1}
> >>
> >> and no amount of manual running or kicking the queue will bring it
> >> back to life. If I run it on 'master', it works fine.
> >>
> >> Did you run blktests on your series?
> > 
> > Not yet.
> > 
> > Will take a look at it, but this seems related with driver, since
> > the request has been submitted to hardware already.
> 
> It might be, but it repeatedly failed for me on for-linus, and not on
> master. I can try doing some more runs just to gain more confidence.

Looks it is one IO vs CPU hotplug test, which was done before posting.
Not reproduce it on NVMe inside KVM.

Will install a kernel and run it on real HW.

BTW, could you share your lscpu info?

Thanks,
Ming


Re: [PATCH] blk-mq: Avoid that submitting a bio concurrently with device removal triggers a crash

2018-04-10 Thread Bart Van Assche
On Tue, 2018-04-10 at 20:10 -0600, Jens Axboe wrote:
> On 4/10/18 8:05 PM, Ming Lei wrote:
> > On Tue, Apr 10, 2018 at 02:30:35PM -0600, Bart Van Assche wrote:
> > > Because blkcg_exit_queue() is now called from inside blk_cleanup_queue()
> > > it is no longer safe to access cgroup information during or after the
> > > blk_cleanup_queue() call. Hence protect the generic_make_request_checks()
> > > call with blk_queue_enter() / blk_queue_exit().
> > > 
> > > Reported-by: Ming Lei 
> > > Fixes: a063057d7c73 ("block: Fix a race between request queue removal and 
> > > the block cgroup controller")
> > > Signed-off-by: Bart Van Assche 
> > > Cc: Ming Lei 
> > > Cc: Joseph Qi 
> > > ---
> > >  block/blk-core.c | 32 ++--
> > >  1 file changed, 26 insertions(+), 6 deletions(-)
> > > 
> > > diff --git a/block/blk-core.c b/block/blk-core.c
> > > index 34e2f2227fd9..a330cd2829e1 100644
> > > --- a/block/blk-core.c
> > > +++ b/block/blk-core.c
> > > @@ -2386,8 +2386,19 @@ blk_qc_t generic_make_request(struct bio *bio)
> > >* yet.
> > >*/
> > >   struct bio_list bio_list_on_stack[2];
> > > + blk_mq_req_flags_t flags = bio->bi_opf & REQ_NOWAIT ?
> > > + BLK_MQ_REQ_NOWAIT : 0;
> > > + struct request_queue *q = bio->bi_disk->queue;
> > >   blk_qc_t ret = BLK_QC_T_NONE;
> > >  
> > > + if (blk_queue_enter(q, flags) < 0) {
> > 
> > As I mentioned last time, the queue pointer has to be checked before
> > calling blk_queue_enter(), since it isn't difficult to trigger the
> > failure log of 'generic_make_request: Trying to access nonexistent
> > block-device'.
> 
> That's a good point, the NULL check needs to be included first. Bart,
> do you want to send an update? I can just rebase the for-linus branch.

But how could the request pointer be NULL? Which code clears that pointer?

Thanks,

Bart.





Re: [PATCH v2] blk-mq: Avoid that submitting a bio concurrently with device removal triggers a crash

2018-04-10 Thread Bart Van Assche
On Wed, 2018-04-11 at 10:12 +0800, Ming Lei wrote:
> On Tue, Apr 10, 2018 at 02:45:54PM -0600, Bart Van Assche wrote:
> > +   struct request_queue *q = bio->bi_disk->queue;
> > blk_qc_t ret = BLK_QC_T_NONE;
> >  
> > +   if (blk_queue_enter(q, flags) < 0) {
> 
> Same issue with V1, the queue pointer has to be checked before calling
> blk_queue_enter().

I think it's the responsibility of the caller to keep a reference on the
request queue.

Bart.




Re: [PATCH v2] blk-mq: Avoid that submitting a bio concurrently with device removal triggers a crash

2018-04-10 Thread Ming Lei
On Tue, Apr 10, 2018 at 02:45:54PM -0600, Bart Van Assche wrote:
> Because blkcg_exit_queue() is now called from inside blk_cleanup_queue()
> it is no longer safe to access cgroup information during or after the
> blk_cleanup_queue() call. Hence protect the generic_make_request_checks()
> call with blk_queue_enter() / blk_queue_exit().
> 
> Reported-by: Ming Lei 
> Fixes: a063057d7c73 ("block: Fix a race between request queue removal and the 
> block cgroup controller")
> Signed-off-by: Bart Van Assche 
> Cc: Ming Lei 
> Cc: Joseph Qi 
> ---
> 
> Changes compared to v1: changed the blk_queue_exit() inside the loop with "if 
> (q)".
> 
>  block/blk-core.c | 33 +++--
>  1 file changed, 27 insertions(+), 6 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 34e2f2227fd9..181b1a688a5b 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2386,8 +2386,19 @@ blk_qc_t generic_make_request(struct bio *bio)
>* yet.
>*/
>   struct bio_list bio_list_on_stack[2];
> + blk_mq_req_flags_t flags = bio->bi_opf & REQ_NOWAIT ?
> + BLK_MQ_REQ_NOWAIT : 0;
> + struct request_queue *q = bio->bi_disk->queue;
>   blk_qc_t ret = BLK_QC_T_NONE;
>  
> + if (blk_queue_enter(q, flags) < 0) {

Same issue with V1, the queue pointer has to be checked before calling
blk_queue_enter().

-- 
Ming


Re: [PATCH v5] blk-mq: Avoid that a completion can be ignored for BLK_EH_RESET_TIMER

2018-04-10 Thread Ming Lei
On Tue, Apr 10, 2018 at 03:01:57PM -0600, Bart Van Assche wrote:
> The blk-mq timeout handling code ignores completions that occur after
> blk_mq_check_expired() has been called and before blk_mq_rq_timed_out()
> has reset rq->aborted_gstate. If a block driver timeout handler always
> returns BLK_EH_RESET_TIMER then the result will be that the request
> never terminates.

Under this situation:

IMO, if this request has been handled by driver's irq handler, and if
driver's .timeout still returns BLK_EH_RESET_TIMER, it is driver's bug,
and the correct return value should be BLK_EH_HANDLED.

-- 
Ming


Re: [PATCH] blk-mq: Avoid that submitting a bio concurrently with device removal triggers a crash

2018-04-10 Thread Jens Axboe
On 4/10/18 8:05 PM, Ming Lei wrote:
> On Tue, Apr 10, 2018 at 02:30:35PM -0600, Bart Van Assche wrote:
>> Because blkcg_exit_queue() is now called from inside blk_cleanup_queue()
>> it is no longer safe to access cgroup information during or after the
>> blk_cleanup_queue() call. Hence protect the generic_make_request_checks()
>> call with blk_queue_enter() / blk_queue_exit().
>>
>> Reported-by: Ming Lei 
>> Fixes: a063057d7c73 ("block: Fix a race between request queue removal and 
>> the block cgroup controller")
>> Signed-off-by: Bart Van Assche 
>> Cc: Ming Lei 
>> Cc: Joseph Qi 
>> ---
>>  block/blk-core.c | 32 ++--
>>  1 file changed, 26 insertions(+), 6 deletions(-)
>>
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index 34e2f2227fd9..a330cd2829e1 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -2386,8 +2386,19 @@ blk_qc_t generic_make_request(struct bio *bio)
>>   * yet.
>>   */
>>  struct bio_list bio_list_on_stack[2];
>> +blk_mq_req_flags_t flags = bio->bi_opf & REQ_NOWAIT ?
>> +BLK_MQ_REQ_NOWAIT : 0;
>> +struct request_queue *q = bio->bi_disk->queue;
>>  blk_qc_t ret = BLK_QC_T_NONE;
>>  
>> +if (blk_queue_enter(q, flags) < 0) {
> 
> As I mentioned last time, the queue pointer has to be checked before
> calling blk_queue_enter(), since it isn't difficult to trigger the
> failure log of 'generic_make_request: Trying to access nonexistent
> block-device'.

That's a good point, the NULL check needs to be included first. Bart,
do you want to send an update? I can just rebase the for-linus branch.

-- 
Jens Axboe



Re: Hang with blk-mq map series (block/008)

2018-04-10 Thread Jens Axboe
On 4/10/18 8:02 PM, Ming Lei wrote:
> On Tue, Apr 10, 2018 at 09:51:41AM -0600, Jens Axboe wrote:
>> Hi Ming,
>>
>> Ran the above blktests on the current tree, and we end up getting
>> a hang that we never recover from. There's on request perpetually
>> stuck:
>>
>> root@dell:/sys/kernel/debug/block/nvme0n1/hctx2# cat busy 
>> 5e2b09fe {.op=READ, .cmd_flags=, .rq_flags=DONTPREP|IO_STAT|STATS, 
>> .state=in_flight, .tag=313, .internal_tag=-1}
>>
>> and no amount of manual running or kicking the queue will bring it
>> back to life. If I run it on 'master', it works fine.
>>
>> Did you run blktests on your series?
> 
> Not yet.
> 
> Will take a look at it, but this seems related with driver, since
> the request has been submitted to hardware already.

It might be, but it repeatedly failed for me on for-linus, and not on
master. I can try doing some more runs just to gain more confidence.

-- 
Jens Axboe



Re: [PATCH] blk-mq: Avoid that submitting a bio concurrently with device removal triggers a crash

2018-04-10 Thread Ming Lei
On Tue, Apr 10, 2018 at 02:30:35PM -0600, Bart Van Assche wrote:
> Because blkcg_exit_queue() is now called from inside blk_cleanup_queue()
> it is no longer safe to access cgroup information during or after the
> blk_cleanup_queue() call. Hence protect the generic_make_request_checks()
> call with blk_queue_enter() / blk_queue_exit().
> 
> Reported-by: Ming Lei 
> Fixes: a063057d7c73 ("block: Fix a race between request queue removal and the 
> block cgroup controller")
> Signed-off-by: Bart Van Assche 
> Cc: Ming Lei 
> Cc: Joseph Qi 
> ---
>  block/blk-core.c | 32 ++--
>  1 file changed, 26 insertions(+), 6 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 34e2f2227fd9..a330cd2829e1 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2386,8 +2386,19 @@ blk_qc_t generic_make_request(struct bio *bio)
>* yet.
>*/
>   struct bio_list bio_list_on_stack[2];
> + blk_mq_req_flags_t flags = bio->bi_opf & REQ_NOWAIT ?
> + BLK_MQ_REQ_NOWAIT : 0;
> + struct request_queue *q = bio->bi_disk->queue;
>   blk_qc_t ret = BLK_QC_T_NONE;
>  
> + if (blk_queue_enter(q, flags) < 0) {

As I mentioned last time, the queue pointer has to be checked before
calling blk_queue_enter(), since it isn't difficult to trigger the
failure log of 'generic_make_request: Trying to access nonexistent
block-device'.

-- 
Ming


Re: Hang with blk-mq map series (block/008)

2018-04-10 Thread Ming Lei
On Tue, Apr 10, 2018 at 09:51:41AM -0600, Jens Axboe wrote:
> Hi Ming,
> 
> Ran the above blktests on the current tree, and we end up getting
> a hang that we never recover from. There's on request perpetually
> stuck:
> 
> root@dell:/sys/kernel/debug/block/nvme0n1/hctx2# cat busy 
> 5e2b09fe {.op=READ, .cmd_flags=, .rq_flags=DONTPREP|IO_STAT|STATS, 
> .state=in_flight, .tag=313, .internal_tag=-1}
> 
> and no amount of manual running or kicking the queue will bring it
> back to life. If I run it on 'master', it works fine.
> 
> Did you run blktests on your series?

Not yet.

Will take a look at it, but this seems related with driver, since
the request has been submitted to hardware already.

Thanks,
Ming


Re: 4.15.14 crash with iscsi target and dvd

2018-04-10 Thread Wakko Warner
Ming Lei wrote:
> Sure, thanks for your sharing.
> 
> Wakko, could you test the following patch and see if there is any
> difference?
> 
> --
> diff --git a/drivers/target/target_core_pscsi.c 
> b/drivers/target/target_core_pscsi.c
> index 0d99b242e82e..6147178f1f37 100644
> --- a/drivers/target/target_core_pscsi.c
> +++ b/drivers/target/target_core_pscsi.c
> @@ -888,7 +888,7 @@ pscsi_map_sg(struct se_cmd *cmd, struct scatterlist *sgl, 
> u32 sgl_nents,
>   if (len > 0 && data_len > 0) {
>   bytes = min_t(unsigned int, len, PAGE_SIZE - off);
>   bytes = min(bytes, data_len);
> -
> + new_bio:
>   if (!bio) {
>   nr_vecs = min_t(int, BIO_MAX_PAGES, nr_pages);
>   nr_pages -= nr_vecs;
> @@ -931,6 +931,7 @@ pscsi_map_sg(struct se_cmd *cmd, struct scatterlist *sgl, 
> u32 sgl_nents,
>* be allocated with pscsi_get_bio() above.
>*/
>   bio = NULL;
> + goto new_bio;
>   }
>  
>   data_len -= bytes;

Sorry for the delay.  I reverted my change, added this one.  I didn't
reboot, I just unloaded and loaded this one.
Note: /dev/sr1 as seen from the initiator is /dev/sr0 (physical disc) on the
target.

Doesn't crash, however on the initiator I see this:
[9273849.70] ISO 9660 Extensions: RRIP_1991A
[9273863.359718] scsi_io_completion: 13 callbacks suppressed
[9273863.359788] sr 26:0:0:0: [sr1] tag#1 UNKNOWN(0x2003) Result: hostbyte=0x00 
driverbyte=0x08
[9273863.359909] sr 26:0:0:0: [sr1] tag#1 Sense Key : 0x2 [current] 
[9273863.359974] sr 26:0:0:0: [sr1] tag#1 ASC=0x8 ASCQ=0x0 
[9273863.360036] sr 26:0:0:0: [sr1] tag#1 CDB: opcode=0x28 28 00 00 22 f6 96 00 
00 80 00
[9273863.360116] blk_update_request: 13 callbacks suppressed
[9273863.360177] blk_update_request: I/O error, dev sr1, sector 9165400
[9273875.864648] sr 26:0:0:0: [sr1] tag#1 UNKNOWN(0x2003) Result: hostbyte=0x00 
driverbyte=0x08
[9273875.864738] sr 26:0:0:0: [sr1] tag#1 Sense Key : 0x2 [current] 
[9273875.864801] sr 26:0:0:0: [sr1] tag#1 ASC=0x8 ASCQ=0x0 
[9273875.864890] sr 26:0:0:0: [sr1] tag#1 CDB: opcode=0x28 28 00 00 22 f7 16 00 
00 80 00
[9273875.864971] blk_update_request: I/O error, dev sr1, sector 9165912

To cause this, I mounted the dvd as seen in the first line and ran this
command: find /cdrom2 -type f | xargs -tn1 cat > /dev/null
I did some various tests.  Each test was done after umount and mount to
clear the cache.
cat  > /dev/null causes the message.
dd if= of=/dev/null bs=2048 doesn't
using bs=4096 doesn't
using bs=64k doesn't
using bs=128k does
cat uses a blocksize of 128k.

The following was done without being mounted.
ddrescue -f -f /dev/sr1 /dev/null 
doesn't cause the message
dd if=/dev/sr1 of=/dev/null bs=128k
doesn't cause the message
using bs=256k causes the message once:
[9275916.857409] sr 27:0:0:0: [sr1] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x00 
driverbyte=0x08
[9275916.857482] sr 27:0:0:0: [sr1] tag#0 Sense Key : 0x2 [current] 
[9275916.857520] sr 27:0:0:0: [sr1] tag#0 ASC=0x8 ASCQ=0x0 
[9275916.857556] sr 27:0:0:0: [sr1] tag#0 CDB: opcode=0x28 28 00 00 00 00 00 00 
00 80 00
[9275916.857614] blk_update_request: I/O error, dev sr1, sector 0

If I access the disc from the target natively either by mounting and
accessing files or working with the device directly (ie dd) no errors are
logged on the target.

-- 
 Microsoft has beaten Volkswagen's world record.  Volkswagen only created 22
 million bugs.


Re: [PATCH v3] blk-mq: Avoid that submitting a bio concurrently with device removal triggers a crash

2018-04-10 Thread Jens Axboe
On 4/10/18 5:02 PM, Bart Van Assche wrote:
> Because blkcg_exit_queue() is now called from inside blk_cleanup_queue()
> it is no longer safe to access cgroup information during or after the
> blk_cleanup_queue() call. Hence protect the generic_make_request_checks()
> call with blk_queue_enter() / blk_queue_exit().

Looks good, applied.

-- 
Jens Axboe



[PATCH v3] blk-mq: Avoid that submitting a bio concurrently with device removal triggers a crash

2018-04-10 Thread Bart Van Assche
Because blkcg_exit_queue() is now called from inside blk_cleanup_queue()
it is no longer safe to access cgroup information during or after the
blk_cleanup_queue() call. Hence protect the generic_make_request_checks()
call with blk_queue_enter() / blk_queue_exit().

Reported-by: Ming Lei 
Fixes: a063057d7c73 ("block: Fix a race between request queue removal and the 
block cgroup controller")
Signed-off-by: Bart Van Assche 
Cc: Ming Lei 
Cc: Joseph Qi 
---

Changes compared to v2: converted two ternary expressions into if-statements.

Changes compared to v1: guarded the blk_queue_exit() inside the loop with "if 
(q)".

 block/blk-core.c | 35 +--
 1 file changed, 29 insertions(+), 6 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 34e2f2227fd9..39308e874ffa 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2386,8 +2386,20 @@ blk_qc_t generic_make_request(struct bio *bio)
 * yet.
 */
struct bio_list bio_list_on_stack[2];
+   blk_mq_req_flags_t flags = 0;
+   struct request_queue *q = bio->bi_disk->queue;
blk_qc_t ret = BLK_QC_T_NONE;
 
+   if (bio->bi_opf & REQ_NOWAIT)
+   flags = BLK_MQ_REQ_NOWAIT;
+   if (blk_queue_enter(q, flags) < 0) {
+   if (!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT))
+   bio_wouldblock_error(bio);
+   else
+   bio_io_error(bio);
+   return ret;
+   }
+
if (!generic_make_request_checks(bio))
goto out;
 
@@ -2424,11 +2436,22 @@ blk_qc_t generic_make_request(struct bio *bio)
bio_list_init(_list_on_stack[0]);
current->bio_list = bio_list_on_stack;
do {
-   struct request_queue *q = bio->bi_disk->queue;
-   blk_mq_req_flags_t flags = bio->bi_opf & REQ_NOWAIT ?
-   BLK_MQ_REQ_NOWAIT : 0;
+   bool enter_succeeded = true;
+
+   if (unlikely(q != bio->bi_disk->queue)) {
+   if (q)
+   blk_queue_exit(q);
+   q = bio->bi_disk->queue;
+   flags = 0;
+   if (bio->bi_opf & REQ_NOWAIT)
+   flags = BLK_MQ_REQ_NOWAIT;
+   if (blk_queue_enter(q, flags) < 0) {
+   enter_succeeded = false;
+   q = NULL;
+   }
+   }
 
-   if (likely(blk_queue_enter(q, flags) == 0)) {
+   if (enter_succeeded) {
struct bio_list lower, same;
 
/* Create a fresh bio_list for all subordinate requests 
*/
@@ -2436,8 +2459,6 @@ blk_qc_t generic_make_request(struct bio *bio)
bio_list_init(_list_on_stack[0]);
ret = q->make_request_fn(q, bio);
 
-   blk_queue_exit(q);
-
/* sort new bios into those for a lower level
 * and those for the same level
 */
@@ -2464,6 +2485,8 @@ blk_qc_t generic_make_request(struct bio *bio)
current->bio_list = NULL; /* deactivate */
 
 out:
+   if (q)
+   blk_queue_exit(q);
return ret;
 }
 EXPORT_SYMBOL(generic_make_request);
-- 
2.16.2



Re: [PATCH v2] blk-mq: Avoid that submitting a bio concurrently with device removal triggers a crash

2018-04-10 Thread Jens Axboe
On 4/10/18 2:45 PM, Bart Van Assche wrote:
> Because blkcg_exit_queue() is now called from inside blk_cleanup_queue()
> it is no longer safe to access cgroup information during or after the
> blk_cleanup_queue() call. Hence protect the generic_make_request_checks()
> call with blk_queue_enter() / blk_queue_exit().

This looks better. But can we please get rid of the ternary? I hate it
with a vengeance, especially when it ends up spanning multiple lines.
This:

flags = 0;
if (bio->bi_opf & REQ_NOWAIT)
flags = BLK_MQ_REQ_NOWAIT;

is so much more readable.

-- 
Jens Axboe



Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-10 Thread t...@kernel.org
On Tue, Apr 10, 2018 at 09:46:23PM +, Bart Van Assche wrote:
> On Tue, 2018-04-10 at 14:33 -0700, t...@kernel.org wrote:
> > +   else
> > +   rq->missed_completion = true;
> 
> In this patch I found code that sets rq->missed_completion but no code that
> clears it. Did I perhaps miss something?

Ah, yeah, I was moving it out of add_timer but forgot to actully add
it to the issue path.  Fixed patch below.

BTW, no matter what we do w/ the request handover between normal and
timeout paths, we'd need something similar.  Otherwise, while we can
reduce the window, we can't get rid of it.

Thanks.

---
 block/blk-mq.c |   45 -
 include/linux/blkdev.h |2 ++
 2 files changed, 34 insertions(+), 13 deletions(-)

--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -642,6 +642,8 @@ void blk_mq_complete_request(struct requ
hctx_lock(hctx, _idx);
if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
__blk_mq_complete_request(rq);
+   else
+   rq->missed_completion = true;
hctx_unlock(hctx, srcu_idx);
 }
 EXPORT_SYMBOL(blk_mq_complete_request);
@@ -684,6 +686,7 @@ void blk_mq_start_request(struct request
 
blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
blk_add_timer(rq);
+   rq->missed_completion = false;
 
write_seqcount_end(>gstate_seq);
preempt_enable();
@@ -881,7 +884,7 @@ static void blk_mq_check_expired(struct
}
 }
 
-static void blk_mq_timeout_sync_rcu(struct request_queue *q)
+static void blk_mq_timeout_sync_rcu(struct request_queue *q, bool clear)
 {
struct blk_mq_hw_ctx *hctx;
bool has_rcu = false;
@@ -896,7 +899,8 @@ static void blk_mq_timeout_sync_rcu(stru
else
synchronize_srcu(hctx->srcu);
 
-   hctx->need_sync_rcu = false;
+   if (clear)
+   hctx->need_sync_rcu = false;
}
if (has_rcu)
synchronize_rcu();
@@ -917,25 +921,37 @@ static void blk_mq_terminate_expired(str
blk_mq_rq_timed_out(hctx, rq, priv, reserved);
 }
 
-static void blk_mq_finish_timeout_reset(struct blk_mq_hw_ctx *hctx,
+static void blk_mq_timeout_reset_return(struct blk_mq_hw_ctx *hctx,
struct request *rq, void *priv, bool reserved)
 {
/*
 * @rq's timer reset has gone through rcu synchronization and is
 * visible now.  Allow normal completions again by resetting
 * ->aborted_gstate.  Don't clear RQF_MQ_TIMEOUT_RESET here as
-* there's no memory ordering around ->aborted_gstate making it the
-* only field safe to update.  Let blk_add_timer() clear it later
-* when the request is recycled or times out again.
-*
-* As nothing prevents from completion happening while
-* ->aborted_gstate is set, this may lead to ignored completions
-* and further spurious timeouts.
+* blk_mq_timeout_reset_cleanup() needs it again and there's no
+* memory ordering around ->aborted_gstate making it the only field
+* safe to update.  Let blk_add_timer() clear it later when the
+* request is recycled or times out again.
 */
if (rq->rq_flags & RQF_MQ_TIMEOUT_RESET)
blk_mq_rq_update_aborted_gstate(rq, 0);
 }
 
+static void blk_mq_timeout_reset_cleanup(struct blk_mq_hw_ctx *hctx,
+   struct request *rq, void *priv, bool reserved)
+{
+   /*
+* @rq is now fully returned to the normal path.  If normal
+* completion raced timeout handling, execute the missed completion
+* here.  This is safe because 1. ->missed_completion can no longer
+* be asserted because nothing is timing out right now and 2. if
+* ->missed_completion is set, @rq is safe from recycling because
+* nobody could have completed it.
+*/
+   if ((rq->rq_flags & RQF_MQ_TIMEOUT_RESET) && rq->missed_completion)
+   blk_mq_complete_request(rq);
+}
+
 static void blk_mq_timeout_work(struct work_struct *work)
 {
struct request_queue *q =
@@ -976,7 +992,7 @@ static void blk_mq_timeout_work(struct w
 * becomes a problem, we can add per-hw_ctx rcu_head and
 * wait in parallel.
 */
-   blk_mq_timeout_sync_rcu(q);
+   blk_mq_timeout_sync_rcu(q, true);
 
/* terminate the ones we won */
blk_mq_queue_tag_busy_iter(q, blk_mq_terminate_expired,
@@ -988,9 +1004,12 @@ static void blk_mq_timeout_work(struct w
 * reset racing against recycling.
 */
if (nr_resets) {
-   blk_mq_timeout_sync_rcu(q);
+   blk_mq_timeout_sync_rcu(q, false);
+   blk_mq_queue_tag_busy_iter(q,
+   blk_mq_timeout_reset_return, NULL);
+

Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-10 Thread Bart Van Assche
On Tue, 2018-04-10 at 14:33 -0700, t...@kernel.org wrote:
> +   else
> +   rq->missed_completion = true;

In this patch I found code that sets rq->missed_completion but no code that
clears it. Did I perhaps miss something?

Thanks,

Bart.




Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-10 Thread t...@kernel.org
Hello,

On Tue, Apr 10, 2018 at 08:40:43AM -0700, t...@kernel.org wrote:
> On Tue, Apr 10, 2018 at 11:38:18PM +0800, Ming Lei wrote:
> > I agree, the issue should be in driver's irq handler and .timeout in
> > theory.
> > 
> > For example, even though one request has been done by irq handler, .timeout
> > still may return RESET_TIMER.
> 
> blk-mq can use a separate flag to track normal completions during
> timeout and complete the request normally on RESET_TIMER if the flag
> is set while EH was in progress.  With a bit of care, we'd be able to
> plug the race completely.

So, something like the following.  Just compile tested.  The extra
dedicated field is kinda unfortunate but we need something like that
no matter how we do this because the involved completion marking
violates the ownership (we want to record the normail completion while
timeout handling is in progress).  Alternatively, we can add an atomic
flags field.  The addition doesn't change the size of struct request
because it fits in a hole but that hole seems removeable, so...

Anyways, it'd be great if someone can test this combined with the
previous two rcu sync patches.

Thanks.
---
 block/blk-mq.c |   44 +++-
 include/linux/blkdev.h |2 ++
 2 files changed, 33 insertions(+), 13 deletions(-)

--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -642,6 +642,8 @@ void blk_mq_complete_request(struct requ
hctx_lock(hctx, _idx);
if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
__blk_mq_complete_request(rq);
+   else
+   rq->missed_completion = true;
hctx_unlock(hctx, srcu_idx);
 }
 EXPORT_SYMBOL(blk_mq_complete_request);
@@ -881,7 +883,7 @@ static void blk_mq_check_expired(struct
}
 }
 
-static void blk_mq_timeout_sync_rcu(struct request_queue *q)
+static void blk_mq_timeout_sync_rcu(struct request_queue *q, bool clear)
 {
struct blk_mq_hw_ctx *hctx;
bool has_rcu = false;
@@ -896,7 +898,8 @@ static void blk_mq_timeout_sync_rcu(stru
else
synchronize_srcu(hctx->srcu);
 
-   hctx->need_sync_rcu = false;
+   if (clear)
+   hctx->need_sync_rcu = false;
}
if (has_rcu)
synchronize_rcu();
@@ -917,25 +920,37 @@ static void blk_mq_terminate_expired(str
blk_mq_rq_timed_out(hctx, rq, priv, reserved);
 }
 
-static void blk_mq_finish_timeout_reset(struct blk_mq_hw_ctx *hctx,
+static void blk_mq_timeout_reset_return(struct blk_mq_hw_ctx *hctx,
struct request *rq, void *priv, bool reserved)
 {
/*
 * @rq's timer reset has gone through rcu synchronization and is
 * visible now.  Allow normal completions again by resetting
 * ->aborted_gstate.  Don't clear RQF_MQ_TIMEOUT_RESET here as
-* there's no memory ordering around ->aborted_gstate making it the
-* only field safe to update.  Let blk_add_timer() clear it later
-* when the request is recycled or times out again.
-*
-* As nothing prevents from completion happening while
-* ->aborted_gstate is set, this may lead to ignored completions
-* and further spurious timeouts.
+* blk_mq_timeout_reset_cleanup() needs it again and there's no
+* memory ordering around ->aborted_gstate making it the only field
+* safe to update.  Let blk_add_timer() clear it later when the
+* request is recycled or times out again.
 */
if (rq->rq_flags & RQF_MQ_TIMEOUT_RESET)
blk_mq_rq_update_aborted_gstate(rq, 0);
 }
 
+static void blk_mq_timeout_reset_cleanup(struct blk_mq_hw_ctx *hctx,
+   struct request *rq, void *priv, bool reserved)
+{
+   /*
+* @rq is now fully returned to the normal path.  If normal
+* completion raced timeout handling, execute the missed completion
+* here.  This is safe because 1. ->missed_completion can no longer
+* be asserted because nothing is timing out right now and 2. if
+* ->missed_completion is set, @rq is safe from recycling because
+* nobody could have completed it.
+*/
+   if ((rq->rq_flags & RQF_MQ_TIMEOUT_RESET) && rq->missed_completion)
+   blk_mq_complete_request(rq);
+}
+
 static void blk_mq_timeout_work(struct work_struct *work)
 {
struct request_queue *q =
@@ -976,7 +991,7 @@ static void blk_mq_timeout_work(struct w
 * becomes a problem, we can add per-hw_ctx rcu_head and
 * wait in parallel.
 */
-   blk_mq_timeout_sync_rcu(q);
+   blk_mq_timeout_sync_rcu(q, true);
 
/* terminate the ones we won */
blk_mq_queue_tag_busy_iter(q, blk_mq_terminate_expired,
@@ -988,9 +1003,12 @@ static void blk_mq_timeout_work(struct w
 * reset racing against recycling.
 */

[PATCH v5] blk-mq: Avoid that a completion can be ignored for BLK_EH_RESET_TIMER

2018-04-10 Thread Bart Van Assche
The blk-mq timeout handling code ignores completions that occur after
blk_mq_check_expired() has been called and before blk_mq_rq_timed_out()
has reset rq->aborted_gstate. If a block driver timeout handler always
returns BLK_EH_RESET_TIMER then the result will be that the request
never terminates.

Since the request state can be updated from two different contexts,
namely regular completion and request timeout, this race cannot be
fixed with RCU synchronization only. Fix this race as follows:
- Use the deadline instead of the request generation to detect whether
  or not a request timer fired after reinitialization of a request.
- Store the request state in the lowest two bits of the deadline instead
  of the lowest two bits of 'gstate'.
- Rename MQ_RQ_STATE_MASK into RQ_STATE_MASK and change it from an
  enumeration member into a #define such that its type can be changed
  into unsigned long. That allows to write & ~RQ_STATE_MASK instead of
  ~(unsigned long)RQ_STATE_MASK.
- Remove all request member variables that became superfluous due to
  this change: gstate, aborted_gstate, gstate_seq and aborted_gstate_sync.
- Remove the request state information that became superfluous due to this
  patch, namely RQF_MQ_TIMEOUT_EXPIRED.
- Remove the hctx member that became superfluous due to these changes,
  namely nr_expired.
- Remove the code that became superfluous due to this change, namely
  the RCU lock and unlock statements in blk_mq_complete_request() and
  also the synchronize_rcu() call in the timeout handler.

Signed-off-by: Bart Van Assche 
Cc: Tejun Heo 
Cc: Christoph Hellwig 
Cc: Ming Lei 
Cc: Sagi Grimberg 
Cc: Israel Rukshin ,
Cc: Max Gurtovoy 
Cc:  # v4.16
---

Changes compared to v4:
- Addressed multiple review comments from Christoph. The most important are
  that atomic_long_cmpxchg() has been changed into cmpxchg() and also that
  there is now a nice and clean split between the legacy and blk-mq versions
  of blk_add_timer().
- Changed the patch name and modified the patch description because there is
  disagreement about whether or not the v4.16 blk-mq core can complete a
  single request twice. Kept the "Cc: stable" tag because of
  https://bugzilla.kernel.org/show_bug.cgi?id=199077.

Changes compared to v3 (see also 
https://www.mail-archive.com/linux-block@vger.kernel.org/msg20073.html):
- Removed the spinlock again that was introduced to protect the request state.
  v4 uses atomic_long_cmpxchg() instead.
- Split __deadline into two variables - one for the legacy block layer and one
  for blk-mq.

Changes compared to v2 
(https://www.mail-archive.com/linux-block@vger.kernel.org/msg18338.html):
- Rebased and retested on top of kernel v4.16.

Changes compared to v1 
(https://www.mail-archive.com/linux-block@vger.kernel.org/msg18089.html):
- Removed the gstate and aborted_gstate members of struct request and used
  the __deadline member to encode both the generation and state information.

 block/blk-core.c   |   2 -
 block/blk-mq-debugfs.c |   1 -
 block/blk-mq.c | 174 +
 block/blk-mq.h |  65 ++
 block/blk-timeout.c|  89 ++---
 block/blk.h|  13 ++--
 include/linux/blk-mq.h |   1 -
 include/linux/blkdev.h |  30 ++---
 8 files changed, 118 insertions(+), 257 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 8625ec929fe5..181b1a688a5b 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -200,8 +200,6 @@ void blk_rq_init(struct request_queue *q, struct request 
*rq)
rq->start_time = jiffies;
set_start_time_ns(rq);
rq->part = NULL;
-   seqcount_init(>gstate_seq);
-   u64_stats_init(>aborted_gstate_sync);
 }
 EXPORT_SYMBOL(blk_rq_init);
 
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 6f72413b6cab..80c7c585769f 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -345,7 +345,6 @@ static const char *const rqf_name[] = {
RQF_NAME(STATS),
RQF_NAME(SPECIAL_PAYLOAD),
RQF_NAME(ZONE_WRITE_LOCKED),
-   RQF_NAME(MQ_TIMEOUT_EXPIRED),
RQF_NAME(MQ_POLL_SLEPT),
 };
 #undef RQF_NAME
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 7816d28b7219..0680977d6d98 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -305,7 +305,6 @@ static struct request *blk_mq_rq_ctx_init(struct 
blk_mq_alloc_data *data,
rq->special = NULL;
/* tag was already set */
rq->extra_len = 0;
-   rq->__deadline = 0;
 
INIT_LIST_HEAD(>timeout_list);
rq->timeout = 0;
@@ -481,7 +480,8 @@ void blk_mq_free_request(struct request *rq)
if (blk_rq_rl(rq))
blk_put_rl(blk_rq_rl(rq));
 
-   blk_mq_rq_update_state(rq, MQ_RQ_IDLE);
+   if (!blk_mq_change_rq_state(rq, 

[PATCH v2] blk-mq: Avoid that submitting a bio concurrently with device removal triggers a crash

2018-04-10 Thread Bart Van Assche
Because blkcg_exit_queue() is now called from inside blk_cleanup_queue()
it is no longer safe to access cgroup information during or after the
blk_cleanup_queue() call. Hence protect the generic_make_request_checks()
call with blk_queue_enter() / blk_queue_exit().

Reported-by: Ming Lei 
Fixes: a063057d7c73 ("block: Fix a race between request queue removal and the 
block cgroup controller")
Signed-off-by: Bart Van Assche 
Cc: Ming Lei 
Cc: Joseph Qi 
---

Changes compared to v1: changed the blk_queue_exit() inside the loop with "if 
(q)".

 block/blk-core.c | 33 +++--
 1 file changed, 27 insertions(+), 6 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 34e2f2227fd9..181b1a688a5b 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2386,8 +2386,19 @@ blk_qc_t generic_make_request(struct bio *bio)
 * yet.
 */
struct bio_list bio_list_on_stack[2];
+   blk_mq_req_flags_t flags = bio->bi_opf & REQ_NOWAIT ?
+   BLK_MQ_REQ_NOWAIT : 0;
+   struct request_queue *q = bio->bi_disk->queue;
blk_qc_t ret = BLK_QC_T_NONE;
 
+   if (blk_queue_enter(q, flags) < 0) {
+   if (!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT))
+   bio_wouldblock_error(bio);
+   else
+   bio_io_error(bio);
+   return ret;
+   }
+
if (!generic_make_request_checks(bio))
goto out;
 
@@ -2424,11 +2435,21 @@ blk_qc_t generic_make_request(struct bio *bio)
bio_list_init(_list_on_stack[0]);
current->bio_list = bio_list_on_stack;
do {
-   struct request_queue *q = bio->bi_disk->queue;
-   blk_mq_req_flags_t flags = bio->bi_opf & REQ_NOWAIT ?
-   BLK_MQ_REQ_NOWAIT : 0;
+   bool enter_succeeded = true;
+
+   if (unlikely(q != bio->bi_disk->queue)) {
+   if (q)
+   blk_queue_exit(q);
+   q = bio->bi_disk->queue;
+   flags = bio->bi_opf & REQ_NOWAIT ? BLK_MQ_REQ_NOWAIT :
+   0;
+   if (blk_queue_enter(q, flags) < 0) {
+   enter_succeeded = false;
+   q = NULL;
+   }
+   }
 
-   if (likely(blk_queue_enter(q, flags) == 0)) {
+   if (enter_succeeded) {
struct bio_list lower, same;
 
/* Create a fresh bio_list for all subordinate requests 
*/
@@ -2436,8 +2457,6 @@ blk_qc_t generic_make_request(struct bio *bio)
bio_list_init(_list_on_stack[0]);
ret = q->make_request_fn(q, bio);
 
-   blk_queue_exit(q);
-
/* sort new bios into those for a lower level
 * and those for the same level
 */
@@ -2464,6 +2483,8 @@ blk_qc_t generic_make_request(struct bio *bio)
current->bio_list = NULL; /* deactivate */
 
 out:
+   if (q)
+   blk_queue_exit(q);
return ret;
 }
 EXPORT_SYMBOL(generic_make_request);
-- 
2.16.2



Re: [PATCH] blk-mq: Directly schedule q->timeout_work when aborting a request

2018-04-10 Thread Martin Steigerwald
Martin Steigerwald - 10.04.18, 20:43:
> Tejun Heo - 03.04.18, 00:04:
> > Request abortion is performed by overriding deadline to now and
> > scheduling timeout handling immediately.  For the latter part, the
> > code was using mod_timer(timeout, 0) which can't guarantee that the
> > timer runs afterwards.  Let's schedule the underlying work item
> > directly instead.
> > 
> > This fixes the hangs during probing reported by Sitsofe but it isn't
> > yet clear to me how the failure can happen reliably if it's just the
> > above described race condition.
> 
> Compiling a 4.16.1 kernel with that patch to test whether this fixes
> the boot hang I reported in:
> 
> [Possible REGRESSION, 4.16-rc4] Error updating SMART data during
> runtime and boot failures with blk_mq_terminate_expired in backtrace
> https://bugzilla.kernel.org/show_bug.cgi?id=199077

Fails as well, see

https://bugzilla.kernel.org/show_bug.cgi?id=199077#c8

for photo with (part of) backtrace.

> The "Error updating SMART data during runtime" thing I reported there
> as well may still be another (independent) issue.
> 
> > Signed-off-by: Tejun Heo 
> > Reported-by: Sitsofe Wheeler 
> > Reported-by: Meelis Roos 
> > Fixes: 358f70da49d7 ("blk-mq: make blk_abort_request() trigger
> > timeout path") Cc: sta...@vger.kernel.org # v4.16
> > Link:
> > http://lkml.kernel.org/r/CALjAwxh-PVYFnYFCJpGOja+m5SzZ8Sa4J7ohxdK=r8
> > NyOF-EM a...@mail.gmail.com Link:
> > http://lkml.kernel.org/r/alpine.lrh.2.21.1802261049140.4...@math.ut.
> > ee --- Hello,
> > 
> > I don't have the full explanation yet but here's a preliminary
> > patch.
> > 
> > Thanks.
> > 
> >  block/blk-timeout.c |2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/block/blk-timeout.c b/block/blk-timeout.c
> > index a05e367..f0e6e41 100644
> > --- a/block/blk-timeout.c
> > +++ b/block/blk-timeout.c
> > @@ -165,7 +165,7 @@ void blk_abort_request(struct request *req)
> > 
> >  * No need for fancy synchronizations.
> >  */
> > 
> > blk_rq_set_deadline(req, jiffies);
> > 
> > -   mod_timer(>q->timeout, 0);
> > +   kblockd_schedule_work(>q->timeout_work);
> > 
> > } else {
> > 
> > if (blk_mark_rq_complete(req))
> > 
> > return;


-- 
Martin




[PATCH] blk-mq: Avoid that submitting a bio concurrently with device removal triggers a crash

2018-04-10 Thread Bart Van Assche
Because blkcg_exit_queue() is now called from inside blk_cleanup_queue()
it is no longer safe to access cgroup information during or after the
blk_cleanup_queue() call. Hence protect the generic_make_request_checks()
call with blk_queue_enter() / blk_queue_exit().

Reported-by: Ming Lei 
Fixes: a063057d7c73 ("block: Fix a race between request queue removal and the 
block cgroup controller")
Signed-off-by: Bart Van Assche 
Cc: Ming Lei 
Cc: Joseph Qi 
---
 block/blk-core.c | 32 ++--
 1 file changed, 26 insertions(+), 6 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 34e2f2227fd9..a330cd2829e1 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2386,8 +2386,19 @@ blk_qc_t generic_make_request(struct bio *bio)
 * yet.
 */
struct bio_list bio_list_on_stack[2];
+   blk_mq_req_flags_t flags = bio->bi_opf & REQ_NOWAIT ?
+   BLK_MQ_REQ_NOWAIT : 0;
+   struct request_queue *q = bio->bi_disk->queue;
blk_qc_t ret = BLK_QC_T_NONE;
 
+   if (blk_queue_enter(q, flags) < 0) {
+   if (!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT))
+   bio_wouldblock_error(bio);
+   else
+   bio_io_error(bio);
+   return ret;
+   }
+
if (!generic_make_request_checks(bio))
goto out;
 
@@ -2424,11 +2435,20 @@ blk_qc_t generic_make_request(struct bio *bio)
bio_list_init(_list_on_stack[0]);
current->bio_list = bio_list_on_stack;
do {
-   struct request_queue *q = bio->bi_disk->queue;
-   blk_mq_req_flags_t flags = bio->bi_opf & REQ_NOWAIT ?
-   BLK_MQ_REQ_NOWAIT : 0;
+   bool enter_succeeded = true;
 
-   if (likely(blk_queue_enter(q, flags) == 0)) {
+   if (unlikely(q != bio->bi_disk->queue)) {
+   flags = bio->bi_opf & REQ_NOWAIT ? BLK_MQ_REQ_NOWAIT :
+   0;
+   blk_queue_exit(q);
+   q = bio->bi_disk->queue;
+   if (blk_queue_enter(q, flags) < 0) {
+   enter_succeeded = false;
+   q = NULL;
+   }
+   }
+
+   if (enter_succeeded) {
struct bio_list lower, same;
 
/* Create a fresh bio_list for all subordinate requests 
*/
@@ -2436,8 +2456,6 @@ blk_qc_t generic_make_request(struct bio *bio)
bio_list_init(_list_on_stack[0]);
ret = q->make_request_fn(q, bio);
 
-   blk_queue_exit(q);
-
/* sort new bios into those for a lower level
 * and those for the same level
 */
@@ -2464,6 +2482,8 @@ blk_qc_t generic_make_request(struct bio *bio)
current->bio_list = NULL; /* deactivate */
 
 out:
+   if (q)
+   blk_queue_exit(q);
return ret;
 }
 EXPORT_SYMBOL(generic_make_request);
-- 
2.16.2



Re: [PATCH] blk-mq: Directly schedule q->timeout_work when aborting a request

2018-04-10 Thread Martin Steigerwald
Tejun Heo - 03.04.18, 00:04:
> Request abortion is performed by overriding deadline to now and
> scheduling timeout handling immediately.  For the latter part, the
> code was using mod_timer(timeout, 0) which can't guarantee that the
> timer runs afterwards.  Let's schedule the underlying work item
> directly instead.
> 
> This fixes the hangs during probing reported by Sitsofe but it isn't
> yet clear to me how the failure can happen reliably if it's just the
> above described race condition.

Compiling a 4.16.1 kernel with that patch to test whether this fixes the boot 
hang I reported in:

[Possible REGRESSION, 4.16-rc4] Error updating SMART data during runtime and 
boot failures with blk_mq_terminate_expired in backtrace
https://bugzilla.kernel.org/show_bug.cgi?id=199077

The "Error updating SMART data during runtime" thing I reported there as well 
may still be another (independent) issue.

> Signed-off-by: Tejun Heo 
> Reported-by: Sitsofe Wheeler 
> Reported-by: Meelis Roos 
> Fixes: 358f70da49d7 ("blk-mq: make blk_abort_request() trigger timeout
> path") Cc: sta...@vger.kernel.org # v4.16
> Link:
> http://lkml.kernel.org/r/CALjAwxh-PVYFnYFCJpGOja+m5SzZ8Sa4J7ohxdK=r8NyOF-EM
> a...@mail.gmail.com Link:
> http://lkml.kernel.org/r/alpine.lrh.2.21.1802261049140.4...@math.ut.ee ---
> Hello,
> 
> I don't have the full explanation yet but here's a preliminary patch.
> 
> Thanks.
> 
>  block/blk-timeout.c |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/block/blk-timeout.c b/block/blk-timeout.c
> index a05e367..f0e6e41 100644
> --- a/block/blk-timeout.c
> +++ b/block/blk-timeout.c
> @@ -165,7 +165,7 @@ void blk_abort_request(struct request *req)
>* No need for fancy synchronizations.
>*/
>   blk_rq_set_deadline(req, jiffies);
> - mod_timer(>q->timeout, 0);
> + kblockd_schedule_work(>q->timeout_work);
>   } else {
>   if (blk_mark_rq_complete(req))
>   return;
-- 
Martin




Hotplug

2018-04-10 Thread Jens Axboe
Hi,

Been running some tests and I keep running into issues with hotplug.
This looks similar to what Bart posted the other day, but it looks
more deeply rooted than just having to protect the queue in
generic_make_request_checks(). The test run is blktests,
block/001. Current -git doesn't survive it. I've seen at least two
different oopses, pasted below.

[  102.163442] NULL pointer dereference at 0010
[  102.163444] PGD 0 P4D 0 
[  102.163447] Oops:  [#1] PREEMPT SMP
[  102.163449] Modules linked in:
[  102.175540] sr 12:0:0:0: [sr2] scsi-1 drive
[  102.180112]  scsi_debug crc_t10dif crct10dif_generic crct10dif_common nvme 
nvme_core sb_edac xl
[  102.186934] sr 12:0:0:0: Attached scsi CD-ROM sr2
[  102.191896]  sr_mod cdrom btrfs xor zstd_decompress zstd_compress xxhash 
lzo_compress zlib_defc
[  102.197169] sr 12:0:0:0: Attached scsi generic sg7 type 5
[  102.203475]  igb ahci libahci i2c_algo_bit libata dca [last unloaded: 
crc_t10dif]
[  102.203484] CPU: 43 PID: 4629 Comm: systemd-udevd Not tainted 4.16.0+ #650
[  102.203487] Hardware name: Dell Inc. PowerEdge T630/0NT78X, BIOS 2.3.4 
11/09/2016
[  102.350882] RIP: 0010:sr_block_revalidate_disk+0x23/0x190 [sr_mod]
[  102.358299] RSP: 0018:883ff357bb58 EFLAGS: 00010292
[  102.364734] RAX: a00b07d0 RBX: 883ff3058000 RCX: 883ff357bb66
[  102.373220] RDX: 0003 RSI: 7530 RDI: 881fea631000
[  102.381705] RBP:  R08: 881fe4d38400 R09: 
[  102.390185] R10:  R11: 01b6 R12: 085d
[  102.398671] R13: 085d R14: 883ffd9b3790 R15: 
[  102.407156] FS:  7f7dc8e6d8c0() GS:883fff34() 
knlGS:
[  102.417138] CS:  0010 DS:  ES:  CR0: 80050033
[  102.424066] CR2: 0010 CR3: 003ffda98005 CR4: 003606e0
[  102.432545] DR0:  DR1:  DR2: 
[  102.441024] DR3:  DR6: fffe0ff0 DR7: 0400
[  102.449502] Call Trace:
[  102.452744]  ? __invalidate_device+0x48/0x60
[  102.458022]  check_disk_change+0x4c/0x60
[  102.462900]  sr_block_open+0x16/0xd0 [sr_mod]
[  102.468270]  __blkdev_get+0xb9/0x450
[  102.472774]  ? iget5_locked+0x1c0/0x1e0
[  102.477568]  blkdev_get+0x11e/0x320
[  102.481969]  ? bdget+0x11d/0x150
[  102.486083]  ? _raw_spin_unlock+0xa/0x20
[  102.490968]  ? bd_acquire+0xc0/0xc0
[  102.495368]  do_dentry_open+0x1b0/0x320
[  102.500159]  ? inode_permission+0x24/0xc0
[  102.505140]  path_openat+0x4e6/0x1420
[  102.509741]  ? cpumask_any_but+0x1f/0x40
[  102.514630]  ? flush_tlb_mm_range+0xa0/0x120
[  102.519903]  do_filp_open+0x8c/0xf0
[  102.524305]  ? __seccomp_filter+0x28/0x230
[  102.529389]  ? _raw_spin_unlock+0xa/0x20
[  102.534283]  ? __handle_mm_fault+0x7d6/0x9b0
[  102.539559]  ? list_lru_add+0xa8/0xc0
[  102.544157]  ? _raw_spin_unlock+0xa/0x20
[  102.549047]  ? __alloc_fd+0xaf/0x160
[  102.553549]  ? do_sys_open+0x1a6/0x230
[  102.558244]  do_sys_open+0x1a6/0x230
[  102.562742]  do_syscall_64+0x5a/0x100
[  102.567336]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2

and this one, similar to Barts:

[ 4676.738069] NULL pointer dereference at 0154
[ 4676.738071] PGD 0 P4D 0 
[ 4676.738073] Oops:  [#1] PREEMPT SMP
[ 4676.738075] Modules linked in: scsi_debug crc_t10dif nvme nvme_core loop 
configfs crct10dif_ge]
[ 4676.859272] CPU: 10 PID: 16598 Comm: systemd-udevd Not tainted 4.16.0+ #651
[ 4676.867525] Hardware name: Dell Inc. PowerEdge T630/0NT78X, BIOS 2.3.4 
11/09/2016
[ 4676.876765] RIP: 0010:blk_throtl_bio+0x45/0x9b0
[ 4676.882296] RSP: 0018:881ff0c8bb38 EFLAGS: 00010246
[ 4676.888610] RAX:  RBX: 881ffa273a40 RCX: 
[ 4676.897059] RDX: 881ffa273a40 RSI:  RDI: 
[ 4676.905507] RBP:  R08: 881ffa273ac0 R09: 881ff123f458
[ 4676.913955] R10: 881ff0c8bc70 R11: 1000 R12: 
[ 4676.922402] R13: 82600980 R14: 88208113 R15: 
[ 4676.930856] FS:  7fe63e5228c0() GS:881fff74() 
knlGS:
[ 4676.940773] CS:  0010 DS:  ES:  CR0: 80050033
[ 4676.947667] CR2: 0154 CR3: 001fed2a0003 CR4: 003606e0
[ 4676.956116] DR0:  DR1:  DR2: 
[ 4676.964568] DR3:  DR6: fffe0ff0 DR7: 0400
[ 4676.973021] Call Trace:
[ 4676.976229]  generic_make_request_checks+0x640/0x660
[ 4676.982245]  ? kmem_cache_alloc+0x37/0x190
[ 4676.987295]  generic_make_request+0x29/0x2f0
[ 4676.992534]  ? submit_bio+0x5c/0x110
[ 4676.996998]  ? bio_alloc_bioset+0x99/0x1c0
[ 4677.002050]  submit_bio+0x5c/0x110
[ 4677.006317]  ? guard_bio_eod+0x42/0x120
[ 4677.011074]  submit_bh_wbc.isra.49+0x132/0x160
[ 4677.016517]  ? bh_uptodate_or_lock+0x90/0x90
[ 4677.021757]  

Re: usercopy whitelist woe in scsi_sense_cache

2018-04-10 Thread Oleksandr Natalenko

Hi, Kees, Paolo et al.

10.04.2018 08:53, Kees Cook wrote:

Unfortunately I only had a single hang with no dumps. I haven't been
able to reproduce it since. :(


For your convenience I've prepared a VM that contains a reproducer.

It consists of 3 disk images (sda.img is for the system, it is 
Arch-based, sdb/sdc.img are for RAID). They are available (in a 
compressed form) to download here [1].


RAID is built as RAID10 with far2 layout, on top of it there is a LUKS 
container (can be opened either with keyfile under the /root or using 
"qwerty" password). There's one LVM PV, one VG and one volume on top of 
LUKS containing XFS. RAID is automatically assembled during the boot, so 
you don't have to worry about it.


I run the VM like this:

$ qemu-system-x86_64 -display gtk,gl=on -machine q35,accel=kvm -cpu 
host,+vmx -enable-kvm -netdev user,id=user.0 -device 
virtio-net,netdev=user.0 -usb -device nec-usb-xhci,id=xhci -device 
usb-tablet,bus=xhci.0 -serial stdio -smp 4 -m 512 -hda sda.img -hdb 
sdb.img -hdc sdc.img


The system is accessible via both VGA and serial console. The user is 
"root", the password is "qwerty".


Under the /root folder there is a reproducer script (reproducer.sh). It 
does trivial things like enabling sysrq, opening LUKS device, mounting a 
volume, running a background I/O (this is an important part, actually, 
since I wasn't able to trigger the issue without the background I/O) 
and, finally, running the smartctl in a loop. If you are lucky, within a 
minute or two you'll get the first warning followed shortly by 
subsequent bugs and I/O stall (htop is pre-installed for your 
convenience too).


Notable changes in this VM comparing to generic defaults:

1) blk-mq is enabled via kernel cmdline (scsi_mod.use_blk_mq=1 is in 
/etc/default/grub)
2) BFQ is set via udev (check /etc/udev/rules.d/10-io-scheduler.rules 
file)


Again, I wasn't able to reproduce the usercopy warning/bug and I/O hang 
without all these components being involved.


Hope you enjoy it.

P.S. I haven't tested Linus' master branch yet. For now, this VM runs 
v4.16 kernel.


Regards,
  Oleksandr

[1] https://natalenko.name/myfiles/usercopy_bfq_woe/


Hang with blk-mq map series (block/008)

2018-04-10 Thread Jens Axboe
Hi Ming,

Ran the above blktests on the current tree, and we end up getting
a hang that we never recover from. There's on request perpetually
stuck:

root@dell:/sys/kernel/debug/block/nvme0n1/hctx2# cat busy 
5e2b09fe {.op=READ, .cmd_flags=, .rq_flags=DONTPREP|IO_STAT|STATS, 
.state=in_flight, .tag=313, .internal_tag=-1}

and no amount of manual running or kicking the queue will bring it
back to life. If I run it on 'master', it works fine.

Did you run blktests on your series?

-- 
Jens Axboe



Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-10 Thread t...@kernel.org
Hello,

On Tue, Apr 10, 2018 at 11:38:18PM +0800, Ming Lei wrote:
> I agree, the issue should be in driver's irq handler and .timeout in
> theory.
> 
> For example, even though one request has been done by irq handler, .timeout
> still may return RESET_TIMER.

blk-mq can use a separate flag to track normal completions during
timeout and complete the request normally on RESET_TIMER if the flag
is set while EH was in progress.  With a bit of care, we'd be able to
plug the race completely.

Thanks.

-- 
tejun


Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-10 Thread Ming Lei
Hi Tejun,

On Tue, Apr 10, 2018 at 08:30:31AM -0700, t...@kernel.org wrote:
> Hello, Ming.
> 
> On Tue, Apr 10, 2018 at 11:25:54PM +0800, Ming Lei wrote:
> > +   if (time_after_eq(jiffies, deadline) &&
> > +   blk_mq_change_rq_state(rq, MQ_RQ_IN_FLIGHT, MQ_RQ_COMPLETE)) {
> > +   blk_mq_rq_timed_out(rq, reserved);
> > 
> > Normal completion still can happen between blk_mq_change_rq_state()
> > and blk_mq_rq_timed_out().
> > 
> > In tj's approach, there is synchronize_rcu() between writing aborted_gstate
> > and blk_mq_rq_timed_out, it is easier for normal completion to happen during
> > the big window.
> 
> I don't think plugging this hole is all that difficult, but this
> shouldn't lead to any critical failures.  If so, that'd be a driver
> bug.

I agree, the issue should be in driver's irq handler and .timeout in
theory.

For example, even though one request has been done by irq handler, .timeout
still may return RESET_TIMER.

Thanks,
Ming


Re: [PATCH] blk-mq: Fix recently introduced races in the timeout handling code

2018-04-10 Thread Alex G.
I've observed a very nasty NULL pointer dereference with NVME drives on
hot-unplug [1]. While I haven't gone in the details of this change, it
does fix the NULL dereference I've been seeing.

Alex

[1] http://lists.infradead.org/pipermail/linux-nvme/2018-April/016623.html


Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-10 Thread t...@kernel.org
Hello, Ming.

On Tue, Apr 10, 2018 at 11:25:54PM +0800, Ming Lei wrote:
> +   if (time_after_eq(jiffies, deadline) &&
> +   blk_mq_change_rq_state(rq, MQ_RQ_IN_FLIGHT, MQ_RQ_COMPLETE)) {
> +   blk_mq_rq_timed_out(rq, reserved);
> 
> Normal completion still can happen between blk_mq_change_rq_state()
> and blk_mq_rq_timed_out().
> 
> In tj's approach, there is synchronize_rcu() between writing aborted_gstate
> and blk_mq_rq_timed_out, it is easier for normal completion to happen during
> the big window.

I don't think plugging this hole is all that difficult, but this
shouldn't lead to any critical failures.  If so, that'd be a driver
bug.

Thanks.

-- 
tejun


Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-10 Thread Ming Lei
On Tue, Apr 10, 2018 at 03:02:11PM +, Bart Van Assche wrote:
> On Tue, 2018-04-10 at 22:30 +0800, Ming Lei wrote:
> > On Tue, Apr 10, 2018 at 02:09:33PM +, Bart Van Assche wrote:
> > > Please keep in mind that all synchronize_rcu() does is to wait for pre-
> > > existing RCU readers to finish. synchronize_rcu() does not prevent that 
> > > new
> > > rcu_read_lock() calls happen. It is e.g. possible that after
> > 
> > That is right, and I also mentioned normal completion can be done between
> > 1) and reset aborted_gstate in 3).
> > 
> > > blk_mq_rq_update_aborted_gstate(req, 0) has been executed that a regular
> > > completion occurs. If that request is not reused before the timer that was
> > > restarted by the timeout code expires, that request will be completed 
> > > twice.
> > 
> > In this patch, blk_mq_add_timer(req, MQ_RQ_COMPLETE, MQ_RQ_IN_FLIGHT) is
> > called for handling BLK_EH_RESET_TIMER. And after rq's state is changed
> > to MQ_RQ_IN_FLIGHT, normal completion still can come and complete this rq,
> > just like the above you described, right?
> 
> I should have added the following in my previous e-mail: "if the completion
> occurs after blk_mq_check_expired() examined rq->gstate and before it updated
> rq->aborted_gstate".

That is the difference between tj's approach and your patch, but the
difference is just in the timing.

See this patch:

+   if (time_after_eq(jiffies, deadline) &&
+   blk_mq_change_rq_state(rq, MQ_RQ_IN_FLIGHT, MQ_RQ_COMPLETE)) {
+   blk_mq_rq_timed_out(rq, reserved);

Normal completion still can happen between blk_mq_change_rq_state()
and blk_mq_rq_timed_out().

In tj's approach, there is synchronize_rcu() between writing aborted_gstate
and blk_mq_rq_timed_out, it is easier for normal completion to happen during
the big window.

> That race can occur with the current upstream blk-mq
> timeout handling code but not after my patch has been applied.

In theory, the 'race' can occur with your patch too, but the window
is just so small.

If you think my comment is correct, please update your commit log.

-- 
Ming


Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-10 Thread Bart Van Assche
On Tue, 2018-04-10 at 22:30 +0800, Ming Lei wrote:
> On Tue, Apr 10, 2018 at 02:09:33PM +, Bart Van Assche wrote:
> > Please keep in mind that all synchronize_rcu() does is to wait for pre-
> > existing RCU readers to finish. synchronize_rcu() does not prevent that new
> > rcu_read_lock() calls happen. It is e.g. possible that after
> 
> That is right, and I also mentioned normal completion can be done between
> 1) and reset aborted_gstate in 3).
> 
> > blk_mq_rq_update_aborted_gstate(req, 0) has been executed that a regular
> > completion occurs. If that request is not reused before the timer that was
> > restarted by the timeout code expires, that request will be completed twice.
> 
> In this patch, blk_mq_add_timer(req, MQ_RQ_COMPLETE, MQ_RQ_IN_FLIGHT) is
> called for handling BLK_EH_RESET_TIMER. And after rq's state is changed
> to MQ_RQ_IN_FLIGHT, normal completion still can come and complete this rq,
> just like the above you described, right?

I should have added the following in my previous e-mail: "if the completion
occurs after blk_mq_check_expired() examined rq->gstate and before it updated
rq->aborted_gstate". That race can occur with the current upstream blk-mq
timeout handling code but not after my patch has been applied.

Bart.




Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-10 Thread h...@lst.de
On Tue, Apr 10, 2018 at 01:26:39PM +, Bart Van Assche wrote:
> Can you explain why you think that using cmpxchg() is safe in this context?
> The regular completion path and the timeout code can both execute e.g. the
> following code concurrently:
> 
>   blk_mq_change_rq_state(rq, MQ_RQ_IN_FLIGHT, MQ_RQ_COMPLETE);
> 
> That's why I think that we need an atomic compare and exchange instead of
> cmpxchg(). What I found in the Intel Software Developer Manual seems to
> confirm that:

The Linux cmpxchg() helper is supposed to always be atomic, you only
need atomic_cmpxchg and friends if you want to operate on an atomic_t.
(same for the _long variant).

In fact if you look at the x86 implementation of the cmpxchg() macro
you'll see that it will use the lock prefix if the kernel has been
built with CONFIG_SMP enabled.


Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-10 Thread Jens Axboe
On 4/10/18 3:55 AM, Christoph Hellwig wrote:
> On Mon, Apr 09, 2018 at 06:34:55PM -0700, Bart Van Assche wrote:
>> If a completion occurs after blk_mq_rq_timed_out() has reset
>> rq->aborted_gstate and the request is again in flight when the timeout
>> expires then a request will be completed twice: a first time by the
>> timeout handler and a second time when the regular completion occurs.
>>
>> Additionally, the blk-mq timeout handling code ignores completions that
>> occur after blk_mq_check_expired() has been called and before
>> blk_mq_rq_timed_out() has reset rq->aborted_gstate. If a block driver
>> timeout handler always returns BLK_EH_RESET_TIMER then the result will
>> be that the request never terminates.
>>
>> Since the request state can be updated from two different contexts,
>> namely regular completion and request timeout, this race cannot be
>> fixed with RCU synchronization only. Fix this race as follows:
>> - Split __deadline in two variables, namely lq_deadline for legacy
>>   request queues and mq_deadline for blk-mq request queues. Use atomic
>>   operations to update mq_deadline.
> 
> I don't think we need the atomic_long_cmpxchg, and can do with a plain
> cmpxhg.  Also unterminated cmpxchg loops are a bad idea, but I think
> both callers are protected from other changes so we can turn that
> into a WARN_ON().  That plus some minor cleanups in the version below,
> only boot tested:

This version looks reasonable, let me throw it through the testing
machinery.

-- 
Jens Axboe



Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-10 Thread t...@kernel.org
Hello,

On Tue, Apr 10, 2018 at 02:30:26PM +, Bart Van Assche wrote:
> > Switching to another model might be better but let's please do that
> > with the right rationales.  A good portion of this seems to be built
> > on misunderstandings.
> 
> Which misunderstandings? I'm not aware of any misunderstandings at my side.
> Additionally, tests with two different block drivers (NVMeOF initiator and
> the SRP initiator driver) have shown that the current blk-mq timeout
> implementation with or without your two patches applied result in subtle and
> hard to debug crashes and/or memory corruption. That is not the case for the

I must have missed that part.  Which tests were they?

> patch at the start of this thread. The latest report of a crash I ran into
> myself and that is fixed by the patch at the start of this thread is
> available here: https://www.spinics.net/lists/linux-rdma/msg63240.html.
> 
> Please also keep in mind that if this patch would be accepted that that does
> not prevent this patch to be replaced with an RCU-based solution later on.
> If anyone comes up any time with a reliably working RCU-based solution I
> will be happy to accept a revert of this patch and I will help reviewing that
> RCU-based solution.

Oh, switching is fine but let's get in sync first.  Who have the repro
cases and what were tested?

Thanks.

-- 
tejun


Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-10 Thread jianchao.wang
Hi Bart

Thanks for your kindly response.

On 04/10/2018 09:01 PM, Bart Van Assche wrote:
> On Tue, 2018-04-10 at 15:59 +0800, jianchao.wang wrote:
>> If yes, how does the timeout handler get the freed request when the tag has 
>> been freed ?
> 
> Hello Jianchao,
> 
> Have you noticed that the timeout handler does not check whether or not the 
> request
> tag is freed? Additionally, I don't think it would be possible to add such a 
> check
> to the timeout code without introducing a new race condition.

Doesn't blk_mq_queue_tag_busy_iter only iterate the tags that has been 
allocated/set ?
When the request is freed, the tag will be cleared through 
blk_mq_put_tag->sbitmap_queue_clear
Do I miss something else ?

Thanks
Jianchao

> 
> 


Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-10 Thread Ming Lei
On Tue, Apr 10, 2018 at 02:09:33PM +, Bart Van Assche wrote:
> On Tue, 2018-04-10 at 21:55 +0800, Ming Lei wrote:
> > Then I have same question with Jianchao, what is the actual double
> > complete in linus tree between BLK_EH_RESET_TIMER and normal completion?
> > 
> > Follows my understanding:
> > 
> > 1) when timeout is detected on one request, its aborted_gstate is
> > updated in blk_mq_check_expired().
> > 
> > 2) run synchronize_rcu(), and make sure all pending completion is done
> > 
> > 3) run blk_mq_rq_timed_out()
> > - ret = ops->timeout
> > - blk_mq_rq_update_aborted_gstate(req, 0)
> > - blk_add_timer(req);
> > 
> > If normal completion is done between 1) and reset aborted_gstate in 3),
> > blk_mq_complete_request() will be called, and found that aborted_gstate
> > is set, then the rq won't be completed really.
> > 
> > If normal completion is done after reset aborted_gstate in 3), it should
> > be same with applying this patch.
> 
> Hello Ming,
> 
> Please keep in mind that all synchronize_rcu() does is to wait for pre-
> existing RCU readers to finish. synchronize_rcu() does not prevent that new
> rcu_read_lock() calls happen. It is e.g. possible that after

That is right, and I also mentioned normal completion can be done between
1) and reset aborted_gstate in 3).

> blk_mq_rq_update_aborted_gstate(req, 0) has been executed that a regular
> completion occurs. If that request is not reused before the timer that was
> restarted by the timeout code expires, that request will be completed twice.

In this patch, blk_mq_add_timer(req, MQ_RQ_COMPLETE, MQ_RQ_IN_FLIGHT) is
called for handling BLK_EH_RESET_TIMER. And after rq's state is changed
to MQ_RQ_IN_FLIGHT, normal completion still can come and complete this rq,
just like the above you described, right?


Thanks,
Ming


Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-10 Thread Bart Van Assche
On Tue, 2018-04-10 at 07:20 -0700, Tejun Heo wrote:
> On Mon, Apr 09, 2018 at 06:34:55PM -0700, Bart Van Assche wrote:
> > Since the request state can be updated from two different contexts,
> > namely regular completion and request timeout, this race cannot be
> > fixed with RCU synchronization only. Fix this race as follows:
> 
> Well, it can be and the patches have been posted months ago.

That's not correct. I have explained you in detail that the two patches you
posted do not fix all the races fixed by the patch at the start of this
e-mail thread.

> Switching to another model might be better but let's please do that
> with the right rationales.  A good portion of this seems to be built
> on misunderstandings.

Which misunderstandings? I'm not aware of any misunderstandings at my side.
Additionally, tests with two different block drivers (NVMeOF initiator and
the SRP initiator driver) have shown that the current blk-mq timeout
implementation with or without your two patches applied result in subtle and
hard to debug crashes and/or memory corruption. That is not the case for the
patch at the start of this thread. The latest report of a crash I ran into
myself and that is fixed by the patch at the start of this thread is
available here: https://www.spinics.net/lists/linux-rdma/msg63240.html.

Please also keep in mind that if this patch would be accepted that that does
not prevent this patch to be replaced with an RCU-based solution later on.
If anyone comes up any time with a reliably working RCU-based solution I
will be happy to accept a revert of this patch and I will help reviewing that
RCU-based solution.

Bart.




Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-10 Thread Tejun Heo
Hello, Bart.

On Mon, Apr 09, 2018 at 06:34:55PM -0700, Bart Van Assche wrote:
> Since the request state can be updated from two different contexts,
> namely regular completion and request timeout, this race cannot be
> fixed with RCU synchronization only. Fix this race as follows:

Well, it can be and the patches have been posted months ago.  It just
needed a repro case to confirm the fix, which we now seem to have.

Switching to another model might be better but let's please do that
with the right rationales.  A good portion of this seems to be built
on misunderstandings.

Thanks.

-- 
tejun


Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-10 Thread Bart Van Assche
On Tue, 2018-04-10 at 21:55 +0800, Ming Lei wrote:
> Then I have same question with Jianchao, what is the actual double
> complete in linus tree between BLK_EH_RESET_TIMER and normal completion?
> 
> Follows my understanding:
> 
> 1) when timeout is detected on one request, its aborted_gstate is
> updated in blk_mq_check_expired().
> 
> 2) run synchronize_rcu(), and make sure all pending completion is done
> 
> 3) run blk_mq_rq_timed_out()
> - ret = ops->timeout
> - blk_mq_rq_update_aborted_gstate(req, 0)
> - blk_add_timer(req);
> 
> If normal completion is done between 1) and reset aborted_gstate in 3),
> blk_mq_complete_request() will be called, and found that aborted_gstate
> is set, then the rq won't be completed really.
> 
> If normal completion is done after reset aborted_gstate in 3), it should
> be same with applying this patch.

Hello Ming,

Please keep in mind that all synchronize_rcu() does is to wait for pre-
existing RCU readers to finish. synchronize_rcu() does not prevent that new
rcu_read_lock() calls happen. It is e.g. possible that after
blk_mq_rq_update_aborted_gstate(req, 0) has been executed that a regular
completion occurs. If that request is not reused before the timer that was
restarted by the timeout code expires, that request will be completed twice.

Bart.





Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-10 Thread Ming Lei
On Tue, Apr 10, 2018 at 12:58:04PM +, Bart Van Assche wrote:
> On Tue, 2018-04-10 at 16:41 +0800, Ming Lei wrote:
> > On Mon, Apr 09, 2018 at 06:34:55PM -0700, Bart Van Assche wrote:
> > > If a completion occurs after blk_mq_rq_timed_out() has reset
> > > rq->aborted_gstate and the request is again in flight when the timeout
> > 
> > Given rq->aborted_gstate is reset only for BLK_EH_RESET_TIMER, I
> > think you are addressing the race between normal completion and
> > the timeout of BLK_EH_RESET_TIMER.
> 
> Yes, that's correct. I will make this more explicit.
> 
> > > expires then a request will be completed twice: a first time by the
> > > timeout handler and a second time when the regular completion occurs.
> > 
> > This patch looks simpler, and more like the previous model of
> > using blk_mark_rq_complete().
> 
> That's also correct.
> 
> > I have one question:
> > 
> > - with this patch, rq's state is updated atomically as cmpxchg. Suppose
> > one rq is timed out, the rq's state is updated as MQ_RQ_COMPLETE by
> > blk_mq_check_expired(), then ops->timeout() is run and
> > BLK_EH_RESET_TIMER is returned, so blk_mq_add_timer(req, MQ_RQ_COMPLETE,
> > MQ_RQ_IN_FLIGHT) is called to update rq's state into MQ_RQ_IN_FLIGHT.
> > 
> > Now the original normal completion still may occur after rq's state
> > becomes MQ_RQ_IN_FLIGHT, and seems there is still the double completion
> > with this patch? Maybe I am wrong, please explain a bit.
> 
> That scenario won't result in a double completion. After the timer has
> been reset the block driver blk_mq_complete_request() call will change
> the request state from MQ_RQ_IN_FLIGHT into MQ_RQ_COMPLETE. The next
> time blk_mq_check_expired() is called it will execute the following code:
> 
>   blk_mq_change_rq_state(rq, MQ_RQ_IN_FLIGHT, MQ_RQ_COMPLETE);
> 
> That function call only changes the request state if the current state is
> IN_FLIGHT. However, the blk_mq_complete_request() call changed the request
> state into COMPLETE. Hence, the above blk_mq_change_rq_state() call will
> return false and the blk-mq timeout code will skip this request. If the
> request would already have been reused and would have been marked again as
> IN_FLIGHT then its deadline will also have been updated and the request
> will be skipped by the timeout code because its deadline has not yet
> expired.

OK.

Then I have same question with Jianchao, what is the actual double
complete in linus tree between BLK_EH_RESET_TIMER and normal completion?

Follows my understanding:

1) when timeout is detected on one request, its aborted_gstate is
updated in blk_mq_check_expired().

2) run synchronize_rcu(), and make sure all pending completion is done

3) run blk_mq_rq_timed_out()
- ret = ops->timeout
- blk_mq_rq_update_aborted_gstate(req, 0)
- blk_add_timer(req);

If normal completion is done between 1) and reset aborted_gstate in 3),
blk_mq_complete_request() will be called, and found that aborted_gstate
is set, then the rq won't be completed really.

If normal completion is done after reset aborted_gstate in 3), it should
be same with applying this patch.

> 
> > And synchronize_rcu() is needed by Tejun's approach between marking
> > COMPLETE and handling this rq's timeout, and the time can be quite long,
> > I guess it might be easier to trigger?
> 
> I have done what I could to trigger races between the regular completion
> path and the timeout code in my tests. Without this patch if I run the
> srp-test software I see crashes being reported in the rdma_rxe driver but
> with this patch applied I don't see any crashes with my tests.

I believe this patch may fix this issue, but I think the idea behind
should be understood.

Thanks,
Ming


Re: usercopy whitelist woe in scsi_sense_cache

2018-04-10 Thread Oleksandr Natalenko

Hi.

10.04.2018 08:35, Oleksandr Natalenko wrote:

- does it reproduce _without_ hardened usercopy? (I would assume yes,
but you'd just not get any warning until the hangs started.) If it
does reproduce without hardened usercopy, then a new bisect run could
narrow the search even more.


Looks like it cannot be disabled via kernel cmdline, so I have to
re-compile the kernel, right? I can certainly do that anyway.


Okay, I've recompiled the kernel without hardened usercopy:

[root@archlinux ~]# zgrep USERCOPY /proc/config.gz
CONFIG_X86_INTEL_USERCOPY=y
CONFIG_HAVE_HARDENED_USERCOPY_ALLOCATOR=y
# CONFIG_HARDENED_USERCOPY is not set

and I cannot reproduce the issue anymore. I/O doesn't hang regardless of 
how long I hammer it.


Eeeh? Maybe, this is a matter of some cleanup code path once the 
warn/bug condition is hit with hardening enabled? I'm just guessing here 
again.


Will work towards checking Linus' master branch now…

Regards,
  Oleksandr


Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-10 Thread Bart Van Assche
On Tue, 2018-04-10 at 11:55 +0200, Christoph Hellwig wrote:
> I don't think we need the atomic_long_cmpxchg, and can do with a plain
> cmpxhg.  Also unterminated cmpxchg loops are a bad idea, but I think
> both callers are protected from other changes so we can turn that
> into a WARN_ON().

Hello Christoph,

I will remove the blk_mq_rq_update_state() function so that's one while
loop less.

Can you explain why you think that using cmpxchg() is safe in this context?
The regular completion path and the timeout code can both execute e.g. the
following code concurrently:

blk_mq_change_rq_state(rq, MQ_RQ_IN_FLIGHT, MQ_RQ_COMPLETE);

That's why I think that we need an atomic compare and exchange instead of
cmpxchg(). What I found in the Intel Software Developer Manual seems to
confirm that:

Description

Compares the value in the AL, AX, EAX, or RAX register with the first
operand (destination operand). If the two values are equal, the second
operand (source operand) is loaded into the destination operand. Otherwise,
the destination operand is loaded into the AL, AX, EAX or RAX register. RAX
register is available only in 64-bit mode. This instruction can be used
with a LOCK prefix to allow the instruction to be executed atomically. To
simplify the interface to the processor’s bus, the destination operand
receives a write cycle without regard to the result of the comparison. The
destination operand is written back if the comparison fails; otherwise, the
source operand is written into the destination. (The processor never
produces a locked read without also producing a locked write.)

Thanks,

Bart.





Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-10 Thread Bart Van Assche
On Tue, 2018-04-10 at 15:59 +0800, jianchao.wang wrote:
> If yes, how does the timeout handler get the freed request when the tag has 
> been freed ?

Hello Jianchao,

Have you noticed that the timeout handler does not check whether or not the 
request
tag is freed? Additionally, I don't think it would be possible to add such a 
check
to the timeout code without introducing a new race condition.

Bart.




Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-10 Thread Bart Van Assche
On Tue, 2018-04-10 at 16:41 +0800, Ming Lei wrote:
> On Mon, Apr 09, 2018 at 06:34:55PM -0700, Bart Van Assche wrote:
> > If a completion occurs after blk_mq_rq_timed_out() has reset
> > rq->aborted_gstate and the request is again in flight when the timeout
> 
> Given rq->aborted_gstate is reset only for BLK_EH_RESET_TIMER, I
> think you are addressing the race between normal completion and
> the timeout of BLK_EH_RESET_TIMER.

Yes, that's correct. I will make this more explicit.

> > expires then a request will be completed twice: a first time by the
> > timeout handler and a second time when the regular completion occurs.
> 
> This patch looks simpler, and more like the previous model of
> using blk_mark_rq_complete().

That's also correct.

> I have one question:
> 
> - with this patch, rq's state is updated atomically as cmpxchg. Suppose
> one rq is timed out, the rq's state is updated as MQ_RQ_COMPLETE by
> blk_mq_check_expired(), then ops->timeout() is run and
> BLK_EH_RESET_TIMER is returned, so blk_mq_add_timer(req, MQ_RQ_COMPLETE,
> MQ_RQ_IN_FLIGHT) is called to update rq's state into MQ_RQ_IN_FLIGHT.
> 
> Now the original normal completion still may occur after rq's state
> becomes MQ_RQ_IN_FLIGHT, and seems there is still the double completion
> with this patch? Maybe I am wrong, please explain a bit.

That scenario won't result in a double completion. After the timer has
been reset the block driver blk_mq_complete_request() call will change
the request state from MQ_RQ_IN_FLIGHT into MQ_RQ_COMPLETE. The next
time blk_mq_check_expired() is called it will execute the following code:

blk_mq_change_rq_state(rq, MQ_RQ_IN_FLIGHT, MQ_RQ_COMPLETE);

That function call only changes the request state if the current state is
IN_FLIGHT. However, the blk_mq_complete_request() call changed the request
state into COMPLETE. Hence, the above blk_mq_change_rq_state() call will
return false and the blk-mq timeout code will skip this request. If the
request would already have been reused and would have been marked again as
IN_FLIGHT then its deadline will also have been updated and the request
will be skipped by the timeout code because its deadline has not yet
expired.

> And synchronize_rcu() is needed by Tejun's approach between marking
> COMPLETE and handling this rq's timeout, and the time can be quite long,
> I guess it might be easier to trigger?

I have done what I could to trigger races between the regular completion
path and the timeout code in my tests. Without this patch if I run the
srp-test software I see crashes being reported in the rdma_rxe driver but
with this patch applied I don't see any crashes with my tests.

Bart.




Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-10 Thread Shan Hai



On 2018年04月10日 18:04, Ming Lei wrote:

On Tue, Apr 10, 2018 at 03:59:30PM +0800, jianchao.wang wrote:

Hi Bart

On 04/10/2018 09:34 AM, Bart Van Assche wrote:

If a completion occurs after blk_mq_rq_timed_out() has reset
rq->aborted_gstate and the request is again in flight when the timeout
expires then a request will be completed twice: a first time by the
timeout handler and a second time when the regular completion occurs

Would you please detail more here about why the request could be completed 
twice ?

Is it the scenario you described as below in 
https://marc.info/?l=linux-block=151796816127318

The following race can occur between the code that resets the timer
and completion handling:
- The code that handles BLK_EH_RESET_TIMER resets aborted_gstate.
- A completion occurs and blk_mq_complete_request() calls
   __blk_mq_complete_request().
- The timeout code calls blk_add_timer() and that function sets the
   request deadline and adjusts the timer.
- __blk_mq_complete_request() frees the request tag.
- The timer fires and the timeout handler gets called for a freed
   request.
If yes, how does the timeout handler get the freed request when the tag has 
been freed ?

Thinking of this patch further.

The issue may not be a double completion issue, and it may be the
following behaviour which breaks NVMe or other drivers easily:

1) there is long delay(synchronize_rcu()) between setting rq->aborted_gstate
and handling the timeout by blk_mq_rq_timed_out().

2) during the long delay, the rq may be completed by hardware, then
if the following timeout is handled as BLK_EH_RESET_TIMER, it is
driver's bug, and driver's .timeout() may be confused about this
behaviour, I guess.

In theory this behaviour should exist in all these approaches,
but just easier to trigger if long delay is introduced before handling
timeout.


Or think it as below?


   C   C    C
+-+---+---+
S   T F  E

Request life time line:
S: start
T: timed out
F: found (by timer), inflight but timed out because of a long delay
E: completed by timeout handler
C: regular completion

normal request completion time range: (S, T)
completion in (T, F) is fine since it's not inflight anymore
race window time range: (F, E)

if the delayed request is completed in (F, E) range then it could be 
completed

twice by the regular completion first and then the timeout handler.

Thanks
Shan Hai

Thanks,
Ming




Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-10 Thread Ming Lei
On Tue, Apr 10, 2018 at 03:59:30PM +0800, jianchao.wang wrote:
> Hi Bart
> 
> On 04/10/2018 09:34 AM, Bart Van Assche wrote:
> > If a completion occurs after blk_mq_rq_timed_out() has reset
> > rq->aborted_gstate and the request is again in flight when the timeout
> > expires then a request will be completed twice: a first time by the
> > timeout handler and a second time when the regular completion occurs
> 
> Would you please detail more here about why the request could be completed 
> twice ?
> 
> Is it the scenario you described as below in 
> https://marc.info/?l=linux-block=151796816127318
> 
> The following race can occur between the code that resets the timer
> and completion handling:
> - The code that handles BLK_EH_RESET_TIMER resets aborted_gstate.
> - A completion occurs and blk_mq_complete_request() calls
>   __blk_mq_complete_request().
> - The timeout code calls blk_add_timer() and that function sets the
>   request deadline and adjusts the timer.
> - __blk_mq_complete_request() frees the request tag.
> - The timer fires and the timeout handler gets called for a freed
>   request.
> If yes, how does the timeout handler get the freed request when the tag has 
> been freed ?

Thinking of this patch further.

The issue may not be a double completion issue, and it may be the
following behaviour which breaks NVMe or other drivers easily:

1) there is long delay(synchronize_rcu()) between setting rq->aborted_gstate
and handling the timeout by blk_mq_rq_timed_out().

2) during the long delay, the rq may be completed by hardware, then
if the following timeout is handled as BLK_EH_RESET_TIMER, it is
driver's bug, and driver's .timeout() may be confused about this
behaviour, I guess.

In theory this behaviour should exist in all these approaches,
but just easier to trigger if long delay is introduced before handling
timeout.

Thanks,
Ming


Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-10 Thread Christoph Hellwig
On Mon, Apr 09, 2018 at 06:34:55PM -0700, Bart Van Assche wrote:
> If a completion occurs after blk_mq_rq_timed_out() has reset
> rq->aborted_gstate and the request is again in flight when the timeout
> expires then a request will be completed twice: a first time by the
> timeout handler and a second time when the regular completion occurs.
> 
> Additionally, the blk-mq timeout handling code ignores completions that
> occur after blk_mq_check_expired() has been called and before
> blk_mq_rq_timed_out() has reset rq->aborted_gstate. If a block driver
> timeout handler always returns BLK_EH_RESET_TIMER then the result will
> be that the request never terminates.
> 
> Since the request state can be updated from two different contexts,
> namely regular completion and request timeout, this race cannot be
> fixed with RCU synchronization only. Fix this race as follows:
> - Split __deadline in two variables, namely lq_deadline for legacy
>   request queues and mq_deadline for blk-mq request queues. Use atomic
>   operations to update mq_deadline.

I don't think we need the atomic_long_cmpxchg, and can do with a plain
cmpxhg.  Also unterminated cmpxchg loops are a bad idea, but I think
both callers are protected from other changes so we can turn that
into a WARN_ON().  That plus some minor cleanups in the version below,
only boot tested:

diff --git a/block/blk-core.c b/block/blk-core.c
index abcb8684ba67..7e88e4a9133d 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -199,8 +199,6 @@ void blk_rq_init(struct request_queue *q, struct request 
*rq)
rq->start_time = jiffies;
set_start_time_ns(rq);
rq->part = NULL;
-   seqcount_init(>gstate_seq);
-   u64_stats_init(>aborted_gstate_sync);
 }
 EXPORT_SYMBOL(blk_rq_init);
 
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 58b3b79cbe83..ecb934bafb29 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -345,7 +345,6 @@ static const char *const rqf_name[] = {
RQF_NAME(STATS),
RQF_NAME(SPECIAL_PAYLOAD),
RQF_NAME(ZONE_WRITE_LOCKED),
-   RQF_NAME(MQ_TIMEOUT_EXPIRED),
RQF_NAME(MQ_POLL_SLEPT),
 };
 #undef RQF_NAME
diff --git a/block/blk-mq.c b/block/blk-mq.c
index f5c7dbcb954f..0f5c56789fde 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -477,7 +477,9 @@ void blk_mq_free_request(struct request *rq)
if (blk_rq_rl(rq))
blk_put_rl(blk_rq_rl(rq));
 
-   blk_mq_rq_update_state(rq, MQ_RQ_IDLE);
+   if (!blk_mq_change_rq_state(rq, blk_mq_rq_state(rq), MQ_RQ_IDLE))
+   WARN_ON_ONCE(1);
+
if (rq->tag != -1)
blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
if (sched_tag != -1)
@@ -523,8 +525,7 @@ static void __blk_mq_complete_request(struct request *rq)
bool shared = false;
int cpu;
 
-   WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IN_FLIGHT);
-   blk_mq_rq_update_state(rq, MQ_RQ_COMPLETE);
+   WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_COMPLETE);
 
if (rq->internal_tag != -1)
blk_mq_sched_completed_request(rq);
@@ -573,36 +574,6 @@ static void hctx_lock(struct blk_mq_hw_ctx *hctx, int 
*srcu_idx)
*srcu_idx = srcu_read_lock(hctx->srcu);
 }
 
-static void blk_mq_rq_update_aborted_gstate(struct request *rq, u64 gstate)
-{
-   unsigned long flags;
-
-   /*
-* blk_mq_rq_aborted_gstate() is used from the completion path and
-* can thus be called from irq context.  u64_stats_fetch in the
-* middle of update on the same CPU leads to lockup.  Disable irq
-* while updating.
-*/
-   local_irq_save(flags);
-   u64_stats_update_begin(>aborted_gstate_sync);
-   rq->aborted_gstate = gstate;
-   u64_stats_update_end(>aborted_gstate_sync);
-   local_irq_restore(flags);
-}
-
-static u64 blk_mq_rq_aborted_gstate(struct request *rq)
-{
-   unsigned int start;
-   u64 aborted_gstate;
-
-   do {
-   start = u64_stats_fetch_begin(>aborted_gstate_sync);
-   aborted_gstate = rq->aborted_gstate;
-   } while (u64_stats_fetch_retry(>aborted_gstate_sync, start));
-
-   return aborted_gstate;
-}
-
 /**
  * blk_mq_complete_request - end I/O on a request
  * @rq:the request being processed
@@ -614,27 +585,12 @@ static u64 blk_mq_rq_aborted_gstate(struct request *rq)
 void blk_mq_complete_request(struct request *rq)
 {
struct request_queue *q = rq->q;
-   struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu);
-   int srcu_idx;
 
if (unlikely(blk_should_fake_timeout(q)))
return;
 
-   /*
-* If @rq->aborted_gstate equals the current instance, timeout is
-* claiming @rq and we lost.  This is synchronized through
-* hctx_lock().  See blk_mq_timeout_work() for details.
-*
-* Completion path never blocks and we can directly use RCU here
-* instead of 

Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-10 Thread Ming Lei
On Mon, Apr 09, 2018 at 06:34:55PM -0700, Bart Van Assche wrote:
> If a completion occurs after blk_mq_rq_timed_out() has reset
> rq->aborted_gstate and the request is again in flight when the timeout

Given rq->aborted_gstate is reset only for BLK_EH_RESET_TIMER, I
think you are addressing the race between normal completion and
the timeout of BLK_EH_RESET_TIMER.

> expires then a request will be completed twice: a first time by the
> timeout handler and a second time when the regular completion occurs.

This patch looks simpler, and more like the previous model of
using blk_mark_rq_complete().

I have one question:

- with this patch, rq's state is updated atomically as cmpxchg. Suppose
one rq is timed out, the rq's state is updated as MQ_RQ_COMPLETE by
blk_mq_check_expired(), then ops->timeout() is run and
BLK_EH_RESET_TIMER is returned, so blk_mq_add_timer(req, MQ_RQ_COMPLETE,
MQ_RQ_IN_FLIGHT) is called to update rq's state into MQ_RQ_IN_FLIGHT.

Now the original normal completion still may occur after rq's state
becomes MQ_RQ_IN_FLIGHT, and seems there is still the double completion
with this patch? Maybe I am wrong, please explain a bit.

And synchronize_rcu() is needed by Tejun's approach between marking
COMPLETE and handling this rq's timeout, and the time can be quite long,
I guess it might be easier to trigger?

Thanks,
Ming


Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-10 Thread jianchao.wang
Hi Bart

On 04/10/2018 09:34 AM, Bart Van Assche wrote:
> If a completion occurs after blk_mq_rq_timed_out() has reset
> rq->aborted_gstate and the request is again in flight when the timeout
> expires then a request will be completed twice: a first time by the
> timeout handler and a second time when the regular completion occurs

Would you please detail more here about why the request could be completed 
twice ?

Is it the scenario you described as below in 
https://marc.info/?l=linux-block=151796816127318

The following race can occur between the code that resets the timer
and completion handling:
- The code that handles BLK_EH_RESET_TIMER resets aborted_gstate.
- A completion occurs and blk_mq_complete_request() calls
  __blk_mq_complete_request().
- The timeout code calls blk_add_timer() and that function sets the
  request deadline and adjusts the timer.
- __blk_mq_complete_request() frees the request tag.
- The timer fires and the timeout handler gets called for a freed
  request.
If yes, how does the timeout handler get the freed request when the tag has 
been freed ?

Thanks
Jianchao


Re: usercopy whitelist woe in scsi_sense_cache

2018-04-10 Thread Kees Cook
On Mon, Apr 9, 2018 at 11:35 PM, Oleksandr Natalenko
 wrote:
> Did your system hang on smartctl hammering too? Have you got some stack
> traces to compare with mine ones?

Unfortunately I only had a single hang with no dumps. I haven't been
able to reproduce it since. :(

-Kees

-- 
Kees Cook
Pixel Security


Re: usercopy whitelist woe in scsi_sense_cache

2018-04-10 Thread Oleksandr Natalenko

Hi.

09.04.2018 22:30, Kees Cook wrote:

echo 1 | tee /sys/block/sd*/queue/nr_requests


I can't get this below "4".


Oops, yeah. It cannot be less than BLKDEV_MIN_RQ (which is 4), so it is 
enforced explicitly in queue_requests_store(). It is the same for me.



echo 1 | tee /sys/block/sd*/device/queue_depth


I've got this now too.
Ah! dm-crypt too. I'll see if I can get that added easily to my tests.
And XFS! You love your corner cases. ;)


Yeah, so far this wonderful configuration has allowed me to uncover a 
bunch of bugs, and see, we are not done yet ;).



Two other questions, since you can reproduce this easily:
- does it reproduce _without_ hardened usercopy? (I would assume yes,
but you'd just not get any warning until the hangs started.) If it
does reproduce without hardened usercopy, then a new bisect run could
narrow the search even more.


Looks like it cannot be disabled via kernel cmdline, so I have to 
re-compile the kernel, right? I can certainly do that anyway.



- does it reproduce with Linus's current tree?


Will try this too.


What would imply missing locking, yes? Yikes. But I'd expect
use-after-free or something, or bad data, not having the pointer slip
forward?


I still think this has something to do with blk-mq re-queuing capability 
and how BFQ implements it, because there are no sings of issue popping 
up with Kyber so far.



Quick update: I added dm-crypt (with XFS on top) and it hung my system
almost immediately. I got no warnings at all, though.


Did your system hang on smartctl hammering too? Have you got some stack 
traces to compare with mine ones?


Regards,
  Oleksandr


Re: [PATCH 6/7] block: consistently use GFP_NOIO instead of __GFP_NORECLAIM

2018-04-10 Thread Hannes Reinecke
On Mon,  9 Apr 2018 17:39:15 +0200
Christoph Hellwig  wrote:

> Same numerical value (for now at least), but a much better
> documentation of intent.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  block/scsi_ioctl.c   |  2 +-
>  drivers/block/drbd/drbd_bitmap.c |  3 ++-
>  drivers/block/pktcdvd.c  |  2 +-
>  drivers/ide/ide-tape.c   |  2 +-
>  drivers/ide/ide-taskfile.c   |  2 +-
>  drivers/scsi/scsi_lib.c  |  2 +-
>  fs/direct-io.c   |  4 ++--
>  kernel/power/swap.c  | 14 +++---
>  8 files changed, 16 insertions(+), 15 deletions(-)
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes



Re: [PATCH 7/7] block: use GFP_KERNEL for allocations from blk_get_request

2018-04-10 Thread Hannes Reinecke
On Mon,  9 Apr 2018 17:39:16 +0200
Christoph Hellwig  wrote:

> blk_get_request is used for pass-through style I/O and thus doesn't
> need GFP_NOIO.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  block/blk-core.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 432923751551..253a869558f9 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1578,7 +1578,7 @@ static struct request
> *blk_old_get_request(struct request_queue *q, unsigned int op,
> blk_mq_req_flags_t flags) {
>   struct request *rq;
> - gfp_t gfp_mask = flags & BLK_MQ_REQ_NOWAIT ? GFP_ATOMIC :
> GFP_NOIO;
> + gfp_t gfp_mask = flags & BLK_MQ_REQ_NOWAIT ? GFP_ATOMIC :
> GFP_KERNEL; int ret = 0;
>  
>   WARN_ON_ONCE(q->mq_ops);

Reviewed-by: Hannes Reinecke 

Cheers,

Hannes



Re: [PATCH 5/7] block: use GFP_NOIO instead of __GFP_DIRECT_RECLAIM

2018-04-10 Thread Hannes Reinecke
On Mon,  9 Apr 2018 17:39:14 +0200
Christoph Hellwig  wrote:

> We just can't do I/O when doing block layer requests allocations,
> so use GFP_NOIO instead of the even more limited __GFP_DIRECT_RECLAIM.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  block/blk-core.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 

Reviewed-by: Hannes Reinecke 

Cheers,

Hannes



Re: [PATCH 4/7] block: pass explicit gfp_t to get_request

2018-04-10 Thread Hannes Reinecke
On Mon,  9 Apr 2018 17:39:13 +0200
Christoph Hellwig  wrote:

> blk_old_get_request already has it at hand, and in blk_queue_bio,
> which is the fast path, it is constant.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  block/blk-core.c  | 14 +++---
>  drivers/scsi/scsi_error.c |  4 
>  2 files changed, 7 insertions(+), 11 deletions(-)
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes



Re: [PATCH 3/7] block: sanitize blk_get_request calling conventions

2018-04-10 Thread Hannes Reinecke
On Mon,  9 Apr 2018 17:39:12 +0200
Christoph Hellwig  wrote:

> Switch everyone to blk_get_request_flags, and then rename
> blk_get_request_flags to blk_get_request.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  block/blk-core.c   | 14 +++---
>  block/bsg.c|  5 ++---
>  block/scsi_ioctl.c |  8 +++-
>  drivers/block/paride/pd.c  |  2 +-
>  drivers/block/pktcdvd.c|  2 +-
>  drivers/block/sx8.c|  2 +-
>  drivers/block/virtio_blk.c |  2 +-
>  drivers/cdrom/cdrom.c  |  2 +-
>  drivers/ide/ide-atapi.c|  2 +-
>  drivers/ide/ide-cd.c   |  2 +-
>  drivers/ide/ide-cd_ioctl.c |  2 +-
>  drivers/ide/ide-devsets.c  |  2 +-
>  drivers/ide/ide-disk.c |  2 +-
>  drivers/ide/ide-ioctls.c   |  4 ++--
>  drivers/ide/ide-park.c |  4 ++--
>  drivers/ide/ide-pm.c   |  5 ++---
>  drivers/ide/ide-tape.c |  2 +-
>  drivers/ide/ide-taskfile.c |  2 +-
>  drivers/md/dm-mpath.c  |  3 ++-
>  drivers/mmc/core/block.c   | 12 +---
>  drivers/scsi/osd/osd_initiator.c   |  2 +-
>  drivers/scsi/osst.c|  2 +-
>  drivers/scsi/scsi_error.c  |  2 +-
>  drivers/scsi/scsi_lib.c|  2 +-
>  drivers/scsi/sg.c  |  2 +-
>  drivers/scsi/st.c  |  2 +-
>  drivers/target/target_core_pscsi.c |  3 +--
>  fs/nfsd/blocklayout.c  |  2 +-
>  include/linux/blkdev.h |  5 +
>  29 files changed, 42 insertions(+), 59 deletions(-)
> 
Long overdue.

Reviewed-by: Hannes Reinecke 

Cheers,

Hannes