Re: Multi-Actuator SAS HDD First Look

2018-04-05 Thread Christoph Hellwig
On Fri, Apr 06, 2018 at 08:24:18AM +0200, Hannes Reinecke wrote:
> Ah. Far better.
> What about delegating FORMAT UNIT to the control LUN, and not
> implementing it for the individual disk LUNs?
> That would make an even stronger case for having a control LUN;
> with that there wouldn't be any problem with having to synchronize
> across LUNs etc.

It sounds to me like NVMe might be a much better model for this drive
than SCSI, btw :)


Re: Multi-Actuator SAS HDD First Look

2018-04-05 Thread Hannes Reinecke
On Thu, 5 Apr 2018 17:43:46 -0600
Tim Walker  wrote:

> On Tue, Apr 3, 2018 at 1:46 AM, Christoph Hellwig 
> wrote:
> > On Sat, Mar 31, 2018 at 01:03:46PM +0200, Hannes Reinecke wrote:  
> >> Actually I would propose to have a 'management' LUN at LUN0, who
> >> could handle all the device-wide commands (eg things like START
> >> STOP UNIT, firmware update, or even SMART commands), and ignoring
> >> them for the remaining LUNs.  
> >
> > That is in fact the only workable option at all.  Everything else
> > completly breaks the scsi architecture.  
> 
> Here's an update: Seagate will eliminate the inter-LU actions from
> FORMAT UNIT and SANITIZE. Probably SANITIZE will be per-LUN, but
> FORMAT UNIT is trickier due to internal drive architecture, and how
> FORMAT UNIT initializes on-disk metadata. Likely it will require some
> sort of synchronization across LUNs, such as the command being sent to
> both LUNs sequentially or something similar. We are also considering
> not supporting FORMAT UNIT at all - would anybody object? Any other
> suggestions?
> 

Ah. Far better.
What about delegating FORMAT UNIT to the control LUN, and not
implementing it for the individual disk LUNs?
That would make an even stronger case for having a control LUN;
with that there wouldn't be any problem with having to synchronize
across LUNs etc.

Cheers,

Hannes


Re: 4.15.14 crash with iscsi target and dvd

2018-04-05 Thread Bart Van Assche
On Thu, 2018-04-05 at 22:06 -0400, Wakko Warner wrote:
> I know now why scsi_print_command isn't doing anything.  cmd->cmnd is null.
> I added a dev_printk in scsi_print_command where the 2 if statements return.
> Logs:
> [  29.866415] sr 3:0:0:0: cmd->cmnd is NULL

That's something that should never happen. As one can see in
scsi_setup_scsi_cmnd() and scsi_setup_fs_cmnd() both functions initialize
that pointer. Since I have not yet been able to reproduce myself what you
reported, would it be possible for you to bisect this issue? You will need
to follow something like the following procedure (see also
https://git-scm.com/docs/git-bisect):

git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
git bisect start
git bisect bad v4.10
git bisect good v4.9

and then build the kernel, install it, boot the kernel and test it.
Depending on the result, run either git bisect bad or git bisect good and
keep going until git bisect comes to a conclusion. This can take an hour or
more.

Bart.





Re: 4.15.14 crash with iscsi target and dvd

2018-04-05 Thread Wakko Warner
Wakko Warner wrote:
> Bart Van Assche wrote:
> > On Sun, 2018-04-01 at 14:27 -0400, Wakko Warner wrote:
> > > Wakko Warner wrote:
> > > > Wakko Warner wrote:
> > > > > I tested 4.14.32 last night with the same oops.  4.9.91 works fine.
> > > > > From the initiator, if I do cat /dev/sr1 > /dev/null it works.  If I 
> > > > > mount
> > > > > /dev/sr1 and then do find -type f | xargs cat > /dev/null the target
> > > > > crashes.  I'm using the builtin iscsi target with pscsi.  I can burn 
> > > > > from
> > > > > the initiator with out problems.  I'll test other kernels between 4.9 
> > > > > and
> > > > > 4.14.
> > > > 
> > > > So I've tested 4.x.y where x one of 10 11 12 14 15 and y is the latest 
> > > > patch
> > > > (except for 4.15 which was 1 behind)
> > > > Each of these kernels crash within seconds or immediate of doing find 
> > > > -type
> > > > f | xargs cat > /dev/null from the initiator.
> > > 
> > > I tried 4.10.0.  It doesn't completely lockup the system, but the device
> > > that was used hangs.  So from the initiator, it's /dev/sr1 and from the
> > > target it's /dev/sr0.  Attempting to read /dev/sr0 after the oops causes 
> > > the
> > > process to hang in D state.
> > 
> > Hello Wakko,
> > 
> > Thank you for having narrowed down this further. I think that you 
> > encountered
> > a regression either in the block layer core or in the SCSI core. 
> > Unfortunately
> > the number of changes between kernel versions v4.9 and v4.10 in these two
> > subsystems is huge. I see two possible ways forward:
> > - Either that you perform a bisect to identify the patch that introduced 
> > this
> >   regression. However, I'm not sure whether you are familiar with the bisect
> >   process.
> > - Or that you identify the command that triggers this crash such that others
> >   can reproduce this issue without needing access to your setup.
> > 
> > How about reproducing this crash with the below patch applied on top of
> > kernel v4.15.x? The additional output sent by this patch to the system log
> > should allow us to reproduce this issue by submitting the same SCSI command
> > with sg_raw.
> 
> Ok, so I tried this, but scsi_print_command doesn't print anything.  I added
> a check for !rq and the same thing that blk_rq_nr_phys_segments does in an
> if statement above this thinking it might have crashed during WARN_ON_ONCE.
> It still didn't print anything.  My printk shows this:
> [  36.263193] sr 3:0:0:0: cmd->request->nr_phys_segments is 0
> 
> I also had scsi_print_command in the same if block which again didn't print
> anything.  Is there some debug option I need to turn on to make it print?  I
> tried looking through the code for this and following some of the function
> calls but didn't see any config options.

I know now why scsi_print_command isn't doing anything.  cmd->cmnd is null.
I added a dev_printk in scsi_print_command where the 2 if statements return.
Logs:
[  29.866415] sr 3:0:0:0: cmd->cmnd is NULL

> > Subject: [PATCH] Report commands with no physical segments in the system log
> > 
> > ---
> >  drivers/scsi/scsi_lib.c | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> > index 6b6a6705f6e5..74a39db57d49 100644
> > --- a/drivers/scsi/scsi_lib.c
> > +++ b/drivers/scsi/scsi_lib.c
> > @@ -1093,8 +1093,10 @@ int scsi_init_io(struct scsi_cmnd *cmd)
> > bool is_mq = (rq->mq_ctx != NULL);
> > int error = BLKPREP_KILL;
> >  
> > -   if (WARN_ON_ONCE(!blk_rq_nr_phys_segments(rq)))
> > +   if (WARN_ON_ONCE(!blk_rq_nr_phys_segments(rq))) {
> > +   scsi_print_command(cmd);
> > goto err_exit;
> > +   }
> >  
> > error = scsi_init_sgtable(rq, &cmd->sdb);
> > if (error)
> -- 
>  Microsoft has beaten Volkswagen's world record.  Volkswagen only created 22
>  million bugs.
-- 
 Microsoft has beaten Volkswagen's world record.  Volkswagen only created 22
 million bugs.


Re: 4.15.14 crash with iscsi target and dvd

2018-04-05 Thread Wakko Warner
Bart Van Assche wrote:
> On Sun, 2018-04-01 at 14:27 -0400, Wakko Warner wrote:
> > Wakko Warner wrote:
> > > Wakko Warner wrote:
> > > > I tested 4.14.32 last night with the same oops.  4.9.91 works fine.
> > > > From the initiator, if I do cat /dev/sr1 > /dev/null it works.  If I 
> > > > mount
> > > > /dev/sr1 and then do find -type f | xargs cat > /dev/null the target
> > > > crashes.  I'm using the builtin iscsi target with pscsi.  I can burn 
> > > > from
> > > > the initiator with out problems.  I'll test other kernels between 4.9 
> > > > and
> > > > 4.14.
> > > 
> > > So I've tested 4.x.y where x one of 10 11 12 14 15 and y is the latest 
> > > patch
> > > (except for 4.15 which was 1 behind)
> > > Each of these kernels crash within seconds or immediate of doing find 
> > > -type
> > > f | xargs cat > /dev/null from the initiator.
> > 
> > I tried 4.10.0.  It doesn't completely lockup the system, but the device
> > that was used hangs.  So from the initiator, it's /dev/sr1 and from the
> > target it's /dev/sr0.  Attempting to read /dev/sr0 after the oops causes the
> > process to hang in D state.
> 
> Hello Wakko,
> 
> Thank you for having narrowed down this further. I think that you encountered
> a regression either in the block layer core or in the SCSI core. Unfortunately
> the number of changes between kernel versions v4.9 and v4.10 in these two
> subsystems is huge. I see two possible ways forward:
> - Either that you perform a bisect to identify the patch that introduced this
>   regression. However, I'm not sure whether you are familiar with the bisect
>   process.
> - Or that you identify the command that triggers this crash such that others
>   can reproduce this issue without needing access to your setup.
> 
> How about reproducing this crash with the below patch applied on top of
> kernel v4.15.x? The additional output sent by this patch to the system log
> should allow us to reproduce this issue by submitting the same SCSI command
> with sg_raw.

Ok, so I tried this, but scsi_print_command doesn't print anything.  I added
a check for !rq and the same thing that blk_rq_nr_phys_segments does in an
if statement above this thinking it might have crashed during WARN_ON_ONCE.
It still didn't print anything.  My printk shows this:
[  36.263193] sr 3:0:0:0: cmd->request->nr_phys_segments is 0

I also had scsi_print_command in the same if block which again didn't print
anything.  Is there some debug option I need to turn on to make it print?  I
tried looking through the code for this and following some of the function
calls but didn't see any config options.

> Subject: [PATCH] Report commands with no physical segments in the system log
> 
> ---
>  drivers/scsi/scsi_lib.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index 6b6a6705f6e5..74a39db57d49 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -1093,8 +1093,10 @@ int scsi_init_io(struct scsi_cmnd *cmd)
>   bool is_mq = (rq->mq_ctx != NULL);
>   int error = BLKPREP_KILL;
>  
> - if (WARN_ON_ONCE(!blk_rq_nr_phys_segments(rq)))
> + if (WARN_ON_ONCE(!blk_rq_nr_phys_segments(rq))) {
> + scsi_print_command(cmd);
>   goto err_exit;
> + }
>  
>   error = scsi_init_sgtable(rq, &cmd->sdb);
>   if (error)
-- 
 Microsoft has beaten Volkswagen's world record.  Volkswagen only created 22
 million bugs.


Re: Multi-Actuator SAS HDD First Look

2018-04-05 Thread Douglas Gilbert

On 2018-04-05 07:43 PM, Tim Walker wrote:

On Tue, Apr 3, 2018 at 1:46 AM, Christoph Hellwig  wrote:

On Sat, Mar 31, 2018 at 01:03:46PM +0200, Hannes Reinecke wrote:

Actually I would propose to have a 'management' LUN at LUN0, who could
handle all the device-wide commands (eg things like START STOP UNIT,
firmware update, or even SMART commands), and ignoring them for the
remaining LUNs.


That is in fact the only workable option at all.  Everything else
completly breaks the scsi architecture.


Here's an update: Seagate will eliminate the inter-LU actions from
FORMAT UNIT and SANITIZE. Probably SANITIZE will be per-LUN, but
FORMAT UNIT is trickier due to internal drive architecture, and how
FORMAT UNIT initializes on-disk metadata. Likely it will require some
sort of synchronization across LUNs, such as the command being sent to
both LUNs sequentially or something similar. We are also considering
not supporting FORMAT UNIT at all - would anybody object? Any other
suggestions?


Good, that is progress. [But you still only have one spindle.]

If Protection Information (PI) or changing the logical block size between
512 and 4096 bytes per block are options, then you need FU for that.
But does it need to take 900 minutes like one I got recently from S..?
Couldn't the actual reformatting of a track be deferred until the first
block written to that track?

Doug Gilbert



Re: bcache and hibernation

2018-04-05 Thread Michael Lyle
On Thu, Apr 5, 2018 at 12:51 PM, Nikolaus Rath  wrote:
> Hi Michael,
>
> Could you explain why this isn't a problem with writethrough? It seems
> to me that the trouble happens when the hibernation image is *read*, so
> why does it matter what kind of write caching is used?

With writethrough you can set up your loader to read it directly from
the backing device-- e.g. you don't need the cache, and there's at
least some valid configurations; with writeback some of the extents
may be on the cache dev so...

That said, it's not really great to put swap/hibernate on a cache
device... the workloads don't usually benefit much from tiering (since
they tend to be write-once-read-never or write-once-read-once).

>> I am unaware of a mechanism to prohibit this in the kernel-- to say that
>> a given type of block provider can't be involved in a resume operation.
>> Most documentation for hibernation explicitly cautions about the btrfs
>> situation, but use of bcache is less common and as a result generally
>> isn't covered.
>
> Could you maybe add a warning to Documentation/bcache.txt? I think this
> would have saved me.

Yah, I can look at that.

>
> Best,
> -Nikolaus

Mike


Re: Multi-Actuator SAS HDD First Look

2018-04-05 Thread Tim Walker
On Tue, Apr 3, 2018 at 1:46 AM, Christoph Hellwig  wrote:
> On Sat, Mar 31, 2018 at 01:03:46PM +0200, Hannes Reinecke wrote:
>> Actually I would propose to have a 'management' LUN at LUN0, who could
>> handle all the device-wide commands (eg things like START STOP UNIT,
>> firmware update, or even SMART commands), and ignoring them for the
>> remaining LUNs.
>
> That is in fact the only workable option at all.  Everything else
> completly breaks the scsi architecture.

Here's an update: Seagate will eliminate the inter-LU actions from
FORMAT UNIT and SANITIZE. Probably SANITIZE will be per-LUN, but
FORMAT UNIT is trickier due to internal drive architecture, and how
FORMAT UNIT initializes on-disk metadata. Likely it will require some
sort of synchronization across LUNs, such as the command being sent to
both LUNs sequentially or something similar. We are also considering
not supporting FORMAT UNIT at all - would anybody object? Any other
suggestions?

-- 
Tim Walker
Product Design Systems Engineering, Seagate Technology
(303) 775-3770


Re: [PATCH v2 08/11] block: sed-opal: ioctl for writing to shadow mbr

2018-04-05 Thread Scott Bauer
On Thu, Mar 29, 2018 at 08:27:30PM +0200, catch...@ghostav.ddnss.de wrote:
> On Thu, Mar 29, 2018 at 11:16:42AM -0600, Scott Bauer wrote:
> > Yeah, having to autheticate to write the MBR is a real bummer. Theoretically
> > you could dd a the pw struct + the shador MBR into sysfs. But that's
> > a pretty disgusting hack just to use sysfs. The other method I thought of
> > was to authenticate via ioctl then write via sysfs. We already save the PW
> > in-kernel for unlocks, so perhaps we can re-use the save-for-unlock to
> > do shadow MBR writes via sysfs?
> > 
> > Re-using an already exposed ioctl for another purpose seems somewhat 
> > dangerous?
> > In the sense that what if the user wants to write the smbr but doesn't want 
> > to
> > unlock on suspends, or does not want their PW hanging around in the kernel.
> Well. If we would force the user to a two-step interaction, why not stay
> completely in sysfs? So instead of using the save-for-unlock ioctl, we
> could export each security provider( (AdminSP, UserSPX, ...) as a sysfs

The Problem with this is Single user mode, where you can assign users to 
locking ranges.
There would have to be a lot of dynamic changes of sysfs as users get 
added/removed,
or added to LRs etc. It seems like we're trying mold something that already
works fine into something that doesnt really work as we dig into the details. 



> directory with appropriate files (e.g. mbr for AdminSP) as well as a
> 'unlock' file to store a users password for the specific locking space
> and a 'lock' file to remove the stored password on write to it.
> Of course, while this will prevent from reuse of the ioctl and
> stays within the same configuration method, the PW will still hang
> around in the kernel between 'lock' and 'unlock'.
> 
> Another idea I just came across while writing this down:
> Instead of storing/releasing the password permanently with the 'unlock' and
> 'lock' files, those may be used to start/stop an authenticated session.
> To make it more clear what I mean: Each ioctl that requires
> authentication has a similar pattern:
> discovery0, start_session, , end_session
> Instead of having the combination determined by the ioctl, the 'unlock'
> would do discovery0 and start_session while the 'lock' would do the
> end_session. The user is free to issue further commands with the
> appropriate write/reads to other files of the sysfs-directory.
> While this removes the requirement to store the key within kernel space,
> the open session handle may be used from everybody with permissions for
> read/write access to the sysfs-directory files. So this is not optimal
> as not only the user who provided the password will finally be able to use
> it.

I generally like the idea of being able to run your abritrary opal commands, 
but:

that's probably not going to work for the final reason you outlined.
Even though it's root only access(to sysfs) we're breaking the authentication
lower down by essentially allowing any opal command to be ran if you've somehow
become root.

The other issue with this is the session time out in opal. When we dispatch the 
commands
in-kernel we're hammering them out 1-by-1. If the user needs to do an 
activatelsp,
setuplr, etc. They do that with a new session.

If someone starts the session and it times out it may be hard to figure out how 
to not
get an SP_BUSY back from the controller. I've in the past just had to wipe my 
damn
fw to get out of SP_BUSYs, but that could be due to the early implementations I 
was
dealing with.



> I already did some basic work to split of the session-information from
> the opal_dev struct (initially to reduce the memory-footprint of devices with
> currently no active opal-interaction). So I think, I could get a
> proof-of-concept of this approach within the next one or two weeks if
> there are no objections to the base idea.

Sorry to ocme back a week later, but if you do have anything it would be at 
least
interesting to see. I would still prefer the ioctl route, but will review and 
test
any implementation people deem acceptable. 


Re: bcache and hibernation

2018-04-05 Thread Nikolaus Rath
Hi Michael,

On Apr 05 2018, Michael Lyle  wrote:
> On 04/05/2018 01:51 AM, Nikolaus Rath wrote:
>> Is there a way to prevent this from happening? Could eg the kernel
>> detect that the swap devices is (indirectly) on bcache and refuse to
>> hibernate? Or is there a way to do a "true" read-only mount of a
>> bcache volume so that one can safely resume from it?
>
> I think you're correct.  If you're using bcache in writeback mode, it is
> not safe to hibernate there, because some of the blocks involved in the
> resume can end up in cache (and dependency issues, like you mention).

Could you explain why this isn't a problem with writethrough? It seems
to me that the trouble happens when the hibernation image is *read*, so
why does it matter what kind of write caching is used?

> I am unaware of a mechanism to prohibit this in the kernel-- to say that
> a given type of block provider can't be involved in a resume operation.
> Most documentation for hibernation explicitly cautions about the btrfs
> situation, but use of bcache is less common and as a result generally
> isn't covered.

Could you maybe add a warning to Documentation/bcache.txt? I think this
would have saved me.

Best,
-Nikolaus

-- 
GPG Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F

 »Time flies like an arrow, fruit flies like a Banana.«


Re: bcache and hibernation

2018-04-05 Thread Michael Lyle
Hi Nikolaus (and everyone else),

Sorry I've been slow in responding.  I probably need to step down as
bcache maintainer because so many other things have competed for my time
lately and I've fallen behind on both patches and mailing list.

On 04/05/2018 01:51 AM, Nikolaus Rath wrote:
> Is there a way to prevent this from happening? Could eg the kernel detect 
> that the swap devices is (indirectly) on bcache and refuse to hibernate? Or 
> is there a way to do a "true" read-only mount of a bcache volume so that one 
> can safely resume from it?

I think you're correct.  If you're using bcache in writeback mode, it is
not safe to hibernate there, because some of the blocks involved in the
resume can end up in cache (and dependency issues, like you mention).
There's similar cautions/problems with btrfs.

I am unaware of a mechanism to prohibit this in the kernel-- to say that
a given type of block provider can't be involved in a resume operation.
Most documentation for hibernation explicitly cautions about the btrfs
situation, but use of bcache is less common and as a result generally
isn't covered.

> Best,
> -Nikolaus
Mike


Re: [PATCH] blk-mq: only run mapped hw queues in blk_mq_run_hw_queues()

2018-04-05 Thread Christian Borntraeger


On 04/05/2018 07:39 PM, Christian Borntraeger wrote:
> 
> 
> On 04/05/2018 06:11 PM, Ming Lei wrote:
>>>
>>> Could you please apply the following patch and provide the dmesg boot log?
>>
>> And please post out the 'lscpu' log together from the test machine too.
> 
> attached.
> 
> As I said before this seems to go way with CONFIG_NR_CPUS=64 or smaller.
> We have 282 nr_cpu_ids here (max 141CPUs on that z13 with SMT2) but only 8 
> Cores
> == 16 threads.

To say it differently 
The whole system has up to 141 cpus, but this LPAR has only 8 cpus assigned. So 
we
have 16 CPUS (SMT) but this could become up to 282 IF I would do CPU hotplug. 
(But
this is not used here).



Re: [PATCH] blk-mq: only run mapped hw queues in blk_mq_run_hw_queues()

2018-04-05 Thread Christian Borntraeger


On 04/05/2018 06:11 PM, Ming Lei wrote:
>>
>> Could you please apply the following patch and provide the dmesg boot log?
> 
> And please post out the 'lscpu' log together from the test machine too.

attached.

As I said before this seems to go way with CONFIG_NR_CPUS=64 or smaller.
We have 282 nr_cpu_ids here (max 141CPUs on that z13 with SMT2) but only 8 Cores
== 16 threads.





dmesg.gz
Description: application/gzip
Architecture:s390x
CPU op-mode(s):  32-bit, 64-bit
Byte Order:  Big Endian
CPU(s):  16
On-line CPU(s) list: 0-15
Thread(s) per core:  2
Core(s) per socket:  8
Socket(s) per book:  3
Book(s) per drawer:  2
Drawer(s):   4
NUMA node(s):1
Vendor ID:   IBM/S390
Machine type:2964
CPU dynamic MHz: 5000
CPU static MHz:  5000
BogoMIPS:20325.00
Hypervisor:  PR/SM
Hypervisor vendor:   IBM
Virtualization type: full
Dispatching mode:horizontal
L1d cache:   128K
L1i cache:   96K
L2d cache:   2048K
L2i cache:   2048K
L3 cache:65536K
L4 cache:491520K
NUMA node0 CPU(s):   0-15
Flags:   esan3 zarch stfle msa ldisp eimm dfp edat etf3eh highgprs 
te vx sie
CPU NODE DRAWER BOOK SOCKET CORE L1d:L1i:L2d:L2i ONLINE CONFIGURED POLARIZATION 
ADDRESS
0   00  00  00:0:0:0 yesyeshorizontal   0
1   00  00  01:1:1:1 yesyeshorizontal   
1
2   00  00  12:2:2:2 yesyeshorizontal   
2
3   00  00  13:3:3:3 yesyeshorizontal   
3
4   00  00  24:4:4:4 yesyeshorizontal   
4
5   00  00  25:5:5:5 yesyeshorizontal   
5
6   00  00  36:6:6:6 yesyeshorizontal   
6
7   00  00  37:7:7:7 yesyeshorizontal   
7
8   00  01  48:8:8:8 yesyeshorizontal   
8
9   00  01  49:9:9:9 yesyeshorizontal   
9
10  00  01  510:10:10:10 yesyeshorizontal   
10
11  00  01  511:11:11:11 yesyeshorizontal   
11
12  00  01  612:12:12:12 yesyeshorizontal   
12
13  00  01  613:13:13:13 yesyeshorizontal   
13
14  00  01  714:14:14:14 yesyeshorizontal   
14
15  00  01  715:15:15:15 yesyeshorizontal   
15


Re: BUG at IP: blk_mq_get_request+0x23e/0x390 on 4.16.0-rc7

2018-04-05 Thread Yi Zhang



On 04/04/2018 09:22 PM, Sagi Grimberg wrote:



On 03/30/2018 12:32 PM, Yi Zhang wrote:

Hello
I got this kernel BUG on 4.16.0-rc7, here is the reproducer and log, 
let me know if you need more info, thanks.


Reproducer:
1. setup target
#nvmetcli restore /etc/rdma.json
2. connect target on host
#nvme connect-all -t rdma -a $IP -s 4420during my NVMeoF RDMA testing
3. do fio background on host
#fio -filename=/dev/nvme0n1 -iodepth=1 -thread -rw=randwrite 
-ioengine=psync 
-bssplit=5k/10:9k/10:13k/10:17k/10:21k/10:25k/10:29k/10:33k/10:37k/10:41k/10 
-bs_unaligned -runtime=180 -size=-group_reporting -name=mytest 
-numjobs=60 &

4. offline cpu on host
#echo 0 > /sys/devices/system/cpu/cpu1/online
#echo 0 > /sys/devices/system/cpu/cpu2/online
#echo 0 > /sys/devices/system/cpu/cpu3/online
5. clear target
#nvmetcli clear
6. restore target
#nvmetcli restore /etc/rdma.json
7. check console log on host


Hi Yi,

Does this happen with this applied?
--
diff --git a/block/blk-mq-rdma.c b/block/blk-mq-rdma.c
index 996167f1de18..b89da55e8aaa 100644
--- a/block/blk-mq-rdma.c
+++ b/block/blk-mq-rdma.c
@@ -35,6 +35,8 @@ int blk_mq_rdma_map_queues(struct blk_mq_tag_set *set,
    const struct cpumask *mask;
    unsigned int queue, cpu;

+   goto fallback;
+
    for (queue = 0; queue < set->nr_hw_queues; queue++) {
    mask = ib_get_vector_affinity(dev, first_vec + queue);
    if (!mask)
--



Hi Sagi

Still can reproduce this issue with the change:

[  133.469908] nvme nvme0: new ctrl: NQN 
"nqn.2014-08.org.nvmexpress.discovery", addr 172.31.0.90:4420

[  133.554025] nvme nvme0: creating 40 I/O queues.
[  133.947648] nvme nvme0: new ctrl: NQN "testnqn", addr 172.31.0.90:4420
[  138.740870] smpboot: CPU 1 is now offline
[  138.778382] IRQ 37: no longer affine to CPU2
[  138.783153] IRQ 54: no longer affine to CPU2
[  138.787919] IRQ 70: no longer affine to CPU2
[  138.792687] IRQ 98: no longer affine to CPU2
[  138.797458] IRQ 140: no longer affine to CPU2
[  138.802319] IRQ 141: no longer affine to CPU2
[  138.807189] IRQ 166: no longer affine to CPU2
[  138.813622] smpboot: CPU 2 is now offline
[  139.043610] smpboot: CPU 3 is now offline
[  141.587283] print_req_error: operation not supported error, dev 
nvme0n1, sector 494622136
[  141.587303] print_req_error: operation not supported error, dev 
nvme0n1, sector 219643648
[  141.587304] print_req_error: operation not supported error, dev 
nvme0n1, sector 279256456
[  141.587306] print_req_error: operation not supported error, dev 
nvme0n1, sector 1208024
[  141.587322] print_req_error: operation not supported error, dev 
nvme0n1, sector 100575248
[  141.587335] print_req_error: operation not supported error, dev 
nvme0n1, sector 111717456
[  141.587346] print_req_error: operation not supported error, dev 
nvme0n1, sector 171939296
[  141.587348] print_req_error: operation not supported error, dev 
nvme0n1, sector 476420528
[  141.587353] print_req_error: operation not supported error, dev 
nvme0n1, sector 371566696
[  141.587356] print_req_error: operation not supported error, dev 
nvme0n1, sector 161758408
[  141.587463] Buffer I/O error on dev nvme0n1, logical block 54193430, 
lost async page write
[  141.587472] Buffer I/O error on dev nvme0n1, logical block 54193431, 
lost async page write
[  141.587478] Buffer I/O error on dev nvme0n1, logical block 54193432, 
lost async page write
[  141.587483] Buffer I/O error on dev nvme0n1, logical block 54193433, 
lost async page write
[  141.587532] Buffer I/O error on dev nvme0n1, logical block 54193476, 
lost async page write
[  141.587534] Buffer I/O error on dev nvme0n1, logical block 54193477, 
lost async page write
[  141.587536] Buffer I/O error on dev nvme0n1, logical block 54193478, 
lost async page write
[  141.587538] Buffer I/O error on dev nvme0n1, logical block 54193479, 
lost async page write
[  141.587540] Buffer I/O error on dev nvme0n1, logical block 54193480, 
lost async page write
[  141.587542] Buffer I/O error on dev nvme0n1, logical block 54193481, 
lost async page write

[  142.573522] nvme nvme0: Reconnecting in 10 seconds...
[  146.587532] buffer_io_error: 3743628 callbacks suppressed
[  146.587534] Buffer I/O error on dev nvme0n1, logical block 64832757, 
lost async page write
[  146.602837] Buffer I/O error on dev nvme0n1, logical block 64832758, 
lost async page write
[  146.612091] Buffer I/O error on dev nvme0n1, logical block 64832759, 
lost async page write
[  146.621346] Buffer I/O error on dev nvme0n1, logical block 64832760, 
lost async page write

[  146.630615] print_req_error: 556822 callbacks suppressed
[  146.630616] print_req_error: I/O error, dev nvme0n1, sector 518662176
[  146.643776] Buffer I/O error on dev nvme0n1, logical block 64832772, 
lost async page write
[  146.653030] Buffer I/O error on dev nvme0n1, logical block 64832773, 
lost async page write
[  146.662282] Buffer I/O error on dev nvme0n1, logical block 64832774, 
lost async page

Re: BUG: KASAN: use-after-free in bt_for_each+0x1ea/0x29f

2018-04-05 Thread Bart Van Assche
On Wed, 2018-04-04 at 19:26 -0600, Jens Axboe wrote:
> Leaving the whole trace here, but I'm having a hard time making sense of it.
> It complains about a user-after-free in the inflight iteration, which is only
> working on the queue, request, and on-stack mi data. None of these would be
> freed. The below trace on allocation and free indicates a bio, but that isn't
> used in the inflight path at all. Is it possible that kasan gets confused 
> here?
> Not sure what to make of it so far.

Hello Jens,

In the many block layer tests I ran with KASAN enabled I have never seen
anything like this nor have I seen anything that made me wonder about the
reliability of KASAN. Maybe some code outside the block layer core corrupted
a request queue data structure and triggered this weird report?

Bart.





Re: [PATCH] blk-mq: only run mapped hw queues in blk_mq_run_hw_queues()

2018-04-05 Thread Ming Lei
On Fri, Apr 06, 2018 at 12:05:03AM +0800, Ming Lei wrote:
> On Wed, Apr 04, 2018 at 10:18:13AM +0200, Christian Borntraeger wrote:
> > 
> > 
> > On 03/30/2018 04:53 AM, Ming Lei wrote:
> > > On Thu, Mar 29, 2018 at 01:49:29PM +0200, Christian Borntraeger wrote:
> > >>
> > >>
> > >> On 03/29/2018 01:43 PM, Ming Lei wrote:
> > >>> On Thu, Mar 29, 2018 at 12:49:55PM +0200, Christian Borntraeger wrote:
> > 
> > 
> >  On 03/29/2018 12:48 PM, Ming Lei wrote:
> > > On Thu, Mar 29, 2018 at 12:10:11PM +0200, Christian Borntraeger wrote:
> > >>
> > >>
> > >> On 03/29/2018 11:40 AM, Ming Lei wrote:
> > >>> On Thu, Mar 29, 2018 at 11:09:08AM +0200, Christian Borntraeger 
> > >>> wrote:
> > 
> > 
> >  On 03/29/2018 09:23 AM, Christian Borntraeger wrote:
> > >
> > >
> > > On 03/29/2018 04:00 AM, Ming Lei wrote:
> > >> On Wed, Mar 28, 2018 at 05:36:53PM +0200, Christian Borntraeger 
> > >> wrote:
> > >>>
> > >>>
> > >>> On 03/28/2018 05:26 PM, Ming Lei wrote:
> >  Hi Christian,
> > 
> >  On Wed, Mar 28, 2018 at 09:45:10AM +0200, Christian 
> >  Borntraeger wrote:
> > > FWIW, this patch does not fix the issue for me:
> > >
> > > ostname=? addr=? terminal=? res=success'
> > > [   21.454961] WARNING: CPU: 3 PID: 1882 at 
> > > block/blk-mq.c:1410 __blk_mq_delay_run_hw_queue+0xbe/0xd8
> > > [   21.454968] Modules linked in: scsi_dh_rdac scsi_dh_emc 
> > > scsi_dh_alua dm_mirror dm_region_hash dm_log dm_multipath 
> > > dm_mod autofs4
> > > [   21.454984] CPU: 3 PID: 1882 Comm: dasdconf.sh Not tainted 
> > > 4.16.0-rc7+ #26
> > > [   21.454987] Hardware name: IBM 2964 NC9 704 (LPAR)
> > > [   21.454990] Krnl PSW : c0131ea3 3ea2f7bf 
> > > (__blk_mq_delay_run_hw_queue+0xbe/0xd8)
> > > [   21.454996]R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 
> > > AS:3 CC:0 PM:0 RI:0 EA:3
> > > [   21.455005] Krnl GPRS: 013abb69a000 013a 
> > > 013ac6c0dc00 0001
> > > [   21.455008] 013abb69a710 
> > > 013a 0001b691fd98
> > > [   21.455011]0001b691fd98 013ace4775c8 
> > > 0001 
> > > [   21.455014]013ac6c0dc00 00b47238 
> > > 0001b691fc08 0001b691fbd0
> > > [   21.455032] Krnl Code: 0069c596: ebaff0a4  
> > > lmg %r10,%r15,160(%r15)
> > >   0069c59c: c0f47a5e  
> > > brcl15,68ba58
> > >  #0069c5a2: a7f40001  
> > > brc 15,69c5a4
> > >  >0069c5a6: e340f0c4  
> > > lg  %r4,192(%r15)
> > >   0069c5ac: ebaff0a4  
> > > lmg %r10,%r15,160(%r15)
> > >   0069c5b2: 07f4  
> > > bcr 15,%r4
> > >   0069c5b4: c0e5feea  
> > > brasl   %r14,69c388
> > >   0069c5ba: a7f4fff6  
> > > brc 15,69c5a6
> > > [   21.455067] Call Trace:
> > > [   21.455072] ([<0001b691fd98>] 0x1b691fd98)
> > > [   21.455079]  [<0069c692>] 
> > > blk_mq_run_hw_queue+0xba/0x100 
> > > [   21.455083]  [<0069c740>] 
> > > blk_mq_run_hw_queues+0x68/0x88 
> > > [   21.455089]  [<0069b956>] 
> > > __blk_mq_complete_request+0x11e/0x1d8 
> > > [   21.455091]  [<0069ba9c>] 
> > > blk_mq_complete_request+0x8c/0xc8 
> > > [   21.455103]  [<008aa250>] 
> > > dasd_block_tasklet+0x158/0x490 
> > > [   21.455110]  [<0014c742>] 
> > > tasklet_hi_action+0x92/0x120 
> > > [   21.455118]  [<00a7cfc0>] __do_softirq+0x120/0x348 
> > > [   21.455122]  [<0014c212>] irq_exit+0xba/0xd0 
> > > [   21.455130]  [<0010bf92>] do_IRQ+0x8a/0xb8 
> > > [   21.455133]  [<00a7c298>] 
> > > io_int_handler+0x130/0x298 
> > > [   21.455136] Last Breaking-Event-Address:
> > > [   21.455138]  [<0069c5a2>] 
> > > __blk_mq_delay_run_hw_queue+0xba/0xd8
> > > [   21.455140] ---[ end trace be43f99a5d1e553e ]---
> > > [   21.5100

Re: [PATCH] blk-mq: only run mapped hw queues in blk_mq_run_hw_queues()

2018-04-05 Thread Ming Lei
On Wed, Apr 04, 2018 at 10:18:13AM +0200, Christian Borntraeger wrote:
> 
> 
> On 03/30/2018 04:53 AM, Ming Lei wrote:
> > On Thu, Mar 29, 2018 at 01:49:29PM +0200, Christian Borntraeger wrote:
> >>
> >>
> >> On 03/29/2018 01:43 PM, Ming Lei wrote:
> >>> On Thu, Mar 29, 2018 at 12:49:55PM +0200, Christian Borntraeger wrote:
> 
> 
>  On 03/29/2018 12:48 PM, Ming Lei wrote:
> > On Thu, Mar 29, 2018 at 12:10:11PM +0200, Christian Borntraeger wrote:
> >>
> >>
> >> On 03/29/2018 11:40 AM, Ming Lei wrote:
> >>> On Thu, Mar 29, 2018 at 11:09:08AM +0200, Christian Borntraeger wrote:
> 
> 
>  On 03/29/2018 09:23 AM, Christian Borntraeger wrote:
> >
> >
> > On 03/29/2018 04:00 AM, Ming Lei wrote:
> >> On Wed, Mar 28, 2018 at 05:36:53PM +0200, Christian Borntraeger 
> >> wrote:
> >>>
> >>>
> >>> On 03/28/2018 05:26 PM, Ming Lei wrote:
>  Hi Christian,
> 
>  On Wed, Mar 28, 2018 at 09:45:10AM +0200, Christian Borntraeger 
>  wrote:
> > FWIW, this patch does not fix the issue for me:
> >
> > ostname=? addr=? terminal=? res=success'
> > [   21.454961] WARNING: CPU: 3 PID: 1882 at block/blk-mq.c:1410 
> > __blk_mq_delay_run_hw_queue+0xbe/0xd8
> > [   21.454968] Modules linked in: scsi_dh_rdac scsi_dh_emc 
> > scsi_dh_alua dm_mirror dm_region_hash dm_log dm_multipath 
> > dm_mod autofs4
> > [   21.454984] CPU: 3 PID: 1882 Comm: dasdconf.sh Not tainted 
> > 4.16.0-rc7+ #26
> > [   21.454987] Hardware name: IBM 2964 NC9 704 (LPAR)
> > [   21.454990] Krnl PSW : c0131ea3 3ea2f7bf 
> > (__blk_mq_delay_run_hw_queue+0xbe/0xd8)
> > [   21.454996]R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 
> > AS:3 CC:0 PM:0 RI:0 EA:3
> > [   21.455005] Krnl GPRS: 013abb69a000 013a 
> > 013ac6c0dc00 0001
> > [   21.455008] 013abb69a710 
> > 013a 0001b691fd98
> > [   21.455011]0001b691fd98 013ace4775c8 
> > 0001 
> > [   21.455014]013ac6c0dc00 00b47238 
> > 0001b691fc08 0001b691fbd0
> > [   21.455032] Krnl Code: 0069c596: ebaff0a4
> > lmg %r10,%r15,160(%r15)
> >   0069c59c: c0f47a5e
> > brcl15,68ba58
> >  #0069c5a2: a7f40001
> > brc 15,69c5a4
> >  >0069c5a6: e340f0c4
> > lg  %r4,192(%r15)
> >   0069c5ac: ebaff0a4
> > lmg %r10,%r15,160(%r15)
> >   0069c5b2: 07f4
> > bcr 15,%r4
> >   0069c5b4: c0e5feea
> > brasl   %r14,69c388
> >   0069c5ba: a7f4fff6
> > brc 15,69c5a6
> > [   21.455067] Call Trace:
> > [   21.455072] ([<0001b691fd98>] 0x1b691fd98)
> > [   21.455079]  [<0069c692>] 
> > blk_mq_run_hw_queue+0xba/0x100 
> > [   21.455083]  [<0069c740>] 
> > blk_mq_run_hw_queues+0x68/0x88 
> > [   21.455089]  [<0069b956>] 
> > __blk_mq_complete_request+0x11e/0x1d8 
> > [   21.455091]  [<0069ba9c>] 
> > blk_mq_complete_request+0x8c/0xc8 
> > [   21.455103]  [<008aa250>] 
> > dasd_block_tasklet+0x158/0x490 
> > [   21.455110]  [<0014c742>] 
> > tasklet_hi_action+0x92/0x120 
> > [   21.455118]  [<00a7cfc0>] __do_softirq+0x120/0x348 
> > [   21.455122]  [<0014c212>] irq_exit+0xba/0xd0 
> > [   21.455130]  [<0010bf92>] do_IRQ+0x8a/0xb8 
> > [   21.455133]  [<00a7c298>] io_int_handler+0x130/0x298 
> > [   21.455136] Last Breaking-Event-Address:
> > [   21.455138]  [<0069c5a2>] 
> > __blk_mq_delay_run_hw_queue+0xba/0xd8
> > [   21.455140] ---[ end trace be43f99a5d1e553e ]---
> > [   21.510046] dasdconf.sh Warning: 0.0.241e is already online, 
> > not configuring
> 
>  Thinking about this issue further, I can't understand the root 
>  cause for
>  this issue.
> 
>  FWIW, Li

Re: 4.15.15: BFQ stalled at blk_mq_get_tag

2018-04-05 Thread Jens Axboe
On 4/5/18 8:45 AM, Paolo Valente wrote:
> 
> 
>> Il giorno 05 apr 2018, alle ore 15:15, Sami Farin 
>>  ha scritto:
>>
>> I was using chacharand to fill 32 GB SD card (VFAT fs) (maybe 30 MiB/s)
>> with random data, it froze halfway.  There was 400 MiB Dirty data.
>> After reboot the filling operation went OK when I used kyber scheduler.
>> System is Fedora 27 on Core i5 2500K / 16 GiB.
>>
> 
> I'm afraid this crash is caused by a bug fixed for 4.16 [1].  In the
> same thread [1], Oleksander (in CC) proposed to backport this and
> other fixes and improvements to 4.15.  But Jens (in CC) didn't accept,
> because too general stuff was included in the batch.  Maybe this bug
> report could be the opportunity to reconsider that backport or part of
> it?

I never objected to back porting the single fix. What I did object
to was a huge list of other fixes. I'm also fine with back porting
a set of fixes, if they are all relevant to back port. I don't
want a whole sale list of "these are all the changes we did to BFQ,
let's back port them".

-- 
Jens Axboe



Re: 4.15.15: BFQ stalled at blk_mq_get_tag

2018-04-05 Thread Paolo Valente


> Il giorno 05 apr 2018, alle ore 15:15, Sami Farin 
>  ha scritto:
> 
> I was using chacharand to fill 32 GB SD card (VFAT fs) (maybe 30 MiB/s)
> with random data, it froze halfway.  There was 400 MiB Dirty data.
> After reboot the filling operation went OK when I used kyber scheduler.
> System is Fedora 27 on Core i5 2500K / 16 GiB.
> 

I'm afraid this crash is caused by a bug fixed for 4.16 [1].  In the
same thread [1], Oleksander (in CC) proposed to backport this and
other fixes and improvements to 4.15.  But Jens (in CC) didn't accept,
because too general stuff was included in the batch.  Maybe this bug
report could be the opportunity to reconsider that backport or part of
it?

Thanks,
Paolo

[1] https://lkml.org/lkml/2018/2/7/678

> sysrq: SysRq : Show Blocked State
> taskPC stack   pid father
> device poll D0 2811838  1 0x
> Call Trace:
> ? __schedule+0x2c2/0x910
> schedule+0x2a/0x80
> schedule_timeout+0x8a/0x490
> ? collect_expired_timers+0xa0/0xa0
> msleep+0x24/0x30
> usb_port_suspend+0x298/0x430 [usbcore]
> usb_suspend_both+0x17d/0x200 [usbcore]
> ? usb_probe_interface+0x300/0x300 [usbcore]
> usb_runtime_suspend+0x25/0x60 [usbcore]
> __rpm_callback+0xb7/0x1f0
> ? usb_probe_interface+0x300/0x300 [usbcore]
> rpm_callback+0x1a/0x80
> ? usb_probe_interface+0x300/0x300 [usbcore]
> rpm_suspend+0x11e/0x660
> __pm_runtime_suspend+0x36/0x60
> usbdev_release+0xb3/0x120 [usbcore]
> __fput+0xa3/0x1f0
> task_work_run+0x82/0xa0
> exit_to_usermode_loop+0x91/0xa0
> do_syscall_64+0xe7/0x100
> entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> RIP: 0033:0x7f41101c170c
> RSP: 002b:7f410e655b80 EFLAGS: 0293 ORIG_RAX: 0003
> RAX:  RBX: 7f410e655ecb RCX: 7f41101c170c
> RDX:  RSI: 7f410e655ea0 RDI: 0007
> RBP: 7f410e655ec4 R08:  R09: 7f410080
> R10:  R11: 0293 R12: 7f410e655c90
> R13: 7f410e655ecb R14: 0007 R15: 7f410e655ebb
> kworker/u8:4D0 2978647  2 0x8000
> Workqueue: writeback wb_workfn (flush-8:80)
> Call Trace:
> ? __schedule+0x2c2/0x910
> schedule+0x2a/0x80
> io_schedule+0xd/0x30
> blk_mq_get_tag+0x150/0x250
> ? wait_woken+0x80/0x80
> blk_mq_get_request+0x131/0x450
> ? bfq_bio_merge+0xcb/0x100
> blk_mq_make_request+0x118/0x6e0
> ? blk_queue_enter+0x31/0x2f0
> generic_make_request+0xfd/0x2a0
> ? submit_bio+0x67/0x140
> submit_bio+0x67/0x140
> ? guard_bio_eod+0x78/0x150
> mpage_writepages+0xa7/0xe0
> ? fat_add_cluster+0x60/0x60 [fat]
> ? do_writepages+0x37/0xc0
> ? fat_writepage+0x10/0x10 [fat]
> do_writepages+0x37/0xc0
> ? reacquire_held_locks+0x8f/0x150
> ? writeback_sb_inodes+0xef/0x490
> ? __writeback_single_inode+0x5a/0x530
> __writeback_single_inode+0x5a/0x530
> writeback_sb_inodes+0x1ed/0x490
> __writeback_inodes_wb+0x55/0xa0
> wb_writeback+0x261/0x3f0
> ? wb_workfn+0x1fd/0x4f0
> wb_workfn+0x1fd/0x4f0
> process_one_work+0x206/0x560
> worker_thread+0x2c/0x380
> ? process_one_work+0x560/0x560
> kthread+0x10e/0x130
> ? kthread_create_on_node+0x40/0x40
> ret_from_fork+0x35/0x40
> kworker/0:3 D0 2979285  2 0x8000
> Workqueue: events_freezable_power_ disk_events_workfn
> Call Trace:
> ? __schedule+0x2c2/0x910
> schedule+0x2a/0x80
> io_schedule+0xd/0x30
> blk_mq_get_tag+0x150/0x250
> ? wait_woken+0x80/0x80
> blk_mq_get_request+0x131/0x450
> blk_mq_alloc_request+0x58/0xb0
> blk_get_request_flags+0x3b/0x150
> scsi_execute+0x33/0x250
> scsi_test_unit_ready+0x48/0xb0
> sd_check_events+0xc8/0x170
> disk_check_events+0x54/0x130
> process_one_work+0x206/0x560
> worker_thread+0x2c/0x380
> ? process_one_work+0x560/0x560
> kthread+0x10e/0x130
> ? kthread_create_on_node+0x40/0x40
> ? SyS_exit+0xe/0x10
> ret_from_fork+0x35/0x40
> chacharand  D0 2980742 2978974 0x8002
> Call Trace:
> ? __schedule+0x2c2/0x910
> schedule+0x2a/0x80
> io_schedule+0xd/0x30
> blk_mq_get_tag+0x150/0x250
> ? wait_woken+0x80/0x80
> blk_mq_get_request+0x131/0x450
> ? bfq_bio_merge+0xcb/0x100
> blk_mq_make_request+0x118/0x6e0
> ? blk_queue_enter+0x31/0x2f0
> generic_make_request+0xfd/0x2a0
> ? submit_bio+0x67/0x140
> submit_bio+0x67/0x140
> ? guard_bio_eod+0x78/0x150
> __mpage_writepage+0x67e/0x7a0
> ? clear_page_dirty_for_io+0x10f/0x240
> ? clear_page_dirty_for_io+0x12f/0x240
> write_cache_pages+0x1ee/0x460
> ? clean_buffers+0x60/0x60
> ? fat_add_cluster+0x60/0x60 [fat]
> mpage_writepages+0x68/0xe0
> ? fat_add_cluster+0x60/0x60 [fat]
> ? do_writepages+0x37/0xc0
> ? fat_writepage+0x10/0x10 [fat]
> do_writepages+0x37/0xc0
> ? __filemap_fdatawrite_range+0x99/0xe0
> ? __filemap_fdatawrite_range+0xa6/0xe0
> __filemap_fdatawrite_range+0xa6/0xe0
> ? sync_inode_metadata+0x2a/0x30
> fat_flush_inodes+0x25/0x60 [fat]
> fat_file_release+0x2a/0x40 [fat]
> __fput+0xa3/0x1f0
> task_work_run+0x82/0xa0
> do_exit+0x29b/0xbf0
> do_group_exit+0x34/0xb0
> SyS_exit_group+0xb/0x10
> do_syscall_64+0x62/0x100
> entry_SYSCALL_64_after_hwframe+0

Re: [BISECTED][REGRESSION] Hang while booting EeePC 900

2018-04-05 Thread Tejun Heo
Hello,

On Thu, Apr 05, 2018 at 09:14:15AM +0100, Sitsofe Wheeler wrote:
> Just out of interest, does the fact that an abort occurs mean that the
> hardware is somehow broken or badly behaved?

Not really.  For example, ATAPI devices depend on exception handling
to fetch sense data as a part of normal operation which is handled by
libata exception handler which is invoked by aborting the original
command.  So, exception handling can often be a part of normal
operation.

Thanks.

-- 
tejun


Re: [RFC PATCH 0/2] use larger max_request_size for virtio_blk

2018-04-05 Thread Jens Axboe
On 4/5/18 4:09 AM, Weiping Zhang wrote:
> Hi,
> 
> For virtio block device, actually there is no a hard limit for max request
> size, and virtio_blk driver set -1 to blk_queue_max_hw_sectors(q, -1U);.
> But it doesn't work, because there is a default upper limitation
> BLK_DEF_MAX_SECTORS (1280 sectors). So this series want to add a new helper
> blk_queue_max_hw_sectors_no_limit to set a proper max reqeust size.
> 
> Weiping Zhang (2):
>   blk-setting: add new helper blk_queue_max_hw_sectors_no_limit
>   virtio_blk: add new module parameter to set max request size
> 
>  block/blk-settings.c   | 20 
>  drivers/block/virtio_blk.c | 32 ++--
>  include/linux/blkdev.h |  2 ++
>  3 files changed, 52 insertions(+), 2 deletions(-)

The driver should just use blk_queue_max_hw_sectors() to set the limit,
and then the soft limit can be modified by a udev rule. Technically the
driver doesn't own the software limit, it's imposed to ensure that we
don't introduce too much latency per request.

Your situation is no different from many other setups, where the
hw limit is much higher than the default 1280k.

-- 
Jens Axboe



4.15.15: BFQ stalled at blk_mq_get_tag

2018-04-05 Thread Sami Farin
I was using chacharand to fill 32 GB SD card (VFAT fs) (maybe 30 MiB/s)
with random data, it froze halfway.  There was 400 MiB Dirty data.
After reboot the filling operation went OK when I used kyber scheduler.
System is Fedora 27 on Core i5 2500K / 16 GiB.

sysrq: SysRq : Show Blocked State
 taskPC stack   pid father
device poll D0 2811838  1 0x
Call Trace:
? __schedule+0x2c2/0x910
schedule+0x2a/0x80
schedule_timeout+0x8a/0x490
? collect_expired_timers+0xa0/0xa0
msleep+0x24/0x30
usb_port_suspend+0x298/0x430 [usbcore]
usb_suspend_both+0x17d/0x200 [usbcore]
? usb_probe_interface+0x300/0x300 [usbcore]
usb_runtime_suspend+0x25/0x60 [usbcore]
__rpm_callback+0xb7/0x1f0
? usb_probe_interface+0x300/0x300 [usbcore]
rpm_callback+0x1a/0x80
? usb_probe_interface+0x300/0x300 [usbcore]
rpm_suspend+0x11e/0x660
__pm_runtime_suspend+0x36/0x60
usbdev_release+0xb3/0x120 [usbcore]
__fput+0xa3/0x1f0
task_work_run+0x82/0xa0
exit_to_usermode_loop+0x91/0xa0
do_syscall_64+0xe7/0x100
entry_SYSCALL_64_after_hwframe+0x3d/0xa2
RIP: 0033:0x7f41101c170c
RSP: 002b:7f410e655b80 EFLAGS: 0293 ORIG_RAX: 0003
RAX:  RBX: 7f410e655ecb RCX: 7f41101c170c
RDX:  RSI: 7f410e655ea0 RDI: 0007
RBP: 7f410e655ec4 R08:  R09: 7f410080
R10:  R11: 0293 R12: 7f410e655c90
R13: 7f410e655ecb R14: 0007 R15: 7f410e655ebb
kworker/u8:4D0 2978647  2 0x8000
Workqueue: writeback wb_workfn (flush-8:80)
Call Trace:
? __schedule+0x2c2/0x910
schedule+0x2a/0x80
io_schedule+0xd/0x30
blk_mq_get_tag+0x150/0x250
? wait_woken+0x80/0x80
blk_mq_get_request+0x131/0x450
? bfq_bio_merge+0xcb/0x100
blk_mq_make_request+0x118/0x6e0
? blk_queue_enter+0x31/0x2f0
generic_make_request+0xfd/0x2a0
? submit_bio+0x67/0x140
submit_bio+0x67/0x140
? guard_bio_eod+0x78/0x150
mpage_writepages+0xa7/0xe0
? fat_add_cluster+0x60/0x60 [fat]
? do_writepages+0x37/0xc0
? fat_writepage+0x10/0x10 [fat]
do_writepages+0x37/0xc0
? reacquire_held_locks+0x8f/0x150
? writeback_sb_inodes+0xef/0x490
? __writeback_single_inode+0x5a/0x530
__writeback_single_inode+0x5a/0x530
writeback_sb_inodes+0x1ed/0x490
__writeback_inodes_wb+0x55/0xa0
wb_writeback+0x261/0x3f0
? wb_workfn+0x1fd/0x4f0
wb_workfn+0x1fd/0x4f0
process_one_work+0x206/0x560
worker_thread+0x2c/0x380
? process_one_work+0x560/0x560
kthread+0x10e/0x130
? kthread_create_on_node+0x40/0x40
ret_from_fork+0x35/0x40
kworker/0:3 D0 2979285  2 0x8000
Workqueue: events_freezable_power_ disk_events_workfn
Call Trace:
? __schedule+0x2c2/0x910
schedule+0x2a/0x80
io_schedule+0xd/0x30
blk_mq_get_tag+0x150/0x250
? wait_woken+0x80/0x80
blk_mq_get_request+0x131/0x450
blk_mq_alloc_request+0x58/0xb0
blk_get_request_flags+0x3b/0x150
scsi_execute+0x33/0x250
scsi_test_unit_ready+0x48/0xb0
sd_check_events+0xc8/0x170
disk_check_events+0x54/0x130
process_one_work+0x206/0x560
worker_thread+0x2c/0x380
? process_one_work+0x560/0x560
kthread+0x10e/0x130
? kthread_create_on_node+0x40/0x40
? SyS_exit+0xe/0x10
ret_from_fork+0x35/0x40
chacharand  D0 2980742 2978974 0x8002
Call Trace:
? __schedule+0x2c2/0x910
schedule+0x2a/0x80
io_schedule+0xd/0x30
blk_mq_get_tag+0x150/0x250
? wait_woken+0x80/0x80
blk_mq_get_request+0x131/0x450
? bfq_bio_merge+0xcb/0x100
blk_mq_make_request+0x118/0x6e0
? blk_queue_enter+0x31/0x2f0
generic_make_request+0xfd/0x2a0
? submit_bio+0x67/0x140
submit_bio+0x67/0x140
? guard_bio_eod+0x78/0x150
__mpage_writepage+0x67e/0x7a0
? clear_page_dirty_for_io+0x10f/0x240
? clear_page_dirty_for_io+0x12f/0x240
write_cache_pages+0x1ee/0x460
? clean_buffers+0x60/0x60
? fat_add_cluster+0x60/0x60 [fat]
mpage_writepages+0x68/0xe0
? fat_add_cluster+0x60/0x60 [fat]
? do_writepages+0x37/0xc0
? fat_writepage+0x10/0x10 [fat]
do_writepages+0x37/0xc0
? __filemap_fdatawrite_range+0x99/0xe0
? __filemap_fdatawrite_range+0xa6/0xe0
__filemap_fdatawrite_range+0xa6/0xe0
? sync_inode_metadata+0x2a/0x30
fat_flush_inodes+0x25/0x60 [fat]
fat_file_release+0x2a/0x40 [fat]
__fput+0xa3/0x1f0
task_work_run+0x82/0xa0
do_exit+0x29b/0xbf0
do_group_exit+0x34/0xb0
SyS_exit_group+0xb/0x10
do_syscall_64+0x62/0x100
entry_SYSCALL_64_after_hwframe+0x3d/0xa2
RIP: 0033:0x7fd9172b3178
RSP: 002b:7fffe01eb248 EFLAGS: 0246 ORIG_RAX: 00e7
RAX: ffda RBX:  RCX: 7fd9172b3178
RDX:  RSI: 003c RDI: 
RBP: 7fd9175b08b8 R08: 00e7 R09: ff80
R10: 7fffe01eb1d0 R11: 0246 R12: 7fd9175b08b8
R13: 7fd9175b5d60 R14:  R15: 
(ostnamed)  D0 2981753  1 0x0004
Call Trace:
? __schedule+0x2c2/0x910
? rwsem_down_write_failed+0x174/0x260
schedule+0x2a/0x80
rwsem_down_write_failed+0x179/0x260
? call_rwsem_down_write_failed+0x13/0x20
call_rwsem_down_write_failed+0x13/0x20
down_write+0x3b/0x50
? do_mount+0x434/0xdb0
do_mount+

Re: [RFC PATCH 0/2] use larger max_request_size for virtio_blk

2018-04-05 Thread Martin K. Petersen

Weiping,

> For virtio block device, actually there is no a hard limit for max
> request size, and virtio_blk driver set -1 to
> blk_queue_max_hw_sectors(q, -1U);.  But it doesn't work, because there
> is a default upper limitation BLK_DEF_MAX_SECTORS (1280 sectors).

That's intentional (although it's an ongoing debate what the actual
value should be).

> So this series want to add a new helper
> blk_queue_max_hw_sectors_no_limit to set a proper max reqeust size.

BLK_DEF_MAX_SECTORS is a kernel default empirically chosen to strike a
decent balance between I/O latency and bandwidth. It sets an upper bound
for filesystem requests only. Regardless of the capabilities of the
block device driver and underlying hardware.

You can override the limit on a per-device basis via max_sectors_kb in
sysfs. People generally do it via a udev rule.

-- 
Martin K. Petersen  Oracle Linux Engineering


[RFC PATCH 1/2] blk-setting: add new helper blk_queue_max_hw_sectors_no_limit

2018-04-05 Thread Weiping Zhang
There is a default upper limitation BLK_DEF_MAX_SECTORS, but for
some virtual block device driver there is no such limitation. So
add a new help to set max request size.

Signed-off-by: Weiping Zhang 
---
 block/blk-settings.c   | 20 
 include/linux/blkdev.h |  2 ++
 2 files changed, 22 insertions(+)

diff --git a/block/blk-settings.c b/block/blk-settings.c
index 48ebe6b..685c30c 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -253,6 +253,26 @@ void blk_queue_max_hw_sectors(struct request_queue *q, 
unsigned int max_hw_secto
 }
 EXPORT_SYMBOL(blk_queue_max_hw_sectors);
 
+/* same as blk_queue_max_hw_sectors but without default upper limitation */
+void blk_queue_max_hw_sectors_no_limit(struct request_queue *q,
+   unsigned int max_hw_sectors)
+{
+   struct queue_limits *limits = &q->limits;
+   unsigned int max_sectors;
+
+   if ((max_hw_sectors << 9) < PAGE_SIZE) {
+   max_hw_sectors = 1 << (PAGE_SHIFT - 9);
+   printk(KERN_INFO "%s: set to minimum %d\n",
+  __func__, max_hw_sectors);
+   }
+
+   limits->max_hw_sectors = max_hw_sectors;
+   max_sectors = min_not_zero(max_hw_sectors, limits->max_dev_sectors);
+   limits->max_sectors = max_sectors;
+   q->backing_dev_info->io_pages = max_sectors >> (PAGE_SHIFT - 9);
+}
+EXPORT_SYMBOL(blk_queue_max_hw_sectors_no_limit);
+
 /**
  * blk_queue_chunk_sectors - set size of the chunk for this queue
  * @q:  the request queue for the device
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index ed63f3b..2250709 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1243,6 +1243,8 @@ extern void blk_cleanup_queue(struct request_queue *);
 extern void blk_queue_make_request(struct request_queue *, make_request_fn *);
 extern void blk_queue_bounce_limit(struct request_queue *, u64);
 extern void blk_queue_max_hw_sectors(struct request_queue *, unsigned int);
+extern void blk_queue_max_hw_sectors_no_limit(struct request_queue *,
+   unsigned int);
 extern void blk_queue_chunk_sectors(struct request_queue *, unsigned int);
 extern void blk_queue_max_segments(struct request_queue *, unsigned short);
 extern void blk_queue_max_discard_segments(struct request_queue *,
-- 
2.9.4



[RFC PATCH 0/2] use larger max_request_size for virtio_blk

2018-04-05 Thread Weiping Zhang
Hi,

For virtio block device, actually there is no a hard limit for max request
size, and virtio_blk driver set -1 to blk_queue_max_hw_sectors(q, -1U);.
But it doesn't work, because there is a default upper limitation
BLK_DEF_MAX_SECTORS (1280 sectors). So this series want to add a new helper
blk_queue_max_hw_sectors_no_limit to set a proper max reqeust size.

Weiping Zhang (2):
  blk-setting: add new helper blk_queue_max_hw_sectors_no_limit
  virtio_blk: add new module parameter to set max request size

 block/blk-settings.c   | 20 
 drivers/block/virtio_blk.c | 32 ++--
 include/linux/blkdev.h |  2 ++
 3 files changed, 52 insertions(+), 2 deletions(-)

-- 
2.9.4



Re: [PATCH V3 4/4] genirq/affinity: irq vector spread among online CPUs as far as possible

2018-04-05 Thread Thomas Gleixner
On Wed, 4 Apr 2018, Ming Lei wrote:
> On Wed, Apr 04, 2018 at 02:45:18PM +0200, Thomas Gleixner wrote:
> > Now the 4 offline CPUs are plugged in again. These CPUs won't ever get an
> > interrupt as all interrupts stay on CPU 0-3 unless one of these CPUs is
> > unplugged. Using cpu_present_mask the spread would be:
> > 
> > irq 39, cpu list 0,1
> > irq 40, cpu list 2,3
> > irq 41, cpu list 4,5
> > irq 42, cpu list 6,7
> 
> Given physical CPU hotplug isn't common, this way will make only irq 39
> and irq 40 active most of times, so performance regression is caused just
> as Kashyap reported.

That is only true, if CPU 4-7 are in the present mask at boot time. I
seriously doubt that this is the case for Kashyaps scenario. Grrr, if you
would have included him into the Reported-by: tags then I could have asked
him myself.

In the physical hotplug case, the physcially (or virtually) not available
CPUs are not in the present mask. They are solely in the possible mask.

The above is about soft hotplug where the CPUs are physically there and
therefore in the present mask and can be onlined without interaction from
the outside (mechanical or virt config).

If nobody objects, I'll make that change and queue the stuff tomorrow
morning so it can brew a few days in next before I send it off to Linus.

Thanks,

tglx


[RFC PATCH 2/2] virtio_blk: add new module parameter to set max request size

2018-04-05 Thread Weiping Zhang
Actually there is no upper limitation, so add new module parameter
to provide a way to set a proper max request size for virtio block.
Using a larger request size can improve sequence performance in theory,
and reduce the interaction between guest and hypervisor.

Signed-off-by: Weiping Zhang 
---
 drivers/block/virtio_blk.c | 32 ++--
 1 file changed, 30 insertions(+), 2 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 4a07593c..5ac6d59 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -64,6 +64,34 @@ struct virtblk_req {
struct scatterlist sg[];
 };
 
+
+static int max_request_size_set(const char *val, const struct kernel_param 
*kp);
+
+static const struct kernel_param_ops max_request_size_ops = {
+   .set = max_request_size_set,
+   .get = param_get_uint,
+};
+
+static unsigned long max_request_size = 4096; /* in unit of KiB */
+module_param_cb(max_request_size, &max_request_size_ops, &max_request_size,
+   0444);
+MODULE_PARM_DESC(max_request_size, "set max request size, in unit of KiB");
+
+static int max_request_size_set(const char *val, const struct kernel_param *kp)
+{
+   int ret;
+   unsigned int size_kb, page_kb = 1 << (PAGE_SHIFT - 10);
+
+   ret = kstrtouint(val, 10, &size_kb);
+   if (ret != 0)
+   return -EINVAL;
+
+   if (size_kb < page_kb)
+   return -EINVAL;
+
+   return param_set_uint(val, kp);
+}
+
 static inline blk_status_t virtblk_result(struct virtblk_req *vbr)
 {
switch (vbr->status) {
@@ -730,8 +758,8 @@ static int virtblk_probe(struct virtio_device *vdev)
/* We can handle whatever the host told us to handle. */
blk_queue_max_segments(q, vblk->sg_elems-2);
 
-   /* No real sector limit. */
-   blk_queue_max_hw_sectors(q, -1U);
+   /* No real sector limit, use 512b (max_request_size << 10) >> 9 */
+   blk_queue_max_hw_sectors_no_limit(q, max_request_size << 1);
 
/* Host can optionally specify maximum segment size and number of
 * segments. */
-- 
2.9.4



bcache and hibernation (was: bcache: bad block header)

2018-04-05 Thread Nikolaus Rath
Hi,

I have a hypothesis of what happened. My swap volume is also on LVM, and thus 
also eventually backed by bcache. Hibernation and resume work fine. But when 
the hibernation image is read during resume, the contents of the cache device 
change because with bcache reading is no longer a read-only operation. When the 
hibernation image is loaded, the kernel looses track of these changes so that 
what's on the cache disk no longer matches the structures in the kernel. 
Therefore, on the first boot after the successful resume, havoc ensures.

I needed the system running again, so I've now detached the backing volumes, 
re-initialized the cache volume and re-attached the backing volumes. 
Unfortunately there was too much filesystem damage, so I restored everything 
from backup.

Is there a way to prevent this from happening? Could eg the kernel detect that 
the swap devices is (indirectly) on bcache and refuse to hibernate? Or is there 
a way to do a "true" read-only mount of a bcache volume so that one can safely 
resume from it?
 
Best,
-Nikolaus

--
GPG Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F

 »Time flies like an arrow, fruit flies like a Banana.«

On Tue, 3 Apr 2018, at 23:38, Jens Axboe wrote:
> CC'ing Mike
> 
> On 4/3/18 1:01 PM, Nikolaus Rath wrote:
> > [ Re-send to both linux-block and linux-bcache ]
> > 
> > Hi,
> > 
> > A few days ago, my system refused to boot because it couldn't find the root 
> > filesystem anymore. The root filesystem is ext4 on LVM on dm-crypt on 
> > bcache, using kernel 4.9.92 (from Debian stretch). Booting from a recovery 
> > medium with Kernel 4.16, I got:
> > 
> > [   84.551715] bcache: register_bcache() error /dev/sda4: device already 
> > registered
> > [   84.553188] bcache: register_bcache() error /dev/sdc2: device already 
> > registered
> > [   84.616438] bcache: error on 1330b5f6-0c13-43ec-b925-2ee2734b135f:
> > [   84.616440] bad btree header at bucket 85065, block 0, 0 keys
> > [   84.616442] , disabling caching
> > [   84.616445] bcache: register_cache() registered cache device sdb2
> > [   84.616597] bcache: cache_set_free() Cache set 
> > 1330b5f6-0c13-43ec-b925-2ee2734b135f unregistered
> > [   85.375933]  sdb: sdb1 sdb2 sdb4 < sdb5 >
> > [   85.416610] bcache: error on 1330b5f6-0c13-43ec-b925-2ee2734b135f:
> > [   85.416612] bad btree header at bucket 85065, block 0, 0 keys
> > [   85.416614] , disabling caching
> > [   85.416618] bcache: register_cache() registered cache device sdb2
> > [   85.416624] bcache: register_bcache() error /dev/sdc2: device already 
> > registered
> > [   85.416626] bcache: register_bcache() error /dev/sda4: device already 
> > registered
> > [   85.416796] bcache: cache_set_free() Cache set 
> > 1330b5f6-0c13-43ec-b925-2ee2734b135f unregistered
> > [   85.488246] bcache: error on 1330b5f6-0c13-43ec-b925-2ee2734b135f:
> > [   85.488249] bad btree header at bucket 85065, block 0, 0 keys
> > [   85.488251] , disabling caching
> > [   85.488254] bcache: register_cache() registered cache device sdb2
> > [   85.488429] bcache: cache_set_free() Cache set 
> > 1330b5f6-0c13-43ec-b925-2ee2734b135f unregistered
> > [   85.560003] bcache: error on 1330b5f6-0c13-43ec-b925-2ee2734b135f:
> > [   85.560006] bad btree header at bucket 85065, block 0, 0 keys
> > [   85.560008] , disabling caching
> > [   85.560013] bcache: register_cache() registered cache device sdb2
> > [   85.560017] bcache: register_bcache() error /dev/sda4: device already 
> > registered
> > [   85.560217] bcache: cache_set_free() Cache set 
> > 1330b5f6-0c13-43ec-b925-2ee2734b135f unregistered
> > [   85.571950] bcache: register_bcache() error /dev/sdc2: device already 
> > registered
> > [   85.580628] bcache: register_bcache() error /dev/sdc2: device already 
> > registered
> > [   85.761969] bcache: register_bcache() error /dev/sda4: device already 
> > registered
> > [   85.792749] bcache: register_bcache() error /dev/sda4: device already 
> > registered
> > [   85.952931] bcache: register_bcache() error /dev/sda4: device already 
> > registered
> > [   85.955640] bcache: register_bcache() error /dev/sda4: device already 
> > registered
> > [...]
> > 
> > These are the first messages that mention bcache. Note that the first 
> > message is that the device is already registered - is that normal?
> > 
> > smartctl does not report any errors on backing or caching disks, and the 
> > system was shutdown cleanly.
> > 
> > The only possibly related thing that comes to mind is that a few days ago I 
> > hibernated and resumed the system (this is something I normally don't do). 
> > Resume worked fine as far as I could tell though, and there have been no 
> > unclean shutdowns.
> > 
> > Is there a way to narrow down what may have caused this corruption?
> > 
> > And, is there a way to gracefully recover from this situation without 
> > wiping everything? Since the message mentions only problems with one block, 
> > can I maybe tell bcache t

Re: [BISECTED][REGRESSION] Hang while booting EeePC 900

2018-04-05 Thread Sitsofe Wheeler
On 2 April 2018 at 21:29, Tejun Heo  wrote:
> Hello, Sitsofe.
>
> Can you see whether the following patch makes any difference?
>
> Thanks.
>
> diff --git a/block/blk-timeout.c b/block/blk-timeout.c
> index a05e367..f0e6e41 100644
> --- a/block/blk-timeout.c
> +++ b/block/blk-timeout.c
> @@ -165,7 +165,7 @@ void blk_abort_request(struct request *req)
>  * No need for fancy synchronizations.
>  */
> blk_rq_set_deadline(req, jiffies);
> -   mod_timer(&req->q->timeout, 0);
> +   kblockd_schedule_work(&req->q->timeout_work);
> } else {
> if (blk_mark_rq_complete(req))
> return;

Just out of interest, does the fact that an abort occurs mean that the
hardware is somehow broken or badly behaved?

-- 
Sitsofe | http://sucs.org/~sits/