Re: Multi-Actuator SAS HDD First Look
On Fri, Apr 06, 2018 at 08:24:18AM +0200, Hannes Reinecke wrote: > Ah. Far better. > What about delegating FORMAT UNIT to the control LUN, and not > implementing it for the individual disk LUNs? > That would make an even stronger case for having a control LUN; > with that there wouldn't be any problem with having to synchronize > across LUNs etc. It sounds to me like NVMe might be a much better model for this drive than SCSI, btw :)
Re: Multi-Actuator SAS HDD First Look
On Thu, 5 Apr 2018 17:43:46 -0600 Tim Walker wrote: > On Tue, Apr 3, 2018 at 1:46 AM, Christoph Hellwig > wrote: > > On Sat, Mar 31, 2018 at 01:03:46PM +0200, Hannes Reinecke wrote: > >> Actually I would propose to have a 'management' LUN at LUN0, who > >> could handle all the device-wide commands (eg things like START > >> STOP UNIT, firmware update, or even SMART commands), and ignoring > >> them for the remaining LUNs. > > > > That is in fact the only workable option at all. Everything else > > completly breaks the scsi architecture. > > Here's an update: Seagate will eliminate the inter-LU actions from > FORMAT UNIT and SANITIZE. Probably SANITIZE will be per-LUN, but > FORMAT UNIT is trickier due to internal drive architecture, and how > FORMAT UNIT initializes on-disk metadata. Likely it will require some > sort of synchronization across LUNs, such as the command being sent to > both LUNs sequentially or something similar. We are also considering > not supporting FORMAT UNIT at all - would anybody object? Any other > suggestions? > Ah. Far better. What about delegating FORMAT UNIT to the control LUN, and not implementing it for the individual disk LUNs? That would make an even stronger case for having a control LUN; with that there wouldn't be any problem with having to synchronize across LUNs etc. Cheers, Hannes
Re: 4.15.14 crash with iscsi target and dvd
On Thu, 2018-04-05 at 22:06 -0400, Wakko Warner wrote: > I know now why scsi_print_command isn't doing anything. cmd->cmnd is null. > I added a dev_printk in scsi_print_command where the 2 if statements return. > Logs: > [ 29.866415] sr 3:0:0:0: cmd->cmnd is NULL That's something that should never happen. As one can see in scsi_setup_scsi_cmnd() and scsi_setup_fs_cmnd() both functions initialize that pointer. Since I have not yet been able to reproduce myself what you reported, would it be possible for you to bisect this issue? You will need to follow something like the following procedure (see also https://git-scm.com/docs/git-bisect): git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git git bisect start git bisect bad v4.10 git bisect good v4.9 and then build the kernel, install it, boot the kernel and test it. Depending on the result, run either git bisect bad or git bisect good and keep going until git bisect comes to a conclusion. This can take an hour or more. Bart.
Re: 4.15.14 crash with iscsi target and dvd
Wakko Warner wrote: > Bart Van Assche wrote: > > On Sun, 2018-04-01 at 14:27 -0400, Wakko Warner wrote: > > > Wakko Warner wrote: > > > > Wakko Warner wrote: > > > > > I tested 4.14.32 last night with the same oops. 4.9.91 works fine. > > > > > From the initiator, if I do cat /dev/sr1 > /dev/null it works. If I > > > > > mount > > > > > /dev/sr1 and then do find -type f | xargs cat > /dev/null the target > > > > > crashes. I'm using the builtin iscsi target with pscsi. I can burn > > > > > from > > > > > the initiator with out problems. I'll test other kernels between 4.9 > > > > > and > > > > > 4.14. > > > > > > > > So I've tested 4.x.y where x one of 10 11 12 14 15 and y is the latest > > > > patch > > > > (except for 4.15 which was 1 behind) > > > > Each of these kernels crash within seconds or immediate of doing find > > > > -type > > > > f | xargs cat > /dev/null from the initiator. > > > > > > I tried 4.10.0. It doesn't completely lockup the system, but the device > > > that was used hangs. So from the initiator, it's /dev/sr1 and from the > > > target it's /dev/sr0. Attempting to read /dev/sr0 after the oops causes > > > the > > > process to hang in D state. > > > > Hello Wakko, > > > > Thank you for having narrowed down this further. I think that you > > encountered > > a regression either in the block layer core or in the SCSI core. > > Unfortunately > > the number of changes between kernel versions v4.9 and v4.10 in these two > > subsystems is huge. I see two possible ways forward: > > - Either that you perform a bisect to identify the patch that introduced > > this > > regression. However, I'm not sure whether you are familiar with the bisect > > process. > > - Or that you identify the command that triggers this crash such that others > > can reproduce this issue without needing access to your setup. > > > > How about reproducing this crash with the below patch applied on top of > > kernel v4.15.x? The additional output sent by this patch to the system log > > should allow us to reproduce this issue by submitting the same SCSI command > > with sg_raw. > > Ok, so I tried this, but scsi_print_command doesn't print anything. I added > a check for !rq and the same thing that blk_rq_nr_phys_segments does in an > if statement above this thinking it might have crashed during WARN_ON_ONCE. > It still didn't print anything. My printk shows this: > [ 36.263193] sr 3:0:0:0: cmd->request->nr_phys_segments is 0 > > I also had scsi_print_command in the same if block which again didn't print > anything. Is there some debug option I need to turn on to make it print? I > tried looking through the code for this and following some of the function > calls but didn't see any config options. I know now why scsi_print_command isn't doing anything. cmd->cmnd is null. I added a dev_printk in scsi_print_command where the 2 if statements return. Logs: [ 29.866415] sr 3:0:0:0: cmd->cmnd is NULL > > Subject: [PATCH] Report commands with no physical segments in the system log > > > > --- > > drivers/scsi/scsi_lib.c | 4 +++- > > 1 file changed, 3 insertions(+), 1 deletion(-) > > > > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c > > index 6b6a6705f6e5..74a39db57d49 100644 > > --- a/drivers/scsi/scsi_lib.c > > +++ b/drivers/scsi/scsi_lib.c > > @@ -1093,8 +1093,10 @@ int scsi_init_io(struct scsi_cmnd *cmd) > > bool is_mq = (rq->mq_ctx != NULL); > > int error = BLKPREP_KILL; > > > > - if (WARN_ON_ONCE(!blk_rq_nr_phys_segments(rq))) > > + if (WARN_ON_ONCE(!blk_rq_nr_phys_segments(rq))) { > > + scsi_print_command(cmd); > > goto err_exit; > > + } > > > > error = scsi_init_sgtable(rq, &cmd->sdb); > > if (error) > -- > Microsoft has beaten Volkswagen's world record. Volkswagen only created 22 > million bugs. -- Microsoft has beaten Volkswagen's world record. Volkswagen only created 22 million bugs.
Re: 4.15.14 crash with iscsi target and dvd
Bart Van Assche wrote: > On Sun, 2018-04-01 at 14:27 -0400, Wakko Warner wrote: > > Wakko Warner wrote: > > > Wakko Warner wrote: > > > > I tested 4.14.32 last night with the same oops. 4.9.91 works fine. > > > > From the initiator, if I do cat /dev/sr1 > /dev/null it works. If I > > > > mount > > > > /dev/sr1 and then do find -type f | xargs cat > /dev/null the target > > > > crashes. I'm using the builtin iscsi target with pscsi. I can burn > > > > from > > > > the initiator with out problems. I'll test other kernels between 4.9 > > > > and > > > > 4.14. > > > > > > So I've tested 4.x.y where x one of 10 11 12 14 15 and y is the latest > > > patch > > > (except for 4.15 which was 1 behind) > > > Each of these kernels crash within seconds or immediate of doing find > > > -type > > > f | xargs cat > /dev/null from the initiator. > > > > I tried 4.10.0. It doesn't completely lockup the system, but the device > > that was used hangs. So from the initiator, it's /dev/sr1 and from the > > target it's /dev/sr0. Attempting to read /dev/sr0 after the oops causes the > > process to hang in D state. > > Hello Wakko, > > Thank you for having narrowed down this further. I think that you encountered > a regression either in the block layer core or in the SCSI core. Unfortunately > the number of changes between kernel versions v4.9 and v4.10 in these two > subsystems is huge. I see two possible ways forward: > - Either that you perform a bisect to identify the patch that introduced this > regression. However, I'm not sure whether you are familiar with the bisect > process. > - Or that you identify the command that triggers this crash such that others > can reproduce this issue without needing access to your setup. > > How about reproducing this crash with the below patch applied on top of > kernel v4.15.x? The additional output sent by this patch to the system log > should allow us to reproduce this issue by submitting the same SCSI command > with sg_raw. Ok, so I tried this, but scsi_print_command doesn't print anything. I added a check for !rq and the same thing that blk_rq_nr_phys_segments does in an if statement above this thinking it might have crashed during WARN_ON_ONCE. It still didn't print anything. My printk shows this: [ 36.263193] sr 3:0:0:0: cmd->request->nr_phys_segments is 0 I also had scsi_print_command in the same if block which again didn't print anything. Is there some debug option I need to turn on to make it print? I tried looking through the code for this and following some of the function calls but didn't see any config options. > Subject: [PATCH] Report commands with no physical segments in the system log > > --- > drivers/scsi/scsi_lib.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c > index 6b6a6705f6e5..74a39db57d49 100644 > --- a/drivers/scsi/scsi_lib.c > +++ b/drivers/scsi/scsi_lib.c > @@ -1093,8 +1093,10 @@ int scsi_init_io(struct scsi_cmnd *cmd) > bool is_mq = (rq->mq_ctx != NULL); > int error = BLKPREP_KILL; > > - if (WARN_ON_ONCE(!blk_rq_nr_phys_segments(rq))) > + if (WARN_ON_ONCE(!blk_rq_nr_phys_segments(rq))) { > + scsi_print_command(cmd); > goto err_exit; > + } > > error = scsi_init_sgtable(rq, &cmd->sdb); > if (error) -- Microsoft has beaten Volkswagen's world record. Volkswagen only created 22 million bugs.
Re: Multi-Actuator SAS HDD First Look
On 2018-04-05 07:43 PM, Tim Walker wrote: On Tue, Apr 3, 2018 at 1:46 AM, Christoph Hellwig wrote: On Sat, Mar 31, 2018 at 01:03:46PM +0200, Hannes Reinecke wrote: Actually I would propose to have a 'management' LUN at LUN0, who could handle all the device-wide commands (eg things like START STOP UNIT, firmware update, or even SMART commands), and ignoring them for the remaining LUNs. That is in fact the only workable option at all. Everything else completly breaks the scsi architecture. Here's an update: Seagate will eliminate the inter-LU actions from FORMAT UNIT and SANITIZE. Probably SANITIZE will be per-LUN, but FORMAT UNIT is trickier due to internal drive architecture, and how FORMAT UNIT initializes on-disk metadata. Likely it will require some sort of synchronization across LUNs, such as the command being sent to both LUNs sequentially or something similar. We are also considering not supporting FORMAT UNIT at all - would anybody object? Any other suggestions? Good, that is progress. [But you still only have one spindle.] If Protection Information (PI) or changing the logical block size between 512 and 4096 bytes per block are options, then you need FU for that. But does it need to take 900 minutes like one I got recently from S..? Couldn't the actual reformatting of a track be deferred until the first block written to that track? Doug Gilbert
Re: bcache and hibernation
On Thu, Apr 5, 2018 at 12:51 PM, Nikolaus Rath wrote: > Hi Michael, > > Could you explain why this isn't a problem with writethrough? It seems > to me that the trouble happens when the hibernation image is *read*, so > why does it matter what kind of write caching is used? With writethrough you can set up your loader to read it directly from the backing device-- e.g. you don't need the cache, and there's at least some valid configurations; with writeback some of the extents may be on the cache dev so... That said, it's not really great to put swap/hibernate on a cache device... the workloads don't usually benefit much from tiering (since they tend to be write-once-read-never or write-once-read-once). >> I am unaware of a mechanism to prohibit this in the kernel-- to say that >> a given type of block provider can't be involved in a resume operation. >> Most documentation for hibernation explicitly cautions about the btrfs >> situation, but use of bcache is less common and as a result generally >> isn't covered. > > Could you maybe add a warning to Documentation/bcache.txt? I think this > would have saved me. Yah, I can look at that. > > Best, > -Nikolaus Mike
Re: Multi-Actuator SAS HDD First Look
On Tue, Apr 3, 2018 at 1:46 AM, Christoph Hellwig wrote: > On Sat, Mar 31, 2018 at 01:03:46PM +0200, Hannes Reinecke wrote: >> Actually I would propose to have a 'management' LUN at LUN0, who could >> handle all the device-wide commands (eg things like START STOP UNIT, >> firmware update, or even SMART commands), and ignoring them for the >> remaining LUNs. > > That is in fact the only workable option at all. Everything else > completly breaks the scsi architecture. Here's an update: Seagate will eliminate the inter-LU actions from FORMAT UNIT and SANITIZE. Probably SANITIZE will be per-LUN, but FORMAT UNIT is trickier due to internal drive architecture, and how FORMAT UNIT initializes on-disk metadata. Likely it will require some sort of synchronization across LUNs, such as the command being sent to both LUNs sequentially or something similar. We are also considering not supporting FORMAT UNIT at all - would anybody object? Any other suggestions? -- Tim Walker Product Design Systems Engineering, Seagate Technology (303) 775-3770
Re: [PATCH v2 08/11] block: sed-opal: ioctl for writing to shadow mbr
On Thu, Mar 29, 2018 at 08:27:30PM +0200, catch...@ghostav.ddnss.de wrote: > On Thu, Mar 29, 2018 at 11:16:42AM -0600, Scott Bauer wrote: > > Yeah, having to autheticate to write the MBR is a real bummer. Theoretically > > you could dd a the pw struct + the shador MBR into sysfs. But that's > > a pretty disgusting hack just to use sysfs. The other method I thought of > > was to authenticate via ioctl then write via sysfs. We already save the PW > > in-kernel for unlocks, so perhaps we can re-use the save-for-unlock to > > do shadow MBR writes via sysfs? > > > > Re-using an already exposed ioctl for another purpose seems somewhat > > dangerous? > > In the sense that what if the user wants to write the smbr but doesn't want > > to > > unlock on suspends, or does not want their PW hanging around in the kernel. > Well. If we would force the user to a two-step interaction, why not stay > completely in sysfs? So instead of using the save-for-unlock ioctl, we > could export each security provider( (AdminSP, UserSPX, ...) as a sysfs The Problem with this is Single user mode, where you can assign users to locking ranges. There would have to be a lot of dynamic changes of sysfs as users get added/removed, or added to LRs etc. It seems like we're trying mold something that already works fine into something that doesnt really work as we dig into the details. > directory with appropriate files (e.g. mbr for AdminSP) as well as a > 'unlock' file to store a users password for the specific locking space > and a 'lock' file to remove the stored password on write to it. > Of course, while this will prevent from reuse of the ioctl and > stays within the same configuration method, the PW will still hang > around in the kernel between 'lock' and 'unlock'. > > Another idea I just came across while writing this down: > Instead of storing/releasing the password permanently with the 'unlock' and > 'lock' files, those may be used to start/stop an authenticated session. > To make it more clear what I mean: Each ioctl that requires > authentication has a similar pattern: > discovery0, start_session, , end_session > Instead of having the combination determined by the ioctl, the 'unlock' > would do discovery0 and start_session while the 'lock' would do the > end_session. The user is free to issue further commands with the > appropriate write/reads to other files of the sysfs-directory. > While this removes the requirement to store the key within kernel space, > the open session handle may be used from everybody with permissions for > read/write access to the sysfs-directory files. So this is not optimal > as not only the user who provided the password will finally be able to use > it. I generally like the idea of being able to run your abritrary opal commands, but: that's probably not going to work for the final reason you outlined. Even though it's root only access(to sysfs) we're breaking the authentication lower down by essentially allowing any opal command to be ran if you've somehow become root. The other issue with this is the session time out in opal. When we dispatch the commands in-kernel we're hammering them out 1-by-1. If the user needs to do an activatelsp, setuplr, etc. They do that with a new session. If someone starts the session and it times out it may be hard to figure out how to not get an SP_BUSY back from the controller. I've in the past just had to wipe my damn fw to get out of SP_BUSYs, but that could be due to the early implementations I was dealing with. > I already did some basic work to split of the session-information from > the opal_dev struct (initially to reduce the memory-footprint of devices with > currently no active opal-interaction). So I think, I could get a > proof-of-concept of this approach within the next one or two weeks if > there are no objections to the base idea. Sorry to ocme back a week later, but if you do have anything it would be at least interesting to see. I would still prefer the ioctl route, but will review and test any implementation people deem acceptable.
Re: bcache and hibernation
Hi Michael, On Apr 05 2018, Michael Lyle wrote: > On 04/05/2018 01:51 AM, Nikolaus Rath wrote: >> Is there a way to prevent this from happening? Could eg the kernel >> detect that the swap devices is (indirectly) on bcache and refuse to >> hibernate? Or is there a way to do a "true" read-only mount of a >> bcache volume so that one can safely resume from it? > > I think you're correct. If you're using bcache in writeback mode, it is > not safe to hibernate there, because some of the blocks involved in the > resume can end up in cache (and dependency issues, like you mention). Could you explain why this isn't a problem with writethrough? It seems to me that the trouble happens when the hibernation image is *read*, so why does it matter what kind of write caching is used? > I am unaware of a mechanism to prohibit this in the kernel-- to say that > a given type of block provider can't be involved in a resume operation. > Most documentation for hibernation explicitly cautions about the btrfs > situation, but use of bcache is less common and as a result generally > isn't covered. Could you maybe add a warning to Documentation/bcache.txt? I think this would have saved me. Best, -Nikolaus -- GPG Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F »Time flies like an arrow, fruit flies like a Banana.«
Re: bcache and hibernation
Hi Nikolaus (and everyone else), Sorry I've been slow in responding. I probably need to step down as bcache maintainer because so many other things have competed for my time lately and I've fallen behind on both patches and mailing list. On 04/05/2018 01:51 AM, Nikolaus Rath wrote: > Is there a way to prevent this from happening? Could eg the kernel detect > that the swap devices is (indirectly) on bcache and refuse to hibernate? Or > is there a way to do a "true" read-only mount of a bcache volume so that one > can safely resume from it? I think you're correct. If you're using bcache in writeback mode, it is not safe to hibernate there, because some of the blocks involved in the resume can end up in cache (and dependency issues, like you mention). There's similar cautions/problems with btrfs. I am unaware of a mechanism to prohibit this in the kernel-- to say that a given type of block provider can't be involved in a resume operation. Most documentation for hibernation explicitly cautions about the btrfs situation, but use of bcache is less common and as a result generally isn't covered. > Best, > -Nikolaus Mike
Re: [PATCH] blk-mq: only run mapped hw queues in blk_mq_run_hw_queues()
On 04/05/2018 07:39 PM, Christian Borntraeger wrote: > > > On 04/05/2018 06:11 PM, Ming Lei wrote: >>> >>> Could you please apply the following patch and provide the dmesg boot log? >> >> And please post out the 'lscpu' log together from the test machine too. > > attached. > > As I said before this seems to go way with CONFIG_NR_CPUS=64 or smaller. > We have 282 nr_cpu_ids here (max 141CPUs on that z13 with SMT2) but only 8 > Cores > == 16 threads. To say it differently The whole system has up to 141 cpus, but this LPAR has only 8 cpus assigned. So we have 16 CPUS (SMT) but this could become up to 282 IF I would do CPU hotplug. (But this is not used here).
Re: [PATCH] blk-mq: only run mapped hw queues in blk_mq_run_hw_queues()
On 04/05/2018 06:11 PM, Ming Lei wrote: >> >> Could you please apply the following patch and provide the dmesg boot log? > > And please post out the 'lscpu' log together from the test machine too. attached. As I said before this seems to go way with CONFIG_NR_CPUS=64 or smaller. We have 282 nr_cpu_ids here (max 141CPUs on that z13 with SMT2) but only 8 Cores == 16 threads. dmesg.gz Description: application/gzip Architecture:s390x CPU op-mode(s): 32-bit, 64-bit Byte Order: Big Endian CPU(s): 16 On-line CPU(s) list: 0-15 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s) per book: 3 Book(s) per drawer: 2 Drawer(s): 4 NUMA node(s):1 Vendor ID: IBM/S390 Machine type:2964 CPU dynamic MHz: 5000 CPU static MHz: 5000 BogoMIPS:20325.00 Hypervisor: PR/SM Hypervisor vendor: IBM Virtualization type: full Dispatching mode:horizontal L1d cache: 128K L1i cache: 96K L2d cache: 2048K L2i cache: 2048K L3 cache:65536K L4 cache:491520K NUMA node0 CPU(s): 0-15 Flags: esan3 zarch stfle msa ldisp eimm dfp edat etf3eh highgprs te vx sie CPU NODE DRAWER BOOK SOCKET CORE L1d:L1i:L2d:L2i ONLINE CONFIGURED POLARIZATION ADDRESS 0 00 00 00:0:0:0 yesyeshorizontal 0 1 00 00 01:1:1:1 yesyeshorizontal 1 2 00 00 12:2:2:2 yesyeshorizontal 2 3 00 00 13:3:3:3 yesyeshorizontal 3 4 00 00 24:4:4:4 yesyeshorizontal 4 5 00 00 25:5:5:5 yesyeshorizontal 5 6 00 00 36:6:6:6 yesyeshorizontal 6 7 00 00 37:7:7:7 yesyeshorizontal 7 8 00 01 48:8:8:8 yesyeshorizontal 8 9 00 01 49:9:9:9 yesyeshorizontal 9 10 00 01 510:10:10:10 yesyeshorizontal 10 11 00 01 511:11:11:11 yesyeshorizontal 11 12 00 01 612:12:12:12 yesyeshorizontal 12 13 00 01 613:13:13:13 yesyeshorizontal 13 14 00 01 714:14:14:14 yesyeshorizontal 14 15 00 01 715:15:15:15 yesyeshorizontal 15
Re: BUG at IP: blk_mq_get_request+0x23e/0x390 on 4.16.0-rc7
On 04/04/2018 09:22 PM, Sagi Grimberg wrote: On 03/30/2018 12:32 PM, Yi Zhang wrote: Hello I got this kernel BUG on 4.16.0-rc7, here is the reproducer and log, let me know if you need more info, thanks. Reproducer: 1. setup target #nvmetcli restore /etc/rdma.json 2. connect target on host #nvme connect-all -t rdma -a $IP -s 4420during my NVMeoF RDMA testing 3. do fio background on host #fio -filename=/dev/nvme0n1 -iodepth=1 -thread -rw=randwrite -ioengine=psync -bssplit=5k/10:9k/10:13k/10:17k/10:21k/10:25k/10:29k/10:33k/10:37k/10:41k/10 -bs_unaligned -runtime=180 -size=-group_reporting -name=mytest -numjobs=60 & 4. offline cpu on host #echo 0 > /sys/devices/system/cpu/cpu1/online #echo 0 > /sys/devices/system/cpu/cpu2/online #echo 0 > /sys/devices/system/cpu/cpu3/online 5. clear target #nvmetcli clear 6. restore target #nvmetcli restore /etc/rdma.json 7. check console log on host Hi Yi, Does this happen with this applied? -- diff --git a/block/blk-mq-rdma.c b/block/blk-mq-rdma.c index 996167f1de18..b89da55e8aaa 100644 --- a/block/blk-mq-rdma.c +++ b/block/blk-mq-rdma.c @@ -35,6 +35,8 @@ int blk_mq_rdma_map_queues(struct blk_mq_tag_set *set, const struct cpumask *mask; unsigned int queue, cpu; + goto fallback; + for (queue = 0; queue < set->nr_hw_queues; queue++) { mask = ib_get_vector_affinity(dev, first_vec + queue); if (!mask) -- Hi Sagi Still can reproduce this issue with the change: [ 133.469908] nvme nvme0: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery", addr 172.31.0.90:4420 [ 133.554025] nvme nvme0: creating 40 I/O queues. [ 133.947648] nvme nvme0: new ctrl: NQN "testnqn", addr 172.31.0.90:4420 [ 138.740870] smpboot: CPU 1 is now offline [ 138.778382] IRQ 37: no longer affine to CPU2 [ 138.783153] IRQ 54: no longer affine to CPU2 [ 138.787919] IRQ 70: no longer affine to CPU2 [ 138.792687] IRQ 98: no longer affine to CPU2 [ 138.797458] IRQ 140: no longer affine to CPU2 [ 138.802319] IRQ 141: no longer affine to CPU2 [ 138.807189] IRQ 166: no longer affine to CPU2 [ 138.813622] smpboot: CPU 2 is now offline [ 139.043610] smpboot: CPU 3 is now offline [ 141.587283] print_req_error: operation not supported error, dev nvme0n1, sector 494622136 [ 141.587303] print_req_error: operation not supported error, dev nvme0n1, sector 219643648 [ 141.587304] print_req_error: operation not supported error, dev nvme0n1, sector 279256456 [ 141.587306] print_req_error: operation not supported error, dev nvme0n1, sector 1208024 [ 141.587322] print_req_error: operation not supported error, dev nvme0n1, sector 100575248 [ 141.587335] print_req_error: operation not supported error, dev nvme0n1, sector 111717456 [ 141.587346] print_req_error: operation not supported error, dev nvme0n1, sector 171939296 [ 141.587348] print_req_error: operation not supported error, dev nvme0n1, sector 476420528 [ 141.587353] print_req_error: operation not supported error, dev nvme0n1, sector 371566696 [ 141.587356] print_req_error: operation not supported error, dev nvme0n1, sector 161758408 [ 141.587463] Buffer I/O error on dev nvme0n1, logical block 54193430, lost async page write [ 141.587472] Buffer I/O error on dev nvme0n1, logical block 54193431, lost async page write [ 141.587478] Buffer I/O error on dev nvme0n1, logical block 54193432, lost async page write [ 141.587483] Buffer I/O error on dev nvme0n1, logical block 54193433, lost async page write [ 141.587532] Buffer I/O error on dev nvme0n1, logical block 54193476, lost async page write [ 141.587534] Buffer I/O error on dev nvme0n1, logical block 54193477, lost async page write [ 141.587536] Buffer I/O error on dev nvme0n1, logical block 54193478, lost async page write [ 141.587538] Buffer I/O error on dev nvme0n1, logical block 54193479, lost async page write [ 141.587540] Buffer I/O error on dev nvme0n1, logical block 54193480, lost async page write [ 141.587542] Buffer I/O error on dev nvme0n1, logical block 54193481, lost async page write [ 142.573522] nvme nvme0: Reconnecting in 10 seconds... [ 146.587532] buffer_io_error: 3743628 callbacks suppressed [ 146.587534] Buffer I/O error on dev nvme0n1, logical block 64832757, lost async page write [ 146.602837] Buffer I/O error on dev nvme0n1, logical block 64832758, lost async page write [ 146.612091] Buffer I/O error on dev nvme0n1, logical block 64832759, lost async page write [ 146.621346] Buffer I/O error on dev nvme0n1, logical block 64832760, lost async page write [ 146.630615] print_req_error: 556822 callbacks suppressed [ 146.630616] print_req_error: I/O error, dev nvme0n1, sector 518662176 [ 146.643776] Buffer I/O error on dev nvme0n1, logical block 64832772, lost async page write [ 146.653030] Buffer I/O error on dev nvme0n1, logical block 64832773, lost async page write [ 146.662282] Buffer I/O error on dev nvme0n1, logical block 64832774, lost async page
Re: BUG: KASAN: use-after-free in bt_for_each+0x1ea/0x29f
On Wed, 2018-04-04 at 19:26 -0600, Jens Axboe wrote: > Leaving the whole trace here, but I'm having a hard time making sense of it. > It complains about a user-after-free in the inflight iteration, which is only > working on the queue, request, and on-stack mi data. None of these would be > freed. The below trace on allocation and free indicates a bio, but that isn't > used in the inflight path at all. Is it possible that kasan gets confused > here? > Not sure what to make of it so far. Hello Jens, In the many block layer tests I ran with KASAN enabled I have never seen anything like this nor have I seen anything that made me wonder about the reliability of KASAN. Maybe some code outside the block layer core corrupted a request queue data structure and triggered this weird report? Bart.
Re: [PATCH] blk-mq: only run mapped hw queues in blk_mq_run_hw_queues()
On Fri, Apr 06, 2018 at 12:05:03AM +0800, Ming Lei wrote: > On Wed, Apr 04, 2018 at 10:18:13AM +0200, Christian Borntraeger wrote: > > > > > > On 03/30/2018 04:53 AM, Ming Lei wrote: > > > On Thu, Mar 29, 2018 at 01:49:29PM +0200, Christian Borntraeger wrote: > > >> > > >> > > >> On 03/29/2018 01:43 PM, Ming Lei wrote: > > >>> On Thu, Mar 29, 2018 at 12:49:55PM +0200, Christian Borntraeger wrote: > > > > > > On 03/29/2018 12:48 PM, Ming Lei wrote: > > > On Thu, Mar 29, 2018 at 12:10:11PM +0200, Christian Borntraeger wrote: > > >> > > >> > > >> On 03/29/2018 11:40 AM, Ming Lei wrote: > > >>> On Thu, Mar 29, 2018 at 11:09:08AM +0200, Christian Borntraeger > > >>> wrote: > > > > > > On 03/29/2018 09:23 AM, Christian Borntraeger wrote: > > > > > > > > > On 03/29/2018 04:00 AM, Ming Lei wrote: > > >> On Wed, Mar 28, 2018 at 05:36:53PM +0200, Christian Borntraeger > > >> wrote: > > >>> > > >>> > > >>> On 03/28/2018 05:26 PM, Ming Lei wrote: > > Hi Christian, > > > > On Wed, Mar 28, 2018 at 09:45:10AM +0200, Christian > > Borntraeger wrote: > > > FWIW, this patch does not fix the issue for me: > > > > > > ostname=? addr=? terminal=? res=success' > > > [ 21.454961] WARNING: CPU: 3 PID: 1882 at > > > block/blk-mq.c:1410 __blk_mq_delay_run_hw_queue+0xbe/0xd8 > > > [ 21.454968] Modules linked in: scsi_dh_rdac scsi_dh_emc > > > scsi_dh_alua dm_mirror dm_region_hash dm_log dm_multipath > > > dm_mod autofs4 > > > [ 21.454984] CPU: 3 PID: 1882 Comm: dasdconf.sh Not tainted > > > 4.16.0-rc7+ #26 > > > [ 21.454987] Hardware name: IBM 2964 NC9 704 (LPAR) > > > [ 21.454990] Krnl PSW : c0131ea3 3ea2f7bf > > > (__blk_mq_delay_run_hw_queue+0xbe/0xd8) > > > [ 21.454996]R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 > > > AS:3 CC:0 PM:0 RI:0 EA:3 > > > [ 21.455005] Krnl GPRS: 013abb69a000 013a > > > 013ac6c0dc00 0001 > > > [ 21.455008] 013abb69a710 > > > 013a 0001b691fd98 > > > [ 21.455011]0001b691fd98 013ace4775c8 > > > 0001 > > > [ 21.455014]013ac6c0dc00 00b47238 > > > 0001b691fc08 0001b691fbd0 > > > [ 21.455032] Krnl Code: 0069c596: ebaff0a4 > > > lmg %r10,%r15,160(%r15) > > > 0069c59c: c0f47a5e > > > brcl15,68ba58 > > > #0069c5a2: a7f40001 > > > brc 15,69c5a4 > > > >0069c5a6: e340f0c4 > > > lg %r4,192(%r15) > > > 0069c5ac: ebaff0a4 > > > lmg %r10,%r15,160(%r15) > > > 0069c5b2: 07f4 > > > bcr 15,%r4 > > > 0069c5b4: c0e5feea > > > brasl %r14,69c388 > > > 0069c5ba: a7f4fff6 > > > brc 15,69c5a6 > > > [ 21.455067] Call Trace: > > > [ 21.455072] ([<0001b691fd98>] 0x1b691fd98) > > > [ 21.455079] [<0069c692>] > > > blk_mq_run_hw_queue+0xba/0x100 > > > [ 21.455083] [<0069c740>] > > > blk_mq_run_hw_queues+0x68/0x88 > > > [ 21.455089] [<0069b956>] > > > __blk_mq_complete_request+0x11e/0x1d8 > > > [ 21.455091] [<0069ba9c>] > > > blk_mq_complete_request+0x8c/0xc8 > > > [ 21.455103] [<008aa250>] > > > dasd_block_tasklet+0x158/0x490 > > > [ 21.455110] [<0014c742>] > > > tasklet_hi_action+0x92/0x120 > > > [ 21.455118] [<00a7cfc0>] __do_softirq+0x120/0x348 > > > [ 21.455122] [<0014c212>] irq_exit+0xba/0xd0 > > > [ 21.455130] [<0010bf92>] do_IRQ+0x8a/0xb8 > > > [ 21.455133] [<00a7c298>] > > > io_int_handler+0x130/0x298 > > > [ 21.455136] Last Breaking-Event-Address: > > > [ 21.455138] [<0069c5a2>] > > > __blk_mq_delay_run_hw_queue+0xba/0xd8 > > > [ 21.455140] ---[ end trace be43f99a5d1e553e ]--- > > > [ 21.5100
Re: [PATCH] blk-mq: only run mapped hw queues in blk_mq_run_hw_queues()
On Wed, Apr 04, 2018 at 10:18:13AM +0200, Christian Borntraeger wrote: > > > On 03/30/2018 04:53 AM, Ming Lei wrote: > > On Thu, Mar 29, 2018 at 01:49:29PM +0200, Christian Borntraeger wrote: > >> > >> > >> On 03/29/2018 01:43 PM, Ming Lei wrote: > >>> On Thu, Mar 29, 2018 at 12:49:55PM +0200, Christian Borntraeger wrote: > > > On 03/29/2018 12:48 PM, Ming Lei wrote: > > On Thu, Mar 29, 2018 at 12:10:11PM +0200, Christian Borntraeger wrote: > >> > >> > >> On 03/29/2018 11:40 AM, Ming Lei wrote: > >>> On Thu, Mar 29, 2018 at 11:09:08AM +0200, Christian Borntraeger wrote: > > > On 03/29/2018 09:23 AM, Christian Borntraeger wrote: > > > > > > On 03/29/2018 04:00 AM, Ming Lei wrote: > >> On Wed, Mar 28, 2018 at 05:36:53PM +0200, Christian Borntraeger > >> wrote: > >>> > >>> > >>> On 03/28/2018 05:26 PM, Ming Lei wrote: > Hi Christian, > > On Wed, Mar 28, 2018 at 09:45:10AM +0200, Christian Borntraeger > wrote: > > FWIW, this patch does not fix the issue for me: > > > > ostname=? addr=? terminal=? res=success' > > [ 21.454961] WARNING: CPU: 3 PID: 1882 at block/blk-mq.c:1410 > > __blk_mq_delay_run_hw_queue+0xbe/0xd8 > > [ 21.454968] Modules linked in: scsi_dh_rdac scsi_dh_emc > > scsi_dh_alua dm_mirror dm_region_hash dm_log dm_multipath > > dm_mod autofs4 > > [ 21.454984] CPU: 3 PID: 1882 Comm: dasdconf.sh Not tainted > > 4.16.0-rc7+ #26 > > [ 21.454987] Hardware name: IBM 2964 NC9 704 (LPAR) > > [ 21.454990] Krnl PSW : c0131ea3 3ea2f7bf > > (__blk_mq_delay_run_hw_queue+0xbe/0xd8) > > [ 21.454996]R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 > > AS:3 CC:0 PM:0 RI:0 EA:3 > > [ 21.455005] Krnl GPRS: 013abb69a000 013a > > 013ac6c0dc00 0001 > > [ 21.455008] 013abb69a710 > > 013a 0001b691fd98 > > [ 21.455011]0001b691fd98 013ace4775c8 > > 0001 > > [ 21.455014]013ac6c0dc00 00b47238 > > 0001b691fc08 0001b691fbd0 > > [ 21.455032] Krnl Code: 0069c596: ebaff0a4 > > lmg %r10,%r15,160(%r15) > > 0069c59c: c0f47a5e > > brcl15,68ba58 > > #0069c5a2: a7f40001 > > brc 15,69c5a4 > > >0069c5a6: e340f0c4 > > lg %r4,192(%r15) > > 0069c5ac: ebaff0a4 > > lmg %r10,%r15,160(%r15) > > 0069c5b2: 07f4 > > bcr 15,%r4 > > 0069c5b4: c0e5feea > > brasl %r14,69c388 > > 0069c5ba: a7f4fff6 > > brc 15,69c5a6 > > [ 21.455067] Call Trace: > > [ 21.455072] ([<0001b691fd98>] 0x1b691fd98) > > [ 21.455079] [<0069c692>] > > blk_mq_run_hw_queue+0xba/0x100 > > [ 21.455083] [<0069c740>] > > blk_mq_run_hw_queues+0x68/0x88 > > [ 21.455089] [<0069b956>] > > __blk_mq_complete_request+0x11e/0x1d8 > > [ 21.455091] [<0069ba9c>] > > blk_mq_complete_request+0x8c/0xc8 > > [ 21.455103] [<008aa250>] > > dasd_block_tasklet+0x158/0x490 > > [ 21.455110] [<0014c742>] > > tasklet_hi_action+0x92/0x120 > > [ 21.455118] [<00a7cfc0>] __do_softirq+0x120/0x348 > > [ 21.455122] [<0014c212>] irq_exit+0xba/0xd0 > > [ 21.455130] [<0010bf92>] do_IRQ+0x8a/0xb8 > > [ 21.455133] [<00a7c298>] io_int_handler+0x130/0x298 > > [ 21.455136] Last Breaking-Event-Address: > > [ 21.455138] [<0069c5a2>] > > __blk_mq_delay_run_hw_queue+0xba/0xd8 > > [ 21.455140] ---[ end trace be43f99a5d1e553e ]--- > > [ 21.510046] dasdconf.sh Warning: 0.0.241e is already online, > > not configuring > > Thinking about this issue further, I can't understand the root > cause for > this issue. > > FWIW, Li
Re: 4.15.15: BFQ stalled at blk_mq_get_tag
On 4/5/18 8:45 AM, Paolo Valente wrote: > > >> Il giorno 05 apr 2018, alle ore 15:15, Sami Farin >> ha scritto: >> >> I was using chacharand to fill 32 GB SD card (VFAT fs) (maybe 30 MiB/s) >> with random data, it froze halfway. There was 400 MiB Dirty data. >> After reboot the filling operation went OK when I used kyber scheduler. >> System is Fedora 27 on Core i5 2500K / 16 GiB. >> > > I'm afraid this crash is caused by a bug fixed for 4.16 [1]. In the > same thread [1], Oleksander (in CC) proposed to backport this and > other fixes and improvements to 4.15. But Jens (in CC) didn't accept, > because too general stuff was included in the batch. Maybe this bug > report could be the opportunity to reconsider that backport or part of > it? I never objected to back porting the single fix. What I did object to was a huge list of other fixes. I'm also fine with back porting a set of fixes, if they are all relevant to back port. I don't want a whole sale list of "these are all the changes we did to BFQ, let's back port them". -- Jens Axboe
Re: 4.15.15: BFQ stalled at blk_mq_get_tag
> Il giorno 05 apr 2018, alle ore 15:15, Sami Farin > ha scritto: > > I was using chacharand to fill 32 GB SD card (VFAT fs) (maybe 30 MiB/s) > with random data, it froze halfway. There was 400 MiB Dirty data. > After reboot the filling operation went OK when I used kyber scheduler. > System is Fedora 27 on Core i5 2500K / 16 GiB. > I'm afraid this crash is caused by a bug fixed for 4.16 [1]. In the same thread [1], Oleksander (in CC) proposed to backport this and other fixes and improvements to 4.15. But Jens (in CC) didn't accept, because too general stuff was included in the batch. Maybe this bug report could be the opportunity to reconsider that backport or part of it? Thanks, Paolo [1] https://lkml.org/lkml/2018/2/7/678 > sysrq: SysRq : Show Blocked State > taskPC stack pid father > device poll D0 2811838 1 0x > Call Trace: > ? __schedule+0x2c2/0x910 > schedule+0x2a/0x80 > schedule_timeout+0x8a/0x490 > ? collect_expired_timers+0xa0/0xa0 > msleep+0x24/0x30 > usb_port_suspend+0x298/0x430 [usbcore] > usb_suspend_both+0x17d/0x200 [usbcore] > ? usb_probe_interface+0x300/0x300 [usbcore] > usb_runtime_suspend+0x25/0x60 [usbcore] > __rpm_callback+0xb7/0x1f0 > ? usb_probe_interface+0x300/0x300 [usbcore] > rpm_callback+0x1a/0x80 > ? usb_probe_interface+0x300/0x300 [usbcore] > rpm_suspend+0x11e/0x660 > __pm_runtime_suspend+0x36/0x60 > usbdev_release+0xb3/0x120 [usbcore] > __fput+0xa3/0x1f0 > task_work_run+0x82/0xa0 > exit_to_usermode_loop+0x91/0xa0 > do_syscall_64+0xe7/0x100 > entry_SYSCALL_64_after_hwframe+0x3d/0xa2 > RIP: 0033:0x7f41101c170c > RSP: 002b:7f410e655b80 EFLAGS: 0293 ORIG_RAX: 0003 > RAX: RBX: 7f410e655ecb RCX: 7f41101c170c > RDX: RSI: 7f410e655ea0 RDI: 0007 > RBP: 7f410e655ec4 R08: R09: 7f410080 > R10: R11: 0293 R12: 7f410e655c90 > R13: 7f410e655ecb R14: 0007 R15: 7f410e655ebb > kworker/u8:4D0 2978647 2 0x8000 > Workqueue: writeback wb_workfn (flush-8:80) > Call Trace: > ? __schedule+0x2c2/0x910 > schedule+0x2a/0x80 > io_schedule+0xd/0x30 > blk_mq_get_tag+0x150/0x250 > ? wait_woken+0x80/0x80 > blk_mq_get_request+0x131/0x450 > ? bfq_bio_merge+0xcb/0x100 > blk_mq_make_request+0x118/0x6e0 > ? blk_queue_enter+0x31/0x2f0 > generic_make_request+0xfd/0x2a0 > ? submit_bio+0x67/0x140 > submit_bio+0x67/0x140 > ? guard_bio_eod+0x78/0x150 > mpage_writepages+0xa7/0xe0 > ? fat_add_cluster+0x60/0x60 [fat] > ? do_writepages+0x37/0xc0 > ? fat_writepage+0x10/0x10 [fat] > do_writepages+0x37/0xc0 > ? reacquire_held_locks+0x8f/0x150 > ? writeback_sb_inodes+0xef/0x490 > ? __writeback_single_inode+0x5a/0x530 > __writeback_single_inode+0x5a/0x530 > writeback_sb_inodes+0x1ed/0x490 > __writeback_inodes_wb+0x55/0xa0 > wb_writeback+0x261/0x3f0 > ? wb_workfn+0x1fd/0x4f0 > wb_workfn+0x1fd/0x4f0 > process_one_work+0x206/0x560 > worker_thread+0x2c/0x380 > ? process_one_work+0x560/0x560 > kthread+0x10e/0x130 > ? kthread_create_on_node+0x40/0x40 > ret_from_fork+0x35/0x40 > kworker/0:3 D0 2979285 2 0x8000 > Workqueue: events_freezable_power_ disk_events_workfn > Call Trace: > ? __schedule+0x2c2/0x910 > schedule+0x2a/0x80 > io_schedule+0xd/0x30 > blk_mq_get_tag+0x150/0x250 > ? wait_woken+0x80/0x80 > blk_mq_get_request+0x131/0x450 > blk_mq_alloc_request+0x58/0xb0 > blk_get_request_flags+0x3b/0x150 > scsi_execute+0x33/0x250 > scsi_test_unit_ready+0x48/0xb0 > sd_check_events+0xc8/0x170 > disk_check_events+0x54/0x130 > process_one_work+0x206/0x560 > worker_thread+0x2c/0x380 > ? process_one_work+0x560/0x560 > kthread+0x10e/0x130 > ? kthread_create_on_node+0x40/0x40 > ? SyS_exit+0xe/0x10 > ret_from_fork+0x35/0x40 > chacharand D0 2980742 2978974 0x8002 > Call Trace: > ? __schedule+0x2c2/0x910 > schedule+0x2a/0x80 > io_schedule+0xd/0x30 > blk_mq_get_tag+0x150/0x250 > ? wait_woken+0x80/0x80 > blk_mq_get_request+0x131/0x450 > ? bfq_bio_merge+0xcb/0x100 > blk_mq_make_request+0x118/0x6e0 > ? blk_queue_enter+0x31/0x2f0 > generic_make_request+0xfd/0x2a0 > ? submit_bio+0x67/0x140 > submit_bio+0x67/0x140 > ? guard_bio_eod+0x78/0x150 > __mpage_writepage+0x67e/0x7a0 > ? clear_page_dirty_for_io+0x10f/0x240 > ? clear_page_dirty_for_io+0x12f/0x240 > write_cache_pages+0x1ee/0x460 > ? clean_buffers+0x60/0x60 > ? fat_add_cluster+0x60/0x60 [fat] > mpage_writepages+0x68/0xe0 > ? fat_add_cluster+0x60/0x60 [fat] > ? do_writepages+0x37/0xc0 > ? fat_writepage+0x10/0x10 [fat] > do_writepages+0x37/0xc0 > ? __filemap_fdatawrite_range+0x99/0xe0 > ? __filemap_fdatawrite_range+0xa6/0xe0 > __filemap_fdatawrite_range+0xa6/0xe0 > ? sync_inode_metadata+0x2a/0x30 > fat_flush_inodes+0x25/0x60 [fat] > fat_file_release+0x2a/0x40 [fat] > __fput+0xa3/0x1f0 > task_work_run+0x82/0xa0 > do_exit+0x29b/0xbf0 > do_group_exit+0x34/0xb0 > SyS_exit_group+0xb/0x10 > do_syscall_64+0x62/0x100 > entry_SYSCALL_64_after_hwframe+0
Re: [BISECTED][REGRESSION] Hang while booting EeePC 900
Hello, On Thu, Apr 05, 2018 at 09:14:15AM +0100, Sitsofe Wheeler wrote: > Just out of interest, does the fact that an abort occurs mean that the > hardware is somehow broken or badly behaved? Not really. For example, ATAPI devices depend on exception handling to fetch sense data as a part of normal operation which is handled by libata exception handler which is invoked by aborting the original command. So, exception handling can often be a part of normal operation. Thanks. -- tejun
Re: [RFC PATCH 0/2] use larger max_request_size for virtio_blk
On 4/5/18 4:09 AM, Weiping Zhang wrote: > Hi, > > For virtio block device, actually there is no a hard limit for max request > size, and virtio_blk driver set -1 to blk_queue_max_hw_sectors(q, -1U);. > But it doesn't work, because there is a default upper limitation > BLK_DEF_MAX_SECTORS (1280 sectors). So this series want to add a new helper > blk_queue_max_hw_sectors_no_limit to set a proper max reqeust size. > > Weiping Zhang (2): > blk-setting: add new helper blk_queue_max_hw_sectors_no_limit > virtio_blk: add new module parameter to set max request size > > block/blk-settings.c | 20 > drivers/block/virtio_blk.c | 32 ++-- > include/linux/blkdev.h | 2 ++ > 3 files changed, 52 insertions(+), 2 deletions(-) The driver should just use blk_queue_max_hw_sectors() to set the limit, and then the soft limit can be modified by a udev rule. Technically the driver doesn't own the software limit, it's imposed to ensure that we don't introduce too much latency per request. Your situation is no different from many other setups, where the hw limit is much higher than the default 1280k. -- Jens Axboe
4.15.15: BFQ stalled at blk_mq_get_tag
I was using chacharand to fill 32 GB SD card (VFAT fs) (maybe 30 MiB/s) with random data, it froze halfway. There was 400 MiB Dirty data. After reboot the filling operation went OK when I used kyber scheduler. System is Fedora 27 on Core i5 2500K / 16 GiB. sysrq: SysRq : Show Blocked State taskPC stack pid father device poll D0 2811838 1 0x Call Trace: ? __schedule+0x2c2/0x910 schedule+0x2a/0x80 schedule_timeout+0x8a/0x490 ? collect_expired_timers+0xa0/0xa0 msleep+0x24/0x30 usb_port_suspend+0x298/0x430 [usbcore] usb_suspend_both+0x17d/0x200 [usbcore] ? usb_probe_interface+0x300/0x300 [usbcore] usb_runtime_suspend+0x25/0x60 [usbcore] __rpm_callback+0xb7/0x1f0 ? usb_probe_interface+0x300/0x300 [usbcore] rpm_callback+0x1a/0x80 ? usb_probe_interface+0x300/0x300 [usbcore] rpm_suspend+0x11e/0x660 __pm_runtime_suspend+0x36/0x60 usbdev_release+0xb3/0x120 [usbcore] __fput+0xa3/0x1f0 task_work_run+0x82/0xa0 exit_to_usermode_loop+0x91/0xa0 do_syscall_64+0xe7/0x100 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 RIP: 0033:0x7f41101c170c RSP: 002b:7f410e655b80 EFLAGS: 0293 ORIG_RAX: 0003 RAX: RBX: 7f410e655ecb RCX: 7f41101c170c RDX: RSI: 7f410e655ea0 RDI: 0007 RBP: 7f410e655ec4 R08: R09: 7f410080 R10: R11: 0293 R12: 7f410e655c90 R13: 7f410e655ecb R14: 0007 R15: 7f410e655ebb kworker/u8:4D0 2978647 2 0x8000 Workqueue: writeback wb_workfn (flush-8:80) Call Trace: ? __schedule+0x2c2/0x910 schedule+0x2a/0x80 io_schedule+0xd/0x30 blk_mq_get_tag+0x150/0x250 ? wait_woken+0x80/0x80 blk_mq_get_request+0x131/0x450 ? bfq_bio_merge+0xcb/0x100 blk_mq_make_request+0x118/0x6e0 ? blk_queue_enter+0x31/0x2f0 generic_make_request+0xfd/0x2a0 ? submit_bio+0x67/0x140 submit_bio+0x67/0x140 ? guard_bio_eod+0x78/0x150 mpage_writepages+0xa7/0xe0 ? fat_add_cluster+0x60/0x60 [fat] ? do_writepages+0x37/0xc0 ? fat_writepage+0x10/0x10 [fat] do_writepages+0x37/0xc0 ? reacquire_held_locks+0x8f/0x150 ? writeback_sb_inodes+0xef/0x490 ? __writeback_single_inode+0x5a/0x530 __writeback_single_inode+0x5a/0x530 writeback_sb_inodes+0x1ed/0x490 __writeback_inodes_wb+0x55/0xa0 wb_writeback+0x261/0x3f0 ? wb_workfn+0x1fd/0x4f0 wb_workfn+0x1fd/0x4f0 process_one_work+0x206/0x560 worker_thread+0x2c/0x380 ? process_one_work+0x560/0x560 kthread+0x10e/0x130 ? kthread_create_on_node+0x40/0x40 ret_from_fork+0x35/0x40 kworker/0:3 D0 2979285 2 0x8000 Workqueue: events_freezable_power_ disk_events_workfn Call Trace: ? __schedule+0x2c2/0x910 schedule+0x2a/0x80 io_schedule+0xd/0x30 blk_mq_get_tag+0x150/0x250 ? wait_woken+0x80/0x80 blk_mq_get_request+0x131/0x450 blk_mq_alloc_request+0x58/0xb0 blk_get_request_flags+0x3b/0x150 scsi_execute+0x33/0x250 scsi_test_unit_ready+0x48/0xb0 sd_check_events+0xc8/0x170 disk_check_events+0x54/0x130 process_one_work+0x206/0x560 worker_thread+0x2c/0x380 ? process_one_work+0x560/0x560 kthread+0x10e/0x130 ? kthread_create_on_node+0x40/0x40 ? SyS_exit+0xe/0x10 ret_from_fork+0x35/0x40 chacharand D0 2980742 2978974 0x8002 Call Trace: ? __schedule+0x2c2/0x910 schedule+0x2a/0x80 io_schedule+0xd/0x30 blk_mq_get_tag+0x150/0x250 ? wait_woken+0x80/0x80 blk_mq_get_request+0x131/0x450 ? bfq_bio_merge+0xcb/0x100 blk_mq_make_request+0x118/0x6e0 ? blk_queue_enter+0x31/0x2f0 generic_make_request+0xfd/0x2a0 ? submit_bio+0x67/0x140 submit_bio+0x67/0x140 ? guard_bio_eod+0x78/0x150 __mpage_writepage+0x67e/0x7a0 ? clear_page_dirty_for_io+0x10f/0x240 ? clear_page_dirty_for_io+0x12f/0x240 write_cache_pages+0x1ee/0x460 ? clean_buffers+0x60/0x60 ? fat_add_cluster+0x60/0x60 [fat] mpage_writepages+0x68/0xe0 ? fat_add_cluster+0x60/0x60 [fat] ? do_writepages+0x37/0xc0 ? fat_writepage+0x10/0x10 [fat] do_writepages+0x37/0xc0 ? __filemap_fdatawrite_range+0x99/0xe0 ? __filemap_fdatawrite_range+0xa6/0xe0 __filemap_fdatawrite_range+0xa6/0xe0 ? sync_inode_metadata+0x2a/0x30 fat_flush_inodes+0x25/0x60 [fat] fat_file_release+0x2a/0x40 [fat] __fput+0xa3/0x1f0 task_work_run+0x82/0xa0 do_exit+0x29b/0xbf0 do_group_exit+0x34/0xb0 SyS_exit_group+0xb/0x10 do_syscall_64+0x62/0x100 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 RIP: 0033:0x7fd9172b3178 RSP: 002b:7fffe01eb248 EFLAGS: 0246 ORIG_RAX: 00e7 RAX: ffda RBX: RCX: 7fd9172b3178 RDX: RSI: 003c RDI: RBP: 7fd9175b08b8 R08: 00e7 R09: ff80 R10: 7fffe01eb1d0 R11: 0246 R12: 7fd9175b08b8 R13: 7fd9175b5d60 R14: R15: (ostnamed) D0 2981753 1 0x0004 Call Trace: ? __schedule+0x2c2/0x910 ? rwsem_down_write_failed+0x174/0x260 schedule+0x2a/0x80 rwsem_down_write_failed+0x179/0x260 ? call_rwsem_down_write_failed+0x13/0x20 call_rwsem_down_write_failed+0x13/0x20 down_write+0x3b/0x50 ? do_mount+0x434/0xdb0 do_mount+
Re: [RFC PATCH 0/2] use larger max_request_size for virtio_blk
Weiping, > For virtio block device, actually there is no a hard limit for max > request size, and virtio_blk driver set -1 to > blk_queue_max_hw_sectors(q, -1U);. But it doesn't work, because there > is a default upper limitation BLK_DEF_MAX_SECTORS (1280 sectors). That's intentional (although it's an ongoing debate what the actual value should be). > So this series want to add a new helper > blk_queue_max_hw_sectors_no_limit to set a proper max reqeust size. BLK_DEF_MAX_SECTORS is a kernel default empirically chosen to strike a decent balance between I/O latency and bandwidth. It sets an upper bound for filesystem requests only. Regardless of the capabilities of the block device driver and underlying hardware. You can override the limit on a per-device basis via max_sectors_kb in sysfs. People generally do it via a udev rule. -- Martin K. Petersen Oracle Linux Engineering
[RFC PATCH 1/2] blk-setting: add new helper blk_queue_max_hw_sectors_no_limit
There is a default upper limitation BLK_DEF_MAX_SECTORS, but for some virtual block device driver there is no such limitation. So add a new help to set max request size. Signed-off-by: Weiping Zhang --- block/blk-settings.c | 20 include/linux/blkdev.h | 2 ++ 2 files changed, 22 insertions(+) diff --git a/block/blk-settings.c b/block/blk-settings.c index 48ebe6b..685c30c 100644 --- a/block/blk-settings.c +++ b/block/blk-settings.c @@ -253,6 +253,26 @@ void blk_queue_max_hw_sectors(struct request_queue *q, unsigned int max_hw_secto } EXPORT_SYMBOL(blk_queue_max_hw_sectors); +/* same as blk_queue_max_hw_sectors but without default upper limitation */ +void blk_queue_max_hw_sectors_no_limit(struct request_queue *q, + unsigned int max_hw_sectors) +{ + struct queue_limits *limits = &q->limits; + unsigned int max_sectors; + + if ((max_hw_sectors << 9) < PAGE_SIZE) { + max_hw_sectors = 1 << (PAGE_SHIFT - 9); + printk(KERN_INFO "%s: set to minimum %d\n", + __func__, max_hw_sectors); + } + + limits->max_hw_sectors = max_hw_sectors; + max_sectors = min_not_zero(max_hw_sectors, limits->max_dev_sectors); + limits->max_sectors = max_sectors; + q->backing_dev_info->io_pages = max_sectors >> (PAGE_SHIFT - 9); +} +EXPORT_SYMBOL(blk_queue_max_hw_sectors_no_limit); + /** * blk_queue_chunk_sectors - set size of the chunk for this queue * @q: the request queue for the device diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index ed63f3b..2250709 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -1243,6 +1243,8 @@ extern void blk_cleanup_queue(struct request_queue *); extern void blk_queue_make_request(struct request_queue *, make_request_fn *); extern void blk_queue_bounce_limit(struct request_queue *, u64); extern void blk_queue_max_hw_sectors(struct request_queue *, unsigned int); +extern void blk_queue_max_hw_sectors_no_limit(struct request_queue *, + unsigned int); extern void blk_queue_chunk_sectors(struct request_queue *, unsigned int); extern void blk_queue_max_segments(struct request_queue *, unsigned short); extern void blk_queue_max_discard_segments(struct request_queue *, -- 2.9.4
[RFC PATCH 0/2] use larger max_request_size for virtio_blk
Hi, For virtio block device, actually there is no a hard limit for max request size, and virtio_blk driver set -1 to blk_queue_max_hw_sectors(q, -1U);. But it doesn't work, because there is a default upper limitation BLK_DEF_MAX_SECTORS (1280 sectors). So this series want to add a new helper blk_queue_max_hw_sectors_no_limit to set a proper max reqeust size. Weiping Zhang (2): blk-setting: add new helper blk_queue_max_hw_sectors_no_limit virtio_blk: add new module parameter to set max request size block/blk-settings.c | 20 drivers/block/virtio_blk.c | 32 ++-- include/linux/blkdev.h | 2 ++ 3 files changed, 52 insertions(+), 2 deletions(-) -- 2.9.4
Re: [PATCH V3 4/4] genirq/affinity: irq vector spread among online CPUs as far as possible
On Wed, 4 Apr 2018, Ming Lei wrote: > On Wed, Apr 04, 2018 at 02:45:18PM +0200, Thomas Gleixner wrote: > > Now the 4 offline CPUs are plugged in again. These CPUs won't ever get an > > interrupt as all interrupts stay on CPU 0-3 unless one of these CPUs is > > unplugged. Using cpu_present_mask the spread would be: > > > > irq 39, cpu list 0,1 > > irq 40, cpu list 2,3 > > irq 41, cpu list 4,5 > > irq 42, cpu list 6,7 > > Given physical CPU hotplug isn't common, this way will make only irq 39 > and irq 40 active most of times, so performance regression is caused just > as Kashyap reported. That is only true, if CPU 4-7 are in the present mask at boot time. I seriously doubt that this is the case for Kashyaps scenario. Grrr, if you would have included him into the Reported-by: tags then I could have asked him myself. In the physical hotplug case, the physcially (or virtually) not available CPUs are not in the present mask. They are solely in the possible mask. The above is about soft hotplug where the CPUs are physically there and therefore in the present mask and can be onlined without interaction from the outside (mechanical or virt config). If nobody objects, I'll make that change and queue the stuff tomorrow morning so it can brew a few days in next before I send it off to Linus. Thanks, tglx
[RFC PATCH 2/2] virtio_blk: add new module parameter to set max request size
Actually there is no upper limitation, so add new module parameter to provide a way to set a proper max request size for virtio block. Using a larger request size can improve sequence performance in theory, and reduce the interaction between guest and hypervisor. Signed-off-by: Weiping Zhang --- drivers/block/virtio_blk.c | 32 ++-- 1 file changed, 30 insertions(+), 2 deletions(-) diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c index 4a07593c..5ac6d59 100644 --- a/drivers/block/virtio_blk.c +++ b/drivers/block/virtio_blk.c @@ -64,6 +64,34 @@ struct virtblk_req { struct scatterlist sg[]; }; + +static int max_request_size_set(const char *val, const struct kernel_param *kp); + +static const struct kernel_param_ops max_request_size_ops = { + .set = max_request_size_set, + .get = param_get_uint, +}; + +static unsigned long max_request_size = 4096; /* in unit of KiB */ +module_param_cb(max_request_size, &max_request_size_ops, &max_request_size, + 0444); +MODULE_PARM_DESC(max_request_size, "set max request size, in unit of KiB"); + +static int max_request_size_set(const char *val, const struct kernel_param *kp) +{ + int ret; + unsigned int size_kb, page_kb = 1 << (PAGE_SHIFT - 10); + + ret = kstrtouint(val, 10, &size_kb); + if (ret != 0) + return -EINVAL; + + if (size_kb < page_kb) + return -EINVAL; + + return param_set_uint(val, kp); +} + static inline blk_status_t virtblk_result(struct virtblk_req *vbr) { switch (vbr->status) { @@ -730,8 +758,8 @@ static int virtblk_probe(struct virtio_device *vdev) /* We can handle whatever the host told us to handle. */ blk_queue_max_segments(q, vblk->sg_elems-2); - /* No real sector limit. */ - blk_queue_max_hw_sectors(q, -1U); + /* No real sector limit, use 512b (max_request_size << 10) >> 9 */ + blk_queue_max_hw_sectors_no_limit(q, max_request_size << 1); /* Host can optionally specify maximum segment size and number of * segments. */ -- 2.9.4
bcache and hibernation (was: bcache: bad block header)
Hi, I have a hypothesis of what happened. My swap volume is also on LVM, and thus also eventually backed by bcache. Hibernation and resume work fine. But when the hibernation image is read during resume, the contents of the cache device change because with bcache reading is no longer a read-only operation. When the hibernation image is loaded, the kernel looses track of these changes so that what's on the cache disk no longer matches the structures in the kernel. Therefore, on the first boot after the successful resume, havoc ensures. I needed the system running again, so I've now detached the backing volumes, re-initialized the cache volume and re-attached the backing volumes. Unfortunately there was too much filesystem damage, so I restored everything from backup. Is there a way to prevent this from happening? Could eg the kernel detect that the swap devices is (indirectly) on bcache and refuse to hibernate? Or is there a way to do a "true" read-only mount of a bcache volume so that one can safely resume from it? Best, -Nikolaus -- GPG Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F »Time flies like an arrow, fruit flies like a Banana.« On Tue, 3 Apr 2018, at 23:38, Jens Axboe wrote: > CC'ing Mike > > On 4/3/18 1:01 PM, Nikolaus Rath wrote: > > [ Re-send to both linux-block and linux-bcache ] > > > > Hi, > > > > A few days ago, my system refused to boot because it couldn't find the root > > filesystem anymore. The root filesystem is ext4 on LVM on dm-crypt on > > bcache, using kernel 4.9.92 (from Debian stretch). Booting from a recovery > > medium with Kernel 4.16, I got: > > > > [ 84.551715] bcache: register_bcache() error /dev/sda4: device already > > registered > > [ 84.553188] bcache: register_bcache() error /dev/sdc2: device already > > registered > > [ 84.616438] bcache: error on 1330b5f6-0c13-43ec-b925-2ee2734b135f: > > [ 84.616440] bad btree header at bucket 85065, block 0, 0 keys > > [ 84.616442] , disabling caching > > [ 84.616445] bcache: register_cache() registered cache device sdb2 > > [ 84.616597] bcache: cache_set_free() Cache set > > 1330b5f6-0c13-43ec-b925-2ee2734b135f unregistered > > [ 85.375933] sdb: sdb1 sdb2 sdb4 < sdb5 > > > [ 85.416610] bcache: error on 1330b5f6-0c13-43ec-b925-2ee2734b135f: > > [ 85.416612] bad btree header at bucket 85065, block 0, 0 keys > > [ 85.416614] , disabling caching > > [ 85.416618] bcache: register_cache() registered cache device sdb2 > > [ 85.416624] bcache: register_bcache() error /dev/sdc2: device already > > registered > > [ 85.416626] bcache: register_bcache() error /dev/sda4: device already > > registered > > [ 85.416796] bcache: cache_set_free() Cache set > > 1330b5f6-0c13-43ec-b925-2ee2734b135f unregistered > > [ 85.488246] bcache: error on 1330b5f6-0c13-43ec-b925-2ee2734b135f: > > [ 85.488249] bad btree header at bucket 85065, block 0, 0 keys > > [ 85.488251] , disabling caching > > [ 85.488254] bcache: register_cache() registered cache device sdb2 > > [ 85.488429] bcache: cache_set_free() Cache set > > 1330b5f6-0c13-43ec-b925-2ee2734b135f unregistered > > [ 85.560003] bcache: error on 1330b5f6-0c13-43ec-b925-2ee2734b135f: > > [ 85.560006] bad btree header at bucket 85065, block 0, 0 keys > > [ 85.560008] , disabling caching > > [ 85.560013] bcache: register_cache() registered cache device sdb2 > > [ 85.560017] bcache: register_bcache() error /dev/sda4: device already > > registered > > [ 85.560217] bcache: cache_set_free() Cache set > > 1330b5f6-0c13-43ec-b925-2ee2734b135f unregistered > > [ 85.571950] bcache: register_bcache() error /dev/sdc2: device already > > registered > > [ 85.580628] bcache: register_bcache() error /dev/sdc2: device already > > registered > > [ 85.761969] bcache: register_bcache() error /dev/sda4: device already > > registered > > [ 85.792749] bcache: register_bcache() error /dev/sda4: device already > > registered > > [ 85.952931] bcache: register_bcache() error /dev/sda4: device already > > registered > > [ 85.955640] bcache: register_bcache() error /dev/sda4: device already > > registered > > [...] > > > > These are the first messages that mention bcache. Note that the first > > message is that the device is already registered - is that normal? > > > > smartctl does not report any errors on backing or caching disks, and the > > system was shutdown cleanly. > > > > The only possibly related thing that comes to mind is that a few days ago I > > hibernated and resumed the system (this is something I normally don't do). > > Resume worked fine as far as I could tell though, and there have been no > > unclean shutdowns. > > > > Is there a way to narrow down what may have caused this corruption? > > > > And, is there a way to gracefully recover from this situation without > > wiping everything? Since the message mentions only problems with one block, > > can I maybe tell bcache t
Re: [BISECTED][REGRESSION] Hang while booting EeePC 900
On 2 April 2018 at 21:29, Tejun Heo wrote: > Hello, Sitsofe. > > Can you see whether the following patch makes any difference? > > Thanks. > > diff --git a/block/blk-timeout.c b/block/blk-timeout.c > index a05e367..f0e6e41 100644 > --- a/block/blk-timeout.c > +++ b/block/blk-timeout.c > @@ -165,7 +165,7 @@ void blk_abort_request(struct request *req) > * No need for fancy synchronizations. > */ > blk_rq_set_deadline(req, jiffies); > - mod_timer(&req->q->timeout, 0); > + kblockd_schedule_work(&req->q->timeout_work); > } else { > if (blk_mark_rq_complete(req)) > return; Just out of interest, does the fact that an abort occurs mean that the hardware is somehow broken or badly behaved? -- Sitsofe | http://sucs.org/~sits/