You have been subscribed to a public bug:

So, after this is now all upstream in mainline and I found some time to
test this properly on Ubuntu, here is the backport request.

Some time ago we noticed - Fedor Loshakov did - that our DIV and DIX support in
zfcp broke at some point (by "broke" I mean, the kernel will unconditionally 
crash if one activates either DIF, or DIX, and attaches *any* LU). I tracked 
that down to a commit made for v5.4 (737eb78e82d5), but we didn't notice it
back than, because our CI doesn't currently run with either DIV, nor DIX
enabled (time allowing this is something we want to improve so we catch stuff
like this earlier). It also turned out that the commit in v5.4 was not really
the root-cause, and was only making the problem visible more easy.

In short: zfcp used to allocate/add the shost object for a HBA before
knowing all the HBA's capabilities, and we later patched the shost
object to make more of the capabilities known - including the protection
capabilities. Back when we still had the old blk queue, this worked
fine; after scsi_mod switched to blk-mq and because requests are now
all allocated during allocation time of the blk-mq tag-set, this doesn't
work anymore. Changes we make later to the protection capabilities don't
get reflected into the tag-set's requests, and they are missing parts.
When we then try to send I/O, scsi_mod tries to access the protection
payload data, who are not there, and it crashes the kernel.

So instead, I now want to allocate/add the shost object for a HBA
after we know all of its base capabilities. This solves the bug.

Because we had this modus operandi for a very long time, I had to touch
many places that assume the shost object was already allocated -
explaining the rather big patchset for a 'fix'. And because this also
involves/depends on code that went upstream in v5.5 and v5.7 we now have
a rather complicated situation for backports of the fix. Nothing "just"
applies.

The easiest and most straight forward way to deal with that is to basically
backport most everything that is involved - which is most of the stuff
that went upstream since v5.5 for our driver.

I complied a list of the upstream commits that would have to be picked
in order to be merge-conflict free:

    92953c6e0aa77 scsi: zfcp: signal incomplete or error for sync exchange 
config/port data
    7e418833e6894 scsi: zfcp: diagnostics buffer caching and use for exchange 
port data
    088210233e6fc scsi: zfcp: add diagnostics buffer for exchange config data
    a10a61e807b0a scsi: zfcp: support retrieval of SFP Data via Exchange Port 
Data
    6028f7c4cd87c scsi: zfcp: introduce sysfs interface for diagnostics of 
local SFP transceiver
    8155eb0785279 scsi: zfcp: implicitly refresh port-data diagnostics when 
reading sysfs
    5a2876f0d1ef2 scsi: zfcp: introduce sysfs interface to read the local 
B2B-Credit
    8a72db70b5ca3 scsi: zfcp: implicitly refresh config-data diagnostics when 
reading sysfs
    48910f8c35cfd scsi: zfcp: move maximum age of diagnostic buffers into a 
per-adapter variable
    e76acc5194264 scsi: zfcp: proper indentation to reduce confusion in 
zfcp_erp_required_act
    a3fd4bfe85fbb scsi: zfcp: fix wrong data and display format of SFP+ 
temperature
    e05a10a055098 scsi: zfcp: expose fabric name as common fc_host sysfs 
attribute
    538c6e910baea scsi: zfcp: wire previously driver-specific sysfs attributes 
also to fc_host
    7e0e4e0958ef7 scsi: zfcp: fix fc_host attributes that should be unknown on 
local link down
    185f2d2d595c2 scsi: zfcp: auto variables for dereferenced structs in open 
port handler
    a17c78460093a scsi: zfcp: report FC Endpoint Security in sysfs
    f0d26ae847489 scsi: zfcp: log FC Endpoint Security of connections
    616da39e0060f scsi: zfcp: trace FC Endpoint Security of FCP devices and 
connections
    e53d92856e9f1 scsi: zfcp: enhance handling of FC Endpoint Security errors
    42cabdaf103be scsi: zfcp: log FC Endpoint Security errors
    cec9cbac5244b scsi: zfcp: use fallthrough;
    978857c7e367d scsi: zfcp: Move shost modification after QDIO (re-)open into 
fenced function
    bd1684817d7d8 scsi: zfcp: Move shost updates during xconfig data handling 
into fenced function
    52e61fde5ec95 scsi: zfcp: Move fc_host updates during xport data handling 
into fenced function
    990486f3a8508 scsi: zfcp: Fence fc_host updates during link-down handling
    ac007adc4d2d9 scsi: zfcp: Move p-t-p port allocation to after xport data
    971f2abb4ca40 scsi: zfcp: Fence adapter status propagation for common 
statuses
    71159b6ecb067 scsi: zfcp: Fence early sysfs interfaces for accesses of 
shost objects
    d0dff2ac98dd4 scsi: zfcp: Move allocation of the shost object to after 
xconf- and xport-data

I test this with the kernel you provide @ git://kernel.ubuntu.com/ubuntu
/ubuntu-focal.git, added Linus' tree as secondary remote (@
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git) and
ran:

    git cherry-pick 92953c6e0aa7~1..e76acc519426
    git cherry-pick a3fd4bfe85fb
    git cherry-pick e05a10a05509~1..42cabdaf103b
    git cherry-pick cec9cbac5244
    git cherry-pick 978857c7e367~1..d0dff2ac98dd

That worked without a hitch on top of tag "Ubuntu-5.4.0-41.45"

I tested this then by building an Ubuntu distribution kernel (unsigned)
on level 5.4.0-41.45 with the patches from above applied. I ran our
regression-suite with DIV/DIX/NONE; with I/O, and local/remote cable
pulls, switched and p-t-p. Everything worked fine for me.

So I'm positive that this should work just fine.

If you don't want to pull all these commits it'll get complicated; I
gave it a look some time ago if there was a smaller changeset possible
without/with minimal changes, but found that we would have to
touch/change several patches to make them apply properly and not have
any regressions. So I would prefer this. It would also make future
stable backports easier.

** Affects: linux (Ubuntu)
     Importance: Undecided
     Assignee: Skipper Bug Screeners (skipper-screen-team)
         Status: New


** Tags: architecture-s39064 bugnameltc-186041 severity-critical 
targetmilestone-inin2004
-- 
[UBUNTU 20.04] kernel panic with zfcp.dif=1 and zfcp.dix=1 - crash in 
scsi_queue_rq
https://bugs.launchpad.net/bugs/1887124
You received this bug notification because you are a member of Kernel Packages, 
which is subscribed to linux in Ubuntu.

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to