Many thx, Thorsten!

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1887124

Title:
  [UBUNTU 20.04] DIF and DIX support in zfcp (s390x) is broken and the
  kernel crashes unconditionally

Status in Ubuntu on IBM z Systems:
  Fix Committed
Status in linux package in Ubuntu:
  Fix Committed
Status in linux source package in Focal:
  Fix Committed

Bug description:
  SRU Justification:
  ==================

  [Impact]

  * DIF and DIX support in zfcp (s390x) is broken and the kernel crashes
  unconditionally, in case one activates either zfcp DIF (data integrity
  field) or DIX (data integrity extention) and attaches any LU.

  * zfcp used to allocate/add the shost object for a HBA, before knowing
  all the HBA's capabilities. But later the shost object got patched to
  make more of the capabilities known - including the protection
  capabilities.

  * Changes that are later made to the protection capabilities don't get
  reflected in the tag-set requests and are missing.

  * Now sending I/O, scsi_mod tries to access the protection payload
  data, that isn't there, and the kernel crashes.

  [Fix]

  * The SRU request was created as pull request, so please pull starting
  at 0b38938af911 to head / 9227405ee311 from
  https://code.launchpad.net/~fheimes/+git/lp1887124

  [Test Case]

  * Have a IBM Z or LInuxONE system (ideally on LPAR) in place that is
  connected to a DS8xxx storage subsystem that supports, zfcp SCSI
  storage with DIF and DIX.

  * Have an Ubuntu Server 20.04 installed in such a LPAR system.

  * Now enable DIF (zfcp.dif=1) or DIX (zfcp.dix=1) and connect the
  system to a logical unit.

  * Monitor the system for any kernel crashes (/var/log/syslog and
  /var/log/kern.log) - look for kernel panic in scsi_queue_rq.

  [Regression Potential]

  * Even if the modifications are substantial, the regression potnetial
  can be considered as moderate.

  * This issue could be mainly tracked down to commit '737eb78e82d5'
  (made for v5.4) that made the problem more visible.

  * With the old blk queue DIF/DIX worked fine, but after scsi_mod switched to 
blk-mq (and because requests are now
  all allocated during allocation time of the blk-mq tag-set), it didn't worked 
anymore.

  * The solution is to allocate/add the shost object for a HBA after all
  of its base capabilities are known.

  * Since the situation is like this for quite a while, several places
  in the zfcp driver code (where it's assumed that the shost object was
  correctly allocated) needed to be touched.

  * This is finally the reason (as well as the fact that some parts
  depend on code that went upstream with the kernels 5.5 and 5.7) for
  this rather big patchset for fixing DIF and DIX support in zfcp.

  * On the other hand side this entire set is now upstream accepted and
  entirely included in 5.8 (and with that soon in groovy), and no
  5.4-specific (hand-crafted) backports are needed.

  * Also notice that all modifications are in the zfcp driver layer,
  hence they are s390x specific, and with that limited to
  /drivers/s390/scsi/zfcp_*

  * Finally the patches were already tested with a patched focal kernel
  with DIF/DIX/NONE, with I/O, and local/remote cable pulls - switched
  and point-to-point connected. No regressions were identified.

  [Other]

  * Alternatively the patches can also be cherry-picked from linux master with:
      git cherry-pick 92953c6e0aa7~1..e76acc519426
      git cherry-pick a3fd4bfe85fb
      git cherry-pick e05a10a05509~1..42cabdaf103b
      git cherry-pick cec9cbac5244
      git cherry-pick 978857c7e367~1..d0dff2ac98dd

  * The following lists the individual commits - just for the reason of
  completeness:

  * 92953c6e0aa77d4febcca6dd691e8192910c8a28 92953c6e0aa77 "scsi: zfcp:
  signal incomplete or error for sync exchange config/port data"

  * 7e418833e68948cb9ed15262889173b7db2960cb 7e418833e6894 "scsi: zfcp:
  diagnostics buffer caching and use for exchange port data"

  * 088210233e6fc039fd2c0bfe44b06bb94328d09e 088210233e6fc "scsi: zfcp:
  add diagnostics buffer for exchange config data"

  * a10a61e807b0a226b78e2041843cbf0521bd0c35 a10a61e807b0a "scsi: zfcp:
  support retrieval of SFP Data via Exchange Port Data"

  * 6028f7c4cd87cac13481255d7e35dd2c9207ecae 6028f7c4cd87c "scsi: zfcp:
  introduce sysfs interface for diagnostics of local SFP transceiver"

  * 8155eb0785279728b6b2e29aba2ca52d16aa526f 8155eb0785279 "scsi: zfcp:
  implicitly refresh port-data diagnostics when reading sysfs"

  * 5a2876f0d1ef26b76755749f978d46e4666013dd 5a2876f0d1ef2 "scsi: zfcp:
  introduce sysfs interface to read the local B2B-Credit"

  * 8a72db70b5ca3c3feb3ca25519e8a9516cc60cfe 8a72db70b5ca3 "scsi: zfcp:
  implicitly refresh config-data diagnostics when reading sysfs"

  * 48910f8c35cfd250d806f3e03150d256f40b6d4c 48910f8c35cfd "scsi: zfcp:
  move maximum age of diagnostic buffers into a per-adapter variable"

  * e76acc51942649398660ca50655af5afecf29c42 e76acc5194264 "scsi: zfcp:
  proper indentation to reduce confusion in zfcp_erp_required_act"

  * a3fd4bfe85fbb67cf4ec1232d0af625ece3c508b a3fd4bfe85fbb "scsi: zfcp:
  fix wrong data and display format of SFP+ temperature"

  * e05a10a055098bf55168a9d682156e38c6b00cfa e05a10a055098 "scsi: zfcp:
  expose fabric name as common fc_host sysfs attribute"

  * 538c6e910baea9a94ba2a816c19c3e071892b49c 538c6e910baea "scsi: zfcp:
  wire previously driver-specific sysfs attributes also to fc_host"

  * 7e0e4e0958ef794ee868838249880d5c521ff761 7e0e4e0958ef7 "scsi: zfcp:
  fix fc_host attributes that should be unknown on local link down"

  * 185f2d2d595c29c6e2ed6e2897b9ccc52c50c917 185f2d2d595c2 "scsi: zfcp:
  auto variables for dereferenced structs in open port handler"

  * a17c78460093aad8fb97fc6905c22355b7d1c923 a17c78460093a "scsi: zfcp:
  report FC Endpoint Security in sysfs"

  * f0d26ae847489850509b793ef3f74be62f69ab0f f0d26ae847489 "scsi: zfcp:
  log FC Endpoint Security of connections"

  * 616da39e0060f3b8bbc0f36f7d911bb5abb31746 616da39e0060f "scsi: zfcp:
  trace FC Endpoint Security of FCP devices and connections"

  * e53d92856e9f1cfa0be284fa1dc3367130ce433a e53d92856e9f1 "scsi: zfcp:
  enhance handling of FC Endpoint Security errors"

  * 42cabdaf103be174adb6f1ca61383eb2b35a013a 42cabdaf103be "scsi: zfcp:
  log FC Endpoint Security errors"

  * cec9cbac5244b017f2671e3770abfacc939d753d cec9cbac5244b "scsi: zfcp:
  use fallthrough;"

  * 978857c7e367d6841f71c4ded5a8c244520f5e22 978857c7e367d "scsi: zfcp:
  Move shost modification after QDIO (re-)open into fenced function"

  * bd1684817d7d8d1a3b95a4347166246ad1f7670b bd1684817d7d8 "scsi: zfcp:
  Move shost updates during xconfig data handling into fenced function"

  * 52e61fde5ec95cb4011784fb0bc6b436e16fcaa8 52e61fde5ec95 "scsi: zfcp:
  Move fc_host updates during xport data handling into fenced function"

  * 990486f3a8508494dab2a7ff0fcc3eb977557d89 990486f3a8508 "scsi: zfcp:
  Fence fc_host updates during link-down handling"

  * ac007adc4d2d9258fdf27abd25cc77a4e0e8d19f ac007adc4d2d9 "scsi: zfcp:
  Move p-t-p port allocation to after xport data"

  * 971f2abb4ca4095bb4bd7c2ba119d1cca078b433 971f2abb4ca40 "scsi: zfcp:
  Fence adapter status propagation for common statuses"

  * 71159b6ecb067565fb41faba616116e8c0fc10ea 71159b6ecb067 "scsi: zfcp:
  Fence early sysfs interfaces for accesses of shost objects"

  * d0dff2ac98dd41d7d451127d9eae2f6478fc40b0 d0dff2ac98dd4 "scsi: zfcp:
  Move allocation of the shost object to after xconf- and xport-data"

  __________

  So, after this is now all upstream in mainline and I found some time
  to test this properly on Ubuntu, here is the backport request.

  Some time ago we noticed - Fedor Loshakov did - that our DIF and DIX support 
in
  zfcp broke at some point (by "broke" I mean, the kernel will unconditionally
  crash if one activates either DIF, or DIX, and attaches *any* LU). I tracked
  that down to a commit made for v5.4 (737eb78e82d5), but we didn't notice it
  back than, because our CI doesn't currently run with either DIF, nor DIX
  enabled (time allowing this is something we want to improve so we catch stuff
  like this earlier). It also turned out that the commit in v5.4 was not really
  the root-cause, and was only making the problem visible more easy.

  In short: zfcp used to allocate/add the shost object for a HBA before
  knowing all the HBA's capabilities, and we later patched the shost
  object to make more of the capabilities known - including the protection
  capabilities. Back when we still had the old blk queue, this worked
  fine; after scsi_mod switched to blk-mq and because requests are now
  all allocated during allocation time of the blk-mq tag-set, this doesn't
  work anymore. Changes we make later to the protection capabilities don't
  get reflected into the tag-set's requests, and they are missing parts.
  When we then try to send I/O, scsi_mod tries to access the protection
  payload data, who are not there, and it crashes the kernel.

  So instead, I now want to allocate/add the shost object for a HBA
  after we know all of its base capabilities. This solves the bug.

  Because we had this modus operandi for a very long time, I had to touch
  many places that assume the shost object was already allocated -
  explaining the rather big patchset for a 'fix'. And because this also
  involves/depends on code that went upstream in v5.5 and v5.7 we now have
  a rather complicated situation for backports of the fix. Nothing "just"
  applies.

  The easiest and most straight forward way to deal with that is to basically
  backport most everything that is involved - which is most of the stuff
  that went upstream since v5.5 for our driver.

  I complied a list of the upstream commits that would have to be picked
  in order to be merge-conflict free:

      92953c6e0aa77 scsi: zfcp: signal incomplete or error for sync exchange 
config/port data
      7e418833e6894 scsi: zfcp: diagnostics buffer caching and use for exchange 
port data
      088210233e6fc scsi: zfcp: add diagnostics buffer for exchange config data
      a10a61e807b0a scsi: zfcp: support retrieval of SFP Data via Exchange Port 
Data
      6028f7c4cd87c scsi: zfcp: introduce sysfs interface for diagnostics of 
local SFP transceiver
      8155eb0785279 scsi: zfcp: implicitly refresh port-data diagnostics when 
reading sysfs
      5a2876f0d1ef2 scsi: zfcp: introduce sysfs interface to read the local 
B2B-Credit
      8a72db70b5ca3 scsi: zfcp: implicitly refresh config-data diagnostics when 
reading sysfs
      48910f8c35cfd scsi: zfcp: move maximum age of diagnostic buffers into a 
per-adapter variable
      e76acc5194264 scsi: zfcp: proper indentation to reduce confusion in 
zfcp_erp_required_act
      a3fd4bfe85fbb scsi: zfcp: fix wrong data and display format of SFP+ 
temperature
      e05a10a055098 scsi: zfcp: expose fabric name as common fc_host sysfs 
attribute
      538c6e910baea scsi: zfcp: wire previously driver-specific sysfs 
attributes also to fc_host
      7e0e4e0958ef7 scsi: zfcp: fix fc_host attributes that should be unknown 
on local link down
      185f2d2d595c2 scsi: zfcp: auto variables for dereferenced structs in open 
port handler
      a17c78460093a scsi: zfcp: report FC Endpoint Security in sysfs
      f0d26ae847489 scsi: zfcp: log FC Endpoint Security of connections
      616da39e0060f scsi: zfcp: trace FC Endpoint Security of FCP devices and 
connections
      e53d92856e9f1 scsi: zfcp: enhance handling of FC Endpoint Security errors
      42cabdaf103be scsi: zfcp: log FC Endpoint Security errors
      cec9cbac5244b scsi: zfcp: use fallthrough;
      978857c7e367d scsi: zfcp: Move shost modification after QDIO (re-)open 
into fenced function
      bd1684817d7d8 scsi: zfcp: Move shost updates during xconfig data handling 
into fenced function
      52e61fde5ec95 scsi: zfcp: Move fc_host updates during xport data handling 
into fenced function
      990486f3a8508 scsi: zfcp: Fence fc_host updates during link-down handling
      ac007adc4d2d9 scsi: zfcp: Move p-t-p port allocation to after xport data
      971f2abb4ca40 scsi: zfcp: Fence adapter status propagation for common 
statuses
      71159b6ecb067 scsi: zfcp: Fence early sysfs interfaces for accesses of 
shost objects
      d0dff2ac98dd4 scsi: zfcp: Move allocation of the shost object to after 
xconf- and xport-data

  I test this with the kernel you provide @
  git://kernel.ubuntu.com/ubuntu/ubuntu-focal.git, added Linus' tree as
  secondary remote (@
  git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git) and
  ran:

      git cherry-pick 92953c6e0aa7~1..e76acc519426
      git cherry-pick a3fd4bfe85fb
      git cherry-pick e05a10a05509~1..42cabdaf103b
      git cherry-pick cec9cbac5244
      git cherry-pick 978857c7e367~1..d0dff2ac98dd

  That worked without a hitch on top of tag "Ubuntu-5.4.0-41.45"

  I tested this then by building an Ubuntu distribution kernel
  (unsigned) on level 5.4.0-41.45 with the patches from above applied. I
  ran our regression-suite with DIF/DIX/NONE; with I/O, and local/remote
  cable pulls, switched and p-t-p. Everything worked fine for me.

  So I'm positive that this should work just fine.

  If you don't want to pull all these commits it'll get complicated; I
  gave it a look some time ago if there was a smaller changeset possible
  without/with minimal changes, but found that we would have to
  touch/change several patches to make them apply properly and not have
  any regressions. So I would prefer this. It would also make future
  stable backports easier.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-z-systems/+bug/1887124/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to