[Bug 1887774] Re: [UBUNTU 20.04] zfcp: Fix panic on ERP timeout for previously dismissed ERP

Frank Heimes Thu, 16 Jul 2020 08:01:25 -0700

Kernel SRU request submitted:
https://lists.ubuntu.com/archives/kernel-team/2020-July/thread.html#112154
Updating status to 'In Progress'.


** Changed in: linux (Ubuntu Focal)
       Status: New => In Progress

** Changed in: ubuntu-z-systems
       Status: Triaged => In Progress

** Description changed:

+ SRU Justification:
+ ==================
+ 
+ [Impact]
+ 
+ * Linux kernel panics due to kernel page fault in IRQ context when
+ running zfcp_erp_timeout_handler() calling zfcp_erp_notify().
+ 
+ [Fix]
+ 
+ * 936e6b85da0476dd2edac7c51c68072da9fb4ba2 936e6b85da04 "scsi: zfcp: Fix
+ panic on ERP timeout for previously dismissed ERP action"
+ 
+ [Test Case]
+ 
+ * Requires an IBM z13/z13s or LinuxONE Rockhopper/Emperor system (or
+ newer) connected to zfcp capcble storage sub-system.
+ 
+ * Initiate an (ERP) timeout (maybe by injection or by causing a slow
+ recovery otherwise).
+ 
+ * Monitor the system log for any kernel panics.
+ 
+ [Regression Potential]
+ 
+ * The regression can be considered as medium since the modification is
+ platform specific / limited to s390x and again limited to the zfcp
+ layer.
+ 
+ * Within zfcp it's further limited to the error recovery procedure (ERP)
+ of fcp and only touches zfcp_erp.c, means the code path is mainly active
+ under error conditions.
+ 
+ [Other]
+ 
+ * The above fix is upstream accepted with v5.8-rc3, hence will make it's
+ way to groovy with kernel 5.8.
+ 
+ * Therefore this SRU request was submitted for bionic and focal only and
+ not for groovy.
+ 
+ __________
+ 
  Description:   zfcp: Fix panic on ERP timeout for previously dismissed ERP
  Symptom:       Linux kernel panic due to kernel page fault in IRQ context
-                when running zfcp_erp_timeout_handler() calling
-                zfcp_erp_notify().
+                when running zfcp_erp_timeout_handler() calling
+                zfcp_erp_notify().
  Problem:       Suppose that, for unrelated reasons, FSF requests on behalf
-                of recovery are very slow and can run into the ERP timeout.
-                In the case at hand, we did adapter recovery to a large
-                degree. However due to the slowness a LUN open is pending so
-                the corresponding fc_rport remains blocked. After
-                fast_io_fail_tmo we trigger close physical port recovery for
-                the port under which the LUN should have been opened. The
-                new higher order port recovery dismisses the pending LUN
-                open ERP action and dismisses the pending LUN open FSF
-                request. Such dismissal decouples the ERP action from the
-                pending corresponding FSF request by setting
-                zfcp_fsf_req->erp_action to NULL (among other things)
-                [zfcp_erp_strategy_check_fsfreq()].
-                If now the ERP timeout for the pending open LUN request runs
-                out, we must not use zfcp_fsf_req->erp_action in the ERP
-                timeout handler. This is a problem since v4.15 commit
-                75492a51568b ("s390/scsi: Convert timers to use
-                timer_setup()"). Before that we intentionally only passed
-                zfcp_erp_action as context argument to
-                zfcp_erp_timeout_handler().
-                Note: The lifetime of the corresponding zfcp_fsf_req object
-                continues until a (late) response or an (unrelated) adapter
-                recovery.
+                of recovery are very slow and can run into the ERP timeout.
+                In the case at hand, we did adapter recovery to a large
+                degree. However due to the slowness a LUN open is pending so
+                the corresponding fc_rport remains blocked. After
+                fast_io_fail_tmo we trigger close physical port recovery for
+                the port under which the LUN should have been opened. The
+                new higher order port recovery dismisses the pending LUN
+                open ERP action and dismisses the pending LUN open FSF
+                request. Such dismissal decouples the ERP action from the
+                pending corresponding FSF request by setting
+                zfcp_fsf_req->erp_action to NULL (among other things)
+                [zfcp_erp_strategy_check_fsfreq()].
+                If now the ERP timeout for the pending open LUN request runs
+                out, we must not use zfcp_fsf_req->erp_action in the ERP
+                timeout handler. This is a problem since v4.15 commit
+                75492a51568b ("s390/scsi: Convert timers to use
+                timer_setup()"). Before that we intentionally only passed
+                zfcp_erp_action as context argument to
+                zfcp_erp_timeout_handler().
+                Note: The lifetime of the corresponding zfcp_fsf_req object
+                continues until a (late) response or an (unrelated) adapter
+                recovery.
  Solution:      Just like the regular response path ignores dismissed
-                requests [zfcp_fsf_req_complete() =>
-                zfcp_fsf_protstatus_eval() => return early] the ERP timeout
-                handler now needs to ignore dismissed requests. So simply
-                return early in the ERP timeout handler if the FSF request
-                is marked as dismissed in its status flags. To protect
-                against the race where zfcp_erp_strategy_check_fsfreq()
-                dismisses and sets zfcp_fsf_req->erp_action to NULL after
-                our previous status flag check, return early if
-                zfcp_fsf_req->erp_action is NULL. After all, the former ERP
-                action does not need to be woken up as that was already done
-                as part of the dismissal above [zfcp_erp_action_dismiss()].
+                requests [zfcp_fsf_req_complete() =>
+                zfcp_fsf_protstatus_eval() => return early] the ERP timeout
+                handler now needs to ignore dismissed requests. So simply
+                return early in the ERP timeout handler if the FSF request
+                is marked as dismissed in its status flags. To protect
+                against the race where zfcp_erp_strategy_check_fsfreq()
+                dismisses and sets zfcp_fsf_req->erp_action to NULL after
+                our previous status flag check, return early if
+                zfcp_fsf_req->erp_action is NULL. After all, the former ERP
+                action does not need to be woken up as that was already done
+                as part of the dismissal above [zfcp_erp_action_dismiss()].
  
  Upstream-ID:   936e6b85da0476dd2edac7c51c68072da9fb4ba2 -> kernel 5.8
  
  Will be integrated by kernel 5.8 by groovy.
  
  Please check that this also be integrated into 20.04

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1887774

Title:
  [UBUNTU 20.04] zfcp: Fix panic on ERP timeout for previously dismissed
  ERP

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-z-systems/+bug/1887774/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1887774] Re: [UBUNTU 20.04] zfcp: Fix panic on ERP timeout for previously dismissed ERP

Reply via email to