[PATCH v2] scsi_debug: add cmd abort option to every_nth
This patch is motivated by a response in the thread: Re: [PATCH 0/5]stop normal completion path entering a timeout req by Jianchao Wang . It generalizes the error injection of blk_abort_request() to use scsi_debug's "every_nth" mechanism. Ref with original patch to scsi_debug: https://lore.kernel.org/lkml/a68ad043-26a1-d3d8-2009-504ba4230...@oracle.com/ Also convert two vmalloc/memset(0) to vzalloc() calls. Signed-off-by: Douglas Gilbert --- Changes since original version: - thanks to changes/fixes suggested by Bart Van Assche, mainly to avoid calling scsi_done() in worker in the abort case, this version works The scsi mid-level call scsi_debug_abort but at the time this driver has already removed the command context so the abort is ignored. One small wrinkle: ... sd 0:0:0:0: scsi_debug: tag=0x4d, cmd 28 00 00 00 27 c0 00 00 10 00 scsi_debug: tag=0xa7, cmd 28 00 00 00 27 d0 00 00 10 00 sd 0:0:0:0: abort request tag 167 scsi_debug:sdebug_q_cmd_complete: bypassing scsi_done() due to aborted cmd scsi_debug_abort: command not found scsi_debug: tag=0xa7, cmd 28 00 00 00 27 e0 00 00 10 00 scsi_debug: tag=0xaa, cmd 28 00 00 00 27 f0 00 00 10 00 ... My ddpt utility is sending those READs to /dev/sg0 using ioctl(SG_IO_v3) and is not informed that the READ on LBA 0x27d0 has failed. The patch is against MKP's 4.19/scsi-queue branch. drivers/scsi/scsi_debug.c | 47 ++- 1 file changed, 32 insertions(+), 15 deletions(-) diff --git a/drivers/scsi/scsi_debug.c b/drivers/scsi/scsi_debug.c index 4dda192b8fba..2c816ecb0a92 100644 --- a/drivers/scsi/scsi_debug.c +++ b/drivers/scsi/scsi_debug.c @@ -164,29 +164,29 @@ static const char *sdebug_version_date = "20180128"; #define SDEBUG_OPT_RESET_NOISE 0x2000 #define SDEBUG_OPT_NO_CDB_NOISE0x4000 #define SDEBUG_OPT_HOST_BUSY 0x8000 +#define SDEBUG_OPT_CMD_ABORT 0x1 #define SDEBUG_OPT_ALL_NOISE (SDEBUG_OPT_NOISE | SDEBUG_OPT_Q_NOISE | \ SDEBUG_OPT_RESET_NOISE) #define SDEBUG_OPT_ALL_INJECTING (SDEBUG_OPT_RECOVERED_ERR | \ SDEBUG_OPT_TRANSPORT_ERR | \ SDEBUG_OPT_DIF_ERR | SDEBUG_OPT_DIX_ERR | \ SDEBUG_OPT_SHORT_TRANSFER | \ - SDEBUG_OPT_HOST_BUSY) + SDEBUG_OPT_HOST_BUSY | \ + SDEBUG_OPT_CMD_ABORT) /* When "every_nth" > 0 then modulo "every_nth" commands: * - a missing response is simulated if SDEBUG_OPT_TIMEOUT is set * - a RECOVERED_ERROR is simulated on successful read and write * commands if SDEBUG_OPT_RECOVERED_ERR is set. * - a TRANSPORT_ERROR is simulated on successful read and write * commands if SDEBUG_OPT_TRANSPORT_ERR is set. + * - similarly for DIF_ERR, DIX_ERR, SHORT_TRANSFER, HOST_BUSY and + * CMD_ABORT * - * When "every_nth" < 0 then after "- every_nth" commands: - * - a missing response is simulated if SDEBUG_OPT_TIMEOUT is set - * - a RECOVERED_ERROR is simulated on successful read and write - * commands if SDEBUG_OPT_RECOVERED_ERR is set. - * - a TRANSPORT_ERROR is simulated on successful read and write - * commands if _DEBUG_OPT_TRANSPORT_ERR is set. - * This will continue on every subsequent command until some other action - * occurs (e.g. the user * writing a new value (other than -1 or 1) to - * every_nth via sysfs). + * When "every_nth" < 0 then after "- every_nth" commands the selected + * error will be injected. The error will be injected on every subsequent + * command until some other action occurs; for example, the user writing + * a new value (other than -1 or 1) to every_nth: + * echo 0 > /sys/bus/pseudo/drivers/scsi_debug/every_nth */ /* As indicated in SAM-5 and SPC-4 Unit Attentions (UAs) are returned in @@ -281,6 +281,7 @@ struct sdebug_defer { int issuing_cpu; bool init_hrt; bool init_wq; + bool aborted; /* true when blk_abort_request() already called */ enum sdeb_defer_type defer_t; }; @@ -296,6 +297,7 @@ struct sdebug_queued_cmd { unsigned int inj_dix:1; unsigned int inj_short:1; unsigned int inj_host_busy:1; + unsigned int inj_cmd_abort:1; }; struct sdebug_queue { @@ -3792,6 +3794,7 @@ static struct sdebug_queue *get_queue(struct scsi_cmnd *cmnd) /* Queued (deferred) command completions converge here. */ static void sdebug_q_cmd_complete(struct sdebug_defer *sd_dp) { + bool aborted = sd_dp->aborted; int qc_idx; int retiring = 0; unsigned long iflags; @@ -3801,6 +3804,8 @@ static void sdebug_q_cmd_complete(struct sdebug_defer *sd_dp) struct sdebug_dev_info *devip; sd_dp->defer_t = SDEB_DEFER_NONE; + if (unlikely(aborted)) + sd_dp->aborted = false; qc_idx = sd_dp->qc_idx; sqp = sdebug_q_arr
Re: [PATCH] tcmu: fix crash for dereferencing the released udev->mb_addr memory
On 2018/7/20 23:13, Mike Christie wrote: On 07/19/2018 07:34 PM, Xiubo Li wrote: On 2018/7/19 23:49, Mike Christie wrote: On 07/19/2018 09:30 AM, xiu...@redhat.com wrote: From: Xiubo Li The logs are: BUG: unable to handle kernel NULL pointer dereference at 0040 IP: [] tcmu_reset_ring_store+0x149/0x240 [target_core_user] PGD 8e254067 PUD e255067 PMD 0 Oops: 0002 [#1] SMP [...] CPU: 0 PID: 36077 Comm: tcmu-runner Kdump: loaded Not tainted 3.10.0-924.el7.test.x86_64 #1 It is not clear to me how you hit this. It's not a RHEL only bug is it? Are you sure you are hitting it during device removal? Yes, not only the RHEL bug, for the RHEL we have sync the related code to the upstream. Before as I remembered I have hit one sudden crash by using the SCSI repo code when testing the tcmu-runner PR#402: tcmu-runner: add reset netlink support. The SCSI repo is : git://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git Since the kdump service was not working for the upstream code in my setups that time, so could only get very simple logs. This time I was using Prasanna's script and RHEL code with the netlink and UIO crash fixing patches to reproduce this. The test script please see the attachment. If we are calling tcmu_reset_ring_store isn't there a refcount on the item? So we cannot call tcmu_destroy_device -> tcmu_dev_kref_release until after that has been released, so they would never run at the same time. And, if userspace were to try to open/write to that reset file after the rmdir has been done on the se_device then configfs detects this and fails the open. With the tcmu-runner PR#402, when starting tcmu-runner the daemon will do : 1, block netlink 2, reset netlink events 3, block device --> reset ring --> open device --> unblock device If the device was only created and not yet enabled you will hit the crash here. If in your test you kill runner while a enable is in process then this would happen. As you mentioned it is possible only created then runner is killed. 4, unblock netlink Currently there may has one bug in systemd service, because we hit one odd issue by using the test script: When executing 'systemctl restart tcmu-runner', this process seems waiting one signal from somewhere by stucking itself, if we do nothing It will wait for about 6 ~20 minutes(the time comes from our test result) to recovery from that. And we can execute 'systemctl restart gluster-blockd' in another terminal, which could stop the stuck state above immediately. [root@localhost ~]# ps -aux | grep -e tcmu -e target root 2021 0.0 0.0 0 0 ?S< 15:09 0:00 [target_completi] root 2386 0.6 0.5 226232 11096 ?Ds 15:11 0:00 /usr/bin/python /usr/bin/targetctl clear root 2400 0.0 0.0 134816 1316 pts/0S+ 15:11 0:00 systemctl restart tcmu-runnerroot 2415 0.0 0.0 112704 988 pts/1R+ 15:11 0:00 grep --color=auto -e tcmu -e target [root@localhost ~]# cat /proc/2400/stack [] poll_schedule_timeout+0x55/0xb0 [] do_sys_poll+0x48d/0x590 [] SyS_ppoll+0x1d3/0x1f0 [] system_call_fastpath+0x1c/0x21 And with IOs going on, when killing the tcmu-runner process, the Are you just doing a test to kill runner at various times or is your restart of your service doing this while the targetcli clear operation is running? There are 3 related services, and here '-->' means will Bindto and depend to gluster-blockd.service --> gluster-block-target.service --> tcmu-runner service I was running the script as I have attached in last mail and in another terminal running the 'systemctl restart gluster-blockd' cmd when this crash is hit. Because as I mentioned without the 'systemctl restart gluster-blockd' in another terminal, the script will be stuck in 'systemctl restart tcmu-runner' for a very long time. When resarting the gluster-blockd, it will start the tcmu-runner and gluster-block-target first. And it is possible that the targetcli clear operation is stuck somewhere then the runner is restarted. If you re killing it at various times, how do you know you are not killing it during a enable operation and hitting this bug there? It is very possible. If you are killing on purpose as part of your restart operation why are you killing it when the targetcli clear operation needs the backends to complete their IO as part of the tpg disablement process. Just after runner service is killed then the gluster-block-target service will be killed too, and it will invoke the 'targetctl clear' when stopping. We are testing the netlink reset and the ring reset feature, so the script is killing the runner daemon to emulate the D state cases. And the script also emulating the crash case of the tcmu-runner daemon for some reasons. 'targetctl clear' process will stuck in [1] first. Then we restart the tcmu-runner, which will reset ring with accessing the udev->mb_addr, currently we still could access the /
Re: [PATCH] scsi_debug: add cmd abort option to every_nth
On Fri, 2018-07-20 at 15:21 -0400, Douglas Gilbert wrote: > /* Complete the processing of the thread that queued a SCSI command to this > @@ -4459,6 +4462,11 @@ static int schedule_resp(struct scsi_cmnd *cmnd, > struct sdebug_dev_info *devip, > sd_dp->issuing_cpu = raw_smp_processor_id(); > sd_dp->defer_t = SDEB_DEFER_WQ; > schedule_work(&sd_dp->ew.work); > + if (unlikely(sqcp->inj_cmd_abort)) { > + blk_abort_request(cmnd->request); > + sdev_printk(KERN_INFO, sdp, "abort request tag %d\n", > + cmnd->request->tag); > + } > } > if (unlikely((SDEBUG_OPT_Q_NOISE & sdebug_opts) && >(scsi_result == device_qfull_result))) Should the sdev_printk() call occur before the blk_abort_request() call to avoid that the sdev_printk() call triggers a use-after-free? Does the above change cause schedule_resp() to call both blk_abort_request() and scsi_done()? I think that's wrong. A SCSI driver should call one of these two functions but not both. Thanks, Bart.
Re: [PATCH 02/15] target: fix isid copying and comparision
On Fri, 2018-07-20 at 16:08 -0500, Mike Christie wrote: > Hey Bart and Christoph, > > Bart, I noticed we basically had what you are requesting but Christoph > had moved the id code from the fabric drivers to lio core in this commit: > > commit 2650d71e244fb3637b5f58a0080682a8bf9c7091 > Author: Christoph Hellwig > Date: Fri May 1 17:47:58 2015 +0200 > > target: move transport ID handling to the core Hello Mike, I'm not in favor of reverting Christoph's patch because that patch simplified the target code significantly. On the other hand, it's inconvenient that with the current approach there is some code and knowledge in the target core that should be in target drivers. I think that the code for parsing the initiator name in srp_get_pr_transport_id() should be in the SRP target driver instead of the core. When I added support for a new initiator name format in the SRP target driver I overlooked that I had to update srp_get_pr_transport_id() because that function is in the core instead of ib_srpt.c. See also commit 2dc98f09f9e6 ("IB/srpt: Use the source GUID as session name"). Christoph, do you want me to add support for the new ib_srpt initiator name format in drivers/target/target_core_fabric_lib.c or should I find a way to move the code for parsing the initiator name into ib_srpt.c? Thanks, Bart.
Re: [PATCH 02/15] target: fix isid copying and comparision
On 07/20/2018 04:08 PM, Mike Christie wrote: > On 07/19/2018 11:13 AM, Mike Christie wrote: >> > On 07/19/2018 10:15 AM, Bart Van Assche wrote: >>> >> On Wed, 2018-07-18 at 20:02 -0500, Mike Christie wrote: >>> On 07/18/2018 07:03 PM, Mike Christie wrote: > On 07/18/2018 05:09 PM, Bart Van Assche wrote: >> > [ ... ] >> > is that these involve a transport ID and that that transport ID >> > can be up to 228 >> > bytes long for iSCSI. > > I am talking about the Initiator Session ID above. That along with > the > iscsi name make up the Initiator Port Transport ID. In spc4r37 > checkout > table 508 or in SAM 5r21 checkout table A.4. > > So in the SCSI specs as part of the Initiator Port Transport ID we > have > this from that SAM table: > > The Initiator Session Identifier (ISID) portion of the string is a > UTF-8 > encoded hexadecimal representation of a six byte binary value. > > --- > > In the PR parts of SPC it sometimes mentions only "Transport ID" but > then clarifies the initiator port so I am assuming in those cases it > means "Initiator Port Transport ID" so it is both the name and isid > for > iscsi. >>> >>> It looks like we are supposed to go by what the initiator specifies in >>> the TPID field, so it can be either the Transport ID or Initiator Port >>> Transport ID. >>> >> >>> >> Hello Mike, >>> >> >>> >> Since the ISID is iSCSI-specific I think that all code that knows about >>> >> the >>> >> ISID and its encoding should be in the iSCSI target driver instead of the >>> >> target core. Do you think an approach similar to that of the SCST >>> >> function >>> >> iscsi_get_initiator_port_transport_id() can be implemented in LIO? The >>> >> caller >> > >> > Yeah, I can make that work. > Hey Bart and Christoph, > > Bart, I noticed we basically had what you are requesting but Christoph > had moved the id code from the fabric drivers to lio core in this commit: > > commit 2650d71e244fb3637b5f58a0080682a8bf9c7091 > Author: Christoph Hellwig > Date: Fri May 1 17:47:58 2015 +0200 > > target: move transport ID handling to the core > > > So what do you guys want to do here? Revert Christoph's patch and go > back to that style or have lio core do the transport ID processing? Oh yeah for the latter, we can make it so at least target_core_pr.c does not have to do any isid conversions. We could just rename sess_bin_isid to sess_str_isid, store the isid as a string, and then never convert it between the binary and string format. All the isid tests in target_core_pr.c would be strcmps.
Re: [PATCH 02/15] target: fix isid copying and comparision
On 07/19/2018 11:13 AM, Mike Christie wrote: > On 07/19/2018 10:15 AM, Bart Van Assche wrote: >> On Wed, 2018-07-18 at 20:02 -0500, Mike Christie wrote: >>> On 07/18/2018 07:03 PM, Mike Christie wrote: On 07/18/2018 05:09 PM, Bart Van Assche wrote: > [ ... ] > is that these involve a transport ID and that that transport ID can be up > to 228 > bytes long for iSCSI. I am talking about the Initiator Session ID above. That along with the iscsi name make up the Initiator Port Transport ID. In spc4r37 checkout table 508 or in SAM 5r21 checkout table A.4. So in the SCSI specs as part of the Initiator Port Transport ID we have this from that SAM table: The Initiator Session Identifier (ISID) portion of the string is a UTF-8 encoded hexadecimal representation of a six byte binary value. --- In the PR parts of SPC it sometimes mentions only "Transport ID" but then clarifies the initiator port so I am assuming in those cases it means "Initiator Port Transport ID" so it is both the name and isid for iscsi. >>> >>> It looks like we are supposed to go by what the initiator specifies in >>> the TPID field, so it can be either the Transport ID or Initiator Port >>> Transport ID. >> >> Hello Mike, >> >> Since the ISID is iSCSI-specific I think that all code that knows about the >> ISID and its encoding should be in the iSCSI target driver instead of the >> target core. Do you think an approach similar to that of the SCST function >> iscsi_get_initiator_port_transport_id() can be implemented in LIO? The caller > > Yeah, I can make that work. Hey Bart and Christoph, Bart, I noticed we basically had what you are requesting but Christoph had moved the id code from the fabric drivers to lio core in this commit: commit 2650d71e244fb3637b5f58a0080682a8bf9c7091 Author: Christoph Hellwig Date: Fri May 1 17:47:58 2015 +0200 target: move transport ID handling to the core So what do you guys want to do here? Revert Christoph's patch and go back to that style or have lio core do the transport ID processing? > >> of that function is in source file scst/src/scst_targ.c: >> >> res = sess->tgt->tgtt->get_initiator_port_transport_id( >> sess->tgt, sess, &sess->transport_id); >> >> Thanks, >> >> Bart.-- >> To unsubscribe from this list: send the line "unsubscribe target-devel" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > > -- > To unsubscribe from this list: send the line "unsubscribe target-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >
we provide editing
Hi, We provide image editing services like - photo cut out; photo clipping path; photo masking; photo shadow creation; photo color correction; photo retouching; beauty model retouching on skin, face, body; glamour retouching; products retouching and other image editing. We are also offering to deliver testing for you, so that you get to know our quality first hand. If you want to explore further, please reply back. Thanks and Regards, Scott
[PATCH] scsi_debug: add cmd abort option to every_nth
This patch is motivated by a response in the thread: Re: [PATCH 0/5]stop normal completion path entering a timeout req by Jianchao Wang . It generalizes the error injection of blk_abort_request() to use scsi_debug's "every_nth" mechanism. Ref with original patch to scsi_debug: https://lore.kernel.org/lkml/a68ad043-26a1-d3d8-2009-504ba4230...@oracle.com/ Signed-off-by: Douglas Gilbert --- This patch may not be correct, or sufficient (nothing else in SCSI subsystem calls blk_abort_request() ) and it freezes my laptop with lk 2.18.0-rc1 kernel based on MKP's repository. However that crash may be the point of the original patch. Needs positive feedback from knowledgeable parties. drivers/scsi/scsi_debug.c | 30 +++--- 1 file changed, 19 insertions(+), 11 deletions(-) diff --git a/drivers/scsi/scsi_debug.c b/drivers/scsi/scsi_debug.c index 4dda192b8fba..9eda71138744 100644 --- a/drivers/scsi/scsi_debug.c +++ b/drivers/scsi/scsi_debug.c @@ -164,29 +164,29 @@ static const char *sdebug_version_date = "20180128"; #define SDEBUG_OPT_RESET_NOISE 0x2000 #define SDEBUG_OPT_NO_CDB_NOISE0x4000 #define SDEBUG_OPT_HOST_BUSY 0x8000 +#define SDEBUG_OPT_CMD_ABORT 0x1 #define SDEBUG_OPT_ALL_NOISE (SDEBUG_OPT_NOISE | SDEBUG_OPT_Q_NOISE | \ SDEBUG_OPT_RESET_NOISE) #define SDEBUG_OPT_ALL_INJECTING (SDEBUG_OPT_RECOVERED_ERR | \ SDEBUG_OPT_TRANSPORT_ERR | \ SDEBUG_OPT_DIF_ERR | SDEBUG_OPT_DIX_ERR | \ SDEBUG_OPT_SHORT_TRANSFER | \ - SDEBUG_OPT_HOST_BUSY) + SDEBUG_OPT_HOST_BUSY | \ + SDEBUG_OPT_CMD_ABORT) /* When "every_nth" > 0 then modulo "every_nth" commands: * - a missing response is simulated if SDEBUG_OPT_TIMEOUT is set * - a RECOVERED_ERROR is simulated on successful read and write * commands if SDEBUG_OPT_RECOVERED_ERR is set. * - a TRANSPORT_ERROR is simulated on successful read and write * commands if SDEBUG_OPT_TRANSPORT_ERR is set. + * - similarly for DIF_ERR, DIX_ERR, SHORT_TRANSFER, HOST_BUSY and + * CMD_ABORT * - * When "every_nth" < 0 then after "- every_nth" commands: - * - a missing response is simulated if SDEBUG_OPT_TIMEOUT is set - * - a RECOVERED_ERROR is simulated on successful read and write - * commands if SDEBUG_OPT_RECOVERED_ERR is set. - * - a TRANSPORT_ERROR is simulated on successful read and write - * commands if _DEBUG_OPT_TRANSPORT_ERR is set. - * This will continue on every subsequent command until some other action - * occurs (e.g. the user * writing a new value (other than -1 or 1) to - * every_nth via sysfs). + * When "every_nth" < 0 then after "- every_nth" commands the selected + * error will be injected. The error will be injected on every subsequent + * command until some other action occurs; for example, the user writing + * a new value (other than -1 or 1) to every_nth: + * echo 0 > /sys/bus/pseudo/drivers/scsi_debug/every_nth */ /* As indicated in SAM-5 and SPC-4 Unit Attentions (UAs) are returned in @@ -296,6 +296,7 @@ struct sdebug_queued_cmd { unsigned int inj_dix:1; unsigned int inj_short:1; unsigned int inj_host_busy:1; + unsigned int inj_cmd_abort:1; }; struct sdebug_queue { @@ -4312,7 +4313,8 @@ static void setup_inject(struct sdebug_queue *sqp, if (sdebug_every_nth > 0) sqcp->inj_recovered = sqcp->inj_transport = sqcp->inj_dif - = sqcp->inj_dix = sqcp->inj_short = 0; + = sqcp->inj_dix = sqcp->inj_short + = sqcp->inj_host_busy = sqcp->inj_cmd_abort = 0; return; } sqcp->inj_recovered = !!(SDEBUG_OPT_RECOVERED_ERR & sdebug_opts); @@ -4321,6 +4323,7 @@ static void setup_inject(struct sdebug_queue *sqp, sqcp->inj_dix = !!(SDEBUG_OPT_DIX_ERR & sdebug_opts); sqcp->inj_short = !!(SDEBUG_OPT_SHORT_TRANSFER & sdebug_opts); sqcp->inj_host_busy = !!(SDEBUG_OPT_HOST_BUSY & sdebug_opts); + sqcp->inj_cmd_abort = !!(SDEBUG_OPT_CMD_ABORT & sdebug_opts); } /* Complete the processing of the thread that queued a SCSI command to this @@ -4459,6 +4462,11 @@ static int schedule_resp(struct scsi_cmnd *cmnd, struct sdebug_dev_info *devip, sd_dp->issuing_cpu = raw_smp_processor_id(); sd_dp->defer_t = SDEB_DEFER_WQ; schedule_work(&sd_dp->ew.work); + if (unlikely(sqcp->inj_cmd_abort)) { + blk_abort_request(cmnd->request); + sdev_printk(KERN_INFO, sdp, "abort request tag %d\n", + cmnd->request->tag); + } } if (unl
we provide editing
Hi, We provide image editing services like - photo cut out; photo clipping path; photo masking; photo shadow creation; photo color correction; photo retouching; beauty model retouching on skin, face, body; glamour retouching; products retouching and other image editing. We are also offering to deliver testing for you, so that you get to know our quality first hand. If you want to explore further, please reply back. Thanks and Regards, Scott
Re: [PATCH] tcmu: fix crash for dereferencing the released udev->mb_addr memory
On 07/19/2018 07:34 PM, Xiubo Li wrote: > On 2018/7/19 23:49, Mike Christie wrote: >> On 07/19/2018 09:30 AM, xiu...@redhat.com wrote: >>> From: Xiubo Li >>> >>> The logs are: >>> >>> BUG: unable to handle kernel NULL pointer dereference at 0040 >>> IP: [] tcmu_reset_ring_store+0x149/0x240 >>> [target_core_user] >>> PGD 8e254067 PUD e255067 PMD 0 >>> Oops: 0002 [#1] SMP >>> [...] >>> CPU: 0 PID: 36077 Comm: tcmu-runner Kdump: loaded Not tainted >>> 3.10.0-924.el7.test.x86_64 #1 >> >> It is not clear to me how you hit this. It's not a RHEL only bug is it? >> Are you sure you are hitting it during device removal? > Yes, not only the RHEL bug, for the RHEL we have sync the related code > to the upstream. Before as I remembered I have hit one sudden crash by > using the SCSI repo code when testing the tcmu-runner PR#402: > tcmu-runner: add reset netlink support. > > The SCSI repo is : > git://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git > > Since the kdump service was not working for the upstream code in my > setups that time, so could only get very simple logs. This time I was > using Prasanna's script and RHEL code with the netlink and UIO crash > fixing patches to reproduce this. The test script please see the attachment. > >> If we are calling tcmu_reset_ring_store isn't there a refcount on the >> item? So we cannot call tcmu_destroy_device -> tcmu_dev_kref_release >> until after that has been released, so they would never run at the same >> time. And, if userspace were to try to open/write to that reset file >> after the rmdir has been done on the se_device then configfs detects >> this and fails the open. > With the tcmu-runner PR#402, when starting tcmu-runner the daemon will do : > 1, block netlink > 2, reset netlink events > 3, block device --> reset ring --> open device --> unblock device If the device was only created and not yet enabled you will hit the crash here. If in your test you kill runner while a enable is in process then this would happen. > 4, unblock netlink > > > Currently there may has one bug in systemd service, because we hit one > odd issue by using the test script: > When executing 'systemctl restart tcmu-runner', this process seems > waiting one signal from somewhere by stucking itself, if we do nothing > It will wait for about 6 ~20 minutes(the time comes from our test > result) to recovery from that. And we can execute 'systemctl restart > gluster-blockd' in another terminal, which could stop the stuck state > above immediately. > > [root@localhost ~]# ps -aux | grep -e tcmu -e > target > root 2021 0.0 0.0 0 0 ?S< 15:09 0:00 > [target_completi] > root 2386 0.6 0.5 226232 11096 ?Ds 15:11 0:00 > /usr/bin/python /usr/bin/targetctl clear > root 2400 0.0 0.0 134816 1316 pts/0S+ 15:11 0:00 > systemctl restart tcmu-runnerroot 2415 > 0.0 0.0 112704 988 pts/1R+ 15:11 0:00 grep --color=auto -e > tcmu -e target > [root@localhost ~]# cat > /proc/2400/stack > > > [] > poll_schedule_timeout+0x55/0xb0 > > > [] > do_sys_poll+0x48d/0x590 > > > [] > SyS_ppoll+0x1d3/0x1f0 > > > [] > system_call_fastpath+0x1c/0x21 > > > > And with IOs going on, when killing the tcmu-runner process, the Are you just doing a test to kill runner at various times or is your restart of your service doing this while the targetcli clear operation is running? If you re killing it at various times, how do you know you are not killing it during a enable operation and hitting this bug there? If you are killing on purpose as part of your restart operation why are you killing it when the targetcli clear operation needs the backends to complete their IO as part of the tpg disablement process. > 'targetctl clear' process will stuck in [1] first. Then we restart the > tcmu-runner, which will reset ring with accessing the udev->mb_addr, > currently we still could access the > /sys/kernel/config/target/core/user_XX/*. And while the ring reset is > still going on, the stucked 'targetctl clear' process will be woken up > and then tries to delete the device which will kfree(udev->mb_addr), > here any accessing to /sys/kernel/config/target/core/user_XX/* will stuck. The rtslib clear operation deletes the target first and that disables the tpg which waits for outstanding commands. It then finishes tearing down the objects under the target, and then the backend devices are deleted. If targetcli clear is unhung from the
Re: [PATCH 11/15] target: export initiator port values for all sessions
On Thu, Jul 19, 2018 at 04:30:36PM -0500, Mike Christie wrote: > Just to clarify. We can create a dir from the kernel already. It is no > problem. I am doing that in this patchset with configfs_register_group. > > What Bart was requesting originally and what is missing is being able to > add a symlink from the kernel. > > I have not fully looked into it, but I think it would be something like > taking part of configfs_symlink and making it so we can call it with > config_items for the 2 items to be symlinked. Yes, it shouldn't be too hard.
[PATCH] mpt3sas: correct reset of smid while clearing scsi tracker
In mpt3sas_base_clear_st() function smid value is reseted in wrong line, i.e. driver should reset smid value to zero after decrementing chain_offset counter in chain_lookup table but in current code, driver is resetting smid value before decrementing the chain_offset counter. which we are correcting with this patch. v1 changelog: Added memory barriers before & after atomic_set() API. Signed-off-by: Sreekanth Reddy --- drivers/scsi/mpt3sas/mpt3sas_base.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/drivers/scsi/mpt3sas/mpt3sas_base.c b/drivers/scsi/mpt3sas/mpt3sas_base.c index 902610d..94359d8 100644 --- a/drivers/scsi/mpt3sas/mpt3sas_base.c +++ b/drivers/scsi/mpt3sas/mpt3sas_base.c @@ -1702,6 +1702,8 @@ static int mpt3sas_remove_dead_ioc_func(void *arg) return NULL; chain_req = &ioc->chain_lookup[smid - 1].chains_per_smid[chain_offset]; + /* Adding memory barrier before atomic operation. */ + smp_mb__before_atomic(); atomic_inc(&ioc->chain_lookup[smid - 1].chain_offset); return chain_req; } @@ -3283,8 +3285,12 @@ void mpt3sas_base_clear_st(struct MPT3SAS_ADAPTER *ioc, return; st->cb_idx = 0xFF; st->direct_io = 0; - st->smid = 0; + /* Adding memory barrier before atomic operation. */ + smp_mb__before_atomic(); atomic_set(&ioc->chain_lookup[st->smid - 1].chain_offset, 0); + /* Adding memory barrier after atomic operation. */ + smp_mb__after_atomic(); + st->smid = 0; } /** -- 1.8.3.1