Re: [PATCH-v4 0/5] Fix LUN_RESET active I/O + TMR handling
On Fri, Feb 12, 2016 at 12:30 AM, Himanshu Madhani wrote: > Hi Nic, > > > > On 2/11/16, 3:47 PM, "Nicholas A. Bellinger" wrote: > >>On Wed, 2016-02-10 at 22:53 -0800, Nicholas A. Bellinger wrote: >>> On Tue, 2016-02-09 at 18:03 +, Himanshu Madhani wrote: >>> > On 2/8/16, 9:25 PM, "Nicholas A. Bellinger" >>>wrote: >>> > >On Mon, 2016-02-08 at 23:27 +, Himanshu Madhani wrote: >>> > >> >>> > >> I am testing this series with with 4.5.0-rc2+ kernel and I am >>>seeing >>> > >>issue >>> > >> where trying to trigger >>> > >> sg_reset with option of host/device/bus in loop at 120second >>>interval >>> > >> causes call stack. At this point >>> > >> removing configuration hangs indefinitely. See attached dmesg >>>output >>> > >>from >>> > >> my setup. >>> > >> >>> > > >>> > >Thanks alot for testing this. >>> > > >>> > >So It looks like we're still hitting a indefinite schedule() on >>> > >se_cmd->cmd_wait_comp once tcm_qla2xxx session disconnect/reconnect >>> > >occurs, after repeated explicit active I/O remote-port sg_resets. >>> > > >>> > >Does this trigger on the first tcm_qla2xxx session reconnect after >>> > >explicit remote-port sg_reset..? Are session reconnects actively >>>being >>> > >triggered during the test..? >>> > > >>> > >To verify the latter for iscsi-target, I've been using a small patch >>>to >>> > >trigger session reset from TMR kthread context in order to simulate >>>the >>> > >I_T disconnects. Something like that would be useful for verifying >>>with >>> > >tcm_qla2xxx too. >>> > > >>> > >That said, I'll be reproducing with tcm_qla2xxx ports this week, and >>> > >will enable various debug in a WIP branch for testing. >>> >>> Following up here.. >>> >>> So far using my test setup with ISP2532 ports in P2P + RAMDISK_MCP and >>> v4.5-rc1, repeated remote-port active I/O LUN_RESET (sg_reset -d) has >>> been functioning as expected with a blocksize_range=4k-256k + iodepth=32 >>> fio write-verify style workload. >>> >>> No ->cmd_kref -1 OOPsen or qla2xxx initiator generated ABORT_TASKs from >>> outstanding target TAS responses, nor fio write-verify failures to >>> report after 800x remote-port active I/O LUN_RESETS. >>> >>> Next step will be to verify explicit tcm_qla2xxx port + module shutdown >>> after 1K test iterations, and then IBLOCK async completions <-> NVMe >>> backends with the same case. >>> >> >>After letting this test run over-night up to 7k active I/O remote-port >>LUN_RESETs, things are still functioning as expected. >> >>Also, /etc/init.d/target stop was able to successfully shutdown all >>active sessions and unload tcm_qla2xxx after the test run. >> >>So AFAICT, the active I/O remote-port LUN_RESET changes are stable with >>tcm_qla2xxx ports, separate from concurrent session disconnect hung task >>you reported earlier. >> >>That said, I'll likely push this series as-is for -rc4, given that Dan >>has also been able to verify the non conncurrent session disconnect case >>on his setup generating constant ABORT_TASKs, and it's still surviving >>both cases for iscsi-target ports. >> >>Please give the debug patch from last night a shot, and see if we can >>determine the se_cmd states when you hit the hung task. > > I¹ll give your latest debug patch try in a little while > > From the testing that I have done, what is seen is that active IO has > already been completed and qla2xxx driver is waiting for commands to be > Completed and it¹s waiting indefinitely for cmd_wait_comp. > So it looks like there is a missing complete call from target_core. I¹ve > attached our analysis from crash debug on a live system after the issues > happens. > > > I can recreate this issue at will within 5 minute of triggering sg_reset > with following steps > > 1. Export 4 RAM disk LUNs on each of 2 port adapter. Initiator will see 8 > RAM disk targets > 2. Start IO with 4K block size and 8 threads with 80% write 20% read and > 100% dandom. > (I am using vdbench for generating IO. I can provide setup/config script > if needed) > 3. Start sg_reset for each LUNs with first device, bus and host with 120s > delay. (I¹ve attached > My script that I am using for triggering sg_reset) > >> >>Thank you, >> >>-nab >> > Has there been any update to this? Thanks -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH-v4 0/5] Fix LUN_RESET active I/O + TMR handling
Hi Nic, On 2/12/16, 11:03 PM, "Nicholas A. Bellinger" wrote: >Hi Himanshu & Co, > >On Fri, 2016-02-12 at 00:48 -0800, Nicholas A. Bellinger wrote: >> On Fri, 2016-02-12 at 05:30 +, Himanshu Madhani wrote: > > > >> Thanks for the crash dump output. >> >> So it's a t_state = TRANSPORT_WRITE_PENDING descriptor with >> SAM_STAT_CHECK_CONDITION + cmd_kref.refcount = 0: >> >> struct qla_tgt_cmd { >> se_cmd = { >> scsi_status = 0x2 >> se_cmd_flags = 0x80090d, >> >> >> >> cmd_kref = { >> refcount = { >> counter = 0x0 >> } >> }, >> } >> >> The se_cmd_flags=0x80090d translation to enum se_cmd_flags_table: >> >> - SCF_TRANSPORT_TASK_SENSE >> - SCF_EMULATED_TASK_SENSE >> - SCF_SCSI_DATA_CDB >> - SCF_SE_LUN_CMD >> - SCF_SENT_CHECK_CONDITION >> - SCF_USE_CPUID >> > >After groking your dump some more: > >For SAM_STAT_CHECK_CONDITION with t_state = TRANSPORT_WRITE_PENDING plus >se_cmd->transport_state = 0x880 bits set, is: > >- CMD_T_DEV_ACTIVE >- CMD_T_FABRIC_STOP > >and sense buffer = 0x70 00 0b 00 00 00 00 0a 00 00 00 00 29 03 00, >which is the following from sense_info_table[]: > > [TCM_CHECK_CONDITION_ABORT_CMD] = { >.key = ABORTED_COMMAND, >.asc = 0x29, /* BUS DEVICE RESET FUNCTION OCCURRED */ >.ascq = 0x03, >}, > >The descriptor looks like it did make it to tcm_qla2xxx_complete_free() >-> transport_generic_free_cmd() with both qla_tgt_cmd->cmd_sent_to_fw=0, >and qla_tgt_cmd->write_data_transferred=0 set. > >The best I can tell, it looks like tcm_qla2xxx_handle_data_work() -> >transport_generic_request_failure() w/ TCM_CHECK_CONDITION_ABORT_CMD is >occurring.. > >So to confirm, this specific bug was not a result of active I/O >LUN_RESET w/ CMD_T_ABORTED during session disconnect, or otherwise. > >> >> > I can recreate this issue at will within 5 minute of triggering >>sg_reset >> > with following steps >> > >> > 1. Export 4 RAM disk LUNs on each of 2 port adapter. Initiator will >>see 8 >> > RAM disk targets >> > 2. Start IO with 4K block size and 8 threads with 80% write 20% read >>and >> > 100% dandom. >> > (I am using vdbench for generating IO. I can provide setup/config >>script >> > if needed) >> > 3. Start sg_reset for each LUNs with first device, bus and host with >>120s >> > delay. (I¹ve attached >> > My script that I am using for triggering sg_reset) >> > >> >> Thanks, will keep looking and try to reproduce with your script. > >So here's my test setup with 3x Intel P3600 NVMe/IBLOCK backends, across >dual ISP2532 ports: > >o- / >.. >... [...] > o- backstores >.. > [...] > | o- fileio >... [0 >Storage Object] > | o- iblock >.. [3 >Storage Objects] > | | o- nvme0n1 > >[/dev/nvme0n1, in use] > | | o- nvme1n1 > >[/dev/nvme1n1, in use] > | | o- nvme2n1 > >[/dev/nvme2n1, in use] > | o- pscsi > [0 >Storage Object] > | o- rd_mcp >... [1 >Storage Object] > | o- ramdisk .. >[16.0G, ramdisk, not in use] > o- qla2xxx >.. >. [2 Targets] > | o- 21:00:00:24:ff:48:97:7e >... [enabled] > | | o- acls >.. > [1 ACL] > | | | o- 21:00:00:24:ff:48:97:7c >. [3 Mapped LUNs] > | | | o- mapped_lun0 >... [lun0 >(rw)] > | | | o- mapped_lun1 >... [lun1 >(rw)] > | | | o- mapped_lun2 >... [lun2 >(rw)] > | | o- luns >.. >... [3 LUNs] > | | o- lun0 >[iblock/nvme0n1 (/dev/nvme0n1)] > | | o- lun1 >[iblock/nvme1n1 (/dev/nvme1n1)] > | | o- lun2 >[iblock/nvme2n1 (/dev/nvme2n1)] > | o- 21:00:00:24:ff:48:97:7f >... [enabled] > | o- acls >.. > [1 ACL] > |
Re: [PATCH-v4 0/5] Fix LUN_RESET active I/O + TMR handling
Hi Himanshu & Co, On Fri, 2016-02-12 at 00:48 -0800, Nicholas A. Bellinger wrote: > On Fri, 2016-02-12 at 05:30 +, Himanshu Madhani wrote: > Thanks for the crash dump output. > > So it's a t_state = TRANSPORT_WRITE_PENDING descriptor with > SAM_STAT_CHECK_CONDITION + cmd_kref.refcount = 0: > > struct qla_tgt_cmd { > se_cmd = { > scsi_status = 0x2 > se_cmd_flags = 0x80090d, > > > > cmd_kref = { > refcount = { > counter = 0x0 > } > }, > } > > The se_cmd_flags=0x80090d translation to enum se_cmd_flags_table: > > - SCF_TRANSPORT_TASK_SENSE > - SCF_EMULATED_TASK_SENSE > - SCF_SCSI_DATA_CDB > - SCF_SE_LUN_CMD > - SCF_SENT_CHECK_CONDITION > - SCF_USE_CPUID > After groking your dump some more: For SAM_STAT_CHECK_CONDITION with t_state = TRANSPORT_WRITE_PENDING plus se_cmd->transport_state = 0x880 bits set, is: - CMD_T_DEV_ACTIVE - CMD_T_FABRIC_STOP and sense buffer = 0x70 00 0b 00 00 00 00 0a 00 00 00 00 29 03 00, which is the following from sense_info_table[]: [TCM_CHECK_CONDITION_ABORT_CMD] = { .key = ABORTED_COMMAND, .asc = 0x29, /* BUS DEVICE RESET FUNCTION OCCURRED */ .ascq = 0x03, }, The descriptor looks like it did make it to tcm_qla2xxx_complete_free() -> transport_generic_free_cmd() with both qla_tgt_cmd->cmd_sent_to_fw=0, and qla_tgt_cmd->write_data_transferred=0 set. The best I can tell, it looks like tcm_qla2xxx_handle_data_work() -> transport_generic_request_failure() w/ TCM_CHECK_CONDITION_ABORT_CMD is occurring.. So to confirm, this specific bug was not a result of active I/O LUN_RESET w/ CMD_T_ABORTED during session disconnect, or otherwise. > > > I can recreate this issue at will within 5 minute of triggering sg_reset > > with following steps > > > > 1. Export 4 RAM disk LUNs on each of 2 port adapter. Initiator will see 8 > > RAM disk targets > > 2. Start IO with 4K block size and 8 threads with 80% write 20% read and > > 100% dandom. > > (I am using vdbench for generating IO. I can provide setup/config script > > if needed) > > 3. Start sg_reset for each LUNs with first device, bus and host with 120s > > delay. (I¹ve attached > > My script that I am using for triggering sg_reset) > > > > Thanks, will keep looking and try to reproduce with your script. So here's my test setup with 3x Intel P3600 NVMe/IBLOCK backends, across dual ISP2532 ports: o- / . [...] o- backstores .. [...] | o- fileio ... [0 Storage Object] | o- iblock .. [3 Storage Objects] | | o- nvme0n1 [/dev/nvme0n1, in use] | | o- nvme1n1 [/dev/nvme1n1, in use] | | o- nvme2n1 [/dev/nvme2n1, in use] | o- pscsi [0 Storage Object] | o- rd_mcp ... [1 Storage Object] | o- ramdisk .. [16.0G, ramdisk, not in use] o- qla2xxx ... [2 Targets] | o- 21:00:00:24:ff:48:97:7e ... [enabled] | | o- acls .. [1 ACL] | | | o- 21:00:00:24:ff:48:97:7c . [3 Mapped LUNs] | | | o- mapped_lun0 ... [lun0 (rw)] | | | o- mapped_lun1 ... [lun1 (rw)] | | | o- mapped_lun2 ... [lun2 (rw)] | | o- luns . [3 LUNs] | | o- lun0 [iblock/nvme0n1 (/dev/nvme0n1)] | | o- lun1 [iblock/nvme1n1 (/dev/nvme1n1)] | | o- lun2 [iblock/nvme2n1 (/dev/nvme2n1)] | o- 21:00:00:24:ff:48:97:7f ... [enabled] | o- acls .. [1 ACL] | | o- 21:00:00:24:ff:48:97:7d . [3 Mapped LUNs] | | o- mapped_lun0 ... [lun0 (rw)] | |
Re: [PATCH-v4 0/5] Fix LUN_RESET active I/O + TMR handling
On Fri, 2016-02-12 at 05:30 +, Himanshu Madhani wrote: > Hi Nic, > > > > On 2/11/16, 3:47 PM, "Nicholas A. Bellinger" wrote: > > >On Wed, 2016-02-10 at 22:53 -0800, Nicholas A. Bellinger wrote: > >> On Tue, 2016-02-09 at 18:03 +, Himanshu Madhani wrote: > >> > On 2/8/16, 9:25 PM, "Nicholas A. Bellinger" > >>wrote: > >> > >On Mon, 2016-02-08 at 23:27 +, Himanshu Madhani wrote: > >> > >> > >> > >> I am testing this series with with 4.5.0-rc2+ kernel and I am > >>seeing > >> > >>issue > >> > >> where trying to trigger > >> > >> sg_reset with option of host/device/bus in loop at 120second > >>interval > >> > >> causes call stack. At this point > >> > >> removing configuration hangs indefinitely. See attached dmesg > >>output > >> > >>from > >> > >> my setup. > >> > >> > >> > > > >> > >Thanks alot for testing this. > >> > > > >> > >So It looks like we're still hitting a indefinite schedule() on > >> > >se_cmd->cmd_wait_comp once tcm_qla2xxx session disconnect/reconnect > >> > >occurs, after repeated explicit active I/O remote-port sg_resets. > >> > > > >> > >Does this trigger on the first tcm_qla2xxx session reconnect after > >> > >explicit remote-port sg_reset..? Are session reconnects actively > >>being > >> > >triggered during the test..? > >> > > > >> > >To verify the latter for iscsi-target, I've been using a small patch > >>to > >> > >trigger session reset from TMR kthread context in order to simulate > >>the > >> > >I_T disconnects. Something like that would be useful for verifying > >>with > >> > >tcm_qla2xxx too. > >> > > > >> > >That said, I'll be reproducing with tcm_qla2xxx ports this week, and > >> > >will enable various debug in a WIP branch for testing. > >> > >> Following up here.. > >> > >> So far using my test setup with ISP2532 ports in P2P + RAMDISK_MCP and > >> v4.5-rc1, repeated remote-port active I/O LUN_RESET (sg_reset -d) has > >> been functioning as expected with a blocksize_range=4k-256k + iodepth=32 > >> fio write-verify style workload. > >> > >> No ->cmd_kref -1 OOPsen or qla2xxx initiator generated ABORT_TASKs from > >> outstanding target TAS responses, nor fio write-verify failures to > >> report after 800x remote-port active I/O LUN_RESETS. > >> > >> Next step will be to verify explicit tcm_qla2xxx port + module shutdown > >> after 1K test iterations, and then IBLOCK async completions <-> NVMe > >> backends with the same case. > >> > > > >After letting this test run over-night up to 7k active I/O remote-port > >LUN_RESETs, things are still functioning as expected. > > > >Also, /etc/init.d/target stop was able to successfully shutdown all > >active sessions and unload tcm_qla2xxx after the test run. > > > >So AFAICT, the active I/O remote-port LUN_RESET changes are stable with > >tcm_qla2xxx ports, separate from concurrent session disconnect hung task > >you reported earlier. > > > >That said, I'll likely push this series as-is for -rc4, given that Dan > >has also been able to verify the non conncurrent session disconnect case > >on his setup generating constant ABORT_TASKs, and it's still surviving > >both cases for iscsi-target ports. > > > >Please give the debug patch from last night a shot, and see if we can > >determine the se_cmd states when you hit the hung task. > > I¹ll give your latest debug patch try in a little while > > From the testing that I have done, what is seen is that active IO has > already been completed and qla2xxx driver is waiting for commands to be > Completed and it¹s waiting indefinitely for cmd_wait_comp. > So it looks like there is a missing complete call from target_core. I¹ve > attached our analysis from crash debug on a live system after the issues > happens. > > Thanks for the crash dump output. So it's a t_state = TRANSPORT_WRITE_PENDING descriptor with SAM_STAT_CHECK_CONDITION + cmd_kref.refcount = 0: struct qla_tgt_cmd { se_cmd = { scsi_status = 0x2 se_cmd_flags = 0x80090d, cmd_kref = { refcount = { counter = 0x0 } }, } The se_cmd_flags=0x80090d translation to enum se_cmd_flags_table: - SCF_TRANSPORT_TASK_SENSE - SCF_EMULATED_TASK_SENSE - SCF_SCSI_DATA_CDB - SCF_SE_LUN_CMD - SCF_SENT_CHECK_CONDITION - SCF_USE_CPUID Also, ->cmd_wait_comp has a zero rlock counter + dead magic: cmd_wait_comp = { done = 0x0, wait = { lock = { { rlock = { raw_lock = { val = { counter = 0x0 } }, magic = 0xdead4ead, owner_cpu = 0x, owner = 0x, > I can recreate this issue at will within 5 minute of triggering sg_reset > with following steps > > 1. Export 4 RAM disk LUNs on each of 2 port adapter. Initiator will see 8 > RAM disk targets > 2. Start IO with 4K block size and 8 threads with 80% write 20% read and > 100% dandom. > (I am using vdbench for generating IO. I can provide setup/config scr
Re: [PATCH-v4 0/5] Fix LUN_RESET active I/O + TMR handling
Hi Nic, On 2/11/16, 3:47 PM, "Nicholas A. Bellinger" wrote: >On Wed, 2016-02-10 at 22:53 -0800, Nicholas A. Bellinger wrote: >> On Tue, 2016-02-09 at 18:03 +, Himanshu Madhani wrote: >> > On 2/8/16, 9:25 PM, "Nicholas A. Bellinger" >>wrote: >> > >On Mon, 2016-02-08 at 23:27 +, Himanshu Madhani wrote: >> > >> >> > >> I am testing this series with with 4.5.0-rc2+ kernel and I am >>seeing >> > >>issue >> > >> where trying to trigger >> > >> sg_reset with option of host/device/bus in loop at 120second >>interval >> > >> causes call stack. At this point >> > >> removing configuration hangs indefinitely. See attached dmesg >>output >> > >>from >> > >> my setup. >> > >> >> > > >> > >Thanks alot for testing this. >> > > >> > >So It looks like we're still hitting a indefinite schedule() on >> > >se_cmd->cmd_wait_comp once tcm_qla2xxx session disconnect/reconnect >> > >occurs, after repeated explicit active I/O remote-port sg_resets. >> > > >> > >Does this trigger on the first tcm_qla2xxx session reconnect after >> > >explicit remote-port sg_reset..? Are session reconnects actively >>being >> > >triggered during the test..? >> > > >> > >To verify the latter for iscsi-target, I've been using a small patch >>to >> > >trigger session reset from TMR kthread context in order to simulate >>the >> > >I_T disconnects. Something like that would be useful for verifying >>with >> > >tcm_qla2xxx too. >> > > >> > >That said, I'll be reproducing with tcm_qla2xxx ports this week, and >> > >will enable various debug in a WIP branch for testing. >> >> Following up here.. >> >> So far using my test setup with ISP2532 ports in P2P + RAMDISK_MCP and >> v4.5-rc1, repeated remote-port active I/O LUN_RESET (sg_reset -d) has >> been functioning as expected with a blocksize_range=4k-256k + iodepth=32 >> fio write-verify style workload. >> >> No ->cmd_kref -1 OOPsen or qla2xxx initiator generated ABORT_TASKs from >> outstanding target TAS responses, nor fio write-verify failures to >> report after 800x remote-port active I/O LUN_RESETS. >> >> Next step will be to verify explicit tcm_qla2xxx port + module shutdown >> after 1K test iterations, and then IBLOCK async completions <-> NVMe >> backends with the same case. >> > >After letting this test run over-night up to 7k active I/O remote-port >LUN_RESETs, things are still functioning as expected. > >Also, /etc/init.d/target stop was able to successfully shutdown all >active sessions and unload tcm_qla2xxx after the test run. > >So AFAICT, the active I/O remote-port LUN_RESET changes are stable with >tcm_qla2xxx ports, separate from concurrent session disconnect hung task >you reported earlier. > >That said, I'll likely push this series as-is for -rc4, given that Dan >has also been able to verify the non conncurrent session disconnect case >on his setup generating constant ABORT_TASKs, and it's still surviving >both cases for iscsi-target ports. > >Please give the debug patch from last night a shot, and see if we can >determine the se_cmd states when you hit the hung task. I¹ll give your latest debug patch try in a little while >From the testing that I have done, what is seen is that active IO has already been completed and qla2xxx driver is waiting for commands to be Completed and it¹s waiting indefinitely for cmd_wait_comp. So it looks like there is a missing complete call from target_core. I¹ve attached our analysis from crash debug on a live system after the issues happens. I can recreate this issue at will within 5 minute of triggering sg_reset with following steps 1. Export 4 RAM disk LUNs on each of 2 port adapter. Initiator will see 8 RAM disk targets 2. Start IO with 4K block size and 8 threads with 80% write 20% read and 100% dandom. (I am using vdbench for generating IO. I can provide setup/config script if needed) 3. Start sg_reset for each LUNs with first device, bus and host with 120s delay. (I¹ve attached My script that I am using for triggering sg_reset) > >Thank you, > >-nab > <>
Re: [PATCH-v4 0/5] Fix LUN_RESET active I/O + TMR handling
On Wed, 2016-02-10 at 22:53 -0800, Nicholas A. Bellinger wrote: > On Tue, 2016-02-09 at 18:03 +, Himanshu Madhani wrote: > > On 2/8/16, 9:25 PM, "Nicholas A. Bellinger" wrote: > > >On Mon, 2016-02-08 at 23:27 +, Himanshu Madhani wrote: > > >> > > >> I am testing this series with with 4.5.0-rc2+ kernel and I am seeing > > >>issue > > >> where trying to trigger > > >> sg_reset with option of host/device/bus in loop at 120second interval > > >> causes call stack. At this point > > >> removing configuration hangs indefinitely. See attached dmesg output > > >>from > > >> my setup. > > >> > > > > > >Thanks alot for testing this. > > > > > >So It looks like we're still hitting a indefinite schedule() on > > >se_cmd->cmd_wait_comp once tcm_qla2xxx session disconnect/reconnect > > >occurs, after repeated explicit active I/O remote-port sg_resets. > > > > > >Does this trigger on the first tcm_qla2xxx session reconnect after > > >explicit remote-port sg_reset..? Are session reconnects actively being > > >triggered during the test..? > > > > > >To verify the latter for iscsi-target, I've been using a small patch to > > >trigger session reset from TMR kthread context in order to simulate the > > >I_T disconnects. Something like that would be useful for verifying with > > >tcm_qla2xxx too. > > > > > >That said, I'll be reproducing with tcm_qla2xxx ports this week, and > > >will enable various debug in a WIP branch for testing. > > Following up here.. > > So far using my test setup with ISP2532 ports in P2P + RAMDISK_MCP and > v4.5-rc1, repeated remote-port active I/O LUN_RESET (sg_reset -d) has > been functioning as expected with a blocksize_range=4k-256k + iodepth=32 > fio write-verify style workload. > > No ->cmd_kref -1 OOPsen or qla2xxx initiator generated ABORT_TASKs from > outstanding target TAS responses, nor fio write-verify failures to > report after 800x remote-port active I/O LUN_RESETS. > > Next step will be to verify explicit tcm_qla2xxx port + module shutdown > after 1K test iterations, and then IBLOCK async completions <-> NVMe > backends with the same case. > After letting this test run over-night up to 7k active I/O remote-port LUN_RESETs, things are still functioning as expected. Also, /etc/init.d/target stop was able to successfully shutdown all active sessions and unload tcm_qla2xxx after the test run. So AFAICT, the active I/O remote-port LUN_RESET changes are stable with tcm_qla2xxx ports, separate from concurrent session disconnect hung task you reported earlier. That said, I'll likely push this series as-is for -rc4, given that Dan has also been able to verify the non conncurrent session disconnect case on his setup generating constant ABORT_TASKs, and it's still surviving both cases for iscsi-target ports. Please give the debug patch from last night a shot, and see if we can determine the se_cmd states when you hit the hung task. Thank you, -nab -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH-v4 0/5] Fix LUN_RESET active I/O + TMR handling
Hi Himanshu & Co, On Tue, 2016-02-09 at 18:03 +, Himanshu Madhani wrote: > Hi Nic, > > > On 2/8/16, 9:25 PM, "Nicholas A. Bellinger" wrote: > > >Hi Himanshu, > > > >On Mon, 2016-02-08 at 23:27 +, Himanshu Madhani wrote: > >> > >> I am testing this series with with 4.5.0-rc2+ kernel and I am seeing > >>issue > >> where trying to trigger > >> sg_reset with option of host/device/bus in loop at 120second interval > >> causes call stack. At this point > >> removing configuration hangs indefinitely. See attached dmesg output > >>from > >> my setup. > >> > > > >Thanks alot for testing this. > > > >So It looks like we're still hitting a indefinite schedule() on > >se_cmd->cmd_wait_comp once tcm_qla2xxx session disconnect/reconnect > >occurs, after repeated explicit active I/O remote-port sg_resets. > > > >Does this trigger on the first tcm_qla2xxx session reconnect after > >explicit remote-port sg_reset..? Are session reconnects actively being > >triggered during the test..? > > > >To verify the latter for iscsi-target, I've been using a small patch to > >trigger session reset from TMR kthread context in order to simulate the > >I_T disconnects. Something like that would be useful for verifying with > >tcm_qla2xxx too. > > > >That said, I'll be reproducing with tcm_qla2xxx ports this week, and > >will enable various debug in a WIP branch for testing. Following up here.. So far using my test setup with ISP2532 ports in P2P + RAMDISK_MCP and v4.5-rc1, repeated remote-port active I/O LUN_RESET (sg_reset -d) has been functioning as expected with a blocksize_range=4k-256k + iodepth=32 fio write-verify style workload. No ->cmd_kref -1 OOPsen or qla2xxx initiator generated ABORT_TASKs from outstanding target TAS responses, nor fio write-verify failures to report after 800x remote-port active I/O LUN_RESETS. Next step will be to verify explicit tcm_qla2xxx port + module shutdown after 1K test iterations, and then IBLOCK async completions <-> NVMe backends with the same case. > Let me know if I can help in any way for testing/validating this series. > Thanks. :) So based on your original log, it's still unclear clear if the session reset resulting in se_cmd->cmd_wait_comp indefinite sleep + ->cmd_kref leak is happen concurrently with repeated remote port LUN_RESET, or the session reset -> target_wait_for_sess_cmds() occurs after active I/O has already completed..? Please confirm. To that end, target-pending/debug-for-himanshu has been pushed to enable extra debug for test, please update. Also, you'll want to enable microsecond ring buffer timestamps in your kernel build too, as it's very useful for type this debugging. Thank you, --nab -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH-v4 0/5] Fix LUN_RESET active I/O + TMR handling
Hi Nic, On 2/8/16, 9:25 PM, "Nicholas A. Bellinger" wrote: >Hi Himanshu, > >On Mon, 2016-02-08 at 23:27 +, Himanshu Madhani wrote: >> >> I am testing this series with with 4.5.0-rc2+ kernel and I am seeing >>issue >> where trying to trigger >> sg_reset with option of host/device/bus in loop at 120second interval >> causes call stack. At this point >> removing configuration hangs indefinitely. See attached dmesg output >>from >> my setup. >> > >Thanks alot for testing this. > >So It looks like we're still hitting a indefinite schedule() on >se_cmd->cmd_wait_comp once tcm_qla2xxx session disconnect/reconnect >occurs, after repeated explicit active I/O remote-port sg_resets. > >Does this trigger on the first tcm_qla2xxx session reconnect after >explicit remote-port sg_reset..? Are session reconnects actively being >triggered during the test..? > >To verify the latter for iscsi-target, I've been using a small patch to >trigger session reset from TMR kthread context in order to simulate the >I_T disconnects. Something like that would be useful for verifying with >tcm_qla2xxx too. > >That said, I'll be reproducing with tcm_qla2xxx ports this week, and >will enable various debug in a WIP branch for testing. Let me know if I can help in any way for testing/validating this series. > >Thank you, > >--nab > <>
Re: [PATCH-v4 0/5] Fix LUN_RESET active I/O + TMR handling
Hi Himanshu, On Mon, 2016-02-08 at 23:27 +, Himanshu Madhani wrote: > > I am testing this series with with 4.5.0-rc2+ kernel and I am seeing issue > where trying to trigger > sg_reset with option of host/device/bus in loop at 120second interval > causes call stack. At this point > removing configuration hangs indefinitely. See attached dmesg output from > my setup. > Thanks alot for testing this. So It looks like we're still hitting a indefinite schedule() on se_cmd->cmd_wait_comp once tcm_qla2xxx session disconnect/reconnect occurs, after repeated explicit active I/O remote-port sg_resets. Does this trigger on the first tcm_qla2xxx session reconnect after explicit remote-port sg_reset..? Are session reconnects actively being triggered during the test..? To verify the latter for iscsi-target, I've been using a small patch to trigger session reset from TMR kthread context in order to simulate the I_T disconnects. Something like that would be useful for verifying with tcm_qla2xxx too. That said, I'll be reproducing with tcm_qla2xxx ports this week, and will enable various debug in a WIP branch for testing. Thank you, --nab -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH-v4 0/5] Fix LUN_RESET active I/O + TMR handling
On Sun, 2016-02-07 at 08:02 -0800, Bart Van Assche wrote: > On 02/06/16 21:19, Nicholas A. Bellinger wrote: > > On Sat, 2016-02-06 at 20:19 -0800, Bart Van Assche wrote: > >> On 02/06/16 19:17, Nicholas A. Bellinger wrote: > >>> Here is -v4 series to address the set of of LUN_RESET > >>> active I/O + TMR se_cmd->cmd_kref < 0 bugs as reported > >>> recently by Quinn & Co. This can occur during active > >>> I/O remote port TMR LUN_RESET with multi-port LIO > >>> configurations. > >> > >> Hi Nic, > >> > >> If I understood the purpose of this patch series correctly then this > >> patch series is a brave attempt to fix what is also fixed by my patch > >> called "target: Make ABORT and LUN RESET handling synchronous". Wouldn't > >> it be better to focus on that patch instead of trying to fix the current > >> approach in which TMR handling happens from the another context than the > >> command processing context ? > >> > > > > This statement is a gross oversimplification of the issues involved. > > > > If you'll recall, this was already highlighted in the context of your > > patch here: > > > > http://www.spinics.net/lists/target-devel/msg11057.html > > > > There are a number of comments on why the bug-fix was incorrect and > > broken, the basics of what needed to be done and in what order it should > > happen. > > > > But instead of replying to the comments, this was your response: > > > > http://www.spinics.net/lists/target-devel/msg11542.html > > > > If you are authentically interested in understanding the issues > > involved, you'll probably need to go back and comment on those topics > > individually, instead of ignoring them. > > Hi Nic, > > What you write is not correct. All your review comments that made sense > have been addressed in the latest version of my patch > (http://www.spinics.net/lists/target-devel/msg11666.html). Did you respond to the specific feedback of my email..? No. Did you include a change-log in the subsequent patch explaining the changes..? No. If you're still not willing or able to have a technical discussion on the specific issues involved or give feedback inline to the patch series itself, then you are just trying to waste everyone's time. > > Additionally, you haven't answered my question. My question was: why to > spend more energy on trying to fix the current approach if the LIO TMR > handling code can be simplified greatly by handling ABORT and LUN RESET > from the regular command execution path ? > Because your patch was incorrect and broken, and you still don't seem interested in taking the time to actually understand why that is. Listen, Bart, I'm getting tired of your inability to have a technical discussion of the issues. So that said, I'm going to put down some ground rules for our future interactions. I expect you to: - Ask questions when you're unsure of a specific piece of code, before attempting to push changes to re-write significant pieces and spend community review cycles. - Comment inline to all feedback for changes of substance. - Stop ignoring subsystem maintainer feedback. - Provide a change-log between patches for all changes of substance. If you are genuinely interested in contributing to LIO, then these should be a no-brainer. If you aren't genuinely interested in contributing to LIO, then keep doing what you're doing. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH-v4 0/5] Fix LUN_RESET active I/O + TMR handling
On 02/06/16 21:19, Nicholas A. Bellinger wrote: On Sat, 2016-02-06 at 20:19 -0800, Bart Van Assche wrote: On 02/06/16 19:17, Nicholas A. Bellinger wrote: Here is -v4 series to address the set of of LUN_RESET active I/O + TMR se_cmd->cmd_kref < 0 bugs as reported recently by Quinn & Co. This can occur during active I/O remote port TMR LUN_RESET with multi-port LIO configurations. Hi Nic, If I understood the purpose of this patch series correctly then this patch series is a brave attempt to fix what is also fixed by my patch called "target: Make ABORT and LUN RESET handling synchronous". Wouldn't it be better to focus on that patch instead of trying to fix the current approach in which TMR handling happens from the another context than the command processing context ? This statement is a gross oversimplification of the issues involved. If you'll recall, this was already highlighted in the context of your patch here: http://www.spinics.net/lists/target-devel/msg11057.html There are a number of comments on why the bug-fix was incorrect and broken, the basics of what needed to be done and in what order it should happen. But instead of replying to the comments, this was your response: http://www.spinics.net/lists/target-devel/msg11542.html If you are authentically interested in understanding the issues involved, you'll probably need to go back and comment on those topics individually, instead of ignoring them. Hi Nic, What you write is not correct. All your review comments that made sense have been addressed in the latest version of my patch (http://www.spinics.net/lists/target-devel/msg11666.html). Additionally, you haven't answered my question. My question was: why to spend more energy on trying to fix the current approach if the LIO TMR handling code can be simplified greatly by handling ABORT and LUN RESET from the regular command execution path ? Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH-v4 0/5] Fix LUN_RESET active I/O + TMR handling
On Sat, 2016-02-06 at 20:19 -0800, Bart Van Assche wrote: > On 02/06/16 19:17, Nicholas A. Bellinger wrote: > > Here is -v4 series to address the set of of LUN_RESET > > active I/O + TMR se_cmd->cmd_kref < 0 bugs as reported > > recently by Quinn & Co. This can occur during active > > I/O remote port TMR LUN_RESET with multi-port LIO > > configurations. > > Hi Nic, > > If I understood the purpose of this patch series correctly then this > patch series is a brave attempt to fix what is also fixed by my patch > called "target: Make ABORT and LUN RESET handling synchronous". Wouldn't > it be better to focus on that patch instead of trying to fix the current > approach in which TMR handling happens from the another context than the > command processing context ? > This statement is a gross oversimplification of the issues involved. If you'll recall, this was already highlighted in the context of your patch here: http://www.spinics.net/lists/target-devel/msg11057.html There are a number of comments on why the bug-fix was incorrect and broken, the basics of what needed to be done and in what order it should happen. But instead of replying to the comments, this was your response: http://www.spinics.net/lists/target-devel/msg11542.html If you are authentically interested in understanding the issues involved, you'll probably need to go back and comment on those topics individually, instead of ignoring them. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH-v4 0/5] Fix LUN_RESET active I/O + TMR handling
On 02/06/16 19:17, Nicholas A. Bellinger wrote: Here is -v4 series to address the set of of LUN_RESET active I/O + TMR se_cmd->cmd_kref < 0 bugs as reported recently by Quinn & Co. This can occur during active I/O remote port TMR LUN_RESET with multi-port LIO configurations. Hi Nic, If I understood the purpose of this patch series correctly then this patch series is a brave attempt to fix what is also fixed by my patch called "target: Make ABORT and LUN RESET handling synchronous". Wouldn't it be better to focus on that patch instead of trying to fix the current approach in which TMR handling happens from the another context than the command processing context ? Thanks, Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH-v4 0/5] Fix LUN_RESET active I/O + TMR handling
From: Nicholas Bellinger Hi folks, Here is -v4 series to address the set of of LUN_RESET active I/O + TMR se_cmd->cmd_kref < 0 bugs as reported recently by Quinn & Co. This can occur during active I/O remote port TMR LUN_RESET with multi-port LIO configurations. To address this bug, we add a __target_check_io_state() common handler for ABORT_TASK + LUN_RESET I/O abort cases, and move the remaining se_cmd SGL page + release into target_free_cmd_mem() to now be called directly from final target_release_cmd_kref() callback. It also adds a target_wait_free_cmd() helper and makes transport_generic_free_cmd() aware of CMD_T_ABORTED status during concurrent session disconnects, and introduces CMD_T_FABRIC_STOP bit to signal this special case. Currently this series is running atop v4.5-rc1 + v3.14.y, and with iscsi-target ports is able to survive active I/O remote-port LUN resets, plus remote-port LUN_RESET with concurrent simulated session disconnects. At this point the changes are stable with iscsi-target ports, and as Himanshu + Co can verify with tcm_qla2xxx should be considered ready to merge for -rc4. Please review + test. --nab v4 changes: - Add explicit CMD_T_FABRIC_STOP check and drop cmd_wait_set bit set usage in __target_check_io_state(). - Set early CMD_T_TAS in __target_check_io_state to avoid potential race in transport_send_task_abort() with shutdown. - Add fabric_stop + aborted checks in __transport_wait_for_tasks() in order to let TMR CMD_T_ABORTED se_cmd shutdown complete during concurrent session disconnect. - Fix race with driver SCF_SEND_DELAYED_TAS handling when __transport_check_aborted_status() could happen before transport_send_task_abort() in TMR kthread context. Nicholas Bellinger (5): target: Fix LUN_RESET active I/O handling for ACK_KREF target: Fix LUN_RESET active TMR descriptor handling target: Fix TAS handling for multi-session se_node_acls target: Fix remote-port TMR ABORT + se_cmd fabric stop target: Fix race with SCF_SEND_DELAYED_TAS handling drivers/target/target_core_tmr.c | 139 - drivers/target/target_core_transport.c | 278 +++-- include/target/target_core_base.h | 3 + 3 files changed, 301 insertions(+), 119 deletions(-) -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html