Re: [PATCH-v4 0/5] Fix LUN_RESET active I/O + TMR handling

2016-02-26 Thread Dan Lane
On Fri, Feb 12, 2016 at 12:30 AM, Himanshu Madhani
 wrote:
> Hi Nic,
>
>
>
> On 2/11/16, 3:47 PM, "Nicholas A. Bellinger"  wrote:
>
>>On Wed, 2016-02-10 at 22:53 -0800, Nicholas A. Bellinger wrote:
>>> On Tue, 2016-02-09 at 18:03 +, Himanshu Madhani wrote:
>>> > On 2/8/16, 9:25 PM, "Nicholas A. Bellinger" 
>>>wrote:
>>> > >On Mon, 2016-02-08 at 23:27 +, Himanshu Madhani wrote:
>>> > >>
>>> > >> I am testing this series with with 4.5.0-rc2+ kernel and I am
>>>seeing
>>> > >>issue
>>> > >> where trying to trigger
>>> > >> sg_reset with option of host/device/bus in loop at 120second
>>>interval
>>> > >> causes call stack. At this point
>>> > >> removing configuration hangs indefinitely. See attached dmesg
>>>output
>>> > >>from
>>> > >> my setup.
>>> > >>
>>> > >
>>> > >Thanks alot for testing this.
>>> > >
>>> > >So It looks like we're still hitting a indefinite schedule() on
>>> > >se_cmd->cmd_wait_comp once tcm_qla2xxx session disconnect/reconnect
>>> > >occurs, after repeated explicit active I/O remote-port sg_resets.
>>> > >
>>> > >Does this trigger on the first tcm_qla2xxx session reconnect after
>>> > >explicit remote-port sg_reset..?  Are session reconnects actively
>>>being
>>> > >triggered during the test..?
>>> > >
>>> > >To verify the latter for iscsi-target, I've been using a small patch
>>>to
>>> > >trigger session reset from TMR kthread context in order to simulate
>>>the
>>> > >I_T disconnects.  Something like that would be useful for verifying
>>>with
>>> > >tcm_qla2xxx too.
>>> > >
>>> > >That said, I'll be reproducing with tcm_qla2xxx ports this week, and
>>> > >will enable various debug in a WIP branch for testing.
>>>
>>> Following up here..
>>>
>>> So far using my test setup with ISP2532 ports in P2P + RAMDISK_MCP and
>>> v4.5-rc1, repeated remote-port active I/O LUN_RESET (sg_reset -d) has
>>> been functioning as expected with a blocksize_range=4k-256k + iodepth=32
>>> fio write-verify style workload.
>>>
>>> No ->cmd_kref -1 OOPsen or qla2xxx initiator generated ABORT_TASKs from
>>> outstanding target TAS responses, nor fio write-verify failures to
>>> report after 800x remote-port active I/O LUN_RESETS.
>>>
>>> Next step will be to verify explicit tcm_qla2xxx port + module shutdown
>>> after 1K test iterations, and then IBLOCK async completions <-> NVMe
>>> backends with the same case.
>>>
>>
>>After letting this test run over-night up to 7k active I/O remote-port
>>LUN_RESETs, things are still functioning as expected.
>>
>>Also, /etc/init.d/target stop was able to successfully shutdown all
>>active sessions and unload tcm_qla2xxx after the test run.
>>
>>So AFAICT, the active I/O remote-port LUN_RESET changes are stable with
>>tcm_qla2xxx ports, separate from concurrent session disconnect hung task
>>you reported earlier.
>>
>>That said, I'll likely push this series as-is for -rc4, given that Dan
>>has also been able to verify the non conncurrent session disconnect case
>>on his setup generating constant ABORT_TASKs, and it's still surviving
>>both cases for iscsi-target ports.
>>
>>Please give the debug patch from last night a shot, and see if we can
>>determine the se_cmd states when you hit the hung task.
>
> I¹ll give your latest debug patch try in a little while
>
> From the testing that I have done, what is seen is that active IO has
> already been completed and qla2xxx driver is waiting for commands to be
> Completed and it¹s waiting indefinitely for cmd_wait_comp.
> So it looks like there is a missing complete call from target_core. I¹ve
> attached our analysis from crash debug on a live system after the issues
> happens.
>
>
> I can recreate this issue at will within 5 minute of triggering sg_reset
> with following steps
>
> 1. Export 4 RAM disk LUNs on each of 2 port adapter. Initiator will see 8
> RAM disk targets
> 2. Start IO with 4K block size and 8 threads with 80% write 20% read and
> 100% dandom.
> (I am using vdbench for generating IO. I can provide setup/config script
> if needed)
> 3. Start sg_reset for each LUNs with first device, bus and host with 120s
> delay. (I¹ve attached
> My script that I am using for triggering sg_reset)
>
>>
>>Thank you,
>>
>>-nab
>>
>

Has there been any update to this?

Thanks
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-v4 0/5] Fix LUN_RESET active I/O + TMR handling

2016-02-15 Thread Himanshu Madhani
Hi Nic,



On 2/12/16, 11:03 PM, "Nicholas A. Bellinger"  wrote:

>Hi Himanshu & Co,
>
>On Fri, 2016-02-12 at 00:48 -0800, Nicholas A. Bellinger wrote:
>> On Fri, 2016-02-12 at 05:30 +, Himanshu Madhani wrote:
>
>
>
>> Thanks for the crash dump output.
>> 
>> So it's a t_state = TRANSPORT_WRITE_PENDING descriptor with
>> SAM_STAT_CHECK_CONDITION + cmd_kref.refcount = 0:
>> 
>> struct qla_tgt_cmd {
>>   se_cmd = {
>> scsi_status = 0x2
>> se_cmd_flags = 0x80090d,
>> 
>> 
>> 
>> cmd_kref = {
>>   refcount = {
>> counter = 0x0
>>   }
>> }, 
>> }
>> 
>> The se_cmd_flags=0x80090d translation to enum se_cmd_flags_table:
>> 
>> - SCF_TRANSPORT_TASK_SENSE
>> - SCF_EMULATED_TASK_SENSE
>> - SCF_SCSI_DATA_CDB
>> - SCF_SE_LUN_CMD
>> - SCF_SENT_CHECK_CONDITION
>> - SCF_USE_CPUID
>> 
>
>After groking your dump some more:
>
>For SAM_STAT_CHECK_CONDITION with t_state = TRANSPORT_WRITE_PENDING plus
>se_cmd->transport_state = 0x880 bits set, is:
>
>- CMD_T_DEV_ACTIVE
>- CMD_T_FABRIC_STOP
>
>and sense buffer = 0x70 00 0b 00 00 00 00 0a 00 00 00 00 29 03 00,
>which is the following from sense_info_table[]:
>
>  [TCM_CHECK_CONDITION_ABORT_CMD] = {
>.key = ABORTED_COMMAND,
>.asc = 0x29, /* BUS DEVICE RESET FUNCTION OCCURRED */
>.ascq = 0x03,
>},
>
>The descriptor looks like it did make it to tcm_qla2xxx_complete_free()
>-> transport_generic_free_cmd() with both qla_tgt_cmd->cmd_sent_to_fw=0,
>and qla_tgt_cmd->write_data_transferred=0 set.
>
>The best I can tell, it looks like tcm_qla2xxx_handle_data_work() ->
>transport_generic_request_failure() w/ TCM_CHECK_CONDITION_ABORT_CMD is
>occurring..
>
>So to confirm, this specific bug was not a result of active I/O
>LUN_RESET w/ CMD_T_ABORTED during session disconnect, or otherwise.
>
>> 
>> > I can recreate this issue at will within 5 minute of triggering
>>sg_reset
>> > with following steps
>> > 
>> > 1. Export 4 RAM disk LUNs on each of 2 port adapter. Initiator will
>>see 8
>> > RAM disk targets
>> > 2. Start IO with 4K block size and 8 threads with 80% write 20% read
>>and
>> > 100% dandom. 
>> > (I am using vdbench for generating IO. I can provide setup/config
>>script
>> > if needed)
>> > 3. Start sg_reset for each LUNs with first device, bus and host with
>>120s
>> > delay. (I¹ve attached
>> > My script that I am using for triggering sg_reset)
>> > 
>> 
>> Thanks, will keep looking and try to reproduce with your script.
>
>So here's my test setup with 3x Intel P3600 NVMe/IBLOCK backends, across
>dual ISP2532 ports:
>
>o- / 
>..
>... [...]
>  o- backstores 
>..
> [...]
>  | o- fileio 
>... [0
>Storage Object]
>  | o- iblock 
>.. [3
>Storage Objects]
>  | | o- nvme0n1 
>
>[/dev/nvme0n1, in use]
>  | | o- nvme1n1 
>
>[/dev/nvme1n1, in use]
>  | | o- nvme2n1 
>
>[/dev/nvme2n1, in use]
>  | o- pscsi 
> [0
>Storage Object]
>  | o- rd_mcp 
>... [1
>Storage Object]
>  |   o- ramdisk ..
>[16.0G, ramdisk, not in use]
>  o- qla2xxx 
>..
>. [2 Targets]
>  | o- 21:00:00:24:ff:48:97:7e
>... [enabled]
>  | | o- acls 
>..
> [1 ACL]
>  | | | o- 21:00:00:24:ff:48:97:7c
>. [3 Mapped LUNs]
>  | | |   o- mapped_lun0
>... [lun0
>(rw)]
>  | | |   o- mapped_lun1
>... [lun1
>(rw)]
>  | | |   o- mapped_lun2
>... [lun2
>(rw)]
>  | | o- luns 
>..
>... [3 LUNs]
>  | |   o- lun0 
>[iblock/nvme0n1 (/dev/nvme0n1)]
>  | |   o- lun1 
>[iblock/nvme1n1 (/dev/nvme1n1)]
>  | |   o- lun2 
>[iblock/nvme2n1 (/dev/nvme2n1)]
>  | o- 21:00:00:24:ff:48:97:7f
>... [enabled]
>  |   o- acls 

Re: [PATCH-v4 0/5] Fix LUN_RESET active I/O + TMR handling

2016-02-12 Thread Nicholas A. Bellinger
On Fri, 2016-02-12 at 05:30 +, Himanshu Madhani wrote:
> Hi Nic, 
> 
> 
> 
> On 2/11/16, 3:47 PM, "Nicholas A. Bellinger"  wrote:
> 
> >On Wed, 2016-02-10 at 22:53 -0800, Nicholas A. Bellinger wrote:
> >> On Tue, 2016-02-09 at 18:03 +, Himanshu Madhani wrote:
> >> > On 2/8/16, 9:25 PM, "Nicholas A. Bellinger" 
> >>wrote:
> >> > >On Mon, 2016-02-08 at 23:27 +, Himanshu Madhani wrote:
> >> > >> 
> >> > >> I am testing this series with with 4.5.0-rc2+ kernel and I am
> >>seeing
> >> > >>issue
> >> > >> where trying to trigger
> >> > >> sg_reset with option of host/device/bus in loop at 120second
> >>interval
> >> > >> causes call stack. At this point
> >> > >> removing configuration hangs indefinitely. See attached dmesg
> >>output
> >> > >>from
> >> > >> my setup. 
> >> > >> 
> >> > >
> >> > >Thanks alot for testing this.
> >> > >
> >> > >So It looks like we're still hitting a indefinite schedule() on
> >> > >se_cmd->cmd_wait_comp once tcm_qla2xxx session disconnect/reconnect
> >> > >occurs, after repeated explicit active I/O remote-port sg_resets.
> >> > >
> >> > >Does this trigger on the first tcm_qla2xxx session reconnect after
> >> > >explicit remote-port sg_reset..?  Are session reconnects actively
> >>being
> >> > >triggered during the test..?
> >> > >
> >> > >To verify the latter for iscsi-target, I've been using a small patch
> >>to
> >> > >trigger session reset from TMR kthread context in order to simulate
> >>the
> >> > >I_T disconnects.  Something like that would be useful for verifying
> >>with
> >> > >tcm_qla2xxx too.
> >> > >
> >> > >That said, I'll be reproducing with tcm_qla2xxx ports this week, and
> >> > >will enable various debug in a WIP branch for testing.
> >> 
> >> Following up here..
> >> 
> >> So far using my test setup with ISP2532 ports in P2P + RAMDISK_MCP and
> >> v4.5-rc1, repeated remote-port active I/O LUN_RESET (sg_reset -d) has
> >> been functioning as expected with a blocksize_range=4k-256k + iodepth=32
> >> fio write-verify style workload.
> >> 
> >> No ->cmd_kref -1 OOPsen or qla2xxx initiator generated ABORT_TASKs from
> >> outstanding target TAS responses, nor fio write-verify failures to
> >> report after 800x remote-port active I/O LUN_RESETS.
> >> 
> >> Next step will be to verify explicit tcm_qla2xxx port + module shutdown
> >> after 1K test iterations, and then IBLOCK async completions <-> NVMe
> >> backends with the same case.
> >> 
> >
> >After letting this test run over-night up to 7k active I/O remote-port
> >LUN_RESETs, things are still functioning as expected.
> >
> >Also, /etc/init.d/target stop was able to successfully shutdown all
> >active sessions and unload tcm_qla2xxx after the test run.
> >
> >So AFAICT, the active I/O remote-port LUN_RESET changes are stable with
> >tcm_qla2xxx ports, separate from concurrent session disconnect hung task
> >you reported earlier.
> >
> >That said, I'll likely push this series as-is for -rc4, given that Dan
> >has also been able to verify the non conncurrent session disconnect case
> >on his setup generating constant ABORT_TASKs, and it's still surviving
> >both cases for iscsi-target ports.
> >
> >Please give the debug patch from last night a shot, and see if we can
> >determine the se_cmd states when you hit the hung task.
> 
> I¹ll give your latest debug patch try in a little while
> 
> From the testing that I have done, what is seen is that active IO has
> already been completed and qla2xxx driver is waiting for commands to be
> Completed and it¹s waiting indefinitely for cmd_wait_comp.
> So it looks like there is a missing complete call from target_core. I¹ve
> attached our analysis from crash debug on a live system after the issues
> happens.
> 
> 

Thanks for the crash dump output.

So it's a t_state = TRANSPORT_WRITE_PENDING descriptor with
SAM_STAT_CHECK_CONDITION + cmd_kref.refcount = 0:

struct qla_tgt_cmd {
  se_cmd = {
scsi_status = 0x2
se_cmd_flags = 0x80090d,



cmd_kref = {
  refcount = {
counter = 0x0
  }
}, 
}

The se_cmd_flags=0x80090d translation to enum se_cmd_flags_table:

- SCF_TRANSPORT_TASK_SENSE
- SCF_EMULATED_TASK_SENSE
- SCF_SCSI_DATA_CDB
- SCF_SE_LUN_CMD
- SCF_SENT_CHECK_CONDITION
- SCF_USE_CPUID

Also, ->cmd_wait_comp has a zero rlock counter + dead magic:

cmd_wait_comp = {
  done = 0x0, 
  wait = {
lock = {
  {
rlock = {
  raw_lock = {
val = {
  counter = 0x0
}
  }, 
  magic = 0xdead4ead, 
  owner_cpu = 0x, 
  owner = 0x, 

> I can recreate this issue at will within 5 minute of triggering sg_reset
> with following steps
> 
> 1. Export 4 RAM disk LUNs on each of 2 port adapter. Initiator will see 8
> RAM disk targets
> 2. Start IO with 4K block size and 8 threads with 80% write 20% read and
> 100% dandom. 
> (I am using vdbench for 

Re: [PATCH-v4 0/5] Fix LUN_RESET active I/O + TMR handling

2016-02-12 Thread Nicholas A. Bellinger
Hi Himanshu & Co,

On Fri, 2016-02-12 at 00:48 -0800, Nicholas A. Bellinger wrote:
> On Fri, 2016-02-12 at 05:30 +, Himanshu Madhani wrote:



> Thanks for the crash dump output.
> 
> So it's a t_state = TRANSPORT_WRITE_PENDING descriptor with
> SAM_STAT_CHECK_CONDITION + cmd_kref.refcount = 0:
> 
> struct qla_tgt_cmd {
>   se_cmd = {
> scsi_status = 0x2
> se_cmd_flags = 0x80090d,
> 
> 
> 
> cmd_kref = {
>   refcount = {
> counter = 0x0
>   }
> }, 
> }
> 
> The se_cmd_flags=0x80090d translation to enum se_cmd_flags_table:
> 
> - SCF_TRANSPORT_TASK_SENSE
> - SCF_EMULATED_TASK_SENSE
> - SCF_SCSI_DATA_CDB
> - SCF_SE_LUN_CMD
> - SCF_SENT_CHECK_CONDITION
> - SCF_USE_CPUID
> 

After groking your dump some more:

For SAM_STAT_CHECK_CONDITION with t_state = TRANSPORT_WRITE_PENDING plus
se_cmd->transport_state = 0x880 bits set, is:

- CMD_T_DEV_ACTIVE
- CMD_T_FABRIC_STOP

and sense buffer = 0x70 00 0b 00 00 00 00 0a 00 00 00 00 29 03 00,
which is the following from sense_info_table[]:

  [TCM_CHECK_CONDITION_ABORT_CMD] = {
.key = ABORTED_COMMAND,
.asc = 0x29, /* BUS DEVICE RESET FUNCTION OCCURRED */
.ascq = 0x03,
},

The descriptor looks like it did make it to tcm_qla2xxx_complete_free()
-> transport_generic_free_cmd() with both qla_tgt_cmd->cmd_sent_to_fw=0,
and qla_tgt_cmd->write_data_transferred=0 set.

The best I can tell, it looks like tcm_qla2xxx_handle_data_work() ->
transport_generic_request_failure() w/ TCM_CHECK_CONDITION_ABORT_CMD is
occurring..

So to confirm, this specific bug was not a result of active I/O
LUN_RESET w/ CMD_T_ABORTED during session disconnect, or otherwise.

> 
> > I can recreate this issue at will within 5 minute of triggering sg_reset
> > with following steps
> > 
> > 1. Export 4 RAM disk LUNs on each of 2 port adapter. Initiator will see 8
> > RAM disk targets
> > 2. Start IO with 4K block size and 8 threads with 80% write 20% read and
> > 100% dandom. 
> > (I am using vdbench for generating IO. I can provide setup/config script
> > if needed)
> > 3. Start sg_reset for each LUNs with first device, bus and host with 120s
> > delay. (I¹ve attached
> > My script that I am using for triggering sg_reset)
> > 
> 
> Thanks, will keep looking and try to reproduce with your script.

So here's my test setup with 3x Intel P3600 NVMe/IBLOCK backends, across
dual ISP2532 ports:

o- / 
.
 [...]
  o- backstores 
.. 
[...]
  | o- fileio 
... [0 Storage 
Object]
  | o- iblock 
.. [3 Storage 
Objects]
  | | o- nvme0n1  
[/dev/nvme0n1, in use]
  | | o- nvme1n1  
[/dev/nvme1n1, in use]
  | | o- nvme2n1  
[/dev/nvme2n1, in use]
  | o- pscsi 
 [0 Storage 
Object]
  | o- rd_mcp 
... [1 Storage 
Object]
  |   o- ramdisk .. [16.0G, 
ramdisk, not in use]
  o- qla2xxx 
... [2 
Targets]
  | o- 21:00:00:24:ff:48:97:7e 
... [enabled]
  | | o- acls 
.. 
[1 ACL]
  | | | o- 21:00:00:24:ff:48:97:7c 
. [3 Mapped LUNs]
  | | |   o- mapped_lun0 
... [lun0 (rw)]
  | | |   o- mapped_lun1 
... [lun1 (rw)]
  | | |   o- mapped_lun2 
... [lun2 (rw)]
  | | o- luns 
. 
[3 LUNs]
  | |   o- lun0  
[iblock/nvme0n1 (/dev/nvme0n1)]
  | |   o- lun1  
[iblock/nvme1n1 (/dev/nvme1n1)]
  | |   o- lun2  
[iblock/nvme2n1 (/dev/nvme2n1)]
  | o- 21:00:00:24:ff:48:97:7f 
... [enabled]
  |   o- acls 
.. 
[1 ACL]
  |   | o- 21:00:00:24:ff:48:97:7d 
. [3 Mapped LUNs]
  |   |   o- mapped_lun0 
... [lun0 (rw)]
  |   | 

Re: [PATCH-v4 0/5] Fix LUN_RESET active I/O + TMR handling

2016-02-11 Thread Himanshu Madhani
Hi Nic, 



On 2/11/16, 3:47 PM, "Nicholas A. Bellinger"  wrote:

>On Wed, 2016-02-10 at 22:53 -0800, Nicholas A. Bellinger wrote:
>> On Tue, 2016-02-09 at 18:03 +, Himanshu Madhani wrote:
>> > On 2/8/16, 9:25 PM, "Nicholas A. Bellinger" 
>>wrote:
>> > >On Mon, 2016-02-08 at 23:27 +, Himanshu Madhani wrote:
>> > >> 
>> > >> I am testing this series with with 4.5.0-rc2+ kernel and I am
>>seeing
>> > >>issue
>> > >> where trying to trigger
>> > >> sg_reset with option of host/device/bus in loop at 120second
>>interval
>> > >> causes call stack. At this point
>> > >> removing configuration hangs indefinitely. See attached dmesg
>>output
>> > >>from
>> > >> my setup. 
>> > >> 
>> > >
>> > >Thanks alot for testing this.
>> > >
>> > >So It looks like we're still hitting a indefinite schedule() on
>> > >se_cmd->cmd_wait_comp once tcm_qla2xxx session disconnect/reconnect
>> > >occurs, after repeated explicit active I/O remote-port sg_resets.
>> > >
>> > >Does this trigger on the first tcm_qla2xxx session reconnect after
>> > >explicit remote-port sg_reset..?  Are session reconnects actively
>>being
>> > >triggered during the test..?
>> > >
>> > >To verify the latter for iscsi-target, I've been using a small patch
>>to
>> > >trigger session reset from TMR kthread context in order to simulate
>>the
>> > >I_T disconnects.  Something like that would be useful for verifying
>>with
>> > >tcm_qla2xxx too.
>> > >
>> > >That said, I'll be reproducing with tcm_qla2xxx ports this week, and
>> > >will enable various debug in a WIP branch for testing.
>> 
>> Following up here..
>> 
>> So far using my test setup with ISP2532 ports in P2P + RAMDISK_MCP and
>> v4.5-rc1, repeated remote-port active I/O LUN_RESET (sg_reset -d) has
>> been functioning as expected with a blocksize_range=4k-256k + iodepth=32
>> fio write-verify style workload.
>> 
>> No ->cmd_kref -1 OOPsen or qla2xxx initiator generated ABORT_TASKs from
>> outstanding target TAS responses, nor fio write-verify failures to
>> report after 800x remote-port active I/O LUN_RESETS.
>> 
>> Next step will be to verify explicit tcm_qla2xxx port + module shutdown
>> after 1K test iterations, and then IBLOCK async completions <-> NVMe
>> backends with the same case.
>> 
>
>After letting this test run over-night up to 7k active I/O remote-port
>LUN_RESETs, things are still functioning as expected.
>
>Also, /etc/init.d/target stop was able to successfully shutdown all
>active sessions and unload tcm_qla2xxx after the test run.
>
>So AFAICT, the active I/O remote-port LUN_RESET changes are stable with
>tcm_qla2xxx ports, separate from concurrent session disconnect hung task
>you reported earlier.
>
>That said, I'll likely push this series as-is for -rc4, given that Dan
>has also been able to verify the non conncurrent session disconnect case
>on his setup generating constant ABORT_TASKs, and it's still surviving
>both cases for iscsi-target ports.
>
>Please give the debug patch from last night a shot, and see if we can
>determine the se_cmd states when you hit the hung task.

I¹ll give your latest debug patch try in a little while

>From the testing that I have done, what is seen is that active IO has
already been completed and qla2xxx driver is waiting for commands to be
Completed and it¹s waiting indefinitely for cmd_wait_comp.
So it looks like there is a missing complete call from target_core. I¹ve
attached our analysis from crash debug on a live system after the issues
happens.


I can recreate this issue at will within 5 minute of triggering sg_reset
with following steps

1. Export 4 RAM disk LUNs on each of 2 port adapter. Initiator will see 8
RAM disk targets
2. Start IO with 4K block size and 8 threads with 80% write 20% read and
100% dandom. 
(I am using vdbench for generating IO. I can provide setup/config script
if needed)
3. Start sg_reset for each LUNs with first device, bus and host with 120s
delay. (I¹ve attached
My script that I am using for triggering sg_reset)

>
>Thank you,
>
>-nab
>

<>

Re: [PATCH-v4 0/5] Fix LUN_RESET active I/O + TMR handling

2016-02-11 Thread Nicholas A. Bellinger
On Wed, 2016-02-10 at 22:53 -0800, Nicholas A. Bellinger wrote:
> On Tue, 2016-02-09 at 18:03 +, Himanshu Madhani wrote:
> > On 2/8/16, 9:25 PM, "Nicholas A. Bellinger"  wrote:
> > >On Mon, 2016-02-08 at 23:27 +, Himanshu Madhani wrote:
> > >> 
> > >> I am testing this series with with 4.5.0-rc2+ kernel and I am seeing
> > >>issue
> > >> where trying to trigger
> > >> sg_reset with option of host/device/bus in loop at 120second interval
> > >> causes call stack. At this point
> > >> removing configuration hangs indefinitely. See attached dmesg output
> > >>from
> > >> my setup. 
> > >> 
> > >
> > >Thanks alot for testing this.
> > >
> > >So It looks like we're still hitting a indefinite schedule() on
> > >se_cmd->cmd_wait_comp once tcm_qla2xxx session disconnect/reconnect
> > >occurs, after repeated explicit active I/O remote-port sg_resets.
> > >
> > >Does this trigger on the first tcm_qla2xxx session reconnect after
> > >explicit remote-port sg_reset..?  Are session reconnects actively being
> > >triggered during the test..?
> > >
> > >To verify the latter for iscsi-target, I've been using a small patch to
> > >trigger session reset from TMR kthread context in order to simulate the
> > >I_T disconnects.  Something like that would be useful for verifying with
> > >tcm_qla2xxx too.
> > >
> > >That said, I'll be reproducing with tcm_qla2xxx ports this week, and
> > >will enable various debug in a WIP branch for testing.
> 
> Following up here..
> 
> So far using my test setup with ISP2532 ports in P2P + RAMDISK_MCP and
> v4.5-rc1, repeated remote-port active I/O LUN_RESET (sg_reset -d) has
> been functioning as expected with a blocksize_range=4k-256k + iodepth=32
> fio write-verify style workload.
> 
> No ->cmd_kref -1 OOPsen or qla2xxx initiator generated ABORT_TASKs from
> outstanding target TAS responses, nor fio write-verify failures to
> report after 800x remote-port active I/O LUN_RESETS.
> 
> Next step will be to verify explicit tcm_qla2xxx port + module shutdown
> after 1K test iterations, and then IBLOCK async completions <-> NVMe
> backends with the same case.
> 

After letting this test run over-night up to 7k active I/O remote-port
LUN_RESETs, things are still functioning as expected.

Also, /etc/init.d/target stop was able to successfully shutdown all
active sessions and unload tcm_qla2xxx after the test run.

So AFAICT, the active I/O remote-port LUN_RESET changes are stable with
tcm_qla2xxx ports, separate from concurrent session disconnect hung task
you reported earlier.

That said, I'll likely push this series as-is for -rc4, given that Dan
has also been able to verify the non conncurrent session disconnect case
on his setup generating constant ABORT_TASKs, and it's still surviving
both cases for iscsi-target ports.

Please give the debug patch from last night a shot, and see if we can
determine the se_cmd states when you hit the hung task.

Thank you,

-nab

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-v4 0/5] Fix LUN_RESET active I/O + TMR handling

2016-02-10 Thread Nicholas A. Bellinger
Hi Himanshu & Co,

On Tue, 2016-02-09 at 18:03 +, Himanshu Madhani wrote:
> Hi Nic, 
> 
> 
> On 2/8/16, 9:25 PM, "Nicholas A. Bellinger"  wrote:
> 
> >Hi Himanshu,
> >
> >On Mon, 2016-02-08 at 23:27 +, Himanshu Madhani wrote:
> >> 
> >> I am testing this series with with 4.5.0-rc2+ kernel and I am seeing
> >>issue
> >> where trying to trigger
> >> sg_reset with option of host/device/bus in loop at 120second interval
> >> causes call stack. At this point
> >> removing configuration hangs indefinitely. See attached dmesg output
> >>from
> >> my setup. 
> >> 
> >
> >Thanks alot for testing this.
> >
> >So It looks like we're still hitting a indefinite schedule() on
> >se_cmd->cmd_wait_comp once tcm_qla2xxx session disconnect/reconnect
> >occurs, after repeated explicit active I/O remote-port sg_resets.
> >
> >Does this trigger on the first tcm_qla2xxx session reconnect after
> >explicit remote-port sg_reset..?  Are session reconnects actively being
> >triggered during the test..?
> >
> >To verify the latter for iscsi-target, I've been using a small patch to
> >trigger session reset from TMR kthread context in order to simulate the
> >I_T disconnects.  Something like that would be useful for verifying with
> >tcm_qla2xxx too.
> >
> >That said, I'll be reproducing with tcm_qla2xxx ports this week, and
> >will enable various debug in a WIP branch for testing.

Following up here..

So far using my test setup with ISP2532 ports in P2P + RAMDISK_MCP and
v4.5-rc1, repeated remote-port active I/O LUN_RESET (sg_reset -d) has
been functioning as expected with a blocksize_range=4k-256k + iodepth=32
fio write-verify style workload.

No ->cmd_kref -1 OOPsen or qla2xxx initiator generated ABORT_TASKs from
outstanding target TAS responses, nor fio write-verify failures to
report after 800x remote-port active I/O LUN_RESETS.

Next step will be to verify explicit tcm_qla2xxx port + module shutdown
after 1K test iterations, and then IBLOCK async completions <-> NVMe
backends with the same case.

> Let me know if I can help in any way for testing/validating this series.
> 

Thanks.  :)

So based on your original log, it's still unclear clear if the session
reset resulting in se_cmd->cmd_wait_comp indefinite sleep + ->cmd_kref
leak is happen concurrently with repeated remote port LUN_RESET, or the
session reset -> target_wait_for_sess_cmds() occurs after active I/O has
already completed..?  Please confirm.

To that end, target-pending/debug-for-himanshu has been pushed to enable
extra debug for test, please update.

Also, you'll want to enable microsecond ring buffer timestamps in your
kernel build too, as it's very useful for type this debugging.

Thank you,

--nab

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-v4 0/5] Fix LUN_RESET active I/O + TMR handling

2016-02-09 Thread Himanshu Madhani
Hi Nic, 


On 2/8/16, 9:25 PM, "Nicholas A. Bellinger"  wrote:

>Hi Himanshu,
>
>On Mon, 2016-02-08 at 23:27 +, Himanshu Madhani wrote:
>> 
>> I am testing this series with with 4.5.0-rc2+ kernel and I am seeing
>>issue
>> where trying to trigger
>> sg_reset with option of host/device/bus in loop at 120second interval
>> causes call stack. At this point
>> removing configuration hangs indefinitely. See attached dmesg output
>>from
>> my setup. 
>> 
>
>Thanks alot for testing this.
>
>So It looks like we're still hitting a indefinite schedule() on
>se_cmd->cmd_wait_comp once tcm_qla2xxx session disconnect/reconnect
>occurs, after repeated explicit active I/O remote-port sg_resets.
>
>Does this trigger on the first tcm_qla2xxx session reconnect after
>explicit remote-port sg_reset..?  Are session reconnects actively being
>triggered during the test..?
>
>To verify the latter for iscsi-target, I've been using a small patch to
>trigger session reset from TMR kthread context in order to simulate the
>I_T disconnects.  Something like that would be useful for verifying with
>tcm_qla2xxx too.
>
>That said, I'll be reproducing with tcm_qla2xxx ports this week, and
>will enable various debug in a WIP branch for testing.

Let me know if I can help in any way for testing/validating this series.

>
>Thank you,
>
>--nab
>

<>

Re: [PATCH-v4 0/5] Fix LUN_RESET active I/O + TMR handling

2016-02-08 Thread Nicholas A. Bellinger
Hi Himanshu,

On Mon, 2016-02-08 at 23:27 +, Himanshu Madhani wrote:
> 
> I am testing this series with with 4.5.0-rc2+ kernel and I am seeing issue
> where trying to trigger
> sg_reset with option of host/device/bus in loop at 120second interval
> causes call stack. At this point
> removing configuration hangs indefinitely. See attached dmesg output from
> my setup. 
> 

Thanks alot for testing this.

So It looks like we're still hitting a indefinite schedule() on
se_cmd->cmd_wait_comp once tcm_qla2xxx session disconnect/reconnect
occurs, after repeated explicit active I/O remote-port sg_resets.

Does this trigger on the first tcm_qla2xxx session reconnect after
explicit remote-port sg_reset..?  Are session reconnects actively being
triggered during the test..?

To verify the latter for iscsi-target, I've been using a small patch to
trigger session reset from TMR kthread context in order to simulate the
I_T disconnects.  Something like that would be useful for verifying with
tcm_qla2xxx too.

That said, I'll be reproducing with tcm_qla2xxx ports this week, and
will enable various debug in a WIP branch for testing.

Thank you,

--nab

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-v4 0/5] Fix LUN_RESET active I/O + TMR handling

2016-02-07 Thread Bart Van Assche

On 02/06/16 21:19, Nicholas A. Bellinger wrote:

On Sat, 2016-02-06 at 20:19 -0800, Bart Van Assche wrote:

On 02/06/16 19:17, Nicholas A. Bellinger wrote:

Here is -v4 series to address the set of of LUN_RESET
active I/O + TMR se_cmd->cmd_kref < 0 bugs as reported
recently by Quinn & Co.  This can occur during active
I/O remote port TMR LUN_RESET with multi-port LIO
configurations.


Hi Nic,

If I understood the purpose of this patch series correctly then this
patch series is a brave attempt to fix what is also fixed by my patch
called "target: Make ABORT and LUN RESET handling synchronous". Wouldn't
it be better to focus on that patch instead of trying to fix the current
approach in which TMR handling happens from the another context than the
command processing context ?



This statement is a gross oversimplification of the issues involved.

If you'll recall, this was already highlighted in the context of your
patch here:

http://www.spinics.net/lists/target-devel/msg11057.html

There are a number of comments on why the bug-fix was incorrect and
broken, the basics of what needed to be done and in what order it should
happen.

But instead of replying to the comments, this was your response:

http://www.spinics.net/lists/target-devel/msg11542.html

If you are authentically interested in understanding the issues
involved, you'll probably need to go back and comment on those topics
individually, instead of ignoring them.


Hi Nic,

What you write is not correct. All your review comments that made sense 
have been addressed in the latest version of my patch 
(http://www.spinics.net/lists/target-devel/msg11666.html).


Additionally, you haven't answered my question. My question was: why to 
spend more energy on trying to fix the current approach if the LIO TMR 
handling code can be simplified greatly by handling ABORT and LUN RESET 
from the regular command execution path ?


Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-v4 0/5] Fix LUN_RESET active I/O + TMR handling

2016-02-07 Thread Nicholas A. Bellinger
On Sun, 2016-02-07 at 08:02 -0800, Bart Van Assche wrote:
> On 02/06/16 21:19, Nicholas A. Bellinger wrote:
> > On Sat, 2016-02-06 at 20:19 -0800, Bart Van Assche wrote:
> >> On 02/06/16 19:17, Nicholas A. Bellinger wrote:
> >>> Here is -v4 series to address the set of of LUN_RESET
> >>> active I/O + TMR se_cmd->cmd_kref < 0 bugs as reported
> >>> recently by Quinn & Co.  This can occur during active
> >>> I/O remote port TMR LUN_RESET with multi-port LIO
> >>> configurations.
> >>
> >> Hi Nic,
> >>
> >> If I understood the purpose of this patch series correctly then this
> >> patch series is a brave attempt to fix what is also fixed by my patch
> >> called "target: Make ABORT and LUN RESET handling synchronous". Wouldn't
> >> it be better to focus on that patch instead of trying to fix the current
> >> approach in which TMR handling happens from the another context than the
> >> command processing context ?
> >>
> >
> > This statement is a gross oversimplification of the issues involved.
> >
> > If you'll recall, this was already highlighted in the context of your
> > patch here:
> >
> > http://www.spinics.net/lists/target-devel/msg11057.html
> >
> > There are a number of comments on why the bug-fix was incorrect and
> > broken, the basics of what needed to be done and in what order it should
> > happen.
> >
> > But instead of replying to the comments, this was your response:
> >
> > http://www.spinics.net/lists/target-devel/msg11542.html
> >
> > If you are authentically interested in understanding the issues
> > involved, you'll probably need to go back and comment on those topics
> > individually, instead of ignoring them.
> 
> Hi Nic,
> 
> What you write is not correct. All your review comments that made sense 
> have been addressed in the latest version of my patch 
> (http://www.spinics.net/lists/target-devel/msg11666.html).

Did you respond to the specific feedback of my email..?  No.

Did you include a change-log in the subsequent patch explaining the
changes..?  No.

If you're still not willing or able to have a technical discussion on
the specific issues involved or give feedback inline to the patch series
itself, then you are just trying to waste everyone's time.

> 
> Additionally, you haven't answered my question. My question was: why to 
> spend more energy on trying to fix the current approach if the LIO TMR 
> handling code can be simplified greatly by handling ABORT and LUN RESET 
> from the regular command execution path ?
> 

Because your patch was incorrect and broken, and you still don't seem
interested in taking the time to actually understand why that is.

Listen, Bart, I'm getting tired of your inability to have a technical
discussion of the issues.

So that said, I'm going to put down some ground rules for our future
interactions.  I expect you to:

   - Ask questions when you're unsure of a specific piece of code, 
 before attempting to push changes to re-write significant pieces 
 and spend community review cycles. 
   - Comment inline to all feedback for changes of substance.
   - Stop ignoring subsystem maintainer feedback.
   - Provide a change-log between patches for all changes of substance.

If you are genuinely interested in contributing to LIO, then these
should be a no-brainer. 

If you aren't genuinely interested in contributing to LIO, then keep
doing what you're doing.

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH-v4 0/5] Fix LUN_RESET active I/O + TMR handling

2016-02-06 Thread Nicholas A. Bellinger
From: Nicholas Bellinger 

Hi folks,

Here is -v4 series to address the set of of LUN_RESET
active I/O + TMR se_cmd->cmd_kref < 0 bugs as reported
recently by Quinn & Co.  This can occur during active
I/O remote port TMR LUN_RESET with multi-port LIO
configurations.

To address this bug, we add a __target_check_io_state()
common handler for ABORT_TASK + LUN_RESET I/O abort
cases, and move the remaining se_cmd SGL page + release
into target_free_cmd_mem() to now be called directly
from final target_release_cmd_kref() callback.

It also adds a target_wait_free_cmd() helper and makes
transport_generic_free_cmd() aware of CMD_T_ABORTED
status during concurrent session disconnects, and
introduces CMD_T_FABRIC_STOP bit to signal this special
case.

Currently this series is running atop v4.5-rc1 + v3.14.y,
and with iscsi-target ports is able to survive active
I/O remote-port LUN resets, plus remote-port LUN_RESET
with concurrent simulated session disconnects.

At this point the changes are stable with iscsi-target
ports, and as Himanshu + Co can verify with tcm_qla2xxx
should be considered ready to merge for -rc4.

Please review + test.

--nab

v4 changes:

- Add explicit CMD_T_FABRIC_STOP check and drop cmd_wait_set
  bit set usage in __target_check_io_state().
- Set early CMD_T_TAS in __target_check_io_state to avoid
  potential race in transport_send_task_abort() with shutdown.
- Add fabric_stop + aborted checks in __transport_wait_for_tasks()
  in order to let TMR CMD_T_ABORTED se_cmd shutdown complete
  during concurrent session disconnect.
- Fix race with driver SCF_SEND_DELAYED_TAS handling when
  __transport_check_aborted_status() could happen before
  transport_send_task_abort() in TMR kthread context.

Nicholas Bellinger (5):
  target: Fix LUN_RESET active I/O handling for ACK_KREF
  target: Fix LUN_RESET active TMR descriptor handling
  target: Fix TAS handling for multi-session se_node_acls
  target: Fix remote-port TMR ABORT + se_cmd fabric stop
  target: Fix race with SCF_SEND_DELAYED_TAS handling

 drivers/target/target_core_tmr.c   | 139 -
 drivers/target/target_core_transport.c | 278 +++--
 include/target/target_core_base.h  |   3 +
 3 files changed, 301 insertions(+), 119 deletions(-)

-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-v4 0/5] Fix LUN_RESET active I/O + TMR handling

2016-02-06 Thread Bart Van Assche

On 02/06/16 19:17, Nicholas A. Bellinger wrote:

Here is -v4 series to address the set of of LUN_RESET
active I/O + TMR se_cmd->cmd_kref < 0 bugs as reported
recently by Quinn & Co.  This can occur during active
I/O remote port TMR LUN_RESET with multi-port LIO
configurations.


Hi Nic,

If I understood the purpose of this patch series correctly then this 
patch series is a brave attempt to fix what is also fixed by my patch 
called "target: Make ABORT and LUN RESET handling synchronous". Wouldn't 
it be better to focus on that patch instead of trying to fix the current 
approach in which TMR handling happens from the another context than the 
command processing context ?


Thanks,

Bart.

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html