Re: No abort is sent for a WRITE command that takes too long
On Mon, May 18, 2009 at 6:01 PM, Mike Christie wrote: > > Erez Zilber wrote: >> On Mon, May 18, 2009 at 4:36 PM, Mike Christie wrote: >>> Erez Zilber wrote: I enabled open-iscsi logging + added some printk calls when the abort handler returns. Here's the log. I see that iscsi_eh_cmd_timed_out gets called, but there's no abort. May 17 11:00:06 kpc36 kernel: session1: iscsi_eh_cmd_timed_out scsi cmd 8101e30efe40 timedout May 17 11:00:06 kpc36 kernel: session1: iscsi_eh_cmd_timed_out return timer reset >>> As you can see in iscsi_eh_cmd_timed_out, if the sesison is down then >>> there is no point in letting the scsi eh run since we have to relogin >>> and restart commands so we would return reset timer which prevents the >>> scsi eh from running. >> >> Makes sense. There's only one thing that I don't understand - when >> does scsi-ml call iscsi_eh_cmd_timed_out? I thought that if a cmd >> times out, scsi-ml sends an abort. > > > scsi-ml calls scsi_times_out (which calls iscsi_eh_cmd_timed_out) when > the scsi command timer expires (timeout is value in > /sys/block/sdX/device/timeout). > > If the eh_timed_out callout returns BLK_EH_NOT_HANDLED then the scsi eh > will run and call the abort function. > > The transportt->eh_timed_out allows the class/lld to override the scsi > eh if for example it is a transport problem. In that case an abort or > lun reset would not help, because you cannot send the TMF since there is > no access to the target/disk. > I understand. Thanks! Erez --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "open-iscsi" group. To post to this group, send email to open-iscsi@googlegroups.com To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/open-iscsi -~--~~~~--~~--~--~---
Re: No abort is sent for a WRITE command that takes too long
Erez Zilber wrote: > On Mon, May 18, 2009 at 4:36 PM, Mike Christie wrote: >> Erez Zilber wrote: >>> I enabled open-iscsi logging + added some printk calls when the abort >>> handler returns. >>> Here's the log. I see that iscsi_eh_cmd_timed_out gets called, but >>> there's no abort. >>> May 17 11:00:06 kpc36 kernel: session1: iscsi_eh_cmd_timed_out scsi >>> cmd 8101e30efe40 timedout >>> May 17 11:00:06 kpc36 kernel: session1: iscsi_eh_cmd_timed_out return >>> timer reset >> As you can see in iscsi_eh_cmd_timed_out, if the sesison is down then >> there is no point in letting the scsi eh run since we have to relogin >> and restart commands so we would return reset timer which prevents the >> scsi eh from running. > > Makes sense. There's only one thing that I don't understand - when > does scsi-ml call iscsi_eh_cmd_timed_out? I thought that if a cmd > times out, scsi-ml sends an abort. scsi-ml calls scsi_times_out (which calls iscsi_eh_cmd_timed_out) when the scsi command timer expires (timeout is value in /sys/block/sdX/device/timeout). If the eh_timed_out callout returns BLK_EH_NOT_HANDLED then the scsi eh will run and call the abort function. The transportt->eh_timed_out allows the class/lld to override the scsi eh if for example it is a transport problem. In that case an abort or lun reset would not help, because you cannot send the TMF since there is no access to the target/disk. --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "open-iscsi" group. To post to this group, send email to open-iscsi@googlegroups.com To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/open-iscsi -~--~~~~--~~--~--~---
Re: No abort is sent for a WRITE command that takes too long
On Mon, May 18, 2009 at 4:36 PM, Mike Christie wrote: > > Erez Zilber wrote: >> >> I enabled open-iscsi logging + added some printk calls when the abort >> handler returns. >> Here's the log. I see that iscsi_eh_cmd_timed_out gets called, but >> there's no abort. > >> May 17 11:00:06 kpc36 kernel: session1: iscsi_eh_cmd_timed_out scsi >> cmd 8101e30efe40 timedout >> May 17 11:00:06 kpc36 kernel: session1: iscsi_eh_cmd_timed_out return >> timer reset > > As you can see in iscsi_eh_cmd_timed_out, if the sesison is down then > there is no point in letting the scsi eh run since we have to relogin > and restart commands so we would return reset timer which prevents the > scsi eh from running. Makes sense. There's only one thing that I don't understand - when does scsi-ml call iscsi_eh_cmd_timed_out? I thought that if a cmd times out, scsi-ml sends an abort. > > And then there is code in there to check if we are in the middle of > checking the connection. If we are then we ask for some more time with > the command, and that will prevent the scsi eh from running. This looks > like it can be problem because we would get a response to our nop which > would update the last_recv field. If there was no progress being made > for the scsi command we would still ask to reset the timer and we could > end up in that loop forever since the scsi layer does not cap the number > of times you can reset the time. I will send a patch to fix that. > > > However, that probably will not fix your problem. > > > For your specific setup, it looks like we hit the > iscsi_eh_cmd_timed_out, reset the scsi command timer becuase we are in > the middle of checking the the connection with the nop/ping, but then > the nop/ping does not return in time and so we drop the session: > > connection1:0: ping timeout of 5 secs > expired, recv timeout 5, last rx 4526718494, last ping 4526723494, now > 4526728494 > > That is why on the target you see it cleanup up commands. On the > initiator you can see us cleaning up: > > May 17 11:00:07 kpc36 kernel: session1: iscsi_start_session_recovery > blocking session > May 17 11:00:07 kpc36 kernel: session1: fail_scsi_tasks failing sc > 8101e30efe40 itt 0x13 state 3 > > And then later in the logs you will see us start the commands again when > we are logged in again. > > > So you probably need to continue to replying to nops when the r2t is > dropped. I will fix it on the initiatotr side to detect if we are not > getting IO for a specific command and then let the scsi eh run. The current behavior doesn't create a problem for me - instead of getting an 'abort' for the cmd, the session gets dropped and the cmd is cleaned up anyway. I was only wondering why it happens. Thanks for the detailed explanation. It was helpful. Erez --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "open-iscsi" group. To post to this group, send email to open-iscsi@googlegroups.com To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/open-iscsi -~--~~~~--~~--~--~---
Re: No abort is sent for a WRITE command that takes too long
Erez Zilber wrote: > > I enabled open-iscsi logging + added some printk calls when the abort > handler returns. > Here's the log. I see that iscsi_eh_cmd_timed_out gets called, but > there's no abort. > May 17 11:00:06 kpc36 kernel: session1: iscsi_eh_cmd_timed_out scsi > cmd 8101e30efe40 timedout > May 17 11:00:06 kpc36 kernel: session1: iscsi_eh_cmd_timed_out return > timer reset As you can see in iscsi_eh_cmd_timed_out, if the sesison is down then there is no point in letting the scsi eh run since we have to relogin and restart commands so we would return reset timer which prevents the scsi eh from running. And then there is code in there to check if we are in the middle of checking the connection. If we are then we ask for some more time with the command, and that will prevent the scsi eh from running. This looks like it can be problem because we would get a response to our nop which would update the last_recv field. If there was no progress being made for the scsi command we would still ask to reset the timer and we could end up in that loop forever since the scsi layer does not cap the number of times you can reset the time. I will send a patch to fix that. However, that probably will not fix your problem. For your specific setup, it looks like we hit the iscsi_eh_cmd_timed_out, reset the scsi command timer becuase we are in the middle of checking the the connection with the nop/ping, but then the nop/ping does not return in time and so we drop the session: connection1:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4526718494, last ping 4526723494, now 4526728494 That is why on the target you see it cleanup up commands. On the initiator you can see us cleaning up: May 17 11:00:07 kpc36 kernel: session1: iscsi_start_session_recovery blocking session May 17 11:00:07 kpc36 kernel: session1: fail_scsi_tasks failing sc 8101e30efe40 itt 0x13 state 3 And then later in the logs you will see us start the commands again when we are logged in again. So you probably need to continue to replying to nops when the r2t is dropped. I will fix it on the initiatotr side to detect if we are not getting IO for a specific command and then let the scsi eh run. --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "open-iscsi" group. To post to this group, send email to open-iscsi@googlegroups.com To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/open-iscsi -~--~~~~--~~--~--~---
Re: No abort is sent for a WRITE command that takes too long
On Thu, May 14, 2009 at 6:25 PM, Mike Christie wrote: > > Erez Zilber wrote: >>> Could you try my linux-2.6-iscsi git tree and turn on debugging? >> >> Moving to another kernel will be harder for me, but I can add a printk >> in the eh handler. >> > > Can you recompile the libiscsi module with debugging? > uncomment this: > > /* #define DEBUG_SCSI */ > > in libiscsi.h > > > You might also want to add some debugs in iscsi_eh_abort. We might > hitting one of those early return statements (some do not have > printks/debug statements). > > >>> open-iscsi logs in again after 10 seconds and sends the command again. >>> Do you see something about a target reset or host reset? >> >> No >> SCST also cleans up the session. >>> If the initiator thinks the abort failed (actually it is more like we >>> will return failed if we think it is possible that someone could still >>> be accessing the commands buffers, because we do not want scsi-ml to >>> start using them again) it would return FAILED to the scsi-eh which for >>> us would end up running the host/target reset, which we just drop the >>> session for. >>> >>> We discussed before we need to modify how we decided when to return >>> failed and we need to send a target reset for the host/target reset >>> handler because the target reset and session relogin have different >>> clearing effects. Vlad also has concerns about the tcp/ip connection >>> teardown and buildup. >>> Can anyone explain the reason for this behavior? I would expect that >>> iser and iscsi_tcp both hooked in the same libiscsi.c eh code. You >>> should know this :) >> >> I suspect that scsi-ml doesn't call the eh handler at all. Anyway, I >> will add this printk and retest. >> > > Yeah, that is what it is looking like. If the initiator dropped the > session then you should see the conn error 1011 or a "target reset > succeeded" or "host reset succeeded" message or a failure message if we > did not log back in. > > > > I enabled open-iscsi logging + added some printk calls when the abort handler returns. Here's the log. I see that iscsi_eh_cmd_timed_out gets called, but there's no abort. May 17 10:59:30 kpc36 kernel: connection1:0: iscsi_check_transport_timeouts Sending nopout as ping May 17 10:59:30 kpc36 kernel: connection1:0: iscsi_check_transport_timeouts Setting next tmo 4526696621 May 17 10:59:30 kpc36 kernel: session1: iscsi_prep_mgmt_task mgmtpdu [op 0x0 hdr->itt 0x4009 datalen 0] May 17 10:59:30 kpc36 kernel: session1: __iscsi_complete_pdu [op 0x20 cid 0 itt 0x9 len 0] May 17 10:59:30 kpc36 kernel: session1: iscsi_complete_task complete task itt 0x9 state 3 sc May 17 10:59:30 kpc36 kernel: session1: iscsi_free_task freeing task itt 0x9 state 1 sc May 17 10:59:30 kpc36 kernel: connection1:0: iscsi_check_transport_timeouts Setting next tmo 4526696621 May 17 10:59:33 kpc36 kernel: session1: iscsi_prep_scsi_cmd_pdu iscsi prep [read cid 0 sc 8101e5e1e6c0 cdb 0x12 itt 0xa len 36 bidi_len 0 cmdsn 81 win 32] May 17 10:59:33 kpc36 kernel: session1: __iscsi_complete_pdu [op 0x25 cid 0 itt 0xa len 0] May 17 10:59:33 kpc36 kernel: session1: iscsi_data_in_rsp data in with status done [sc 8101e5e1e6c0 res 0 itt 0xa] May 17 10:59:33 kpc36 kernel: session1: iscsi_complete_task complete task itt 0xa state 3 sc 8101e5e1e6c0 May 17 10:59:33 kpc36 kernel: session1: iscsi_free_task freeing task itt 0xa state 1 sc 8101e5e1e6c0 May 17 10:59:33 kpc36 kernel: session1: iscsi_prep_scsi_cmd_pdu iscsi prep [read cid 0 sc 8101dcca4080 cdb 0x25 itt 0xb len 8 bidi_len 0 cmdsn 82 win 32] May 17 10:59:33 kpc36 kernel: session1: __iscsi_complete_pdu [op 0x25 cid 0 itt 0xb len 0] May 17 10:59:33 kpc36 kernel: session1: iscsi_data_in_rsp data in with status done [sc 8101dcca4080 res 0 itt 0xb] May 17 10:59:33 kpc36 kernel: session1: iscsi_complete_task complete task itt 0xb state 3 sc 8101dcca4080 May 17 10:59:33 kpc36 kernel: session1: iscsi_free_task freeing task itt 0xb state 1 sc 8101dcca4080 May 17 10:59:33 kpc36 kernel: session1: iscsi_prep_scsi_cmd_pdu iscsi prep [read cid 0 sc 8101d8cd7980 cdb 0x28 itt 0xc len 512 bidi_len 0 cmdsn 83 win 32] May 17 10:59:33 kpc36 kernel: session1: __iscsi_complete_pdu [op 0x25 cid 0 itt 0xc len 0] May 17 10:59:33 kpc36 kernel: session1: iscsi_data_in_rsp data in with status done [sc 8101d8cd7980 res 0 itt 0xc] May 17 10:59:33 kpc36 kernel: session1: iscsi_complete_task complete task itt 0xc state 3 sc 8101d8cd7980 May 17 10:59:33 kpc36 kernel: session1: iscsi_free_task freeing task itt 0xc state 1 sc 8101d8cd7980 May 17 10:59:33 kpc36 kernel: session1: iscsi_prep_scsi_cmd_pdu iscsi prep [write cid 0 sc 81023eff3680 cdb 0x2a itt 0xd len 512 bidi_len 0 cmdsn 84 win 32] May 17 10:59:33 kpc36 kernel: session1: __iscsi_complete_pdu [op 0x21 cid 0 itt 0xd len 0] May 17 10:59:33 kpc36 kernel: session1: iscsi_scsi_cmd_rsp cmd rsp done [sc 81023eff3680 res 0
Re: No abort is sent for a WRITE command that takes too long
Erez Zilber wrote: >> Could you try my linux-2.6-iscsi git tree and turn on debugging? > > Moving to another kernel will be harder for me, but I can add a printk > in the eh handler. > Can you recompile the libiscsi module with debugging? uncomment this: /* #define DEBUG_SCSI */ in libiscsi.h You might also want to add some debugs in iscsi_eh_abort. We might hitting one of those early return statements (some do not have printks/debug statements). >> >>> open-iscsi logs in again after 10 seconds and sends the command again. >> Do you see something about a target reset or host reset? > > No > >>> SCST also cleans up the session. >> If the initiator thinks the abort failed (actually it is more like we >> will return failed if we think it is possible that someone could still >> be accessing the commands buffers, because we do not want scsi-ml to >> start using them again) it would return FAILED to the scsi-eh which for >> us would end up running the host/target reset, which we just drop the >> session for. >> >> We discussed before we need to modify how we decided when to return >> failed and we need to send a target reset for the host/target reset >> handler because the target reset and session relogin have different >> clearing effects. Vlad also has concerns about the tcp/ip connection >> teardown and buildup. >> >>> Can anyone explain the reason for this behavior? I would expect that >> iser and iscsi_tcp both hooked in the same libiscsi.c eh code. You >> should know this :) > > I suspect that scsi-ml doesn't call the eh handler at all. Anyway, I > will add this printk and retest. > Yeah, that is what it is looking like. If the initiator dropped the session then you should see the conn error 1011 or a "target reset succeeded" or "host reset succeeded" message or a failure message if we did not log back in. --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "open-iscsi" group. To post to this group, send email to open-iscsi@googlegroups.com To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/open-iscsi -~--~~~~--~~--~--~---
Re: No abort is sent for a WRITE command that takes too long
On Thu, May 14, 2009 at 5:55 PM, Mike Christie wrote: > > Erez Zilber wrote: >> Hi, >> >> I'm running a setup of open-iscsi & SCST. In order to test error >> scenarios during a WRITE command, I've added a delay in SCST, so after >> it receives the command, it doesn't send an R2T for 20 seconds. I also >> modified the device timeout on the initiator side to 5 seconds. >> However, I don't see an 'abort' for that command. Instead, I see that > > How are you looking? With ethereal? Did you see a TUR get sent before > the session recovery? I'm running wireshark. i didn't see any TUR. > > Could you try my linux-2.6-iscsi git tree and turn on debugging? Moving to another kernel will be harder for me, but I can add a printk in the eh handler. > > >> open-iscsi logs in again after 10 seconds and sends the command again. > > Do you see something about a target reset or host reset? No > >> SCST also cleans up the session. > > If the initiator thinks the abort failed (actually it is more like we > will return failed if we think it is possible that someone could still > be accessing the commands buffers, because we do not want scsi-ml to > start using them again) it would return FAILED to the scsi-eh which for > us would end up running the host/target reset, which we just drop the > session for. > > We discussed before we need to modify how we decided when to return > failed and we need to send a target reset for the host/target reset > handler because the target reset and session relogin have different > clearing effects. Vlad also has concerns about the tcp/ip connection > teardown and buildup. > >> >> Can anyone explain the reason for this behavior? I would expect that > > iser and iscsi_tcp both hooked in the same libiscsi.c eh code. You > should know this :) I suspect that scsi-ml doesn't call the eh handler at all. Anyway, I will add this printk and retest. Erez --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "open-iscsi" group. To post to this group, send email to open-iscsi@googlegroups.com To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/open-iscsi -~--~~~~--~~--~--~---
Re: No abort is sent for a WRITE command that takes too long
Erez Zilber wrote: > Hi, > > I'm running a setup of open-iscsi & SCST. In order to test error > scenarios during a WRITE command, I've added a delay in SCST, so after > it receives the command, it doesn't send an R2T for 20 seconds. I also > modified the device timeout on the initiator side to 5 seconds. > However, I don't see an 'abort' for that command. Instead, I see that How are you looking? With ethereal? Did you see a TUR get sent before the session recovery? Could you try my linux-2.6-iscsi git tree and turn on debugging? > open-iscsi logs in again after 10 seconds and sends the command again. Do you see something about a target reset or host reset? > SCST also cleans up the session. If the initiator thinks the abort failed (actually it is more like we will return failed if we think it is possible that someone could still be accessing the commands buffers, because we do not want scsi-ml to start using them again) it would return FAILED to the scsi-eh which for us would end up running the host/target reset, which we just drop the session for. We discussed before we need to modify how we decided when to return failed and we need to send a target reset for the host/target reset handler because the target reset and session relogin have different clearing effects. Vlad also has concerns about the tcp/ip connection teardown and buildup. > > Can anyone explain the reason for this behavior? I would expect that iser and iscsi_tcp both hooked in the same libiscsi.c eh code. You should know this :) > scsi-ml on the initiator side will send an 'abort' after 5 seconds. > > Thanks, > Erez > > > --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "open-iscsi" group. To post to this group, send email to open-iscsi@googlegroups.com To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/open-iscsi -~--~~~~--~~--~--~---
Re: No abort is sent for a WRITE command that takes too long
On Thu, May 14, 2009 at 2:54 PM, Ulrich Windl wrote: > > On 14 May 2009 at 13:12, Erez Zilber wrote: > >> >> Hi, >> >> I'm running a setup of open-iscsi & SCST. In order to test error >> scenarios during a WRITE command, I've added a delay in SCST, so after >> it receives the command, it doesn't send an R2T for 20 seconds. I also >> modified the device timeout on the initiator side to 5 seconds. >> However, I don't see an 'abort' for that command. Instead, I see that >> open-iscsi logs in again after 10 seconds and sends the command again. >> SCST also cleans up the session. >> >> Can anyone explain the reason for this behavior? I would expect that >> scsi-ml on the initiator side will send an 'abort' after 5 seconds. > > IMHO 5s for a "target abort" is way too short: If your physical disk has some > problems (like seek errors, verify error, etc.) it can take very long for a > write > to complete. Usually one would prefer to see the data on disk over triggering > some > error. I don't know the spects, but I'd guess the suggested timeout is > minutes, > not seconds. Maybe it wasn't clear, but the 5 seconds timeout is only for testing & debugging. Erez --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "open-iscsi" group. To post to this group, send email to open-iscsi@googlegroups.com To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/open-iscsi -~--~~~~--~~--~--~---
Re: No abort is sent for a WRITE command that takes too long
On 14 May 2009 at 13:12, Erez Zilber wrote: > > Hi, > > I'm running a setup of open-iscsi & SCST. In order to test error > scenarios during a WRITE command, I've added a delay in SCST, so after > it receives the command, it doesn't send an R2T for 20 seconds. I also > modified the device timeout on the initiator side to 5 seconds. > However, I don't see an 'abort' for that command. Instead, I see that > open-iscsi logs in again after 10 seconds and sends the command again. > SCST also cleans up the session. > > Can anyone explain the reason for this behavior? I would expect that > scsi-ml on the initiator side will send an 'abort' after 5 seconds. IMHO 5s for a "target abort" is way too short: If your physical disk has some problems (like seek errors, verify error, etc.) it can take very long for a write to complete. Usually one would prefer to see the data on disk over triggering some error. I don't know the spects, but I'd guess the suggested timeout is minutes, not seconds. Despite of that the question is how long does the target take to abort the command. Regards, Ulrich --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "open-iscsi" group. To post to this group, send email to open-iscsi@googlegroups.com To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/open-iscsi -~--~~~~--~~--~--~---