Configuration question

2008-08-19 Thread v42bis

I have changed the timeouts in iscsid.conf as directed in the iSCSI
Root section 8.2 of the README so that in case the open-iscsi
initiator loses connection or has other communications problems with
my OpenSolaris target then the open-iscsi initiator will wait for up
to 24 hours before it fails to the SCSI layer:

node.session.timeo.replacement_timeout = 86400
node.conn[0].timeo.login_timeout = 15
node.conn[0].timeo.logout_timeout = 15
node.conn[0].timeo.noop_out_interval = 0
node.conn[0].timeo.noop_out_timeout = 0

My OpenSolaris target recently core dumped and came back online in
about 5 minutes. By that time, all of my ext3 partitions mounted over
iscsi had aborted their journals. Shouldn't iscsi wait for 24 hours
before I see any failures on my SCSI layer affecting my ext3
partitions?

Am I missing some other configuration? Any help appreciated.

Thank you,
--
Dave
--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to open-iscsi@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~--~~~~--~~--~--~---



Re: Configuration question

2008-08-19 Thread Mike Christie

v42bis wrote:
> I have changed the timeouts in iscsid.conf as directed in the iSCSI
> Root section 8.2 of the README so that in case the open-iscsi
> initiator loses connection or has other communications problems with
> my OpenSolaris target then the open-iscsi initiator will wait for up
> to 24 hours before it fails to the SCSI layer:
> 
> node.session.timeo.replacement_timeout = 86400
> node.conn[0].timeo.login_timeout = 15
> node.conn[0].timeo.logout_timeout = 15
> node.conn[0].timeo.noop_out_interval = 0
> node.conn[0].timeo.noop_out_timeout = 0
> 
> My OpenSolaris target recently core dumped and came back online in
> about 5 minutes. By that time, all of my ext3 partitions mounted over
> iscsi had aborted their journals. Shouldn't iscsi wait for 24 hours
> before I see any failures on my SCSI layer affecting my ext3
> partitions?

It should have. Do you have the logs? Do you see something about the 
replacement or recovery timeout timing out. It would have the correct 
86400 value, but when you look at the log it would say that it failed a 
lot quicker like the 5 minutes you mention. If this happens you may be 
hitting a bug where the kernel cannot support long timeouts and 
basically what is happening is the kernel's timer is rolling over and 
not caching it self right or maybe we are not supposed to be setting 
that high. We are still investigating to see who is at fault.

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to open-iscsi@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~--~~~~--~~--~--~---



Re: Configuration question

2008-08-20 Thread v42bis


On Aug 20, 12:46 am, Mike Christie <[EMAIL PROTECTED]> wrote:
> v42bis wrote:
> > I have changed the timeouts in iscsid.conf as directed in the iSCSI
> > Root section 8.2 of the README so that in case the open-iscsi
> > initiator loses connection or has other communications problems with
> > my OpenSolaris target then the open-iscsi initiator will wait for up
> > to 24 hours before it fails to the SCSI layer:
>
> > node.session.timeo.replacement_timeout = 86400
> > node.conn[0].timeo.login_timeout = 15
> > node.conn[0].timeo.logout_timeout = 15
> > node.conn[0].timeo.noop_out_interval = 0
> > node.conn[0].timeo.noop_out_timeout = 0
>
> > My OpenSolaris target recently core dumped and came back online in
> > about 5 minutes. By that time, all of my ext3 partitions mounted over
> > iscsi had aborted their journals. Shouldn't iscsi wait for 24 hours
> > before I see any failures on my SCSI layer affecting my ext3
> > partitions?
>
> It should have. Do you have the logs? Do you see something about the
> replacement or recovery timeout timing out. It would have the correct
> 86400 value, but when you look at the log it would say that it failed a
> lot quicker like the 5 minutes you mention. If this happens you may be
> hitting a bug where the kernel cannot support long timeouts and
> basically what is happening is the kernel's timer is rolling over and
> not caching it self right or maybe we are not supposed to be setting
> that high. We are still investigating to see who is at fault.


Thank for the reply, Mike.

The iscsi connections failed about 1m13s after my iscsi target went
down (timestamps that follow are synced from same ntp master, however
clock skew may account for a few seconds difference [1m45sec seems
very conspicuous - a multiplier of default 15sec timers?]). The target
went down at Aug 19 13:33:33.

>From /var/log/messages of one of my open-iscsi clients with two
sessions active and ext3 filesystems mounted from each at the time of
target failure:

Aug 19 13:34:46 ak1-vz2 kernel:  connection2:0: iscsi: detected conn
error (1011)
Aug 19 13:35:47 ak1-vz2 kernel:  connection1:0: iscsi: detected conn
error (1011)
Aug 19 13:36:38 ak1-vz2 kernel: sd 6:0:0:0: scsi: Device offlined -
not ready after error recovery
Aug 19 13:36:38 ak1-vz2 kernel: sd 6:0:0:0: scsi: Device offlined -
not ready after error recovery
Aug 19 13:36:38 ak1-vz2 kernel: sd 6:0:0:0: SCSI error: return code =
0x0002
Aug 19 13:36:38 ak1-vz2 kernel: end_request: I/O error, dev sdc,
sector 4063233
Aug 19 13:36:38 ak1-vz2 kernel: lost page write due to I/O error on
sdc1
Aug 19 13:36:38 ak1-vz2 kernel: sd 6:0:0:0: SCSI error: return code =
0x0002
Aug 19 13:36:38 ak1-vz2 kernel: end_request: I/O error, dev sdc,
sector 4157905
Aug 19 13:36:38 ak1-vz2 kernel: lost page write due to I/O error on
sdc1
Aug 19 13:36:39 ak1-vz2 kernel: iscsi: scsi conn_destroy(): host_busy
0 host_failed 0
Aug 19 13:36:39 ak1-vz2 kernel: lost page write due to I/O error on
sdc1
Aug 19 13:36:39 ak1-vz2 last message repeated 2 times
Aug 19 13:36:41 ak1-vz2 kernel: sd 7:0:0:0: scsi: Device offlined -
not ready after error recovery
Aug 19 13:36:41 ak1-vz2 kernel: sd 7:0:0:0: scsi: Device offlined -
not ready after error recovery
Aug 19 13:36:41 ak1-vz2 kernel: sd 7:0:0:0: SCSI error: return code =
0x0002
Aug 19 13:36:41 ak1-vz2 kernel: end_request: I/O error, dev sdd,
sector 126137
Aug 19 13:36:41 ak1-vz2 kernel: lost page write due to I/O error on
sdd1
Aug 19 13:36:41 ak1-vz2 last message repeated 4 times
Aug 19 13:36:41 ak1-vz2 kernel: sd 7:0:0:0: SCSI error: return code =
0x0002
Aug 19 13:36:41 ak1-vz2 kernel: end_request: I/O error, dev sdd,
sector 33214121
Aug 19 13:36:42 ak1-vz2 kernel: iscsi: scsi conn_destroy(): host_busy
0 host_failed 0
Aug 19 13:36:42 ak1-vz2 kernel: sd 7:0:0:0: SCSI error: return code =
0x0001
Aug 19 13:36:42 ak1-vz2 kernel: end_request: I/O error, dev sdd,
sector 33216097
Aug 19 13:36:42 ak1-vz2 kernel: __journal_remove_journal_head: freeing
b_committed_data
Aug 19 13:36:45 ak1-vz2 kernel: reading directory #1245317 offset 0

I have seen the following logs in the past when my iscsi target
machine fails over to a multipath/bonded NIC within about 30 seconds:

Jul 29 08:20:09 ak1-vz3.aktiom.net iscsid: received iferror -38
Jul 29 08:20:09 ak1-vz3.aktiom.net iscsid: connection8:0 is
operational now

The above did not affect normal operation of my open-iscsi initiators.

This is the only debug info I have. I didn't install open-iscsi with
debug enabled.

--
Dave

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to open-iscsi@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~--~~~~--~~--~--~---



Re: Configuration question

2008-08-20 Thread Mike Christie

v42bis wrote:
> Thank for the reply, Mike.
> 

No problem.

> The iscsi connections failed about 1m13s after my iscsi target went
> down (timestamps that follow are synced from same ntp master, however
> clock skew may account for a few seconds difference [1m45sec seems
> very conspicuous - a multiplier of default 15sec timers?]). The target
> went down at Aug 19 13:33:33.

Actually this looks like a different problem. What version of open-iscsi 
are you using? Do a "iscsiadm -P 3". The top part should dump the 
iscsiadm version.


> Aug 19 13:36:42 ak1-vz2 kernel: iscsi: scsi conn_destroy(): host_busy
> 0 host_failed 0

This means that userspace decided to kill the iscsi session/connection 
which means that we ignore the recovery/replacement timeout and just 
kill everything which forces IO errors. We only did this for fatal 
errors, but we should not do that anymore.

> 
> The above did not affect normal operation of my open-iscsi initiators.
> 

That is weirder. In this setup do you have multiple 
sessions/connections? When you checked the machine were all the 
session/connections running? There should have been two sessions that 
were destroyed.

In older open-iscsi userspace tools there were certain errors the target 
could send us and iscsid would consider it a fatal error and it would 
kill the sessions like above. For example if a target was shutting down 
it could tell us that it was not coming back, so we would kill the 
session. There was also a case where iscsid got confused and thought it 
was a fatal error and would kill the session. We now just retry forever 
or until the user kills the session manually to avoid problems like this.

Please tell me you were using a older version than open-iscsi-2.0-869.2 
:) If you were using open-iscsi-2.0-869.2 then we have a different 
problem :(

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to open-iscsi@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~--~~~~--~~--~--~---



Re: Configuration question

2008-08-20 Thread v42bis



On Aug 20, 1:39 am, Mike Christie <[EMAIL PROTECTED]> wrote:
> v42bis wrote:
> > Thank for the reply, Mike.
>
> No problem.
>
> > The iscsi connections failed about 1m13s after my iscsi target went
> > down (timestamps that follow are synced from same ntp master, however
> > clock skew may account for a few seconds difference [1m45sec seems
> > very conspicuous - a multiplier of default 15sec timers?]). The target
> > went down at Aug 19 13:33:33.
>
> Actually this looks like a different problem. What version of open-iscsi
> are you using? Do a "iscsiadm -P 3". The top part should dump the
> iscsiadm version.

`iscsiadm -P 3` just spits out the usage/help information - no
version. I know it is version open-iscsi-2.0-865.15, though.

>
> > Aug 19 13:36:42 ak1-vz2 kernel: iscsi: scsi conn_destroy(): host_busy
> > 0 host_failed 0
>
> This means that userspace decided to kill the iscsi session/connection
> which means that we ignore the recovery/replacement timeout and just
> kill everything which forces IO errors. We only did this for fatal
> errors, but we should not do that anymore.

What userspace process would have done that?

>
> > The above did not affect normal operation of my open-iscsi initiators.
>
> That is weirder. In this setup do you have multiple
> sessions/connections? When you checked the machine were all the
> session/connections running? There should have been two sessions that
> were destroyed.

Only one session per connection. One connection to each iscsi target.

All of the filesystems and iscsi connections seemed fine, as far as I
could tell.

>
> In older open-iscsi userspace tools there were certain errors the target
> could send us and iscsid would consider it a fatal error and it would
> kill the sessions like above. For example if a target was shutting down
> it could tell us that it was not coming back, so we would kill the
> session. There was also a case where iscsid got confused and thought it
> was a fatal error and would kill the session. We now just retry forever
> or until the user kills the session manually to avoid problems like this.

To confirm: open-iscsi version 2.0-869.2 and above will never kill
iscsi sessions unless the user explicitly tells iscsid to logout/kill
the session? I want to make sure my open-iscsi initiators never return
errors until replacement_timeout is reached. I'd rather have any
processes accessing filesystems on iscsi hang forever than have the
connections lost and journals aborted.

Looking at the code, there is no problem with setting such a high
replacement_timeout?

>
> Please tell me you were using a older version than open-iscsi-2.0-869.2
> :) If you were using open-iscsi-2.0-869.2 then we have a different
> problem :(

I am definitely running 2.0-865.15. I will upgrade to 2.0-869.2.

It would be *very* convenient if the Changelog would include changes
in every version and not just the current release. :)


--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to open-iscsi@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~--~~~~--~~--~--~---



Re: Configuration question

2008-08-21 Thread Mike Christie

v42bis wrote:
> 
> 
> On Aug 20, 1:39 am, Mike Christie <[EMAIL PROTECTED]> wrote:
>> v42bis wrote:
>>> Thank for the reply, Mike.
>> No problem.
>>
>>> The iscsi connections failed about 1m13s after my iscsi target went
>>> down (timestamps that follow are synced from same ntp master, however
>>> clock skew may account for a few seconds difference [1m45sec seems
>>> very conspicuous - a multiplier of default 15sec timers?]). The target
>>> went down at Aug 19 13:33:33.
>> Actually this looks like a different problem. What version of open-iscsi
>> are you using? Do a "iscsiadm -P 3". The top part should dump the
>> iscsiadm version.
> 
> `iscsiadm -P 3` just spits out the usage/help information - no
> version. I know it is version open-iscsi-2.0-865.15, though.

Ah older versions had private info argument for debugging. It later 
become stable as -P. Try "iscsiadm -m --info"


> 
>>> Aug 19 13:36:42 ak1-vz2 kernel: iscsi: scsi conn_destroy(): host_busy
>>> 0 host_failed 0
>> This means that userspace decided to kill the iscsi session/connection
>> which means that we ignore the recovery/replacement timeout and just
>> kill everything which forces IO errors. We only did this for fatal
>> errors, but we should not do that anymore.
> 
> What userspace process would have done that?

The iscsi userspace daemon that handles iscsi errors and does the 
login/relogin and session/connection management, iscsid.


> 
>>> The above did not affect normal operation of my open-iscsi initiators.
>> That is weirder. In this setup do you have multiple
>> sessions/connections? When you checked the machine were all the
>> session/connections running? There should have been two sessions that
>> were destroyed.
> 
> Only one session per connection. One connection to each iscsi target.
> 
> All of the filesystems and iscsi connections seemed fine, as far as I
> could tell.
> 
>> In older open-iscsi userspace tools there were certain errors the target
>> could send us and iscsid would consider it a fatal error and it would
>> kill the sessions like above. For example if a target was shutting down
>> it could tell us that it was not coming back, so we would kill the
>> session. There was also a case where iscsid got confused and thought it
>> was a fatal error and would kill the session. We now just retry forever
>> or until the user kills the session manually to avoid problems like this.
> 
> To confirm: open-iscsi version 2.0-869.2 and above will never kill
> iscsi sessions unless the user explicitly tells iscsid to logout/kill

Right.

> the session? I want to make sure my open-iscsi initiators never return
> errors until replacement_timeout is reached. I'd rather have any
> processes accessing filesystems on iscsi hang forever than have the
> connections lost and journals aborted.
> 
> Looking at the code, there is no problem with setting such a high
> replacement_timeout?

With the kernel time code or iscsi code that handles the timer? As a 
quick test try setting the timer to 10 days and set the nop times to 5 
seconds. Unplug the cable and in about 10 seconds you will see the ping 
timeout message. Then shortly after (within minutes instead of days) 
that you should see the recovery/replacment timed out message.


> 
>> Please tell me you were using a older version than open-iscsi-2.0-869.2
>> :) If you were using open-iscsi-2.0-869.2 then we have a different
>> problem :(
> 
> I am definitely running 2.0-865.15. I will upgrade to 2.0-869.2.
> 
> It would be *very* convenient if the Changelog would include changes
> in every version and not just the current release. :)
> 

Will start that on the next release.


--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to open-iscsi@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~--~~~~--~~--~--~---



Re: Configuration question

2008-08-21 Thread Mike Christie

Mike Christie wrote:
> v42bis wrote:
>>
>> On Aug 20, 1:39 am, Mike Christie <[EMAIL PROTECTED]> wrote:
>>> v42bis wrote:
 Thank for the reply, Mike.
>>> No problem.
>>>
 The iscsi connections failed about 1m13s after my iscsi target went
 down (timestamps that follow are synced from same ntp master, however
 clock skew may account for a few seconds difference [1m45sec seems
 very conspicuous - a multiplier of default 15sec timers?]). The target
 went down at Aug 19 13:33:33.
>>> Actually this looks like a different problem. What version of open-iscsi
>>> are you using? Do a "iscsiadm -P 3". The top part should dump the
>>> iscsiadm version.
>> `iscsiadm -P 3` just spits out the usage/help information - no
>> version. I know it is version open-iscsi-2.0-865.15, though.
> 
> Ah older versions had private info argument for debugging. It later 
> become stable as -P. Try "iscsiadm -m --info"
> 
> 
 Aug 19 13:36:42 ak1-vz2 kernel: iscsi: scsi conn_destroy(): host_busy
 0 host_failed 0
>>> This means that userspace decided to kill the iscsi session/connection
>>> which means that we ignore the recovery/replacement timeout and just
>>> kill everything which forces IO errors. We only did this for fatal
>>> errors, but we should not do that anymore.
>> What userspace process would have done that?
> 
> The iscsi userspace daemon that handles iscsi errors and does the 
> login/relogin and session/connection management, iscsid.
> 
> 
 The above did not affect normal operation of my open-iscsi initiators.
>>> That is weirder. In this setup do you have multiple
>>> sessions/connections? When you checked the machine were all the
>>> session/connections running? There should have been two sessions that
>>> were destroyed.
>> Only one session per connection. One connection to each iscsi target.
>>
>> All of the filesystems and iscsi connections seemed fine, as far as I
>> could tell.
>>
>>> In older open-iscsi userspace tools there were certain errors the target
>>> could send us and iscsid would consider it a fatal error and it would
>>> kill the sessions like above. For example if a target was shutting down
>>> it could tell us that it was not coming back, so we would kill the
>>> session. There was also a case where iscsid got confused and thought it
>>> was a fatal error and would kill the session. We now just retry forever
>>> or until the user kills the session manually to avoid problems like this.
>> To confirm: open-iscsi version 2.0-869.2 and above will never kill
>> iscsi sessions unless the user explicitly tells iscsid to logout/kill
> 
> Right.
> 
>> the session? I want to make sure my open-iscsi initiators never return
>> errors until replacement_timeout is reached. I'd rather have any
>> processes accessing filesystems on iscsi hang forever than have the
>> connections lost and journals aborted.
>>
>> Looking at the code, there is no problem with setting such a high
>> replacement_timeout?
> 
> With the kernel time code or iscsi code that handles the timer? As a 
> quick test try setting the timer to 10 days and set the nop times to 5 
> seconds. Unplug the cable and in about 10 seconds you will see the ping 
> timeout message. Then shortly after (within minutes instead of days) 
> that you should see the recovery/replacment timed out message.
>

Actually that is a waste of time. It looks like not everyone is hitting 
it and my be due to the kernel config and having the right timing.

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to open-iscsi@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~--~~~~--~~--~--~---