Re: 4.6.4, 3w_sas: timeout too small?

2016-11-02 Thread Harald Dunkel
Hi Adam,

thanx very much for your detailed analysis.

Regards
Harri



Re: 4.6.4, 3w_sas: timeout too small?

2016-11-02 Thread Harald Dunkel
Hi Adam,

thanx very much for your detailed analysis.

Regards
Harri



Re: 4.6.4, 3w_sas: timeout too small?

2016-10-18 Thread adam radford
>> Oct 16 06:41:33 nasl003b kernel: [4903207.577484] 3w-sas: scsi0: ERROR: 
>> (0x06:0x000D): Microcontroller Error: clearing.

Microcontroller Error = FW crashed and will be reset on the next I/O
timeout or ioctl() timeout, whichever happens first.

It could be the iSCSI target code or initiator sent a SCSI CDB not
supported by 3ware 9750 firmware (or malformed CDB), or your 3ware
card is going bad, most likely the later.

>> Oct 16, 2016 06:45:03AM (0x04:0x005E): Cache synchronization completed: 
>> unit=0

Cache sync happens after an online controller reset.  Previously
acknowledged write back cache entries are flushed to disk at this
time.

>> The 3ware web interface says the controller was busy to write the cache to 
>> disk

It is not the case that the controller was performing some long cache
flush interval, > 60 seconds, such that you would need to adjust any
timeout values in the controller reset sequence code, i.e.
twl_reset_sequence().

-Adam


On Mon, Oct 17, 2016 at 6:09 AM, Harald Dunkel  wrote:
> Hi folks,
>
> Question about the reset timeout of the 3w_sas:
>
> For an unknown reason my 3ware 9750 RAID controller became
> unresponsive. kernel.log:
>
> Oct 16 06:30:42 nasl003b kernel: [4902556.775272] 3w-sas: scsi0: AEN: INFO 
> (0x04:0x002B): Verify completed:unit=1.
> Oct 16 06:38:29 nasl003b kernel: [4903023.388357] 3w-sas: scsi0: AEN: INFO 
> (0x04:0x002B): Verify completed:unit=2.
> Oct 16 06:41:33 nasl003b kernel: [4903207.577484] 3w-sas: scsi0: ERROR: 
> (0x06:0x000D): Microcontroller Error: clearing.
> Oct 16 06:42:11 nasl003b kernel: [4903245.701070] TMR Opcode 
> TARGET_WARM_RESET authorization failed for Initiator Node: 
> iqn.2014-01.com.example.ac.srva001
> Oct 16 06:42:11 nasl003b kernel: [4903245.706778] TMR Opcode 
> TARGET_WARM_RESET authorization failed for Initiator Node: 
> iqn.2014-01.com.example.ac.srva001
> Oct 16 06:42:15 nasl003b kernel: [4903249.298778] TMR Opcode 
> TARGET_WARM_RESET authorization failed for Initiator Node: 
> iqn.2014-01.com.example.ac.srva002
> Oct 16 06:42:15 nasl003b kernel: [4903249.304369] TMR Opcode 
> TARGET_WARM_RESET authorization failed for Initiator Node: 
> iqn.2014-01.com.example.ac.srva002
> Oct 16 06:42:15 nasl003b kernel: [4903249.318734] TMR Opcode 
> TARGET_WARM_RESET authorization failed for Initiator Node: 
> iqn.2014-01.com.example.ac.srva003
> Oct 16 06:42:15 nasl003b kernel: [4903249.324122] TMR Opcode 
> TARGET_WARM_RESET authorization failed for Initiator Node: 
> iqn.2014-01.com.example.ac.srva003
> Oct 16 06:42:41 nasl003b kernel: [4903275.739327] 3w-sas: scsi0: WARNING: 
> (0x06:0x0006): Character ioctl (0x108) timed out, resetting card.
> Oct 16 06:42:41 nasl003b kernel: [4903275.739402] sd 0:0:0:0: WARNING: 
> (0x06:0x002C): Command (0x2a) timed out, resetting card.
> Oct 16 06:43:36 nasl003b kernel: [4903330.757153] iSCSI Login timeout on 
> Network Portal 172.19.96.217:3260
> Oct 16 06:43:36 nasl003b kernel: [4903330.762124] iSCSI Login negotiation 
> failed.
> Oct 16 06:43:42 nasl003b kernel: [4903336.720892] 3w-sas: scsi0: ERROR: 
> (0x06:0x0011): Controller not ready during reset sequence.
> Oct 16 06:43:51 nasl003b kernel: [4903345.796529] iSCSI Login timeout on 
> Network Portal 172.19.96.217:3260
> Oct 16 06:43:51 nasl003b kernel: [4903345.801462] iSCSI Login negotiation 
> failed.
> Oct 16 06:44:06 nasl003b kernel: [4903360.835920] iSCSI Login timeout on 
> Network Portal 172.19.96.217:3260
> Oct 16 06:44:06 nasl003b kernel: [4903360.840843] iSCSI Login negotiation 
> failed.
> Oct 16 06:44:13 nasl003b kernel: [4903367.919662] INFO: task jbd2/sda1-8:235 
> blocked for more than 120 seconds.
> Oct 16 06:44:13 nasl003b kernel: [4903367.924633]   Tainted: G
> E   4.6.0-0.bpo.1-amd64 #1
> Oct 16 06:44:13 nasl003b kernel: [4903367.929624] "echo 0 > 
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Oct 16 06:44:13 nasl003b kernel: [4903367.934751] jbd2/sda1-8 D 
> 88042d896000 0   235  2 0x
> Oct 16 06:44:13 nasl003b kernel: [4903367.934756]  880424c2c440 
> 88042b43a140 8804249e 8804249dfbc0
> Oct 16 06:44:13 nasl003b kernel: [4903367.934760]  7fff 
> 815c50a0 8804249dfc40 8804000daf00
> Oct 16 06:44:13 nasl003b kernel: [4903367.934763]  815c4821 
>  815c785f 7fff
> Oct 16 06:44:13 nasl003b kernel: [4903367.934766] Call Trace:
> Oct 16 06:44:13 nasl003b kernel: [4903367.934783]  [] ? 
> bit_wait_timeout+0xa0/0xa0
> Oct 16 06:44:13 nasl003b kernel: [4903367.934786]  [] ? 
> schedule+0x31/0x80
> Oct 16 06:44:13 nasl003b kernel: [4903367.934789]  [] ? 
> schedule_timeout+0x22f/0x2c0
> Oct 16 06:44:13 nasl003b kernel: [4903367.934794]  [] ? 
> blk_peek_request+0x4d/0x280
> Oct 16 06:44:13 nasl003b kernel: [4903367.934797]  [] ? 
> bit_wait_timeout+0xa0/0xa0
> :
> :
>
> The interesting point is that apparently the reset didn't succeed.
> 

Re: 4.6.4, 3w_sas: timeout too small?

2016-10-18 Thread adam radford
>> Oct 16 06:41:33 nasl003b kernel: [4903207.577484] 3w-sas: scsi0: ERROR: 
>> (0x06:0x000D): Microcontroller Error: clearing.

Microcontroller Error = FW crashed and will be reset on the next I/O
timeout or ioctl() timeout, whichever happens first.

It could be the iSCSI target code or initiator sent a SCSI CDB not
supported by 3ware 9750 firmware (or malformed CDB), or your 3ware
card is going bad, most likely the later.

>> Oct 16, 2016 06:45:03AM (0x04:0x005E): Cache synchronization completed: 
>> unit=0

Cache sync happens after an online controller reset.  Previously
acknowledged write back cache entries are flushed to disk at this
time.

>> The 3ware web interface says the controller was busy to write the cache to 
>> disk

It is not the case that the controller was performing some long cache
flush interval, > 60 seconds, such that you would need to adjust any
timeout values in the controller reset sequence code, i.e.
twl_reset_sequence().

-Adam


On Mon, Oct 17, 2016 at 6:09 AM, Harald Dunkel  wrote:
> Hi folks,
>
> Question about the reset timeout of the 3w_sas:
>
> For an unknown reason my 3ware 9750 RAID controller became
> unresponsive. kernel.log:
>
> Oct 16 06:30:42 nasl003b kernel: [4902556.775272] 3w-sas: scsi0: AEN: INFO 
> (0x04:0x002B): Verify completed:unit=1.
> Oct 16 06:38:29 nasl003b kernel: [4903023.388357] 3w-sas: scsi0: AEN: INFO 
> (0x04:0x002B): Verify completed:unit=2.
> Oct 16 06:41:33 nasl003b kernel: [4903207.577484] 3w-sas: scsi0: ERROR: 
> (0x06:0x000D): Microcontroller Error: clearing.
> Oct 16 06:42:11 nasl003b kernel: [4903245.701070] TMR Opcode 
> TARGET_WARM_RESET authorization failed for Initiator Node: 
> iqn.2014-01.com.example.ac.srva001
> Oct 16 06:42:11 nasl003b kernel: [4903245.706778] TMR Opcode 
> TARGET_WARM_RESET authorization failed for Initiator Node: 
> iqn.2014-01.com.example.ac.srva001
> Oct 16 06:42:15 nasl003b kernel: [4903249.298778] TMR Opcode 
> TARGET_WARM_RESET authorization failed for Initiator Node: 
> iqn.2014-01.com.example.ac.srva002
> Oct 16 06:42:15 nasl003b kernel: [4903249.304369] TMR Opcode 
> TARGET_WARM_RESET authorization failed for Initiator Node: 
> iqn.2014-01.com.example.ac.srva002
> Oct 16 06:42:15 nasl003b kernel: [4903249.318734] TMR Opcode 
> TARGET_WARM_RESET authorization failed for Initiator Node: 
> iqn.2014-01.com.example.ac.srva003
> Oct 16 06:42:15 nasl003b kernel: [4903249.324122] TMR Opcode 
> TARGET_WARM_RESET authorization failed for Initiator Node: 
> iqn.2014-01.com.example.ac.srva003
> Oct 16 06:42:41 nasl003b kernel: [4903275.739327] 3w-sas: scsi0: WARNING: 
> (0x06:0x0006): Character ioctl (0x108) timed out, resetting card.
> Oct 16 06:42:41 nasl003b kernel: [4903275.739402] sd 0:0:0:0: WARNING: 
> (0x06:0x002C): Command (0x2a) timed out, resetting card.
> Oct 16 06:43:36 nasl003b kernel: [4903330.757153] iSCSI Login timeout on 
> Network Portal 172.19.96.217:3260
> Oct 16 06:43:36 nasl003b kernel: [4903330.762124] iSCSI Login negotiation 
> failed.
> Oct 16 06:43:42 nasl003b kernel: [4903336.720892] 3w-sas: scsi0: ERROR: 
> (0x06:0x0011): Controller not ready during reset sequence.
> Oct 16 06:43:51 nasl003b kernel: [4903345.796529] iSCSI Login timeout on 
> Network Portal 172.19.96.217:3260
> Oct 16 06:43:51 nasl003b kernel: [4903345.801462] iSCSI Login negotiation 
> failed.
> Oct 16 06:44:06 nasl003b kernel: [4903360.835920] iSCSI Login timeout on 
> Network Portal 172.19.96.217:3260
> Oct 16 06:44:06 nasl003b kernel: [4903360.840843] iSCSI Login negotiation 
> failed.
> Oct 16 06:44:13 nasl003b kernel: [4903367.919662] INFO: task jbd2/sda1-8:235 
> blocked for more than 120 seconds.
> Oct 16 06:44:13 nasl003b kernel: [4903367.924633]   Tainted: G
> E   4.6.0-0.bpo.1-amd64 #1
> Oct 16 06:44:13 nasl003b kernel: [4903367.929624] "echo 0 > 
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Oct 16 06:44:13 nasl003b kernel: [4903367.934751] jbd2/sda1-8 D 
> 88042d896000 0   235  2 0x
> Oct 16 06:44:13 nasl003b kernel: [4903367.934756]  880424c2c440 
> 88042b43a140 8804249e 8804249dfbc0
> Oct 16 06:44:13 nasl003b kernel: [4903367.934760]  7fff 
> 815c50a0 8804249dfc40 8804000daf00
> Oct 16 06:44:13 nasl003b kernel: [4903367.934763]  815c4821 
>  815c785f 7fff
> Oct 16 06:44:13 nasl003b kernel: [4903367.934766] Call Trace:
> Oct 16 06:44:13 nasl003b kernel: [4903367.934783]  [] ? 
> bit_wait_timeout+0xa0/0xa0
> Oct 16 06:44:13 nasl003b kernel: [4903367.934786]  [] ? 
> schedule+0x31/0x80
> Oct 16 06:44:13 nasl003b kernel: [4903367.934789]  [] ? 
> schedule_timeout+0x22f/0x2c0
> Oct 16 06:44:13 nasl003b kernel: [4903367.934794]  [] ? 
> blk_peek_request+0x4d/0x280
> Oct 16 06:44:13 nasl003b kernel: [4903367.934797]  [] ? 
> bit_wait_timeout+0xa0/0xa0
> :
> :
>
> The interesting point is that apparently the reset didn't succeed.
> "Controller not ready during reset 

4.6.4, 3w_sas: timeout too small?

2016-10-17 Thread Harald Dunkel
Hi folks,

Question about the reset timeout of the 3w_sas:

For an unknown reason my 3ware 9750 RAID controller became
unresponsive. kernel.log:

Oct 16 06:30:42 nasl003b kernel: [4902556.775272] 3w-sas: scsi0: AEN: INFO 
(0x04:0x002B): Verify completed:unit=1.
Oct 16 06:38:29 nasl003b kernel: [4903023.388357] 3w-sas: scsi0: AEN: INFO 
(0x04:0x002B): Verify completed:unit=2.
Oct 16 06:41:33 nasl003b kernel: [4903207.577484] 3w-sas: scsi0: ERROR: 
(0x06:0x000D): Microcontroller Error: clearing.
Oct 16 06:42:11 nasl003b kernel: [4903245.701070] TMR Opcode TARGET_WARM_RESET 
authorization failed for Initiator Node: iqn.2014-01.com.example.ac.srva001
Oct 16 06:42:11 nasl003b kernel: [4903245.706778] TMR Opcode TARGET_WARM_RESET 
authorization failed for Initiator Node: iqn.2014-01.com.example.ac.srva001
Oct 16 06:42:15 nasl003b kernel: [4903249.298778] TMR Opcode TARGET_WARM_RESET 
authorization failed for Initiator Node: iqn.2014-01.com.example.ac.srva002
Oct 16 06:42:15 nasl003b kernel: [4903249.304369] TMR Opcode TARGET_WARM_RESET 
authorization failed for Initiator Node: iqn.2014-01.com.example.ac.srva002
Oct 16 06:42:15 nasl003b kernel: [4903249.318734] TMR Opcode TARGET_WARM_RESET 
authorization failed for Initiator Node: iqn.2014-01.com.example.ac.srva003
Oct 16 06:42:15 nasl003b kernel: [4903249.324122] TMR Opcode TARGET_WARM_RESET 
authorization failed for Initiator Node: iqn.2014-01.com.example.ac.srva003
Oct 16 06:42:41 nasl003b kernel: [4903275.739327] 3w-sas: scsi0: WARNING: 
(0x06:0x0006): Character ioctl (0x108) timed out, resetting card.
Oct 16 06:42:41 nasl003b kernel: [4903275.739402] sd 0:0:0:0: WARNING: 
(0x06:0x002C): Command (0x2a) timed out, resetting card.
Oct 16 06:43:36 nasl003b kernel: [4903330.757153] iSCSI Login timeout on 
Network Portal 172.19.96.217:3260
Oct 16 06:43:36 nasl003b kernel: [4903330.762124] iSCSI Login negotiation 
failed.
Oct 16 06:43:42 nasl003b kernel: [4903336.720892] 3w-sas: scsi0: ERROR: 
(0x06:0x0011): Controller not ready during reset sequence.
Oct 16 06:43:51 nasl003b kernel: [4903345.796529] iSCSI Login timeout on 
Network Portal 172.19.96.217:3260
Oct 16 06:43:51 nasl003b kernel: [4903345.801462] iSCSI Login negotiation 
failed.
Oct 16 06:44:06 nasl003b kernel: [4903360.835920] iSCSI Login timeout on 
Network Portal 172.19.96.217:3260
Oct 16 06:44:06 nasl003b kernel: [4903360.840843] iSCSI Login negotiation 
failed.
Oct 16 06:44:13 nasl003b kernel: [4903367.919662] INFO: task jbd2/sda1-8:235 
blocked for more than 120 seconds.
Oct 16 06:44:13 nasl003b kernel: [4903367.924633]   Tainted: GE 
  4.6.0-0.bpo.1-amd64 #1
Oct 16 06:44:13 nasl003b kernel: [4903367.929624] "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 16 06:44:13 nasl003b kernel: [4903367.934751] jbd2/sda1-8 D 
88042d896000 0   235  2 0x
Oct 16 06:44:13 nasl003b kernel: [4903367.934756]  880424c2c440 
88042b43a140 8804249e 8804249dfbc0
Oct 16 06:44:13 nasl003b kernel: [4903367.934760]  7fff 
815c50a0 8804249dfc40 8804000daf00
Oct 16 06:44:13 nasl003b kernel: [4903367.934763]  815c4821 
 815c785f 7fff
Oct 16 06:44:13 nasl003b kernel: [4903367.934766] Call Trace:
Oct 16 06:44:13 nasl003b kernel: [4903367.934783]  [] ? 
bit_wait_timeout+0xa0/0xa0
Oct 16 06:44:13 nasl003b kernel: [4903367.934786]  [] ? 
schedule+0x31/0x80
Oct 16 06:44:13 nasl003b kernel: [4903367.934789]  [] ? 
schedule_timeout+0x22f/0x2c0
Oct 16 06:44:13 nasl003b kernel: [4903367.934794]  [] ? 
blk_peek_request+0x4d/0x280
Oct 16 06:44:13 nasl003b kernel: [4903367.934797]  [] ? 
bit_wait_timeout+0xa0/0xa0
:
:

The interesting point is that apparently the reset didn't succeed.
"Controller not ready during reset sequence" at 06:43:42. The 3ware
web interface says the controller was busy to write the cache to disk:

Oct 16, 2016 06:45:03AM (0x04:0x005E): Cache synchronization completed: unit=0

I have the impression that the timeout (60 secs?) in the driver is
insufficient. Do you think the status check in twl_reset_sequence()
could be improved?

There were no disk failures, of course.


Every helpful comment is highly appreciated.
Harri


4.6.4, 3w_sas: timeout too small?

2016-10-17 Thread Harald Dunkel
Hi folks,

Question about the reset timeout of the 3w_sas:

For an unknown reason my 3ware 9750 RAID controller became
unresponsive. kernel.log:

Oct 16 06:30:42 nasl003b kernel: [4902556.775272] 3w-sas: scsi0: AEN: INFO 
(0x04:0x002B): Verify completed:unit=1.
Oct 16 06:38:29 nasl003b kernel: [4903023.388357] 3w-sas: scsi0: AEN: INFO 
(0x04:0x002B): Verify completed:unit=2.
Oct 16 06:41:33 nasl003b kernel: [4903207.577484] 3w-sas: scsi0: ERROR: 
(0x06:0x000D): Microcontroller Error: clearing.
Oct 16 06:42:11 nasl003b kernel: [4903245.701070] TMR Opcode TARGET_WARM_RESET 
authorization failed for Initiator Node: iqn.2014-01.com.example.ac.srva001
Oct 16 06:42:11 nasl003b kernel: [4903245.706778] TMR Opcode TARGET_WARM_RESET 
authorization failed for Initiator Node: iqn.2014-01.com.example.ac.srva001
Oct 16 06:42:15 nasl003b kernel: [4903249.298778] TMR Opcode TARGET_WARM_RESET 
authorization failed for Initiator Node: iqn.2014-01.com.example.ac.srva002
Oct 16 06:42:15 nasl003b kernel: [4903249.304369] TMR Opcode TARGET_WARM_RESET 
authorization failed for Initiator Node: iqn.2014-01.com.example.ac.srva002
Oct 16 06:42:15 nasl003b kernel: [4903249.318734] TMR Opcode TARGET_WARM_RESET 
authorization failed for Initiator Node: iqn.2014-01.com.example.ac.srva003
Oct 16 06:42:15 nasl003b kernel: [4903249.324122] TMR Opcode TARGET_WARM_RESET 
authorization failed for Initiator Node: iqn.2014-01.com.example.ac.srva003
Oct 16 06:42:41 nasl003b kernel: [4903275.739327] 3w-sas: scsi0: WARNING: 
(0x06:0x0006): Character ioctl (0x108) timed out, resetting card.
Oct 16 06:42:41 nasl003b kernel: [4903275.739402] sd 0:0:0:0: WARNING: 
(0x06:0x002C): Command (0x2a) timed out, resetting card.
Oct 16 06:43:36 nasl003b kernel: [4903330.757153] iSCSI Login timeout on 
Network Portal 172.19.96.217:3260
Oct 16 06:43:36 nasl003b kernel: [4903330.762124] iSCSI Login negotiation 
failed.
Oct 16 06:43:42 nasl003b kernel: [4903336.720892] 3w-sas: scsi0: ERROR: 
(0x06:0x0011): Controller not ready during reset sequence.
Oct 16 06:43:51 nasl003b kernel: [4903345.796529] iSCSI Login timeout on 
Network Portal 172.19.96.217:3260
Oct 16 06:43:51 nasl003b kernel: [4903345.801462] iSCSI Login negotiation 
failed.
Oct 16 06:44:06 nasl003b kernel: [4903360.835920] iSCSI Login timeout on 
Network Portal 172.19.96.217:3260
Oct 16 06:44:06 nasl003b kernel: [4903360.840843] iSCSI Login negotiation 
failed.
Oct 16 06:44:13 nasl003b kernel: [4903367.919662] INFO: task jbd2/sda1-8:235 
blocked for more than 120 seconds.
Oct 16 06:44:13 nasl003b kernel: [4903367.924633]   Tainted: GE 
  4.6.0-0.bpo.1-amd64 #1
Oct 16 06:44:13 nasl003b kernel: [4903367.929624] "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 16 06:44:13 nasl003b kernel: [4903367.934751] jbd2/sda1-8 D 
88042d896000 0   235  2 0x
Oct 16 06:44:13 nasl003b kernel: [4903367.934756]  880424c2c440 
88042b43a140 8804249e 8804249dfbc0
Oct 16 06:44:13 nasl003b kernel: [4903367.934760]  7fff 
815c50a0 8804249dfc40 8804000daf00
Oct 16 06:44:13 nasl003b kernel: [4903367.934763]  815c4821 
 815c785f 7fff
Oct 16 06:44:13 nasl003b kernel: [4903367.934766] Call Trace:
Oct 16 06:44:13 nasl003b kernel: [4903367.934783]  [] ? 
bit_wait_timeout+0xa0/0xa0
Oct 16 06:44:13 nasl003b kernel: [4903367.934786]  [] ? 
schedule+0x31/0x80
Oct 16 06:44:13 nasl003b kernel: [4903367.934789]  [] ? 
schedule_timeout+0x22f/0x2c0
Oct 16 06:44:13 nasl003b kernel: [4903367.934794]  [] ? 
blk_peek_request+0x4d/0x280
Oct 16 06:44:13 nasl003b kernel: [4903367.934797]  [] ? 
bit_wait_timeout+0xa0/0xa0
:
:

The interesting point is that apparently the reset didn't succeed.
"Controller not ready during reset sequence" at 06:43:42. The 3ware
web interface says the controller was busy to write the cache to disk:

Oct 16, 2016 06:45:03AM (0x04:0x005E): Cache synchronization completed: unit=0

I have the impression that the timeout (60 secs?) in the driver is
insufficient. Do you think the status check in twl_reset_sequence()
could be improved?

There were no disk failures, of course.


Every helpful comment is highly appreciated.
Harri