RE: aacraid: Host adapter reset request. SCSI hang ?

Salyzyn, Mark Mon, 18 Feb 2008 06:49:30 -0800

This 'Host adapter reset request. SCSI hang ?' message hides a rather 
complicated-to-explain underlying hardware behavior.


Aacraid based controllers have an underlying timeout/recovery cycle that is 35 
seconds long, there is a driver patch for this POST RHEL5 (but is going into 
RHEL5.2) that increases the default Linux per-device timeout to 45 seconds. The 
default in some SCSI subsystems was 60 seconds in the past, but is now 
standardized at 30 seconds which results in an interference pattern between the 
controller and Linux' SCSI subsystem. The alternate workaround is for the user 
to adjust the timeout in sysfs if it is shorter than this value. This is the 
only likely Linux driver issue that you may be having, everything is out of the 
scope for Linux or the aacraid driver; and needs to be addressed separately. 
This is not the panacea, so do NOT get your hopes up, and do NOT feel that the 
warning behavior is in fact a problem that needs to be solved (!)

Keep in mind that I/O completions on these controllers when everything is 
working is typically in the millisecond region or less.

The 3405 (or generically any aacraid based controller) is likely going through 
an error correction cycle on the SAS/SATA bus that is delaying the completion 
of I/O beyond the Linux default timeout set for the device, this may be a 
hardware issue (i.e.: driver or OS will not change anything) and typically will 
be of little concern to the Linux Kernel or Distribution folks. Or this may be 
a problem with an overly aggressive default timeout value as outlined 
immediately above. The fact that all you get are these messages and I/O 
continues indicates that everything is working as-designed dealing with or 
working around whatever the hardware issue is.

That does not mean the driver is free and clear, perhaps we are reaching a 
fundamental resource limit in the controller, on the bus, on the enclosure or 
in the drives. You may be able to mitigate this by adjusting the maximum queue 
depth down somewhat. You can adjust this in sysfs or indirectly through 
driver's controller-wide limit using insmod parameter numacb. If you find this 
to be the case and it is prevalent, we may need to adjust the rules within the 
driver regarding load balancing and queue depths to improve the general 
reliability and responsiveness of the system. A balance between reliability, 
performance, responsiveness and periodic warning messages in your logs? Target 
devices typically have the ability to handle at least 32 outstanding commands, 
often more.

My guess is that if the drives, enclosure and the controllers are all up to 
date (latest Firmware on all) and swapping components has not resulted in any 
changes in behavior, and this problem persists at per-device queue depths below 
32 and timeouts of 60 seconds, that you likely have a drive compatibility 
problem.

One of the first incompatibilities that needs to be reconciled that comes to 
mind is the use of desktop class (typically self correcting, which can take 
more than ten seconds to respond to I/O requests) or off-spec (i.e., higher 
error rate drives often shipped into low-budget markets) drives. The controller 
needs to work with these drives, and it does, and is dealing with them in an 
as-designed manner. If these are the drives you have, then you are getting what 
you paid for. If this is the case, you can reduce the annoyance by increasing 
the component's timeout by programming an updated value in sysfs.

Keep in mind that if the purchaser has in their mind that they wanted an 
enterprise class drive, and the supplier (or the manufacturer) supplied you 
desktop class drives; that they are often very willing to swap out the desktop 
class drives in exchange for the enterprise class drives. The difference is 
often merely as simple as a Drive Firmware behavior.

I would not increase this timeout much beyond 120 seconds, for anything larger 
starts affecting your rough guarantee of I/O delivery required for server 
network connections. The timeout results in a quescing of I/O to the device and 
results in the servicing of any starved requests that may be backed up behind 
resource limits in the components and thus also has an affect on dealing with 
the fundamental resource limit issues indicated above. The messages you are 
seeing, although annoying, are actually a desirable, albeit clunky, handshake 
somewhat ensuring the guarantee of I/O delivery. It is clunky because it also 
leads to a pause in more recently submitted I/O as it offers up ten seconds of 
silence in it's honor :-(

Another problem can arise surrounding the enclosure compatibility, for it may 
be negotiating in bad faith with the components, or has an incompatibility with 
the enclosure services in the controller. Especially if the enclosure has not 
been certified for use with the controller.

Another is that the drive is server/enterprise class and because it is a 
late-model unit, specifically has never been certified with the controller and 
has an odd behavior not yet worked around within the contoller's firmware.

If either of these incompatibilities is the case, it is in Adaptec's interest 
to work with you via the technical support department to resolve this problem. 
They will no doubt immediately ask you for the 'diagnostic dump'. If that does 
not bear any fruit, then they will work with you to either internally duplicate 
the problem, or as a last resort require your system to duplicate the issue to 
acquire the all-important SAS/SATA trace.

I hope you can see why these messages are discussed ad-nauseam on the network, 
fingers pointing in all different directions. Good luck working on this issue 
now armed with this understanding!

My cookbook for the 'aacraid: Host adapter reset request. SCSI hang ?' message:

- Check for any updated firmware for the controller, targets and enclosure on 
the respective manufacturer's websites.
- If you get a BlinkLED (rare) following this message, contact the controller 
supplier's technical support department
- Check per-device queue depth in sysfs to make sure it is reasonable.
- Check per-device timeout in sysfs to make sure it is reasonable.
- Engage Drive supplier's technical support department to check through 
compatibility or drive class issues.
- Engage Enclosure supplier's technical support department to check through 
compatibility issues.
- Engage Controller supplier's (some aacraid controllers are OEM'd through 
other channels) technical support department to check through compatibility 
issues.
        - Tech support will ask for the 'support.zip' and diagnostic dumps to 
triage issue.
        - Tech support may further engage you to work with you to acquire 
additional details from your specific system.
- Or ... be happy with what class of underlying target components you have 
selected; for a warning does not mean there is a problem that needs to be 
resolved.

Sincerely -- Mark Salyzyn

> -----Original Message-----
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED] On Behalf Of Omar Kilani
> Sent: Saturday, February 16, 2008 1:38 AM
> To: linux-kernel@vger.kernel.org
> Subject: aacraid: Host adapter reset request. SCSI hang ?
>
> Hi there,
>
> We're having issues with our Adaptec RAID controller and I was
> wondering if anyone would be able to advise on how to go about
> resolving them. :)
>
> The system:
>
> RHEL 5.1 x86_64
> Kernel 2.6.18-53.1.4.el5
> aacraid 1.1-5[2437]-mh4 (The default module shipped with the
> RHEL kernel)
>
> The controller:
>
> Adaptec 3405 + BBU
>
> BIOS : 5.2-0 (12415)
> Firmware : 5.2-0 (12415)
> Driver : 1.1-5 (2437)
> Boot Flash : 5.2-0 (12415)
>
> The disks:
>
> 4x Seagate ST373455SS (Firmware 0002) in a RAID10
>
> The issue:
>
> We continually get what seem to be adapter lockups with the error:
>
> aacraid: Host adapter abort request (3,0,0,0)
> aacraid: Host adapter reset request. SCSI hang ?
>
> The controller hangs for a while, recovers, and everything
> continues normally.
>
> We've tried a replacement controller and a replacement set of disks
> (we swapped out all 4 disks) to no avail.
>
> The problem seems (?) to happen during spikes of IO -- like when the
> PostgreSQL autovacuum daemon kicks in. But not always.
>
> From searching around LKML and Google, this seems to be a fairly
> common issue, but I'm not quite clear on the cause (hardware?
> software? firmware?) or the resolution.
>
> Any help would be greatly appreciated.
>
> Thanks!
>
> Regards,
> Omar
> --
> To unsubscribe from this list: send the line "unsubscribe
> linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: aacraid: Host adapter reset request. SCSI hang ?

Reply via email to