This 'Host adapter reset request. SCSI hang ?' message hides a rather
complicated-to-explain underlying hardware behavior.
Aacraid based controllers have an underlying timeout/recovery cycle that is 35
seconds long, there is a driver patch for this POST RHEL5 (but is going into
RHEL5.2) that increases the default Linux per-device timeout to 45 seconds. The
default in some SCSI subsystems was 60 seconds in the past, but is now
standardized at 30 seconds which results in an interference pattern between the
controller and Linux' SCSI subsystem. The alternate workaround is for the user
to adjust the timeout in sysfs if it is shorter than this value. This is the
only likely Linux driver issue that you may be having, everything is out of the
scope for Linux or the aacraid driver; and needs to be addressed separately.
This is not the panacea, so do NOT get your hopes up, and do NOT feel that the
warning behavior is in fact a problem that needs to be solved (!)
Keep in mind that I/O completions on these controllers when everything is
working is typically in the millisecond region or less.
The 3405 (or generically any aacraid based controller) is likely going through
an error correction cycle on the SAS/SATA bus that is delaying the completion
of I/O beyond the Linux default timeout set for the device, this may be a
hardware issue (i.e.: driver or OS will not change anything) and typically will
be of little concern to the Linux Kernel or Distribution folks. Or this may be
a problem with an overly aggressive default timeout value as outlined
immediately above. The fact that all you get are these messages and I/O
continues indicates that everything is working as-designed dealing with or
working around whatever the hardware issue is.
That does not mean the driver is free and clear, perhaps we are reaching a
fundamental resource limit in the controller, on the bus, on the enclosure or
in the drives. You may be able to mitigate this by adjusting the maximum queue
depth down somewhat. You can adjust this in sysfs or indirectly through
driver's controller-wide limit using insmod parameter numacb. If you find this
to be the case and it is prevalent, we may need to adjust the rules within the
driver regarding load balancing and queue depths to improve the general
reliability and responsiveness of the system. A balance between reliability,
performance, responsiveness and periodic warning messages in your logs? Target
devices typically have the ability to handle at least 32 outstanding commands,
often more.
My guess is that if the drives, enclosure and the controllers are all up to
date (latest Firmware on all) and swapping components has not resulted in any
changes in behavior, and this problem persists at per-device queue depths below
32 and timeouts of 60 seconds, that you likely have a drive compatibility
problem.
One of the first incompatibilities that needs to be reconciled that comes to
mind is the use of desktop class (typically self correcting, which can take
more than ten seconds to respond to I/O requests) or off-spec (i.e., higher
error rate drives often shipped into low-budget markets) drives. The controller
needs to work with these drives, and it does, and is dealing with them in an
as-designed manner. If these are the drives you have, then you are getting what
you paid for. If this is the case, you can reduce the annoyance by increasing
the component's timeout by programming an updated value in sysfs.
Keep in mind that if the purchaser has in their mind that they wanted an
enterprise class drive, and the supplier (or the manufacturer) supplied you
desktop class drives; that they are often very willing to swap out the desktop
class drives in exchange for the enterprise class drives. The difference is
often merely as simple as a Drive Firmware behavior.
I would not increase this timeout much beyond 120 seconds, for anything larger
starts affecting your rough guarantee of I/O delivery required for server
network connections. The timeout results in a quescing of I/O to the device and
results in the servicing of any starved requests that may be backed up behind
resource limits in the components and thus also has an affect on dealing with
the fundamental resource limit issues indicated above. The messages you are
seeing, although annoying, are actually a desirable, albeit clunky, handshake
somewhat ensuring the guarantee of I/O delivery. It is clunky because it also
leads to a pause in more recently submitted I/O as it offers up ten seconds of
silence in it's honor :-(
Another problem can arise surrounding the enclosure compatibility, for it may
be negotiating in bad faith with the components, or has an incompatibility with
the enclosure services in the controller. Especially if the enclosure has not
been certified for use with the controller.
Another is that the drive is server/enterprise class and because it is a
late-model unit, specifically has never been certified with the controller and
has an odd behavior not yet worked around within the contoller's firmware.
If either of these incompatibilities is the case, it is in Adaptec's interest
to work with you via the technical support department to resolve this problem.
They will no doubt immediately ask you for the 'diagnostic dump'. If that does
not bear any fruit, then they will work with you to either internally duplicate
the problem, or as a last resort require your system to duplicate the issue to
acquire the all-important SAS/SATA trace.
I hope you can see why these messages are discussed ad-nauseam on the network,
fingers pointing in all different directions. Good luck working on this issue
now armed with this understanding!
My cookbook for the 'aacraid: Host adapter reset request. SCSI hang ?' message:
- Check for any updated firmware for the controller, targets and enclosure on
the respective manufacturer's websites.
- If you get a BlinkLED (rare) following this message, contact the controller
supplier's technical support department
- Check per-device queue depth in sysfs to make sure it is reasonable.
- Check per-device timeout in sysfs to make sure it is reasonable.
- Engage Drive supplier's technical support department to check through
compatibility or drive class issues.
- Engage Enclosure supplier's technical support department to check through
compatibility issues.
- Engage Controller supplier's (some aacraid controllers are OEM'd through
other channels) technical support department to check through compatibility
issues.
- Tech support will ask for the 'support.zip' and diagnostic dumps to
triage issue.
- Tech support may further engage you to work with you to acquire
additional details from your specific system.
- Or ... be happy with what class of underlying target components you have
selected; for a warning does not mean there is a problem that needs to be
resolved.
Sincerely -- Mark Salyzyn
> -----Original Message-----
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED] On Behalf Of Omar Kilani
> Sent: Saturday, February 16, 2008 1:38 AM
> To: [email protected]
> Subject: aacraid: Host adapter reset request. SCSI hang ?
>
> Hi there,
>
> We're having issues with our Adaptec RAID controller and I was
> wondering if anyone would be able to advise on how to go about
> resolving them. :)
>
> The system:
>
> RHEL 5.1 x86_64
> Kernel 2.6.18-53.1.4.el5
> aacraid 1.1-5[2437]-mh4 (The default module shipped with the
> RHEL kernel)
>
> The controller:
>
> Adaptec 3405 + BBU
>
> BIOS : 5.2-0 (12415)
> Firmware : 5.2-0 (12415)
> Driver : 1.1-5 (2437)
> Boot Flash : 5.2-0 (12415)
>
> The disks:
>
> 4x Seagate ST373455SS (Firmware 0002) in a RAID10
>
> The issue:
>
> We continually get what seem to be adapter lockups with the error:
>
> aacraid: Host adapter abort request (3,0,0,0)
> aacraid: Host adapter reset request. SCSI hang ?
>
> The controller hangs for a while, recovers, and everything
> continues normally.
>
> We've tried a replacement controller and a replacement set of disks
> (we swapped out all 4 disks) to no avail.
>
> The problem seems (?) to happen during spikes of IO -- like when the
> PostgreSQL autovacuum daemon kicks in. But not always.
>
> From searching around LKML and Google, this seems to be a fairly
> common issue, but I'm not quite clear on the cause (hardware?
> software? firmware?) or the resolution.
>
> Any help would be greatly appreciated.
>
> Thanks!
>
> Regards,
> Omar
> --
> To unsubscribe from this list: send the line "unsubscribe
> linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/