Dear Mr. Long,

firstly, let me thank you for maintaining the Adaptec RAID drivers.

I've got a problem with the Adaptec 2120S in FreeBSD 4.8-RELEASE
and I haven't found any notes about that in the mailing lists.

In SMP mode, upon a RAID array degradation event (a disk is
ripped out), the system locks up almost entirely, stuck at
disk operations.
The same happens upon boot with a rebuilding/degraded array
- building from scratch or rebuilding after a disk failure,
or even just running off a single disk while the other one
is dead (no rebuild going on in the background).

The problem doesn't occur in UP mode (when options SMP and
APIC_IO are off) - that way the host system works happily
just as if there was nothing wrong with the array (except
for a few **Monitor** warnings and the LEDs going disco).
The problem also doesn't occur as long as the RAID is
"optimal".
The problem was only observed and tested in a configuration
with two disks in a mirror (one or two logical "containers"
on them), no hot spare.


My system configuration is:

2x Intel P4 Xeon @ 2.4 GHz, 533MHz FSB
1 GB RAM (dual-channel, 2x 512 MB DIMM DDR266, ECC, REG)
ServerWorks GC-LE chipset, PCI-X [EMAIL PROTECTED]
2x3 SCA backplane with two GEM318 SAF-TE processors
On one channel, there are two pieces of Seagate ST336607LC (36 GB)
(+ 2x onboard BCM570x GbETH, 2x onboard AIC7902,
  onboard ATI RageXL PCI 8MB, etc)

The array on the AAC controller is the only disk drive in the
system -> the machine is booting from it.

To the best of my knowledge, the mechanical and electrical
parts of the U320 system are fine - they've been working for
me in Linux and with other SCSI controllers just fine, after
all the dual-channel onboard U320 HBA works just fine, too.


Attached is a tarball with debugging logs.

There are three directories, containing three different
combinations of debug options (see below items A to C).
Each directory contains six log files: a boot from a clean
array, a disk failure (somewhat improperly simulated by
ripping the SCA enclosure out), and a boot from a degraded
volume - all of that for a UP and SMP kernel. 3*2=6.

I've tried the following different debugging options and levels:
A) full CAM debug and AAC_DEBUG=2
B) AAC_DEBUG=2
C) AAC_DEBUG=4  (after I found in the sources that L4 exists)

With A), everything worked as described above.
         Just the CAM debugging messages probably cluttered the
         kernel ring buffer to the extent that some of the AAC_DEBUG
         and generic messages are missing in the log, such as those
         announcing the detection of /dev/aacd0 and /dev/aacd1
         (the two RAID volumes/containers)
With B), upon runtime disk failure, the fault occured even in UP configuration!
         -- while UP kernels without debugging continued to operate,
         and even the UP kernel as per B) continued to run fine
         after reboot, on the failed array.
With C), __SMP__: the machine behaved as expected (dead) upon
         runtime disk failure, but consistently managed to boot with
         the degraded array while it was not rebuilding (=anomaly)
         - then it crashed when I logged in and told it to `reboot`.
         When I plugged the disk drive back and the array started
         rebuilding, the SMP kernel consistently failed to boot.
         __UP__: the machine was consistently failing miserably
         upon array degradation (=anomaly). It did boot fine
         consistently with a degraded array (not rebuilding).
         It failed at boot consistently with a rebuilding array.

So it seems that the serial logging / debugging stuff modifies
timing, and hence the behavior with debugging on is different.
Reminds me of the Heisenbergian uncertainty.

Still, without debugging, the consistent pattern is:
UP = boots fine from a clean array, survives array degradation
     and boots from a degraded array.
SMP = boots fine from a clean array, does not survive array degradation
      and fails to boot from a degraded array.


While I was trying to find a typical healthy "SCSI request/response"
pattern in the logs, it seemed to me that quite often some of the
debugging messages were missing, and some were clearly cut in
half or so - perhaps I should check my RS232 cabling? Though
I really think that my cabling is all right...

>From the debug listings it would seem that the AAC driver
on the host PC gets a zero-padded FIB from the controller,
and then an endless row of interrupts.
This happens immediately after a disk failure or after driver
initialization upon boot.

The following is a piece of pseudo-code for your reference,
based on /usr/src/sys/dev/aac.c. The aac_host_command() forms
the body of a kthread that gets started upon adapter
initialization. Note the line with "!!!":

aac_host_command()
{
   while(true)
   {
      tsleep();

      for (;;)
      {
         // check for enqueued FIBs
         aac_dequeue_fib(AAC_HOST_NORM_CMD_QUEUE);

         if (found one)
         {
            // process it
         }
         else
         {
            break; // go to sleep again
         }
      }
   }
}

aac_dequeue_fib()
{
   if (ci != pi)  // consumer/producer indices
   {
      // there are some FIBs in the queue
      // !!! at the same time, the FIB is zero-padded !!!
   }
   else return(ENOENT);
}

Another symptom is that, upon array degradation, the controller
seems to reset the RAID-private SCSI bus (I hope that's what
the **Monitor** message says).

The trouble is that both the aac_host_command() wakeup with the
zero-padded FIB and the monitor messages appear in asynchronous
context (in a separate kthread or in an interrupt) and I'm not
as skilled as to say which previous action of the driver is
the immediate cause.

More on the behavior of the disk LEDs:
These LEDs on my server case are controlled by the SCA/SAF-TE
chip (GEM318).
- When the array is degraded but operating normally, the dead
disk's LED is dark and the live disk's led flashes green,
indicating normal storage transfers.
- When a degraded array is rebuilding, the two disk LEDs dance
in shades of green to orange (both the green and red
pads flashing).
- When the whole controller or the RAID-private SCSI channel
is being reset, both the two LEDs shine a steady red.
- When the machine fails at boot with a rebuilding array,
the LEDs often turn red for a few seconds (reset?) and then
one of them remains red and the other one starts dancing
green/orange... and the reset may come back a few times
before the machine locks up entirely or the BSD manages
to do an auto-reboot. Or the LED's just stay red and the
machine hangs.
- When the machine boots and runs fine (i.e., with a UP kernel
under normal conditions), the disk LED's never go red, except
for a cold reset of the whole PC. When the array is rebuilding,
the LED's keep dancing merrily between green and orange
throughout the boot process.

I guess this would indicate that it's not just the BSD driver
getting messed up - the controller probably also gets
seriously confused. Is that a chicken-vs.-egg style puzzle?


As a side note: it seems interesting to me that, regardless
of whethere debugging and SMP is on or off in any particular
combination, the kernel always rushes through to
"Waiting 15 seconds for the SCSI devices to settle"
and _immediately_ reports the RAID containers.
Only then it waits those fifteen seconds before proceeding
to detect the regular SCSI devices.


Attached is my kernel config file and a listing of
`lspci -lv`

I can't think of anything else to tell you at the moment.
Ask me if you need further help - perhaps I can modify the
debugging flags and try again, add some more instrumentation
hooks here and there to focus on particular points in the
code etc.

Any ideas are welcome.
Sorry about wasting your time by sending such an eloquent
explanation.
Thanks for the great job that you're doing.


Frank Rysanek
[EMAIL PROTECTED]:0:0:  class=0x060000 card=0x00000000 chip=0x00141166 rev=0x31 
hdr=0x00
    vendor   = 'Reliance Computer Corp./ServerWorks'
    device   = 'CNB20-HE Host Bridge'
    class    = bridge
    subclass = HOST-PCI
[EMAIL PROTECTED]:0:1:  class=0x060000 card=0x00000000 chip=0x00141166 rev=0x00 
hdr=0x00
    vendor   = 'Reliance Computer Corp./ServerWorks'
    device   = 'CNB20-HE Host Bridge'
    class    = bridge
    subclass = HOST-PCI
[EMAIL PROTECTED]:0:2:  class=0x060000 card=0x00000000 chip=0x00151166 rev=0x00 
hdr=0x00
    vendor   = 'Reliance Computer Corp./ServerWorks'
    device   = 'CMIC-GC Hostbridge and MCH'
    class    = bridge
    subclass = HOST-PCI
[EMAIL PROTECTED]:2:0:  class=0x030000 card=0x80041002 chip=0x47521002 rev=0x27 
hdr=0x00
    vendor   = 'ATI Technologies'
    device   = 'Rage XL PCI'
    class    = display
    subclass = VGA
[EMAIL PROTECTED]:15:0: class=0x060100 card=0x02011166 chip=0x02011166 rev=0x93 
hdr=0x00
    vendor   = 'Reliance Computer Corp./ServerWorks'
    device   = 'CSB5 PCI to ISA Bridge'
    class    = bridge
    subclass = PCI-ISA
[EMAIL PROTECTED]:15:1: class=0x01018a card=0x02121166 chip=0x02121166 rev=0x93 
hdr=0x00
    vendor   = 'Reliance Computer Corp./ServerWorks'
    device   = 'CSB5 PCI EIDE Controller'
    class    = mass storage
    subclass = ATA
[EMAIL PROTECTED]:15:2: class=0x0c0310 card=0x02201166 chip=0x02201166 rev=0x05 
hdr=0x00
    vendor   = 'Reliance Computer Corp./ServerWorks'
    device   = 'OSB4 OpenHCI Compliant USB Controller'
    class    = serial bus
    subclass = USB
[EMAIL PROTECTED]:15:3: class=0x060000 card=0x02301166 chip=0x02251166 rev=0x00 
hdr=0x00
    vendor   = 'Reliance Computer Corp./ServerWorks'
    device   = 'CSB5 PCI Bridge'
    class    = bridge
    subclass = HOST-PCI
[EMAIL PROTECTED]:17:0: class=0x060000 card=0x00000000 chip=0x01011166 rev=0x03 
hdr=0x00
    vendor   = 'Reliance Computer Corp./ServerWorks'
    device   = 'CIOB-X2'
    class    = bridge
    subclass = HOST-PCI
[EMAIL PROTECTED]:17:2: class=0x060000 card=0x00000000 chip=0x01011166 rev=0x03 
hdr=0x00
    vendor   = 'Reliance Computer Corp./ServerWorks'
    device   = 'CIOB-X2'
    class    = bridge
    subclass = HOST-PCI
[EMAIL PROTECTED]:4:0:  class=0x010400 card=0x02869005 chip=0x02859005 rev=0x01 
hdr=0x00
    vendor   = 'Adaptec'
    device   = 'AAC-RAID RAID Controller'
    class    = mass storage
    subclass = RAID
[EMAIL PROTECTED]:2:0:  class=0x020000 card=0x000814e4 chip=0x164514e4 rev=0x15 
hdr=0x00
    vendor   = 'Broadcom Corporation'
    device   = 'BCM5701 NetXtreme Gigabit Ethernet'
    class    = network
    subclass = ethernet
[EMAIL PROTECTED]:3:0:  class=0x020000 card=0x000814e4 chip=0x164514e4 rev=0x15 
hdr=0x00
    vendor   = 'Broadcom Corporation'
    device   = 'BCM5701 NetXtreme Gigabit Ethernet'
    class    = network
    subclass = ethernet
[EMAIL PROTECTED]:4:0:  class=0x010000 card=0x005e9005 chip=0x801d9005 rev=0x10 
hdr=0x00
    vendor   = 'Adaptec'
    class    = mass storage
    subclass = SCSI
[EMAIL PROTECTED]:4:1:  class=0x010000 card=0x005e9005 chip=0x801d9005 rev=0x10 
hdr=0x00
    vendor   = 'Adaptec'
    class    = mass storage
    subclass = SCSI
#
# GENERIC -- Generic kernel configuration file for FreeBSD/i386
#
# For more information on this file, please read the handbook section on
# Kernel Configuration Files:
#
#    http://www.FreeBSD.org/doc/en_US.ISO8859-1/books/handbook/kernelconfig-config.html
#
# The handbook is also available locally in /usr/share/doc/handbook
# if you've installed the doc distribution, otherwise always see the
# FreeBSD World Wide Web server (http://www.FreeBSD.org/) for the
# latest information.
#
# An exhaustive list of options and more detailed explanations of the
# device lines is also present in the ./LINT configuration file. If you are
# in doubt as to the purpose or necessity of a line, check first in LINT.
#
# $FreeBSD: src/sys/i386/conf/GENERIC,v 1.246.2.51.2.2 2003/03/25 23:35:15 jhb Exp $

machine         i386
#cpu            I386_CPU
#cpu            I486_CPU
#cpu            I586_CPU
cpu             I686_CPU
ident           GENERIC
maxusers        0

#makeoptions    DEBUG=-g                #Build kernel with gdb(1) debug symbols

options         MATH_EMULATE            #Support for x87 emulation
options         INET                    #InterNETworking
#options        INET6                   #IPv6 communications protocols
options         FFS                     #Berkeley Fast Filesystem
options         FFS_ROOT                #FFS usable as root device [keep this!]
options         SOFTUPDATES             #Enable FFS soft updates support
options         UFS_DIRHASH             #Improve performance on big directories
options         MFS                     #Memory Filesystem
options         MD_ROOT                 #MD is a potential root device
options         NFS                     #Network Filesystem
options         NFS_ROOT                #NFS usable as root device, NFS required
options         MSDOSFS                 #MSDOS Filesystem
options         CD9660                  #ISO 9660 Filesystem
options         CD9660_ROOT             #CD-ROM usable as root, CD9660 required
options         PROCFS                  #Process filesystem
options         COMPAT_43               #Compatible with BSD 4.3 [KEEP THIS!]
options         SCSI_DELAY=15000        #Delay (in ms) before probing SCSI
options         UCONSOLE                #Allow users to grab the console
options         USERCONFIG              #boot -c editor
options         VISUAL_USERCONFIG       #visual boot -c editor
options         KTRACE                  #ktrace(1) support
options         SYSVSHM                 #SYSV-style shared memory
options         SYSVMSG                 #SYSV-style message queues
options         SYSVSEM                 #SYSV-style semaphores
options         P1003_1B                #Posix P1003_1B real-time extensions
options         _KPOSIX_PRIORITY_SCHEDULING
options         ICMP_BANDLIM            #Rate limit bad replies
options         KBD_INSTALL_CDEV        # install a CDEV entry in /dev
options         AHC_REG_PRETTY_PRINT    # Print register bitfields in debug
                                        # output.  Adds ~128k to driver.
options         AHD_REG_PRETTY_PRINT    # Print register bitfields in debug 
                                        # output.  Adds ~215k to driver.

# To make an SMP kernel, the next two are needed
options         SMP                     # Symmetric MultiProcessor Kernel
options         APIC_IO                 # Symmetric (APIC) I/O

# To support HyperThreading, HTT is needed in addition to SMP and APIC_IO
options         HTT                     # HyperThreading Technology

device          isa
#device         eisa
device          pci

# Floppy drives
device          fdc0    at isa? port IO_FD1 irq 6 drq 2
device          fd0     at fdc0 drive 0
device          fd1     at fdc0 drive 1
#
# If you have a Toshiba Libretto with its Y-E Data PCMCIA floppy,
# don't use the above line for fdc0 but the following one:
#device         fdc0

# ATA and ATAPI devices
device          ata0    at isa? port IO_WD1 irq 14
device          ata1    at isa? port IO_WD2 irq 15
device          ata
device          atadisk                 # ATA disk drives
device          atapicd                 # ATAPI CDROM drives
device          atapifd                 # ATAPI floppy drives
device          atapist                 # ATAPI tape drives
options         ATA_STATIC_ID           #Static device numbering

# SCSI Controllers
#device         ahb             # EISA AHA1742 family
#device         ahc             # AHA2940 and onboard AIC7xxx devices
device          ahd             # AHA39320/29320 and onboard AIC79xx devices
#device         amd             # AMD 53C974 (Tekram DC-390(T))
#device         isp             # Qlogic family
#device         mpt             # LSI-Logic MPT/Fusion
#device         ncr             # NCR/Symbios Logic
#device         sym             # NCR/Symbios Logic (newer chipsets)
#options        SYM_SETUP_LP_PROBE_MAP=0x40
                                # Allow ncr to attach legacy NCR devices when 
                                # both sym and ncr are configured

#device         adv0    at isa?
#device         adw
#device         bt0     at isa?
#device         aha0    at isa?
#device         aic0    at isa?

#device         ncv             # NCR 53C500
#device         nsp             # Workbit Ninja SCSI-3
#device         stg             # TMC 18C30/18C50

# SCSI peripherals
device          scbus           # SCSI bus (required)
device          da              # Direct Access (disks)
device          sa              # Sequential Access (tape etc)
device          cd              # CD
device          pass            # Passthrough device (direct SCSI access)

# RAID controllers interfaced to the SCSI subsystem
#device         asr             # DPT SmartRAID V, VI and Adaptec SCSI RAID
#device         dpt             # DPT Smartcache - See LINT for options!
#device         iir             # Intel Integrated RAID
#device         mly             # Mylex AcceleRAID/eXtremeRAID
#device         ciss            # Compaq SmartRAID 5* series

# RAID controllers
device          aac             # Adaptec FSA RAID, Dell PERC2/PERC3
#options        AAC_DEBUG=4
#device         aacp            # SCSI passthrough for aac (requires CAM)
#device         ida             # Compaq Smart RAID
#device         amr             # AMI MegaRAID
#device         mlx             # Mylex DAC960 family
#device         twe             # 3ware Escalade

#options        CAMDEBUG
#options        CAM_DEBUG_BUS=-1
#options        CAM_DEBUG_TARGET=-1
#options        CAM_DEBUG_LUN=-1
#options        
CAM_DEBUG_FLAGS="CAM_DEBUG_INFO|CAM_DEBUG_TRACE|CAM_DEBUG_SUBTRACE|CAM_DEBUG_CDB|CAM_DEBUG_XPT|CAM_DEBUG_PERIPH"

# atkbdc0 controls both the keyboard and the PS/2 mouse
device          atkbdc0 at isa? port IO_KBD
device          atkbd0  at atkbdc? irq 1 flags 0x1
device          psm0    at atkbdc? irq 12

device          vga0    at isa?

# splash screen/screen saver
pseudo-device   splash

# syscons is the default console driver, resembling an SCO console
device          sc0     at isa? flags 0x100

# Enable this and PCVT_FREEBSD for pcvt vt220 compatible console driver
#device         vt0     at isa?
#options        XSERVER                 # support for X server on a vt console
#options        FAT_CURSOR              # start with block cursor
# If you have a ThinkPAD, uncomment this along with the rest of the PCVT lines
#options        PCVT_SCANSET=2          # IBM keyboards are non-std

device          agp             # support several AGP chipsets

# Floating point support - do not disable.
device          npx0    at nexus? port IO_NPX irq 13

# Power management support (see LINT for more options)
device          apm0    at nexus? disable flags 0x20 # Advanced Power Management

# PCCARD (PCMCIA) support
#device         card
#device         pcic0   at isa? irq 0 port 0x3e0 iomem 0xd0000
#device         pcic1   at isa? irq 0 port 0x3e2 iomem 0xd4000 disable

# Serial (COM) ports
device          sio0    at isa? port IO_COM1 flags 0x30 irq 4
device          sio1    at isa? port IO_COM2 irq 3
device          sio2    at isa? disable port IO_COM3 irq 5
device          sio3    at isa? disable port IO_COM4 irq 9

options CONSPEED=115200

# Parallel port
device          ppc0    at isa? irq 7
device          ppbus           # Parallel port bus (required)
device          lpt             # Printer
#device         plip            # TCP/IP over parallel
#device         ppi             # Parallel port interface device
#device         vpo             # Requires scbus and da


# PCI Ethernet NICs.
#device         de              # DEC/Intel DC21x4x (``Tulip'')
#device         em              # Intel PRO/1000 adapter Gigabit Ethernet Card 
(``Wiseman'')
#device         txp             # 3Com 3cR990 (``Typhoon'')
#device         vx              # 3Com 3c590, 3c595 (``Vortex'')

# PCI Ethernet NICs that use the common MII bus controller code.
# NOTE: Be sure to keep the 'device miibus' line in order to use these NICs!
device          miibus          # MII bus support
#device         dc              # DEC/Intel 21143 and various workalikes
#device         fxp             # Intel EtherExpress PRO/100B (82557, 82558)
#device         pcn             # AMD Am79C97x PCI 10/100 NICs
#device         rl              # RealTek 8129/8139
#device         sf              # Adaptec AIC-6915 (``Starfire'')
#device         sis             # Silicon Integrated Systems SiS 900/SiS 7016
#device         ste             # Sundance ST201 (D-Link DFE-550TX)
#device         tl              # Texas Instruments ThunderLAN
#device         tx              # SMC EtherPower II (83c170 ``EPIC'')
#device         vr              # VIA Rhine, Rhine II
#device         wb              # Winbond W89C840F
#device         xl              # 3Com 3c90x (``Boomerang'', ``Cyclone'')
device          bge             # Broadcom BCM570x (``Tigon III'')

# ISA Ethernet NICs.
# 'device ed' requires 'device miibus'
#device         ed0     at isa? disable port 0x280 irq 10 iomem 0xd8000
#device         ex
#device         ep
#device         fe0     at isa? disable port 0x300
# Xircom Ethernet
#device         xe
# PRISM I IEEE 802.11b wireless NIC.
#device         awi
# WaveLAN/IEEE 802.11 wireless NICs. Note: the WaveLAN/IEEE really
# exists only as a PCMCIA device, so there is no ISA attachment needed
# and resources will always be dynamically assigned by the pccard code.
#device         wi
# Aironet 4500/4800 802.11 wireless NICs. Note: the declaration below will
# work for PCMCIA and PCI cards, as well as ISA cards set to ISA PnP
# mode (the factory default). If you set the switches on your ISA
# card for a manually chosen I/O address and IRQ, you must specify
# those parameters here.
#device         an
# The probe order of these is presently determined by i386/isa/isa_compat.c.
#device         ie0     at isa? disable port 0x300 irq 10 iomem 0xd0000
#device         le0     at isa? disable port 0x300 irq 5 iomem 0xd0000
#device         lnc0    at isa? disable port 0x280 irq 10 drq 0
#device         cs0     at isa? disable port 0x300
#device         sn0     at isa? disable port 0x300 irq 10

# Pseudo devices - the number indicates how many units to allocate.
pseudo-device   loop            # Network loopback
pseudo-device   ether           # Ethernet support
pseudo-device   sl      1       # Kernel SLIP
pseudo-device   ppp     1       # Kernel PPP
pseudo-device   tun             # Packet tunnel.
pseudo-device   pty             # Pseudo-ttys (telnet etc)
pseudo-device   md              # Memory "disks"
pseudo-device   gif             # IPv6 and IPv4 tunneling
pseudo-device   faith   1       # IPv6-to-IPv4 relaying (translation)

# The `bpf' pseudo-device enables the Berkeley Packet Filter.
# Be aware of the administrative consequences of enabling this!
pseudo-device   bpf             #Berkeley packet filter

# USB support
device          uhci            # UHCI PCI->USB interface
device          ohci            # OHCI PCI->USB interface
device          usb             # USB Bus (required)
device          ugen            # Generic
device          uhid            # "Human Interface Devices"
device          ukbd            # Keyboard
device          ulpt            # Printer
device          umass           # Disks/Mass storage - Requires scbus and da
device          ums             # Mouse
#device         uscanner        # Scanners
#device         urio            # Diamond Rio MP3 Player
# USB Ethernet, requires mii
#device         aue             # ADMtek USB ethernet
#device         cue             # CATC USB ethernet
#device         kue             # Kawasaki LSI USB ethernet
_______________________________________________
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

Reply via email to