Re: aic94xx: failing on high load (another data point)

2008-02-20 Thread James Bottomley
On Wed, 2008-02-20 at 17:54 +0800, Keith Hopkins wrote:
 On 02/20/2008 11:48 AM, James Bottomley wrote:
  On Tue, 2008-02-19 at 10:22 -0600, James Bottomley wrote:
  I'll see if I can come up with patches to fix this ... or at least
  mitigate the problems it causes.
  
  Darrick's working on the ascb sequencer use after free problem.
  
  I looked into some of the error handling in libsas, and apparently
  that's a bit of a huge screw up too.  There are a number of places where
  we won't complete a task that is being errored out and thus causes
  timeout errors.  This patch is actually for libsas to fix all of this.
  
  I've managed to reproduce some of your problem by firing random resets
  across a disk under load, and this recovers the protocol errors for me.
  However, I can't reproduce the TMF timeout which caused the sequencer
  screw up, so you still need to wait for Darrick's fix as well.
  
  James
  
 
 Hi James, Darrick,
 
   Thanks again for looking more into this.  I'll wait for Darrick's
 patch and try it together with this libsas patch.  Should I leave
 James' first patch in also?

Yes, that's a requirement just to get the REQ_TASK_ABORT for the
protocol errors actually to work ...

I'm afraid this is like peeling an onion as I said .. and you're going
to build up layers of patches.  However, the ones that are obvious bug
fixes and I can test (all of them so far), I'm putting in the rc fixes
tree of SCSI, so you can download a rollup here:

http://www.kernel.org/pub/linux/kernel/people/jejb/scsi-rc-fixes-2.6.diff

James


-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: aic94xx: failing on high load (another data point)

2008-02-19 Thread James Bottomley
On Tue, 2008-02-19 at 10:22 -0600, James Bottomley wrote:
 I'll see if I can come up with patches to fix this ... or at least
 mitigate the problems it causes.

Darrick's working on the ascb sequencer use after free problem.

I looked into some of the error handling in libsas, and apparently
that's a bit of a huge screw up too.  There are a number of places where
we won't complete a task that is being errored out and thus causes
timeout errors.  This patch is actually for libsas to fix all of this.

I've managed to reproduce some of your problem by firing random resets
across a disk under load, and this recovers the protocol errors for me.
However, I can't reproduce the TMF timeout which caused the sequencer
screw up, so you still need to wait for Darrick's fix as well.

James

---

diff --git a/drivers/scsi/libsas/sas_scsi_host.c 
b/drivers/scsi/libsas/sas_scsi_host.c
index f869fba..b656e29 100644
--- a/drivers/scsi/libsas/sas_scsi_host.c
+++ b/drivers/scsi/libsas/sas_scsi_host.c
@@ -51,8 +51,6 @@ static void sas_scsi_task_done(struct sas_task *task)
 {
struct task_status_struct *ts = task-task_status;
struct scsi_cmnd *sc = task-uldd_task;
-   struct sas_ha_struct *sas_ha = SHOST_TO_SAS_HA(sc-device-host);
-   unsigned ts_flags = task-task_state_flags;
int hs = 0, stat = 0;
 
if (unlikely(!sc)) {
@@ -120,11 +118,7 @@ static void sas_scsi_task_done(struct sas_task *task)
sc-result = (hs  16) | stat;
list_del_init(task-list);
sas_free_task(task);
-   /* This is very ugly but this is how SCSI Core works. */
-   if (ts_flags  SAS_TASK_STATE_ABORTED)
-   scsi_eh_finish_cmd(sc, sas_ha-eh_done_q);
-   else
-   sc-scsi_done(sc);
+   sc-scsi_done(sc);
 }
 
 static enum task_attribute sas_scsi_get_task_attr(struct scsi_cmnd *cmd)
@@ -255,13 +249,27 @@ out:
return res;
 }
 
+static void sas_eh_finish_cmd(struct scsi_cmnd *cmd)
+{
+   struct sas_task *task = TO_SAS_TASK(cmd);
+   struct sas_ha_struct *sas_ha = SHOST_TO_SAS_HA(cmd-device-host);
+
+   /* First off call task_done.  However, task will
+* be free'd after this */
+   task-task_done(task);
+   /* now finish the command and move it on to the error
+* handler done list, this also takes it off the
+* error handler pending list */
+   scsi_eh_finish_cmd(cmd, sas_ha-eh_done_q);
+}
+
 static void sas_scsi_clear_queue_lu(struct list_head *error_q, struct 
scsi_cmnd *my_cmd)
 {
struct scsi_cmnd *cmd, *n;
 
list_for_each_entry_safe(cmd, n, error_q, eh_entry) {
if (cmd == my_cmd)
-   list_del_init(cmd-eh_entry);
+   sas_eh_finish_cmd(cmd);
}
 }
 
@@ -274,7 +282,7 @@ static void sas_scsi_clear_queue_I_T(struct list_head 
*error_q,
struct domain_device *x = cmd_to_domain_dev(cmd);
 
if (x == dev)
-   list_del_init(cmd-eh_entry);
+   sas_eh_finish_cmd(cmd);
}
 }
 
@@ -288,7 +296,7 @@ static void sas_scsi_clear_queue_port(struct list_head 
*error_q,
struct asd_sas_port *x = dev-port;
 
if (x == port)
-   list_del_init(cmd-eh_entry);
+   sas_eh_finish_cmd(cmd);
}
 }
 
@@ -528,14 +536,14 @@ Again:
case TASK_IS_DONE:
SAS_DPRINTK(%s: task 0x%p is done\n, __FUNCTION__,
task);
-   task-task_done(task);
+   sas_eh_finish_cmd(cmd);
if (need_reset)
try_to_reset_cmd_device(shost, cmd);
continue;
case TASK_IS_ABORTED:
SAS_DPRINTK(%s: task 0x%p is aborted\n,
__FUNCTION__, task);
-   task-task_done(task);
+   sas_eh_finish_cmd(cmd);
if (need_reset)
try_to_reset_cmd_device(shost, cmd);
continue;
@@ -547,7 +555,7 @@ Again:
recovered\n,
SAS_ADDR(task-dev),
cmd-device-lun);
-   task-task_done(task);
+   sas_eh_finish_cmd(cmd);
if (need_reset)
try_to_reset_cmd_device(shost, cmd);
sas_scsi_clear_queue_lu(work_q, cmd);
@@ -562,7 +570,7 @@ Again:
if (tmf_resp == TMF_RESP_FUNC_COMPLETE) {
SAS_DPRINTK(I_T %016llx recovered\n,
SAS_ADDR(task-dev-sas_addr));
-   task-task_done(task);
+  

Re: aic94xx: failing on high load (another data point)

2008-02-18 Thread James Bottomley
On Mon, 2008-02-18 at 22:26 +0800, Keith Hopkins wrote:
 Well, that made life interesting
   but didn't seem to fix anything.
 
 The behavior is about the same as before, but with more verbose
 errors.  I failed one member of the raid and had it rebuild as a
 test...which hangs for a while and the drive falls off-line.

Actually, it now finds the task and tries to do error handling for
it ... so we've now uncovered bugs in the error handler.  It may not
look like it, but this is actually progress.  Although, I'm afraid it's
going to be a bit like peeling an onion: every time one error gets
fixed, you just get to the next layer of errors.

 Please grab the dmesg output in all its gory glory from here:
 http://wiki.hopnet.net/dokuwiki/lib/exe/fetch.php?media=myit:sas:dmesg-20080218-wpatch-fail.txt.gz
 
 The drive is a Dell OEM drive, but it's not in a Dell system.  There
 is at least one firmware (S527) upgrade for it, but the Dell loader
 refuses to load it (because it isn't in a Dell system...)
 Does anyone know a generic way to load a new firmware onto a SAS drive?

The firmware upgrade tools are usually vendor specific, though because
the format of the firmware file is vendor specific.  Could you just put
it in a dell box to upgrade?

James


-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: aic94xx: failing on high load (another data point)

2008-02-15 Thread James Bottomley
On Fri, 2008-02-15 at 00:11 +0800, Keith Hopkins wrote:
 On 01/31/2008 03:29 AM, Darrick J. Wong wrote:
  On Wed, Jan 30, 2008 at 06:59:34PM +0800, Keith Hopkins wrote:
  V28.  My controller functions well with a single drive (low-medium load).  
  Unfortunately, all attempts to get the mirrors in sync fail and usually 
  hang the whole box.
  
  Adaptec posted a V30 sequencer on their website; does that fix the
  problems?
  
  http://www.adaptec.com/en-US/speed/scsi/linux/aic94xx-seq-30-1_tar_gz.htm
  
 
 I lost connectivity to the drive again, and had to reboot to recover
 the drive, so it seemed a good time to try out the V30 firmware.
 Unfortunately, it didn't work any better.  Details are in the
 attachment.

Well, I can offer some hope.  The errors you report:

 aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6
 aic94xx: escb_tasklet_complete: Can't find task (tc=6) to abort!

Are requests by the sequencer to abort a task because of a protocol
error.  IBM did some extensive testing with seagate drives and found
that the protocol errors were genuine and the result of drive firmware
problems.  IBM released a version of seagate firmware (BA17) to correct
these.  Unfortunately, your drive identifies its firmware as S513 which
is likely OEM firmware from another vendor ... however, that vendor may
have an update which corrects the problem.

Of course, the other issue is this:

 aic94xx: escb_tasklet_complete: Can't find task (tc=6) to abort!

This is a bug in the driver.  It's not finding the task in the
outstanding list.  The problem seems to be that it's taking the task
from the escb which, by definition, is always NULL.  It should be taking
the task from the ascb it finds by looping over the pending queue.

If you're willing, could you try this patch which may correct the
problem?  It's sort of like falling off a cliff: if you never go near
the edge (i.e. you upgrade the drive fw) you never fall off;
alternatively, it would be nice if you could help me put up guard rails
just in case.

Thanks,

James

---
diff --git a/drivers/scsi/aic94xx/aic94xx_scb.c 
b/drivers/scsi/aic94xx/aic94xx_scb.c
index 0febad4..ab35050 100644
--- a/drivers/scsi/aic94xx/aic94xx_scb.c
+++ b/drivers/scsi/aic94xx/aic94xx_scb.c
@@ -458,13 +458,19 @@ static void escb_tasklet_complete(struct asd_ascb *ascb,
tc_abort = le16_to_cpu(tc_abort);
 
list_for_each_entry_safe(a, b, asd_ha-seq.pend_q, list) {
-   struct sas_task *task = ascb-uldd_task;
+   struct sas_task *task = a-uldd_task;
+
+   if (a-tc_index != tc_abort)
+   continue;
 
-   if (task  a-tc_index == tc_abort) {
+   if (task) {
failed_dev = task-dev;
sas_task_abort(task);
-   break;
+   } else {
+   ASD_DPRINTK(R_T_A for non TASK scb 0x%x\n,
+   a-scb-header.opcode);
}
+   break;
}
 
if (!failed_dev) {
@@ -478,7 +484,7 @@ static void escb_tasklet_complete(struct asd_ascb *ascb,
 * that the EH will wake up and do something.
 */
list_for_each_entry_safe(a, b, asd_ha-seq.pend_q, list) {
-   struct sas_task *task = ascb-uldd_task;
+   struct sas_task *task = a-uldd_task;
 
if (task 
task-dev == failed_dev 


-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: aic94xx: failing on high load (another data point)

2008-02-15 Thread Keith Hopkins
On 02/15/2008 11:28 PM, James Bottomley wrote:
 If you're willing, could you try this patch which may correct the
 problem?  It's sort of like falling off a cliff: if you never go near
 the edge (i.e. you upgrade the drive fw) you never fall off;
 alternatively, it would be nice if you could help me put up guard rails
 just in case.

Hi James,

  Thanks for your feedback  suggestions.  Yes, I'll give the patch a try.  It 
might take a few days to get onto the system.  The system/drive isn't IBM, but 
I'll also see if I can track down a firmware update too for the protocol errors.

--Keith


-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: aic94xx: failing on high load (another data point)

2008-02-14 Thread Keith Hopkins
On 01/31/2008 03:29 AM, Darrick J. Wong wrote:
 On Wed, Jan 30, 2008 at 06:59:34PM +0800, Keith Hopkins wrote:
 V28.  My controller functions well with a single drive (low-medium load).  
 Unfortunately, all attempts to get the mirrors in sync fail and usually hang 
 the whole box.
 
 Adaptec posted a V30 sequencer on their website; does that fix the
 problems?
 
 http://www.adaptec.com/en-US/speed/scsi/linux/aic94xx-seq-30-1_tar_gz.htm
 

I lost connectivity to the drive again, and had to reboot to recover the drive, 
so it seemed a good time to try out the V30 firmware.  Unfortunately, it didn't 
work any better.  Details are in the attachment.

--Keith


Running V28 Firmware

Feb 14 21:45:55 titan syslog-ng[28369]: STATS: dropped 60
Feb 14 21:47:59 titan kernel: sd 0:0:1:0: rejecting I/O to offline device
Feb 14 21:47:59 titan kernel: raid1: sdb2: rescheduling sector 0
Feb 14 21:47:59 titan kernel: sd 0:0:1:0: rejecting I/O to offline device
Feb 14 21:47:59 titan kernel: sd 0:0:1:0: rejecting I/O to offline device
Feb 14 21:47:59 titan kernel: raid1: Disk failure on sdb2, disabling device. 
Feb 14 21:47:59 titan kernel:   Operation continuing on 1 devices
Feb 14 21:47:59 titan kernel: raid1: sda2: redirecting sector 0 to another 
mirror
Feb 14 21:47:59 titan kernel: RAID1 conf printout:
Feb 14 21:47:59 titan kernel:  --- wd:1 rd:2
Feb 14 21:47:59 titan kernel:  disk 0, wo:1, o:0, dev:sdb2
Feb 14 21:47:59 titan kernel:  disk 1, wo:0, o:1, dev:sda2
Feb 14 21:47:59 titan kernel: RAID1 conf printout:
Feb 14 21:47:59 titan kernel:  --- wd:1 rd:2
Feb 14 21:47:59 titan kernel:  disk 1, wo:0, o:1, dev:sda2
Feb 14 21:50:08 titan smartd[28072]: Device: /dev/sdb, No such device or 
address, open() failed

V30 Firmware was installed in OS via rpm.  Ran mkinitrd and...

== manually reboot to get drive back online ==

(lots of kruft removed)

Linux version 2.6.22.16-0.1-default ([EMAIL PROTECTED]) (gcc version 4.2.1 
(SUSE Linux)) #1 SMP 2008/01/23 14:28:52 UTC
Command line: root=/dev/vgtitan/lvroot vga=0x346 noresume splash=silent 
PROFILE=default profile=default [EMAIL PROTECTED]  splash=off nosplash
SMP: Allowing 8 CPUs, 0 hotplug CPUs
Kernel command line: root=/dev/vgtitan/lvroot vga=0x346 noresume splash=silent 
PROFILE=default profile=default [EMAIL PROTECTED]  splash=off nosplash
bootsplash: silent mode.
Initializing CPU#0
time.c: Detected 2327.500 MHz processor.
Memory: 14256200k/15728640k available (2053k kernel code, 422808k reserved, 
1017k data, 316k init)
CPU: L1 I cache: 32K, L1 D cache: 32K
CPU: L2 cache: 4096K
CPU 0/0 - Node 0
using mwait in idle threads.
CPU: Physical Processor ID: 0
CPU: Processor Core ID: 0
CPU0: Thermal monitoring handled by SMI
SMP alternatives: switching to UP code
Unpacking initramfs... done
Freeing initrd memory: 4931k freed
ACPI: Core revision 20070126
Brought up 8 CPUs
io scheduler cfq registered (default)
Boot video device is :08:00.0
Freeing unused kernel memory: 316k freed
ACPI Error (dsopcode-0250): No pointer back to NS node in buffer obj 
8103b0131c60 [20070126]
ACPI Error (psparse-0537): Method parse/execution failed [\_PR_.CPU0._PDC] 
(Node 8103b0ae0770), AE_AML_INTERNAL
ACPI: Processor [CPU0] (supports 8 throttling states)
md: raid1 personality registered for level 1
device-mapper: ioctl: 4.11.0-ioctl (2006-10-12) initialised: [EMAIL PROTECTED]
BIOS EDD facility v0.16 2004-Jun-25, 4 devices found
SCSI subsystem initialized
aic94xx: Adaptec aic94xx SAS/SATA driver version 1.0.3 loaded
ACPI: PCI Interrupt :05:02.0[A] - GSI 16 (level, low) - IRQ 16
aic94xx: found Adaptec AIC-9410W SAS/SATA Host Adapter, device :05:02.0
scsi0 : aic94xx
aic94xx: BIOS present (1,1), 1918
aic94xx: ue num:2, ue size:88
aic94xx: manuf sect SAS_ADDR 5d10002d9380
aic94xx: manuf sect PCBA SN 0BB0C54904VA
aic94xx: ms: num_phy_desc: 8
aic94xx: ms: phy0: ENABLED
aic94xx: ms: phy1: ENABLED
aic94xx: ms: phy2: ENABLED
aic94xx: ms: phy3: ENABLED
aic94xx: ms: phy4: ENABLED
aic94xx: ms: phy5: ENABLED
aic94xx: ms: phy6: ENABLED
aic94xx: ms: phy7: ENABLED
aic94xx: ms: max_phys:0x8, num_phys:0x8
aic94xx: ms: enabled_phys:0xff
aic94xx: ctrla: phy0: sas_addr: 5d10002d9380, sas rate:0x9-0x8, sata 
rate:0x0-0x0, flags:0x0
aic94xx: ctrla: phy1: sas_addr: 5d10002d9380, sas rate:0x9-0x8, sata 
rate:0x0-0x0, flags:0x0
aic94xx: ctrla: phy2: sas_addr: 5d10002d9380, sas rate:0x9-0x8, sata 
rate:0x0-0x0, flags:0x0
aic94xx: ctrla: phy3: sas_addr: 5d10002d9380, sas rate:0x9-0x8, sata 
rate:0x0-0x0, flags:0x0
aic94xx: ctrla: phy4: sas_addr: 5d10002d9380, sas rate:0x9-0x8, sata 
rate:0x0-0x0, flags:0x0
aic94xx: ctrla: phy5: sas_addr: 5d10002d9380, sas rate:0x9-0x8, sata 
rate:0x0-0x0, flags:0x0
aic94xx: ctrla: phy6: sas_addr: 5d10002d9380, sas rate:0x9-0x8, sata 
rate:0x0-0x0, flags:0x0
aic94xx: ctrla: phy7: sas_addr: 5d10002d9380, sas rate:0x9-0x8, sata 
rate:0x0-0x0, flags:0x0
aic94xx: max_scbs:512, max_ddbs:128
aic94xx: setting phy0 addr to 5d10002d9380

Re: aic94xx: failing on high load (another data point)

2008-01-30 Thread Keith Hopkins
 We've tried new adaptec firmware shipped with SLES and we got
 ourselves new error string that appears just above error messages that you
 have seen before and that were attached to the original message:

 kernel: aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6
 kernel: aic94xx: escb_tasklet_complete: Can't find task (tc=71) to abort!
 
 Do you think they have any significance?

Hi Jan,

Which firmware version is that?

I get similar errors under a high load (rebuilding sw raid1 partitions) with 
sequencer Firmware version 1.1 (V28), which will eventually hang my box.  Prev 
fw versions would also hang in similar situations.

My box being OpenSuSE 10.3, a 2.6.22.13-0.3-default kernel,
2x quad core Xeon CPU E5345 @ 2.33GHz stepping 0b,
14G of DDR2-667 memory, and a
Adaptec 48300 (AIC-9410W SAS/SATA Host Adapter, device :05:02.0)
directly connected to two SEAGATE ST3146855SS

My 2 bits.

--Keith Hopkins


-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: aic94xx: failing on high load (another data point)

2008-01-30 Thread Keith Hopkins
On 01/30/2008 05:14 PM, Jan Sembera wrote:
 
   We tried firmware versions V28, V30, and even V32 that is, as
 far as I know, not yet available on adaptec website. All of them were
 unfortunately displaying exactly the same behaviour :-(. Did you get your
 SAS controller working? And if so, with which firmware was that?
 

V28.  My controller functions well with a single drive (low-medium load).  
Unfortunately, all attempts to get the mirrors in sync fail and usually hang 
the whole box.

--Keith
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: aic94xx: failing on high load (another data point)

2008-01-30 Thread Darrick J. Wong
On Wed, Jan 30, 2008 at 06:59:34PM +0800, Keith Hopkins wrote:
 
 V28.  My controller functions well with a single drive (low-medium load).  
 Unfortunately, all attempts to get the mirrors in sync fail and usually hang 
 the whole box.

Adaptec posted a V30 sequencer on their website; does that fix the
problems?

http://www.adaptec.com/en-US/speed/scsi/linux/aic94xx-seq-30-1_tar_gz.htm

--D
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html