Re: aic94xx: failing on high load (another data point)
On Wed, 2008-02-20 at 17:54 +0800, Keith Hopkins wrote: On 02/20/2008 11:48 AM, James Bottomley wrote: On Tue, 2008-02-19 at 10:22 -0600, James Bottomley wrote: I'll see if I can come up with patches to fix this ... or at least mitigate the problems it causes. Darrick's working on the ascb sequencer use after free problem. I looked into some of the error handling in libsas, and apparently that's a bit of a huge screw up too. There are a number of places where we won't complete a task that is being errored out and thus causes timeout errors. This patch is actually for libsas to fix all of this. I've managed to reproduce some of your problem by firing random resets across a disk under load, and this recovers the protocol errors for me. However, I can't reproduce the TMF timeout which caused the sequencer screw up, so you still need to wait for Darrick's fix as well. James Hi James, Darrick, Thanks again for looking more into this. I'll wait for Darrick's patch and try it together with this libsas patch. Should I leave James' first patch in also? Yes, that's a requirement just to get the REQ_TASK_ABORT for the protocol errors actually to work ... I'm afraid this is like peeling an onion as I said .. and you're going to build up layers of patches. However, the ones that are obvious bug fixes and I can test (all of them so far), I'm putting in the rc fixes tree of SCSI, so you can download a rollup here: http://www.kernel.org/pub/linux/kernel/people/jejb/scsi-rc-fixes-2.6.diff James - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: aic94xx: failing on high load (another data point)
On Tue, 2008-02-19 at 10:22 -0600, James Bottomley wrote: I'll see if I can come up with patches to fix this ... or at least mitigate the problems it causes. Darrick's working on the ascb sequencer use after free problem. I looked into some of the error handling in libsas, and apparently that's a bit of a huge screw up too. There are a number of places where we won't complete a task that is being errored out and thus causes timeout errors. This patch is actually for libsas to fix all of this. I've managed to reproduce some of your problem by firing random resets across a disk under load, and this recovers the protocol errors for me. However, I can't reproduce the TMF timeout which caused the sequencer screw up, so you still need to wait for Darrick's fix as well. James --- diff --git a/drivers/scsi/libsas/sas_scsi_host.c b/drivers/scsi/libsas/sas_scsi_host.c index f869fba..b656e29 100644 --- a/drivers/scsi/libsas/sas_scsi_host.c +++ b/drivers/scsi/libsas/sas_scsi_host.c @@ -51,8 +51,6 @@ static void sas_scsi_task_done(struct sas_task *task) { struct task_status_struct *ts = task-task_status; struct scsi_cmnd *sc = task-uldd_task; - struct sas_ha_struct *sas_ha = SHOST_TO_SAS_HA(sc-device-host); - unsigned ts_flags = task-task_state_flags; int hs = 0, stat = 0; if (unlikely(!sc)) { @@ -120,11 +118,7 @@ static void sas_scsi_task_done(struct sas_task *task) sc-result = (hs 16) | stat; list_del_init(task-list); sas_free_task(task); - /* This is very ugly but this is how SCSI Core works. */ - if (ts_flags SAS_TASK_STATE_ABORTED) - scsi_eh_finish_cmd(sc, sas_ha-eh_done_q); - else - sc-scsi_done(sc); + sc-scsi_done(sc); } static enum task_attribute sas_scsi_get_task_attr(struct scsi_cmnd *cmd) @@ -255,13 +249,27 @@ out: return res; } +static void sas_eh_finish_cmd(struct scsi_cmnd *cmd) +{ + struct sas_task *task = TO_SAS_TASK(cmd); + struct sas_ha_struct *sas_ha = SHOST_TO_SAS_HA(cmd-device-host); + + /* First off call task_done. However, task will +* be free'd after this */ + task-task_done(task); + /* now finish the command and move it on to the error +* handler done list, this also takes it off the +* error handler pending list */ + scsi_eh_finish_cmd(cmd, sas_ha-eh_done_q); +} + static void sas_scsi_clear_queue_lu(struct list_head *error_q, struct scsi_cmnd *my_cmd) { struct scsi_cmnd *cmd, *n; list_for_each_entry_safe(cmd, n, error_q, eh_entry) { if (cmd == my_cmd) - list_del_init(cmd-eh_entry); + sas_eh_finish_cmd(cmd); } } @@ -274,7 +282,7 @@ static void sas_scsi_clear_queue_I_T(struct list_head *error_q, struct domain_device *x = cmd_to_domain_dev(cmd); if (x == dev) - list_del_init(cmd-eh_entry); + sas_eh_finish_cmd(cmd); } } @@ -288,7 +296,7 @@ static void sas_scsi_clear_queue_port(struct list_head *error_q, struct asd_sas_port *x = dev-port; if (x == port) - list_del_init(cmd-eh_entry); + sas_eh_finish_cmd(cmd); } } @@ -528,14 +536,14 @@ Again: case TASK_IS_DONE: SAS_DPRINTK(%s: task 0x%p is done\n, __FUNCTION__, task); - task-task_done(task); + sas_eh_finish_cmd(cmd); if (need_reset) try_to_reset_cmd_device(shost, cmd); continue; case TASK_IS_ABORTED: SAS_DPRINTK(%s: task 0x%p is aborted\n, __FUNCTION__, task); - task-task_done(task); + sas_eh_finish_cmd(cmd); if (need_reset) try_to_reset_cmd_device(shost, cmd); continue; @@ -547,7 +555,7 @@ Again: recovered\n, SAS_ADDR(task-dev), cmd-device-lun); - task-task_done(task); + sas_eh_finish_cmd(cmd); if (need_reset) try_to_reset_cmd_device(shost, cmd); sas_scsi_clear_queue_lu(work_q, cmd); @@ -562,7 +570,7 @@ Again: if (tmf_resp == TMF_RESP_FUNC_COMPLETE) { SAS_DPRINTK(I_T %016llx recovered\n, SAS_ADDR(task-dev-sas_addr)); - task-task_done(task); +
Re: aic94xx: failing on high load (another data point)
On Mon, 2008-02-18 at 22:26 +0800, Keith Hopkins wrote: Well, that made life interesting but didn't seem to fix anything. The behavior is about the same as before, but with more verbose errors. I failed one member of the raid and had it rebuild as a test...which hangs for a while and the drive falls off-line. Actually, it now finds the task and tries to do error handling for it ... so we've now uncovered bugs in the error handler. It may not look like it, but this is actually progress. Although, I'm afraid it's going to be a bit like peeling an onion: every time one error gets fixed, you just get to the next layer of errors. Please grab the dmesg output in all its gory glory from here: http://wiki.hopnet.net/dokuwiki/lib/exe/fetch.php?media=myit:sas:dmesg-20080218-wpatch-fail.txt.gz The drive is a Dell OEM drive, but it's not in a Dell system. There is at least one firmware (S527) upgrade for it, but the Dell loader refuses to load it (because it isn't in a Dell system...) Does anyone know a generic way to load a new firmware onto a SAS drive? The firmware upgrade tools are usually vendor specific, though because the format of the firmware file is vendor specific. Could you just put it in a dell box to upgrade? James - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: aic94xx: failing on high load (another data point)
On Fri, 2008-02-15 at 00:11 +0800, Keith Hopkins wrote: On 01/31/2008 03:29 AM, Darrick J. Wong wrote: On Wed, Jan 30, 2008 at 06:59:34PM +0800, Keith Hopkins wrote: V28. My controller functions well with a single drive (low-medium load). Unfortunately, all attempts to get the mirrors in sync fail and usually hang the whole box. Adaptec posted a V30 sequencer on their website; does that fix the problems? http://www.adaptec.com/en-US/speed/scsi/linux/aic94xx-seq-30-1_tar_gz.htm I lost connectivity to the drive again, and had to reboot to recover the drive, so it seemed a good time to try out the V30 firmware. Unfortunately, it didn't work any better. Details are in the attachment. Well, I can offer some hope. The errors you report: aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6 aic94xx: escb_tasklet_complete: Can't find task (tc=6) to abort! Are requests by the sequencer to abort a task because of a protocol error. IBM did some extensive testing with seagate drives and found that the protocol errors were genuine and the result of drive firmware problems. IBM released a version of seagate firmware (BA17) to correct these. Unfortunately, your drive identifies its firmware as S513 which is likely OEM firmware from another vendor ... however, that vendor may have an update which corrects the problem. Of course, the other issue is this: aic94xx: escb_tasklet_complete: Can't find task (tc=6) to abort! This is a bug in the driver. It's not finding the task in the outstanding list. The problem seems to be that it's taking the task from the escb which, by definition, is always NULL. It should be taking the task from the ascb it finds by looping over the pending queue. If you're willing, could you try this patch which may correct the problem? It's sort of like falling off a cliff: if you never go near the edge (i.e. you upgrade the drive fw) you never fall off; alternatively, it would be nice if you could help me put up guard rails just in case. Thanks, James --- diff --git a/drivers/scsi/aic94xx/aic94xx_scb.c b/drivers/scsi/aic94xx/aic94xx_scb.c index 0febad4..ab35050 100644 --- a/drivers/scsi/aic94xx/aic94xx_scb.c +++ b/drivers/scsi/aic94xx/aic94xx_scb.c @@ -458,13 +458,19 @@ static void escb_tasklet_complete(struct asd_ascb *ascb, tc_abort = le16_to_cpu(tc_abort); list_for_each_entry_safe(a, b, asd_ha-seq.pend_q, list) { - struct sas_task *task = ascb-uldd_task; + struct sas_task *task = a-uldd_task; + + if (a-tc_index != tc_abort) + continue; - if (task a-tc_index == tc_abort) { + if (task) { failed_dev = task-dev; sas_task_abort(task); - break; + } else { + ASD_DPRINTK(R_T_A for non TASK scb 0x%x\n, + a-scb-header.opcode); } + break; } if (!failed_dev) { @@ -478,7 +484,7 @@ static void escb_tasklet_complete(struct asd_ascb *ascb, * that the EH will wake up and do something. */ list_for_each_entry_safe(a, b, asd_ha-seq.pend_q, list) { - struct sas_task *task = ascb-uldd_task; + struct sas_task *task = a-uldd_task; if (task task-dev == failed_dev - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: aic94xx: failing on high load (another data point)
On 02/15/2008 11:28 PM, James Bottomley wrote: If you're willing, could you try this patch which may correct the problem? It's sort of like falling off a cliff: if you never go near the edge (i.e. you upgrade the drive fw) you never fall off; alternatively, it would be nice if you could help me put up guard rails just in case. Hi James, Thanks for your feedback suggestions. Yes, I'll give the patch a try. It might take a few days to get onto the system. The system/drive isn't IBM, but I'll also see if I can track down a firmware update too for the protocol errors. --Keith - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: aic94xx: failing on high load (another data point)
On 01/31/2008 03:29 AM, Darrick J. Wong wrote: On Wed, Jan 30, 2008 at 06:59:34PM +0800, Keith Hopkins wrote: V28. My controller functions well with a single drive (low-medium load). Unfortunately, all attempts to get the mirrors in sync fail and usually hang the whole box. Adaptec posted a V30 sequencer on their website; does that fix the problems? http://www.adaptec.com/en-US/speed/scsi/linux/aic94xx-seq-30-1_tar_gz.htm I lost connectivity to the drive again, and had to reboot to recover the drive, so it seemed a good time to try out the V30 firmware. Unfortunately, it didn't work any better. Details are in the attachment. --Keith Running V28 Firmware Feb 14 21:45:55 titan syslog-ng[28369]: STATS: dropped 60 Feb 14 21:47:59 titan kernel: sd 0:0:1:0: rejecting I/O to offline device Feb 14 21:47:59 titan kernel: raid1: sdb2: rescheduling sector 0 Feb 14 21:47:59 titan kernel: sd 0:0:1:0: rejecting I/O to offline device Feb 14 21:47:59 titan kernel: sd 0:0:1:0: rejecting I/O to offline device Feb 14 21:47:59 titan kernel: raid1: Disk failure on sdb2, disabling device. Feb 14 21:47:59 titan kernel: Operation continuing on 1 devices Feb 14 21:47:59 titan kernel: raid1: sda2: redirecting sector 0 to another mirror Feb 14 21:47:59 titan kernel: RAID1 conf printout: Feb 14 21:47:59 titan kernel: --- wd:1 rd:2 Feb 14 21:47:59 titan kernel: disk 0, wo:1, o:0, dev:sdb2 Feb 14 21:47:59 titan kernel: disk 1, wo:0, o:1, dev:sda2 Feb 14 21:47:59 titan kernel: RAID1 conf printout: Feb 14 21:47:59 titan kernel: --- wd:1 rd:2 Feb 14 21:47:59 titan kernel: disk 1, wo:0, o:1, dev:sda2 Feb 14 21:50:08 titan smartd[28072]: Device: /dev/sdb, No such device or address, open() failed V30 Firmware was installed in OS via rpm. Ran mkinitrd and... == manually reboot to get drive back online == (lots of kruft removed) Linux version 2.6.22.16-0.1-default ([EMAIL PROTECTED]) (gcc version 4.2.1 (SUSE Linux)) #1 SMP 2008/01/23 14:28:52 UTC Command line: root=/dev/vgtitan/lvroot vga=0x346 noresume splash=silent PROFILE=default profile=default [EMAIL PROTECTED] splash=off nosplash SMP: Allowing 8 CPUs, 0 hotplug CPUs Kernel command line: root=/dev/vgtitan/lvroot vga=0x346 noresume splash=silent PROFILE=default profile=default [EMAIL PROTECTED] splash=off nosplash bootsplash: silent mode. Initializing CPU#0 time.c: Detected 2327.500 MHz processor. Memory: 14256200k/15728640k available (2053k kernel code, 422808k reserved, 1017k data, 316k init) CPU: L1 I cache: 32K, L1 D cache: 32K CPU: L2 cache: 4096K CPU 0/0 - Node 0 using mwait in idle threads. CPU: Physical Processor ID: 0 CPU: Processor Core ID: 0 CPU0: Thermal monitoring handled by SMI SMP alternatives: switching to UP code Unpacking initramfs... done Freeing initrd memory: 4931k freed ACPI: Core revision 20070126 Brought up 8 CPUs io scheduler cfq registered (default) Boot video device is :08:00.0 Freeing unused kernel memory: 316k freed ACPI Error (dsopcode-0250): No pointer back to NS node in buffer obj 8103b0131c60 [20070126] ACPI Error (psparse-0537): Method parse/execution failed [\_PR_.CPU0._PDC] (Node 8103b0ae0770), AE_AML_INTERNAL ACPI: Processor [CPU0] (supports 8 throttling states) md: raid1 personality registered for level 1 device-mapper: ioctl: 4.11.0-ioctl (2006-10-12) initialised: [EMAIL PROTECTED] BIOS EDD facility v0.16 2004-Jun-25, 4 devices found SCSI subsystem initialized aic94xx: Adaptec aic94xx SAS/SATA driver version 1.0.3 loaded ACPI: PCI Interrupt :05:02.0[A] - GSI 16 (level, low) - IRQ 16 aic94xx: found Adaptec AIC-9410W SAS/SATA Host Adapter, device :05:02.0 scsi0 : aic94xx aic94xx: BIOS present (1,1), 1918 aic94xx: ue num:2, ue size:88 aic94xx: manuf sect SAS_ADDR 5d10002d9380 aic94xx: manuf sect PCBA SN 0BB0C54904VA aic94xx: ms: num_phy_desc: 8 aic94xx: ms: phy0: ENABLED aic94xx: ms: phy1: ENABLED aic94xx: ms: phy2: ENABLED aic94xx: ms: phy3: ENABLED aic94xx: ms: phy4: ENABLED aic94xx: ms: phy5: ENABLED aic94xx: ms: phy6: ENABLED aic94xx: ms: phy7: ENABLED aic94xx: ms: max_phys:0x8, num_phys:0x8 aic94xx: ms: enabled_phys:0xff aic94xx: ctrla: phy0: sas_addr: 5d10002d9380, sas rate:0x9-0x8, sata rate:0x0-0x0, flags:0x0 aic94xx: ctrla: phy1: sas_addr: 5d10002d9380, sas rate:0x9-0x8, sata rate:0x0-0x0, flags:0x0 aic94xx: ctrla: phy2: sas_addr: 5d10002d9380, sas rate:0x9-0x8, sata rate:0x0-0x0, flags:0x0 aic94xx: ctrla: phy3: sas_addr: 5d10002d9380, sas rate:0x9-0x8, sata rate:0x0-0x0, flags:0x0 aic94xx: ctrla: phy4: sas_addr: 5d10002d9380, sas rate:0x9-0x8, sata rate:0x0-0x0, flags:0x0 aic94xx: ctrla: phy5: sas_addr: 5d10002d9380, sas rate:0x9-0x8, sata rate:0x0-0x0, flags:0x0 aic94xx: ctrla: phy6: sas_addr: 5d10002d9380, sas rate:0x9-0x8, sata rate:0x0-0x0, flags:0x0 aic94xx: ctrla: phy7: sas_addr: 5d10002d9380, sas rate:0x9-0x8, sata rate:0x0-0x0, flags:0x0 aic94xx: max_scbs:512, max_ddbs:128 aic94xx: setting phy0 addr to 5d10002d9380
Re: aic94xx: failing on high load (another data point)
We've tried new adaptec firmware shipped with SLES and we got ourselves new error string that appears just above error messages that you have seen before and that were attached to the original message: kernel: aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6 kernel: aic94xx: escb_tasklet_complete: Can't find task (tc=71) to abort! Do you think they have any significance? Hi Jan, Which firmware version is that? I get similar errors under a high load (rebuilding sw raid1 partitions) with sequencer Firmware version 1.1 (V28), which will eventually hang my box. Prev fw versions would also hang in similar situations. My box being OpenSuSE 10.3, a 2.6.22.13-0.3-default kernel, 2x quad core Xeon CPU E5345 @ 2.33GHz stepping 0b, 14G of DDR2-667 memory, and a Adaptec 48300 (AIC-9410W SAS/SATA Host Adapter, device :05:02.0) directly connected to two SEAGATE ST3146855SS My 2 bits. --Keith Hopkins - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: aic94xx: failing on high load (another data point)
On 01/30/2008 05:14 PM, Jan Sembera wrote: We tried firmware versions V28, V30, and even V32 that is, as far as I know, not yet available on adaptec website. All of them were unfortunately displaying exactly the same behaviour :-(. Did you get your SAS controller working? And if so, with which firmware was that? V28. My controller functions well with a single drive (low-medium load). Unfortunately, all attempts to get the mirrors in sync fail and usually hang the whole box. --Keith - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: aic94xx: failing on high load (another data point)
On Wed, Jan 30, 2008 at 06:59:34PM +0800, Keith Hopkins wrote: V28. My controller functions well with a single drive (low-medium load). Unfortunately, all attempts to get the mirrors in sync fail and usually hang the whole box. Adaptec posted a V30 sequencer on their website; does that fix the problems? http://www.adaptec.com/en-US/speed/scsi/linux/aic94xx-seq-30-1_tar_gz.htm --D - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html