Re: Change Libata Error Handling for Drive Testing

2007-03-22 Thread Fajun Chen

Hi Tejun,

JFYI, it turns out that spurious interrupts was caused by User Scan
before drive is ready.  I wait for 2 seconds after drive is powered on
which is not sufficient for some drives.  Alt status should be checked
first but there's no good way to check it in user space. Does User
Scan related code check alt status before drive is touched?

Thanks,
Fajun

On 3/19/07, Fajun Chen [EMAIL PROTECTED] wrote:

On 3/19/07, Tejun Heo [EMAIL PROTECTED] wrote:
 Fajun Chen wrote:
  Please ignore the changes to pata_sil680.c. The same failure happened
  to standard sil680 driver without my change as well.

 Does it also happen when the second port is empty?


Yes, it happens even when one of the port (either one) is powered off.
It used to happen in the middle of our IO test application, now it
happened much early in our test spinup process with debugging version
of ata_host_intr() function. We boot up the target (ARM XScale
processor) with hard drive powered off, then power up the drive and do
test spinup. What test spinup does is to issue sysfs user scan on the
port followed by Identify Device.

Below is my debugging version of ata_host_intr() function with
ATA_IRQ_TRAP enabled and hacked. What puzzled me is that none of the
sub-counters (initial value is 1) get incremented in most failures?
Please see dmesg log for details. I did see one failure (out of many)
where idle_irq_hsm_state is incremented and matches idle_irq
counter though.

Thanks,
Fajun

inline unsigned int ata_host_intr (struct ata_port *ap,
   struct ata_queued_cmd *qc)
{
u8 status, host_stat = 0;

/*  VPRINTK(ata%u: protocol %d task_state %d\n, */
/*  ap-id, qc-tf.protocol, ap-hsm_task_state); */
/*  printk(KERN_INFO ata%u: protocol %d task_state %d\n,  */
/* ap-id, qc-tf.protocol, ap-hsm_task_state);  */

/* Check whether we are expecting interrupt in this state */
switch (ap-hsm_task_state) {
case HSM_ST_FIRST:
/* Some pre-ATAPI-4 devices assert INTRQ
 * at this state when ready to receive CDB.
 */

/* Check the ATA_DFLAG_CDB_INTR flag is enough here.
 * The flag was turned on only for atapi devices.
 * No need to check is_atapi_taskfile(qc-tf) again.
 */
if (!(qc-dev-flags  ATA_DFLAG_CDB_INTR))
{
/* printk(KERN_INFO ata%u: flags %lu\n,  */
/* ap-id, qc-dev-flags); */
ap-stats.idle_irq_non_atapi++;
goto idle_irq;
}
break;
case HSM_ST_LAST:
if (qc-tf.protocol == ATA_PROT_DMA ||
qc-tf.protocol == ATA_PROT_ATAPI_DMA) {
/* check status of DMA engine */
host_stat = ap-ops-bmdma_status(ap);
VPRINTK(ata%u: host_stat 0x%X\n, ap-id, host_stat);

/* if it's not our irq... */
if (!(host_stat  ATA_DMA_INTR))
/* printk(KERN_INFO ata%u: host_stat %d\n,  */
/* ap-id, host_stat); */
ap-stats.idle_irq_host_state++;
goto idle_irq;

/* before we do anything else, clear DMA-Start bit */
ap-ops-bmdma_stop(qc);

if (unlikely(host_stat  ATA_DMA_ERR)) {
/* error when transfering data to/from memory */
qc-err_mask |= AC_ERR_HOST_BUS;
ap-hsm_task_state = HSM_ST_ERR;
}
}
break;
case HSM_ST:
break;
default:
/* printk(KERN_INFO ata%u: hsm_state %d\n,  */
/* ap-id, ap-hsm_task_state); */
ap-stats.idle_irq_hsm_state++;
goto idle_irq;
}

/* check altstatus */
status = ata_altstatus(ap);
if (status  ATA_BUSY)
{
/* printk(KERN_INFO ata%u: altstatus %d\n,  */
/* ap-id, status); */
ap-stats.idle_irq_altstatus++;
goto idle_irq;
}

/* check main status, clearing INTRQ */
status = ata_chk_status(ap);
if (unlikely(status  ATA_BUSY))
{
/* printk(KERN_INFO ata%u: status %d\n,  */
/* ap-id, status); */
ap-stats.idle_irq_status++;
goto idle_irq;
}

/* ack bmdma irq events */
ap-ops-irq_clear(ap);

ata_hsm_move(ap, qc, status, 0);
return 1;   /* irq handled */

idle_irq:
ap-stats.idle_irq++;

#ifdef ATA_IRQ_TRAP
if ((ap-stats.idle_irq % 

Re: Change Libata Error Handling for Drive Testing

2007-03-19 Thread Fajun Chen

  Another error we're analyzing is disabled IRQ due to screaming
  unhandled interrupts from ata_interrupt(). Since this happened on
  different drives, it doesn't seem to be a drive issue. Any known
  problem on this one? Could long ISR locking by some rogue application
  cause this problem? To detect the problem early, it probably makes
  sense to add error handling code to check for screeming unhandled ISR
  in ata_interrupt() function?

 Which driver are you using?  Please report with lspci -nn and full dmesg
 with linux-ide cc'd.

We use PATA Sil680.  I don't have lspci ported yet on the target test
hardware. I'll collect the kernel log and report it to you later.


Kernel revision is 2.6.18-rc2 and libata 2.00. The following changes
were made to pata_sil680.c:
1.For PCI IO performance:
  /* FIXME: double check */
//pci_write_config_byte(pdev, PCI_CACHE_LINE_SIZE, (class_rev) ? 1 : 
255);
pci_write_config_byte(pdev, PCI_CACHE_LINE_SIZE, 0);
pci_write_config_byte(pdev, PCI_LATENCY_TIMER, 0x40);
2.  Applied one of the patches to change TF1 timing register to 0x2283

Below are the kernel logs at the time of the failure:
call 0xc0015f90: init_ramfs_fs+0x0/0x1c()
[42949375.53] Calling initcall 0xc001618c: noop_init+0x0/0x1c()
[42949375.53] io scheduler noop registered
[42949375.53] Calling initcall 0xc00161a8: as_init+0x0/0x80()
[42949375.53] io scheduler anticipatory registered
[42949375.53] Calling initcall 0xc0016228: deadline_init+0x0/0x78()
[42949375.53] io scheduler deadline registered
[42949375.53] Calling initcall 0xc00162a0: cfq_init+0x0/0x10c()
[42949375.53] io scheduler cfq registered (default)
[42949375.53] Calling initcall 0xc00175b8: pci_init+0x0/0x3c()
[42949375.53] PCI: Calling quirk c0018048 for :00:01.0
[42949375.53] PCI: Calling quirk c001cb64 for :00:01.0
[42949375.53] Calling initcall 0xc0019044: pci_sysfs_init+0x0/0x44()
[42949375.53] Calling initcall 0xc0019298: pci_proc_init+0x0/0x90()
[42949375.53] Calling initcall 0xc0019e9c: rand_initialize+0x0/0x38()
[42949375.53] Calling initcall 0xc0019f70: tty_init+0x0/0x1f8()
[42949375.53] Calling initcall 0xc001a168: pty_init+0x0/0x3fc()
[42949375.72] Calling initcall 0xc00db7cc: aauc_init+0x0/0x14c()
[42949375.72] Calling initcall 0xc00e8b00: usbfifoInit+0x0/0x1c()
[42949375.72] Calling initcall 0xc001b184: serial8250_init+0x0/0x130()
[42949375.72] Serial: 8250/16550 driver $Revision: 1.4 $ 2 ports,
IRQ sharing disabled
[42949375.72] serial8250: ttyS0 at MMIO 0xfe80 (irq = 29) is a SC16C752
[42949375.72] serial8250: ttyS1 at MMIO 0xfe88 (irq = 30) is a SC16C752
[42949375.72] Calling initcall 0xc001b578: serial8250_pci_init+0x0/0x20()
[42949375.72] Calling initcall 0xc001bc80: rd_init+0x0/0x194()
[42949375.73] RAMDISK driver initialized: 16 RAM disks of
10485760K size 1024 blocksize
[42949375.73] Calling initcall 0xc001bef4: net_olddevs_init+0x0/0xdc()
[42949375.73] Calling initcall 0xc001c024: tun_init+0x0/0x64()
[42949375.73] tun: Universal TUN/TAP device driver, 1.6
[42949375.73] tun: (C) 1999-2004 Max Krasnyansky [EMAIL PROTECTED]
[42949375.73] Calling initcall 0xc001c3ac: spi_transport_init+0x0/0x44()
[42949375.73] Calling initcall 0xc001c3f0: fc_transport_init+0x0/0x4c()
[42949375.73] Calling initcall 0xc001c43c: iscsi_transport_init+0x0/0x108()
[42949375.73] Calling initcall 0xc001c544: ata_init+0x0/0xa0()
[42949375.73] libata version 2.00 loaded.
[42949375.73] Calling initcall 0xc001c5e4: sil_init+0x0/0x20()
[42949375.73] Calling initcall 0xc001c604: sil24_init+0x0/0x20()
[42949375.73] Calling initcall 0xc001c624: sil680_init+0x0/0x20()
[42949375.73] pata_sil680 :00:01.0: version 0.3.2
[42949375.73] sil680: BA5_EN = 1 clock = 00
[42949375.73] sil680: BA5_EN = 1 clock = 10
[42949375.73] sil680: 133MHz clock.
[42949375.73] PCI: enabling device :00:01.0 (0140 - 0143)
[42949375.73] PCI: Enabling bus mastering for device :00:01.0
[42949375.73] ata1: PATA max UDMA/133 cmd 0xFE10 ctl
0xFE22 bmdma 0xFE00 irq 27
[42949375.73] ata2: PATA max UDMA/133 cmd 0xFE18 ctl
0xFE26 bmdma 0xFE08 irq 27
[42949375.73] scsi0 : pata_sil680
[42949375.75] scsi1 : pata_sil680
[42949375.77] Calling initcall 0xc001c910: vsc_sata_init+0x0/0x20()
[42949375.77] Calling initcall 0xc001c930: init_sd+0x0/0x8c()
[42949375.77] Calling initcall 0xc001c9bc: init_sg+0x0/0x14c()
[42949375.77] Calling initcall 0xc001d03c: i2c_dev_init+0x0/0xa8()
[42949375.77] i2c /dev entries driver
[42949375.77] Calling initcall 0xc001d0e4: i2c_iop3xx_init+0x0/0x1c()
[42949375.77] Calling initcall 0xc001d100: hwmon_init+0x0/0x5c()
[42949375.77] Calling initcall 0xc001d74c: flow_cache_init+0x0/0x174()
[42949375.77] Calling initcall 0xc001eb74: xfrm4_transport_init+0x0/0x20()

Re: Change Libata Error Handling for Drive Testing

2007-03-19 Thread Fajun Chen

On 3/19/07, Fajun Chen [EMAIL PROTECTED] wrote:

   Another error we're analyzing is disabled IRQ due to screaming
   unhandled interrupts from ata_interrupt(). Since this happened on
   different drives, it doesn't seem to be a drive issue. Any known
   problem on this one? Could long ISR locking by some rogue application
   cause this problem? To detect the problem early, it probably makes
   sense to add error handling code to check for screeming unhandled ISR
   in ata_interrupt() function?
 
  Which driver are you using?  Please report with lspci -nn and full dmesg
  with linux-ide cc'd.
 
 We use PATA Sil680.  I don't have lspci ported yet on the target test
 hardware. I'll collect the kernel log and report it to you later.

Kernel revision is 2.6.18-rc2 and libata 2.00. The following changes
were made to pata_sil680.c:
1.For PCI IO performance:
   /* FIXME: double check */
//pci_write_config_byte(pdev, PCI_CACHE_LINE_SIZE, (class_rev) ? 1 : 
255);
pci_write_config_byte(pdev, PCI_CACHE_LINE_SIZE, 0);
pci_write_config_byte(pdev, PCI_LATENCY_TIMER, 0x40);
2.  Applied one of the patches to change TF1 timing register to 0x2283

Below are the kernel logs at the time of the failure:
call 0xc0015f90: init_ramfs_fs+0x0/0x1c()
[42949375.53] Calling initcall 0xc001618c: noop_init+0x0/0x1c()
[42949375.53] io scheduler noop registered
[42949375.53] Calling initcall 0xc00161a8: as_init+0x0/0x80()
[42949375.53] io scheduler anticipatory registered
[42949375.53] Calling initcall 0xc0016228: deadline_init+0x0/0x78()
[42949375.53] io scheduler deadline registered
[42949375.53] Calling initcall 0xc00162a0: cfq_init+0x0/0x10c()
[42949375.53] io scheduler cfq registered (default)
[42949375.53] Calling initcall 0xc00175b8: pci_init+0x0/0x3c()
[42949375.53] PCI: Calling quirk c0018048 for :00:01.0
[42949375.53] PCI: Calling quirk c001cb64 for :00:01.0
[42949375.53] Calling initcall 0xc0019044: pci_sysfs_init+0x0/0x44()
[42949375.53] Calling initcall 0xc0019298: pci_proc_init+0x0/0x90()
[42949375.53] Calling initcall 0xc0019e9c: rand_initialize+0x0/0x38()
[42949375.53] Calling initcall 0xc0019f70: tty_init+0x0/0x1f8()
[42949375.53] Calling initcall 0xc001a168: pty_init+0x0/0x3fc()
[42949375.72] Calling initcall 0xc00db7cc: aauc_init+0x0/0x14c()
[42949375.72] Calling initcall 0xc00e8b00: usbfifoInit+0x0/0x1c()
[42949375.72] Calling initcall 0xc001b184: serial8250_init+0x0/0x130()
[42949375.72] Serial: 8250/16550 driver $Revision: 1.4 $ 2 ports,
IRQ sharing disabled
[42949375.72] serial8250: ttyS0 at MMIO 0xfe80 (irq = 29) is a SC16C752
[42949375.72] serial8250: ttyS1 at MMIO 0xfe88 (irq = 30) is a SC16C752
[42949375.72] Calling initcall 0xc001b578: serial8250_pci_init+0x0/0x20()
[42949375.72] Calling initcall 0xc001bc80: rd_init+0x0/0x194()
[42949375.73] RAMDISK driver initialized: 16 RAM disks of
10485760K size 1024 blocksize
[42949375.73] Calling initcall 0xc001bef4: net_olddevs_init+0x0/0xdc()
[42949375.73] Calling initcall 0xc001c024: tun_init+0x0/0x64()
[42949375.73] tun: Universal TUN/TAP device driver, 1.6
[42949375.73] tun: (C) 1999-2004 Max Krasnyansky [EMAIL PROTECTED]
[42949375.73] Calling initcall 0xc001c3ac: spi_transport_init+0x0/0x44()
[42949375.73] Calling initcall 0xc001c3f0: fc_transport_init+0x0/0x4c()
[42949375.73] Calling initcall 0xc001c43c: iscsi_transport_init+0x0/0x108()
[42949375.73] Calling initcall 0xc001c544: ata_init+0x0/0xa0()
[42949375.73] libata version 2.00 loaded.
[42949375.73] Calling initcall 0xc001c5e4: sil_init+0x0/0x20()
[42949375.73] Calling initcall 0xc001c604: sil24_init+0x0/0x20()
[42949375.73] Calling initcall 0xc001c624: sil680_init+0x0/0x20()
[42949375.73] pata_sil680 :00:01.0: version 0.3.2
[42949375.73] sil680: BA5_EN = 1 clock = 00
[42949375.73] sil680: BA5_EN = 1 clock = 10
[42949375.73] sil680: 133MHz clock.
[42949375.73] PCI: enabling device :00:01.0 (0140 - 0143)
[42949375.73] PCI: Enabling bus mastering for device :00:01.0
[42949375.73] ata1: PATA max UDMA/133 cmd 0xFE10 ctl
0xFE22 bmdma 0xFE00 irq 27
[42949375.73] ata2: PATA max UDMA/133 cmd 0xFE18 ctl
0xFE26 bmdma 0xFE08 irq 27
[42949375.73] scsi0 : pata_sil680
[42949375.75] scsi1 : pata_sil680
[42949375.77] Calling initcall 0xc001c910: vsc_sata_init+0x0/0x20()
[42949375.77] Calling initcall 0xc001c930: init_sd+0x0/0x8c()
[42949375.77] Calling initcall 0xc001c9bc: init_sg+0x0/0x14c()
[42949375.77] Calling initcall 0xc001d03c: i2c_dev_init+0x0/0xa8()
[42949375.77] i2c /dev entries driver
[42949375.77] Calling initcall 0xc001d0e4: i2c_iop3xx_init+0x0/0x1c()
[42949375.77] Calling initcall 0xc001d100: hwmon_init+0x0/0x5c()
[42949375.77] Calling initcall 0xc001d74c: flow_cache_init+0x0/0x174()
[42949375.77] Calling 

Re: Change Libata Error Handling for Drive Testing

2007-03-19 Thread Tejun Heo

Fajun Chen wrote:

Please ignore the changes to pata_sil680.c. The same failure happened
to standard sil680 driver without my change as well.


Does it also happen when the second port is empty?

--
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Change Libata Error Handling for Drive Testing

2007-03-19 Thread Fajun Chen

On 3/19/07, Tejun Heo [EMAIL PROTECTED] wrote:

Fajun Chen wrote:
 Please ignore the changes to pata_sil680.c. The same failure happened
 to standard sil680 driver without my change as well.

Does it also happen when the second port is empty?



Yes, it happens even when one of the port (either one) is powered off.
It used to happen in the middle of our IO test application, now it
happened much early in our test spinup process with debugging version
of ata_host_intr() function. We boot up the target (ARM XScale
processor) with hard drive powered off, then power up the drive and do
test spinup. What test spinup does is to issue sysfs user scan on the
port followed by Identify Device.

Below is my debugging version of ata_host_intr() function with
ATA_IRQ_TRAP enabled and hacked. What puzzled me is that none of the
sub-counters (initial value is 1) get incremented in most failures?
Please see dmesg log for details. I did see one failure (out of many)
where idle_irq_hsm_state is incremented and matches idle_irq
counter though.

Thanks,
Fajun

inline unsigned int ata_host_intr (struct ata_port *ap,
  struct ata_queued_cmd *qc)
{
   u8 status, host_stat = 0;

/*  VPRINTK(ata%u: protocol %d task_state %d\n, */
/*  ap-id, qc-tf.protocol, ap-hsm_task_state); */
/*  printk(KERN_INFO ata%u: protocol %d task_state %d\n,  */
/* ap-id, qc-tf.protocol, ap-hsm_task_state);  */

   /* Check whether we are expecting interrupt in this state */
   switch (ap-hsm_task_state) {
   case HSM_ST_FIRST:
   /* Some pre-ATAPI-4 devices assert INTRQ
* at this state when ready to receive CDB.
*/

   /* Check the ATA_DFLAG_CDB_INTR flag is enough here.
* The flag was turned on only for atapi devices.
* No need to check is_atapi_taskfile(qc-tf) again.
*/
   if (!(qc-dev-flags  ATA_DFLAG_CDB_INTR))
   {
   /* printk(KERN_INFO ata%u: flags %lu\n,  */
/* ap-id, qc-dev-flags); */
   ap-stats.idle_irq_non_atapi++;
   goto idle_irq;
   }
   break;
   case HSM_ST_LAST:
   if (qc-tf.protocol == ATA_PROT_DMA ||
   qc-tf.protocol == ATA_PROT_ATAPI_DMA) {
   /* check status of DMA engine */
   host_stat = ap-ops-bmdma_status(ap);
   VPRINTK(ata%u: host_stat 0x%X\n, ap-id, host_stat);

   /* if it's not our irq... */
   if (!(host_stat  ATA_DMA_INTR))
   /* printk(KERN_INFO ata%u: host_stat %d\n,  */
/* ap-id, host_stat); */
   ap-stats.idle_irq_host_state++;
   goto idle_irq;

   /* before we do anything else, clear DMA-Start bit */
   ap-ops-bmdma_stop(qc);

   if (unlikely(host_stat  ATA_DMA_ERR)) {
   /* error when transfering data to/from memory */
   qc-err_mask |= AC_ERR_HOST_BUS;
   ap-hsm_task_state = HSM_ST_ERR;
   }
   }
   break;
   case HSM_ST:
   break;
   default:
   /* printk(KERN_INFO ata%u: hsm_state %d\n,  */
/* ap-id, ap-hsm_task_state); */
   ap-stats.idle_irq_hsm_state++;
   goto idle_irq;
   }

   /* check altstatus */
   status = ata_altstatus(ap);
   if (status  ATA_BUSY)
   {
   /* printk(KERN_INFO ata%u: altstatus %d\n,  */
/* ap-id, status); */
   ap-stats.idle_irq_altstatus++;
   goto idle_irq;
   }

   /* check main status, clearing INTRQ */
   status = ata_chk_status(ap);
   if (unlikely(status  ATA_BUSY))
   {
   /* printk(KERN_INFO ata%u: status %d\n,  */
/* ap-id, status); */
   ap-stats.idle_irq_status++;
   goto idle_irq;
   }

   /* ack bmdma irq events */
   ap-ops-irq_clear(ap);

   ata_hsm_move(ap, qc, status, 0);
   return 1;   /* irq handled */

idle_irq:
   ap-stats.idle_irq++;

#ifdef ATA_IRQ_TRAP
   if ((ap-stats.idle_irq % 1000) == 0) {
 /* ata_irq_ack(ap, 0); /\* debug trap *\/ */
 ata_port_printk(ap, KERN_WARNING, irq trap: %lu - %lu %lu
%lu %lu %lu\n,
 ap-stats.idle_irq,
 ap-stats.idle_irq_non_atapi,
 ap-stats.idle_irq_hsm_state,
 ap-stats.idle_irq_host_state,
 ap-stats.idle_irq_altstatus,
 ap-stats.idle_irq_status);
 return 0;   //