2.6.24 found 5 exceptions in dmesg after not using the system for while
I haven't used the system with these errors in a day or two and I came back and noticed 5 exceptions in dmesg. These are all the same: ata1: exception Emask 0x10 SAct 0x0 SErr 0x90200 action 0xe frozen ata1: irq_stat 0x0040, PHY RDY changed ata1: SError: { Persist PHYRdyChg 10B8B } ata1: hard resetting link ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300) ata1.00: configured for UDMA/133 ata1: EH complete sd 0:0:0:0: [sda] 488397168 512-byte hardware sectors (250059 MB) sd 0:0:0:0: [sda] Write Protect is off sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00 sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA Any idea what caused these? This is on a ATI SB600 SATA controller: 00:12.0 SATA controller [0106]: ATI Technologies Inc SB600 Non-Raid-5 SATA [1002:4380] (prog-if 01 [AHCI 1.0]) Subsystem: Albatron Corp. Unknown device [17f2:5999] Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- SERR- From boot time: ata1: SATA max UDMA/133 abar [EMAIL PROTECTED] port 0xfe02f100 irq 18 ata2: SATA max UDMA/133 abar [EMAIL PROTECTED] port 0xfe02f180 irq 18 ata3: SATA max UDMA/133 abar [EMAIL PROTECTED] port 0xfe02f200 irq 18 ata4: SATA max UDMA/133 abar [EMAIL PROTECTED] port 0xfe02f280 irq 18 ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300) ata1.00: ATA-7: HDT722525DLA380, V44OA96A, max UDMA/133 ata1.00: 488397168 sectors, multi 16: LBA48 NCQ (depth 31/32) ata1.00: configured for UDMA/133 ata2: SATA link down (SStatus 0 SControl 300) ata3: SATA link down (SStatus 0 SControl 300) ata4: SATA link down (SStatus 0 SControl 300) scsi 0:0:0:0: Direct-Access ATA HDT722525DLA380 V44O PQ: 0 ANSI: 5 sd 0:0:0:0: [sda] 488397168 512-byte hardware sectors (250059 MB) sd 0:0:0:0: [sda] Write Protect is off sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00 sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA sd 0:0:0:0: [sda] 488397168 512-byte hardware sectors (250059 MB) sd 0:0:0:0: [sda] Write Protect is off sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00 sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA sda: sda1 sd 0:0:0:0: [sda] Attached SCSI disk Thanks, -Andrew - To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Hitachi 7K1000 1tb drives and sata_sil 3114 chipset
I've been encountering many, many problems with Hitachi 1tb drives under a Sil3114 chipset and I'm wondering if there could be something wrong with the driver/chipset in relation to these drives. Statistically, I've had 6 out of 7 drives exhibit very strange failure conditions while being used under this controller. Some symptoms: - Clicking noises while in operation - What appears to sound/feel like the drive spins down quickly and back up again with no console output - SMART reporting seek read errors which mysteriously appear/disappear completely - Failed I/O requests I most recently swapped out the 1tb drives for 500gb Hitachi models and have not experienced any of the problems above. The most recent failed I/O requests output lots of messages, which I've pasted below. I triggered the I/O errors by setting up lots of simultaneous copies of large files between two drives to test the configuration. If anyone has any ideas whether this could be some kind of incompatibility or bug, let me know. If anyone has any positive/negative experiences with these drives on this controller, it would also help. Thanks, -Andrew sd 1:0:0:0: [sdb] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK end_request: I/O error, dev sdb, sector 1559281795 sd 1:0:0:0: [sdb] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK end_request: I/O error, dev sdb, sector 1563089767 sd 1:0:0:0: [sdb] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK end_request: I/O error, dev sdb, sector 1563115585 sd 1:0:0:0: [sdb] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK end_request: I/O error, dev sdb, sector 7775 printk: 124 messages suppressed. Buffer I/O error on device sdb1, logical block 3856 Buffer I/O error on device sdb1, logical block 3857 Buffer I/O error on device sdb1, logical block 3858 Buffer I/O error on device sdb1, logical block 3859 EXT3-fs error (device sdb1): ext3_readdir: directory #2 contains a hole at offset 0 # lspci -vvnnxxx -s 00:11.0 00:11.0 RAID bus controller [0104]: Silicon Image, Inc. SiI 3114 [SATALink/SATARaid] Serial ATA Controller [1095:3114] (rev 02) Subsystem: Silicon Image, Inc. SiI 3114 SATARaid Controller [1095:6114] Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- SERR- http://vger.kernel.org/majordomo-info.html
Re: 2.6.24 sata_sil Sil3114 drive clicking / restarting?
Both drives already had PM disabled, visible in hdparm -i: "AdvancedPM=yes: disabled (255) WriteCache=enabled" Looking at the smart reporting, it is showing both drives have a FAILING_NOW condition for Seek_Error_Rate. I don't know what to believe, because it seems like whatever drives I attach to this system are chewed up and start showing Seek_Error_Rate failure conditions. /dev/sda: 7 Seek_Error_Rate 0x000b 046 046 067Pre-fail Always FAILING_NOW 393853 /dev/sdb: 7 Seek_Error_Rate 0x000b 044 044 067Pre-fail Always FAILING_NOW 2556544 I swapped in 2 more drives of the same model, and one exhibits the same Seek_Error_Rate FAILING_NOW condition. I now have 4 out of 5 of this same model drive which are failing. They appear to be from the same batch, so I'm not ruling out some kind of manufacturing defect, but this definitely seems strange. I guess I'm just fishing to see if there is anything on the system that could have damaged the drives. Thanks, -Andrew On Jan 27, 2008 1:33 PM, Jim Paris <[EMAIL PROTECTED]> wrote: > Andrew Paprocki wrote: > > I've been noticing something strange on an AMD Geode LX board that I > > have.. I have two SATA drives connected to the onboard Sil3114 chip, > > and the drives appear to be continually restarting (soft resetting?) > > during normal operation when nothing at all is happening on the > > machine. You can hear the drives doing it as well as feel it > > physically if you touch the drive. They are spinning down and back up > > again over and over again. All the while the OS never prints out any > > ata/scsi problems. The only manifestation of this in the kernel is > > that if you're doing something w/ the drives, it pauses momentarily > > while this happens (for instance, during an ext3 format). > > It could be drive power management. Try "hdparm -B 255" or "hdparm -B > 254" to turn that off. The output of "smartctl -A" output can also be > helpful to figure out what's causing it. > > -jim > > - To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
2.6.24 sata_sil Sil3114 drive clicking / restarting?
I've been noticing something strange on an AMD Geode LX board that I have.. I have two SATA drives connected to the onboard Sil3114 chip, and the drives appear to be continually restarting (soft resetting?) during normal operation when nothing at all is happening on the machine. You can hear the drives doing it as well as feel it physically if you touch the drive. They are spinning down and back up again over and over again. All the while the OS never prints out any ata/scsi problems. The only manifestation of this in the kernel is that if you're doing something w/ the drives, it pauses momentarily while this happens (for instance, during an ext3 format). I thought this might be a bad drive because smartctl listed some errors, but I have a stack of drives here and after swapping out the drive doing this, the replacement is doing it as well. These drives are all new 1TB Hitachi drives less than 6 months old. Now, I'm wondering if this is some Sil3114 problem w/ libata. Has anyone else seen this type of behavior before with no errors showing up in the console? Thanks, -Andrew Some info (unknown partition tables are because these are an md RAID1 pair): Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2 ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx Probing IDE interface ide0... hda: , ATA DISK drive Probing IDE interface ide1... ide0 at 0x1f0-0x1f7,0x3f6 on irq 14 hda: max request size: 128KiB hda: 256000 sectors (131 MB) w/0KiB Cache, CHS=500/16/32 hda: hda1 Driver 'sd' needs updating - please use bus_type methods sata_sil :00:11.0: version 2.3 ACPI: PCI Interrupt Link [LNKD] BIOS reported IRQ 0, using IRQ 10 ACPI: PCI Interrupt Link [LNKD] enabled at IRQ 10 ACPI: PCI Interrupt :00:11.0[A] -> Link [LNKD] -> GSI 10 (level, low) -> IRQ 10 sata_sil :00:11.0: Applying R_ERR on DMA activate FIS errata fix PCI: Setting latency timer of device :00:11.0 to 64 scsi0 : sata_sil scsi1 : sata_sil scsi2 : sata_sil scsi3 : sata_sil ata1: SATA max UDMA/100 mmio [EMAIL PROTECTED] tf 0xefffb080 irq 10 ata2: SATA max UDMA/100 mmio [EMAIL PROTECTED] tf 0xefffb0c0 irq 10 ata3: SATA max UDMA/100 mmio [EMAIL PROTECTED] tf 0xefffb280 irq 10 ata4: SATA max UDMA/100 mmio [EMAIL PROTECTED] tf 0xefffb2c0 irq 10 ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310) ata1.00: ATA-7: Hitachi HDS721010KLA330, GKAOA70F, max UDMA/133 ata1.00: 1953525168 sectors, multi 16: LBA48 NCQ (depth 0/32) ata1.00: configured for UDMA/100 ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310) ata2.00: ATA-7: Hitachi HDS721010KLA330, GKAOA70F, max UDMA/133 ata2.00: 1953525168 sectors, multi 16: LBA48 NCQ (depth 0/32) ata2.00: configured for UDMA/100 ata3: SATA link down (SStatus 0 SControl 310) ata4: SATA link down (SStatus 0 SControl 310) scsi 0:0:0:0: Direct-Access ATA Hitachi HDS72101 GKAO PQ: 0 ANSI: 5 sd 0:0:0:0: [sda] 1953525168 512-byte hardware sectors (1000205 MB) sd 0:0:0:0: [sda] Write Protect is off sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00 sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA sd 0:0:0:0: [sda] 1953525168 512-byte hardware sectors (1000205 MB) sd 0:0:0:0: [sda] Write Protect is off sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00 sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA sda: unknown partition table sd 0:0:0:0: [sda] Attached SCSI disk scsi 1:0:0:0: Direct-Access ATA Hitachi HDS72101 GKAO PQ: 0 ANSI: 5 sd 1:0:0:0: [sdb] 1953525168 512-byte hardware sectors (1000205 MB) sd 1:0:0:0: [sdb] Write Protect is off sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00 sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA sd 1:0:0:0: [sdb] 1953525168 512-byte hardware sectors (1000205 MB) sd 1:0:0:0: [sdb] Write Protect is off sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00 sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA sdb: unknown partition table sd 1:0:0:0: [sdb] Attached SCSI disk # hdparm -i /dev/sda /dev/sda: hdparm: ioctl 0x304 failed: Inappropriate ioctl for device Model=Hitachi HDS721010KLA330 , FwRev=GKAOA70F, SerialNo= GTJ000PAG2L50C Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs } RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=52 BuffType=(3) DualPortCache, BuffSize=31157kB, MaxMultSect=16, MultSect=?16? CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=268435455 IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120} PIO modes: pio0 pio1 pio2 pio3 pio4 DMA modes: mdma0 mdma1 mdma2 UDMA modes: udma0 udma1 udma2 AdvancedPM=yes: disabled (255) WriteCache=enabled Drive conforms to: ATA/ATAPI-7 T13 1532D rev.1: ATA/ATAPI-2 ATA/ATAPI-3 ATA/ATAPI-4 ATA/ATAPI-5 ATA/ATAPI-6 ATA/ATAPI-7 * current active mode # smartctl -a /dev/sda ... === START OF INFORMATION SECTION === Device Model: Hitachi HDS721010KLA330 Serial Number:GTJ000PAG2L50C Firmware Version:
Re: About forcing 32bit DMA patch for AMD690G(SB600)
I'll try to get that configuration together.. right now I only have 2 1gb sticks installed on the board, so I would need to track down 2gb ones. If I can find some laying around, I'll let you know. Thanks, -Andrew On Jan 25, 2008 12:50 AM, Tejun Heo <[EMAIL PROTECTED]> wrote: > Andrew Paprocki wrote: > > I have an SB600/RS690 here with SATA drives connected. I haven't been > > following this thread, but I can help test something if it would help. > > We're trying to determine whether SB600 ahci controller can do 64bit DMA > or not. Srihari's couldn't but Shane's test result tells a different > story. Do you have memory mapped over 4G (if you have 4G some of them > will be over 4G, you can know this by looking at the e820 map printed > during boot)? > > -- > tejun > > - To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: About forcing 32bit DMA patch for AMD690G(SB600)
Tejun, I have an SB600/RS690 here with SATA drives connected. I haven't been following this thread, but I can help test something if it would help. Thanks, -Andrew On Jan 24, 2008 7:21 PM, Tejun Heo <[EMAIL PROTECTED]> wrote: > Hello, Shane. Sorry about the delay. Got caught up in other stuff. > > Shane Huang wrote: > > Quoting Tejun: > >> Uh-oh, wait a bit. Nope. Until we figure out what the something > >> else > > is > >> and positively verify 64bit DMA works fine, the quirk stays in. > > > > Our HW engineer has confirmed that our SB600 SATA controller indeed > > has some MSI issue, and we do not have any workaround. > > > > The workaround "quirk_msi_intx_disable_bug" to SB700 SATA controller > > can NOT work to SB600 SATA controller in my debug, while disablement > > to RS690 MSI in kernel source can fix it. > > Hmmm... Okay. Is the SB600 SATA controller culprit or the north bridge > - RS690? If the former is the case, proper way to work around it is to > add AHCI_HFLAG_NO_MSI for SB600 AHCI. > > > As to the SB600 64 bit DMA capacity, do you have any methods to do > > further verification? I do NOT find any problem in my debug after I > > disabled RS690 MSI in kernel 2.6.24-rc7. > > The problem is that we didn't actually prove anything. In the tests > you've done, pci=nomsi didn't fix the problem but disable_all_msi quirk > did. pci=nomsi and disable_all_msi quirk are identical. Also, > Srihari's problem was not reproduced, so currently we can't say much > from the test results. Srihari, do you still have the system around? > > Thanks. > > -- > tejun - To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] pata_cs5536: ATA driver for Geode companion chip
Long story short -- this has nothing to do with pata_cs5536. It isn't in that list! I just patched my kernel to print the reason why it is being blacklisted and this turned up: ata5.00: ATA-0: , 060729DA, max MWDMA2 ata5.00: 256000 sectors, multi 1: LBA ata5.00: device is on DMA blacklist, disabling DMA ata5.00: device matched DMA blacklist, model: WDC AC11000H ata5.00: configured for PIO4 The strn_pattern_cmp function does not handle blank model names. I would like to give this Taiwanese manufacturer a bug for not bothering to put a model name on their device, but I don't think they'll care too much.. :) I just submitted a patch to fix strn_pattern_cmp to handle the strlen(name)==0 case appropriately (ie only match it against "*" or ""). With that change, it is properly detected as MWDMA2 again: ata5.00: ATA-0: , 060729DA, max MWDMA2 ata5.00: 256000 sectors, multi 1: LBA ata5.00: configured for MWDMA2 ata5.00: configured for MWDMA2 Thanks, -Andrew On 10/14/07, Alan Cox <[EMAIL PROTECTED]> wrote: > On Sun, 14 Oct 2007 15:42:19 -0400 > "Andrew Paprocki" <[EMAIL PROTECTED]> wrote: > > > Just noticed something.. I'm not sure if this is due to a libata-dev > > change or me switching to pata_cs5536, but my 128MB DOM on the PATA > > port is hitting the ata_dma_blacklisted() case and it was not > > previously. This did not happen under 2.6.22.6 using pata_amd. The > > system is noticeably slower when forced to use PIO4 (as you would > > expect). > > > > Is this expected in the newer code, or is it a bug? > > Sounds like someone added it wrongly to the blacklist. Remove the > blacklist entry, test again and if DMA is working we need to get that > fixed ASAP in .2 and 2.6.24. > > Alan - To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] libata: prevent devices with blank model names from being DMA blacklisted
The strn_pattern_cmp routine does not handle a blank name parameter properly. The only patterns which should match a blank name are "*" and an explicit "". If the function is passed a blank name in current code, it will always match against the patt parameter. The bug manifests itself as the device with the empty model name always matching the first device in the DMA blacklist, forcing it to revert to PIO mode. Signed-off-by: Andrew Paprocki <[EMAIL PROTECTED]> --- drivers/ata/libata-core.c | 13 - 1 files changed, 12 insertions(+), 1 deletions(-) diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c index 4e11e39..e73b7b4 100644 --- a/drivers/ata/libata-core.c +++ b/drivers/ata/libata-core.c @@ -4013,8 +4013,19 @@ int strn_pattern_cmp(const char *patt, const char *name, int wildchar) p = strchr(patt, wildchar); if (p && ((*(p + 1)) == 0)) len = p - patt; - else + else { len = strlen(name); + /* If the model name parameter is empty, it should not match +* against anything other than "*" or "". +*/ + if (unlikely(len == 0)) { + /* In the rare case your pattern is "". */ + if (strlen(patt) == 0) + return 0; + else + return -1; + } + } return strncmp(patt, name, len); } -- 1.5.3.4.g58ba4 - To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] pata_cs5536: ATA driver for Geode companion chip
Just noticed something.. I'm not sure if this is due to a libata-dev change or me switching to pata_cs5536, but my 128MB DOM on the PATA port is hitting the ata_dma_blacklisted() case and it was not previously. This did not happen under 2.6.22.6 using pata_amd. The system is noticeably slower when forced to use PIO4 (as you would expect). Is this expected in the newer code, or is it a bug? Previous 2.6.22.6 kernel using pata_amd: pata_amd :00:0f.2: version 0.3.8 scsi4 : pata_amd scsi5 : pata_amd ata5: PATA max UDMA/100 cmd 0x000101f0 ctl 0x000103f6 bmdma 0x0001ff00 irq 14 ata6: PATA max UDMA/100 cmd 0x00010170 ctl 0x00010376 bmdma 0x0001ff08 irq 15 ata5.00: ATA-0: , 060729DA, max MWDMA2 ata5.00: 256000 sectors, multi 1: LBA ata5.00: configured for MWDMA2 ata6: port disabled. ignoring. Up-to-date libata-dev using pata_cs5536: scsi4 : pata_cs5536 scsi5 : pata_cs5536 ata5: PATA max UDMA/100 cmd 0x1f0 ctl 0x3f6 bmdma 0xff00 irq 14 ata6: DUMMY ata5: PATA max UDMA/100 cmd 0x1f0 ctl 0x3f6 bmdma 0xff00 irq 14 ata5.00: ATA-0: , 060729DA, max MWDMA2 ata5.00: 256000 sectors, multi 1: LBA ata5.00: device is on DMA blacklist, disabling DMA ata5.00: configured for PIO4 ata5.00: device is on DMA blacklist, disabling DMA ata5.00: configured for PIO4 ata5: EH complete Thanks, -Andrew On 10/14/07, Andrew Paprocki <[EMAIL PROTECTED]> wrote: > On 10/11/07, Alan Cox <[EMAIL PROTECTED]> wrote: > > On Thu, 11 Oct 2007 03:38:19 -0400 > > "Martin K. Petersen" <[EMAIL PROTECTED]> wrote: > > > > > > > > This is a driver for the ATA controller on the Geode CS5536 companion > > > chip. The PCI device ID for this device was previously claimed by > > > pata_amd.c but the PIO timings were not correct. This driver also > > > works around a bug in some BIOSes that handle unaligned access to the > > > PCI config registers poorly. Finally, the driver allows fallback to > > > using MSR registers for configuration on BIOSes that are truly > > > broken. > > > > > > Signed-off-by: Martin K. Petersen <[EMAIL PROTECTED]> > > > > Acked-by: Alan Cox <[EMAIL PROTECTED]> > > I've been using the driver (boot drive on the port) since Martin's > post and haven't experienced any problems. > > Tested-by: Andrew Paprocki <[EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] pata_cs5536: ATA driver for Geode companion chip
On 10/11/07, Alan Cox <[EMAIL PROTECTED]> wrote: > On Thu, 11 Oct 2007 03:38:19 -0400 > "Martin K. Petersen" <[EMAIL PROTECTED]> wrote: > > > > > This is a driver for the ATA controller on the Geode CS5536 companion > > chip. The PCI device ID for this device was previously claimed by > > pata_amd.c but the PIO timings were not correct. This driver also > > works around a bug in some BIOSes that handle unaligned access to the > > PCI config registers poorly. Finally, the driver allows fallback to > > using MSR registers for configuration on BIOSes that are truly > > broken. > > > > Signed-off-by: Martin K. Petersen <[EMAIL PROTECTED]> > > Acked-by: Alan Cox <[EMAIL PROTECTED]> I've been using the driver (boot drive on the port) since Martin's post and haven't experienced any problems. Tested-by: Andrew Paprocki <[EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH #upstream 2/2] libata: track SLEEP state and issue SRST to wake it up
Tejun, This patch applied on top of your set works for me. It clears the error mask and completes any ATA_CMD_SLEEP when the drive is already sleeping. I tried `hdparm -Y` twice and it didn't loop like before. Thanks, -Andrew diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c index 45b781b..7e0627f 100644 --- a/drivers/ata/libata-core.c +++ b/drivers/ata/libata-core.c @@ -5763,6 +5763,16 @@ void ata_qc_issue(struct ata_queued_cmd *qc) /* if device is sleeping, schedule softreset and abort the link */ if (unlikely(qc->dev->flags & ATA_DFLAG_SLEEPING)) { + if (unlikely(qc->tf.command == ATA_CMD_SLEEP)) { + /* to prevent a loop, do not wake up if sleeping +* and a sleep cmd is sent. instead, simply clear +* the error mask and complete as if it was +* successful. +*/ + qc->err_mask = 0; + ata_qc_complete(qc); + return; + } link->eh_info.action |= ATA_EH_SOFTRESET; ata_ehi_push_desc(&link->eh_info, "waking up from sleep"); ata_link_abort(link); On 10/13/07, Tejun Heo <[EMAIL PROTECTED]> wrote: > Jeff, please forget about this patchset. I'll re-post updated version. - To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH #upstream 2/2] libata: track SLEEP state and issue SRST to wake it up
Tejun, I'm able to break my system using this patch. I had a hunch this might be possible.. :) In short, if you issue a sleep command while the drive is already sleeping, it puts libata into an infinite loop resetting the port. I've illustrated the working test and the evil hunch below. The sleep command itself will need a short-circuit out of this logic in order to prevent this loop. Also, in the working case below the hddtemp command actually blocked until the drive was spun up before returning a valid temp. While testing, I was able to get hddtemp to trigger the drive wake-up when it was sleeping, but hddtemp then returned stating the drive was sleeping. Re-running hddtemp until the drive was fully spun up (another 5 seconds) kept returning that it was sleeping. I'll see if I can reproduce this reliably. Am I correct in assuming the process which triggers the wake-up should block? -Andrew Working case: # hddtemp /dev/sdb /dev/sdb: Hitachi HDS721010KLA330 : 35 C # hdparm -Y /dev/sdb /dev/sdb: issuing sleep command # time hddtemp /dev/sdb ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 ata2.00: waking up from sleep ata2: soft resetting link ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310) ata2.00: configured for UDMA/100 ata2: EH complete sd 1:0:0:0: [sdb] 1953525168 512-byte hardware sectors (1000205 MB) sd 1:0:0:0: [sdb] Write Protect is off sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00 sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA /dev/sdb: Hitachi HDS721010KLA330 : 34 C real0m 10.89s user0m 0.00s sys 0m 0.00s # time hddtemp /dev/sdb /dev/sdb: Hitachi HDS721010KLA330 : 34 C real0m 0.26s user0m 0.00s sys 0m 0.00s Evil DoS case: # hddtemp /dev/sdb /dev/sdb: Hitachi HDS721010KLA330 : 35 C # hdparm -Y /dev/sdb /dev/sdb: issuing sleep command # hdparm -Y /dev/sdb ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 ata2.00: waking up from sleep ata2: soft resetting link ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310) ata2.00: configured for UDMA/100 ata2: EH complete ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 ata2.00: waking up from sleep ata2: soft resetting link ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310) ata2.00: configured for UDMA/100 ata2: EH complete to infinity On 10/12/07, Tejun Heo <[EMAIL PROTECTED]> wrote: > ATA devices in SLEEP mode don't respond to any commands. SRST is > necessary to wake it up. Till now, when a command is issued to a > device in SLEEP mode, the command times out, which makes EH reset the > device and retry the command after that, causing a long delay. > > This patch makes libata track SLEEP state and issue SRST automatically > if a command is about to be issued to a device in SLEEP. > > Signed-off-by: Tejun Heo <[EMAIL PROTECTED]> > Cc: Bruce Allen <[EMAIL PROTECTED]> > Cc: Andrew Paprocki <[EMAIL PROTECTED]> > --- > drivers/ata/libata-core.c | 12 > drivers/ata/libata-eh.c |4 +++- > include/linux/ata.h |1 + > include/linux/libata.h|1 + > 4 files changed, 17 insertions(+), 1 deletion(-) > > Index: work/include/linux/ata.h > === > --- work.orig/include/linux/ata.h > +++ work/include/linux/ata.h > @@ -179,6 +179,7 @@ enum { > ATA_CMD_VERIFY = 0x40, > ATA_CMD_VERIFY_EXT = 0x42, > ATA_CMD_STANDBYNOW1 = 0xE0, > + ATA_CMD_SLEEP = 0xE6, > ATA_CMD_IDLEIMMEDIATE = 0xE1, > ATA_CMD_INIT_DEV_PARAMS = 0x91, > ATA_CMD_READ_NATIVE_MAX = 0xF8, > Index: work/include/linux/libata.h > === > --- work.orig/include/linux/libata.h > +++ work/include/linux/libata.h > @@ -145,6 +145,7 @@ enum { > ATA_DFLAG_PIO = (1 << 12), /* device limited to PIO mode */ > ATA_DFLAG_NCQ_OFF = (1 << 13), /* device limited to non-NCQ > mode */ > ATA_DFLAG_SPUNDOWN = (1 << 14), /* XXX: for spindown_compat */ > + ATA_DFLAG_SLEEPING = (1 << 15), /* device is sleeping */ > ATA_DFLAG_INIT_MASK = (1 << 16) - 1, > > ATA_DFLAG_DETACH= (1 << 16), > Index: work/drivers/ata/libata-core.c > === > --- work.orig/drivers/ata/libata-core.c > +++ work/drivers/ata/libata-core.c > @@ -5553,6 +5553,10 @@ void __ata_qc_complete(struct ata_queued > case ATA_CMD_SET_MULTI: /* multi_count changed */ > eh_action |= ATA_EH_REVALIDATE; > break; > +
Re: pata_cs5536: ATA driver for Geode companion chip
Martin, Just wanted to report that the 2.6.23 libata-dev with the MSR pata_cs5536 applied to it is working fine with my LX board. I am using the first PATA port as my boot/root drive right now. Thanks, -Andrew scsi4 : pata_cs5536 scsi5 : pata_cs5536 ata5: PATA max UDMA/100 cmd 0x000101f0 ctl 0x000103f6 bmdma 0x0001ff00 irq 14 ata6: DUMMY ata5.00: ATA-0: , 060729DA, max MWDMA2 ata5.00: 256000 sectors, multi 1: LBA ata5.00: configured for MWDMA2 scsi 4:0:0:0: Direct-Access ATA 0607 PQ: 0 ANSI: 5 sd 4:0:0:0: [sdc] 256000 512-byte hardware sectors (131 MB) sd 4:0:0:0: [sdc] Write Protect is off sd 4:0:0:0: [sdc] Mode Sense: 00 3a 00 00 sd 4:0:0:0: [sdc] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA - To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: smartd causing SATA timeouts on sleeping drives
Bruce/Tejun, Just so you both know, even when specifying '-n standby,q' in smartd, it still triggers timeouts on my system. The timeouts are no longer coming from the default half-hour checks, but from my configured self-test times with the '-s' option. It appears smartd overrides the '-n' parameter in this case, triggering the libata soft reset. This is another case that would be fixed if libata does the SRST automatically. Thanks, -Andrew Oct 11 02:16:52 (none) daemon.info smartd[23848]: Device: /dev/sdb, STANDBY mode ignored due to scheduled self test (47 checks skipped) Oct 11 02:17:03 (none) user.err kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen Oct 11 02:17:03 (none) user.err kernel: ata2.00: cmd b0/da:00:00:4f:c2/00:00:00:00:00/00 tag 0 cdb 0x0 data 0 Oct 11 02:17:03 (none) user.warn kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 11 02:17:08 (none) user.warn kernel: ata2: port is slow to respond, please be patient (Status 0xd0) Oct 11 02:17:10 (none) user.info kernel: ata2: soft resetting port Oct 11 02:17:10 (none) user.info kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310) Oct 11 02:17:10 (none) user.info kernel: ata2.00: configured for UDMA/100 Oct 11 02:17:10 (none) user.info kernel: ata2: EH complete On 10/10/07, Tejun Heo <[EMAIL PROTECTED]> wrote: > Maybe what should be done is to track sleep mode in libata and issue > SRST automatically if a command is issued to a sleeping drive. I'll > work on it. - To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: smartd causing SATA timeouts on sleeping drives
On 10/10/07, Tejun Heo <[EMAIL PROTECTED]> wrote: > Maybe what should be done is to track sleep mode in libata and issue > SRST automatically if a command is issued to a sleeping drive. I'll > work on it. Another tidbit of info.. I just went through the pain of tracking down everything in my system (system apps as well as my own code) responsible for waking up sleeping drives. My end goal was to make sure sleeping drives stayed asleep to reduce power consumption and wear due to unnecessary spin-ups. I'm sure distros targeting laptops or embedded systems that use live disks go through this pain frequently. Would all SRST cmds sent from libata come from the ata_std_softreset() call? Could something like SystemTap be used without modifying libata to track all pids which cause that function to be called? If that would work, it could be an easy way to do what I did manually. That is, unless someone knows of an easier way that I'm overlooking.. :) I might give that a try to see if it works well and document the result. -Andrew - To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Permanent disk shutdown instead of soft/hard reset?
I'm currently running into a situation where I have 4 SATA drives in a striped array where one of the drives is failing (/ has failed). The single drive failure manifests itself as ext3 errors and libata SCSI media errors which occur non-stop as software attempts to read/write to the mounted array. Because libata is seeing media errors, the bad drive endlessly soft resets while the software is still running and attempting to access the drive. This winds up hanging the entire system because the software (consider it a 'find' command running on the drive) occurs in the init.d boot scripts. The end result is that a login prompt is never reached until the software finishes what it is doing and hours of soft resets have occurred. Is there any way that this behavior can be stopped by permanently disconnecting the drive after a configurable number of errors that would otherwise soft reset? Does the layer allow for the concept of a full disk shutdown rather than a reset? I assume this would have to forcefully unmount any active mounts which use the drive/array to ensure that no subsequent cmds would cause libata to attempt to reconnect to the bad drive(s). Is this even possible? Using smartd is invaluable for detecting failing drives, but when the failed drive prevents the system from booting, it is hard to recover remotely. It may not be possible to "recover" (e.g. If the failed drive is the boot drive), but that should be up to the system designer. In my case, I would still want to boot into the system (I do not boot from the array), establish network connectivity, and "phone home" that a permanent hardware failure has occurred in the array. Thanks, -Andrew - To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: pata_cs5536: ATA driver for Geode companion chip
On 10/10/07, Jordan Crouse <[EMAIL PROTECTED]> wrote: > I never heard back from anybody - so either nobody is using pata_amd > (which I suspect), or they didn't have any problems. Go ahead and merge, > and I'll do a sanity check on a few boards next week just to make sure. I am using the CS5536 currently with pata_amd and it worked without issue. I only used the PATA port for a short time, though, before switching to USB + SATA. I will now be going back to using the PATA this week, so I'll try out the new driver in place of pata_amd and write back if there are any problems. Thanks, -Andrew - To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: smartd causing SATA timeouts on sleeping drives
Yes, the drives were in sleep mode. That is the only case where these timeouts/resets occur. It seems like the "-n never" mode of smartd should send the SRST if the drive is truly sleeping, otherwise libata will soft reset the drive when it sees the timeout. The "-n standby" option sounds like a more sane default, but there might be legacy reasons why it isn't configured that way. On 10/8/07, Tejun Heo <[EMAIL PROTECTED]> wrote: > Andrew Paprocki wrote: > > I found out after posting that this is governed by the -n parameter to > > smartd. The default behavior is "-n never" which means smartd will > > send the cmds regardless of the drive status. The man page indicates > > that may cause the drive to spin-up to answer the cmds. It appears for > > some drives (?) the cmds just timeout and libata performs a soft > > reset. I'm going to change my setup to "-n standby", but it seems > > strange to me that "-n never" is the default if it has this drastic of > > a result (at least under Linux). Is there any way to know if the drive > > will actually spin up as a result of the cmd instead of timing out? > > If in standby mode, the drive would automatically spin up to process > command. If in sleep mode, it needs SRST to spin back up. Was your > drive in sleep mode? - To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: smartd causing SATA timeouts on sleeping drives
I found out after posting that this is governed by the -n parameter to smartd. The default behavior is "-n never" which means smartd will send the cmds regardless of the drive status. The man page indicates that may cause the drive to spin-up to answer the cmds. It appears for some drives (?) the cmds just timeout and libata performs a soft reset. I'm going to change my setup to "-n standby", but it seems strange to me that "-n never" is the default if it has this drastic of a result (at least under Linux). Is there any way to know if the drive will actually spin up as a result of the cmd instead of timing out? On 10/6/07, Tejun Heo <[EMAIL PROTECTED]> wrote: > smartd should probably issue CHECK POWER MODE (0xe5) before issuing > other commands. Bruce? - To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
smartd causing SATA timeouts on sleeping drives
Tejun/Bruce, I tracked down the source of timeouts I have been frequently getting. It appears smartd is not properly handling drives that are spun down by the BIOS ACPI settings. I have SATA timeouts which occur every half hour (the default -i 1800 in smartd) that do not occur when smartd is not running. The drives smartd is configured to look at have a sleep time configured in the BIOS. When the drives are asleep, I get a soft reset every half hour as smartd attempts to access the drives. While in this state, smartd also reports bad state to syslog (e.g. temperature changes to 200C). Just for comparison, hddtemp knows the drives are sleeping: # hddtemp /dev/sda /dev/sda: Hitachi HDS721010KLA330 : drive is sleeping # ls /storage ... wakes up the drives ... # hddtemp /dev/sda /dev/sda: Hitachi HDS721010KLA330 : 29 C or F I'm pasting the example cmd / timeout error / soft reset below. Also, I'm pasting the invalid settings which smartd detects when in this state. What needs to change for smartd to recognize drives are sleeping and either not perform its checks, or forcefully wake them up to perform them? (Should that be a configuration parameter in smartd?) Thanks, -Andrew # uname -a Linux (none) 2.6.22.6 #5 Mon Sep 10 02:15:22 EDT 2007 i586 unknown (Using sata_sil on 3114 chips) # smartctl -V smartmontools release 5.38 dated 2006/12/20 at 20:37:59 UTC ... smartctl compile dated Sep 17 2007 at 13:47:25 (repository code checked out on Sep 17th) # cat /var/run/smartd.conf /dev/sda -d ata -a -S on -s (S/../.././02|L/../../6/03) /dev/sdb -d ata -a -S on -s (S/../.././02|L/../../6/03) What happens every 30 minutes when drives are sleeping: Oct 6 01:05:48 (none) user.err kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen Oct 6 01:05:48 (none) user.err kernel: ata2.00: cmd b0/da:00:00:4f:c2/00:00:00:00:00/00 tag 0 cdb 0x0 data 0 Oct 6 01:05:48 (none) user.warn kernel: res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 6 01:05:53 (none) user.warn kernel: ata2: port is slow to respond, please be patient (Status 0xd0) Oct 6 01:05:55 (none) user.info kernel: ata2: soft resetting port Oct 6 01:05:56 (none) user.info kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310) Oct 6 01:05:56 (none) user.info kernel: ata2.00: configured for UDMA/100 Oct 6 01:05:56 (none) user.info kernel: ata2: EH complete Oct 6 01:05:56 (none) user.notice kernel: sd 1:0:0:0: [sdb] 1953525168 512-byte hardware sectors (1000205 MB) Oct 6 01:05:56 (none) user.notice kernel: sd 1:0:0:0: [sdb] Write Protect is off Oct 6 01:05:56 (none) user.debug kernel: sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00 Oct 6 01:05:56 (none) user.notice kernel: sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA Invalid attribute values: Oct 2 22:35:21 (none) daemon.info smartd[585]: Device: /dev/sda, SMART Prefailure Attribute: 7 Seek_Error_Rate changed from 87 to 86 Oct 2 23:35:21 (none) daemon.info smartd[585]: Device: /dev/sda, SMART Prefailure Attribute: 7 Seek_Error_Rate changed from 86 to 85 Oct 5 20:05:56 (none) daemon.info smartd[585]: Device: /dev/sdb, SMART Prefailure Attribute: 3 Spin_Up_Time changed from 84 to 85 Oct 6 01:05:38 (none) daemon.info smartd[585]: Device: /dev/sda, SMART Usage Attribute: 194 Temperature_Celsius changed from 200 to 206 Oct 6 01:05:56 (none) daemon.info smartd[585]: Device: /dev/sdb, SMART Usage Attribute: 194 Temperature_Celsius changed from 193 to 200 Once the drives are started up, those values report: 3 Spin_Up_Time0x0007 085 085 024Pre-fail Always - 821 (Average 820) 7 Seek_Error_Rate 0x000b 100 100 067Pre-fail Always - 0 194 Temperature_Celsius 0x0002 193 193 000Old_age Always - 31 (Lifetime Min/Max 24/67) - To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Chipset selection for NCQ md RAID
Given the current kernel (2.6.22.8), which SATA chipset would you feel most comfortable plugging in to achieve: - 16 1TB Hitachi 7K1000 drives - NCQ support - Kernel md RAID (ie hardware RAID not necessary) - Max transfer rate to OS (PCIe should handle 16 drives given their current transfer rate?) (Assuming the required number of chips of any chipset could be made available on the bus.) If support for SiI chips is the most robust, would multiple SiI3124/SiI3132 chips be most reliable? Or is support for something like a Marvell 88SX6081 good enough so that only 2 chips are needed?. I have not owned any of these chips, and my impression is that the SiI chips have some of the most robust support. The comments at the top of sata_mv.c scare me, even if they might be out of date... :) What about port multiplier support? Do they ever introduce stability problems with the drivers? Is native libata support for high-density Areca cards planned? Does anyone know if the manufacturer driver is good enough to rely on for the above config to "just work"? I'd personally like to stay within libata. (Comments from any Areca Linux users with an ARC-1261ML? http://www.areca.com.tw/products/pcie341.htm) Thanks in advance, -Andrew - To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.22.6 sata_sil device errors & timeouts
It appears to be the '-o on' causing the problem. If I remove that, the errors go away. The strange part is that according to the smartctl documentation, my drives support it: # smartctl -c /dev/sda smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-7 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF READ SMART DATA SECTION === General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (4797) seconds. Offline data collection capabilities:(0x5b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities:(0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability:(0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time:( 1) minutes. Extended self-test routine recommended polling time:( 80) minutes. Thanks, -Andrew On 9/18/07, Bruce Allen <[EMAIL PROTECTED]> wrote: > Does removing '-o on' and/or '-S on' eliminate the errors? > > > On Mon, 17 Sep 2007, Andrew Paprocki wrote: > > > Bruce, > > > > Just built it -- it eliminated the HSM violations, but I still get the > > device errors: > > > > smartmontools release 5.38 dated 2006/12/20 at 20:37:59 UTC > > (I see the above date, even though I verified it is built from CVS head) > > > > ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 > > ata2.00: cmd b0/db:f8:00:4f:c2/00:00:00:00:00/00 tag 0 cdb 0x0 data 0 > > res 51/04:f8:00:4f:c2/00:00:00:00:00/00 Emask 0x1 (device error) > > ata2.00: configured for UDMA/100 > > ata2: EH complete > > > > This is what it is in smartd.conf: > > /dev/sda -d ata -a -o on -S on > > /dev/sdb -d ata -a -o on -S on > > /dev/sdc -d ata -a -o on -S on > > > > Thanks, -Andrew - To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.22.6 sata_sil device errors & timeouts
Bruce, Just built it -- it eliminated the HSM violations, but I still get the device errors: smartmontools release 5.38 dated 2006/12/20 at 20:37:59 UTC (I see the above date, even though I verified it is built from CVS head) ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 ata2.00: cmd b0/db:f8:00:4f:c2/00:00:00:00:00/00 tag 0 cdb 0x0 data 0 res 51/04:f8:00:4f:c2/00:00:00:00:00/00 Emask 0x1 (device error) ata2.00: configured for UDMA/100 ata2: EH complete This is what it is in smartd.conf: /dev/sda -d ata -a -o on -S on /dev/sdb -d ata -a -o on -S on /dev/sdc -d ata -a -o on -S on Thanks, -Andrew On 9/17/07, Bruce Allen <[EMAIL PROTECTED]> wrote: > Hi Andrew, > > Please build the CVS version (unreleased) of smartmontools. The versions > below are dated 2006/12/20 and 2006/04/12. You need to build a code > version based on the past few weeks of code. > > Cheers, > Bruce - To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.22.6 sata_sil device errors & timeouts
On 9/17/07, Andrew Paprocki <[EMAIL PROTECTED]> wrote: > On 9/17/07, Tejun Heo <[EMAIL PROTECTED]> wrote: > > Upgrading smartd should fix it. Which version are you using? > > smartmontools release 5.36 dated 2006/04/12 at 17:39:01 UTC > smartmontools configure arguments: '--prefix=/opt/smartmontools' > > I see a newer experimental 5.37 is out. I'll give it a go and see if > the trace goes away. Upgrading made it worse.. I now receive the same device errors as well as a slew of new "HSM violation" errors when smartd starts up: smartmontools release 5.37 dated 2006/12/20 at 20:37:59 UTC smartmontools configure arguments: '--prefix=/opt/smartmontools' ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen ata5.00: cmd b0/db:f8:00:4f:c2/00:00:00:00:00/00 tag 0 cdb 0x0 data 126976 in res 50/00:f8:00:4f:c2/00:00:00:00:00/a0 Emask 0x202 (HSM violation) ata5: soft resetting port ata5.00: configured for UDMA/100 ata5: EH complete # smartctl -i /dev/sda smartctl version 5.37 [i686-pc-linux-gnu] Copyright (C) 2002-6 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Model Family: Hitachi Deskstar T7K250 series Device Model: HDT722525DLA380 Serial Number:VDK41GT5F3S4JK Firmware Version: V44OA96A User Capacity:250,059,350,016 bytes Device is:In smartctl database [for details use: -P show] ATA Version is: 7 ATA Standard is: ATA/ATAPI-7 T13 1532D revision 1 Local Time is:Mon Sep 17 15:25:29 2007 EDT SMART support is: Available - device has SMART capability. SMART support is: Enabled Thanks, -Andrew - To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.22.6 sata_sil device errors & timeouts
On 9/17/07, Tejun Heo <[EMAIL PROTECTED]> wrote: > [cc'ing Bruce Allen] > > Andrew Paprocki wrote: > > ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 > > ata2.00: cmd b0/db:f8:00:4f:c2/00:00:00:00:00/00 tag 0 cdb 0x0 data 0 > > res 51/04:f8:00:4f:c2/00:00:00:00:00/00 Emask 0x1 (device error) > > ata2.00: configured for UDMA/100 > > ata2: EH complete > > Upgrading smartd should fix it. Which version are you using? smartmontools release 5.36 dated 2006/04/12 at 17:39:01 UTC smartmontools configure arguments: '--prefix=/opt/smartmontools' I see a newer experimental 5.37 is out. I'll give it a go and see if the trace goes away. Thanks, -Andrew - To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.22.6 sata_sil device errors & timeouts
On 9/17/07, Tejun Heo <[EMAIL PROTECTED]> wrote: > Andrew Paprocki wrote: > > boot configuration more complicated if booting off the pata drive. Is > > there any way to control which order the drives are assigned when not > > building w/ modules? > > Please use mount-by-LABEL or UUID. Thanks, wasn't aware of that functionality. Works like a charm. > > ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x240 action 0x2 frozen > > ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x28 action 0x0 > > In both cases, SError is indicating transmission problem. Handshake > error and Unrecognized FIS type in the first case, 10b to 8b decode > error and CRC error on the second case. I can't tell why but signals > flying through those redish cables are getting corrupted. I've replaced the cables with a different brand I had laying around, and I haven't seen a problem yet. I'll need to test it heavily, though to see if I can trigger anything to pop up. I didn't mention it before, but I'm also getting these errors every time I boot. I'm thinking they're related to the drive not supporting cmds that smartd is sending it. If so, is there any way that libata/smartd can handle this more gracefully? This stuff spews into dmesg and gives a scare that there is a real hardware problem that may cause data corruption. I get exactly 6 instances of each of these two blocks of output prior to reaching the login prompt: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 ata1.00: cmd b0/db:f8:00:4f:c2/00:00:00:00:00/00 tag 0 cdb 0x0 data 0 res 51/04:f8:00:4f:c2/00:00:00:00:00/00 Emask 0x1 (device error) ata1.00: configured for UDMA/100 ata1: EH complete ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 ata2.00: cmd b0/db:f8:00:4f:c2/00:00:00:00:00/00 tag 0 cdb 0x0 data 0 res 51/04:f8:00:4f:c2/00:00:00:00:00/00 Emask 0x1 (device error) ata2.00: configured for UDMA/100 ata2: EH complete Thanks, -Andrew - To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
2.6.22.6 sata_sil device errors & timeouts
I have a sata_sil 3114 integrated chipset with 2 Hitachi 250gb sata drives connected, and I'm seeing errors print out during use. The problems seem to get much worse when I switch from these 250gb drives to brand new Hitachi HDS721010KLA330 1tb drives, and eventually the system hangs. With the 250gb drives, I haven't seen a hang, but I still see the errors below. Also, I'm seeing two other "issues": 1) When built with modules disabled, and libata handling the sata + pata (AMD CS5536) connections, the pata drives come _after_ the sata drives (i.e. w/ 2 sata drives, the first IDE drive is sdc). This makes boot configuration more complicated if booting off the pata drive. Is there any way to control which order the drives are assigned when not building w/ modules? 2) The drives display that they support udma6 in hdparm -I, but only udma5 is being used. And hdparm -i only shows up to udma2.. ? Any ideas? Thanks, -Andrew ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x240 action 0x2 frozen ata2.00: cmd 35/00:00:80:31:54/00:04:02:00:00/e0 tag 0 cdb 0x0 data 524288 out res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) ata2: soft resetting port ata2: port is slow to respond, please be patient (Status 0xd1) ata2: SRST failed (errno=-16) ata2: hard resetting port ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310) ata2.00: configured for UDMA/100 ata2: EH complete sd 1:0:0:0: [sdb] 488397168 512-byte hardware sectors (250059 MB) sd 1:0:0:0: [sdb] Write Protect is off sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00 sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x28 action 0x0 ata1.00: (BMDMA2 stat 0x617d9009) ata1.00: cmd 25/00:80:00:d6:bd/00:02:0b:00:00/e0 tag 0 cdb 0x0 data 327680 in res 51/04:e0:9f:d7:bd/00:00:0b:00:00/eb Emask 0x1 (device error) ata1.00: configured for UDMA/100 ata1: EH complete sd 0:0:0:0: [sda] 488397168 512-byte hardware sectors (250059 MB) sd 0:0:0:0: [sda] Write Protect is off sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00 sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA # hdparm -i /dev/sda /dev/sda: Model=HDT722525DLA380 , FwRev=V44OA96A, SerialNo= VDK41GT5F3S4JK Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs } RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=52 BuffType=DualPortCache, BuffSize=7674kB, MaxMultSect=16, MultSect=?16? CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=268435455 IORDY=on/off, tPIO={min:240,w/IORDY:120}, tDMA={min:120,rec:120} PIO modes: pio0 pio1 pio2 pio3 pio4 DMA modes: mdma0 mdma1 mdma2 UDMA modes: udma0 udma1 udma2 AdvancedPM=yes: disabled (255) WriteCache=enabled Drive conforms to: ATA/ATAPI-7 T13 1532D revision 1: ATA/ATAPI-2 ATA/ATAPI-3 ATA/ATAPI-4 ATA/ATAPI-5 ATA/ATAPI-6 ATA/ATAPI-7 # hdparm -I /dev/sda | grep udma DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 *udma5 udma6 # lspci -vv -d 1095:3114 :00:11.0 0180: 1095:3114 (rev 02) Subsystem: 1095:3114 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- SERR- http://vger.kernel.org/majordomo-info.html
Re: JMicron JMB363 issue fixed / ICH8 RAID volume trace
Tejun, fdisk -l output is attached Basically, in the ICH8 BIOS: /dev/sda + sdb = 2 500GB drives in RAID1 configuration /dev/sdc + sdd + sde + sdf = 4 320GB drives in RAID5 configuration /dev/sdg is a 320GB boot drive connected to the JMB363 chipset Is there some kind of problem when probing these partitions because they are fake software RAID through the ICH8? The messages only spew at boot time. Thanks, -Andrew On 5/24/07, Tejun Heo <[EMAIL PROTECTED]> wrote: Andrew Paprocki wrote: > Ethan, I believe my 2.6.22-rc2 kernel *is* working with respect to the > libata problem. By removing CONFIG_IDE, the system now works fine. The > reason why I thought that libata was still having a problem was > because the system would hang after agpgart printed: > "agpgart: detected an Intel 965G chipset." > > I *thought* the system was once again waiting for the root drive to > become available, but it turns out it was actually hung. I found > another user with a Gigabyte board with the same issue. I also have > 4GB ram.. http://lists.opensuse.org/opensuse-amd64/2007-04/msg1.html > > I added "mem=4096M" to the boot line and now everything is working > properly. The IDE subsystem is off and libata is handling everything. > I'll post on the kernel mailing list to see if this is a known issue > w/ agpgart or amd64+4gb. > > I do see some trace print out complaining about reads past the end of > the device.. Does anyone have an idea if these are harmful? They are > coming from my ICH8 RAID volumes: > > sda: sda1 > sda: p1 exceeds device capacity > sdb: unknown partition table > sdc: sdc1 > sdc: p1 exceeds device capacity > sdf1 > sdf: p1 exceeds device capacity > ... > attempt to access beyond end of device > sda: rw=0, want=1953533832, limit=976773168 > Buffer I/O error on device sda1, logical block 244191472 > (repeats about 25 times) > attempt to access beyond end of device > sdf: rw=0, want=1875410824, limit=625142448 > (repeats about 25 times) > sdc: rw=0, want=1875410824, limit=625142448 > attempt to access beyond end of device > (repeats about 25 times) What does 'fdisk -l' say? -- tejun - To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Disk /dev/sdb doesn't contain a valid partition table Disk /dev/sdd doesn't contain a valid partition table Disk /dev/sde doesn't contain a valid partition table Disk /dev/sda: 500.1 GB, 500107862016 bytes 255 heads, 63 sectors/track, 60801 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Device Boot Start End Blocks Id System /dev/sda1 * 1 121602 9767659527 HPFS/NTFS Disk /dev/sdb: 500.1 GB, 500107862016 bytes 255 heads, 63 sectors/track, 60801 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Disk /dev/sdc: 320.0 GB, 320072933376 bytes 255 heads, 63 sectors/track, 38913 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Device Boot Start End Blocks Id System /dev/sdc1 1 116739 9377044487 HPFS/NTFS Disk /dev/sdd: 320.0 GB, 320072933376 bytes 255 heads, 63 sectors/track, 38913 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Disk /dev/sde: 320.0 GB, 320072933376 bytes 255 heads, 63 sectors/track, 38913 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Disk /dev/sdf: 320.0 GB, 320072933376 bytes 255 heads, 63 sectors/track, 38913 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Device Boot Start End Blocks Id System /dev/sdf1 1 116739 9377044487 HPFS/NTFS Disk /dev/sdg: 320.0 GB, 320072933376 bytes 255 heads, 63 sectors/track, 38913 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Device Boot Start End Blocks Id System /dev/sdg1 1 32680 2625003527 HPFS/NTFS /dev/sdg2 32680 32713 262144 82 Linux swap / Solaris /dev/sdg3 * 32713 3891449806336 83 Linux
Re: JMicron JMB363 issue fixed / ICH8 RAID volume trace
Ethan, I believe my 2.6.22-rc2 kernel *is* working with respect to the libata problem. By removing CONFIG_IDE, the system now works fine. The reason why I thought that libata was still having a problem was because the system would hang after agpgart printed: "agpgart: detected an Intel 965G chipset." I *thought* the system was once again waiting for the root drive to become available, but it turns out it was actually hung. I found another user with a Gigabyte board with the same issue. I also have 4GB ram.. http://lists.opensuse.org/opensuse-amd64/2007-04/msg1.html I added "mem=4096M" to the boot line and now everything is working properly. The IDE subsystem is off and libata is handling everything. I'll post on the kernel mailing list to see if this is a known issue w/ agpgart or amd64+4gb. I do see some trace print out complaining about reads past the end of the device.. Does anyone have an idea if these are harmful? They are coming from my ICH8 RAID volumes: sda: sda1 sda: p1 exceeds device capacity sdb: unknown partition table sdc: sdc1 sdc: p1 exceeds device capacity sdf1 sdf: p1 exceeds device capacity ... attempt to access beyond end of device sda: rw=0, want=1953533832, limit=976773168 Buffer I/O error on device sda1, logical block 244191472 (repeats about 25 times) attempt to access beyond end of device sdf: rw=0, want=1875410824, limit=625142448 (repeats about 25 times) sdc: rw=0, want=1875410824, limit=625142448 attempt to access beyond end of device (repeats about 25 times) I've attached the full dmesg output as dmesg.052207.txt. Thanks -Andrew On 5/23/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: > Everyone, I tried rebuilding 2.6.22-rc2 last night with CONFIG_IDE > disabled, but it still produces the same problem. The relevant config > options: > > # CONFIG_IDE is not set > CONFIG_ATA=y > CONFIG_ATA_ACPI=y > CONFIG_SATA_AHCI=y > CONFIG_ATA_PIIX=y > CONFIG_PATA_JMICRON=y > > Ethan, you mention that this is a known issue.. I can't find any link > to this problem. This is happening on a cleanly rebuilt kernel, so I'm > not sure if this has to do with Debian 4.0r0 probing the module > incorrectly. I already have the OS installed, and I figured a newer > kernel would have resolved this issue. Unlike the bug report I linked > to, I am not seeing the driver detect JMB363 & JMB361 in the same boot > log. Even when everything works, it only detects a JMB361. I've tried to reproduce this issue under GA-965P-DQ6. JMB363 works fine in 2.6.22-rc2. The dmesg log is attached. It will not show any device name in pata_jmicron. > A working boot looks like this: > > JMB361: IDE controller at PCI slot :04:00.1 > ACPI: PCI Interrupt :04:00.1[B] -> GSI 18 (level, low) -> IRQ 58 > JMB361: chipset revision 2 > JMB361: 100% native mode on irq 58 > ide0: BM-DMA at 0xa000-0xa007, BIOS settings: hda:pio, hdb:pio > ide1: BM-DMA at 0xa008-0xa00f, BIOS settings: hdc:DMA, hdd:DMA If you disabled the entire old-IDE driver, this message should not be existed. It should be like this: ACPI: PCI Interrupt :03:00.1[B] -> GSI 18 (level, low) -> IRQ 19 PCI: Setting latency timer of device :03:00.1 to 64 scsi6 : pata_jmicron scsi7 : pata_jmicron ata7: PATA max UDMA/100 cmd 0x0001a000 ctl 0x0001a402 bmdma 0x0001b000 irq 0 ata8: PATA max UDMA/100 cmd 0x0001a800 ctl 0x0001ac02 bmdma 0x0001b008 irq 0 > No use of acpi=off, noapic, nolapic seems to affect the JMB361 from > being detected properly or not. > > Any next steps? Ethan, do you have more information about this > particular issue with the DQ6 motherboard? > > Thanks -Andrew Linux version 2.6.22-rc2-plateado ([EMAIL PROTECTED]) (gcc version 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)) #3 SMP Mon May 21 19:38:38 EDT 2007 Command line: root=/dev/sdg3 ro mem=4096M BIOS-provided physical RAM map: BIOS-e820: - 00097c00 (usable) BIOS-e820: 0009f800 - 000a (reserved) BIOS-e820: 000f - 0010 (reserved) BIOS-e820: 0010 - dfee (usable) BIOS-e820: dfee - dfee3000 (ACPI NVS) BIOS-e820: dfee3000 - dfef (ACPI data) BIOS-e820: dfef - dff0 (reserved) BIOS-e820: f000 - f400 (reserved) BIOS-e820: fec0 - 0001 (reserved) BIOS-e820: 0001 - 00012000 (usable) Entering add_active_range(0, 0, 151) 0 entries of 256 used Entering add_active_range(0, 256, 917216) 1 entries of 256 used end_pfn_map = 1048576 DMI 2.4 present. ACPI: RSDP 000F6E70, 0014 (r0 GBT ) ACPI: RSDT DFEE3040, 0034 (r1 GBTGBTUACPI 42302E31 GBTU 1010101) ACPI: FACP DFEE30C0, 0074 (r1 GBTGBTUACPI 42302E31 GBTU 1010101) ACPI: DSDT DFEE3180, 49F4 (r1 GBTGBTUACPI 1000 MSFT 10C) ACPI: FACS DFEE, 0040 ACPI: HPET DFEE7CC0, 0038 (r1 GBTGBTUACPI 42302E31 GBTU 98) ACPI: MCFG DFEE7D40, 003C (r1 GBTGBTUACPI 42302E31 GBTU 1010101)
Re: JMicron JMB361 sporadically failing to initialize from at least 2.6.18.4 to 2.6.22-rc2
Everyone, I tried rebuilding 2.6.22-rc2 last night with CONFIG_IDE disabled, but it still produces the same problem. The relevant config options: # CONFIG_IDE is not set CONFIG_ATA=y CONFIG_ATA_ACPI=y CONFIG_SATA_AHCI=y CONFIG_ATA_PIIX=y CONFIG_PATA_JMICRON=y Ethan, you mention that this is a known issue.. I can't find any link to this problem. This is happening on a cleanly rebuilt kernel, so I'm not sure if this has to do with Debian 4.0r0 probing the module incorrectly. I already have the OS installed, and I figured a newer kernel would have resolved this issue. Unlike the bug report I linked to, I am not seeing the driver detect JMB363 & JMB361 in the same boot log. Even when everything works, it only detects a JMB361. A working boot looks like this: JMB361: IDE controller at PCI slot :04:00.1 ACPI: PCI Interrupt :04:00.1[B] -> GSI 18 (level, low) -> IRQ 58 JMB361: chipset revision 2 JMB361: 100% native mode on irq 58 ide0: BM-DMA at 0xa000-0xa007, BIOS settings: hda:pio, hdb:pio ide1: BM-DMA at 0xa008-0xa00f, BIOS settings: hdc:DMA, hdd:DMA No use of acpi=off, noapic, nolapic seems to affect the JMB361 from being detected properly or not. Any next steps? Ethan, do you have more information about this particular issue with the DQ6 motherboard? Thanks -Andrew On 5/21/07, Alan Cox <[EMAIL PROTECTED]> wrote: > Does anyone know what is causing this and if it is fixed in any dev > branch? I've tried the stock Debian etch netinst 2.6.18.4 kernel, as > well as my own build of 2.6.21.1 and 2.6.22-rc2 and they all exhibit > the same problem. > > Let me know what I can do to help debug this on my end. What configuration options have you got selected - in particular if you have the libata support for SATA enabled then the kernel configures the hardware to expose the AHCI interface for SATA and the PATA interface separately. This requires you to be using the libata drivers for both the SATA and PATA components of the hardware if you wish to use both. - To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
JMicron JMB361 sporadically failing to initialize from at least 2.6.18.4 to 2.6.22-rc2
I have a Gigabyte GA-965P-DQ6 motherboard which has onboard Intel ICH8 raid as well as a "Gigabyte" (rebranded JMicron) chipset for 2 separate SATA ports. When I boot the machine, it completely sporadically fails to initialize the JMB361 chipset which it detects, claiming "dma_base is invalid". When it works (~20% of the time), it will correctly detect the chip and all drives connected to it. It feels like a race condition.. I have a DVD-RW & my boot SATAII drive connected to the controller, so this bug has the nasty side effect of hanging my machine for eternity waiting for the root drive to appear. The dma_base trace is listed below: JMB361: IDE controller at PCI slot :03:00.0 ACPI: PCI Interrupt :03:00.0[A] -> GSI 17 (level, low) -> IRQ 177 JMB361: chipset revision 3 JMB361: 100% native mode on irq 177 JMB361: dma_base is invalid ide0: JMB361 Bus-Master DMA disabled (BIOS) JMB361: dma_base is invalid ide1: JMB361 Bus-Master DMA disabled (BIOS) This seems to be the same problem as reported here: http://www.mail-archive.com/[EMAIL PROTECTED]/msg339806.html And numerous other people seem to be hitting this in newer kernels. A Google search for 'jmb361 dma_base' turns up a lot of hits. Does anyone know what is causing this and if it is fixed in any dev branch? I've tried the stock Debian etch netinst 2.6.18.4 kernel, as well as my own build of 2.6.21.1 and 2.6.22-rc2 and they all exhibit the same problem. Let me know what I can do to help debug this on my end. Thanks -Andrew - To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html