Re: sata_nv + ADMA + Samsung disk problem

2008-01-11 Thread Gabor Gombas
On Mon, Jan 07, 2008 at 06:10:29PM -0600, Robert Hancock wrote:

 Gabor, I just noticed you said that it worked OK in 2.6.20, yet 2.6.22  
 fails. 2.6.20 had ADMA support as well, so I wonder what change started  
 causing the problem. Would it be possible for you to do a git bisect (or  
 at least try 2.6.21 to try and narrow it down)?

I've now booted 2.6.21.7, we'll see. The problem with the bisection is
that I can't explicitely trigger the bug so I can't say for sure if a
kernel is good or it is just needs more time to trigger. The average
uptime of this machine is just a couple hours a day.

For example, with 2.6.24-rc6 it took over 3 hours for the first disk to
trigger the bug and the second disk needed more than 7 hours. This
machine is seldom turned on for that long.

Gabor

-- 
 -
 MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
 -
-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sata_nv + ADMA + Samsung disk problem

2008-01-07 Thread Robert Hancock

Allen Martin wrote:
 

Dunno about the NVidia version.
Theirs works rather differently - the GO bit is there, but there's 
another append register which is used to tell the controller 
that a new 
tag has been added to the CPB list.


The only thing we currently use the GO bit for is to switch 
between ADMA 
and port register mode. Could be there's something we need to 
do there, 
though, who knows..




You shouldn't ever need to touch GO other than the ADMA / legacy mode
switch as you say.

The NVIDIA ADMA hw is not based on the Pacific Digital core.


Gabor, I just noticed you said that it worked OK in 2.6.20, yet 2.6.22 
fails. 2.6.20 had ADMA support as well, so I wonder what change started 
causing the problem. Would it be possible for you to do a git bisect (or 
at least try 2.6.21 to try and narrow it down)?

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sata_nv + ADMA + Samsung disk problem

2008-01-03 Thread Mark Lord

Robert Hancock wrote:

Mark Lord wrote:

Robert Hancock wrote:
..
 From some of the traces I took previously (posted on LKML as 
sata_nv ADMA controller lockup investigation way back in Feb 07), 
what seems to occur is that when the second command is issued very 
rapidly (within less than 20 microseconds, or potentially longer) 
after the previous command's completion, the ADMA status changes from 
0x500 (STOPPED and IDLE) to 0x400 (just IDLE) as it typically does, 
but then it sticks there, no interrupt is ever raised, and CPB 
response flags remain at 0.

..

Assuming that NVidia got their ADMA core logic from Pacific Digital
(the inventors), then it may have some of the same bugs as the original.

One of those bugs is that the aGO trigger is sampled in a racey way,
such that it sometimes may miss a recent addition to the ring.

The *only* way to guarantee things with the original Pacific Digital core
was to (1) always retrigger aGO for a full ring scan with each new 
addition,

and (2) poll periodically (every half second or so) rather than relying
exclusively on the IRQ actually working..

Dunno about the NVidia version.


Theirs works rather differently - the GO bit is there, but there's 
another append register which is used to tell the controller that a new 
tag has been added to the CPB list.

..

The PacDigi core uses a search count register for that purpose,
but the buggy nature of the core required that it always be set
to 2 * ring_size to ensure nothing got missed.

Here's some comments from the original ADMA driver.
Maybe something from here might help with the NV stuff, too.

  // There is a chance that the chip will skip over a CPB if a SERVICE 
interrupt
   // occurs while it's reading the CPB header.  This won't cause us to get
   // stuck anywhere, but it might slow down execution of the new CPB if
   // it has to wait for the next time we hit aGO.  So.. Dxxx/Dxxx suggest
   // that all we need to do is tell the chip to do two passes around the 
ring
   // from an aGO instead of one pass, so that it will find the missed CPB
   // on the second pass.  This isn't as bad as it first looks.
   //
   writew(channel-num_cpbs * 2, adma_regs-cpb_search_count);

Or again, the NV stuff may be completely different (?).
-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sata_nv + ADMA + Samsung disk problem

2008-01-03 Thread Mark Lord

Mark Lord wrote:

Robert Hancock wrote:

Mark Lord wrote:

Robert Hancock wrote:
..

 From some of the traces I took previously (posted on LKML as sata_nv ADMA 
controller lockup investigation way back in Feb 07), what seems to occur is that 
when the second command is issued very rapidly (within less than 20 microseconds, or 
potentially longer) after the previous command's completion, the ADMA status changes from 
0x500 (STOPPED and IDLE) to 0x400 (just IDLE) as it typically does, but then it sticks 
there, no interrupt is ever raised, and CPB response flags remain at 0.

..

Assuming that NVidia got their ADMA core logic from Pacific Digital
(the inventors), then it may have some of the same bugs as the original.

One of those bugs is that the aGO trigger is sampled in a racey way,
such that it sometimes may miss a recent addition to the ring.

The *only* way to guarantee things with the original Pacific Digital core
was to (1) always retrigger aGO for a full ring scan with each new addition,
and (2) poll periodically (every half second or so) rather than relying
exclusively on the IRQ actually working..

Dunno about the NVidia version.


Theirs works rather differently - the GO bit is there, but there's another 
append register which is used to tell the controller that a new tag has been 
added to the CPB list.

..

The PacDigi core uses a search count register for that purpose,
but the buggy nature of the core required that it always be set
to 2 * ring_size to ensure nothing got missed.

Here's some comments from the original ADMA driver.
Maybe something from here might help with the NV stuff, too.

  // There is a chance that the chip will skip over a CPB if a SERVICE 
interrupt
   // occurs while it's reading the CPB header.  This won't cause us to get
   // stuck anywhere, but it might slow down execution of the new CPB if
   // it has to wait for the next time we hit aGO.  So.. Dxxx/Dxxx suggest
   // that all we need to do is tell the chip to do two passes around the 
ring
   // from an aGO instead of one pass, so that it will find the missed CPB
   // on the second pass.  This isn't as bad as it first looks.
   //
   writew(channel-num_cpbs * 2, adma_regs-cpb_search_count);

Or again, the NV stuff may be completely different (?).

..

Another thing about the PacDigi core:  one has to be very careful
to avoid sequential accesses to sequential PCI locations when
programming the chip -- it cannot handle merged register writes.

So for any group of sequentially laid out registers, the code has
to ensure it never writes two adjacent registers in sequence..

-ml
-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: sata_nv + ADMA + Samsung disk problem

2008-01-03 Thread Allen Martin
 
  Dunno about the NVidia version.
 
 Theirs works rather differently - the GO bit is there, but there's 
 another append register which is used to tell the controller 
 that a new 
 tag has been added to the CPB list.
 
 The only thing we currently use the GO bit for is to switch 
 between ADMA 
 and port register mode. Could be there's something we need to 
 do there, 
 though, who knows..
 

You shouldn't ever need to touch GO other than the ADMA / legacy mode
switch as you say.

The NVIDIA ADMA hw is not based on the Pacific Digital core.
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---
-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sata_nv + ADMA + Samsung disk problem

2008-01-03 Thread Robert Hancock

Benjamin Herrenschmidt wrote:

Another thing about the PacDigi core:  one has to be very careful
to avoid sequential accesses to sequential PCI locations when
programming the chip -- it cannot handle merged register writes.

So for any group of sequentially laid out registers, the code has
to ensure it never writes two adjacent registers in sequence..


Ugh ? Write combining isn't permitted on normal registers afaik...

Ben.


Byte merging can be done by the chipset on MMIO writes (merging multiple 
8 or 16-bit writes into a single 32-bit cycle).

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sata_nv + ADMA + Samsung disk problem

2008-01-03 Thread Robert Hancock

Allen Martin wrote:
 

Dunno about the NVidia version.
Theirs works rather differently - the GO bit is there, but there's 
another append register which is used to tell the controller 
that a new 
tag has been added to the CPB list.


The only thing we currently use the GO bit for is to switch 
between ADMA 
and port register mode. Could be there's something we need to 
do there, 
though, who knows..




You shouldn't ever need to touch GO other than the ADMA / legacy mode
switch as you say.

The NVIDIA ADMA hw is not based on the Pacific Digital core.


That answers that question, I guess. Still guessing at why the 
controller would get stuck in IDLE state with no interrupt raised, then..

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sata_nv + ADMA + Samsung disk problem

2008-01-03 Thread Benjamin Herrenschmidt

On Thu, 2008-01-03 at 19:43 -0600, Robert Hancock wrote:
 Benjamin Herrenschmidt wrote:
  Another thing about the PacDigi core:  one has to be very careful
  to avoid sequential accesses to sequential PCI locations when
  programming the chip -- it cannot handle merged register writes.
 
  So for any group of sequentially laid out registers, the code has
  to ensure it never writes two adjacent registers in sequence..
  
  Ugh ? Write combining isn't permitted on normal registers afaik...
  
  Ben.
 
 Byte merging can be done by the chipset on MMIO writes (merging multiple 
 8 or 16-bit writes into a single 32-bit cycle).

That is true, if they are consecutive. You mean that this HW is f*cked
up enough to actually have separate 8/16 bits registers that are
contiguous ? Yuck... I'm afraid you -have- to add reads in between to
guarantee that no merging will occur.

Cheers,

Ben.


-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: sata_nv + ADMA + Samsung disk problem

2008-01-02 Thread Allen Martin

 The question I had for NVIDIA regarding this that I never got 
 answered 
 was, is there any reason why we would need a delay when switching 
 between NCQ and non-NCQ commands on ADMA, and if not, is 
 there any known 
 cause that could cause the controller to get into this seemingly 
 locked-up state?

When switching from NCQ to non NCQ or vice versa you must make sure all
outstanding commands are completed before issuing the new command.  The
hardware doesn't do anything to prevent queued and non queued commands
from going out on the wire at the same time which will certainly cause
some drives to fail.

-Allen
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---
-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sata_nv + ADMA + Samsung disk problem

2008-01-02 Thread Jeff Garzik

Allen Martin wrote:
The question I had for NVIDIA regarding this that I never got 
answered 
was, is there any reason why we would need a delay when switching 
between NCQ and non-NCQ commands on ADMA, and if not, is 
there any known 
cause that could cause the controller to get into this seemingly 
locked-up state?


When switching from NCQ to non NCQ or vice versa you must make sure all
outstanding commands are completed before issuing the new command.  The
hardware doesn't do anything to prevent queued and non queued commands
from going out on the wire at the same time which will certainly cause
some drives to fail.


The software definitely provides that guarantee for all NCQ-capable 
controllers.


Jeff



-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: sata_nv + ADMA + Samsung disk problem

2008-01-02 Thread Allen Martin
 The software definitely provides that guarantee for all NCQ-capable 
 controllers.
 

Well if that's not it, it must be some problem entering ADMA legacy
mode.  Here's what the Windows driver does:


ADMACtrl.aGO = 0
ADMACtrl.aEIEN = 0
poll {
  until ADMAStatus.aLGCY = 1 || timeout
}
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---
-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sata_nv + ADMA + Samsung disk problem

2008-01-02 Thread Robert Hancock

Allen Martin wrote:
The software definitely provides that guarantee for all NCQ-capable 
controllers.




Well if that's not it, it must be some problem entering ADMA legacy
mode.  Here's what the Windows driver does:


ADMACtrl.aGO = 0
ADMACtrl.aEIEN = 0
poll {
  until ADMAStatus.aLGCY = 1 || timeout
}


What we're doing to enter legacy mode is essentially:

-wait until ADMA status indicates IDLE bit set (max wait of 1 microsecond)
-clear GO bit in control register
-wait until status indicates LEGACY bit set (max wait of 1 microsecond)

and to enter ADMA mode:

-set GO bit in control register
-wait until status indicates LEGACY bit cleared and IDLE bit set (max 
wait of 1 microsecond)


The 1 microsecond timeout is pretty aggressive admittedly, but it 
apparently isn't being broken (the only timeouts when switching modes 
I've seen are during error handling after a command timeout has already 
occurred). What timeout value is the Windows driver using?


Also, I see you are clearing the AEIN bit when in register mode, while 
we're not. Is that important/necessary?


Aside from all this though, in the case of NCQ writes followed by a 
cache flush, that sequence of commands won't put us into legacy mode at 
all since the cache flush is a no-data command which we should be able 
to handle in ADMA mode, from my understanding (correct me if I'm wrong). 
So I don't imagine legacy/ADMA mode switch could be the cause of this 
problem.


I also saw in my previous investigation that a flush immediately 
followed by a write could cause the write to time out as well.


From some of the traces I took previously (posted on LKML as sata_nv 
ADMA controller lockup investigation way back in Feb 07), what seems to 
occur is that when the second command is issued very rapidly (within 
less than 20 microseconds, or potentially longer) after the previous 
command's completion, the ADMA status changes from 0x500 (STOPPED and 
IDLE) to 0x400 (just IDLE) as it typically does, but then it sticks 
there, no interrupt is ever raised, and CPB response flags remain at 0.


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove nospam from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/


-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sata_nv + ADMA + Samsung disk problem

2008-01-02 Thread Robert Hancock

Tejun Heo wrote:

Robert Hancock wrote:

Jeff Garzik wrote:

Tejun Heo wrote:

Thanks a lot for the detailed explanation.  Nvidia ppl, any ideas?
FLUSH is used regularly.  We really need to fix this.


I reiterate my opinion :)  ...   We should remove ADMA support from
sata_nv.  It's only in a few chips, it's not appearing in any new
chips, and nasty problems have lingered since ADMA support was
introduced.

Definitely sounds like we should disable ADMA by default for
2.6.24-rc, too.

I wouldn't agree.. It's only in a few chips (CK804/MCP04), but those
chips are very common in desktop, workstation, even some server
machines. Given the huge number of these chips out there, problem
reports have been quite rare.


I agree with Jeff here.  Maybe not remove but disable it by default and
when enabling warn loudly.  NCQ just doesn't enough for its cost when
the cost includes erratic behaviors.  Only very small fraction of error
cases actually make to bugzilla or this mailing list.

Nvidia gents, is there anyway (be it NDA or whatever) to get Robert or
any of us technical documentation?

Thanks.


Last I heard, NVIDIA management gave the thumbs down to any more NDAs 
for ADMA documentation. It would be nice if they would reconsider. 
Apparently Jeff does have the docs, though..


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove nospam from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sata_nv + ADMA + Samsung disk problem

2008-01-02 Thread Mark Lord

Robert Hancock wrote:


What we're doing to enter legacy mode is essentially:

-wait until ADMA status indicates IDLE bit set (max wait of 1 microsecond)
-clear GO bit in control register
-wait until status indicates LEGACY bit set (max wait of 1 microsecond)

and to enter ADMA mode:

-set GO bit in control register
-wait until status indicates LEGACY bit cleared and IDLE bit set (max 
wait of 1 microsecond)

..

If there are outstanding TCQ/NCQ commands (any drive),
then this could take (much) longer to enter legacy mode,
as the ADMA engine will wait for them all to finish.

But for normal, nothing outstanding mode, it should be very quick.

Cheers
-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sata_nv + ADMA + Samsung disk problem

2008-01-02 Thread Mark Lord

Robert Hancock wrote:
..
 From some of the traces I took previously (posted on LKML as sata_nv 
ADMA controller lockup investigation way back in Feb 07), what seems to 
occur is that when the second command is issued very rapidly (within 
less than 20 microseconds, or potentially longer) after the previous 
command's completion, the ADMA status changes from 0x500 (STOPPED and 
IDLE) to 0x400 (just IDLE) as it typically does, but then it sticks 
there, no interrupt is ever raised, and CPB response flags remain at 0.

..

Assuming that NVidia got their ADMA core logic from Pacific Digital
(the inventors), then it may have some of the same bugs as the original.

One of those bugs is that the aGO trigger is sampled in a racey way,
such that it sometimes may miss a recent addition to the ring.

The *only* way to guarantee things with the original Pacific Digital core
was to (1) always retrigger aGO for a full ring scan with each new addition,
and (2) poll periodically (every half second or so) rather than relying
exclusively on the IRQ actually working..

Dunno about the NVidia version.

Cheers


-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sata_nv + ADMA + Samsung disk problem

2008-01-02 Thread Robert Hancock

Mark Lord wrote:

Robert Hancock wrote:
..
 From some of the traces I took previously (posted on LKML as sata_nv 
ADMA controller lockup investigation way back in Feb 07), what seems 
to occur is that when the second command is issued very rapidly 
(within less than 20 microseconds, or potentially longer) after the 
previous command's completion, the ADMA status changes from 0x500 
(STOPPED and IDLE) to 0x400 (just IDLE) as it typically does, but then 
it sticks there, no interrupt is ever raised, and CPB response flags 
remain at 0.

..

Assuming that NVidia got their ADMA core logic from Pacific Digital
(the inventors), then it may have some of the same bugs as the original.

One of those bugs is that the aGO trigger is sampled in a racey way,
such that it sometimes may miss a recent addition to the ring.

The *only* way to guarantee things with the original Pacific Digital core
was to (1) always retrigger aGO for a full ring scan with each new 
addition,

and (2) poll periodically (every half second or so) rather than relying
exclusively on the IRQ actually working..

Dunno about the NVidia version.


Theirs works rather differently - the GO bit is there, but there's 
another append register which is used to tell the controller that a new 
tag has been added to the CPB list.


The only thing we currently use the GO bit for is to switch between ADMA 
and port register mode. Could be there's something we need to do there, 
though, who knows..


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove nospam from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sata_nv + ADMA + Samsung disk problem

2008-01-01 Thread Gabor Gombas
Hi,

Just FYI I've tried to enable ADMA again (now running 2.6.24-rc6) but
the bug is still present:

Jan  1 16:11:21 host kernel: ata7: EH in ADMA mode, notifier 0x0 notifier_error 
0x0 gen_ctl 0x1501000 status 0x400 next cpb count 0x0 next cpb idx 0x0
Jan  1 16:11:21 host kernel: ata7: CPB 0: ctl_flags 0x9, resp_flags 0x0
Jan  1 16:11:21 host kernel: ata7: timeout waiting for ADMA IDLE, stat=0x400
Jan  1 16:11:21 host kernel: ata7: timeout waiting for ADMA LEGACY, stat=0x400
Jan  1 16:11:21 host kernel: ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 
action 0x2 frozen
Jan  1 16:11:21 host kernel: ata7.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 
tag 0
Jan  1 16:11:21 host kernel:  res 40/00:00:00:4f:c2/00:00:00:00:00/00 
Emask 0x4 (timeout)
Jan  1 16:11:21 host kernel: ata7.00: status: { DRDY }
Jan  1 16:11:21 host kernel: ata7: soft resetting link
Jan  1 16:11:22 host kernel: ata7: SATA link up 3.0 Gbps (SStatus 123 SControl 
300)
Jan  1 16:11:22 host kernel: ata7.00: configured for UDMA/133
Jan  1 16:11:22 host kernel: ata7: EH complete
Jan  1 16:11:22 host kernel: sd 6:0:0:0: [sdc] 488397168 512-byte hardware 
sectors (250059 MB)
Jan  1 16:11:22 host kernel: sd 6:0:0:0: [sdc] Write Protect is off
Jan  1 16:11:22 host kernel: sd 6:0:0:0: [sdc] Mode Sense: 00 3a 00 00
Jan  1 16:11:22 host kernel: sd 6:0:0:0: [sdc] Write cache: enabled, read 
cache: enabled, doesn't support DPO or FUA

Although this time the above happened more than 3 hours after boot
which is much better than 2.6.22 was. In the past ~4 months ADMA was
disabled and I never had any libata-related error messages.

SMART does not show anything interesting:

smartctl version 5.37 [x86_64-unknown-linux-gnu] Copyright (C) 2002-6 Bruce 
Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Model Family: SAMSUNG SpinPoint P120 series
Device Model: SAMSUNG SP2504C
Serial Number:XX
Firmware Version: VT100-33
User Capacity:250,059,350,016 bytes
Device is:In smartctl database [for details use: -P show]
ATA Version is:   7
ATA Standard is:  ATA/ATAPI-7 T13 1532D revision 4a
Local Time is:Tue Jan  1 17:38:21 2008 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status:  (   0) The previous self-test routine completed
without error or no self-test has ever 
been run.
Total time to complete Offline 
data collection: (4867) seconds.
Offline data collection
capabilities:(0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off 
support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities:(0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability:(0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine 
recommended polling time:(   1) minutes.
Extended self-test routine
recommended polling time:(  81) minutes.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME  FLAG VALUE WORST THRESH TYPE  UPDATED  
WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate 0x000f   100   100   051Pre-fail  Always   
-   0
  3 Spin_Up_Time0x0007   100   100   025Pre-fail  Always   
-   6144
  4 Start_Stop_Count0x0032   099   099   000Old_age   Always   
-   1218
  5 Reallocated_Sector_Ct   0x0033   253   253   010Pre-fail  Always   
-   0
  7 Seek_Error_Rate 0x000f   253   253   051Pre-fail  Always   
-   0
  8 Seek_Time_Performance   0x0025   253   253   015Pre-fail  Offline  
-   11363
  9 Power_On_Hours  0x0032   100   100   000Old_age   Always   
-   3325
 10 Spin_Retry_Count0x0033   253   253   051Pre-fail  Always   
-   0
 11 Calibration_Retry_Count 0x0012   253   002   000Old_age   

Re: sata_nv + ADMA + Samsung disk problem

2008-01-01 Thread Tejun Heo
[cc'ing Robert Hancock and NVidia people]

Whole thread can be read from the following URL.

  http://thread.gmane.org/gmane.linux.ide/21710

In a nutshell, with ADMA enabled, FLUSH_EXT occasionally times out.  I
first suspected faulty disk (reallocation failure on flush) but SMART
reports nothing suspicious and w/ ADMA disabled, the drive works just fine.

On a side note, on 2.6.22.1, SMART fails from time to time but the
problem went away on 2.6.24-rc6.  This was apparently fixed during that
period.  I guess we can ignore this for now.

Thanks.

Gabor Gombas wrote:
 Hi,
 
 Just FYI I've tried to enable ADMA again (now running 2.6.24-rc6) but
 the bug is still present:
 
 Jan  1 16:11:21 host kernel: ata7: EH in ADMA mode, notifier 0x0 
 notifier_error 0x0 gen_ctl 0x1501000 status 0x400 next cpb count 0x0 next cpb 
 idx 0x0
 Jan  1 16:11:21 host kernel: ata7: CPB 0: ctl_flags 0x9, resp_flags 0x0
 Jan  1 16:11:21 host kernel: ata7: timeout waiting for ADMA IDLE, stat=0x400
 Jan  1 16:11:21 host kernel: ata7: timeout waiting for ADMA LEGACY, stat=0x400
 Jan  1 16:11:21 host kernel: ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 
 action 0x2 frozen
 Jan  1 16:11:21 host kernel: ata7.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 
 tag 0
 Jan  1 16:11:21 host kernel:  res 40/00:00:00:4f:c2/00:00:00:00:00/00 
 Emask 0x4 (timeout)
 Jan  1 16:11:21 host kernel: ata7.00: status: { DRDY }
 Jan  1 16:11:21 host kernel: ata7: soft resetting link
 Jan  1 16:11:22 host kernel: ata7: SATA link up 3.0 Gbps (SStatus 123 
 SControl 300)
 Jan  1 16:11:22 host kernel: ata7.00: configured for UDMA/133
 Jan  1 16:11:22 host kernel: ata7: EH complete
 Jan  1 16:11:22 host kernel: sd 6:0:0:0: [sdc] 488397168 512-byte hardware 
 sectors (250059 MB)
 Jan  1 16:11:22 host kernel: sd 6:0:0:0: [sdc] Write Protect is off
 Jan  1 16:11:22 host kernel: sd 6:0:0:0: [sdc] Mode Sense: 00 3a 00 00
 Jan  1 16:11:22 host kernel: sd 6:0:0:0: [sdc] Write cache: enabled, read 
 cache: enabled, doesn't support DPO or FUA
 
 Although this time the above happened more than 3 hours after boot
 which is much better than 2.6.22 was. In the past ~4 months ADMA was
 disabled and I never had any libata-related error messages.
 
 SMART does not show anything interesting:
 
 smartctl version 5.37 [x86_64-unknown-linux-gnu] Copyright (C) 2002-6 Bruce 
 Allen
 Home page is http://smartmontools.sourceforge.net/
 
 === START OF INFORMATION SECTION ===
 Model Family: SAMSUNG SpinPoint P120 series
 Device Model: SAMSUNG SP2504C
 Serial Number:XX
 Firmware Version: VT100-33
 User Capacity:250,059,350,016 bytes
 Device is:In smartctl database [for details use: -P show]
 ATA Version is:   7
 ATA Standard is:  ATA/ATAPI-7 T13 1532D revision 4a
 Local Time is:Tue Jan  1 17:38:21 2008 CET
 SMART support is: Available - device has SMART capability.
 SMART support is: Enabled
 
 === START OF READ SMART DATA SECTION ===
 SMART overall-health self-assessment test result: PASSED
 
 General SMART Values:
 Offline data collection status:  (0x82)   Offline data collection activity
   was completed without error.
   Auto Offline Data Collection: Enabled.
 Self-test execution status:  (   0)   The previous self-test routine 
 completed
   without error or no self-test has ever 
   been run.
 Total time to complete Offline 
 data collection:   (4867) seconds.
 Offline data collection
 capabilities:  (0x5b) SMART execute Offline immediate.
   Auto Offline data collection on/off 
 support.
   Suspend Offline collection upon new
   command.
   Offline surface scan supported.
   Self-test supported.
   No Conveyance Self-test supported.
   Selective Self-test supported.
 SMART capabilities:(0x0003)   Saves SMART data before entering
   power-saving mode.
   Supports SMART auto save timer.
 Error logging capability:(0x01)   Error logging supported.
   General Purpose Logging supported.
 Short self-test routine 
 recommended polling time:  (   1) minutes.
 Extended self-test routine
 recommended polling time:  (  81) minutes.
 
 SMART Attributes Data Structure revision number: 16
 Vendor Specific SMART Attributes with Thresholds:
 ID# ATTRIBUTE_NAME  FLAG VALUE WORST THRESH TYPE  UPDATED  
 WHEN_FAILED RAW_VALUE
   1 Raw_Read_Error_Rate 0x000f   100   100   051Pre-fail  Always  
  -   0
   3 Spin_Up_Time0x0007   100   

Re: sata_nv + ADMA + Samsung disk problem

2008-01-01 Thread Robert Hancock

Tejun Heo wrote:

[cc'ing Robert Hancock and NVidia people]

Whole thread can be read from the following URL.

  http://thread.gmane.org/gmane.linux.ide/21710

In a nutshell, with ADMA enabled, FLUSH_EXT occasionally times out.  I
first suspected faulty disk (reallocation failure on flush) but SMART
reports nothing suspicious and w/ ADMA disabled, the drive works just fine.

On a side note, on 2.6.22.1, SMART fails from time to time but the
problem went away on 2.6.24-rc6.  This was apparently fixed during that
period.  I guess we can ignore this for now.

Thanks.


This is kind of a longstanding problem which has been partially worked 
around, but it seems not entirely. This is what I had diagnosed some 
time ago:


recently, some issues cropped up with command timeouts when a cache 
flush command was immediately followed by an NCQ write. In this case, 
sometimes when the NCQ write was issued, the status register changed 
from 0x500 (Stopped and Idle) to 0x400 (Stopped) as it normally appears 
to, however it seems like the controller would get hung in that state, 
and we would time out with no notifiers set, the gen_ctl register not 
indicating interrupt status, and the CPB response flags still 0 as we 
left them, seemingly indicating the controller hasn't done anything with 
it. Then, when the error handler kicks in we clear the GO bit to put it 
back into register mode, but the Legacy flag in the status register 
doesn't get set (or at least it takes longer than 1 microsecond). 
Finally when we do an ADMA channel reset that seems to get it responding 
again, until this happens the next time.


From some experimentation, I found that when we are issuing a NCQ
command when the last command was non-NCQ, or vice versa, if I added in
a delay of 20 microseconds between setting up the CPB and writing to the
append register, the problem appeared to go away. Problem is I don't
know if that's because it actually needs this delay, or because it
changes the timing so that it happens to work even though we're doing
something wrong, there's some event we're not waiting for, etc.

I've now verified that no switches between ADMA and register mode occur 
near the time of these timeouts. Neither are we reading or writing any 
of the ATA shadow registers while we're in ADMA mode.


It seems likely that this is what is happening here (a switch from an 
NCQ command to a non-NCQ command, then the non-NCQ times out). It could 
be in some cases the 20 microsecond delay is not enough. But it seems 
bogus that we should need such an arbitrary delay in the first place.


The question I had for NVIDIA regarding this that I never got answered 
was, is there any reason why we would need a delay when switching 
between NCQ and non-NCQ commands on ADMA, and if not, is there any known 
cause that could cause the controller to get into this seemingly 
locked-up state?


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove nospam from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sata_nv + ADMA + Samsung disk problem

2008-01-01 Thread Robert Hancock

Robert Hancock wrote:

Tejun Heo wrote:

[cc'ing Robert Hancock and NVidia people]

Whole thread can be read from the following URL.

  http://thread.gmane.org/gmane.linux.ide/21710

In a nutshell, with ADMA enabled, FLUSH_EXT occasionally times out.  I
first suspected faulty disk (reallocation failure on flush) but SMART
reports nothing suspicious and w/ ADMA disabled, the drive works just 
fine.


On a side note, on 2.6.22.1, SMART fails from time to time but the
problem went away on 2.6.24-rc6.  This was apparently fixed during that
period.  I guess we can ignore this for now.

Thanks.


This is kind of a longstanding problem which has been partially worked 
around, but it seems not entirely. This is what I had diagnosed some 
time ago:


recently, some issues cropped up with command timeouts when a cache 
flush command was immediately followed by an NCQ write. In this case, 
sometimes when the NCQ write was issued, the status register changed 
from 0x500 (Stopped and Idle) to 0x400 (Stopped) as it normally appears 
to, however it seems like the controller would get hung in that state, 
and we would time out with no notifiers set, the gen_ctl register not 
indicating interrupt status, and the CPB response flags still 0 as we 
left them, seemingly indicating the controller hasn't done anything with 
it. Then, when the error handler kicks in we clear the GO bit to put it 
back into register mode, but the Legacy flag in the status register 
doesn't get set (or at least it takes longer than 1 microsecond). 
Finally when we do an ADMA channel reset that seems to get it responding 
again, until this happens the next time.


 From some experimentation, I found that when we are issuing a NCQ
command when the last command was non-NCQ, or vice versa, if I added in
a delay of 20 microseconds between setting up the CPB and writing to the
append register, the problem appeared to go away. Problem is I don't
know if that's because it actually needs this delay, or because it
changes the timing so that it happens to work even though we're doing
something wrong, there's some event we're not waiting for, etc.

I've now verified that no switches between ADMA and register mode occur 
near the time of these timeouts. Neither are we reading or writing any 
of the ATA shadow registers while we're in ADMA mode.


It seems likely that this is what is happening here (a switch from an 
NCQ command to a non-NCQ command, then the non-NCQ times out). It could 
be in some cases the 20 microsecond delay is not enough. But it seems 
bogus that we should need such an arbitrary delay in the first place.


The question I had for NVIDIA regarding this that I never got answered 
was, is there any reason why we would need a delay when switching 
between NCQ and non-NCQ commands on ADMA, and if not, is there any known 
cause that could cause the controller to get into this seemingly 
locked-up state?


Well, I guess I did sort of get an answer, but the only theory was that 
the flush and the NCQ commands were being overlapped, which shouldn't be 
possible (the libata core guarantees that, and if it didn't work it 
would affect all controllers).


I'm kind of wondering if there's something funny going on with the 
notifier register stuff, which is supposed to tell us what commands have 
completed. We don't really use it at all (we had some problems with 
missed completions, etc. when I tried using it, also it doesn't work if 
ATAPI is enabled on the other port on the controller, apparently). I 
know these controllers will do strange things like not signalling 
interrupts for later events if you don't clear the notifiers in just the 
right way (that being mostly determined by trial and error).


Or, maybe somehow the flush is getting issued before the controller is 
really ready for it somehow (it's not finished cleaning up after 
preceding NCQ command).


It's pretty hard for me to figure out which of the above might be the 
case, especially without access to the detailed controller documentation..

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sata_nv + ADMA + Samsung disk problem

2008-01-01 Thread Tejun Heo
Robert Hancock wrote:
 This is kind of a longstanding problem which has been partially worked
 around, but it seems not entirely. This is what I had diagnosed some
 time ago:

 recently, some issues cropped up with command timeouts when a cache
 flush command was immediately followed by an NCQ write. In this case,
 sometimes when the NCQ write was issued, the status register changed
 from 0x500 (Stopped and Idle) to 0x400 (Stopped) as it normally
 appears to, however it seems like the controller would get hung in
 that state, and we would time out with no notifiers set, the gen_ctl
 register not indicating interrupt status, and the CPB response flags
 still 0 as we left them, seemingly indicating the controller hasn't
 done anything with it. Then, when the error handler kicks in we clear
 the GO bit to put it back into register mode, but the Legacy flag in
 the status register doesn't get set (or at least it takes longer than
 1 microsecond). Finally when we do an ADMA channel reset that seems to
 get it responding again, until this happens the next time.

  From some experimentation, I found that when we are issuing a NCQ
 command when the last command was non-NCQ, or vice versa, if I added in
 a delay of 20 microseconds between setting up the CPB and writing to the
 append register, the problem appeared to go away. Problem is I don't
 know if that's because it actually needs this delay, or because it
 changes the timing so that it happens to work even though we're doing
 something wrong, there's some event we're not waiting for, etc.

 I've now verified that no switches between ADMA and register mode
 occur near the time of these timeouts. Neither are we reading or
 writing any of the ATA shadow registers while we're in ADMA mode.

 It seems likely that this is what is happening here (a switch from an
 NCQ command to a non-NCQ command, then the non-NCQ times out). It
 could be in some cases the 20 microsecond delay is not enough. But it
 seems bogus that we should need such an arbitrary delay in the first
 place.

 The question I had for NVIDIA regarding this that I never got answered
 was, is there any reason why we would need a delay when switching
 between NCQ and non-NCQ commands on ADMA, and if not, is there any
 known cause that could cause the controller to get into this seemingly
 locked-up state?
 
 Well, I guess I did sort of get an answer, but the only theory was that
 the flush and the NCQ commands were being overlapped, which shouldn't be
 possible (the libata core guarantees that, and if it didn't work it
 would affect all controllers).
 
 I'm kind of wondering if there's something funny going on with the
 notifier register stuff, which is supposed to tell us what commands have
 completed. We don't really use it at all (we had some problems with
 missed completions, etc. when I tried using it, also it doesn't work if
 ATAPI is enabled on the other port on the controller, apparently). I
 know these controllers will do strange things like not signalling
 interrupts for later events if you don't clear the notifiers in just the
 right way (that being mostly determined by trial and error).
 
 Or, maybe somehow the flush is getting issued before the controller is
 really ready for it somehow (it's not finished cleaning up after
 preceding NCQ command).
 
 It's pretty hard for me to figure out which of the above might be the
 case, especially without access to the detailed controller documentation..

Thanks a lot for the detailed explanation.  Nvidia ppl, any ideas?
FLUSH is used regularly.  We really need to fix this.

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sata_nv + ADMA + Samsung disk problem

2008-01-01 Thread Jeff Garzik

Tejun Heo wrote:

Thanks a lot for the detailed explanation.  Nvidia ppl, any ideas?
FLUSH is used regularly.  We really need to fix this.



I reiterate my opinion :)  ...   We should remove ADMA support from 
sata_nv.  It's only in a few chips, it's not appearing in any new chips, 
and nasty problems have lingered since ADMA support was introduced.


Definitely sounds like we should disable ADMA by default for 2.6.24-rc, too.

Jeff


-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sata_nv + ADMA + Samsung disk problem

2008-01-01 Thread Robert Hancock

Jeff Garzik wrote:

Tejun Heo wrote:

Thanks a lot for the detailed explanation.  Nvidia ppl, any ideas?
FLUSH is used regularly.  We really need to fix this.



I reiterate my opinion :)  ...   We should remove ADMA support from 
sata_nv.  It's only in a few chips, it's not appearing in any new chips, 
and nasty problems have lingered since ADMA support was introduced.


Definitely sounds like we should disable ADMA by default for 2.6.24-rc, 
too.


I wouldn't agree.. It's only in a few chips (CK804/MCP04), but those 
chips are very common in desktop, workstation, even some server 
machines. Given the huge number of these chips out there, problem 
reports have been quite rare.

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sata_nv + ADMA + Samsung disk problem

2008-01-01 Thread Tejun Heo
Robert Hancock wrote:
 Jeff Garzik wrote:
 Tejun Heo wrote:
 Thanks a lot for the detailed explanation.  Nvidia ppl, any ideas?
 FLUSH is used regularly.  We really need to fix this.


 I reiterate my opinion :)  ...   We should remove ADMA support from
 sata_nv.  It's only in a few chips, it's not appearing in any new
 chips, and nasty problems have lingered since ADMA support was
 introduced.

 Definitely sounds like we should disable ADMA by default for
 2.6.24-rc, too.
 
 I wouldn't agree.. It's only in a few chips (CK804/MCP04), but those
 chips are very common in desktop, workstation, even some server
 machines. Given the huge number of these chips out there, problem
 reports have been quite rare.

I agree with Jeff here.  Maybe not remove but disable it by default and
when enabling warn loudly.  NCQ just doesn't enough for its cost when
the cost includes erratic behaviors.  Only very small fraction of error
cases actually make to bugzilla or this mailing list.

Nvidia gents, is there anyway (be it NDA or whatever) to get Robert or
any of us technical documentation?

Thanks.

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sata_nv + ADMA + Samsung disk problem

2007-08-16 Thread Gabor Gombas
Hi,

On Tue, Aug 14, 2007 at 06:30:28PM +0900, Tejun Heo wrote:

 Hmmm... That's timeout on cache flush, indicative of failing disk.
 Please post the result of 'smartctl -a /dev/sdc'.

Ok, so something is fishy in 2.6.22 wrt. SMART.

First, booting back to 2.6.20.5 I confirmed that SMART works without any
problems for all 4 disks, so all the following is a regression in
2.6.22.

I have 4 disks: two Maxtors (hdparm -I output below): sda/sdb, and two
Samsung (hdparm -I output is in my previous mail): sdc/sdd.

 cut 
/dev/sda:

ATA device, with non-removable media
Model Number:   Maxtor 6B250S0  
Serial Number:  
Firmware Revision:  BANC1G10
Standards:
Used: ATA/ATAPI-7 T13 1532D revision 0 
Supported: 7 6 5 4 
Configuration:
Logical max current
cylinders   16383   16383
heads   16  16
sectors/track   63  63
--
CHS current addressable sectors:   16514064
LBAuser addressable sectors:  268435455
LBA48  user addressable sectors:  490234752
device size with M = 1024*1024:  239372 MBytes
device size with M = 1000*1000:  251000 MBytes (251 GB)
Capabilities:
LBA, IORDY(can be disabled)
Queue depth: 32
Standby timer values: spec'd by Standard, no device specific minimum
R/W multiple sector transfer: Max = 16  Current = 16
Advanced power management level: unknown setting (0x)
Recommended acoustic management value: 192, current value: 128
DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6 
 Cycle time: min=120ns recommended=120ns
PIO: pio0 pio1 pio2 pio3 pio4 
 Cycle time: no flow control=120ns  IORDY flow control=120ns
Commands/features:
Enabled Supported:
   *SMART feature set
Security Mode feature set
   *Power Management feature set
   *Write cache
   *Look-ahead
   *Host Protected Area feature set
   *WRITE_VERIFY command
   *WRITE_BUFFER command
   *READ_BUFFER command
   *NOP cmd
   *DOWNLOAD_MICROCODE
Advanced Power Management feature set
SET_MAX security extension
   *Automatic Acoustic Management feature set
   *48-bit Address feature set
   *Device Configuration Overlay feature set
   *Mandatory FLUSH_CACHE
   *FLUSH_CACHE_EXT
   *SMART error logging
   *SMART self-test
Media Card Pass-Through
   *General Purpose Logging feature set
   *WRITE_{DMA|MULTIPLE}_FUA_EXT
   *URG for READ_STREAM[_DMA]_EXT
   *URG for WRITE_STREAM[_DMA]_EXT
   *SATA-I signaling speed (1.5Gb/s)
   *Native Command Queueing (NCQ)
Software settings preservation
   *SMART Command Transport (SCT) feature set
   *SCT Data Tables (AC5)
Security: 
Master password revision code = 65534
supported
not enabled
not locked
frozen
not expired: security count
not supported: enhanced erase
Checksum: correct
 cut 

Under 2.6.22.1, when I try to do smartctl -d ata -s on /dev/sd[ab] or
smartctl -d ata -a /dev/sd[ab], I get the following error:

 cut 
smartctl version 5.37 [x86_64-unknown-linux-gnu] Copyright (C) 2002-6 Bruce 
Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Model Family: Maxtor DiamondMax 10 family (ATA/133 and SATA/150)
Device Model: Maxtor 6B250S0
Serial Number:
Firmware Version: BANC1G10
User Capacity:251,000,193,024 bytes
Device is:In smartctl database [for details use: -P show]
ATA Version is:   7
ATA Standard is:  ATA/ATAPI-7 T13 1532D revision 0
Local Time is:Wed Aug 15 12:01:38 2007 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

Error SMART Status command failed
Please get assistance from http://smartmontools.sourceforge.net/
Register values returned from SMART Status command are:
CMD=0x50
FR =0x00
NS =0x00
SC =0x00
CL =0xc2
CH =0x00
SEL=0x00
A mandatory SMART command failed: exiting. To continue, add one or more '-T 
permissive' options.
 cut 

To repeat, this does not happen under 2.6.20.5. Using -T permissive works:

 cut 
smartctl version 5.37 [x86_64-unknown-linux-gnu] Copyright (C) 2002-6 Bruce 
Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Model Family: Maxtor 

Re: sata_nv + ADMA + Samsung disk problem

2007-08-16 Thread Jim Paris
Gabor Gombas wrote:
 On Tue, Aug 14, 2007 at 06:30:28PM +0900, Tejun Heo wrote:
  Hmmm... That's timeout on cache flush, indicative of failing disk.
  Please post the result of 'smartctl -a /dev/sdc'.
 
 Ok, so something is fishy in 2.6.22 wrt. SMART.

See http://lkml.org/lkml/2007/7/8/198

-jim
-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sata_nv + ADMA + Samsung disk problem

2007-08-14 Thread Tejun Heo
Gabor Gombas wrote:
 Hi,
 
 Since I have upgraded to 2.6.22.1 from 2.6.20 I have problems with
 Samsung disks. Sometimes the disks stall for about half a minute and
 then I have these messages in the logs:
 
 Aug  6 20:10:11 twister kernel: ata7: EH in ADMA mode, notifier 0x0 
 notifier_error 0x0 gen_ctl 0x1501000 status 0x400 next cpb count 0x0 next cpb 
 idx 0x0
 Aug  6 20:10:12 twister kernel: ata7: CPB 0: ctl_flags 0x9, resp_flags 0x0
 Aug  6 20:10:12 twister kernel: ata7: timeout waiting for ADMA IDLE, 
 stat=0x400
 Aug  6 20:10:12 twister kernel: ata7: timeout waiting for ADMA LEGACY, 
 stat=0x400
 Aug  6 20:10:12 twister kernel: ata7.00: exception Emask 0x0 SAct 0x0 SErr 
 0x0 action 0x2 frozen
 Aug  6 20:10:12 twister kernel: ata7.00: cmd 
 ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 cdb 0x0 data 0 
 Aug  6 20:10:12 twister kernel:  res 
 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
 Aug  6 20:10:12 twister kernel: ata7: soft resetting port
 Aug  6 20:10:12 twister kernel: ata7: SATA link up 3.0 Gbps (SStatus 123 
 SControl 300)
 Aug  6 20:10:12 twister kernel: ata7.00: configured for UDMA/133
 Aug  6 20:10:12 twister kernel: ata7: EH complete
 Aug  6 20:10:12 twister kernel: sd 6:0:0:0: [sdc] 488397168 512-byte hardware 
 sectors (250059 MB)
 Aug  6 20:10:12 twister kernel: sd 6:0:0:0: [sdc] Write Protect is off
 Aug  6 20:10:12 twister kernel: sd 6:0:0:0: [sdc] Mode Sense: 00 3a 00 00
 Aug  6 20:10:12 twister kernel: sd 6:0:0:0: [sdc] Write cache: enabled, read 
 cache: enabled, doesn't support DPO or FUA
 Aug  6 20:20:25 twister kernel: ata8: EH in ADMA mode, notifier 0x0 
 notifier_error 0x0 gen_ctl 0x1501000 status 0x400 next cpb count 0x0 next cpb 
 idx 0x0
 Aug  6 20:20:25 twister kernel: ata8: CPB 0: ctl_flags 0x9, resp_flags 0x0
 Aug  6 20:20:25 twister kernel: ata8: timeout waiting for ADMA IDLE, 
 stat=0x400
 Aug  6 20:20:25 twister kernel: ata8: timeout waiting for ADMA LEGACY, 
 stat=0x400
 Aug  6 20:20:25 twister kernel: ata8.00: exception Emask 0x0 SAct 0x0 SErr 
 0x0 action 0x2 frozen
 Aug  6 20:20:25 twister kernel: ata8.00: cmd 
 ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 cdb 0x0 data 0 
 Aug  6 20:20:25 twister kernel:  res 
 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
 Aug  6 20:20:25 twister kernel: ata8: soft resetting port
 Aug  6 20:20:25 twister kernel: ata8: SATA link up 3.0 Gbps (SStatus 123 
 SControl 300)
 Aug  6 20:20:25 twister kernel: ata8.00: configured for UDMA/133
 Aug  6 20:20:25 twister kernel: ata8: EH complete
 Aug  6 20:20:25 twister kernel: sd 7:0:0:0: [sdd] 488397168 512-byte hardware 
 sectors (250059 MB)
 Aug  6 20:20:25 twister kernel: sd 7:0:0:0: [sdd] Write Protect is off
 Aug  6 20:20:25 twister kernel: sd 7:0:0:0: [sdd] Mode Sense: 00 3a 00 00
 Aug  6 20:20:25 twister kernel: sd 7:0:0:0: [sdd] Write cache: enabled, read 
 cache: enabled, doesn't support DPO or FUA
 
 I also have two Maxtor disks on the same controller but they are working
 correctly in ADMA mode. I now disabled ADMA mode and that seems to help.

Hmmm... That's timeout on cache flush, indicative of failing disk.
Please post the result of 'smartctl -a /dev/sdc'.

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sata_nv + ADMA + Samsung disk problem

2007-08-14 Thread Gabor Gombas
On Tue, Aug 14, 2007 at 06:30:28PM +0900, Tejun Heo wrote:

 Hmmm... That's timeout on cache flush, indicative of failing disk.
 Please post the result of 'smartctl -a /dev/sdc'.

Will do when I get home. Note however that this only occurs in ADMA
mode. It never occured with 2.6.20 and it never occured with 2.6.22 ever
since I have disabled ADMA.

Gabor

-- 
 -
 MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
 -
-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html