Re: sata_nv + ADMA + Samsung disk problem

2008-01-11 Thread Robert Hancock

Gabor Gombas wrote:

On Mon, Jan 07, 2008 at 06:10:29PM -0600, Robert Hancock wrote:

Gabor, I just noticed you said that it worked OK in 2.6.20, yet 2.6.22  
fails. 2.6.20 had ADMA support as well, so I wonder what change started  
causing the problem. Would it be possible for you to do a git bisect (or  
at least try 2.6.21 to try and narrow it down)?


I've now booted 2.6.21.7, we'll see. The problem with the bisection is
that I can't explicitely trigger the bug so I can't say for sure if a
kernel is good or it is just needs more time to trigger. The average
uptime of this machine is just a couple hours a day.

For example, with 2.6.24-rc6 it took over 3 hours for the first disk to
trigger the bug and the second disk needed more than 7 hours. This
machine is seldom turned on for that long.


If you want to try to reproduce the problem more rapidly, you can try 
the recipe I just suggested to the NVIDIA guys:


Run 2 instances of this C program, with different output files as the 
argument, i.e. save this to fsynctest.c, and do

gcc fsynctest.c -o fsynctest
./fsynctest testfile & ./fsynctest testfile2 &

#include 
#include 
#include 
#include 
#include 
#include 

int main(int argc, char* argv[])
{
int i;
int fd = open( argv[1], O_WRONLY | O_CREAT | O_TRUNC, S_IRUSR | 
S_IWUSR);

if(fd == -1)
{
perror("open");
return 1;
}
for(i=0;i<100;i++)
{
int rc = write(fd, "0", 1);
if( rc != 1 )
{
perror("write");
return 2;
}
rc = fsync(fd);
if(rc)
{
perror("fsync");
return 2;
}
}
return 0;
}

Also run one instance of this:

dd if=/dev/zero of=blankfile bs=512 count=10 oflag=direct

and one of this:

while /bin/true; do sdparm --command=sync /dev/sdb; done

all at the same time. In my experience, it helps to disable cpufreq (on 
Red Hat/Fedora, /sbin/service cpuspeed stop) to force the CPU to run at 
max frequency all the time. After a few minutes I got this:


ata4: EH in ADMA mode, notifier 0x0 notifier_error 0x0 gen_ctl 0x1501000 
status 0x400 next cpb count 0x2 next cpb idx 0x0

ata4: CPB 0: ctl_flags 0x1f, resp_flags 0x0
ata4: CPB 1: ctl_flags 0x1f, resp_flags 0x0
ata4: CPB 2: ctl_flags 0x1f, resp_flags 0x0
ata4: timeout waiting for ADMA IDLE, stat=0x400
ata4: timeout waiting for ADMA LEGACY, stat=0x400
ata4.00: exception Emask 0x0 SAct 0x7 SErr 0x0 action 0x2 frozen
ata4.00: cmd 61/08:00:e0:74:64/00:00:0a:00:00/40 tag 0 ncq 4096 out
 res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
ata4.00: status: { DRDY }
ata4.00: cmd 61/08:08:30:5b:76/00:00:0c:00:00/40 tag 1 ncq 4096 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata4.00: status: { DRDY }
ata4.00: cmd 61/01:10:ba:51:77/00:00:0c:00:00/40 tag 2 ncq 512 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata4.00: status: { DRDY }
ata4: soft resetting link
ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata4.00: configured for UDMA/133
ata4: EH complete

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sata_nv + ADMA + Samsung disk problem

2008-01-11 Thread Gabor Gombas
On Mon, Jan 07, 2008 at 06:10:29PM -0600, Robert Hancock wrote:

> Gabor, I just noticed you said that it worked OK in 2.6.20, yet 2.6.22  
> fails. 2.6.20 had ADMA support as well, so I wonder what change started  
> causing the problem. Would it be possible for you to do a git bisect (or  
> at least try 2.6.21 to try and narrow it down)?

I've now booted 2.6.21.7, we'll see. The problem with the bisection is
that I can't explicitely trigger the bug so I can't say for sure if a
kernel is good or it is just needs more time to trigger. The average
uptime of this machine is just a couple hours a day.

For example, with 2.6.24-rc6 it took over 3 hours for the first disk to
trigger the bug and the second disk needed more than 7 hours. This
machine is seldom turned on for that long.

Gabor

-- 
 -
 MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
 -
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sata_nv + ADMA + Samsung disk problem

2008-01-07 Thread Robert Hancock

Allen Martin wrote:
 

Dunno about the NVidia version.
Theirs works rather differently - the GO bit is there, but there's 
another append register which is used to tell the controller 
that a new 
tag has been added to the CPB list.


The only thing we currently use the GO bit for is to switch 
between ADMA 
and port register mode. Could be there's something we need to 
do there, 
though, who knows..




You shouldn't ever need to touch GO other than the ADMA / legacy mode
switch as you say.

The NVIDIA ADMA hw is not based on the Pacific Digital core.


Gabor, I just noticed you said that it worked OK in 2.6.20, yet 2.6.22 
fails. 2.6.20 had ADMA support as well, so I wonder what change started 
causing the problem. Would it be possible for you to do a git bisect (or 
at least try 2.6.21 to try and narrow it down)?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sata_nv + ADMA + Samsung disk problem

2008-01-03 Thread Benjamin Herrenschmidt

On Thu, 2008-01-03 at 19:43 -0600, Robert Hancock wrote:
> Benjamin Herrenschmidt wrote:
> >> Another thing about the PacDigi core:  one has to be very careful
> >> to avoid sequential accesses to sequential PCI locations when
> >> programming the chip -- it cannot handle merged register writes.
> >>
> >> So for any group of sequentially laid out registers, the code has
> >> to ensure it never writes two adjacent registers in sequence..
> > 
> > Ugh ? Write combining isn't permitted on normal registers afaik...
> > 
> > Ben.
> 
> Byte merging can be done by the chipset on MMIO writes (merging multiple 
> 8 or 16-bit writes into a single 32-bit cycle).

That is true, if they are consecutive. You mean that this HW is f*cked
up enough to actually have separate 8/16 bits registers that are
contiguous ? Yuck... I'm afraid you -have- to add reads in between to
guarantee that no merging will occur.

Cheers,

Ben.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sata_nv + ADMA + Samsung disk problem

2008-01-03 Thread Robert Hancock

Allen Martin wrote:
 

Dunno about the NVidia version.
Theirs works rather differently - the GO bit is there, but there's 
another append register which is used to tell the controller 
that a new 
tag has been added to the CPB list.


The only thing we currently use the GO bit for is to switch 
between ADMA 
and port register mode. Could be there's something we need to 
do there, 
though, who knows..




You shouldn't ever need to touch GO other than the ADMA / legacy mode
switch as you say.

The NVIDIA ADMA hw is not based on the Pacific Digital core.


That answers that question, I guess. Still guessing at why the 
controller would get stuck in IDLE state with no interrupt raised, then..

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sata_nv + ADMA + Samsung disk problem

2008-01-03 Thread Robert Hancock

Benjamin Herrenschmidt wrote:

Another thing about the PacDigi core:  one has to be very careful
to avoid sequential accesses to sequential PCI locations when
programming the chip -- it cannot handle merged register writes.

So for any group of sequentially laid out registers, the code has
to ensure it never writes two adjacent registers in sequence..


Ugh ? Write combining isn't permitted on normal registers afaik...

Ben.


Byte merging can be done by the chipset on MMIO writes (merging multiple 
8 or 16-bit writes into a single 32-bit cycle).

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: sata_nv + ADMA + Samsung disk problem

2008-01-03 Thread Allen Martin
 
> > Dunno about the NVidia version.
> 
> Theirs works rather differently - the GO bit is there, but there's 
> another append register which is used to tell the controller 
> that a new 
> tag has been added to the CPB list.
> 
> The only thing we currently use the GO bit for is to switch 
> between ADMA 
> and port register mode. Could be there's something we need to 
> do there, 
> though, who knows..
> 

You shouldn't ever need to touch GO other than the ADMA / legacy mode
switch as you say.

The NVIDIA ADMA hw is not based on the Pacific Digital core.
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sata_nv + ADMA + Samsung disk problem

2008-01-03 Thread Benjamin Herrenschmidt

> Another thing about the PacDigi core:  one has to be very careful
> to avoid sequential accesses to sequential PCI locations when
> programming the chip -- it cannot handle merged register writes.
> 
> So for any group of sequentially laid out registers, the code has
> to ensure it never writes two adjacent registers in sequence..

Ugh ? Write combining isn't permitted on normal registers afaik...

Ben.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sata_nv + ADMA + Samsung disk problem

2008-01-03 Thread Mark Lord

Mark Lord wrote:

Robert Hancock wrote:

Mark Lord wrote:

Robert Hancock wrote:
..

 From some of the traces I took previously (posted on LKML as "sata_nv ADMA 
controller lockup investigation" way back in Feb 07), what seems to occur is that 
when the second command is issued very rapidly (within less than 20 microseconds, or 
potentially longer) after the previous command's completion, the ADMA status changes from 
0x500 (STOPPED and IDLE) to 0x400 (just IDLE) as it typically does, but then it sticks 
there, no interrupt is ever raised, and CPB response flags remain at 0.

..

Assuming that NVidia got their ADMA core logic from Pacific Digital
(the inventors), then it may have some of the same bugs as the original.

One of those bugs is that the aGO trigger is sampled in a "racey" way,
such that it sometimes may miss a recent addition to the ring.

The *only* way to guarantee things with the original Pacific Digital core
was to (1) always retrigger aGO for a full ring scan with each new addition,
and (2) poll periodically (every half second or so) rather than relying
exclusively on the IRQ actually working..

Dunno about the NVidia version.


Theirs works rather differently - the GO bit is there, but there's another 
append register which is used to tell the controller that a new tag has been 
added to the CPB list.

..

The PacDigi core uses a "search count" register for that purpose,
but the buggy nature of the core required that it always be set
to "2 * ring_size" to ensure nothing got missed.

Here's some comments from the original ADMA driver.
Maybe something from here might help with the NV stuff, too.

  // There is a chance that the chip will skip over a CPB if a SERVICE 
interrupt
   // occurs while it's reading the CPB header.  This won't cause us to get
   // stuck anywhere, but it might slow down execution of the new CPB if
   // it has to wait for the next time we hit aGO.  So.. Dxxx/Dxxx suggest
   // that all we need to do is tell the chip to do two passes around the 
ring
   // from an aGO instead of one pass, so that it will find the "missed" CPB
   // on the second pass.  This isn't as bad as it first looks.
   //
   writew(channel->num_cpbs * 2, &adma_regs->cpb_search_count);

Or again, the NV stuff may be completely different (?).

..

Another thing about the PacDigi core:  one has to be very careful
to avoid sequential accesses to sequential PCI locations when
programming the chip -- it cannot handle merged register writes.

So for any group of sequentially laid out registers, the code has
to ensure it never writes two adjacent registers in sequence..

-ml
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sata_nv + ADMA + Samsung disk problem

2008-01-03 Thread Mark Lord

Robert Hancock wrote:

Mark Lord wrote:

Robert Hancock wrote:
..
 From some of the traces I took previously (posted on LKML as 
"sata_nv ADMA controller lockup investigation" way back in Feb 07), 
what seems to occur is that when the second command is issued very 
rapidly (within less than 20 microseconds, or potentially longer) 
after the previous command's completion, the ADMA status changes from 
0x500 (STOPPED and IDLE) to 0x400 (just IDLE) as it typically does, 
but then it sticks there, no interrupt is ever raised, and CPB 
response flags remain at 0.

..

Assuming that NVidia got their ADMA core logic from Pacific Digital
(the inventors), then it may have some of the same bugs as the original.

One of those bugs is that the aGO trigger is sampled in a "racey" way,
such that it sometimes may miss a recent addition to the ring.

The *only* way to guarantee things with the original Pacific Digital core
was to (1) always retrigger aGO for a full ring scan with each new 
addition,

and (2) poll periodically (every half second or so) rather than relying
exclusively on the IRQ actually working..

Dunno about the NVidia version.


Theirs works rather differently - the GO bit is there, but there's 
another append register which is used to tell the controller that a new 
tag has been added to the CPB list.

..

The PacDigi core uses a "search count" register for that purpose,
but the buggy nature of the core required that it always be set
to "2 * ring_size" to ensure nothing got missed.

Here's some comments from the original ADMA driver.
Maybe something from here might help with the NV stuff, too.

  // There is a chance that the chip will skip over a CPB if a SERVICE 
interrupt
   // occurs while it's reading the CPB header.  This won't cause us to get
   // stuck anywhere, but it might slow down execution of the new CPB if
   // it has to wait for the next time we hit aGO.  So.. Dxxx/Dxxx suggest
   // that all we need to do is tell the chip to do two passes around the 
ring
   // from an aGO instead of one pass, so that it will find the "missed" CPB
   // on the second pass.  This isn't as bad as it first looks.
   //
   writew(channel->num_cpbs * 2, &adma_regs->cpb_search_count);

Or again, the NV stuff may be completely different (?).
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sata_nv + ADMA + Samsung disk problem

2008-01-02 Thread Robert Hancock

Mark Lord wrote:

Robert Hancock wrote:
..
 From some of the traces I took previously (posted on LKML as "sata_nv 
ADMA controller lockup investigation" way back in Feb 07), what seems 
to occur is that when the second command is issued very rapidly 
(within less than 20 microseconds, or potentially longer) after the 
previous command's completion, the ADMA status changes from 0x500 
(STOPPED and IDLE) to 0x400 (just IDLE) as it typically does, but then 
it sticks there, no interrupt is ever raised, and CPB response flags 
remain at 0.

..

Assuming that NVidia got their ADMA core logic from Pacific Digital
(the inventors), then it may have some of the same bugs as the original.

One of those bugs is that the aGO trigger is sampled in a "racey" way,
such that it sometimes may miss a recent addition to the ring.

The *only* way to guarantee things with the original Pacific Digital core
was to (1) always retrigger aGO for a full ring scan with each new 
addition,

and (2) poll periodically (every half second or so) rather than relying
exclusively on the IRQ actually working..

Dunno about the NVidia version.


Theirs works rather differently - the GO bit is there, but there's 
another append register which is used to tell the controller that a new 
tag has been added to the CPB list.


The only thing we currently use the GO bit for is to switch between ADMA 
and port register mode. Could be there's something we need to do there, 
though, who knows..


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove "nospam" from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sata_nv + ADMA + Samsung disk problem

2008-01-02 Thread Mark Lord

Robert Hancock wrote:
..
 From some of the traces I took previously (posted on LKML as "sata_nv 
ADMA controller lockup investigation" way back in Feb 07), what seems to 
occur is that when the second command is issued very rapidly (within 
less than 20 microseconds, or potentially longer) after the previous 
command's completion, the ADMA status changes from 0x500 (STOPPED and 
IDLE) to 0x400 (just IDLE) as it typically does, but then it sticks 
there, no interrupt is ever raised, and CPB response flags remain at 0.

..

Assuming that NVidia got their ADMA core logic from Pacific Digital
(the inventors), then it may have some of the same bugs as the original.

One of those bugs is that the aGO trigger is sampled in a "racey" way,
such that it sometimes may miss a recent addition to the ring.

The *only* way to guarantee things with the original Pacific Digital core
was to (1) always retrigger aGO for a full ring scan with each new addition,
and (2) poll periodically (every half second or so) rather than relying
exclusively on the IRQ actually working..

Dunno about the NVidia version.

Cheers


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sata_nv + ADMA + Samsung disk problem

2008-01-02 Thread Mark Lord

Robert Hancock wrote:


What we're doing to enter legacy mode is essentially:

-wait until ADMA status indicates IDLE bit set (max wait of 1 microsecond)
-clear GO bit in control register
-wait until status indicates LEGACY bit set (max wait of 1 microsecond)

and to enter ADMA mode:

-set GO bit in control register
-wait until status indicates LEGACY bit cleared and IDLE bit set (max 
wait of 1 microsecond)

..

If there are outstanding TCQ/NCQ commands (any drive),
then this could take (much) longer to enter legacy mode,
as the ADMA engine will wait for them all to finish.

But for normal, "nothing outstanding" mode, it should be very quick.

Cheers
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sata_nv + ADMA + Samsung disk problem

2008-01-02 Thread Robert Hancock

Tejun Heo wrote:

Robert Hancock wrote:

Jeff Garzik wrote:

Tejun Heo wrote:

Thanks a lot for the detailed explanation.  Nvidia ppl, any ideas?
FLUSH is used regularly.  We really need to fix this.


I reiterate my opinion :)  ...   We should remove ADMA support from
sata_nv.  It's only in a few chips, it's not appearing in any new
chips, and nasty problems have lingered since ADMA support was
introduced.

Definitely sounds like we should disable ADMA by default for
2.6.24-rc, too.

I wouldn't agree.. It's only in a few chips (CK804/MCP04), but those
chips are very common in desktop, workstation, even some server
machines. Given the huge number of these chips out there, problem
reports have been quite rare.


I agree with Jeff here.  Maybe not remove but disable it by default and
when enabling warn loudly.  NCQ just doesn't enough for its cost when
the cost includes erratic behaviors.  Only very small fraction of error
cases actually make to bugzilla or this mailing list.

Nvidia gents, is there anyway (be it NDA or whatever) to get Robert or
any of us technical documentation?

Thanks.


Last I heard, NVIDIA management gave the thumbs down to any more NDAs 
for ADMA documentation. It would be nice if they would reconsider. 
Apparently Jeff does have the docs, though..


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove "nospam" from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sata_nv + ADMA + Samsung disk problem

2008-01-02 Thread Robert Hancock

Allen Martin wrote:
The software definitely provides that guarantee for all NCQ-capable 
controllers.




Well if that's not it, it must be some problem entering ADMA legacy
mode.  Here's what the Windows driver does:


ADMACtrl.aGO = 0
ADMACtrl.aEIEN = 0
poll {
  until ADMAStatus.aLGCY = 1 || timeout
}


What we're doing to enter legacy mode is essentially:

-wait until ADMA status indicates IDLE bit set (max wait of 1 microsecond)
-clear GO bit in control register
-wait until status indicates LEGACY bit set (max wait of 1 microsecond)

and to enter ADMA mode:

-set GO bit in control register
-wait until status indicates LEGACY bit cleared and IDLE bit set (max 
wait of 1 microsecond)


The 1 microsecond timeout is pretty aggressive admittedly, but it 
apparently isn't being broken (the only timeouts when switching modes 
I've seen are during error handling after a command timeout has already 
occurred). What timeout value is the Windows driver using?


Also, I see you are clearing the AEIN bit when in register mode, while 
we're not. Is that important/necessary?


Aside from all this though, in the case of NCQ writes followed by a 
cache flush, that sequence of commands won't put us into legacy mode at 
all since the cache flush is a no-data command which we should be able 
to handle in ADMA mode, from my understanding (correct me if I'm wrong). 
So I don't imagine legacy/ADMA mode switch could be the cause of this 
problem.


I also saw in my previous investigation that a flush immediately 
followed by a write could cause the write to time out as well.


From some of the traces I took previously (posted on LKML as "sata_nv 
ADMA controller lockup investigation" way back in Feb 07), what seems to 
occur is that when the second command is issued very rapidly (within 
less than 20 microseconds, or potentially longer) after the previous 
command's completion, the ADMA status changes from 0x500 (STOPPED and 
IDLE) to 0x400 (just IDLE) as it typically does, but then it sticks 
there, no interrupt is ever raised, and CPB response flags remain at 0.


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove "nospam" from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: sata_nv + ADMA + Samsung disk problem

2008-01-02 Thread Allen Martin
> The software definitely provides that guarantee for all NCQ-capable 
> controllers.
> 

Well if that's not it, it must be some problem entering ADMA legacy
mode.  Here's what the Windows driver does:


ADMACtrl.aGO = 0
ADMACtrl.aEIEN = 0
poll {
  until ADMAStatus.aLGCY = 1 || timeout
}
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sata_nv + ADMA + Samsung disk problem

2008-01-02 Thread Jeff Garzik

Allen Martin wrote:
The question I had for NVIDIA regarding this that I never got 
answered 
was, is there any reason why we would need a delay when switching 
between NCQ and non-NCQ commands on ADMA, and if not, is 
there any known 
cause that could cause the controller to get into this seemingly 
locked-up state?


When switching from NCQ to non NCQ or vice versa you must make sure all
outstanding commands are completed before issuing the new command.  The
hardware doesn't do anything to prevent queued and non queued commands
from going out on the wire at the same time which will certainly cause
some drives to fail.


The software definitely provides that guarantee for all NCQ-capable 
controllers.


Jeff



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: sata_nv + ADMA + Samsung disk problem

2008-01-02 Thread Allen Martin

> The question I had for NVIDIA regarding this that I never got 
> answered 
> was, is there any reason why we would need a delay when switching 
> between NCQ and non-NCQ commands on ADMA, and if not, is 
> there any known 
> cause that could cause the controller to get into this seemingly 
> locked-up state?

When switching from NCQ to non NCQ or vice versa you must make sure all
outstanding commands are completed before issuing the new command.  The
hardware doesn't do anything to prevent queued and non queued commands
from going out on the wire at the same time which will certainly cause
some drives to fail.

-Allen
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sata_nv + ADMA + Samsung disk problem

2008-01-01 Thread Tejun Heo
Robert Hancock wrote:
> Jeff Garzik wrote:
>> Tejun Heo wrote:
>>> Thanks a lot for the detailed explanation.  Nvidia ppl, any ideas?
>>> FLUSH is used regularly.  We really need to fix this.
>>
>>
>> I reiterate my opinion :)  ...   We should remove ADMA support from
>> sata_nv.  It's only in a few chips, it's not appearing in any new
>> chips, and nasty problems have lingered since ADMA support was
>> introduced.
>>
>> Definitely sounds like we should disable ADMA by default for
>> 2.6.24-rc, too.
> 
> I wouldn't agree.. It's only in a few chips (CK804/MCP04), but those
> chips are very common in desktop, workstation, even some server
> machines. Given the huge number of these chips out there, problem
> reports have been quite rare.

I agree with Jeff here.  Maybe not remove but disable it by default and
when enabling warn loudly.  NCQ just doesn't enough for its cost when
the cost includes erratic behaviors.  Only very small fraction of error
cases actually make to bugzilla or this mailing list.

Nvidia gents, is there anyway (be it NDA or whatever) to get Robert or
any of us technical documentation?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sata_nv + ADMA + Samsung disk problem

2008-01-01 Thread Robert Hancock

Jeff Garzik wrote:

Tejun Heo wrote:

Thanks a lot for the detailed explanation.  Nvidia ppl, any ideas?
FLUSH is used regularly.  We really need to fix this.



I reiterate my opinion :)  ...   We should remove ADMA support from 
sata_nv.  It's only in a few chips, it's not appearing in any new chips, 
and nasty problems have lingered since ADMA support was introduced.


Definitely sounds like we should disable ADMA by default for 2.6.24-rc, 
too.


I wouldn't agree.. It's only in a few chips (CK804/MCP04), but those 
chips are very common in desktop, workstation, even some server 
machines. Given the huge number of these chips out there, problem 
reports have been quite rare.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sata_nv + ADMA + Samsung disk problem

2008-01-01 Thread Jeff Garzik

Tejun Heo wrote:

Thanks a lot for the detailed explanation.  Nvidia ppl, any ideas?
FLUSH is used regularly.  We really need to fix this.



I reiterate my opinion :)  ...   We should remove ADMA support from 
sata_nv.  It's only in a few chips, it's not appearing in any new chips, 
and nasty problems have lingered since ADMA support was introduced.


Definitely sounds like we should disable ADMA by default for 2.6.24-rc, too.

Jeff


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sata_nv + ADMA + Samsung disk problem

2008-01-01 Thread Tejun Heo
Robert Hancock wrote:
>> This is kind of a longstanding problem which has been partially worked
>> around, but it seems not entirely. This is what I had diagnosed some
>> time ago:
>>
>> "recently, some issues cropped up with command timeouts when a cache
>> flush command was immediately followed by an NCQ write. In this case,
>> sometimes when the NCQ write was issued, the status register changed
>> from 0x500 (Stopped and Idle) to 0x400 (Stopped) as it normally
>> appears to, however it seems like the controller would get hung in
>> that state, and we would time out with no notifiers set, the gen_ctl
>> register not indicating interrupt status, and the CPB response flags
>> still 0 as we left them, seemingly indicating the controller hasn't
>> done anything with it. Then, when the error handler kicks in we clear
>> the GO bit to put it back into register mode, but the Legacy flag in
>> the status register doesn't get set (or at least it takes longer than
>> 1 microsecond). Finally when we do an ADMA channel reset that seems to
>> get it responding again, until this happens the next time.
>>
>>  From some experimentation, I found that when we are issuing a NCQ
>> command when the last command was non-NCQ, or vice versa, if I added in
>> a delay of 20 microseconds between setting up the CPB and writing to the
>> append register, the problem appeared to go away. Problem is I don't
>> know if that's because it actually needs this delay, or because it
>> changes the timing so that it happens to work even though we're doing
>> something wrong, there's some event we're not waiting for, etc.
>>
>> I've now verified that no switches between ADMA and register mode
>> occur near the time of these timeouts. Neither are we reading or
>> writing any of the ATA shadow registers while we're in ADMA mode."
>>
>> It seems likely that this is what is happening here (a switch from an
>> NCQ command to a non-NCQ command, then the non-NCQ times out). It
>> could be in some cases the 20 microsecond delay is not enough. But it
>> seems bogus that we should need such an arbitrary delay in the first
>> place.
>>
>> The question I had for NVIDIA regarding this that I never got answered
>> was, is there any reason why we would need a delay when switching
>> between NCQ and non-NCQ commands on ADMA, and if not, is there any
>> known cause that could cause the controller to get into this seemingly
>> locked-up state?
> 
> Well, I guess I did sort of get an answer, but the only theory was that
> the flush and the NCQ commands were being overlapped, which shouldn't be
> possible (the libata core guarantees that, and if it didn't work it
> would affect all controllers).
> 
> I'm kind of wondering if there's something funny going on with the
> notifier register stuff, which is supposed to tell us what commands have
> completed. We don't really use it at all (we had some problems with
> missed completions, etc. when I tried using it, also it doesn't work if
> ATAPI is enabled on the other port on the controller, apparently). I
> know these controllers will do strange things like not signalling
> interrupts for later events if you don't clear the notifiers in just the
> right way (that being mostly determined by trial and error).
> 
> Or, maybe somehow the flush is getting issued before the controller is
> really "ready" for it somehow (it's not finished cleaning up after
> preceding NCQ command).
> 
> It's pretty hard for me to figure out which of the above might be the
> case, especially without access to the detailed controller documentation..

Thanks a lot for the detailed explanation.  Nvidia ppl, any ideas?
FLUSH is used regularly.  We really need to fix this.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sata_nv + ADMA + Samsung disk problem

2008-01-01 Thread Robert Hancock

Robert Hancock wrote:

Tejun Heo wrote:

[cc'ing Robert Hancock and NVidia people]

Whole thread can be read from the following URL.

  http://thread.gmane.org/gmane.linux.ide/21710

In a nutshell, with ADMA enabled, FLUSH_EXT occasionally times out.  I
first suspected faulty disk (reallocation failure on flush) but SMART
reports nothing suspicious and w/ ADMA disabled, the drive works just 
fine.


On a side note, on 2.6.22.1, SMART fails from time to time but the
problem went away on 2.6.24-rc6.  This was apparently fixed during that
period.  I guess we can ignore this for now.

Thanks.


This is kind of a longstanding problem which has been partially worked 
around, but it seems not entirely. This is what I had diagnosed some 
time ago:


"recently, some issues cropped up with command timeouts when a cache 
flush command was immediately followed by an NCQ write. In this case, 
sometimes when the NCQ write was issued, the status register changed 
from 0x500 (Stopped and Idle) to 0x400 (Stopped) as it normally appears 
to, however it seems like the controller would get hung in that state, 
and we would time out with no notifiers set, the gen_ctl register not 
indicating interrupt status, and the CPB response flags still 0 as we 
left them, seemingly indicating the controller hasn't done anything with 
it. Then, when the error handler kicks in we clear the GO bit to put it 
back into register mode, but the Legacy flag in the status register 
doesn't get set (or at least it takes longer than 1 microsecond). 
Finally when we do an ADMA channel reset that seems to get it responding 
again, until this happens the next time.


 From some experimentation, I found that when we are issuing a NCQ
command when the last command was non-NCQ, or vice versa, if I added in
a delay of 20 microseconds between setting up the CPB and writing to the
append register, the problem appeared to go away. Problem is I don't
know if that's because it actually needs this delay, or because it
changes the timing so that it happens to work even though we're doing
something wrong, there's some event we're not waiting for, etc.

I've now verified that no switches between ADMA and register mode occur 
near the time of these timeouts. Neither are we reading or writing any 
of the ATA shadow registers while we're in ADMA mode."


It seems likely that this is what is happening here (a switch from an 
NCQ command to a non-NCQ command, then the non-NCQ times out). It could 
be in some cases the 20 microsecond delay is not enough. But it seems 
bogus that we should need such an arbitrary delay in the first place.


The question I had for NVIDIA regarding this that I never got answered 
was, is there any reason why we would need a delay when switching 
between NCQ and non-NCQ commands on ADMA, and if not, is there any known 
cause that could cause the controller to get into this seemingly 
locked-up state?


Well, I guess I did sort of get an answer, but the only theory was that 
the flush and the NCQ commands were being overlapped, which shouldn't be 
possible (the libata core guarantees that, and if it didn't work it 
would affect all controllers).


I'm kind of wondering if there's something funny going on with the 
notifier register stuff, which is supposed to tell us what commands have 
completed. We don't really use it at all (we had some problems with 
missed completions, etc. when I tried using it, also it doesn't work if 
ATAPI is enabled on the other port on the controller, apparently). I 
know these controllers will do strange things like not signalling 
interrupts for later events if you don't clear the notifiers in just the 
right way (that being mostly determined by trial and error).


Or, maybe somehow the flush is getting issued before the controller is 
really "ready" for it somehow (it's not finished cleaning up after 
preceding NCQ command).


It's pretty hard for me to figure out which of the above might be the 
case, especially without access to the detailed controller documentation..

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sata_nv + ADMA + Samsung disk problem

2008-01-01 Thread Robert Hancock

Tejun Heo wrote:

[cc'ing Robert Hancock and NVidia people]

Whole thread can be read from the following URL.

  http://thread.gmane.org/gmane.linux.ide/21710

In a nutshell, with ADMA enabled, FLUSH_EXT occasionally times out.  I
first suspected faulty disk (reallocation failure on flush) but SMART
reports nothing suspicious and w/ ADMA disabled, the drive works just fine.

On a side note, on 2.6.22.1, SMART fails from time to time but the
problem went away on 2.6.24-rc6.  This was apparently fixed during that
period.  I guess we can ignore this for now.

Thanks.


This is kind of a longstanding problem which has been partially worked 
around, but it seems not entirely. This is what I had diagnosed some 
time ago:


"recently, some issues cropped up with command timeouts when a cache 
flush command was immediately followed by an NCQ write. In this case, 
sometimes when the NCQ write was issued, the status register changed 
from 0x500 (Stopped and Idle) to 0x400 (Stopped) as it normally appears 
to, however it seems like the controller would get hung in that state, 
and we would time out with no notifiers set, the gen_ctl register not 
indicating interrupt status, and the CPB response flags still 0 as we 
left them, seemingly indicating the controller hasn't done anything with 
it. Then, when the error handler kicks in we clear the GO bit to put it 
back into register mode, but the Legacy flag in the status register 
doesn't get set (or at least it takes longer than 1 microsecond). 
Finally when we do an ADMA channel reset that seems to get it responding 
again, until this happens the next time.


From some experimentation, I found that when we are issuing a NCQ
command when the last command was non-NCQ, or vice versa, if I added in
a delay of 20 microseconds between setting up the CPB and writing to the
append register, the problem appeared to go away. Problem is I don't
know if that's because it actually needs this delay, or because it
changes the timing so that it happens to work even though we're doing
something wrong, there's some event we're not waiting for, etc.

I've now verified that no switches between ADMA and register mode occur 
near the time of these timeouts. Neither are we reading or writing any 
of the ATA shadow registers while we're in ADMA mode."


It seems likely that this is what is happening here (a switch from an 
NCQ command to a non-NCQ command, then the non-NCQ times out). It could 
be in some cases the 20 microsecond delay is not enough. But it seems 
bogus that we should need such an arbitrary delay in the first place.


The question I had for NVIDIA regarding this that I never got answered 
was, is there any reason why we would need a delay when switching 
between NCQ and non-NCQ commands on ADMA, and if not, is there any known 
cause that could cause the controller to get into this seemingly 
locked-up state?


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove "nospam" from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sata_nv + ADMA + Samsung disk problem

2008-01-01 Thread Tejun Heo
[cc'ing Robert Hancock and NVidia people]

Whole thread can be read from the following URL.

  http://thread.gmane.org/gmane.linux.ide/21710

In a nutshell, with ADMA enabled, FLUSH_EXT occasionally times out.  I
first suspected faulty disk (reallocation failure on flush) but SMART
reports nothing suspicious and w/ ADMA disabled, the drive works just fine.

On a side note, on 2.6.22.1, SMART fails from time to time but the
problem went away on 2.6.24-rc6.  This was apparently fixed during that
period.  I guess we can ignore this for now.

Thanks.

Gabor Gombas wrote:
> Hi,
> 
> Just FYI I've tried to enable ADMA again (now running 2.6.24-rc6) but
> the bug is still present:
> 
> Jan  1 16:11:21 host kernel: ata7: EH in ADMA mode, notifier 0x0 
> notifier_error 0x0 gen_ctl 0x1501000 status 0x400 next cpb count 0x0 next cpb 
> idx 0x0
> Jan  1 16:11:21 host kernel: ata7: CPB 0: ctl_flags 0x9, resp_flags 0x0
> Jan  1 16:11:21 host kernel: ata7: timeout waiting for ADMA IDLE, stat=0x400
> Jan  1 16:11:21 host kernel: ata7: timeout waiting for ADMA LEGACY, stat=0x400
> Jan  1 16:11:21 host kernel: ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 
> action 0x2 frozen
> Jan  1 16:11:21 host kernel: ata7.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 
> tag 0
> Jan  1 16:11:21 host kernel:  res 40/00:00:00:4f:c2/00:00:00:00:00/00 
> Emask 0x4 (timeout)
> Jan  1 16:11:21 host kernel: ata7.00: status: { DRDY }
> Jan  1 16:11:21 host kernel: ata7: soft resetting link
> Jan  1 16:11:22 host kernel: ata7: SATA link up 3.0 Gbps (SStatus 123 
> SControl 300)
> Jan  1 16:11:22 host kernel: ata7.00: configured for UDMA/133
> Jan  1 16:11:22 host kernel: ata7: EH complete
> Jan  1 16:11:22 host kernel: sd 6:0:0:0: [sdc] 488397168 512-byte hardware 
> sectors (250059 MB)
> Jan  1 16:11:22 host kernel: sd 6:0:0:0: [sdc] Write Protect is off
> Jan  1 16:11:22 host kernel: sd 6:0:0:0: [sdc] Mode Sense: 00 3a 00 00
> Jan  1 16:11:22 host kernel: sd 6:0:0:0: [sdc] Write cache: enabled, read 
> cache: enabled, doesn't support DPO or FUA
> 
> Although this time the above happened more than 3 hours after boot
> which is much better than 2.6.22 was. In the past ~4 months ADMA was
> disabled and I never had any libata-related error messages.
> 
> SMART does not show anything interesting:
> 
> smartctl version 5.37 [x86_64-unknown-linux-gnu] Copyright (C) 2002-6 Bruce 
> Allen
> Home page is http://smartmontools.sourceforge.net/
> 
> === START OF INFORMATION SECTION ===
> Model Family: SAMSUNG SpinPoint P120 series
> Device Model: SAMSUNG SP2504C
> Serial Number:XX
> Firmware Version: VT100-33
> User Capacity:250,059,350,016 bytes
> Device is:In smartctl database [for details use: -P show]
> ATA Version is:   7
> ATA Standard is:  ATA/ATAPI-7 T13 1532D revision 4a
> Local Time is:Tue Jan  1 17:38:21 2008 CET
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
> 
> === START OF READ SMART DATA SECTION ===
> SMART overall-health self-assessment test result: PASSED
> 
> General SMART Values:
> Offline data collection status:  (0x82)   Offline data collection activity
>   was completed without error.
>   Auto Offline Data Collection: Enabled.
> Self-test execution status:  (   0)   The previous self-test routine 
> completed
>   without error or no self-test has ever 
>   been run.
> Total time to complete Offline 
> data collection:   (4867) seconds.
> Offline data collection
> capabilities:  (0x5b) SMART execute Offline immediate.
>   Auto Offline data collection on/off 
> support.
>   Suspend Offline collection upon new
>   command.
>   Offline surface scan supported.
>   Self-test supported.
>   No Conveyance Self-test supported.
>   Selective Self-test supported.
> SMART capabilities:(0x0003)   Saves SMART data before entering
>   power-saving mode.
>   Supports SMART auto save timer.
> Error logging capability:(0x01)   Error logging supported.
>   General Purpose Logging supported.
> Short self-test routine 
> recommended polling time:  (   1) minutes.
> Extended self-test routine
> recommended polling time:  (  81) minutes.
> 
> SMART Attributes Data Structure revision number: 16
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME  FLAG VALUE WORST THRESH TYPE  UPDATED  
> WHEN_FAILED RAW_VALUE
>   1 Raw_Read_Error_Rate 0x000f   100   10

Re: sata_nv + ADMA + Samsung disk problem

2008-01-01 Thread Gabor Gombas
Hi,

Just FYI I've tried to enable ADMA again (now running 2.6.24-rc6) but
the bug is still present:

Jan  1 16:11:21 host kernel: ata7: EH in ADMA mode, notifier 0x0 notifier_error 
0x0 gen_ctl 0x1501000 status 0x400 next cpb count 0x0 next cpb idx 0x0
Jan  1 16:11:21 host kernel: ata7: CPB 0: ctl_flags 0x9, resp_flags 0x0
Jan  1 16:11:21 host kernel: ata7: timeout waiting for ADMA IDLE, stat=0x400
Jan  1 16:11:21 host kernel: ata7: timeout waiting for ADMA LEGACY, stat=0x400
Jan  1 16:11:21 host kernel: ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 
action 0x2 frozen
Jan  1 16:11:21 host kernel: ata7.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 
tag 0
Jan  1 16:11:21 host kernel:  res 40/00:00:00:4f:c2/00:00:00:00:00/00 
Emask 0x4 (timeout)
Jan  1 16:11:21 host kernel: ata7.00: status: { DRDY }
Jan  1 16:11:21 host kernel: ata7: soft resetting link
Jan  1 16:11:22 host kernel: ata7: SATA link up 3.0 Gbps (SStatus 123 SControl 
300)
Jan  1 16:11:22 host kernel: ata7.00: configured for UDMA/133
Jan  1 16:11:22 host kernel: ata7: EH complete
Jan  1 16:11:22 host kernel: sd 6:0:0:0: [sdc] 488397168 512-byte hardware 
sectors (250059 MB)
Jan  1 16:11:22 host kernel: sd 6:0:0:0: [sdc] Write Protect is off
Jan  1 16:11:22 host kernel: sd 6:0:0:0: [sdc] Mode Sense: 00 3a 00 00
Jan  1 16:11:22 host kernel: sd 6:0:0:0: [sdc] Write cache: enabled, read 
cache: enabled, doesn't support DPO or FUA

Although this time the above happened more than 3 hours after boot
which is much better than 2.6.22 was. In the past ~4 months ADMA was
disabled and I never had any libata-related error messages.

SMART does not show anything interesting:

smartctl version 5.37 [x86_64-unknown-linux-gnu] Copyright (C) 2002-6 Bruce 
Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Model Family: SAMSUNG SpinPoint P120 series
Device Model: SAMSUNG SP2504C
Serial Number:XX
Firmware Version: VT100-33
User Capacity:250,059,350,016 bytes
Device is:In smartctl database [for details use: -P show]
ATA Version is:   7
ATA Standard is:  ATA/ATAPI-7 T13 1532D revision 4a
Local Time is:Tue Jan  1 17:38:21 2008 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status:  (   0) The previous self-test routine completed
without error or no self-test has ever 
been run.
Total time to complete Offline 
data collection: (4867) seconds.
Offline data collection
capabilities:(0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off 
support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities:(0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability:(0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine 
recommended polling time:(   1) minutes.
Extended self-test routine
recommended polling time:(  81) minutes.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME  FLAG VALUE WORST THRESH TYPE  UPDATED  
WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate 0x000f   100   100   051Pre-fail  Always   
-   0
  3 Spin_Up_Time0x0007   100   100   025Pre-fail  Always   
-   6144
  4 Start_Stop_Count0x0032   099   099   000Old_age   Always   
-   1218
  5 Reallocated_Sector_Ct   0x0033   253   253   010Pre-fail  Always   
-   0
  7 Seek_Error_Rate 0x000f   253   253   051Pre-fail  Always   
-   0
  8 Seek_Time_Performance   0x0025   253   253   015Pre-fail  Offline  
-   11363
  9 Power_On_Hours  0x0032   100   100   000Old_age   Always   
-   3325
 10 Spin_Retry_Count0x0033   253   253   051Pre-fail  Always   
-   0
 11 Calibration_Retry_Count 0x0012   253   002   000Old_age   

Re: sata_nv + ADMA + Samsung disk problem

2007-08-16 Thread Jim Paris
Gabor Gombas wrote:
> On Tue, Aug 14, 2007 at 06:30:28PM +0900, Tejun Heo wrote:
> > Hmmm... That's timeout on cache flush, indicative of failing disk.
> > Please post the result of 'smartctl -a /dev/sdc'.
> 
> Ok, so something is fishy in 2.6.22 wrt. SMART.

See http://lkml.org/lkml/2007/7/8/198

-jim
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sata_nv + ADMA + Samsung disk problem

2007-08-16 Thread Gabor Gombas
Hi,

On Tue, Aug 14, 2007 at 06:30:28PM +0900, Tejun Heo wrote:

> Hmmm... That's timeout on cache flush, indicative of failing disk.
> Please post the result of 'smartctl -a /dev/sdc'.

Ok, so something is fishy in 2.6.22 wrt. SMART.

First, booting back to 2.6.20.5 I confirmed that SMART works without any
problems for all 4 disks, so all the following is a regression in
2.6.22.

I have 4 disks: two Maxtors (hdparm -I output below): sda/sdb, and two
Samsung (hdparm -I output is in my previous mail): sdc/sdd.

< cut >
/dev/sda:

ATA device, with non-removable media
Model Number:   Maxtor 6B250S0  
Serial Number:  
Firmware Revision:  BANC1G10
Standards:
Used: ATA/ATAPI-7 T13 1532D revision 0 
Supported: 7 6 5 4 
Configuration:
Logical max current
cylinders   16383   16383
heads   16  16
sectors/track   63  63
--
CHS current addressable sectors:   16514064
LBAuser addressable sectors:  268435455
LBA48  user addressable sectors:  490234752
device size with M = 1024*1024:  239372 MBytes
device size with M = 1000*1000:  251000 MBytes (251 GB)
Capabilities:
LBA, IORDY(can be disabled)
Queue depth: 32
Standby timer values: spec'd by Standard, no device specific minimum
R/W multiple sector transfer: Max = 16  Current = 16
Advanced power management level: unknown setting (0x)
Recommended acoustic management value: 192, current value: 128
DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6 
 Cycle time: min=120ns recommended=120ns
PIO: pio0 pio1 pio2 pio3 pio4 
 Cycle time: no flow control=120ns  IORDY flow control=120ns
Commands/features:
Enabled Supported:
   *SMART feature set
Security Mode feature set
   *Power Management feature set
   *Write cache
   *Look-ahead
   *Host Protected Area feature set
   *WRITE_VERIFY command
   *WRITE_BUFFER command
   *READ_BUFFER command
   *NOP cmd
   *DOWNLOAD_MICROCODE
Advanced Power Management feature set
SET_MAX security extension
   *Automatic Acoustic Management feature set
   *48-bit Address feature set
   *Device Configuration Overlay feature set
   *Mandatory FLUSH_CACHE
   *FLUSH_CACHE_EXT
   *SMART error logging
   *SMART self-test
Media Card Pass-Through
   *General Purpose Logging feature set
   *WRITE_{DMA|MULTIPLE}_FUA_EXT
   *URG for READ_STREAM[_DMA]_EXT
   *URG for WRITE_STREAM[_DMA]_EXT
   *SATA-I signaling speed (1.5Gb/s)
   *Native Command Queueing (NCQ)
Software settings preservation
   *SMART Command Transport (SCT) feature set
   *SCT Data Tables (AC5)
Security: 
Master password revision code = 65534
supported
not enabled
not locked
frozen
not expired: security count
not supported: enhanced erase
Checksum: correct
< cut >

Under 2.6.22.1, when I try to do "smartctl -d ata -s on /dev/sd[ab]" or
"smartctl -d ata -a /dev/sd[ab]", I get the following error:

< cut >
smartctl version 5.37 [x86_64-unknown-linux-gnu] Copyright (C) 2002-6 Bruce 
Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Model Family: Maxtor DiamondMax 10 family (ATA/133 and SATA/150)
Device Model: Maxtor 6B250S0
Serial Number:
Firmware Version: BANC1G10
User Capacity:251,000,193,024 bytes
Device is:In smartctl database [for details use: -P show]
ATA Version is:   7
ATA Standard is:  ATA/ATAPI-7 T13 1532D revision 0
Local Time is:Wed Aug 15 12:01:38 2007 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

Error SMART Status command failed
Please get assistance from http://smartmontools.sourceforge.net/
Register values returned from SMART Status command are:
CMD=0x50
FR =0x00
NS =0x00
SC =0x00
CL =0xc2
CH =0x00
SEL=0x00
A mandatory SMART command failed: exiting. To continue, add one or more '-T 
permissive' options.
< cut >

To repeat, this does not happen under 2.6.20.5. Using "-T permissive" works:

< cut >
smartctl version 5.37 [x86_64-unknown-linux-gnu] Copyright (C) 2002-6 Bruce 
Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Model Family:

Re: sata_nv + ADMA + Samsung disk problem

2007-08-14 Thread Gabor Gombas
On Tue, Aug 14, 2007 at 06:30:28PM +0900, Tejun Heo wrote:

> Hmmm... That's timeout on cache flush, indicative of failing disk.
> Please post the result of 'smartctl -a /dev/sdc'.

Will do when I get home. Note however that this only occurs in ADMA
mode. It never occured with 2.6.20 and it never occured with 2.6.22 ever
since I have disabled ADMA.

Gabor

-- 
 -
 MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
 -
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sata_nv + ADMA + Samsung disk problem

2007-08-14 Thread Tejun Heo
Gabor Gombas wrote:
> Hi,
> 
> Since I have upgraded to 2.6.22.1 from 2.6.20 I have problems with
> Samsung disks. Sometimes the disks stall for about half a minute and
> then I have these messages in the logs:
> 
> Aug  6 20:10:11 twister kernel: ata7: EH in ADMA mode, notifier 0x0 
> notifier_error 0x0 gen_ctl 0x1501000 status 0x400 next cpb count 0x0 next cpb 
> idx 0x0
> Aug  6 20:10:12 twister kernel: ata7: CPB 0: ctl_flags 0x9, resp_flags 0x0
> Aug  6 20:10:12 twister kernel: ata7: timeout waiting for ADMA IDLE, 
> stat=0x400
> Aug  6 20:10:12 twister kernel: ata7: timeout waiting for ADMA LEGACY, 
> stat=0x400
> Aug  6 20:10:12 twister kernel: ata7.00: exception Emask 0x0 SAct 0x0 SErr 
> 0x0 action 0x2 frozen
> Aug  6 20:10:12 twister kernel: ata7.00: cmd 
> ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 cdb 0x0 data 0 
> Aug  6 20:10:12 twister kernel:  res 
> 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
> Aug  6 20:10:12 twister kernel: ata7: soft resetting port
> Aug  6 20:10:12 twister kernel: ata7: SATA link up 3.0 Gbps (SStatus 123 
> SControl 300)
> Aug  6 20:10:12 twister kernel: ata7.00: configured for UDMA/133
> Aug  6 20:10:12 twister kernel: ata7: EH complete
> Aug  6 20:10:12 twister kernel: sd 6:0:0:0: [sdc] 488397168 512-byte hardware 
> sectors (250059 MB)
> Aug  6 20:10:12 twister kernel: sd 6:0:0:0: [sdc] Write Protect is off
> Aug  6 20:10:12 twister kernel: sd 6:0:0:0: [sdc] Mode Sense: 00 3a 00 00
> Aug  6 20:10:12 twister kernel: sd 6:0:0:0: [sdc] Write cache: enabled, read 
> cache: enabled, doesn't support DPO or FUA
> Aug  6 20:20:25 twister kernel: ata8: EH in ADMA mode, notifier 0x0 
> notifier_error 0x0 gen_ctl 0x1501000 status 0x400 next cpb count 0x0 next cpb 
> idx 0x0
> Aug  6 20:20:25 twister kernel: ata8: CPB 0: ctl_flags 0x9, resp_flags 0x0
> Aug  6 20:20:25 twister kernel: ata8: timeout waiting for ADMA IDLE, 
> stat=0x400
> Aug  6 20:20:25 twister kernel: ata8: timeout waiting for ADMA LEGACY, 
> stat=0x400
> Aug  6 20:20:25 twister kernel: ata8.00: exception Emask 0x0 SAct 0x0 SErr 
> 0x0 action 0x2 frozen
> Aug  6 20:20:25 twister kernel: ata8.00: cmd 
> ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 cdb 0x0 data 0 
> Aug  6 20:20:25 twister kernel:  res 
> 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
> Aug  6 20:20:25 twister kernel: ata8: soft resetting port
> Aug  6 20:20:25 twister kernel: ata8: SATA link up 3.0 Gbps (SStatus 123 
> SControl 300)
> Aug  6 20:20:25 twister kernel: ata8.00: configured for UDMA/133
> Aug  6 20:20:25 twister kernel: ata8: EH complete
> Aug  6 20:20:25 twister kernel: sd 7:0:0:0: [sdd] 488397168 512-byte hardware 
> sectors (250059 MB)
> Aug  6 20:20:25 twister kernel: sd 7:0:0:0: [sdd] Write Protect is off
> Aug  6 20:20:25 twister kernel: sd 7:0:0:0: [sdd] Mode Sense: 00 3a 00 00
> Aug  6 20:20:25 twister kernel: sd 7:0:0:0: [sdd] Write cache: enabled, read 
> cache: enabled, doesn't support DPO or FUA
> 
> I also have two Maxtor disks on the same controller but they are working
> correctly in ADMA mode. I now disabled ADMA mode and that seems to help.

Hmmm... That's timeout on cache flush, indicative of failing disk.
Please post the result of 'smartctl -a /dev/sdc'.

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/