Re: SATA exceptions with 2.6.20-rc5

2007-02-09 Thread Björn Steinbrink
On 2007.02.04 02:13:51 +0100, Björn Steinbrink wrote:
> On 2007.02.02 23:48:14 -0600, Robert Hancock wrote:
> > There's a patch in -mm (sata_nv-use-adma-for-nodata-commands.patch) 
> > which should hopefully avoid this problem for the cache flush commands, 
> > at least - can you try that one out? You'll have to apply the other 
> > sata_nv patches in -mm first, i.e. this order:
> > 
> > http://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc6/2.6.20-rc6-mm3/broken-out/sata_nv-cleanup-adma-error-handling-v2.patch
> > http://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc6/2.6.20-rc6-mm3/broken-out/sata_nv-cleanup-adma-error-handling-v2-cleanup.patch
> > http://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc6/2.6.20-rc6-mm3/broken-out/sata_nv-use-adma-for-nodata-commands.patch
> 
> Got 2.6.20-rc7 with them applied now (the rejects seemed trivial enough
> for me to fix them). Let's see how that works out...

After about 1.5 days of uptime, an involuntary reboot and another 3
days of uptime, no sign of an exception. No stress testing was done,
but a few disk intensive actions did happen, at least more than with
that -rc6 that did throw an exception at me.

Björn
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-02-09 Thread Björn Steinbrink
On 2007.02.04 02:13:51 +0100, Björn Steinbrink wrote:
 On 2007.02.02 23:48:14 -0600, Robert Hancock wrote:
  There's a patch in -mm (sata_nv-use-adma-for-nodata-commands.patch) 
  which should hopefully avoid this problem for the cache flush commands, 
  at least - can you try that one out? You'll have to apply the other 
  sata_nv patches in -mm first, i.e. this order:
  
  http://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc6/2.6.20-rc6-mm3/broken-out/sata_nv-cleanup-adma-error-handling-v2.patch
  http://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc6/2.6.20-rc6-mm3/broken-out/sata_nv-cleanup-adma-error-handling-v2-cleanup.patch
  http://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc6/2.6.20-rc6-mm3/broken-out/sata_nv-use-adma-for-nodata-commands.patch
 
 Got 2.6.20-rc7 with them applied now (the rejects seemed trivial enough
 for me to fix them). Let's see how that works out...

After about 1.5 days of uptime, an involuntary reboot and another 3
days of uptime, no sign of an exception. No stress testing was done,
but a few disk intensive actions did happen, at least more than with
that -rc6 that did throw an exception at me.

Björn
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-02-03 Thread Björn Steinbrink
On 2007.02.02 23:48:14 -0600, Robert Hancock wrote:
> Björn Steinbrink wrote:
> >On 2007.01.24 01:39:23 +0100, Björn Steinbrink wrote:
> >>On 2007.01.23 17:18:43 -0600, Robert Hancock wrote:
> >>>Larry Walton wrote:
> The last patch (sata_nv-force-int-dev-in-interrupt.patch) 
> seems to have fix the problem.  Much appreciated, 
> thank you. I'd consider it a must have in 2.6.20.
> >>>Can any of the rest of you that have been seeing this problem also 
> >>>confirm that this fixes it?
> >>Seems to work for me, uptime is about an hour now and no exception yet.
> >>Had the stress test running for only about 10 minutes, but I usually got
> >>an exception within an hour even during plain irssi usage, so I'm quite
> >>confident that the patch fixes it.
> >
> >Or maybe not :( Just got an exception on 2.6.20-rc6. Took 4 days of
> >uptime to trigger, so it's just a lot harder to trigger now.
> 
> Same exception details as before?

Yes, exactly the same.

> There's a patch in -mm (sata_nv-use-adma-for-nodata-commands.patch) 
> which should hopefully avoid this problem for the cache flush commands, 
> at least - can you try that one out? You'll have to apply the other 
> sata_nv patches in -mm first, i.e. this order:
> 
> http://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc6/2.6.20-rc6-mm3/broken-out/sata_nv-cleanup-adma-error-handling-v2.patch
> http://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc6/2.6.20-rc6-mm3/broken-out/sata_nv-cleanup-adma-error-handling-v2-cleanup.patch
> http://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc6/2.6.20-rc6-mm3/broken-out/sata_nv-use-adma-for-nodata-commands.patch

Got 2.6.20-rc7 with them applied now (the rejects seemed trivial enough
for me to fix them). Let's see how that works out...

Björn
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-02-03 Thread Björn Steinbrink
On 2007.02.02 23:48:14 -0600, Robert Hancock wrote:
 Björn Steinbrink wrote:
 On 2007.01.24 01:39:23 +0100, Björn Steinbrink wrote:
 On 2007.01.23 17:18:43 -0600, Robert Hancock wrote:
 Larry Walton wrote:
 The last patch (sata_nv-force-int-dev-in-interrupt.patch) 
 seems to have fix the problem.  Much appreciated, 
 thank you. I'd consider it a must have in 2.6.20.
 Can any of the rest of you that have been seeing this problem also 
 confirm that this fixes it?
 Seems to work for me, uptime is about an hour now and no exception yet.
 Had the stress test running for only about 10 minutes, but I usually got
 an exception within an hour even during plain irssi usage, so I'm quite
 confident that the patch fixes it.
 
 Or maybe not :( Just got an exception on 2.6.20-rc6. Took 4 days of
 uptime to trigger, so it's just a lot harder to trigger now.
 
 Same exception details as before?

Yes, exactly the same.

 There's a patch in -mm (sata_nv-use-adma-for-nodata-commands.patch) 
 which should hopefully avoid this problem for the cache flush commands, 
 at least - can you try that one out? You'll have to apply the other 
 sata_nv patches in -mm first, i.e. this order:
 
 http://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc6/2.6.20-rc6-mm3/broken-out/sata_nv-cleanup-adma-error-handling-v2.patch
 http://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc6/2.6.20-rc6-mm3/broken-out/sata_nv-cleanup-adma-error-handling-v2-cleanup.patch
 http://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc6/2.6.20-rc6-mm3/broken-out/sata_nv-use-adma-for-nodata-commands.patch

Got 2.6.20-rc7 with them applied now (the rejects seemed trivial enough
for me to fix them). Let's see how that works out...

Björn
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-02-02 Thread Robert Hancock

Björn Steinbrink wrote:

On 2007.01.24 01:39:23 +0100, Björn Steinbrink wrote:

On 2007.01.23 17:18:43 -0600, Robert Hancock wrote:

Larry Walton wrote:
The last patch (sata_nv-force-int-dev-in-interrupt.patch) 
seems to have fix the problem.  Much appreciated, 
thank you. I'd consider it a must have in 2.6.20.
Can any of the rest of you that have been seeing this problem also 
confirm that this fixes it?

Seems to work for me, uptime is about an hour now and no exception yet.
Had the stress test running for only about 10 minutes, but I usually got
an exception within an hour even during plain irssi usage, so I'm quite
confident that the patch fixes it.


Or maybe not :( Just got an exception on 2.6.20-rc6. Took 4 days of
uptime to trigger, so it's just a lot harder to trigger now.


Same exception details as before?

There's a patch in -mm (sata_nv-use-adma-for-nodata-commands.patch) 
which should hopefully avoid this problem for the cache flush commands, 
at least - can you try that one out? You'll have to apply the other 
sata_nv patches in -mm first, i.e. this order:


http://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc6/2.6.20-rc6-mm3/broken-out/sata_nv-cleanup-adma-error-handling-v2.patch
http://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc6/2.6.20-rc6-mm3/broken-out/sata_nv-cleanup-adma-error-handling-v2-cleanup.patch
http://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc6/2.6.20-rc6-mm3/broken-out/sata_nv-use-adma-for-nodata-commands.patch

--
Robert Hancock  Saskatoon, SK, Canada
To email, remove "nospam" from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-02-02 Thread Björn Steinbrink
On 2007.01.24 01:39:23 +0100, Björn Steinbrink wrote:
> On 2007.01.23 17:18:43 -0600, Robert Hancock wrote:
> > Larry Walton wrote:
> > >The last patch (sata_nv-force-int-dev-in-interrupt.patch) 
> > >seems to have fix the problem.  Much appreciated, 
> > >thank you. I'd consider it a must have in 2.6.20.
> > 
> > Can any of the rest of you that have been seeing this problem also 
> > confirm that this fixes it?
> 
> Seems to work for me, uptime is about an hour now and no exception yet.
> Had the stress test running for only about 10 minutes, but I usually got
> an exception within an hour even during plain irssi usage, so I'm quite
> confident that the patch fixes it.

Or maybe not :( Just got an exception on 2.6.20-rc6. Took 4 days of
uptime to trigger, so it's just a lot harder to trigger now.

Björn
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-02-02 Thread Björn Steinbrink
On 2007.01.24 01:39:23 +0100, Björn Steinbrink wrote:
 On 2007.01.23 17:18:43 -0600, Robert Hancock wrote:
  Larry Walton wrote:
  The last patch (sata_nv-force-int-dev-in-interrupt.patch) 
  seems to have fix the problem.  Much appreciated, 
  thank you. I'd consider it a must have in 2.6.20.
  
  Can any of the rest of you that have been seeing this problem also 
  confirm that this fixes it?
 
 Seems to work for me, uptime is about an hour now and no exception yet.
 Had the stress test running for only about 10 minutes, but I usually got
 an exception within an hour even during plain irssi usage, so I'm quite
 confident that the patch fixes it.

Or maybe not :( Just got an exception on 2.6.20-rc6. Took 4 days of
uptime to trigger, so it's just a lot harder to trigger now.

Björn
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-02-02 Thread Robert Hancock

Björn Steinbrink wrote:

On 2007.01.24 01:39:23 +0100, Björn Steinbrink wrote:

On 2007.01.23 17:18:43 -0600, Robert Hancock wrote:

Larry Walton wrote:
The last patch (sata_nv-force-int-dev-in-interrupt.patch) 
seems to have fix the problem.  Much appreciated, 
thank you. I'd consider it a must have in 2.6.20.
Can any of the rest of you that have been seeing this problem also 
confirm that this fixes it?

Seems to work for me, uptime is about an hour now and no exception yet.
Had the stress test running for only about 10 minutes, but I usually got
an exception within an hour even during plain irssi usage, so I'm quite
confident that the patch fixes it.


Or maybe not :( Just got an exception on 2.6.20-rc6. Took 4 days of
uptime to trigger, so it's just a lot harder to trigger now.


Same exception details as before?

There's a patch in -mm (sata_nv-use-adma-for-nodata-commands.patch) 
which should hopefully avoid this problem for the cache flush commands, 
at least - can you try that one out? You'll have to apply the other 
sata_nv patches in -mm first, i.e. this order:


http://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc6/2.6.20-rc6-mm3/broken-out/sata_nv-cleanup-adma-error-handling-v2.patch
http://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc6/2.6.20-rc6-mm3/broken-out/sata_nv-cleanup-adma-error-handling-v2-cleanup.patch
http://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc6/2.6.20-rc6-mm3/broken-out/sata_nv-use-adma-for-nodata-commands.patch

--
Robert Hancock  Saskatoon, SK, Canada
To email, remove nospam from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-24 Thread Björn Steinbrink
On 2007.01.24 09:24:00 +0100, Ian Kumlien wrote:
> On tis, 2007-01-23 at 17:18 -0600, Robert Hancock wrote:
> > Larry Walton wrote:
> > > The last patch (sata_nv-force-int-dev-in-interrupt.patch) 
> > > seems to have fix the problem.  Much appreciated, 
> > > thank you. I'd consider it a must have in 2.6.20.
> > 
> > Can any of the rest of you that have been seeing this problem also 
> > confirm that this fixes it?
> 
> I applied it yesterday and today my dmesg contains three:
> BUG: at mm/truncate.c:60 cancel_dirty_page()

David Chinner sent two patches regarding that bug yesterday.
http://lkml.org/lkml/2007/1/23/190
http://lkml.org/lkml/2007/1/23/192

Björn
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-24 Thread Ian Kumlien
On tis, 2007-01-23 at 17:18 -0600, Robert Hancock wrote:
> Larry Walton wrote:
> > The last patch (sata_nv-force-int-dev-in-interrupt.patch) 
> > seems to have fix the problem.  Much appreciated, 
> > thank you. I'd consider it a must have in 2.6.20.
> 
> Can any of the rest of you that have been seeing this problem also 
> confirm that this fixes it?

I applied it yesterday and today my dmesg contains three:
BUG: at mm/truncate.c:60 cancel_dirty_page()

Call Trace:
 [] cancel_dirty_page+0x43/0x71
 [] reiserfs_cut_from_item+0x5f8/0x61d
 [] find_get_page+0x21/0x47
 [] reiserfs_do_truncate+0x34d/0x495
 [] reiserfs_truncate_file+0x199/0x2aa
 [] reiserfs_file_release+0x261/0x281
 [] __fput+0xb1/0x17d
 [] filp_close+0x5d/0x65
 [] sys_close+0x8c/0xcf
 [] system_call+0x7e/0x83

Which never happened before... I dunno if they are related though, but
they weren't there before...

(It does fix the timeout problem)

-- 
Ian Kumlien  -- http://pomac.netswarm.net


signature.asc
Description: This is a digitally signed message part


Re: SATA exceptions with 2.6.20-rc5

2007-01-24 Thread Ian Kumlien
On tis, 2007-01-23 at 17:18 -0600, Robert Hancock wrote:
 Larry Walton wrote:
  The last patch (sata_nv-force-int-dev-in-interrupt.patch) 
  seems to have fix the problem.  Much appreciated, 
  thank you. I'd consider it a must have in 2.6.20.
 
 Can any of the rest of you that have been seeing this problem also 
 confirm that this fixes it?

I applied it yesterday and today my dmesg contains three:
BUG: at mm/truncate.c:60 cancel_dirty_page()

Call Trace:
 [8029f3e5] cancel_dirty_page+0x43/0x71
 [802ec1ab] reiserfs_cut_from_item+0x5f8/0x61d
 [802074fc] find_get_page+0x21/0x47
 [802ec51d] reiserfs_do_truncate+0x34d/0x495
 [802d9d47] reiserfs_truncate_file+0x199/0x2aa
 [802df9c5] reiserfs_file_release+0x261/0x281
 [80211b02] __fput+0xb1/0x17d
 [802218e0] filp_close+0x5d/0x65
 [8021bef5] sys_close+0x8c/0xcf
 [8025725e] system_call+0x7e/0x83

Which never happened before... I dunno if they are related though, but
they weren't there before...

(It does fix the timeout problem)

-- 
Ian Kumlien pomac () vapor ! com -- http://pomac.netswarm.net


signature.asc
Description: This is a digitally signed message part


Re: SATA exceptions with 2.6.20-rc5

2007-01-24 Thread Björn Steinbrink
On 2007.01.24 09:24:00 +0100, Ian Kumlien wrote:
 On tis, 2007-01-23 at 17:18 -0600, Robert Hancock wrote:
  Larry Walton wrote:
   The last patch (sata_nv-force-int-dev-in-interrupt.patch) 
   seems to have fix the problem.  Much appreciated, 
   thank you. I'd consider it a must have in 2.6.20.
  
  Can any of the rest of you that have been seeing this problem also 
  confirm that this fixes it?
 
 I applied it yesterday and today my dmesg contains three:
 BUG: at mm/truncate.c:60 cancel_dirty_page()

David Chinner sent two patches regarding that bug yesterday.
http://lkml.org/lkml/2007/1/23/190
http://lkml.org/lkml/2007/1/23/192

Björn
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-23 Thread Björn Steinbrink
On 2007.01.23 17:18:43 -0600, Robert Hancock wrote:
> Larry Walton wrote:
> >The last patch (sata_nv-force-int-dev-in-interrupt.patch) 
> >seems to have fix the problem.  Much appreciated, 
> >thank you. I'd consider it a must have in 2.6.20.
> 
> Can any of the rest of you that have been seeing this problem also 
> confirm that this fixes it?

Seems to work for me, uptime is about an hour now and no exception yet.
Had the stress test running for only about 10 minutes, but I usually got
an exception within an hour even during plain irssi usage, so I'm quite
confident that the patch fixes it.

Thanks,
Björn
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-23 Thread Robert Hancock

Larry Walton wrote:
The last patch (sata_nv-force-int-dev-in-interrupt.patch) 
seems to have fix the problem.  Much appreciated, 
thank you. I'd consider it a must have in 2.6.20.


Can any of the rest of you that have been seeing this problem also 
confirm that this fixes it?


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove "nospam" from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-23 Thread Larry Walton
The last patch (sata_nv-force-int-dev-in-interrupt.patch) 
seems to have fix the problem.  Much appreciated, 
thank you. I'd consider it a must have in 2.6.20.


-- 
*--* Mail: [EMAIL PROTECTED]
*--* Voice: 206.892.6269
*--* Cell: 206.225.0154
*--* HTTP://real.com
--
- - - - - - - R e a l - - - - - - - -



signature.asc
Description: Digital signature


Re: SATA exceptions with 2.6.20-rc5

2007-01-23 Thread Larry Walton
The last patch (sata_nv-force-int-dev-in-interrupt.patch) 
seems to have fix the problem.  Much appreciated, 
thank you. I'd consider it a must have in 2.6.20.


-- 
*--* Mail: [EMAIL PROTECTED]
*--* Voice: 206.892.6269
*--* Cell: 206.225.0154
*--* HTTP://real.com
--
- - - - - - - R e a l - - - - - - - -



signature.asc
Description: Digital signature


Re: SATA exceptions with 2.6.20-rc5

2007-01-23 Thread Robert Hancock

Larry Walton wrote:
The last patch (sata_nv-force-int-dev-in-interrupt.patch) 
seems to have fix the problem.  Much appreciated, 
thank you. I'd consider it a must have in 2.6.20.


Can any of the rest of you that have been seeing this problem also 
confirm that this fixes it?


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove nospam from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-23 Thread Björn Steinbrink
On 2007.01.23 17:18:43 -0600, Robert Hancock wrote:
 Larry Walton wrote:
 The last patch (sata_nv-force-int-dev-in-interrupt.patch) 
 seems to have fix the problem.  Much appreciated, 
 thank you. I'd consider it a must have in 2.6.20.
 
 Can any of the rest of you that have been seeing this problem also 
 confirm that this fixes it?

Seems to work for me, uptime is about an hour now and no exception yet.
Had the stress test running for only about 10 minutes, but I usually got
an exception within an hour even during plain irssi usage, so I'm quite
confident that the patch fixes it.

Thanks,
Björn
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-22 Thread Robert Hancock

Björn Steinbrink wrote:

Hm, I don't think it is unhappy about looking at NV_INT_STATUS_CK804.
I'm running 2.6.20-rc5 with the INT_DEV check removed for 8 hours now
without a single problem and that should still look at
NV_INT_STATUS_CK804, right?
I just noticed that my last email might not have been clear enough. The
exceptions happened when I re-enabled the return statement in addition
to the debug message. Without the INT_DEV check, it is completely fine
AFAICT.


Indeed, it seems to be just the NV_INT_DEV check that is problematic. 
Here's a patch that's likely better to test, it forces the NV_INT_DEV 
flag on when a command is active, and also fixes that questionable code 
in nv_host_intr that I mentioned.


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove "nospam" from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

--- linux-2.6.20-rc5/drivers/ata/sata_nv.c  2007-01-19 19:18:53.0 
-0600
+++ linux-2.6.20-rc5debug/drivers/ata/sata_nv.c 2007-01-22 22:33:43.0 
-0600
@@ -700,7 +700,6 @@ static void nv_adma_check_cpb(struct ata
 static int nv_host_intr(struct ata_port *ap, u8 irq_stat)
 {
struct ata_queued_cmd *qc = ata_qc_from_tag(ap, ap->active_tag);
-   int handled;
 
/* freeze if hotplugged */
if (unlikely(irq_stat & (NV_INT_ADDED | NV_INT_REMOVED))) {
@@ -719,13 +718,7 @@ static int nv_host_intr(struct ata_port 
}
 
/* handle interrupt */
-   handled = ata_host_intr(ap, qc);
-   if (unlikely(!handled)) {
-   /* spurious, clear it */
-   ata_check_status(ap);
-   }
-
-   return 1;
+   return ata_host_intr(ap, qc);
 }
 
 static irqreturn_t nv_adma_interrupt(int irq, void *dev_instance)
@@ -752,6 +745,11 @@ static irqreturn_t nv_adma_interrupt(int
if (pp->flags & NV_ADMA_PORT_REGISTER_MODE) {
u8 irq_stat = readb(host->mmio_base + 
NV_INT_STATUS_CK804)
>> (NV_INT_PORT_SHIFT * i);
+   if(ata_tag_valid(ap->active_tag))
+   /** NV_INT_DEV indication seems 
unreliable at times
+   at least in ADMA mode. Force it on 
always when a
+   command is active, to prevent 
losing interrupts. */
+   irq_stat |= NV_INT_DEV;
handled += nv_host_intr(ap, irq_stat);
continue;
}


Re: SATA exceptions with 2.6.20-rc5

2007-01-22 Thread Björn Steinbrink
On 2007.01.22 19:24:22 -0600, Robert Hancock wrote:
> Björn Steinbrink wrote:
> >>>Running a kernel with the return statement replace by a line that prints
> >>>the irq_stat instead.
> >>>
> >>>Currently I'm seeing lots of 0x10 on ata1 and 0x0 on ata2.
> >>40 minutes stress test now and no exception yet. What's interesting is
> >>that ata1 saw exactly one interrupt with irq_stat 0x0, all others that
> >>might have get dropped are as above.
> >>I'll keep it running for some time and will then re-enable the return
> >>statement to see if there's a relation between the irq_stat 0x0 and the
> >>exception.
> >
> >No, doesn't seem to be related, did get 2 exceptions, but no irq_stat
> >0x0 for ata1. Syslog/dmesg has nothing new either, still the same
> >pattern of dismissed irq_stats.
> 
> I've finally managed to reproduce this problem on my box, by doing:
> 
> watch --interval=0.1 /sbin/hdparm -I /dev/sda
> 
> on one drive and then running bonnie++ on /dev/sdb connected to the 
> other port on the same controller device. Usually within a few minutes 
> one of the IDENTIFY commands would time out in the same way you guys 
> have been seeing.
> 
> Through some various trials and tribulations, the only conclusion I can 
> come to is that this controller really doesn't like that 
> NV_INT_STATUS_CK804 register being looked at in ADMA mode. I tried 
> adding some debug code to the qc_issue function that would check to see 
> if the BUSY flag in altstatus went high or that register showed an 
> interrupt within a certain time afterwards, however that really seemed 
> to hose things, the system wouldn't even boot.

Hm, I don't think it is unhappy about looking at NV_INT_STATUS_CK804.
I'm running 2.6.20-rc5 with the INT_DEV check removed for 8 hours now
without a single problem and that should still look at
NV_INT_STATUS_CK804, right?
I just noticed that my last email might not have been clear enough. The
exceptions happened when I re-enabled the return statement in addition
to the debug message. Without the INT_DEV check, it is completely fine
AFAICT.

> Try out this patch, it just calls the ata_host_intr function where 
> appropriate without using nv_host_intr which looks at the 
> NV_INT_STATUS_CK804 register. This is what the original ADMA patch from 
> Mr. Mysterious NVIDIA Person did, I'm guessing there may be a reason for 
> that. With this patch I can get through a whole bonnie++ run with the 
> repeated IDENTIFY requests running without seeing the error.

I'll see if I can schedule a test run for tomorrow, I currently need
this box.

Thanks,
Björn
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-22 Thread Robert Hancock

Alistair John Strachan wrote:

On Tuesday 23 January 2007 01:24, Robert Hancock wrote:

As a final aside, this is another case where the hardware docs for this
controller would really be useful, in order to know whether we are
actually supposed to be reading that register in ADMA mode or not. I
sent a query to Allen Martin at NVIDIA asking if there's a way I could
get access to the documents, but I haven't heard anything yet.


Obviously, NVIDIA's response is disappointing, but thank you for putting the 
time in to debug this problem. Definitely sounds like a hardware defect, I'm 
just glad there's a workaround.


Will we see this fix in 2.6.20?


Hopefully, assuming it actually does fix the problem for those that have 
been seeing it..


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove "nospam" from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-22 Thread Alistair John Strachan
On Tuesday 23 January 2007 01:24, Robert Hancock wrote:
> As a final aside, this is another case where the hardware docs for this
> controller would really be useful, in order to know whether we are
> actually supposed to be reading that register in ADMA mode or not. I
> sent a query to Allen Martin at NVIDIA asking if there's a way I could
> get access to the documents, but I haven't heard anything yet.

Obviously, NVIDIA's response is disappointing, but thank you for putting the 
time in to debug this problem. Definitely sounds like a hardware defect, I'm 
just glad there's a workaround.

Will we see this fix in 2.6.20?

-- 
Cheers,
Alistair.

Final year Computer Science undergraduate.
1F2 55 South Clerk Street, Edinburgh, UK.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-22 Thread Robert Hancock

Björn Steinbrink wrote:

Running a kernel with the return statement replace by a line that prints
the irq_stat instead.

Currently I'm seeing lots of 0x10 on ata1 and 0x0 on ata2.

40 minutes stress test now and no exception yet. What's interesting is
that ata1 saw exactly one interrupt with irq_stat 0x0, all others that
might have get dropped are as above.
I'll keep it running for some time and will then re-enable the return
statement to see if there's a relation between the irq_stat 0x0 and the
exception.


No, doesn't seem to be related, did get 2 exceptions, but no irq_stat
0x0 for ata1. Syslog/dmesg has nothing new either, still the same
pattern of dismissed irq_stats.


I've finally managed to reproduce this problem on my box, by doing:

watch --interval=0.1 /sbin/hdparm -I /dev/sda

on one drive and then running bonnie++ on /dev/sdb connected to the 
other port on the same controller device. Usually within a few minutes 
one of the IDENTIFY commands would time out in the same way you guys 
have been seeing.


Through some various trials and tribulations, the only conclusion I can 
come to is that this controller really doesn't like that 
NV_INT_STATUS_CK804 register being looked at in ADMA mode. I tried 
adding some debug code to the qc_issue function that would check to see 
if the BUSY flag in altstatus went high or that register showed an 
interrupt within a certain time afterwards, however that really seemed 
to hose things, the system wouldn't even boot.


Try out this patch, it just calls the ata_host_intr function where 
appropriate without using nv_host_intr which looks at the 
NV_INT_STATUS_CK804 register. This is what the original ADMA patch from 
Mr. Mysterious NVIDIA Person did, I'm guessing there may be a reason for 
that. With this patch I can get through a whole bonnie++ run with the 
repeated IDENTIFY requests running without seeing the error.


As an aside, there seems to be some dubious code in nv_host_intr, if 
ata_host_intr returns 0 for handled when a command is outstanding, it 
goes and calls ata_check_status anyway. This is rather dangerous since 
if an interrupt showed up right after ata_host_intr but before 
ata_check_status, the ata_check_status would clear it and we would 
forget about it. I tried fixing just that issue and still had this 
problem however. I suspect that code is truly broken and needs further 
thought, but this patch avoids calling it in the ADMA case, at any rate.


As a final aside, this is another case where the hardware docs for this 
controller would really be useful, in order to know whether we are 
actually supposed to be reading that register in ADMA mode or not. I 
sent a query to Allen Martin at NVIDIA asking if there's a way I could 
get access to the documents, but I haven't heard anything yet.


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove "nospam" from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

--- linux-2.6.20-rc5/drivers/ata/sata_nv.c  2007-01-19 19:18:53.0 
-0600
+++ linux-2.6.20-rc5debug/drivers/ata/sata_nv.c 2007-01-22 18:35:09.0 
-0600
@@ -750,9 +750,9 @@ static irqreturn_t nv_adma_interrupt(int
 
/* if in ATA register mode, use standard ata interrupt 
handler */
if (pp->flags & NV_ADMA_PORT_REGISTER_MODE) {
-   u8 irq_stat = readb(host->mmio_base + 
NV_INT_STATUS_CK804)
-   >> (NV_INT_PORT_SHIFT * i);
-   handled += nv_host_intr(ap, irq_stat);
+   struct ata_queued_cmd *qc = ata_qc_from_tag(ap, 
ap->active_tag);
+   if(qc && !(qc->tf.flags & ATA_TFLAG_POLLING))
+   handled += ata_host_intr(ap, qc);
continue;
}
 


Re: SATA exceptions with 2.6.20-rc5

2007-01-22 Thread Eric D. Mudama

On 1/15/07, Jeff Garzik <[EMAIL PROTECTED]> wrote:

Jens Axboe wrote:
> On Mon, Jan 15 2007, Jeff Garzik wrote:
>> Jens Axboe wrote:
>>> I'd be surprised if the device would not obey the 7 second timeout rule
>>> that seems to be set in stone and not allow more dirty in-drive cache
>>> than it could flush out in approximately that time.
>> AFAIK Windows flush-cache timeout is 30 seconds, not 7 as with other
>> commands...
>
> Ok, 7 seconds for FLUSH_CACHE would have been nice for us too though, as
> it would pretty much guarentee lower latencies for random writes and
> write back caching. The concern is the barrier code, of course. I guess
> I should do some timings on potential worst case patterns some day. Alan
> may have done that sometime in the past, iirc.

FWIW:  According to the drive guys (Eric M, among others), FLUSH CACHE
will "probably" be under 30 seconds, but pathological cases might even
extend beyond that.

Definitely more than 7 seconds in less-than-pathological cases,
unfortunately...


The mentioned Maxtor model (6Yxxx) isn't susceptible to the
large-buffer long completion times, due to architectural differences
and availability of only small buffers.  Any "real" long-completion
flush on this device would, I believe, involve damage to the disk that
hinders the ability to seek, settle, or write.  (e.g. 30-second
flushes are easy to hit if you mount the disk on a shaker-table with
sufficient amplitude)

Later in the thread I think people have pretty much isolated it as not
the disk's problem, but just wanted to point this out.

I assume that large enough customers can buy enterprise-type command
completion ("all commands within X seconds") from most any disk
vendor.  However, these firmwares require much smarter or more active
drivers or block layers, to handle the higher error rate when the data
on the device is valid, but it will take longer than allowed by the
arbitrary enterprise rules.  Most customers who are buying this many
devices have software engineers customizing the drivers or disk
management applications to handle this differing behavior.

--eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-22 Thread Björn Steinbrink
On 2007.01.22 17:57:08 +0100, Björn Steinbrink wrote:
> On 2007.01.22 17:12:40 +0100, Björn Steinbrink wrote:
> > On 2007.01.21 18:17:01 -0600, Robert Hancock wrote:
> > > Hmm, another miss, apparently.. Has anyone tried removing these lines
> > > >from nv_host_intr in 2.6.20-rc5 sata_nv.c and see what that does?
> > > 
> > > /* bail out if not our interrupt */
> > > if (!(irq_stat & NV_INT_DEV))
> > > return 0;
> > 
> > Running a kernel with the return statement replace by a line that prints
> > the irq_stat instead.
> > 
> > Currently I'm seeing lots of 0x10 on ata1 and 0x0 on ata2.
> 
> 40 minutes stress test now and no exception yet. What's interesting is
> that ata1 saw exactly one interrupt with irq_stat 0x0, all others that
> might have get dropped are as above.
> I'll keep it running for some time and will then re-enable the return
> statement to see if there's a relation between the irq_stat 0x0 and the
> exception.

No, doesn't seem to be related, did get 2 exceptions, but no irq_stat
0x0 for ata1. Syslog/dmesg has nothing new either, still the same
pattern of dismissed irq_stats.

Björn
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-22 Thread Björn Steinbrink
On 2007.01.22 17:12:40 +0100, Björn Steinbrink wrote:
> On 2007.01.21 18:17:01 -0600, Robert Hancock wrote:
> > Björn Steinbrink wrote:
> > >On 2007.01.21 13:58:01 -0600, Robert Hancock wrote:
> > >>Björn Steinbrink wrote:
> > >>>All kernels were bad using that approach. So back to square 1. :/
> > >>>
> > >>>Björn
> > >>>
> > >>OK guys, here's a new patch to try against 2.6.20-rc5:
> > >>
> > >>Right now when switching between ADMA mode and legacy mode (i.e. when 
> > >>going from doing normal DMA reads/writes to doing a FLUSH CACHE) we just 
> > >>set the ADMA GO register bit appropriately and continue with no delay. 
> > >>It looks like in some cases the controller doesn't respond to this 
> > >>immediately, it takes some nanoseconds for the controller's status 
> > >>registers to reflect the change that was made. It's possible that if we 
> > >>were trying to issue commands during this time, the controller might not 
> > >>react properly. This patch adds some code to wait for the status 
> > >>register to change to the state we asked for before continuing.
> > >
> > >Just got two exceptions with your patch, none of the debug messages were
> > >issued.
> > >
> > >Björn
> > 
> > Hmm, another miss, apparently.. Has anyone tried removing these lines
> > >from nv_host_intr in 2.6.20-rc5 sata_nv.c and see what that does?
> > 
> > /* bail out if not our interrupt */
> > if (!(irq_stat & NV_INT_DEV))
> > return 0;
> 
> Running a kernel with the return statement replace by a line that prints
> the irq_stat instead.
> 
> Currently I'm seeing lots of 0x10 on ata1 and 0x0 on ata2.

40 minutes stress test now and no exception yet. What's interesting is
that ata1 saw exactly one interrupt with irq_stat 0x0, all others that
might have get dropped are as above.
I'll keep it running for some time and will then re-enable the return
statement to see if there's a relation between the irq_stat 0x0 and the
exception.

Björn
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-22 Thread Björn Steinbrink
On 2007.01.21 18:17:01 -0600, Robert Hancock wrote:
> Björn Steinbrink wrote:
> >On 2007.01.21 13:58:01 -0600, Robert Hancock wrote:
> >>Björn Steinbrink wrote:
> >>>All kernels were bad using that approach. So back to square 1. :/
> >>>
> >>>Björn
> >>>
> >>OK guys, here's a new patch to try against 2.6.20-rc5:
> >>
> >>Right now when switching between ADMA mode and legacy mode (i.e. when 
> >>going from doing normal DMA reads/writes to doing a FLUSH CACHE) we just 
> >>set the ADMA GO register bit appropriately and continue with no delay. 
> >>It looks like in some cases the controller doesn't respond to this 
> >>immediately, it takes some nanoseconds for the controller's status 
> >>registers to reflect the change that was made. It's possible that if we 
> >>were trying to issue commands during this time, the controller might not 
> >>react properly. This patch adds some code to wait for the status 
> >>register to change to the state we asked for before continuing.
> >
> >Just got two exceptions with your patch, none of the debug messages were
> >issued.
> >
> >Björn
> 
> Hmm, another miss, apparently.. Has anyone tried removing these lines
> >from nv_host_intr in 2.6.20-rc5 sata_nv.c and see what that does?
> 
> /* bail out if not our interrupt */
> if (!(irq_stat & NV_INT_DEV))
> return 0;

Running a kernel with the return statement replace by a line that prints
the irq_stat instead.

Currently I'm seeing lots of 0x10 on ata1 and 0x0 on ata2.

Björn
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-22 Thread Chr
On Monday, 22. January 2007 03:39, Tejun Heo wrote:
> Hello,
> 
> Chr wrote:
> > Ok, you won't believe this... I opened my case and rewired my drives... 
> > And guess what, my second (aka the "good") HDD is now failing! 
> > I guess, my mainboard has a (but maybe two, or three :( ) "bad" 
> > sata-port(s)!  
> 
> Or, you have power related problem.  Try to rewire the power lines or 
> connect harddrives to a separate powersupply.  It's often useful to 
> change one component at a time and watch which change the problem 
> follows.  Anyways, you seem to be suffering transmission failures, not a 
> driver problem.
> 
> Thanks.
> 

Yes and no, it's probably not a power problem, I've tried another
PSU with the same result :( . Futhermore, the RAID0 setup makes
it impossible to try only one drive alone :(. 

Anyway,the WD2500KS is known to have some strange bugs in the FW.
e.g.: It reports 255°C right after a cold start. 
( http://www.bugtrack.almico.com/view.php?id=468 ).

Thanks,
Chr.
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-22 Thread Chr
On Monday, 22. January 2007 03:39, Tejun Heo wrote:
 Hello,
 
 Chr wrote:
  Ok, you won't believe this... I opened my case and rewired my drives... 
  And guess what, my second (aka the good) HDD is now failing! 
  I guess, my mainboard has a (but maybe two, or three :( ) bad 
  sata-port(s)!  
 
 Or, you have power related problem.  Try to rewire the power lines or 
 connect harddrives to a separate powersupply.  It's often useful to 
 change one component at a time and watch which change the problem 
 follows.  Anyways, you seem to be suffering transmission failures, not a 
 driver problem.
 
 Thanks.
 

Yes and no, it's probably not a power problem, I've tried another
PSU with the same result :( . Futhermore, the RAID0 setup makes
it impossible to try only one drive alone :(. 

Anyway,the WD2500KS is known to have some strange bugs in the FW.
e.g.: It reports 255°C right after a cold start. 
( http://www.bugtrack.almico.com/view.php?id=468 ).

Thanks,
Chr.
 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-22 Thread Björn Steinbrink
On 2007.01.21 18:17:01 -0600, Robert Hancock wrote:
 Björn Steinbrink wrote:
 On 2007.01.21 13:58:01 -0600, Robert Hancock wrote:
 Björn Steinbrink wrote:
 All kernels were bad using that approach. So back to square 1. :/
 
 Björn
 
 OK guys, here's a new patch to try against 2.6.20-rc5:
 
 Right now when switching between ADMA mode and legacy mode (i.e. when 
 going from doing normal DMA reads/writes to doing a FLUSH CACHE) we just 
 set the ADMA GO register bit appropriately and continue with no delay. 
 It looks like in some cases the controller doesn't respond to this 
 immediately, it takes some nanoseconds for the controller's status 
 registers to reflect the change that was made. It's possible that if we 
 were trying to issue commands during this time, the controller might not 
 react properly. This patch adds some code to wait for the status 
 register to change to the state we asked for before continuing.
 
 Just got two exceptions with your patch, none of the debug messages were
 issued.
 
 Björn
 
 Hmm, another miss, apparently.. Has anyone tried removing these lines
 from nv_host_intr in 2.6.20-rc5 sata_nv.c and see what that does?
 
 /* bail out if not our interrupt */
 if (!(irq_stat  NV_INT_DEV))
 return 0;

Running a kernel with the return statement replace by a line that prints
the irq_stat instead.

Currently I'm seeing lots of 0x10 on ata1 and 0x0 on ata2.

Björn
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-22 Thread Björn Steinbrink
On 2007.01.22 17:12:40 +0100, Björn Steinbrink wrote:
 On 2007.01.21 18:17:01 -0600, Robert Hancock wrote:
  Björn Steinbrink wrote:
  On 2007.01.21 13:58:01 -0600, Robert Hancock wrote:
  Björn Steinbrink wrote:
  All kernels were bad using that approach. So back to square 1. :/
  
  Björn
  
  OK guys, here's a new patch to try against 2.6.20-rc5:
  
  Right now when switching between ADMA mode and legacy mode (i.e. when 
  going from doing normal DMA reads/writes to doing a FLUSH CACHE) we just 
  set the ADMA GO register bit appropriately and continue with no delay. 
  It looks like in some cases the controller doesn't respond to this 
  immediately, it takes some nanoseconds for the controller's status 
  registers to reflect the change that was made. It's possible that if we 
  were trying to issue commands during this time, the controller might not 
  react properly. This patch adds some code to wait for the status 
  register to change to the state we asked for before continuing.
  
  Just got two exceptions with your patch, none of the debug messages were
  issued.
  
  Björn
  
  Hmm, another miss, apparently.. Has anyone tried removing these lines
  from nv_host_intr in 2.6.20-rc5 sata_nv.c and see what that does?
  
  /* bail out if not our interrupt */
  if (!(irq_stat  NV_INT_DEV))
  return 0;
 
 Running a kernel with the return statement replace by a line that prints
 the irq_stat instead.
 
 Currently I'm seeing lots of 0x10 on ata1 and 0x0 on ata2.

40 minutes stress test now and no exception yet. What's interesting is
that ata1 saw exactly one interrupt with irq_stat 0x0, all others that
might have get dropped are as above.
I'll keep it running for some time and will then re-enable the return
statement to see if there's a relation between the irq_stat 0x0 and the
exception.

Björn
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-22 Thread Björn Steinbrink
On 2007.01.22 17:57:08 +0100, Björn Steinbrink wrote:
 On 2007.01.22 17:12:40 +0100, Björn Steinbrink wrote:
  On 2007.01.21 18:17:01 -0600, Robert Hancock wrote:
   Hmm, another miss, apparently.. Has anyone tried removing these lines
   from nv_host_intr in 2.6.20-rc5 sata_nv.c and see what that does?
   
   /* bail out if not our interrupt */
   if (!(irq_stat  NV_INT_DEV))
   return 0;
  
  Running a kernel with the return statement replace by a line that prints
  the irq_stat instead.
  
  Currently I'm seeing lots of 0x10 on ata1 and 0x0 on ata2.
 
 40 minutes stress test now and no exception yet. What's interesting is
 that ata1 saw exactly one interrupt with irq_stat 0x0, all others that
 might have get dropped are as above.
 I'll keep it running for some time and will then re-enable the return
 statement to see if there's a relation between the irq_stat 0x0 and the
 exception.

No, doesn't seem to be related, did get 2 exceptions, but no irq_stat
0x0 for ata1. Syslog/dmesg has nothing new either, still the same
pattern of dismissed irq_stats.

Björn
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-22 Thread Eric D. Mudama

On 1/15/07, Jeff Garzik [EMAIL PROTECTED] wrote:

Jens Axboe wrote:
 On Mon, Jan 15 2007, Jeff Garzik wrote:
 Jens Axboe wrote:
 I'd be surprised if the device would not obey the 7 second timeout rule
 that seems to be set in stone and not allow more dirty in-drive cache
 than it could flush out in approximately that time.
 AFAIK Windows flush-cache timeout is 30 seconds, not 7 as with other
 commands...

 Ok, 7 seconds for FLUSH_CACHE would have been nice for us too though, as
 it would pretty much guarentee lower latencies for random writes and
 write back caching. The concern is the barrier code, of course. I guess
 I should do some timings on potential worst case patterns some day. Alan
 may have done that sometime in the past, iirc.

FWIW:  According to the drive guys (Eric M, among others), FLUSH CACHE
will probably be under 30 seconds, but pathological cases might even
extend beyond that.

Definitely more than 7 seconds in less-than-pathological cases,
unfortunately...


The mentioned Maxtor model (6Yxxx) isn't susceptible to the
large-buffer long completion times, due to architectural differences
and availability of only small buffers.  Any real long-completion
flush on this device would, I believe, involve damage to the disk that
hinders the ability to seek, settle, or write.  (e.g. 30-second
flushes are easy to hit if you mount the disk on a shaker-table with
sufficient amplitude)

Later in the thread I think people have pretty much isolated it as not
the disk's problem, but just wanted to point this out.

I assume that large enough customers can buy enterprise-type command
completion (all commands within X seconds) from most any disk
vendor.  However, these firmwares require much smarter or more active
drivers or block layers, to handle the higher error rate when the data
on the device is valid, but it will take longer than allowed by the
arbitrary enterprise rules.  Most customers who are buying this many
devices have software engineers customizing the drivers or disk
management applications to handle this differing behavior.

--eric
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-22 Thread Robert Hancock

Björn Steinbrink wrote:

Running a kernel with the return statement replace by a line that prints
the irq_stat instead.

Currently I'm seeing lots of 0x10 on ata1 and 0x0 on ata2.

40 minutes stress test now and no exception yet. What's interesting is
that ata1 saw exactly one interrupt with irq_stat 0x0, all others that
might have get dropped are as above.
I'll keep it running for some time and will then re-enable the return
statement to see if there's a relation between the irq_stat 0x0 and the
exception.


No, doesn't seem to be related, did get 2 exceptions, but no irq_stat
0x0 for ata1. Syslog/dmesg has nothing new either, still the same
pattern of dismissed irq_stats.


I've finally managed to reproduce this problem on my box, by doing:

watch --interval=0.1 /sbin/hdparm -I /dev/sda

on one drive and then running bonnie++ on /dev/sdb connected to the 
other port on the same controller device. Usually within a few minutes 
one of the IDENTIFY commands would time out in the same way you guys 
have been seeing.


Through some various trials and tribulations, the only conclusion I can 
come to is that this controller really doesn't like that 
NV_INT_STATUS_CK804 register being looked at in ADMA mode. I tried 
adding some debug code to the qc_issue function that would check to see 
if the BUSY flag in altstatus went high or that register showed an 
interrupt within a certain time afterwards, however that really seemed 
to hose things, the system wouldn't even boot.


Try out this patch, it just calls the ata_host_intr function where 
appropriate without using nv_host_intr which looks at the 
NV_INT_STATUS_CK804 register. This is what the original ADMA patch from 
Mr. Mysterious NVIDIA Person did, I'm guessing there may be a reason for 
that. With this patch I can get through a whole bonnie++ run with the 
repeated IDENTIFY requests running without seeing the error.


As an aside, there seems to be some dubious code in nv_host_intr, if 
ata_host_intr returns 0 for handled when a command is outstanding, it 
goes and calls ata_check_status anyway. This is rather dangerous since 
if an interrupt showed up right after ata_host_intr but before 
ata_check_status, the ata_check_status would clear it and we would 
forget about it. I tried fixing just that issue and still had this 
problem however. I suspect that code is truly broken and needs further 
thought, but this patch avoids calling it in the ADMA case, at any rate.


As a final aside, this is another case where the hardware docs for this 
controller would really be useful, in order to know whether we are 
actually supposed to be reading that register in ADMA mode or not. I 
sent a query to Allen Martin at NVIDIA asking if there's a way I could 
get access to the documents, but I haven't heard anything yet.


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove nospam from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

--- linux-2.6.20-rc5/drivers/ata/sata_nv.c  2007-01-19 19:18:53.0 
-0600
+++ linux-2.6.20-rc5debug/drivers/ata/sata_nv.c 2007-01-22 18:35:09.0 
-0600
@@ -750,9 +750,9 @@ static irqreturn_t nv_adma_interrupt(int
 
/* if in ATA register mode, use standard ata interrupt 
handler */
if (pp-flags  NV_ADMA_PORT_REGISTER_MODE) {
-   u8 irq_stat = readb(host-mmio_base + 
NV_INT_STATUS_CK804)
-(NV_INT_PORT_SHIFT * i);
-   handled += nv_host_intr(ap, irq_stat);
+   struct ata_queued_cmd *qc = ata_qc_from_tag(ap, 
ap-active_tag);
+   if(qc  !(qc-tf.flags  ATA_TFLAG_POLLING))
+   handled += ata_host_intr(ap, qc);
continue;
}
 


Re: SATA exceptions with 2.6.20-rc5

2007-01-22 Thread Alistair John Strachan
On Tuesday 23 January 2007 01:24, Robert Hancock wrote:
 As a final aside, this is another case where the hardware docs for this
 controller would really be useful, in order to know whether we are
 actually supposed to be reading that register in ADMA mode or not. I
 sent a query to Allen Martin at NVIDIA asking if there's a way I could
 get access to the documents, but I haven't heard anything yet.

Obviously, NVIDIA's response is disappointing, but thank you for putting the 
time in to debug this problem. Definitely sounds like a hardware defect, I'm 
just glad there's a workaround.

Will we see this fix in 2.6.20?

-- 
Cheers,
Alistair.

Final year Computer Science undergraduate.
1F2 55 South Clerk Street, Edinburgh, UK.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-22 Thread Robert Hancock

Alistair John Strachan wrote:

On Tuesday 23 January 2007 01:24, Robert Hancock wrote:

As a final aside, this is another case where the hardware docs for this
controller would really be useful, in order to know whether we are
actually supposed to be reading that register in ADMA mode or not. I
sent a query to Allen Martin at NVIDIA asking if there's a way I could
get access to the documents, but I haven't heard anything yet.


Obviously, NVIDIA's response is disappointing, but thank you for putting the 
time in to debug this problem. Definitely sounds like a hardware defect, I'm 
just glad there's a workaround.


Will we see this fix in 2.6.20?


Hopefully, assuming it actually does fix the problem for those that have 
been seeing it..


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove nospam from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-22 Thread Björn Steinbrink
On 2007.01.22 19:24:22 -0600, Robert Hancock wrote:
 Björn Steinbrink wrote:
 Running a kernel with the return statement replace by a line that prints
 the irq_stat instead.
 
 Currently I'm seeing lots of 0x10 on ata1 and 0x0 on ata2.
 40 minutes stress test now and no exception yet. What's interesting is
 that ata1 saw exactly one interrupt with irq_stat 0x0, all others that
 might have get dropped are as above.
 I'll keep it running for some time and will then re-enable the return
 statement to see if there's a relation between the irq_stat 0x0 and the
 exception.
 
 No, doesn't seem to be related, did get 2 exceptions, but no irq_stat
 0x0 for ata1. Syslog/dmesg has nothing new either, still the same
 pattern of dismissed irq_stats.
 
 I've finally managed to reproduce this problem on my box, by doing:
 
 watch --interval=0.1 /sbin/hdparm -I /dev/sda
 
 on one drive and then running bonnie++ on /dev/sdb connected to the 
 other port on the same controller device. Usually within a few minutes 
 one of the IDENTIFY commands would time out in the same way you guys 
 have been seeing.
 
 Through some various trials and tribulations, the only conclusion I can 
 come to is that this controller really doesn't like that 
 NV_INT_STATUS_CK804 register being looked at in ADMA mode. I tried 
 adding some debug code to the qc_issue function that would check to see 
 if the BUSY flag in altstatus went high or that register showed an 
 interrupt within a certain time afterwards, however that really seemed 
 to hose things, the system wouldn't even boot.

Hm, I don't think it is unhappy about looking at NV_INT_STATUS_CK804.
I'm running 2.6.20-rc5 with the INT_DEV check removed for 8 hours now
without a single problem and that should still look at
NV_INT_STATUS_CK804, right?
I just noticed that my last email might not have been clear enough. The
exceptions happened when I re-enabled the return statement in addition
to the debug message. Without the INT_DEV check, it is completely fine
AFAICT.

 Try out this patch, it just calls the ata_host_intr function where 
 appropriate without using nv_host_intr which looks at the 
 NV_INT_STATUS_CK804 register. This is what the original ADMA patch from 
 Mr. Mysterious NVIDIA Person did, I'm guessing there may be a reason for 
 that. With this patch I can get through a whole bonnie++ run with the 
 repeated IDENTIFY requests running without seeing the error.

I'll see if I can schedule a test run for tomorrow, I currently need
this box.

Thanks,
Björn
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-22 Thread Robert Hancock

Björn Steinbrink wrote:

Hm, I don't think it is unhappy about looking at NV_INT_STATUS_CK804.
I'm running 2.6.20-rc5 with the INT_DEV check removed for 8 hours now
without a single problem and that should still look at
NV_INT_STATUS_CK804, right?
I just noticed that my last email might not have been clear enough. The
exceptions happened when I re-enabled the return statement in addition
to the debug message. Without the INT_DEV check, it is completely fine
AFAICT.


Indeed, it seems to be just the NV_INT_DEV check that is problematic. 
Here's a patch that's likely better to test, it forces the NV_INT_DEV 
flag on when a command is active, and also fixes that questionable code 
in nv_host_intr that I mentioned.


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove nospam from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

--- linux-2.6.20-rc5/drivers/ata/sata_nv.c  2007-01-19 19:18:53.0 
-0600
+++ linux-2.6.20-rc5debug/drivers/ata/sata_nv.c 2007-01-22 22:33:43.0 
-0600
@@ -700,7 +700,6 @@ static void nv_adma_check_cpb(struct ata
 static int nv_host_intr(struct ata_port *ap, u8 irq_stat)
 {
struct ata_queued_cmd *qc = ata_qc_from_tag(ap, ap-active_tag);
-   int handled;
 
/* freeze if hotplugged */
if (unlikely(irq_stat  (NV_INT_ADDED | NV_INT_REMOVED))) {
@@ -719,13 +718,7 @@ static int nv_host_intr(struct ata_port 
}
 
/* handle interrupt */
-   handled = ata_host_intr(ap, qc);
-   if (unlikely(!handled)) {
-   /* spurious, clear it */
-   ata_check_status(ap);
-   }
-
-   return 1;
+   return ata_host_intr(ap, qc);
 }
 
 static irqreturn_t nv_adma_interrupt(int irq, void *dev_instance)
@@ -752,6 +745,11 @@ static irqreturn_t nv_adma_interrupt(int
if (pp-flags  NV_ADMA_PORT_REGISTER_MODE) {
u8 irq_stat = readb(host-mmio_base + 
NV_INT_STATUS_CK804)
 (NV_INT_PORT_SHIFT * i);
+   if(ata_tag_valid(ap-active_tag))
+   /** NV_INT_DEV indication seems 
unreliable at times
+   at least in ADMA mode. Force it on 
always when a
+   command is active, to prevent 
losing interrupts. */
+   irq_stat |= NV_INT_DEV;
handled += nv_host_intr(ap, irq_stat);
continue;
}


Re: SATA exceptions with 2.6.20-rc5

2007-01-21 Thread Tejun Heo

Hello,

Chr wrote:
Ok, you won't believe this... I opened my case and rewired my drives... 
And guess what, my second (aka the "good") HDD is now failing! 
I guess, my mainboard has a (but maybe two, or three :( ) "bad" sata-port(s)!  


Or, you have power related problem.  Try to rewire the power lines or 
connect harddrives to a separate powersupply.  It's often useful to 
change one component at a time and watch which change the problem 
follows.  Anyways, you seem to be suffering transmission failures, not a 
driver problem.


Thanks.

--
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-21 Thread Robert Hancock

Björn Steinbrink wrote:

On 2007.01.21 13:58:01 -0600, Robert Hancock wrote:

Björn Steinbrink wrote:

All kernels were bad using that approach. So back to square 1. :/

Björn


OK guys, here's a new patch to try against 2.6.20-rc5:

Right now when switching between ADMA mode and legacy mode (i.e. when 
going from doing normal DMA reads/writes to doing a FLUSH CACHE) we just 
set the ADMA GO register bit appropriately and continue with no delay. 
It looks like in some cases the controller doesn't respond to this 
immediately, it takes some nanoseconds for the controller's status 
registers to reflect the change that was made. It's possible that if we 
were trying to issue commands during this time, the controller might not 
react properly. This patch adds some code to wait for the status 
register to change to the state we asked for before continuing.


Just got two exceptions with your patch, none of the debug messages were
issued.

Björn


Hmm, another miss, apparently.. Has anyone tried removing these lines
from nv_host_intr in 2.6.20-rc5 sata_nv.c and see what that does?

/* bail out if not our interrupt */
if (!(irq_stat & NV_INT_DEV))
return 0;

--
Robert Hancock  Saskatoon, SK, Canada
To email, remove "nospam" from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-21 Thread Robert Hancock

Björn Steinbrink wrote:

On 2007.01.21 23:08:11 +0100, Björn Steinbrink wrote:

On 2007.01.21 13:58:01 -0600, Robert Hancock wrote:

Björn Steinbrink wrote:

All kernels were bad using that approach. So back to square 1. :/

Björn


OK guys, here's a new patch to try against 2.6.20-rc5:

Right now when switching between ADMA mode and legacy mode (i.e. when 
going from doing normal DMA reads/writes to doing a FLUSH CACHE) we just 
set the ADMA GO register bit appropriately and continue with no delay. 
It looks like in some cases the controller doesn't respond to this 
immediately, it takes some nanoseconds for the controller's status 
registers to reflect the change that was made. It's possible that if we 
were trying to issue commands during this time, the controller might not 
react properly. This patch adds some code to wait for the status 
register to change to the state we asked for before continuing.

I went for the "I feel lucky" route and did just add mmio reads after the
mmio writes, posting them. Rationale being that if it is a write posting
issue, the debug patch would/could actually hide it AFAICT.
It's the "I feel lucky" route, because my whole "knowledge" about mmio
and write posting originates from the few things I read up on when you
discovered the comment about write posting in the generic ata code.


Uhm, yeah, exception occured about the time that I hit "send".

Björn


Yeah, I don't think just adding reads to flush posted writes is enough 
here - it seems to need more delay than that, and it also wasn't always 
in the idle state even before we would write the register..


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-21 Thread Björn Steinbrink
On 2007.01.21 13:58:01 -0600, Robert Hancock wrote:
> Björn Steinbrink wrote:
> >All kernels were bad using that approach. So back to square 1. :/
> >
> >Björn
> >
> 
> OK guys, here's a new patch to try against 2.6.20-rc5:
> 
> Right now when switching between ADMA mode and legacy mode (i.e. when 
> going from doing normal DMA reads/writes to doing a FLUSH CACHE) we just 
> set the ADMA GO register bit appropriately and continue with no delay. 
> It looks like in some cases the controller doesn't respond to this 
> immediately, it takes some nanoseconds for the controller's status 
> registers to reflect the change that was made. It's possible that if we 
> were trying to issue commands during this time, the controller might not 
> react properly. This patch adds some code to wait for the status 
> register to change to the state we asked for before continuing.

Just got two exceptions with your patch, none of the debug messages were
issued.

Björn
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-21 Thread Björn Steinbrink
On 2007.01.21 23:08:11 +0100, Björn Steinbrink wrote:
> On 2007.01.21 13:58:01 -0600, Robert Hancock wrote:
> > Björn Steinbrink wrote:
> > >All kernels were bad using that approach. So back to square 1. :/
> > >
> > >Björn
> > >
> > 
> > OK guys, here's a new patch to try against 2.6.20-rc5:
> > 
> > Right now when switching between ADMA mode and legacy mode (i.e. when 
> > going from doing normal DMA reads/writes to doing a FLUSH CACHE) we just 
> > set the ADMA GO register bit appropriately and continue with no delay. 
> > It looks like in some cases the controller doesn't respond to this 
> > immediately, it takes some nanoseconds for the controller's status 
> > registers to reflect the change that was made. It's possible that if we 
> > were trying to issue commands during this time, the controller might not 
> > react properly. This patch adds some code to wait for the status 
> > register to change to the state we asked for before continuing.
> 
> I went for the "I feel lucky" route and did just add mmio reads after the
> mmio writes, posting them. Rationale being that if it is a write posting
> issue, the debug patch would/could actually hide it AFAICT.
> It's the "I feel lucky" route, because my whole "knowledge" about mmio
> and write posting originates from the few things I read up on when you
> discovered the comment about write posting in the generic ata code.

Uhm, yeah, exception occured about the time that I hit "send".

Björn
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-21 Thread Björn Steinbrink
On 2007.01.21 13:58:01 -0600, Robert Hancock wrote:
> Björn Steinbrink wrote:
> >All kernels were bad using that approach. So back to square 1. :/
> >
> >Björn
> >
> 
> OK guys, here's a new patch to try against 2.6.20-rc5:
> 
> Right now when switching between ADMA mode and legacy mode (i.e. when 
> going from doing normal DMA reads/writes to doing a FLUSH CACHE) we just 
> set the ADMA GO register bit appropriately and continue with no delay. 
> It looks like in some cases the controller doesn't respond to this 
> immediately, it takes some nanoseconds for the controller's status 
> registers to reflect the change that was made. It's possible that if we 
> were trying to issue commands during this time, the controller might not 
> react properly. This patch adds some code to wait for the status 
> register to change to the state we asked for before continuing.

I went for the "I feel lucky" route and did just add mmio reads after the
mmio writes, posting them. Rationale being that if it is a write posting
issue, the debug patch would/could actually hide it AFAICT.
It's the "I feel lucky" route, because my whole "knowledge" about mmio
and write posting originates from the few things I read up on when you
discovered the comment about write posting in the generic ata code.

Björn
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-21 Thread Chr
On Sunday, 21. January 2007 19:01, Björn Steinbrink wrote:
> On 2007.01.21 18:34:40 +0100, Chr wrote:
>
> I run those two in parallel:
> while /bin/true; do ls -lR / > /dev/null 2>&1; done
> while /bin/true; do echo 255 > /proc/sys/vm/drop_caches; sleep 1; done
>
> Not sure if running them in parallel is necessary, but I don't want to
> change the test setup ;) Takes between 1 and 40 minutes to trigger it.
> Most of the time it's around 15 minutes now, doing more random stuff in
> addition to that seems to trigger it even easier (like reading mail,
> rebuilding the kernel etc.).
>
> I'm down to 2 commits after 2.6.19 now, only bad kernels, so I tend to
> say that 2.6.19 with 2.6.20-rc5's sata_nv.c will also fail for me, but I
> thought I might finish bisection just to be sure.
>
> > But, this time it looks slightly different:
> > ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
> > ata3.00: tag 0 cmd 0xec Emask 0x4 stat 0x40 err 0x0 (timeout)
> >
> > [Rest of the error message + SMART error snipped]
>
> I get the same exception every time, doesn't change for me. And neither
> do I get any SMART errors or something.
>
> Thanks,
> Björn

Ok, you won't believe this... I opened my case and rewired my drives... 
And guess what, my second (aka the "good") HDD is now failing! 
I guess, my mainboard has a (but maybe two, or three :( ) "bad" sata-port(s)!  

But, one small question remains: when I opened my case, I saw that my drivers
are pluged in SATA jack 1 and 2... The BIOS also says they're on 1 and 2.
Now, Linux says they're on port 3 & 4! 



it's always ata3.00!
"ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata3.00: tag 0 cmd 0xea Emask 0x4 stat 0x40 err 0x0 (timeout)
ata3: soft resetting port
ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata3.00: configured for UDMA/133
ata3: EH complete
SCSI device sda: 490234752 512-byte hdwr sectors (251000 MB)
sda: Write Protect is off
sda: Mode Sense: 00 3a 00 00
SCSI device sda: drive cache: write back"


Thanks,
Chr.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-21 Thread Robert Hancock

Björn Steinbrink wrote:

All kernels were bad using that approach. So back to square 1. :/

Björn



OK guys, here's a new patch to try against 2.6.20-rc5:

Right now when switching between ADMA mode and legacy mode (i.e. when 
going from doing normal DMA reads/writes to doing a FLUSH CACHE) we just 
set the ADMA GO register bit appropriately and continue with no delay. 
It looks like in some cases the controller doesn't respond to this 
immediately, it takes some nanoseconds for the controller's status 
registers to reflect the change that was made. It's possible that if we 
were trying to issue commands during this time, the controller might not 
react properly. This patch adds some code to wait for the status 
register to change to the state we asked for before continuing.


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove "nospam" from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

--- linux-2.6.20-rc5/drivers/ata/sata_nv.c  2007-01-19 19:18:53.0 
-0600
+++ linux-2.6.20-rc5debug/drivers/ata/sata_nv.c 2007-01-21 13:35:17.0 
-0600
@@ -509,14 +509,38 @@ static void nv_adma_register_mode(struct
 {
void __iomem *mmio = nv_adma_ctl_block(ap);
struct nv_adma_port_priv *pp = ap->private_data;
-   u16 tmp;
+   u16 tmp, status;
+   int count = 0;
 
if (pp->flags & NV_ADMA_PORT_REGISTER_MODE)
return;
 
+   status = readw(mmio + NV_ADMA_STAT);
+   while(!(status & NV_ADMA_STAT_IDLE) && count < 20) {
+   ndelay(50);
+   status = readw(mmio + NV_ADMA_STAT);
+   count++;
+   }
+   if(count == 20)
+   ata_port_printk(ap, KERN_WARNING,
+   "timeout waiting for ADMA IDLE, stat=0x%hx\n",
+   status);
+
tmp = readw(mmio + NV_ADMA_CTL);
writew(tmp & ~NV_ADMA_CTL_GO, mmio + NV_ADMA_CTL);
 
+   count = 0;
+   status = readw(mmio + NV_ADMA_STAT);
+   while(!(status & NV_ADMA_STAT_LEGACY) && count < 20) {
+   ndelay(50);
+   status = readw(mmio + NV_ADMA_STAT);
+   count++;
+   }
+   if(count == 20)
+   ata_port_printk(ap, KERN_WARNING,
+"timeout waiting for ADMA LEGACY, stat=0x%hx\n",
+status);
+
pp->flags |= NV_ADMA_PORT_REGISTER_MODE;
 }
 
@@ -524,7 +548,8 @@ static void nv_adma_mode(struct ata_port
 {
void __iomem *mmio = nv_adma_ctl_block(ap);
struct nv_adma_port_priv *pp = ap->private_data;
-   u16 tmp;
+   u16 tmp, status;
+   int count = 0;
 
if (!(pp->flags & NV_ADMA_PORT_REGISTER_MODE))
return;
@@ -534,6 +559,18 @@ static void nv_adma_mode(struct ata_port
tmp = readw(mmio + NV_ADMA_CTL);
writew(tmp | NV_ADMA_CTL_GO, mmio + NV_ADMA_CTL);
 
+   status = readw(mmio + NV_ADMA_STAT);
+   while(((status & NV_ADMA_STAT_LEGACY) ||
+ !(status & NV_ADMA_STAT_IDLE)) && count < 20) {
+   ndelay(50);
+   status = readw(mmio + NV_ADMA_STAT);
+   count++;
+   }
+   if(count == 20)
+   ata_port_printk(ap, KERN_WARNING,
+   "timeout waiting for ADMA LEGACY clear and IDLE, 
stat=0x%hx\n",
+   status);
+
pp->flags &= ~NV_ADMA_PORT_REGISTER_MODE;
 }
 


Re: SATA exceptions with 2.6.20-rc5

2007-01-21 Thread Björn Steinbrink
On 2007.01.21 09:36:18 +0100, Björn Steinbrink wrote:
> On 2007.01.21 00:39:20 -0600, Robert Hancock wrote:
> > Björn Steinbrink wrote:
> > >On 2007.01.20 22:34:27 -0500, Jeff Garzik wrote:
> > >>Robert Hancock wrote:
> > >>>change in 2.6.20-rc is either causing or triggering this problem. It 
> > >>>would be useful if you could try git bisect between 2.6.19 and 
> > >>>2.6.20-rc5, keeping the latest sata_nv.c each time, and see if that 
> > >>
> > >>Yes, 'git bisect' would be the next step in figuring out this puzzle.
> > >>
> > >>Anybody up for it?
> > >
> > >I'll go for it, but could I get an explanation how that could lead to a
> > >different result than my last bisection? I see the difference of keeping
> > >sata_nv.c but my brain can't wrap around it right now (woke up in the
> > >middle of the night and still not up to speed...).
> > 
> > Whatever the problem is, only seems to show up when ADMA is enabled, and 
> > so the patch that added ADMA support shows up as the culprit from your 
> > git bisect. However, from what Chr is reporting, 2.6.19 with the ADMA 
> > support added in doesn't seem to have the problem, so presumably 
> > something else that changed in the 2.6.20-rc series is triggering it. 
> > Doing a bisect while keeping the driver code itself the same will 
> > hopefully identify what that change is..
> 
> Ah, right... sata_nv.c of course interacts with the outside world, d'oh!
> 
> Up to now, I only got bad kernels, latest tested being:
> 94fcda1f8ab5e0cacc381c5ca1cc9aa6ad523576
> 
> Which, unless I missed a commit in the diff, only USB changes,
> continuing anyway.

All kernels were bad using that approach. So back to square 1. :/

Björn
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-21 Thread Björn Steinbrink
On 2007.01.21 18:34:40 +0100, Chr wrote:
> On Sunday, 21. January 2007 09:36, Björn Steinbrink wrote:
> > On 2007.01.21 00:39:20 -0600, Robert Hancock wrote:
> >
> > Ah, right... sata_nv.c of course interacts with the outside world, d'oh!
> >
> > Up to now, I only got bad kernels, latest tested being:
> > 94fcda1f8ab5e0cacc381c5ca1cc9aa6ad523576
> >
> > Which, unless I missed a commit in the diff, only USB changes,
> > continuing anyway.
> >
> > Just to make sure, here's my little helper for this bisect run, I hope
> > it does what you expected:
> >
> > #!/bin/bash
> > cp ../sata_nv.c.orig drivers/ata/sata_nv.c
> > git bisect good
> > cp drivers/ata/sata_nv.c ../sata_nv.c.orig
> > cp ../sata_nv.c drivers/ata/
> > make oldconfig
> > make -j4
> >
> > Where "../sata_nv.c" is the version from 2.6.20-rc5. The copying is done
> > to avoid conflicts and keep git happy. Of course there's also a version
> > for bad kernels ;) No idea, why I didn't make that an argument to the
> > script...
> >
> > Thanks,
> > Björn
> 
> Ar, 2.6.19 (with 2.6.20-rc5 adma stuff) is affected too (BTW, what do you 
> do to trigger the exceptions? Because, it takes hours to "reproduces" this
> silly *).

I run those two in parallel:
while /bin/true; do ls -lR / > /dev/null 2>&1; done
while /bin/true; do echo 255 > /proc/sys/vm/drop_caches; sleep 1; done

Not sure if running them in parallel is necessary, but I don't want to
change the test setup ;) Takes between 1 and 40 minutes to trigger it.
Most of the time it's around 15 minutes now, doing more random stuff in
addition to that seems to trigger it even easier (like reading mail,
rebuilding the kernel etc.).

I'm down to 2 commits after 2.6.19 now, only bad kernels, so I tend to
say that 2.6.19 with 2.6.20-rc5's sata_nv.c will also fail for me, but I
thought I might finish bisection just to be sure.

> But, this time it looks slightly different:
> ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
> ata3.00: tag 0 cmd 0xec Emask 0x4 stat 0x40 err 0x0 (timeout)

> [Rest of the error message + SMART error snipped]

I get the same exception every time, doesn't change for me. And neither
do I get any SMART errors or something.

Thanks,
Björn
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-21 Thread Chr
On Sunday, 21. January 2007 09:36, Björn Steinbrink wrote:
> On 2007.01.21 00:39:20 -0600, Robert Hancock wrote:
>
> Ah, right... sata_nv.c of course interacts with the outside world, d'oh!
>
> Up to now, I only got bad kernels, latest tested being:
> 94fcda1f8ab5e0cacc381c5ca1cc9aa6ad523576
>
> Which, unless I missed a commit in the diff, only USB changes,
> continuing anyway.
>
> Just to make sure, here's my little helper for this bisect run, I hope
> it does what you expected:
>
> #!/bin/bash
> cp ../sata_nv.c.orig drivers/ata/sata_nv.c
> git bisect good
> cp drivers/ata/sata_nv.c ../sata_nv.c.orig
> cp ../sata_nv.c drivers/ata/
> make oldconfig
> make -j4
>
> Where "../sata_nv.c" is the version from 2.6.20-rc5. The copying is done
> to avoid conflicts and keep git happy. Of course there's also a version
> for bad kernels ;) No idea, why I didn't make that an argument to the
> script...
>
> Thanks,
> Björn

Ar, 2.6.19 (with 2.6.20-rc5 adma stuff) is affected too (BTW, what do you 
do to trigger the exceptions? Because, it takes hours to "reproduces" this
silly *).

But, this time it looks slightly different:
ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata3.00: tag 0 cmd 0xec Emask 0x4 stat 0x40 err 0x0 (timeout)
ata3: soft resetting port
ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
!!!
ata3.00: failed to IDENTIFY (INIT_DEV_PARAMS failed, err_mask=0x1)
ata3.00: revalidation failed (errno=-5)
ata3: failed to recover some devices, retrying in 5 secs
!!!
ata3: hard resetting port
ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata3.00: configured for UDMA/133
ata3: EH complete
SCSI device sda: 488395055 512-byte hdwr sectors (250058 MB)
sda: Write Protect is off
sda: Mode Sense: 00 3a 00 00
SCSI device sda: drive cache: write back


Oh, and I got this nice SMART Error: 

ID# ATTRIBUTE_NAME  FLAGRAW VALUE
199 UDMA_CRC_Error_Count0x003e   ...  -   12

SMART Error Log Version: 1
ATA Error Count: 1
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 1 occurred at disk power-on lifetime: 5603 hours (233 days + 11 hours)
  When the command that caused the error occurred, the device was in an 
unknown state.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 51 3f 00 00 00 af

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --    
  91 00 3f 00 00 00 0f 00  05:30:59.655  INITIALIZE DEVICE PARAMETERS 
[OBS-6]
  ec 00 01 01 00 00 00 00  05:30:59.654  IDENTIFY DEVICE
  ec 00 00 00 00 00 00 00  05:30:56.191  IDENTIFY DEVICE
  ca 00 28 02 ee 9a 0c 00  05:30:56.190  WRITE DMA
  ca 00 10 e8 4c 10 0a 00  05:30:56.190  WRITE DMA


Maybe, it's really the HDD!

OT: "http://www.nvidia.com/object/680i_hotfix.html;  


Chr.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-21 Thread Björn Steinbrink
On 2007.01.21 00:39:20 -0600, Robert Hancock wrote:
> Björn Steinbrink wrote:
> >On 2007.01.20 22:34:27 -0500, Jeff Garzik wrote:
> >>Robert Hancock wrote:
> >>>change in 2.6.20-rc is either causing or triggering this problem. It 
> >>>would be useful if you could try git bisect between 2.6.19 and 
> >>>2.6.20-rc5, keeping the latest sata_nv.c each time, and see if that 
> >>
> >>Yes, 'git bisect' would be the next step in figuring out this puzzle.
> >>
> >>Anybody up for it?
> >
> >I'll go for it, but could I get an explanation how that could lead to a
> >different result than my last bisection? I see the difference of keeping
> >sata_nv.c but my brain can't wrap around it right now (woke up in the
> >middle of the night and still not up to speed...).
> 
> Whatever the problem is, only seems to show up when ADMA is enabled, and 
> so the patch that added ADMA support shows up as the culprit from your 
> git bisect. However, from what Chr is reporting, 2.6.19 with the ADMA 
> support added in doesn't seem to have the problem, so presumably 
> something else that changed in the 2.6.20-rc series is triggering it. 
> Doing a bisect while keeping the driver code itself the same will 
> hopefully identify what that change is..

Ah, right... sata_nv.c of course interacts with the outside world, d'oh!

Up to now, I only got bad kernels, latest tested being:
94fcda1f8ab5e0cacc381c5ca1cc9aa6ad523576

Which, unless I missed a commit in the diff, only USB changes,
continuing anyway.

Just to make sure, here's my little helper for this bisect run, I hope
it does what you expected:

#!/bin/bash
cp ../sata_nv.c.orig drivers/ata/sata_nv.c
git bisect good
cp drivers/ata/sata_nv.c ../sata_nv.c.orig
cp ../sata_nv.c drivers/ata/
make oldconfig
make -j4

Where "../sata_nv.c" is the version from 2.6.20-rc5. The copying is done
to avoid conflicts and keep git happy. Of course there's also a version
for bad kernels ;) No idea, why I didn't make that an argument to the
script...

Thanks,
Björn
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-21 Thread Björn Steinbrink
On 2007.01.21 00:39:20 -0600, Robert Hancock wrote:
 Björn Steinbrink wrote:
 On 2007.01.20 22:34:27 -0500, Jeff Garzik wrote:
 Robert Hancock wrote:
 change in 2.6.20-rc is either causing or triggering this problem. It 
 would be useful if you could try git bisect between 2.6.19 and 
 2.6.20-rc5, keeping the latest sata_nv.c each time, and see if that 
 
 Yes, 'git bisect' would be the next step in figuring out this puzzle.
 
 Anybody up for it?
 
 I'll go for it, but could I get an explanation how that could lead to a
 different result than my last bisection? I see the difference of keeping
 sata_nv.c but my brain can't wrap around it right now (woke up in the
 middle of the night and still not up to speed...).
 
 Whatever the problem is, only seems to show up when ADMA is enabled, and 
 so the patch that added ADMA support shows up as the culprit from your 
 git bisect. However, from what Chr is reporting, 2.6.19 with the ADMA 
 support added in doesn't seem to have the problem, so presumably 
 something else that changed in the 2.6.20-rc series is triggering it. 
 Doing a bisect while keeping the driver code itself the same will 
 hopefully identify what that change is..

Ah, right... sata_nv.c of course interacts with the outside world, d'oh!

Up to now, I only got bad kernels, latest tested being:
94fcda1f8ab5e0cacc381c5ca1cc9aa6ad523576

Which, unless I missed a commit in the diff, only USB changes,
continuing anyway.

Just to make sure, here's my little helper for this bisect run, I hope
it does what you expected:

#!/bin/bash
cp ../sata_nv.c.orig drivers/ata/sata_nv.c
git bisect good
cp drivers/ata/sata_nv.c ../sata_nv.c.orig
cp ../sata_nv.c drivers/ata/
make oldconfig
make -j4

Where ../sata_nv.c is the version from 2.6.20-rc5. The copying is done
to avoid conflicts and keep git happy. Of course there's also a version
for bad kernels ;) No idea, why I didn't make that an argument to the
script...

Thanks,
Björn
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-21 Thread Chr
On Sunday, 21. January 2007 09:36, Björn Steinbrink wrote:
 On 2007.01.21 00:39:20 -0600, Robert Hancock wrote:

 Ah, right... sata_nv.c of course interacts with the outside world, d'oh!

 Up to now, I only got bad kernels, latest tested being:
 94fcda1f8ab5e0cacc381c5ca1cc9aa6ad523576

 Which, unless I missed a commit in the diff, only USB changes,
 continuing anyway.

 Just to make sure, here's my little helper for this bisect run, I hope
 it does what you expected:

 #!/bin/bash
 cp ../sata_nv.c.orig drivers/ata/sata_nv.c
 git bisect good
 cp drivers/ata/sata_nv.c ../sata_nv.c.orig
 cp ../sata_nv.c drivers/ata/
 make oldconfig
 make -j4

 Where ../sata_nv.c is the version from 2.6.20-rc5. The copying is done
 to avoid conflicts and keep git happy. Of course there's also a version
 for bad kernels ;) No idea, why I didn't make that an argument to the
 script...

 Thanks,
 Björn

Ar, 2.6.19 (with 2.6.20-rc5 adma stuff) is affected too (BTW, what do you 
do to trigger the exceptions? Because, it takes hours to reproduces this
silly *).

But, this time it looks slightly different:
ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata3.00: tag 0 cmd 0xec Emask 0x4 stat 0x40 err 0x0 (timeout)
ata3: soft resetting port
ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
!!!
ata3.00: failed to IDENTIFY (INIT_DEV_PARAMS failed, err_mask=0x1)
ata3.00: revalidation failed (errno=-5)
ata3: failed to recover some devices, retrying in 5 secs
!!!
ata3: hard resetting port
ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata3.00: configured for UDMA/133
ata3: EH complete
SCSI device sda: 488395055 512-byte hdwr sectors (250058 MB)
sda: Write Protect is off
sda: Mode Sense: 00 3a 00 00
SCSI device sda: drive cache: write back


Oh, and I got this nice SMART Error: 

ID# ATTRIBUTE_NAME  FLAGRAW VALUE
199 UDMA_CRC_Error_Count0x003e   ...  -   12

SMART Error Log Version: 1
ATA Error Count: 1
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It wraps after 49.710 days.

Error 1 occurred at disk power-on lifetime: 5603 hours (233 days + 11 hours)
  When the command that caused the error occurred, the device was in an 
unknown state.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 51 3f 00 00 00 af

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --    
  91 00 3f 00 00 00 0f 00  05:30:59.655  INITIALIZE DEVICE PARAMETERS 
[OBS-6]
  ec 00 01 01 00 00 00 00  05:30:59.654  IDENTIFY DEVICE
  ec 00 00 00 00 00 00 00  05:30:56.191  IDENTIFY DEVICE
  ca 00 28 02 ee 9a 0c 00  05:30:56.190  WRITE DMA
  ca 00 10 e8 4c 10 0a 00  05:30:56.190  WRITE DMA


Maybe, it's really the HDD!

OT: http://www.nvidia.com/object/680i_hotfix.html;  


Chr.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-21 Thread Björn Steinbrink
On 2007.01.21 18:34:40 +0100, Chr wrote:
 On Sunday, 21. January 2007 09:36, Björn Steinbrink wrote:
  On 2007.01.21 00:39:20 -0600, Robert Hancock wrote:
 
  Ah, right... sata_nv.c of course interacts with the outside world, d'oh!
 
  Up to now, I only got bad kernels, latest tested being:
  94fcda1f8ab5e0cacc381c5ca1cc9aa6ad523576
 
  Which, unless I missed a commit in the diff, only USB changes,
  continuing anyway.
 
  Just to make sure, here's my little helper for this bisect run, I hope
  it does what you expected:
 
  #!/bin/bash
  cp ../sata_nv.c.orig drivers/ata/sata_nv.c
  git bisect good
  cp drivers/ata/sata_nv.c ../sata_nv.c.orig
  cp ../sata_nv.c drivers/ata/
  make oldconfig
  make -j4
 
  Where ../sata_nv.c is the version from 2.6.20-rc5. The copying is done
  to avoid conflicts and keep git happy. Of course there's also a version
  for bad kernels ;) No idea, why I didn't make that an argument to the
  script...
 
  Thanks,
  Björn
 
 Ar, 2.6.19 (with 2.6.20-rc5 adma stuff) is affected too (BTW, what do you 
 do to trigger the exceptions? Because, it takes hours to reproduces this
 silly *).

I run those two in parallel:
while /bin/true; do ls -lR /  /dev/null 21; done
while /bin/true; do echo 255  /proc/sys/vm/drop_caches; sleep 1; done

Not sure if running them in parallel is necessary, but I don't want to
change the test setup ;) Takes between 1 and 40 minutes to trigger it.
Most of the time it's around 15 minutes now, doing more random stuff in
addition to that seems to trigger it even easier (like reading mail,
rebuilding the kernel etc.).

I'm down to 2 commits after 2.6.19 now, only bad kernels, so I tend to
say that 2.6.19 with 2.6.20-rc5's sata_nv.c will also fail for me, but I
thought I might finish bisection just to be sure.

 But, this time it looks slightly different:
 ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
 ata3.00: tag 0 cmd 0xec Emask 0x4 stat 0x40 err 0x0 (timeout)

 [Rest of the error message + SMART error snipped]

I get the same exception every time, doesn't change for me. And neither
do I get any SMART errors or something.

Thanks,
Björn
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-21 Thread Björn Steinbrink
On 2007.01.21 09:36:18 +0100, Björn Steinbrink wrote:
 On 2007.01.21 00:39:20 -0600, Robert Hancock wrote:
  Björn Steinbrink wrote:
  On 2007.01.20 22:34:27 -0500, Jeff Garzik wrote:
  Robert Hancock wrote:
  change in 2.6.20-rc is either causing or triggering this problem. It 
  would be useful if you could try git bisect between 2.6.19 and 
  2.6.20-rc5, keeping the latest sata_nv.c each time, and see if that 
  
  Yes, 'git bisect' would be the next step in figuring out this puzzle.
  
  Anybody up for it?
  
  I'll go for it, but could I get an explanation how that could lead to a
  different result than my last bisection? I see the difference of keeping
  sata_nv.c but my brain can't wrap around it right now (woke up in the
  middle of the night and still not up to speed...).
  
  Whatever the problem is, only seems to show up when ADMA is enabled, and 
  so the patch that added ADMA support shows up as the culprit from your 
  git bisect. However, from what Chr is reporting, 2.6.19 with the ADMA 
  support added in doesn't seem to have the problem, so presumably 
  something else that changed in the 2.6.20-rc series is triggering it. 
  Doing a bisect while keeping the driver code itself the same will 
  hopefully identify what that change is..
 
 Ah, right... sata_nv.c of course interacts with the outside world, d'oh!
 
 Up to now, I only got bad kernels, latest tested being:
 94fcda1f8ab5e0cacc381c5ca1cc9aa6ad523576
 
 Which, unless I missed a commit in the diff, only USB changes,
 continuing anyway.

All kernels were bad using that approach. So back to square 1. :/

Björn
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-21 Thread Robert Hancock

Björn Steinbrink wrote:

All kernels were bad using that approach. So back to square 1. :/

Björn



OK guys, here's a new patch to try against 2.6.20-rc5:

Right now when switching between ADMA mode and legacy mode (i.e. when 
going from doing normal DMA reads/writes to doing a FLUSH CACHE) we just 
set the ADMA GO register bit appropriately and continue with no delay. 
It looks like in some cases the controller doesn't respond to this 
immediately, it takes some nanoseconds for the controller's status 
registers to reflect the change that was made. It's possible that if we 
were trying to issue commands during this time, the controller might not 
react properly. This patch adds some code to wait for the status 
register to change to the state we asked for before continuing.


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove nospam from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

--- linux-2.6.20-rc5/drivers/ata/sata_nv.c  2007-01-19 19:18:53.0 
-0600
+++ linux-2.6.20-rc5debug/drivers/ata/sata_nv.c 2007-01-21 13:35:17.0 
-0600
@@ -509,14 +509,38 @@ static void nv_adma_register_mode(struct
 {
void __iomem *mmio = nv_adma_ctl_block(ap);
struct nv_adma_port_priv *pp = ap-private_data;
-   u16 tmp;
+   u16 tmp, status;
+   int count = 0;
 
if (pp-flags  NV_ADMA_PORT_REGISTER_MODE)
return;
 
+   status = readw(mmio + NV_ADMA_STAT);
+   while(!(status  NV_ADMA_STAT_IDLE)  count  20) {
+   ndelay(50);
+   status = readw(mmio + NV_ADMA_STAT);
+   count++;
+   }
+   if(count == 20)
+   ata_port_printk(ap, KERN_WARNING,
+   timeout waiting for ADMA IDLE, stat=0x%hx\n,
+   status);
+
tmp = readw(mmio + NV_ADMA_CTL);
writew(tmp  ~NV_ADMA_CTL_GO, mmio + NV_ADMA_CTL);
 
+   count = 0;
+   status = readw(mmio + NV_ADMA_STAT);
+   while(!(status  NV_ADMA_STAT_LEGACY)  count  20) {
+   ndelay(50);
+   status = readw(mmio + NV_ADMA_STAT);
+   count++;
+   }
+   if(count == 20)
+   ata_port_printk(ap, KERN_WARNING,
+timeout waiting for ADMA LEGACY, stat=0x%hx\n,
+status);
+
pp-flags |= NV_ADMA_PORT_REGISTER_MODE;
 }
 
@@ -524,7 +548,8 @@ static void nv_adma_mode(struct ata_port
 {
void __iomem *mmio = nv_adma_ctl_block(ap);
struct nv_adma_port_priv *pp = ap-private_data;
-   u16 tmp;
+   u16 tmp, status;
+   int count = 0;
 
if (!(pp-flags  NV_ADMA_PORT_REGISTER_MODE))
return;
@@ -534,6 +559,18 @@ static void nv_adma_mode(struct ata_port
tmp = readw(mmio + NV_ADMA_CTL);
writew(tmp | NV_ADMA_CTL_GO, mmio + NV_ADMA_CTL);
 
+   status = readw(mmio + NV_ADMA_STAT);
+   while(((status  NV_ADMA_STAT_LEGACY) ||
+ !(status  NV_ADMA_STAT_IDLE))  count  20) {
+   ndelay(50);
+   status = readw(mmio + NV_ADMA_STAT);
+   count++;
+   }
+   if(count == 20)
+   ata_port_printk(ap, KERN_WARNING,
+   timeout waiting for ADMA LEGACY clear and IDLE, 
stat=0x%hx\n,
+   status);
+
pp-flags = ~NV_ADMA_PORT_REGISTER_MODE;
 }
 


Re: SATA exceptions with 2.6.20-rc5

2007-01-21 Thread Chr
On Sunday, 21. January 2007 19:01, Björn Steinbrink wrote:
 On 2007.01.21 18:34:40 +0100, Chr wrote:

 I run those two in parallel:
 while /bin/true; do ls -lR /  /dev/null 21; done
 while /bin/true; do echo 255  /proc/sys/vm/drop_caches; sleep 1; done

 Not sure if running them in parallel is necessary, but I don't want to
 change the test setup ;) Takes between 1 and 40 minutes to trigger it.
 Most of the time it's around 15 minutes now, doing more random stuff in
 addition to that seems to trigger it even easier (like reading mail,
 rebuilding the kernel etc.).

 I'm down to 2 commits after 2.6.19 now, only bad kernels, so I tend to
 say that 2.6.19 with 2.6.20-rc5's sata_nv.c will also fail for me, but I
 thought I might finish bisection just to be sure.

  But, this time it looks slightly different:
  ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
  ata3.00: tag 0 cmd 0xec Emask 0x4 stat 0x40 err 0x0 (timeout)
 
  [Rest of the error message + SMART error snipped]

 I get the same exception every time, doesn't change for me. And neither
 do I get any SMART errors or something.

 Thanks,
 Björn

Ok, you won't believe this... I opened my case and rewired my drives... 
And guess what, my second (aka the good) HDD is now failing! 
I guess, my mainboard has a (but maybe two, or three :( ) bad sata-port(s)!  

But, one small question remains: when I opened my case, I saw that my drivers
are pluged in SATA jack 1 and 2... The BIOS also says they're on 1 and 2.
Now, Linux says they're on port 3  4! 



it's always ata3.00!
ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata3.00: tag 0 cmd 0xea Emask 0x4 stat 0x40 err 0x0 (timeout)
ata3: soft resetting port
ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata3.00: configured for UDMA/133
ata3: EH complete
SCSI device sda: 490234752 512-byte hdwr sectors (251000 MB)
sda: Write Protect is off
sda: Mode Sense: 00 3a 00 00
SCSI device sda: drive cache: write back


Thanks,
Chr.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-21 Thread Björn Steinbrink
On 2007.01.21 13:58:01 -0600, Robert Hancock wrote:
 Björn Steinbrink wrote:
 All kernels were bad using that approach. So back to square 1. :/
 
 Björn
 
 
 OK guys, here's a new patch to try against 2.6.20-rc5:
 
 Right now when switching between ADMA mode and legacy mode (i.e. when 
 going from doing normal DMA reads/writes to doing a FLUSH CACHE) we just 
 set the ADMA GO register bit appropriately and continue with no delay. 
 It looks like in some cases the controller doesn't respond to this 
 immediately, it takes some nanoseconds for the controller's status 
 registers to reflect the change that was made. It's possible that if we 
 were trying to issue commands during this time, the controller might not 
 react properly. This patch adds some code to wait for the status 
 register to change to the state we asked for before continuing.

I went for the I feel lucky route and did just add mmio reads after the
mmio writes, posting them. Rationale being that if it is a write posting
issue, the debug patch would/could actually hide it AFAICT.
It's the I feel lucky route, because my whole knowledge about mmio
and write posting originates from the few things I read up on when you
discovered the comment about write posting in the generic ata code.

Björn
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-21 Thread Björn Steinbrink
On 2007.01.21 23:08:11 +0100, Björn Steinbrink wrote:
 On 2007.01.21 13:58:01 -0600, Robert Hancock wrote:
  Björn Steinbrink wrote:
  All kernels were bad using that approach. So back to square 1. :/
  
  Björn
  
  
  OK guys, here's a new patch to try against 2.6.20-rc5:
  
  Right now when switching between ADMA mode and legacy mode (i.e. when 
  going from doing normal DMA reads/writes to doing a FLUSH CACHE) we just 
  set the ADMA GO register bit appropriately and continue with no delay. 
  It looks like in some cases the controller doesn't respond to this 
  immediately, it takes some nanoseconds for the controller's status 
  registers to reflect the change that was made. It's possible that if we 
  were trying to issue commands during this time, the controller might not 
  react properly. This patch adds some code to wait for the status 
  register to change to the state we asked for before continuing.
 
 I went for the I feel lucky route and did just add mmio reads after the
 mmio writes, posting them. Rationale being that if it is a write posting
 issue, the debug patch would/could actually hide it AFAICT.
 It's the I feel lucky route, because my whole knowledge about mmio
 and write posting originates from the few things I read up on when you
 discovered the comment about write posting in the generic ata code.

Uhm, yeah, exception occured about the time that I hit send.

Björn
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-21 Thread Robert Hancock

Björn Steinbrink wrote:

On 2007.01.21 23:08:11 +0100, Björn Steinbrink wrote:

On 2007.01.21 13:58:01 -0600, Robert Hancock wrote:

Björn Steinbrink wrote:

All kernels were bad using that approach. So back to square 1. :/

Björn


OK guys, here's a new patch to try against 2.6.20-rc5:

Right now when switching between ADMA mode and legacy mode (i.e. when 
going from doing normal DMA reads/writes to doing a FLUSH CACHE) we just 
set the ADMA GO register bit appropriately and continue with no delay. 
It looks like in some cases the controller doesn't respond to this 
immediately, it takes some nanoseconds for the controller's status 
registers to reflect the change that was made. It's possible that if we 
were trying to issue commands during this time, the controller might not 
react properly. This patch adds some code to wait for the status 
register to change to the state we asked for before continuing.

I went for the I feel lucky route and did just add mmio reads after the
mmio writes, posting them. Rationale being that if it is a write posting
issue, the debug patch would/could actually hide it AFAICT.
It's the I feel lucky route, because my whole knowledge about mmio
and write posting originates from the few things I read up on when you
discovered the comment about write posting in the generic ata code.


Uhm, yeah, exception occured about the time that I hit send.

Björn


Yeah, I don't think just adding reads to flush posted writes is enough 
here - it seems to need more delay than that, and it also wasn't always 
in the idle state even before we would write the register..


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-21 Thread Björn Steinbrink
On 2007.01.21 13:58:01 -0600, Robert Hancock wrote:
 Björn Steinbrink wrote:
 All kernels were bad using that approach. So back to square 1. :/
 
 Björn
 
 
 OK guys, here's a new patch to try against 2.6.20-rc5:
 
 Right now when switching between ADMA mode and legacy mode (i.e. when 
 going from doing normal DMA reads/writes to doing a FLUSH CACHE) we just 
 set the ADMA GO register bit appropriately and continue with no delay. 
 It looks like in some cases the controller doesn't respond to this 
 immediately, it takes some nanoseconds for the controller's status 
 registers to reflect the change that was made. It's possible that if we 
 were trying to issue commands during this time, the controller might not 
 react properly. This patch adds some code to wait for the status 
 register to change to the state we asked for before continuing.

Just got two exceptions with your patch, none of the debug messages were
issued.

Björn
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-21 Thread Robert Hancock

Björn Steinbrink wrote:

On 2007.01.21 13:58:01 -0600, Robert Hancock wrote:

Björn Steinbrink wrote:

All kernels were bad using that approach. So back to square 1. :/

Björn


OK guys, here's a new patch to try against 2.6.20-rc5:

Right now when switching between ADMA mode and legacy mode (i.e. when 
going from doing normal DMA reads/writes to doing a FLUSH CACHE) we just 
set the ADMA GO register bit appropriately and continue with no delay. 
It looks like in some cases the controller doesn't respond to this 
immediately, it takes some nanoseconds for the controller's status 
registers to reflect the change that was made. It's possible that if we 
were trying to issue commands during this time, the controller might not 
react properly. This patch adds some code to wait for the status 
register to change to the state we asked for before continuing.


Just got two exceptions with your patch, none of the debug messages were
issued.

Björn


Hmm, another miss, apparently.. Has anyone tried removing these lines
from nv_host_intr in 2.6.20-rc5 sata_nv.c and see what that does?

/* bail out if not our interrupt */
if (!(irq_stat  NV_INT_DEV))
return 0;

--
Robert Hancock  Saskatoon, SK, Canada
To email, remove nospam from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-21 Thread Tejun Heo

Hello,

Chr wrote:
Ok, you won't believe this... I opened my case and rewired my drives... 
And guess what, my second (aka the good) HDD is now failing! 
I guess, my mainboard has a (but maybe two, or three :( ) bad sata-port(s)!  


Or, you have power related problem.  Try to rewire the power lines or 
connect harddrives to a separate powersupply.  It's often useful to 
change one component at a time and watch which change the problem 
follows.  Anyways, you seem to be suffering transmission failures, not a 
driver problem.


Thanks.

--
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-20 Thread Robert Hancock

Björn Steinbrink wrote:

On 2007.01.20 22:34:27 -0500, Jeff Garzik wrote:

Robert Hancock wrote:
change in 2.6.20-rc is either causing or triggering this problem. It 
would be useful if you could try git bisect between 2.6.19 and 
2.6.20-rc5, keeping the latest sata_nv.c each time, and see if that 


Yes, 'git bisect' would be the next step in figuring out this puzzle.

Anybody up for it?


I'll go for it, but could I get an explanation how that could lead to a
different result than my last bisection? I see the difference of keeping
sata_nv.c but my brain can't wrap around it right now (woke up in the
middle of the night and still not up to speed...).


Whatever the problem is, only seems to show up when ADMA is enabled, and 
so the patch that added ADMA support shows up as the culprit from your 
git bisect. However, from what Chr is reporting, 2.6.19 with the ADMA 
support added in doesn't seem to have the problem, so presumably 
something else that changed in the 2.6.20-rc series is triggering it. 
Doing a bisect while keeping the driver code itself the same will 
hopefully identify what that change is..

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-20 Thread Björn Steinbrink
On 2007.01.20 22:34:27 -0500, Jeff Garzik wrote:
> Robert Hancock wrote:
> >change in 2.6.20-rc is either causing or triggering this problem. It 
> >would be useful if you could try git bisect between 2.6.19 and 
> >2.6.20-rc5, keeping the latest sata_nv.c each time, and see if that 
> 
> 
> Yes, 'git bisect' would be the next step in figuring out this puzzle.
> 
> Anybody up for it?

I'll go for it, but could I get an explanation how that could lead to a
different result than my last bisection? I see the difference of keeping
sata_nv.c but my brain can't wrap around it right now (woke up in the
middle of the night and still not up to speed...).

Thanks,
Björn
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-20 Thread Jeff Garzik

Robert Hancock wrote:
change in 2.6.20-rc is either causing or triggering this problem. It 
would be useful if you could try git bisect between 2.6.19 and 
2.6.20-rc5, keeping the latest sata_nv.c each time, and see if that 



Yes, 'git bisect' would be the next step in figuring out this puzzle.

Anybody up for it?

Jeff


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-20 Thread Robert Hancock

Chr wrote:

Could you (or anyone else) test what happens if you take the 2.6.20-rc5
version of sata_nv.c and try it on 2.6.19? That would tell us whether
it's this change or whether it's something else (i.e. in libata core).


Ok, did that! (got a fresh 2.6.19 tar ball, and used 2.6.20-rc5' sata_nv.c
with the oneliner in libata_sff.c)

And surprise after one hour uptime, there is not even one 
sata exceptions in dmesg! (I'll report back tomorrow...)


That is interesting, indeed.. If that holds up then I assume some other 
change in 2.6.20-rc is either causing or triggering this problem. It 
would be useful if you could try git bisect between 2.6.19 and 
2.6.20-rc5, keeping the latest sata_nv.c each time, and see if that 
gives any indication. If not, just trying some of the different 
2.6.20-rcX versions may be useful.


Before that, though, can you try making this change I suggested below in 
2.6.20-rc5 and see if the problem still shows up?





Assuming that still doesn't work, can you then try removing these lines
from nv_host_intr in 2.6.20-rc5 sata_nv.c and see what that does?

/* bail out if not our interrupt */
if (!(irq_stat & NV_INT_DEV))
return 0;

as that's the difference I'm most suspicious of causing the problem.



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-20 Thread Chr
On Saturday, 20. January 2007 20:59, you wrote:
> Ian Kumlien wrote:
> > Hi,
> >
> > I went from 2.6.19+sata_nv-adma-ncq-v7.patch, with no problems and adama
> > enabled, to 2.6.20-rc5, which gave me problems almost instantly.
> >
> > I just thought that it might be interesting to know that it DID work
> > nicely.
> >
> > CC since i'm not on the ml
>
> (I'm ccing more of the people who reported this)
>
> Well that's interesting.. The only significant change that went into
> 2.6.20-rc5 in that driver that wasn't in that version you mentioned was
> this one:
>
> http://www2.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=com
>mit;h=2dec7555e6bf2772749113ea0ad454fcdb8cf861
>
> Could you (or anyone else) test what happens if you take the 2.6.20-rc5
> version of sata_nv.c and try it on 2.6.19? That would tell us whether
> it's this change or whether it's something else (i.e. in libata core).

Ok, did that! (got a fresh 2.6.19 tar ball, and used 2.6.20-rc5' sata_nv.c
with the oneliner in libata_sff.c)

And surprise after one hour uptime, there is not even one 
sata exceptions in dmesg! (I'll report back tomorrow...)

>
> Assuming that still doesn't work, can you then try removing these lines
> from nv_host_intr in 2.6.20-rc5 sata_nv.c and see what that does?
>
>   /* bail out if not our interrupt */
>   if (!(irq_stat & NV_INT_DEV))
>   return 0;
>
> as that's the difference I'm most suspicious of causing the problem.


Linux version 2.6.19test ([EMAIL PROTECTED]) (gcc version 4.1.2 20061115 
(prerelease) (Debian 4.1.1-21)) #2 SMP PREEMPT Sat Jan 20 22:19:20 CET 2007
Command line: root=/dev/md1 ro
BIOS-provided physical RAM map:
 BIOS-e820:  - 0009f800 (usable)
 BIOS-e820: 0009f800 - 000a (reserved)
 BIOS-e820: 000f - 0010 (reserved)
 BIOS-e820: 0010 - 7fff (usable)
 BIOS-e820: 7fff - 7fff3000 (ACPI NVS)
 BIOS-e820: 7fff3000 - 8000 (ACPI data)
 BIOS-e820: e000 - f000 (reserved)
 BIOS-e820: fec0 - 0001 (reserved)
Entering add_active_range(0, 0, 159) 0 entries of 256 used
Entering add_active_range(0, 256, 524272) 1 entries of 256 used
end_pfn_map = 1048576
DMI 2.3 present.
ACPI: RSDP (v000 Nvidia) @ 0x000f7d30
ACPI: RSDT (v001 Nvidia AWRDACPI 0x42302e31 AWRD 0x) @ 
0x7fff3040
ACPI: FADT (v001 Nvidia AWRDACPI 0x42302e31 AWRD 0x) @ 
0x7fff30c0
ACPI: SSDT (v001 PTLTD  POWERNOW 0x0001  LTP 0x0001) @ 
0x7fff9900
ACPI: SRAT (v001 AMDHAMMER   0x0001 AMD  0x0001) @ 
0x7fff9b40
ACPI: MCFG (v001 Nvidia AWRDACPI 0x42302e31 AWRD 0x) @ 
0x7fff9c40
ACPI: MADT (v001 Nvidia AWRDACPI 0x42302e31 AWRD 0x) @ 
0x7fff9840
ACPI: DSDT (v001 NVIDIA AWRDACPI 0x1000 MSFT 0x010e) @ 
0x
Entering add_active_range(0, 0, 159) 0 entries of 256 used
Entering add_active_range(0, 256, 524272) 1 entries of 256 used
Zone PFN ranges:
  DMA 0 -> 4096
  DMA324096 ->  1048576
  Normal1048576 ->  1048576
early_node_map[2] active PFN ranges
0:0 ->  159
0:  256 ->   524272
On node 0 totalpages: 524175
  DMA zone: 56 pages used for memmap
  DMA zone: 10 pages reserved
  DMA zone: 3933 pages, LIFO batch:0
  DMA32 zone: 7111 pages used for memmap
  DMA32 zone: 513065 pages, LIFO batch:31
  Normal zone: 0 pages used for memmap
Nvidia board detected. Ignoring ACPI timer override.
If you got timer trouble try acpi_use_timer_override
ACPI: PM-Timer IO Port: 0x4008
ACPI: Local APIC address 0xfee0
ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled)
Processor #0 (Bootup-CPU)
ACPI: LAPIC (acpi_id[0x01] lapic_id[0x01] enabled)
Processor #1
ACPI: LAPIC_NMI (acpi_id[0x00] high edge lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x01] high edge lint[0x1])
ACPI: IOAPIC (id[0x02] address[0xfec0] gsi_base[0])
IOAPIC[0]: apic_id 2, address 0xfec0, GSI 0-23
ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
ACPI: BIOS IRQ0 pin2 override ignored.
ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
ACPI: INT_SRC_OVR (bus 0 bus_irq 14 global_irq 14 high edge)
ACPI: INT_SRC_OVR (bus 0 bus_irq 15 global_irq 15 high edge)
ACPI: IRQ9 used by override.
ACPI: IRQ14 used by override.
ACPI: IRQ15 used by override.
Setting APIC routing to physical flat
Using ACPI (MADT) for SMP configuration information
Nosave address range: 0009f000 - 000a
Nosave address range: 000a - 000f
Nosave address range: 000f - 0010
Allocating PCI resources starting at 8800 (gap: 8000:6000)
SMP: Allowing 2 CPUs, 0 hotplug CPUs
PERCPU: Allocating 32320 bytes of per cpu data
Built 1 zonelists.  Total pages: 516998
Kernel command line: root=/dev/md1 ro
Initializing CPU#0
PID hash table 

Re: SATA exceptions with 2.6.20-rc5

2007-01-20 Thread Ian Kumlien
On lör, 2007-01-20 at 21:43 +, Alistair John Strachan wrote:
> On Saturday 20 January 2007 19:59, Robert Hancock wrote:
> > Ian Kumlien wrote:
> > > Hi,
> > >
> > > I went from 2.6.19+sata_nv-adma-ncq-v7.patch, with no problems and adama
> > > enabled, to 2.6.20-rc5, which gave me problems almost instantly.
> > >
> > > I just thought that it might be interesting to know that it DID work
> > > nicely.
> > >
> > > CC since i'm not on the ml
> >
> > (I'm ccing more of the people who reported this)
> >
> > Well that's interesting.. The only significant change that went into
> > 2.6.20-rc5 in that driver that wasn't in that version you mentioned was
> > this one:
> >
> > http://www2.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=com
> >mit;h=2dec7555e6bf2772749113ea0ad454fcdb8cf861
> >
> > Could you (or anyone else) test what happens if you take the 2.6.20-rc5
> > version of sata_nv.c and try it on 2.6.19? That would tell us whether
> > it's this change or whether it's something else (i.e. in libata core).
> 
> I'm still running an -rc5 kernel with ADMA switched off entirely and I can't 
> reproduce the problem. How is everybody else reproducing this?
> 
> I've been successful installing bonnie++, then going to a large XFS partition 
> and running "bonnie++ -u 1000:1000" and letting it run through, all defaults.
> 
> It doesn't cause the problem I was seeing in -rc5 with ADMA on, when I switch 
> ADMA off, so I think this is sufficient to fix it.

Eh? The whole point with that patch was to ADD ADMA support to sata_nv,
imho that is something we want to have and i have been running with ADMA
on on two computers since sata_nv-adma-ncq-v4 or 5 or so without
problems.

So, something has been introduced or been broken to cause this error,
wouldn't it be better to find the error introduced than to just totally
negate the patch in the first place?

I haven't had the energy to go trough the patch that was found as
causing the problem yet... I don't know if i even have all the info
needed to make any form of educated guess but i'll give it a try when i
have the energy.

I really home someone finds it before then =)

-- 
Ian Kumlien  -- http://pomac.netswarm.net


signature.asc
Description: This is a digitally signed message part


Re: SATA exceptions with 2.6.20-rc5

2007-01-20 Thread Alistair John Strachan
On Saturday 20 January 2007 19:59, Robert Hancock wrote:
> Ian Kumlien wrote:
> > Hi,
> >
> > I went from 2.6.19+sata_nv-adma-ncq-v7.patch, with no problems and adama
> > enabled, to 2.6.20-rc5, which gave me problems almost instantly.
> >
> > I just thought that it might be interesting to know that it DID work
> > nicely.
> >
> > CC since i'm not on the ml
>
> (I'm ccing more of the people who reported this)
>
> Well that's interesting.. The only significant change that went into
> 2.6.20-rc5 in that driver that wasn't in that version you mentioned was
> this one:
>
> http://www2.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=com
>mit;h=2dec7555e6bf2772749113ea0ad454fcdb8cf861
>
> Could you (or anyone else) test what happens if you take the 2.6.20-rc5
> version of sata_nv.c and try it on 2.6.19? That would tell us whether
> it's this change or whether it's something else (i.e. in libata core).

I'm still running an -rc5 kernel with ADMA switched off entirely and I can't 
reproduce the problem. How is everybody else reproducing this?

I've been successful installing bonnie++, then going to a large XFS partition 
and running "bonnie++ -u 1000:1000" and letting it run through, all defaults.

It doesn't cause the problem I was seeing in -rc5 with ADMA on, when I switch 
ADMA off, so I think this is sufficient to fix it.

Others have reported differently. Did you guys do:

[EMAIL PROTECTED]:~$ cat /proc/cmdline
root=/dev/sda1 ro sata_nv.adma=0

Or something similar? This is how Jeff suggested disabling ADMA and indeed the 
messages about its use disappear from dmesg.

-- 
Cheers,
Alistair.

Final year Computer Science undergraduate.
1F2 55 South Clerk Street, Edinburgh, UK.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-20 Thread Robert Hancock

Ian Kumlien wrote:

Hi,

I went from 2.6.19+sata_nv-adma-ncq-v7.patch, with no problems and adama
enabled, to 2.6.20-rc5, which gave me problems almost instantly.

I just thought that it might be interesting to know that it DID work
nicely.

CC since i'm not on the ml



(I'm ccing more of the people who reported this)

Well that's interesting.. The only significant change that went into 
2.6.20-rc5 in that driver that wasn't in that version you mentioned was 
this one:


http://www2.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=2dec7555e6bf2772749113ea0ad454fcdb8cf861

Could you (or anyone else) test what happens if you take the 2.6.20-rc5 
version of sata_nv.c and try it on 2.6.19? That would tell us whether 
it's this change or whether it's something else (i.e. in libata core).


Assuming that still doesn't work, can you then try removing these lines 
from nv_host_intr in 2.6.20-rc5 sata_nv.c and see what that does?


/* bail out if not our interrupt */
if (!(irq_stat & NV_INT_DEV))
return 0;

as that's the difference I'm most suspicious of causing the problem.

--
Robert Hancock  Saskatoon, SK, Canada
To email, remove "nospam" from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-20 Thread Chr
On Saturday, 20. January 2007 03:41, Robert Hancock wrote:
> Alistair John Strachan wrote:
> > On Tuesday 16 January 2007 01:53, Jeff Garzik wrote:
> >> Robert Hancock wrote:
> >>> I'll try your stress test when I get a chance, but I doubt I'll run
> >>> into the same problem and I haven't seen any similar reports. Perhaps
> >>> it's some kind of wierd timing issue or incompatibility between the
> >>> controller and that drive when running in ADMA mode? I seem to remember
> >>> various reports of issues with certain Maxtor drives and some nForce
> >>> SATA controllers under Windows at least..
> >>
> >> Just to eliminate things, has disabling ADMA been attempted?
> >>
> >> It can be disabled using the sata_nv.adma module parameter.
> >
> > Setting this option fixes the problem for me. I suggest that ADMA
> > defaults off in 2.6.20, if there's still time to do that.
>
> Can you guys that are having this problem try the attached debug patch?
> It's possible it will fix the problem, as I'm trying a private
> exec_command implementation that flushes the write by reading a
> controller register instead of reading altstatus from the drive like the
> libata core code does.
>
> If the problem still happens, I also added some more debugging in to
> help figure out what is going on, so please post full dmesg.
>
> By the way, I assume that you guys are using reiserfs or xfs, as it
> appears no other file systems issue flush commands automatically. I had
> to test this by "echo 1 > delete" on the SCSI disk in sysfs, as I am
> using ext3.
Yes, I've some reiserfs partitions, but I don't think it's reiserfs fault ;). 
Here is the log. (I cut out some parts, because it was too big.) 


BTW: please CC, I'm not on the list!
18:17:29 sys kernel: Linux version 2.6.20-rc5 ([EMAIL PROTECTED]) (gcc version 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)) #2 SMP PREEMPT Sat 18:07:36 CET 2007
18:17:29 sys kernel: Command line: root=/dev/md1 ro 
18:17:29 sys kernel: BIOS-provided physical RAM map:
18:17:29 sys kernel: BIOS-e820:  - 0009f800 (usable)
18:17:29 sys kernel: BIOS-e820: 0009f800 - 000a (reserved)
18:17:29 sys kernel: BIOS-e820: 000f - 0010 (reserved)
18:17:29 sys kernel: BIOS-e820: 0010 - 7fff (usable)
18:17:29 sys kernel: BIOS-e820: 7fff - 7fff3000 (ACPI NVS)
18:17:29 sys kernel: BIOS-e820: 7fff3000 - 8000 (ACPI data)
18:17:29 sys kernel: BIOS-e820: e000 - f000 (reserved)
18:17:29 sys kernel: BIOS-e820: fec0 - 0001 (reserved)
18:17:29 sys kernel: Entering add_active_range(0, 0, 159) 0 entries of 256 used
18:17:29 sys kernel: Entering add_active_range(0, 256, 524272) 1 entries of 256 used
18:17:29 sys kernel: end_pfn_map = 1048576
18:17:29 sys kernel: DMI 2.3 present.
18:17:29 sys kernel: ACPI: RSDP (v000 Nvidia) @ 0x000f7d30
18:17:29 sys kernel: ACPI: RSDT (v001 Nvidia AWRDACPI 0x42302e31 AWRD 0x) @ 0x7fff3040
18:17:29 sys kernel: ACPI: FADT (v001 Nvidia AWRDACPI 0x42302e31 AWRD 0x) @ 0x7fff30c0
18:17:29 sys kernel: ACPI: SSDT (v001 PTLTD  POWERNOW 0x0001  LTP 0x0001) @ 0x7fff9900
18:17:29 sys kernel: ACPI: SRAT (v001 AMDHAMMER   0x0001 AMD  0x0001) @ 0x7fff9b40
18:17:29 sys kernel: ACPI: MCFG (v001 Nvidia AWRDACPI 0x42302e31 AWRD 0x) @ 0x7fff9c40
18:17:29 sys kernel: ACPI: MADT (v001 Nvidia AWRDACPI 0x42302e31 AWRD 0x) @ 0x7fff9840
18:17:29 sys kernel: ACPI: DSDT (v001 NVIDIA AWRDACPI 0x1000 MSFT 0x010e) @ 0x
18:17:29 sys kernel: Entering add_active_range(0, 0, 159) 0 entries of 256 used
18:17:29 sys kernel: Entering add_active_range(0, 256, 524272) 1 entries of 256 used
18:17:29 sys kernel: Zone PFN ranges:
18:17:29 sys kernel: DMA 0 -> 4096
18:17:29 sys kernel: DMA324096 ->  1048576
18:17:29 sys kernel: Normal1048576 ->  1048576
18:17:29 sys kernel: early_node_map[2] active PFN ranges
18:17:29 sys kernel: 0:0 ->  159
18:17:29 sys kernel: 0:  256 ->   524272
18:17:29 sys kernel: On node 0 totalpages: 524175
18:17:29 sys kernel: DMA zone: 56 pages used for memmap
18:17:29 sys kernel: DMA zone: 10 pages reserved
18:17:29 sys kernel: DMA zone: 3933 pages, LIFO batch:0
18:17:29 sys kernel: DMA32 zone: 7111 pages used for memmap
18:17:29 sys kernel: DMA32 zone: 513065 pages, LIFO batch:31
18:17:29 sys kernel: Normal zone: 0 pages used for memmap
18:17:29 sys kernel: Nvidia board detected. Ignoring ACPI timer override.
18:17:29 sys kernel: If you got timer trouble try acpi_use_timer_override
18:17:29 sys kernel: ACPI: PM-Timer IO Port: 0x4008
18:17:29 sys kernel: ACPI: Local APIC address 0xfee0
18:17:29 sys kernel: ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled)
18:17:29 sys kernel: Processor #0 (Bootup-CPU)
18:17:29 sys kernel: ACPI: 

Re: SATA exceptions with 2.6.20-rc5

2007-01-20 Thread Ian Kumlien
Hi,

I went from 2.6.19+sata_nv-adma-ncq-v7.patch, with no problems and adama
enabled, to 2.6.20-rc5, which gave me problems almost instantly.

I just thought that it might be interesting to know that it DID work
nicely.

CC since i'm not on the ml

-- 
Ian Kumlien  -- http://pomac.netswarm.net


signature.asc
Description: This is a digitally signed message part


Re: SATA exceptions with 2.6.20-rc5

2007-01-20 Thread Ian Kumlien
Hi,

I went from 2.6.19+sata_nv-adma-ncq-v7.patch, with no problems and adama
enabled, to 2.6.20-rc5, which gave me problems almost instantly.

I just thought that it might be interesting to know that it DID work
nicely.

CC since i'm not on the ml

-- 
Ian Kumlien pomac () vapor ! com -- http://pomac.netswarm.net


signature.asc
Description: This is a digitally signed message part


Re: SATA exceptions with 2.6.20-rc5

2007-01-20 Thread Chr
On Saturday, 20. January 2007 03:41, Robert Hancock wrote:
 Alistair John Strachan wrote:
  On Tuesday 16 January 2007 01:53, Jeff Garzik wrote:
  Robert Hancock wrote:
  I'll try your stress test when I get a chance, but I doubt I'll run
  into the same problem and I haven't seen any similar reports. Perhaps
  it's some kind of wierd timing issue or incompatibility between the
  controller and that drive when running in ADMA mode? I seem to remember
  various reports of issues with certain Maxtor drives and some nForce
  SATA controllers under Windows at least..
 
  Just to eliminate things, has disabling ADMA been attempted?
 
  It can be disabled using the sata_nv.adma module parameter.
 
  Setting this option fixes the problem for me. I suggest that ADMA
  defaults off in 2.6.20, if there's still time to do that.

 Can you guys that are having this problem try the attached debug patch?
 It's possible it will fix the problem, as I'm trying a private
 exec_command implementation that flushes the write by reading a
 controller register instead of reading altstatus from the drive like the
 libata core code does.

 If the problem still happens, I also added some more debugging in to
 help figure out what is going on, so please post full dmesg.

 By the way, I assume that you guys are using reiserfs or xfs, as it
 appears no other file systems issue flush commands automatically. I had
 to test this by echo 1  delete on the SCSI disk in sysfs, as I am
 using ext3.
Yes, I've some reiserfs partitions, but I don't think it's reiserfs fault ;). 
Here is the log. (I cut out some parts, because it was too big.) 


BTW: please CC, I'm not on the list!
18:17:29 sys kernel: Linux version 2.6.20-rc5 ([EMAIL PROTECTED]) (gcc version 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)) #2 SMP PREEMPT Sat 18:07:36 CET 2007
18:17:29 sys kernel: Command line: root=/dev/md1 ro 
18:17:29 sys kernel: BIOS-provided physical RAM map:
18:17:29 sys kernel: BIOS-e820:  - 0009f800 (usable)
18:17:29 sys kernel: BIOS-e820: 0009f800 - 000a (reserved)
18:17:29 sys kernel: BIOS-e820: 000f - 0010 (reserved)
18:17:29 sys kernel: BIOS-e820: 0010 - 7fff (usable)
18:17:29 sys kernel: BIOS-e820: 7fff - 7fff3000 (ACPI NVS)
18:17:29 sys kernel: BIOS-e820: 7fff3000 - 8000 (ACPI data)
18:17:29 sys kernel: BIOS-e820: e000 - f000 (reserved)
18:17:29 sys kernel: BIOS-e820: fec0 - 0001 (reserved)
18:17:29 sys kernel: Entering add_active_range(0, 0, 159) 0 entries of 256 used
18:17:29 sys kernel: Entering add_active_range(0, 256, 524272) 1 entries of 256 used
18:17:29 sys kernel: end_pfn_map = 1048576
18:17:29 sys kernel: DMI 2.3 present.
18:17:29 sys kernel: ACPI: RSDP (v000 Nvidia) @ 0x000f7d30
18:17:29 sys kernel: ACPI: RSDT (v001 Nvidia AWRDACPI 0x42302e31 AWRD 0x) @ 0x7fff3040
18:17:29 sys kernel: ACPI: FADT (v001 Nvidia AWRDACPI 0x42302e31 AWRD 0x) @ 0x7fff30c0
18:17:29 sys kernel: ACPI: SSDT (v001 PTLTD  POWERNOW 0x0001  LTP 0x0001) @ 0x7fff9900
18:17:29 sys kernel: ACPI: SRAT (v001 AMDHAMMER   0x0001 AMD  0x0001) @ 0x7fff9b40
18:17:29 sys kernel: ACPI: MCFG (v001 Nvidia AWRDACPI 0x42302e31 AWRD 0x) @ 0x7fff9c40
18:17:29 sys kernel: ACPI: MADT (v001 Nvidia AWRDACPI 0x42302e31 AWRD 0x) @ 0x7fff9840
18:17:29 sys kernel: ACPI: DSDT (v001 NVIDIA AWRDACPI 0x1000 MSFT 0x010e) @ 0x
18:17:29 sys kernel: Entering add_active_range(0, 0, 159) 0 entries of 256 used
18:17:29 sys kernel: Entering add_active_range(0, 256, 524272) 1 entries of 256 used
18:17:29 sys kernel: Zone PFN ranges:
18:17:29 sys kernel: DMA 0 - 4096
18:17:29 sys kernel: DMA324096 -  1048576
18:17:29 sys kernel: Normal1048576 -  1048576
18:17:29 sys kernel: early_node_map[2] active PFN ranges
18:17:29 sys kernel: 0:0 -  159
18:17:29 sys kernel: 0:  256 -   524272
18:17:29 sys kernel: On node 0 totalpages: 524175
18:17:29 sys kernel: DMA zone: 56 pages used for memmap
18:17:29 sys kernel: DMA zone: 10 pages reserved
18:17:29 sys kernel: DMA zone: 3933 pages, LIFO batch:0
18:17:29 sys kernel: DMA32 zone: 7111 pages used for memmap
18:17:29 sys kernel: DMA32 zone: 513065 pages, LIFO batch:31
18:17:29 sys kernel: Normal zone: 0 pages used for memmap
18:17:29 sys kernel: Nvidia board detected. Ignoring ACPI timer override.
18:17:29 sys kernel: If you got timer trouble try acpi_use_timer_override
18:17:29 sys kernel: ACPI: PM-Timer IO Port: 0x4008
18:17:29 sys kernel: ACPI: Local APIC address 0xfee0
18:17:29 sys kernel: ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled)
18:17:29 sys kernel: Processor #0 (Bootup-CPU)
18:17:29 sys kernel: ACPI: LAPIC (acpi_id[0x01] lapic_id[0x01] enabled)
18:17:29 sys kernel: 

Re: SATA exceptions with 2.6.20-rc5

2007-01-20 Thread Robert Hancock

Ian Kumlien wrote:

Hi,

I went from 2.6.19+sata_nv-adma-ncq-v7.patch, with no problems and adama
enabled, to 2.6.20-rc5, which gave me problems almost instantly.

I just thought that it might be interesting to know that it DID work
nicely.

CC since i'm not on the ml



(I'm ccing more of the people who reported this)

Well that's interesting.. The only significant change that went into 
2.6.20-rc5 in that driver that wasn't in that version you mentioned was 
this one:


http://www2.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=2dec7555e6bf2772749113ea0ad454fcdb8cf861

Could you (or anyone else) test what happens if you take the 2.6.20-rc5 
version of sata_nv.c and try it on 2.6.19? That would tell us whether 
it's this change or whether it's something else (i.e. in libata core).


Assuming that still doesn't work, can you then try removing these lines 
from nv_host_intr in 2.6.20-rc5 sata_nv.c and see what that does?


/* bail out if not our interrupt */
if (!(irq_stat  NV_INT_DEV))
return 0;

as that's the difference I'm most suspicious of causing the problem.

--
Robert Hancock  Saskatoon, SK, Canada
To email, remove nospam from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-20 Thread Alistair John Strachan
On Saturday 20 January 2007 19:59, Robert Hancock wrote:
 Ian Kumlien wrote:
  Hi,
 
  I went from 2.6.19+sata_nv-adma-ncq-v7.patch, with no problems and adama
  enabled, to 2.6.20-rc5, which gave me problems almost instantly.
 
  I just thought that it might be interesting to know that it DID work
  nicely.
 
  CC since i'm not on the ml

 (I'm ccing more of the people who reported this)

 Well that's interesting.. The only significant change that went into
 2.6.20-rc5 in that driver that wasn't in that version you mentioned was
 this one:

 http://www2.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=com
mit;h=2dec7555e6bf2772749113ea0ad454fcdb8cf861

 Could you (or anyone else) test what happens if you take the 2.6.20-rc5
 version of sata_nv.c and try it on 2.6.19? That would tell us whether
 it's this change or whether it's something else (i.e. in libata core).

I'm still running an -rc5 kernel with ADMA switched off entirely and I can't 
reproduce the problem. How is everybody else reproducing this?

I've been successful installing bonnie++, then going to a large XFS partition 
and running bonnie++ -u 1000:1000 and letting it run through, all defaults.

It doesn't cause the problem I was seeing in -rc5 with ADMA on, when I switch 
ADMA off, so I think this is sufficient to fix it.

Others have reported differently. Did you guys do:

[EMAIL PROTECTED]:~$ cat /proc/cmdline
root=/dev/sda1 ro sata_nv.adma=0

Or something similar? This is how Jeff suggested disabling ADMA and indeed the 
messages about its use disappear from dmesg.

-- 
Cheers,
Alistair.

Final year Computer Science undergraduate.
1F2 55 South Clerk Street, Edinburgh, UK.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-20 Thread Ian Kumlien
On lör, 2007-01-20 at 21:43 +, Alistair John Strachan wrote:
 On Saturday 20 January 2007 19:59, Robert Hancock wrote:
  Ian Kumlien wrote:
   Hi,
  
   I went from 2.6.19+sata_nv-adma-ncq-v7.patch, with no problems and adama
   enabled, to 2.6.20-rc5, which gave me problems almost instantly.
  
   I just thought that it might be interesting to know that it DID work
   nicely.
  
   CC since i'm not on the ml
 
  (I'm ccing more of the people who reported this)
 
  Well that's interesting.. The only significant change that went into
  2.6.20-rc5 in that driver that wasn't in that version you mentioned was
  this one:
 
  http://www2.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=com
 mit;h=2dec7555e6bf2772749113ea0ad454fcdb8cf861
 
  Could you (or anyone else) test what happens if you take the 2.6.20-rc5
  version of sata_nv.c and try it on 2.6.19? That would tell us whether
  it's this change or whether it's something else (i.e. in libata core).
 
 I'm still running an -rc5 kernel with ADMA switched off entirely and I can't 
 reproduce the problem. How is everybody else reproducing this?
 
 I've been successful installing bonnie++, then going to a large XFS partition 
 and running bonnie++ -u 1000:1000 and letting it run through, all defaults.
 
 It doesn't cause the problem I was seeing in -rc5 with ADMA on, when I switch 
 ADMA off, so I think this is sufficient to fix it.

Eh? The whole point with that patch was to ADD ADMA support to sata_nv,
imho that is something we want to have and i have been running with ADMA
on on two computers since sata_nv-adma-ncq-v4 or 5 or so without
problems.

So, something has been introduced or been broken to cause this error,
wouldn't it be better to find the error introduced than to just totally
negate the patch in the first place?

I haven't had the energy to go trough the patch that was found as
causing the problem yet... I don't know if i even have all the info
needed to make any form of educated guess but i'll give it a try when i
have the energy.

I really home someone finds it before then =)

-- 
Ian Kumlien pomac () vapor ! com -- http://pomac.netswarm.net


signature.asc
Description: This is a digitally signed message part


Re: SATA exceptions with 2.6.20-rc5

2007-01-20 Thread Chr
On Saturday, 20. January 2007 20:59, you wrote:
 Ian Kumlien wrote:
  Hi,
 
  I went from 2.6.19+sata_nv-adma-ncq-v7.patch, with no problems and adama
  enabled, to 2.6.20-rc5, which gave me problems almost instantly.
 
  I just thought that it might be interesting to know that it DID work
  nicely.
 
  CC since i'm not on the ml

 (I'm ccing more of the people who reported this)

 Well that's interesting.. The only significant change that went into
 2.6.20-rc5 in that driver that wasn't in that version you mentioned was
 this one:

 http://www2.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=com
mit;h=2dec7555e6bf2772749113ea0ad454fcdb8cf861

 Could you (or anyone else) test what happens if you take the 2.6.20-rc5
 version of sata_nv.c and try it on 2.6.19? That would tell us whether
 it's this change or whether it's something else (i.e. in libata core).

Ok, did that! (got a fresh 2.6.19 tar ball, and used 2.6.20-rc5' sata_nv.c
with the oneliner in libata_sff.c)

And surprise after one hour uptime, there is not even one 
sata exceptions in dmesg! (I'll report back tomorrow...)


 Assuming that still doesn't work, can you then try removing these lines
 from nv_host_intr in 2.6.20-rc5 sata_nv.c and see what that does?

   /* bail out if not our interrupt */
   if (!(irq_stat  NV_INT_DEV))
   return 0;

 as that's the difference I'm most suspicious of causing the problem.


Linux version 2.6.19test ([EMAIL PROTECTED]) (gcc version 4.1.2 20061115 
(prerelease) (Debian 4.1.1-21)) #2 SMP PREEMPT Sat Jan 20 22:19:20 CET 2007
Command line: root=/dev/md1 ro
BIOS-provided physical RAM map:
 BIOS-e820:  - 0009f800 (usable)
 BIOS-e820: 0009f800 - 000a (reserved)
 BIOS-e820: 000f - 0010 (reserved)
 BIOS-e820: 0010 - 7fff (usable)
 BIOS-e820: 7fff - 7fff3000 (ACPI NVS)
 BIOS-e820: 7fff3000 - 8000 (ACPI data)
 BIOS-e820: e000 - f000 (reserved)
 BIOS-e820: fec0 - 0001 (reserved)
Entering add_active_range(0, 0, 159) 0 entries of 256 used
Entering add_active_range(0, 256, 524272) 1 entries of 256 used
end_pfn_map = 1048576
DMI 2.3 present.
ACPI: RSDP (v000 Nvidia) @ 0x000f7d30
ACPI: RSDT (v001 Nvidia AWRDACPI 0x42302e31 AWRD 0x) @ 
0x7fff3040
ACPI: FADT (v001 Nvidia AWRDACPI 0x42302e31 AWRD 0x) @ 
0x7fff30c0
ACPI: SSDT (v001 PTLTD  POWERNOW 0x0001  LTP 0x0001) @ 
0x7fff9900
ACPI: SRAT (v001 AMDHAMMER   0x0001 AMD  0x0001) @ 
0x7fff9b40
ACPI: MCFG (v001 Nvidia AWRDACPI 0x42302e31 AWRD 0x) @ 
0x7fff9c40
ACPI: MADT (v001 Nvidia AWRDACPI 0x42302e31 AWRD 0x) @ 
0x7fff9840
ACPI: DSDT (v001 NVIDIA AWRDACPI 0x1000 MSFT 0x010e) @ 
0x
Entering add_active_range(0, 0, 159) 0 entries of 256 used
Entering add_active_range(0, 256, 524272) 1 entries of 256 used
Zone PFN ranges:
  DMA 0 - 4096
  DMA324096 -  1048576
  Normal1048576 -  1048576
early_node_map[2] active PFN ranges
0:0 -  159
0:  256 -   524272
On node 0 totalpages: 524175
  DMA zone: 56 pages used for memmap
  DMA zone: 10 pages reserved
  DMA zone: 3933 pages, LIFO batch:0
  DMA32 zone: 7111 pages used for memmap
  DMA32 zone: 513065 pages, LIFO batch:31
  Normal zone: 0 pages used for memmap
Nvidia board detected. Ignoring ACPI timer override.
If you got timer trouble try acpi_use_timer_override
ACPI: PM-Timer IO Port: 0x4008
ACPI: Local APIC address 0xfee0
ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled)
Processor #0 (Bootup-CPU)
ACPI: LAPIC (acpi_id[0x01] lapic_id[0x01] enabled)
Processor #1
ACPI: LAPIC_NMI (acpi_id[0x00] high edge lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x01] high edge lint[0x1])
ACPI: IOAPIC (id[0x02] address[0xfec0] gsi_base[0])
IOAPIC[0]: apic_id 2, address 0xfec0, GSI 0-23
ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
ACPI: BIOS IRQ0 pin2 override ignored.
ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
ACPI: INT_SRC_OVR (bus 0 bus_irq 14 global_irq 14 high edge)
ACPI: INT_SRC_OVR (bus 0 bus_irq 15 global_irq 15 high edge)
ACPI: IRQ9 used by override.
ACPI: IRQ14 used by override.
ACPI: IRQ15 used by override.
Setting APIC routing to physical flat
Using ACPI (MADT) for SMP configuration information
Nosave address range: 0009f000 - 000a
Nosave address range: 000a - 000f
Nosave address range: 000f - 0010
Allocating PCI resources starting at 8800 (gap: 8000:6000)
SMP: Allowing 2 CPUs, 0 hotplug CPUs
PERCPU: Allocating 32320 bytes of per cpu data
Built 1 zonelists.  Total pages: 516998
Kernel command line: root=/dev/md1 ro
Initializing CPU#0
PID hash table entries: 4096 (order: 12, 32768 bytes)

Re: SATA exceptions with 2.6.20-rc5

2007-01-20 Thread Robert Hancock

Chr wrote:

Could you (or anyone else) test what happens if you take the 2.6.20-rc5
version of sata_nv.c and try it on 2.6.19? That would tell us whether
it's this change or whether it's something else (i.e. in libata core).


Ok, did that! (got a fresh 2.6.19 tar ball, and used 2.6.20-rc5' sata_nv.c
with the oneliner in libata_sff.c)

And surprise after one hour uptime, there is not even one 
sata exceptions in dmesg! (I'll report back tomorrow...)


That is interesting, indeed.. If that holds up then I assume some other 
change in 2.6.20-rc is either causing or triggering this problem. It 
would be useful if you could try git bisect between 2.6.19 and 
2.6.20-rc5, keeping the latest sata_nv.c each time, and see if that 
gives any indication. If not, just trying some of the different 
2.6.20-rcX versions may be useful.


Before that, though, can you try making this change I suggested below in 
2.6.20-rc5 and see if the problem still shows up?





Assuming that still doesn't work, can you then try removing these lines
from nv_host_intr in 2.6.20-rc5 sata_nv.c and see what that does?

/* bail out if not our interrupt */
if (!(irq_stat  NV_INT_DEV))
return 0;

as that's the difference I'm most suspicious of causing the problem.



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-20 Thread Jeff Garzik

Robert Hancock wrote:
change in 2.6.20-rc is either causing or triggering this problem. It 
would be useful if you could try git bisect between 2.6.19 and 
2.6.20-rc5, keeping the latest sata_nv.c each time, and see if that 



Yes, 'git bisect' would be the next step in figuring out this puzzle.

Anybody up for it?

Jeff


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-20 Thread Björn Steinbrink
On 2007.01.20 22:34:27 -0500, Jeff Garzik wrote:
 Robert Hancock wrote:
 change in 2.6.20-rc is either causing or triggering this problem. It 
 would be useful if you could try git bisect between 2.6.19 and 
 2.6.20-rc5, keeping the latest sata_nv.c each time, and see if that 
 
 
 Yes, 'git bisect' would be the next step in figuring out this puzzle.
 
 Anybody up for it?

I'll go for it, but could I get an explanation how that could lead to a
different result than my last bisection? I see the difference of keeping
sata_nv.c but my brain can't wrap around it right now (woke up in the
middle of the night and still not up to speed...).

Thanks,
Björn
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-20 Thread Robert Hancock

Björn Steinbrink wrote:

On 2007.01.20 22:34:27 -0500, Jeff Garzik wrote:

Robert Hancock wrote:
change in 2.6.20-rc is either causing or triggering this problem. It 
would be useful if you could try git bisect between 2.6.19 and 
2.6.20-rc5, keeping the latest sata_nv.c each time, and see if that 


Yes, 'git bisect' would be the next step in figuring out this puzzle.

Anybody up for it?


I'll go for it, but could I get an explanation how that could lead to a
different result than my last bisection? I see the difference of keeping
sata_nv.c but my brain can't wrap around it right now (woke up in the
middle of the night and still not up to speed...).


Whatever the problem is, only seems to show up when ADMA is enabled, and 
so the patch that added ADMA support shows up as the culprit from your 
git bisect. However, from what Chr is reporting, 2.6.19 with the ADMA 
support added in doesn't seem to have the problem, so presumably 
something else that changed in the 2.6.20-rc series is triggering it. 
Doing a bisect while keeping the driver code itself the same will 
hopefully identify what that change is..

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-19 Thread Björn Steinbrink
On 2007.01.19 20:41:36 -0600, Robert Hancock wrote:
> Alistair John Strachan wrote:
> >On Tuesday 16 January 2007 01:53, Jeff Garzik wrote:
> >>Robert Hancock wrote:
> >>>I'll try your stress test when I get a chance, but I doubt I'll run into
> >>>the same problem and I haven't seen any similar reports. Perhaps it's
> >>>some kind of wierd timing issue or incompatibility between the
> >>>controller and that drive when running in ADMA mode? I seem to remember
> >>>various reports of issues with certain Maxtor drives and some nForce
> >>>SATA controllers under Windows at least..
> >>Just to eliminate things, has disabling ADMA been attempted?
> >>
> >>It can be disabled using the sata_nv.adma module parameter.
> >
> >Setting this option fixes the problem for me. I suggest that ADMA defaults 
> >off in 2.6.20, if there's still time to do that.
> >
> 
> Can you guys that are having this problem try the attached debug patch? 
> It's possible it will fix the problem, as I'm trying a private 
> exec_command implementation that flushes the write by reading a 
> controller register instead of reading altstatus from the drive like the 
> libata core code does.

Will give it a spin in about an hour.

> If the problem still happens, I also added some more debugging in to 
> help figure out what is going on, so please post full dmesg.
> 
> By the way, I assume that you guys are using reiserfs or xfs, as it 
> appears no other file systems issue flush commands automatically. I had 
> to test this by "echo 1 > delete" on the SCSI disk in sysfs, as I am 
> using ext3.

No, ext3 here, on top of md RAID1 and LVM. Oh, and one ext2, I wonder
where that comes from...

Björn
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-19 Thread Alistair John Strachan
On Saturday 20 January 2007 02:41, Robert Hancock wrote:
> By the way, I assume that you guys are using reiserfs or xfs, as it
> appears no other file systems issue flush commands automatically. I had
> to test this by "echo 1 > delete" on the SCSI disk in sysfs, as I am
> using ext3.

I'll give it a spin now, and yes I'm using several large XFS partitions on 
this machine, layered on top of md RAID5. That's why this particular defect 
is so catastrophic (literally _everything_ is stalled).

-- 
Cheers,
Alistair.

Final year Computer Science undergraduate.
1F2 55 South Clerk Street, Edinburgh, UK.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-19 Thread Robert Hancock

Alistair John Strachan wrote:

On Tuesday 16 January 2007 01:53, Jeff Garzik wrote:

Robert Hancock wrote:

I'll try your stress test when I get a chance, but I doubt I'll run into
the same problem and I haven't seen any similar reports. Perhaps it's
some kind of wierd timing issue or incompatibility between the
controller and that drive when running in ADMA mode? I seem to remember
various reports of issues with certain Maxtor drives and some nForce
SATA controllers under Windows at least..

Just to eliminate things, has disabling ADMA been attempted?

It can be disabled using the sata_nv.adma module parameter.


Setting this option fixes the problem for me. I suggest that ADMA defaults off 
in 2.6.20, if there's still time to do that.




Can you guys that are having this problem try the attached debug patch? 
It's possible it will fix the problem, as I'm trying a private 
exec_command implementation that flushes the write by reading a 
controller register instead of reading altstatus from the drive like the 
libata core code does.


If the problem still happens, I also added some more debugging in to 
help figure out what is going on, so please post full dmesg.


By the way, I assume that you guys are using reiserfs or xfs, as it 
appears no other file systems issue flush commands automatically. I had 
to test this by "echo 1 > delete" on the SCSI disk in sysfs, as I am 
using ext3.


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove "nospam" from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

--- linux-2.6.20-rc5/drivers/ata/sata_nv.c  2007-01-19 19:18:53.0 
-0600
+++ linux-2.6.20-rc5debug/drivers/ata/sata_nv.c 2007-01-19 20:25:31.0 
-0600
@@ -245,6 +245,7 @@ static void nv_adma_bmdma_setup(struct a
 static void nv_adma_bmdma_start(struct ata_queued_cmd *qc);
 static void nv_adma_bmdma_stop(struct ata_queued_cmd *qc);
 static u8 nv_adma_bmdma_status(struct ata_port *ap);
+static void nv_adma_exec_command(struct ata_port *ap, const struct 
ata_taskfile *tf);
 
 enum nv_host_type
 {
@@ -409,7 +410,7 @@ static const struct ata_port_operations 
.tf_load= ata_tf_load,
.tf_read= ata_tf_read,
.check_atapi_dma= nv_adma_check_atapi_dma,
-   .exec_command   = ata_exec_command,
+   .exec_command   = nv_adma_exec_command,
.check_status   = ata_check_status,
.dev_select = ata_std_dev_select,
.bmdma_setup= nv_adma_bmdma_setup,
@@ -617,6 +618,14 @@ static int nv_adma_check_atapi_dma(struc
return !(pp->flags & NV_ADMA_ATAPI_SETUP_COMPLETE);
 }
 
+static void nv_adma_exec_command(struct ata_port *ap, const struct 
ata_taskfile *tf)
+{
+   void __iomem* mmio = nv_adma_ctl_block(ap);
+   writeb(tf->command, (void __iomem *) ap->ioaddr.command_addr);
+   readw(mmio + NV_ADMA_CTL); /* flush */
+   ndelay(400);
+}
+
 static unsigned int nv_adma_tf_to_cpb(struct ata_taskfile *tf, __le16 *cpb)
 {
unsigned int idx = 0;
@@ -701,6 +710,9 @@ static int nv_host_intr(struct ata_port 
 {
struct ata_queued_cmd *qc = ata_qc_from_tag(ap, ap->active_tag);
int handled;
+   u8 cmd = 0;
+   if(qc)
+   cmd = qc->tf.command;
 
/* freeze if hotplugged */
if (unlikely(irq_stat & (NV_INT_ADDED | NV_INT_REMOVED))) {
@@ -709,8 +721,11 @@ static int nv_host_intr(struct ata_port 
}
 
/* bail out if not our interrupt */
-   if (!(irq_stat & NV_INT_DEV))
+   if (!(irq_stat & NV_INT_DEV)) {
+   if( cmd == ATA_CMD_FLUSH || cmd == ATA_CMD_FLUSH_EXT )
+   ata_port_printk(ap, KERN_NOTICE, "cmd 0x%x active but 
stat 0x%x\n", cmd, irq_stat);
return 0;
+   }
 
/* DEV interrupt w/ no active qc? */
if (unlikely(!qc || (qc->tf.flags & ATA_TFLAG_POLLING))) {
@@ -720,6 +735,8 @@ static int nv_host_intr(struct ata_port 
 
/* handle interrupt */
handled = ata_host_intr(ap, qc);
+   if( cmd == ATA_CMD_FLUSH || cmd == ATA_CMD_FLUSH_EXT )
+   ata_port_printk(ap, KERN_NOTICE, "cmd 0x%x active, stat = 0x%x, 
handled = 0x%x\n", cmd, irq_stat, handled);
if (unlikely(!handled)) {
/* spurious, clear it */
ata_check_status(ap);
@@ -870,7 +887,7 @@ static void nv_adma_bmdma_setup(struct a
outb(dmactl, ap->ioaddr.bmdma_addr + ATA_DMA_CMD);
 
/* issue r/w command */
-   ata_exec_command(ap, >tf);
+   nv_adma_exec_command(ap, >tf);
 }
 
 static void nv_adma_bmdma_start(struct ata_queued_cmd *qc)
@@ -1161,6 +1178,9 @@ static unsigned int nv_adma_qc_issue(str
/* use ATA register mode */
VPRINTK("no dmamap or ATAPI, using ATA register mode: 0x%lx\n", 
qc->flags);
nv_adma_register_mode(qc->ap);
+   if(qc->tf.command == ATA_CMD_FLUSH ||
+  qc->tf.command 

Re: SATA exceptions with 2.6.20-rc5

2007-01-19 Thread chunkeey
On Friday, 19. January 2007 16:05, Alistair John Strachan wrote:
> On Tuesday 16 January 2007 01:53, Jeff Garzik wrote:
> > Robert Hancock wrote:
> > > I'll try your stress test when I get a chance, but I doubt I'll run
> > > into the same problem and I haven't seen any similar reports. Perhaps
> > > it's some kind of wierd timing issue or incompatibility between the
> > > controller and that drive when running in ADMA mode? I seem to remember
> > > various reports of issues with certain Maxtor drives and some nForce
> > > SATA controllers under Windows at least..
> >
> > Just to eliminate things, has disabling ADMA been attempted?
> >
> > It can be disabled using the sata_nv.adma module parameter.
>
> Setting this option fixes the problem for me. I suggest that ADMA defaults
> off in 2.6.20, if there's still time to do that.

Not for me.
I'm still have the same trouble, but less (maybe about every hour, instead of 
every 5 minutes). futhermore, I found a patch
cocktail-2.6.20-rc3.patch: http://tinyurl.com/2gza8q, which improves the 
situation too! 

Now, the funny thing is that I've two SATA HDDs, but only 1 causes all the
headaches.

The affected drive is a:
sda - @ata3.0 - WDC WD2500KS-00M 02.0
ATA-7, max UDMA/133, 488395055 sectors: LBA48

"ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata3.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 cdb 0x0 data 0 out
 res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata3: soft resetting port
ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata3.00: configured for UDMA/133:PIO0
ata3: EH complete
SCSI device sda: 488395055 512-byte hdwr sectors (250058 MB)
sda: Write Protect is off
sda: Mode Sense: 00 3a 00 00"

the "good" HDD is a:
sdb - @ata4.0 - WDC WD2500YD-01N 10.0
ATA-7, max UDMA/133, 490234752 sectors: LBA48 NCQ (depth 0/1)

System:
AMD64 4200+ 
nForce 4 SLI
2 GB
SMP PREEMPT kernel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-19 Thread Alistair John Strachan
On Tuesday 16 January 2007 01:53, Jeff Garzik wrote:
> Robert Hancock wrote:
> > I'll try your stress test when I get a chance, but I doubt I'll run into
> > the same problem and I haven't seen any similar reports. Perhaps it's
> > some kind of wierd timing issue or incompatibility between the
> > controller and that drive when running in ADMA mode? I seem to remember
> > various reports of issues with certain Maxtor drives and some nForce
> > SATA controllers under Windows at least..
>
> Just to eliminate things, has disabling ADMA been attempted?
>
> It can be disabled using the sata_nv.adma module parameter.

Setting this option fixes the problem for me. I suggest that ADMA defaults off 
in 2.6.20, if there's still time to do that.

-- 
Cheers,
Alistair.

Final year Computer Science undergraduate.
1F2 55 South Clerk Street, Edinburgh, UK.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-19 Thread Alistair John Strachan
On Tuesday 16 January 2007 00:34, Robert Hancock wrote:
> I'll try your stress test when I get a chance, but I doubt I'll run into
> the same problem and I haven't seen any similar reports. Perhaps it's
> some kind of wierd timing issue or incompatibility between the
> controller and that drive when running in ADMA mode? I seem to remember
> various reports of issues with certain Maxtor drives and some nForce
> SATA controllers under Windows at least..

I have exactly the same problem on -rc5 and it causes all I/O to stall 
periodically if I do _anything_ I/O intensive.

On my box, I have 4 sata_nv handled SATA ports, with two pairs of different 
drives (two Maxtor, two WD) and it happens randomly on both. So it's 
absolutely nothing to do with the drive make/model.

I'll try Jeff's suggestion of disabling ADMA now, but I think something more 
radical than this workaround should make it into 2.6.20 final, otherwise a 
lot of people are going to have broken boxes.

-- 
Cheers,
Alistair.

Final year Computer Science undergraduate.
1F2 55 South Clerk Street, Edinburgh, UK.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-19 Thread Alistair John Strachan
On Tuesday 16 January 2007 00:34, Robert Hancock wrote:
 I'll try your stress test when I get a chance, but I doubt I'll run into
 the same problem and I haven't seen any similar reports. Perhaps it's
 some kind of wierd timing issue or incompatibility between the
 controller and that drive when running in ADMA mode? I seem to remember
 various reports of issues with certain Maxtor drives and some nForce
 SATA controllers under Windows at least..

I have exactly the same problem on -rc5 and it causes all I/O to stall 
periodically if I do _anything_ I/O intensive.

On my box, I have 4 sata_nv handled SATA ports, with two pairs of different 
drives (two Maxtor, two WD) and it happens randomly on both. So it's 
absolutely nothing to do with the drive make/model.

I'll try Jeff's suggestion of disabling ADMA now, but I think something more 
radical than this workaround should make it into 2.6.20 final, otherwise a 
lot of people are going to have broken boxes.

-- 
Cheers,
Alistair.

Final year Computer Science undergraduate.
1F2 55 South Clerk Street, Edinburgh, UK.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-19 Thread Alistair John Strachan
On Tuesday 16 January 2007 01:53, Jeff Garzik wrote:
 Robert Hancock wrote:
  I'll try your stress test when I get a chance, but I doubt I'll run into
  the same problem and I haven't seen any similar reports. Perhaps it's
  some kind of wierd timing issue or incompatibility between the
  controller and that drive when running in ADMA mode? I seem to remember
  various reports of issues with certain Maxtor drives and some nForce
  SATA controllers under Windows at least..

 Just to eliminate things, has disabling ADMA been attempted?

 It can be disabled using the sata_nv.adma module parameter.

Setting this option fixes the problem for me. I suggest that ADMA defaults off 
in 2.6.20, if there's still time to do that.

-- 
Cheers,
Alistair.

Final year Computer Science undergraduate.
1F2 55 South Clerk Street, Edinburgh, UK.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-19 Thread chunkeey
On Friday, 19. January 2007 16:05, Alistair John Strachan wrote:
 On Tuesday 16 January 2007 01:53, Jeff Garzik wrote:
  Robert Hancock wrote:
   I'll try your stress test when I get a chance, but I doubt I'll run
   into the same problem and I haven't seen any similar reports. Perhaps
   it's some kind of wierd timing issue or incompatibility between the
   controller and that drive when running in ADMA mode? I seem to remember
   various reports of issues with certain Maxtor drives and some nForce
   SATA controllers under Windows at least..
 
  Just to eliminate things, has disabling ADMA been attempted?
 
  It can be disabled using the sata_nv.adma module parameter.

 Setting this option fixes the problem for me. I suggest that ADMA defaults
 off in 2.6.20, if there's still time to do that.

Not for me.
I'm still have the same trouble, but less (maybe about every hour, instead of 
every 5 minutes). futhermore, I found a patch
cocktail-2.6.20-rc3.patch: http://tinyurl.com/2gza8q, which improves the 
situation too! 

Now, the funny thing is that I've two SATA HDDs, but only 1 causes all the
headaches.

The affected drive is a:
sda - @ata3.0 - WDC WD2500KS-00M 02.0
ATA-7, max UDMA/133, 488395055 sectors: LBA48

ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata3.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 cdb 0x0 data 0 out
 res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata3: soft resetting port
ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata3.00: configured for UDMA/133:PIO0
ata3: EH complete
SCSI device sda: 488395055 512-byte hdwr sectors (250058 MB)
sda: Write Protect is off
sda: Mode Sense: 00 3a 00 00

the good HDD is a:
sdb - @ata4.0 - WDC WD2500YD-01N 10.0
ATA-7, max UDMA/133, 490234752 sectors: LBA48 NCQ (depth 0/1)

System:
AMD64 4200+ 
nForce 4 SLI
2 GB
SMP PREEMPT kernel
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-19 Thread Robert Hancock

Alistair John Strachan wrote:

On Tuesday 16 January 2007 01:53, Jeff Garzik wrote:

Robert Hancock wrote:

I'll try your stress test when I get a chance, but I doubt I'll run into
the same problem and I haven't seen any similar reports. Perhaps it's
some kind of wierd timing issue or incompatibility between the
controller and that drive when running in ADMA mode? I seem to remember
various reports of issues with certain Maxtor drives and some nForce
SATA controllers under Windows at least..

Just to eliminate things, has disabling ADMA been attempted?

It can be disabled using the sata_nv.adma module parameter.


Setting this option fixes the problem for me. I suggest that ADMA defaults off 
in 2.6.20, if there's still time to do that.




Can you guys that are having this problem try the attached debug patch? 
It's possible it will fix the problem, as I'm trying a private 
exec_command implementation that flushes the write by reading a 
controller register instead of reading altstatus from the drive like the 
libata core code does.


If the problem still happens, I also added some more debugging in to 
help figure out what is going on, so please post full dmesg.


By the way, I assume that you guys are using reiserfs or xfs, as it 
appears no other file systems issue flush commands automatically. I had 
to test this by echo 1  delete on the SCSI disk in sysfs, as I am 
using ext3.


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove nospam from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

--- linux-2.6.20-rc5/drivers/ata/sata_nv.c  2007-01-19 19:18:53.0 
-0600
+++ linux-2.6.20-rc5debug/drivers/ata/sata_nv.c 2007-01-19 20:25:31.0 
-0600
@@ -245,6 +245,7 @@ static void nv_adma_bmdma_setup(struct a
 static void nv_adma_bmdma_start(struct ata_queued_cmd *qc);
 static void nv_adma_bmdma_stop(struct ata_queued_cmd *qc);
 static u8 nv_adma_bmdma_status(struct ata_port *ap);
+static void nv_adma_exec_command(struct ata_port *ap, const struct 
ata_taskfile *tf);
 
 enum nv_host_type
 {
@@ -409,7 +410,7 @@ static const struct ata_port_operations 
.tf_load= ata_tf_load,
.tf_read= ata_tf_read,
.check_atapi_dma= nv_adma_check_atapi_dma,
-   .exec_command   = ata_exec_command,
+   .exec_command   = nv_adma_exec_command,
.check_status   = ata_check_status,
.dev_select = ata_std_dev_select,
.bmdma_setup= nv_adma_bmdma_setup,
@@ -617,6 +618,14 @@ static int nv_adma_check_atapi_dma(struc
return !(pp-flags  NV_ADMA_ATAPI_SETUP_COMPLETE);
 }
 
+static void nv_adma_exec_command(struct ata_port *ap, const struct 
ata_taskfile *tf)
+{
+   void __iomem* mmio = nv_adma_ctl_block(ap);
+   writeb(tf-command, (void __iomem *) ap-ioaddr.command_addr);
+   readw(mmio + NV_ADMA_CTL); /* flush */
+   ndelay(400);
+}
+
 static unsigned int nv_adma_tf_to_cpb(struct ata_taskfile *tf, __le16 *cpb)
 {
unsigned int idx = 0;
@@ -701,6 +710,9 @@ static int nv_host_intr(struct ata_port 
 {
struct ata_queued_cmd *qc = ata_qc_from_tag(ap, ap-active_tag);
int handled;
+   u8 cmd = 0;
+   if(qc)
+   cmd = qc-tf.command;
 
/* freeze if hotplugged */
if (unlikely(irq_stat  (NV_INT_ADDED | NV_INT_REMOVED))) {
@@ -709,8 +721,11 @@ static int nv_host_intr(struct ata_port 
}
 
/* bail out if not our interrupt */
-   if (!(irq_stat  NV_INT_DEV))
+   if (!(irq_stat  NV_INT_DEV)) {
+   if( cmd == ATA_CMD_FLUSH || cmd == ATA_CMD_FLUSH_EXT )
+   ata_port_printk(ap, KERN_NOTICE, cmd 0x%x active but 
stat 0x%x\n, cmd, irq_stat);
return 0;
+   }
 
/* DEV interrupt w/ no active qc? */
if (unlikely(!qc || (qc-tf.flags  ATA_TFLAG_POLLING))) {
@@ -720,6 +735,8 @@ static int nv_host_intr(struct ata_port 
 
/* handle interrupt */
handled = ata_host_intr(ap, qc);
+   if( cmd == ATA_CMD_FLUSH || cmd == ATA_CMD_FLUSH_EXT )
+   ata_port_printk(ap, KERN_NOTICE, cmd 0x%x active, stat = 0x%x, 
handled = 0x%x\n, cmd, irq_stat, handled);
if (unlikely(!handled)) {
/* spurious, clear it */
ata_check_status(ap);
@@ -870,7 +887,7 @@ static void nv_adma_bmdma_setup(struct a
outb(dmactl, ap-ioaddr.bmdma_addr + ATA_DMA_CMD);
 
/* issue r/w command */
-   ata_exec_command(ap, qc-tf);
+   nv_adma_exec_command(ap, qc-tf);
 }
 
 static void nv_adma_bmdma_start(struct ata_queued_cmd *qc)
@@ -1161,6 +1178,9 @@ static unsigned int nv_adma_qc_issue(str
/* use ATA register mode */
VPRINTK(no dmamap or ATAPI, using ATA register mode: 0x%lx\n, 
qc-flags);
nv_adma_register_mode(qc-ap);
+   if(qc-tf.command == ATA_CMD_FLUSH ||
+  qc-tf.command == ATA_CMD_FLUSH_EXT )

Re: SATA exceptions with 2.6.20-rc5

2007-01-19 Thread Alistair John Strachan
On Saturday 20 January 2007 02:41, Robert Hancock wrote:
 By the way, I assume that you guys are using reiserfs or xfs, as it
 appears no other file systems issue flush commands automatically. I had
 to test this by echo 1  delete on the SCSI disk in sysfs, as I am
 using ext3.

I'll give it a spin now, and yes I'm using several large XFS partitions on 
this machine, layered on top of md RAID5. That's why this particular defect 
is so catastrophic (literally _everything_ is stalled).

-- 
Cheers,
Alistair.

Final year Computer Science undergraduate.
1F2 55 South Clerk Street, Edinburgh, UK.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-19 Thread Björn Steinbrink
On 2007.01.19 20:41:36 -0600, Robert Hancock wrote:
 Alistair John Strachan wrote:
 On Tuesday 16 January 2007 01:53, Jeff Garzik wrote:
 Robert Hancock wrote:
 I'll try your stress test when I get a chance, but I doubt I'll run into
 the same problem and I haven't seen any similar reports. Perhaps it's
 some kind of wierd timing issue or incompatibility between the
 controller and that drive when running in ADMA mode? I seem to remember
 various reports of issues with certain Maxtor drives and some nForce
 SATA controllers under Windows at least..
 Just to eliminate things, has disabling ADMA been attempted?
 
 It can be disabled using the sata_nv.adma module parameter.
 
 Setting this option fixes the problem for me. I suggest that ADMA defaults 
 off in 2.6.20, if there's still time to do that.
 
 
 Can you guys that are having this problem try the attached debug patch? 
 It's possible it will fix the problem, as I'm trying a private 
 exec_command implementation that flushes the write by reading a 
 controller register instead of reading altstatus from the drive like the 
 libata core code does.

Will give it a spin in about an hour.

 If the problem still happens, I also added some more debugging in to 
 help figure out what is going on, so please post full dmesg.
 
 By the way, I assume that you guys are using reiserfs or xfs, as it 
 appears no other file systems issue flush commands automatically. I had 
 to test this by echo 1  delete on the SCSI disk in sysfs, as I am 
 using ext3.

No, ext3 here, on top of md RAID1 and LVM. Oh, and one ext2, I wonder
where that comes from...

Björn
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-18 Thread Björn Steinbrink
On 2007.01.18 18:09:50 -0600, Robert Hancock wrote:
> I heard from Larry Walton who was apparently seeing this problem as 
> well. He tried my recent "sata_nv: cleanup ADMA error handling v2" patch 
> and originally thought it fixed the problem, but it turned out to only 
> make it happen less often.
> 
> I wouldn't expect that patch to have an effect on this problem. If it 
> seems to reduce the frequency that would tend to be further evidence of 
>  some kind of timing-related issue where the code change just happens 
> to make a difference.
> 
> I'll see if I can come up with a debug patch for people having this 
> problem to try, which prints out when a flush command is issued and what 
> interrupts happen when a flush is pending.
> 
> There is one important difference between ADMA and non-ADMA mode for 
> non-DMA commands like flushes, which didn't come to mind before: ADMA 
> mode uses MMIO registers on the controller whereas non-ADMA mode uses 
> legacy IO registers. Posted write flushing is a concern with MMIO 
> registers but not with PIO, the libata core is supposed to handle this 
> but maybe it doesn't in some case(s). In fact, just looking at 
> libata-sff.c there's this comment on the ata_exec_command_mmio function:
> 
>  *  FIXME: missing write posting for 400nS delay enforcement
> 
> That seems a bit suspicious..

That would imply that disabling adma via a module parameter should make
the issue go away, right? I'll try to have a test run with adma disabled
over night then.

Thanks,
Björn
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-18 Thread Robert Hancock
I heard from Larry Walton who was apparently seeing this problem as 
well. He tried my recent "sata_nv: cleanup ADMA error handling v2" patch 
and originally thought it fixed the problem, but it turned out to only 
make it happen less often.


I wouldn't expect that patch to have an effect on this problem. If it 
seems to reduce the frequency that would tend to be further evidence of 
 some kind of timing-related issue where the code change just happens 
to make a difference.


I'll see if I can come up with a debug patch for people having this 
problem to try, which prints out when a flush command is issued and what 
interrupts happen when a flush is pending.


There is one important difference between ADMA and non-ADMA mode for 
non-DMA commands like flushes, which didn't come to mind before: ADMA 
mode uses MMIO registers on the controller whereas non-ADMA mode uses 
legacy IO registers. Posted write flushing is a concern with MMIO 
registers but not with PIO, the libata core is supposed to handle this 
but maybe it doesn't in some case(s). In fact, just looking at 
libata-sff.c there's this comment on the ata_exec_command_mmio function:


 *  FIXME: missing write posting for 400nS delay enforcement

That seems a bit suspicious..

--
Robert Hancock  Saskatoon, SK, Canada
To email, remove "nospam" from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-18 Thread Robert Hancock
I heard from Larry Walton who was apparently seeing this problem as 
well. He tried my recent sata_nv: cleanup ADMA error handling v2 patch 
and originally thought it fixed the problem, but it turned out to only 
make it happen less often.


I wouldn't expect that patch to have an effect on this problem. If it 
seems to reduce the frequency that would tend to be further evidence of 
 some kind of timing-related issue where the code change just happens 
to make a difference.


I'll see if I can come up with a debug patch for people having this 
problem to try, which prints out when a flush command is issued and what 
interrupts happen when a flush is pending.


There is one important difference between ADMA and non-ADMA mode for 
non-DMA commands like flushes, which didn't come to mind before: ADMA 
mode uses MMIO registers on the controller whereas non-ADMA mode uses 
legacy IO registers. Posted write flushing is a concern with MMIO 
registers but not with PIO, the libata core is supposed to handle this 
but maybe it doesn't in some case(s). In fact, just looking at 
libata-sff.c there's this comment on the ata_exec_command_mmio function:


 *  FIXME: missing write posting for 400nS delay enforcement

That seems a bit suspicious..

--
Robert Hancock  Saskatoon, SK, Canada
To email, remove nospam from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-18 Thread Björn Steinbrink
On 2007.01.18 18:09:50 -0600, Robert Hancock wrote:
 I heard from Larry Walton who was apparently seeing this problem as 
 well. He tried my recent sata_nv: cleanup ADMA error handling v2 patch 
 and originally thought it fixed the problem, but it turned out to only 
 make it happen less often.
 
 I wouldn't expect that patch to have an effect on this problem. If it 
 seems to reduce the frequency that would tend to be further evidence of 
  some kind of timing-related issue where the code change just happens 
 to make a difference.
 
 I'll see if I can come up with a debug patch for people having this 
 problem to try, which prints out when a flush command is issued and what 
 interrupts happen when a flush is pending.
 
 There is one important difference between ADMA and non-ADMA mode for 
 non-DMA commands like flushes, which didn't come to mind before: ADMA 
 mode uses MMIO registers on the controller whereas non-ADMA mode uses 
 legacy IO registers. Posted write flushing is a concern with MMIO 
 registers but not with PIO, the libata core is supposed to handle this 
 but maybe it doesn't in some case(s). In fact, just looking at 
 libata-sff.c there's this comment on the ata_exec_command_mmio function:
 
  *  FIXME: missing write posting for 400nS delay enforcement
 
 That seems a bit suspicious..

That would imply that disabling adma via a module parameter should make
the issue go away, right? I'll try to have a test run with adma disabled
over night then.

Thanks,
Björn
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-15 Thread Robert Hancock

Björn Steinbrink wrote:
It should be correct the way it is - that check is trying to prevent 
ATAPI commands from using DMA until the slave_config function has been 
called to set up the DMA parameters properly. When the 
NV_ADMA_ATAPI_SETUP_COMPLETE flag is not set, this returns 1 which 
disallows DMA transfers. Unless you were using an ATAPI (i.e. CD/DVD) 
device on the channel this wouldn't affect you anyway.


I wondered about it, because the flag is cleared when adma_enabled is 1,
which seems to be consistent with everything but nv_adma_check_atapi_dma.


When ADMA is enabled we can't use ATAPI at all (or so says NVidia 
anyway), so it has to be disabled when an ATAPI device is detected in 
slave_config. Since doing that implies using the legacy BMDMA engine 
with its greater restrictions, this is why we need to prevent DMA 
transfers from being attempted until those restrictions have been set 
properly. (Otherwise, the libata core will try to use PACKET commands on 
an ATAPI device with DMA enabled before slave_config is even called.)



Thus I thought that nv_adma_check_atapi_dma might be wrong, but maybe
setting/clearing the flag is wrong instead? *feels lost*


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove "nospam" from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SATA exceptions with 2.6.20-rc5

2007-01-15 Thread Jeff Garzik

Robert Hancock wrote:
I'll try your stress test when I get a chance, but I doubt I'll run into 
the same problem and I haven't seen any similar reports. Perhaps it's 
some kind of wierd timing issue or incompatibility between the 
controller and that drive when running in ADMA mode? I seem to remember 
various reports of issues with certain Maxtor drives and some nForce 
SATA controllers under Windows at least..



Just to eliminate things, has disabling ADMA been attempted?

It can be disabled using the sata_nv.adma module parameter.

Jeff


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   >