Re: [PATCH 0/7] blk_end_request: full I/O completion handler

2008-02-08 Thread Jens Axboe
On Fri, Feb 08 2008, S, Chandrakala (STSD) wrote:
 Hi,
 
 Thanks for the information! 
 We would like to know when does the 2.6.25 kernel will be available at
 kernel.org.

So would I, if I locate my crystal ball I'll be sure to let you know :-)

Seriously, given past experience, it's probably sometime in April.

 
 Thanks,
 Chandrakala 
 
 -Original Message-
 From: Jens Axboe [mailto:[EMAIL PROTECTED] 
 Sent: Tuesday, February 05, 2008 4:09 PM
 To: S, Chandrakala (STSD)
 Cc: Kiyoshi Ueda; [EMAIL PROTECTED];
 [EMAIL PROTECTED]; linux-ide@vger.kernel.org; Miller, Mike (OS
 Dev); [EMAIL PROTECTED]; [EMAIL PROTECTED];
 [EMAIL PROTECTED]
 Subject: Re: [PATCH 0/7] blk_end_request: full I/O completion handler
 
 On Tue, Feb 05 2008, S, Chandrakala (STSD) wrote:
  Hello,
  
  We would like to know in which kernel version these patches are 
  available.
 
 They were merged after 2.6.24 was released, so they will show up in the
 2.6.25 kernel.
 
 --
 Jens Axboe
 

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHSET #upstream] block/libata: update and use block layer padding and draining

2008-02-08 Thread Jens Axboe
On Fri, Feb 08 2008, Jeff Garzik wrote:
 Jeff Garzik wrote:
 Tejun Heo wrote:
 This patchset updates block layer padding and draining support and
 make libata use it.  It's based on James Bottomley's initial work and,
 of the five, the last two patches are from James with some
 modifications.
 
 Please read the following thread for more info.
 
   http://thread.gmane.org/gmane.linux.scsi/37185
 
 This patchset is on top of
 
   upstream (a6af42fc9a12165136d82206ad52f18c5955ce87)
 + kill-n_iter-and-fix-fsl patch [1]
 
 ACK patchset...  lets definitely get these fixes upstream.
 
 Once Jens is happy, I would prefer the merge the lot upstream, if that 
 is OK with everyone involved?
 
 Jens, ping?
 
 It's a bug fix, so it would be nice to get this in soonish.  As noted, 
 if all looks good, I would prefer to merge via libata-dev...

I'm ok with it, but lets please merge the block bits through the block
repo, since they are not trivial. Wont be until the week after next,
though.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/7] blk_end_request: full I/O completion handler

2008-02-05 Thread Jens Axboe
On Tue, Feb 05 2008, S, Chandrakala (STSD) wrote:
 Hello,
 
 We would like to know in which kernel version these patches are
 available.  

They were merged after 2.6.24 was released, so they will show up in the
2.6.25 kernel.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kernel BUG at ide-cd.c:1726 in 2.6.24-03863-g0ba6c33 -g8561b089

2008-01-31 Thread Jens Axboe
On Thu, Jan 31 2008, Nai Xia wrote:
 My dmesg relevant info is quite similar:
 
 [6.875041] Freeing unused kernel memory: 320k freed
 [8.143120] ide-cd: rq still having bio: dev hdc: type=2, flags=114c8
 [8.144439]
 [8.144439] sector 10824201199534213, nr/cnr 0/0
 [8.144439] bio cf029280, biotail cf029280, buffer , data
 , len 158
 [8.144439] cdb: 12 00 00 00 fe 00 00 00 00 00 00 00 00 00 00 00
 [8.144439] backup: data_len=158  bi_size=158
 [8.160756] ide-cd: rq still having bio: dev hdc: type=2, flags=114c8
 [8.160756]
 [8.160756] sector 2669858, nr/cnr 0/0
 [8.160756] bio cf029300, biotail cf029300, buffer , data
 , len 158
 [8.160756] cdb: 12 01 00 00 fe 00 00 00 00 00 00 00 00 00 00 00
 [8.160756] backup: data_len=158  bi_size=158
 [   14.851101] eth0: link up
 [   27.121883] eth0: no IPv6 routers present
 
 
 And by the way, Kiyoshi,
 This can be reproduced in a typical setup vmware workstation 6.02 with
 a vritual IDE cdrom,
 in case you wanna catch that with your own eyes. :-)
 Thanks for your trying hard to correct this annoying bug.

The below fix should be enough. It's perfectly legal to have leftover
byte counts when the drive signals completion, happens all the time for
eg user issued commands where you don't know an exact byte count.

diff --git a/drivers/ide/ide-cd.c b/drivers/ide/ide-cd.c
index 74c6087..bee05a3 100644
--- a/drivers/ide/ide-cd.c
+++ b/drivers/ide/ide-cd.c
@@ -1722,7 +1722,7 @@ static ide_startstop_t cdrom_newpc_intr(ide_drive_t 
*drive)
 */
if ((stat  DRQ_STAT) == 0) {
spin_lock_irqsave(ide_lock, flags);
-   if (__blk_end_request(rq, 0, 0))
+   if (__blk_end_request(rq, 0, rq-data_len))
BUG();
HWGROUP(drive)-rq = NULL;
spin_unlock_irqrestore(ide_lock, flags);

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kernel BUG at ide-cd.c:1726 in 2.6.24-03863-g0ba6c33 -g8561b089

2008-01-31 Thread Jens Axboe
On Thu, Jan 31 2008, Florian Lohoff wrote:
 On Thu, Jan 31, 2008 at 02:05:58PM +0100, Jens Axboe wrote:
  The below fix should be enough. It's perfectly legal to have leftover
  byte counts when the drive signals completion, happens all the time for
  eg user issued commands where you don't know an exact byte count.
  
  diff --git a/drivers/ide/ide-cd.c b/drivers/ide/ide-cd.c
  index 74c6087..bee05a3 100644
  --- a/drivers/ide/ide-cd.c
  +++ b/drivers/ide/ide-cd.c
  @@ -1722,7 +1722,7 @@ static ide_startstop_t cdrom_newpc_intr(ide_drive_t 
  *drive)
   */
  if ((stat  DRQ_STAT) == 0) {
  spin_lock_irqsave(ide_lock, flags);
  -   if (__blk_end_request(rq, 0, 0))
  +   if (__blk_end_request(rq, 0, rq-data_len))
  BUG();
  HWGROUP(drive)-rq = NULL;
  spin_unlock_irqrestore(ide_lock, flags);
  
 
 Fixes the crash on boot for me ...

Great, thanks for confirming that. I'll make sure the patch goes
upstream today, if Linus is available.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kernel BUG at ide-cd.c:1726 in 2.6.24-03863-g0ba6c33 -g8561b089

2008-01-31 Thread Jens Axboe



On 31/01/2008, at 18.04, Kiyoshi Ueda [EMAIL PROTECTED] wrote:


Hi Jens,

On Thu, 31 Jan 2008 14:05:58 +0100, Jens Axboe wrote:

On Thu, Jan 31 2008, Nai Xia wrote:

My dmesg relevant info is quite similar:

[6.875041] Freeing unused kernel memory: 320k freed
[8.143120] ide-cd: rq still having bio: dev hdc: type=2,  
flags=114c8

[8.144439]
[8.144439] sector 10824201199534213, nr/cnr 0/0
[8.144439] bio cf029280, biotail cf029280, buffer , data
, len 158
[8.144439] cdb: 12 00 00 00 fe 00 00 00 00 00 00 00 00 00 00 00
[8.144439] backup: data_len=158  bi_size=158
[8.160756] ide-cd: rq still having bio: dev hdc: type=2,  
flags=114c8

[8.160756]
[8.160756] sector 2669858, nr/cnr 0/0
[8.160756] bio cf029300, biotail cf029300, buffer , data
, len 158
[8.160756] cdb: 12 01 00 00 fe 00 00 00 00 00 00 00 00 00 00 00
[8.160756] backup: data_len=158  bi_size=158
[   14.851101] eth0: link up
[   27.121883] eth0: no IPv6 routers present


And by the way, Kiyoshi,
This can be reproduced in a typical setup vmware workstation 6.02  
with

a vritual IDE cdrom,
in case you wanna catch that with your own eyes. :-)
Thanks for your trying hard to correct this annoying bug.


The below fix should be enough. It's perfectly legal to have leftover
byte counts when the drive signals completion, happens all the time  
for

eg user issued commands where you don't know an exact byte count.

diff --git a/drivers/ide/ide-cd.c b/drivers/ide/ide-cd.c
index 74c6087..bee05a3 100644
--- a/drivers/ide/ide-cd.c
+++ b/drivers/ide/ide-cd.c
@@ -1722,7 +1722,7 @@ static ide_startstop_t cdrom_newpc_intr 
(ide_drive_t *drive)

*/
   if ((stat  DRQ_STAT) == 0) {
   spin_lock_irqsave(ide_lock, flags);
-if (__blk_end_request(rq, 0, 0))
+if (__blk_end_request(rq, 0, rq-data_len))
   BUG();
   HWGROUP(drive)-rq = NULL;
   spin_unlock_irqrestore(ide_lock, flags);


OK, I undarstand the leftover is legal.

By the way, is it safe to always return success if there is a  
leftover?

I thought we might have to complete the rq with -EIO in such case.


data_len being non zero should pass the residual count back to the  
issuer. 
-

To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 41/63] ide-cd: move lba_to_msf() and msf_to_lba() to linux/cdrom.h

2007-12-20 Thread Jens Axboe
On Thu, Dec 20 2007, Bartlomiej Zolnierkiewicz wrote:
 
 * Move lba_to_msf() and msf_to_lba() to linux/cdrom.h
   (use 'u8' type instead of 'byte' while at it).
 
 * Remove msf_to_lba() copy from drivers/cdrom/cdrom.c.
 
 Cc: Jens Axboe [EMAIL PROTECTED]
 Signed-off-by: Bartlomiej Zolnierkiewicz [EMAIL PROTECTED]

Acked-by: Jens Axboe [EMAIL PROTECTED]

 ---
  drivers/cdrom/cdrom.c |6 --
  drivers/ide/ide-cd.c  |   18 --
  include/linux/cdrom.h |   14 ++
  3 files changed, 14 insertions(+), 24 deletions(-)
 
 Index: b/drivers/cdrom/cdrom.c
 ===
 --- a/drivers/cdrom/cdrom.c
 +++ b/drivers/cdrom/cdrom.c
 @@ -2787,12 +2787,6 @@ int cdrom_ioctl(struct file * file, stru
   return -ENOSYS;
  }
  
 -static inline
 -int msf_to_lba(char m, char s, char f)
 -{
 - return (((m * CD_SECS) + s) * CD_FRAMES + f) - CD_MSF_OFFSET;
 -}
 -
  /*
   * Required when we need to use READ_10 to issue other than 2048 block
   * reads
 Index: b/drivers/ide/ide-cd.c
 ===
 --- a/drivers/ide/ide-cd.c
 +++ b/drivers/ide/ide-cd.c
 @@ -1643,24 +1643,6 @@ void msf_from_bcd (struct atapi_msf *msf
   msf-frame  = BCD2BIN(msf-frame);
  }
  
 -static inline
 -void lba_to_msf (int lba, byte *m, byte *s, byte *f)
 -{
 - lba += CD_MSF_OFFSET;
 - lba = 0xff;  /* negative lbas use only 24 bits */
 - *m = lba / (CD_SECS * CD_FRAMES);
 - lba %= (CD_SECS * CD_FRAMES);
 - *s = lba / CD_FRAMES;
 - *f = lba % CD_FRAMES;
 -}
 -
 -
 -static inline
 -int msf_to_lba (byte m, byte s, byte f)
 -{
 - return (((m * CD_SECS) + s) * CD_FRAMES + f) - CD_MSF_OFFSET;
 -}
 -
  static int cdrom_check_status(ide_drive_t *drive, struct request_sense 
 *sense)
  {
   struct request req;
 Index: b/include/linux/cdrom.h
 ===
 --- a/include/linux/cdrom.h
 +++ b/include/linux/cdrom.h
 @@ -1184,6 +1184,20 @@ struct media_event_desc {
  
  extern int cdrom_get_media_event(struct cdrom_device_info *cdi, struct 
 media_event_desc *med);
  
 +static inline void lba_to_msf(int lba, u8 *m, u8 *s, u8 *f)
 +{
 + lba += CD_MSF_OFFSET;
 + lba = 0xff;  /* negative lbas use only 24 bits */
 + *m = lba / (CD_SECS * CD_FRAMES);
 + lba %= (CD_SECS * CD_FRAMES);
 + *s = lba / CD_FRAMES;
 + *f = lba % CD_FRAMES;
 +}
 +
 +static inline int msf_to_lba(u8 m, u8 s, u8 f)
 +{
 + return (((m * CD_SECS) + s) * CD_FRAMES + f) - CD_MSF_OFFSET;
 +}
  #endif  /* End of kernel only stuff */ 
  
  #endif  /* _LINUX_CDROM_H */

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?

2007-12-13 Thread Jens Axboe
On Thu, Dec 13 2007, Mark Lord wrote:
 Matthew Wilcox wrote:
 On Thu, Dec 13, 2007 at 01:48:18PM -0500, Mark Lord wrote:
 Problem confirmed.  2.6.23.8 regularly generates segments up to 64KB for 
 libata,
 but 2.6.24 uses only 4KB segments and a *few* 8KB segments.
 
 Just a suspicion ... could this be slab vs slub?  ie check your configs
 are the same / similar between the two kernels.
 ..
 
 Mmmm.. a good thought, that one.
 But I just rechecked, and both have CONFIG_SLAB=y
 
 My guess is that something got changed around when Jens
 reworked the block layer for 2.6.24.
 I'm going to dig around in there now.

I didn't rework the block layer for 2.6.24 :-). The core block layer
changes since 2.6.23 are:

- Support for empty barriers. Not a likely candidate.
- Shared tag queue fixes. Totally unlikely.
- sg chaining support. Not likely.
- The bio changes from Neil. Of the bunch, the most likely suspects in
  this area, since it changes some of the code involved with merges and
  blk_rq_map_sg().
- Lots of simple stuff, again very unlikely.

Anyway, it sounds odd for this to be a block layer problem if you do see
occasional segments being merged. So it sounds more like the input data
having changed.

Why not just bisect it?

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?

2007-12-13 Thread Jens Axboe
On Thu, Dec 13 2007, Mark Lord wrote:
 Mark Lord wrote:
 Jens Axboe wrote:
 On Thu, Dec 13 2007, Mark Lord wrote:
 Matthew Wilcox wrote:
 On Thu, Dec 13, 2007 at 01:48:18PM -0500, Mark Lord wrote:
 Problem confirmed.  2.6.23.8 regularly generates segments up to 
 64KB for libata,
 but 2.6.24 uses only 4KB segments and a *few* 8KB segments.
 Just a suspicion ... could this be slab vs slub?  ie check your configs
 are the same / similar between the two kernels.
 ..
 
 Mmmm.. a good thought, that one.
 But I just rechecked, and both have CONFIG_SLAB=y
 
 My guess is that something got changed around when Jens
 reworked the block layer for 2.6.24.
 I'm going to dig around in there now.
 
 I didn't rework the block layer for 2.6.24 :-). The core block layer
 changes since 2.6.23 are:
 
 - Support for empty barriers. Not a likely candidate.
 - Shared tag queue fixes. Totally unlikely.
 - sg chaining support. Not likely.
 - The bio changes from Neil. Of the bunch, the most likely suspects in
   this area, since it changes some of the code involved with merges and
   blk_rq_map_sg().
 - Lots of simple stuff, again very unlikely.
 
 Anyway, it sounds odd for this to be a block layer problem if you do see
 occasional segments being merged. So it sounds more like the input data
 having changed.
 
 Why not just bisect it?
 ..
 
 Because the early 2.6.24 series failed to boot on this machine
 due to bugs in the block layer -- so the code that caused this regression
 is probably in the stuff from before the kernels became usable here.
 ..
 
 That sounds more harsh than intended -- the earlier 2.6.24 kernels (up to
 the first couple of -rc* ones failed here because of incompatibilities
 between the block/bio changes and libata.
 
 That's better, I think! 

No worries, I didn't pick it up as harsh just as an odd conclusion :-)

If I were you, I'd just start from the first -rc that booted for you. If
THAT has the bug, then we'll think of something else. If you don't get
anywhere, I can run some tests tomorrow and see if I can reproduce it
here.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?

2007-12-13 Thread Jens Axboe
On Thu, Dec 13 2007, Mark Lord wrote:
 Jens Axboe wrote:
 On Thu, Dec 13 2007, Mark Lord wrote:
 Mark Lord wrote:
 Jens Axboe wrote:
 On Thu, Dec 13 2007, Mark Lord wrote:
 Matthew Wilcox wrote:
 On Thu, Dec 13, 2007 at 01:48:18PM -0500, Mark Lord wrote:
 Problem confirmed.  2.6.23.8 regularly generates segments up to 
 64KB for libata,
 but 2.6.24 uses only 4KB segments and a *few* 8KB segments.
 Just a suspicion ... could this be slab vs slub?  ie check your 
 configs
 are the same / similar between the two kernels.
 ..
 
 Mmmm.. a good thought, that one.
 But I just rechecked, and both have CONFIG_SLAB=y
 
 My guess is that something got changed around when Jens
 reworked the block layer for 2.6.24.
 I'm going to dig around in there now.
 I didn't rework the block layer for 2.6.24 :-). The core block layer
 changes since 2.6.23 are:
 
 - Support for empty barriers. Not a likely candidate.
 - Shared tag queue fixes. Totally unlikely.
 - sg chaining support. Not likely.
 - The bio changes from Neil. Of the bunch, the most likely suspects in
  this area, since it changes some of the code involved with merges and
  blk_rq_map_sg().
 - Lots of simple stuff, again very unlikely.
 
 Anyway, it sounds odd for this to be a block layer problem if you do see
 occasional segments being merged. So it sounds more like the input data
 having changed.
 
 Why not just bisect it?
 ..
 
 Because the early 2.6.24 series failed to boot on this machine
 due to bugs in the block layer -- so the code that caused this regression
 is probably in the stuff from before the kernels became usable here.
 ..
 
 That sounds more harsh than intended -- the earlier 2.6.24 kernels (up to
 the first couple of -rc* ones failed here because of incompatibilities
 between the block/bio changes and libata.
 
 That's better, I think! 
 
 No worries, I didn't pick it up as harsh just as an odd conclusion :-)
 
 If I were you, I'd just start from the first -rc that booted for you. If
 THAT has the bug, then we'll think of something else. If you don't get
 anywhere, I can run some tests tomorrow and see if I can reproduce it
 here.
 ..
 
 I believe that *anyone* can reproduce it, since it's broken long before
 the requests ever get to SCSI or libata.  Which also means that *anyone*
 who wants to can bisect it, as well.
 
 I don't do bisects.

It was just a suggestion on how to narrow it down, do as you see fit.

 But I will dig a bit more and see if I can find the culprit.

Sure, I'll dig around as well.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?

2007-12-13 Thread Jens Axboe
On Thu, Dec 13 2007, Jens Axboe wrote:
 On Thu, Dec 13 2007, Mark Lord wrote:
  Jens Axboe wrote:
  On Thu, Dec 13 2007, Mark Lord wrote:
  Mark Lord wrote:
  Jens Axboe wrote:
  On Thu, Dec 13 2007, Mark Lord wrote:
  Matthew Wilcox wrote:
  On Thu, Dec 13, 2007 at 01:48:18PM -0500, Mark Lord wrote:
  Problem confirmed.  2.6.23.8 regularly generates segments up to 
  64KB for libata,
  but 2.6.24 uses only 4KB segments and a *few* 8KB segments.
  Just a suspicion ... could this be slab vs slub?  ie check your 
  configs
  are the same / similar between the two kernels.
  ..
  
  Mmmm.. a good thought, that one.
  But I just rechecked, and both have CONFIG_SLAB=y
  
  My guess is that something got changed around when Jens
  reworked the block layer for 2.6.24.
  I'm going to dig around in there now.
  I didn't rework the block layer for 2.6.24 :-). The core block layer
  changes since 2.6.23 are:
  
  - Support for empty barriers. Not a likely candidate.
  - Shared tag queue fixes. Totally unlikely.
  - sg chaining support. Not likely.
  - The bio changes from Neil. Of the bunch, the most likely suspects in
   this area, since it changes some of the code involved with merges and
   blk_rq_map_sg().
  - Lots of simple stuff, again very unlikely.
  
  Anyway, it sounds odd for this to be a block layer problem if you do see
  occasional segments being merged. So it sounds more like the input data
  having changed.
  
  Why not just bisect it?
  ..
  
  Because the early 2.6.24 series failed to boot on this machine
  due to bugs in the block layer -- so the code that caused this regression
  is probably in the stuff from before the kernels became usable here.
  ..
  
  That sounds more harsh than intended -- the earlier 2.6.24 kernels (up to
  the first couple of -rc* ones failed here because of incompatibilities
  between the block/bio changes and libata.
  
  That's better, I think! 
  
  No worries, I didn't pick it up as harsh just as an odd conclusion :-)
  
  If I were you, I'd just start from the first -rc that booted for you. If
  THAT has the bug, then we'll think of something else. If you don't get
  anywhere, I can run some tests tomorrow and see if I can reproduce it
  here.
  ..
  
  I believe that *anyone* can reproduce it, since it's broken long before
  the requests ever get to SCSI or libata.  Which also means that *anyone*
  who wants to can bisect it, as well.
  
  I don't do bisects.
 
 It was just a suggestion on how to narrow it down, do as you see fit.
 
  But I will dig a bit more and see if I can find the culprit.
 
 Sure, I'll dig around as well.

Just tried something simple. I only see one 12kb segment so far, so not
a lot by any stretch. I also DONT see any missed merges signs, so it
would appear that the pages in the request are simply not contigious
physically.

diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c
index e30b1a4..1e34b6f 100644
--- a/block/ll_rw_blk.c
+++ b/block/ll_rw_blk.c
@@ -1330,6 +1330,8 @@ int blk_rq_map_sg(struct request_queue *q, struct request 
*rq,
goto new_segment;
 
sg-length += nbytes;
+   if (sg-length  8192)
+   printk(sg_len=%d\n, sg-length);
} else {
 new_segment:
if (!sg)
@@ -1349,6 +1351,8 @@ new_segment:
sg = sg_next(sg);
}
 
+   if (bvprv  (page_address(bvprv-bv_page) + 
bvprv-bv_len == page_address(bvec-bv_page)))
+   printk(missed merge\n);
sg_set_page(sg, bvec-bv_page, nbytes, bvec-bv_offset);
nsegs++;
}

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?

2007-12-13 Thread Jens Axboe
On Thu, Dec 13 2007, Mark Lord wrote:
 Jens Axboe wrote:
 On Thu, Dec 13 2007, Mark Lord wrote:
 Jens Axboe wrote:
 On Thu, Dec 13 2007, Mark Lord wrote:
 Mark Lord wrote:
 Jens Axboe wrote:
 On Thu, Dec 13 2007, Mark Lord wrote:
 Matthew Wilcox wrote:
 On Thu, Dec 13, 2007 at 01:48:18PM -0500, Mark Lord wrote:
 Problem confirmed.  2.6.23.8 regularly generates segments up to 
 64KB for libata,
 but 2.6.24 uses only 4KB segments and a *few* 8KB segments.
 Just a suspicion ... could this be slab vs slub?  ie check your 
 configs
 are the same / similar between the two kernels.
 ..
 
 Mmmm.. a good thought, that one.
 But I just rechecked, and both have CONFIG_SLAB=y
 
 My guess is that something got changed around when Jens
 reworked the block layer for 2.6.24.
 I'm going to dig around in there now.
 I didn't rework the block layer for 2.6.24 :-). The core block layer
 changes since 2.6.23 are:
 
 - Support for empty barriers. Not a likely candidate.
 - Shared tag queue fixes. Totally unlikely.
 - sg chaining support. Not likely.
 - The bio changes from Neil. Of the bunch, the most likely suspects in
 this area, since it changes some of the code involved with merges and
 blk_rq_map_sg().
 - Lots of simple stuff, again very unlikely.
 
 Anyway, it sounds odd for this to be a block layer problem if you do 
 see
 occasional segments being merged. So it sounds more like the input 
 data
 having changed.
 
 Why not just bisect it?
 ..
 
 Because the early 2.6.24 series failed to boot on this machine
 due to bugs in the block layer -- so the code that caused this 
 regression
 is probably in the stuff from before the kernels became usable here.
 ..
 
 That sounds more harsh than intended -- the earlier 2.6.24 kernels (up 
 to
 the first couple of -rc* ones failed here because of incompatibilities
 between the block/bio changes and libata.
 
 That's better, I think! 
 No worries, I didn't pick it up as harsh just as an odd conclusion :-)
 
 If I were you, I'd just start from the first -rc that booted for you. If
 THAT has the bug, then we'll think of something else. If you don't get
 anywhere, I can run some tests tomorrow and see if I can reproduce it
 here.
 ..
 
 I believe that *anyone* can reproduce it, since it's broken long before
 the requests ever get to SCSI or libata.  Which also means that *anyone*
 who wants to can bisect it, as well.
 
 I don't do bisects.
 
 It was just a suggestion on how to narrow it down, do as you see fit.
 
 But I will dig a bit more and see if I can find the culprit.
 
 Sure, I'll dig around as well.
 ..
 
 I wonder if it's 9dfa52831e96194b8649613e3131baa2c109f7dc:
 Merge blk_recount_segments into blk_recalc_rq_segments ?
 
 That particular commit does some rather innocent code-shuffling,
 but also introduces a couple of new if (nr_hw_segs == 1 conditions
 that were not there before.

You can try and revert it of course, but I think you are looking at the
wrong bits. If the segment counts were totally off, you'd never be
anywhere close to reaching the set limit. Your problems seems to be
missed contig segment merges.

 Okay git experts:  how do I pull out a kernel at the point of this exact 
 commit ?

Dummy approach - git log and grep for
9dfa52831e96194b8649613e3131baa2c109f7dc, then see what commit is before
that. Then do a git checkout commit.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?

2007-12-13 Thread Jens Axboe
On Thu, Dec 13 2007, Mark Lord wrote:
 Jens Axboe wrote:
 On Thu, Dec 13 2007, Mark Lord wrote:
 Jens Axboe wrote:
 On Thu, Dec 13 2007, Jens Axboe wrote:
 On Thu, Dec 13 2007, Mark Lord wrote:
 Jens Axboe wrote:
 On Thu, Dec 13 2007, Mark Lord wrote:
 Mark Lord wrote:
 Jens Axboe wrote:
 On Thu, Dec 13 2007, Mark Lord wrote:
 Matthew Wilcox wrote:
 On Thu, Dec 13, 2007 at 01:48:18PM -0500, Mark Lord wrote:
 Problem confirmed.  2.6.23.8 regularly generates segments up to 
 64KB for libata,
 but 2.6.24 uses only 4KB segments and a *few* 8KB segments.
 Just a suspicion ... could this be slab vs slub?  ie check your 
 configs
 are the same / similar between the two kernels.
 ..
 
 Mmmm.. a good thought, that one.
 But I just rechecked, and both have CONFIG_SLAB=y
 
 My guess is that something got changed around when Jens
 reworked the block layer for 2.6.24.
 I'm going to dig around in there now.
 I didn't rework the block layer for 2.6.24 :-). The core block 
 layer
 changes since 2.6.23 are:
 
 - Support for empty barriers. Not a likely candidate.
 - Shared tag queue fixes. Totally unlikely.
 - sg chaining support. Not likely.
 - The bio changes from Neil. Of the bunch, the most likely 
 suspects in
 this area, since it changes some of the code involved with merges 
 and
 blk_rq_map_sg().
 - Lots of simple stuff, again very unlikely.
 
 Anyway, it sounds odd for this to be a block layer problem if you 
 do see
 occasional segments being merged. So it sounds more like the input 
 data
 having changed.
 
 Why not just bisect it?
 ..
 
 Because the early 2.6.24 series failed to boot on this machine
 due to bugs in the block layer -- so the code that caused this 
 regression
 is probably in the stuff from before the kernels became usable here.
 ..
 
 That sounds more harsh than intended -- the earlier 2.6.24 kernels 
 (up to
 the first couple of -rc* ones failed here because of 
 incompatibilities
 between the block/bio changes and libata.
 
 That's better, I think! 
 No worries, I didn't pick it up as harsh just as an odd conclusion :-)
 
 If I were you, I'd just start from the first -rc that booted for you. 
 If
 THAT has the bug, then we'll think of something else. If you don't get
 anywhere, I can run some tests tomorrow and see if I can reproduce it
 here.
 ..
 
 I believe that *anyone* can reproduce it, since it's broken long before
 the requests ever get to SCSI or libata.  Which also means that 
 *anyone*
 who wants to can bisect it, as well.
 
 I don't do bisects.
 It was just a suggestion on how to narrow it down, do as you see fit.
 
 But I will dig a bit more and see if I can find the culprit.
 Sure, I'll dig around as well.
 Just tried something simple. I only see one 12kb segment so far, so not
 a lot by any stretch. I also DONT see any missed merges signs, so it
 would appear that the pages in the request are simply not contigious
 physically.
 
 diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c
 index e30b1a4..1e34b6f 100644
 --- a/block/ll_rw_blk.c
 +++ b/block/ll_rw_blk.c
 @@ -1330,6 +1330,8 @@ int blk_rq_map_sg(struct request_queue *q, struct 
 request *rq,
goto new_segment;
 
sg-length += nbytes;
 +  if (sg-length  8192)
 +  printk(sg_len=%d\n, sg-length);
} else {
 new_segment:
if (!sg)
 @@ -1349,6 +1351,8 @@ new_segment:
sg = sg_next(sg);
}
 
 +  if (bvprv  (page_address(bvprv-bv_page) + 
 bvprv-bv_len == page_address(bvec-bv_page)))
 +  printk(missed merge\n);
sg_set_page(sg, bvec-bv_page, nbytes, 
bvec-bv_offset);
nsegs++;
}
 
 ..
 
 Yeah, the first part is similar to my own hack.
 
 For testing, try dd if=/dev/sda of=/dev/null bs=4096k.
 That *really* should end up using contiguous pages on most systems.
 
 I figured out the git thing, and am now building some in-between kernels 
 to try.
 
 OK, it's a vm issue, I have tens of thousand backward pages after a
 boot - IOW, bvec-bv_page is the page before bvprv-bv_page, not
 reverse. So it looks like that bug got reintroduced.
 ...
 
 Mmm.. shouldn't one of the front- or back- merge logics work for either 
 order?

I think you are misunderstanding the merging. The front/back bits are
for contig on disk, this is sg segment merging. We can only join pieces
that are contig in memory, otherwise the result would not be pretty :-)

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?

2007-12-13 Thread Jens Axboe
On Thu, Dec 13 2007, Andrew Morton wrote:
 On Thu, 13 Dec 2007 21:09:59 +0100
 Jens Axboe [EMAIL PROTECTED] wrote:
 
 
  OK, it's a vm issue,
 
 cc linux-mm and probable culprit.
 
   I have tens of thousand backward pages after a
  boot - IOW, bvec-bv_page is the page before bvprv-bv_page, not
  reverse. So it looks like that bug got reintroduced.
 
 Bill Irwin fixed this a couple of years back: changed the page allocator so
 that it mostly hands out pages in ascending physical-address order.
 
 I guess we broke that, quite possibly in Mel's page allocator rework.
 
 It would help if you could provide us with a simple recipe for
 demonstrating this problem, please.

Basically anything involving IO :-). A boot here showed a handful of
good merges, and probably in the order of 100,000 descending
allocations. A kernel make is a fine test as well.

Something like the below should work fine - if you see oodles of these
basicaly doing any type of IO, then you are screwed.

diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c
index e30b1a4..8ce3fcc 100644
--- a/block/ll_rw_blk.c
+++ b/block/ll_rw_blk.c
@@ -1349,6 +1349,10 @@ new_segment:
sg = sg_next(sg);
}
 
+   if (bvprv) {
+   if (page_address(bvec-bv_page) + PAGE_SIZE == 
page_address(bvprv-bv_page)  printk_ratelimit())
+   printk(page alloc order backwards\n);
+   }
sg_set_page(sg, bvec-bv_page, nbytes, bvec-bv_offset);
nsegs++;
}

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SAS v SATA interface performance

2007-12-10 Thread Jens Axboe
On Mon, Dec 10 2007, Tejun Heo wrote:
 There's one thing we can do to improve the situation tho.  Several
 drives including raptors and 7200.11s suffer serious performance hit if
 sequential transfer is performed by multiple NCQ commands.  My 7200.11
 can do  100MB/s if non-NCQ command is used or only upto two NCQ
 commands are issued; however, if all 31 (maximum currently supported by
 libata) are used, the transfer rate drops to miserable 70MB/s.
 
 It seems that what we need to do is not issuing too many commands to one
 sequential stream.  In fact, there isn't much to gain by issuing more
 than two commands to one sequential stream.

Well... CFQ wont go to deep queue depths across processes if they are
doing streaming IO, but it wont stop a single process from doing so. I'd
like to know what real life process would issue a streaming IO in some
async manner as to get 31 pending commands sequentially? Not very likely
:-)

So I'd consider your case above a microbenchmark results. I'd also claim
that the firmware is very crappy, if it performs like described.

There's another possibility as well - that the queueing by the drive
generates a worse issue IO pattern, and that is why the performance
drops. Did you check with blktrace what the generated IO looks like?

 Both raptors and 7200.11 perform noticeably better on random workload
 with NCQ enabled.  So, it's about time to update IO schedulers
 accordingly, it seems.

Definitely. Again microbenchmarks are able to show 30-40% improvements
when I last tested. That's a pure random workload though, again not
something that you would see in real life.

I tend to always run with a depth around 4 here. It seems to be a good
value, you get some benefits from NCQ but you don't allow the drive
firmware to screw you over.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Bug: get EXT3-fs error Allocating block in system zone

2007-12-10 Thread Jens Axboe
On Mon, Dec 10 2007, Marco Gatti wrote:
 Jens Axboe schrieb:
 Hello Jens,
 Thanks for help. I just applied the patch. Unfortunately it doesn't 
 work.
 Can you try and additionally boot with iommu=off as a boot parameter?
 
 Yes. This is the end of getting any sata devices. See screenshots for 
 errors. It continued untill ata4. At the end no root device was found.
 
 Hmm, even though the address is set to 0x we still seem to
 receive requests outside that range. Lets assume it's the scsi logic,
 can you test this? IOW, patch + iommu=off + this patch.
 
 I probably wont see any more mails tonight, we can continue this
 tomorrow (or someone else can step in, whatever comes first :-)
 
 diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
 index 2faced6..769ce3a 100644
 --- a/drivers/scsi/scsi_lib.c
 +++ b/drivers/scsi/scsi_lib.c
 @@ -1572,7 +1572,9 @@ struct request_queue *__scsi_alloc_queue(struct 
 Scsi_Host *shost,
  #endif
  
  blk_queue_max_sectors(q, shost-max_sectors);
 +#if 0
  blk_queue_bounce_limit(q, scsi_calculate_bounce_limit(shost));
 +#else
  blk_queue_segment_boundary(q, shost-dma_boundary);
  
  if (!shost-use_clustering)
 I applied the path. Got Hunk #1 succeeded at 1562 with fuzz 2 (offset 
 -10 lines).
 
 I didn't compile completly.
 
 drivers/scsi/scsi_lib.c:1565:1: error: unterminated #else
 make[2]: *** [drivers/scsi/scsi_lib.o] Error 1
 make[1]: *** [drivers/scsi] Error 2
 make: *** [drivers] Error 2

Doh sorry, that #else wants to be an #endif

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Bug: get EXT3-fs error Allocating block in system zone

2007-12-09 Thread Jens Axboe
On Sun, Dec 09 2007, Linus Torvalds wrote:
 
 
 On Sun, 9 Dec 2007, Robert Hancock wrote:
  
  The obvious suspect with a filesystem problem would be the disk 
  controller driver, AHCI here. However, the controller appears to set the 
  flag to indicate that it supports 64-bit DMA, so it should be fine, 
  unless it lies of course (which we know that ATI SB600 chipset does, but 
  I don't believe Intel is known to).
  
  Could still be a DMA mapping bug that only shows up when IOMMU is used. 
  However, AHCI is a pretty well tested driver..
 
 AHCI is a pretty well tested driver, but 99%+ of all testers still tend to 
 have less than 4GB of memory. So I do *not* believe that the highmem bits 
 are all that well tested at all. 
 
 Can somebody who knows the driver send Marco a test-patch to just limit 
 DMA to the low 32 bits, and then Marco can at least verify that yes, that 
 that it. While it looks like DMA problems, there could obviously be some 
 other subtle issue with big-memory machines (ie the PCI allocations etc 
 tend to change too!)

Was just thinking that, this should do the trick. If this works, then we
can look at whether this is a hardware or iommu or block bouncing
(unlikely, would affect more people) bug.

diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c
index 4688dbf..cad3cbc 100644
--- a/drivers/ata/ahci.c
+++ b/drivers/ata/ahci.c
@@ -623,6 +623,9 @@ static void ahci_save_initial_config(struct pci_dev *pdev,
hpriv-saved_cap = cap = readl(mmio + HOST_CAP);
hpriv-saved_port_map = port_map = readl(mmio + HOST_PORTS_IMPL);
 
+   hpriv-saved_cap = ~HOST_CAP_64;
+   cap = ~HOST_CAP_64;
+
/* some chips have errata preventing 64bit use */
if ((cap  HOST_CAP_64)  (hpriv-flags  AHCI_HFLAG_32BIT_ONLY)) {
dev_printk(KERN_INFO, pdev-dev,

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Bug: get EXT3-fs error Allocating block in system zone

2007-12-09 Thread Jens Axboe
On Sun, Dec 09 2007, Marco Gatti wrote:
 Jens Axboe schrieb:
 
 Was just thinking that, this should do the trick. If this works, then we
 can look at whether this is a hardware or iommu or block bouncing
 (unlikely, would affect more people) bug.
 
 diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c
 index 4688dbf..cad3cbc 100644
 --- a/drivers/ata/ahci.c
 +++ b/drivers/ata/ahci.c
 @@ -623,6 +623,9 @@ static void ahci_save_initial_config(struct pci_dev 
 *pdev,
  hpriv-saved_cap = cap = readl(mmio + HOST_CAP);
  hpriv-saved_port_map = port_map = readl(mmio + HOST_PORTS_IMPL);
  
 +hpriv-saved_cap = ~HOST_CAP_64;
 +cap = ~HOST_CAP_64;
 +
  /* some chips have errata preventing 64bit use */
  if ((cap  HOST_CAP_64)  (hpriv-flags  AHCI_HFLAG_32BIT_ONLY)) {
  dev_printk(KERN_INFO, pdev-dev,
 
 
 Hello Jens,
 Thanks for help. I just applied the patch. Unfortunately it doesn't work.

Can you try and additionally boot with iommu=off as a boot parameter?

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Bug: get EXT3-fs error Allocating block in system zone

2007-12-09 Thread Jens Axboe
On Sun, Dec 09 2007, Marco Gatti wrote:
 Jens Axboe schrieb:
 On Sun, Dec 09 2007, Marco Gatti wrote:
 Jens Axboe schrieb:
 Was just thinking that, this should do the trick. If this works, then we
 can look at whether this is a hardware or iommu or block bouncing
 (unlikely, would affect more people) bug.
 
 diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c
 index 4688dbf..cad3cbc 100644
 --- a/drivers/ata/ahci.c
 +++ b/drivers/ata/ahci.c
 @@ -623,6 +623,9 @@ static void ahci_save_initial_config(struct pci_dev 
 *pdev,
hpriv-saved_cap = cap = readl(mmio + HOST_CAP);
hpriv-saved_port_map = port_map = readl(mmio + HOST_PORTS_IMPL);
 
 +  hpriv-saved_cap = ~HOST_CAP_64;
 +  cap = ~HOST_CAP_64;
 +
/* some chips have errata preventing 64bit use */
if ((cap  HOST_CAP_64)  (hpriv-flags  AHCI_HFLAG_32BIT_ONLY)) {
dev_printk(KERN_INFO, pdev-dev,
 
 Hello Jens,
 Thanks for help. I just applied the patch. Unfortunately it doesn't work.
 
 Can you try and additionally boot with iommu=off as a boot parameter?
 
 
 Yes. This is the end of getting any sata devices. See screenshots for 
 errors. It continued untill ata4. At the end no root device was found.

Hmm, even though the address is set to 0x we still seem to
receive requests outside that range. Lets assume it's the scsi logic,
can you test this? IOW, patch + iommu=off + this patch.

I probably wont see any more mails tonight, we can continue this
tomorrow (or someone else can step in, whatever comes first :-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 2faced6..769ce3a 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1572,7 +1572,9 @@ struct request_queue *__scsi_alloc_queue(struct Scsi_Host 
*shost,
 #endif
 
blk_queue_max_sectors(q, shost-max_sectors);
+#if 0
blk_queue_bounce_limit(q, scsi_calculate_bounce_limit(shost));
+#else
blk_queue_segment_boundary(q, shost-dma_boundary);
 
if (!shost-use_clustering)

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 01/28] blk_end_request: add new request completion interface (take 3)

2007-12-05 Thread Jens Axboe
On Tue, Dec 04 2007, Kiyoshi Ueda wrote:
 Hi Boaz and Jens,
 
 On Tue, 04 Dec 2007 15:56:32 +0200, Boaz Harrosh [EMAIL PROTECTED] wrote:
   +/**
   + * blk_end_request - Helper function for drivers to complete the request.
   + * @rq:   the request being processed
   + * @uptodate: 1 for success, 0 for I/O error,  0 for specific error
   + * @nr_bytes: number of bytes to complete
   + *
   + * Description:
   + * Ends I/O on a number of bytes attached to @rq.
   + * If @rq has leftover, sets it up for the next range of segments.
   + *
   + * Return:
   + * 0 - we are done with this request
   + * 1 - still buffers pending for this request
   + **/
   +int blk_end_request(struct request *rq, int uptodate, int nr_bytes)
  
  I always hated that uptodate boolean with possible negative error value.
  I guess it was done for backward compatibility of then users of 
  end_that_request_first(). But since you are introducing a new API then
  this is not the case. Just have regular status int where 0 means ALL_OK
  and negative value means error code. 
  Just my $0.02.
 
 Thank you for the comment.
 I think it's quite reasonable.
 By doing that, we don't need end_io_error() anymore.
 
 
 Jens,
 What do you think?

I agree, the uptodate usage right now is horrible.

 If you agree with the interface change above, I would prefer to
 separate the patch-set from blk_end_request patch-set like below:
 o blk_end_request: remove end_that_request_*
 o change interface of 'uptodate' in blk_end_request to 'error'
 It makes the purpose of blk_end_request patch-set clear
 (and also, each patch of device drivers could be smaller).
 But, it doubles your merging work.  So if you would like to get
 the changes at once, I'll merge them into blk_end_request patch-set.

Twice the merging is not an issue for me.

 As for the patch inclusion, do you push the driver changes to Linus
 all at once?  Or should I ask each maintainer to take the patch?

Lets just try to get as many maintainer acks as possible, since the
patches need to go in together.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/28] blk_end_request: full I/O completion handler (take 3)

2007-12-04 Thread Jens Axboe
On Fri, Nov 30 2007, Kiyoshi Ueda wrote:
 Hello Jens,
 
 The following is the updated patch-set for blk_end_request().
 Changes since the last version are only minor updates to catch up
 with the base kernel changes.
 Do you agree the implementation of blk_end_request()?
 If there's no problem, could you merge it to your tree?
 Or does it have to be merged to -mm tree first?

Looks good to me now, I'll queue it up. Thanks!

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] New Kernel Bugs

2007-11-13 Thread Jens Axboe
On Tue, Nov 13 2007, Andrew Morton wrote:
  I/O STORAGE===
  
  kernel bug from pktcdvd
  http://bugzilla.kernel.org/show_bug.cgi?id=9294
  Kernel: 2.6.23
 
 I think we might have fixed this.

It's fixed and merged, I just forgot to close the bugzilla. Did so now.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Suspend to ram regression (2.6.24-rc1-git)

2007-11-05 Thread Jens Axboe
On Fri, Nov 02 2007, Kristen Carlson Accardi wrote:
 On Thu, 1 Nov 2007 09:41:46 +0100
 Jens Axboe [EMAIL PROTECTED] wrote:
 
  On Wed, Oct 31 2007, Jens Axboe wrote:
   Hi,
   
   My x60 stopped suspending about two days ago. It just freezes after
   printing
   
   Suspending console(s)
   
   where it would normally turn everything off and the 'moon' light would
   go on. Posting this message in case somebody else knows what is up, if
   not I'll do a bisect on it tomorrow.
  
  Did the bisect, it points to this commit:
  
  1556594f913fa81d008cecfe46d7211c919a853 is first bad commit
  commit 31556594f913fa81d008cecfe46d7211c919a853
  Author: Kristen Carlson Accardi [EMAIL PROTECTED]
  Date:   Thu Oct 25 01:33:26 2007 -0400
  
  [libata] AHCI: add hw link power management support
  
  Booting any kernel after this commit fails suspending to ram, it just
  sits there forever.
  
  -- 
  Jens Axboe
  
 
 Does this patch fix your problem?  It seems to get hung up while disabling
 DIPM, and after thinking about this a bit, I don't think we really need
 to do this anyway.

Yep, works for me! Can we get this expedited upstream?

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Suspend to ram regression (2.6.24-rc1-git)

2007-11-01 Thread Jens Axboe
On Wed, Oct 31 2007, Jens Axboe wrote:
 Hi,
 
 My x60 stopped suspending about two days ago. It just freezes after
 printing
 
 Suspending console(s)
 
 where it would normally turn everything off and the 'moon' light would
 go on. Posting this message in case somebody else knows what is up, if
 not I'll do a bisect on it tomorrow.

Did the bisect, it points to this commit:

1556594f913fa81d008cecfe46d7211c919a853 is first bad commit
commit 31556594f913fa81d008cecfe46d7211c919a853
Author: Kristen Carlson Accardi [EMAIL PROTECTED]
Date:   Thu Oct 25 01:33:26 2007 -0400

[libata] AHCI: add hw link power management support

Booting any kernel after this commit fails suspending to ram, it just
sits there forever.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Suspend to ram regression (2.6.24-rc1-git)

2007-11-01 Thread Jens Axboe
On Thu, Nov 01 2007, Jens Axboe wrote:
 On Wed, Oct 31 2007, Jens Axboe wrote:
  Hi,
  
  My x60 stopped suspending about two days ago. It just freezes after
  printing
  
  Suspending console(s)
  
  where it would normally turn everything off and the 'moon' light would
  go on. Posting this message in case somebody else knows what is up, if
  not I'll do a bisect on it tomorrow.
 
 Did the bisect, it points to this commit:
 
 1556594f913fa81d008cecfe46d7211c919a853 is first bad commit
 commit 31556594f913fa81d008cecfe46d7211c919a853
 Author: Kristen Carlson Accardi [EMAIL PROTECTED]
 Date:   Thu Oct 25 01:33:26 2007 -0400
 
 [libata] AHCI: add hw link power management support
 
 Booting any kernel after this commit fails suspending to ram, it just
 sits there forever.

Reverting just the default AHCI flags makes it work again. IOW, with the
below patch I can suspend properly with current -git.

diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c
index ed9b407..77f7631 100644
--- a/drivers/ata/ahci.c
+++ b/drivers/ata/ahci.c
@@ -190,8 +190,7 @@ enum {
 
AHCI_FLAG_COMMON= ATA_FLAG_SATA | ATA_FLAG_NO_LEGACY |
  ATA_FLAG_MMIO | ATA_FLAG_PIO_DMA |
- ATA_FLAG_ACPI_SATA | ATA_FLAG_AN |
- ATA_FLAG_IPM,
+ ATA_FLAG_ACPI_SATA | ATA_FLAG_AN,
AHCI_LFLAG_COMMON   = ATA_LFLAG_SKIP_D2H_BSY,
 };
 

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Suspend to ram regression (2.6.24-rc1-git)

2007-11-01 Thread Jens Axboe
On Thu, Nov 01 2007, Jens Axboe wrote:
 On Thu, Nov 01 2007, Jens Axboe wrote:
  On Wed, Oct 31 2007, Jens Axboe wrote:
   Hi,
   
   My x60 stopped suspending about two days ago. It just freezes after
   printing
   
   Suspending console(s)
   
   where it would normally turn everything off and the 'moon' light would
   go on. Posting this message in case somebody else knows what is up, if
   not I'll do a bisect on it tomorrow.
  
  Did the bisect, it points to this commit:
  
  1556594f913fa81d008cecfe46d7211c919a853 is first bad commit
  commit 31556594f913fa81d008cecfe46d7211c919a853
  Author: Kristen Carlson Accardi [EMAIL PROTECTED]
  Date:   Thu Oct 25 01:33:26 2007 -0400
  
  [libata] AHCI: add hw link power management support
  
  Booting any kernel after this commit fails suspending to ram, it just
  sits there forever.
 
 Reverting just the default AHCI flags makes it work again. IOW, with the
 below patch I can suspend properly with current -git.
 
 diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c
 index ed9b407..77f7631 100644
 --- a/drivers/ata/ahci.c
 +++ b/drivers/ata/ahci.c
 @@ -190,8 +190,7 @@ enum {
  
   AHCI_FLAG_COMMON= ATA_FLAG_SATA | ATA_FLAG_NO_LEGACY |
 ATA_FLAG_MMIO | ATA_FLAG_PIO_DMA |
 -   ATA_FLAG_ACPI_SATA | ATA_FLAG_AN |
 -   ATA_FLAG_IPM,
 +   ATA_FLAG_ACPI_SATA | ATA_FLAG_AN,
   AHCI_LFLAG_COMMON   = ATA_LFLAG_SKIP_D2H_BSY,
  };

I should also mention that just stubbing out ahci_enable_alpm() and
ahci_disable_alpm() in ahci.c is NOT enough to make it work.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Suspend to ram regression (2.6.24-rc1-git)

2007-11-01 Thread Jens Axboe
On Thu, Nov 01 2007, Jeff Garzik wrote:
 Jens Axboe wrote:
 Reverting just the default AHCI flags makes it work again. IOW, with the
 below patch I can suspend properly with current -git.
 diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c
 index ed9b407..77f7631 100644
 --- a/drivers/ata/ahci.c
 +++ b/drivers/ata/ahci.c
 @@ -190,8 +190,7 @@ enum {
  AHCI_FLAG_COMMON= ATA_FLAG_SATA | ATA_FLAG_NO_LEGACY |
ATA_FLAG_MMIO | ATA_FLAG_PIO_DMA |
 -  ATA_FLAG_ACPI_SATA | ATA_FLAG_AN |
 -  ATA_FLAG_IPM,
 +  ATA_FLAG_ACPI_SATA | ATA_FLAG_AN,
  AHCI_LFLAG_COMMON   = ATA_LFLAG_SKIP_D2H_BSY,


 sounds like the easy thing to do, in light of this breakage, might be to 
 default it to off, add a module parameter turning it on by setting that 
 flag.

Wouldn't it be better to just get this bug fixed? IOW, is there a reason
for disabling ALPM if it's Bug Free?

I'd suggest committing the patch disabling IPM, then Kristen can debug
the thing in piece in quiet. Once confident it works with ahci again, we
can revert that commit.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fix ATAPI transfer lengths causes CD writing regression

2007-10-31 Thread Jens Axboe
On Wed, Oct 31 2007, Alan Cox wrote:
 On Tue, 30 Oct 2007 19:21:29 +
 Daniel Drake [EMAIL PROTECTED] wrote:
 
  Alan Cox wrote:
   I would guess Brasero is issuing a command with the length of data
   wrongly set. In the old code that might well just produce errors of the
   Umm wtf is this data left over for ?, with the new code the drive is
   likely to change state as it knows the transfer size and that will
   *correctly* cause an HSM error and what follows.
   
   Now the question is who gets the length wrong - Brasero or the ata
   translation code in libata
  
  Brasero does exactly the same as my test app which I attached to my last 
  mail. Is my test app wrong?
 
 Would need to double check the SCSI specificatons to be sure but I think
 you are asking for less data than the drive wishes to provide. You
 aren't allowed to do that with ATA.

ide-cd handles this by throwing the excess away, which I think is the
sane way to do this.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fix ATAPI transfer lengths causes CD writing regression

2007-10-31 Thread Jens Axboe
On Wed, Oct 31 2007, Jeff Garzik wrote:
 Jens Axboe wrote:
 On Wed, Oct 31 2007, Alan Cox wrote:
 On Tue, 30 Oct 2007 19:21:29 +
 Daniel Drake [EMAIL PROTECTED] wrote:

 Alan Cox wrote:
 I would guess Brasero is issuing a command with the length of data
 wrongly set. In the old code that might well just produce errors of the
 Umm wtf is this data left over for ?, with the new code the drive is
 likely to change state as it knows the transfer size and that will
 *correctly* cause an HSM error and what follows.

 Now the question is who gets the length wrong - Brasero or the ata
 translation code in libata
 Brasero does exactly the same as my test app which I attached to my last 
 mail. Is my test app wrong?
 Would need to double check the SCSI specificatons to be sure but I think
 you are asking for less data than the drive wishes to provide. You
 aren't allowed to do that with ATA.
 ide-cd handles this by throwing the excess away, which I think is the
 sane way to do this.

 That's easy for the PIO case.  But CD writing is normally DMA, which means 
 you will get a DMA engine exception if the device wants to give you more 
 data than the scatter/gather entries permit.

Right, that's of course problematic... There has to be a way to recover
that situation though, or you can't export any user command issue
facility.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fix ATAPI transfer lengths causes CD writing regression

2007-10-31 Thread Jens Axboe
On Wed, Oct 31 2007, Jeff Garzik wrote:
 Jens Axboe wrote:
 Right, that's of course problematic... There has to be a way to recover
 that situation though, or you can't export any user command issue
 facility.
 
 You cannot hope to handle all possible effects arising from an app 
 providing an invalid sg header / cdb.
 
 Once you start talking recovery you are already screwed:  we are 
 talking about low-level hardware commands that are passed straight to 
 the hardware.  It is trivial to lock up hardware, brick hardware, and 
 corrupt data at that level.
 
 
 If this is NOT a privileged app, we must update the command validation 
 to ensure that invalid commands are not transported to the hardware.
 
 If this is a privileged app, our work is done.  Fix the app.  We gave 
 root rope, and he took it.

Woaw, back the truck up a bit :-)

I'm talking about simple things - like asking for 8 bytes of sense data.
Simple mistakes. You cannot possibly check for everything like that in a
command filter, it's utterly impossible.

 I even venture to say that accept anything, clean up afterwards is 
 /impossible/ to implement, in addition to being dangerous.

Certainly, that's not what I'm talking about.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ata_exec_internal crash at boot in -git22

2007-10-22 Thread Jens Axboe
On Mon, Oct 22 2007, Andi Kleen wrote:
 
 One of the systems tested in autoboot crashes at boot with with -git22.
 This is a AMD 2 socket Opteron NUMA system.
 
 The tester was a little flakey and happened to hit the x86-merge-broke-
 compilation window, so the last good data point I have is 2.6.23-rc9.

Andi, can you test with this patch applied?

http://brick.kernel.dk/sg-git.patch

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ata_exec_internal crash at boot in -git22

2007-10-22 Thread Jens Axboe
On Mon, Oct 22 2007, Andi Kleen wrote:
 On Monday 22 October 2007 20:26:45 Jens Axboe wrote:
  On Mon, Oct 22 2007, Andi Kleen wrote:
   
   One of the systems tested in autoboot crashes at boot with with -git22.
   This is a AMD 2 socket Opteron NUMA system.
   
   The tester was a little flakey and happened to hit the x86-merge-broke-
   compilation window, so the last good data point I have is 2.6.23-rc9.
  
  Andi, can you test with this patch applied?
  
  http://brick.kernel.dk/sg-git.patch
 Sorry was a mistake (cue: there is no 2.6.23-git22 yet). It turned out
 to be an old broken kernel. Current -git seems to boot.
 Sorry for the noise.

OK, all for the better :-)

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: remove blk_queue_max_phys_segments in libata

2007-10-16 Thread Jens Axboe
On Tue, Oct 16 2007, Jeff Garzik wrote:
 Linux Kernel Mailing List wrote:
 Gitweb: 
 http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=8889e3c129780cdbe15fed3c366ba3aa3026684d
 Commit: 8889e3c129780cdbe15fed3c366ba3aa3026684d
 Parent: fd820f405574a30aacf9a859886e173d641f080b
 Author: FUJITA Tomonori [EMAIL PROTECTED]
 AuthorDate: Tue Sep 18 12:16:45 2007 +0200
 Committer:  Jens Axboe [EMAIL PROTECTED]
 CommitDate: Tue Oct 16 11:24:44 2007 +0200
 
 remove blk_queue_max_phys_segments in libata
 
 LIBATA_MAX_PRD is the maximum number of DMA scatter/gather elements
 permitted by the HBA's DMA engine. It's properly set to
 q-max_hw_segments via the sg_tablesize parameter.
 
 libata shouldn't call blk_queue_max_phys_segments. Now LIBATA_MAX_PRD
 is equal to SCSI_MAX_PHYS_SEGMENTS by default (both is 128), so
 everything is fine. But if they are changed, some code (like the scsi
 mid layer, sg chaining, etc) might not work properly.
 
 (Addition from Jens) The basic issue is that the physical segment
 setting is purely a driver issue. And since SCSI is managing the 
 sglist,
 libata has no business changing the setting. All libata should care
 about is the hw segment setting.
 
 Signed-off-by: FUJITA Tomonori [EMAIL PROTECTED]
 Signed-off-by: Jens Axboe [EMAIL PROTECTED]
 ---
  drivers/ata/libata-scsi.c |2 --
  1 files changed, 0 insertions(+), 2 deletions(-)
 
 diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
 index d63c81e..ba62d53 100644
 --- a/drivers/ata/libata-scsi.c
 +++ b/drivers/ata/libata-scsi.c
 @@ -801,8 +801,6 @@ int ata_scsi_slave_config(struct scsi_device *sdev)
  
  ata_scsi_sdev_config(sdev);
  
 -blk_queue_max_phys_segments(sdev-request_queue, LIBATA_MAX_PRD);
 -
  sdev-manage_start_stop = 1;
 
 As I noted when this first patch was posted...   this line of code 
 existed because the difference between hw-segments and phys-segments was 
 incredibly difficult to discern.  The names say basically the same thing 
 to me, and I could find no documentation explaining the difference?

We discussed this change to death afair, so I'm a little surprised you
say that. The commit message itself has good reasoning as well.

 Does such documentation exist?  If not, can you please explain the 
 difference?

OK, I can give a brief summary. The driver can set two restrictions -
hardware and physical. I'm not sure who originally came up with the
naming (I think it's prety terrible), but essentially the 'hardware'
restriction is what the hardware can support (OK that name works) and
the 'physical' restriction is imposed by the driver. Typically that last
restriction has to do with structure limitations, like allocating stuff
on the stack or limiting the size of the sgtable you have to allocate.

 Finally, some notification and coordination would have been helpful. 

Confused, as I mentioned, we discussed this at quite some length a few
months ago.

 Commit 6c08772e49622e90d39903e7ff0be1a0f463ac86 appears to have been 
 missed, at the very least... or is it actually correct?

Depends on what you think it does. The number of hardware segments will
never be larger than the physical segments. But it looks like you want
to set the hw seg value there. As mentioned in the commit above, since
libata doesn't allocate its own sg tables, letting it fiddle with the
physical segment limit is dodgy at best.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.23-rc7-mm1 AHCI ATA errors -- won't boot

2007-09-25 Thread Jens Axboe
On Tue, Sep 25 2007, Berck E. Nash wrote:
 Jeff Garzik wrote:
 
  The first step would be to clone the upstream branch of
  git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev.git
  
  and see if the problem is reproducible there.  If yes, then you have
  narrowed down the problem to something my ATA devel tree has introduced
  into -mm.
 
 Nope, you're off the hook.  The libata tree works great, so it must be
 something else in -mm conflicting.

Can you try 2.6.23-rc8 plus this patch:

http://brick.kernel.dk/git-block.patch.bz2

and see if that works?

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.23-rc7-mm1 AHCI ATA errors -- won't boot

2007-09-25 Thread Jens Axboe
On Tue, Sep 25 2007, Berck E. Nash wrote:
 Jens Axboe wrote:
  On Tue, Sep 25 2007, Berck E. Nash wrote:
  Jeff Garzik wrote:
 
  The first step would be to clone the upstream branch of
  git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev.git
 
  and see if the problem is reproducible there.  If yes, then you have
  narrowed down the problem to something my ATA devel tree has introduced
  into -mm.
  Nope, you're off the hook.  The libata tree works great, so it must be
  something else in -mm conflicting.
 
 Whoops, sorry!  I just lied.  I'm a git newbie, and failed to actually
 get the upstream branch the first time, so rc8 is clean, but it fails
 when I actually pull the upstream branch.  I'll git bisect and get back
 to you.

OK, you probably realize this, but you can forget about the git-block
testing for now then.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ahci: enable GHC.AE bit before set GHC.HR

2007-09-21 Thread Jens Axboe
On Fri, Sep 21 2007, Alan Cox wrote:
 On Fri, 21 Sep 2007 12:31:20 +0200
 Jens Axboe [EMAIL PROTECTED] wrote:
 
  On Fri, Sep 21 2007, Peer Chen wrote:
   According to the description of section 5.2.2.1 and 10.1.2 of AHCI 
   specification rev1_1/rev1_2, GHC.HR shall only be set to ¡®1¡¯
   by software when GHC.AE is set to ¡®1¡¯.
   
   Signed-off-by: Peer Chen [EMAIL PROTECTED]
   ---
   --- linux-2.6.23-rc7/drivers/ata/ahci.c.orig  2007-09-20 
   11:01:55.0 -0400
   +++ linux-2.6.23-rc7/drivers/ata/ahci.c   2007-09-20 11:07:31.0 
   -0400
   @@ -834,6 +834,10 @@ static int ahci_reset_controller(struct 
 void __iomem *mmio = host-iomap[AHCI_PCI_BAR];
 u32 tmp;

   +/* turn on AHCI mode before controller reset*/
   +writel(HOST_AHCI_EN, mmio + HOST_CTL);
   +(void) readl(mmio + HOST_CTL);  /* flush */
   +
 /* global controller reset */
 tmp = readl(mmio + HOST_CTL);
 if ((tmp  HOST_RESET) == 0) {
  
  I appreciate the readl() flushes, but in this particular case we end up
  reading the exact offset again below.
 
 The code above is wrong btw - it is an iomap so both the new and existing
 code should be using ioread* not readl

Happy converting :-)

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ide-cd is unmaintained

2007-09-20 Thread Jens Axboe
On Thu, Sep 20 2007, Alan Cox wrote:
 I simply don't have any old IDE systems any more or time to really look
 after this. Nobody responded to the previous linux-ide mail about
 maintainers so...
 
 Signed-off-by: Alan Cox [EMAIL PROTECTED]

Just fold it in with the IDE system in general? Would make the most
sense, I think.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.23-rc4-mm1

2007-09-11 Thread Jens Axboe
On Mon, Sep 10 2007, Torsten Kaiser wrote:
 On 9/10/07, FUJITA Tomonori [EMAIL PROTECTED] wrote:
  On Mon, 10 Sep 2007 12:20:38 -0700
  Andrew Morton [EMAIL PROTECTED] wrote:
 
   On Mon, 10 Sep 2007 20:59:49 +0200 Torsten Kaiser [EMAIL PROTECTED] 
   wrote:
The system boots, reads the partition tables, starts the RAID and then
kicks one drive out because of errors.
  
   Andy is using qla1280.  You're using sata.  So it's probably a different
   bug, with the same symptoms.
 
  This might be a sg chaining bug too (probabaly sg chaining libata
  patch).
 
  Can you try the following patch that I've just sent:
 
  http://lkml.org/lkml/2007/9/10/251
 
  The patch also disables chaining sg list for libata.
 
 With this patch 2.6.23-rc4-mm1 works for me.
 Mainline 2.6.23-rc5-git1 works also without needing any patches.

OK, thanks for testing that. I'll merge Tomo's patch so that we can
selectively enable drivers when we KNOW they work, instead of trying to
do this (massive) operation whole sale.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [-mm patch] remove ide_get_error_location()

2007-09-11 Thread Jens Axboe
On Tue, Sep 11 2007, Bartlomiej Zolnierkiewicz wrote:
 On Sunday 09 September 2007, Adrian Bunk wrote:
  On Fri, Aug 31, 2007 at 09:58:22PM -0700, Andrew Morton wrote:
  ...
   Changes since 2.6.23-rc3-mm1:
  ...
git-block.patch
  ...
git trees
  ...
  
  ide_get_error_location() is no longer used.
  
  Signed-off-by: Adrian Bunk [EMAIL PROTECTED]
 
 Signed-off-by: Bartlomiej Zolnierkiewicz [EMAIL PROTECTED]
 
 Since git-block contains the patch which removes the only user of
 ide_get_error_location() I think that this patch should be also merged
 through block tree.  Jens?

Yeah, I'll add it there.

 PS none of the blkdev_issue_flush() users uses *error_sector argument
 so it can be probably removed as well

I had hoped that the existance was enough incentive, but it didn't
happen. I'll make a note to kill that again.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/7] blk_end_request: remove/unexport end_that_request_*

2007-09-04 Thread Jens Axboe
On Tue, Sep 04 2007, Halevy, Benny wrote:
 Boaz raised my attention to this patchset today...
 We suspect we'll still need the extern entry points for handling the bidi 
 request in the scsi_io_completion() path as we only want to call
 end_that_request_chunk on req-next_rq and never
 end_that_request_last.
  
 (see 
 http://www.bhalevy.com/open-osd/download/linux-2.6.23-rc2_and_iscsi-iscsi-2007_08_09/0005-SCSI-bidi-support.patch)
  
 If this is ok with you I'd leave these entry points in place rather than
 taking them out and putting them back in later.

There's no point in leaving them in when nothing current needs it, I'd
much rather add it back in should the need arise. That's the proper way
to handle things like this.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] libata: kill sector_buf temporary buffer

2007-08-16 Thread Jens Axboe
On Thu, Aug 16 2007, Jeff Garzik wrote:
 
 Rather than carrying around this buffer all the time, for rare
 circumstances, it seems that we can easily alloc/free a temp buffer as
 needed.
 
 Saves a big chunk of per-port memory.
 
 I forget the justification for what it was in ata_port to begin with,
 but the code uses don't see to justify any need.
 
 diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
 index 735f74b..5e9049f 100644
 --- a/drivers/ata/libata-core.c
 +++ b/drivers/ata/libata-core.c
 @@ -3679,20 +3679,29 @@ static int ata_dev_same_device(struct ata_device 
 *dev, unsigned int new_class,
  int ata_dev_reread_id(struct ata_device *dev, unsigned int readid_flags)
  {
   unsigned int class = dev-class;
 - u16 *id = (void *)dev-link-ap-sector_buf;
 - int rc;
 + u16 *id;
 + int rc = 0;
 +
 + id = kzalloc(ATA_SECT_SIZE, GFP_KERNEL);
 + if (!id)
 + return -ENOMEM;
  
   /* read ID data */
   rc = ata_dev_read_id(dev, class, readid_flags, id);
   if (rc)
 - return rc;
 + goto out;
  
   /* is the device still there? */
 - if (!ata_dev_same_device(dev, class, id))
 - return -ENODEV;
 + if (!ata_dev_same_device(dev, class, id)) {
 + rc = -ENODEV;
 + goto out;
 + }
  
   memcpy(dev-id, id, sizeof(id[0]) * ATA_ID_WORDS);
 - return 0;
 +
 +out:
 + kfree(id);
 + return rc;
  }
  
  /**
 diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
 index 2ddc2ed..fc731e2 100644
 --- a/drivers/ata/libata-eh.c
 +++ b/drivers/ata/libata-eh.c
 @@ -1133,14 +1133,20 @@ static unsigned int ata_read_log_page(struct 
 ata_device *dev,
  static int ata_eh_read_log_10h(struct ata_device *dev,
  int *tag, struct ata_taskfile *tf)
  {
 - u8 *buf = dev-link-ap-sector_buf;
 + u8 *buf;
   unsigned int err_mask;
   u8 csum;
 - int i;
 + int i, rc = 0;
 +
 + buf = kzalloc(ATA_SECT_SIZE, GFP_KERNEL);
 + if (!buf)
 + return -ENOMEM;

How can this work? If we're issuing this log page read, we cannot do any
IO to the device. So this allocation must not generate any IO. And if we
fail in allocating memory and just return, the device wont talk to us in
the future.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: libata git tree, mbox queue status and contents

2007-08-13 Thread Jens Axboe
On Fri, Aug 03 2007, Jeff Garzik wrote:
 * Kristen: ALPM patches.  We definitely want them, as they save a ton of 
 power.

The problem with ALPM, as I see it, is that it is way too aggressive. It
really needs to be combined with a timer to be useful, it's really a
huge shame that it doesn't come equipped with a timeout setting in
hardware. Lacking that, we could punt to using a second aligned timer
that just checks for activity in the last second, and if none was seen
then enable ALPM. That should have absolutely minimal impact on CPU
consumption. Likewise for when we see IO, when the rate/sec goes beyond
a low threshold then disable ALPM again.

In my testing on this notebook (x60), throughput was reduced to about
30% when using ALPM. So while it does save a good amount of power, it
also makes the disk a slow dog if you are actually using it.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Some NCQ numbers...

2007-07-09 Thread Jens Axboe
On Wed, Jul 04 2007, James Bottomley wrote:
 On Wed, 2007-07-04 at 10:19 +0900, Tejun Heo wrote:
  Michael Tokarev wrote:
   Well.  It looks like the results does not depend on the
   elevator.  Originally I tried with deadline, and just
   re-ran the test with noop (hence the long delay with
   the answer) - changing linux elevator changes almost
   nothing in the results - modulo some random fluctuations.
  
  I see.  Thanks for testing.
  
   In any case, NCQ - at least in this drive - just does
   not work.  Linux with its I/O elevator may help to
   speed things up a bit, but the disk does nothing in
   this area.  NCQ doesn't slow things down either - it
   just does not work.
   
   The same's for ST3250620NS enterprise drives.
   
   By the way, Seagate announced Barracuda ES 2 series
   (in range 500..1200Gb if memory serves) - maybe with
   those, NCQ will work better?
  
  No one would know without testing.
  
   Or maybe it's libata which does not implement NCQ
   properly?  (As I shown before, with almost all
   ol'good SCSI drives TCQ helps alot - up to 2x the
   difference and more - with multiple I/O threads)
  
  Well, what the driver does is minimal.  It just passes through all the
  commands to the harddrive.  After all, NCQ/TCQ gives the harddrive more
  responsibility regarding request scheduling.
 
 Actually, in many ways the result support a theory of SCSI TCQ Jens used
 when designing the block layer.  The original TCQ theory held that the
 drive could make much better head scheduling decisions than the
 Operating System, so you just used TCQ to pass all the outstanding I/O
 unfiltered down to the drive to let it schedule.  However, the I/O
 results always seemed to indicate that the effect of TCQ was negligible
 at around 4 outstanding commands, leading to the second theory that all
 TCQ was good for was saturating the transport, and making scheduling
 decisions was, indeed, better left to the OS (hence all our I/O
 schedulers).

Indeed, the above I still find to be true. The only real case where
larger depths make a real difference, is a pure random reads (or writes,
with write caching off) workload. And those situations are largely
synthetic, hence benchmarks tend to show NCQ being a lot more beneficial
since they construct workloads that consist 100% of random IO. Real life
is rarely so black and white.

Additionally, there are cases where drive queue depths hurt a lot. The
drive has no knowledge of fairness, or process-to-io mappings. So AS/CFQ
has to artificially limit queue depths competing IO processes doing
semi (or fully) sequential workloads, or throughput plummets.

So while NCQ has some benefits, I typically tend to prefer managing the
IO queue largely in software instead of punting to (often) buggy
firmware.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 3/3] Enable Aggressive Link Power management for AHCI controllers.

2007-06-22 Thread Jens Axboe
On Fri, Jun 22 2007, Kristen Carlson Accardi wrote:
 On Thu, 21 Jun 2007 15:08:32 +0200
 Jens Axboe [EMAIL PROTECTED] wrote:
 
  On Wed, Jun 20 2007, Kristen Carlson Accardi wrote:
   Enable Aggressive Link Power management for AHCI controllers.
   
   This patch will set the correct bits to turn on Aggressive
   Link Power Management (ALPM) for the ahci driver.  This
   will cause the controller and disk to negotiate a lower
   power state for the link when there is no activity (see
   the AHCI 1.x spec for details).  This feature is mutually
   exclusive with Hot Plug, so when ALPM is enabled, Hot Plug
   is disabled.  ALPM will be enabled by default, but it is
   settable via the scsi host syfs interface.  Possible 
   settings for this feature are:
   
   Setting   Effect
   --
   min_power ALPM is enabled, and link set to enter 
 lowest power state (SLUMBER) when idle
 Hot plug not allowed.
   
   max_performance   ALPM is disabled, Hot Plug is allowed
   
   medium_power  ALPM is enabled, and link set to enter
 second lowest power state (PARTIAL) when
 idle.  Hot plug not allowed.
   
   Signed-off-by:  Kristen Carlson Accardi [EMAIL PROTECTED]
  
  A suggestion (it comes with a patch!) - default to max_power/almp off,
  not min_power. For two reasons:
  
  - There's such a big performance difference between the two, you really
want max_power when booting.
  
  - It's a lot better to default to no change, than default to enabling
something new.
 
 Sounds reasonable to me.  Distros/users can decide if they want to have
 scripts that enable this after boot to run at min_power. 

Exactly, it needs to be handled by some power management daemon anyway
and be integrated with power savings in general. You could use io load
to determine when to enable/disable alpm, for instance.

 Acked-by: Kristen Carlson Accardi [EMAIL PROTECTED]

Will you integrate it into the next posting?

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 126X DVD drive? (odd ide-cd output)

2007-06-21 Thread Jens Axboe
On Wed, Jun 20 2007, Cal Peake wrote:
 Hello,
 
 It seems if I boot with a disc in the drive (in this case a dual fs 
 ISO-9660/UDF DVD-R) the drive speed gets reported incorrectly:
 
 hdc: DVDRW DRW-6S160P, ATAPI CD/DVD-ROM drive
 hdc: ATAPI 126X DVD-ROM DVD-R CD-R/RW drive, 2048kB Cache, UDMA(66)
 
 normal output (w/ no disc) is:
 
 hdc: DVDRW DRW-6S160P, ATAPI CD/DVD-ROM drive
 hdc: ATAPI 48X DVD-ROM DVD-R CD-R/RW drive, 2048kB Cache, UDMA(66)
 
 Kernel is 2.6.22-rc5-gf1518a08.
 
 Harmless, but odd...

Sounds like a firmware bug. From a drive with a 'DVDRW' vendor string,
that's not highly surprising :-)

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 3/3] Enable Aggressive Link Power management for AHCI controllers.

2007-06-21 Thread Jens Axboe
On Wed, Jun 20 2007, Kristen Carlson Accardi wrote:
 Enable Aggressive Link Power management for AHCI controllers.
 
 This patch will set the correct bits to turn on Aggressive
 Link Power Management (ALPM) for the ahci driver.  This
 will cause the controller and disk to negotiate a lower
 power state for the link when there is no activity (see
 the AHCI 1.x spec for details).  This feature is mutually
 exclusive with Hot Plug, so when ALPM is enabled, Hot Plug
 is disabled.  ALPM will be enabled by default, but it is
 settable via the scsi host syfs interface.  Possible 
 settings for this feature are:
 
 Setting   Effect
 --
 min_power ALPM is enabled, and link set to enter 
   lowest power state (SLUMBER) when idle
   Hot plug not allowed.
 
 max_performance   ALPM is disabled, Hot Plug is allowed
 
 medium_power  ALPM is enabled, and link set to enter
   second lowest power state (PARTIAL) when
   idle.  Hot plug not allowed.
 
 Signed-off-by:  Kristen Carlson Accardi [EMAIL PROTECTED]

A suggestion (it comes with a patch!) - default to max_power/almp off,
not min_power. For two reasons:

- There's such a big performance difference between the two, you really
  want max_power when booting.

- It's a lot better to default to no change, than default to enabling
  something new.

diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c
index 841cf0a..e7a2072 100644
--- a/drivers/ata/ahci.c
+++ b/drivers/ata/ahci.c
@@ -786,8 +786,7 @@ static int ahci_disable_alpm(struct ata_port *ap)
return 0;
 }
 
-static int ahci_enable_alpm(struct ata_port *ap,
-   enum scsi_host_link_pm policy)
+static int ahci_enable_alpm(struct ata_port *ap, enum scsi_host_link_pm policy)
 {
struct ahci_host_priv *hpriv = ap-host-private_data;
void __iomem *port_mmio = ahci_port_base(ap);
@@ -808,19 +807,19 @@ static int ahci_enable_alpm(struct ata_port *ap,
return -EINVAL;
}
 
-   switch(policy) {
+   switch (policy) {
case SHOST_MAX_PERFORMANCE:
-   ahci_disable_alpm(ap);
-   ap-pm_policy = policy;
-   return 0;
case SHOST_NOT_AVAILABLE:
-   case SHOST_MIN_POWER:
/*
 * if we came here with SHOST_NOT_AVAILABLE,
 * it just means this is the first time we
-* have tried to enable - so try to do
-* min_power
+* have tried to enable - default to max performance,
+* and let the user go to lower power modes on request.
 */
+   ahci_disable_alpm(ap);
+   ap-pm_policy = SHOST_MAX_PERFORMANCE;
+   return 0;
+   case SHOST_MIN_POWER:
ap-pm_policy = SHOST_MIN_POWER;
 
/* configure HBA to enter SLUMBER */

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 0/3] AHCI Link Power Management

2007-06-14 Thread Jens Axboe
On Tue, Jun 12 2007, Tejun Heo wrote:
 Arjan van de Ven wrote:
  I'm not sure about this.  We need better PM framework to support 
  powersaving in other controllers and some ahcis don't save much
  when only link power management is used,
  
  do you have data to support this?
 
 Yeah, it was some Lenovo notebook.  Pavel is more familiar with the
 hardware.  Pavel, what was the notebook which didn't save much power
 with standard SATA power save but needed port to be completely turned off?
 
  The data we have from this patch is that it saves typically a Watt of
  power (depends on the machine of course, but the range is 0.5W to
  1.5W). If you want to also have an even more agressive thing where
  you want to start disabling the entire controller... I don't see how
  this is in conflict with saving power on the link level by just
  enabling a hardware feature 
 
 Well, both implement about the same thing.  I prefer software
 implementation because it's more generic and ALPE/ASP seems too
 aggressive to me.  Here are reasons why sw implementation wasn't merged.
 
 1. It didn't have proper interface with userland.  This was mainly
 because of missing ATA sysfs nodes.  I'm not sure whether adding this to
 scsi node is a good idea.
 
 2. It was focused on SATA link PS and couldn't cover the Lenovo case.
 
 I think we need something at the block layer.

I think the hardware method is preferable, actually. Doing this in the
block layer would mean keeping track of idle time, and that quickly
turns into a lot of timer management. Not exactly free, in terms of CPU
usage.

I've yet to do some power measurements with this ahci patch, I just
noticed that with min_power performance drops from ~55mb/sec to
~15mb/sec sequential on my drive. That's pretty drastic :-)

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: use_clustering (sht) bit set to 0 in AHCI ?

2007-05-21 Thread Jens Axboe
On Thu, Apr 26 2007, Alok kataria wrote:
 Hi Jeff,
 
 I recently got a new AHCI disk, and was using the AHCI-libata driver to run
 this. I noticed that the scattergather lists that were being built for
 the I/O on this device were just of PAGE_SIZE length, even though i
 was doing IO on
 contiguous pages.
 
 Through a little instrumentation i figured out that the use_clustering
 bit in the ahci_sht (scsi_host_template) is set to zero. Due to which
 we are not putting consecutive bios into one sg in blk_rq_map_sg.
 
 I tried changing the clustering bit to 1, but encountered a panic at
 the initialization of the disks  during the bootup process, and so
 couldn't entirely get hold of the panic mesg.
 I was wondering, though the max segment size suported with this driver
 is 65536, why was i not able to feed in bigger SG's to this driver.
 
 I searched on the net and the intel ahci arch doc too but couldn't
 find anything relative to the clustering support for AHCI.
 
 It would be great help if you could let me know about the possible
 problems with clustering on AHCI  ? why is it off by default in the
 driver ? or anything else that might help.

It's a good question. If you look at the documentation, it states that
ahci supports up to 64k sg entries and each can have a size of up to 4mb
(bits 0 through 21). So as far as I can tell, clustering should work
with a segment size up to those 4mb.

ahci has always had clustering disabled, perhaps Jeff can expand on why?

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] bidi support: bidirectional request

2007-05-01 Thread Jens Axboe
On Tue, May 01 2007, Boaz Harrosh wrote:
 Please consider the attached proposal. It is a complete block-level bidi
 implementation that is, I hope, a middle ground which will keep everyone
 happy (including Christoph). It is both quite small and not invasive,
 yet has a full bidi API that is easy to use and maintain.

This isn't much of an improvement imo, if any at all. Why didn't you do
the -next_rq approach I suggested? Your patch still makes struct
request considerably fatter (30% here, from 280 to 368 bytes on x86-64
from a quick look) for something that will have relatively few uses. And
it still has its paws all over the block layer code.

Please just implement the 2nd data phase as a linked request off the
first one. I think that approach is both much cleaner from a design
perspective, and also much leaner and has zero (well almost, it costs a
pointer) impact on the regular read-write paths.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] bidi support: bidirectional request

2007-04-30 Thread Jens Axboe
On Sun, Apr 29 2007, James Bottomley wrote:
 On Sun, 2007-04-29 at 18:48 +0300, Boaz Harrosh wrote:
  FUJITA Tomonori wrote:
   From: Boaz Harrosh [EMAIL PROTECTED]
   Subject: [PATCH 4/4] bidi support: bidirectional request
   Date: Sun, 15 Apr 2007 20:33:28 +0300
   
   diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
   index 645d24b..16a02ee 100644
   --- a/include/linux/blkdev.h
   +++ b/include/linux/blkdev.h
   @@ -322,6 +322,7 @@ struct request {
void *end_io_data;
  
struct request_io_part uni;
   +struct request_io_part bidi_read;
};
   
   Would be more straightforward to have:
   
   struct request_io_part in;
   struct request_io_part out;
   
  
  Yes I wish I could do that. For bidi supporting drivers this is the most 
  logical.
  But for the 99.9% of uni-directional drivers, calling rq_uni(), and being 
  some what on
  the hotish paths, this means we will need a pointer to a uni 
  request_io_part.
  This is bad because:
  1st- There is no defined stage in a request life where to definitely set 
  that pointer,
   specially in the preparation stages.
  2nd- hacks like scsi_error.c/scsi_send_eh_cmnd() will not work at all. Now 
  this is a
   very bad spot already, and I have a short term fix for it in the 
  SCSI-bidi patches
   (not sent yet) but a more long term solution is needed. Once such 
  hacks are
   cleaned up we can do what you say. This is exactly why I use the 
  access functions
   rq_uni/rq_io/rq_in/rq_out and not open code access.
 
 I'm still not really convinced about this approach.  The primary job of
 the block layer is to manage and merge READ and WRITE requests.  It
 serves a beautiful secondary function of queueing for arbitrary requests
 it doesn't understand (REQ_TYPE_BLOCK_PC or REQ_TYPE_SPECIAL ... or
 indeed any non REQ_TYPE_FS).
 
 bidirectional requests fall into the latter category (there's nothing
 really we can do to merge them ... they're just transported by the block
 layer).  The only unusual feature is that they carry two bios.  I think
 the drivers that actually support bidirectional will be a rarity, so it
 might even be advisable to add it to the queue capability (refuse
 bidirectional requests at the top rather than perturbing all the drivers
 to process them).
 
 So, what about REQ_TYPE_BIDIRECTIONAL rather than REQ_BIDI?  That will
 remove it from the standard path and put it on the special command type
 path where we can process it specially.  Additionally, if you take this
 approach, you can probably simply chain the second bio through
 req-special as an additional request in the stream.  The only thing
 that would then need modification would be the dequeue of the block
 driver (it would have to dequeue both requests and prepare them) and
 that needs to be done only for drivers handling bidirectional requests.

I agree, I'm really not crazy about shuffling the entire request setup
around just for something as exotic as bidirection commands. How about
just keeping it simple - have a second request linked off the first one
for the second data phase? So keep it completely seperate, not just
overload -special for 2nd bio list.

So basically just add a struct request pointer, so you can do rq =
rq-next_rq or something for the next data phase. I bet this would be a
LOT less invasive as well, and we can get by with a few helpers to
support it.

And it should definitely be a request type.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] bidi support: bidirectional request

2007-04-30 Thread Jens Axboe
On Mon, Apr 30 2007, Benny Halevy wrote:
 Jens Axboe wrote:
  On Sun, Apr 29 2007, James Bottomley wrote:
  On Sun, 2007-04-29 at 18:48 +0300, Boaz Harrosh wrote:
  FUJITA Tomonori wrote:
  From: Boaz Harrosh [EMAIL PROTECTED]
  Subject: [PATCH 4/4] bidi support: bidirectional request
  Date: Sun, 15 Apr 2007 20:33:28 +0300
 
  diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
  index 645d24b..16a02ee 100644
  --- a/include/linux/blkdev.h
  +++ b/include/linux/blkdev.h
  @@ -322,6 +322,7 @@ struct request {
   void *end_io_data;
 
   struct request_io_part uni;
  +struct request_io_part bidi_read;
   };
  Would be more straightforward to have:
 
  struct request_io_part in;
  struct request_io_part out;
 
  Yes I wish I could do that. For bidi supporting drivers this is the most 
  logical.
  But for the 99.9% of uni-directional drivers, calling rq_uni(), and being 
  some what on
  the hotish paths, this means we will need a pointer to a uni 
  request_io_part.
  This is bad because:
  1st- There is no defined stage in a request life where to definitely set 
  that pointer,
   specially in the preparation stages.
  2nd- hacks like scsi_error.c/scsi_send_eh_cmnd() will not work at all. 
  Now this is a
   very bad spot already, and I have a short term fix for it in the 
  SCSI-bidi patches
   (not sent yet) but a more long term solution is needed. Once such 
  hacks are
   cleaned up we can do what you say. This is exactly why I use the 
  access functions
   rq_uni/rq_io/rq_in/rq_out and not open code access.
  I'm still not really convinced about this approach.  The primary job of
  the block layer is to manage and merge READ and WRITE requests.  It
  serves a beautiful secondary function of queueing for arbitrary requests
  it doesn't understand (REQ_TYPE_BLOCK_PC or REQ_TYPE_SPECIAL ... or
  indeed any non REQ_TYPE_FS).
 
  bidirectional requests fall into the latter category (there's nothing
  really we can do to merge them ... they're just transported by the block
  layer).  The only unusual feature is that they carry two bios.  I think
  the drivers that actually support bidirectional will be a rarity, so it
  might even be advisable to add it to the queue capability (refuse
  bidirectional requests at the top rather than perturbing all the drivers
  to process them).
 
  So, what about REQ_TYPE_BIDIRECTIONAL rather than REQ_BIDI?  That will
  remove it from the standard path and put it on the special command type
  path where we can process it specially.  Additionally, if you take this
  approach, you can probably simply chain the second bio through
  req-special as an additional request in the stream.  The only thing
  that would then need modification would be the dequeue of the block
  driver (it would have to dequeue both requests and prepare them) and
  that needs to be done only for drivers handling bidirectional requests.
  
  I agree, I'm really not crazy about shuffling the entire request setup
  around just for something as exotic as bidirection commands. How about
  just keeping it simple - have a second request linked off the first one
  for the second data phase? So keep it completely seperate, not just
  overload -special for 2nd bio list.
  
  So basically just add a struct request pointer, so you can do rq =
  rq-next_rq or something for the next data phase. I bet this would be a
  LOT less invasive as well, and we can get by with a few helpers to
  support it.
  
  And it should definitely be a request type.
  
 
 I'm a bit confused since what you both suggest is very similar to what we've
 proposed back in October 2006 and the impression we got was that it will be
 better to support bidirectional block requests natively (yet to be honest,
 James, you wanted a linked request all along).

It still has to be implemented natively at the block layer, just
differently like described above. So instead of messing all over the
block layer adding rq_uni() stuff, just add that struct request pointer
to the request structure for the 2nd data phase. You can relatively easy
then modify the block layer helpers to support mapping and setup of such
requests.

 Before we go on that route again, how do you see the support for bidi
 at the scsi mid-layer done?  Again, we prefer to support that officially
 using two struct scsi_cmnd_buff instances in struct scsi_cmnd and not as
 a one-off feature, using special-purpose state and logic (e.g. a linked
 struct scsi_cmd for the bidi_read sg list).

The SCSI part is up to James, that can be done as either inside a single
scsi command, or as linked scsi commands as well. I don't care too much
about that bit, just the block layer parts :-). And the proposed block
layer design can be used both ways by the scsi layer.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] bidi support: bidirectional request

2007-04-30 Thread Jens Axboe
On Mon, Apr 30 2007, Douglas Gilbert wrote:
 Jens Axboe wrote:
  On Mon, Apr 30 2007, Benny Halevy wrote:
  Jens Axboe wrote:
  On Sun, Apr 29 2007, James Bottomley wrote:
  On Sun, 2007-04-29 at 18:48 +0300, Boaz Harrosh wrote:
  FUJITA Tomonori wrote:
  From: Boaz Harrosh [EMAIL PROTECTED]
  Subject: [PATCH 4/4] bidi support: bidirectional request
  Date: Sun, 15 Apr 2007 20:33:28 +0300
 
  diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
  index 645d24b..16a02ee 100644
  --- a/include/linux/blkdev.h
  +++ b/include/linux/blkdev.h
  @@ -322,6 +322,7 @@ struct request {
   void *end_io_data;
 
   struct request_io_part uni;
  +struct request_io_part bidi_read;
   };
  Would be more straightforward to have:
 
  struct request_io_part in;
  struct request_io_part out;
 
  Yes I wish I could do that. For bidi supporting drivers this is the 
  most logical.
  But for the 99.9% of uni-directional drivers, calling rq_uni(), and 
  being some what on
  the hotish paths, this means we will need a pointer to a uni 
  request_io_part.
  This is bad because:
  1st- There is no defined stage in a request life where to definitely 
  set that pointer,
   specially in the preparation stages.
  2nd- hacks like scsi_error.c/scsi_send_eh_cmnd() will not work at all. 
  Now this is a
   very bad spot already, and I have a short term fix for it in the 
  SCSI-bidi patches
   (not sent yet) but a more long term solution is needed. Once such 
  hacks are
   cleaned up we can do what you say. This is exactly why I use the 
  access functions
   rq_uni/rq_io/rq_in/rq_out and not open code access.
  I'm still not really convinced about this approach.  The primary job of
  the block layer is to manage and merge READ and WRITE requests.  It
  serves a beautiful secondary function of queueing for arbitrary requests
  it doesn't understand (REQ_TYPE_BLOCK_PC or REQ_TYPE_SPECIAL ... or
  indeed any non REQ_TYPE_FS).
 
  bidirectional requests fall into the latter category (there's nothing
  really we can do to merge them ... they're just transported by the block
  layer).  The only unusual feature is that they carry two bios.  I think
  the drivers that actually support bidirectional will be a rarity, so it
  might even be advisable to add it to the queue capability (refuse
  bidirectional requests at the top rather than perturbing all the drivers
  to process them).
 
  So, what about REQ_TYPE_BIDIRECTIONAL rather than REQ_BIDI?  That will
  remove it from the standard path and put it on the special command type
  path where we can process it specially.  Additionally, if you take this
  approach, you can probably simply chain the second bio through
  req-special as an additional request in the stream.  The only thing
  that would then need modification would be the dequeue of the block
  driver (it would have to dequeue both requests and prepare them) and
  that needs to be done only for drivers handling bidirectional requests.
  I agree, I'm really not crazy about shuffling the entire request setup
  around just for something as exotic as bidirection commands. How about
  just keeping it simple - have a second request linked off the first one
  for the second data phase? So keep it completely seperate, not just
  overload -special for 2nd bio list.
 
  So basically just add a struct request pointer, so you can do rq =
  rq-next_rq or something for the next data phase. I bet this would be a
  LOT less invasive as well, and we can get by with a few helpers to
  support it.
 
  And it should definitely be a request type.
 
  I'm a bit confused since what you both suggest is very similar to what 
  we've
  proposed back in October 2006 and the impression we got was that it will be
  better to support bidirectional block requests natively (yet to be honest,
  James, you wanted a linked request all along).
  
  It still has to be implemented natively at the block layer, just
  differently like described above. So instead of messing all over the
  block layer adding rq_uni() stuff, just add that struct request pointer
  to the request structure for the 2nd data phase. You can relatively easy
  then modify the block layer helpers to support mapping and setup of such
  requests.
  
  Before we go on that route again, how do you see the support for bidi
  at the scsi mid-layer done?  Again, we prefer to support that officially
  using two struct scsi_cmnd_buff instances in struct scsi_cmnd and not as
  a one-off feature, using special-purpose state and logic (e.g. a linked
  struct scsi_cmd for the bidi_read sg list).
  
  The SCSI part is up to James, that can be done as either inside a single
  scsi command, or as linked scsi commands as well. I don't care too much
  about that bit, just the block layer parts :-). And the proposed block
  layer design can be used both ways by the scsi layer.
 
 Linked SCSI commands have been obsolete since SPC-4 rev 6
 (18 July 2006) after proposal 06-259r1

Re: BUG: Null pointer dereference in fs/open.c

2007-04-26 Thread Jens Axboe
On Wed, Apr 25 2007, Andrew Morton wrote:
  # pktsetup 2 /dev/sr0
  [19982.934793] drivers/scsi/scsi_lib.c:838: setting error to 134217730
  [19982.941070]  [c010521a] show_trace_log_lvl+0x1a/0x30
  [19982.946256]  [c0105952] show_trace+0x12/0x20
  [19982.950744]  [c0105a46] dump_stack+0x16/0x20
  [19982.955232]  [c034543a] scsi_io_completion+0x28a/0x3a0
  [19982.960586]  [c034556b] scsi_blk_pc_done+0x1b/0x30
  [19982.965594]  [c0340d0c] scsi_finish_command+0x4c/0x60
  [19982.970861]  [c0345c07] scsi_softirq_done+0x77/0xe0
  [19982.975955]  [c0257f8b] blk_done_softirq+0x6b/0x80
  [19982.980962]  [c01243a2] __do_softirq+0x62/0xc0
  [19982.985624]  [c0124455] do_softirq+0x55/0x60
  [19982.990112]  [c0124be5] ksoftirqd+0x65/0x100
  [19982.994599]  [c0132963] kthread+0xa3/0xd0
  [19982.998827]  [c0104e17] kernel_thread_helper+0x7/0x10
  [19983.004095]  ===
  [19983.009065] cdrom: This disc doesn't have any tracks I recognize!
 
 So what has happened here is that this code, in ide-cd.c's
 cdrom_decode_status() is now triggering:
 
   } else if (blk_pc_request(rq) || rq-cmd_type == REQ_TYPE_ATA_PC) {
   /* All other functions, except for READ. */
   unsigned long flags;
 
   /*
* if we have an error, pass back CHECK_CONDITION as the
* scsi status byte
*/
   if (blk_pc_request(rq)  !rq-errors)
   rq-errors = SAM_STAT_CHECK_CONDITION;
 
 
 I suspect this is a bug introduced by
 406c9b605cbc45151c03ac9a3f95e9acf050808c (in which case it'll be the third
 bug so far).
 
 Perhaps the IDE driver was previously not considering these requests to be
 of type blk_pc_request(), and after
 406c9b605cbc45151c03ac9a3f95e9acf050808c it _is_ treating them as
 blk_pc_request() and is incorrectly reporting an error.  Or something like
 that.

But it IS a block pc request. We've been setting the sam stats on
-errors for block pc request for a long time.

 Guys: help!

I'm not sure what your question is, if someone can some up what the what
goes wrong and what the expected result is, I can try and help.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: Null pointer dereference in fs/open.c

2007-04-26 Thread Jens Axboe
On Thu, Apr 26 2007, William Heimbigner wrote:
 On Wed, 25 Apr 2007, Andrew Morton wrote:
 On Wed, 25 Apr 2007 22:53:00 + (GMT) William Heimbigner 
 [EMAIL PROTECTED] wrote:
 
 On Wed, 25 Apr 2007, Andrew Morton wrote:
 snip
 OK.  I am able to use the pktcdvd driver OK in mainline with a piix/sata
 drive.  It could be that something is going wrong at the IDE level for 
 you.
 Perhaps; I'll try an external usb cd burner, and see where that goes.
 
 Are you able to identify the most recent kernel which actually worked?
 No, because I haven't set packet writing up in Linux before - however, I 
 do know
 that I've successfully set up packet writing (using 2 of the 3 cd burners 
 I
 have) in another operating system before. I'll try 2.6.18 and see if that 
 gets
 me anywhere different, though.
 
 OK.
 
 A quick summary: mainline's pktcdvd isn't working for William using IDE.
 It is working for me using sata.
 
 snip
 
 So what has happened here is that this code, in ide-cd.c's
 cdrom_decode_status() is now triggering:
 
  } else if (blk_pc_request(rq) || rq-cmd_type == REQ_TYPE_ATA_PC) {
  /* All other functions, except for READ. */
  unsigned long flags;
 
  /*
   * if we have an error, pass back CHECK_CONDITION as the
   * scsi status byte
   */
  if (blk_pc_request(rq)  !rq-errors)
  rq-errors = SAM_STAT_CHECK_CONDITION;
 
 
 I suspect this is a bug introduced by
 406c9b605cbc45151c03ac9a3f95e9acf050808c (in which case it'll be the third
 bug so far).
 
 Perhaps the IDE driver was previously not considering these requests to be
 of type blk_pc_request(), and after
 406c9b605cbc45151c03ac9a3f95e9acf050808c it _is_ treating them as
 blk_pc_request() and is incorrectly reporting an error.  Or something like
 that.
 
 Guys: help!
 
 A follow-up: after looking around a bit, I have managed to get packet 
 writing to work properly on /dev/hdc (before, it was reporting only 1.8 MB 
 available or so; this was a formatting issue).
 I've also gotten the external cd-rw drive to work. However, I'm still at a 
 loss as to why /dev/hdd won't work. I tried formatting a dvd-rw for this 
 drive, however, it consistently gives me:
 [27342.503933] drivers/ide/ide-cd.c:729: setting error to 2
 [27342.509251]  [c010521a] show_trace_log_lvl+0x1a/0x30
 [27342.514411]  [c0105952] show_trace+0x12/0x20
 [27342.518864]  [c0105a46] dump_stack+0x16/0x20
 [27342.523317]  [c033f6e4] cdrom_decode_status+0x1f4/0x3b0
 [27342.528732]  [c033fae8] cdrom_newpc_intr+0x38/0x320
 [27342.533791]  [c0331106] ide_intr+0x96/0x200
 [27342.538157]  [c0150cf8] handle_IRQ_event+0x28/0x60
 [27342.543139]  [c0151f96] handle_edge_irq+0xa6/0x130
 [27342.548121]  [c0106449] do_IRQ+0x49/0xa0
 [27342.552228]  [c0104c3a] common_interrupt+0x2e/0x34
 [27342.557200]  [c01022d2] mwait_idle+0x12/0x20
 [27342.561653]  [c01023ca] cpu_idle+0x4a/0x80
 [27342.565934]  [c0101147] rest_init+0x37/0x40
 [27342.570300]  [c068ac7b] start_kernel+0x34b/0x420
 [27342.575109]  [] 0x0
 [27342.578089]  ===
 and doesn't work (the above output was generated by Andrew's patch to log 
 certain areas).
 
 # dvd+rw-format /dev/hdd -force
 * BD/DVDRW/-RAM format utility by [EMAIL PROTECTED], version 7.0.
 :-( failed to locate Quick Format descriptor.
 * 4.7GB DVD-RW media in Sequential mode detected.
 * formatting 0.0\:-[ READ TRACK INFORMATION failed with 
 SK=3h/ASC=11h/ACQ=05h]: Input/output error

That's an uncorrectable read error. Is the media good?

 I tried putting in a different dvd-rw, and this time I get:
 # dvd+rw-format /dev/hdd -force
 * BD/DVDRW/-RAM format utility by [EMAIL PROTECTED], version 7.0.
 * 4.7GB DVD-RW media in Sequential mode detected.
 * formatting 0.0|:-[ FORMAT UNIT failed with SK=5h/ASC=26h/ACQ=00h]: 
 Input/output error

That's the drive complaining about an invalid bit being set in the
command descriptor block. That's usually a bug in the issuer.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Maxtor 6B250S0/BANC1B70 hangs with NCQ

2007-03-30 Thread Jens Axboe
Hi,

I've seen this several times on this drive, completely reproducible.
Once it has hung, power needs to be cut from the drive to recover it, a
simple reboot is not enough. So I'd suggest disabling NCQ on this
driver.

Error log attached.

Signed-off-by: Jens Axboe [EMAIL PROTECTED]

diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
index f1f595f..ddb3909 100644
--- a/drivers/ata/libata-core.c
+++ b/drivers/ata/libata-core.c
@@ -3361,6 +3361,8 @@ static const struct ata_blacklist_entry 
ata_device_blacklist [] = {
{ FUJITSU MHT2060BH,  NULL,   ATA_HORKAGE_NONCQ },
/* NCQ is broken */
{ Maxtor 6L250S0, BANC1G10, ATA_HORKAGE_NONCQ },
+   /* NCQ hard hangs device under heavier load, needs hard power cycle */
+   { Maxtor 6B250S0, BANC1B70, ATA_HORKAGE_NONCQ },
 
/* Devices with NCQ limits */
 

-- 
Jens Axboe

ata1.00: exception Emask 0x0 SAct 0x7fff SErr 0x0 action 0x2 frozen
ata1.00: cmd 61/08:00:3f:00:f4/00:00:0c:00:00/40 tag 0 cdb 0x0 data 4096 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: cmd 61/08:08:3f:00:f8/00:00:0c:00:00/40 tag 1 cdb 0x0 data 4096 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: cmd 61/08:10:3f:00:fc/00:00:0c:00:00/40 tag 2 cdb 0x0 data 4096 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: cmd 61/08:18:3f:00:00/00:00:0d:00:00/40 tag 3 cdb 0x0 data 4096 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: cmd 61/08:20:3f:00:04/00:00:0d:00:00/40 tag 4 cdb 0x0 data 4096 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: cmd 61/08:28:3f:00:08/00:00:0d:00:00/40 tag 5 cdb 0x0 data 4096 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: cmd 61/08:30:3f:00:0c/00:00:0d:00:00/40 tag 6 cdb 0x0 data 4096 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: cmd 61/08:38:3f:00:10/00:00:0d:00:00/40 tag 7 cdb 0x0 data 4096 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: cmd 61/08:40:3f:00:14/00:00:0d:00:00/40 tag 8 cdb 0x0 data 4096 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: cmd 61/08:48:3f:00:18/00:00:0d:00:00/40 tag 9 cdb 0x0 data 4096 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: cmd 61/08:50:3f:00:1c/00:00:0d:00:00/40 tag 10 cdb 0x0 data 4096 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: cmd 61/08:58:3f:00:20/00:00:0d:00:00/40 tag 11 cdb 0x0 data 4096 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: cmd 61/08:60:3f:00:24/00:00:0d:00:00/40 tag 12 cdb 0x0 data 4096 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: cmd 61/08:68:3f:00:28/00:00:0d:00:00/40 tag 13 cdb 0x0 data 4096 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: cmd 61/08:70:3f:00:2c/00:00:0d:00:00/40 tag 14 cdb 0x0 data 4096 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: cmd 61/08:78:3f:00:30/00:00:0d:00:00/40 tag 15 cdb 0x0 data 4096 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: cmd 61/08:80:3f:00:34/00:00:0d:00:00/40 tag 16 cdb 0x0 data 4096 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: cmd 61/08:88:3f:00:38/00:00:0d:00:00/40 tag 17 cdb 0x0 data 4096 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: cmd 61/08:90:3f:00:3c/00:00:0d:00:00/40 tag 18 cdb 0x0 data 4096 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: cmd 61/08:98:3f:00:40/00:00:0d:00:00/40 tag 19 cdb 0x0 data 4096 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: cmd 61/08:a0:3f:00:44/00:00:0d:00:00/40 tag 20 cdb 0x0 data 4096 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: cmd 61/08:a8:3f:00:48/00:00:0d:00:00/40 tag 21 cdb 0x0 data 4096 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: cmd 61/08:b0:3f:00:4c/00:00:0d:00:00/40 tag 22 cdb 0x0 data 4096 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: cmd 61/08:b8:3f:00:50/00:00:0d:00:00/40 tag 23 cdb 0x0 data 4096 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: cmd 61/08:c0:3f:00:54/00:00:0d:00:00/40 tag 24 cdb 0x0 data 4096 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: cmd 61/08:c8:3f:00:dc/00:00:0c:00:00/40 tag 25 cdb 0x0 data 4096 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: cmd 61/08:d0:3f:00:e0/00:00:0c:00:00/40 tag 26 cdb 0x0 data 4096 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: cmd 61/08:d8:3f:00:e4/00:00:0c:00:00/40 tag 27 cdb 0x0

Re: 2.6.20.3 AMD64 oops in CFQ code

2007-03-22 Thread Jens Axboe
On Thu, Mar 22 2007, [EMAIL PROTECTED] wrote:
 This is a uniprocessor AMD64 system running software RAID-5 and RAID-10
 over multiple PCIe SiI3132 SATA controllers.  The hardware has been very
 stable for a long time, but has been acting up of late since I upgraded
 to 2.6.20.3.  ECC memory should preclude the possibility of bit-flip
 errors.
 
 Kernel 2.6.20.3 + linuxpps patches (confined to drivers/serial, and not
 actually in use as I stole the serial port for a console).
 
 It takes half a day to reproduce the problem, so bisecting would be painful.
 
 BackupPC_dump mostly writes to a large (1.7 TB) ext3 RAID5 partition.
 
 
 Here are two oopes, a few minutes (16:31, to be precise) apart.
 Unusually, it oopsed twice *without* locking up the system..  Usually,
 I see this followed by an error from drivers/input/keyboard/atkbd.c:
 printk(KERN_WARNING atkbd.c: Spurious %s on %s. 
Some program might be trying access hardware 
 directly.\n,
 emitted at 1 Hz with the keyboard LEDs flashing and the system
 unresponsive to keyboard or pings.
 (I think it was spurious ACK on serio/input0, but my memory may be faulty.)
 
 
 If anyone has any suggestions, they'd be gratefully received.
 
 
 Unable to handle kernel NULL pointer dereference at 0098 RIP: 
  [8031504a] cfq_dispatch_insert+0x18/0x68
 PGD 777e9067 PUD 78774067 PMD 0 
 Oops:  [1] 
 CPU 0 
 Modules linked in: ecb
 Pid: 2837, comm: BackupPC_dump Not tainted 2.6.20.3-g691f5333 #40
 RIP: 0010:[8031504a]  [8031504a] 
 cfq_dispatch_insert+0x18/0x68
 RSP: 0018:8100770bbaf8  EFLAGS: 00010092
 RAX: 81007fb36c80 RBX:  RCX: 0001
 RDX: 00010003e4e7 RSI:  RDI: 
 RBP: 81007fb37a00 R08:  R09: 81005d390298
 R10: 81007fcb4f80 R11: 81007fcb4f80 R12: 81007facd280
 R13: 0004 R14: 0001 R15: 
 FS:  2b322d120d30() GS:805de000() knlGS:
 CS:  0010 DS:  ES:  CR0: 80050033
 CR2: 0098 CR3: 7bcf CR4: 06e0
 Process BackupPC_dump (pid: 2837, threadinfo 8100770ba000, task 
 81007fc5d8e0)
 Stack:   8100770f39f0  0004
  0001 80315253 803b2607 81005da2bc40
  81007fac3800 81007facd280 81007facd280 81005d390298
 Call Trace:
  [80315253] cfq_dispatch_requests+0x152/0x512
  [803b2607] scsi_done+0x0/0x18
  [8030d9f1] elv_next_request+0x137/0x147
  [803b7ce0] scsi_request_fn+0x6a/0x33a
  [8024d407] generic_unplug_device+0xa/0xe
  [80407ced] unplug_slaves+0x5b/0x94
  [80223d65] sync_page+0x0/0x40
  [80223d9b] sync_page+0x36/0x40
  [80256d45] __wait_on_bit_lock+0x36/0x65
  [80237496] __lock_page+0x5e/0x64
  [8028061d] wake_bit_function+0x0/0x23
  [802074de] find_get_page+0xe/0x2d
  [8020b38e] do_generic_mapping_read+0x1c2/0x40d
  [8020bd80] file_read_actor+0x0/0x118
  [8021422e] generic_file_aio_read+0x15c/0x19e
  [8020bafa] do_sync_read+0xc9/0x10c
  [80210342] may_open+0x5b/0x1c6
  [802805ef] autoremove_wake_function+0x0/0x2e
  [8020a857] vfs_read+0xaa/0x152
  [8020faf3] sys_read+0x45/0x6e
  [8025041e] system_call+0x7e/0x83

3 (I think) seperate instances of this, each involving raid5. Is your
array degraded or fully operational?


-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] libata: add NCQ blacklist entries from Silicon Image Windows driver

2007-02-22 Thread Jens Axboe
On Thu, Feb 22 2007, Jeff Garzik wrote:
 Robert Hancock wrote:
 This patch adds in some NCQ blacklist entries taken from the Silicon
 Image Windows drivers' .inf files for the 3124 and 3132 controllers.
 These entries were marked as DisableSataQueueing. Assume these are
 in their blacklist for a reason and disable NCQ on these drives.
 
 Signed-off-by: Robert Hancock [EMAIL PROTECTED]
 
 --- linux-2.6.21-rc1edit/drivers/ata/libata-core.c.prev2007-02-21 
 22:23:05.0 -0600
 +++ linux-2.6.21-rc1edit/drivers/ata/libata-core.c2007-02-21 
 22:25:44.0 -0600
 @@ -3269,6 +3269,13 @@ static const struct ata_blacklist_entry
 /* Devices with NCQ limits */
 
 +/* The following blacklist entries are taken from the Windows
 +   driver .inf files for the Silicon Image 3124 and 3132. */
 +{ Maxtor 7B250S0,BANC1B70,ATA_HORKAGE_NONCQ, },
 +{ HTS541060G9SA00,MB3OC60D,ATA_HORKAGE_NONCQ, },
 +{ HTS541080G9SA00,MB4OC60D,ATA_HORKAGE_NONCQ, },
 +{ HTS541010G9SA00,MBZOC60D,ATA_HORKAGE_NONCQ, },
 
 Do we have information that these drives fail on non-SiI controllers?
 
 Sometimes the problem can be related to a single family of controllers.

I don't know about the Hitachi's, but the Maxtor with that firmware is
definitely broken. It _appeared_ to work if the depth was limited to 4,
but I didn't test it long enough to be absolutely certain. So the safest
is indeed to blacklist it.

I'm pretty sure my initial NCQ patches had a blacklist entry for that
drive as well, but apparently the blacklist got lost somewhere along the
way?

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] libata: add NCQ blacklist entries from Silicon Image Windows driver

2007-02-22 Thread Jens Axboe
On Thu, Feb 22 2007, Mark Lord wrote:
 Jens Axboe wrote:
 Robert Hancock wrote:
 ..
 +/* The following blacklist entries are taken from the Windows
 +   driver .inf files for the Silicon Image 3124 and 3132. */
 +{ Maxtor 7B250S0,BANC1B70,ATA_HORKAGE_NONCQ, },
 +{ HTS541060G9SA00,MB3OC60D,ATA_HORKAGE_NONCQ, },
 +{ HTS541080G9SA00,MB4OC60D,ATA_HORKAGE_NONCQ, },
 +{ HTS541010G9SA00,MBZOC60D,ATA_HORKAGE_NONCQ, },
 Do we have information that these drives fail on non-SiI controllers?
 
 Sometimes the problem can be related to a single family of controllers.
 
 I don't know about the Hitachi's, but the Maxtor with that firmware is
 definitely broken. It _appeared_ to work if the depth was limited to 4,
 but I didn't test it long enough to be absolutely certain. So the safest
 is indeed to blacklist it.
 
 Yes, broken on the Silicon Image controllers for sure.
 But what type of controller did you observe the failures on, Jens ?

achi

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: libata FUA revisited

2007-02-21 Thread Jens Axboe
On Wed, Feb 21 2007, Tejun Heo wrote:
 [cc'ing Ric, Hannes and Dongjun, Hello.  Feel free to drag other people in.]
 
 Robert Hancock wrote:
  Jens Axboe wrote:
  But we can't really change that, since you need the cache flushed before
  issuing the FUA write. I've been advocating for an ordered bit for
  years, so that we could just do:
 
  3. w/FUA+ORDERED
 
  normal operation - barrier issued - write barrier FUA+ORDERED
   - normal operation resumes
 
  So we don't have to serialize everything both at the block and device
  level. I would have made FUA imply this already, but apparently it's not
  what MS wanted FUA for, so... The current implementations take the FUA
  bit (or WRITE FUA) as a hint to boost it to head of queue, so you are
  almost certainly going to jump ahead of already queued writes. Which we
  of course really do not.
 
 Yeah, I think if we have tagged write command and flush tagged (or
 barrier tagged) things can be pretty efficient.  Again, I'm much more
 comfortable with separate opcodes for those rather than bits changing
 the behavior.

ORDERED+FUA NCQ would still be preferable to an NCQ enabled flush
command, though.

 Another idea Dongjun talked about while drinking in LSF was ranged
 flush.  Not as flexible/efficient as the previous option but much less
 intrusive and should help quite a bit, I think.

But that requires extensive tracking, I'm not so sure the implementation
of that for barriers would be very clean. It'd probably be good for
fsync, though.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: libata FUA revisited

2007-02-21 Thread Jens Axboe
On Mon, Feb 19 2007, Robert Hancock wrote:
 Jens Axboe wrote:
 But we can't really change that, since you need the cache flushed before
 issuing the FUA write. I've been advocating for an ordered bit for
 years, so that we could just do:
 
 3. w/FUA+ORDERED
 
 normal operation - barrier issued - write barrier FUA+ORDERED
  - normal operation resumes
 
 So we don't have to serialize everything both at the block and device
 level. I would have made FUA imply this already, but apparently it's not
 what MS wanted FUA for, so... The current implementations take the FUA
 bit (or WRITE FUA) as a hint to boost it to head of queue, so you are
 almost certainly going to jump ahead of already queued writes. Which we
 of course really do not.
 
 I think that FUA was designed for a different use case than what Linux 
 is using barriers for currently. The advantage with FUA is when you have 

[snip]

Yes that's pretty obvious, my point is just that FUA+ORDERED would be a
nice thing to have for us.

 I'm not too nervous about the FUA write commands, I hope we can safely
 assume that if you set the FUA supported bit in the id AND the write fua
 command doesn't get aborted, that FUA must work. Anything else would
 just be an immensely stupid implementation. NCQ+FUA is more tricky, I
 agree that it being just a command bit does make it more likely that it
 could be ignored. And that is indeed a danger. Given state of NCQ in
 early firmware drives, I would not at all be surprised if the drive
 vendors screwed that up too.
 
 But, since we don't have the ordered bit for NCQ/FUA anyway, we do need
 to drain the drive queue before issuing the WRITE/FUA. And at that point
 we may as well not use the NCQ command, just go for the regular non-NCQ
 FUA write. I think that should be safe.
 
 Aside from the issue above, as I mentioned elsewhere, lots of NCQ drives 
 don't support non-NCQ FUA writes..

Lots meaning how many? All the ones I have here support FUA.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: libata FUA revisited

2007-02-21 Thread Jens Axboe
On Wed, Feb 21 2007, Tejun Heo wrote:
 Jens Axboe wrote:
  On Wed, Feb 21 2007, Tejun Heo wrote:
  [cc'ing Ric, Hannes and Dongjun, Hello.  Feel free to drag other people 
  in.]
 
  Robert Hancock wrote:
  Jens Axboe wrote:
  But we can't really change that, since you need the cache flushed before
  issuing the FUA write. I've been advocating for an ordered bit for
  years, so that we could just do:
 
  3. w/FUA+ORDERED
 
  normal operation - barrier issued - write barrier FUA+ORDERED
   - normal operation resumes
 
  So we don't have to serialize everything both at the block and device
  level. I would have made FUA imply this already, but apparently it's not
  what MS wanted FUA for, so... The current implementations take the FUA
  bit (or WRITE FUA) as a hint to boost it to head of queue, so you are
  almost certainly going to jump ahead of already queued writes. Which we
  of course really do not.
  Yeah, I think if we have tagged write command and flush tagged (or
  barrier tagged) things can be pretty efficient.  Again, I'm much more
  comfortable with separate opcodes for those rather than bits changing
  the behavior.
  
  ORDERED+FUA NCQ would still be preferable to an NCQ enabled flush
  command, though.
 
 I think we're talking about two different things here.
 
 1. The barrier write (FUA write) combined with flush.  I think it would
 help improving the performance but I think issuing two commands
 shouldn't be too slower than issuing one combined command unless it
 causes extra physical activity (moving head, etc...).

The command overhead is dwarfed by other factors, agree.

 2. FLUSH currently flushes all writes.  If we can mark certain commands
 requiring ordering, we can selectively flush or order necessary writes.
  (No need to flush 16M buffer all over the disk when only journal needs
 barriering)

Sure, anything is better than the sledge hammer flush. But my claim is
that an ORDERED+FUA enabled write for critical data would be a good
approach, and simple in software.

  Another idea Dongjun talked about while drinking in LSF was ranged
  flush.  Not as flexible/efficient as the previous option but much less
  intrusive and should help quite a bit, I think.
  
  But that requires extensive tracking, I'm not so sure the implementation
  of that for barriers would be very clean. It'd probably be good for
  fsync, though.
 
 I was mostly thinking about journal area.  Using it for other purposes
 would incur a lot of complexity.  :-(

Yep if it's just for the journal, the range is known and fixed, so the
flush range would work nicely there.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: libata FUA revisited

2007-02-15 Thread Jens Axboe
On Tue, Feb 13 2007, Tejun Heo wrote:
 So, actually, I was thinking about *always* using the non-NCQ FUA 
 opcode.  As currently implemented, FUA request is always issued by 
 itself, so NCQ doesn't make any difference there.  So, I think it 
 would be better to turn on FUA on driver-by-driver basis whether the 
 controller supports NCQ or not.
 
 Unfortunately not all drives that support NCQ support the non-NCQ FUA 
 commands (my Seagates are like this).
 
 And I'm a bit scared to set FUA bit on such drives and trust that it 
 will actually do FUA, so our opinions aren't too far away from each 
 other.  :-)
 
 There's definitely a potential advantage to FUA with NCQ - if you have 
 non-synchronous accesses going on concurrently with synchronous ones, if 
 you have to use non-NCQ FUA or flush cache commands, you have to wait 
 for all the IOs of both types to drain out before you can issue the 
 flush (since those can't be overlapped with the NCQ read/writes). And if 
 you can only use flush cache, then you're forcing all the writes to be 
 flushed including the non-synchronous ones you didn't care about. 
 Whether or not the block layer currently exploits this I don't know, but 
 it definitely could.
 
 The current barrier implementation uses the following sequences for 
 no-FUA and FUA cases.
 
 1. w/o FUA
 
 normal operation - barrier issued - drain IO - flush - barrier 
 written - flush - normal operation resumes
 
 2. w/ FUA
 
 normal operation - barrier issued - drain IO - flush - barrier 
 written / FUA - normal operation resumes
 
 So, the FUA write is issued by itself.  This isn't really efficient and 
 frequent barriers impact the performance badly.  If we can change that 
 NCQ FUA will be certainly beneficial.

But we can't really change that, since you need the cache flushed before
issuing the FUA write. I've been advocating for an ordered bit for
years, so that we could just do:

3. w/FUA+ORDERED

normal operation - barrier issued - write barrier FUA+ORDERED
 - normal operation resumes

So we don't have to serialize everything both at the block and device
level. I would have made FUA imply this already, but apparently it's not
what MS wanted FUA for, so... The current implementations take the FUA
bit (or WRITE FUA) as a hint to boost it to head of queue, so you are
almost certainly going to jump ahead of already queued writes. Which we
of course really do not.

 Well, I might be being too paranoid but silent FUA failure would be 
 really hard to diagnose if that ever happens (and I'm fairly certain 
 that it will on some firmwares).
 
 Well, there are also probably drives that ignore flush cache commands or 
  fail to do other things that they should. There's only so far we can go 
 in coping if the firmware authors are being retarded. If any drive is 
 broken like that we should likely just blacklist NCQ on it entirely as 
 obviously little thought or testing went into the implementation..
 
 FLUSH has been around quite long time now and most drives don't have 
 problem with that.  FUA on ATA is still quite new and libata will be the 
 first major user of it if we enable it by default.  It just seems too 
 easy to ignore that bit and successfully complete the write - there 
 isn't any safety net as opposed to using a separate opcode.  So, I'm a 
 bit nervous.

I'm not too nervous about the FUA write commands, I hope we can safely
assume that if you set the FUA supported bit in the id AND the write fua
command doesn't get aborted, that FUA must work. Anything else would
just be an immensely stupid implementation. NCQ+FUA is more tricky, I
agree that it being just a command bit does make it more likely that it
could be ignored. And that is indeed a danger. Given state of NCQ in
early firmware drives, I would not at all be surprised if the drive
vendors screwed that up too.

But, since we don't have the ordered bit for NCQ/FUA anyway, we do need
to drain the drive queue before issuing the WRITE/FUA. And at that point
we may as well not use the NCQ command, just go for the regular non-NCQ
FUA write. I think that should be safe.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ahci problems with sata disk.

2007-01-17 Thread Jens Axboe
On Tue, Jan 16 2007, Jeff Garzik wrote:
 Mark Hahn wrote:
 I though that NCQ was intended to increase performance ??
 
 intended to increase _sales_ performance ;)
 
 Yep.
 
 
 remember that you've always had command queueing (kernel elevator): the 
 main difference with NCQ (or SCSI tagged queueing) is when
 the disk can out-schedule the kernel.  afaikt, this means sqeezing
 in a rotationally intermediate request along the way.
 
 that intermediate request must be fairly small and should be a read
 (for head-settling reasons).
 
 I wonder how often this happens in the real world, given the relatively
 small queues the disk has to work with.
 
 ISTR either Jens or Andrew ran some numbers, and found that there was 
 little utility beyond 4 or 8 tags or so.

It entirely depends on the access pattern. For truly random reads,
performance does seem to continue to scale up with increasing drive
queue depths. It may only be a benchmark figure though, as truly random
read workloads probably aren't that common :-)

For anything else, going beyond 4 tags doesn't improve much.

 My hdparm test is a sequential read-ahead test, so it will
 naturally perform worse on a Raptor when NCQ is on.
 
 that's a surprisingly naive heuristic, especially since NCQ is concerned 
 with just a max of ~4MB of reads, only a smallish
 fraction of the available cache.
 
 NCQ mainly helps with multiple threads doing reads.  Writes are largely 
 asynchronous to the user already (except for fsync-style writes).  You 
 want to be able to stuff the disk's internal elevator with as many read 
 requests as possible, because reads are very often synchronous -- most 
 apps (1) read a block, (2) do something, (3) goto step #1.  The kernel's 
 elevator isn't much use in these cases.

Au contraire, this is one of the cases where intelligent IO scheduling
in the kernel makes a ton of difference. It's the primary reason that AS
and CFQ are able to maintain  90% of disk bandwidth for more than one
process, idling the drive for the duration of step 2 in the sequence
above (step 2 is typically really small, time wise). If the next block
read is close to the first one, that is. If you do that, you will
greatly outperform the same workload pushed to the drive scheduling.
I've done considerable benchmarks on this. Only if the processes are
doing random IO should the IO scheduler punt and push everything to the
drive queue.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ahci problems with sata disk.

2007-01-17 Thread Jens Axboe
On Tue, Jan 16 2007, Eric D. Mudama wrote:

[snip lots of stuff I agree completely with]

 If done properly, queueing should never hurt performance.  High queue
 depths will increase average latency of course, but shouldn't hurt
 overall performance.

It may never hurt performance, but there are common scenarios where you
are much better off not doing queuing even if you could. A good example
of that is a media serving service, where you end up reading a bunch of
files sequentially. It's faster to read chunks of each file sequentially
at depth 1 and move on, than queue a a request from each of them and
send them to the drive. On my laptop with an NCQ enabled drive, the
mentioned approach outperforms queuing by more than 100%.

 NCQ mainly helps with multiple threads doing reads.  Writes are
 largely asynchronous to the user already (except for fsync-style
 writes).  You want to be able to stuff the disk's internal elevator
 with as many read requests as possible, because reads are very often
 synchronous -- most apps (1) read a block, (2) do something, (3) goto
 step #1.  The kernel's elevator isn't much use in these cases.
 
 True.  And internal to the drive, normal elevator is meh.  There are
 other algorithms for scheduling that perform better.

Well Linux doesn't default to using a normal elevator, so it's a moot
point.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-rc4-mm1

2007-01-15 Thread Jens Axboe
On Mon, Jan 15 2007, Ingo Molnar wrote:
 
 * Jens Axboe [EMAIL PROTECTED] wrote:
 
   In a previous write invoked by: fsck.ext3(1896): WRITE block 8552 on 
   sdb1 end_buffer_async_write() is invoked.
   
   sdb1 is not a part of a raid device.
  
  When I briefly tested this before I left (and found it broken), doing 
  a cat /proc/mdstat got things going again. Hard if that's your rootfs, 
  it's just a hint :-)
 
 hm, so you knew it's broken, still you let Andrew pick it up, or am i 
 misunderstanding something?

Well the raid issue wasn't known before it was in -mm.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-rc4-mm1

2007-01-14 Thread Jens Axboe
On Sun, Jan 14 2007, Thomas Gleixner wrote:
 On Mon, 2007-01-15 at 09:05 +1100, Jens Axboe wrote:
  raid seems to have severe problems with the plugging change. I'll try
  and find Neil and have a chat with him, hopefully we can work it out.
 
 Some hints:
 
 mount(1899): WRITE block 16424 on md3
 call md_write_start
 md3_raid1(438): WRITE block 40965504 on sdb6
 md3_raid1(438): WRITE block 40965504 on sda6
 First Write sector 16424 disks 2
 
 Stuck.
 
 Note, that neither end_buffer_async_write() nor
 raid1_end_write_request() are invoked, 
 
 In a previous write invoked by:
 fsck.ext3(1896): WRITE block 8552 on sdb1
 end_buffer_async_write() is invoked.
 
 sdb1 is not a part of a raid device.

When I briefly tested this before I left (and found it broken), doing a
cat /proc/mdstat got things going again. Hard if that's your rootfs,
it's just a hint :-)

 Hope that helps,

I can reproduce, so that's not a problem. I can't do much about it until
I'm back next week, but Neil might be able to help. We shall see. Thanks
for testing.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] SCSI, libata: add support for ATA_16 commands to libata ATAPI devices

2007-01-04 Thread Jens Axboe
On Wed, Jan 03 2007, James Bottomley wrote:
 Er, well, as you know, I've never been a fan of this static list.  I
 thought Jens was going to put us all out of our misery by making the
 list settable per device by root and thus shovel the problem off onto
 the distros?

The code is there, just haven't had the guts to merge it yet.

http://git.kernel.dk/?p=linux-2.6-block.git;a=commitdiff;h=bf5f922d167a5c5cf57132bbcaa1e0ddfd5c45f7

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] SCSI, libata: add support for ATA_16 commands to libata ATAPI devices

2007-01-04 Thread Jens Axboe
On Thu, Jan 04 2007, Mark Lord wrote:
 Jens Axboe wrote:
 On Wed, Jan 03 2007, James Bottomley wrote:
 Er, well, as you know, I've never been a fan of this static list.  I
 thought Jens was going to put us all out of our misery by making the
 list settable per device by root and thus shovel the problem off onto
 the distros?
 
 The code is there, just haven't had the guts to merge it yet.
 
 http://git.kernel.dk/?p=linux-2.6-block.git;a=commitdiff;h=bf5f922d167a5c5cf57132bbcaa1e0ddfd5c45f7
 
 
 That's nice, but doesn't help with the case of trying to do ATA passthru
 to ATAPI devices, the subject of the original patch here.

James brought up the filtering issue, my reply pertains to that alone.
Hence the snipping of unrelated contents in the original mail :-)

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: AHCI and add_disk_randomness()

2006-11-16 Thread Jens Axboe
On Wed, Nov 15 2006, Marek Podmaka wrote:
 Hello,
 
   I have server with Intel 5000V motherboard with integrated AHCI SATA
   controller. It works well with kernel 2.6.18.2. But I have problem
   with little entropy available and I'm not sure if one of the reasons
   is that AHCI driver does not use add_disk_randomness() to contribute
   to the kernel entropy pool.
 
   I'm not very skilled on kernel hacking... I tried finding where
   this is called (it's defined in drivers/char/random.c and used for
   example in Comapaq SmartArray driver in drivers/block/cciss.c). For
   the SCSI part, I found it in scsi_lib.c, but I was not able to
   determine if this is actually used by ahci/libata drivers.
 
   If not, would it be possible to implement it? I tried to figure out
   where to call it by looking at cciss.c, but it seems that this is
   totally different case, at least for me. I don't know where to add
   it, because its parameter is struct gendisk *disk and didn't find it
   used anywhere in ahci or libata.

Since ahci attaches its devices through the scsi layer,
add_disk_randomness() will get called from scsi_end_request() like for
any other scsi controller.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 0/4] ide: Break ide_lock to per-hwgroup lock

2005-09-07 Thread Jens Axboe
On Tue, Sep 06 2005, Ravikiran G Thirumalai wrote:
 The following patchset breaks down the global ide_lock to per-hwgroup lock.
 We have taken the following approach.

Curious, what is the point of this?

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 0/4] ide: Break ide_lock to per-hwgroup lock

2005-09-07 Thread Jens Axboe
On Wed, Sep 07 2005, Ravikiran G Thirumalai wrote:
 On Wed, Sep 07, 2005 at 11:19:24AM +0200, Jens Axboe wrote:
  On Tue, Sep 06 2005, Ravikiran G Thirumalai wrote:
   The following patchset breaks down the global ide_lock to per-hwgroup 
   lock.
   We have taken the following approach.
  
  Curious, what is the point of this?
  
 
 On smp machines with multiple ide interfaces, we take per-group lock instead
 of a global lock, there by breaking the lock to per-irq hwgroups.

I realize the theory behind breaking up locks, I'm just wondering about
this specific case. Please show actual contention data promoting this
specific case, we don't break up locks just because.

I'm asking because I've never heard anyone complain about IDE lock
contention and a proper patch usually comes with analysis of why it is
needed.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: libata: clustering on or off?

2005-08-28 Thread Jens Axboe
On Sun, Aug 28 2005, Arjan van de Ven wrote:
 On Sun, 2005-08-28 at 05:42 -0400, Jeff Garzik wrote:
  The constant ATA_SHT_USE_CLUSTERING in include/linux/libata.h controls
  the use of SCSI layer's use_clustering feature, for a great many libata
  drivers.
  
  The current setup has clustering disabled, which in theory causes the
  block layer to do less work, at the expense of a greater number of
  scatter/gather table entries used.
  
  Any opinions WRT turning on clustering for libata?
 
 in 2.4 clustering was expensive due to a large number of checks that
 were done (basically the number of fragments got recounted a gazilion
 times). In 2.6 Jens fixed that afaik to make it basically free...
 at which point it's a win always.

Yeah, it wont cost any extra cycles, so there's no point in keeping it
turned off for that reason.

 Imo clustering on the driver level should announce driver capabilities.
 If clustering for some arch/kernel makes it slower, that should be
 decided at a midlayer level and not in each driver; eg the midlayer
 would chose to ignore the drivers capabilities.
 So .. my opinion would be that libata should announce the capability (it
 seems the code/hw can do it). 

Agree, we should just remove the ability to control clustering, as it
really overlaps with the segment settings anyways.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: libata: clustering on or off?

2005-08-28 Thread Jens Axboe
On Sun, Aug 28 2005, Christoph Hellwig wrote:
 On Sun, Aug 28, 2005 at 04:20:19PM +0200, Jens Axboe wrote:
  Agree, we should just remove the ability to control clustering, as it
  really overlaps with the segment settings anyways.
 
 What are we going to do with iscsi then?  It really doesn't like segments
 over a pages size.  Best thing would probably be to switch networking to
 use sg lists and dma_map_sg, but that's not a trivial task.

Limit the segment size, then. There are provisions for doing both length
and boundary limits, that should suffice.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] libata: error processing + rw 6 byte fix

2005-08-27 Thread Jens Axboe
On Sat, Aug 27 2005, Douglas Gilbert wrote:
 Jeff Garzik wrote:
  Does the attached look OK to everybody?
  
 
 Jeff,
 Yes.
 
 And after an exchange with Jens, it would probably be
 safer to map transfer_length=0 for READ_10 and READ_16
 (as well as WRITE_10 and WRITE_16) to a nop on the ATA
 side. Otherwise an ATA disk may attempt to transfer 2**16
 sectors. [Only READ_6 and WRITE_6 have the quirky 0 means
 256 definition, in the larger commands 0 means 0.]

Yes, that part needs to be added as well. Jeff, the below looks good for
the READ/WRITE_6 case.

  if (scsicmd[0] == READ_6 || scsicmd[0] == WRITE_6) {
  qc-nsect = tf-nsect = scsicmd[4];
  +   if (!qc-nsect) {
  +   qc-nsect = 256;
  +   if (lba48)
  +   tf-hob_nsect = 1;
  +   }
  +
  tf-lbal = scsicmd[3];
  tf-lbam = scsicmd[2];
  tf-lbah = scsicmd[1]  0x1f; /* mask out reserved bits */
 
 

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] libata: error processing + rw 6 byte fix

2005-08-27 Thread Jens Axboe
On Sat, Aug 27 2005, Jeff Garzik wrote:
 Here is the patch I just checked in.

Looks perfect.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Hdaps-devel] Re: HDAPS, Need to park the head for real

2005-08-26 Thread Jens Axboe
On Thu, Aug 25 2005, Jon Escombe wrote:
 Alan Cox wrote:
 @@ -1661,6 +1671,9 @@
 where = ELEVATOR_INSERT_FRONT;
 rq-flags |= REQ_PREEMPT;
 }
 +   if (action == ide_next)
 +   where = ELEVATOR_INSERT_FRONT;
 +
 __elv_add_request(drive-queue, rq, where, 0);
 ide_do_request(hwgroup, IDE_NO_IRQ);
 spin_unlock_irqrestore(ide_lock, flags);
 
 Also puzzles me- why is this needed ?
 
 I wanted the park command to get in at the head of the queue (behind the 
 currently executing request).
 
 Contrary to the comments for ide_do_drive_cmd(), ide_next didn't appear 
 to do anything to achieve this? At least from my initial testing before 
 I made this change - it could take a second or so for the park command 
 to be issued if the disk was busy

That part seems to have been lost, apparently. The above patch is
correct.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PATCH: ide: ide-disk freeze support for hdaps

2005-08-26 Thread Jens Axboe
On Fri, Aug 26 2005, Yani Ioannou wrote:
  Please make the interface accept number of seconds (as suggested by Jens)
  and remove this module parameter. This way interface will be more flexible
  and cleaner.  I really don't see any advantage in doing echo 1  ... 
  instead
  of echo x  ... (Pavel, please explain).
 
 Either way is pretty easy enough to implement. Note though that I'd
 expect the userspace app should thaw the device when danger is out of
 the way (the timeout is mainly there to ensure that the queue isn't
 frozen forever, and should probably be higher). Personally I don't
 have too much of an opinion either way though... what's the consensus?
 :).

Yes please, I don't understand why you would want a 0/1 interface
instead, when the timer-seconds method gives you the exact same ability
plus a way to control when to unfreeze...

  +static struct timer_list freeze_timer =
  +   TIMER_INITIALIZER(freeze_expire, 0, 0);
  
  There needs to be a timer per device not a global one
  (it works for a current special case of T42 but sooner
   or later we will hit this problem).
 
 I was considering that, but I am confused as to whether each drive has
 it's own queue or not? (I really am a newbie to this stuff...). If so
 then yes there should be a per-device timer.

Each drive has its own queue.

  queue handling should be done through block layer helpers
  (as described in Jens' email) - we will need them for libata too.
 
 Good point, I'll try to move as much as I can up to the block layer,
 it helps when it comes to implementing freeze for libata as you point
 out too.

That includes things like the timer as well, reuse the queue plugging
timer as I described in my initial posting on how to implement this.

  At this time attribute can still be in use (because refcounting is done
  on drive-gendev), you need to add disk class to ide-disk driver
  (drivers/scsi/st.c looks like a good example how to do it).
 
 I missed that completely, I'll look at changing it.
 
  IMO this should also be handled by block layer
  which has all needed information, Jens?
  
  While at it: I think that sysfs support should be moved to block layer 
  (queue
  attributes) and storage driver should only need to provide queue_freeze_fn
  and queue_thaw_fn functions (similarly to cache flush support).  This should
  be done now not later because this stuff is exposed to the user-space.
 
 I was actually considering using a queue attribute originally, but in
 my indecision I decided to go with Jen's suggestion. A queue attribute
 does make sense in that the attribute primarily is there to freeze the
 queue, but it would also be performing the head park. Would a queue
 attribute be confusing because of that?

I fully agree with Bart here. The only stuff that should be ide special
is the actual command setup and completion check, since that is a
hardware property. libata will get a few little helpers for that as
well. The rest should be block layer implementation.

   * Sanity: don't accept a request that isn't a PM request
   * if we are currently power managed. This is very 
  important as
   * blk_stop_queue() doesn't prevent the elv_next_request()
  @@ -1661,6 +1671,9 @@ int ide_do_drive_cmd (ide_drive_t *drive
  where = ELEVATOR_INSERT_FRONT;
  rq-flags |= REQ_PREEMPT;
  }
  +   if (action == ide_next)
  +   where = ELEVATOR_INSERT_FRONT;
  +
  __elv_add_request(drive-queue, rq, where, 0);
  ide_do_request(hwgroup, IDE_NO_IRQ);
  spin_unlock_irqrestore(ide_lock, flags);
  
  Why is this needed?
 
 I think Jon discussed that in a previous thread, but basically
 although ide_next is documented in the comment for ide_do_drive_cmd,
 there isn't (as far as Jon or I could see) anything actually handling
 it. This patch is carried over from Jon's work and adds the code to
 handle ide_next by inserting the request at the front of the queue.

As per my previous mail, I will ack that bit.

  Overall, very promising work!
 
 Thanks :-), most of it is Jon's work, and Jen's suggestions though.

My name is Jens, not Jen :-)

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] libata: error processing + rw 6 byte fix

2005-08-23 Thread Jens Axboe
On Tue, Aug 23 2005, Douglas Gilbert wrote:
 Jens Axboe wrote:
  On Mon, Aug 22 2005, Douglas Gilbert wrote:
  
 if (scsicmd[0] == READ_6 || scsicmd[0] == WRITE_6) {
 -   qc-nsect = tf-nsect = scsicmd[4];
 +   if (scsicmd[4] == 0) {
 +   /*
 +* For READ_6 and WRITE_6 (only)
 +* transfer_len==0 - 256 blocks !!
 +*/
 +   if (lba48) {
 +   tf-hob_nsect = 1;
 +   qc-nsect = 256;
 +   } else
 +   return 1;
  
  
  This isn't quite right, for 28-bit lba a 0 sector value means 256
  sectors to transfer as well. So just make that:
  
  if (lba48) {
  tf-hob_nsect = 1;
  qc-nsect = 256;
  }
  
  /* continue */
  
  and it should work fine. Similarly for 48-bit lba, 0 means 16^2 sectors.
 
 Jens,
 Since for 28-bit lba a 0 sector value means 256 sectors
 do I need to check for the lba48 case at all? As proposed
 to Jeff is this ok (for READ_6 and WRITE_6):
 
if (scsicmd[4] == 0) {
/*
 * For READ_6 and WRITE_6 (only)
 * transfer_len==0 - 256 blocks !!
 */
qc-nsect = 256;
} else
qc-nsect = scsicmd[4];
tf-nsect = scsicmd[4];

This will break for lba48 devices, since if you have scsicmd[4] == 0, a
lba48 read/write will want to transfer 65536 sectors instead of the
intended 256.

Your qc-nsect logic is correct, but you need to set tf-hob_nsect 1
for lba48 if scsicmd[4] == 0 to correctly tell that command to transfer
256 sectors.

 Also I noticed while testing the original code with READ_6
 (sectors=0) that the device locked up (power cycle required).
 So given the point you make for 48-bit lba, 0 means 16^2
 sectors, then the READ_10 (sectors=0) and READ_16 (sectors=0)
 which are valid nops according to SBC-2 may also lock up
 in libata.

Try with the corrected sector counts, should work. I didn't check the
other READ_X/WRITE_X, so you should probably audit them as well :)

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ahci, SActive flag, and the HD activity LED

2005-08-03 Thread Jens Axboe
On Wed, Aug 03 2005, Martin Wilck wrote:
 Jens Axboe wrote:
 
 If I am reading the specs correctly, that'd mean the ahci driver is 
 wrong in setting the SActive bit.
 
 I completely agree, that was my reading of the spec as well and hence my
 original posts about this in the NCQ thread.
 
 Have you (or has anybody else) also seen the wrong behavior of the 
 activity LED?

No, but I have observed that SActive never gets cleared by the device
for non-NCQ commands (which is probably which gets you the stuck LED on
some systems?), which to me is another indication that we should not be
setting the tag bits for those commands.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Crash in ide_do_request() on card removal

2005-08-02 Thread Jens Axboe
On Tue, Aug 02 2005, Steven Scholz wrote:
 Jens Axboe wrote:
 
 That's not quite true, q is not invalid after this call. It will only be
 invalid when it is freed (which doesn't happen from here but rather from
 the blk_cleanup_queue() call when the reference count drops to 0).
 
 This is still not perfect, but a lot better. Does it work for you?
 
 --- linux-2.6.12/drivers/ide/ide-disk.c~ 2005-08-02 
 12:48:16.0 +0200
 +++ linux-2.6.12/drivers/ide/ide-disk.c  2005-08-02 
 12:48:32.0 +0200
 @@ -1054,6 +1054,7 @@
  drive-driver_data = NULL;
  drive-devfs_name[0] = '\0';
  g-private_data = NULL;
 +g-disk = NULL;
  put_disk(g);
  kfree(idkp);
  }
 
 No.
 drivers/ide/ide-disk.c: In function `ide_disk_release':
 drivers/ide/ide-disk.c:1057: error: structure has no member named `disk'

Eh, typo, should be g-queue of course :-)

--- linux-2.6.12/drivers/ide/ide-disk.c~2005-08-02 12:48:16.0 
+0200
+++ linux-2.6.12/drivers/ide/ide-disk.c 2005-08-02 13:12:54.0 +0200
@@ -1054,6 +1054,7 @@
drive-driver_data = NULL;
drive-devfs_name[0] = '\0';
g-private_data = NULL;
+   g-queue = NULL;
put_disk(g);
kfree(idkp);
 }

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Crash in ide_do_request() on card removal

2005-08-02 Thread Jens Axboe
On Tue, Aug 02 2005, Steven Scholz wrote:
 Jens Axboe wrote:
 
 On Tue, Aug 02 2005, Steven Scholz wrote:
 
 Jens Axboe wrote:
 
 
 That's not quite true, q is not invalid after this call. It will only be
 invalid when it is freed (which doesn't happen from here but rather from
 the blk_cleanup_queue() call when the reference count drops to 0).
 
 This is still not perfect, but a lot better. Does it work for you?
 
 --- linux-2.6.12/drivers/ide/ide-disk.c~   2005-08-02 
 12:48:16.0 +0200
 +++ linux-2.6.12/drivers/ide/ide-disk.c2005-08-02 
 12:48:32.0 +0200
 @@ -1054,6 +1054,7 @@
drive-driver_data = NULL;
drive-devfs_name[0] = '\0';
g-private_data = NULL;
 +  g-disk = NULL;
put_disk(g);
kfree(idkp);
 }
 
 No.
 drivers/ide/ide-disk.c: In function `ide_disk_release':
 drivers/ide/ide-disk.c:1057: error: structure has no member named `disk'
 
 
 Eh, typo, should be g-queue of course :-)
 
 --- linux-2.6.12/drivers/ide/ide-disk.c~ 2005-08-02 
 12:48:16.0 +0200
 +++ linux-2.6.12/drivers/ide/ide-disk.c  2005-08-02 
 13:12:54.0 +0200
 @@ -1054,6 +1054,7 @@
  drive-driver_data = NULL;
  drive-devfs_name[0] = '\0';
  g-private_data = NULL;
 +g-queue = NULL;
  put_disk(g);
  kfree(idkp);
  }
 
 No. That does not work:
 
 ~ # umount /mnt/pcmcia/
 generic_make_request(2859) q=c02d3040
 __generic_unplug_device(1447) calling q-request_fn() @ c00f97ec
 
 do_ide_request(1281) HWIF=c01dee8c (0), HWGROUP=c089cea0 (1038681856), 
 drive=c01def1c (0, 0), queue=c02d3040 ()
 do_ide_request(1287) HWIF is not present anymore!!!
 do_ide_request(1291) DRIVE is not present anymore. SKIPPING REQUEST!!!
 
 As you can see generic_make_request() still has the pointer to that queue!
 It gets it with
 
   q = bdev_get_queue(bio-bi_bdev);
 
 So the pointer is still stored soemwhere else...

Hmmm, perhaps just let ide end requests where the drive has been
removed might be better. The disconnection between the queue cleanup and
the gendisk cleanup makes it harder to do it properly. SCSI deals with
it the same way, basically.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Crash in ide_do_request() on card removal

2005-08-02 Thread Jens Axboe
On Tue, Aug 02 2005, Steven Scholz wrote:
 Jens Axboe wrote:
 
 On Tue, Aug 02 2005, Steven Scholz wrote:
 
 Jens Axboe wrote:
 
 
 On Tue, Aug 02 2005, Steven Scholz wrote:
 
 
 Jens Axboe wrote:
 
 
 
 That's not quite true, q is not invalid after this call. It will only 
 be
 invalid when it is freed (which doesn't happen from here but rather 
 from
 the blk_cleanup_queue() call when the reference count drops to 0).
 
 This is still not perfect, but a lot better. Does it work for you?
 
 --- linux-2.6.12/drivers/ide/ide-disk.c~ 2005-08-02 
 12:48:16.0 +0200
 +++ linux-2.6.12/drivers/ide/ide-disk.c  2005-08-02 
 12:48:32.0 +0200
 @@ -1054,6 +1054,7 @@
  drive-driver_data = NULL;
  drive-devfs_name[0] = '\0';
  g-private_data = NULL;
 +g-disk = NULL;
  put_disk(g);
  kfree(idkp);
 }
 
 No.
 drivers/ide/ide-disk.c: In function `ide_disk_release':
 drivers/ide/ide-disk.c:1057: error: structure has no member named `disk'
 
 
 Eh, typo, should be g-queue of course :-)
 
 --- linux-2.6.12/drivers/ide/ide-disk.c~   2005-08-02 
 12:48:16.0 +0200
 +++ linux-2.6.12/drivers/ide/ide-disk.c2005-08-02 
 13:12:54.0 +0200
 @@ -1054,6 +1054,7 @@
drive-driver_data = NULL;
drive-devfs_name[0] = '\0';
g-private_data = NULL;
 +  g-queue = NULL;
put_disk(g);
kfree(idkp);
 }
 
 No. That does not work:
 
 ~ # umount /mnt/pcmcia/
 generic_make_request(2859) q=c02d3040
 __generic_unplug_device(1447) calling q-request_fn() @ c00f97ec
 
 do_ide_request(1281) HWIF=c01dee8c (0), HWGROUP=c089cea0 (1038681856), 
 drive=c01def1c (0, 0), queue=c02d3040 ()
 do_ide_request(1287) HWIF is not present anymore!!!
 do_ide_request(1291) DRIVE is not present anymore. SKIPPING REQUEST!!!
 
 As you can see generic_make_request() still has the pointer to that queue!
 It gets it with
 
 q = bdev_get_queue(bio-bi_bdev);
 
 So the pointer is still stored soemwhere else...
 
 
 Hmmm, perhaps just let ide end requests where the drive has been
 removed might be better. 
 
 I don't understand what you mean.
 
 If requests are issued (e.g calling umount) after the drive is gone, then I 
 get either a kernel crash or umount hangs cause it waits in 
 __wait_on_buffer() ...

No, those waiters will be woken up when ide does an end_request for
requests coming in for a device which no longer exists.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Crash in ide_do_request() on card removal

2005-08-02 Thread Jens Axboe
On Tue, Aug 02 2005, Steven Scholz wrote:
 Jens Axboe wrote:
 
 On Tue, Aug 02 2005, Steven Scholz wrote:
 
 Jens Axboe wrote:
 
 
 On Tue, Aug 02 2005, Steven Scholz wrote:
 
 
 Jens Axboe wrote:
 
 
 
 On Tue, Aug 02 2005, Steven Scholz wrote:
 
 
 
 Jens Axboe wrote:
 
 
 
 
 That's not quite true, q is not invalid after this call. It will 
 only be
 invalid when it is freed (which doesn't happen from here but rather 
 from
 the blk_cleanup_queue() call when the reference count drops to 0).
 
 This is still not perfect, but a lot better. Does it work for you?
 
 --- linux-2.6.12/drivers/ide/ide-disk.c~   2005-08-02 
 12:48:16.0 +0200
 +++ linux-2.6.12/drivers/ide/ide-disk.c2005-08-02 
 12:48:32.0 +0200
 @@ -1054,6 +1054,7 @@
drive-driver_data = NULL;
drive-devfs_name[0] = '\0';
g-private_data = NULL;
 +  g-disk = NULL;
put_disk(g);
kfree(idkp);
 }
 
 No.
 drivers/ide/ide-disk.c: In function `ide_disk_release':
 drivers/ide/ide-disk.c:1057: error: structure has no member named 
 `disk'
 
 
 Eh, typo, should be g-queue of course :-)
 
 --- linux-2.6.12/drivers/ide/ide-disk.c~ 2005-08-02 
 12:48:16.0 +0200
 +++ linux-2.6.12/drivers/ide/ide-disk.c  2005-08-02 
 13:12:54.0 +0200
 @@ -1054,6 +1054,7 @@
  drive-driver_data = NULL;
  drive-devfs_name[0] = '\0';
  g-private_data = NULL;
 +g-queue = NULL;
  put_disk(g);
  kfree(idkp);
 }
 
 No. That does not work:
 
 ~ # umount /mnt/pcmcia/
 generic_make_request(2859) q=c02d3040
 __generic_unplug_device(1447) calling q-request_fn() @ c00f97ec
 
 do_ide_request(1281) HWIF=c01dee8c (0), HWGROUP=c089cea0 (1038681856), 
 drive=c01def1c (0, 0), queue=c02d3040 ()
 do_ide_request(1287) HWIF is not present anymore!!!
 do_ide_request(1291) DRIVE is not present anymore. SKIPPING REQUEST!!!
 
 As you can see generic_make_request() still has the pointer to that 
 queue!
 It gets it with
 
   q = bdev_get_queue(bio-bi_bdev);
 
 So the pointer is still stored soemwhere else...
 
 
 Hmmm, perhaps just let ide end requests where the drive has been
 removed might be better. 
 
 I don't understand what you mean.
 
 If requests are issued (e.g calling umount) after the drive is gone, then 
 I get either a kernel crash or umount hangs cause it waits in 
 __wait_on_buffer() ...
 
 
 No, those waiters will be woken up when ide does an end_request for
 requests coming in for a device which no longer exists.
 
 But that would mean generating requests for devices, drives and hwifs that 
 no longer exists. But exactly there it will crash! In do_ide_request() and 
 ide_do_request().

ide doesn't generate the requests, it just receives them for processing.
And you want to halt that at the earliest stage possible.

Basically the problem you are trying to solve is hacking around the
missing hotplug support in drivers/ide. And that will never be pretty.
The correct solution would of course be to improve the hotplug support,
I think Bart was/is working on that (cc'ing him).

 ide_unregister() restores some old hwif structure. drive and queue are set 
 to NULL. When I wait long enough between cardctl eject and umount it 
 looks like this:
 
 ~ # cardctl eject
 ide_release(398)
 ide_unregister(585): index=0
 ide_unregister(698) old HWIF restored!
 hwif=c01dee8c (0), hwgroup=c0fac2a0, drive=, queue=
 ide_detach(164)
 cardmgr[253]: shutting down socket 0
 cardmgr[253]: executing: './ide stop hda'
 cardmgr[253]: executing: 'modprobe -r ide-cs'
 exit_ide_cs(514)
 
 ~ # umount /mnt/pcmcia/
 sys_umount(494)
 generic_make_request(2859) q=c02d3040
 __generic_unplug_device(1447) calling q-request_fn() @ c00f97e4
 do_ide_request(1279) HWIF=c01dee8c (0), HWGROUP=c0fac2a0 (738987520), 
 drive=c01def1c (0, 0), queue=c02d3040 ()

I don't understand what values you are dumping above, please explain. Is
HWIF c01dee8c or 0?

 Assertion '(hwif-present)' failed in 
 drivers/ide/ide-io.c:do_ide_request(1284)
 Assertion '(drive-present)' failed in 
 drivers/ide/ide-io.c:do_ide_request(1290)
 ide_do_request(1133) hwgroup is busy!
 ide_do_request(1135) hwif=01000406
 
 The 738987520 above is hwgroup-busy! Obviously completly wrong. This 
 seems to be a hint that an invalid pointer is dereferenced! The pointer 
 hwif=01000406 also does not look very healthy! drive=c01def1c is the result 
 of

Yeah it looks very bad. Same thing with the reference counting, ide
should not be freeing various structures that the block layer still
holds a reference to.

   drive = choose_drive(hwgroup);
 
 but can't be as it was set to NULL before.
 
 If I don't wait long enough  between cardctl eject and umount the 
 kernel crashes with:
 
 ~ # cardctl eject; umount /mnt/pcmcia
 ide_release(398)
 ide_unregister(585): index=0
 ide_unregister(698) old HWIF restored!
 hwif=c01dee8c (0), hwgroup=c0268080, drive=, queue=
 ide_detach(164)
 cardmgr[253]: shutting down socket 0
 cardmgr[253]: executing

Re: Crash in ide_do_request() on card removal

2005-08-02 Thread Jens Axboe
On Tue, Aug 02 2005, Steven Scholz wrote:
 Jens Axboe wrote:
 
 No, those waiters will be woken up when ide does an end_request for
 requests coming in for a device which no longer exists.
 
 But that would mean generating requests for devices, drives and hwifs 
 that no longer exists. But exactly there it will crash! In 
 do_ide_request() and ide_do_request().
 
 
 ide doesn't generate the requests, it just receives them for processing.
 I know.
 
 And you want to halt that at the earliest stage possible.
 Agreed.
 
 Problems seems to be:
 
 A refererenc to the request queue is stored in struct gendisk. Thus if you 
 unregister a block device you should make sure that noone can still try to 
 access that request queue, right?

Well the problem is that ide-cs/ide doesn't handle unplug gracefully.
You are trying to fix it in the wrong location, fix belongs in ide.

 ~ # umount /mnt/pcmcia/
 sys_umount(494)
 generic_make_request(2859) q=c02d3040
 __generic_unplug_device(1447) calling q-request_fn() @ c00f97e4
 do_ide_request(1279) HWIF=c01dee8c (0), HWGROUP=c0fac2a0 (738987520), 
 drive=c01def1c (0, 0), queue=c02d3040 ()
 
 
 I don't understand what values you are dumping above, please explain. Is
 HWIF c01dee8c or 0?
 
 printk(%s(%d) HWIF=%p (%d), HWGROUP=%p (%d), drive=%p (%d, %d), queue=%p 
 (%p)\n, __FUNCTION__, __LINE__, hwif, hwif-present, hwgroup, 
 hwgroup-busy, drive, drive-present, drive-dead, q, drive-queue);
 
 So HWIF is a c01dee8c and hwif-present=0.

Ok, so you could kill any request arriving for a !hwif-present hardware
interface.

 Assertion '(hwif-present)' failed in 
 drivers/ide/ide-io.c:do_ide_request(1284)
 Assertion '(drive-present)' failed in 
 drivers/ide/ide-io.c:do_ide_request(1290)
 ide_do_request(1133) hwgroup is busy!
 ide_do_request(1135) hwif=01000406
 
 The 738987520 above is hwgroup-busy! Obviously completly wrong. This 
 seems to be a hint that an invalid pointer is dereferenced! The pointer 
 hwif=01000406 also does not look very healthy! drive=c01def1c is the 
 result of
 
 
 Yeah it looks very bad. Same thing with the reference counting, ide
 should not be freeing various structures that the block layer still
 holds a reference to.
 
 Well or better tell the block layer that the drive is gone and it makes no 
 sense to make any requests ...

It's not enough! What if requests are already on the queue waiting to be
serviced? Again, forget request generation.

 So how could you generate requests (and handle them sanely) for devices 
 that where removed?
 
 Generation is not a problem, that happens outside of your scope. The job
 of the driver is just to make sure that it plays by the rule and at
 least makes sure it doesn't crash on its own for an active queue.
 
 do_ide_request() could check hwif-present and/or drive-present.

Precisely.

 BUT: at this point the request is already made and the low level block 
 layer is sleeping and waiting for it's completion.

Which will complete when you error the request, as I wrote a few mails
ago.

 I could not figure out how to kill a request in do_ide_request() and wake 
 up the block layer (sleeping in __wait_on_buffer()).
 That's why I thought preventing the generation of such reuqests would be 
 the right way.

It's not the right way, it only solves a little part of the problem.
Killing a request with an error usually looks like this:

blkdev_dequeue_request(rq);
end_that_request_first(rq, 0, rq-hard_nr_sectors);
end_that_request_last(rq);


-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Crash in ide_do_request() on card removal

2005-08-02 Thread Jens Axboe
On Tue, Aug 02 2005, Steven Scholz wrote:
 Jens Axboe wrote:
 
 It's not the right way, it only solves a little part of the problem.
 Killing a request with an error usually looks like this:
 
  blkdev_dequeue_request(rq);
  end_that_request_first(rq, 0, rq-hard_nr_sectors);
  end_that_request_last(rq);
 
 How do I get the request? do_ide_request() only get the complete 
 request_queue_t *q. Shell I use elv_next_request() ?

Yes.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Crash in ide_do_request() on card removal

2005-08-02 Thread Jens Axboe
On Tue, Aug 02 2005, Steven Scholz wrote:
 [PATCH] ide: kill requests when drive is not present anymore
 
 Signed-off-by: Steven Scholz [EMAIL PROTECTED]
 
 Ok?
 
 Change the CHANGELOG at your will.
 
 --
 Steven

 --- linux-2.6.13-rc5/drivers/ide/ide-io.c 2005-06-17 21:48:29.0 
 +0200
 +++ linux-2.6.13-rc4-at91-multiIO/drivers/ide/ide-io.c2005-08-02 
 15:46:53.0 +0200
 @@ -1230,7 +1257,17 @@ void do_ide_request(request_queue_t *q)
  {
   ide_drive_t *drive = q-queuedata;
  
 - ide_do_request(HWGROUP(drive), IDE_NO_IRQ);
 + if (drive-present)
 + ide_do_request(HWGROUP(drive), IDE_NO_IRQ);
 + else {
 + struct request  *rq;
 + printk(KERN_WARNING %s: not present, killing requests\n, 
 drive-name);
 + while ((rq = elv_next_request(q)) != NULL) {
 + blkdev_dequeue_request(rq);
 + end_that_request_first(rq, 0, rq-hard_nr_sectors);
 + end_that_request_last(rq);
 + }
 + }
  }
  
  /*

Looks good to me now, that's one item off Barts list :-)

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ahci, SActive flag, and the HD activity LED

2005-08-02 Thread Jens Axboe
On Tue, Aug 02 2005, Martin Wilck wrote:
 Hello Jeff, hello Jens, hello everybody,
 
 I am referring to the debate about whether or not setting the SActive 
 bit for non-NCQ ATA commands (e.g. http://lkml.org/lkml/2005/5/26/142).
 
 In our machines, this behavior of the Linux AHCI driver causes the HD 
 activity LED to stay on all the time. If I apply the attached trivial 
 patch (this is for the RedHat EL4.0-U1 kernel), the LED behaves nicely.
 
 Jeff has stated in the above thread that SActive is intentionally used 
 for non-NCQ devices. However I find clear indication in the specs that 
 the SActive flag should be set if and only if tagged queuing is being 
 used, and only for a specified subset of commands that support queuing 
 (http://www.t13.org/docs2005/D1699r1e-ATA8-ACS.pdf, secs. 4.19 and 
 4.20). The current mainline driver doesn't use queuing.
 
 If I am reading the specs correctly, that'd mean the ahci driver is 
 wrong in setting the SActive bit. Could you please comment? Jeff, in 
 particular, could you please give more detail why you say this flag is 
 intentionally used?

I completely agree, that was my reading of the spec as well and hence my
original posts about this in the NCQ thread.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Driver for sata adapter promise sata300 tx4

2005-08-01 Thread Jens Axboe
On Mon, Aug 01 2005, Daniel Drake wrote:
 Otto Meier wrote:
 My question is also are these features (NCQ/TCQ) and the heigher 
 datarate be supported by this
 modification? or is only the basic feature set of sata 150 TX4 supported?
 
 NCQ support is under development. Search the archives for Jens Axboe's 
 recent patches to support this. I don't know about TCQ.

It's done for ahci, because we have documentation. I have no intention
on working on NCQ for chipset where full documentation is not available.
But the bulk of the code is the libata core support, adding NCQ support
to a sata_* driver should now be fairly trivial (with docs).

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Driver for sata adapter promise sata300 tx4

2005-08-01 Thread Jens Axboe
On Mon, Aug 01 2005, Jens Axboe wrote:
 On Mon, Aug 01 2005, Daniel Drake wrote:
  Otto Meier wrote:
  My question is also are these features (NCQ/TCQ) and the heigher 
  datarate be supported by this
  modification? or is only the basic feature set of sata 150 TX4 supported?
  
  NCQ support is under development. Search the archives for Jens Axboe's 
  recent patches to support this. I don't know about TCQ.
 
 It's done for ahci, because we have documentation. I have no intention
 on working on NCQ for chipset where full documentation is not available.
 But the bulk of the code is the libata core support, adding NCQ support
 to a sata_* driver should now be fairly trivial (with docs).

Oh, and forget TCQ. It's a completely worthless technology inherited
from PATA, whos continued existence can only be explained by some
companies having invested money adding firmware support for it already
and/or because it sounds good to marketing who apparently rely on
customers thinking it must be similar to SCSI TCQ because it shares the
same name. In reality, they really share nothing. IDE TCQ makes a
mockery of the TLA, I hope the people that came up with it bury their
heads in shame for having wasted peoples time actually tring it out.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Driver for sata adapter promise sata300 tx4

2005-08-01 Thread Jens Axboe
On Mon, Aug 01 2005, Jeff Garzik wrote:
 Jens Axboe wrote:
 On Mon, Aug 01 2005, Daniel Drake wrote:
 
 Otto Meier wrote:
 
 My question is also are these features (NCQ/TCQ) and the heigher 
 datarate be supported by this
 modification? or is only the basic feature set of sata 150 TX4 supported?
 
 NCQ support is under development. Search the archives for Jens Axboe's 
 recent patches to support this. I don't know about TCQ.
 
 
 It's done for ahci, because we have documentation. I have no intention
 on working on NCQ for chipset where full documentation is not available.
 But the bulk of the code is the libata core support, adding NCQ support
 to a sata_* driver should now be fairly trivial (with docs).
 
 
 I have docs for the Promise NCQ stuff.  Once NCQ is fully fleshed out (I 
 haven't wrapped my brain around it in a couple weeks), it shouldn't be 
 difficult to add NCQ support to sata_promise.

Excellent!

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [git patches] IDE update

2005-07-08 Thread Jens Axboe
On Fri, Jul 08 2005, Ingo Molnar wrote:
 
 * Jens Axboe [EMAIL PROTECTED] wrote:
 
  But! I used hdparm -t solely, 2.6 was always ~5% faster than 2.4. But 
  using -Tt slowed down the hd speed by about 30%. So it looks like some 
  scheduler interaction, perhaps the memory timing loops gets it marked 
  as batch or something?
 
 to check whether that could be the case, could you try:
 
   nice -n -20 hdparm -t /dev/hdc
 
 does that produce different results?

Same result, see my next mail, it turned out to be a read-ahead bug.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Readahead with softraid1

2005-07-08 Thread Jens Axboe
On Fri, Jul 08 2005, Erik Slagter wrote:
 Hi,
 
 I am using softraid 1 on two sata disks and I'm trying to get the best
 possible performance. IMHO read actions (if properly addressed) should
 be split over the two drivers and performed independently. However, I
 don't notice anything to back this up. The read performance (with the
 dreaded hdparm) shows read performance on sda,sdb and md0 exactly the
 same.
 
 I've been playing a bit with readahead and it does matter a bit in that
 if I disable readahead for sda/sdb completely, the read rate for these
 goes completely down (to be expected) whilst the read rate on md0 stays
 the same (also a bit to be expected). Other combinations do not show any
 significant impact.
 
 I also played with the i/o scheduler and nr_requests (as from previous
 messages here).
 
 What am I doing wrong here???

raid1 doesn't split reads, currently you need more than one process
doing io to get a performance increase.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Readahead with softraid1

2005-07-08 Thread Jens Axboe
On Fri, Jul 08 2005, Erik Slagter wrote:
 On Fri, 2005-07-08 at 08:16 -0400, Danny Cox wrote:
 
   What am I doing wrong here???
  
  Nothing.  I'll take a shot at answering this one instead of lurking
  this time.  Then, I'll crawl back under my rock.
  
  The raid1 driver keeps a last visited block for each drive.  This is
  the block number that was most recently read or written by that drive.
  When a read request arrives, the driver examines each drive for the
  nearest last visited block to the one requested.  Guess what?  If the
  read starts with drive sda, then it will *always* be the one chosen to
  service the read in the future, because the last visited block number is
  only one off.  This would only change if there are multiple processes
  performing I/O on the md device.  Then, it may switch to another drive.
  In any case, it will *tend* to stick with the same drive.
  
  Did I explain that well, or only muddy the waters?
 
 perfect explanation, thanks (and also Jens!).
 
 Is this a design decision or is it fundamentaly impossible to split the
 work amongst several drives?  I guess the md driver at least do a
 prefetch of the next block/chunk on the other drive(s)?

Not sure if it's a design decision or just this works ok, I'll fix it
later. Clearly there is a lot of room for improvement in the balancing
logic, to get more cases correct/faster. It's quite doable to split a
bio and send bits of it to various drives.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH Linux 2.6.12 00/09] NCQ: generic NCQ completion/error-handling

2005-07-08 Thread Jens Axboe
On Fri, Jul 08 2005, Tejun Heo wrote:
 On Fri, Jul 08, 2005 at 10:03:48AM +0200, Jens Axboe wrote:
  Hi,
  
  Ok, one more error, this time from irq context:
  
  AHCI: ata1: error irq, status=4001 stat=51 err=04 sstat=0113
  serr=
  ata1: aborting commands due to error.  active_tag -1, sactive 0001
  ahci: sactive 1
  ata1: recovering from error
  sactive=0
  ata1: failed to read log page 10h (-110)
  ata1: resetting...
  ata1: started resetting...
  ata1: end resetting, sstatus=0113
  ata1: status=0x01 { Error }
  ata1: error=0x80 { Sector }
  SCSI error : 0 0 0 0 return code = 0x802
  sda: Current: sense key=0x3
  ASC=0x11 ASCQ=0x4
  end_request: I/O error, dev sda, sector 2120875
  Buffer I/O error on device sda2, logical block 2045
  lost page write due to I/O error on sda2
  
  -- 
  Jens Axboe
 
  Hi, Jens.
 
  I also have a weird lockup log.  This log is generated with the
 second take of NCQ patchset I've posted yesterday.  It's Samsung
 HD160JJ on ICH7R AHCI.  Every command with tag#3 is scrambled, so
 recovery operations are performed constantly (log 10h, if that fails
 COMRESET).  After a few hours, the drive failed to become online after
 COMRESET.  Rebooting didn't work.  BIOS couldn't detect/recover it.  I
 had to power-cycle to make it online again.  I'm running similar test
 with much lower error-late (5~6 errors per 3000 requests) to avoid too
 many COMRESET's and, for more than six hours, it's been running okay.
 Error log follows.

very strange, the drive must be really buggered.

  Jens, can you run your test with the new patchset?

Sure, I will try. But I will be away from this box for the next two
weeks (vacation, then KS/OLS), so I cannot do much about it in the near
future...

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sata_sil 3112 activity LED patch

2005-07-07 Thread Jens Axboe
On Wed, Jul 06 2005, Jeff Garzik wrote:
 Aric Cyr wrote:
 After finally getting fed up with not having my activity light working
 for my SATA drives, I came up with a small patch (more like hack) to
 make it work.  It works quite well, but I'm afraid that there are many
 restriction that this patch does not check for that it probably
 should... so consider this a work-in-progress.  My information is
 based on a document from Silicon Image that appears to no longer be
 available on their website (Sil-AN-0082-E; 8112-0082.pdf).  I still
 have a copy if anyone is interested.
 
 There are two restrictions that are not checked:
 
 1) Is the chip a 3112 or 3114?  I assume that this would only work on
a 3112, but whether it is a bad thing on a 3114 I do not know.
 
 2) BAR5 + 0x54 is apparently used for the flash memory address and
data lines.  However for most motherboards (i.e. not add-on cards)
with the chip, like my EPOX 8RDA3+, there is no flash memory, so
these lines are hijacked as LED GPIO.  I assume that this is a
common practice for motherboard makers using the sil3112 since
Silicon Image went out of their way to produce the above mentioned
document.  Anyways, the problem is that this patch does not check
if flash memory is installed or not before twiddling with the GPIO
lines.  This could be extremely bad for people running the 3112
from add-on cards (or any implementation with flash memory
installed).
 
 Setting the low 8bits at BAR5+54h seems to enable the LED circuit.  It
 seems that this circuit is patched through into the motherboard as it
 lights the regular hard drive light on the front of my case.  Setting
 bits [8:15] at BAR5+54h clears the bits, disabling the LED.  I hooked
 this logic into the ata_bmdma_start and ata_bmdma_stop which were made
 into simple wrapper functions in sata_sil.c that just set the GPIO
 bits and calls ata_bmdma_*.
 
 I don't think its ugly, necessarily.  I do worry about the flash memory 
 stuff, though, which is why I don't want to merge this upstream for now.
 
 For your patch specifically, it would be nice to follow the coding style 
 that is found in the rest of the driver (single-tab indents, etc., read 
 CodingStyle in kernel source tree).

There's also an existing variant of this in the block layer, the
activity_fn, that we use on the ibook/powerbook to use the sleep led as
an activity light. Just in case you prefer that to overloading the bmdma
start/stop handlers.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sata_sil 3112 activity LED patch

2005-07-07 Thread Jens Axboe
On Thu, Jul 07 2005, Aric Cyr wrote:
  There's also an existing variant of this in the block layer, the
  activity_fn, that we use on the ibook/powerbook to use the sleep led as
  an activity light. Just in case you prefer that to overloading the bmdma
  start/stop handlers.
 
 You suggestion at first looked to be incredibly nice... until I looked
 at how much implementation was required.  I am considering trying it,
 but I cannot find a place for an sata driver to call the
 blk_queue_activity_fn() with meaningful parameters during init.
 
 On a second look, I guess I would have to override
 ata_scsi_slave_config() in the driver and hook up the activity light
 there.  This would be fine I guess.  Unless I am interpreting this
 incorrectly, however, I would need to use a timer or something to turn
 the light back off?  I'm probably missing something, so is there a
 simpler way to do this?

Hmm yes, it will require more work for you. It should be cleaned up a
little to pass in a START/STOP variable and handle everything in the
block layer instead. You probably just want to continue using the bmdma
hooks now, that is actually a fine implementation imo.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH Linux 2.6.12 00/09] NCQ: generic NCQ completion/error-handling

2005-07-06 Thread Jens Axboe
On Wed, Jul 06 2005, Jens Axboe wrote:
 On Mon, Jul 04 2005, Jens Axboe wrote:
  On Fri, Jul 01 2005, Jens Axboe wrote:
 I converted most of debug messages I've used during development into
warning messages when posting the patchset and forgot about it, so
I've never posted the debug patch.  Sorry about that.  Here's a small
patch which adds some more messages though.  The following patch also
adds printk'ing FIS on each command issue in ahci.c:ahci_qc_issue(),
if you think it would fill your log excessively, feel free to turn it
off.  It wouldn't probably matter anyway.
   
   I will have to kill the issue part of the patch, that would generate
   insane amounts of printk traffic :-)
   
   I'll boot the kernel and report what happens.
  
  It triggered last night, but the old kernel was booted. This was the
  log:
  
  ahci ata1: stat=d0, issuing COMRESET
  ata1: recovering from error
  ata1: status=0x01 { Error }
  ata1: error=0x80 { Sector }
  SCSI error : 0 0 0 0 return code = 0x802
  sda: Current: sense key=0x3
  ASC=0x11 ASCQ=0x4
  end_request: I/O error, dev sda, sector 66255899
  Buffer I/O error on device sda2, logical block 8018923
  lost page write due to I/O error on sda2
  ata1: status=0x01 { Error }
  ata1: error=0x80 { Sector }
  SCSI error : 0 0 0 0 return code = 0x802
  sda: Current: sense key=0x3
  ASC=0x11 ASCQ=0x4
  end_request: I/O error, dev sda, sector 66239043
  Buffer I/O error on device sda2, logical block 8016816
  lost page write due to I/O error on sda2
  ata1: recovering from error
  ata1: status=0x01 { Error }
  ata1: error=0x80 { Sector }
  SCSI error : 0 0 0 0 return code = 0x802
  sda: Current: sense key=0x3
  ASC=0x11 ASCQ=0x4
  end_request: I/O error, dev sda, sector 66239051
  Buffer I/O error on device sda2, logical block 8016817
  lost page write due to I/O error on sda2
  ata1: status=0x01 { Error }
  ata1: error=0x80 { Sector }
  SCSI error : 0 0 0 0 return code = 0x802
  sda: Current: sense key=0x3
  ASC=0x11 ASCQ=0x4
  end_request: I/O error, dev sda, sector 35137043
 
 This is with the extra debug. Given that it is the timeout triggering,
 only the sstatus is new.
 
 ahci ata1: stat=d0, issuing COMRESET
 ata1: started resetting...
 ata1: end resetting, sstatus=0113
 ata1: recovering from error
 ata1: status=0x01 { Error }
 ata1: error=0x80 { Sector }
 SCSI error : 0 0 0 0 return code = 0x802
 sda: Current: sense key=0x3
 ASC=0x11 ASCQ=0x4
 end_request: I/O error, dev sda, sector 66190875
 Buffer I/O error on device sda2, logical block 8010795
 lost page write due to I/O error on sda2
 ata1: status=0x01 { Error }
 ata1: error=0x80 { Sector }
 SCSI error : 0 0 0 0 return code = 0x802
 sda: Current: sense key=0x3
 ASC=0x11 ASCQ=0x4
 end_request: I/O error, dev sda, sector 66159699
 Buffer I/O error on device sda2, logical block 8006898
 lost page write due to I/O error on sda2

btw, the reason it hangs here (I suspect) is that your read_log_page()
logic is wrong - not every error condition will have NCQ_FAILED set
before entering ncq_recover. The timeout will not, for instance.
Testing... As usual, this will take days.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [git patches] IDE update

2005-07-05 Thread Jens Axboe
On Tue, Jul 05 2005, Ondrej Zary wrote:
 Jens Axboe wrote:
 On Tue, Jul 05 2005, Ondrej Zary wrote:
 
 André Tomt wrote:
 
 Al Boldi wrote:
 
 
 Bartlomiej Zolnierkiewicz wrote: {
 
 
 On 7/4/05, Al Boldi [EMAIL PROTECTED] wrote:
 Hdparm -tT gives 38mb/s in 2.4.31
 Cat /dev/hda  /dev/null gives 2% user 33% sys 65% idle
 
 Hdparm -tT gives 28mb/s in 2.6.12
 Cat /dev/hda  /dev/null gives 2% user 25% sys 0% idle 73% IOWAIT
 
 
 The hdparm doesn't get as high scores as in 2.4 is a old discussed to 
 death problem on LKML. So far nobody has been able to show it affects 
 anything  but that pretty useless quasi-benchmark.
 
 
 No, it's not a problem with hdparm. hdparm only shows that there is 
 _really_ a problem:
 
 2.6.12
 [EMAIL PROTECTED]:/home/rainbow# time dd if=/dev/hda of=/dev/null bs=512
 count=1048576
 1048576+0 records in
 1048576+0 records out
 
 real0m32.339s
 user0m1.500s
 sys 0m14.560s
 
 2.4.26
 [EMAIL PROTECTED]:/home/rainbow# time dd if=/dev/hda of=/dev/null bs=512
 count=1048576
 1048576+0 records in
 1048576+0 records out
 
 real0m23.858s
 user0m1.750s
 sys 0m15.180s
 
 
 Perhaps some read-ahead bug. What happens if you use bs=128k for
 instance?
 
 Nothing - it's still the same.
 
 [EMAIL PROTECTED]:/home/rainbow# time dd if=/dev/hda of=/dev/null bs=128k 
 count=4096
 4096+0 records in
 4096+0 records out
 
 real0m32.832s
 user0m0.040s
 sys 0m15.670s

Can you post full dmesg of 2.4 and 2.6 kernel boot? What does hdparm
-I/-i say for both kernels?

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [git patches] IDE update

2005-07-05 Thread Jens Axboe
On Tue, 2005-07-05 at 14:35 +0200, Ondrej Zary wrote:
 2.4.26
 [EMAIL PROTECTED]:/home/rainbow# time dd if=/dev/hda of=/dev/null bs=512
 count=1048576
 1048576+0 records in
 1048576+0 records out
 
 real0m23.858s
 user0m1.750s
 sys 0m15.180s
 
 
 Perhaps some read-ahead bug. What happens if you use bs=128k for
 instance?
 
 
 Nothing - it's still the same.
 
 [EMAIL PROTECTED]:/home/rainbow# time dd if=/dev/hda of=/dev/null bs=128k 
 count=4096
 4096+0 records in
 4096+0 records out
 
 real0m32.832s
 user0m0.040s
 sys 0m15.670s
  
  
  Can you post full dmesg of 2.4 and 2.6 kernel boot? What does hdparm
  -I/-i say for both kernels?
  
 
 The 2.4.26 kernel is the one from Slackware 10.0 bootable install CD.
 dmesg outputs attached, hdparm -i and hdparm -I shows the same in both
 kernels (compared using diff) - attached too.

Ok, looks alright for both. Your machine is quite slow, perhaps that is
showing the slower performance. Can you try and make HZ 100 in 2.6 and
test again? 2.6.13-recent has it as a config option, otherwise edit
include/asm/param.h appropriately.


-- 
Jens Axboe [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line unsubscribe linux-ide in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [git patches] IDE update

2005-07-05 Thread Jens Axboe
On Tue, Jul 05 2005, Ondrej Zary wrote:
 Jens Axboe wrote:
 On Tue, 2005-07-05 at 15:02 +0200, Ondrej Zary wrote:
 
 Ok, looks alright for both. Your machine is quite slow, perhaps that is
 showing the slower performance. Can you try and make HZ 100 in 2.6 and
 test again? 2.6.13-recent has it as a config option, otherwise edit
 include/asm/param.h appropriately.
 
 
 I forgot to write that my 2.6.12 kernel is already compiled with HZ 100 
 (it makes the system more responsive).
 I've just tried 2.6.8.1 with HZ 1000 and there is no difference in HDD 
 performance comparing to 2.6.12.
 
 
 OK, interesting. You could try and boot with profile=2 and do
 
 # readprofile -r
 # dd if=/dev/hda of=/dev/null bs=128k 
 # readprofile  prof_output
 
 for each kernel and post it here, so we can see if anything sticks out.
 
 Here are the profiles (used dd with count=4096) from 2.4.26 and 2.6.12 
 (nothing from 2.6.8.1 because I don't have the .map file anymore).

Looks interesting, 2.6 spends oodles of times copying to user space.
Lets check if raw reads perform ok, please try and time this app in 2.4
and 2.6 as well.

# gcc -Wall -O2 -o oread oread.c
# time ./oread /dev/hda

-- 
Jens Axboe

#include stdio.h
#include unistd.h
#define __USE_GNU
#include fcntl.h
#include stdlib.h

#define BS  (131072)
#define BLOCKS  (4096)
#define ALIGN(buf)  (char *) (((unsigned long) (buf) + 4095)  ~(4095))

int main(int argc, char *argv[])
{
char *buffer;
int fd, i;

if (argc  2) {
printf(%s: device\n, argv[0]);
return 1;
}

fd = open(argv[1], O_RDONLY | O_DIRECT);
if (fd == -1) {
perror(open);
return 2;
}

buffer = ALIGN(malloc(BS + 4095));

for (i = 0; i  BLOCKS; i++) {
int ret = read(fd, buffer, BS);

if (!ret)
break;
else if (ret  0) {
perror(read infile);
break;
}
}

return 0;
}


  1   2   >