Re: [PATCH 0/7] blk_end_request: full I/O completion handler
On Fri, Feb 08 2008, S, Chandrakala (STSD) wrote: Hi, Thanks for the information! We would like to know when does the 2.6.25 kernel will be available at kernel.org. So would I, if I locate my crystal ball I'll be sure to let you know :-) Seriously, given past experience, it's probably sometime in April. Thanks, Chandrakala -Original Message- From: Jens Axboe [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 05, 2008 4:09 PM To: S, Chandrakala (STSD) Cc: Kiyoshi Ueda; [EMAIL PROTECTED]; [EMAIL PROTECTED]; linux-ide@vger.kernel.org; Miller, Mike (OS Dev); [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED] Subject: Re: [PATCH 0/7] blk_end_request: full I/O completion handler On Tue, Feb 05 2008, S, Chandrakala (STSD) wrote: Hello, We would like to know in which kernel version these patches are available. They were merged after 2.6.24 was released, so they will show up in the 2.6.25 kernel. -- Jens Axboe -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHSET #upstream] block/libata: update and use block layer padding and draining
On Fri, Feb 08 2008, Jeff Garzik wrote: Jeff Garzik wrote: Tejun Heo wrote: This patchset updates block layer padding and draining support and make libata use it. It's based on James Bottomley's initial work and, of the five, the last two patches are from James with some modifications. Please read the following thread for more info. http://thread.gmane.org/gmane.linux.scsi/37185 This patchset is on top of upstream (a6af42fc9a12165136d82206ad52f18c5955ce87) + kill-n_iter-and-fix-fsl patch [1] ACK patchset... lets definitely get these fixes upstream. Once Jens is happy, I would prefer the merge the lot upstream, if that is OK with everyone involved? Jens, ping? It's a bug fix, so it would be nice to get this in soonish. As noted, if all looks good, I would prefer to merge via libata-dev... I'm ok with it, but lets please merge the block bits through the block repo, since they are not trivial. Wont be until the week after next, though. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/7] blk_end_request: full I/O completion handler
On Tue, Feb 05 2008, S, Chandrakala (STSD) wrote: Hello, We would like to know in which kernel version these patches are available. They were merged after 2.6.24 was released, so they will show up in the 2.6.25 kernel. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel BUG at ide-cd.c:1726 in 2.6.24-03863-g0ba6c33 -g8561b089
On Thu, Jan 31 2008, Nai Xia wrote: My dmesg relevant info is quite similar: [6.875041] Freeing unused kernel memory: 320k freed [8.143120] ide-cd: rq still having bio: dev hdc: type=2, flags=114c8 [8.144439] [8.144439] sector 10824201199534213, nr/cnr 0/0 [8.144439] bio cf029280, biotail cf029280, buffer , data , len 158 [8.144439] cdb: 12 00 00 00 fe 00 00 00 00 00 00 00 00 00 00 00 [8.144439] backup: data_len=158 bi_size=158 [8.160756] ide-cd: rq still having bio: dev hdc: type=2, flags=114c8 [8.160756] [8.160756] sector 2669858, nr/cnr 0/0 [8.160756] bio cf029300, biotail cf029300, buffer , data , len 158 [8.160756] cdb: 12 01 00 00 fe 00 00 00 00 00 00 00 00 00 00 00 [8.160756] backup: data_len=158 bi_size=158 [ 14.851101] eth0: link up [ 27.121883] eth0: no IPv6 routers present And by the way, Kiyoshi, This can be reproduced in a typical setup vmware workstation 6.02 with a vritual IDE cdrom, in case you wanna catch that with your own eyes. :-) Thanks for your trying hard to correct this annoying bug. The below fix should be enough. It's perfectly legal to have leftover byte counts when the drive signals completion, happens all the time for eg user issued commands where you don't know an exact byte count. diff --git a/drivers/ide/ide-cd.c b/drivers/ide/ide-cd.c index 74c6087..bee05a3 100644 --- a/drivers/ide/ide-cd.c +++ b/drivers/ide/ide-cd.c @@ -1722,7 +1722,7 @@ static ide_startstop_t cdrom_newpc_intr(ide_drive_t *drive) */ if ((stat DRQ_STAT) == 0) { spin_lock_irqsave(ide_lock, flags); - if (__blk_end_request(rq, 0, 0)) + if (__blk_end_request(rq, 0, rq-data_len)) BUG(); HWGROUP(drive)-rq = NULL; spin_unlock_irqrestore(ide_lock, flags); -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel BUG at ide-cd.c:1726 in 2.6.24-03863-g0ba6c33 -g8561b089
On Thu, Jan 31 2008, Florian Lohoff wrote: On Thu, Jan 31, 2008 at 02:05:58PM +0100, Jens Axboe wrote: The below fix should be enough. It's perfectly legal to have leftover byte counts when the drive signals completion, happens all the time for eg user issued commands where you don't know an exact byte count. diff --git a/drivers/ide/ide-cd.c b/drivers/ide/ide-cd.c index 74c6087..bee05a3 100644 --- a/drivers/ide/ide-cd.c +++ b/drivers/ide/ide-cd.c @@ -1722,7 +1722,7 @@ static ide_startstop_t cdrom_newpc_intr(ide_drive_t *drive) */ if ((stat DRQ_STAT) == 0) { spin_lock_irqsave(ide_lock, flags); - if (__blk_end_request(rq, 0, 0)) + if (__blk_end_request(rq, 0, rq-data_len)) BUG(); HWGROUP(drive)-rq = NULL; spin_unlock_irqrestore(ide_lock, flags); Fixes the crash on boot for me ... Great, thanks for confirming that. I'll make sure the patch goes upstream today, if Linus is available. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel BUG at ide-cd.c:1726 in 2.6.24-03863-g0ba6c33 -g8561b089
On 31/01/2008, at 18.04, Kiyoshi Ueda [EMAIL PROTECTED] wrote: Hi Jens, On Thu, 31 Jan 2008 14:05:58 +0100, Jens Axboe wrote: On Thu, Jan 31 2008, Nai Xia wrote: My dmesg relevant info is quite similar: [6.875041] Freeing unused kernel memory: 320k freed [8.143120] ide-cd: rq still having bio: dev hdc: type=2, flags=114c8 [8.144439] [8.144439] sector 10824201199534213, nr/cnr 0/0 [8.144439] bio cf029280, biotail cf029280, buffer , data , len 158 [8.144439] cdb: 12 00 00 00 fe 00 00 00 00 00 00 00 00 00 00 00 [8.144439] backup: data_len=158 bi_size=158 [8.160756] ide-cd: rq still having bio: dev hdc: type=2, flags=114c8 [8.160756] [8.160756] sector 2669858, nr/cnr 0/0 [8.160756] bio cf029300, biotail cf029300, buffer , data , len 158 [8.160756] cdb: 12 01 00 00 fe 00 00 00 00 00 00 00 00 00 00 00 [8.160756] backup: data_len=158 bi_size=158 [ 14.851101] eth0: link up [ 27.121883] eth0: no IPv6 routers present And by the way, Kiyoshi, This can be reproduced in a typical setup vmware workstation 6.02 with a vritual IDE cdrom, in case you wanna catch that with your own eyes. :-) Thanks for your trying hard to correct this annoying bug. The below fix should be enough. It's perfectly legal to have leftover byte counts when the drive signals completion, happens all the time for eg user issued commands where you don't know an exact byte count. diff --git a/drivers/ide/ide-cd.c b/drivers/ide/ide-cd.c index 74c6087..bee05a3 100644 --- a/drivers/ide/ide-cd.c +++ b/drivers/ide/ide-cd.c @@ -1722,7 +1722,7 @@ static ide_startstop_t cdrom_newpc_intr (ide_drive_t *drive) */ if ((stat DRQ_STAT) == 0) { spin_lock_irqsave(ide_lock, flags); -if (__blk_end_request(rq, 0, 0)) +if (__blk_end_request(rq, 0, rq-data_len)) BUG(); HWGROUP(drive)-rq = NULL; spin_unlock_irqrestore(ide_lock, flags); OK, I undarstand the leftover is legal. By the way, is it safe to always return success if there is a leftover? I thought we might have to complete the rq with -EIO in such case. data_len being non zero should pass the residual count back to the issuer. - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 41/63] ide-cd: move lba_to_msf() and msf_to_lba() to linux/cdrom.h
On Thu, Dec 20 2007, Bartlomiej Zolnierkiewicz wrote: * Move lba_to_msf() and msf_to_lba() to linux/cdrom.h (use 'u8' type instead of 'byte' while at it). * Remove msf_to_lba() copy from drivers/cdrom/cdrom.c. Cc: Jens Axboe [EMAIL PROTECTED] Signed-off-by: Bartlomiej Zolnierkiewicz [EMAIL PROTECTED] Acked-by: Jens Axboe [EMAIL PROTECTED] --- drivers/cdrom/cdrom.c |6 -- drivers/ide/ide-cd.c | 18 -- include/linux/cdrom.h | 14 ++ 3 files changed, 14 insertions(+), 24 deletions(-) Index: b/drivers/cdrom/cdrom.c === --- a/drivers/cdrom/cdrom.c +++ b/drivers/cdrom/cdrom.c @@ -2787,12 +2787,6 @@ int cdrom_ioctl(struct file * file, stru return -ENOSYS; } -static inline -int msf_to_lba(char m, char s, char f) -{ - return (((m * CD_SECS) + s) * CD_FRAMES + f) - CD_MSF_OFFSET; -} - /* * Required when we need to use READ_10 to issue other than 2048 block * reads Index: b/drivers/ide/ide-cd.c === --- a/drivers/ide/ide-cd.c +++ b/drivers/ide/ide-cd.c @@ -1643,24 +1643,6 @@ void msf_from_bcd (struct atapi_msf *msf msf-frame = BCD2BIN(msf-frame); } -static inline -void lba_to_msf (int lba, byte *m, byte *s, byte *f) -{ - lba += CD_MSF_OFFSET; - lba = 0xff; /* negative lbas use only 24 bits */ - *m = lba / (CD_SECS * CD_FRAMES); - lba %= (CD_SECS * CD_FRAMES); - *s = lba / CD_FRAMES; - *f = lba % CD_FRAMES; -} - - -static inline -int msf_to_lba (byte m, byte s, byte f) -{ - return (((m * CD_SECS) + s) * CD_FRAMES + f) - CD_MSF_OFFSET; -} - static int cdrom_check_status(ide_drive_t *drive, struct request_sense *sense) { struct request req; Index: b/include/linux/cdrom.h === --- a/include/linux/cdrom.h +++ b/include/linux/cdrom.h @@ -1184,6 +1184,20 @@ struct media_event_desc { extern int cdrom_get_media_event(struct cdrom_device_info *cdi, struct media_event_desc *med); +static inline void lba_to_msf(int lba, u8 *m, u8 *s, u8 *f) +{ + lba += CD_MSF_OFFSET; + lba = 0xff; /* negative lbas use only 24 bits */ + *m = lba / (CD_SECS * CD_FRAMES); + lba %= (CD_SECS * CD_FRAMES); + *s = lba / CD_FRAMES; + *f = lba % CD_FRAMES; +} + +static inline int msf_to_lba(u8 m, u8 s, u8 f) +{ + return (((m * CD_SECS) + s) * CD_FRAMES + f) - CD_MSF_OFFSET; +} #endif /* End of kernel only stuff */ #endif /* _LINUX_CDROM_H */ -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?
On Thu, Dec 13 2007, Mark Lord wrote: Matthew Wilcox wrote: On Thu, Dec 13, 2007 at 01:48:18PM -0500, Mark Lord wrote: Problem confirmed. 2.6.23.8 regularly generates segments up to 64KB for libata, but 2.6.24 uses only 4KB segments and a *few* 8KB segments. Just a suspicion ... could this be slab vs slub? ie check your configs are the same / similar between the two kernels. .. Mmmm.. a good thought, that one. But I just rechecked, and both have CONFIG_SLAB=y My guess is that something got changed around when Jens reworked the block layer for 2.6.24. I'm going to dig around in there now. I didn't rework the block layer for 2.6.24 :-). The core block layer changes since 2.6.23 are: - Support for empty barriers. Not a likely candidate. - Shared tag queue fixes. Totally unlikely. - sg chaining support. Not likely. - The bio changes from Neil. Of the bunch, the most likely suspects in this area, since it changes some of the code involved with merges and blk_rq_map_sg(). - Lots of simple stuff, again very unlikely. Anyway, it sounds odd for this to be a block layer problem if you do see occasional segments being merged. So it sounds more like the input data having changed. Why not just bisect it? -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?
On Thu, Dec 13 2007, Mark Lord wrote: Mark Lord wrote: Jens Axboe wrote: On Thu, Dec 13 2007, Mark Lord wrote: Matthew Wilcox wrote: On Thu, Dec 13, 2007 at 01:48:18PM -0500, Mark Lord wrote: Problem confirmed. 2.6.23.8 regularly generates segments up to 64KB for libata, but 2.6.24 uses only 4KB segments and a *few* 8KB segments. Just a suspicion ... could this be slab vs slub? ie check your configs are the same / similar between the two kernels. .. Mmmm.. a good thought, that one. But I just rechecked, and both have CONFIG_SLAB=y My guess is that something got changed around when Jens reworked the block layer for 2.6.24. I'm going to dig around in there now. I didn't rework the block layer for 2.6.24 :-). The core block layer changes since 2.6.23 are: - Support for empty barriers. Not a likely candidate. - Shared tag queue fixes. Totally unlikely. - sg chaining support. Not likely. - The bio changes from Neil. Of the bunch, the most likely suspects in this area, since it changes some of the code involved with merges and blk_rq_map_sg(). - Lots of simple stuff, again very unlikely. Anyway, it sounds odd for this to be a block layer problem if you do see occasional segments being merged. So it sounds more like the input data having changed. Why not just bisect it? .. Because the early 2.6.24 series failed to boot on this machine due to bugs in the block layer -- so the code that caused this regression is probably in the stuff from before the kernels became usable here. .. That sounds more harsh than intended -- the earlier 2.6.24 kernels (up to the first couple of -rc* ones failed here because of incompatibilities between the block/bio changes and libata. That's better, I think! No worries, I didn't pick it up as harsh just as an odd conclusion :-) If I were you, I'd just start from the first -rc that booted for you. If THAT has the bug, then we'll think of something else. If you don't get anywhere, I can run some tests tomorrow and see if I can reproduce it here. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?
On Thu, Dec 13 2007, Mark Lord wrote: Jens Axboe wrote: On Thu, Dec 13 2007, Mark Lord wrote: Mark Lord wrote: Jens Axboe wrote: On Thu, Dec 13 2007, Mark Lord wrote: Matthew Wilcox wrote: On Thu, Dec 13, 2007 at 01:48:18PM -0500, Mark Lord wrote: Problem confirmed. 2.6.23.8 regularly generates segments up to 64KB for libata, but 2.6.24 uses only 4KB segments and a *few* 8KB segments. Just a suspicion ... could this be slab vs slub? ie check your configs are the same / similar between the two kernels. .. Mmmm.. a good thought, that one. But I just rechecked, and both have CONFIG_SLAB=y My guess is that something got changed around when Jens reworked the block layer for 2.6.24. I'm going to dig around in there now. I didn't rework the block layer for 2.6.24 :-). The core block layer changes since 2.6.23 are: - Support for empty barriers. Not a likely candidate. - Shared tag queue fixes. Totally unlikely. - sg chaining support. Not likely. - The bio changes from Neil. Of the bunch, the most likely suspects in this area, since it changes some of the code involved with merges and blk_rq_map_sg(). - Lots of simple stuff, again very unlikely. Anyway, it sounds odd for this to be a block layer problem if you do see occasional segments being merged. So it sounds more like the input data having changed. Why not just bisect it? .. Because the early 2.6.24 series failed to boot on this machine due to bugs in the block layer -- so the code that caused this regression is probably in the stuff from before the kernels became usable here. .. That sounds more harsh than intended -- the earlier 2.6.24 kernels (up to the first couple of -rc* ones failed here because of incompatibilities between the block/bio changes and libata. That's better, I think! No worries, I didn't pick it up as harsh just as an odd conclusion :-) If I were you, I'd just start from the first -rc that booted for you. If THAT has the bug, then we'll think of something else. If you don't get anywhere, I can run some tests tomorrow and see if I can reproduce it here. .. I believe that *anyone* can reproduce it, since it's broken long before the requests ever get to SCSI or libata. Which also means that *anyone* who wants to can bisect it, as well. I don't do bisects. It was just a suggestion on how to narrow it down, do as you see fit. But I will dig a bit more and see if I can find the culprit. Sure, I'll dig around as well. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?
On Thu, Dec 13 2007, Jens Axboe wrote: On Thu, Dec 13 2007, Mark Lord wrote: Jens Axboe wrote: On Thu, Dec 13 2007, Mark Lord wrote: Mark Lord wrote: Jens Axboe wrote: On Thu, Dec 13 2007, Mark Lord wrote: Matthew Wilcox wrote: On Thu, Dec 13, 2007 at 01:48:18PM -0500, Mark Lord wrote: Problem confirmed. 2.6.23.8 regularly generates segments up to 64KB for libata, but 2.6.24 uses only 4KB segments and a *few* 8KB segments. Just a suspicion ... could this be slab vs slub? ie check your configs are the same / similar between the two kernels. .. Mmmm.. a good thought, that one. But I just rechecked, and both have CONFIG_SLAB=y My guess is that something got changed around when Jens reworked the block layer for 2.6.24. I'm going to dig around in there now. I didn't rework the block layer for 2.6.24 :-). The core block layer changes since 2.6.23 are: - Support for empty barriers. Not a likely candidate. - Shared tag queue fixes. Totally unlikely. - sg chaining support. Not likely. - The bio changes from Neil. Of the bunch, the most likely suspects in this area, since it changes some of the code involved with merges and blk_rq_map_sg(). - Lots of simple stuff, again very unlikely. Anyway, it sounds odd for this to be a block layer problem if you do see occasional segments being merged. So it sounds more like the input data having changed. Why not just bisect it? .. Because the early 2.6.24 series failed to boot on this machine due to bugs in the block layer -- so the code that caused this regression is probably in the stuff from before the kernels became usable here. .. That sounds more harsh than intended -- the earlier 2.6.24 kernels (up to the first couple of -rc* ones failed here because of incompatibilities between the block/bio changes and libata. That's better, I think! No worries, I didn't pick it up as harsh just as an odd conclusion :-) If I were you, I'd just start from the first -rc that booted for you. If THAT has the bug, then we'll think of something else. If you don't get anywhere, I can run some tests tomorrow and see if I can reproduce it here. .. I believe that *anyone* can reproduce it, since it's broken long before the requests ever get to SCSI or libata. Which also means that *anyone* who wants to can bisect it, as well. I don't do bisects. It was just a suggestion on how to narrow it down, do as you see fit. But I will dig a bit more and see if I can find the culprit. Sure, I'll dig around as well. Just tried something simple. I only see one 12kb segment so far, so not a lot by any stretch. I also DONT see any missed merges signs, so it would appear that the pages in the request are simply not contigious physically. diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c index e30b1a4..1e34b6f 100644 --- a/block/ll_rw_blk.c +++ b/block/ll_rw_blk.c @@ -1330,6 +1330,8 @@ int blk_rq_map_sg(struct request_queue *q, struct request *rq, goto new_segment; sg-length += nbytes; + if (sg-length 8192) + printk(sg_len=%d\n, sg-length); } else { new_segment: if (!sg) @@ -1349,6 +1351,8 @@ new_segment: sg = sg_next(sg); } + if (bvprv (page_address(bvprv-bv_page) + bvprv-bv_len == page_address(bvec-bv_page))) + printk(missed merge\n); sg_set_page(sg, bvec-bv_page, nbytes, bvec-bv_offset); nsegs++; } -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?
On Thu, Dec 13 2007, Mark Lord wrote: Jens Axboe wrote: On Thu, Dec 13 2007, Mark Lord wrote: Jens Axboe wrote: On Thu, Dec 13 2007, Mark Lord wrote: Mark Lord wrote: Jens Axboe wrote: On Thu, Dec 13 2007, Mark Lord wrote: Matthew Wilcox wrote: On Thu, Dec 13, 2007 at 01:48:18PM -0500, Mark Lord wrote: Problem confirmed. 2.6.23.8 regularly generates segments up to 64KB for libata, but 2.6.24 uses only 4KB segments and a *few* 8KB segments. Just a suspicion ... could this be slab vs slub? ie check your configs are the same / similar between the two kernels. .. Mmmm.. a good thought, that one. But I just rechecked, and both have CONFIG_SLAB=y My guess is that something got changed around when Jens reworked the block layer for 2.6.24. I'm going to dig around in there now. I didn't rework the block layer for 2.6.24 :-). The core block layer changes since 2.6.23 are: - Support for empty barriers. Not a likely candidate. - Shared tag queue fixes. Totally unlikely. - sg chaining support. Not likely. - The bio changes from Neil. Of the bunch, the most likely suspects in this area, since it changes some of the code involved with merges and blk_rq_map_sg(). - Lots of simple stuff, again very unlikely. Anyway, it sounds odd for this to be a block layer problem if you do see occasional segments being merged. So it sounds more like the input data having changed. Why not just bisect it? .. Because the early 2.6.24 series failed to boot on this machine due to bugs in the block layer -- so the code that caused this regression is probably in the stuff from before the kernels became usable here. .. That sounds more harsh than intended -- the earlier 2.6.24 kernels (up to the first couple of -rc* ones failed here because of incompatibilities between the block/bio changes and libata. That's better, I think! No worries, I didn't pick it up as harsh just as an odd conclusion :-) If I were you, I'd just start from the first -rc that booted for you. If THAT has the bug, then we'll think of something else. If you don't get anywhere, I can run some tests tomorrow and see if I can reproduce it here. .. I believe that *anyone* can reproduce it, since it's broken long before the requests ever get to SCSI or libata. Which also means that *anyone* who wants to can bisect it, as well. I don't do bisects. It was just a suggestion on how to narrow it down, do as you see fit. But I will dig a bit more and see if I can find the culprit. Sure, I'll dig around as well. .. I wonder if it's 9dfa52831e96194b8649613e3131baa2c109f7dc: Merge blk_recount_segments into blk_recalc_rq_segments ? That particular commit does some rather innocent code-shuffling, but also introduces a couple of new if (nr_hw_segs == 1 conditions that were not there before. You can try and revert it of course, but I think you are looking at the wrong bits. If the segment counts were totally off, you'd never be anywhere close to reaching the set limit. Your problems seems to be missed contig segment merges. Okay git experts: how do I pull out a kernel at the point of this exact commit ? Dummy approach - git log and grep for 9dfa52831e96194b8649613e3131baa2c109f7dc, then see what commit is before that. Then do a git checkout commit. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?
On Thu, Dec 13 2007, Mark Lord wrote: Jens Axboe wrote: On Thu, Dec 13 2007, Mark Lord wrote: Jens Axboe wrote: On Thu, Dec 13 2007, Jens Axboe wrote: On Thu, Dec 13 2007, Mark Lord wrote: Jens Axboe wrote: On Thu, Dec 13 2007, Mark Lord wrote: Mark Lord wrote: Jens Axboe wrote: On Thu, Dec 13 2007, Mark Lord wrote: Matthew Wilcox wrote: On Thu, Dec 13, 2007 at 01:48:18PM -0500, Mark Lord wrote: Problem confirmed. 2.6.23.8 regularly generates segments up to 64KB for libata, but 2.6.24 uses only 4KB segments and a *few* 8KB segments. Just a suspicion ... could this be slab vs slub? ie check your configs are the same / similar between the two kernels. .. Mmmm.. a good thought, that one. But I just rechecked, and both have CONFIG_SLAB=y My guess is that something got changed around when Jens reworked the block layer for 2.6.24. I'm going to dig around in there now. I didn't rework the block layer for 2.6.24 :-). The core block layer changes since 2.6.23 are: - Support for empty barriers. Not a likely candidate. - Shared tag queue fixes. Totally unlikely. - sg chaining support. Not likely. - The bio changes from Neil. Of the bunch, the most likely suspects in this area, since it changes some of the code involved with merges and blk_rq_map_sg(). - Lots of simple stuff, again very unlikely. Anyway, it sounds odd for this to be a block layer problem if you do see occasional segments being merged. So it sounds more like the input data having changed. Why not just bisect it? .. Because the early 2.6.24 series failed to boot on this machine due to bugs in the block layer -- so the code that caused this regression is probably in the stuff from before the kernels became usable here. .. That sounds more harsh than intended -- the earlier 2.6.24 kernels (up to the first couple of -rc* ones failed here because of incompatibilities between the block/bio changes and libata. That's better, I think! No worries, I didn't pick it up as harsh just as an odd conclusion :-) If I were you, I'd just start from the first -rc that booted for you. If THAT has the bug, then we'll think of something else. If you don't get anywhere, I can run some tests tomorrow and see if I can reproduce it here. .. I believe that *anyone* can reproduce it, since it's broken long before the requests ever get to SCSI or libata. Which also means that *anyone* who wants to can bisect it, as well. I don't do bisects. It was just a suggestion on how to narrow it down, do as you see fit. But I will dig a bit more and see if I can find the culprit. Sure, I'll dig around as well. Just tried something simple. I only see one 12kb segment so far, so not a lot by any stretch. I also DONT see any missed merges signs, so it would appear that the pages in the request are simply not contigious physically. diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c index e30b1a4..1e34b6f 100644 --- a/block/ll_rw_blk.c +++ b/block/ll_rw_blk.c @@ -1330,6 +1330,8 @@ int blk_rq_map_sg(struct request_queue *q, struct request *rq, goto new_segment; sg-length += nbytes; + if (sg-length 8192) + printk(sg_len=%d\n, sg-length); } else { new_segment: if (!sg) @@ -1349,6 +1351,8 @@ new_segment: sg = sg_next(sg); } + if (bvprv (page_address(bvprv-bv_page) + bvprv-bv_len == page_address(bvec-bv_page))) + printk(missed merge\n); sg_set_page(sg, bvec-bv_page, nbytes, bvec-bv_offset); nsegs++; } .. Yeah, the first part is similar to my own hack. For testing, try dd if=/dev/sda of=/dev/null bs=4096k. That *really* should end up using contiguous pages on most systems. I figured out the git thing, and am now building some in-between kernels to try. OK, it's a vm issue, I have tens of thousand backward pages after a boot - IOW, bvec-bv_page is the page before bvprv-bv_page, not reverse. So it looks like that bug got reintroduced. ... Mmm.. shouldn't one of the front- or back- merge logics work for either order? I think you are misunderstanding the merging. The front/back bits are for contig on disk, this is sg segment merging. We can only join pieces that are contig in memory, otherwise the result would not be pretty :-) -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?
On Thu, Dec 13 2007, Andrew Morton wrote: On Thu, 13 Dec 2007 21:09:59 +0100 Jens Axboe [EMAIL PROTECTED] wrote: OK, it's a vm issue, cc linux-mm and probable culprit. I have tens of thousand backward pages after a boot - IOW, bvec-bv_page is the page before bvprv-bv_page, not reverse. So it looks like that bug got reintroduced. Bill Irwin fixed this a couple of years back: changed the page allocator so that it mostly hands out pages in ascending physical-address order. I guess we broke that, quite possibly in Mel's page allocator rework. It would help if you could provide us with a simple recipe for demonstrating this problem, please. Basically anything involving IO :-). A boot here showed a handful of good merges, and probably in the order of 100,000 descending allocations. A kernel make is a fine test as well. Something like the below should work fine - if you see oodles of these basicaly doing any type of IO, then you are screwed. diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c index e30b1a4..8ce3fcc 100644 --- a/block/ll_rw_blk.c +++ b/block/ll_rw_blk.c @@ -1349,6 +1349,10 @@ new_segment: sg = sg_next(sg); } + if (bvprv) { + if (page_address(bvec-bv_page) + PAGE_SIZE == page_address(bvprv-bv_page) printk_ratelimit()) + printk(page alloc order backwards\n); + } sg_set_page(sg, bvec-bv_page, nbytes, bvec-bv_offset); nsegs++; } -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SAS v SATA interface performance
On Mon, Dec 10 2007, Tejun Heo wrote: There's one thing we can do to improve the situation tho. Several drives including raptors and 7200.11s suffer serious performance hit if sequential transfer is performed by multiple NCQ commands. My 7200.11 can do 100MB/s if non-NCQ command is used or only upto two NCQ commands are issued; however, if all 31 (maximum currently supported by libata) are used, the transfer rate drops to miserable 70MB/s. It seems that what we need to do is not issuing too many commands to one sequential stream. In fact, there isn't much to gain by issuing more than two commands to one sequential stream. Well... CFQ wont go to deep queue depths across processes if they are doing streaming IO, but it wont stop a single process from doing so. I'd like to know what real life process would issue a streaming IO in some async manner as to get 31 pending commands sequentially? Not very likely :-) So I'd consider your case above a microbenchmark results. I'd also claim that the firmware is very crappy, if it performs like described. There's another possibility as well - that the queueing by the drive generates a worse issue IO pattern, and that is why the performance drops. Did you check with blktrace what the generated IO looks like? Both raptors and 7200.11 perform noticeably better on random workload with NCQ enabled. So, it's about time to update IO schedulers accordingly, it seems. Definitely. Again microbenchmarks are able to show 30-40% improvements when I last tested. That's a pure random workload though, again not something that you would see in real life. I tend to always run with a depth around 4 here. It seems to be a good value, you get some benefits from NCQ but you don't allow the drive firmware to screw you over. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Bug: get EXT3-fs error Allocating block in system zone
On Mon, Dec 10 2007, Marco Gatti wrote: Jens Axboe schrieb: Hello Jens, Thanks for help. I just applied the patch. Unfortunately it doesn't work. Can you try and additionally boot with iommu=off as a boot parameter? Yes. This is the end of getting any sata devices. See screenshots for errors. It continued untill ata4. At the end no root device was found. Hmm, even though the address is set to 0x we still seem to receive requests outside that range. Lets assume it's the scsi logic, can you test this? IOW, patch + iommu=off + this patch. I probably wont see any more mails tonight, we can continue this tomorrow (or someone else can step in, whatever comes first :-) diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index 2faced6..769ce3a 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -1572,7 +1572,9 @@ struct request_queue *__scsi_alloc_queue(struct Scsi_Host *shost, #endif blk_queue_max_sectors(q, shost-max_sectors); +#if 0 blk_queue_bounce_limit(q, scsi_calculate_bounce_limit(shost)); +#else blk_queue_segment_boundary(q, shost-dma_boundary); if (!shost-use_clustering) I applied the path. Got Hunk #1 succeeded at 1562 with fuzz 2 (offset -10 lines). I didn't compile completly. drivers/scsi/scsi_lib.c:1565:1: error: unterminated #else make[2]: *** [drivers/scsi/scsi_lib.o] Error 1 make[1]: *** [drivers/scsi] Error 2 make: *** [drivers] Error 2 Doh sorry, that #else wants to be an #endif -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Bug: get EXT3-fs error Allocating block in system zone
On Sun, Dec 09 2007, Linus Torvalds wrote: On Sun, 9 Dec 2007, Robert Hancock wrote: The obvious suspect with a filesystem problem would be the disk controller driver, AHCI here. However, the controller appears to set the flag to indicate that it supports 64-bit DMA, so it should be fine, unless it lies of course (which we know that ATI SB600 chipset does, but I don't believe Intel is known to). Could still be a DMA mapping bug that only shows up when IOMMU is used. However, AHCI is a pretty well tested driver.. AHCI is a pretty well tested driver, but 99%+ of all testers still tend to have less than 4GB of memory. So I do *not* believe that the highmem bits are all that well tested at all. Can somebody who knows the driver send Marco a test-patch to just limit DMA to the low 32 bits, and then Marco can at least verify that yes, that that it. While it looks like DMA problems, there could obviously be some other subtle issue with big-memory machines (ie the PCI allocations etc tend to change too!) Was just thinking that, this should do the trick. If this works, then we can look at whether this is a hardware or iommu or block bouncing (unlikely, would affect more people) bug. diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c index 4688dbf..cad3cbc 100644 --- a/drivers/ata/ahci.c +++ b/drivers/ata/ahci.c @@ -623,6 +623,9 @@ static void ahci_save_initial_config(struct pci_dev *pdev, hpriv-saved_cap = cap = readl(mmio + HOST_CAP); hpriv-saved_port_map = port_map = readl(mmio + HOST_PORTS_IMPL); + hpriv-saved_cap = ~HOST_CAP_64; + cap = ~HOST_CAP_64; + /* some chips have errata preventing 64bit use */ if ((cap HOST_CAP_64) (hpriv-flags AHCI_HFLAG_32BIT_ONLY)) { dev_printk(KERN_INFO, pdev-dev, -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Bug: get EXT3-fs error Allocating block in system zone
On Sun, Dec 09 2007, Marco Gatti wrote: Jens Axboe schrieb: Was just thinking that, this should do the trick. If this works, then we can look at whether this is a hardware or iommu or block bouncing (unlikely, would affect more people) bug. diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c index 4688dbf..cad3cbc 100644 --- a/drivers/ata/ahci.c +++ b/drivers/ata/ahci.c @@ -623,6 +623,9 @@ static void ahci_save_initial_config(struct pci_dev *pdev, hpriv-saved_cap = cap = readl(mmio + HOST_CAP); hpriv-saved_port_map = port_map = readl(mmio + HOST_PORTS_IMPL); +hpriv-saved_cap = ~HOST_CAP_64; +cap = ~HOST_CAP_64; + /* some chips have errata preventing 64bit use */ if ((cap HOST_CAP_64) (hpriv-flags AHCI_HFLAG_32BIT_ONLY)) { dev_printk(KERN_INFO, pdev-dev, Hello Jens, Thanks for help. I just applied the patch. Unfortunately it doesn't work. Can you try and additionally boot with iommu=off as a boot parameter? -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Bug: get EXT3-fs error Allocating block in system zone
On Sun, Dec 09 2007, Marco Gatti wrote: Jens Axboe schrieb: On Sun, Dec 09 2007, Marco Gatti wrote: Jens Axboe schrieb: Was just thinking that, this should do the trick. If this works, then we can look at whether this is a hardware or iommu or block bouncing (unlikely, would affect more people) bug. diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c index 4688dbf..cad3cbc 100644 --- a/drivers/ata/ahci.c +++ b/drivers/ata/ahci.c @@ -623,6 +623,9 @@ static void ahci_save_initial_config(struct pci_dev *pdev, hpriv-saved_cap = cap = readl(mmio + HOST_CAP); hpriv-saved_port_map = port_map = readl(mmio + HOST_PORTS_IMPL); + hpriv-saved_cap = ~HOST_CAP_64; + cap = ~HOST_CAP_64; + /* some chips have errata preventing 64bit use */ if ((cap HOST_CAP_64) (hpriv-flags AHCI_HFLAG_32BIT_ONLY)) { dev_printk(KERN_INFO, pdev-dev, Hello Jens, Thanks for help. I just applied the patch. Unfortunately it doesn't work. Can you try and additionally boot with iommu=off as a boot parameter? Yes. This is the end of getting any sata devices. See screenshots for errors. It continued untill ata4. At the end no root device was found. Hmm, even though the address is set to 0x we still seem to receive requests outside that range. Lets assume it's the scsi logic, can you test this? IOW, patch + iommu=off + this patch. I probably wont see any more mails tonight, we can continue this tomorrow (or someone else can step in, whatever comes first :-) diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index 2faced6..769ce3a 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -1572,7 +1572,9 @@ struct request_queue *__scsi_alloc_queue(struct Scsi_Host *shost, #endif blk_queue_max_sectors(q, shost-max_sectors); +#if 0 blk_queue_bounce_limit(q, scsi_calculate_bounce_limit(shost)); +#else blk_queue_segment_boundary(q, shost-dma_boundary); if (!shost-use_clustering) -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 01/28] blk_end_request: add new request completion interface (take 3)
On Tue, Dec 04 2007, Kiyoshi Ueda wrote: Hi Boaz and Jens, On Tue, 04 Dec 2007 15:56:32 +0200, Boaz Harrosh [EMAIL PROTECTED] wrote: +/** + * blk_end_request - Helper function for drivers to complete the request. + * @rq: the request being processed + * @uptodate: 1 for success, 0 for I/O error, 0 for specific error + * @nr_bytes: number of bytes to complete + * + * Description: + * Ends I/O on a number of bytes attached to @rq. + * If @rq has leftover, sets it up for the next range of segments. + * + * Return: + * 0 - we are done with this request + * 1 - still buffers pending for this request + **/ +int blk_end_request(struct request *rq, int uptodate, int nr_bytes) I always hated that uptodate boolean with possible negative error value. I guess it was done for backward compatibility of then users of end_that_request_first(). But since you are introducing a new API then this is not the case. Just have regular status int where 0 means ALL_OK and negative value means error code. Just my $0.02. Thank you for the comment. I think it's quite reasonable. By doing that, we don't need end_io_error() anymore. Jens, What do you think? I agree, the uptodate usage right now is horrible. If you agree with the interface change above, I would prefer to separate the patch-set from blk_end_request patch-set like below: o blk_end_request: remove end_that_request_* o change interface of 'uptodate' in blk_end_request to 'error' It makes the purpose of blk_end_request patch-set clear (and also, each patch of device drivers could be smaller). But, it doubles your merging work. So if you would like to get the changes at once, I'll merge them into blk_end_request patch-set. Twice the merging is not an issue for me. As for the patch inclusion, do you push the driver changes to Linus all at once? Or should I ask each maintainer to take the patch? Lets just try to get as many maintainer acks as possible, since the patches need to go in together. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/28] blk_end_request: full I/O completion handler (take 3)
On Fri, Nov 30 2007, Kiyoshi Ueda wrote: Hello Jens, The following is the updated patch-set for blk_end_request(). Changes since the last version are only minor updates to catch up with the base kernel changes. Do you agree the implementation of blk_end_request()? If there's no problem, could you merge it to your tree? Or does it have to be merged to -mm tree first? Looks good to me now, I'll queue it up. Thanks! -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG] New Kernel Bugs
On Tue, Nov 13 2007, Andrew Morton wrote: I/O STORAGE=== kernel bug from pktcdvd http://bugzilla.kernel.org/show_bug.cgi?id=9294 Kernel: 2.6.23 I think we might have fixed this. It's fixed and merged, I just forgot to close the bugzilla. Did so now. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Suspend to ram regression (2.6.24-rc1-git)
On Fri, Nov 02 2007, Kristen Carlson Accardi wrote: On Thu, 1 Nov 2007 09:41:46 +0100 Jens Axboe [EMAIL PROTECTED] wrote: On Wed, Oct 31 2007, Jens Axboe wrote: Hi, My x60 stopped suspending about two days ago. It just freezes after printing Suspending console(s) where it would normally turn everything off and the 'moon' light would go on. Posting this message in case somebody else knows what is up, if not I'll do a bisect on it tomorrow. Did the bisect, it points to this commit: 1556594f913fa81d008cecfe46d7211c919a853 is first bad commit commit 31556594f913fa81d008cecfe46d7211c919a853 Author: Kristen Carlson Accardi [EMAIL PROTECTED] Date: Thu Oct 25 01:33:26 2007 -0400 [libata] AHCI: add hw link power management support Booting any kernel after this commit fails suspending to ram, it just sits there forever. -- Jens Axboe Does this patch fix your problem? It seems to get hung up while disabling DIPM, and after thinking about this a bit, I don't think we really need to do this anyway. Yep, works for me! Can we get this expedited upstream? -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Suspend to ram regression (2.6.24-rc1-git)
On Wed, Oct 31 2007, Jens Axboe wrote: Hi, My x60 stopped suspending about two days ago. It just freezes after printing Suspending console(s) where it would normally turn everything off and the 'moon' light would go on. Posting this message in case somebody else knows what is up, if not I'll do a bisect on it tomorrow. Did the bisect, it points to this commit: 1556594f913fa81d008cecfe46d7211c919a853 is first bad commit commit 31556594f913fa81d008cecfe46d7211c919a853 Author: Kristen Carlson Accardi [EMAIL PROTECTED] Date: Thu Oct 25 01:33:26 2007 -0400 [libata] AHCI: add hw link power management support Booting any kernel after this commit fails suspending to ram, it just sits there forever. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Suspend to ram regression (2.6.24-rc1-git)
On Thu, Nov 01 2007, Jens Axboe wrote: On Wed, Oct 31 2007, Jens Axboe wrote: Hi, My x60 stopped suspending about two days ago. It just freezes after printing Suspending console(s) where it would normally turn everything off and the 'moon' light would go on. Posting this message in case somebody else knows what is up, if not I'll do a bisect on it tomorrow. Did the bisect, it points to this commit: 1556594f913fa81d008cecfe46d7211c919a853 is first bad commit commit 31556594f913fa81d008cecfe46d7211c919a853 Author: Kristen Carlson Accardi [EMAIL PROTECTED] Date: Thu Oct 25 01:33:26 2007 -0400 [libata] AHCI: add hw link power management support Booting any kernel after this commit fails suspending to ram, it just sits there forever. Reverting just the default AHCI flags makes it work again. IOW, with the below patch I can suspend properly with current -git. diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c index ed9b407..77f7631 100644 --- a/drivers/ata/ahci.c +++ b/drivers/ata/ahci.c @@ -190,8 +190,7 @@ enum { AHCI_FLAG_COMMON= ATA_FLAG_SATA | ATA_FLAG_NO_LEGACY | ATA_FLAG_MMIO | ATA_FLAG_PIO_DMA | - ATA_FLAG_ACPI_SATA | ATA_FLAG_AN | - ATA_FLAG_IPM, + ATA_FLAG_ACPI_SATA | ATA_FLAG_AN, AHCI_LFLAG_COMMON = ATA_LFLAG_SKIP_D2H_BSY, }; -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Suspend to ram regression (2.6.24-rc1-git)
On Thu, Nov 01 2007, Jens Axboe wrote: On Thu, Nov 01 2007, Jens Axboe wrote: On Wed, Oct 31 2007, Jens Axboe wrote: Hi, My x60 stopped suspending about two days ago. It just freezes after printing Suspending console(s) where it would normally turn everything off and the 'moon' light would go on. Posting this message in case somebody else knows what is up, if not I'll do a bisect on it tomorrow. Did the bisect, it points to this commit: 1556594f913fa81d008cecfe46d7211c919a853 is first bad commit commit 31556594f913fa81d008cecfe46d7211c919a853 Author: Kristen Carlson Accardi [EMAIL PROTECTED] Date: Thu Oct 25 01:33:26 2007 -0400 [libata] AHCI: add hw link power management support Booting any kernel after this commit fails suspending to ram, it just sits there forever. Reverting just the default AHCI flags makes it work again. IOW, with the below patch I can suspend properly with current -git. diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c index ed9b407..77f7631 100644 --- a/drivers/ata/ahci.c +++ b/drivers/ata/ahci.c @@ -190,8 +190,7 @@ enum { AHCI_FLAG_COMMON= ATA_FLAG_SATA | ATA_FLAG_NO_LEGACY | ATA_FLAG_MMIO | ATA_FLAG_PIO_DMA | - ATA_FLAG_ACPI_SATA | ATA_FLAG_AN | - ATA_FLAG_IPM, + ATA_FLAG_ACPI_SATA | ATA_FLAG_AN, AHCI_LFLAG_COMMON = ATA_LFLAG_SKIP_D2H_BSY, }; I should also mention that just stubbing out ahci_enable_alpm() and ahci_disable_alpm() in ahci.c is NOT enough to make it work. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Suspend to ram regression (2.6.24-rc1-git)
On Thu, Nov 01 2007, Jeff Garzik wrote: Jens Axboe wrote: Reverting just the default AHCI flags makes it work again. IOW, with the below patch I can suspend properly with current -git. diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c index ed9b407..77f7631 100644 --- a/drivers/ata/ahci.c +++ b/drivers/ata/ahci.c @@ -190,8 +190,7 @@ enum { AHCI_FLAG_COMMON= ATA_FLAG_SATA | ATA_FLAG_NO_LEGACY | ATA_FLAG_MMIO | ATA_FLAG_PIO_DMA | - ATA_FLAG_ACPI_SATA | ATA_FLAG_AN | - ATA_FLAG_IPM, + ATA_FLAG_ACPI_SATA | ATA_FLAG_AN, AHCI_LFLAG_COMMON = ATA_LFLAG_SKIP_D2H_BSY, sounds like the easy thing to do, in light of this breakage, might be to default it to off, add a module parameter turning it on by setting that flag. Wouldn't it be better to just get this bug fixed? IOW, is there a reason for disabling ALPM if it's Bug Free? I'd suggest committing the patch disabling IPM, then Kristen can debug the thing in piece in quiet. Once confident it works with ahci again, we can revert that commit. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fix ATAPI transfer lengths causes CD writing regression
On Wed, Oct 31 2007, Alan Cox wrote: On Tue, 30 Oct 2007 19:21:29 + Daniel Drake [EMAIL PROTECTED] wrote: Alan Cox wrote: I would guess Brasero is issuing a command with the length of data wrongly set. In the old code that might well just produce errors of the Umm wtf is this data left over for ?, with the new code the drive is likely to change state as it knows the transfer size and that will *correctly* cause an HSM error and what follows. Now the question is who gets the length wrong - Brasero or the ata translation code in libata Brasero does exactly the same as my test app which I attached to my last mail. Is my test app wrong? Would need to double check the SCSI specificatons to be sure but I think you are asking for less data than the drive wishes to provide. You aren't allowed to do that with ATA. ide-cd handles this by throwing the excess away, which I think is the sane way to do this. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fix ATAPI transfer lengths causes CD writing regression
On Wed, Oct 31 2007, Jeff Garzik wrote: Jens Axboe wrote: On Wed, Oct 31 2007, Alan Cox wrote: On Tue, 30 Oct 2007 19:21:29 + Daniel Drake [EMAIL PROTECTED] wrote: Alan Cox wrote: I would guess Brasero is issuing a command with the length of data wrongly set. In the old code that might well just produce errors of the Umm wtf is this data left over for ?, with the new code the drive is likely to change state as it knows the transfer size and that will *correctly* cause an HSM error and what follows. Now the question is who gets the length wrong - Brasero or the ata translation code in libata Brasero does exactly the same as my test app which I attached to my last mail. Is my test app wrong? Would need to double check the SCSI specificatons to be sure but I think you are asking for less data than the drive wishes to provide. You aren't allowed to do that with ATA. ide-cd handles this by throwing the excess away, which I think is the sane way to do this. That's easy for the PIO case. But CD writing is normally DMA, which means you will get a DMA engine exception if the device wants to give you more data than the scatter/gather entries permit. Right, that's of course problematic... There has to be a way to recover that situation though, or you can't export any user command issue facility. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fix ATAPI transfer lengths causes CD writing regression
On Wed, Oct 31 2007, Jeff Garzik wrote: Jens Axboe wrote: Right, that's of course problematic... There has to be a way to recover that situation though, or you can't export any user command issue facility. You cannot hope to handle all possible effects arising from an app providing an invalid sg header / cdb. Once you start talking recovery you are already screwed: we are talking about low-level hardware commands that are passed straight to the hardware. It is trivial to lock up hardware, brick hardware, and corrupt data at that level. If this is NOT a privileged app, we must update the command validation to ensure that invalid commands are not transported to the hardware. If this is a privileged app, our work is done. Fix the app. We gave root rope, and he took it. Woaw, back the truck up a bit :-) I'm talking about simple things - like asking for 8 bytes of sense data. Simple mistakes. You cannot possibly check for everything like that in a command filter, it's utterly impossible. I even venture to say that accept anything, clean up afterwards is /impossible/ to implement, in addition to being dangerous. Certainly, that's not what I'm talking about. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ata_exec_internal crash at boot in -git22
On Mon, Oct 22 2007, Andi Kleen wrote: One of the systems tested in autoboot crashes at boot with with -git22. This is a AMD 2 socket Opteron NUMA system. The tester was a little flakey and happened to hit the x86-merge-broke- compilation window, so the last good data point I have is 2.6.23-rc9. Andi, can you test with this patch applied? http://brick.kernel.dk/sg-git.patch -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ata_exec_internal crash at boot in -git22
On Mon, Oct 22 2007, Andi Kleen wrote: On Monday 22 October 2007 20:26:45 Jens Axboe wrote: On Mon, Oct 22 2007, Andi Kleen wrote: One of the systems tested in autoboot crashes at boot with with -git22. This is a AMD 2 socket Opteron NUMA system. The tester was a little flakey and happened to hit the x86-merge-broke- compilation window, so the last good data point I have is 2.6.23-rc9. Andi, can you test with this patch applied? http://brick.kernel.dk/sg-git.patch Sorry was a mistake (cue: there is no 2.6.23-git22 yet). It turned out to be an old broken kernel. Current -git seems to boot. Sorry for the noise. OK, all for the better :-) -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: remove blk_queue_max_phys_segments in libata
On Tue, Oct 16 2007, Jeff Garzik wrote: Linux Kernel Mailing List wrote: Gitweb: http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=8889e3c129780cdbe15fed3c366ba3aa3026684d Commit: 8889e3c129780cdbe15fed3c366ba3aa3026684d Parent: fd820f405574a30aacf9a859886e173d641f080b Author: FUJITA Tomonori [EMAIL PROTECTED] AuthorDate: Tue Sep 18 12:16:45 2007 +0200 Committer: Jens Axboe [EMAIL PROTECTED] CommitDate: Tue Oct 16 11:24:44 2007 +0200 remove blk_queue_max_phys_segments in libata LIBATA_MAX_PRD is the maximum number of DMA scatter/gather elements permitted by the HBA's DMA engine. It's properly set to q-max_hw_segments via the sg_tablesize parameter. libata shouldn't call blk_queue_max_phys_segments. Now LIBATA_MAX_PRD is equal to SCSI_MAX_PHYS_SEGMENTS by default (both is 128), so everything is fine. But if they are changed, some code (like the scsi mid layer, sg chaining, etc) might not work properly. (Addition from Jens) The basic issue is that the physical segment setting is purely a driver issue. And since SCSI is managing the sglist, libata has no business changing the setting. All libata should care about is the hw segment setting. Signed-off-by: FUJITA Tomonori [EMAIL PROTECTED] Signed-off-by: Jens Axboe [EMAIL PROTECTED] --- drivers/ata/libata-scsi.c |2 -- 1 files changed, 0 insertions(+), 2 deletions(-) diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c index d63c81e..ba62d53 100644 --- a/drivers/ata/libata-scsi.c +++ b/drivers/ata/libata-scsi.c @@ -801,8 +801,6 @@ int ata_scsi_slave_config(struct scsi_device *sdev) ata_scsi_sdev_config(sdev); -blk_queue_max_phys_segments(sdev-request_queue, LIBATA_MAX_PRD); - sdev-manage_start_stop = 1; As I noted when this first patch was posted... this line of code existed because the difference between hw-segments and phys-segments was incredibly difficult to discern. The names say basically the same thing to me, and I could find no documentation explaining the difference? We discussed this change to death afair, so I'm a little surprised you say that. The commit message itself has good reasoning as well. Does such documentation exist? If not, can you please explain the difference? OK, I can give a brief summary. The driver can set two restrictions - hardware and physical. I'm not sure who originally came up with the naming (I think it's prety terrible), but essentially the 'hardware' restriction is what the hardware can support (OK that name works) and the 'physical' restriction is imposed by the driver. Typically that last restriction has to do with structure limitations, like allocating stuff on the stack or limiting the size of the sgtable you have to allocate. Finally, some notification and coordination would have been helpful. Confused, as I mentioned, we discussed this at quite some length a few months ago. Commit 6c08772e49622e90d39903e7ff0be1a0f463ac86 appears to have been missed, at the very least... or is it actually correct? Depends on what you think it does. The number of hardware segments will never be larger than the physical segments. But it looks like you want to set the hw seg value there. As mentioned in the commit above, since libata doesn't allocate its own sg tables, letting it fiddle with the physical segment limit is dodgy at best. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.23-rc7-mm1 AHCI ATA errors -- won't boot
On Tue, Sep 25 2007, Berck E. Nash wrote: Jeff Garzik wrote: The first step would be to clone the upstream branch of git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev.git and see if the problem is reproducible there. If yes, then you have narrowed down the problem to something my ATA devel tree has introduced into -mm. Nope, you're off the hook. The libata tree works great, so it must be something else in -mm conflicting. Can you try 2.6.23-rc8 plus this patch: http://brick.kernel.dk/git-block.patch.bz2 and see if that works? -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.23-rc7-mm1 AHCI ATA errors -- won't boot
On Tue, Sep 25 2007, Berck E. Nash wrote: Jens Axboe wrote: On Tue, Sep 25 2007, Berck E. Nash wrote: Jeff Garzik wrote: The first step would be to clone the upstream branch of git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev.git and see if the problem is reproducible there. If yes, then you have narrowed down the problem to something my ATA devel tree has introduced into -mm. Nope, you're off the hook. The libata tree works great, so it must be something else in -mm conflicting. Whoops, sorry! I just lied. I'm a git newbie, and failed to actually get the upstream branch the first time, so rc8 is clean, but it fails when I actually pull the upstream branch. I'll git bisect and get back to you. OK, you probably realize this, but you can forget about the git-block testing for now then. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ahci: enable GHC.AE bit before set GHC.HR
On Fri, Sep 21 2007, Alan Cox wrote: On Fri, 21 Sep 2007 12:31:20 +0200 Jens Axboe [EMAIL PROTECTED] wrote: On Fri, Sep 21 2007, Peer Chen wrote: According to the description of section 5.2.2.1 and 10.1.2 of AHCI specification rev1_1/rev1_2, GHC.HR shall only be set to ¡®1¡¯ by software when GHC.AE is set to ¡®1¡¯. Signed-off-by: Peer Chen [EMAIL PROTECTED] --- --- linux-2.6.23-rc7/drivers/ata/ahci.c.orig 2007-09-20 11:01:55.0 -0400 +++ linux-2.6.23-rc7/drivers/ata/ahci.c 2007-09-20 11:07:31.0 -0400 @@ -834,6 +834,10 @@ static int ahci_reset_controller(struct void __iomem *mmio = host-iomap[AHCI_PCI_BAR]; u32 tmp; +/* turn on AHCI mode before controller reset*/ +writel(HOST_AHCI_EN, mmio + HOST_CTL); +(void) readl(mmio + HOST_CTL); /* flush */ + /* global controller reset */ tmp = readl(mmio + HOST_CTL); if ((tmp HOST_RESET) == 0) { I appreciate the readl() flushes, but in this particular case we end up reading the exact offset again below. The code above is wrong btw - it is an iomap so both the new and existing code should be using ioread* not readl Happy converting :-) -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ide-cd is unmaintained
On Thu, Sep 20 2007, Alan Cox wrote: I simply don't have any old IDE systems any more or time to really look after this. Nobody responded to the previous linux-ide mail about maintainers so... Signed-off-by: Alan Cox [EMAIL PROTECTED] Just fold it in with the IDE system in general? Would make the most sense, I think. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.23-rc4-mm1
On Mon, Sep 10 2007, Torsten Kaiser wrote: On 9/10/07, FUJITA Tomonori [EMAIL PROTECTED] wrote: On Mon, 10 Sep 2007 12:20:38 -0700 Andrew Morton [EMAIL PROTECTED] wrote: On Mon, 10 Sep 2007 20:59:49 +0200 Torsten Kaiser [EMAIL PROTECTED] wrote: The system boots, reads the partition tables, starts the RAID and then kicks one drive out because of errors. Andy is using qla1280. You're using sata. So it's probably a different bug, with the same symptoms. This might be a sg chaining bug too (probabaly sg chaining libata patch). Can you try the following patch that I've just sent: http://lkml.org/lkml/2007/9/10/251 The patch also disables chaining sg list for libata. With this patch 2.6.23-rc4-mm1 works for me. Mainline 2.6.23-rc5-git1 works also without needing any patches. OK, thanks for testing that. I'll merge Tomo's patch so that we can selectively enable drivers when we KNOW they work, instead of trying to do this (massive) operation whole sale. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [-mm patch] remove ide_get_error_location()
On Tue, Sep 11 2007, Bartlomiej Zolnierkiewicz wrote: On Sunday 09 September 2007, Adrian Bunk wrote: On Fri, Aug 31, 2007 at 09:58:22PM -0700, Andrew Morton wrote: ... Changes since 2.6.23-rc3-mm1: ... git-block.patch ... git trees ... ide_get_error_location() is no longer used. Signed-off-by: Adrian Bunk [EMAIL PROTECTED] Signed-off-by: Bartlomiej Zolnierkiewicz [EMAIL PROTECTED] Since git-block contains the patch which removes the only user of ide_get_error_location() I think that this patch should be also merged through block tree. Jens? Yeah, I'll add it there. PS none of the blkdev_issue_flush() users uses *error_sector argument so it can be probably removed as well I had hoped that the existance was enough incentive, but it didn't happen. I'll make a note to kill that again. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 6/7] blk_end_request: remove/unexport end_that_request_*
On Tue, Sep 04 2007, Halevy, Benny wrote: Boaz raised my attention to this patchset today... We suspect we'll still need the extern entry points for handling the bidi request in the scsi_io_completion() path as we only want to call end_that_request_chunk on req-next_rq and never end_that_request_last. (see http://www.bhalevy.com/open-osd/download/linux-2.6.23-rc2_and_iscsi-iscsi-2007_08_09/0005-SCSI-bidi-support.patch) If this is ok with you I'd leave these entry points in place rather than taking them out and putting them back in later. There's no point in leaving them in when nothing current needs it, I'd much rather add it back in should the need arise. That's the proper way to handle things like this. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] libata: kill sector_buf temporary buffer
On Thu, Aug 16 2007, Jeff Garzik wrote: Rather than carrying around this buffer all the time, for rare circumstances, it seems that we can easily alloc/free a temp buffer as needed. Saves a big chunk of per-port memory. I forget the justification for what it was in ata_port to begin with, but the code uses don't see to justify any need. diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c index 735f74b..5e9049f 100644 --- a/drivers/ata/libata-core.c +++ b/drivers/ata/libata-core.c @@ -3679,20 +3679,29 @@ static int ata_dev_same_device(struct ata_device *dev, unsigned int new_class, int ata_dev_reread_id(struct ata_device *dev, unsigned int readid_flags) { unsigned int class = dev-class; - u16 *id = (void *)dev-link-ap-sector_buf; - int rc; + u16 *id; + int rc = 0; + + id = kzalloc(ATA_SECT_SIZE, GFP_KERNEL); + if (!id) + return -ENOMEM; /* read ID data */ rc = ata_dev_read_id(dev, class, readid_flags, id); if (rc) - return rc; + goto out; /* is the device still there? */ - if (!ata_dev_same_device(dev, class, id)) - return -ENODEV; + if (!ata_dev_same_device(dev, class, id)) { + rc = -ENODEV; + goto out; + } memcpy(dev-id, id, sizeof(id[0]) * ATA_ID_WORDS); - return 0; + +out: + kfree(id); + return rc; } /** diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c index 2ddc2ed..fc731e2 100644 --- a/drivers/ata/libata-eh.c +++ b/drivers/ata/libata-eh.c @@ -1133,14 +1133,20 @@ static unsigned int ata_read_log_page(struct ata_device *dev, static int ata_eh_read_log_10h(struct ata_device *dev, int *tag, struct ata_taskfile *tf) { - u8 *buf = dev-link-ap-sector_buf; + u8 *buf; unsigned int err_mask; u8 csum; - int i; + int i, rc = 0; + + buf = kzalloc(ATA_SECT_SIZE, GFP_KERNEL); + if (!buf) + return -ENOMEM; How can this work? If we're issuing this log page read, we cannot do any IO to the device. So this allocation must not generate any IO. And if we fail in allocating memory and just return, the device wont talk to us in the future. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: libata git tree, mbox queue status and contents
On Fri, Aug 03 2007, Jeff Garzik wrote: * Kristen: ALPM patches. We definitely want them, as they save a ton of power. The problem with ALPM, as I see it, is that it is way too aggressive. It really needs to be combined with a timer to be useful, it's really a huge shame that it doesn't come equipped with a timeout setting in hardware. Lacking that, we could punt to using a second aligned timer that just checks for activity in the last second, and if none was seen then enable ALPM. That should have absolutely minimal impact on CPU consumption. Likewise for when we see IO, when the rate/sec goes beyond a low threshold then disable ALPM again. In my testing on this notebook (x60), throughput was reduced to about 30% when using ALPM. So while it does save a good amount of power, it also makes the disk a slow dog if you are actually using it. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Some NCQ numbers...
On Wed, Jul 04 2007, James Bottomley wrote: On Wed, 2007-07-04 at 10:19 +0900, Tejun Heo wrote: Michael Tokarev wrote: Well. It looks like the results does not depend on the elevator. Originally I tried with deadline, and just re-ran the test with noop (hence the long delay with the answer) - changing linux elevator changes almost nothing in the results - modulo some random fluctuations. I see. Thanks for testing. In any case, NCQ - at least in this drive - just does not work. Linux with its I/O elevator may help to speed things up a bit, but the disk does nothing in this area. NCQ doesn't slow things down either - it just does not work. The same's for ST3250620NS enterprise drives. By the way, Seagate announced Barracuda ES 2 series (in range 500..1200Gb if memory serves) - maybe with those, NCQ will work better? No one would know without testing. Or maybe it's libata which does not implement NCQ properly? (As I shown before, with almost all ol'good SCSI drives TCQ helps alot - up to 2x the difference and more - with multiple I/O threads) Well, what the driver does is minimal. It just passes through all the commands to the harddrive. After all, NCQ/TCQ gives the harddrive more responsibility regarding request scheduling. Actually, in many ways the result support a theory of SCSI TCQ Jens used when designing the block layer. The original TCQ theory held that the drive could make much better head scheduling decisions than the Operating System, so you just used TCQ to pass all the outstanding I/O unfiltered down to the drive to let it schedule. However, the I/O results always seemed to indicate that the effect of TCQ was negligible at around 4 outstanding commands, leading to the second theory that all TCQ was good for was saturating the transport, and making scheduling decisions was, indeed, better left to the OS (hence all our I/O schedulers). Indeed, the above I still find to be true. The only real case where larger depths make a real difference, is a pure random reads (or writes, with write caching off) workload. And those situations are largely synthetic, hence benchmarks tend to show NCQ being a lot more beneficial since they construct workloads that consist 100% of random IO. Real life is rarely so black and white. Additionally, there are cases where drive queue depths hurt a lot. The drive has no knowledge of fairness, or process-to-io mappings. So AS/CFQ has to artificially limit queue depths competing IO processes doing semi (or fully) sequential workloads, or throughput plummets. So while NCQ has some benefits, I typically tend to prefer managing the IO queue largely in software instead of punting to (often) buggy firmware. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 3/3] Enable Aggressive Link Power management for AHCI controllers.
On Fri, Jun 22 2007, Kristen Carlson Accardi wrote: On Thu, 21 Jun 2007 15:08:32 +0200 Jens Axboe [EMAIL PROTECTED] wrote: On Wed, Jun 20 2007, Kristen Carlson Accardi wrote: Enable Aggressive Link Power management for AHCI controllers. This patch will set the correct bits to turn on Aggressive Link Power Management (ALPM) for the ahci driver. This will cause the controller and disk to negotiate a lower power state for the link when there is no activity (see the AHCI 1.x spec for details). This feature is mutually exclusive with Hot Plug, so when ALPM is enabled, Hot Plug is disabled. ALPM will be enabled by default, but it is settable via the scsi host syfs interface. Possible settings for this feature are: Setting Effect -- min_power ALPM is enabled, and link set to enter lowest power state (SLUMBER) when idle Hot plug not allowed. max_performance ALPM is disabled, Hot Plug is allowed medium_power ALPM is enabled, and link set to enter second lowest power state (PARTIAL) when idle. Hot plug not allowed. Signed-off-by: Kristen Carlson Accardi [EMAIL PROTECTED] A suggestion (it comes with a patch!) - default to max_power/almp off, not min_power. For two reasons: - There's such a big performance difference between the two, you really want max_power when booting. - It's a lot better to default to no change, than default to enabling something new. Sounds reasonable to me. Distros/users can decide if they want to have scripts that enable this after boot to run at min_power. Exactly, it needs to be handled by some power management daemon anyway and be integrated with power savings in general. You could use io load to determine when to enable/disable alpm, for instance. Acked-by: Kristen Carlson Accardi [EMAIL PROTECTED] Will you integrate it into the next posting? -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 126X DVD drive? (odd ide-cd output)
On Wed, Jun 20 2007, Cal Peake wrote: Hello, It seems if I boot with a disc in the drive (in this case a dual fs ISO-9660/UDF DVD-R) the drive speed gets reported incorrectly: hdc: DVDRW DRW-6S160P, ATAPI CD/DVD-ROM drive hdc: ATAPI 126X DVD-ROM DVD-R CD-R/RW drive, 2048kB Cache, UDMA(66) normal output (w/ no disc) is: hdc: DVDRW DRW-6S160P, ATAPI CD/DVD-ROM drive hdc: ATAPI 48X DVD-ROM DVD-R CD-R/RW drive, 2048kB Cache, UDMA(66) Kernel is 2.6.22-rc5-gf1518a08. Harmless, but odd... Sounds like a firmware bug. From a drive with a 'DVDRW' vendor string, that's not highly surprising :-) -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 3/3] Enable Aggressive Link Power management for AHCI controllers.
On Wed, Jun 20 2007, Kristen Carlson Accardi wrote: Enable Aggressive Link Power management for AHCI controllers. This patch will set the correct bits to turn on Aggressive Link Power Management (ALPM) for the ahci driver. This will cause the controller and disk to negotiate a lower power state for the link when there is no activity (see the AHCI 1.x spec for details). This feature is mutually exclusive with Hot Plug, so when ALPM is enabled, Hot Plug is disabled. ALPM will be enabled by default, but it is settable via the scsi host syfs interface. Possible settings for this feature are: Setting Effect -- min_power ALPM is enabled, and link set to enter lowest power state (SLUMBER) when idle Hot plug not allowed. max_performance ALPM is disabled, Hot Plug is allowed medium_power ALPM is enabled, and link set to enter second lowest power state (PARTIAL) when idle. Hot plug not allowed. Signed-off-by: Kristen Carlson Accardi [EMAIL PROTECTED] A suggestion (it comes with a patch!) - default to max_power/almp off, not min_power. For two reasons: - There's such a big performance difference between the two, you really want max_power when booting. - It's a lot better to default to no change, than default to enabling something new. diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c index 841cf0a..e7a2072 100644 --- a/drivers/ata/ahci.c +++ b/drivers/ata/ahci.c @@ -786,8 +786,7 @@ static int ahci_disable_alpm(struct ata_port *ap) return 0; } -static int ahci_enable_alpm(struct ata_port *ap, - enum scsi_host_link_pm policy) +static int ahci_enable_alpm(struct ata_port *ap, enum scsi_host_link_pm policy) { struct ahci_host_priv *hpriv = ap-host-private_data; void __iomem *port_mmio = ahci_port_base(ap); @@ -808,19 +807,19 @@ static int ahci_enable_alpm(struct ata_port *ap, return -EINVAL; } - switch(policy) { + switch (policy) { case SHOST_MAX_PERFORMANCE: - ahci_disable_alpm(ap); - ap-pm_policy = policy; - return 0; case SHOST_NOT_AVAILABLE: - case SHOST_MIN_POWER: /* * if we came here with SHOST_NOT_AVAILABLE, * it just means this is the first time we -* have tried to enable - so try to do -* min_power +* have tried to enable - default to max performance, +* and let the user go to lower power modes on request. */ + ahci_disable_alpm(ap); + ap-pm_policy = SHOST_MAX_PERFORMANCE; + return 0; + case SHOST_MIN_POWER: ap-pm_policy = SHOST_MIN_POWER; /* configure HBA to enter SLUMBER */ -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 0/3] AHCI Link Power Management
On Tue, Jun 12 2007, Tejun Heo wrote: Arjan van de Ven wrote: I'm not sure about this. We need better PM framework to support powersaving in other controllers and some ahcis don't save much when only link power management is used, do you have data to support this? Yeah, it was some Lenovo notebook. Pavel is more familiar with the hardware. Pavel, what was the notebook which didn't save much power with standard SATA power save but needed port to be completely turned off? The data we have from this patch is that it saves typically a Watt of power (depends on the machine of course, but the range is 0.5W to 1.5W). If you want to also have an even more agressive thing where you want to start disabling the entire controller... I don't see how this is in conflict with saving power on the link level by just enabling a hardware feature Well, both implement about the same thing. I prefer software implementation because it's more generic and ALPE/ASP seems too aggressive to me. Here are reasons why sw implementation wasn't merged. 1. It didn't have proper interface with userland. This was mainly because of missing ATA sysfs nodes. I'm not sure whether adding this to scsi node is a good idea. 2. It was focused on SATA link PS and couldn't cover the Lenovo case. I think we need something at the block layer. I think the hardware method is preferable, actually. Doing this in the block layer would mean keeping track of idle time, and that quickly turns into a lot of timer management. Not exactly free, in terms of CPU usage. I've yet to do some power measurements with this ahci patch, I just noticed that with min_power performance drops from ~55mb/sec to ~15mb/sec sequential on my drive. That's pretty drastic :-) -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: use_clustering (sht) bit set to 0 in AHCI ?
On Thu, Apr 26 2007, Alok kataria wrote: Hi Jeff, I recently got a new AHCI disk, and was using the AHCI-libata driver to run this. I noticed that the scattergather lists that were being built for the I/O on this device were just of PAGE_SIZE length, even though i was doing IO on contiguous pages. Through a little instrumentation i figured out that the use_clustering bit in the ahci_sht (scsi_host_template) is set to zero. Due to which we are not putting consecutive bios into one sg in blk_rq_map_sg. I tried changing the clustering bit to 1, but encountered a panic at the initialization of the disks during the bootup process, and so couldn't entirely get hold of the panic mesg. I was wondering, though the max segment size suported with this driver is 65536, why was i not able to feed in bigger SG's to this driver. I searched on the net and the intel ahci arch doc too but couldn't find anything relative to the clustering support for AHCI. It would be great help if you could let me know about the possible problems with clustering on AHCI ? why is it off by default in the driver ? or anything else that might help. It's a good question. If you look at the documentation, it states that ahci supports up to 64k sg entries and each can have a size of up to 4mb (bits 0 through 21). So as far as I can tell, clustering should work with a segment size up to those 4mb. ahci has always had clustering disabled, perhaps Jeff can expand on why? -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] bidi support: bidirectional request
On Tue, May 01 2007, Boaz Harrosh wrote: Please consider the attached proposal. It is a complete block-level bidi implementation that is, I hope, a middle ground which will keep everyone happy (including Christoph). It is both quite small and not invasive, yet has a full bidi API that is easy to use and maintain. This isn't much of an improvement imo, if any at all. Why didn't you do the -next_rq approach I suggested? Your patch still makes struct request considerably fatter (30% here, from 280 to 368 bytes on x86-64 from a quick look) for something that will have relatively few uses. And it still has its paws all over the block layer code. Please just implement the 2nd data phase as a linked request off the first one. I think that approach is both much cleaner from a design perspective, and also much leaner and has zero (well almost, it costs a pointer) impact on the regular read-write paths. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] bidi support: bidirectional request
On Sun, Apr 29 2007, James Bottomley wrote: On Sun, 2007-04-29 at 18:48 +0300, Boaz Harrosh wrote: FUJITA Tomonori wrote: From: Boaz Harrosh [EMAIL PROTECTED] Subject: [PATCH 4/4] bidi support: bidirectional request Date: Sun, 15 Apr 2007 20:33:28 +0300 diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 645d24b..16a02ee 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -322,6 +322,7 @@ struct request { void *end_io_data; struct request_io_part uni; +struct request_io_part bidi_read; }; Would be more straightforward to have: struct request_io_part in; struct request_io_part out; Yes I wish I could do that. For bidi supporting drivers this is the most logical. But for the 99.9% of uni-directional drivers, calling rq_uni(), and being some what on the hotish paths, this means we will need a pointer to a uni request_io_part. This is bad because: 1st- There is no defined stage in a request life where to definitely set that pointer, specially in the preparation stages. 2nd- hacks like scsi_error.c/scsi_send_eh_cmnd() will not work at all. Now this is a very bad spot already, and I have a short term fix for it in the SCSI-bidi patches (not sent yet) but a more long term solution is needed. Once such hacks are cleaned up we can do what you say. This is exactly why I use the access functions rq_uni/rq_io/rq_in/rq_out and not open code access. I'm still not really convinced about this approach. The primary job of the block layer is to manage and merge READ and WRITE requests. It serves a beautiful secondary function of queueing for arbitrary requests it doesn't understand (REQ_TYPE_BLOCK_PC or REQ_TYPE_SPECIAL ... or indeed any non REQ_TYPE_FS). bidirectional requests fall into the latter category (there's nothing really we can do to merge them ... they're just transported by the block layer). The only unusual feature is that they carry two bios. I think the drivers that actually support bidirectional will be a rarity, so it might even be advisable to add it to the queue capability (refuse bidirectional requests at the top rather than perturbing all the drivers to process them). So, what about REQ_TYPE_BIDIRECTIONAL rather than REQ_BIDI? That will remove it from the standard path and put it on the special command type path where we can process it specially. Additionally, if you take this approach, you can probably simply chain the second bio through req-special as an additional request in the stream. The only thing that would then need modification would be the dequeue of the block driver (it would have to dequeue both requests and prepare them) and that needs to be done only for drivers handling bidirectional requests. I agree, I'm really not crazy about shuffling the entire request setup around just for something as exotic as bidirection commands. How about just keeping it simple - have a second request linked off the first one for the second data phase? So keep it completely seperate, not just overload -special for 2nd bio list. So basically just add a struct request pointer, so you can do rq = rq-next_rq or something for the next data phase. I bet this would be a LOT less invasive as well, and we can get by with a few helpers to support it. And it should definitely be a request type. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] bidi support: bidirectional request
On Mon, Apr 30 2007, Benny Halevy wrote: Jens Axboe wrote: On Sun, Apr 29 2007, James Bottomley wrote: On Sun, 2007-04-29 at 18:48 +0300, Boaz Harrosh wrote: FUJITA Tomonori wrote: From: Boaz Harrosh [EMAIL PROTECTED] Subject: [PATCH 4/4] bidi support: bidirectional request Date: Sun, 15 Apr 2007 20:33:28 +0300 diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 645d24b..16a02ee 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -322,6 +322,7 @@ struct request { void *end_io_data; struct request_io_part uni; +struct request_io_part bidi_read; }; Would be more straightforward to have: struct request_io_part in; struct request_io_part out; Yes I wish I could do that. For bidi supporting drivers this is the most logical. But for the 99.9% of uni-directional drivers, calling rq_uni(), and being some what on the hotish paths, this means we will need a pointer to a uni request_io_part. This is bad because: 1st- There is no defined stage in a request life where to definitely set that pointer, specially in the preparation stages. 2nd- hacks like scsi_error.c/scsi_send_eh_cmnd() will not work at all. Now this is a very bad spot already, and I have a short term fix for it in the SCSI-bidi patches (not sent yet) but a more long term solution is needed. Once such hacks are cleaned up we can do what you say. This is exactly why I use the access functions rq_uni/rq_io/rq_in/rq_out and not open code access. I'm still not really convinced about this approach. The primary job of the block layer is to manage and merge READ and WRITE requests. It serves a beautiful secondary function of queueing for arbitrary requests it doesn't understand (REQ_TYPE_BLOCK_PC or REQ_TYPE_SPECIAL ... or indeed any non REQ_TYPE_FS). bidirectional requests fall into the latter category (there's nothing really we can do to merge them ... they're just transported by the block layer). The only unusual feature is that they carry two bios. I think the drivers that actually support bidirectional will be a rarity, so it might even be advisable to add it to the queue capability (refuse bidirectional requests at the top rather than perturbing all the drivers to process them). So, what about REQ_TYPE_BIDIRECTIONAL rather than REQ_BIDI? That will remove it from the standard path and put it on the special command type path where we can process it specially. Additionally, if you take this approach, you can probably simply chain the second bio through req-special as an additional request in the stream. The only thing that would then need modification would be the dequeue of the block driver (it would have to dequeue both requests and prepare them) and that needs to be done only for drivers handling bidirectional requests. I agree, I'm really not crazy about shuffling the entire request setup around just for something as exotic as bidirection commands. How about just keeping it simple - have a second request linked off the first one for the second data phase? So keep it completely seperate, not just overload -special for 2nd bio list. So basically just add a struct request pointer, so you can do rq = rq-next_rq or something for the next data phase. I bet this would be a LOT less invasive as well, and we can get by with a few helpers to support it. And it should definitely be a request type. I'm a bit confused since what you both suggest is very similar to what we've proposed back in October 2006 and the impression we got was that it will be better to support bidirectional block requests natively (yet to be honest, James, you wanted a linked request all along). It still has to be implemented natively at the block layer, just differently like described above. So instead of messing all over the block layer adding rq_uni() stuff, just add that struct request pointer to the request structure for the 2nd data phase. You can relatively easy then modify the block layer helpers to support mapping and setup of such requests. Before we go on that route again, how do you see the support for bidi at the scsi mid-layer done? Again, we prefer to support that officially using two struct scsi_cmnd_buff instances in struct scsi_cmnd and not as a one-off feature, using special-purpose state and logic (e.g. a linked struct scsi_cmd for the bidi_read sg list). The SCSI part is up to James, that can be done as either inside a single scsi command, or as linked scsi commands as well. I don't care too much about that bit, just the block layer parts :-). And the proposed block layer design can be used both ways by the scsi layer. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] bidi support: bidirectional request
On Mon, Apr 30 2007, Douglas Gilbert wrote: Jens Axboe wrote: On Mon, Apr 30 2007, Benny Halevy wrote: Jens Axboe wrote: On Sun, Apr 29 2007, James Bottomley wrote: On Sun, 2007-04-29 at 18:48 +0300, Boaz Harrosh wrote: FUJITA Tomonori wrote: From: Boaz Harrosh [EMAIL PROTECTED] Subject: [PATCH 4/4] bidi support: bidirectional request Date: Sun, 15 Apr 2007 20:33:28 +0300 diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 645d24b..16a02ee 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -322,6 +322,7 @@ struct request { void *end_io_data; struct request_io_part uni; +struct request_io_part bidi_read; }; Would be more straightforward to have: struct request_io_part in; struct request_io_part out; Yes I wish I could do that. For bidi supporting drivers this is the most logical. But for the 99.9% of uni-directional drivers, calling rq_uni(), and being some what on the hotish paths, this means we will need a pointer to a uni request_io_part. This is bad because: 1st- There is no defined stage in a request life where to definitely set that pointer, specially in the preparation stages. 2nd- hacks like scsi_error.c/scsi_send_eh_cmnd() will not work at all. Now this is a very bad spot already, and I have a short term fix for it in the SCSI-bidi patches (not sent yet) but a more long term solution is needed. Once such hacks are cleaned up we can do what you say. This is exactly why I use the access functions rq_uni/rq_io/rq_in/rq_out and not open code access. I'm still not really convinced about this approach. The primary job of the block layer is to manage and merge READ and WRITE requests. It serves a beautiful secondary function of queueing for arbitrary requests it doesn't understand (REQ_TYPE_BLOCK_PC or REQ_TYPE_SPECIAL ... or indeed any non REQ_TYPE_FS). bidirectional requests fall into the latter category (there's nothing really we can do to merge them ... they're just transported by the block layer). The only unusual feature is that they carry two bios. I think the drivers that actually support bidirectional will be a rarity, so it might even be advisable to add it to the queue capability (refuse bidirectional requests at the top rather than perturbing all the drivers to process them). So, what about REQ_TYPE_BIDIRECTIONAL rather than REQ_BIDI? That will remove it from the standard path and put it on the special command type path where we can process it specially. Additionally, if you take this approach, you can probably simply chain the second bio through req-special as an additional request in the stream. The only thing that would then need modification would be the dequeue of the block driver (it would have to dequeue both requests and prepare them) and that needs to be done only for drivers handling bidirectional requests. I agree, I'm really not crazy about shuffling the entire request setup around just for something as exotic as bidirection commands. How about just keeping it simple - have a second request linked off the first one for the second data phase? So keep it completely seperate, not just overload -special for 2nd bio list. So basically just add a struct request pointer, so you can do rq = rq-next_rq or something for the next data phase. I bet this would be a LOT less invasive as well, and we can get by with a few helpers to support it. And it should definitely be a request type. I'm a bit confused since what you both suggest is very similar to what we've proposed back in October 2006 and the impression we got was that it will be better to support bidirectional block requests natively (yet to be honest, James, you wanted a linked request all along). It still has to be implemented natively at the block layer, just differently like described above. So instead of messing all over the block layer adding rq_uni() stuff, just add that struct request pointer to the request structure for the 2nd data phase. You can relatively easy then modify the block layer helpers to support mapping and setup of such requests. Before we go on that route again, how do you see the support for bidi at the scsi mid-layer done? Again, we prefer to support that officially using two struct scsi_cmnd_buff instances in struct scsi_cmnd and not as a one-off feature, using special-purpose state and logic (e.g. a linked struct scsi_cmd for the bidi_read sg list). The SCSI part is up to James, that can be done as either inside a single scsi command, or as linked scsi commands as well. I don't care too much about that bit, just the block layer parts :-). And the proposed block layer design can be used both ways by the scsi layer. Linked SCSI commands have been obsolete since SPC-4 rev 6 (18 July 2006) after proposal 06-259r1
Re: BUG: Null pointer dereference in fs/open.c
On Wed, Apr 25 2007, Andrew Morton wrote: # pktsetup 2 /dev/sr0 [19982.934793] drivers/scsi/scsi_lib.c:838: setting error to 134217730 [19982.941070] [c010521a] show_trace_log_lvl+0x1a/0x30 [19982.946256] [c0105952] show_trace+0x12/0x20 [19982.950744] [c0105a46] dump_stack+0x16/0x20 [19982.955232] [c034543a] scsi_io_completion+0x28a/0x3a0 [19982.960586] [c034556b] scsi_blk_pc_done+0x1b/0x30 [19982.965594] [c0340d0c] scsi_finish_command+0x4c/0x60 [19982.970861] [c0345c07] scsi_softirq_done+0x77/0xe0 [19982.975955] [c0257f8b] blk_done_softirq+0x6b/0x80 [19982.980962] [c01243a2] __do_softirq+0x62/0xc0 [19982.985624] [c0124455] do_softirq+0x55/0x60 [19982.990112] [c0124be5] ksoftirqd+0x65/0x100 [19982.994599] [c0132963] kthread+0xa3/0xd0 [19982.998827] [c0104e17] kernel_thread_helper+0x7/0x10 [19983.004095] === [19983.009065] cdrom: This disc doesn't have any tracks I recognize! So what has happened here is that this code, in ide-cd.c's cdrom_decode_status() is now triggering: } else if (blk_pc_request(rq) || rq-cmd_type == REQ_TYPE_ATA_PC) { /* All other functions, except for READ. */ unsigned long flags; /* * if we have an error, pass back CHECK_CONDITION as the * scsi status byte */ if (blk_pc_request(rq) !rq-errors) rq-errors = SAM_STAT_CHECK_CONDITION; I suspect this is a bug introduced by 406c9b605cbc45151c03ac9a3f95e9acf050808c (in which case it'll be the third bug so far). Perhaps the IDE driver was previously not considering these requests to be of type blk_pc_request(), and after 406c9b605cbc45151c03ac9a3f95e9acf050808c it _is_ treating them as blk_pc_request() and is incorrectly reporting an error. Or something like that. But it IS a block pc request. We've been setting the sam stats on -errors for block pc request for a long time. Guys: help! I'm not sure what your question is, if someone can some up what the what goes wrong and what the expected result is, I can try and help. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BUG: Null pointer dereference in fs/open.c
On Thu, Apr 26 2007, William Heimbigner wrote: On Wed, 25 Apr 2007, Andrew Morton wrote: On Wed, 25 Apr 2007 22:53:00 + (GMT) William Heimbigner [EMAIL PROTECTED] wrote: On Wed, 25 Apr 2007, Andrew Morton wrote: snip OK. I am able to use the pktcdvd driver OK in mainline with a piix/sata drive. It could be that something is going wrong at the IDE level for you. Perhaps; I'll try an external usb cd burner, and see where that goes. Are you able to identify the most recent kernel which actually worked? No, because I haven't set packet writing up in Linux before - however, I do know that I've successfully set up packet writing (using 2 of the 3 cd burners I have) in another operating system before. I'll try 2.6.18 and see if that gets me anywhere different, though. OK. A quick summary: mainline's pktcdvd isn't working for William using IDE. It is working for me using sata. snip So what has happened here is that this code, in ide-cd.c's cdrom_decode_status() is now triggering: } else if (blk_pc_request(rq) || rq-cmd_type == REQ_TYPE_ATA_PC) { /* All other functions, except for READ. */ unsigned long flags; /* * if we have an error, pass back CHECK_CONDITION as the * scsi status byte */ if (blk_pc_request(rq) !rq-errors) rq-errors = SAM_STAT_CHECK_CONDITION; I suspect this is a bug introduced by 406c9b605cbc45151c03ac9a3f95e9acf050808c (in which case it'll be the third bug so far). Perhaps the IDE driver was previously not considering these requests to be of type blk_pc_request(), and after 406c9b605cbc45151c03ac9a3f95e9acf050808c it _is_ treating them as blk_pc_request() and is incorrectly reporting an error. Or something like that. Guys: help! A follow-up: after looking around a bit, I have managed to get packet writing to work properly on /dev/hdc (before, it was reporting only 1.8 MB available or so; this was a formatting issue). I've also gotten the external cd-rw drive to work. However, I'm still at a loss as to why /dev/hdd won't work. I tried formatting a dvd-rw for this drive, however, it consistently gives me: [27342.503933] drivers/ide/ide-cd.c:729: setting error to 2 [27342.509251] [c010521a] show_trace_log_lvl+0x1a/0x30 [27342.514411] [c0105952] show_trace+0x12/0x20 [27342.518864] [c0105a46] dump_stack+0x16/0x20 [27342.523317] [c033f6e4] cdrom_decode_status+0x1f4/0x3b0 [27342.528732] [c033fae8] cdrom_newpc_intr+0x38/0x320 [27342.533791] [c0331106] ide_intr+0x96/0x200 [27342.538157] [c0150cf8] handle_IRQ_event+0x28/0x60 [27342.543139] [c0151f96] handle_edge_irq+0xa6/0x130 [27342.548121] [c0106449] do_IRQ+0x49/0xa0 [27342.552228] [c0104c3a] common_interrupt+0x2e/0x34 [27342.557200] [c01022d2] mwait_idle+0x12/0x20 [27342.561653] [c01023ca] cpu_idle+0x4a/0x80 [27342.565934] [c0101147] rest_init+0x37/0x40 [27342.570300] [c068ac7b] start_kernel+0x34b/0x420 [27342.575109] [] 0x0 [27342.578089] === and doesn't work (the above output was generated by Andrew's patch to log certain areas). # dvd+rw-format /dev/hdd -force * BD/DVDRW/-RAM format utility by [EMAIL PROTECTED], version 7.0. :-( failed to locate Quick Format descriptor. * 4.7GB DVD-RW media in Sequential mode detected. * formatting 0.0\:-[ READ TRACK INFORMATION failed with SK=3h/ASC=11h/ACQ=05h]: Input/output error That's an uncorrectable read error. Is the media good? I tried putting in a different dvd-rw, and this time I get: # dvd+rw-format /dev/hdd -force * BD/DVDRW/-RAM format utility by [EMAIL PROTECTED], version 7.0. * 4.7GB DVD-RW media in Sequential mode detected. * formatting 0.0|:-[ FORMAT UNIT failed with SK=5h/ASC=26h/ACQ=00h]: Input/output error That's the drive complaining about an invalid bit being set in the command descriptor block. That's usually a bug in the issuer. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Maxtor 6B250S0/BANC1B70 hangs with NCQ
Hi, I've seen this several times on this drive, completely reproducible. Once it has hung, power needs to be cut from the drive to recover it, a simple reboot is not enough. So I'd suggest disabling NCQ on this driver. Error log attached. Signed-off-by: Jens Axboe [EMAIL PROTECTED] diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c index f1f595f..ddb3909 100644 --- a/drivers/ata/libata-core.c +++ b/drivers/ata/libata-core.c @@ -3361,6 +3361,8 @@ static const struct ata_blacklist_entry ata_device_blacklist [] = { { FUJITSU MHT2060BH, NULL, ATA_HORKAGE_NONCQ }, /* NCQ is broken */ { Maxtor 6L250S0, BANC1G10, ATA_HORKAGE_NONCQ }, + /* NCQ hard hangs device under heavier load, needs hard power cycle */ + { Maxtor 6B250S0, BANC1B70, ATA_HORKAGE_NONCQ }, /* Devices with NCQ limits */ -- Jens Axboe ata1.00: exception Emask 0x0 SAct 0x7fff SErr 0x0 action 0x2 frozen ata1.00: cmd 61/08:00:3f:00:f4/00:00:0c:00:00/40 tag 0 cdb 0x0 data 4096 out res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) ata1.00: cmd 61/08:08:3f:00:f8/00:00:0c:00:00/40 tag 1 cdb 0x0 data 4096 out res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) ata1.00: cmd 61/08:10:3f:00:fc/00:00:0c:00:00/40 tag 2 cdb 0x0 data 4096 out res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) ata1.00: cmd 61/08:18:3f:00:00/00:00:0d:00:00/40 tag 3 cdb 0x0 data 4096 out res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) ata1.00: cmd 61/08:20:3f:00:04/00:00:0d:00:00/40 tag 4 cdb 0x0 data 4096 out res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) ata1.00: cmd 61/08:28:3f:00:08/00:00:0d:00:00/40 tag 5 cdb 0x0 data 4096 out res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) ata1.00: cmd 61/08:30:3f:00:0c/00:00:0d:00:00/40 tag 6 cdb 0x0 data 4096 out res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) ata1.00: cmd 61/08:38:3f:00:10/00:00:0d:00:00/40 tag 7 cdb 0x0 data 4096 out res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) ata1.00: cmd 61/08:40:3f:00:14/00:00:0d:00:00/40 tag 8 cdb 0x0 data 4096 out res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) ata1.00: cmd 61/08:48:3f:00:18/00:00:0d:00:00/40 tag 9 cdb 0x0 data 4096 out res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) ata1.00: cmd 61/08:50:3f:00:1c/00:00:0d:00:00/40 tag 10 cdb 0x0 data 4096 out res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) ata1.00: cmd 61/08:58:3f:00:20/00:00:0d:00:00/40 tag 11 cdb 0x0 data 4096 out res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) ata1.00: cmd 61/08:60:3f:00:24/00:00:0d:00:00/40 tag 12 cdb 0x0 data 4096 out res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) ata1.00: cmd 61/08:68:3f:00:28/00:00:0d:00:00/40 tag 13 cdb 0x0 data 4096 out res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) ata1.00: cmd 61/08:70:3f:00:2c/00:00:0d:00:00/40 tag 14 cdb 0x0 data 4096 out res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) ata1.00: cmd 61/08:78:3f:00:30/00:00:0d:00:00/40 tag 15 cdb 0x0 data 4096 out res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) ata1.00: cmd 61/08:80:3f:00:34/00:00:0d:00:00/40 tag 16 cdb 0x0 data 4096 out res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) ata1.00: cmd 61/08:88:3f:00:38/00:00:0d:00:00/40 tag 17 cdb 0x0 data 4096 out res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) ata1.00: cmd 61/08:90:3f:00:3c/00:00:0d:00:00/40 tag 18 cdb 0x0 data 4096 out res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) ata1.00: cmd 61/08:98:3f:00:40/00:00:0d:00:00/40 tag 19 cdb 0x0 data 4096 out res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) ata1.00: cmd 61/08:a0:3f:00:44/00:00:0d:00:00/40 tag 20 cdb 0x0 data 4096 out res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) ata1.00: cmd 61/08:a8:3f:00:48/00:00:0d:00:00/40 tag 21 cdb 0x0 data 4096 out res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) ata1.00: cmd 61/08:b0:3f:00:4c/00:00:0d:00:00/40 tag 22 cdb 0x0 data 4096 out res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) ata1.00: cmd 61/08:b8:3f:00:50/00:00:0d:00:00/40 tag 23 cdb 0x0 data 4096 out res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) ata1.00: cmd 61/08:c0:3f:00:54/00:00:0d:00:00/40 tag 24 cdb 0x0 data 4096 out res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) ata1.00: cmd 61/08:c8:3f:00:dc/00:00:0c:00:00/40 tag 25 cdb 0x0 data 4096 out res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) ata1.00: cmd 61/08:d0:3f:00:e0/00:00:0c:00:00/40 tag 26 cdb 0x0 data 4096 out res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) ata1.00: cmd 61/08:d8:3f:00:e4/00:00:0c:00:00/40 tag 27 cdb 0x0
Re: 2.6.20.3 AMD64 oops in CFQ code
On Thu, Mar 22 2007, [EMAIL PROTECTED] wrote: This is a uniprocessor AMD64 system running software RAID-5 and RAID-10 over multiple PCIe SiI3132 SATA controllers. The hardware has been very stable for a long time, but has been acting up of late since I upgraded to 2.6.20.3. ECC memory should preclude the possibility of bit-flip errors. Kernel 2.6.20.3 + linuxpps patches (confined to drivers/serial, and not actually in use as I stole the serial port for a console). It takes half a day to reproduce the problem, so bisecting would be painful. BackupPC_dump mostly writes to a large (1.7 TB) ext3 RAID5 partition. Here are two oopes, a few minutes (16:31, to be precise) apart. Unusually, it oopsed twice *without* locking up the system.. Usually, I see this followed by an error from drivers/input/keyboard/atkbd.c: printk(KERN_WARNING atkbd.c: Spurious %s on %s. Some program might be trying access hardware directly.\n, emitted at 1 Hz with the keyboard LEDs flashing and the system unresponsive to keyboard or pings. (I think it was spurious ACK on serio/input0, but my memory may be faulty.) If anyone has any suggestions, they'd be gratefully received. Unable to handle kernel NULL pointer dereference at 0098 RIP: [8031504a] cfq_dispatch_insert+0x18/0x68 PGD 777e9067 PUD 78774067 PMD 0 Oops: [1] CPU 0 Modules linked in: ecb Pid: 2837, comm: BackupPC_dump Not tainted 2.6.20.3-g691f5333 #40 RIP: 0010:[8031504a] [8031504a] cfq_dispatch_insert+0x18/0x68 RSP: 0018:8100770bbaf8 EFLAGS: 00010092 RAX: 81007fb36c80 RBX: RCX: 0001 RDX: 00010003e4e7 RSI: RDI: RBP: 81007fb37a00 R08: R09: 81005d390298 R10: 81007fcb4f80 R11: 81007fcb4f80 R12: 81007facd280 R13: 0004 R14: 0001 R15: FS: 2b322d120d30() GS:805de000() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 0098 CR3: 7bcf CR4: 06e0 Process BackupPC_dump (pid: 2837, threadinfo 8100770ba000, task 81007fc5d8e0) Stack: 8100770f39f0 0004 0001 80315253 803b2607 81005da2bc40 81007fac3800 81007facd280 81007facd280 81005d390298 Call Trace: [80315253] cfq_dispatch_requests+0x152/0x512 [803b2607] scsi_done+0x0/0x18 [8030d9f1] elv_next_request+0x137/0x147 [803b7ce0] scsi_request_fn+0x6a/0x33a [8024d407] generic_unplug_device+0xa/0xe [80407ced] unplug_slaves+0x5b/0x94 [80223d65] sync_page+0x0/0x40 [80223d9b] sync_page+0x36/0x40 [80256d45] __wait_on_bit_lock+0x36/0x65 [80237496] __lock_page+0x5e/0x64 [8028061d] wake_bit_function+0x0/0x23 [802074de] find_get_page+0xe/0x2d [8020b38e] do_generic_mapping_read+0x1c2/0x40d [8020bd80] file_read_actor+0x0/0x118 [8021422e] generic_file_aio_read+0x15c/0x19e [8020bafa] do_sync_read+0xc9/0x10c [80210342] may_open+0x5b/0x1c6 [802805ef] autoremove_wake_function+0x0/0x2e [8020a857] vfs_read+0xaa/0x152 [8020faf3] sys_read+0x45/0x6e [8025041e] system_call+0x7e/0x83 3 (I think) seperate instances of this, each involving raid5. Is your array degraded or fully operational? -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] libata: add NCQ blacklist entries from Silicon Image Windows driver
On Thu, Feb 22 2007, Jeff Garzik wrote: Robert Hancock wrote: This patch adds in some NCQ blacklist entries taken from the Silicon Image Windows drivers' .inf files for the 3124 and 3132 controllers. These entries were marked as DisableSataQueueing. Assume these are in their blacklist for a reason and disable NCQ on these drives. Signed-off-by: Robert Hancock [EMAIL PROTECTED] --- linux-2.6.21-rc1edit/drivers/ata/libata-core.c.prev2007-02-21 22:23:05.0 -0600 +++ linux-2.6.21-rc1edit/drivers/ata/libata-core.c2007-02-21 22:25:44.0 -0600 @@ -3269,6 +3269,13 @@ static const struct ata_blacklist_entry /* Devices with NCQ limits */ +/* The following blacklist entries are taken from the Windows + driver .inf files for the Silicon Image 3124 and 3132. */ +{ Maxtor 7B250S0,BANC1B70,ATA_HORKAGE_NONCQ, }, +{ HTS541060G9SA00,MB3OC60D,ATA_HORKAGE_NONCQ, }, +{ HTS541080G9SA00,MB4OC60D,ATA_HORKAGE_NONCQ, }, +{ HTS541010G9SA00,MBZOC60D,ATA_HORKAGE_NONCQ, }, Do we have information that these drives fail on non-SiI controllers? Sometimes the problem can be related to a single family of controllers. I don't know about the Hitachi's, but the Maxtor with that firmware is definitely broken. It _appeared_ to work if the depth was limited to 4, but I didn't test it long enough to be absolutely certain. So the safest is indeed to blacklist it. I'm pretty sure my initial NCQ patches had a blacklist entry for that drive as well, but apparently the blacklist got lost somewhere along the way? -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] libata: add NCQ blacklist entries from Silicon Image Windows driver
On Thu, Feb 22 2007, Mark Lord wrote: Jens Axboe wrote: Robert Hancock wrote: .. +/* The following blacklist entries are taken from the Windows + driver .inf files for the Silicon Image 3124 and 3132. */ +{ Maxtor 7B250S0,BANC1B70,ATA_HORKAGE_NONCQ, }, +{ HTS541060G9SA00,MB3OC60D,ATA_HORKAGE_NONCQ, }, +{ HTS541080G9SA00,MB4OC60D,ATA_HORKAGE_NONCQ, }, +{ HTS541010G9SA00,MBZOC60D,ATA_HORKAGE_NONCQ, }, Do we have information that these drives fail on non-SiI controllers? Sometimes the problem can be related to a single family of controllers. I don't know about the Hitachi's, but the Maxtor with that firmware is definitely broken. It _appeared_ to work if the depth was limited to 4, but I didn't test it long enough to be absolutely certain. So the safest is indeed to blacklist it. Yes, broken on the Silicon Image controllers for sure. But what type of controller did you observe the failures on, Jens ? achi -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: libata FUA revisited
On Wed, Feb 21 2007, Tejun Heo wrote: [cc'ing Ric, Hannes and Dongjun, Hello. Feel free to drag other people in.] Robert Hancock wrote: Jens Axboe wrote: But we can't really change that, since you need the cache flushed before issuing the FUA write. I've been advocating for an ordered bit for years, so that we could just do: 3. w/FUA+ORDERED normal operation - barrier issued - write barrier FUA+ORDERED - normal operation resumes So we don't have to serialize everything both at the block and device level. I would have made FUA imply this already, but apparently it's not what MS wanted FUA for, so... The current implementations take the FUA bit (or WRITE FUA) as a hint to boost it to head of queue, so you are almost certainly going to jump ahead of already queued writes. Which we of course really do not. Yeah, I think if we have tagged write command and flush tagged (or barrier tagged) things can be pretty efficient. Again, I'm much more comfortable with separate opcodes for those rather than bits changing the behavior. ORDERED+FUA NCQ would still be preferable to an NCQ enabled flush command, though. Another idea Dongjun talked about while drinking in LSF was ranged flush. Not as flexible/efficient as the previous option but much less intrusive and should help quite a bit, I think. But that requires extensive tracking, I'm not so sure the implementation of that for barriers would be very clean. It'd probably be good for fsync, though. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: libata FUA revisited
On Mon, Feb 19 2007, Robert Hancock wrote: Jens Axboe wrote: But we can't really change that, since you need the cache flushed before issuing the FUA write. I've been advocating for an ordered bit for years, so that we could just do: 3. w/FUA+ORDERED normal operation - barrier issued - write barrier FUA+ORDERED - normal operation resumes So we don't have to serialize everything both at the block and device level. I would have made FUA imply this already, but apparently it's not what MS wanted FUA for, so... The current implementations take the FUA bit (or WRITE FUA) as a hint to boost it to head of queue, so you are almost certainly going to jump ahead of already queued writes. Which we of course really do not. I think that FUA was designed for a different use case than what Linux is using barriers for currently. The advantage with FUA is when you have [snip] Yes that's pretty obvious, my point is just that FUA+ORDERED would be a nice thing to have for us. I'm not too nervous about the FUA write commands, I hope we can safely assume that if you set the FUA supported bit in the id AND the write fua command doesn't get aborted, that FUA must work. Anything else would just be an immensely stupid implementation. NCQ+FUA is more tricky, I agree that it being just a command bit does make it more likely that it could be ignored. And that is indeed a danger. Given state of NCQ in early firmware drives, I would not at all be surprised if the drive vendors screwed that up too. But, since we don't have the ordered bit for NCQ/FUA anyway, we do need to drain the drive queue before issuing the WRITE/FUA. And at that point we may as well not use the NCQ command, just go for the regular non-NCQ FUA write. I think that should be safe. Aside from the issue above, as I mentioned elsewhere, lots of NCQ drives don't support non-NCQ FUA writes.. Lots meaning how many? All the ones I have here support FUA. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: libata FUA revisited
On Wed, Feb 21 2007, Tejun Heo wrote: Jens Axboe wrote: On Wed, Feb 21 2007, Tejun Heo wrote: [cc'ing Ric, Hannes and Dongjun, Hello. Feel free to drag other people in.] Robert Hancock wrote: Jens Axboe wrote: But we can't really change that, since you need the cache flushed before issuing the FUA write. I've been advocating for an ordered bit for years, so that we could just do: 3. w/FUA+ORDERED normal operation - barrier issued - write barrier FUA+ORDERED - normal operation resumes So we don't have to serialize everything both at the block and device level. I would have made FUA imply this already, but apparently it's not what MS wanted FUA for, so... The current implementations take the FUA bit (or WRITE FUA) as a hint to boost it to head of queue, so you are almost certainly going to jump ahead of already queued writes. Which we of course really do not. Yeah, I think if we have tagged write command and flush tagged (or barrier tagged) things can be pretty efficient. Again, I'm much more comfortable with separate opcodes for those rather than bits changing the behavior. ORDERED+FUA NCQ would still be preferable to an NCQ enabled flush command, though. I think we're talking about two different things here. 1. The barrier write (FUA write) combined with flush. I think it would help improving the performance but I think issuing two commands shouldn't be too slower than issuing one combined command unless it causes extra physical activity (moving head, etc...). The command overhead is dwarfed by other factors, agree. 2. FLUSH currently flushes all writes. If we can mark certain commands requiring ordering, we can selectively flush or order necessary writes. (No need to flush 16M buffer all over the disk when only journal needs barriering) Sure, anything is better than the sledge hammer flush. But my claim is that an ORDERED+FUA enabled write for critical data would be a good approach, and simple in software. Another idea Dongjun talked about while drinking in LSF was ranged flush. Not as flexible/efficient as the previous option but much less intrusive and should help quite a bit, I think. But that requires extensive tracking, I'm not so sure the implementation of that for barriers would be very clean. It'd probably be good for fsync, though. I was mostly thinking about journal area. Using it for other purposes would incur a lot of complexity. :-( Yep if it's just for the journal, the range is known and fixed, so the flush range would work nicely there. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: libata FUA revisited
On Tue, Feb 13 2007, Tejun Heo wrote: So, actually, I was thinking about *always* using the non-NCQ FUA opcode. As currently implemented, FUA request is always issued by itself, so NCQ doesn't make any difference there. So, I think it would be better to turn on FUA on driver-by-driver basis whether the controller supports NCQ or not. Unfortunately not all drives that support NCQ support the non-NCQ FUA commands (my Seagates are like this). And I'm a bit scared to set FUA bit on such drives and trust that it will actually do FUA, so our opinions aren't too far away from each other. :-) There's definitely a potential advantage to FUA with NCQ - if you have non-synchronous accesses going on concurrently with synchronous ones, if you have to use non-NCQ FUA or flush cache commands, you have to wait for all the IOs of both types to drain out before you can issue the flush (since those can't be overlapped with the NCQ read/writes). And if you can only use flush cache, then you're forcing all the writes to be flushed including the non-synchronous ones you didn't care about. Whether or not the block layer currently exploits this I don't know, but it definitely could. The current barrier implementation uses the following sequences for no-FUA and FUA cases. 1. w/o FUA normal operation - barrier issued - drain IO - flush - barrier written - flush - normal operation resumes 2. w/ FUA normal operation - barrier issued - drain IO - flush - barrier written / FUA - normal operation resumes So, the FUA write is issued by itself. This isn't really efficient and frequent barriers impact the performance badly. If we can change that NCQ FUA will be certainly beneficial. But we can't really change that, since you need the cache flushed before issuing the FUA write. I've been advocating for an ordered bit for years, so that we could just do: 3. w/FUA+ORDERED normal operation - barrier issued - write barrier FUA+ORDERED - normal operation resumes So we don't have to serialize everything both at the block and device level. I would have made FUA imply this already, but apparently it's not what MS wanted FUA for, so... The current implementations take the FUA bit (or WRITE FUA) as a hint to boost it to head of queue, so you are almost certainly going to jump ahead of already queued writes. Which we of course really do not. Well, I might be being too paranoid but silent FUA failure would be really hard to diagnose if that ever happens (and I'm fairly certain that it will on some firmwares). Well, there are also probably drives that ignore flush cache commands or fail to do other things that they should. There's only so far we can go in coping if the firmware authors are being retarded. If any drive is broken like that we should likely just blacklist NCQ on it entirely as obviously little thought or testing went into the implementation.. FLUSH has been around quite long time now and most drives don't have problem with that. FUA on ATA is still quite new and libata will be the first major user of it if we enable it by default. It just seems too easy to ignore that bit and successfully complete the write - there isn't any safety net as opposed to using a separate opcode. So, I'm a bit nervous. I'm not too nervous about the FUA write commands, I hope we can safely assume that if you set the FUA supported bit in the id AND the write fua command doesn't get aborted, that FUA must work. Anything else would just be an immensely stupid implementation. NCQ+FUA is more tricky, I agree that it being just a command bit does make it more likely that it could be ignored. And that is indeed a danger. Given state of NCQ in early firmware drives, I would not at all be surprised if the drive vendors screwed that up too. But, since we don't have the ordered bit for NCQ/FUA anyway, we do need to drain the drive queue before issuing the WRITE/FUA. And at that point we may as well not use the NCQ command, just go for the regular non-NCQ FUA write. I think that should be safe. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ahci problems with sata disk.
On Tue, Jan 16 2007, Jeff Garzik wrote: Mark Hahn wrote: I though that NCQ was intended to increase performance ?? intended to increase _sales_ performance ;) Yep. remember that you've always had command queueing (kernel elevator): the main difference with NCQ (or SCSI tagged queueing) is when the disk can out-schedule the kernel. afaikt, this means sqeezing in a rotationally intermediate request along the way. that intermediate request must be fairly small and should be a read (for head-settling reasons). I wonder how often this happens in the real world, given the relatively small queues the disk has to work with. ISTR either Jens or Andrew ran some numbers, and found that there was little utility beyond 4 or 8 tags or so. It entirely depends on the access pattern. For truly random reads, performance does seem to continue to scale up with increasing drive queue depths. It may only be a benchmark figure though, as truly random read workloads probably aren't that common :-) For anything else, going beyond 4 tags doesn't improve much. My hdparm test is a sequential read-ahead test, so it will naturally perform worse on a Raptor when NCQ is on. that's a surprisingly naive heuristic, especially since NCQ is concerned with just a max of ~4MB of reads, only a smallish fraction of the available cache. NCQ mainly helps with multiple threads doing reads. Writes are largely asynchronous to the user already (except for fsync-style writes). You want to be able to stuff the disk's internal elevator with as many read requests as possible, because reads are very often synchronous -- most apps (1) read a block, (2) do something, (3) goto step #1. The kernel's elevator isn't much use in these cases. Au contraire, this is one of the cases where intelligent IO scheduling in the kernel makes a ton of difference. It's the primary reason that AS and CFQ are able to maintain 90% of disk bandwidth for more than one process, idling the drive for the duration of step 2 in the sequence above (step 2 is typically really small, time wise). If the next block read is close to the first one, that is. If you do that, you will greatly outperform the same workload pushed to the drive scheduling. I've done considerable benchmarks on this. Only if the processes are doing random IO should the IO scheduler punt and push everything to the drive queue. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ahci problems with sata disk.
On Tue, Jan 16 2007, Eric D. Mudama wrote: [snip lots of stuff I agree completely with] If done properly, queueing should never hurt performance. High queue depths will increase average latency of course, but shouldn't hurt overall performance. It may never hurt performance, but there are common scenarios where you are much better off not doing queuing even if you could. A good example of that is a media serving service, where you end up reading a bunch of files sequentially. It's faster to read chunks of each file sequentially at depth 1 and move on, than queue a a request from each of them and send them to the drive. On my laptop with an NCQ enabled drive, the mentioned approach outperforms queuing by more than 100%. NCQ mainly helps with multiple threads doing reads. Writes are largely asynchronous to the user already (except for fsync-style writes). You want to be able to stuff the disk's internal elevator with as many read requests as possible, because reads are very often synchronous -- most apps (1) read a block, (2) do something, (3) goto step #1. The kernel's elevator isn't much use in these cases. True. And internal to the drive, normal elevator is meh. There are other algorithms for scheduling that perform better. Well Linux doesn't default to using a normal elevator, so it's a moot point. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.20-rc4-mm1
On Mon, Jan 15 2007, Ingo Molnar wrote: * Jens Axboe [EMAIL PROTECTED] wrote: In a previous write invoked by: fsck.ext3(1896): WRITE block 8552 on sdb1 end_buffer_async_write() is invoked. sdb1 is not a part of a raid device. When I briefly tested this before I left (and found it broken), doing a cat /proc/mdstat got things going again. Hard if that's your rootfs, it's just a hint :-) hm, so you knew it's broken, still you let Andrew pick it up, or am i misunderstanding something? Well the raid issue wasn't known before it was in -mm. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.20-rc4-mm1
On Sun, Jan 14 2007, Thomas Gleixner wrote: On Mon, 2007-01-15 at 09:05 +1100, Jens Axboe wrote: raid seems to have severe problems with the plugging change. I'll try and find Neil and have a chat with him, hopefully we can work it out. Some hints: mount(1899): WRITE block 16424 on md3 call md_write_start md3_raid1(438): WRITE block 40965504 on sdb6 md3_raid1(438): WRITE block 40965504 on sda6 First Write sector 16424 disks 2 Stuck. Note, that neither end_buffer_async_write() nor raid1_end_write_request() are invoked, In a previous write invoked by: fsck.ext3(1896): WRITE block 8552 on sdb1 end_buffer_async_write() is invoked. sdb1 is not a part of a raid device. When I briefly tested this before I left (and found it broken), doing a cat /proc/mdstat got things going again. Hard if that's your rootfs, it's just a hint :-) Hope that helps, I can reproduce, so that's not a problem. I can't do much about it until I'm back next week, but Neil might be able to help. We shall see. Thanks for testing. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] SCSI, libata: add support for ATA_16 commands to libata ATAPI devices
On Wed, Jan 03 2007, James Bottomley wrote: Er, well, as you know, I've never been a fan of this static list. I thought Jens was going to put us all out of our misery by making the list settable per device by root and thus shovel the problem off onto the distros? The code is there, just haven't had the guts to merge it yet. http://git.kernel.dk/?p=linux-2.6-block.git;a=commitdiff;h=bf5f922d167a5c5cf57132bbcaa1e0ddfd5c45f7 -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] SCSI, libata: add support for ATA_16 commands to libata ATAPI devices
On Thu, Jan 04 2007, Mark Lord wrote: Jens Axboe wrote: On Wed, Jan 03 2007, James Bottomley wrote: Er, well, as you know, I've never been a fan of this static list. I thought Jens was going to put us all out of our misery by making the list settable per device by root and thus shovel the problem off onto the distros? The code is there, just haven't had the guts to merge it yet. http://git.kernel.dk/?p=linux-2.6-block.git;a=commitdiff;h=bf5f922d167a5c5cf57132bbcaa1e0ddfd5c45f7 That's nice, but doesn't help with the case of trying to do ATA passthru to ATAPI devices, the subject of the original patch here. James brought up the filtering issue, my reply pertains to that alone. Hence the snipping of unrelated contents in the original mail :-) -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: AHCI and add_disk_randomness()
On Wed, Nov 15 2006, Marek Podmaka wrote: Hello, I have server with Intel 5000V motherboard with integrated AHCI SATA controller. It works well with kernel 2.6.18.2. But I have problem with little entropy available and I'm not sure if one of the reasons is that AHCI driver does not use add_disk_randomness() to contribute to the kernel entropy pool. I'm not very skilled on kernel hacking... I tried finding where this is called (it's defined in drivers/char/random.c and used for example in Comapaq SmartArray driver in drivers/block/cciss.c). For the SCSI part, I found it in scsi_lib.c, but I was not able to determine if this is actually used by ahci/libata drivers. If not, would it be possible to implement it? I tried to figure out where to call it by looking at cciss.c, but it seems that this is totally different case, at least for me. I don't know where to add it, because its parameter is struct gendisk *disk and didn't find it used anywhere in ahci or libata. Since ahci attaches its devices through the scsi layer, add_disk_randomness() will get called from scsi_end_request() like for any other scsi controller. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 0/4] ide: Break ide_lock to per-hwgroup lock
On Tue, Sep 06 2005, Ravikiran G Thirumalai wrote: The following patchset breaks down the global ide_lock to per-hwgroup lock. We have taken the following approach. Curious, what is the point of this? -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 0/4] ide: Break ide_lock to per-hwgroup lock
On Wed, Sep 07 2005, Ravikiran G Thirumalai wrote: On Wed, Sep 07, 2005 at 11:19:24AM +0200, Jens Axboe wrote: On Tue, Sep 06 2005, Ravikiran G Thirumalai wrote: The following patchset breaks down the global ide_lock to per-hwgroup lock. We have taken the following approach. Curious, what is the point of this? On smp machines with multiple ide interfaces, we take per-group lock instead of a global lock, there by breaking the lock to per-irq hwgroups. I realize the theory behind breaking up locks, I'm just wondering about this specific case. Please show actual contention data promoting this specific case, we don't break up locks just because. I'm asking because I've never heard anyone complain about IDE lock contention and a proper patch usually comes with analysis of why it is needed. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: libata: clustering on or off?
On Sun, Aug 28 2005, Arjan van de Ven wrote: On Sun, 2005-08-28 at 05:42 -0400, Jeff Garzik wrote: The constant ATA_SHT_USE_CLUSTERING in include/linux/libata.h controls the use of SCSI layer's use_clustering feature, for a great many libata drivers. The current setup has clustering disabled, which in theory causes the block layer to do less work, at the expense of a greater number of scatter/gather table entries used. Any opinions WRT turning on clustering for libata? in 2.4 clustering was expensive due to a large number of checks that were done (basically the number of fragments got recounted a gazilion times). In 2.6 Jens fixed that afaik to make it basically free... at which point it's a win always. Yeah, it wont cost any extra cycles, so there's no point in keeping it turned off for that reason. Imo clustering on the driver level should announce driver capabilities. If clustering for some arch/kernel makes it slower, that should be decided at a midlayer level and not in each driver; eg the midlayer would chose to ignore the drivers capabilities. So .. my opinion would be that libata should announce the capability (it seems the code/hw can do it). Agree, we should just remove the ability to control clustering, as it really overlaps with the segment settings anyways. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: libata: clustering on or off?
On Sun, Aug 28 2005, Christoph Hellwig wrote: On Sun, Aug 28, 2005 at 04:20:19PM +0200, Jens Axboe wrote: Agree, we should just remove the ability to control clustering, as it really overlaps with the segment settings anyways. What are we going to do with iscsi then? It really doesn't like segments over a pages size. Best thing would probably be to switch networking to use sg lists and dma_map_sg, but that's not a trivial task. Limit the segment size, then. There are provisions for doing both length and boundary limits, that should suffice. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] libata: error processing + rw 6 byte fix
On Sat, Aug 27 2005, Douglas Gilbert wrote: Jeff Garzik wrote: Does the attached look OK to everybody? Jeff, Yes. And after an exchange with Jens, it would probably be safer to map transfer_length=0 for READ_10 and READ_16 (as well as WRITE_10 and WRITE_16) to a nop on the ATA side. Otherwise an ATA disk may attempt to transfer 2**16 sectors. [Only READ_6 and WRITE_6 have the quirky 0 means 256 definition, in the larger commands 0 means 0.] Yes, that part needs to be added as well. Jeff, the below looks good for the READ/WRITE_6 case. if (scsicmd[0] == READ_6 || scsicmd[0] == WRITE_6) { qc-nsect = tf-nsect = scsicmd[4]; + if (!qc-nsect) { + qc-nsect = 256; + if (lba48) + tf-hob_nsect = 1; + } + tf-lbal = scsicmd[3]; tf-lbam = scsicmd[2]; tf-lbah = scsicmd[1] 0x1f; /* mask out reserved bits */ -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] libata: error processing + rw 6 byte fix
On Sat, Aug 27 2005, Jeff Garzik wrote: Here is the patch I just checked in. Looks perfect. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Hdaps-devel] Re: HDAPS, Need to park the head for real
On Thu, Aug 25 2005, Jon Escombe wrote: Alan Cox wrote: @@ -1661,6 +1671,9 @@ where = ELEVATOR_INSERT_FRONT; rq-flags |= REQ_PREEMPT; } + if (action == ide_next) + where = ELEVATOR_INSERT_FRONT; + __elv_add_request(drive-queue, rq, where, 0); ide_do_request(hwgroup, IDE_NO_IRQ); spin_unlock_irqrestore(ide_lock, flags); Also puzzles me- why is this needed ? I wanted the park command to get in at the head of the queue (behind the currently executing request). Contrary to the comments for ide_do_drive_cmd(), ide_next didn't appear to do anything to achieve this? At least from my initial testing before I made this change - it could take a second or so for the park command to be issued if the disk was busy That part seems to have been lost, apparently. The above patch is correct. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PATCH: ide: ide-disk freeze support for hdaps
On Fri, Aug 26 2005, Yani Ioannou wrote: Please make the interface accept number of seconds (as suggested by Jens) and remove this module parameter. This way interface will be more flexible and cleaner. I really don't see any advantage in doing echo 1 ... instead of echo x ... (Pavel, please explain). Either way is pretty easy enough to implement. Note though that I'd expect the userspace app should thaw the device when danger is out of the way (the timeout is mainly there to ensure that the queue isn't frozen forever, and should probably be higher). Personally I don't have too much of an opinion either way though... what's the consensus? :). Yes please, I don't understand why you would want a 0/1 interface instead, when the timer-seconds method gives you the exact same ability plus a way to control when to unfreeze... +static struct timer_list freeze_timer = + TIMER_INITIALIZER(freeze_expire, 0, 0); There needs to be a timer per device not a global one (it works for a current special case of T42 but sooner or later we will hit this problem). I was considering that, but I am confused as to whether each drive has it's own queue or not? (I really am a newbie to this stuff...). If so then yes there should be a per-device timer. Each drive has its own queue. queue handling should be done through block layer helpers (as described in Jens' email) - we will need them for libata too. Good point, I'll try to move as much as I can up to the block layer, it helps when it comes to implementing freeze for libata as you point out too. That includes things like the timer as well, reuse the queue plugging timer as I described in my initial posting on how to implement this. At this time attribute can still be in use (because refcounting is done on drive-gendev), you need to add disk class to ide-disk driver (drivers/scsi/st.c looks like a good example how to do it). I missed that completely, I'll look at changing it. IMO this should also be handled by block layer which has all needed information, Jens? While at it: I think that sysfs support should be moved to block layer (queue attributes) and storage driver should only need to provide queue_freeze_fn and queue_thaw_fn functions (similarly to cache flush support). This should be done now not later because this stuff is exposed to the user-space. I was actually considering using a queue attribute originally, but in my indecision I decided to go with Jen's suggestion. A queue attribute does make sense in that the attribute primarily is there to freeze the queue, but it would also be performing the head park. Would a queue attribute be confusing because of that? I fully agree with Bart here. The only stuff that should be ide special is the actual command setup and completion check, since that is a hardware property. libata will get a few little helpers for that as well. The rest should be block layer implementation. * Sanity: don't accept a request that isn't a PM request * if we are currently power managed. This is very important as * blk_stop_queue() doesn't prevent the elv_next_request() @@ -1661,6 +1671,9 @@ int ide_do_drive_cmd (ide_drive_t *drive where = ELEVATOR_INSERT_FRONT; rq-flags |= REQ_PREEMPT; } + if (action == ide_next) + where = ELEVATOR_INSERT_FRONT; + __elv_add_request(drive-queue, rq, where, 0); ide_do_request(hwgroup, IDE_NO_IRQ); spin_unlock_irqrestore(ide_lock, flags); Why is this needed? I think Jon discussed that in a previous thread, but basically although ide_next is documented in the comment for ide_do_drive_cmd, there isn't (as far as Jon or I could see) anything actually handling it. This patch is carried over from Jon's work and adds the code to handle ide_next by inserting the request at the front of the queue. As per my previous mail, I will ack that bit. Overall, very promising work! Thanks :-), most of it is Jon's work, and Jen's suggestions though. My name is Jens, not Jen :-) -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] libata: error processing + rw 6 byte fix
On Tue, Aug 23 2005, Douglas Gilbert wrote: Jens Axboe wrote: On Mon, Aug 22 2005, Douglas Gilbert wrote: if (scsicmd[0] == READ_6 || scsicmd[0] == WRITE_6) { - qc-nsect = tf-nsect = scsicmd[4]; + if (scsicmd[4] == 0) { + /* +* For READ_6 and WRITE_6 (only) +* transfer_len==0 - 256 blocks !! +*/ + if (lba48) { + tf-hob_nsect = 1; + qc-nsect = 256; + } else + return 1; This isn't quite right, for 28-bit lba a 0 sector value means 256 sectors to transfer as well. So just make that: if (lba48) { tf-hob_nsect = 1; qc-nsect = 256; } /* continue */ and it should work fine. Similarly for 48-bit lba, 0 means 16^2 sectors. Jens, Since for 28-bit lba a 0 sector value means 256 sectors do I need to check for the lba48 case at all? As proposed to Jeff is this ok (for READ_6 and WRITE_6): if (scsicmd[4] == 0) { /* * For READ_6 and WRITE_6 (only) * transfer_len==0 - 256 blocks !! */ qc-nsect = 256; } else qc-nsect = scsicmd[4]; tf-nsect = scsicmd[4]; This will break for lba48 devices, since if you have scsicmd[4] == 0, a lba48 read/write will want to transfer 65536 sectors instead of the intended 256. Your qc-nsect logic is correct, but you need to set tf-hob_nsect 1 for lba48 if scsicmd[4] == 0 to correctly tell that command to transfer 256 sectors. Also I noticed while testing the original code with READ_6 (sectors=0) that the device locked up (power cycle required). So given the point you make for 48-bit lba, 0 means 16^2 sectors, then the READ_10 (sectors=0) and READ_16 (sectors=0) which are valid nops according to SBC-2 may also lock up in libata. Try with the corrected sector counts, should work. I didn't check the other READ_X/WRITE_X, so you should probably audit them as well :) -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ahci, SActive flag, and the HD activity LED
On Wed, Aug 03 2005, Martin Wilck wrote: Jens Axboe wrote: If I am reading the specs correctly, that'd mean the ahci driver is wrong in setting the SActive bit. I completely agree, that was my reading of the spec as well and hence my original posts about this in the NCQ thread. Have you (or has anybody else) also seen the wrong behavior of the activity LED? No, but I have observed that SActive never gets cleared by the device for non-NCQ commands (which is probably which gets you the stuck LED on some systems?), which to me is another indication that we should not be setting the tag bits for those commands. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Crash in ide_do_request() on card removal
On Tue, Aug 02 2005, Steven Scholz wrote: Jens Axboe wrote: That's not quite true, q is not invalid after this call. It will only be invalid when it is freed (which doesn't happen from here but rather from the blk_cleanup_queue() call when the reference count drops to 0). This is still not perfect, but a lot better. Does it work for you? --- linux-2.6.12/drivers/ide/ide-disk.c~ 2005-08-02 12:48:16.0 +0200 +++ linux-2.6.12/drivers/ide/ide-disk.c 2005-08-02 12:48:32.0 +0200 @@ -1054,6 +1054,7 @@ drive-driver_data = NULL; drive-devfs_name[0] = '\0'; g-private_data = NULL; +g-disk = NULL; put_disk(g); kfree(idkp); } No. drivers/ide/ide-disk.c: In function `ide_disk_release': drivers/ide/ide-disk.c:1057: error: structure has no member named `disk' Eh, typo, should be g-queue of course :-) --- linux-2.6.12/drivers/ide/ide-disk.c~2005-08-02 12:48:16.0 +0200 +++ linux-2.6.12/drivers/ide/ide-disk.c 2005-08-02 13:12:54.0 +0200 @@ -1054,6 +1054,7 @@ drive-driver_data = NULL; drive-devfs_name[0] = '\0'; g-private_data = NULL; + g-queue = NULL; put_disk(g); kfree(idkp); } -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Crash in ide_do_request() on card removal
On Tue, Aug 02 2005, Steven Scholz wrote: Jens Axboe wrote: On Tue, Aug 02 2005, Steven Scholz wrote: Jens Axboe wrote: That's not quite true, q is not invalid after this call. It will only be invalid when it is freed (which doesn't happen from here but rather from the blk_cleanup_queue() call when the reference count drops to 0). This is still not perfect, but a lot better. Does it work for you? --- linux-2.6.12/drivers/ide/ide-disk.c~ 2005-08-02 12:48:16.0 +0200 +++ linux-2.6.12/drivers/ide/ide-disk.c2005-08-02 12:48:32.0 +0200 @@ -1054,6 +1054,7 @@ drive-driver_data = NULL; drive-devfs_name[0] = '\0'; g-private_data = NULL; + g-disk = NULL; put_disk(g); kfree(idkp); } No. drivers/ide/ide-disk.c: In function `ide_disk_release': drivers/ide/ide-disk.c:1057: error: structure has no member named `disk' Eh, typo, should be g-queue of course :-) --- linux-2.6.12/drivers/ide/ide-disk.c~ 2005-08-02 12:48:16.0 +0200 +++ linux-2.6.12/drivers/ide/ide-disk.c 2005-08-02 13:12:54.0 +0200 @@ -1054,6 +1054,7 @@ drive-driver_data = NULL; drive-devfs_name[0] = '\0'; g-private_data = NULL; +g-queue = NULL; put_disk(g); kfree(idkp); } No. That does not work: ~ # umount /mnt/pcmcia/ generic_make_request(2859) q=c02d3040 __generic_unplug_device(1447) calling q-request_fn() @ c00f97ec do_ide_request(1281) HWIF=c01dee8c (0), HWGROUP=c089cea0 (1038681856), drive=c01def1c (0, 0), queue=c02d3040 () do_ide_request(1287) HWIF is not present anymore!!! do_ide_request(1291) DRIVE is not present anymore. SKIPPING REQUEST!!! As you can see generic_make_request() still has the pointer to that queue! It gets it with q = bdev_get_queue(bio-bi_bdev); So the pointer is still stored soemwhere else... Hmmm, perhaps just let ide end requests where the drive has been removed might be better. The disconnection between the queue cleanup and the gendisk cleanup makes it harder to do it properly. SCSI deals with it the same way, basically. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Crash in ide_do_request() on card removal
On Tue, Aug 02 2005, Steven Scholz wrote: Jens Axboe wrote: On Tue, Aug 02 2005, Steven Scholz wrote: Jens Axboe wrote: On Tue, Aug 02 2005, Steven Scholz wrote: Jens Axboe wrote: That's not quite true, q is not invalid after this call. It will only be invalid when it is freed (which doesn't happen from here but rather from the blk_cleanup_queue() call when the reference count drops to 0). This is still not perfect, but a lot better. Does it work for you? --- linux-2.6.12/drivers/ide/ide-disk.c~ 2005-08-02 12:48:16.0 +0200 +++ linux-2.6.12/drivers/ide/ide-disk.c 2005-08-02 12:48:32.0 +0200 @@ -1054,6 +1054,7 @@ drive-driver_data = NULL; drive-devfs_name[0] = '\0'; g-private_data = NULL; +g-disk = NULL; put_disk(g); kfree(idkp); } No. drivers/ide/ide-disk.c: In function `ide_disk_release': drivers/ide/ide-disk.c:1057: error: structure has no member named `disk' Eh, typo, should be g-queue of course :-) --- linux-2.6.12/drivers/ide/ide-disk.c~ 2005-08-02 12:48:16.0 +0200 +++ linux-2.6.12/drivers/ide/ide-disk.c2005-08-02 13:12:54.0 +0200 @@ -1054,6 +1054,7 @@ drive-driver_data = NULL; drive-devfs_name[0] = '\0'; g-private_data = NULL; + g-queue = NULL; put_disk(g); kfree(idkp); } No. That does not work: ~ # umount /mnt/pcmcia/ generic_make_request(2859) q=c02d3040 __generic_unplug_device(1447) calling q-request_fn() @ c00f97ec do_ide_request(1281) HWIF=c01dee8c (0), HWGROUP=c089cea0 (1038681856), drive=c01def1c (0, 0), queue=c02d3040 () do_ide_request(1287) HWIF is not present anymore!!! do_ide_request(1291) DRIVE is not present anymore. SKIPPING REQUEST!!! As you can see generic_make_request() still has the pointer to that queue! It gets it with q = bdev_get_queue(bio-bi_bdev); So the pointer is still stored soemwhere else... Hmmm, perhaps just let ide end requests where the drive has been removed might be better. I don't understand what you mean. If requests are issued (e.g calling umount) after the drive is gone, then I get either a kernel crash or umount hangs cause it waits in __wait_on_buffer() ... No, those waiters will be woken up when ide does an end_request for requests coming in for a device which no longer exists. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Crash in ide_do_request() on card removal
On Tue, Aug 02 2005, Steven Scholz wrote: Jens Axboe wrote: On Tue, Aug 02 2005, Steven Scholz wrote: Jens Axboe wrote: On Tue, Aug 02 2005, Steven Scholz wrote: Jens Axboe wrote: On Tue, Aug 02 2005, Steven Scholz wrote: Jens Axboe wrote: That's not quite true, q is not invalid after this call. It will only be invalid when it is freed (which doesn't happen from here but rather from the blk_cleanup_queue() call when the reference count drops to 0). This is still not perfect, but a lot better. Does it work for you? --- linux-2.6.12/drivers/ide/ide-disk.c~ 2005-08-02 12:48:16.0 +0200 +++ linux-2.6.12/drivers/ide/ide-disk.c2005-08-02 12:48:32.0 +0200 @@ -1054,6 +1054,7 @@ drive-driver_data = NULL; drive-devfs_name[0] = '\0'; g-private_data = NULL; + g-disk = NULL; put_disk(g); kfree(idkp); } No. drivers/ide/ide-disk.c: In function `ide_disk_release': drivers/ide/ide-disk.c:1057: error: structure has no member named `disk' Eh, typo, should be g-queue of course :-) --- linux-2.6.12/drivers/ide/ide-disk.c~ 2005-08-02 12:48:16.0 +0200 +++ linux-2.6.12/drivers/ide/ide-disk.c 2005-08-02 13:12:54.0 +0200 @@ -1054,6 +1054,7 @@ drive-driver_data = NULL; drive-devfs_name[0] = '\0'; g-private_data = NULL; +g-queue = NULL; put_disk(g); kfree(idkp); } No. That does not work: ~ # umount /mnt/pcmcia/ generic_make_request(2859) q=c02d3040 __generic_unplug_device(1447) calling q-request_fn() @ c00f97ec do_ide_request(1281) HWIF=c01dee8c (0), HWGROUP=c089cea0 (1038681856), drive=c01def1c (0, 0), queue=c02d3040 () do_ide_request(1287) HWIF is not present anymore!!! do_ide_request(1291) DRIVE is not present anymore. SKIPPING REQUEST!!! As you can see generic_make_request() still has the pointer to that queue! It gets it with q = bdev_get_queue(bio-bi_bdev); So the pointer is still stored soemwhere else... Hmmm, perhaps just let ide end requests where the drive has been removed might be better. I don't understand what you mean. If requests are issued (e.g calling umount) after the drive is gone, then I get either a kernel crash or umount hangs cause it waits in __wait_on_buffer() ... No, those waiters will be woken up when ide does an end_request for requests coming in for a device which no longer exists. But that would mean generating requests for devices, drives and hwifs that no longer exists. But exactly there it will crash! In do_ide_request() and ide_do_request(). ide doesn't generate the requests, it just receives them for processing. And you want to halt that at the earliest stage possible. Basically the problem you are trying to solve is hacking around the missing hotplug support in drivers/ide. And that will never be pretty. The correct solution would of course be to improve the hotplug support, I think Bart was/is working on that (cc'ing him). ide_unregister() restores some old hwif structure. drive and queue are set to NULL. When I wait long enough between cardctl eject and umount it looks like this: ~ # cardctl eject ide_release(398) ide_unregister(585): index=0 ide_unregister(698) old HWIF restored! hwif=c01dee8c (0), hwgroup=c0fac2a0, drive=, queue= ide_detach(164) cardmgr[253]: shutting down socket 0 cardmgr[253]: executing: './ide stop hda' cardmgr[253]: executing: 'modprobe -r ide-cs' exit_ide_cs(514) ~ # umount /mnt/pcmcia/ sys_umount(494) generic_make_request(2859) q=c02d3040 __generic_unplug_device(1447) calling q-request_fn() @ c00f97e4 do_ide_request(1279) HWIF=c01dee8c (0), HWGROUP=c0fac2a0 (738987520), drive=c01def1c (0, 0), queue=c02d3040 () I don't understand what values you are dumping above, please explain. Is HWIF c01dee8c or 0? Assertion '(hwif-present)' failed in drivers/ide/ide-io.c:do_ide_request(1284) Assertion '(drive-present)' failed in drivers/ide/ide-io.c:do_ide_request(1290) ide_do_request(1133) hwgroup is busy! ide_do_request(1135) hwif=01000406 The 738987520 above is hwgroup-busy! Obviously completly wrong. This seems to be a hint that an invalid pointer is dereferenced! The pointer hwif=01000406 also does not look very healthy! drive=c01def1c is the result of Yeah it looks very bad. Same thing with the reference counting, ide should not be freeing various structures that the block layer still holds a reference to. drive = choose_drive(hwgroup); but can't be as it was set to NULL before. If I don't wait long enough between cardctl eject and umount the kernel crashes with: ~ # cardctl eject; umount /mnt/pcmcia ide_release(398) ide_unregister(585): index=0 ide_unregister(698) old HWIF restored! hwif=c01dee8c (0), hwgroup=c0268080, drive=, queue= ide_detach(164) cardmgr[253]: shutting down socket 0 cardmgr[253]: executing
Re: Crash in ide_do_request() on card removal
On Tue, Aug 02 2005, Steven Scholz wrote: Jens Axboe wrote: No, those waiters will be woken up when ide does an end_request for requests coming in for a device which no longer exists. But that would mean generating requests for devices, drives and hwifs that no longer exists. But exactly there it will crash! In do_ide_request() and ide_do_request(). ide doesn't generate the requests, it just receives them for processing. I know. And you want to halt that at the earliest stage possible. Agreed. Problems seems to be: A refererenc to the request queue is stored in struct gendisk. Thus if you unregister a block device you should make sure that noone can still try to access that request queue, right? Well the problem is that ide-cs/ide doesn't handle unplug gracefully. You are trying to fix it in the wrong location, fix belongs in ide. ~ # umount /mnt/pcmcia/ sys_umount(494) generic_make_request(2859) q=c02d3040 __generic_unplug_device(1447) calling q-request_fn() @ c00f97e4 do_ide_request(1279) HWIF=c01dee8c (0), HWGROUP=c0fac2a0 (738987520), drive=c01def1c (0, 0), queue=c02d3040 () I don't understand what values you are dumping above, please explain. Is HWIF c01dee8c or 0? printk(%s(%d) HWIF=%p (%d), HWGROUP=%p (%d), drive=%p (%d, %d), queue=%p (%p)\n, __FUNCTION__, __LINE__, hwif, hwif-present, hwgroup, hwgroup-busy, drive, drive-present, drive-dead, q, drive-queue); So HWIF is a c01dee8c and hwif-present=0. Ok, so you could kill any request arriving for a !hwif-present hardware interface. Assertion '(hwif-present)' failed in drivers/ide/ide-io.c:do_ide_request(1284) Assertion '(drive-present)' failed in drivers/ide/ide-io.c:do_ide_request(1290) ide_do_request(1133) hwgroup is busy! ide_do_request(1135) hwif=01000406 The 738987520 above is hwgroup-busy! Obviously completly wrong. This seems to be a hint that an invalid pointer is dereferenced! The pointer hwif=01000406 also does not look very healthy! drive=c01def1c is the result of Yeah it looks very bad. Same thing with the reference counting, ide should not be freeing various structures that the block layer still holds a reference to. Well or better tell the block layer that the drive is gone and it makes no sense to make any requests ... It's not enough! What if requests are already on the queue waiting to be serviced? Again, forget request generation. So how could you generate requests (and handle them sanely) for devices that where removed? Generation is not a problem, that happens outside of your scope. The job of the driver is just to make sure that it plays by the rule and at least makes sure it doesn't crash on its own for an active queue. do_ide_request() could check hwif-present and/or drive-present. Precisely. BUT: at this point the request is already made and the low level block layer is sleeping and waiting for it's completion. Which will complete when you error the request, as I wrote a few mails ago. I could not figure out how to kill a request in do_ide_request() and wake up the block layer (sleeping in __wait_on_buffer()). That's why I thought preventing the generation of such reuqests would be the right way. It's not the right way, it only solves a little part of the problem. Killing a request with an error usually looks like this: blkdev_dequeue_request(rq); end_that_request_first(rq, 0, rq-hard_nr_sectors); end_that_request_last(rq); -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Crash in ide_do_request() on card removal
On Tue, Aug 02 2005, Steven Scholz wrote: Jens Axboe wrote: It's not the right way, it only solves a little part of the problem. Killing a request with an error usually looks like this: blkdev_dequeue_request(rq); end_that_request_first(rq, 0, rq-hard_nr_sectors); end_that_request_last(rq); How do I get the request? do_ide_request() only get the complete request_queue_t *q. Shell I use elv_next_request() ? Yes. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Crash in ide_do_request() on card removal
On Tue, Aug 02 2005, Steven Scholz wrote: [PATCH] ide: kill requests when drive is not present anymore Signed-off-by: Steven Scholz [EMAIL PROTECTED] Ok? Change the CHANGELOG at your will. -- Steven --- linux-2.6.13-rc5/drivers/ide/ide-io.c 2005-06-17 21:48:29.0 +0200 +++ linux-2.6.13-rc4-at91-multiIO/drivers/ide/ide-io.c2005-08-02 15:46:53.0 +0200 @@ -1230,7 +1257,17 @@ void do_ide_request(request_queue_t *q) { ide_drive_t *drive = q-queuedata; - ide_do_request(HWGROUP(drive), IDE_NO_IRQ); + if (drive-present) + ide_do_request(HWGROUP(drive), IDE_NO_IRQ); + else { + struct request *rq; + printk(KERN_WARNING %s: not present, killing requests\n, drive-name); + while ((rq = elv_next_request(q)) != NULL) { + blkdev_dequeue_request(rq); + end_that_request_first(rq, 0, rq-hard_nr_sectors); + end_that_request_last(rq); + } + } } /* Looks good to me now, that's one item off Barts list :-) -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ahci, SActive flag, and the HD activity LED
On Tue, Aug 02 2005, Martin Wilck wrote: Hello Jeff, hello Jens, hello everybody, I am referring to the debate about whether or not setting the SActive bit for non-NCQ ATA commands (e.g. http://lkml.org/lkml/2005/5/26/142). In our machines, this behavior of the Linux AHCI driver causes the HD activity LED to stay on all the time. If I apply the attached trivial patch (this is for the RedHat EL4.0-U1 kernel), the LED behaves nicely. Jeff has stated in the above thread that SActive is intentionally used for non-NCQ devices. However I find clear indication in the specs that the SActive flag should be set if and only if tagged queuing is being used, and only for a specified subset of commands that support queuing (http://www.t13.org/docs2005/D1699r1e-ATA8-ACS.pdf, secs. 4.19 and 4.20). The current mainline driver doesn't use queuing. If I am reading the specs correctly, that'd mean the ahci driver is wrong in setting the SActive bit. Could you please comment? Jeff, in particular, could you please give more detail why you say this flag is intentionally used? I completely agree, that was my reading of the spec as well and hence my original posts about this in the NCQ thread. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Driver for sata adapter promise sata300 tx4
On Mon, Aug 01 2005, Daniel Drake wrote: Otto Meier wrote: My question is also are these features (NCQ/TCQ) and the heigher datarate be supported by this modification? or is only the basic feature set of sata 150 TX4 supported? NCQ support is under development. Search the archives for Jens Axboe's recent patches to support this. I don't know about TCQ. It's done for ahci, because we have documentation. I have no intention on working on NCQ for chipset where full documentation is not available. But the bulk of the code is the libata core support, adding NCQ support to a sata_* driver should now be fairly trivial (with docs). -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Driver for sata adapter promise sata300 tx4
On Mon, Aug 01 2005, Jens Axboe wrote: On Mon, Aug 01 2005, Daniel Drake wrote: Otto Meier wrote: My question is also are these features (NCQ/TCQ) and the heigher datarate be supported by this modification? or is only the basic feature set of sata 150 TX4 supported? NCQ support is under development. Search the archives for Jens Axboe's recent patches to support this. I don't know about TCQ. It's done for ahci, because we have documentation. I have no intention on working on NCQ for chipset where full documentation is not available. But the bulk of the code is the libata core support, adding NCQ support to a sata_* driver should now be fairly trivial (with docs). Oh, and forget TCQ. It's a completely worthless technology inherited from PATA, whos continued existence can only be explained by some companies having invested money adding firmware support for it already and/or because it sounds good to marketing who apparently rely on customers thinking it must be similar to SCSI TCQ because it shares the same name. In reality, they really share nothing. IDE TCQ makes a mockery of the TLA, I hope the people that came up with it bury their heads in shame for having wasted peoples time actually tring it out. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Driver for sata adapter promise sata300 tx4
On Mon, Aug 01 2005, Jeff Garzik wrote: Jens Axboe wrote: On Mon, Aug 01 2005, Daniel Drake wrote: Otto Meier wrote: My question is also are these features (NCQ/TCQ) and the heigher datarate be supported by this modification? or is only the basic feature set of sata 150 TX4 supported? NCQ support is under development. Search the archives for Jens Axboe's recent patches to support this. I don't know about TCQ. It's done for ahci, because we have documentation. I have no intention on working on NCQ for chipset where full documentation is not available. But the bulk of the code is the libata core support, adding NCQ support to a sata_* driver should now be fairly trivial (with docs). I have docs for the Promise NCQ stuff. Once NCQ is fully fleshed out (I haven't wrapped my brain around it in a couple weeks), it shouldn't be difficult to add NCQ support to sata_promise. Excellent! -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [git patches] IDE update
On Fri, Jul 08 2005, Ingo Molnar wrote: * Jens Axboe [EMAIL PROTECTED] wrote: But! I used hdparm -t solely, 2.6 was always ~5% faster than 2.4. But using -Tt slowed down the hd speed by about 30%. So it looks like some scheduler interaction, perhaps the memory timing loops gets it marked as batch or something? to check whether that could be the case, could you try: nice -n -20 hdparm -t /dev/hdc does that produce different results? Same result, see my next mail, it turned out to be a read-ahead bug. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Readahead with softraid1
On Fri, Jul 08 2005, Erik Slagter wrote: Hi, I am using softraid 1 on two sata disks and I'm trying to get the best possible performance. IMHO read actions (if properly addressed) should be split over the two drivers and performed independently. However, I don't notice anything to back this up. The read performance (with the dreaded hdparm) shows read performance on sda,sdb and md0 exactly the same. I've been playing a bit with readahead and it does matter a bit in that if I disable readahead for sda/sdb completely, the read rate for these goes completely down (to be expected) whilst the read rate on md0 stays the same (also a bit to be expected). Other combinations do not show any significant impact. I also played with the i/o scheduler and nr_requests (as from previous messages here). What am I doing wrong here??? raid1 doesn't split reads, currently you need more than one process doing io to get a performance increase. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Readahead with softraid1
On Fri, Jul 08 2005, Erik Slagter wrote: On Fri, 2005-07-08 at 08:16 -0400, Danny Cox wrote: What am I doing wrong here??? Nothing. I'll take a shot at answering this one instead of lurking this time. Then, I'll crawl back under my rock. The raid1 driver keeps a last visited block for each drive. This is the block number that was most recently read or written by that drive. When a read request arrives, the driver examines each drive for the nearest last visited block to the one requested. Guess what? If the read starts with drive sda, then it will *always* be the one chosen to service the read in the future, because the last visited block number is only one off. This would only change if there are multiple processes performing I/O on the md device. Then, it may switch to another drive. In any case, it will *tend* to stick with the same drive. Did I explain that well, or only muddy the waters? perfect explanation, thanks (and also Jens!). Is this a design decision or is it fundamentaly impossible to split the work amongst several drives? I guess the md driver at least do a prefetch of the next block/chunk on the other drive(s)? Not sure if it's a design decision or just this works ok, I'll fix it later. Clearly there is a lot of room for improvement in the balancing logic, to get more cases correct/faster. It's quite doable to split a bio and send bits of it to various drives. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH Linux 2.6.12 00/09] NCQ: generic NCQ completion/error-handling
On Fri, Jul 08 2005, Tejun Heo wrote: On Fri, Jul 08, 2005 at 10:03:48AM +0200, Jens Axboe wrote: Hi, Ok, one more error, this time from irq context: AHCI: ata1: error irq, status=4001 stat=51 err=04 sstat=0113 serr= ata1: aborting commands due to error. active_tag -1, sactive 0001 ahci: sactive 1 ata1: recovering from error sactive=0 ata1: failed to read log page 10h (-110) ata1: resetting... ata1: started resetting... ata1: end resetting, sstatus=0113 ata1: status=0x01 { Error } ata1: error=0x80 { Sector } SCSI error : 0 0 0 0 return code = 0x802 sda: Current: sense key=0x3 ASC=0x11 ASCQ=0x4 end_request: I/O error, dev sda, sector 2120875 Buffer I/O error on device sda2, logical block 2045 lost page write due to I/O error on sda2 -- Jens Axboe Hi, Jens. I also have a weird lockup log. This log is generated with the second take of NCQ patchset I've posted yesterday. It's Samsung HD160JJ on ICH7R AHCI. Every command with tag#3 is scrambled, so recovery operations are performed constantly (log 10h, if that fails COMRESET). After a few hours, the drive failed to become online after COMRESET. Rebooting didn't work. BIOS couldn't detect/recover it. I had to power-cycle to make it online again. I'm running similar test with much lower error-late (5~6 errors per 3000 requests) to avoid too many COMRESET's and, for more than six hours, it's been running okay. Error log follows. very strange, the drive must be really buggered. Jens, can you run your test with the new patchset? Sure, I will try. But I will be away from this box for the next two weeks (vacation, then KS/OLS), so I cannot do much about it in the near future... -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: sata_sil 3112 activity LED patch
On Wed, Jul 06 2005, Jeff Garzik wrote: Aric Cyr wrote: After finally getting fed up with not having my activity light working for my SATA drives, I came up with a small patch (more like hack) to make it work. It works quite well, but I'm afraid that there are many restriction that this patch does not check for that it probably should... so consider this a work-in-progress. My information is based on a document from Silicon Image that appears to no longer be available on their website (Sil-AN-0082-E; 8112-0082.pdf). I still have a copy if anyone is interested. There are two restrictions that are not checked: 1) Is the chip a 3112 or 3114? I assume that this would only work on a 3112, but whether it is a bad thing on a 3114 I do not know. 2) BAR5 + 0x54 is apparently used for the flash memory address and data lines. However for most motherboards (i.e. not add-on cards) with the chip, like my EPOX 8RDA3+, there is no flash memory, so these lines are hijacked as LED GPIO. I assume that this is a common practice for motherboard makers using the sil3112 since Silicon Image went out of their way to produce the above mentioned document. Anyways, the problem is that this patch does not check if flash memory is installed or not before twiddling with the GPIO lines. This could be extremely bad for people running the 3112 from add-on cards (or any implementation with flash memory installed). Setting the low 8bits at BAR5+54h seems to enable the LED circuit. It seems that this circuit is patched through into the motherboard as it lights the regular hard drive light on the front of my case. Setting bits [8:15] at BAR5+54h clears the bits, disabling the LED. I hooked this logic into the ata_bmdma_start and ata_bmdma_stop which were made into simple wrapper functions in sata_sil.c that just set the GPIO bits and calls ata_bmdma_*. I don't think its ugly, necessarily. I do worry about the flash memory stuff, though, which is why I don't want to merge this upstream for now. For your patch specifically, it would be nice to follow the coding style that is found in the rest of the driver (single-tab indents, etc., read CodingStyle in kernel source tree). There's also an existing variant of this in the block layer, the activity_fn, that we use on the ibook/powerbook to use the sleep led as an activity light. Just in case you prefer that to overloading the bmdma start/stop handlers. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: sata_sil 3112 activity LED patch
On Thu, Jul 07 2005, Aric Cyr wrote: There's also an existing variant of this in the block layer, the activity_fn, that we use on the ibook/powerbook to use the sleep led as an activity light. Just in case you prefer that to overloading the bmdma start/stop handlers. You suggestion at first looked to be incredibly nice... until I looked at how much implementation was required. I am considering trying it, but I cannot find a place for an sata driver to call the blk_queue_activity_fn() with meaningful parameters during init. On a second look, I guess I would have to override ata_scsi_slave_config() in the driver and hook up the activity light there. This would be fine I guess. Unless I am interpreting this incorrectly, however, I would need to use a timer or something to turn the light back off? I'm probably missing something, so is there a simpler way to do this? Hmm yes, it will require more work for you. It should be cleaned up a little to pass in a START/STOP variable and handle everything in the block layer instead. You probably just want to continue using the bmdma hooks now, that is actually a fine implementation imo. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH Linux 2.6.12 00/09] NCQ: generic NCQ completion/error-handling
On Wed, Jul 06 2005, Jens Axboe wrote: On Mon, Jul 04 2005, Jens Axboe wrote: On Fri, Jul 01 2005, Jens Axboe wrote: I converted most of debug messages I've used during development into warning messages when posting the patchset and forgot about it, so I've never posted the debug patch. Sorry about that. Here's a small patch which adds some more messages though. The following patch also adds printk'ing FIS on each command issue in ahci.c:ahci_qc_issue(), if you think it would fill your log excessively, feel free to turn it off. It wouldn't probably matter anyway. I will have to kill the issue part of the patch, that would generate insane amounts of printk traffic :-) I'll boot the kernel and report what happens. It triggered last night, but the old kernel was booted. This was the log: ahci ata1: stat=d0, issuing COMRESET ata1: recovering from error ata1: status=0x01 { Error } ata1: error=0x80 { Sector } SCSI error : 0 0 0 0 return code = 0x802 sda: Current: sense key=0x3 ASC=0x11 ASCQ=0x4 end_request: I/O error, dev sda, sector 66255899 Buffer I/O error on device sda2, logical block 8018923 lost page write due to I/O error on sda2 ata1: status=0x01 { Error } ata1: error=0x80 { Sector } SCSI error : 0 0 0 0 return code = 0x802 sda: Current: sense key=0x3 ASC=0x11 ASCQ=0x4 end_request: I/O error, dev sda, sector 66239043 Buffer I/O error on device sda2, logical block 8016816 lost page write due to I/O error on sda2 ata1: recovering from error ata1: status=0x01 { Error } ata1: error=0x80 { Sector } SCSI error : 0 0 0 0 return code = 0x802 sda: Current: sense key=0x3 ASC=0x11 ASCQ=0x4 end_request: I/O error, dev sda, sector 66239051 Buffer I/O error on device sda2, logical block 8016817 lost page write due to I/O error on sda2 ata1: status=0x01 { Error } ata1: error=0x80 { Sector } SCSI error : 0 0 0 0 return code = 0x802 sda: Current: sense key=0x3 ASC=0x11 ASCQ=0x4 end_request: I/O error, dev sda, sector 35137043 This is with the extra debug. Given that it is the timeout triggering, only the sstatus is new. ahci ata1: stat=d0, issuing COMRESET ata1: started resetting... ata1: end resetting, sstatus=0113 ata1: recovering from error ata1: status=0x01 { Error } ata1: error=0x80 { Sector } SCSI error : 0 0 0 0 return code = 0x802 sda: Current: sense key=0x3 ASC=0x11 ASCQ=0x4 end_request: I/O error, dev sda, sector 66190875 Buffer I/O error on device sda2, logical block 8010795 lost page write due to I/O error on sda2 ata1: status=0x01 { Error } ata1: error=0x80 { Sector } SCSI error : 0 0 0 0 return code = 0x802 sda: Current: sense key=0x3 ASC=0x11 ASCQ=0x4 end_request: I/O error, dev sda, sector 66159699 Buffer I/O error on device sda2, logical block 8006898 lost page write due to I/O error on sda2 btw, the reason it hangs here (I suspect) is that your read_log_page() logic is wrong - not every error condition will have NCQ_FAILED set before entering ncq_recover. The timeout will not, for instance. Testing... As usual, this will take days. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [git patches] IDE update
On Tue, Jul 05 2005, Ondrej Zary wrote: Jens Axboe wrote: On Tue, Jul 05 2005, Ondrej Zary wrote: André Tomt wrote: Al Boldi wrote: Bartlomiej Zolnierkiewicz wrote: { On 7/4/05, Al Boldi [EMAIL PROTECTED] wrote: Hdparm -tT gives 38mb/s in 2.4.31 Cat /dev/hda /dev/null gives 2% user 33% sys 65% idle Hdparm -tT gives 28mb/s in 2.6.12 Cat /dev/hda /dev/null gives 2% user 25% sys 0% idle 73% IOWAIT The hdparm doesn't get as high scores as in 2.4 is a old discussed to death problem on LKML. So far nobody has been able to show it affects anything but that pretty useless quasi-benchmark. No, it's not a problem with hdparm. hdparm only shows that there is _really_ a problem: 2.6.12 [EMAIL PROTECTED]:/home/rainbow# time dd if=/dev/hda of=/dev/null bs=512 count=1048576 1048576+0 records in 1048576+0 records out real0m32.339s user0m1.500s sys 0m14.560s 2.4.26 [EMAIL PROTECTED]:/home/rainbow# time dd if=/dev/hda of=/dev/null bs=512 count=1048576 1048576+0 records in 1048576+0 records out real0m23.858s user0m1.750s sys 0m15.180s Perhaps some read-ahead bug. What happens if you use bs=128k for instance? Nothing - it's still the same. [EMAIL PROTECTED]:/home/rainbow# time dd if=/dev/hda of=/dev/null bs=128k count=4096 4096+0 records in 4096+0 records out real0m32.832s user0m0.040s sys 0m15.670s Can you post full dmesg of 2.4 and 2.6 kernel boot? What does hdparm -I/-i say for both kernels? -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [git patches] IDE update
On Tue, 2005-07-05 at 14:35 +0200, Ondrej Zary wrote: 2.4.26 [EMAIL PROTECTED]:/home/rainbow# time dd if=/dev/hda of=/dev/null bs=512 count=1048576 1048576+0 records in 1048576+0 records out real0m23.858s user0m1.750s sys 0m15.180s Perhaps some read-ahead bug. What happens if you use bs=128k for instance? Nothing - it's still the same. [EMAIL PROTECTED]:/home/rainbow# time dd if=/dev/hda of=/dev/null bs=128k count=4096 4096+0 records in 4096+0 records out real0m32.832s user0m0.040s sys 0m15.670s Can you post full dmesg of 2.4 and 2.6 kernel boot? What does hdparm -I/-i say for both kernels? The 2.4.26 kernel is the one from Slackware 10.0 bootable install CD. dmesg outputs attached, hdparm -i and hdparm -I shows the same in both kernels (compared using diff) - attached too. Ok, looks alright for both. Your machine is quite slow, perhaps that is showing the slower performance. Can you try and make HZ 100 in 2.6 and test again? 2.6.13-recent has it as a config option, otherwise edit include/asm/param.h appropriately. -- Jens Axboe [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-ide in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [git patches] IDE update
On Tue, Jul 05 2005, Ondrej Zary wrote: Jens Axboe wrote: On Tue, 2005-07-05 at 15:02 +0200, Ondrej Zary wrote: Ok, looks alright for both. Your machine is quite slow, perhaps that is showing the slower performance. Can you try and make HZ 100 in 2.6 and test again? 2.6.13-recent has it as a config option, otherwise edit include/asm/param.h appropriately. I forgot to write that my 2.6.12 kernel is already compiled with HZ 100 (it makes the system more responsive). I've just tried 2.6.8.1 with HZ 1000 and there is no difference in HDD performance comparing to 2.6.12. OK, interesting. You could try and boot with profile=2 and do # readprofile -r # dd if=/dev/hda of=/dev/null bs=128k # readprofile prof_output for each kernel and post it here, so we can see if anything sticks out. Here are the profiles (used dd with count=4096) from 2.4.26 and 2.6.12 (nothing from 2.6.8.1 because I don't have the .map file anymore). Looks interesting, 2.6 spends oodles of times copying to user space. Lets check if raw reads perform ok, please try and time this app in 2.4 and 2.6 as well. # gcc -Wall -O2 -o oread oread.c # time ./oread /dev/hda -- Jens Axboe #include stdio.h #include unistd.h #define __USE_GNU #include fcntl.h #include stdlib.h #define BS (131072) #define BLOCKS (4096) #define ALIGN(buf) (char *) (((unsigned long) (buf) + 4095) ~(4095)) int main(int argc, char *argv[]) { char *buffer; int fd, i; if (argc 2) { printf(%s: device\n, argv[0]); return 1; } fd = open(argv[1], O_RDONLY | O_DIRECT); if (fd == -1) { perror(open); return 2; } buffer = ALIGN(malloc(BS + 4095)); for (i = 0; i BLOCKS; i++) { int ret = read(fd, buffer, BS); if (!ret) break; else if (ret 0) { perror(read infile); break; } } return 0; }