from:"\"Jens Axboe\""

Re: [PATCHSET #upstream] block/libata: update and use block layer padding and draining

2008-02-08 Thread Jens Axboe

On Fri, Feb 08 2008, Jeff Garzik wrote:
> Jeff Garzik wrote:
> >Tejun Heo wrote:
> >>This patchset updates block layer padding and draining support and
> >>make libata use it.  It's based on James Bottomley's initial work and,
> >>of the five, the last two patches are from James with some
> >>modifications.
> >>
> >>Please read the following thread for more info.
> >>
> >>  http://thread.gmane.org/gmane.linux.scsi/37185
> >>
> >>This patchset is on top of
> >>
> >>  upstream (a6af42fc9a12165136d82206ad52f18c5955ce87)
> >>+ kill-n_iter-and-fix-fsl patch [1]
> >
> >ACK patchset...  lets definitely get these fixes upstream.
> >
> >Once Jens is happy, I would prefer the merge the lot upstream, if that 
> >is OK with everyone involved?
> 
> Jens, ping?
> 
> It's a bug fix, so it would be nice to get this in soonish.  As noted, 
> if all looks good, I would prefer to merge via libata-dev...

I'm ok with it, but lets please merge the block bits through the block
repo, since they are not trivial. Wont be until the week after next,
though.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/7] blk_end_request: full I/O completion handler

2008-02-08 Thread Jens Axboe

On Fri, Feb 08 2008, S, Chandrakala (STSD) wrote:
> Hi,
> 
> Thanks for the information! 
> We would like to know when does the 2.6.25 kernel will be available at
> kernel.org.

So would I, if I locate my crystal ball I'll be sure to let you know :-)

Seriously, given past experience, it's probably sometime in April.

> 
> Thanks,
> Chandrakala 
> 
> -Original Message-
> From: Jens Axboe [mailto:[EMAIL PROTECTED] 
> Sent: Tuesday, February 05, 2008 4:09 PM
> To: S, Chandrakala (STSD)
> Cc: Kiyoshi Ueda; [EMAIL PROTECTED];
> [EMAIL PROTECTED]; linux-ide@vger.kernel.org; Miller, Mike (OS
> Dev); [EMAIL PROTECTED]; [EMAIL PROTECTED];
> [EMAIL PROTECTED]
> Subject: Re: [PATCH 0/7] blk_end_request: full I/O completion handler
> 
> On Tue, Feb 05 2008, S, Chandrakala (STSD) wrote:
> > Hello,
> > 
> > We would like to know in which kernel version these patches are 
> > available.
> 
> They were merged after 2.6.24 was released, so they will show up in the
> 2.6.25 kernel.
> 
> --
> Jens Axboe
> 

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/7] blk_end_request: full I/O completion handler

2008-02-05 Thread Jens Axboe

On Tue, Feb 05 2008, S, Chandrakala (STSD) wrote:
> Hello,
> 
> We would like to know in which kernel version these patches are
> available.  

They were merged after 2.6.24 was released, so they will show up in the
2.6.25 kernel.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kernel BUG at ide-cd.c:1726 in 2.6.24-03863-g0ba6c33 && -g8561b089

2008-01-31 Thread Jens Axboe




On 31/01/2008, at 18.04, Kiyoshi Ueda <[EMAIL PROTECTED]> wrote:


Hi Jens,

On Thu, 31 Jan 2008 14:05:58 +0100, Jens Axboe wrote:

On Thu, Jan 31 2008, Nai Xia wrote:

My dmesg relevant info is quite similar:

[6.875041] Freeing unused kernel memory: 320k freed
[8.143120] ide-cd: rq still having bio: dev hdc: type=2,  
flags=114c8

[8.144439]
[8.144439] sector 10824201199534213, nr/cnr 0/0
[8.144439] bio cf029280, biotail cf029280, buffer , data
, len 158
[8.144439] cdb: 12 00 00 00 fe 00 00 00 00 00 00 00 00 00 00 00
[8.144439] backup: data_len=158  bi_size=158
[8.160756] ide-cd: rq still having bio: dev hdc: type=2,  
flags=114c8

[8.160756]
[8.160756] sector 2669858, nr/cnr 0/0
[8.160756] bio cf029300, biotail cf029300, buffer , data
, len 158
[8.160756] cdb: 12 01 00 00 fe 00 00 00 00 00 00 00 00 00 00 00
[8.160756] backup: data_len=158  bi_size=158
[   14.851101] eth0: link up
[   27.121883] eth0: no IPv6 routers present


And by the way, Kiyoshi,
This can be reproduced in a typical setup vmware workstation 6.02  
with

a vritual IDE cdrom,
in case you wanna catch that with your own eyes. :-)
Thanks for your trying hard to correct this annoying bug.


The below fix should be enough. It's perfectly legal to have leftover
byte counts when the drive signals completion, happens all the time  
for

eg user issued commands where you don't know an exact byte count.

diff --git a/drivers/ide/ide-cd.c b/drivers/ide/ide-cd.c
index 74c6087..bee05a3 100644
--- a/drivers/ide/ide-cd.c
+++ b/drivers/ide/ide-cd.c
@@ -1722,7 +1722,7 @@ static ide_startstop_t cdrom_newpc_intr 
(ide_drive_t *drive)

*/
   if ((stat & DRQ_STAT) == 0) {
   spin_lock_irqsave(&ide_lock, flags);
-if (__blk_end_request(rq, 0, 0))
+if (__blk_end_request(rq, 0, rq->data_len))
   BUG();
   HWGROUP(drive)->rq = NULL;
   spin_unlock_irqrestore(&ide_lock, flags);


OK, I undarstand the leftover is legal.

By the way, is it safe to always return success if there is a  
leftover?

I thought we might have to complete the rq with -EIO in such case.


data_len being non zero should pass the residual count back to the  
issuer. 
-

To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kernel BUG at ide-cd.c:1726 in 2.6.24-03863-g0ba6c33 && -g8561b089

2008-01-31 Thread Jens Axboe

On Thu, Jan 31 2008, Florian Lohoff wrote:
> On Thu, Jan 31, 2008 at 02:05:58PM +0100, Jens Axboe wrote:
> > The below fix should be enough. It's perfectly legal to have leftover
> > byte counts when the drive signals completion, happens all the time for
> > eg user issued commands where you don't know an exact byte count.
> > 
> > diff --git a/drivers/ide/ide-cd.c b/drivers/ide/ide-cd.c
> > index 74c6087..bee05a3 100644
> > --- a/drivers/ide/ide-cd.c
> > +++ b/drivers/ide/ide-cd.c
> > @@ -1722,7 +1722,7 @@ static ide_startstop_t cdrom_newpc_intr(ide_drive_t 
> > *drive)
> >  */
> > if ((stat & DRQ_STAT) == 0) {
> > spin_lock_irqsave(&ide_lock, flags);
> > -   if (__blk_end_request(rq, 0, 0))
> > +   if (__blk_end_request(rq, 0, rq->data_len))
> > BUG();
> > HWGROUP(drive)->rq = NULL;
> > spin_unlock_irqrestore(&ide_lock, flags);
> > 
> 
> Fixes the crash on boot for me ...

Great, thanks for confirming that. I'll make sure the patch goes
upstream today, if Linus is available.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kernel BUG at ide-cd.c:1726 in 2.6.24-03863-g0ba6c33 && -g8561b089

2008-01-31 Thread Jens Axboe

On Thu, Jan 31 2008, Nai Xia wrote:
> My dmesg relevant info is quite similar:
> 
> [6.875041] Freeing unused kernel memory: 320k freed
> [8.143120] ide-cd: rq still having bio: dev hdc: type=2, flags=114c8
> [8.144439]
> [8.144439] sector 10824201199534213, nr/cnr 0/0
> [8.144439] bio cf029280, biotail cf029280, buffer , data
> , len 158
> [8.144439] cdb: 12 00 00 00 fe 00 00 00 00 00 00 00 00 00 00 00
> [8.144439] backup: data_len=158  bi_size=158
> [8.160756] ide-cd: rq still having bio: dev hdc: type=2, flags=114c8
> [8.160756]
> [8.160756] sector 2669858, nr/cnr 0/0
> [8.160756] bio cf029300, biotail cf029300, buffer , data
> , len 158
> [8.160756] cdb: 12 01 00 00 fe 00 00 00 00 00 00 00 00 00 00 00
> [8.160756] backup: data_len=158  bi_size=158
> [   14.851101] eth0: link up
> [   27.121883] eth0: no IPv6 routers present
> 
> 
> And by the way, Kiyoshi,
> This can be reproduced in a typical setup vmware workstation 6.02 with
> a vritual IDE cdrom,
> in case you wanna catch that with your own eyes. :-)
> Thanks for your trying hard to correct this annoying bug.

The below fix should be enough. It's perfectly legal to have leftover
byte counts when the drive signals completion, happens all the time for
eg user issued commands where you don't know an exact byte count.

diff --git a/drivers/ide/ide-cd.c b/drivers/ide/ide-cd.c
index 74c6087..bee05a3 100644
--- a/drivers/ide/ide-cd.c
+++ b/drivers/ide/ide-cd.c
@@ -1722,7 +1722,7 @@ static ide_startstop_t cdrom_newpc_intr(ide_drive_t 
*drive)
 */
if ((stat & DRQ_STAT) == 0) {
spin_lock_irqsave(&ide_lock, flags);
-   if (__blk_end_request(rq, 0, 0))
+   if (__blk_end_request(rq, 0, rq->data_len))
BUG();
    HWGROUP(drive)->rq = NULL;
spin_unlock_irqrestore(&ide_lock, flags);

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 41/63] ide-cd: move lba_to_msf() and msf_to_lba() to

2007-12-20 Thread Jens Axboe

On Thu, Dec 20 2007, Bartlomiej Zolnierkiewicz wrote:
> 
> * Move lba_to_msf() and msf_to_lba() to 
>   (use 'u8' type instead of 'byte' while at it).
> 
> * Remove msf_to_lba() copy from drivers/cdrom/cdrom.c.
> 
> Cc: Jens Axboe <[EMAIL PROTECTED]>
> Signed-off-by: Bartlomiej Zolnierkiewicz <[EMAIL PROTECTED]>

Acked-by: Jens Axboe <[EMAIL PROTECTED]>

> ---
>  drivers/cdrom/cdrom.c |6 --
>  drivers/ide/ide-cd.c  |   18 --
>  include/linux/cdrom.h |   14 ++
>  3 files changed, 14 insertions(+), 24 deletions(-)
> 
> Index: b/drivers/cdrom/cdrom.c
> ===
> --- a/drivers/cdrom/cdrom.c
> +++ b/drivers/cdrom/cdrom.c
> @@ -2787,12 +2787,6 @@ int cdrom_ioctl(struct file * file, stru
>   return -ENOSYS;
>  }
>  
> -static inline
> -int msf_to_lba(char m, char s, char f)
> -{
> - return (((m * CD_SECS) + s) * CD_FRAMES + f) - CD_MSF_OFFSET;
> -}
> -
>  /*
>   * Required when we need to use READ_10 to issue other than 2048 block
>   * reads
> Index: b/drivers/ide/ide-cd.c
> ===
> --- a/drivers/ide/ide-cd.c
> +++ b/drivers/ide/ide-cd.c
> @@ -1643,24 +1643,6 @@ void msf_from_bcd (struct atapi_msf *msf
>   msf->frame  = BCD2BIN(msf->frame);
>  }
>  
> -static inline
> -void lba_to_msf (int lba, byte *m, byte *s, byte *f)
> -{
> - lba += CD_MSF_OFFSET;
> - lba &= 0xff;  /* negative lbas use only 24 bits */
> - *m = lba / (CD_SECS * CD_FRAMES);
> - lba %= (CD_SECS * CD_FRAMES);
> - *s = lba / CD_FRAMES;
> - *f = lba % CD_FRAMES;
> -}
> -
> -
> -static inline
> -int msf_to_lba (byte m, byte s, byte f)
> -{
> - return (((m * CD_SECS) + s) * CD_FRAMES + f) - CD_MSF_OFFSET;
> -}
> -
>  static int cdrom_check_status(ide_drive_t *drive, struct request_sense 
> *sense)
>  {
>   struct request req;
> Index: b/include/linux/cdrom.h
> ===
> --- a/include/linux/cdrom.h
> +++ b/include/linux/cdrom.h
> @@ -1184,6 +1184,20 @@ struct media_event_desc {
>  
>  extern int cdrom_get_media_event(struct cdrom_device_info *cdi, struct 
> media_event_desc *med);
>  
> +static inline void lba_to_msf(int lba, u8 *m, u8 *s, u8 *f)
> +{
> + lba += CD_MSF_OFFSET;
> + lba &= 0xff;  /* negative lbas use only 24 bits */
> + *m = lba / (CD_SECS * CD_FRAMES);
> + lba %= (CD_SECS * CD_FRAMES);
> +     *s = lba / CD_FRAMES;
> + *f = lba % CD_FRAMES;
> +}
> +
> +static inline int msf_to_lba(u8 m, u8 s, u8 f)
> +{
> + return (((m * CD_SECS) + s) * CD_FRAMES + f) - CD_MSF_OFFSET;
> +}
>  #endif  /* End of kernel only stuff */ 
>  
>  #endif  /* _LINUX_CDROM_H */

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/3] Add GD-Rom support to the SEGA Dreamcast

2007-12-17 Thread Jens Axboe

On Sun, Dec 16 2007, Adrian McMenamin wrote:
> +static int gdrom_readdisk_dma(int block, int block_cnt, char *buffer)
> +{
> + int err;
> + struct packet_command *read_command;
> + /* release the spin lock but check later
> +  * we're not in the middle of some dma */
> + spin_unlock(&gdrom_lock);
> + ctrl_outl(0x8843407F, GDROM_DMA_ACCESS_CTRL_REG); /* memory setting */
> + ctrl_outl(9, GDROM_DMA_WAIT_REG); /* DMA word setting */
> + ctrl_outl(PHYSADDR(buffer), GDROM_DMA_STARTADDR_REG);
> + ctrl_outl(block_cnt * GDROM_HARD_SECTOR, GDROM_DMA_LENGTH_REG);
> + ctrl_outl(1, GDROM_DMA_DIRECTION_REG);
> + ctrl_outl(1, GDROM_DMA_ENABLE_REG);
> + /* send command */
> + read_command = kzalloc(sizeof(struct packet_command), GFP_KERNEL);
> + if (!read_command)
> + return -ENOMEM;
> + read_command->cmd[0] = 0x30;
> + read_command->cmd[1] = 0x20;
> + read_command->cmd[2] = (block >> 16) & 0xFF;
> + read_command->cmd[3] = (block >> 8) & 0xFF;
> + read_command->cmd[4] = block & 0xFF;
> + read_command->cmd[8] = (block_cnt >> 16) & 0xFF;
> + read_command->cmd[9] = (block_cnt >> 8) & 0xFF;
> + read_command->cmd[10] = block_cnt & 0xFF;
> + /* set for DMA */
> + ctrl_outb(1, GDROM_ERROR_REG);
> + /* other parameters */
> + ctrl_outb(0, GDROM_SECNUM_REG);
> + ctrl_outb(0, GDROM_BCL_REG);
> + ctrl_outb(0, GDROM_BCH_REG);
> + ctrl_outb(0, GDROM_DSEL_REG);
> + ctrl_outb(0, GDROM_INTSEC_REG);
> + /* In multiple DMA transfers need to wait */
> + while (ctrl_inb(GDROM_ALTSTATUS_REG) & 0x80)
> + cpu_relax();
> + ctrl_outb(GDROM_COM_PACKET, GDROM_STATUSCOMMAND_REG);
> + while ((ctrl_inb(GDROM_ALTSTATUS_REG) & 0x88) != 8)
> + cpu_relax(); /* wait for DRQ to be set to 1 */
> + gd.pending = 1;
> + gd.transfer = 1;
> + outsw(PHYSADDR(GDROM_DATA_REG), &read_command->cmd, 6);
> + while (ctrl_inb(GDROM_DMA_STATUS_REG))
> + cpu_relax();
> + ctrl_outb(1, GDROM_DMA_STATUS_REG);
> + /* 5 second error margin here seems more reasonable */
> + wait_event_interruptible_timeout(request_queue, gd.transfer == 0, HZ * 
> 5);
> + err = ctrl_inb(GDROM_DMA_STATUS_REG);
> + err = gd.transfer;
> + gd.transfer = 0;
> + gd.pending = 0;
> + kfree(read_command);
> + spin_lock(&gdrom_lock);
> + return err;
> +}
> +
> +static void gdrom_request_handler_dma(struct request *req)
> +{
> + int err, block, block_cnt;
> + err = 0;
> + block = req->sector/GD_TO_BLK + GD_SESSION_OFFSET;
> + block_cnt = req->nr_sectors/GD_TO_BLK;
> + err = gdrom_readdisk_dma(block, block_cnt, req->buffer);
> + if (unlikely(err)) {
> + end_request(req, 0);
> + spin_unlock(&gdrom_lock);
> + gdrom_getsense(NULL);
> + spin_lock(&gdrom_lock);
> + }
> + else
> + end_request(req, 1);
> +}
> +
> +static void gdrom_request(struct request_queue *rq)
> +{
> + struct request *req;
> + unsigned long pages;
> + pages = rq->backing_dev_info.ra_pages;
> + while ((req = elv_next_request(rq)) != NULL) {
> + if (! blk_fs_request(req)) {
> + printk(KERN_DEBUG "GDROM: Non-fs request ignored\n");
> + end_request(req, 0);
> + }
> + if (rq_data_dir(req)) {
> + printk(KERN_NOTICE "GDROM: Read only device - write 
> request ignored\n");
> + end_request(req, 0);
> + }
> + if (req->nr_sectors) {
> + gdrom_request_handler_dma(req);
> + }
> + }
> +}
> +

Few notes:

- Compare rq_data_dir() with WRITE, don't just assume that any non-zero
  will be a write.

- You need to offload this request handling to a workqueue of some sort,
  your current request handling is very broken for two reasons: One is
  that interrupts are still disabled when you drop your queue lock, so
  you cannot use sleeping functions like GFP_KERNEL allocations or
  wait_event(). The other is that it's illegal to sleep from your
  request_fn context in the first place, since you could be stalling
  others.

- You also seem to be busy waiting for other transactions to finish. Any
  idea how long those might take? Perhaps put an upper bound on this
  waiting, and/or do blocking waits?

- I'm assuming this hardware can't do sg transfers?

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?

2007-12-13 Thread Jens Axboe

On Thu, Dec 13 2007, Andrew Morton wrote:
> On Thu, 13 Dec 2007 21:09:59 +0100
> Jens Axboe <[EMAIL PROTECTED]> wrote:
> 
> >
> > OK, it's a vm issue,
> 
> cc linux-mm and probable culprit.
> 
> >  I have tens of thousand "backward" pages after a
> > boot - IOW, bvec->bv_page is the page before bvprv->bv_page, not
> > reverse. So it looks like that bug got reintroduced.
> 
> Bill Irwin fixed this a couple of years back: changed the page allocator so
> that it mostly hands out pages in ascending physical-address order.
> 
> I guess we broke that, quite possibly in Mel's page allocator rework.
> 
> It would help if you could provide us with a simple recipe for
> demonstrating this problem, please.

Basically anything involving IO :-). A boot here showed a handful of
good merges, and probably in the order of 100,000 descending
allocations. A kernel make is a fine test as well.

Something like the below should work fine - if you see oodles of these
basicaly doing any type of IO, then you are screwed.

diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c
index e30b1a4..8ce3fcc 100644
--- a/block/ll_rw_blk.c
+++ b/block/ll_rw_blk.c
@@ -1349,6 +1349,10 @@ new_segment:
sg = sg_next(sg);
}
 
+   if (bvprv) {
+   if (page_address(bvec->bv_page) + PAGE_SIZE == 
page_address(bvprv->bv_page) && printk_ratelimit())
+   printk("page alloc order backwards\n");
+   }
    sg_set_page(sg, bvec->bv_page, nbytes, bvec->bv_offset);
nsegs++;
}

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?

2007-12-13 Thread Jens Axboe

On Thu, Dec 13 2007, Mark Lord wrote:
> Jens Axboe wrote:
> >On Thu, Dec 13 2007, Mark Lord wrote:
> >>Jens Axboe wrote:
> >>>On Thu, Dec 13 2007, Jens Axboe wrote:
> >>>>On Thu, Dec 13 2007, Mark Lord wrote:
> >>>>>Jens Axboe wrote:
> >>>>>>On Thu, Dec 13 2007, Mark Lord wrote:
> >>>>>>>Mark Lord wrote:
> >>>>>>>>Jens Axboe wrote:
> >>>>>>>>>On Thu, Dec 13 2007, Mark Lord wrote:
> >>>>>>>>>>Matthew Wilcox wrote:
> >>>>>>>>>>>On Thu, Dec 13, 2007 at 01:48:18PM -0500, Mark Lord wrote:
> >>>>>>>>>>>>Problem confirmed.  2.6.23.8 regularly generates segments up to 
> >>>>>>>>>>>>64KB for libata,
> >>>>>>>>>>>>but 2.6.24 uses only 4KB segments and a *few* 8KB segments.
> >>>>>>>>>>>Just a suspicion ... could this be slab vs slub?  ie check your 
> >>>>>>>>>>>configs
> >>>>>>>>>>>are the same / similar between the two kernels.
> >>>>>>>>>>..
> >>>>>>>>>>
> >>>>>>>>>>Mmmm.. a good thought, that one.
> >>>>>>>>>>But I just rechecked, and both have CONFIG_SLAB=y
> >>>>>>>>>>
> >>>>>>>>>>My guess is that something got changed around when Jens
> >>>>>>>>>>reworked the block layer for 2.6.24.
> >>>>>>>>>>I'm going to dig around in there now.
> >>>>>>>>>I didn't rework the block layer for 2.6.24 :-). The core block 
> >>>>>>>>>layer
> >>>>>>>>>changes since 2.6.23 are:
> >>>>>>>>>
> >>>>>>>>>- Support for empty barriers. Not a likely candidate.
> >>>>>>>>>- Shared tag queue fixes. Totally unlikely.
> >>>>>>>>>- sg chaining support. Not likely.
> >>>>>>>>>- The bio changes from Neil. Of the bunch, the most likely 
> >>>>>>>>>suspects in
> >>>>>>>>>this area, since it changes some of the code involved with merges 
> >>>>>>>>>and
> >>>>>>>>>blk_rq_map_sg().
> >>>>>>>>>- Lots of simple stuff, again very unlikely.
> >>>>>>>>>
> >>>>>>>>>Anyway, it sounds odd for this to be a block layer problem if you 
> >>>>>>>>>do see
> >>>>>>>>>occasional segments being merged. So it sounds more like the input 
> >>>>>>>>>data
> >>>>>>>>>having changed.
> >>>>>>>>>
> >>>>>>>>>Why not just bisect it?
> >>>>>>>>..
> >>>>>>>>
> >>>>>>>>Because the early 2.6.24 series failed to boot on this machine
> >>>>>>>>due to bugs in the block layer -- so the code that caused this 
> >>>>>>>>regression
> >>>>>>>>is probably in the stuff from before the kernels became usable here.
> >>>>>>>..
> >>>>>>>
> >>>>>>>That sounds more harsh than intended --> the earlier 2.6.24 kernels 
> >>>>>>>(up to
> >>>>>>>the first couple of -rc* ones failed here because of 
> >>>>>>>incompatibilities
> >>>>>>>between the block/bio changes and libata.
> >>>>>>>
> >>>>>>>That's better, I think! 
> >>>>>>No worries, I didn't pick it up as harsh just as an odd conclusion :-)
> >>>>>>
> >>>>>>If I were you, I'd just start from the first -rc that booted for you. 
> >>>>>>If
> >>>>>>THAT has the bug, then we'll think of something else. If you don't get
> >>>>>>anywhere, I can run some tests tomorrow and see if I can reproduce it
> >>>>>>here.
> >>>>>..
> >>>>>
> >>>>>I believe that *anyone* can reproduce it, since it's broken long befo

Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?

2007-12-13 Thread Jens Axboe

On Thu, Dec 13 2007, Mark Lord wrote:
> Jens Axboe wrote:
> >On Thu, Dec 13 2007, Jens Axboe wrote:
> >>On Thu, Dec 13 2007, Mark Lord wrote:
> >>>Jens Axboe wrote:
> >>>>On Thu, Dec 13 2007, Mark Lord wrote:
> >>>>>Mark Lord wrote:
> >>>>>>Jens Axboe wrote:
> >>>>>>>On Thu, Dec 13 2007, Mark Lord wrote:
> >>>>>>>>Matthew Wilcox wrote:
> >>>>>>>>>On Thu, Dec 13, 2007 at 01:48:18PM -0500, Mark Lord wrote:
> >>>>>>>>>>Problem confirmed.  2.6.23.8 regularly generates segments up to 
> >>>>>>>>>>64KB for libata,
> >>>>>>>>>>but 2.6.24 uses only 4KB segments and a *few* 8KB segments.
> >>>>>>>>>Just a suspicion ... could this be slab vs slub?  ie check your 
> >>>>>>>>>configs
> >>>>>>>>>are the same / similar between the two kernels.
> >>>>>>>>..
> >>>>>>>>
> >>>>>>>>Mmmm.. a good thought, that one.
> >>>>>>>>But I just rechecked, and both have CONFIG_SLAB=y
> >>>>>>>>
> >>>>>>>>My guess is that something got changed around when Jens
> >>>>>>>>reworked the block layer for 2.6.24.
> >>>>>>>>I'm going to dig around in there now.
> >>>>>>>I didn't rework the block layer for 2.6.24 :-). The core block layer
> >>>>>>>changes since 2.6.23 are:
> >>>>>>>
> >>>>>>>- Support for empty barriers. Not a likely candidate.
> >>>>>>>- Shared tag queue fixes. Totally unlikely.
> >>>>>>>- sg chaining support. Not likely.
> >>>>>>>- The bio changes from Neil. Of the bunch, the most likely suspects 
> >>>>>>>in
> >>>>>>>this area, since it changes some of the code involved with merges and
> >>>>>>>blk_rq_map_sg().
> >>>>>>>- Lots of simple stuff, again very unlikely.
> >>>>>>>
> >>>>>>>Anyway, it sounds odd for this to be a block layer problem if you do 
> >>>>>>>see
> >>>>>>>occasional segments being merged. So it sounds more like the input 
> >>>>>>>data
> >>>>>>>having changed.
> >>>>>>>
> >>>>>>>Why not just bisect it?
> >>>>>>..
> >>>>>>
> >>>>>>Because the early 2.6.24 series failed to boot on this machine
> >>>>>>due to bugs in the block layer -- so the code that caused this 
> >>>>>>regression
> >>>>>>is probably in the stuff from before the kernels became usable here.
> >>>>>..
> >>>>>
> >>>>>That sounds more harsh than intended --> the earlier 2.6.24 kernels 
> >>>>>(up to
> >>>>>the first couple of -rc* ones failed here because of incompatibilities
> >>>>>between the block/bio changes and libata.
> >>>>>
> >>>>>That's better, I think! 
> >>>>No worries, I didn't pick it up as harsh just as an odd conclusion :-)
> >>>>
> >>>>If I were you, I'd just start from the first -rc that booted for you. If
> >>>>THAT has the bug, then we'll think of something else. If you don't get
> >>>>anywhere, I can run some tests tomorrow and see if I can reproduce it
> >>>>here.
> >>>..
> >>>
> >>>I believe that *anyone* can reproduce it, since it's broken long before
> >>>the requests ever get to SCSI or libata.  Which also means that *anyone*
> >>>who wants to can bisect it, as well.
> >>>
> >>>I don't do "bisects".
> >>It was just a suggestion on how to narrow it down, do as you see fit.
> >>
> >>>But I will dig a bit more and see if I can find the culprit.
> >>Sure, I'll dig around as well.
> >
> >Just tried something simple. I only see one 12kb segment so far, so not
> >a lot by any stretch. I also DONT see any missed merges signs, so it
> >would appear that the pages in the request are simply not contigious
> >physically.
> >
> >diff --git a

Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?

2007-12-13 Thread Jens Axboe

On Thu, Dec 13 2007, Mark Lord wrote:
> Jens Axboe wrote:
> >On Thu, Dec 13 2007, Mark Lord wrote:
> >>Jens Axboe wrote:
> >>>On Thu, Dec 13 2007, Mark Lord wrote:
> >>>>Mark Lord wrote:
> >>>>>Jens Axboe wrote:
> >>>>>>On Thu, Dec 13 2007, Mark Lord wrote:
> >>>>>>>Matthew Wilcox wrote:
> >>>>>>>>On Thu, Dec 13, 2007 at 01:48:18PM -0500, Mark Lord wrote:
> >>>>>>>>>Problem confirmed.  2.6.23.8 regularly generates segments up to 
> >>>>>>>>>64KB for libata,
> >>>>>>>>>but 2.6.24 uses only 4KB segments and a *few* 8KB segments.
> >>>>>>>>Just a suspicion ... could this be slab vs slub?  ie check your 
> >>>>>>>>configs
> >>>>>>>>are the same / similar between the two kernels.
> >>>>>>>..
> >>>>>>>
> >>>>>>>Mmmm.. a good thought, that one.
> >>>>>>>But I just rechecked, and both have CONFIG_SLAB=y
> >>>>>>>
> >>>>>>>My guess is that something got changed around when Jens
> >>>>>>>reworked the block layer for 2.6.24.
> >>>>>>>I'm going to dig around in there now.
> >>>>>>I didn't rework the block layer for 2.6.24 :-). The core block layer
> >>>>>>changes since 2.6.23 are:
> >>>>>>
> >>>>>>- Support for empty barriers. Not a likely candidate.
> >>>>>>- Shared tag queue fixes. Totally unlikely.
> >>>>>>- sg chaining support. Not likely.
> >>>>>>- The bio changes from Neil. Of the bunch, the most likely suspects in
> >>>>>>this area, since it changes some of the code involved with merges and
> >>>>>>blk_rq_map_sg().
> >>>>>>- Lots of simple stuff, again very unlikely.
> >>>>>>
> >>>>>>Anyway, it sounds odd for this to be a block layer problem if you do 
> >>>>>>see
> >>>>>>occasional segments being merged. So it sounds more like the input 
> >>>>>>data
> >>>>>>having changed.
> >>>>>>
> >>>>>>Why not just bisect it?
> >>>>>..
> >>>>>
> >>>>>Because the early 2.6.24 series failed to boot on this machine
> >>>>>due to bugs in the block layer -- so the code that caused this 
> >>>>>regression
> >>>>>is probably in the stuff from before the kernels became usable here.
> >>>>..
> >>>>
> >>>>That sounds more harsh than intended --> the earlier 2.6.24 kernels (up 
> >>>>to
> >>>>the first couple of -rc* ones failed here because of incompatibilities
> >>>>between the block/bio changes and libata.
> >>>>
> >>>>That's better, I think! 
> >>>No worries, I didn't pick it up as harsh just as an odd conclusion :-)
> >>>
> >>>If I were you, I'd just start from the first -rc that booted for you. If
> >>>THAT has the bug, then we'll think of something else. If you don't get
> >>>anywhere, I can run some tests tomorrow and see if I can reproduce it
> >>>here.
> >>..
> >>
> >>I believe that *anyone* can reproduce it, since it's broken long before
> >>the requests ever get to SCSI or libata.  Which also means that *anyone*
> >>who wants to can bisect it, as well.
> >>
> >>I don't do "bisects".
> >
> >It was just a suggestion on how to narrow it down, do as you see fit.
> >
> >>But I will dig a bit more and see if I can find the culprit.
> >
> >Sure, I'll dig around as well.
> ..
> 
> I wonder if it's 9dfa52831e96194b8649613e3131baa2c109f7dc:
> "Merge blk_recount_segments into blk_recalc_rq_segments" ?
> 
> That particular commit does some rather innocent code-shuffling,
> but also introduces a couple of new "if (nr_hw_segs == 1" conditions
> that were not there before.

You can try and revert it of course, but I think you are looking at the
wrong bits. If the segment counts were totally off, you'd never be
anywhere close to reaching the set limit. Your problems seems to be
missed contig segment merges.

> Okay git experts:  how do I pull out a kernel at the point of this exact 
> commit ?

Dummy approach - git log and grep for
9dfa52831e96194b8649613e3131baa2c109f7dc, then see what commit is before
that. Then do a git checkout commit.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?

2007-12-13 Thread Jens Axboe

On Thu, Dec 13 2007, Jens Axboe wrote:
> On Thu, Dec 13 2007, Mark Lord wrote:
> > Jens Axboe wrote:
> > >On Thu, Dec 13 2007, Mark Lord wrote:
> > >>Mark Lord wrote:
> > >>>Jens Axboe wrote:
> > >>>>On Thu, Dec 13 2007, Mark Lord wrote:
> > >>>>>Matthew Wilcox wrote:
> > >>>>>>On Thu, Dec 13, 2007 at 01:48:18PM -0500, Mark Lord wrote:
> > >>>>>>>Problem confirmed.  2.6.23.8 regularly generates segments up to 
> > >>>>>>>64KB for libata,
> > >>>>>>>but 2.6.24 uses only 4KB segments and a *few* 8KB segments.
> > >>>>>>Just a suspicion ... could this be slab vs slub?  ie check your 
> > >>>>>>configs
> > >>>>>>are the same / similar between the two kernels.
> > >>>>>..
> > >>>>>
> > >>>>>Mmmm.. a good thought, that one.
> > >>>>>But I just rechecked, and both have CONFIG_SLAB=y
> > >>>>>
> > >>>>>My guess is that something got changed around when Jens
> > >>>>>reworked the block layer for 2.6.24.
> > >>>>>I'm going to dig around in there now.
> > >>>>I didn't rework the block layer for 2.6.24 :-). The core block layer
> > >>>>changes since 2.6.23 are:
> > >>>>
> > >>>>- Support for empty barriers. Not a likely candidate.
> > >>>>- Shared tag queue fixes. Totally unlikely.
> > >>>>- sg chaining support. Not likely.
> > >>>>- The bio changes from Neil. Of the bunch, the most likely suspects in
> > >>>> this area, since it changes some of the code involved with merges and
> > >>>> blk_rq_map_sg().
> > >>>>- Lots of simple stuff, again very unlikely.
> > >>>>
> > >>>>Anyway, it sounds odd for this to be a block layer problem if you do see
> > >>>>occasional segments being merged. So it sounds more like the input data
> > >>>>having changed.
> > >>>>
> > >>>>Why not just bisect it?
> > >>>..
> > >>>
> > >>>Because the early 2.6.24 series failed to boot on this machine
> > >>>due to bugs in the block layer -- so the code that caused this regression
> > >>>is probably in the stuff from before the kernels became usable here.
> > >>..
> > >>
> > >>That sounds more harsh than intended --> the earlier 2.6.24 kernels (up to
> > >>the first couple of -rc* ones failed here because of incompatibilities
> > >>between the block/bio changes and libata.
> > >>
> > >>That's better, I think! 
> > >
> > >No worries, I didn't pick it up as harsh just as an odd conclusion :-)
> > >
> > >If I were you, I'd just start from the first -rc that booted for you. If
> > >THAT has the bug, then we'll think of something else. If you don't get
> > >anywhere, I can run some tests tomorrow and see if I can reproduce it
> > >here.
> > ..
> > 
> > I believe that *anyone* can reproduce it, since it's broken long before
> > the requests ever get to SCSI or libata.  Which also means that *anyone*
> > who wants to can bisect it, as well.
> > 
> > I don't do "bisects".
> 
> It was just a suggestion on how to narrow it down, do as you see fit.
> 
> > But I will dig a bit more and see if I can find the culprit.
> 
> Sure, I'll dig around as well.

Just tried something simple. I only see one 12kb segment so far, so not
a lot by any stretch. I also DONT see any missed merges signs, so it
would appear that the pages in the request are simply not contigious
physically.

diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c
index e30b1a4..1e34b6f 100644
--- a/block/ll_rw_blk.c
+++ b/block/ll_rw_blk.c
@@ -1330,6 +1330,8 @@ int blk_rq_map_sg(struct request_queue *q, struct request 
*rq,
goto new_segment;
 
sg->length += nbytes;
+   if (sg->length > 8192)
+   printk("sg_len=%d\n", sg->length);
} else {
 new_segment:
if (!sg)
@@ -1349,6 +1351,8 @@ new_segment:
sg = sg_next(sg);
}
 
+   if (bvprv && (page_address(bvprv->bv_page) + 
bvprv->bv_len == page_address(bvec->bv_page)))
+   printk("missed merge\n");
sg_set_page(sg, bvec->bv_page, nbytes, bvec->bv_offset);
nsegs++;
}

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?

2007-12-13 Thread Jens Axboe

On Thu, Dec 13 2007, Mark Lord wrote:
> Jens Axboe wrote:
> >On Thu, Dec 13 2007, Mark Lord wrote:
> >>Mark Lord wrote:
> >>>Jens Axboe wrote:
> >>>>On Thu, Dec 13 2007, Mark Lord wrote:
> >>>>>Matthew Wilcox wrote:
> >>>>>>On Thu, Dec 13, 2007 at 01:48:18PM -0500, Mark Lord wrote:
> >>>>>>>Problem confirmed.  2.6.23.8 regularly generates segments up to 
> >>>>>>>64KB for libata,
> >>>>>>>but 2.6.24 uses only 4KB segments and a *few* 8KB segments.
> >>>>>>Just a suspicion ... could this be slab vs slub?  ie check your 
> >>>>>>configs
> >>>>>>are the same / similar between the two kernels.
> >>>>>..
> >>>>>
> >>>>>Mmmm.. a good thought, that one.
> >>>>>But I just rechecked, and both have CONFIG_SLAB=y
> >>>>>
> >>>>>My guess is that something got changed around when Jens
> >>>>>reworked the block layer for 2.6.24.
> >>>>>I'm going to dig around in there now.
> >>>>I didn't rework the block layer for 2.6.24 :-). The core block layer
> >>>>changes since 2.6.23 are:
> >>>>
> >>>>- Support for empty barriers. Not a likely candidate.
> >>>>- Shared tag queue fixes. Totally unlikely.
> >>>>- sg chaining support. Not likely.
> >>>>- The bio changes from Neil. Of the bunch, the most likely suspects in
> >>>> this area, since it changes some of the code involved with merges and
> >>>> blk_rq_map_sg().
> >>>>- Lots of simple stuff, again very unlikely.
> >>>>
> >>>>Anyway, it sounds odd for this to be a block layer problem if you do see
> >>>>occasional segments being merged. So it sounds more like the input data
> >>>>having changed.
> >>>>
> >>>>Why not just bisect it?
> >>>..
> >>>
> >>>Because the early 2.6.24 series failed to boot on this machine
> >>>due to bugs in the block layer -- so the code that caused this regression
> >>>is probably in the stuff from before the kernels became usable here.
> >>..
> >>
> >>That sounds more harsh than intended --> the earlier 2.6.24 kernels (up to
> >>the first couple of -rc* ones failed here because of incompatibilities
> >>between the block/bio changes and libata.
> >>
> >>That's better, I think! 
> >
> >No worries, I didn't pick it up as harsh just as an odd conclusion :-)
> >
> >If I were you, I'd just start from the first -rc that booted for you. If
> >THAT has the bug, then we'll think of something else. If you don't get
> >anywhere, I can run some tests tomorrow and see if I can reproduce it
> >here.
> ..
> 
> I believe that *anyone* can reproduce it, since it's broken long before
> the requests ever get to SCSI or libata.  Which also means that *anyone*
> who wants to can bisect it, as well.
> 
> I don't do "bisects".

It was just a suggestion on how to narrow it down, do as you see fit.

> But I will dig a bit more and see if I can find the culprit.

Sure, I'll dig around as well.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?

2007-12-13 Thread Jens Axboe

On Thu, Dec 13 2007, Mark Lord wrote:
> Mark Lord wrote:
> >Jens Axboe wrote:
> >>On Thu, Dec 13 2007, Mark Lord wrote:
> >>>Matthew Wilcox wrote:
> >>>>On Thu, Dec 13, 2007 at 01:48:18PM -0500, Mark Lord wrote:
> >>>>>Problem confirmed.  2.6.23.8 regularly generates segments up to 
> >>>>>64KB for libata,
> >>>>>but 2.6.24 uses only 4KB segments and a *few* 8KB segments.
> >>>>Just a suspicion ... could this be slab vs slub?  ie check your configs
> >>>>are the same / similar between the two kernels.
> >>>..
> >>>
> >>>Mmmm.. a good thought, that one.
> >>>But I just rechecked, and both have CONFIG_SLAB=y
> >>>
> >>>My guess is that something got changed around when Jens
> >>>reworked the block layer for 2.6.24.
> >>>I'm going to dig around in there now.
> >>
> >>I didn't rework the block layer for 2.6.24 :-). The core block layer
> >>changes since 2.6.23 are:
> >>
> >>- Support for empty barriers. Not a likely candidate.
> >>- Shared tag queue fixes. Totally unlikely.
> >>- sg chaining support. Not likely.
> >>- The bio changes from Neil. Of the bunch, the most likely suspects in
> >>  this area, since it changes some of the code involved with merges and
> >>  blk_rq_map_sg().
> >>- Lots of simple stuff, again very unlikely.
> >>
> >>Anyway, it sounds odd for this to be a block layer problem if you do see
> >>occasional segments being merged. So it sounds more like the input data
> >>having changed.
> >>
> >>Why not just bisect it?
> >..
> >
> >Because the early 2.6.24 series failed to boot on this machine
> >due to bugs in the block layer -- so the code that caused this regression
> >is probably in the stuff from before the kernels became usable here.
> ..
> 
> That sounds more harsh than intended --> the earlier 2.6.24 kernels (up to
> the first couple of -rc* ones failed here because of incompatibilities
> between the block/bio changes and libata.
> 
> That's better, I think! 

No worries, I didn't pick it up as harsh just as an odd conclusion :-)

If I were you, I'd just start from the first -rc that booted for you. If
THAT has the bug, then we'll think of something else. If you don't get
anywhere, I can run some tests tomorrow and see if I can reproduce it
here.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?

2007-12-13 Thread Jens Axboe

On Thu, Dec 13 2007, Mark Lord wrote:
> Jens Axboe wrote:
> >On Thu, Dec 13 2007, Mark Lord wrote:
> >>Matthew Wilcox wrote:
> >>>On Thu, Dec 13, 2007 at 01:48:18PM -0500, Mark Lord wrote:
> >>>>Problem confirmed.  2.6.23.8 regularly generates segments up to 64KB 
> >>>>for libata,
> >>>>but 2.6.24 uses only 4KB segments and a *few* 8KB segments.
> >>>Just a suspicion ... could this be slab vs slub?  ie check your configs
> >>>are the same / similar between the two kernels.
> >>..
> >>
> >>Mmmm.. a good thought, that one.
> >>But I just rechecked, and both have CONFIG_SLAB=y
> >>
> >>My guess is that something got changed around when Jens
> >>reworked the block layer for 2.6.24.
> >>I'm going to dig around in there now.
> >
> >I didn't rework the block layer for 2.6.24 :-). The core block layer
> >changes since 2.6.23 are:
> >
> >- Support for empty barriers. Not a likely candidate.
> >- Shared tag queue fixes. Totally unlikely.
> >- sg chaining support. Not likely.
> >- The bio changes from Neil. Of the bunch, the most likely suspects in
> >  this area, since it changes some of the code involved with merges and
> >  blk_rq_map_sg().
> >- Lots of simple stuff, again very unlikely.
> >
> >Anyway, it sounds odd for this to be a block layer problem if you do see
> >occasional segments being merged. So it sounds more like the input data
> >having changed.
> >
> >Why not just bisect it?
> ..
> 
> Because the early 2.6.24 series failed to boot on this machine
> due to bugs in the block layer -- so the code that caused this regression
> is probably in the stuff from before the kernels became usable here.

That would be the sg chain stuff, I don't think there's any correlation
between the two "bugs" (ie I don't get how you jump to the conclusion
that this regression is from stuff before that). Just go back as early
as you can, you could even just start with a -rc bisect.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?

2007-12-13 Thread Jens Axboe

On Thu, Dec 13 2007, Mark Lord wrote:
> Matthew Wilcox wrote:
> >On Thu, Dec 13, 2007 at 01:48:18PM -0500, Mark Lord wrote:
> >>Problem confirmed.  2.6.23.8 regularly generates segments up to 64KB for 
> >>libata,
> >>but 2.6.24 uses only 4KB segments and a *few* 8KB segments.
> >
> >Just a suspicion ... could this be slab vs slub?  ie check your configs
> >are the same / similar between the two kernels.
> ..
> 
> Mmmm.. a good thought, that one.
> But I just rechecked, and both have CONFIG_SLAB=y
> 
> My guess is that something got changed around when Jens
> reworked the block layer for 2.6.24.
> I'm going to dig around in there now.

I didn't rework the block layer for 2.6.24 :-). The core block layer
changes since 2.6.23 are:

- Support for empty barriers. Not a likely candidate.
- Shared tag queue fixes. Totally unlikely.
- sg chaining support. Not likely.
- The bio changes from Neil. Of the bunch, the most likely suspects in
  this area, since it changes some of the code involved with merges and
  blk_rq_map_sg().
- Lots of simple stuff, again very unlikely.

Anyway, it sounds odd for this to be a block layer problem if you do see
occasional segments being merged. So it sounds more like the input data
having changed.

Why not just bisect it?

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 00/30] blk_end_request: full I/O completion handler (take 4)

2007-12-12 Thread Jens Axboe

On Tue, Dec 11 2007, Kiyoshi Ueda wrote:
> Hi Jens, Boaz,
> 
> The following is the updated patch-set for blk_end_request().
> I have done some interface/implementation changes based on
> feedbacks/discussions since the previous version.
> (Although this patch-set was made on top of 2.6.24-rc4, I confirmed
>  it can be applied to 2.6.24-rc5, too.  Also, I confirmed it has
>  no build error on my IA64 box.)

Trying to apply this, but it fails from patch #24 and on:

patching file block/ll_rw_blk.c
Hunk #1 succeeded at 3852 (offset 49 lines).
Hunk #2 succeeded at 3867 with fuzz 2 (offset 49 lines).
Hunk #3 succeeded at 3904 with fuzz 1 (offset 66 lines).
Hunk #4 FAILED at 3916.
Hunk #5 succeeded at 3967 with fuzz 2 (offset 60 lines).
1 out of 5 hunks FAILED -- saving rejects to file block/ll_rw_blk.c.rej
patching file include/linux/blkdev.h
Reversed (or previously applied) patch detected!  Assume -R? [n] 

This is 2.6.24-rc5 (pretty close to at least, no changes to ll_rw_blk.c
from that version). Are you using quilt and forgot to refresh, or
something like that?

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Bug: get EXT3-fs error Allocating block in system zone

2007-12-10 Thread Jens Axboe

On Mon, Dec 10 2007, Marco Gatti wrote:
> Jens Axboe schrieb:
> >>>>Hello Jens,
> >>>>Thanks for help. I just applied the patch. Unfortunately it doesn't 
> >>>>work.
> >>>Can you try and additionally boot with iommu=off as a boot parameter?
> >>>
> >>Yes. This is the end of getting any sata devices. See screenshots for 
> >>errors. It continued untill ata4. At the end no root device was found.
> >
> >Hmm, even though the address is set to 0x we still seem to
> >receive requests outside that range. Lets assume it's the scsi logic,
> >can you test this? IOW, patch + iommu=off + this patch.
> >
> >I probably wont see any more mails tonight, we can continue this
> >tomorrow (or someone else can step in, whatever comes first :-)
> >
> >diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> >index 2faced6..769ce3a 100644
> >--- a/drivers/scsi/scsi_lib.c
> >+++ b/drivers/scsi/scsi_lib.c
> >@@ -1572,7 +1572,9 @@ struct request_queue *__scsi_alloc_queue(struct 
> >Scsi_Host *shost,
> > #endif
> > 
> > blk_queue_max_sectors(q, shost->max_sectors);
> >+#if 0
> > blk_queue_bounce_limit(q, scsi_calculate_bounce_limit(shost));
> >+#else
> > blk_queue_segment_boundary(q, shost->dma_boundary);
> > 
> > if (!shost->use_clustering)
> I applied the path. Got Hunk #1 succeeded at 1562 with fuzz 2 (offset 
> -10 lines).
> 
> I didn't compile completly.
> 
> drivers/scsi/scsi_lib.c:1565:1: error: unterminated #else
> make[2]: *** [drivers/scsi/scsi_lib.o] Error 1
> make[1]: *** [drivers/scsi] Error 2
> make: *** [drivers] Error 2

Doh sorry, that #else wants to be an #endif

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: SAS v SATA interface performance

2007-12-10 Thread Jens Axboe

On Mon, Dec 10 2007, Tejun Heo wrote:
> There's one thing we can do to improve the situation tho.  Several
> drives including raptors and 7200.11s suffer serious performance hit if
> sequential transfer is performed by multiple NCQ commands.  My 7200.11
> can do > 100MB/s if non-NCQ command is used or only upto two NCQ
> commands are issued; however, if all 31 (maximum currently supported by
> libata) are used, the transfer rate drops to miserable 70MB/s.
> 
> It seems that what we need to do is not issuing too many commands to one
> sequential stream.  In fact, there isn't much to gain by issuing more
> than two commands to one sequential stream.

Well... CFQ wont go to deep queue depths across processes if they are
doing streaming IO, but it wont stop a single process from doing so. I'd
like to know what real life process would issue a streaming IO in some
async manner as to get 31 pending commands sequentially? Not very likely
:-)

So I'd consider your case above a microbenchmark results. I'd also claim
that the firmware is very crappy, if it performs like described.

There's another possibility as well - that the queueing by the drive
generates a worse issue IO pattern, and that is why the performance
drops. Did you check with blktrace what the generated IO looks like?

> Both raptors and 7200.11 perform noticeably better on random workload
> with NCQ enabled.  So, it's about time to update IO schedulers
> accordingly, it seems.

Definitely. Again microbenchmarks are able to show 30-40% improvements
when I last tested. That's a pure random workload though, again not
something that you would see in real life.

I tend to always run with a depth around 4 here. It seems to be a good
value, you get some benefits from NCQ but you don't allow the drive
firmware to screw you over.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Bug: get EXT3-fs error Allocating block in system zone

2007-12-09 Thread Jens Axboe

On Sun, Dec 09 2007, Marco Gatti wrote:
> Jens Axboe schrieb:
> >On Sun, Dec 09 2007, Marco Gatti wrote:
> >>Jens Axboe schrieb:
> >>>Was just thinking that, this should do the trick. If this works, then we
> >>>can look at whether this is a hardware or iommu or block bouncing
> >>>(unlikely, would affect more people) bug.
> >>>
> >>>diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c
> >>>index 4688dbf..cad3cbc 100644
> >>>--- a/drivers/ata/ahci.c
> >>>+++ b/drivers/ata/ahci.c
> >>>@@ -623,6 +623,9 @@ static void ahci_save_initial_config(struct pci_dev 
> >>>*pdev,
> >>>   hpriv->saved_cap = cap = readl(mmio + HOST_CAP);
> >>>   hpriv->saved_port_map = port_map = readl(mmio + HOST_PORTS_IMPL);
> >>>
> >>>+  hpriv->saved_cap &= ~HOST_CAP_64;
> >>>+  cap &= ~HOST_CAP_64;
> >>>+
> >>>   /* some chips have errata preventing 64bit use */
> >>>   if ((cap & HOST_CAP_64) && (hpriv->flags & AHCI_HFLAG_32BIT_ONLY)) {
> >>>   dev_printk(KERN_INFO, &pdev->dev,
> >>>
> >>Hello Jens,
> >>Thanks for help. I just applied the patch. Unfortunately it doesn't work.
> >
> >Can you try and additionally boot with iommu=off as a boot parameter?
> >
> 
> Yes. This is the end of getting any sata devices. See screenshots for 
> errors. It continued untill ata4. At the end no root device was found.

Hmm, even though the address is set to 0x we still seem to
receive requests outside that range. Lets assume it's the scsi logic,
can you test this? IOW, patch + iommu=off + this patch.

I probably wont see any more mails tonight, we can continue this
tomorrow (or someone else can step in, whatever comes first :-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 2faced6..769ce3a 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1572,7 +1572,9 @@ struct request_queue *__scsi_alloc_queue(struct Scsi_Host 
*shost,
 #endif
 
blk_queue_max_sectors(q, shost->max_sectors);
+#if 0
blk_queue_bounce_limit(q, scsi_calculate_bounce_limit(shost));
+#else
blk_queue_segment_boundary(q, shost->dma_boundary);
 
if (!shost->use_clustering)

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Bug: get EXT3-fs error Allocating block in system zone

2007-12-09 Thread Jens Axboe

On Sun, Dec 09 2007, Marco Gatti wrote:
> Jens Axboe schrieb:
> >
> >Was just thinking that, this should do the trick. If this works, then we
> >can look at whether this is a hardware or iommu or block bouncing
> >(unlikely, would affect more people) bug.
> >
> >diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c
> >index 4688dbf..cad3cbc 100644
> >--- a/drivers/ata/ahci.c
> >+++ b/drivers/ata/ahci.c
> >@@ -623,6 +623,9 @@ static void ahci_save_initial_config(struct pci_dev 
> >*pdev,
> > hpriv->saved_cap = cap = readl(mmio + HOST_CAP);
> > hpriv->saved_port_map = port_map = readl(mmio + HOST_PORTS_IMPL);
> > 
> >+hpriv->saved_cap &= ~HOST_CAP_64;
> >+cap &= ~HOST_CAP_64;
> >+
> > /* some chips have errata preventing 64bit use */
> > if ((cap & HOST_CAP_64) && (hpriv->flags & AHCI_HFLAG_32BIT_ONLY)) {
> > dev_printk(KERN_INFO, &pdev->dev,
> >
> 
> Hello Jens,
> Thanks for help. I just applied the patch. Unfortunately it doesn't work.

Can you try and additionally boot with iommu=off as a boot parameter?

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Bug: get EXT3-fs error Allocating block in system zone

2007-12-09 Thread Jens Axboe

On Sun, Dec 09 2007, Linus Torvalds wrote:
> 
> 
> On Sun, 9 Dec 2007, Robert Hancock wrote:
> > 
> > The obvious suspect with a filesystem problem would be the disk 
> > controller driver, AHCI here. However, the controller appears to set the 
> > flag to indicate that it supports 64-bit DMA, so it should be fine, 
> > unless it lies of course (which we know that ATI SB600 chipset does, but 
> > I don't believe Intel is known to).
> > 
> > Could still be a DMA mapping bug that only shows up when IOMMU is used. 
> > However, AHCI is a pretty well tested driver..
> 
> AHCI is a pretty well tested driver, but 99%+ of all testers still tend to 
> have less than 4GB of memory. So I do *not* believe that the highmem bits 
> are all that well tested at all. 
> 
> Can somebody who knows the driver send Marco a test-patch to just limit 
> DMA to the low 32 bits, and then Marco can at least verify that yes, that 
> that it. While it looks like DMA problems, there could obviously be some 
> other subtle issue with big-memory machines (ie the PCI allocations etc 
> tend to change too!)

Was just thinking that, this should do the trick. If this works, then we
can look at whether this is a hardware or iommu or block bouncing
(unlikely, would affect more people) bug.

diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c
index 4688dbf..cad3cbc 100644
--- a/drivers/ata/ahci.c
+++ b/drivers/ata/ahci.c
@@ -623,6 +623,9 @@ static void ahci_save_initial_config(struct pci_dev *pdev,
hpriv->saved_cap = cap = readl(mmio + HOST_CAP);
hpriv->saved_port_map = port_map = readl(mmio + HOST_PORTS_IMPL);
 
+   hpriv->saved_cap &= ~HOST_CAP_64;
+   cap &= ~HOST_CAP_64;
+
/* some chips have errata preventing 64bit use */
if ((cap & HOST_CAP_64) && (hpriv->flags & AHCI_HFLAG_32BIT_ONLY)) {
dev_printk(KERN_INFO, &pdev->dev,

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 01/28] blk_end_request: add new request completion interface (take 3)

2007-12-05 Thread Jens Axboe

On Tue, Dec 04 2007, Kiyoshi Ueda wrote:
> Hi Boaz and Jens,
> 
> On Tue, 04 Dec 2007 15:56:32 +0200, Boaz Harrosh <[EMAIL PROTECTED]> wrote:
> > > +/**
> > > + * blk_end_request - Helper function for drivers to complete the request.
> > > + * @rq:   the request being processed
> > > + * @uptodate: 1 for success, 0 for I/O error, < 0 for specific error
> > > + * @nr_bytes: number of bytes to complete
> > > + *
> > > + * Description:
> > > + * Ends I/O on a number of bytes attached to @rq.
> > > + * If @rq has leftover, sets it up for the next range of segments.
> > > + *
> > > + * Return:
> > > + * 0 - we are done with this request
> > > + * 1 - still buffers pending for this request
> > > + **/
> > > +int blk_end_request(struct request *rq, int uptodate, int nr_bytes)
> > 
> > I always hated that uptodate boolean with possible negative error value.
> > I guess it was done for backward compatibility of then users of 
> > end_that_request_first(). But since you are introducing a new API then
> > this is not the case. Just have regular status int where 0 means ALL_OK
> > and negative value means error code. 
> > Just my $0.02.
> 
> Thank you for the comment.
> I think it's quite reasonable.
> By doing that, we don't need end_io_error() anymore.
> 
> 
> Jens,
> What do you think?

I agree, the uptodate usage right now is horrible.

> If you agree with the interface change above, I would prefer to
> separate the patch-set from blk_end_request patch-set like below:
> o blk_end_request: remove end_that_request_*
> o change interface of 'uptodate' in blk_end_request to 'error'
> It makes the purpose of blk_end_request patch-set clear
> (and also, each patch of device drivers could be smaller).
> But, it doubles your merging work.  So if you would like to get
> the changes at once, I'll merge them into blk_end_request patch-set.

Twice the merging is not an issue for me.

> As for the patch inclusion, do you push the driver changes to Linus
> all at once?  Or should I ask each maintainer to take the patch?

Lets just try to get as many maintainer acks as possible, since the
patches need to go in together.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 00/28] blk_end_request: full I/O completion handler (take 3)

2007-12-04 Thread Jens Axboe

On Fri, Nov 30 2007, Kiyoshi Ueda wrote:
> Hello Jens,
> 
> The following is the updated patch-set for blk_end_request().
> Changes since the last version are only minor updates to catch up
> with the base kernel changes.
> Do you agree the implementation of blk_end_request()?
> If there's no problem, could you merge it to your tree?
> Or does it have to be merged to -mm tree first?
> 
> 
> Boaz,
> Could you review the newly added PATCH 27 which converts the bidi part,
> and give me your comments?
> It uses blk_end_request_callback() in PATCH 25, which was only for
> the tricky ide-cd driver.
> If bidi added a 'resid' member to struct request instead of reusing
> 'data_len' for the other purpose, it could use the standard
> blk_end_request() instead.
> 
> -- Changes from the previous post -
> Changes between take2 and take3:
>   o Rebased on top of 2.6.24-rc3-mm2

OK, so this means that I can't apply it unfortunately. It depends on
other patches in -mm (bidi).

SCSI sits on block, so the best approach imho is to base this patchset
on mainline so I can include the block bits.


-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 00/28] blk_end_request: full I/O completion handler (take 3)

2007-12-04 Thread Jens Axboe

On Fri, Nov 30 2007, Kiyoshi Ueda wrote:
> Hello Jens,
> 
> The following is the updated patch-set for blk_end_request().
> Changes since the last version are only minor updates to catch up
> with the base kernel changes.
> Do you agree the implementation of blk_end_request()?
> If there's no problem, could you merge it to your tree?
> Or does it have to be merged to -mm tree first?

Looks good to me now, I'll queue it up. Thanks!

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2.6.24-rc & SB600 AHCI no go on >=4GB of RAM II

2007-11-20 Thread Jens Axboe

On Tue, Nov 20 2007, Andi Kleen wrote:
> 
> > Which in turn enables the iommu_merge functionality in gart_map_sg().
> 
> > for_each_sg(sg, s, nents, i) {
> 
> Hmm, another thought. Maybe this code just has trouble with the new 
> linked SG lists and it's not really a SB600 problem?
> 
> I did a quick test on two ATI machines with older chipset and
> iommu=force,merge and it didn't show a problem though.

chained sg lists aren't enabled on libata, so it should not affect
libata drivers.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [BUG] New Kernel Bugs

2007-11-13 Thread Jens Axboe

On Tue, Nov 13 2007, Andrew Morton wrote:
> > I/O STORAGE===
> > 
> > kernel bug from pktcdvd
> > http://bugzilla.kernel.org/show_bug.cgi?id=9294
> > Kernel: 2.6.23
> 
> I think we might have fixed this.

It's fixed and merged, I just forgot to close the bugzilla. Did so now.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Suspend to ram regression (2.6.24-rc1-git)

2007-11-05 Thread Jens Axboe

On Fri, Nov 02 2007, Kristen Carlson Accardi wrote:
> On Thu, 1 Nov 2007 09:41:46 +0100
> Jens Axboe <[EMAIL PROTECTED]> wrote:
> 
> > On Wed, Oct 31 2007, Jens Axboe wrote:
> > > Hi,
> > > 
> > > My x60 stopped suspending about two days ago. It just freezes after
> > > printing
> > > 
> > > Suspending console(s)
> > > 
> > > where it would normally turn everything off and the 'moon' light would
> > > go on. Posting this message in case somebody else knows what is up, if
> > > not I'll do a bisect on it tomorrow.
> > 
> > Did the bisect, it points to this commit:
> > 
> > 1556594f913fa81d008cecfe46d7211c919a853 is first bad commit
> > commit 31556594f913fa81d008cecfe46d7211c919a853
> > Author: Kristen Carlson Accardi <[EMAIL PROTECTED]>
> > Date:   Thu Oct 25 01:33:26 2007 -0400
> > 
> > [libata] AHCI: add hw link power management support
> > 
> > Booting any kernel after this commit fails suspending to ram, it just
> > sits there forever.
> > 
> > -- 
> > Jens Axboe
> > 
> 
> Does this patch fix your problem?  It seems to get hung up while disabling
> DIPM, and after thinking about this a bit, I don't think we really need
> to do this anyway.

Yep, works for me! Can we get this expedited upstream?

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Suspend to ram regression (2.6.24-rc1-git)

2007-11-01 Thread Jens Axboe

On Thu, Nov 01 2007, Jeff Garzik wrote:
> Jens Axboe wrote:
>> On Thu, Nov 01 2007, Jeff Garzik wrote:
>>> Jens Axboe wrote:
>>>> Reverting just the default AHCI flags makes it work again. IOW, with the
>>>> below patch I can suspend properly with current -git.
>>>> diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c
>>>> index ed9b407..77f7631 100644
>>>> --- a/drivers/ata/ahci.c
>>>> +++ b/drivers/ata/ahci.c
>>>> @@ -190,8 +190,7 @@ enum {
>>>>AHCI_FLAG_COMMON= ATA_FLAG_SATA | ATA_FLAG_NO_LEGACY |
>>>>  ATA_FLAG_MMIO | ATA_FLAG_PIO_DMA |
>>>> -ATA_FLAG_ACPI_SATA | ATA_FLAG_AN |
>>>> -ATA_FLAG_IPM,
>>>> +ATA_FLAG_ACPI_SATA | ATA_FLAG_AN,
>>>>AHCI_LFLAG_COMMON   = ATA_LFLAG_SKIP_D2H_BSY,
>>>
>>> sounds like the easy thing to do, in light of this breakage, might be to 
>>> default it to off, add a module parameter turning it on by setting that 
>>> flag.
>> Wouldn't it be better to just get this bug fixed? IOW, is there a reason
>> for disabling ALPM if it's Bug Free?
>> I'd suggest committing the patch disabling IPM, then Kristen can debug
>> the thing in piece in quiet. Once confident it works with ahci again, we
>> can revert that commit.
>
> Right -- if you are going to commit a patch "disabling" it, it is best to 
> do so via a simple module option, which allows users to easily try the 
> feature in parallel with Intel's debugging.

OK, so you just want the option to be temporary? In that case I think a
config option is better, since you don't risk breaking peoples setups
later when removing the option. That can be quite annoying. Ala the
below.

diff --git a/drivers/ata/Kconfig b/drivers/ata/Kconfig
index ba63619..e276ab6 100644
--- a/drivers/ata/Kconfig
+++ b/drivers/ata/Kconfig
@@ -48,6 +48,14 @@ config SATA_AHCI
 
  If unsure, say N.
 
+config SATA_AHCI_IPM
+   bool "AHCI power management"
+   depends on EXPERIMENTAL && SATA_AHCI
+   help
+ This option adds support for AHCI power management. It current
+ breaks suspend on some laptops. This option is temporary and will
+ go away once those issues are fully resolved.
+
 config SATA_SVW
tristate "ServerWorks Frodo / Apple K2 SATA support"
depends on PCI
diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c
index ed9b407..37266ce 100644
--- a/drivers/ata/ahci.c
+++ b/drivers/ata/ahci.c
@@ -190,8 +190,11 @@ enum {
 
AHCI_FLAG_COMMON= ATA_FLAG_SATA | ATA_FLAG_NO_LEGACY |
  ATA_FLAG_MMIO | ATA_FLAG_PIO_DMA |
- ATA_FLAG_ACPI_SATA | ATA_FLAG_AN |
-     ATA_FLAG_IPM,
+ ATA_FLAG_ACPI_SATA | ATA_FLAG_AN
+#ifdef CONFIG_SATA_AHCI_IPM
+ | ATA_FLAG_IPM
+#endif
+   ,
AHCI_LFLAG_COMMON   = ATA_LFLAG_SKIP_D2H_BSY,
 };
 

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Suspend to ram regression (2.6.24-rc1-git)

2007-11-01 Thread Jens Axboe

On Thu, Nov 01 2007, Jeff Garzik wrote:
> Jens Axboe wrote:
>> Reverting just the default AHCI flags makes it work again. IOW, with the
>> below patch I can suspend properly with current -git.
>> diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c
>> index ed9b407..77f7631 100644
>> --- a/drivers/ata/ahci.c
>> +++ b/drivers/ata/ahci.c
>> @@ -190,8 +190,7 @@ enum {
>>  AHCI_FLAG_COMMON= ATA_FLAG_SATA | ATA_FLAG_NO_LEGACY |
>>ATA_FLAG_MMIO | ATA_FLAG_PIO_DMA |
>> -  ATA_FLAG_ACPI_SATA | ATA_FLAG_AN |
>> -  ATA_FLAG_IPM,
>> +  ATA_FLAG_ACPI_SATA | ATA_FLAG_AN,
>>  AHCI_LFLAG_COMMON   = ATA_LFLAG_SKIP_D2H_BSY,
>
>
> sounds like the easy thing to do, in light of this breakage, might be to 
> default it to off, add a module parameter turning it on by setting that 
> flag.

Wouldn't it be better to just get this bug fixed? IOW, is there a reason
for disabling ALPM if it's Bug Free?

I'd suggest committing the patch disabling IPM, then Kristen can debug
the thing in piece in quiet. Once confident it works with ahci again, we
can revert that commit.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Suspend to ram regression (2.6.24-rc1-git)

2007-11-01 Thread Jens Axboe

On Thu, Nov 01 2007, Jens Axboe wrote:
> On Thu, Nov 01 2007, Jens Axboe wrote:
> > On Wed, Oct 31 2007, Jens Axboe wrote:
> > > Hi,
> > > 
> > > My x60 stopped suspending about two days ago. It just freezes after
> > > printing
> > > 
> > > Suspending console(s)
> > > 
> > > where it would normally turn everything off and the 'moon' light would
> > > go on. Posting this message in case somebody else knows what is up, if
> > > not I'll do a bisect on it tomorrow.
> > 
> > Did the bisect, it points to this commit:
> > 
> > 1556594f913fa81d008cecfe46d7211c919a853 is first bad commit
> > commit 31556594f913fa81d008cecfe46d7211c919a853
> > Author: Kristen Carlson Accardi <[EMAIL PROTECTED]>
> > Date:   Thu Oct 25 01:33:26 2007 -0400
> > 
> > [libata] AHCI: add hw link power management support
> > 
> > Booting any kernel after this commit fails suspending to ram, it just
> > sits there forever.
> 
> Reverting just the default AHCI flags makes it work again. IOW, with the
> below patch I can suspend properly with current -git.
> 
> diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c
> index ed9b407..77f7631 100644
> --- a/drivers/ata/ahci.c
> +++ b/drivers/ata/ahci.c
> @@ -190,8 +190,7 @@ enum {
>  
>   AHCI_FLAG_COMMON= ATA_FLAG_SATA | ATA_FLAG_NO_LEGACY |
> ATA_FLAG_MMIO | ATA_FLAG_PIO_DMA |
> -   ATA_FLAG_ACPI_SATA | ATA_FLAG_AN |
> -   ATA_FLAG_IPM,
> +   ATA_FLAG_ACPI_SATA | ATA_FLAG_AN,
>   AHCI_LFLAG_COMMON   = ATA_LFLAG_SKIP_D2H_BSY,
>  };

I should also mention that just stubbing out ahci_enable_alpm() and
ahci_disable_alpm() in ahci.c is NOT enough to make it work.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Suspend to ram regression (2.6.24-rc1-git)

2007-11-01 Thread Jens Axboe

On Thu, Nov 01 2007, Jens Axboe wrote:
> On Wed, Oct 31 2007, Jens Axboe wrote:
> > Hi,
> > 
> > My x60 stopped suspending about two days ago. It just freezes after
> > printing
> > 
> > Suspending console(s)
> > 
> > where it would normally turn everything off and the 'moon' light would
> > go on. Posting this message in case somebody else knows what is up, if
> > not I'll do a bisect on it tomorrow.
> 
> Did the bisect, it points to this commit:
> 
> 1556594f913fa81d008cecfe46d7211c919a853 is first bad commit
> commit 31556594f913fa81d008cecfe46d7211c919a853
> Author: Kristen Carlson Accardi <[EMAIL PROTECTED]>
> Date:   Thu Oct 25 01:33:26 2007 -0400
> 
> [libata] AHCI: add hw link power management support
> 
> Booting any kernel after this commit fails suspending to ram, it just
> sits there forever.

Reverting just the default AHCI flags makes it work again. IOW, with the
below patch I can suspend properly with current -git.

diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c
index ed9b407..77f7631 100644
--- a/drivers/ata/ahci.c
+++ b/drivers/ata/ahci.c
@@ -190,8 +190,7 @@ enum {
 
AHCI_FLAG_COMMON= ATA_FLAG_SATA | ATA_FLAG_NO_LEGACY |
  ATA_FLAG_MMIO | ATA_FLAG_PIO_DMA |
- ATA_FLAG_ACPI_SATA | ATA_FLAG_AN |
- ATA_FLAG_IPM,
+ ATA_FLAG_ACPI_SATA | ATA_FLAG_AN,
AHCI_LFLAG_COMMON   = ATA_LFLAG_SKIP_D2H_BSY,
 };
 

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Suspend to ram regression (2.6.24-rc1-git)

2007-11-01 Thread Jens Axboe

On Wed, Oct 31 2007, Jens Axboe wrote:
> Hi,
> 
> My x60 stopped suspending about two days ago. It just freezes after
> printing
> 
> Suspending console(s)
> 
> where it would normally turn everything off and the 'moon' light would
> go on. Posting this message in case somebody else knows what is up, if
> not I'll do a bisect on it tomorrow.

Did the bisect, it points to this commit:

1556594f913fa81d008cecfe46d7211c919a853 is first bad commit
commit 31556594f913fa81d008cecfe46d7211c919a853
Author: Kristen Carlson Accardi <[EMAIL PROTECTED]>
Date:   Thu Oct 25 01:33:26 2007 -0400

[libata] AHCI: add hw link power management support

Booting any kernel after this commit fails suspending to ram, it just
sits there forever.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: "Fix ATAPI transfer lengths" causes CD writing regression

2007-10-31 Thread Jens Axboe

On Wed, Oct 31 2007, Jeff Garzik wrote:
> Jens Axboe wrote:
> >Right, that's of course problematic... There has to be a way to recover
> >that situation though, or you can't export any user command issue
> >facility.
> 
> You cannot hope to handle all possible effects arising from an app 
> providing an invalid sg header / cdb.
> 
> Once you start talking "recovery" you are already screwed:  we are 
> talking about low-level hardware commands that are passed straight to 
> the hardware.  It is trivial to lock up hardware, brick hardware, and 
> corrupt data at that level.
> 
> 
> If this is NOT a privileged app, we must update the command validation 
> to ensure that invalid commands are not transported to the hardware.
> 
> If this is a privileged app, our work is done.  Fix the app.  We gave 
> root rope, and he took it.

Woaw, back the truck up a bit :-)

I'm talking about simple things - like asking for 8 bytes of sense data.
Simple mistakes. You cannot possibly check for everything like that in a
command filter, it's utterly impossible.

> I even venture to say that "accept anything, clean up afterwards" is 
> /impossible/ to implement, in addition to being dangerous.

Certainly, that's not what I'm talking about.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: "Fix ATAPI transfer lengths" causes CD writing regression

2007-10-31 Thread Jens Axboe

On Wed, Oct 31 2007, Jeff Garzik wrote:
> Jens Axboe wrote:
>> On Wed, Oct 31 2007, Alan Cox wrote:
>>> On Tue, 30 Oct 2007 19:21:29 +
>>> Daniel Drake <[EMAIL PROTECTED]> wrote:
>>>
>>>> Alan Cox wrote:
>>>>> I would guess Brasero is issuing a command with the length of data
>>>>> wrongly set. In the old code that might well just produce errors of the
>>>>> "Umm wtf is this data left over for ?", with the new code the drive is
>>>>> likely to change state as it knows the transfer size and that will
>>>>> *correctly* cause an HSM error and what follows.
>>>>>
>>>>> Now the question is who gets the length wrong - Brasero or the ata
>>>>> translation code in libata
>>>> Brasero does exactly the same as my test app which I attached to my last 
>>>> mail. Is my test app wrong?
>>> Would need to double check the SCSI specificatons to be sure but I think
>>> you are asking for less data than the drive wishes to provide. You
>>> aren't allowed to do that with ATA.
>> ide-cd handles this by throwing the excess away, which I think is the
>> sane way to do this.
>
> That's easy for the PIO case.  But CD writing is normally DMA, which means 
> you will get a DMA engine exception if the device wants to give you more 
> data than the scatter/gather entries permit.

Right, that's of course problematic... There has to be a way to recover
that situation though, or you can't export any user command issue
facility.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: "Fix ATAPI transfer lengths" causes CD writing regression

2007-10-31 Thread Jens Axboe

On Wed, Oct 31 2007, Alan Cox wrote:
> On Tue, 30 Oct 2007 19:21:29 +
> Daniel Drake <[EMAIL PROTECTED]> wrote:
> 
> > Alan Cox wrote:
> > > I would guess Brasero is issuing a command with the length of data
> > > wrongly set. In the old code that might well just produce errors of the
> > > "Umm wtf is this data left over for ?", with the new code the drive is
> > > likely to change state as it knows the transfer size and that will
> > > *correctly* cause an HSM error and what follows.
> > > 
> > > Now the question is who gets the length wrong - Brasero or the ata
> > > translation code in libata
> > 
> > Brasero does exactly the same as my test app which I attached to my last 
> > mail. Is my test app wrong?
> 
> Would need to double check the SCSI specificatons to be sure but I think
> you are asking for less data than the drive wishes to provide. You
> aren't allowed to do that with ATA.

ide-cd handles this by throwing the excess away, which I think is the
sane way to do this.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ata_exec_internal crash at boot in -git22

2007-10-22 Thread Jens Axboe

On Mon, Oct 22 2007, Andi Kleen wrote:
> On Monday 22 October 2007 20:26:45 Jens Axboe wrote:
> > On Mon, Oct 22 2007, Andi Kleen wrote:
> > > 
> > > One of the systems tested in autoboot crashes at boot with with -git22.
> > > This is a AMD 2 socket Opteron NUMA system.
> > > 
> > > The tester was a little flakey and happened to hit the x86-merge-broke-
> > > compilation window, so the last good data point I have is 2.6.23-rc9.
> > 
> > Andi, can you test with this patch applied?
> > 
> > http://brick.kernel.dk/sg-git.patch
> Sorry was a mistake (cue: there is no 2.6.23-git22 yet). It turned out
> to be an old broken kernel. Current -git seems to boot.
> Sorry for the noise.

OK, all for the better :-)

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ata_exec_internal crash at boot in -git22

2007-10-22 Thread Jens Axboe

On Mon, Oct 22 2007, Andi Kleen wrote:
> 
> One of the systems tested in autoboot crashes at boot with with -git22.
> This is a AMD 2 socket Opteron NUMA system.
> 
> The tester was a little flakey and happened to hit the x86-merge-broke-
> compilation window, so the last good data point I have is 2.6.23-rc9.

Andi, can you test with this patch applied?

http://brick.kernel.dk/sg-git.patch

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: remove blk_queue_max_phys_segments in libata

2007-10-16 Thread Jens Axboe

On Tue, Oct 16 2007, Jeff Garzik wrote:
> Linux Kernel Mailing List wrote:
> >Gitweb: 
> >http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=8889e3c129780cdbe15fed3c366ba3aa3026684d
> >Commit: 8889e3c129780cdbe15fed3c366ba3aa3026684d
> >Parent: fd820f405574a30aacf9a859886e173d641f080b
> >Author: FUJITA Tomonori <[EMAIL PROTECTED]>
> >AuthorDate: Tue Sep 18 12:16:45 2007 +0200
> >Committer:  Jens Axboe <[EMAIL PROTECTED]>
> >CommitDate: Tue Oct 16 11:24:44 2007 +0200
> >
> >remove blk_queue_max_phys_segments in libata
> >
> >LIBATA_MAX_PRD is the maximum number of DMA scatter/gather elements
> >permitted by the HBA's DMA engine. It's properly set to
> >q->max_hw_segments via the sg_tablesize parameter.
> >
> >libata shouldn't call blk_queue_max_phys_segments. Now LIBATA_MAX_PRD
> >is equal to SCSI_MAX_PHYS_SEGMENTS by default (both is 128), so
> >everything is fine. But if they are changed, some code (like the scsi
> >mid layer, sg chaining, etc) might not work properly.
> >
> >(Addition from Jens) The basic issue is that the physical segment
> >setting is purely a driver issue. And since SCSI is managing the 
> >sglist,
> >libata has no business changing the setting. All libata should care
> >about is the hw segment setting.
> >
> >Signed-off-by: FUJITA Tomonori <[EMAIL PROTECTED]>
> >Signed-off-by: Jens Axboe <[EMAIL PROTECTED]>
> >---
> > drivers/ata/libata-scsi.c |2 --
> > 1 files changed, 0 insertions(+), 2 deletions(-)
> >
> >diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
> >index d63c81e..ba62d53 100644
> >--- a/drivers/ata/libata-scsi.c
> >+++ b/drivers/ata/libata-scsi.c
> >@@ -801,8 +801,6 @@ int ata_scsi_slave_config(struct scsi_device *sdev)
> > 
> > ata_scsi_sdev_config(sdev);
> > 
> >-blk_queue_max_phys_segments(sdev->request_queue, LIBATA_MAX_PRD);
> >-
> > sdev->manage_start_stop = 1;
> 
> As I noted when this first patch was posted...   this line of code 
> existed because the difference between hw-segments and phys-segments was 
> incredibly difficult to discern.  The names say basically the same thing 
> to me, and I could find no documentation explaining the difference?

We discussed this change to death afair, so I'm a little surprised you
say that. The commit message itself has good reasoning as well.

> Does such documentation exist?  If not, can you please explain the 
> difference?

OK, I can give a brief summary. The driver can set two restrictions -
hardware and physical. I'm not sure who originally came up with the
naming (I think it's prety terrible), but essentially the 'hardware'
restriction is what the hardware can support (OK that name works) and
the 'physical' restriction is imposed by the driver. Typically that last
restriction has to do with structure limitations, like allocating stuff
on the stack or limiting the size of the sgtable you have to allocate.

> Finally, some notification and coordination would have been helpful. 

Confused, as I mentioned, we discussed this at quite some length a few
months ago.

> Commit 6c08772e49622e90d39903e7ff0be1a0f463ac86 appears to have been 
> missed, at the very least... or is it actually correct?

Depends on what you think it does. The number of hardware segments will
never be larger than the physical segments. But it looks like you want
to set the hw seg value there. As mentioned in the commit above, since
libata doesn't allocate its own sg tables, letting it fiddle with the
physical segment limit is dodgy at best.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2.6.23-rc7-mm1 AHCI ATA errors -- won't boot

2007-09-25 Thread Jens Axboe

On Tue, Sep 25 2007, Berck E. Nash wrote:
> Jens Axboe wrote:
> > On Tue, Sep 25 2007, Berck E. Nash wrote:
> >> Jeff Garzik wrote:
> >>
> >>> The first step would be to clone the "upstream" branch of
> >>> git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev.git
> >>>
> >>> and see if the problem is reproducible there.  If yes, then you have
> >>> narrowed down the problem to something my ATA devel tree has introduced
> >>> into -mm.
> >> Nope, you're off the hook.  The libata tree works great, so it must be
> >> something else in -mm conflicting.
> 
> Whoops, sorry!  I just lied.  I'm a git newbie, and failed to actually
> get the "upstream" branch the first time, so rc8 is clean, but it fails
> when I actually pull the upstream branch.  I'll git bisect and get back
> to you.

OK, you probably realize this, but you can forget about the git-block
testing for now then.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2.6.23-rc7-mm1 AHCI ATA errors -- won't boot

2007-09-25 Thread Jens Axboe

On Tue, Sep 25 2007, Berck E. Nash wrote:
> Jeff Garzik wrote:
> 
> > The first step would be to clone the "upstream" branch of
> > git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev.git
> > 
> > and see if the problem is reproducible there.  If yes, then you have
> > narrowed down the problem to something my ATA devel tree has introduced
> > into -mm.
> 
> Nope, you're off the hook.  The libata tree works great, so it must be
> something else in -mm conflicting.

Can you try 2.6.23-rc8 plus this patch:

http://brick.kernel.dk/git-block.patch.bz2

and see if that works?

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] ahci: enable GHC.AE bit before set GHC.HR

2007-09-21 Thread Jens Axboe

On Fri, Sep 21 2007, Alan Cox wrote:
> On Fri, 21 Sep 2007 12:31:20 +0200
> Jens Axboe <[EMAIL PROTECTED]> wrote:
> 
> > On Fri, Sep 21 2007, Peer Chen wrote:
> > > According to the description of section 5.2.2.1 and 10.1.2 of AHCI 
> > > specification rev1_1/rev1_2, GHC.HR shall only be set to ¡®1¡¯
> > > by software when GHC.AE is set to ¡®1¡¯.
> > > 
> > > Signed-off-by: Peer Chen <[EMAIL PROTECTED]>
> > > ---
> > > --- linux-2.6.23-rc7/drivers/ata/ahci.c.orig  2007-09-20 
> > > 11:01:55.0 -0400
> > > +++ linux-2.6.23-rc7/drivers/ata/ahci.c   2007-09-20 11:07:31.0 
> > > -0400
> > > @@ -834,6 +834,10 @@ static int ahci_reset_controller(struct 
> > >   void __iomem *mmio = host->iomap[AHCI_PCI_BAR];
> > >   u32 tmp;
> > >  
> > > +/* turn on AHCI mode before controller reset*/
> > > +writel(HOST_AHCI_EN, mmio + HOST_CTL);
> > > +(void) readl(mmio + HOST_CTL);  /* flush */
> > > +
> > >   /* global controller reset */
> > >   tmp = readl(mmio + HOST_CTL);
> > >   if ((tmp & HOST_RESET) == 0) {
> > 
> > I appreciate the readl() flushes, but in this particular case we end up
> > reading the exact offset again below.
> 
> The code above is wrong btw - it is an iomap so both the new and existing
> code should be using ioread* not readl

Happy converting :-)

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] ahci: enable GHC.AE bit before set GHC.HR

2007-09-21 Thread Jens Axboe

On Fri, Sep 21 2007, Peer Chen wrote:
> According to the description of section 5.2.2.1 and 10.1.2 of AHCI 
> specification rev1_1/rev1_2, GHC.HR shall only be set to ¡®1¡¯
> by software when GHC.AE is set to ¡®1¡¯.
> 
> Signed-off-by: Peer Chen <[EMAIL PROTECTED]>
> ---
> --- linux-2.6.23-rc7/drivers/ata/ahci.c.orig  2007-09-20 11:01:55.0 
> -0400
> +++ linux-2.6.23-rc7/drivers/ata/ahci.c   2007-09-20 11:07:31.0 
> -0400
> @@ -834,6 +834,10 @@ static int ahci_reset_controller(struct 
>   void __iomem *mmio = host->iomap[AHCI_PCI_BAR];
>   u32 tmp;
>  
> +/* turn on AHCI mode before controller reset*/
> +writel(HOST_AHCI_EN, mmio + HOST_CTL);
> +(void) readl(mmio + HOST_CTL);  /* flush */
> +
>   /* global controller reset */
>   tmp = readl(mmio + HOST_CTL);
>   if ((tmp & HOST_RESET) == 0) {

I appreciate the readl() flushes, but in this particular case we end up
reading the exact offset again below.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] ide-cd is unmaintained

2007-09-20 Thread Jens Axboe

On Thu, Sep 20 2007, Alan Cox wrote:
> I simply don't have any old IDE systems any more or time to really look
> after this. Nobody responded to the previous linux-ide mail about
> maintainers so...
> 
> Signed-off-by: Alan Cox <[EMAIL PROTECTED]>

Just fold it in with the IDE system in general? Would make the most
sense, I think.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [-mm patch] remove ide_get_error_location()

2007-09-11 Thread Jens Axboe

On Tue, Sep 11 2007, Bartlomiej Zolnierkiewicz wrote:
> On Sunday 09 September 2007, Adrian Bunk wrote:
> > On Fri, Aug 31, 2007 at 09:58:22PM -0700, Andrew Morton wrote:
> > >...
> > > Changes since 2.6.23-rc3-mm1:
> > >...
> > >  git-block.patch
> > >...
> > >  git trees
> > >...
> > 
> > ide_get_error_location() is no longer used.
> > 
> > Signed-off-by: Adrian Bunk <[EMAIL PROTECTED]>
> 
> Signed-off-by: Bartlomiej Zolnierkiewicz <[EMAIL PROTECTED]>
> 
> Since git-block contains the patch which removes the only user of
> ide_get_error_location() I think that this patch should be also merged
> through block tree.  Jens?

Yeah, I'll add it there.

> PS none of the blkdev_issue_flush() users uses *error_sector argument
> so it can be probably removed as well

I had hoped that the existance was enough incentive, but it didn't
happen. I'll make a note to kill that again.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2.6.23-rc4-mm1

2007-09-11 Thread Jens Axboe

On Mon, Sep 10 2007, Torsten Kaiser wrote:
> On 9/10/07, FUJITA Tomonori <[EMAIL PROTECTED]> wrote:
> > On Mon, 10 Sep 2007 12:20:38 -0700
> > Andrew Morton <[EMAIL PROTECTED]> wrote:
> >
> > > On Mon, 10 Sep 2007 20:59:49 +0200 "Torsten Kaiser" <[EMAIL PROTECTED]> 
> > > wrote:
> > > > The system boots, reads the partition tables, starts the RAID and then
> > > > kicks one drive out because of errors.
> > >
> > > Andy is using qla1280.  You're using sata.  So it's probably a different
> > > bug, with the same symptoms.
> >
> > This might be a sg chaining bug too (probabaly sg chaining libata
> > patch).
> >
> > Can you try the following patch that I've just sent:
> >
> > http://lkml.org/lkml/2007/9/10/251
> >
> > The patch also disables chaining sg list for libata.
> >
> With this patch 2.6.23-rc4-mm1 works for me.
> Mainline 2.6.23-rc5-git1 works also without needing any patches.

OK, thanks for testing that. I'll merge Tomo's patch so that we can
selectively enable drivers when we KNOW they work, instead of trying to
do this (massive) operation whole sale.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 6/7] blk_end_request: remove/unexport end_that_request_*

2007-09-04 Thread Jens Axboe

On Tue, Sep 04 2007, Halevy, Benny wrote:
> Boaz raised my attention to this patchset today...
> We suspect we'll still need the extern entry points for handling the bidi 
> request in the scsi_io_completion() path as we only want to call
> end_that_request_chunk on req->next_rq and never
> end_that_request_last.
>  
> (see 
> http://www.bhalevy.com/open-osd/download/linux-2.6.23-rc2_and_iscsi-iscsi-2007_08_09/0005-SCSI-bidi-support.patch)
>  
> If this is ok with you I'd leave these entry points in place rather than
> taking them out and putting them back in later.

There's no point in leaving them in when nothing current needs it, I'd
much rather add it back in should the need arise. That's the proper way
to handle things like this.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/7] blk_end_request: full I/O completion handler

2007-09-03 Thread Jens Axboe

On Fri, Aug 31 2007, Kiyoshi Ueda wrote:
> Hello,
> 
> This set of patches changes request completion interface
> between device drivers and block layer to 1 step procedure
> from current 2 step procedures using end_that_request_{first/chunk}
> and end_that_request_last().
> 
> This change allows request-based multipath to hook in before
> completing each chunk of request, check errors for it and
> retry it using another path if error is detected.
> 
> Summaries of each patch are below:
>   1/7: add new request completion interface, blk_end_request()
>   2/7: add some macros to get the size of request in bytes
>   3/7: convert normal drivers to use blk_end_request()
>   4/7: convert odd drivers like cciss/cpqarray/xsysace to use
>blk_end_request()
>   5/7: convert ide-cd (cdrom_newpc_intr) to use blk_end_request()
>   6/7: remove/unexport no longer needed end_that_request_*
>   7/7: change rq->end_io to cover request completion as a whole
> 
> I have tested the patch on two machines, ia64+QLA1280+QLA2200
> and x86_64+SATA+IDE-CDROM.
> I can't test other device drivers for which I don't have hardware.
> So testing help and any comments would be very much appreciated.
> 
> The interface change causes code modifications of *ALL DEVICE DRIVERS*
> which are using end_that_request_{first/chunk/last} to complete request.
> But it should not affect the behavior.
> 
> Please review and apply if no problem.
> This patch-set should be applied on top of 2.6.23-rc3-mm1.
> 
> 
> BACKGROUND
> ==
> The patch is necessary to allow device stacking at request level,
> that is request-based device-mapper multipath.
> Currently, device-mapper is implemented as a stacking block device
> at BIO level.  OTOH, request-based DM will stack at request level to
> allow better multipathing decision.
> To allow device stacking at request level, the completion procedure
> need to provide a hook for it.
> For example, dm-multipath has to check errors and retry with other
> paths if necessary before returning the I/O result to upper layer.
> struct request has 'end_io' hook currently.  But it's called at
> the very late stage of completion handling where the I/O result
> is already returned to the upper layer.
> So we need something here.
> 
> The first approach to hook in completion of each chunk of request
> was adding a new rq->end_io_first() hook and calling it on the top
> of __end_that_request_first().
>   - http://marc.theaimsgroup.com/?l=linux-scsi&m=115520444515914&w=2
>   - http://marc.theaimsgroup.com/?l=linux-kernel&m=116656637425880&w=2
> However, Jens pointed out that redesigning rq->end_io() as a full
> completion handler would be better:
> 
> On Thu, 21 Dec 2006 08:49:47 +0100, Jens Axboe <[EMAIL PROTECTED]> wrote:
> > Ok, I see what you are getting at. The current ->end_io() is called when
> > the request has fully completed, you want notification for each chunk
> > potentially completed.
> > 
> > I think a better design here would be to use ->end_io() as the full
> > completion handler, similar to how bio->bi_end_io() works. A request
> > originating from __make_request() would set something ala:
> .
> > instead of calling the functions manually. That would allow you to get
> > notification right at the beginning and do what you need, without adding
> > a special hook for this.
> 
> I thought his comment was reasonable.
> So I modified the patches based on his suggestion.
> 
> 
> WHAT IS CHANGED
> ===
> The change is basically illustlated by the following pseudo code:
> 
> [Before]
>   if (end_that_request_{first/chunk} succeeds) { <-- completes bios
>  
>  end_that_request_last() <-- calls end_io()
>  
>   } else {
>  
>   }
> 
> [After]
>   if (blk_end_request() succeeds) { <-- calls end_io(), completes bios
>  
>   } else {
>  
>   }
> 
> 
> In detail, request completion procedures are changed like below.
> 
> [Before]
>   o 2 steps completion using end_that_request_{first/chunk}
> and end_that_request_last().
>   o Device drivers have ownership of a request until they
> call end_that_request_last().
>   o rq->end_io() is called at the last stage of
> end_that_request_last() for some block layer codes need
> specific request handling when completing it.
> 
> [After]
>   o 1 step completion using blk_end_request().
> (end_that_request_* are no longer used from device drivers.)
>   o Device drivers give over ownership of a request
> when calling blk_end_request().
> If it returns 0, the reques

Re: [PATCH 1/7] blk_end_request: add new request completion interface

2007-09-03 Thread Jens Axboe

extern int __blk_end_request(struct request *rq, int uptodate, int nr_bytes);
>  extern int end_that_request_first(struct request *, int, int);
>  extern int end_that_request_chunk(struct request *, int, int);
>  extern void end_that_request_last(struct request *, int);

We get in to way too many levels of underscores here. Please changes
this to be blk_end_request() and blk_end_request_locked(), where the
former grabs the queue lock but the latter assumes it's held. Then have
the static __blk_end_request() where the lock MUST be held - do this in
the caller, don't pass it as an argument!

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/7] blk_end_request: add blk_rq_size() macros

2007-09-03 Thread Jens Axboe

On Fri, Aug 31 2007, Kiyoshi Ueda wrote:
> This patch adds macros to get the size of request in bytes.
> They are useful because blk_end_request() takes bytes
> as a completed I/O size instead of sectors.
> 
> Signed-off-by: Kiyoshi Ueda <[EMAIL PROTECTED]>
> Signed-off-by: Jun'ichi Nomura <[EMAIL PROTECTED]>
> ---
>  blkdev.h |9 +
>  1 files changed, 9 insertions(+)
> 
> diff -rupN 01-blkendreq-interface/include/linux/blkdev.h 
> 02-sect2byte-macro/include/linux/blkdev.h
> --- 01-blkendreq-interface/include/linux/blkdev.h 2007-08-23 
> 17:22:50.0 -0400
> +++ 02-sect2byte-macro/include/linux/blkdev.h 2007-08-23 17:25:59.0 
> -0400
> @@ -737,6 +737,15 @@ extern void end_request(struct request *
>  extern void blk_complete_request(struct request *);
>  
>  /*
> + * blk_end_request() takes bytes instead of sectors as a complete size.
> + * blk_rq_size() returns the entire size left to complete in the request.
> + * blk_rq_cur_size() returns the size left to complete in the current 
> segment.
> + */
> +#define sect2byte(nr_sectors) ((nr_sectors) << 9)
> +#define blk_rq_size(rq) (sect2byte((rq)->hard_nr_sectors))
> +#define blk_rq_cur_size(rq) (sect2byte((rq)->current_nr_sectors))
> +
> +/*
>   * end_that_request_first/chunk() takes an uptodate argument. we account
>   * any value <= as an io error. 0 means -EIO for compatability reasons,
>   * any other < 0 value is the direct error type. An uptodate value of

Don't use a sect2byte() macro, kill that. And it doesn't look quite
right, for blk_pc_requests() you don't want to look at *nr_sectors.
Something ala:

static unsigned int blk_rq_size(struct request *rq)
{
if (blk_fs_request(rq))
return rq->nr_sectors << 9;

return rq->data_len;
}

static unsigned int blk_rq_cur_size(struct request *rq)
{
if (blk_fs_request(rq))
    return rq->current_nr_sectors << 9;

if (rq->bio)
return rq->bio->bi_size;

return rq->data_len;
}

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] libata: kill sector_buf temporary buffer

2007-08-16 Thread Jens Axboe

On Thu, Aug 16 2007, Jeff Garzik wrote:
> 
> Rather than carrying around this buffer all the time, for rare
> circumstances, it seems that we can easily alloc/free a temp buffer as
> needed.
> 
> Saves a big chunk of per-port memory.
> 
> I forget the justification for what it was in ata_port to begin with,
> but the code uses don't see to justify any need.
> 
> diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
> index 735f74b..5e9049f 100644
> --- a/drivers/ata/libata-core.c
> +++ b/drivers/ata/libata-core.c
> @@ -3679,20 +3679,29 @@ static int ata_dev_same_device(struct ata_device 
> *dev, unsigned int new_class,
>  int ata_dev_reread_id(struct ata_device *dev, unsigned int readid_flags)
>  {
>   unsigned int class = dev->class;
> - u16 *id = (void *)dev->link->ap->sector_buf;
> - int rc;
> + u16 *id;
> + int rc = 0;
> +
> + id = kzalloc(ATA_SECT_SIZE, GFP_KERNEL);
> + if (!id)
> + return -ENOMEM;
>  
>   /* read ID data */
>   rc = ata_dev_read_id(dev, &class, readid_flags, id);
>   if (rc)
> - return rc;
> + goto out;
>  
>   /* is the device still there? */
> - if (!ata_dev_same_device(dev, class, id))
> - return -ENODEV;
> + if (!ata_dev_same_device(dev, class, id)) {
> + rc = -ENODEV;
> + goto out;
> + }
>  
>   memcpy(dev->id, id, sizeof(id[0]) * ATA_ID_WORDS);
> - return 0;
> +
> +out:
> + kfree(id);
> + return rc;
>  }
>  
>  /**
> diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
> index 2ddc2ed..fc731e2 100644
> --- a/drivers/ata/libata-eh.c
> +++ b/drivers/ata/libata-eh.c
> @@ -1133,14 +1133,20 @@ static unsigned int ata_read_log_page(struct 
> ata_device *dev,
>  static int ata_eh_read_log_10h(struct ata_device *dev,
>  int *tag, struct ata_taskfile *tf)
>  {
> - u8 *buf = dev->link->ap->sector_buf;
> + u8 *buf;
>   unsigned int err_mask;
>   u8 csum;
> - int i;
> + int i, rc = 0;
> +
> + buf = kzalloc(ATA_SECT_SIZE, GFP_KERNEL);
> + if (!buf)
> + return -ENOMEM;

How can this work? If we're issuing this log page read, we cannot do any
IO to the device. So this allocation must not generate any IO. And if we
fail in allocating memory and just return, the device wont talk to us in
the future.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: libata git tree, mbox queue status and contents

2007-08-13 Thread Jens Axboe

On Fri, Aug 03 2007, Jeff Garzik wrote:
> * Kristen: ALPM patches.  We definitely want them, as they save a ton of 
> power.

The problem with ALPM, as I see it, is that it is way too aggressive. It
really needs to be combined with a timer to be useful, it's really a
huge shame that it doesn't come equipped with a timeout setting in
hardware. Lacking that, we could punt to using a second aligned timer
that just checks for activity in the last second, and if none was seen
then enable ALPM. That should have absolutely minimal impact on CPU
consumption. Likewise for when we see IO, when the rate/sec goes beyond
a low threshold then disable ALPM again.

In my testing on this notebook (x60), throughput was reduced to about
30% when using ALPM. So while it does save a good amount of power, it
also makes the disk a slow dog if you are actually using it.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: SATA open bugs

2007-08-08 Thread Jens Axboe

On Thu, Aug 09 2007, Tejun Heo wrote:
> > http://bugzilla.kernel.org/show_bug.cgi?id=7805 tejun is pinging
> > pktcdvd maintainer (?)
> 
> Not a libata bug.  This is caused by pktcdvd holding onto the device
> thus preventing it from being revalidated.  Didn't get pong for my ping
> yet.  Jens, any ideas?
> 
> > http://bugzilla.kernel.org/show_bug.cgi?id=7547 tejun is asking " and Jens
> 
> This one is pktcdvd too.  Jens?

Ask Peter, I haven't worked on pktcdvd in ~5 years or so :-). CC'ed.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Some NCQ numbers...

2007-07-09 Thread Jens Axboe

On Wed, Jul 04 2007, Justin Piszcz wrote:
>  On Wed, 4 Jul 2007, Michael Tokarev wrote:
> 
> > Tejun Heo wrote:
> >> Hello,
> >>
> >> Michael Tokarev wrote:
> >>> Well.  It looks like the results does not depend on the
> >>> elevator.  Originally I tried with deadline, and just
> >>> re-ran the test with noop (hence the long delay with
> >>> the answer) - changing linux elevator changes almost
> >>> nothing in the results - modulo some random "fluctuations".
> >>
> >> I see.  Thanks for testing.
> >
> > Here are actual results - the tests were still running when
> > I replied yesterday.
> >
> > Again, this is Seagate ST3250620AS "desktop" drive, 7200RPM,
> > 16Mb cache, 250Gb capacity.  The tests were performed with
> > queue depth = 64 (on mptsas), drive write cache is turned
> > off.
> 
>  I found AS scheduler to be the premium and best for single-user performance.
> 
>  You want speed? Use AS.
> 
>  http://home.comcast.net/~jpiszcz/sched/cfq_vs_as_vs_deadline_vs_noop.html

Hmm, I find your data very weak for such a conclusion. Value of the test
itself withstanding, AS seems to be a lot faster for sequential output
for some reason, yet slower for everything else. Which is odd, deadline
should always be running at the same speed for writeout as AS. The only
real difference should be sequential and random reads.

So allow me to call your results questionable. It also looks like bonnie
(some version) output, I never found bonnie to provide good and
repeatable numbers. tiotest is much better, or (of course) fio.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Some NCQ numbers...

2007-07-09 Thread Jens Axboe

On Wed, Jul 04 2007, James Bottomley wrote:
> On Wed, 2007-07-04 at 10:19 +0900, Tejun Heo wrote:
> > Michael Tokarev wrote:
> > > Well.  It looks like the results does not depend on the
> > > elevator.  Originally I tried with deadline, and just
> > > re-ran the test with noop (hence the long delay with
> > > the answer) - changing linux elevator changes almost
> > > nothing in the results - modulo some random "fluctuations".
> > 
> > I see.  Thanks for testing.
> > 
> > > In any case, NCQ - at least in this drive - just does
> > > not work.  Linux with its I/O elevator may help to
> > > speed things up a bit, but the disk does nothing in
> > > this area.  NCQ doesn't slow things down either - it
> > > just does not work.
> > > 
> > > The same's for ST3250620NS "enterprise" drives.
> > > 
> > > By the way, Seagate announced Barracuda ES 2 series
> > > (in range 500..1200Gb if memory serves) - maybe with
> > > those, NCQ will work better?
> > 
> > No one would know without testing.
> > 
> > > Or maybe it's libata which does not implement NCQ
> > > "properly"?  (As I shown before, with almost all
> > > ol'good SCSI drives TCQ helps alot - up to 2x the
> > > difference and more - with multiple I/O threads)
> > 
> > Well, what the driver does is minimal.  It just passes through all the
> > commands to the harddrive.  After all, NCQ/TCQ gives the harddrive more
> > responsibility regarding request scheduling.
> 
> Actually, in many ways the result support a theory of SCSI TCQ Jens used
> when designing the block layer.  The original TCQ theory held that the
> drive could make much better head scheduling decisions than the
> Operating System, so you just used TCQ to pass all the outstanding I/O
> unfiltered down to the drive to let it schedule.  However, the I/O
> results always seemed to indicate that the effect of TCQ was negligible
> at around 4 outstanding commands, leading to the second theory that all
> TCQ was good for was saturating the transport, and making scheduling
> decisions was, indeed, better left to the OS (hence all our I/O
> schedulers).

Indeed, the above I still find to be true. The only real case where
larger depths make a real difference, is a pure random reads (or writes,
with write caching off) workload. And those situations are largely
synthetic, hence benchmarks tend to show NCQ being a lot more beneficial
since they construct workloads that consist 100% of random IO. Real life
is rarely so black and white.

Additionally, there are cases where drive queue depths hurt a lot. The
drive has no knowledge of fairness, or process-to-io mappings. So AS/CFQ
has to artificially limit queue depths competing IO processes doing
semi (or fully) sequential workloads, or throughput plummets.

So while NCQ has some benefits, I typically tend to prefer managing the
IO queue largely in software instead of punting to (often) buggy
firmware.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 3/3] Enable Aggressive Link Power management for AHCI controllers.

2007-06-22 Thread Jens Axboe

On Fri, Jun 22 2007, Kristen Carlson Accardi wrote:
> On Thu, 21 Jun 2007 15:08:32 +0200
> Jens Axboe <[EMAIL PROTECTED]> wrote:
> 
> > On Wed, Jun 20 2007, Kristen Carlson Accardi wrote:
> > > Enable Aggressive Link Power management for AHCI controllers.
> > > 
> > > This patch will set the correct bits to turn on Aggressive
> > > Link Power Management (ALPM) for the ahci driver.  This
> > > will cause the controller and disk to negotiate a lower
> > > power state for the link when there is no activity (see
> > > the AHCI 1.x spec for details).  This feature is mutually
> > > exclusive with Hot Plug, so when ALPM is enabled, Hot Plug
> > > is disabled.  ALPM will be enabled by default, but it is
> > > settable via the scsi host syfs interface.  Possible 
> > > settings for this feature are:
> > > 
> > > Setting   Effect
> > > --
> > > min_power ALPM is enabled, and link set to enter 
> > >   lowest power state (SLUMBER) when idle
> > >   Hot plug not allowed.
> > > 
> > > max_performance   ALPM is disabled, Hot Plug is allowed
> > > 
> > > medium_power  ALPM is enabled, and link set to enter
> > >   second lowest power state (PARTIAL) when
> > >   idle.  Hot plug not allowed.
> > > 
> > > Signed-off-by:  Kristen Carlson Accardi <[EMAIL PROTECTED]>
> > 
> > A suggestion (it comes with a patch!) - default to max_power/almp off,
> > not min_power. For two reasons:
> > 
> > - There's such a big performance difference between the two, you really
> >   want max_power when booting.
> > 
> > - It's a lot better to default to no change, than default to enabling
> >   something new.
> 
> Sounds reasonable to me.  Distros/users can decide if they want to have
> scripts that enable this after boot to run at min_power. 

Exactly, it needs to be handled by some power management daemon anyway
and be integrated with power savings in general. You could use io load
to determine when to enable/disable alpm, for instance.

> Acked-by: Kristen Carlson Accardi <[EMAIL PROTECTED]>

Will you integrate it into the next posting?

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 3/3] Enable Aggressive Link Power management for AHCI controllers.

2007-06-21 Thread Jens Axboe

On Wed, Jun 20 2007, Kristen Carlson Accardi wrote:
> Enable Aggressive Link Power management for AHCI controllers.
> 
> This patch will set the correct bits to turn on Aggressive
> Link Power Management (ALPM) for the ahci driver.  This
> will cause the controller and disk to negotiate a lower
> power state for the link when there is no activity (see
> the AHCI 1.x spec for details).  This feature is mutually
> exclusive with Hot Plug, so when ALPM is enabled, Hot Plug
> is disabled.  ALPM will be enabled by default, but it is
> settable via the scsi host syfs interface.  Possible 
> settings for this feature are:
> 
> Setting   Effect
> --
> min_power ALPM is enabled, and link set to enter 
>   lowest power state (SLUMBER) when idle
>   Hot plug not allowed.
> 
> max_performance   ALPM is disabled, Hot Plug is allowed
> 
> medium_power  ALPM is enabled, and link set to enter
>   second lowest power state (PARTIAL) when
>   idle.  Hot plug not allowed.
> 
> Signed-off-by:  Kristen Carlson Accardi <[EMAIL PROTECTED]>

A suggestion (it comes with a patch!) - default to max_power/almp off,
not min_power. For two reasons:

- There's such a big performance difference between the two, you really
  want max_power when booting.

- It's a lot better to default to no change, than default to enabling
  something new.

diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c
index 841cf0a..e7a2072 100644
--- a/drivers/ata/ahci.c
+++ b/drivers/ata/ahci.c
@@ -786,8 +786,7 @@ static int ahci_disable_alpm(struct ata_port *ap)
return 0;
 }
 
-static int ahci_enable_alpm(struct ata_port *ap,
-   enum scsi_host_link_pm policy)
+static int ahci_enable_alpm(struct ata_port *ap, enum scsi_host_link_pm policy)
 {
struct ahci_host_priv *hpriv = ap->host->private_data;
void __iomem *port_mmio = ahci_port_base(ap);
@@ -808,19 +807,19 @@ static int ahci_enable_alpm(struct ata_port *ap,
return -EINVAL;
}
 
-   switch(policy) {
+   switch (policy) {
case SHOST_MAX_PERFORMANCE:
-   ahci_disable_alpm(ap);
-   ap->pm_policy = policy;
-   return 0;
case SHOST_NOT_AVAILABLE:
-   case SHOST_MIN_POWER:
/*
 * if we came here with SHOST_NOT_AVAILABLE,
 * it just means this is the first time we
-* have tried to enable - so try to do
-* min_power
+* have tried to enable - default to max performance,
+* and let the user go to lower power modes on request.
 */
+   ahci_disable_alpm(ap);
+   ap->pm_policy = SHOST_MAX_PERFORMANCE;
+   return 0;
+   case SHOST_MIN_POWER:
    ap->pm_policy = SHOST_MIN_POWER;
 
/* configure HBA to enter SLUMBER */

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 126X DVD drive? (odd ide-cd output)

2007-06-21 Thread Jens Axboe

On Wed, Jun 20 2007, Cal Peake wrote:
> Hello,
> 
> It seems if I boot with a disc in the drive (in this case a dual fs 
> ISO-9660/UDF DVD-R) the drive speed gets reported incorrectly:
> 
> hdc: DVDRW DRW-6S160P, ATAPI CD/DVD-ROM drive
> hdc: ATAPI 126X DVD-ROM DVD-R CD-R/RW drive, 2048kB Cache, UDMA(66)
> 
> normal output (w/ no disc) is:
> 
> hdc: DVDRW DRW-6S160P, ATAPI CD/DVD-ROM drive
> hdc: ATAPI 48X DVD-ROM DVD-R CD-R/RW drive, 2048kB Cache, UDMA(66)
> 
> Kernel is 2.6.22-rc5-gf1518a08.
> 
> Harmless, but odd...

Sounds like a firmware bug. From a drive with a 'DVDRW' vendor string,
that's not highly surprising :-)

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 0/3] AHCI Link Power Management

2007-06-14 Thread Jens Axboe

On Tue, Jun 12 2007, Tejun Heo wrote:
> Arjan van de Ven wrote:
> >> I'm not sure about this.  We need better PM framework to support 
> >> powersaving in other controllers and some ahcis don't save much
> >> when only link power management is used,
> > 
> > do you have data to support this?
> 
> Yeah, it was some Lenovo notebook.  Pavel is more familiar with the
> hardware.  Pavel, what was the notebook which didn't save much power
> with standard SATA power save but needed port to be completely turned off?
> 
> > The data we have from this patch is that it saves typically a Watt of
> > power (depends on the machine of course, but the range is 0.5W to
> > 1.5W). If you want to also have an even more agressive thing where
> > you want to start disabling the entire controller... I don't see how
> > this is in conflict with saving power on the link level by "just"
> > enabling a hardware feature 
> 
> Well, both implement about the same thing.  I prefer software
> implementation because it's more generic and ALPE/ASP seems too
> aggressive to me.  Here are reasons why sw implementation wasn't merged.
> 
> 1. It didn't have proper interface with userland.  This was mainly
> because of missing ATA sysfs nodes.  I'm not sure whether adding this to
> scsi node is a good idea.
> 
> 2. It was focused on SATA link PS and couldn't cover the Lenovo case.
> 
> I think we need something at the block layer.

I think the hardware method is preferable, actually. Doing this in the
block layer would mean keeping track of idle time, and that quickly
turns into a lot of timer management. Not exactly free, in terms of CPU
usage.

I've yet to do some power measurements with this ahci patch, I just
noticed that with min_power performance drops from ~55mb/sec to
~15mb/sec sequential on my drive. That's pretty drastic :-)

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: use_clustering (sht) bit set to 0 in AHCI ?

2007-05-22 Thread Jens Axboe

On Mon, May 21 2007, Jeff Garzik wrote:
> Jens Axboe wrote:
> >ahci has always had clustering disabled, perhaps Jeff can expand on why?
> 
> 
> Just historical reasons.  libata had clustering disabled by default in 
> the beginning, for all drivers.  Then we enabled it globally by changing 
> the value of ATA_SHT_USE_CLUSTERING...  but apparently forgot to change 
> drivers which use their own value rather than ATA_SHT_USE_CLUSTERING.
> 
> Feel free to submit a patch turning it on...

The below works for me, but it's only lightly tested. I booted it up and
ran some large IO tests, I've verified really large IO sizes as well
(using blktrace, I've verified ios up to 9216KiB being accepted and
completed by the drive).

-

From: Jens Axboe <[EMAIL PROTECTED]>

ahci: enable sg segment clustering

The specification states that ahci supports segments up to 4MiB in size,
so enable clustering.

Signed-off-by: Jens Axboe <[EMAIL PROTECTED]>

diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c
index e00e1b9..bfcd8ec 100644
--- a/drivers/ata/ahci.c
+++ b/drivers/ata/ahci.c
@@ -54,7 +54,7 @@ enum {
AHCI_MAX_PORTS  = 32,
AHCI_MAX_SG = 168, /* hardware max is 64K */
AHCI_DMA_BOUNDARY   = 0x,
-   AHCI_USE_CLUSTERING = 0,
+   AHCI_USE_CLUSTERING = 1,
AHCI_MAX_CMDS   = 32,
AHCI_CMD_SZ = 32,
AHCI_CMD_SLOT_SZ= AHCI_MAX_CMDS * AHCI_CMD_SZ,

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: use_clustering (sht) bit set to 0 in AHCI ?

2007-05-21 Thread Jens Axboe

On Thu, Apr 26 2007, Alok kataria wrote:
> Hi Jeff,
> 
> I recently got a new AHCI disk, and was using the AHCI-libata driver to run
> this. I noticed that the scattergather lists that were being built for
> the I/O on this device were just of PAGE_SIZE length, even though i
> was doing IO on
> contiguous pages.
> 
> Through a little instrumentation i figured out that the use_clustering
> bit in the ahci_sht (scsi_host_template) is set to zero. Due to which
> we are not putting consecutive bios into one sg in blk_rq_map_sg.
> 
> I tried changing the clustering bit to 1, but encountered a panic at
> the initialization of the disks  during the bootup process, and so
> couldn't entirely get hold of the panic mesg.
> I was wondering, though the max segment size suported with this driver
> is 65536, why was i not able to feed in bigger SG's to this driver.
> 
> I searched on the net and the intel ahci arch doc too but couldn't
> find anything relative to the clustering support for AHCI.
> 
> It would be great help if you could let me know about the possible
> problems with clustering on AHCI  ? why is it off by default in the
> driver ? or anything else that might help.

It's a good question. If you look at the documentation, it states that
ahci supports up to 64k sg entries and each can have a size of up to 4mb
(bits 0 through 21). So as far as I can tell, clustering should work
with a segment size up to those 4mb.

ahci has always had clustering disabled, perhaps Jeff can expand on why?

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/4] bidi support: bidirectional request

2007-05-01 Thread Jens Axboe

On Tue, May 01 2007, Boaz Harrosh wrote:
> Please consider the attached proposal. It is a complete block-level bidi
> implementation that is, I hope, a middle ground which will keep everyone
> happy (including Christoph). It is both quite small and not invasive,
> yet has a full bidi API that is easy to use and maintain.

This isn't much of an improvement imo, if any at all. Why didn't you do
the ->next_rq approach I suggested? Your patch still makes struct
request considerably fatter (30% here, from 280 to 368 bytes on x86-64
from a quick look) for something that will have relatively few uses. And
it still has its paws all over the block layer code.

Please just implement the 2nd data phase as a linked request off the
first one. I think that approach is both much cleaner from a design
perspective, and also much leaner and has zero (well almost, it costs a
pointer) impact on the regular read-write paths.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/4] bidi support: bidirectional request

2007-04-30 Thread Jens Axboe

On Mon, Apr 30 2007, Douglas Gilbert wrote:
> Jens Axboe wrote:
> > On Mon, Apr 30 2007, Benny Halevy wrote:
> >> Jens Axboe wrote:
> >>> On Sun, Apr 29 2007, James Bottomley wrote:
> >>>> On Sun, 2007-04-29 at 18:48 +0300, Boaz Harrosh wrote:
> >>>>> FUJITA Tomonori wrote:
> >>>>>> From: Boaz Harrosh <[EMAIL PROTECTED]>
> >>>>>> Subject: [PATCH 4/4] bidi support: bidirectional request
> >>>>>> Date: Sun, 15 Apr 2007 20:33:28 +0300
> >>>>>>
> >>>>>>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> >>>>>>> index 645d24b..16a02ee 100644
> >>>>>>> --- a/include/linux/blkdev.h
> >>>>>>> +++ b/include/linux/blkdev.h
> >>>>>>> @@ -322,6 +322,7 @@ struct request {
> >>>>>>>  void *end_io_data;
> >>>>>>>
> >>>>>>>  struct request_io_part uni;
> >>>>>>> +struct request_io_part bidi_read;
> >>>>>>>  };
> >>>>>> Would be more straightforward to have:
> >>>>>>
> >>>>>> struct request_io_part in;
> >>>>>> struct request_io_part out;
> >>>>>>
> >>>>> Yes I wish I could do that. For bidi supporting drivers this is the 
> >>>>> most logical.
> >>>>> But for the 99.9% of uni-directional drivers, calling rq_uni(), and 
> >>>>> being some what on
> >>>>> the hotish paths, this means we will need a pointer to a uni 
> >>>>> request_io_part.
> >>>>> This is bad because:
> >>>>> 1st- There is no defined stage in a request life where to definitely 
> >>>>> set that pointer,
> >>>>>  specially in the preparation stages.
> >>>>> 2nd- hacks like scsi_error.c/scsi_send_eh_cmnd() will not work at all. 
> >>>>> Now this is a
> >>>>>  very bad spot already, and I have a short term fix for it in the 
> >>>>> SCSI-bidi patches
> >>>>>  (not sent yet) but a more long term solution is needed. Once such 
> >>>>> hacks are
> >>>>>  cleaned up we can do what you say. This is exactly why I use the 
> >>>>> access functions
> >>>>>  rq_uni/rq_io/rq_in/rq_out and not open code access.
> >>>> I'm still not really convinced about this approach.  The primary job of
> >>>> the block layer is to manage and merge READ and WRITE requests.  It
> >>>> serves a beautiful secondary function of queueing for arbitrary requests
> >>>> it doesn't understand (REQ_TYPE_BLOCK_PC or REQ_TYPE_SPECIAL ... or
> >>>> indeed any non REQ_TYPE_FS).
> >>>>
> >>>> bidirectional requests fall into the latter category (there's nothing
> >>>> really we can do to merge them ... they're just transported by the block
> >>>> layer).  The only unusual feature is that they carry two bios.  I think
> >>>> the drivers that actually support bidirectional will be a rarity, so it
> >>>> might even be advisable to add it to the queue capability (refuse
> >>>> bidirectional requests at the top rather than perturbing all the drivers
> >>>> to process them).
> >>>>
> >>>> So, what about REQ_TYPE_BIDIRECTIONAL rather than REQ_BIDI?  That will
> >>>> remove it from the standard path and put it on the special command type
> >>>> path where we can process it specially.  Additionally, if you take this
> >>>> approach, you can probably simply chain the second bio through
> >>>> req->special as an additional request in the stream.  The only thing
> >>>> that would then need modification would be the dequeue of the block
> >>>> driver (it would have to dequeue both requests and prepare them) and
> >>>> that needs to be done only for drivers handling bidirectional requests.
> >>> I agree, I'm really not crazy about shuffling the entire request setup
> >>> around just for something as exotic as bidirection commands. How about
> >>> just keeping it simple - have a second request linked off the first one
> >>> for the second data phase? So keep it completely seperate, not just
> >>> overload ->special for 2nd bio list.

Re: [PATCH 4/4] bidi support: bidirectional request

2007-04-30 Thread Jens Axboe

On Mon, Apr 30 2007, Mark Lord wrote:
> Jens Axboe wrote:
> >
> >So basically just add a struct request pointer, so you can do rq =
> >rq->next_rq or something for the next data phase. I bet this would be a
> >LOT less invasive as well, and we can get by with a few helpers to
> >support it.
> 
> Hey, I want a way to issue those (linked requests) from userspace (SG_IO), 
> too.
> Specifically for use with the new SMART Command Transport (SCT) feature set
> on modern SATA drives.  As well as for a disk recovery utility I'm working 
> on.
> 
> Sounds generally useful, that.

Yep, one of the reasons why I like (my :-) proposal as well, we could
potentially find other uses for linking commands like that.
> 

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/4] bidi support: bidirectional request

2007-04-30 Thread Jens Axboe

On Mon, Apr 30 2007, Benny Halevy wrote:
> Jens Axboe wrote:
> > On Sun, Apr 29 2007, James Bottomley wrote:
> >> On Sun, 2007-04-29 at 18:48 +0300, Boaz Harrosh wrote:
> >>> FUJITA Tomonori wrote:
> >>>> From: Boaz Harrosh <[EMAIL PROTECTED]>
> >>>> Subject: [PATCH 4/4] bidi support: bidirectional request
> >>>> Date: Sun, 15 Apr 2007 20:33:28 +0300
> >>>>
> >>>>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> >>>>> index 645d24b..16a02ee 100644
> >>>>> --- a/include/linux/blkdev.h
> >>>>> +++ b/include/linux/blkdev.h
> >>>>> @@ -322,6 +322,7 @@ struct request {
> >>>>>  void *end_io_data;
> >>>>>
> >>>>>  struct request_io_part uni;
> >>>>> +struct request_io_part bidi_read;
> >>>>>  };
> >>>> Would be more straightforward to have:
> >>>>
> >>>> struct request_io_part in;
> >>>> struct request_io_part out;
> >>>>
> >>> Yes I wish I could do that. For bidi supporting drivers this is the most 
> >>> logical.
> >>> But for the 99.9% of uni-directional drivers, calling rq_uni(), and being 
> >>> some what on
> >>> the hotish paths, this means we will need a pointer to a uni 
> >>> request_io_part.
> >>> This is bad because:
> >>> 1st- There is no defined stage in a request life where to definitely set 
> >>> that pointer,
> >>>  specially in the preparation stages.
> >>> 2nd- hacks like scsi_error.c/scsi_send_eh_cmnd() will not work at all. 
> >>> Now this is a
> >>>  very bad spot already, and I have a short term fix for it in the 
> >>> SCSI-bidi patches
> >>>  (not sent yet) but a more long term solution is needed. Once such 
> >>> hacks are
> >>>  cleaned up we can do what you say. This is exactly why I use the 
> >>> access functions
> >>>  rq_uni/rq_io/rq_in/rq_out and not open code access.
> >> I'm still not really convinced about this approach.  The primary job of
> >> the block layer is to manage and merge READ and WRITE requests.  It
> >> serves a beautiful secondary function of queueing for arbitrary requests
> >> it doesn't understand (REQ_TYPE_BLOCK_PC or REQ_TYPE_SPECIAL ... or
> >> indeed any non REQ_TYPE_FS).
> >>
> >> bidirectional requests fall into the latter category (there's nothing
> >> really we can do to merge them ... they're just transported by the block
> >> layer).  The only unusual feature is that they carry two bios.  I think
> >> the drivers that actually support bidirectional will be a rarity, so it
> >> might even be advisable to add it to the queue capability (refuse
> >> bidirectional requests at the top rather than perturbing all the drivers
> >> to process them).
> >>
> >> So, what about REQ_TYPE_BIDIRECTIONAL rather than REQ_BIDI?  That will
> >> remove it from the standard path and put it on the special command type
> >> path where we can process it specially.  Additionally, if you take this
> >> approach, you can probably simply chain the second bio through
> >> req->special as an additional request in the stream.  The only thing
> >> that would then need modification would be the dequeue of the block
> >> driver (it would have to dequeue both requests and prepare them) and
> >> that needs to be done only for drivers handling bidirectional requests.
> > 
> > I agree, I'm really not crazy about shuffling the entire request setup
> > around just for something as exotic as bidirection commands. How about
> > just keeping it simple - have a second request linked off the first one
> > for the second data phase? So keep it completely seperate, not just
> > overload ->special for 2nd bio list.
> > 
> > So basically just add a struct request pointer, so you can do rq =
> > rq->next_rq or something for the next data phase. I bet this would be a
> > LOT less invasive as well, and we can get by with a few helpers to
> > support it.
> > 
> > And it should definitely be a request type.
> > 
> 
> I'm a bit confused since what you both suggest is very similar to what we've
> proposed back in October 2006 and the impression we got was that it will be
> better to support bidirectional block requests natively (yet to be honest,
> Jam

Re: [PATCH 4/4] bidi support: bidirectional request

2007-04-30 Thread Jens Axboe

On Sun, Apr 29 2007, James Bottomley wrote:
> On Sun, 2007-04-29 at 18:48 +0300, Boaz Harrosh wrote:
> > FUJITA Tomonori wrote:
> > > From: Boaz Harrosh <[EMAIL PROTECTED]>
> > > Subject: [PATCH 4/4] bidi support: bidirectional request
> > > Date: Sun, 15 Apr 2007 20:33:28 +0300
> > > 
> > >> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> > >> index 645d24b..16a02ee 100644
> > >> --- a/include/linux/blkdev.h
> > >> +++ b/include/linux/blkdev.h
> > >> @@ -322,6 +322,7 @@ struct request {
> > >>  void *end_io_data;
> > >>
> > >>  struct request_io_part uni;
> > >> +struct request_io_part bidi_read;
> > >>  };
> > > 
> > > Would be more straightforward to have:
> > > 
> > > struct request_io_part in;
> > > struct request_io_part out;
> > > 
> > 
> > Yes I wish I could do that. For bidi supporting drivers this is the most 
> > logical.
> > But for the 99.9% of uni-directional drivers, calling rq_uni(), and being 
> > some what on
> > the hotish paths, this means we will need a pointer to a uni 
> > request_io_part.
> > This is bad because:
> > 1st- There is no defined stage in a request life where to definitely set 
> > that pointer,
> >  specially in the preparation stages.
> > 2nd- hacks like scsi_error.c/scsi_send_eh_cmnd() will not work at all. Now 
> > this is a
> >  very bad spot already, and I have a short term fix for it in the 
> > SCSI-bidi patches
> >  (not sent yet) but a more long term solution is needed. Once such 
> > hacks are
> >  cleaned up we can do what you say. This is exactly why I use the 
> > access functions
> >  rq_uni/rq_io/rq_in/rq_out and not open code access.
> 
> I'm still not really convinced about this approach.  The primary job of
> the block layer is to manage and merge READ and WRITE requests.  It
> serves a beautiful secondary function of queueing for arbitrary requests
> it doesn't understand (REQ_TYPE_BLOCK_PC or REQ_TYPE_SPECIAL ... or
> indeed any non REQ_TYPE_FS).
> 
> bidirectional requests fall into the latter category (there's nothing
> really we can do to merge them ... they're just transported by the block
> layer).  The only unusual feature is that they carry two bios.  I think
> the drivers that actually support bidirectional will be a rarity, so it
> might even be advisable to add it to the queue capability (refuse
> bidirectional requests at the top rather than perturbing all the drivers
> to process them).
> 
> So, what about REQ_TYPE_BIDIRECTIONAL rather than REQ_BIDI?  That will
> remove it from the standard path and put it on the special command type
> path where we can process it specially.  Additionally, if you take this
> approach, you can probably simply chain the second bio through
> req->special as an additional request in the stream.  The only thing
> that would then need modification would be the dequeue of the block
> driver (it would have to dequeue both requests and prepare them) and
> that needs to be done only for drivers handling bidirectional requests.

I agree, I'm really not crazy about shuffling the entire request setup
around just for something as exotic as bidirection commands. How about
just keeping it simple - have a second request linked off the first one
for the second data phase? So keep it completely seperate, not just
overload ->special for 2nd bio list.

So basically just add a struct request pointer, so you can do rq =
rq->next_rq or something for the next data phase. I bet this would be a
LOT less invasive as well, and we can get by with a few helpers to
support it.

And it should definitely be a request type.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BUG: Null pointer dereference in fs/open.c

2007-04-25 Thread Jens Axboe

On Thu, Apr 26 2007, William Heimbigner wrote:
> On Wed, 25 Apr 2007, Andrew Morton wrote:
> >On Wed, 25 Apr 2007 22:53:00 + (GMT) William Heimbigner 
> ><[EMAIL PROTECTED]> wrote:
> >
> >>On Wed, 25 Apr 2007, Andrew Morton wrote:
> >>
> >>>OK.  I am able to use the pktcdvd driver OK in mainline with a piix/sata
> >>>drive.  It could be that something is going wrong at the IDE level for 
> >>>you.
> >>Perhaps; I'll try an external usb cd burner, and see where that goes.
> >>
> >>>Are you able to identify the most recent kernel which actually worked?
> >>No, because I haven't set packet writing up in Linux before - however, I 
> >>do know
> >>that I've successfully set up packet writing (using 2 of the 3 cd burners 
> >>I
> >>have) in another operating system before. I'll try 2.6.18 and see if that 
> >>gets
> >>me anywhere different, though.
> >
> >OK.
> >
> >A quick summary: mainline's pktcdvd isn't working for William using IDE.
> >It is working for me using sata.
> >
> 
> >
> >So what has happened here is that this code, in ide-cd.c's
> >cdrom_decode_status() is now triggering:
> >
> > } else if (blk_pc_request(rq) || rq->cmd_type == REQ_TYPE_ATA_PC) {
> > /* All other functions, except for READ. */
> > unsigned long flags;
> >
> > /*
> >  * if we have an error, pass back CHECK_CONDITION as the
> >  * scsi status byte
> >  */
> > if (blk_pc_request(rq) && !rq->errors)
> > rq->errors = SAM_STAT_CHECK_CONDITION;
> >
> >
> >I suspect this is a bug introduced by
> >406c9b605cbc45151c03ac9a3f95e9acf050808c (in which case it'll be the third
> >bug so far).
> >
> >Perhaps the IDE driver was previously not considering these requests to be
> >of type blk_pc_request(), and after
> >406c9b605cbc45151c03ac9a3f95e9acf050808c it _is_ treating them as
> >blk_pc_request() and is incorrectly reporting an error.  Or something like
> >that.
> >
> >Guys: help!
> >
> A follow-up: after looking around a bit, I have managed to get packet 
> writing to work properly on /dev/hdc (before, it was reporting only 1.8 MB 
> available or so; this was a formatting issue).
> I've also gotten the external cd-rw drive to work. However, I'm still at a 
> loss as to why /dev/hdd won't work. I tried formatting a dvd-rw for this 
> drive, however, it consistently gives me:
> [27342.503933] drivers/ide/ide-cd.c:729: setting error to 2
> [27342.509251]  [] show_trace_log_lvl+0x1a/0x30
> [27342.514411]  [] show_trace+0x12/0x20
> [27342.518864]  [] dump_stack+0x16/0x20
> [27342.523317]  [] cdrom_decode_status+0x1f4/0x3b0
> [27342.528732]  [] cdrom_newpc_intr+0x38/0x320
> [27342.533791]  [] ide_intr+0x96/0x200
> [27342.538157]  [] handle_IRQ_event+0x28/0x60
> [27342.543139]  [] handle_edge_irq+0xa6/0x130
> [27342.548121]  [] do_IRQ+0x49/0xa0
> [27342.552228]  [] common_interrupt+0x2e/0x34
> [27342.557200]  [] mwait_idle+0x12/0x20
> [27342.561653]  [] cpu_idle+0x4a/0x80
> [27342.565934]  [] rest_init+0x37/0x40
> [27342.570300]  [] start_kernel+0x34b/0x420
> [27342.575109]  [<>] 0x0
> [27342.578089]  ===
> and doesn't work (the above output was generated by Andrew's patch to log 
> certain areas).
> 
> # dvd+rw-format /dev/hdd -force
> * BD/DVDRW/-RAM format utility by <[EMAIL PROTECTED]>, version 7.0.
> :-( failed to locate "Quick Format" descriptor.
> * 4.7GB DVD-RW media in Sequential mode detected.
> * formatting 0.0\:-[ READ TRACK INFORMATION failed with 
> SK=3h/ASC=11h/ACQ=05h]: Input/output error

That's an uncorrectable read error. Is the media good?

> I tried putting in a different dvd-rw, and this time I get:
> # dvd+rw-format /dev/hdd -force
> * BD/DVDRW/-RAM format utility by <[EMAIL PROTECTED]>, version 7.0.
> * 4.7GB DVD-RW media in Sequential mode detected.
> * formatting 0.0|:-[ FORMAT UNIT failed with SK=5h/ASC=26h/ACQ=00h]: 
> Input/output error

That's the drive complaining about an invalid bit being set in the
command descriptor block. That's usually a bug in the issuer.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BUG: Null pointer dereference in fs/open.c

2007-04-25 Thread Jens Axboe

On Wed, Apr 25 2007, Andrew Morton wrote:
> > # pktsetup 2 /dev/sr0
> > [19982.934793] drivers/scsi/scsi_lib.c:838: setting error to 134217730
> > [19982.941070]  [] show_trace_log_lvl+0x1a/0x30
> > [19982.946256]  [] show_trace+0x12/0x20
> > [19982.950744]  [] dump_stack+0x16/0x20
> > [19982.955232]  [] scsi_io_completion+0x28a/0x3a0
> > [19982.960586]  [] scsi_blk_pc_done+0x1b/0x30
> > [19982.965594]  [] scsi_finish_command+0x4c/0x60
> > [19982.970861]  [] scsi_softirq_done+0x77/0xe0
> > [19982.975955]  [] blk_done_softirq+0x6b/0x80
> > [19982.980962]  [] __do_softirq+0x62/0xc0
> > [19982.985624]  [] do_softirq+0x55/0x60
> > [19982.990112]  [] ksoftirqd+0x65/0x100
> > [19982.994599]  [] kthread+0xa3/0xd0
> > [19982.998827]  [] kernel_thread_helper+0x7/0x10
> > [19983.004095]  ===
> > [19983.009065] cdrom: This disc doesn't have any tracks I recognize!
> 
> So what has happened here is that this code, in ide-cd.c's
> cdrom_decode_status() is now triggering:
> 
>   } else if (blk_pc_request(rq) || rq->cmd_type == REQ_TYPE_ATA_PC) {
>   /* All other functions, except for READ. */
>   unsigned long flags;
> 
>   /*
>* if we have an error, pass back CHECK_CONDITION as the
>* scsi status byte
>*/
>   if (blk_pc_request(rq) && !rq->errors)
>   rq->errors = SAM_STAT_CHECK_CONDITION;
> 
> 
> I suspect this is a bug introduced by
> 406c9b605cbc45151c03ac9a3f95e9acf050808c (in which case it'll be the third
> bug so far).
> 
> Perhaps the IDE driver was previously not considering these requests to be
> of type blk_pc_request(), and after
> 406c9b605cbc45151c03ac9a3f95e9acf050808c it _is_ treating them as
> blk_pc_request() and is incorrectly reporting an error.  Or something like
> that.

But it IS a block pc request. We've been setting the sam stats on
->errors for block pc request for a long time.

> Guys: help!

I'm not sure what your question is, if someone can some up what the what
goes wrong and what the expected result is, I can try and help.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] Maxtor 6B250S0/BANC1B70 hangs with NCQ

2007-03-29 Thread Jens Axboe

Hi,

I've seen this several times on this drive, completely reproducible.
Once it has hung, power needs to be cut from the drive to recover it, a
simple reboot is not enough. So I'd suggest disabling NCQ on this
driver.

Error log attached.

Signed-off-by: Jens Axboe <[EMAIL PROTECTED]>

diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
index f1f595f..ddb3909 100644
--- a/drivers/ata/libata-core.c
+++ b/drivers/ata/libata-core.c
@@ -3361,6 +3361,8 @@ static const struct ata_blacklist_entry 
ata_device_blacklist [] = {
{ "FUJITSU MHT2060BH",  NULL,   ATA_HORKAGE_NONCQ },
/* NCQ is broken */
{ "Maxtor 6L250S0", "BANC1G10", ATA_HORKAGE_NONCQ },
+   /* NCQ hard hangs device under heavier load, needs hard power cycle */
+   { "Maxtor 6B250S0", "BANC1B70", ATA_HORKAGE_NONCQ },
 
/* Devices with NCQ limits */
 

-- 
Jens Axboe

ata1.00: exception Emask 0x0 SAct 0x7fff SErr 0x0 action 0x2 frozen
ata1.00: cmd 61/08:00:3f:00:f4/00:00:0c:00:00/40 tag 0 cdb 0x0 data 4096 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: cmd 61/08:08:3f:00:f8/00:00:0c:00:00/40 tag 1 cdb 0x0 data 4096 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: cmd 61/08:10:3f:00:fc/00:00:0c:00:00/40 tag 2 cdb 0x0 data 4096 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: cmd 61/08:18:3f:00:00/00:00:0d:00:00/40 tag 3 cdb 0x0 data 4096 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: cmd 61/08:20:3f:00:04/00:00:0d:00:00/40 tag 4 cdb 0x0 data 4096 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: cmd 61/08:28:3f:00:08/00:00:0d:00:00/40 tag 5 cdb 0x0 data 4096 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: cmd 61/08:30:3f:00:0c/00:00:0d:00:00/40 tag 6 cdb 0x0 data 4096 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: cmd 61/08:38:3f:00:10/00:00:0d:00:00/40 tag 7 cdb 0x0 data 4096 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: cmd 61/08:40:3f:00:14/00:00:0d:00:00/40 tag 8 cdb 0x0 data 4096 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: cmd 61/08:48:3f:00:18/00:00:0d:00:00/40 tag 9 cdb 0x0 data 4096 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: cmd 61/08:50:3f:00:1c/00:00:0d:00:00/40 tag 10 cdb 0x0 data 4096 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: cmd 61/08:58:3f:00:20/00:00:0d:00:00/40 tag 11 cdb 0x0 data 4096 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: cmd 61/08:60:3f:00:24/00:00:0d:00:00/40 tag 12 cdb 0x0 data 4096 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: cmd 61/08:68:3f:00:28/00:00:0d:00:00/40 tag 13 cdb 0x0 data 4096 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: cmd 61/08:70:3f:00:2c/00:00:0d:00:00/40 tag 14 cdb 0x0 data 4096 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: cmd 61/08:78:3f:00:30/00:00:0d:00:00/40 tag 15 cdb 0x0 data 4096 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: cmd 61/08:80:3f:00:34/00:00:0d:00:00/40 tag 16 cdb 0x0 data 4096 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: cmd 61/08:88:3f:00:38/00:00:0d:00:00/40 tag 17 cdb 0x0 data 4096 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: cmd 61/08:90:3f:00:3c/00:00:0d:00:00/40 tag 18 cdb 0x0 data 4096 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: cmd 61/08:98:3f:00:40/00:00:0d:00:00/40 tag 19 cdb 0x0 data 4096 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: cmd 61/08:a0:3f:00:44/00:00:0d:00:00/40 tag 20 cdb 0x0 data 4096 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: cmd 61/08:a8:3f:00:48/00:00:0d:00:00/40 tag 21 cdb 0x0 data 4096 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: cmd 61/08:b0:3f:00:4c/00:00:0d:00:00/40 tag 22 cdb 0x0 data 4096 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: cmd 61/08:b8:3f:00:50/00:00:0d:00:00/40 tag 23 cdb 0x0 data 4096 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: cmd 61/08:c0:3f:00:54/00:00:0d:00:00/40 tag 24 cdb 0x0 data 4096 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: cmd 61/08:c8:3f:00:dc/00:00:0c:00:00/40 tag 25 cdb 0x0 data 4096 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: cmd 61/08:d0:3f:00:e0/00:00:0c:00:00/40 tag 26 cdb 0x0 data 4096 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask

Re: 2.6.20.3 AMD64 oops in CFQ code

2007-03-22 Thread Jens Axboe

On Thu, Mar 22 2007, [EMAIL PROTECTED] wrote:
> This is a uniprocessor AMD64 system running software RAID-5 and RAID-10
> over multiple PCIe SiI3132 SATA controllers.  The hardware has been very
> stable for a long time, but has been acting up of late since I upgraded
> to 2.6.20.3.  ECC memory should preclude the possibility of bit-flip
> errors.
> 
> Kernel 2.6.20.3 + linuxpps patches (confined to drivers/serial, and not
> actually in use as I stole the serial port for a console).
> 
> It takes half a day to reproduce the problem, so bisecting would be painful.
> 
> BackupPC_dump mostly writes to a large (1.7 TB) ext3 RAID5 partition.
> 
> 
> Here are two oopes, a few minutes (16:31, to be precise) apart.
> Unusually, it oopsed twice *without* locking up the system..  Usually,
> I see this followed by an error from drivers/input/keyboard/atkbd.c:
> printk(KERN_WARNING "atkbd.c: Spurious %s on %s. "
>"Some program might be trying access hardware 
> directly.\n",
> emitted at 1 Hz with the keyboard LEDs flashing and the system
> unresponsive to keyboard or pings.
> (I think it was spurious ACK on serio/input0, but my memory may be faulty.)
> 
> 
> If anyone has any suggestions, they'd be gratefully received.
> 
> 
> Unable to handle kernel NULL pointer dereference at 0098 RIP: 
>  [] cfq_dispatch_insert+0x18/0x68
> PGD 777e9067 PUD 78774067 PMD 0 
> Oops:  [1] 
> CPU 0 
> Modules linked in: ecb
> Pid: 2837, comm: BackupPC_dump Not tainted 2.6.20.3-g691f5333 #40
> RIP: 0010:[]  [] 
> cfq_dispatch_insert+0x18/0x68
> RSP: 0018:8100770bbaf8  EFLAGS: 00010092
> RAX: 81007fb36c80 RBX:  RCX: 0001
> RDX: 00010003e4e7 RSI:  RDI: 
> RBP: 81007fb37a00 R08:  R09: 81005d390298
> R10: 81007fcb4f80 R11: 81007fcb4f80 R12: 81007facd280
> R13: 0004 R14: 0001 R15: 
> FS:  2b322d120d30() GS:805de000() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 0098 CR3: 7bcf CR4: 06e0
> Process BackupPC_dump (pid: 2837, threadinfo 8100770ba000, task 
> 81007fc5d8e0)
> Stack:   8100770f39f0  0004
>  0001 80315253 803b2607 81005da2bc40
>  81007fac3800 81007facd280 81007facd280 81005d390298
> Call Trace:
>  [] cfq_dispatch_requests+0x152/0x512
>  [] scsi_done+0x0/0x18
>  [] elv_next_request+0x137/0x147
>  [] scsi_request_fn+0x6a/0x33a
>  [] generic_unplug_device+0xa/0xe
>  [] unplug_slaves+0x5b/0x94
>  [] sync_page+0x0/0x40
>  [] sync_page+0x36/0x40
>  [] __wait_on_bit_lock+0x36/0x65
>  [] __lock_page+0x5e/0x64
>  [] wake_bit_function+0x0/0x23
>  [] find_get_page+0xe/0x2d
>  [] do_generic_mapping_read+0x1c2/0x40d
>  [] file_read_actor+0x0/0x118
>  [] generic_file_aio_read+0x15c/0x19e
>  [] do_sync_read+0xc9/0x10c
>  [] may_open+0x5b/0x1c6
>  [] autoremove_wake_function+0x0/0x2e
>  [] vfs_read+0xaa/0x152
>  [] sys_read+0x45/0x6e
>  [] system_call+0x7e/0x83

3 (I think) seperate instances of this, each involving raid5. Is your
array degraded or fully operational?


-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] libata: add NCQ blacklist entries from Silicon Image Windows driver

2007-02-22 Thread Jens Axboe

On Thu, Feb 22 2007, Mark Lord wrote:
> Jens Axboe wrote:
> >>Robert Hancock wrote:
> >..
> >>>+/* The following blacklist entries are taken from the Windows
> >>>+   driver .inf files for the Silicon Image 3124 and 3132. */
> >>>+{ "Maxtor 7B250S0","BANC1B70",ATA_HORKAGE_NONCQ, },
> >>>+{ "HTS541060G9SA00","MB3OC60D",ATA_HORKAGE_NONCQ, },
> >>>+{ "HTS541080G9SA00","MB4OC60D",ATA_HORKAGE_NONCQ, },
> >>>+{ "HTS541010G9SA00","MBZOC60D",ATA_HORKAGE_NONCQ, },
> >>Do we have information that these drives fail on non-SiI controllers?
> >>
> >>Sometimes the problem can be related to a single family of controllers.
> >
> >I don't know about the Hitachi's, but the Maxtor with that firmware is
> >definitely broken. It _appeared_ to work if the depth was limited to 4,
> >but I didn't test it long enough to be absolutely certain. So the safest
> >is indeed to blacklist it.
> 
> Yes, broken on the Silicon Image controllers for sure.
> But what type of controller did you observe the failures on, Jens ?

achi

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] libata: add NCQ blacklist entries from Silicon Image Windows driver

2007-02-22 Thread Jens Axboe

On Thu, Feb 22 2007, Jeff Garzik wrote:
> Robert Hancock wrote:
> >This patch adds in some NCQ blacklist entries taken from the Silicon
> >Image Windows drivers' .inf files for the 3124 and 3132 controllers.
> >These entries were marked as ""DisableSataQueueing". Assume these are
> >in their blacklist for a reason and disable NCQ on these drives.
> >
> >Signed-off-by: Robert Hancock <[EMAIL PROTECTED]>
> >
> >--- linux-2.6.21-rc1edit/drivers/ata/libata-core.c.prev2007-02-21 
> >22:23:05.0 -0600
> >+++ linux-2.6.21-rc1edit/drivers/ata/libata-core.c2007-02-21 
> >22:25:44.0 -0600
> >@@ -3269,6 +3269,13 @@ static const struct ata_blacklist_entry
> >/* Devices with NCQ limits */
> >
> >+/* The following blacklist entries are taken from the Windows
> >+   driver .inf files for the Silicon Image 3124 and 3132. */
> >+{ "Maxtor 7B250S0","BANC1B70",ATA_HORKAGE_NONCQ, },
> >+{ "HTS541060G9SA00","MB3OC60D",ATA_HORKAGE_NONCQ, },
> >+{ "HTS541080G9SA00","MB4OC60D",ATA_HORKAGE_NONCQ, },
> >+{ "HTS541010G9SA00","MBZOC60D",ATA_HORKAGE_NONCQ, },
> 
> Do we have information that these drives fail on non-SiI controllers?
> 
> Sometimes the problem can be related to a single family of controllers.

I don't know about the Hitachi's, but the Maxtor with that firmware is
definitely broken. It _appeared_ to work if the depth was limited to 4,
but I didn't test it long enough to be absolutely certain. So the safest
is indeed to blacklist it.

I'm pretty sure my initial NCQ patches had a blacklist entry for that
drive as well, but apparently the blacklist got lost somewhere along the
way?

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: libata FUA revisited

2007-02-21 Thread Jens Axboe

On Wed, Feb 21 2007, Tejun Heo wrote:
> Jens Axboe wrote:
> > On Wed, Feb 21 2007, Tejun Heo wrote:
> >> [cc'ing Ric, Hannes and Dongjun, Hello.  Feel free to drag other people 
> >> in.]
> >>
> >> Robert Hancock wrote:
> >>> Jens Axboe wrote:
> >>>> But we can't really change that, since you need the cache flushed before
> >>>> issuing the FUA write. I've been advocating for an ordered bit for
> >>>> years, so that we could just do:
> >>>>
> >>>> 3. w/FUA+ORDERED
> >>>>
> >>>> normal operation -> barrier issued -> write barrier FUA+ORDERED
> >>>>  -> normal operation resumes
> >>>>
> >>>> So we don't have to serialize everything both at the block and device
> >>>> level. I would have made FUA imply this already, but apparently it's not
> >>>> what MS wanted FUA for, so... The current implementations take the FUA
> >>>> bit (or WRITE FUA) as a hint to boost it to head of queue, so you are
> >>>> almost certainly going to jump ahead of already queued writes. Which we
> >>>> of course really do not.
> >> Yeah, I think if we have tagged write command and flush tagged (or
> >> barrier tagged) things can be pretty efficient.  Again, I'm much more
> >> comfortable with separate opcodes for those rather than bits changing
> >> the behavior.
> > 
> > ORDERED+FUA NCQ would still be preferable to an NCQ enabled flush
> > command, though.
> 
> I think we're talking about two different things here.
> 
> 1. The barrier write (FUA write) combined with flush.  I think it would
> help improving the performance but I think issuing two commands
> shouldn't be too slower than issuing one combined command unless it
> causes extra physical activity (moving head, etc...).

The command overhead is dwarfed by other factors, agree.

> 2. FLUSH currently flushes all writes.  If we can mark certain commands
> requiring ordering, we can selectively flush or order necessary writes.
>  (No need to flush 16M buffer all over the disk when only journal needs
> barriering)

Sure, anything is better than the sledge hammer flush. But my claim is
that an ORDERED+FUA enabled write for critical data would be a good
approach, and simple in software.

> >> Another idea Dongjun talked about while drinking in LSF was ranged
> >> flush.  Not as flexible/efficient as the previous option but much less
> >> intrusive and should help quite a bit, I think.
> > 
> > But that requires extensive tracking, I'm not so sure the implementation
> > of that for barriers would be very clean. It'd probably be good for
> > fsync, though.
> 
> I was mostly thinking about journal area.  Using it for other purposes
> would incur a lot of complexity.  :-(

Yep if it's just for the journal, the range is known and fixed, so the
flush range would work nicely there.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: libata FUA revisited

2007-02-21 Thread Jens Axboe

On Mon, Feb 19 2007, Robert Hancock wrote:
> Jens Axboe wrote:
> >But we can't really change that, since you need the cache flushed before
> >issuing the FUA write. I've been advocating for an ordered bit for
> >years, so that we could just do:
> >
> >3. w/FUA+ORDERED
> >
> >normal operation -> barrier issued -> write barrier FUA+ORDERED
> > -> normal operation resumes
> >
> >So we don't have to serialize everything both at the block and device
> >level. I would have made FUA imply this already, but apparently it's not
> >what MS wanted FUA for, so... The current implementations take the FUA
> >bit (or WRITE FUA) as a hint to boost it to head of queue, so you are
> >almost certainly going to jump ahead of already queued writes. Which we
> >of course really do not.
> 
> I think that FUA was designed for a different use case than what Linux 
> is using barriers for currently. The advantage with FUA is when you have 

[snip]

Yes that's pretty obvious, my point is just that FUA+ORDERED would be a
nice thing to have for us.

> >I'm not too nervous about the FUA write commands, I hope we can safely
> >assume that if you set the FUA supported bit in the id AND the write fua
> >command doesn't get aborted, that FUA must work. Anything else would
> >just be an immensely stupid implementation. NCQ+FUA is more tricky, I
> >agree that it being just a command bit does make it more likely that it
> >could be ignored. And that is indeed a danger. Given state of NCQ in
> >early firmware drives, I would not at all be surprised if the drive
> >vendors screwed that up too.
> >
> >But, since we don't have the ordered bit for NCQ/FUA anyway, we do need
> >to drain the drive queue before issuing the WRITE/FUA. And at that point
> >we may as well not use the NCQ command, just go for the regular non-NCQ
> >FUA write. I think that should be safe.
> 
> Aside from the issue above, as I mentioned elsewhere, lots of NCQ drives 
> don't support non-NCQ FUA writes..

"Lots" meaning how many? All the ones I have here support FUA.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: libata FUA revisited

2007-02-21 Thread Jens Axboe

On Wed, Feb 21 2007, Tejun Heo wrote:
> [cc'ing Ric, Hannes and Dongjun, Hello.  Feel free to drag other people in.]
> 
> Robert Hancock wrote:
> > Jens Axboe wrote:
> >> But we can't really change that, since you need the cache flushed before
> >> issuing the FUA write. I've been advocating for an ordered bit for
> >> years, so that we could just do:
> >>
> >> 3. w/FUA+ORDERED
> >>
> >> normal operation -> barrier issued -> write barrier FUA+ORDERED
> >>  -> normal operation resumes
> >>
> >> So we don't have to serialize everything both at the block and device
> >> level. I would have made FUA imply this already, but apparently it's not
> >> what MS wanted FUA for, so... The current implementations take the FUA
> >> bit (or WRITE FUA) as a hint to boost it to head of queue, so you are
> >> almost certainly going to jump ahead of already queued writes. Which we
> >> of course really do not.
> 
> Yeah, I think if we have tagged write command and flush tagged (or
> barrier tagged) things can be pretty efficient.  Again, I'm much more
> comfortable with separate opcodes for those rather than bits changing
> the behavior.

ORDERED+FUA NCQ would still be preferable to an NCQ enabled flush
command, though.

> Another idea Dongjun talked about while drinking in LSF was ranged
> flush.  Not as flexible/efficient as the previous option but much less
> intrusive and should help quite a bit, I think.

But that requires extensive tracking, I'm not so sure the implementation
of that for barriers would be very clean. It'd probably be good for
fsync, though.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: libata FUA revisited

2007-02-15 Thread Jens Axboe

On Tue, Feb 13 2007, Tejun Heo wrote:
> >>So, actually, I was thinking about *always* using the non-NCQ FUA 
> >>opcode.  As currently implemented, FUA request is always issued by 
> >>itself, so NCQ doesn't make any difference there.  So, I think it 
> >>would be better to turn on FUA on driver-by-driver basis whether the 
> >>controller supports NCQ or not.
> >
> >Unfortunately not all drives that support NCQ support the non-NCQ FUA 
> >commands (my Seagates are like this).
> 
> And I'm a bit scared to set FUA bit on such drives and trust that it 
> will actually do FUA, so our opinions aren't too far away from each 
> other.  :-)
> 
> >There's definitely a potential advantage to FUA with NCQ - if you have 
> >non-synchronous accesses going on concurrently with synchronous ones, if 
> >you have to use non-NCQ FUA or flush cache commands, you have to wait 
> >for all the IOs of both types to drain out before you can issue the 
> >flush (since those can't be overlapped with the NCQ read/writes). And if 
> >you can only use flush cache, then you're forcing all the writes to be 
> >flushed including the non-synchronous ones you didn't care about. 
> >Whether or not the block layer currently exploits this I don't know, but 
> >it definitely could.
> 
> The current barrier implementation uses the following sequences for 
> no-FUA and FUA cases.
> 
> 1. w/o FUA
> 
> normal operation -> barrier issued -> drain IO -> flush -> barrier 
> written -> flush -> normal operation resumes
> 
> 2. w/ FUA
> 
> normal operation -> barrier issued -> drain IO -> flush -> barrier 
> written / FUA -> normal operation resumes
> 
> So, the FUA write is issued by itself.  This isn't really efficient and 
> frequent barriers impact the performance badly.  If we can change that 
> NCQ FUA will be certainly beneficial.

But we can't really change that, since you need the cache flushed before
issuing the FUA write. I've been advocating for an ordered bit for
years, so that we could just do:

3. w/FUA+ORDERED

normal operation -> barrier issued -> write barrier FUA+ORDERED
 -> normal operation resumes

So we don't have to serialize everything both at the block and device
level. I would have made FUA imply this already, but apparently it's not
what MS wanted FUA for, so... The current implementations take the FUA
bit (or WRITE FUA) as a hint to boost it to head of queue, so you are
almost certainly going to jump ahead of already queued writes. Which we
of course really do not.

> >>Well, I might be being too paranoid but silent FUA failure would be 
> >>really hard to diagnose if that ever happens (and I'm fairly certain 
> >>that it will on some firmwares).
> >
> >Well, there are also probably drives that ignore flush cache commands or 
> > fail to do other things that they should. There's only so far we can go 
> >in coping if the firmware authors are being retarded. If any drive is 
> >broken like that we should likely just blacklist NCQ on it entirely as 
> >obviously little thought or testing went into the implementation..
> 
> FLUSH has been around quite long time now and most drives don't have 
> problem with that.  FUA on ATA is still quite new and libata will be the 
> first major user of it if we enable it by default.  It just seems too 
> easy to ignore that bit and successfully complete the write - there 
> isn't any safety net as opposed to using a separate opcode.  So, I'm a 
> bit nervous.

I'm not too nervous about the FUA write commands, I hope we can safely
assume that if you set the FUA supported bit in the id AND the write fua
command doesn't get aborted, that FUA must work. Anything else would
just be an immensely stupid implementation. NCQ+FUA is more tricky, I
agree that it being just a command bit does make it more likely that it
could be ignored. And that is indeed a danger. Given state of NCQ in
early firmware drives, I would not at all be surprised if the drive
vendors screwed that up too.

But, since we don't have the ordered bit for NCQ/FUA anyway, we do need
to drain the drive queue before issuing the WRITE/FUA. And at that point
we may as well not use the NCQ command, just go for the regular non-NCQ
FUA write. I think that should be safe.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ahci problems with sata disk.

2007-01-17 Thread Jens Axboe

On Tue, Jan 16 2007, Eric D. Mudama wrote:

[snip lots of stuff I agree completely with]

> If done properly, queueing should never hurt performance.  High queue
> depths will increase average latency of course, but shouldn't hurt
> overall performance.

It may never hurt performance, but there are common scenarios where you
are much better off not doing queuing even if you could. A good example
of that is a media serving service, where you end up reading a bunch of
files sequentially. It's faster to read chunks of each file sequentially
at depth 1 and move on, than queue a a request from each of them and
send them to the drive. On my laptop with an NCQ enabled drive, the
mentioned approach outperforms queuing by more than 100%.

> >NCQ mainly helps with multiple threads doing reads.  Writes are
> >largely asynchronous to the user already (except for fsync-style
> >writes).  You want to be able to stuff the disk's internal elevator
> >with as many read requests as possible, because reads are very often
> >synchronous -- most apps (1) read a block, (2) do something, (3) goto
> >step #1.  The kernel's elevator isn't much use in these cases.
> 
> True.  And internal to the drive, normal elevator is "meh."  There are
> other algorithms for scheduling that perform better.

Well Linux doesn't default to using a normal elevator, so it's a moot
point.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ahci problems with sata disk.

2007-01-17 Thread Jens Axboe

On Tue, Jan 16 2007, Jeff Garzik wrote:
> Mark Hahn wrote:
> >>>>I though that NCQ was intended to increase performance ??
> >
> >intended to increase _sales_ performance ;)
> 
> Yep.
> 
> 
> >remember that you've always had command queueing (kernel elevator): the 
> >main difference with NCQ (or SCSI tagged queueing) is when
> >the disk can out-schedule the kernel.  afaikt, this means sqeezing
> >in a rotationally intermediate request along the way.
> >
> >that intermediate request must be fairly small and should be a read
> >(for head-settling reasons).
> >
> >I wonder how often this happens in the real world, given the relatively
> >small queues the disk has to work with.
> 
> ISTR either Jens or Andrew ran some numbers, and found that there was 
> little utility beyond 4 or 8 tags or so.

It entirely depends on the access pattern. For truly random reads,
performance does seem to continue to scale up with increasing drive
queue depths. It may only be a benchmark figure though, as truly random
read workloads probably aren't that common :-)

For anything else, going beyond 4 tags doesn't improve much.

> >>My hdparm test is a sequential read-ahead test, so it will
> >>naturally perform worse on a Raptor when NCQ is on.
> >
> >that's a surprisingly naive heuristic, especially since NCQ is concerned 
> >with just a max of ~4MB of reads, only a smallish
> >fraction of the available cache.
> 
> NCQ mainly helps with multiple threads doing reads.  Writes are largely 
> asynchronous to the user already (except for fsync-style writes).  You 
> want to be able to stuff the disk's internal elevator with as many read 
> requests as possible, because reads are very often synchronous -- most 
> apps (1) read a block, (2) do something, (3) goto step #1.  The kernel's 
> elevator isn't much use in these cases.

Au contraire, this is one of the cases where intelligent IO scheduling
in the kernel makes a ton of difference. It's the primary reason that AS
and CFQ are able to maintain > 90% of disk bandwidth for more than one
process, idling the drive for the duration of step 2 in the sequence
above (step 2 is typically really small, time wise). If the next block
read is close to the first one, that is. If you do that, you will
greatly outperform the same workload pushed to the drive scheduling.
I've done considerable benchmarks on this. Only if the processes are
doing random IO should the IO scheduler punt and push everything to the
drive queue.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2.6.20-rc4-mm1

2007-01-15 Thread Jens Axboe

On Mon, Jan 15 2007, Ingo Molnar wrote:
> 
> * Jens Axboe <[EMAIL PROTECTED]> wrote:
> 
> > > In a previous write invoked by: fsck.ext3(1896): WRITE block 8552 on 
> > > sdb1 end_buffer_async_write() is invoked.
> > > 
> > > sdb1 is not a part of a raid device.
> > 
> > When I briefly tested this before I left (and found it broken), doing 
> > a cat /proc/mdstat got things going again. Hard if that's your rootfs, 
> > it's just a hint :-)
> 
> hm, so you knew it's broken, still you let Andrew pick it up, or am i 
> misunderstanding something?

Well the raid issue wasn't known before it was in -mm.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2.6.20-rc4-mm1

2007-01-14 Thread Jens Axboe

On Sun, Jan 14 2007, Thomas Gleixner wrote:
> On Sun, 2007-01-14 at 11:46 +0100, Thomas Gleixner wrote:
> > > Boot proceeds, but gets stuck hard at:
> > > "Remounting root filesystem in read-write mode:"
> > > 
> > > No SysRq-T, nothing.
> > > 
> > > The above BUG seems unrelated to that. Investigating further.
> > 
> > Bisect identified: git-block.patch
> 
> Does only happen on 2 systems. Both have sata + raid1 setup. I managed 
> to get a stacktrace from the SMP box. Sits there and sleeps forever.
> 
>   tglx
> 
> [] io_schedule+0x7a/0x9a
> [] sleep_on_page+0x8/0xc
> [] __wait_on_bit+0x36/0x5d
> [] wait_on_page_bit+0x5b/0x61
> [] wait_on_page_writeback_range+0x4f/0xef
> [] filemap_fdatawait+0x44/0x49
> [] filemap_write_and_wait+0x22/0x2d
> [] sync_blockdev+0x17/0x1d
> [] quota_sync_sb+0x33/0xd6
> [] sync_dquots+0x22/0xfa
> [] __fsync_super+0x17/0x66
> [] fsync_super+0xb/0x19
> [] do_remount_sb+0x49/0x101
> [] do_mount+0x1ad/0x678
> [] sys_mount+0x6f/0xa4
> [] sysenter_past_esp+0x5f/0x99

raid seems to have severe problems with the plugging change. I'll try
and find Neil and have a chat with him, hopefully we can work it out.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2.6.20-rc4-mm1

2007-01-14 Thread Jens Axboe

On Sun, Jan 14 2007, Thomas Gleixner wrote:
> On Mon, 2007-01-15 at 09:05 +1100, Jens Axboe wrote:
> > raid seems to have severe problems with the plugging change. I'll try
> > and find Neil and have a chat with him, hopefully we can work it out.
> 
> Some hints:
> 
> mount(1899): WRITE block 16424 on md3
> call md_write_start
> md3_raid1(438): WRITE block 40965504 on sdb6
> md3_raid1(438): WRITE block 40965504 on sda6
> First Write sector 16424 disks 2
> 
> Stuck.
> 
> Note, that neither end_buffer_async_write() nor
> raid1_end_write_request() are invoked, 
> 
> In a previous write invoked by:
> fsck.ext3(1896): WRITE block 8552 on sdb1
> end_buffer_async_write() is invoked.
> 
> sdb1 is not a part of a raid device.

When I briefly tested this before I left (and found it broken), doing a
cat /proc/mdstat got things going again. Hard if that's your rootfs,
it's just a hint :-)

> Hope that helps,

I can reproduce, so that's not a problem. I can't do much about it until
I'm back next week, but Neil might be able to help. We shall see. Thanks
for testing.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] SCSI, libata: add support for ATA_16 commands to libata ATAPI devices

2007-01-04 Thread Jens Axboe

On Thu, Jan 04 2007, Mark Lord wrote:
> Jens Axboe wrote:
> >On Wed, Jan 03 2007, James Bottomley wrote:
> >>Er, well, as you know, I've never been a fan of this static list.  I
> >>thought Jens was going to put us all out of our misery by making the
> >>list settable per device by root and thus shovel the problem off onto
> >>the distros?
> >
> >The code is there, just haven't had the guts to merge it yet.
> >
> >http://git.kernel.dk/?p=linux-2.6-block.git;a=commitdiff;h=bf5f922d167a5c5cf57132bbcaa1e0ddfd5c45f7
> >
> 
> That's nice, but doesn't help with the case of trying to do ATA passthru
> to ATAPI devices, the subject of the original patch here.

James brought up the filtering issue, my reply pertains to that alone.
Hence the snipping of unrelated contents in the original mail :-)

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] SCSI, libata: add support for ATA_16 commands to libata ATAPI devices

2007-01-04 Thread Jens Axboe

On Wed, Jan 03 2007, James Bottomley wrote:
> Er, well, as you know, I've never been a fan of this static list.  I
> thought Jens was going to put us all out of our misery by making the
> list settable per device by root and thus shovel the problem off onto
> the distros?

The code is there, just haven't had the guts to merge it yet.

http://git.kernel.dk/?p=linux-2.6-block.git;a=commitdiff;h=bf5f922d167a5c5cf57132bbcaa1e0ddfd5c45f7

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] cdrom: longer timeout for "Read Track Info" command

2007-01-02 Thread Jens Axboe

On Tue, Jan 02 2007, Alan wrote:
> On Mon, 1 Jan 2007 18:36:24 -0800
> Jeremy Higdon <[EMAIL PROTECTED]> wrote:
> 
> > I have a DVD combo drive and a CD in which the
> > "READ TRACK INFORMATION" command (implemented in the
> > cdrom_get_track_info() function) takes about 7 seconds to run.
> > The current implementation of cdrom_get_track_info() uses the
> > default timeout of 5 seconds.  So here's a patch that increases
> > the timeout from 5 to 15 seconds.
> > 
> > The drive in question is a TSSTcorpCD/DVDW SN-S082D, and I have
> > a Silicon Image 680A adapter, in case that's of interest.
> > 
> > signed-off-by: <[EMAIL PROTECTED]>
> 
> Please test with a seven second timeout rather than fifteen which is way
> longer than anyone wants to wait. Seven is the magic value used by
> another major vendor so ought to be right for all hardware 8)

Yep, I suspect this patch is long overdue. Jeremy, is this enough to fix
it for you?

diff --git a/drivers/cdrom/cdrom.c b/drivers/cdrom/cdrom.c
index 66d028d..3105ddd 100644
--- a/drivers/cdrom/cdrom.c
+++ b/drivers/cdrom/cdrom.c
@@ -337,6 +337,12 @@ static const char *mrw_address_space[] = { "DMA", "GAA" };
 /* used in the audio ioctls */
 #define CHECKAUDIO if ((ret=check_for_audio_disc(cdi, cdo))) return ret
 
+/*
+ * Another popular OS uses 7 seconds as the hard timeout for default
+ * commands, so it is a good choice for us as well.
+ */
+#define CDROM_DEF_TIMEOUT  (7 * HZ)
+
 /* Not-exported routines. */
 static int open_for_data(struct cdrom_device_info * cdi);
 static int check_for_audio_disc(struct cdrom_device_info * cdi,
@@ -1528,7 +1534,7 @@ void init_cdrom_command(struct packet_command *cgc, void 
*buf, int len,
cgc->buffer = (char *) buf;
cgc->buflen = len;
    cgc->data_direction = type;
-   cgc->timeout = 5*HZ;
+   cgc->timeout = CDROM_DEF_TIMEOUT;
 }
 
 /* DVD handling */

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] libata: turn off NCQ if queue depth is adjusted to 1

2006-12-18 Thread Jens Axboe

On Sat, Dec 16 2006, Jeff Garzik wrote:
> Ric Wheeler wrote:
> >Just thinking out loud, but it would be really helpful to get drive 
> >vendor's a basic set of tests for Linux systems -  error handling, 
> >performance, SMART features, etc - that would run natively on linux.
> >We would need to get something really easy to deploy, like a live CD 
> >image with the test suite that could be booted on a pc, to get into an 
> >environment that is used to booting DOS based floppies...
> 
> Strongly agreed.
> 
> I know some people use DOS-based environments; I would prefer the 
> following test environment:
> 
> Equip systems with NICs that can do wake-on-lan and PXE.  To initiate 
> testing of a system, perform a PXE boot, which downloads a 
> custom-compiled kernel and initrd over the net.  The kernel boots, sets 
> up a test environment in either ramfs or nfs (or a combination thereof), 
> and runs a "do everything" script which starts the tests specified by 
> the network admin.
> 
> The tests performed should be in three classes:  (1) data and non-data 
> tests performed over a "direct submit" interface like SG_IO, (2) data 
> tests performed by directly accessing the block device, and (3) data 
> tests performed by accessing data through a common filesystem [ext3 or 
> whatever is popular].
> 
> It is already trivial to write tests for #2 and #3.  Tests in class #1 
> may require some thought and complexity, such as using multiple threads, 
> to achieve maximal use of command queueing features.  I'm not aware of 
> any userspace interface that allows fine-grained control of TCQ (Jens 
> correct me here), or even an interface that does not require multiple 
> threads to submit multiple tasks simultaneously.

fio can do all of that, it supports a variety of io engines that allow
you to control the queue depth. So for SG_IO, you can obviously only do
depth of 1 as it's a sync interface, but the fio sg io engine also
supports the async /dev/sg (or bsg) interface that allows you to do
queueing.

Using fio it's quite simple to stress test the various interfaces. From
the io scheduler POV, it's also quite interesting to mix eg normal block
device access with SG_IO, to test that as well.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: -mm merge plans for 2.6.20

2006-12-04 Thread Jens Axboe

On Mon, Dec 04 2006, Andrew Morton wrote:
> > 
> > > libata_resume_fix.patch
> > 
> > I thought this was resolved long ago?  Are there still open reports that 
> > this solves, where upstream doesn't work?
> 
> Heck, I don't know.

I'm not aware of any, and resume works for me. Did that patch ever get
verified as fixing something for anybody? I think it can be safely
dropped.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: AHCI and add_disk_randomness()

2006-11-16 Thread Jens Axboe

On Wed, Nov 15 2006, Marek Podmaka wrote:
> Hello,
> 
>   I have server with Intel 5000V motherboard with integrated AHCI SATA
>   controller. It works well with kernel 2.6.18.2. But I have problem
>   with little entropy available and I'm not sure if one of the reasons
>   is that AHCI driver does not use add_disk_randomness() to contribute
>   to the kernel entropy pool.
> 
>   I'm not very skilled on kernel "hacking"... I tried finding where
>   this is called (it's defined in drivers/char/random.c and used for
>   example in Comapaq SmartArray driver in drivers/block/cciss.c). For
>   the SCSI part, I found it in scsi_lib.c, but I was not able to
>   determine if this is actually used by ahci/libata drivers.
> 
>   If not, would it be possible to implement it? I tried to figure out
>   where to call it by looking at cciss.c, but it seems that this is
>   totally different case, at least for me. I don't know where to add
>   it, because its parameter is struct gendisk *disk and didn't find it
>   used anywhere in ahci or libata.

Since ahci attaches its devices through the scsi layer,
add_disk_randomness() will get called from scsi_end_request() like for
any other scsi controller.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 0/4] ide: Break ide_lock to per-hwgroup lock

2005-09-07 Thread Jens Axboe

On Wed, Sep 07 2005, Ravikiran G Thirumalai wrote:
> On Wed, Sep 07, 2005 at 11:19:24AM +0200, Jens Axboe wrote:
> > On Tue, Sep 06 2005, Ravikiran G Thirumalai wrote:
> > > The following patchset breaks down the global ide_lock to per-hwgroup 
> > > lock.
> > > We have taken the following approach.
> > 
> > Curious, what is the point of this?
> > 
> 
> On smp machines with multiple ide interfaces, we take per-group lock instead
> of a global lock, there by breaking the lock to per-irq hwgroups.

I realize the theory behind breaking up locks, I'm just wondering about
this specific case. Please show actual contention data promoting this
specific case, we don't break up locks "just because".

I'm asking because I've never heard anyone complain about IDE lock
contention and a proper patch usually comes with analysis of why it is
needed.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 0/4] ide: Break ide_lock to per-hwgroup lock

2005-09-07 Thread Jens Axboe

On Tue, Sep 06 2005, Ravikiran G Thirumalai wrote:
> The following patchset breaks down the global ide_lock to per-hwgroup lock.
> We have taken the following approach.

Curious, what is the point of this?

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: libata: clustering on or off?

2005-08-28 Thread Jens Axboe

On Sun, Aug 28 2005, Jeff Garzik wrote:
> Jens Axboe wrote:
> >On Sun, Aug 28 2005, Arjan van de Ven wrote:
> >
> >>On Sun, 2005-08-28 at 05:42 -0400, Jeff Garzik wrote:
> >>
> >>>The constant ATA_SHT_USE_CLUSTERING in include/linux/libata.h controls
> >>>the use of SCSI layer's use_clustering feature, for a great many libata
> >>>drivers.
> >>>
> >>>The current setup has clustering disabled, which in theory causes the
> >>>block layer to do less work, at the expense of a greater number of
> >>>scatter/gather table entries used.
> >>>
> >>>Any opinions WRT turning on clustering for libata?
> >>
> >>in 2.4 clustering was expensive due to a large number of checks that
> >>were done (basically the number of fragments got recounted a gazilion
> >>times). In 2.6 Jens fixed that afaik to make it basically free...
> >>at which point it's a win always.
> 
> >Yeah, it wont cost any extra cycles,
> 
> A simple grep for QUEUE_FLAG_CLUSTER-related code shows that it -does- 
> cost extra cycles.

Well yes, none is not true of course. But it's not a lot, like extra
iterations of the request mappings like it used to. So in by far the
most cases, it should be a win overall.

> >>Imo clustering on the driver level should announce driver capabilities.
> >>If clustering for some arch/kernel makes it slower, that should be
> >>decided at a midlayer level and not in each driver; eg the midlayer
> >>would chose to ignore the drivers capabilities.
> >>So .. my opinion would be that libata should announce the capability (it
> >>seems the code/hw can do it). 
> >
> >
> >Agree, we should just remove the ability to control clustering, as it
> >really overlaps with the segment settings anyways.
> 
> OK, I guess the consensus is to use clustering :)
> 
> We'll see if anything blows up in 2.6.14...

;-)

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: libata: clustering on or off?

2005-08-28 Thread Jens Axboe

On Sun, Aug 28 2005, Christoph Hellwig wrote:
> On Sun, Aug 28, 2005 at 04:20:19PM +0200, Jens Axboe wrote:
> > Agree, we should just remove the ability to control clustering, as it
> > really overlaps with the segment settings anyways.
> 
> What are we going to do with iscsi then?  It really doesn't like segments
> over a pages size.  Best thing would probably be to switch networking to
> use sg lists and dma_map_sg, but that's not a trivial task.

Limit the segment size, then. There are provisions for doing both length
and boundary limits, that should suffice.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: libata: clustering on or off?

2005-08-28 Thread Jens Axboe

On Sun, Aug 28 2005, Arjan van de Ven wrote:
> On Sun, 2005-08-28 at 05:42 -0400, Jeff Garzik wrote:
> > The constant ATA_SHT_USE_CLUSTERING in include/linux/libata.h controls
> > the use of SCSI layer's use_clustering feature, for a great many libata
> > drivers.
> > 
> > The current setup has clustering disabled, which in theory causes the
> > block layer to do less work, at the expense of a greater number of
> > scatter/gather table entries used.
> > 
> > Any opinions WRT turning on clustering for libata?
> 
> in 2.4 clustering was expensive due to a large number of checks that
> were done (basically the number of fragments got recounted a gazilion
> times). In 2.6 Jens fixed that afaik to make it basically free...
> at which point it's a win always.

Yeah, it wont cost any extra cycles, so there's no point in keeping it
turned off for that reason.

> Imo clustering on the driver level should announce driver capabilities.
> If clustering for some arch/kernel makes it slower, that should be
> decided at a midlayer level and not in each driver; eg the midlayer
> would chose to ignore the drivers capabilities.
> So .. my opinion would be that libata should announce the capability (it
> seems the code/hw can do it). 

Agree, we should just remove the ability to control clustering, as it
really overlaps with the segment settings anyways.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] libata: error processing + rw 6 byte fix

2005-08-27 Thread Jens Axboe

On Sat, Aug 27 2005, Jeff Garzik wrote:
> Here is the patch I just checked in.

Looks perfect.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] libata: error processing + rw 6 byte fix

2005-08-27 Thread Jens Axboe

On Sat, Aug 27 2005, Douglas Gilbert wrote:
> Jeff Garzik wrote:
> > Does the attached look OK to everybody?
> > 
> 
> Jeff,
> Yes.
> 
> And after an exchange with Jens, it would probably be
> safer to map transfer_length=0 for READ_10 and READ_16
> (as well as WRITE_10 and WRITE_16) to a nop on the ATA
> side. Otherwise an ATA disk may attempt to transfer 2**16
> sectors. [Only READ_6 and WRITE_6 have the quirky "0 means
> 256" definition, in the larger commands "0 means 0".]

Yes, that part needs to be added as well. Jeff, the below looks good for
the READ/WRITE_6 case.

> > if (scsicmd[0] == READ_6 || scsicmd[0] == WRITE_6) {
> > qc->nsect = tf->nsect = scsicmd[4];
> > +   if (!qc->nsect) {
> > +   qc->nsect = 256;
> > +   if (lba48)
> > +   tf->hob_nsect = 1;
> > +   }
> > +
> > tf->lbal = scsicmd[3];
> > tf->lbam = scsicmd[2];
> > tf->lbah = scsicmd[1] & 0x1f; /* mask out reserved bits */
> 
> 

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: PATCH: ide: ide-disk freeze support for hdaps

2005-08-25 Thread Jens Axboe

On Fri, Aug 26 2005, Yani Ioannou wrote:
> > Please make the interface accept number of seconds (as suggested by Jens)
> > and remove this module parameter. This way interface will be more flexible
> > and cleaner.  I really don't see any advantage in doing "echo 1 > ..." 
> > instead
> > of "echo x > ..." (Pavel, please explain).
> 
> Either way is pretty easy enough to implement. Note though that I'd
> expect the userspace app should thaw the device when danger is out of
> the way (the timeout is mainly there to ensure that the queue isn't
> frozen forever, and should probably be higher). Personally I don't
> have too much of an opinion either way though... what's the consensus?
> :).

Yes please, I don't understand why you would want a 0/1 interface
instead, when the timer-seconds method gives you the exact same ability
plus a way to control when to unfreeze...

> > +static struct timer_list freeze_timer =
> > +   TIMER_INITIALIZER(freeze_expire, 0, 0);
> > 
> > There needs to be a timer per device not a global one
> > (it works for a current special case of T42 but sooner
> >  or later we will hit this problem).
> 
> I was considering that, but I am confused as to whether each drive has
> it's own queue or not? (I really am a newbie to this stuff...). If so
> then yes there should be a per-device timer.

Each drive has its own queue.

> > queue handling should be done through block layer helpers
> > (as described in Jens' email) - we will need them for libata too.
> 
> Good point, I'll try to move as much as I can up to the block layer,
> it helps when it comes to implementing freeze for libata as you point
> out too.

That includes things like the timer as well, reuse the queue plugging
timer as I described in my initial posting on how to implement this.

> > At this time attribute can still be in use (because refcounting is done
> > on drive->gendev), you need to add "disk" class to ide-disk driver
> > (drivers/scsi/st.c looks like a good example how to do it).
> 
> I missed that completely, I'll look at changing it.
> 
> > IMO this should also be handled by block layer
> > which has all needed information, Jens?
> > 
> > While at it: I think that sysfs support should be moved to block layer 
> > (queue
> > attributes) and storage driver should only need to provide queue_freeze_fn
> > and queue_thaw_fn functions (similarly to cache flush support).  This should
> > be done now not later because this stuff is exposed to the user-space.
> 
> I was actually considering using a queue attribute originally, but in
> my indecision I decided to go with Jen's suggestion. A queue attribute
> does make sense in that the attribute primarily is there to freeze the
> queue, but it would also be performing the head park. Would a queue
> attribute be confusing because of that?

I fully agree with Bart here. The only stuff that should be ide special
is the actual command setup and completion check, since that is a
hardware property. libata will get a few little helpers for that as
well. The rest should be block layer implementation.

> >  * Sanity: don't accept a request that isn't a PM request
> >  * if we are currently power managed. This is very 
> > important as
> >  * blk_stop_queue() doesn't prevent the elv_next_request()
> > @@ -1661,6 +1671,9 @@ int ide_do_drive_cmd (ide_drive_t *drive
> > where = ELEVATOR_INSERT_FRONT;
> > rq->flags |= REQ_PREEMPT;
> > }
> > +   if (action == ide_next)
> > +   where = ELEVATOR_INSERT_FRONT;
> > +
> > __elv_add_request(drive->queue, rq, where, 0);
> > ide_do_request(hwgroup, IDE_NO_IRQ);
> > spin_unlock_irqrestore(&ide_lock, flags);
> > 
> > Why is this needed?
> 
> I think Jon discussed that in a previous thread, but basically
> although ide_next is documented in the comment for ide_do_drive_cmd,
> there isn't (as far as Jon or I could see) anything actually handling
> it. This patch is carried over from Jon's work and adds the code to
> handle ide_next by inserting the request at the front of the queue.

As per my previous mail, I will ack that bit.

> > Overall, very promising work!
> 
> Thanks :-), most of it is Jon's work, and Jen's suggestions though.

My name is Jens, not Jen :-)

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Hdaps-devel] Re: HDAPS, Need to park the head for real

2005-08-25 Thread Jens Axboe

On Thu, Aug 25 2005, Jon Escombe wrote:
> Alan Cox wrote:
> >@@ -1661,6 +1671,9 @@
> >where = ELEVATOR_INSERT_FRONT;
> >rq->flags |= REQ_PREEMPT;
> >}
> >+   if (action == ide_next)
> >+   where = ELEVATOR_INSERT_FRONT;
> >+
> >__elv_add_request(drive->queue, rq, where, 0);
> >ide_do_request(hwgroup, IDE_NO_IRQ);
> >spin_unlock_irqrestore(&ide_lock, flags);
> >
> >Also puzzles me- why is this needed ?
> 
> I wanted the park command to get in at the head of the queue (behind the 
> currently executing request).
> 
> Contrary to the comments for ide_do_drive_cmd(), ide_next didn't appear 
> to do anything to achieve this? At least from my initial testing before 
> I made this change - it could take a second or so for the park command 
> to be issued if the disk was busy

That part seems to have been lost, apparently. The above patch is
correct.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] libata: error processing + rw 6 byte fix

2005-08-23 Thread Jens Axboe

On Tue, Aug 23 2005, Douglas Gilbert wrote:
> Jens Axboe wrote:
> > On Mon, Aug 22 2005, Douglas Gilbert wrote:
> > 
> >>if (scsicmd[0] == READ_6 || scsicmd[0] == WRITE_6) {
> >>-   qc->nsect = tf->nsect = scsicmd[4];
> >>+   if (scsicmd[4] == 0) {
> >>+   /*
> >>+* For READ_6 and WRITE_6 (only)
> >>+* transfer_len==0 -> 256 blocks !!
> >>+*/
> >>+   if (lba48) {
> >>+   tf->hob_nsect = 1;
> >>+   qc->nsect = 256;
> >>+   } else
> >>+   return 1;
> > 
> > 
> > This isn't quite right, for 28-bit lba a 0 sector value means 256
> > sectors to transfer as well. So just make that:
> > 
> > if (lba48) {
> > tf->hob_nsect = 1;
> > qc->nsect = 256;
> > }
> > 
> > /* continue */
> > 
> > and it should work fine. Similarly for 48-bit lba, 0 means 16^2 sectors.
> 
> Jens,
> Since for 28-bit lba a 0 sector value means 256 sectors
> do I need to check for the lba48 case at all? As proposed
> to Jeff is this ok (for READ_6 and WRITE_6):
> 
>if (scsicmd[4] == 0) {
>/*
> * For READ_6 and WRITE_6 (only)
> * transfer_len==0 -> 256 blocks !!
> */
>qc->nsect = 256;
>} else
>qc->nsect = scsicmd[4];
>tf->nsect = scsicmd[4];

This will break for lba48 devices, since if you have scsicmd[4] == 0, a
lba48 read/write will want to transfer 65536 sectors instead of the
intended 256.

Your qc->nsect logic is correct, but you need to set tf->hob_nsect 1
for lba48 if scsicmd[4] == 0 to correctly tell that command to transfer
256 sectors.

> Also I noticed while testing the original code with READ_6
> (sectors=0) that the device locked up (power cycle required).
> So given the point you make for 48-bit lba, 0 means 16^2
> sectors, then the READ_10 (sectors=0) and READ_16 (sectors=0)
> which are valid nops according to SBC-2 may also lock up
> in libata.

Try with the corrected sector counts, should work. I didn't check the
other READ_X/WRITE_X, so you should probably audit them as well :)

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] libata: error processing + rw 6 byte fix

2005-08-23 Thread Jens Axboe

On Mon, Aug 22 2005, Douglas Gilbert wrote:
>   if (scsicmd[0] == READ_6 || scsicmd[0] == WRITE_6) {
> - qc->nsect = tf->nsect = scsicmd[4];
> + if (scsicmd[4] == 0) {
> + /*
> +  * For READ_6 and WRITE_6 (only)
> +  * transfer_len==0 -> 256 blocks !!
> +  */
> + if (lba48) {
> + tf->hob_nsect = 1;
> + qc->nsect = 256;
> + } else
> + return 1;

This isn't quite right, for 28-bit lba a 0 sector value means 256
sectors to transfer as well. So just make that:

if (lba48) {
tf->hob_nsect = 1;
qc->nsect = 256;
}

/* continue */

and it should work fine. Similarly for 48-bit lba, 0 means 16^2 sectors.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [parisc-linux] Re: Why toggle_bounce only for disks?

2005-08-19 Thread Jens Axboe

On Fri, Aug 19 2005, James Bottomley wrote:
> On Fri, 2005-08-19 at 12:21 +0200, Jens Axboe wrote:
> > Because not bouncing is a performance optimization and I only did the
> > work on ide-cd to allow it. Your patch breaks ide-cd on highmem i386
> > machines, so it's not acceptable.
> > 
> > Tells us more about this crash instead, I'm pretty sure you are working
> > around another issue (your io-mmu code, is it hardware or software?)
> > somehwere with this patch.
> 
> OK, so the particular fix is wrong; but the logic in ide_toggle_bounce()
> is also incorrec.  Our problem is not that we don't want to bounce
> highmem in ide-cd, it's that this is a parisc system with an IOMMU and
> doesn't have any highmem to begin with.
> 
> we need ide_toggle_bounce to return BLK_BOUNCE_ANY always if
> PCI_DMA_BUS_IS_PHYS is not set.
> 
> James
> 
> diff --git a/drivers/ide/ide-lib.c b/drivers/ide/ide-lib.c
> --- a/drivers/ide/ide-lib.c
> +++ b/drivers/ide/ide-lib.c
> @@ -410,10 +410,10 @@ void ide_toggle_bounce(ide_drive_t *driv
>  {
>   u64 addr = BLK_BOUNCE_HIGH; /* dma64_addr_t */
>  
> - if (on && drive->media == ide_disk) {
> - if (!PCI_DMA_BUS_IS_PHYS)
> - addr = BLK_BOUNCE_ANY;
> - else if (HWIF(drive)->pci_dev)
> + if (!PCI_DMA_BUS_IS_PHYS)
> + addr = BLK_BOUNCE_ANY;
> + else if (on && drive->media == ide_disk) {
> + if (HWIF(drive)->pci_dev)
>   addr = HWIF(drive)->pci_dev->dma_mask;
>   }

That looks more correct, indeed. It's really too convoluted for drivers,
the defines/setup could do with a little work-over.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 >

1 - 100 of 153 matches

Mail list logo