Re: [RFC][PATCH] libata ATAPI alignment

2005-08-06 Thread Tejun Heo
On Fri, Jul 29, 2005 at 01:06:54AM -0400, Jeff Garzik wrote:
> 
> So, one thing that's terribly ugly about SATA ATAPI is that we need to
> pad DMA transfers to the next 32-bit boundary, if the length is not
> evenly divisible by 4.
> 
> Messing with the scatterlist to accomplish this is terribly ugly
> no matter how you slice it.  One way would be to create my own
> scatterlist, via memcpy and then manual labor.  Another way would be
> to special case a pad buffer, appending it onto the end of various
> scatterlist code.
> 
> Complicating matters, we currently must support two methods of data
> buffer submission:  a single kernel virtual address, or a struct
> scatterlist.
> 
> Review is requested by any and all parties, as well as suggestions for
> a prettier approach.
> 
> This is one of the last steps needed to get ATAPI going.
> 

 Hello, Jeff, Jens & Alan.

 I've rewritten the patch to fix some bugs and make it a bit (IMHO)
simpler to use.

 Bug fixes...

 * When copying a sg, original implementation just kmap'ed sg->page
   which can cause trouble as a sg can span more than a page.

 * In ata_sg_clean(), the original implementation accesses last sg by
   indexing w/ (qc->n_elem - 1).  This is incorrect as qc->n_elem
   could have been shrunk by dma_map_sg() in ata_sg_setup().

 * In the original code (before Jeff's patch), sata_sx4 used
   sg[i].legnth to calculate total_len.  This is wrong for the same
   reason as above and Jeff's patch fixes it.  I separated out this
   fix just to clarify.

 Changes...

 * Instead of adding pad sg handling to each fill_sg function,
   ata_for_each_sg() macro is added.  Normal sg entries and
   qc->pad_sgent are setup by sg_setup routines and fill_sg functions
   can just iterate w/ ata_for_each_sg() without caring about padding.

 * hw_max_segments is automatically decremented in slave_config if
   attached device is ATAPI.

 Questions/suggestions...

 * I didn't include AHCI_MAX_SG change as it looked a bit more out of
   place w/ other changes to ahci gone.  Also, I'm curious about how
   meaningful increasing AHCI_MAX_SG is.  We're currently setting
   max_phys_segments to LIBATA_MAX_PRD, which is 128, and AFAIK max hw
   segments higher than phys segments is meaningless (phys segs are
   merged into hw segs and one phys segment cannot be splitted into
   two hw segs).  Am I missing something here?

 * About core routines/callbacks.  Currently, libata-core file
   contains actual libata core routines and all helper functions for
   taskfile controllers.  ata_ prefix is also shared by both function
   categories.  This is a bit confusing, IMO.

   I think ata_port_start should allocate stuff necessary for libata
   core and call ->port_start callback and similary for ata_port_stop,
   and current ata_port_start/stop need to be renamed to something
   like ata_tf_port_start/stop, such that we don't have to allocate
   data structures needed by libata core in specific drivers (ahci in
   this case).

 I've tested both sg and non-sg paths with sg_test_rwbuf.  When
testing sg path, I've commented out sg.c:L1951 (sg_build_indirect:L10)
to prevent it padding transfer size to 512byte boundary.  Read/write
padding in both paths work.

 Two patches will follow this mail.  The first one fixes sata_sx4 as
mentioned above and the second implements atapi align.

 Thanks.

--
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH] libata ATAPI alignment

2005-08-06 Thread Tejun Heo
On Fri, Jul 29, 2005 at 01:06:54AM -0400, Jeff Garzik wrote:
 
 So, one thing that's terribly ugly about SATA ATAPI is that we need to
 pad DMA transfers to the next 32-bit boundary, if the length is not
 evenly divisible by 4.
 
 Messing with the scatterlist to accomplish this is terribly ugly
 no matter how you slice it.  One way would be to create my own
 scatterlist, via memcpy and then manual labor.  Another way would be
 to special case a pad buffer, appending it onto the end of various
 scatterlist code.
 
 Complicating matters, we currently must support two methods of data
 buffer submission:  a single kernel virtual address, or a struct
 scatterlist.
 
 Review is requested by any and all parties, as well as suggestions for
 a prettier approach.
 
 This is one of the last steps needed to get ATAPI going.
 

 Hello, Jeff, Jens  Alan.

 I've rewritten the patch to fix some bugs and make it a bit (IMHO)
simpler to use.

 Bug fixes...

 * When copying a sg, original implementation just kmap'ed sg-page
   which can cause trouble as a sg can span more than a page.

 * In ata_sg_clean(), the original implementation accesses last sg by
   indexing w/ (qc-n_elem - 1).  This is incorrect as qc-n_elem
   could have been shrunk by dma_map_sg() in ata_sg_setup().

 * In the original code (before Jeff's patch), sata_sx4 used
   sg[i].legnth to calculate total_len.  This is wrong for the same
   reason as above and Jeff's patch fixes it.  I separated out this
   fix just to clarify.

 Changes...

 * Instead of adding pad sg handling to each fill_sg function,
   ata_for_each_sg() macro is added.  Normal sg entries and
   qc-pad_sgent are setup by sg_setup routines and fill_sg functions
   can just iterate w/ ata_for_each_sg() without caring about padding.

 * hw_max_segments is automatically decremented in slave_config if
   attached device is ATAPI.

 Questions/suggestions...

 * I didn't include AHCI_MAX_SG change as it looked a bit more out of
   place w/ other changes to ahci gone.  Also, I'm curious about how
   meaningful increasing AHCI_MAX_SG is.  We're currently setting
   max_phys_segments to LIBATA_MAX_PRD, which is 128, and AFAIK max hw
   segments higher than phys segments is meaningless (phys segs are
   merged into hw segs and one phys segment cannot be splitted into
   two hw segs).  Am I missing something here?

 * About core routines/callbacks.  Currently, libata-core file
   contains actual libata core routines and all helper functions for
   taskfile controllers.  ata_ prefix is also shared by both function
   categories.  This is a bit confusing, IMO.

   I think ata_port_start should allocate stuff necessary for libata
   core and call -port_start callback and similary for ata_port_stop,
   and current ata_port_start/stop need to be renamed to something
   like ata_tf_port_start/stop, such that we don't have to allocate
   data structures needed by libata core in specific drivers (ahci in
   this case).

 I've tested both sg and non-sg paths with sg_test_rwbuf.  When
testing sg path, I've commented out sg.c:L1951 (sg_build_indirect:L10)
to prevent it padding transfer size to 512byte boundary.  Read/write
padding in both paths work.

 Two patches will follow this mail.  The first one fixes sata_sx4 as
mentioned above and the second implements atapi align.

 Thanks.

--
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH] libata ATAPI alignment

2005-08-02 Thread Jeff Garzik

Jens Axboe wrote:

On Fri, Jul 29 2005, Jeff Garzik wrote:

diff --git a/drivers/scsi/ahci.c b/drivers/scsi/ahci.c
--- a/drivers/scsi/ahci.c
+++ b/drivers/scsi/ahci.c
@@ -44,7 +44,7 @@

enum {
AHCI_PCI_BAR= 5,
-   AHCI_MAX_SG = 168, /* hardware max is 64K */
+   AHCI_MAX_SG = 300, /* hardware max is 64K */
AHCI_DMA_BOUNDARY   = 0x,
AHCI_USE_CLUSTERING = 0,
AHCI_CMD_SLOT_SZ= 32 * 32,



Reasoning? I agree, just wondering... How big is the allocated area now?


168 kept the entire allocated DMA area to 4K.

300 increases that to K.  ;-)

Jeff


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH] libata ATAPI alignment

2005-08-02 Thread Jens Axboe
On Fri, Jul 29 2005, Jeff Garzik wrote:
> 
> So, one thing that's terribly ugly about SATA ATAPI is that we need to
> pad DMA transfers to the next 32-bit boundary, if the length is not
> evenly divisible by 4.
> 
> Messing with the scatterlist to accomplish this is terribly ugly
> no matter how you slice it.  One way would be to create my own
> scatterlist, via memcpy and then manual labor.  Another way would be
> to special case a pad buffer, appending it onto the end of various
> scatterlist code.

It's not pretty, but I think it's the only solution currently.

> Complicating matters, we currently must support two methods of data
> buffer submission:  a single kernel virtual address, or a struct
> scatterlist.

Fairly soon the !use_sg case will be gone, at least coming from SCSI. I
hope we can completely get away from the virtual address + length for
any remaining cases, just making it a single entry sg list.

> 
> Review is requested by any and all parties, as well as suggestions for
> a prettier approach.
> 
> This is one of the last steps needed to get ATAPI going.
> 
> 
> 
> diff --git a/drivers/scsi/ahci.c b/drivers/scsi/ahci.c
> --- a/drivers/scsi/ahci.c
> +++ b/drivers/scsi/ahci.c
> @@ -44,7 +44,7 @@
>  
>  enum {
>   AHCI_PCI_BAR= 5,
> - AHCI_MAX_SG = 168, /* hardware max is 64K */
> + AHCI_MAX_SG = 300, /* hardware max is 64K */
>   AHCI_DMA_BOUNDARY   = 0x,
>   AHCI_USE_CLUSTERING = 0,
>   AHCI_CMD_SLOT_SZ= 32 * 32,

Reasoning? I agree, just wondering... How big is the allocated area now?

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH] libata ATAPI alignment

2005-08-02 Thread Jens Axboe
On Fri, Jul 29 2005, Jeff Garzik wrote:
 
 So, one thing that's terribly ugly about SATA ATAPI is that we need to
 pad DMA transfers to the next 32-bit boundary, if the length is not
 evenly divisible by 4.
 
 Messing with the scatterlist to accomplish this is terribly ugly
 no matter how you slice it.  One way would be to create my own
 scatterlist, via memcpy and then manual labor.  Another way would be
 to special case a pad buffer, appending it onto the end of various
 scatterlist code.

It's not pretty, but I think it's the only solution currently.

 Complicating matters, we currently must support two methods of data
 buffer submission:  a single kernel virtual address, or a struct
 scatterlist.

Fairly soon the !use_sg case will be gone, at least coming from SCSI. I
hope we can completely get away from the virtual address + length for
any remaining cases, just making it a single entry sg list.

 
 Review is requested by any and all parties, as well as suggestions for
 a prettier approach.
 
 This is one of the last steps needed to get ATAPI going.
 
 
 
 diff --git a/drivers/scsi/ahci.c b/drivers/scsi/ahci.c
 --- a/drivers/scsi/ahci.c
 +++ b/drivers/scsi/ahci.c
 @@ -44,7 +44,7 @@
  
  enum {
   AHCI_PCI_BAR= 5,
 - AHCI_MAX_SG = 168, /* hardware max is 64K */
 + AHCI_MAX_SG = 300, /* hardware max is 64K */
   AHCI_DMA_BOUNDARY   = 0x,
   AHCI_USE_CLUSTERING = 0,
   AHCI_CMD_SLOT_SZ= 32 * 32,

Reasoning? I agree, just wondering... How big is the allocated area now?

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH] libata ATAPI alignment

2005-08-02 Thread Jeff Garzik

Jens Axboe wrote:

On Fri, Jul 29 2005, Jeff Garzik wrote:

diff --git a/drivers/scsi/ahci.c b/drivers/scsi/ahci.c
--- a/drivers/scsi/ahci.c
+++ b/drivers/scsi/ahci.c
@@ -44,7 +44,7 @@

enum {
AHCI_PCI_BAR= 5,
-   AHCI_MAX_SG = 168, /* hardware max is 64K */
+   AHCI_MAX_SG = 300, /* hardware max is 64K */
AHCI_DMA_BOUNDARY   = 0x,
AHCI_USE_CLUSTERING = 0,
AHCI_CMD_SLOT_SZ= 32 * 32,



Reasoning? I agree, just wondering... How big is the allocated area now?


168 kept the entire allocated DMA area to 4K.

300 increases that to I don't careK.  ;-)

Jeff


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH] libata ATAPI alignment

2005-07-29 Thread Alan Cox
On Gwe, 2005-07-29 at 01:06 -0400, Jeff Garzik wrote:
> So, one thing that's terribly ugly about SATA ATAPI is that we need to
> pad DMA transfers to the next 32-bit boundary, if the length is not
> evenly divisible by 4.

Looks good and avoids the special case leaking into the core code.

> Complicating matters, we currently must support two methods of data
> buffer submission:  a single kernel virtual address, or a struct
> scatterlist.

For the moment - also you turn the single buffer into a one entry sglist
so its not to bad.

> Review is requested by any and all parties, as well as suggestions for
> a prettier approach.

I'd pull the code into seperate functions but thats my only real
comment.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH] libata ATAPI alignment

2005-07-29 Thread Alan Cox
On Gwe, 2005-07-29 at 01:06 -0400, Jeff Garzik wrote:
 So, one thing that's terribly ugly about SATA ATAPI is that we need to
 pad DMA transfers to the next 32-bit boundary, if the length is not
 evenly divisible by 4.

Looks good and avoids the special case leaking into the core code.

 Complicating matters, we currently must support two methods of data
 buffer submission:  a single kernel virtual address, or a struct
 scatterlist.

For the moment - also you turn the single buffer into a one entry sglist
so its not to bad.

 Review is requested by any and all parties, as well as suggestions for
 a prettier approach.

I'd pull the code into seperate functions but thats my only real
comment.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC][PATCH] libata ATAPI alignment

2005-07-28 Thread Jeff Garzik

So, one thing that's terribly ugly about SATA ATAPI is that we need to
pad DMA transfers to the next 32-bit boundary, if the length is not
evenly divisible by 4.

Messing with the scatterlist to accomplish this is terribly ugly
no matter how you slice it.  One way would be to create my own
scatterlist, via memcpy and then manual labor.  Another way would be
to special case a pad buffer, appending it onto the end of various
scatterlist code.

Complicating matters, we currently must support two methods of data
buffer submission:  a single kernel virtual address, or a struct
scatterlist.

Review is requested by any and all parties, as well as suggestions for
a prettier approach.

This is one of the last steps needed to get ATAPI going.



diff --git a/drivers/scsi/ahci.c b/drivers/scsi/ahci.c
--- a/drivers/scsi/ahci.c
+++ b/drivers/scsi/ahci.c
@@ -44,7 +44,7 @@
 
 enum {
AHCI_PCI_BAR= 5,
-   AHCI_MAX_SG = 168, /* hardware max is 64K */
+   AHCI_MAX_SG = 300, /* hardware max is 64K */
AHCI_DMA_BOUNDARY   = 0x,
AHCI_USE_CLUSTERING = 0,
AHCI_CMD_SLOT_SZ= 32 * 32,
@@ -197,7 +197,7 @@ static Scsi_Host_Template ahci_sht = {
.eh_strategy_handler= ata_scsi_error,
.can_queue  = ATA_DEF_QUEUE,
.this_id= ATA_SHT_THIS_ID,
-   .sg_tablesize   = AHCI_MAX_SG,
+   .sg_tablesize   = AHCI_MAX_SG - 1,
.max_sectors= ATA_MAX_SECTORS,
.cmd_per_lun= ATA_SHT_CMD_PER_LUN,
.emulated   = ATA_SHT_EMULATED,
@@ -313,8 +313,15 @@ static int ahci_port_start(struct ata_po
return -ENOMEM;
memset(pp, 0, sizeof(*pp));
 
+   ap->pad = dma_alloc_coherent(dev, ATA_DMA_PAD_BUF_SZ, >pad_dma, 
GFP_KERNEL);
+   if (!ap->pad) {
+   kfree(pp);
+   return -ENOMEM;
+   }
+
mem = dma_alloc_coherent(dev, AHCI_PORT_PRIV_DMA_SZ, _dma, 
GFP_KERNEL);
if (!mem) {
+   dma_free_coherent(dev, ATA_DMA_PAD_BUF_SZ, ap->pad, 
ap->pad_dma);
kfree(pp);
return -ENOMEM;
}
@@ -390,6 +397,7 @@ static void ahci_port_stop(struct ata_po
ap->private_data = NULL;
dma_free_coherent(dev, AHCI_PORT_PRIV_DMA_SZ,
  pp->cmd_slot, pp->cmd_slot_dma);
+   dma_free_coherent(dev, ATA_DMA_PAD_BUF_SZ, ap->pad, ap->pad_dma);
kfree(pp);
 }
 
@@ -474,7 +482,8 @@ static void ahci_tf_read(struct ata_port
 
 static void ahci_fill_sg(struct ata_queued_cmd *qc)
 {
-   struct ahci_port_priv *pp = qc->ap->private_data;
+   struct ata_port *ap = qc->ap;
+   struct ahci_port_priv *pp = ap->private_data;
unsigned int i;
 
VPRINTK("ENTER\n");
@@ -493,6 +502,24 @@ static void ahci_fill_sg(struct ata_queu
pp->cmd_tbl_sg[i].addr_hi = cpu_to_le32((addr >> 16) >> 16);
pp->cmd_tbl_sg[i].flags_size = cpu_to_le32(sg_len - 1);
}
+
+   /* if we added a small buffer, to pad xfer to next 32-bit bound,
+* add it to the s/g list here
+*/
+   if (qc->flags & ATA_QCFLAG_PADDED) {
+   dma_addr_t pad_addr = ap->pad_dma + (qc->tag * ATA_DMA_PAD_SZ);
+   u32 len;
+
+   /* fixup last s/g entry */
+   len = le32_to_cpu(pp->cmd_tbl_sg[i - 1].flags_size);
+   pp->cmd_tbl_sg[i - 1].flags_size =
+   cpu_to_le32(len - qc->pad_len);
+
+   /* append pad buffer to s/g list */
+   pp->cmd_tbl_sg[i].addr = cpu_to_le32(pad_addr & 0x);
+   pp->cmd_tbl_sg[i].addr_hi = cpu_to_le32((pad_addr >> 16) >> 16);
+   pp->cmd_tbl_sg[i].flags_size = cpu_to_le32(ATA_DMA_PAD_SZ - 1);
+   }
 }
 
 static void ahci_qc_prep(struct ata_queued_cmd *qc)
@@ -501,13 +528,16 @@ static void ahci_qc_prep(struct ata_queu
struct ahci_port_priv *pp = ap->private_data;
u32 opts;
const u32 cmd_fis_len = 5; /* five dwords */
+   unsigned int n_elem = qc->n_elem;
 
/*
 * Fill in command slot information (currently only one slot,
 * slot 0, is currently since we don't do queueing)
 */
 
-   opts = (qc->n_elem << 16) | cmd_fis_len;
+   if (qc->flags & ATA_QCFLAG_PADDED)
+   n_elem++;
+   opts = (n_elem << 16) | cmd_fis_len;
if (qc->tf.flags & ATA_TFLAG_WRITE)
opts |= AHCI_CMD_WRITE;
if (is_atapi_taskfile(>tf))
diff --git a/drivers/scsi/libata-core.c b/drivers/scsi/libata-core.c
--- a/drivers/scsi/libata-core.c
+++ b/drivers/scsi/libata-core.c
@@ -2144,6 +2144,8 @@ static void ata_sg_clean(struct ata_queu
struct ata_port *ap = qc->ap;
struct scatterlist *sg = qc->sg;
int dir = qc->dma_dir;
+   unsigned int copy_pad = 0;
+   void *pad_buf = NULL;
 
assert(qc->flags & 

[RFC][PATCH] libata ATAPI alignment

2005-07-28 Thread Jeff Garzik

So, one thing that's terribly ugly about SATA ATAPI is that we need to
pad DMA transfers to the next 32-bit boundary, if the length is not
evenly divisible by 4.

Messing with the scatterlist to accomplish this is terribly ugly
no matter how you slice it.  One way would be to create my own
scatterlist, via memcpy and then manual labor.  Another way would be
to special case a pad buffer, appending it onto the end of various
scatterlist code.

Complicating matters, we currently must support two methods of data
buffer submission:  a single kernel virtual address, or a struct
scatterlist.

Review is requested by any and all parties, as well as suggestions for
a prettier approach.

This is one of the last steps needed to get ATAPI going.



diff --git a/drivers/scsi/ahci.c b/drivers/scsi/ahci.c
--- a/drivers/scsi/ahci.c
+++ b/drivers/scsi/ahci.c
@@ -44,7 +44,7 @@
 
 enum {
AHCI_PCI_BAR= 5,
-   AHCI_MAX_SG = 168, /* hardware max is 64K */
+   AHCI_MAX_SG = 300, /* hardware max is 64K */
AHCI_DMA_BOUNDARY   = 0x,
AHCI_USE_CLUSTERING = 0,
AHCI_CMD_SLOT_SZ= 32 * 32,
@@ -197,7 +197,7 @@ static Scsi_Host_Template ahci_sht = {
.eh_strategy_handler= ata_scsi_error,
.can_queue  = ATA_DEF_QUEUE,
.this_id= ATA_SHT_THIS_ID,
-   .sg_tablesize   = AHCI_MAX_SG,
+   .sg_tablesize   = AHCI_MAX_SG - 1,
.max_sectors= ATA_MAX_SECTORS,
.cmd_per_lun= ATA_SHT_CMD_PER_LUN,
.emulated   = ATA_SHT_EMULATED,
@@ -313,8 +313,15 @@ static int ahci_port_start(struct ata_po
return -ENOMEM;
memset(pp, 0, sizeof(*pp));
 
+   ap-pad = dma_alloc_coherent(dev, ATA_DMA_PAD_BUF_SZ, ap-pad_dma, 
GFP_KERNEL);
+   if (!ap-pad) {
+   kfree(pp);
+   return -ENOMEM;
+   }
+
mem = dma_alloc_coherent(dev, AHCI_PORT_PRIV_DMA_SZ, mem_dma, 
GFP_KERNEL);
if (!mem) {
+   dma_free_coherent(dev, ATA_DMA_PAD_BUF_SZ, ap-pad, 
ap-pad_dma);
kfree(pp);
return -ENOMEM;
}
@@ -390,6 +397,7 @@ static void ahci_port_stop(struct ata_po
ap-private_data = NULL;
dma_free_coherent(dev, AHCI_PORT_PRIV_DMA_SZ,
  pp-cmd_slot, pp-cmd_slot_dma);
+   dma_free_coherent(dev, ATA_DMA_PAD_BUF_SZ, ap-pad, ap-pad_dma);
kfree(pp);
 }
 
@@ -474,7 +482,8 @@ static void ahci_tf_read(struct ata_port
 
 static void ahci_fill_sg(struct ata_queued_cmd *qc)
 {
-   struct ahci_port_priv *pp = qc-ap-private_data;
+   struct ata_port *ap = qc-ap;
+   struct ahci_port_priv *pp = ap-private_data;
unsigned int i;
 
VPRINTK(ENTER\n);
@@ -493,6 +502,24 @@ static void ahci_fill_sg(struct ata_queu
pp-cmd_tbl_sg[i].addr_hi = cpu_to_le32((addr  16)  16);
pp-cmd_tbl_sg[i].flags_size = cpu_to_le32(sg_len - 1);
}
+
+   /* if we added a small buffer, to pad xfer to next 32-bit bound,
+* add it to the s/g list here
+*/
+   if (qc-flags  ATA_QCFLAG_PADDED) {
+   dma_addr_t pad_addr = ap-pad_dma + (qc-tag * ATA_DMA_PAD_SZ);
+   u32 len;
+
+   /* fixup last s/g entry */
+   len = le32_to_cpu(pp-cmd_tbl_sg[i - 1].flags_size);
+   pp-cmd_tbl_sg[i - 1].flags_size =
+   cpu_to_le32(len - qc-pad_len);
+
+   /* append pad buffer to s/g list */
+   pp-cmd_tbl_sg[i].addr = cpu_to_le32(pad_addr  0x);
+   pp-cmd_tbl_sg[i].addr_hi = cpu_to_le32((pad_addr  16)  16);
+   pp-cmd_tbl_sg[i].flags_size = cpu_to_le32(ATA_DMA_PAD_SZ - 1);
+   }
 }
 
 static void ahci_qc_prep(struct ata_queued_cmd *qc)
@@ -501,13 +528,16 @@ static void ahci_qc_prep(struct ata_queu
struct ahci_port_priv *pp = ap-private_data;
u32 opts;
const u32 cmd_fis_len = 5; /* five dwords */
+   unsigned int n_elem = qc-n_elem;
 
/*
 * Fill in command slot information (currently only one slot,
 * slot 0, is currently since we don't do queueing)
 */
 
-   opts = (qc-n_elem  16) | cmd_fis_len;
+   if (qc-flags  ATA_QCFLAG_PADDED)
+   n_elem++;
+   opts = (n_elem  16) | cmd_fis_len;
if (qc-tf.flags  ATA_TFLAG_WRITE)
opts |= AHCI_CMD_WRITE;
if (is_atapi_taskfile(qc-tf))
diff --git a/drivers/scsi/libata-core.c b/drivers/scsi/libata-core.c
--- a/drivers/scsi/libata-core.c
+++ b/drivers/scsi/libata-core.c
@@ -2144,6 +2144,8 @@ static void ata_sg_clean(struct ata_queu
struct ata_port *ap = qc-ap;
struct scatterlist *sg = qc-sg;
int dir = qc-dma_dir;
+   unsigned int copy_pad = 0;
+   void *pad_buf = NULL;
 
assert(qc-flags  ATA_QCFLAG_DMAMAP);
assert(sg != NULL);
@@ -2153,11