Re: [RFC] libata-scsi: introducing SANITIZE translation

2016-10-26 Thread Mark Lord

On 16-07-11 02:35 AM, Tom Yan wrote:

I don't suppose there would be any problem doing it in userspace /
with ATA PASS-THROUGH anyway.

..

On 8 July 2016 at 17:29, James Bottomley
 wrote:

..

Not really.  The point is that you've proposed something as an addition
to the kernel that can also be done in userspace.  Checking if it can
work easily there is like a barrier to entry.  If it works, then fine,
we're done.  If it throws up problems then we reconsider the kernel
route.


hdparm has full support for the SANITIZE commands in userspace.

-ml
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

2014-01-04 Thread Mark Lord
On 14-01-03 10:40 AM, walt wrote:
> On 01/02/2014 11:15 AM, Sarah Sharp wrote:
>> On Tue, Dec 31, 2013 at 12:40:16PM -0800, walt wrote:
>>> On 12/18/2013 01:11 PM, Greg Kroah-Hartman wrote:
>>>> 3.12-stable review patch.  If anyone has any objections, please let me 
>>>> know.
>>>>
>>>> --
>>>>
>>>> From: David Laight 
>>>>
>>>> commit 35773dac5f862cb1c82ea151eba3e2f6de51ec3e upstream.
>>>>
>>>> Section 4.11.7.1 of rev 1.0 of the xhci specification states that a link 
>>>> TRB
>>>> can only occur at a boundary between underlying USB frames (512 bytes for
>>>> high speed devices).
>>>>
>>>> If this isn't done the USB frames aren't formatted correctly and, for 
>>>> example,
>>>> the USB3 ethernet ax88179_178a card will stop sending...
>>>
>>>
>>> Unfortunately this patch causes a regression when copying large files to my
>>> outboard USB3 drive. (Nothing at all to do with networking.)
> 
>> Do you have CONFIG_USB_DEBUG turned on for 3.13?  If so, you should see
>> dmesg output from this statement shortly before your drive fails:
>>
>> if (num_trbs >= TRBS_PER_SEGMENT) {
>>  xhci_err(xhci, "Too many fragments %d, max %d\n",
>>  num_trbs, TRBS_PER_SEGMENT - 1);
>>  return -ENOMEM;
>> }
> 
> Well, the answers depend on whether the usb3 drive uses logical volumes or not
> (lvm2), which I can't explain.  What I've described so far is with lvm2.
> 
> When using lvm2 on the usb3 drive, turning on USB_DEBUG has *no* effect -- the
> console prints two or three lines stating that the ext4 journal has quit and
> the drive is remounted ro.  That particular drive stays wedged until the next
> reboot, but no other ill effects to the system.
> 
> OTOH, when I put a disk with just an ordinary ext4 partition in the usb3 dock,
> (no logical volumes) the copy failure becomes catastrophic, with kernel panic
> messages, leaving the system unresponsive and needing a hard reset to recover.
> 
> I also tried your other suggestion:
> 
> diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c
> index 4265b48..1a6a43d 100644
> --- a/drivers/usb/host/xhci.c
> +++ b/drivers/usb/host/xhci.c
> @@ -4714,7 +4714,7 @@ int xhci_gen_setup(struct usb_hcd *hcd, 
> xhci_get_quirks_t get_quirks)
> int retval;
>  
> /* Accept arbitrarily long scatter-gather lists */
> -   hcd->self.sg_tablesize = ~0;
> +   hcd->self.sg_tablesize = 31;
>  
> /* support to build packet from discontinuous buffers */
> hcd->self.no_sg_constraint = 1;
> 
> Sadly it didn't fix the problem.  Did I get the patch right?


That sounds almost as if the old version is still being loaded/run,
possibly from the initramfs image?

-- 
Mark Lord
Real-Time Remedies Inc.
ml...@pobox.com
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

2014-01-02 Thread Mark Lord
On 14-01-02 02:15 PM, Sarah Sharp wrote:
> On Tue, Dec 31, 2013 at 12:40:16PM -0800, walt wrote:
..
>> Unfortunately this patch causes a regression when copying large files to my
>> outboard USB3 drive. (Nothing at all to do with networking.)
>>
>> When I try to copy a large (20GB) file to the USB3 drive, the copy dies after
>> about 7GB, the ext4 journal aborts and the drive is remounted read-only.
>>
>> This bug is 100% reproducible (always pretty close to 7GB) and reverting this
>> patch completely fixes the problem.
> 
> Ok, I had feared that would be a consequence of this patch.  I think the
> problem is that the usb-storage driver submitted an URB with more
> scatter-gather entries than would fit on the ring segment, the xHCI
> driver rejected the URB with -ENOMEM, and the SCSI core eventually gave
> up on the SCSI command.


Is there not a block layer / scheduler tunable for max sg entries or something?
-- 
Mark Lord
Real-Time Remedies Inc.
ml...@pobox.com
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] sd: don't bother spinning up disks on resume

2013-11-20 Thread Mark Lord
On 13-11-18 10:54 AM, Phillip Susi wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
> 
> On 11/17/2013 8:06 PM, Phillip Susi wrote:
>> I don't see anything particularly inefficient about issuing the
>> command in the eh path ( in fact, libata always uses the eh path to
>> handle power management ).  You can of course, use the
>> manage_start_stop flag to have the disk start on resume if you
>> really want.
> 
> Yikes, I think I see now why scsi eh is so bad: the entire host, not
> just the device is quiesced.
> 
> I suppose now the question is how to issue the START STOP UNIT from
> sd_prep_fn()?  I gather you can't just call scsi_execute_req() from
> there?  Is there maybe a way to insert a START STOP UNIT request into
> the queue ahead of the current request being prepped?  Can
> sd_prep_fn() be called concurrently on multiple cpus?  Hrm.. what
> about TCQ?  Would it be bad if the original request were queued ( to
> the drive ) before the START STOP UNIT command completed?

Yeah, solve that, and you're golden.
I totally understand the motivation for this patch,
and would love to have it working on my machines here too.

But it may have to be some kind of sysfs optional setting,
defaulting to current behaviour, to keep the server keepers happy.

Cheers


--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] sd: don't bother spinning up disks on resume

2013-11-15 Thread Mark Lord
On 13-11-07 01:21 PM, Douglas Gilbert wrote:
> On 13-11-06 08:57 PM, Phillip Susi wrote:
>> Don't bother forcing disks to spin up on resume, as they
>> will do so automatically when accessed, and forcing them
>> to spin up slows down the resume.  Add a second bit to the
>> manage_start_stop flag to restore the previous behavior.
> 
> SCSI disks when in STOP state do not spin up "automatically
> when accessed".
..

Ditto for many SATA disks with "power up in standby" enabled.


--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 00/77] Re-design MSI/MSI-X interrupts enablement pattern

2013-10-11 Thread Mark Lord
On 13-10-11 04:41 AM, Alexander Gordeev wrote:
> On Thu, Oct 10, 2013 at 07:17:18PM -0400, Mark Lord wrote:
>> Just to help us all understand "the loop" issue..
>>
>> Here's an example of driver code which uses the existing MSI-X interfaces,
>> for a device which can work with either 16, 8, 4, 2, or 1 MSI-X interrupt.
>> This is from a new driver I'm working on right now:
..
> Now, this is a loop-free alternative:
> 
> static int xx_alloc_msix_irqs(struct xx_dev *dev, int nvec)
> {
>   nvec = roundup_pow_of_two(nvec);/* assume 0 > nvec <= 16 */
> 
>   xx_disable_all_irqs(dev);
> 
>   pci_lock_msi(dev->pdev);
> 
>   rc = pci_get_msix_limit(dev->pdev, nvec);
>   if (rc < 0)
>   goto err;
> 
>   nvec = min(nvec, rc);   /* if limit is more than requested */
>   nvec = rounddown_pow_of_two(nvec);  /* (a) */
> 
>   xx_prep_for_msix_vectors(dev, nvec);
> 
>   rc = pci_enable_msix(dev->pdev, dev->irqs, nvec);   /* (b)  */
>   if (rc < 0)
>   goto err;
> 
>   pci_unlock_msi(dev->pdev);
> 
>   dev->num_vectors = nvec;/* (b) */
>   return 0;
> 
> err:
>   pci_unlock_msi(dev->pdev);
> 
> kerr(dev->name, "pci_enable_msix() failed, err=%d", rc);
> dev->num_vectors = 0;
> return rc;
> }

That would still need a loop, to handle the natural race between
the calls to pci_get_msix_limit() and pci_enable_msix() -- the driver and device
can and should fall back to a smaller number of vectors when pci_enable_msix() 
fails.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 00/77] Re-design MSI/MSI-X interrupts enablement pattern

2013-10-10 Thread Mark Lord
Just to help us all understand "the loop" issue..

Here's an example of driver code which uses the existing MSI-X interfaces,
for a device which can work with either 16, 8, 4, 2, or 1 MSI-X interrupt.
This is from a new driver I'm working on right now:


static int xx_alloc_msix_irqs (struct xx_dev *dev, int nvec)
{
xx_disable_all_irqs(dev);
do {
if (nvec < 2)
xx_prep_for_1_msix_vector(dev);
else if (nvec < 4)
xx_prep_for_2_msix_vectors(dev);
else if (nvec < 8)
xx_prep_for_4_msix_vectors(dev);
else if (nvec < 16)
xx_prep_for_8_msix_vectors(dev);
else
xx_prep_for_16_msix_vectors(dev);
nvec = pci_enable_msix(dev->pdev, dev->irqs, dev->num_vectors);
} while (nvec > 0);

if (nvec) {
kerr(dev->name, "pci_enable_msix() failed, err=%d", nvec);
dev->num_vectors = 0;
return nvec;
}
return 0;   /* success */
}
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 00/77] Re-design MSI/MSI-X interrupts enablement pattern

2013-10-08 Thread Mark Lord
On 13-10-02 06:29 AM, Alexander Gordeev wrote:
..
> This update converts pci_enable_msix() and pci_enable_msi_block()
> interfaces to canonical kernel functions and makes them return a
> error code in case of failure or 0 in case of success.

Rather than silently break dozens of drivers in mysterious ways,
please invent new function names for the replacements to the
existing pci_enable_msix() and pci_enable_msi_block() functions.

That way, both in-tree and out-of-tree drivers will notice the API change,
rather than having it go unseen and just failing for unknown reasons.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: STANDBY IMMEDIATE failed on NVIDIA MCP5x controllers when system suspend

2013-03-12 Thread Mark Lord
On 13-03-11 03:30 PM, Robert Hancock wrote:
..
> I'm sure there are some SSDs that do violate their data integrity
> commitments - a while ago some tests were done (don't have a link
> handy and I don't think they identified the actual vendor/model of the
> drives in any case) but there were definitely some SSDs that did do
> nasty things like trash unrelated data if power was lost while
> writing, etc. So we definitely don't want to risk this sort of thing
> occurring on a normal power-off or suspend.


Just about all current SATA SSDs advertise having some form(s)
of "background garbage collection".  That "feature" has always concerned me,
because of the real possibility of power being removed (system shutdown)
while the drive firmware is in the midst of re-shuffling my data around.

One would hope that "STANDBY IMMEDIATE" is taken by the drive
as a signal to stop that fussing about and prepare for full/sudden power-off.

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Fix libata-eh don't retry command after reset succeeded.

2013-01-14 Thread Mark Lord
On 13-01-14 09:01 AM, Mark Lord wrote:
>>
>> From 9cc9a85f17a8525e53caf430611d762c105d324c Mon Sep 17 00:00:00 2001
>> From: Bian Yu 
>> Date: Tue, 18 Dec 2012 05:58:34 -0500
>> Subject: [PATCH] Fix libata-eh don't retry command after reset succeeded.
>>  It's introduced by commit 8d899e70c1b3afff, When disk has a UNC error,
>>  qc->err_mask will set AC_ERR_MEDIA flag.
>>  Signed-off-by: Bian Yu 
..
> Yup, good catch.  The original patch to fix retries (from me) had it that way,
> but somewhere among the enforced revisions this error crept in unnoticed.
> 
> Jeff -- wanna pick this one up?
> Could be useful for -stable, too.

The original version of this correction patch had the "Signed-off-by" line.
See below.


>From 78c0fb104c9957db682518b97ce6a01ce1bc07b6 Mon Sep 17 00:00:00 2001
From: Bian Yu 
Date: Wed, 12 Dec 2012 22:26:58 -0500
Subject: [PATCH] It should be a mistake introduced by commit 8d899e70c1b3afff.
 because only qc->flags can't be set AC_ERR_*

Signed-off-by: Bian Yu 
---
 drivers/ata/libata-eh.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
index bf039b0..bcf4437 100644
--- a/drivers/ata/libata-eh.c
+++ b/drivers/ata/libata-eh.c
@@ -2094,7 +2094,7 @@ static unsigned int ata_eh_speed_down(struct
ata_device *dev,
  */
 static inline int ata_eh_worth_retry(struct ata_queued_cmd *qc)
 {
-   if (qc->flags & AC_ERR_MEDIA)
+   if (qc->err_mask & AC_ERR_MEDIA)
return 0;   /* don't retry media errors */
if (qc->flags & ATA_QCFLAG_IO)
return 1;   /* otherwise retry anything from fs stack */
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Fix libata-eh don't retry command after reset succeeded.

2013-01-14 Thread Mark Lord
> 
> From 9cc9a85f17a8525e53caf430611d762c105d324c Mon Sep 17 00:00:00 2001
> From: Bian Yu 
> Date: Tue, 18 Dec 2012 05:58:34 -0500
> Subject: [PATCH] Fix libata-eh don't retry command after reset succeeded.
>  It's introduced by commit 8d899e70c1b3afff, When disk has a UNC error,
>  qc->err_mask will set AC_ERR_MEDIA flag.
>  Signed-off-by: Bian Yu 
>  
> ---
>  drivers/ata/libata-eh.c |2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
>  
> diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
> index bf039b0..bcf4437 100644
> --- a/drivers/ata/libata-eh.c
> +++ b/drivers/ata/libata-eh.c
> @@ -2094,7 +2094,7 @@ static unsigned int ata_eh_speed_down(struct ata_device 
> *dev,
>   */
>  static inline int ata_eh_worth_retry(struct ata_queued_cmd *qc)
>  {
> -   if (qc->flags & AC_ERR_MEDIA)
> +   if (qc->err_mask & AC_ERR_MEDIA)
> return 0;   /* don't retry media errors */
> if (qc->flags & ATA_QCFLAG_IO)
> return 1;   /* otherwise retry anything from fs stack */
> -- 
> 1.7.1

Yup, good catch.  The original patch to fix retries (from me) had it that way,
but somewhere among the enforced revisions this error crept in unnoticed.

Jeff -- wanna pick this one up?
Could be useful for -stable, too.

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: DMA mapping on SCSI device?

2008-01-30 Thread Mark Lord

Mark Lord wrote:

..
Commands which were not ADMA compatible (eg. MODE_SENSE, 
TEST_UNIT_READY, ..)

were simply handled with PIO (in the driver) rather than any form of DMA,
which is okay because those commands are relatively infrequent.

..

A slight correction there:  TEST_UNIT_READY was fine in ADMA mode as well.

Cheers
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: DMA mapping on SCSI device?

2008-01-30 Thread Mark Lord

Robert Hancock wrote:

Luben Tuikov wrote:

--- On Mon, 1/28/08, Robert Hancock <[EMAIL PROTECTED]> wrote:

The trick is that if an ATAPI device is connected, we (as
far as I'm aware) can't use ADMA mode, so we have to switch that
port into legacy mode.


Can you double check this with the HW architect of the
HW DMA engine of the ASIC?


Will do so. However, previous statements from NVIDIA fairly clearly 
indicate that this is the case.





This means it's only capable of 32-bit DMA.
However the other port on the controller may be connected to a hard 
drive and therefore still capable of 64-bit DMA.


If this is indeed the case as you've presented it here,
it sounds like a HW shortcoming.  I cannot see how the device
type (or protocol) dictate how the DMA engine operates.
They live in two different domains.


Well, there is an indirect link. The ADMA interface (which supports 
64-bit DMA) cannot be used to issue ATAPI commands, so if an ATAPI 
device is connected we have to go to legacy mode, which supports only 
32-bit DMA.


I'm not sure why ADMA mode doesn't support ATAPI. The only reason I can 
think of is that there's issues since ATAPI commands can potentially be 
of unpredictable transfer size. The "real" ADMA spec that the NVIDIA 
implementation is loosely based on does have some special "ignore 
excess" controls that don't seem to be in the NVIDIA version (or at 
least not to the knowledge I have on this hardware).

..

The original Pacific Digital ADMA cores *do* support most ATAPI commands
in ADMA mode, including READ_CD, READ_10, etc..  With the caveat that if
DSC completion state is required, the driver has to drop out of ADMA
and poll for it after the ADMA command completes.

Commands which were not ADMA compatible (eg. MODE_SENSE, TEST_UNIT_READY, ..)
were simply handled with PIO (in the driver) rather than any form of DMA,
which is okay because those commands are relatively infrequent.

Note that Pacific Digital "officially" said "no ATAPI" for the ADMA design,
but I implemented it regardless (for Linux) and it worked rather well.
We could burn DVDs and back-up to tape simultaneously, with the burner
and the tape unit sharing a single IDE cable/channel.

Cheers
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fix page_alloc for larger I/O segments (improved)

2007-12-14 Thread Mark Lord

Matthew Wilcox wrote:

On Fri, Dec 14, 2007 at 05:42:37PM +, Mel Gorman wrote:

Regrettably this interferes with anti-fragmentation because the "next" page
on the list on return from rmqueue_bulk is not guaranteed to be of the right
mobility type. I fixed it as an additional patch but it adds additional cost
that should not be necessary and it's visible in microbenchmark results on
at least one machine.


Is this patch to be preferred to the one Andrew Morton posted to do
list_for_each_entry_reverse?

..

This patch replaces my earlier patch that Andrew has:

-   list_add(&page->lru, list);
+   list_add_tail(&page->lru, list);

Which, in turn, replaced the even-earlier list_for_each_entry_reverse patch.

-ml
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fix page_alloc for larger I/O segments (improved)

2007-12-14 Thread Mark Lord

Mel Gorman wrote:

On (13/12/07 19:46), Mark Lord didst pronounce:

"Improved version", more similar to the 2.6.23 code:

Fix page allocator to give better chance of larger contiguous segments 
(again).


Signed-off-by: Mark Lord <[EMAIL PROTECTED]


Regrettably this interferes with anti-fragmentation because the "next" page
on the list on return from rmqueue_bulk is not guaranteed to be of the right
mobility type. I fixed it as an additional patch but it adds additional cost
that should not be necessary and it's visible in microbenchmark results on
at least one machine.

The following patch should fix the page ordering problem without incurring an
additional cost or interfering with anti-fragmentation. However, I haven't
anything in place yet to verify that the physical page ordering is correct
but it makes sense. Can you verify it fixes the problem please?

It'll still be some time before I have a full set of performance results
but initially at least, this fix seems to avoid any impact.

==
Subject: Fix page allocation for larger I/O segments

In some cases the IO subsystem is able to merge requests if the pages are
adjacent in physical memory. This was achieved in the allocator by having
expand() return pages in physically contiguous order in situations were
a large buddy was split. However, list-based anti-fragmentation changed
the order pages were returned in to avoid searching in buffered_rmqueue()
for a page of the appropriate migrate type.

This patch restores behaviour of rmqueue_bulk() preserving the physical order
of pages returned by the allocator without incurring increased search costs for
anti-fragmentation.

Signed-off-by: Mel Gorman <[EMAIL PROTECTED]>
--- 
 page_alloc.c |   11 +++

 1 file changed, 11 insertions(+)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff 
linux-2.6.24-rc5-clean/mm/page_alloc.c 
linux-2.6.24-rc5-giveback-physorder-listmove/mm/page_alloc.c
--- linux-2.6.24-rc5-clean/mm/page_alloc.c  2007-12-14 11:55:13.0 
+
+++ linux-2.6.24-rc5-giveback-physorder-listmove/mm/page_alloc.c
2007-12-14 15:33:12.0 +
@@ -847,8 +847,19 @@ static int rmqueue_bulk(struct zone *zon
struct page *page = __rmqueue(zone, order, migratetype);
if (unlikely(page == NULL))
break;
+
+   /*
+* Split buddy pages returned by expand() are received here
+* in physical page order. The page is added to the callers and
+* list and the list head then moves forward. From the callers
+* perspective, the linked list is ordered by page number in
+* some conditions. This is useful for IO devices that can
+* merge IO requests if the physical pages are ordered
+* properly.
+*/
list_add(&page->lru, list);
set_page_private(page, migratetype);
+   list = &page->lru;
}
spin_unlock(&zone->lock);
return i;

..

That (also) works for me here, regularly generating 64KB I/O segments with SLAB.

Cheers
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?

2007-12-14 Thread Mark Lord

Mel Gorman wrote:

On (13/12/07 16:37), Andrew Morton didst pronounce:

On Thu, 13 Dec 2007 19:30:00 -0500
Mark Lord <[EMAIL PROTECTED]> wrote:


Here's the commit that causes the regression:

...

--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -760,7 +760,8 @@ static int rmqueue_bulk(struct zone *zone, unsigned int 
order,
struct page *page = __rmqueue(zone, order, migratetype);
if (unlikely(page == NULL))
break;
-   list_add_tail(&page->lru, list);
+   list_add(&page->lru, list);

well that looks fishy.



The reasoning behind the change was the first page encountered on the list
by the caller would have a matching migratetype. I failed to take into
account the physical ordering of pages returned. I'm setting up to run some
performance benchmarks of the candidate fix merged into the -mm tree to see
if the search shows up or not. I'm testing against 2.6.25-rc5 but it'll
take a few hours to complete.

..

Thanks, Mel.  This is all with CONFIG_SLAB=y, by the way.

Note that it did appear to behave better with CONFIG_SLUB=y when I accidently
used that .config on my 4GB machine here.  Physical segments of 4-10 pages
happended much more common than with CONFIG_SLAB=y on my 3GB machine
Slightly "apples and oranges" there, I know, but at least both were x86-32.  :)

So I would expect CONFIG_SLAB to be well off with this patch under most (all?)
conditions, but dunno about CONFIG_SLUB.

Cheers




-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fix page_alloc for larger I/O segments (improved)

2007-12-13 Thread Mark Lord

Andrew Morton wrote:

On Thu, 13 Dec 2007 19:57:29 -0500
James Bottomley <[EMAIL PROTECTED]> wrote:


On Thu, 2007-12-13 at 19:46 -0500, Mark Lord wrote:

"Improved version", more similar to the 2.6.23 code:

Fix page allocator to give better chance of larger contiguous segments (again).

Signed-off-by: Mark Lord <[EMAIL PROTECTED]
---

--- old/mm/page_alloc.c 2007-12-13 19:25:15.0 -0500
+++ linux-2.6/mm/page_alloc.c   2007-12-13 19:43:07.0 -0500
@@ -760,7 +760,7 @@
struct page *page = __rmqueue(zone, order, migratetype);
if (unlikely(page == NULL))
break;
-   list_add(&page->lru, list);
+   list_add_tail(&page->lru, list);

Could we put a big comment above this explaining to the would be vm
tweakers why this has to be a list_add_tail, so we don't end up back in
this position after another two years?



Already done ;)

..

I thought of the comment as I rushed off for dinner.
Thanks, Andrew!


--- a/mm/page_alloc.c~fix-page_alloc-for-larger-i-o-segments-fix
+++ a/mm/page_alloc.c
@@ -847,6 +847,10 @@ static int rmqueue_bulk(struct zone *zon
struct page *page = __rmqueue(zone, order, migratetype);
if (unlikely(page == NULL))
break;
+   /*
+* Doing a list_add_tail() here helps us to hand out pages in
+* ascending physical-address order.
+*/
list_add_tail(&page->lru, list);
set_page_private(page, migratetype);
}
_


-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?

2007-12-13 Thread Mark Lord

Mark Lord wrote:

Andrew Morton wrote:

On Thu, 13 Dec 2007 19:30:00 -0500
Mark Lord <[EMAIL PROTECTED]> wrote:


Here's the commit that causes the regression:

...

--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -760,7 +760,8 @@ static int rmqueue_bulk(struct zone *zone, 
unsigned int order,

 struct page *page = __rmqueue(zone, order, migratetype);
 if (unlikely(page == NULL))
 break;
-list_add_tail(&page->lru, list);
+list_add(&page->lru, list);


well that looks fishy.

..

Yeah.  I missed that, and instead just posted a patch
to search the list in reverse order, which seems to work for me.

I'll try just reversing that line above here now.. gimme 5 minutes or so.

..

Yep, that works too.  Alternative "improved" patch now posted.
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] fix page_alloc for larger I/O segments (improved)

2007-12-13 Thread Mark Lord


"Improved version", more similar to the 2.6.23 code:

Fix page allocator to give better chance of larger contiguous segments (again).

Signed-off-by: Mark Lord <[EMAIL PROTECTED]
---

--- old/mm/page_alloc.c 2007-12-13 19:25:15.0 -0500
+++ linux-2.6/mm/page_alloc.c   2007-12-13 19:43:07.0 -0500
@@ -760,7 +760,7 @@
struct page *page = __rmqueue(zone, order, migratetype);
if (unlikely(page == NULL))
break;
-   list_add(&page->lru, list);
+   list_add_tail(&page->lru, list);
set_page_private(page, migratetype);
}
spin_unlock(&zone->lock);
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?

2007-12-13 Thread Mark Lord

Andrew Morton wrote:

On Thu, 13 Dec 2007 19:30:00 -0500
Mark Lord <[EMAIL PROTECTED]> wrote:


Here's the commit that causes the regression:

...

--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -760,7 +760,8 @@ static int rmqueue_bulk(struct zone *zone, unsigned int 
order,
struct page *page = __rmqueue(zone, order, migratetype);
if (unlikely(page == NULL))
break;
-   list_add_tail(&page->lru, list);
+   list_add(&page->lru, list);


well that looks fishy.

..

Yeah.  I missed that, and instead just posted a patch
to search the list in reverse order, which seems to work for me.

I'll try just reversing that line above here now.. gimme 5 minutes or so.

Cheers
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] fix page_alloc for larger I/O segments

2007-12-13 Thread Mark Lord

Mark Lord wrote:

Mark Lord wrote:

Mark Lord wrote:

Mark Lord wrote:

Andrew Morton wrote:

On Thu, 13 Dec 2007 17:15:06 -0500
James Bottomley <[EMAIL PROTECTED]> wrote:


On Thu, 2007-12-13 at 14:02 -0800, Andrew Morton wrote:

On Thu, 13 Dec 2007 21:09:59 +0100
Jens Axboe <[EMAIL PROTECTED]> wrote:


OK, it's a vm issue,

cc linux-mm and probable culprit.


 I have tens of thousand "backward" pages after a
boot - IOW, bvec->bv_page is the page before bvprv->bv_page, not
reverse. So it looks like that bug got reintroduced.
Bill Irwin fixed this a couple of years back: changed the page 
allocator so

that it mostly hands out pages in ascending physical-address order.

I guess we broke that, quite possibly in Mel's page allocator 
rework.


It would help if you could provide us with a simple recipe for
demonstrating this problem, please.
The simple way seems to be to malloc a large area, touch every 
page and
then look at the physical pages assigned ... they now mostly seem 
to be

descending in physical address.



OIC.  -mm's /proc/pid/pagemap can be used to get the pfn's...

..

I'm actually running the treadmill right now (have been for many 
hours, actually,

to bisect it to a specific commit.

Thought I was almost done, and then noticed that git-bisect doesn't 
keep
the Makefile VERSION lines the same, so I was actually running the 
wrong

kernel after the first few times.. duh.

Wrote a script to fix it now.

..

Well, that was a waste of three hours.

..

Ahh.. it seems to be sensitive to one/both of these:

CONFIG_HIGHMEM64G=y with 4GB RAM:  not so bad, frequently does 20KB - 
48KB segments.
CONFIG_HIGHMEM4G=y  with 2GB RAM:  very severe, rarely does more than 
8KB segments.
CONFIG_HIGHMEM4G=y  with 3GB RAM:  very severe, rarely does more than 
8KB segments.


So if you want to reproduce this on a large memory machine, use 
"mem=2GB" for starters.

..

Here's the commit that causes the regression:

535131e6925b4a95f321148ad7293f496e0e58d7 Choose pages from the per-cpu 
list based on migration type




And here is a patch that seems to fix it for me here:

* * * *

Fix page allocator to give better change of larger contiguous segments (again).

Signed-off-by: Mark Lord <[EMAIL PROTECTED]
---


--- old/mm/page_alloc.c.orig2007-12-13 19:25:15.0 -0500
+++ linux-2.6/mm/page_alloc.c   2007-12-13 19:35:50.0 -0500
@@ -954,7 +954,7 @@
goto failed;
}
/* Find a page of the appropriate migrate type */
-   list_for_each_entry(page, &pcp->list, lru) {
+   list_for_each_entry_reverse(page, &pcp->list, lru) {
if (page_private(page) == migratetype) {
list_del(&page->lru);
pcp->count--;
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?

2007-12-13 Thread Mark Lord

Mark Lord wrote:

Mark Lord wrote:

Mark Lord wrote:

Andrew Morton wrote:

On Thu, 13 Dec 2007 17:15:06 -0500
James Bottomley <[EMAIL PROTECTED]> wrote:


On Thu, 2007-12-13 at 14:02 -0800, Andrew Morton wrote:

On Thu, 13 Dec 2007 21:09:59 +0100
Jens Axboe <[EMAIL PROTECTED]> wrote:


OK, it's a vm issue,

cc linux-mm and probable culprit.


 I have tens of thousand "backward" pages after a
boot - IOW, bvec->bv_page is the page before bvprv->bv_page, not
reverse. So it looks like that bug got reintroduced.
Bill Irwin fixed this a couple of years back: changed the page 
allocator so

that it mostly hands out pages in ascending physical-address order.

I guess we broke that, quite possibly in Mel's page allocator rework.

It would help if you could provide us with a simple recipe for
demonstrating this problem, please.
The simple way seems to be to malloc a large area, touch every page 
and
then look at the physical pages assigned ... they now mostly seem 
to be

descending in physical address.



OIC.  -mm's /proc/pid/pagemap can be used to get the pfn's...

..

I'm actually running the treadmill right now (have been for many 
hours, actually,

to bisect it to a specific commit.

Thought I was almost done, and then noticed that git-bisect doesn't keep
the Makefile VERSION lines the same, so I was actually running the wrong
kernel after the first few times.. duh.

Wrote a script to fix it now.

..

Well, that was a waste of three hours.

..

Ahh.. it seems to be sensitive to one/both of these:

CONFIG_HIGHMEM64G=y with 4GB RAM:  not so bad, frequently does 20KB - 
48KB segments.
CONFIG_HIGHMEM4G=y  with 2GB RAM:  very severe, rarely does more than 
8KB segments.
CONFIG_HIGHMEM4G=y  with 3GB RAM:  very severe, rarely does more than 
8KB segments.


So if you want to reproduce this on a large memory machine, use 
"mem=2GB" for starters.

..

Here's the commit that causes the regression:

535131e6925b4a95f321148ad7293f496e0e58d7 Choose pages from the per-cpu list 
based on migration type


From: Mel Gorman <[EMAIL PROTECTED]>
Date: Tue, 16 Oct 2007 08:25:49 + (-0700)
Subject: Choose pages from the per-cpu list based on migration type
X-Git-Tag: v2.6.24-rc1~1141
X-Git-Url: 
http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Ftorvalds%2Flinux-2.6.git;a=commitdiff_plain;h=535131e6925b4a95f321148ad7293f496e0e58d7;hp=b2a0ac8875a0a3b9f0739b60526f8c5977d2200f

Choose pages from the per-cpu list based on migration type

The freelists for each migrate type can slowly become polluted due to the
per-cpu list.  Consider what happens when the following happens

1. A 2^(MAX_ORDER-1) list is reserved for __GFP_MOVABLE pages
2. An order-0 page is allocated from the newly reserved block
3. The page is freed and placed on the per-cpu list
4. alloc_page() is called with GFP_KERNEL as the gfp_mask
5. The per-cpu list is used to satisfy the allocation

This results in a kernel page is in the middle of a migratable region. This
patch prevents this leak occuring by storing the MIGRATE_ type of the page in
page->private. On allocate, a page will only be returned of the desired type,
else more pages will be allocated. This may temporarily allow a per-cpu list
to go over the pcp->high limit but it'll be corrected on the next free. Care
is taken to preserve the hotness of pages recently freed.

The additional code is not measurably slower for the workloads we've tested.

Signed-off-by: Mel Gorman <[EMAIL PROTECTED]>
Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
Signed-off-by: Linus Torvalds <[EMAIL PROTECTED]>
---

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d54ecf4..e3e726b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -760,7 +760,8 @@ static int rmqueue_bulk(struct zone *zone, unsigned int 
order,
struct page *page = __rmqueue(zone, order, migratetype);
if (unlikely(page == NULL))
break;
-   list_add_tail(&page->lru, list);
+   list_add(&page->lru, list);
+   set_page_private(page, migratetype);
}
spin_unlock(&zone->lock);
return i;
@@ -887,6 +888,7 @@ static void fastcall free_hot_cold_page(struct page *page, 
int cold)
local_irq_save(flags);
__count_vm_event(PGFREE);
list_add(&page->lru, &pcp->list);
+   set_page_private(page, get_pageblock_migratetype(page));
pcp->count++;
if (pcp->count >= pcp->high) {
free_pages_bulk(zone, pcp->batch, &pcp->list, 0);
@@ -951,9 +953,27 @@ again:
if (unlikely(!pcp->count))
goto failed;
}
-   page = list_entry(pcp->list.next, struct page, lru);
-   list_del(&page->lru);
-   pcp->count--;
+ 

Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?

2007-12-13 Thread Mark Lord

Mark Lord wrote:

Mark Lord wrote:

Andrew Morton wrote:

On Thu, 13 Dec 2007 17:15:06 -0500
James Bottomley <[EMAIL PROTECTED]> wrote:


On Thu, 2007-12-13 at 14:02 -0800, Andrew Morton wrote:

On Thu, 13 Dec 2007 21:09:59 +0100
Jens Axboe <[EMAIL PROTECTED]> wrote:


OK, it's a vm issue,

cc linux-mm and probable culprit.


 I have tens of thousand "backward" pages after a
boot - IOW, bvec->bv_page is the page before bvprv->bv_page, not
reverse. So it looks like that bug got reintroduced.
Bill Irwin fixed this a couple of years back: changed the page 
allocator so

that it mostly hands out pages in ascending physical-address order.

I guess we broke that, quite possibly in Mel's page allocator rework.

It would help if you could provide us with a simple recipe for
demonstrating this problem, please.

The simple way seems to be to malloc a large area, touch every page and
then look at the physical pages assigned ... they now mostly seem to be
descending in physical address.



OIC.  -mm's /proc/pid/pagemap can be used to get the pfn's...

..

I'm actually running the treadmill right now (have been for many 
hours, actually,

to bisect it to a specific commit.

Thought I was almost done, and then noticed that git-bisect doesn't keep
the Makefile VERSION lines the same, so I was actually running the wrong
kernel after the first few times.. duh.

Wrote a script to fix it now.

..

Well, that was a waste of three hours.

..

Ahh.. it seems to be sensitive to one/both of these:

CONFIG_HIGHMEM64G=y with 4GB RAM:  not so bad, frequently does 20KB - 48KB 
segments.
CONFIG_HIGHMEM4G=y  with 2GB RAM:  very severe, rarely does more than 8KB 
segments.
CONFIG_HIGHMEM4G=y  with 3GB RAM:  very severe, rarely does more than 8KB 
segments.

So if you want to reproduce this on a large memory machine, use "mem=2GB" for 
starters.

Still testing..



-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?

2007-12-13 Thread Mark Lord

Mark Lord wrote:

Andrew Morton wrote:

On Thu, 13 Dec 2007 17:15:06 -0500
James Bottomley <[EMAIL PROTECTED]> wrote:


On Thu, 2007-12-13 at 14:02 -0800, Andrew Morton wrote:

On Thu, 13 Dec 2007 21:09:59 +0100
Jens Axboe <[EMAIL PROTECTED]> wrote:


OK, it's a vm issue,

cc linux-mm and probable culprit.


 I have tens of thousand "backward" pages after a
boot - IOW, bvec->bv_page is the page before bvprv->bv_page, not
reverse. So it looks like that bug got reintroduced.
Bill Irwin fixed this a couple of years back: changed the page 
allocator so

that it mostly hands out pages in ascending physical-address order.

I guess we broke that, quite possibly in Mel's page allocator rework.

It would help if you could provide us with a simple recipe for
demonstrating this problem, please.

The simple way seems to be to malloc a large area, touch every page and
then look at the physical pages assigned ... they now mostly seem to be
descending in physical address.



OIC.  -mm's /proc/pid/pagemap can be used to get the pfn's...

..

I'm actually running the treadmill right now (have been for many hours, 
actually,

to bisect it to a specific commit.

Thought I was almost done, and then noticed that git-bisect doesn't keep
the Makefile VERSION lines the same, so I was actually running the wrong
kernel after the first few times.. duh.

Wrote a script to fix it now.

..

Well, that was a waste of three hours.

Somebody else can try it now.
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?

2007-12-13 Thread Mark Lord

Andrew Morton wrote:

On Thu, 13 Dec 2007 17:15:06 -0500
James Bottomley <[EMAIL PROTECTED]> wrote:


On Thu, 2007-12-13 at 14:02 -0800, Andrew Morton wrote:

On Thu, 13 Dec 2007 21:09:59 +0100
Jens Axboe <[EMAIL PROTECTED]> wrote:


OK, it's a vm issue,

cc linux-mm and probable culprit.


 I have tens of thousand "backward" pages after a
boot - IOW, bvec->bv_page is the page before bvprv->bv_page, not
reverse. So it looks like that bug got reintroduced.

Bill Irwin fixed this a couple of years back: changed the page allocator so
that it mostly hands out pages in ascending physical-address order.

I guess we broke that, quite possibly in Mel's page allocator rework.

It would help if you could provide us with a simple recipe for
demonstrating this problem, please.

The simple way seems to be to malloc a large area, touch every page and
then look at the physical pages assigned ... they now mostly seem to be
descending in physical address.



OIC.  -mm's /proc/pid/pagemap can be used to get the pfn's...

..

I'm actually running the treadmill right now (have been for many hours, 
actually,
to bisect it to a specific commit.

Thought I was almost done, and then noticed that git-bisect doesn't keep
the Makefile VERSION lines the same, so I was actually running the wrong
kernel after the first few times.. duh.

Wrote a script to fix it now.

-ml
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?

2007-12-13 Thread Mark Lord

Mark Lord wrote:

Jens Axboe wrote:

..

OK, it's a vm issue, I have tens of thousand "backward" pages after a
boot - IOW, bvec->bv_page is the page before bvprv->bv_page, not
reverse. So it looks like that bug got reintroduced.

...

Mmm.. shouldn't one of the front- or back- merge logics work for either order?

..

Belay that thought.  I'm slowly remembering how this is supposed to work now.  
:)
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?

2007-12-13 Thread Mark Lord

Jens Axboe wrote:

On Thu, Dec 13 2007, Mark Lord wrote:

Jens Axboe wrote:

On Thu, Dec 13 2007, Jens Axboe wrote:

On Thu, Dec 13 2007, Mark Lord wrote:

Jens Axboe wrote:

On Thu, Dec 13 2007, Mark Lord wrote:

Mark Lord wrote:

Jens Axboe wrote:

On Thu, Dec 13 2007, Mark Lord wrote:

Matthew Wilcox wrote:

On Thu, Dec 13, 2007 at 01:48:18PM -0500, Mark Lord wrote:
Problem confirmed.  2.6.23.8 regularly generates segments up to 
64KB for libata,

but 2.6.24 uses only 4KB segments and a *few* 8KB segments.
Just a suspicion ... could this be slab vs slub?  ie check your 
configs

are the same / similar between the two kernels.

..

Mmmm.. a good thought, that one.
But I just rechecked, and both have CONFIG_SLAB=y

My guess is that something got changed around when Jens
reworked the block layer for 2.6.24.
I'm going to dig around in there now.

I didn't rework the block layer for 2.6.24 :-). The core block layer
changes since 2.6.23 are:

- Support for empty barriers. Not a likely candidate.
- Shared tag queue fixes. Totally unlikely.
- sg chaining support. Not likely.
- The bio changes from Neil. Of the bunch, the most likely suspects 
in

this area, since it changes some of the code involved with merges and
blk_rq_map_sg().
- Lots of simple stuff, again very unlikely.

Anyway, it sounds odd for this to be a block layer problem if you do 
see
occasional segments being merged. So it sounds more like the input 
data

having changed.

Why not just bisect it?

..

Because the early 2.6.24 series failed to boot on this machine
due to bugs in the block layer -- so the code that caused this 
regression

is probably in the stuff from before the kernels became usable here.

..

That sounds more harsh than intended --> the earlier 2.6.24 kernels 
(up to

the first couple of -rc* ones failed here because of incompatibilities
between the block/bio changes and libata.

That's better, I think! 

No worries, I didn't pick it up as harsh just as an odd conclusion :-)

If I were you, I'd just start from the first -rc that booted for you. If
THAT has the bug, then we'll think of something else. If you don't get
anywhere, I can run some tests tomorrow and see if I can reproduce it
here.

..

I believe that *anyone* can reproduce it, since it's broken long before
the requests ever get to SCSI or libata.  Which also means that *anyone*
who wants to can bisect it, as well.

I don't do "bisects".

It was just a suggestion on how to narrow it down, do as you see fit.


But I will dig a bit more and see if I can find the culprit.

Sure, I'll dig around as well.

Just tried something simple. I only see one 12kb segment so far, so not
a lot by any stretch. I also DONT see any missed merges signs, so it
would appear that the pages in the request are simply not contigious
physically.

diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c
index e30b1a4..1e34b6f 100644
--- a/block/ll_rw_blk.c
+++ b/block/ll_rw_blk.c
@@ -1330,6 +1330,8 @@ int blk_rq_map_sg(struct request_queue *q, struct 
request *rq,

goto new_segment;

sg->length += nbytes;
+   if (sg->length > 8192)
+   printk("sg_len=%d\n", sg->length);
} else {
new_segment:
if (!sg)
@@ -1349,6 +1351,8 @@ new_segment:
sg = sg_next(sg);
}

+			if (bvprv && (page_address(bvprv->bv_page) + 
bvprv->bv_len == page_address(bvec->bv_page)))

+   printk("missed merge\n");
			sg_set_page(sg, bvec->bv_page, nbytes, 
			bvec->bv_offset);

nsegs++;
}


..

Yeah, the first part is similar to my own hack.

For testing, try "dd if=/dev/sda of=/dev/null bs=4096k".
That *really* should end up using contiguous pages on most systems.

I figured out the git thing, and am now building some in-between kernels to 
try.


OK, it's a vm issue, I have tens of thousand "backward" pages after a
boot - IOW, bvec->bv_page is the page before bvprv->bv_page, not
reverse. So it looks like that bug got reintroduced.

...

Mmm.. shouldn't one of the front- or back- merge logics work for either order?


-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?

2007-12-13 Thread Mark Lord

Jens Axboe wrote:

On Thu, Dec 13 2007, Jens Axboe wrote:

On Thu, Dec 13 2007, Mark Lord wrote:

Jens Axboe wrote:

On Thu, Dec 13 2007, Mark Lord wrote:

Mark Lord wrote:

Jens Axboe wrote:

On Thu, Dec 13 2007, Mark Lord wrote:

Matthew Wilcox wrote:

On Thu, Dec 13, 2007 at 01:48:18PM -0500, Mark Lord wrote:
Problem confirmed.  2.6.23.8 regularly generates segments up to 
64KB for libata,

but 2.6.24 uses only 4KB segments and a *few* 8KB segments.
Just a suspicion ... could this be slab vs slub?  ie check your 
configs

are the same / similar between the two kernels.

..

Mmmm.. a good thought, that one.
But I just rechecked, and both have CONFIG_SLAB=y

My guess is that something got changed around when Jens
reworked the block layer for 2.6.24.
I'm going to dig around in there now.

I didn't rework the block layer for 2.6.24 :-). The core block layer
changes since 2.6.23 are:

- Support for empty barriers. Not a likely candidate.
- Shared tag queue fixes. Totally unlikely.
- sg chaining support. Not likely.
- The bio changes from Neil. Of the bunch, the most likely suspects in
this area, since it changes some of the code involved with merges and
blk_rq_map_sg().
- Lots of simple stuff, again very unlikely.

Anyway, it sounds odd for this to be a block layer problem if you do see
occasional segments being merged. So it sounds more like the input data
having changed.

Why not just bisect it?

..

Because the early 2.6.24 series failed to boot on this machine
due to bugs in the block layer -- so the code that caused this regression
is probably in the stuff from before the kernels became usable here.

..

That sounds more harsh than intended --> the earlier 2.6.24 kernels (up to
the first couple of -rc* ones failed here because of incompatibilities
between the block/bio changes and libata.

That's better, I think! 

No worries, I didn't pick it up as harsh just as an odd conclusion :-)

If I were you, I'd just start from the first -rc that booted for you. If
THAT has the bug, then we'll think of something else. If you don't get
anywhere, I can run some tests tomorrow and see if I can reproduce it
here.

..

I believe that *anyone* can reproduce it, since it's broken long before
the requests ever get to SCSI or libata.  Which also means that *anyone*
who wants to can bisect it, as well.

I don't do "bisects".

It was just a suggestion on how to narrow it down, do as you see fit.


But I will dig a bit more and see if I can find the culprit.

Sure, I'll dig around as well.


Just tried something simple. I only see one 12kb segment so far, so not
a lot by any stretch. I also DONT see any missed merges signs, so it
would appear that the pages in the request are simply not contigious
physically.

diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c
index e30b1a4..1e34b6f 100644
--- a/block/ll_rw_blk.c
+++ b/block/ll_rw_blk.c
@@ -1330,6 +1330,8 @@ int blk_rq_map_sg(struct request_queue *q, struct request 
*rq,
goto new_segment;
 
 			sg->length += nbytes;

+   if (sg->length > 8192)
+   printk("sg_len=%d\n", sg->length);
} else {
 new_segment:
if (!sg)
@@ -1349,6 +1351,8 @@ new_segment:
sg = sg_next(sg);
}
 
+			if (bvprv && (page_address(bvprv->bv_page) + bvprv->bv_len == page_address(bvec->bv_page)))

+   printk("missed merge\n");
sg_set_page(sg, bvec->bv_page, nbytes, bvec->bv_offset);
nsegs++;
}


..

Yeah, the first part is similar to my own hack.

For testing, try "dd if=/dev/sda of=/dev/null bs=4096k".
That *really* should end up using contiguous pages on most systems.

I figured out the git thing, and am now building some in-between kernels to try.

Cheers
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?

2007-12-13 Thread Mark Lord

Jens Axboe wrote:

On Thu, Dec 13 2007, Mark Lord wrote:

Jens Axboe wrote:

On Thu, Dec 13 2007, Mark Lord wrote:

Mark Lord wrote:

Jens Axboe wrote:

On Thu, Dec 13 2007, Mark Lord wrote:

Matthew Wilcox wrote:

On Thu, Dec 13, 2007 at 01:48:18PM -0500, Mark Lord wrote:
Problem confirmed.  2.6.23.8 regularly generates segments up to 
64KB for libata,

but 2.6.24 uses only 4KB segments and a *few* 8KB segments.
Just a suspicion ... could this be slab vs slub?  ie check your 
configs

are the same / similar between the two kernels.

..

Mmmm.. a good thought, that one.
But I just rechecked, and both have CONFIG_SLAB=y

My guess is that something got changed around when Jens
reworked the block layer for 2.6.24.
I'm going to dig around in there now.

I didn't rework the block layer for 2.6.24 :-). The core block layer
changes since 2.6.23 are:

- Support for empty barriers. Not a likely candidate.
- Shared tag queue fixes. Totally unlikely.
- sg chaining support. Not likely.
- The bio changes from Neil. Of the bunch, the most likely suspects in
this area, since it changes some of the code involved with merges and
blk_rq_map_sg().
- Lots of simple stuff, again very unlikely.

Anyway, it sounds odd for this to be a block layer problem if you do see
occasional segments being merged. So it sounds more like the input data
having changed.

Why not just bisect it?

..

Because the early 2.6.24 series failed to boot on this machine
due to bugs in the block layer -- so the code that caused this regression
is probably in the stuff from before the kernels became usable here.

..

That sounds more harsh than intended --> the earlier 2.6.24 kernels (up to
the first couple of -rc* ones failed here because of incompatibilities
between the block/bio changes and libata.

That's better, I think! 

No worries, I didn't pick it up as harsh just as an odd conclusion :-)

If I were you, I'd just start from the first -rc that booted for you. If
THAT has the bug, then we'll think of something else. If you don't get
anywhere, I can run some tests tomorrow and see if I can reproduce it
here.

..

I believe that *anyone* can reproduce it, since it's broken long before
the requests ever get to SCSI or libata.  Which also means that *anyone*
who wants to can bisect it, as well.

I don't do "bisects".


It was just a suggestion on how to narrow it down, do as you see fit.


But I will dig a bit more and see if I can find the culprit.


Sure, I'll dig around as well.

..

I wonder if it's 9dfa52831e96194b8649613e3131baa2c109f7dc:
"Merge blk_recount_segments into blk_recalc_rq_segments" ?

That particular commit does some rather innocent code-shuffling,
but also introduces a couple of new "if (nr_hw_segs == 1" conditions
that were not there before.

Okay git experts:  how do I pull out a kernel at the point of this exact commit 
?

Thanks!




-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?

2007-12-13 Thread Mark Lord

Jens Axboe wrote:

On Thu, Dec 13 2007, Mark Lord wrote:

Matthew Wilcox wrote:

On Thu, Dec 13, 2007 at 01:48:18PM -0500, Mark Lord wrote:
Problem confirmed.  2.6.23.8 regularly generates segments up to 64KB for 
libata,

but 2.6.24 uses only 4KB segments and a *few* 8KB segments.

Just a suspicion ... could this be slab vs slub?  ie check your configs
are the same / similar between the two kernels.

..

Mmmm.. a good thought, that one.
But I just rechecked, and both have CONFIG_SLAB=y

My guess is that something got changed around when Jens
reworked the block layer for 2.6.24.
I'm going to dig around in there now.


I didn't rework the block layer for 2.6.24 :-). The core block layer
changes since 2.6.23 are:

- Support for empty barriers. Not a likely candidate.
- Shared tag queue fixes. Totally unlikely.
- sg chaining support. Not likely.
- The bio changes from Neil. Of the bunch, the most likely suspects in
  this area, since it changes some of the code involved with merges and
  blk_rq_map_sg().
- Lots of simple stuff, again very unlikely.

Anyway, it sounds odd for this to be a block layer problem if you do see
occasional segments being merged. So it sounds more like the input data
having changed.

Why not just bisect it?

..

CC'ing Neil Brown.

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?

2007-12-13 Thread Mark Lord

Jens Axboe wrote:

On Thu, Dec 13 2007, Mark Lord wrote:

Mark Lord wrote:

Jens Axboe wrote:

On Thu, Dec 13 2007, Mark Lord wrote:

Matthew Wilcox wrote:

On Thu, Dec 13, 2007 at 01:48:18PM -0500, Mark Lord wrote:
Problem confirmed.  2.6.23.8 regularly generates segments up to 
64KB for libata,

but 2.6.24 uses only 4KB segments and a *few* 8KB segments.

Just a suspicion ... could this be slab vs slub?  ie check your configs
are the same / similar between the two kernels.

..

Mmmm.. a good thought, that one.
But I just rechecked, and both have CONFIG_SLAB=y

My guess is that something got changed around when Jens
reworked the block layer for 2.6.24.
I'm going to dig around in there now.

I didn't rework the block layer for 2.6.24 :-). The core block layer
changes since 2.6.23 are:

- Support for empty barriers. Not a likely candidate.
- Shared tag queue fixes. Totally unlikely.
- sg chaining support. Not likely.
- The bio changes from Neil. Of the bunch, the most likely suspects in
 this area, since it changes some of the code involved with merges and
 blk_rq_map_sg().
- Lots of simple stuff, again very unlikely.

Anyway, it sounds odd for this to be a block layer problem if you do see
occasional segments being merged. So it sounds more like the input data
having changed.

Why not just bisect it?

..

Because the early 2.6.24 series failed to boot on this machine
due to bugs in the block layer -- so the code that caused this regression
is probably in the stuff from before the kernels became usable here.

..

That sounds more harsh than intended --> the earlier 2.6.24 kernels (up to
the first couple of -rc* ones failed here because of incompatibilities
between the block/bio changes and libata.

That's better, I think! 


No worries, I didn't pick it up as harsh just as an odd conclusion :-)

If I were you, I'd just start from the first -rc that booted for you. If
THAT has the bug, then we'll think of something else. If you don't get
anywhere, I can run some tests tomorrow and see if I can reproduce it
here.

..

I believe that *anyone* can reproduce it, since it's broken long before
the requests ever get to SCSI or libata.  Which also means that *anyone*
who wants to can bisect it, as well.

I don't do "bisects".

But I will dig a bit more and see if I can find the culprit.

Cheers

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?

2007-12-13 Thread Mark Lord

Mark Lord wrote:

Jens Axboe wrote:

On Thu, Dec 13 2007, Mark Lord wrote:

Matthew Wilcox wrote:

On Thu, Dec 13, 2007 at 01:48:18PM -0500, Mark Lord wrote:
Problem confirmed.  2.6.23.8 regularly generates segments up to 
64KB for libata,

but 2.6.24 uses only 4KB segments and a *few* 8KB segments.

Just a suspicion ... could this be slab vs slub?  ie check your configs
are the same / similar between the two kernels.

..

Mmmm.. a good thought, that one.
But I just rechecked, and both have CONFIG_SLAB=y

My guess is that something got changed around when Jens
reworked the block layer for 2.6.24.
I'm going to dig around in there now.


I didn't rework the block layer for 2.6.24 :-). The core block layer
changes since 2.6.23 are:

- Support for empty barriers. Not a likely candidate.
- Shared tag queue fixes. Totally unlikely.
- sg chaining support. Not likely.
- The bio changes from Neil. Of the bunch, the most likely suspects in
  this area, since it changes some of the code involved with merges and
  blk_rq_map_sg().
- Lots of simple stuff, again very unlikely.

Anyway, it sounds odd for this to be a block layer problem if you do see
occasional segments being merged. So it sounds more like the input data
having changed.

Why not just bisect it?

..

Because the early 2.6.24 series failed to boot on this machine
due to bugs in the block layer -- so the code that caused this regression
is probably in the stuff from before the kernels became usable here.

..

That sounds more harsh than intended --> the earlier 2.6.24 kernels (up to
the first couple of -rc* ones failed here because of incompatibilities
between the block/bio changes and libata.

That's better, I think! 


Cheers
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?

2007-12-13 Thread Mark Lord

Jens Axboe wrote:

On Thu, Dec 13 2007, Mark Lord wrote:

Matthew Wilcox wrote:

On Thu, Dec 13, 2007 at 01:48:18PM -0500, Mark Lord wrote:
Problem confirmed.  2.6.23.8 regularly generates segments up to 64KB for 
libata,

but 2.6.24 uses only 4KB segments and a *few* 8KB segments.

Just a suspicion ... could this be slab vs slub?  ie check your configs
are the same / similar between the two kernels.

..

Mmmm.. a good thought, that one.
But I just rechecked, and both have CONFIG_SLAB=y

My guess is that something got changed around when Jens
reworked the block layer for 2.6.24.
I'm going to dig around in there now.


I didn't rework the block layer for 2.6.24 :-). The core block layer
changes since 2.6.23 are:

- Support for empty barriers. Not a likely candidate.
- Shared tag queue fixes. Totally unlikely.
- sg chaining support. Not likely.
- The bio changes from Neil. Of the bunch, the most likely suspects in
  this area, since it changes some of the code involved with merges and
  blk_rq_map_sg().
- Lots of simple stuff, again very unlikely.

Anyway, it sounds odd for this to be a block layer problem if you do see
occasional segments being merged. So it sounds more like the input data
having changed.

Why not just bisect it?

..

Because the early 2.6.24 series failed to boot on this machine
due to bugs in the block layer -- so the code that caused this regression
is probably in the stuff from before the kernels became usable here.

Cheers

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?

2007-12-13 Thread Mark Lord

Matthew Wilcox wrote:

On Thu, Dec 13, 2007 at 01:48:18PM -0500, Mark Lord wrote:
Problem confirmed.  2.6.23.8 regularly generates segments up to 64KB for 
libata,

but 2.6.24 uses only 4KB segments and a *few* 8KB segments.


Just a suspicion ... could this be slab vs slub?  ie check your configs
are the same / similar between the two kernels.

..

Mmmm.. a good thought, that one.
But I just rechecked, and both have CONFIG_SLAB=y

My guess is that something got changed around when Jens
reworked the block layer for 2.6.24.
I'm going to dig around in there now.

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?

2007-12-13 Thread Mark Lord

Mark Lord wrote:

(resending with corrected email address for Jens)

Jens,

I'm experimenting here with trying to generate large I/O through libata,
and not having much luck.

The limit seems to be the number of hardware PRD (SG) entries permitted
by the driver (libata:ata_piix), which is 128 by default.

The problem is, the block layer *never* sends an SG entry larger than 
8192 bytes,
and even that size is exceptionally rare.  Nearly all I/O segments are 
4096 bytes,

so I never see a single I/O larger than 512KB (128 * 4096).

If I patch various parts of block and SCSI, this limit doesn't budge,
but when I change the hardware PRD limit in libata, it scales by exactly
whatever I set the new value to.  This tells me that adjacent I/O segments
are not being combined.

I thought that QUEUE_FLAG_CLUSTER (aka. SCSI host .use_clustering=1) should
result in adjacent single pages being combined into larger physical 
segments?


This is x86-32 with latest 2.6.24-rc*.
I'll re-test on older kernels next.

...

Problem confirmed.  2.6.23.8 regularly generates segments up to 64KB for libata,
but 2.6.24 uses only 4KB segments and a *few* 8KB segments.

???
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?

2007-12-13 Thread Mark Lord

(resending with corrected email address for Jens)

Jens,

I'm experimenting here with trying to generate large I/O through libata,
and not having much luck.

The limit seems to be the number of hardware PRD (SG) entries permitted
by the driver (libata:ata_piix), which is 128 by default.

The problem is, the block layer *never* sends an SG entry larger than 8192 
bytes,
and even that size is exceptionally rare.  Nearly all I/O segments are 4096 
bytes,
so I never see a single I/O larger than 512KB (128 * 4096).

If I patch various parts of block and SCSI, this limit doesn't budge,
but when I change the hardware PRD limit in libata, it scales by exactly
whatever I set the new value to.  This tells me that adjacent I/O segments
are not being combined.

I thought that QUEUE_FLAG_CLUSTER (aka. SCSI host .use_clustering=1) should
result in adjacent single pages being combined into larger physical segments?

This is x86-32 with latest 2.6.24-rc*.
I'll re-test on older kernels next.

???
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?

2007-12-13 Thread Mark Lord

Jens,

I'm experimenting here with trying to generate large I/O through libata,
and not having much luck.

The limit seems to be the number of hardware PRD (SG) entries permitted
by the driver (libata:ata_piix), which is 128 by default.

The problem is, the block layer *never* sends an SG entry larger than 8192 
bytes,
and even that size is exceptionally rare.  Nearly all I/O segments are 4096 
bytes,
so I never see a single I/O larger than 512KB (128 * 4096).

If I patch various parts of block and SCSI, this limit doesn't budge,
but when I change the hardware PRD limit in libata, it scales by exactly
whatever I set the new value to.  This tells me that adjacent I/O segments
are not being combined.

I thought that QUEUE_FLAG_CLUSTER (aka. SCSI host .use_clustering=1) should
result in adjacent single pages being combined into larger physical segments?

This is x86-32 with latest 2.6.24-rc*.
I'll re-test on older kernels next.

???
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Process Scheduling Issue using sg/libata

2007-11-18 Thread Mark Lord

Fajun Chen wrote:


As a matter of fact, I'm using /dev/sg*.  Due to the size of my test
application, I have not be able to compress it into a small and
publishable form. However, this issue can be easily reproduced on my
ARM XScale target using sg3_util code as follows:
1. Run printtime.c attached,  which prints message to console in a loop.
2. Run sgm_dd (part of sg3_util package, source code attached) on the
same system as follows:

sgm_dd if=/dev/sg0 of=/dev/null count=10M bpt=1

The print task can be delayed for as many as 25 seconds. Surprisingly,
I can't reproduce the problem in an i386 test system with a more
powerful processor.

Some clarification to MAP_ANONYMOUS option in mmap(). After fixing a
bug and more testing, this option seems to make no difference to cpu
load. Sorry about previous report. Back to the drawing board now :-)

..

Okay, I don't see anything unusual here.  The code is on a slow CPU,
and is triggering 10MBytes of PIO over a (probably) slow bus to an ATA device.

This *will* tie up the CPU at 100% for the duration of the I/O,
because the I/O happens in interrupt handlers, which are outside
of the realm of the CPU scheduler.

This is a known shortcoming of Linux for real-time uses.

When the I/O uses DMA transfers, it *may* still have a similar effect,
depending upon the caching in the ATA device, and on how the DMA shares
the memory bus with the CPU.

Again, no surprise here.

One way to deal with it in an embedded device, is to force the
application that's generating the I/O to self-throttle.
Or modify the device driver to self-throttle.

You may want to find an embedded Linux consultant to help out
with this situation if it's beyond your expertise.

Cheers
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Process Scheduling Issue using sg/libata

2007-11-18 Thread Mark Lord

Fajun Chen wrote:

..
I verified your program works in my system and my application works as
well if changed accordingly. However, this change (indirect IO in sg
term) may come at a performance cost for IO intensive applications
since it does NOT utilize mmaped buffer managed by sg driver.  Please
see relevant sg document below:
http://sg.torque.net/sg/p/sg_v3_ho.html#id2495330
http://sg.torque.net/sg/p/sg_v3_ho.html#dmmio
As an example, sg_rbuf.c in sg3_util package uses SG_FLAG_MMAP_IO flag
in SG_IO. Please see source code attached. I also noticed that
MAP_ANONYMOUS is NOT used in mmap() call in sg_rbuf.c, which may not
be desirable as you pointed out in previous emails. So this brings up
an interesting sg usage issue: can we use MAP_ANONYMOUS with
SG_FLAG_MMAP_IO flag in SG_IO?

..

The SG_FLAG_MMAP works only with /dev/sg* devices, not /dev/sd* devices.
I don't know which kind you were trying to use, since you still have
not provided your source code for examination.

If you are using /dev/sg*, then you should be able to get your original mmap()
code to work.  But the behaviour described thus far seems to indicate that
your secret program must have been using /dev/sd* instead.

Cheers
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Process Scheduling Issue using sg/libata

2007-11-18 Thread Mark Lord

Fajun Chen wrote:

On 11/17/07, Mark Lord <[EMAIL PROTECTED]> wrote:

..

What you probably intended to do instead, was to use mmap to just allocate
some page-aligned RAM, not to actually mmap'd any on-disk data.  Right?

Here's how that's done:

  read_buffer = (U8 *)mmap(NULL, buf_sz, PROT_READ | PROT_WRITE,
 MAP_SHARED|MAP_ANONYMOUS, -1, 0);


What I intended to do is to write data into disc or read data from
disc via SG_IO as requested by my user-space application. I don't want
any automatically scheduled kernel task to sync data with disc.

..

Right.  Then you definitely do NOT want to mmap your device,
because that's exactly what would otherwise happen, by design!



I've experimented with memory mapping using MAP_ANONYMOUS as you
suggested, the good news is that it does free up the cpu load and my
system is much more responsive with the change.

..

Yes, that's what we expected to see.



The bad news is that
the data read back from disc (PIO or DMA read) seems to be invisible
to user-space application. For instance, read buffer is all zeros
after Identify Device command. Is this expected side effect of
MAP_ANONYMOUS option?

..

No, that would be a side effect of some other bug in the code.

Here (attached) is a working program that performs (PACKET)IDENTIFY DEVICE
commands, using a mmap() buffer to receive the data.

Cheers
/*
 * This code is copyright 2007 by Mark Lord,
 * and is made available to all under the terms
 * of the GNU General Public License v2.
 */
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#include 
#include 
#include 
#include 
#include 

typedef unsigned long long u64;

enum {
	ATA_CMD_PIO_IDENTIFY		= 0xec,
	ATA_CMD_PIO_PIDENTIFY		= 0xa1,

	/* normal sector size (bytes) for PIO/DMA */
	ATA_SECT_SIZE			= 512,

	ATA_16= 0x85,
	ATA_16_LEN			= 16,

	ATA_DEV_REG_LBA			= (1 << 6),

	ATA_LBA48			= 1,

	/* data transfer protocols; only basic PIO and DMA actually work */
	ATA_PROTO_NON_DATA		= ( 3 << 1),
	ATA_PROTO_PIO_IN		= ( 4 << 1),
	ATA_PROTO_PIO_OUT		= ( 5 << 1),
	ATA_PROTO_DMA			= ( 6 << 1),
	ATA_PROTO_UDMA_IN		= (11 << 1), /* unsupported */
	ATA_PROTO_UDMA_OUT		= (12 << 1), /* unsupported */
};

/*
 * Taskfile layout for ATA_16 cdb (LBA28/LBA48):
 *
 *	cdb[ 4] = feature
 *	cdb[ 6] = nsect
 *	cdb[ 8] = lbal
 *	cdb[10] = lbam
 *	cdb[12] = lbah
 *	cdb[13] = device
 *	cdb[14] = command
 *
 * "high order byte" (hob) fields for LBA48 commands:
 *
 *	cdb[ 3] = hob_feature
 *	cdb[ 5] = hob_nsect
 *	cdb[ 7] = hob_lbal
 *	cdb[ 9] = hob_lbam
 *	cdb[11] = hob_lbah
 *
 * dxfer_direction choices:
 *
 *	SG_DXFER_TO_DEV		(writing to drive)
 *	SG_DXFER_FROM_DEV	(reading from drive)
 *	SG_DXFER_NONE		(non-data commands)
 */

static int sg_issue (int fd, unsigned char ata_op, void *buf)
{
	unsigned char cdb[ATA_16_LEN]
		= { ATA_16, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 };
	unsigned char sense[32];
	unsigned int nsects = 1;
	struct sg_io_hdr hdr;

	cdb[ 1] = ATA_PROTO_PIO_IN;
	cdb[ 6] = nsects;
	cdb[14] = ata_op;

	memset(&hdr, 0, sizeof(struct sg_io_hdr));
	hdr.interface_id	= 'S';
	hdr.cmd_len		= ATA_16_LEN;
	hdr.mx_sb_len		= sizeof(sense);
	hdr.dxfer_direction	= SG_DXFER_FROM_DEV;
	hdr.dxfer_len		= nsects * ATA_SECT_SIZE;
	hdr.dxferp		= buf;
	hdr.cmdp		= cdb;
	hdr.sbp			= sense;
	hdr.timeout		= 5000; /* milliseconds */

	memset(sense, 0, sizeof(sense));
	if (ioctl(fd, SG_IO, &hdr) < 0) {
		perror("ioctl(SG_IO)");
		return (-1);
	}
	if (hdr.status == 0 && hdr.host_status == 0 && hdr.driver_status == 0)
		return 0; /* success */

	if (hdr.status > 0) {
		unsigned char *d = sense + 8;
		/* SCSI status is non-zero */
		fprintf(stderr, "SG_IO error: SCSI sense=0x%x/%02x/%02x, ATA=0x%02x/%02x\n",
			sense[1] & 0xf, sense[2], sense[3], d[13], d[3]);
		return -1;
	}
	/* some other error we don't know about yet */
	fprintf(stderr, "SG_IO returned: SCSI status=0x%x, host_status=0x%x, driver_status=0x%x\n",
		hdr.status, hdr.host_status, hdr.driver_status);
	return -1;
}

int main (int argc, char *argv[])
{
	const char *devpath;
	int i, rc, fd;
#if 0
	unsigned short id[ATA_SECT_SIZE / 2];
	memset(id, 0, sizeof(id));
#else
	unsigned short *id;
	id = mmap(NULL, getpagesize(), PROT_READ|PROT_WRITE, MAP_SHARED|MAP_ANONYMOUS, -1, 0);
	if (id == MAP_FAILED) {
		perror("mmap");
		exit(1);
	}
#endif
	if (argc != 2) {
		fprintf(stderr, "%s: bad/missing parm: expected \n", argv[0]);
		exit(1);
	}
	devpath = argv[1];

	fd = open(devpath, O_RDWR|O_NONBLOCK);
	if (fd == -1) {
		perror(devpath);
		exit(1);
	}
	rc = sg_issue(fd, ATA_CMD_PIO_IDENTIFY, id);
	if (rc != 0)
		rc = sg_issue(fd, ATA_CMD_PIO_PIDENTIFY, id);
	if (rc == 0) {
		unsigned short *d = id;
		for (i = 0; i < (256/8); ++i) {
			printf("%04x %04x %04x %04x %04x %04x %04x %04x\n",
d[0], d[1], d[2], d[3], d[4], d[5], d[6], d[7]);
			d += 8;
		}
		exit(0);
	}
	exit(1);
}


Re: Process Scheduling Issue using sg/libata

2007-11-17 Thread Mark Lord

Fajun Chen wrote:

On 11/17/07, Mark Lord <[EMAIL PROTECTED]> wrote:

Fajun Chen wrote:

On 11/16/07, Mark Lord <[EMAIL PROTECTED]> wrote:

Fajun Chen wrote:

..

This problem also happens with R/W DMA ops. Below are simplified code snippets:
// Open one sg device for read
  if ((sg_fd  = open(dev_name, O_RDWR))<0)
  {
  ...
  }
  read_buffer = (U8 *)mmap(NULL, buf_sz, PROT_READ | PROT_WRITE,
 MAP_SHARED, sg_fd, 0);

// Open the same sg device for write
  if ((sg_fd_wr = open(dev_name, O_RDWR))<0)
  {
 ...
  }
  write_buffer = (U8 *)mmap(NULL, buf_sz, PROT_READ | PROT_WRITE,
 MAP_SHARED, sg_fd_wr, 0);

..

Mmmm.. what is the purpose of those two mmap'd areas ?
I think this is important and relevant here:  what are they used for?

As coded above, these are memory mapped areas taht (1) overlap,
and (2) will be demand paged automatically to/from the disk
as they are accessed/modified.  This *will* conflict with any SG_IO
operations happening at the same time on the same device.

..

The purpose of using two memory mapped areas is to meet our
requirement that certain data patterns for writing need to be kept
across commands. For instance, if one buffer is used for both reads
and writes, then this buffer will need to be re-populated with certain
write data after each read command, which would be very costly for
write-read mixed type of ops. This separate R/W buffer setting also
facilitates data comparison.

These buffers are not used at the same time (one will be used only
after the command on the other is completed). My application is the
only program accessing disk using sg/libata and the rest of the
programs run from ramdisk. Also, each buffer is only about 0.5MB and
we have 64MB RAM on the target board.
With this setup,  these two buffers should be pretty much independent
and free from block layer/file system, correct?

..

No.  Those "buffers" as coded above are actually mmap'ed representations
of portions of the device (disk drive).  So any write into one of those
buffers will trigger disk writes, and just accessing ("read") the buffers
may trigger disk reads.

So what could be happening here, is when you trigger manual disk accesses
via SG_IO, that result in data being copied into those "buffers", the kernel
then automatically schedules disk writes to update the on-disk copies of
those mmap'd regions.

What you probably intended to do instead, was to use mmap to just allocate
some page-aligned RAM, not to actually mmap'd any on-disk data.  Right?

Here's how that's done:

  read_buffer = (U8 *)mmap(NULL, buf_sz, PROT_READ | PROT_WRITE,
 MAP_SHARED|MAP_ANONYMOUS, -1, 0);

Cheers
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Process Scheduling Issue using sg/libata

2007-11-17 Thread Mark Lord

Fajun Chen wrote:

On 11/16/07, Mark Lord <[EMAIL PROTECTED]> wrote:

Fajun Chen wrote:

Hi All,

I use sg/libata and ata pass through for read/writes. Linux 2.6.18-rc2
and libata version 2.00 are loaded on ARM XScale board.  Under heavy
cpu load (e.g. when blocks per transfer/sector count is set to 1),
I've observed that the test application can suck cpu away for long
time (more than 20 seconds) and other processes including high
priority shell can not get the time slice to run.  What's interesting
is that if the application is under heavy IO load (e.g. when blocks
per transfer/sector count is set to 256),  the problem goes away. I
also tested with open source code sg_utils and got the same result, so
this is not a problem specific to my user-space application.

..

Post the relevant code here, and then we'll be able to better understand
and explain it to you.

For example, if the code is using ATA opcodes 0x20, 0x21, 0x24,
0x30, 0x31, 0x34, 0x29, 0x39, 0xc4 or 0xc5 (any of the R/W PIO ops),
then this behaviour does not surprise me in the least.  Fully expected
and difficult to avoid.



This problem also happens with R/W DMA ops. Below are simplified code snippets:
// Open one sg device for read
  if ((sg_fd  = open(dev_name, O_RDWR))<0)
  {
  ...
  }
  read_buffer = (U8 *)mmap(NULL, buf_sz, PROT_READ | PROT_WRITE,
 MAP_SHARED, sg_fd, 0);

// Open the same sg device for write
  if ((sg_fd_wr = open(dev_name, O_RDWR))<0)
  {
 ...
  }
  write_buffer = (U8 *)mmap(NULL, buf_sz, PROT_READ | PROT_WRITE,
 MAP_SHARED, sg_fd_wr, 0);

..

Mmmm.. what is the purpose of those two mmap'd areas ?
I think this is important and relevant here:  what are they used for?

As coded above, these are memory mapped areas taht (1) overlap,
and (2) will be demand paged automatically to/from the disk
as they are accessed/modified.  This *will* conflict with any SG_IO
operations happening at the same time on the same device.






  sg_io_hdr_t io_hdr;

  memset(&io_hdr, 0, sizeof(sg_io_hdr_t));

  io_hdr.interface_id = 'S';
  io_hdr.mx_sb_len= sizeof(sense_buffer);
  io_hdr.sbp  = sense_buffer;
  io_hdr.dxfer_len= dxfer_len;
  io_hdr.cmd_len  = cmd_len;
  io_hdr.cmdp = cmdp;// ATA pass through command block
  io_hdr.timeout  = cmd_tmo * 1000;   // In millisecs
  io_hdr.pack_id = id;  // Read/write counter for now
  io_hdr.iovec_count=0;   // scatter gather elements, 0=not being used

  if (direction == 1)
  {
io_hdr.dxfer_direction = SG_DXFER_TO_DEV;
io_hdr.flags |= SG_FLAG_MMAP_IO;
status = ioctl(sg_fd_wr, SG_IO, &io_hdr);
  }
  else
  {
io_hdr.dxfer_direction = SG_DXFER_FROM_DEV;
io_hdr.flags |= SG_FLAG_MMAP_IO;
status = ioctl(sg_fd, SG_IO, &io_hdr);
  }
  ...
Mmaped IO is a moot point here since this problem is also observed
when using direct IO.

Thanks,
Fajun


-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Process Scheduling Issue using sg/libata

2007-11-16 Thread Mark Lord

Fajun Chen wrote:

Hi All,

I use sg/libata and ata pass through for read/writes. Linux 2.6.18-rc2
and libata version 2.00 are loaded on ARM XScale board.  Under heavy
cpu load (e.g. when blocks per transfer/sector count is set to 1),
I've observed that the test application can suck cpu away for long
time (more than 20 seconds) and other processes including high
priority shell can not get the time slice to run.  What's interesting
is that if the application is under heavy IO load (e.g. when blocks
per transfer/sector count is set to 256),  the problem goes away. I
also tested with open source code sg_utils and got the same result, so
this is not a problem specific to my user-space application.

..

Post the relevant code here, and then we'll be able to better understand
and explain it to you.

For example, if the code is using ATA opcodes 0x20, 0x21, 0x24,
0x30, 0x31, 0x34, 0x29, 0x39, 0xc4 or 0xc5 (any of the R/W PIO ops),
then this behaviour does not surprise me in the least.  Fully expected
and difficult to avoid.

Cheers

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [linux-usb-devel] Problems with USB disk [solved]

2007-09-13 Thread Mark Lord

Alan Stern wrote:

On Thu, 13 Sep 2007, Mark Lord wrote:

..

What happens is, there's a nice little LED on the Cruzer stick,
that is "lit" when the stick itself is not in a "power suspend" state
(or whatever you USB folks call it).


We call it "suspended".  (Wasn't there an episode of Classic Trek where 
Mr. Spock explained to somebody, "I call them `ears'."?)



On 2.6.22, that little LED stays "on" normally, and flickers off/on
when data is being transfered.

The new "USB autosuspend" logic in 2.6.23 now causes that little LED
to turn off after a few seconds of inactivity.

Once that happens, the USB stick is not accessible until after a longish
timeout (~30s I think), followed by a USB reset.  Then it is usable again
until the next inactivity timeout and autosuspend (a few seconds).


So this _isn't_ the regression described above -- to wit, that the 
drive gets spun down and then won't work without first being spun back 
up.  You wrote:



My Sandisk Cruzer Micro 1GB USB sticks suffer from this regression.


But it doesn't.  It suffers from a different regression.


Perhaps, but that doesn't necessarily follow from the above.

From this observer's point of view (and being an expert on disk technologies),

the same "spin up" issue for rotating media could be the culprit here too.

A reset to a disk drive usually (not always) causes it to spin up.

Regardless, Greg has acknowledged the regression and is planning to revert 
things.

Cheers
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [linux-usb-devel] Problems with USB disk [solved]

2007-09-13 Thread Mark Lord

Alan Stern wrote:

On Wed, 12 Sep 2007, Mark Lord wrote:


Chuck Ebbert wrote:

..

Oh, nice. The usb-storage (SCSI) disk spins itself down and we can't handle 
that.
Should we be disabling auto-spindown when we connect the device, or be able to
handle this by sending the start command when needed?

There's more to this.

My Sandisk Cruzer Micro 1GB USB sticks suffer from this regression.


I seriously doubt that.  Are you claiming that your USB stick spins 
itself down during a suspend?  And then requires to be spun back up 
before it will resume proper operation?


No, the machine is not being suspended at all.

What happens is, there's a nice little LED on the Cruzer stick,
that is "lit" when the stick itself is not in a "power suspend" state
(or whatever you USB folks call it).

On 2.6.22, that little LED stays "on" normally, and flickers off/on
when data is being transfered.

The new "USB autosuspend" logic in 2.6.23 now causes that little LED
to turn off after a few seconds of inactivity.

Once that happens, the USB stick is not accessible until after a longish
timeout (~30s I think), followed by a USB reset.  Then it is usable again
until the next inactivity timeout and autosuspend (a few seconds).

Etc.

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problems with USB disk [solved]

2007-09-12 Thread Mark Lord

Chuck Ebbert wrote:

On 08/13/2007 10:50 AM, Niels wrote:

On Sunday 12 August 2007 11:54, Niels wrote:


On Friday 10 August 2007 14:43, Niels wrote:


On Wednesday 08 August 2007 12:57, Ismail Dönmez wrote:


On Wednesday 08 August 2007 13:48:29 you wrote:

On Tuesday 07 August 2007 23:18, Greg KH wrote:

On Tue, Aug 07, 2007 at 10:26:15PM +0200, Niels wrote:

Hi,

I'm having problems with a new 500 GB USB disk. It works, but
sometimes I get these in dmesg:


usb 1-3: reset high speed USB device using ehci_hcd and address 2
usb 5-1: USB disconnect, address 2
drivers/usb/class/usblp.c: usblp0: removed
sd 0:0:0:0: Device not ready: <6>: Sense Key : 0x2 [current]

: ASC=0x4 ASCQ=0x2

end_request: I/O error, dev sda, sector 254148215
sd 0:0:0:0: Device not ready: <6>: Sense Key : 0x2 [current]

: ASC=0x4 ASCQ=0x2

end_request: I/O error, dev sda, sector 252434023
EXT3-fs error (device sda1): ext3_find_entry: reading directory
#15761836 offset 0


There's also a printer connected. This is on a pci/usb2 card. When
the above happens, I get I/O errors. When I mount the drive next,
there are errors and often missing files. Quite annoying!

Kernel is 2.6.21

What's going on?

You have a low voltage issue, or a bad cable.  The device is
electronically disconnecting itself.  Try using a externally-powered
hub, or a new cable.

I am seeing a similar problem with 2.6.22 and 2.6.23-* kernels with my
60G iPod Video, works fine with 2.6.18 kernel though.


So far I'm seeing this:

- On 2.6.21 I mount the drive. After a while it spins down, and when I
then unmount it, an error pops up in dmesg.

- On 2.6.18 I can't provoke the same error. The drive doesn't appear to
spin down. I don't know if the data corruption from 2.6.21 occurs with
regular use.

There are a number of other factor I need to eliminate on my system, but
that's it so far. CONFIG_USB_SUSPEND is not set on either kernel.

OK, on a vanilla 2.6.18.8 I also have this problem, with both the pci/usb2
card, and the usb1 on the board. I listen to music from the drive, and
after some time (10-20 minutes or so), it freaks out:

=
sd 1:0:0:0: Device not ready: <6>: Current: sense key=0x2
ASC=0x4 ASCQ=0x2
end_request: I/O error, dev sda, sector 126693711
sd 1:0:0:0: Device not ready: <6>: Current: sense key=0x2
ASC=0x4 ASCQ=0x2
end_request: I/O error, dev sda, sector 126693711
sd 1:0:0:0: Device not ready: <6>: Current: sense key=0x2
ASC=0x4 ASCQ=0x2
end_request: I/O error, dev sda, sector 126693711
=


Using a new PSU and a powered hub made no difference. But I found a solution
here:

http://alienghic.livejournal.com/382903.html

Basically, the problem is, as suspected, that the drive spins down / goes to
suspend. This can be disabled with "sdparm --clear STANDBY -6 /dev/sda".

It seems to me to be an error that the kernel reports this as something like
a hardware failure. Or at least very misleading.



Oh, nice. The usb-storage (SCSI) disk spins itself down and we can't handle 
that.
Should we be disabling auto-spindown when we connect the device, or be able to
handle this by sending the start command when needed?


There's more to this.

My Sandisk Cruzer Micro 1GB USB sticks suffer from this regression.
Plug one in, it works for about 5 seconds, then the light goes off (bad).
Next access requires a 30s timeout + reset.  Etc..

This is with 2.6.23-rc6.
Works without any problems in 2.6.22.  REGRESSION.

-ml
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: HDIO_DRIVE_CMD and hdparm

2007-05-10 Thread Mark Lord

Geert Uytterhoeven wrote:

Hi,

`hdparm -t' uses HDIO_DRIVE_CMD(null) to flush the disk's buffer.


More correctly, that command is supposed to act like an I/O queue "barrier"
operation, not returning from the syscall until everything queued in front
of it has been issued/completed.

I believe that only the original IDE driver actually implements it, though.
And hdparm-7.4 (not released yet) will no longer complain about ENOTTY.

Note that current versions of hdparm use SG_IO/ATA_16 (SAT) for nearly 
everything
now, only falling back to the older ioctl's for drivers which reject the SAT 
attempt.

I'd love to find a USB drive enclosure that supports SAT.
Anyone know of one?

And does the USB storage layer actually pass the ATA_16 packets to the device?

Cheers
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: HDIO_DRIVE_CMD and hdparm

2007-05-10 Thread Mark Lord

Alan Cox wrote:

  - SCSI doesn't handle HDIO_DRIVE_CMD(null), and returns EINVAL
=> fine for hdparm
  - If a block device doesn't support the ioctl, blkdev_driver_ioctl() returns
ENOTTY
=> hdparm error message


Those are both correct
-ENOTTY I don't know this ioctl
-EINVAL I know this ioctl, usage wrong


Exactly.
I'll have hdparm-7.4 not complain on ENOTTY as well.

-ml
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: libata /dev/scd0 problem: mount after burn fails without eject

2007-05-01 Thread Mark Lord

Forwarding to linux-scsi and linux-ide mailing lists.

Frank van Maarseveen wrote:

Tested on 2.6.20.6 and 2.6.21.1

I decided to swich from the old IDE drivers to libata and now there
seems to be a little but annoying problem: cannot mount an ISO image
after burning it.

May  1 14:32:55 kernel: attempt to access beyond end of device
May  1 14:32:55 kernel: sr0: rw=0, want=68, limit=4
May  1 14:32:55 kernel: isofs_fill_super: bread failed, dev=sr0, iso_blknum=16, 
block=16

an "eject" command seems to fix the state of the PATA DVD writer
or driver. The problem occurs for burning a CD and for DVD too with
identical error messages.


relevant kernel boot messages:
| ata1: PATA max UDMA/100 cmd 0x000101f0 ctl 0x000103f6 bmdma 0x0001ffa0 irq 14
| ata2: PATA max UDMA/100 cmd 0x00010170 ctl 0x00010376 bmdma 0x0001ffa8 irq 15
| scsi0 : ata_piix
| ata1.00: ATA-7: ST3120814A, 3.AAJ, max UDMA/100
| ata1.00: 234441648 sectors, multi 8: LBA48 
| ata1.00: configured for UDMA/100

| scsi1 : ata_piix
| ata2.00: ATAPI, max UDMA/33
| ata2.01: ATAPI, max UDMA/33
| ata2.00: configured for UDMA/33
| ata2.01: configured for UDMA/33
| scsi 0:0:0:0: Direct-Access ATA  ST3120814A   3.AA PQ: 0 ANSI: 5
| SCSI device sda: 234441648 512-byte hdwr sectors (120034 MB)
| sda: Write Protect is off
| SCSI device sda: write cache: enabled, read cache: enabled, doesn't support 
DPO or FUA
| SCSI device sda: 234441648 512-byte hdwr sectors (120034 MB)
| sda: Write Protect is off
| SCSI device sda: write cache: enabled, read cache: enabled, doesn't support 
DPO or FUA
|  sda: sda1 sda2 sda4
| sd 0:0:0:0: Attached scsi disk sda
| sd 0:0:0:0: Attached scsi generic sg0 type 0
| scsi 1:0:0:0: CD-ROMHP   DVD Writer 640c  CS30 PQ: 0 ANSI: 5
| sr0: scsi3-mmc drive: 40x/40x writer cd/rw xa/form2 cdda tray
| Uniform CD-ROM driver Revision: 3.20
| sr 1:0:0:0: Attached scsi generic sg1 type 5
| scsi 1:0:1:0: CD-ROMSAMSUNG  CD-ROM SC-148C   B105 PQ: 0 ANSI: 5
| sr1: scsi3-mmc drive: 1x/48x cd/rw xa/form2 cdda tray
| sr 1:0:1:0: Attached scsi generic sg2 type 5

stripped config (well, as far of I'm sure it shouldn't matter):
| CONFIG_PM=y
| CONFIG_PM_LEGACY=y
| 
| CONFIG_ACPI=y

| CONFIG_ACPI_PROCFS=y
| CONFIG_ACPI_FAN=y
| CONFIG_ACPI_PROCESSOR=y
| CONFIG_ACPI_THERMAL=y
| CONFIG_ACPI_BLACKLIST_YEAR=0
| CONFIG_ACPI_EC=y
| CONFIG_ACPI_POWER=y
| CONFIG_ACPI_SYSTEM=y
| CONFIG_X86_PM_TIMER=y
| 
| CONFIG_PCI=y

| CONFIG_PCI_GOANY=y
| CONFIG_PCI_BIOS=y
| CONFIG_PCI_DIRECT=y
| CONFIG_PCI_MMCONFIG=y
| CONFIG_PCIEPORTBUS=y
| CONFIG_PCIEAER=y
| CONFIG_HT_IRQ=y
| CONFIG_ISA_DMA_API=y
| 
| CONFIG_PNP=y
| 
| CONFIG_PNPACPI=y
| 
| CONFIG_CDROM_PKTCDVD=y

| CONFIG_CDROM_PKTCDVD_BUFFERS=8
| 
| CONFIG_IDE=y
| 
| CONFIG_RAID_ATTRS=y

| CONFIG_SCSI=y
| CONFIG_SCSI_PROC_FS=y
| 
| CONFIG_BLK_DEV_SD=y

| CONFIG_BLK_DEV_SR=y
| CONFIG_CHR_DEV_SG=y
| 
| CONFIG_SCSI_MULTI_LUN=y

| CONFIG_SCSI_CONSTANTS=y
| CONFIG_SCSI_LOGGING=y
| 
| CONFIG_SCSI_SPI_ATTRS=y
| 
| CONFIG_ATA=y

| CONFIG_SATA_AHCI=y
| CONFIG_ATA_PIIX=y
| CONFIG_SATA_INTEL_COMBINED=y
| CONFIG_SATA_ACPI=y
| CONFIG_PATA_SERVERWORKS=y


lspci:
00:00.0 Host bridge: Intel Corporation 82845G/GL[Brookdale-G]/GE/PE DRAM 
Controller/Host-Hub Interface (rev 01)
00:01.0 PCI bridge: Intel Corporation 82845G/GL[Brookdale-G]/GE/PE Host-to-AGP 
Bridge (rev 01)
00:1d.0 USB Controller: Intel Corporation 82801DB/DBL/DBM (ICH4/ICH4-L/ICH4-M) 
USB UHCI Controller #1 (rev 01)
00:1d.1 USB Controller: Intel Corporation 82801DB/DBL/DBM (ICH4/ICH4-L/ICH4-M) 
USB UHCI Controller #2 (rev 01)
00:1d.2 USB Controller: Intel Corporation 82801DB/DBL/DBM (ICH4/ICH4-L/ICH4-M) 
USB UHCI Controller #3 (rev 01)
00:1d.7 USB Controller: Intel Corporation 82801DB/DBM (ICH4/ICH4-M) USB2 EHCI 
Controller (rev 01)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 81)
00:1f.0 ISA bridge: Intel Corporation 82801DB/DBL (ICH4/ICH4-L) LPC Interface 
Bridge (rev 01)
00:1f.1 IDE interface: Intel Corporation 82801DB (ICH4) IDE Controller (rev 01)
00:1f.3 SMBus: Intel Corporation 82801DB/DBL/DBM (ICH4/ICH4-L/ICH4-M) SMBus 
Controller (rev 01)
00:1f.5 Multimedia audio controller: Intel Corporation 82801DB/DBL/DBM 
(ICH4/ICH4-L/ICH4-M) AC'97 Audio Controller (rev 01)
01:00.0 VGA compatible controller: nVidia Corporation NV18 [GeForce4 MX 440 AGP 
8x] (rev a2)
02:0c.0 Ethernet controller: Intel Corporation 82541PI Gigabit Ethernet 
Controller (rev 05)

lspci -v -v -v for IDE
00:1f.1 IDE interface: Intel Corporation 82801DB (ICH4) IDE Controller (rev 01) 
(prog-if 8a [Master SecP PriP])
Subsystem: Dell Unknown device 0142
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR- FastB2B-
Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- SERR- 

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] bidi support: bidirectional request

2007-04-30 Thread Mark Lord

Jens Axboe wrote:


So basically just add a struct request pointer, so you can do rq =
rq->next_rq or something for the next data phase. I bet this would be a
LOT less invasive as well, and we can get by with a few helpers to
support it.


Hey, I want a way to issue those (linked requests) from userspace (SG_IO), too.
Specifically for use with the new SMART Command Transport (SCT) feature set
on modern SATA drives.  As well as for a disk recovery utility I'm working on.

Sounds generally useful, that.


-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR

2007-02-02 Thread Mark Lord

Matt Mackall wrote:

..
Also worth considering is that spending minutes trying to reread
damaged sectors is likely to accelerate your death spiral. More data
may be recoverable if you give up quickly in a first pass, then go
back and manually retry damaged bits with smaller I/Os.


All good input.  But what was being debated here is not so much
the retrying of known-bad sectors, but rather what to do about
the kiBs or MiBs of sectors remaining in a merged request after
hitting a single bad sector mid-way.

Currently, SCSI just abandons the entire remaining workload.

Cheers
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR

2007-02-02 Thread Mark Lord

Alan wrote:


If this is the right strategy for disk recovery for a given type of
device then this ought to be an automatic strategy. Most end users will
not have the knowledge to frob about in sysfs, and if the bad sector hits
at the wrong moment a sensible automatic recovery strategy is going to do
the right thing faster than human intervention, which may be some hours
away.


I think something we seem to be discussing here are the opposite aims
of enterprise RAID (traditional SCSI market) versus desktop filesystems
(now the number one user of Linux SCSI, courtesy of libata).

With RAID, it's safe to suggest that a very fast, bounded exit from EH
is almost always what the customer / end-user wants, because the higher
level RAID management can then deal with the failed drive offline.

But for a desktop filesystem, failing out quickly and bulk-failing megabytes
around a couple of bad sectors is not good -- no redundancy, and this will
lead to unneeded data loss.

It's beginning to look like this needs to be run-time tuneable,
per block minor device (per-partition), through sysfs at a minimum.
The RAID tools could then automatically choose settings to bias more
towards an "instant exit" when errors are found, whereas a non-RAID
desktop could select a more reliable recovery strategy.

Right now, with Jame's patch (earlier in this thread), the whole scheme
is heavily weighted towards the RAID "instant exit" strategy, which is
making me quite nervous about the data on my laptop.

Cheers 
-

To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR

2007-02-01 Thread Mark Lord

James Bottomley wrote:

On Thu, 2007-02-01 at 15:02 -0500, Mark Lord wrote:
..

One thing that could be even better than the patch below,
would be to have it perhaps skip the entire bio that includes
the failed sector, rather than only the bad sector itself.


Er ... define "skip over the bio".  A bio is simply a block
representation for a bunch of sg elements coming in to the elevator.


Exactly.  Or rather, a block of sg_elements from a single point
of request, is it not?


Mostly what we see in SCSI is a single bio per request, so skipping the
bio is really the current behaviour (to fail the rest of the request).


Very good.  That's what it's supposed to do.

But if each request contained only a single bio, then all of Jens'
work on IO scheduling would be for nothing, n'est-ce pas?

In the case where a request consists of multiple bio's
which have been merged under a single request struct,
we really should give at least one attempt to each bio.

This way, in most cases, only the process that requested the
failed sector(s) will see an error, not the innocent victims
that happened to get merged onto the end.  Which could be very
critical stuff (or not -- it could be quite random).

So the time factor works out to one disk I/O timeout per failed bio.
That's what would have happened with the NOP scheduler anyway.

On the sytems I'm working with, I don't see huge numbers of bad sectors.
What they tend to show is just one or two bad sectors, widely scattered.

So:

I think doing that might address most concerns expressed here.
Have you got an alternate suggestion, James?


Cheers
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR

2007-02-01 Thread Mark Lord

James Bottomley wrote:

On Tue, 2007-01-30 at 19:47 -0500, Mark Lord wrote:

Kernels since about 2.6.16 or so have been broken in this regard.
They "complete" the good sectors before the error,
and then fail the entire remaining portions of the request.


What was the commit that introduced the change? ... I have a vague
memory of it being deliberate.


I believe you made the first change in response to my prodding at the time,
when libata was not returning valid sense data (no LBA) for media errors.
The SCSI EH handling of that was rather poor at the time,
and so having it not retry the remaining sectors was actually
a very good fix at the time.

But now, libata *does* return valid sense data for LBA/DMA drives,
and the workaround from circa 2.6.16 is no longer the best we can do.
Now that we know which sector failed, we ought to be able to skip
over it, and continue with the rest of the merged request.

One thing that could be even better than the patch below,
would be to have it perhaps skip the entire bio that includes
the failed sector, rather than only the bad sector itself.

I think doing that might address most concerns expressed here.
Have you got an alternate suggestion, James?

..

Signed-off-by:  Mark Lord <[EMAIL PROTECTED]>
---
diff -u --recursive --new-file --exclude-from=linux_17//Documentation/dontdiff 
old/drivers/scsi/scsi_lib.c linux/drivers/scsi/scsi_lib.c
--- old/drivers/scsi/scsi_lib.c 2007-01-30 13:58:05.0 -0500
+++ linux/drivers/scsi/scsi_lib.c   2007-01-30 18:30:01.0 -0500
@@ -865,6 +865,12 @@
 */
if (sense_valid && !sense_deferred) {
switch (sshdr.sense_key) {
+   case MEDIUM_ERROR:
+   // Bad sector.  Fail it, and then continue the rest of 
the request:
+   if (scsi_end_request(cmd, 0, cmd->device->sector_size, 
1) == NULL) {


The sense key may have come with additional information  I think we want
to parse that (if it exists) rather than just blindly failing the first
sector of the request.


+   cmd->retries = 0;// go around again..
+   return;
+   }


This would drop through to the UNIT_ATTENTION case if scsi_end_request()
fails ... I don't think that's correct.


case UNIT_ATTENTION:
if (cmd->device->removable) {
/* Detected disc change.  Set a bit


-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] RESEND: SCSI, libata: add support for ATA_16 commands to libata ATAPI devices

2007-01-31 Thread Mark Lord

Tejun Heo wrote:

Anyways, this has never been guaranteed because
the limit is host wide.


But until very very recently, "host wide" meant just a single device
for libata.  I was just assuming we did all of the fiddling to ensure
a minimal value there for some real reason.

But, yes, now we have PATA (2 drives per host), and PMP (many more
drives per host), so just maxing out the limit seems sensible.


So, I'm for setting it to 16.  Jeff, what do you think?


This patch sets libata to always accept CDB lengths up to 16 bytes.
With this change, ATA_16 passthrough commands can now be passed into
libata for ATAPI devices.  There is a separate related patch already
pending to actually implement ATA_16 for ATAPI inside libata.

Signed-off-by:  Mark Lord <[EMAIL PROTECTED]>
---
--- old/drivers/ata/libata-core.c   2007-01-31 19:55:53.0 -0500
+++ linux/drivers/ata/libata-core.c 2007-01-31 19:57:15.0 -0500
@@ -1577,20 +1577,6 @@
snprintf(desc, desc_sz, "NCQ (depth %d/%d)", hdepth, ddepth);
}

-static void ata_set_port_max_cmd_len(struct ata_port *ap)
-{
-   int i;
-
-   if (ap->scsi_host) {
-   unsigned int len = 0;
-
-   for (i = 0; i < ATA_MAX_DEVICES; i++)
-   len = max(len, ap->device[i].cdb_len);
-
-   ap->scsi_host->max_cmd_len = len;
-   }
-}
-
/**
 *  ata_dev_configure - Configure the specified ATA/ATAPI device
 *  @dev: Target device to configure
@@ -1771,8 +1757,6 @@
}
}

-   ata_set_port_max_cmd_len(ap);
-
/* limit bridge transfers to udma5, 200 sectors */
if (ata_dev_knobble(dev)) {
if (ata_msg_drv(ap) && print_info)
@@ -5689,7 +5673,7 @@
shost->max_id = 16;
shost->max_lun = 1;
shost->max_channel = 1;
-   shost->max_cmd_len = 12;
+   shost->max_cmd_len = 16;
}

/**
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] RESEND: SCSI, libata: add support for ATA_16 commands to libata ATAPI devices

2007-01-31 Thread Mark Lord

James Bottomley wrote:

..
In SCSI, we're very careful only to use large commands where necessary,
so even for a >2TB array, you only see 16 byte commands for sectors
beyond 2TB.  This essentially means that the capacity (and we do a
careful READ_CAPACITY(10) and only move on to READ_CAPACITY(16) if we're
told to) limits the command size.


Very cool, James.
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] RESEND: SCSI, libata: add support for ATA_16 commands to libata ATAPI devices

2007-01-31 Thread Mark Lord

Tejun Heo wrote:

Mark Lord wrote:
..

So, to achieve ATA passthru capability for libata ATAPI,
we have to instead use the ATA_16 opcode: a 16-byte command.

SCSI normally disallows issuing 16-byte commands to 12-byte devices,
so special support has to be added for this.



I might have missed the discussion but can't we just set
host->max_cmd_len to 16 unconditionally?


Sure thing, if you and Jeff are happy with that, then lets do it.

I just kind of assumed that the complexity in ata_set_port_max_cmd_len()
was there for some kind of reason.

For example, I think all existing ATAPI drives only speak 12-byte packet
protocols, and so if we tell SCSI we're good for 16-byte, then won't the
SCSI layer suddenly start sending us READ_16 and the like?

Cheers
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR

2007-01-31 Thread Mark Lord

James Bottomley wrote:

On Wed, 2007-01-31 at 12:57 -0500, Mark Lord wrote:

Alan wrote:

When libata reports a MEDIUM_ERROR to us, we *know* it's non-recoverable,
as the drive itself has already done internal retries (libata uses the
"with retry" ATA opcodes for this).

This depends on the firmware. Some of the "raid firmware" drives don't
appear to do retries in firmware.

One way to tell if this is true, is simply to time how long
the failed operation takes.  If the drive truly does not do retries,
then the media error should be reported more or less instantly
(assuming drive was already spun up).


Well, the simpler way (and one we have a hope of implementing) is to
examine the ASC/ASCQ codes to see if the error is genuinely unretryable.


My suggestion above was not for a kernel fix,
but rather just as a way of determining if drives
which claim "no retries" actually do them or not.  :)


I seem to have dropped the ball on this one in that the scsi_error.c
pieces of this patch

http://marc.theaimsgroup.com/?l=linux-scsi&m=116485834119885

I thought I'd applied.  Apparently I didn't, so I'll go back and put
them in.


Good.  That would be a useful supplement to the patch I posted here.

Cheers
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR

2007-01-31 Thread Mark Lord

Alan wrote:

When libata reports a MEDIUM_ERROR to us, we *know* it's non-recoverable,
as the drive itself has already done internal retries (libata uses the
"with retry" ATA opcodes for this).


This depends on the firmware. Some of the "raid firmware" drives don't
appear to do retries in firmware.


One way to tell if this is true, is simply to time how long
the failed operation takes.  If the drive truly does not do retries,
then the media error should be reported more or less instantly
(assuming drive was already spun up).

If the failure takes more than a few hundred milliseconds to be reported,
or in this case 4-7 seconds typically, then we know the drive was doing
retries before it reported back.

I haven't seen any drive fail instantly yet.
Can anyone with those newfangled "RAID edition" drives try it
and report back?  Oh.. you'll need a way to create a bad sector.
I've got patches and a command-line utility for the job.

If your drive supports "WRITE UNCORRECTABLE" ("hdparm -I", w/latest hdparm),
then the patches aren't needed.


But meanwhile, we still have the original issue too, where a single stray
bad sector can blow a system out of the water, because the mid-layer
currently aborts everything after it from a large merged request.

Thus the original patch from this thread.  :)


Agreed

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR

2007-01-31 Thread Mark Lord

Douglas Gilbert wrote:


Ric,
Both ATA (ATA8-ACS) and SCSI (SBC-3) have recently added
command support to flag a block as "uncorrectable". There
is no need to send bad "long" data to it and suppress the
disk's automatic re-allocation logic.


That'll be useful in a couple of years, once drives that have it
become more common.  For now, though, we're hacking current drives
using READ/WRITE LONG commands, with a corresponding patch to libata
to allow for the longer "sector size" involved.

Having real bad sectors, exactly where we want them on the media,
sure does make testing / fixing the EH mechanisms a lot more feasible.

Cheers

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR

2007-01-31 Thread Mark Lord

Mark Lord wrote:

James Bottomley wrote:


For the MD case, this is what REQ_FAILFAST is for.


I cannot find where SCSI honours that flag.  James?


Scratch that thought.. SCSI honours it in scsi_end_request().

But I'm not certain that the block layer handles it correctly,
at least not in the 2.6.16/2.6.18 kernel that I'm working with today.

Cheers
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR

2007-01-31 Thread Mark Lord

James Bottomley wrote:


For the MD case, this is what REQ_FAILFAST is for.


I cannot find where SCSI honours that flag.  James?

And for that matter, even when I patch SCSI so that it *does* honour it,
I don't actually see the flag making it into the SCSI layer from above.

And I don't see where/how the block layer takes care when considering
merge FAILFAST/READA requests with non FAILFAST/READA requests.
To me, it looks perfectly happy to add non-FAILFAST/READA bios
to a FAILFAST request, risking data loss if a lower-layer decides
to honour the FAILFAST/READA flags.

So it's a pretty Good Thing(tm) that SCSI doesn't currently honour it. ;)


-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR

2007-01-31 Thread Mark Lord

Ric Wheeler wrote:

Mark Lord wrote:

Eric D. Mudama wrote:
Actually, it's possibly worse, since each failure in libata will 
generate 3-4 retries. 


(note: libata does *not* generate retries for medium errors;
the looping is driven by the SCSI mid-layer code).


It really beats the alternative of a forced reboot
due to, say, superblock I/O failing because it happened
to get merged with an unrelated I/O which then failed..
Etc..

Definitely an improvement.

The number of retries is an entirely separate issue.
If we really care about it, then we should fix SD_MAX_RETRIES.

The current value of 5 is *way* too high.  It should be zero or one.

..
I think that drives retry enough, we should leave retry at zero for 
normal (non-removable) drives. Should this  be a policy we can set like 
we do with NCQ queue depth via /sys ?


Or perhaps we could have the mid-layer always "early-exit"
without retries for "MEDIUM_ERROR", and still do retries for the rest.

When libata reports a MEDIUM_ERROR to us, we *know* it's non-recoverable,
as the drive itself has already done internal retries (libata uses the
"with retry" ATA opcodes for this).

But meanwhile, we still have the original issue too, where a single stray
bad sector can blow a system out of the water, because the mid-layer
currently aborts everything after it from a large merged request.

Thus the original patch from this thread.  :)

Cheers
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR

2007-01-30 Thread Mark Lord

James Bottomley wrote:

First off, please send SCSI patches to the SCSI list:



Fixed already, thanks!


This patch fixes the behaviour to be similar to what we had originally.

When a bad sector is encounted, SCSI will now work around it again,
failing *only* the bad sector itself.


Erm, but the corollary is that if we get a large read failure because of
a bad track, you're going to try and chunk up it a sector at a time


That's better than the huge data-loss scenario that we currently
have for single-sector errors.  MUCH better.


forcing an individual error for each sector is going to annoy some
people ... particularly removable medium ones which return this error if
the medium isn't present ... Are you sure this is really what we want to
do?


No, for removed-medium everything just fails right away.
This patch is *only* for media errors, not any other failures.

Cheers
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR

2007-01-30 Thread Mark Lord

Eric D. Mudama wrote:


Actually, it's possibly worse, since each failure in libata will 
generate 3-4 retries.  With existing ATA error recovery in the drives, 
that's about 3 seconds per retry on average, or 12 seconds per failure.  
Multiply that by the number of blocks past the error to complete the 
request..


It really beats the alternative of a forced reboot
due to, say, superblock I/O failing because it happened
to get merged with an unrelated I/O which then failed..
Etc..

Definitely an improvement.

The number of retries is an entirely separate issue.
If we really care about it, then we should fix SD_MAX_RETRIES.

The current value of 5 is *way* too high.  It should be zero or one.

Cheers
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] RESEND scsi_lib.c: continue after MEDIUM_ERROR

2007-01-30 Thread Mark Lord
Fixed for 80-columns, and copying linux-scsi this time.

In ancient kernels, the SCSI disk code used to continue after
encountering a MEDIUM_ERROR.  It would "complete" the good
sectors before the error, fail the bad sector/block, and then
continue with the rest of the request.

Kernels since about 2.6.16 or so have been broken in this regard.
They "complete" the good sectors before the error,
and then fail the entire remaining portions of the request.

This is very risky behaviour, as a request is often a merge
of several bios, and just because one application hits a bad sector
is no reason to pretend that (for example) an adjacent directly lookup also 
failed.

This patch fixes the behaviour to be similar to what we had originally.

When a bad sector is encounted, SCSI will now work around it again,
failing *only* the bad sector itself.

Signed-off-by:  Mark Lord <[EMAIL PROTECTED]>
---
--- old/drivers/scsi/scsi_lib.c 2007-01-30 20:06:15.0 -0500
+++ linux/drivers/scsi/scsi_lib.c   2007-01-30 20:06:59.0 -0500
@@ -865,6 +865,13 @@
 */
if (sense_valid && !sense_deferred) {
switch (sshdr.sense_key) {
+   case MEDIUM_ERROR:
+   /* Bad sector. Fail it, and continue on with the rest */
+   if (scsi_end_request(cmd, 0,
+   cmd->device->sector_size, 1) == NULL) {
+   cmd->retries = 0;   /* go around again.. */
+   return;
+   }
case UNIT_ATTENTION:
if (cmd->device->removable) {
/* Detected disc change.  Set a bit
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] SCSI, libata: add support for ATA_16 commands to libata ATAPI devices

2007-01-04 Thread Mark Lord

Jens Axboe wrote:

On Wed, Jan 03 2007, James Bottomley wrote:

Er, well, as you know, I've never been a fan of this static list.  I
thought Jens was going to put us all out of our misery by making the
list settable per device by root and thus shovel the problem off onto
the distros?


The code is there, just haven't had the guts to merge it yet.

http://git.kernel.dk/?p=linux-2.6-block.git;a=commitdiff;h=bf5f922d167a5c5cf57132bbcaa1e0ddfd5c45f7



That's nice, but doesn't help with the case of trying to do ATA passthru
to ATAPI devices, the subject of the original patch here.

Cheers!
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] SCSI, libata: add support for ATA_16 commands to libata ATAPI devices

2007-01-03 Thread Mark Lord

James Bottomley wrote:


As I read the code Mark sent me, current libata will only accept ATA_16
for ATAPI devices (whether type 5 or not) ... pehaps this needs to be
altered?


It could be.

Both ATA_12 and ATA_16 are just ways of getting a taskfile sent
down to the LLD, and ATA_12 is merely a restricted subset of ATA_16
so it is not really necessary here.

I think that given the opcode conflict with "BLANK", it may be best for
the moment to just not bother with ATA_12 for libata ATAPI.  We don't really
have a need for ATA_12 anyway, as ATA_16 is more flexible.

And any application code that uses these isn't going to want to have
to vary its commands between ATA_12 and ATA_16 depending upon the target.
I think they'll all just go straight for ATA_16 for uniformity and
to minimize testing drudgery.

hdparm will certainly use only ATA_16 once it becomes available in SCSI/libata.

Cheers
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] SCSI, libata: add support for ATA_16 commands to libata ATAPI devices

2007-01-03 Thread Mark Lord

James Bottomley wrote:


Er I've only seen one patch ... I did ask if there was another to add
the ATA_16 interpretation ... did that get lost by the SCSI reflector?


Ahh.. my apologies on that, James.

The two patches apply and build independently,
and the second patch was only for libata, so it was
sent only to the linux-ide list.

There's actually a third patch as well, to fix a one-line
bug in sr.c that prevents *any* SG_IO from ever working
on ATAPI as well.

I'll email you the full set.

Cheers!

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] SCSI, libata: add support for ATA_16 commands to libata ATAPI devices

2007-01-02 Thread Mark Lord

James Bottomley wrote:

..
I don't think I quite understand what you're trying to do here.  My
understanding is that ATA_12 and ATA_16 are part of the SAT layer. i.e.
they're used when we're speaking SCSI to an underlying ATA device to
send taskfiles.  However, ATAPI devices don't use SAT ... every SCSI
command you send to an ATAPI device goes out as an ATA PACKET command
without being translated.


ATAPI devices also implement quite a few ATA (non-packet) commands.
This patch gives us a way to issue them in response to SG_IO ATA passthru
and the existing (non working) libata HDIO_DRIVE_CMD and HDIO_DRIVE_TASK
ioctl calls.

Without this patch, SCSI blocks the ATA_16 passthru attempts,
because it thinks the CDB is too large for the 12-byte packet protocol
that ATAPI devices use.  This limit (12 bytes) is totally non-applicable
to ATA protocol commands.


Is there a missing piece to this patch, where you scan the incoming
commands to ATAPI devices and actually do translation for ATA_16?


Look at the existing libata-core.c ioctl's for HDIO_DRIVE_CMD
and HDIO_DRIVE_TASK, used by hdparm and others.

Cheers
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] SCSI, libata: add support for ATA_16 commands to libata ATAPI devices

2007-01-02 Thread Mark Lord
In an ideal world, we would use the existing ATA_12 opcode
to issue 12-byte ATA passthrough commands for libata ATAPI drives
from userspace.

But ATA_12 happens to have the same SCSI opcode value as the older
CD/RW "BLANK" command, widely used by cdrecord and friends.

So, to achieve ATA passthru capability for libata ATAPI,
we have to instead use the ATA_16 opcode: a 16-byte command.

SCSI normally disallows issuing 16-byte commands to 12-byte devices,
so special support has to be added for this.

Introduce an "allow_ata_16" boolean to the scsi_host struct.
This provides a means for libata to signal that 16-byte ATA_16
commands should be permitted even for 12-byte ATAPI devices.

Signed-off-by:  Mark Lord <[EMAIL PROTECTED]>

--- old/include/scsi/scsi_host.h2007-01-02 19:06:45.0 -0500
+++ new/include/scsi/scsi_host.h2007-01-02 19:09:06.0 -0500
@@ -608,6 +608,13 @@
unsigned async_scan:1;
 
/*
+* True for libata, to allow issuing ATA_16 16-byte CDBs
+* to (otherwise) 12-byte ATAPI drives.  ATAPI cannot use
+* the ATA_12 opcode, because it means "BLANK" to CDRW drives.
+*/
+   unsigned allow_ata_16:1;
+
+   /*
 * Optional work queue to be utilized by the transport
 */
char work_q_name[KOBJ_NAME_LEN];
--- old/drivers/scsi/scsi.c 2007-01-02 19:06:45.0 -0500
+++ new/drivers/scsi/scsi.c 2007-01-02 19:09:06.0 -0500
@@ -567,12 +567,17 @@
 * length exceeds what the host adapter can handle.
 */
if (CDB_SIZE(cmd) > cmd->device->host->max_cmd_len) {
-   SCSI_LOG_MLQUEUE(3,
+   /* permit ATA_16 passthru to 12-byte ATAPI devices */
+   if (cmd->cmnd[0]  != ATA_16
+|| CDB_SIZE(cmd) != 16
+|| !cmd->device->host->allow_ata_16) {
+   SCSI_LOG_MLQUEUE(3,
printk("queuecommand : command too long.\n"));
-   cmd->result = (DID_ABORT << 16);
+   cmd->result = (DID_ABORT << 16);
 
-   scsi_done(cmd);
-   goto out;
+   scsi_done(cmd);
+   goto out;
+   }
}
 
spin_lock_irqsave(host->host_lock, flags);
--- old/drivers/ata/libata-core.c   2007-01-02 19:06:44.0 -0500
+++ new/drivers/ata/libata-core.c   2007-01-02 19:09:06.0 -0500
@@ -5686,6 +5686,7 @@
shost->max_lun = 1;
shost->max_channel = 1;
shost->max_cmd_len = 12;
+   shost->allow_ata_16 = 1;
 }
 
 /**
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2.6.20-rc3] fix broken retval test in sr_block_ioctl

2007-01-02 Thread Mark Lord
With recent kernels (and possibly much older ones, too),
ioctl() calls for ATAPI devices never make it to the driver layers.

The reason for this is a borked return code test in drivers/scsi/sr.c.
This patch fixes it.

Signed-off-by:  Mark Lord <[EMAIL PROTECTED]>

---
--- old/drivers/scsi/sr.c   2006-11-29 16:57:37.0 -0500
+++ linux/drivers/scsi/sr.c 2007-01-02 16:40:33.0 -0500
@@ -468,7 +468,7 @@
}
 
ret = cdrom_ioctl(file, &cd->cdi, inode, cmd, arg);
-   if (ret != ENOSYS)
+   if (ret != -ENOSYS)
return ret;
 
/*
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2.6.20-rc3] fix broken retval test in sr_block_ioctl

2007-01-02 Thread Mark Lord

Jeff Garzik wrote:

Mark Lord wrote:

...

 ret = cdrom_ioctl(file, &cd->cdi, inode, cmd, arg);
-if (ret != ENOSYS)
+if (ret != -ENOSYS)


I think Tejun posted the same patch earlier today.


Ahh.. missed it -- could have saved me an hour or two of debugging!

Cheers!

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: libata: clustering on or off?

2005-08-29 Thread Mark Lord

Clustering has always been a (small) win for the ATA/SATA hardware
I have worked on.  Many controllers read the scatter-gather list
"on demand", so if clustering is not used, there are a lot of
times where the controller must stop streaming data, and fetch
the next s/g entry, then resume streaming data.

Clustering reduces the number of such events, giving better bus
utilization and (slightly) faster transfers.

The more modern hardware I'm working with now will generally
read the s/g table in larger blocks, so that it only rarely
needs to interrupt data transfers to fetch s/g info.
Clustering is not as big a win with this, but still reduces
total bus cycles required.

Cheers
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2.6.11-rc4+] sata_qstor: new basic driver for Pacific Digital QStor

2005-02-16 Thread Mark Lord
This is a new basic libata SATA driver
for the Pacific Digital QStor SATA/RAID card.
It features ordinary per-drive SATA with single-request DMA
for R/W commands, and PIO for everything else.
It currently does not implement any of the hardware RAID support,
nor hot-plug, nor the tagged/native command-queuing features.
On the other hand, it is small and simple as a result.
Patch is against 2.6.11-rc4 + minor libata updates from Jeff.
Signed-off-by: Mark Lord <[EMAIL PROTECTED]>
diff -uprN linux-2.6.11-rc4jeff/drivers/scsi/Kconfig linux/drivers/scsi/Kconfig
--- linux-2.6.11-rc4jeff/drivers/scsi/Kconfig	2005-02-14 10:27:44.0 -0500
+++ linux/drivers/scsi/Kconfig	2005-02-16 20:29:23.0 -0500
@@ -457,6 +457,14 @@ config SCSI_SATA_PROMISE
 
 	  If unsure, say N.
 
+config SCSI_SATA_QSTOR
+	tristate "Pacific Digital SATA QStor support"
+	depends on SCSI_SATA && PCI
+	help
+	  This option enables support for Pacific Digital Serial ATA QStor.
+
+	  If unsure, say N.
+
 config SCSI_SATA_SX4
 	tristate "Promise SATA SX4 support"
 	depends on SCSI_SATA && PCI && EXPERIMENTAL
diff -uprN linux-2.6.11-rc4jeff/drivers/scsi/Makefile linux/drivers/scsi/Makefile
--- linux-2.6.11-rc4jeff/drivers/scsi/Makefile	2005-02-14 10:27:44.0 -0500
+++ linux/drivers/scsi/Makefile	2005-02-16 20:29:23.0 -0500
@@ -125,6 +125,7 @@ obj-$(CONFIG_SCSI_SATA_AHCI)	+= libata.o
 obj-$(CONFIG_SCSI_SATA_SVW)	+= libata.o sata_svw.o
 obj-$(CONFIG_SCSI_ATA_PIIX)	+= libata.o ata_piix.o
 obj-$(CONFIG_SCSI_SATA_PROMISE)	+= libata.o sata_promise.o
+obj-$(CONFIG_SCSI_SATA_QSTOR)	+= libata.o sata_qstor.o
 obj-$(CONFIG_SCSI_SATA_SIL)	+= libata.o sata_sil.o
 obj-$(CONFIG_SCSI_SATA_VIA)	+= libata.o sata_via.o
 obj-$(CONFIG_SCSI_SATA_VITESSE)	+= libata.o sata_vsc.o
diff -uprN linux-2.6.11-rc4jeff/drivers/scsi/sata_qstor.c linux/drivers/scsi/sata_qstor.c
--- linux-2.6.11-rc4jeff/drivers/scsi/sata_qstor.c	1969-12-31 19:00:00.0 -0500
+++ linux/drivers/scsi/sata_qstor.c	2005-02-16 20:25:01.0 -0500
@@ -0,0 +1,700 @@
+/*
+ *  sata_qstor.c - Pacific Digital Corporation QStor SATA
+ *
+ *  Maintained by:  Mark Lord <[EMAIL PROTECTED]>
+ *
+ *  Copyright 2005 Pacific Digital Corporation.
+ *  (OSL/GPL code release authorized by Jalil Fadavi).
+ *
+ *  The contents of this file are subject to the Open
+ *  Software License version 1.1 that can be found at
+ *  http://www.opensource.org/licenses/osl-1.1.txt and is included herein
+ *  by reference.
+ *
+ *  Alternatively, the contents of this file may be used under the terms
+ *  of the GNU General Public License version 2 (the "GPL") as distributed
+ *  in the kernel source COPYING file, in which case the provisions of
+ *  the GPL are applicable instead of the above.  If you wish to allow
+ *  the use of your version of this file only under the terms of the
+ *  GPL and not to allow others to use your version of this file under
+ *  the OSL, indicate your decision by deleting the provisions above and
+ *  replace them with the notice and other provisions required by the GPL.
+ *  If you do not delete the provisions above, a recipient may use your
+ *  version of this file under either the OSL or the GPL.
+ *
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "scsi.h"
+#include 
+#include 
+#include 
+
+#define DRV_NAME	"sata_qstor"
+#define DRV_VERSION	"0.03"
+
+enum {
+	QS_PORTS		= 4,
+	QS_MAX_PRD		= LIBATA_MAX_PRD,
+	QS_CPB_ORDER		= 6,
+	QS_CPB_BYTES		= (1 << QS_CPB_ORDER),
+	QS_PRD_BYTES		= QS_MAX_PRD * 16,
+	QS_PKT_BYTES		= QS_CPB_BYTES + QS_PRD_BYTES,
+
+	QS_DMA_BOUNDARY		= ~0UL,
+
+	/* global register offsets */
+	QS_HCF_CNFG3		= 0x0003, /* host configuration offset */
+	QS_HID_HPHY		= 0x0004, /* host physical interface info */
+	QS_HCT_CTRL		= 0x00e4, /* global interrupt mask offset */
+	QS_HST_SFF		= 0x0100, /* host status fifo offset */
+	QS_HVS_SERD3		= 0x0393, /* PHY enable offset */
+
+	/* global control bits */
+	QS_HPHY_64BIT		= (1 << 1), /* 64-bit bus detected */
+	QS_CNFG3_GSRST		= 0x01, /* global chip reset */
+	QS_SERD3_PHY_ENA	= 0xf0, /* PHY detection ENAble*/
+
+	/* per-channel register offsets */
+	QS_CCF_CPBA		= 0x0710, /* chan CPB base address */
+	QS_CCF_CSEP		= 0x0718, /* chan CPB separation factor */
+	QS_CFC_HUFT		= 0x0800, /* host upstream fifo threshold */
+	QS_CFC_HDFT		= 0x0804, /* host downstream fifo threshold */
+	QS_CFC_DUFT		= 0x0808, /* dev upstream fifo threshold */
+	QS_CFC_DDFT		= 0x080c, /* dev downstream fifo threshold */
+	QS_CCT_CTR0		= 0x0900, /* chan control-0 offset */
+	QS_CCT_CTR1		= 0x0901, /* chan control-1 offset */
+	QS_CCT_CFF		= 0x0a00, /* chan command fifo offset */
+
+	/* channel control bits */
+	QS_CTR0_REG		= (1 << 1),   /* register mode (vs. pkt mode) */
+	QS_CTR0_CLER		= (1 << 2),   /* clear channel errors */
+	QS_CTR1_RDEV		=