Re: [RFC] libata-scsi: introducing SANITIZE translation
On 16-07-11 02:35 AM, Tom Yan wrote: I don't suppose there would be any problem doing it in userspace / with ATA PASS-THROUGH anyway. .. On 8 July 2016 at 17:29, James Bottomley wrote: .. Not really. The point is that you've proposed something as an addition to the kernel that can also be done in userspace. Checking if it can work easily there is like a barrier to entry. If it works, then fine, we're done. If it throws up problems then we reconsider the kernel route. hdparm has full support for the SANITIZE commands in userspace. -ml -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst
On 14-01-03 10:40 AM, walt wrote: > On 01/02/2014 11:15 AM, Sarah Sharp wrote: >> On Tue, Dec 31, 2013 at 12:40:16PM -0800, walt wrote: >>> On 12/18/2013 01:11 PM, Greg Kroah-Hartman wrote: >>>> 3.12-stable review patch. If anyone has any objections, please let me >>>> know. >>>> >>>> -- >>>> >>>> From: David Laight >>>> >>>> commit 35773dac5f862cb1c82ea151eba3e2f6de51ec3e upstream. >>>> >>>> Section 4.11.7.1 of rev 1.0 of the xhci specification states that a link >>>> TRB >>>> can only occur at a boundary between underlying USB frames (512 bytes for >>>> high speed devices). >>>> >>>> If this isn't done the USB frames aren't formatted correctly and, for >>>> example, >>>> the USB3 ethernet ax88179_178a card will stop sending... >>> >>> >>> Unfortunately this patch causes a regression when copying large files to my >>> outboard USB3 drive. (Nothing at all to do with networking.) > >> Do you have CONFIG_USB_DEBUG turned on for 3.13? If so, you should see >> dmesg output from this statement shortly before your drive fails: >> >> if (num_trbs >= TRBS_PER_SEGMENT) { >> xhci_err(xhci, "Too many fragments %d, max %d\n", >> num_trbs, TRBS_PER_SEGMENT - 1); >> return -ENOMEM; >> } > > Well, the answers depend on whether the usb3 drive uses logical volumes or not > (lvm2), which I can't explain. What I've described so far is with lvm2. > > When using lvm2 on the usb3 drive, turning on USB_DEBUG has *no* effect -- the > console prints two or three lines stating that the ext4 journal has quit and > the drive is remounted ro. That particular drive stays wedged until the next > reboot, but no other ill effects to the system. > > OTOH, when I put a disk with just an ordinary ext4 partition in the usb3 dock, > (no logical volumes) the copy failure becomes catastrophic, with kernel panic > messages, leaving the system unresponsive and needing a hard reset to recover. > > I also tried your other suggestion: > > diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c > index 4265b48..1a6a43d 100644 > --- a/drivers/usb/host/xhci.c > +++ b/drivers/usb/host/xhci.c > @@ -4714,7 +4714,7 @@ int xhci_gen_setup(struct usb_hcd *hcd, > xhci_get_quirks_t get_quirks) > int retval; > > /* Accept arbitrarily long scatter-gather lists */ > - hcd->self.sg_tablesize = ~0; > + hcd->self.sg_tablesize = 31; > > /* support to build packet from discontinuous buffers */ > hcd->self.no_sg_constraint = 1; > > Sadly it didn't fix the problem. Did I get the patch right? That sounds almost as if the old version is still being loaded/run, possibly from the initramfs image? -- Mark Lord Real-Time Remedies Inc. ml...@pobox.com -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst
On 14-01-02 02:15 PM, Sarah Sharp wrote: > On Tue, Dec 31, 2013 at 12:40:16PM -0800, walt wrote: .. >> Unfortunately this patch causes a regression when copying large files to my >> outboard USB3 drive. (Nothing at all to do with networking.) >> >> When I try to copy a large (20GB) file to the USB3 drive, the copy dies after >> about 7GB, the ext4 journal aborts and the drive is remounted read-only. >> >> This bug is 100% reproducible (always pretty close to 7GB) and reverting this >> patch completely fixes the problem. > > Ok, I had feared that would be a consequence of this patch. I think the > problem is that the usb-storage driver submitted an URB with more > scatter-gather entries than would fit on the ring segment, the xHCI > driver rejected the URB with -ENOMEM, and the SCSI core eventually gave > up on the SCSI command. Is there not a block layer / scheduler tunable for max sg entries or something? -- Mark Lord Real-Time Remedies Inc. ml...@pobox.com -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] sd: don't bother spinning up disks on resume
On 13-11-18 10:54 AM, Phillip Susi wrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA1 > > On 11/17/2013 8:06 PM, Phillip Susi wrote: >> I don't see anything particularly inefficient about issuing the >> command in the eh path ( in fact, libata always uses the eh path to >> handle power management ). You can of course, use the >> manage_start_stop flag to have the disk start on resume if you >> really want. > > Yikes, I think I see now why scsi eh is so bad: the entire host, not > just the device is quiesced. > > I suppose now the question is how to issue the START STOP UNIT from > sd_prep_fn()? I gather you can't just call scsi_execute_req() from > there? Is there maybe a way to insert a START STOP UNIT request into > the queue ahead of the current request being prepped? Can > sd_prep_fn() be called concurrently on multiple cpus? Hrm.. what > about TCQ? Would it be bad if the original request were queued ( to > the drive ) before the START STOP UNIT command completed? Yeah, solve that, and you're golden. I totally understand the motivation for this patch, and would love to have it working on my machines here too. But it may have to be some kind of sysfs optional setting, defaulting to current behaviour, to keep the server keepers happy. Cheers -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] sd: don't bother spinning up disks on resume
On 13-11-07 01:21 PM, Douglas Gilbert wrote: > On 13-11-06 08:57 PM, Phillip Susi wrote: >> Don't bother forcing disks to spin up on resume, as they >> will do so automatically when accessed, and forcing them >> to spin up slows down the resume. Add a second bit to the >> manage_start_stop flag to restore the previous behavior. > > SCSI disks when in STOP state do not spin up "automatically > when accessed". .. Ditto for many SATA disks with "power up in standby" enabled. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 00/77] Re-design MSI/MSI-X interrupts enablement pattern
On 13-10-11 04:41 AM, Alexander Gordeev wrote: > On Thu, Oct 10, 2013 at 07:17:18PM -0400, Mark Lord wrote: >> Just to help us all understand "the loop" issue.. >> >> Here's an example of driver code which uses the existing MSI-X interfaces, >> for a device which can work with either 16, 8, 4, 2, or 1 MSI-X interrupt. >> This is from a new driver I'm working on right now: .. > Now, this is a loop-free alternative: > > static int xx_alloc_msix_irqs(struct xx_dev *dev, int nvec) > { > nvec = roundup_pow_of_two(nvec);/* assume 0 > nvec <= 16 */ > > xx_disable_all_irqs(dev); > > pci_lock_msi(dev->pdev); > > rc = pci_get_msix_limit(dev->pdev, nvec); > if (rc < 0) > goto err; > > nvec = min(nvec, rc); /* if limit is more than requested */ > nvec = rounddown_pow_of_two(nvec); /* (a) */ > > xx_prep_for_msix_vectors(dev, nvec); > > rc = pci_enable_msix(dev->pdev, dev->irqs, nvec); /* (b) */ > if (rc < 0) > goto err; > > pci_unlock_msi(dev->pdev); > > dev->num_vectors = nvec;/* (b) */ > return 0; > > err: > pci_unlock_msi(dev->pdev); > > kerr(dev->name, "pci_enable_msix() failed, err=%d", rc); > dev->num_vectors = 0; > return rc; > } That would still need a loop, to handle the natural race between the calls to pci_get_msix_limit() and pci_enable_msix() -- the driver and device can and should fall back to a smaller number of vectors when pci_enable_msix() fails. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 00/77] Re-design MSI/MSI-X interrupts enablement pattern
Just to help us all understand "the loop" issue.. Here's an example of driver code which uses the existing MSI-X interfaces, for a device which can work with either 16, 8, 4, 2, or 1 MSI-X interrupt. This is from a new driver I'm working on right now: static int xx_alloc_msix_irqs (struct xx_dev *dev, int nvec) { xx_disable_all_irqs(dev); do { if (nvec < 2) xx_prep_for_1_msix_vector(dev); else if (nvec < 4) xx_prep_for_2_msix_vectors(dev); else if (nvec < 8) xx_prep_for_4_msix_vectors(dev); else if (nvec < 16) xx_prep_for_8_msix_vectors(dev); else xx_prep_for_16_msix_vectors(dev); nvec = pci_enable_msix(dev->pdev, dev->irqs, dev->num_vectors); } while (nvec > 0); if (nvec) { kerr(dev->name, "pci_enable_msix() failed, err=%d", nvec); dev->num_vectors = 0; return nvec; } return 0; /* success */ } -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 00/77] Re-design MSI/MSI-X interrupts enablement pattern
On 13-10-02 06:29 AM, Alexander Gordeev wrote: .. > This update converts pci_enable_msix() and pci_enable_msi_block() > interfaces to canonical kernel functions and makes them return a > error code in case of failure or 0 in case of success. Rather than silently break dozens of drivers in mysterious ways, please invent new function names for the replacements to the existing pci_enable_msix() and pci_enable_msi_block() functions. That way, both in-tree and out-of-tree drivers will notice the API change, rather than having it go unseen and just failing for unknown reasons. Thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: STANDBY IMMEDIATE failed on NVIDIA MCP5x controllers when system suspend
On 13-03-11 03:30 PM, Robert Hancock wrote: .. > I'm sure there are some SSDs that do violate their data integrity > commitments - a while ago some tests were done (don't have a link > handy and I don't think they identified the actual vendor/model of the > drives in any case) but there were definitely some SSDs that did do > nasty things like trash unrelated data if power was lost while > writing, etc. So we definitely don't want to risk this sort of thing > occurring on a normal power-off or suspend. Just about all current SATA SSDs advertise having some form(s) of "background garbage collection". That "feature" has always concerned me, because of the real possibility of power being removed (system shutdown) while the drive firmware is in the midst of re-shuffling my data around. One would hope that "STANDBY IMMEDIATE" is taken by the drive as a signal to stop that fussing about and prepare for full/sudden power-off. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Fix libata-eh don't retry command after reset succeeded.
On 13-01-14 09:01 AM, Mark Lord wrote: >> >> From 9cc9a85f17a8525e53caf430611d762c105d324c Mon Sep 17 00:00:00 2001 >> From: Bian Yu >> Date: Tue, 18 Dec 2012 05:58:34 -0500 >> Subject: [PATCH] Fix libata-eh don't retry command after reset succeeded. >> It's introduced by commit 8d899e70c1b3afff, When disk has a UNC error, >> qc->err_mask will set AC_ERR_MEDIA flag. >> Signed-off-by: Bian Yu .. > Yup, good catch. The original patch to fix retries (from me) had it that way, > but somewhere among the enforced revisions this error crept in unnoticed. > > Jeff -- wanna pick this one up? > Could be useful for -stable, too. The original version of this correction patch had the "Signed-off-by" line. See below. >From 78c0fb104c9957db682518b97ce6a01ce1bc07b6 Mon Sep 17 00:00:00 2001 From: Bian Yu Date: Wed, 12 Dec 2012 22:26:58 -0500 Subject: [PATCH] It should be a mistake introduced by commit 8d899e70c1b3afff. because only qc->flags can't be set AC_ERR_* Signed-off-by: Bian Yu --- drivers/ata/libata-eh.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c index bf039b0..bcf4437 100644 --- a/drivers/ata/libata-eh.c +++ b/drivers/ata/libata-eh.c @@ -2094,7 +2094,7 @@ static unsigned int ata_eh_speed_down(struct ata_device *dev, */ static inline int ata_eh_worth_retry(struct ata_queued_cmd *qc) { - if (qc->flags & AC_ERR_MEDIA) + if (qc->err_mask & AC_ERR_MEDIA) return 0; /* don't retry media errors */ if (qc->flags & ATA_QCFLAG_IO) return 1; /* otherwise retry anything from fs stack */ -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Fix libata-eh don't retry command after reset succeeded.
> > From 9cc9a85f17a8525e53caf430611d762c105d324c Mon Sep 17 00:00:00 2001 > From: Bian Yu > Date: Tue, 18 Dec 2012 05:58:34 -0500 > Subject: [PATCH] Fix libata-eh don't retry command after reset succeeded. > It's introduced by commit 8d899e70c1b3afff, When disk has a UNC error, > qc->err_mask will set AC_ERR_MEDIA flag. > Signed-off-by: Bian Yu > > --- > drivers/ata/libata-eh.c |2 +- > 1 files changed, 1 insertions(+), 1 deletions(-) > > diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c > index bf039b0..bcf4437 100644 > --- a/drivers/ata/libata-eh.c > +++ b/drivers/ata/libata-eh.c > @@ -2094,7 +2094,7 @@ static unsigned int ata_eh_speed_down(struct ata_device > *dev, > */ > static inline int ata_eh_worth_retry(struct ata_queued_cmd *qc) > { > - if (qc->flags & AC_ERR_MEDIA) > + if (qc->err_mask & AC_ERR_MEDIA) > return 0; /* don't retry media errors */ > if (qc->flags & ATA_QCFLAG_IO) > return 1; /* otherwise retry anything from fs stack */ > -- > 1.7.1 Yup, good catch. The original patch to fix retries (from me) had it that way, but somewhere among the enforced revisions this error crept in unnoticed. Jeff -- wanna pick this one up? Could be useful for -stable, too. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: DMA mapping on SCSI device?
Mark Lord wrote: .. Commands which were not ADMA compatible (eg. MODE_SENSE, TEST_UNIT_READY, ..) were simply handled with PIO (in the driver) rather than any form of DMA, which is okay because those commands are relatively infrequent. .. A slight correction there: TEST_UNIT_READY was fine in ADMA mode as well. Cheers - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: DMA mapping on SCSI device?
Robert Hancock wrote: Luben Tuikov wrote: --- On Mon, 1/28/08, Robert Hancock <[EMAIL PROTECTED]> wrote: The trick is that if an ATAPI device is connected, we (as far as I'm aware) can't use ADMA mode, so we have to switch that port into legacy mode. Can you double check this with the HW architect of the HW DMA engine of the ASIC? Will do so. However, previous statements from NVIDIA fairly clearly indicate that this is the case. This means it's only capable of 32-bit DMA. However the other port on the controller may be connected to a hard drive and therefore still capable of 64-bit DMA. If this is indeed the case as you've presented it here, it sounds like a HW shortcoming. I cannot see how the device type (or protocol) dictate how the DMA engine operates. They live in two different domains. Well, there is an indirect link. The ADMA interface (which supports 64-bit DMA) cannot be used to issue ATAPI commands, so if an ATAPI device is connected we have to go to legacy mode, which supports only 32-bit DMA. I'm not sure why ADMA mode doesn't support ATAPI. The only reason I can think of is that there's issues since ATAPI commands can potentially be of unpredictable transfer size. The "real" ADMA spec that the NVIDIA implementation is loosely based on does have some special "ignore excess" controls that don't seem to be in the NVIDIA version (or at least not to the knowledge I have on this hardware). .. The original Pacific Digital ADMA cores *do* support most ATAPI commands in ADMA mode, including READ_CD, READ_10, etc.. With the caveat that if DSC completion state is required, the driver has to drop out of ADMA and poll for it after the ADMA command completes. Commands which were not ADMA compatible (eg. MODE_SENSE, TEST_UNIT_READY, ..) were simply handled with PIO (in the driver) rather than any form of DMA, which is okay because those commands are relatively infrequent. Note that Pacific Digital "officially" said "no ATAPI" for the ADMA design, but I implemented it regardless (for Linux) and it worked rather well. We could burn DVDs and back-up to tape simultaneously, with the burner and the tape unit sharing a single IDE cable/channel. Cheers - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fix page_alloc for larger I/O segments (improved)
Matthew Wilcox wrote: On Fri, Dec 14, 2007 at 05:42:37PM +, Mel Gorman wrote: Regrettably this interferes with anti-fragmentation because the "next" page on the list on return from rmqueue_bulk is not guaranteed to be of the right mobility type. I fixed it as an additional patch but it adds additional cost that should not be necessary and it's visible in microbenchmark results on at least one machine. Is this patch to be preferred to the one Andrew Morton posted to do list_for_each_entry_reverse? .. This patch replaces my earlier patch that Andrew has: - list_add(&page->lru, list); + list_add_tail(&page->lru, list); Which, in turn, replaced the even-earlier list_for_each_entry_reverse patch. -ml - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fix page_alloc for larger I/O segments (improved)
Mel Gorman wrote: On (13/12/07 19:46), Mark Lord didst pronounce: "Improved version", more similar to the 2.6.23 code: Fix page allocator to give better chance of larger contiguous segments (again). Signed-off-by: Mark Lord <[EMAIL PROTECTED] Regrettably this interferes with anti-fragmentation because the "next" page on the list on return from rmqueue_bulk is not guaranteed to be of the right mobility type. I fixed it as an additional patch but it adds additional cost that should not be necessary and it's visible in microbenchmark results on at least one machine. The following patch should fix the page ordering problem without incurring an additional cost or interfering with anti-fragmentation. However, I haven't anything in place yet to verify that the physical page ordering is correct but it makes sense. Can you verify it fixes the problem please? It'll still be some time before I have a full set of performance results but initially at least, this fix seems to avoid any impact. == Subject: Fix page allocation for larger I/O segments In some cases the IO subsystem is able to merge requests if the pages are adjacent in physical memory. This was achieved in the allocator by having expand() return pages in physically contiguous order in situations were a large buddy was split. However, list-based anti-fragmentation changed the order pages were returned in to avoid searching in buffered_rmqueue() for a page of the appropriate migrate type. This patch restores behaviour of rmqueue_bulk() preserving the physical order of pages returned by the allocator without incurring increased search costs for anti-fragmentation. Signed-off-by: Mel Gorman <[EMAIL PROTECTED]> --- page_alloc.c | 11 +++ 1 file changed, 11 insertions(+) diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.24-rc5-clean/mm/page_alloc.c linux-2.6.24-rc5-giveback-physorder-listmove/mm/page_alloc.c --- linux-2.6.24-rc5-clean/mm/page_alloc.c 2007-12-14 11:55:13.0 + +++ linux-2.6.24-rc5-giveback-physorder-listmove/mm/page_alloc.c 2007-12-14 15:33:12.0 + @@ -847,8 +847,19 @@ static int rmqueue_bulk(struct zone *zon struct page *page = __rmqueue(zone, order, migratetype); if (unlikely(page == NULL)) break; + + /* +* Split buddy pages returned by expand() are received here +* in physical page order. The page is added to the callers and +* list and the list head then moves forward. From the callers +* perspective, the linked list is ordered by page number in +* some conditions. This is useful for IO devices that can +* merge IO requests if the physical pages are ordered +* properly. +*/ list_add(&page->lru, list); set_page_private(page, migratetype); + list = &page->lru; } spin_unlock(&zone->lock); return i; .. That (also) works for me here, regularly generating 64KB I/O segments with SLAB. Cheers - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?
Mel Gorman wrote: On (13/12/07 16:37), Andrew Morton didst pronounce: On Thu, 13 Dec 2007 19:30:00 -0500 Mark Lord <[EMAIL PROTECTED]> wrote: Here's the commit that causes the regression: ... --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -760,7 +760,8 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order, struct page *page = __rmqueue(zone, order, migratetype); if (unlikely(page == NULL)) break; - list_add_tail(&page->lru, list); + list_add(&page->lru, list); well that looks fishy. The reasoning behind the change was the first page encountered on the list by the caller would have a matching migratetype. I failed to take into account the physical ordering of pages returned. I'm setting up to run some performance benchmarks of the candidate fix merged into the -mm tree to see if the search shows up or not. I'm testing against 2.6.25-rc5 but it'll take a few hours to complete. .. Thanks, Mel. This is all with CONFIG_SLAB=y, by the way. Note that it did appear to behave better with CONFIG_SLUB=y when I accidently used that .config on my 4GB machine here. Physical segments of 4-10 pages happended much more common than with CONFIG_SLAB=y on my 3GB machine Slightly "apples and oranges" there, I know, but at least both were x86-32. :) So I would expect CONFIG_SLAB to be well off with this patch under most (all?) conditions, but dunno about CONFIG_SLUB. Cheers - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fix page_alloc for larger I/O segments (improved)
Andrew Morton wrote: On Thu, 13 Dec 2007 19:57:29 -0500 James Bottomley <[EMAIL PROTECTED]> wrote: On Thu, 2007-12-13 at 19:46 -0500, Mark Lord wrote: "Improved version", more similar to the 2.6.23 code: Fix page allocator to give better chance of larger contiguous segments (again). Signed-off-by: Mark Lord <[EMAIL PROTECTED] --- --- old/mm/page_alloc.c 2007-12-13 19:25:15.0 -0500 +++ linux-2.6/mm/page_alloc.c 2007-12-13 19:43:07.0 -0500 @@ -760,7 +760,7 @@ struct page *page = __rmqueue(zone, order, migratetype); if (unlikely(page == NULL)) break; - list_add(&page->lru, list); + list_add_tail(&page->lru, list); Could we put a big comment above this explaining to the would be vm tweakers why this has to be a list_add_tail, so we don't end up back in this position after another two years? Already done ;) .. I thought of the comment as I rushed off for dinner. Thanks, Andrew! --- a/mm/page_alloc.c~fix-page_alloc-for-larger-i-o-segments-fix +++ a/mm/page_alloc.c @@ -847,6 +847,10 @@ static int rmqueue_bulk(struct zone *zon struct page *page = __rmqueue(zone, order, migratetype); if (unlikely(page == NULL)) break; + /* +* Doing a list_add_tail() here helps us to hand out pages in +* ascending physical-address order. +*/ list_add_tail(&page->lru, list); set_page_private(page, migratetype); } _ - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?
Mark Lord wrote: Andrew Morton wrote: On Thu, 13 Dec 2007 19:30:00 -0500 Mark Lord <[EMAIL PROTECTED]> wrote: Here's the commit that causes the regression: ... --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -760,7 +760,8 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order, struct page *page = __rmqueue(zone, order, migratetype); if (unlikely(page == NULL)) break; -list_add_tail(&page->lru, list); +list_add(&page->lru, list); well that looks fishy. .. Yeah. I missed that, and instead just posted a patch to search the list in reverse order, which seems to work for me. I'll try just reversing that line above here now.. gimme 5 minutes or so. .. Yep, that works too. Alternative "improved" patch now posted. - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] fix page_alloc for larger I/O segments (improved)
"Improved version", more similar to the 2.6.23 code: Fix page allocator to give better chance of larger contiguous segments (again). Signed-off-by: Mark Lord <[EMAIL PROTECTED] --- --- old/mm/page_alloc.c 2007-12-13 19:25:15.0 -0500 +++ linux-2.6/mm/page_alloc.c 2007-12-13 19:43:07.0 -0500 @@ -760,7 +760,7 @@ struct page *page = __rmqueue(zone, order, migratetype); if (unlikely(page == NULL)) break; - list_add(&page->lru, list); + list_add_tail(&page->lru, list); set_page_private(page, migratetype); } spin_unlock(&zone->lock); - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?
Andrew Morton wrote: On Thu, 13 Dec 2007 19:30:00 -0500 Mark Lord <[EMAIL PROTECTED]> wrote: Here's the commit that causes the regression: ... --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -760,7 +760,8 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order, struct page *page = __rmqueue(zone, order, migratetype); if (unlikely(page == NULL)) break; - list_add_tail(&page->lru, list); + list_add(&page->lru, list); well that looks fishy. .. Yeah. I missed that, and instead just posted a patch to search the list in reverse order, which seems to work for me. I'll try just reversing that line above here now.. gimme 5 minutes or so. Cheers - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] fix page_alloc for larger I/O segments
Mark Lord wrote: Mark Lord wrote: Mark Lord wrote: Mark Lord wrote: Andrew Morton wrote: On Thu, 13 Dec 2007 17:15:06 -0500 James Bottomley <[EMAIL PROTECTED]> wrote: On Thu, 2007-12-13 at 14:02 -0800, Andrew Morton wrote: On Thu, 13 Dec 2007 21:09:59 +0100 Jens Axboe <[EMAIL PROTECTED]> wrote: OK, it's a vm issue, cc linux-mm and probable culprit. I have tens of thousand "backward" pages after a boot - IOW, bvec->bv_page is the page before bvprv->bv_page, not reverse. So it looks like that bug got reintroduced. Bill Irwin fixed this a couple of years back: changed the page allocator so that it mostly hands out pages in ascending physical-address order. I guess we broke that, quite possibly in Mel's page allocator rework. It would help if you could provide us with a simple recipe for demonstrating this problem, please. The simple way seems to be to malloc a large area, touch every page and then look at the physical pages assigned ... they now mostly seem to be descending in physical address. OIC. -mm's /proc/pid/pagemap can be used to get the pfn's... .. I'm actually running the treadmill right now (have been for many hours, actually, to bisect it to a specific commit. Thought I was almost done, and then noticed that git-bisect doesn't keep the Makefile VERSION lines the same, so I was actually running the wrong kernel after the first few times.. duh. Wrote a script to fix it now. .. Well, that was a waste of three hours. .. Ahh.. it seems to be sensitive to one/both of these: CONFIG_HIGHMEM64G=y with 4GB RAM: not so bad, frequently does 20KB - 48KB segments. CONFIG_HIGHMEM4G=y with 2GB RAM: very severe, rarely does more than 8KB segments. CONFIG_HIGHMEM4G=y with 3GB RAM: very severe, rarely does more than 8KB segments. So if you want to reproduce this on a large memory machine, use "mem=2GB" for starters. .. Here's the commit that causes the regression: 535131e6925b4a95f321148ad7293f496e0e58d7 Choose pages from the per-cpu list based on migration type And here is a patch that seems to fix it for me here: * * * * Fix page allocator to give better change of larger contiguous segments (again). Signed-off-by: Mark Lord <[EMAIL PROTECTED] --- --- old/mm/page_alloc.c.orig2007-12-13 19:25:15.0 -0500 +++ linux-2.6/mm/page_alloc.c 2007-12-13 19:35:50.0 -0500 @@ -954,7 +954,7 @@ goto failed; } /* Find a page of the appropriate migrate type */ - list_for_each_entry(page, &pcp->list, lru) { + list_for_each_entry_reverse(page, &pcp->list, lru) { if (page_private(page) == migratetype) { list_del(&page->lru); pcp->count--; - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?
Mark Lord wrote: Mark Lord wrote: Mark Lord wrote: Andrew Morton wrote: On Thu, 13 Dec 2007 17:15:06 -0500 James Bottomley <[EMAIL PROTECTED]> wrote: On Thu, 2007-12-13 at 14:02 -0800, Andrew Morton wrote: On Thu, 13 Dec 2007 21:09:59 +0100 Jens Axboe <[EMAIL PROTECTED]> wrote: OK, it's a vm issue, cc linux-mm and probable culprit. I have tens of thousand "backward" pages after a boot - IOW, bvec->bv_page is the page before bvprv->bv_page, not reverse. So it looks like that bug got reintroduced. Bill Irwin fixed this a couple of years back: changed the page allocator so that it mostly hands out pages in ascending physical-address order. I guess we broke that, quite possibly in Mel's page allocator rework. It would help if you could provide us with a simple recipe for demonstrating this problem, please. The simple way seems to be to malloc a large area, touch every page and then look at the physical pages assigned ... they now mostly seem to be descending in physical address. OIC. -mm's /proc/pid/pagemap can be used to get the pfn's... .. I'm actually running the treadmill right now (have been for many hours, actually, to bisect it to a specific commit. Thought I was almost done, and then noticed that git-bisect doesn't keep the Makefile VERSION lines the same, so I was actually running the wrong kernel after the first few times.. duh. Wrote a script to fix it now. .. Well, that was a waste of three hours. .. Ahh.. it seems to be sensitive to one/both of these: CONFIG_HIGHMEM64G=y with 4GB RAM: not so bad, frequently does 20KB - 48KB segments. CONFIG_HIGHMEM4G=y with 2GB RAM: very severe, rarely does more than 8KB segments. CONFIG_HIGHMEM4G=y with 3GB RAM: very severe, rarely does more than 8KB segments. So if you want to reproduce this on a large memory machine, use "mem=2GB" for starters. .. Here's the commit that causes the regression: 535131e6925b4a95f321148ad7293f496e0e58d7 Choose pages from the per-cpu list based on migration type From: Mel Gorman <[EMAIL PROTECTED]> Date: Tue, 16 Oct 2007 08:25:49 + (-0700) Subject: Choose pages from the per-cpu list based on migration type X-Git-Tag: v2.6.24-rc1~1141 X-Git-Url: http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Ftorvalds%2Flinux-2.6.git;a=commitdiff_plain;h=535131e6925b4a95f321148ad7293f496e0e58d7;hp=b2a0ac8875a0a3b9f0739b60526f8c5977d2200f Choose pages from the per-cpu list based on migration type The freelists for each migrate type can slowly become polluted due to the per-cpu list. Consider what happens when the following happens 1. A 2^(MAX_ORDER-1) list is reserved for __GFP_MOVABLE pages 2. An order-0 page is allocated from the newly reserved block 3. The page is freed and placed on the per-cpu list 4. alloc_page() is called with GFP_KERNEL as the gfp_mask 5. The per-cpu list is used to satisfy the allocation This results in a kernel page is in the middle of a migratable region. This patch prevents this leak occuring by storing the MIGRATE_ type of the page in page->private. On allocate, a page will only be returned of the desired type, else more pages will be allocated. This may temporarily allow a per-cpu list to go over the pcp->high limit but it'll be corrected on the next free. Care is taken to preserve the hotness of pages recently freed. The additional code is not measurably slower for the workloads we've tested. Signed-off-by: Mel Gorman <[EMAIL PROTECTED]> Signed-off-by: Andrew Morton <[EMAIL PROTECTED]> Signed-off-by: Linus Torvalds <[EMAIL PROTECTED]> --- diff --git a/mm/page_alloc.c b/mm/page_alloc.c index d54ecf4..e3e726b 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -760,7 +760,8 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order, struct page *page = __rmqueue(zone, order, migratetype); if (unlikely(page == NULL)) break; - list_add_tail(&page->lru, list); + list_add(&page->lru, list); + set_page_private(page, migratetype); } spin_unlock(&zone->lock); return i; @@ -887,6 +888,7 @@ static void fastcall free_hot_cold_page(struct page *page, int cold) local_irq_save(flags); __count_vm_event(PGFREE); list_add(&page->lru, &pcp->list); + set_page_private(page, get_pageblock_migratetype(page)); pcp->count++; if (pcp->count >= pcp->high) { free_pages_bulk(zone, pcp->batch, &pcp->list, 0); @@ -951,9 +953,27 @@ again: if (unlikely(!pcp->count)) goto failed; } - page = list_entry(pcp->list.next, struct page, lru); - list_del(&page->lru); - pcp->count--; +
Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?
Mark Lord wrote: Mark Lord wrote: Andrew Morton wrote: On Thu, 13 Dec 2007 17:15:06 -0500 James Bottomley <[EMAIL PROTECTED]> wrote: On Thu, 2007-12-13 at 14:02 -0800, Andrew Morton wrote: On Thu, 13 Dec 2007 21:09:59 +0100 Jens Axboe <[EMAIL PROTECTED]> wrote: OK, it's a vm issue, cc linux-mm and probable culprit. I have tens of thousand "backward" pages after a boot - IOW, bvec->bv_page is the page before bvprv->bv_page, not reverse. So it looks like that bug got reintroduced. Bill Irwin fixed this a couple of years back: changed the page allocator so that it mostly hands out pages in ascending physical-address order. I guess we broke that, quite possibly in Mel's page allocator rework. It would help if you could provide us with a simple recipe for demonstrating this problem, please. The simple way seems to be to malloc a large area, touch every page and then look at the physical pages assigned ... they now mostly seem to be descending in physical address. OIC. -mm's /proc/pid/pagemap can be used to get the pfn's... .. I'm actually running the treadmill right now (have been for many hours, actually, to bisect it to a specific commit. Thought I was almost done, and then noticed that git-bisect doesn't keep the Makefile VERSION lines the same, so I was actually running the wrong kernel after the first few times.. duh. Wrote a script to fix it now. .. Well, that was a waste of three hours. .. Ahh.. it seems to be sensitive to one/both of these: CONFIG_HIGHMEM64G=y with 4GB RAM: not so bad, frequently does 20KB - 48KB segments. CONFIG_HIGHMEM4G=y with 2GB RAM: very severe, rarely does more than 8KB segments. CONFIG_HIGHMEM4G=y with 3GB RAM: very severe, rarely does more than 8KB segments. So if you want to reproduce this on a large memory machine, use "mem=2GB" for starters. Still testing.. - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?
Mark Lord wrote: Andrew Morton wrote: On Thu, 13 Dec 2007 17:15:06 -0500 James Bottomley <[EMAIL PROTECTED]> wrote: On Thu, 2007-12-13 at 14:02 -0800, Andrew Morton wrote: On Thu, 13 Dec 2007 21:09:59 +0100 Jens Axboe <[EMAIL PROTECTED]> wrote: OK, it's a vm issue, cc linux-mm and probable culprit. I have tens of thousand "backward" pages after a boot - IOW, bvec->bv_page is the page before bvprv->bv_page, not reverse. So it looks like that bug got reintroduced. Bill Irwin fixed this a couple of years back: changed the page allocator so that it mostly hands out pages in ascending physical-address order. I guess we broke that, quite possibly in Mel's page allocator rework. It would help if you could provide us with a simple recipe for demonstrating this problem, please. The simple way seems to be to malloc a large area, touch every page and then look at the physical pages assigned ... they now mostly seem to be descending in physical address. OIC. -mm's /proc/pid/pagemap can be used to get the pfn's... .. I'm actually running the treadmill right now (have been for many hours, actually, to bisect it to a specific commit. Thought I was almost done, and then noticed that git-bisect doesn't keep the Makefile VERSION lines the same, so I was actually running the wrong kernel after the first few times.. duh. Wrote a script to fix it now. .. Well, that was a waste of three hours. Somebody else can try it now. - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?
Andrew Morton wrote: On Thu, 13 Dec 2007 17:15:06 -0500 James Bottomley <[EMAIL PROTECTED]> wrote: On Thu, 2007-12-13 at 14:02 -0800, Andrew Morton wrote: On Thu, 13 Dec 2007 21:09:59 +0100 Jens Axboe <[EMAIL PROTECTED]> wrote: OK, it's a vm issue, cc linux-mm and probable culprit. I have tens of thousand "backward" pages after a boot - IOW, bvec->bv_page is the page before bvprv->bv_page, not reverse. So it looks like that bug got reintroduced. Bill Irwin fixed this a couple of years back: changed the page allocator so that it mostly hands out pages in ascending physical-address order. I guess we broke that, quite possibly in Mel's page allocator rework. It would help if you could provide us with a simple recipe for demonstrating this problem, please. The simple way seems to be to malloc a large area, touch every page and then look at the physical pages assigned ... they now mostly seem to be descending in physical address. OIC. -mm's /proc/pid/pagemap can be used to get the pfn's... .. I'm actually running the treadmill right now (have been for many hours, actually, to bisect it to a specific commit. Thought I was almost done, and then noticed that git-bisect doesn't keep the Makefile VERSION lines the same, so I was actually running the wrong kernel after the first few times.. duh. Wrote a script to fix it now. -ml - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?
Mark Lord wrote: Jens Axboe wrote: .. OK, it's a vm issue, I have tens of thousand "backward" pages after a boot - IOW, bvec->bv_page is the page before bvprv->bv_page, not reverse. So it looks like that bug got reintroduced. ... Mmm.. shouldn't one of the front- or back- merge logics work for either order? .. Belay that thought. I'm slowly remembering how this is supposed to work now. :) - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?
Jens Axboe wrote: On Thu, Dec 13 2007, Mark Lord wrote: Jens Axboe wrote: On Thu, Dec 13 2007, Jens Axboe wrote: On Thu, Dec 13 2007, Mark Lord wrote: Jens Axboe wrote: On Thu, Dec 13 2007, Mark Lord wrote: Mark Lord wrote: Jens Axboe wrote: On Thu, Dec 13 2007, Mark Lord wrote: Matthew Wilcox wrote: On Thu, Dec 13, 2007 at 01:48:18PM -0500, Mark Lord wrote: Problem confirmed. 2.6.23.8 regularly generates segments up to 64KB for libata, but 2.6.24 uses only 4KB segments and a *few* 8KB segments. Just a suspicion ... could this be slab vs slub? ie check your configs are the same / similar between the two kernels. .. Mmmm.. a good thought, that one. But I just rechecked, and both have CONFIG_SLAB=y My guess is that something got changed around when Jens reworked the block layer for 2.6.24. I'm going to dig around in there now. I didn't rework the block layer for 2.6.24 :-). The core block layer changes since 2.6.23 are: - Support for empty barriers. Not a likely candidate. - Shared tag queue fixes. Totally unlikely. - sg chaining support. Not likely. - The bio changes from Neil. Of the bunch, the most likely suspects in this area, since it changes some of the code involved with merges and blk_rq_map_sg(). - Lots of simple stuff, again very unlikely. Anyway, it sounds odd for this to be a block layer problem if you do see occasional segments being merged. So it sounds more like the input data having changed. Why not just bisect it? .. Because the early 2.6.24 series failed to boot on this machine due to bugs in the block layer -- so the code that caused this regression is probably in the stuff from before the kernels became usable here. .. That sounds more harsh than intended --> the earlier 2.6.24 kernels (up to the first couple of -rc* ones failed here because of incompatibilities between the block/bio changes and libata. That's better, I think! No worries, I didn't pick it up as harsh just as an odd conclusion :-) If I were you, I'd just start from the first -rc that booted for you. If THAT has the bug, then we'll think of something else. If you don't get anywhere, I can run some tests tomorrow and see if I can reproduce it here. .. I believe that *anyone* can reproduce it, since it's broken long before the requests ever get to SCSI or libata. Which also means that *anyone* who wants to can bisect it, as well. I don't do "bisects". It was just a suggestion on how to narrow it down, do as you see fit. But I will dig a bit more and see if I can find the culprit. Sure, I'll dig around as well. Just tried something simple. I only see one 12kb segment so far, so not a lot by any stretch. I also DONT see any missed merges signs, so it would appear that the pages in the request are simply not contigious physically. diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c index e30b1a4..1e34b6f 100644 --- a/block/ll_rw_blk.c +++ b/block/ll_rw_blk.c @@ -1330,6 +1330,8 @@ int blk_rq_map_sg(struct request_queue *q, struct request *rq, goto new_segment; sg->length += nbytes; + if (sg->length > 8192) + printk("sg_len=%d\n", sg->length); } else { new_segment: if (!sg) @@ -1349,6 +1351,8 @@ new_segment: sg = sg_next(sg); } + if (bvprv && (page_address(bvprv->bv_page) + bvprv->bv_len == page_address(bvec->bv_page))) + printk("missed merge\n"); sg_set_page(sg, bvec->bv_page, nbytes, bvec->bv_offset); nsegs++; } .. Yeah, the first part is similar to my own hack. For testing, try "dd if=/dev/sda of=/dev/null bs=4096k". That *really* should end up using contiguous pages on most systems. I figured out the git thing, and am now building some in-between kernels to try. OK, it's a vm issue, I have tens of thousand "backward" pages after a boot - IOW, bvec->bv_page is the page before bvprv->bv_page, not reverse. So it looks like that bug got reintroduced. ... Mmm.. shouldn't one of the front- or back- merge logics work for either order? - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?
Jens Axboe wrote: On Thu, Dec 13 2007, Jens Axboe wrote: On Thu, Dec 13 2007, Mark Lord wrote: Jens Axboe wrote: On Thu, Dec 13 2007, Mark Lord wrote: Mark Lord wrote: Jens Axboe wrote: On Thu, Dec 13 2007, Mark Lord wrote: Matthew Wilcox wrote: On Thu, Dec 13, 2007 at 01:48:18PM -0500, Mark Lord wrote: Problem confirmed. 2.6.23.8 regularly generates segments up to 64KB for libata, but 2.6.24 uses only 4KB segments and a *few* 8KB segments. Just a suspicion ... could this be slab vs slub? ie check your configs are the same / similar between the two kernels. .. Mmmm.. a good thought, that one. But I just rechecked, and both have CONFIG_SLAB=y My guess is that something got changed around when Jens reworked the block layer for 2.6.24. I'm going to dig around in there now. I didn't rework the block layer for 2.6.24 :-). The core block layer changes since 2.6.23 are: - Support for empty barriers. Not a likely candidate. - Shared tag queue fixes. Totally unlikely. - sg chaining support. Not likely. - The bio changes from Neil. Of the bunch, the most likely suspects in this area, since it changes some of the code involved with merges and blk_rq_map_sg(). - Lots of simple stuff, again very unlikely. Anyway, it sounds odd for this to be a block layer problem if you do see occasional segments being merged. So it sounds more like the input data having changed. Why not just bisect it? .. Because the early 2.6.24 series failed to boot on this machine due to bugs in the block layer -- so the code that caused this regression is probably in the stuff from before the kernels became usable here. .. That sounds more harsh than intended --> the earlier 2.6.24 kernels (up to the first couple of -rc* ones failed here because of incompatibilities between the block/bio changes and libata. That's better, I think! No worries, I didn't pick it up as harsh just as an odd conclusion :-) If I were you, I'd just start from the first -rc that booted for you. If THAT has the bug, then we'll think of something else. If you don't get anywhere, I can run some tests tomorrow and see if I can reproduce it here. .. I believe that *anyone* can reproduce it, since it's broken long before the requests ever get to SCSI or libata. Which also means that *anyone* who wants to can bisect it, as well. I don't do "bisects". It was just a suggestion on how to narrow it down, do as you see fit. But I will dig a bit more and see if I can find the culprit. Sure, I'll dig around as well. Just tried something simple. I only see one 12kb segment so far, so not a lot by any stretch. I also DONT see any missed merges signs, so it would appear that the pages in the request are simply not contigious physically. diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c index e30b1a4..1e34b6f 100644 --- a/block/ll_rw_blk.c +++ b/block/ll_rw_blk.c @@ -1330,6 +1330,8 @@ int blk_rq_map_sg(struct request_queue *q, struct request *rq, goto new_segment; sg->length += nbytes; + if (sg->length > 8192) + printk("sg_len=%d\n", sg->length); } else { new_segment: if (!sg) @@ -1349,6 +1351,8 @@ new_segment: sg = sg_next(sg); } + if (bvprv && (page_address(bvprv->bv_page) + bvprv->bv_len == page_address(bvec->bv_page))) + printk("missed merge\n"); sg_set_page(sg, bvec->bv_page, nbytes, bvec->bv_offset); nsegs++; } .. Yeah, the first part is similar to my own hack. For testing, try "dd if=/dev/sda of=/dev/null bs=4096k". That *really* should end up using contiguous pages on most systems. I figured out the git thing, and am now building some in-between kernels to try. Cheers - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?
Jens Axboe wrote: On Thu, Dec 13 2007, Mark Lord wrote: Jens Axboe wrote: On Thu, Dec 13 2007, Mark Lord wrote: Mark Lord wrote: Jens Axboe wrote: On Thu, Dec 13 2007, Mark Lord wrote: Matthew Wilcox wrote: On Thu, Dec 13, 2007 at 01:48:18PM -0500, Mark Lord wrote: Problem confirmed. 2.6.23.8 regularly generates segments up to 64KB for libata, but 2.6.24 uses only 4KB segments and a *few* 8KB segments. Just a suspicion ... could this be slab vs slub? ie check your configs are the same / similar between the two kernels. .. Mmmm.. a good thought, that one. But I just rechecked, and both have CONFIG_SLAB=y My guess is that something got changed around when Jens reworked the block layer for 2.6.24. I'm going to dig around in there now. I didn't rework the block layer for 2.6.24 :-). The core block layer changes since 2.6.23 are: - Support for empty barriers. Not a likely candidate. - Shared tag queue fixes. Totally unlikely. - sg chaining support. Not likely. - The bio changes from Neil. Of the bunch, the most likely suspects in this area, since it changes some of the code involved with merges and blk_rq_map_sg(). - Lots of simple stuff, again very unlikely. Anyway, it sounds odd for this to be a block layer problem if you do see occasional segments being merged. So it sounds more like the input data having changed. Why not just bisect it? .. Because the early 2.6.24 series failed to boot on this machine due to bugs in the block layer -- so the code that caused this regression is probably in the stuff from before the kernels became usable here. .. That sounds more harsh than intended --> the earlier 2.6.24 kernels (up to the first couple of -rc* ones failed here because of incompatibilities between the block/bio changes and libata. That's better, I think! No worries, I didn't pick it up as harsh just as an odd conclusion :-) If I were you, I'd just start from the first -rc that booted for you. If THAT has the bug, then we'll think of something else. If you don't get anywhere, I can run some tests tomorrow and see if I can reproduce it here. .. I believe that *anyone* can reproduce it, since it's broken long before the requests ever get to SCSI or libata. Which also means that *anyone* who wants to can bisect it, as well. I don't do "bisects". It was just a suggestion on how to narrow it down, do as you see fit. But I will dig a bit more and see if I can find the culprit. Sure, I'll dig around as well. .. I wonder if it's 9dfa52831e96194b8649613e3131baa2c109f7dc: "Merge blk_recount_segments into blk_recalc_rq_segments" ? That particular commit does some rather innocent code-shuffling, but also introduces a couple of new "if (nr_hw_segs == 1" conditions that were not there before. Okay git experts: how do I pull out a kernel at the point of this exact commit ? Thanks! - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?
Jens Axboe wrote: On Thu, Dec 13 2007, Mark Lord wrote: Matthew Wilcox wrote: On Thu, Dec 13, 2007 at 01:48:18PM -0500, Mark Lord wrote: Problem confirmed. 2.6.23.8 regularly generates segments up to 64KB for libata, but 2.6.24 uses only 4KB segments and a *few* 8KB segments. Just a suspicion ... could this be slab vs slub? ie check your configs are the same / similar between the two kernels. .. Mmmm.. a good thought, that one. But I just rechecked, and both have CONFIG_SLAB=y My guess is that something got changed around when Jens reworked the block layer for 2.6.24. I'm going to dig around in there now. I didn't rework the block layer for 2.6.24 :-). The core block layer changes since 2.6.23 are: - Support for empty barriers. Not a likely candidate. - Shared tag queue fixes. Totally unlikely. - sg chaining support. Not likely. - The bio changes from Neil. Of the bunch, the most likely suspects in this area, since it changes some of the code involved with merges and blk_rq_map_sg(). - Lots of simple stuff, again very unlikely. Anyway, it sounds odd for this to be a block layer problem if you do see occasional segments being merged. So it sounds more like the input data having changed. Why not just bisect it? .. CC'ing Neil Brown. - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?
Jens Axboe wrote: On Thu, Dec 13 2007, Mark Lord wrote: Mark Lord wrote: Jens Axboe wrote: On Thu, Dec 13 2007, Mark Lord wrote: Matthew Wilcox wrote: On Thu, Dec 13, 2007 at 01:48:18PM -0500, Mark Lord wrote: Problem confirmed. 2.6.23.8 regularly generates segments up to 64KB for libata, but 2.6.24 uses only 4KB segments and a *few* 8KB segments. Just a suspicion ... could this be slab vs slub? ie check your configs are the same / similar between the two kernels. .. Mmmm.. a good thought, that one. But I just rechecked, and both have CONFIG_SLAB=y My guess is that something got changed around when Jens reworked the block layer for 2.6.24. I'm going to dig around in there now. I didn't rework the block layer for 2.6.24 :-). The core block layer changes since 2.6.23 are: - Support for empty barriers. Not a likely candidate. - Shared tag queue fixes. Totally unlikely. - sg chaining support. Not likely. - The bio changes from Neil. Of the bunch, the most likely suspects in this area, since it changes some of the code involved with merges and blk_rq_map_sg(). - Lots of simple stuff, again very unlikely. Anyway, it sounds odd for this to be a block layer problem if you do see occasional segments being merged. So it sounds more like the input data having changed. Why not just bisect it? .. Because the early 2.6.24 series failed to boot on this machine due to bugs in the block layer -- so the code that caused this regression is probably in the stuff from before the kernels became usable here. .. That sounds more harsh than intended --> the earlier 2.6.24 kernels (up to the first couple of -rc* ones failed here because of incompatibilities between the block/bio changes and libata. That's better, I think! No worries, I didn't pick it up as harsh just as an odd conclusion :-) If I were you, I'd just start from the first -rc that booted for you. If THAT has the bug, then we'll think of something else. If you don't get anywhere, I can run some tests tomorrow and see if I can reproduce it here. .. I believe that *anyone* can reproduce it, since it's broken long before the requests ever get to SCSI or libata. Which also means that *anyone* who wants to can bisect it, as well. I don't do "bisects". But I will dig a bit more and see if I can find the culprit. Cheers - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?
Mark Lord wrote: Jens Axboe wrote: On Thu, Dec 13 2007, Mark Lord wrote: Matthew Wilcox wrote: On Thu, Dec 13, 2007 at 01:48:18PM -0500, Mark Lord wrote: Problem confirmed. 2.6.23.8 regularly generates segments up to 64KB for libata, but 2.6.24 uses only 4KB segments and a *few* 8KB segments. Just a suspicion ... could this be slab vs slub? ie check your configs are the same / similar between the two kernels. .. Mmmm.. a good thought, that one. But I just rechecked, and both have CONFIG_SLAB=y My guess is that something got changed around when Jens reworked the block layer for 2.6.24. I'm going to dig around in there now. I didn't rework the block layer for 2.6.24 :-). The core block layer changes since 2.6.23 are: - Support for empty barriers. Not a likely candidate. - Shared tag queue fixes. Totally unlikely. - sg chaining support. Not likely. - The bio changes from Neil. Of the bunch, the most likely suspects in this area, since it changes some of the code involved with merges and blk_rq_map_sg(). - Lots of simple stuff, again very unlikely. Anyway, it sounds odd for this to be a block layer problem if you do see occasional segments being merged. So it sounds more like the input data having changed. Why not just bisect it? .. Because the early 2.6.24 series failed to boot on this machine due to bugs in the block layer -- so the code that caused this regression is probably in the stuff from before the kernels became usable here. .. That sounds more harsh than intended --> the earlier 2.6.24 kernels (up to the first couple of -rc* ones failed here because of incompatibilities between the block/bio changes and libata. That's better, I think! Cheers - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?
Jens Axboe wrote: On Thu, Dec 13 2007, Mark Lord wrote: Matthew Wilcox wrote: On Thu, Dec 13, 2007 at 01:48:18PM -0500, Mark Lord wrote: Problem confirmed. 2.6.23.8 regularly generates segments up to 64KB for libata, but 2.6.24 uses only 4KB segments and a *few* 8KB segments. Just a suspicion ... could this be slab vs slub? ie check your configs are the same / similar between the two kernels. .. Mmmm.. a good thought, that one. But I just rechecked, and both have CONFIG_SLAB=y My guess is that something got changed around when Jens reworked the block layer for 2.6.24. I'm going to dig around in there now. I didn't rework the block layer for 2.6.24 :-). The core block layer changes since 2.6.23 are: - Support for empty barriers. Not a likely candidate. - Shared tag queue fixes. Totally unlikely. - sg chaining support. Not likely. - The bio changes from Neil. Of the bunch, the most likely suspects in this area, since it changes some of the code involved with merges and blk_rq_map_sg(). - Lots of simple stuff, again very unlikely. Anyway, it sounds odd for this to be a block layer problem if you do see occasional segments being merged. So it sounds more like the input data having changed. Why not just bisect it? .. Because the early 2.6.24 series failed to boot on this machine due to bugs in the block layer -- so the code that caused this regression is probably in the stuff from before the kernels became usable here. Cheers - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?
Matthew Wilcox wrote: On Thu, Dec 13, 2007 at 01:48:18PM -0500, Mark Lord wrote: Problem confirmed. 2.6.23.8 regularly generates segments up to 64KB for libata, but 2.6.24 uses only 4KB segments and a *few* 8KB segments. Just a suspicion ... could this be slab vs slub? ie check your configs are the same / similar between the two kernels. .. Mmmm.. a good thought, that one. But I just rechecked, and both have CONFIG_SLAB=y My guess is that something got changed around when Jens reworked the block layer for 2.6.24. I'm going to dig around in there now. - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?
Mark Lord wrote: (resending with corrected email address for Jens) Jens, I'm experimenting here with trying to generate large I/O through libata, and not having much luck. The limit seems to be the number of hardware PRD (SG) entries permitted by the driver (libata:ata_piix), which is 128 by default. The problem is, the block layer *never* sends an SG entry larger than 8192 bytes, and even that size is exceptionally rare. Nearly all I/O segments are 4096 bytes, so I never see a single I/O larger than 512KB (128 * 4096). If I patch various parts of block and SCSI, this limit doesn't budge, but when I change the hardware PRD limit in libata, it scales by exactly whatever I set the new value to. This tells me that adjacent I/O segments are not being combined. I thought that QUEUE_FLAG_CLUSTER (aka. SCSI host .use_clustering=1) should result in adjacent single pages being combined into larger physical segments? This is x86-32 with latest 2.6.24-rc*. I'll re-test on older kernels next. ... Problem confirmed. 2.6.23.8 regularly generates segments up to 64KB for libata, but 2.6.24 uses only 4KB segments and a *few* 8KB segments. ??? - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?
(resending with corrected email address for Jens) Jens, I'm experimenting here with trying to generate large I/O through libata, and not having much luck. The limit seems to be the number of hardware PRD (SG) entries permitted by the driver (libata:ata_piix), which is 128 by default. The problem is, the block layer *never* sends an SG entry larger than 8192 bytes, and even that size is exceptionally rare. Nearly all I/O segments are 4096 bytes, so I never see a single I/O larger than 512KB (128 * 4096). If I patch various parts of block and SCSI, this limit doesn't budge, but when I change the hardware PRD limit in libata, it scales by exactly whatever I set the new value to. This tells me that adjacent I/O segments are not being combined. I thought that QUEUE_FLAG_CLUSTER (aka. SCSI host .use_clustering=1) should result in adjacent single pages being combined into larger physical segments? This is x86-32 with latest 2.6.24-rc*. I'll re-test on older kernels next. ??? - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
QUEUE_FLAG_CLUSTER: not working in 2.6.24 ?
Jens, I'm experimenting here with trying to generate large I/O through libata, and not having much luck. The limit seems to be the number of hardware PRD (SG) entries permitted by the driver (libata:ata_piix), which is 128 by default. The problem is, the block layer *never* sends an SG entry larger than 8192 bytes, and even that size is exceptionally rare. Nearly all I/O segments are 4096 bytes, so I never see a single I/O larger than 512KB (128 * 4096). If I patch various parts of block and SCSI, this limit doesn't budge, but when I change the hardware PRD limit in libata, it scales by exactly whatever I set the new value to. This tells me that adjacent I/O segments are not being combined. I thought that QUEUE_FLAG_CLUSTER (aka. SCSI host .use_clustering=1) should result in adjacent single pages being combined into larger physical segments? This is x86-32 with latest 2.6.24-rc*. I'll re-test on older kernels next. ??? - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Process Scheduling Issue using sg/libata
Fajun Chen wrote: As a matter of fact, I'm using /dev/sg*. Due to the size of my test application, I have not be able to compress it into a small and publishable form. However, this issue can be easily reproduced on my ARM XScale target using sg3_util code as follows: 1. Run printtime.c attached, which prints message to console in a loop. 2. Run sgm_dd (part of sg3_util package, source code attached) on the same system as follows: sgm_dd if=/dev/sg0 of=/dev/null count=10M bpt=1 The print task can be delayed for as many as 25 seconds. Surprisingly, I can't reproduce the problem in an i386 test system with a more powerful processor. Some clarification to MAP_ANONYMOUS option in mmap(). After fixing a bug and more testing, this option seems to make no difference to cpu load. Sorry about previous report. Back to the drawing board now :-) .. Okay, I don't see anything unusual here. The code is on a slow CPU, and is triggering 10MBytes of PIO over a (probably) slow bus to an ATA device. This *will* tie up the CPU at 100% for the duration of the I/O, because the I/O happens in interrupt handlers, which are outside of the realm of the CPU scheduler. This is a known shortcoming of Linux for real-time uses. When the I/O uses DMA transfers, it *may* still have a similar effect, depending upon the caching in the ATA device, and on how the DMA shares the memory bus with the CPU. Again, no surprise here. One way to deal with it in an embedded device, is to force the application that's generating the I/O to self-throttle. Or modify the device driver to self-throttle. You may want to find an embedded Linux consultant to help out with this situation if it's beyond your expertise. Cheers - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Process Scheduling Issue using sg/libata
Fajun Chen wrote: .. I verified your program works in my system and my application works as well if changed accordingly. However, this change (indirect IO in sg term) may come at a performance cost for IO intensive applications since it does NOT utilize mmaped buffer managed by sg driver. Please see relevant sg document below: http://sg.torque.net/sg/p/sg_v3_ho.html#id2495330 http://sg.torque.net/sg/p/sg_v3_ho.html#dmmio As an example, sg_rbuf.c in sg3_util package uses SG_FLAG_MMAP_IO flag in SG_IO. Please see source code attached. I also noticed that MAP_ANONYMOUS is NOT used in mmap() call in sg_rbuf.c, which may not be desirable as you pointed out in previous emails. So this brings up an interesting sg usage issue: can we use MAP_ANONYMOUS with SG_FLAG_MMAP_IO flag in SG_IO? .. The SG_FLAG_MMAP works only with /dev/sg* devices, not /dev/sd* devices. I don't know which kind you were trying to use, since you still have not provided your source code for examination. If you are using /dev/sg*, then you should be able to get your original mmap() code to work. But the behaviour described thus far seems to indicate that your secret program must have been using /dev/sd* instead. Cheers - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Process Scheduling Issue using sg/libata
Fajun Chen wrote: On 11/17/07, Mark Lord <[EMAIL PROTECTED]> wrote: .. What you probably intended to do instead, was to use mmap to just allocate some page-aligned RAM, not to actually mmap'd any on-disk data. Right? Here's how that's done: read_buffer = (U8 *)mmap(NULL, buf_sz, PROT_READ | PROT_WRITE, MAP_SHARED|MAP_ANONYMOUS, -1, 0); What I intended to do is to write data into disc or read data from disc via SG_IO as requested by my user-space application. I don't want any automatically scheduled kernel task to sync data with disc. .. Right. Then you definitely do NOT want to mmap your device, because that's exactly what would otherwise happen, by design! I've experimented with memory mapping using MAP_ANONYMOUS as you suggested, the good news is that it does free up the cpu load and my system is much more responsive with the change. .. Yes, that's what we expected to see. The bad news is that the data read back from disc (PIO or DMA read) seems to be invisible to user-space application. For instance, read buffer is all zeros after Identify Device command. Is this expected side effect of MAP_ANONYMOUS option? .. No, that would be a side effect of some other bug in the code. Here (attached) is a working program that performs (PACKET)IDENTIFY DEVICE commands, using a mmap() buffer to receive the data. Cheers /* * This code is copyright 2007 by Mark Lord, * and is made available to all under the terms * of the GNU General Public License v2. */ #include #include #include #include #include #include #include #include #include #include #include #include #include #include typedef unsigned long long u64; enum { ATA_CMD_PIO_IDENTIFY = 0xec, ATA_CMD_PIO_PIDENTIFY = 0xa1, /* normal sector size (bytes) for PIO/DMA */ ATA_SECT_SIZE = 512, ATA_16= 0x85, ATA_16_LEN = 16, ATA_DEV_REG_LBA = (1 << 6), ATA_LBA48 = 1, /* data transfer protocols; only basic PIO and DMA actually work */ ATA_PROTO_NON_DATA = ( 3 << 1), ATA_PROTO_PIO_IN = ( 4 << 1), ATA_PROTO_PIO_OUT = ( 5 << 1), ATA_PROTO_DMA = ( 6 << 1), ATA_PROTO_UDMA_IN = (11 << 1), /* unsupported */ ATA_PROTO_UDMA_OUT = (12 << 1), /* unsupported */ }; /* * Taskfile layout for ATA_16 cdb (LBA28/LBA48): * * cdb[ 4] = feature * cdb[ 6] = nsect * cdb[ 8] = lbal * cdb[10] = lbam * cdb[12] = lbah * cdb[13] = device * cdb[14] = command * * "high order byte" (hob) fields for LBA48 commands: * * cdb[ 3] = hob_feature * cdb[ 5] = hob_nsect * cdb[ 7] = hob_lbal * cdb[ 9] = hob_lbam * cdb[11] = hob_lbah * * dxfer_direction choices: * * SG_DXFER_TO_DEV (writing to drive) * SG_DXFER_FROM_DEV (reading from drive) * SG_DXFER_NONE (non-data commands) */ static int sg_issue (int fd, unsigned char ata_op, void *buf) { unsigned char cdb[ATA_16_LEN] = { ATA_16, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; unsigned char sense[32]; unsigned int nsects = 1; struct sg_io_hdr hdr; cdb[ 1] = ATA_PROTO_PIO_IN; cdb[ 6] = nsects; cdb[14] = ata_op; memset(&hdr, 0, sizeof(struct sg_io_hdr)); hdr.interface_id = 'S'; hdr.cmd_len = ATA_16_LEN; hdr.mx_sb_len = sizeof(sense); hdr.dxfer_direction = SG_DXFER_FROM_DEV; hdr.dxfer_len = nsects * ATA_SECT_SIZE; hdr.dxferp = buf; hdr.cmdp = cdb; hdr.sbp = sense; hdr.timeout = 5000; /* milliseconds */ memset(sense, 0, sizeof(sense)); if (ioctl(fd, SG_IO, &hdr) < 0) { perror("ioctl(SG_IO)"); return (-1); } if (hdr.status == 0 && hdr.host_status == 0 && hdr.driver_status == 0) return 0; /* success */ if (hdr.status > 0) { unsigned char *d = sense + 8; /* SCSI status is non-zero */ fprintf(stderr, "SG_IO error: SCSI sense=0x%x/%02x/%02x, ATA=0x%02x/%02x\n", sense[1] & 0xf, sense[2], sense[3], d[13], d[3]); return -1; } /* some other error we don't know about yet */ fprintf(stderr, "SG_IO returned: SCSI status=0x%x, host_status=0x%x, driver_status=0x%x\n", hdr.status, hdr.host_status, hdr.driver_status); return -1; } int main (int argc, char *argv[]) { const char *devpath; int i, rc, fd; #if 0 unsigned short id[ATA_SECT_SIZE / 2]; memset(id, 0, sizeof(id)); #else unsigned short *id; id = mmap(NULL, getpagesize(), PROT_READ|PROT_WRITE, MAP_SHARED|MAP_ANONYMOUS, -1, 0); if (id == MAP_FAILED) { perror("mmap"); exit(1); } #endif if (argc != 2) { fprintf(stderr, "%s: bad/missing parm: expected \n", argv[0]); exit(1); } devpath = argv[1]; fd = open(devpath, O_RDWR|O_NONBLOCK); if (fd == -1) { perror(devpath); exit(1); } rc = sg_issue(fd, ATA_CMD_PIO_IDENTIFY, id); if (rc != 0) rc = sg_issue(fd, ATA_CMD_PIO_PIDENTIFY, id); if (rc == 0) { unsigned short *d = id; for (i = 0; i < (256/8); ++i) { printf("%04x %04x %04x %04x %04x %04x %04x %04x\n", d[0], d[1], d[2], d[3], d[4], d[5], d[6], d[7]); d += 8; } exit(0); } exit(1); }
Re: Process Scheduling Issue using sg/libata
Fajun Chen wrote: On 11/17/07, Mark Lord <[EMAIL PROTECTED]> wrote: Fajun Chen wrote: On 11/16/07, Mark Lord <[EMAIL PROTECTED]> wrote: Fajun Chen wrote: .. This problem also happens with R/W DMA ops. Below are simplified code snippets: // Open one sg device for read if ((sg_fd = open(dev_name, O_RDWR))<0) { ... } read_buffer = (U8 *)mmap(NULL, buf_sz, PROT_READ | PROT_WRITE, MAP_SHARED, sg_fd, 0); // Open the same sg device for write if ((sg_fd_wr = open(dev_name, O_RDWR))<0) { ... } write_buffer = (U8 *)mmap(NULL, buf_sz, PROT_READ | PROT_WRITE, MAP_SHARED, sg_fd_wr, 0); .. Mmmm.. what is the purpose of those two mmap'd areas ? I think this is important and relevant here: what are they used for? As coded above, these are memory mapped areas taht (1) overlap, and (2) will be demand paged automatically to/from the disk as they are accessed/modified. This *will* conflict with any SG_IO operations happening at the same time on the same device. .. The purpose of using two memory mapped areas is to meet our requirement that certain data patterns for writing need to be kept across commands. For instance, if one buffer is used for both reads and writes, then this buffer will need to be re-populated with certain write data after each read command, which would be very costly for write-read mixed type of ops. This separate R/W buffer setting also facilitates data comparison. These buffers are not used at the same time (one will be used only after the command on the other is completed). My application is the only program accessing disk using sg/libata and the rest of the programs run from ramdisk. Also, each buffer is only about 0.5MB and we have 64MB RAM on the target board. With this setup, these two buffers should be pretty much independent and free from block layer/file system, correct? .. No. Those "buffers" as coded above are actually mmap'ed representations of portions of the device (disk drive). So any write into one of those buffers will trigger disk writes, and just accessing ("read") the buffers may trigger disk reads. So what could be happening here, is when you trigger manual disk accesses via SG_IO, that result in data being copied into those "buffers", the kernel then automatically schedules disk writes to update the on-disk copies of those mmap'd regions. What you probably intended to do instead, was to use mmap to just allocate some page-aligned RAM, not to actually mmap'd any on-disk data. Right? Here's how that's done: read_buffer = (U8 *)mmap(NULL, buf_sz, PROT_READ | PROT_WRITE, MAP_SHARED|MAP_ANONYMOUS, -1, 0); Cheers - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Process Scheduling Issue using sg/libata
Fajun Chen wrote: On 11/16/07, Mark Lord <[EMAIL PROTECTED]> wrote: Fajun Chen wrote: Hi All, I use sg/libata and ata pass through for read/writes. Linux 2.6.18-rc2 and libata version 2.00 are loaded on ARM XScale board. Under heavy cpu load (e.g. when blocks per transfer/sector count is set to 1), I've observed that the test application can suck cpu away for long time (more than 20 seconds) and other processes including high priority shell can not get the time slice to run. What's interesting is that if the application is under heavy IO load (e.g. when blocks per transfer/sector count is set to 256), the problem goes away. I also tested with open source code sg_utils and got the same result, so this is not a problem specific to my user-space application. .. Post the relevant code here, and then we'll be able to better understand and explain it to you. For example, if the code is using ATA opcodes 0x20, 0x21, 0x24, 0x30, 0x31, 0x34, 0x29, 0x39, 0xc4 or 0xc5 (any of the R/W PIO ops), then this behaviour does not surprise me in the least. Fully expected and difficult to avoid. This problem also happens with R/W DMA ops. Below are simplified code snippets: // Open one sg device for read if ((sg_fd = open(dev_name, O_RDWR))<0) { ... } read_buffer = (U8 *)mmap(NULL, buf_sz, PROT_READ | PROT_WRITE, MAP_SHARED, sg_fd, 0); // Open the same sg device for write if ((sg_fd_wr = open(dev_name, O_RDWR))<0) { ... } write_buffer = (U8 *)mmap(NULL, buf_sz, PROT_READ | PROT_WRITE, MAP_SHARED, sg_fd_wr, 0); .. Mmmm.. what is the purpose of those two mmap'd areas ? I think this is important and relevant here: what are they used for? As coded above, these are memory mapped areas taht (1) overlap, and (2) will be demand paged automatically to/from the disk as they are accessed/modified. This *will* conflict with any SG_IO operations happening at the same time on the same device. sg_io_hdr_t io_hdr; memset(&io_hdr, 0, sizeof(sg_io_hdr_t)); io_hdr.interface_id = 'S'; io_hdr.mx_sb_len= sizeof(sense_buffer); io_hdr.sbp = sense_buffer; io_hdr.dxfer_len= dxfer_len; io_hdr.cmd_len = cmd_len; io_hdr.cmdp = cmdp;// ATA pass through command block io_hdr.timeout = cmd_tmo * 1000; // In millisecs io_hdr.pack_id = id; // Read/write counter for now io_hdr.iovec_count=0; // scatter gather elements, 0=not being used if (direction == 1) { io_hdr.dxfer_direction = SG_DXFER_TO_DEV; io_hdr.flags |= SG_FLAG_MMAP_IO; status = ioctl(sg_fd_wr, SG_IO, &io_hdr); } else { io_hdr.dxfer_direction = SG_DXFER_FROM_DEV; io_hdr.flags |= SG_FLAG_MMAP_IO; status = ioctl(sg_fd, SG_IO, &io_hdr); } ... Mmaped IO is a moot point here since this problem is also observed when using direct IO. Thanks, Fajun - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Process Scheduling Issue using sg/libata
Fajun Chen wrote: Hi All, I use sg/libata and ata pass through for read/writes. Linux 2.6.18-rc2 and libata version 2.00 are loaded on ARM XScale board. Under heavy cpu load (e.g. when blocks per transfer/sector count is set to 1), I've observed that the test application can suck cpu away for long time (more than 20 seconds) and other processes including high priority shell can not get the time slice to run. What's interesting is that if the application is under heavy IO load (e.g. when blocks per transfer/sector count is set to 256), the problem goes away. I also tested with open source code sg_utils and got the same result, so this is not a problem specific to my user-space application. .. Post the relevant code here, and then we'll be able to better understand and explain it to you. For example, if the code is using ATA opcodes 0x20, 0x21, 0x24, 0x30, 0x31, 0x34, 0x29, 0x39, 0xc4 or 0xc5 (any of the R/W PIO ops), then this behaviour does not surprise me in the least. Fully expected and difficult to avoid. Cheers - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [linux-usb-devel] Problems with USB disk [solved]
Alan Stern wrote: On Thu, 13 Sep 2007, Mark Lord wrote: .. What happens is, there's a nice little LED on the Cruzer stick, that is "lit" when the stick itself is not in a "power suspend" state (or whatever you USB folks call it). We call it "suspended". (Wasn't there an episode of Classic Trek where Mr. Spock explained to somebody, "I call them `ears'."?) On 2.6.22, that little LED stays "on" normally, and flickers off/on when data is being transfered. The new "USB autosuspend" logic in 2.6.23 now causes that little LED to turn off after a few seconds of inactivity. Once that happens, the USB stick is not accessible until after a longish timeout (~30s I think), followed by a USB reset. Then it is usable again until the next inactivity timeout and autosuspend (a few seconds). So this _isn't_ the regression described above -- to wit, that the drive gets spun down and then won't work without first being spun back up. You wrote: My Sandisk Cruzer Micro 1GB USB sticks suffer from this regression. But it doesn't. It suffers from a different regression. Perhaps, but that doesn't necessarily follow from the above. From this observer's point of view (and being an expert on disk technologies), the same "spin up" issue for rotating media could be the culprit here too. A reset to a disk drive usually (not always) causes it to spin up. Regardless, Greg has acknowledged the regression and is planning to revert things. Cheers - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [linux-usb-devel] Problems with USB disk [solved]
Alan Stern wrote: On Wed, 12 Sep 2007, Mark Lord wrote: Chuck Ebbert wrote: .. Oh, nice. The usb-storage (SCSI) disk spins itself down and we can't handle that. Should we be disabling auto-spindown when we connect the device, or be able to handle this by sending the start command when needed? There's more to this. My Sandisk Cruzer Micro 1GB USB sticks suffer from this regression. I seriously doubt that. Are you claiming that your USB stick spins itself down during a suspend? And then requires to be spun back up before it will resume proper operation? No, the machine is not being suspended at all. What happens is, there's a nice little LED on the Cruzer stick, that is "lit" when the stick itself is not in a "power suspend" state (or whatever you USB folks call it). On 2.6.22, that little LED stays "on" normally, and flickers off/on when data is being transfered. The new "USB autosuspend" logic in 2.6.23 now causes that little LED to turn off after a few seconds of inactivity. Once that happens, the USB stick is not accessible until after a longish timeout (~30s I think), followed by a USB reset. Then it is usable again until the next inactivity timeout and autosuspend (a few seconds). Etc. - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Problems with USB disk [solved]
Chuck Ebbert wrote: On 08/13/2007 10:50 AM, Niels wrote: On Sunday 12 August 2007 11:54, Niels wrote: On Friday 10 August 2007 14:43, Niels wrote: On Wednesday 08 August 2007 12:57, Ismail Dönmez wrote: On Wednesday 08 August 2007 13:48:29 you wrote: On Tuesday 07 August 2007 23:18, Greg KH wrote: On Tue, Aug 07, 2007 at 10:26:15PM +0200, Niels wrote: Hi, I'm having problems with a new 500 GB USB disk. It works, but sometimes I get these in dmesg: usb 1-3: reset high speed USB device using ehci_hcd and address 2 usb 5-1: USB disconnect, address 2 drivers/usb/class/usblp.c: usblp0: removed sd 0:0:0:0: Device not ready: <6>: Sense Key : 0x2 [current] : ASC=0x4 ASCQ=0x2 end_request: I/O error, dev sda, sector 254148215 sd 0:0:0:0: Device not ready: <6>: Sense Key : 0x2 [current] : ASC=0x4 ASCQ=0x2 end_request: I/O error, dev sda, sector 252434023 EXT3-fs error (device sda1): ext3_find_entry: reading directory #15761836 offset 0 There's also a printer connected. This is on a pci/usb2 card. When the above happens, I get I/O errors. When I mount the drive next, there are errors and often missing files. Quite annoying! Kernel is 2.6.21 What's going on? You have a low voltage issue, or a bad cable. The device is electronically disconnecting itself. Try using a externally-powered hub, or a new cable. I am seeing a similar problem with 2.6.22 and 2.6.23-* kernels with my 60G iPod Video, works fine with 2.6.18 kernel though. So far I'm seeing this: - On 2.6.21 I mount the drive. After a while it spins down, and when I then unmount it, an error pops up in dmesg. - On 2.6.18 I can't provoke the same error. The drive doesn't appear to spin down. I don't know if the data corruption from 2.6.21 occurs with regular use. There are a number of other factor I need to eliminate on my system, but that's it so far. CONFIG_USB_SUSPEND is not set on either kernel. OK, on a vanilla 2.6.18.8 I also have this problem, with both the pci/usb2 card, and the usb1 on the board. I listen to music from the drive, and after some time (10-20 minutes or so), it freaks out: = sd 1:0:0:0: Device not ready: <6>: Current: sense key=0x2 ASC=0x4 ASCQ=0x2 end_request: I/O error, dev sda, sector 126693711 sd 1:0:0:0: Device not ready: <6>: Current: sense key=0x2 ASC=0x4 ASCQ=0x2 end_request: I/O error, dev sda, sector 126693711 sd 1:0:0:0: Device not ready: <6>: Current: sense key=0x2 ASC=0x4 ASCQ=0x2 end_request: I/O error, dev sda, sector 126693711 = Using a new PSU and a powered hub made no difference. But I found a solution here: http://alienghic.livejournal.com/382903.html Basically, the problem is, as suspected, that the drive spins down / goes to suspend. This can be disabled with "sdparm --clear STANDBY -6 /dev/sda". It seems to me to be an error that the kernel reports this as something like a hardware failure. Or at least very misleading. Oh, nice. The usb-storage (SCSI) disk spins itself down and we can't handle that. Should we be disabling auto-spindown when we connect the device, or be able to handle this by sending the start command when needed? There's more to this. My Sandisk Cruzer Micro 1GB USB sticks suffer from this regression. Plug one in, it works for about 5 seconds, then the light goes off (bad). Next access requires a 30s timeout + reset. Etc.. This is with 2.6.23-rc6. Works without any problems in 2.6.22. REGRESSION. -ml - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: HDIO_DRIVE_CMD and hdparm
Geert Uytterhoeven wrote: Hi, `hdparm -t' uses HDIO_DRIVE_CMD(null) to flush the disk's buffer. More correctly, that command is supposed to act like an I/O queue "barrier" operation, not returning from the syscall until everything queued in front of it has been issued/completed. I believe that only the original IDE driver actually implements it, though. And hdparm-7.4 (not released yet) will no longer complain about ENOTTY. Note that current versions of hdparm use SG_IO/ATA_16 (SAT) for nearly everything now, only falling back to the older ioctl's for drivers which reject the SAT attempt. I'd love to find a USB drive enclosure that supports SAT. Anyone know of one? And does the USB storage layer actually pass the ATA_16 packets to the device? Cheers - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: HDIO_DRIVE_CMD and hdparm
Alan Cox wrote: - SCSI doesn't handle HDIO_DRIVE_CMD(null), and returns EINVAL => fine for hdparm - If a block device doesn't support the ioctl, blkdev_driver_ioctl() returns ENOTTY => hdparm error message Those are both correct -ENOTTY I don't know this ioctl -EINVAL I know this ioctl, usage wrong Exactly. I'll have hdparm-7.4 not complain on ENOTTY as well. -ml - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: libata /dev/scd0 problem: mount after burn fails without eject
Forwarding to linux-scsi and linux-ide mailing lists. Frank van Maarseveen wrote: Tested on 2.6.20.6 and 2.6.21.1 I decided to swich from the old IDE drivers to libata and now there seems to be a little but annoying problem: cannot mount an ISO image after burning it. May 1 14:32:55 kernel: attempt to access beyond end of device May 1 14:32:55 kernel: sr0: rw=0, want=68, limit=4 May 1 14:32:55 kernel: isofs_fill_super: bread failed, dev=sr0, iso_blknum=16, block=16 an "eject" command seems to fix the state of the PATA DVD writer or driver. The problem occurs for burning a CD and for DVD too with identical error messages. relevant kernel boot messages: | ata1: PATA max UDMA/100 cmd 0x000101f0 ctl 0x000103f6 bmdma 0x0001ffa0 irq 14 | ata2: PATA max UDMA/100 cmd 0x00010170 ctl 0x00010376 bmdma 0x0001ffa8 irq 15 | scsi0 : ata_piix | ata1.00: ATA-7: ST3120814A, 3.AAJ, max UDMA/100 | ata1.00: 234441648 sectors, multi 8: LBA48 | ata1.00: configured for UDMA/100 | scsi1 : ata_piix | ata2.00: ATAPI, max UDMA/33 | ata2.01: ATAPI, max UDMA/33 | ata2.00: configured for UDMA/33 | ata2.01: configured for UDMA/33 | scsi 0:0:0:0: Direct-Access ATA ST3120814A 3.AA PQ: 0 ANSI: 5 | SCSI device sda: 234441648 512-byte hdwr sectors (120034 MB) | sda: Write Protect is off | SCSI device sda: write cache: enabled, read cache: enabled, doesn't support DPO or FUA | SCSI device sda: 234441648 512-byte hdwr sectors (120034 MB) | sda: Write Protect is off | SCSI device sda: write cache: enabled, read cache: enabled, doesn't support DPO or FUA | sda: sda1 sda2 sda4 | sd 0:0:0:0: Attached scsi disk sda | sd 0:0:0:0: Attached scsi generic sg0 type 0 | scsi 1:0:0:0: CD-ROMHP DVD Writer 640c CS30 PQ: 0 ANSI: 5 | sr0: scsi3-mmc drive: 40x/40x writer cd/rw xa/form2 cdda tray | Uniform CD-ROM driver Revision: 3.20 | sr 1:0:0:0: Attached scsi generic sg1 type 5 | scsi 1:0:1:0: CD-ROMSAMSUNG CD-ROM SC-148C B105 PQ: 0 ANSI: 5 | sr1: scsi3-mmc drive: 1x/48x cd/rw xa/form2 cdda tray | sr 1:0:1:0: Attached scsi generic sg2 type 5 stripped config (well, as far of I'm sure it shouldn't matter): | CONFIG_PM=y | CONFIG_PM_LEGACY=y | | CONFIG_ACPI=y | CONFIG_ACPI_PROCFS=y | CONFIG_ACPI_FAN=y | CONFIG_ACPI_PROCESSOR=y | CONFIG_ACPI_THERMAL=y | CONFIG_ACPI_BLACKLIST_YEAR=0 | CONFIG_ACPI_EC=y | CONFIG_ACPI_POWER=y | CONFIG_ACPI_SYSTEM=y | CONFIG_X86_PM_TIMER=y | | CONFIG_PCI=y | CONFIG_PCI_GOANY=y | CONFIG_PCI_BIOS=y | CONFIG_PCI_DIRECT=y | CONFIG_PCI_MMCONFIG=y | CONFIG_PCIEPORTBUS=y | CONFIG_PCIEAER=y | CONFIG_HT_IRQ=y | CONFIG_ISA_DMA_API=y | | CONFIG_PNP=y | | CONFIG_PNPACPI=y | | CONFIG_CDROM_PKTCDVD=y | CONFIG_CDROM_PKTCDVD_BUFFERS=8 | | CONFIG_IDE=y | | CONFIG_RAID_ATTRS=y | CONFIG_SCSI=y | CONFIG_SCSI_PROC_FS=y | | CONFIG_BLK_DEV_SD=y | CONFIG_BLK_DEV_SR=y | CONFIG_CHR_DEV_SG=y | | CONFIG_SCSI_MULTI_LUN=y | CONFIG_SCSI_CONSTANTS=y | CONFIG_SCSI_LOGGING=y | | CONFIG_SCSI_SPI_ATTRS=y | | CONFIG_ATA=y | CONFIG_SATA_AHCI=y | CONFIG_ATA_PIIX=y | CONFIG_SATA_INTEL_COMBINED=y | CONFIG_SATA_ACPI=y | CONFIG_PATA_SERVERWORKS=y lspci: 00:00.0 Host bridge: Intel Corporation 82845G/GL[Brookdale-G]/GE/PE DRAM Controller/Host-Hub Interface (rev 01) 00:01.0 PCI bridge: Intel Corporation 82845G/GL[Brookdale-G]/GE/PE Host-to-AGP Bridge (rev 01) 00:1d.0 USB Controller: Intel Corporation 82801DB/DBL/DBM (ICH4/ICH4-L/ICH4-M) USB UHCI Controller #1 (rev 01) 00:1d.1 USB Controller: Intel Corporation 82801DB/DBL/DBM (ICH4/ICH4-L/ICH4-M) USB UHCI Controller #2 (rev 01) 00:1d.2 USB Controller: Intel Corporation 82801DB/DBL/DBM (ICH4/ICH4-L/ICH4-M) USB UHCI Controller #3 (rev 01) 00:1d.7 USB Controller: Intel Corporation 82801DB/DBM (ICH4/ICH4-M) USB2 EHCI Controller (rev 01) 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 81) 00:1f.0 ISA bridge: Intel Corporation 82801DB/DBL (ICH4/ICH4-L) LPC Interface Bridge (rev 01) 00:1f.1 IDE interface: Intel Corporation 82801DB (ICH4) IDE Controller (rev 01) 00:1f.3 SMBus: Intel Corporation 82801DB/DBL/DBM (ICH4/ICH4-L/ICH4-M) SMBus Controller (rev 01) 00:1f.5 Multimedia audio controller: Intel Corporation 82801DB/DBL/DBM (ICH4/ICH4-L/ICH4-M) AC'97 Audio Controller (rev 01) 01:00.0 VGA compatible controller: nVidia Corporation NV18 [GeForce4 MX 440 AGP 8x] (rev a2) 02:0c.0 Ethernet controller: Intel Corporation 82541PI Gigabit Ethernet Controller (rev 05) lspci -v -v -v for IDE 00:1f.1 IDE interface: Intel Corporation 82801DB (ICH4) IDE Controller (rev 01) (prog-if 8a [Master SecP PriP]) Subsystem: Dell Unknown device 0142 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- SERR- - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] bidi support: bidirectional request
Jens Axboe wrote: So basically just add a struct request pointer, so you can do rq = rq->next_rq or something for the next data phase. I bet this would be a LOT less invasive as well, and we can get by with a few helpers to support it. Hey, I want a way to issue those (linked requests) from userspace (SG_IO), too. Specifically for use with the new SMART Command Transport (SCT) feature set on modern SATA drives. As well as for a disk recovery utility I'm working on. Sounds generally useful, that. - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR
Matt Mackall wrote: .. Also worth considering is that spending minutes trying to reread damaged sectors is likely to accelerate your death spiral. More data may be recoverable if you give up quickly in a first pass, then go back and manually retry damaged bits with smaller I/Os. All good input. But what was being debated here is not so much the retrying of known-bad sectors, but rather what to do about the kiBs or MiBs of sectors remaining in a merged request after hitting a single bad sector mid-way. Currently, SCSI just abandons the entire remaining workload. Cheers - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR
Alan wrote: If this is the right strategy for disk recovery for a given type of device then this ought to be an automatic strategy. Most end users will not have the knowledge to frob about in sysfs, and if the bad sector hits at the wrong moment a sensible automatic recovery strategy is going to do the right thing faster than human intervention, which may be some hours away. I think something we seem to be discussing here are the opposite aims of enterprise RAID (traditional SCSI market) versus desktop filesystems (now the number one user of Linux SCSI, courtesy of libata). With RAID, it's safe to suggest that a very fast, bounded exit from EH is almost always what the customer / end-user wants, because the higher level RAID management can then deal with the failed drive offline. But for a desktop filesystem, failing out quickly and bulk-failing megabytes around a couple of bad sectors is not good -- no redundancy, and this will lead to unneeded data loss. It's beginning to look like this needs to be run-time tuneable, per block minor device (per-partition), through sysfs at a minimum. The RAID tools could then automatically choose settings to bias more towards an "instant exit" when errors are found, whereas a non-RAID desktop could select a more reliable recovery strategy. Right now, with Jame's patch (earlier in this thread), the whole scheme is heavily weighted towards the RAID "instant exit" strategy, which is making me quite nervous about the data on my laptop. Cheers - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR
James Bottomley wrote: On Thu, 2007-02-01 at 15:02 -0500, Mark Lord wrote: .. One thing that could be even better than the patch below, would be to have it perhaps skip the entire bio that includes the failed sector, rather than only the bad sector itself. Er ... define "skip over the bio". A bio is simply a block representation for a bunch of sg elements coming in to the elevator. Exactly. Or rather, a block of sg_elements from a single point of request, is it not? Mostly what we see in SCSI is a single bio per request, so skipping the bio is really the current behaviour (to fail the rest of the request). Very good. That's what it's supposed to do. But if each request contained only a single bio, then all of Jens' work on IO scheduling would be for nothing, n'est-ce pas? In the case where a request consists of multiple bio's which have been merged under a single request struct, we really should give at least one attempt to each bio. This way, in most cases, only the process that requested the failed sector(s) will see an error, not the innocent victims that happened to get merged onto the end. Which could be very critical stuff (or not -- it could be quite random). So the time factor works out to one disk I/O timeout per failed bio. That's what would have happened with the NOP scheduler anyway. On the sytems I'm working with, I don't see huge numbers of bad sectors. What they tend to show is just one or two bad sectors, widely scattered. So: I think doing that might address most concerns expressed here. Have you got an alternate suggestion, James? Cheers - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR
James Bottomley wrote: On Tue, 2007-01-30 at 19:47 -0500, Mark Lord wrote: Kernels since about 2.6.16 or so have been broken in this regard. They "complete" the good sectors before the error, and then fail the entire remaining portions of the request. What was the commit that introduced the change? ... I have a vague memory of it being deliberate. I believe you made the first change in response to my prodding at the time, when libata was not returning valid sense data (no LBA) for media errors. The SCSI EH handling of that was rather poor at the time, and so having it not retry the remaining sectors was actually a very good fix at the time. But now, libata *does* return valid sense data for LBA/DMA drives, and the workaround from circa 2.6.16 is no longer the best we can do. Now that we know which sector failed, we ought to be able to skip over it, and continue with the rest of the merged request. One thing that could be even better than the patch below, would be to have it perhaps skip the entire bio that includes the failed sector, rather than only the bad sector itself. I think doing that might address most concerns expressed here. Have you got an alternate suggestion, James? .. Signed-off-by: Mark Lord <[EMAIL PROTECTED]> --- diff -u --recursive --new-file --exclude-from=linux_17//Documentation/dontdiff old/drivers/scsi/scsi_lib.c linux/drivers/scsi/scsi_lib.c --- old/drivers/scsi/scsi_lib.c 2007-01-30 13:58:05.0 -0500 +++ linux/drivers/scsi/scsi_lib.c 2007-01-30 18:30:01.0 -0500 @@ -865,6 +865,12 @@ */ if (sense_valid && !sense_deferred) { switch (sshdr.sense_key) { + case MEDIUM_ERROR: + // Bad sector. Fail it, and then continue the rest of the request: + if (scsi_end_request(cmd, 0, cmd->device->sector_size, 1) == NULL) { The sense key may have come with additional information I think we want to parse that (if it exists) rather than just blindly failing the first sector of the request. + cmd->retries = 0;// go around again.. + return; + } This would drop through to the UNIT_ATTENTION case if scsi_end_request() fails ... I don't think that's correct. case UNIT_ATTENTION: if (cmd->device->removable) { /* Detected disc change. Set a bit - To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] RESEND: SCSI, libata: add support for ATA_16 commands to libata ATAPI devices
Tejun Heo wrote: Anyways, this has never been guaranteed because the limit is host wide. But until very very recently, "host wide" meant just a single device for libata. I was just assuming we did all of the fiddling to ensure a minimal value there for some real reason. But, yes, now we have PATA (2 drives per host), and PMP (many more drives per host), so just maxing out the limit seems sensible. So, I'm for setting it to 16. Jeff, what do you think? This patch sets libata to always accept CDB lengths up to 16 bytes. With this change, ATA_16 passthrough commands can now be passed into libata for ATAPI devices. There is a separate related patch already pending to actually implement ATA_16 for ATAPI inside libata. Signed-off-by: Mark Lord <[EMAIL PROTECTED]> --- --- old/drivers/ata/libata-core.c 2007-01-31 19:55:53.0 -0500 +++ linux/drivers/ata/libata-core.c 2007-01-31 19:57:15.0 -0500 @@ -1577,20 +1577,6 @@ snprintf(desc, desc_sz, "NCQ (depth %d/%d)", hdepth, ddepth); } -static void ata_set_port_max_cmd_len(struct ata_port *ap) -{ - int i; - - if (ap->scsi_host) { - unsigned int len = 0; - - for (i = 0; i < ATA_MAX_DEVICES; i++) - len = max(len, ap->device[i].cdb_len); - - ap->scsi_host->max_cmd_len = len; - } -} - /** * ata_dev_configure - Configure the specified ATA/ATAPI device * @dev: Target device to configure @@ -1771,8 +1757,6 @@ } } - ata_set_port_max_cmd_len(ap); - /* limit bridge transfers to udma5, 200 sectors */ if (ata_dev_knobble(dev)) { if (ata_msg_drv(ap) && print_info) @@ -5689,7 +5673,7 @@ shost->max_id = 16; shost->max_lun = 1; shost->max_channel = 1; - shost->max_cmd_len = 12; + shost->max_cmd_len = 16; } /** - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] RESEND: SCSI, libata: add support for ATA_16 commands to libata ATAPI devices
James Bottomley wrote: .. In SCSI, we're very careful only to use large commands where necessary, so even for a >2TB array, you only see 16 byte commands for sectors beyond 2TB. This essentially means that the capacity (and we do a careful READ_CAPACITY(10) and only move on to READ_CAPACITY(16) if we're told to) limits the command size. Very cool, James. - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] RESEND: SCSI, libata: add support for ATA_16 commands to libata ATAPI devices
Tejun Heo wrote: Mark Lord wrote: .. So, to achieve ATA passthru capability for libata ATAPI, we have to instead use the ATA_16 opcode: a 16-byte command. SCSI normally disallows issuing 16-byte commands to 12-byte devices, so special support has to be added for this. I might have missed the discussion but can't we just set host->max_cmd_len to 16 unconditionally? Sure thing, if you and Jeff are happy with that, then lets do it. I just kind of assumed that the complexity in ata_set_port_max_cmd_len() was there for some kind of reason. For example, I think all existing ATAPI drives only speak 12-byte packet protocols, and so if we tell SCSI we're good for 16-byte, then won't the SCSI layer suddenly start sending us READ_16 and the like? Cheers - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR
James Bottomley wrote: On Wed, 2007-01-31 at 12:57 -0500, Mark Lord wrote: Alan wrote: When libata reports a MEDIUM_ERROR to us, we *know* it's non-recoverable, as the drive itself has already done internal retries (libata uses the "with retry" ATA opcodes for this). This depends on the firmware. Some of the "raid firmware" drives don't appear to do retries in firmware. One way to tell if this is true, is simply to time how long the failed operation takes. If the drive truly does not do retries, then the media error should be reported more or less instantly (assuming drive was already spun up). Well, the simpler way (and one we have a hope of implementing) is to examine the ASC/ASCQ codes to see if the error is genuinely unretryable. My suggestion above was not for a kernel fix, but rather just as a way of determining if drives which claim "no retries" actually do them or not. :) I seem to have dropped the ball on this one in that the scsi_error.c pieces of this patch http://marc.theaimsgroup.com/?l=linux-scsi&m=116485834119885 I thought I'd applied. Apparently I didn't, so I'll go back and put them in. Good. That would be a useful supplement to the patch I posted here. Cheers - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR
Alan wrote: When libata reports a MEDIUM_ERROR to us, we *know* it's non-recoverable, as the drive itself has already done internal retries (libata uses the "with retry" ATA opcodes for this). This depends on the firmware. Some of the "raid firmware" drives don't appear to do retries in firmware. One way to tell if this is true, is simply to time how long the failed operation takes. If the drive truly does not do retries, then the media error should be reported more or less instantly (assuming drive was already spun up). If the failure takes more than a few hundred milliseconds to be reported, or in this case 4-7 seconds typically, then we know the drive was doing retries before it reported back. I haven't seen any drive fail instantly yet. Can anyone with those newfangled "RAID edition" drives try it and report back? Oh.. you'll need a way to create a bad sector. I've got patches and a command-line utility for the job. If your drive supports "WRITE UNCORRECTABLE" ("hdparm -I", w/latest hdparm), then the patches aren't needed. But meanwhile, we still have the original issue too, where a single stray bad sector can blow a system out of the water, because the mid-layer currently aborts everything after it from a large merged request. Thus the original patch from this thread. :) Agreed - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR
Douglas Gilbert wrote: Ric, Both ATA (ATA8-ACS) and SCSI (SBC-3) have recently added command support to flag a block as "uncorrectable". There is no need to send bad "long" data to it and suppress the disk's automatic re-allocation logic. That'll be useful in a couple of years, once drives that have it become more common. For now, though, we're hacking current drives using READ/WRITE LONG commands, with a corresponding patch to libata to allow for the longer "sector size" involved. Having real bad sectors, exactly where we want them on the media, sure does make testing / fixing the EH mechanisms a lot more feasible. Cheers - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR
Mark Lord wrote: James Bottomley wrote: For the MD case, this is what REQ_FAILFAST is for. I cannot find where SCSI honours that flag. James? Scratch that thought.. SCSI honours it in scsi_end_request(). But I'm not certain that the block layer handles it correctly, at least not in the 2.6.16/2.6.18 kernel that I'm working with today. Cheers - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR
James Bottomley wrote: For the MD case, this is what REQ_FAILFAST is for. I cannot find where SCSI honours that flag. James? And for that matter, even when I patch SCSI so that it *does* honour it, I don't actually see the flag making it into the SCSI layer from above. And I don't see where/how the block layer takes care when considering merge FAILFAST/READA requests with non FAILFAST/READA requests. To me, it looks perfectly happy to add non-FAILFAST/READA bios to a FAILFAST request, risking data loss if a lower-layer decides to honour the FAILFAST/READA flags. So it's a pretty Good Thing(tm) that SCSI doesn't currently honour it. ;) - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR
Ric Wheeler wrote: Mark Lord wrote: Eric D. Mudama wrote: Actually, it's possibly worse, since each failure in libata will generate 3-4 retries. (note: libata does *not* generate retries for medium errors; the looping is driven by the SCSI mid-layer code). It really beats the alternative of a forced reboot due to, say, superblock I/O failing because it happened to get merged with an unrelated I/O which then failed.. Etc.. Definitely an improvement. The number of retries is an entirely separate issue. If we really care about it, then we should fix SD_MAX_RETRIES. The current value of 5 is *way* too high. It should be zero or one. .. I think that drives retry enough, we should leave retry at zero for normal (non-removable) drives. Should this be a policy we can set like we do with NCQ queue depth via /sys ? Or perhaps we could have the mid-layer always "early-exit" without retries for "MEDIUM_ERROR", and still do retries for the rest. When libata reports a MEDIUM_ERROR to us, we *know* it's non-recoverable, as the drive itself has already done internal retries (libata uses the "with retry" ATA opcodes for this). But meanwhile, we still have the original issue too, where a single stray bad sector can blow a system out of the water, because the mid-layer currently aborts everything after it from a large merged request. Thus the original patch from this thread. :) Cheers - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR
James Bottomley wrote: First off, please send SCSI patches to the SCSI list: Fixed already, thanks! This patch fixes the behaviour to be similar to what we had originally. When a bad sector is encounted, SCSI will now work around it again, failing *only* the bad sector itself. Erm, but the corollary is that if we get a large read failure because of a bad track, you're going to try and chunk up it a sector at a time That's better than the huge data-loss scenario that we currently have for single-sector errors. MUCH better. forcing an individual error for each sector is going to annoy some people ... particularly removable medium ones which return this error if the medium isn't present ... Are you sure this is really what we want to do? No, for removed-medium everything just fails right away. This patch is *only* for media errors, not any other failures. Cheers - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR
Eric D. Mudama wrote: Actually, it's possibly worse, since each failure in libata will generate 3-4 retries. With existing ATA error recovery in the drives, that's about 3 seconds per retry on average, or 12 seconds per failure. Multiply that by the number of blocks past the error to complete the request.. It really beats the alternative of a forced reboot due to, say, superblock I/O failing because it happened to get merged with an unrelated I/O which then failed.. Etc.. Definitely an improvement. The number of retries is an entirely separate issue. If we really care about it, then we should fix SD_MAX_RETRIES. The current value of 5 is *way* too high. It should be zero or one. Cheers - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] RESEND scsi_lib.c: continue after MEDIUM_ERROR
Fixed for 80-columns, and copying linux-scsi this time. In ancient kernels, the SCSI disk code used to continue after encountering a MEDIUM_ERROR. It would "complete" the good sectors before the error, fail the bad sector/block, and then continue with the rest of the request. Kernels since about 2.6.16 or so have been broken in this regard. They "complete" the good sectors before the error, and then fail the entire remaining portions of the request. This is very risky behaviour, as a request is often a merge of several bios, and just because one application hits a bad sector is no reason to pretend that (for example) an adjacent directly lookup also failed. This patch fixes the behaviour to be similar to what we had originally. When a bad sector is encounted, SCSI will now work around it again, failing *only* the bad sector itself. Signed-off-by: Mark Lord <[EMAIL PROTECTED]> --- --- old/drivers/scsi/scsi_lib.c 2007-01-30 20:06:15.0 -0500 +++ linux/drivers/scsi/scsi_lib.c 2007-01-30 20:06:59.0 -0500 @@ -865,6 +865,13 @@ */ if (sense_valid && !sense_deferred) { switch (sshdr.sense_key) { + case MEDIUM_ERROR: + /* Bad sector. Fail it, and continue on with the rest */ + if (scsi_end_request(cmd, 0, + cmd->device->sector_size, 1) == NULL) { + cmd->retries = 0; /* go around again.. */ + return; + } case UNIT_ATTENTION: if (cmd->device->removable) { /* Detected disc change. Set a bit - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] SCSI, libata: add support for ATA_16 commands to libata ATAPI devices
Jens Axboe wrote: On Wed, Jan 03 2007, James Bottomley wrote: Er, well, as you know, I've never been a fan of this static list. I thought Jens was going to put us all out of our misery by making the list settable per device by root and thus shovel the problem off onto the distros? The code is there, just haven't had the guts to merge it yet. http://git.kernel.dk/?p=linux-2.6-block.git;a=commitdiff;h=bf5f922d167a5c5cf57132bbcaa1e0ddfd5c45f7 That's nice, but doesn't help with the case of trying to do ATA passthru to ATAPI devices, the subject of the original patch here. Cheers! - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] SCSI, libata: add support for ATA_16 commands to libata ATAPI devices
James Bottomley wrote: As I read the code Mark sent me, current libata will only accept ATA_16 for ATAPI devices (whether type 5 or not) ... pehaps this needs to be altered? It could be. Both ATA_12 and ATA_16 are just ways of getting a taskfile sent down to the LLD, and ATA_12 is merely a restricted subset of ATA_16 so it is not really necessary here. I think that given the opcode conflict with "BLANK", it may be best for the moment to just not bother with ATA_12 for libata ATAPI. We don't really have a need for ATA_12 anyway, as ATA_16 is more flexible. And any application code that uses these isn't going to want to have to vary its commands between ATA_12 and ATA_16 depending upon the target. I think they'll all just go straight for ATA_16 for uniformity and to minimize testing drudgery. hdparm will certainly use only ATA_16 once it becomes available in SCSI/libata. Cheers - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] SCSI, libata: add support for ATA_16 commands to libata ATAPI devices
James Bottomley wrote: Er I've only seen one patch ... I did ask if there was another to add the ATA_16 interpretation ... did that get lost by the SCSI reflector? Ahh.. my apologies on that, James. The two patches apply and build independently, and the second patch was only for libata, so it was sent only to the linux-ide list. There's actually a third patch as well, to fix a one-line bug in sr.c that prevents *any* SG_IO from ever working on ATAPI as well. I'll email you the full set. Cheers! - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] SCSI, libata: add support for ATA_16 commands to libata ATAPI devices
James Bottomley wrote: .. I don't think I quite understand what you're trying to do here. My understanding is that ATA_12 and ATA_16 are part of the SAT layer. i.e. they're used when we're speaking SCSI to an underlying ATA device to send taskfiles. However, ATAPI devices don't use SAT ... every SCSI command you send to an ATAPI device goes out as an ATA PACKET command without being translated. ATAPI devices also implement quite a few ATA (non-packet) commands. This patch gives us a way to issue them in response to SG_IO ATA passthru and the existing (non working) libata HDIO_DRIVE_CMD and HDIO_DRIVE_TASK ioctl calls. Without this patch, SCSI blocks the ATA_16 passthru attempts, because it thinks the CDB is too large for the 12-byte packet protocol that ATAPI devices use. This limit (12 bytes) is totally non-applicable to ATA protocol commands. Is there a missing piece to this patch, where you scan the incoming commands to ATAPI devices and actually do translation for ATA_16? Look at the existing libata-core.c ioctl's for HDIO_DRIVE_CMD and HDIO_DRIVE_TASK, used by hdparm and others. Cheers - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] SCSI, libata: add support for ATA_16 commands to libata ATAPI devices
In an ideal world, we would use the existing ATA_12 opcode to issue 12-byte ATA passthrough commands for libata ATAPI drives from userspace. But ATA_12 happens to have the same SCSI opcode value as the older CD/RW "BLANK" command, widely used by cdrecord and friends. So, to achieve ATA passthru capability for libata ATAPI, we have to instead use the ATA_16 opcode: a 16-byte command. SCSI normally disallows issuing 16-byte commands to 12-byte devices, so special support has to be added for this. Introduce an "allow_ata_16" boolean to the scsi_host struct. This provides a means for libata to signal that 16-byte ATA_16 commands should be permitted even for 12-byte ATAPI devices. Signed-off-by: Mark Lord <[EMAIL PROTECTED]> --- old/include/scsi/scsi_host.h2007-01-02 19:06:45.0 -0500 +++ new/include/scsi/scsi_host.h2007-01-02 19:09:06.0 -0500 @@ -608,6 +608,13 @@ unsigned async_scan:1; /* +* True for libata, to allow issuing ATA_16 16-byte CDBs +* to (otherwise) 12-byte ATAPI drives. ATAPI cannot use +* the ATA_12 opcode, because it means "BLANK" to CDRW drives. +*/ + unsigned allow_ata_16:1; + + /* * Optional work queue to be utilized by the transport */ char work_q_name[KOBJ_NAME_LEN]; --- old/drivers/scsi/scsi.c 2007-01-02 19:06:45.0 -0500 +++ new/drivers/scsi/scsi.c 2007-01-02 19:09:06.0 -0500 @@ -567,12 +567,17 @@ * length exceeds what the host adapter can handle. */ if (CDB_SIZE(cmd) > cmd->device->host->max_cmd_len) { - SCSI_LOG_MLQUEUE(3, + /* permit ATA_16 passthru to 12-byte ATAPI devices */ + if (cmd->cmnd[0] != ATA_16 +|| CDB_SIZE(cmd) != 16 +|| !cmd->device->host->allow_ata_16) { + SCSI_LOG_MLQUEUE(3, printk("queuecommand : command too long.\n")); - cmd->result = (DID_ABORT << 16); + cmd->result = (DID_ABORT << 16); - scsi_done(cmd); - goto out; + scsi_done(cmd); + goto out; + } } spin_lock_irqsave(host->host_lock, flags); --- old/drivers/ata/libata-core.c 2007-01-02 19:06:44.0 -0500 +++ new/drivers/ata/libata-core.c 2007-01-02 19:09:06.0 -0500 @@ -5686,6 +5686,7 @@ shost->max_lun = 1; shost->max_channel = 1; shost->max_cmd_len = 12; + shost->allow_ata_16 = 1; } /** - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2.6.20-rc3] fix broken retval test in sr_block_ioctl
With recent kernels (and possibly much older ones, too), ioctl() calls for ATAPI devices never make it to the driver layers. The reason for this is a borked return code test in drivers/scsi/sr.c. This patch fixes it. Signed-off-by: Mark Lord <[EMAIL PROTECTED]> --- --- old/drivers/scsi/sr.c 2006-11-29 16:57:37.0 -0500 +++ linux/drivers/scsi/sr.c 2007-01-02 16:40:33.0 -0500 @@ -468,7 +468,7 @@ } ret = cdrom_ioctl(file, &cd->cdi, inode, cmd, arg); - if (ret != ENOSYS) + if (ret != -ENOSYS) return ret; /* - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2.6.20-rc3] fix broken retval test in sr_block_ioctl
Jeff Garzik wrote: Mark Lord wrote: ... ret = cdrom_ioctl(file, &cd->cdi, inode, cmd, arg); -if (ret != ENOSYS) +if (ret != -ENOSYS) I think Tejun posted the same patch earlier today. Ahh.. missed it -- could have saved me an hour or two of debugging! Cheers! - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: libata: clustering on or off?
Clustering has always been a (small) win for the ATA/SATA hardware I have worked on. Many controllers read the scatter-gather list "on demand", so if clustering is not used, there are a lot of times where the controller must stop streaming data, and fetch the next s/g entry, then resume streaming data. Clustering reduces the number of such events, giving better bus utilization and (slightly) faster transfers. The more modern hardware I'm working with now will generally read the s/g table in larger blocks, so that it only rarely needs to interrupt data transfers to fetch s/g info. Clustering is not as big a win with this, but still reduces total bus cycles required. Cheers - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2.6.11-rc4+] sata_qstor: new basic driver for Pacific Digital QStor
This is a new basic libata SATA driver for the Pacific Digital QStor SATA/RAID card. It features ordinary per-drive SATA with single-request DMA for R/W commands, and PIO for everything else. It currently does not implement any of the hardware RAID support, nor hot-plug, nor the tagged/native command-queuing features. On the other hand, it is small and simple as a result. Patch is against 2.6.11-rc4 + minor libata updates from Jeff. Signed-off-by: Mark Lord <[EMAIL PROTECTED]> diff -uprN linux-2.6.11-rc4jeff/drivers/scsi/Kconfig linux/drivers/scsi/Kconfig --- linux-2.6.11-rc4jeff/drivers/scsi/Kconfig 2005-02-14 10:27:44.0 -0500 +++ linux/drivers/scsi/Kconfig 2005-02-16 20:29:23.0 -0500 @@ -457,6 +457,14 @@ config SCSI_SATA_PROMISE If unsure, say N. +config SCSI_SATA_QSTOR + tristate "Pacific Digital SATA QStor support" + depends on SCSI_SATA && PCI + help + This option enables support for Pacific Digital Serial ATA QStor. + + If unsure, say N. + config SCSI_SATA_SX4 tristate "Promise SATA SX4 support" depends on SCSI_SATA && PCI && EXPERIMENTAL diff -uprN linux-2.6.11-rc4jeff/drivers/scsi/Makefile linux/drivers/scsi/Makefile --- linux-2.6.11-rc4jeff/drivers/scsi/Makefile 2005-02-14 10:27:44.0 -0500 +++ linux/drivers/scsi/Makefile 2005-02-16 20:29:23.0 -0500 @@ -125,6 +125,7 @@ obj-$(CONFIG_SCSI_SATA_AHCI) += libata.o obj-$(CONFIG_SCSI_SATA_SVW) += libata.o sata_svw.o obj-$(CONFIG_SCSI_ATA_PIIX) += libata.o ata_piix.o obj-$(CONFIG_SCSI_SATA_PROMISE) += libata.o sata_promise.o +obj-$(CONFIG_SCSI_SATA_QSTOR) += libata.o sata_qstor.o obj-$(CONFIG_SCSI_SATA_SIL) += libata.o sata_sil.o obj-$(CONFIG_SCSI_SATA_VIA) += libata.o sata_via.o obj-$(CONFIG_SCSI_SATA_VITESSE) += libata.o sata_vsc.o diff -uprN linux-2.6.11-rc4jeff/drivers/scsi/sata_qstor.c linux/drivers/scsi/sata_qstor.c --- linux-2.6.11-rc4jeff/drivers/scsi/sata_qstor.c 1969-12-31 19:00:00.0 -0500 +++ linux/drivers/scsi/sata_qstor.c 2005-02-16 20:25:01.0 -0500 @@ -0,0 +1,700 @@ +/* + * sata_qstor.c - Pacific Digital Corporation QStor SATA + * + * Maintained by: Mark Lord <[EMAIL PROTECTED]> + * + * Copyright 2005 Pacific Digital Corporation. + * (OSL/GPL code release authorized by Jalil Fadavi). + * + * The contents of this file are subject to the Open + * Software License version 1.1 that can be found at + * http://www.opensource.org/licenses/osl-1.1.txt and is included herein + * by reference. + * + * Alternatively, the contents of this file may be used under the terms + * of the GNU General Public License version 2 (the "GPL") as distributed + * in the kernel source COPYING file, in which case the provisions of + * the GPL are applicable instead of the above. If you wish to allow + * the use of your version of this file only under the terms of the + * GPL and not to allow others to use your version of this file under + * the OSL, indicate your decision by deleting the provisions above and + * replace them with the notice and other provisions required by the GPL. + * If you do not delete the provisions above, a recipient may use your + * version of this file under either the OSL or the GPL. + * + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include "scsi.h" +#include +#include +#include + +#define DRV_NAME "sata_qstor" +#define DRV_VERSION "0.03" + +enum { + QS_PORTS = 4, + QS_MAX_PRD = LIBATA_MAX_PRD, + QS_CPB_ORDER = 6, + QS_CPB_BYTES = (1 << QS_CPB_ORDER), + QS_PRD_BYTES = QS_MAX_PRD * 16, + QS_PKT_BYTES = QS_CPB_BYTES + QS_PRD_BYTES, + + QS_DMA_BOUNDARY = ~0UL, + + /* global register offsets */ + QS_HCF_CNFG3 = 0x0003, /* host configuration offset */ + QS_HID_HPHY = 0x0004, /* host physical interface info */ + QS_HCT_CTRL = 0x00e4, /* global interrupt mask offset */ + QS_HST_SFF = 0x0100, /* host status fifo offset */ + QS_HVS_SERD3 = 0x0393, /* PHY enable offset */ + + /* global control bits */ + QS_HPHY_64BIT = (1 << 1), /* 64-bit bus detected */ + QS_CNFG3_GSRST = 0x01, /* global chip reset */ + QS_SERD3_PHY_ENA = 0xf0, /* PHY detection ENAble*/ + + /* per-channel register offsets */ + QS_CCF_CPBA = 0x0710, /* chan CPB base address */ + QS_CCF_CSEP = 0x0718, /* chan CPB separation factor */ + QS_CFC_HUFT = 0x0800, /* host upstream fifo threshold */ + QS_CFC_HDFT = 0x0804, /* host downstream fifo threshold */ + QS_CFC_DUFT = 0x0808, /* dev upstream fifo threshold */ + QS_CFC_DDFT = 0x080c, /* dev downstream fifo threshold */ + QS_CCT_CTR0 = 0x0900, /* chan control-0 offset */ + QS_CCT_CTR1 = 0x0901, /* chan control-1 offset */ + QS_CCT_CFF = 0x0a00, /* chan command fifo offset */ + + /* channel control bits */ + QS_CTR0_REG = (1 << 1), /* register mode (vs. pkt mode) */ + QS_CTR0_CLER = (1 << 2), /* clear channel errors */ + QS_CTR1_RDEV =