Hi Marek, > On 3/2/20 2:21 PM, Lukasz Majewski wrote: > > Hi, > > Hi, > > >> This patch aims to improve robustness of 'usb' command operation on > >> the ehci-hcd IP block as it ports to contemporary U-Boot the patch > >> described and provided in [1] (originally applicable to U-Boot > >> 2016.05). > >> > >> According to the fix author - "rayvt" (from [1]): > >> > >> TD stands for "Queue Element Transfer Descriptor (qTD)", which is a > >> micro-code instruction for the EHCI USB chip. > >> The "token" field is detailed information, control, and status of > >> the TD's data transfer. In particular, the rightmost byte is the > >> status field. 0x80 bit means the TD is active and the data > >> transfer has not yet completed. 0x08 bit means there was some sort > >> of data transfer error (XactErr). > > ^^^^^^^ - this error is NOT handled now in mainline > > u-boot. Do you have any experience with it? > > No, but then I also didn't run into those yet.
Ok. > > >> If the drive gets a transfer error, it refuses to do any other I/O > >> transfer until the error is properly cleared and reset. U-boot did > >> not do this, so every subsequent disk read would get a timeout > >> error because the drive would not send any data. The fix is to > >> properly clear and reset the USB disk when it gets a transfer > >> error. > > So basically, what this says is, "we crashed the drive, because the > drive suffered counter overflow, thus we fix it by resetting the > drive". The transfer error may be also caused by some timeout from internal controller when the USB disk is worn out. Counter overflow may not be the only reason. > No, the fix is to not crash the drive in the first place. From 2019.0{14} a lot was done to make "usb" command more robust, but still some mishaps happen. > > >> Every read operation starts at the maximum block size. When the USB > >> pendrive is not able to correctly serve this data read request, the > >> dynamic reduction of IO size is performed. Up to six tries (with > >> smaller IO block each time) are attempted. > >> > >> A related problem is that some drives are slow to come up. Linux > >> handles this by issuing a spinup command and allowing more time for > >> the drive to respond. The same idea is applied in this fix. > >> > >> On TPC70 (i.MX6Q) once per ~10 times (without this patch): > >> > >> Bus usb@2184200: USB EHCI 1.00 > >> scanning bus usb@2184200 for devices... 2 USB Device(s) found > >> scanning usb for storage devices... 1 Storage Device(s) > >> found EHCI timed out on TD - token=0x1f8c80 > > > > This is how the error gets evident. The detailed explanation is in > > link [1]. > > Is there one specific post in that forum, or do I need to read through > the whole multi-page thread ? There is a thread pointed out by the below link which explains the issue thoroughly: https://forum.doozan.com/read.php?3,35295,35295#msg-35295 (The first 2 posts from 'rayvt' explains the problem). > > I would expect that if you run -- I assume 'usb reset' (the command is > missing above) -- then the bus gets power-cycled, hence the USB device > also gets power-cycled. Unfortunately, the 'usb reset' is not fixing the issue. The controller is unresponsive and the only way for recovering is a power cycle. > > >> Performance impact: > >> > >> With U-Boot -master in mainline: > >> 16869440 bytes read in 979 ms (16.4 MiB/s) > >> > >> With this patch: > >> 16869440 bytes read in 1219 ms (13.2 MiB/s) > >> > >> Links: > >> [1] - https://forum.doozan.com/read.php?3,35295,35295#msg-35295 > >> [2] - > >> https://www.dropbox.com/s/nrkrd1no63viuu8/uboot-bodhi-2016.05-timeoutTD.patch?dl=0 > >> > >> Signed-off-by: Lukasz Majewski <lu...@denx.de> > >> [Unfortunately, the original patch [2] did not contain S-o-B from > >> the original author - "rayvt"] > >> --- > >> > >> common/usb.c | 10 +++- > >> common/usb_storage.c | 106 > >> +++++++++++++++++++++++++++++++++--- drivers/usb/host/ehci-hcd.c | > >> 55 +++++++++++++++++-- include/usb.h | 1 + > >> include/usb_defs.h | 1 + > >> 5 files changed, 160 insertions(+), 13 deletions(-) > >> > >> diff --git a/common/usb.c b/common/usb.c > >> index 349e838f1d..305482b5bb 100644 > >> --- a/common/usb.c > >> +++ b/common/usb.c > >> @@ -925,14 +925,20 @@ static int get_descriptor_len(struct > >> usb_device *dev, int len, int expect_len) __maybe_unused struct > >> usb_device_descriptor *desc; ALLOC_CACHE_ALIGN_BUFFER(unsigned > >> char, tmpbuf, USB_BUFSIZ); int err; > >> + int retry = 5; > >> > >> desc = (struct usb_device_descriptor *)tmpbuf; > >> > >> +again: > >> err = usb_get_descriptor(dev, USB_DT_DEVICE, 0, desc, > >> len); if (err < expect_len) { > >> if (err < 0) { > >> - printf("unable to get device descriptor > >> (error=%d)\n", > >> - err); > >> + printf("unable to get device descriptor > >> (error=%d) retry: %d\n", > >> + err, retry); > >> + mdelay(50); > >> + if (--retry >= 0) > >> + /* Some drives are just slow to > >> wake up. */ > >> + goto again; > > > > This I think is just to "wait" for the inserted device if it is > > rather slow (or probably its memory got degraded over time). > > > > This shouldn't bring any transfer slow down. > > Then this could be a separate patch ? Yes. I think so. > > >> return err; > >> } else { > >> printf("USB device descriptor short read > >> (expected %i, got %i)\n", diff --git a/common/usb_storage.c > >> b/common/usb_storage.c index 097b6729c1..48c8c2ae64 100644 > >> --- a/common/usb_storage.c > >> +++ b/common/usb_storage.c > >> @@ -111,6 +111,18 @@ int usb_stor_get_info(struct usb_device *dev, > >> struct us_data *us, struct blk_desc *dev_desc); > >> int usb_storage_probe(struct usb_device *dev, unsigned int ifnum, > >> struct us_data *ss); > >> + > >> +#ifdef CONFIG_USB_EHCI_HCD > >> + /* > >> + * The U-Boot EHCI driver can handle any transfer length > >> as long as > >> + * there is enough free heap space left, but the SCSI > >> READ(10) and > >> + * WRITE(10) commands are limited to 65535 blocks. > >> + */ > >> +int usb_max_xfer_blk = 4096; > > > > Setting it to magic value - 4096 may be indeed problematic. > > However, I don't know what is the 'usual' transfer size. > > 240 blocks , see 7d6fd7f0ba71cd93d94079132f958d9630f27a89 . Ok. > > >> +#else > >> +int usb_max_xfer_blk = 20; > > > > This is also a magic value - probably took from some old U-Boot > > (when default transfer was set to this value). > > > > I do guess that this snippet shall be removed from this patch set. > > Probably. Ok. > > >> +#endif > >> + > >> #if CONFIG_IS_ENABLED(BLK) > >> static unsigned long usb_stor_read(struct udevice *dev, lbaint_t > >> blknr, lbaint_t blkcnt, void *buffer); > >> @@ -729,6 +741,7 @@ static int usb_stor_BBB_transport(struct > >> scsi_cmd *srb, struct us_data *us) pipeout = > >> usb_sndbulkpipe(us->pusb_dev, us->ep_out); /* DATA phase + error > >> handling */ data_actlen = 0; > >> + mdelay(10); /* Like linux does. */ > >> /* no data, go immediately to the STATUS phase */ > >> if (srb->datalen == 0) > >> goto st; > >> @@ -740,6 +753,16 @@ static int usb_stor_BBB_transport(struct > >> scsi_cmd *srb, struct us_data *us) > >> result = usb_bulk_msg(us->pusb_dev, pipe, srb->pdata, > >> srb->datalen, &data_actlen, USB_CNTL_TIMEOUT * 5); > >> + > >> + /* special handling of XACTERR in DATA phase */ > >> + if (result < 0 && (us->pusb_dev->status & > >> USB_ST_XACTERR)) { > >> + debug("XACTERR in data phase - clr, reset, and > >> return fail.\n"); > >> + usb_stor_BBB_clear_endpt_stall(us, dir_in ? > >> + us->ep_in : > >> us->ep_out); > >> + usb_stor_BBB_reset(us); > >> + return USB_STOR_TRANSPORT_FAILED; > >> + } > >> + > > > > Here the case of XACTERR is handled. This is missing now in > > -master. > > Surely this could be added as a separate patch ? Ok. > > >> /* special handling of STALL in DATA phase */ > >> if ((result < 0) && (us->pusb_dev->status & > >> USB_ST_STALLED)) { debug("DATA:stall\n"); > >> @@ -1013,9 +1036,32 @@ static int usb_request_sense(struct scsi_cmd > >> *srb, struct us_data *ss) return 0; > >> } > >> > >> +/* > >> + * This spins up the disk and also consumes the time that the > >> + * disk takes to become active and ready to read data. > >> + * Some drives (like Western Digital) can take more than 5 > >> seconds. > >> + * The delay occurs on the 1st data read from the disk. > >> + * Extending the timeout here works better than handling the > >> timeout > >> + * as an error on a "real" read. > >> + */ > >> +static int usb_spinup(struct scsi_cmd *srb, struct us_data *ss) > >> +{ > >> + memset(&srb->cmd[0], 0, 12); > >> + srb->cmd[0] = SCSI_START_STP; > >> + srb->cmd[1] = srb->lun << 5; > >> + srb->cmd[4] = 1; /* Start spinup. */ > >> + srb->datalen = 0; > >> + srb->cmdlen = 6; > >> + ss->pusb_dev->extra_timeout = 9876; > >> + ss->transport(srb, ss); > >> + ss->pusb_dev->extra_timeout = 0; > >> + return 0; > >> +} > >> + > >> static int usb_test_unit_ready(struct scsi_cmd *srb, struct > >> us_data *ss) { > >> int retries = 10; > >> + int gave_extra_time = 0; > >> > >> do { > >> memset(&srb->cmd[0], 0, 12); > >> @@ -1038,6 +1084,17 @@ static int usb_test_unit_ready(struct > >> scsi_cmd *srb, struct us_data *ss) if ((srb->sense_buf[2] == 0x02) > >> && (srb->sense_buf[12] == 0x3a)) > >> return -1; > >> + /* > >> + * If the status is "Not Ready - becoming ready", > >> give it > >> + * more time. Linux issues a spinup command > >> (once) and gives > >> + * it 100 seconds. > >> + */ > >> + if (srb->sense_buf[2] == 0x02 && > >> + srb->sense_buf[12] == 0x04 && > >> + gave_extra_time == 0) { > >> + retries = 100; /* Allow 10 seconds. */ > >> + gave_extra_time = retries; > >> + } > > > > The above code looks like targeting the HDDs, but some degraded pen > > drives may need this time as well for proper operation. > > Hard drive spin up handling. This could be a separate patch too. > Ok. > >> mdelay(100); > >> } while (retries--); > >> > >> @@ -1166,22 +1223,57 @@ static unsigned long usb_stor_read(struct > >> blk_desc *block_dev, lbaint_t blknr, block_dev->devnum, start, > >> blks, buf_addr); > >> do { > >> - /* XXX need some comment here */ > >> + /* > >> + * Probably most errors are USB errors, not hard > >> disk error. > >> + * Many disks use a USB chip that is flaky when > >> doing large > >> + * transfers. The workaround is to dynamically > >> reduce the > >> + * transfer size and allow an additional try. > >> + * This should pick up flaky disks. Linux uses a > >> quirks table. > >> + * We'll use observation: > >> + * Give it 1 try very large, 1 try large, 2 tries > >> medium and 2 > >> + * tries small(ish). > >> + * On a solid fail (actual disk error - which > >> should be rare), > >> + * this will give us 6 tries max, and only that > >> many if the read > >> + * is quite large. > >> + * A fail on a very short read obviously doesn't > >> have a > >> + * too-large max_blks. Timeout due to spinup > >> being a case in > >> + * this point. > >> + */ > >> retry = 2; > >> srb->pdata = (unsigned char *)buf_addr; > >> - if (blks > ss->max_xfer_blk) > >> - smallblks = ss->max_xfer_blk; > >> +retry_it: > >> + if (blks > usb_max_xfer_blk) > >> + smallblks = usb_max_xfer_blk; > >> else > >> smallblks = (unsigned short) blks; > >> -retry_it: > >> - if (smallblks == ss->max_xfer_blk) > >> + > >> + if (smallblks == usb_max_xfer_blk) > >> usb_show_progress(); > >> + > >> srb->datalen = block_dev->blksz * smallblks; > >> srb->pdata = (unsigned char *)buf_addr; > >> if (usb_read_10(srb, ss, start, smallblks)) { > >> debug("Read ERROR\n"); > >> ss->flags &= ~USB_READY; > >> usb_request_sense(srb, ss); > >> + > >> + /* Dynamically reduce the I/O size. */ > >> + if (smallblks > 2047) { > >> + usb_max_xfer_blk = 2047; > >> + ++retry; > >> + } else if (smallblks > 512) { > >> + usb_max_xfer_blk = 512; > >> + ++retry; > >> + } else if (smallblks > 511) { > >> + usb_max_xfer_blk = 511; > >> + ++retry; > >> + } else if (smallblks > 63) { > >> + usb_max_xfer_blk = 63; > >> + retry += 2; > >> + } > >> + debug("step down usb_max_xfer_blk to > >> %d\n", > >> + usb_max_xfer_blk); > >> + > > > > This shall not be needed, as the ehci controller's driver will > > scatter the transfers up till 240 blocks max. Am I right? > > > > The above code was supposed to work before the 240 block limitation > > was in place? > > Correct, this code is wrong. Ok. > > [...] > > >> + qhtoken = hc32_to_cpu(qh->qh_overlay.qt_token); > >> if (!(QT_TOKEN_GET_STATUS(qhtoken) & > >> QT_TOKEN_STATUS_ACTIVE)) { debug("TOKEN=%#x\n", qhtoken); > >> + if (qhtoken & QT_TOKEN_STATUS_XACTERR) { > > ^^^^^^^^^^^^^^^^^^^^^^^^ - this flag > > before this patch was also not > > checked. > >> + if (--trynum >= 0) { > >> + /* > >> + * It is necessary to do this, > >> otherwise the > >> + * disk is clagged. > >> + */ > >> + debug("reset the TD and redo, > >> because of XACTERR\n"); > >> + qhtoken &= > >> ~QT_TOKEN_STATUS_HALTED; > >> + qhtoken |= QT_TOKEN_STATUS_ACTIVE > >> | > >> + QT_TOKEN_CERR(2); > >> + vtd->qt_token = > >> cpu_to_hc32(qhtoken); > >> + qh->qh_overlay.qt_token = > >> cpu_to_hc32(qhtoken); > >> + goto retry_xacterr; > >> + } > >> + dev->status = USB_ST_XACTERR; > >> + dev->act_len = length - > >> + QT_TOKEN_GET_TOTALBYTES(qhtoken); > >> > > > > This may solve the issue - as it resets the controller in the case > > of transfer error. > > Do you need to reset the controller ? Or is there some graceful way > out, i.e. a way to recover from the error ? The original patch was developed when the async transmission was start and stopped for each transfer (before your optimization), hence the code for disabling it. I need to check if it would be enough to just clear the QT_TOKEN_STATUS_HALTED and set QT_TOKEN_STATUS_ACTIVE again. > [...] Best regards, Lukasz Majewski -- DENX Software Engineering GmbH, Managing Director: Wolfgang Denk HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany Phone: (+49)-8142-66989-59 Fax: (+49)-8142-66989-80 Email: lu...@denx.de
pgpuf_gXf3kRD.pgp
Description: OpenPGP digital signature