Re: [PATCH] usb: ehci: Fix "EHCI timed out on TD - token=XXXX" error on ehci-hcd

Lukasz Majewski Mon, 02 Mar 2020 15:25:55 -0800

Hi Marek,

> On 3/2/20 2:21 PM, Lukasz Majewski wrote:
> > Hi,  
> 
> Hi,
> 
> >> This patch aims to improve robustness of 'usb' command operation on
> >> the ehci-hcd IP block as it ports to contemporary U-Boot the patch
> >> described and provided in [1] (originally applicable to U-Boot
> >> 2016.05).
> >>
> >> According to the fix author - "rayvt" (from [1]):
> >>
> >> TD stands for "Queue Element Transfer Descriptor (qTD)", which is a
> >> micro-code instruction for the EHCI USB chip.
> >> The "token" field is detailed information, control, and status of
> >> the TD's data transfer. In particular, the rightmost byte is the
> >> status field. 0x80 bit means the TD is active and the data
> >> transfer has not yet completed. 0x08 bit means there was some sort
> >> of data transfer error (XactErr).  
> >      ^^^^^^^ - this error is NOT handled now in mainline
> > u-boot. Do you have any experience with it?  
> 
> No, but then I also didn't run into those yet.


Ok.

> 
> >> If the drive gets a transfer error, it refuses to do any other I/O
> >> transfer until the error is properly cleared and reset. U-boot did
> >> not do this, so every subsequent disk read would get a timeout
> >> error because the drive would not send any data. The fix is to
> >> properly clear and reset the USB disk when it gets a transfer
> >> error.  
> 
> So basically, what this says is, "we crashed the drive, because the
> drive suffered counter overflow, thus we fix it by resetting the
> drive". 

The transfer error may be also caused by some timeout from internal
controller when the USB disk is worn out. Counter overflow may not be
the only reason.

> No, the fix is to not crash the drive in the first place.

From 2019.0{14} a lot was done to make "usb" command more robust, but
still some mishaps happen.

> 
> >> Every read operation starts at the maximum block size. When the USB
> >> pendrive is not able to correctly serve this data read request, the
> >> dynamic reduction of IO size is performed. Up to six tries (with
> >> smaller IO block each time) are attempted.
> >>
> >> A related problem is that some drives are slow to come up. Linux
> >> handles this by issuing a spinup command and allowing more time for
> >> the drive to respond. The same idea is applied in this fix.
> >>
> >> On TPC70 (i.MX6Q) once per ~10 times (without this patch):
> >>
> >> Bus usb@2184200: USB EHCI 1.00
> >> scanning bus usb@2184200 for devices... 2 USB Device(s) found
> >>        scanning usb for storage devices... 1 Storage Device(s)
> >> found EHCI timed out on TD - token=0x1f8c80  
> > 
> > This is how the error gets evident. The detailed explanation is in
> > link [1].  
> 
> Is there one specific post in that forum, or do I need to read through
> the whole multi-page thread ?

There is a thread pointed out by the below link which explains the
issue thoroughly:
https://forum.doozan.com/read.php?3,35295,35295#msg-35295

(The first 2 posts from 'rayvt' explains the problem).

> 
> I would expect that if you run -- I assume 'usb reset' (the command is
> missing above) -- then the bus gets power-cycled, hence the USB device
> also gets power-cycled.

Unfortunately, the 'usb reset' is not fixing the issue. The controller
is unresponsive and the only way for recovering is a power cycle.

> 
> >> Performance impact:
> >>
> >> With U-Boot -master in mainline:
> >> 16869440 bytes read in 979 ms (16.4 MiB/s)
> >>
> >> With this patch:
> >> 16869440 bytes read in 1219 ms (13.2 MiB/s)
> >>
> >> Links:
> >> [1] - https://forum.doozan.com/read.php?3,35295,35295#msg-35295
> >> [2] -
> >> https://www.dropbox.com/s/nrkrd1no63viuu8/uboot-bodhi-2016.05-timeoutTD.patch?dl=0
> >>
> >> Signed-off-by: Lukasz Majewski <lu...@denx.de>
> >> [Unfortunately, the original patch [2] did not contain S-o-B from
> >> the original author - "rayvt"]
> >> ---
> >>
> >>  common/usb.c                |  10 +++-
> >>  common/usb_storage.c        | 106
> >> +++++++++++++++++++++++++++++++++--- drivers/usb/host/ehci-hcd.c |
> >> 55 +++++++++++++++++-- include/usb.h               |   1 +
> >>  include/usb_defs.h          |   1 +
> >>  5 files changed, 160 insertions(+), 13 deletions(-)
> >>
> >> diff --git a/common/usb.c b/common/usb.c
> >> index 349e838f1d..305482b5bb 100644
> >> --- a/common/usb.c
> >> +++ b/common/usb.c
> >> @@ -925,14 +925,20 @@ static int get_descriptor_len(struct
> >> usb_device *dev, int len, int expect_len) __maybe_unused struct
> >> usb_device_descriptor *desc; ALLOC_CACHE_ALIGN_BUFFER(unsigned
> >> char, tmpbuf, USB_BUFSIZ); int err;
> >> +  int retry = 5;
> >>  
> >>    desc = (struct usb_device_descriptor *)tmpbuf;
> >>  
> >> +again:
> >>    err = usb_get_descriptor(dev, USB_DT_DEVICE, 0, desc,
> >> len); if (err < expect_len) {
> >>            if (err < 0) {
> >> -                  printf("unable to get device descriptor
> >> (error=%d)\n",
> >> -                          err);
> >> +                  printf("unable to get device descriptor
> >> (error=%d) retry: %d\n",
> >> +                         err, retry);
> >> +                  mdelay(50);
> >> +                  if (--retry >= 0)
> >> +                          /* Some drives are just slow to
> >> wake up. */
> >> +                          goto again;  
> > 
> > This I think is just to "wait" for the inserted device if it is
> > rather slow (or probably its memory got degraded over time).
> > 
> > This shouldn't bring any transfer slow down.  
> 
> Then this could be a separate patch ?

Yes. I think so.

> 
> >>                    return err;
> >>            } else {
> >>                    printf("USB device descriptor short read
> >> (expected %i, got %i)\n", diff --git a/common/usb_storage.c
> >> b/common/usb_storage.c index 097b6729c1..48c8c2ae64 100644
> >> --- a/common/usb_storage.c
> >> +++ b/common/usb_storage.c
> >> @@ -111,6 +111,18 @@ int usb_stor_get_info(struct usb_device *dev,
> >> struct us_data *us, struct blk_desc *dev_desc);
> >>  int usb_storage_probe(struct usb_device *dev, unsigned int ifnum,
> >>                  struct us_data *ss);
> >> +
> >> +#ifdef CONFIG_USB_EHCI_HCD
> >> +  /*
> >> +   * The U-Boot EHCI driver can handle any transfer length
> >> as long as
> >> +   * there is enough free heap space left, but the SCSI
> >> READ(10) and
> >> +   * WRITE(10) commands are limited to 65535 blocks.
> >> +   */
> >> +int usb_max_xfer_blk = 4096;  
> > 
> > Setting it to magic value - 4096 may be indeed problematic.
> > However, I don't know what is the 'usual' transfer size.  
> 
> 240 blocks , see 7d6fd7f0ba71cd93d94079132f958d9630f27a89 .

Ok.

> 
> >> +#else
> >> +int usb_max_xfer_blk = 20;  
> > 
> > This is also a magic value - probably took from some old U-Boot
> > (when default transfer was set to this value).
> > 
> > I do guess that this snippet shall be removed from this patch set.  
> 
> Probably.

Ok.

> 
> >> +#endif
> >> +
> >>  #if CONFIG_IS_ENABLED(BLK)
> >>  static unsigned long usb_stor_read(struct udevice *dev, lbaint_t
> >> blknr, lbaint_t blkcnt, void *buffer);
> >> @@ -729,6 +741,7 @@ static int usb_stor_BBB_transport(struct
> >> scsi_cmd *srb, struct us_data *us) pipeout =
> >> usb_sndbulkpipe(us->pusb_dev, us->ep_out); /* DATA phase + error
> >> handling */ data_actlen = 0;
> >> +  mdelay(10);             /* Like linux does. */
> >>    /* no data, go immediately to the STATUS phase */
> >>    if (srb->datalen == 0)
> >>            goto st;
> >> @@ -740,6 +753,16 @@ static int usb_stor_BBB_transport(struct
> >> scsi_cmd *srb, struct us_data *us) 
> >>    result = usb_bulk_msg(us->pusb_dev, pipe, srb->pdata,
> >> srb->datalen, &data_actlen, USB_CNTL_TIMEOUT * 5);
> >> +
> >> +  /* special handling of XACTERR in DATA phase */
> >> +  if (result < 0 && (us->pusb_dev->status &
> >> USB_ST_XACTERR)) {
> >> +          debug("XACTERR in data phase - clr, reset, and
> >> return fail.\n");
> >> +          usb_stor_BBB_clear_endpt_stall(us, dir_in ?
> >> +                                         us->ep_in :
> >> us->ep_out);
> >> +          usb_stor_BBB_reset(us);
> >> +          return USB_STOR_TRANSPORT_FAILED;
> >> +  }
> >> +  
> > 
> > Here the case of XACTERR is handled. This is missing now in
> > -master.  
> 
> Surely this could be added as a separate patch ?

Ok.

> 
> >>    /* special handling of STALL in DATA phase */
> >>    if ((result < 0) && (us->pusb_dev->status &
> >> USB_ST_STALLED)) { debug("DATA:stall\n");
> >> @@ -1013,9 +1036,32 @@ static int usb_request_sense(struct scsi_cmd
> >> *srb, struct us_data *ss) return 0;
> >>  }
> >>  
> >> +/*
> >> + * This spins up the disk and also consumes the time that the
> >> + * disk takes to become active and ready to read data.
> >> + * Some drives (like Western Digital) can take more than 5
> >> seconds.
> >> + * The delay occurs on the 1st data read from the disk.
> >> + * Extending the timeout here works better than handling the
> >> timeout
> >> + * as an error on a "real" read.
> >> + */
> >> +static int usb_spinup(struct scsi_cmd *srb, struct us_data *ss)
> >> +{
> >> +  memset(&srb->cmd[0], 0, 12);
> >> +  srb->cmd[0] = SCSI_START_STP;
> >> +  srb->cmd[1] = srb->lun << 5;
> >> +  srb->cmd[4] = 1;        /* Start spinup. */
> >> +  srb->datalen = 0;
> >> +  srb->cmdlen = 6;
> >> +  ss->pusb_dev->extra_timeout = 9876;
> >> +  ss->transport(srb, ss);
> >> +  ss->pusb_dev->extra_timeout = 0;
> >> +  return 0;
> >> +}
> >> +
> >>  static int usb_test_unit_ready(struct scsi_cmd *srb, struct
> >> us_data *ss) {
> >>    int retries = 10;
> >> +  int gave_extra_time = 0;
> >>  
> >>    do {
> >>            memset(&srb->cmd[0], 0, 12);
> >> @@ -1038,6 +1084,17 @@ static int usb_test_unit_ready(struct
> >> scsi_cmd *srb, struct us_data *ss) if ((srb->sense_buf[2] == 0x02)
> >> && (srb->sense_buf[12] == 0x3a))
> >>                    return -1;
> >> +          /*
> >> +           * If the status is "Not Ready - becoming ready",
> >> give it
> >> +           * more time.  Linux issues a spinup command
> >> (once) and gives
> >> +           * it 100 seconds.
> >> +           */
> >> +          if (srb->sense_buf[2] == 0x02 &&
> >> +              srb->sense_buf[12] == 0x04 &&
> >> +              gave_extra_time == 0) {
> >> +                  retries = 100; /* Allow 10 seconds. */
> >> +                  gave_extra_time = retries;
> >> +          }  
> > 
> > The above code looks like targeting the HDDs, but some degraded pen
> > drives may need this time as well for proper operation.  
> 
> Hard drive spin up handling. This could be a separate patch too.
> 

Ok.

> >>            mdelay(100);
> >>    } while (retries--);
> >>  
> >> @@ -1166,22 +1223,57 @@ static unsigned long usb_stor_read(struct
> >> blk_desc *block_dev, lbaint_t blknr, block_dev->devnum, start,
> >> blks, buf_addr); 
> >>    do {
> >> -          /* XXX need some comment here */
> >> +          /*
> >> +           * Probably most errors are USB errors, not hard
> >> disk error.
> >> +           * Many disks use a USB chip that is flaky when
> >> doing large
> >> +           * transfers. The workaround is to dynamically
> >> reduce the
> >> +           * transfer size and allow an additional try.
> >> +           * This should pick up flaky disks. Linux uses a
> >> quirks table.
> >> +           * We'll use observation:
> >> +           * Give it 1 try very large, 1 try large, 2 tries
> >> medium and 2
> >> +           * tries small(ish).
> >> +           * On a solid fail (actual disk error - which
> >> should be rare),
> >> +           * this will give us 6 tries max, and only that
> >> many if the read
> >> +           * is quite large.
> >> +           * A fail on a very short read obviously doesn't
> >> have a
> >> +           * too-large max_blks.  Timeout due to spinup
> >> being a case in
> >> +           * this point.
> >> +           */
> >>            retry = 2;
> >>            srb->pdata = (unsigned char *)buf_addr;
> >> -          if (blks > ss->max_xfer_blk)
> >> -                  smallblks = ss->max_xfer_blk;
> >> +retry_it:
> >> +          if (blks > usb_max_xfer_blk)
> >> +                  smallblks = usb_max_xfer_blk;
> >>            else
> >>                    smallblks = (unsigned short) blks;
> >> -retry_it:
> >> -          if (smallblks == ss->max_xfer_blk)
> >> +
> >> +          if (smallblks == usb_max_xfer_blk)
> >>                    usb_show_progress();
> >> +
> >>            srb->datalen = block_dev->blksz * smallblks;
> >>            srb->pdata = (unsigned char *)buf_addr;
> >>            if (usb_read_10(srb, ss, start, smallblks)) {
> >>                    debug("Read ERROR\n");
> >>                    ss->flags &= ~USB_READY;
> >>                    usb_request_sense(srb, ss);
> >> +
> >> +                  /* Dynamically reduce the I/O size. */
> >> +                  if (smallblks > 2047) {
> >> +                          usb_max_xfer_blk = 2047;
> >> +                          ++retry;
> >> +                  } else if (smallblks > 512) {
> >> +                          usb_max_xfer_blk = 512;
> >> +                          ++retry;
> >> +                  } else if (smallblks > 511) {
> >> +                          usb_max_xfer_blk = 511;
> >> +                          ++retry;
> >> +                  } else if (smallblks > 63) {
> >> +                          usb_max_xfer_blk = 63;
> >> +                          retry += 2;
> >> +                  }
> >> +                  debug("step down usb_max_xfer_blk to
> >> %d\n",
> >> +                        usb_max_xfer_blk);
> >> +  
> > 
> > This shall not be needed, as the ehci controller's driver will
> > scatter the transfers up till 240 blocks max. Am I right?
> > 
> > The above code was supposed to work before the 240 block limitation
> > was in place?  
> 
> Correct, this code is wrong.

Ok.

> 
> [...]
> 
> >> +  qhtoken = hc32_to_cpu(qh->qh_overlay.qt_token);
> >>    if (!(QT_TOKEN_GET_STATUS(qhtoken) &
> >> QT_TOKEN_STATUS_ACTIVE)) { debug("TOKEN=%#x\n", qhtoken);
> >> +          if (qhtoken & QT_TOKEN_STATUS_XACTERR) {  
> >                           ^^^^^^^^^^^^^^^^^^^^^^^^ - this flag
> >                           before this patch was also not
> > checked. 
> >> +                  if (--trynum >= 0) {
> >> +                          /*
> >> +                           * It is necessary to do this,
> >> otherwise the
> >> +                           * disk is clagged.
> >> +                           */
> >> +                          debug("reset the TD and redo,
> >> because of XACTERR\n");
> >> +                          qhtoken &=
> >> ~QT_TOKEN_STATUS_HALTED;
> >> +                          qhtoken |= QT_TOKEN_STATUS_ACTIVE
> >> |
> >> +                                  QT_TOKEN_CERR(2);
> >> +                          vtd->qt_token =
> >> cpu_to_hc32(qhtoken);
> >> +                          qh->qh_overlay.qt_token =
> >> cpu_to_hc32(qhtoken);
> >> +                          goto retry_xacterr;
> >> +                  }
> >> +                  dev->status = USB_ST_XACTERR;
> >> +                  dev->act_len = length -
> >> +                          QT_TOKEN_GET_TOTALBYTES(qhtoken);
> >>  
> > 
> > This may solve the issue - as it resets the controller in the case
> > of transfer error.  
> 
> Do you need to reset the controller ? Or is there some graceful way
> out, i.e. a way to recover from the error ?

The original patch was developed when the async transmission was start
and stopped for each transfer (before your optimization), hence the code
for disabling it.

I need to check if it would be enough to just clear the
QT_TOKEN_STATUS_HALTED and set QT_TOKEN_STATUS_ACTIVE again.

> [...]




Best regards,

Lukasz Majewski

--

DENX Software Engineering GmbH,      Managing Director: Wolfgang Denk
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-59 Fax: (+49)-8142-66989-80 Email: lu...@denx.de

pgpuf_gXf3kRD.pgp
Description: OpenPGP digital signature

Re: [PATCH] usb: ehci: Fix "EHCI timed out on TD - token=XXXX" error on ehci-hcd

Reply via email to