On Wed, Mar 24, 2021 at 02:05:27PM -0600, Theo de Raadt wrote:

> Marcus Glocker <[email protected]> wrote:
> 
> > On Tue, Mar 23, 2021 at 09:52:42AM -0600, Theo de Raadt wrote:
> > 
> > > Mark Kettenis <[email protected]> wrote:
> > > 
> > > > > > Index: sys/dev/sdmmc/sdmmc_scsi.c
> > > > > > ===================================================================
> > > > > > RCS file: /cvs/src/sys/dev/sdmmc/sdmmc_scsi.c,v
> > > > > > retrieving revision 1.59
> > > > > > diff -u -p -u -p -r1.59 sdmmc_scsi.c
> > > > > > --- sys/dev/sdmmc/sdmmc_scsi.c      15 Oct 2020 13:22:13 -0000      
> > > > > > 1.59
> > > > > > +++ sys/dev/sdmmc/sdmmc_scsi.c      23 Mar 2021 07:32:52 -0000
> > > > > > @@ -365,9 +365,8 @@ sdmmc_scsi_cmd(struct scsi_xfer *xs)
> > > > > >     /* A read or write operation. */
> > > > > >     sdmmc_scsi_decode_rw(xs, &blockno, &blockcnt);
> > > > > >  
> > > > > > -   if (blockno >= tgt->card->csd.capacity ||
> > > > > > -       blockno + blockcnt > tgt->card->csd.capacity) {
> > > > > > -           DPRINTF(("%s: out of bounds %u-%u >= %u\n", DEVNAME(sc),
> > > > > > +   if (blockno + blockcnt >= tgt->card->csd.capacity) {
> > > > > > +           DPRINTF(("%s: out of bounds %u+%u >= %u\n", DEVNAME(sc),
> > > > > >                 blockno, blockcnt, tgt->card->csd.capacity));
> > > > > >             xs->error = XS_DRIVER_STUFFUP;
> > > > > >             scsi_done(xs);
> > > > 
> > > > Apart from a potential overflow, the original condition looks correct
> > > > to me.  If csd.capacity is 16 and blockno is 0 and blockcnt is 16, the
> > > > operation should succeed, shouldn't it?
> > 
> > Yes, you're right.  In my case the patch suppressed the issue by
> > skipping the last block of the disk.
> >  
> > > Hmm.  Maybe this is a specific SD card with a bug accessing the last 
> > > sector.
> > > 
> > > A manual dd using
> > > 
> > >   skip='c partition size -1' count=1 bs=512
> > > 
> > > Is failing on that SD card, but it is working on other SD cards.  Which 
> > > appear
> > > to all be SD v2 cards.
> > > 
> > > Maybe his specific SD card has an off-by-one bug relating to the last 
> > > sector,
> > > and we need to determine a way to handle that, or skip the last sector on
> > > all devices of this sub-class?
> > > 
> > > I don't see any issue in the controller code.   Marcus, can you move the 
> > > card
> > > to another machine (different controller) and use dd to read the last 
> > > sector?
> > > Though I suspect that might not provide enough clue either, the "fail to 
> > > read"
> > > behaviour might vary between controllers...
> > > 
> > > > > blockno and blockcnt are both 32-bit, so this feels dangerously close
> > > > > to integer wrap, and I believe the first condition cannot be deleted.
> > > > 
> > > > But the second condition can still wrap.  So it probably needs to be
> > > > something like:
> > > > 
> > > >    if (blockno >= tgt->card->csd.capacity ||
> > > >        blockcnt > tgt->card->csd.capacity - blockno)
> > > 
> > > All 3 variables are int.  Not sure how moving the + to - on the other
> > > side changes anything.
> > 
> > In the meantime I figured out how to trigger the SD controller command
> > to fail.  It always happens if you read more than one block through one
> > command, where the last block is accessing the last disk block.  If you
> > only access the last disk block directly, the command doesn't fail.
> > Some examples:
> > 
> >     # dd if=/dev/rsd0c of=/dev/null skip=15591935 bs=4096 count=1
> >     nblks=8, datalen=4096, blklen=512, arg=0x76f4ff8, cmd=0x8052
> >     CMD FAIL!
> >     bcmsdhost0: transfer timeout!
> > 
> >     # dd if=/dev/rsd0c of=/dev/null skip=31183871 bs=2048 count=1
> >     CMD FAIL!
> >     bcmsdhost0: transfer timeout!
> > 
> >     # dd if=/dev/rsd0c of=/dev/null skip=62367743 bs=1024 count=1
> >     CMD FAIL!
> >     bcmsdhost0: transfer timeout!
> > 
> >     # dd if=/dev/rsd0c of=/dev/null skip=124735487 bs=512 count=1
> >     < all good >
> > 
> > I don't think it's an issue with the microSD card since I have NetBSD
> > installed on a different card of exactly the same type/size, and there I
> > also can trigger the issue:
> > 
> >     # dd if=/dev/rld0c of=/dev/null skip=15591935 bs=4096 count=1
> >     [ 179.4170221] ld0c: error reading fsbn 124735480 of 124735480-124735487
> >     (ld0 bn 124735480; cn 60905 tn 63 sn 24), retrying
> > 
> > Also I have tried that on the Pine64/sximmc(4) with the same microSD
> > card, and there all is hunky-dory.
> > 
> > It seems to be a behavior/bug of the BCM2835 SD controller.  But I'm
> > not sure what's the right approach to work around that in the driver, or
> > even if there is a real fix by changing the command code.
> > 
> 
> I think the bug is in bcm2835_sdhost.c bcmsdhost_exec_command()
> 
> This chunk is fishy, why is it increasing the transaction size?
> 
>                 if (nblks == 0 || (cmd->c_datalen % cmd->c_blklen) != 0)
>                         ++nblks;
> 
> I guess this does not occur for this operation, but still it is weird.
> 
> Maybe investigate which of the "goto done" cases gets hit; I am pretty
> sure the controller is indicating which failure mode occurs.
> 
> Surely c_error isn't zero when this function completes?

Yes, I did that already.  It happens here:

        if (ISSET(bcmsdhost_read(sc, SDCMD), SDCMD_FAIL)) {
                cmd->c_error = EIO;
                goto done;
        }

The error can be found in the SDHSTS register, and it's:

        CMD FAIL! SDHSTS=0x80
        bcmsdhost0: transfer timeout!

0x80 = SDHSTS_REW_TO.  What ever 'REW' means (Read/Write)?!
 
> > I mean one can just change the disklabel to skip the last block, but
> > it's not really a solution since somebody lazy like me, who does a
> > very simple disklabel where 'a'=root, and 'b'=swap is the rest of the
> > disk can easily trigger that "bug" through e.g. savecore(8) which we
> > run by default during boot.  At least in this case it's a very good
> > debugger :-)
> 
> We would need to change the disklabel on all existing and future SD
> cards, since the cards are removable and can move between controllers.
> That won't work.

Reply via email to