On Wed, Mar 24, 2021 at 02:05:27PM -0600, Theo de Raadt wrote:
> Marcus Glocker <[email protected]> wrote:
>
> > On Tue, Mar 23, 2021 at 09:52:42AM -0600, Theo de Raadt wrote:
> >
> > > Mark Kettenis <[email protected]> wrote:
> > >
> > > > > > Index: sys/dev/sdmmc/sdmmc_scsi.c
> > > > > > ===================================================================
> > > > > > RCS file: /cvs/src/sys/dev/sdmmc/sdmmc_scsi.c,v
> > > > > > retrieving revision 1.59
> > > > > > diff -u -p -u -p -r1.59 sdmmc_scsi.c
> > > > > > --- sys/dev/sdmmc/sdmmc_scsi.c 15 Oct 2020 13:22:13 -0000
> > > > > > 1.59
> > > > > > +++ sys/dev/sdmmc/sdmmc_scsi.c 23 Mar 2021 07:32:52 -0000
> > > > > > @@ -365,9 +365,8 @@ sdmmc_scsi_cmd(struct scsi_xfer *xs)
> > > > > > /* A read or write operation. */
> > > > > > sdmmc_scsi_decode_rw(xs, &blockno, &blockcnt);
> > > > > >
> > > > > > - if (blockno >= tgt->card->csd.capacity ||
> > > > > > - blockno + blockcnt > tgt->card->csd.capacity) {
> > > > > > - DPRINTF(("%s: out of bounds %u-%u >= %u\n", DEVNAME(sc),
> > > > > > + if (blockno + blockcnt >= tgt->card->csd.capacity) {
> > > > > > + DPRINTF(("%s: out of bounds %u+%u >= %u\n", DEVNAME(sc),
> > > > > > blockno, blockcnt, tgt->card->csd.capacity));
> > > > > > xs->error = XS_DRIVER_STUFFUP;
> > > > > > scsi_done(xs);
> > > >
> > > > Apart from a potential overflow, the original condition looks correct
> > > > to me. If csd.capacity is 16 and blockno is 0 and blockcnt is 16, the
> > > > operation should succeed, shouldn't it?
> >
> > Yes, you're right. In my case the patch suppressed the issue by
> > skipping the last block of the disk.
> >
> > > Hmm. Maybe this is a specific SD card with a bug accessing the last
> > > sector.
> > >
> > > A manual dd using
> > >
> > > skip='c partition size -1' count=1 bs=512
> > >
> > > Is failing on that SD card, but it is working on other SD cards. Which
> > > appear
> > > to all be SD v2 cards.
> > >
> > > Maybe his specific SD card has an off-by-one bug relating to the last
> > > sector,
> > > and we need to determine a way to handle that, or skip the last sector on
> > > all devices of this sub-class?
> > >
> > > I don't see any issue in the controller code. Marcus, can you move the
> > > card
> > > to another machine (different controller) and use dd to read the last
> > > sector?
> > > Though I suspect that might not provide enough clue either, the "fail to
> > > read"
> > > behaviour might vary between controllers...
> > >
> > > > > blockno and blockcnt are both 32-bit, so this feels dangerously close
> > > > > to integer wrap, and I believe the first condition cannot be deleted.
> > > >
> > > > But the second condition can still wrap. So it probably needs to be
> > > > something like:
> > > >
> > > > if (blockno >= tgt->card->csd.capacity ||
> > > > blockcnt > tgt->card->csd.capacity - blockno)
> > >
> > > All 3 variables are int. Not sure how moving the + to - on the other
> > > side changes anything.
> >
> > In the meantime I figured out how to trigger the SD controller command
> > to fail. It always happens if you read more than one block through one
> > command, where the last block is accessing the last disk block. If you
> > only access the last disk block directly, the command doesn't fail.
> > Some examples:
> >
> > # dd if=/dev/rsd0c of=/dev/null skip=15591935 bs=4096 count=1
> > nblks=8, datalen=4096, blklen=512, arg=0x76f4ff8, cmd=0x8052
> > CMD FAIL!
> > bcmsdhost0: transfer timeout!
> >
> > # dd if=/dev/rsd0c of=/dev/null skip=31183871 bs=2048 count=1
> > CMD FAIL!
> > bcmsdhost0: transfer timeout!
> >
> > # dd if=/dev/rsd0c of=/dev/null skip=62367743 bs=1024 count=1
> > CMD FAIL!
> > bcmsdhost0: transfer timeout!
> >
> > # dd if=/dev/rsd0c of=/dev/null skip=124735487 bs=512 count=1
> > < all good >
> >
> > I don't think it's an issue with the microSD card since I have NetBSD
> > installed on a different card of exactly the same type/size, and there I
> > also can trigger the issue:
> >
> > # dd if=/dev/rld0c of=/dev/null skip=15591935 bs=4096 count=1
> > [ 179.4170221] ld0c: error reading fsbn 124735480 of 124735480-124735487
> > (ld0 bn 124735480; cn 60905 tn 63 sn 24), retrying
> >
> > Also I have tried that on the Pine64/sximmc(4) with the same microSD
> > card, and there all is hunky-dory.
> >
> > It seems to be a behavior/bug of the BCM2835 SD controller. But I'm
> > not sure what's the right approach to work around that in the driver, or
> > even if there is a real fix by changing the command code.
> >
>
> I think the bug is in bcm2835_sdhost.c bcmsdhost_exec_command()
>
> This chunk is fishy, why is it increasing the transaction size?
>
> if (nblks == 0 || (cmd->c_datalen % cmd->c_blklen) != 0)
> ++nblks;
>
> I guess this does not occur for this operation, but still it is weird.
>
> Maybe investigate which of the "goto done" cases gets hit; I am pretty
> sure the controller is indicating which failure mode occurs.
>
> Surely c_error isn't zero when this function completes?
Yes, I did that already. It happens here:
if (ISSET(bcmsdhost_read(sc, SDCMD), SDCMD_FAIL)) {
cmd->c_error = EIO;
goto done;
}
The error can be found in the SDHSTS register, and it's:
CMD FAIL! SDHSTS=0x80
bcmsdhost0: transfer timeout!
0x80 = SDHSTS_REW_TO. What ever 'REW' means (Read/Write)?!
> > I mean one can just change the disklabel to skip the last block, but
> > it's not really a solution since somebody lazy like me, who does a
> > very simple disklabel where 'a'=root, and 'b'=swap is the rest of the
> > disk can easily trigger that "bug" through e.g. savecore(8) which we
> > run by default during boot. At least in this case it's a very good
> > debugger :-)
>
> We would need to change the disklabel on all existing and future SD
> cards, since the cards are removable and can move between controllers.
> That won't work.