Marcus Glocker <[email protected]> wrote:
> On Tue, Mar 23, 2021 at 09:52:42AM -0600, Theo de Raadt wrote:
>
> > Mark Kettenis <[email protected]> wrote:
> >
> > > > > Index: sys/dev/sdmmc/sdmmc_scsi.c
> > > > > ===================================================================
> > > > > RCS file: /cvs/src/sys/dev/sdmmc/sdmmc_scsi.c,v
> > > > > retrieving revision 1.59
> > > > > diff -u -p -u -p -r1.59 sdmmc_scsi.c
> > > > > --- sys/dev/sdmmc/sdmmc_scsi.c 15 Oct 2020 13:22:13 -0000
> > > > > 1.59
> > > > > +++ sys/dev/sdmmc/sdmmc_scsi.c 23 Mar 2021 07:32:52 -0000
> > > > > @@ -365,9 +365,8 @@ sdmmc_scsi_cmd(struct scsi_xfer *xs)
> > > > > /* A read or write operation. */
> > > > > sdmmc_scsi_decode_rw(xs, &blockno, &blockcnt);
> > > > >
> > > > > - if (blockno >= tgt->card->csd.capacity ||
> > > > > - blockno + blockcnt > tgt->card->csd.capacity) {
> > > > > - DPRINTF(("%s: out of bounds %u-%u >= %u\n", DEVNAME(sc),
> > > > > + if (blockno + blockcnt >= tgt->card->csd.capacity) {
> > > > > + DPRINTF(("%s: out of bounds %u+%u >= %u\n", DEVNAME(sc),
> > > > > blockno, blockcnt, tgt->card->csd.capacity));
> > > > > xs->error = XS_DRIVER_STUFFUP;
> > > > > scsi_done(xs);
> > >
> > > Apart from a potential overflow, the original condition looks correct
> > > to me. If csd.capacity is 16 and blockno is 0 and blockcnt is 16, the
> > > operation should succeed, shouldn't it?
>
> Yes, you're right. In my case the patch suppressed the issue by
> skipping the last block of the disk.
>
> > Hmm. Maybe this is a specific SD card with a bug accessing the last sector.
> >
> > A manual dd using
> >
> > skip='c partition size -1' count=1 bs=512
> >
> > Is failing on that SD card, but it is working on other SD cards. Which
> > appear
> > to all be SD v2 cards.
> >
> > Maybe his specific SD card has an off-by-one bug relating to the last
> > sector,
> > and we need to determine a way to handle that, or skip the last sector on
> > all devices of this sub-class?
> >
> > I don't see any issue in the controller code. Marcus, can you move the
> > card
> > to another machine (different controller) and use dd to read the last
> > sector?
> > Though I suspect that might not provide enough clue either, the "fail to
> > read"
> > behaviour might vary between controllers...
> >
> > > > blockno and blockcnt are both 32-bit, so this feels dangerously close
> > > > to integer wrap, and I believe the first condition cannot be deleted.
> > >
> > > But the second condition can still wrap. So it probably needs to be
> > > something like:
> > >
> > > if (blockno >= tgt->card->csd.capacity ||
> > > blockcnt > tgt->card->csd.capacity - blockno)
> >
> > All 3 variables are int. Not sure how moving the + to - on the other
> > side changes anything.
>
> In the meantime I figured out how to trigger the SD controller command
> to fail. It always happens if you read more than one block through one
> command, where the last block is accessing the last disk block. If you
> only access the last disk block directly, the command doesn't fail.
> Some examples:
>
> # dd if=/dev/rsd0c of=/dev/null skip=15591935 bs=4096 count=1
> nblks=8, datalen=4096, blklen=512, arg=0x76f4ff8, cmd=0x8052
> CMD FAIL!
> bcmsdhost0: transfer timeout!
>
> # dd if=/dev/rsd0c of=/dev/null skip=31183871 bs=2048 count=1
> CMD FAIL!
> bcmsdhost0: transfer timeout!
>
> # dd if=/dev/rsd0c of=/dev/null skip=62367743 bs=1024 count=1
> CMD FAIL!
> bcmsdhost0: transfer timeout!
>
> # dd if=/dev/rsd0c of=/dev/null skip=124735487 bs=512 count=1
> < all good >
>
> I don't think it's an issue with the microSD card since I have NetBSD
> installed on a different card of exactly the same type/size, and there I
> also can trigger the issue:
>
> # dd if=/dev/rld0c of=/dev/null skip=15591935 bs=4096 count=1
> [ 179.4170221] ld0c: error reading fsbn 124735480 of 124735480-124735487
> (ld0 bn 124735480; cn 60905 tn 63 sn 24), retrying
>
> Also I have tried that on the Pine64/sximmc(4) with the same microSD
> card, and there all is hunky-dory.
>
> It seems to be a behavior/bug of the BCM2835 SD controller. But I'm
> not sure what's the right approach to work around that in the driver, or
> even if there is a real fix by changing the command code.
>
I think the bug is in bcm2835_sdhost.c bcmsdhost_exec_command()
This chunk is fishy, why is it increasing the transaction size?
if (nblks == 0 || (cmd->c_datalen % cmd->c_blklen) != 0)
++nblks;
I guess this does not occur for this operation, but still it is weird.
Maybe investigate which of the "goto done" cases gets hit; I am pretty
sure the controller is indicating which failure mode occurs.
Surely c_error isn't zero when this function completes?
> I mean one can just change the disklabel to skip the last block, but
> it's not really a solution since somebody lazy like me, who does a
> very simple disklabel where 'a'=root, and 'b'=swap is the rest of the
> disk can easily trigger that "bug" through e.g. savecore(8) which we
> run by default during boot. At least in this case it's a very good
> debugger :-)
We would need to change the disklabel on all existing and future SD
cards, since the cards are removable and can move between controllers.
That won't work.