On 6/9/23 20:52, Xavier Drudis Ferran wrote:
Sorry, I replied to Marek only but meant to reply to all.

El Fri, Jun 09, 2023 at 03:20:33AM +0200, Marek Vasut deia:

No. Well, in some tests yes and some no, but I got the error in all cases.

This is doubtful. It is mandatory to run 'usb start' or 'usb reset' before
you would get any meaningful result out of 'usb info'. Without either, you
would get 'USB is stopped.' message. Could it be there are some extra
scripts in your environment that manipulate the USB ?


I saw usb_boootdev_hunt() calls usb_init() in drivers/usb/host/usb_bootdev.c

But maybe I don't get that called and it's really something silly in
my setup as you say later... Maybe it doesn't get called unless it
finds nothing else useful to boot.


Can you test with stock U-Boot ?


I don't know. I'll see if I have time ...
I'd rather read the code to understand what's the condition for finding 
bootdevices...

Can you test with another USB stick, i.e. is the issue specific to this USB
stick ?


I could test this, yes.

Is the issue specific to this partition layout of this USB stick, i.e. if
you clone (dd if=... of=...) the content of the USB stick to another USB
stick, does the error still occur.


I'll try to partition and flash a new USB.

[...]

Model: Radxa ROCK Pi 4B
Net:   eth0: ethernet@fe300000
Hit any key to stop autoboot:  2  1  0
rockchip_pcie pcie@f8000000: PCIe link training gen1 timeout!
Bus usb@fe380000: USB EHCI 1.00
Bus usb@fe3c0000: USB EHCI 1.00
Bus usb@fe800000: Register 2000140 NbrPorts 2
Starting the controller
USB XHCI 1.10
Bus usb@fe900000: Register 2000140 NbrPorts 2
Starting the controller
USB XHCI 1.10
scanning bus usb@fe380000 for devices... 1 USB Device(s) found
scanning bus usb@fe3c0000 for devices... 2 USB Device(s) found
scanning bus usb@fe800000 for devices... 1 USB Device(s) found
scanning bus usb@fe900000 for devices... 1 USB Device(s) found
rockchip_pcie pcie@f8000000: failed to find ep-gpios property
ethernet@fe300000 Waiting for PHY auto negotiation to complete......... TIMEOUT 
!
Could not initialize PHY ethernet@fe300000
rockchip_pcie pcie@f8000000: failed to find ep-gpios property
ethernet@fe300000 Waiting for PHY auto negotiation to complete......... TIMEOUT 
!
Could not initialize PHY ethernet@fe300000

Is this some $preboot script which initializes your hardware ?


Mmm... yes, I used to have it... I thought not in this test, but I'd better 
recheck

Anyway, one should be allowed to stop the boot, call usb start and usb tree
and don't get a reset, shouldn't one?

=> printenv preboot


I'll send this later when I repeat the test. I'd like to find a
minimal test case or something...

Thank you

=> usb tree
USB device tree:
    1  Hub (480 Mb/s, 0mA)
       u-boot EHCI Host Controller
    1  Hub (480 Mb/s, 0mA)
    |  u-boot EHCI Host Controller
    |
                         uclass_id=64
    |+-2  Mass Storage (480 Mb/s, 200mA)
         TDK LoR TF10 07032B6B1D0ACB96
                         uclass_id=22
                         uclass_id=25
       "Synchronous Abort" handler, esr 0x96000010, far 0x948
elr: 00000000002157d4 lr : 00000000002157d4 (reloc)
elr: 00000000f3f4f7d4 lr : 00000000f3f4f7d4

Take the u-boot (unstripped elf) which matches this binary, and run
aarch64-...objdump -lSD on it, then search for the $lr value, see
doc/develop/crash_dumps.rst for details. That should tell you where exactly
the crash occurred. Where did it occur ?


I didn't do it exactly so, but from u-boot.map I gathered that it was
in cmd/usb.c and the fact that my patch fixed it implies the problem
is the functions usb_show_tree_graph() or usb_show_info() get called
recursively with null as a first parameter.

Now I don't have that u-boot.map anymore and would have to repeat the
experiment, to find out exactly as you say, so I won't do it right
now. But thanks, understood.

The reason usb_show_tree_graph() gets called with a null usb_device *
is that the code in cmd/usb.c for usb info and usb tree assumes
everything a UCLASS_MASS_STORAGE device can have as children are
devices with some usb_device in their dev_get_parent_priv().  It
carves out exceptions to this general rule for UCLASS_USB_EMUL and
UCLASS_BLK, but not for UCLASS_BOOTDEV. When it finds a child that is
UCLASS_BOOTDEV it happily recurses on it passing its parent_priv as
usb_device, but the bootdev code did not put any usb_device there,
it's null. So the first access causes a null pointer dereference.

I would have to wrap my mind around more code to start understanding
if it's better to give that UCLASS_BOOTDEV some usb_device as parent
priv data, or it is better to give USB devices that can be enumerated
for listing (usb tree or usb info) some RECURSIBLE flag that indicates
their priv parent data is reliably a usb_device.

So checking that the alledged usb_device at least isn't null as in my
patch is possibly a partial solution. I'm sure if it's null we
shouldn't call, but if it points to something other than a usb_device we
shouldn't either, and it doesn't address why it is null (well, because
it's not really a USB internal node, not even a proper leaf, so it
shouldn't be recursed anyway).

In usb_show_info() it is similar, usb_display_desc() gets called with a
null udev because that's what came in. Recursion is avoided for
UCLASS_USB_EMUL or UCLASS_BLK but not for UCLASS_BOOTDEV.

A different solution could be to expand the exception to
UCLASS_BOOTDEV, but it still seems a wrong strategy to expect to know
everything you can find so you can list all the exceptions, instead of
checking that you have something expected that you can recurse on. Not
completely sure, just smells so to me. At least checking for null is
more general.

Maybe the solution is to fix common/usb_storage.c when it calls
bootdev_setup_sibling_blk(), to ensure there's some useful usb_device
there as parent priv. Or something needs fixing in
drivers/usb/host/usb_bootdev.c ??? But I'm not sure. We really
shouldn't call recursively for usb info or tree anyway.

For now I was trying to understand when that UCLASS_BOOTDEV is added
and why, and whether it should be removed at some time. This should
hint me at what is the minimum scenario to reproduce the issue.


The NAK is really only to prevent this from accidentally going on.


The NAK it's OK. I like people to want to understand stuff. Don't worry.

s@going on@going in@ ... typo.

Please see above, maybe it could be narrowed down ?

I'll see if I can test better and send more useful reports.
Not sure when.

I'm not sure I'll have the time to learn all I need. I just hoped someone
else had it in mind...

I reproducer would be a good starting point, i.e. something like "I run these ... commands and I get this crash" . But please make sure you use current codebase and no preboot or similar scripting .

Reply via email to