Heyya, misc@, I'm very new to using OpenBSD for anything more than what's 'on the tin' (DNS RR, Router, etc), and have got myself stuck and have tried during nights and weekends for about a week to try and unwedge myself here, unsuccessfully. I'm hoping someone can help point me to a OpenBSD kernel for newbies guide/help of some sort.
I've emailed a few lists, but so far everyone either looked the other way quickly[0], didn't know, or didn't have time to help me out (fair enough!) I've been trying to take a library[1] I use on my Linux boxen, and coax it into working on OpenBSD[2], and have been able to get a compiled .so that looks good, with the exception of the USB transport. Given the history of related libraries having similar-sounding[3] issues[4] with libusb1, I suspect the issue isn't in the library I'm trying to port, rather, in libusb1's OpenBSD support, or in (*shudder*) the kernel. Either that or this is a common libusb 'gotcha' that everyone eventually finds and patches that presents itself on OpenBSD by always locking up. To get my library working on OpenBSD, I've had to use -snapshot (it requires waitid(2)), and to debug the system (which is now entirely just for testing this singular problem), I've built a -current kernel. Specifically, it's mostly based on the source tree that the Git mirror knows as e3f6ba90cc00f3d7457f857a0fd00f2b435bc0ec (Wed, Mar 29th, 2023).[5] Everything below is the state as I understand it while the program is locked up: [Split-cut to: userland] When reading from the device, my 3-line test program (3-line is disengenious here since it's calling into a library that calls into a library that eventually talks to the OS) eventually hangs while invoking 'libusb_submit_transfer', which is, specifically, hung on a read(2) against a ugen device (specifically, the read in '_sync_gen_transfer' in the libusb OpenBSD implementation). A pthread in the background is continuing to exercise the 'libusb_handle_events' endpoint, to no avail. I asked the upstream list about this[6], no reply yet. [Fade to: kernelspace] [Scene one: read(2) syscall] The kernelspace end of the read(2) has wound up in `usbd_transfer`, roundabout line 406 of usbdi.c, waiting for the xfer on the other end of the endpoint's pipe to be complete. This intrepid syscall earnestly stands by, waiting for the day when it'll be called back. [Scene two: ugen driver state] Everything is using USB3/xhci. That's the only bus on the system, and this device communicates using USB3 in this case. The ugen device (other USB devices are not a factor as far as I can tell, and will continue to operate, even in the locked state, interestingly) looks something like: ugen device unit 1: - endpoint 2: dir=0 (OUT) -- NULL pipe - endpoint 4: dir=0 (OUT) -- NULL pipe - endpoint 6: dir=1 (IN) -- NULL pipe - endpoint 8: dir=1 (IN) -- pipe alloc'd The one related endpoint, which is endpoint 8 here, has a refcount of 1. It's running (not aborting). Here's the output of some USB_DEBUG knobs: usbd_dump_device: dev=0xffff800001291500 bus=0xffff800000132000 default_pipe=0xffff8000001bd000 address=3 config=1 depth=1 speed=4 self_powered=0 power=2 langid=1033 pipe=0xffff8000008b6000 usbd_dump_endpoint: endp=0xffff8000012d6ee0 edesc=0xffff8000001b785d refcnt=1 bEndpointAddress=0x88 (usbd_dump_pipe:) running=1 aborting=0 intrxfer=0x0, repeat=0, interval=-1 Since this doesn't give me a lot of insight into the xfer state, I wrote a small stub and threw it into a dark corner where I could let it be, and have the following from it: pipe=0xffff8000008b6000 xfer=0xfffffd827f59daf0 status=1 done=0 length=1024 flags=0x16 timeout=0 My understanding is that this means that endpoint 8 has an active pipe that is running (not aborting), with an in-flight xfer which is IN_PROGRESS, with the (USBD_SYNCHRONOUS | USBD_SHORT_XFER_OK | USBD_CATCH) flags set [0x02 | 0x04 | 0x10 - as seen in usbdi.h]. Even while waiting a signficant (in human-time) length of time, the xfer remains in this state, and is never marked as done until I C-c the process, and everything cleans up on close(2)/EINTR [Cut to: author] Right, thank you for sticking around. I'm looking for some help for what steps to take next on trying to isolate what may be blocking the xfer from being processed -- and trying to figure out what userland knobs are invoking kernelspace in a way that's causing it. Since other libraries are working OK, I have to assume this isn't actually a kernelspace bug, but I'd feel a lot better root causing it. During the wedged state, xhci messages don't come through after the transfer was submitted (I have xhcidebug, as well as other {usb,ugen,*}debug set high), and I've very quickly got myself lost past this point. This makes me suspect some sort of inturupt is maybe being suspended while a call is in progress; I just have no idea where to start with that theory, much less trick 'ddb' into prying that information from RAM. I've been able to set up the serial console, and get 'ddb' working well, but I am having a hard time using it without poking myself on the sharp bits. Is there a good way to explore the wedged system that isn't using ddb off a serial line? Looks like 'ddb' can 'boot crash'; is there a good workflow documented there? As I mentioned above, my gut was pushing me look into checking if the USB intrurupts are inhibited somehow, as well as checking for lock contention (although, all the kernelspace locks I've instrumented in the specific files i'm in are properly locked/unlocked during the program's runtime). Failing that we're onto plans G, H, I and J. Right, so the ask: 0) Is there a good place to have this conversation? I don't see topical "usb subsystem interest group" mailing list(s) where this may be less tedious to most of the readers. I tried ports@[7], but I don't think that list was right either. I feel like a nunsense posting on all these lists right now. 1) How would I go about checking into some of the above? Does that sound like an interesting idea? Am I wasting my time? What tools exist that could help me out here? 2) Does anyone have other good ideas for what could be to blame here? Similar bugs that you may know about? 3) Is there a guide to the OpenBSD kernel or something? I hadn't so much as built an OpenBSD kernel until last Sunday, so I know very little about how this is all working, much less conventions or internal architecture of a BSD kernel. FWIW it's been very pleasent compared to building and running a modified Linux kernel. Fondly, paultag [0]: <insert gif of grandpa simpson walking out of the bar from The Simpsons episode "Bart After Dark"> [1]: libuhd [2]: https://github.com/EttusResearch/uhd/pull/668 [3]: https://marc.info/?l=libusb-devel&m=144466221830349&w=2 - although interestingly the version in ports doesn't have any patches; I suspect something got fixed in libhackrf, libusb1 or the kernel to fix that during the last decade-ish. I confirmed libhackrf functions normally on my system. [4]: comms/rtl-sdr has a few patches, such as patch-src_rtl_fm_c and their ilk that comprehensively removes async code from the binaries -- although this, too, works unpatched (from upstream sources) now on -current, with the exception of a few patches I sent in for CMake[4.1], which I don't think will get merged at this rate. I confirmed librtlsdr workes normally on my system. [4.1]: https://lists.osmocom.org/hyperkitty/list/osmocom-...@lists.osmocom.org/thread/NOIG7LNP3VXM4JQYVVVXJ4TZR6IDQCQJ/ [5]: "Mostly" in this case means that I toggled a few options, and added a bunch of DPRINTFN and an ioctl stub for me to extract kernelspace state without launching ddb and starting over again. I pop my patches off when they get to be too much now and again. The specific options as of the writing of this email: ``` option MULTIPROCESSOR option MP_LOCKDEBUG option WITNESS option UGEN_DEBUG option USB_DEBUG option XHCI_DEBUG ``` [6]: https://sourceforge.net/p/libusb/mailman/message/37796867/ [7]: https://marc.info/?t=167993667800007&r=1&w=2 -- :wq