Re: [SUCCESS] Debugging/fixing a kernel stalled not crashing
Le Sun, Aug 21, 2022 at 03:25:36PM +, Emmanuel Dreyfus a écrit : > On Sun, Aug 21, 2022 at 02:16:58PM +0200, tlaro...@polynum.com wrote: > > Addition (asked by Taylor R Campbell): a current GENERIC boots only > > with i915drmkms disabled. > > > > With the framebuffer stuff enabled, it does not boot, and does not even > > panic and reboot. It freezes somewhere. The same as the 9.x series. > > I have a machine that randomy crash during boot since we had the Linux 5.x > DRM import. The feature is still an asset, since it supports the GPU > that was not supported before, but it suggests booting with DRM based > framebuffer is more fragile than booting without. Perhaps we need a boot > flag to disable framebuffer? This is my feeling too that a generic flag to disable it via userconf would be a good thing instead of explicitely listing all the drivers. And, at the very least, to advertise, for people installing on a server, to try with framebuffer disabled first, to see if NetBSD boots, and to try it with only after. When one installs on a remote server, without seeing anything about the boot process[*], it is quite frustating. *: I plan to play a little with UEFI EDKII to see if installing it and dealing with an ethernet card EFI Runtime driver (persistent after exiting boot) could be a solution for remote debugging. But no schedule set so don't hold your breath; it's vaporware for the moment. Other idea: write messages to memory in a place kept untouched by UEFI and NetBSD so that rebooting (in case of crash) in UEFI, an UEFI application could dump the memory on some place on the disk, in the EFI partition, for post-mortem inspection. -- Thierry Laronde http://www.kergis.com/ http://kertex.kergis.com/ http://www.sbfa.fr/ Key fingerprint = 0FF7 E906 FBAF FE95 FD89 250D 52B1 AE95 6006 F40C
Re: [SUCCESS] Debugging/fixing a kernel stalled not crashing
On Sun, Aug 21, 2022 at 02:16:58PM +0200, tlaro...@polynum.com wrote: > Addition (asked by Taylor R Campbell): a current GENERIC boots only > with i915drmkms disabled. > > With the framebuffer stuff enabled, it does not boot, and does not even > panic and reboot. It freezes somewhere. The same as the 9.x series. I have a machine that randomy crash during boot since we had the Linux 5.x DRM import. The feature is still an asset, since it supports the GPU that was not supported before, but it suggests booting with DRM based framebuffer is more fragile than booting without. Perhaps we need a boot flag to disable framebuffer? -- Emmanuel Dreyfus m...@netbsd.org
Re: [SUCCESS] Debugging/fixing a kernel stalled not crashing
Addition (asked by Taylor R Campbell): a current GENERIC boots only with i915drmkms disabled. With the framebuffer stuff enabled, it does not boot, and does not even panic and reboot. It freezes somewhere. The same as the 9.x series. Le Sat, Aug 20, 2022 at 09:03:52PM +0200, tlaro...@polynum.com a écrit : > A final point: > > Context: I rent a baremetal server (OVH) that has an Intel Xeon > quadcore, IvyBridge, with 16Gb of RAM, 3 2TB disks, an Intel PRO 1000 > ethernet card (but the bandwith is limited to 100Mib). It is an entry > level offer, that I wanted only for an IPv4 address (there is an IPv6 > address too). > > The images to install include no BSD but only Linux/Debian variants. > > Following instructions from an helpful wiki page, I try to install using > a Linux rescue disk (provided by OVH), running all in memory, and having > qemu-system-x86_64 allowing to use a CDROM install image. > > Nothing booted. > > Since it was unclear from the web interface if the boot process was > depending or not on the information about an image being installed (to > allow booting from the disk), I then installed a Linux/Debian on only > one disk (one can select, 1, 2 or 3 disks, but if multiple disks this > is software RAID). > > Using the rescue system, I then resized the Debian partition and > installed NetBSD on another partition (dual booting) and, to bypass a > possible limitation in the booting process (only booting GRUB and > accessing directly GRUB), I chainloaded the NetBSD stage1 from the GRUB2 > menu, and verified, under qemu, this will boot, using GRUB2 boot once > feature so that if the NetBSD crashed and reboots, I can go back to > Debian to try something else. > > Still no success. > > It was almost certain there was a problem with the kernel. > > So I wrote a special /boot.cfg to test various things, custom compiling > a kernel (since the GENERIC installation one was not running), and tried > to validate step by step the booting procedure in order to try, after > to insert a cpu_reboot() instruction in the kernel to see where the > problem occurred (since when rebooting, I will be able to connect to > Debian, I would have known that before the instruction, it was OK). > > In order to limit the work, I used once more qemu but to install NetBSD > on another disk (so that I can in fact use qemu not with the rescue > system, but directly under Debian without trashing the very disk Debian > runs from). > > The first test was to see if, indeed, NetBSD stage2 was loaded. The > menu in /boot.cfg was simple: the instruction "quit". > > => First lesson: this does not work, because the rebooting is not a > total one, and mapping the drives (in GRUB2) to ensure that the booting > succeeds, the stage2 reboots but finally back to itself, so the machine > was unendlessly rebooting and I had no connection. > > It took me various modifications before realizing it was the case (under > qemu) so I abandonned the idea and tried to boot a custom kernel, > without SMP and without framebuffer (i915drmkms). > > This succeeded. > > I then get back to test letting the framebuffer. It didn't work. > I then disable the framebuffer for everything, and tried with SMP. It > worked. > Then, I tried 9.2 GENERIC and 9.3 GENERIC without framebuffer. Both > work. > > So the final lesson: NetBSD can be installed on such machine but the > framebuffer is a problem. And NetBSD is not far behind Linux, because > the Debian distribution is a recent one, and the main clue was in the > Linux dmesg: > > Command line: BOOT_IMAGE=/boot/vmlinuz-5.10.0-14-amd64 > root=UUID=eea6d0a4-03b6-44e6-8588-ff6c4eba2095 ro nomodeset iommu=pt > > The: nomodeset. > > Linux doesn't work with the embedded graphics (HD 4000) either. > > So it is partly a kernel problem (kernel stalling with framebuffer > initializations) but mainly an install problem (framebuffer in such > cases should be disabled). > > If someone thinks there can be interest in how I set dual booting, > chainloading NetBSD from GRUB2, and configuring the boot procedure, I > can write a mini-page about it. > > For the rest: problem solved. NetBSD can install on an OVH baremetal > (at least this kind of machine). > -- > Thierry Laronde > http://www.kergis.com/ > http://kertex.kergis.com/ >http://www.sbfa.fr/ > Key fingerprint = 0FF7 E906 FBAF FE95 FD89 250D 52B1 AE95 6006 F40C -- Thierry Laronde http://www.kergis.com/ http://kertex.kergis.com/ http://www.sbfa.fr/ Key fingerprint = 0FF7 E906 FBAF FE95 FD89 250D 52B1 AE95 6006 F40C
Re: [SUCCESS] Debugging/fixing a kernel stalled not crashing
A final point: Context: I rent a baremetal server (OVH) that has an Intel Xeon quadcore, IvyBridge, with 16Gb of RAM, 3 2TB disks, an Intel PRO 1000 ethernet card (but the bandwith is limited to 100Mib). It is an entry level offer, that I wanted only for an IPv4 address (there is an IPv6 address too). The images to install include no BSD but only Linux/Debian variants. Following instructions from an helpful wiki page, I try to install using a Linux rescue disk (provided by OVH), running all in memory, and having qemu-system-x86_64 allowing to use a CDROM install image. Nothing booted. Since it was unclear from the web interface if the boot process was depending or not on the information about an image being installed (to allow booting from the disk), I then installed a Linux/Debian on only one disk (one can select, 1, 2 or 3 disks, but if multiple disks this is software RAID). Using the rescue system, I then resized the Debian partition and installed NetBSD on another partition (dual booting) and, to bypass a possible limitation in the booting process (only booting GRUB and accessing directly GRUB), I chainloaded the NetBSD stage1 from the GRUB2 menu, and verified, under qemu, this will boot, using GRUB2 boot once feature so that if the NetBSD crashed and reboots, I can go back to Debian to try something else. Still no success. It was almost certain there was a problem with the kernel. So I wrote a special /boot.cfg to test various things, custom compiling a kernel (since the GENERIC installation one was not running), and tried to validate step by step the booting procedure in order to try, after to insert a cpu_reboot() instruction in the kernel to see where the problem occurred (since when rebooting, I will be able to connect to Debian, I would have known that before the instruction, it was OK). In order to limit the work, I used once more qemu but to install NetBSD on another disk (so that I can in fact use qemu not with the rescue system, but directly under Debian without trashing the very disk Debian runs from). The first test was to see if, indeed, NetBSD stage2 was loaded. The menu in /boot.cfg was simple: the instruction "quit". => First lesson: this does not work, because the rebooting is not a total one, and mapping the drives (in GRUB2) to ensure that the booting succeeds, the stage2 reboots but finally back to itself, so the machine was unendlessly rebooting and I had no connection. It took me various modifications before realizing it was the case (under qemu) so I abandonned the idea and tried to boot a custom kernel, without SMP and without framebuffer (i915drmkms). This succeeded. I then get back to test letting the framebuffer. It didn't work. I then disable the framebuffer for everything, and tried with SMP. It worked. Then, I tried 9.2 GENERIC and 9.3 GENERIC without framebuffer. Both work. So the final lesson: NetBSD can be installed on such machine but the framebuffer is a problem. And NetBSD is not far behind Linux, because the Debian distribution is a recent one, and the main clue was in the Linux dmesg: Command line: BOOT_IMAGE=/boot/vmlinuz-5.10.0-14-amd64 root=UUID=eea6d0a4-03b6-44e6-8588-ff6c4eba2095 ro nomodeset iommu=pt The: nomodeset. Linux doesn't work with the embedded graphics (HD 4000) either. So it is partly a kernel problem (kernel stalling with framebuffer initializations) but mainly an install problem (framebuffer in such cases should be disabled). If someone thinks there can be interest in how I set dual booting, chainloading NetBSD from GRUB2, and configuring the boot procedure, I can write a mini-page about it. For the rest: problem solved. NetBSD can install on an OVH baremetal (at least this kind of machine). -- Thierry Laronde http://www.kergis.com/ http://kertex.kergis.com/ http://www.sbfa.fr/ Key fingerprint = 0FF7 E906 FBAF FE95 FD89 250D 52B1 AE95 6006 F40C
[PARTIAL SUCCESS] Debugging/fixing a kernel stalled not crashing
Le Thu, Aug 18, 2022 at 04:33:04PM +0200, tlaro...@polynum.com a écrit : > Context: I rent a baremetal server and try to install NetBSD on it. I > finally installed a Linux (Debian) and installed NetBSD as a dual boot. > But NetBSD doesn't come up (in case there was a > network misconfiguration, I verified that no log, no dmesg was written) > and neither does it crashes and reboots (because I use GRUB2 boot once > feature and, if it was the case, the server will go back to Debian, and > it doesn't). > So: - I have installed a Linux/Debian and I'm using GRUB2 to chainload the stage1 block in order to load the NetBSD kernel, using the booting once feature of GRUB2 so that if something goes wrong, I can go back to the Linux/Debian; - I have set (since I can see nothing of the boot process) a /boot.cfg with several choices, and set the default in order from the chainloading done by GRUB2 to try various things (since I haven't found the possibility to mount ffs rw under Linux, I use qemu-system-x86_64, under Debian, to write and modify the NetBSD partitions); - The machine is an Intel Xeon, quadcore, IvyBridge. Since the GENERIC kernel does not boot, I have compiled a custom 9.3, stripping all unneeded, and adding this feature (commented out in the GENERIC config): acpismbus* at acpi?# ACPI SMBus CMI (experimental) since from x86/pci/imcsmb/imc.c, there are some pecularities about the (Sandy,Ivy)bridge with the Xeon. Disabling the framebuffer (i915drmkms) via userconf, and disabling the SMP, NetBSD boots on the machine. The dmesg is here: http://downloads.kergis.com/misc/rpt_netbsd9.3_monocore_no-fb.dmesg Since I fought quite a lot with Debian, GRUB2 and so on for the installation and the boot process, I have to verify if an SMP version of the same does boot or not. If an SMP does not boot, I will go back to the list to have tips about how I can best gain informations about what's going wrong in order to try to fix or help to fix it. -- Thierry Laronde http://www.kergis.com/ http://kertex.kergis.com/ http://www.sbfa.fr/ Key fingerprint = 0FF7 E906 FBAF FE95 FD89 250D 52B1 AE95 6006 F40C
Re: Debugging/fixing a kernel stalled not crashing
>> If it's an issue picking up the root filesystem, you could boot an >> INSTALL type kernel with a built in ramdisk with dhcpcd and sshd >> enabled, [...] > Yes, I plan to test this also, depending on [...] This reminds me of a case I had, once. I wanted to test-boot a particular kernel version on a machine which had no disk interface or network supported by that kernel. (It had USB 3 USB, and the kernel in question didn't support anything past USB 2; and the network interfaces weren't supported either. The kernel was old compared to the hardware.) The machine's ROM code, though, could boot a kernel off a USB thumbdrive just fine. I ended up building a kernel that booted with a SLIPpish interface configured on the console serial port, at an address specified at kernel config time. Running diskless over SLIP on a serial line...well, it was painful, but it worked. If I'd ended up wanting to use that kernel on that hardware more extensively, I probably would have used that kernel to support porting either USB support or Ethernet support, but the desire disappeared before I made any significant progress beyond getting it to boot and run - or, perhaps more accurately, crawl - diskless. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: Debugging/fixing a kernel stalled not crashing
Hello, Le Fri, Aug 19, 2022 at 02:36:33PM +0100, David Brownlee a écrit : > Tangentially... > > If it's an issue picking up the root filesystem, you could boot an > INSTALL type kernel with a built in ramdisk with dhcpcd and sshd > enabled, and see if you can ssh into the box (I think someone had > pre-built arm images which did just that, so the code should be out > there :) Yes, I plan to test this also, depending on at what stage my reboot tactics indicates where the problem is. The aim being to be able to connect to a running kernel. When it will be achieved, the harder will have been made. I have already built a custom kernel (with acpismbus* added since the machine has IvyBridge and it is related, and it's not in GENERIC) and will start to debug tomorrow. Best, -- Thierry Laronde http://www.kergis.com/ http://kertex.kergis.com/ http://www.sbfa.fr/ Key fingerprint = 0FF7 E906 FBAF FE95 FD89 250D 52B1 AE95 6006 F40C
Re: Debugging/fixing a kernel stalled not crashing
Tangentially... If it's an issue picking up the root filesystem, you could boot an INSTALL type kernel with a built in ramdisk with dhcpcd and sshd enabled, and see if you can ssh into the box (I think someone had pre-built arm images which did just that, so the code should be out there :) David
Debugging/fixing a kernel stalled not crashing
Context: I rent a baremetal server and try to install NetBSD on it. I finally installed a Linux (Debian) and installed NetBSD as a dual boot. But NetBSD doesn't come up (in case there was a network misconfiguration, I verified that no log, no dmesg was written) and neither does it crashes and reboots (because I use GRUB2 boot once feature and, if it was the case, the server will go back to Debian, and it doesn't). I can't "see" the boot process (no IPMI for this entry level offer), but I have at least the dmesg from Linux for the description of the machine, and I'd like to give it a try to see if I can find the culprit and, this being identified, manage to correct it. In order to bisect the problem, it seems that the simplest would be to place a cpu_reboot() at various steps to identify the culprit since, if it reboots, I will be back to Debian and hence will know that "until this" it is OK. Questions: 1) Is src/sys/kern/init_main.c the correct file to start the bisection with? 2) Starting at what stage a problem would almost for sure cause a reboot (DDB_ONPANIC being unset) so that I can know that the problem is very likely before? I would then try perhaps to start back, from this point; 3) Are there places where cpu_reboot() may leave the hardware in such a state that a soft reset will perhaps not bring the machine back allowing the boot sequence to succeed (or is cpu_reboot() immuned from this)? TIA, -- Thierry Laronde http://www.kergis.com/ http://kertex.kergis.com/ http://www.sbfa.fr/ Key fingerprint = 0FF7 E906 FBAF FE95 FD89 250D 52B1 AE95 6006 F40C