Re: [SUCCESS] Debugging/fixing a kernel stalled not crashing

2022-08-21 Thread tlaronde
Le Sun, Aug 21, 2022 at 03:25:36PM +, Emmanuel Dreyfus a écrit :
> On Sun, Aug 21, 2022 at 02:16:58PM +0200, tlaro...@polynum.com wrote:
> > Addition (asked by Taylor R Campbell): a current GENERIC boots only
> > with i915drmkms disabled.
> > 
> > With the framebuffer stuff enabled, it does not boot, and does not even
> > panic and reboot. It freezes somewhere. The same as the 9.x series.
> 
> I have a machine that randomy crash during boot since we had the Linux 5.x
> DRM import. The feature is still an asset, since it supports the GPU
> that was not supported before, but it suggests booting with DRM based
> framebuffer is more fragile than booting without. Perhaps we need a boot
> flag to disable framebuffer?

This is my feeling too that a generic flag to disable it via userconf
would be a good thing instead of explicitely listing all the drivers.
And, at the very least, to advertise, for people
installing on a server, to try with framebuffer disabled first, to see
if NetBSD boots, and to try it with only after. When one installs on a
remote server, without seeing anything about the boot process[*], it is
quite frustating.

*: I plan to play a little with UEFI EDKII to see if installing it and
dealing with an ethernet card EFI Runtime driver (persistent after exiting boot)
could be a solution for remote debugging. But no schedule set so don't
hold your breath; it's vaporware for the moment. Other idea: write messages
to memory in a place kept untouched by UEFI and NetBSD so that rebooting
(in case of crash) in UEFI, an UEFI application could dump the
memory on some place on the disk, in the EFI partition, for
post-mortem inspection.
-- 
Thierry Laronde 
 http://www.kergis.com/
http://kertex.kergis.com/
   http://www.sbfa.fr/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C


Re: [SUCCESS] Debugging/fixing a kernel stalled not crashing

2022-08-21 Thread Emmanuel Dreyfus
On Sun, Aug 21, 2022 at 02:16:58PM +0200, tlaro...@polynum.com wrote:
> Addition (asked by Taylor R Campbell): a current GENERIC boots only
> with i915drmkms disabled.
> 
> With the framebuffer stuff enabled, it does not boot, and does not even
> panic and reboot. It freezes somewhere. The same as the 9.x series.

I have a machine that randomy crash during boot since we had the Linux 5.x
DRM import. The feature is still an asset, since it supports the GPU
that was not supported before, but it suggests booting with DRM based
framebuffer is more fragile than booting without. Perhaps we need a boot
flag to disable framebuffer?

-- 
Emmanuel Dreyfus
m...@netbsd.org


Re: [SUCCESS] Debugging/fixing a kernel stalled not crashing

2022-08-21 Thread tlaronde
Addition (asked by Taylor R Campbell): a current GENERIC boots only
with i915drmkms disabled.

With the framebuffer stuff enabled, it does not boot, and does not even
panic and reboot. It freezes somewhere. The same as the 9.x series.

Le Sat, Aug 20, 2022 at 09:03:52PM +0200, tlaro...@polynum.com a écrit :
> A final point:
> 
> Context: I rent a baremetal server (OVH) that has an Intel Xeon
> quadcore, IvyBridge, with 16Gb of RAM, 3 2TB disks, an Intel PRO 1000
> ethernet card (but the bandwith is limited to 100Mib). It is an entry
> level offer, that I wanted only for an IPv4 address (there is an IPv6
> address too).
> 
> The images to install include no BSD but only Linux/Debian variants.
> 
> Following instructions from an helpful wiki page, I try to install using
> a Linux rescue disk (provided by OVH), running all in memory, and having
> qemu-system-x86_64 allowing to use a CDROM install image.
> 
> Nothing booted.
> 
> Since it was unclear from the web interface if the boot process was
> depending or not on the information about an image being installed (to
> allow booting from the disk), I then installed a Linux/Debian on only
> one disk (one can select, 1, 2 or 3 disks, but if multiple disks this
> is software RAID).
> 
> Using the rescue system, I then resized the Debian partition and
> installed NetBSD on another partition (dual booting) and, to bypass a
> possible limitation in the booting process (only booting GRUB and
> accessing directly GRUB), I chainloaded the NetBSD stage1 from the GRUB2
> menu, and verified, under qemu, this will boot, using GRUB2 boot once
> feature so that if the NetBSD crashed and reboots, I can go back to
> Debian to try something else.
> 
> Still no success.
> 
> It was almost certain there was a problem with the kernel.
> 
> So I wrote a special /boot.cfg to test various things, custom compiling
> a kernel (since the GENERIC installation one was not running), and tried
> to validate step by step the booting procedure in order to try, after
> to insert a cpu_reboot() instruction in the kernel to see where the
> problem occurred (since when rebooting, I will be able to connect to
> Debian, I would have known that before the instruction, it was OK).
> 
> In order to limit the work, I used once more qemu but to install NetBSD
> on another disk (so that I can in fact use qemu not with the rescue
> system, but directly under Debian without trashing the very disk Debian
> runs from).
> 
> The first test was to see if, indeed, NetBSD stage2 was loaded. The
> menu in /boot.cfg was simple: the instruction "quit".
> 
> => First lesson: this does not work, because the rebooting is not a
> total one, and mapping the drives (in GRUB2) to ensure that the booting
> succeeds, the stage2 reboots but finally back to itself, so the machine 
> was unendlessly rebooting and I had no connection.
> 
> It took me various modifications before realizing it was the case (under
> qemu) so I abandonned the idea and tried to boot a custom kernel,
> without SMP and without framebuffer (i915drmkms).
> 
> This succeeded.
> 
> I then get back to test letting the framebuffer. It didn't work.
> I then disable the framebuffer for everything, and tried with SMP. It
> worked.
> Then, I tried 9.2 GENERIC and 9.3 GENERIC without framebuffer. Both
> work.
> 
> So the final lesson: NetBSD can be installed on such machine but the
> framebuffer is a problem. And NetBSD is not far behind Linux, because
> the Debian distribution is a recent one, and the main clue was in the
> Linux dmesg: 
> 
> Command line: BOOT_IMAGE=/boot/vmlinuz-5.10.0-14-amd64 
> root=UUID=eea6d0a4-03b6-44e6-8588-ff6c4eba2095 ro nomodeset iommu=pt
> 
> The: nomodeset.
> 
> Linux doesn't work with the embedded graphics (HD 4000) either.
> 
> So it is partly a kernel problem (kernel stalling with framebuffer
> initializations) but mainly an install problem (framebuffer in such
> cases should be disabled).
> 
> If someone thinks there can be interest in how I set dual booting,
> chainloading NetBSD from GRUB2, and configuring the boot procedure, I
> can write a mini-page about it.
> 
> For the rest: problem solved. NetBSD can install on an OVH baremetal
> (at least this kind of machine).
> -- 
> Thierry Laronde 
>  http://www.kergis.com/
> http://kertex.kergis.com/
>http://www.sbfa.fr/
> Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C

-- 
Thierry Laronde 
 http://www.kergis.com/
http://kertex.kergis.com/
   http://www.sbfa.fr/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C


Re: [SUCCESS] Debugging/fixing a kernel stalled not crashing

2022-08-20 Thread tlaronde
A final point:

Context: I rent a baremetal server (OVH) that has an Intel Xeon
quadcore, IvyBridge, with 16Gb of RAM, 3 2TB disks, an Intel PRO 1000
ethernet card (but the bandwith is limited to 100Mib). It is an entry
level offer, that I wanted only for an IPv4 address (there is an IPv6
address too).

The images to install include no BSD but only Linux/Debian variants.

Following instructions from an helpful wiki page, I try to install using
a Linux rescue disk (provided by OVH), running all in memory, and having
qemu-system-x86_64 allowing to use a CDROM install image.

Nothing booted.

Since it was unclear from the web interface if the boot process was
depending or not on the information about an image being installed (to
allow booting from the disk), I then installed a Linux/Debian on only
one disk (one can select, 1, 2 or 3 disks, but if multiple disks this
is software RAID).

Using the rescue system, I then resized the Debian partition and
installed NetBSD on another partition (dual booting) and, to bypass a
possible limitation in the booting process (only booting GRUB and
accessing directly GRUB), I chainloaded the NetBSD stage1 from the GRUB2
menu, and verified, under qemu, this will boot, using GRUB2 boot once
feature so that if the NetBSD crashed and reboots, I can go back to
Debian to try something else.

Still no success.

It was almost certain there was a problem with the kernel.

So I wrote a special /boot.cfg to test various things, custom compiling
a kernel (since the GENERIC installation one was not running), and tried
to validate step by step the booting procedure in order to try, after
to insert a cpu_reboot() instruction in the kernel to see where the
problem occurred (since when rebooting, I will be able to connect to
Debian, I would have known that before the instruction, it was OK).

In order to limit the work, I used once more qemu but to install NetBSD
on another disk (so that I can in fact use qemu not with the rescue
system, but directly under Debian without trashing the very disk Debian
runs from).

The first test was to see if, indeed, NetBSD stage2 was loaded. The
menu in /boot.cfg was simple: the instruction "quit".

=> First lesson: this does not work, because the rebooting is not a
total one, and mapping the drives (in GRUB2) to ensure that the booting
succeeds, the stage2 reboots but finally back to itself, so the machine 
was unendlessly rebooting and I had no connection.

It took me various modifications before realizing it was the case (under
qemu) so I abandonned the idea and tried to boot a custom kernel,
without SMP and without framebuffer (i915drmkms).

This succeeded.

I then get back to test letting the framebuffer. It didn't work.
I then disable the framebuffer for everything, and tried with SMP. It
worked.
Then, I tried 9.2 GENERIC and 9.3 GENERIC without framebuffer. Both
work.

So the final lesson: NetBSD can be installed on such machine but the
framebuffer is a problem. And NetBSD is not far behind Linux, because
the Debian distribution is a recent one, and the main clue was in the
Linux dmesg: 

Command line: BOOT_IMAGE=/boot/vmlinuz-5.10.0-14-amd64 
root=UUID=eea6d0a4-03b6-44e6-8588-ff6c4eba2095 ro nomodeset iommu=pt

The: nomodeset.

Linux doesn't work with the embedded graphics (HD 4000) either.

So it is partly a kernel problem (kernel stalling with framebuffer
initializations) but mainly an install problem (framebuffer in such
cases should be disabled).

If someone thinks there can be interest in how I set dual booting,
chainloading NetBSD from GRUB2, and configuring the boot procedure, I
can write a mini-page about it.

For the rest: problem solved. NetBSD can install on an OVH baremetal
(at least this kind of machine).
-- 
Thierry Laronde 
 http://www.kergis.com/
http://kertex.kergis.com/
   http://www.sbfa.fr/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C


[PARTIAL SUCCESS] Debugging/fixing a kernel stalled not crashing

2022-08-20 Thread tlaronde
Le Thu, Aug 18, 2022 at 04:33:04PM +0200, tlaro...@polynum.com a écrit :
> Context: I rent a baremetal server and try to install NetBSD on it. I
> finally installed a Linux (Debian) and installed NetBSD as a dual boot.
> But NetBSD doesn't come up (in case there was a
> network misconfiguration, I verified that no log, no dmesg was written)
> and neither does it crashes and reboots (because I use GRUB2 boot once
> feature and, if it was the case, the server will go back to Debian, and
> it doesn't).
> 

So:

- I have installed a Linux/Debian and I'm using GRUB2 to chainload
the stage1 block in order to load the NetBSD kernel, using the booting
once feature of GRUB2 so that if something goes wrong, I can go back
to the Linux/Debian;

- I have set (since I can see nothing of the boot process) a /boot.cfg
with several choices, and set the default in order from the chainloading
done by GRUB2 to try various things (since I haven't found the
possibility to mount ffs rw under Linux, I use qemu-system-x86_64,
under Debian, to write and modify the NetBSD partitions);

- The machine is an Intel Xeon, quadcore, IvyBridge. Since the GENERIC
kernel does not boot, I have compiled a custom 9.3, stripping all
unneeded, and adding this feature (commented out in the GENERIC config):

acpismbus*  at acpi?# ACPI SMBus CMI (experimental)

since from x86/pci/imcsmb/imc.c, there are some pecularities about
the (Sandy,Ivy)bridge with the Xeon.

Disabling the framebuffer (i915drmkms) via userconf, and disabling the
SMP, NetBSD boots on the machine. The dmesg is here:

http://downloads.kergis.com/misc/rpt_netbsd9.3_monocore_no-fb.dmesg

Since I fought quite a lot with Debian, GRUB2 and so on for the 
installation and the boot process, I have to verify if an SMP version
of the same does boot or not.

If an SMP does not boot, I will go back to the list to have tips about
how I can best gain informations about what's going wrong in order to
try to fix or help to fix it.
-- 
Thierry Laronde 
 http://www.kergis.com/
http://kertex.kergis.com/
   http://www.sbfa.fr/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C


Re: Debugging/fixing a kernel stalled not crashing

2022-08-19 Thread Mouse
>> If it's an issue picking up the root filesystem, you could boot an
>> INSTALL type kernel with a built in ramdisk with dhcpcd and sshd
>> enabled, [...]
> Yes, I plan to test this also, depending on [...]

This reminds me of a case I had, once.

I wanted to test-boot a particular kernel version on a machine which
had no disk interface or network supported by that kernel.  (It had USB
3 USB, and the kernel in question didn't support anything past USB 2;
and the network interfaces weren't supported either.  The kernel was
old compared to the hardware.)

The machine's ROM code, though, could boot a kernel off a USB
thumbdrive just fine.

I ended up building a kernel that booted with a SLIPpish interface
configured on the console serial port, at an address specified at
kernel config time.  Running diskless over SLIP on a serial
line...well, it was painful, but it worked.  If I'd ended up wanting to
use that kernel on that hardware more extensively, I probably would
have used that kernel to support porting either USB support or Ethernet
support, but the desire disappeared before I made any significant
progress beyond getting it to boot and run - or, perhaps more
accurately, crawl - diskless.

/~\ The ASCII Mouse
\ / Ribbon Campaign
 X  Against HTMLmo...@rodents-montreal.org
/ \ Email!   7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


Re: Debugging/fixing a kernel stalled not crashing

2022-08-19 Thread tlaronde
Hello,

Le Fri, Aug 19, 2022 at 02:36:33PM +0100, David Brownlee a écrit :
> Tangentially...
> 
> If it's an issue picking up the root filesystem, you could boot an
> INSTALL type kernel with a built in ramdisk with dhcpcd and sshd
> enabled, and see if you can ssh into the box (I think someone had
> pre-built arm images which did just that, so the code should be out
> there :)

Yes, I plan to test this also, depending on at what stage my reboot
tactics indicates where the problem is. The aim being to be able to
connect to a running kernel. When it will be achieved, the harder will
have been made.

I have already built a custom kernel (with acpismbus* added since the
machine has IvyBridge and it is related, and it's not in GENERIC) and
will start to debug tomorrow.

Best,
-- 
Thierry Laronde 
 http://www.kergis.com/
http://kertex.kergis.com/
   http://www.sbfa.fr/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C


Re: Debugging/fixing a kernel stalled not crashing

2022-08-19 Thread David Brownlee
Tangentially...

If it's an issue picking up the root filesystem, you could boot an
INSTALL type kernel with a built in ramdisk with dhcpcd and sshd
enabled, and see if you can ssh into the box (I think someone had
pre-built arm images which did just that, so the code should be out
there :)

David


Debugging/fixing a kernel stalled not crashing

2022-08-18 Thread tlaronde
Context: I rent a baremetal server and try to install NetBSD on it. I
finally installed a Linux (Debian) and installed NetBSD as a dual boot.
But NetBSD doesn't come up (in case there was a
network misconfiguration, I verified that no log, no dmesg was written)
and neither does it crashes and reboots (because I use GRUB2 boot once
feature and, if it was the case, the server will go back to Debian, and
it doesn't).

I can't "see" the boot process (no IPMI for this entry level offer), but
I have at least the dmesg from Linux for the description of the machine,
and I'd like to give it a try to see if I can find the culprit and,
this being identified, manage to correct it.

In order to bisect the problem, it seems that the simplest would be
to place a cpu_reboot() at various steps to identify the culprit since,
if it reboots, I will be back to Debian and hence will know that "until
this" it is OK.

Questions:

1) Is src/sys/kern/init_main.c the correct file to start the bisection
with?

2) Starting at what stage a problem would almost for sure cause a
reboot (DDB_ONPANIC being unset) so that I can know that the problem
is very likely before? I would then try perhaps to start back, from
this point;

3) Are there places where cpu_reboot() may leave the hardware in such a
state that a soft reset will perhaps not bring the machine back
allowing the boot sequence to succeed (or is cpu_reboot() immuned
from this)?

TIA,
-- 
Thierry Laronde 
 http://www.kergis.com/
http://kertex.kergis.com/
   http://www.sbfa.fr/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C