debugging initramfs boot process, install client can't mount root file system
If you have any troubles during the boot process of your install clients and they can't mount the nfsroot, you like to have a shell for debuging. You can get a shell by adding break=bottom as kernel parameter (in a pxelinux.cfg file). This stops the boot process and spaws a shell within the initramfs. Other values for break= can be found if you search all initramfs scripts for the subroutine maybe_break: maybe_break top maybe_break modules maybe_break premount maybe_break mount maybe_break mountroot maybe_break bottom maybe_break init maybe_break live-premount maybe_break live-bottom maybe_break pre-mdadm maybe_break post-mdadm -- regards Thomas
Re: debugging initramfs boot process, install client can't mount root file system
Thomas Lange wrote: If you have any troubles during the boot process of your install clients and they can't mount the nfsroot, you like to have a shell for debuging. You can get a shell by adding break=bottom as kernel parameter (in a pxelinux.cfg file). This stops the boot process and spaws a shell within the initramfs. Other values for break= can be found if you search all initramfs scripts for the subroutine maybe_break: maybe_break top maybe_break modules maybe_break premount maybe_break mount maybe_break mountroot maybe_break bottom maybe_break init maybe_break live-premount maybe_break live-bottom maybe_break pre-mdadm maybe_break post-mdadm Wow. Just having read the manpage for initramfs-tools, I see the break parameter can be *very* helpful. I'm currently debugging a kernel panic and this is sure to be a useful tool. Thanks for the tip, Thomas! Respectfully, Ryan
tg3 network cards
This may or may not be the proper list/outlet for this, so if it's not, feel free to let me know and I'll pursue it elsewhere, but since it came up during a FAI installation I'll start here. I've got a server with a Broadcom network card (a 5721) and I'm using the tg3 driver, and the box with that NIC that just absolutely refuses to get through the initrd. The installation hangs in the initramfs in /scripts/live on the function do_netmount(), and I'm pretty sure it's because the 'ipconfig' binary included in the initrd is killing networking. I've spent a few days now hacking the initrd, the init script, and it's functions to determine the path it takes to get there, which appears to be: 1. init is invoked 2. init sources /scripts/live (since boot=live in the pxelinux.cfg), and then calls the function mountroot(), which is defined in /scripts/live 3. mountroot calls several other functions within the /scripts/live script, and eventually gets to do_netmount() 4. inside do_netmount, it encounters a line in which the binary 'ipconfig' (yes, ipconfig, not ifconfig) is called, and this is where it hangs. I've added some debugging code for clarity: do_netmount () { rc=1 modprobe -q af_packet # For DHCP udevtrigger udevsettle echo -e \nThis is right before we 'ipconfig ${DEVICE}'\n /dev/console 21 ipconfig ${DEVICE} | tee /netboot.config echo -e \nRight before sourcing ipconfig output\n /dev/console 21 # source relevant ipconfig output OLDHOSTNAME=${HOSTNAME} . /tmp/net-${DEVICE}.conf snip } With that debugging output in place, the last output to the console is: This is right before we 'ipconfig eth0' [ 100.068705] tg3: eth0: Link is up at 1000 Mbps, full duplex. [ 100.068767] tg3: eth0: Flow control is off for TX and off for RX. [ 393.930374] Machine check events logged [ 699.732829] Machine check events logged ...and from there it just hangs indefinitely. I know for a fact that the kernel module, tg3.ko, is being loaded by load_modules, so that's not the problem - in fact I'm almost 100% positive that 'ipconfig' is killing network connectivity. I initially thought I was (and maybe I still am) getting bitten by a really crappy Broadcom card/driver, but when I tested ipconfig in a VM (extracted the initrd, ran 'bin/ipconfig eth0 | tee /outfile'), the only way I could get the machine to be operable over the network again is to pop in to the VM console and issue an '/etc/init.d/networking restart'. That may not be a great litmus test since it's using a virtual interface, not a real hardware interface - but it does mimic the behavior I see on real hardware exactly. I would love to hear any insights and/or similar experiences others may have had with this. I have several other servers that use different network drivers (igb, e1000, etc.) that all seem to work just fine, which furthers my feelings that this Broadcom card is just poorly supported on Linux. I've tried both the tg3.ko that ships with Ubuntu, and compiling the driver myself, both with the same results.
Re: tg3 network cards
This may or may not be the proper list/outlet for this, so if it's not, feel free to let me know and I'll pursue it elsewhere, but since it came up during a FAI installation I'll start here. I've got a server with a Broadcom network card (a 5721) and I'm using the tg3 driver, and the box with that NIC that just absolutely refuses to get through the initrd. The installation hangs in the initramfs in /scripts/live on the function do_netmount(), and I'm pretty sure it's because the 'ipconfig' binary included in the initrd is killing networking. I've spent a few days now hacking the initrd, the init script, and it's functions to determine the path it takes to get there, which appears to be: [...] What kind of access do you have for debugging this? Console access? Do you get it to boot by some other means (e.g., GRML live CD)? ipconfig is part of the klibc-utils package and most likely just tries to get an answer from your DHCP server at that very moment. If you get your system to boot by some other means, you could safely copy over ipconfig from your Debian systems and just run it on the console manually to see what happens. Furthermore, some tcpdump or the like may be useful to find out what ipconfig is trying to achieve. Other than that, there is also the frequently discussed issue of systems with more than one NIC -- your ipconfig may simply be trying to get a response from the DHCP server over some interface that doesn't have any cable plugged in. Best, Michael pgpM3NRz1MAVn.pgp Description: PGP signature
Re: tg3 network cards
Michael Tautschnig wrote: This may or may not be the proper list/outlet for this, so if it's not, feel free to let me know and I'll pursue it elsewhere, but since it came up during a FAI installation I'll start here. I've got a server with a Broadcom network card (a 5721) and I'm using the tg3 driver, and the box with that NIC that just absolutely refuses to get through the initrd. The installation hangs in the initramfs in /scripts/live on the function do_netmount(), and I'm pretty sure it's because the 'ipconfig' binary included in the initrd is killing networking. I've spent a few days now hacking the initrd, the init script, and it's functions to determine the path it takes to get there, which appears to be: [...] What kind of access do you have for debugging this? Console access? Yes, I have console access, both via KVM over IP and IPMI. Do you get it to boot by some other means (e.g., GRML live CD)? Unfortunately, I don't have easy physical access, hence the KVM over IP and IPMI interface. The server is physically located at a colo facility about 30 minutes away. I'm doing what I can without driving down there. :) ipconfig is part of the klibc-utils package and most likely just tries to get an answer from your DHCP server at that very moment. Ah, didn't know that's where it came from - thanks for the tip. Re: 'most likely tries...' - really? It doesn't just use the info supplied to the card initially when it boots up, DHCP's, and proceeds via PXE? I submit, I tried to dump the binary with 'strings', but not much research beyond that yet (I'm exhausted today, been working for a long time). If you get your system to boot by some other means, you could safely copy over ipconfig from your Debian Ubuntu. I apologized for only making a passing reference to that at the end of my OP. systems and just run it on the console manually to see what happens. I tested this on a VM, and it also killed networking. But, the VM has a virtual interface, and may not be a very good test, even if the results were the same. I'll try to test ipconfig on a physical network interface to see if I hit the same problem. But, like I said, I've installed other servers the same way successfully, the only difference being they had nice Intel cards, not a crappy Broadcom card. Furthermore, some tcpdump or the like may be useful to find out what ipconfig is trying to achieve. I'll see if the DHCP server picks up anything, though I think I tried that and didn't glean much from it. (Then again, I'm exhausted, so maybe when I'm fresh tomorrow I'll get different results. Other than that, there is also the frequently discussed issue of systems with more than one NIC -- your ipconfig may simply be trying to get a response from the DHCP server over some interface that doesn't have any cable plugged in. I suppose that could be the case - hadn't even occurred to me. The box does have dual on-board NIC's, so that is a viable suggestion. I'll do some more research on that front, see what comes of it. Thanks for the suggestion. Respectfully, Ryan
Re: tg3 network cards
[...] ipconfig is part of the klibc-utils package and most likely just tries to get an answer from your DHCP server at that very moment. Ah, didn't know that's where it came from - thanks for the tip. Re: 'most likely tries...' - really? It doesn't just use the info supplied to the card initially when it boots up, DHCP's, and proceeds via PXE? I submit, I tried to dump the binary with 'strings', but not much research beyond that yet (I'm exhausted today, been working for a long time). Yes, it does DHCP on its own, no re-use of PXE's info at all (actually, I don't know whether that would be available at all to Linux anyway). I think, by default ipconfig eth0 will just do DHCP on eth0. If you get your system to boot by some other means, you could safely copy over ipconfig from your Debian Ubuntu. I apologized for only making a passing reference to that at the end of my OP. Oh, yes, sorry, spotted your mention of Ubuntu. But that shouldn't matter that much, just use klibc from your Ubuntu boxes :-) systems and just run it on the console manually to see what happens. I tested this on a VM, and it also killed networking. But, the VM has a virtual interface, and may not be a very good test, even if the results were the same. I'll try to test ipconfig on a physical network interface to see if I hit the same problem. But, like I said, I've installed other servers the same way successfully, the only difference being they had nice Intel cards, not a crappy Broadcom card. Well, as long as the VM shows a very similar behavior, it will be useful for debugging from there, even though you might end up having debugged two different issues :-) [...] Other than that, there is also the frequently discussed issue of systems with more than one NIC -- your ipconfig may simply be trying to get a response from the DHCP server over some interface that doesn't have any cable plugged in. I suppose that could be the case - hadn't even occurred to me. The box does have dual on-board NIC's, so that is a viable suggestion. I'll do some more research on that front, see what comes of it. Thanks for the suggestion. Woo, two things to check first: - There are known issues with some Broadcom cards and their IPMI firmware. You may or may not be able to apply these bugfixes (I think it required some DOS boot disk :-( ) and they might also fully disable IPMI. I guess the net may help you further along. But AFAIK the symptoms were a bit different (DHCP was fine, but no NFS afterwards, and somewhat Xen-related). - The easier one: Check the MAC address of the interface the initrd is trying to run ipconfig on. This should help you to find out whether it is using the correct link. HTH, Michael pgplHK7M6onqd.pgp Description: PGP signature