debugging initramfs boot process, install client can't mount root file system

2008-12-03 Diskussionsfäden Thomas Lange
If you have any troubles during the boot process of your install
clients and they can't mount the nfsroot, you like to have a shell for
debuging.

You can get a shell by adding break=bottom as kernel parameter (in
a pxelinux.cfg file). This stops the boot process and spaws a shell
within the initramfs. Other values for break= can be found if you search
all initramfs scripts for the subroutine maybe_break:

maybe_break top
maybe_break modules
maybe_break premount
maybe_break mount
maybe_break mountroot
maybe_break bottom
maybe_break init
maybe_break live-premount
maybe_break live-bottom
maybe_break pre-mdadm
maybe_break post-mdadm

-- 
regards Thomas


Re: debugging initramfs boot process, install client can't mount root file system

2008-12-03 Diskussionsfäden Ryan Steele

Thomas Lange wrote:

If you have any troubles during the boot process of your install
clients and they can't mount the nfsroot, you like to have a shell for
debuging.

You can get a shell by adding break=bottom as kernel parameter (in
a pxelinux.cfg file). This stops the boot process and spaws a shell
within the initramfs. Other values for break= can be found if you search
all initramfs scripts for the subroutine maybe_break:

maybe_break top
maybe_break modules
maybe_break premount
maybe_break mount
maybe_break mountroot
maybe_break bottom
maybe_break init
maybe_break live-premount
maybe_break live-bottom
maybe_break pre-mdadm
maybe_break post-mdadm

  


Wow.  Just having read the manpage for initramfs-tools, I see the break 
parameter can be *very* helpful.  I'm currently debugging a kernel panic 
and this is sure to be a useful tool.  Thanks for the tip, Thomas!



Respectfully,
Ryan



tg3 network cards

2008-12-03 Diskussionsfäden Ryan Steele
This may or may not be the proper list/outlet for this, so if it's not, 
feel free to let me know and I'll pursue it elsewhere, but since it came 
up during a FAI installation I'll start here.


I've got a server with a Broadcom network card (a 5721) and I'm using 
the tg3 driver, and the box with that NIC that just absolutely refuses 
to get through the initrd.  The installation hangs in the initramfs in 
/scripts/live on the function do_netmount(), and I'm pretty sure it's 
because the 'ipconfig' binary included in the initrd is killing 
networking.  I've spent a few days now hacking the initrd, the init 
script, and it's functions to determine the path it takes to get there, 
which appears to be:



1. init is invoked
2. init sources /scripts/live (since boot=live in the pxelinux.cfg), and 
then calls the function mountroot(), which is defined in /scripts/live
3. mountroot calls several other functions within the /scripts/live 
script, and eventually gets to do_netmount()
4. inside do_netmount, it encounters a line in which the binary 
'ipconfig' (yes, ipconfig, not ifconfig) is called, and this is where it 
hangs.  I've added some debugging code for clarity:



do_netmount ()
{
  rc=1

  modprobe -q af_packet # For DHCP

  udevtrigger
  udevsettle

  echo -e \nThis is right before we 'ipconfig ${DEVICE}'\n 
/dev/console 21

  ipconfig ${DEVICE} | tee /netboot.config

  echo -e \nRight before sourcing ipconfig output\n /dev/console 21
  # source relevant ipconfig output
  OLDHOSTNAME=${HOSTNAME}
  . /tmp/net-${DEVICE}.conf

 snip
}


With that debugging output in place, the last output to the console is:



This is right before we 'ipconfig eth0'


[  100.068705] tg3: eth0: Link is up at 1000 Mbps, full duplex.
[  100.068767] tg3: eth0: Flow control is off for TX and off for RX.
[  393.930374] Machine check events logged
[  699.732829] Machine check events logged



...and from there it just hangs indefinitely.  I know for a fact that 
the kernel module, tg3.ko, is being loaded by load_modules, so that's 
not the problem - in fact I'm almost 100% positive that 'ipconfig' is 
killing network connectivity.  I initially thought I was (and maybe I 
still am) getting bitten by a really crappy Broadcom card/driver, but 
when I tested ipconfig in a VM (extracted the initrd, ran 'bin/ipconfig 
eth0 | tee /outfile'), the only way I could get the machine to be 
operable over the network again is to pop in to the VM console and issue 
an '/etc/init.d/networking restart'.  That may not be a great litmus 
test since it's using a virtual interface, not a real hardware interface 
- but it does mimic the behavior I see on real hardware exactly.


I would love to hear any insights and/or similar experiences others may 
have had with this.  I have several other servers that use different 
network drivers (igb, e1000, etc.) that all seem to work just fine, 
which furthers my feelings that this Broadcom card is just poorly 
supported on Linux.  I've tried both the tg3.ko that ships with Ubuntu, 
and compiling the driver myself, both with the same results.


Re: tg3 network cards

2008-12-03 Diskussionsfäden Michael Tautschnig
 This may or may not be the proper list/outlet for this, so if it's not,  
 feel free to let me know and I'll pursue it elsewhere, but since it came  
 up during a FAI installation I'll start here.

 I've got a server with a Broadcom network card (a 5721) and I'm using  
 the tg3 driver, and the box with that NIC that just absolutely refuses  
 to get through the initrd.  The installation hangs in the initramfs in  
 /scripts/live on the function do_netmount(), and I'm pretty sure it's  
 because the 'ipconfig' binary included in the initrd is killing  
 networking.  I've spent a few days now hacking the initrd, the init  
 script, and it's functions to determine the path it takes to get there,  
 which appears to be:

[...]

What kind of access do you have for debugging this? Console access? Do you get
it to boot by some other means (e.g., GRML live CD)?

ipconfig is part of the klibc-utils package and most likely just tries to get an
answer from your DHCP server at that very moment. If you get your system to boot
by some other means, you could safely copy over ipconfig from your Debian
systems and just run it on the console manually to see what happens.
Furthermore, some tcpdump or the like may be useful to find out what ipconfig is
trying to achieve.

Other than that, there is also the frequently discussed issue of systems with
more than one NIC -- your ipconfig may simply be trying to get a response from
the DHCP server over some interface that doesn't have any cable plugged in.

Best,
Michael





pgpM3NRz1MAVn.pgp
Description: PGP signature


Re: tg3 network cards

2008-12-03 Diskussionsfäden Ryan Steele

Michael Tautschnig wrote:
This may or may not be the proper list/outlet for this, so if it's not,  
feel free to let me know and I'll pursue it elsewhere, but since it came  
up during a FAI installation I'll start here.


I've got a server with a Broadcom network card (a 5721) and I'm using  
the tg3 driver, and the box with that NIC that just absolutely refuses  
to get through the initrd.  The installation hangs in the initramfs in  
/scripts/live on the function do_netmount(), and I'm pretty sure it's  
because the 'ipconfig' binary included in the initrd is killing  
networking.  I've spent a few days now hacking the initrd, the init  
script, and it's functions to determine the path it takes to get there,  
which appears to be:




[...]

What kind of access do you have for debugging this? Console access? 


Yes, I have console access, both via KVM over IP and IPMI.


Do you get
it to boot by some other means (e.g., GRML live CD)?
  


Unfortunately, I don't have easy physical access, hence the KVM over IP 
and IPMI interface.  The server is physically located at a colo facility 
about 30 minutes away.  I'm doing what I can without driving down there.  :)



ipconfig is part of the klibc-utils package and most likely just tries to get an
answer from your DHCP server at that very moment. 


Ah, didn't know that's where it came from - thanks for the tip.  Re: 
'most likely tries...' - really?  It doesn't just use the info supplied 
to the card initially when it boots up, DHCP's, and proceeds via PXE?  I 
submit, I tried to dump the binary with 'strings', but not much research 
beyond that yet (I'm exhausted today, been working for a long time).



If you get your system to boot
by some other means, you could safely copy over ipconfig from your Debian
  


Ubuntu.  I apologized for only making a passing reference to that at the 
end of my OP.



systems and just run it on the console manually to see what happens.
  


I tested this on a VM, and it also killed networking.  But, the VM has a 
virtual interface, and may not be a very good test, even if the results 
were the same.  I'll try to test ipconfig on a physical network 
interface to see if I hit the same problem.  But, like I said, I've 
installed other servers the same way successfully, the only difference 
being they had nice Intel cards, not a crappy Broadcom card.



Furthermore, some tcpdump or the like may be useful to find out what ipconfig is
trying to achieve.
  


I'll see if the DHCP server picks up anything, though I think I tried 
that and didn't glean much from it.  (Then again, I'm exhausted, so 
maybe when I'm fresh tomorrow I'll get different results.



Other than that, there is also the frequently discussed issue of systems with
more than one NIC -- your ipconfig may simply be trying to get a response from
the DHCP server over some interface that doesn't have any cable plugged in.
  


I suppose that could be the case - hadn't even occurred to me.  The box 
does have dual on-board NIC's, so that is a viable suggestion.  I'll do 
some more research on that front, see what comes of it.  Thanks for the 
suggestion.


Respectfully,
Ryan


Re: tg3 network cards

2008-12-03 Diskussionsfäden Michael Tautschnig
[...]

 ipconfig is part of the klibc-utils package and most likely just tries to 
 get an
 answer from your DHCP server at that very moment. 

 Ah, didn't know that's where it came from - thanks for the tip.  Re:  
 'most likely tries...' - really?  It doesn't just use the info supplied  
 to the card initially when it boots up, DHCP's, and proceeds via PXE?  I  
 submit, I tried to dump the binary with 'strings', but not much research  
 beyond that yet (I'm exhausted today, been working for a long time).


Yes, it does DHCP on its own, no re-use of PXE's info at all (actually, I don't
know whether that would be available at all to Linux anyway). I think, by
default ipconfig eth0 will just do DHCP on eth0.

 If you get your system to boot
 by some other means, you could safely copy over ipconfig from your Debian
   

 Ubuntu.  I apologized for only making a passing reference to that at the  
 end of my OP.

Oh, yes, sorry, spotted your mention of Ubuntu. But that shouldn't matter that
much, just use klibc from your Ubuntu boxes :-)

 systems and just run it on the console manually to see what happens.
   

 I tested this on a VM, and it also killed networking.  But, the VM has a  
 virtual interface, and may not be a very good test, even if the results  
 were the same.  I'll try to test ipconfig on a physical network  
 interface to see if I hit the same problem.  But, like I said, I've  
 installed other servers the same way successfully, the only difference  
 being they had nice Intel cards, not a crappy Broadcom card.

Well, as long as the VM shows a very similar behavior, it will be useful for
debugging from there, even though you might end up having debugged two different
issues :-)

[...]
 Other than that, there is also the frequently discussed issue of systems with
 more than one NIC -- your ipconfig may simply be trying to get a response 
 from
 the DHCP server over some interface that doesn't have any cable plugged in.
   

 I suppose that could be the case - hadn't even occurred to me.  The box  
 does have dual on-board NIC's, so that is a viable suggestion.  I'll do  
 some more research on that front, see what comes of it.  Thanks for the  
 suggestion.


Woo, two things to check first:
- There are known issues with some Broadcom cards and their IPMI firmware. You
  may or may not be able to apply these bugfixes (I think it required some DOS
  boot disk :-( ) and they might also fully disable IPMI. I guess the net may
  help you further along. But AFAIK the symptoms were a bit different (DHCP was
  fine, but no NFS afterwards, and somewhat Xen-related).
- The easier one: Check the MAC address of the interface the initrd is trying to
  run ipconfig on. This should help you to find out whether it is using the
  correct link.

HTH,
Michael



pgplHK7M6onqd.pgp
Description: PGP signature