Hi, I've had a PXE net-install setup that has worked very well for a while. You can plug any box into our network and boot from network, and you get a grub menu with a bunch of generally useful options (memtest86+ for anybody, no password, sysresccd with a password, etc.), plus if our DHCP server knows what OS and configuration are supposed to be on that box, there will also be a submenu with rescue, prompted install, and saved-answer install options custom to that box. All this is just background to give an indication of how simple/complicated our grub.cfg is. Really it's not bad. It's written by hand (not the usual assembled-from-pieces-in-/etc), and about 90 lines long. It defines a few functions, and it uses net_get_dhcp_option to query a site-local option the DHCP server sends to say what should be on the box. If that string is, say, "RHEL x86_64 6 5 workstation" then the main grub.cfg will source $prefix/RHEL and expect it to define a function RHEL that can be called with the remaining arguments, and that function creates the submenu with the right menuentries for booting the OS installer with the right cmdline arguments. The RHEL script, for example, is another 45 lines. Nothing very big at all.
This whole setup works great on every other box I've used it on, but we just bought some brand new Dell Precision T5610 workstations, and the behavior is really crazy. Usually it loads grub and parses the scripts ok, and puts up the correct menu, but try to load any of the choices and it just either hangs or reboots on the first kernel load command (linux, knetbsd, whatever). Sometimes it won't accept the pbkdf2 password at all, or will say command not found even for something built-in like reboot. On occasions when it does accept the password and I can get to the command line, it will hang on even something simple like testload ($root)/memtest86+.elf It is acting very much like something in the grub-script code is stomping on memory somewhere. As a test, I moved most of grub.cfg into grub.normal, and made a very short grub.cfg: set superusers="..." password_pbkdf2 root grub.pbkdf2.sha512.... echo -n 'm for minimal: ' read min echo if [ 'm' != "$min" ] then source "$prefix/grub.normal" fi With that, I can boot and enter m and it skips all the rest of the script. At the command line I can directly enter the same lines that were otherwise hanging.... testload ($root)/memtest86+.elf ...works fine knetbsd ($root)/memtest86+.elf boot ...gives me a memtest just fine and so on. So as long as it hasn't run my other 130 lines of script yet, apparently nothing has been stomped on yet. And there's really nothing at all fancy in the script - some function definitions, variable assigments and uses, uses of setparams and shift, and the one use of net_dhcp_get_option. And why am I seeing this problem on these 5610s and not on other boxes? Do different BIOSes leave grub with really different amounts of working memory or something? Are there any grub commands I can use to see the memory stats or anything else that might help pin down where the problem is? The 5610s have 32 GB of RAM each, and are set for PXE boot in legacy BIOS mode. I have played with some BIOS settings that seemed like they might affect the memory available to grub, but with no improvement. I even pulled some DIMMs just to see if it might work better with less total RAM (32 GB is more than most of our other boxes where I don't see this problem). Also no improvement. Any hints on how to further troubleshoot this? And is there a way to build grub to give errors if something in the script code is clobbering memory or whatever, instead of just seeming to work until the machine hangs or reboots later? Thanks, Chapman Flack _______________________________________________ Help-grub mailing list [email protected] https://lists.gnu.org/mailman/listinfo/help-grub
