Hi! my department at the university has inherited an SGI Altix UV-1000 compute cluster [1]. It consists of two blade centers with 16 blades each and each blade sporting 64 GB of local memory and two Intel Xeon X7560 CPUs, totaling to 64 CPUs (1024 with multi-core and Hyperthreading enabled) and 2 TiB RAM.
The blades are inter-connected through the NUMAlink system meaning that all blades form a single, logical node with 1024 CPUs and 2 TiB RAM. The machine was originally shipped with SuSE Linux Enterprise Server 11 (SLES) running kernel 2.6.32 if my mind serves right. I have replaced the SLES installation with a stock Debian Wheezy (not without making a full backup of the original SLES installation) since this the distribution of our choice. After getting familiar with the system, it turns out that it's anything but trivial to get Linux boot on it. First, I encountered problems with GRUB which had trouble with the amount of e820 memory table entries which got resolved with a more recent GRUB release [3]. Now that GRUB was working fine, I ran into problems with the kernel which apparently simply froze when trying to boot. I tried various Linux distributions and kernels without success. However, it turned out the kernel boots just fine when disabling NUMAlink meaning that only the first of the 32 blades is used which reduces the machine to 32 CPUs and 64 GB RAM which is apparently not what you want when you have a machine which consumes 33 kW of power ;). Anyway, I did some further research and it turns out that SGI has a very long list of kernel parameters when booting the machine, to be more exact these: "/sgiroot splash=silent showopts stop_machine.lazy=1 add_efi_memmap nortsched processor.max_cstate=1 nobau log_buf_len=8M kdb=on cgroup_disable=memory earlyprintk=ttyS0,115200n8 pcie_aspm=on nohz=off crashkernel=512M intel_iommu=off init=/sbin/bootcpuset console=ttyS0,115200n8 " Most of these are explained here [4] and are obviously part of the vanilla Linux kernel. However, the parameter "stop_machine.lazy" appears to be exclusive to SuSE kernels [5]. Now, I am wondering whether the SuSE patch is actually what gets the kernel booting on the UV1000 with NUMAlink enabled, I haven't built a kernel with the patch added yet, however. I will do that once I get back to work in the new year. I was just wondering if anyone has some more suggestions what I could look into and what might cause the kernel to freeze immediately after GRUB with NUMAlink enabled. It freezes right after decompressing the kernel. Any idea? Cheers, Adrian PS: Some documentation on the UV from SGI [6-7]. > [1] http://www.sgi.com/products/remarketed/servers/uv1.html > [2] http://en.wikipedia.org/wiki/NUMAlink > [3] http://git.savannah.gnu.org/cgit/grub.git/commit/?id=a4e5ca80d97077cf302223a7c6aa38a2a9bedf8a > [4] https://access.redhat.com/site/articles/42548 > [5] http://kernel.opensuse.org/cgit/kernel-source/commit/?id=39eac1e710e6c9c8a524ad9a6319a3426e872894 > [6] http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi/hdwr/bks/SGI_Admin/books/UV_Wind_Install_AG/sgi_html/ch01.html#Z1299705322tls > [7] http://techpubs.sgi.com/library/manuals/5000/007-5663-003/pdf/007-5663-003.pdf -- .''`. John Paul Adrian Glaubitz : :' : Debian Developer - [email protected] `. `' Freie Universitaet Berlin - [email protected] `- GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913 -- To UNSUBSCRIBE, email to [email protected] with a subject of "unsubscribe". Trouble? Contact [email protected] Archive: http://lists.debian.org/[email protected]

