It looks to me like this is expected behavior. Nvidia drivers will only work with the kernel version that was running when it was compiled (not any other kernels that were installed at the same time), AND whose kernel headers were available. You have to recompile the driver after every kernel update *after* rebooting into the new kernel. It has been a while since I worked with NVidia drivers, but I recall that there actually was a command-line option that let you compile it for a different kernel, but I'm not sure how well that works; you would also need the correct kernel headers, kernel-devel RPM, and probably more. I never bothered trying to make that work.
In your case, the proper procedure should be (caveat: this is theory; I did not actually test this): - Create a driver-build machine that has the correct *old* kernel installed. I would recommend doing this away from xCAT. A virtual machine is fine for the purpose. - Recompile the driver. - Build an RPM from the driver's binaries, not the source code. This RPM should include a dependency on the correct kernel version. - On the build machine, upgrade the kernel and all other RPMs. - Build another RPM from these binaries. Again, make sure this RPM depends on the correct kernel version. - Create a repository (or use an existing one) and put both RPMs in (and any future ones you create this way). - Make this repository available to xCAT This will allow you to initially install the old nvidia driver into your osimage (because of the dependency, it will pick the one for the old kernel), and then when you update the kernel, the nvidia driver will be updated along with it from your repository. If you want to avoid a few steps, at the expense of more manual work later, you can also install this RPM into your osimage *after* you update the kernel RPM to the correct version. You have to rebuild the RPM with every new kernel version. Since you are using an older version of CentOS, that shouldn't be too frequent. There is another option, but that is may be less desirable in an xCAT system: you can install DKMS to automatically recompile the driver every time a kernel is updated. That means that you will have to have a lot of extra stuff (kernel headers, gcc, various devel RPMs) on each node. _______________________________________________________________________ Kevin Keane | Systems Architect | University of San Diego ITS | [email protected] Maher Hall, 192 |5998 Alcalá Park | San Diego, CA 92110-2492 | 619.260.6859 | Text: 760-721-8339 *REMEMBER! **No one from IT at USD will ever ask to confirm or supply your password*. These messages are an attempt to steal your username and password. Please do not reply to, click the links within, or open the attachments of these messages. Delete them! On Thu, May 2, 2019 at 6:31 AM Roosen, Nicolas <[email protected]> wrote: > Hello, I have some trouble installing the Nvidia drivers into a compute > node, using a custom script. > > Using xcat 2.13.5 on Centos 7.3 > > We repackaged the Nvidia driver in a RPM, which installs fine when the > node is up. > > But when we install it during a node re-image, it fails, because there > are two different kernel version. > > Bellow are more details, does anyone has some experience with the Nvidia > driver ? > > +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > This RPM is installed during the deployment process, which uses the > default Centos 7.3 kernel (3.10.0-514.el7). The kernel is also updated > during the installation process (but *before* the Nvidia driver > installation). > > Once the node deployment is finished, it reboots into the latest kernel > (3.10.0-514.26.2.el7), and the Nvidia driver fails to load. If I reboot > into the older kernel, it works. > > So I'd like to know if there is an options to install the Nvidia driver > for another kernel than the running one? > > I have this error, if that helps: > > Making nvidia.ko silently in > /opt/sgi/Factory-Install/nvidia/NVIDIA-Linux-x86_64-418.40.04/kernel > Module nvidia.ko from kernel 3.10.0-514.el7.x86_64 is not compatible > with kernel 3.10.0-514.26.2.el7.x86_64 in symbols: > acpi_bus_register_driver acpi_bus_get_device acpi_bus_unregister_driver > nvidia.ko: > /lib/modules/3.10.0-514.el7.x86_64/video/nvidia.ko > > Curiously enough, if I re-install the same RPM by hand while running the > latest kernel, it works ... So I'm a bit lost here ... > +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > Thanks. > -- > Nicolas > > _______________________________________________ > xCAT-user mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/xcat-user >
_______________________________________________ xCAT-user mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/xcat-user
