It looks to me like this is expected behavior.

Nvidia drivers will only work with the kernel version that was running when
it was compiled (not any other kernels that were installed at the same
time), AND whose kernel headers were available. You have to recompile the
driver after every kernel update *after* rebooting into the new kernel. It
has been a while since I worked with NVidia drivers, but I recall that
there actually was a command-line option that let you compile it for a
different kernel, but I'm not sure how well that works; you would also need
the correct kernel headers, kernel-devel RPM, and probably more. I never
bothered trying to make that work.

In your case, the proper procedure should be (caveat: this is theory; I did
not actually test this):

- Create a driver-build machine that has the correct *old* kernel
installed. I would recommend doing this away from xCAT. A virtual machine
is fine for the purpose.
- Recompile the driver.
- Build an RPM from the driver's binaries, not the source code. This RPM
should include a dependency on the correct kernel version.
- On the build machine, upgrade the kernel and all other RPMs.
- Build another RPM from these binaries. Again, make sure this RPM depends
on the correct kernel version.
- Create a repository (or use an existing one) and put both RPMs in (and
any future ones you create this way).
- Make this repository available to xCAT

This will allow you to initially install the old nvidia driver into your
osimage (because of the dependency, it will pick the one for the old
kernel), and then when you update the kernel, the nvidia driver will be
updated along with it from your repository.

If you want to avoid a few steps, at the expense of more manual work later,
you can also install this RPM into your osimage *after* you update the
kernel RPM to the correct version.

You have to rebuild the RPM with every new kernel version. Since you are
using an older version of CentOS, that shouldn't be too frequent.

There is another option, but that is may be less desirable in an xCAT
system: you can install DKMS to automatically recompile the driver every
time a kernel is updated. That means that you will have to have a lot of
extra stuff (kernel headers, gcc, various devel RPMs) on each node.

_______________________________________________________________________
Kevin Keane | Systems Architect | University of San Diego ITS |
[email protected]
Maher Hall, 192 |5998 Alcalá Park | San Diego, CA 92110-2492 | 619.260.6859
| Text: 760-721-8339

*REMEMBER! **No one from IT at USD will ever ask to confirm or supply your
password*.
These messages are an attempt to steal your username and password. Please
do not reply to, click the links within, or open the attachments of these
messages. Delete them!




On Thu, May 2, 2019 at 6:31 AM Roosen, Nicolas <[email protected]>
wrote:

> Hello, I have some trouble installing the Nvidia drivers into a compute
> node, using a custom script.
>
> Using xcat 2.13.5 on Centos 7.3
>
> We repackaged the Nvidia driver in a RPM, which installs fine when the
> node is up.
>
> But when we install it during a node re-image, it fails, because there
> are two different kernel version.
>
> Bellow are more details, does anyone has some experience with the Nvidia
> driver ?
>
> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> This RPM is installed during the deployment process, which uses the
> default Centos 7.3 kernel (3.10.0-514.el7). The kernel is also updated
> during the installation process (but *before* the Nvidia driver
> installation).
>
> Once the node deployment is finished, it reboots into the latest kernel
> (3.10.0-514.26.2.el7), and the Nvidia driver fails to load. If I reboot
> into the older kernel, it works.
>
> So I'd like to know if there is an options to install the Nvidia driver
> for another kernel than the running one?
>
> I have this error, if that helps:
>
> Making nvidia.ko silently in
> /opt/sgi/Factory-Install/nvidia/NVIDIA-Linux-x86_64-418.40.04/kernel
> Module nvidia.ko from kernel 3.10.0-514.el7.x86_64 is not compatible
> with kernel 3.10.0-514.26.2.el7.x86_64 in symbols:
> acpi_bus_register_driver acpi_bus_get_device acpi_bus_unregister_driver
> nvidia.ko:
> /lib/modules/3.10.0-514.el7.x86_64/video/nvidia.ko
>
> Curiously enough, if I re-install the same RPM by hand while running the
> latest kernel, it works ... So I'm a bit lost here ...
> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
> Thanks.
> --
> Nicolas
>
> _______________________________________________
> xCAT-user mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/xcat-user
>
_______________________________________________
xCAT-user mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/xcat-user

Reply via email to