Bug#976901: nvidia-tesla-450-kernel-dkms: Fails to build DKMS kernel module on ppc64le 450.80.02
On 19/12/20 3:24 μ.μ., Andreas Beckmann wrote: Control: severity -1 important This bug should not cause nvidia-cuda-toolkit to be removed from testing... Indeed, and seeing that Nvidia only supports the "compute" mode of the driver on ppc64le, I'm having second thoughts about using them on my Talos II. I think I will revert back to AMD in this case and use plain the Titan cards on plain x86. Did you ever have a working kernel/driver/toolkit combination? Partly yes, I could get CUDA working, on latest testing even, but it was very shakey and as soon as I tried to use both cards, it would crash the driver and I would have to reboot the system. IIRC you had two GPUs in that machine, could you try with only one installed? That made things a bit better, but still it's not entirely stable. It's not a tested/supported configuration and it shows. I seem to remember reading that Ubuntu on ppc64el is in cuda 11.x no longer a combination supported by nvidia. That makes it a bit more difficult to find a setup that is supposed to work. But if it actually works in some RHEL/CentOS environment, why shouldn't we get it runnning on Debian as well? Indeed, that is the case. Unless one owns a very high end data center card on ppc64le, it's probably not going to be supported by nvidia. I'm not sure if I should ask you to close the bug as wontfix, I don't expect the situation will change soon, and it's not something that Debian can fix tbh. As I said I already considering using the cards on x86. Nevertheless, I appreciate your help, thank you. Regards Konstantinos
Bug#976901: nvidia-tesla-450-kernel-dkms: Fails to build DKMS kernel module on ppc64le 450.80.02
Control: severity -1 important This bug should not cause nvidia-cuda-toolkit to be removed from testing... Did you ever have a working kernel/driver/toolkit combination? On 12/10/20 5:05 PM, Konstantinos Margaritis wrote: > I just installed 5.8 from buster-backports along with tesla-440 driver > and it seems to give slightly better results,the modules build and load, > but similar errors in kernel: You could try the kernel from Debian stable ... IIRC you had two GPUs in that machine, could you try with only one installed? I seem to remember reading that Ubuntu on ppc64el is in cuda 11.x no longer a combination supported by nvidia. That makes it a bit more difficult to find a setup that is supposed to work. But if it actually works in some RHEL/CentOS environment, why shouldn't we get it runnning on Debian as well? I only found this footnote in https://docs.nvidia.com/cuda/archive/11.1.1/cuda-installation-guide-linux/index.html (4) Only Tesla V100 and T4 GPUs are supported for CUDA 11.2 on Arm64 (aarch64) POWER9 (ppc64le) Andreas
Bug#976901: nvidia-tesla-450-kernel-dkms: Fails to build DKMS kernel module on ppc64le 450.80.02
On 10/12/20 5:37 μ.μ., Andreas Beckmann wrote: If it works with Linux 5.8, can you stick to an older kernel (which will not receive security updates) until there is a newer driver release available from Nvidia? I just installed 5.8 from buster-backports along with tesla-440 driver and it seems to give slightly better results,the modules build and load, but similar errors in kernel: [ 15.167039] NVRM: GPU :01:00.0: DMA address not in addressable range of device (0x80020054106-0x80020054106, 0x800-0x8ff) [ 15.167070] NVRM: GPU :01:00.0: DMA address not in addressable range of device (0x80020054106-0x80020054106, 0x800-0x8ff) [ 15.168422] NVRM: GPU :01:00.0: RmInitAdapter failed! (0x31:0x40:937) [ 15.168517] NVRM: GPU :01:00.0: rm_init_adapter failed, device minor number 0 ... [ 18.058799] NVRM: GPU :01:00.0: DMA address not in addressable range of device (0x80020053266-0x80020053266, 0x800-0x8ff) [ 18.058816] NVRM: GPU :01:00.0: DMA address not in addressable range of device (0x80020053266-0x80020053266, 0x800-0x8ff) [ 18.059885] NVRM: GPU :01:00.0: RmInitAdapter failed! (0x31:0x40:937) [ 18.059943] NVRM: GPU :01:00.0: rm_init_adapter failed, device minor number 0 As for Xorg.0.log: ... [ 18.046] (II) NVIDIA GLX Module 440.118.02 Thu Sep 3 09:45:23 UTC 2020 [ 18.049] (II) NVIDIA: The X server supports PRIME Render Offload. [ 18.158] (EE) NVIDIA(GPU-0): Failed to initialize the NVIDIA GPU at PCI:1:0:0. Please [ 18.158] (EE) NVIDIA(GPU-0): check your system's kernel log for additional error [ 18.158] (EE) NVIDIA(GPU-0): messages and refer to Chapter 8: Common Problems in the [ 18.158] (EE) NVIDIA(GPU-0): README for additional information. [ 18.158] (EE) NVIDIA(GPU-0): Failed to initialize the NVIDIA graphics device! [ 18.158] (EE) NVIDIA(0): Failing initialization of X screen [ 18.158] (II) UnloadModule: "nvidia" [ 18.158] (II) UnloadSubModule: "glxserver_nvidia" [ 18.158] (II) Unloading glxserver_nvidia [ 18.158] (II) UnloadSubModule: "wfb" [ 18.158] (II) UnloadSubModule: "fb" [ 18.158] (==) NVIDIA(G0): Depth 24, (==) framebuffer bpp 32 [ 18.158] (==) NVIDIA(G0): RGB weight 888 [ 18.158] (==) NVIDIA(G0): Default visual is TrueColor [ 18.158] (==) NVIDIA(G0): Using gamma correction (1.0, 1.0, 1.0) [ 18.158] (**) NVIDIA(G0): Enabling 2D acceleration [ 18.158] (EE) NVIDIA(G0): GPU screens are not yet supported by the NVIDIA driver [ 18.158] (EE) NVIDIA(G0): Failing initialization of X screen [ 18.158] (II) UnloadModule: "nvidia" [ 18.158] (II) UnloadSubModule: "wfb" [ 18.158] (II) UnloadSubModule: "fb" [ 18.158] (EE) Screen(s) found, but none have a usable configuration. [ 18.158] (EE) Fatal server error: [ 18.158] (EE) no screens found(EE) I don't mind waiting for a newer release, or using an older kernel, however any suggestions to solve the problem would be appreciated, even outside this particular bug report. Regards Konstantinos
Bug#976901: nvidia-tesla-450-kernel-dkms: Fails to build DKMS kernel module on ppc64le 450.80.02
On 12/10/20 10:37 AM, Konstantinos Margaritis wrote: > On 10/12/20 1:19 π.μ., Andreas Beckmann wrote: >> but I'm not sure whether it is worth backporting them, >> since you most likely will be affected by >> #973729 - nvidia-uvm does not work with Linux 5.9 >> which is fixed in 455.45.01 > > Well, I did the replace you suggested below and even though the modules > load, I don't get a display, here is what dmesg gives: > > [ 15.889326] NVRM: GPU :01:00.0: DMA address not in addressable > range of device (0x80020054de8-0x80020054de8, > 0x800-0x8ff) If it works with Linux 5.8, can you stick to an older kernel (which will not receive security updates) until there is a newer driver release available from Nvidia? Andreas
Bug#976901: nvidia-tesla-450-kernel-dkms: Fails to build DKMS kernel module on ppc64le 450.80.02
On 10/12/20 1:19 π.μ., Andreas Beckmann wrote: but I'm not sure whether it is worth backporting them, since you most likely will be affected by #973729 - nvidia-uvm does not work with Linux 5.9 which is fixed in 455.45.01 Well, I did the replace you suggested below and even though the modules load, I don't get a display, here is what dmesg gives: [ 15.889326] NVRM: GPU :01:00.0: DMA address not in addressable range of device (0x80020054de8-0x80020054de8, 0x800-0x8ff) [ 15.889341] NVRM: GPU :01:00.0: DMA address not in addressable range of device (0x80020054de8-0x80020054de8, 0x800-0x8ff) [ 15.890377] NVRM: GPU :01:00.0: DMA address not in addressable range of device (0x80020054de8-0x80020054de8, 0x800-0x8ff) [ 15.890564] NVRM: GPU :01:00.0: RmInitAdapter failed! (0x24:0x1e:1224) [ 15.890601] NVRM: GPU :01:00.0: rm_init_adapter failed, device minor number 0 [ 15.995590] NVRM: GPU 0030:01:00.0: DMA address not in addressable range of device (0x80020054a31-0x80020054a31, 0x800-0x8ff) [ 15.995601] NVRM: GPU 0030:01:00.0: DMA address not in addressable range of device (0x80020054a31-0x80020054a31, 0x800-0x8ff) [ 15.996482] NVRM: GPU 0030:01:00.0: DMA address not in addressable range of device (0x80020054a31-0x80020054a31, 0x800-0x8ff) [ 15.996650] NVRM: GPU 0030:01:00.0: RmInitAdapter failed! (0x24:0x1e:1224) [ 15.996705] NVRM: GPU 0030:01:00.0: rm_init_adapter failed, device minor number 1 [ 34.850800] [ cut here ] [ 34.850801] remap_4k_pfn called with wrong pfn value [ 34.850966] WARNING: CPU: 5 PID: 1584 at arch/powerpc/include/asm/book3s/64/hash-64k.h:166 nvidia_mmap_helper+0x6bc/0x800 [nvidia] [ 34.850967] Modules linked in: xt_conntrack(E) nf_conntrack_netlink(E) xfrm_user(E) xfrm_algo(E) xt_addrtype(E) br_netfilter(E) overlay(E) xt_CHECKSUM(E) nft_chain_nat(E) xt_MASQUERADE(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) libcrc32c(E) nft_counter(E) xt_tcpudp(E) nft_compat(E) bridge(E) stp(E) llc(E) nf_tables(E) nfnetlink(E) rfkill(E) nvidia_drm(POE) drm_kms_helper(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) fb_sys_fops(E) nvidia_modeset(POE) nvidia(POE) binfmt_misc(E) evdev(E) joydev(E) snd_hda_codec_hdmi(E) snd_hda_intel(E) snd_intel_dspcfg(E) snd_hda_codec(E) snd_hda_core(E) snd_hwdep(E) snd_pcm(E) snd_timer(E) ctr(E) cbc(E) snd(E) vmx_crypto(E) soundcore(E) gf128mul(E) ofpart(E) ipmi_powernv(E) powernv_flash(E) ipmi_devintf(E) mtd(E) ipmi_msghandler(E) opal_prd(E) at24(E) regmap_i2c(E) parport_pc(E) lp(E) drm(E) parport(E) sunrpc(E) fuse(E) configfs(E) drm_panel_orientation_quirks(E) ip_tables(E) x_tables(E) autofs4(E) ext4(E) crc16(E) mbcache(E) jbd2(E) [ 34.850994] crc32c_generic(E) ecb(E) aes_generic(E) libaes(E) xts(E) hid_generic(E) usbhid(E) hid(E) dm_crypt(E) dm_mod(E) xhci_pci(E) xhci_hcd(E) tg3(E) usbcore(E) nvme(E) libphy(E) nvme_core(E) ptp(E) pps_core(E) usb_common(E) t10_pi(E) crc_t10dif(E) crct10dif_generic(E) crct10dif_common(E) [ 34.851008] CPU: 5 PID: 1584 Comm: Xorg Tainted: P OE 5.9.0-4-powerpc64le #1 Debian 5.9.11-1 [ 34.851009] NIP: c0080e44c664 LR: c0080e44c660 CTR: [ 34.851010] REGS: c7493750 TRAP: 0700 Tainted: P OE (5.9.0-4-powerpc64le Debian 5.9.11-1) [ 34.851011] MSR: 90029033 CR: 2804 XER: [ 34.851014] CFAR: c01314e4 IRQMASK: 0 GPR00: c0080e44c660 c74939e0 c0080f09ec00 0028 GPR04: 0001 0004 0027 c005ff6cbf90 GPR08: 0023 ffd8 0027 GPR12: 2000 c005fffea600 0001473aaad0 7fffeac1ac14 GPR16: 0013 0008 GPR20: 0001 1000 00600240 GPR24: c005f7085e08 c005f7085800 2000 0003 GPR28: 00060024 c005f7085800 c005fa441800 c005f38fbb80 [ 34.851105] NIP [c0080e44c664] nvidia_mmap_helper+0x6bc/0x800 [nvidia] [ 34.851187] LR [c0080e44c660] nvidia_mmap_helper+0x6b8/0x800 [nvidia] [ 34.851188] Call Trace: [ 34.851270] [c74939e0] [c0080e44c660] nvidia_mmap_helper+0x6b8/0x800 [nvidia] (unreliable) [ 34.851353] [c7493ac0] [c0080e44c814] nvidia_mmap+0x6c/0xc0 [nvidia] [ 34.851434] [c7493b00] [c0080e4400ec] nvidia_frontend_mmap+0x54/0x80 [nvidia] [ 34.851438] [c7493b20] [c03bf51c] mmap_region+0x4cc/0x840 [ 34.851439] [c7493c00] [c03bfcac]
Bug#976901: nvidia-tesla-450-kernel-dkms: Fails to build DKMS kernel module on ppc64le 450.80.02
On 12/9/20 9:30 AM, Konstantinos Margaritis wrote: > Hi, I am trying to use a Titan X on a Talos II Power9 system on bullseye > and the nvidia module fails to compile. Build log attached: The error: /var/lib/dkms/nvidia-tesla-450/450.80.02/build/nvidia/nv-pci.c:891:10: warning: 'enum pci_channel_state' declared inside parameter list will not be visible outside of this definition or declaration /var/lib/dkms/nvidia-tesla-450/450.80.02/build/nvidia/nv-pci.c:891:28: error: parameter 2 ('error') has incomplete type /var/lib/dkms/nvidia-tesla-450/450.80.02/build/nvidia/nv-pci.c:889:1: error: function declaration isn't a prototype [-Werror=strict-prototypes] Similar errors happen building the 418 and 440 tesla driver modules for Linux 5.9 on ppc64el. There are changes in 455.45.01-1 to mitigate this kernel change: # pci_channel_state was removed by commit 16d79cd4e23b ("PCI: Use # 'pci_channel_state_t' instead of 'enum pci_channel_state'") in # v5.9-rc1 (2020-07-02). but I'm not sure whether it is worth backporting them, since you most likely will be affected by #973729 - nvidia-uvm does not work with Linux 5.9 which is fixed in 455.45.01 Unfortunately there is no separate 455.xx driver release available for ppc64el. Andreas PS: you could try s/enum pci_channel_state/pci_channel_state_t/g on the dkms tree PPS: the first time I hear that someone is actually trying to use the ppc64el packages ;-)