Bug#976901: nvidia-tesla-450-kernel-dkms: Fails to build DKMS kernel module on ppc64le 450.80.02

2020-12-21 Thread Konstantinos Margaritis

On 19/12/20 3:24 μ.μ., Andreas Beckmann wrote:

Control: severity -1 important

This bug should not cause nvidia-cuda-toolkit to be removed from testing...


Indeed, and seeing that Nvidia only supports the "compute" mode of the 
driver on ppc64le, I'm having second thoughts about using them on my 
Talos II. I think I will revert back to AMD in this case and use plain 
the Titan cards on plain x86.



Did you ever have a working kernel/driver/toolkit combination?
Partly yes, I could get CUDA working, on latest testing even, but it was 
very shakey and as soon as I tried to use both cards, it would crash the 
driver and I would have to reboot the system.


IIRC you had two GPUs in that machine, could you try with only one
installed?
That made things a bit better, but still it's not entirely stable. It's 
not a tested/supported configuration and it shows.


I seem to remember reading that Ubuntu on ppc64el is in cuda 11.x no
longer a combination supported by nvidia. That makes it a bit more
difficult to find a setup that is supposed to work. But if it actually
works in some RHEL/CentOS environment, why shouldn't we get it runnning
on Debian as well?


Indeed, that is the case. Unless one owns a very high end data center 
card on ppc64le, it's probably not going to be supported by nvidia.


I'm not sure if I should ask you to close the bug as wontfix, I don't 
expect the situation will change soon, and it's not something that 
Debian can fix tbh. As I said I already considering using the cards on x86.


Nevertheless, I appreciate your help, thank you.

Regards

Konstantinos



Bug#976901: nvidia-tesla-450-kernel-dkms: Fails to build DKMS kernel module on ppc64le 450.80.02

2020-12-19 Thread Andreas Beckmann
Control: severity -1 important

This bug should not cause nvidia-cuda-toolkit to be removed from testing...

Did you ever have a working kernel/driver/toolkit combination?

On 12/10/20 5:05 PM, Konstantinos Margaritis wrote:
> I just installed 5.8 from buster-backports along with tesla-440 driver
> and it seems to give slightly better results,the modules build and load,
> but similar errors in kernel:

You could try the kernel from Debian stable ...

IIRC you had two GPUs in that machine, could you try with only one
installed?

I seem to remember reading that Ubuntu on ppc64el is in cuda 11.x no
longer a combination supported by nvidia. That makes it a bit more
difficult to find a setup that is supposed to work. But if it actually
works in some RHEL/CentOS environment, why shouldn't we get it runnning
on Debian as well?

I only found this footnote in
https://docs.nvidia.com/cuda/archive/11.1.1/cuda-installation-guide-linux/index.html
  (4) Only Tesla V100 and T4 GPUs are supported for CUDA 11.2 on Arm64
(aarch64) POWER9 (ppc64le)

Andreas



Bug#976901: nvidia-tesla-450-kernel-dkms: Fails to build DKMS kernel module on ppc64le 450.80.02

2020-12-10 Thread Konstantinos Margaritis

On 10/12/20 5:37 μ.μ., Andreas Beckmann wrote:

If it works with Linux 5.8, can you stick to an older kernel (which will
not receive security updates) until there is a newer driver release
available from Nvidia?


I just installed 5.8 from buster-backports along with tesla-440 driver 
and it seems to give slightly better results,the modules build and load, 
but similar errors in kernel:



[   15.167039] NVRM: GPU :01:00.0: DMA address not in addressable 
range of device (0x80020054106-0x80020054106, 
0x800-0x8ff)
[   15.167070] NVRM: GPU :01:00.0: DMA address not in addressable 
range of device (0x80020054106-0x80020054106, 
0x800-0x8ff)

[   15.168422] NVRM: GPU :01:00.0: RmInitAdapter failed! (0x31:0x40:937)
[   15.168517] NVRM: GPU :01:00.0: rm_init_adapter failed, device 
minor number 0


...
[   18.058799] NVRM: GPU :01:00.0: DMA address not in addressable 
range of device (0x80020053266-0x80020053266, 
0x800-0x8ff)
[   18.058816] NVRM: GPU :01:00.0: DMA address not in addressable 
range of device (0x80020053266-0x80020053266, 
0x800-0x8ff)

[   18.059885] NVRM: GPU :01:00.0: RmInitAdapter failed! (0x31:0x40:937)
[   18.059943] NVRM: GPU :01:00.0: rm_init_adapter failed, device 
minor number 0



As for Xorg.0.log:

...

[    18.046] (II) NVIDIA GLX Module 440.118.02  Thu Sep  3 09:45:23 UTC 2020
[    18.049] (II) NVIDIA: The X server supports PRIME Render Offload.
[    18.158] (EE) NVIDIA(GPU-0): Failed to initialize the NVIDIA GPU at 
PCI:1:0:0.  Please
[    18.158] (EE) NVIDIA(GPU-0): check your system's kernel log for 
additional error
[    18.158] (EE) NVIDIA(GPU-0): messages and refer to Chapter 8: 
Common Problems in the

[    18.158] (EE) NVIDIA(GPU-0): README for additional information.
[    18.158] (EE) NVIDIA(GPU-0): Failed to initialize the NVIDIA 
graphics device!

[    18.158] (EE) NVIDIA(0): Failing initialization of X screen
[    18.158] (II) UnloadModule: "nvidia"
[    18.158] (II) UnloadSubModule: "glxserver_nvidia"
[    18.158] (II) Unloading glxserver_nvidia
[    18.158] (II) UnloadSubModule: "wfb"
[    18.158] (II) UnloadSubModule: "fb"
[    18.158] (==) NVIDIA(G0): Depth 24, (==) framebuffer bpp 32
[    18.158] (==) NVIDIA(G0): RGB weight 888
[    18.158] (==) NVIDIA(G0): Default visual is TrueColor
[    18.158] (==) NVIDIA(G0): Using gamma correction (1.0, 1.0, 1.0)
[    18.158] (**) NVIDIA(G0): Enabling 2D acceleration
[    18.158] (EE) NVIDIA(G0): GPU screens are not yet supported by the 
NVIDIA driver

[    18.158] (EE) NVIDIA(G0): Failing initialization of X screen
[    18.158] (II) UnloadModule: "nvidia"
[    18.158] (II) UnloadSubModule: "wfb"
[    18.158] (II) UnloadSubModule: "fb"
[    18.158] (EE) Screen(s) found, but none have a usable configuration.
[    18.158] (EE)
Fatal server error:
[    18.158] (EE) no screens found(EE)

I don't mind waiting for a newer release, or using an older kernel, 
however any suggestions to solve the problem would be appreciated, even 
outside this particular bug report.


Regards

Konstantinos



Bug#976901: nvidia-tesla-450-kernel-dkms: Fails to build DKMS kernel module on ppc64le 450.80.02

2020-12-10 Thread Andreas Beckmann
On 12/10/20 10:37 AM, Konstantinos Margaritis wrote:
> On 10/12/20 1:19 π.μ., Andreas Beckmann wrote:
>> but I'm not sure whether it is worth backporting them,
>> since you most likely will be affected by
>> #973729 - nvidia-uvm does not work with Linux 5.9
>> which is fixed in 455.45.01
> 
> Well, I did the replace you suggested below and even though the modules
> load, I don't get a display, here is what dmesg gives:
> 
> [   15.889326] NVRM: GPU :01:00.0: DMA address not in addressable
> range of device (0x80020054de8-0x80020054de8,
> 0x800-0x8ff)

If it works with Linux 5.8, can you stick to an older kernel (which will
not receive security updates) until there is a newer driver release
available from Nvidia?


Andreas



Bug#976901: nvidia-tesla-450-kernel-dkms: Fails to build DKMS kernel module on ppc64le 450.80.02

2020-12-10 Thread Konstantinos Margaritis

On 10/12/20 1:19 π.μ., Andreas Beckmann wrote:

but I'm not sure whether it is worth backporting them,
since you most likely will be affected by
#973729 - nvidia-uvm does not work with Linux 5.9
which is fixed in 455.45.01


Well, I did the replace you suggested below and even though the modules 
load, I don't get a display, here is what dmesg gives:


[   15.889326] NVRM: GPU :01:00.0: DMA address not in addressable 
range of device (0x80020054de8-0x80020054de8, 
0x800-0x8ff)
[   15.889341] NVRM: GPU :01:00.0: DMA address not in addressable 
range of device (0x80020054de8-0x80020054de8, 
0x800-0x8ff)
[   15.890377] NVRM: GPU :01:00.0: DMA address not in addressable 
range of device (0x80020054de8-0x80020054de8, 
0x800-0x8ff)
[   15.890564] NVRM: GPU :01:00.0: RmInitAdapter failed! 
(0x24:0x1e:1224)
[   15.890601] NVRM: GPU :01:00.0: rm_init_adapter failed, device 
minor number 0
[   15.995590] NVRM: GPU 0030:01:00.0: DMA address not in addressable 
range of device (0x80020054a31-0x80020054a31, 
0x800-0x8ff)
[   15.995601] NVRM: GPU 0030:01:00.0: DMA address not in addressable 
range of device (0x80020054a31-0x80020054a31, 
0x800-0x8ff)
[   15.996482] NVRM: GPU 0030:01:00.0: DMA address not in addressable 
range of device (0x80020054a31-0x80020054a31, 
0x800-0x8ff)
[   15.996650] NVRM: GPU 0030:01:00.0: RmInitAdapter failed! 
(0x24:0x1e:1224)
[   15.996705] NVRM: GPU 0030:01:00.0: rm_init_adapter failed, device 
minor number 1

[   34.850800] [ cut here ]
[   34.850801] remap_4k_pfn called with wrong pfn value
[   34.850966] WARNING: CPU: 5 PID: 1584 at 
arch/powerpc/include/asm/book3s/64/hash-64k.h:166 
nvidia_mmap_helper+0x6bc/0x800 [nvidia]
[   34.850967] Modules linked in: xt_conntrack(E) 
nf_conntrack_netlink(E) xfrm_user(E) xfrm_algo(E) xt_addrtype(E) 
br_netfilter(E) overlay(E) xt_CHECKSUM(E) nft_chain_nat(E) 
xt_MASQUERADE(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) 
nf_defrag_ipv4(E) libcrc32c(E) nft_counter(E) xt_tcpudp(E) nft_compat(E) 
bridge(E) stp(E) llc(E) nf_tables(E) nfnetlink(E) rfkill(E) 
nvidia_drm(POE) drm_kms_helper(E) syscopyarea(E) sysfillrect(E) 
sysimgblt(E) fb_sys_fops(E) nvidia_modeset(POE) nvidia(POE) 
binfmt_misc(E) evdev(E) joydev(E) snd_hda_codec_hdmi(E) snd_hda_intel(E) 
snd_intel_dspcfg(E) snd_hda_codec(E) snd_hda_core(E) snd_hwdep(E) 
snd_pcm(E) snd_timer(E) ctr(E) cbc(E) snd(E) vmx_crypto(E) soundcore(E) 
gf128mul(E) ofpart(E) ipmi_powernv(E) powernv_flash(E) ipmi_devintf(E) 
mtd(E) ipmi_msghandler(E) opal_prd(E) at24(E) regmap_i2c(E) 
parport_pc(E) lp(E) drm(E) parport(E) sunrpc(E) fuse(E) configfs(E) 
drm_panel_orientation_quirks(E) ip_tables(E) x_tables(E) autofs4(E) 
ext4(E) crc16(E) mbcache(E) jbd2(E)
[   34.850994]  crc32c_generic(E) ecb(E) aes_generic(E) libaes(E) xts(E) 
hid_generic(E) usbhid(E) hid(E) dm_crypt(E) dm_mod(E) xhci_pci(E) 
xhci_hcd(E) tg3(E) usbcore(E) nvme(E) libphy(E) nvme_core(E) ptp(E) 
pps_core(E) usb_common(E) t10_pi(E) crc_t10dif(E) crct10dif_generic(E) 
crct10dif_common(E)
[   34.851008] CPU: 5 PID: 1584 Comm: Xorg Tainted: P OE 
5.9.0-4-powerpc64le #1 Debian 5.9.11-1
[   34.851009] NIP:  c0080e44c664 LR: c0080e44c660 CTR: 

[   34.851010] REGS: c7493750 TRAP: 0700   Tainted: P   
OE  (5.9.0-4-powerpc64le Debian 5.9.11-1)
[   34.851011] MSR:  90029033   CR: 
2804  XER: 

[   34.851014] CFAR: c01314e4 IRQMASK: 0
   GPR00: c0080e44c660 c74939e0 
c0080f09ec00 0028
   GPR04: 0001 0004 
0027 c005ff6cbf90
   GPR08: 0023 ffd8 
0027 
   GPR12: 2000 c005fffea600 
0001473aaad0 7fffeac1ac14
   GPR16:   
0013 0008
   GPR20:  0001 
1000 00600240
   GPR24: c005f7085e08 c005f7085800 
2000 0003
   GPR28: 00060024 c005f7085800 
c005fa441800 c005f38fbb80
[   34.851105] NIP [c0080e44c664] nvidia_mmap_helper+0x6bc/0x800 
[nvidia]

[   34.851187] LR [c0080e44c660] nvidia_mmap_helper+0x6b8/0x800 [nvidia]
[   34.851188] Call Trace:
[   34.851270] [c74939e0] [c0080e44c660] 
nvidia_mmap_helper+0x6b8/0x800 [nvidia] (unreliable)
[   34.851353] [c7493ac0] [c0080e44c814] 
nvidia_mmap+0x6c/0xc0 [nvidia]
[   34.851434] [c7493b00] [c0080e4400ec] 
nvidia_frontend_mmap+0x54/0x80 [nvidia]

[   34.851438] [c7493b20] [c03bf51c] mmap_region+0x4cc/0x840
[   34.851439] [c7493c00] [c03bfcac] 

Bug#976901: nvidia-tesla-450-kernel-dkms: Fails to build DKMS kernel module on ppc64le 450.80.02

2020-12-09 Thread Andreas Beckmann
On 12/9/20 9:30 AM, Konstantinos Margaritis wrote:
> Hi, I am trying to use a Titan X on a Talos II Power9 system on bullseye
> and the nvidia module fails to compile. Build log attached:

The error:

/var/lib/dkms/nvidia-tesla-450/450.80.02/build/nvidia/nv-pci.c:891:10: warning: 
'enum pci_channel_state' declared inside parameter list will not be visible 
outside of this definition or declaration
/var/lib/dkms/nvidia-tesla-450/450.80.02/build/nvidia/nv-pci.c:891:28: error: 
parameter 2 ('error') has incomplete type
/var/lib/dkms/nvidia-tesla-450/450.80.02/build/nvidia/nv-pci.c:889:1: error: 
function declaration isn't a prototype [-Werror=strict-prototypes]

Similar errors happen building the 418 and 440 tesla driver modules
for Linux 5.9 on ppc64el.

There are changes in 455.45.01-1 to mitigate this kernel change:

# pci_channel_state was removed by commit 16d79cd4e23b ("PCI: Use
# 'pci_channel_state_t' instead of 'enum pci_channel_state'") in
# v5.9-rc1 (2020-07-02).

but I'm not sure whether it is worth backporting them,
since you most likely will be affected by
#973729 - nvidia-uvm does not work with Linux 5.9
which is fixed in 455.45.01

Unfortunately there is no separate 455.xx driver release available
for ppc64el.


Andreas

PS: you could try s/enum pci_channel_state/pci_channel_state_t/g
on the dkms tree

PPS: the first time I hear that someone is actually trying to use
the ppc64el packages ;-)