On 2021-03-30 14:55 +0200, Christian König wrote: > Am 30.03.21 um 14:04 schrieb Xi Ruoyao: > > On 2021-03-30 03:40 +0800, Xi Ruoyao wrote: > > > On 2021-03-29 21:36 +0200, Christian König wrote: > > > > Am 29.03.21 um 21:27 schrieb Xi Ruoyao: > > > > > Hi Christian, > > > > > > > > > > I don't think there is any constraint implemented to ensure > > > > > `num_entries > > > > > % > > > > > AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0`. For example, in > > > > > `amdgpu_vm_bo_map()`: > > > > > > > > > > /* validate the parameters */ > > > > > if (saddr & AMDGPU_GPU_PAGE_MASK || offset & > > > > > AMDGPU_GPU_PAGE_MASK > > > > > size == 0 || size & AMDGPU_GPU_PAGE_MASK) > > > > > return -EINVAL; > > > > > > > > > > /* snip */ > > > > > > > > > > saddr /= AMDGPU_GPU_PAGE_SIZE; > > > > > eaddr /= AMDGPU_GPU_PAGE_SIZE; > > > > > > > > > > /* snip */ > > > > > > > > > > mapping->start = saddr; > > > > > mapping->last = eaddr; > > > > > > > > > > > > > > > If we really want to ensure (mapping->last - mapping->start + 1) % > > > > > AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0, then we should replace > > > > > "AMDGPU_GPU_PAGE_MASK" > > > > > in "validate the parameters" with "PAGE_MASK". > > It should be "~PAGE_MASK", "PAGE_MASK" has an opposite convention of > > "AMDGPU_GPU_PAGE_MASK" :(. > > > > > > Yeah, good point. > > > > > > > > > I tried it and it broke userspace: Xorg startup fails with EINVAL with > > > > > this > > > > > change. > > > > Well in theory it is possible that we always fill the GPUVM on a 4k > > > > basis while the native page size of the CPU is larger. Let me double > > > > check the code. > > On my platform, there are two issues both causing the VM corruption. One is > > fixed by agd5f/linux@fe001e7. > > What is fe001e7? A commit id? If yes then that is to short and I can't > find it. > > > Another is in Mesa from userspace: it uses > > `dev_info->gart_page_size` as the alignment, but the kernel AMDGPU driver > > expects it to use `dev_info->virtual_address_alignment`. > > Mhm, looking at the kernel code I would rather say Mesa is correct and > the kernel driver is broken. > > The gart_page_size is limited by the CPU page size, but the > virtual_address_alignment isn't. > > > If we can make the change to fill the GPUVM on a 4k basis, we can fix this > > issue > > and make virtual_address_alignment = 4K. Otherwise, we should fortify the > > parameter validation, changing "AMDGPU_GPU_PAGE_MASK" to "~PAGE_MASK". Then > > the > > userspace will just get an EINVAL, instead of a slient VM corruption. And > > someone should tell Mesa developers to fix the code in this case. > > I rather see this as a kernel bug. Can you test if this code fragment > fixes your issue: > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c > index 64beb3399604..e1260b517e1b 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c > @@ -780,7 +780,7 @@ int amdgpu_info_ioctl(struct drm_device *dev, void > *data, struct drm_file *filp) > } > dev_info->virtual_address_alignment = > max((int)PAGE_SIZE, AMDGPU_GPU_PAGE_SIZE); > dev_info->pte_fragment_size = (1 << > adev->vm_manager.fragment_size) * AMDGPU_GPU_PAGE_SIZE; > - dev_info->gart_page_size = AMDGPU_GPU_PAGE_SIZE; > + dev_info->gart_page_size = > dev_info->virtual_address_alignment; > dev_info->cu_active_number = adev->gfx.cu_info.number; > dev_info->cu_ao_mask = adev->gfx.cu_info.ao_cu_mask; > dev_info->ce_ram_size = adev->gfx.ce_ram_size;
It works. I've seen it at https://github.com/xen0n/linux/commit/84ada72983838bd7ce54bc32f5d34ac5b5aae191 before (with a common sub-expression, though :). -- Xi Ruoyao <xry...@mengyan1223.wang> School of Aerospace Science and Technology, Xidian University