Ok, I've figured out what is going on here. The problem was that we tried to clear VM BOs with the standard background clear function.

Because of this you needed not only one, but two fence slots for newly allocated page directories/tables.

But I've already fixed this bug quite a while ago, make sure you have this patch in your branch:
commit 13307f7e1d0c05a68f4ba19193cbd213573a8680
Author: Christian König <christian.koe...@amd.com>
Date:   Wed Jan 24 17:19:04 2018 +0100

    drm/amdgpu: revert "drm/amdgpu: use AMDGPU_GEM_CREATE_VRAM_CLEARED for VM PD/PTs" v2

    Using the standard clear turned out to be to inflexible.

    First of all it is executed on the system queue, together with buffer
    moves instead on the per VM queue.

    And second we need to fill in the page tables with more than just zero.

    We keep the new functionality of initializing the PDEs/PTEs with ATC
    routing entries intact.

    v2: update commit message.

    Signed-off-by: Christian König <christian.koe...@amd.com>
    Signed-off-by: Alex Deucher <alexander.deuc...@amd.com>

And also this follow up commit:
commit 29e8357b4cbbfcee6d375f2d183b674b678923d7
Author: Christian König <christian.koe...@amd.com>
Date:   Sun Feb 4 19:36:52 2018 +0100

    drm/amdgpu: sync the VM PD/PT before clearing it

    Otherwise we might overwrite stuff which is still in use.

    Signed-off-by: Christian König <christian.koe...@amd.com>
    Reviewed-by: Felix Kuehling <felix.kuehl...@amd.com>
    Signed-off-by: Alex Deucher <alexander.deuc...@amd.com>

Regards,
Christian.

Am 27.02.2018 um 10:17 schrieb Liu, Monk:
I know in theoretically this issue should be equal for sriov or not
But since the gpu performance under SRIOV is only one quarter compared with 
bare-metal (assume 4VF)
And the CPU's performance is also limited, so there are chance that the fence 
is signaled not as quick as bare-metal
And lead to the scenario that four different context fence are living in resv,

If you do it on bare-metal maybe you need some tricks, at least on my sriov 
platform I can duplicate this BUG()
With 4 repeating loop on VK_EXAMPLE every time

/Monk

-----Original Message-----
From: Koenig, Christian
Sent: 2018年2月27日 16:34
To: Liu, Monk <monk....@amd.com>; amd-gfx@lists.freedesktop.org; Chris Wilson 
<ch...@chris-wilson.co.uk>
Subject: Re: [PATCH 22/22] drm/amdgpu: fix reservation obj shared count bug

Am 27.02.2018 um 04:36 schrieb Liu, Monk:
Well then there is something else broken and I actually think that this is the 
root cause here.
What if this VM submits four different jobs on four different rings,
e.g. gfx/compute1/compute2/vce So the fences for those jobs are from
different context, and now you get a mapping on this VM So the shared
count is now 4
Yeah, but each VM submission reserves a shared slot before adding the fence. So 
even if you have 4 submissions we still have a guaranteed free slot.

I ran VK_example and it always can hit this reservation BUG(), just
repeat running it 4 loops and The BUG() hit, but I did it under SRIOV
case so the CPU&GPU performance is different with bare-metal

Anyway do you think above scenario exist ?
Well you somehow run into this problem, I just don't understand how.

It shouldn't matter if you run on bare-metal or SRIOV, the VM handling is the 
same as far as I know.

Going to give it a try today to create some unit tests exercising the problem.

Christian.


-----Original Message-----
From: Koenig, Christian
Sent: 2018年2月26日 19:19
To: Liu, Monk <monk....@amd.com>; amd-gfx@lists.freedesktop.org; Chris
Wilson <ch...@chris-wilson.co.uk>
Subject: Re: [PATCH 22/22] drm/amdgpu: fix reservation obj shared
count bug

What if all those context are not equal ? and accumulated finally to 4 ?
Well then there is something else broken and I actually think that this is the 
root cause here.

You can try VK_EXAMPLE test yourself, see if you can hit that BUG()
in reservation.c file,
Could you generate a simpler unit test to trigger the problem?


Thanks,
Christian.

Am 26.02.2018 um 11:57 schrieb Liu, Monk:
So we call reserve_shared only once and then call 
reservation_object_add_shared_fence() multiple times, always with the same 
context.
should replace the fence when it has the same context as a previously added 
fence.
What if all those context are not equal ? and accumulated finally to 4 ?
You can try VK_EXAMPLE test yourself, see if you can hit that BUG()
in reservation.c file, I hit it and it is fixed by this patch, I don't think 
there is other reason.

-----Original Message-----
From: amd-gfx [mailto:amd-gfx-boun...@lists.freedesktop.org] On
Behalf Of Christian K?nig
Sent: 2018年2月26日 18:45
To: Liu, Monk <monk....@amd.com>; amd-gfx@lists.freedesktop.org;
Chris Wilson <ch...@chris-wilson.co.uk>
Subject: Re: [PATCH 22/22] drm/amdgpu: fix reservation obj shared
count bug

We must call reserve_shared before amdgpu_bo_fence....
Actually that's not correct. See
reservation_object_add_shared_fence()
should replace the fence when it has the same context as a previously added 
fence.

So we call reserve_shared only once and then call
reservation_object_add_shared_fence() multiple times, always with the same 
context.

Writing this, Chris actually had patches which may have broken that assumption, 
but as far as I know those aren't merged yet.

Regards,
Christian.

Am 26.02.2018 um 11:36 schrieb Liu, Monk:
amdgpu_vm_get_pd_bo() should already reserves a shared slot for the fence:
            entry->tv.shared = true;
That's not related with the bug I fixed! The root cause is there is
some sequence that When code path comes to amdgpu_bo_fence of
vm_update_directories, no one The shared_cnt is by coincidence
already the max value of resv obj, so the BUG() in Reservation.c
file will hit,

We must call reserve_shared before amdgpu_bo_fence....



-----Original Message-----
From: Christian König [mailto:ckoenig.leichtzumer...@gmail.com]
Sent: 2018年2月26日 17:48
To: Liu, Monk <monk....@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 22/22] drm/amdgpu: fix reservation obj shared
count bug

That fix is incorrect.

amdgpu_vm_get_pd_bo() should already reserves a shared slot for the fence:
            entry->tv.shared = true;
So you must run into this issue because of something else.

Regards,
Christian.

Am 26.02.2018 um 06:35 schrieb Monk Liu:
should call reservation_object_reserve_shared before
amdgpu_bo_fence(), otherwise there are odds kernel hit BUG in
reversation.c

bug:
[12622.076435] ------------[ cut here ]------------ [12622.076438]
kernel BUG at drivers/dma-buf/reservation.c:233!
[12622.078046] invalid opcode: 0000 [#1] SMP KASAN [12622.078345]
Modules linked in: amdgpu(E) chash(E) gpu_sched(E) ttm(E) drm_kms_helper(E) 
drm(E) i2c_algo_bit(E) fb_sys_fops(E) syscopyarea(E) sysfillrect(E) 
sysimgblt(E) amdkfd(E) amd_iommu_v2(E) snd_hda_codec_generic(E) 
snd_hda_intel(E) snd_hda_codec(E) snd_hda_core(E) snd_hwdep(E) snd_pcm(E) 
crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) pcbc(E) 
snd_seq_midi(E) snd_seq_midi_event(E) snd_rawmidi(E) snd_seq(E) aesni_intel(E) 
snd_seq_device(E) aes_x86_64(E) crypto_simd(E) glue_helper(E) cryptd(E) 
snd_timer(E) serio_raw(E) snd(E) soundcore(E) i2c_piix4(E) mac_hid(E) 
binfmt_misc(E) nfsd(E) auth_rpcgss(E) nfs_acl(E) lockd(E) grace(E) sunrpc(E) 
parport_pc(E) ppdev(E) lp(E) parport(E) autofs4(E) 8139too(E) psmouse(E) 
floppy(E) 8139cp(E) mii(E) pata_acpi(E)
[12622.082664] CPU: 2 PID: 6075 Comm: ShaderImageLoad Tainted: G            E   
4.13.0-feb15-2018-guest-12 #8
[12622.083278] Hardware name: QEMU Standard PC (i440FX + PIIX,
1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 [12622.083880] task:
ffff8801314f8000 task.stack: ffff8801e6df0000 [12622.084267] RIP:
0010:reservation_object_add_shared_fence+0x2bc/0x2c0
[12622.084680] RSP: 0018:ffff8801e6df7ae8 EFLAGS: 00010246
[12622.085016] RAX: 0000000000000004 RBX: ffff8801f2542a80 RCX:
ffff8801f2542b58 [12622.085471] RDX: ffff880044791428 RSI:
ffff8801e2657d20 RDI: ffff880044791428 [12622.085995] RBP:
ffff8801e6df7b28 R08: 0000000000000000 R09: 0000000000000000
[12622.086451] R10: 000000009fe2fe07 R11: 0000000000000002 R12:
ffff880044791200 [12622.086905] R13: ffff8801e1da6850 R14:
0000000000000000 R15: ffff8801e9303000 [12622.087359] FS:
00007f70109ff700(0000) GS:ffff8801f7500000(0000)
knlGS:0000000000000000 [12622.087874] CS:  0010 DS: 0000 ES: 0000 CR0: 
0000000080050033 [12622.088242] CR2: 00007f6fe1a49000 CR3: 00000001e8377000 
CR4: 00000000001406e0 [12622.088698] Call Trace:
[12622.088934]  amdgpu_bo_fence+0x25/0x30 [amdgpu] [12622.089271]
amdgpu_vm_update_directories+0x3d7/0x450 [amdgpu] [12622.089721]  ?
amdgpu_vm_free_mapping.isra.20+0x30/0x30 [amdgpu] [12622.090184]
amdgpu_gem_va_ioctl+0x409/0x4f0 [amdgpu] [12622.090569]  ?
amdgpu_gem_metadata_ioctl+0x1b0/0x1b0 [amdgpu] [12622.090959]
drm_ioctl_kernel+0x6a/0xb0 [drm] [12622.091247]  ?
kasan_check_write+0x14/0x20 [12622.091517]  drm_ioctl+0x2ea/0x3b0
[drm] [12622.091805]  ? amdgpu_gem_metadata_ioctl+0x1b0/0x1b0
[amdgpu] [12622.092212]  amdgpu_drm_ioctl+0x4b/0x80 [amdgpu]
[12622.092519]
do_vfs_ioctl+0x96/0x5b0 [12622.092754]  ?
kvm_sched_clock_read+0x1e/0x30 [12622.093036]  ?
kvm_sched_clock_read+0x1e/0x30 [12622.093322]  ?
sched_clock+0x9/0x10 [12622.093550]  SyS_ioctl+0x79/0x90 [12622.093812]  ?
vtime_user_exit+0x29/0x70 [12622.094108]  do_syscall_64+0x6e/0x160
[12622.094350]  entry_SYSCALL64_slow_path+0x25/0x25
[12622.094651] RIP: 0033:0x7f7012ebff07 [12622.094884] RSP:
002b:00007f70109fdae8 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
[12622.095369] RAX: ffffffffffffffda RBX: 00007f6fe100fe30 RCX:
00007f7012ebff07 [12622.095827] RDX: 00007f70109fdbb0 RSI:
00000000c0286448 RDI: 0000000000000004 [12622.096283] RBP:
00007f70109fdb20 R08: 0000000332000000 R09: 000000000000000e
[12622.096739] R10: 0000000002e2e628 R11: 0000000000000206 R12:
0000000002ebb8f0 [12622.097195] R13: 00007f6fe100fe30 R14:
0000000002fe2bf0 R15: 0000000000000000 [12622.097676] Code: 2b e6
ff ff 41 8b 4f 10 48 8b 75 c8 48 8b 55 d0 e9 06 ff ff ff 49 8d 44
cf 10
45 31 ed 48 89 70 08 41 83 47 10 01 e9 12 ff ff ff <0f> 0b 90 90 0f
1f
44 00 00 55 31 c0 48 81 7f 08 00 df a9 ac 48 [12622.098919] RIP:
reservation_object_add_shared_fence+0x2bc/0x2c0 RSP:
ffff8801e6df7ae8 [12622.099527] ---[ end trace 8235addd8e22e444
]---

Change-Id: I1de84f60c75a0d78b0cad244b8e2bff02aced9fa
Signed-off-by: Monk Liu <monk....@amd.com>
---
      drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 5 +++++
      1 file changed, 5 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 0b237e0..e6f159f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -989,6 +989,11 @@ int amdgpu_vm_update_directories(struct amdgpu_device 
*adev,
                amdgpu_ring_pad_ib(ring, params.ib);
                amdgpu_sync_resv(adev, &job->sync, root->tbo.resv,
                                 AMDGPU_FENCE_OWNER_VM, false);
+
+               r = reservation_object_reserve_shared(root->tbo.resv);
+               if (r)
+                       goto error;
+
                WARN_ON(params.ib->length_dw > ndw);
                r = amdgpu_job_submit(job, ring, &vm->entity,
                                      AMDGPU_FENCE_OWNER_VM, &fence);
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Reply via email to