Re: Support for amdgpu VM update via CPU on large-bar systems

2017-05-13 Thread Christian König

Am 12.05.2017 um 21:25 schrieb Felix Kuehling:

On 17-05-12 04:43 AM, Christian König wrote:

Am 12.05.2017 um 10:37 schrieb zhoucm1:



If the sdma is faster, even they wait for finish, which time is
shorter than CPU, isn't it? Of course, the precondition is sdma is
exclusive. They can reserve a sdma for PT updating.


No, if I understood Felix numbers correctly the setup and wait time
for SDMA is a bit (but not much) longer than doing it with the CPU.

I'm skeptical of claims that SDMA is faster. Even when you use SDMA to
write the page table, the CPU still has to do about the same amount of
work writing PTEs into the SDMA IBs. SDMA can only save CPU time in
certain cases:

   * Copying PTEs from GART table if they are on the same GPU (not
 possible on Vega10 due to different MTYPE bits)
   * Generating PTEs for contiguous VRAM BOs

At least for system memory BOs writing the PTEs directly to
write-combining VRAM should be faster than writing them to cached system
memory IBs first and then kicking off an SDMA transfer and waiting for
completion.


That's unfortunately not correct at all.

Nicolai did quite some measurements on this and even with WC enabled on 
most systems the SDMA is more efficient transferring even small amounts 
of memory over the bus than the CPU.


And no we couldn't figure why, it indeed doesn't make much sense when WC 
is enabled.


I think the SDMA is simply optimized for those kinds of transfers, so 
even considering the overhead of allocating an IB.


So anything larger than I would say 1KB is faster handled when you write 
it to system memory and then copy it to VRAM with the SDMA.


Regards,
Christian.
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH libdrm] amdgpu: add missing extern "C" headers

2017-05-13 Thread Nicolai Hähnle
From: Nicolai Hähnle 

Signed-off-by: Nicolai Hähnle 
---
 amdgpu/amdgpu.h | 8 
 1 file changed, 8 insertions(+)

diff --git a/amdgpu/amdgpu.h b/amdgpu/amdgpu.h
index fdea905..1901fa8 100644
--- a/amdgpu/amdgpu.h
+++ b/amdgpu/amdgpu.h
@@ -30,20 +30,24 @@
  * User wanted to use libdrm_amdgpu functionality must include
  * this file.
  *
  */
 #ifndef _AMDGPU_H_
 #define _AMDGPU_H_
 
 #include 
 #include 
 
+#ifdef __cplusplus
+extern "C" {
+#endif
+
 struct drm_amdgpu_info_hw_ip;
 
 /*--*/
 /* --- Defines  */
 /*--*/
 
 /**
  * Define max. number of Command Buffers (IB) which could be sent to the single
  * hardware IP to accommodate CE/DE requirements
  *
@@ -1317,11 +1321,15 @@ int amdgpu_cs_destroy_semaphore(amdgpu_semaphore_handle 
sem);
 /**
  *  Get the ASIC marketing name
  *
  * \param   dev - \c [in] Device handle. See 
#amdgpu_device_initialize()
  *
  * \return  the constant string of the marketing name
  *  "NULL" means the ASIC is not found
 */
 const char *amdgpu_get_marketing_name(amdgpu_device_handle dev);
 
+#ifdef __cplusplus
+}
+#endif
+
 #endif /* #ifdef _AMDGPU_H_ */
-- 
2.9.3

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: Plan: BO move throttling for visible VRAM evictions

2017-05-13 Thread Marek Olšák
On Mon, Apr 17, 2017 at 11:55 AM, Michel Dänzer  wrote:
> On 17/04/17 07:58 AM, Marek Olšák wrote:
>> On Fri, Apr 14, 2017 at 12:14 PM, Michel Dänzer  wrote:
>>> On 04/04/17 05:11 AM, Marek Olšák wrote:
 On Fri, Mar 31, 2017 at 5:24 AM, Michel Dänzer  wrote:
> On 30/03/17 07:03 PM, Michel Dänzer wrote:
>> On 25/03/17 01:33 AM, Marek Olšák wrote:
>>> Hi,
>>>
>>> I'm sharing this idea here, because it's something that has been
>>> decreasing our performance a lot recently, for example:
>>> http://openbenchmarking.org/prospect/1703011-RI-RADEONDIR06/7b7668cfc109d1c3dc27e871c8aea71ca13f23fa
>>
>> The attached proof-of-concept patch (on top of Christian's "CPU mapping
>> of split VRAM buffers" series, ported from radeon) results in 145.05 fps
>> on my Tonga.
>
> I get the same result without my or Christian's patches though, with
> 4.11 based DRM or amd-staging-4.9. So I guess I just can't reproduce the
> problem with this test. Are there any other tests for it?

 It's random. Sometimes the benchmark runs OK, other times it's slow.
 You can easily see the difference but observing how smooth it is. The
 visible VRAM evictions result in constant 100-200ms stalls but not
 every frame, which feels like the frame rate is much lower than it
 actually is.

 Make sure your graphics details are maxed out. The best score I can
 get with my rig is 70 fps. (Fiji & Core i5 3570)
>>>
>>> I'm getting around 53-54 fps at Ultra with Tonga, both with Mesa 13.0.6
>>> and Git.
>>>
>>> Have you tried if Christian's patches for CPU access to split VRAM
>>> buffers help? I can imagine that forcing contiguous VRAM buffers for CPU
>>> access could cause lots of other BOs to be unnecessarily evicted from
>>> VRAM, if at least one of their fragments happens to be in the CPU
>>> visible part of VRAM.
>>
>> I've finally tested latest amd-staging-4.9 and I'm very pleased. For
>> the first time, the Deus Ex benchmark has almost no hiccups. I've
>> never seen it so smooth. At one point, the MB/s BO move rate increase
>> to 200MB/s, stayed there for a couple of seconds, and then it dropped
>> to 0 again. The frame rate was OK-ish, so I guess the moves didn't
>> happen all at once. I also tested DiRT Rally and I haven't been able
>> to reproduce the low FPS with the consistently-high BO move rate that
>> I saw several months ago.
>>
>> We could do some move throttling there for sure, but it's much better
>> than it ever was.
>
> That's great to hear. If you get a chance, it would be interesting if
> the attached updated patch improves things even more for you. (The patch
> I attached previously couldn't work as intended, this one at least might :)

Frogging101 on IRC noticed that we get a ton of TTM BO moves due to
visible VRAM thrashing and Michel's patch doesn't help. His kernel is
up to date with amd-staging. It looks like the only option left is my
original plan: BO move throttling for visible VRAM by redirecting
mapped buffers to GTT and not allowing them to go back to VRAM if some
counter is too high.

Opinions?

Marek
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx