[Mesa-dev] [Bug 107670] Massive slowdown under specific memcpy implementations (32bit, no-SIMD, backward copy).

bugzilla-daemon Tue, 04 Sep 2018 01:48:39 -0700

https://bugs.freedesktop.org/show_bug.cgi?id=107670


--- Comment #13 from i...@yahoo.com ---
As I've said, I'm still investigating the issue.
Here are some of the things I've found so far:

1. Slackware32, i586 and glibc. 
Slackware tries to support as many machines as possible, since i586 is still
supported by the kernel, Slackware compiles everything to be able to run on
i586.

The problem is that for some reason Glibc compiled for i586 does NOT support
multi-arch. It does not use CPUID (that is available on all i586 and some i486)
to pick specific version for the running CPU. Glibc supports multi-arch only
for i686 builds.

2. Glibc i586 memcpy()
At first I thought that the problem is writing backwards. It made sense.
I was wrong.

Actually the glibc.i586 memcpy() does forward copy, but it also does a read of
the _destination_ in order to load the entire cache line(32 bytes), before
over-writing that cache line.
It seems that Pentium1 processors had no "write store" and did not "write
allocate", so "write miss" would fall through, aka writes would be sent to the
system RAM without been cached. So the "optimization" involves manual loading
of the cache line first, by explicitly reading it.

Of course, reading from PCI is slow, not cached; and in this exact case also
completely unnecessary. 

Here is the source of the memcpy function and the comment explaining the
read-ahead-destination.
https://github.com/lattera/glibc/blob/master/sysdeps/i386/i586/memcpy.S#L70

3. Why the system package had no issue.
Well, it turned out quite simple - gcc inlined its own built-in memcpy(), that
was just `rep movsb`. It does not do it if you compile for i486, i686 or newer;
or if you touch the compile flags.

4. The Upload Data function.
I did add a printf() to the problem function u_upload_data() to check what
parameters its memcpy() gets.

An apitrace file I had called the problem function about 820 times per rendered
frame. Most of the time (56%) with biggest size 3136 bytes, then 32% with sizes
around 512 and the rest were smaller (up to 128).

I do suspect that the transfer might actually be vram->vram.
e.g. 
In MTRR I have:
    reg01: base=0x0e0000000 ( 3584MB), size=  512MB, count=1: uncachable

Most of the logs look like:
    u_upload_data.memcpy(0xec90cf00, 0xed4e84c0, 544)

If I run the trace with `R600_DEBUG=nodma`, I do get:
    u_upload_data.memcpy(0xeab60900, 0x7cbef570, 3136)

(The "nodma" does not help with the glibc i586 memcpy slowness.)

It looks to me like the data is first moved ram->vram using dma, then
vram->vram using CPU...
There should be a better way to do that.


My video card is AMD Radeon HD5670 Evergreen Redwood, that uses the r600
driver.
This transfer function is highlighted by Nine. There are others that involve
OpenGL too, I just haven't tracked them down yet.

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are the QA Contact for the bug.

_______________________________________________
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev

[Mesa-dev] [Bug 107670] Massive slowdown under specific memcpy implementations (32bit, no-SIMD, backward copy).

Reply via email to