[Mesa-dev] [Bug 107670] Massive slowdown under specific memcpy implementations (32bit, no-SIMD, backward copy).

2018-09-05 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=107670

Timothy Arceri  changed:

   What|Removed |Added

 Status|NEEDINFO|RESOLVED
 Resolution|--- |NOTOURBUG

--- Comment #21 from Timothy Arceri  ---
(In reply to iive from comment #13)
> Slackware32, i586 and glibc. 
> Slackware tries to support as many machines as possible, since i586 is still
> supported by the kernel, Slackware compiles everything to be able to run on
> i586.
> 
> The problem is that for some reason Glibc compiled for i586 does NOT support
> multi-arch. It does not use CPUID (that is available on all i586 and some
> i486) to pick specific version for the running CPU. Glibc supports
> multi-arch only for i686 builds.
> 

I'm all for allowing old hardware to continue to be used but if you want
performance you should pick a distro that targets "modern" hardware. 

Alternatively file a bug against / submit a patch for Glibc.

Given this and comment 20 I'm going to close this as not our bug.

-- 
You are receiving this mail because:
You are the QA Contact for the bug.
You are the assignee for the bug.___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


[Mesa-dev] [Bug 107670] Massive slowdown under specific memcpy implementations (32bit, no-SIMD, backward copy).

2018-09-05 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=107670

--- Comment #20 from Axel Davy  ---
To clarify what I said, based on our source code and the calls made by the game
trace, the only "upload" that could occur every frame is buffer data upload.

The game has uses two types of d3d vertex buffers.
. One is mapped to the gallium STREAM pool, thus to GTT WC. For this d3d
buffer, the game uses unsynchronized writes. There is no memcpy on mesa side.

. One is stored into a ram buffer, linked to a buffer in the gallium DEFAULT
pool. The application writes to the ram buffer, and when required, nine uploads
dirty locations to the gpu buffer with the buffer_subdata call.
si_buffer_subdata seems to map the buffer, and memcpy the data.

If as you say VRAM mapping is write-combined, then we don't need to look
further. The slowdown comes from the distro memcpy, which will read the VRAM
content on the buffer_subdata call.

-- 
You are receiving this mail because:
You are the QA Contact for the bug.
You are the assignee for the bug.___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


[Mesa-dev] [Bug 107670] Massive slowdown under specific memcpy implementations (32bit, no-SIMD, backward copy).

2018-09-05 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=107670

--- Comment #19 from Michel Dänzer  ---
Axel, I'm not sure what you're saying. Anyway, if the problem was that the
source of the memcpy is uncacheable, surely it would always be slow, regardless
of which memcpy implementation is used?


> So while it lists VRAM for the location, I'm not sure how the flag is used
> and if it affects the mapping.

RADEON_FLAG_GTT_WC means a write-combined CPU mapping will be used while the
buffer object resides in the GTT domain. CPU mappings of VRAM are always
write-combined.

-- 
You are receiving this mail because:
You are the QA Contact for the bug.
You are the assignee for the bug.___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


[Mesa-dev] [Bug 107670] Massive slowdown under specific memcpy implementations (32bit, no-SIMD, backward copy).

2018-09-05 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=107670

--- Comment #18 from Axel Davy  ---
I doubled checked that it is indeed likely to be GTT WC read issue by looking
at the mentionned trace. Some vertex buffers are in GTT WC (but with no memcpy
inside mesa) and some buffers are in VRAM, with the content being filled by
nine with buffer_subdata, which does a memcpy inside radeonsi (it maps the
buffer then does memcpy).

That said, the elements of the default pool are allocated with:
res->domains = RADEON_DOMAIN_VRAM;
res->flags |= RADEON_FLAG_GTT_WC;

So while it lists VRAM for the location, I'm not sure how the flag is used and
if it affects the mapping.

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are the QA Contact for the bug.___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


[Mesa-dev] [Bug 107670] Massive slowdown under specific memcpy implementations (32bit, no-SIMD, backward copy).

2018-09-04 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=107670

--- Comment #17 from i...@yahoo.com ---
(In reply to Michel Dänzer from comment #16)
> (In reply to iive from comment #15)
> > Aka, I do expect that the whole 512MB buffer is mapped at once.
> 
> It's not (if it was, one process could access the buffer object memory of
> another process, bypassing process separation), TTM maps the memory of each
> buffer object into userspace individually. 
> 
> The whole MTRR thing is irrelevant anyway due to PAT.
> 
> You've found the problem in glibc's memcpy() reading from the destination,
> no need to look any further.

The physical and effective addresses could be mapped 1:1, while each process
loads only the pages that belong to it. Meaning that pages owned by other
processes would simply remain unloaded in the current one.

Anyway,
This does not answer my question of how to (dis)prove that the memcpy does or
does not do vmem->vmem.

It is relevant, as one way to fix this issue is to NOT use memcpy() for
transfer, if DMA is already employed.

Axel Davy promised to take a look at that one, as it is related to Nine.

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are the QA Contact for the bug.___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


[Mesa-dev] [Bug 107670] Massive slowdown under specific memcpy implementations (32bit, no-SIMD, backward copy).

2018-09-04 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=107670

--- Comment #16 from Michel Dänzer  ---
(In reply to iive from comment #15)
> Aka, I do expect that the whole 512MB buffer is mapped at once.

It's not (if it was, one process could access the buffer object memory of
another process, bypassing process separation), TTM maps the memory of each
buffer object into userspace individually. 


The whole MTRR thing is irrelevant anyway due to PAT.


You've found the problem in glibc's memcpy() reading from the destination, no
need to look any further.

-- 
You are receiving this mail because:
You are the QA Contact for the bug.
You are the assignee for the bug.___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


[Mesa-dev] [Bug 107670] Massive slowdown under specific memcpy implementations (32bit, no-SIMD, backward copy).

2018-09-04 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=107670

--- Comment #15 from i...@yahoo.com ---
(In reply to Michel Dänzer from comment #14)
> (In reply to iive from comment #13)
> > It looks to me like the data is first moved ram->vram using dma, then
> > vram->vram using CPU...
> 
> No. u_upload_data is for copying data from normal system memory into GPU
> accessible memory. (You're comparing physical and virtual memory addresses,
> AKA apples and oranges :)

Not really.
It is custom of the linux kernel to repeat the physical addresses as effective
mappings, it simplifies a number of things. Also, this is not system memory
that could be freed and reused in any order at kernel discretion, It is a frame
buffer that is mapped from another device and the relative addressing should be
preserved, as much as possible.

Aka, I do expect that the whole 512MB buffer is mapped at once.
So if one of these addresses is in vram, then the other should be too.

You could probably help me (dis)prove this, by telling me how to obtain the
effective address of the frame-buffer. Xorg.0.log lists only:
[   114.494] (--) PCI:*(1@0:0:0) 1002:68d8:1458:21d9 rev 0, Mem @
0xe000/268435456, 0xfbdc/131072, I/O @ 0xce00/256, BIOS @
0x/131072


I do understand that u_upload_data() is for coping data from normal system
memory into GPU accessible memory, so coping vram->vram should be some kind of
bug.

-- 
You are receiving this mail because:
You are the QA Contact for the bug.
You are the assignee for the bug.___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


[Mesa-dev] [Bug 107670] Massive slowdown under specific memcpy implementations (32bit, no-SIMD, backward copy).

2018-09-04 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=107670

--- Comment #14 from Michel Dänzer  ---
(In reply to iive from comment #13)
> Of course, reading from PCI is slow, not cached; and in this exact case also
> completely unnecessary. 

Right, reading from uncacheable memory can certainly explain the slowness.


> It looks to me like the data is first moved ram->vram using dma, then
> vram->vram using CPU...

No. u_upload_data is for copying data from normal system memory into GPU
accessible memory. (You're comparing physical and virtual memory addresses, AKA
apples and oranges :)

-- 
You are receiving this mail because:
You are the QA Contact for the bug.
You are the assignee for the bug.___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


[Mesa-dev] [Bug 107670] Massive slowdown under specific memcpy implementations (32bit, no-SIMD, backward copy).

2018-09-04 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=107670

--- Comment #13 from i...@yahoo.com ---
As I've said, I'm still investigating the issue.
Here are some of the things I've found so far:

1. Slackware32, i586 and glibc. 
Slackware tries to support as many machines as possible, since i586 is still
supported by the kernel, Slackware compiles everything to be able to run on
i586.

The problem is that for some reason Glibc compiled for i586 does NOT support
multi-arch. It does not use CPUID (that is available on all i586 and some i486)
to pick specific version for the running CPU. Glibc supports multi-arch only
for i686 builds.

2. Glibc i586 memcpy()
At first I thought that the problem is writing backwards. It made sense.
I was wrong.

Actually the glibc.i586 memcpy() does forward copy, but it also does a read of
the _destination_ in order to load the entire cache line(32 bytes), before
over-writing that cache line.
It seems that Pentium1 processors had no "write store" and did not "write
allocate", so "write miss" would fall through, aka writes would be sent to the
system RAM without been cached. So the "optimization" involves manual loading
of the cache line first, by explicitly reading it.

Of course, reading from PCI is slow, not cached; and in this exact case also
completely unnecessary. 

Here is the source of the memcpy function and the comment explaining the
read-ahead-destination.
https://github.com/lattera/glibc/blob/master/sysdeps/i386/i586/memcpy.S#L70

3. Why the system package had no issue.
Well, it turned out quite simple - gcc inlined its own built-in memcpy(), that
was just `rep movsb`. It does not do it if you compile for i486, i686 or newer;
or if you touch the compile flags.

4. The Upload Data function.
I did add a printf() to the problem function u_upload_data() to check what
parameters its memcpy() gets.

An apitrace file I had called the problem function about 820 times per rendered
frame. Most of the time (56%) with biggest size 3136 bytes, then 32% with sizes
around 512 and the rest were smaller (up to 128).

I do suspect that the transfer might actually be vram->vram.
e.g. 
In MTRR I have:
reg01: base=0x0e000 ( 3584MB), size=  512MB, count=1: uncachable

Most of the logs look like:
u_upload_data.memcpy(0xec90cf00, 0xed4e84c0, 544)

If I run the trace with `R600_DEBUG=nodma`, I do get:
u_upload_data.memcpy(0xeab60900, 0x7cbef570, 3136)

(The "nodma" does not help with the glibc i586 memcpy slowness.)

It looks to me like the data is first moved ram->vram using dma, then
vram->vram using CPU...
There should be a better way to do that.


My video card is AMD Radeon HD5670 Evergreen Redwood, that uses the r600
driver.
This transfer function is highlighted by Nine. There are others that involve
OpenGL too, I just haven't tracked them down yet.

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are the QA Contact for the bug.___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


[Mesa-dev] [Bug 107670] Massive slowdown under specific memcpy implementations (32bit, no-SIMD, backward copy).

2018-09-03 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=107670

--- Comment #12 from Emil Velikov  ---
Why are we even discussing a potential optimisation where the user is
_unknown_?
It contradicts with the principles that we've been using in Mesa for years.

-- 
You are receiving this mail because:
You are the QA Contact for the bug.
You are the assignee for the bug.___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


[Mesa-dev] [Bug 107670] Massive slowdown under specific memcpy implementations (32bit, no-SIMD, backward copy).

2018-09-03 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=107670

--- Comment #11 from Eero Tamminen  ---
Libc memcpy() obviously won't be optimized for PCI bus transfers, it's way too
rare use-case for it.

E.g. libpciaccess would seem more suitable place for PCI bus transfer optimized
memory copy function, but unfortunately it doesn't (currently) provide an API
for that.


(In reply to Emil Velikov from comment #10)
> If memcpy shows so prominently in perf, we should look why we're using it so
> often. Polishing the memcpy implementation is putting a band-aid instead of
> fixing the actual problem.

I.e. are the uploads triggered by something in driver, rather than application
itself directly doing it?

"valgrind --tool=callgrind " would output callgraph info with call
counts etc, which can be viewed in kcachegrind.

-- 
You are receiving this mail because:
You are the QA Contact for the bug.
You are the assignee for the bug.___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


[Mesa-dev] [Bug 107670] Massive slowdown under specific memcpy implementations (32bit, no-SIMD, backward copy).

2018-09-03 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=107670

--- Comment #10 from Emil Velikov  ---
My personal train of though:

Details such as WC are left to the kernel module. Even on the case where
userspace can provide hints, it's ultimately up-to the kernel to manage it.

Optimising w/o saying the benchmark/game name is _seriously_ moot.
Furthermore, doing benchmarks on a i586 build is also fairly moot.

You are correct though - _if_ glibc decides to change things perf. _may_ drop.

If memcpy shows so prominently in perf, we should look why we're using it so
often. Polishing the memcpy implementation is putting a band-aid instead of
fixing the actual problem.

Again, that's my personal take. Feel free to ignore.

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are the QA Contact for the bug.___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


[Mesa-dev] [Bug 107670] Massive slowdown under specific memcpy implementations (32bit, no-SIMD, backward copy).

2018-08-31 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=107670

--- Comment #9 from i...@yahoo.com ---
(In reply to Timothy Arceri from comment #8)
> Using SSE2 memcpy seems to avoid this problem"
> 
> Glib should select the SSE2 (or better) version of memcpy. If Slackware
> doesn't ship and SSE2 support for glibc I don't see how this is Mesas fault.
> 
> If I'm misunderstanding somthing please clarify. Otherwise I'm inclined to
> close this as won't fix.

Please,
I'm not done investigating this bug.
I also intent on writing some patches for it.

1.
The glibc memcpy() is optimized for system->system memory transfer. While it
might be faster than the problematic one, it still may not be the optimal one.

Also, nothing guarantees that glibc memcpy() will continue to work properly in
future. That's why it is good idea for Mesa to have its own implementation that
is known to always do the right thing, when going sys->vid mem transfer.

I can write the x86(_64), MMX/AVX assembly, I've written SIMD before.
Finding the all functions that have to use it, might be more tricky and need
help by experts.
(The memcpy I've reported is mostly used by Nine, but I'm getting the same
problem with other memcpy()s when using OpenGL.)
---
2.
Another issue that has to be checked is related to Write Combine caching.

In the past the XFree86 DDX driver was setting video memory region caching
through MTRR registers. That was removed in favor of using PAT (Page Attribute
Table, aka setting caching per memory page).

I have asked developers where is the PAT handling code. Is it in the kernel
kms, libdrm or Mesa3D itself? Where exactly? How do I check the caching status?

So far nobody was brave enough to answer. And if nobody has checked that code
recently, it might have silently stopped working some time ago.

(One reason why SSE2 code might be working better is that it usually employs
MOVNTQ. That instruction forces WC to avoid cache pollution.)


I want Mesa3D to always be fast. So help me help you.

-- 
You are receiving this mail because:
You are the QA Contact for the bug.
You are the assignee for the bug.___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


[Mesa-dev] [Bug 107670] Massive slowdown under specific memcpy implementations (32bit, no-SIMD, backward copy).

2018-08-30 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=107670

Timothy Arceri  changed:

   What|Removed |Added

 Status|NEW |NEEDINFO

--- Comment #8 from Timothy Arceri  ---
(In reply to Eero Tamminen from comment #6)
> (In reply to Timothy Arceri from comment #1)
> > There already is asm optimized version of memcpy() in glibc. Why would we
> > want to reinvent that in Mesa?
> > 
> > glibc should pick the right implementation for you system.
> 
> How would memcpy() know that the destination is mapped to PCI-E address
> space i.e. gets transparently transferred over the PCI-E bus (which has its
> own performance constraints)?

"The slowdown could be observed if non-SIMD version of the glibc-2.27 function
is used (like the one that comes with the 32 bit Slackware-current). 



Using SSE2 memcpy seems to avoid this problem"

Glib should select the SSE2 (or better) version of memcpy. If Slackware doesn't
ship and SSE2 support for glibc I don't see how this is Mesas fault.

If I'm misunderstanding somthing please clarify. Otherwise I'm inclined to
close this as won't fix.

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are the QA Contact for the bug.___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


[Mesa-dev] [Bug 107670] Massive slowdown under specific memcpy implementations (32bit, no-SIMD, backward copy).

2018-08-24 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=107670

--- Comment #7 from i...@yahoo.com ---
(In reply to Grazvydas Ignotas from comment #4)
> What game/benchmark do you see this with?
> 
> Can you try calling _mesa_streaming_load_memcpy() there? It's for reading
> uncached memory, but by the looks of it it might be suitable for writing too.

I'm running Left4Dead2 under wine with Gallium Nine. The game has a `timedemo`
option where it could replay a previously `record`-ed gameplay, so the
benchmark is consistent.
I run it in a window, so I could watch the terminal with `perf top`.
When the problem is present, memcpy() is always the first with 25% usage, while
everything else is less than 2%.
I have to point out that I do run 64bit kernel, I just need the 32 bit
libraries, since the game is 32bit.

_mesa_streaming_load_memcpy() is a little problematic to test, since it is
written in intrinsic and I'm compiling for i486 (that's what my distribution
does). The function also has strong requirement for alignment of both src,
and could fall back to regular memcpy().
Still its existence is proof that there is need for such functionality.

-- 
You are receiving this mail because:
You are the QA Contact for the bug.
You are the assignee for the bug.___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


[Mesa-dev] [Bug 107670] Massive slowdown under specific memcpy implementations (32bit, no-SIMD, backward copy).

2018-08-24 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=107670

--- Comment #6 from Eero Tamminen  ---
(In reply to Timothy Arceri from comment #1)
> There already is asm optimized version of memcpy() in glibc. Why would we
> want to reinvent that in Mesa?
> 
> glibc should pick the right implementation for you system.

How would memcpy() know that the destination is mapped to PCI-E address space
i.e. gets transparently transferred over the PCI-E bus (which has its own
performance constraints)?

-- 
You are receiving this mail because:
You are the QA Contact for the bug.
You are the assignee for the bug.___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


[Mesa-dev] [Bug 107670] Massive slowdown under specific memcpy implementations (32bit, no-SIMD, backward copy).

2018-08-24 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=107670

--- Comment #5 from i...@yahoo.com ---
(In reply to Roland Scheidegger from comment #3)
> Isn't this mapped as WC?
> In this case I'd expect the direction to make little difference, since write
> combine of any decent cpu should be able to combine the writes regardless
> the order?
> Although if it's UC I suppose someone needs to ensure that the maximum
> possible size is picked...

The theory that this is a caching issue has a merit since the distribution
version and my build seem to use the exact same memcpy(), one that goes
backwards, yet the distribution one is not triggering the massive slowdown.
The memmove() uses `rep movsb` and direction flag.

The question is, what controls the cache? How userland Mesa3D controls the PAT
cache flags? Because I am just changing the libraries, without rebooting
machine or restarting Xorg, I don't even stop the steam client. This means that
MTRR registers are not changed and the exact same kernel module and
configuration is used.

I do use modified build script, that disables support for hardware I don't
have, like intel and nvidia. Some of my options might cause the cache problem,
but I need to know what I am looking for.
BTW, the system libdrm is latest version (2.4.92).

-- 
You are receiving this mail because:
You are the QA Contact for the bug.
You are the assignee for the bug.___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


[Mesa-dev] [Bug 107670] Massive slowdown under specific memcpy implementations (32bit, no-SIMD, backward copy).

2018-08-24 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=107670

--- Comment #4 from Grazvydas Ignotas  ---
What game/benchmark do you see this with?

Can you try calling _mesa_streaming_load_memcpy() there? It's for reading
uncached memory, but by the looks of it it might be suitable for writing too.

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are the QA Contact for the bug.___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


[Mesa-dev] [Bug 107670] Massive slowdown under specific memcpy implementations (32bit, no-SIMD, backward copy).

2018-08-23 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=107670

--- Comment #3 from Roland Scheidegger  ---
Isn't this mapped as WC?
In this case I'd expect the direction to make little difference, since write
combine of any decent cpu should be able to combine the writes regardless the
order?
Although if it's UC I suppose someone needs to ensure that the maximum possible
size is picked...

-- 
You are receiving this mail because:
You are the QA Contact for the bug.
You are the assignee for the bug.___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


[Mesa-dev] [Bug 107670] Massive slowdown under specific memcpy implementations (32bit, no-SIMD, backward copy).

2018-08-23 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=107670

--- Comment #2 from i...@yahoo.com ---
(In reply to Timothy Arceri from comment #1)
> There already is asm optimized version of memcpy() in glibc. Why would we
> want to reinvent that in Mesa?
> 
> glibc should pick the right implementation for you system.

Because some implementations copy data backwards and this creates a huge
problem when it is written over PCIe.

To be clear:
for(i=0;i=0;i--) dst[i]=src[i]; // backwards copy

-- 
You are receiving this mail because:
You are the QA Contact for the bug.
You are the assignee for the bug.___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


[Mesa-dev] [Bug 107670] Massive slowdown under specific memcpy implementations (32bit, no-SIMD, backward copy).

2018-08-23 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=107670

--- Comment #1 from Timothy Arceri  ---
There already is asm optimized version of memcpy() in glibc. Why would we want
to reinvent that in Mesa?

glibc should pick the right implementation for you system.

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are the QA Contact for the bug.___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


[Mesa-dev] [Bug 107670] Massive slowdown under specific memcpy implementations (32bit, no-SIMD, backward copy).

2018-08-23 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=107670

Bug ID: 107670
   Summary: Massive slowdown under specific memcpy implementations
(32bit, no-SIMD, backward copy).
   Product: Mesa
   Version: unspecified
  Hardware: x86 (IA32)
OS: All
Status: NEW
  Severity: normal
  Priority: medium
 Component: Other
  Assignee: mesa-dev@lists.freedesktop.org
  Reporter: i...@yahoo.com
QA Contact: mesa-dev@lists.freedesktop.org

I've traced the massive slowdown to the memcpy() in
"mesa/src/gallium/auxiliary/util/u_upload_mgr.c::u_upload_data()" that seems to
be used to move data from the host memory into the video card memory.

The slowdown could be observed if non-SIMD version of the glibc-2.27 function
is used (like the one that comes with the 32 bit Slackware-current). The system
mesa3d package does not exhibit the same slowdown, but it seems to be linked to
glibc-2.5.

I do suspect that the slowdown is caused by memcpy() implementation that copies
data backwards, starting from the end and moving to the beginning. This is
likely treated as non-sequential data transfer over the PCI bus (it probably
sends the full 32 bit address for every 32 bits of data).
Using SSE2 memcpy seems to avoid this problem, but I have no idea if it is
because it copies more data at once or because it copies forward.

In my benchmarks, `perf top` showed that the problematic memcpy() consumes 25%
CPU time. In a particular game benchmark, I was getting 50fps instead of 70fps.


Just replacing that memcpy() with memmove() fixed the issue for me, without
having to recompile and replace glibc.
However I do not consider it reliable fix, as there is nothing guaranteeing
that memmove() would do the right thing.


I think that the correct solution would be to create a new function
memcpy_to_pci() and having assembly implementation(s) that are specifically
crafted to maximize PCI/PCIe throughput.
The kernel has memcpy_toio/fromio(), but they don't seem to be asm optimized.
I've seen MPlayer MMX optimized mem2agpcpy() in aclib_template.c .

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are the QA Contact for the bug.___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev