Good evening/morning/night/day tweakers,
I'm investigating the capabilities of various processor's SIMD engines
and would like some help. I have a Pentium II w/ MMX and an AMD64 w/
SSE2 so I need some help testing on the other architectures. To that
end I have written a test program that uses the shading routines from
Eterm to test the performance of the various routines. Although this
does not directly relate to E the imlib2 blending routines were based
off the same code that I based my shading routines on: that written by
Willem Monsuwe <[EMAIL PROTECTED]>. Further, the imlib2 libraries have
some issues with aligned memory moves that information collected by
these routines should help track down. In reality a 'blend' of two
images is the same as a shade where one image has the same pixel color
value for the entire image. This testing code contains x86-64/SSE2,
x86/MMX, and x86/SSE2. The x86/SSE code has turned into a much more
complicated beast than I imagined so for now those users get to use the
MMX routines. :-( I don't know that I'm going to do this piece as I'm
not sure how to duplicate some of the SSE2 routines in SSE even
disregarding speed. I have marked the lines of code that need to be
replaced with an asterisk in the code in the x86/SSE section for those
that are interested. Links to the appropriate manuals are also in the
code (A64_128bit_Media_Programming.pdf). All the macro foo to support
x86/SSE is in place though it will emit an error if you try to use it.
Currently I'm looking for information on what works on what processor
and how fast it works. To that end this program collects some
information about your computer but leaves the "phone home" part to you
so that you know exactly what is collected and can review it prior to
sending it to me. If you don't really know what you are doing or are
limited on time but still want to help please see the 'basic'
instructions. If you want to dive in deeper then please see the
'advanced' instructions. When a processor gets an op code that it
doesn't understand it usually generates a SEGV and if it generates
something else I'd like to know that too. Don't be misled into thinking
that there is a memory management error; it is most likely the way the
memory is accessed.
There is no aligned memory concept in MMX and the movdqa/movdqu
instructions are SSE2 so alignment speed is not tested on SSE or MMX.
Do x86 SSE blending routines exist in imlib2? If not they should be
implemented as it should halve the time it takes to blend two images.
Instructions (basic):
Your hardware needs to be setup in the Makefile. Search for:
# ADJUST HERE:
And set the CFLAGS accordingly for your hardware. The x86 arch can
handle MMX (SSE & SSE2 soon) and the x86-64 arch can handle the SSE2. I
am particularyly interested in the required alignment requirements for
SSE2 on x86 hardware. Then simply "make clean && make && ./tst 2>&1 |
tee output.txt" When everything is setup and working then turn up the
count variable in main() in tst.c please so that it runs for a while.
It took about 20 hours for 10,000 iterations on my amd64 3500+ using
SSE2 and aligned tests and about 12 hours for 1000 iterations on my Dual
Pentium II 400MHz (it only uses one processor though) using MMX. Catch
and then review the output.txt file and email it back to me if you
think the information is okay. Maybe post the summaries to the list
too.
Instructions (advanced):
Follow the basic instructions first. Then beat the crap out of the
CFLAGS/ASFLAGS/LDFLAGS and see what you can do to break it (I'm really
only interested in sane flags though, if you use the serious ricer flags
like -fholy_shit_its_so_fast stuff). Try it with different
assemblers/compilers. Try it on something other than Linux (like
{Free,Open,Net}BSD, Solaris, and whatever else Eterm has been ported to
that has an x86 or x86-64 processor). Alter the image sizes. And maybe
set the count in main() in tst.c (the iterations to test) to something
that will run through the night. And get as creative as you want.
Adjustables in main(): Many variables exist in main which can be
adjusted to perform more thorough testing. Some of these can be set to
the special value 'RANDOM' meaning the value is retrieved from a random
generator on each iteration and those variables are marked with an '*'.
If the random seed, r_seed, is also set to RANDOM then the random number
generator is seeded off the clock. The settable variables are: count,
width, height, *red_mod, *green_mod, *blue_mod, and *r_seed. Pay
attention to your machine's memory. 5000x5000 @ 32bpp = 100MB
Memory management:
There is also a replacement for malloc() and free() in
align/memory.[ch] that handles aligned memory management and a different
main() in memory.c that will test these routines fairly rigorously.
Feel free to expand on the tests or try to use it in something else.
The advantage of my code is that it can be included where it is needed
especially where the posix_memalign() might not be available (it was
written from scratch without looking at any other code so it is not
copyright encumbered in any way). The disadvantage is that the posix
code is much faster, better tested, and can be freed with free(). I
have not attempted to optimize mine at all and won't unless the code is
wanted here or in X.
Timing:
The timing on these routines is not based on time at all. It is based
on CPU cycles consumed by the routines. Since the routines can be
preempted we are really looking for the "best case" scenario. If you
want really accurate cycle counts you either need to reboot to single
user mode or run the code as root and clear the interrupts prior to the
shading routines. (Don't forget to issue an sti when you return. :) It
is this possible preemption that explains negative reductions and most
of the other weird output. The timing crap is in time-it.[ch] and
utilizes a small amount of inline assembly. If anyone as an alternate
timing infrastructure I'd be interested in a macro that flips between 2
or 3 different ones.
Needed:
Michael also wanted to know the Resident Stack Size. How should I go
about retrieving it and when is the best time to do it? Help figuring
our why the aligned memory moves fail on x86 in my 32 bit chroot jail.
It might have something to do with the memory chunk that is made
available in a 32 bit chroot jail on a 64 bit processor is not 16 byte
aligned. (I hope anyway. I might make aligned memory access enabled by
a macro in the code that I finally submit to MEJ like it is now. Adjust
it in cmod.h)
CC People (and why):
MEJ: It is the Eterm shading routines that I'm testing with.
Raster: You wanted to see some statistics on performance. Specifically
how fast this is over the internal x86 fixup microcode.
Tiago: As the person that found the alignment issues in imlib2
originally.
John: As the author of the imlib2 code that has an alignment issue. I
hope to dive into that code shortly and hope you have the time
to work with me on it. (Not yet, but soon)
vapier: Because weird code turns him on for some reason and he is
Gentoo's maintainer for Eterm/Enlightenment.
hparker:
dang: My Gentoo AMD64 mentors. :p
Licensing:
I'm releasing this code as copyright by me and others with all my
rights reserved for the next month or so just so that other versions
don't appear and confuse things (I will gladly accept patches though).
If Michael or Raster want to adopt it into a testing branch of their
projects then they may have it to license however they please; if not
then I will release the portions that I wrote next month under their
preferred license: the BSD. (I prefer the GPL) If you do play with the
code anyway then be nice and set the major version number to 0
(TEST_VERSION_MAJOR in tst.c) for me please.
Thanks all, for helping out in Operation: Aggravate MEJ! Muu ha ha ha
ha. He hates duplicated code. :-) (For those newer to the list than
I, MEJ is really a good sport about accepting outside code -- look at
Escreen. But how many of us look forward to the prospect of having 16
different functions written in one of C, assembly, or inline assembly to
accomplish the single job of shading a background image. :/ ) And
please ignore all the crap that is still in the code. I'll skim it down
once I know more about the test results. MEJ: I know we have different
coding styles; I will make a concerted effort to adjust my code to your
style for everything that I submit to you so don't freak on me yet.
This should have most everything to test aligned/unaligned memory in
SSE2 on x86 & x86-64 as well as to profile the older MMX code that
Willem wrote. I'm very tired ATM so I'm sure I missed something
critical. Sorry. I'll get to it after some sleep.
The code is being hosted here (thanks Gentoo & Mr. Parker):
http://dev.gentoo.org/~hparker/simd-tester.tar.gz
Cheers,
--
Tres Melton
IRC & Gentoo: RiverRat
signature.asc
Description: This is a digitally signed message part
