Re: [E-devel] Attn: SIMD tweakers. :) Need some testing help please.

The Rasterman Sun, 08 Jan 2006 04:22:13 -0800

On Sat, 07 Jan 2006 11:09:46 -0700 Tres Melton <[EMAIL PROTECTED]> babbled:


> Good evening/morning/night/day tweakers,
> 
>       I'm investigating the capabilities of various processor's SIMD engines
> and would like some help.  I have a Pentium II w/ MMX and an AMD64 w/
> SSE2 so I need some help testing on the other architectures.  To that
> end I have written a test program that uses the shading routines from
> Eterm to test the performance of the various routines.  Although this
> does not directly relate to E the imlib2 blending routines were based
> off the same code that I based my shading routines on: that written by
> Willem Monsuwe <[EMAIL PROTECTED]>.  Further, the imlib2 libraries have
> some issues with aligned memory moves that information collected by
> these routines should help track down.  In reality a 'blend' of two
> images is the same as a shade where one image has the same pixel color
> value for the entire image.  This testing code contains x86-64/SSE2,
> x86/MMX, and x86/SSE2.  The x86/SSE code has turned into a much more
> complicated beast than I imagined so for now those users get to use the
> MMX routines.  :-(  I don't know that I'm going to do this piece as I'm
> not sure how to duplicate some of the SSE2 routines in SSE even
> disregarding speed.  I have marked the lines of code that need to be
> replaced with an asterisk in the code in the x86/SSE section for those
> that are interested.  Links to the appropriate manuals are also in the
> code (A64_128bit_Media_Programming.pdf).  All the macro foo to support
> x86/SSE is in place though it will emit an error if you try to use it.
> 
>       Currently I'm looking for information on what works on what processor
> and how fast it works.  To that end this program collects some
> information about your computer but leaves the "phone home" part to you
> so that you know exactly what is collected and can review it prior to
> sending it to me.  If you don't really know what you are doing or are
> limited on time but still want to help please see the 'basic'
> instructions.  If you want to dive in deeper then please see the
> 'advanced' instructions.  When a processor gets an op code that it
> doesn't understand it usually generates a SEGV and if it generates
> something else I'd like to know that too.  Don't be misled into thinking
> that there is a memory management error; it is most likely the way the
> memory is accessed.
> 
>       There is no aligned memory concept in MMX and the movdqa/movdqu
> instructions are SSE2 so alignment speed is not tested on SSE or MMX.
> Do x86 SSE blending routines exist in imlib2?  If not they should be
> implemented as it should halve the time it takes to blend two images.
> 
> Instructions (basic):
> Your hardware needs to be setup in the Makefile.  Search for:
> 
> #  ADJUST HERE:
> 
> And set the CFLAGS accordingly for your hardware.  The x86 arch can
> handle MMX (SSE & SSE2 soon) and the x86-64 arch can handle the SSE2.  I
> am particularyly interested in the required alignment requirements for
> SSE2 on x86 hardware.  Then simply "make clean && make && ./tst 2>&1 |
> tee output.txt"  When everything is setup and working then turn up the
> count variable in main() in tst.c please so that it runs for a while.
> It took about 20 hours for 10,000 iterations on my amd64 3500+ using
> SSE2 and aligned tests and about 12 hours for 1000 iterations on my Dual
> Pentium II 400MHz (it only uses one processor though) using MMX.  Catch
> and then review the output.txt  file and email it back to me if you
> think the information is okay.  Maybe post the summaries to the list
> too.
> 
> 
> Instructions (advanced):
>       Follow the basic instructions first.  Then beat the crap out of the
> CFLAGS/ASFLAGS/LDFLAGS and see what you can do to break it (I'm really
> only interested in sane flags though, if you use the serious ricer flags
> like -fholy_shit_its_so_fast stuff).  Try it with different
> assemblers/compilers.  Try it on something other than Linux (like
> {Free,Open,Net}BSD, Solaris, and whatever else Eterm has been ported to
> that has an x86 or x86-64 processor).  Alter the image sizes.  And maybe
> set the count in main() in tst.c (the iterations to test) to something
> that will run through the night.  And get as creative as you want.
> 
> Adjustables in main():  Many variables exist in main which can be
> adjusted to perform more thorough testing.  Some of these can be set to
> the special value 'RANDOM' meaning the value is retrieved from a random
> generator on each iteration and those variables are marked with an '*'.
> If the random seed, r_seed, is also set to RANDOM then the random number
> generator is seeded off the clock.  The settable variables are:  count,
> width, height, *red_mod, *green_mod, *blue_mod, and *r_seed.  Pay
> attention to your machine's memory. 5000x5000 @ 32bpp = 100MB
> 
> 
> Memory management:
>       There is also a replacement for malloc() and free() in
> align/memory.[ch] that handles aligned memory management and a different
> main() in memory.c that will test these routines fairly rigorously.
> Feel free to expand on the tests or try to use it in something else.
> The advantage of my code is that it can be included where it is needed
> especially where the posix_memalign() might not be available (it was
> written from scratch without looking at any other code so it is not
> copyright encumbered in any way).  The disadvantage is that the posix
> code is much faster, better tested, and can be freed with free().  I
> have not attempted to optimize mine at all and won't unless the code is
> wanted here or in X.
> 
> 
> Timing:
>       The timing on these routines is not based on time at all.  It is based
> on CPU cycles consumed by the routines.  Since the routines can be
> preempted we are really looking for the "best case" scenario.  If you
> want really accurate cycle counts you either need to reboot to single
> user mode or run the code as root and clear the interrupts prior to the
> shading routines. (Don't forget to issue an sti when you return. :)  It
> is this possible preemption that explains negative reductions and most
> of the other weird output.  The timing crap is in time-it.[ch] and
> utilizes a small amount of inline assembly.  If anyone as an alternate
> timing infrastructure I'd be interested in a macro that flips between 2
> or 3 different ones.
> 
> 
> Needed:
>       Michael also wanted to know the Resident Stack Size.  How should I go
> about retrieving it and when is the best time to do it?  Help figuring
> our why the aligned memory moves fail on x86 in my 32 bit chroot jail.
> It might have something to do with the memory chunk that is made
> available in a 32 bit chroot jail on a 64 bit processor is not 16 byte
> aligned.  (I hope anyway.  I might make aligned memory access enabled by
> a macro in the code that I finally submit to MEJ like it is now.  Adjust
> it in cmod.h)
> 
> 
> CC People (and why):
> MEJ:  It is the Eterm shading routines that I'm testing with.
> Raster:       You wanted to see some statistics on performance.  Specifically
>       how fast this is over the internal x86 fixup microcode.
> Tiago:        As the person that found the alignment issues in imlib2
>       originally.
> John: As the author of the imlib2 code that has an alignment issue.  I
>       hope to dive into that code shortly and hope you have the time
>       to work with me on it.  (Not yet, but soon)
> vapier:       Because weird code turns him on for some reason and he is
>       Gentoo's maintainer for Eterm/Enlightenment.  
> hparker:
> dang: My Gentoo AMD64 mentors.  :p
> 
> Licensing:
>       I'm releasing this code as copyright by me and others with all my
> rights reserved for the next month or so just so that other versions
> don't appear and confuse things (I will gladly accept patches though).
> If Michael or Raster want to adopt it into a testing branch of their
> projects then they may have it to license however they please; if not
> then I will release the portions that I wrote next month under their
> preferred license: the BSD.  (I prefer the GPL)  If you do play with the
> code anyway then be nice and set the major version number to 0
> (TEST_VERSION_MAJOR in tst.c) for me please.
> 
>       Thanks all, for helping out in Operation: Aggravate MEJ!  Muu ha ha ha
> ha.  He hates duplicated code.  :-)   (For those newer to the list than
> I, MEJ is really a good sport about accepting outside code -- look at
> Escreen.  But how many of us look forward to the prospect of having 16
> different functions written in one of C, assembly, or inline assembly to
> accomplish the single job of shading a background image.  :/  )  And
> please ignore all the crap that is still in the code.  I'll skim it down
> once I know more about the test results.  MEJ: I know we have different
> coding styles; I will make a concerted effort to adjust my code to your
> style for everything that I submit to you so don't freak on me yet.
> This should have most everything to test aligned/unaligned memory in
> SSE2 on x86 & x86-64 as well as to profile the older MMX code that
> Willem wrote.  I'm very tired ATM so I'm sure I missed something
> critical.  Sorry.  I'll get to it after some sleep.
> 
> The code is being hosted here (thanks Gentoo & Mr. Parker):
> http://dev.gentoo.org/~hparker/simd-tester.tar.gz

Forbidden

You don't have permission to access /~hparker/simd-tester.tar.gz on this server.
Apache Server at dev.gentoo.org Port 80

:)

(i'd love to comment... but can't comment on the code atm.. so i'll comment on
whats in the mail).

1. in e17 cvs:

e17/proto/gfx_routines

you will find a quick tester program with routines for alpha blending & copying
(so far) - it tests  c, mmx, sse and sse2 (so you can use it to see how to make
sse2 work. i could make it faster i guess... i dont have detection there as its
simple for testing correctness and relative speed). a few things i have
noticed. relative performance of routines can vary WILDLY between cpu's and
makers. a p4 mobile will be different to a p4 desktop. 2 p4 desktops will vary
(memory/bus differences with the motherboard). amd64 will be quite different to
p4. p2 and p3 will be quite different too (if you disable the routines that
will not work on them). same with pentium-m. you can't rely on the results on 1
box as on another one routine that was faster, now is slower than the competing
routine. i have tweaked these as much as i can here. spares vs non-sparse is
basically where the blender does a compare+branch per pixel to avoid work if
the src alpha is 0 or 255 - this of course makes the speed entirely variable
based on inptu data - with some input data its faster than brute force, with
some it's slower. the trend is - the more 0 or 255 pixels u have the better it
optimises - if u have lots of intermediate (> 0 && , 255) alpha values, brute
force is better as the cost of compare+branch is greater than the savings.

and 2. your mail

imlib2's aligned memory stuff should be fixed now. the above gfx_routines code
doesnt need to worry about aligned/unaligned except for sse2 copies. :) also
except for memcpy i havent found sse to give any great speedups over mmx. sse2
though is nicer (see the above routines - well edfinitely on p4's). as for
aligned vs unaliged speed differences - i have not noted any big differences to
worry about overall. i have also noticed u NEED to turn on optimizations in gcc
for this code to compile - it refuses without (and hard to work out why). i
like your timing using cpu cycles - i still use wall clock time - BUT i do
tests on a almsot idle system and the tests run many many loops so the impact
of the system on the bench is minimal enough to be ignored - as long as its
idle. as for stats - woudl love to see them... :)

-- 
------------- Codito, ergo sum - "I code, therefore I am" --------------
The Rasterman (Carsten Haitzler)    [EMAIL PROTECTED]
裸好多
Tokyo, Japan (東京 日本)


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
enlightenment-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/enlightenment-devel

Re: [E-devel] Attn: SIMD tweakers. :) Need some testing help please.

Reply via email to