http://bugs.freedesktop.org/show_bug.cgi?id=12216





------- Comment #9 from [EMAIL PROTECTED]  2007-08-30 16:23 PST -------
(In reply to comment #8)
> > the code so it guarantees it's 128bit aligned, if possible (using aligned
> > mallocs, though I'm right now not quite sure if there even is some
> > compiler-independent solution to do this if the variable is on the stack
> > instead?).
> 
> 
> Manually aligning a stack variable in a compiler independent way is fairly
> simply, but like an aligned malloc() implementation, it takes slightly more
> memory and setup that normally a compiler would do (e.g. GCC's __attribute__).
> The basic idea is that a 128 bit value requires 16 bytes of stack space, and
> assuming the worst case that its address mod 16  is 1, up to 15 extra bytes of
> padding space is needed. To be round, I would suggest just using 32 bytes. To
> find the aligned address, you need:
> 
> char data[32];
> unsigned long addr; /* Hopefully same size as ptr in LP64 and ILP32 models */
> float* aligned_ptr;
> 
> addr = (unsigned long)&data[0];
> 
> addr = addr + (16 - (addr & 0x0F)); /* Align the address */
> 
> aligned_ptr = (float*)addr; /* Currently have a single 128-bit variable 
> aligned
> */
Well, what I really meant was to do it in a compiler-independent way but
without any additional effort. I'll take that as a "no" ;-).

> Replacing movaps with movups has severe performance hits in heavy memory
> traffic areas, and by forcing the lowest common denominator, you cannot use 
> the
> x86 memory reference to relieve register pressure since SSE/2/3 operations 
> that
> use memory operands must also be aligned just as movaps must. I don't suggest
> that every instance be replaced, rather I would imagine that getting aligned
> data would be the optimal solution.
Register pressure shouldn't be too bad, since the code in question is used on
x86_64 only, the x86 version uses a 32bit mov there so no problem. And at least
the destination is always aligned.

> I pretty famaliar with x86 assembly language, especially SIMD instruction 
> sets,
> perhaps I could help a bit?
> 
> Patrick Baggett
If you want to give it a look that'll be great. There's not that much sse
assembly actually (more functions are optimized "only" for x86 without sse, as
it wasn't particularly easy to parallelize it (IIRC at least also partly due to
the limited shuffle capabilities). And, I think the x86_64 optimizing version
is still using some 3dnow opcodes, which, while maybe more suitable to the
code, certainly poses some problems for intel chips...


-- 
Configure bugmail: http://bugs.freedesktop.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Mesa3d-dev mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/mesa3d-dev

Reply via email to