[Bug c++/15795] No way to teach operator new anything about alignment requirements
--- Comment #38 from timday at bottlenose dot demon dot co dot uk 2006-11-12 15:33 --- Gah: just spent several hours trying to figure out why my malloced __v4sf weren't 16 byte aligned before I stumbled on this thread. Would be nice if the info gcc Using vector instructions through built-in functions section contained a big warning about the issue. -- timday at bottlenose dot demon dot co dot uk changed: What|Removed |Added CC||timday at bottlenose dot ||demon dot co dot uk http://gcc.gnu.org/bugzilla/show_bug.cgi?id=15795
[Bug middle-end/29756] SSE intrinsics hard to use without redundant temporaries appearing
--- Comment #3 from timday at bottlenose dot demon dot co dot uk 2006-11-08 10:01 --- I've just tried an alternative version (will upload later) replacing the union with a single __v4sf _rep, and implementing the [] operators using e.g (reinterpret_castconst float*(_rep))[i]; However the code generated by the two transform implementations remains the same (20 and 32 instructions anyway; haven't checked the details yet). Maybe not surprising as it's just moving the problem around. The big difference between the two methods is perhaps primarily that the bad one involves a __v4sf-float-__vfs4 conversion, while the good one uses __v4sf throughout by using the mul_compN methods. I'll try and prepare a more concise test case based on the premise that bad handling of __v4sf - float is the real issue. -- timday at bottlenose dot demon dot co dot uk changed: What|Removed |Added CC||timday at bottlenose dot ||demon dot co dot uk http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29756
[Bug middle-end/29756] SSE intrinsics hard to use without redundant temporaries appearing
--- Comment #4 from timday at bottlenose dot demon dot co dot uk 2006-11-08 22:18 --- Created an attachment (id=12573) -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=12573action=view) More concise demonstration of the v4sf-float-v4sf issue. The attached code, (no classes or unions, just a few inline functions) obtained from gcc -v -save-temps -S -O3 -march=pentium3 -mfpmath=sse -msse -fomit-frame-pointer v4sf.cpp compiles transform_good to 18 instructions and transform_bad to 33. However it's not really surprising a round-trip through stack temporaries is required when pointer arithmetic is being used to extract a float from a __v4sf. I've no idea whether it's realistic to hope this could ever be optimised away. Alternatively, it would be very nice if the builtin vector types simply provided a [] operator, or if there were some intrinsics for extracting floats from a __v4sf. (In the meantime, in the original vector4f class, remaining in the __v4sf domain by having the const operator[] return a suitably type-wrapped __v4sf filled with the specified component seems to be a promising direction). -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29756
[Bug inline-asm/29756] New: SSE intrinsics hard to use without redundant temporaries appearing
I've been adapting some old codes' simple 4-float vector class to use SSE by use of the intrinsic functions. It seems to be quite hard to avoid the generated assembly code being rather diluted by apparently redundant spills of intermediate results to the stack. On inspecting the assembly produced from the file to be attached, compare the code generated for matrix44f::transform_good and matrix44f::transform_bad. The former is 20 instructions and apparently optimal. However, it was only arrived at by prodding the latter version of the function (which does exactly the same thing but expressed more naturally, but results in 32 instructions) until the stack temporaries went away. It would be nice if both versions of the function generated optimal code and there doesn't seem to be any particular reason they shouldn't. Both versions' assembly contain the same expected numbers of shuffle, multiply and add instructions, the excess seems to all involve extra stack temporaries. [I'm not sure what the triplet codes on this form are. I'm using a gcc in Debian Etch gcc --version shows gcc (GCC) 4.1.2 20060901 (prerelease) (Debian 4.1.1-13); platform is a Pentium3. Sorry if the inline-asm component is a completely inappropriate thing to assign to.] -- Summary: SSE intrinsics hard to use without redundant temporaries appearing Product: gcc Version: 4.1.2 Status: UNCONFIRMED Severity: minor Priority: P3 Component: inline-asm AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: timday at bottlenose dot demon dot co dot uk http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29756
[Bug inline-asm/29756] SSE intrinsics hard to use without redundant temporaries appearing
--- Comment #1 from timday at bottlenose dot demon dot co dot uk 2006-11-07 22:26 --- Created an attachment (id=12566) -- (http://gcc.gnu.org/bugzilla/attachment.cgi?id=12566action=view) Result of gcc -v -save-temps -S -O3 -march=pentium3 -mfpmath=sse -msse -fomit-frame-pointer intrin.cpp This is the .ii file output from gcc -v -save-temps -S -O3 -march=pentium3 -mfpmath=sse -msse -fomit-frame-pointer intrin.cpp Most of it is the result of the .cpp's sole direct include : #include xmmintrin.h, which was immediately before the class vector4f declaration. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29756