http://gcc.gnu.org/bugzilla/show_bug.cgi?id=61043
Bug ID: 61043 Summary: LTO accumulates CPU requirements from all input objects Product: gcc Version: 4.8.2 Status: UNCONFIRMED Severity: normal Priority: P3 Component: lto Assignee: unassigned at gcc dot gnu.org Reporter: andysem at mail dot ru Created attachment 32726 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=32726&action=edit A test case to reproduce the problem I have a test case (attached) with several input files. main.cpp contains generic code that should run on any CPU, and add_sse2.c and add_avx2.c containing optimized code with SSE2 and AVX2 intrinsics, respectively. main.cpp detects CPU features in run time and invokes the optimized code when possible. The problem is when this test is compiled with LTO enabled, the resulting executable contains add_sse2 function with VEX-encoded instructions (i.e. with AVX-128 code instead of legacy SSE2). This does not happen when LTO is not enabled. My guess is that LTO computes the highest required CPU across all input object files (which is one with AVX2 in this case) and generates code for it instead of generating code for the CPU that was specified during the compilation stage. The expected behavior would be to record target-related compiler options for every function and use these options at LTO stage. To compile the test you can use compile.sh. To obtain disassembled code you can use disasm.sh. Look for add_sse2 code in the disassembly.