------- Comment #4 from pinskia at gcc dot gnu dot org 2008-12-31 08:10 ------- D.45587 = VIEW_CONVERT_EXPR<__v4si>(x); D.45589 = __builtin_ia32_pcmpeqd128 (D.45587, D.45587); D.45591 = __builtin_ia32_psrldi128 (D.45589, 25); D.45594 = __builtin_ia32_pslldi128 (D.45591, 23); one = VIEW_CONVERT_EXPR<__m128>(VIEW_CONVERT_EXPR<__m128i>(D.45594)); D.45644 = (long unsigned int) ((n >> 2) + 4294967295) + 1 * 16; ivtmp.516 = 0;
So the inner loop is not the issue, only the setup code. The extra subtract/add comes from D.45644. -- pinskia at gcc dot gnu dot org changed: What |Removed |Added ---------------------------------------------------------------------------- Component|target |middle-end Summary|[4.4 Regression] speed |[4.4 Regression] extra code |regression with sse |for setting up loops |intrinsics | http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38671