Re: Memset/memcpy patch
Hi HJ, The last-year patch is currently almost useless, as efforts needed for its rebase seem to be almost the same as efforts needed for writing it from scratch. I hoped to make a patch covering at least subset of cases, but unfortunately haven't had time even for it yet. What time do we have for it now, when does stage1 finish? Thanks, Michael On 26 September 2012 19:00, H.J. Lu hjl.to...@gmail.com wrote: On Fri, Aug 31, 2012 at 1:54 AM, Jan Hubicka hubi...@ucw.cz wrote: On Mon, Dec 12, 2011 at 6:02 AM, Jan Hubicka hubi...@ucw.cz wrote: Any update? I will look into it today, but anyway I think it is stage1 material, so we have some time to progress on it. Honza Hi Honza, The old patch was reverted and the new patch was posted at http://gcc.gnu.org/ml/gcc-patches/2011-12/msg00336.html Have you got a chance to review it? I am in China till 5th, I will try to return to it shortly after returning. Ping me again if not - there seems to be a lot of work left on this patch... Hi Honza, Michael, Any changes to get it into GCC 4.8? Thanks. -- H.J. -- --- Best regards, Michael V. Zolotukhin, Software Engineer Intel Corporation.
Re: Memset/memcpy patch
On Mon, Dec 12, 2011 at 6:02 AM, Jan Hubicka hubi...@ucw.cz wrote: Any update? I will look into it today, but anyway I think it is stage1 material, so we have some time to progress on it. Honza Hi Honza, The old patch was reverted and the new patch was posted at http://gcc.gnu.org/ml/gcc-patches/2011-12/msg00336.html Have you got a chance to review it? I am in China till 5th, I will try to return to it shortly after returning. Ping me again if not - there seems to be a lot of work left on this patch... Honza
Re: Memset/memcpy patch
On Mon, Dec 12, 2011 at 6:02 AM, Jan Hubicka hubi...@ucw.cz wrote: Any update? I will look into it today, but anyway I think it is stage1 material, so we have some time to progress on it. Honza Hi Honza, The old patch was reverted and the new patch was posted at http://gcc.gnu.org/ml/gcc-patches/2011-12/msg00336.html Have you got a chance to review it? Thanks. -- H.J.
Re: Memset/memcpy patch
Any update? On 5 December 2011 15:14, Michael Zolotukhin michael.v.zolotuk...@gmail.com wrote: Hi Jan, I debugged the changes, and I think I've hunted down all the bugs. I slightly refactored the code - now all new SSE-related code is more localized. Also, I fixed some alignment issues. Please find the new patch in the attachment (it's made against rev 181709) - is it ok for trunk? Bootstrap and 'make check' passed on Atom and Corei7 (32,64 bits). I also checked specs2000, eembc1_1 and eembc2_0 on Atom. On 26 November 2011 09:18, Jan Hubicka hubi...@ucw.cz wrote: On Wed, Nov 23, 2011 at 3:32 PM, Michael Zolotukhin michael.v.zolotuk...@gmail.com wrote: I found and fixed another problem in the latest memcpy/memest changes - with this fix all the failing tests mentioned in #51134 started passing. Bootstraps are also ok. Though I still see fails in 32-bit make check, so probably, it'd be better to revert the changes till these fails are fixed. I will revert it for now. OK. I guess I can break out the simple fixes and commit them for 4.7 and we could revisit this for next stage1. Probably not by adding all the features together, but extending prologues/epilogues first and adding SSE loops with the new alignment logic next. Honza -- H.J. -- --- Best regards, Michael V. Zolotukhin, Software Engineer Intel Corporation. -- --- Best regards, Michael V. Zolotukhin, Software Engineer Intel Corporation.
Re: Memset/memcpy patch
Any update? I will look into it today, but anyway I think it is stage1 material, so we have some time to progress on it. Honza
Re: Memset/memcpy patch
On Wed, Nov 23, 2011 at 3:32 PM, Michael Zolotukhin michael.v.zolotuk...@gmail.com wrote: I found and fixed another problem in the latest memcpy/memest changes - with this fix all the failing tests mentioned in #51134 started passing. Bootstraps are also ok. Though I still see fails in 32-bit make check, so probably, it'd be better to revert the changes till these fails are fixed. I will revert it for now. OK. I guess I can break out the simple fixes and commit them for 4.7 and we could revisit this for next stage1. Probably not by adding all the features together, but extending prologues/epilogues first and adding SSE loops with the new alignment logic next. Honza -- H.J.
Re: Memset/memcpy patch
On Wed, Nov 23, 2011 at 3:32 PM, Michael Zolotukhin michael.v.zolotuk...@gmail.com wrote: I found and fixed another problem in the latest memcpy/memest changes - with this fix all the failing tests mentioned in #51134 started passing. Bootstraps are also ok. Though I still see fails in 32-bit make check, so probably, it'd be better to revert the changes till these fails are fixed. I will revert it for now. -- H.J.
Re: Memset/memcpy patch
I found and fixed another problem in the latest memcpy/memest changes - with this fix all the failing tests mentioned in #51134 started passing. Bootstraps are also ok. Though I still see fails in 32-bit make check, so probably, it'd be better to revert the changes till these fails are fixed. On 21 November 2011 20:36, Michael Zolotukhin michael.v.zolotuk...@gmail.com wrote: Hi, Continuing investigation of fails on bootstrap I found next problem (besides the problem with unknown alignment described above): there is a mess with size_needed and epilogue_size_needed when we generate epilogue loop which also use SSE-moves, but no unrolled - that's probably the reason of the fails we saw. Please check the attached patch - though the full testing isn't over yet. bootstraps seem to be ok as well as arrayarg.f90-test (with sse_loop enabled). On 19 November 2011 05:38, Jan Hubicka hubi...@ucw.cz wrote: Given that x86 memset/memcpy is still broken, I think we should revert it for now. Well, looking into the code, the SSE alignment issues needs work - the alignment test merely tests whether some alignmnet is known not whether 16 byte alignment is known that is the cause of failures in 32bit bootstrap. I originally convinced myself that this is safe since we soot for unaligned load/stores anyway. I've commited the following patch that disabled SSE codegen and unbreaks atom bootstrap. This seems more sensible to me given that the patch cumulated some good improvements on the non-SSE path as well and we could return into the SSE alignment issues incremntally. There is still falure in the fortran testcase that I am convinced is previously latent issue. I will be offline tomorrow. If there are futher serious problems, just fell free to revert the changes and we could look into them for next stage1. Honza * i386.c (atom_cost): Disable SSE loop until alignment issues are fixed. Index: i386.c === --- i386.c (revision 181479) +++ i386.c (working copy) @@ -1783,18 +1783,18 @@ struct processor_costs atom_cost = { /* stringop_algs for memcpy. SSE loops works best on Atom, but fall back into non-SSE unrolled loop variant if that fails. */ - {{{libcall, {{4096, sse_loop}, {4096, unrolled_loop}, {-1, libcall}}}, /* Known alignment. */ - {libcall, {{4096, sse_loop}, {4096, unrolled_loop}, {-1, libcall, - {{libcall, {{2048, sse_loop}, {2048, unrolled_loop}, {-1, libcall}}}, /* Unknown alignment. */ - {libcall, {{2048, sse_loop}, {2048, unrolled_loop}, + {{{libcall, {{4096, unrolled_loop}, {-1, libcall}}}, /* Known alignment. */ + {libcall, {{4096, unrolled_loop}, {-1, libcall, + {{libcall, {{2048, unrolled_loop}, {-1, libcall}}}, /* Unknown alignment. */ + {libcall, {{2048, unrolled_loop}, {-1, libcall}, /* stringop_algs for memset. */ - {{{libcall, {{4096, sse_loop}, {4096, unrolled_loop}, {-1, libcall}}}, /* Known alignment. */ - {libcall, {{4096, sse_loop}, {4096, unrolled_loop}, {-1, libcall, - {{libcall, {{1024, sse_loop}, {1024, unrolled_loop}, /* Unknown alignment. */ + {{{libcall, {{4096, unrolled_loop}, {-1, libcall}}}, /* Known alignment. */ + {libcall, {{4096, unrolled_loop}, {-1, libcall, + {{libcall, {{1024, unrolled_loop}, /* Unknown alignment. */ {-1, libcall}}}, - {libcall, {{2048, sse_loop}, {2048, unrolled_loop}, + {libcall, {{2048, unrolled_loop}, {-1, libcall}, 1, /* scalar_stmt_cost. */ 1, /* scalar load_cost. */ -- --- Best regards, Michael V. Zolotukhin, Software Engineer Intel Corporation. -- --- Best regards, Michael V. Zolotukhin, Software Engineer Intel Corporation.
Re: Memset/memcpy patch
Hi, Continuing investigation of fails on bootstrap I found next problem (besides the problem with unknown alignment described above): there is a mess with size_needed and epilogue_size_needed when we generate epilogue loop which also use SSE-moves, but no unrolled - that's probably the reason of the fails we saw. Please check the attached patch - though the full testing isn't over yet. bootstraps seem to be ok as well as arrayarg.f90-test (with sse_loop enabled). On 19 November 2011 05:38, Jan Hubicka hubi...@ucw.cz wrote: Given that x86 memset/memcpy is still broken, I think we should revert it for now. Well, looking into the code, the SSE alignment issues needs work - the alignment test merely tests whether some alignmnet is known not whether 16 byte alignment is known that is the cause of failures in 32bit bootstrap. I originally convinced myself that this is safe since we soot for unaligned load/stores anyway. I've commited the following patch that disabled SSE codegen and unbreaks atom bootstrap. This seems more sensible to me given that the patch cumulated some good improvements on the non-SSE path as well and we could return into the SSE alignment issues incremntally. There is still falure in the fortran testcase that I am convinced is previously latent issue. I will be offline tomorrow. If there are futher serious problems, just fell free to revert the changes and we could look into them for next stage1. Honza * i386.c (atom_cost): Disable SSE loop until alignment issues are fixed. Index: i386.c === --- i386.c (revision 181479) +++ i386.c (working copy) @@ -1783,18 +1783,18 @@ struct processor_costs atom_cost = { /* stringop_algs for memcpy. SSE loops works best on Atom, but fall back into non-SSE unrolled loop variant if that fails. */ - {{{libcall, {{4096, sse_loop}, {4096, unrolled_loop}, {-1, libcall}}}, /* Known alignment. */ - {libcall, {{4096, sse_loop}, {4096, unrolled_loop}, {-1, libcall, - {{libcall, {{2048, sse_loop}, {2048, unrolled_loop}, {-1, libcall}}}, /* Unknown alignment. */ - {libcall, {{2048, sse_loop}, {2048, unrolled_loop}, + {{{libcall, {{4096, unrolled_loop}, {-1, libcall}}}, /* Known alignment. */ + {libcall, {{4096, unrolled_loop}, {-1, libcall, + {{libcall, {{2048, unrolled_loop}, {-1, libcall}}}, /* Unknown alignment. */ + {libcall, {{2048, unrolled_loop}, {-1, libcall}, /* stringop_algs for memset. */ - {{{libcall, {{4096, sse_loop}, {4096, unrolled_loop}, {-1, libcall}}}, /* Known alignment. */ - {libcall, {{4096, sse_loop}, {4096, unrolled_loop}, {-1, libcall, - {{libcall, {{1024, sse_loop}, {1024, unrolled_loop}, /* Unknown alignment. */ + {{{libcall, {{4096, unrolled_loop}, {-1, libcall}}}, /* Known alignment. */ + {libcall, {{4096, unrolled_loop}, {-1, libcall, + {{libcall, {{1024, unrolled_loop}, /* Unknown alignment. */ {-1, libcall}}}, - {libcall, {{2048, sse_loop}, {2048, unrolled_loop}, + {libcall, {{2048, unrolled_loop}, {-1, libcall}, 1, /* scalar_stmt_cost. */ 1, /* scalar load_cost. */ -- --- Best regards, Michael V. Zolotukhin, Software Engineer Intel Corporation. memfunc_epilogue_loops.patch Description: Binary data
Re: Memset/memcpy patch
I found another bug in current implementation. A patch for it doesn't cure i686-linux- bootstrap, but fixes fails on some tests (see attached). The problem was that we tried to add runtime tests for alignment even if both SRC and DST had unknown alignment - in this case it could be impossible to make them both aligned simultaneously, so I think it's easier to even not try to use aligned SSE-moves at all. Generation of prologues with runtime tests could be used only if at least one alignment is known - otherwise it's incorrect. Probably, generation of such prologues could be removed from MEMMOV at all for now. Though, even with this fix i686-bootstrap still fails. Configure for the bootstrap-fail reproducing: CC=gcc -m32 CXX=g++ -m32 ../configure --with-arch=core2 --with-cpu=atom --prefix=`pwd` i686-linux --with-fpmath=sse --enable-languages=c,c++,fortran On 18 November 2011 06:23, Jan Hubicka hubi...@ucw.cz wrote: The current x86 memset/memcpy expansion is broken. It miscompiles many programs, including GCC itself. Should it be reverted for now? There was problem in the new code doing loopy epilogues. I am currently testing the following patch that shold fix the problem. We could either revert now and I will apply combined patch or I hope to fix that tonight. To expand little bit. I was looking into the code for most of the day today and the patch combines several fixes 1) the new loopy epilogue code was quite broken. It did not work for memset at all because the promoted value was not always initialized that I fixed in the version of patch that is in mainline now. It however also miss bound check in some cases. This is fixed by the expand_set_or_movmem_via_loop_with_iter change. 2) I misupdated atom description so 32bit memset was not expanded inline, this is fixed by memset changes 3) decide_alg was broken in two ways - first it gives complex algorithms for -O0 and it chose wrong variant when sse_loop is used. 4) the epilogue loop was output even in the case it is not needed - i.e. when unrolled loops handled 16 bytes at once, and block size is 39. This is the ix86_movmem and ix86_setmem change 5) The implementation of ix86_movmem/ix86_setmem diverged for no reason so I got it back to sync. For some reason SSE code in movmem is not output for 64bit unaligned memcpy that is fixed too. 6) it seems that both bdver and core is good enough on handling misaligned blocks that the alignmnet prologues can be ommited. This greatly improves and reduces size of the inline sequence. I will however break this out into independent patch. Life would be easier if the changes was made in multiple incremental steps, stringops expansion is relatively tricky busyness and realively easy to get wrong in some cases since there are so many of them depending on knowledge of size/alignmnet and target architecture. Hi, this is the patch I comitted after bootstrapping/regstesting x86_64-linux and --with-arch=core2 --with-cpu=atom gfortran.fortran-torture/execute/arrayarg.f90 failure stays. As I've explained in the PR log, I believe it is previously latent problem elsewhere that is now triggered by inline memset expansion that is later unrolled. I would welcome help from someone who understand the testcase on whether it is aliasing safe or not. Honza PR bootstrap/51134 * i386.c (atom_cost): Fix 32bit memset description. (expand_set_or_movmem_via_loop_with_iter): Output proper bounds check for epilogue loops. (expand_movmem_epilogue): Handle epilogues up to size 15 w/o producing byte loop. (decide_alg): sse_loop is not useable wthen SSE2 is disabled; when not optimizing always use rep movsb or lincall; do not produce word sized loops when optimizing memset for size (to avoid need for large constants). (ix86_expand_movmem): Get into sync with ix86_expand_setmem; choose unroll factors better; always do 128bit moves when producing SSE loops; do not produce loopy epilogue when size is too small. (promote_duplicated_reg_to_size): Do not look into desired alignments when doing vector expansion. (ix86_expand_setmem): Track better when promoted value is available; choose unroll factors more sanely.; output loopy epilogue only when needed. Index: config/i386/i386.c === *** config/i386/i386.c (revision 181407) --- config/i386/i386.c (working copy) *** struct processor_costs atom_cost = { *** 1785,1791 if that fails. */ {{{libcall, {{4096, sse_loop}, {4096, unrolled_loop}, {-1, libcall}}}, /* Known alignment. */ {libcall, {{4096, sse_loop}, {4096, unrolled_loop}, {-1, libcall, ! {{libcall, {{-1, libcall}}},
Re: Memset/memcpy patch
I found another bug in current implementation. A patch for it doesn't cure i686-linux- bootstrap, but fixes fails on some tests (see attached). The problem was that we tried to add runtime tests for alignment even if both SRC and DST had unknown alignment - in this case it could be impossible to make them both aligned simultaneously, so I think it's easier to even not try to use aligned SSE-moves at all. Generation of prologues with runtime tests could be used only if at least one alignment is known - otherwise it's incorrect. Probably, generation of such prologues could be removed from MEMMOV at all for now. The prologues always align the destination as it helps more than aligning source on most chips. I do not see problem with that. But for SSE either we should arrange unaligned load opcodes (that is what I see in generated code, but I guess it depends on -march setting) or simply disqualify the sse_loop algorithm in decide_alg when alignment is not know. Though, even with this fix i686-bootstrap still fails. Configure for the bootstrap-fail reproducing: CC=gcc -m32 CXX=g++ -m32 ../configure --with-arch=core2 --with-cpu=atom --prefix=`pwd` i686-linux --with-fpmath=sse --enable-languages=c,c++,fortran Default i686-linux bootstrap was working for me. I will try your setting but my time today evening and at weekend is limited. Honza
Re: Memset/memcpy patch
Given that x86 memset/memcpy is still broken, I think we should revert it for now. Well, looking into the code, the SSE alignment issues needs work - the alignment test merely tests whether some alignmnet is known not whether 16 byte alignment is known that is the cause of failures in 32bit bootstrap. I originally convinced myself that this is safe since we soot for unaligned load/stores anyway. I've commited the following patch that disabled SSE codegen and unbreaks atom bootstrap. This seems more sensible to me given that the patch cumulated some good improvements on the non-SSE path as well and we could return into the SSE alignment issues incremntally. There is still falure in the fortran testcase that I am convinced is previously latent issue. I will be offline tomorrow. If there are futher serious problems, just fell free to revert the changes and we could look into them for next stage1. Honza * i386.c (atom_cost): Disable SSE loop until alignment issues are fixed. Index: i386.c === --- i386.c (revision 181479) +++ i386.c (working copy) @@ -1783,18 +1783,18 @@ struct processor_costs atom_cost = { /* stringop_algs for memcpy. SSE loops works best on Atom, but fall back into non-SSE unrolled loop variant if that fails. */ - {{{libcall, {{4096, sse_loop}, {4096, unrolled_loop}, {-1, libcall}}}, /* Known alignment. */ -{libcall, {{4096, sse_loop}, {4096, unrolled_loop}, {-1, libcall, - {{libcall, {{2048, sse_loop}, {2048, unrolled_loop}, {-1, libcall}}}, /* Unknown alignment. */ -{libcall, {{2048, sse_loop}, {2048, unrolled_loop}, + {{{libcall, {{4096, unrolled_loop}, {-1, libcall}}}, /* Known alignment. */ +{libcall, {{4096, unrolled_loop}, {-1, libcall, + {{libcall, {{2048, unrolled_loop}, {-1, libcall}}}, /* Unknown alignment. */ +{libcall, {{2048, unrolled_loop}, {-1, libcall}, /* stringop_algs for memset. */ - {{{libcall, {{4096, sse_loop}, {4096, unrolled_loop}, {-1, libcall}}}, /* Known alignment. */ -{libcall, {{4096, sse_loop}, {4096, unrolled_loop}, {-1, libcall, - {{libcall, {{1024, sse_loop}, {1024, unrolled_loop}, /* Unknown alignment. */ + {{{libcall, {{4096, unrolled_loop}, {-1, libcall}}}, /* Known alignment. */ +{libcall, {{4096, unrolled_loop}, {-1, libcall, + {{libcall, {{1024, unrolled_loop}, /* Unknown alignment. */ {-1, libcall}}}, -{libcall, {{2048, sse_loop}, {2048, unrolled_loop}, +{libcall, {{2048, unrolled_loop}, {-1, libcall}, 1, /* scalar_stmt_cost. */ 1, /* scalar load_cost. */
Re: Memset/memcpy patch
On Mon, Nov 14, 2011 at 3:48 PM, H.J. Lu hjl.to...@gmail.com wrote: On Mon, Nov 14, 2011 at 9:03 AM, Jan Hubicka hubi...@ucw.cz wrote: Hi, this is hopefully final variant of patch. The epilogue code was broken in some scenarios for memset, but should work safely now. I also fixed the tables for core/buldozer/amdfam10 chips. But before it can be comitted, we need to reoslve copyright assignment issues. You don't seem to be liested as having copyright assignment, does you company have one? Otherwise, please try to get one soon. Honza 2011-11-14 Zolotukhin Michael michael.v.zolotuk...@gmail.com Jan Hubicka j...@suse.cz * gcc.target/i386/sw-1.c: Force rep;movsb. * config/i386/i386.h (processor_costs): Add second dimension to stringop_algs array. * config/i386/i386.c (cost models): Initialize second dimension of stringop_algs arrays. (core_cost): New costs based on generic64 costs with updated stringop values. (promote_duplicated_reg): Add support for vector modes, add declaration. (promote_duplicated_reg_to_size): Likewise. (processor_target): Set core costs for core variants. (expand_set_or_movmem_via_loop_with_iter): New function. (expand_set_or_movmem_via_loop): Enable reuse of the same iters in different loops, produced by this function. (emit_strset): New function. (expand_movmem_epilogue): Add epilogue generation for bigger sizes, use SSE-moves where possible. (expand_setmem_epilogue): Likewise. (expand_movmem_prologue): Likewise for prologue. (expand_setmem_prologue): Likewise. (expand_constant_movmem_prologue): Likewise. (expand_constant_setmem_prologue): Likewise. (decide_alg): Add new argument align_unknown. Fix algorithm of strategy selection if TARGET_INLINE_ALL_STRINGOPS is set; Skip sse_loop (decide_alignment): Update desired alignment according to chosen move mode. (ix86_expand_movmem): Change unrolled_loop strategy to use SSE-moves. (ix86_expand_setmem): Likewise. (ix86_slow_unaligned_access): Implementation of new hook slow_unaligned_access. * config/i386/i386.md (strset): Enable half-SSE moves. * config/i386/sse.md (vec_dupv4si): Add expand for vec_dupv4si. (vec_dupv2di): Add expand for vec_dupv2di. This may have caused: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51134 The current x86 memset/memcpy expansion is broken. It miscompiles many programs, including GCC itself. Should it be reverted for now? -- H.J.
Re: Memset/memcpy patch
The current x86 memset/memcpy expansion is broken. It miscompiles many programs, including GCC itself. Should it be reverted for now? There was problem in the new code doing loopy epilogues. I am currently testing the following patch that shold fix the problem. We could either revert now and I will apply combined patch or I hope to fix that tonight. Honza Index: config/i386/i386.h === --- config/i386/i386.h (revision 181442) +++ config/i386/i386.h (working copy) @@ -276,6 +276,7 @@ enum ix86_tune_indices { X86_TUNE_PROMOTE_QIMODE, X86_TUNE_FAST_PREFIX, X86_TUNE_SINGLE_STRINGOP, + X86_TUNE_ALIGN_STRINGOP, X86_TUNE_QIMODE_MATH, X86_TUNE_HIMODE_MATH, X86_TUNE_PROMOTE_QI_REGS, Index: config/i386/i386.md === --- config/i386/i386.md (revision 181442) +++ config/i386/i386.md (working copy) @@ -15944,6 +15944,17 @@ (clobber (reg:CC FLAGS_REG))])] { + rtx vec_reg; + enum machine_mode mode = GET_MODE (operands[2]); + if (vector_extensions_used_for_mode (mode) + CONSTANT_P (operands[2])) +{ + if (mode == DImode) + mode = TARGET_64BIT ? V2DImode : V4SImode; + vec_reg = gen_reg_rtx (mode); + emit_move_insn (vec_reg, operands[2]); + operands[2] = vec_reg; +} if (GET_MODE (operands[1]) != GET_MODE (operands[2])) operands[1] = adjust_address_nv (operands[1], GET_MODE (operands[2]), 0); Index: config/i386/i386.c === --- config/i386/i386.c (revision 181442) +++ config/i386/i386.c (working copy) @@ -1785,7 +1785,7 @@ struct processor_costs atom_cost = { if that fails. */ {{{libcall, {{4096, sse_loop}, {4096, unrolled_loop}, {-1, libcall}}}, /* Known alignment. */ {libcall, {{4096, sse_loop}, {4096, unrolled_loop}, {-1, libcall, - {{libcall, {{-1, libcall}}}, /* Unknown alignment. */ + {{libcall, {{2048, sse_loop}, {2048, unrolled_loop}, {-1, libcall}}}, /* Unknown alignment. */ {libcall, {{2048, sse_loop}, {2048, unrolled_loop}, {-1, libcall}, @@ -2178,6 +2178,9 @@ static unsigned int initial_ix86_tune_fe /* X86_TUNE_SINGLE_STRINGOP */ m_386 | m_P4_NOCONA, + /* X86_TUNE_ALIGN_STRINGOP */ + ~(m_BDVER | m_CORE2I7), + /* X86_TUNE_QIMODE_MATH */ ~0, @@ -3724,6 +3727,14 @@ ix86_option_override_internal (bool main target_flags |= MASK_NO_RED_ZONE; } + if (!(target_flags_explicit MASK_NO_ALIGN_STRINGOPS)) +{ + if (ix86_tune_features[X86_TUNE_ALIGN_STRINGOP] ix86_tune_mask) +target_flags = ~MASK_NO_ALIGN_STRINGOPS; + else +target_flags |= MASK_NO_ALIGN_STRINGOPS; +} + /* Keep nonleaf frame pointers. */ if (flag_omit_frame_pointer) target_flags = ~MASK_OMIT_LEAF_FRAME_POINTER; @@ -21149,20 +21160,25 @@ expand_set_or_movmem_via_loop_with_iter top_label = gen_label_rtx (); out_label = gen_label_rtx (); - if (!reuse_iter) -iter = gen_reg_rtx (iter_mode); - size = expand_simple_binop (iter_mode, AND, count, piece_size_mask, - NULL, 1, OPTAB_DIRECT); - /* Those two should combine. */ - if (piece_size == const1_rtx) + NULL, 1, OPTAB_DIRECT); + if (!reuse_iter) { - emit_cmp_and_jump_insns (size, const0_rtx, EQ, NULL_RTX, iter_mode, + iter = gen_reg_rtx (iter_mode); + /* Those two should combine. */ + if (piece_size == const1_rtx) + { + emit_cmp_and_jump_insns (size, const0_rtx, EQ, NULL_RTX, iter_mode, + true, out_label); + predict_jump (REG_BR_PROB_BASE * 10 / 100); + } + emit_move_insn (iter, const0_rtx); +} + else +{ + emit_cmp_and_jump_insns (iter, size, GE, NULL_RTX, iter_mode, true, out_label); - predict_jump (REG_BR_PROB_BASE * 10 / 100); } - if (!reuse_iter) -emit_move_insn (iter, const0_rtx); emit_label (top_label); @@ -21588,17 +21604,28 @@ expand_setmem_epilogue (rtx destmem, rtx Remaining part we'll move using Pmode and narrower modes. */ if (promoted_to_vector_value) - while (remainder_size = 16) - { - if (GET_MODE (destmem) != move_mode) - destmem = adjust_automodify_address_nv (destmem, move_mode, - destptr, offset); - emit_strset (destmem, promoted_to_vector_value, destptr, -move_mode, offset); + { + if (promoted_to_vector_value) + { + if (max_size = GET_MODE_SIZE (V4SImode)) + move_mode = V4SImode; + else if (max_size = GET_MODE_SIZE (DImode)) + move_mode = DImode; + } + while (remainder_size =
Re: Memset/memcpy patch
The current x86 memset/memcpy expansion is broken. It miscompiles many programs, including GCC itself. Should it be reverted for now? There was problem in the new code doing loopy epilogues. I am currently testing the following patch that shold fix the problem. We could either revert now and I will apply combined patch or I hope to fix that tonight. To expand little bit. I was looking into the code for most of the day today and the patch combines several fixes 1) the new loopy epilogue code was quite broken. It did not work for memset at all because the promoted value was not always initialized that I fixed in the version of patch that is in mainline now. It however also miss bound check in some cases. This is fixed by the expand_set_or_movmem_via_loop_with_iter change. 2) I misupdated atom description so 32bit memset was not expanded inline, this is fixed by memset changes 3) decide_alg was broken in two ways - first it gives complex algorithms for -O0 and it chose wrong variant when sse_loop is used. 4) the epilogue loop was output even in the case it is not needed - i.e. when unrolled loops handled 16 bytes at once, and block size is 39. This is the ix86_movmem and ix86_setmem change 5) The implementation of ix86_movmem/ix86_setmem diverged for no reason so I got it back to sync. For some reason SSE code in movmem is not output for 64bit unaligned memcpy that is fixed too. 6) it seems that both bdver and core is good enough on handling misaligned blocks that the alignmnet prologues can be ommited. This greatly improves and reduces size of the inline sequence. I will however break this out into independent patch. Life would be easier if the changes was made in multiple incremental steps, stringops expansion is relatively tricky busyness and realively easy to get wrong in some cases since there are so many of them depending on knowledge of size/alignmnet and target architecture. Honza
Re: Memset/memcpy patch
The current x86 memset/memcpy expansion is broken. It miscompiles many programs, including GCC itself. Should it be reverted for now? There was problem in the new code doing loopy epilogues. I am currently testing the following patch that shold fix the problem. We could either revert now and I will apply combined patch or I hope to fix that tonight. To expand little bit. I was looking into the code for most of the day today and the patch combines several fixes 1) the new loopy epilogue code was quite broken. It did not work for memset at all because the promoted value was not always initialized that I fixed in the version of patch that is in mainline now. It however also miss bound check in some cases. This is fixed by the expand_set_or_movmem_via_loop_with_iter change. 2) I misupdated atom description so 32bit memset was not expanded inline, this is fixed by memset changes 3) decide_alg was broken in two ways - first it gives complex algorithms for -O0 and it chose wrong variant when sse_loop is used. 4) the epilogue loop was output even in the case it is not needed - i.e. when unrolled loops handled 16 bytes at once, and block size is 39. This is the ix86_movmem and ix86_setmem change 5) The implementation of ix86_movmem/ix86_setmem diverged for no reason so I got it back to sync. For some reason SSE code in movmem is not output for 64bit unaligned memcpy that is fixed too. 6) it seems that both bdver and core is good enough on handling misaligned blocks that the alignmnet prologues can be ommited. This greatly improves and reduces size of the inline sequence. I will however break this out into independent patch. Life would be easier if the changes was made in multiple incremental steps, stringops expansion is relatively tricky busyness and realively easy to get wrong in some cases since there are so many of them depending on knowledge of size/alignmnet and target architecture. Hi, this is the patch I comitted after bootstrapping/regstesting x86_64-linux and --with-arch=core2 --with-cpu=atom gfortran.fortran-torture/execute/arrayarg.f90 failure stays. As I've explained in the PR log, I believe it is previously latent problem elsewhere that is now triggered by inline memset expansion that is later unrolled. I would welcome help from someone who understand the testcase on whether it is aliasing safe or not. Honza PR bootstrap/51134 * i386.c (atom_cost): Fix 32bit memset description. (expand_set_or_movmem_via_loop_with_iter): Output proper bounds check for epilogue loops. (expand_movmem_epilogue): Handle epilogues up to size 15 w/o producing byte loop. (decide_alg): sse_loop is not useable wthen SSE2 is disabled; when not optimizing always use rep movsb or lincall; do not produce word sized loops when optimizing memset for size (to avoid need for large constants). (ix86_expand_movmem): Get into sync with ix86_expand_setmem; choose unroll factors better; always do 128bit moves when producing SSE loops; do not produce loopy epilogue when size is too small. (promote_duplicated_reg_to_size): Do not look into desired alignments when doing vector expansion. (ix86_expand_setmem): Track better when promoted value is available; choose unroll factors more sanely.; output loopy epilogue only when needed. Index: config/i386/i386.c === *** config/i386/i386.c (revision 181407) --- config/i386/i386.c (working copy) *** struct processor_costs atom_cost = { *** 1785,1791 if that fails. */ {{{libcall, {{4096, sse_loop}, {4096, unrolled_loop}, {-1, libcall}}}, /* Known alignment. */ {libcall, {{4096, sse_loop}, {4096, unrolled_loop}, {-1, libcall, !{{libcall, {{-1, libcall}}}, /* Unknown alignment. */ {libcall, {{2048, sse_loop}, {2048, unrolled_loop}, {-1, libcall}, --- 1785,1791 if that fails. */ {{{libcall, {{4096, sse_loop}, {4096, unrolled_loop}, {-1, libcall}}}, /* Known alignment. */ {libcall, {{4096, sse_loop}, {4096, unrolled_loop}, {-1, libcall, !{{libcall, {{2048, sse_loop}, {2048, unrolled_loop}, {-1, libcall}}}, /* Unknown alignment. */ {libcall, {{2048, sse_loop}, {2048, unrolled_loop}, {-1, libcall}, *** expand_set_or_movmem_via_loop_with_iter *** 21149,21168 top_label = gen_label_rtx (); out_label = gen_label_rtx (); - if (!reuse_iter) - iter = gen_reg_rtx (iter_mode); - size = expand_simple_binop (iter_mode, AND, count, piece_size_mask, ! NULL, 1, OPTAB_DIRECT); ! /* Those two should combine. */ ! if (piece_size ==
Re: Memset/memcpy patch
Looks like we have a bootstrap issue, thus sorry if may message may appear stupid nitpicking: why Zolotukhin Michael instead of Michael Zolotukhin in the ChangeLog? Is Michael the family name? Michael is the first name, Zolotukhin - last name. I probably swapped them accidentally in the changelog. Michael
Re: Memset/memcpy patch
On 11/15/2011 04:12 PM, Michael Zolotukhin wrote: Looks like we have a bootstrap issue, thus sorry if may message may appear stupid nitpicking: why Zolotukhin Michael instead of Michael Zolotukhin in the ChangeLog? Is Michael the family name? Michael is the first name, Zolotukhin - last name. I probably swapped them accidentally in the changelog. Ah, ok, thanks. Many years ago I learned this funny (from my parochial Italian point of view, sorry) story: http://en.wikipedia.org/wiki/Bui_Tuong_Phong and I'm still quite sensitive to the issue. Paolo.
Re: Memset/memcpy patch
On Mon, Nov 14, 2011 at 9:03 AM, Jan Hubicka hubi...@ucw.cz wrote: Hi, this is hopefully final variant of patch. The epilogue code was broken in some scenarios for memset, but should work safely now. I also fixed the tables for core/buldozer/amdfam10 chips. But before it can be comitted, we need to reoslve copyright assignment issues. You don't seem to be liested as having copyright assignment, does you company have one? Otherwise, please try to get one soon. Honza 2011-11-14 Zolotukhin Michael michael.v.zolotuk...@gmail.com Jan Hubicka j...@suse.cz Zolotukhin Michael works for Intel and has copyright assignment with FSF. -- H.J.
Re: Memset/memcpy patch
On Mon, Nov 14, 2011 at 9:03 AM, Jan Hubicka hubi...@ucw.cz wrote: Hi, this is hopefully final variant of patch. The epilogue code was broken in some scenarios for memset, but should work safely now. I also fixed the tables for core/buldozer/amdfam10 chips. But before it can be comitted, we need to reoslve copyright assignment issues. You don't seem to be liested as having copyright assignment, does you company have one? Otherwise, please try to get one soon. Honza 2011-11-14 Zolotukhin Michael michael.v.zolotuk...@gmail.com Jan Hubicka j...@suse.cz Zolotukhin Michael works for Intel and has copyright assignment with FSF. Thank you. I went ahead and comitted the patch then. Honza
Re: Memset/memcpy patch
2011/11/14 Jan Hubicka hubi...@ucw.cz: On Mon, Nov 14, 2011 at 9:03 AM, Jan Hubicka hubi...@ucw.cz wrote: Hi, this is hopefully final variant of patch. The epilogue code was broken in some scenarios for memset, but should work safely now. I also fixed the tables for core/buldozer/amdfam10 chips. But before it can be comitted, we need to reoslve copyright assignment issues. You don't seem to be liested as having copyright assignment, does you company have one? Otherwise, please try to get one soon. Honza 2011-11-14 Zolotukhin Michael michael.v.zolotuk...@gmail.com Jan Hubicka j...@suse.cz Zolotukhin Michael works for Intel and has copyright assignment with FSF. Thank you. I went ahead and comitted the patch then. GCC failed to bootstrap: ../../src-trunk/libiberty/sort.c:100:14: internal compiler error: in decide_alg, at config/i386/i386.c:22094 Please submit a full bug report, with preprocessed source if appropriate. See http://gcc.gnu.org/bugs.html for instructions. make[6]: *** [sort.o] Error 1 -- H.J.
Re: Memset/memcpy patch
On 14 Nov 2011, at 20:36, H.J. Lu wrote: 2011/11/14 Jan Hubicka hubi...@ucw.cz: On Mon, Nov 14, 2011 at 9:03 AM, Jan Hubicka hubi...@ucw.cz wrote: Hi, this is hopefully final variant of patch. The epilogue code was broken in some scenarios for memset, but should work safely now. I also fixed the tables for core/buldozer/amdfam10 chips. But before it can be comitted, we need to reoslve copyright assignment issues. You don't seem to be liested as having copyright assignment, does you company have one? Otherwise, please try to get one soon. Honza 2011-11-14 Zolotukhin Michael michael.v.zolotuk...@gmail.com Jan Hubicka j...@suse.cz Zolotukhin Michael works for Intel and has copyright assignment with FSF. Thank you. I went ahead and comitted the patch then. GCC failed to bootstrap: ../../src-trunk/libiberty/sort.c:100:14: internal compiler error: in decide_alg, at config/i386/i386.c:22094 Please submit a full bug report, with preprocessed source if appropriate. See http://gcc.gnu.org/bugs.html for instructions. make[6]: *** [sort.o] Error 1 Assuming that the target is a core processor: I'm testing a patch from Honza for this - which he has asked to be checked in if it works out OK. just a pasto... Index: i386.c === --- i386.c (revision 181360) +++ i386.c (working copy) @@ -1877,10 +1877,10 @@ struct processor_costs core_cost = { {libcall, {{16, loop}, {24, unrolled_loop}, {1024, rep_prefix_8_byte}, {-1, libcall}, /* stringop_algs for memset. */ - {{{libcall, {{256, rep_prefix_4_byte}}}, /* Known alignment. */ -{libcall, {{256, rep_prefix_8_byte, - {{libcall, {{256, rep_prefix_4_byte}}}, /* Unknown alignment. */ -{libcall, {{256, rep_prefix_8_byte}, + {{{libcall, {{256, rep_prefix_4_byte}, {-1 libcall}}}, /* Known alignment. */ +{libcall, {{256, rep_prefix_8_byte}, {-1 libcall, + {{libcall, {{256, rep_prefix_4_byte}, {-1 libcall}}}, /* Unknown alignment. */ +{libcall, {{256, rep_prefix_8_byte}, {-1 libcall}, 1,/* scalar_stmt_cost. */ 1,/* scalar load_cost. */ 1,/* scalar_store_cost. */
Re: Memset/memcpy patch
On Mon, Nov 14, 2011 at 12:40 PM, Iain Sandoe develo...@sandoe-acoustics.co.uk wrote: On 14 Nov 2011, at 20:36, H.J. Lu wrote: 2011/11/14 Jan Hubicka hubi...@ucw.cz: On Mon, Nov 14, 2011 at 9:03 AM, Jan Hubicka hubi...@ucw.cz wrote: Hi, this is hopefully final variant of patch. The epilogue code was broken in some scenarios for memset, but should work safely now. I also fixed the tables for core/buldozer/amdfam10 chips. But before it can be comitted, we need to reoslve copyright assignment issues. You don't seem to be liested as having copyright assignment, does you company have one? Otherwise, please try to get one soon. Honza 2011-11-14 Zolotukhin Michael michael.v.zolotuk...@gmail.com Jan Hubicka j...@suse.cz Zolotukhin Michael works for Intel and has copyright assignment with FSF. Thank you. I went ahead and comitted the patch then. GCC failed to bootstrap: ../../src-trunk/libiberty/sort.c:100:14: internal compiler error: in decide_alg, at config/i386/i386.c:22094 Please submit a full bug report, with preprocessed source if appropriate. See http://gcc.gnu.org/bugs.html for instructions. make[6]: *** [sort.o] Error 1 Assuming that the target is a core processor: I'm testing a patch from Honza for this - which he has asked to be checked in if it works out OK. just a pasto... Index: i386.c === --- i386.c (revision 181360) +++ i386.c (working copy) @@ -1877,10 +1877,10 @@ struct processor_costs core_cost = { {libcall, {{16, loop}, {24, unrolled_loop}, {1024, rep_prefix_8_byte}, {-1, libcall}, /* stringop_algs for memset. */ - {{{libcall, {{256, rep_prefix_4_byte}}}, /* Known alignment. */ - {libcall, {{256, rep_prefix_8_byte, - {{libcall, {{256, rep_prefix_4_byte}}}, /* Unknown alignment. */ - {libcall, {{256, rep_prefix_8_byte}, + {{{libcall, {{256, rep_prefix_4_byte}, {-1 libcall}}}, /* Known alignment. */ + {libcall, {{256, rep_prefix_8_byte}, {-1 libcall, + {{libcall, {{256, rep_prefix_4_byte}, {-1 libcall}}}, /* Unknown alignment. */ + {libcall, {{256, rep_prefix_8_byte}, {-1 libcall}, 1, /* scalar_stmt_cost. */ 1, /* scalar load_cost. */ 1, /* scalar_store_cost. */ It looks reasonable. -- H.J.
Re: Memset/memcpy patch
Hi, 2011-11-14 Zolotukhin Michael michael.v.zolotuk...@gmail.com Jan Hubicka j...@suse.cz Zolotukhin Michael works for Intel and has copyright assignment with FSF. Looks like we have a bootstrap issue, thus sorry if may message may appear stupid nitpicking: why Zolotukhin Michael instead of Michael Zolotukhin in the ChangeLog? Is Michael the family name? Thanks, Paolo
Re: Memset/memcpy patch
On 14 Nov 2011, at 20:44, H.J. Lu wrote: On Mon, Nov 14, 2011 at 12:40 PM, Iain Sandoe develo...@sandoe-acoustics.co.uk wrote: On 14 Nov 2011, at 20:36, H.J. Lu wrote: 2011/11/14 Jan Hubicka hubi...@ucw.cz: On Mon, Nov 14, 2011 at 9:03 AM, Jan Hubicka hubi...@ucw.cz wrote: Hi, this is hopefully final variant of patch. The epilogue code was broken in some scenarios for memset, but should work safely now. I also fixed the tables for core/buldozer/amdfam10 chips. But before it can be comitted, we need to reoslve copyright assignment issues. You don't seem to be liested as having copyright assignment, does you company have one? Otherwise, please try to get one soon. Honza 2011-11-14 Zolotukhin Michael michael.v.zolotuk...@gmail.com Jan Hubicka j...@suse.cz Zolotukhin Michael works for Intel and has copyright assignment with FSF. Thank you. I went ahead and comitted the patch then. GCC failed to bootstrap: ../../src-trunk/libiberty/sort.c:100:14: internal compiler error: in decide_alg, at config/i386/i386.c:22094 Please submit a full bug report, with preprocessed source if appropriate. See http://gcc.gnu.org/bugs.html for instructions. make[6]: *** [sort.o] Error 1 Assuming that the target is a core processor: I'm testing a patch from Honza for this - which he has asked to be checked in if it works out OK. just a pasto... Index: i386.c === --- i386.c (revision 181360) +++ i386.c (working copy) @@ -1877,10 +1877,10 @@ struct processor_costs core_cost = { {libcall, {{16, loop}, {24, unrolled_loop}, {1024, rep_prefix_8_byte}, {-1, libcall}, /* stringop_algs for memset. */ - {{{libcall, {{256, rep_prefix_4_byte}}}, /* Known alignment. */ -{libcall, {{256, rep_prefix_8_byte, - {{libcall, {{256, rep_prefix_4_byte}}}, /* Unknown alignment. */ -{libcall, {{256, rep_prefix_8_byte}, + {{{libcall, {{256, rep_prefix_4_byte}, {-1 libcall}}}, /* Known alignment. */ +{libcall, {{256, rep_prefix_8_byte}, {-1 libcall, + {{libcall, {{256, rep_prefix_4_byte}, {-1 libcall}}}, /* Unknown alignment. */ +{libcall, {{256, rep_prefix_8_byte}, {-1 libcall}, 1,/* scalar_stmt_cost. */ 1,/* scalar load_cost. */ 1,/* scalar_store_cost. */ It looks reasonable. bootstrap completed on i686-darwin9, so I've applied the following as requested, Iain gcc: 2011-11-14 Jan Hubicka j...@suse.cz * config/i386/i386.c (core cost model): Correct pasto. ndex: gcc/config/i386/i386.c === --- gcc/config/i386/i386.c (revision 181364) +++ gcc/config/i386/i386.c (working copy) @@ -1877,10 +1877,10 @@ struct processor_costs core_cost = { {libcall, {{16, loop}, {24, unrolled_loop}, {1024, rep_prefix_8_byte}, {-1, libcall}, /* stringop_algs for memset. */ - {{{libcall, {{256, rep_prefix_4_byte}}}, /* Known alignment. */ -{libcall, {{256, rep_prefix_8_byte, - {{libcall, {{256, rep_prefix_4_byte}}}, /* Unknown alignment. */ -{libcall, {{256, rep_prefix_8_byte}, + {{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}}, /* Known alignment. */ +{libcall, {{256, rep_prefix_8_byte}, {-1, libcall, + {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}}, /* Unknown alignment. */ +{libcall, {{256, rep_prefix_8_byte}, {-1, libcall}, 1, /* scalar_stmt_cost. */ 1, /* scalar load_cost. */ 1, /* scalar_store_cost. */
Re: Memset/memcpy patch
bootstrap completed on i686-darwin9, so I've applied the following as requested, Thank you and my apologizes for the breakage! Honza Iain gcc: 2011-11-14 Jan Hubicka j...@suse.cz * config/i386/i386.c (core cost model): Correct pasto. ndex: gcc/config/i386/i386.c === --- gcc/config/i386/i386.c(revision 181364) +++ gcc/config/i386/i386.c(working copy) @@ -1877,10 +1877,10 @@ struct processor_costs core_cost = { {libcall, {{16, loop}, {24, unrolled_loop}, {1024, rep_prefix_8_byte}, {-1, libcall}, /* stringop_algs for memset. */ - {{{libcall, {{256, rep_prefix_4_byte}}}, /* Known alignment. */ -{libcall, {{256, rep_prefix_8_byte, - {{libcall, {{256, rep_prefix_4_byte}}}, /* Unknown alignment. */ -{libcall, {{256, rep_prefix_8_byte}, + {{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}}, /* Known alignment. */ +{libcall, {{256, rep_prefix_8_byte}, {-1, libcall, + {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}}, /* Unknown alignment. */ +{libcall, {{256, rep_prefix_8_byte}, {-1, libcall}, 1, /* scalar_stmt_cost. */ 1, /* scalar load_cost. */ 1, /* scalar_store_cost. */
Re: Memset/memcpy patch
On Mon, Nov 14, 2011 at 9:03 AM, Jan Hubicka hubi...@ucw.cz wrote: Hi, this is hopefully final variant of patch. The epilogue code was broken in some scenarios for memset, but should work safely now. I also fixed the tables for core/buldozer/amdfam10 chips. But before it can be comitted, we need to reoslve copyright assignment issues. You don't seem to be liested as having copyright assignment, does you company have one? Otherwise, please try to get one soon. Honza 2011-11-14 Zolotukhin Michael michael.v.zolotuk...@gmail.com Jan Hubicka j...@suse.cz * gcc.target/i386/sw-1.c: Force rep;movsb. * config/i386/i386.h (processor_costs): Add second dimension to stringop_algs array. * config/i386/i386.c (cost models): Initialize second dimension of stringop_algs arrays. (core_cost): New costs based on generic64 costs with updated stringop values. (promote_duplicated_reg): Add support for vector modes, add declaration. (promote_duplicated_reg_to_size): Likewise. (processor_target): Set core costs for core variants. (expand_set_or_movmem_via_loop_with_iter): New function. (expand_set_or_movmem_via_loop): Enable reuse of the same iters in different loops, produced by this function. (emit_strset): New function. (expand_movmem_epilogue): Add epilogue generation for bigger sizes, use SSE-moves where possible. (expand_setmem_epilogue): Likewise. (expand_movmem_prologue): Likewise for prologue. (expand_setmem_prologue): Likewise. (expand_constant_movmem_prologue): Likewise. (expand_constant_setmem_prologue): Likewise. (decide_alg): Add new argument align_unknown. Fix algorithm of strategy selection if TARGET_INLINE_ALL_STRINGOPS is set; Skip sse_loop (decide_alignment): Update desired alignment according to chosen move mode. (ix86_expand_movmem): Change unrolled_loop strategy to use SSE-moves. (ix86_expand_setmem): Likewise. (ix86_slow_unaligned_access): Implementation of new hook slow_unaligned_access. * config/i386/i386.md (strset): Enable half-SSE moves. * config/i386/sse.md (vec_dupv4si): Add expand for vec_dupv4si. (vec_dupv2di): Add expand for vec_dupv2di. This may have caused: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51134 -- H.J.