I found another bug in current implementation. A patch for it doesn't cure i686-linux- bootstrap, but fixes fails on some tests (see attached).
The problem was that we tried to add runtime tests for alignment even if both SRC and DST had unknown alignment - in this case it could be impossible to make them both aligned simultaneously, so I think it's easier to even not try to use aligned SSE-moves at all. Generation of prologues with runtime tests could be used only if at least one alignment is known - otherwise it's incorrect. Probably, generation of such prologues could be removed from MEMMOV at all for now. Though, even with this fix i686-bootstrap still fails. Configure for the bootstrap-fail reproducing: CC="gcc -m32" CXX="g++ -m32" ../configure --with-arch=core2 --with-cpu=atom --prefix=`pwd` i686-linux --with-fpmath=sse --enable-languages=c,c++,fortran On 18 November 2011 06:23, Jan Hubicka <hubi...@ucw.cz> wrote: >> > > >> > > The current x86 memset/memcpy expansion is broken. It miscompiles >> > > many programs, including GCC itself. Should it be reverted for now? >> > >> > There was problem in the new code doing loopy epilogues. >> > I am currently testing the following patch that shold fix the problem. >> > We could either revert now and I will apply combined patch or I hope to >> > fix that >> > tonight. >> >> To expand little bit. I was looking into the code for most of the day today >> and >> the patch combines several fixes >> 1) the new loopy epilogue code was quite broken. It did not work for >> memset at all because >> the promoted value was not always initialized that I fixed in the >> version of patch >> that is in mainline now. It however also miss bound check in some >> cases. This is fixed >> by the expand_set_or_movmem_via_loop_with_iter change. >> 2) I misupdated atom description so 32bit memset was not expanded inline, >> this is fixed >> by memset changes >> 3) decide_alg was broken in two ways - first it gives complex algorithms >> for -O0 >> and it chose wrong variant when sse_loop is used. >> 4) the epilogue loop was output even in the case it is not needed - i.e. >> when unrolled >> loops handled 16 bytes at once, and block size is 39. This is the >> ix86_movmem >> and ix86_setmem change >> 5) The implementation of ix86_movmem/ix86_setmem diverged for no reason >> so I got it back >> to sync. For some reason SSE code in movmem is not output for 64bit >> unaligned memcpy >> that is fixed too. >> 6) it seems that both bdver and core is good enough on handling >> misaligned blocks that >> the alignmnet prologues can be ommited. This greatly improves and >> reduces size of the >> inline sequence. I will however break this out into independent patch. >> >> Life would be easier if the changes was made in multiple incremental steps, >> stringops expansion >> is relatively tricky busyness and realively easy to get wrong in some cases >> since there are so >> many of them depending on knowledge of size/alignmnet and target >> architecture. > > Hi, > this is the patch I comitted after bootstrapping/regstesting x86_64-linux and > --with-arch=core2 --with-cpu=atom > gfortran.fortran-torture/execute/arrayarg.f90 > failure stays. As I've explained in the PR log, I believe it is previously > latent problem elsewhere that is now triggered by inline memset expansion that > is later unrolled. I would welcome help from someone who understand the > testcase on whether it is aliasing safe or not. > > Honza > > PR bootstrap/51134 > * i386.c (atom_cost): Fix 32bit memset description. > (expand_set_or_movmem_via_loop_with_iter): Output proper bounds check > for epilogue loops. > (expand_movmem_epilogue): Handle epilogues up to size 15 w/o producing > byte loop. > (decide_alg): sse_loop is not useable wthen SSE2 is disabled; when not > optimizing always > use rep movsb or lincall; do not produce word sized loops when > optimizing memset for > size (to avoid need for large constants). > (ix86_expand_movmem): Get into sync with ix86_expand_setmem; choose > unroll factors > better; always do 128bit moves when producing SSE loops; do not > produce loopy epilogue > when size is too small. > (promote_duplicated_reg_to_size): Do not look into desired alignments > when > doing vector expansion. > (ix86_expand_setmem): Track better when promoted value is available; > choose unroll factors > more sanely.; output loopy epilogue only when needed. > Index: config/i386/i386.c > =================================================================== > *** config/i386/i386.c (revision 181407) > --- config/i386/i386.c (working copy) > *************** struct processor_costs atom_cost = { > *** 1785,1791 **** > if that fails. */ > {{{libcall, {{4096, sse_loop}, {4096, unrolled_loop}, {-1, libcall}}}, /* > Known alignment. */ > {libcall, {{4096, sse_loop}, {4096, unrolled_loop}, {-1, libcall}}}}, > ! {{libcall, {{-1, libcall}}}, /* Unknown > alignment. */ > {libcall, {{2048, sse_loop}, {2048, unrolled_loop}, > {-1, libcall}}}}}, > > --- 1785,1791 ---- > if that fails. */ > {{{libcall, {{4096, sse_loop}, {4096, unrolled_loop}, {-1, libcall}}}, /* > Known alignment. */ > {libcall, {{4096, sse_loop}, {4096, unrolled_loop}, {-1, libcall}}}}, > ! {{libcall, {{2048, sse_loop}, {2048, unrolled_loop}, {-1, libcall}}}, /* > Unknown alignment. */ > {libcall, {{2048, sse_loop}, {2048, unrolled_loop}, > {-1, libcall}}}}}, > > *************** expand_set_or_movmem_via_loop_with_iter > *** 21149,21168 **** > > top_label = gen_label_rtx (); > out_label = gen_label_rtx (); > - if (!reuse_iter) > - iter = gen_reg_rtx (iter_mode); > - > size = expand_simple_binop (iter_mode, AND, count, piece_size_mask, > ! NULL, 1, OPTAB_DIRECT); > ! /* Those two should combine. */ > ! if (piece_size == const1_rtx) > { > ! emit_cmp_and_jump_insns (size, const0_rtx, EQ, NULL_RTX, iter_mode, > true, out_label); > - predict_jump (REG_BR_PROB_BASE * 10 / 100); > } > - if (!reuse_iter) > - emit_move_insn (iter, const0_rtx); > > emit_label (top_label); > > --- 21149,21173 ---- > > top_label = gen_label_rtx (); > out_label = gen_label_rtx (); > size = expand_simple_binop (iter_mode, AND, count, piece_size_mask, > ! NULL, 1, OPTAB_DIRECT); > ! if (!reuse_iter) > { > ! iter = gen_reg_rtx (iter_mode); > ! /* Those two should combine. */ > ! if (piece_size == const1_rtx) > ! { > ! emit_cmp_and_jump_insns (size, const0_rtx, EQ, NULL_RTX, iter_mode, > ! true, out_label); > ! predict_jump (REG_BR_PROB_BASE * 10 / 100); > ! } > ! emit_move_insn (iter, const0_rtx); > ! } > ! else > ! { > ! emit_cmp_and_jump_insns (iter, size, GE, NULL_RTX, iter_mode, > true, out_label); > } > > emit_label (top_label); > > *************** expand_movmem_epilogue (rtx destmem, rtx > *** 21460,21466 **** > gcc_assert (remainder_size == 0); > return; > } > ! if (max_size > 8) > { > count = expand_simple_binop (GET_MODE (count), AND, count, GEN_INT > (max_size - 1), > count, 1, OPTAB_DIRECT); > --- 21465,21471 ---- > gcc_assert (remainder_size == 0); > return; > } > ! if (max_size > 16) > { > count = expand_simple_binop (GET_MODE (count), AND, count, GEN_INT > (max_size - 1), > count, 1, OPTAB_DIRECT); > *************** expand_movmem_epilogue (rtx destmem, rtx > *** 21475,21480 **** > --- 21480,21504 ---- > */ > if (TARGET_SINGLE_STRINGOP) > { > + if (max_size > 8) > + { > + rtx label = ix86_expand_aligntest (count, 8, true); > + if (TARGET_64BIT) > + { > + src = change_address (srcmem, DImode, srcptr); > + dest = change_address (destmem, DImode, destptr); > + emit_insn (gen_strmov (destptr, dest, srcptr, src)); > + } > + else > + { > + src = change_address (srcmem, SImode, srcptr); > + dest = change_address (destmem, SImode, destptr); > + emit_insn (gen_strmov (destptr, dest, srcptr, src)); > + emit_insn (gen_strmov (destptr, dest, srcptr, src)); > + } > + emit_label (label); > + LABEL_NUSES (label) = 1; > + } > if (max_size > 4) > { > rtx label = ix86_expand_aligntest (count, 4, true); > *************** expand_movmem_epilogue (rtx destmem, rtx > *** 21508,21513 **** > --- 21532,21566 ---- > rtx offset = force_reg (Pmode, const0_rtx); > rtx tmp; > > + if (max_size > 8) > + { > + rtx label = ix86_expand_aligntest (count, 8, true); > + if (TARGET_64BIT) > + { > + src = change_address (srcmem, DImode, srcptr); > + dest = change_address (destmem, DImode, destptr); > + emit_move_insn (dest, src); > + tmp = expand_simple_binop (Pmode, PLUS, offset, GEN_INT (8), > NULL, > + true, OPTAB_LIB_WIDEN); > + } > + else > + { > + src = change_address (srcmem, SImode, srcptr); > + dest = change_address (destmem, SImode, destptr); > + emit_move_insn (dest, src); > + tmp = expand_simple_binop (Pmode, PLUS, offset, GEN_INT (4), > NULL, > + true, OPTAB_LIB_WIDEN); > + if (tmp != offset) > + emit_move_insn (offset, tmp); > + tmp = expand_simple_binop (Pmode, PLUS, offset, GEN_INT (4), > NULL, > + true, OPTAB_LIB_WIDEN); > + emit_move_insn (dest, src); > + } > + if (tmp != offset) > + emit_move_insn (offset, tmp); > + emit_label (label); > + LABEL_NUSES (label) = 1; > + } > if (max_size > 4) > { > rtx label = ix86_expand_aligntest (count, 4, true); > *************** expand_setmem_epilogue (rtx destmem, rtx > *** 21588,21604 **** > Remaining part we'll move using Pmode and narrower modes. */ > > if (promoted_to_vector_value) > ! while (remainder_size >= 16) > ! { > ! if (GET_MODE (destmem) != move_mode) > ! destmem = adjust_automodify_address_nv (destmem, move_mode, > ! destptr, offset); > ! emit_strset (destmem, promoted_to_vector_value, destptr, > ! move_mode, offset); > > ! offset += 16; > ! remainder_size -= 16; > ! } > > /* Move the remaining part of epilogue - its size might be > a size of the widest mode. */ > --- 21641,21668 ---- > Remaining part we'll move using Pmode and narrower modes. */ > > if (promoted_to_vector_value) > ! { > ! if (promoted_to_vector_value) > ! { > ! if (max_size >= GET_MODE_SIZE (V4SImode)) > ! move_mode = V4SImode; > ! else if (max_size >= GET_MODE_SIZE (DImode)) > ! move_mode = DImode; > ! } > ! while (remainder_size >= GET_MODE_SIZE (move_mode)) > ! { > ! if (GET_MODE (destmem) != move_mode) > ! destmem = adjust_automodify_address_nv (destmem, move_mode, > ! destptr, offset); > ! emit_strset (destmem, > ! promoted_to_vector_value, > ! destptr, > ! move_mode, offset); > > ! offset += GET_MODE_SIZE (move_mode); > ! remainder_size -= GET_MODE_SIZE (move_mode); > ! } > ! } > > /* Move the remaining part of epilogue - its size might be > a size of the widest mode. */ > *************** decide_alg (HOST_WIDE_INT count, HOST_WI > *** 22022,22031 **** > || (memset > ? fixed_regs[AX_REG] : fixed_regs[SI_REG])); > > ! #define ALG_USABLE_P(alg) (rep_prefix_usable \ > ! || (alg != rep_prefix_1_byte \ > ! && alg != rep_prefix_4_byte \ > ! && alg != rep_prefix_8_byte)) > const struct processor_costs *cost; > > /* Even if the string operation call is cold, we still might spend a lot > --- 22086,22096 ---- > || (memset > ? fixed_regs[AX_REG] : fixed_regs[SI_REG])); > > ! #define ALG_USABLE_P(alg) ((rep_prefix_usable \ > ! || (alg != rep_prefix_1_byte \ > ! && alg != rep_prefix_4_byte \ > ! && alg != rep_prefix_8_byte)) \ > ! && (TARGET_SSE2 || alg != sse_loop)) > const struct processor_costs *cost; > > /* Even if the string operation call is cold, we still might spend a lot > *************** decide_alg (HOST_WIDE_INT count, HOST_WI > *** 22037,22042 **** > --- 22102,22110 ---- > else > optimize_for_speed = true; > > + if (!optimize) > + return (rep_prefix_usable ? rep_prefix_1_byte : libcall); > + > cost = optimize_for_speed ? ix86_cost : &ix86_size_cost; > > *dynamic_check = -1; > *************** decide_alg (HOST_WIDE_INT count, HOST_WI > *** 22049,22058 **** > /* rep; movq or rep; movl is the smallest variant. */ > else if (!optimize_for_speed) > { > ! if (!count || (count & 3)) > ! return rep_prefix_usable ? rep_prefix_1_byte : loop_1_byte; > else > ! return rep_prefix_usable ? rep_prefix_4_byte : loop; > } > /* Very tiny blocks are best handled via the loop, REP is expensive to > setup. > */ > --- 22117,22126 ---- > /* rep; movq or rep; movl is the smallest variant. */ > else if (!optimize_for_speed) > { > ! if (!count || (count & 3) || memset) > ! return rep_prefix_usable ? rep_prefix_1_byte : libcall; > else > ! return rep_prefix_usable ? rep_prefix_4_byte : libcall; > } > /* Very tiny blocks are best handled via the loop, REP is expensive to > setup. > */ > *************** decide_alg (HOST_WIDE_INT count, HOST_WI > *** 22106,22118 **** > int max = -1; > enum stringop_alg alg; > int i; > - bool any_alg_usable_p = true; > bool only_libcall_fits = true; > > for (i = 0; i < MAX_STRINGOP_ALGS; i++) > { > enum stringop_alg candidate = algs->size[i].alg; > - any_alg_usable_p = any_alg_usable_p && ALG_USABLE_P (candidate); > > if (candidate != libcall && candidate > && ALG_USABLE_P (candidate)) > --- 22174,22184 ---- > *************** decide_alg (HOST_WIDE_INT count, HOST_WI > *** 22124,22130 **** > /* If there aren't any usable algorithms, then recursing on > smaller sizes isn't going to find anything. Just return the > simple byte-at-a-time copy loop. */ > ! if (!any_alg_usable_p || only_libcall_fits) > { > /* Pick something reasonable. */ > if (TARGET_INLINE_STRINGOPS_DYNAMICALLY) > --- 22190,22196 ---- > /* If there aren't any usable algorithms, then recursing on > smaller sizes isn't going to find anything. Just return the > simple byte-at-a-time copy loop. */ > ! if (only_libcall_fits) > { > /* Pick something reasonable. */ > if (TARGET_INLINE_STRINGOPS_DYNAMICALLY) > *************** ix86_expand_movmem (rtx dst, rtx src, rt > *** 22253,22259 **** > int dynamic_check; > bool need_zero_guard = false; > bool align_unknown; > ! int unroll_factor; > enum machine_mode move_mode; > rtx loop_iter = NULL_RTX; > int dst_offset, src_offset; > --- 22319,22325 ---- > int dynamic_check; > bool need_zero_guard = false; > bool align_unknown; > ! unsigned int unroll_factor; > enum machine_mode move_mode; > rtx loop_iter = NULL_RTX; > int dst_offset, src_offset; > *************** ix86_expand_movmem (rtx dst, rtx src, rt > *** 22316,22329 **** > case unrolled_loop: > need_zero_guard = true; > move_mode = Pmode; > ! unroll_factor = TARGET_64BIT ? 4 : 2; > size_needed = GET_MODE_SIZE (move_mode) * unroll_factor; > break; > case sse_loop: > need_zero_guard = true; > /* Use SSE instructions, if possible. */ > ! move_mode = align_unknown ? DImode : V4SImode; > ! unroll_factor = TARGET_64BIT ? 4 : 2; > size_needed = GET_MODE_SIZE (move_mode) * unroll_factor; > break; > case rep_prefix_8_byte: > --- 22382,22408 ---- > case unrolled_loop: > need_zero_guard = true; > move_mode = Pmode; > ! unroll_factor = 1; > ! /* Select maximal available 1,2 or 4 unroll factor. > ! In 32bit we can not afford to use 4 registers inside the loop. */ > ! if (!count) > ! unroll_factor = TARGET_64BIT ? 4 : 2; > ! else > ! while (GET_MODE_SIZE (move_mode) * unroll_factor * 2 < count > ! && unroll_factor < (TARGET_64BIT ? 4 :2)) > ! unroll_factor *= 2; > size_needed = GET_MODE_SIZE (move_mode) * unroll_factor; > break; > case sse_loop: > need_zero_guard = true; > /* Use SSE instructions, if possible. */ > ! move_mode = V4SImode; > ! /* Select maximal available 1,2 or 4 unroll factor. */ > ! if (!count) > ! unroll_factor = 4; > ! else > ! while (GET_MODE_SIZE (move_mode) * unroll_factor * 2 < count > ! && unroll_factor < 4) > size_needed = GET_MODE_SIZE (move_mode) * unroll_factor; > break; > case rep_prefix_8_byte: > *************** ix86_expand_movmem (rtx dst, rtx src, rt > *** 22568,22574 **** > if (alg == sse_loop || alg == unrolled_loop) > { > rtx tmp; > ! if (align_unknown && unroll_factor > 1) > { > /* Reduce epilogue's size by creating not-unrolled loop. If we won't > do this, we can have very big epilogue - when alignment is > statically > --- 22647,22659 ---- > if (alg == sse_loop || alg == unrolled_loop) > { > rtx tmp; > ! int remainder_size = epilogue_size_needed; > ! > ! /* We may not need the epilgoue loop at all when the count is known > ! and alignment is not adjusted. */ > ! if (count && desired_align <= align) > ! remainder_size = count % epilogue_size_needed; > ! if (remainder_size > 31) > { > /* Reduce epilogue's size by creating not-unrolled loop. If we won't > do this, we can have very big epilogue - when alignment is > statically > *************** promote_duplicated_reg_to_size (rtx val, > *** 22710,22716 **** > { > rtx promoted_val = NULL_RTX; > > ! if (size_needed > 8 || (desired_align > align && desired_align > 8)) > { > /* We want to promote to vector register, so we expect that at least > SSE > is available. */ > --- 22795,22801 ---- > { > rtx promoted_val = NULL_RTX; > > ! if (size_needed > 8) > { > /* We want to promote to vector register, so we expect that at least > SSE > is available. */ > *************** promote_duplicated_reg_to_size (rtx val, > *** 22724,22730 **** > else > promoted_val = promote_duplicated_reg (V4SImode, val); > } > ! else if (size_needed > 4 || (desired_align > align && desired_align > 4)) > { > gcc_assert (TARGET_64BIT); > promoted_val = promote_duplicated_reg (DImode, val); > --- 22809,22815 ---- > else > promoted_val = promote_duplicated_reg (V4SImode, val); > } > ! else if (size_needed > 4) > { > gcc_assert (TARGET_64BIT); > promoted_val = promote_duplicated_reg (DImode, val); > *************** ix86_expand_setmem (rtx dst, rtx count_e > *** 22764,22769 **** > --- 22849,22855 ---- > unsigned int unroll_factor; > enum machine_mode move_mode; > rtx loop_iter = NULL_RTX; > + bool early_jump = false; > > if (CONST_INT_P (align_exp)) > align = INTVAL (align_exp); > *************** ix86_expand_setmem (rtx dst, rtx count_e > *** 22783,22789 **** > /* Step 0: Decide on preferred algorithm, desired alignment and > size of chunks to be copied by main loop. */ > > ! align_unknown = CONST_INT_P (align_exp) && INTVAL (align_exp) > 0; > alg = decide_alg (count, expected_size, true, &dynamic_check, > align_unknown); > desired_align = decide_alignment (align, alg, expected_size); > unroll_factor = 1; > --- 22869,22875 ---- > /* Step 0: Decide on preferred algorithm, desired alignment and > size of chunks to be copied by main loop. */ > > ! align_unknown = !(CONST_INT_P (align_exp) && INTVAL (align_exp) > 0); > alg = decide_alg (count, expected_size, true, &dynamic_check, > align_unknown); > desired_align = decide_alignment (align, alg, expected_size); > unroll_factor = 1; > *************** ix86_expand_setmem (rtx dst, rtx count_e > *** 22813,22821 **** > move_mode = Pmode; > unroll_factor = 1; > /* Select maximal available 1,2 or 4 unroll factor. */ > ! while (GET_MODE_SIZE (move_mode) * unroll_factor * 2 < count > ! && unroll_factor < 4) > ! unroll_factor *= 2; > size_needed = GET_MODE_SIZE (move_mode) * unroll_factor; > break; > case sse_loop: > --- 22899,22910 ---- > move_mode = Pmode; > unroll_factor = 1; > /* Select maximal available 1,2 or 4 unroll factor. */ > ! if (!count) > ! unroll_factor = 4; > ! else > ! while (GET_MODE_SIZE (move_mode) * unroll_factor * 2 < count > ! && unroll_factor < 4) > ! unroll_factor *= 2; > size_needed = GET_MODE_SIZE (move_mode) * unroll_factor; > break; > case sse_loop: > *************** ix86_expand_setmem (rtx dst, rtx count_e > *** 22823,22831 **** > move_mode = TARGET_64BIT ? V2DImode : V4SImode; > unroll_factor = 1; > /* Select maximal available 1,2 or 4 unroll factor. */ > ! while (GET_MODE_SIZE (move_mode) * unroll_factor * 2 < count > ! && unroll_factor < 4) > ! unroll_factor *= 2; > size_needed = GET_MODE_SIZE (move_mode) * unroll_factor; > break; > case rep_prefix_8_byte: > --- 22912,22923 ---- > move_mode = TARGET_64BIT ? V2DImode : V4SImode; > unroll_factor = 1; > /* Select maximal available 1,2 or 4 unroll factor. */ > ! if (!count) > ! unroll_factor = 4; > ! else > ! while (GET_MODE_SIZE (move_mode) * unroll_factor * 2 < count > ! && unroll_factor < 4) > ! unroll_factor *= 2; > size_needed = GET_MODE_SIZE (move_mode) * unroll_factor; > break; > case rep_prefix_8_byte: > *************** ix86_expand_setmem (rtx dst, rtx count_e > *** 22904,22909 **** > --- 22996,23002 ---- > emit_move_insn (loop_iter, const0_rtx); > } > label = gen_label_rtx (); > + early_jump = true; > emit_cmp_and_jump_insns (count_exp, > GEN_INT (epilogue_size_needed), > LTU, 0, counter_mode (count_exp), 1, label); > *************** ix86_expand_setmem (rtx dst, rtx count_e > *** 23016,23022 **** > vec_promoted_val = > promote_duplicated_reg_to_size (gpr_promoted_val, > GET_MODE_SIZE (move_mode), > ! desired_align, align); > loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, NULL, > destreg, > NULL, vec_promoted_val, count_exp, > loop_iter, move_mode, unroll_factor, > --- 23109,23115 ---- > vec_promoted_val = > promote_duplicated_reg_to_size (gpr_promoted_val, > GET_MODE_SIZE (move_mode), > ! GET_MODE_SIZE (move_mode), align); > loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, NULL, > destreg, > NULL, vec_promoted_val, count_exp, > loop_iter, move_mode, unroll_factor, > *************** ix86_expand_setmem (rtx dst, rtx count_e > *** 23065,23085 **** > LABEL_NUSES (label) = 1; > /* We can not rely on fact that promoved value is known. */ > vec_promoted_val = 0; > ! gpr_promoted_val = 0; > } > epilogue: > if (alg == unrolled_loop || alg == sse_loop) > { > rtx tmp; > ! if (align_unknown && unroll_factor > 1 > ! && epilogue_size_needed >= GET_MODE_SIZE (move_mode) > ! && vec_promoted_val) > { > /* Reduce epilogue's size by creating not-unrolled loop. If we won't > do this, we can have very big epilogue - when alignment is > statically > unknown we'll have the epilogue byte by byte which may be very > slow. */ > loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, NULL, > destreg, > ! NULL, vec_promoted_val, count_exp, > loop_iter, move_mode, 1, > expected_size, false); > dst = change_address (dst, BLKmode, destreg); > --- 23158,23183 ---- > LABEL_NUSES (label) = 1; > /* We can not rely on fact that promoved value is known. */ > vec_promoted_val = 0; > ! if (early_jump) > ! gpr_promoted_val = 0; > } > epilogue: > if (alg == unrolled_loop || alg == sse_loop) > { > rtx tmp; > ! int remainder_size = epilogue_size_needed; > ! if (count && desired_align <= align) > ! remainder_size = count % epilogue_size_needed; > ! /* We may not need the epilgoue loop at all when the count is known > ! and alignment is not adjusted. */ > ! if (remainder_size > 31 > ! && (alg == sse_loop ? vec_promoted_val : gpr_promoted_val)) > { > /* Reduce epilogue's size by creating not-unrolled loop. If we won't > do this, we can have very big epilogue - when alignment is > statically > unknown we'll have the epilogue byte by byte which may be very > slow. */ > loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, NULL, > destreg, > ! NULL, (alg == sse_loop ? vec_promoted_val : gpr_promoted_val), > count_exp, > loop_iter, move_mode, 1, > expected_size, false); > dst = change_address (dst, BLKmode, destreg); > *************** ix86_expand_setmem (rtx dst, rtx count_e > *** 23090,23106 **** > if (tmp != destreg) > emit_move_insn (destreg, tmp); > } > ! if (count_exp == const0_rtx) > ; > ! else if (!gpr_promoted_val && epilogue_size_needed > 1) > expand_setmem_epilogue_via_loop (dst, destreg, val_exp, count_exp, > epilogue_size_needed); > else > ! { > ! if (epilogue_size_needed > 1) > ! expand_setmem_epilogue (dst, destreg, vec_promoted_val, > gpr_promoted_val, > ! val_exp, count_exp, epilogue_size_needed); > ! } > if (jump_around_label) > emit_label (jump_around_label); > return true; > --- 23188,23201 ---- > if (tmp != destreg) > emit_move_insn (destreg, tmp); > } > ! if (count_exp == const0_rtx || epilogue_size_needed <= 1) > ; > ! else if (!gpr_promoted_val) > expand_setmem_epilogue_via_loop (dst, destreg, val_exp, count_exp, > epilogue_size_needed); > else > ! expand_setmem_epilogue (dst, destreg, vec_promoted_val, > gpr_promoted_val, > ! val_exp, count_exp, epilogue_size_needed); > if (jump_around_label) > emit_label (jump_around_label); > return true; > -- --- Best regards, Michael V. Zolotukhin, Software Engineer Intel Corporation.
memcpy_unknown_alignment.patch
Description: Binary data