Forgot to add Uros - adding now. On 18 April 2013 15:53, Michael Zolotukhin <michael.v.zolotuk...@gmail.com> wrote: > Hi, > Jan, thanks for the review, I hope to prepare an updated version of the patch > shortly. Please see my answers to your comments below. > > Uros, there is a question of a better approach for generation of wide moves. > Could you please comment it (see details in bullets 3 and 5)? > > 1. >> +static int smallest_pow2_greater_than (int); >> >> Perhaps it is easier to use existing 1<<ceil_log2? > Well, yep. Actually, this routine has already been used there, so I continued > using it. I guess we could change its implementation to call > ceil_log2/floor_log2 or remove it entirely. > > 2. >> - y_addr = gen_rtx_PLUS (Pmode, srcptr, copy_rtx (tmp)); >> - srcmem = change_address (srcmem, mode, y_addr); >> + srcmem = offset_address (srcmem, copy_rtx (tmp), piece_size_n); >> + srcmem = adjust_address (srcmem, mode, 0); >> ... >> This change looks OK and can go into manline independnetly. Just please >> ensure that changing >> the way address is computed is not making us to preserve alias set. Memmove >> can not rely on the alias >> set of the src/destination objects. > Could you explain it in more details? Do you mean that at the beginning DST > and SRC could point to one memory location and have corresponding alias sets, > and I just change addresses they point to without invalidating alias sets? > I haven't thought about this, and that seems like a possible bug, but I guess > it could be simply fixed by calling change_address at the end. > > 3. >> + /* Find the widest mode in which we could perform moves. >> + Start with the biggest power of 2 less than SIZE_TO_MOVE and half >> + it until move of such size is supported. */ >> + piece_size = smallest_pow2_greater_than (size_to_move) >> 1; >> + move_mode = mode_for_size (piece_size * BITS_PER_UNIT, MODE_INT, 0); >> >> I suppose this is a problem with SSE moves ending up in integer register, >> since >> you get TImode rather than vectorized mode, like V8QImode in here. Why not >> stick >> with the original mode parmaeter? > Yes, here we choose TImode instead of a vector mode, but that actually was > done > intentionally. I tried several approaches here and decided that using the > widest integer mode is the best one for now. We could try to find out a > particular (vector)mode in which we want to perform copying, but isn't it > better > to rely on a machine-description here? My idea here was to just request a > copy > of, for instance, 128-bit piece (i.e. one TI-move) and leave it to the > compiler > to choose the most optimal way of performing it. Currently, the compiler > thinks > that move of 128bits should be splitted into two moves of 64-bits (this > transformation is done in split2 pass) - if it's actually not so optimal, we > should fix it there, IMHO. > > I think Uros could give me an advice on whether it's a reasonable approach or > it > should be changed. > > Also, I tried to avoid such fixes in this patch - that doesn't mean I'm not > going to work on the fixes, quite the contrary. But it'd be easier to work on > them if we have a code in the trunk that could reveal the problem. > > 4. >> Doesn't this effectively kill support for TARGET_SINGLE_STRINGOP? It is >> useful as >> size optimization. > Do you mean removing emit_strmov? I don't think it'll kill anything, as new > emit_memmov is capable of doing what emit_strmov did and is just an extended > version of it. BTW, under TARGET_SINGLE_STRINGOP switch gen_strmov is used, > not > emit_strmov - behaviour there hasn't been changed by this patch. > > 5. >> For SSE codegen, won't we need to track down in destination was aligned to >> generate aligned/unaligned moves? > We try to achieve a required alignment by prologue, so in the main loop > destination is aligned properly. Source, meanwhile, could be misaligned, so > for > it unaligned moves could be generated. Here I actually also rely on the fact > that we have an optimal description of aligned/unaligned moves in MD-file, > i.e. > if it's better to emit two DI-moves instead of one unaligned TI-mode, then > splits/expands will manage to do that. > > 6. >> + else if (TREE_CODE (expr) == MEM_REF) >> + { >> + tree base = TREE_OPERAND (expr, 0); >> + tree byte_offset = TREE_OPERAND (expr, 1); >> + if (TREE_CODE (base) != ADDR_EXPR >> + || TREE_CODE (byte_offset) != INTEGER_CST) >> + return -1; >> + if (!DECL_P (TREE_OPERAND (base, 0)) >> + || DECL_ALIGN (TREE_OPERAND (base, 0)) < align) >> >> You can use TYPE_ALIGN here? In general can't we replace all the GIMPLE >> handling by get_object_alignment? >> >> + return -1; >> + offset += tree_low_cst (byte_offset, 1); >> + } >> else >> return -1; >> >> This change out to go independently. I can not review it. >> I will make first look over the patch shortly, but please send updated patch >> fixing >> the problem with integer regs. > Actually, I don't know what is a right way to find out alignment, but the > existing one didn't work. Routine get_mem_align_offset didn't handle MEM_REFs > at all, so I added some handling there - I'm not sure it's complete and > absoulutely correct, but that currently works for me. I'd be glad to hear any > suggestions of how that should be done - whom should I ask about it? > > --- > Thanks, Michael > > > On 17 April 2013 19:12, Jan Hubicka <hubi...@ucw.cz> wrote: >> @@ -2392,6 +2392,7 @@ static void ix86_set_current_function (tree); >> static unsigned int ix86_minimum_incoming_stack_boundary (bool); >> >> static enum calling_abi ix86_function_abi (const_tree); >> +static int smallest_pow2_greater_than (int); >> >> Perhaps it is easier to use existing 1<<ceil_log2? >> >> >> #ifndef SUBTARGET32_DEFAULT_CPU >> @@ -21829,11 +21830,10 @@ expand_set_or_movmem_via_loop (rtx destmem, rtx >> srcmem, >> { >> rtx out_label, top_label, iter, tmp; >> enum machine_mode iter_mode = counter_mode (count); >> - rtx piece_size = GEN_INT (GET_MODE_SIZE (mode) * unroll); >> + int piece_size_n = GET_MODE_SIZE (mode) * unroll; >> + rtx piece_size = GEN_INT (piece_size_n); >> rtx piece_size_mask = GEN_INT (~((GET_MODE_SIZE (mode) * unroll) - 1)); >> rtx size; >> - rtx x_addr; >> - rtx y_addr; >> int i; >> >> top_label = gen_label_rtx (); >> @@ -21854,13 +21854,18 @@ expand_set_or_movmem_via_loop (rtx destmem, rtx >> srcmem, >> emit_label (top_label); >> >> tmp = convert_modes (Pmode, iter_mode, iter, true); >> - x_addr = gen_rtx_PLUS (Pmode, destptr, tmp); >> - destmem = change_address (destmem, mode, x_addr); >> + >> + /* This assert could be relaxed - in this case we'll need to compute >> + smallest power of two, containing in PIECE_SIZE_N and pass it to >> + offset_address. */ >> + gcc_assert ((piece_size_n & (piece_size_n - 1)) == 0); >> + destmem = offset_address (destmem, tmp, piece_size_n); >> + destmem = adjust_address (destmem, mode, 0); >> >> if (srcmem) >> { >> - y_addr = gen_rtx_PLUS (Pmode, srcptr, copy_rtx (tmp)); >> - srcmem = change_address (srcmem, mode, y_addr); >> + srcmem = offset_address (srcmem, copy_rtx (tmp), piece_size_n); >> + srcmem = adjust_address (srcmem, mode, 0); >> >> /* When unrolling for chips that reorder memory reads and writes, >> we can save registers by using single temporary. >> @@ -22039,13 +22044,61 @@ expand_setmem_via_rep_stos (rtx destmem, rtx >> destptr, rtx value, >> emit_insn (gen_rep_stos (destptr, countreg, destmem, value, destexp)); >> } >> >> This change looks OK and can go into manline independnetly. Just please >> ensure that changing >> the way address is computed is not making us to preserve alias set. Memmove >> can not rely on the alias >> set of the src/destination objects. >> >> -static void >> -emit_strmov (rtx destmem, rtx srcmem, >> - rtx destptr, rtx srcptr, enum machine_mode mode, int offset) >> -{ >> - rtx src = adjust_automodify_address_nv (srcmem, mode, srcptr, offset); >> - rtx dest = adjust_automodify_address_nv (destmem, mode, destptr, offset); >> - emit_insn (gen_strmov (destptr, dest, srcptr, src)); >> +/* This function emits moves to copy SIZE_TO_MOVE bytes from SRCMEM to >> + DESTMEM. >> + SRC is passed by pointer to be updated on return. >> + Return value is updated DST. */ >> +static rtx >> +emit_memmov (rtx destmem, rtx *srcmem, rtx destptr, rtx srcptr, >> + HOST_WIDE_INT size_to_move) >> +{ >> + rtx dst = destmem, src = *srcmem, adjust, tempreg; >> + enum insn_code code; >> + enum machine_mode move_mode; >> + int piece_size, i; >> + >> + /* Find the widest mode in which we could perform moves. >> + Start with the biggest power of 2 less than SIZE_TO_MOVE and half >> + it until move of such size is supported. */ >> + piece_size = smallest_pow2_greater_than (size_to_move) >> 1; >> + move_mode = mode_for_size (piece_size * BITS_PER_UNIT, MODE_INT, 0); >> >> I suppose this is a problem with SSE moves ending up in integer register, >> since >> you get TImode rather than vectorized mode, like V8QImode in here. Why not >> stick >> with the original mode parmaeter? >> + code = optab_handler (mov_optab, move_mode); >> + while (code == CODE_FOR_nothing && piece_size > 1) >> + { >> + piece_size >>= 1; >> + move_mode = mode_for_size (piece_size * BITS_PER_UNIT, MODE_INT, 0); >> + code = optab_handler (mov_optab, move_mode); >> + } >> + gcc_assert (code != CODE_FOR_nothing); >> + >> + dst = adjust_automodify_address_nv (dst, move_mode, destptr, 0); >> + src = adjust_automodify_address_nv (src, move_mode, srcptr, 0); >> + >> + /* Emit moves. We'll need SIZE_TO_MOVE/PIECE_SIZES moves. */ >> + gcc_assert (size_to_move % piece_size == 0); >> + adjust = GEN_INT (piece_size); >> + for (i = 0; i < size_to_move; i += piece_size) >> + { >> + /* We move from memory to memory, so we'll need to do it via >> + a temporary register. */ >> + tempreg = gen_reg_rtx (move_mode); >> + emit_insn (GEN_FCN (code) (tempreg, src)); >> + emit_insn (GEN_FCN (code) (dst, tempreg)); >> + >> + emit_move_insn (destptr, >> + gen_rtx_PLUS (Pmode, copy_rtx (destptr), adjust)); >> + emit_move_insn (srcptr, >> + gen_rtx_PLUS (Pmode, copy_rtx (srcptr), adjust)); >> + >> + dst = adjust_automodify_address_nv (dst, move_mode, destptr, >> + piece_size); >> + src = adjust_automodify_address_nv (src, move_mode, srcptr, >> + piece_size); >> + } >> + >> + /* Update DST and SRC rtx. */ >> + *srcmem = src; >> + return dst; >> >> Doesn't this effectively kill support for TARGET_SINGLE_STRINGOP? It is >> useful as >> size optimization. >> } >> >> /* Output code to copy at most count & (max_size - 1) bytes from SRC to >> DEST. */ >> @@ -22057,44 +22110,17 @@ expand_movmem_epilogue (rtx destmem, rtx srcmem, >> if (CONST_INT_P (count)) >> { >> HOST_WIDE_INT countval = INTVAL (count); >> - int offset = 0; >> + HOST_WIDE_INT epilogue_size = countval % max_size; >> + int i; >> >> - if ((countval & 0x10) && max_size > 16) >> - { >> - if (TARGET_64BIT) >> - { >> - emit_strmov (destmem, srcmem, destptr, srcptr, DImode, offset); >> - emit_strmov (destmem, srcmem, destptr, srcptr, DImode, offset >> + 8); >> - } >> - else >> - gcc_unreachable (); >> - offset += 16; >> - } >> - if ((countval & 0x08) && max_size > 8) >> - { >> - if (TARGET_64BIT) >> - emit_strmov (destmem, srcmem, destptr, srcptr, DImode, offset); >> - else >> - { >> - emit_strmov (destmem, srcmem, destptr, srcptr, SImode, offset); >> - emit_strmov (destmem, srcmem, destptr, srcptr, SImode, offset >> + 4); >> - } >> - offset += 8; >> - } >> - if ((countval & 0x04) && max_size > 4) >> + /* For now MAX_SIZE should be a power of 2. This assert could be >> + relaxed, but it'll require a bit more complicated epilogue >> + expanding. */ >> + gcc_assert ((max_size & (max_size - 1)) == 0); >> + for (i = max_size; i >= 1; i >>= 1) >> { >> - emit_strmov (destmem, srcmem, destptr, srcptr, SImode, offset); >> - offset += 4; >> - } >> - if ((countval & 0x02) && max_size > 2) >> - { >> - emit_strmov (destmem, srcmem, destptr, srcptr, HImode, offset); >> - offset += 2; >> - } >> - if ((countval & 0x01) && max_size > 1) >> - { >> - emit_strmov (destmem, srcmem, destptr, srcptr, QImode, offset); >> - offset += 1; >> + if (epilogue_size & i) >> + destmem = emit_memmov (destmem, &srcmem, destptr, srcptr, i); >> } >> return; >> } >> @@ -22330,47 +22356,33 @@ expand_setmem_epilogue (rtx destmem, rtx destptr, >> rtx value, rtx count, int max_ >> } >> >> /* Copy enough from DEST to SRC to align DEST known to by aligned by ALIGN >> to >> - DESIRED_ALIGNMENT. */ >> -static void >> + DESIRED_ALIGNMENT. >> + Return value is updated DESTMEM. */ >> +static rtx >> expand_movmem_prologue (rtx destmem, rtx srcmem, >> rtx destptr, rtx srcptr, rtx count, >> int align, int desired_alignment) >> { >> - if (align <= 1 && desired_alignment > 1) >> - { >> - rtx label = ix86_expand_aligntest (destptr, 1, false); >> - srcmem = change_address (srcmem, QImode, srcptr); >> - destmem = change_address (destmem, QImode, destptr); >> - emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem)); >> - ix86_adjust_counter (count, 1); >> - emit_label (label); >> - LABEL_NUSES (label) = 1; >> - } >> - if (align <= 2 && desired_alignment > 2) >> - { >> - rtx label = ix86_expand_aligntest (destptr, 2, false); >> - srcmem = change_address (srcmem, HImode, srcptr); >> - destmem = change_address (destmem, HImode, destptr); >> - emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem)); >> - ix86_adjust_counter (count, 2); >> - emit_label (label); >> - LABEL_NUSES (label) = 1; >> - } >> - if (align <= 4 && desired_alignment > 4) >> + int i; >> + for (i = 1; i < desired_alignment; i <<= 1) >> { >> - rtx label = ix86_expand_aligntest (destptr, 4, false); >> - srcmem = change_address (srcmem, SImode, srcptr); >> - destmem = change_address (destmem, SImode, destptr); >> - emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem)); >> - ix86_adjust_counter (count, 4); >> - emit_label (label); >> - LABEL_NUSES (label) = 1; >> + if (align <= i) >> + { >> + rtx label = ix86_expand_aligntest (destptr, i, false); >> + destmem = emit_memmov (destmem, &srcmem, destptr, srcptr, i); >> + ix86_adjust_counter (count, i); >> + emit_label (label); >> + LABEL_NUSES (label) = 1; >> + set_mem_align (destmem, i * 2 * BITS_PER_UNIT); >> + } >> } >> - gcc_assert (desired_alignment <= 8); >> + return destmem; >> } >> >> /* Copy enough from DST to SRC to align DST known to DESIRED_ALIGN. >> - ALIGN_BYTES is how many bytes need to be copied. */ >> + ALIGN_BYTES is how many bytes need to be copied. >> + The function updates DST and SRC, namely, it sets proper alignment. >> + DST is returned via return value, SRC is updated via pointer SRCP. */ >> static rtx >> expand_constant_movmem_prologue (rtx dst, rtx *srcp, rtx destreg, rtx >> srcreg, >> int desired_align, int align_bytes) >> @@ -22378,62 +22390,34 @@ expand_constant_movmem_prologue (rtx dst, rtx >> *srcp, rtx destreg, rtx srcreg, >> rtx src = *srcp; >> rtx orig_dst = dst; >> rtx orig_src = src; >> - int off = 0; >> + int piece_size = 1; >> + int copied_bytes = 0; >> int src_align_bytes = get_mem_align_offset (src, desired_align * >> BITS_PER_UNIT); >> if (src_align_bytes >= 0) >> src_align_bytes = desired_align - src_align_bytes; >> - if (align_bytes & 1) >> - { >> - dst = adjust_automodify_address_nv (dst, QImode, destreg, 0); >> - src = adjust_automodify_address_nv (src, QImode, srcreg, 0); >> - off = 1; >> - emit_insn (gen_strmov (destreg, dst, srcreg, src)); >> - } >> - if (align_bytes & 2) >> - { >> - dst = adjust_automodify_address_nv (dst, HImode, destreg, off); >> - src = adjust_automodify_address_nv (src, HImode, srcreg, off); >> - if (MEM_ALIGN (dst) < 2 * BITS_PER_UNIT) >> - set_mem_align (dst, 2 * BITS_PER_UNIT); >> - if (src_align_bytes >= 0 >> - && (src_align_bytes & 1) == (align_bytes & 1) >> - && MEM_ALIGN (src) < 2 * BITS_PER_UNIT) >> - set_mem_align (src, 2 * BITS_PER_UNIT); >> - off = 2; >> - emit_insn (gen_strmov (destreg, dst, srcreg, src)); >> - } >> - if (align_bytes & 4) >> + >> + for (piece_size = 1; >> + piece_size <= desired_align && copied_bytes < align_bytes; >> + piece_size <<= 1) >> { >> - dst = adjust_automodify_address_nv (dst, SImode, destreg, off); >> - src = adjust_automodify_address_nv (src, SImode, srcreg, off); >> - if (MEM_ALIGN (dst) < 4 * BITS_PER_UNIT) >> - set_mem_align (dst, 4 * BITS_PER_UNIT); >> - if (src_align_bytes >= 0) >> + if (align_bytes & piece_size) >> { >> - unsigned int src_align = 0; >> - if ((src_align_bytes & 3) == (align_bytes & 3)) >> - src_align = 4; >> - else if ((src_align_bytes & 1) == (align_bytes & 1)) >> - src_align = 2; >> - if (MEM_ALIGN (src) < src_align * BITS_PER_UNIT) >> - set_mem_align (src, src_align * BITS_PER_UNIT); >> + dst = emit_memmov (dst, &src, destreg, srcreg, piece_size); >> + copied_bytes += piece_size; >> } >> - off = 4; >> - emit_insn (gen_strmov (destreg, dst, srcreg, src)); >> } >> - dst = adjust_automodify_address_nv (dst, BLKmode, destreg, off); >> - src = adjust_automodify_address_nv (src, BLKmode, srcreg, off); >> + >> if (MEM_ALIGN (dst) < (unsigned int) desired_align * BITS_PER_UNIT) >> set_mem_align (dst, desired_align * BITS_PER_UNIT); >> if (src_align_bytes >= 0) >> { >> - unsigned int src_align = 0; >> - if ((src_align_bytes & 7) == (align_bytes & 7)) >> - src_align = 8; >> - else if ((src_align_bytes & 3) == (align_bytes & 3)) >> - src_align = 4; >> - else if ((src_align_bytes & 1) == (align_bytes & 1)) >> - src_align = 2; >> + unsigned int src_align; >> + for (src_align = desired_align; src_align >= 2; src_align >>= 1) >> + { >> + if ((src_align_bytes & (src_align - 1)) >> + == (align_bytes & (src_align - 1))) >> + break; >> + } >> if (src_align > (unsigned int) desired_align) >> src_align = desired_align; >> if (MEM_ALIGN (src) < src_align * BITS_PER_UNIT) >> @@ -22666,42 +22650,24 @@ decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT >> expected_size, bool memset, >> static int >> decide_alignment (int align, >> enum stringop_alg alg, >> - int expected_size) >> + int expected_size, >> + enum machine_mode move_mode) >> { >> int desired_align = 0; >> - switch (alg) >> - { >> - case no_stringop: >> - gcc_unreachable (); >> - case loop: >> - case unrolled_loop: >> - desired_align = GET_MODE_SIZE (Pmode); >> - break; >> - case rep_prefix_8_byte: >> - desired_align = 8; >> - break; >> - case rep_prefix_4_byte: >> - /* PentiumPro has special logic triggering for 8 byte aligned blocks. >> - copying whole cacheline at once. */ >> - if (TARGET_PENTIUMPRO) >> - desired_align = 8; >> - else >> - desired_align = 4; >> - break; >> - case rep_prefix_1_byte: >> - /* PentiumPro has special logic triggering for 8 byte aligned blocks. >> - copying whole cacheline at once. */ >> - if (TARGET_PENTIUMPRO) >> - desired_align = 8; >> - else >> - desired_align = 1; >> - break; >> - case loop_1_byte: >> - desired_align = 1; >> - break; >> - case libcall: >> - return 0; >> - } >> + >> + gcc_assert (alg != no_stringop); >> + >> + if (alg == libcall) >> + return 0; >> + if (move_mode == VOIDmode) >> + return 0; >> + >> + desired_align = GET_MODE_SIZE (move_mode); >> + /* PentiumPro has special logic triggering for 8 byte aligned blocks. >> + copying whole cacheline at once. */ >> + if (TARGET_PENTIUMPRO >> + && (alg == rep_prefix_4_byte || alg == rep_prefix_1_byte)) >> + desired_align = 8; >> >> if (optimize_size) >> desired_align = 1; >> @@ -22709,6 +22675,7 @@ decide_alignment (int align, >> desired_align = align; >> if (expected_size != -1 && expected_size < 4) >> desired_align = align; >> + >> return desired_align; >> } >> >> @@ -22765,6 +22732,8 @@ ix86_expand_movmem (rtx dst, rtx src, rtx count_exp, >> rtx align_exp, >> int dynamic_check; >> bool need_zero_guard = false; >> bool noalign; >> + enum machine_mode move_mode = VOIDmode; >> + int unroll_factor = 1; >> >> if (CONST_INT_P (align_exp)) >> align = INTVAL (align_exp); >> @@ -22788,50 +22757,60 @@ ix86_expand_movmem (rtx dst, rtx src, rtx >> count_exp, rtx align_exp, >> >> /* Step 0: Decide on preferred algorithm, desired alignment and >> size of chunks to be copied by main loop. */ >> - >> alg = decide_alg (count, expected_size, false, &dynamic_check, &noalign); >> - desired_align = decide_alignment (align, alg, expected_size); >> - >> - if (!TARGET_ALIGN_STRINGOPS || noalign) >> - align = desired_align; >> - >> if (alg == libcall) >> return false; >> gcc_assert (alg != no_stringop); >> + >> if (!count) >> count_exp = copy_to_mode_reg (GET_MODE (count_exp), count_exp); >> destreg = copy_addr_to_reg (XEXP (dst, 0)); >> srcreg = copy_addr_to_reg (XEXP (src, 0)); >> + >> + unroll_factor = 1; >> + move_mode = word_mode; >> switch (alg) >> { >> case libcall: >> case no_stringop: >> gcc_unreachable (); >> + case loop_1_byte: >> + need_zero_guard = true; >> + move_mode = QImode; >> + break; >> case loop: >> need_zero_guard = true; >> - size_needed = GET_MODE_SIZE (word_mode); >> break; >> case unrolled_loop: >> need_zero_guard = true; >> - size_needed = GET_MODE_SIZE (word_mode) * (TARGET_64BIT ? 4 : 2); >> + unroll_factor = (TARGET_64BIT ? 4 : 2); >> + break; >> + case vector_loop: >> + need_zero_guard = true; >> + unroll_factor = 4; >> + /* Find the widest supported mode. */ >> + move_mode = Pmode; >> + while (optab_handler (mov_optab, GET_MODE_WIDER_MODE (move_mode)) >> + != CODE_FOR_nothing) >> + move_mode = GET_MODE_WIDER_MODE (move_mode); >> break; >> case rep_prefix_8_byte: >> - size_needed = 8; >> + move_mode = DImode; >> break; >> case rep_prefix_4_byte: >> - size_needed = 4; >> + move_mode = SImode; >> break; >> case rep_prefix_1_byte: >> - size_needed = 1; >> - break; >> - case loop_1_byte: >> - need_zero_guard = true; >> - size_needed = 1; >> + move_mode = QImode; >> break; >> } >> - >> + size_needed = GET_MODE_SIZE (move_mode) * unroll_factor; >> epilogue_size_needed = size_needed; >> >> + desired_align = decide_alignment (align, alg, expected_size, move_mode); >> >> + desired_align = decide_alignment (align, alg, expected_size, move_mode); >> + if (!TARGET_ALIGN_STRINGOPS || noalign) >> + align = desired_align; >> + >> >> For SSE codegen, won't we need to track down in destination was aligned to >> generate aligned/unaligned moves? >> >> Otherwise the patch seems resonable. Thanks for submitting it by pieces so >> the review is easier. >> >> Honza > > -- > --- > Best regards, > Michael V. Zolotukhin, > Software Engineer > Intel Corporation.
-- --- Best regards, Michael V. Zolotukhin, Software Engineer Intel Corporation.