Hi Jan! Thanks for the review, you could find my answers to some of your remarks below. I'll send a corrected patch soon with answers to the rest of your remarks.
> - {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}, > + {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}, > {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}}, > - {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}, > + {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}, > + {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}}}, > + {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}, > {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}}, > + {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}, > + {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}}}, > > I am bit concerned about explossion of variants, but adding aligned variants > probably makes > sense. I guess we should specify what alignment needs to be known. I.e. is > alignment of 2 enough > or shall the alignment be matching the size of loads/stores produced? Yes, alignment should match the size of loads/stores as well as offset from alignment boundary should be known. In other case, strategies for unknown alignment would be chosen. > This hunk seems dangerous in a way that by emitting the explicit loadq/storeq > pairs (and similar) will > prevent use of integer registers for 64bit/128bit arithmetic. > > I guess we could play such tricks for memory-memory moves & constant stores. > With gimple optimizations > we already know pretty well that the moves will stay as they are. That might > be enough for you? Yes, theoretically it could harm 64/128-bit arithmetic, but actually what could we do if we have DImode, mem-to-mem move and our mode is 32-bit? Ideally, RA should be able to make desicions on how to perform such moves, but currently it doesn't generate SSE-moves - when it'll be able to do so, I think we could remove this part and rely on RA. And, one more point. This is quite a special case - here we want to perform move via half of vector register. This is the main reason why these particular cases are handled in special, not common, way. > I wrote the original function, but it is not really clear for me what the > function > does now. I.e. what is code for updating addresses and what means reusing > iter. > I guess reusing iter means that we won't start the loop from 0. Could you > expand comments a bit more? > > I know I did not documented them originally, but all the parameters ought to > be > explicitely documented in a function comment. Yep, you're right - we just don't start the loop from 0. I'll send a version with the comments soon. > -/* Output code to copy at most count & (max_size - 1) bytes from SRC to > DEST. */ > +/* Emit strset instuction. If RHS is constant, and vector mode will be used, > + then move this consatnt to a vector register before emitting strset. */ > +static void > +emit_strset (rtx destmem, rtx value, > + rtx destptr, enum machine_mode mode, int offset) > > This seems to more naturally belong into gen_strset expander? I don't think it matters here, but to make emit_strset look similar to emit_strmov, most of emit_strset body realy could be moved to gen_strset. > if (max_size > 16) > { > rtx label = ix86_expand_aligntest (count, 16, true); > if (TARGET_64BIT) > { > - dest = change_address (destmem, DImode, destptr); > - emit_insn (gen_strset (destptr, dest, value)); > - emit_insn (gen_strset (destptr, dest, value)); > + destmem = change_address (destmem, DImode, destptr); > + emit_insn (gen_strset (destptr, destmem, gen_lowpart (DImode, > + value))); > + emit_insn (gen_strset (destptr, destmem, gen_lowpart (DImode, > + value))); > > No use for 128bit moves here? > } > else > { > - dest = change_address (destmem, SImode, destptr); > - emit_insn (gen_strset (destptr, dest, value)); > - emit_insn (gen_strset (destptr, dest, value)); > - emit_insn (gen_strset (destptr, dest, value)); > - emit_insn (gen_strset (destptr, dest, value)); > + destmem = change_address (destmem, SImode, destptr); > + emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode, > + value))); > + emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode, > + value))); > + emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode, > + value))); > + emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode, > + value))); > > And here? For memset prologues/epilogues I avoid using vector moves as it could require expensive initialization (we need to create a vector filled with the given value). For other cases, I'll re-check if use of vector moves is possible. > @@ -21286,6 +21528,37 @@ expand_constant_movmem_prologue (rtx dst, rtx *srcp, > rtx destreg, rtx srcreg, > off = 4; > emit_insn (gen_strmov (destreg, dst, srcreg, src)); > } > + if (align_bytes & 8) > + { > + if (TARGET_64BIT || TARGET_SSE) > + { > + dst = adjust_automodify_address_nv (dst, DImode, destreg, off); > + src = adjust_automodify_address_nv (src, DImode, srcreg, off); > + emit_insn (gen_strmov (destreg, dst, srcreg, src)); > + } > + else > + { > + dst = adjust_automodify_address_nv (dst, SImode, destreg, off); > + src = adjust_automodify_address_nv (src, SImode, srcreg, off); > + emit_insn (gen_strmov (destreg, dst, srcreg, src)); > + emit_insn (gen_strmov (destreg, dst, srcreg, src)); > > again, no use for vector moves? Actually, here vector-moves are used if they are available - if 32bit mode is used (so we can't do the move via GPR), but SSE is available, then SSE-move would be generated. > +/* Target hook. Returns rtx of mode MODE with promoted value VAL, that is > + supposed to represent one byte. MODE could be a vector mode. > + Example: > + 1) VAL = const_int (0xAB), mode = SImode, > + the result is const_int (0xABABABAB). > > This can be handled in machine independent way, right? > > + 2) if VAL isn't const, then the result will be the result of > MUL-instruction > + of VAL and const_int (0x01010101) (for SImode). */ > > This would probably go better as named expansion pattern, like we do for other > machine description interfaces. I don't think it could be done in machine-independent way - e.g. if AVX is available, we could use broadcast-instructions, if not - we need to use multiply-instructions, on other architectures there probably some other more efficient ways to duplicate byte value across the entire vector register. So IMO it's a good place to have a hook. > diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi > index 90cef1c..4b7d67b 100644 > --- a/gcc/doc/tm.texi > +++ b/gcc/doc/tm.texi > @@ -5780,6 +5780,32 @@ mode returned by > @code{TARGET_VECTORIZE_PREFERRED_SIMD_MODE}. > The default is zero which means to not iterate over other vector sizes. > @end deftypefn > > +@deftypefn {Target Hook} bool TARGET_SLOW_UNALIGNED_ACCESS (enum > machine_mode @var{mode}, unsigned int @var{align}) > +This hook should return true if memory accesses in mode @var{mode} to data > +aligned by @var{align} bits have a cost many times greater than aligned > +accesses, for example if they are emulated in a trap handler. > > New hooks should go to the machine indpendent part of the patch. > These changes present in middle-end part too (of course, in the full patch it's not duplicated). I left it in this patch too to avoid possible problems with build. -- --- Best regards, Michael V. Zolotukhin, Software Engineer Intel Corporation. On 27 October 2011 19:09, Jan Hubicka <hubi...@ucw.cz> wrote: > Hi, > sorry for delay with the review. This is my first pass through the backend > part, hopefully > someone else will do the middle end bits. > > diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c > index 2c53423..d7c4330 100644 > --- a/gcc/config/i386/i386.c > +++ b/gcc/config/i386/i386.c > @@ -561,10 +561,14 @@ struct processor_costs ix86_size_cost = {/* costs for > tuning for size */ > COSTS_N_BYTES (2), /* cost of FABS instruction. */ > COSTS_N_BYTES (2), /* cost of FCHS instruction. */ > COSTS_N_BYTES (2), /* cost of FSQRT instruction. */ > - {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}, > + {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}, > {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}}, > - {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}, > + {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}, > + {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}}}, > + {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}, > {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}}, > + {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}, > + {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}}}, > > I am bit concerned about explossion of variants, but adding aligned variants > probably makes > sense. I guess we should specify what alignment needs to be known. I.e. is > alignment of 2 enough > or shall the alignment be matching the size of loads/stores produced? > > @@ -15266,6 +15362,38 @@ ix86_expand_move (enum machine_mode mode, rtx > operands[]) > } > else > { > + if (mode == DImode > + && !TARGET_64BIT > + && TARGET_SSE2 > + && MEM_P (op0) > + && MEM_P (op1) > + && !push_operand (op0, mode) > + && can_create_pseudo_p ()) > + { > + rtx temp = gen_reg_rtx (V2DImode); > + emit_insn (gen_sse2_loadq (temp, op1)); > + emit_insn (gen_sse_storeq (op0, temp)); > + return; > + } > + if (mode == DImode > + && !TARGET_64BIT > + && TARGET_SSE > + && !MEM_P (op1) > + && GET_MODE (op1) == V2DImode) > + { > + emit_insn (gen_sse_storeq (op0, op1)); > + return; > + } > + if (mode == TImode > + && TARGET_AVX2 > + && MEM_P (op0) > + && !MEM_P (op1) > + && GET_MODE (op1) == V4DImode) > + { > + op0 = convert_to_mode (V2DImode, op0, 1); > + emit_insn (gen_vec_extract_lo_v4di (op0, op1)); > + return; > + } > > This hunk seems dangerous in a way that by emitting the explicit loadq/storeq > pairs (and similar) will > prevent use of integer registers for 64bit/128bit arithmetic. > > I guess we could play such tricks for memory-memory moves & constant stores. > With gimple optimizations > we already know pretty well that the moves will stay as they are. That might > be enough for you? > > If you go this way, please make separate patch so it can be benchmarked. > While the moves are faster > there are always problem with size mismatches in load/store buffers. > > I think for string operations we should output the SSE/AVX instruction > variants > by hand + we could try to instruct IRA to actually preffer use of SSE/AVX when > feasible? This has been traditionally problem with old RA because it did not > see that because pseudo is eventually used for arithmetics, it can not go into > SSE register. So it was not possible to make MMX/SSE/AVX to be preferred > variants for 64bit/128bit manipulations w/o hurting performance of code that > does arithmetic on long long and __int128. Perhaps IRA can solve this now. > Vladimir, do you have any ida? > > +/* Helper function for expand_set_or_movmem_via_loop. > + This function can reuse iter rtx from another loop and don't generate > + code for updating the addresses. */ > +static rtx > +expand_set_or_movmem_via_loop_with_iter (rtx destmem, rtx srcmem, > + rtx destptr, rtx srcptr, rtx value, > + rtx count, rtx iter, > + enum machine_mode mode, int unroll, > + int expected_size, bool change_ptrs) > > I wrote the original function, but it is not really clear for me what the > function > does now. I.e. what is code for updating addresses and what means reusing > iter. > I guess reusing iter means that we won't start the loop from 0. Could you > expand comments a bit more? > > I know I did not documented them originally, but all the parameters ought to > be > explicitely documented in a function comment. > @@ -20913,7 +21065,27 @@ emit_strmov (rtx destmem, rtx srcmem, > emit_insn (gen_strmov (destptr, dest, srcptr, src)); > } > > -/* Output code to copy at most count & (max_size - 1) bytes from SRC to > DEST. */ > +/* Emit strset instuction. If RHS is constant, and vector mode will be used, > + then move this consatnt to a vector register before emitting strset. */ > +static void > +emit_strset (rtx destmem, rtx value, > + rtx destptr, enum machine_mode mode, int offset) > > This seems to more naturally belong into gen_strset expander? > { > - if (TARGET_64BIT) > - { > - dest = adjust_automodify_address_nv (destmem, DImode, destptr, > offset); > - emit_insn (gen_strset (destptr, dest, value)); > - } > - else > - { > - dest = adjust_automodify_address_nv (destmem, SImode, destptr, > offset); > - emit_insn (gen_strset (destptr, dest, value)); > - dest = adjust_automodify_address_nv (destmem, SImode, destptr, > offset + 4); > - emit_insn (gen_strset (destptr, dest, value)); > - } > - offset += 8; > + if (GET_MODE (destmem) != move_mode) > + destmem = change_address (destmem, move_mode, destptr); > AFAIK change_address is not equilvalent into adjust_automodify_address_nv in > a way > it copies memory aliasing attributes and it is needed to zap them here since > stringops > behaves funily WRT aliaseing. > if (max_size > 16) > { > rtx label = ix86_expand_aligntest (count, 16, true); > if (TARGET_64BIT) > { > - dest = change_address (destmem, DImode, destptr); > - emit_insn (gen_strset (destptr, dest, value)); > - emit_insn (gen_strset (destptr, dest, value)); > + destmem = change_address (destmem, DImode, destptr); > + emit_insn (gen_strset (destptr, destmem, gen_lowpart (DImode, > + value))); > + emit_insn (gen_strset (destptr, destmem, gen_lowpart (DImode, > + value))); > > No use for 128bit moves here? > } > else > { > - dest = change_address (destmem, SImode, destptr); > - emit_insn (gen_strset (destptr, dest, value)); > - emit_insn (gen_strset (destptr, dest, value)); > - emit_insn (gen_strset (destptr, dest, value)); > - emit_insn (gen_strset (destptr, dest, value)); > + destmem = change_address (destmem, SImode, destptr); > + emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode, > + value))); > + emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode, > + value))); > + emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode, > + value))); > + emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode, > + value))); > > And here? > @@ -21204,8 +21426,8 @@ expand_movmem_prologue (rtx destmem, rtx srcmem, > if (align <= 1 && desired_alignment > 1) > { > rtx label = ix86_expand_aligntest (destptr, 1, false); > - srcmem = change_address (srcmem, QImode, srcptr); > - destmem = change_address (destmem, QImode, destptr); > + srcmem = adjust_automodify_address_1 (srcmem, QImode, srcptr, 0, 1); > + destmem = adjust_automodify_address_1 (destmem, QImode, destptr, 0, 1); > > You want to always use adjust_automodify_address or > adjust_automodify_address_nv, > adjust_automodify_address_1 is not intended for general use. > @@ -21286,6 +21528,37 @@ expand_constant_movmem_prologue (rtx dst, rtx *srcp, > rtx destreg, rtx srcreg, > off = 4; > emit_insn (gen_strmov (destreg, dst, srcreg, src)); > } > + if (align_bytes & 8) > + { > + if (TARGET_64BIT || TARGET_SSE) > + { > + dst = adjust_automodify_address_nv (dst, DImode, destreg, off); > + src = adjust_automodify_address_nv (src, DImode, srcreg, off); > + emit_insn (gen_strmov (destreg, dst, srcreg, src)); > + } > + else > + { > + dst = adjust_automodify_address_nv (dst, SImode, destreg, off); > + src = adjust_automodify_address_nv (src, SImode, srcreg, off); > + emit_insn (gen_strmov (destreg, dst, srcreg, src)); > + emit_insn (gen_strmov (destreg, dst, srcreg, src)); > > again, no use for vector moves? > +/* Target hook. Returns rtx of mode MODE with promoted value VAL, that is > + supposed to represent one byte. MODE could be a vector mode. > + Example: > + 1) VAL = const_int (0xAB), mode = SImode, > + the result is const_int (0xABABABAB). > > This can be handled in machine independent way, right? > > + 2) if VAL isn't const, then the result will be the result of > MUL-instruction > + of VAL and const_int (0x01010101) (for SImode). */ > > This would probably go better as named expansion pattern, like we do for other > machine description interfaces. > diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi > index 90cef1c..4b7d67b 100644 > --- a/gcc/doc/tm.texi > +++ b/gcc/doc/tm.texi > @@ -5780,6 +5780,32 @@ mode returned by > @code{TARGET_VECTORIZE_PREFERRED_SIMD_MODE}. > The default is zero which means to not iterate over other vector sizes. > @end deftypefn > > +@deftypefn {Target Hook} bool TARGET_SLOW_UNALIGNED_ACCESS (enum > machine_mode @var{mode}, unsigned int @var{align}) > +This hook should return true if memory accesses in mode @var{mode} to data > +aligned by @var{align} bits have a cost many times greater than aligned > +accesses, for example if they are emulated in a trap handler. > > New hooks should go to the machine indpendent part of the patch. > > Honza >