Attached is a part 1 of patch that enables use of vector-instructions in memset and memcopy (middle-end part). The main part of the changes is in functions move_by_pieces/set_by_pieces. In new version algorithm of move-mode selection was changed – now it checks if alignment is known at compile time and uses cost-models to choose between aligned and unaligned vector or not-vector move-modes.
Build and 'make check' was tested - in 'make check' there is a fail, that would be cured when complete patch is applied. On 27 September 2011 18:44, Michael Zolotukhin <michael.v.zolotuk...@gmail.com> wrote: > I divided the patch into three smaller ones: > > 1) Patch with target-independent changes (see attached file > memfunc-mid.patch). > The main part of the changes is in functions > move_by_pieces/set_by_pieces. In new version algorithm of move-mode > selection was changed – now it checks if alignment is known at compile > time and uses cost-models to choose between aligned and unaligned > vector or not-vector move-modes. > > 2) Patch with target-dependent changes (memfunc-be.patch). > The main part of the changes is in functions > ix86_expand_setmem/ix86_expand_movmem. The other changes are only > needed to support it. > The changes mostly touched unrolled_loop strategy – now vector move > modes could be used here. That resulted in large epilogues and > prologues, so their generation also was modified. > This patch contains some changes in middle-end (to make build > possible) - but all these changes are present in the first patch, so > there is no need to review them here. > > 3) Patch with all new tests (memfunc-tests.patch). > This patch contains a lot of small tests for different memset and memcopy > cases. > > Separately from each other, these patches won't give performance gain. > The positive effect will be noticeable only if they are applied > together (I attach the complete patch also - see file > memfunc-complete.patch). > > > If you have any questions regarding these changes, please don't > hesitate to ask them. > > > On 18 July 2011 15:00, Michael Zolotukhin > <michael.v.zolotuk...@gmail.com> wrote: >> Here is a summary - probably, it doesn't cover every single piece in >> the patch, but I tried to describe the major changes. I hope this will >> help you a bit - and of course I'll answer your further questions if >> they appear. >> >> The changes could be logically divided into two parts (though, these >> parts have something in common). >> The first part is changes in target-independent part, in functions >> move_by_pieces() and store_by_pieces() - mostly located in expr.c. >> The second part touches ix86_expand_movmem() and ix86_expand_setmem() >> - mostly located in config/i386/i386.c. >> >> Changes in i386.c (target-dependent part): >> 1) Strategies for cases with known and unknown alignment are separated >> from each other. >> When alignment is known at compile time, we could generate optimized >> code without libcalls. >> When it's unknown, we sometimes could create runtime-checks to reach >> desired alignment, but not always. >> Strategies for atom and generic_32, generic_64 were chosen according >> to set of experiments, strategies in other >> cost models are unchanged (strategies for unknown alignment are copied >> from existing strategies). >> 2) unrolled_loop algorithm was modified - now it uses SSE move-modes, >> if they're available. >> 3) As size of data, moved in one iteration, greatly increased, and >> epilogues became bigger - so some changes were needed in epilogue >> generation. In some cases a special loop (not unrolled) is generated >> in epilogue to avoid slow copying by bytes (changes in >> expand_set_or_movmem_via_loop() and introducing of >> expand_set_or_movmem_via_loop_with_iter() is made for these cases). >> 4) As bigger alignment might be needed than previously, prologue >> generation was also modified. >> >> Changes in expr.c (target-independent part): >> There are two possible strategies now: use of aligned and unaligned >> moves. For each of them a cost model was implemented and the choice is >> made according to the cost of each option. Move-mode choice is made by >> functions widest_mode_for_unaligned_mov() and >> widest_mode_for_aligned_mov(). >> Cost estimation is implemented in functions compute_aligned_cost() and >> compute_unaligned_cost(). >> Choice between these two strategies and the generation of moves >> themselves are in function move_by_pieces(). >> >> Function store_by_pieces() calls set_by_pieces_1() instead of >> store_by_pieces_1(), if this is memset-case (I needed to introduce >> set_by_pieces_1 to separate memset-case from others - >> store_by_pieces_1 is sometimes called for strcpy and some other >> functions, not only for memset). >> >> Set_by_pieces_1() estimates costs of aligned and unaligned strategies >> (as in move_by_pieces() ) and generates moves for memset. Single move >> is generated via >> generate_move_with_mode(). If it's called first time, a promoted value >> (register, filled with one-byte value of memset argument) is generated >> - later calls reuse this value. >> >> Changes in MD-files: >> For generation of promoted values, I made some changes in >> promote_duplicated_reg() and promote_duplicated_reg_to_size(). Expands >> for vec_dup4si and vec_dupv2di were introduced for this too (these >> expands differ from corresponding define_insns - existing define_insn >> work only with registers, while new expands could process memory >> operand as well). >> >> Some code were added to allow generation of MOVQ (with SSE-registers) >> - such moves aren't usual ones, because they use only half of >> xmm-register. >> There was a need to generate such moves explicitly, so I added a >> simple expand to sse.md. >> >> >> On 16 July 2011 03:24, Jan Hubicka <hubi...@ucw.cz> wrote: >>>> > New algorithm for move-mode selection is implemented for move_by_pieces, >>>> > store_by_pieces. >>>> > x86-specific ix86_expand_movmem and ix86_expand_setmem are also changed >>>> > in >>>> > similar way, x86 cost-models parameters are slightly changed to support >>>> > this. This implementation checks if array's alignment is known at compile >>>> > time and chooses expanding algorithm and move-mode according to it. >>> >>> Can you give some sumary of changes you made? It would make it a lot >>> easier to >>> review if it was broken up int the generic changes (with rationaly why they >>> are >>> needed) and i386 backend changes that I could review then. >>> >>> From first pass through the patch I don't quite see the need for i.e. adding >>> new move patterns when we can output all kinds of SSE moves already. Will >>> look >>> more into the patch to see if I can come up with useful comments. >>> >>> Honza >>> >> > > -- > --- > Best regards, > Michael V. Zolotukhin, > Software Engineer > Intel Corporation. > -- --- Best regards, Michael V. Zolotukhin, Software Engineer Intel Corporation.