> 128 is about upper bound you can expand with sse moves.
> Tuning did not take into account code size and measured only when code
> is in tigth loop.
> For GPR-moves limit is around 64.
Thanks for the data - I've not performed measurements with this
implementation yet, but we surely should adjust thresholds to avoid
performance degradations on small sizes.

Michael

On 10 April 2013 22:53, Ondřej Bílka <nel...@seznam.cz> wrote:
> On Wed, Apr 10, 2013 at 09:53:09PM +0400, Michael Zolotukhin wrote:
>> > Hi, I am writing memcpy for libc. It avoids computed jump and has is
>> > much faster on small strings (variant for sandy bridge attached.
>>
>> I'm not sure I get what you meant - could you please explain what is
>> computed jumps?
> computed goto. See Duff's device it works almost exactly same.
>>
>> > You must also check performance with cold instruction cache.
>> > Now memcpy(x,y,128) takes 126 bytes which is too much.
>>
>> > Do not align for small sizes. Dependency caused by this erases any gains
>> > that you migth get. Keep in mind that in 55% of cases data are already
>> > aligned.
>>
>> Other algorithms are still available and we can use them for small
>> sizes. E.g. for sizes <128 we could emit loop with GPR-moves and don't
>> use vector instructions in it.
>
> 128 is about upper bound you can expand with sse moves.
> Tuning did not take into account code size and measured only when code
> is in tigth loop.
> For GPR-moves limit is around 64.
>
> What matters which code has best performance/size ratio.
>> But that's tuning and I haven't worked on it yet - I'm going to
>> measure performance of all algorithms on all sizes and thus defines on
>> which sizes which algorithm is preferable.
>> What I did in this patch is introducing some infrastructure to allow
>> emitting of vector moves in movmem expanding - tuning is certainly
>> possible and needed, but that's out of the scope of the patch.
>>
>> On 10 April 2013 21:43, Ondřej Bílka <nel...@seznam.cz> wrote:
>> > On Wed, Apr 10, 2013 at 08:14:30PM +0400, Michael Zolotukhin wrote:
>> >> Hi,
>> >> This patch adds a new algorithm of expanding movmem in x86 and a bit
>> >> refactor existing implementation. This is a reincarnation of the patch
>> >> that was sent wasn't checked couple of years ago - now I reworked it
>> >> from scratch and divide into several more manageable parts.
>> >>
>> > Hi, I am writing memcpy for libc. It avoids computed jump and has is
>> > much faster on small strings (variant for sandy bridge attached.
>> >
>> >> For now this algorithm isn't used, because cost_models are tuned to
>> >> use existing ones. I believe the new algorithm will give better
>> >> performance, but I'll leave cost-models tuning for a separate patch.
>> >>
>> > You must also check performance with cold instruction cache.
>> > Now memcpy(x,y,128) takes 126 bytes which is too much.
>> >
>> >> Also, I changed get_mem_align_offset to make it handle MEM_REFs as
>> >> well. Probably, there is another way of getting info about alignment -
>> >> if so, please let me know.
>> >>
>> > Do not align for small sizes. Dependency caused by this erases any gains
>> > that you migth get. Keep in mind that in 55% of cases data are already
>> > aligned.
>> >
>> > Also in my tests best way to handle prologue is first copy last 16
>> > bytes and then loop.
>> >
>> >> Similar improvements could be done in expanding of memset, but that's
>> >> in progress now and I'm going to proceed with it if this patch is ok.
>> >>
>> >> Bootstrap/make check/Specs2k are passing on i686 and x86_64.
>> >>
>> >> Is it ok for trunk?
>> >>
>> >> Changelog entry:
>> >>
>> >> 2013-04-10  Michael Zolotukhin  <michael.v.zolotuk...@gmail.com>
>> >>
>> >>         * config/i386/i386-opts.h (enum stringop_alg): Add vector_loop.
>> >>         * config/i386/i386.c (expand_set_or_movmem_via_loop): Use
>> >>         adjust_address instead of change_address to keep info about 
>> >> alignment.
>> >>         (emit_strmov): Remove.
>> >>         (emit_memmov): New function.
>> >>         (expand_movmem_epilogue): Refactor to properly handle bigger 
>> >> sizes.
>> >>         (expand_movmem_epilogue): Likewise and return updated rtx for
>> >>         destination.
>> >>         (expand_constant_movmem_prologue): Likewise and return updated 
>> >> rtx for
>> >>         destination and source.
>> >>         (decide_alignment): Refactor, handle vector_loop.
>> >>         (ix86_expand_movmem): Likewise.
>> >>         (ix86_expand_setmem): Likewise.
>> >>         * config/i386/i386.opt (Enum): Add vector_loop to option 
>> >> stringop_alg.
>> >>         * emit-rtl.c (get_mem_align_offset): Compute alignment for 
>> >> MEM_REF.

--
---
Best regards,
Michael V. Zolotukhin,
Software Engineer
Intel Corporation.

Reply via email to