* Alexey Dobriyan <adobri...@gmail.com> wrote:

> Current memset() implementation does silly things:
> * multiplication to get wide constant:
>       waste of cycles if filler is known at compile time,
> 
> * REP STOSQ followed by REP STOSB:
>       this code is used when REP STOSB is slow but still it is used
>       for small length (< 8) when setup overhead is relatively big,
> 
> * suboptimal calling convention:
>       REP STOSB/STOSQ favours (rdi, rcx)
> 
> * memset_orig():
>       it is hard to even look at it :^)
> 
> New implementation is based on the following observations:
> * c == 0 is the most common form,
>       filler can be done with "xor eax, eax" and pushed into memset()
>       saving 2 bytes per call and multiplication
> 
> * len divisible by 8 is the most common form:
>       all it takes is one pointer or unsigned long inside structure,
>       dispatch at compile time to code without those ugly "lets fill
>       at most 7 bytes" tails,
> 
> * multiplication to get wider filler value can be done at compile time
>   for "c != 0" with 1 insn/10 bytes at most saving multiplication.
> 
> * those leaner forms of memset can be done withing 3/4 registers (RDI,
>   RCX, RAX, [RSI]) saving the rest from clobbering.

Ok, sorry about the belated reply - all that sounds like very nice 
improvements!

> Note: "memset0" name is chosen because "bzero" is officially deprecated.
> Note: memset(,0,) form is interleaved into memset(,c,) form to save
> space.
> 
> QUESTION: is it possible to tell gcc "this function is semantically
> equivalent to memset(3) so make high level optimizations but call it
> when it is necessary"? I suspect the answer is "no" :-\

No idea ...

> TODO:
>       CONFIG_FORTIFY_SOURCE is enabled by distros
>       benchmarks
>       testing
>       more comments
>       check with memset_io() so that no surprises pop up

I'd only like to make happy noises here to make sure you continue with 
this work - it does look promising. :-)
 
Thanks,

        Ingo

Reply via email to