* Alexey Dobriyan <adobri...@gmail.com> wrote:
> Current memset() implementation does silly things: > * multiplication to get wide constant: > waste of cycles if filler is known at compile time, > > * REP STOSQ followed by REP STOSB: > this code is used when REP STOSB is slow but still it is used > for small length (< 8) when setup overhead is relatively big, > > * suboptimal calling convention: > REP STOSB/STOSQ favours (rdi, rcx) > > * memset_orig(): > it is hard to even look at it :^) > > New implementation is based on the following observations: > * c == 0 is the most common form, > filler can be done with "xor eax, eax" and pushed into memset() > saving 2 bytes per call and multiplication > > * len divisible by 8 is the most common form: > all it takes is one pointer or unsigned long inside structure, > dispatch at compile time to code without those ugly "lets fill > at most 7 bytes" tails, > > * multiplication to get wider filler value can be done at compile time > for "c != 0" with 1 insn/10 bytes at most saving multiplication. > > * those leaner forms of memset can be done withing 3/4 registers (RDI, > RCX, RAX, [RSI]) saving the rest from clobbering. Ok, sorry about the belated reply - all that sounds like very nice improvements! > Note: "memset0" name is chosen because "bzero" is officially deprecated. > Note: memset(,0,) form is interleaved into memset(,c,) form to save > space. > > QUESTION: is it possible to tell gcc "this function is semantically > equivalent to memset(3) so make high level optimizations but call it > when it is necessary"? I suspect the answer is "no" :-\ No idea ... > TODO: > CONFIG_FORTIFY_SOURCE is enabled by distros > benchmarks > testing > more comments > check with memset_io() so that no surprises pop up I'd only like to make happy noises here to make sure you continue with this work - it does look promising. :-) Thanks, Ingo