"Richard B. Johnson" wrote:
> 
> With intel processors, the 'rep' before an instruction will not
> execute that instruction if ecx is already zero. You do not
> have to test. Also, a jump is often much more harmful in instruction
> time than straight-through instruction. For instance, the fastest
> 486 code for an unaligned copy is:
> 
>         movl    SRC(%esp), %esi
>         movl    DST(%esp), %edi
>         movl    CNT(%esp), %ecx
>         shrl    $1,%ecx
>         rep     movsw
>         adcl    %ecx,%ecx
>         rep     movsb


Agreed. But most of the time we are memseting or memcopying memory
regions that are aligned in compile time or aligned by kmalloc.
In both cases alignment is 4 or other higher power of 2 value.
Which make such code redundant.

 
> If it's longword aligned, i.e.,  both source and destination addresss
> are clear in their low two bits, moving longwords through the edx
> register, with eax and ebx being the index registers, is faster, even with
> a beginning test for longword size.
> 
>         movl    SRC(%esp), %eax
>         movl    DST(%esp), %ebx
>         movl    CNT(%esp), %ecx
>         testl   $3, %ecx
>         jz      2f
>         shrl    $2, %ecx        # long words CY set if an extra word
> 1:      movl    (%eax), %edx    # Do NOT touch EAX in the next instruction
>         movl    %edx, (%ebx)    # Do NOT touch EBX in the next instruction
>         leal    4(%eax), %eax   # Adjust EAX index now
>         leal    4(%ebx), %ebx   # Adjust EBX index now
>         decl    %ecx            # does not change CY
>         jnz     1b
> 
> 2:
> 
> To be able to run some instructions in parallel, you have to follow the
> idea shown in the above comments, i.e., don't touch an index register
> in the instructions immediately following its use to address memory.
>
> This will allow the memory access to occur during the parallel execution
> of the next instruction(s).

I made such a mistake in memcpy - i added 4 to register used in last
register for memory reference.
I'm not so sure about placing "decl" between two "leal"s. I am using
"addl" which is supposed to go through V pipe (at least on 586), just
as "decl" can.
Anyway I'll make some performance tests on an old 486 i have.
 

> The decl %ecx should be put BETWEEN the two `leal` instructions so that
> the address calculation can occur in parallel with the register operation.
> LEA does not affect the flags. In the example above I didn't do this
> because it makes the code unclear.
> 
> Various registers used as index registers are not all the same. Register
> EAX was not an index register in i386 machines.  It became one in i486
> machines. It is faster to use (%eax) than (%ebx).


Right. This is inherited from earlier '86 CPUs where "ax" was the
accumulator - that's why many arithmetic operations generate smaller
code when the target is ax/eax.



best,
Petkan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Reply via email to