On Mon, 11 Jun 2018, Konstantin Belousov wrote:
On Mon, Jun 11, 2018 at 03:48:52PM +1000, Bruce Evans wrote:
Change TYPE to unsigned char[sizeof(old TYPE)] and use memcpy() to assign
the variables to fix this with small churn and without the pessimizations
in this commit. This gives portable code back to K&R1978 + memcpy(), and
is as efficient as the above when the above works. On x86, even gcc-3.3
produces a load and store for memcpy(p, q, sizeof(register_t)) when p and
q are void *.
Change TYPE to unsigned char[n] and use a single memcpy() without a loop
to assign the variables to fix this with larger churn and with different
(smaller?) pessimizations than in this commit. Setup and initialization
for this method is also simpler. This uses the newfangled VLA feature,
and since n is variable for qsort(), builtin memcpy() with length n
doesn't work so well.
Either results in the unacceptable stack use.
'unsigned char[sizeof(old TYPE)]' uses no memory and thus no stack provided
memcpy() is optimized to register moves.
Similarly for 'unsigned char[n]', except this might be harder to optimize.
If it uses memory, then it wasn't optimized to register moves. n must be
split up into blocks that fit in registers, just like the reverse of what
clang does now for combining byte-sized moves into larger blocks that fit
in registers.
Any source code may result in unacceptable stack usage if the compiler
optimizations aren't quite right. The compiler should do something like
combining assignments of small objects to assignments of objects of size n;
then reblock to fit this in registers. When this is not done right, parts
that don't fit in registers might be spilled to the stack, and when this
is done wrong the spill size might be nearly n or even larger.
In my example, clang and gcc-4.2.1 get the optimization with the VLA wrong
by putting the VLA on the stack and not doing any reblocking; clang
doesn't use any stack when it generates large code to reblock the byte
assignments (except when it uses AVX, signal handlers use more stack).
I can limit the char[es] and memcpy to some value of es, but I do not
see a point. I will not object if somebody decides to do it.
The old code reblocked es into smaller objects and got the type punning
for this wrong.
I checked the speed of qsort(). It is O(n*log(n)) for both swaps and
comparisons. Howver, I think most uses of it have small elements, since
for large data the elements should be keys or pointers to avoid swapping
the data. The old version is almost optimal for swapping pointers.
clang's sophisticted reblocking of byte swaps is really bad for swapping
pointers: on amd64:
XX movq plen(%rip), %r10
XX testq %r10, %r10
XX je .LBB2_18
plen is 8, so don't branch.
XX # %bb.1: # %for.body.lr.ph.i
XX movq q(%rip), %rcx
XX movq p(%rip), %rdx
XX cmpq $32, %r10
XX jb .LBB2_12
plen is < 32, so branch.
XX ....
XX .LBB2_12: # %for.body.i8.preheader
XX leaq -1(%r10), %rsi
XX movq %r10, %rdi
XX andq $3, %rdi
XX je .LBB2_15
plen is 0 mod 4, so branch.
XX ...
XX .LBB2_15: # %for.body.i8.prol.loopexit
XX cmpq $3, %rsi
XX jb .LBB2_18
plen - 1 is not below 3, so don't branch.
XX # %bb.16: # %for.body.i8.preheader.new
XX xorl %esi, %esi
XX .p2align 4, 0x90
XX .LBB2_17: # %for.body.i8
XX # =>This Inner Loop Header: Depth=1
XX movzbl (%rdx,%rsi), %edi
XX movzbl (%rcx,%rsi), %eax
XX movb %al, (%rdx,%rsi)
XX movb %dil, (%rcx,%rsi)
XX movzbl 1(%rdx,%rsi), %edi
XX movzbl 1(%rcx,%rsi), %eax
XX movb %al, 1(%rdx,%rsi)
XX movb %dil, 1(%rcx,%rsi)
XX movzbl 2(%rdx,%rsi), %edi
XX movzbl 2(%rcx,%rsi), %eax
XX movb %al, 2(%rdx,%rsi)
XX movb %dil, 2(%rcx,%rsi)
XX movzbl 3(%rdx,%rsi), %edi
XX movzbl 3(%rcx,%rsi), %eax
XX movb %al, 3(%rdx,%rsi)
XX movb %dil, 3(%rcx,%rsi)
XX addq $4, %rsi
End swapping copying a byte at a time after all. Swap 4 bytes per iteration.
XX cmpq %rsi, %r10
XX jne .LBB2_17
Iterate twice for the 8 byte pointer.
All this instead of 2 loads and 2 stores that the old code might have
generated.
Since es is not always 8, the old code must generate more than 2 loads
and 2 stores -- it will have to reblock using similar branches to the
above. It only uses 2 block sizes (sizeof(long) and 1), but the compiler
might pessimize this to the same as the above.
clang actually generates enormous^2 code with the current version and
enormous^3 code with the previous version. For the current version,
1742 lines, with 192 lines for just movups instructions. For the
previous version, it generates 4823 lines with 576 movups instructions.
gcc -O2 generates 619 lines for the current version and 1053 lines for
the previous version.
So I don't mind the simplification. It would take lots of __predict_ugly()
annotations to prevent unrolling that pessimizes the usual case.
Bruce
_______________________________________________
svn-src-head@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"