This patch/request for comment is my proposed partial solution to
PR target/122454. Although preserving TImode operations until after
reload has many benefits (including STV), it also introduces a small
number of complications (mainly from the use of SUBREGs).
Consider the last usage of a TImode register x, this is typically
split into two DImode registers, hi and lo, using a sequence of two
instructions that looks like:
(set (reg:DI lo) (subreg:DI (reg:TI x) 0))
(set (reg:DI hi) (subreg:DI (reg:TI x) 8)) (expr_list:REG_DEAD (reg:TI x)
The challenge for register allocation (reload) is that between these
two instructions the register x is still considered to be live, so lo
conflicts with x, and can't reside in the same hard register used to hold
to lowpart of x. This leads to increased register pressure and suboptimal
register allocation. In PR 122454 this manifests as additional move
instructions (a P2 regression).
The proposed solution is to introduce a parallel move instruction:
(parallel [(set (reg:DI lo) lo_src)
(set (reg:DI hi) hi_src)])
which allows the TImode pseudo to be decomposed "atomically" without
extenting the lifetime of x, and after reload split this into zero,
one or two move instructions or an xchg (swap) instruction depending
upon the register allocation. Worst case is this requires two moves
(which is the same as previously), but ideally this becomes a no-op
if register preferencing manages to assign lo and lo_src to the same
hard register, and hi and hi_src to the same hard register.
Previously for the testcase in the bugzilla PR:
less: movq %rdi, %rax
movq %rdx, %r11
pushq %rbx
movq %rsi, %r10
xorq %rdx, %rax
movq %rsi, %rdx
xorq %rcx, %rdx
orq %rdx, %rax
je .L9
cmpq %r11, %rdi
popq %rbx
sbbq %rcx, %r10
setc %al
ret
.L9: movq %r9, %rsi
movq %r8, %rdi
popq %rbx
jmp bytewiseless
With this patch we now generate the improved (saving two instructions):
movq %rdi, %r10
movq %rsi, %rax
pushq %rbx
xorq %rdx, %r10
xorq %rcx, %rax
orq %rax, %r10
je .L9
cmpq %rdx, %rdi
popq %rbx
sbbq %rcx, %rsi
setc %al
ret
.L9: movq %r9, %rsi
movq %r8, %rdi
popq %rbx
jmp bytewiseless
Alas %rbx is still being saved and restored, even though its never used
(by the time we reach final), but this appears to be a step in the right
direction. This TI->DI patch is conceptually related to the DI->TI patches
by Jakub (and I) that introduced concatditi3 and friends, to aid with
constructing TImode pseudos from a pair of DImode pseudos.
There are lots of places this new parmovdi4 instruction can be generated
during splitting/expand of doubleword operations, but the *cmpti_doubleword
splitter in this PR seems like a reasonable place to start as a proof of
concept. The call to gen_parmovdi4 could even eventually be moved into
the function split_double_mode.
Does this seem like a reasonable approach? Is there precedence for a
better instruction name than parmov<mode>4 (for either "parallel move"
or "pair move") on other targets? Is there some additional register
preferencing magic that can advise ira/reload to prefer to place the
sources and destinations in the same hard registers?
Testing on x86_64 shows no regressions in the testsuite, but did require
tweaks to several gcc.target/i386/builtins_memmove_* testcases that used
check-function-bodies to (mostly) account for differences in hard register
assignments. The only non-naming differences were in builtin-memmove-2d.c
where the reduction in register pressure allows us to save several
instructions by avoiding a spill [reducing code size by 16 bytes].
2026-01-12 Roger Sayle <[email protected]>
gcc/ChangeLog
PR target/122454
* config/i386/i386.md (*cmp<dwi>_doubleword): If split_double_mode
returns a pair of SUBREGs, emit a parmov<mode>4 to place them in
new pseudos, to improve register allocation.
(parmov<mode>4): New define_insn_and_split, that splits to zero,
one or two moves, or remains as a xchg instruction.
gcc/testsuite/ChangeLog
PR target/122454
* gcc.target/i386/builtin-memmove-11a.c: Update for changes in
register allocation.
* gcc.target/i386/builtin-memmove-11b.c: Likewise.
* gcc.target/i386/builtin-memmove-2a.c: Likewise.
* gcc.target/i386/builtin-memmove-2b.c: Likewise.
* gcc.target/i386/builtin-memmove-2c.c: Likewise.
* gcc.target/i386/builtin-memmove-2d.c: Update for changes in
register allocation and code generation.
Thoughts?
Thanks in advance,
Roger
--
diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index df7135f84d4..67beb631057 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -1704,6 +1704,24 @@
DONE;
}
+ /* Minimize live ranges of TImode pseudos. */
+ if (SUBREG_P (operands[0]) && SUBREG_P (operands[2]))
+ {
+ rtx lo = gen_reg_rtx (<MODE>mode);
+ rtx hi = gen_reg_rtx (<MODE>mode);
+ emit_insn (gen_parmov<mode>4 (lo, operands[0], hi, operands[2]));
+ operands[0] = lo;
+ operands[2] = hi;
+ }
+ if (SUBREG_P (operands[1]) && SUBREG_P (operands[3]))
+ {
+ rtx lo = gen_reg_rtx (<MODE>mode);
+ rtx hi = gen_reg_rtx (<MODE>mode);
+ emit_insn (gen_parmov<mode>4 (lo, operands[1], hi, operands[3]));
+ operands[1] = lo;
+ operands[3] = hi;
+ }
+
if (operands[1] == const0_rtx)
emit_move_insn (operands[4], operands[0]);
else if (operands[0] == const0_rtx)
@@ -3767,6 +3785,57 @@
split_double_concat (DImode, operands[0], operands[2], operands[4]);
DONE;
})
+
+;; Doubleword splitting
+(define_insn_and_split "parmov<mode>4"
+ [(set (match_operand:DWIH 0 "register_operand" "=r,r,r,r")
+ (match_operand:DWIH 1 "register_operand" "0,0,r,r"))
+ (set (match_operand:DWIH 2 "register_operand" "=r,r,r,r")
+ (match_operand:DWIH 3 "register_operand" "2,r,2,r"))]
+ ""
+ "xchg{<imodesuffix>}\t%0, %2"
+ "&& reload_completed
+ && (!rtx_equal_p (operands[0], operands[3])
+ || !rtx_equal_p (operands[1], operands[2]))"
+ [(set (match_dup 4) (match_dup 5))
+ (set (match_dup 6) (match_dup 7))]
+{
+ /* Single-set and no-op cases. */
+ if (rtx_equal_p (operands[0], operands[1]))
+ {
+ if (rtx_equal_p (operands[2], operands[3]))
+ emit_note (NOTE_INSN_DELETED);
+ else
+ emit_move_insn (operands[2], operands[3]);
+ DONE;
+ }
+ if (rtx_equal_p (operands[2], operands[3]))
+ {
+ emit_move_insn (operands[0], operands[1]);
+ DONE;
+ }
+
+ if (rtx_equal_p (operands[0], operands[3]))
+ {
+ operands[4] = operands[2];
+ operands[5] = operands[3];
+ operands[6] = operands[0];
+ operands[7] = operands[1];
+ }
+ else
+ {
+ operands[4] = operands[0];
+ operands[5] = operands[1];
+ operands[6] = operands[2];
+ operands[7] = operands[3];
+ }
+}
+ [(set_attr "type" "imov")
+ (set_attr "mode" "<MODE>")
+ (set_attr "pent_pair" "np")
+ (set_attr "athlon_decode" "vector")
+ (set_attr "amdfam10_decode" "double")
+ (set_attr "bdver1_decode" "double")])
;; Floating point push instructions.
diff --git a/gcc/testsuite/gcc.target/i386/builtin-memmove-11a.c
b/gcc/testsuite/gcc.target/i386/builtin-memmove-11a.c
index bf5369cee8c..a79cedf0132 100644
--- a/gcc/testsuite/gcc.target/i386/builtin-memmove-11a.c
+++ b/gcc/testsuite/gcc.target/i386/builtin-memmove-11a.c
@@ -21,48 +21,48 @@
** movdqu \(%rsi\), %xmm3
** movdqu 16\(%rsi\), %xmm2
** subl \$64, %edx
-** addq \$64, %rax
+** addq \$64, %rdi
** movdqu 32\(%rsi\), %xmm1
** movdqu 48\(%rsi\), %xmm0
** addq \$64, %rsi
-** movups %xmm3, -64\(%rax\)
-** movups %xmm2, -48\(%rax\)
-** movups %xmm1, -32\(%rax\)
-** movups %xmm0, -16\(%rax\)
+** movups %xmm3, -64\(%rdi\)
+** movups %xmm2, -48\(%rdi\)
+** movups %xmm1, -32\(%rdi\)
+** movups %xmm0, -16\(%rdi\)
** cmpl \$64, %edx
** ja .L6
-** movups %xmm7, 496\(%rdi\)
-** movups %xmm6, 480\(%rdi\)
-** movups %xmm5, 464\(%rdi\)
-** movups %xmm4, 448\(%rdi\)
+** movups %xmm7, 496\(%rax\)
+** movups %xmm6, 480\(%rax\)
+** movups %xmm5, 464\(%rax\)
+** movups %xmm4, 448\(%rax\)
** ret
** .p2align 4,,10
** .p2align 3
**.L5:
** movdqu \(%rsi\), %xmm7
** movdqu 16\(%rsi\), %xmm6
-** leaq 512\(%rdi\), %rax
+** leaq 512\(%rdi\), %rdi
** addq \$512, %rsi
** movdqu -480\(%rsi\), %xmm5
** movdqu -464\(%rsi\), %xmm4
**.L7:
** movdqu -16\(%rsi\), %xmm3
** subl \$64, %edx
-** subq \$64, %rax
+** subq \$64, %rdi
** subq \$64, %rsi
** movdqu 32\(%rsi\), %xmm2
** movdqu 16\(%rsi\), %xmm1
** movdqu \(%rsi\), %xmm0
-** movups %xmm3, 48\(%rax\)
-** movups %xmm2, 32\(%rax\)
-** movups %xmm1, 16\(%rax\)
-** movups %xmm0, \(%rax\)
+** movups %xmm3, 48\(%rdi\)
+** movups %xmm2, 32\(%rdi\)
+** movups %xmm1, 16\(%rdi\)
+** movups %xmm0, \(%rdi\)
** cmpl \$64, %edx
** ja .L7
-** movups %xmm7, \(%rdi\)
-** movups %xmm6, 16\(%rdi\)
-** movups %xmm5, 32\(%rdi\)
-** movups %xmm4, 48\(%rdi\)
+** movups %xmm7, \(%rax\)
+** movups %xmm6, 16\(%rax\)
+** movups %xmm5, 32\(%rax\)
+** movups %xmm4, 48\(%rax\)
**.L1:
** ret
** .cfi_endproc
diff --git a/gcc/testsuite/gcc.target/i386/builtin-memmove-11b.c
b/gcc/testsuite/gcc.target/i386/builtin-memmove-11b.c
index f80881db196..3ac1d6202eb 100644
--- a/gcc/testsuite/gcc.target/i386/builtin-memmove-11b.c
+++ b/gcc/testsuite/gcc.target/i386/builtin-memmove-11b.c
@@ -21,20 +21,20 @@
** vmovdqu \(%rsi\), %ymm3
** vmovdqu 32\(%rsi\), %ymm2
** addl \$-128, %edx
-** subq \$-128, %rax
+** subq \$-128, %rdi
** vmovdqu 64\(%rsi\), %ymm1
** vmovdqu 96\(%rsi\), %ymm0
** subq \$-128, %rsi
-** vmovdqu %ymm3, -128\(%rax\)
-** vmovdqu %ymm2, -96\(%rax\)
-** vmovdqu %ymm1, -64\(%rax\)
-** vmovdqu %ymm0, -32\(%rax\)
+** vmovdqu %ymm3, -128\(%rdi\)
+** vmovdqu %ymm2, -96\(%rdi\)
+** vmovdqu %ymm1, -64\(%rdi\)
+** vmovdqu %ymm0, -32\(%rdi\)
** cmpl \$128, %edx
** ja .L6
-** vmovdqu %ymm7, 480\(%rdi\)
-** vmovdqu %ymm6, 448\(%rdi\)
-** vmovdqu %ymm5, 416\(%rdi\)
-** vmovdqu %ymm4, 384\(%rdi\)
+** vmovdqu %ymm7, 480\(%rax\)
+** vmovdqu %ymm6, 448\(%rax\)
+** vmovdqu %ymm5, 416\(%rax\)
+** vmovdqu %ymm4, 384\(%rax\)
** vzeroupper
** ret
** .p2align 4,,10
@@ -42,28 +42,28 @@
**.L5:
** vmovdqu \(%rsi\), %ymm7
** vmovdqu 32\(%rsi\), %ymm6
-** leaq 512\(%rdi\), %rax
+** leaq 512\(%rdi\), %rdi
** addq \$512, %rsi
** vmovdqu -448\(%rsi\), %ymm5
** vmovdqu -416\(%rsi\), %ymm4
**.L7:
** vmovdqu -32\(%rsi\), %ymm3
** addl \$-128, %edx
-** addq \$-128, %rax
+** addq \$-128, %rdi
** addq \$-128, %rsi
** vmovdqu 64\(%rsi\), %ymm2
** vmovdqu 32\(%rsi\), %ymm1
** vmovdqu \(%rsi\), %ymm0
-** vmovdqu %ymm3, 96\(%rax\)
-** vmovdqu %ymm2, 64\(%rax\)
-** vmovdqu %ymm1, 32\(%rax\)
-** vmovdqu %ymm0, \(%rax\)
+** vmovdqu %ymm3, 96\(%rdi\)
+** vmovdqu %ymm2, 64\(%rdi\)
+** vmovdqu %ymm1, 32\(%rdi\)
+** vmovdqu %ymm0, \(%rdi\)
** cmpl \$128, %edx
** ja .L7
-** vmovdqu %ymm7, \(%rdi\)
-** vmovdqu %ymm6, 32\(%rdi\)
-** vmovdqu %ymm5, 64\(%rdi\)
-** vmovdqu %ymm4, 96\(%rdi\)
+** vmovdqu %ymm7, \(%rax\)
+** vmovdqu %ymm6, 32\(%rax\)
+** vmovdqu %ymm5, 64\(%rax\)
+** vmovdqu %ymm4, 96\(%rax\)
** vzeroupper
**.L10:
** ret
diff --git a/gcc/testsuite/gcc.target/i386/builtin-memmove-2a.c
b/gcc/testsuite/gcc.target/i386/builtin-memmove-2a.c
index 903a31cfd34..5c4c50fe769 100644
--- a/gcc/testsuite/gcc.target/i386/builtin-memmove-2a.c
+++ b/gcc/testsuite/gcc.target/i386/builtin-memmove-2a.c
@@ -16,27 +16,27 @@
** jbe .L17
** cmpq \$128, %rdx
** jbe .L18
-** movq %rdx, %rsi
-** cmpq %rdi, %rcx
+** movq %rdx, %rcx
+** cmpq %rdi, %rsi
** jb .L11
** je .L2
-** movdqu -16\(%rcx,%rdx\), %xmm7
-** movdqu -32\(%rcx,%rdx\), %xmm6
-** movdqu -48\(%rcx,%rdx\), %xmm5
-** movdqu -64\(%rcx,%rdx\), %xmm4
+** movdqu -16\(%rsi,%rdx\), %xmm7
+** movdqu -32\(%rsi,%rdx\), %xmm6
+** movdqu -48\(%rsi,%rdx\), %xmm5
+** movdqu -64\(%rsi,%rdx\), %xmm4
**.L12:
-** movdqu \(%rcx\), %xmm3
-** subq \$64, %rsi
+** movdqu \(%rsi\), %xmm3
+** subq \$64, %rcx
** addq \$64, %rdi
-** addq \$64, %rcx
-** movdqu -48\(%rcx\), %xmm2
-** movdqu -32\(%rcx\), %xmm1
-** movdqu -16\(%rcx\), %xmm0
+** addq \$64, %rsi
+** movdqu -48\(%rsi\), %xmm2
+** movdqu -32\(%rsi\), %xmm1
+** movdqu -16\(%rsi\), %xmm0
** movups %xmm3, -64\(%rdi\)
** movups %xmm2, -48\(%rdi\)
** movups %xmm1, -32\(%rdi\)
** movups %xmm0, -16\(%rdi\)
-** cmpq \$64, %rsi
+** cmpq \$64, %rcx
** ja .L12
** movups %xmm7, -16\(%rax,%rdx\)
** movups %xmm6, -32\(%rax,%rdx\)
@@ -48,10 +48,10 @@
**.L3:
** cmpq \$8, %rdx
** jb .L19
-** movq \(%rsi\), %rdi
-** movq -8\(%rsi,%rdx\), %rcx
-** movq %rdi, \(%rax\)
-** movq %rcx, -8\(%rax,%rdx\)
+** movq \(%rsi\), %rsi
+** movq -8\(%rcx,%rdx\), %rcx
+** movq %rsi, \(%rdi\)
+** movq %rcx, -8\(%rdi,%rdx\)
** ret
** .p2align 4,,10
** .p2align 3
@@ -97,33 +97,33 @@
** .p2align 4,,10
** .p2align 3
**.L6:
-** movl \(%rsi\), %edi
-** movl -4\(%rsi,%rdx\), %ecx
-** movl %edi, \(%rax\)
-** movl %ecx, -4\(%rax,%rdx\)
+** movl \(%rsi\), %esi
+** movl -4\(%rcx,%rdx\), %ecx
+** movl %esi, \(%rdi\)
+** movl %ecx, -4\(%rdi,%rdx\)
** ret
** .p2align 4,,10
** .p2align 3
**.L11:
-** movdqu \(%rcx\), %xmm7
-** movdqu 16\(%rcx\), %xmm6
+** movdqu \(%rsi\), %xmm7
+** movdqu 16\(%rsi\), %xmm6
** leaq \(%rdi,%rdx\), %rdi
-** movdqu 32\(%rcx\), %xmm5
-** movdqu 48\(%rcx\), %xmm4
-** addq %rdx, %rcx
+** movdqu 32\(%rsi\), %xmm5
+** movdqu 48\(%rsi\), %xmm4
+** addq %rdx, %rsi
**.L13:
-** movdqu -16\(%rcx\), %xmm3
-** movdqu -32\(%rcx\), %xmm2
-** subq \$64, %rsi
-** subq \$64, %rdi
-** movdqu -48\(%rcx\), %xmm1
-** movdqu -64\(%rcx\), %xmm0
+** movdqu -16\(%rsi\), %xmm3
+** movdqu -32\(%rsi\), %xmm2
** subq \$64, %rcx
+** subq \$64, %rdi
+** movdqu -48\(%rsi\), %xmm1
+** movdqu -64\(%rsi\), %xmm0
+** subq \$64, %rsi
** movups %xmm3, 48\(%rdi\)
** movups %xmm2, 32\(%rdi\)
** movups %xmm1, 16\(%rdi\)
** movups %xmm0, \(%rdi\)
-** cmpq \$64, %rsi
+** cmpq \$64, %rcx
** ja .L13
** movups %xmm7, \(%rax\)
** movups %xmm6, 16\(%rax\)
@@ -146,10 +146,10 @@
** .p2align 4,,10
** .p2align 3
**.L7:
-** movzwl \(%rsi\), %edi
-** movzwl -2\(%rsi,%rdx\), %ecx
-** movw %di, \(%rax\)
-** movw %cx, -2\(%rax,%rdx\)
+** movzwl \(%rsi\), %esi
+** movzwl -2\(%rcx,%rdx\), %ecx
+** movw %si, \(%rdi\)
+** movw %cx, -2\(%rdi,%rdx\)
** ret
** .cfi_endproc
**...
diff --git a/gcc/testsuite/gcc.target/i386/builtin-memmove-2b.c
b/gcc/testsuite/gcc.target/i386/builtin-memmove-2b.c
index ac676d07867..9e71bfa273a 100644
--- a/gcc/testsuite/gcc.target/i386/builtin-memmove-2b.c
+++ b/gcc/testsuite/gcc.target/i386/builtin-memmove-2b.c
@@ -16,27 +16,27 @@
** jbe .L18
** cmpq \$256, %rdx
** jbe .L19
-** movq %rdx, %rsi
-** cmpq %rdi, %rcx
+** movq %rdx, %rcx
+** cmpq %rdi, %rsi
** jb .L12
** je .L2
-** vmovdqu -32\(%rcx,%rdx\), %ymm7
-** vmovdqu -64\(%rcx,%rdx\), %ymm6
-** vmovdqu -96\(%rcx,%rdx\), %ymm5
-** vmovdqu -128\(%rcx,%rdx\), %ymm4
+** vmovdqu -32\(%rsi,%rdx\), %ymm7
+** vmovdqu -64\(%rsi,%rdx\), %ymm6
+** vmovdqu -96\(%rsi,%rdx\), %ymm5
+** vmovdqu -128\(%rsi,%rdx\), %ymm4
**.L13:
-** vmovdqu \(%rcx\), %ymm3
-** addq \$-128, %rsi
+** vmovdqu \(%rsi\), %ymm3
+** addq \$-128, %rcx
** subq \$-128, %rdi
-** subq \$-128, %rcx
-** vmovdqu -96\(%rcx\), %ymm2
-** vmovdqu -64\(%rcx\), %ymm1
-** vmovdqu -32\(%rcx\), %ymm0
+** subq \$-128, %rsi
+** vmovdqu -96\(%rsi\), %ymm2
+** vmovdqu -64\(%rsi\), %ymm1
+** vmovdqu -32\(%rsi\), %ymm0
** vmovdqu %ymm3, -128\(%rdi\)
** vmovdqu %ymm2, -96\(%rdi\)
** vmovdqu %ymm1, -64\(%rdi\)
** vmovdqu %ymm0, -32\(%rdi\)
-** cmpq \$128, %rsi
+** cmpq \$128, %rcx
** ja .L13
** vmovdqu %ymm7, -32\(%rax,%rdx\)
** vmovdqu %ymm6, -64\(%rax,%rdx\)
@@ -102,33 +102,33 @@
** .p2align 4,,10
** .p2align 3
**.L6:
-** movq \(%rsi\), %rdi
-** movq -8\(%rsi,%rdx\), %rcx
-** movq %rdi, \(%rax\)
-** movq %rcx, -8\(%rax,%rdx\)
+** movq \(%rsi\), %rsi
+** movq -8\(%rcx,%rdx\), %rcx
+** movq %rsi, \(%rdi\)
+** movq %rcx, -8\(%rdi,%rdx\)
** ret
** .p2align 4,,10
** .p2align 3
**.L12:
-** vmovdqu \(%rcx\), %ymm7
-** vmovdqu 32\(%rcx\), %ymm6
+** vmovdqu \(%rsi\), %ymm7
+** vmovdqu 32\(%rsi\), %ymm6
** leaq \(%rdi,%rdx\), %rdi
-** vmovdqu 64\(%rcx\), %ymm5
-** vmovdqu 96\(%rcx\), %ymm4
-** addq %rdx, %rcx
+** vmovdqu 64\(%rsi\), %ymm5
+** vmovdqu 96\(%rsi\), %ymm4
+** addq %rdx, %rsi
**.L14:
-** vmovdqu -32\(%rcx\), %ymm3
-** vmovdqu -64\(%rcx\), %ymm2
-** addq \$-128, %rsi
-** addq \$-128, %rdi
-** vmovdqu -96\(%rcx\), %ymm1
-** vmovdqu -128\(%rcx\), %ymm0
+** vmovdqu -32\(%rsi\), %ymm3
+** vmovdqu -64\(%rsi\), %ymm2
** addq \$-128, %rcx
+** addq \$-128, %rdi
+** vmovdqu -96\(%rsi\), %ymm1
+** vmovdqu -128\(%rsi\), %ymm0
+** addq \$-128, %rsi
** vmovdqu %ymm3, 96\(%rdi\)
** vmovdqu %ymm2, 64\(%rdi\)
** vmovdqu %ymm1, 32\(%rdi\)
** vmovdqu %ymm0, \(%rdi\)
-** cmpq \$128, %rsi
+** cmpq \$128, %rcx
** ja .L14
** vmovdqu %ymm7, \(%rax\)
** vmovdqu %ymm6, 32\(%rax\)
@@ -153,18 +153,18 @@
** .p2align 4,,10
** .p2align 3
**.L7:
-** movl \(%rsi\), %edi
-** movl -4\(%rsi,%rdx\), %ecx
-** movl %edi, \(%rax\)
-** movl %ecx, -4\(%rax,%rdx\)
+** movl \(%rsi\), %esi
+** movl -4\(%rcx,%rdx\), %ecx
+** movl %esi, \(%rdi\)
+** movl %ecx, -4\(%rdi,%rdx\)
** ret
** .p2align 4,,10
** .p2align 3
**.L8:
-** movzwl \(%rsi\), %edi
-** movzwl -2\(%rsi,%rdx\), %ecx
-** movw %di, \(%rax\)
-** movw %cx, -2\(%rax,%rdx\)
+** movzwl \(%rsi\), %esi
+** movzwl -2\(%rcx,%rdx\), %ecx
+** movw %si, \(%rdi\)
+** movw %cx, -2\(%rdi,%rdx\)
** ret
** .cfi_endproc
**...
diff --git a/gcc/testsuite/gcc.target/i386/builtin-memmove-2c.c
b/gcc/testsuite/gcc.target/i386/builtin-memmove-2c.c
index 656986b458e..9a886fd4cce 100644
--- a/gcc/testsuite/gcc.target/i386/builtin-memmove-2c.c
+++ b/gcc/testsuite/gcc.target/i386/builtin-memmove-2c.c
@@ -16,27 +16,27 @@
** jbe .L19
** cmpq \$512, %rdx
** jbe .L20
-** movq %rdx, %rsi
-** cmpq %rdi, %rcx
+** movq %rdx, %rcx
+** cmpq %rdi, %rsi
** jb .L13
** je .L2
-** vmovdqu64 -64\(%rcx,%rdx\), %zmm7
-** vmovdqu64 -128\(%rcx,%rdx\), %zmm6
-** vmovdqu64 -192\(%rcx,%rdx\), %zmm5
-** vmovdqu64 -256\(%rcx,%rdx\), %zmm4
+** vmovdqu64 -64\(%rsi,%rdx\), %zmm7
+** vmovdqu64 -128\(%rsi,%rdx\), %zmm6
+** vmovdqu64 -192\(%rsi,%rdx\), %zmm5
+** vmovdqu64 -256\(%rsi,%rdx\), %zmm4
**.L14:
-** vmovdqu64 \(%rcx\), %zmm3
-** vmovdqu64 64\(%rcx\), %zmm2
-** subq \$256, %rsi
+** vmovdqu64 \(%rsi\), %zmm3
+** vmovdqu64 64\(%rsi\), %zmm2
+** subq \$256, %rcx
** addq \$256, %rdi
-** vmovdqu64 128\(%rcx\), %zmm1
-** addq \$256, %rcx
-** vmovdqu64 -64\(%rcx\), %zmm0
+** vmovdqu64 128\(%rsi\), %zmm1
+** addq \$256, %rsi
+** vmovdqu64 -64\(%rsi\), %zmm0
** vmovdqu64 %zmm3, -256\(%rdi\)
** vmovdqu64 %zmm2, -192\(%rdi\)
** vmovdqu64 %zmm1, -128\(%rdi\)
** vmovdqu64 %zmm0, -64\(%rdi\)
-** cmpq \$256, %rsi
+** cmpq \$256, %rcx
** ja .L14
** vmovdqu64 %zmm7, -64\(%rax,%rdx\)
** vmovdqu64 %zmm6, -128\(%rax,%rdx\)
@@ -113,25 +113,25 @@
** .p2align 4,,10
** .p2align 3
**.L13:
-** vmovdqu64 \(%rcx\), %zmm7
+** vmovdqu64 \(%rsi\), %zmm7
** leaq \(%rdi,%rdx\), %rdi
-** vmovdqu64 64\(%rcx\), %zmm6
-** vmovdqu64 128\(%rcx\), %zmm5
-** vmovdqu64 192\(%rcx\), %zmm4
-** addq %rdx, %rcx
+** vmovdqu64 64\(%rsi\), %zmm6
+** vmovdqu64 128\(%rsi\), %zmm5
+** vmovdqu64 192\(%rsi\), %zmm4
+** addq %rdx, %rsi
**.L15:
-** vmovdqu64 -64\(%rcx\), %zmm3
-** vmovdqu64 -128\(%rcx\), %zmm2
-** subq \$256, %rsi
-** subq \$256, %rdi
-** vmovdqu64 -192\(%rcx\), %zmm1
+** vmovdqu64 -64\(%rsi\), %zmm3
+** vmovdqu64 -128\(%rsi\), %zmm2
** subq \$256, %rcx
-** vmovdqu64 \(%rcx\), %zmm0
+** subq \$256, %rdi
+** vmovdqu64 -192\(%rsi\), %zmm1
+** subq \$256, %rsi
+** vmovdqu64 \(%rsi\), %zmm0
** vmovdqu64 %zmm3, 192\(%rdi\)
** vmovdqu64 %zmm2, 128\(%rdi\)
** vmovdqu64 %zmm1, 64\(%rdi\)
** vmovdqu64 %zmm0, \(%rdi\)
-** cmpq \$256, %rsi
+** cmpq \$256, %rcx
** ja .L15
** vmovdqu64 %zmm7, \(%rax\)
** vmovdqu64 %zmm6, 64\(%rax\)
@@ -156,26 +156,26 @@
** .p2align 4,,10
** .p2align 3
**.L7:
-** movq \(%rsi\), %rdi
-** movq -8\(%rsi,%rdx\), %rcx
-** movq %rdi, \(%rax\)
-** movq %rcx, -8\(%rax,%rdx\)
+** movq \(%rsi\), %rsi
+** movq -8\(%rcx,%rdx\), %rcx
+** movq %rsi, \(%rdi\)
+** movq %rcx, -8\(%rdi,%rdx\)
** ret
** .p2align 4,,10
** .p2align 3
**.L8:
-** movl \(%rsi\), %edi
-** movl -4\(%rsi,%rdx\), %ecx
-** movl %edi, \(%rax\)
-** movl %ecx, -4\(%rax,%rdx\)
+** movl \(%rsi\), %esi
+** movl -4\(%rcx,%rdx\), %ecx
+** movl %esi, \(%rdi\)
+** movl %ecx, -4\(%rdi,%rdx\)
** ret
** .p2align 4,,10
** .p2align 3
**.L9:
-** movzwl \(%rsi\), %edi
-** movzwl -2\(%rsi,%rdx\), %ecx
-** movw %di, \(%rax\)
-** movw %cx, -2\(%rax,%rdx\)
+** movzwl \(%rsi\), %esi
+** movzwl -2\(%rcx,%rdx\), %ecx
+** movw %si, \(%rdi\)
+** movw %cx, -2\(%rdi,%rdx\)
** ret
** .cfi_endproc
**...
diff --git a/gcc/testsuite/gcc.target/i386/builtin-memmove-2d.c
b/gcc/testsuite/gcc.target/i386/builtin-memmove-2d.c
index 324de74519e..0fa919b4a74 100644
--- a/gcc/testsuite/gcc.target/i386/builtin-memmove-2d.c
+++ b/gcc/testsuite/gcc.target/i386/builtin-memmove-2d.c
@@ -9,6 +9,7 @@
**.LFB0:
** .cfi_startproc
** movq %rdi, %rax
+** movq %rsi, %rcx
** cmpq \$8, %rdx
** jb .L3
** cmpq \$16, %rdx
@@ -17,9 +18,8 @@
** .cfi_def_cfa_offset 40
** cmpq \$64, %rdx
** jbe .L20
-** movq %rsi, %rcx
-** movq %rdx, %rsi
-** cmpq %rdi, %rcx
+** movq %rdx, %rcx
+** cmpq %rdi, %rsi
** jb .L10
** je .L2
** movq %rbx, \(%rsp\)
@@ -30,26 +30,26 @@
** .cfi_offset 6, -32
** .cfi_offset 14, -24
** .cfi_offset 15, -16
-** movq -8\(%rcx,%rdx\), %r15
-** movq -16\(%rcx,%rdx\), %r14
-** movq -24\(%rcx,%rdx\), %rbp
-** movq -32\(%rcx,%rdx\), %r11
+** movq -8\(%rsi,%rdx\), %r14
+** movq -16\(%rsi,%rdx\), %r15
+** movq -24\(%rsi,%rdx\), %rbp
+** movq -32\(%rsi,%rdx\), %r11
**.L11:
-** movq 8\(%rcx\), %r10
-** movq 16\(%rcx\), %r9
-** subq \$32, %rsi
+** movq 8\(%rsi\), %r10
+** movq 16\(%rsi\), %r9
+** subq \$32, %rcx
** addq \$32, %rdi
-** movq 24\(%rcx\), %r8
-** movq \(%rcx\), %rbx
-** addq \$32, %rcx
+** movq 24\(%rsi\), %r8
+** movq \(%rsi\), %rbx
+** addq \$32, %rsi
** movq %r10, -24\(%rdi\)
** movq %rbx, -32\(%rdi\)
** movq %r9, -16\(%rdi\)
** movq %r8, -8\(%rdi\)
-** cmpq \$32, %rsi
+** cmpq \$32, %rcx
** ja .L11
-** movq %r15, -8\(%rax,%rdx\)
-** movq %r14, -16\(%rax,%rdx\)
+** movq %r14, -8\(%rax,%rdx\)
+** movq %r15, -16\(%rax,%rdx\)
** movq %rbp, -24\(%rax,%rdx\)
** movq %r11, -32\(%rax,%rdx\)
** movq \(%rsp\), %rbx
@@ -67,10 +67,10 @@
** .cfi_def_cfa_offset 8
** cmpq \$4, %rdx
** jb .L21
-** movl \(%rsi\), %edi
-** movl -4\(%rsi,%rdx\), %ecx
-** movl %edi, \(%rax\)
-** movl %ecx, -4\(%rax,%rdx\)
+** movl \(%rsi\), %esi
+** movl -4\(%rcx,%rdx\), %ecx
+** movl %esi, \(%rdi\)
+** movl %ecx, -4\(%rdi,%rdx\)
** ret
** .p2align 4,,10
** .p2align 3
@@ -84,10 +84,10 @@
** .p2align 4,,10
** .p2align 3
**.L19:
-** movq \(%rsi\), %rdi
-** movq -8\(%rsi,%rdx\), %rcx
-** movq %rdi, \(%rax\)
-** movq %rcx, -8\(%rax,%rdx\)
+** movq \(%rsi\), %rsi
+** movq -8\(%rcx,%rdx\), %rcx
+** movq %rsi, \(%rdi\)
+** movq %rcx, -8\(%rdi,%rdx\)
** ret
** .p2align 4,,10
** .p2align 3
@@ -96,29 +96,25 @@
** cmpq \$32, %rdx
** jb .L9
** movq %rbx, \(%rsp\)
-** movq %r14, 16\(%rsp\)
** .cfi_offset 3, -40
-** .cfi_offset 14, -24
** movq \(%rsi\), %rbx
-** movq 8\(%rsi\), %r14
-** movq 16\(%rsi\), %r11
-** movq 24\(%rsi\), %r10
-** movq -8\(%rsi,%rdx\), %r9
-** movq -16\(%rsi,%rdx\), %r8
-** movq -24\(%rsi,%rdx\), %rdi
-** movq -32\(%rsi,%rdx\), %rcx
+** movq 8\(%rsi\), %r11
+** movq 16\(%rsi\), %r10
+** movq 24\(%rsi\), %r9
+** movq -8\(%rsi,%rdx\), %r8
+** movq -16\(%rsi,%rdx\), %rdi
+** movq -32\(%rcx,%rdx\), %rcx
+** movq -24\(%rsi,%rdx\), %rsi
** movq %rbx, \(%rax\)
-** movq %r14, 8\(%rax\)
-** movq %r11, 16\(%rax\)
-** movq %r10, 24\(%rax\)
-** movq %r9, -8\(%rax,%rdx\)
-** movq %r8, -16\(%rax,%rdx\)
-** movq %rdi, -24\(%rax,%rdx\)
+** movq %r11, 8\(%rax\)
+** movq %r10, 16\(%rax\)
+** movq %r9, 24\(%rax\)
+** movq %r8, -8\(%rax,%rdx\)
+** movq %rdi, -16\(%rax,%rdx\)
+** movq %rsi, -24\(%rax,%rdx\)
** movq %rcx, -32\(%rax,%rdx\)
** movq \(%rsp\), %rbx
** .cfi_restore 3
-** movq 16\(%rsp\), %r14
-** .cfi_restore 14
**.L2:
** addq \$32, %rsp
** .cfi_def_cfa_offset 8
@@ -126,10 +122,10 @@
** .p2align 4,,10
** .p2align 3
**.L6:
-** movzwl \(%rsi\), %edi
-** movzwl -2\(%rsi,%rdx\), %ecx
-** movw %di, \(%rax\)
-** movw %cx, -2\(%rax,%rdx\)
+** movzwl \(%rsi\), %esi
+** movzwl -2\(%rcx,%rdx\), %ecx
+** movw %si, \(%rdi\)
+** movw %cx, -2\(%rdi,%rdx\)
** ret
** .p2align 4,,10
** .p2align 3
@@ -139,13 +135,13 @@
** .p2align 3
**.L9:
** .cfi_def_cfa_offset 40
-** movq \(%rsi\), %r9
-** movq 8\(%rsi\), %r8
-** movq -8\(%rsi,%rdx\), %rdi
-** movq -16\(%rsi,%rdx\), %rcx
-** movq %r9, \(%rax\)
-** movq %r8, 8\(%rax\)
-** movq %rdi, -8\(%rax,%rdx\)
+** movq \(%rsi\), %r8
+** movq 8\(%rsi\), %rdi
+** movq -16\(%rcx,%rdx\), %rcx
+** movq -8\(%rsi,%rdx\), %rsi
+** movq %r8, \(%rax\)
+** movq %rdi, 8\(%rax\)
+** movq %rsi, -8\(%rax,%rdx\)
** movq %rcx, -16\(%rax,%rdx\)
** jmp .L2
** .p2align 4,,10
@@ -158,35 +154,35 @@
** .cfi_offset 3, -40
** .cfi_offset 14, -24
** .cfi_offset 15, -16
-** movq \(%rcx\), %r14
-** movq 8\(%rcx\), %r15
-** movq 16\(%rcx\), %r10
-** movq 24\(%rcx\), %r11
-** addq %rdx, %rcx
+** movq \(%rsi\), %r14
+** movq 8\(%rsi\), %r15
+** movq 16\(%rsi\), %r10
+** movq 24\(%rsi\), %r11
+** addq %rdx, %rsi
**.L12:
-** movq -16\(%rcx\), %r9
-** movq -24\(%rcx\), %r8
-** subq \$32, %rsi
-** subq \$32, %rdi
-** movq -32\(%rcx\), %rdx
-** movq -8\(%rcx\), %rbx
+** movq -16\(%rsi\), %r9
+** movq -24\(%rsi\), %r8
** subq \$32, %rcx
+** subq \$32, %rdi
+** movq -32\(%rsi\), %rdx
+** movq -8\(%rsi\), %rbx
+** subq \$32, %rsi
** movq %r9, 16\(%rdi\)
** movq %rbx, 24\(%rdi\)
** movq %r8, 8\(%rdi\)
** movq %rdx, \(%rdi\)
-** cmpq \$32, %rsi
+** cmpq \$32, %rcx
** ja .L12
** movq %r14, \(%rax\)
+** movq %r15, 8\(%rax\)
+** movq %r10, 16\(%rax\)
+** movq %r11, 24\(%rax\)
** movq \(%rsp\), %rbx
** .cfi_restore 3
-** movq %r15, 8\(%rax\)
** movq 16\(%rsp\), %r14
** .cfi_restore 14
-** movq %r10, 16\(%rax\)
** movq 24\(%rsp\), %r15
** .cfi_restore 15
-** movq %r11, 24\(%rax\)
** jmp .L2
** .cfi_endproc
**...