This patch/request for comment is my proposed partial solution to
PR target/122454. Although preserving TImode operations until after
reload has many benefits (including STV), it also introduces a small
number of complications (mainly from the use of SUBREGs).

Consider the last usage of a TImode register x, this is typically
split into two DImode registers, hi and lo, using a sequence of two
instructions that looks like:

  (set (reg:DI lo) (subreg:DI (reg:TI x) 0))
  (set (reg:DI hi) (subreg:DI (reg:TI x) 8)) (expr_list:REG_DEAD (reg:TI x)

The challenge for register allocation (reload) is that between these
two instructions the register x is still considered to be live, so lo
conflicts with x, and can't reside in the same hard register used to hold
to lowpart of x.  This leads to increased register pressure and suboptimal
register allocation.  In PR 122454 this manifests as additional move
instructions (a P2 regression).

The proposed solution is to introduce a parallel move instruction:

  (parallel [(set (reg:DI lo) lo_src)
             (set (reg:DI hi) hi_src)])

which allows the TImode pseudo to be decomposed "atomically" without
extenting the lifetime of x, and after reload split this into zero,
one or two move instructions or an xchg (swap) instruction depending
upon the register allocation.  Worst case is this requires two moves
(which is the same as previously), but ideally this becomes a no-op
if register preferencing manages to assign lo and lo_src to the same
hard register, and hi and hi_src to the same hard register.

Previously for the testcase in the bugzilla PR:

less:   movq    %rdi, %rax
        movq    %rdx, %r11
        pushq   %rbx
        movq    %rsi, %r10
        xorq    %rdx, %rax
        movq    %rsi, %rdx
        xorq    %rcx, %rdx
        orq     %rdx, %rax
        je      .L9
        cmpq    %r11, %rdi
        popq    %rbx
        sbbq    %rcx, %r10
        setc    %al
        ret
.L9:    movq    %r9, %rsi
        movq    %r8, %rdi
        popq    %rbx
        jmp     bytewiseless

With this patch we now generate the improved (saving two instructions):

        movq    %rdi, %r10
        movq    %rsi, %rax
        pushq   %rbx
        xorq    %rdx, %r10
        xorq    %rcx, %rax
        orq     %rax, %r10
        je      .L9
        cmpq    %rdx, %rdi
        popq    %rbx
        sbbq    %rcx, %rsi
        setc    %al
        ret
.L9:    movq    %r9, %rsi
        movq    %r8, %rdi
        popq    %rbx
        jmp     bytewiseless


Alas %rbx is still being saved and restored, even though its never used
(by the time we reach final), but this appears to be a step in the right
direction.  This TI->DI patch is conceptually related to the DI->TI patches
by Jakub (and I) that introduced concatditi3 and friends, to aid with
constructing TImode pseudos from a pair of DImode pseudos.

There are lots of places this new parmovdi4 instruction can be generated
during splitting/expand of doubleword operations, but the *cmpti_doubleword
splitter in this PR seems like a reasonable place to start as a proof of
concept.  The call to gen_parmovdi4 could even eventually be moved into
the function split_double_mode.

Does this seem like a reasonable approach?  Is there precedence for a
better instruction name than parmov<mode>4 (for either "parallel move"
or "pair move") on other targets?  Is there some additional register
preferencing magic that can advise ira/reload to prefer to place the
sources and destinations in the same hard registers?

Testing on x86_64 shows no regressions in the testsuite, but did require
tweaks to several gcc.target/i386/builtins_memmove_* testcases that used
check-function-bodies to (mostly) account for differences in hard register
assignments.  The only non-naming differences were in builtin-memmove-2d.c
where the reduction in register pressure allows us to save several
instructions by avoiding a spill [reducing code size by 16 bytes].


2026-01-12  Roger Sayle  <[email protected]>

gcc/ChangeLog
        PR target/122454
        * config/i386/i386.md (*cmp<dwi>_doubleword): If split_double_mode
        returns a pair of SUBREGs, emit a parmov<mode>4 to place them in
        new pseudos, to improve register allocation.
        (parmov<mode>4): New define_insn_and_split, that splits to zero,
        one or two moves, or remains as a xchg instruction.

gcc/testsuite/ChangeLog
        PR target/122454
        * gcc.target/i386/builtin-memmove-11a.c: Update for changes in
        register allocation.
        * gcc.target/i386/builtin-memmove-11b.c: Likewise.
        * gcc.target/i386/builtin-memmove-2a.c: Likewise.
        * gcc.target/i386/builtin-memmove-2b.c: Likewise.
        * gcc.target/i386/builtin-memmove-2c.c: Likewise.
        * gcc.target/i386/builtin-memmove-2d.c: Update for changes in
        register allocation and code generation.


Thoughts?
Thanks in advance,
Roger
--

diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index df7135f84d4..67beb631057 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -1704,6 +1704,24 @@
       DONE;
     }
 
+  /* Minimize live ranges of TImode pseudos.  */
+  if (SUBREG_P (operands[0]) && SUBREG_P (operands[2]))
+    {
+      rtx lo = gen_reg_rtx (<MODE>mode);
+      rtx hi = gen_reg_rtx (<MODE>mode);
+      emit_insn (gen_parmov<mode>4 (lo, operands[0], hi, operands[2]));
+      operands[0] = lo;
+      operands[2] = hi;
+    }
+  if (SUBREG_P (operands[1]) && SUBREG_P (operands[3]))
+    {
+      rtx lo = gen_reg_rtx (<MODE>mode);
+      rtx hi = gen_reg_rtx (<MODE>mode);
+      emit_insn (gen_parmov<mode>4 (lo, operands[1], hi, operands[3]));
+      operands[1] = lo;
+      operands[3] = hi;
+    }
+
   if (operands[1] == const0_rtx)
     emit_move_insn (operands[4], operands[0]);
   else if (operands[0] == const0_rtx)
@@ -3767,6 +3785,57 @@
   split_double_concat (DImode, operands[0], operands[2], operands[4]);
   DONE;
 })
+
+;; Doubleword splitting
+(define_insn_and_split "parmov<mode>4"
+  [(set (match_operand:DWIH 0 "register_operand" "=r,r,r,r")
+       (match_operand:DWIH 1 "register_operand" "0,0,r,r"))
+   (set (match_operand:DWIH 2 "register_operand" "=r,r,r,r")
+       (match_operand:DWIH 3 "register_operand" "2,r,2,r"))]
+  ""
+  "xchg{<imodesuffix>}\t%0, %2"
+  "&& reload_completed
+   && (!rtx_equal_p (operands[0], operands[3])
+       || !rtx_equal_p (operands[1], operands[2]))"
+  [(set (match_dup 4) (match_dup 5))
+   (set (match_dup 6) (match_dup 7))]
+{
+  /* Single-set and no-op cases.  */
+  if (rtx_equal_p (operands[0], operands[1]))
+    {
+      if (rtx_equal_p (operands[2], operands[3]))
+       emit_note (NOTE_INSN_DELETED);
+      else
+       emit_move_insn (operands[2], operands[3]);
+      DONE;
+    }
+  if (rtx_equal_p (operands[2], operands[3]))
+    {
+      emit_move_insn (operands[0], operands[1]);
+      DONE;
+    }
+
+  if (rtx_equal_p (operands[0], operands[3]))
+    {
+      operands[4] = operands[2];
+      operands[5] = operands[3];
+      operands[6] = operands[0];
+      operands[7] = operands[1];
+    }
+  else
+    {
+      operands[4] = operands[0];
+      operands[5] = operands[1];
+      operands[6] = operands[2];
+      operands[7] = operands[3];
+    }
+}
+  [(set_attr "type" "imov")
+   (set_attr "mode" "<MODE>")
+   (set_attr "pent_pair" "np")
+   (set_attr "athlon_decode" "vector")
+   (set_attr "amdfam10_decode" "double")
+   (set_attr "bdver1_decode" "double")])
 
 ;; Floating point push instructions.
 
diff --git a/gcc/testsuite/gcc.target/i386/builtin-memmove-11a.c 
b/gcc/testsuite/gcc.target/i386/builtin-memmove-11a.c
index bf5369cee8c..a79cedf0132 100644
--- a/gcc/testsuite/gcc.target/i386/builtin-memmove-11a.c
+++ b/gcc/testsuite/gcc.target/i386/builtin-memmove-11a.c
@@ -21,48 +21,48 @@
 **     movdqu  \(%rsi\), %xmm3
 **     movdqu  16\(%rsi\), %xmm2
 **     subl    \$64, %edx
-**     addq    \$64, %rax
+**     addq    \$64, %rdi
 **     movdqu  32\(%rsi\), %xmm1
 **     movdqu  48\(%rsi\), %xmm0
 **     addq    \$64, %rsi
-**     movups  %xmm3, -64\(%rax\)
-**     movups  %xmm2, -48\(%rax\)
-**     movups  %xmm1, -32\(%rax\)
-**     movups  %xmm0, -16\(%rax\)
+**     movups  %xmm3, -64\(%rdi\)
+**     movups  %xmm2, -48\(%rdi\)
+**     movups  %xmm1, -32\(%rdi\)
+**     movups  %xmm0, -16\(%rdi\)
 **     cmpl    \$64, %edx
 **     ja      .L6
-**     movups  %xmm7, 496\(%rdi\)
-**     movups  %xmm6, 480\(%rdi\)
-**     movups  %xmm5, 464\(%rdi\)
-**     movups  %xmm4, 448\(%rdi\)
+**     movups  %xmm7, 496\(%rax\)
+**     movups  %xmm6, 480\(%rax\)
+**     movups  %xmm5, 464\(%rax\)
+**     movups  %xmm4, 448\(%rax\)
 **     ret
 **     .p2align 4,,10
 **     .p2align 3
 **.L5:
 **     movdqu  \(%rsi\), %xmm7
 **     movdqu  16\(%rsi\), %xmm6
-**     leaq    512\(%rdi\), %rax
+**     leaq    512\(%rdi\), %rdi
 **     addq    \$512, %rsi
 **     movdqu  -480\(%rsi\), %xmm5
 **     movdqu  -464\(%rsi\), %xmm4
 **.L7:
 **     movdqu  -16\(%rsi\), %xmm3
 **     subl    \$64, %edx
-**     subq    \$64, %rax
+**     subq    \$64, %rdi
 **     subq    \$64, %rsi
 **     movdqu  32\(%rsi\), %xmm2
 **     movdqu  16\(%rsi\), %xmm1
 **     movdqu  \(%rsi\), %xmm0
-**     movups  %xmm3, 48\(%rax\)
-**     movups  %xmm2, 32\(%rax\)
-**     movups  %xmm1, 16\(%rax\)
-**     movups  %xmm0, \(%rax\)
+**     movups  %xmm3, 48\(%rdi\)
+**     movups  %xmm2, 32\(%rdi\)
+**     movups  %xmm1, 16\(%rdi\)
+**     movups  %xmm0, \(%rdi\)
 **     cmpl    \$64, %edx
 **     ja      .L7
-**     movups  %xmm7, \(%rdi\)
-**     movups  %xmm6, 16\(%rdi\)
-**     movups  %xmm5, 32\(%rdi\)
-**     movups  %xmm4, 48\(%rdi\)
+**     movups  %xmm7, \(%rax\)
+**     movups  %xmm6, 16\(%rax\)
+**     movups  %xmm5, 32\(%rax\)
+**     movups  %xmm4, 48\(%rax\)
 **.L1:
 **     ret
 **     .cfi_endproc
diff --git a/gcc/testsuite/gcc.target/i386/builtin-memmove-11b.c 
b/gcc/testsuite/gcc.target/i386/builtin-memmove-11b.c
index f80881db196..3ac1d6202eb 100644
--- a/gcc/testsuite/gcc.target/i386/builtin-memmove-11b.c
+++ b/gcc/testsuite/gcc.target/i386/builtin-memmove-11b.c
@@ -21,20 +21,20 @@
 **     vmovdqu \(%rsi\), %ymm3
 **     vmovdqu 32\(%rsi\), %ymm2
 **     addl    \$-128, %edx
-**     subq    \$-128, %rax
+**     subq    \$-128, %rdi
 **     vmovdqu 64\(%rsi\), %ymm1
 **     vmovdqu 96\(%rsi\), %ymm0
 **     subq    \$-128, %rsi
-**     vmovdqu %ymm3, -128\(%rax\)
-**     vmovdqu %ymm2, -96\(%rax\)
-**     vmovdqu %ymm1, -64\(%rax\)
-**     vmovdqu %ymm0, -32\(%rax\)
+**     vmovdqu %ymm3, -128\(%rdi\)
+**     vmovdqu %ymm2, -96\(%rdi\)
+**     vmovdqu %ymm1, -64\(%rdi\)
+**     vmovdqu %ymm0, -32\(%rdi\)
 **     cmpl    \$128, %edx
 **     ja      .L6
-**     vmovdqu %ymm7, 480\(%rdi\)
-**     vmovdqu %ymm6, 448\(%rdi\)
-**     vmovdqu %ymm5, 416\(%rdi\)
-**     vmovdqu %ymm4, 384\(%rdi\)
+**     vmovdqu %ymm7, 480\(%rax\)
+**     vmovdqu %ymm6, 448\(%rax\)
+**     vmovdqu %ymm5, 416\(%rax\)
+**     vmovdqu %ymm4, 384\(%rax\)
 **     vzeroupper
 **     ret
 **     .p2align 4,,10
@@ -42,28 +42,28 @@
 **.L5:
 **     vmovdqu \(%rsi\), %ymm7
 **     vmovdqu 32\(%rsi\), %ymm6
-**     leaq    512\(%rdi\), %rax
+**     leaq    512\(%rdi\), %rdi
 **     addq    \$512, %rsi
 **     vmovdqu -448\(%rsi\), %ymm5
 **     vmovdqu -416\(%rsi\), %ymm4
 **.L7:
 **     vmovdqu -32\(%rsi\), %ymm3
 **     addl    \$-128, %edx
-**     addq    \$-128, %rax
+**     addq    \$-128, %rdi
 **     addq    \$-128, %rsi
 **     vmovdqu 64\(%rsi\), %ymm2
 **     vmovdqu 32\(%rsi\), %ymm1
 **     vmovdqu \(%rsi\), %ymm0
-**     vmovdqu %ymm3, 96\(%rax\)
-**     vmovdqu %ymm2, 64\(%rax\)
-**     vmovdqu %ymm1, 32\(%rax\)
-**     vmovdqu %ymm0, \(%rax\)
+**     vmovdqu %ymm3, 96\(%rdi\)
+**     vmovdqu %ymm2, 64\(%rdi\)
+**     vmovdqu %ymm1, 32\(%rdi\)
+**     vmovdqu %ymm0, \(%rdi\)
 **     cmpl    \$128, %edx
 **     ja      .L7
-**     vmovdqu %ymm7, \(%rdi\)
-**     vmovdqu %ymm6, 32\(%rdi\)
-**     vmovdqu %ymm5, 64\(%rdi\)
-**     vmovdqu %ymm4, 96\(%rdi\)
+**     vmovdqu %ymm7, \(%rax\)
+**     vmovdqu %ymm6, 32\(%rax\)
+**     vmovdqu %ymm5, 64\(%rax\)
+**     vmovdqu %ymm4, 96\(%rax\)
 **     vzeroupper
 **.L10:
 **     ret
diff --git a/gcc/testsuite/gcc.target/i386/builtin-memmove-2a.c 
b/gcc/testsuite/gcc.target/i386/builtin-memmove-2a.c
index 903a31cfd34..5c4c50fe769 100644
--- a/gcc/testsuite/gcc.target/i386/builtin-memmove-2a.c
+++ b/gcc/testsuite/gcc.target/i386/builtin-memmove-2a.c
@@ -16,27 +16,27 @@
 **     jbe     .L17
 **     cmpq    \$128, %rdx
 **     jbe     .L18
-**     movq    %rdx, %rsi
-**     cmpq    %rdi, %rcx
+**     movq    %rdx, %rcx
+**     cmpq    %rdi, %rsi
 **     jb      .L11
 **     je      .L2
-**     movdqu  -16\(%rcx,%rdx\), %xmm7
-**     movdqu  -32\(%rcx,%rdx\), %xmm6
-**     movdqu  -48\(%rcx,%rdx\), %xmm5
-**     movdqu  -64\(%rcx,%rdx\), %xmm4
+**     movdqu  -16\(%rsi,%rdx\), %xmm7
+**     movdqu  -32\(%rsi,%rdx\), %xmm6
+**     movdqu  -48\(%rsi,%rdx\), %xmm5
+**     movdqu  -64\(%rsi,%rdx\), %xmm4
 **.L12:
-**     movdqu  \(%rcx\), %xmm3
-**     subq    \$64, %rsi
+**     movdqu  \(%rsi\), %xmm3
+**     subq    \$64, %rcx
 **     addq    \$64, %rdi
-**     addq    \$64, %rcx
-**     movdqu  -48\(%rcx\), %xmm2
-**     movdqu  -32\(%rcx\), %xmm1
-**     movdqu  -16\(%rcx\), %xmm0
+**     addq    \$64, %rsi
+**     movdqu  -48\(%rsi\), %xmm2
+**     movdqu  -32\(%rsi\), %xmm1
+**     movdqu  -16\(%rsi\), %xmm0
 **     movups  %xmm3, -64\(%rdi\)
 **     movups  %xmm2, -48\(%rdi\)
 **     movups  %xmm1, -32\(%rdi\)
 **     movups  %xmm0, -16\(%rdi\)
-**     cmpq    \$64, %rsi
+**     cmpq    \$64, %rcx
 **     ja      .L12
 **     movups  %xmm7, -16\(%rax,%rdx\)
 **     movups  %xmm6, -32\(%rax,%rdx\)
@@ -48,10 +48,10 @@
 **.L3:
 **     cmpq    \$8, %rdx
 **     jb      .L19
-**     movq    \(%rsi\), %rdi
-**     movq    -8\(%rsi,%rdx\), %rcx
-**     movq    %rdi, \(%rax\)
-**     movq    %rcx, -8\(%rax,%rdx\)
+**     movq    \(%rsi\), %rsi
+**     movq    -8\(%rcx,%rdx\), %rcx
+**     movq    %rsi, \(%rdi\)
+**     movq    %rcx, -8\(%rdi,%rdx\)
 **     ret
 **     .p2align 4,,10
 **     .p2align 3
@@ -97,33 +97,33 @@
 **     .p2align 4,,10
 **     .p2align 3
 **.L6:
-**     movl    \(%rsi\), %edi
-**     movl    -4\(%rsi,%rdx\), %ecx
-**     movl    %edi, \(%rax\)
-**     movl    %ecx, -4\(%rax,%rdx\)
+**     movl    \(%rsi\), %esi
+**     movl    -4\(%rcx,%rdx\), %ecx
+**     movl    %esi, \(%rdi\)
+**     movl    %ecx, -4\(%rdi,%rdx\)
 **     ret
 **     .p2align 4,,10
 **     .p2align 3
 **.L11:
-**     movdqu  \(%rcx\), %xmm7
-**     movdqu  16\(%rcx\), %xmm6
+**     movdqu  \(%rsi\), %xmm7
+**     movdqu  16\(%rsi\), %xmm6
 **     leaq    \(%rdi,%rdx\), %rdi
-**     movdqu  32\(%rcx\), %xmm5
-**     movdqu  48\(%rcx\), %xmm4
-**     addq    %rdx, %rcx
+**     movdqu  32\(%rsi\), %xmm5
+**     movdqu  48\(%rsi\), %xmm4
+**     addq    %rdx, %rsi
 **.L13:
-**     movdqu  -16\(%rcx\), %xmm3
-**     movdqu  -32\(%rcx\), %xmm2
-**     subq    \$64, %rsi
-**     subq    \$64, %rdi
-**     movdqu  -48\(%rcx\), %xmm1
-**     movdqu  -64\(%rcx\), %xmm0
+**     movdqu  -16\(%rsi\), %xmm3
+**     movdqu  -32\(%rsi\), %xmm2
 **     subq    \$64, %rcx
+**     subq    \$64, %rdi
+**     movdqu  -48\(%rsi\), %xmm1
+**     movdqu  -64\(%rsi\), %xmm0
+**     subq    \$64, %rsi
 **     movups  %xmm3, 48\(%rdi\)
 **     movups  %xmm2, 32\(%rdi\)
 **     movups  %xmm1, 16\(%rdi\)
 **     movups  %xmm0, \(%rdi\)
-**     cmpq    \$64, %rsi
+**     cmpq    \$64, %rcx
 **     ja      .L13
 **     movups  %xmm7, \(%rax\)
 **     movups  %xmm6, 16\(%rax\)
@@ -146,10 +146,10 @@
 **     .p2align 4,,10
 **     .p2align 3
 **.L7:
-**     movzwl  \(%rsi\), %edi
-**     movzwl  -2\(%rsi,%rdx\), %ecx
-**     movw    %di, \(%rax\)
-**     movw    %cx, -2\(%rax,%rdx\)
+**     movzwl  \(%rsi\), %esi
+**     movzwl  -2\(%rcx,%rdx\), %ecx
+**     movw    %si, \(%rdi\)
+**     movw    %cx, -2\(%rdi,%rdx\)
 **     ret
 **     .cfi_endproc
 **...
diff --git a/gcc/testsuite/gcc.target/i386/builtin-memmove-2b.c 
b/gcc/testsuite/gcc.target/i386/builtin-memmove-2b.c
index ac676d07867..9e71bfa273a 100644
--- a/gcc/testsuite/gcc.target/i386/builtin-memmove-2b.c
+++ b/gcc/testsuite/gcc.target/i386/builtin-memmove-2b.c
@@ -16,27 +16,27 @@
 **     jbe     .L18
 **     cmpq    \$256, %rdx
 **     jbe     .L19
-**     movq    %rdx, %rsi
-**     cmpq    %rdi, %rcx
+**     movq    %rdx, %rcx
+**     cmpq    %rdi, %rsi
 **     jb      .L12
 **     je      .L2
-**     vmovdqu -32\(%rcx,%rdx\), %ymm7
-**     vmovdqu -64\(%rcx,%rdx\), %ymm6
-**     vmovdqu -96\(%rcx,%rdx\), %ymm5
-**     vmovdqu -128\(%rcx,%rdx\), %ymm4
+**     vmovdqu -32\(%rsi,%rdx\), %ymm7
+**     vmovdqu -64\(%rsi,%rdx\), %ymm6
+**     vmovdqu -96\(%rsi,%rdx\), %ymm5
+**     vmovdqu -128\(%rsi,%rdx\), %ymm4
 **.L13:
-**     vmovdqu \(%rcx\), %ymm3
-**     addq    \$-128, %rsi
+**     vmovdqu \(%rsi\), %ymm3
+**     addq    \$-128, %rcx
 **     subq    \$-128, %rdi
-**     subq    \$-128, %rcx
-**     vmovdqu -96\(%rcx\), %ymm2
-**     vmovdqu -64\(%rcx\), %ymm1
-**     vmovdqu -32\(%rcx\), %ymm0
+**     subq    \$-128, %rsi
+**     vmovdqu -96\(%rsi\), %ymm2
+**     vmovdqu -64\(%rsi\), %ymm1
+**     vmovdqu -32\(%rsi\), %ymm0
 **     vmovdqu %ymm3, -128\(%rdi\)
 **     vmovdqu %ymm2, -96\(%rdi\)
 **     vmovdqu %ymm1, -64\(%rdi\)
 **     vmovdqu %ymm0, -32\(%rdi\)
-**     cmpq    \$128, %rsi
+**     cmpq    \$128, %rcx
 **     ja      .L13
 **     vmovdqu %ymm7, -32\(%rax,%rdx\)
 **     vmovdqu %ymm6, -64\(%rax,%rdx\)
@@ -102,33 +102,33 @@
 **     .p2align 4,,10
 **     .p2align 3
 **.L6:
-**     movq    \(%rsi\), %rdi
-**     movq    -8\(%rsi,%rdx\), %rcx
-**     movq    %rdi, \(%rax\)
-**     movq    %rcx, -8\(%rax,%rdx\)
+**     movq    \(%rsi\), %rsi
+**     movq    -8\(%rcx,%rdx\), %rcx
+**     movq    %rsi, \(%rdi\)
+**     movq    %rcx, -8\(%rdi,%rdx\)
 **     ret
 **     .p2align 4,,10
 **     .p2align 3
 **.L12:
-**     vmovdqu \(%rcx\), %ymm7
-**     vmovdqu 32\(%rcx\), %ymm6
+**     vmovdqu \(%rsi\), %ymm7
+**     vmovdqu 32\(%rsi\), %ymm6
 **     leaq    \(%rdi,%rdx\), %rdi
-**     vmovdqu 64\(%rcx\), %ymm5
-**     vmovdqu 96\(%rcx\), %ymm4
-**     addq    %rdx, %rcx
+**     vmovdqu 64\(%rsi\), %ymm5
+**     vmovdqu 96\(%rsi\), %ymm4
+**     addq    %rdx, %rsi
 **.L14:
-**     vmovdqu -32\(%rcx\), %ymm3
-**     vmovdqu -64\(%rcx\), %ymm2
-**     addq    \$-128, %rsi
-**     addq    \$-128, %rdi
-**     vmovdqu -96\(%rcx\), %ymm1
-**     vmovdqu -128\(%rcx\), %ymm0
+**     vmovdqu -32\(%rsi\), %ymm3
+**     vmovdqu -64\(%rsi\), %ymm2
 **     addq    \$-128, %rcx
+**     addq    \$-128, %rdi
+**     vmovdqu -96\(%rsi\), %ymm1
+**     vmovdqu -128\(%rsi\), %ymm0
+**     addq    \$-128, %rsi
 **     vmovdqu %ymm3, 96\(%rdi\)
 **     vmovdqu %ymm2, 64\(%rdi\)
 **     vmovdqu %ymm1, 32\(%rdi\)
 **     vmovdqu %ymm0, \(%rdi\)
-**     cmpq    \$128, %rsi
+**     cmpq    \$128, %rcx
 **     ja      .L14
 **     vmovdqu %ymm7, \(%rax\)
 **     vmovdqu %ymm6, 32\(%rax\)
@@ -153,18 +153,18 @@
 **     .p2align 4,,10
 **     .p2align 3
 **.L7:
-**     movl    \(%rsi\), %edi
-**     movl    -4\(%rsi,%rdx\), %ecx
-**     movl    %edi, \(%rax\)
-**     movl    %ecx, -4\(%rax,%rdx\)
+**     movl    \(%rsi\), %esi
+**     movl    -4\(%rcx,%rdx\), %ecx
+**     movl    %esi, \(%rdi\)
+**     movl    %ecx, -4\(%rdi,%rdx\)
 **     ret
 **     .p2align 4,,10
 **     .p2align 3
 **.L8:
-**     movzwl  \(%rsi\), %edi
-**     movzwl  -2\(%rsi,%rdx\), %ecx
-**     movw    %di, \(%rax\)
-**     movw    %cx, -2\(%rax,%rdx\)
+**     movzwl  \(%rsi\), %esi
+**     movzwl  -2\(%rcx,%rdx\), %ecx
+**     movw    %si, \(%rdi\)
+**     movw    %cx, -2\(%rdi,%rdx\)
 **     ret
 **     .cfi_endproc
 **...
diff --git a/gcc/testsuite/gcc.target/i386/builtin-memmove-2c.c 
b/gcc/testsuite/gcc.target/i386/builtin-memmove-2c.c
index 656986b458e..9a886fd4cce 100644
--- a/gcc/testsuite/gcc.target/i386/builtin-memmove-2c.c
+++ b/gcc/testsuite/gcc.target/i386/builtin-memmove-2c.c
@@ -16,27 +16,27 @@
 **     jbe     .L19
 **     cmpq    \$512, %rdx
 **     jbe     .L20
-**     movq    %rdx, %rsi
-**     cmpq    %rdi, %rcx
+**     movq    %rdx, %rcx
+**     cmpq    %rdi, %rsi
 **     jb      .L13
 **     je      .L2
-**     vmovdqu64       -64\(%rcx,%rdx\), %zmm7
-**     vmovdqu64       -128\(%rcx,%rdx\), %zmm6
-**     vmovdqu64       -192\(%rcx,%rdx\), %zmm5
-**     vmovdqu64       -256\(%rcx,%rdx\), %zmm4
+**     vmovdqu64       -64\(%rsi,%rdx\), %zmm7
+**     vmovdqu64       -128\(%rsi,%rdx\), %zmm6
+**     vmovdqu64       -192\(%rsi,%rdx\), %zmm5
+**     vmovdqu64       -256\(%rsi,%rdx\), %zmm4
 **.L14:
-**     vmovdqu64       \(%rcx\), %zmm3
-**     vmovdqu64       64\(%rcx\), %zmm2
-**     subq    \$256, %rsi
+**     vmovdqu64       \(%rsi\), %zmm3
+**     vmovdqu64       64\(%rsi\), %zmm2
+**     subq    \$256, %rcx
 **     addq    \$256, %rdi
-**     vmovdqu64       128\(%rcx\), %zmm1
-**     addq    \$256, %rcx
-**     vmovdqu64       -64\(%rcx\), %zmm0
+**     vmovdqu64       128\(%rsi\), %zmm1
+**     addq    \$256, %rsi
+**     vmovdqu64       -64\(%rsi\), %zmm0
 **     vmovdqu64       %zmm3, -256\(%rdi\)
 **     vmovdqu64       %zmm2, -192\(%rdi\)
 **     vmovdqu64       %zmm1, -128\(%rdi\)
 **     vmovdqu64       %zmm0, -64\(%rdi\)
-**     cmpq    \$256, %rsi
+**     cmpq    \$256, %rcx
 **     ja      .L14
 **     vmovdqu64       %zmm7, -64\(%rax,%rdx\)
 **     vmovdqu64       %zmm6, -128\(%rax,%rdx\)
@@ -113,25 +113,25 @@
 **     .p2align 4,,10
 **     .p2align 3
 **.L13:
-**     vmovdqu64       \(%rcx\), %zmm7
+**     vmovdqu64       \(%rsi\), %zmm7
 **     leaq    \(%rdi,%rdx\), %rdi
-**     vmovdqu64       64\(%rcx\), %zmm6
-**     vmovdqu64       128\(%rcx\), %zmm5
-**     vmovdqu64       192\(%rcx\), %zmm4
-**     addq    %rdx, %rcx
+**     vmovdqu64       64\(%rsi\), %zmm6
+**     vmovdqu64       128\(%rsi\), %zmm5
+**     vmovdqu64       192\(%rsi\), %zmm4
+**     addq    %rdx, %rsi
 **.L15:
-**     vmovdqu64       -64\(%rcx\), %zmm3
-**     vmovdqu64       -128\(%rcx\), %zmm2
-**     subq    \$256, %rsi
-**     subq    \$256, %rdi
-**     vmovdqu64       -192\(%rcx\), %zmm1
+**     vmovdqu64       -64\(%rsi\), %zmm3
+**     vmovdqu64       -128\(%rsi\), %zmm2
 **     subq    \$256, %rcx
-**     vmovdqu64       \(%rcx\), %zmm0
+**     subq    \$256, %rdi
+**     vmovdqu64       -192\(%rsi\), %zmm1
+**     subq    \$256, %rsi
+**     vmovdqu64       \(%rsi\), %zmm0
 **     vmovdqu64       %zmm3, 192\(%rdi\)
 **     vmovdqu64       %zmm2, 128\(%rdi\)
 **     vmovdqu64       %zmm1, 64\(%rdi\)
 **     vmovdqu64       %zmm0, \(%rdi\)
-**     cmpq    \$256, %rsi
+**     cmpq    \$256, %rcx
 **     ja      .L15
 **     vmovdqu64       %zmm7, \(%rax\)
 **     vmovdqu64       %zmm6, 64\(%rax\)
@@ -156,26 +156,26 @@
 **     .p2align 4,,10
 **     .p2align 3
 **.L7:
-**     movq    \(%rsi\), %rdi
-**     movq    -8\(%rsi,%rdx\), %rcx
-**     movq    %rdi, \(%rax\)
-**     movq    %rcx, -8\(%rax,%rdx\)
+**     movq    \(%rsi\), %rsi
+**     movq    -8\(%rcx,%rdx\), %rcx
+**     movq    %rsi, \(%rdi\)
+**     movq    %rcx, -8\(%rdi,%rdx\)
 **     ret
 **     .p2align 4,,10
 **     .p2align 3
 **.L8:
-**     movl    \(%rsi\), %edi
-**     movl    -4\(%rsi,%rdx\), %ecx
-**     movl    %edi, \(%rax\)
-**     movl    %ecx, -4\(%rax,%rdx\)
+**     movl    \(%rsi\), %esi
+**     movl    -4\(%rcx,%rdx\), %ecx
+**     movl    %esi, \(%rdi\)
+**     movl    %ecx, -4\(%rdi,%rdx\)
 **     ret
 **     .p2align 4,,10
 **     .p2align 3
 **.L9:
-**     movzwl  \(%rsi\), %edi
-**     movzwl  -2\(%rsi,%rdx\), %ecx
-**     movw    %di, \(%rax\)
-**     movw    %cx, -2\(%rax,%rdx\)
+**     movzwl  \(%rsi\), %esi
+**     movzwl  -2\(%rcx,%rdx\), %ecx
+**     movw    %si, \(%rdi\)
+**     movw    %cx, -2\(%rdi,%rdx\)
 **     ret
 **     .cfi_endproc
 **...
diff --git a/gcc/testsuite/gcc.target/i386/builtin-memmove-2d.c 
b/gcc/testsuite/gcc.target/i386/builtin-memmove-2d.c
index 324de74519e..0fa919b4a74 100644
--- a/gcc/testsuite/gcc.target/i386/builtin-memmove-2d.c
+++ b/gcc/testsuite/gcc.target/i386/builtin-memmove-2d.c
@@ -9,6 +9,7 @@
 **.LFB0:
 **     .cfi_startproc
 **     movq    %rdi, %rax
+**     movq    %rsi, %rcx
 **     cmpq    \$8, %rdx
 **     jb      .L3
 **     cmpq    \$16, %rdx
@@ -17,9 +18,8 @@
 **     .cfi_def_cfa_offset 40
 **     cmpq    \$64, %rdx
 **     jbe     .L20
-**     movq    %rsi, %rcx
-**     movq    %rdx, %rsi
-**     cmpq    %rdi, %rcx
+**     movq    %rdx, %rcx
+**     cmpq    %rdi, %rsi
 **     jb      .L10
 **     je      .L2
 **     movq    %rbx, \(%rsp\)
@@ -30,26 +30,26 @@
 **     .cfi_offset 6, -32
 **     .cfi_offset 14, -24
 **     .cfi_offset 15, -16
-**     movq    -8\(%rcx,%rdx\), %r15
-**     movq    -16\(%rcx,%rdx\), %r14
-**     movq    -24\(%rcx,%rdx\), %rbp
-**     movq    -32\(%rcx,%rdx\), %r11
+**     movq    -8\(%rsi,%rdx\), %r14
+**     movq    -16\(%rsi,%rdx\), %r15
+**     movq    -24\(%rsi,%rdx\), %rbp
+**     movq    -32\(%rsi,%rdx\), %r11
 **.L11:
-**     movq    8\(%rcx\), %r10
-**     movq    16\(%rcx\), %r9
-**     subq    \$32, %rsi
+**     movq    8\(%rsi\), %r10
+**     movq    16\(%rsi\), %r9
+**     subq    \$32, %rcx
 **     addq    \$32, %rdi
-**     movq    24\(%rcx\), %r8
-**     movq    \(%rcx\), %rbx
-**     addq    \$32, %rcx
+**     movq    24\(%rsi\), %r8
+**     movq    \(%rsi\), %rbx
+**     addq    \$32, %rsi
 **     movq    %r10, -24\(%rdi\)
 **     movq    %rbx, -32\(%rdi\)
 **     movq    %r9, -16\(%rdi\)
 **     movq    %r8, -8\(%rdi\)
-**     cmpq    \$32, %rsi
+**     cmpq    \$32, %rcx
 **     ja      .L11
-**     movq    %r15, -8\(%rax,%rdx\)
-**     movq    %r14, -16\(%rax,%rdx\)
+**     movq    %r14, -8\(%rax,%rdx\)
+**     movq    %r15, -16\(%rax,%rdx\)
 **     movq    %rbp, -24\(%rax,%rdx\)
 **     movq    %r11, -32\(%rax,%rdx\)
 **     movq    \(%rsp\), %rbx
@@ -67,10 +67,10 @@
 **     .cfi_def_cfa_offset 8
 **     cmpq    \$4, %rdx
 **     jb      .L21
-**     movl    \(%rsi\), %edi
-**     movl    -4\(%rsi,%rdx\), %ecx
-**     movl    %edi, \(%rax\)
-**     movl    %ecx, -4\(%rax,%rdx\)
+**     movl    \(%rsi\), %esi
+**     movl    -4\(%rcx,%rdx\), %ecx
+**     movl    %esi, \(%rdi\)
+**     movl    %ecx, -4\(%rdi,%rdx\)
 **     ret
 **     .p2align 4,,10
 **     .p2align 3
@@ -84,10 +84,10 @@
 **     .p2align 4,,10
 **     .p2align 3
 **.L19:
-**     movq    \(%rsi\), %rdi
-**     movq    -8\(%rsi,%rdx\), %rcx
-**     movq    %rdi, \(%rax\)
-**     movq    %rcx, -8\(%rax,%rdx\)
+**     movq    \(%rsi\), %rsi
+**     movq    -8\(%rcx,%rdx\), %rcx
+**     movq    %rsi, \(%rdi\)
+**     movq    %rcx, -8\(%rdi,%rdx\)
 **     ret
 **     .p2align 4,,10
 **     .p2align 3
@@ -96,29 +96,25 @@
 **     cmpq    \$32, %rdx
 **     jb      .L9
 **     movq    %rbx, \(%rsp\)
-**     movq    %r14, 16\(%rsp\)
 **     .cfi_offset 3, -40
-**     .cfi_offset 14, -24
 **     movq    \(%rsi\), %rbx
-**     movq    8\(%rsi\), %r14
-**     movq    16\(%rsi\), %r11
-**     movq    24\(%rsi\), %r10
-**     movq    -8\(%rsi,%rdx\), %r9
-**     movq    -16\(%rsi,%rdx\), %r8
-**     movq    -24\(%rsi,%rdx\), %rdi
-**     movq    -32\(%rsi,%rdx\), %rcx
+**     movq    8\(%rsi\), %r11
+**     movq    16\(%rsi\), %r10
+**     movq    24\(%rsi\), %r9
+**     movq    -8\(%rsi,%rdx\), %r8
+**     movq    -16\(%rsi,%rdx\), %rdi
+**     movq    -32\(%rcx,%rdx\), %rcx
+**     movq    -24\(%rsi,%rdx\), %rsi
 **     movq    %rbx, \(%rax\)
-**     movq    %r14, 8\(%rax\)
-**     movq    %r11, 16\(%rax\)
-**     movq    %r10, 24\(%rax\)
-**     movq    %r9, -8\(%rax,%rdx\)
-**     movq    %r8, -16\(%rax,%rdx\)
-**     movq    %rdi, -24\(%rax,%rdx\)
+**     movq    %r11, 8\(%rax\)
+**     movq    %r10, 16\(%rax\)
+**     movq    %r9, 24\(%rax\)
+**     movq    %r8, -8\(%rax,%rdx\)
+**     movq    %rdi, -16\(%rax,%rdx\)
+**     movq    %rsi, -24\(%rax,%rdx\)
 **     movq    %rcx, -32\(%rax,%rdx\)
 **     movq    \(%rsp\), %rbx
 **     .cfi_restore 3
-**     movq    16\(%rsp\), %r14
-**     .cfi_restore 14
 **.L2:
 **     addq    \$32, %rsp
 **     .cfi_def_cfa_offset 8
@@ -126,10 +122,10 @@
 **     .p2align 4,,10
 **     .p2align 3
 **.L6:
-**     movzwl  \(%rsi\), %edi
-**     movzwl  -2\(%rsi,%rdx\), %ecx
-**     movw    %di, \(%rax\)
-**     movw    %cx, -2\(%rax,%rdx\)
+**     movzwl  \(%rsi\), %esi
+**     movzwl  -2\(%rcx,%rdx\), %ecx
+**     movw    %si, \(%rdi\)
+**     movw    %cx, -2\(%rdi,%rdx\)
 **     ret
 **     .p2align 4,,10
 **     .p2align 3
@@ -139,13 +135,13 @@
 **     .p2align 3
 **.L9:
 **     .cfi_def_cfa_offset 40
-**     movq    \(%rsi\), %r9
-**     movq    8\(%rsi\), %r8
-**     movq    -8\(%rsi,%rdx\), %rdi
-**     movq    -16\(%rsi,%rdx\), %rcx
-**     movq    %r9, \(%rax\)
-**     movq    %r8, 8\(%rax\)
-**     movq    %rdi, -8\(%rax,%rdx\)
+**     movq    \(%rsi\), %r8
+**     movq    8\(%rsi\), %rdi
+**     movq    -16\(%rcx,%rdx\), %rcx
+**     movq    -8\(%rsi,%rdx\), %rsi
+**     movq    %r8, \(%rax\)
+**     movq    %rdi, 8\(%rax\)
+**     movq    %rsi, -8\(%rax,%rdx\)
 **     movq    %rcx, -16\(%rax,%rdx\)
 **     jmp     .L2
 **     .p2align 4,,10
@@ -158,35 +154,35 @@
 **     .cfi_offset 3, -40
 **     .cfi_offset 14, -24
 **     .cfi_offset 15, -16
-**     movq    \(%rcx\), %r14
-**     movq    8\(%rcx\), %r15
-**     movq    16\(%rcx\), %r10
-**     movq    24\(%rcx\), %r11
-**     addq    %rdx, %rcx
+**     movq    \(%rsi\), %r14
+**     movq    8\(%rsi\), %r15
+**     movq    16\(%rsi\), %r10
+**     movq    24\(%rsi\), %r11
+**     addq    %rdx, %rsi
 **.L12:
-**     movq    -16\(%rcx\), %r9
-**     movq    -24\(%rcx\), %r8
-**     subq    \$32, %rsi
-**     subq    \$32, %rdi
-**     movq    -32\(%rcx\), %rdx
-**     movq    -8\(%rcx\), %rbx
+**     movq    -16\(%rsi\), %r9
+**     movq    -24\(%rsi\), %r8
 **     subq    \$32, %rcx
+**     subq    \$32, %rdi
+**     movq    -32\(%rsi\), %rdx
+**     movq    -8\(%rsi\), %rbx
+**     subq    \$32, %rsi
 **     movq    %r9, 16\(%rdi\)
 **     movq    %rbx, 24\(%rdi\)
 **     movq    %r8, 8\(%rdi\)
 **     movq    %rdx, \(%rdi\)
-**     cmpq    \$32, %rsi
+**     cmpq    \$32, %rcx
 **     ja      .L12
 **     movq    %r14, \(%rax\)
+**     movq    %r15, 8\(%rax\)
+**     movq    %r10, 16\(%rax\)
+**     movq    %r11, 24\(%rax\)
 **     movq    \(%rsp\), %rbx
 **     .cfi_restore 3
-**     movq    %r15, 8\(%rax\)
 **     movq    16\(%rsp\), %r14
 **     .cfi_restore 14
-**     movq    %r10, 16\(%rax\)
 **     movq    24\(%rsp\), %r15
 **     .cfi_restore 15
-**     movq    %r11, 24\(%rax\)
 **     jmp     .L2
 **     .cfi_endproc
 **...

Reply via email to