Re: [RFC, ARM] later split of symbol_refs

2012-07-04 Thread Dmitry Melnik

On 06/29/2012 06:31 PM, Ramana Radhakrishnan wrote:

Ok with this comment?

+;; Split symbol_refs at the later stage (after cprop), instead of 
generating

+;; movt/movw pair directly at expand.  Otherwise corresponding high_sum
+;; and lo_sum would be merged back into memory load at cprop. However,
+;; if the default is to prefer movt/movw rather than a load from the 
constant

+;; pool, the performance is usually better.



+;; Split symbol_refs at the later stage (after cprop), instead of generating
+;; movt/movw pair directly at expand.  Otherwise corresponding high_sum
+;; and lo_sum would be merged back into memory load at cprop.  However,

I would rewrite part of your comment as


+;; movt/movw is preferable, because it usually executes faster than a load

"However if the default is to prefer to use movw/movt rather than the
constant pool use that. instead of a load from the constant pool."


--
Best regards,
   Dmitry

2009-05-29  Julian Brown  

gcc/
	* config/arm/arm.md (movsi): Don't split symbol refs here.
	(define_split): New.

--- a/gcc/config/arm/arm.md
+++ b/gcc/config/arm/arm.md
@@ -5472,14 +5472,6 @@
   optimize && can_create_pseudo_p ());
   DONE;
 }
-
-  if (TARGET_USE_MOVT && !target_word_relocations
- && GET_CODE (operands[1]) == SYMBOL_REF
- && !flag_pic && !arm_tls_referenced_p (operands[1]))
-   {
- arm_emit_movpair (operands[0], operands[1]);
- DONE;
-   }
 }
   else /* TARGET_THUMB1...  */
 {
@@ -5588,6 +5580,24 @@
   "
 )
 
+;; Split symbol_refs at the later stage (after cprop), instead of generating
+;; movt/movw pair directly at expand.  Otherwise corresponding high_sum
+;; and lo_sum would be merged back into memory load at cprop.  However,
+;; if the default is to prefer movt/movw rather than a load from the constant
+;; pool, the performance is usually better.
+(define_split
+  [(set (match_operand:SI 0 "arm_general_register_operand" "")
+   (match_operand:SI 1 "general_operand" ""))]
+  "TARGET_32BIT
+   && TARGET_USE_MOVT && GET_CODE (operands[1]) == SYMBOL_REF
+   && !flag_pic && !target_word_relocations
+   && !arm_tls_referenced_p (operands[1])"
+  [(clobber (const_int 0))]
+{
+  arm_emit_movpair (operands[0], operands[1]);
+  DONE;
+})
+
 (define_insn "*thumb1_movsi_insn"
   [(set (match_operand:SI 0 "nonimmediate_operand" "=l,l,l,l,l,>,l, m,*l*h*k")
(match_operand:SI 1 "general_operand"  "l, I,J,K,>,l,mi,l,*l*h*k"))]


Re: [RFC, ARM] later split of symbol_refs

2012-06-29 Thread Dmitry Melnik

On 06/27/2012 07:53 PM, Richard Earnshaw wrote:

Please update the ChangeLog entry (it's not appropriate to mention
Sourcery G++) and add a comment as Steven has suggested.

Otherwise OK.




Updated.
Ok to commit now?


--
Best regards,
  Dmitry

2009-05-29  Julian Brown  

gcc/
	* config/arm/arm.md (movsi): Don't split symbol refs here.
	(define_split): New.

diff --git a/gcc/config/arm/arm.md b/gcc/config/arm/arm.md
index 0654564..98ff382 100644
--- a/gcc/config/arm/arm.md
+++ b/gcc/config/arm/arm.md
@@ -5472,14 +5472,6 @@
 			   optimize && can_create_pseudo_p ());
   DONE;
 }
-
-  if (TARGET_USE_MOVT && !target_word_relocations
-	  && GET_CODE (operands[1]) == SYMBOL_REF
-	  && !flag_pic && !arm_tls_referenced_p (operands[1]))
-	{
-	  arm_emit_movpair (operands[0], operands[1]);
-	  DONE;
-	}
 }
   else /* TARGET_THUMB1...  */
 {
@@ -5588,6 +5580,23 @@
   "
 )
 
+;; Split symbol_refs at the later stage (after cprop), instead of generating
+;; movt/movw pair directly at expand.  Otherwise corresponding high_sum
+;; and lo_sum would be merged back into memory load at cprop.  However,
+;; movt/movw is preferable, because it usually executes faster than a load.
+(define_split
+  [(set (match_operand:SI 0 "arm_general_register_operand" "")
+   (match_operand:SI 1 "general_operand" ""))]
+  "TARGET_32BIT
+   && TARGET_USE_MOVT && GET_CODE (operands[1]) == SYMBOL_REF
+   && !flag_pic && !target_word_relocations
+   && !arm_tls_referenced_p (operands[1])"
+  [(clobber (const_int 0))]
+{
+  arm_emit_movpair (operands[0], operands[1]);
+  DONE;
+})
+
 (define_insn "*thumb1_movsi_insn"
   [(set (match_operand:SI 0 "nonimmediate_operand" "=l,l,l,l,l,>,l, m,*l*h*k")
 	(match_operand:SI 1 "general_operand"  "l, I,J,K,>,l,mi,l,*l*h*k"))]


Re: [RFC, ARM] later split of symbol_refs

2012-06-29 Thread Dmitry Melnik


On 06/27/2012 07:55 PM, Ramana Radhakrishnan wrote:

> I must admit that I had been suggesting to Zhenqiang about turning
> this off by tightening the movsi_insn predicates rather than adding a
> split, but given that it appears to produce enough benefit in this
> case I don't have any reasons to object ...
>
> However it's interesting that this doesn't seem to help vpr 

We retested vpr, but it just seems to be unstable:

base peak
time  time

  175.vpr   1400   502  X 1400 526  X
  175.vpr   1400   500  X 1400 524  X
  175.vpr   1400   516  X 1400 526  X
  175.vpr   1400   492  X 1400 481  X
  175.vpr   1400   496  X 1400 485  X
  median 500524

However, the minimum time is still better with the patch.

And here are all 3 runs for previously reported data:

testbase ratiopeak ratiomedian basemedian peak improvement
164.gzip284281284282-0.70%
164.gzip284282
164.gzip285283
175.vpr329306323306-5.26%
175.vpr323305
175.vpr306307
176.gcc5425545425572.77%
176.gcc541558
176.gcc544557
181.mcf3433403403410.29%
181.mcf339342
181.mcf340341
186.crafty3833993833912.09%
186.crafty390391
186.crafty380386
197.parser2542572542571.18%
197.parser254257
197.parser254257
252.eon5916445946448.42%
252.eon598644
252.eon594643
253.perlbmk4624904634905.83%
253.perlbmk463490
253.perlbmk463490
254.gap4154734254679.88%
254.gap430467
254.gap425464
255.vortex38443038243012.57%
255.vortex382430
255.vortex381430
256.bzip23313543323546.63%
256.bzip2332354
256.bzip2335349
300.twolf3233563283527.32%
300.twolf347337
300.twolf328352


--
Best regards,
  Dmitry



[RFC, ARM] later split of symbol_refs

2012-06-27 Thread Dmitry Melnik

Hi,

We'd like to note about CodeSourcery's patch for ARM backend, from which 
GCC mainline can gain 4% on SPEC2K INT: 
http://cgit.openembedded.org/openembedded/plain/recipes/gcc/gcc-4.5/linaro/gcc-4.5-linaro-r99369.patch 
(also the patch is attached).


Originally, we noticed that GNU Go works 6% faster on cortex-a8 with 
-fno-gcse.  After profiling we found that this is most likely caused by 
cache misses when accessing global variables.  GCC generates ldr 
instructions for them, while this can be avoided by emitting movt/movw 
pair for such cases. RTL expressions for these instructions is high_ and 
lo_sum.  Currently, symbol_ref expands as high_ and lo_sum but then 
cprop1 decides that this is redundant and merges them into one load insn.


The problem was also found by Linaro community: 
https://bugs.launchpad.net/gcc-linaro/+bug/886124 .
Also there is a patch from codesourcery (attached), which was ported to 
linaro gcc 4.5, but is missing in later linaro releases.
This patch makes split of symbol_refs at the later stage (after cprop), 
instead of generating movt/movw at expand.


It fixed our test case on GNU Go.  Also we tested it on SPEC2K INT (ref) 
with GCC 4.8 snapshot from May 12, 2012 on cortex-a9 with -O2 and -mthumb:


Base  Base  Base  Peak  Peak  Peak
Benchmarks  Ref Time  Run Time   RatioRef Time  Run Time  Ratio
--           ---
164.gzip1400  492   284 1400   497   282  -0.70%
175.vpr 1400  433   323 1400   458   306  -5.26%
176.gcc 1100  203   542 1100   198   557   2.77%
181.mcf 1800  529   340 1800   528   341   0.29%
186.crafty  1000  261   383 1000   256   391   2.09%
197.parser  1800  709   254 1800   701   257   1.18%
252.eon 1300  219   594 1300   202   644   8.42%
253.perlbmk 1800  389   463 1800   367   490   5.83%
254.gap 1100  259   425 1100   236   467   9.88%
255.vortex  1900  498   382 1900   442   430  12.57%
256.bzip2   1500  452   332 1500   424   354   6.63%
300.twolf   3000  916   328 3000   853   352   7.32%
SPECint_base2000376
SPECint2000  391   3.99%


SPEC2K INT grows by 4% (up to 12.5% on vortex; vpr slowdown is likely 
because of big variance on this test).


Similarly, there are gains of 3-4% without -mthumb on cortex-a9 and on 
cortex-a8 (thumb2 and ARM modes).


This patch can be applied to current trunk and passes regtest 
successfully on qemu-arm.

Maybe it will be good to have it in trunk?
If everybody agrees, we can take care of committing it.

--
Best regards,
  Dmitry
2010-08-20  Jie Zhang  

	Merged from Sourcery G++ 4.4:

	gcc/
	2009-05-29  Julian Brown  
	Merged from Sourcery G++ 4.3:
	* config/arm/arm.md (movsi): Don't split symbol refs here.
	(define_split): New.

 2010-08-18  Julian Brown  
 
 	Issue #9222

=== modified file 'gcc/config/arm/arm.md'
--- old/gcc/config/arm/arm.md	2010-08-20 16:41:37 +
+++ new/gcc/config/arm/arm.md	2010-08-23 14:39:12 +
@@ -5150,14 +5150,6 @@
 			   optimize && can_create_pseudo_p ());
   DONE;
 }
-
-  if (TARGET_USE_MOVT && !target_word_relocations
-	  && GET_CODE (operands[1]) == SYMBOL_REF
-	  && !flag_pic && !arm_tls_referenced_p (operands[1]))
-	{
-	  arm_emit_movpair (operands[0], operands[1]);
-	  DONE;
-	}
 }
   else /* TARGET_THUMB1...  */
 {
@@ -5265,6 +5257,19 @@
   "
 )
 
+(define_split
+  [(set (match_operand:SI 0 "arm_general_register_operand" "")
+	(match_operand:SI 1 "general_operand" ""))]
+  "TARGET_32BIT
+   && TARGET_USE_MOVT && GET_CODE (operands[1]) == SYMBOL_REF
+   && !flag_pic && !target_word_relocations
+   && !arm_tls_referenced_p (operands[1])"
+  [(clobber (const_int 0))]
+{
+  arm_emit_movpair (operands[0], operands[1]);
+  DONE;
+})
+
 (define_insn "*thumb1_movsi_insn"
   [(set (match_operand:SI 0 "nonimmediate_operand" "=l,l,l,l,l,>,l, m,*lhk")
 	(match_operand:SI 1 "general_operand"  "l, I,J,K,>,l,mi,l,*lhk"))]



[PATCH, ARM] Cortex-A8 backend fixes

2012-02-09 Thread Dmitry Melnik

This patch fixes few things in pipeline description of ARM Cortex-A8.

1) arm_no_early_alu_shift_value_dep() checks early dependence only for 
one argument, ignoring the dependence on register used as shift amount. 
For example, this function is used as a condition in bypass that sets 
dep_cost=0 between mov and ALU operations:


  mov r0, r1
  add r3, r4, r5, asr r0

This results in dep_cost returning 0 for these insns, while according
to Technical Reference Manual it should be 1
(http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0344k/Babcagee.html). 



Also, in PLUS and MINUS rtx expressions the order of operands is 
different: PLUS has shift expression as its first argument, while MINUS 
usually has shift as a second argument. But in 
arm_no_early_alu_shift_value_dep() only the first argument is checked as 
EARLY_OP. We changed arm_no_early_alu_shift_dep() so it uses 
rtx_search() to find SHIFT expression.  As all registers for SHIFT 
expression are required at stage E1, it's no difference whether it's 
shift's first or second argument, so we use new 
arm_no_early_alu_shift_dep() instead of 
arm_no_early_alu_shift_value_dep() in Cortex-A8 bypasses. Functions 
arm_no_early_alu_shift_[value_]dep() are also used in Cortex-A5, 
Cortex-R4 and ARM1136JFS descriptions, so we named modified function as  
arm_cortex_a8_no_early_alu_shift_dep().
Besides SHIFTs and ROTATE, the function also handles MULT (which is used 
to represent shifts by a constant) and ZERO_EXTEND and SIGN_EXTEND (they 
also have type of alu_shift).


2) MUL to ALU bypass has incorrect delay of 4 cycles, while according to 
TRM it has to be 5 for MUL and 6 for MULL.  The patch splits this bypass 
in two and sets the correct delay values.


3) In cortex-a8.md MOV with shift instructions matched to wrong 
reservations (cortex_a8_alu_shift, cortex_a8_alu_shift_reg).  Adding 
insn attribute "mov" for arm_shiftsi3 pattern in arm.md fixes that.


4) SMLALxy was moved from cortex_a8_mull reservation to 
cortex_a8_smlald, which according to TRM has proper timing for this insn 
(1 cycle less than MULL).


5) ARM Cortex-A8 TRM itself contains inaccurate timings for availability 
of RdLo in some multiply instructions.  Namely, lower part of the result 
for (S|U)MULL, (S|U)MLAL, UMAAL, SMLALxy, SMLALD, SMLSLD instructions  
is already available at E4 stage (instead of E5 in TRM).


This information initially was found in beagle board mailing list, and 
it's confirmed by our tests and these sites: 
http://www.avison.me.uk/ben/programming/cortex-a8.html and 
http://hilbert-space.de/?p=66


The patch adds two bypasses between these instructions and MOV 
instruction, which uses arm_mull_low_part_dep() to check whether 
dependency is only on the low part of MUL destination.  Bypasses between 
MULL and ALU insns for RdLo can't be added, because between this pair of 
reservation there are existing bypasses.  However, in practice these 
multiply insns are rare, and on SPEC2K INT code low part of the result 
for such insns is never used.


--
Best regards,
  Dmitry
2012-02-09  Ruben Buchatskiy 

* config/arm/arm-protos.h (arm_cortex_a8_no_early_alu_shift_dep,
arm_mull_low_part_dep): Declare.
* config/arm/arm.c (arm_cortex_a8_no_early_alu_shift_dep,
arm_mull_low_part_dep, is_early_op): New function.
* config/arm/arm.md (arm_shiftsi3): Add "mov" insn attribute.

diff --git a/gcc/config/arm/arm-protos.h b/gcc/config/arm/arm-protos.h
index 23a29c6..2a1334e 100644
--- a/gcc/config/arm/arm-protos.h
+++ b/gcc/config/arm/arm-protos.h
@@ -97,10 +97,12 @@ extern int neon_struct_mem_operand (rtx);
 extern int arm_no_early_store_addr_dep (rtx, rtx);
 extern int arm_early_store_addr_dep (rtx, rtx);
 extern int arm_early_load_addr_dep (rtx, rtx);
+extern int arm_cortex_a8_no_early_alu_shift_dep (rtx, rtx);
 extern int arm_no_early_alu_shift_dep (rtx, rtx);
 extern int arm_no_early_alu_shift_value_dep (rtx, rtx);
 extern int arm_no_early_mul_dep (rtx, rtx);
 extern int arm_mac_accumulator_is_mul_result (rtx, rtx);
+extern int arm_mull_low_part_dep (rtx, rtx);
 
 extern int tls_mentioned_p (rtx);
 extern int symbol_mentioned_p (rtx);
diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
index ee26c51..e92c75b 100644
--- a/gcc/config/arm/arm.c
+++ b/gcc/config/arm/arm.c
@@ -23035,6 +23035,56 @@ arm_early_load_addr_dep (rtx producer, rtx consumer)
   return reg_overlap_mentioned_p (value, addr);
 }
 
+/* Return nonzero and copy *X to *DATA if *X is a SHIFT operand.
+   This is a callback for for_each_rtx in arm_no_early_alu_shift_dep().  */
+
+static int
+is_early_op (rtx *x, void *data)
+{
+  rtx *rtx_data = (rtx *) data;
+  enum rtx_code code;
+  code = GET_CODE (*x);
+
+  if (code == ASHIFT || code == ASHIFTRT || code == LSHIFTRT
+  || code == ROTATERT || code == ROTATE || code == MULT
+  || code == ZERO_EXTEND || code == SIGN_EXTEND)
+{
+   *rtx_data = *x;
+   return 1;
+}
+  else
+return 0;

[RFC, ARM][PATCH 5/5] Swap passes peephole2 and if_after_reload

2011-12-30 Thread Dmitry Melnik
After Thumb-2's peephole2 adds flag clobbering on suitable insns in 
order to generate 16-bit encoding for them, if-conversion can't 
transform these insns into cond_execs.  In theory, if the instruction 
were converted to conditional form, it would also use 16-bit encoding, 
so the flag clobbering doesn't have to be added to force 16-bit encoding.
Swapping order of these passes has actually increased the number of 
if-conversions (it results in 2% more it-blocks on SPEC2K INT and
4% more "long" it-blocks that contain two or more instructions). 
However, this actually caused 2038 code size regression on SPEC2K INT. 
At first, we blamed this growth on the effects described in patches 1-4, 
but with these fixes the regression has only reduced by 474 bytes.  I.e. 
the more there are conditional insns, the better is the code size 
reduction from patches 1-4, but altered pass order still loses 1564 
bytes to the original one.

What do you think about the order of these passes on Thumb-2?

(The patch is obvious, and is not attached).


[RFC, ARM][PATCH 4/5] Limit on frequency in if-conversion

2011-12-30 Thread Dmitry Melnik
If one of branches has significantly greater probability than the other, 
then it may be better to rely on CPU's branch prediction and block 
reordering, than putting rarely executed instructions into the pipeline. 
 In this patch we set 10% frequency ratio as a cutoff.

On SPEC2K INT with -O2 this reduced code size for 28 bytes (no regressions).
2011-12-29  Dmitry Plotnikov  

gcc/
* ifcvt.c (cond_exec_process_if_block): Added check for frequency ratio.

diff --git a/gcc/ifcvt.c b/gcc/ifcvt.c
index ce60ce2..330034e 100644
--- a/gcc/ifcvt.c
+++ b/gcc/ifcvt.c
@@ -445,6 +445,16 @@ cond_exec_process_if_block (ce_if_block_t * ce_info,
   int then_n_insns, else_n_insns, n_insns;
   enum rtx_code false_code;
 
+  /* If one of branches has significantly greater probability than the other,
+ then we'd better rely on CPU's branch prediction and block reordering
+ than putting rarely executed instructions into the pipeline.  10% ratio
+ seems like a reasonable cutoff.  */
+  if (then_bb && then_bb->frequency < (test_bb->frequency / 10))
+return FALSE;
+
+  if (else_bb && else_bb->frequency < (test_bb->frequency / 10))
+return FALSE;
+
   /* If test is comprised of && or || elements, and we've failed at handling
  all of them together, just use the last test if it is the special case of
  && elements without an ELSE block.  */


[RFC, ARM][PATCH 3/5] Adjust the maximum number of if-converted insns to 4

2011-12-30 Thread Dmitry Melnik
This patch adjusts the maximum number of instructions in a basic block 
to be converted into conditional form from 5 to 4. The idea is that the 
5-th conditional instruction comes at cost of extra IT-instruction, 
while 4 insns fit into single IT-block, so this IT-instruction replaces 
eliminated branch insn and code won't grow. This limit is applied for 
each of converted conditional branches.
This reduces code size by 96 bytes on SPEC2K INT with -O2 (with +4 byte 
regression on one test).
2011-12-29  Dmitry Melnik  

gcc/
* config/arm/arm.h (MAX_CONDITIONAL_EXECUTE): New macro.

diff --git a/gcc/config/arm/arm.h b/gcc/config/arm/arm.h
index 85e2b99..acad3ec 100644
--- a/gcc/config/arm/arm.h
+++ b/gcc/config/arm/arm.h
@@ -1990,6 +1990,15 @@ typedef struct
 #define BRANCH_COST(speed_p, predictable_p) \
   (current_tune->branch_cost (speed_p, predictable_p))
 
+/* One IT-block consists of 4 insns at maximum.  If-conversion can eliminate
+   2 branches for IFs with both THEN and ELSE branches, and 1 branch for
+   those with only THEN branch.  If we don't want code size to grow, and just
+   allow trading branch insns for IT insns, we should limit number of converted
+   insns to 4 (and in ifcvt.c it will be doubled if there are 2 branches).  */
+#define MAX_CONDITIONAL_EXECUTE \
+  ((TARGET_THUMB2) ? 4 : BRANCH_COST (optimize_function_for_speed_p (cfun), \
+  false) + 1)
+
 
 /* Position Independent Code.  */
 /* We decide which register to use based on the compilation options and


[RFC, ARM][PATCH 2/5] Try not to split IT-blocks by scheduling conditional insns together

2011-12-30 Thread Dmitry Melnik
GCC's scheduler has no idea about Thumb-2 IT blocks, which are generated 
from cond_execs at the stage of emitting assembly code. So it treats 
conditional instructions independently, and may intermix them with 
non-conditional instructions, which will cause generation of extra IT 
instructions.
We have added arm_sched_reorder target hook, which moves conditional 
instructions to the top of the ready list if the last scheduled insn was 
conditional, and vice versa.  This is only done for insns with the same 
INSN_PRIORITY, so works only for resolving tie-breaks.  Also it required 
few more target hooks just to save correct can_issue_more value.
This has reduced code size by 144 bytes on SPEC2K INT with -O2 (no 
regressions).


2011-12-29  Dmitry Melnik  

gcc/
* config/arm/arm.c (arm_variable_issue, arm_sched_init, arm_sched_finish,
  arm_sched_reorder, arm_dfa_post_advance_cycle): New functions.
  (TARGET_SCHED_VARIABLE_ISSUE, TARGET_SCHED_INIT, TARGET_SCHED_FINISH,
  TARGET_SCHED_SCHED_REORDER, TARGET_SCHED_SCHED_REORDER2, 
  TARGET_SCHED_DFA_POST_ADVANCE_CYCLE): Added hooks.
  (last_scheduled_insn): New variable.

diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
index ee26c51..cabf343 100644
--- a/gcc/config/arm/arm.c
+++ b/gcc/config/arm/arm.c
@@ -57,6 +57,7 @@
 #include "libfuncs.h"
 #include "params.h"
 #include "opts.h"
+#include "sched-int.h"
 
 /* Forward definitions of types.  */
 typedef struct minipool_nodeMnode;
@@ -133,6 +134,12 @@ static void arm_output_function_prologue (FILE *, HOST_WIDE_INT);
 static int arm_comp_type_attributes (const_tree, const_tree);
 static void arm_set_default_type_attributes (tree);
 static int arm_adjust_cost (rtx, rtx, rtx, int);
+static int arm_variable_issue (FILE *, int, rtx, int);
+static void arm_sched_init (FILE *, int, int);
+static void arm_sched_finish (FILE *, int);
+static int arm_sched_reorder (FILE *, int, rtx *, int *, int);
+static void arm_dfa_post_advance_cycle (state_t state);
+
 static int optimal_immediate_sequence (enum rtx_code code,
    unsigned HOST_WIDE_INT val,
    struct four_ints *return_sequence);
@@ -361,6 +368,19 @@ static const struct attribute_spec arm_attribute_table[] =
 #undef  TARGET_SCHED_ADJUST_COST
 #define TARGET_SCHED_ADJUST_COST arm_adjust_cost
 
+#undef TARGET_SCHED_VARIABLE_ISSUE
+#define TARGET_SCHED_VARIABLE_ISSUE arm_variable_issue
+#undef TARGET_SCHED_INIT
+#define TARGET_SCHED_INIT arm_sched_init
+#undef TARGET_SCHED_FINISH
+#define TARGET_SCHED_FINISH arm_sched_finish
+#undef TARGET_SCHED_REORDER
+#define TARGET_SCHED_REORDER arm_sched_reorder
+#undef TARGET_SCHED_REORDER2
+#define TARGET_SCHED_REORDER2 arm_sched_reorder
+#undef TARGET_SCHED_DFA_POST_ADVANCE_CYCLE
+#define TARGET_SCHED_DFA_POST_ADVANCE_CYCLE arm_dfa_post_advance_cycle
+
 #undef TARGET_ENCODE_SECTION_INFO
 #ifdef ARM_PE
 #define TARGET_ENCODE_SECTION_INFO  arm_pe_encode_section_info
@@ -804,6 +824,9 @@ int arm_condexec_mask = 0;
 /* The number of bits used in arm_condexec_mask.  */
 int arm_condexec_masklen = 0;
 
+/* Last scheduled instruction.  */
+static rtx last_scheduled_insn;
+
 /* The condition codes of the ARM, and the inverse function.  */
 static const char * const arm_condition_codes[] =
 {
@@ -8428,6 +8451,101 @@ fa726te_sched_adjust_cost (rtx insn, rtx link, rtx dep, int * cost)
   return true;
 }
 
+/* Holds correct CAN_ISSUE_MORE so arm_sched_reorder can return correct value.  */
+static int cached_can_issue_more;
+
+/* Save CAN_ISSUE_MORE in CACHED_CAN_ISSUE_MORE.  Also move the code from 
+   haifa-sched.c that won't work with arm_variable_issue hook defined.  */
+static int
+arm_variable_issue (FILE *dump ATTRIBUTE_UNUSED,
+int sched_verbose ATTRIBUTE_UNUSED,
+rtx insn,
+int can_issue_more)
+{
+  last_scheduled_insn = insn;
+
+  cached_can_issue_more = can_issue_more;
+
+  if (GET_CODE (PATTERN (insn)) != USE
+  && GET_CODE (PATTERN (insn)) != CLOBBER)
+cached_can_issue_more = can_issue_more - 1;
+
+  return cached_can_issue_more;
+}
+
+/* Init LAST_SCHEDULED_INSN.  */
+static void
+arm_sched_init (FILE *dump ATTRIBUTE_UNUSED,
+int sched_verbose ATTRIBUTE_UNUSED,
+int max_ready ATTRIBUTE_UNUSED)
+{
+  last_scheduled_insn = NULL_RTX;
+}
+
+/* Reset LAST_SCHEDULED_INSN.  */
+static void
+arm_sched_finish (FILE *dump ATTRIBUTE_UNUSED, 
+  int sched_verbose ATTRIBUTE_UNUSED)
+{
+  last_scheduled_insn = NULL_RTX;
+}
+
+/* Remove the instruction at index LOWER from ready queue READY and
+   reinsert it in front of the instruction at index HIGHER.  LOWER must
+   be <= HIGHER.  */
+static void
+arm_promote_ready (rtx *ready, int lower, int higher)
+{
+  rtx new_head;
+  int i;
+
+  new_head = ready[lower];
+  for (i = lower; i < higher; i++)
+ready[i] = ready[i + 1];
+  ready[i] = new_head;
+}
+
+/* M

[RFC, ARM][PATCH 1/5] Split if_then_else into cond_execs

2011-12-30 Thread Dmitry Melnik
This patch adds splits for if_then_else into cond_execs. This helps 
generating the minimum number of IT-blocks for two consequent 
if_then_elses, e.g. one ITETE insn instead of two ITE insns, if 
if_then_else were expanded directly into assembly code.
There are three splitters for the cases when both IF and THEN branches 
are present, and when there's only one of them (the last two splitters 
are required to prevent generation of code like "(cc) r0 = r0").

On SPEC2K INT with -O2 this reduces code size by 76 bytes (no regressions).

2011-12-08  Sevak Sargsyan 

gcc/
* config/arm/thumb2.md (new splitters for if_then_else): Turn them
into cond_execs.

diff --git a/gcc/config/arm/thumb2.md b/gcc/config/arm/thumb2.md
index 05585da..662f995 100644
--- a/gcc/config/arm/thumb2.md
+++ b/gcc/config/arm/thumb2.md
@@ -299,6 +299,57 @@
(set_attr "conds" "use")]
 )
 
+(define_split
+ [(set (match_operand:SI 0 "s_register_operand" "")
+   (if_then_else:SI
+ (match_operator 3 "arm_comparison_operator"
+  [(match_operand 4 "cc_register" "") (const_int 0)])
+ (match_operand:SI 1 "arm_not_operand" "")
+ (match_operand:SI 2 "arm_not_operand" "")))]
+  "TARGET_THUMB2 && reload_completed
+   && (!REG_P (operands[1]) || REGNO (operands[0]) != REGNO (operands[1]))
+   && (!REG_P (operands[2]) || REGNO (operands[0]) != REGNO (operands[2]))"
+[(cond_exec (match_dup 5) (set (match_dup 0) (match_dup 1)))
+ (cond_exec (match_dup 6) (set (match_dup 0) (match_dup 2)))]
+{
+   operands[5] = gen_rtx_fmt_ee (GET_CODE (operands[3]), VOIDmode,
+ operands[4], const0_rtx);
+   operands[6] = gen_rtx_fmt_ee (reversed_comparison_code (operands[3], NULL_RTX),
+ VOIDmode, operands[4], const0_rtx);
+})
+
+(define_split
+ [(set (match_operand:SI 0 "s_register_operand" "")
+   (if_then_else:SI
+ (match_operator 3 "arm_comparison_operator"
+  [(match_operand 4 "cc_register" "") (const_int 0)])
+ (match_operand:SI 1 "arm_not_operand" "")
+ (match_operand:SI 2 "arm_not_operand" "")))]
+  "TARGET_THUMB2 && reload_completed
+   && REG_P (operands[1]) && REGNO (operands[0]) == REGNO (operands[1])
+   && (!REG_P (operands[2]) || REGNO (operands[0]) != REGNO (operands[2]))"
+[(cond_exec (match_dup 5) (set (match_dup 0) (match_dup 2)))]
+{
+   operands[5] = gen_rtx_fmt_ee (reversed_comparison_code (operands[3], NULL_RTX),
+ VOIDmode, operands[4], const0_rtx);
+})
+
+(define_split
+ [(set (match_operand:SI 0 "s_register_operand" "")
+   (if_then_else:SI
+ (match_operator 3 "arm_comparison_operator"
+  [(match_operand 4 "cc_register" "") (const_int 0)])
+ (match_operand:SI 1 "arm_not_operand" "")
+ (match_operand:SI 2 "arm_not_operand" "")))]
+  "TARGET_THUMB2 && reload_completed
+   && (!REG_P (operands[1]) || REGNO (operands[0]) != REGNO (operands[1]))
+   && REG_P (operands[2]) && REGNO (operands[0]) == REGNO (operands[2])"
+[(cond_exec (match_dup 5) (set (match_dup 0) (match_dup 1)))]
+{
+   operands[5] = gen_rtx_fmt_ee (GET_CODE (operands[3]), VOIDmode, operands[4],
+ const0_rtx);
+})
+
 (define_insn "*call_reg_thumb2"
   [(call (mem:SI (match_operand:SI 0 "s_register_operand" "r"))
  (match_operand 1 "" ""))


[RFC, ARM][PATCH 0/5] Enhancements to handling of Thumb-2 conditional insns

2011-12-30 Thread Dmitry Melnik

Hi,

This series of patches solves few issues we found with Thumb-2 
conditional insns.  These fixes include:


1) Split if_then_else into cond_execs to generate only required minimum 
of IT-blocks;
2) Grouping conditional insns of same INSN_PRIORITY to avoid excessive 
splitting of IT-blocks;
3) In if-conversion, set the maximum number of converted insns in a 
branch to 4, to match the limit for IT-block;
4) Don't perform if-conversion, if one of branches has significantly 
greater probability then the other;
5) Swap passes peephole2 and if_after_reload in order to generate more 
conditional insns (this one is actually more like a problem report, than 
a fix).


The combined effect on code size from patches 1-4 on SPEC2K INT with -O2 
is as follows:


Test bytes
name saved
--
gzip4
vpr 0
gcc   100
mcf 4
crafty 16
parser  0
eon 8
perlbmk 8
gap44
vortex  8
bzip2  16
twolf  24
--
Total:232

Do you think some of this patches are OK for trunk?

--
Best regards,
  Dmitry


Re: [PATCH, ARM] Support NEON's VABD with combine pass

2011-09-12 Thread Dmitry Melnik



Interesting but I would be a bit defensive and make sure that this
matches only if -ffast-math in the FP case. You are sort of relying on
the fact that vsub wouldn't be generated without ffast-math but I'd
rather be defensive about it . (This is in case it's not clear in the
non-intrinsics case).

Fixed.

BTW was SPEC2k built with -Ofast ? Maybe then you'll see a bit of vectorization.
Yes, I built it with -Ofast. I think it's because SPEC2K tests mostly 
use doubles, which are not supported by vabd.


--
Best regards,
   Dmitry
diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md
index a8c1b87..aceb564 100644
--- a/gcc/config/arm/neon.md
+++ b/gcc/config/arm/neon.md
@@ -5607,3 +5607,32 @@
   emit_insn (gen_neon_vec_pack_trunc_ (operands[0], tempreg));
   DONE;
 })
+
+(define_insn "neon_vabd_2"
+ [(set (match_operand:VDQ 0 "s_register_operand" "=w")
+   (abs:VDQ (minus:VDQ (match_operand:VDQ 1 "s_register_operand" "w")
+   (match_operand:VDQ 2 "s_register_operand" "w"]
+ "TARGET_NEON && (! || flag_unsafe_math_optimizations)"
+ "vabd. %0, %1, %2"
+ [(set (attr "neon_type")
+   (if_then_else (ne (symbol_ref "") (const_int 0))
+ (if_then_else (ne (symbol_ref "") (const_int 0))
+   (const_string "neon_fp_vadd_ddd_vabs_dd")
+   (const_string "neon_fp_vadd_qqq_vabs_qq"))
+ (const_string "neon_int_5")))]
+)
+
+(define_insn "neon_vabd_3"
+ [(set (match_operand:VDQ 0 "s_register_operand" "=w")
+   (abs:VDQ (unspec:VDQ [(match_operand:VDQ 1 "s_register_operand" "w")
+ (match_operand:VDQ 2 "s_register_operand" "w")]
+ UNSPEC_VSUB)))]
+ "TARGET_NEON && (! || flag_unsafe_math_optimizations)"
+ "vabd. %0, %1, %2"
+ [(set (attr "neon_type")
+   (if_then_else (ne (symbol_ref "") (const_int 0))
+ (if_then_else (ne (symbol_ref "") (const_int 0))
+   (const_string "neon_fp_vadd_ddd_vabs_dd")
+   (const_string "neon_fp_vadd_qqq_vabs_qq"))
+ (const_string "neon_int_5")))]
+)
diff --git a/gcc/testsuite/gcc.target/arm/neon-combine-sub-abs-into-vabd.c b/gcc/testsuite/gcc.target/arm/neon-combine-sub-abs-into-vabd.c
new file mode 100644
index 000..ad6ba75
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/neon-combine-sub-abs-into-vabd.c
@@ -0,0 +1,50 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target arm_neon_ok } */
+/* { dg-options "-O2 -funsafe-math-optimizations" } */
+/* { dg-add-options arm_neon } */
+
+#include 
+float32x2_t f_sub_abs_to_vabd_32()
+{
+  float32x2_t val1 = vdup_n_f32 (10);
+  float32x2_t val2 = vdup_n_f32 (30);
+  float32x2_t sres = vsub_f32(val1, val2);
+  float32x2_t res = vabs_f32 (sres);
+
+  return res;
+}
+/* { dg-final { scan-assembler "vabd\.f32" } }*/
+
+#include 
+int8x8_t sub_abs_to_vabd_8()
+{
+  int8x8_t val1 = vdup_n_s8 (10);
+  int8x8_t val2 = vdup_n_s8 (30);
+  int8x8_t sres = vsub_s8(val1, val2);
+  int8x8_t res = vabs_s8 (sres);
+
+  return res;
+}
+/* { dg-final { scan-assembler "vabd\.s8" } }*/
+
+int16x4_t sub_abs_to_vabd_16()
+{
+  int16x4_t val1 = vdup_n_s16 (10);
+  int16x4_t val2 = vdup_n_s16 (30);
+  int16x4_t sres = vsub_s16(val1, val2);
+  int16x4_t res = vabs_s16 (sres);
+
+  return res;
+}
+/* { dg-final { scan-assembler "vabd\.s16" } }*/
+
+int32x2_t sub_abs_to_vabd_32()
+{
+  int32x2_t val1 = vdup_n_s32 (10);
+  int32x2_t val2 = vdup_n_s32 (30);
+  int32x2_t sres = vsub_s32(val1, val2);
+  int32x2_t res = vabs_s32 (sres);
+
+   return res;
+}
+/* { dg-final { scan-assembler "vabd\.s32" } }*/


[PATCH, ARM] Support NEON's VABD with combine pass

2011-07-29 Thread Dmitry Melnik
This patch adds two define_insn patterns for NEON vabd instruction to 
make combine pass recognize expressions matching (vabs (vsub ...)) 
patterns as vabd.
This patch reduces code size of x264 binary from 649143 to 648343 (800 
bytes, or 0.12%) and increases its performance on average by 2.5% on 
plain C version of x264 with -O2 -ftree-vectorize.
On SPEC2K it didn't make any difference -- all vabs instructions found 
in SPEC2K binaries are either using .f64 mode or scalar .f32 which are 
not supported by NEON's vabd.

Regtested with QEMU.

Ok for trunk?


--
Best regards,
   Dmitry

2011-07-21  Sevak Sargsyan 

* config/arm/neon.md (neon_vabd_2, neon_vabd_3): New define_insn patterns for combine.

gcc/testsuite:

* gcc.target/arm/neon-combine-sub-abs-into-vabd.c: New test.

diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md
index a8c1b87..f457365 100644
--- a/gcc/config/arm/neon.md
+++ b/gcc/config/arm/neon.md
@@ -5607,3 +5607,32 @@
   emit_insn (gen_neon_vec_pack_trunc_ (operands[0], tempreg));
   DONE;
 })
+
+(define_insn "neon_vabd_2"
+ [(set (match_operand:VDQ 0 "s_register_operand" "=w")
+   (abs:VDQ (minus:VDQ (match_operand:VDQ 1 "s_register_operand" "w")
+   (match_operand:VDQ 2 "s_register_operand" "w"]
+ "TARGET_NEON"
+ "vabd. %0, %1, %2"
+ [(set (attr "neon_type")
+   (if_then_else (ne (symbol_ref "") (const_int 0))
+ (if_then_else (ne (symbol_ref "") (const_int 0))
+   (const_string "neon_fp_vadd_ddd_vabs_dd")
+   (const_string "neon_fp_vadd_qqq_vabs_qq"))
+ (const_string "neon_int_5")))]
+)
+
+(define_insn "neon_vabd_3"
+ [(set (match_operand:VDQ 0 "s_register_operand" "=w")
+   (abs:VDQ (unspec:VDQ [(match_operand:VDQ 1 "s_register_operand" "w")
+ (match_operand:VDQ 2 "s_register_operand" "w")]
+ UNSPEC_VSUB)))]
+ "TARGET_NEON"
+ "vabd. %0, %1, %2"
+ [(set (attr "neon_type")
+   (if_then_else (ne (symbol_ref "") (const_int 0))
+ (if_then_else (ne (symbol_ref "") (const_int 0))
+   (const_string "neon_fp_vadd_ddd_vabs_dd")
+   (const_string "neon_fp_vadd_qqq_vabs_qq"))
+ (const_string "neon_int_5")))]
+)
diff --git a/gcc/testsuite/gcc.target/arm/neon-combine-sub-abs-into-vabd.c b/gcc/testsuite/gcc.target/arm/neon-combine-sub-abs-into-vabd.c
new file mode 100644
index 000..aae4117
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/neon-combine-sub-abs-into-vabd.c
@@ -0,0 +1,54 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target arm_neon_ok } */
+/* { dg-options "-O2 -funsafe-math-optimizations" } */
+/* { dg-add-options arm_neon } */
+
+#include 
+float32x2_t f_sub_abs_to_vabd_32()
+{
+
+   float32x2_t val1 = vdup_n_f32 (10); 
+   float32x2_t val2 = vdup_n_f32 (30);
+   float32x2_t sres = vsub_f32(val1, val2);
+   float32x2_t res = vabs_f32 (sres); 
+
+   return res;
+}
+/* { dg-final { scan-assembler "vabd\.f32" } }*/
+
+#include 
+int8x8_t sub_abs_to_vabd_8()
+{
+   
+   int8x8_t val1 = vdup_n_s8 (10); 
+int8x8_t val2 = vdup_n_s8 (30);
+int8x8_t sres = vsub_s8(val1, val2);
+int8x8_t res = vabs_s8 (sres); 
+
+   return res;
+}
+/* { dg-final { scan-assembler "vabd\.s8" } }*/
+
+int16x4_t sub_abs_to_vabd_16()
+{
+   
+   int16x4_t val1 = vdup_n_s16 (10); 
+int16x4_t val2 = vdup_n_s16 (30);
+int16x4_t sres = vsub_s16(val1, val2);
+int16x4_t res = vabs_s16 (sres); 
+
+   return res;
+}
+/* { dg-final { scan-assembler "vabd\.s16" } }*/
+
+int32x2_t sub_abs_to_vabd_32()
+{
+
+int32x2_t val1 = vdup_n_s32 (10);
+int32x2_t val2 = vdup_n_s32 (30);
+int32x2_t sres = vsub_s32(val1, val2);
+int32x2_t res = vabs_s32 (sres);
+
+   return res;
+}
+/* { dg-final { scan-assembler "vabd\.s32" } }*/


[PATCH, ARM] Support NEON's VABD with combine pass

2011-07-29 Thread Dmitry Melnik
This patch adds two define_insn patterns for NEON vabd instruction to 
make combine pass recognize expressions matching (vabs (vsub ...)) 
patterns as vabd.
This patch reduces code size of x264 binary from 649143 to 648343 (800 
bytes, or 0.12%) and increases its performance on average by 2.5% on 
plain C version of x264 with -O2 -ftree-vectorize.
On SPEC2K it didn't make any difference -- all vabs instructions found 
in SPEC2K binaries are either using .f64 mode or scalar .f32 which are 
not supported by NEON's vabd.

Regtested with QEMU.

Ok for trunk?


--
Best regards,
   Dmitry
2011-07-21  Sevak Sargsyan 

* config/arm/neon.md (neon_vabd_2, neon_vabd_3): New define_insn patterns for combine.

gcc/testsuite:

* gcc.target/arm/neon-combine-sub-abs-into-vabd.c: New test.

diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md
index a8c1b87..f457365 100644
--- a/gcc/config/arm/neon.md
+++ b/gcc/config/arm/neon.md
@@ -5607,3 +5607,32 @@
   emit_insn (gen_neon_vec_pack_trunc_ (operands[0], tempreg));
   DONE;
 })
+
+(define_insn "neon_vabd_2"
+ [(set (match_operand:VDQ 0 "s_register_operand" "=w")
+   (abs:VDQ (minus:VDQ (match_operand:VDQ 1 "s_register_operand" "w")
+   (match_operand:VDQ 2 "s_register_operand" "w"]
+ "TARGET_NEON"
+ "vabd. %0, %1, %2"
+ [(set (attr "neon_type")
+   (if_then_else (ne (symbol_ref "") (const_int 0))
+ (if_then_else (ne (symbol_ref "") (const_int 0))
+   (const_string "neon_fp_vadd_ddd_vabs_dd")
+   (const_string "neon_fp_vadd_qqq_vabs_qq"))
+ (const_string "neon_int_5")))]
+)
+
+(define_insn "neon_vabd_3"
+ [(set (match_operand:VDQ 0 "s_register_operand" "=w")
+   (abs:VDQ (unspec:VDQ [(match_operand:VDQ 1 "s_register_operand" "w")
+ (match_operand:VDQ 2 "s_register_operand" "w")]
+ UNSPEC_VSUB)))]
+ "TARGET_NEON"
+ "vabd. %0, %1, %2"
+ [(set (attr "neon_type")
+   (if_then_else (ne (symbol_ref "") (const_int 0))
+ (if_then_else (ne (symbol_ref "") (const_int 0))
+   (const_string "neon_fp_vadd_ddd_vabs_dd")
+   (const_string "neon_fp_vadd_qqq_vabs_qq"))
+ (const_string "neon_int_5")))]
+)
diff --git a/gcc/testsuite/gcc.target/arm/neon-combine-sub-abs-into-vabd.c b/gcc/testsuite/gcc.target/arm/neon-combine-sub-abs-into-vabd.c
new file mode 100644
index 000..aae4117
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/neon-combine-sub-abs-into-vabd.c
@@ -0,0 +1,54 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target arm_neon_ok } */
+/* { dg-options "-O2 -funsafe-math-optimizations" } */
+/* { dg-add-options arm_neon } */
+
+#include 
+float32x2_t f_sub_abs_to_vabd_32()
+{
+
+   float32x2_t val1 = vdup_n_f32 (10); 
+   float32x2_t val2 = vdup_n_f32 (30);
+   float32x2_t sres = vsub_f32(val1, val2);
+   float32x2_t res = vabs_f32 (sres); 
+
+   return res;
+}
+/* { dg-final { scan-assembler "vabd\.f32" } }*/
+
+#include 
+int8x8_t sub_abs_to_vabd_8()
+{
+   
+   int8x8_t val1 = vdup_n_s8 (10); 
+int8x8_t val2 = vdup_n_s8 (30);
+int8x8_t sres = vsub_s8(val1, val2);
+int8x8_t res = vabs_s8 (sres); 
+
+   return res;
+}
+/* { dg-final { scan-assembler "vabd\.s8" } }*/
+
+int16x4_t sub_abs_to_vabd_16()
+{
+   
+   int16x4_t val1 = vdup_n_s16 (10); 
+int16x4_t val2 = vdup_n_s16 (30);
+int16x4_t sres = vsub_s16(val1, val2);
+int16x4_t res = vabs_s16 (sres); 
+
+   return res;
+}
+/* { dg-final { scan-assembler "vabd\.s16" } }*/
+
+int32x2_t sub_abs_to_vabd_32()
+{
+
+int32x2_t val1 = vdup_n_s32 (10);
+int32x2_t val2 = vdup_n_s32 (30);
+int32x2_t sres = vsub_s32(val1, val2);
+int32x2_t res = vabs_s32 (sres);
+
+   return res;
+}
+/* { dg-final { scan-assembler "vabd\.s32" } }*/


[PATCH, ARM] Reload register class fix for NEON constants

2011-04-25 Thread Dmitry Melnik


Hi All,

The attached patch changes the reload class for NEON constant vectors 
from GENERAL_REGS to NO_REGS.

The issue was found on this code from libevas:

void
_op_blend_p_caa_dp(unsigned *s, unsigned* e, unsigned *d, unsigned c) {
while (d < e) {
 *d = ( (*s) >> 8) & 0x00ff00ff) * (c)) & 0xff00ff00) + 
(*s) & 0x00ff00ff) * (c)) >> 8) & 0x00ff00ff) );

 //*d = (*s) & 0x00ff00ff;
 d++;
 s++;
}
}

Original asm:

.L4:
adr r8, .L10
ldmia   r8, {r8-fp}
...
vmovd22, r8, r9  @ v4si
vmovd23, sl, fp
vandq12, q8, q11
...
bhi .L4

.L10:
.word   16711935 @ 0xff00ff
.word   16711935
.word   16711935
.word   16711935

Fixed asm:

.L4:
vmov.i16q11, #255  @ v4si
...
vandq12, q8, q11
bhi .L4

This fix results in +3.7% gain for expedite (reduced) test suite, and up 
to 15% for affected tests.


Ok for trunk?


--
Best regards,
   Dmitry


2011-04-22  Sergey Grechanik  

	* config/arm/arm.c (coproc_secondary_reload_class): Treat constant
	vectors the same way as memory locations to prevent loading them 
	through the ARM general registers.

--- a/gcc/config/arm/arm.c
+++ b/gcc/config/arm/arm.c
@@ -9152,7 +9152,7 @@ coproc_secondary_reload_class (enum machine_mode mode, rtx x, bool wb)
   /* The neon move patterns handle all legitimate vector and struct
  addresses.  */
   if (TARGET_NEON
-  && MEM_P (x)
+  && (MEM_P (x) || GET_CODE (x) == CONST_VECTOR)
   && (GET_MODE_CLASS (mode) == MODE_VECTOR_INT
  || GET_MODE_CLASS (mode) == MODE_VECTOR_FLOAT
  || VALID_NEON_STRUCT_MODE (mode)))