Re: [patch, ARM] Fix PR48250, adjust DImode reload address legitimizing

2011-03-31 Thread Richard Earnshaw

On Thu, 2011-03-31 at 11:33 +0800, Chung-Lin Tang wrote:
 On 2011/3/30 05:28 PM, Richard Earnshaw wrote:
  
  On Wed, 2011-03-30 at 15:35 +0800, Chung-Lin Tang wrote:
  On 2011/3/30 上午 12:23, Richard Earnshaw wrote:
 
  On Tue, 2011-03-29 at 22:53 +0800, Chung-Lin Tang wrote:
  On 2011/3/29 下午 10:26, Richard Earnshaw wrote:
  On Tue, 2011-03-29 at 18:25 +0800, Chung-Lin Tang wrote:
  On 2011/3/24 06:51 PM, Richard Earnshaw wrote:
 
  On Thu, 2011-03-24 at 12:56 +0900, Chung-Lin Tang wrote:
  Hi,
  PR48250 happens under TARGET_NEON, where DImode is included within 
  the
  valid NEON modes. This turns the range of legitimate constant 
  indexes to
  step-4 (coproc load/store), thus arm_legitimize_reload_address() when
  trying to decompose the [reg+index] reload address into
  [(reg+index_high)+index_low], can cause an ICE later when 'index_low'
  part is not aligned to 4.
 
  I'm not sure why the current DImode index is computed as:
  low = ((val  0xf) ^ 0x8) - 0x8;  the sign-extending into negative
  values, then subtracting back, actually creates further off indexes.
  e.g. in the supplied testcase, [sp+13] was turned into [(sp+16)-3].
 
 
  Hysterical Raisins... the code there was clearly written for the days
  before we had LDRD in the architecture.  At that time the most 
  efficient
  way to load a 64-bit object was to use the LDM{ia,ib,da,db}
  instructions.  The computation here was (I think), intended to try and
  make the most efficient use of an add/sub instruction followed by
  LDM/STM offsetting.  At that time the architecture had no unaligned
  access either, so dealing with 64-bit that were less than 32-bit 
  aligned
  (in those days 32-bit was the maximum alignment) probably wasn't
  considered, or couldn't even get through to reload.
 
 
  I see it now. The code in output_move_double() returning assembly for
  ldm/stm(db/da/ib) for offsets -8/-4/+4 seems to confirm this.
 
  I have changed the patch to let the new code handle the TARGET_LDRD 
  case
  only.  The pre-LDRD case is still handled by the original code, with an
  additional  ~0x3 for aligning the offset to 4.
 
  I've also added a comment for the pre-TARGET_LDRD case. Please see if
  the description is accurate enough.
 
  My patch changes the index decomposing to a more straightforward 
  way; it
  also sort of outlines the way the other reload address indexes are
  broken by using and-masks, is not the most effective.  The address is
  computed by addition, subtracting away the parts to obtain low+high
  should be the optimal way of giving the largest computable index 
  range.
 
  I have included a few Thumb-2 bits in the patch; I know currently
  arm_legitimize_reload_address() is only used under TARGET_ARM, but I
  guess it might eventually be turned into TARGET_32BIT.
 
 
  I think this needs to be looked at carefully on ARMv4/ARMv4T to check
  that it doesn't cause regressions there when we don't have LDRD in the
  instruction set.
 
  I'll be testing the modified patch under an ARMv4/ARMv4T configuration.
  Okay for trunk if no regressions?
 
  Thanks,
  Chung-Lin
 
 PR target/48250
 * config/arm/arm.c (arm_legitimize_reload_address): Adjust
 DImode constant index decomposing under TARGET_LDRD. Clear
 lower two bits for NEON, Thumb-2, and !TARGET_LDRD. Add
 comment for !TARGET_LDRD case.
 
  This looks technically correct, but I can't help feeling that trying to
  deal with the bottom two bits being non-zero is not really worthwhile.
  This hook is an optimization that's intended to generate better code
  than the default mechanisms that reload provides.  It is allowed to
  fail, but it must say so if it does (by returning false).
 
  What this hook is trying to do for DImode is to take an address of the
  form (reg + TOO_BIG_CONST) and turn it into two instructions that are
  legitimate:
 
  tmp = (REG + LEGAL_BIG_CONST)
  some_use_of (mem (tmp + SMALL_LEGAL_CONST))
 
  The idea behind the optimization is that LEGAL_BIG_CONST will need fewer
  instructions to generate than TOO_BIG_CONST.  It's unlikely (probably
  impossible in ARM state) that pushing the bottom two bits of the address
  into LEGAL_BIG_CONST part of the offset is going to lead to a better
  code sequence.  
 
  So I think it would be more sensible to just return false if we have a
  DImode address with the bottom 2 bits non-zero and then let the normal
  reload recovery mechanism take over.
 
  It is supposed to provide better utilization of the full range of
  LEGAL_BIG_CONST+SMALL_LEGAL_CONST.  I am not sure, but I guess reload
  will resolve it with the reg+LEGAL_BIG_CONST part only, using only (mem
  (reg)) for the load/store (correct me if I'm wrong).
 
  Also, the new code slighty improves the reloading, for example currently
  [reg+64] is broken into [(reg+72)-8], creating an additional unneeded
  reload, which is certainly not good when we have ldrd/strd available.
 
 
  Sorry, didn't 

Re: [patch, ARM] Fix PR48250, adjust DImode reload address legitimizing

2011-03-31 Thread Chung-Lin Tang
On 2011/3/31 06:14 PM, Richard Earnshaw wrote:
 
 On Thu, 2011-03-31 at 11:33 +0800, Chung-Lin Tang wrote:
 On 2011/3/30 05:28 PM, Richard Earnshaw wrote:

 On Wed, 2011-03-30 at 15:35 +0800, Chung-Lin Tang wrote:
 On 2011/3/30 上午 12:23, Richard Earnshaw wrote:

 On Tue, 2011-03-29 at 22:53 +0800, Chung-Lin Tang wrote:
 On 2011/3/29 下午 10:26, Richard Earnshaw wrote:
 On Tue, 2011-03-29 at 18:25 +0800, Chung-Lin Tang wrote:
 On 2011/3/24 06:51 PM, Richard Earnshaw wrote:

 On Thu, 2011-03-24 at 12:56 +0900, Chung-Lin Tang wrote:
 Hi,
 PR48250 happens under TARGET_NEON, where DImode is included within 
 the
 valid NEON modes. This turns the range of legitimate constant 
 indexes to
 step-4 (coproc load/store), thus arm_legitimize_reload_address() when
 trying to decompose the [reg+index] reload address into
 [(reg+index_high)+index_low], can cause an ICE later when 'index_low'
 part is not aligned to 4.

 I'm not sure why the current DImode index is computed as:
 low = ((val  0xf) ^ 0x8) - 0x8;  the sign-extending into negative
 values, then subtracting back, actually creates further off indexes.
 e.g. in the supplied testcase, [sp+13] was turned into [(sp+16)-3].


 Hysterical Raisins... the code there was clearly written for the days
 before we had LDRD in the architecture.  At that time the most 
 efficient
 way to load a 64-bit object was to use the LDM{ia,ib,da,db}
 instructions.  The computation here was (I think), intended to try and
 make the most efficient use of an add/sub instruction followed by
 LDM/STM offsetting.  At that time the architecture had no unaligned
 access either, so dealing with 64-bit that were less than 32-bit 
 aligned
 (in those days 32-bit was the maximum alignment) probably wasn't
 considered, or couldn't even get through to reload.


 I see it now. The code in output_move_double() returning assembly for
 ldm/stm(db/da/ib) for offsets -8/-4/+4 seems to confirm this.

 I have changed the patch to let the new code handle the TARGET_LDRD 
 case
 only.  The pre-LDRD case is still handled by the original code, with an
 additional  ~0x3 for aligning the offset to 4.

 I've also added a comment for the pre-TARGET_LDRD case. Please see if
 the description is accurate enough.

 My patch changes the index decomposing to a more straightforward 
 way; it
 also sort of outlines the way the other reload address indexes are
 broken by using and-masks, is not the most effective.  The address is
 computed by addition, subtracting away the parts to obtain low+high
 should be the optimal way of giving the largest computable index 
 range.

 I have included a few Thumb-2 bits in the patch; I know currently
 arm_legitimize_reload_address() is only used under TARGET_ARM, but I
 guess it might eventually be turned into TARGET_32BIT.


 I think this needs to be looked at carefully on ARMv4/ARMv4T to check
 that it doesn't cause regressions there when we don't have LDRD in the
 instruction set.

 I'll be testing the modified patch under an ARMv4/ARMv4T configuration.
 Okay for trunk if no regressions?

 Thanks,
 Chung-Lin

PR target/48250
* config/arm/arm.c (arm_legitimize_reload_address): Adjust
DImode constant index decomposing under TARGET_LDRD. Clear
lower two bits for NEON, Thumb-2, and !TARGET_LDRD. Add
comment for !TARGET_LDRD case.

 This looks technically correct, but I can't help feeling that trying to
 deal with the bottom two bits being non-zero is not really worthwhile.
 This hook is an optimization that's intended to generate better code
 than the default mechanisms that reload provides.  It is allowed to
 fail, but it must say so if it does (by returning false).

 What this hook is trying to do for DImode is to take an address of the
 form (reg + TOO_BIG_CONST) and turn it into two instructions that are
 legitimate:

 tmp = (REG + LEGAL_BIG_CONST)
 some_use_of (mem (tmp + SMALL_LEGAL_CONST))

 The idea behind the optimization is that LEGAL_BIG_CONST will need fewer
 instructions to generate than TOO_BIG_CONST.  It's unlikely (probably
 impossible in ARM state) that pushing the bottom two bits of the address
 into LEGAL_BIG_CONST part of the offset is going to lead to a better
 code sequence.  

 So I think it would be more sensible to just return false if we have a
 DImode address with the bottom 2 bits non-zero and then let the normal
 reload recovery mechanism take over.

 It is supposed to provide better utilization of the full range of
 LEGAL_BIG_CONST+SMALL_LEGAL_CONST.  I am not sure, but I guess reload
 will resolve it with the reg+LEGAL_BIG_CONST part only, using only (mem
 (reg)) for the load/store (correct me if I'm wrong).

 Also, the new code slighty improves the reloading, for example currently
 [reg+64] is broken into [(reg+72)-8], creating an additional unneeded
 reload, which is certainly not good when we have ldrd/strd available.


 Sorry, didn't mean to suggest that we don't want to do something better
 for 

Re: [patch, ARM] Fix PR48250, adjust DImode reload address legitimizing

2011-03-31 Thread Richard Earnshaw

On Thu, 2011-03-31 at 22:18 +0800, Chung-Lin Tang wrote:
 On 2011/3/31 06:14 PM, Richard Earnshaw wrote:
  
  On Thu, 2011-03-31 at 11:33 +0800, Chung-Lin Tang wrote:
  On 2011/3/30 05:28 PM, Richard Earnshaw wrote:
 
  On Wed, 2011-03-30 at 15:35 +0800, Chung-Lin Tang wrote:
  On 2011/3/30 上午 12:23, Richard Earnshaw wrote:
 
  On Tue, 2011-03-29 at 22:53 +0800, Chung-Lin Tang wrote:
  On 2011/3/29 下午 10:26, Richard Earnshaw wrote:
  On Tue, 2011-03-29 at 18:25 +0800, Chung-Lin Tang wrote:
  On 2011/3/24 06:51 PM, Richard Earnshaw wrote:
 
  On Thu, 2011-03-24 at 12:56 +0900, Chung-Lin Tang wrote:
  Hi,
  PR48250 happens under TARGET_NEON, where DImode is included within 
  the
  valid NEON modes. This turns the range of legitimate constant 
  indexes to
  step-4 (coproc load/store), thus arm_legitimize_reload_address() 
  when
  trying to decompose the [reg+index] reload address into
  [(reg+index_high)+index_low], can cause an ICE later when 
  'index_low'
  part is not aligned to 4.
 
  I'm not sure why the current DImode index is computed as:
  low = ((val  0xf) ^ 0x8) - 0x8;  the sign-extending into negative
  values, then subtracting back, actually creates further off 
  indexes.
  e.g. in the supplied testcase, [sp+13] was turned into [(sp+16)-3].
 
 
  Hysterical Raisins... the code there was clearly written for the 
  days
  before we had LDRD in the architecture.  At that time the most 
  efficient
  way to load a 64-bit object was to use the LDM{ia,ib,da,db}
  instructions.  The computation here was (I think), intended to try 
  and
  make the most efficient use of an add/sub instruction followed by
  LDM/STM offsetting.  At that time the architecture had no unaligned
  access either, so dealing with 64-bit that were less than 32-bit 
  aligned
  (in those days 32-bit was the maximum alignment) probably wasn't
  considered, or couldn't even get through to reload.
 
 
  I see it now. The code in output_move_double() returning assembly for
  ldm/stm(db/da/ib) for offsets -8/-4/+4 seems to confirm this.
 
  I have changed the patch to let the new code handle the TARGET_LDRD 
  case
  only.  The pre-LDRD case is still handled by the original code, with 
  an
  additional  ~0x3 for aligning the offset to 4.
 
  I've also added a comment for the pre-TARGET_LDRD case. Please see if
  the description is accurate enough.
 
  My patch changes the index decomposing to a more straightforward 
  way; it
  also sort of outlines the way the other reload address indexes are
  broken by using and-masks, is not the most effective.  The address 
  is
  computed by addition, subtracting away the parts to obtain low+high
  should be the optimal way of giving the largest computable index 
  range.
 
  I have included a few Thumb-2 bits in the patch; I know currently
  arm_legitimize_reload_address() is only used under TARGET_ARM, but 
  I
  guess it might eventually be turned into TARGET_32BIT.
 
 
  I think this needs to be looked at carefully on ARMv4/ARMv4T to 
  check
  that it doesn't cause regressions there when we don't have LDRD in 
  the
  instruction set.
 
  I'll be testing the modified patch under an ARMv4/ARMv4T 
  configuration.
  Okay for trunk if no regressions?
 
  Thanks,
  Chung-Lin
 
   PR target/48250
   * config/arm/arm.c (arm_legitimize_reload_address): Adjust
   DImode constant index decomposing under TARGET_LDRD. Clear
   lower two bits for NEON, Thumb-2, and !TARGET_LDRD. Add
   comment for !TARGET_LDRD case.
 
  This looks technically correct, but I can't help feeling that trying 
  to
  deal with the bottom two bits being non-zero is not really worthwhile.
  This hook is an optimization that's intended to generate better code
  than the default mechanisms that reload provides.  It is allowed to
  fail, but it must say so if it does (by returning false).
 
  What this hook is trying to do for DImode is to take an address of the
  form (reg + TOO_BIG_CONST) and turn it into two instructions that are
  legitimate:
 
tmp = (REG + LEGAL_BIG_CONST)
some_use_of (mem (tmp + SMALL_LEGAL_CONST))
 
  The idea behind the optimization is that LEGAL_BIG_CONST will need 
  fewer
  instructions to generate than TOO_BIG_CONST.  It's unlikely (probably
  impossible in ARM state) that pushing the bottom two bits of the 
  address
  into LEGAL_BIG_CONST part of the offset is going to lead to a better
  code sequence.  
 
  So I think it would be more sensible to just return false if we have a
  DImode address with the bottom 2 bits non-zero and then let the normal
  reload recovery mechanism take over.
 
  It is supposed to provide better utilization of the full range of
  LEGAL_BIG_CONST+SMALL_LEGAL_CONST.  I am not sure, but I guess reload
  will resolve it with the reg+LEGAL_BIG_CONST part only, using only (mem
  (reg)) for the load/store (correct me if I'm wrong).
 
  Also, the new code slighty improves the reloading, for example 
  currently
  [reg+64] is broken into 

Re: [patch, ARM] Fix PR48250, adjust DImode reload address legitimizing

2011-03-31 Thread Chung-Lin Tang
On 2011/4/1 01:33 AM, Richard Earnshaw wrote:
 
 On Thu, 2011-03-31 at 22:18 +0800, Chung-Lin Tang wrote:
 On 2011/3/31 06:14 PM, Richard Earnshaw wrote:

 On Thu, 2011-03-31 at 11:33 +0800, Chung-Lin Tang wrote:
 On 2011/3/30 05:28 PM, Richard Earnshaw wrote:

 On Wed, 2011-03-30 at 15:35 +0800, Chung-Lin Tang wrote:
 On 2011/3/30 上午 12:23, Richard Earnshaw wrote:

 On Tue, 2011-03-29 at 22:53 +0800, Chung-Lin Tang wrote:
 On 2011/3/29 下午 10:26, Richard Earnshaw wrote:
 On Tue, 2011-03-29 at 18:25 +0800, Chung-Lin Tang wrote:
 On 2011/3/24 06:51 PM, Richard Earnshaw wrote:

 On Thu, 2011-03-24 at 12:56 +0900, Chung-Lin Tang wrote:
 Hi,
 PR48250 happens under TARGET_NEON, where DImode is included within 
 the
 valid NEON modes. This turns the range of legitimate constant 
 indexes to
 step-4 (coproc load/store), thus arm_legitimize_reload_address() 
 when
 trying to decompose the [reg+index] reload address into
 [(reg+index_high)+index_low], can cause an ICE later when 
 'index_low'
 part is not aligned to 4.

 I'm not sure why the current DImode index is computed as:
 low = ((val  0xf) ^ 0x8) - 0x8;  the sign-extending into negative
 values, then subtracting back, actually creates further off 
 indexes.
 e.g. in the supplied testcase, [sp+13] was turned into [(sp+16)-3].


 Hysterical Raisins... the code there was clearly written for the 
 days
 before we had LDRD in the architecture.  At that time the most 
 efficient
 way to load a 64-bit object was to use the LDM{ia,ib,da,db}
 instructions.  The computation here was (I think), intended to try 
 and
 make the most efficient use of an add/sub instruction followed by
 LDM/STM offsetting.  At that time the architecture had no unaligned
 access either, so dealing with 64-bit that were less than 32-bit 
 aligned
 (in those days 32-bit was the maximum alignment) probably wasn't
 considered, or couldn't even get through to reload.


 I see it now. The code in output_move_double() returning assembly for
 ldm/stm(db/da/ib) for offsets -8/-4/+4 seems to confirm this.

 I have changed the patch to let the new code handle the TARGET_LDRD 
 case
 only.  The pre-LDRD case is still handled by the original code, with 
 an
 additional  ~0x3 for aligning the offset to 4.

 I've also added a comment for the pre-TARGET_LDRD case. Please see if
 the description is accurate enough.

 My patch changes the index decomposing to a more straightforward 
 way; it
 also sort of outlines the way the other reload address indexes are
 broken by using and-masks, is not the most effective.  The address 
 is
 computed by addition, subtracting away the parts to obtain low+high
 should be the optimal way of giving the largest computable index 
 range.

 I have included a few Thumb-2 bits in the patch; I know currently
 arm_legitimize_reload_address() is only used under TARGET_ARM, but 
 I
 guess it might eventually be turned into TARGET_32BIT.


 I think this needs to be looked at carefully on ARMv4/ARMv4T to 
 check
 that it doesn't cause regressions there when we don't have LDRD in 
 the
 instruction set.

 I'll be testing the modified patch under an ARMv4/ARMv4T 
 configuration.
 Okay for trunk if no regressions?

 Thanks,
 Chung-Lin

  PR target/48250
  * config/arm/arm.c (arm_legitimize_reload_address): Adjust
  DImode constant index decomposing under TARGET_LDRD. Clear
  lower two bits for NEON, Thumb-2, and !TARGET_LDRD. Add
  comment for !TARGET_LDRD case.

 This looks technically correct, but I can't help feeling that trying 
 to
 deal with the bottom two bits being non-zero is not really worthwhile.
 This hook is an optimization that's intended to generate better code
 than the default mechanisms that reload provides.  It is allowed to
 fail, but it must say so if it does (by returning false).

 What this hook is trying to do for DImode is to take an address of the
 form (reg + TOO_BIG_CONST) and turn it into two instructions that are
 legitimate:

   tmp = (REG + LEGAL_BIG_CONST)
   some_use_of (mem (tmp + SMALL_LEGAL_CONST))

 The idea behind the optimization is that LEGAL_BIG_CONST will need 
 fewer
 instructions to generate than TOO_BIG_CONST.  It's unlikely (probably
 impossible in ARM state) that pushing the bottom two bits of the 
 address
 into LEGAL_BIG_CONST part of the offset is going to lead to a better
 code sequence.  

 So I think it would be more sensible to just return false if we have a
 DImode address with the bottom 2 bits non-zero and then let the normal
 reload recovery mechanism take over.

 It is supposed to provide better utilization of the full range of
 LEGAL_BIG_CONST+SMALL_LEGAL_CONST.  I am not sure, but I guess reload
 will resolve it with the reg+LEGAL_BIG_CONST part only, using only (mem
 (reg)) for the load/store (correct me if I'm wrong).

 Also, the new code slighty improves the reloading, for example 
 currently
 [reg+64] is broken into [(reg+72)-8], creating an additional unneeded
 reload, which is certainly not 

Re: [patch, ARM] Fix PR48250, adjust DImode reload address legitimizing

2011-03-30 Thread Chung-Lin Tang
On 2011/3/30 上午 12:23, Richard Earnshaw wrote:
 
 On Tue, 2011-03-29 at 22:53 +0800, Chung-Lin Tang wrote:
 On 2011/3/29 下午 10:26, Richard Earnshaw wrote:
 On Tue, 2011-03-29 at 18:25 +0800, Chung-Lin Tang wrote:
 On 2011/3/24 06:51 PM, Richard Earnshaw wrote:

 On Thu, 2011-03-24 at 12:56 +0900, Chung-Lin Tang wrote:
 Hi,
 PR48250 happens under TARGET_NEON, where DImode is included within the
 valid NEON modes. This turns the range of legitimate constant indexes to
 step-4 (coproc load/store), thus arm_legitimize_reload_address() when
 trying to decompose the [reg+index] reload address into
 [(reg+index_high)+index_low], can cause an ICE later when 'index_low'
 part is not aligned to 4.

 I'm not sure why the current DImode index is computed as:
 low = ((val  0xf) ^ 0x8) - 0x8;  the sign-extending into negative
 values, then subtracting back, actually creates further off indexes.
 e.g. in the supplied testcase, [sp+13] was turned into [(sp+16)-3].


 Hysterical Raisins... the code there was clearly written for the days
 before we had LDRD in the architecture.  At that time the most efficient
 way to load a 64-bit object was to use the LDM{ia,ib,da,db}
 instructions.  The computation here was (I think), intended to try and
 make the most efficient use of an add/sub instruction followed by
 LDM/STM offsetting.  At that time the architecture had no unaligned
 access either, so dealing with 64-bit that were less than 32-bit aligned
 (in those days 32-bit was the maximum alignment) probably wasn't
 considered, or couldn't even get through to reload.


 I see it now. The code in output_move_double() returning assembly for
 ldm/stm(db/da/ib) for offsets -8/-4/+4 seems to confirm this.

 I have changed the patch to let the new code handle the TARGET_LDRD case
 only.  The pre-LDRD case is still handled by the original code, with an
 additional  ~0x3 for aligning the offset to 4.

 I've also added a comment for the pre-TARGET_LDRD case. Please see if
 the description is accurate enough.

 My patch changes the index decomposing to a more straightforward way; it
 also sort of outlines the way the other reload address indexes are
 broken by using and-masks, is not the most effective.  The address is
 computed by addition, subtracting away the parts to obtain low+high
 should be the optimal way of giving the largest computable index range.

 I have included a few Thumb-2 bits in the patch; I know currently
 arm_legitimize_reload_address() is only used under TARGET_ARM, but I
 guess it might eventually be turned into TARGET_32BIT.


 I think this needs to be looked at carefully on ARMv4/ARMv4T to check
 that it doesn't cause regressions there when we don't have LDRD in the
 instruction set.

 I'll be testing the modified patch under an ARMv4/ARMv4T configuration.
 Okay for trunk if no regressions?

 Thanks,
 Chung-Lin

PR target/48250
* config/arm/arm.c (arm_legitimize_reload_address): Adjust
DImode constant index decomposing under TARGET_LDRD. Clear
lower two bits for NEON, Thumb-2, and !TARGET_LDRD. Add
comment for !TARGET_LDRD case.

 This looks technically correct, but I can't help feeling that trying to
 deal with the bottom two bits being non-zero is not really worthwhile.
 This hook is an optimization that's intended to generate better code
 than the default mechanisms that reload provides.  It is allowed to
 fail, but it must say so if it does (by returning false).

 What this hook is trying to do for DImode is to take an address of the
 form (reg + TOO_BIG_CONST) and turn it into two instructions that are
 legitimate:

 tmp = (REG + LEGAL_BIG_CONST)
 some_use_of (mem (tmp + SMALL_LEGAL_CONST))

 The idea behind the optimization is that LEGAL_BIG_CONST will need fewer
 instructions to generate than TOO_BIG_CONST.  It's unlikely (probably
 impossible in ARM state) that pushing the bottom two bits of the address
 into LEGAL_BIG_CONST part of the offset is going to lead to a better
 code sequence.  

 So I think it would be more sensible to just return false if we have a
 DImode address with the bottom 2 bits non-zero and then let the normal
 reload recovery mechanism take over.

 It is supposed to provide better utilization of the full range of
 LEGAL_BIG_CONST+SMALL_LEGAL_CONST.  I am not sure, but I guess reload
 will resolve it with the reg+LEGAL_BIG_CONST part only, using only (mem
 (reg)) for the load/store (correct me if I'm wrong).

 Also, the new code slighty improves the reloading, for example currently
 [reg+64] is broken into [(reg+72)-8], creating an additional unneeded
 reload, which is certainly not good when we have ldrd/strd available.

 
 Sorry, didn't mean to suggest that we don't want to do something better
 for addresses that are a multiple of 4, just that for addresses that
 aren't (at least) word-aligned that we should just bail as the code in
 that case won't benefit from the optimization.  So something like
 
if (mode == DImode || (mode == DFmode  

Re: [patch, ARM] Fix PR48250, adjust DImode reload address legitimizing

2011-03-30 Thread Richard Earnshaw

On Wed, 2011-03-30 at 15:35 +0800, Chung-Lin Tang wrote:
 On 2011/3/30 上午 12:23, Richard Earnshaw wrote:
  
  On Tue, 2011-03-29 at 22:53 +0800, Chung-Lin Tang wrote:
  On 2011/3/29 下午 10:26, Richard Earnshaw wrote:
  On Tue, 2011-03-29 at 18:25 +0800, Chung-Lin Tang wrote:
  On 2011/3/24 06:51 PM, Richard Earnshaw wrote:
 
  On Thu, 2011-03-24 at 12:56 +0900, Chung-Lin Tang wrote:
  Hi,
  PR48250 happens under TARGET_NEON, where DImode is included within the
  valid NEON modes. This turns the range of legitimate constant indexes 
  to
  step-4 (coproc load/store), thus arm_legitimize_reload_address() when
  trying to decompose the [reg+index] reload address into
  [(reg+index_high)+index_low], can cause an ICE later when 'index_low'
  part is not aligned to 4.
 
  I'm not sure why the current DImode index is computed as:
  low = ((val  0xf) ^ 0x8) - 0x8;  the sign-extending into negative
  values, then subtracting back, actually creates further off indexes.
  e.g. in the supplied testcase, [sp+13] was turned into [(sp+16)-3].
 
 
  Hysterical Raisins... the code there was clearly written for the days
  before we had LDRD in the architecture.  At that time the most efficient
  way to load a 64-bit object was to use the LDM{ia,ib,da,db}
  instructions.  The computation here was (I think), intended to try and
  make the most efficient use of an add/sub instruction followed by
  LDM/STM offsetting.  At that time the architecture had no unaligned
  access either, so dealing with 64-bit that were less than 32-bit aligned
  (in those days 32-bit was the maximum alignment) probably wasn't
  considered, or couldn't even get through to reload.
 
 
  I see it now. The code in output_move_double() returning assembly for
  ldm/stm(db/da/ib) for offsets -8/-4/+4 seems to confirm this.
 
  I have changed the patch to let the new code handle the TARGET_LDRD case
  only.  The pre-LDRD case is still handled by the original code, with an
  additional  ~0x3 for aligning the offset to 4.
 
  I've also added a comment for the pre-TARGET_LDRD case. Please see if
  the description is accurate enough.
 
  My patch changes the index decomposing to a more straightforward way; 
  it
  also sort of outlines the way the other reload address indexes are
  broken by using and-masks, is not the most effective.  The address is
  computed by addition, subtracting away the parts to obtain low+high
  should be the optimal way of giving the largest computable index range.
 
  I have included a few Thumb-2 bits in the patch; I know currently
  arm_legitimize_reload_address() is only used under TARGET_ARM, but I
  guess it might eventually be turned into TARGET_32BIT.
 
 
  I think this needs to be looked at carefully on ARMv4/ARMv4T to check
  that it doesn't cause regressions there when we don't have LDRD in the
  instruction set.
 
  I'll be testing the modified patch under an ARMv4/ARMv4T configuration.
  Okay for trunk if no regressions?
 
  Thanks,
  Chung-Lin
 
   PR target/48250
   * config/arm/arm.c (arm_legitimize_reload_address): Adjust
   DImode constant index decomposing under TARGET_LDRD. Clear
   lower two bits for NEON, Thumb-2, and !TARGET_LDRD. Add
   comment for !TARGET_LDRD case.
 
  This looks technically correct, but I can't help feeling that trying to
  deal with the bottom two bits being non-zero is not really worthwhile.
  This hook is an optimization that's intended to generate better code
  than the default mechanisms that reload provides.  It is allowed to
  fail, but it must say so if it does (by returning false).
 
  What this hook is trying to do for DImode is to take an address of the
  form (reg + TOO_BIG_CONST) and turn it into two instructions that are
  legitimate:
 
tmp = (REG + LEGAL_BIG_CONST)
some_use_of (mem (tmp + SMALL_LEGAL_CONST))
 
  The idea behind the optimization is that LEGAL_BIG_CONST will need fewer
  instructions to generate than TOO_BIG_CONST.  It's unlikely (probably
  impossible in ARM state) that pushing the bottom two bits of the address
  into LEGAL_BIG_CONST part of the offset is going to lead to a better
  code sequence.  
 
  So I think it would be more sensible to just return false if we have a
  DImode address with the bottom 2 bits non-zero and then let the normal
  reload recovery mechanism take over.
 
  It is supposed to provide better utilization of the full range of
  LEGAL_BIG_CONST+SMALL_LEGAL_CONST.  I am not sure, but I guess reload
  will resolve it with the reg+LEGAL_BIG_CONST part only, using only (mem
  (reg)) for the load/store (correct me if I'm wrong).
 
  Also, the new code slighty improves the reloading, for example currently
  [reg+64] is broken into [(reg+72)-8], creating an additional unneeded
  reload, which is certainly not good when we have ldrd/strd available.
 
  
  Sorry, didn't mean to suggest that we don't want to do something better
  for addresses that are a multiple of 4, just that for addresses that
  aren't (at least) word-aligned that 

Re: [patch, ARM] Fix PR48250, adjust DImode reload address legitimizing

2011-03-30 Thread Chung-Lin Tang
On 2011/3/30 05:28 PM, Richard Earnshaw wrote:
 
 On Wed, 2011-03-30 at 15:35 +0800, Chung-Lin Tang wrote:
 On 2011/3/30 上午 12:23, Richard Earnshaw wrote:

 On Tue, 2011-03-29 at 22:53 +0800, Chung-Lin Tang wrote:
 On 2011/3/29 下午 10:26, Richard Earnshaw wrote:
 On Tue, 2011-03-29 at 18:25 +0800, Chung-Lin Tang wrote:
 On 2011/3/24 06:51 PM, Richard Earnshaw wrote:

 On Thu, 2011-03-24 at 12:56 +0900, Chung-Lin Tang wrote:
 Hi,
 PR48250 happens under TARGET_NEON, where DImode is included within the
 valid NEON modes. This turns the range of legitimate constant indexes 
 to
 step-4 (coproc load/store), thus arm_legitimize_reload_address() when
 trying to decompose the [reg+index] reload address into
 [(reg+index_high)+index_low], can cause an ICE later when 'index_low'
 part is not aligned to 4.

 I'm not sure why the current DImode index is computed as:
 low = ((val  0xf) ^ 0x8) - 0x8;  the sign-extending into negative
 values, then subtracting back, actually creates further off indexes.
 e.g. in the supplied testcase, [sp+13] was turned into [(sp+16)-3].


 Hysterical Raisins... the code there was clearly written for the days
 before we had LDRD in the architecture.  At that time the most efficient
 way to load a 64-bit object was to use the LDM{ia,ib,da,db}
 instructions.  The computation here was (I think), intended to try and
 make the most efficient use of an add/sub instruction followed by
 LDM/STM offsetting.  At that time the architecture had no unaligned
 access either, so dealing with 64-bit that were less than 32-bit aligned
 (in those days 32-bit was the maximum alignment) probably wasn't
 considered, or couldn't even get through to reload.


 I see it now. The code in output_move_double() returning assembly for
 ldm/stm(db/da/ib) for offsets -8/-4/+4 seems to confirm this.

 I have changed the patch to let the new code handle the TARGET_LDRD case
 only.  The pre-LDRD case is still handled by the original code, with an
 additional  ~0x3 for aligning the offset to 4.

 I've also added a comment for the pre-TARGET_LDRD case. Please see if
 the description is accurate enough.

 My patch changes the index decomposing to a more straightforward way; 
 it
 also sort of outlines the way the other reload address indexes are
 broken by using and-masks, is not the most effective.  The address is
 computed by addition, subtracting away the parts to obtain low+high
 should be the optimal way of giving the largest computable index range.

 I have included a few Thumb-2 bits in the patch; I know currently
 arm_legitimize_reload_address() is only used under TARGET_ARM, but I
 guess it might eventually be turned into TARGET_32BIT.


 I think this needs to be looked at carefully on ARMv4/ARMv4T to check
 that it doesn't cause regressions there when we don't have LDRD in the
 instruction set.

 I'll be testing the modified patch under an ARMv4/ARMv4T configuration.
 Okay for trunk if no regressions?

 Thanks,
 Chung-Lin

  PR target/48250
  * config/arm/arm.c (arm_legitimize_reload_address): Adjust
  DImode constant index decomposing under TARGET_LDRD. Clear
  lower two bits for NEON, Thumb-2, and !TARGET_LDRD. Add
  comment for !TARGET_LDRD case.

 This looks technically correct, but I can't help feeling that trying to
 deal with the bottom two bits being non-zero is not really worthwhile.
 This hook is an optimization that's intended to generate better code
 than the default mechanisms that reload provides.  It is allowed to
 fail, but it must say so if it does (by returning false).

 What this hook is trying to do for DImode is to take an address of the
 form (reg + TOO_BIG_CONST) and turn it into two instructions that are
 legitimate:

   tmp = (REG + LEGAL_BIG_CONST)
   some_use_of (mem (tmp + SMALL_LEGAL_CONST))

 The idea behind the optimization is that LEGAL_BIG_CONST will need fewer
 instructions to generate than TOO_BIG_CONST.  It's unlikely (probably
 impossible in ARM state) that pushing the bottom two bits of the address
 into LEGAL_BIG_CONST part of the offset is going to lead to a better
 code sequence.  

 So I think it would be more sensible to just return false if we have a
 DImode address with the bottom 2 bits non-zero and then let the normal
 reload recovery mechanism take over.

 It is supposed to provide better utilization of the full range of
 LEGAL_BIG_CONST+SMALL_LEGAL_CONST.  I am not sure, but I guess reload
 will resolve it with the reg+LEGAL_BIG_CONST part only, using only (mem
 (reg)) for the load/store (correct me if I'm wrong).

 Also, the new code slighty improves the reloading, for example currently
 [reg+64] is broken into [(reg+72)-8], creating an additional unneeded
 reload, which is certainly not good when we have ldrd/strd available.


 Sorry, didn't mean to suggest that we don't want to do something better
 for addresses that are a multiple of 4, just that for addresses that
 aren't (at least) word-aligned that we should just bail as the code in
 that case won't benefit 

Re: [patch, ARM] Fix PR48250, adjust DImode reload address legitimizing

2011-03-29 Thread Chung-Lin Tang
On 2011/3/24 06:51 PM, Richard Earnshaw wrote:
 
 On Thu, 2011-03-24 at 12:56 +0900, Chung-Lin Tang wrote:
 Hi,
 PR48250 happens under TARGET_NEON, where DImode is included within the
 valid NEON modes. This turns the range of legitimate constant indexes to
 step-4 (coproc load/store), thus arm_legitimize_reload_address() when
 trying to decompose the [reg+index] reload address into
 [(reg+index_high)+index_low], can cause an ICE later when 'index_low'
 part is not aligned to 4.

 I'm not sure why the current DImode index is computed as:
 low = ((val  0xf) ^ 0x8) - 0x8;  the sign-extending into negative
 values, then subtracting back, actually creates further off indexes.
 e.g. in the supplied testcase, [sp+13] was turned into [(sp+16)-3].

 
 Hysterical Raisins... the code there was clearly written for the days
 before we had LDRD in the architecture.  At that time the most efficient
 way to load a 64-bit object was to use the LDM{ia,ib,da,db}
 instructions.  The computation here was (I think), intended to try and
 make the most efficient use of an add/sub instruction followed by
 LDM/STM offsetting.  At that time the architecture had no unaligned
 access either, so dealing with 64-bit that were less than 32-bit aligned
 (in those days 32-bit was the maximum alignment) probably wasn't
 considered, or couldn't even get through to reload.


I see it now. The code in output_move_double() returning assembly for
ldm/stm(db/da/ib) for offsets -8/-4/+4 seems to confirm this.

I have changed the patch to let the new code handle the TARGET_LDRD case
only.  The pre-LDRD case is still handled by the original code, with an
additional  ~0x3 for aligning the offset to 4.

I've also added a comment for the pre-TARGET_LDRD case. Please see if
the description is accurate enough.

 My patch changes the index decomposing to a more straightforward way; it
 also sort of outlines the way the other reload address indexes are
 broken by using and-masks, is not the most effective.  The address is
 computed by addition, subtracting away the parts to obtain low+high
 should be the optimal way of giving the largest computable index range.

 I have included a few Thumb-2 bits in the patch; I know currently
 arm_legitimize_reload_address() is only used under TARGET_ARM, but I
 guess it might eventually be turned into TARGET_32BIT.

 
 I think this needs to be looked at carefully on ARMv4/ARMv4T to check
 that it doesn't cause regressions there when we don't have LDRD in the
 instruction set.

I'll be testing the modified patch under an ARMv4/ARMv4T configuration.
Okay for trunk if no regressions?

Thanks,
Chung-Lin

PR target/48250
* config/arm/arm.c (arm_legitimize_reload_address): Adjust
DImode constant index decomposing under TARGET_LDRD. Clear
lower two bits for NEON, Thumb-2, and !TARGET_LDRD. Add
comment for !TARGET_LDRD case.
Index: config/arm/arm.c
===
--- config/arm/arm.c(revision 171652)
+++ config/arm/arm.c(working copy)
@@ -6420,7 +6420,32 @@
   HOST_WIDE_INT low, high;
 
   if (mode == DImode || (mode == DFmode  TARGET_SOFT_FLOAT))
-   low = ((val  0xf) ^ 0x8) - 0x8;
+   {
+ if (TARGET_LDRD)
+   {
+ /* ??? There may be more adjustments later for Thumb-2,
+which has a ldrd insn with +-1020 index range.  */
+ int max_idx = 255;
+
+ /* low == val, if val is within range [-max_idx, +max_idx].
+If not, val is set to the boundary +-max_idx.  */
+ low = (-max_idx = val  val = max_idx
+? val : (val  0 ? max_idx : -max_idx));
+
+ /* Thumb-2 ldrd, and NEON coprocessor load/store indexes
+are in steps of 4, so the least two bits need to be
+cleared to zero.  */
+ if (TARGET_NEON || TARGET_THUMB2)
+   low = ~0x3;
+   }
+ else
+   {
+ /* For pre-ARMv5TE (without ldrd), we use ldm/stm(db/da/ib)
+to access doublewords. The supported load/store offsets are
+-8, -4, and 4, which we try to produce here.  */
+ low = (((val  0xf) ^ 0x8) - 0x8)  ~0x3;
+   }
+   }
   else if (TARGET_MAVERICK  TARGET_HARD_FLOAT)
/* Need to be careful, -256 is not a valid offset.  */
low = val = 0 ? (val  0xff) : -((-val)  0xff);


Re: [patch, ARM] Fix PR48250, adjust DImode reload address legitimizing

2011-03-29 Thread Richard Earnshaw
On Tue, 2011-03-29 at 18:25 +0800, Chung-Lin Tang wrote:
 On 2011/3/24 06:51 PM, Richard Earnshaw wrote:
  
  On Thu, 2011-03-24 at 12:56 +0900, Chung-Lin Tang wrote:
  Hi,
  PR48250 happens under TARGET_NEON, where DImode is included within the
  valid NEON modes. This turns the range of legitimate constant indexes to
  step-4 (coproc load/store), thus arm_legitimize_reload_address() when
  trying to decompose the [reg+index] reload address into
  [(reg+index_high)+index_low], can cause an ICE later when 'index_low'
  part is not aligned to 4.
 
  I'm not sure why the current DImode index is computed as:
  low = ((val  0xf) ^ 0x8) - 0x8;  the sign-extending into negative
  values, then subtracting back, actually creates further off indexes.
  e.g. in the supplied testcase, [sp+13] was turned into [(sp+16)-3].
 
  
  Hysterical Raisins... the code there was clearly written for the days
  before we had LDRD in the architecture.  At that time the most efficient
  way to load a 64-bit object was to use the LDM{ia,ib,da,db}
  instructions.  The computation here was (I think), intended to try and
  make the most efficient use of an add/sub instruction followed by
  LDM/STM offsetting.  At that time the architecture had no unaligned
  access either, so dealing with 64-bit that were less than 32-bit aligned
  (in those days 32-bit was the maximum alignment) probably wasn't
  considered, or couldn't even get through to reload.
 
 
 I see it now. The code in output_move_double() returning assembly for
 ldm/stm(db/da/ib) for offsets -8/-4/+4 seems to confirm this.
 
 I have changed the patch to let the new code handle the TARGET_LDRD case
 only.  The pre-LDRD case is still handled by the original code, with an
 additional  ~0x3 for aligning the offset to 4.
 
 I've also added a comment for the pre-TARGET_LDRD case. Please see if
 the description is accurate enough.
 
  My patch changes the index decomposing to a more straightforward way; it
  also sort of outlines the way the other reload address indexes are
  broken by using and-masks, is not the most effective.  The address is
  computed by addition, subtracting away the parts to obtain low+high
  should be the optimal way of giving the largest computable index range.
 
  I have included a few Thumb-2 bits in the patch; I know currently
  arm_legitimize_reload_address() is only used under TARGET_ARM, but I
  guess it might eventually be turned into TARGET_32BIT.
 
  
  I think this needs to be looked at carefully on ARMv4/ARMv4T to check
  that it doesn't cause regressions there when we don't have LDRD in the
  instruction set.
 
 I'll be testing the modified patch under an ARMv4/ARMv4T configuration.
 Okay for trunk if no regressions?
 
 Thanks,
 Chung-Lin
 
   PR target/48250
   * config/arm/arm.c (arm_legitimize_reload_address): Adjust
   DImode constant index decomposing under TARGET_LDRD. Clear
   lower two bits for NEON, Thumb-2, and !TARGET_LDRD. Add
   comment for !TARGET_LDRD case.

This looks technically correct, but I can't help feeling that trying to
deal with the bottom two bits being non-zero is not really worthwhile.
This hook is an optimization that's intended to generate better code
than the default mechanisms that reload provides.  It is allowed to
fail, but it must say so if it does (by returning false).

What this hook is trying to do for DImode is to take an address of the
form (reg + TOO_BIG_CONST) and turn it into two instructions that are
legitimate:

tmp = (REG + LEGAL_BIG_CONST)
some_use_of (mem (tmp + SMALL_LEGAL_CONST))

The idea behind the optimization is that LEGAL_BIG_CONST will need fewer
instructions to generate than TOO_BIG_CONST.  It's unlikely (probably
impossible in ARM state) that pushing the bottom two bits of the address
into LEGAL_BIG_CONST part of the offset is going to lead to a better
code sequence.  

So I think it would be more sensible to just return false if we have a
DImode address with the bottom 2 bits non-zero and then let the normal
reload recovery mechanism take over.




Re: [patch, ARM] Fix PR48250, adjust DImode reload address legitimizing

2011-03-29 Thread Chung-Lin Tang
On 2011/3/29 下午 10:26, Richard Earnshaw wrote:
 On Tue, 2011-03-29 at 18:25 +0800, Chung-Lin Tang wrote:
 On 2011/3/24 06:51 PM, Richard Earnshaw wrote:

 On Thu, 2011-03-24 at 12:56 +0900, Chung-Lin Tang wrote:
 Hi,
 PR48250 happens under TARGET_NEON, where DImode is included within the
 valid NEON modes. This turns the range of legitimate constant indexes to
 step-4 (coproc load/store), thus arm_legitimize_reload_address() when
 trying to decompose the [reg+index] reload address into
 [(reg+index_high)+index_low], can cause an ICE later when 'index_low'
 part is not aligned to 4.

 I'm not sure why the current DImode index is computed as:
 low = ((val  0xf) ^ 0x8) - 0x8;  the sign-extending into negative
 values, then subtracting back, actually creates further off indexes.
 e.g. in the supplied testcase, [sp+13] was turned into [(sp+16)-3].


 Hysterical Raisins... the code there was clearly written for the days
 before we had LDRD in the architecture.  At that time the most efficient
 way to load a 64-bit object was to use the LDM{ia,ib,da,db}
 instructions.  The computation here was (I think), intended to try and
 make the most efficient use of an add/sub instruction followed by
 LDM/STM offsetting.  At that time the architecture had no unaligned
 access either, so dealing with 64-bit that were less than 32-bit aligned
 (in those days 32-bit was the maximum alignment) probably wasn't
 considered, or couldn't even get through to reload.


 I see it now. The code in output_move_double() returning assembly for
 ldm/stm(db/da/ib) for offsets -8/-4/+4 seems to confirm this.

 I have changed the patch to let the new code handle the TARGET_LDRD case
 only.  The pre-LDRD case is still handled by the original code, with an
 additional  ~0x3 for aligning the offset to 4.

 I've also added a comment for the pre-TARGET_LDRD case. Please see if
 the description is accurate enough.

 My patch changes the index decomposing to a more straightforward way; it
 also sort of outlines the way the other reload address indexes are
 broken by using and-masks, is not the most effective.  The address is
 computed by addition, subtracting away the parts to obtain low+high
 should be the optimal way of giving the largest computable index range.

 I have included a few Thumb-2 bits in the patch; I know currently
 arm_legitimize_reload_address() is only used under TARGET_ARM, but I
 guess it might eventually be turned into TARGET_32BIT.


 I think this needs to be looked at carefully on ARMv4/ARMv4T to check
 that it doesn't cause regressions there when we don't have LDRD in the
 instruction set.

 I'll be testing the modified patch under an ARMv4/ARMv4T configuration.
 Okay for trunk if no regressions?

 Thanks,
 Chung-Lin

  PR target/48250
  * config/arm/arm.c (arm_legitimize_reload_address): Adjust
  DImode constant index decomposing under TARGET_LDRD. Clear
  lower two bits for NEON, Thumb-2, and !TARGET_LDRD. Add
  comment for !TARGET_LDRD case.
 
 This looks technically correct, but I can't help feeling that trying to
 deal with the bottom two bits being non-zero is not really worthwhile.
 This hook is an optimization that's intended to generate better code
 than the default mechanisms that reload provides.  It is allowed to
 fail, but it must say so if it does (by returning false).
 
 What this hook is trying to do for DImode is to take an address of the
 form (reg + TOO_BIG_CONST) and turn it into two instructions that are
 legitimate:
 
   tmp = (REG + LEGAL_BIG_CONST)
   some_use_of (mem (tmp + SMALL_LEGAL_CONST))
 
 The idea behind the optimization is that LEGAL_BIG_CONST will need fewer
 instructions to generate than TOO_BIG_CONST.  It's unlikely (probably
 impossible in ARM state) that pushing the bottom two bits of the address
 into LEGAL_BIG_CONST part of the offset is going to lead to a better
 code sequence.  
 
 So I think it would be more sensible to just return false if we have a
 DImode address with the bottom 2 bits non-zero and then let the normal
 reload recovery mechanism take over.

It is supposed to provide better utilization of the full range of
LEGAL_BIG_CONST+SMALL_LEGAL_CONST.  I am not sure, but I guess reload
will resolve it with the reg+LEGAL_BIG_CONST part only, using only (mem
(reg)) for the load/store (correct me if I'm wrong).

Also, the new code slighty improves the reloading, for example currently
[reg+64] is broken into [(reg+72)-8], creating an additional unneeded
reload, which is certainly not good when we have ldrd/strd available.

Thanks,
Chung-Lin


Re: [patch, ARM] Fix PR48250, adjust DImode reload address legitimizing

2011-03-24 Thread Richard Earnshaw

On Thu, 2011-03-24 at 12:56 +0900, Chung-Lin Tang wrote:
 Hi,
 PR48250 happens under TARGET_NEON, where DImode is included within the
 valid NEON modes. This turns the range of legitimate constant indexes to
 step-4 (coproc load/store), thus arm_legitimize_reload_address() when
 trying to decompose the [reg+index] reload address into
 [(reg+index_high)+index_low], can cause an ICE later when 'index_low'
 part is not aligned to 4.
 
 I'm not sure why the current DImode index is computed as:
 low = ((val  0xf) ^ 0x8) - 0x8;  the sign-extending into negative
 values, then subtracting back, actually creates further off indexes.
 e.g. in the supplied testcase, [sp+13] was turned into [(sp+16)-3].
 

Hysterical Raisins... the code there was clearly written for the days
before we had LDRD in the architecture.  At that time the most efficient
way to load a 64-bit object was to use the LDM{ia,ib,da,db}
instructions.  The computation here was (I think), intended to try and
make the most efficient use of an add/sub instruction followed by
LDM/STM offsetting.  At that time the architecture had no unaligned
access either, so dealing with 64-bit that were less than 32-bit aligned
(in those days 32-bit was the maximum alignment) probably wasn't
considered, or couldn't even get through to reload.

 My patch changes the index decomposing to a more straightforward way; it
 also sort of outlines the way the other reload address indexes are
 broken by using and-masks, is not the most effective.  The address is
 computed by addition, subtracting away the parts to obtain low+high
 should be the optimal way of giving the largest computable index range.
 
 I have included a few Thumb-2 bits in the patch; I know currently
 arm_legitimize_reload_address() is only used under TARGET_ARM, but I
 guess it might eventually be turned into TARGET_32BIT.
 

I think this needs to be looked at carefully on ARMv4/ARMv4T to check
that it doesn't cause regressions there when we don't have LDRD in the
instruction set.

 Cross-tested on QEMU without regressions, is this okay?
 
 Thanks,
 Chung-Lin
 
 2011-03-24  Chung-Lin Tang  clt...@codesourcery.com
 
   PR target/48250
   * config/arm/arm.c (arm_legitimize_reload_address): Adjust
   DImode constant index decomposing. Mask out lower 2-bits for
   NEON and Thumb-2.
 
   testsuite/
   * gcc.target/arm/pr48250.c: New.

R.