Re: [patch, ARM] Fix PR48250, adjust DImode reload address legitimizing
On Thu, 2011-03-31 at 11:33 +0800, Chung-Lin Tang wrote: On 2011/3/30 05:28 PM, Richard Earnshaw wrote: On Wed, 2011-03-30 at 15:35 +0800, Chung-Lin Tang wrote: On 2011/3/30 上午 12:23, Richard Earnshaw wrote: On Tue, 2011-03-29 at 22:53 +0800, Chung-Lin Tang wrote: On 2011/3/29 下午 10:26, Richard Earnshaw wrote: On Tue, 2011-03-29 at 18:25 +0800, Chung-Lin Tang wrote: On 2011/3/24 06:51 PM, Richard Earnshaw wrote: On Thu, 2011-03-24 at 12:56 +0900, Chung-Lin Tang wrote: Hi, PR48250 happens under TARGET_NEON, where DImode is included within the valid NEON modes. This turns the range of legitimate constant indexes to step-4 (coproc load/store), thus arm_legitimize_reload_address() when trying to decompose the [reg+index] reload address into [(reg+index_high)+index_low], can cause an ICE later when 'index_low' part is not aligned to 4. I'm not sure why the current DImode index is computed as: low = ((val 0xf) ^ 0x8) - 0x8; the sign-extending into negative values, then subtracting back, actually creates further off indexes. e.g. in the supplied testcase, [sp+13] was turned into [(sp+16)-3]. Hysterical Raisins... the code there was clearly written for the days before we had LDRD in the architecture. At that time the most efficient way to load a 64-bit object was to use the LDM{ia,ib,da,db} instructions. The computation here was (I think), intended to try and make the most efficient use of an add/sub instruction followed by LDM/STM offsetting. At that time the architecture had no unaligned access either, so dealing with 64-bit that were less than 32-bit aligned (in those days 32-bit was the maximum alignment) probably wasn't considered, or couldn't even get through to reload. I see it now. The code in output_move_double() returning assembly for ldm/stm(db/da/ib) for offsets -8/-4/+4 seems to confirm this. I have changed the patch to let the new code handle the TARGET_LDRD case only. The pre-LDRD case is still handled by the original code, with an additional ~0x3 for aligning the offset to 4. I've also added a comment for the pre-TARGET_LDRD case. Please see if the description is accurate enough. My patch changes the index decomposing to a more straightforward way; it also sort of outlines the way the other reload address indexes are broken by using and-masks, is not the most effective. The address is computed by addition, subtracting away the parts to obtain low+high should be the optimal way of giving the largest computable index range. I have included a few Thumb-2 bits in the patch; I know currently arm_legitimize_reload_address() is only used under TARGET_ARM, but I guess it might eventually be turned into TARGET_32BIT. I think this needs to be looked at carefully on ARMv4/ARMv4T to check that it doesn't cause regressions there when we don't have LDRD in the instruction set. I'll be testing the modified patch under an ARMv4/ARMv4T configuration. Okay for trunk if no regressions? Thanks, Chung-Lin PR target/48250 * config/arm/arm.c (arm_legitimize_reload_address): Adjust DImode constant index decomposing under TARGET_LDRD. Clear lower two bits for NEON, Thumb-2, and !TARGET_LDRD. Add comment for !TARGET_LDRD case. This looks technically correct, but I can't help feeling that trying to deal with the bottom two bits being non-zero is not really worthwhile. This hook is an optimization that's intended to generate better code than the default mechanisms that reload provides. It is allowed to fail, but it must say so if it does (by returning false). What this hook is trying to do for DImode is to take an address of the form (reg + TOO_BIG_CONST) and turn it into two instructions that are legitimate: tmp = (REG + LEGAL_BIG_CONST) some_use_of (mem (tmp + SMALL_LEGAL_CONST)) The idea behind the optimization is that LEGAL_BIG_CONST will need fewer instructions to generate than TOO_BIG_CONST. It's unlikely (probably impossible in ARM state) that pushing the bottom two bits of the address into LEGAL_BIG_CONST part of the offset is going to lead to a better code sequence. So I think it would be more sensible to just return false if we have a DImode address with the bottom 2 bits non-zero and then let the normal reload recovery mechanism take over. It is supposed to provide better utilization of the full range of LEGAL_BIG_CONST+SMALL_LEGAL_CONST. I am not sure, but I guess reload will resolve it with the reg+LEGAL_BIG_CONST part only, using only (mem (reg)) for the load/store (correct me if I'm wrong). Also, the new code slighty improves the reloading, for example currently [reg+64] is broken into [(reg+72)-8], creating an additional unneeded reload, which is certainly not good when we have ldrd/strd available. Sorry, didn't
Re: [patch, ARM] Fix PR48250, adjust DImode reload address legitimizing
On 2011/3/31 06:14 PM, Richard Earnshaw wrote: On Thu, 2011-03-31 at 11:33 +0800, Chung-Lin Tang wrote: On 2011/3/30 05:28 PM, Richard Earnshaw wrote: On Wed, 2011-03-30 at 15:35 +0800, Chung-Lin Tang wrote: On 2011/3/30 上午 12:23, Richard Earnshaw wrote: On Tue, 2011-03-29 at 22:53 +0800, Chung-Lin Tang wrote: On 2011/3/29 下午 10:26, Richard Earnshaw wrote: On Tue, 2011-03-29 at 18:25 +0800, Chung-Lin Tang wrote: On 2011/3/24 06:51 PM, Richard Earnshaw wrote: On Thu, 2011-03-24 at 12:56 +0900, Chung-Lin Tang wrote: Hi, PR48250 happens under TARGET_NEON, where DImode is included within the valid NEON modes. This turns the range of legitimate constant indexes to step-4 (coproc load/store), thus arm_legitimize_reload_address() when trying to decompose the [reg+index] reload address into [(reg+index_high)+index_low], can cause an ICE later when 'index_low' part is not aligned to 4. I'm not sure why the current DImode index is computed as: low = ((val 0xf) ^ 0x8) - 0x8; the sign-extending into negative values, then subtracting back, actually creates further off indexes. e.g. in the supplied testcase, [sp+13] was turned into [(sp+16)-3]. Hysterical Raisins... the code there was clearly written for the days before we had LDRD in the architecture. At that time the most efficient way to load a 64-bit object was to use the LDM{ia,ib,da,db} instructions. The computation here was (I think), intended to try and make the most efficient use of an add/sub instruction followed by LDM/STM offsetting. At that time the architecture had no unaligned access either, so dealing with 64-bit that were less than 32-bit aligned (in those days 32-bit was the maximum alignment) probably wasn't considered, or couldn't even get through to reload. I see it now. The code in output_move_double() returning assembly for ldm/stm(db/da/ib) for offsets -8/-4/+4 seems to confirm this. I have changed the patch to let the new code handle the TARGET_LDRD case only. The pre-LDRD case is still handled by the original code, with an additional ~0x3 for aligning the offset to 4. I've also added a comment for the pre-TARGET_LDRD case. Please see if the description is accurate enough. My patch changes the index decomposing to a more straightforward way; it also sort of outlines the way the other reload address indexes are broken by using and-masks, is not the most effective. The address is computed by addition, subtracting away the parts to obtain low+high should be the optimal way of giving the largest computable index range. I have included a few Thumb-2 bits in the patch; I know currently arm_legitimize_reload_address() is only used under TARGET_ARM, but I guess it might eventually be turned into TARGET_32BIT. I think this needs to be looked at carefully on ARMv4/ARMv4T to check that it doesn't cause regressions there when we don't have LDRD in the instruction set. I'll be testing the modified patch under an ARMv4/ARMv4T configuration. Okay for trunk if no regressions? Thanks, Chung-Lin PR target/48250 * config/arm/arm.c (arm_legitimize_reload_address): Adjust DImode constant index decomposing under TARGET_LDRD. Clear lower two bits for NEON, Thumb-2, and !TARGET_LDRD. Add comment for !TARGET_LDRD case. This looks technically correct, but I can't help feeling that trying to deal with the bottom two bits being non-zero is not really worthwhile. This hook is an optimization that's intended to generate better code than the default mechanisms that reload provides. It is allowed to fail, but it must say so if it does (by returning false). What this hook is trying to do for DImode is to take an address of the form (reg + TOO_BIG_CONST) and turn it into two instructions that are legitimate: tmp = (REG + LEGAL_BIG_CONST) some_use_of (mem (tmp + SMALL_LEGAL_CONST)) The idea behind the optimization is that LEGAL_BIG_CONST will need fewer instructions to generate than TOO_BIG_CONST. It's unlikely (probably impossible in ARM state) that pushing the bottom two bits of the address into LEGAL_BIG_CONST part of the offset is going to lead to a better code sequence. So I think it would be more sensible to just return false if we have a DImode address with the bottom 2 bits non-zero and then let the normal reload recovery mechanism take over. It is supposed to provide better utilization of the full range of LEGAL_BIG_CONST+SMALL_LEGAL_CONST. I am not sure, but I guess reload will resolve it with the reg+LEGAL_BIG_CONST part only, using only (mem (reg)) for the load/store (correct me if I'm wrong). Also, the new code slighty improves the reloading, for example currently [reg+64] is broken into [(reg+72)-8], creating an additional unneeded reload, which is certainly not good when we have ldrd/strd available. Sorry, didn't mean to suggest that we don't want to do something better for
Re: [patch, ARM] Fix PR48250, adjust DImode reload address legitimizing
On Thu, 2011-03-31 at 22:18 +0800, Chung-Lin Tang wrote: On 2011/3/31 06:14 PM, Richard Earnshaw wrote: On Thu, 2011-03-31 at 11:33 +0800, Chung-Lin Tang wrote: On 2011/3/30 05:28 PM, Richard Earnshaw wrote: On Wed, 2011-03-30 at 15:35 +0800, Chung-Lin Tang wrote: On 2011/3/30 上午 12:23, Richard Earnshaw wrote: On Tue, 2011-03-29 at 22:53 +0800, Chung-Lin Tang wrote: On 2011/3/29 下午 10:26, Richard Earnshaw wrote: On Tue, 2011-03-29 at 18:25 +0800, Chung-Lin Tang wrote: On 2011/3/24 06:51 PM, Richard Earnshaw wrote: On Thu, 2011-03-24 at 12:56 +0900, Chung-Lin Tang wrote: Hi, PR48250 happens under TARGET_NEON, where DImode is included within the valid NEON modes. This turns the range of legitimate constant indexes to step-4 (coproc load/store), thus arm_legitimize_reload_address() when trying to decompose the [reg+index] reload address into [(reg+index_high)+index_low], can cause an ICE later when 'index_low' part is not aligned to 4. I'm not sure why the current DImode index is computed as: low = ((val 0xf) ^ 0x8) - 0x8; the sign-extending into negative values, then subtracting back, actually creates further off indexes. e.g. in the supplied testcase, [sp+13] was turned into [(sp+16)-3]. Hysterical Raisins... the code there was clearly written for the days before we had LDRD in the architecture. At that time the most efficient way to load a 64-bit object was to use the LDM{ia,ib,da,db} instructions. The computation here was (I think), intended to try and make the most efficient use of an add/sub instruction followed by LDM/STM offsetting. At that time the architecture had no unaligned access either, so dealing with 64-bit that were less than 32-bit aligned (in those days 32-bit was the maximum alignment) probably wasn't considered, or couldn't even get through to reload. I see it now. The code in output_move_double() returning assembly for ldm/stm(db/da/ib) for offsets -8/-4/+4 seems to confirm this. I have changed the patch to let the new code handle the TARGET_LDRD case only. The pre-LDRD case is still handled by the original code, with an additional ~0x3 for aligning the offset to 4. I've also added a comment for the pre-TARGET_LDRD case. Please see if the description is accurate enough. My patch changes the index decomposing to a more straightforward way; it also sort of outlines the way the other reload address indexes are broken by using and-masks, is not the most effective. The address is computed by addition, subtracting away the parts to obtain low+high should be the optimal way of giving the largest computable index range. I have included a few Thumb-2 bits in the patch; I know currently arm_legitimize_reload_address() is only used under TARGET_ARM, but I guess it might eventually be turned into TARGET_32BIT. I think this needs to be looked at carefully on ARMv4/ARMv4T to check that it doesn't cause regressions there when we don't have LDRD in the instruction set. I'll be testing the modified patch under an ARMv4/ARMv4T configuration. Okay for trunk if no regressions? Thanks, Chung-Lin PR target/48250 * config/arm/arm.c (arm_legitimize_reload_address): Adjust DImode constant index decomposing under TARGET_LDRD. Clear lower two bits for NEON, Thumb-2, and !TARGET_LDRD. Add comment for !TARGET_LDRD case. This looks technically correct, but I can't help feeling that trying to deal with the bottom two bits being non-zero is not really worthwhile. This hook is an optimization that's intended to generate better code than the default mechanisms that reload provides. It is allowed to fail, but it must say so if it does (by returning false). What this hook is trying to do for DImode is to take an address of the form (reg + TOO_BIG_CONST) and turn it into two instructions that are legitimate: tmp = (REG + LEGAL_BIG_CONST) some_use_of (mem (tmp + SMALL_LEGAL_CONST)) The idea behind the optimization is that LEGAL_BIG_CONST will need fewer instructions to generate than TOO_BIG_CONST. It's unlikely (probably impossible in ARM state) that pushing the bottom two bits of the address into LEGAL_BIG_CONST part of the offset is going to lead to a better code sequence. So I think it would be more sensible to just return false if we have a DImode address with the bottom 2 bits non-zero and then let the normal reload recovery mechanism take over. It is supposed to provide better utilization of the full range of LEGAL_BIG_CONST+SMALL_LEGAL_CONST. I am not sure, but I guess reload will resolve it with the reg+LEGAL_BIG_CONST part only, using only (mem (reg)) for the load/store (correct me if I'm wrong). Also, the new code slighty improves the reloading, for example currently [reg+64] is broken into
Re: [patch, ARM] Fix PR48250, adjust DImode reload address legitimizing
On 2011/4/1 01:33 AM, Richard Earnshaw wrote: On Thu, 2011-03-31 at 22:18 +0800, Chung-Lin Tang wrote: On 2011/3/31 06:14 PM, Richard Earnshaw wrote: On Thu, 2011-03-31 at 11:33 +0800, Chung-Lin Tang wrote: On 2011/3/30 05:28 PM, Richard Earnshaw wrote: On Wed, 2011-03-30 at 15:35 +0800, Chung-Lin Tang wrote: On 2011/3/30 上午 12:23, Richard Earnshaw wrote: On Tue, 2011-03-29 at 22:53 +0800, Chung-Lin Tang wrote: On 2011/3/29 下午 10:26, Richard Earnshaw wrote: On Tue, 2011-03-29 at 18:25 +0800, Chung-Lin Tang wrote: On 2011/3/24 06:51 PM, Richard Earnshaw wrote: On Thu, 2011-03-24 at 12:56 +0900, Chung-Lin Tang wrote: Hi, PR48250 happens under TARGET_NEON, where DImode is included within the valid NEON modes. This turns the range of legitimate constant indexes to step-4 (coproc load/store), thus arm_legitimize_reload_address() when trying to decompose the [reg+index] reload address into [(reg+index_high)+index_low], can cause an ICE later when 'index_low' part is not aligned to 4. I'm not sure why the current DImode index is computed as: low = ((val 0xf) ^ 0x8) - 0x8; the sign-extending into negative values, then subtracting back, actually creates further off indexes. e.g. in the supplied testcase, [sp+13] was turned into [(sp+16)-3]. Hysterical Raisins... the code there was clearly written for the days before we had LDRD in the architecture. At that time the most efficient way to load a 64-bit object was to use the LDM{ia,ib,da,db} instructions. The computation here was (I think), intended to try and make the most efficient use of an add/sub instruction followed by LDM/STM offsetting. At that time the architecture had no unaligned access either, so dealing with 64-bit that were less than 32-bit aligned (in those days 32-bit was the maximum alignment) probably wasn't considered, or couldn't even get through to reload. I see it now. The code in output_move_double() returning assembly for ldm/stm(db/da/ib) for offsets -8/-4/+4 seems to confirm this. I have changed the patch to let the new code handle the TARGET_LDRD case only. The pre-LDRD case is still handled by the original code, with an additional ~0x3 for aligning the offset to 4. I've also added a comment for the pre-TARGET_LDRD case. Please see if the description is accurate enough. My patch changes the index decomposing to a more straightforward way; it also sort of outlines the way the other reload address indexes are broken by using and-masks, is not the most effective. The address is computed by addition, subtracting away the parts to obtain low+high should be the optimal way of giving the largest computable index range. I have included a few Thumb-2 bits in the patch; I know currently arm_legitimize_reload_address() is only used under TARGET_ARM, but I guess it might eventually be turned into TARGET_32BIT. I think this needs to be looked at carefully on ARMv4/ARMv4T to check that it doesn't cause regressions there when we don't have LDRD in the instruction set. I'll be testing the modified patch under an ARMv4/ARMv4T configuration. Okay for trunk if no regressions? Thanks, Chung-Lin PR target/48250 * config/arm/arm.c (arm_legitimize_reload_address): Adjust DImode constant index decomposing under TARGET_LDRD. Clear lower two bits for NEON, Thumb-2, and !TARGET_LDRD. Add comment for !TARGET_LDRD case. This looks technically correct, but I can't help feeling that trying to deal with the bottom two bits being non-zero is not really worthwhile. This hook is an optimization that's intended to generate better code than the default mechanisms that reload provides. It is allowed to fail, but it must say so if it does (by returning false). What this hook is trying to do for DImode is to take an address of the form (reg + TOO_BIG_CONST) and turn it into two instructions that are legitimate: tmp = (REG + LEGAL_BIG_CONST) some_use_of (mem (tmp + SMALL_LEGAL_CONST)) The idea behind the optimization is that LEGAL_BIG_CONST will need fewer instructions to generate than TOO_BIG_CONST. It's unlikely (probably impossible in ARM state) that pushing the bottom two bits of the address into LEGAL_BIG_CONST part of the offset is going to lead to a better code sequence. So I think it would be more sensible to just return false if we have a DImode address with the bottom 2 bits non-zero and then let the normal reload recovery mechanism take over. It is supposed to provide better utilization of the full range of LEGAL_BIG_CONST+SMALL_LEGAL_CONST. I am not sure, but I guess reload will resolve it with the reg+LEGAL_BIG_CONST part only, using only (mem (reg)) for the load/store (correct me if I'm wrong). Also, the new code slighty improves the reloading, for example currently [reg+64] is broken into [(reg+72)-8], creating an additional unneeded reload, which is certainly not
Re: [patch, ARM] Fix PR48250, adjust DImode reload address legitimizing
On 2011/3/30 上午 12:23, Richard Earnshaw wrote: On Tue, 2011-03-29 at 22:53 +0800, Chung-Lin Tang wrote: On 2011/3/29 下午 10:26, Richard Earnshaw wrote: On Tue, 2011-03-29 at 18:25 +0800, Chung-Lin Tang wrote: On 2011/3/24 06:51 PM, Richard Earnshaw wrote: On Thu, 2011-03-24 at 12:56 +0900, Chung-Lin Tang wrote: Hi, PR48250 happens under TARGET_NEON, where DImode is included within the valid NEON modes. This turns the range of legitimate constant indexes to step-4 (coproc load/store), thus arm_legitimize_reload_address() when trying to decompose the [reg+index] reload address into [(reg+index_high)+index_low], can cause an ICE later when 'index_low' part is not aligned to 4. I'm not sure why the current DImode index is computed as: low = ((val 0xf) ^ 0x8) - 0x8; the sign-extending into negative values, then subtracting back, actually creates further off indexes. e.g. in the supplied testcase, [sp+13] was turned into [(sp+16)-3]. Hysterical Raisins... the code there was clearly written for the days before we had LDRD in the architecture. At that time the most efficient way to load a 64-bit object was to use the LDM{ia,ib,da,db} instructions. The computation here was (I think), intended to try and make the most efficient use of an add/sub instruction followed by LDM/STM offsetting. At that time the architecture had no unaligned access either, so dealing with 64-bit that were less than 32-bit aligned (in those days 32-bit was the maximum alignment) probably wasn't considered, or couldn't even get through to reload. I see it now. The code in output_move_double() returning assembly for ldm/stm(db/da/ib) for offsets -8/-4/+4 seems to confirm this. I have changed the patch to let the new code handle the TARGET_LDRD case only. The pre-LDRD case is still handled by the original code, with an additional ~0x3 for aligning the offset to 4. I've also added a comment for the pre-TARGET_LDRD case. Please see if the description is accurate enough. My patch changes the index decomposing to a more straightforward way; it also sort of outlines the way the other reload address indexes are broken by using and-masks, is not the most effective. The address is computed by addition, subtracting away the parts to obtain low+high should be the optimal way of giving the largest computable index range. I have included a few Thumb-2 bits in the patch; I know currently arm_legitimize_reload_address() is only used under TARGET_ARM, but I guess it might eventually be turned into TARGET_32BIT. I think this needs to be looked at carefully on ARMv4/ARMv4T to check that it doesn't cause regressions there when we don't have LDRD in the instruction set. I'll be testing the modified patch under an ARMv4/ARMv4T configuration. Okay for trunk if no regressions? Thanks, Chung-Lin PR target/48250 * config/arm/arm.c (arm_legitimize_reload_address): Adjust DImode constant index decomposing under TARGET_LDRD. Clear lower two bits for NEON, Thumb-2, and !TARGET_LDRD. Add comment for !TARGET_LDRD case. This looks technically correct, but I can't help feeling that trying to deal with the bottom two bits being non-zero is not really worthwhile. This hook is an optimization that's intended to generate better code than the default mechanisms that reload provides. It is allowed to fail, but it must say so if it does (by returning false). What this hook is trying to do for DImode is to take an address of the form (reg + TOO_BIG_CONST) and turn it into two instructions that are legitimate: tmp = (REG + LEGAL_BIG_CONST) some_use_of (mem (tmp + SMALL_LEGAL_CONST)) The idea behind the optimization is that LEGAL_BIG_CONST will need fewer instructions to generate than TOO_BIG_CONST. It's unlikely (probably impossible in ARM state) that pushing the bottom two bits of the address into LEGAL_BIG_CONST part of the offset is going to lead to a better code sequence. So I think it would be more sensible to just return false if we have a DImode address with the bottom 2 bits non-zero and then let the normal reload recovery mechanism take over. It is supposed to provide better utilization of the full range of LEGAL_BIG_CONST+SMALL_LEGAL_CONST. I am not sure, but I guess reload will resolve it with the reg+LEGAL_BIG_CONST part only, using only (mem (reg)) for the load/store (correct me if I'm wrong). Also, the new code slighty improves the reloading, for example currently [reg+64] is broken into [(reg+72)-8], creating an additional unneeded reload, which is certainly not good when we have ldrd/strd available. Sorry, didn't mean to suggest that we don't want to do something better for addresses that are a multiple of 4, just that for addresses that aren't (at least) word-aligned that we should just bail as the code in that case won't benefit from the optimization. So something like if (mode == DImode || (mode == DFmode
Re: [patch, ARM] Fix PR48250, adjust DImode reload address legitimizing
On Wed, 2011-03-30 at 15:35 +0800, Chung-Lin Tang wrote: On 2011/3/30 上午 12:23, Richard Earnshaw wrote: On Tue, 2011-03-29 at 22:53 +0800, Chung-Lin Tang wrote: On 2011/3/29 下午 10:26, Richard Earnshaw wrote: On Tue, 2011-03-29 at 18:25 +0800, Chung-Lin Tang wrote: On 2011/3/24 06:51 PM, Richard Earnshaw wrote: On Thu, 2011-03-24 at 12:56 +0900, Chung-Lin Tang wrote: Hi, PR48250 happens under TARGET_NEON, where DImode is included within the valid NEON modes. This turns the range of legitimate constant indexes to step-4 (coproc load/store), thus arm_legitimize_reload_address() when trying to decompose the [reg+index] reload address into [(reg+index_high)+index_low], can cause an ICE later when 'index_low' part is not aligned to 4. I'm not sure why the current DImode index is computed as: low = ((val 0xf) ^ 0x8) - 0x8; the sign-extending into negative values, then subtracting back, actually creates further off indexes. e.g. in the supplied testcase, [sp+13] was turned into [(sp+16)-3]. Hysterical Raisins... the code there was clearly written for the days before we had LDRD in the architecture. At that time the most efficient way to load a 64-bit object was to use the LDM{ia,ib,da,db} instructions. The computation here was (I think), intended to try and make the most efficient use of an add/sub instruction followed by LDM/STM offsetting. At that time the architecture had no unaligned access either, so dealing with 64-bit that were less than 32-bit aligned (in those days 32-bit was the maximum alignment) probably wasn't considered, or couldn't even get through to reload. I see it now. The code in output_move_double() returning assembly for ldm/stm(db/da/ib) for offsets -8/-4/+4 seems to confirm this. I have changed the patch to let the new code handle the TARGET_LDRD case only. The pre-LDRD case is still handled by the original code, with an additional ~0x3 for aligning the offset to 4. I've also added a comment for the pre-TARGET_LDRD case. Please see if the description is accurate enough. My patch changes the index decomposing to a more straightforward way; it also sort of outlines the way the other reload address indexes are broken by using and-masks, is not the most effective. The address is computed by addition, subtracting away the parts to obtain low+high should be the optimal way of giving the largest computable index range. I have included a few Thumb-2 bits in the patch; I know currently arm_legitimize_reload_address() is only used under TARGET_ARM, but I guess it might eventually be turned into TARGET_32BIT. I think this needs to be looked at carefully on ARMv4/ARMv4T to check that it doesn't cause regressions there when we don't have LDRD in the instruction set. I'll be testing the modified patch under an ARMv4/ARMv4T configuration. Okay for trunk if no regressions? Thanks, Chung-Lin PR target/48250 * config/arm/arm.c (arm_legitimize_reload_address): Adjust DImode constant index decomposing under TARGET_LDRD. Clear lower two bits for NEON, Thumb-2, and !TARGET_LDRD. Add comment for !TARGET_LDRD case. This looks technically correct, but I can't help feeling that trying to deal with the bottom two bits being non-zero is not really worthwhile. This hook is an optimization that's intended to generate better code than the default mechanisms that reload provides. It is allowed to fail, but it must say so if it does (by returning false). What this hook is trying to do for DImode is to take an address of the form (reg + TOO_BIG_CONST) and turn it into two instructions that are legitimate: tmp = (REG + LEGAL_BIG_CONST) some_use_of (mem (tmp + SMALL_LEGAL_CONST)) The idea behind the optimization is that LEGAL_BIG_CONST will need fewer instructions to generate than TOO_BIG_CONST. It's unlikely (probably impossible in ARM state) that pushing the bottom two bits of the address into LEGAL_BIG_CONST part of the offset is going to lead to a better code sequence. So I think it would be more sensible to just return false if we have a DImode address with the bottom 2 bits non-zero and then let the normal reload recovery mechanism take over. It is supposed to provide better utilization of the full range of LEGAL_BIG_CONST+SMALL_LEGAL_CONST. I am not sure, but I guess reload will resolve it with the reg+LEGAL_BIG_CONST part only, using only (mem (reg)) for the load/store (correct me if I'm wrong). Also, the new code slighty improves the reloading, for example currently [reg+64] is broken into [(reg+72)-8], creating an additional unneeded reload, which is certainly not good when we have ldrd/strd available. Sorry, didn't mean to suggest that we don't want to do something better for addresses that are a multiple of 4, just that for addresses that aren't (at least) word-aligned that
Re: [patch, ARM] Fix PR48250, adjust DImode reload address legitimizing
On 2011/3/30 05:28 PM, Richard Earnshaw wrote: On Wed, 2011-03-30 at 15:35 +0800, Chung-Lin Tang wrote: On 2011/3/30 上午 12:23, Richard Earnshaw wrote: On Tue, 2011-03-29 at 22:53 +0800, Chung-Lin Tang wrote: On 2011/3/29 下午 10:26, Richard Earnshaw wrote: On Tue, 2011-03-29 at 18:25 +0800, Chung-Lin Tang wrote: On 2011/3/24 06:51 PM, Richard Earnshaw wrote: On Thu, 2011-03-24 at 12:56 +0900, Chung-Lin Tang wrote: Hi, PR48250 happens under TARGET_NEON, where DImode is included within the valid NEON modes. This turns the range of legitimate constant indexes to step-4 (coproc load/store), thus arm_legitimize_reload_address() when trying to decompose the [reg+index] reload address into [(reg+index_high)+index_low], can cause an ICE later when 'index_low' part is not aligned to 4. I'm not sure why the current DImode index is computed as: low = ((val 0xf) ^ 0x8) - 0x8; the sign-extending into negative values, then subtracting back, actually creates further off indexes. e.g. in the supplied testcase, [sp+13] was turned into [(sp+16)-3]. Hysterical Raisins... the code there was clearly written for the days before we had LDRD in the architecture. At that time the most efficient way to load a 64-bit object was to use the LDM{ia,ib,da,db} instructions. The computation here was (I think), intended to try and make the most efficient use of an add/sub instruction followed by LDM/STM offsetting. At that time the architecture had no unaligned access either, so dealing with 64-bit that were less than 32-bit aligned (in those days 32-bit was the maximum alignment) probably wasn't considered, or couldn't even get through to reload. I see it now. The code in output_move_double() returning assembly for ldm/stm(db/da/ib) for offsets -8/-4/+4 seems to confirm this. I have changed the patch to let the new code handle the TARGET_LDRD case only. The pre-LDRD case is still handled by the original code, with an additional ~0x3 for aligning the offset to 4. I've also added a comment for the pre-TARGET_LDRD case. Please see if the description is accurate enough. My patch changes the index decomposing to a more straightforward way; it also sort of outlines the way the other reload address indexes are broken by using and-masks, is not the most effective. The address is computed by addition, subtracting away the parts to obtain low+high should be the optimal way of giving the largest computable index range. I have included a few Thumb-2 bits in the patch; I know currently arm_legitimize_reload_address() is only used under TARGET_ARM, but I guess it might eventually be turned into TARGET_32BIT. I think this needs to be looked at carefully on ARMv4/ARMv4T to check that it doesn't cause regressions there when we don't have LDRD in the instruction set. I'll be testing the modified patch under an ARMv4/ARMv4T configuration. Okay for trunk if no regressions? Thanks, Chung-Lin PR target/48250 * config/arm/arm.c (arm_legitimize_reload_address): Adjust DImode constant index decomposing under TARGET_LDRD. Clear lower two bits for NEON, Thumb-2, and !TARGET_LDRD. Add comment for !TARGET_LDRD case. This looks technically correct, but I can't help feeling that trying to deal with the bottom two bits being non-zero is not really worthwhile. This hook is an optimization that's intended to generate better code than the default mechanisms that reload provides. It is allowed to fail, but it must say so if it does (by returning false). What this hook is trying to do for DImode is to take an address of the form (reg + TOO_BIG_CONST) and turn it into two instructions that are legitimate: tmp = (REG + LEGAL_BIG_CONST) some_use_of (mem (tmp + SMALL_LEGAL_CONST)) The idea behind the optimization is that LEGAL_BIG_CONST will need fewer instructions to generate than TOO_BIG_CONST. It's unlikely (probably impossible in ARM state) that pushing the bottom two bits of the address into LEGAL_BIG_CONST part of the offset is going to lead to a better code sequence. So I think it would be more sensible to just return false if we have a DImode address with the bottom 2 bits non-zero and then let the normal reload recovery mechanism take over. It is supposed to provide better utilization of the full range of LEGAL_BIG_CONST+SMALL_LEGAL_CONST. I am not sure, but I guess reload will resolve it with the reg+LEGAL_BIG_CONST part only, using only (mem (reg)) for the load/store (correct me if I'm wrong). Also, the new code slighty improves the reloading, for example currently [reg+64] is broken into [(reg+72)-8], creating an additional unneeded reload, which is certainly not good when we have ldrd/strd available. Sorry, didn't mean to suggest that we don't want to do something better for addresses that are a multiple of 4, just that for addresses that aren't (at least) word-aligned that we should just bail as the code in that case won't benefit
Re: [patch, ARM] Fix PR48250, adjust DImode reload address legitimizing
On 2011/3/24 06:51 PM, Richard Earnshaw wrote: On Thu, 2011-03-24 at 12:56 +0900, Chung-Lin Tang wrote: Hi, PR48250 happens under TARGET_NEON, where DImode is included within the valid NEON modes. This turns the range of legitimate constant indexes to step-4 (coproc load/store), thus arm_legitimize_reload_address() when trying to decompose the [reg+index] reload address into [(reg+index_high)+index_low], can cause an ICE later when 'index_low' part is not aligned to 4. I'm not sure why the current DImode index is computed as: low = ((val 0xf) ^ 0x8) - 0x8; the sign-extending into negative values, then subtracting back, actually creates further off indexes. e.g. in the supplied testcase, [sp+13] was turned into [(sp+16)-3]. Hysterical Raisins... the code there was clearly written for the days before we had LDRD in the architecture. At that time the most efficient way to load a 64-bit object was to use the LDM{ia,ib,da,db} instructions. The computation here was (I think), intended to try and make the most efficient use of an add/sub instruction followed by LDM/STM offsetting. At that time the architecture had no unaligned access either, so dealing with 64-bit that were less than 32-bit aligned (in those days 32-bit was the maximum alignment) probably wasn't considered, or couldn't even get through to reload. I see it now. The code in output_move_double() returning assembly for ldm/stm(db/da/ib) for offsets -8/-4/+4 seems to confirm this. I have changed the patch to let the new code handle the TARGET_LDRD case only. The pre-LDRD case is still handled by the original code, with an additional ~0x3 for aligning the offset to 4. I've also added a comment for the pre-TARGET_LDRD case. Please see if the description is accurate enough. My patch changes the index decomposing to a more straightforward way; it also sort of outlines the way the other reload address indexes are broken by using and-masks, is not the most effective. The address is computed by addition, subtracting away the parts to obtain low+high should be the optimal way of giving the largest computable index range. I have included a few Thumb-2 bits in the patch; I know currently arm_legitimize_reload_address() is only used under TARGET_ARM, but I guess it might eventually be turned into TARGET_32BIT. I think this needs to be looked at carefully on ARMv4/ARMv4T to check that it doesn't cause regressions there when we don't have LDRD in the instruction set. I'll be testing the modified patch under an ARMv4/ARMv4T configuration. Okay for trunk if no regressions? Thanks, Chung-Lin PR target/48250 * config/arm/arm.c (arm_legitimize_reload_address): Adjust DImode constant index decomposing under TARGET_LDRD. Clear lower two bits for NEON, Thumb-2, and !TARGET_LDRD. Add comment for !TARGET_LDRD case. Index: config/arm/arm.c === --- config/arm/arm.c(revision 171652) +++ config/arm/arm.c(working copy) @@ -6420,7 +6420,32 @@ HOST_WIDE_INT low, high; if (mode == DImode || (mode == DFmode TARGET_SOFT_FLOAT)) - low = ((val 0xf) ^ 0x8) - 0x8; + { + if (TARGET_LDRD) + { + /* ??? There may be more adjustments later for Thumb-2, +which has a ldrd insn with +-1020 index range. */ + int max_idx = 255; + + /* low == val, if val is within range [-max_idx, +max_idx]. +If not, val is set to the boundary +-max_idx. */ + low = (-max_idx = val val = max_idx +? val : (val 0 ? max_idx : -max_idx)); + + /* Thumb-2 ldrd, and NEON coprocessor load/store indexes +are in steps of 4, so the least two bits need to be +cleared to zero. */ + if (TARGET_NEON || TARGET_THUMB2) + low = ~0x3; + } + else + { + /* For pre-ARMv5TE (without ldrd), we use ldm/stm(db/da/ib) +to access doublewords. The supported load/store offsets are +-8, -4, and 4, which we try to produce here. */ + low = (((val 0xf) ^ 0x8) - 0x8) ~0x3; + } + } else if (TARGET_MAVERICK TARGET_HARD_FLOAT) /* Need to be careful, -256 is not a valid offset. */ low = val = 0 ? (val 0xff) : -((-val) 0xff);
Re: [patch, ARM] Fix PR48250, adjust DImode reload address legitimizing
On Tue, 2011-03-29 at 18:25 +0800, Chung-Lin Tang wrote: On 2011/3/24 06:51 PM, Richard Earnshaw wrote: On Thu, 2011-03-24 at 12:56 +0900, Chung-Lin Tang wrote: Hi, PR48250 happens under TARGET_NEON, where DImode is included within the valid NEON modes. This turns the range of legitimate constant indexes to step-4 (coproc load/store), thus arm_legitimize_reload_address() when trying to decompose the [reg+index] reload address into [(reg+index_high)+index_low], can cause an ICE later when 'index_low' part is not aligned to 4. I'm not sure why the current DImode index is computed as: low = ((val 0xf) ^ 0x8) - 0x8; the sign-extending into negative values, then subtracting back, actually creates further off indexes. e.g. in the supplied testcase, [sp+13] was turned into [(sp+16)-3]. Hysterical Raisins... the code there was clearly written for the days before we had LDRD in the architecture. At that time the most efficient way to load a 64-bit object was to use the LDM{ia,ib,da,db} instructions. The computation here was (I think), intended to try and make the most efficient use of an add/sub instruction followed by LDM/STM offsetting. At that time the architecture had no unaligned access either, so dealing with 64-bit that were less than 32-bit aligned (in those days 32-bit was the maximum alignment) probably wasn't considered, or couldn't even get through to reload. I see it now. The code in output_move_double() returning assembly for ldm/stm(db/da/ib) for offsets -8/-4/+4 seems to confirm this. I have changed the patch to let the new code handle the TARGET_LDRD case only. The pre-LDRD case is still handled by the original code, with an additional ~0x3 for aligning the offset to 4. I've also added a comment for the pre-TARGET_LDRD case. Please see if the description is accurate enough. My patch changes the index decomposing to a more straightforward way; it also sort of outlines the way the other reload address indexes are broken by using and-masks, is not the most effective. The address is computed by addition, subtracting away the parts to obtain low+high should be the optimal way of giving the largest computable index range. I have included a few Thumb-2 bits in the patch; I know currently arm_legitimize_reload_address() is only used under TARGET_ARM, but I guess it might eventually be turned into TARGET_32BIT. I think this needs to be looked at carefully on ARMv4/ARMv4T to check that it doesn't cause regressions there when we don't have LDRD in the instruction set. I'll be testing the modified patch under an ARMv4/ARMv4T configuration. Okay for trunk if no regressions? Thanks, Chung-Lin PR target/48250 * config/arm/arm.c (arm_legitimize_reload_address): Adjust DImode constant index decomposing under TARGET_LDRD. Clear lower two bits for NEON, Thumb-2, and !TARGET_LDRD. Add comment for !TARGET_LDRD case. This looks technically correct, but I can't help feeling that trying to deal with the bottom two bits being non-zero is not really worthwhile. This hook is an optimization that's intended to generate better code than the default mechanisms that reload provides. It is allowed to fail, but it must say so if it does (by returning false). What this hook is trying to do for DImode is to take an address of the form (reg + TOO_BIG_CONST) and turn it into two instructions that are legitimate: tmp = (REG + LEGAL_BIG_CONST) some_use_of (mem (tmp + SMALL_LEGAL_CONST)) The idea behind the optimization is that LEGAL_BIG_CONST will need fewer instructions to generate than TOO_BIG_CONST. It's unlikely (probably impossible in ARM state) that pushing the bottom two bits of the address into LEGAL_BIG_CONST part of the offset is going to lead to a better code sequence. So I think it would be more sensible to just return false if we have a DImode address with the bottom 2 bits non-zero and then let the normal reload recovery mechanism take over.
Re: [patch, ARM] Fix PR48250, adjust DImode reload address legitimizing
On 2011/3/29 下午 10:26, Richard Earnshaw wrote: On Tue, 2011-03-29 at 18:25 +0800, Chung-Lin Tang wrote: On 2011/3/24 06:51 PM, Richard Earnshaw wrote: On Thu, 2011-03-24 at 12:56 +0900, Chung-Lin Tang wrote: Hi, PR48250 happens under TARGET_NEON, where DImode is included within the valid NEON modes. This turns the range of legitimate constant indexes to step-4 (coproc load/store), thus arm_legitimize_reload_address() when trying to decompose the [reg+index] reload address into [(reg+index_high)+index_low], can cause an ICE later when 'index_low' part is not aligned to 4. I'm not sure why the current DImode index is computed as: low = ((val 0xf) ^ 0x8) - 0x8; the sign-extending into negative values, then subtracting back, actually creates further off indexes. e.g. in the supplied testcase, [sp+13] was turned into [(sp+16)-3]. Hysterical Raisins... the code there was clearly written for the days before we had LDRD in the architecture. At that time the most efficient way to load a 64-bit object was to use the LDM{ia,ib,da,db} instructions. The computation here was (I think), intended to try and make the most efficient use of an add/sub instruction followed by LDM/STM offsetting. At that time the architecture had no unaligned access either, so dealing with 64-bit that were less than 32-bit aligned (in those days 32-bit was the maximum alignment) probably wasn't considered, or couldn't even get through to reload. I see it now. The code in output_move_double() returning assembly for ldm/stm(db/da/ib) for offsets -8/-4/+4 seems to confirm this. I have changed the patch to let the new code handle the TARGET_LDRD case only. The pre-LDRD case is still handled by the original code, with an additional ~0x3 for aligning the offset to 4. I've also added a comment for the pre-TARGET_LDRD case. Please see if the description is accurate enough. My patch changes the index decomposing to a more straightforward way; it also sort of outlines the way the other reload address indexes are broken by using and-masks, is not the most effective. The address is computed by addition, subtracting away the parts to obtain low+high should be the optimal way of giving the largest computable index range. I have included a few Thumb-2 bits in the patch; I know currently arm_legitimize_reload_address() is only used under TARGET_ARM, but I guess it might eventually be turned into TARGET_32BIT. I think this needs to be looked at carefully on ARMv4/ARMv4T to check that it doesn't cause regressions there when we don't have LDRD in the instruction set. I'll be testing the modified patch under an ARMv4/ARMv4T configuration. Okay for trunk if no regressions? Thanks, Chung-Lin PR target/48250 * config/arm/arm.c (arm_legitimize_reload_address): Adjust DImode constant index decomposing under TARGET_LDRD. Clear lower two bits for NEON, Thumb-2, and !TARGET_LDRD. Add comment for !TARGET_LDRD case. This looks technically correct, but I can't help feeling that trying to deal with the bottom two bits being non-zero is not really worthwhile. This hook is an optimization that's intended to generate better code than the default mechanisms that reload provides. It is allowed to fail, but it must say so if it does (by returning false). What this hook is trying to do for DImode is to take an address of the form (reg + TOO_BIG_CONST) and turn it into two instructions that are legitimate: tmp = (REG + LEGAL_BIG_CONST) some_use_of (mem (tmp + SMALL_LEGAL_CONST)) The idea behind the optimization is that LEGAL_BIG_CONST will need fewer instructions to generate than TOO_BIG_CONST. It's unlikely (probably impossible in ARM state) that pushing the bottom two bits of the address into LEGAL_BIG_CONST part of the offset is going to lead to a better code sequence. So I think it would be more sensible to just return false if we have a DImode address with the bottom 2 bits non-zero and then let the normal reload recovery mechanism take over. It is supposed to provide better utilization of the full range of LEGAL_BIG_CONST+SMALL_LEGAL_CONST. I am not sure, but I guess reload will resolve it with the reg+LEGAL_BIG_CONST part only, using only (mem (reg)) for the load/store (correct me if I'm wrong). Also, the new code slighty improves the reloading, for example currently [reg+64] is broken into [(reg+72)-8], creating an additional unneeded reload, which is certainly not good when we have ldrd/strd available. Thanks, Chung-Lin
Re: [patch, ARM] Fix PR48250, adjust DImode reload address legitimizing
On Thu, 2011-03-24 at 12:56 +0900, Chung-Lin Tang wrote: Hi, PR48250 happens under TARGET_NEON, where DImode is included within the valid NEON modes. This turns the range of legitimate constant indexes to step-4 (coproc load/store), thus arm_legitimize_reload_address() when trying to decompose the [reg+index] reload address into [(reg+index_high)+index_low], can cause an ICE later when 'index_low' part is not aligned to 4. I'm not sure why the current DImode index is computed as: low = ((val 0xf) ^ 0x8) - 0x8; the sign-extending into negative values, then subtracting back, actually creates further off indexes. e.g. in the supplied testcase, [sp+13] was turned into [(sp+16)-3]. Hysterical Raisins... the code there was clearly written for the days before we had LDRD in the architecture. At that time the most efficient way to load a 64-bit object was to use the LDM{ia,ib,da,db} instructions. The computation here was (I think), intended to try and make the most efficient use of an add/sub instruction followed by LDM/STM offsetting. At that time the architecture had no unaligned access either, so dealing with 64-bit that were less than 32-bit aligned (in those days 32-bit was the maximum alignment) probably wasn't considered, or couldn't even get through to reload. My patch changes the index decomposing to a more straightforward way; it also sort of outlines the way the other reload address indexes are broken by using and-masks, is not the most effective. The address is computed by addition, subtracting away the parts to obtain low+high should be the optimal way of giving the largest computable index range. I have included a few Thumb-2 bits in the patch; I know currently arm_legitimize_reload_address() is only used under TARGET_ARM, but I guess it might eventually be turned into TARGET_32BIT. I think this needs to be looked at carefully on ARMv4/ARMv4T to check that it doesn't cause regressions there when we don't have LDRD in the instruction set. Cross-tested on QEMU without regressions, is this okay? Thanks, Chung-Lin 2011-03-24 Chung-Lin Tang clt...@codesourcery.com PR target/48250 * config/arm/arm.c (arm_legitimize_reload_address): Adjust DImode constant index decomposing. Mask out lower 2-bits for NEON and Thumb-2. testsuite/ * gcc.target/arm/pr48250.c: New. R.