Re: [wide-int] Handle more add and sub cases inline
Kenneth Zadeck zad...@naturalbridge.com writes: I would like to see some comment to the effect that this to allow inlining for the common case for widest int and offset int without inlining the uncommon case for regular wide-int. OK, how about: /* If the precision is known at compile time to be greater than HOST_BITS_PER_WIDE_INT, we can optimize the single-HWI case knowing that (a) all bits in those HWIs are significant and (b) the result has room for at least two HWIs. This provides a fast path for things like offset_int and widest_int. The STATIC_CONSTANT_P test prevents this path from being used for wide_ints. wide_ints with precisions greater than HOST_BITS_PER_WIDE_INT are relatively rare and there's not much point handling them inline. */ Thanks, Richard On 11/28/2013 12:38 PM, Richard Sandiford wrote: Currently add and sub have no fast path for offset_int and widest_int, they just call the out-of-line version. This patch handles the single-HWI cases inline. At least on x86_64, this only adds one branch per call; the fast path itself is straight-line code. On the same fold-const.ii testcase, this reduces the number of add_large calls from 877507 to 42459. It reduces the number of sub_large calls from 25707 to 148. Tested on x86_64-linux-gnu. OK to install? Thanks, Richard Index: gcc/wide-int.h === --- gcc/wide-int.h 2013-11-28 13:34:19.596839877 + +++ gcc/wide-int.h 2013-11-28 16:08:11.387731775 + @@ -2234,6 +2234,17 @@ wi::add (const T1 x, const T2 y) val[0] = xi.ulow () + yi.ulow (); result.set_len (1); } + else if (STATIC_CONSTANT_P (precision HOST_BITS_PER_WIDE_INT) +xi.len + yi.len == 2) +{ + unsigned HOST_WIDE_INT xl = xi.ulow (); + unsigned HOST_WIDE_INT yl = yi.ulow (); + unsigned HOST_WIDE_INT resultl = xl + yl; + val[0] = resultl; + val[1] = (HOST_WIDE_INT) resultl 0 ? 0 : -1; + result.set_len (1 + (((resultl ^ xl) (resultl ^ yl)) +(HOST_BITS_PER_WIDE_INT - 1))); +} else result.set_len (add_large (val, xi.val, xi.len, yi.val, yi.len, precision, @@ -2288,6 +2299,17 @@ wi::sub (const T1 x, const T2 y) val[0] = xi.ulow () - yi.ulow (); result.set_len (1); } + else if (STATIC_CONSTANT_P (precision HOST_BITS_PER_WIDE_INT) +xi.len + yi.len == 2) +{ + unsigned HOST_WIDE_INT xl = xi.ulow (); + unsigned HOST_WIDE_INT yl = yi.ulow (); + unsigned HOST_WIDE_INT resultl = xl - yl; + val[0] = resultl; + val[1] = (HOST_WIDE_INT) resultl 0 ? 0 : -1; + result.set_len (1 + (((resultl ^ xl) (xl ^ yl)) +(HOST_BITS_PER_WIDE_INT - 1))); +} else result.set_len (sub_large (val, xi.val, xi.len, yi.val, yi.len, precision,
[wide-int] Handle more add and sub cases inline
Currently add and sub have no fast path for offset_int and widest_int, they just call the out-of-line version. This patch handles the single-HWI cases inline. At least on x86_64, this only adds one branch per call; the fast path itself is straight-line code. On the same fold-const.ii testcase, this reduces the number of add_large calls from 877507 to 42459. It reduces the number of sub_large calls from 25707 to 148. Tested on x86_64-linux-gnu. OK to install? Thanks, Richard Index: gcc/wide-int.h === --- gcc/wide-int.h 2013-11-28 13:34:19.596839877 + +++ gcc/wide-int.h 2013-11-28 16:08:11.387731775 + @@ -2234,6 +2234,17 @@ wi::add (const T1 x, const T2 y) val[0] = xi.ulow () + yi.ulow (); result.set_len (1); } + else if (STATIC_CONSTANT_P (precision HOST_BITS_PER_WIDE_INT) + xi.len + yi.len == 2) +{ + unsigned HOST_WIDE_INT xl = xi.ulow (); + unsigned HOST_WIDE_INT yl = yi.ulow (); + unsigned HOST_WIDE_INT resultl = xl + yl; + val[0] = resultl; + val[1] = (HOST_WIDE_INT) resultl 0 ? 0 : -1; + result.set_len (1 + (((resultl ^ xl) (resultl ^ yl)) + (HOST_BITS_PER_WIDE_INT - 1))); +} else result.set_len (add_large (val, xi.val, xi.len, yi.val, yi.len, precision, @@ -2288,6 +2299,17 @@ wi::sub (const T1 x, const T2 y) val[0] = xi.ulow () - yi.ulow (); result.set_len (1); } + else if (STATIC_CONSTANT_P (precision HOST_BITS_PER_WIDE_INT) + xi.len + yi.len == 2) +{ + unsigned HOST_WIDE_INT xl = xi.ulow (); + unsigned HOST_WIDE_INT yl = yi.ulow (); + unsigned HOST_WIDE_INT resultl = xl - yl; + val[0] = resultl; + val[1] = (HOST_WIDE_INT) resultl 0 ? 0 : -1; + result.set_len (1 + (((resultl ^ xl) (xl ^ yl)) + (HOST_BITS_PER_WIDE_INT - 1))); +} else result.set_len (sub_large (val, xi.val, xi.len, yi.val, yi.len, precision,
Re: [wide-int] Handle more add and sub cases inline
I would like to see some comment to the effect that this to allow inlining for the common case for widest int and offset int without inlining the uncommon case for regular wide-int. On 11/28/2013 12:38 PM, Richard Sandiford wrote: Currently add and sub have no fast path for offset_int and widest_int, they just call the out-of-line version. This patch handles the single-HWI cases inline. At least on x86_64, this only adds one branch per call; the fast path itself is straight-line code. On the same fold-const.ii testcase, this reduces the number of add_large calls from 877507 to 42459. It reduces the number of sub_large calls from 25707 to 148. Tested on x86_64-linux-gnu. OK to install? Thanks, Richard Index: gcc/wide-int.h === --- gcc/wide-int.h 2013-11-28 13:34:19.596839877 + +++ gcc/wide-int.h 2013-11-28 16:08:11.387731775 + @@ -2234,6 +2234,17 @@ wi::add (const T1 x, const T2 y) val[0] = xi.ulow () + yi.ulow (); result.set_len (1); } + else if (STATIC_CONSTANT_P (precision HOST_BITS_PER_WIDE_INT) + xi.len + yi.len == 2) +{ + unsigned HOST_WIDE_INT xl = xi.ulow (); + unsigned HOST_WIDE_INT yl = yi.ulow (); + unsigned HOST_WIDE_INT resultl = xl + yl; + val[0] = resultl; + val[1] = (HOST_WIDE_INT) resultl 0 ? 0 : -1; + result.set_len (1 + (((resultl ^ xl) (resultl ^ yl)) + (HOST_BITS_PER_WIDE_INT - 1))); +} else result.set_len (add_large (val, xi.val, xi.len, yi.val, yi.len, precision, @@ -2288,6 +2299,17 @@ wi::sub (const T1 x, const T2 y) val[0] = xi.ulow () - yi.ulow (); result.set_len (1); } + else if (STATIC_CONSTANT_P (precision HOST_BITS_PER_WIDE_INT) + xi.len + yi.len == 2) +{ + unsigned HOST_WIDE_INT xl = xi.ulow (); + unsigned HOST_WIDE_INT yl = yi.ulow (); + unsigned HOST_WIDE_INT resultl = xl - yl; + val[0] = resultl; + val[1] = (HOST_WIDE_INT) resultl 0 ? 0 : -1; + result.set_len (1 + (((resultl ^ xl) (xl ^ yl)) + (HOST_BITS_PER_WIDE_INT - 1))); +} else result.set_len (sub_large (val, xi.val, xi.len, yi.val, yi.len, precision,