Re: wide-int more performance fixes for wide multiplication.

Richard Sandiford Sat, 14 Dec 2013 06:31:07 -0800

Kenneth Zadeck <zad...@naturalbridge.com> writes:
> On 12/14/2013 06:40 AM, Richard Sandiford wrote:
>> Hi Kenny,
>>
>> Sorry for the slow response.
>>
>> Kenneth Zadeck <zad...@naturalbridge.com> writes:
>>> Index: gcc/wide-int.cc
>>> ===================================================================
>>> --- gcc/wide-int.cc (revision 205765)
>>> +++ gcc/wide-int.cc (working copy)
>>> @@ -1275,23 +1275,31 @@ wi::mul_internal (HOST_WIDE_INT *val, co
>>>     unsigned int j;
>>>     unsigned int blocks_needed = BLOCKS_NEEDED (prec);
>>>     unsigned int half_blocks_needed = blocks_needed * 2;
>>> -  /* The sizes here are scaled to support a 2x largest mode by 2x
>>> -     largest mode yielding a 4x largest mode result.  This is what is
>>> -     needed by vpn.  */
>>>   
>>> +  /* The sizes here are scaled to support a 2*largest mode + a little
>>> +     by 2*largest mode + a little yielding a 4*largest mode + a
>>> +     little.  This is what is needed by vpn.  */
>>>     unsigned HOST_HALF_WIDE_INT
>>> -    u[4 * MAX_BITSIZE_MODE_ANY_INT / HOST_BITS_PER_HALF_WIDE_INT];
>>> +    u[2 * WIDE_INT_MAX_PRECISION / HOST_BITS_PER_HALF_WIDE_INT];
>>> +  unsigned int ulen;
>>>     unsigned HOST_HALF_WIDE_INT
>>> -    v[4 * MAX_BITSIZE_MODE_ANY_INT / HOST_BITS_PER_HALF_WIDE_INT];
>>> -  /* The '2' in 'R' is because we are internally doing a full
>>> +    v[2 * WIDE_INT_MAX_PRECISION / HOST_BITS_PER_HALF_WIDE_INT];
>>> +  unsigned int vlen;
>>> +  unsigned int uvlen;
>>> +  unsigned int vallen;
>>> +
>>> +  /* The '4' in 'R' is because we are internally doing a wide
>>>        multiply.  */
>>>     unsigned HOST_HALF_WIDE_INT
>>> -    r[2 * 4 * MAX_BITSIZE_MODE_ANY_INT / HOST_BITS_PER_HALF_WIDE_INT];
>>> -  HOST_WIDE_INT mask = ((HOST_WIDE_INT)1 << HOST_BITS_PER_HALF_WIDE_INT) - 
>>> 1;
>>> +    r[4 * WIDE_INT_MAX_PRECISION / HOST_BITS_PER_HALF_WIDE_INT];
>> While we're here I think we should convert these arrays into allocas,
>> so that we're not hard-coding the specialness of vpn in wide-int.cc.
>> I.e. rather than having arrays of maximum size, use XALLOCAVEC to
>> allocate a specific number of stack HWIs, once we know how many
>> we actually need.  That includes the new arrays added in the patch.
> I had thought of doing this but there is a part of me that has always 
> thought of this as being more expensive, but in the scheme of things it 
> is just noise here, so i will do it.


Thanks.

>>> +  /* True if the result needs to be negated.  */
>>> +  bool is_neg = false;
>>>   
>>>     /* If the top level routine did not really pass in an overflow, then
>>>        just make sure that we never attempt to set it.  */
>>>     bool needs_overflow = (overflow != 0);
>>> +  const HOST_WIDE_INT zero = 0;
>>>     if (needs_overflow)
>>>       *overflow = false;
>>>   
>>> @@ -1382,61 +1392,96 @@ wi::mul_internal (HOST_WIDE_INT *val, co
>>>         return 1;
>>>       }
>>>   
>>> -  /* We do unsigned mul and then correct it.  */
>>> -  wi_unpack (u, (const unsigned HOST_WIDE_INT *) op1val, op1len,
>>> -        half_blocks_needed, prec, SIGNED);
>>> -  wi_unpack (v, (const unsigned HOST_WIDE_INT *) op2val, op2len,
>>> -        half_blocks_needed, prec, SIGNED);
>>> -
>>> -  /* The 2 is for a full mult.  */
>>> -  memset (r, 0, half_blocks_needed * 2
>>> -     * HOST_BITS_PER_HALF_WIDE_INT / CHAR_BIT);
>>> +  ulen = op1len * 2;
>>> +  vlen = op2len * 2;
>>> +
>>> +  if (sgn == SIGNED)
>>> +    {
>>> +      if (wi::neg_p (op1))
>>> +   {
>>> +     HOST_WIDE_INT nop1[2 * WIDE_INT_MAX_ELTS];
>>> +   
>> E.g. following on from the above:
>>
>>        /* Add an extra HWI so that we can represent the negative of
>>           the minimum.  */
>>        HOST_WIDE_INT *nop1 = XALLOCAVEC (HOST_WIDE_INT, op1len + 1);
>>
>>> +     int nop1len
>>> + = wi::sub_large (nop1, &zero, 1, op1val, op1len, prec + 1, SIGNED,
>>> 0);
>> Not sure about the "prec + 1" here, since it could cause nop1len to be
>> greater than the maximum length for "prec".  And you go on to unpack
>> nop1 in "prec" rather than "prec + 1".
> the unpacking with prec is the wrong part.  That should have been prec + 1

Hmm, I still think it's the opposite, because...

>> AIUI, once you've taken the absolutes, you've got two unsigned numbers
>> of precision "prec" and create a "prec * 2" result, so naybe:
>> sub_large (..., prec, UNSIGNED, 0) instead?
>
> the signed and unsigned does not matter here.    the prec + 1 means we 
> never overflow.

...my point is that after this we treat the number as unsigned,
so it isn't a question of whether the absolute value overflows
the precision as a signed number but whether it overflows as an
unsigned number.  And it never will.  It might need 1 extra HWI
than the original input if op1len < blocks_needed, but it shouldn't
need extra precision.

E.g. say for 8-bit HWIs we start out with a signed -0x80 * -0x80.
We convert that into an unsigned 0x80 * 0x80, which is still
precision 8 * precision 8 -> precision 16.  And we'd need to be
able to handle those precision-8 values correctly (with no extra
zero HWI) if we had an unsigned 0x80 * 0x80 from the outset.

>>> +     is_neg = !is_neg;
>>> +     ulen = nop1len * 2;
>>> +     wi_unpack (u, (const unsigned HOST_WIDE_INT*)nop1, nop1len,
>>> +                ulen, prec, SIGNED);
>> Back on the alloca theme, we'd have:
>>
>>        u = XALLOCAVEC (unsigned HOST_HALF_WIDE_INT, ulen);
>>
>> before this and other unpacks.
>>
>> Note that there's supposed to be space after ")" in casts.  Lots of
>> instances in the patch :-)
>>
>>> +      /* UNSIGNED mul. */
>>> +      /* For large positive integers inside regular wide-ints, we must
>>> +    explictly expand out the 1s to full width for the mul to be
>>> +    correct. Note that the top bit will never be set for a
>>> +    unsigned number in offset_int of widest-int.  */
>> s/of/or/.  But I think the comment is confusing because the unsignedness
>> is a property of the multiplication rather than the inputs.  We should never
>> be doing an unsigned operation on widest_int to begin with.  And if
>> offset_ints are being treated as having a sign then the same is true there.
>>
>> If we ever do have an unsigned multiplication on those types for some reason
>> then -1 will (rightly) be treated as the maximum unsigned value.

> The real question here is "do you want to do an assert or do you want to 
> explain what happens if the input comes in that way?"

Sorry, to be clear, I was only talking about the "Note..." part of the
comment rather than the full thing.  I think the first part of the comment
and the code itself is fine.

> The current world 
> is actually structured so that we never ask about overflow for the two 
> larger classes because the reason that you used those classes was that 
> you never wanted to have this discussion. So if you never ask about 
> overflow, then it really does not matter because we are not going to 
> return enough bits for you to care what happened on the inside.  Of 
> course that could change and someone could say that they wanted overflow 
> on widest-int.   Then the comment makes sense, with revisions, unless 
> your review of the code that wants overflow on widest int suggests that 
> they are just being stupid.

But widest_int is now supposed to be at least 1 bit wider than widest
input type (unlike previously where it was double the widest input type).
So I can definitely see cases where we'd want to know whether a
widest_int * widest_int result overflows.

My point is that the widest_int * widest_int would normally be a signed
multiplication rather than an unsigned multiplication, since the extra
1 bit of precision allows every operation to be signed.  So it isn't
a case of whether the top bit of a widest_int will be set, but whether
we ever reach here for widest_int in the first place.  We should be
going down the sgn == SIGNED path rather than the sgn == UNSIGNED path.

widest_int can represent an all-1s value, usually interpreted as -1.
If we do go down this sgn == UNSIGNED path for widest_int then we will
instead treat the all-1s value as the maximum unsigned number, just like
for any other kind of wide int.

As far as this function goes there really is no difference between
wide_int, offset_int and widest_int.  Which is good, because offset_int
and widest_int should just be wide_ints that are optimised for a specific
and fixed precision.

Thanks,
Richard

Re: wide-int more performance fixes for wide multiplication.

Reply via email to