Re: GCC 4.8.0 Status Report (2012-10-29), Stage 1 to end soon

Kenneth Zadeck Tue, 27 Nov 2012 05:04:28 -0800

i will discuss this with mike when he wakes up. he lives on the westpole so that will not be until after you go to bed.

the one point that i will take exception to is that the copyingoperation is, in practice, any more time expensive than the pointercopy. I never bother to initialize the storage in the array, i onlycopy the elements that are live. This is with almost always 1 hwibecause either most types are small or most constants of large typescompress to 1 hwi. So even if a compilation does a zillion::from_trees, you will most likely never see the difference in time.


kenny


On 11/27/2012 05:03 AM, Richard Biener wrote:

On Tue, Nov 27, 2012 at 1:06 AM, Kenneth Zadeck
<zad...@naturalbridge.com> wrote:

Richard,

I spent a good part of the afternoon talking to Mike about this.  He is on
the c++ standards committee and is a much more seasoned c++ programmer than
I am.

He convinced me that with a large amount of engineering and c++
"foolishness" that it was indeed possible to get your proposal to POSSIBLY
work as well as what we did.

But now the question is why would any want to do this?

At the very least you are talking about instantiating two instances of
wide-ints, one for the stack allocated uses and one for the places where we
just move a pointer from the tree or the rtx. Then you are talking about
creating connectors so that the stack allocated functions can take
parameters of pointer version and visa versa.

Then there is the issue that rather than just saying that something is a
wide int, that the programmer is going to have to track it's origin.   In
particular,  where in the code right now i say.

wide_int foo = wide_int::from_rtx (r1);
wide_int bar = wide_int::from_rtx (r2) + foo;

now i would have to say

wide_int_ptr foo = wide_int_ptr::from_rtx (r1);
wide_int_stack bar = wide_int_ptr::from_rtx (r2) + foo;

No, you'd say

wide_int foo = wide_int::from_rtx (r1);

and the static, non-templated from_rtx method would automagically
return (always!) a "wide_int_ptr" kind.  The initialization then would
use the assignment operator that mediates between wide_int and
"wide_int_ptr", doing the copying.

The user should get a 'stack' kind by default when specifying wide_int,
like implemented with

struct wide_int_storage_stack;
struct wide_int_storage_ptr;

template <class storage = wide_int_storage_stack>
class wide_int : public storage
{
...
    static wide_int <wide_int_storage_ptr> from_rtx (rtx);
}

the whole point of the exercise is to make from_rtx and from_tree avoid
the copying (and excessive stack space allocation) for the rvalue case
like in

  wide_int res = wide_int::from_rtx (x) + 1;

if you save the result into a wide_int temporary first then you are lost
of course (modulo some magic GCC optimization being able to elide
the copy somehow).

And of course for code like VRP that keeps a lattice of wide_ints to
be able to reduce its footprint by using ptr storage and explicit allocations
(that's a secondary concern, of course).  And for VRP to specify that
it needs more than the otherwise needed MAX_INT_MODE_SIZE.
ptr storage would not have this arbitrary limitation, only embedded
storage (should) have.

then when i want to call some function using a wide_int ref that function
now must be either overloaded to take both or i have to choose one of the
two instantiations (presumably based on which is going to be more common)
and just have the compiler fix up everything (which it is likely to do).

Nope, they'd be

class wide_int ...
{
    template <class storage1, class storage2>
    wide_int operator+(wide_int <storage1> a, wide_int<storage2> b)
    {
       return wide_int::plus_worker (a.precision, a. ...., a.get_storage_ptr (),
                                                 b.precision, ...,
b.get_storage_ptr ());
    }

And so what is the payoff:
1) No one except the c++ elite is going to understand the code. The rest of
the community will hate me and curse the ground that i walk on.

Maybe for the implementation - but look at hash-table and vec ... not for
usage certainly.

2) I will end up with a version of wide-int that can be used as a medium
life container (where i define medium life as not allowed to survive a gc
since they will contain pointers into rtxes and trees.)
3) An no clients that actually wanted to do this!!    I could use as an
example one of your favorite passes, tree-vrp.   The current double-int
could have been a medium lifetime container since it has a smaller
footprint, but in fact tree-vrp converts those double-ints back into trees
for medium storage.   Why, because it needs the other fields of a tree-cst
to store the entire state.  Wide-ints also "suffer" this problem.  their
only state are the data, and the three length fields.   They have no type
and none of the other tree info so the most obvious client for a medium
lifetime object is really not going to be a good match even if you "solve
the storage problem".

The fact is that wide-ints are an excellent short term storage class that
can be very quickly converted into our two long term storage classes.  Your
proposal is requires a lot of work, will not be easy to use and as far as i
can see has no payoff on the horizon.   It could be that there could be
future clients for a medium lifetime value, but asking for this with no
clients in hand is really beyond the scope of a reasonable review.

I remind you that the purpose of these patches is to solve problems that
exist in the current compiler that we have papered over for years.   If
someone needs wide-ints in some way that is not foreseen then they can
change it.

The patches introduce a lot more temporary wide-ints (your words) and
at the same time makes construction of them from tree / rtx very expensive
both stack space and compile-time wise.  Look at how we for example
compute TREE_INT_CST + 1 - int_cst_binop internally uses double_ints
for the computation and then instantiates a new tree for holding the result.
Now we'd use wide_ints for this requring totally unnecessary copying.
Why not in the first place try to avoid that.  And try to avoid making
wide_ints 4 times as large as really necessary just for the sake of VRP!
(VRP should have a way to say "_I_ want larger wide_ints", without putting
this burden on all other users).

Richard.

kenny


On 11/26/2012 11:30 AM, Richard Biener wrote:

On Mon, Nov 26, 2012 at 5:03 PM, Kenneth Zadeck
<zad...@naturalbridge.com> wrote:

On 11/26/2012 10:03 AM, Richard Biener wrote:

On Mon, Nov 5, 2012 at 2:59 PM, Kenneth Zadeck
<zad...@naturalbridge.com>
wrote:

On 11/04/2012 11:54 AM, Richard Biener wrote:

On Thu, Nov 1, 2012 at 2:10 PM, Richard Sandiford
<rdsandif...@googlemail.com> wrote:

Kenneth Zadeck <zad...@naturalbridge.com> writes:

I would like you to respond to at least point 1 of this email.   In
it
there is code from the rtl level that was written twice, once for
the
case when the size of the mode is less than the size of a HWI and
once
for the case where the size of the mode is less that 2 HWIs.

my patch changes this to one instance of the code that works no
matter
how large the data passed to it is.

you have made a specific requirement for wide int to be a template
that
can be instantiated in several sizes, one for 1 HWI, one for 2 HWI.
I
would like to know how this particular fragment is to be rewritten
in
this model?   It seems that I would have to retain the structure
where
there is one version of the code for each size that the template is
instantiated.

I think richi's argument was that wide_int should be split into two.
There should be a "bare-metal" class that just has a length and HWIs,
and the main wide_int class should be an extension on top of that
that does things to a bit precision instead.  Presumably with some
template magic so that the length (number of HWIs) is a constant for:

      typedef foo<2> double_int;

and a variable for wide_int (because in wide_int the length would be
the number of significant HWIs rather than the size of the underlying
array).  wide_int would also record the precision and apply it after
the full HWI operation.

So the wide_int class would still provide "as wide as we need"
arithmetic,
as in your rtl patch.  I don't think he was objecting to that.

That summarizes one part of my complaints / suggestions correctly.  In
other
mails I suggested to not make it a template but a constant over object
lifetime
'bitsize' (or maxlen) field.  Both suggestions likely require more
thought
than
I put into them.  The main reason is that with C++ you can abstract
from
where
wide-int information pieces are stored and thus use the arithmetic /
operation
workers without copying the (source) "wide-int" objects.  Thus you
should
be able to write adaptors for double-int storage, tree or RTX storage.

We had considered something along these lines and rejected it.   I am
not
really opposed to doing something like this, but it is not an obvious
winning idea and is likely not to be a good idea.   Here was our
thought
process:

if you abstract away the storage inside a wide int, then you should be
able
to copy a pointer to the block of data from either the rtl level
integer
constant or the tree level one into the wide int.   It is certainly
true
that making a wide_int from one of these is an extremely common
operation
and doing this would avoid those copies.

However, this causes two problems:
1)  Mike's first cut at the CONST_WIDE_INT did two ggc allocations to
make
the object.   it created the base object and then it allocated the
array.
Richard S noticed that we could just allocate one CONST_WIDE_INT that
had
the array in it.   Doing it this way saves one ggc allocation and one
indirection when accessing the data within the CONST_WIDE_INT.   Our
plan
is
to use the same trick at the tree level.   So to avoid the copying, you
seem
to have to have a more expensive rep for CONST_WIDE_INT and INT_CST.

I did not propose having a pointer to the data in the RTX or tree int.
Just
the short-lived wide-ints (which are on the stack) would have a pointer
to
the data - which can then obviously point into the RTX and tree data.

There is the issue then what if some wide-ints are not short lived. It
makes
me nervous to create internal pointers to gc ed memory.

I thought they were all short-lived.

2) You are now stuck either ggcing the storage inside a wide_int when
they
are created as part of an expression or you have to play some game to
represent the two different storage plans inside of wide_int.

Hm?  wide-ints are short-lived and thus never live across a garbage
collection
point.  We create non-GCed objects pointing to GCed objects all the time
and everywhere this way.

Again, this makes me nervous but it could be done.  However, it does mean
that now the wide ints that are not created from rtxes or trees will be
more
expensive because they are not going to get their storage "for free",
they
are going to alloca it.

No, those would simply use the embedded storage model.

however, it still is not clear, given that 99% of the wide ints are going
to
fit in a single hwi, that this would be a noticeable win.

Currently even if they fit into a HWI you will still allocate 4 times the
larges integer mode size.  You say that doesn't matter because they
are short-lived, but I say it does matter because not all of them are
short-lived enough.  If 99% fit in a HWI why allocate 4 times the
largest integer mode size in 99% of the cases?

     Clearly this
is where you think that we should be going by suggesting that we
abstract
away the internal storage.   However, this comes at a price:   what is
currently an array access in my patches would (i believe) become a
function
call.

No, the workers (that perform the array accesses) will simply get
a pointer to the first data element.  Then whether it's embedded or
external is of no interest to them.

so is your plan that the wide int constructors from rtx or tree would
just
copy the pointer to the array on top of the array that is otherwise
allocated on the stack?    I can easily do this.   But as i said, the
gain
seems quite small.

And of course, going the other way still does need the copy.

The proposal was to template wide_int on a storage model, the embedded
one would work as-is (embedding 4 times largest integer mode), the
external one would have a pointer to data.  All functions that return a
wide_int produce a wide_int with the embedded model.  To avoid
the function call penalty you described the storage model provides
a way to get a pointer to the first element and the templated operations
simply dispatch to a worker that takes this pointer to the first element
(as the storage model is designed as a template its abstraction is going
to be optimized away by means of inlining).

Richard.

    From a performance point of view, i believe that this is a non
starter. If you can figure out how to design this so that it is not a
function call, i would consider this a viable option.

On the other side of this you are clearly correct that we are copying
the
data when we are making wide ints from INT_CSTs or CONST_WIDE_INTs.
But
this is why we represent data inside of the wide_ints, the INT_CSTs and
the
CONST_WIDE_INTs in a compressed form.   Even with very big types, which
are
generally rare, the constants them selves are very small.   So the copy
operation is a loop that almost always copies one element, even with
tree-vrp which doubles the sizes of every type.

There is the third option which is that the storage inside the wide int
is
just ggced storage.  We rejected this because of the functional nature
of
wide-ints.    There are zillions created, they can be stack allocated,
and
they last for very short periods of time.

Of course - GCing wide-ints is a non-starter.

As is probably obvious, I don't agree FWIW.  It seems like an
unnecessary
complication without any clear use.  Especially since the number of

Maybe the double_int typedef is without any clear use.  Properly
abstracting from the storage / information providers will save
compile-time, memory and code though.  I don't see that any thought
was spent on how to avoid excessive copying or dealing with
long(er)-lived objects and their storage needs.

I actually disagree.    Wide ints can use a bloated amount of storage
because they are designed to be very short lived and very low cost
objects
that are stack allocated.   For long term storage, there is INT_CST at
the
tree level and CONST_WIDE_INT at the rtl level.  Those use a very
compact
storage model.   The copying entailed is only a small part of the
overall
performance.

Well, but both trees and RTXen are not viable for short-lived things
because
the are GCed!  double-ints were suitable for this kind of stuff because
the also have a moderate size.  With wide-ints size becomes a problem
(or GC, if you instead use trees or RTXen).

Everything that you are suggesting along these lines is adding to the
weight
of a wide-int object.

On the contrary - it lessens their weight (with external already
existing storage)
or does not do anything to it (with the embedded storage).

    You have to understand there will be many more
wide-ints created in a normal compilation than were ever created with
double-int.    This is because the rtl level had no object like this at
all
and at the tree level, many of the places that should have used double
int,
short cut the code and only did the transformations if the types fit in
a
HWI.

Your argument shows that the copy-in/out from tree/RTX to/from wide-int
will become a very frequent operation and thus it is worth optimizing
it.

This is why we are extremely defensive about this issue.   We really
did
think a lot about it.

I'm sure you did.

Richard.

Re: GCC 4.8.0 Status Report (2012-10-29), Stage 1 to end soon

Reply via email to