On Mon, Nov 26, 2012 at 5:03 PM, Kenneth Zadeck <zad...@naturalbridge.com> wrote: > On 11/26/2012 10:03 AM, Richard Biener wrote: >> >> On Mon, Nov 5, 2012 at 2:59 PM, Kenneth Zadeck <zad...@naturalbridge.com> >> wrote: >>> >>> On 11/04/2012 11:54 AM, Richard Biener wrote: >>>> >>>> On Thu, Nov 1, 2012 at 2:10 PM, Richard Sandiford >>>> <rdsandif...@googlemail.com> wrote: >>>>> >>>>> Kenneth Zadeck <zad...@naturalbridge.com> writes: >>>>>> >>>>>> I would like you to respond to at least point 1 of this email. In it >>>>>> there is code from the rtl level that was written twice, once for the >>>>>> case when the size of the mode is less than the size of a HWI and once >>>>>> for the case where the size of the mode is less that 2 HWIs. >>>>>> >>>>>> my patch changes this to one instance of the code that works no matter >>>>>> how large the data passed to it is. >>>>>> >>>>>> you have made a specific requirement for wide int to be a template >>>>>> that >>>>>> can be instantiated in several sizes, one for 1 HWI, one for 2 HWI. >>>>>> I >>>>>> would like to know how this particular fragment is to be rewritten in >>>>>> this model? It seems that I would have to retain the structure where >>>>>> there is one version of the code for each size that the template is >>>>>> instantiated. >>>>> >>>>> I think richi's argument was that wide_int should be split into two. >>>>> There should be a "bare-metal" class that just has a length and HWIs, >>>>> and the main wide_int class should be an extension on top of that >>>>> that does things to a bit precision instead. Presumably with some >>>>> template magic so that the length (number of HWIs) is a constant for: >>>>> >>>>> typedef foo<2> double_int; >>>>> >>>>> and a variable for wide_int (because in wide_int the length would be >>>>> the number of significant HWIs rather than the size of the underlying >>>>> array). wide_int would also record the precision and apply it after >>>>> the full HWI operation. >>>>> >>>>> So the wide_int class would still provide "as wide as we need" >>>>> arithmetic, >>>>> as in your rtl patch. I don't think he was objecting to that. >>>> >>>> That summarizes one part of my complaints / suggestions correctly. In >>>> other >>>> mails I suggested to not make it a template but a constant over object >>>> lifetime >>>> 'bitsize' (or maxlen) field. Both suggestions likely require more >>>> thought >>>> than >>>> I put into them. The main reason is that with C++ you can abstract from >>>> where >>>> wide-int information pieces are stored and thus use the arithmetic / >>>> operation >>>> workers without copying the (source) "wide-int" objects. Thus you >>>> should >>>> be able to write adaptors for double-int storage, tree or RTX storage. >>> >>> We had considered something along these lines and rejected it. I am not >>> really opposed to doing something like this, but it is not an obvious >>> winning idea and is likely not to be a good idea. Here was our thought >>> process: >>> >>> if you abstract away the storage inside a wide int, then you should be >>> able >>> to copy a pointer to the block of data from either the rtl level integer >>> constant or the tree level one into the wide int. It is certainly true >>> that making a wide_int from one of these is an extremely common operation >>> and doing this would avoid those copies. >>> >>> However, this causes two problems: >>> 1) Mike's first cut at the CONST_WIDE_INT did two ggc allocations to >>> make >>> the object. it created the base object and then it allocated the array. >>> Richard S noticed that we could just allocate one CONST_WIDE_INT that had >>> the array in it. Doing it this way saves one ggc allocation and one >>> indirection when accessing the data within the CONST_WIDE_INT. Our plan >>> is >>> to use the same trick at the tree level. So to avoid the copying, you >>> seem >>> to have to have a more expensive rep for CONST_WIDE_INT and INT_CST. >> >> I did not propose having a pointer to the data in the RTX or tree int. >> Just >> the short-lived wide-ints (which are on the stack) would have a pointer to >> the data - which can then obviously point into the RTX and tree data. > > There is the issue then what if some wide-ints are not short lived. It makes > me nervous to create internal pointers to gc ed memory.
I thought they were all short-lived. >>> 2) You are now stuck either ggcing the storage inside a wide_int when >>> they >>> are created as part of an expression or you have to play some game to >>> represent the two different storage plans inside of wide_int. >> >> Hm? wide-ints are short-lived and thus never live across a garbage >> collection >> point. We create non-GCed objects pointing to GCed objects all the time >> and everywhere this way. > > Again, this makes me nervous but it could be done. However, it does mean > that now the wide ints that are not created from rtxes or trees will be more > expensive because they are not going to get their storage "for free", they > are going to alloca it. No, those would simply use the embedded storage model. > however, it still is not clear, given that 99% of the wide ints are going to > fit in a single hwi, that this would be a noticeable win. Currently even if they fit into a HWI you will still allocate 4 times the larges integer mode size. You say that doesn't matter because they are short-lived, but I say it does matter because not all of them are short-lived enough. If 99% fit in a HWI why allocate 4 times the largest integer mode size in 99% of the cases? >> >>> Clearly this >>> is where you think that we should be going by suggesting that we abstract >>> away the internal storage. However, this comes at a price: what is >>> currently an array access in my patches would (i believe) become a >>> function >>> call. >> >> No, the workers (that perform the array accesses) will simply get >> a pointer to the first data element. Then whether it's embedded or >> external is of no interest to them. > > so is your plan that the wide int constructors from rtx or tree would just > copy the pointer to the array on top of the array that is otherwise > allocated on the stack? I can easily do this. But as i said, the gain > seems quite small. > > And of course, going the other way still does need the copy. The proposal was to template wide_int on a storage model, the embedded one would work as-is (embedding 4 times largest integer mode), the external one would have a pointer to data. All functions that return a wide_int produce a wide_int with the embedded model. To avoid the function call penalty you described the storage model provides a way to get a pointer to the first element and the templated operations simply dispatch to a worker that takes this pointer to the first element (as the storage model is designed as a template its abstraction is going to be optimized away by means of inlining). Richard. >>> From a performance point of view, i believe that this is a non >>> starter. If you can figure out how to design this so that it is not a >>> function call, i would consider this a viable option. >>> >>> On the other side of this you are clearly correct that we are copying the >>> data when we are making wide ints from INT_CSTs or CONST_WIDE_INTs. >>> But >>> this is why we represent data inside of the wide_ints, the INT_CSTs and >>> the >>> CONST_WIDE_INTs in a compressed form. Even with very big types, which >>> are >>> generally rare, the constants them selves are very small. So the copy >>> operation is a loop that almost always copies one element, even with >>> tree-vrp which doubles the sizes of every type. >>> >>> There is the third option which is that the storage inside the wide int >>> is >>> just ggced storage. We rejected this because of the functional nature of >>> wide-ints. There are zillions created, they can be stack allocated, >>> and >>> they last for very short periods of time. >> >> Of course - GCing wide-ints is a non-starter. >> >>>>> As is probably obvious, I don't agree FWIW. It seems like an >>>>> unnecessary >>>>> complication without any clear use. Especially since the number of >>>> >>>> Maybe the double_int typedef is without any clear use. Properly >>>> abstracting from the storage / information providers will save >>>> compile-time, memory and code though. I don't see that any thought >>>> was spent on how to avoid excessive copying or dealing with >>>> long(er)-lived objects and their storage needs. >>> >>> I actually disagree. Wide ints can use a bloated amount of storage >>> because they are designed to be very short lived and very low cost >>> objects >>> that are stack allocated. For long term storage, there is INT_CST at >>> the >>> tree level and CONST_WIDE_INT at the rtl level. Those use a very compact >>> storage model. The copying entailed is only a small part of the overall >>> performance. >> >> Well, but both trees and RTXen are not viable for short-lived things >> because >> the are GCed! double-ints were suitable for this kind of stuff because >> the also have a moderate size. With wide-ints size becomes a problem >> (or GC, if you instead use trees or RTXen). >> >>> Everything that you are suggesting along these lines is adding to the >>> weight >>> of a wide-int object. >> >> On the contrary - it lessens their weight (with external already >> existing storage) >> or does not do anything to it (with the embedded storage). >> >>> You have to understand there will be many more >>> wide-ints created in a normal compilation than were ever created with >>> double-int. This is because the rtl level had no object like this at >>> all >>> and at the tree level, many of the places that should have used double >>> int, >>> short cut the code and only did the transformations if the types fit in a >>> HWI. >> >> Your argument shows that the copy-in/out from tree/RTX to/from wide-int >> will become a very frequent operation and thus it is worth optimizing it. >> >>> This is why we are extremely defensive about this issue. We really did >>> think a lot about it. >> >> I'm sure you did. >> >> Richard. > >