Re: RFC: Faster for_each_rtx-like iterators

Richard Sandiford Wed, 07 May 2014 23:31:27 -0700

Richard Sandiford <rdsandif...@googlemail.com> writes:
> Trevor Saunders <tsaund...@mozilla.com> writes:
>> On Wed, May 07, 2014 at 09:52:49PM +0100, Richard Sandiford wrote:
>>> I noticed for_each_rtx showing up in profiles and thought I'd have a go
>>> at using worklist-based iterators instead.  So far I have three:
>>> 
>>>   FOR_EACH_SUBRTX: iterates over const_rtx subrtxes of a const_rtx
>>>   FOR_EACH_SUBRTX_VAR: iterates over rtx subrtxes of an rtx
>>>   FOR_EACH_SUBRTX_PTR: iterates over subrtx pointers of an rtx *
>>> 
>>> with FOR_EACH_SUBRTX_PTR being the direct for_each_rtx replacement.
>>> 
>>> I made FOR_EACH_SUBRTX the "default" (unsuffixed) version because
>>> most walks really don't modify the structure.  I think we should
>>> encourage const_rtxes to be used whereever possible.  E.g. it might
>>> make it easier to have non-GC storage for temporary rtxes in future.
>>> 
>>> I've locally replaced all for_each_rtx calls in the generic code with
>>> these iterators and they make things reproducably faster.  The speed-up
>>> on full --enable-checking=release ./cc1 and ./cc1plus times is only about 
>>> 1%,
>>> but maybe that's enough to justify the churn.
>>
>> seems pretty nice, and it seems like it'll make code a little more
>> readable too :)
>>
>>> Implementation-wise, the main observation is that most subrtxes are part
>>> of a single contiguous sequence of "e" fields.  E.g. when compiling an
>>> oldish combine.ii on x86_64-linux-gnu with -O2, we iterate over the
>>> subrtxes of 7,636,542 rtxes.  Of those:
>>> 
>>> (A) 4,459,135 (58.4%) are leaf rtxes with no "e" or "E" fields,
>>> (B) 3,133,875 (41.0%) are rtxes with a single block of "e" fields and
>>>                       no "E" fields, and
>>> (C)    43,532 (00.6%) are more complicated.
>>> 
>>> (A) is really a special case of (B) in which the block has zero length.
>>> Those are the only two cases that really need to be handled inline.
>>> The implementation does this by having a mapping from an rtx code to the
>>> bounds of its "e" sequence, in the form of a start index and count.
>>> 
>>> Out of (C), the vast majority (43,509) are PARALLELs.  However, as you'd
>>> probably expect, bloating the inline code with that case made things
>>> slower rather than faster.
>>> 
>>> The vast majority (in fact all in the combine.ii run above) of iterations
>>> can be done with a 16-element stack worklist.  We obviously still need a
>>> heap fallback for the pathological cases though.
>>> 
>>> I spent a bit of time trying different iterator implementations and
>>> seeing which produced the best code.  Specific results from that were:
>>> 
>>> - The storage used for the worklist is separate from the iterator,
>>>   in order to avoid capturing iterator fields.
>>> 
>>> - Although the natural type of the storage would be auto_vec <..., 16>,
>>>   that produced some overhead compared with a separate stack array and heap
>>>   vector pointer.  With the heap vector pointer, the only overhead is an
>>>   assignment in the constructor and an "if (x) release (x)"-style sequence
>>>   in the destructor.  I think the extra complication over auto_vec is worth
>>>   it because in this case the heap version is so very rarely needed.
>>
>> hm, where does the overhead come from exactly? it seems like if  its
>>  faster to use vec<T, va_heap, vl_embedd> *foo; we should fix something
>>  about vectors since this isn't the only place it could matter.  does it
>>  matter if you use vec<T, va_heap, vl_embedd> * or vec<T> ? the second
>>  is basically just a wrapper around the former I'd expect has no effect.
>>  I'm not saying you're doing the wrong thing here, but if we can make
>>  generic vectors faster we probably should ;) or is the issue the
>>  __builtin_expect()s you can add?
>
> Part of the problem is that by having an array in the vec itself,
> the other fields effectively have their address taken too.
> So m_alloc, m_num and m_using_auto_storage need to be set up
> and maintained on the stack, even though we're almost sure that they
> will never be used.
>
>>> - The maximum number of fields in (B)-type rtxes is 3.  We get better
>>>   code by making that explicit rather than having a general loop.
>>> 
>>> - (C) codes map to an "e" count of UCHAR_MAX, so we can use a single
>>>   check to test for that and for cases where the stack worklist is
>>>   too small.
>>
>>  can we use uint8_t?
>
> We don't really use that in GCC yet.  I don't mind setting a precedent
> though :-)
>
>>> To give an example:
>>> 
>>> /* Callback for for_each_rtx, that returns 1 upon encountering a VALUE
>>>    whose UID is greater than the int uid that D points to.  */
>>> 
>>> static int
>>> refs_newer_value_cb (rtx *x, void *d)
>>> {
>>>   if (GET_CODE (*x) == VALUE && CSELIB_VAL_PTR (*x)->uid > *(int *)d)
>>>     return 1;
>>> 
>>>   return 0;
>>> }
>>> 
>>> /* Return TRUE if EXPR refers to a VALUE whose uid is greater than
>>>    that of V.  */
>>> 
>>> static bool
>>> refs_newer_value_p (rtx expr, rtx v)
>>> {
>>>   int minuid = CSELIB_VAL_PTR (v)->uid;
>>> 
>>>   return for_each_rtx (&expr, refs_newer_value_cb, &minuid);
>>> }
>>> 
>>> becomes:
>>> 
>>> /* Return TRUE if EXPR refers to a VALUE whose uid is greater than
>>>    that of V.  */
>>> 
>>> static bool
>>> refs_newer_value_p (const_rtx expr, rtx v)
>>> {
>>>   int minuid = CSELIB_VAL_PTR (v)->uid;
>>>   subrtx_iterator::array_type array;
>>
>> some reason to not hide it in the macro?
>
> I wanted to make the macro a true for loop, with no extra baggage.
> AFAIK there's no way of declaring both the array and the iterator
> in the first part of the for loop without making them part of the
> same object (and thus capturing the iterator fields again).
> A "do ... while (0)" wrapper might be too surprising.


and of course would require an "end loop" macro to hide the while (0).

Re: RFC: Faster for_each_rtx-like iterators

Reply via email to