Richard Sandiford <rdsandif...@googlemail.com> writes: > Trevor Saunders <tsaund...@mozilla.com> writes: >> On Wed, May 07, 2014 at 09:52:49PM +0100, Richard Sandiford wrote: >>> I noticed for_each_rtx showing up in profiles and thought I'd have a go >>> at using worklist-based iterators instead. So far I have three: >>> >>> FOR_EACH_SUBRTX: iterates over const_rtx subrtxes of a const_rtx >>> FOR_EACH_SUBRTX_VAR: iterates over rtx subrtxes of an rtx >>> FOR_EACH_SUBRTX_PTR: iterates over subrtx pointers of an rtx * >>> >>> with FOR_EACH_SUBRTX_PTR being the direct for_each_rtx replacement. >>> >>> I made FOR_EACH_SUBRTX the "default" (unsuffixed) version because >>> most walks really don't modify the structure. I think we should >>> encourage const_rtxes to be used whereever possible. E.g. it might >>> make it easier to have non-GC storage for temporary rtxes in future. >>> >>> I've locally replaced all for_each_rtx calls in the generic code with >>> these iterators and they make things reproducably faster. The speed-up >>> on full --enable-checking=release ./cc1 and ./cc1plus times is only about >>> 1%, >>> but maybe that's enough to justify the churn. >> >> seems pretty nice, and it seems like it'll make code a little more >> readable too :) >> >>> Implementation-wise, the main observation is that most subrtxes are part >>> of a single contiguous sequence of "e" fields. E.g. when compiling an >>> oldish combine.ii on x86_64-linux-gnu with -O2, we iterate over the >>> subrtxes of 7,636,542 rtxes. Of those: >>> >>> (A) 4,459,135 (58.4%) are leaf rtxes with no "e" or "E" fields, >>> (B) 3,133,875 (41.0%) are rtxes with a single block of "e" fields and >>> no "E" fields, and >>> (C) 43,532 (00.6%) are more complicated. >>> >>> (A) is really a special case of (B) in which the block has zero length. >>> Those are the only two cases that really need to be handled inline. >>> The implementation does this by having a mapping from an rtx code to the >>> bounds of its "e" sequence, in the form of a start index and count. >>> >>> Out of (C), the vast majority (43,509) are PARALLELs. However, as you'd >>> probably expect, bloating the inline code with that case made things >>> slower rather than faster. >>> >>> The vast majority (in fact all in the combine.ii run above) of iterations >>> can be done with a 16-element stack worklist. We obviously still need a >>> heap fallback for the pathological cases though. >>> >>> I spent a bit of time trying different iterator implementations and >>> seeing which produced the best code. Specific results from that were: >>> >>> - The storage used for the worklist is separate from the iterator, >>> in order to avoid capturing iterator fields. >>> >>> - Although the natural type of the storage would be auto_vec <..., 16>, >>> that produced some overhead compared with a separate stack array and heap >>> vector pointer. With the heap vector pointer, the only overhead is an >>> assignment in the constructor and an "if (x) release (x)"-style sequence >>> in the destructor. I think the extra complication over auto_vec is worth >>> it because in this case the heap version is so very rarely needed. >> >> hm, where does the overhead come from exactly? it seems like if its >> faster to use vec<T, va_heap, vl_embedd> *foo; we should fix something >> about vectors since this isn't the only place it could matter. does it >> matter if you use vec<T, va_heap, vl_embedd> * or vec<T> ? the second >> is basically just a wrapper around the former I'd expect has no effect. >> I'm not saying you're doing the wrong thing here, but if we can make >> generic vectors faster we probably should ;) or is the issue the >> __builtin_expect()s you can add? > > Part of the problem is that by having an array in the vec itself, > the other fields effectively have their address taken too. > So m_alloc, m_num and m_using_auto_storage need to be set up > and maintained on the stack, even though we're almost sure that they > will never be used. > >>> - The maximum number of fields in (B)-type rtxes is 3. We get better >>> code by making that explicit rather than having a general loop. >>> >>> - (C) codes map to an "e" count of UCHAR_MAX, so we can use a single >>> check to test for that and for cases where the stack worklist is >>> too small. >> >> can we use uint8_t? > > We don't really use that in GCC yet. I don't mind setting a precedent > though :-) > >>> To give an example: >>> >>> /* Callback for for_each_rtx, that returns 1 upon encountering a VALUE >>> whose UID is greater than the int uid that D points to. */ >>> >>> static int >>> refs_newer_value_cb (rtx *x, void *d) >>> { >>> if (GET_CODE (*x) == VALUE && CSELIB_VAL_PTR (*x)->uid > *(int *)d) >>> return 1; >>> >>> return 0; >>> } >>> >>> /* Return TRUE if EXPR refers to a VALUE whose uid is greater than >>> that of V. */ >>> >>> static bool >>> refs_newer_value_p (rtx expr, rtx v) >>> { >>> int minuid = CSELIB_VAL_PTR (v)->uid; >>> >>> return for_each_rtx (&expr, refs_newer_value_cb, &minuid); >>> } >>> >>> becomes: >>> >>> /* Return TRUE if EXPR refers to a VALUE whose uid is greater than >>> that of V. */ >>> >>> static bool >>> refs_newer_value_p (const_rtx expr, rtx v) >>> { >>> int minuid = CSELIB_VAL_PTR (v)->uid; >>> subrtx_iterator::array_type array; >> >> some reason to not hide it in the macro? > > I wanted to make the macro a true for loop, with no extra baggage. > AFAIK there's no way of declaring both the array and the iterator > in the first part of the for loop without making them part of the > same object (and thus capturing the iterator fields again). > A "do ... while (0)" wrapper might be too surprising.
and of course would require an "end loop" macro to hide the while (0).