On 2/12/26 10:13, Arsen Arsenović wrote:
Hi!

Matthew Malcomson <[email protected]> writes:

From: Matthew Malcomson <[email protected]>

I imagine there is a good chunk of performance to be gained in adding
more logic here:
1) Something recording whether the affinity setup has changed since the
    last non-nested team and if not removing the calculation of places
    and assignment to `nthr->ts.place_partition_*` variables.
2) I could move more members that are assigned multiple times to `team`
    for each secondary thread to take.
    - `data` pointer.
    - `num_teams`.
    - `team_num`.
    - Honestly should have looked into this in the patch -- noticed it
      while writing this cover letter and will look into whether this is
      feasible relatively soon.
3) If we can identify that we're re-using a team from the last parallel
    region (and affinity ICV has not changed) it seems that we could
    avoid re-initialising some of its fields:
    - ordered_release[i] should already point to `nthr`?
    - ts.team_id should already be set.
Overall if we can identify that we're using the same team as was
cached and we don't need to change anything we should be able to get
away with drastically less work in the serial part of the call to
GOMP_parallel.


I've actually made pretty much this same change for the GCN
configuration of libgomp, resulting in a semantic conflict with this
patch (yet to upstream it, though).

It may make sense to extract the common thread initialization (including
from this cache) into a function so that it can be used in other
libgomp configurations.


Hi Arsen,

Nice to know others see the benefit!

Having a common thread initialisation function sounds like a sensible idea.

One complication would be that (once I've investigated the three improvements I mentioned above) I expect some of the thread initialization done in the generic code would not be things that the targets use (e.g. the `num_teams` member).

I guess if that does turn out to be the case we could take a little sub-optimality (in the parallel region so much less critical) rather than making a generic function that is strongly tied to the current implementation details of backends.

MM

Reply via email to