On Fri, 12 Nov 2021, Jakub Jelinek via Gcc-patches wrote:

> On Fri, Nov 12, 2021 at 08:47:09PM +0100, Jakub Jelinek wrote:
> > The problem is that the argument of the num_teams clause isn't always known
> > before target is launched.
> 
> There was a design mistake that the clause has been put on teams rather than
> on target (well, for host teams we need it on teams), and 5.1 actually
> partially fixes this up for thread_limit by allowing that clause on both,
> but not for num_teams.

If this is a mistake in the standard, can GCC say "the spec is bad; fix the
spec" and refuse to implement support, since it penalizes the common case?

Technically, this could be implemented without penalizing the common case via
CUDA "dynamic parallelism" where you initially launch just one block on the
device that figures out the dimensions and then performs a GPU-side launch of
the required amount of blocks, but that's a nontrivial amount of work.

I looked over your patch. I sent a small nitpick about 'nocommon' in a separate
message, and I still think it's better to adjust GOMP_OFFLOAD_run to take into
account the lower bound when it's known on the host side (otherwise you do
static scheduling of blocks which is going to be inferior to dynamic scheduling:
imagine lower bound is 3, and maximum resident blocks is 2: then you first do
teams 0 and 1 in parallel, then you do team 2 from the 0'th block, while in fact
you want to do it from whichever block finished its initial team first).

Alexander

Reply via email to