On Fri, Nov 12, 2021 at 10:16:11PM +0300, Alexander Monakov wrote:
> I suspect there may be a misunderstanding here, or maybe your explanation is
> incomplete. I don't think the intention of the standard was to force such
> complexity. You can launch as many blocks on the GPU as you like, limited only
> by the bitwidth of the indexing register used in hardware, NVIDIA guarantees
> at least INT_MAX blocks (in fact almost 1<<63 blocks if you launch a
> three-dimensional grid with INT_MAX x 65535 x 65535 blocks).
> 
> The hardware will schedule blocks automatically (so for example if the 
> hardware
> can run 40 blocks simultaneously and you launch 100, the hardware may launch
> blocks 0 to 39, then when one of those finishes it will launch the 40'th block
> and so on).
> 
> So isn't the solution simply to adjust the logic around
> nvptx_adjust_launch_bounds in GOMP_OFFLOAD_run, that is, if there's a lower
> bound specified, use it instead of what adjust_launch_bounds is computing as
> max_blocks?

The problem is that the argument of the num_teams clause isn't always known
before target is launched.
While gimplify.c tries hard to figure it out as often as possible and the
standard makes it easy for the combined target teams case where we say
that the expressions in the num_teams/thread_limit clauses are evaluated on
the host before the target construct - in that case the plugin is told the
expected number and unless CUDA decides to allocate fewer than requested,
we are fine, there are cases where target is not combined with teams where
per the spec the expressions need to be evaluated on the target, not on the
host (gimplify still tries to optimize some of those cases by e.g. seeing if
it is some simple arithmetic expression where all the vars would be
firstprivatized), and in that case we create some default number of CTAs and
only later on find out what the user asked for.
extern int foo (void);
#pragma omp declare target to (foo)
void bar (void)
{
  #pragma omp target
  #pragma omp teams num_teams (foo ())
  ;
}
is such a case, we simply don't know and foo () needs to be called in
target.  In OpenMP 5.0 we had the option to always create fewer teams if
we decided so (of course at least 1), but in 5.1 we don't have that option,
if there is just one expression, we need to create exactly that many teams,
if it is num_teams (foo () - 10 : foo () + 10), we need to be within that
range (inclusive).

        Jakub

Reply via email to