On Fri, Nov 12, 2021 at 10:16:11PM +0300, Alexander Monakov wrote: > I suspect there may be a misunderstanding here, or maybe your explanation is > incomplete. I don't think the intention of the standard was to force such > complexity. You can launch as many blocks on the GPU as you like, limited only > by the bitwidth of the indexing register used in hardware, NVIDIA guarantees > at least INT_MAX blocks (in fact almost 1<<63 blocks if you launch a > three-dimensional grid with INT_MAX x 65535 x 65535 blocks). > > The hardware will schedule blocks automatically (so for example if the > hardware > can run 40 blocks simultaneously and you launch 100, the hardware may launch > blocks 0 to 39, then when one of those finishes it will launch the 40'th block > and so on). > > So isn't the solution simply to adjust the logic around > nvptx_adjust_launch_bounds in GOMP_OFFLOAD_run, that is, if there's a lower > bound specified, use it instead of what adjust_launch_bounds is computing as > max_blocks?
The problem is that the argument of the num_teams clause isn't always known before target is launched. While gimplify.c tries hard to figure it out as often as possible and the standard makes it easy for the combined target teams case where we say that the expressions in the num_teams/thread_limit clauses are evaluated on the host before the target construct - in that case the plugin is told the expected number and unless CUDA decides to allocate fewer than requested, we are fine, there are cases where target is not combined with teams where per the spec the expressions need to be evaluated on the target, not on the host (gimplify still tries to optimize some of those cases by e.g. seeing if it is some simple arithmetic expression where all the vars would be firstprivatized), and in that case we create some default number of CTAs and only later on find out what the user asked for. extern int foo (void); #pragma omp declare target to (foo) void bar (void) { #pragma omp target #pragma omp teams num_teams (foo ()) ; } is such a case, we simply don't know and foo () needs to be called in target. In OpenMP 5.0 we had the option to always create fewer teams if we decided so (of course at least 1), but in 5.1 we don't have that option, if there is just one expression, we need to create exactly that many teams, if it is num_teams (foo () - 10 : foo () + 10), we need to be within that range (inclusive). Jakub