On 12/23/22 6:08 AM, Thomas Schwinge wrote:
Hi!
On 2022-11-11T15:35:44+0100, Richard Biener via Fortran <fort...@gcc.gnu.org>
wrote:
On Fri, Nov 11, 2022 at 3:13 PM Thomas Schwinge <tho...@codesourcery.com> wrote:
For example, for Fortran code like:
write (*,*) "Hello world"
..., 'gfortran' creates:
struct __st_parameter_dt dt_parm.0;
try
{
dt_parm.0.common.filename =
&"source-gcc/libgomp/testsuite/libgomp.oacc-fortran/print-1_.f90"[1]{lb: 1 sz:
1};
dt_parm.0.common.line = 29;
dt_parm.0.common.flags = 128;
dt_parm.0.common.unit = 6;
_gfortran_st_write (&dt_parm.0);
_gfortran_transfer_character_write (&dt_parm.0, &"Hello world"[1]{lb:
1 sz: 1}, 11);
_gfortran_st_write_done (&dt_parm.0);
}
finally
{
dt_parm.0 = {CLOBBER(eol)};
}
The issue: the stack object 'dt_parm.0' is a half-KiB in size (yes,
really! -- there's a lot of state in Fortran I/O apparently). That's a
problem for GPU execution -- here: OpenACC/nvptx -- where typically you
have small stacks. (For example, GCC/OpenACC/nvptx: 1 KiB per thread;
GCC/OpenMP/nvptx is an exception, because of its use of '-msoft-stack'
"Use custom stacks instead of local memory for automatic storage".)
Now, the Nvidia Driver tries to accomodate for such largish stack usage,
and dynamically increases the per-thread stack as necessary (thereby
potentially reducing parallelism) -- if it manages to understand the call
graph. In case of libgfortran I/O, it evidently doesn't. Not being able
to disprove existance of recursion is the common problem, as I've read.
At run time, via 'CU_JIT_INFO_LOG_BUFFER' you then get, for example:
warning : Stack size for entry function 'MAIN__$_omp_fn$0' cannot be
statically determined
That's still not an actual problem: if the GPU kernel's stack usage still
fits into 1 KiB. Very often it does, but if, as happens in libgfortran
I/O handling, there is another such 'dt_parm' put onto the stack, the
stack then overflows; device-side SIGSEGV.
(There is, by the way, some similar analysis by Tom de Vries in
<https://gcc.gnu.org/PR85519> "[nvptx, openacc, openmp, testsuite]
Recursive tests may fail due to thread stack limit".)
Of course, you shouldn't really be doing I/O in GPU kernels, but people
do like their occasional "'printf' debugging", so we ought to make that
work (... without pessimizing any "normal" code).
I assume that generally reducing the size of 'dt_parm' etc. is out of
scope.
There are so many wiggles and turns and corner cases and the like of
nightmares in I/O I would advise not trying to reduce the dt_parm. It
could probably be done.
For debugging GPU, would it not be better to have a way you signal back
to a main thread to do a print from there, like some sort of call back
in the users code under test.
Putting this another way, recommend users debugging to use a different
method than embedding print statements for debugging rather than do a
tone of work to enable something that is not really a legitimate use case.
FWIW,
Jerry