Re: [Patch] nvptx/mkoffload.cc: Add dummy proc for OpenMP rev-offload table [PR108098]

2023-04-04 Thread Tom de Vries via Gcc-patches

On 4/4/23 11:02, Thomas Schwinge wrote:

Hi!

Are we going to install such a work-around?



Hi,

LGTM.

Thanks,
- Tom



Grüße
  Thomas


On 2022-12-19T13:04:43+0100, I wrote:

Hi!

On 2022-12-16T17:19:00+0100, Tobias Burnus  wrote:

Seems to be a CUDA JIT issue


A Nvidia Driver JIT issue, more precisely.  ;-)


which is fixed by adding a dummy procedure.


Gah...  :-|


Lightly tested with 4 systems at hand, where 2 failed before.


I'm happy to confirm that indeed this does resolve the issue for all
configurations that I reported in 
"OpenMP/nvptx reverse offload execution test FAILs".


As I said on IRC, #gcc, 2022-12-16:


[...] we're unlikely to reverse-engineer the exact version/conditions
where this got fixed, so don't have a useful means for versioning the
workaround.  Fortunately, it doesn't "cost" anything really.  (In
constrast to some other GCC/nvptx back end workarounds, as I
understand.)



Grüße
  Thomas



One had 10.2 and
the other had some ancient CUDA where 'nvptx-smi' did not print a CUDA version
and requires -mptx=3.1.
(I did check that offloading indeed happened and no hostfallback was done.)

OK for mainline?

Tobias




nvptx/mkoffload.cc: Add dummy proc for OpenMP rev-offload table [PR108098]

Seemingly, the ptx JIT of CUDA <= 10.2 replaces function pointers in global
variables by NULL if a translation does not contain any executable code. It
works with CUDA 11.1.  The code of this commit is about reverse offload;
having NULL values disables the side of reverse offload during image load.

Solution is the same as found by Thomas for a related issue: Adding a dummy
procedure. Cf. the PR of this issue and Thomas' patch
"nvptx: Support global constructors/destructors via 'collect2'"
https://gcc.gnu.org/pipermail/gcc-patches/2022-December/607749.html

As that approach also works here:

Co-authored-by: Thomas Schwinge 

gcc/
  PR libgomp/108098

  * config/nvptx/mkoffload.cc (process): Emit dummy procedure
  alongside reverse-offload function table to prevent NULL values
  of the function addresses.

---
  gcc/config/nvptx/mkoffload.cc | 14 ++
  1 file changed, 14 insertions(+)

diff --git a/gcc/config/nvptx/mkoffload.cc b/gcc/config/nvptx/mkoffload.cc
index 5d89ba8..8306aa0 100644
--- a/gcc/config/nvptx/mkoffload.cc
+++ b/gcc/config/nvptx/mkoffload.cc
@@ -357,6 +357,20 @@ process (FILE *in, FILE *out, uint32_t omp_requires)
  fputc (sm_ver2[i], out);
fprintf (out, "\"\n\t\".file 1 \\\"\\\"\"\n");

+  /* WORKAROUND - see PR 108098
+ It seems as if older CUDA JIT compiler optimizes the function pointers
+ in offload_func_table to NULL, which can be prevented by adding a
+ dummy procedure. With CUDA 11.1, it seems to work fine without
+ workaround while CUDA 10.2 as some ancient version have need the
+ workaround. Assuming CUDA 11.0 fixes it, emitting it could be
+ restricted to 'if (sm_ver2[0] < 8 && version2[0] < 7)' as sm_80 and
+ PTX ISA 7.0 are new in CUDA 11.0; for 11.1 it would be sm_86 and
+ PTX ISA 7.1.  */
+  fprintf (out, "\n\t\".func __dummy$func ( );\"\n");
+  fprintf (out, "\t\".func __dummy$func ( )\"\n");
+  fprintf (out, "\t\"{\"\n");
+  fprintf (out, "\t\"}\"\n");
+
size_t fidx = 0;
for (id = func_ids; id; id = id->next)
  {

-
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 
München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas 
Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht 
München, HRB 106955




Re: [Patch] nvptx/mkoffload.cc: Add dummy proc for OpenMP rev-offload table [PR108098]

2023-04-04 Thread Thomas Schwinge
Hi!

Are we going to install such a work-around?


Grüße
 Thomas


On 2022-12-19T13:04:43+0100, I wrote:
> Hi!
>
> On 2022-12-16T17:19:00+0100, Tobias Burnus  wrote:
>> Seems to be a CUDA JIT issue
>
> A Nvidia Driver JIT issue, more precisely.  ;-)
>
>> which is fixed by adding a dummy procedure.
>
> Gah...  :-|
>
>> Lightly tested with 4 systems at hand, where 2 failed before.
>
> I'm happy to confirm that indeed this does resolve the issue for all
> configurations that I reported in 
> "OpenMP/nvptx reverse offload execution test FAILs".
>
>
> As I said on IRC, #gcc, 2022-12-16:
>
>> [...] we're unlikely to reverse-engineer the exact version/conditions
>> where this got fixed, so don't have a useful means for versioning the
>> workaround.  Fortunately, it doesn't "cost" anything really.  (In
>> constrast to some other GCC/nvptx back end workarounds, as I
>> understand.)
>
>
> Grüße
>  Thomas
>
>
>> One had 10.2 and
>> the other had some ancient CUDA where 'nvptx-smi' did not print a CUDA 
>> version
>> and requires -mptx=3.1.
>> (I did check that offloading indeed happened and no hostfallback was done.)
>>
>> OK for mainline?
>>
>> Tobias
>
>
>> nvptx/mkoffload.cc: Add dummy proc for OpenMP rev-offload table [PR108098]
>>
>> Seemingly, the ptx JIT of CUDA <= 10.2 replaces function pointers in global
>> variables by NULL if a translation does not contain any executable code. It
>> works with CUDA 11.1.  The code of this commit is about reverse offload;
>> having NULL values disables the side of reverse offload during image load.
>>
>> Solution is the same as found by Thomas for a related issue: Adding a dummy
>> procedure. Cf. the PR of this issue and Thomas' patch
>> "nvptx: Support global constructors/destructors via 'collect2'"
>> https://gcc.gnu.org/pipermail/gcc-patches/2022-December/607749.html
>>
>> As that approach also works here:
>>
>> Co-authored-by: Thomas Schwinge 
>>
>> gcc/
>>  PR libgomp/108098
>>
>>  * config/nvptx/mkoffload.cc (process): Emit dummy procedure
>>  alongside reverse-offload function table to prevent NULL values
>>  of the function addresses.
>>
>> ---
>>  gcc/config/nvptx/mkoffload.cc | 14 ++
>>  1 file changed, 14 insertions(+)
>>
>> diff --git a/gcc/config/nvptx/mkoffload.cc b/gcc/config/nvptx/mkoffload.cc
>> index 5d89ba8..8306aa0 100644
>> --- a/gcc/config/nvptx/mkoffload.cc
>> +++ b/gcc/config/nvptx/mkoffload.cc
>> @@ -357,6 +357,20 @@ process (FILE *in, FILE *out, uint32_t omp_requires)
>>  fputc (sm_ver2[i], out);
>>fprintf (out, "\"\n\t\".file 1 \\\"\\\"\"\n");
>>
>> +  /* WORKAROUND - see PR 108098
>> + It seems as if older CUDA JIT compiler optimizes the function pointers
>> + in offload_func_table to NULL, which can be prevented by adding a
>> + dummy procedure. With CUDA 11.1, it seems to work fine without
>> + workaround while CUDA 10.2 as some ancient version have need the
>> + workaround. Assuming CUDA 11.0 fixes it, emitting it could be
>> + restricted to 'if (sm_ver2[0] < 8 && version2[0] < 7)' as sm_80 and
>> + PTX ISA 7.0 are new in CUDA 11.0; for 11.1 it would be sm_86 and
>> + PTX ISA 7.1.  */
>> +  fprintf (out, "\n\t\".func __dummy$func ( );\"\n");
>> +  fprintf (out, "\t\".func __dummy$func ( )\"\n");
>> +  fprintf (out, "\t\"{\"\n");
>> +  fprintf (out, "\t\"}\"\n");
>> +
>>size_t fidx = 0;
>>for (id = func_ids; id; id = id->next)
>>  {
-
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 
München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas 
Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht 
München, HRB 106955


Re: [Patch] nvptx/mkoffload.cc: Add dummy proc for OpenMP rev-offload table [PR108098]

2022-12-19 Thread Thomas Schwinge
Hi!

On 2022-12-16T17:19:00+0100, Tobias Burnus  wrote:
> Seems to be a CUDA JIT issue

A Nvidia Driver JIT issue, more precisely.  ;-)

> which is fixed by adding a dummy procedure.

Gah...  :-|

> Lightly tested with 4 systems at hand, where 2 failed before.

I'm happy to confirm that indeed this does resolve the issue for all
configurations that I reported in 
"OpenMP/nvptx reverse offload execution test FAILs".


As I said on IRC, #gcc, 2022-12-16:

> [...] we're unlikely to reverse-engineer the exact version/conditions
> where this got fixed, so don't have a useful means for versioning the
> workaround.  Fortunately, it doesn't "cost" anything really.  (In
> constrast to some other GCC/nvptx back end workarounds, as I
> understand.)


Grüße
 Thomas


> One had 10.2 and
> the other had some ancient CUDA where 'nvptx-smi' did not print a CUDA version
> and requires -mptx=3.1.
> (I did check that offloading indeed happened and no hostfallback was done.)
>
> OK for mainline?
>
> Tobias


> nvptx/mkoffload.cc: Add dummy proc for OpenMP rev-offload table [PR108098]
>
> Seemingly, the ptx JIT of CUDA <= 10.2 replaces function pointers in global
> variables by NULL if a translation does not contain any executable code. It
> works with CUDA 11.1.  The code of this commit is about reverse offload;
> having NULL values disables the side of reverse offload during image load.
>
> Solution is the same as found by Thomas for a related issue: Adding a dummy
> procedure. Cf. the PR of this issue and Thomas' patch
> "nvptx: Support global constructors/destructors via 'collect2'"
> https://gcc.gnu.org/pipermail/gcc-patches/2022-December/607749.html
>
> As that approach also works here:
>
> Co-authored-by: Thomas Schwinge 
>
> gcc/
>   PR libgomp/108098
>
>   * config/nvptx/mkoffload.cc (process): Emit dummy procedure
>   alongside reverse-offload function table to prevent NULL values
>   of the function addresses.
>
> ---
>  gcc/config/nvptx/mkoffload.cc | 14 ++
>  1 file changed, 14 insertions(+)
>
> diff --git a/gcc/config/nvptx/mkoffload.cc b/gcc/config/nvptx/mkoffload.cc
> index 5d89ba8..8306aa0 100644
> --- a/gcc/config/nvptx/mkoffload.cc
> +++ b/gcc/config/nvptx/mkoffload.cc
> @@ -357,6 +357,20 @@ process (FILE *in, FILE *out, uint32_t omp_requires)
>   fputc (sm_ver2[i], out);
>fprintf (out, "\"\n\t\".file 1 \\\"\\\"\"\n");
>
> +  /* WORKAROUND - see PR 108098
> +  It seems as if older CUDA JIT compiler optimizes the function pointers
> +  in offload_func_table to NULL, which can be prevented by adding a
> +  dummy procedure. With CUDA 11.1, it seems to work fine without
> +  workaround while CUDA 10.2 as some ancient version have need the
> +  workaround. Assuming CUDA 11.0 fixes it, emitting it could be
> +  restricted to 'if (sm_ver2[0] < 8 && version2[0] < 7)' as sm_80 and
> +  PTX ISA 7.0 are new in CUDA 11.0; for 11.1 it would be sm_86 and
> +  PTX ISA 7.1.  */
> +  fprintf (out, "\n\t\".func __dummy$func ( );\"\n");
> +  fprintf (out, "\t\".func __dummy$func ( )\"\n");
> +  fprintf (out, "\t\"{\"\n");
> +  fprintf (out, "\t\"}\"\n");
> +
>size_t fidx = 0;
>for (id = func_ids; id; id = id->next)
>   {
-
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 
München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas 
Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht 
München, HRB 106955


[Patch] nvptx/mkoffload.cc: Add dummy proc for OpenMP rev-offload table [PR108098]

2022-12-16 Thread Tobias Burnus

Seems to be a CUDA JIT issue - which is fixed by adding a dummy procedure.

Lightly tested with 4 systems at hand, where 2 failed before. One had 10.2 and
the other had some ancient CUDA where 'nvptx-smi' did not print a CUDA version
and requires -mptx=3.1.
(I did check that offloading indeed happened and no hostfallback was done.)

OK for mainline?

Tobias
-
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 
München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas 
Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht 
München, HRB 106955
nvptx/mkoffload.cc: Add dummy proc for OpenMP rev-offload table [PR108098]

Seemingly, the ptx JIT of CUDA <= 10.2 replaces function pointers in global
variables by NULL if a translation does not contain any executable code. It
works with CUDA 11.1.  The code of this commit is about reverse offload;
having NULL values disables the side of reverse offload during image load.

Solution is the same as found by Thomas for a related issue: Adding a dummy
procedure. Cf. the PR of this issue and Thomas' patch
"nvptx: Support global constructors/destructors via 'collect2'"
https://gcc.gnu.org/pipermail/gcc-patches/2022-December/607749.html

As that approach also works here:

Co-authored-by: Thomas Schwinge 

gcc/
	PR libgomp/108098

	* config/nvptx/mkoffload.cc (process): Emit dummy procedure
	alongside reverse-offload function table to prevent NULL values
	of the function addresses.

---
 gcc/config/nvptx/mkoffload.cc | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/gcc/config/nvptx/mkoffload.cc b/gcc/config/nvptx/mkoffload.cc
index 5d89ba8..8306aa0 100644
--- a/gcc/config/nvptx/mkoffload.cc
+++ b/gcc/config/nvptx/mkoffload.cc
@@ -357,6 +357,20 @@ process (FILE *in, FILE *out, uint32_t omp_requires)
 	fputc (sm_ver2[i], out);
   fprintf (out, "\"\n\t\".file 1 \\\"\\\"\"\n");
 
+  /* WORKAROUND - see PR 108098
+	 It seems as if older CUDA JIT compiler optimizes the function pointers
+	 in offload_func_table to NULL, which can be prevented by adding a
+	 dummy procedure. With CUDA 11.1, it seems to work fine without
+	 workaround while CUDA 10.2 as some ancient version have need the
+	 workaround. Assuming CUDA 11.0 fixes it, emitting it could be
+	 restricted to 'if (sm_ver2[0] < 8 && version2[0] < 7)' as sm_80 and
+	 PTX ISA 7.0 are new in CUDA 11.0; for 11.1 it would be sm_86 and
+	 PTX ISA 7.1.  */
+  fprintf (out, "\n\t\".func __dummy$func ( );\"\n");
+  fprintf (out, "\t\".func __dummy$func ( )\"\n");
+  fprintf (out, "\t\"{\"\n");
+  fprintf (out, "\t\"}\"\n");
+
   size_t fidx = 0;
   for (id = func_ids; id; id = id->next)
 	{