[RFC] Summary of libgomp failures for offloading to nvptx from AArch64

Prathamesh Kulkarni via Gcc Thu, 25 Jul 2024 06:35:55 -0700

Hi,
I am working on enabling offloading to nvptx from AAarch64 host. As mentioned 
on wiki (https://gcc.gnu.org/wiki/Offloading#Running_.27make_check.27),
I ran make check-target-libgomp on AAarch64 host (and no GPU) with following 
results:


                === libgomp Summary ===

# of expected passes            14568
# of unexpected failures        1023
# of expected failures          309
# of untested testcases         54
# of unresolved testcases       992
# of unsupported tests          644

It seems majority of the tests fail due to the following 4 issues:

* Compiling a minimal test-case:

int main()
{
  int x;
  #pragma omp target map (to: x)
  {
    x = 0;
  }
  return x;
}

Compiling with -fopenmp -foffload=nvptx-none results in following issues:

(1) Differing values of NUM_POLY_INT_COEFFS between host and accelerator, which 
results in following ICE:

0x1a6e0a7 pp_quoted_string
        ../../gcc/gcc/pretty-print.cc:2277
 0x1a6ffb3 pp_format(pretty_printer*, text_info*, urlifier const*)
        ../../gcc/gcc/pretty-print.cc:1634
 0x1a4a3f3 diagnostic_context::report_diagnostic(diagnostic_info*)
        ../../gcc/gcc/diagnostic.cc:1612
 0x1a4a727 diagnostic_impl
        ../../gcc/gcc/diagnostic.cc:1775
 0x1a4e20b fatal_error(unsigned int, char const*, ...)
        ../../gcc/gcc/diagnostic.cc:2218
 0xb3088f lto_input_mode_table(lto_file_decl_data*)
         ../../gcc/gcc/lto-streamer-in.cc:2121
 0x6f5cdf lto_file_finalize
        ../../gcc/gcc/lto/lto-common.cc:2285
 0x6f5cdf lto_create_files_from_ids
        ../../gcc/gcc/lto/lto-common.cc:2309
 0x6f5cdf lto_file_read
        ../../gcc/gcc/lto/lto-common.cc:2364
 0x6f5cdf read_cgraph_and_symbols(unsigned int, char const**)
        ../../gcc/gcc/lto/lto-common.cc:2812
 0x6cfb93 lto_main()
        ../../gcc/gcc/lto/lto.cc:658

This is already tracked in https://gcc.gnu.org/PR96265 (and related PR's)

Streaming out mode_table:
mode = SI, mclass = 2, size = 4, prec = 32
mode = DI, mclass = 2, size = 8, prec = 64

Streaming in mode_table (in lto_input_mode_table):
mclass = 2, size = 4, prec = 0
(and then calculates the correct mode value by iterating over all modes of 
mclass starting from narrowest mode)

The issue is that the value for prec is not getting streamed-in correctly for 
SImode as seen above. While streaming out from AArch64 host,
it is 32, but while streaming in for nvptx, it is 0. This happens because of 
differing values of NUM_POLY_INT_COEFFS between AArch64 and nvptx backend.

Since NUM_POLY_INT_COEFFS is 2 for aarch64, the streamed-out values for mode, 
precision would be <4, 0> and <32, 0>
respectively (streamed-out in bp_pack_poly_value). Both zeros come from 
coeffs[1] of size and prec. While streaming in however,
NUM_POLY_INT_COEFFS is 1 for nvptx, and thus it incorrectly treats <4, 0> as 
size and precision respectively, which is why precision
gets streamed in as 0, and thus it encounters the above ICE.

Supporting non VLA code with offloading:

In the general case, it's hard to support offloading for arbitrary poly_ints 
when NUM_POLY_INT_COEFFS differs for host and accelerator.
For example, it's not possible to represent a degree-2 poly_int like 4 + 4x 
(as-is) on an accelerator with NUM_POLY_INT_COEFFS == 1.

However, IIUC, we can support offloading for restricted set of poly_ints whose 
degree <= accel's NUM_POLY_INT_COEFFS, since they can be
represented on accelerator ? For a hypothetical example, if host 
NUM_POLY_INT_COEFFS == 3 and accel NUM_POLY_INT_COEFFS == 2, then I suppose
we could represent a degree 2 poly_int on accelerator, but not a degree 3 
poly_int like 3+4x+5x^2 ?

Based on that, I have come up with following approach in attached 
"quick-and-dirty" patch (p-163-2.diff):
Stream-out host NUM_POLY_INT_COEFFS, and while streaming-in during lto1, 
compare it with accelerator's NUM_POLY_INT_COEFFS as follows:

Stream in host_num_poly_int_coeffs;
if (host_num_poly_int_coeffs == NUM_POLY_INT_COEFFS) // NUM_POLY_INT_COEFFS 
represents accelerator's value here.
{
    /* Both are equal, proceed to unpacking NUM_POLY_INT_COEFFS words from 
bitstream.  */
}
else if (host_num_poly_int_coeffs < NUM_POLY_INT_COEFFS)
{
    /* Unpack host_num_poly_int_coeffs words and zero out remaining higher 
coeffs (similar to zero-extension).  */
}
else
{
    /* Unpack host_num_poly_int_coeffs words and ensure that degree of 
streamed-out poly_int <= NUM_POLY_INT_COEFFS.  */
}

For example, with host NUM_POLY_INT_COEFFS == 2 and accel NUM_POLY_INT_COEFFS 
== 1, this will allow streaming of "degree-1" poly_ints
like 4+0x (which will degenerate to constant 4), but give an error for 
streaming degree-2 poly_int like 4+4x.

Following this approach, I am assuming we can support AArch64/nvptx offloading 
for non VLA code, since poly_ints used for representing various
artefacts like mode_size, mode_precision, vector length etc. will be degree-1 
poly_int for scalar variables and fixed-length vectors
(and thus degenerate to constants). With the patch applied, it proceeds forward 
from this point, but fails at a later stage (see below).

Does this approach look reasonable for supporting offloading for non VLA code ?
Are there any cases I may have overlooked, where offloading will still fail for 
non-VLA code due to differing NUM_POLY_INT_COEFFS issue ?

(2) nvptx mkoffload.cc passes -m64/-m32 to host compiler if 
-foffload-abi=lp64/ilp32
After applying workaround for the above assertion failure, it hits the 
following error:
gcc: error: unrecognized command-line option '-m64'
nvptx mkoffload: fatal error: ../install/bin/gcc returned 1 exit status
compilation terminated.

This happens because nvptx/mkoffload.cc:compile_native passes -m64/-m32 to host 
compiler depending on whether
offload_abi is OFFLOAD_ABI_LP64 or OFFLOAD_ABI_ILP32, and aarch64 backend 
doesn't recognize these options.

I suppose a simple solution to check if host_compiler supports a particular 
command-line option, would be to create a dummy C file and
check if the command "host_compiler <opt> dummy_file.c" returns zero exit 
status. Alternative could be to check exit status for
"host_compiler <opt> --version", once http://gcc.gnu.org/PR116050 is fixed, but 
I am not sure if either is an ideal solution.

With workarounds for these 2 issues, the minimal test builds and runs 
successfully.

(3) Assertion error in lto_read_decls during lto1:
There are several failures (~350+) in the testsuite caused due to the following
assert in lto_read_decls:
gcc_assert (data_in->reader_cache->nodes.length () == from + 1);

AFAIU, this seems to happen because of presence of LTO_null tag. The following 
workaround avoids hitting the assert,
but am not sure if it's the right fix:

        t = lto_input_tree_1 (&ib_main, data_in, tag, 0);
+       if (t == NULL_TREE)
+         continue;
        gcc_assert (data_in->reader_cache->nodes.length () == from + 1);

(FWIW, this was reproducible with the above minimal test, but has seemingly 
gone away for it after updating the sources recently,
but still reproduces with libgomp tests like for-9.c, baseptrs-4.C etc.)

(4) AAarch64 uses OImode for 256-bit size array, which is not supported on 
nvptx:
This causes ~18 tests to fail.

Can be reproduced with following simple test:
int main()
{
  long c[4];
  #pragma omp target map(c)
    c[0] = 0;
  return 0;
}

Compiling with -O2 -fopenmp -foffload=nvptx-none results in:
lto1: fatal error: nvptx-none - 256-bit integer numbers unsupported (mode 'OI')
compilation terminated.
nvptx mkoffload: fatal error: 
../install/bin/aarch64-unknown-linux-gnu-accel-nvptx-none-gcc returned 1 exit 
status
compilation terminated.

This happens with AArch64 host because, it uses OImode (256-bit integer mode) 
for ARRAY_TYPE (long c[4] fits 256-bits), which isn't
supported on nvptx. This decision is made during layout_type for 'c', which 
calls mode_for_array, and mode_for_array uses target hooks
array_mode and array_mode_supported_p to determine target-specific modes to use 
for ARRAY_TYPE.
For x86_64, AFAIK, it uses BLKmode for ARRAY_TYPE.

I have attached a "quick-and-dirty" patch (p-166-2.diff) which falls back to 
using BLKmode for ARRAY_TYPE if offloading is enabled,
and avoids streaming-out target-specific int modes in lto_write_mode_table. I 
used default_scalar_mode_supported_p check to test if
the int_mode is "generic", but not sure if that's entirely correct. The test 
compiles and runs OK with patch applied. I suppose a more general
solution would be to somehow "intersect" available AArch64 modes with nvptx 
modes, and use those for offloading ?

With local workarounds for the above 4 issues, running make 
check-target-libgomp shows following results:

                === libgomp Summary ===

# of expected passes            16604
# of unexpected failures        10
# of expected failures          309
# of untested testcases         54
# of unresolved testcases       3
# of unsupported tests          643

The remaining issues are:

(5) "error: alias definitions not supported in this configuration"
Fails for pr96390.c, and pr96390.C. This seems to be related to 
https://gcc.gnu.org/PR97102 

(6) Execution Failures:
- libgomp/pr104783.c
- libgomp/pr104783-2.c
I haven't investigated these yet.

(7) Several warnings fail for libgomp.oacc-c-c++-common/acc_prof-kernels-1.c 
and following excess errors:
acc_prof-kernels-1.c:185:9: optimized: assigned OpenACC seq loop parallelism
acc_prof-kernels-1.c:214:9: optimized: assigned OpenACC seq loop parallelism
acc_prof-kernels-1.c:245:9: optimized: assigned OpenACC seq loop parallelism

So far, I have only been testing make check-target-libgomp. Should I be testing 
any additional parts of the testsuite for offloading changes ?

My initial goals are:
(a) To get AArch64/nvptx offloading to work for above minimal test.
(b) Testsuite results for libgomp on par with x86_64 for non VLA code (as far 
as correctness is concerned).
(c) After (a) and (b) are in place, try to enable support for offloading with 
VLA/SVE.

I am planning to address these issues and will post patches for the same 
shortly. I will be grateful for any feedback or suggestions on how to proceed 
forward.

Thanks,
Prathamesh

p-166-2.diff
Description: p-166-2.diff

p-163-2.diff
Description: p-163-2.diff

[RFC] Summary of libgomp failures for offloading to nvptx from AArch64

Reply via email to