Hi, I am working on enabling offloading to nvptx from AAarch64 host. As mentioned on wiki (https://gcc.gnu.org/wiki/Offloading#Running_.27make_check.27), I ran make check-target-libgomp on AAarch64 host (and no GPU) with following results:
=== libgomp Summary ===
# of expected passes 14568
# of unexpected failures 1023
# of expected failures 309
# of untested testcases 54
# of unresolved testcases 992
# of unsupported tests 644
It seems majority of the tests fail due to the following 4 issues:
* Compiling a minimal test-case:
int main()
{
int x;
#pragma omp target map (to: x)
{
x = 0;
}
return x;
}
Compiling with -fopenmp -foffload=nvptx-none results in following issues:
(1) Differing values of NUM_POLY_INT_COEFFS between host and accelerator, which
results in following ICE:
0x1a6e0a7 pp_quoted_string
../../gcc/gcc/pretty-print.cc:2277
0x1a6ffb3 pp_format(pretty_printer*, text_info*, urlifier const*)
../../gcc/gcc/pretty-print.cc:1634
0x1a4a3f3 diagnostic_context::report_diagnostic(diagnostic_info*)
../../gcc/gcc/diagnostic.cc:1612
0x1a4a727 diagnostic_impl
../../gcc/gcc/diagnostic.cc:1775
0x1a4e20b fatal_error(unsigned int, char const*, ...)
../../gcc/gcc/diagnostic.cc:2218
0xb3088f lto_input_mode_table(lto_file_decl_data*)
../../gcc/gcc/lto-streamer-in.cc:2121
0x6f5cdf lto_file_finalize
../../gcc/gcc/lto/lto-common.cc:2285
0x6f5cdf lto_create_files_from_ids
../../gcc/gcc/lto/lto-common.cc:2309
0x6f5cdf lto_file_read
../../gcc/gcc/lto/lto-common.cc:2364
0x6f5cdf read_cgraph_and_symbols(unsigned int, char const**)
../../gcc/gcc/lto/lto-common.cc:2812
0x6cfb93 lto_main()
../../gcc/gcc/lto/lto.cc:658
This is already tracked in https://gcc.gnu.org/PR96265 (and related PR's)
Streaming out mode_table:
mode = SI, mclass = 2, size = 4, prec = 32
mode = DI, mclass = 2, size = 8, prec = 64
Streaming in mode_table (in lto_input_mode_table):
mclass = 2, size = 4, prec = 0
(and then calculates the correct mode value by iterating over all modes of
mclass starting from narrowest mode)
The issue is that the value for prec is not getting streamed-in correctly for
SImode as seen above. While streaming out from AArch64 host,
it is 32, but while streaming in for nvptx, it is 0. This happens because of
differing values of NUM_POLY_INT_COEFFS between AArch64 and nvptx backend.
Since NUM_POLY_INT_COEFFS is 2 for aarch64, the streamed-out values for mode,
precision would be <4, 0> and <32, 0>
respectively (streamed-out in bp_pack_poly_value). Both zeros come from
coeffs[1] of size and prec. While streaming in however,
NUM_POLY_INT_COEFFS is 1 for nvptx, and thus it incorrectly treats <4, 0> as
size and precision respectively, which is why precision
gets streamed in as 0, and thus it encounters the above ICE.
Supporting non VLA code with offloading:
In the general case, it's hard to support offloading for arbitrary poly_ints
when NUM_POLY_INT_COEFFS differs for host and accelerator.
For example, it's not possible to represent a degree-2 poly_int like 4 + 4x
(as-is) on an accelerator with NUM_POLY_INT_COEFFS == 1.
However, IIUC, we can support offloading for restricted set of poly_ints whose
degree <= accel's NUM_POLY_INT_COEFFS, since they can be
represented on accelerator ? For a hypothetical example, if host
NUM_POLY_INT_COEFFS == 3 and accel NUM_POLY_INT_COEFFS == 2, then I suppose
we could represent a degree 2 poly_int on accelerator, but not a degree 3
poly_int like 3+4x+5x^2 ?
Based on that, I have come up with following approach in attached
"quick-and-dirty" patch (p-163-2.diff):
Stream-out host NUM_POLY_INT_COEFFS, and while streaming-in during lto1,
compare it with accelerator's NUM_POLY_INT_COEFFS as follows:
Stream in host_num_poly_int_coeffs;
if (host_num_poly_int_coeffs == NUM_POLY_INT_COEFFS) // NUM_POLY_INT_COEFFS
represents accelerator's value here.
{
/* Both are equal, proceed to unpacking NUM_POLY_INT_COEFFS words from
bitstream. */
}
else if (host_num_poly_int_coeffs < NUM_POLY_INT_COEFFS)
{
/* Unpack host_num_poly_int_coeffs words and zero out remaining higher
coeffs (similar to zero-extension). */
}
else
{
/* Unpack host_num_poly_int_coeffs words and ensure that degree of
streamed-out poly_int <= NUM_POLY_INT_COEFFS. */
}
For example, with host NUM_POLY_INT_COEFFS == 2 and accel NUM_POLY_INT_COEFFS
== 1, this will allow streaming of "degree-1" poly_ints
like 4+0x (which will degenerate to constant 4), but give an error for
streaming degree-2 poly_int like 4+4x.
Following this approach, I am assuming we can support AArch64/nvptx offloading
for non VLA code, since poly_ints used for representing various
artefacts like mode_size, mode_precision, vector length etc. will be degree-1
poly_int for scalar variables and fixed-length vectors
(and thus degenerate to constants). With the patch applied, it proceeds forward
from this point, but fails at a later stage (see below).
Does this approach look reasonable for supporting offloading for non VLA code ?
Are there any cases I may have overlooked, where offloading will still fail for
non-VLA code due to differing NUM_POLY_INT_COEFFS issue ?
(2) nvptx mkoffload.cc passes -m64/-m32 to host compiler if
-foffload-abi=lp64/ilp32
After applying workaround for the above assertion failure, it hits the
following error:
gcc: error: unrecognized command-line option '-m64'
nvptx mkoffload: fatal error: ../install/bin/gcc returned 1 exit status
compilation terminated.
This happens because nvptx/mkoffload.cc:compile_native passes -m64/-m32 to host
compiler depending on whether
offload_abi is OFFLOAD_ABI_LP64 or OFFLOAD_ABI_ILP32, and aarch64 backend
doesn't recognize these options.
I suppose a simple solution to check if host_compiler supports a particular
command-line option, would be to create a dummy C file and
check if the command "host_compiler <opt> dummy_file.c" returns zero exit
status. Alternative could be to check exit status for
"host_compiler <opt> --version", once http://gcc.gnu.org/PR116050 is fixed, but
I am not sure if either is an ideal solution.
With workarounds for these 2 issues, the minimal test builds and runs
successfully.
(3) Assertion error in lto_read_decls during lto1:
There are several failures (~350+) in the testsuite caused due to the following
assert in lto_read_decls:
gcc_assert (data_in->reader_cache->nodes.length () == from + 1);
AFAIU, this seems to happen because of presence of LTO_null tag. The following
workaround avoids hitting the assert,
but am not sure if it's the right fix:
t = lto_input_tree_1 (&ib_main, data_in, tag, 0);
+ if (t == NULL_TREE)
+ continue;
gcc_assert (data_in->reader_cache->nodes.length () == from + 1);
(FWIW, this was reproducible with the above minimal test, but has seemingly
gone away for it after updating the sources recently,
but still reproduces with libgomp tests like for-9.c, baseptrs-4.C etc.)
(4) AAarch64 uses OImode for 256-bit size array, which is not supported on
nvptx:
This causes ~18 tests to fail.
Can be reproduced with following simple test:
int main()
{
long c[4];
#pragma omp target map(c)
c[0] = 0;
return 0;
}
Compiling with -O2 -fopenmp -foffload=nvptx-none results in:
lto1: fatal error: nvptx-none - 256-bit integer numbers unsupported (mode 'OI')
compilation terminated.
nvptx mkoffload: fatal error:
../install/bin/aarch64-unknown-linux-gnu-accel-nvptx-none-gcc returned 1 exit
status
compilation terminated.
This happens with AArch64 host because, it uses OImode (256-bit integer mode)
for ARRAY_TYPE (long c[4] fits 256-bits), which isn't
supported on nvptx. This decision is made during layout_type for 'c', which
calls mode_for_array, and mode_for_array uses target hooks
array_mode and array_mode_supported_p to determine target-specific modes to use
for ARRAY_TYPE.
For x86_64, AFAIK, it uses BLKmode for ARRAY_TYPE.
I have attached a "quick-and-dirty" patch (p-166-2.diff) which falls back to
using BLKmode for ARRAY_TYPE if offloading is enabled,
and avoids streaming-out target-specific int modes in lto_write_mode_table. I
used default_scalar_mode_supported_p check to test if
the int_mode is "generic", but not sure if that's entirely correct. The test
compiles and runs OK with patch applied. I suppose a more general
solution would be to somehow "intersect" available AArch64 modes with nvptx
modes, and use those for offloading ?
With local workarounds for the above 4 issues, running make
check-target-libgomp shows following results:
=== libgomp Summary ===
# of expected passes 16604
# of unexpected failures 10
# of expected failures 309
# of untested testcases 54
# of unresolved testcases 3
# of unsupported tests 643
The remaining issues are:
(5) "error: alias definitions not supported in this configuration"
Fails for pr96390.c, and pr96390.C. This seems to be related to
https://gcc.gnu.org/PR97102
(6) Execution Failures:
- libgomp/pr104783.c
- libgomp/pr104783-2.c
I haven't investigated these yet.
(7) Several warnings fail for libgomp.oacc-c-c++-common/acc_prof-kernels-1.c
and following excess errors:
acc_prof-kernels-1.c:185:9: optimized: assigned OpenACC seq loop parallelism
acc_prof-kernels-1.c:214:9: optimized: assigned OpenACC seq loop parallelism
acc_prof-kernels-1.c:245:9: optimized: assigned OpenACC seq loop parallelism
So far, I have only been testing make check-target-libgomp. Should I be testing
any additional parts of the testsuite for offloading changes ?
My initial goals are:
(a) To get AArch64/nvptx offloading to work for above minimal test.
(b) Testsuite results for libgomp on par with x86_64 for non VLA code (as far
as correctness is concerned).
(c) After (a) and (b) are in place, try to enable support for offloading with
VLA/SVE.
I am planning to address these issues and will post patches for the same
shortly. I will be grateful for any feedback or suggestions on how to proceed
forward.
Thanks,
Prathamesh
p-166-2.diff
Description: p-166-2.diff
p-163-2.diff
Description: p-163-2.diff
