Hi, I am working on enabling offloading to nvptx from AAarch64 host. As mentioned on wiki (https://gcc.gnu.org/wiki/Offloading#Running_.27make_check.27), I ran make check-target-libgomp on AAarch64 host (and no GPU) with following results:
=== libgomp Summary === # of expected passes 14568 # of unexpected failures 1023 # of expected failures 309 # of untested testcases 54 # of unresolved testcases 992 # of unsupported tests 644 It seems majority of the tests fail due to the following 4 issues: * Compiling a minimal test-case: int main() { int x; #pragma omp target map (to: x) { x = 0; } return x; } Compiling with -fopenmp -foffload=nvptx-none results in following issues: (1) Differing values of NUM_POLY_INT_COEFFS between host and accelerator, which results in following ICE: 0x1a6e0a7 pp_quoted_string ../../gcc/gcc/pretty-print.cc:2277 0x1a6ffb3 pp_format(pretty_printer*, text_info*, urlifier const*) ../../gcc/gcc/pretty-print.cc:1634 0x1a4a3f3 diagnostic_context::report_diagnostic(diagnostic_info*) ../../gcc/gcc/diagnostic.cc:1612 0x1a4a727 diagnostic_impl ../../gcc/gcc/diagnostic.cc:1775 0x1a4e20b fatal_error(unsigned int, char const*, ...) ../../gcc/gcc/diagnostic.cc:2218 0xb3088f lto_input_mode_table(lto_file_decl_data*) ../../gcc/gcc/lto-streamer-in.cc:2121 0x6f5cdf lto_file_finalize ../../gcc/gcc/lto/lto-common.cc:2285 0x6f5cdf lto_create_files_from_ids ../../gcc/gcc/lto/lto-common.cc:2309 0x6f5cdf lto_file_read ../../gcc/gcc/lto/lto-common.cc:2364 0x6f5cdf read_cgraph_and_symbols(unsigned int, char const**) ../../gcc/gcc/lto/lto-common.cc:2812 0x6cfb93 lto_main() ../../gcc/gcc/lto/lto.cc:658 This is already tracked in https://gcc.gnu.org/PR96265 (and related PR's) Streaming out mode_table: mode = SI, mclass = 2, size = 4, prec = 32 mode = DI, mclass = 2, size = 8, prec = 64 Streaming in mode_table (in lto_input_mode_table): mclass = 2, size = 4, prec = 0 (and then calculates the correct mode value by iterating over all modes of mclass starting from narrowest mode) The issue is that the value for prec is not getting streamed-in correctly for SImode as seen above. While streaming out from AArch64 host, it is 32, but while streaming in for nvptx, it is 0. This happens because of differing values of NUM_POLY_INT_COEFFS between AArch64 and nvptx backend. Since NUM_POLY_INT_COEFFS is 2 for aarch64, the streamed-out values for mode, precision would be <4, 0> and <32, 0> respectively (streamed-out in bp_pack_poly_value). Both zeros come from coeffs[1] of size and prec. While streaming in however, NUM_POLY_INT_COEFFS is 1 for nvptx, and thus it incorrectly treats <4, 0> as size and precision respectively, which is why precision gets streamed in as 0, and thus it encounters the above ICE. Supporting non VLA code with offloading: In the general case, it's hard to support offloading for arbitrary poly_ints when NUM_POLY_INT_COEFFS differs for host and accelerator. For example, it's not possible to represent a degree-2 poly_int like 4 + 4x (as-is) on an accelerator with NUM_POLY_INT_COEFFS == 1. However, IIUC, we can support offloading for restricted set of poly_ints whose degree <= accel's NUM_POLY_INT_COEFFS, since they can be represented on accelerator ? For a hypothetical example, if host NUM_POLY_INT_COEFFS == 3 and accel NUM_POLY_INT_COEFFS == 2, then I suppose we could represent a degree 2 poly_int on accelerator, but not a degree 3 poly_int like 3+4x+5x^2 ? Based on that, I have come up with following approach in attached "quick-and-dirty" patch (p-163-2.diff): Stream-out host NUM_POLY_INT_COEFFS, and while streaming-in during lto1, compare it with accelerator's NUM_POLY_INT_COEFFS as follows: Stream in host_num_poly_int_coeffs; if (host_num_poly_int_coeffs == NUM_POLY_INT_COEFFS) // NUM_POLY_INT_COEFFS represents accelerator's value here. { /* Both are equal, proceed to unpacking NUM_POLY_INT_COEFFS words from bitstream. */ } else if (host_num_poly_int_coeffs < NUM_POLY_INT_COEFFS) { /* Unpack host_num_poly_int_coeffs words and zero out remaining higher coeffs (similar to zero-extension). */ } else { /* Unpack host_num_poly_int_coeffs words and ensure that degree of streamed-out poly_int <= NUM_POLY_INT_COEFFS. */ } For example, with host NUM_POLY_INT_COEFFS == 2 and accel NUM_POLY_INT_COEFFS == 1, this will allow streaming of "degree-1" poly_ints like 4+0x (which will degenerate to constant 4), but give an error for streaming degree-2 poly_int like 4+4x. Following this approach, I am assuming we can support AArch64/nvptx offloading for non VLA code, since poly_ints used for representing various artefacts like mode_size, mode_precision, vector length etc. will be degree-1 poly_int for scalar variables and fixed-length vectors (and thus degenerate to constants). With the patch applied, it proceeds forward from this point, but fails at a later stage (see below). Does this approach look reasonable for supporting offloading for non VLA code ? Are there any cases I may have overlooked, where offloading will still fail for non-VLA code due to differing NUM_POLY_INT_COEFFS issue ? (2) nvptx mkoffload.cc passes -m64/-m32 to host compiler if -foffload-abi=lp64/ilp32 After applying workaround for the above assertion failure, it hits the following error: gcc: error: unrecognized command-line option '-m64' nvptx mkoffload: fatal error: ../install/bin/gcc returned 1 exit status compilation terminated. This happens because nvptx/mkoffload.cc:compile_native passes -m64/-m32 to host compiler depending on whether offload_abi is OFFLOAD_ABI_LP64 or OFFLOAD_ABI_ILP32, and aarch64 backend doesn't recognize these options. I suppose a simple solution to check if host_compiler supports a particular command-line option, would be to create a dummy C file and check if the command "host_compiler <opt> dummy_file.c" returns zero exit status. Alternative could be to check exit status for "host_compiler <opt> --version", once http://gcc.gnu.org/PR116050 is fixed, but I am not sure if either is an ideal solution. With workarounds for these 2 issues, the minimal test builds and runs successfully. (3) Assertion error in lto_read_decls during lto1: There are several failures (~350+) in the testsuite caused due to the following assert in lto_read_decls: gcc_assert (data_in->reader_cache->nodes.length () == from + 1); AFAIU, this seems to happen because of presence of LTO_null tag. The following workaround avoids hitting the assert, but am not sure if it's the right fix: t = lto_input_tree_1 (&ib_main, data_in, tag, 0); + if (t == NULL_TREE) + continue; gcc_assert (data_in->reader_cache->nodes.length () == from + 1); (FWIW, this was reproducible with the above minimal test, but has seemingly gone away for it after updating the sources recently, but still reproduces with libgomp tests like for-9.c, baseptrs-4.C etc.) (4) AAarch64 uses OImode for 256-bit size array, which is not supported on nvptx: This causes ~18 tests to fail. Can be reproduced with following simple test: int main() { long c[4]; #pragma omp target map(c) c[0] = 0; return 0; } Compiling with -O2 -fopenmp -foffload=nvptx-none results in: lto1: fatal error: nvptx-none - 256-bit integer numbers unsupported (mode 'OI') compilation terminated. nvptx mkoffload: fatal error: ../install/bin/aarch64-unknown-linux-gnu-accel-nvptx-none-gcc returned 1 exit status compilation terminated. This happens with AArch64 host because, it uses OImode (256-bit integer mode) for ARRAY_TYPE (long c[4] fits 256-bits), which isn't supported on nvptx. This decision is made during layout_type for 'c', which calls mode_for_array, and mode_for_array uses target hooks array_mode and array_mode_supported_p to determine target-specific modes to use for ARRAY_TYPE. For x86_64, AFAIK, it uses BLKmode for ARRAY_TYPE. I have attached a "quick-and-dirty" patch (p-166-2.diff) which falls back to using BLKmode for ARRAY_TYPE if offloading is enabled, and avoids streaming-out target-specific int modes in lto_write_mode_table. I used default_scalar_mode_supported_p check to test if the int_mode is "generic", but not sure if that's entirely correct. The test compiles and runs OK with patch applied. I suppose a more general solution would be to somehow "intersect" available AArch64 modes with nvptx modes, and use those for offloading ? With local workarounds for the above 4 issues, running make check-target-libgomp shows following results: === libgomp Summary === # of expected passes 16604 # of unexpected failures 10 # of expected failures 309 # of untested testcases 54 # of unresolved testcases 3 # of unsupported tests 643 The remaining issues are: (5) "error: alias definitions not supported in this configuration" Fails for pr96390.c, and pr96390.C. This seems to be related to https://gcc.gnu.org/PR97102 (6) Execution Failures: - libgomp/pr104783.c - libgomp/pr104783-2.c I haven't investigated these yet. (7) Several warnings fail for libgomp.oacc-c-c++-common/acc_prof-kernels-1.c and following excess errors: acc_prof-kernels-1.c:185:9: optimized: assigned OpenACC seq loop parallelism acc_prof-kernels-1.c:214:9: optimized: assigned OpenACC seq loop parallelism acc_prof-kernels-1.c:245:9: optimized: assigned OpenACC seq loop parallelism So far, I have only been testing make check-target-libgomp. Should I be testing any additional parts of the testsuite for offloading changes ? My initial goals are: (a) To get AArch64/nvptx offloading to work for above minimal test. (b) Testsuite results for libgomp on par with x86_64 for non VLA code (as far as correctness is concerned). (c) After (a) and (b) are in place, try to enable support for offloading with VLA/SVE. I am planning to address these issues and will post patches for the same shortly. I will be grateful for any feedback or suggestions on how to proceed forward. Thanks, Prathamesh
p-166-2.diff
Description: p-166-2.diff
p-163-2.diff
Description: p-163-2.diff