Hi,
This patch is part of an effort to improve the performance of OpenMP offloading.
One of the observations is that due to many factors, OpenMP target constructs 
are
significantly slower than OpenACC (especially on nvptx)

This gap in performance could be attributed to many factors: the use of indirect
function pointers in the libgomp runtime for parallel construct functions, the
use of soft stacks as opposed to native stacks, whole warps instead of actual 
CUDA
threads as OpenMP "threads", general administrative overhead induced by OpenMP
requirements, etc.

This patch is an attempt to side-step all of this, by a mode enabled by an 
option,
which enables internally handling a subset of OpenMP target regions as OpenACC 
parallel regions.
This basically includes target, teams, parallel, distribute, for/do constructs, 
and atomics.

Essentially, we adjust the internal kinds to OpenACC type, and let OpenACC code 
paths handle them
(with various needed adjustments of course). The option is dubbed 
'-fopenmp-target=', with
values 'default' (the normal default) and 'acc', which enables this 
OpenMP-as-OpenACC processing.
When using this "OMPACC" mode, if there are cases the patch doesn't handle, it 
issues
a warning, and reverts to normal processing for that target region.

There are still some serious limitations: most clauses are not allowed in the 
handled
subset yet, except 'map' (yes, reductions are not yet tested, so disable for 
now)
Currently it is hard coded to only use 32-threads (1 warp) in each team 
(long-vector mode still
not tested nor other details worked on yet).  Do note that since we're using 
actual
CUDA threads as threads (like OpenACC vector), so the efficiency is already 
much better than normal
"warp-thread" based OpenMP offloading.

Our main measured benchmark here is 521.miniswp_t in SPEChpc, where we can 
achieve almost 10x
performance improvement compared to normal OpenMP offloading on Volta (although 
note that
the measured gap between normal OpenMP and OpenACC was around 20x, so it is 
still a bit slower
that pure OpenACC).

We think that this patch should be seen as an optimization of the status quo. 
To truly solve
the OpenMP offloading performance issues we see (and across nvptx and gcn) 
there will probably
be more fundamental re-architecturing of more stuff.

This has been tested on powerpc64le-linux with nvptx offloading(Volta), 
submitting here for trunk.
I am planning to commit this to devel/omp/gcc-12 (and probably 13 soon) after 
more testing concludes.

Thanks,
Chung-Lin

2023-05-19  Chung-Lin Tang  <clt...@codesourcery.com>

gcc/ChangeLog

        * builtins.cc (expand_builtin_omp_builtins): New function.
        (expand_builtin): Add expand cases for BUILT_IN_GOMP_BARRIER,
        BUILT_IN_OMP_GET_THREAD_NUM, BUILT_IN_OMP_GET_NUM_THREADS,
        BUILT_IN_OMP_GET_TEAM_NUM, and BUILT_IN_OMP_GET_NUM_TEAMS using
        expand_builtin_omp_builtins, enabled under -fopenmp-target=acc.
        * cgraphunit.cc (analyze_functions): Add call to
        omp_ompacc_attribute_tagging, enabled under -fopenmp-target=acc.
        * common.opt (fopenmp-target=): Add new option and enums.
        * config/nvptx/mkoffload.cc (main): Handle -fopenmp-target=.
        * config/nvptx/nvptx-protos.h (nvptx_expand_omp_get_num_threads): New
        prototype.
        (nvptx_mem_shared_p): Likewise.
        * config/nvptx/nvptx.cc (omp_num_threads_sym): New global static RTX
        symbol for number of threads in team.
        (omp_num_threads_align): New var for alignment of omp_num_threads_sym.
        (need_omp_num_threads): New bool for if any function references
        omp_num_threads_sym.
        (nvptx_option_override): Initialize omp_num_threads_sym/align.
        (write_as_kernel): Disable normal OpenMP kernel entry under OMPACC mode.
        (nvptx_declare_function_name): Disable shim function under OMPACC mode.
        Disable soft-stack under OMPACC mode. Add generation of neutering init
        code under OMPACC mode.
        (nvptx_output_set_softstack): Return "" under OMPACC mode.
        (nvptx_expand_call): Set parallelism to vector for function calls with
        "ompacc for" attached.
        (nvptx_expand_oacc_fork): Set mode to GOMP_DIM_VECTOR under OMPACC mode.
        (nvptx_expand_oacc_join): Likewise.
        (nvptx_expand_omp_get_num_threads): New function.
        (nvptx_mem_shared_p): New function.
        (nvptx_mach_max_workers): Return 1 under OMPACC mode.
        (nvptx_mach_vector_length): Return 32 under OMPACC mode.
        (nvptx_single): Add adjustments for OMPACC mode, which have
        parallel-construct fork/joins, and regions of code where neutering is
        dynamically determined.
        (nvptx_reorg): Enable neutering under OMPACC mode when "ompacc for"
        attribute is attached to function. Disable uniform-simt when under
        OMPACC mode.
        (nvptx_file_end): Write __nvptx_omp_num_threads out when needed.
        (nvptx_goacc_fork_join): Return true under OMPACC mode.
        * config/nvptx/nvptx.h (struct GTY(()) machine_function): Add
        omp_parallel_predicate and omp_fn_entry_num_threads_reg fields.
        * config/nvptx/nvptx.md (unspecv): Add UNSPECV_GET_TID,
        UNSPECV_GET_NTID, UNSPECV_GET_CTAID, UNSPECV_GET_NCTAID,
        UNSPECV_OMP_PARALLEL_FORK, UNSPECV_OMP_PARALLEL_JOIN entries.
        (nvptx_shared_mem_operand): New predicate.
        (gomp_barrier): New expand pattern.
        (omp_get_num_threads): New expand pattern.
        (omp_get_num_teams): New insn pattern.
        (omp_get_thread_num): Likewise.
        (omp_get_team_num): Likewise.
        (get_ntid): Likewise.
        (nvptx_omp_parallel_fork): Likewise.
        (nvptx_omp_parallel_join): Likewise.

        * flag-types.h (omp_target_mode_kind): New flag value enum.
        * gimplify.cc (struct gimplify_omp_ctx): Add 'bool ompacc' field.
        (gimplify_scan_omp_clauses): Handle OMP_CLAUSE__OMPACC_.
        (gimplify_adjust_omp_clauses): Likewise.
        (gimplify_omp_ctx_ompacc_p): New function.
        (gimplify_omp_for): Handle combined loops under OMPACC.

        * lto-wrapper.cc (append_compiler_options): Add OPT_fopenmp_target_.
        * omp-builtins.def (BUILT_IN_OMP_GET_THREAD_NUM): Remove CONST.
        (BUILT_IN_OMP_GET_NUM_THREADS): Likewise.
        * omp-expand.cc (remove_exit_barrier): Disable addressable-var
        processing for parallel construct child functions under OMPACC mode.
        (expand_oacc_for): Add OMPACC mode handling.
        (get_target_arguments): Force thread_limit clause value to 1 under
        OMPACC mode.
        (expand_omp): Under OMPACC mode, avoid child function expanding of
        GIMPLE_OMP_PARALLEL.
        * omp-general.cc (omp_extract_for_data): Adjustments for OMPACC mode.
        * omp-low.cc (struct omp_context): Add 'bool ompacc_p' field.
        (scan_sharing_clauses): Handle OMP_CLAUSE__OMPACC_.
        (ompacc_ctx_p): New function.
        (scan_omp_parallel): Handle OMPACC mode, avoid creating child function.
        (scan_omp_target): Tag "ompacc"/"ompacc for" attributes for target
        construct child function, remove OMP_CLAUSE__OMPACC_ clauses.
        (lower_oacc_head_mark): Handle OMPACC mode cases.
        (lower_omp_for): Adjust OMP_FOR kind from OpenMP to OpenACC kinds, add
        vector/gang clauses as needed. Add other OMPACC handling.
        (lower_omp_taskreg): Add call to lower_oacc_head_tail for OMPACC case.
        (lower_omp_target): Do OpenACC gang privatization under OMPACC case.
        (lower_omp_teams): Forward OpenACC privatization variables to outer
        target region under OMPACC mode.
        (lower_omp_1): Do OpenACC gang privatization under OMPACC case for
        GIMPLE_BIND.
        * omp-offload.cc (ompacc_supported_clauses_p): New function.
        (struct target_region_data): New struct type for tree walk.
        (scan_fndecl_for_ompacc): New function.
        (scan_omp_target_region_r): New function.
        (scan_omp_target_construct_r): New function.
        (omp_ompacc_attribute_tagging): New function.
        (oacc_dim_call): Add OMPACC case handling.
        (execute_oacc_device_lower): Make parts explicitly only OpenACC enabled.
        (pass_oacc_device_lower::gate): Enable pass under OMPACC mode.
        * omp-offload.h (omp_ompacc_attribute_tagging): New prototype.
        * opts.cc (finish_options): Only allow -fopenmp-target= when -fopenmp
        and no -fopenacc.
        * target-insns.def (gomp_barrier): New defined insn pattern.
        (omp_get_thread_num): Likewise.
        (omp_get_num_threads): Likewise.
        (omp_get_team_num): Likewise.
        (omp_get_num_teams): Likewise.
        * tree-core.h (enum omp_clause_code): Add new OMP_CLAUSE__OMPACC_ entry
        for internal clause.
        * tree-nested.cc (convert_nonlocal_omp_clauses): Handle
        OMP_CLAUSE__OMPACC_.
        * tree-pretty-print.cc (dump_omp_clause): Handle OMP_CLAUSE__OMPACC_.
        * tree.cc (omp_clause_num_ops): Add OMP_CLAUSE__OMPACC_ entry.
        (omp_clause_code_name): Likewise.
        * tree.h (OMP_CLAUSE__OMPACC__FOR): New macro for OMP_CLAUSE__OMPACC_.

libgomp/ChangeLog:

        * config/nvptx/team.c (__nvptx_omp_num_threads): New global variable in
        shared memory.

Attachment: ompacc-20230519-2115.patch
Description: ompacc-20230519-2115.patch

Reply via email to