In PR83920, I encountered a nvptx bug where live predicate variables were clobbered before their value was broadcasted. Apparently, there were problems in certain version of the CUDA driver where the JIT would generate wrong code for shfl broadcasts. The attached patch teaches nvptx_single not to apply that workaround if the predicate register is live.
Tom, does this patch look sane to you? I'm not sure if it defeats the purpose of your original patch. Regardless, the live predicate registers shouldn't be clobbered before they are used. Unfortunately, I cannot reproduce the runtime failure with gemm example in the PR, so I didn't include it in the patch. However, this patch does fix the failure with da-1.c in og7. This patch does not cause any regressions. Is it OK for trunk? Thanks, Cesar
diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c index 55c7e3c..698c574 100644 --- a/gcc/config/nvptx/nvptx.c +++ b/gcc/config/nvptx/nvptx.c @@ -3957,6 +3957,7 @@ bb_first_real_insn (basic_block bb) static void nvptx_single (unsigned mask, basic_block from, basic_block to) { + bitmap live = DF_LIVE_IN (from); rtx_insn *head = BB_HEAD (from); rtx_insn *tail = BB_END (to); unsigned skip_mask = mask; @@ -4126,8 +4127,9 @@ nvptx_single (unsigned mask, basic_block from, basic_block to) There is nothing in the PTX spec to suggest that this is wrong, or to explain why the extra initialization is needed. So, we classify it as a JIT bug, and the extra initialization as workaround. */ - emit_insn_before (gen_movbi (pvar, const0_rtx), - bb_first_real_insn (from)); + if (!bitmap_bit_p (live, REGNO (pvar))) + emit_insn_before (gen_movbi (pvar, const0_rtx), + bb_first_real_insn (from)); #endif emit_insn_before (nvptx_gen_vcast (pvar), tail); }