https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108803
--- Comment #2 from Jakub Jelinek <jakub at gcc dot gnu.org> --- --- gcc/optabs.cc.jj 2023-01-02 09:32:53.309838465 +0100 +++ gcc/optabs.cc 2023-02-16 18:04:54.794871019 +0100 @@ -596,6 +596,16 @@ expand_doubleword_shift_condmove (scalar { rtx outof_superword, into_superword; + if (shift_mask < BITS_PER_WORD - 1) + { + rtx tmp = immed_wide_int_const (wi::shwi (BITS_PER_WORD - 1, + GET_MODE (superword_op1)), + GET_MODE (superword_op1)); + superword_op1 + = simplify_expand_binop (op1_mode, and_optab, superword_op1, tmp, + 0, true, methods); + } + /* Put the superword version of the output into OUTOF_SUPERWORD and INTO_SUPERWORD. */ outof_superword = outof_target != 0 ? gen_reg_rtx (word_mode) : 0; @@ -617,6 +627,16 @@ expand_doubleword_shift_condmove (scalar return false; } + if (shift_mask < BITS_PER_WORD - 1) + { + rtx tmp = immed_wide_int_const (wi::shwi (BITS_PER_WORD - 1, + GET_MODE (subword_op1)), + GET_MODE (subword_op1)); + subword_op1 + = simplify_expand_binop (op1_mode, and_optab, subword_op1, tmp, + 0, true, methods); + } + /* Put the subword version directly in OUTOF_TARGET and INTO_TARGET. */ if (!expand_subword_shift (op1_mode, binoptab, outof_input, into_input, subword_op1, indeed fixes the miscompilation, but unfortunately with e.g. __attribute__((noipa)) __int128 foo (__int128 a, unsigned k) { return a << k; } __attribute__((noipa)) __int128 bar (__int128 a, unsigned k) { return a >> k; } results in one extra insn in each of the functions. While the superword_op1 case is fine because aarch64 (among other arches) has a pattern to catch shift with masked count, in the subword_op1 case that doesn't work, because expand_subword_shift actually emits 3 shifts instead of just one, one with (BIT_PER_WORD - 1) - op1 as shift count and two with op1. If the op1 &= (BITS_PER_WORD - 1) masking is done in the caller, then it can't be easily merged with the shifts. We could do that also separately in expand_subword_shift under some new bool and in that case instead of using op1 &= (BITS_PER_WORD - 1); shift1 by ((BITS_PER_WORD - 1) - op1); shift2 by op1; shift3 by op1 use tmp = (63 - op1) & (BITS_PER_WORD - 1); shift1 by tmp; op1 &= (BITS_PER_WORD - 1); shift2 by op1; shift3 by op1, but that would be larger code if the target doesn't have those shift with masking patterns that trigger on it. Perhaps have some target hook? Or try to recog the combined instruction?