On Tue, Apr 19, 2022 at 11:36 PM Qing Zhao <qing.z...@oracle.com> wrote:
>
>
>
> > On Apr 14, 2022, at 1:53 AM, Richard Biener <richard.guent...@gmail.com> 
> > wrote:
> >
> > On Wed, Apr 13, 2022 at 5:22 PM Qing Zhao <qing.z...@oracle.com> wrote:
> >>
> >> Hi, Richard,
> >>
> >> Thanks a lot for taking a look at this issue (and Sorry that I haven’t 
> >> fixed this one yet, I was distracted by other tasks then just forgot this 
> >> one….)
> >>
> >>> On Apr 13, 2022, at 3:41 AM, Richard Biener <richard.guent...@gmail.com> 
> >>> wrote:
> >>>
> >>> On Tue, Feb 15, 2022 at 5:31 PM Qing Zhao via Gcc-patches
> >>> <gcc-patches@gcc.gnu.org> wrote:
> >>>>
> >>>>
> >>>>
> >>>>> On Feb 15, 2022, at 3:58 AM, Jakub Jelinek <ja...@redhat.com> wrote:
> >>>>>
> >>>>> Hi!
> >>>>>
> >>>>> For IBM double double I've added in PR95450 and PR99648 verification 
> >>>>> that
> >>>>> when we at the tree/GIMPLE or RTL level interpret target bytes as a 
> >>>>> REAL_CST
> >>>>> or CONST_DOUBLE constant, we try to encode it back to target bytes and
> >>>>> verify it is the same.
> >>>>> This is because our real.c support isn't able to represent all valid 
> >>>>> values
> >>>>> of IBM double double which has variable precision.
> >>>>> In PR104522, it has been noted that we have similar problem with the
> >>>>> Intel/Motorola extended XFmode formats, our internal representation 
> >>>>> isn't
> >>>>> able to record pseudo denormals, pseudo infinities, pseudo NaNs and 
> >>>>> unnormal
> >>>>> values.
> >>>>> So, the following patch is an attempt to extend that verification to all
> >>>>> floats.
> >>>>> Unfortunately, it wasn't that straightforward, because the
> >>>>> __builtin_clear_padding code exactly for the XFmode long doubles needs 
> >>>>> to
> >>>>> discover what bits are padding and does that by interpreting memory of
> >>>>> all 1s.  That is actually a valid supported value, a qNaN with negative
> >>>>> sign with all mantissa bits set, but the verification includes also the
> >>>>> padding bits (exactly what __builtin_clear_padding wants to figure out)
> >>>>> and so fails the comparison check and so we ICE.
> >>>>> The patch fixes that case by moving that verification from
> >>>>> native_interpret_real to its caller, so that clear_padding_type can
> >>>>> call native_interpret_real and avoid that extra check.
> >>>>>
> >>>>> With this, the only thing that regresses in the testsuite is
> >>>>> +FAIL: gcc.target/i386/auto-init-4.c scan-assembler-times 
> >>>>> long\\t-16843010 5
> >>>>> because it decides to use a pattern that has non-zero bits in the 
> >>>>> padding
> >>>>> bits of the long double, so the simplify-rtx.cc change prevents folding
> >>>>> a SUBREG into a constant.  We emit (the testcase is -O0 but we emit 
> >>>>> worse
> >>>>> code at all opt levels) something like:
> >>>>>      movabsq $-72340172838076674, %rax
> >>>>>      movabsq $-72340172838076674, %rdx
> >>>>>      movq    %rax, -48(%rbp)
> >>>>>      movq    %rdx, -40(%rbp)
> >>>>>      fldt    -48(%rbp)
> >>>>>      fstpt   -32(%rbp)
> >>>>> instead of
> >>>>>      fldt    .LC2(%rip)
> >>>>>      fstpt   -32(%rbp)
> >>>>> ...
> >>>>> .LC2:
> >>>>>      .long   -16843010
> >>>>>      .long   -16843010
> >>>>>      .long   65278
> >>>>>      .long   0
> >>>>> Note, neither of those sequences actually stores the padding bits, fstpt
> >>>>> simply doesn't touch them.
> >>>>> For vars with clear_padding_real_needs_padding_p types that are 
> >>>>> allocated
> >>>>> to memory at expansion time, I'd say much better would be to do the 
> >>>>> stores
> >>>>> using integral modes rather than XFmode, so do that:
> >>>>>      movabsq $-72340172838076674, %rax
> >>>>>     movq    %rax, -32(%rbp)
> >>>>>     movq    %rax, -24(%rbp)
> >>>>> directly.  That is the only way to ensure the padding bits are 
> >>>>> initialized
> >>>>> (or expand __builtin_clear_padding, but then you initialize separately 
> >>>>> the
> >>>>> value bits and padding bits).
> >>>>>
> >>>>> Bootstrapped/regtested on x86_64-linux and i686-linux, though as 
> >>>>> mentioned
> >>>>> above, the gcc.target/i386/auto-init-4.c case is unresolved.
> >>>>
> >>>> Thanks, I will try to fix this testing case in a later patch.
> >>>
> >>> I've looked at this FAIL now and really wonder whether "pattern init" as
> >>> implemented makes any sense for non-integral types.
> >>> We end up with
> >>> initializing a register (SSA name) with
> >>>
> >>> VIEW_CONVERT_EXPR<long double>(0xfefefefefefefefefefefefefefefefe)
> >>>
> >>> as we go building a TImode constant (we verified we have a TImode SET!)
> >>> but then
> >>>
> >>>         /* Pun the LHS to make sure its type has constant size
> >>>            unless it is an SSA name where that's already known.  */
> >>>         if (TREE_CODE (lhs) != SSA_NAME)
> >>>           lhs = build1 (VIEW_CONVERT_EXPR, itype, lhs);
> >>>         else
> >>>           init = fold_build1 (VIEW_CONVERT_EXPR, TREE_TYPE (lhs), init);
> >>> ...
> >>>     expand_assignment (lhs, init, false);
> >>>
> >>> and generally registers do not have any padding.  This weird expansion
> >>> then causes us to spill the TImode constant and reload the XFmode value,
> >>> which is definitely not optimal here.
> >>>
> >>> One approach to avoid the worse code generation would be to use mode
> >>> specific patterns for registers (like using a NaN or a target specific
> >>> value that
> >>> can be loaded cheaply),
> >>
> >> You mean that using “mode specific patterns” ONLY for registers?
> >> Can we use “mode specific patterns” consistently for both registers and 
> >> memory?
> >
> > The difference is that registers don't have padding while memory
> > possibly does, also
> > for aggregates using different patterns is too expensive IMHO.  For
> > registers the extra
> > complication with generic patterns is that efficient initialization
> > (without going through memory)
> > should be a priority IMHO.
> >
> > And for stack memory I still think that initializing the full
> > allocated frame (including padding
> > between variables!) is the best approach.
> >
> >> LLVM use different patterns for different types (Integral, Float, pointer 
> >> type, etc…) in order to
> >> Catch bugs easily for different types.
> >>
> >> The beginning versions of this patch tried to do similar as LLVM, but 
> >> later we gave up this and
> >> Use the same pattern for all different types.
> >>
> >> If now, we want to use “mode specific patterns” for registers, I am 
> >> wondering whether it’s
> >> Better to just use “mode specific patterns” consistently for both 
> >> registers and memory?
> >>
> >>> another would be to simply fall back to zero
> >>> initialization
> >>> when we fail to constant fold the initializer like with
> >>>
> >>> diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
> >>> index 8b1733e20c4..a4b995f71e4 100644
> >>> --- a/gcc/internal-fn.cc
> >>> +++ b/gcc/internal-fn.cc
> >>> @@ -3125,7 +3125,11 @@ expand_DEFERRED_INIT (internal_fn, gcall *stmt)
> >>>         if (TREE_CODE (lhs) != SSA_NAME)
> >>>           lhs = build1 (VIEW_CONVERT_EXPR, itype, lhs);
> >>>         else
> >>> -           init = fold_build1 (VIEW_CONVERT_EXPR, TREE_TYPE (lhs), init);
> >>> +           {
> >>> +             init = fold_build1 (VIEW_CONVERT_EXPR, TREE_TYPE (lhs), 
> >>> init);
> >>> +             if (!CONSTANT_CLASS_P (init))
> >>> +               init = build_zero_cst (TREE_TYPE (lhs));
> >>> +           }
> >>>       }
> >>>      else
> >>>       /* Use zero-init also for variable-length sizes.  */
> >>>
> >>> note that still does not address the issue noted by Jakub that we do not
> >>> initialize padding this way - but of course that's because we expand a
> >>> special assignment from .DEFERRED_INIT and are not initializing
> >>> the storage allocated for the variable on the stack (at -O0) by RTL
> >>> expansion.  Ideally .DEFERRED_INIT "expansion" for stack variables
> >>> would take place there, simply filling the allocated frame with the
> >>> pattern or zero.  That would probably mean that RTL expansion
> >>> should scoop up .DEFERRED_INITs for variables it decides not to
> >>> expand to pseudos and not leave that to stmt expansion.
> >>
> >> Need to study this a little bit more...
>
>
> So, Is what you mean in the above a complete rewrite of the current 
> “expand_DEFERRED_INIT”:
>
> Instead of simply using “expand_builtin_memset” for “variables that are not 
> in register” and  “expand_assignment”
> for “variables that are in register”,  RTL expansion should directly expand 
> this call in a lower level:
>
> i.e,
>
>   tree lhs = gimple_call_lhs (stmt);
>   …
>
>   rtx target = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
>
>   If (MEM_P (target))   // the variable is allocated on stack
>   {
>      // filling the allocated frame with pattern or zero.   How to do it??
>   }
>   else                     // the variable is in pseudo register.
>   {
>     rtx init_rtx = …;
>     emit_move_insn (target, init_rtx)
>   }
>
>
> Is the above a correct understanding?

No, I would simply expand the automatic in-memory var .DEFERRED_INIT
calls to nothing
but instead process their effect at the time we do
expand_one_stack_var, which means at
CFG expansion time scan for the .DEFERRED_INITs and record whether and how to
initialize the variables.  We can then group the variables accordingly
and block-initialize
the stack space also covering padding inbetween variables.

Richard.

> Thanks.
>
> Qing
>
> >>
> >>>
> >>> I'm going to simply XFAIL this testcase for GCC 12 and opened
> >>> PR105259 so we can revisit the above for GCC 13.
> >>
> >> I can further study this bug and try to come up with a good solution in 
> >> gcc13.
> >>
> >> Thank you!
> >>
> >> Qing
> >>>
> >>> Richard.
> >>>
> >>>> Qing
> >>>>>
> >>>>
> >>
>

Reply via email to