Re: [patch, rs6000, middle-end 0/1] v1: Add implementation for different targets for pair mem fusion

Richard Sandiford Thu, 06 Jun 2024 15:57:51 -0700

Ajit Agarwal <aagar...@linux.ibm.com> writes:
> On 06/06/24 8:03 pm, Richard Sandiford wrote:
>> Ajit Agarwal <aagar...@linux.ibm.com> writes:
>>> On 06/06/24 2:28 pm, Richard Sandiford wrote:
>>>> Hi,
>>>>
>>>> Just some comments on the fuseable_load_p part, since that's what
>>>> we were discussing last time.
>>>>
>>>> It looks like this now relies on:
>>>>
>>>> Ajit Agarwal <aagar...@linux.ibm.com> writes:
>>>>> +      /* We use DF data flow because we change location rtx
>>>>> +  which is easier to find and modify.
>>>>> +  We use mix of rtl-ssa def-use and DF data flow
>>>>> +  where it is easier.  */
>>>>> +      df_chain_add_problem (DF_DU_CHAIN | DF_UD_CHAIN);
>>>>> +      df_analyze ();
>>>>> +      df_set_flags (DF_DEFER_INSN_RESCAN);
>>>>
>>>> But please don't do this!  For one thing, building DU/UD chains
>>>> as well as rtl-ssa is really expensive in terms of compile time.
>>>> But more importantly, modifications need to happen via rtl-ssa
>>>> to ensure that the IL is kept up-to-date.  If we don't do that,
>>>> later fuse attempts will be based on stale data and so could
>>>> generate incorrect code.
>>>>
>>>
>>> Sure I have made changes to use only rtl-ssa and not to use
>>> UD/DU chains. I will send the changes in separate subsequent
>>> patch.
>> 
>> Thanks.  Before you send the patch though:
>> 
>>>>> +// Check whether load can be fusable or not.
>>>>> +// Return true if fuseable otherwise false.
>>>>> +bool
>>>>> +rs6000_pair_fusion::fuseable_load_p (insn_info *info)
>>>>> +{
>>>>> +  for (auto def : info->defs())
>>>>> +    {
>>>>> +      auto set = dyn_cast<set_info *> (def);
>>>>> +      for (auto use1 : set->nondebug_insn_uses ())
>>>>> + use1->set_is_live_out_use (true);
>>>>> +    }
>>>>
>>>> What was the reason for adding this loop?
>>>>
>>>
>>> The purpose of adding is to avoid assert failure in 
>>> gcc/rtl-ssa/changes.cc:252
>> 
>> That assert is making sure that we don't delete a definition of a
>> register (or memory) while a real insn still uses it.  If the assert
>> is firing then something has gone wrong.
>> 
>> Live-out uses are a particular kind of use that occur at the end of
>> basic blocks.  It's incorrect to mark normal insn uses as live-out.
>> 
>> When an assert fails, it's important to understand why the failure
>> occurs, rather than brute-force the assert condition to true.
>> 
>
> The above assert failure occurs when there is a debug insn and its
> use is not live-out.


Uses in debug insns are never live-out uses.

It sounds like the bug is that we're failing to update all debug uses of
the original register.  We need to do that, or "reset" the debug insn if
substitution fails for some reason.

See fixup_debug_uses for what the target-independent part of the pass
does for debug insns that are affected by movement.  Hopefully the
update needed here will be simpler than that.

>>>>> [...]
>>>>> +
>>>>> +  rtx addr = XEXP (SET_SRC (body), 0);
>>>>> +
>>>>> +  if (GET_CODE (addr) == PLUS
>>>>> +      && XEXP (addr, 1) && CONST_INT_P (XEXP (addr, 1)))
>>>>> +    {
>>>>> +      if (INTVAL (XEXP (addr, 1)) == -16)
>>>>> + return false;
>>>>> +  }
>>>>
>>>> What's special about -16?
>>>>
>>>
>>> The tests like libgomp/for-8 fails with fused load with offset -16 and 0.
>>> Thats why I have added this check.
>> 
>> But why does it fail though?  It sounds like the testcase is pointing
>> out a problem in the pass (or perhaps elsewhere).  It's important that
>> we try to understand and fix the underlying problem.
>> 
>
> This check is not required anymore and will remove from subsequent patches.

OK, great.

>>>>> +
>>>>> +  df_ref use;
>>>>> +  df_insn_info *insn_info = DF_INSN_INFO_GET (info->rtl ());
>>>>> +  FOR_EACH_INSN_INFO_DEF (use, insn_info)
>>>>> +    {
>>>>> +      struct df_link *def_link = DF_REF_CHAIN (use);
>>>>> +
>>>>> +      if (!def_link || !def_link->ref
>>>>> +   || DF_REF_IS_ARTIFICIAL (def_link->ref))
>>>>> + continue;
>>>>> +
>>>>> +      while (def_link && def_link->ref)
>>>>> + {
>>>>> +   rtx_insn *insn = DF_REF_INSN (def_link->ref);
>>>>> +   if (GET_CODE (PATTERN (insn)) == PARALLEL)
>>>>> +     return false;
>>>>
>>>> Why do you need to skip PARALLELs?
>>>>
>>>
>>> vec_select with parallel give failures final.cc "can't split-up with subreg 
>>> 128 (reg OO"
>>> Thats why I have added this.
>> 
>> But in (vec_select ... (parallel ...)), the parallel won't be the 
>> PATTERN (insn).  It'll instead be a suboperand of the vec_select.
>> 
>> Here too it's important to understand why the final.cc failure occurs
>> and what the correct fix is.
>> 
>
> subreg with vec_select operand already exists before fusion pass.
> We overwrite them with subreg 128 bits from 256 OO mode operand.

But why is that wrong?  What was the full rtl of the subreg before the
pass runs, what did the subreg look like after the pass, and why is the
change not correct?

In general, there are two main ways that an rtl change can be incorrect:

(1) The new rtl isn't well-formed (such as (subreg (subreg X A) B)).
    In this case, the new rtl makes no inherent sense when viewed
    in isolation: it isn't necessary to see the old rtl to tell that
    the new rtl is wrong.

(2) The new rtl is well-formed (i.e. makes inherent sense when viewed in
    isolation) but it does not have the same semantics as the old rtl.
    In other words, the new rtl is describing a different operation
    from the old rtl.

I think we need to talk about it in those terms, rather than where
the eventual ICE occurs.

> Due to this in final.cc we couldnt splt at line 2807 and bails
> out fatal_insn.
>
> Currently we dont support already existing subreg vector operand
> to generate register pairs.
> We should bail out from fusion pass in this case.
>>>>> +
>>>>> +   rtx set = single_set (insn);
>>>>> +   if (set == NULL_RTX)
>>>>> +     return false;
>>>>> +
>>>>> +   rtx op0 = SET_SRC (set);
>>>>> +   rtx_code code = GET_CODE (op0);
>>>>> +
>>>>> +   // This check is added as register pairs are not generated
>>>>> +   // by RA for neg:V2DF (fma: V2DF (reg1)
>>>>> +   //                  (reg2)
>>>>> +   //                  (neg:V2DF (reg3)))
>>>>> +   if (GET_RTX_CLASS (code) == RTX_UNARY)
>>>>> +     return false;
>>>>
>>>> What's special about (neg (fma ...))?
>>>>
>>>
>>> I am not sure why register allocator fails allocating register pairs with
>>> NEG Unary operation with fma operand. I have not debugged register 
>>> allocator why the NEG
>>> Unary operation with fma operand. 
>> 
>
> For neg (fma ...) cases because of subreg 128 bits from OOmode 256 bits are
> set correctly. 
> IRA marked them spill candidates as spill priority is zero.
>
> Due to this LRA reload pass couldn't allocate register pairs.

I think this is just restating the symptom though.  I suppose the same
kind of questions apply here too: what was the instruction before the
pass runs, what was the instruction after the pass runs, and why is
the rtl change incorrect (by the meaning above)?

Thanks,
Richard

Re: [patch, rs6000, middle-end 0/1] v1: Add implementation for different targets for pair mem fusion

Reply via email to