Re: [PATCH v4 01/10] Initial TI PRU GCC port

Jeff Law Mon, 24 Sep 2018 10:41:09 -0700
On 9/24/18 4:38 AM, Richard Sandiford wrote:
> Dimitar Dimitrov <dimi...@dinux.eu> writes:
>> On Thursday, 9/13/2018 13:02:21 EEST Richard Sandiford wrote:
>>> Dimitar Dimitrov <dimi...@dinux.eu> writes:
>>>> +(define_insn
>>>> "sub_impl<EQD:mode><EQS0:mode><EQS1:mode>_<alu3_zext><alu3_zext_op1><alu3
>>>> _zext_op2>" +  [(set (match_operand:EQD 0 "register_operand" "=r,r,r")
>>>> +  (minus:EQD
>>>> +   (zero_extend:EQD
>>>> +    (match_operand:EQS0 1 "reg_or_ubyte_operand" "r,r,I"))
>>>> +   (zero_extend:EQD
>>>> +    (match_operand:EQS1 2 "reg_or_ubyte_operand" "r,I,r"))))]
>>>> +  ""
>>>> +  "@
>>>> +   sub\\t%0, %1, %2
>>>> +   sub\\t%0, %1, %2
>>>> +   rsb\\t%0, %2, %1"
>>>> +  [(set_attr "type" "alu")
>>>> +   (set_attr "length" "4")])
>>>
>>> By convention, subtraction patterns shouldn't accept constants for
>>> operand 2.  Constants should instead be subtracted using an addition
>>> of the negative.
>> Understood. I will remove second alternative. But I will leave the third one 
>> since it enables an important optimization:
>>
>>    unsigned test(unsigned a)
>>    {
>>         return 10-a;
>>    }
>>
>> RTL:
>> (insn 6 3 7 2 (set (reg:SI 152)
>>         (minus:SI (const_int 10 [0xa])
>>             (reg/v:SI 151 [ a ]))) "test.c":430 -1
>>      (nil))
>>
>> Assembly:
>>     test:
>>         rsb     r14, r14, 10
>>         ret
> 
> Thanks.  And yeah, the final alternative is fine.
> 
>>>> +;; Return true if OP is a text segment reference.
>>>> +;; This is needed for program memory address expressions.  Borrowed from
>>>> AVR. +(define_predicate "text_segment_operand"
>>>> +  (match_code "code_label,label_ref,symbol_ref,plus,minus,const")
>>>> +{
>>>> +  switch (GET_CODE (op))
>>>> +    {
>>>> +    case CODE_LABEL:
>>>> +      return true;
>>>> +    case LABEL_REF :
>>>> +      return true;
>>>> +    case SYMBOL_REF :
>>>> +      return SYMBOL_REF_FUNCTION_P (op);
>>>> +    case PLUS :
>>>> +    case MINUS :
>>>> +      /* Assume canonical format of symbol + constant.
>>>> +   Fall through.  */
>>>> +    case CONST :
>>>> +      return text_segment_operand (XEXP (op, 0), VOIDmode);
>>>> +    default :
>>>> +      return false;
>>>> +    }
>>>> +})
>>>
>>> This probably comes from AVR, but: no spaces before ":".
>>>
>>> Bit surprised that we can get a CODE_LABEL rather than a LABEL_REF here.
>>> Do you know if that triggers in practice, and if so where?
>> Indeed, CODE_LABEL case is never reached. I'll leave gcc_unreachable here.
>>
>>> An IMO neater and slightly more robust way of writing the body is:
>>>   poly_int64 offset:
>>>   rtx base = strip_offset (op, &offset);
>>>   switch (GET_CODE (base))
>>>   
>>>     {
>>>     
>>>     case LABEL_REF:
>>>       ...as above...
>>>     
>>>     case SYMBOL_REF:
>>>       ...as above...
>>>     
>>>     default:
>>>       return false;
>>>     
>>>     }
>>>
>>> with "plus" and "minus" not in the match_code list (since they should
>>> always appear in consts if they really are text references).
>>
>> The "plus" and "minus" are needed when handling code labels as values. Take 
>> for example the following construct:
>>
>>    int x = &&lab1 - &&lab0;
>> lab1:
>>   ...
>> lab2:
>>
>>
>> My TARGET_ASM_INTEGER callback uses the text_segment_operand predicate. In 
>> the 
>> above case it is passed the following RTL expression:
>> (minus:SI (label_ref/v:SI 20)
>>     (label_ref/v:SI 27))
>>
>> I need to detect text labels so that I annotate them with %pmem:
>>         .4byte       %pmem(.L4-(.L2))
>> Instead of the incorrect:
>>         .4byte   .L3-(.L2)
> 
> OK, thanks for the explanation.  I think the target-independent code should
> really be passing (const (minus ...)) rather than a plain (minus ...) here,
> but that's going to be difficult to change at this stage.  And like you say,
> the split_offset suggestion wouldn't have handled this anyway.
> 
> So yeah, please keep what you have now.
> 
>>>> +/* Callback for walk_gimple_seq that checks TP tree for TI ABI
>>>> compliance.  */ +static tree
>>>> +check_op_callback (tree *tp, int *walk_subtrees, void *data)
>>>> +{
>>>> +  struct walk_stmt_info *wi = (struct walk_stmt_info *) data;
>>>> +
>>>> +  if (RECORD_OR_UNION_TYPE_P (*tp) || TREE_CODE (*tp) == ENUMERAL_TYPE)
>>>> +    {
>>>> +      /* Forward declarations have NULL tree type.  Skip them.  */
>>>> +      if (TREE_TYPE (*tp) == NULL)
>>>> +  return NULL;
>>>> +    }
>>>> +
>>>> +  /* TODO - why C++ leaves INTEGER_TYPE forward declarations around?  */
>>>> +  if (TREE_TYPE (*tp) == NULL)
>>>> +    return NULL;
>>>> +
>>>> +  const tree type = TREE_TYPE (*tp);
>>>> +
>>>> +  /* Direct function calls are allowed, obviously.  */
>>>> +  if (TREE_CODE (*tp) == ADDR_EXPR && TREE_CODE (type) == POINTER_TYPE)
>>>> +    {
>>>> +      const tree ptype = TREE_TYPE (type);
>>>> +      if (TREE_CODE (ptype) == FUNCTION_TYPE)
>>>> +  return NULL;
>>>> +    }
>>>
>>> This seems like a bit of a dangerous heuristic.  Couldn't it also cause
>>> us to skip things like:
>>>
>>>    (void *) func
>>>
>>> ?  (OK, that's dubious C, but we do support it.)
>> The cast yields a "data pointer", which is 32 bits for both types of ABI. 
>> Hence it is safe to skip "(void *) func".
>>
>> The TI ABI's 16 bit function pointers become a problem when they change the 
>> layout of structures and function argument registers.
> 
> OK.  The reason this stood out is that the code doesn't obviously match
> the comment.  If the code is just trying to skip direct function calls,
> I think the gcall sequence I mentioned would be more obvious, and would
> match the existing comment.  If anything that takes the address of a
> function is OK then it might be worth changing the comment to include that.
> 
>>>> +      {
>>>> +        *total = 0;
>>>> +      }
>>>> +    else
>>>> +      {
>>>> +        /* SI move has the same cost as a QI move.  */
>>>> +        int factor = GET_MODE_SIZE (mode) / GET_MODE_SIZE (SImode);
>>>> +        if (factor == 0) 
>>>> +          factor = 1;
>>>> +        *total = factor * COSTS_N_INSNS (1);
>>>> +      }
>>>
>>> Could you explain this a bit more?  It looks like a pure QImode operation
>>> gets a cost of 1 insn but an SImode operation zero-extended from QImode
>>> gets a cost of 0.
>> I have unintentionally bumped QI cost while trying to make zero-extends 
>> cheap. 
>> I will fix by using the following simple logic:
>>
>>     case SET:
>>        mode = GET_MODE (SET_DEST (x));
>>        /* SI move has the same cost as a QI move.  Moves larger than
>>           64 bits are costly.  */
>>        int factor = CEIL (GET_MODE_SIZE (mode), GET_MODE_SIZE (SImode));
>>        *total = factor * COSTS_N_INSNS (1);
> 
> Looks good.
> 
>>>> +/* Implement TARGET_PREFERRED_RELOAD_CLASS.  */
>>>> +static reg_class_t
>>>> +pru_preferred_reload_class (rtx x ATTRIBUTE_UNUSED, reg_class_t regclass)
>>>> +{
>>>> +  return regclass == NO_REGS ? GENERAL_REGS : regclass;
>>>> +}
>>>
>>> This looks odd: PREFERRED_RELOAD_CLASS should return a subset of the
>>> original class rather than add new registers.  Can you remember why
>>> it was needed?
>> I'm not sure what target code is supposed to return when NO_REGS is passed 
>> here. I saw how other ports handle NO_REGS (c-sky, m32c, nios2, rl78). So I 
>> followed suite.
>>
>> Here is a backtrace of LRA when NO_REGS is passed:
>> 0xf629ae pru_preferred_reload_class
>>         ../../gcc/gcc/config/pru/pru.c:788
>> 0xa3d6e8 process_alt_operands
>>         ../../gcc/gcc/lra-constraints.c:2600
>> 0xa3ef7d curr_insn_transform
>>         ../../gcc/gcc/lra-constraints.c:3889
>> 0xa4301e lra_constraints(bool)
>>         ../../gcc/gcc/lra-constraints.c:4906
>> 0xa2726c lra(_IO_FILE*)
>>         ../../gcc/gcc/lra.c:2446
>> 0x9c97a9 do_reload
>>         ../../gcc/gcc/ira.c:5469
>> 0x9c97a9 execute
>>         ../../gcc/gcc/ira.c:5653
> 
> I think it should just pass NO_REGS through unmodified (which means
> spill to memory).  In some ways it would be good if it didn't have to
> handle this case, but again that's going to be work to fix.
> 
> The RA will pass ALL_REGS if it can handle any register, and wants to
> know what the target would prefer.  But for any input, the hook needs
> to stick within the registers it was given.
> 
>>>> +/* Helper function to get the starting storage HW register for an
>>>> argument, +   or -1 if it must be passed on stack.  The cum_v state is
>>>> not changed.  */ +static int
>>>> +pru_function_arg_regi (cumulative_args_t cum_v,
>>>> +                 machine_mode mode, const_tree type,
>>>> +                 bool named)
>>>> +{
>>>> +  CUMULATIVE_ARGS *cum = get_cumulative_args (cum_v);
>>>> +  size_t argsize = pru_function_arg_size (mode, type);
>>>> +  size_t i, bi;
>>>> +  int regi = -1;
>>>> +
>>>> +  if (!pru_arg_in_reg_bysize (argsize))
>>>> +    return -1;
>>>> +
>>>> +  if (!named)
>>>> +    return -1;
>>>> +
>>>> +  /* Find the first available slot that fits.  Yes, that's the PRU ABI. 
>>>> */ +  for (i = 0; regi < 0 && i < ARRAY_SIZE (cum->regs_used); i++)
>>>> +    {
>>>> +      if (mode == BLKmode)
>>>> +  {
>>>> +    /* Structs are passed, beginning at a full register.  */
>>>> +    if ((i % 4) != 0)
>>>> +      continue;
>>>> +  }
>>>
>>> The comment doesn't seem to match the code: a struct like:
>>>
>>>   struct s { short x; };
>>>
>>> will have HImode rather than BLKmode.
>> I will update the comment to clarify it is only for large structs. Code 
>> should 
>> be ok.
>>
>>> It's usually better to base ABI stuff on the type where possible,
>>> since that corresponds directly to the source language, while modes
>>> are more of an internal GCC thing.
>> I will explore this option. But ABI is already defined in data sizes, which 
>> I 
>> think neatly fit with GCC's modes.
> 
> Thanks.  And this certainly shouldn't hold up acceptance, it was just a
> suggestion.
> 
>>>> +; LRA cannot cope with clobbered op2, hence the scratch register.
>>>> +(define_insn "ashr<mode>3"
>>>> +  [(set (match_operand:QISI 0 "register_operand"      "=&r,r")
>>>> +    (ashiftrt:QISI
>>>> +      (match_operand:QISI 1 "register_operand"        "0,r")
>>>> +      (match_operand:QISI 2 "reg_or_const_1_operand"  "r,P")))
>>>> +   (clobber (match_scratch:QISI 3                     "=r,X"))]
>>>> +  ""
>>>> +  "@
>>>> +   mov %3, %2\;ASHRLP%=:\;qbeq ASHREND%=, %3, 0\; sub %3, %3, 1\;
>>>> lsr\\t%0, %0, 1\; qbbc ASHRLP%=, %0, (%S0 * 8) - 2\; set %0, %0, (%S0 *
>>>> 8) - 1\; jmp ASHRLP%=\;ASHREND%=: +   lsr\\t%0, %1, 1\;qbbc LSIGN%=, %0,
>>>> (%S0 * 8) - 2\;set %0, %0, (%S0 * 8) - 1\;LSIGN%=:" +  [(set_attr "type"
>>>> "complex,alu")
>>>> +   (set_attr "length" "28,4")])
>>>
>>> What do you mean by LRA not coping?  What did you try originally and
>>> what went wrong?
>> Better assembly could be generated if the shift count register was used for 
>> a 
>> loop counter. Pseudo code:
>>
>>    while (op2--)
>>      op0 >>= 1;
>>  
>> I originally tried to clobber operand 2:
>>  (define_insn "ashr<mode>3"
>>    [(set (match_operand:QISI 0 "register_operand" "=&r,r")
>>          (ashiftrt:QISI
>>            (match_operand:QISI 1 "register_operand" "0,r")
>>            (match_operand:QISI 2 "reg_or_const_1_operand" "+r,P"))
>>           )]
>>
>> But with the above pattern operand 2 was not clobbered. Its value was deemed 
>> untouched (i.e. live across the pattern). So I came up with  clobbering a 
>> separate register to fix this, at the expense of slightly bigger generated 
>> code.
> 
> Ah, yeah, "+" has to be on an output operand (the destination of a set
> or clobber).
> 
> In this kind of situation you might be better off expanding the loop
> as individual RTL instructions, which is what several targets do for
> things like memmove.  But if you'd prefer to keep a single pattern,
> you could use:
> 
>   [(set (match_operand:QISI 0 "register_operand" "=r")
>       (ashiftrt:QISI
>         (match_operand:QISI 2 "register_operand" "0")
>         (match_operand:QISI 3 "register_operand" "1")))
>    (set (match_operand:QISI 1 "register_operand" "=r")
>       (const_int -1))]
> 
> (assuming the optimised loop does leave operand 2 containing all-1s).
> 
> The shift by 1 would then need to be a separate pattern, with the
> define_expand choosing between them.
> 
>>>> +(define_insn "<code>di3"
>>>> +  [(set (match_operand:DI 0 "register_operand"            "=&r,&r")
>>>> +    (LOGICAL_BITOP:DI
>>>> +      (match_operand:DI 1 "register_operand"      "%r,r")
>>>> +      (match_operand:DI 2 "reg_or_ubyte_operand"  "r,I")))]
>>>> +  ""
>>>> +  "@
>>>> +   <logical_bitop_asm>\\t%F0, %F1, %F2\;<logical_bitop_asm>\\t%N0, %N1,
>>>> %N2 +   <logical_bitop_asm>\\t%F0, %F1, %2\;<logical_bitop_asm>\\t%N0,
>>>> %N1, 0" +  [(set_attr "type" "alu,alu")
>>>> +   (set_attr "length" "8,8")])
>>>> +
>>>> +
>>>> +(define_insn "one_cmpldi2"
>>>> +  [(set (match_operand:DI 0 "register_operand"            "=r")
>>>> +  (not:DI (match_operand:DI 1 "register_operand"  "r")))]
>>>> +  ""
>>>> +{
>>>> +  /* careful with overlapping source and destination regs.  */
>>>> +  gcc_assert (GP_REG_P (REGNO (operands[0])));
>>>> +  gcc_assert (GP_REG_P (REGNO (operands[1])));
>>>> +  if (REGNO (operands[0]) == (REGNO (operands[1]) + 4))
>>>> +    return "not\\t%N0, %N1\;not\\t%F0, %F1";
>>>> +  else
>>>> +    return "not\\t%F0, %F1\;not\\t%N0, %N1";
>>>> +}
>>>> +  [(set_attr "type" "alu")
>>>> +   (set_attr "length" "8")])
>>>
>>> Do you see any code improvements from defining these patterns?
>>> These days I'd expect you'd get better code by letting the
>>> target-independent code split them up into SImode operations.
>> Yes, these patterns improve code size and speed.
>>
>> Looking at expand_binop(), DI logical ops are split into WORD mode, which in 
>> PRU's peculiar case is QI. So without those patterns, a 64 bit IOR generates 
>> 8 
>> QI instructions:
>>
>>         or      r0.b0, r14.b0, r16.b0
>>         or      r0.b1, r14.b1, r16.b1
>>         or      r0.b2, r14.b2, r16.b2
>>         or      r0.b3, r14.b3, r16.b3
>>         or      r1.b0, r15.b0, r17.b0
>>         or      r1.b1, r15.b1, r17.b1
>>         or      r1.b2, r15.b2, r17.b2
>>         or      r1.b3, r15.b3, r17.b3
>>
>> Whereas with patterns defined, it is only 2 SImode instructions:
>>         or      r0, r14, r16
>>         or      r1, r15, r17
> 
> Ah, yeah, I'd forgotten about words being QImode.  Ideally the doubleword
> handling would be extended to any mode that is twice the width of a "cheap"
> mode, but again that'd be a fair amount of work.
> 
> So yeah, please keep this as-is.
> 
>>>> +;; Multiply instruction.  Idea for fixing registers comes from the AVR
>>>> backend. +
>>>> +(define_expand "mulsi3"
>>>> +  [(set (match_operand:SI 0 "register_operand" "")
>>>> +  (mult:SI (match_operand:SI 1 "register_operand" "")
>>>> +           (match_operand:SI 2 "register_operand" "")))]
>>>> +  ""
>>>> +{
>>>> +  emit_insn (gen_mulsi3_fixinp (operands[0], operands[1], operands[2]));
>>>> +  DONE;
>>>> +})
>>>> +
>>>> +
>>>> +(define_expand "mulsi3_fixinp"
>>>> +  [(set (reg:SI 112) (match_operand:SI 1 "register_operand" ""))
>>>> +   (set (reg:SI 116) (match_operand:SI 2 "register_operand" ""))
>>>> +   (set (reg:SI 104) (mult:SI (reg:SI 112) (reg:SI 116)))
>>>> +   (set (match_operand:SI 0 "register_operand" "") (reg:SI 104))]
>>>> +  ""
>>>> +{
>>>> +})
>>>
>>> This seems slightly dangerous since there's nothing to guarantee that
>>> those registers aren't already live at the point of expansion.
>>>
>>> The more usual way (and IMO correct way) would be for:
>>>> +(define_insn "*mulsi3_prumac"
>>>> +  [(set (reg:SI 104) (mult:SI (reg:SI 112) (reg:SI 116)))]
>>>> +  ""
>>>> +  "nop\;xin\\t0, r26, 4"
>>>> +  [(set_attr "type" "alu")
>>>> +   (set_attr "length" "8")])
>>>
>>> to have register classes and constraints for these three registers,
>>> like e.g. the x86 port does for %cx etc.
>>>
>>> It would be good if you could try that.  On the other hand, if AVR
>>> already does this and it worked in practice then I guess it's not
>>> something that should hold up the port.
>> This suggestion worked. It was a matter of correct predicates.
>>
>> Here is what I will put in the next patch version:
>>
>> (define_predicate "pru_muldst_operand"
>>   (and (match_code "reg")
>>        (ior (match_test "REGNO_REG_CLASS (REGNO (op)) == MULDST_REGS")
>>          (match_test "REGNO (op) >= FIRST_PSEUDO_REGISTER"))))
>> ...
>> (define_register_constraint "Rmd0" "MULDST_REGS"
>>   "@internal
>>   The multiply destination register.")
>> ...
>>
>> (define_insn "mulsi3"
>>   [(set (match_operand:SI 0 "pru_muldst_operand"        "=Rmd0")
>>      (mult:SI (match_operand:SI 1 "pru_mulsrc0_operand" "Rms0")
>>               (match_operand:SI 2 "pru_mulsrc1_operand" "Rms1")))]
> 
> Looks good.  Thanks for doing this.
> 
> You could also add a "%" constraint to operand 1 to allow the input
> registers to be swapped, just on the off chance that that's useful.
> But I suppose there's a risk that providing that alternative will
> confuse the RA cost model a bit.  Just a suggestion.
> 
>>>> @@ -23444,6 +23449,56 @@ the offset with a symbol reference to a canary in
>>>> the TLS block.> 
>>>>  @end table
>>>>
>>>> +@node PRU Options
>>>> +@subsection PRU Options
>>>> +@cindex PRU Options
>>>> +
>>>> +These command-line options are defined for PRU target:
>>>> +
>>>> +@table @gcctabopt
>>>> +@item -minrt
>>>> +@opindex minrt
>>>> +Enable the use of a minimum runtime environment---no static
>>>> +initializers or constructors.  Results in significant code size
>>>> +reduction of final ELF binaries.
>>>
>>> Up to you, but this read to me as "-m"+"inrt".  If this option is already
>>> in use then obviously keep it, but if it's new then it might be worth
>>> calling it -mminrt instead, for consistency with -mmcu.
>> I assumed this is a standard naming since MSP430 already uses it. I've seen 
>> the option being utilized by pru-gcc users, so let's keep it that way.
> 
> OK.
> 
>>> Despite the long reply (sorry), this looks in really good shape to me FWIW.
>>> Thanks for the submission.
>> I take this sentence as an indication that the PRU port is welcome for 
>> inclusion into GCC. I'll work on fixing the comments and will send a new 
>> patch 
>> version.
> 
> Well, process-wise, it's up to the steering committee to decide whether
> to accept the port.  But one of the main requirements is of course
> reviewing the patches.
> 
> Which of the other patches in the series still need review?  I think Jeff
> approved some of them earlier.
I think the only thing outstanding was the .md/.c/.h files for the port
itself.  The rest was very straightforward.
jeff
Re: [PATCH v4 01/10] Initial TI PRU GCC port

Reply via email to