On 9/24/18 4:38 AM, Richard Sandiford wrote:
> Dimitar Dimitrov <dimi...@dinux.eu> writes:
>> On Thursday, 9/13/2018 13:02:21 EEST Richard Sandiford wrote:
>>> Dimitar Dimitrov <dimi...@dinux.eu> writes:
>>>> +(define_insn
>>>> "sub_impl<EQD:mode><EQS0:mode><EQS1:mode>_<alu3_zext><alu3_zext_op1><alu3
>>>> _zext_op2>" + [(set (match_operand:EQD 0 "register_operand" "=r,r,r")
>>>> + (minus:EQD
>>>> + (zero_extend:EQD
>>>> + (match_operand:EQS0 1 "reg_or_ubyte_operand" "r,r,I"))
>>>> + (zero_extend:EQD
>>>> + (match_operand:EQS1 2 "reg_or_ubyte_operand" "r,I,r"))))]
>>>> + ""
>>>> + "@
>>>> + sub\\t%0, %1, %2
>>>> + sub\\t%0, %1, %2
>>>> + rsb\\t%0, %2, %1"
>>>> + [(set_attr "type" "alu")
>>>> + (set_attr "length" "4")])
>>>
>>> By convention, subtraction patterns shouldn't accept constants for
>>> operand 2. Constants should instead be subtracted using an addition
>>> of the negative.
>> Understood. I will remove second alternative. But I will leave the third one
>> since it enables an important optimization:
>>
>> unsigned test(unsigned a)
>> {
>> return 10-a;
>> }
>>
>> RTL:
>> (insn 6 3 7 2 (set (reg:SI 152)
>> (minus:SI (const_int 10 [0xa])
>> (reg/v:SI 151 [ a ]))) "test.c":430 -1
>> (nil))
>>
>> Assembly:
>> test:
>> rsb r14, r14, 10
>> ret
>
> Thanks. And yeah, the final alternative is fine.
>
>>>> +;; Return true if OP is a text segment reference.
>>>> +;; This is needed for program memory address expressions. Borrowed from
>>>> AVR. +(define_predicate "text_segment_operand"
>>>> + (match_code "code_label,label_ref,symbol_ref,plus,minus,const")
>>>> +{
>>>> + switch (GET_CODE (op))
>>>> + {
>>>> + case CODE_LABEL:
>>>> + return true;
>>>> + case LABEL_REF :
>>>> + return true;
>>>> + case SYMBOL_REF :
>>>> + return SYMBOL_REF_FUNCTION_P (op);
>>>> + case PLUS :
>>>> + case MINUS :
>>>> + /* Assume canonical format of symbol + constant.
>>>> + Fall through. */
>>>> + case CONST :
>>>> + return text_segment_operand (XEXP (op, 0), VOIDmode);
>>>> + default :
>>>> + return false;
>>>> + }
>>>> +})
>>>
>>> This probably comes from AVR, but: no spaces before ":".
>>>
>>> Bit surprised that we can get a CODE_LABEL rather than a LABEL_REF here.
>>> Do you know if that triggers in practice, and if so where?
>> Indeed, CODE_LABEL case is never reached. I'll leave gcc_unreachable here.
>>
>>> An IMO neater and slightly more robust way of writing the body is:
>>> poly_int64 offset:
>>> rtx base = strip_offset (op, &offset);
>>> switch (GET_CODE (base))
>>>
>>> {
>>>
>>> case LABEL_REF:
>>> ...as above...
>>>
>>> case SYMBOL_REF:
>>> ...as above...
>>>
>>> default:
>>> return false;
>>>
>>> }
>>>
>>> with "plus" and "minus" not in the match_code list (since they should
>>> always appear in consts if they really are text references).
>>
>> The "plus" and "minus" are needed when handling code labels as values. Take
>> for example the following construct:
>>
>> int x = &&lab1 - &&lab0;
>> lab1:
>> ...
>> lab2:
>>
>>
>> My TARGET_ASM_INTEGER callback uses the text_segment_operand predicate. In
>> the
>> above case it is passed the following RTL expression:
>> (minus:SI (label_ref/v:SI 20)
>> (label_ref/v:SI 27))
>>
>> I need to detect text labels so that I annotate them with %pmem:
>> .4byte %pmem(.L4-(.L2))
>> Instead of the incorrect:
>> .4byte .L3-(.L2)
>
> OK, thanks for the explanation. I think the target-independent code should
> really be passing (const (minus ...)) rather than a plain (minus ...) here,
> but that's going to be difficult to change at this stage. And like you say,
> the split_offset suggestion wouldn't have handled this anyway.
>
> So yeah, please keep what you have now.
>
>>>> +/* Callback for walk_gimple_seq that checks TP tree for TI ABI
>>>> compliance. */ +static tree
>>>> +check_op_callback (tree *tp, int *walk_subtrees, void *data)
>>>> +{
>>>> + struct walk_stmt_info *wi = (struct walk_stmt_info *) data;
>>>> +
>>>> + if (RECORD_OR_UNION_TYPE_P (*tp) || TREE_CODE (*tp) == ENUMERAL_TYPE)
>>>> + {
>>>> + /* Forward declarations have NULL tree type. Skip them. */
>>>> + if (TREE_TYPE (*tp) == NULL)
>>>> + return NULL;
>>>> + }
>>>> +
>>>> + /* TODO - why C++ leaves INTEGER_TYPE forward declarations around? */
>>>> + if (TREE_TYPE (*tp) == NULL)
>>>> + return NULL;
>>>> +
>>>> + const tree type = TREE_TYPE (*tp);
>>>> +
>>>> + /* Direct function calls are allowed, obviously. */
>>>> + if (TREE_CODE (*tp) == ADDR_EXPR && TREE_CODE (type) == POINTER_TYPE)
>>>> + {
>>>> + const tree ptype = TREE_TYPE (type);
>>>> + if (TREE_CODE (ptype) == FUNCTION_TYPE)
>>>> + return NULL;
>>>> + }
>>>
>>> This seems like a bit of a dangerous heuristic. Couldn't it also cause
>>> us to skip things like:
>>>
>>> (void *) func
>>>
>>> ? (OK, that's dubious C, but we do support it.)
>> The cast yields a "data pointer", which is 32 bits for both types of ABI.
>> Hence it is safe to skip "(void *) func".
>>
>> The TI ABI's 16 bit function pointers become a problem when they change the
>> layout of structures and function argument registers.
>
> OK. The reason this stood out is that the code doesn't obviously match
> the comment. If the code is just trying to skip direct function calls,
> I think the gcall sequence I mentioned would be more obvious, and would
> match the existing comment. If anything that takes the address of a
> function is OK then it might be worth changing the comment to include that.
>
>>>> + {
>>>> + *total = 0;
>>>> + }
>>>> + else
>>>> + {
>>>> + /* SI move has the same cost as a QI move. */
>>>> + int factor = GET_MODE_SIZE (mode) / GET_MODE_SIZE (SImode);
>>>> + if (factor == 0)
>>>> + factor = 1;
>>>> + *total = factor * COSTS_N_INSNS (1);
>>>> + }
>>>
>>> Could you explain this a bit more? It looks like a pure QImode operation
>>> gets a cost of 1 insn but an SImode operation zero-extended from QImode
>>> gets a cost of 0.
>> I have unintentionally bumped QI cost while trying to make zero-extends
>> cheap.
>> I will fix by using the following simple logic:
>>
>> case SET:
>> mode = GET_MODE (SET_DEST (x));
>> /* SI move has the same cost as a QI move. Moves larger than
>> 64 bits are costly. */
>> int factor = CEIL (GET_MODE_SIZE (mode), GET_MODE_SIZE (SImode));
>> *total = factor * COSTS_N_INSNS (1);
>
> Looks good.
>
>>>> +/* Implement TARGET_PREFERRED_RELOAD_CLASS. */
>>>> +static reg_class_t
>>>> +pru_preferred_reload_class (rtx x ATTRIBUTE_UNUSED, reg_class_t regclass)
>>>> +{
>>>> + return regclass == NO_REGS ? GENERAL_REGS : regclass;
>>>> +}
>>>
>>> This looks odd: PREFERRED_RELOAD_CLASS should return a subset of the
>>> original class rather than add new registers. Can you remember why
>>> it was needed?
>> I'm not sure what target code is supposed to return when NO_REGS is passed
>> here. I saw how other ports handle NO_REGS (c-sky, m32c, nios2, rl78). So I
>> followed suite.
>>
>> Here is a backtrace of LRA when NO_REGS is passed:
>> 0xf629ae pru_preferred_reload_class
>> ../../gcc/gcc/config/pru/pru.c:788
>> 0xa3d6e8 process_alt_operands
>> ../../gcc/gcc/lra-constraints.c:2600
>> 0xa3ef7d curr_insn_transform
>> ../../gcc/gcc/lra-constraints.c:3889
>> 0xa4301e lra_constraints(bool)
>> ../../gcc/gcc/lra-constraints.c:4906
>> 0xa2726c lra(_IO_FILE*)
>> ../../gcc/gcc/lra.c:2446
>> 0x9c97a9 do_reload
>> ../../gcc/gcc/ira.c:5469
>> 0x9c97a9 execute
>> ../../gcc/gcc/ira.c:5653
>
> I think it should just pass NO_REGS through unmodified (which means
> spill to memory). In some ways it would be good if it didn't have to
> handle this case, but again that's going to be work to fix.
>
> The RA will pass ALL_REGS if it can handle any register, and wants to
> know what the target would prefer. But for any input, the hook needs
> to stick within the registers it was given.
>
>>>> +/* Helper function to get the starting storage HW register for an
>>>> argument, + or -1 if it must be passed on stack. The cum_v state is
>>>> not changed. */ +static int
>>>> +pru_function_arg_regi (cumulative_args_t cum_v,
>>>> + machine_mode mode, const_tree type,
>>>> + bool named)
>>>> +{
>>>> + CUMULATIVE_ARGS *cum = get_cumulative_args (cum_v);
>>>> + size_t argsize = pru_function_arg_size (mode, type);
>>>> + size_t i, bi;
>>>> + int regi = -1;
>>>> +
>>>> + if (!pru_arg_in_reg_bysize (argsize))
>>>> + return -1;
>>>> +
>>>> + if (!named)
>>>> + return -1;
>>>> +
>>>> + /* Find the first available slot that fits. Yes, that's the PRU ABI.
>>>> */ + for (i = 0; regi < 0 && i < ARRAY_SIZE (cum->regs_used); i++)
>>>> + {
>>>> + if (mode == BLKmode)
>>>> + {
>>>> + /* Structs are passed, beginning at a full register. */
>>>> + if ((i % 4) != 0)
>>>> + continue;
>>>> + }
>>>
>>> The comment doesn't seem to match the code: a struct like:
>>>
>>> struct s { short x; };
>>>
>>> will have HImode rather than BLKmode.
>> I will update the comment to clarify it is only for large structs. Code
>> should
>> be ok.
>>
>>> It's usually better to base ABI stuff on the type where possible,
>>> since that corresponds directly to the source language, while modes
>>> are more of an internal GCC thing.
>> I will explore this option. But ABI is already defined in data sizes, which
>> I
>> think neatly fit with GCC's modes.
>
> Thanks. And this certainly shouldn't hold up acceptance, it was just a
> suggestion.
>
>>>> +; LRA cannot cope with clobbered op2, hence the scratch register.
>>>> +(define_insn "ashr<mode>3"
>>>> + [(set (match_operand:QISI 0 "register_operand" "=&r,r")
>>>> + (ashiftrt:QISI
>>>> + (match_operand:QISI 1 "register_operand" "0,r")
>>>> + (match_operand:QISI 2 "reg_or_const_1_operand" "r,P")))
>>>> + (clobber (match_scratch:QISI 3 "=r,X"))]
>>>> + ""
>>>> + "@
>>>> + mov %3, %2\;ASHRLP%=:\;qbeq ASHREND%=, %3, 0\; sub %3, %3, 1\;
>>>> lsr\\t%0, %0, 1\; qbbc ASHRLP%=, %0, (%S0 * 8) - 2\; set %0, %0, (%S0 *
>>>> 8) - 1\; jmp ASHRLP%=\;ASHREND%=: + lsr\\t%0, %1, 1\;qbbc LSIGN%=, %0,
>>>> (%S0 * 8) - 2\;set %0, %0, (%S0 * 8) - 1\;LSIGN%=:" + [(set_attr "type"
>>>> "complex,alu")
>>>> + (set_attr "length" "28,4")])
>>>
>>> What do you mean by LRA not coping? What did you try originally and
>>> what went wrong?
>> Better assembly could be generated if the shift count register was used for
>> a
>> loop counter. Pseudo code:
>>
>> while (op2--)
>> op0 >>= 1;
>>
>> I originally tried to clobber operand 2:
>> (define_insn "ashr<mode>3"
>> [(set (match_operand:QISI 0 "register_operand" "=&r,r")
>> (ashiftrt:QISI
>> (match_operand:QISI 1 "register_operand" "0,r")
>> (match_operand:QISI 2 "reg_or_const_1_operand" "+r,P"))
>> )]
>>
>> But with the above pattern operand 2 was not clobbered. Its value was deemed
>> untouched (i.e. live across the pattern). So I came up with clobbering a
>> separate register to fix this, at the expense of slightly bigger generated
>> code.
>
> Ah, yeah, "+" has to be on an output operand (the destination of a set
> or clobber).
>
> In this kind of situation you might be better off expanding the loop
> as individual RTL instructions, which is what several targets do for
> things like memmove. But if you'd prefer to keep a single pattern,
> you could use:
>
> [(set (match_operand:QISI 0 "register_operand" "=r")
> (ashiftrt:QISI
> (match_operand:QISI 2 "register_operand" "0")
> (match_operand:QISI 3 "register_operand" "1")))
> (set (match_operand:QISI 1 "register_operand" "=r")
> (const_int -1))]
>
> (assuming the optimised loop does leave operand 2 containing all-1s).
>
> The shift by 1 would then need to be a separate pattern, with the
> define_expand choosing between them.
>
>>>> +(define_insn "<code>di3"
>>>> + [(set (match_operand:DI 0 "register_operand" "=&r,&r")
>>>> + (LOGICAL_BITOP:DI
>>>> + (match_operand:DI 1 "register_operand" "%r,r")
>>>> + (match_operand:DI 2 "reg_or_ubyte_operand" "r,I")))]
>>>> + ""
>>>> + "@
>>>> + <logical_bitop_asm>\\t%F0, %F1, %F2\;<logical_bitop_asm>\\t%N0, %N1,
>>>> %N2 + <logical_bitop_asm>\\t%F0, %F1, %2\;<logical_bitop_asm>\\t%N0,
>>>> %N1, 0" + [(set_attr "type" "alu,alu")
>>>> + (set_attr "length" "8,8")])
>>>> +
>>>> +
>>>> +(define_insn "one_cmpldi2"
>>>> + [(set (match_operand:DI 0 "register_operand" "=r")
>>>> + (not:DI (match_operand:DI 1 "register_operand" "r")))]
>>>> + ""
>>>> +{
>>>> + /* careful with overlapping source and destination regs. */
>>>> + gcc_assert (GP_REG_P (REGNO (operands[0])));
>>>> + gcc_assert (GP_REG_P (REGNO (operands[1])));
>>>> + if (REGNO (operands[0]) == (REGNO (operands[1]) + 4))
>>>> + return "not\\t%N0, %N1\;not\\t%F0, %F1";
>>>> + else
>>>> + return "not\\t%F0, %F1\;not\\t%N0, %N1";
>>>> +}
>>>> + [(set_attr "type" "alu")
>>>> + (set_attr "length" "8")])
>>>
>>> Do you see any code improvements from defining these patterns?
>>> These days I'd expect you'd get better code by letting the
>>> target-independent code split them up into SImode operations.
>> Yes, these patterns improve code size and speed.
>>
>> Looking at expand_binop(), DI logical ops are split into WORD mode, which in
>> PRU's peculiar case is QI. So without those patterns, a 64 bit IOR generates
>> 8
>> QI instructions:
>>
>> or r0.b0, r14.b0, r16.b0
>> or r0.b1, r14.b1, r16.b1
>> or r0.b2, r14.b2, r16.b2
>> or r0.b3, r14.b3, r16.b3
>> or r1.b0, r15.b0, r17.b0
>> or r1.b1, r15.b1, r17.b1
>> or r1.b2, r15.b2, r17.b2
>> or r1.b3, r15.b3, r17.b3
>>
>> Whereas with patterns defined, it is only 2 SImode instructions:
>> or r0, r14, r16
>> or r1, r15, r17
>
> Ah, yeah, I'd forgotten about words being QImode. Ideally the doubleword
> handling would be extended to any mode that is twice the width of a "cheap"
> mode, but again that'd be a fair amount of work.
>
> So yeah, please keep this as-is.
>
>>>> +;; Multiply instruction. Idea for fixing registers comes from the AVR
>>>> backend. +
>>>> +(define_expand "mulsi3"
>>>> + [(set (match_operand:SI 0 "register_operand" "")
>>>> + (mult:SI (match_operand:SI 1 "register_operand" "")
>>>> + (match_operand:SI 2 "register_operand" "")))]
>>>> + ""
>>>> +{
>>>> + emit_insn (gen_mulsi3_fixinp (operands[0], operands[1], operands[2]));
>>>> + DONE;
>>>> +})
>>>> +
>>>> +
>>>> +(define_expand "mulsi3_fixinp"
>>>> + [(set (reg:SI 112) (match_operand:SI 1 "register_operand" ""))
>>>> + (set (reg:SI 116) (match_operand:SI 2 "register_operand" ""))
>>>> + (set (reg:SI 104) (mult:SI (reg:SI 112) (reg:SI 116)))
>>>> + (set (match_operand:SI 0 "register_operand" "") (reg:SI 104))]
>>>> + ""
>>>> +{
>>>> +})
>>>
>>> This seems slightly dangerous since there's nothing to guarantee that
>>> those registers aren't already live at the point of expansion.
>>>
>>> The more usual way (and IMO correct way) would be for:
>>>> +(define_insn "*mulsi3_prumac"
>>>> + [(set (reg:SI 104) (mult:SI (reg:SI 112) (reg:SI 116)))]
>>>> + ""
>>>> + "nop\;xin\\t0, r26, 4"
>>>> + [(set_attr "type" "alu")
>>>> + (set_attr "length" "8")])
>>>
>>> to have register classes and constraints for these three registers,
>>> like e.g. the x86 port does for %cx etc.
>>>
>>> It would be good if you could try that. On the other hand, if AVR
>>> already does this and it worked in practice then I guess it's not
>>> something that should hold up the port.
>> This suggestion worked. It was a matter of correct predicates.
>>
>> Here is what I will put in the next patch version:
>>
>> (define_predicate "pru_muldst_operand"
>> (and (match_code "reg")
>> (ior (match_test "REGNO_REG_CLASS (REGNO (op)) == MULDST_REGS")
>> (match_test "REGNO (op) >= FIRST_PSEUDO_REGISTER"))))
>> ...
>> (define_register_constraint "Rmd0" "MULDST_REGS"
>> "@internal
>> The multiply destination register.")
>> ...
>>
>> (define_insn "mulsi3"
>> [(set (match_operand:SI 0 "pru_muldst_operand" "=Rmd0")
>> (mult:SI (match_operand:SI 1 "pru_mulsrc0_operand" "Rms0")
>> (match_operand:SI 2 "pru_mulsrc1_operand" "Rms1")))]
>
> Looks good. Thanks for doing this.
>
> You could also add a "%" constraint to operand 1 to allow the input
> registers to be swapped, just on the off chance that that's useful.
> But I suppose there's a risk that providing that alternative will
> confuse the RA cost model a bit. Just a suggestion.
>
>>>> @@ -23444,6 +23449,56 @@ the offset with a symbol reference to a canary in
>>>> the TLS block.>
>>>> @end table
>>>>
>>>> +@node PRU Options
>>>> +@subsection PRU Options
>>>> +@cindex PRU Options
>>>> +
>>>> +These command-line options are defined for PRU target:
>>>> +
>>>> +@table @gcctabopt
>>>> +@item -minrt
>>>> +@opindex minrt
>>>> +Enable the use of a minimum runtime environment---no static
>>>> +initializers or constructors. Results in significant code size
>>>> +reduction of final ELF binaries.
>>>
>>> Up to you, but this read to me as "-m"+"inrt". If this option is already
>>> in use then obviously keep it, but if it's new then it might be worth
>>> calling it -mminrt instead, for consistency with -mmcu.
>> I assumed this is a standard naming since MSP430 already uses it. I've seen
>> the option being utilized by pru-gcc users, so let's keep it that way.
>
> OK.
>
>>> Despite the long reply (sorry), this looks in really good shape to me FWIW.
>>> Thanks for the submission.
>> I take this sentence as an indication that the PRU port is welcome for
>> inclusion into GCC. I'll work on fixing the comments and will send a new
>> patch
>> version.
>
> Well, process-wise, it's up to the steering committee to decide whether
> to accept the port. But one of the main requirements is of course
> reviewing the patches.
>
> Which of the other patches in the series still need review? I think Jeff
> approved some of them earlier.
I think the only thing outstanding was the .md/.c/.h files for the port
itself. The rest was very straightforward.
jeff