Re: (R5900) Implementing Vector Support
On 06/03/2016 05:54 AM, Woon yung Liu wrote: The problem is that gen_lowpart() doesn't seem to support casting to other modes of the same size. It certainly does. The only place you get into trouble with gen_lowpart is with CONST_INT, which is mode-less. But I am already doubting that I will complete this port as I can no longer see a favourable conclusion. I would suggest that you post actual code to the list so that people can help you. Simply answering questions in the abstract, as I have been doing, can only go so far. r~
RE: (R5900) Implementing Vector Support
Woon yung Liu writes: > On Wednesday, June 1, 2016 5:45 AM, Richard Henderson > wrote: > > This is almost always incorrect, and certainly before reload. > > You need to use gen_lowpart. There are examples in the code > > > fragments that I sent the other week. > > The problem is that gen_lowpart() doesn't seem to support casting to > other modes of the same size. > When I use it, the assert within gen_lowpart_general() will fail due to > gen_lowpart_common() rejecting the operation (new mode size is not > smaller than the old). The conclusion we came to when developing MSA is that simplify_gen_subreg is the way to go for converting between vector modes: simplify_gen_subreg (new_mode, rtx, old_mode, 0) I'm not sure there is much need to change modes after reload so do it upfront at expand time or when splitting and you should be OK. See trunk mips.c for a number of uses of this when converting vector modes. > > You need to read the gcc internals documentation. They are all three > different > > > uses, though there is some overlap between define_insn and > define_expand. > > I actually read the GCC internals documentation before I even begun any > attempt at this, but there was still a lot that I did not successfully > grasp. > > I'll go with define_expand then. define_expand only provides a way to generate an instruction or sequence of instructions to achieve the overall goal. You must also have define_insn definitions for any pattern you emit or the generated code will fail to match. A define_insn_and_split is just shorthand for a define_insn where one or more output patterns are '#' (split) and you want to define the split alongside the instruction rather than as a separate define_split. As far as I understand the difference is syntactic sugar. > But I am already doubting that I will complete this port as I can no > longer see a favourable conclusion. It may take time but I'm sure we can help talk through the problems. As a new GCC developer you are a welcome addition to the community. Thanks, Matthew
Re: (R5900) Implementing Vector Support
On 05/29/2016 12:59 AM, Woon yung Liu wrote: Hi Richard, I have solved the problems with the mulv8hi3 pattern; I needed to adjust the code within mips.c to allow the double-sized vector modes and to allow vector modes into the LO+HI accumulators. Yes, I should have mentioned that you would need to do that. What is the correct way to change the mode of registers? For example, I am doing this to change the mode for a register to V4SI within an expand: reg = gen_rtx_REG(V4SImode, REGNO (reg)); This is almost always incorrect, and certainly before reload. You need to use gen_lowpart. There are examples in the code fragments that I sent the other week. Finally, what is the difference between define_expand and define_insn_and_split? When should I ever use define_insn_and_split? You need to read the gcc internals documentation. They are all three different uses, though there is some overlap between define_insn and define_expand. Are define_insn_and_split patterns used to avoid pseudo registers? No. r~
Re: (R5900) Implementing Vector Support
On 05/18/2016 05:16 AM, Woon yung Liu wrote: I didn't know that, thanks. I've re-done the instructions and expands, mostly based off the stuff that you shared earlier. Unfortunately, the test function wouldn't compile: testv.c: In function 'testv8mult': testv.c:87:1: error: unrecognizable insn: } ^ (insn 7 4 8 2 (parallel [ (set (reg:V8SI 201) (vec_select:V8SI (mult:V8SI (sign_extend:V8SI (reg/v:V8HI 198 [ v81 ])) (sign_extend:V8SI (reg/v:V8HI 199 [ v82 ]))) (parallel [ (const_int 0 [0]) (const_int 1 [0x1]) (const_int 4 [0x4]) (const_int 5 [0x5]) (const_int 2 [0x2]) (const_int 3 [0x3]) (const_int 6 [0x6]) (const_int 7 [0x7]) ]))) (clobber (scratch:V4SI)) ]) testv.c:86 -1 (nil)) You'd have to point me at your source to see what's gone wrong. r~
Re: (R5900) Implementing Vector Support
On 05/14/2016 03:21 AM, Woon yung Liu wrote: The current constraints allow GCC to access the 64-bit LO+HI register pair as a single 128-bit register, so I am cheating by using both the x and wr (new constraint for LO1+HI1) constraints. That doesn't seem right. The x constrant is for the hi/lo pair, whatever size it is. You should be able to use that just fine with a 256 bit mode. r~
Re: (R5900) Implementing Vector Support
On 05/15/2016 03:43 AM, Woon yung Liu wrote: testv.c:70:2: note: ==> examining statement: _5 = (int) _4; You need to implement the vec_unpack* patterns. But how can I tell what operations are required by autovectorization, that are currently not supported? Well, the dumps you're looking at are the start. But it also requires that you look through tree-vect-stmts.c. My port is still missing the instructions for initializing vectors, and inserting/setting and extracting values from vectors. They aren't implemented yet because I haven't figured out how to implement them; the documentation describes them as simple operations, but yet the implementations within mips.c do a lot more things! Efficient vector initialization requires that we detect some common cases. We do that before the fully general mips_expand_vi_general. r~
Re: (R5900) Implementing Vector Support
On 05/11/2016 04:54 AM, Woon yung Liu wrote: I saw that the EE has the PMFHL.LH instruction, which loads the HI/LO register pairs (containing the multiplication result) into a single destination (i.e. truncates the multiplication result in the process), with the right order too. I suppose that it would be suitable for implementing the mulm3 operation. But if I implement mulm3, is there still a need to implement the vec_widen_smult_hi_m and vec_widen_smult_lo_m patterns? Of course. They're used for different things. E.g. int out[100]; short in1[100], in2[100]; for (i = 0; i < 100; ++i) out[i] = in1[i] * in2[i]; will use the vec_widen_smult* patterns. I tried to implement the two patterns (vec_widen_smult_hi_m and vec_widen_smult_lo_m), but GCC wouldn't compile due to both patterns having the same operands. Must they be expands? If so, what sort of patterns should the pcpyld and pcpyud instructions be? If I don't declare them differently, I'll have the same compilation error again (due to them having the same operands). Yes I would think they should be expands. I would expect something like ;; ??? Could describe the result in %3, if we ever find it useful. (define_insn "pmulth_ee" [(set (match_operand:V8SI 0 "register_operand" "=x") (vec_select:V8SI (mult:V8SI (sign_extend:V8SI (match_operand:V8HI 1 "register_operand" "d")) (sign_extend:V8SI (match_operand:V8HI 2 "register_operand" "d"))) (parallel [(const_int 0) (const_int 1) (const_int 4) (const_int 5) (const_int 2) (const_int 3) (const_int 6) (const_int 7)]))) (clobber (match_scratch:V4SI 3 "=d"))] "..." "pmulth\t%3,%1,%2" ) (define_insn "pmfhl_lh_ee_v8hi" [(set (match_operand:V8HI 0 "register_operand" "=d") (vec_select:V8HI (match_operand:V16HI 1 "register_operand" "x") (parallel [(const_int 0) (const_int 2) (const_int 8) (const_int 10) (const_int 4) (const_int 6) (const_int 12) (const_int 14)])))] "..." "pmfhl.lh\t%0" ) ;; ??? Maybe provide V4SI and V8HI versions too. (define_insn "pmfhi_ee_v2di" [(set (match_operand:V2DI 0 "register_operand" "=d") (vec_select:V2DI (match_operand:V4DI 1 "register_operand" "x") (parallel [(const_int 2) (const_int 3)])))] "..." "pmfhi\t%0" ) ;; ??? Maybe provide V4SI and V8HI versions too. (define_insn "pmflo_ee_v2di" [(set (match_operand:V2DI 0 "register_operand" "=d") (vec_select:V2DI (match_operand:V4DI 1 "register_operand" "x") (parallel [(const_int 0) (const_int 1)])))] "..." "pmflo\t%0" ) ;; ??? Maybe provide V4SI and V8HI versions too. (define_insn "pcpyld_ee_v2di" [(set (match_operand:V2DI 0 "register_operand" "=d") (vec_select:V2DI (vec_concat:V4DI (match_operand:V2DI 1 "register_operand" "d") (match_operand:V2DI 2 "register_operand" "d")) (parallel [(const_int 0) (const_int 2)])))] "..." "pcpyld\t%0,%2,%1" ) ;; ??? Maybe provide V4SI and V8HI versions too. (define_insn "pcpyud_ee_v2di" [(set (match_operand:V2DI 0 "register_operand" "=d") (vec_select:V2DI (vec_concat:V4DI (match_operand:V2DI 1 "register_operand" "d") (match_operand:V2DI 2 "register_operand" "d")) (parallel [(const_int 1) (const_int 3)])))] "..." "pcpyud\t%0,%1,%2" ) (define_expand "mulv8hi3" [(match_operand:V8HI 0 "register_operand") (match_operand:V8HI 1 "register_operand") (match_operand:V8HI 2 "register_operand")] "..." { rtx hilo = gen_reg_rtx (V8SImode); emit_insn (gen_pmulth_ee (hilo, operands[1], operands[2])); hilo = gen_lowpart (V16HImode, hilo); emit_insn (gen_pmfhl_lh_ee_v8hi (operands[0], hilo)); DONE; }) (define_expand "vec_widen_smult_lo_v8qi" [(match_operand:V4SI 0 "register_operand") (match_operand:V8HI 1 "register_operand") (match_operand:V8HI 2 "register_operand")] "..." { rtx hilo = gen_reg_rtx (V8SImode); rtx hi = gen_reg_rtx (V2DImode); rtx lo = gen_reg_rtx (V2DImode); emit_insn (gen_pmulth_ee (hilo, operands[1], operands[2])); hilo = gen_lowpart (V4DImode, hilo); emit_insn (gen_pmfhi_ee_v2di (hi, hilo)); emit_insn (gen_pmflo_ee_v2di (lo, hilo)); emit_insn (gen_pcpyld_ee_v2di (gen_lowpart (V2DImode, operands[0]), lo, hi)); DONE; }) (define_expand "vec_widen_smult_hi_v8qi" [(match_operand:V4SI 0 "register_operand") (match_operand:V8HI 1 "register_operand") (match_operand:V8HI 2 "register_operand")] "..." { rtx hilo = gen_reg_rtx (V8SImode); rtx hi = gen_reg_rtx (V2DImode); rtx lo = gen_reg_rtx (V2DImode); emit_insn (gen_pmulth_ee (hilo, operands[1], operands[2])); hilo = gen_lowpart (V4DImode, hilo); emit_insn (gen_pmfhi_ee_v2di (hi, hilo)); emit_insn (gen_pmflo_ee_v2di (lo, hilo)); emit_insn (gen_pcpyud_ee_v2di (gen_lowpart (V2DImode, operands[0]), lo, hi)); DONE; }
Re: (R5900) Implementing Vector Support
On 05/06/2016 09:28 PM, Woon yung Liu wrote: Regarding multiplication of vectors, is there a way to work with a multiplication operation that results in something like this (the result is spread across these 3 registers), without re-ordering any elements: RD: A6xB6, A4xB4, A2xB2, A0xA0 LO: A7xB7, A6xB6, A3xB3, A2xA2 HI: A5xB5, A4xB4, A1xB1, A0xA0 A0-A7 and B0-B7 are the 8 elements of two V8HI vectors, which are multiplied together to produce a widened multiplication result. It looks like the vector hi/lo multiplication pattern would work with the values in HI and LO, but the order of the elements don't seem to be in a way that GCC expects. Assuming that it is possible to put this pattern to use, does GCC allow the vec_widen_smult_hi and vec_widen_smult_lo patterns to be combined together? Like for the divmod (division + modulus) patterns. The instruction described above (PMULTH) will result in calculation of both the hi and lo parts of the result, in one instruction. Hence combining the two patterns would be more efficient. You can use this if you reshuffle the results. Since it appears that PMULTH naturally produces even results in RD, it would seem to make the most sense to attempt to construct the odd results from LO+HI. However, I don't see anything in the TX79 isa that's particularly helpful there. That said, pmulth r0, x, y pmflo t1 pmfhi t2 pcpyld r1, t1, t2 pcpyud r2, t2, t1 would appear to produce the results gcc expects for the hi/lo multiples. Don't worry overmuch about initially generating two copies of the pmulth instruction. We have a similar problem with the ia64 patterns. Rely on the rtl CSE pass to remove the duplicate instructions. r~
Re: (R5900) Implementing Vector Support
On 04/29/2016 07:54 AM, Liu Woon Yung wrote: I've done something like that, but GCC still doesn't select the pattern to use: (define_insn "vec_cmp" Because you've used the wrong name. The patterns are: OPTAB_CD(vec_cmp_optab, "vec_cmp$a$b") OPTAB_CD(vec_cmpu_optab, "vec_cmpu$a$b") I see where the confusion is though. These: i386/sse.md:(define_expand "vec_cmp" i386/sse.md:(define_expand "vec_cmp" i386/sse.md:(define_expand "vec_cmp" i386/sse.md:(define_expand "vec_cmp" i386/sse.md:(define_expand "vec_cmpv2div2di" i386/sse.md:(define_expand "vec_cmp" i386/sse.md:(define_expand "vec_cmp" i386/sse.md:(define_expand "vec_cmpu" i386/sse.md:(define_expand "vec_cmpu" i386/sse.md:(define_expand "vec_cmpu" i386/sse.md:(define_expand "vec_cmpu" i386/sse.md:(define_expand "vec_cmpuv2div2di" are the only usage examples within the gcc tree. All of the other "vec_cmp" stuff that you're seeing are internal to the rs6000 and s390 ports, for implementing builtins and/or vcond. rs6000 doesn't implement bare comparisons, but only implements the "vcond" conditional move upon which uses the comparison. Many of the other targets do the same thing. Is there a reason why implementing only vcond is preferred? I believe that's just history. IIRC, only vcond was present originally. Amusingly, I believe that was because vcond was designed to handle one of the other MIPS vector extensions (MDMX?) wherein the comparison results are placed in (a set of) condition code registers, and thus producing a per-element {0,-1} vector result requires extra instructions. r~
Re: (R5900) Implementing Vector Support
On 04/03/2016 09:12 PM, Woon yung Liu wrote: I can't figure out how to implement comparison operations (specifically, equals and the greater than operators). The GCC documentation mentions that the pattern for comparison (==) should be vec_cmp, but I don't understand why it has 4 operands and what they are used for. The second operand is the comparison operator. So given (set (reg:V4SI x) (eq:V4SI (reg:V4SI y) (reg:V4SI z)) operand 0 is x, operand 1 is the entire (eq ...) expression, operand 2 is y, operand 3 is z. This is exactly the same as the normal integer cbranch patterns. I've implemented it anyway, but GCC does not use it. I've taken a look at the rs6000 and Loongson ports, and they seem to be implementing their comparison operators with some non-standard pattern name and the pattern operations are different too (i.e. Loongson uses unspec, while rs6000 uses gt and eq). rs6000 doesn't implement bare comparisons, but only implements the "vcond" conditional move upon which uses the comparison. Many of the other targets do the same thing. What happens what multiplication or division is performed a vector? For example: c = a * b; Whereby a and b are both V4SI vectors. What vector type would C be? Would it become another V4SI (meaning that the multiplication result is truncated) or V4DI? It would be the truncated V4SI mode. There are other named patterns that implement widening multiply. Which you choose depends on how the hardware selects which operands to include in the multiply. Let { A, B, C, D } and { W, X, Y, Z} be V4SI inputs, then Optab Result vec_widen_{s,u}mult_hi_ { A * W, B * X } vec_widen_{s,u}mult_lo_ { C * Y, D * Z } vec_widen_{s,u}mult_even_ { A * W, C * Y } vec_widen_{s,u}mult_odd_ { B * X, D * Z } I also would like to ask about implementing bitwise-shifting. The R5900's vector-shifting instructions are like the MIPS sll, srl and sra instructions, whereby they use an immediate to shift all elements within the vector. Based on the GCC documentation, a scalar can be used, but it will be first converted into a similarly-sized vector There are three different types of shifting: by a scalar (all elements shifted by the same amount), by a vector (every element receives its own shift amount), and full vector (shifting is not restricted to the element boundaries). Scalar shifts: ashr3, ashl3, lshr3. Vector shifts: vashr3, vashl3, vlshr3. Full shifts: vec_shl_, vec_shr_. Finally, what should I be modifying, if I want to implement extraction and packing of the upper 64-bits of the 128-bit vector? Right now, GCC will just generate multiple shifts (i.e. dsll32) to access the upper 64-bits, which is not legal. This means that using any operation that requires unimplemented patterns will not work correctly. You want to implement vec_init, vec_extract, and vec_set. You also want to implement as many vec_perm_const patterns as you can. The existing mips_expand_vec_perm_const_1 code for loongson should be a good starting point. The most important patterns that you'll want to be sure that you can handle are interleave, even/odd, and broadcast. These are generated by the vectorizer. You may wish to examine the aarch64 code for additional ideas; it all depends on what sort of instructions you have available. r~