Re: (R5900) Implementing Vector Support

2016-06-03 Thread Richard Henderson

On 06/03/2016 05:54 AM, Woon yung Liu wrote:

The problem is that gen_lowpart() doesn't seem to support casting to other 
modes of the same size.


It certainly does.  The only place you get into trouble with gen_lowpart is 
with CONST_INT, which is mode-less.



But I am already doubting that I will complete this port as I can no longer see 
a favourable conclusion.


I would suggest that you post actual code to the list so that people can help 
you.  Simply answering questions in the abstract, as I have been doing, can 
only go so far.



r~



RE: (R5900) Implementing Vector Support

2016-06-03 Thread Matthew Fortune
Woon yung Liu  writes:
> On Wednesday, June 1, 2016 5:45 AM, Richard Henderson 
> wrote:
> > This is almost always incorrect, and certainly before reload.
> > You need to use gen_lowpart.  There are examples in the code
> 
> > fragments that I sent the other week.
> 
> The problem is that gen_lowpart() doesn't seem to support casting to
> other modes of the same size.
> When I use it, the assert within gen_lowpart_general() will fail due to
> gen_lowpart_common() rejecting the operation (new mode size is not
> smaller than the old).

The conclusion we came to when developing MSA is that simplify_gen_subreg
is the way to go for converting between vector modes:

simplify_gen_subreg (new_mode, rtx, old_mode, 0)

I'm not sure there is much need to change modes after reload so do it
upfront at expand time or when splitting and you should be OK.

See trunk mips.c for a number of uses of this when converting vector modes.
 
> > You need to read the gcc internals documentation.  They are all three
> different
> 
> > uses, though there is some overlap between define_insn and
> define_expand.
> 
> I actually read the GCC internals documentation before I even begun any
> attempt at this, but there was still a lot that I did not successfully
> grasp.
> 
> I'll go with define_expand then.

define_expand only provides a way to generate an instruction or sequence
of instructions to achieve the overall goal. You must also have
define_insn definitions for any pattern you emit or the generated code
will fail to match.

A define_insn_and_split is just shorthand for a define_insn where one
or more output patterns are '#' (split) and you want to define the
split alongside the instruction rather than as a separate define_split.
As far as I understand the difference is syntactic sugar.

> But I am already doubting that I will complete this port as I can no
> longer see a favourable conclusion.

It may take time but I'm sure we can help talk through the problems. As
a new GCC developer you are a welcome addition to the community.

Thanks,
Matthew



Re: (R5900) Implementing Vector Support

2016-05-31 Thread Richard Henderson

On 05/29/2016 12:59 AM, Woon yung Liu wrote:

Hi Richard,

I have solved the problems with the mulv8hi3 pattern; I needed to adjust the 
code within mips.c to allow the double-sized vector modes and to allow vector 
modes into the LO+HI accumulators.


Yes, I should have mentioned that you would need to do that.


What is the correct way to change the mode of registers? For example, I am 
doing this to change the mode for a register to V4SI within an expand:
   reg = gen_rtx_REG(V4SImode, REGNO (reg));


This is almost always incorrect, and certainly before reload.
You need to use gen_lowpart.  There are examples in the code
fragments that I sent the other week.


Finally, what is the difference between define_expand and 
define_insn_and_split? When should I ever use define_insn_and_split?


You need to read the gcc internals documentation.  They are all three different 
uses, though there is some overlap between define_insn and define_expand.



Are define_insn_and_split patterns used to avoid pseudo registers?


No.


r~



Re: (R5900) Implementing Vector Support

2016-05-18 Thread Richard Henderson

On 05/18/2016 05:16 AM, Woon yung Liu wrote:

I didn't know that, thanks.

I've re-done the instructions and expands, mostly based off the stuff that you 
shared earlier. Unfortunately, the test function wouldn't compile:


testv.c: In function 'testv8mult':
testv.c:87:1: error: unrecognizable insn:
}
^
(insn 7 4 8 2 (parallel [
  (set (reg:V8SI 201)
(vec_select:V8SI (mult:V8SI (sign_extend:V8SI (reg/v:V8HI 198 [ v81 ]))
  (sign_extend:V8SI (reg/v:V8HI 199 [ v82 ])))
(parallel [
  (const_int 0 [0])
  (const_int 1 [0x1])
  (const_int 4 [0x4])
  (const_int 5 [0x5])
  (const_int 2 [0x2])
  (const_int 3 [0x3])
  (const_int 6 [0x6])
  (const_int 7 [0x7])
])))
   (clobber (scratch:V4SI))
]) testv.c:86 -1
(nil))


You'd have to point me at your source to see what's gone wrong.


r~



Re: (R5900) Implementing Vector Support

2016-05-16 Thread Richard Henderson

On 05/14/2016 03:21 AM, Woon yung Liu wrote:

The current constraints allow GCC to access the 64-bit LO+HI register pair
as a single 128-bit register, so I am cheating by using both the x and wr
(new constraint for LO1+HI1) constraints.


That doesn't seem right.

The x constrant is for the hi/lo pair, whatever size it is.  You should be able 
to use that just fine with a 256 bit mode.



r~



Re: (R5900) Implementing Vector Support

2016-05-16 Thread Richard Henderson

On 05/15/2016 03:43 AM, Woon yung Liu wrote:

  testv.c:70:2: note: ==> examining statement: _5 = (int) _4;


You need to implement the vec_unpack* patterns.


But how can I tell what operations are required by autovectorization, that are 
currently not supported?


Well, the dumps you're looking at are the start.  But it also requires that you 
look through tree-vect-stmts.c.




My port is still missing the instructions for initializing vectors, and
inserting/setting and extracting values from vectors. They aren't
implemented yet because I haven't figured out how to implement them; the
documentation describes them as simple operations, but yet the
implementations within mips.c do a lot more things!


Efficient vector initialization requires that we detect some common cases.  We 
do that before the fully general mips_expand_vi_general.



r~


Re: (R5900) Implementing Vector Support

2016-05-11 Thread Richard Henderson

On 05/11/2016 04:54 AM, Woon yung Liu wrote:

I saw that the EE has the PMFHL.LH instruction, which loads the HI/LO
register pairs (containing the multiplication result) into a single destination
(i.e. truncates the multiplication result in the process), with the right order
too.  I suppose that it would be suitable for implementing the mulm3 operation.
But  if I implement mulm3, is there still a need to implement the
vec_widen_smult_hi_m and vec_widen_smult_lo_m patterns?


Of course.  They're used for different things.  E.g.

  int out[100];
  short in1[100], in2[100];

  for (i = 0; i < 100; ++i)
out[i] = in1[i] * in2[i];

will use the vec_widen_smult* patterns.


I tried to implement the two patterns (vec_widen_smult_hi_m and
vec_widen_smult_lo_m), but GCC wouldn't compile due to both patterns having
the same operands. Must they be expands? If so, what sort of patterns should
the pcpyld and pcpyud instructions be? If I don't declare them differently,
I'll have the same compilation error again (due to them having the same
operands).


Yes I would think they should be expands.  I would expect something like

;; ??? Could describe the result in %3, if we ever find it useful.
(define_insn "pmulth_ee"
  [(set (match_operand:V8SI 0 "register_operand" "=x")
(vec_select:V8SI
  (mult:V8SI
(sign_extend:V8SI (match_operand:V8HI 1 "register_operand" "d"))
(sign_extend:V8SI (match_operand:V8HI 2 "register_operand" "d")))
  (parallel
[(const_int 0) (const_int 1) (const_int 4) (const_int 5)
 (const_int 2) (const_int 3) (const_int 6) (const_int 7)])))
(clobber (match_scratch:V4SI 3 "=d"))]
  "..."
  "pmulth\t%3,%1,%2"
)

(define_insn "pmfhl_lh_ee_v8hi"
  [(set (match_operand:V8HI 0 "register_operand" "=d")
(vec_select:V8HI
  (match_operand:V16HI 1 "register_operand" "x")
  (parallel
[(const_int 0) (const_int 2)
 (const_int 8) (const_int 10)
 (const_int 4) (const_int 6)
 (const_int 12) (const_int 14)])))]
  "..."
  "pmfhl.lh\t%0"
)

;; ??? Maybe provide V4SI and V8HI versions too.
(define_insn "pmfhi_ee_v2di"
  [(set (match_operand:V2DI 0 "register_operand" "=d")
(vec_select:V2DI
  (match_operand:V4DI 1 "register_operand" "x")
  (parallel [(const_int 2) (const_int 3)])))]
  "..."
  "pmfhi\t%0"
)

;; ??? Maybe provide V4SI and V8HI versions too.
(define_insn "pmflo_ee_v2di"
  [(set (match_operand:V2DI 0 "register_operand" "=d")
(vec_select:V2DI
  (match_operand:V4DI 1 "register_operand" "x")
  (parallel [(const_int 0) (const_int 1)])))]
  "..."
  "pmflo\t%0"
)

;; ??? Maybe provide V4SI and V8HI versions too.
(define_insn "pcpyld_ee_v2di"
  [(set (match_operand:V2DI 0 "register_operand" "=d")
(vec_select:V2DI
  (vec_concat:V4DI
(match_operand:V2DI 1 "register_operand" "d")
(match_operand:V2DI 2 "register_operand" "d"))
  (parallel [(const_int 0) (const_int 2)])))]
  "..."
  "pcpyld\t%0,%2,%1"
)

;; ??? Maybe provide V4SI and V8HI versions too.
(define_insn "pcpyud_ee_v2di"
  [(set (match_operand:V2DI 0 "register_operand" "=d")
(vec_select:V2DI
  (vec_concat:V4DI
(match_operand:V2DI 1 "register_operand" "d")
(match_operand:V2DI 2 "register_operand" "d"))
  (parallel [(const_int 1) (const_int 3)])))]
  "..."
  "pcpyud\t%0,%1,%2"
)

(define_expand "mulv8hi3"
  [(match_operand:V8HI 0 "register_operand")
   (match_operand:V8HI 1 "register_operand")
   (match_operand:V8HI 2 "register_operand")]
  "..."
{
  rtx hilo = gen_reg_rtx (V8SImode);
  emit_insn (gen_pmulth_ee (hilo, operands[1], operands[2]));
  hilo = gen_lowpart (V16HImode, hilo);
  emit_insn (gen_pmfhl_lh_ee_v8hi (operands[0], hilo));
  DONE;
})

(define_expand "vec_widen_smult_lo_v8qi"
  [(match_operand:V4SI 0 "register_operand")
   (match_operand:V8HI 1 "register_operand")
   (match_operand:V8HI 2 "register_operand")]
  "..."
{
  rtx hilo = gen_reg_rtx (V8SImode);
  rtx hi = gen_reg_rtx (V2DImode);
  rtx lo = gen_reg_rtx (V2DImode);

  emit_insn (gen_pmulth_ee (hilo, operands[1], operands[2]));
  hilo = gen_lowpart (V4DImode, hilo);
  emit_insn (gen_pmfhi_ee_v2di (hi, hilo));
  emit_insn (gen_pmflo_ee_v2di (lo, hilo));
  emit_insn (gen_pcpyld_ee_v2di (gen_lowpart (V2DImode, operands[0]), lo, hi));
  DONE;
})

(define_expand "vec_widen_smult_hi_v8qi"
  [(match_operand:V4SI 0 "register_operand")
   (match_operand:V8HI 1 "register_operand")
   (match_operand:V8HI 2 "register_operand")]
  "..."
{
  rtx hilo = gen_reg_rtx (V8SImode);
  rtx hi = gen_reg_rtx (V2DImode);
  rtx lo = gen_reg_rtx (V2DImode);

  emit_insn (gen_pmulth_ee (hilo, operands[1], operands[2]));
  hilo = gen_lowpart (V4DImode, hilo);
  emit_insn (gen_pmfhi_ee_v2di (hi, hilo));
  emit_insn (gen_pmflo_ee_v2di (lo, hilo));
  emit_insn (gen_pcpyud_ee_v2di (gen_lowpart (V2DImode, operands[0]), lo, hi));
  DONE;
}

Re: (R5900) Implementing Vector Support

2016-05-09 Thread Richard Henderson

On 05/06/2016 09:28 PM, Woon yung Liu wrote:

Regarding multiplication of vectors, is there a way to work with a 
multiplication operation that results in something like this (the result is 
spread across these 3 registers), without re-ordering any elements:

RD: A6xB6, A4xB4, A2xB2, A0xA0

LO: A7xB7, A6xB6, A3xB3, A2xA2
HI: A5xB5, A4xB4, A1xB1, A0xA0

A0-A7 and B0-B7 are the 8 elements of two V8HI vectors, which are multiplied 
together to produce a widened multiplication result.

It looks like the vector hi/lo multiplication pattern would work with the 
values in HI and LO, but the order of the elements don't seem to be in a way 
that GCC expects.

Assuming that it is possible to put this pattern to use, does GCC allow the 
vec_widen_smult_hi and
vec_widen_smult_lo patterns to be combined together? Like for the divmod 
(division + modulus) patterns.
The instruction described above (PMULTH) will result in calculation of both the 
hi and lo parts of the result, in one instruction. Hence combining the two 
patterns would be more efficient.


You can use this if you reshuffle the results.

Since it appears that PMULTH naturally produces even results in RD, it would 
seem to make the most sense to attempt to construct the odd results from LO+HI. 
 However, I don't see anything in the TX79 isa that's particularly helpful there.


That said,

pmulth  r0, x, y
pmflo   t1
pmfhi   t2
pcpyld  r1, t1, t2
pcpyud  r2, t2, t1

would appear to produce the results gcc expects for the hi/lo multiples.

Don't worry overmuch about initially generating two copies of the pmulth 
instruction.  We have a similar problem with the ia64 patterns.  Rely on the 
rtl CSE pass to remove the duplicate instructions.



r~


Re: (R5900) Implementing Vector Support

2016-05-02 Thread Richard Henderson

On 04/29/2016 07:54 AM, Liu Woon Yung wrote:

I've done something like that, but GCC still doesn't select the pattern to use:
(define_insn "vec_cmp"


Because you've used the wrong name.  The patterns are:

OPTAB_CD(vec_cmp_optab, "vec_cmp$a$b")
OPTAB_CD(vec_cmpu_optab, "vec_cmpu$a$b")

I see where the confusion is though.  These:

i386/sse.md:(define_expand "vec_cmp"
i386/sse.md:(define_expand "vec_cmp"
i386/sse.md:(define_expand "vec_cmp"
i386/sse.md:(define_expand "vec_cmp"
i386/sse.md:(define_expand "vec_cmpv2div2di"
i386/sse.md:(define_expand "vec_cmp"
i386/sse.md:(define_expand "vec_cmp"
i386/sse.md:(define_expand "vec_cmpu"
i386/sse.md:(define_expand "vec_cmpu"
i386/sse.md:(define_expand "vec_cmpu"
i386/sse.md:(define_expand "vec_cmpu"
i386/sse.md:(define_expand "vec_cmpuv2div2di"

are the only usage examples within the gcc tree.

All of the other "vec_cmp" stuff that you're seeing are internal to the 
rs6000 and s390 ports, for implementing builtins and/or vcond.



rs6000 doesn't implement bare comparisons, but only implements the "vcond"
conditional move upon which uses the comparison.  Many of the other targets
do the same thing.


Is there a reason why implementing only vcond is preferred?


I believe that's just history.  IIRC, only vcond was present originally.

Amusingly, I believe that was because vcond was designed to handle one of the 
other MIPS vector extensions (MDMX?) wherein the comparison results are placed 
in (a set of) condition code registers, and thus producing a per-element {0,-1} 
vector result requires extra instructions.



r~


Re: (R5900) Implementing Vector Support

2016-04-04 Thread Richard Henderson

On 04/03/2016 09:12 PM, Woon yung Liu wrote:

I can't figure out how to implement comparison operations (specifically,
equals and the greater than operators). The GCC documentation mentions that
the pattern for comparison (==) should be vec_cmp, but I don't understand
why it has 4 operands and what they are used for.


The second operand is the comparison operator.  So given

  (set (reg:V4SI x) (eq:V4SI (reg:V4SI y) (reg:V4SI z))

operand 0 is x,
operand 1 is the entire (eq ...) expression,
operand 2 is y,
operand 3 is z.

This is exactly the same as the normal integer cbranch patterns.


I've implemented it
anyway, but GCC does not use it. I've taken a look at the rs6000 and
Loongson ports, and they seem to be implementing their comparison operators
with some non-standard pattern name and the pattern operations are different
too (i.e. Loongson uses unspec, while rs6000 uses gt and eq).


rs6000 doesn't implement bare comparisons, but only implements the "vcond" 
conditional move upon which uses the comparison.  Many of the other targets do 
the same thing.



What happens what multiplication or division is performed a vector? For
example: c = a * b; Whereby a and b are both V4SI vectors. What vector type
would C be? Would it become another V4SI (meaning that the multiplication
result is truncated) or V4DI?


It would be the truncated V4SI mode.

There are other named patterns that implement widening multiply.  Which you 
choose depends on how the hardware selects which operands to include in the 
multiply.  Let { A, B, C, D } and { W, X, Y, Z} be V4SI inputs, then


  Optab Result
  vec_widen_{s,u}mult_hi_   { A * W, B * X }
  vec_widen_{s,u}mult_lo_   { C * Y, D * Z }
  vec_widen_{s,u}mult_even_ { A * W, C * Y }
  vec_widen_{s,u}mult_odd_  { B * X, D * Z }


I also would like to ask about implementing bitwise-shifting. The R5900's
vector-shifting instructions are like the MIPS sll, srl and sra instructions,
whereby they use an immediate to shift all elements within the vector. Based on
the GCC documentation, a scalar can be used, but it will be first converted
into a similarly-sized vector


There are three different types of shifting: by a scalar (all elements shifted 
by the same amount), by a vector (every element receives its own shift amount), 
and full vector (shifting is not restricted to the element boundaries).


Scalar shifts: ashr3, ashl3, lshr3.
Vector shifts: vashr3, vashl3, vlshr3.
Full shifts: vec_shl_, vec_shr_.


Finally, what should I be modifying, if I want to implement extraction and
packing of the upper 64-bits of the 128-bit vector? Right now, GCC will just
generate multiple shifts (i.e. dsll32) to access the upper 64-bits, which is
not legal. This means that using any operation that requires unimplemented
patterns will not work correctly.


You want to implement vec_init, vec_extract, and vec_set.

You also want to implement as many vec_perm_const patterns as you can. 
The existing mips_expand_vec_perm_const_1 code for loongson should be a good 
starting point.  The most important patterns that you'll want to be sure that 
you can handle are interleave, even/odd, and broadcast.  These are generated by 
the vectorizer.  You may wish to examine the aarch64 code for additional ideas; 
it all depends on what sort of instructions you have available.



r~