Now is this patch OK for the trunk? Thank you!
thanks, Cong On Tue, Nov 5, 2013 at 9:58 AM, Cong Hou <co...@google.com> wrote: > Thank you for your detailed explanation. > > Once GCC detects a reduction operation, it will automatically > accumulate all elements in the vector after the loop. In the loop the > reduction variable is always a vector whose elements are reductions of > corresponding values from other vectors. Therefore in your case the > only instruction you need to generate is: > > VABAL ops[3], ops[1], ops[2] > > It is OK if you accumulate the elements into one in the vector inside > of the loop (if one instruction can do this), but you have to make > sure other elements in the vector should remain zero so that the final > result is correct. > > If you are confused about the documentation, check the one for > udot_prod (just above usad in md.texi), as it has very similar > behavior as usad. Actually I copied the text from there and did some > changes. As those two instruction patterns are both for vectorization, > their behavior should not be difficult to explain. > > If you have more questions or think that the documentation is still > improper please let me know. > > Thank you very much! > > > Cong > > > On Tue, Nov 5, 2013 at 1:53 AM, James Greenhalgh > <james.greenha...@arm.com> wrote: >> On Mon, Nov 04, 2013 at 06:30:55PM +0000, Cong Hou wrote: >>> On Mon, Nov 4, 2013 at 2:06 AM, James Greenhalgh >>> <james.greenha...@arm.com> wrote: >>> > On Fri, Nov 01, 2013 at 04:48:53PM +0000, Cong Hou wrote: >>> >> diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi >>> >> index 2a5a2e1..8f5d39a 100644 >>> >> --- a/gcc/doc/md.texi >>> >> +++ b/gcc/doc/md.texi >>> >> @@ -4705,6 +4705,16 @@ wider mode, is computed and added to operand 3. >>> >> Operand 3 is of a mode equal or >>> >> wider than the mode of the product. The result is placed in operand 0, >>> >> which >>> >> is of the same mode as operand 3. >>> >> >>> >> +@cindex @code{ssad@var{m}} instruction pattern >>> >> +@item @samp{ssad@var{m}} >>> >> +@cindex @code{usad@var{m}} instruction pattern >>> >> +@item @samp{usad@var{m}} >>> >> +Compute the sum of absolute differences of two signed/unsigned elements. >>> >> +Operand 1 and operand 2 are of the same mode. Their absolute >>> >> difference, which >>> >> +is of a wider mode, is computed and added to operand 3. Operand 3 is of >>> >> a mode >>> >> +equal or wider than the mode of the absolute difference. The result is >>> >> placed >>> >> +in operand 0, which is of the same mode as operand 3. >>> >> + >>> >> @cindex @code{ssum_widen@var{m3}} instruction pattern >>> >> @item @samp{ssum_widen@var{m3}} >>> >> @cindex @code{usum_widen@var{m3}} instruction pattern >>> >> diff --git a/gcc/expr.c b/gcc/expr.c >>> >> index 4975a64..1db8a49 100644 >>> > >>> > I'm not sure I follow, and if I do - I don't think it matches what >>> > you have implemented for i386. >>> > >>> > From your text description I would guess the series of operations to be: >>> > >>> > v1 = widen (operands[1]) >>> > v2 = widen (operands[2]) >>> > v3 = abs (v1 - v2) >>> > operands[0] = v3 + operands[3] >>> > >>> > But if I understand the behaviour of PSADBW correctly, what you have >>> > actually implemented is: >>> > >>> > v1 = widen (operands[1]) >>> > v2 = widen (operands[2]) >>> > v3 = abs (v1 - v2) >>> > v4 = reduce_plus (v3) >>> > operands[0] = v4 + operands[3] >>> > >>> > To my mind, synthesizing the reduce_plus step will be wasteful for targets >>> > who do not get this for free with their Absolute Difference step. Imagine >>> > a >>> > simple loop where we have synthesized the reduce_plus, we compute partial >>> > sums each loop iteration, though we would be better to leave the >>> > reduce_plus >>> > step until after the loop. "REDUC_PLUS_EXPR" would be the appropriate >>> > Tree code for this. >>> >>> What do you mean when you use "synthesizing" here? For each pattern, >>> the only synthesized operation is the one being returned from the >>> pattern recognizer. In this case, it is USAD_EXPR. The recognition of >>> reduce sum is necessary as we need corresponding prolog and epilog for >>> reductions, which is already done before pattern recognition. Note >>> that reduction is not a pattern but is a type of vector definition. A >>> vectorization pattern can still be a reduction operation as long as >>> STMT_VINFO_RELATED_STMT of this pattern is a reduction operation. You >>> can check the other two reduction patterns: widen_sum_pattern and >>> dot_prod_pattern for reference. >> >> My apologies for not being clear. What I mean is, for a target which does >> not have a dedicated PSADBW instruction, the individual steps of >> 'usad<m>' must be "synthesized" in such a way as to match the expected >> behaviour of the tree code. >> >> So, I must expand 'usadm' to a series of equivalent instructions >> as USAD_EXPR expects. >> >> If USAD_EXPR requires me to emit a reduction on each loop iteration, >> I think that will be inefficient compared to performing the reduction >> after the loop body. >> >> To a first approximation on ARM, I would expect from your description >> of 'usad<m>' that generating, >> >> VABAL ops[3], ops[1], ops[2] >> (Vector widening Absolute Difference and Accumulate) >> >> would fulfil the requirements. >> >> But to match the behaviour you have implemented in the i386 >> backend I would be required to generate: >> >> VABAL ops[3], ops[1], ops[2] >> VPADD ops[3], ops[3], ops[3] (add one set of pairs) >> VPADD ops[3], ops[3], ops[3] (and the other) >> VAND ops[0], ops[3], MASK (clear high lanes) >> >> Which additionally performs the (redundant) vector reduction >> and high lane zeroing step on each loop iteration. >> >> My comment is that your documentation and implementation are >> inconsistent so I am not sure which behaviour you intend for USAD_EXPR. >> >> Additionally, I think it would be more generic to choose the first >> behaviour, rather than requiring a wasteful decomposition to match >> a very particular i386 opcode. >> >> Thanks, >> James >>