Re: PING: Re: [patch] implement simd loops in trunk (OMP_SIMD)

Jakub Jelinek Mon, 19 Aug 2013 02:36:34 -0700

On Thu, Aug 01, 2013 at 09:38:31AM -1000, Richard Henderson wrote:
> > +      if (simd
> > +         /*
> > +         || (fd->sched_kind == OMP_CLAUSE_SCHEDULE_STATIC
> > +         && !fd->have_ordered)*/)
> 
> Debugging leftovers or what?


gomp-4_0-branch contains also the || there, but for the trunk merge
I think Aldy can leave that comment out, I'll deal with it when merging the
trunk changes back to gomp-4_0-branch.

> > +  /* Enforce simdlen 1 in simd loops with data sharing clauses referencing
> > +     variable sized vars.  That is unnecessarily hard to support and very
> > +     unlikely to result in vectorized code anyway.  */
> > +  if (is_simd)
> > +    for (c = clauses; c ; c = OMP_CLAUSE_CHAIN (c))
> > +      switch (OMP_CLAUSE_CODE (c))
> > +       {
> > +       case OMP_CLAUSE_REDUCTION:
> > +         if (OMP_CLAUSE_REDUCTION_PLACEHOLDER (c))
> > +           max_vf = 1;
> > +         /* FALLTHRU */
> > +       case OMP_CLAUSE_PRIVATE:
> > +       case OMP_CLAUSE_FIRSTPRIVATE:
> > +       case OMP_CLAUSE_LASTPRIVATE:
> > +       case OMP_CLAUSE_LINEAR:
> > +         if (is_variable_sized (OMP_CLAUSE_DECL (c)))
> > +           max_vf = 1;
> > +         break;
> > +       default:
> > +         continue;
> > +       }
> 
> Comment is wrong, code is right, adjusting max_vf not simdlen.

max_vf = 1 causes
          c = build_omp_clause (UNKNOWN_LOCATION, OMP_CLAUSE_SAFELEN);
          OMP_CLAUSE_SAFELEN_EXPR (c) = build_int_cst (integer_type_node,
                                                       max_vf);
so I think the comment should just be changed s/simdlen/safelen/, always get
those 2 wrong...  simdlen is for elemental functions, safelen for simd
loops.
> 
> > +  /* If max_vf is non-NULL, then we can use only vectorization factor
> > +     up to the max_vf we chose.  So stick it into safelen clause.  */
> > +  if (max_vf)
> > +    {
> > +      tree c = find_omp_clause (gimple_omp_for_clauses (ctx->stmt),
> > +                               OMP_CLAUSE_SAFELEN);
> 
> First, non-zero, not non-null -- max_vf is not a pointer.

Ack.

> Second, I don't understand why we're adjusting safelen.  The VF we choose
> for optimizing the loop (even 1), does not change the fact that there are
> no dependencies between loop iterations larger than that.

No, it adds depenencies.  We choose some size of the per-simd lane arrays,
and we must not allow larger vectorization factor than that, as every simd
lane should have different object.  The size is normally the maximum
possible value for a target, so it should be big enough, but we really need
it to be recorded, so that we don't generate invalid code.  max_vf = 1
is for the cases where we punt on handling it properly, those loops for now
can't be vectorized just because they had safelen clause (normal
vectorization rules will probably also punt on those).  In the future if
e.g. privatized per-SIMD lane VLAs are common we can improve that, but right
now I somehow doubt that would be the case.

> Did you want to be adding a _max_vf_ attribute to record stuff that we
> learned while examining the omp loop?  E.g. the max_vf=1 reduction above?
> 
> Given the only adjustment we make to max_vf is to disable vectorization,
> did we in fact want to discard the simd attribute instead, making this a
> "regular" openmp loop?

See above, it is not only about disabling it, we record the sizes of the
arrays there.  And, for SAFELEN clause with 1, we actually don't set
loop->force_vect and set loop->safelen to 0 (i.e. normal loop).

> > +       if (__builtin_expect (N32 cond3 N31, 0)) goto ZERO_ITER_BB;
> > +       if (cond3 is <)
> > +         adj = STEP3 - 1;
> > +       else
> > +         adj = STEP3 + 1;
> > +       count3 = (adj + N32 - N31) / STEP3;
> > +       if (__builtin_expect (N22 cond2 N21, 0)) goto ZERO_ITER_BB;
> > +       if (cond2 is <)
> > +         adj = STEP2 - 1;
> > +       else
> > +         adj = STEP2 + 1;
> > +       count2 = (adj + N22 - N21) / STEP2;
> > +       if (__builtin_expect (N12 cond1 N11, 0)) goto ZERO_ITER_BB;
> 
> CEIL_DIV_EXPR, instead of TRUNC_DIV_EXPR and adjusting by hand?  Unless we
> can't generate the same code in the end because generically we don't know that
> the values involved must be negative for GT?

This is just moving code around (and merging multiple copies of it), the old
code was also using TRUNC_DIV_EXPR.  I think we can try CEIL_DIV_EXPR, but
I'd strongly prefer to only do it incrementally and separately from these
changes.

> I wonder if we do better mooshing all of the BBs together, creating one larger
> BB with all the computation and the unexpected jump at the end.  I.e.
> 
>   bool zero3, zero2, zero1, zero;
> 
>   zero3 = N32 c3 N31;
>   count3 = (N32 - N31) /[cl] STEP3;
>   zero2 = N22 c2 N21;
>   count2 = (N22 - N21) /[cl] STEP2;
>   zero1 = N12 c1 N11;
>   count1 = (N12 - N11) /[cl] STEP1;
>   zero = zero3 || zero2 || zero1;
>   count = count1 * count2 * count3;
>   if (__builtin_expect(zero, false)) goto zero_iter_bb;
> 
> After all, we expect the zero=false, and thus we expect to have to evaluate 
> all
> of the comparison expressions, so short-circuiting oughtn't be a win.  Since
> the condition isn't protecting a denominator, we're not concerned about
> divide-by-zero, so we can fully evaluate count even if a numerator turned out
> to be wrong.
> 
> It seems like putting this all together would create much better scheduling
> opportunities, and less pressure on the chip's branch predictor.

Ditto this.  Though, from my experience, on i?86/x86_64 the above results in
tons of setne/sete insns and it would surprise me if the generated code was
faster.

> > @@ -291,6 +300,15 @@ vect_analyze_data_ref_dependence (struct 
> > data_dependence_re
> 
> You've two adjustments in this function.  First hunk for "dont_know", which is
> fine; second hunk for "data is bad", which I suppose is really a form of
> dont_know and thus also fine.
> 
> But it seems like there should be a third hunk in the "know" section, in which
> we perform some MAX(dist, loop->safelen) computation inside that loop.

The user guarantee IMHO is useful only if you don't know or would give up
otherwise, if you can compute the exact distance, it is better to just use
what you've computed.

> Alternately, why are we patching this function at all, rather than whatever 
> bit
> of code creates the data_dependence_relation data?

This is the spot where we first see *max_vf and can tweak it, in
tree-data-ref.c vectorization factor can't be adjusted.

> > +  hash_table <simduid_to_vf> simduid_to_vf_htab;
> > +  hash_table <decl_to_simduid> decl_to_simduid_htab;
> 
> Why two hashes?  Seems like you can map from decl to vf directly.  At what
> point do you have a simduid integer, but not the decl from whence it came?

simduid_to_vf is for mapping simduid to vectorization factor of that loop.
decl_to_simduid maps the array VAR_DECLs with "omp declare simd" attribute
to the simduid of the loop they are used in; you can have several such
arrays in a loop, or none.

        Jakub

Re: PING: Re: [patch] implement simd loops in trunk (OMP_SIMD)

Reply via email to