Re: [PATCH] Support for SPARC M7 and VIS 4.0

Eric Botcazou Wed, 01 Jun 2016 02:28:25 -0700

> This patch adds support for -mcpu=niagara7, corresponding to the SPARC
> M7 CPU as documented in the Oracle SPARC Architecture 2015 and the M7
> Processor Supplement.  The patch also includes intrinsics support for
> all the VIS 4.0 instructions.


Thanks for contributing this.

> This patch has been tested in sparc64-*-linux-gnu, sparcv9-*-linux-gnu
> and sparc-sun-solaris2.11 targets.
> 
> The support for the new instructions/registers/isas/etc of the M7 is
> already committed upstream in binutils.

For a while, so I think that we should put the patch on the 6 branch too.

>   The niagara7 pipeline description models the V3 pipe using a bypass
>   with latency 3, from-to any instruction executing in the V3 pipe.  The
>   instructions are identified by mean of a new instruction attribute
>   "v3pipe", that has been added to the proper define_insns in sparc.md.
> 
>   However, I am not sure how well this will scale ni the future, as in
>   future cpu models the subset of instruction executing in the V3 pipe
>   will be different than in the M7.  So we need a way to define that
>   instruction class that is processor-specific: using v3pipe, v3pipe_m8,
>   v3pipe_m9, etc seems a bit messy to me.  Any idea?

I guess it depends on whether the set of instructions executing in the V3 pipe 
is (or will become) kind of arbitrary or not.  The usual way to support a new 
scheduling model is to define new (sub-)types of instructions and assign the 
appropriate (sub-)types to the scheduling units but, of course, if affected 
instructions are selected randomly, the model falls apart.

>   Note that the reason why the empirically observed latencies in the T4
>   were different than the documented ones in the T4 supplement (as Dave
>   found and fixed in
>   https://gcc.gnu.org/ml/gcc-patches/2012-10/msg00934.html) was that the
>   HW chaps didn't feel it necessary to document the complexity in the
>   PRM, and just assigned a latency of 11 to the VIS instructions.

Very interesting insight.  It would be nice to add a blurb about that in 
config/sparc/niagara4.md then.

> - Changes in the cache parameters for niagara processors
> 
>   The Oracle SPARC Architecture (previously the UltraSPARC Architecture)
>   specification states that when a PREFETCH[A] instruction is executed
>   an implementation-specific amount of data is prefetched, and that it
>   is at least 64 bytes long (aligned to at least 64 bytes).
> 
>   However, this is not correct.  The M7 (and implementations prior to
>   that) does not guarantee a 64B prefetch into a cache if the line size
>   is smaller.  A single cache line is all that is ever prefetched.  So
>   for the M7, where the L1D$ has 32B lines and the L2D$ and L3 have 64B
>   lines, a prefetch will prefetch 64B into the L2 and L3, but only 32B
>   are brought into the L1D$. (Assuming it is a read_n prefetch, which is
>   the only type which allocates to the L1.)

Adding a comment to sparc_option_override about that would also be nice.

>   Another change is setting PARAM_SIMULTANEOUS_PREFETCHES to 32, as
>   opposed to its previous value of 2 for niagara processors.  Unlike in
>   the UltraSPARC III, which featured a documented prefetch queue with a
>   size of 8, in niagara processors prefetches are handled much like
>   regular loads.  The L1 miss buffer is 32 entries, but prefetches start
>   getting affected when 30 entries become occupied.  That occupation
>   could be a mix of regular loads and prefetches though.  And that
>   buffer is shared by all threads.  Once the threshold is reached, if
>   the core is running a single thread the prefetch will retry.  If more
>   than one thread is running, the prefetch will be dropped.  So, as you
>   can see, all this makes it very difficult to determine how many
>   simultaneous prefetches can be issued simultaneously, even in a
>   single-threaded program.  However, 2 is clearly too low, and
>   experimental results show that setting this parameter to 32 works well
>   when the number of threads is not high.  (Note that I didn't change
>   this parameter for niagara2, niagara3 and niagara4, but we probably
>   want to do that.)

Fine with me, let's do that, but with a comment too.

>   All above together makes a difference when running STREAM in a M7 with
>   OMP_NUM_THREADS=2:
> 
>   With -O3 -mcpu=niagara4 -fprefetch-loop-arrays:
> 
>   Function    Best Rate MB/s  Avg time     Min time     Max time
>   Copy:        6336.6573       5.4323       5.4224       5.4455
>   Scale:       5796.6113       5.9289       5.9276       5.9309
>   Add:         7517.6836       6.8760       6.8558       6.8927
>   Triad:       7781.0785       6.6364       6.6237       6.6549
> 
>   With -O3 -mcpu=niagara7 -fprefetch-loop-arrays:
> 
>   Function    Best Rate MB/s  Avg time     Min time     Max time
>   Copy:       10743.8074       3.2052       3.1981       3.2132
>   Scale:      10763.5906       3.1995       3.1922       3.2078
>   Add:        11866.4764       4.3537       4.3433       4.3688
>   Triad:      12856.0593       4.0129       4.0090       4.0178
> 
>   (The impact on performance quickly decreases as more threads are used
>   in the benchmark, as the prefetches start getting dropped.)

Impressive indeed.

The patch looks good to me, but here are a few nits:

> 2016-05-31  Jose E. Marchesi  <jose.march...@oracle.com>
> 
>       * config/sparc/sparc.md (cpu): Added niagara7 cpu type.
>       Include the M7 SPARC DFA scheduler.
>       New attribute v3pipe.
>       Annotate insns with v3pipe where appropriate.
>       Define cpu_feature vis4.

"Add lzd instruction type and set it on clzdi_sp64 and clzsi_sp64."

>       Added (V8QI "8") to vbits.
>       Added insns {add,sub}v8qi3
>       Added insns ss{add,sub}v8qi3
>       Added insns us{add,sub}{v8qi,v4hi}3
>       Added insns {min,max}{v8qi,v4hi,v2si}3
>       Added insns {minu,maxu}{v8qi,v4hi,v2si}3
>       Added insns fpcmp{le,gt,ule,ug,ule,ugt}{8,16,32}_vis.

Only present tense in ChangeLog: "Add ..."

>       * config/sparc/niagara7.md: New file.
>       * configure.ac (HAVE_AS_SPARC5_VIS4): Define if the assembler
>       supports SPARC5 and VIS 4.0 instructions.

You need to add:

        * configure: Regenerate.
        * config.in: Likewise.

IOW the number of '*' must be equal to the number of added/modified files.

>       * config.gcc: niagara7 is a supported cpu in sparc*-*-* targets.
>       * config/sparc/sol2.h (ASM_CPU32_DEFAUILT_SPEC): Set for
>       TARGET_CPU_niagara7.
>       (ASM_CPU64_DEFAULT_SPEC): Likewise.
>       (CPP_CPU_SPEC): Handle niagara7.
>       (ASM_CPU_SPEC): Likewise.
>       * config/sparc/sparc-opts.h (processor_type): Add
>       PROCESSOR_NIAGARA7.
>       (mvis4): Added.
>       * config/sparc/sparc.h (TARGET_CPU_niagara7): Defined.
>       (AS_NIAGARA7_FLAG): Defined.
>       (ASM_CPU64_DEFAULT_SPEC): Set for niagara7.
>       (CPP_CPU64_DEFAULT_SPEC): Likewise.
>       (CPP_CPU_SPEC): Handle niagara7.
>       (ASM_CPU_SPEC): Likewise.

Likewise for the present tense: "Add", "Define".

>       * config/sparc/sparc.c (niagara7_costs): Defined.
>       (sparc_option_override): Set L2 cache size for niagara*.
>       (sparc_option_override): Ditto for the L1 cache size.
>       (sparc_option_override): Handle niagara7.
>       (sparc32_initialize_trampoline): Likewise.
>       (sparc_use_sched_lookahead): Likewise.
>       (sparc_issue_rate): Likewise.
>       (sparc_register_move_cost): Likewise.
>       (dump_target_flag_bits): Support VIS4.
>       (sparc_option_override): Likewise.
>       (sparc_vis_init_builtins): Likewise.
>       (sparc_builtins): Likewise.

Only one entry per function (sparc_option_override).

> @@ -1583,10 +1626,29 @@ sparc_option_override (void)
> 
>                          || sparc_cpu == PROCESSOR_NIAGARA
>                          || sparc_cpu == PROCESSOR_NIAGARA2
>                          || sparc_cpu == PROCESSOR_NIAGARA3
> 
> -                        || sparc_cpu == PROCESSOR_NIAGARA4)
> -                       ? 64 : 32),
> +                        || sparc_cpu == PROCESSOR_NIAGARA4
> +                        || sparc_cpu == PROCESSOR_NIAGARA7)
> +                       ? 32 : 64),
> +                      global_options.x_param_values,
> +                      global_options_set.x_param_values);
> +  maybe_set_param_value (PARAM_L1_CACHE_SIZE,
> +                      ((sparc_cpu == PROCESSOR_ULTRASPARC
> +                        || sparc_cpu == PROCESSOR_ULTRASPARC3
> +                        || sparc_cpu == PROCESSOR_NIAGARA
> +                        || sparc_cpu == PROCESSOR_NIAGARA2
> +                        || sparc_cpu == PROCESSOR_NIAGARA3
> +                        || sparc_cpu == PROCESSOR_NIAGARA4
> +                        || sparc_cpu == PROCESSOR_NIAGARA7)
> +                       ? 16 : 64),
>                        global_options.x_param_values,
>                        global_options_set.x_param_values);
> +  maybe_set_param_value (PARAM_L2_CACHE_SIZE,
> +                      (sparc_cpu == PROCESSOR_NIAGARA4
> +                       ? 128 : (sparc_cpu == PROCESSOR_NIAGARA7
> +                                ? 256 : 512)),
> +                      global_options.x_param_values,
> +                      global_options_set.x_param_values);
> +

Please add a blank line between the statements.  Why swapping 32 and 64 for 
PARAM_L1_CACHE_LINE_SIZE?  If the 32 default is universally OK, then let's 
just remove the statement.

>    /* Disable save slot sharing for call-clobbered registers by default.
>       The IRA sharing algorithm works on single registers only and this
> @@ -9178,7 +9240,8 @@ sparc32_initialize_trampoline (rtx m_tramp, rtx
> fnaddr, rtx cxt) && sparc_cpu != PROCESSOR_NIAGARA
>        && sparc_cpu != PROCESSOR_NIAGARA2
>        && sparc_cpu != PROCESSOR_NIAGARA3
> -      && sparc_cpu != PROCESSOR_NIAGARA4)
> +      && sparc_cpu != PROCESSOR_NIAGARA4
> +      && sparc_cpu != PROCESSOR_NIAGARA7) /* XXX */
>      emit_insn (gen_flushsi (validize_mem (adjust_address (m_tramp, SImode,
> 8))));

What does the "XXX" mean?

>    /* Call __enable_execute_stack after writing onto the stack to make sure
> @@ -9419,7 +9483,8 @@ sparc_use_sched_lookahead (void)
> 
>        || sparc_cpu == PROCESSOR_NIAGARA2
>        || sparc_cpu == PROCESSOR_NIAGARA3)
> 
>      return 0;
> -  if (sparc_cpu == PROCESSOR_NIAGARA4)
> +  if (sparc_cpu == PROCESSOR_NIAGARA4
> +      || sparc_cpu == PROCESSOR_NIAGARA7) /* XXX */
>      return 2;
>    if (sparc_cpu == PROCESSOR_ULTRASPARC
> 
>        || sparc_cpu == PROCESSOR_ULTRASPARC3)

Likewise.

> @@ -9442,6 +9507,7 @@ sparc_issue_rate (void)
>      default:
>        return 1;
>      case PROCESSOR_NIAGARA4:
> +    case PROCESSOR_NIAGARA7:
>      case PROCESSOR_V9:
>        /* Assume V9 processors are capable of at least dual-issue.  */
>        return 2;

Trailing spaces.

> diff --git a/gcc/config/sparc/sparc.h b/gcc/config/sparc/sparc.h
> index ebfe87d..d91496a 100644
> --- a/gcc/config/sparc/sparc.h
> +++ b/gcc/config/sparc/sparc.h
> @@ -142,6 +142,7 @@ extern enum cmodel sparc_cmodel;
>  #define TARGET_CPU_niagara2  14
>  #define TARGET_CPU_niagara3  15
>  #define TARGET_CPU_niagara4  16
> +#define TARGET_CPU_niagara7  19

Any contribution to plug the hole is of course welcome. :-)

> @@ -8628,6 +8678,14 @@
>    "fp<plusminus_insn>64\t%1, %2, %0"
>    [(set_attr "type" "fga")])
> 
> +(define_insn "<plusminus_insn>v8qi3"
> +  [(set (match_operand:V8QI 0 "register_operand" "=e")
> +        (plusminus:V8QI (match_operand:V8QI 1 "register_operand" "e")
> +                        (match_operand:V8QI 2 "register_operand" "e")))]
> +  "TARGET_VIS4"
> +  "fp<plusminus_insn>8\t%1, %2, %0"
> +  [(set_attr "type" "fga")])
> +
>  (define_mode_iterator VASS [V4HI V2SI V2HI V1SI])
>  (define_code_iterator vis3_addsub_ss [ss_plus ss_minus])
>  (define_code_attr vis3_addsub_ss_insn

Trailing spaces.

> +(define_insn "<vis3_addsub_ss_patname>v8qi3"
> +  [(set (match_operand:V8QI 0 "register_operand" "=e")
> +        (vis3_addsub_ss:V8QI (match_operand:V8QI 1 "register_operand" "e")
> +                             (match_operand:V8QI 2 "register_operand"
> "e")))] +  "TARGET_VIS4"
> +  "<vis3_addsub_ss_insn>8\t%1, %2, %0"
> +  [(set_attr "type" "fga")])

If the mix of VIS4 and vis3_ is correct, then this requires a comment.

-- 
Eric Botcazou

Re: [PATCH] Support for SPARC M7 and VIS 4.0

Reply via email to