> This patch adds support for -mcpu=niagara7, corresponding to the SPARC > M7 CPU as documented in the Oracle SPARC Architecture 2015 and the M7 > Processor Supplement. The patch also includes intrinsics support for > all the VIS 4.0 instructions.
Thanks for contributing this. > This patch has been tested in sparc64-*-linux-gnu, sparcv9-*-linux-gnu > and sparc-sun-solaris2.11 targets. > > The support for the new instructions/registers/isas/etc of the M7 is > already committed upstream in binutils. For a while, so I think that we should put the patch on the 6 branch too. > The niagara7 pipeline description models the V3 pipe using a bypass > with latency 3, from-to any instruction executing in the V3 pipe. The > instructions are identified by mean of a new instruction attribute > "v3pipe", that has been added to the proper define_insns in sparc.md. > > However, I am not sure how well this will scale ni the future, as in > future cpu models the subset of instruction executing in the V3 pipe > will be different than in the M7. So we need a way to define that > instruction class that is processor-specific: using v3pipe, v3pipe_m8, > v3pipe_m9, etc seems a bit messy to me. Any idea? I guess it depends on whether the set of instructions executing in the V3 pipe is (or will become) kind of arbitrary or not. The usual way to support a new scheduling model is to define new (sub-)types of instructions and assign the appropriate (sub-)types to the scheduling units but, of course, if affected instructions are selected randomly, the model falls apart. > Note that the reason why the empirically observed latencies in the T4 > were different than the documented ones in the T4 supplement (as Dave > found and fixed in > https://gcc.gnu.org/ml/gcc-patches/2012-10/msg00934.html) was that the > HW chaps didn't feel it necessary to document the complexity in the > PRM, and just assigned a latency of 11 to the VIS instructions. Very interesting insight. It would be nice to add a blurb about that in config/sparc/niagara4.md then. > - Changes in the cache parameters for niagara processors > > The Oracle SPARC Architecture (previously the UltraSPARC Architecture) > specification states that when a PREFETCH[A] instruction is executed > an implementation-specific amount of data is prefetched, and that it > is at least 64 bytes long (aligned to at least 64 bytes). > > However, this is not correct. The M7 (and implementations prior to > that) does not guarantee a 64B prefetch into a cache if the line size > is smaller. A single cache line is all that is ever prefetched. So > for the M7, where the L1D$ has 32B lines and the L2D$ and L3 have 64B > lines, a prefetch will prefetch 64B into the L2 and L3, but only 32B > are brought into the L1D$. (Assuming it is a read_n prefetch, which is > the only type which allocates to the L1.) Adding a comment to sparc_option_override about that would also be nice. > Another change is setting PARAM_SIMULTANEOUS_PREFETCHES to 32, as > opposed to its previous value of 2 for niagara processors. Unlike in > the UltraSPARC III, which featured a documented prefetch queue with a > size of 8, in niagara processors prefetches are handled much like > regular loads. The L1 miss buffer is 32 entries, but prefetches start > getting affected when 30 entries become occupied. That occupation > could be a mix of regular loads and prefetches though. And that > buffer is shared by all threads. Once the threshold is reached, if > the core is running a single thread the prefetch will retry. If more > than one thread is running, the prefetch will be dropped. So, as you > can see, all this makes it very difficult to determine how many > simultaneous prefetches can be issued simultaneously, even in a > single-threaded program. However, 2 is clearly too low, and > experimental results show that setting this parameter to 32 works well > when the number of threads is not high. (Note that I didn't change > this parameter for niagara2, niagara3 and niagara4, but we probably > want to do that.) Fine with me, let's do that, but with a comment too. > All above together makes a difference when running STREAM in a M7 with > OMP_NUM_THREADS=2: > > With -O3 -mcpu=niagara4 -fprefetch-loop-arrays: > > Function Best Rate MB/s Avg time Min time Max time > Copy: 6336.6573 5.4323 5.4224 5.4455 > Scale: 5796.6113 5.9289 5.9276 5.9309 > Add: 7517.6836 6.8760 6.8558 6.8927 > Triad: 7781.0785 6.6364 6.6237 6.6549 > > With -O3 -mcpu=niagara7 -fprefetch-loop-arrays: > > Function Best Rate MB/s Avg time Min time Max time > Copy: 10743.8074 3.2052 3.1981 3.2132 > Scale: 10763.5906 3.1995 3.1922 3.2078 > Add: 11866.4764 4.3537 4.3433 4.3688 > Triad: 12856.0593 4.0129 4.0090 4.0178 > > (The impact on performance quickly decreases as more threads are used > in the benchmark, as the prefetches start getting dropped.) Impressive indeed. The patch looks good to me, but here are a few nits: > 2016-05-31 Jose E. Marchesi <jose.march...@oracle.com> > > * config/sparc/sparc.md (cpu): Added niagara7 cpu type. > Include the M7 SPARC DFA scheduler. > New attribute v3pipe. > Annotate insns with v3pipe where appropriate. > Define cpu_feature vis4. "Add lzd instruction type and set it on clzdi_sp64 and clzsi_sp64." > Added (V8QI "8") to vbits. > Added insns {add,sub}v8qi3 > Added insns ss{add,sub}v8qi3 > Added insns us{add,sub}{v8qi,v4hi}3 > Added insns {min,max}{v8qi,v4hi,v2si}3 > Added insns {minu,maxu}{v8qi,v4hi,v2si}3 > Added insns fpcmp{le,gt,ule,ug,ule,ugt}{8,16,32}_vis. Only present tense in ChangeLog: "Add ..." > * config/sparc/niagara7.md: New file. > * configure.ac (HAVE_AS_SPARC5_VIS4): Define if the assembler > supports SPARC5 and VIS 4.0 instructions. You need to add: * configure: Regenerate. * config.in: Likewise. IOW the number of '*' must be equal to the number of added/modified files. > * config.gcc: niagara7 is a supported cpu in sparc*-*-* targets. > * config/sparc/sol2.h (ASM_CPU32_DEFAUILT_SPEC): Set for > TARGET_CPU_niagara7. > (ASM_CPU64_DEFAULT_SPEC): Likewise. > (CPP_CPU_SPEC): Handle niagara7. > (ASM_CPU_SPEC): Likewise. > * config/sparc/sparc-opts.h (processor_type): Add > PROCESSOR_NIAGARA7. > (mvis4): Added. > * config/sparc/sparc.h (TARGET_CPU_niagara7): Defined. > (AS_NIAGARA7_FLAG): Defined. > (ASM_CPU64_DEFAULT_SPEC): Set for niagara7. > (CPP_CPU64_DEFAULT_SPEC): Likewise. > (CPP_CPU_SPEC): Handle niagara7. > (ASM_CPU_SPEC): Likewise. Likewise for the present tense: "Add", "Define". > * config/sparc/sparc.c (niagara7_costs): Defined. > (sparc_option_override): Set L2 cache size for niagara*. > (sparc_option_override): Ditto for the L1 cache size. > (sparc_option_override): Handle niagara7. > (sparc32_initialize_trampoline): Likewise. > (sparc_use_sched_lookahead): Likewise. > (sparc_issue_rate): Likewise. > (sparc_register_move_cost): Likewise. > (dump_target_flag_bits): Support VIS4. > (sparc_option_override): Likewise. > (sparc_vis_init_builtins): Likewise. > (sparc_builtins): Likewise. Only one entry per function (sparc_option_override). > @@ -1583,10 +1626,29 @@ sparc_option_override (void) > > || sparc_cpu == PROCESSOR_NIAGARA > || sparc_cpu == PROCESSOR_NIAGARA2 > || sparc_cpu == PROCESSOR_NIAGARA3 > > - || sparc_cpu == PROCESSOR_NIAGARA4) > - ? 64 : 32), > + || sparc_cpu == PROCESSOR_NIAGARA4 > + || sparc_cpu == PROCESSOR_NIAGARA7) > + ? 32 : 64), > + global_options.x_param_values, > + global_options_set.x_param_values); > + maybe_set_param_value (PARAM_L1_CACHE_SIZE, > + ((sparc_cpu == PROCESSOR_ULTRASPARC > + || sparc_cpu == PROCESSOR_ULTRASPARC3 > + || sparc_cpu == PROCESSOR_NIAGARA > + || sparc_cpu == PROCESSOR_NIAGARA2 > + || sparc_cpu == PROCESSOR_NIAGARA3 > + || sparc_cpu == PROCESSOR_NIAGARA4 > + || sparc_cpu == PROCESSOR_NIAGARA7) > + ? 16 : 64), > global_options.x_param_values, > global_options_set.x_param_values); > + maybe_set_param_value (PARAM_L2_CACHE_SIZE, > + (sparc_cpu == PROCESSOR_NIAGARA4 > + ? 128 : (sparc_cpu == PROCESSOR_NIAGARA7 > + ? 256 : 512)), > + global_options.x_param_values, > + global_options_set.x_param_values); > + Please add a blank line between the statements. Why swapping 32 and 64 for PARAM_L1_CACHE_LINE_SIZE? If the 32 default is universally OK, then let's just remove the statement. > /* Disable save slot sharing for call-clobbered registers by default. > The IRA sharing algorithm works on single registers only and this > @@ -9178,7 +9240,8 @@ sparc32_initialize_trampoline (rtx m_tramp, rtx > fnaddr, rtx cxt) && sparc_cpu != PROCESSOR_NIAGARA > && sparc_cpu != PROCESSOR_NIAGARA2 > && sparc_cpu != PROCESSOR_NIAGARA3 > - && sparc_cpu != PROCESSOR_NIAGARA4) > + && sparc_cpu != PROCESSOR_NIAGARA4 > + && sparc_cpu != PROCESSOR_NIAGARA7) /* XXX */ > emit_insn (gen_flushsi (validize_mem (adjust_address (m_tramp, SImode, > 8)))); What does the "XXX" mean? > /* Call __enable_execute_stack after writing onto the stack to make sure > @@ -9419,7 +9483,8 @@ sparc_use_sched_lookahead (void) > > || sparc_cpu == PROCESSOR_NIAGARA2 > || sparc_cpu == PROCESSOR_NIAGARA3) > > return 0; > - if (sparc_cpu == PROCESSOR_NIAGARA4) > + if (sparc_cpu == PROCESSOR_NIAGARA4 > + || sparc_cpu == PROCESSOR_NIAGARA7) /* XXX */ > return 2; > if (sparc_cpu == PROCESSOR_ULTRASPARC > > || sparc_cpu == PROCESSOR_ULTRASPARC3) Likewise. > @@ -9442,6 +9507,7 @@ sparc_issue_rate (void) > default: > return 1; > case PROCESSOR_NIAGARA4: > + case PROCESSOR_NIAGARA7: > case PROCESSOR_V9: > /* Assume V9 processors are capable of at least dual-issue. */ > return 2; Trailing spaces. > diff --git a/gcc/config/sparc/sparc.h b/gcc/config/sparc/sparc.h > index ebfe87d..d91496a 100644 > --- a/gcc/config/sparc/sparc.h > +++ b/gcc/config/sparc/sparc.h > @@ -142,6 +142,7 @@ extern enum cmodel sparc_cmodel; > #define TARGET_CPU_niagara2 14 > #define TARGET_CPU_niagara3 15 > #define TARGET_CPU_niagara4 16 > +#define TARGET_CPU_niagara7 19 Any contribution to plug the hole is of course welcome. :-) > @@ -8628,6 +8678,14 @@ > "fp<plusminus_insn>64\t%1, %2, %0" > [(set_attr "type" "fga")]) > > +(define_insn "<plusminus_insn>v8qi3" > + [(set (match_operand:V8QI 0 "register_operand" "=e") > + (plusminus:V8QI (match_operand:V8QI 1 "register_operand" "e") > + (match_operand:V8QI 2 "register_operand" "e")))] > + "TARGET_VIS4" > + "fp<plusminus_insn>8\t%1, %2, %0" > + [(set_attr "type" "fga")]) > + > (define_mode_iterator VASS [V4HI V2SI V2HI V1SI]) > (define_code_iterator vis3_addsub_ss [ss_plus ss_minus]) > (define_code_attr vis3_addsub_ss_insn Trailing spaces. > +(define_insn "<vis3_addsub_ss_patname>v8qi3" > + [(set (match_operand:V8QI 0 "register_operand" "=e") > + (vis3_addsub_ss:V8QI (match_operand:V8QI 1 "register_operand" "e") > + (match_operand:V8QI 2 "register_operand" > "e")))] + "TARGET_VIS4" > + "<vis3_addsub_ss_insn>8\t%1, %2, %0" > + [(set_attr "type" "fga")]) If the mix of VIS4 and vis3_ is correct, then this requires a comment. -- Eric Botcazou