Re: [PATCH] Support for SPARC M7 and VIS 4.0
> I will also fix all the other points you raised. > Thanks! You're welcome. And if you want to have commit rights to the SVN repository, you can put me as your sponsor (Eric Botcazou ). I just sent a request. Thank you.
Re: [PATCH] Support for SPARC M7 and VIS 4.0
> I think I will change v3pipe to v3pipe_m7 in this patch, to make it more > explicit that the insn<->v3pipe association is processor-dependant. No strong opinion, IOW it's your call. > Yes, it is intended. I will add a little note. Thanks. > I will also fix all the other points you raised. > Thanks! You're welcome. And if you want to have commit rights to the SVN repository, you can put me as your sponsor (Eric Botcazou ). -- Eric Botcazou
Re: [PATCH] Support for SPARC M7 and VIS 4.0
Hi Eric. > The niagara7 pipeline description models the V3 pipe using a bypass > with latency 3, from-to any instruction executing in the V3 pipe. The > instructions are identified by mean of a new instruction attribute > "v3pipe", that has been added to the proper define_insns in sparc.md. > > However, I am not sure how well this will scale ni the future, as in > future cpu models the subset of instruction executing in the V3 pipe > will be different than in the M7. So we need a way to define that > instruction class that is processor-specific: using v3pipe, v3pipe_m8, > v3pipe_m9, etc seems a bit messy to me. Any idea? I guess it depends on whether the set of instructions executing in the V3 pipe is (or will become) kind of arbitrary or not. The usual way to support a new scheduling model is to define new (sub-)types of instructions and assign the appropriate (sub-)types to the scheduling units but, of course, if affected instructions are selected randomly, the model falls apart. Yes, it is arbitrary. My first attempt was indeed to follow the instruction type approach, but then I couldn't even find proper names for the categories. I think I will change v3pipe to v3pipe_m7 in this patch, to make it more explicit that the insn<->v3pipe association is processor-dependant. Wdyt? > @@ -1583,10 +1626,29 @@ sparc_option_override (void) > > || sparc_cpu == PROCESSOR_NIAGARA > || sparc_cpu == PROCESSOR_NIAGARA2 > || sparc_cpu == PROCESSOR_NIAGARA3 > > -|| sparc_cpu == PROCESSOR_NIAGARA4) > - ? 64 : 32), > +|| sparc_cpu == PROCESSOR_NIAGARA4 > +|| sparc_cpu == PROCESSOR_NIAGARA7) > + ? 32 : 64), > + global_options.x_param_values, > + global_options_set.x_param_values); > + maybe_set_param_value (PARAM_L1_CACHE_SIZE, > + ((sparc_cpu == PROCESSOR_ULTRASPARC > +|| sparc_cpu == PROCESSOR_ULTRASPARC3 > +|| sparc_cpu == PROCESSOR_NIAGARA > +|| sparc_cpu == PROCESSOR_NIAGARA2 > +|| sparc_cpu == PROCESSOR_NIAGARA3 > +|| sparc_cpu == PROCESSOR_NIAGARA4 > +|| sparc_cpu == PROCESSOR_NIAGARA7) > + ? 16 : 64), >global_options.x_param_values, >global_options_set.x_param_values); > + maybe_set_param_value (PARAM_L2_CACHE_SIZE, > + (sparc_cpu == PROCESSOR_NIAGARA4 > + ? 128 : (sparc_cpu == PROCESSOR_NIAGARA7 > +? 256 : 512)), > + global_options.x_param_values, > + global_options_set.x_param_values); > + Please add a blank line between the statements. Why swapping 32 and 64 for PARAM_L1_CACHE_LINE_SIZE? If the 32 default is universally OK, then let's just remove the statement. Yes, I agree, I will just remove the statement. (When I wrote that I was having in mind future models of the CPU that will actually feature L1D$ 64 bit lines.) >/* Disable save slot sharing for call-clobbered registers by default. > The IRA sharing algorithm works on single registers only and this > @@ -9178,7 +9240,8 @@ sparc32_initialize_trampoline (rtx m_tramp, rtx > fnaddr, rtx cxt) && sparc_cpu != PROCESSOR_NIAGARA >&& sparc_cpu != PROCESSOR_NIAGARA2 >&& sparc_cpu != PROCESSOR_NIAGARA3 > - && sparc_cpu != PROCESSOR_NIAGARA4) > + && sparc_cpu != PROCESSOR_NIAGARA4 > + && sparc_cpu != PROCESSOR_NIAGARA7) /* XXX */ > emit_insn (gen_flushsi (validize_mem (adjust_address (m_tramp, SImode, > 8; What does the "XXX" mean? Oops, these are unintended leftovers, sorry. > diff --git a/gcc/config/sparc/sparc.h b/gcc/config/sparc/sparc.h > index ebfe87d..d91496a 100644 > --- a/gcc/config/sparc/sparc.h > +++ b/gcc/config/sparc/sparc.h > @@ -142,6 +142,7 @@ extern enum cmodel sparc_cmodel; > #define TARGET_CPU_niagara2 14 > #define TARGET_CPU_niagara3 15 > #define TARGET_CPU_niagara4 16 > +#define TARGET_CPU_niagara7 19 Any contribution to plug the hole is of course welcome. :-) :) Maybe some day. > +(define_insn "v8qi3" > + [(set (match_operand:V8QI 0 "register_operand" "=e") > +(vis3_addsub_ss:V8QI (match_operand:V8QI 1 "register_operand" "e") > + (match_operand:V8QI 2 "register_operand" > "e")))] + "TARGET_VIS4" > + "8\t%1, %2, %0" > + [(set_attr "type" "fga")]) If the mix of VIS4 a
Re: [PATCH] Support for SPARC M7 and VIS 4.0
> This patch adds support for -mcpu=niagara7, corresponding to the SPARC > M7 CPU as documented in the Oracle SPARC Architecture 2015 and the M7 > Processor Supplement. The patch also includes intrinsics support for > all the VIS 4.0 instructions. Thanks for contributing this. > This patch has been tested in sparc64-*-linux-gnu, sparcv9-*-linux-gnu > and sparc-sun-solaris2.11 targets. > > The support for the new instructions/registers/isas/etc of the M7 is > already committed upstream in binutils. For a while, so I think that we should put the patch on the 6 branch too. > The niagara7 pipeline description models the V3 pipe using a bypass > with latency 3, from-to any instruction executing in the V3 pipe. The > instructions are identified by mean of a new instruction attribute > "v3pipe", that has been added to the proper define_insns in sparc.md. > > However, I am not sure how well this will scale ni the future, as in > future cpu models the subset of instruction executing in the V3 pipe > will be different than in the M7. So we need a way to define that > instruction class that is processor-specific: using v3pipe, v3pipe_m8, > v3pipe_m9, etc seems a bit messy to me. Any idea? I guess it depends on whether the set of instructions executing in the V3 pipe is (or will become) kind of arbitrary or not. The usual way to support a new scheduling model is to define new (sub-)types of instructions and assign the appropriate (sub-)types to the scheduling units but, of course, if affected instructions are selected randomly, the model falls apart. > Note that the reason why the empirically observed latencies in the T4 > were different than the documented ones in the T4 supplement (as Dave > found and fixed in > https://gcc.gnu.org/ml/gcc-patches/2012-10/msg00934.html) was that the > HW chaps didn't feel it necessary to document the complexity in the > PRM, and just assigned a latency of 11 to the VIS instructions. Very interesting insight. It would be nice to add a blurb about that in config/sparc/niagara4.md then. > - Changes in the cache parameters for niagara processors > > The Oracle SPARC Architecture (previously the UltraSPARC Architecture) > specification states that when a PREFETCH[A] instruction is executed > an implementation-specific amount of data is prefetched, and that it > is at least 64 bytes long (aligned to at least 64 bytes). > > However, this is not correct. The M7 (and implementations prior to > that) does not guarantee a 64B prefetch into a cache if the line size > is smaller. A single cache line is all that is ever prefetched. So > for the M7, where the L1D$ has 32B lines and the L2D$ and L3 have 64B > lines, a prefetch will prefetch 64B into the L2 and L3, but only 32B > are brought into the L1D$. (Assuming it is a read_n prefetch, which is > the only type which allocates to the L1.) Adding a comment to sparc_option_override about that would also be nice. > Another change is setting PARAM_SIMULTANEOUS_PREFETCHES to 32, as > opposed to its previous value of 2 for niagara processors. Unlike in > the UltraSPARC III, which featured a documented prefetch queue with a > size of 8, in niagara processors prefetches are handled much like > regular loads. The L1 miss buffer is 32 entries, but prefetches start > getting affected when 30 entries become occupied. That occupation > could be a mix of regular loads and prefetches though. And that > buffer is shared by all threads. Once the threshold is reached, if > the core is running a single thread the prefetch will retry. If more > than one thread is running, the prefetch will be dropped. So, as you > can see, all this makes it very difficult to determine how many > simultaneous prefetches can be issued simultaneously, even in a > single-threaded program. However, 2 is clearly too low, and > experimental results show that setting this parameter to 32 works well > when the number of threads is not high. (Note that I didn't change > this parameter for niagara2, niagara3 and niagara4, but we probably > want to do that.) Fine with me, let's do that, but with a comment too. > All above together makes a difference when running STREAM in a M7 with > OMP_NUM_THREADS=2: > > With -O3 -mcpu=niagara4 -fprefetch-loop-arrays: > > FunctionBest Rate MB/s Avg time Min time Max time > Copy:6336.6573 5.4323 5.4224 5.4455 > Scale: 5796.6113 5.9289 5.9276 5.9309 > Add: 7517.6836 6.8760 6.8558 6.8927 > Triad: 7781.0785 6.6364 6.6237 6.6549 > > With -O3 -mcpu=niagara7 -fprefetch-loop-arrays: > > FunctionBest Rate MB/s Avg time Min time Max time > Copy: 10743.8074 3.2052 3.1981 3.2132 > Scale: 10763.5906 3.1995 3.1922 3.2078 > Add:11866.4764
[PATCH] Support for SPARC M7 and VIS 4.0
This patch adds support for -mcpu=niagara7, corresponding to the SPARC M7 CPU as documented in the Oracle SPARC Architecture 2015 and the M7 Processor Supplement. The patch also includes intrinsics support for all the VIS 4.0 instructions. This patch has been tested in sparc64-*-linux-gnu, sparcv9-*-linux-gnu and sparc-sun-solaris2.11 targets. The support for the new instructions/registers/isas/etc of the M7 is already committed upstream in binutils. Some notes on the patch: - Pipeline description for the M7 The pipeline description for the M7 is similar to the niagara4 description, with an exception: the latencies of many VIS instructions are different because of the introduction of the so-called "V3 pipe". There is a 3-cycle pipeline through which a number of instructions flow. That pipeline was born to process some of the short crypto instructions in T4/T5/M5/M6. Then in M7/T7 some instructions that used to have ~11-cycle latency in the FP pipeline (notably, some non-FP VIS instructions) were moved to the V3 pipeline (and we will be moving more instructions to the V3 in future M* models). The result is that latency for quite a few previously 11-cycle operations has dropped to 3 cycles. But then, in order to obtain 3-cycle latency the destination register of the instruction must be "consumed" by another instruction that also executes in the V3 pipeline (in order for the register-bypass path in V3 to keep the latency down to 3 cycles). If the instruction that consumes Rd does not execute in the V3 pipeline, then it has to wait longer to see the result, because the 3-cycle bypass path only works within the V3 pipeline. The list of instructions using the V3 pipe are marked with a (1) superscript in the latencies table (A-1) in the M7 processor supplement. The niagara7 pipeline description models the V3 pipe using a bypass with latency 3, from-to any instruction executing in the V3 pipe. The instructions are identified by mean of a new instruction attribute "v3pipe", that has been added to the proper define_insns in sparc.md. However, I am not sure how well this will scale ni the future, as in future cpu models the subset of instruction executing in the V3 pipe will be different than in the M7. So we need a way to define that instruction class that is processor-specific: using v3pipe, v3pipe_m8, v3pipe_m9, etc seems a bit messy to me. Any idea? Note that the reason why the empirically observed latencies in the T4 were different than the documented ones in the T4 supplement (as Dave found and fixed in https://gcc.gnu.org/ml/gcc-patches/2012-10/msg00934.html) was that the HW chaps didn't feel it necessary to document the complexity in the PRM, and just assigned a latency of 11 to the VIS instructions. Fortunately, the policy changed (as these implementation details are obviously very useful to compiler writers) and in the M7 supplement these details have been documented, as they will be documented in the future models PRMs. - Changes in the cache parameters for niagara processors The Oracle SPARC Architecture (previously the UltraSPARC Architecture) specification states that when a PREFETCH[A] instruction is executed an implementation-specific amount of data is prefetched, and that it is at least 64 bytes long (aligned to at least 64 bytes). However, this is not correct. The M7 (and implementations prior to that) does not guarantee a 64B prefetch into a cache if the line size is smaller. A single cache line is all that is ever prefetched. So for the M7, where the L1D$ has 32B lines and the L2D$ and L3 have 64B lines, a prefetch will prefetch 64B into the L2 and L3, but only 32B are brought into the L1D$. (Assuming it is a read_n prefetch, which is the only type which allocates to the L1.) We will be fixing the OSA specification to say that PREFETCH prefetches 64B or one cache line (whichever is larger) from memory into the last-level cache, and that whether and how much data a given PREFETCH function code causes to be prefetched from the LLC to higher (closer to the processor) levels of caches is implementation-dependent. This accurately describe the behavior for every processor at least as far back as T2. So, by changing the L1_CACHE_LINE_SIZE parameter to 32B, we get the tree SSA loop prefetch pass to unroll loops so in every iteration a L1D$ line size is processed, and prefetched ahead. Another change is setting PARAM_SIMULTANEOUS_PREFETCHES to 32, as opposed to its previous value of 2 for niagara processors. Unlike in the UltraSPARC III, which featured a documented prefetch queue with a size of 8, in niagara processors prefetches are handled much like regular loads. The L1 miss buffer is 32 entries, but prefetches start getting affected when 30 entries become occupied. That occupation could be a mix of regular loads and prefetches though.