Re: [PATCH] Support for SPARC M7 and VIS 4.0

2016-06-02 Thread Jose E. Marchesi

> I will also fix all the other points you raised.
> Thanks!

You're welcome.  And if you want to have commit rights to the SVN 
repository, 
you can put me as your sponsor (Eric Botcazou ).

I just sent a request.  Thank you.


Re: [PATCH] Support for SPARC M7 and VIS 4.0

2016-06-02 Thread Eric Botcazou
> I think I will change v3pipe to v3pipe_m7 in this patch, to make it more
> explicit that the insn<->v3pipe association is processor-dependant.

No strong opinion, IOW it's your call.

> Yes, it is intended.  I will add a little note.

Thanks.

> I will also fix all the other points you raised.
> Thanks!

You're welcome.  And if you want to have commit rights to the SVN repository, 
you can put me as your sponsor (Eric Botcazou ).

-- 
Eric Botcazou


Re: [PATCH] Support for SPARC M7 and VIS 4.0

2016-06-01 Thread Jose E. Marchesi

Hi Eric.

>   The niagara7 pipeline description models the V3 pipe using a bypass
>   with latency 3, from-to any instruction executing in the V3 pipe.  The
>   instructions are identified by mean of a new instruction attribute
>   "v3pipe", that has been added to the proper define_insns in sparc.md.
> 
>   However, I am not sure how well this will scale ni the future, as in
>   future cpu models the subset of instruction executing in the V3 pipe
>   will be different than in the M7.  So we need a way to define that
>   instruction class that is processor-specific: using v3pipe, v3pipe_m8,
>   v3pipe_m9, etc seems a bit messy to me.  Any idea?

I guess it depends on whether the set of instructions executing in the V3 
pipe 
is (or will become) kind of arbitrary or not.  The usual way to support a 
new 
scheduling model is to define new (sub-)types of instructions and assign 
the 
appropriate (sub-)types to the scheduling units but, of course, if affected 
instructions are selected randomly, the model falls apart.

Yes, it is arbitrary.  My first attempt was indeed to follow the
instruction type approach, but then I couldn't even find proper names
for the categories.

I think I will change v3pipe to v3pipe_m7 in this patch, to make it more
explicit that the insn<->v3pipe association is processor-dependant.
Wdyt?

> @@ -1583,10 +1626,29 @@ sparc_option_override (void)
> 
>  || sparc_cpu == PROCESSOR_NIAGARA
>  || sparc_cpu == PROCESSOR_NIAGARA2
>  || sparc_cpu == PROCESSOR_NIAGARA3
> 
> -|| sparc_cpu == PROCESSOR_NIAGARA4)
> -   ? 64 : 32),
> +|| sparc_cpu == PROCESSOR_NIAGARA4
> +|| sparc_cpu == PROCESSOR_NIAGARA7)
> +   ? 32 : 64),
> +  global_options.x_param_values,
> +  global_options_set.x_param_values);
> +  maybe_set_param_value (PARAM_L1_CACHE_SIZE,
> +  ((sparc_cpu == PROCESSOR_ULTRASPARC
> +|| sparc_cpu == PROCESSOR_ULTRASPARC3
> +|| sparc_cpu == PROCESSOR_NIAGARA
> +|| sparc_cpu == PROCESSOR_NIAGARA2
> +|| sparc_cpu == PROCESSOR_NIAGARA3
> +|| sparc_cpu == PROCESSOR_NIAGARA4
> +|| sparc_cpu == PROCESSOR_NIAGARA7)
> +   ? 16 : 64),
>global_options.x_param_values,
>global_options_set.x_param_values);
> +  maybe_set_param_value (PARAM_L2_CACHE_SIZE,
> +  (sparc_cpu == PROCESSOR_NIAGARA4
> +   ? 128 : (sparc_cpu == PROCESSOR_NIAGARA7
> +? 256 : 512)),
> +  global_options.x_param_values,
> +  global_options_set.x_param_values);
> +

Please add a blank line between the statements.  Why swapping 32 and 64 for 
PARAM_L1_CACHE_LINE_SIZE?  If the 32 default is universally OK, then let's 
just remove the statement.

Yes, I agree, I will just remove the statement. (When I wrote that I was
having in mind future models of the CPU that will actually feature L1D$
64 bit lines.)

>/* Disable save slot sharing for call-clobbered registers by default.
>   The IRA sharing algorithm works on single registers only and this
> @@ -9178,7 +9240,8 @@ sparc32_initialize_trampoline (rtx m_tramp, rtx
> fnaddr, rtx cxt) && sparc_cpu != PROCESSOR_NIAGARA
>&& sparc_cpu != PROCESSOR_NIAGARA2
>&& sparc_cpu != PROCESSOR_NIAGARA3
> -  && sparc_cpu != PROCESSOR_NIAGARA4)
> +  && sparc_cpu != PROCESSOR_NIAGARA4
> +  && sparc_cpu != PROCESSOR_NIAGARA7) /* XXX */
>  emit_insn (gen_flushsi (validize_mem (adjust_address (m_tramp, 
SImode,
> 8;

What does the "XXX" mean?

Oops, these are unintended leftovers, sorry.

> diff --git a/gcc/config/sparc/sparc.h b/gcc/config/sparc/sparc.h
> index ebfe87d..d91496a 100644
> --- a/gcc/config/sparc/sparc.h
> +++ b/gcc/config/sparc/sparc.h
> @@ -142,6 +142,7 @@ extern enum cmodel sparc_cmodel;
>  #define TARGET_CPU_niagara2  14
>  #define TARGET_CPU_niagara3  15
>  #define TARGET_CPU_niagara4  16
> +#define TARGET_CPU_niagara7  19

Any contribution to plug the hole is of course welcome. :-)

:) Maybe some day.

> +(define_insn "v8qi3"
> +  [(set (match_operand:V8QI 0 "register_operand" "=e")
> +(vis3_addsub_ss:V8QI (match_operand:V8QI 1 "register_operand" 
"e")
> + (match_operand:V8QI 2 "register_operand"
> "e")))] +  "TARGET_VIS4"
> +  "8\t%1, %2, %0"
> +  [(set_attr "type" "fga")])

If the mix of VIS4 

Re: [PATCH] Support for SPARC M7 and VIS 4.0

2016-06-01 Thread Eric Botcazou
> This patch adds support for -mcpu=niagara7, corresponding to the SPARC
> M7 CPU as documented in the Oracle SPARC Architecture 2015 and the M7
> Processor Supplement.  The patch also includes intrinsics support for
> all the VIS 4.0 instructions.

Thanks for contributing this.

> This patch has been tested in sparc64-*-linux-gnu, sparcv9-*-linux-gnu
> and sparc-sun-solaris2.11 targets.
> 
> The support for the new instructions/registers/isas/etc of the M7 is
> already committed upstream in binutils.

For a while, so I think that we should put the patch on the 6 branch too.

>   The niagara7 pipeline description models the V3 pipe using a bypass
>   with latency 3, from-to any instruction executing in the V3 pipe.  The
>   instructions are identified by mean of a new instruction attribute
>   "v3pipe", that has been added to the proper define_insns in sparc.md.
> 
>   However, I am not sure how well this will scale ni the future, as in
>   future cpu models the subset of instruction executing in the V3 pipe
>   will be different than in the M7.  So we need a way to define that
>   instruction class that is processor-specific: using v3pipe, v3pipe_m8,
>   v3pipe_m9, etc seems a bit messy to me.  Any idea?

I guess it depends on whether the set of instructions executing in the V3 pipe 
is (or will become) kind of arbitrary or not.  The usual way to support a new 
scheduling model is to define new (sub-)types of instructions and assign the 
appropriate (sub-)types to the scheduling units but, of course, if affected 
instructions are selected randomly, the model falls apart.

>   Note that the reason why the empirically observed latencies in the T4
>   were different than the documented ones in the T4 supplement (as Dave
>   found and fixed in
>   https://gcc.gnu.org/ml/gcc-patches/2012-10/msg00934.html) was that the
>   HW chaps didn't feel it necessary to document the complexity in the
>   PRM, and just assigned a latency of 11 to the VIS instructions.

Very interesting insight.  It would be nice to add a blurb about that in 
config/sparc/niagara4.md then.

> - Changes in the cache parameters for niagara processors
> 
>   The Oracle SPARC Architecture (previously the UltraSPARC Architecture)
>   specification states that when a PREFETCH[A] instruction is executed
>   an implementation-specific amount of data is prefetched, and that it
>   is at least 64 bytes long (aligned to at least 64 bytes).
> 
>   However, this is not correct.  The M7 (and implementations prior to
>   that) does not guarantee a 64B prefetch into a cache if the line size
>   is smaller.  A single cache line is all that is ever prefetched.  So
>   for the M7, where the L1D$ has 32B lines and the L2D$ and L3 have 64B
>   lines, a prefetch will prefetch 64B into the L2 and L3, but only 32B
>   are brought into the L1D$. (Assuming it is a read_n prefetch, which is
>   the only type which allocates to the L1.)

Adding a comment to sparc_option_override about that would also be nice.

>   Another change is setting PARAM_SIMULTANEOUS_PREFETCHES to 32, as
>   opposed to its previous value of 2 for niagara processors.  Unlike in
>   the UltraSPARC III, which featured a documented prefetch queue with a
>   size of 8, in niagara processors prefetches are handled much like
>   regular loads.  The L1 miss buffer is 32 entries, but prefetches start
>   getting affected when 30 entries become occupied.  That occupation
>   could be a mix of regular loads and prefetches though.  And that
>   buffer is shared by all threads.  Once the threshold is reached, if
>   the core is running a single thread the prefetch will retry.  If more
>   than one thread is running, the prefetch will be dropped.  So, as you
>   can see, all this makes it very difficult to determine how many
>   simultaneous prefetches can be issued simultaneously, even in a
>   single-threaded program.  However, 2 is clearly too low, and
>   experimental results show that setting this parameter to 32 works well
>   when the number of threads is not high.  (Note that I didn't change
>   this parameter for niagara2, niagara3 and niagara4, but we probably
>   want to do that.)

Fine with me, let's do that, but with a comment too.

>   All above together makes a difference when running STREAM in a M7 with
>   OMP_NUM_THREADS=2:
> 
>   With -O3 -mcpu=niagara4 -fprefetch-loop-arrays:
> 
>   FunctionBest Rate MB/s  Avg time Min time Max time
>   Copy:6336.6573   5.4323   5.4224   5.4455
>   Scale:   5796.6113   5.9289   5.9276   5.9309
>   Add: 7517.6836   6.8760   6.8558   6.8927
>   Triad:   7781.0785   6.6364   6.6237   6.6549
> 
>   With -O3 -mcpu=niagara7 -fprefetch-loop-arrays:
> 
>   FunctionBest Rate MB/s  Avg time Min time Max time
>   Copy:   10743.8074   3.2052   3.1981   3.2132
>   Scale:  10763.5906   3.1995   3.1922   3.2078
>   Add:11866.4764  

[PATCH] Support for SPARC M7 and VIS 4.0

2016-05-31 Thread Jose E. Marchesi

This patch adds support for -mcpu=niagara7, corresponding to the SPARC
M7 CPU as documented in the Oracle SPARC Architecture 2015 and the M7
Processor Supplement.  The patch also includes intrinsics support for
all the VIS 4.0 instructions.

This patch has been tested in sparc64-*-linux-gnu, sparcv9-*-linux-gnu
and sparc-sun-solaris2.11 targets.

The support for the new instructions/registers/isas/etc of the M7 is
already committed upstream in binutils.

Some notes on the patch:

- Pipeline description for the M7

  The pipeline description for the M7 is similar to the niagara4
  description, with an exception: the latencies of many VIS instructions
  are different because of the introduction of the so-called "V3 pipe".

  There is a 3-cycle pipeline through which a number of instructions
  flow.  That pipeline was born to process some of the short crypto
  instructions in T4/T5/M5/M6.  Then in M7/T7 some instructions that
  used to have ~11-cycle latency in the FP pipeline (notably, some
  non-FP VIS instructions) were moved to the V3 pipeline (and we will be
  moving more instructions to the V3 in future M* models).  The result
  is that latency for quite a few previously 11-cycle operations has
  dropped to 3 cycles.

  But then, in order to obtain 3-cycle latency the destination register
  of the instruction must be "consumed" by another instruction that also
  executes in the V3 pipeline (in order for the register-bypass path in
  V3 to keep the latency down to 3 cycles).  If the instruction that
  consumes Rd does not execute in the V3 pipeline, then it has to wait
  longer to see the result, because the 3-cycle bypass path only works
  within the V3 pipeline.  The list of instructions using the V3 pipe
  are marked with a (1) superscript in the latencies table (A-1) in the
  M7 processor supplement.

  The niagara7 pipeline description models the V3 pipe using a bypass
  with latency 3, from-to any instruction executing in the V3 pipe.  The
  instructions are identified by mean of a new instruction attribute
  "v3pipe", that has been added to the proper define_insns in sparc.md.

  However, I am not sure how well this will scale ni the future, as in
  future cpu models the subset of instruction executing in the V3 pipe
  will be different than in the M7.  So we need a way to define that
  instruction class that is processor-specific: using v3pipe, v3pipe_m8,
  v3pipe_m9, etc seems a bit messy to me.  Any idea?

  Note that the reason why the empirically observed latencies in the T4
  were different than the documented ones in the T4 supplement (as Dave
  found and fixed in
  https://gcc.gnu.org/ml/gcc-patches/2012-10/msg00934.html) was that the
  HW chaps didn't feel it necessary to document the complexity in the
  PRM, and just assigned a latency of 11 to the VIS instructions.
  Fortunately, the policy changed (as these implementation details are
  obviously very useful to compiler writers) and in the M7 supplement
  these details have been documented, as they will be documented in the
  future models PRMs.

- Changes in the cache parameters for niagara processors

  The Oracle SPARC Architecture (previously the UltraSPARC Architecture)
  specification states that when a PREFETCH[A] instruction is executed
  an implementation-specific amount of data is prefetched, and that it
  is at least 64 bytes long (aligned to at least 64 bytes).

  However, this is not correct.  The M7 (and implementations prior to
  that) does not guarantee a 64B prefetch into a cache if the line size
  is smaller.  A single cache line is all that is ever prefetched.  So
  for the M7, where the L1D$ has 32B lines and the L2D$ and L3 have 64B
  lines, a prefetch will prefetch 64B into the L2 and L3, but only 32B
  are brought into the L1D$. (Assuming it is a read_n prefetch, which is
  the only type which allocates to the L1.)

  We will be fixing the OSA specification to say that PREFETCH
  prefetches 64B or one cache line (whichever is larger) from memory
  into the last-level cache, and that whether and how much data a given
  PREFETCH function code causes to be prefetched from the LLC to higher
  (closer to the processor) levels of caches is
  implementation-dependent.  This accurately describe the behavior for
  every processor at least as far back as T2.

  So, by changing the L1_CACHE_LINE_SIZE parameter to 32B, we get the
  tree SSA loop prefetch pass to unroll loops so in every iteration a
  L1D$ line size is processed, and prefetched ahead.

  Another change is setting PARAM_SIMULTANEOUS_PREFETCHES to 32, as
  opposed to its previous value of 2 for niagara processors.  Unlike in
  the UltraSPARC III, which featured a documented prefetch queue with a
  size of 8, in niagara processors prefetches are handled much like
  regular loads.  The L1 miss buffer is 32 entries, but prefetches start
  getting affected when 30 entries become occupied.  That occupation
  could be a mix of regular loads and prefetches