Re: [petsc-users] Direct PETSc to use MCDRAM on KNL and other optimizations for KNL

2019-03-01 Thread Zhang, Hong via petsc-users


On Mar 1, 2019, at 11:00 AM, Sajid Ali 
mailto:sajidsyed2...@u.northwestern.edu>> 
wrote:


Hi Hong,

So, the speedup was coming from increased DRAM bandwidth and not the usage of 
MCDRAM.

Certainly the speedup was coming from the usage of MCDRAM (which has much 
higher bandwidth than DRAM). What I meant is your code is still using MCDRAM, 
but MCDRAM acts like L3 cache in cache mode.

Hong



There is moderate MPI imbalance, large amount of Back-End stalls and good 
vectorization.

I'm attaching my submit script, PETSc log file and Intel APS summary (all as 
non-HTML text). I can give more detailed analysis via Intel Vtune if needed.


Thank You,
Sajid Ali
Applied Physics
Northwestern University




Re: [petsc-users] Direct PETSc to use MCDRAM on KNL and other optimizations for KNL

2019-03-01 Thread Sajid Ali via petsc-users
Hi Hong,

So, the speedup was coming from increased DRAM bandwidth and not the usage
of MCDRAM.

There is moderate MPI imbalance, large amount of Back-End stalls and good
vectorization.

I'm attaching my submit script, PETSc log file and Intel APS summary (all
as non-HTML text). I can give more detailed analysis via Intel Vtune if
needed.


Thank You,
Sajid Ali
Applied Physics
Northwestern University


submit_script
Description: Binary data


intel_aps_report
Description: Binary data


knl_petsc
Description: Binary data


Re: [petsc-users] Direct PETSc to use MCDRAM on KNL and other optimizations for KNL

2019-02-28 Thread Zhang, Hong via petsc-users


On Feb 28, 2019, at 6:10 AM, Sajid Ali 
mailto:sajidsyed2...@u.northwestern.edu>> 
wrote:

Hi Hong,

Thanks for the advice. I see that the example takes ~180 seconds to run but I 
can't see the DRAM vs MCDRAM info from Intel APS. I'll try to fix the profiling 
and get back with further questions.

MCDRAM has 4x higher bandwidth than DRAM, so the improvement you see from your 
example looks very reasonable. Note that in cache mode MCDRAM acts as L3 cache 
while in flat mode it is used as another level of memory.

Hong (Mr.)



Also, the intel-mpi manpages say that the use of tmi is now deprecated : 
https://software.intel.com/en-us/mpi-developer-guide-linux-fabrics-control


Thank You,
Sajid Ali
Applied Physics
Northwestern University



Re: [petsc-users] Direct PETSc to use MCDRAM on KNL and other optimizations for KNL

2019-02-28 Thread Sajid Ali via petsc-users
Hi Hong,

Thanks for the advice. I see that the example takes ~180 seconds to run but
I can't see the DRAM vs MCDRAM info from Intel APS. I'll try to fix the
profiling and get back with further questions.

Also, the intel-mpi manpages say that the use of tmi is now deprecated :
https://software.intel.com/en-us/mpi-developer-guide-linux-fabrics-control


Thank You,
Sajid Ali
Applied Physics
Northwestern University


Re: [petsc-users] Direct PETSc to use MCDRAM on KNL and other optimizations for KNL

2019-02-27 Thread Zhang, Hong via petsc-users
For LCRC, it is better to follow the example on their website 
(http://www.lcrc.anl.gov/for-users/using-lcrc/running-jobs/running-jobs-on-bebop/)

#!/bin/bash

#SBATCH -J mympijobname
#SBATCH -A myaccount
#SBATCH -p knlall
#SBATCH -C knl,quad,cache
#SBATCH -N 2
#SBATCH --ntasks-per-node=64
#SBATCH -t 00:15:00

export I_MPI_FABRICS=shm:tmi
srun ./ex_modify

I would not recommend you to use flat mode unless you really want to seriously 
optimize your code for best performance on KNL.

When configuring petsc on KNL, adding options --with-avx512-kernels 
--with-memalign=64 may speed up your code a bit.

Hong (Mr.)

On Feb 27, 2019, at 5:02 PM, Sajid Ali via petsc-users 
mailto:petsc-users@mcs.anl.gov>> wrote:


Hi Junchao,

I’m confused with the syntax. If I submit the following as my job script, I get 
an error :

#!/bin/bash
#SBATCH --job-name=petsc_test
#SBATCH -N 1
#SBATCH -C knl,quad,flat
#SBATCH -p apsxrmd
#SBATCH --time=1:00:00

module load intel/18.0.3-d6gtsxs
module load intel-parallel-studio/cluster.2018.3-xvnfrfz
module load numactl-2.0.12-intel-18.0.3-wh44iog
srun -n 64 -c 64 --cpu_bind=cores numactl -m 1 aps ./ex_modify -ts_type cn 
-prop_steps 25 -pc_type gamg -ts_monitor -log_view


The error is :
srun: cluster configuration lacks support for cpu binding
srun: error: Unable to create step for job 916208: More processors requested 
than permitted

I’m following the advice as given at slide 33 of 
https://www.nersc.gov/assets/Uploads/02-using-cori-knl-nodes-20170609.pdf

For further info, I’m using LCRC at ANL.

Thank You,
Sajid Ali
Applied Physics
Northwestern University



Re: [petsc-users] Direct PETSc to use MCDRAM on KNL and other optimizations for KNL

2019-02-27 Thread Zhang, Junchao via petsc-users



On Wed, Feb 27, 2019 at 7:03 PM Sajid Ali 
mailto:sajidsyed2...@u.northwestern.edu>> 
wrote:

Hi Junchao,

I’m confused with the syntax. If I submit the following as my job script, I get 
an error :

#!/bin/bash
#SBATCH --job-name=petsc_test
#SBATCH -N 1
#SBATCH -C knl,quad,flat
#SBATCH -p apsxrmd
#SBATCH --time=1:00:00

module load intel/18.0.3-d6gtsxs
module load intel-parallel-studio/cluster.2018.3-xvnfrfz
module load numactl-2.0.12-intel-18.0.3-wh44iog
srun -n 64 -c 64 --cpu_bind=cores numactl -m 1 aps ./ex_modify -ts_type cn 
-prop_steps 25 -pc_type gamg -ts_monitor -log_view


The error is :
srun: cluster configuration lacks support for cpu binding

This cluster does not support cpu binding.  You need to remove 
--cpu_bind=cores. In addition, I don't know what is the 'aps' argument


srun: error: Unable to create step for job 916208: More processors requested 
than permitted

I remember the product of -n -c has to be 256.  You can try srun -n 64 -c 4 
numactl -m 1 ./ex_modify ...


I’m following the advice as given at slide 33 of 
https://www.nersc.gov/assets/Uploads/02-using-cori-knl-nodes-20170609.pdf

For further info, I’m using LCRC at ANL.

Thank You,
Sajid Ali
Applied Physics
Northwestern University


Re: [petsc-users] Direct PETSc to use MCDRAM on KNL and other optimizations for KNL

2019-02-27 Thread Sajid Ali via petsc-users
Hi Junchao,

I’m confused with the syntax. If I submit the following as my job script, I
get an error :

#!/bin/bash
#SBATCH --job-name=petsc_test
#SBATCH -N 1
#SBATCH -C knl,quad,flat
#SBATCH -p apsxrmd
#SBATCH --time=1:00:00

module load intel/18.0.3-d6gtsxs
module load intel-parallel-studio/cluster.2018.3-xvnfrfz
module load numactl-2.0.12-intel-18.0.3-wh44iog
srun -n 64 -c 64 --cpu_bind=cores numactl -m 1 aps ./ex_modify
-ts_type cn -prop_steps 25 -pc_type gamg -ts_monitor -log_view

The error is :
srun: cluster configuration lacks support for cpu binding
srun: error: Unable to create step for job 916208: More processors
requested than permitted

I’m following the advice as given at slide 33 of
https://www.nersc.gov/assets/Uploads/02-using-cori-knl-nodes-20170609.pdf

For further info, I’m using LCRC at ANL.

Thank You,
Sajid Ali
Applied Physics
Northwestern University


Re: [petsc-users] Direct PETSc to use MCDRAM on KNL and other optimizations for KNL

2019-02-27 Thread Zhang, Junchao via petsc-users
Use srun  numactl -m 1 ./app OR srun  numactl -p 1 
./app
See bottom of 
https://www.nersc.gov/users/computational-systems/cori/configuration/knl-processor-modes/

--Junchao Zhang


On Wed, Feb 27, 2019 at 4:16 PM Sajid Ali via petsc-users 
mailto:petsc-users@mcs.anl.gov>> wrote:
Hi,

I ran a TS integrator for 25 steps on a Broadwell-Xeon and Xeon-Phi (KNL). The 
problem size is 5000x5000 and I'm using scalar=complex.

The program takes 125 seconds to run on Xeon and 451 seconds on KNL !

The first thing I want to change is to convert the memory access for the 
program on KNL from DRAM to MCDRAM. I did run the problem in an interactive 
SLURM job and specified -C quad,flat and yet I see DRAM is being used.

I'm attaching the PETSc log files and Intel APS reports as well. Any help on 
how I should change my runtime parameters on KNL will be highly appreciated. 
Thanks in advance.

--
Sajid Ali
Applied Physics
Northwestern University