Re: [petsc-users] Direct PETSc to use MCDRAM on KNL and other optimizations for KNL
On Mar 1, 2019, at 11:00 AM, Sajid Ali mailto:sajidsyed2...@u.northwestern.edu>> wrote: Hi Hong, So, the speedup was coming from increased DRAM bandwidth and not the usage of MCDRAM. Certainly the speedup was coming from the usage of MCDRAM (which has much higher bandwidth than DRAM). What I meant is your code is still using MCDRAM, but MCDRAM acts like L3 cache in cache mode. Hong There is moderate MPI imbalance, large amount of Back-End stalls and good vectorization. I'm attaching my submit script, PETSc log file and Intel APS summary (all as non-HTML text). I can give more detailed analysis via Intel Vtune if needed. Thank You, Sajid Ali Applied Physics Northwestern University
Re: [petsc-users] Direct PETSc to use MCDRAM on KNL and other optimizations for KNL
Hi Hong, So, the speedup was coming from increased DRAM bandwidth and not the usage of MCDRAM. There is moderate MPI imbalance, large amount of Back-End stalls and good vectorization. I'm attaching my submit script, PETSc log file and Intel APS summary (all as non-HTML text). I can give more detailed analysis via Intel Vtune if needed. Thank You, Sajid Ali Applied Physics Northwestern University submit_script Description: Binary data intel_aps_report Description: Binary data knl_petsc Description: Binary data
Re: [petsc-users] Direct PETSc to use MCDRAM on KNL and other optimizations for KNL
On Feb 28, 2019, at 6:10 AM, Sajid Ali mailto:sajidsyed2...@u.northwestern.edu>> wrote: Hi Hong, Thanks for the advice. I see that the example takes ~180 seconds to run but I can't see the DRAM vs MCDRAM info from Intel APS. I'll try to fix the profiling and get back with further questions. MCDRAM has 4x higher bandwidth than DRAM, so the improvement you see from your example looks very reasonable. Note that in cache mode MCDRAM acts as L3 cache while in flat mode it is used as another level of memory. Hong (Mr.) Also, the intel-mpi manpages say that the use of tmi is now deprecated : https://software.intel.com/en-us/mpi-developer-guide-linux-fabrics-control Thank You, Sajid Ali Applied Physics Northwestern University
Re: [petsc-users] Direct PETSc to use MCDRAM on KNL and other optimizations for KNL
Hi Hong, Thanks for the advice. I see that the example takes ~180 seconds to run but I can't see the DRAM vs MCDRAM info from Intel APS. I'll try to fix the profiling and get back with further questions. Also, the intel-mpi manpages say that the use of tmi is now deprecated : https://software.intel.com/en-us/mpi-developer-guide-linux-fabrics-control Thank You, Sajid Ali Applied Physics Northwestern University
Re: [petsc-users] Direct PETSc to use MCDRAM on KNL and other optimizations for KNL
For LCRC, it is better to follow the example on their website (http://www.lcrc.anl.gov/for-users/using-lcrc/running-jobs/running-jobs-on-bebop/) #!/bin/bash #SBATCH -J mympijobname #SBATCH -A myaccount #SBATCH -p knlall #SBATCH -C knl,quad,cache #SBATCH -N 2 #SBATCH --ntasks-per-node=64 #SBATCH -t 00:15:00 export I_MPI_FABRICS=shm:tmi srun ./ex_modify I would not recommend you to use flat mode unless you really want to seriously optimize your code for best performance on KNL. When configuring petsc on KNL, adding options --with-avx512-kernels --with-memalign=64 may speed up your code a bit. Hong (Mr.) On Feb 27, 2019, at 5:02 PM, Sajid Ali via petsc-users mailto:petsc-users@mcs.anl.gov>> wrote: Hi Junchao, I’m confused with the syntax. If I submit the following as my job script, I get an error : #!/bin/bash #SBATCH --job-name=petsc_test #SBATCH -N 1 #SBATCH -C knl,quad,flat #SBATCH -p apsxrmd #SBATCH --time=1:00:00 module load intel/18.0.3-d6gtsxs module load intel-parallel-studio/cluster.2018.3-xvnfrfz module load numactl-2.0.12-intel-18.0.3-wh44iog srun -n 64 -c 64 --cpu_bind=cores numactl -m 1 aps ./ex_modify -ts_type cn -prop_steps 25 -pc_type gamg -ts_monitor -log_view The error is : srun: cluster configuration lacks support for cpu binding srun: error: Unable to create step for job 916208: More processors requested than permitted I’m following the advice as given at slide 33 of https://www.nersc.gov/assets/Uploads/02-using-cori-knl-nodes-20170609.pdf For further info, I’m using LCRC at ANL. Thank You, Sajid Ali Applied Physics Northwestern University
Re: [petsc-users] Direct PETSc to use MCDRAM on KNL and other optimizations for KNL
On Wed, Feb 27, 2019 at 7:03 PM Sajid Ali mailto:sajidsyed2...@u.northwestern.edu>> wrote: Hi Junchao, I’m confused with the syntax. If I submit the following as my job script, I get an error : #!/bin/bash #SBATCH --job-name=petsc_test #SBATCH -N 1 #SBATCH -C knl,quad,flat #SBATCH -p apsxrmd #SBATCH --time=1:00:00 module load intel/18.0.3-d6gtsxs module load intel-parallel-studio/cluster.2018.3-xvnfrfz module load numactl-2.0.12-intel-18.0.3-wh44iog srun -n 64 -c 64 --cpu_bind=cores numactl -m 1 aps ./ex_modify -ts_type cn -prop_steps 25 -pc_type gamg -ts_monitor -log_view The error is : srun: cluster configuration lacks support for cpu binding This cluster does not support cpu binding. You need to remove --cpu_bind=cores. In addition, I don't know what is the 'aps' argument srun: error: Unable to create step for job 916208: More processors requested than permitted I remember the product of -n -c has to be 256. You can try srun -n 64 -c 4 numactl -m 1 ./ex_modify ... I’m following the advice as given at slide 33 of https://www.nersc.gov/assets/Uploads/02-using-cori-knl-nodes-20170609.pdf For further info, I’m using LCRC at ANL. Thank You, Sajid Ali Applied Physics Northwestern University
Re: [petsc-users] Direct PETSc to use MCDRAM on KNL and other optimizations for KNL
Hi Junchao, I’m confused with the syntax. If I submit the following as my job script, I get an error : #!/bin/bash #SBATCH --job-name=petsc_test #SBATCH -N 1 #SBATCH -C knl,quad,flat #SBATCH -p apsxrmd #SBATCH --time=1:00:00 module load intel/18.0.3-d6gtsxs module load intel-parallel-studio/cluster.2018.3-xvnfrfz module load numactl-2.0.12-intel-18.0.3-wh44iog srun -n 64 -c 64 --cpu_bind=cores numactl -m 1 aps ./ex_modify -ts_type cn -prop_steps 25 -pc_type gamg -ts_monitor -log_view The error is : srun: cluster configuration lacks support for cpu binding srun: error: Unable to create step for job 916208: More processors requested than permitted I’m following the advice as given at slide 33 of https://www.nersc.gov/assets/Uploads/02-using-cori-knl-nodes-20170609.pdf For further info, I’m using LCRC at ANL. Thank You, Sajid Ali Applied Physics Northwestern University
Re: [petsc-users] Direct PETSc to use MCDRAM on KNL and other optimizations for KNL
Use srun numactl -m 1 ./app OR srun numactl -p 1 ./app See bottom of https://www.nersc.gov/users/computational-systems/cori/configuration/knl-processor-modes/ --Junchao Zhang On Wed, Feb 27, 2019 at 4:16 PM Sajid Ali via petsc-users mailto:petsc-users@mcs.anl.gov>> wrote: Hi, I ran a TS integrator for 25 steps on a Broadwell-Xeon and Xeon-Phi (KNL). The problem size is 5000x5000 and I'm using scalar=complex. The program takes 125 seconds to run on Xeon and 451 seconds on KNL ! The first thing I want to change is to convert the memory access for the program on KNL from DRAM to MCDRAM. I did run the problem in an interactive SLURM job and specified -C quad,flat and yet I see DRAM is being used. I'm attaching the PETSc log files and Intel APS reports as well. Any help on how I should change my runtime parameters on KNL will be highly appreciated. Thanks in advance. -- Sajid Ali Applied Physics Northwestern University