Thanks everyone for the helpful advice. So I tried all the suggestions including using libsci. The performance did not improve for my particular runs, which I think suggests the problem parameters chosen for my tests (SNES ex48) are not optimal for KNL. Does anyone have example test runs I could reproduce that compare the performance between KNL and Haswell/Ivybridge/etc?
On Mon, Apr 3, 2017 at 3:06 PM Richard Mills <[email protected]> wrote: > Yes, one should rely on MKL (or Cray LibSci, if using the Cray toolchain) > on Cori. But I'm guessing that this will make no noticeable difference for > what Justin is doing. > > --Richard > > On Mon, Apr 3, 2017 at 12:57 PM, murat keçeli <[email protected]> wrote: > > How about replacing --download-fblaslapack with vendor specific > BLAS/LAPACK? > > Murat > > On Mon, Apr 3, 2017 at 2:45 PM, Richard Mills <[email protected]> > wrote: > > On Mon, Apr 3, 2017 at 12:24 PM, Zhang, Hong <[email protected]> wrote: > > > On Apr 3, 2017, at 1:44 PM, Justin Chang <[email protected]> wrote: > > Richard, > > This is what my job script looks like: > > #!/bin/bash > #SBATCH -N 16 > #SBATCH -C knl,quad,flat > #SBATCH -p regular > #SBATCH -J knlflat1024 > #SBATCH -L SCRATCH > #SBATCH -o knlflat1024.o%j > #SBATCH --mail-type=ALL > #SBATCH [email protected] > #SBATCH -t 00:20:00 > > #run the application: > cd $SCRATCH/Icesheet > sbcast --compress=lz4 ./ex48cori /tmp/ex48cori > srun -n 1024 -c 4 --cpu_bind=cores numactl -p 1 /tmp/ex48cori -M 128 -N > 128 -P 16 -thi_mat_type baij -pc_type mg -mg_coarse_pc_type gamg -da_refine > 1 > > > Maybe it is a typo. It should be numactl -m 1. > > > "-p 1" will also work. "-p" means to "prefer" NUMA node 1 (the MCDRAM), > whereas "-m" means to use only NUMA node 1. In the former case, MCDRAM > will be used for allocations until the available memory there has been > exhausted, and then things will spill over into the DRAM. One would think > that "-m" would be better for doing performance studies, but on systems > where the nodes have swap space enabled, you can get terrible performance > if your code's working set exceeds the size of the MCDRAM, as the system > will obediently obey your wishes to not use the DRAM and go straight to the > swap disk! I assume the Cori nodes don't have swap space, though I could > be wrong. > > > According to the NERSC info pages, they say to add the "numactl" if using > flat mode. Previously I tried cache mode but the performance seems to be > unaffected. > > > Using cache mode should give similar performance as using flat mode with > the numactl option. But both approaches should be significant faster than > using flat mode without the numactl option. I usually see over 3X speedup. > You can also do such comparison to see if the high-bandwidth memory is > working properly. > > I also comparerd 256 haswell nodes vs 256 KNL nodes and haswell is nearly > 4-5x faster. Though I suspect this drastic change has much to do with the > initial coarse grid size now being extremely small. > > I think you may be right about why you see such a big difference. The KNL > nodes need enough work to be able to use the SIMD lanes effectively. Also, > if your problem gets small enough, then it's going to be able to fit in the > Haswell's L3 cache. Although KNL has MCDRAM and this delivers *a lot* more > memory bandwidth than the DDR4 memory, it will deliver a lot less bandwidth > than the Haswell's L3. > > I'll give the COPTFLAGS a try and see what happens > > > Make sure to use --with-memalign=64 for data alignment when configuring > PETSc. > > > Ah, yes, I forgot that. Thanks for mentioning it, Hong! > > > The option -xMIC-AVX512 would improve the vectorization performance. But > it may cause problems for the MPIBAIJ format for some unknown reason. > MPIAIJ should work fine with this option. > > > Hmm. Try both, and, if you see worse performance with MPIBAIJ, let us > know and I'll try to figure this out. > > --Richard > > > > Hong (Mr.) > > Thanks, > Justin > > On Mon, Apr 3, 2017 at 1:36 PM, Richard Mills <[email protected]> > wrote: > > Hi Justin, > > How is the MCDRAM (on-package "high-bandwidth memory") configured for your > KNL runs? And if it is in "flat" mode, what are you doing to ensure that > you use the MCDRAM? Doing this wrong seems to be one of the most common > reasons for unexpected poor performance on KNL. > > I'm not that familiar with the environment on Cori, but I think that if > you are building for KNL, you should add "-xMIC-AVX512" to your compiler > flags to explicitly instruct the compiler to use the AVX512 instruction > set. I usually use something along the lines of > > 'COPTFLAGS=-g -O3 -fp-model fast -xMIC-AVX512' > > (The "-g" just adds symbols, which make the output from performance > profiling tools much more useful.) > > That said, I think that if you are comparing 1024 Haswell cores vs. 1024 > KNL cores (so double the number of Haswell nodes), I'm not surprised that > the simulations are almost twice as fast using the Haswell nodes. Keep in > mind that individual KNL cores are much less powerful than an individual > Haswell node. You are also using roughly twice the power footprint (dual > socket Haswell node should be roughly equivalent to a KNL node, I > believe). How do things look on when you compare equal nodes? > > Cheers, > Richard > > On Mon, Apr 3, 2017 at 11:13 AM, Justin Chang <[email protected]> wrote: > > Hi all, > > On NERSC's Cori I have the following configure options for PETSc: > > ./configure --download-fblaslapack --with-cc=cc --with-clib-autodetect=0 > --with-cxx=CC --with-cxxlib-autodetect=0 --with-debugging=0 --with-fc=ftn > --with-fortranlib-autodetect=0 --with-mpiexec=srun --with-64-bit-indices=1 > COPTFLAGS=-O3 CXXOPTFLAGS=-O3 FOPTFLAGS=-O3 PETSC_ARCH=arch-cori-opt > > Where I swapped out the default Intel programming environment with that of > Cray (e.g., 'module switch PrgEnv-intel/6.0.3 PrgEnv-cray/6.0.3'). I want > to document the performance difference between Cori's Haswell and KNL > processors. > > When I run a PETSc example like SNES ex48 on 1024 cores (32 Haswell and 16 > KNL nodes), the simulations are almost twice as fast on Haswell nodes. > Which leads me to suspect that I am not doing something right for KNL. Does > anyone know what are some "optimal" configure options for running PETSc on > KNL? > > Thanks, > Justin > > > > > > > >
