"Zhang, Hong" <[email protected]> writes: > On Apr 4, 2017, at 10:45 PM, Justin Chang > <[email protected]<mailto:[email protected]>> wrote: > > So I tried the following options: > > -M 40 > -N 40 > -P 5 > -da_refine 1/2/3/4 > -log_view > -mg_coarse_pc_type gamg > -mg_levels_0_pc_type gamg > -mg_levels_1_sub_pc_type cholesky > -pc_type mg > -thi_mat_type baij > > Performance improved dramatically. However, Haswell still beats out KNL but > only by a little. Now it seems like MatSOR is taking some time (though I > can't really judge whether it's significant or not). Attached are the log > files. > > > MatSOR takes only 3% of the total time. Most of the time is spent on PCSetUp > (~30%) and PCApply (~11%).
I don't see any of your conclusions in the actual data, unless you only looked at the smallest size that Justin tested. For example, from the largest problem size in Justin's logs: KNL: MatSOR 2688 1.0 2.3942e+02 1.1 4.47e+10 1.0 0.0e+00 0.0e+00 0.0e+00 36 45 0 0 0 36 45 0 0 0 11946 KSPSolve 8 1.0 4.3837e+02 1.0 9.87e+10 1.0 1.5e+06 8.8e+03 5.0e+03 68 99 98 61 98 68 99 98 61 98 14409 SNESSolve 1 1.0 6.1583e+02 1.0 9.95e+10 1.0 1.6e+06 1.4e+04 5.1e+03 96100100100 99 96100100100 99 10338 SNESFunctionEval 9 1.0 3.8730e+01 1.0 0.00e+00 0.0 9.2e+03 3.2e+04 0.0e+00 6 0 1 1 0 6 0 1 1 0 0 SNESJacobianEval 40 1.0 1.5628e+02 1.0 0.00e+00 0.0 4.4e+04 2.5e+05 1.4e+02 24 0 3 49 3 24 0 3 49 3 0 PCSetUp 16 1.0 3.4525e+01 1.0 6.52e+07 1.0 2.8e+05 1.0e+04 3.8e+03 5 0 18 13 74 5 0 18 13 74 119 PCSetUpOnBlocks 60 1.0 9.5716e-01 1.1 1.41e+05 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 PCApply 60 1.0 3.8705e+02 1.0 9.32e+10 1.0 1.2e+06 8.0e+03 1.1e+03 60 94 79 45 21 60 94 79 45 21 15407 MatMult 2860 1.0 1.4578e+02 1.1 4.92e+10 1.0 1.2e+06 8.8e+03 0.0e+00 21 49 77 48 0 21 49 77 48 0 21579 Haswell: MatSOR 2262 1.0 2.2116e+02 1.1 7.56e+10 1.0 0.0e+00 0.0e+00 0.0e+00 48 45 0 0 0 48 45 0 0 0 10936 KSPSolve 7 1.0 3.5937e+02 1.0 1.67e+11 1.0 6.7e+05 1.3e+04 4.5e+03 81 99 98 60 98 81 99 98 60 98 14828 SNESSolve 1 1.0 4.3749e+02 1.0 1.68e+11 1.0 6.8e+05 2.1e+04 4.5e+03 99100100100 99 99100100100 99 12280 SNESFunctionEval 8 1.0 1.5460e+01 1.0 0.00e+00 0.0 4.1e+03 4.7e+04 0.0e+00 3 0 1 1 0 3 0 1 1 0 0 SNESJacobianEval 35 1.0 6.8994e+01 1.0 0.00e+00 0.0 1.9e+04 3.8e+05 1.3e+02 16 0 3 50 3 16 0 3 50 3 0 PCSetUp 14 1.0 1.0860e+01 1.0 1.15e+08 1.0 1.3e+05 1.4e+04 3.4e+03 2 0 19 13 74 2 0 19 13 74 335 PCSetUpOnBlocks 50 1.0 4.5601e-02 1.6 2.89e+05 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 6 PCApply 50 1.0 3.3545e+02 1.0 1.57e+11 1.0 5.3e+05 1.2e+04 9.7e+02 75 94 77 44 21 75 94 77 44 21 15017 MatMult 2410 1.0 1.2050e+02 1.1 8.28e+10 1.0 5.1e+05 1.3e+04 0.0e+00 27 49 75 46 0 27 49 75 46 0 21983 > If ex48 has SSE2 intrinsics, does that mean Haswell would almost always be > better? > > The Jacobian evaluation (which has SSE2 intrinsics) on Haswell is about two > times as fast as on KNL, but it eats only 3%-4% of the total time. SNESJacobianEval alone accounts for 90 seconds of the 180 second difference between KNL and Haswell. > According to your logs, the compute-intensive kernels such as MatMult, > MatSOR, PCApply run faster (~2X) on Haswell. They run almost the same speed. > But since the setup time dominates in this test, It doesn't dominate on the larger sizes. > Haswell would not show much benefit. If you increase the problem size, > it could be expected that the performance gap would also increase. Backwards. Haswell is great for low latency on small problem sizes while KNL offers higher theoretical throughput (often not realized due to lack of vectorization) for sufficiently large problem sizes (especially if they don't fit in Haswell L3 cache but do fit in MCDRAM).
signature.asc
Description: PGP signature
