Hi Hong, So, the speedup was coming from increased DRAM bandwidth and not the usage of MCDRAM.
There is moderate MPI imbalance, large amount of Back-End stalls and good vectorization. I'm attaching my submit script, PETSc log file and Intel APS summary (all as non-HTML text). I can give more detailed analysis via Intel Vtune if needed. Thank You, Sajid Ali Applied Physics Northwestern University
submit_script
Description: Binary data
intel_aps_report
Description: Binary data
knl_petsc
Description: Binary data