Mr Hong Zhang has found that removing the manual unrolling from MatMult_SeqAIJ_Inode() (at least with inode size 2) results in a good bump in performance on KNL and pointed me to the Intel gospel https://software.intel.com/en-us/articles/avoid-manual-loop-unrolling which we've always ignored in the past. It would be good try the unrolled and non-unrolled also on Xeon.
We've never done a good job of managing our unrolling, where, how and when we do it and macros for unrolling such as PetscSparseDensePlusDot. Intel would say just throw it all away. Barry