I'm probably screwing up some sort of history by jumping into dev, but this is a dev comment ...
(1) -matptap_via hypre: This call the hypre package to do the PtAP trough > an all-at-once triple product. In our experiences, it is the most memory > efficient, but could be slow. > FYI, I visited LLNL in about 1997 and told them how I did RAP. Simple 4 nested loops. They were very interested. Clearly they did it this way after I talked to them. This approach came up here a while back (eg, we should offer this as an option). Anecdotally, I don't see a noticeable difference in performance on my 3D elasticity problems between my old code (still used by the bone modeling people) and ex56 ... My kernel is an unrolled dense matrix triple product. I doubt Hypre did this. It ran at about 2x+ the flop rate of the mat-vec at scale on the SP3 in 2004. Mark