>   Let's stop arguing about whether MPI is or is not a good base for the
> next generation of HPC software but instead start a new conversation on
> what API (implemented on top of or not on top of MPI/pthreads etc etc) we
> want to build PETSc on to scale PETSc up to millions of cores with large
> NUMA nodes and GPU like accelerators.
>    What do you want in the API?

Let's start with the "lowest" level, or at least the smallest. I think the
only sane way to program for portable performance here
is using CUDA-type vectorization. This SIMT style is explained well here
I think this is much easier and more portable than the intrinsics for
Intel, and more performant and less error prone than threads.
I think you can show that it will accomplish anything we want to do. OpenCL
seems to have capitulated on this point. Do we agree


