>> Let's start with the "lowest" level, or at least the smallest. I think
>> the only sane way to program for portable performance here
>> is using CUDA-type vectorization. This SIMT style is explained well here
>> http://www.yosefk.com/blog/simd-simt-smt-parallelism-in-nvidia-gpus.html
>> I think this is much easier and more portable than the intrinsics for
>> Intel, and more performant and less error prone than threads.
>> I think you can show that it will accomplish anything we want to do.
>> OpenCL seems to have capitulated on this point. Do we agree
>> here?
> Moving from the other thread, I asked how far we could get with an API for
> high-level data movement combined with CUDA/OpenCL kernels. Matt wrote
> *I think it will get you quite far, and the point for me will be*
> *how will the user describe a communication pattern, and how will we
> automate the generation of MPI*
> *from that specification. Sieve has an attempt to do this buried in it
> inspired by the "manifold" idea.*
> *
> *
> Now that CUDA supports function pointers and similar, we can write real
> code in it. Whenever OpenCL gets around to supporting them, we'll be able
> to write real code for multicore and see how it performs. To unify the
> distributed and manycore aspects, we need some sort of hierarchical
> abstraction for NUMA and a communicator-like object to maintain scope.
> After applying a local-distribution filter, we might be able to express
> this using coloring plus the parallel primitives that I have been
> suggesting in the other thread.

One key operation which has not yet been discussed is the "push forward" of
a mapping as Dmitry put it. Here is a scenario:
We understand a matching of mesh points between processes. In order to
construct a ghost communication (VecScatter), I
need to compose the mapping between mesh points and the mapping of mesh
points to data. I think this operation is generic
and important, For example, it turns a mesh point partition into a topology
distribution, or if you like a row partition into a matrix
distribution. I think this might be the right operation to take any
partition to a data distribution algorithm.


> I'll think more on this and see if I can put together a concrete API
> proposal.

