CUDA: Part 1: Memory

Barry Smith Sat, 6 Oct 2012 17:26:52 -0500

   Let's see if we can lift this discussion up another level and "treat" 
multi-core threading more specifically in the discussion (though Karl's subject 
name is Unification approach for OpenMP/Threads/... he largely ignores the 
multi-core/multi-socket aspect).

    Abstractly a node has  

1)  a bunch of memories (some may be "nested" as caches "standing in" for parts 
of larger caches which "stand in" for parts of "main memory". )  In general, 
even without GPUs there are multiple memory sockets (though generally handled 
by the OS as a single unified address space),

2) a bunch of compute "thingies". In general, even without GPUs there are 
multiple CPUs, and each one of those likely has "regular" floating point units 
plus SIMD units.

A) Shri has started coding up a runtime dispatch system for computations on 
multiple cores (which hides differences between PThreads and OpenMP) that 
(currently) assumes Vecs are stored in a single array (each thread accesses the 
array pointer via VecGetArray() and then "its" part of the array by an offset.) 
(BTW: what if each of these VecGetArray() triggered a copy up from a GPU, 
probably a mess).  When using PThreads Shir's model allows (to some degree) the 
asynchronous launching of computational tasks. 

B) We have a different dispatch system for using a single GPU accelerator via 
CUDA that "automagically" handles copying data back and forth from memories via 
VecXXXGetArray(). It is synchronous on the GetArray() in  that is always blocks 
on the GetArray() until the data is there and then moves on to the computation. 

C) We are considering options for using OpenCL kernels. 

D) We have not seriously considering utilizing both GPUs and core processors 
for floating point intensive computations at the same time, either on the 
"same" object computation or completely different object computations. (note 
that DOE bought this huge machine at ORNL that seems to require this).

  Ideally we'd have a "single" high performing programming model for utilizing 
the resources of (1-2) regardless of details.

   Now, lets go to Karl's "Part 1: Memory" which is a good place to start.   In 
PETSc we basically have two data types, a Vec which is relatively easy to 
abstract about and a Mat which is not.  Let's focus just on the Vec now because 
Mat's are hard.

   We need to "divide up" the computation on a Vec (or several Vecs and Mats) 
so that the different compute "thingies" can work on their "piece", this 
division of the computation naturally is associated with a "division" of the 
data  (the division may actually be only abstract with pthreads or it may be 
concrete with two GPUs when "half" of the vector is copied to each GPU's memory 
(sorry Jed, I agree with Karl that we likely shouldn't hide this issue behind 
MPI)).  The "division" is non-overlapping in simple cases (like axpy()) or may 
require "ghosting" for  sparse matrix-vector products (again the division my 
only be abstract).  With multi-memory-socket multi-core we actually divide the 
vector data across physical memories but access it via virtual memory as not 
divided up for ghost points etc.  I think the "special cases" like virtual 
memory make it harder for us to think about this abstractly then it should be. 

   In PETSc we use the abstract object IS to indicate parts of Vecs\footnote.  
Thus if a computation requires part of a vector it is natural to pass into the 
function the Vec AND THE IS indicating that part of the Vec needed. Note that 
Shri's use of code such as i=trstarts[thread_id] is actually a particular type 
of IS (hardwired for performance). 

   So, could we use a single kernel launcher for multi-core, CUDA, OpenCL based 
on this principle? Then VecCUDAGetArray() type things would keep track of parts 
of Vecs based on IS instead of all entries in the Vec.  Similarly there would 
be a VecMultiCoreGetArray(). Whenever possible the VecXXXGetArray() would not 
require copies.    As part of this model I'd also like to separate the "moving 
needed data" part of the kernel from the "computation on the data" so that 
everything doesn't block when data is being moved around. 

   Ok, how about moving this same model up to the MPI level? We already do this 
with IS converted to VecScatter (for performance) for updating ghost points 
(for matrix-vector products, for PDE ghost points etc) (note we can hide the 
VecScatter inside the IS and have it created as needed). 

   Note I intend this to continue the conversation, not end it. Thoughts?

  Barry

Footnote: Except when some people forget and make other unneeded complicated 
constructs that reproduce the functionality of IS. 

On Oct 6, 2012, at 9:09 AM, Jed Brown <jedbrown at mcs.anl.gov> wrote:

> On Sat, Oct 6, 2012 at 8:51 AM, Karl Rupp <rupp at mcs.anl.gov> wrote:
> Hmm, I thought that spptr is some 'special pointer' as commented in Mat, but 
> not supposed to be a generic pointer to a derived class' datastructure (spptr 
> is only injected with #define PETSC_HAVE_CUSP).
> 
> Look at cuspvecimpl.h, for example.
> 
> #undef __FUNCT__
> #define __FUNCT__ "VecCUSPGetArrayReadWrite"
> PETSC_STATIC_INLINE PetscErrorCode VecCUSPGetArrayReadWrite(Vec v, 
> CUSPARRAY** a)
> {
>   PetscErrorCode ierr;
> 
>   PetscFunctionBegin;
>   *a   = 0;
>   ierr = VecCUSPCopyToGPU(v);CHKERRQ(ierr);
>   *a   = ((Vec_CUSP *)v->spptr)->GPUarray;
>   PetscFunctionReturn(0);
> }
> 
> Vec is following the convention from Mat where spptr points to the 
> Mat_UMFPACK, Mat_SuperLU, Mat_SeqBSTRM, etc, which hold the extra information 
> needed for that derived class (note that "derivation" for Mat is typically 
> done at run time, after the object has been created, due to a call to 
> MatGetFactor).
>  
> 
> With different devices, we could simply have a valid flag for each
> device. When someone does VecDevice1GetArrayWrite(), the flag for all
> other devices is marked invalid. When VecDevice2GetArrayRead() is
> called, the implementation copies from any valid device to device2.
> Packing all those flags as bits in a single int is perhaps convenient,
> but not necessary.
> 
> I think that the most common way of handling GPUs will be an overlapping 
> decomposition of the host array, similar to how a vector is distributed via 
> MPI (locally owned, writeable, vs ghost values with read-only). Assigning the 
> full vector exclusively to just one device is more a single-GPU scenario 
> rather than a multi-GPU use case.
> 
> Okay, the matrix will have to partition itself. What is the advantage of 
> having a single CPU process addressing multiple GPUs? Why not use different 
> MPI processes? (We can have the MPI processes sharing a node create a subcomm 
> so they can decide which process is driving which device.)
>  
>  
> I think this stuff (which allows for segmenting the array on the device)
> can go in Vec_CUDA and Vec_OpenCL, basically just replacing the GPUarray
> member of Vec_CUSP. Why have a different PetscAcceleratorData struct?
> 
> If spptr is intended to be a generic pointer to data of the derived class, 
> then this is also a possiblity. However, this would lead to
> Vec_CUDA, Vec_OpenCL, and Vec_CUDA_OpenCL, with the number of implementations 
> rapidly increasing as one may eventually add other frameworks. The 
> PetscAcceleratorData would essentially allow for a unification of Vec_CUDA, 
> Vec_OpenCL, and Vec_CUDA_OpenCL, avoiding code duplication problems.
> 
> How would the user decide which device they wanted computation to run on? 
> (Also, Is OpenCL really the right name in an environment where there may be 
> multiple devices using OpenCL?) Currently, the type indicates where native 
> operations should "prefer" to compute, copying data there when necessary. The 
> Vec operations have different implementations for CUDA and OpenCL so I don't 
> see the problem with making them different derived classes. If we wanted a 
> hybrid CUDA/OpenCL class, it would contain the logic for deciding where to do 
> things followed by dispatch into the device-specific implementation, thus it 
> doesn't seem like duplication to me.
>  
> 
> 
> 
> 
>     Here, the PetscXYZHandleDescriptor holds
>       - the memory handle,
>       - the device ID the handles are valid for, and
>       - a flag whether the data is valid
>         (cf. valid_GPU_array, but with a much finer granularity).
>     Additional metainformation such as index ranges can be extended as
>     needed, cf. Vec_Seq vs Vec_MPI. Different types
>     Petsc*HandleDescriptors are expected to be required because the
>     various memory handle types are not guaranteed to have a particular
>     maximum size among different accelerator platforms.
> 
> 
> It sounds like you want to support marking only part of an array as
> stale. We could could keep one top-level (_p_Vec) flag indicating
> whether the CPU part was current, then in the specific implementation
> (Vec_OpenCL), you can hold finer granularity. Then when
> vec->ops->UpdateCPUArray() is called, you can look at the finer
> granularity flags to copy only what needs to be copied.
> 
> 
> Yes, I also thought of such a top-level-flag. This is, however, rather an 
> optimization flag (similar to what is done in VecGetArray for petscnative), 
> so I refrained from a separate discussion.
> 
> Aside from that, yes, I want to support parts of an array as stale, as the 
> best multi-GPU use I've experienced so far is for block-based preconditioners 
> (cf. Block-ILU-variants, parallel AMG flavors, etc.). A multi-GPU sparse 
> matrix-vector product is handled similarly.
>

[petsc-dev] Unification approach for OpenMP/Threads/OpenCL/CUDA: Part 1: Memory

Reply via email to