Jean-Loup,

> I am currently trying to use HPX to offload computationally intensive
> tasks to remote GPU nodes. In idiomatic HPX, this would typically be done
> by invoking a remote action:
> 
>     OutputData compute(InputData input_data)
>     {
>         /* Asynchronously copy `input_data` to device using DMA */
>         /* Do work on GPU */
>         /* Copy back the results to host */
>         return results;
>     }
> 
>     HPX_PLAIN_ACTION(compute, compute_action);
> 
>     // In sender code
>     auto fut = hpx::async(compute_action(), remote_locality_with_gpu,
> std::move(input_data));
> 
> So far, so good.
> 
> However, an important requirement is that the memory allocated for the
> input data on the receiver end be pinned, to enable asynchronous copy
> between the host and the GPU. This can of course always be done by copying
> the argument `input_data` to pinned memory within the function body, but I
> would prefer to avoid any superfluous copy in order to minimize the
> overhead.
> 
> Do you know if it is possible to control within HPX where the memory for
> the input data will be allocated (on the receiver end) ? I tried to use
> the `pinned_allocator` from the Thrust library for the data members of
> `InputData`, and although it did its job as expected, it also requires to
> allocate pinned memory on the sender side (for the construction of the
> object), as well as the presence of the Thrust library and the CUDA
> runtime on both machines. This led me to think that there should be a
> better way.
> 
> Ideally, I would be able to directly deserialize the incoming data into
> pinned memory. Do you know if there is a way to do this or something
> similar in HPX ? If not, do you think it is possible to emulate such
> functionality by directly using the low-level constructs / internals of
> HPX ? This is for a prototype, so it is okay to use unstable /
> undocumented code as long as it allows me to prove the feasibility of the
> approach.
> 
> I would greatly appreciate any input / suggestions on how to approach this
> issue. If anyone has experience using HPX with GPUs or on heterogeneous
> clusters, I would be very interested in hearing about it as well.

The solution for this all depends on the InputData type you're using. 

All arguments to actions are created by the (de-)serialization layer in HPX.
Normally those are first default constructed and then 'filled' by the
corresponding (de-)serialization function.

That means that in order for you to place the data into pinned memory, your
InputData type needs to know how to access your pinned memory and needs to
perform the allocation itself before the actual data is deserialized. 

The easiest would probably be to use HPX's serialize_buffer [1] which can be
customized using a C++ allocator. The other benefit of serialize_buffer is
that it supports almost perfect zero-copy semantics avoiding internal data
copies (it is 100% zero copy on the sender side, but for arcane reasons
still requires one additional copy on the receiver). A 100% zero-copy
solution would require John's RDMA object.

Using serialize_buffer should be straight-forward if you have the liberty to
use it as your InputData type. You'd still need to create a C++ allocator
which is allocating memory from your pinned memory (John has that buried
inside his RDMA_object, it might be possible to extract that, otoh - writing
an allocator is not too difficult). 

So, assuming you have a pinned_memory_allocator the code would roughly look
like:

    using allocator_t = pinned_memory_allocator;
    using buffer_t = hpx::serialization::serialize_buffer<double,
allocator_t>;

    OutputData compute(buffer_t && input_data)
    {
        /* Asynchronously copy `input_data` to device using DMA */
        async_transfer(input_data.data(), input_data.size());
        ...
        /* Do work on GPU */
        /* Copy back the results to host */
        return results;
    }
 
    HPX_PLAIN_ACTION(compute, compute_action);

    // In sender code

    // allocate array of 1000 doubles using 'alloc'
    allocator_t alloc;
    buffer_t buff(1000, alloc);

    // fill buff by accessing the underlying array using buff.data() 
    // and buff.size()
    // serialize_buffer also supports operator[]

    auto fut = hpx::async(compute_action(), 
       remote_locality_with_gpu, std::move(input_data));

This will allocate both, the sender buff and the received input_data using
your pinned memory allocator (if invoked remotely), and will simply turn
into a pointer copy for local invocations.

Note, that serialize_buffer behaves very similar to a shared_ptr, i.e. it
exposes reference-counted shallow-copy semantics.

HTH
Regards Hartmut
---------------
http://boost-spirit.com
http://stellar.cct.lsu.edu

[1]
https://github.com/STEllAR-GROUP/hpx/blob/master/hpx/runtime/serialization/s
erialize_buffer.hpp


_______________________________________________
hpx-users mailing list
hpx-users@stellar.cct.lsu.edu
https://mail.cct.lsu.edu/mailman/listinfo/hpx-users

Reply via email to