Jean-Loup, > I am currently trying to use HPX to offload computationally intensive > tasks to remote GPU nodes. In idiomatic HPX, this would typically be done > by invoking a remote action: > > OutputData compute(InputData input_data) > { > /* Asynchronously copy `input_data` to device using DMA */ > /* Do work on GPU */ > /* Copy back the results to host */ > return results; > } > > HPX_PLAIN_ACTION(compute, compute_action); > > // In sender code > auto fut = hpx::async(compute_action(), remote_locality_with_gpu, > std::move(input_data)); > > So far, so good. > > However, an important requirement is that the memory allocated for the > input data on the receiver end be pinned, to enable asynchronous copy > between the host and the GPU. This can of course always be done by copying > the argument `input_data` to pinned memory within the function body, but I > would prefer to avoid any superfluous copy in order to minimize the > overhead. > > Do you know if it is possible to control within HPX where the memory for > the input data will be allocated (on the receiver end) ? I tried to use > the `pinned_allocator` from the Thrust library for the data members of > `InputData`, and although it did its job as expected, it also requires to > allocate pinned memory on the sender side (for the construction of the > object), as well as the presence of the Thrust library and the CUDA > runtime on both machines. This led me to think that there should be a > better way. > > Ideally, I would be able to directly deserialize the incoming data into > pinned memory. Do you know if there is a way to do this or something > similar in HPX ? If not, do you think it is possible to emulate such > functionality by directly using the low-level constructs / internals of > HPX ? This is for a prototype, so it is okay to use unstable / > undocumented code as long as it allows me to prove the feasibility of the > approach. > > I would greatly appreciate any input / suggestions on how to approach this > issue. If anyone has experience using HPX with GPUs or on heterogeneous > clusters, I would be very interested in hearing about it as well.
The solution for this all depends on the InputData type you're using. All arguments to actions are created by the (de-)serialization layer in HPX. Normally those are first default constructed and then 'filled' by the corresponding (de-)serialization function. That means that in order for you to place the data into pinned memory, your InputData type needs to know how to access your pinned memory and needs to perform the allocation itself before the actual data is deserialized. The easiest would probably be to use HPX's serialize_buffer [1] which can be customized using a C++ allocator. The other benefit of serialize_buffer is that it supports almost perfect zero-copy semantics avoiding internal data copies (it is 100% zero copy on the sender side, but for arcane reasons still requires one additional copy on the receiver). A 100% zero-copy solution would require John's RDMA object. Using serialize_buffer should be straight-forward if you have the liberty to use it as your InputData type. You'd still need to create a C++ allocator which is allocating memory from your pinned memory (John has that buried inside his RDMA_object, it might be possible to extract that, otoh - writing an allocator is not too difficult). So, assuming you have a pinned_memory_allocator the code would roughly look like: using allocator_t = pinned_memory_allocator; using buffer_t = hpx::serialization::serialize_buffer<double, allocator_t>; OutputData compute(buffer_t && input_data) { /* Asynchronously copy `input_data` to device using DMA */ async_transfer(input_data.data(), input_data.size()); ... /* Do work on GPU */ /* Copy back the results to host */ return results; } HPX_PLAIN_ACTION(compute, compute_action); // In sender code // allocate array of 1000 doubles using 'alloc' allocator_t alloc; buffer_t buff(1000, alloc); // fill buff by accessing the underlying array using buff.data() // and buff.size() // serialize_buffer also supports operator[] auto fut = hpx::async(compute_action(), remote_locality_with_gpu, std::move(input_data)); This will allocate both, the sender buff and the received input_data using your pinned memory allocator (if invoked remotely), and will simply turn into a pointer copy for local invocations. Note, that serialize_buffer behaves very similar to a shared_ptr, i.e. it exposes reference-counted shallow-copy semantics. HTH Regards Hartmut --------------- http://boost-spirit.com http://stellar.cct.lsu.edu [1] https://github.com/STEllAR-GROUP/hpx/blob/master/hpx/runtime/serialization/s erialize_buffer.hpp _______________________________________________ hpx-users mailing list hpx-users@stellar.cct.lsu.edu https://mail.cct.lsu.edu/mailman/listinfo/hpx-users