Re: [OMPI users] MPI_Get is slow with structs containing padding

Nathan Hjelm via users Thu, 30 Mar 2023 12:45:19 -0700
That is exactly the issue. Part of the reason I have argued against MPI_SHORT_INT 
usage in RMA because even though it is padded due to type alignment we are still not 
allowed to operate on the bits between the short and the int. We can correct that one 
in the standard by adding the same language as C (padding bits are undefined) but 
when a user gives us their own datatype we have no options.Yes, the best usage for 
the user is to keep the transfer completely contiguous. osc/rdma will break it down 
otherwise and with tcp that will be really horrible since each request becomes 
essentially a BTL active message.-NathanOn Mar 30, 2023, at 1:19 PM, Joseph Schuchart 
via users <users@lists.open-mpi.org> wrote:Hi Antoine,That's an interesting 
result. I believe the problem with datatypes with gaps is that MPI is not allowed to 
touch the gaps. My guess is that for the RMA version of the benchmark the 
implementation either has to revert back to an active message packing the data at the 
target and sending it back or (which seems more likely in your case) transfer each 
object separately and skip the gaps. Without more information on your setup (using 
UCX?) and the benchmark itself (how many elements? what does the target do?) it's 
hard to be more precise.A possible fix would be to drop the MPI datatype for the RMA 
use and transfer the vector as a whole, using MPI_BYTE. I think there is also a way 
to modify the upper bound of the MPI type to remove the gap, using 
MPI_TYPE_CREATE_RESIZED. I expect that that will allow MPI to touch the gap and 
transfer the vector as a whole. I'm not sure about the details there, maybe someone 
can shed some light.HTHJosephOn 3/30/23 18:34, Antoine Motte via users wrote:Hello 
everyone,I recently had to code an MPI application where I send std::vector contents 
in a distributed environment. In order to try different approaches I coded both 
1-sided and 2-sided point-to-point communication schemes, the first one uses 
MPI_Window and MPI_Get, the second one uses MPI_SendRecv.I had a hard time figuring 
out why my implementation with MPI_Get was between 10 and 100 times slower, and I 
finally found out that MPI_Get is abnormally slow when one tries to send custom 
datatypes including padding.Here is a short example attached, where I send a struct 
{double, int} (12 bytes of data + 4 bytes of padding) vs a struct {double, int, int} 
(16 bytes of data, 0 bytes of padding) with both MPI_SendRecv and MPI_Get. I got 
these results :mpirun -np 4 ./compareGetWithSendRecv{double, int} SendRecv : 
0.0303547 s{double, int} Get : 1.9196 s{double, int, int} SendRecv : 0.0164659 
s{double, int, int} Get : 0.0147757 sI run it with both Open MPI 4.1.2 and with intel 
MPI 2021.6 and got the same results.Is this result normal? Do I have any solution 
other than adding garbage at the end of the struct or at the end of the MPI_Datatype 
to avoid padding?Regards,Antoine Motte
Re: [OMPI users] MPI_Get is slow with structs containing padding

Reply via email to