Indeed there are many potential solutions, but all require too much
intervention on the code to be generic enough. As we discussed
privately mid last year, the "flatten datatype" approach seems to me
to be the most profitable.It is simple to implement and it is also
generic, a simple change will make all pipelined collective work (not
only tuned but all the other as well).

Use a flatten datatype instead of the one provided by the MPI
application. The flatten datatype will have the same type map as the
original data, but will be all in a single level. As the MPI standard
requires all collective to use datatype*count that has the same type
signature, this flattened datatype will allow all the peers in a
collective to have a consistent view of the operations to be done, and
as a result use the same sane pipelining boundaries.

  George.

On Thu, Apr 17, 2014 at 5:02 AM, Gilles Gouaillardet
<gilles.gouaillar...@iferc.org> wrote:
> Dear OpenMPI developers,
>
> i just created #4531 in order to track this issue :
> https://svn.open-mpi.org/trac/ompi/ticket/4531
>
> Basically, the coll/tuned implementation of MPI_Bcast does not work when
> two tasks
> uses datatypes of different sizes.
> for example, if the root send two large vectors of MPI_INT and non root
> receive many MPI_INT, then MPI_Bcast will crash.
> but if the root send many MPI_INT and the non root receive two large
> vectors of MPI_INT, then MPI_Bcast will silently fail.
> (the TRAC ticket has attached test cases)
>
> i believe this kind of issue could occur on all/most collective of the
> coll/tuned module, so it is not limited to MPI_Bcast.
>
>
> i am wondering of what could be the best way to solve this.
>
> one solution i could think of, would be to generate temporary datatypes
> in order to send message whose size is exactly the segment_size.
>
> an other solution i could think of, would be to have new send/recv
> functions :
> if we consider the send function :
> int mca_pml_ob1_send(void *buf,
>                      size_t count,
>                      ompi_datatype_t * datatype,
>                      int dst,
>                      int tag,
>                      mca_pml_base_send_mode_t sendmode,
>                      ompi_communicator_t * comm)
>
> we could imagine to have the xsend function :
> int mca_pml_ob1_xsend(void *buf,
>                      size_t count,
>                      ompi_datatype_t * datatype,
>                      size_t offset,
>                      size_t size,
>                      int dst,
>                      int tag,
>                      mca_pml_base_send_mode_t sendmode,
>                      ompi_communicator_t * comm)
>
> where offset is the number of bytes that should be skipped from the
> beginning of buf
> and size if the (max) number of bytes to be sent (e.g. the message will
> be "truncated"
> to size bytes if (count*size(datatype) - offset) > size
>
> or we could use a buffer if needed, and send/recv with MPI_PACKED datatype
> (this is less efficient, would it even work on heterogeneous nodes ?)
>
> or we could simply consider this is just a limitation of coll/tuned
> (coll/basic works fine) and do nothing
>
> or something else i did not think of ...
>
>
> thanks in advance for your feedback
>
> Gilles
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/04/14556.php

Reply via email to