Tom,

This make perfect sense. However, the fact that one of the network devices (BTL in Open MPi terms) is not available at runtime should not modify the behavior of the application. At least this is the theory :) Changing from named receives to unnamed one, definitively modify the signature (i.e. communication pattern) of the application, and might in most cases introduce mismatching if the same tag is used. However, with the osu_latency there are only two ranks involved in the communication (rank 0 and 1) so the communication pattern should stay the same whatever you use ANY_SOURCE or not, as the MPI standard enforce the message ordering.

Now, let me explain a little bit of internal black magic behind of Open MPI. When we discover that a BTL is overcharged, we reroute the new messages into a local "pending" queue, until some space on the device became available. Once we start book-keeping messages, we still have to enforce the MPI logical ordering, so all new messages will follow into the "pending" queue, until the device is capable of sending data again, and then the messages will be delivered in-order to their respective destination. What might happens, and this is only speculation at this point, is that somehow a message bypass this "pending" queue and goes into the wire too early. As this message will have the same tag, Open MPI might match it when the message arrive at the destination, and can generate a TRUNCATE error if this message belong to the next loop in the osu_latency benchmark. As you can see, there are many ifs in the previous paragraph, so let's assume by now that this is just pure speculation. Please upgrade to the latest version of Open MPI, and if you encounter the same problem then we will try to dig a little bit deeper into this "speculation".

  Thanks,
    george.

On Aug 19, 2008, at 12:36 AM, Tom Riddle wrote:

Thanks George, I will update and try the latest repo. However I'd like to describe our usage case a bit more to see if there is something that may not be proper in our development approach. Forgive me if this is repetitious...

We have configured and built OpenMPI originally on a machine with Infinipath / PSM installed. Since we desire a flexible software development environment across a number of machines (most of them are without the Infinipath hw), we run these same OpenMPI bins in a shared user area. That means other developer's machines, which do not have Infinipath / PSM installed locally, will simulate the multiple machine communication by running in shared memory mode. But again these OpenMPI bins have been configured with Infinipath support.

So we see the error when running in shared memory mode on machines that don't have Infinipath, so is there a way at runtime that you can force shared memory mode exclusively? We are wondering if designating MPI_ANY_SOURCE may then direct OpenMPI to look at every possible communications mode and that probably would cause conflicts if there wasn't psm libs present.

Hope this makes sense, Tom



Things were working without issue until we went to the wildcard MPI_ANY_SOURCE on our receives but only on machines without . I guess I wonder what is the mechanism when in a wildcard mode.

--- On Sun, 8/17/08, George Bosilca <bosi...@eecs.utk.edu> wrote:
From: George Bosilca <bosi...@eecs.utk.edu>
Subject: Re: [OMPI users] MPI_ERR_TRUNCATE with MPI_Revc without Infinipath
To: rarebit...@yahoo.com, "Open MPI Users" <us...@open-mpi.org>
Date: Sunday, August 17, 2008, 2:42 PM

Tom,

I did the same modification as you on the osu_latency and the
resulting application run to completion. I don't get any TRUNCATE
error messages. I'm using the latest version of Open MPI (1.4a1r19313).

There was a bug that might be related to your problem but our commit
log shows it was fixed by commit 18830 on July 9.

   george.

On Aug 13, 2008, at 5:49 PM, Tom Riddle wrote:

> Hi,
>
> A bit more info wrt the question below. I have run other releases of
> OpenMPI and they seem to be fine. The reason I need to run the
> latest is because it supports valgrind fully.
>
> openmpi-1.2.4
> openmpi-1.3ar18303
>
> TIA, Tom
>
> --- On Tue, 8/12/08, Tom Riddle <rarebit...@yahoo.com> wrote:
>
> Hi,
>
> I am getting a curious error on a simple communications test. I have
> altered the std
 mvapich osu_latency test to accept receives from any
> source and I get the following error
>
> [d013.sc.net:15455] *** An error occurred in MPI_Recv
> [d013.sc.net:15455] *** on communicator MPI_COMM_WORLD
> [d013.sc.net:15455] *** MPI_ERR_TRUNCATE: message truncated
> [d013.sc.net:15455] *** MPI_ERRORS_ARE_FATAL (goodbye)
>
> the code change was...
>
>  MPI_Recv(r_buf, size, MPI_CHAR, MPI_ANY_SOURCE, 1, MPI_COMM_WORLD,
> &reqstat);
>
> the command line I run was
>
> > mpirun -np 2 ./osu_latency
>
> Now I run this on 2 types of host machine configurations. One that
> has Infinipath HCAs  installed and another that doesn't. I run both
> of these in shared memory mode ie: dual processes on the same node.
> I have verified that when I am on the host with Infinipath I am
> actually running the OpenMPI mpirun, not
 the mpi that comes with the
> HCA.
>
> I have built OpenMPI with psm support from a fairly recent svn pull
> and run the same bins on both host machines... The config was as
> follows:
> > $ ../configure --prefix=/opt/wkspace/openmpi-1.3 CC=gcc CXX=g++
> > --disable-mpi-f77 --enable-debug --enable-memchecker
> > --with-psm=/usr/include --with-valgrind=/opt/wkspace/ valgrind-3.3.0/
> > mpirun --version
> mpirun (Open MPI) 1.4a1r18908
>
> The error presents itself only on the host that does not have
> Infinipath installed. I have combed through the mca args to see if
> there is a setting I am missing but I cannot see anything obvious.
>
> Any input would be appreciated. Thanks. Tom
>
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
>
 http://www.open-mpi.org/mailman/listinfo.cgi/users



Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to