Sorry for the delay in replying; the SC'18 show and then the US Thanksgiving 
holiday got in the way.  More below.



> On Nov 16, 2018, at 10:50 PM, Weicheng Xue <weic...@vt.edu> wrote:
> 
> Hi Jeff,
> 
>      Thank you very much for your reply! I am now using a cluster at my 
> university (https://www.arc.vt.edu/computing/newriver/). I cannot find any 
> info. about the use of Unified Communications X (or UCX) there so I would 
> guess the cluster does not use it (not exactly sure though).

You might want to try compiling UCX yourself (it's just a user-level library -- 
it can even be installed under your $HOME) and then try compiling Open MPI 
against it and using that.  Make sure to configure/compile UCX with CUDA 
support -- I believe you need a very recent version of UCX for that.

> Actually, I called MPI_Test functions at several places in my code where the 
> communication activity was supposed to finish, but communication did not 
> finish until the code finally called MPI_WAITALL.

You might want to test looping calling MPI_TEST many times, just to see what is 
happening.

Specifically: in Open MPI (and probably in other MPI implementations), MPI_TEST 
dips into the MPI progression engine (essentially) once, whereas MPI_WAIT dips 
into the MPI progression engine as many times as necessary in order to complete 
the request(s).  So it's just a difference of looping.

How large is the message you're sending?

> I got to know this by using the Nvidia profiler (The profiling result showed 
> that the kernel on GPUs right after MPI_WAITALL only started after CPUs 
> finished communication. However, there is enough time for CPUs to finish this 
> task in the background before MPI_WAITALL).  If the communication overhead is 
> not hidden, then it does not make any sense to write the code in an 
> overlapping way. I am wondering whether the openmpi on the cluster was 
> compiled with asynchronous progression enabled, as "OMPI progress: no, ORTE 
> progress: yes" is obtained by using "ompi_info". I really do not know the 
> difference between "OMPI progress" and "ORTE progress" as I am not a CS guy.

I applaud your initiative to find that phrase in the ompi_info output!

However, don't get caught up in it -- that phrase isn't specifically oriented 
to the exact issue you're discussing here (for lack of a longer explanation).

> Also, I am wondering whether MVAPICH2 is worthwhile to be tried as it 
> provides an environment variable to control the progression of operation, 
> which is easier. I would greatly appreciate your help!

Sure, try MVAPICH2 -- that's kinda the strength of the MPI ecosystem (that 
there are multiple different MPI implementations to try).

-- 
Jeff Squyres
jsquy...@cisco.com

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to