Once I read your comment, Ralph, about this being "orders of magnitude worse 
than anything we measure", I knew it had to be our problem

We already had some debug code in place to measure when we send and when we 
receive over MPI.  I turned this code on and ran with 12 slaves instead of 4.
Our debug showed that once an SP does a send, it is received at the GP in less 
than 1 ms.   I then decided to take a close look at when each SP was sending a 
message.
It turns out that the first 9 slaves send out messages at very regular 
intervals, but the last 3 slaves have 200 - 600 ms delays in sending out a 
message.
It could be that our SPs have a problem when many are running at once.  It is 
also interesting to note that the first 9 slaves run on the same blade chassis 
as the GP and
the last 3 SPs run on our second blade chassis.  I will later experiment with 
the placement of our SPs across chassis to see if this an important factor or 
not.

When I first reported this problem, I had only turned on debug in the receiving 
GP process.  The latency I was seeing then was the difference between when I 
received a message
from the 10th slave and when I received the last message from the 10th slave.  
The time we use for our debug  comes from an MPI_Wtime call.

Ralph, for my future reference, could you share how many processes were sending 
to a single process in your testing, and what were the size of the messages 
sent?

Hristo, thanks for your input, I had already spent a few days searching the 
faqs and tuning guides before posting.

From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Ralph Castain
Sent: Wednesday, October 03, 2012 4:01 PM
To: Open MPI Users
Subject: Re: [OMPI users] EXTERNAL: Re: unacceptable latency in gathering 
process

Hmmm...you probably can't without digging down into the diagnostics.

Perhaps we could help more if we had some idea how you are measuring this 
"latency". I ask because that is orders of magnitude worse than anything we 
measure - so I suspect the problem is in your app (i.e., that the time you are 
measuring is actually how long it takes you to get around to processing a 
message that was received some time ago).


On Oct 3, 2012, at 11:52 AM, "Hodge, Gary C" 
<gary.c.ho...@lmco.com<mailto:gary.c.ho...@lmco.com>> wrote:


how do I tell the difference between when the message was received and when the 
message was picked up in MPI_Test?

From: users-boun...@open-mpi.org<mailto:users-boun...@open-mpi.org> 
[mailto:users-boun...@open-mpi.org<mailto:boun...@open-mpi.org>] On Behalf Of 
Ralph Castain
Sent: Wednesday, October 03, 2012 1:00 PM
To: Open MPI Users
Subject: EXTERNAL: Re: [OMPI users] unacceptable latency in gathering process

Out of curiosity, have you logged the time when the SP called "send" and 
compared it to the time when the message was received, and when that message is 
picked up in MPI_Test? In other words, have you actually verified that the 
delay is in the MPI library as opposed to in your application?


On Oct 3, 2012, at 9:40 AM, "Hodge, Gary C" 
<gary.c.ho...@lmco.com<mailto:gary.c.ho...@lmco.com>> wrote:



Hi all,
I am running on an IBM BladeCenter, using Open MPI 1.4.1, and opensm subnet 
manager for Infiniband

Our application has real time requirements and it has recently been proven that 
it does not scale to meet future requirements.
Presently, I am re-organizing the application to process work in a more 
parallel manner then it does now.

Jobs arrive at the rate of 200 per second and are sub-divided into groups of 
objects by a master process (MP) on its own node.
The MP then assigns the object groups to 20 slave processes (SP), each running 
on their own node, to do the expensive computational work in parallel.
The SPs then send their results to a gatherer process (GP) on its own node that 
merges the results for the job and sends it onward for final processing.
The highest latency for the last 1024 jobs that were processed is then written 
to a log file that is displayed by a GUI.
Each process uses the same controller method for sending and  receiving 
messages as follows:

For (each CPU that sends us input)
{
MPI_Irecv(....)
}

While (true)
{
                For (each CPU that sends us input)
{
MPI_Test(....)
If (message was received)
{
                Copy the message
Queue the copy to our input queue
                MPI_Irecv(...)
}
}
If (there are messages on our input queue)
{
                ... process the FIRST message on queue (this may queue messages 
for output) ....

                For (each message on our output queue)
                {
                                MPI_Send(...)
                }
}
}

My problem is that I do not meet our applications performance requirements for 
a job (~ 20 ms) until I reduce the number of SPs from 20 to 4 or less.
I added some debug into the GP and found that there are never more than 14 
messages received in the for loop that calls MPI_Test.
The messages that were sent from the other 6 SPs will eventually arrive at the 
GP in a long stream after experiencing high latency (over 600 ms).

Going forward, we need to handle more objects per job and will need to have 
more than 4 SPs to keep up.
My thought is that I have to obey this 4 SPs to 1 GP ratio and create 
intermediate GPs to gather results from every 4 slaves.

Is this a contention problem at the GP?
Is there debugging or logging I can turn on in the MPI to prove that contention 
is occurring?
Can I configure MPI receive processing to improve upon the 4 to 1 ratio?
Can I improve the controller method (listed above) to gain a performance 
improvement?

Thanks for any suggestions.
Gary Hodge


_______________________________________________
users mailing list
us...@open-mpi.org<mailto:us...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org<mailto:us...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to