Re: [OMPI users] EXTERNAL: Re: unacceptable latency in gathering process

Ralph Castain Tue, 9 Oct 2012 09:30:05 -0400

We don't support thread_multiple, I'm afraid. Only thread_funneled, so
you'll have to architect things so that each process can perform all its
MPI actions inside of a single thread.



On Tue, Oct 9, 2012 at 6:10 AM, Hodge, Gary C <gary.c.ho...@lmco.com> wrote:

>  FYI, I implemented the harvesting thread but found out quickly that my
> installation of open MPI does not have MPI_THREAD_MULIPLE support****
>
> My worker thread still does MPI_Send calls to move the data to the next
> process.****
>
> So I am going to download 1.6.2 today, configure it with
> --enable-thread-multiple and try again****
>
> ** **
>
> *From:* users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] *On
> Behalf Of *Ralph Castain
> *Sent:* Thursday, October 04, 2012 8:10 PM
>
> *To:* Open MPI Users
> *Subject:* Re: [OMPI users] EXTERNAL: Re: unacceptable latency in
> gathering process****
>
>  ** **
>
> Sorry for delayed response - been on the road all day.
>
> Usually we use the standard NetPipe, IMB, and other benchmarks to measure
> latency. IIRC, these are all point-to-point measurements - i.e., they
> measure the latency for a single process sending to one other process
> (typically on the order of a couple of microseconds). The tests may have
> multiple processes running, but they don't have one process receiving
> messages from multiple senders.
>
> You will, of course, see increased delays in that scenario just due to
> cycle time - we give you a message, but cannot give you another one until
> you return from our delivery callback. So the longer you spend in the
> callback, the slower we go.
>
> In one use-case I recently helped with, we had a "harvesting" thread that
> simply reaped the messages from the MPI callback and stuffed them into a
> multi-threaded processing queue. This minimized the MPI "latency", but of
> course the overall thruput depended on the speed of the follow-on queue. In
> our case, we only had one process running on each node (like you), and had
> lots of cores on the node - so we cranked up the threads in the processing
> queue and rammed the data thru the pipe.
>
> Your design looks similar, so you might benefit from a similar approach.
> Just don't try to have multiple MPI callbacks each sitting in a separate
> thread as thread support in MPI isn't good - better to have a single thread
> handling the MPI stuff, and then push it into a queue that multiple threads
> can access.
>
> Anyway, glad that helped diagnose the issue.
> Ralph
>
>
>
> ****
>
> On Thu, Oct 4, 2012 at 6:55 AM, Hodge, Gary C <gary.c.ho...@lmco.com>
> wrote:****
>
> Once I read your comment, Ralph, about this being “orders of magnitude
> worse than anything we measure”, I knew it had to be our problem****
>
>  ****
>
> We already had some debug code in place to measure when we send and when
> we receive over MPI.  I turned this code on and ran with 12 slaves instead
> of 4.****
>
> Our debug showed that once an SP does a send, it is received at the GP in
> less than 1 ms.   I then decided to take a close look at when each SP was
> sending a message.****
>
> It turns out that the first 9 slaves send out messages at very regular
> intervals, but the last 3 slaves have 200 - 600 ms delays in sending out a
> message.****
>
> It could be that our SPs have a problem when many are running at once.  It
> is also interesting to note that the first 9 slaves run on the same blade
> chassis as the GP and****
>
> the last 3 SPs run on our second blade chassis.  I will later experiment
> with the placement of our SPs across chassis to see if this an important
> factor or not.****
>
>  ****
>
> When I first reported this problem, I had only turned on debug in the
> receiving GP process.  The latency I was seeing then was the difference
> between when I received a message****
>
> from the 10th slave and when I received the last message from the 10thslave.  
> The time we use for our debug  comes from an MPI_Wtime call.
> ****
>
>  ****
>
> Ralph, for my future reference, could you share how many processes were
> sending to a single process in your testing, and what were the size of the
> messages sent?****
>
>  ****
>
> Hristo, thanks for your input, I had already spent a few days searching
> the faqs and tuning guides before posting.****
>
>  ****
>
> *From:* users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] *On
> Behalf Of *Ralph Castain
> *Sent:* Wednesday, October 03, 2012 4:01 PM
> *To:* Open MPI Users
> *Subject:* Re: [OMPI users] EXTERNAL: Re: unacceptable latency in
> gathering process****
>
>  ****
>
> Hmmm...you probably can't without digging down into the diagnostics.****
>
>  ****
>
> Perhaps we could help more if we had some idea how you are measuring this
> "latency". I ask because that is orders of magnitude worse than anything we
> measure - so I suspect the problem is in your app (i.e., that the time you
> are measuring is actually how long it takes you to get around to processing
> a message that was received some time ago).****
>
>  ****
>
>  ****
>
> On Oct 3, 2012, at 11:52 AM, "Hodge, Gary C" <gary.c.ho...@lmco.com>
> wrote:****
>
> ** **
>
> how do I tell the difference between when the message was received and
> when the message was picked up in MPI_Test?****
>
>  ****
>
> *From:* users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] *On
> Behalf Of *Ralph Castain
> *Sent:* Wednesday, October 03, 2012 1:00 PM
> *To:* Open MPI Users
> *Subject:* EXTERNAL: Re: [OMPI users] unacceptable latency in gathering
> process****
>
>  ****
>
> Out of curiosity, have you logged the time when the SP called "send" and
> compared it to the time when the message was received, and when that
> message is picked up in MPI_Test? In other words, have you actually
> verified that the delay is in the MPI library as opposed to in your
> application?****
>
>  ****
>
>  ****
>
> On Oct 3, 2012, at 9:40 AM, "Hodge, Gary C" <gary.c.ho...@lmco.com> wrote:
> ****
>
>
>
> ****
>
> Hi all,****
>
> I am running on an IBM BladeCenter, using Open MPI 1.4.1, and opensm
> subnet manager for Infiniband****
>
>  ****
>
> Our application has real time requirements and it has recently been proven
> that it does not scale to meet future requirements.****
>
> Presently, I am re-organizing the application to process work in a more
> parallel manner then it does now.****
>
>  ****
>
> Jobs arrive at the rate of 200 per second and are sub-divided into groups
> of objects by a master process (MP) on its own node.****
>
> The MP then assigns the object groups to 20 slave processes (SP), each
> running on their own node, to do the expensive computational work in
> parallel.****
>
> The SPs then send their results to a gatherer process (GP) on its own node
> that merges the results for the job and sends it onward for final
> processing.****
>
> The highest latency for the last 1024 jobs that were processed is then
> written to a log file that is displayed by a GUI.****
>
> Each process uses the same controller method for sending and  receiving
> messages as follows:****
>
>  ****
>
> For (each CPU that sends us input)****
>
> {****
>
> MPI_Irecv(….)****
>
> }****
>
>  ****
>
> While (true)****
>
> {****
>
>                 For (each CPU that sends us input)****
>
> {****
>
> MPI_Test(….)****
>
> If (message was received)****
>
> {****
>
>                 Copy the message****
>
> Queue the copy to our input queue****
>
>                 MPI_Irecv(…)****
>
> }****
>
> }****
>
> If (there are messages on our input queue)****
>
> {****
>
>                 … process the FIRST message on queue (this may queue
> messages for output) ….****
>
>  ****
>
>                 For (each message on our output queue)****
>
>                 {****
>
>                                 MPI_Send(…)****
>
>                 }****
>
> }             ****
>
> }****
>
>  ****
>
> My problem is that I do not meet our applications performance requirements
> for a job (~ 20 ms) until I reduce the number of SPs from 20 to 4 or less.
> ****
>
> I added some debug into the GP and found that there are never more than 14
> messages received in the for loop that calls MPI_Test.****
>
> The messages that were sent from the other 6 SPs will eventually arrive at
> the GP in a long stream after experiencing high latency (over 600 ms).****
>
>  ****
>
> Going forward, we need to handle more objects per job and will need to
> have more than 4 SPs to keep up.****
>
> My thought is that I have to obey this 4 SPs to 1 GP ratio and create
> intermediate GPs to gather results from every 4 slaves.****
>
>  ****
>
> Is this a contention problem at the GP?****
>
> Is there debugging or logging I can turn on in the MPI to prove that
> contention is occurring?****
>
> Can I configure MPI receive processing to improve upon the 4 to 1 ratio?**
> **
>
> Can I improve the controller method (listed above) to gain a performance
> improvement?****
>
>  ****
>
> Thanks for any suggestions.****
>
> Gary Hodge****
>
>  ****
>
>  ****
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users****
>
>  ****
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users****
>
>  ****
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users****
>
> ** **
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] EXTERNAL: Re: unacceptable latency in gathering process

Reply via email to