I don't think it would be very hard - I would have to create a patch for it, but the fix is completely contained in one file and location.

I would like to have someone else test it, though, before we move it across. It worked for me, but since it is a race condition, that isn't entirely convincing.


On Jun 9, 2009, at 5:41 AM, Jeff Squyres wrote:

I'd be in favor of bringing this to v1.3. Are there other dependencies / would it be difficult?


Begin forwarded message:

From: "Open MPI" <b...@open-mpi.org>
Date: June 8, 2009 11:31:20 AM PDT
Cc: <b...@osl.iu.edu>
Subject: Re: [Open MPI] #1927: v1.3 COMM_SPAWN loop test fails after ~120 spawns

#1927: v1.3 COMM_SPAWN loop test fails after ~120 spawns
----------------------- +----------------------------------------------------
Reporter:  jsquyres    |        Owner:  rhc
   Type:  defect      |       Status:  closed
Priority:  critical    |    Milestone:  Open MPI 1.3.4
Version:  1.3 branch  |   Resolution:  fixed
Keywords:              |
----------------------- +----------------------------------------------------
Changes (by rhc):

 * status:  new => closed
 * resolution:  => fixed


Comment:

This was due to a very tight loop on comm_spawn not giving enough time for
the prior proc to completely terminate (and thus free its file
descriptors) before the next proc was launched. Eventually, we built up a
backlog of terminations to process and ran out of fd's.

We introduced a check-and-delay in the code that detects we don't have
enough fd's to launch another proc, and then waits a second to see if
enough become free before aborting.

Fixed in trunk - can see if we want to bring it to 1.3.

--
Ticket URL: <https://svn.open-mpi.org/trac/ompi/ticket/1927#comment: 3>
Open MPI <http://www.open-mpi.org/>




--
Jeff Squyres
Cisco Systems

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to