Ralph Castain ha scritto: > > On Oct 3, 2008, at 7:14 AM, Roberto Fichera wrote: > >> Ralph Castain ha scritto: >>> I committed something to the trunk yesterday. Given the complexity of >>> the fix, I don't plan to bring it over to the 1.3 branch until >>> sometime mid-to-end next week so it can be adequately tested. >> Ok! So it means that I can checkout from the SVN/trunk to get you fix, >> right? > > Yes, though note that I don't claim it is fully correct yet. Still > needs testing. However, I have tested it a fair amount and it seems okay. > > If you do test it, please let me know how it goes. I execute my test on the svn/trunk below
Open MPI: 1.4a1r19677 Open MPI SVN revision: r19677 Open MPI release date: Unreleased developer copy Open RTE: 1.4a1r19677 Open RTE SVN revision: r19677 Open RTE release date: Unreleased developer copy OPAL: 1.4a1r19677 OPAL SVN revision: r19677 OPAL release date: Unreleased developer copy Ident string: 1.4a1r19677 below is the output which seems to freeze just after the second spawn. [roberto@master TestOpenMPI]$ mpirun --verbose --debug-daemons --hostfile $PBS_NODEFILE -wdir "`pwd`" -np 1 testmaster 100000 $PBS_NODEFILE [master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received add_local_procs [master.tekno-soft.it:30063] [[19516,0],0] node[0].name master daemon 0 arch ffc91200 [master.tekno-soft.it:30063] [[19516,0],0] node[1].name cluster4 daemon INVALID arch ffc91200 [master.tekno-soft.it:30063] [[19516,0],0] node[2].name cluster3 daemon INVALID arch ffc91200 [master.tekno-soft.it:30063] [[19516,0],0] node[3].name cluster2 daemon INVALID arch ffc91200 [master.tekno-soft.it:30063] [[19516,0],0] node[4].name cluster1 daemon INVALID arch ffc91200 Initializing MPI ... [master.tekno-soft.it:30063] [[19516,0],0] orted_recv: received sync+nidmap from local proc [[19516,1],0] [master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received collective data cmd [master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received message_local_procs [master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received collective data cmd [master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received message_local_procs Loading the node's ring from file '/var/torque/aux//932.master.tekno-soft.it' ... adding node #1 host is 'cluster4.tekno-soft.it' ... adding node #2 host is 'cluster3.tekno-soft.it' ... adding node #3 host is 'cluster2.tekno-soft.it' ... adding node #4 host is 'cluster1.tekno-soft.it' A 4 node's ring has been made At least one node is available, let's start to distribute 100000 job across 4 nodes!!! Setting up the host as 'cluster4.tekno-soft.it' Setting the work directory as '/data/roberto/MPI/TestOpenMPI' Spawning a task 'testslave.sh' on node 'cluster4.tekno-soft.it' Daemon was launched on cluster4.tekno-soft.it - beginning to initialize Daemon [[19516,0],1] checking in as pid 25123 on host cluster4.tekno-soft.it Daemon [[19516,0],1] not using static ports [cluster4.tekno-soft.it:25123] [[19516,0],1] orted: up and running - waiting for commands! [master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received add_local_procs [master.tekno-soft.it:30063] [[19516,0],0] node[0].name master daemon 0 arch ffc91200 [master.tekno-soft.it:30063] [[19516,0],0] node[1].name cluster4 daemon 1 arch ffc91200 [master.tekno-soft.it:30063] [[19516,0],0] node[2].name cluster3 daemon INVALID arch ffc91200 [master.tekno-soft.it:30063] [[19516,0],0] node[3].name cluster2 daemon INVALID arch ffc91200 [master.tekno-soft.it:30063] [[19516,0],0] node[4].name cluster1 daemon INVALID arch ffc91200 [cluster4.tekno-soft.it:25123] [[19516,0],1] orted_cmd: received add_local_procs [cluster4.tekno-soft.it:25123] [[19516,0],1] node[0].name master daemon 0 arch ffc91200 [cluster4.tekno-soft.it:25123] [[19516,0],1] node[1].name cluster4 daemon 1 arch ffc91200 [cluster4.tekno-soft.it:25123] [[19516,0],1] node[2].name cluster3 daemon INVALID arch ffc91200 [cluster4.tekno-soft.it:25123] [[19516,0],1] node[3].name cluster2 daemon INVALID arch ffc91200 [cluster4.tekno-soft.it:25123] [[19516,0],1] node[4].name cluster1 daemon INVALID arch ffc91200 [cluster4.tekno-soft.it:25123] [[19516,0],1] orted_recv: received sync+nidmap from local proc [[19516,2],0] [master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received collective data cmd [master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received message_local_procs [cluster4.tekno-soft.it:25123] [[19516,0],1] orted_cmd: received collective data cmd [cluster4.tekno-soft.it:25123] [[19516,0],1] orted_cmd: received message_local_procs [cluster4.tekno-soft.it:25123] [[19516,0],1] orted_cmd: received collective data cmd [master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received collective data cmd [master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received message_local_procs [cluster4.tekno-soft.it:25123] [[19516,0],1] orted_cmd: received message_local_procs Let me know if you need my test program. > > Thanks > Ralph > >> >>> Ralph >>> >>> On Oct 3, 2008, at 5:02 AM, Roberto Fichera wrote: >>> >>>> Ralph Castain ha scritto: >>>>> Actually, it just occurred to me that you may be seeing a problem in >>>>> comm_spawn itself that I am currently chasing down. It is in the 1.3 >>>>> branch and has to do with comm_spawning procs on subsets of nodes >>>>> (instead of across all nodes). Could be related to this - you might >>>>> want to give me a chance to complete the fix. I have identified the >>>>> problem and should have it fixed later today in our trunk - probably >>>>> won't move to the 1.3 branch for several days. >>>> Do you have any news about the above fix? Does the fix is already >>>> available for testing? >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >