Ralph Castain ha scritto:
>
> On Oct 3, 2008, at 7:14 AM, Roberto Fichera wrote:
>
>> Ralph Castain ha scritto:
>>> I committed something to the trunk yesterday. Given the complexity of
>>> the fix, I don't plan to bring it over to the 1.3 branch until
>>> sometime mid-to-end next week so it can be adequately tested.
>> Ok! So it means that I can checkout from the SVN/trunk to get you fix,
>> right?
>
> Yes, though note that I don't claim it is fully correct yet. Still
> needs testing. However, I have tested it a fair amount and it seems okay.
>
> If you do test it, please let me know how it goes.
I execute my test on the svn/trunk below

                Open MPI: 1.4a1r19677
   Open MPI SVN revision: r19677
   Open MPI release date: Unreleased developer copy
                Open RTE: 1.4a1r19677
   Open RTE SVN revision: r19677
   Open RTE release date: Unreleased developer copy
                    OPAL: 1.4a1r19677
       OPAL SVN revision: r19677
       OPAL release date: Unreleased developer copy
            Ident string: 1.4a1r19677

below is the output which seems to freeze just after the second spawn.

[roberto@master TestOpenMPI]$ mpirun --verbose --debug-daemons
--hostfile $PBS_NODEFILE -wdir "`pwd`" -np 1 testmaster 100000 $PBS_NODEFILE
[master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received
add_local_procs
[master.tekno-soft.it:30063] [[19516,0],0] node[0].name master daemon 0
arch ffc91200
[master.tekno-soft.it:30063] [[19516,0],0] node[1].name cluster4 daemon
INVALID arch ffc91200
[master.tekno-soft.it:30063] [[19516,0],0] node[2].name cluster3 daemon
INVALID arch ffc91200
[master.tekno-soft.it:30063] [[19516,0],0] node[3].name cluster2 daemon
INVALID arch ffc91200
[master.tekno-soft.it:30063] [[19516,0],0] node[4].name cluster1 daemon
INVALID arch ffc91200
Initializing MPI ...
[master.tekno-soft.it:30063] [[19516,0],0] orted_recv: received
sync+nidmap from local proc [[19516,1],0]
[master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received
collective data cmd
[master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received
message_local_procs
[master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received
collective data cmd
[master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received
message_local_procs
Loading the node's ring from file
'/var/torque/aux//932.master.tekno-soft.it'
... adding node #1 host is 'cluster4.tekno-soft.it'
... adding node #2 host is 'cluster3.tekno-soft.it'
... adding node #3 host is 'cluster2.tekno-soft.it'
... adding node #4 host is 'cluster1.tekno-soft.it'
A 4 node's ring has been made
At least one node is available, let's start to distribute 100000 job
across 4 nodes!!!
Setting up the host as 'cluster4.tekno-soft.it'
Setting the work directory as '/data/roberto/MPI/TestOpenMPI'
Spawning a task 'testslave.sh' on node 'cluster4.tekno-soft.it'
Daemon was launched on cluster4.tekno-soft.it - beginning to initialize
Daemon [[19516,0],1] checking in as pid 25123 on host cluster4.tekno-soft.it
Daemon [[19516,0],1] not using static ports
[cluster4.tekno-soft.it:25123] [[19516,0],1] orted: up and running -
waiting for commands!
[master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received
add_local_procs
[master.tekno-soft.it:30063] [[19516,0],0] node[0].name master daemon 0
arch ffc91200
[master.tekno-soft.it:30063] [[19516,0],0] node[1].name cluster4 daemon
1 arch ffc91200
[master.tekno-soft.it:30063] [[19516,0],0] node[2].name cluster3 daemon
INVALID arch ffc91200
[master.tekno-soft.it:30063] [[19516,0],0] node[3].name cluster2 daemon
INVALID arch ffc91200
[master.tekno-soft.it:30063] [[19516,0],0] node[4].name cluster1 daemon
INVALID arch ffc91200
[cluster4.tekno-soft.it:25123] [[19516,0],1] orted_cmd: received
add_local_procs
[cluster4.tekno-soft.it:25123] [[19516,0],1] node[0].name master daemon
0 arch ffc91200
[cluster4.tekno-soft.it:25123] [[19516,0],1] node[1].name cluster4
daemon 1 arch ffc91200
[cluster4.tekno-soft.it:25123] [[19516,0],1] node[2].name cluster3
daemon INVALID arch ffc91200
[cluster4.tekno-soft.it:25123] [[19516,0],1] node[3].name cluster2
daemon INVALID arch ffc91200
[cluster4.tekno-soft.it:25123] [[19516,0],1] node[4].name cluster1
daemon INVALID arch ffc91200
[cluster4.tekno-soft.it:25123] [[19516,0],1] orted_recv: received
sync+nidmap from local proc [[19516,2],0]
[master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received
collective data cmd
[master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received
message_local_procs
[cluster4.tekno-soft.it:25123] [[19516,0],1] orted_cmd: received
collective data cmd
[cluster4.tekno-soft.it:25123] [[19516,0],1] orted_cmd: received
message_local_procs
[cluster4.tekno-soft.it:25123] [[19516,0],1] orted_cmd: received
collective data cmd
[master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received
collective data cmd
[master.tekno-soft.it:30063] [[19516,0],0] orted_cmd: received
message_local_procs
[cluster4.tekno-soft.it:25123] [[19516,0],1] orted_cmd: received
message_local_procs

Let me know if you need my test program.

>
> Thanks
> Ralph
>
>>
>>> Ralph
>>>
>>> On Oct 3, 2008, at 5:02 AM, Roberto Fichera wrote:
>>>
>>>> Ralph Castain ha scritto:
>>>>> Actually, it just occurred to me that you may be seeing a problem in
>>>>> comm_spawn itself that I am currently chasing down. It is in the 1.3
>>>>> branch and has to do with comm_spawning procs on subsets of nodes
>>>>> (instead of across all nodes). Could be related to this - you might
>>>>> want to give me a chance to complete the fix. I have identified the
>>>>> problem and should have it fixed later today in our trunk - probably
>>>>> won't move to the 1.3 branch for several days.
>>>> Do you have any news about the above fix? Does the fix is already
>>>> available for testing?
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Reply via email to