[OMPI devel] DDT for v1.2 branch

2007-10-10 Thread Jeff Squyres
George has proposed to bring the DDT over from the trunk to the v1.2  
branch before v1.2.5 in order to fix some pending bugs.


I do not think that this has been tested yet, but are there any knee- 
jerk reactions against doing this?


--
Jeff Squyres
Cisco Systems




Re: [OMPI devel] DDT for v1.2 branch

2007-10-10 Thread Terry Dontje

Jeff Squyres wrote:
George has proposed to bring the DDT over from the trunk to the v1.2  
branch before v1.2.5 in order to fix some pending bugs.


  

What does this entail (ie does this affect the pml interface at all)?
Also by saying "before v1.2.5" I am assuming you mean this fix is to
be put into v1.2.5 since v1.2.4 has been released, right?
I do not think that this has been tested yet, but are there any knee- 
jerk reactions against doing this?


  
Can this be done in a tmp branch and tested out before commiting to the 
1.2 branch?


--td



Re: [OMPI devel] problem in runing MPI job through XGrid

2007-10-10 Thread Jinhui Qin
Hi Brian,
 I found the problem. It looks like xgrid need to do more work on fault
tolerance. It seems that xgrid controller distributed jobs to each available
agent only in certain fixed order, if one of the agents has problem in
communicating with the controller, all jobs failed, even when there are
still more available agents.
  In my case, the third node that the controller always contacted is
node6, which has problem to reach it (I found the problem when I try to do
remote desktop to check each node, I could not reach that node properly, the
rest of other nodes are fine). After I turn of the agent on node6, the
previous problem was solved, everything works fine.

Thank you .
Jinhui


On 10/9/07, Brian Barrett  wrote:
>
> On Oct 4, 2007, at 3:06 PM, Jinhui Qin wrote:
> > sib:sharcnet$ mpirun -n 3 ~/openMPI_stuff/Hello
> >
> > Process 0.1.1 is unable to reach 0.1.2 for MPI communication.
> > If you specified the use of a BTL component, you may have
> > forgotten a component (such as "self") in the list of
> > usable components.
> >
>
> This is very odd -- it looks like two of the processes don't think
> they can talk to each other.  Can you try running with:
>
>mpirun -n 3 -mca btl tcp,self 
>
> If that fails, then the next piece of information that would be
> useful is the IP addresses and netmasks for all the nodes in your
> cluster.  We have some logic in our TCP communication system that can
> cause some interesting results for some network topologies.
>
> Just to verify it's not an XGrid problem, you might want to try
> running with a hostfile -- I think you'll find that the results are
> the same, but it's always good to verify.
>
> Brian
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>