All right - I’ll surrender and remove the timeout. Will release rc4 later 
tonight.

Sorry for putting you thru this Paul - for some reason, these problems aren’t 
showing up elsewhere.


> On Dec 12, 2014, at 3:37 PM, Paul Hargrove <phhargr...@lbl.gov> wrote:
> 
> 
> 
> On Fri, Dec 12, 2014 at 2:58 PM, Ralph Castain <r...@open-mpi.org 
> <mailto:r...@open-mpi.org>> wrote:
> Aha! You are the first to fall thru the timeout. How interesting.
> 
> When it comes to the release candidates, I seem to own a lot of "firsts".
> It is not as fun as one might imagine :-).
> 
> Can you please try adding “-mca oob_tcp_connect_timeout 5:0”?
> 
> That appeared to produce a timeout of about 5 SECONDS ("time mpirun" reports 
> 5.8s elapsed).  Was that really the intent?   No difference if I change "5:0" 
> to "5:00".  So, you might have an "extra" bug lurking there.
> 
> 
> New stderr attached for
>   $ mpirun -mca oob_tcp_if_include bge0 -mca oob_tcp_connect_timeout 5:0 -mca 
> oob_base_verbose 20 -mca btl sm,self,openib -np 2 -host pcp-j-19,pcp-j-20 
> examples/ring_c
> 
> Assuming "5:0" was intended to get a 5 MINUTE timeout, I also tried "-mca 
> oob_tcp_connect_timeout 300", and have also attached the resulting stderr.
> 
> No joy for either timeout value.
> 
> -Paul
> 
>  
> 
> On Dec 12, 2014, at 8:53 AM, Paul Hargrove <phhargr...@lbl.gov 
> <mailto:phhargr...@lbl.gov>> wrote:
>> 
>> 
>> First, I want to ask what became of the issue discussed in this thread?
>>    http://www.open-mpi.org/community/lists/devel/2014/11/16160.php 
>> <http://www.open-mpi.org/community/lists/devel/2014/11/16160.php>
>> I though we had concluded that one just needed -D_REENTRANT.
>> I mention that only for completeness, because I think my current problem is 
>> different.
>> 
>> The following works fine with 1.8.3, making the current behavior a 
>> regression.
>> 
>> I am still on the same system as that previous report, and still/again see a 
>> message like the following:
>> 
>> ------------------------------------------------------------
>> A process or daemon was unable to complete a TCP connection
>> to another process:
>>   Local host:    pcp-j-19
>>   Remote host:   172.18.0.120
>> This is usually caused by a firewall on the remote host. Please
>> check that any firewall (e.g., iptables) has been disabled and
>> try again.
>> ------------------------------------------------------------
>> --------------------------------------------------------------------------
>> ORTE was unable to reliably start one or more daemons.
>> This usually is caused by:
>> [...etc...]
>> 
>> It may be worth noting that the hostname pcp-j-19 (172.16.0.119) and the 
>> address 172.18.0.120 are on different subnets.
>> 
>> I CANNOT resolve the issue this time by adding -D_REENTRANT to CFLAGS at 
>> configure time (I didn't bother to check if it there by default now or not).
>> 
>> NOR can I resolve it by using "-mca oob_tcp_if_include bge0" to allow only 
>> the 172.16.0.120 subnet.
>> IN FACT, the message is the same with that option, other than "172.18" 
>> changing to "172.16".
>> 
>> I've attached the output generated by "-mca oob_base_verbose 20" both with 
>> and without the oob_tcp_if_include.
>> 
>> I should also note that that the following is my full mpirun command, which 
>> excludes the tcp BTL.
>> pcp-j-20$ mpirun -mca oob_tcp_if_include bge0 -mca oob_base_verbose 20 -mca 
>> btl sm,self,openib -np 2 -host pcp-j-19,pcp-j-20 examples/ring_c
>> 
>> 
>> -Paul
>> 
>> -- 
>> Paul H. Hargrove                          phhargr...@lbl.gov 
>> <mailto:phhargr...@lbl.gov>
>> Computer Languages & Systems Software (CLaSS) Group
>> Computer Science Department               Tel: +1-510-495-2352 
>> <tel:%2B1-510-495-2352>
>> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900 
>> <tel:%2B1-510-486-6900><stdout-inc.txt><stderr-2if.txt>_______________________________________________
>> devel mailing list
>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/12/16551.php 
>> <http://www.open-mpi.org/community/lists/devel/2014/12/16551.php>
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org <mailto:de...@open-mpi.org>
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/12/16561.php 
> <http://www.open-mpi.org/community/lists/devel/2014/12/16561.php>
> 
> 
> 
> -- 
> Paul H. Hargrove                          phhargr...@lbl.gov 
> <mailto:phhargr...@lbl.gov>
> Computer Languages & Systems Software (CLaSS) Group
> Computer Science Department               Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
> <stderr-inc-5_0.txt><stderr-inc-300.txt>_______________________________________________
> devel mailing list
> de...@open-mpi.org <mailto:de...@open-mpi.org>
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/12/16565.php 
> <http://www.open-mpi.org/community/lists/devel/2014/12/16565.php>

Reply via email to