Re: [OMPI devel] BTL add procs errors

2010-05-28 Thread Jeff Squyres
To that point, where exactly in the openib BTL init / query sequence is it returning an error for you, Sylvain? Is it just a matter of tidying something up properly before returning the error? On May 28, 2010, at 2:21 PM, George Bosilca wrote: > On May 28, 2010, at 10:03 , Sylvain Jeaugey wro

Re: [OMPI devel] BTL add procs errors

2010-05-28 Thread George Bosilca
On May 28, 2010, at 10:03 , Sylvain Jeaugey wrote: > On Fri, 28 May 2010, Jeff Squyres wrote: > >> On May 28, 2010, at 9:32 AM, Jeff Squyres wrote: >> >>> Understood, and I agreed that the bug should be fixed. Patches would be >>> welcome. :-) > I sent a patch on the bml layer in my first e-m

Re: [OMPI devel] BTL add procs errors

2010-05-28 Thread Sylvain Jeaugey
On Fri, 28 May 2010, Jeff Squyres wrote: On May 28, 2010, at 9:32 AM, Jeff Squyres wrote: Understood, and I agreed that the bug should be fixed. Patches would be welcome. :-) I sent a patch on the bml layer in my first e-mail. We will apply it on our tree, but as always we're trying to send

Re: [OMPI devel] BTL add procs errors

2010-05-28 Thread Jeff Squyres
On May 28, 2010, at 9:32 AM, Jeff Squyres wrote: >> So please, fix the bug first, then if you want that "automatic failover to >> TCP" feature, develop it. Put a parameter for an eventual notification, or >> abort, or whatever you want. But it doesn't exist today. It just doesn't >> work, with any

Re: [OMPI devel] BTL add procs errors

2010-05-28 Thread Jeff Squyres
On May 28, 2010, at 7:19 AM, Sylvain Jeaugey wrote: > So please, fix the bug first, then if you want that "automatic failover to > TCP" feature, develop it. Put a parameter for an eventual notification, or > abort, or whatever you want. But it doesn't exist today. It just doesn't > work, with any

Re: [OMPI devel] BTL add procs errors

2010-05-28 Thread Sylvain Jeaugey
On Fri, 28 May 2010, Jeff Squyres wrote: Herein lies the quandary: we don't/can't know the user or sysadmin intent. They may not care if the IB is borked -- they might just want the job to fall over to TCP and continue. But they may care a lot if IB is borked -- they might want the job to ab

Re: [OMPI devel] BTL add procs errors

2010-05-28 Thread Jeff Squyres
On May 28, 2010, at 6:04 AM, Sylvain Jeaugey wrote: > Having errors on add_procs stop the application seems a good thing in all > cases, so why not do it ? That would solve my original problem which lead > to this discussion. > > Yes, the openib BTL may be suboptimal (stopping the application ins

Re: [OMPI devel] BTL add procs errors

2010-05-28 Thread Sylvain Jeaugey
On Thu, 27 May 2010, Jeff Squyres wrote: On May 27, 2010, at 10:32 AM, Sylvain Jeaugey wrote: That's pretty much my first proposition : abort when an error arises, because if we don't, we'll crash soon afterwards. That's my original concern and this should really be fixed. Now, if you want to

[OMPI devel] Some questions about checkpoint/restart (13),(14)

2010-05-28 Thread Takayuki Seki
13th, 14th question are as follows: (13) Some messages are not shown even though --mca snapc_base_verbose parameter is used. Framework : snapc Component : full The source file : orte/mca/snapc/base/snapc_base_open.c The function name : orte_snapc_base_open I think that the fo

[OMPI devel] Some questions about checkpoint/restart (12)

2010-05-28 Thread Takayuki Seki
Hi,Josh >https://svn.open-mpi.org/trac/ompi/ticket/2397 Thank you very much for filing my questions to ticket system. Now I have 3 new questions and I will post them. Regards, Takayuki Seki 12th question is as follows: (12) Checkpointing of an MPI job which uses two (or more?) openib btl modu