Re: [OMPI devel] Vprotocol pessimist - Open MPI 1.4.1 and 1.4.2a1r22558

2010-02-24 Thread Aurélien Bouteiller
Hi, 

The instructions you found are now obsolete. I'll update them, thank you for 
pointing out.

The new procedure to use uncoordinated checkpoint is now 
mpirun -mca vprotocol pessimist -mca pml ob1,v [regular arguments]. 

The version available in trunk does not support actual restart due to lack of 
runtime support, and is limited to performance evaluation of FT cost without 
failures. There is an ongoing proposal to include such support in the main 
branch. However, we do have a branched version of Open MPI including all the 
necessary support that I can be provided on request. Please also consider that 
this is an ongoing research effort that has not yet matured enough to be used 
in a production environment. 

Aurelien Bouteiller
--
Dr. Aurelien Bouteiller
Innovative Computing Laboratory at the University of Tennessee



Le 6 févr. 2010 à 10:21, Caciano Machado a écrit :
> Hi,
> 
> I'm following the instructions found at
> https://svn.open-mpi.org/trac/ompi/wiki/EventLog_CR to run an
> application with the vprotocol pessimist enabled. I believe that I'm
> doing something wrong but I can't figure out the problem.
> 
> I have compiled Open MPI 1.4.1 and 1.4.2a1r22558 with the parameters:
> ./configure --prefix=/usr/local/openmpi-v/ --with-ft=cr
> --with-blcr=/usr/local/blcr/
> 
> Here is my configuration file:
> vprotocol_pessimist_priority=10
> pml_base_verbose=10
> pbl_v_verbose=500
> 
> The command line:
> mpirun -am /etc/v -np 2 -machinefile /etc/machinefile ep.B.8
> 
> And the mpirun output:
> ##3
> [xiru-10:03440] mca: base: components_open: Looking for pml components
> [xiru-10:03440] mca: base: components_open: opening pml components
> [xiru-10:03440] mca: base: components_open: found loaded component cm
> [xiru-10:03440] mca: base: components_open: component cm has no
> register function
> [xiru-10:03440] mca: base: component_find: unable to open
> /usr/local/openmpi-v/lib/openmpi/mca_mtl_mx: perhaps a missing symbol,
> or compiled for a different version of Open MPI? (ignored)
> 
> [xiru-10:03440] mca: base: components_open: component cm open function
> successful
> [xiru-10:03440] mca: base: components_open: found loaded component crcpw
> [xiru-10:03440] mca: base: components_open: component crcpw has no
> register function
> [xiru-10:03440] mca: base: components_open: component crcpw open
> function successful
> [xiru-10:03440] mca: base: components_open: found loaded component csum
> [xiru-10:03440] mca: base: components_open: component csum has no
> register function
> [xiru-10:03440] mca: base: component_find: unable to open
> /usr/local/openmpi-v/lib/openmpi/mca_btl_mx: perhaps a missing symbol,
> or compiled for a different version of Open MPI? (ignored)
> [xiru-10:03440] mca: base: components_open: component csum open
> function successful
> [xiru-10:03440] mca: base: components_open: found loaded component ob1
> [xiru-10:03440] mca: base: components_open: component ob1 has no
> register function
> [xiru-10:03440] mca: base: components_open: component ob1 open
> function successful
> [xiru-10:03440] mca: base: components_open: found loaded component v
> [xiru-10:03440] mca: base: components_open: component v has no register 
> function
> [xiru-10:03440] mca: base: components_open: component v open function 
> successful
> --
> [[65326,1],0]: A high-performance Open MPI point-to-point messaging module
> was unable to find any relevant network interfaces:
> 
> Module: OpenFabrics (openib)
>  Host: xiru-10.portoalegre.grenoble.grid5000.fr
> 
> Another transport will be used instead, although this may result in
> lower performance.
> --
> [xiru-10:03440] select: initializing pml component cm
> [xiru-10:03440] select: init returned failure for component cm
> [xiru-10:03440] select: component crcpw not in the include list
> [xiru-10:03440] select: component csum not in the include list
> [xiru-10:03440] select: initializing pml component ob1
> [xiru-10:03440] select: init returned priority 20
> [xiru-10:03440] select: component v not in the include list
> [xiru-10:03440] selected ob1 best priority 20
> [xiru-10:03440] select: component ob1 selected
> [xiru-10:03440] mca: base: close: component cm closed
> [xiru-10:03440] mca: base: close: unloading component cm
> [xiru-10:03440] mca: base: close: component crcpw closed
> [xiru-10:03440] mca: base: close: unloading component crcpw
> [xiru-10:03440] mca: base: close: component csum closed
> [xiru-10:03440] mca: base: close: unloading component csum
> [xiru-10:03440] mca: base: close: component v closed
> [xiru-10:03440] mca: base: close: unloading component v
> ...
> 
> #3
> 
> It seems that the vprotocol module is not loading properly. Does
> anyone have a solution to 

[OMPI devel] what's the relationship between proc, endpoint and btl?

2010-02-24 Thread hu yaohui
Could someone tell me the relationship between proc,endpoint and btl?
 thanks & regards


Re: [OMPI devel] what's the relationship between proc, endpoint and btl?

2010-02-24 Thread Aurélien Bouteiller
btl is the component responsible for a particular type of fabric. Endpoint is 
somewhat the instantiation of a btl to reach a particular destination on a 
particular fabric, proc is the generic name and properties of a destination. 

Aurelien

Le 24 févr. 2010 à 09:59, hu yaohui a écrit :

> Could someone tell me the relationship between proc,endpoint and btl?
>  thanks & regards
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




[OMPI devel] OFED 1.4.1-2ofed SRPM

2010-02-24 Thread Jeff Squyres
As discussed on the call yesterday, Pasha and I are preparing a new SRPM 
specifically for OFED 1.5.1.

The reason is because they need some updates to the openib INI file, but we're 
just not ready for an v1.4.2 release.  Hence, it makes sense to slightly modify 
the default OMPI spec file, put in some patches for OFED, and call it OMPI 
1.4.1-2ofed.  This is fairly common practice, I think.

So far, I have INI file updates from Mellanox, Intel, and Chelsio (might get 
another Chelsio update in the next day or three).  

We should record this specfile somewhere, though, just for posterity.  Two 
questions:

1. Should I commit this stuff in the 1.4 contrib/dist/linux branch?  (if I hear 
nothing back, I assume "yes")
2. Should I post this custom SRPM on 
http://www.open-mpi.org/software/ompi/v1.4/? (if I hear nothing back, I assume 
"no" -- treat it like any other downstream packager that has their own custom 
OMPI package)

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] OFED 1.4.1-2ofed SRPM

2010-02-24 Thread Barrett, Brian W
On Feb 24, 2010, at 4:05 PM, Jeff Squyres wrote:

> We should record this specfile somewhere, though, just for posterity.  Two 
> questions:
> 
> 1. Should I commit this stuff in the 1.4 contrib/dist/linux branch?  (if I 
> hear nothing back, I assume "yes")

Which stuff?  If it's just the updated INI file, I'd say no -- there's no need 
to record what OFED did to the tarball (just like we don't record what Red Hat 
did to the tarball).  If there were some changes to the RPM generation script 
which would be useful in the future (such as making it easier to dump a new INI 
file into the SRPM), then I would say yes.

Brian

--
  Brian W. Barrett
  Dept. 1423: Scalable System Software
  Sandia National Laboratories







Re: [OMPI devel] OFED 1.4.1-2ofed SRPM

2010-02-24 Thread Jeff Squyres
On Feb 24, 2010, at 5:05 PM, Barrett, Brian W wrote:

> > We should record this specfile somewhere, though, just for posterity.  Two 
> > questions:
> >
> > 1. Should I commit this stuff in the 1.4 contrib/dist/linux branch?  (if I 
> > hear nothing back, I assume "yes")
> 
> Which stuff?  If it's just the updated INI file, I'd say no -- there's no 
> need to record what OFED did to the tarball (just like we don't record what 
> Red Hat did to the tarball).  If there were some changes to the RPM 
> generation script which would be useful in the future (such as making it 
> easier to dump a new INI file into the SRPM), then I would say yes.

The stuff is a few updates to the specfile and a slightly modified buildrpm.sh 
script to copy 3 *.patch files to the SOURCES directory so that they can be 
used in Patch[012]: and %patch[012] clauses in the specfile.  I didn't bother 
making it generic.  These 3 patches update the .ini file included in the 1.4.1 
tarball.

So if it's not worthwhile, I don't need to commit this stuff.  All the INI 
changes are on the trunk and slated to go over to the branches; it'll just take 
time to get a formal release out with these patches.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] what's the relationship between proc, endpoint and btl?

2010-02-24 Thread Jeff Squyres
On Feb 24, 2010, at 12:16 PM, Aurélien Bouteiller wrote:

> btl is the component responsible for a particular type of fabric. Endpoint is 
> somewhat the instantiation of a btl to reach a particular destination on a 
> particular fabric, proc is the generic name and properties of a destination.

A few more words here...

btl = Byte Transfer Layer.  It's our name for the framework that governs one 
flavor of point-to-point communications in the MPI layer.  Components in this 
framework are used by the ob1 and csum PMLs to effect MPI point-to-point 
communications (they're used in other ways, too, but let's start at the 
beginning here...).  There are several btl components: tcp, sm (shared memory), 
self (process loopback), openib (OpenFabrics), ...etc.  Each one of these 
effects communications over a different network type.  For purposes of this 
discussion, "component" == "plugin".  

The btl plugin is loaded into an MPI process and its component open/query 
functions are called.  If the btl component determines that it wants to run, it 
returns one or more modules.  Typically, btls return a module for every 
interface that they find.  For example, if the openib module finds 2 
OpenFabrics device ports, it'll return 2 modules.  

Hence, we typically describe components as analogous to a C++ class; modules 
are analogous to instances of that C++ class.

Note that in many BTL component comments and variables/fields, they typically 
use shorthand language such as, "The btl then does this..."  Such language 
almost always refers to a specific module of that btl component.

Modules are marshalled by the bml and ob1/csum to make an ordered list of who 
can talk to whom.

Endpoints are data structures used to represent a module's connection to a 
remote MPI process (proc).  Hence, a BTL component can create multiple modules; 
each module can create lots of endpoints.  Each endpoint is tied to a specific 
remote proc.

> Aurelien
> 
> Le 24 févr. 2010 à 09:59, hu yaohui a écrit :
> 
> > Could someone tell me the relationship between proc,endpoint and btl?
> >  thanks & regards
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/