Re: [OMPI devel] Vprotocol pessimist - Open MPI 1.4.1 and 1.4.2a1r22558
Hi, The instructions you found are now obsolete. I'll update them, thank you for pointing out. The new procedure to use uncoordinated checkpoint is now mpirun -mca vprotocol pessimist -mca pml ob1,v [regular arguments]. The version available in trunk does not support actual restart due to lack of runtime support, and is limited to performance evaluation of FT cost without failures. There is an ongoing proposal to include such support in the main branch. However, we do have a branched version of Open MPI including all the necessary support that I can be provided on request. Please also consider that this is an ongoing research effort that has not yet matured enough to be used in a production environment. Aurelien Bouteiller -- Dr. Aurelien Bouteiller Innovative Computing Laboratory at the University of Tennessee Le 6 févr. 2010 à 10:21, Caciano Machado a écrit : > Hi, > > I'm following the instructions found at > https://svn.open-mpi.org/trac/ompi/wiki/EventLog_CR to run an > application with the vprotocol pessimist enabled. I believe that I'm > doing something wrong but I can't figure out the problem. > > I have compiled Open MPI 1.4.1 and 1.4.2a1r22558 with the parameters: > ./configure --prefix=/usr/local/openmpi-v/ --with-ft=cr > --with-blcr=/usr/local/blcr/ > > Here is my configuration file: > vprotocol_pessimist_priority=10 > pml_base_verbose=10 > pbl_v_verbose=500 > > The command line: > mpirun -am /etc/v -np 2 -machinefile /etc/machinefile ep.B.8 > > And the mpirun output: > ##3 > [xiru-10:03440] mca: base: components_open: Looking for pml components > [xiru-10:03440] mca: base: components_open: opening pml components > [xiru-10:03440] mca: base: components_open: found loaded component cm > [xiru-10:03440] mca: base: components_open: component cm has no > register function > [xiru-10:03440] mca: base: component_find: unable to open > /usr/local/openmpi-v/lib/openmpi/mca_mtl_mx: perhaps a missing symbol, > or compiled for a different version of Open MPI? (ignored) > > [xiru-10:03440] mca: base: components_open: component cm open function > successful > [xiru-10:03440] mca: base: components_open: found loaded component crcpw > [xiru-10:03440] mca: base: components_open: component crcpw has no > register function > [xiru-10:03440] mca: base: components_open: component crcpw open > function successful > [xiru-10:03440] mca: base: components_open: found loaded component csum > [xiru-10:03440] mca: base: components_open: component csum has no > register function > [xiru-10:03440] mca: base: component_find: unable to open > /usr/local/openmpi-v/lib/openmpi/mca_btl_mx: perhaps a missing symbol, > or compiled for a different version of Open MPI? (ignored) > [xiru-10:03440] mca: base: components_open: component csum open > function successful > [xiru-10:03440] mca: base: components_open: found loaded component ob1 > [xiru-10:03440] mca: base: components_open: component ob1 has no > register function > [xiru-10:03440] mca: base: components_open: component ob1 open > function successful > [xiru-10:03440] mca: base: components_open: found loaded component v > [xiru-10:03440] mca: base: components_open: component v has no register > function > [xiru-10:03440] mca: base: components_open: component v open function > successful > -- > [[65326,1],0]: A high-performance Open MPI point-to-point messaging module > was unable to find any relevant network interfaces: > > Module: OpenFabrics (openib) > Host: xiru-10.portoalegre.grenoble.grid5000.fr > > Another transport will be used instead, although this may result in > lower performance. > -- > [xiru-10:03440] select: initializing pml component cm > [xiru-10:03440] select: init returned failure for component cm > [xiru-10:03440] select: component crcpw not in the include list > [xiru-10:03440] select: component csum not in the include list > [xiru-10:03440] select: initializing pml component ob1 > [xiru-10:03440] select: init returned priority 20 > [xiru-10:03440] select: component v not in the include list > [xiru-10:03440] selected ob1 best priority 20 > [xiru-10:03440] select: component ob1 selected > [xiru-10:03440] mca: base: close: component cm closed > [xiru-10:03440] mca: base: close: unloading component cm > [xiru-10:03440] mca: base: close: component crcpw closed > [xiru-10:03440] mca: base: close: unloading component crcpw > [xiru-10:03440] mca: base: close: component csum closed > [xiru-10:03440] mca: base: close: unloading component csum > [xiru-10:03440] mca: base: close: component v closed > [xiru-10:03440] mca: base: close: unloading component v > ... > > #3 > > It seems that the vprotocol module is not loading properly. Does > anyone have a solution to
[OMPI devel] what's the relationship between proc, endpoint and btl?
Could someone tell me the relationship between proc,endpoint and btl? thanks & regards
Re: [OMPI devel] what's the relationship between proc, endpoint and btl?
btl is the component responsible for a particular type of fabric. Endpoint is somewhat the instantiation of a btl to reach a particular destination on a particular fabric, proc is the generic name and properties of a destination. Aurelien Le 24 févr. 2010 à 09:59, hu yaohui a écrit : > Could someone tell me the relationship between proc,endpoint and btl? > thanks & regards > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
[OMPI devel] OFED 1.4.1-2ofed SRPM
As discussed on the call yesterday, Pasha and I are preparing a new SRPM specifically for OFED 1.5.1. The reason is because they need some updates to the openib INI file, but we're just not ready for an v1.4.2 release. Hence, it makes sense to slightly modify the default OMPI spec file, put in some patches for OFED, and call it OMPI 1.4.1-2ofed. This is fairly common practice, I think. So far, I have INI file updates from Mellanox, Intel, and Chelsio (might get another Chelsio update in the next day or three). We should record this specfile somewhere, though, just for posterity. Two questions: 1. Should I commit this stuff in the 1.4 contrib/dist/linux branch? (if I hear nothing back, I assume "yes") 2. Should I post this custom SRPM on http://www.open-mpi.org/software/ompi/v1.4/? (if I hear nothing back, I assume "no" -- treat it like any other downstream packager that has their own custom OMPI package) -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] OFED 1.4.1-2ofed SRPM
On Feb 24, 2010, at 4:05 PM, Jeff Squyres wrote: > We should record this specfile somewhere, though, just for posterity. Two > questions: > > 1. Should I commit this stuff in the 1.4 contrib/dist/linux branch? (if I > hear nothing back, I assume "yes") Which stuff? If it's just the updated INI file, I'd say no -- there's no need to record what OFED did to the tarball (just like we don't record what Red Hat did to the tarball). If there were some changes to the RPM generation script which would be useful in the future (such as making it easier to dump a new INI file into the SRPM), then I would say yes. Brian -- Brian W. Barrett Dept. 1423: Scalable System Software Sandia National Laboratories
Re: [OMPI devel] OFED 1.4.1-2ofed SRPM
On Feb 24, 2010, at 5:05 PM, Barrett, Brian W wrote: > > We should record this specfile somewhere, though, just for posterity. Two > > questions: > > > > 1. Should I commit this stuff in the 1.4 contrib/dist/linux branch? (if I > > hear nothing back, I assume "yes") > > Which stuff? If it's just the updated INI file, I'd say no -- there's no > need to record what OFED did to the tarball (just like we don't record what > Red Hat did to the tarball). If there were some changes to the RPM > generation script which would be useful in the future (such as making it > easier to dump a new INI file into the SRPM), then I would say yes. The stuff is a few updates to the specfile and a slightly modified buildrpm.sh script to copy 3 *.patch files to the SOURCES directory so that they can be used in Patch[012]: and %patch[012] clauses in the specfile. I didn't bother making it generic. These 3 patches update the .ini file included in the 1.4.1 tarball. So if it's not worthwhile, I don't need to commit this stuff. All the INI changes are on the trunk and slated to go over to the branches; it'll just take time to get a formal release out with these patches. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] what's the relationship between proc, endpoint and btl?
On Feb 24, 2010, at 12:16 PM, Aurélien Bouteiller wrote: > btl is the component responsible for a particular type of fabric. Endpoint is > somewhat the instantiation of a btl to reach a particular destination on a > particular fabric, proc is the generic name and properties of a destination. A few more words here... btl = Byte Transfer Layer. It's our name for the framework that governs one flavor of point-to-point communications in the MPI layer. Components in this framework are used by the ob1 and csum PMLs to effect MPI point-to-point communications (they're used in other ways, too, but let's start at the beginning here...). There are several btl components: tcp, sm (shared memory), self (process loopback), openib (OpenFabrics), ...etc. Each one of these effects communications over a different network type. For purposes of this discussion, "component" == "plugin". The btl plugin is loaded into an MPI process and its component open/query functions are called. If the btl component determines that it wants to run, it returns one or more modules. Typically, btls return a module for every interface that they find. For example, if the openib module finds 2 OpenFabrics device ports, it'll return 2 modules. Hence, we typically describe components as analogous to a C++ class; modules are analogous to instances of that C++ class. Note that in many BTL component comments and variables/fields, they typically use shorthand language such as, "The btl then does this..." Such language almost always refers to a specific module of that btl component. Modules are marshalled by the bml and ob1/csum to make an ordered list of who can talk to whom. Endpoints are data structures used to represent a module's connection to a remote MPI process (proc). Hence, a BTL component can create multiple modules; each module can create lots of endpoints. Each endpoint is tied to a specific remote proc. > Aurelien > > Le 24 févr. 2010 à 09:59, hu yaohui a écrit : > > > Could someone tell me the relationship between proc,endpoint and btl? > > thanks & regards > > ___ > > devel mailing list > > de...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/