Hi , I got the openmpi-1.4a1r21095.tar.gz<http://www.open-mpi.org/nightly/trunk/openmpi-1.4a1r21095.tar.gz>tarball, but unfortunately my command doesn't work
cat rankf: rank 0=node1 slot=* rank 1=node2 slot=* cat hostf: node1 slots=2 node2 slots=2 mpirun --rankfile rankf --hostfile hostf --host node1 -n 1 hostname : --host node2 -n 1 hostname Error, invalid rank (1) in the rankfile (rankf) -------------------------------------------------------------------------- [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in file rmaps_rank_file.c at line 403 [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in file base/rmaps_base_map_job.c at line 86 [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in file base/plm_base_launch_support.c at line 86 [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in file plm_rsh_module.c at line 1016 Ralph, could you tell me if my command syntax is correct or not ? if not, give me the expected one ? Regards Geoffroy 2009/4/30 Geoffroy Pignot <geopig...@gmail.com> > Immediately Sir !!! :) > > Thanks again Ralph > > Geoffroy > > > >> >> >> ------------------------------ >> >> Message: 2 >> Date: Thu, 30 Apr 2009 06:45:39 -0600 >> From: Ralph Castain <r...@open-mpi.org> >> Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ?? >> To: Open MPI Users <us...@open-mpi.org> >> Message-ID: >> <71d2d8cc0904300545v61a42fe1k50086d2704d0f...@mail.gmail.com> >> Content-Type: text/plain; charset="iso-8859-1" >> >> I believe this is fixed now in our development trunk - you can download >> any >> tarball starting from last night and give it a try, if you like. Any >> feedback would be appreciated. >> >> Ralph >> >> >> On Apr 14, 2009, at 7:57 AM, Ralph Castain wrote: >> >> Ah now, I didn't say it -worked-, did I? :-) >> >> Clearly a bug exists in the program. I'll try to take a look at it (if >> Lenny >> doesn't get to it first), but it won't be until later in the week. >> >> On Apr 14, 2009, at 7:18 AM, Geoffroy Pignot wrote: >> >> I agree with you Ralph , and that 's what I expect from openmpi but my >> second example shows that it's not working >> >> cat hostfile.0 >> r011n002 slots=4 >> r011n003 slots=4 >> >> cat rankfile.0 >> rank 0=r011n002 slot=0 >> rank 1=r011n003 slot=1 >> >> mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -n 1 hostname >> ### CRASHED >> >> > > Error, invalid rank (1) in the rankfile (rankfile.0) >> > > >> > >> -------------------------------------------------------------------------- >> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file >> > > rmaps_rank_file.c at line 404 >> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file >> > > base/rmaps_base_map_job.c at line 87 >> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file >> > > base/plm_base_launch_support.c at line 77 >> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file >> > > plm_rsh_module.c at line 985 >> > > >> > >> -------------------------------------------------------------------------- >> > > A daemon (pid unknown) died unexpectedly on signal 1 while >> > attempting to >> > > launch so we are aborting. >> > > >> > > There may be more information reported by the environment (see >> > above). >> > > >> > > This may be because the daemon was unable to find all the needed >> > shared >> > > libraries on the remote node. You may set your LD_LIBRARY_PATH to >> > have the >> > > location of the shared libraries on the remote nodes and this will >> > > automatically be forwarded to the remote nodes. >> > > >> > >> -------------------------------------------------------------------------- >> > > >> > >> -------------------------------------------------------------------------- >> > > orterun noticed that the job aborted, but has no info as to the >> > process >> > > that caused that situation. >> > > >> > >> -------------------------------------------------------------------------- >> > > orterun: clean termination accomplished >> >> >> >> Message: 4 >> Date: Tue, 14 Apr 2009 06:55:58 -0600 >> From: Ralph Castain <r...@lanl.gov> >> Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ?? >> To: Open MPI Users <us...@open-mpi.org> >> Message-ID: <f6290ada-a196-43f0-a853-cbcb802d8...@lanl.gov> >> Content-Type: text/plain; charset="us-ascii"; Format="flowed"; >> DelSp="yes" >> >> The rankfile cuts across the entire job - it isn't applied on an >> app_context basis. So the ranks in your rankfile must correspond to >> the eventual rank of each process in the cmd line. >> >> Unfortunately, that means you have to count ranks. In your case, you >> only have four, so that makes life easier. Your rankfile would look >> something like this: >> >> rank 0=r001n001 slot=0 >> rank 1=r001n002 slot=1 >> rank 2=r001n001 slot=1 >> rank 3=r001n002 slot=2 >> >> HTH >> Ralph >> >> On Apr 14, 2009, at 12:19 AM, Geoffroy Pignot wrote: >> >> > Hi, >> > >> > I agree that my examples are not very clear. What I want to do is to >> > launch a multiexes application (masters-slaves) and benefit from the >> > processor affinity. >> > Could you show me how to convert this command , using -rf option >> > (whatever the affinity is) >> > >> > mpirun -n 1 -host r001n001 master.x options1 : -n 1 -host r001n002 >> > master.x options2 : -n 1 -host r001n001 slave.x options3 : -n 1 - >> > host r001n002 slave.x options4 >> > >> > Thanks for your help >> > >> > Geoffroy >> > >> > >> > >> > >> > >> > Message: 2 >> > Date: Sun, 12 Apr 2009 18:26:35 +0300 >> > From: Lenny Verkhovsky <lenny.verkhov...@gmail.com> >> > Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ?? >> > To: Open MPI Users <us...@open-mpi.org> >> > Message-ID: >> > <453d39990904120826t2e1d1d33l7bb1fe3de65b5...@mail.gmail.com> >> > Content-Type: text/plain; charset="iso-8859-1" >> > >> > Hi, >> > >> > The first "crash" is OK, since your rankfile has ranks 0 and 1 >> > defined, >> > while n=1, which means only rank 0 is present and can be allocated. >> > >> > NP must be >= the largest rank in rankfile. >> > >> > What exactly are you trying to do ? >> > >> > I tried to recreate your seqv but all I got was >> > >> > ~/work/svn/ompi/trunk/build_x86-64/install/bin/mpirun --hostfile >> > hostfile.0 >> > -rf rankfile.0 -n 1 hostname : -rf rankfile.1 -n 1 hostname >> > [witch19:30798] mca: base: component_find: paffinity >> > "mca_paffinity_linux" >> > uses an MCA interface that is not recognized (component MCA v1.0.0 != >> > supported MCA v2.0.0) -- ignored >> > >> -------------------------------------------------------------------------- >> > It looks like opal_init failed for some reason; your parallel >> > process is >> > likely to abort. There are many reasons that a parallel process can >> > fail during opal_init; some of which are due to configuration or >> > environment problems. This failure appears to be an internal failure; >> > here's some additional information (which may only be relevant to an >> > Open MPI developer): >> > >> > opal_carto_base_select failed >> > --> Returned value -13 instead of OPAL_SUCCESS >> > >> -------------------------------------------------------------------------- >> > [witch19:30798] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file >> > ../../orte/runtime/orte_init.c at line 78 >> > [witch19:30798] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file >> > ../../orte/orted/orted_main.c at line 344 >> > >> -------------------------------------------------------------------------- >> > A daemon (pid 11629) died unexpectedly with status 243 while >> > attempting >> > to launch so we are aborting. >> > >> > There may be more information reported by the environment (see above). >> > >> > This may be because the daemon was unable to find all the needed >> > shared >> > libraries on the remote node. You may set your LD_LIBRARY_PATH to >> > have the >> > location of the shared libraries on the remote nodes and this will >> > automatically be forwarded to the remote nodes. >> > >> -------------------------------------------------------------------------- >> > >> -------------------------------------------------------------------------- >> > mpirun noticed that the job aborted, but has no info as to the process >> > that caused that situation. >> > >> -------------------------------------------------------------------------- >> > mpirun: clean termination accomplished >> > >> > >> > Lenny. >> > >> > >> > On 4/10/09, Geoffroy Pignot <geopig...@gmail.com> wrote: >> > > >> > > Hi , >> > > >> > > I am currently testing the process affinity capabilities of >> > openmpi and I >> > > would like to know if the rankfile behaviour I will describe below >> > is normal >> > > or not ? >> > > >> > > cat hostfile.0 >> > > r011n002 slots=4 >> > > r011n003 slots=4 >> > > >> > > cat rankfile.0 >> > > rank 0=r011n002 slot=0 >> > > rank 1=r011n003 slot=1 >> > > >> > > >> > > >> > >> >> ################################################################################## >> > > >> > > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 2 hostname ### OK >> > > r011n002 >> > > r011n003 >> > > >> > > >> > > >> > >> >> ################################################################################## >> > > but >> > > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -n 1 >> > hostname >> > > ### CRASHED >> > > * >> > > >> > >> -------------------------------------------------------------------------- >> > > Error, invalid rank (1) in the rankfile (rankfile.0) >> > > >> > >> -------------------------------------------------------------------------- >> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file >> > > rmaps_rank_file.c at line 404 >> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file >> > > base/rmaps_base_map_job.c at line 87 >> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file >> > > base/plm_base_launch_support.c at line 77 >> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file >> > > plm_rsh_module.c at line 985 >> > > >> > >> -------------------------------------------------------------------------- >> > > A daemon (pid unknown) died unexpectedly on signal 1 while >> > attempting to >> > > launch so we are aborting. >> > > >> > > There may be more information reported by the environment (see >> > above). >> > > >> > > This may be because the daemon was unable to find all the needed >> > shared >> > > libraries on the remote node. You may set your LD_LIBRARY_PATH to >> > have the >> > > location of the shared libraries on the remote nodes and this will >> > > automatically be forwarded to the remote nodes. >> > > >> > >> -------------------------------------------------------------------------- >> > > >> > >> -------------------------------------------------------------------------- >> > > orterun noticed that the job aborted, but has no info as to the >> > process >> > > that caused that situation. >> > > >> > >> -------------------------------------------------------------------------- >> > > orterun: clean termination accomplished >> > > * >> > > It seems that the rankfile option is not propagted to the second >> > command >> > > line ; there is no global understanding of the ranking inside a >> > mpirun >> > > command. >> > > >> > > >> > > >> > >> >> ################################################################################## >> > > >> > > Assuming that , I tried to provide a rankfile to each command line: >> > > >> > > cat rankfile.0 >> > > rank 0=r011n002 slot=0 >> > > >> > > cat rankfile.1 >> > > rank 0=r011n003 slot=1 >> > > >> > > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -rf >> > rankfile.1 >> > > -n 1 hostname ### CRASHED >> > > *[r011n002:28778] *** Process received signal *** >> > > [r011n002:28778] Signal: Segmentation fault (11) >> > > [r011n002:28778] Signal code: Address not mapped (1) >> > > [r011n002:28778] Failing at address: 0x34 >> > > [r011n002:28778] [ 0] [0xffffe600] >> > > [r011n002:28778] [ 1] >> > > /tmp/HALMPI/openmpi-1.3.1/lib/libopen-rte.so. >> > 0(orte_odls_base_default_get_add_procs_data+0x55d) >> > > [0x5557decd] >> > > [r011n002:28778] [ 2] >> > > /tmp/HALMPI/openmpi-1.3.1/lib/libopen-rte.so. >> > 0(orte_plm_base_launch_apps+0x117) >> > > [0x555842a7] >> > > [r011n002:28778] [ 3] /tmp/HALMPI/openmpi-1.3.1/lib/openmpi/ >> > mca_plm_rsh.so >> > > [0x556098c0] >> > > [r011n002:28778] [ 4] /tmp/HALMPI/openmpi-1.3.1/bin/orterun >> > [0x804aa27] >> > > [r011n002:28778] [ 5] /tmp/HALMPI/openmpi-1.3.1/bin/orterun >> > [0x804a022] >> > > [r011n002:28778] [ 6] /lib/libc.so.6(__libc_start_main+0xdc) >> > [0x9f1dec] >> > > [r011n002:28778] [ 7] /tmp/HALMPI/openmpi-1.3.1/bin/orterun >> > [0x8049f71] >> > > [r011n002:28778] *** End of error message *** >> > > Segmentation fault (core dumped)* >> > > >> > > >> > > >> > > I hope that I've found a bug because it would be very important >> > for me to >> > > have this kind of capabiliy . >> > > Launch a multiexe mpirun command line and be able to bind my exes >> > and >> > > sockets together. >> > > >> > > Thanks in advance for your help >> > > >> > > Geoffroy >> > _______________________________________________ >> > users mailing list >> > us...@open-mpi.org >> > http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> -------------- next part -------------- >> HTML attachment scrubbed and removed >> >> ------------------------------ >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> End of users Digest, Vol 1202, Issue 2 >> ************************************** >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> -------------- next part -------------- >> HTML attachment scrubbed and removed >> >> ------------------------------ >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> End of users Digest, Vol 1218, Issue 2 >> ************************************** >> > >