Hi The result is : everything works fine with MPI executables : logical !!!
What I was trying to do , was to run non MPI exes thanks to mpirun. There , openmpi is not able to bind these processes to a particular CPU. My conclusion is that the process affinity is set in MPI_Init, right ? Could it be possible to have the paffinity features working without any MPI_Init call, using taskset for example. I agree , it's not your job to support the execution of any kind of exes but it would be nice !! Thanks again for all your efforts, I really appreciate I am looking forward to downloading, trying and deploying the next official release Regards Geoffroy 2009/5/4 Geoffroy Pignot <geopig...@gmail.com> > Hi Ralph > > Thanks for your extra tests. Before leaving , I just pointed out a problem > coming from running plpa across different rh distribs (<=> different Linux > kernels). Indeed, I configure and compile openmpi on rhel4 , then I run on > rhel5. I think my problem comes from this approximation. I'll do few more > tests tomorrow morning (France) and keep you inform. > > Regards > > Geoffroy > > >> >> Message: 2 >> Date: Mon, 4 May 2009 13:34:40 -0600 >> From: Ralph Castain <r...@open-mpi.org> >> Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ?? >> To: Open MPI Users <us...@open-mpi.org> >> Message-ID: >> <71d2d8cc0905041234m76eb5a9dx57a773997779d...@mail.gmail.com> >> Content-Type: text/plain; charset="iso-8859-1" >> >> Hmmm...I'm afraid I can't replicate the problem. All seems to be working >> just fine on the RHEL systems available to me. The procs indeed bind to >> the >> specified processors in every case. >> >> rhc@odin ~/trunk]$ cat rankfile >> rank 0=odin001 slot=0 >> rank 1=odin002 slot=1 >> >> [rhc@odin mpi]$ mpirun -rf ../../../rankfile -n 2 >> --leave-session-attached >> -mca paffinity_base_verbose 5 ./mpi_spin >> [odin001.cs.indiana.edu:09297 <http://odin001.cs.indiana.edu:9297/>] >> paffinity slot assignment: slot_list == 0 >> [odin001.cs.indiana.edu:09297 <http://odin001.cs.indiana.edu:9297/>] >> paffinity slot assignment: rank 0 runs on cpu #0 (#0) >> [odin002.cs.indiana.edu:13566] paffinity slot assignment: slot_list == 1 >> [odin002.cs.indiana.edu:13566] paffinity slot assignment: rank 1 runs on >> cpu >> #1 (#1) >> >> Suspended >> [rhc@odin mpi]$ ssh odin001 >> [rhc@odin001 ~]$ ps axo stat,user,psr,pid,pcpu,comm | grep rhc >> S rhc 0 9296 0.0 orted >> RLl rhc 0 9297 100 mpi_spin >> >> [rhc@odin mpi]$ ssh odin002 >> [rhc@odin002 ~]$ ps axo stat,user,psr,pid,pcpu,comm | grep rhc >> S rhc 0 13562 0.0 orted >> RLl rhc 1 13566 102 mpi_spin >> >> >> Not sure where to go from here...perhaps someone else can spot the >> problem? >> Ralph >> >> >> On Mon, May 4, 2009 at 8:28 AM, Ralph Castain <r...@open-mpi.org> wrote: >> >> > Unfortunately, I didn't write any of that code - I was just fixing the >> > mapper so it would properly map the procs. From what I can tell, the >> proper >> > things are happening there. >> > >> > I'll have to dig into the code that specifically deals with parsing the >> > results to bind the processes. Afraid that will take awhile longer - >> pretty >> > dark in that hole. >> > >> > >> > >> > On Mon, May 4, 2009 at 8:04 AM, Geoffroy Pignot <geopig...@gmail.com >> >wrote: >> >> > >> >> Hi, >> >> >> >> So, there are no more crashes with my "crazy" mpirun command. But the >> >> paffinity feature seems to be broken. Indeed I am not able to pin my >> >> processes. >> >> >> >> Simple test with a program using your plpa library : >> >> >> >> r011n006% cat hostf >> >> r011n006 slots=4 >> >> >> >> r011n006% cat rankf >> >> rank 0=r011n006 slot=0 ----> bind to CPU 0 , exact ? >> >> >> >> r011n006% /tmp/HALMPI/openmpi-1.4a/bin/mpirun --hostfile hostf >> --rankfile >> >> rankf --wdir /tmp -n 1 a.out >> >> >>> PLPA Number of processors online: 4 >> >> >>> PLPA Number of processor sockets: 2 >> >> >>> PLPA Socket 0 (ID 0): 2 cores >> >> >>> PLPA Socket 1 (ID 3): 2 cores >> >> >> >> Ctrl+Z >> >> r011n006%bg >> >> >> >> r011n006% ps axo stat,user,psr,pid,pcpu,comm | grep gpignot >> >> R+ gpignot 3 9271 97.8 a.out >> >> >> >> In fact whatever the slot number I put in my rankfile , a.out always >> runs >> >> on the CPU 3. I was looking for it on CPU 0 accordind to my cpuinfo >> file >> >> (see below) >> >> The result is the same if I try another syntax (rank 0=r011n006 >> slot=0:0 >> >> bind to socket 0 - core 0 , exact ? ) >> >> >> >> Thanks in advance >> >> >> >> Geoffroy >> >> >> >> PS: I run on rhel5 >> >> >> >> r011n006% uname -a >> >> Linux r011n006 2.6.18-92.1.1NOMAP32.el5 #1 SMP Sat Mar 15 01:46:39 CDT >> >> 2008 x86_64 x86_64 x86_64 GNU/Linux >> >> >> >> My configure is : >> >> ./configure --prefix=/tmp/openmpi-1.4a --libdir='${exec_prefix}/lib64' >> >> --disable-dlopen --disable-mpi-cxx --enable-heterogeneous >> >> >> >> >> >> r011n006% cat /proc/cpuinfo >> >> processor : 0 >> >> vendor_id : GenuineIntel >> >> cpu family : 6 >> >> model : 15 >> >> model name : Intel(R) Xeon(R) CPU 5150 @ 2.66GHz >> >> stepping : 6 >> >> cpu MHz : 2660.007 >> >> cache size : 4096 KB >> >> physical id : 0 >> >> siblings : 2 >> >> core id : 0 >> >> cpu cores : 2 >> >> fpu : yes >> >> fpu_exception : yes >> >> cpuid level : 10 >> >> wp : yes >> >> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge >> mca >> >> cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx >> lm >> >> constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm >> >> bogomips : 5323.68 >> >> clflush size : 64 >> >> cache_alignment : 64 >> >> address sizes : 36 bits physical, 48 bits virtual >> >> power management: >> >> >> >> processor : 1 >> >> vendor_id : GenuineIntel >> >> cpu family : 6 >> >> model : 15 >> >> model name : Intel(R) Xeon(R) CPU 5150 @ 2.66GHz >> >> stepping : 6 >> >> cpu MHz : 2660.007 >> >> cache size : 4096 KB >> >> physical id : 3 >> >> siblings : 2 >> >> core id : 0 >> >> cpu cores : 2 >> >> fpu : yes >> >> fpu_exception : yes >> >> cpuid level : 10 >> >> wp : yes >> >> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge >> mca >> >> cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx >> lm >> >> constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm >> >> bogomips : 5320.03 >> >> clflush size : 64 >> >> cache_alignment : 64 >> >> address sizes : 36 bits physical, 48 bits virtual >> >> power management: >> >> >> >> processor : 2 >> >> vendor_id : GenuineIntel >> >> cpu family : 6 >> >> model : 15 >> >> model name : Intel(R) Xeon(R) CPU 5150 @ 2.66GHz >> >> stepping : 6 >> >> cpu MHz : 2660.007 >> >> cache size : 4096 KB >> >> physical id : 0 >> >> siblings : 2 >> >> core id : 1 >> >> cpu cores : 2 >> >> fpu : yes >> >> fpu_exception : yes >> >> cpuid level : 10 >> >> wp : yes >> >> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge >> mca >> >> cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx >> lm >> >> constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm >> >> bogomips : 5319.39 >> >> clflush size : 64 >> >> cache_alignment : 64 >> >> address sizes : 36 bits physical, 48 bits virtual >> >> power management: >> >> >> >> processor : 3 >> >> vendor_id : GenuineIntel >> >> cpu family : 6 >> >> model : 15 >> >> model name : Intel(R) Xeon(R) CPU 5150 @ 2.66GHz >> >> stepping : 6 >> >> cpu MHz : 2660.007 >> >> cache size : 4096 KB >> >> physical id : 3 >> >> siblings : 2 >> >> core id : 1 >> >> cpu cores : 2 >> >> fpu : yes >> >> fpu_exception : yes >> >> cpuid level : 10 >> >> wp : yes >> >> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge >> mca >> >> cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx >> lm >> >> constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm >> >> bogomips : 5320.03 >> >> clflush size : 64 >> >> cache_alignment : 64 >> >> address sizes : 36 bits physical, 48 bits virtual >> >> power management: >> >> >> >> >> >>> ------------------------------ >> >>> >> >>> Message: 2 >> >>> Date: Mon, 4 May 2009 04:45:57 -0600 >> >>> From: Ralph Castain <r...@open-mpi.org> >> >>> Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ?? >> >>> To: Open MPI Users <us...@open-mpi.org> >> >>> Message-ID: <d01d7b16-4b47-46f3-ad41-d1a90b2e4...@open-mpi.org> >> >>> >> >>> Content-Type: text/plain; charset="us-ascii"; Format="flowed"; >> >>> DelSp="yes" >> >>> >> >>> My apologies - I wasn't clear enough. You need a tarball from r21111 >> >>> or greater...such as: >> >>> >> >>> http://www.open-mpi.org/nightly/trunk/openmpi-1.4a1r21142.tar.gz >> >>> >> >>> HTH >> >>> Ralph >> >>> >> >>> >> >>> On May 4, 2009, at 2:14 AM, Geoffroy Pignot wrote: >> >>> >> >>> > Hi , >> >>> > >> >>> > I got the openmpi-1.4a1r21095.tar.gz tarball, but unfortunately my >> >>> > command doesn't work >> >>> > >> >>> > cat rankf: >> >>> > rank 0=node1 slot=* >> >>> > rank 1=node2 slot=* >> >>> > >> >>> > cat hostf: >> >>> > node1 slots=2 >> >>> > node2 slots=2 >> >>> > >> >>> > mpirun --rankfile rankf --hostfile hostf --host node1 -n 1 >> >>> > hostname : --host node2 -n 1 hostname >> >>> > >> >>> > Error, invalid rank (1) in the rankfile (rankf) >> >>> > >> >>> > >> >>> >> -------------------------------------------------------------------------- >> >>> > [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in file >> >>> > rmaps_rank_file.c at line 403 >> >>> > [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in file >> >>> > base/rmaps_base_map_job.c at line 86 >> >>> > [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in file >> >>> > base/plm_base_launch_support.c at line 86 >> >>> > [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in file >> >>> > plm_rsh_module.c at line 1016 >> >>> > >> >>> > >> >>> > Ralph, could you tell me if my command syntax is correct or not ? if >> >>> > not, give me the expected one ? >> >>> > >> >>> > Regards >> >>> > >> >>> > Geoffroy >> >>> > >> >>> > >> >>> > >> >>> > >> >>> > 2009/4/30 Geoffroy Pignot <geopig...@gmail.com> >> >>> > Immediately Sir !!! :) >> >>> > >> >>> > Thanks again Ralph >> >>> > >> >>> > Geoffroy >> >>> > >> >>> > >> >>> > >> >>> > >> >>> > >> >>> > ------------------------------ >> >>> > >> >>> > Message: 2 >> >>> > Date: Thu, 30 Apr 2009 06:45:39 -0600 >> >>> > From: Ralph Castain <r...@open-mpi.org> >> >>> > Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ?? >> >>> > To: Open MPI Users <us...@open-mpi.org> >> >>> > Message-ID: >> >>> > <71d2d8cc0904300545v61a42fe1k50086d2704d0f...@mail.gmail.com >> > >> >>> > Content-Type: text/plain; charset="iso-8859-1" >> >>> > >> >>> > I believe this is fixed now in our development trunk - you can >> >>> > download any >> >>> > tarball starting from last night and give it a try, if you like. Any >> >>> > feedback would be appreciated. >> >>> > >> >>> > Ralph >> >>> > >> >>> > >> >>> > On Apr 14, 2009, at 7:57 AM, Ralph Castain wrote: >> >>> > >> >>> > Ah now, I didn't say it -worked-, did I? :-) >> >>> > >> >>> > Clearly a bug exists in the program. I'll try to take a look at it >> >>> > (if Lenny >> >>> > doesn't get to it first), but it won't be until later in the week. >> >>> > >> >>> > On Apr 14, 2009, at 7:18 AM, Geoffroy Pignot wrote: >> >>> > >> >>> > I agree with you Ralph , and that 's what I expect from openmpi but >> my >> >>> > second example shows that it's not working >> >>> > >> >>> > cat hostfile.0 >> >>> > r011n002 slots=4 >> >>> > r011n003 slots=4 >> >>> > >> >>> > cat rankfile.0 >> >>> > rank 0=r011n002 slot=0 >> >>> > rank 1=r011n003 slot=1 >> >>> > >> >>> > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -n 1 >> >>> > hostname >> >>> > ### CRASHED >> >>> > >> >>> > > > Error, invalid rank (1) in the rankfile (rankfile.0) >> >>> > > > >> >>> > > >> >>> > >> >>> >> -------------------------------------------------------------------------- >> >>> > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in >> >>> > file >> >>> > > > rmaps_rank_file.c at line 404 >> >>> > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in >> >>> > file >> >>> > > > base/rmaps_base_map_job.c at line 87 >> >>> > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in >> >>> > file >> >>> > > > base/plm_base_launch_support.c at line 77 >> >>> > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in >> >>> > file >> >>> > > > plm_rsh_module.c at line 985 >> >>> > > > >> >>> > > >> >>> > >> >>> >> -------------------------------------------------------------------------- >> >>> > > > A daemon (pid unknown) died unexpectedly on signal 1 while >> >>> > > attempting to >> >>> > > > launch so we are aborting. >> >>> > > > >> >>> > > > There may be more information reported by the environment (see >> >>> > > above). >> >>> > > > >> >>> > > > This may be because the daemon was unable to find all the needed >> >>> > > shared >> >>> > > > libraries on the remote node. You may set your LD_LIBRARY_PATH >> to >> >>> > > have the >> >>> > > > location of the shared libraries on the remote nodes and this >> will >> >>> > > > automatically be forwarded to the remote nodes. >> >>> > > > >> >>> > > >> >>> > >> >>> >> -------------------------------------------------------------------------- >> >>> > > > >> >>> > > >> >>> > >> >>> >> -------------------------------------------------------------------------- >> >>> > > > orterun noticed that the job aborted, but has no info as to the >> >>> > > process >> >>> > > > that caused that situation. >> >>> > > > >> >>> > > >> >>> > >> >>> >> -------------------------------------------------------------------------- >> >>> > > > orterun: clean termination accomplished >> >>> > >> >>> > >> >>> > >> >>> > Message: 4 >> >>> > Date: Tue, 14 Apr 2009 06:55:58 -0600 >> >>> > From: Ralph Castain <r...@lanl.gov> >> >>> > Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ?? >> >>> > To: Open MPI Users <us...@open-mpi.org> >> >>> > Message-ID: <f6290ada-a196-43f0-a853-cbcb802d8...@lanl.gov> >> >>> > Content-Type: text/plain; charset="us-ascii"; Format="flowed"; >> >>> > DelSp="yes" >> >>> > >> >>> > The rankfile cuts across the entire job - it isn't applied on an >> >>> > app_context basis. So the ranks in your rankfile must correspond to >> >>> > the eventual rank of each process in the cmd line. >> >>> > >> >>> > Unfortunately, that means you have to count ranks. In your case, you >> >>> > only have four, so that makes life easier. Your rankfile would look >> >>> > something like this: >> >>> > >> >>> > rank 0=r001n001 slot=0 >> >>> > rank 1=r001n002 slot=1 >> >>> > rank 2=r001n001 slot=1 >> >>> > rank 3=r001n002 slot=2 >> >>> > >> >>> > HTH >> >>> > Ralph >> >>> > >> >>> > On Apr 14, 2009, at 12:19 AM, Geoffroy Pignot wrote: >> >>> > >> >>> > > Hi, >> >>> > > >> >>> > > I agree that my examples are not very clear. What I want to do is >> to >> >>> > > launch a multiexes application (masters-slaves) and benefit from >> the >> >>> > > processor affinity. >> >>> > > Could you show me how to convert this command , using -rf option >> >>> > > (whatever the affinity is) >> >>> > > >> >>> > > mpirun -n 1 -host r001n001 master.x options1 : -n 1 -host >> r001n002 >> >>> > > master.x options2 : -n 1 -host r001n001 slave.x options3 : -n 1 - >> >>> > > host r001n002 slave.x options4 >> >>> > > >> >>> > > Thanks for your help >> >>> > > >> >>> > > Geoffroy >> >>> > > >> >>> > > >> >>> > > >> >>> > > >> >>> > > >> >>> > > Message: 2 >> >>> > > Date: Sun, 12 Apr 2009 18:26:35 +0300 >> >>> > > From: Lenny Verkhovsky <lenny.verkhov...@gmail.com> >> >>> > > Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ?? >> >>> > > To: Open MPI Users <us...@open-mpi.org> >> >>> > > Message-ID: >> >>> > > < >> 453d39990904120826t2e1d1d33l7bb1fe3de65b5...@mail.gmail.com> >> >>> > > Content-Type: text/plain; charset="iso-8859-1" >> >>> > > >> >>> > > Hi, >> >>> > > >> >>> > > The first "crash" is OK, since your rankfile has ranks 0 and 1 >> >>> > > defined, >> >>> > > while n=1, which means only rank 0 is present and can be >> allocated. >> >>> > > >> >>> > > NP must be >= the largest rank in rankfile. >> >>> > > >> >>> > > What exactly are you trying to do ? >> >>> > > >> >>> > > I tried to recreate your seqv but all I got was >> >>> > > >> >>> > > ~/work/svn/ompi/trunk/build_x86-64/install/bin/mpirun --hostfile >> >>> > > hostfile.0 >> >>> > > -rf rankfile.0 -n 1 hostname : -rf rankfile.1 -n 1 hostname >> >>> > > [witch19:30798] mca: base: component_find: paffinity >> >>> > > "mca_paffinity_linux" >> >>> > > uses an MCA interface that is not recognized (component MCA >> >>> > v1.0.0 != >> >>> > > supported MCA v2.0.0) -- ignored >> >>> > > >> >>> > >> >>> >> -------------------------------------------------------------------------- >> >>> > > It looks like opal_init failed for some reason; your parallel >> >>> > > process is >> >>> > > likely to abort. There are many reasons that a parallel process >> can >> >>> > > fail during opal_init; some of which are due to configuration or >> >>> > > environment problems. This failure appears to be an internal >> >>> > failure; >> >>> > > here's some additional information (which may only be relevant to >> an >> >>> > > Open MPI developer): >> >>> > > >> >>> > > opal_carto_base_select failed >> >>> > > --> Returned value -13 instead of OPAL_SUCCESS >> >>> > > >> >>> > >> >>> >> -------------------------------------------------------------------------- >> >>> > > [witch19:30798] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in >> >>> > file >> >>> > > ../../orte/runtime/orte_init.c at line 78 >> >>> > > [witch19:30798] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in >> >>> > file >> >>> > > ../../orte/orted/orted_main.c at line 344 >> >>> > > >> >>> > >> >>> >> -------------------------------------------------------------------------- >> >>> > > A daemon (pid 11629) died unexpectedly with status 243 while >> >>> > > attempting >> >>> > > to launch so we are aborting. >> >>> > > >> >>> > > There may be more information reported by the environment (see >> >>> > above). >> >>> > > >> >>> > > This may be because the daemon was unable to find all the needed >> >>> > > shared >> >>> > > libraries on the remote node. You may set your LD_LIBRARY_PATH to >> >>> > > have the >> >>> > > location of the shared libraries on the remote nodes and this will >> >>> > > automatically be forwarded to the remote nodes. >> >>> > > >> >>> > >> >>> >> -------------------------------------------------------------------------- >> >>> > > >> >>> > >> >>> >> -------------------------------------------------------------------------- >> >>> > > mpirun noticed that the job aborted, but has no info as to the >> >>> > process >> >>> > > that caused that situation. >> >>> > > >> >>> > >> >>> >> -------------------------------------------------------------------------- >> >>> > > mpirun: clean termination accomplished >> >>> > > >> >>> > > >> >>> > > Lenny. >> >>> > > >> >>> > > >> >>> > > On 4/10/09, Geoffroy Pignot <geopig...@gmail.com> wrote: >> >>> > > > >> >>> > > > Hi , >> >>> > > > >> >>> > > > I am currently testing the process affinity capabilities of >> >>> > > openmpi and I >> >>> > > > would like to know if the rankfile behaviour I will describe >> below >> >>> > > is normal >> >>> > > > or not ? >> >>> > > > >> >>> > > > cat hostfile.0 >> >>> > > > r011n002 slots=4 >> >>> > > > r011n003 slots=4 >> >>> > > > >> >>> > > > cat rankfile.0 >> >>> > > > rank 0=r011n002 slot=0 >> >>> > > > rank 1=r011n003 slot=1 >> >>> > > > >> >>> > > > >> >>> > > > >> >>> > > >> >>> > >> >>> >> ################################################################################## >> >>> > > > >> >>> > > > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 2 hostname ### >> OK >> >>> > > > r011n002 >> >>> > > > r011n003 >> >>> > > > >> >>> > > > >> >>> > > > >> >>> > > >> >>> > >> >>> >> ################################################################################## >> >>> > > > but >> >>> > > > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -n 1 >> >>> > > hostname >> >>> > > > ### CRASHED >> >>> > > > * >> >>> > > > >> >>> > > >> >>> > >> >>> >> -------------------------------------------------------------------------- >> >>> > > > Error, invalid rank (1) in the rankfile (rankfile.0) >> >>> > > > >> >>> > > >> >>> > >> >>> >> -------------------------------------------------------------------------- >> >>> > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in >> >>> > file >> >>> > > > rmaps_rank_file.c at line 404 >> >>> > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in >> >>> > file >> >>> > > > base/rmaps_base_map_job.c at line 87 >> >>> > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in >> >>> > file >> >>> > > > base/plm_base_launch_support.c at line 77 >> >>> > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in >> >>> > file >> >>> > > > plm_rsh_module.c at line 985 >> >>> > > > >> >>> > > >> >>> > >> >>> >> -------------------------------------------------------------------------- >> >>> > > > A daemon (pid unknown) died unexpectedly on signal 1 while >> >>> > > attempting to >> >>> > > > launch so we are aborting. >> >>> > > > >> >>> > > > There may be more information reported by the environment (see >> >>> > > above). >> >>> > > > >> >>> > > > This may be because the daemon was unable to find all the needed >> >>> > > shared >> >>> > > > libraries on the remote node. You may set your LD_LIBRARY_PATH >> to >> >>> > > have the >> >>> > > > location of the shared libraries on the remote nodes and this >> will >> >>> > > > automatically be forwarded to the remote nodes. >> >>> > > > >> >>> > > >> >>> > >> >>> >> -------------------------------------------------------------------------- >> >>> > > > >> >>> > > >> >>> > >> >>> >> -------------------------------------------------------------------------- >> >>> > > > orterun noticed that the job aborted, but has no info as to the >> >>> > > process >> >>> > > > that caused that situation. >> >>> > > > >> >>> > > >> >>> > >> >>> >> -------------------------------------------------------------------------- >> >>> > > > orterun: clean termination accomplished >> >>> > > > * >> >>> > > > It seems that the rankfile option is not propagted to the second >> >>> > > command >> >>> > > > line ; there is no global understanding of the ranking inside a >> >>> > > mpirun >> >>> > > > command. >> >>> > > > >> >>> > > > >> >>> > > > >> >>> > > >> >>> > >> >>> >> ################################################################################## >> >>> > > > >> >>> > > > Assuming that , I tried to provide a rankfile to each command >> >>> > line: >> >>> > > > >> >>> > > > cat rankfile.0 >> >>> > > > rank 0=r011n002 slot=0 >> >>> > > > >> >>> > > > cat rankfile.1 >> >>> > > > rank 0=r011n003 slot=1 >> >>> > > > >> >>> > > > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -rf >> >>> > > rankfile.1 >> >>> > > > -n 1 hostname ### CRASHED >> >>> > > > *[r011n002:28778] *** Process received signal *** >> >>> > > > [r011n002:28778] Signal: Segmentation fault (11) >> >>> > > > [r011n002:28778] Signal code: Address not mapped (1) >> >>> > > > [r011n002:28778] Failing at address: 0x34 >> >>> > > > [r011n002:28778] [ 0] [0xffffe600] >> >>> > > > [r011n002:28778] [ 1] >> >>> > > > /tmp/HALMPI/openmpi-1.3.1/lib/libopen-rte.so. >> >>> > > 0(orte_odls_base_default_get_add_procs_data+0x55d) >> >>> > > > [0x5557decd] >> >>> > > > [r011n002:28778] [ 2] >> >>> > > > /tmp/HALMPI/openmpi-1.3.1/lib/libopen-rte.so. >> >>> > > 0(orte_plm_base_launch_apps+0x117) >> >>> > > > [0x555842a7] >> >>> > > > [r011n002:28778] [ 3] /tmp/HALMPI/openmpi-1.3.1/lib/openmpi/ >> >>> > > mca_plm_rsh.so >> >>> > > > [0x556098c0] >> >>> > > > [r011n002:28778] [ 4] /tmp/HALMPI/openmpi-1.3.1/bin/orterun >> >>> > > [0x804aa27] >> >>> > > > [r011n002:28778] [ 5] /tmp/HALMPI/openmpi-1.3.1/bin/orterun >> >>> > > [0x804a022] >> >>> > > > [r011n002:28778] [ 6] /lib/libc.so.6(__libc_start_main+0xdc) >> >>> > > [0x9f1dec] >> >>> > > > [r011n002:28778] [ 7] /tmp/HALMPI/openmpi-1.3.1/bin/orterun >> >>> > > [0x8049f71] >> >>> > > > [r011n002:28778] *** End of error message *** >> >>> > > > Segmentation fault (core dumped)* >> >>> > > > >> >>> > > > >> >>> > > > >> >>> > > > I hope that I've found a bug because it would be very important >> >>> > > for me to >> >>> > > > have this kind of capabiliy . >> >>> > > > Launch a multiexe mpirun command line and be able to bind my >> exes >> >>> > > and >> >>> > > > sockets together. >> >>> > > > >> >>> > > > Thanks in advance for your help >> >>> > > > >> >>> > > > Geoffroy >> >>> > > _______________________________________________ >> >>> > > users mailing list >> >>> > > us...@open-mpi.org >> >>> > > http://www.open-mpi.org/mailman/listinfo.cgi/users >> >>> > >> >>> > -------------- next part -------------- >> >>> > HTML attachment scrubbed and removed >> >>> > >> >>> > ------------------------------ >> >>> > >> >>> > _______________________________________________ >> >>> > users mailing list >> >>> > us...@open-mpi.org >> >>> > http://www.open-mpi.org/mailman/listinfo.cgi/users >> >>> > >> >>> > End of users Digest, Vol 1202, Issue 2 >> >>> > ************************************** >> >>> > >> >>> > _______________________________________________ >> >>> > users mailing list >> >>> > us...@open-mpi.org >> >>> > http://www.open-mpi.org/mailman/listinfo.cgi/users >> >>> > >> >>> > _______________________________________________ >> >>> > users mailing list >> >>> > us...@open-mpi.org >> >>> > -------------- next part -------------- >> >>> > HTML attachment scrubbed and removed >> >>> > >> >>> > ------------------------------ >> >>> > >> >>> > _______________________________________________ >> >>> > users mailing list >> >>> > us...@open-mpi.org >> >>> > http://www.open-mpi.org/mailman/listinfo.cgi/users >> >>> > >> >>> > End of users Digest, Vol 1218, Issue 2 >> >>> > ************************************** >> >>> > >> >>> > >> >>> > _______________________________________________ >> >>> > users mailing list >> >>> > us...@open-mpi.org >> >>> > http://www.open-mpi.org/mailman/listinfo.cgi/users >> >>> >> >>> -------------- next part -------------- >> >>> HTML attachment scrubbed and removed >> >>> >> >>> ------------------------------ >> >>> >> >>> _______________________________________________ >> >>> users mailing list >> >>> us...@open-mpi.org >> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >>> >> >>> End of users Digest, Vol 1221, Issue 3 >> >>> ************************************** >> >>> >> >> >> >> >> >> _______________________________________________ >> >> users mailing list >> >> us...@open-mpi.org >> >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> > >> > >> -------------- next part -------------- >> HTML attachment scrubbed and removed >> >> ------------------------------ >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> End of users Digest, Vol 1221, Issue 17 >> *************************************** >> > >