Re: [OMPI users] torque integration when tm ras/plm isn't compiled in.
OMPI_MCA_orte_leave_session_attached=1 Note: this does set limits on scale, though, if the system uses an ssh launcher. There are system limits on the number of open ssh sessions you can have at any one time. For all other launchers, no limit issues exist that I know about. HTH Ralph On Oct 22, 2009, at 5:18 PM, Roy Dragseth wrote: On Friday 23 October 2009 00:50:00 Ralph Castain wrote: Why not just setenv OMPI_MCA_orte_default_hostfile $PBS_NODEFILE assuming you are using 1.3.x, of course. If not, then you can use the equivalent for 1.2 - ompi_info would tell you the name of it. THANKS! Just what I was looking for. Been looking up and down for it, but couldn't find the right swear words. Is it also possible to disable the backgrounding of the orted daemons? When they fork into background one looses the feedback about cpu usage in the job. Not really a big issue though... Regards, r. -- The Computer Center, University of Tromsø, N-9037 TROMSØ Norway. phone:+47 77 64 41 07, fax:+47 77 64 41 00 Roy Dragseth, Team Leader, High Performance Computing Direct call: +47 77 64 62 56. email: roy.drags...@uit.no ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] torque integration when tm ras/plm isn't compiled in.
On Friday 23 October 2009 00:50:00 Ralph Castain wrote: > Why not just > > setenv OMPI_MCA_orte_default_hostfile $PBS_NODEFILE > > assuming you are using 1.3.x, of course. > > If not, then you can use the equivalent for 1.2 - ompi_info would tell > you the name of it. THANKS! Just what I was looking for. Been looking up and down for it, but couldn't find the right swear words. Is it also possible to disable the backgrounding of the orted daemons? When they fork into background one looses the feedback about cpu usage in the job. Not really a big issue though... Regards, r. -- The Computer Center, University of Tromsø, N-9037 TROMSØ Norway. phone:+47 77 64 41 07, fax:+47 77 64 41 00 Roy Dragseth, Team Leader, High Performance Computing Direct call: +47 77 64 62 56. email: roy.drags...@uit.no
Re: [OMPI users] torque integration when tm ras/plm isn't compiled in.
Why not just setenv OMPI_MCA_orte_default_hostfile $PBS_NODEFILE assuming you are using 1.3.x, of course. If not, then you can use the equivalent for 1.2 - ompi_info would tell you the name of it. On Oct 22, 2009, at 4:29 PM, Roy Dragseth wrote: Hi all. I'm trying to create a tight integration between torque and openmpi for cases where the tm ras and plm isn't compiled into openmpi. This scenario is common for linux distros that ship openmpi. Of course the ideal solution is to recompile openmpi with torque support, but this isn't always feasible since I do not want to support my own version of openmpi on the stuff I'm distributing to others. We also see some proprietary applications shipping their own embedded openmpi libraries where either tm plm/ras is missing or non-functional with the torque installation on our system. So, I've come so far as to create a pbsdshwrapper.py that mimics ssh behaviour very closely so that starting the orteds on all the hosts works as expected and the application starts correctly when I use setenv OMPI_MCA_plm_rsh_agent "pbsdshwrapper.py" mpirun --hostfile $PBS_NODEFILE What I want now is a way to get rid of the --hostfile $PBS_NODEFILE in the mpirun command. Is there an environment variable that I can set so that mpirun grabs the right nodelist? By spelunking the code I find that the rsh plm has support for SGE where it automatically picks up the PE_NODEFILE if it detects that it is launched within an SGE job. Would it be possible to have the same functionality for torque? The code looks a bit too complex at first sight for me to fix this myself. Best regards, Roy. -- The Computer Center, University of Tromsø, N-9037 TROMSØ Norway. phone:+47 77 64 41 07, fax:+47 77 64 41 00 Roy Dragseth, Team Leader, High Performance Computing Direct call: +47 77 64 62 56. email: roy.drags...@uit.no ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] torque integration when tm ras/plm isn't compiled in.
Hi all. I'm trying to create a tight integration between torque and openmpi for cases where the tm ras and plm isn't compiled into openmpi. This scenario is common for linux distros that ship openmpi. Of course the ideal solution is to recompile openmpi with torque support, but this isn't always feasible since I do not want to support my own version of openmpi on the stuff I'm distributing to others. We also see some proprietary applications shipping their own embedded openmpi libraries where either tm plm/ras is missing or non-functional with the torque installation on our system. So, I've come so far as to create a pbsdshwrapper.py that mimics ssh behaviour very closely so that starting the orteds on all the hosts works as expected and the application starts correctly when I use setenv OMPI_MCA_plm_rsh_agent "pbsdshwrapper.py" mpirun --hostfile $PBS_NODEFILE What I want now is a way to get rid of the --hostfile $PBS_NODEFILE in the mpirun command. Is there an environment variable that I can set so that mpirun grabs the right nodelist? By spelunking the code I find that the rsh plm has support for SGE where it automatically picks up the PE_NODEFILE if it detects that it is launched within an SGE job. Would it be possible to have the same functionality for torque? The code looks a bit too complex at first sight for me to fix this myself. Best regards, Roy. -- The Computer Center, University of Tromsø, N-9037 TROMSØ Norway. phone:+47 77 64 41 07, fax:+47 77 64 41 00 Roy Dragseth, Team Leader, High Performance Computing Direct call: +47 77 64 62 56. email: roy.drags...@uit.no
Re: [OMPI users] [OMPI devel] processor affinity -- OpenMPI / batchsystem integration
SGE might want to be aware that PLPA has now been deprecated -- we're doing all future work on "hwloc" (hardware locality). That is, hwloc represents the merger of PLPA and libtopology from INRIA. The majority of the initial code base came from libtopology; more PLPA- like features will come in over time (e.g., embedding capabilities). hwloc provides all kinds of topology information about the machine. The first release of hwloc -- v0.9.1 -- will be "soon" (we're in rc status right now), but it will not include PLPA-like embedding capabilities. Embedding is slated for v1.0. Come join our mailing lists if you're interested: http://www.open-mpi.org/projects/hwloc/ On Oct 22, 2009, at 11:26 AM, Rayson Ho wrote: Yes, on page 14 of the presentation: "Support for OpenMPI and OpenMP Through -binding [pe|env] linear|striding" -- SGE performs no binding, but instead it outputs the binding decision to OpenMPI. Support for OpenMPI's binding is part of the "Job to Core Binding" project. Rayson On Thu, Oct 22, 2009 at 10:16 AM, Ralph Castain wrote: > Hi Rayson > > You're probably aware: starting with 1.3.4, OMPI will detect and abide by > external bindings. So if grid engine sets a binding, we'll follow it. > > Ralph > > On Oct 22, 2009, at 9:03 AM, Rayson Ho wrote: > >> The code for the Job to Core Binding (aka. thread binding, or CPU >> binding) feature was checked into the Grid Engine project cvs. It uses >> OpenMPI's Portable Linux Processor Affinity (PLPA) library, and is >> topology and NUMA aware. >> >> The presentation from HPC Software Workshop '09: >> http://wikis.sun.com/download/attachments/170755116/job2core.pdf >> >> The design doc: >> >> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=213897 >> >> Initial support is planned for 6.2 update 5 (current release is update >> 4, so update 5 is likely to be released in the next 2 or 3 months). >> >> Rayson >> >> >> >> On Tue, Sep 30, 2008 at 2:23 PM, Ralph Castain wrote: >>> >>> Note that we would also have to modify OMPI to: >>> >>> 1. recognize these environmental variables, and >>> >>> 2. use them to actually set the binding, instead of using OMPI- internal >>> directives >>> >>> Not a big deal to do, but not something currently in the system. Since we >>> launch through our own daemons (something that isn't likely to change in >>> your time frame), these changes would be required. >>> >>> Otherwise, we could come up with some method by which you could provide >>> mapper information we use. While I agree with Jeff that having you tell >>> us >>> which cores to use for each rank would generally be better, it does raise >>> issues when users want specific mapping algorithms that you might not >>> support. For example, we are working on mappers that will take input from >>> the user regarding comm topology plus system info on network wiring >>> topology >>> and generate a near-optimal mapping of ranks. As part of that, users may >>> request some number of cores be reserved for that rank for threading or >>> other purposes. >>> >>> So perhaps both options would be best - give us the list of cores >>> available >>> to us so we can map and do affinity, and pass in your own mapping. Maybe >>> with some logic so we can decide which to use based on whether OMPI or GE >>> did the mapping?? >>> >>> Not sure here - just thinking out loud. >>> Ralph >>> >>> On Sep 30, 2008, at 12:58 PM, Jeff Squyres wrote: >>> On Sep 30, 2008, at 2:51 PM, Rayson Ho wrote: > Restarting this discussion. A new update version of Grid Engine 6.2 > will come out early next year [1], and I really hope that we can get > at least the interface defined. Great! > At the minimum, is it enough for the batch system to tell OpenMPI via > an env variable which core (or virtual core, in the SMT case) to start > binding the first MPI task?? I guess an added bonus would be > information about the number of processors to skip (the stride) > between the sibling tasks?? Stride of one is usually the case, but > something larger than one would allow the batch system to control the > level of cache and memory bandwidth sharing between the MPI tasks... Wouldn't it be better to give us a specific list of cores to bind to? As core counts go up in servers, I think we may see a re-emergence of having multiple MPI jobs on a single server. And as core counts go even *higher*, then fragmentation of available cores over time is possible/ likely. Would you be giving us a list of *relative* cores to bind to (i.e., "bind to the Nth online core on the machine" -- which may be different than the OS's ID for that processor) or will you be giving us the actual OS virtual processor ID(s) to bind to? -- Jeff Squyres Cis
Re: [OMPI users] wrong rank and process number
Typically, when seeing errors like this, it can mean that you've got a mismatch between the Open MPI that you compiled your application with, the mpirun that you're using to launch the application, and/or LD_LIBRARY_PATH to load the dynamic library libmpi.so (as Ralph stated/ alluded). Ensure that you have an exact match between all three of these things: compile your MPI app with the exact same version of Open MPI that you use for mpirun to launch the app, and ensure that LD_LIBRARY_PATH is pointing to the same libmpi.so (or the Right libmpi.so is in your system's default linker search paths). On Oct 22, 2009, at 4:48 PM, Ralph Castain wrote: If you google this list for entries about ubuntu, you will find a bunch of threads discussing problems on that platform. This sounds like one of the more common ones - I forget all the issues, but most dealt with ubuntu coming with a very old OMPI version on it, and issues with ensuring your path and ld_library_path are pointing to the correct distribution. Also believe there are issues with figuring out we are launching locally and other such things. On Oct 22, 2009, at 1:40 PM, Victor Rosas Garcia wrote: > Hello everybody, > > I have just installed openmpi v. 1.2.5 under ubuntu 8.04 and I have > compiled the following "hello world" program: > > --start code-- > #include > #include > > int main(int argc, char *argv[]){ > int rank, size, len; > char hostname[256] = ""; > > MPI_Init(&argc, &argv); > MPI_Comm_rank(MPI_COMM_WORLD, &rank); > MPI_Comm_size(MPI_COMM_WORLD, &size); > MPI_Get_processor_name(hostname, &len); > > printf("Hello World! I am %d of %d running on %s\n", > rank, size, hostname); > > MPI_Finalize(); > > return 0; > } > --end code-- > > which I think is pretty basic stuff. Anyway, when I run it (single > node, three cores), I get the following output: > > mpiexec.openmpi -np 6 hello > > Hello World! I am 0 of 1 running on el-torito > Hello World! I am 0 of 1 running on el-torito > Hello World! I am 0 of 1 running on el-torito > Hello World! I am 0 of 1 running on el-torito > Hello World! I am 0 of 1 running on el-torito > Hello World! I am 0 of 1 running on el-torito > > where "el-torito" is the hostname. Shouldn't it be something like?: > Hello World! I am 0 of 6 running on el-torito > Hello World! I am 1 of 6 running on el-torito > Hello World! I am 2 of 6 running on el-torito > ... > etc. > > Any ideas as to why I keep getting wrong numbers for rank and number > of processes? > > Greetings from Monterrey, Mexico > -- > Victor M. Rosas García > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com
Re: [OMPI users] wrong rank and process number
If you google this list for entries about ubuntu, you will find a bunch of threads discussing problems on that platform. This sounds like one of the more common ones - I forget all the issues, but most dealt with ubuntu coming with a very old OMPI version on it, and issues with ensuring your path and ld_library_path are pointing to the correct distribution. Also believe there are issues with figuring out we are launching locally and other such things. On Oct 22, 2009, at 1:40 PM, Victor Rosas Garcia wrote: Hello everybody, I have just installed openmpi v. 1.2.5 under ubuntu 8.04 and I have compiled the following "hello world" program: --start code-- #include #include int main(int argc, char *argv[]){ int rank, size, len; char hostname[256] = ""; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); MPI_Get_processor_name(hostname, &len); printf("Hello World! I am %d of %d running on %s\n", rank, size, hostname); MPI_Finalize(); return 0; } --end code-- which I think is pretty basic stuff. Anyway, when I run it (single node, three cores), I get the following output: mpiexec.openmpi -np 6 hello Hello World! I am 0 of 1 running on el-torito Hello World! I am 0 of 1 running on el-torito Hello World! I am 0 of 1 running on el-torito Hello World! I am 0 of 1 running on el-torito Hello World! I am 0 of 1 running on el-torito Hello World! I am 0 of 1 running on el-torito where "el-torito" is the hostname. Shouldn't it be something like?: Hello World! I am 0 of 6 running on el-torito Hello World! I am 1 of 6 running on el-torito Hello World! I am 2 of 6 running on el-torito ... etc. Any ideas as to why I keep getting wrong numbers for rank and number of processes? Greetings from Monterrey, Mexico -- Victor M. Rosas García ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] wrong rank and process number
Hello everybody, I have just installed openmpi v. 1.2.5 under ubuntu 8.04 and I have compiled the following "hello world" program: --start code-- #include #include int main(int argc, char *argv[]){ int rank, size, len; char hostname[256] = ""; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); MPI_Get_processor_name(hostname, &len); printf("Hello World! I am %d of %d running on %s\n", rank, size, hostname); MPI_Finalize(); return 0; } --end code-- which I think is pretty basic stuff. Anyway, when I run it (single node, three cores), I get the following output: mpiexec.openmpi -np 6 hello Hello World! I am 0 of 1 running on el-torito Hello World! I am 0 of 1 running on el-torito Hello World! I am 0 of 1 running on el-torito Hello World! I am 0 of 1 running on el-torito Hello World! I am 0 of 1 running on el-torito Hello World! I am 0 of 1 running on el-torito where "el-torito" is the hostname. Shouldn't it be something like?: Hello World! I am 0 of 6 running on el-torito Hello World! I am 1 of 6 running on el-torito Hello World! I am 2 of 6 running on el-torito ... etc. Any ideas as to why I keep getting wrong numbers for rank and number of processes? Greetings from Monterrey, Mexico -- Victor M. Rosas García
Re: [OMPI users] MPI_File_open return error code 16
On Thu, Aug 27, 2009 at 09:23:20AM +0800, Changsheng Jiang wrote: > Hi List, > > I am learning MPI. Welcome! sorry for the several-months lateness of my reply: I check in on OpenMPI only occasionally looking for MPI-IO questions. > A small code snippet try to open a file by MPI_File_open gets error > 16(Internal error code.) in a single server with OpenMPI 1.3.3, but > it's run correctly in another server(with OpenMPI 1.3.2). > > How to fix this problem? Thanks. > > This is the snippet: > > int main(int argc, char *argv[]) { > MPI_File fh; > MPI_Init( &argc, &argv ); > int ret = MPI_File_open( > MPI_COMM_WORLD, "temp", > MPI_MODE_RDWR | MPI_MODE_CREATE, > MPI_INFO_NULL, &fh); > if (ret != MPI_SUCCESS) { > fprintf(stderr, "open file failed, code=%d\n", ret); > } else { > MPI_File_close(&fh); > } > MPI_Finalize(); > return 0; > } The error code isn't very interesting, but if you can turn that error code into a human readable string with the MPI_Error_string() routine, then maybe you'll have a hint as to what is causing the problem. ==rob -- Rob Latham Mathematics and Computer Science Division Argonne National Lab, IL USA
[OMPI users] installation with two different compilers
Hi: Sure there is the howto somewhere, though not found. With Debian Linux amd64 lenny I have a working installation of openmpi-1.2.6, intel compiled, mpif90 etc in /usr/local/bin. For program OCTOPUS I need gfortran-compiled openmpi. I did so with openmpi-1.3.3, mpif90 etc (as symlink to /opt/bin/opal_wrapper) in /opt/bin. The compilation was carried out as follows: ./configure --prefix=/opt FC=/usr/bin/gfortran F77=/usr/bin/gfortran CC=/usr/bin/gcc CXX=/usr/bin/g++ --with-libnuma=/usr/lib As I am the only user, I use to place everything in my .bashrc. Specifically, #For openmpi-1.2.6 Intel compiler if [ "$LD_LIBRARY_PATH" ] ; then export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/lib" else export LD_LIBRARY_PATH="/usr/local/lib" fi #For openmpi-1.3.3 gnu (gfortran) compiled if [ "$LD_LIBRARY_PATH" ] ; then export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/opt/lib" else export LD_LIBRARY_PATH="/opt/lib" fi Command $ompi_info always reports about the intel compiled. How can I have info about the gfortran compiled (and runtime) ? If relevant: francesco@tya64:~$ $PATH bash: /opt/etsf/bin:/usr/local/delphi_4.1.1:/opt/intel/cce/10.1.015/bin:/opt/intel/fce/10.1.015/bin:/usr/local/bin/vmd:/usr/local/chimera/bin:/opt/etsf/bin:/usr/local/delphi_4.1.1:/opt/intel/cce/10.1.015/bin:/opt/intel/fce/10.1.015/bin:/usr/local/bin/vmd:/usr/local/chimera/bin:/opt/etsf/bin:/usr/local/delphi_4.1.1:/opt/intel/cce/10.1.015/bin:/opt/intel/fce/10.1.015/bin:/usr/local/bin/vmd:/usr/local/chimera/bin:/usr/local/bin:/usr/bin:/bin:/usr/games:/home/francesco/hole2/exe:/usr/local/amber10/exe:/usr/local/dock6/bin:/home/francesco/hole2/exe:/usr/local/amber10/exe:/usr/local/dock6/bin:/home/francesco/hole2/exe:/usr/local/amber10/exe:/usr/local/dock6/bin: No such file or directory francesco@tya64:~$ francesco@tya64:~$ $LD_LIBRARY_PATH bash: /opt/intel/mkl/10.0.1.014/lib/em64t:/opt/intel/cce/10.1.015/lib:/opt/intel/fce/10.1.015/lib:/opt/intel/mkl/10.0.1.014/lib/em64t:/opt/intel/cce/10.1.015/lib:/opt/intel/fce/10.1.015/lib:/usr/local/lib:/opt/lib:/opt/acml4.2.0/gfortran64_mp_int64/lib:/usr/local/lib:/opt/lib:/opt/acml4.2.0/gfortran64_mp_int64/lib: No such file or directory francesco@tya64:~$ I do not understand this "No such file or directory"; the library exists. Surely there is something wrong with my handling. Under these conditions the parallel ./configure of OCTOPUS with export FC=/opt/bin/mpif90 export CC=/opt/bin/mpicc (C++ is not used) reports errors of the type: configure:5478: /opt/bin/mpicc -E conftest.c conftest.c:13:28: error: ac_nonexistent.h: No such file or directory and $ make fails, finding a lot of problems with mpi: .. /usr/lib/libmpi.so: undefined reference to `orte_data_value_t_class' /usr/lib/libmpi.so: undefined reference to `opal_progress_event_increment' /usr/lib/libmpi_f77.so: undefined reference to `mpi_conversion_fn_null__' /usr/lib/libmpi.so: undefined reference to `opal_progress_events' /usr/lib/libmpi.so: undefined reference to `orte_hash_table_set_proc' /usr/lib/libmpi_f77.so: undefined reference to `mpi_conversion_fn_null' /usr/lib/libmpi.so: undefined reference to `orte_gpr' /usr/lib/libmpi.so: undefined reference to `orte_init_stage2' /usr/lib/libmpi.so: undefined reference to `orte_schema' /usr/lib/libmpi.so: undefined reference to `orte_gpr_trigger_t_class' /usr/lib/libmpi_f77.so: undefined reference to `MPI_CONVERSION_FN_NULL' /usr/lib/libmpi.so: undefined reference to `orte_ns' /usr/lib/libmpi_f77.so: undefined reference to `ompi_registered_datareps' /usr/lib/libmpi_f77.so: undefined reference to `mpi_conversion_fn_null_' /usr/lib/libmpi.so: undefined reference to `opal_progress_mpi_init' /usr/lib/libmpi.so: undefined reference to `orte_gpr_base_selected_component' /usr/lib/libmpi.so: undefined reference to `orte_smr' /usr/lib/libmpi.so: undefined reference to `opal_progress_mpi_disable' /usr/lib/libmpi.so: undefined reference to `orte_app_context_map_t_class' /usr/lib/libmpi.so: undefined reference to `orte_ns_name_wildcard' /usr/lib/libmpi.so: undefined reference to `orte_system_info' /usr/lib/libmpi.so: undefined reference to `orte_buffer_t_class' /usr/lib/libmpi.so: undefined reference to `orte_rmgr' /usr/lib/libmpi.so: undefined reference to `opal_progress_event_decrement' /usr/lib/libmpi.so: undefined reference to `orte_hash_table_get_proc' /usr/lib/libmpi.so: undefined reference to `orte_init_stage1' /usr/lib/libmpi.so: undefined reference to `orte_dss' /usr/lib/libmpi.so: undefined reference to `orte_gpr_subscription_t_class' /usr/lib/libmpi.so: undefined reference to `opal_progress_mpi_enable' make[3]: *** [octopus_mpi] Error 1 make[3]: Leaving directory `/home/francesco/octopus/octopus-3.1.0/src/main' make[2]: *** [all-recursive] Error 1 make[2]: Leaving directory `/home/francesco/octopus/octopus-3.1.0/src' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/home/francesco/octopus/octopus-3.1.0' make: *** [all] Error 2 francesco@tya64:~/octopus/octopus-3
Re: [OMPI users] Any way to make "btl_tcp_if_exclude" option system wide?
You can also see the FAQ entry: http://www.open-mpi.org/faq/?category=tuning#setting-mca-params It shows all the ways to set MCA parameters. On Oct 22, 2009, at 11:31 AM, Mike Hanby wrote: Thanks for the link to Sun HPC ClusterTools manual. I'll read through that. I'll have to consider which approach is best. Our users are 'supposed' to load the environment module for OpenMPI to properly configure their environment. The module file would be an easy location to add the variable. That isn't always the case, however, as some users like to do it old school and specify all of the variables in their job script. :-) We install OpenMPI using a custom built RPM, so I may need to add the option to the openmpi-mca-params.conf file when building the RPM. Decisions... -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Eugene Loh Sent: Thursday, October 22, 2009 10:12 AM To: Open MPI Users Subject: Re: [OMPI users] Any way to make "btl_tcp_if_exclude" option system wide? Mike Hanby wrote: >Howdy, > >My users are having to use this option with mpirun, otherwise the jobs will normally fail with a 111 communication error: > >--mca btl_tcp_if_exclude lo,eth1 > >Is there a way for me to set that MCA option system wide, perhaps via an environment variable so that they don't have to remember to use it? > > Yes. Maybe you want to use a system-wide configuration file. I don't know where this is "best" documented, but it is at least discussed in the Sun HPC ClusterTools User Guide. (ClusterTools is an Open MPI distribution.) E.g., http://dlc.sun.com/pdf/821-0225-10/821-0225-10.pdf . Look at Chapter 7. The section "Using MCA Parameters as Environment Variables" starts on page 69, but I'm not sure environment variables are really the way to go. I think you want section "To Specify MCA Parameters Using a Text File", on page 71. The file would look like this: % cat $OPAL_PREFIX/lib/openmpi-mca-params.conf btl_tcp_if_exclude = lo,eth1 where $OPAL_PREFIX is where users will be getting OMPI. I'm not 100% sure on the name of that file, but need to run right now. ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com
Re: [OMPI users] Any way to make "btl_tcp_if_exclude" option system wide?
Thanks for the link to Sun HPC ClusterTools manual. I'll read through that. I'll have to consider which approach is best. Our users are 'supposed' to load the environment module for OpenMPI to properly configure their environment. The module file would be an easy location to add the variable. That isn't always the case, however, as some users like to do it old school and specify all of the variables in their job script. :-) We install OpenMPI using a custom built RPM, so I may need to add the option to the openmpi-mca-params.conf file when building the RPM. Decisions... -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Eugene Loh Sent: Thursday, October 22, 2009 10:12 AM To: Open MPI Users Subject: Re: [OMPI users] Any way to make "btl_tcp_if_exclude" option system wide? Mike Hanby wrote: >Howdy, > >My users are having to use this option with mpirun, otherwise the jobs will >normally fail with a 111 communication error: > >--mca btl_tcp_if_exclude lo,eth1 > >Is there a way for me to set that MCA option system wide, perhaps via an >environment variable so that they don't have to remember to use it? > > Yes. Maybe you want to use a system-wide configuration file. I don't know where this is "best" documented, but it is at least discussed in the Sun HPC ClusterTools User Guide. (ClusterTools is an Open MPI distribution.) E.g., http://dlc.sun.com/pdf/821-0225-10/821-0225-10.pdf . Look at Chapter 7. The section "Using MCA Parameters as Environment Variables" starts on page 69, but I'm not sure environment variables are really the way to go. I think you want section "To Specify MCA Parameters Using a Text File", on page 71. The file would look like this: % cat $OPAL_PREFIX/lib/openmpi-mca-params.conf btl_tcp_if_exclude = lo,eth1 where $OPAL_PREFIX is where users will be getting OMPI. I'm not 100% sure on the name of that file, but need to run right now. ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Any way to make "btl_tcp_if_exclude" option system wide?
Eugene Loh wrote: Mike Hanby wrote: My users are having to use this option with mpirun, otherwise the jobs will normally fail with a 111 communication error: --mca btl_tcp_if_exclude lo,eth1 Is there a way for me to set that MCA option system wide, perhaps via an environment variable so that they don't have to remember to use it? Yes. Maybe you want to use a system-wide configuration file. I don't know where this is "best" documented, but it is at least discussed in the Sun HPC ClusterTools User Guide. (ClusterTools is an Open MPI distribution.) E.g., http://dlc.sun.com/pdf/821-0225-10/821-0225-10.pdf . Look at Chapter 7. The section "Using MCA Parameters as Environment Variables" starts on page 69, but I'm not sure environment variables are really the way to go. I think you want section "To Specify MCA Parameters Using a Text File", on page 71. The file would look like this: % cat $OPAL_PREFIX/lib/openmpi-mca-params.conf btl_tcp_if_exclude = lo,eth1 where $OPAL_PREFIX is where users will be getting OMPI. I'm not 100% sure on the name of that file, but need to run right now. Ah, I think the file name is actually shown on the following page; it's $prefix/etc/openmpi-mca-params.conf (.../etc/... rather than .../lib/...).
Re: [OMPI users] [OMPI devel] processor affinity -- OpenMPI / batch system integration
Yes, on page 14 of the presentation: "Support for OpenMPI and OpenMP Through -binding [pe|env] linear|striding" -- SGE performs no binding, but instead it outputs the binding decision to OpenMPI. Support for OpenMPI's binding is part of the "Job to Core Binding" project. Rayson On Thu, Oct 22, 2009 at 10:16 AM, Ralph Castain wrote: > Hi Rayson > > You're probably aware: starting with 1.3.4, OMPI will detect and abide by > external bindings. So if grid engine sets a binding, we'll follow it. > > Ralph > > On Oct 22, 2009, at 9:03 AM, Rayson Ho wrote: > >> The code for the Job to Core Binding (aka. thread binding, or CPU >> binding) feature was checked into the Grid Engine project cvs. It uses >> OpenMPI's Portable Linux Processor Affinity (PLPA) library, and is >> topology and NUMA aware. >> >> The presentation from HPC Software Workshop '09: >> http://wikis.sun.com/download/attachments/170755116/job2core.pdf >> >> The design doc: >> >> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=213897 >> >> Initial support is planned for 6.2 update 5 (current release is update >> 4, so update 5 is likely to be released in the next 2 or 3 months). >> >> Rayson >> >> >> >> On Tue, Sep 30, 2008 at 2:23 PM, Ralph Castain wrote: >>> >>> Note that we would also have to modify OMPI to: >>> >>> 1. recognize these environmental variables, and >>> >>> 2. use them to actually set the binding, instead of using OMPI-internal >>> directives >>> >>> Not a big deal to do, but not something currently in the system. Since we >>> launch through our own daemons (something that isn't likely to change in >>> your time frame), these changes would be required. >>> >>> Otherwise, we could come up with some method by which you could provide >>> mapper information we use. While I agree with Jeff that having you tell >>> us >>> which cores to use for each rank would generally be better, it does raise >>> issues when users want specific mapping algorithms that you might not >>> support. For example, we are working on mappers that will take input from >>> the user regarding comm topology plus system info on network wiring >>> topology >>> and generate a near-optimal mapping of ranks. As part of that, users may >>> request some number of cores be reserved for that rank for threading or >>> other purposes. >>> >>> So perhaps both options would be best - give us the list of cores >>> available >>> to us so we can map and do affinity, and pass in your own mapping. Maybe >>> with some logic so we can decide which to use based on whether OMPI or GE >>> did the mapping?? >>> >>> Not sure here - just thinking out loud. >>> Ralph >>> >>> On Sep 30, 2008, at 12:58 PM, Jeff Squyres wrote: >>> On Sep 30, 2008, at 2:51 PM, Rayson Ho wrote: > Restarting this discussion. A new update version of Grid Engine 6.2 > will come out early next year [1], and I really hope that we can get > at least the interface defined. Great! > At the minimum, is it enough for the batch system to tell OpenMPI via > an env variable which core (or virtual core, in the SMT case) to start > binding the first MPI task?? I guess an added bonus would be > information about the number of processors to skip (the stride) > between the sibling tasks?? Stride of one is usually the case, but > something larger than one would allow the batch system to control the > level of cache and memory bandwidth sharing between the MPI tasks... Wouldn't it be better to give us a specific list of cores to bind to? As core counts go up in servers, I think we may see a re-emergence of having multiple MPI jobs on a single server. And as core counts go even *higher*, then fragmentation of available cores over time is possible/likely. Would you be giving us a list of *relative* cores to bind to (i.e., "bind to the Nth online core on the machine" -- which may be different than the OS's ID for that processor) or will you be giving us the actual OS virtual processor ID(s) to bind to? -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> ___ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] [OMPI devel] processor affinity -- OpenMPI / batch system integration
Hi Rayson You're probably aware: starting with 1.3.4, OMPI will detect and abide by external bindings. So if grid engine sets a binding, we'll follow it. Ralph On Oct 22, 2009, at 9:03 AM, Rayson Ho wrote: The code for the Job to Core Binding (aka. thread binding, or CPU binding) feature was checked into the Grid Engine project cvs. It uses OpenMPI's Portable Linux Processor Affinity (PLPA) library, and is topology and NUMA aware. The presentation from HPC Software Workshop '09: http://wikis.sun.com/download/attachments/170755116/job2core.pdf The design doc: http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=213897 Initial support is planned for 6.2 update 5 (current release is update 4, so update 5 is likely to be released in the next 2 or 3 months). Rayson On Tue, Sep 30, 2008 at 2:23 PM, Ralph Castain wrote: Note that we would also have to modify OMPI to: 1. recognize these environmental variables, and 2. use them to actually set the binding, instead of using OMPI- internal directives Not a big deal to do, but not something currently in the system. Since we launch through our own daemons (something that isn't likely to change in your time frame), these changes would be required. Otherwise, we could come up with some method by which you could provide mapper information we use. While I agree with Jeff that having you tell us which cores to use for each rank would generally be better, it does raise issues when users want specific mapping algorithms that you might not support. For example, we are working on mappers that will take input from the user regarding comm topology plus system info on network wiring topology and generate a near-optimal mapping of ranks. As part of that, users may request some number of cores be reserved for that rank for threading or other purposes. So perhaps both options would be best - give us the list of cores available to us so we can map and do affinity, and pass in your own mapping. Maybe with some logic so we can decide which to use based on whether OMPI or GE did the mapping?? Not sure here - just thinking out loud. Ralph On Sep 30, 2008, at 12:58 PM, Jeff Squyres wrote: On Sep 30, 2008, at 2:51 PM, Rayson Ho wrote: Restarting this discussion. A new update version of Grid Engine 6.2 will come out early next year [1], and I really hope that we can get at least the interface defined. Great! At the minimum, is it enough for the batch system to tell OpenMPI via an env variable which core (or virtual core, in the SMT case) to start binding the first MPI task?? I guess an added bonus would be information about the number of processors to skip (the stride) between the sibling tasks?? Stride of one is usually the case, but something larger than one would allow the batch system to control the level of cache and memory bandwidth sharing between the MPI tasks... Wouldn't it be better to give us a specific list of cores to bind to? As core counts go up in servers, I think we may see a re-emergence of having multiple MPI jobs on a single server. And as core counts go even *higher*, then fragmentation of available cores over time is possible/likely. Would you be giving us a list of *relative* cores to bind to (i.e., "bind to the Nth online core on the machine" -- which may be different than the OS's ID for that processor) or will you be giving us the actual OS virtual processor ID(s) to bind to? -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Any way to make "btl_tcp_if_exclude" option system wide?
Mike Hanby wrote: Howdy, My users are having to use this option with mpirun, otherwise the jobs will normally fail with a 111 communication error: --mca btl_tcp_if_exclude lo,eth1 Is there a way for me to set that MCA option system wide, perhaps via an environment variable so that they don't have to remember to use it? Yes. Maybe you want to use a system-wide configuration file. I don't know where this is "best" documented, but it is at least discussed in the Sun HPC ClusterTools User Guide. (ClusterTools is an Open MPI distribution.) E.g., http://dlc.sun.com/pdf/821-0225-10/821-0225-10.pdf . Look at Chapter 7. The section "Using MCA Parameters as Environment Variables" starts on page 69, but I'm not sure environment variables are really the way to go. I think you want section "To Specify MCA Parameters Using a Text File", on page 71. The file would look like this: % cat $OPAL_PREFIX/lib/openmpi-mca-params.conf btl_tcp_if_exclude = lo,eth1 where $OPAL_PREFIX is where users will be getting OMPI. I'm not 100% sure on the name of that file, but need to run right now.
Re: [OMPI users] [OMPI devel] processor affinity -- OpenMPI / batch system integration
The code for the Job to Core Binding (aka. thread binding, or CPU binding) feature was checked into the Grid Engine project cvs. It uses OpenMPI's Portable Linux Processor Affinity (PLPA) library, and is topology and NUMA aware. The presentation from HPC Software Workshop '09: http://wikis.sun.com/download/attachments/170755116/job2core.pdf The design doc: http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=213897 Initial support is planned for 6.2 update 5 (current release is update 4, so update 5 is likely to be released in the next 2 or 3 months). Rayson On Tue, Sep 30, 2008 at 2:23 PM, Ralph Castain wrote: > Note that we would also have to modify OMPI to: > > 1. recognize these environmental variables, and > > 2. use them to actually set the binding, instead of using OMPI-internal > directives > > Not a big deal to do, but not something currently in the system. Since we > launch through our own daemons (something that isn't likely to change in > your time frame), these changes would be required. > > Otherwise, we could come up with some method by which you could provide > mapper information we use. While I agree with Jeff that having you tell us > which cores to use for each rank would generally be better, it does raise > issues when users want specific mapping algorithms that you might not > support. For example, we are working on mappers that will take input from > the user regarding comm topology plus system info on network wiring topology > and generate a near-optimal mapping of ranks. As part of that, users may > request some number of cores be reserved for that rank for threading or > other purposes. > > So perhaps both options would be best - give us the list of cores available > to us so we can map and do affinity, and pass in your own mapping. Maybe > with some logic so we can decide which to use based on whether OMPI or GE > did the mapping?? > > Not sure here - just thinking out loud. > Ralph > > On Sep 30, 2008, at 12:58 PM, Jeff Squyres wrote: > >> On Sep 30, 2008, at 2:51 PM, Rayson Ho wrote: >> >>> Restarting this discussion. A new update version of Grid Engine 6.2 >>> will come out early next year [1], and I really hope that we can get >>> at least the interface defined. >> >> Great! >> >>> At the minimum, is it enough for the batch system to tell OpenMPI via >>> an env variable which core (or virtual core, in the SMT case) to start >>> binding the first MPI task?? I guess an added bonus would be >>> information about the number of processors to skip (the stride) >>> between the sibling tasks?? Stride of one is usually the case, but >>> something larger than one would allow the batch system to control the >>> level of cache and memory bandwidth sharing between the MPI tasks... >> >> Wouldn't it be better to give us a specific list of cores to bind to? As >> core counts go up in servers, I think we may see a re-emergence of having >> multiple MPI jobs on a single server. And as core counts go even *higher*, >> then fragmentation of available cores over time is possible/likely. >> >> Would you be giving us a list of *relative* cores to bind to (i.e., "bind >> to the Nth online core on the machine" -- which may be different than the >> OS's ID for that processor) or will you be giving us the actual OS virtual >> processor ID(s) to bind to? >> >> -- >> Jeff Squyres >> Cisco Systems >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >
[OMPI users] Any way to make "btl_tcp_if_exclude" option system wide?
Howdy, My users are having to use this option with mpirun, otherwise the jobs will normally fail with a 111 communication error: --mca btl_tcp_if_exclude lo,eth1 Is there a way for me to set that MCA option system wide, perhaps via an environment variable so that they don't have to remember to use it? Thanks, Mike Mike Hanby mha...@uab.edu Information Systems Specialist II IT HPCS / Research Computing