Re: [OMPI users] OpenMPI binding all tasks to cpu0, leaving cpu1 idle. (2-cpu system)
Hi, On 10/3/07, jody wrote: > Hi Miguel > I don't know if it's a typo - but actually it should be > mpiexec -np 2 ./mpi-app config.ini > and not > > mpiexec -n 2 ./mpi-app config.ini thanks for the remark, you're right, but in the man page says -n is a synonym for -np Kind regards, -- Miguel Sousa Filipe
[OMPI users] OpenMPI binding all tasks to cpu0, leaving cpu1 idle. (2-cpu system)
Hi there, I have a 2-cpu system (linux/x86-64), running openmpi-1.1. I do not specify a hostfile. Lately I'm having performance problems when running my mpi-app this way: mpiexec -n 2 ./mpi-app config.ini Both mpi-app processes are running on cpu0, leaving cpu1 idle. After reading the mpirun manpage, it seems that openmpi bind tasks to cpus in a round-robin way, meaning that this should not happen. But given my problem, I assume that it's not detecting this is a 2-way smp system, (assuming a UP system) and binding both tasks to cpu0.. Is this correct? The openmpi-default-hostfile says I should not specify localhost in there.. and let the job dispatcher/rca "detect" the single-node setup. Where should I define/configure system wide, that this is a single-node, 2-slot system? I would like to avoid making the system users be obliged to pass a hostfile to mpirun/mpiexec. I simply want mpiexec -n N ./mpi-task to do the propper job of _really_ spreading the processes evenly between all the system's CPUs. Best regards, waiting for your answer. ps.: should I upgrade to latest openMPI to have my problem "automagically" solved? -- Miguel Sousa Filipe
Re: [OMPI users] efficient memory to memory transfer
On 11/8/06, Miguel Figueiredo Mascarenhas Sousa Filipe wrote: Hi, On 11/8/06, Greg Lindahl wrote: > On Tue, Nov 07, 2006 at 05:02:54PM +0000, Miguel Figueiredo Mascarenhas Sousa Filipe wrote: > > > if your aplication is on one given node, sharing data is better than > > copying data. > > Unless sharing data repeatedly leads you to false sharing and a loss > in performance. what does that mean.. I did not understand that. > > > the MPI model assumes you don't have a "shared memory" system.. > > therefore it is "message passing" oriented, and not designed to > > perform optimally on shared memory systems (like SMPs, or numa-CCs). > > For many programs with both MPI and shared memory implementations, the > MPI version runs faster on SMPs and numa-CCs. Why? See the previous > paragraph... I miss understood what you've said. The reasons I see for that are: 1) have aplication design MPI oriented... and adapt that design to a shared memory implementation afterwards. (using a mpi programming model on a shared memory aplication.) 2) cases where the problem space is better solved using a MPI programming model oriented solution. Shared memory, or multi-threading program development requires a diferent programming model. The MPI model can be better suited to some tasks, than the multi-threading approach. And vice-versa. But, for instance.. try to benchmark real applications with a MPI and posix threads implementations in the same numa-cc or big SMP machine.. my bet is that posix threads implementation is going to be faster.. There are always exceptions.. like having a very well designed MPI application, but a terrible posix threads one.. or design that's just not that adaptable to a posix threads programming model (or a MPI model). -- Miguel Sousa Filipe -- Miguel Sousa Filipe
Re: [OMPI users] efficient memory to memory transfer
Hi, On 11/8/06, Greg Lindahl wrote: On Tue, Nov 07, 2006 at 05:02:54PM +, Miguel Figueiredo Mascarenhas Sousa Filipe wrote: > if your aplication is on one given node, sharing data is better than > copying data. Unless sharing data repeatedly leads you to false sharing and a loss in performance. what does that mean.. I did not understand that. > the MPI model assumes you don't have a "shared memory" system.. > therefore it is "message passing" oriented, and not designed to > perform optimally on shared memory systems (like SMPs, or numa-CCs). For many programs with both MPI and shared memory implementations, the MPI version runs faster on SMPs and numa-CCs. Why? See the previous paragraph... Of course it does..its faster to copy data in main memory than it is to do it thought any kind of network interface. You can optimize you message passing implementation to a couple of memory to memory copies when ranks are on the same node. In the worst case, even if using local IP addresses to communicate between peers/ranks (in the same node), the operating system doesn't even touch the interface.. it will just copy data from a tcp sender buffer to a tcp receiver buffer.. in the end - that's always faster than going through a phisical network link. But you still have a message passing api that is doing memory to memory copies.. its a worse framework to do memory copies than a api designed just for that. One could argue that MPI is more than a message passing api, since it provides also APIs to apply operators to the data.. But, for instance.. try to benchmark real applications with a MPI and posix threads implementations in the same numa-cc or big SMP machine.. my bet is that posix threads implementation is going to be faster.. There are always exceptions.. like having a very well designed MPI application, but a terrible posix threads one.. or design that's just not that adaptable to a posix threads programming model (or a MPI model). -- Miguel Sousa Filipe
Re: [OMPI users] efficient memory to memory transfer
Hi, On 11/7/06, Chevchenkovic Chevchenkovic wrote: Hi, I had the following setup: Rank 0 process on node 1 wants to send an array of particular size to Rank 1 process on same node. 1. What are the optimisations that can be done/invoked while running mpirun to perform this memory to memory transfer efficiently? 2. Is there any performance gain if 2 processes that are exchanging data arrays are kept on the same node rather than on different nodes connected by infiniband? if your aplication is on one given node, sharing data is better than copying data. You can do this with unix shared memory api, or with posix threads api. If aplications share the same address space, and if copy is necessary, memcpy() is probably the faster way (and ensuring that data is aligned in memory). However, this by definition does not work on multi-computer aplications/systems.. If you can have: 1 aplication per node, several threads per node. consider using MPI only between aplications, and setup your MPI framework to launch one aplication per node. program your aplication to use #threads per rank (node), and use posix threading model for parallel execution in each node (for instance, where #threads == NCPUS) , and use MPI for comunicating between nodes. the MPI model assumes you don't have a "shared memory" system.. therefore it is "message passing" oriented, and not designed to perform optimally on shared memory systems (like SMPs, or numa-CCs). best regards, -- Miguel Sousa Filipe
Re: [OMPI users] dual Gigabit ethernet support
Hello, On 10/23/06, Jayanta Roy wrote: Hi, Sometime before I have posted doubts about using dual gigabit support fully. See I get ~140MB/s full duplex transfer rate in each of following runs. Thats impressive, since its _more_ than the threotetical limit of 1Gb ethernet. 140MB = 140 x 8 Mbit = 1120 Megabit/sec > 1 Gigabit/sec... mpirun --mca btl_tcp_if_include eth0 -n 4 -bynode -hostfile host a.out mpirun --mca btl_tcp_if_include eth1 -n 4 -bynode -hostfile host a.out How to combine these two port or use a proper routing table in place host file? I am using openmpi-1.1 version. -Jayanta -- Miguel Sousa Filipe
Re: [OMPI users] using both 64 and 32 bit mpi
Hi, On 9/28/06, Glenn Johnson wrote: I have an 8-way AMD64 system. I built a 64 bit open-mpi-1.1 implementation and then compiled software to use it. That all works fine. In addition, I have a 32 bit binary program (Schrodinger Jaguar) that I would like to run on this machine with mpi. Schrodinger provides source code to build an mpi compatibility layer. This compatibility layer allows jaguar to use a different mpi implementation than that which the software was compiled with. I do not want to give up the 64 bit open-mpi that I already have and am using. So my questions are: 1. Can I build/install a 32 bit version of open-mpi even though I already have a 64 bit version installed? 2. What "tricks" might I need to do to make sure a program calls the correct version of mpi (32 or 64 bit)? 3. Would I do better considering running jaguar in a 32 bit chroot environment? I'm a simple user who is using openmpi in 32bit and 64bits on a amd64 linux, on a 64bit sistem. But I also have openmpi and mpich2 in a 32bit chroot. I've installed openmpi 32bit in /opt/openmpi-1.1.1-32bits (amd64 is installed in /opt/openmpi-1.1.1-amd64) .. and its easy to change between the two.. either with symlinks, PATHs or scripts.. to use one .. or the other. the chroot was a lot more trouble, and I did it because I wasn't being able to compile openmpi (1.0 at the time) in 32bit mode on my 64bit system. Thanks. -- Glenn Johnson USDA, ARS, SRRC -- Miguel Sousa Filipe
Re: [OMPI users] Does current Intel dual processor support MPI?
Hi all, Well, I just wanted to say that from a software engineering (and also computer science) point of view, OMP, MPI and threads are completely diferent models for parallel computation/concurrent programming. I do not believe that any capable engineer (or good programmer for all I know) can know in advance whats the "best" one to use without knowing the problem space, design requisites, ..etc.. Its not just about "portability" or code "readability/mantainability". Deciding which too use will depend (and therefore influence) the aplication arquitecture. Should I use OMP on a web server as apache or tomcat, providing that way better portability and code readability? Should I use OMP or threading for a massively parallel system, such has blue gene/L, what about SGI Altix3000? Shoud I use threading for a 2 cpu, shared memory system for a sequencial aplication where I just need to speed up some highly vectorizable loops? For instance, my thesis dealt with paralelizing a seismic simulation application, I did a thread and a MPI version. The threaded version, since "tasks" could share massive amounts of data with very little lock contention, could work bigger data sets than the MPI version (given the same total amount of ram). But the MPI version could run in clusters, while with threading I needed a single system image. OpenMP was inadequate since it would have a much bigger sequential execution time, providing inadequate speedup, for a algorithm which was very parallel. Seedups measured in the threaded version and MPI version were about 1.99 in 2cpu mode, (<1% of sequential computation). In MPI, with 16 cpus (1 gigabit link for 8 x 2cpu nodes), the measured speedup was 14.8. My threaded version would never achieve a 14.8 speedup, even in a "SSI" 8 node cluster. The effort applied to make the MPI version so scalable was _much_ bigger.. (designing a new concurrent, distributed algorithm to replace one that was sequential, that in the sequencial aplication amounted to 1% of the computation time.) than the threaded one. It uses more ram per process, but can scale up to 64/128 nodes, depending on the problem size, and it doesn't require a shared memory system. My threaded version, in a shared memory system, with lots of cpus, will scale quite a lot..but probably the agregate bandwith will be inferior to a cluster with the same amount of cpus/ram (normally, big SMP or NUMA systems have bigger RAM latency and not proportional bandwith). Basically, I can't predict which performs better. So, I hope that its understandable that choosing the right parallel computing model isn't just a matter of "taste". On 9/6/06, George Bosilca wrote: From my perspective some [let's say #1 and #2) of the most important features of an application that has to last for a while is the readability and portability. And OMP code is far more readable than pthread one. The loops look like loops, the critical sections are obvious and the sequential meaning of the program is preserved. On Sep 5, 2006, at 7:52 PM, Durga Choudhury wrote: > My opinion would be to use pthreads, for a couple of reasons: > > 1. You don't need an OMP aware compiler; any old compiler would do. Compilers can be downloaded for free these days. And most of them have now OMP support. And on all operating systems (i.e. even the free Microsoft compiler now has OMP support, and Windows was definitively not the platform I expect to use for my OMP tasks). > 2. The pthread library is more well adapted and hence might be more > optimized than the code emitted from an OMP compiler. The pthread library add a huge overhead for all operations. At this level granularity quite often you need atomic locks and operations, not critical sections protected by mutexes. Unfortunately, there is no portable library that give you a common interface to atomic operations (there was a BSD one at one point). Moreover, using threads instead of OMP directive move the burden on the programmer. Most of the people just cannot afford a one year student who has to first understand and then add the correct pthread directive inside. And for which result ... you don't even know that you will get the fastest version. On the other side OMP compilers are getting smarter and smarter every day. Today the results are quite impressive, just imagine what will happens in few years. > > If your operating system is Linux, you may use the clone() system > call directly; this would add further optimization at the expense > of portability. It's always a trade-off between performance and portability. What do you want to loose in order to get the 1% performance gain ... And in this case the only performance gain you will get is when you start the threads, otherwise you will not improve anything. Generally, people prefer to use threads pools in order to avoid the overhead of creating and destroying threads all the time. george. > > Durga > > > On 9/5/06, George Bosilca wrote: > On Sep 5, 2006,
Re: [OMPI users] OpenMPI error in the simplest configuration...
Hi there, On 8/27/06, Durga Choudhury wrote: Hi all I am getting an error (details follow) in the simplest of the possible test scenarios: Two identical regular Dell PCs connected back-to-back via an ethernet switch on the 10/100 ethernet. Both run Fedora Core 4. Identical version ( 1.1) of Open MPI is compiled and installed on both of them *without* a --prefix option ( i.e. installed on the default location of /usr/local). The hostfile on both the machine is the same: cat ~/hostfile 192.168.22.29 192.168.22.103 I can run openMPI on either of these two machines by forking two processes: mpirun -np2 osu_acc_latency <-- This runs fine on either of the two machines. However, when I try to luch the same program across the two machines, I get an error: mpirun --hostfile ~/hostfile -np2 /home/durga/openmpi-1.1 /osu_benchmarks/osu_acc_latency durga@192.168.22.29's password: foobar /home/durga/openmpi-1.1/osu_benchmarks/osu_acc_latency: error while loading shared libraries: libmpi.so.0: cannot open shared object file: No such file or directory. However, the file *does exist* in /usr/local/lib: ls -l /usr/local/lib/libmpi.so.0 libmpi.so.0 -> libmpi.so.0.0.0 I have also tried adding /usr/local/lib to my LD_LIBRARY_PATH on *both* machines, to no avail. first of: I'm not from the openMPI team. Where did you add your LD_LIBRARY_PATH ? make shure it is in one of the profile files for your shell.. such has: .bash_profile and _not_ .bashrc that's a tipical error in these kind of configurations.. best regards Any help is greatly appreciated. Thanks Durga ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Miguel Sousa Filipe
Re: [OMPI users] OpenMPI-1.1 virtual memory overhead
Hi, On 8/25/06, Sven Stork wrote: Hello Miguel, this is caused by the shared memory mempool. Per default this shared memory mapping has a size of 512 MB. You can use the "mpool_sm_size" parameter to reduce size e.g. mpirun -mca mpool_sm_size ... using mpirun -mca mpool_sm_size 0 is acceptable ? to what will it fallback ? sockets? pipes? tcp? smoke signals? thankyou very much by the fast answer. Thanks, Sven On Friday 25 August 2006 15:04, Miguel Figueiredo Mascarenhas Sousa Filipe wrote: > Hi there, > I'm using openmpi-1.1 on a linux-amd64 machine and also a linux-32bit x86 > chroot environment on that same machine. > (distro is gentoo, compilers: gcc-4.1.1 and gcc-3.4.6) > > In both cases openmpi-1.1 shows a +/-400MB overhead in virtual memory usage > (virtual address space usage) for each MPI process. > > In my case this is quite troublesome because my application in 32bit mode is > counting on using the whole 4GB address space for the problem set size and > associated data. > This means that I have a reduction in the size of the problems which it can > solve. > (my aplication isn't 64bit safe yet, so I need to run in 32bit mode, and use > effectively the 4GB address space) > > > Is there a way to tweak this overhead, by configuring openmpi to use smaller > buffers, or anything else ? > > I do not see this with mpich2. > > Best regards, > > -- > Miguel Sousa Filipe > ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Miguel Sousa Filipe
[OMPI users] OpenMPI-1.1 virtual memory overhead
Hi there, I'm using openmpi-1.1 on a linux-amd64 machine and also a linux-32bit x86 chroot environment on that same machine. (distro is gentoo, compilers: gcc-4.1.1 and gcc-3.4.6) In both cases openmpi-1.1 shows a +/-400MB overhead in virtual memory usage (virtual address space usage) for each MPI process. In my case this is quite troublesome because my application in 32bit mode is counting on using the whole 4GB address space for the problem set size and associated data. This means that I have a reduction in the size of the problems which it can solve. (my aplication isn't 64bit safe yet, so I need to run in 32bit mode, and use effectively the 4GB address space) Is there a way to tweak this overhead, by configuring openmpi to use smaller buffers, or anything else ? I do not see this with mpich2. Best regards, -- Miguel Sousa Filipe