Re: [OMPI users] very bad parallel scaling of vasp using openmpi

2009-08-18 Thread Gus Correa
Hi Craig, list Independent of any issues with your GigE switch, which you may need to address, you may want to take a look at the performance of the default OpenMPI MPI_Alltoall algorithm, which you say is a cornerstone of VASP. You can perhaps try alternative algorithms for different message siz

Re: [OMPI users] users Digest, Vol 1321, Issue 6

2009-08-18 Thread Oskar Enoksson
Patrick Geoffray wrote: > Hi Oskar, > > Oskar Enoksson wrote: >> The reason in this case was that cl120 had some kind of hardware >> problem, perhaps memory error or myrinet NIC hardware error. The system >> hung. >> >> I will try MX_ZOMBIE_SEND=0, thanks for the hint! > > I would not recommen

Re: [OMPI users] very bad parallel scaling of vasp using openmpi

2009-08-18 Thread Gerry Creager
Most of that bandwidth is in marketing... Sorry, but it's not a high performance switch. Craig Plaisance wrote: The switch we are using (Dell Powerconnect 6248) has a switching fabric capacity of 184 Gb/s, which should be more than adequate for the 48 ports. Is this the same as backplane ban

Re: [OMPI users] orte_launch_agent usage?

2009-08-18 Thread Kenneth Yoshimoto
Thanks, I'll let you know how it works out. Kenneth On Tue, 18 Aug 2009, Ralph Castain wrote: Date: Tue, 18 Aug 2009 15:23:23 -0600 From: Ralph Castain To: Kenneth Yoshimoto , Open MPI Users Subject: Re: [OMPI users] orte_launch_agent usage? I believe I found the problem on this and fixed

Re: [OMPI users] orte_launch_agent usage?

2009-08-18 Thread Ralph Castain
I believe I found the problem on this and fixed it. If you can, you might try installing the nightly tarball and see if that solves the problem. Anything on or after r21827 will include the fix. Ralph On Aug 12, 2009, at 4:04 PM, Kenneth Yoshimoto wrote: If I use -mca orte_launch_agent /h

[OMPI users] --rankfile

2009-08-18 Thread Nulik Nol
Hi, i get this error when i use --rankfile, "There are not enough slots available in the system to satisfy the 2 slots" what could be the problem? I have tried using '*' for 'slot' param and many other configs without any luck. Wihtout --rankfile everything works fine. Will appriciate any help. ma

Re: [OMPI users] MPI loop problem

2009-08-18 Thread Julia He
Thanks. I will try your suggestions and get back to you as soon as possible. Julia --- On Tue, 8/18/09, Eugene Loh wrote: From: Eugene Loh Subject: Re: [OMPI users] MPI loop problem To: "Open MPI Users" List-Post: users@lists.open-mpi.org Date: Tuesday, August 18, 2009, 2:38 PM Is t

Re: [OMPI users] very bad parallel scaling of vasp using openmpi

2009-08-18 Thread Patrick Geoffray
Craig Plaisance wrote: So is this a problem with the physical switch (we need a better switch) or with the configuration of the switch (we need to configure the switch or configure the os to work with the switch)? You may want to look if you are dropping packets somewhere. You can look at the

Re: [OMPI users] very bad parallel scaling of vasp using openmpi

2009-08-18 Thread Craig Plaisance
So is this a problem with the physical switch (we need a better switch) or with the configuration of the switch (we need to configure the switch or configure the os to work with the switch)?

Re: [OMPI users] How to make a job abort when one host dies?

2009-08-18 Thread Patrick Geoffray
Hi Oskar, Oskar Enoksson wrote: The reason in this case was that cl120 had some kind of hardware problem, perhaps memory error or myrinet NIC hardware error. The system hung. I will try MX_ZOMBIE_SEND=0, thanks for the hint! I would not recommend to use that setting. It will affect performa

Re: [OMPI users] very bad parallel scaling of vasp using openmpi

2009-08-18 Thread Joe Landman
Craig Plaisance wrote: The switch we are using (Dell Powerconnect 6248) has a switching fabric capacity of 184 Gb/s, which should be more than adequate for the 48 ports. Is this the same as backplane bandwidth? Yes. If you are getting the behavior you describe, you are not getting all that

Re: [OMPI users] very bad parallel scaling of vasp using openmpi

2009-08-18 Thread Craig Plaisance
The switch we are using (Dell Powerconnect 6248) has a switching fabric capacity of 184 Gb/s, which should be more than adequate for the 48 ports. Is this the same as backplane bandwidth?

Re: [OMPI users] MPI loop problem

2009-08-18 Thread Eugene Loh
Is the problem independent of the the number of MPI processes?  (You suggest this is the case.) If so, does this problem show up even with np=1 (a single MPI process)? If so, does the problem show up even if you turn MPI off? If so, the problem would seem to be unrelated to the MPI implement

Re: [OMPI users] MPI loop problem

2009-08-18 Thread Julia He
The OpenMPI version is [julia.he@bob bin]$ mpirun --version mpirun (Open MPI) 1.2.8 Report bugs to http://www.open-mpi.org/community/help/ The platform is [julia.he@bob bin]$ uname -a Linux bob.csi.cuny.edu 2.6.18-92.1.13.el5 #1 SMP Wed Sep 24 19:32:05 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux

Re: [OMPI users] very bad parallel scaling of vasp using openmpi

2009-08-18 Thread Joe Landman
Craig Plaisance wrote: mpich2 now and post the results. So, does anyone know what causes the wild oscillations in the throughput at larger message sizes and higher network traffic? Thanks! Your switch can't handle this amount of traffic on its backplane. We have seen this often in similar

Re: [OMPI users] very bad parallel scaling of vasp using openmpi

2009-08-18 Thread Craig Plaisance
I ran a test of tcp using NetPIPE and got throughput of 850 Mb/s at message sizes of 128 Kb. The latency was 50 us. At message sizes above 1000 Kb, the throughput oscillated wildly between 850 Mb/s and values as low as 200 Mb/s. This test was done with no other network traffic. I then ran f

[OMPI users] OPEN MPI with self

2009-08-18 Thread Jean Potsam
Dear ALL,     I am trying to checkpoint MPI application using the self  component. I had a look at the OPEN MPI FT user's guide Draft 1.4. but is still unsure. I have installed openmpi as follows: jean$ ./configure --prefix=/home/jean/openmpi/  --enable-debug --enable-mpi-profile -

Re: [OMPI users] How to make a job abort when one host dies?

2009-08-18 Thread Scott Atchley
On Aug 18, 2009, at 10:59 AM, Oskar Enoksson wrote: The question is, however, why is cl120 not acking messages? What is the application? What MPI calls does this application use? Scott The reason in this case was that cl120 had some kind of hardware problem, perhaps memory error or myrin

Re: [OMPI users] How to make a job abort when one host dies?

2009-08-18 Thread Oskar Enoksson
Scott Atchley wrote: Long answer: The messages below indicate that these processes were all trying to send to cl120. It did not ack their messages after 1000 resend attempts (each retry is attempted with a 0.5 second interval) which is about 8.3 minutes (500 seconds). The messages also

Re: [OMPI users] Invalid info object in MPI_Comm_spawn_multiple

2009-08-18 Thread Ralph Castain
BTW: I did see one issue in your program. In the program that isn't working, you declare the various input arrays for MPI_Comm_spawn_multiple, but only the manager rank=0 ever initializes them. Thus, the other manager ranks were passing random garbage down to the function. Even though only the roo

Re: [OMPI users] Invalid info object in MPI_Comm_spawn_multiple

2009-08-18 Thread Ralph Castain
By any chance did you have the flag on to check MPI parameters? I think we have a bug in there that might be causing what you saw, but it would only be active if you had requested that OMPI check parameters. Thanks Ralph 2009/8/18 Federico Golfrè Andreasi > That's is you've done: the worker pr

Re: [OMPI users] MPI loop problem

2009-08-18 Thread Ralph Castain
Sorry, but there is no way to answer this question with what is given. What is "my_sub" doing? Which version of OpenMPI are you talking about, and on what platform? On Tue, Aug 18, 2009 at 8:28 AM, Julia He wrote: > Hi, > > I found that the subroutine call inside a loop did not return correct

[OMPI users] MPI loop problem

2009-08-18 Thread Julia He
Hi, I found that the subroutine call inside a loop did not return correct value after certain iterations. In order to simplify the problem, the inputs to the subroutine are chosen to be constant, so the output should be the same for every iteration on every computing node. It is a fortran program,

[OMPI users] (no subject)

2009-08-18 Thread Julia He
Hi, I found that the subroutine call inside a loop did not return correct value after certain iterations. In order to simplify the problem, the inputs to the subroutine are chosen to be constant, so the output should be the same for every iteration on every computing node. It is a fortran program,

Re: [OMPI users] Invalid info object in MPI_Comm_spawn_multiple

2009-08-18 Thread Federico Golfrè Andreasi
That's is you've done: the worker program spawned and the two versions of the manager that call the spawning. I you find something wrong please let me know. Thank you, Federico 2009/8/18 Ralph Castain > > > Only the root process needs to provide the info keys for spawning anything. >

Re: [OMPI users] very bad parallel scaling of vasp using openmpi

2009-08-18 Thread Joe Landman
Craig Plaisance wrote: Hi - I have compiled vasp 4.6.34 using the Intel fortran compiler 11.1 with openmpi 1.3.3 on a cluster of 104 nodes running Rocks 5.2 with two quad core opterons connected by a Gbit ethernet. Running in parallel on Latency of gigabit is likely your issue. Lower qualit

Re: [OMPI users] very bad parallel scaling of vasp using openmpi

2009-08-18 Thread jimkress_58
Gbit Ethernet is well known to perform poorly for fine grained code like VASP. The latencies for Gbit Ethernet are much too high. If you want good scaling in a cluster for VASP, you'll need to run InfiniBand or some other high speed/ low latency network. Jim -Original Message- From: use

Re: [OMPI users] Invalid info object in MPI_Comm_spawn_multiple

2009-08-18 Thread Ralph Castain
Only the root process needs to provide the info keys for spawning anything. If that isn't correct, then we have a bug. Could you send us a code snippet that shows what you were doing? Thanks Ralph 2009/8/18 Federico Golfrè Andreasi > I think I've solved my problem: > > in the previous c

Re: [OMPI users] Invalid info object in MPI_Comm_spawn_multiple

2009-08-18 Thread Federico Golfrè Andreasi
I think I've solved my problem: in the previous code the arguments of the MPI_Comm_spawn_multiple where filled only by the "root" process not by all the process in the group. Now all the ranks have all that information and the spawn is done correctly. But I read on http://www.mpi-forum.org/docs/mp

Re: [OMPI users] Invalid info object in MPI_Comm_spawn_multiple

2009-08-18 Thread Jeff Squyres
On Aug 18, 2009, at 5:12 AM, Federico Golfrè Andreasi wrote: In the info object I only set the "host" key (after creatig the object with the MPI_Info_create). I've modified my code to leave out that request and created the array of Info object as an array of MPI_INFO_NULL but the problem is

[OMPI users] Invalid info object in MPI_Comm_spawn_multiple

2009-08-18 Thread Federico Golfrè Andreasi
In the info object I only set the "host" key (after creatig the object with the MPI_Info_create). I've modified my code to leave out that request and created the array of Info object as an array of MPI_INFO_NULL but the problem is still the same. The error is thrown only when running with more tha