Hi,

Just out of curiosity, what happens when you add

-mca shmem posix

to your mpirun command line using 1.5.5?

Can you also please try:

-mca shmem sysv

I'm shooting in the dark here, but I want to make sure that the failure isn't 
due to a small backing store.

Thanks,

Sam

On Apr 16, 2012, at 8:57 AM, Gutierrez, Samuel K wrote:

Hi,

Sorry about the lag.  I'll take a closer look at this ASAP.

Appreciate your patience,

Sam
________________________________
From: users-boun...@open-mpi.org<mailto:users-boun...@open-mpi.org> 
[users-boun...@open-mpi.org] on behalf of Ralph Castain [r...@open-mpi.org]
Sent: Monday, April 16, 2012 8:52 AM
To: Seyyed Mohtadin Hashemi
Cc: us...@open-mpi.org<mailto:us...@open-mpi.org>
Subject: Re: [OMPI users] OpenMPI fails to run with -np larger than 10

No earthly idea. As I said, I'm afraid Sam is pretty much unavailable for the 
next two weeks, so we probably don't have much hope of fixing it.

I see in your original note that you tried the 1.5.5 beta rc and got the same 
results, so I assume this must be something in  your system config that is 
causing the issue. I'll file a bug for him (pointing to this thread) so this 
doesn't get lost, but would suggest you run ^sm for now unless someone else has 
other suggestions.


On Apr 16, 2012, at 2:57 AM, Seyyed Mohtadin Hashemi wrote:

I recompiled everything from scratch with GCC 4.4.5 and 4.7 using OMPI 1.4.5 
tarball.

I did some tests and it does not seem that i can make it work, i tried these:

btl_sm_num_fifos 4
btl_sm_free_list_num 1000
btl_sm_free_list_max 1000000
mpool_sm_min_size 1500000000
mpool_sm_max_size 7500000000

but nothing helped. I started out with varying one parameter at the time from 
default to 1000000 (except fifo which i only varied till 100, and sm_min and 
sm_max which i varied from 67mb [default was set to 67xxxxxx] to 7.5gb) to see 
what reactions i could get. When running with 10 npp everything worked, but as 
soon as i went to 11 npp it crashed with the same old error.

On Fri, Apr 13, 2012 at 6:41 PM, Ralph Castain 
<r...@open-mpi.org<mailto:r...@open-mpi.org>> wrote:

On Apr 13, 2012, at 10:36 AM, Seyyed Mohtadin Hashemi wrote:

That fixed the issue but have brought a big question mark on why this happened.

I'm pretty sure it's not a system memory issue, the node with least RAM has 8gb 
which i would think is more than enough.

Do you think that adjusting the btl_sm_eager_limit, mpool_sm_min_size, and 
mpool_sm_max_size can help fix the problem? (Found this 
athttp://www.open-mpi.org/faq/?category=sm )  Because compared to the -np 10 
the performance of -np 18 is worse when running with the cmd you suggested. 
I'll try playing around with the parameters and see what works.

Yes, performance will definitely be worse - I was just trying to isolate the 
problem. I would play a little with those sizes and see what you can do. Our 
shared memory person is pretty much unavailable for the next two weeks, but the 
rest of us will at least try to get you working.

We typically do run with more than 10 ppn, so I know the base sm code works at 
that scale. However, those nodes usually have 32Gbytes of RAM, and the default 
sm params are scaled accordingly.



On Fri, Apr 13, 2012 at 5:44 PM, Ralph Castain 
<r...@open-mpi.org<mailto:r...@open-mpi.org>> wrote:
Afraid I have no idea how those packages were built, what release they 
correspond to, etc. I would suggest sticking with the tarballs.

Your output indicates a problem with shared memory when you completely fill the 
machine. Could be a couple of things, like running out of memory - but for now, 
try adding -mca btl ^sm to your cmd line. Should work.


On Apr 13, 2012, at 5:09 AM, Seyyed Mohtadin Hashemi wrote:

Hi,

Sorry that it took so long to answer, I didn't get any return mails and had to 
check the digest for reply.

Anyway, when i compiled from scratch then i did use the tarballs from 
open-mpi.org<http://open-mpi.org/>. GROMACS is not the problem (or at least i 
don't think so), i just used it as a check to see if i could run parallel jobs 
- i am now using OSU benchmarks because i can't be sure that the problem is not 
with GROMACS.

On the new installation i have not installed (nor compiled) OMPI from the 
official tarballs but rather installed the "openmpi-bin, openmpi-common, 
libopenmpi1.3, openmpi-checkpoint, and libopenmpi-dev" packages using apt-get.

As for the simple examples (i.e. ring_c, hello_c, and connectivity_c extracted 
from the 1.4.2 official tarball) i get the exact same behavior as with 
GROMACS/OSU bench.

I suspect you'll have to ask someone familiar with GROMACS about that specific 
package. As for testing OMPI, can you run the codes in the examples directory - 
e.g., "hello" and "ring"? I assume you are downloading and installing OMPI from 
our tarballs?

On Apr 12, 2012, at 7:04 AM, Seyyed Mohtadin Hashemi wrote:

> Hello,
>
> I have a very peculiar problem: I have a micro cluster with three nodes (18 
> cores total); the nodes are clones of each other and connected to a frontend 
> via Ethernet and Debian squeeze as the OS for all nodes. When I run parallel 
> jobs I can used up ?-np 10? if I go further the job crashes, I have primarily 
> done tests with GROMACS (because that is what I will be running) but have 
> also used OSU Micro-Benchmarks 3.5.2.
>
> For a simple parallel job I use: ?path/mpirun ?hostfile path/hostfile ?np XX 
> ?d ?display-map path/mdrun_mpi ?s path/topol.tpr ?o path/output.trr?
>
> (path is global) For ?np XX being smaller than or 10 it works, however as 
> soon as I make use of 11 or larger the whole thing crashes. The terminal dump 
> is attached to this mail: when_working.txt is for ??np 10?, when_crash.txt is 
> for ??np 12?, and OpenMPI_info.txt is output from ?path/mpirun --bynode 
> --hostfile path/hostfile --tag-output ompi_info -v ompi full ?parsable?
>
> I have tried OpenMPI v.1.4.2 all the way up to beta v1.5.5, and all yield the 
> same result.
>
> The output files are from a new install I did today: I formatted all nodes 
> and started from a fresh minimal install of Squeeze and used "apt-get install 
> gromacs gromacs-openmpi" and installed all dependencies. Then I ran two jobs 
> using the parameters described above, I also did one with OSU bench (data is 
> not included) it also crashed with ?-np? larger than 10.
>
> I hope somebody can help figure out what is wrong and how I can fix it.
>
> Best regards,
> Mohtadin
>
> *****************************************************************************
> ** **
> ** WARNING: This email contains an attachment of a very suspicious type. **
> ** You are urged NOT to open this attachment unless you are absolutely **
> ** sure it is legitimate. Opening this attachment may cause irreparable **
> ** damage to your computer and your files. If you have any questions **
> ** about the validity of this message, PLEASE SEEK HELP BEFORE OPENING IT. **
> ** **
> ** This warning was added by the IU Computer Science Dept. mail scanner. **
> *****************************************************************************
>
> <Archive.zip>_______________________________________________
> users mailing list
> us...@open-mpi.org<mailto:us...@open-mpi.org>
> http://www.open-mpi.org/mailman/listinfo.cgi/users




--
De venligste hilsner/I am, yours most sincerely
Seyyed Mohtadin Hashemi




--
De venligste hilsner/I am, yours most sincerely
Seyyed Mohtadin Hashemi



Reply via email to