No earthly idea. As I said, I'm afraid Sam is pretty much unavailable for the 
next two weeks, so we probably don't have much hope of fixing it.

I see in your original note that you tried the 1.5.5 beta rc and got the same 
results, so I assume this must be something in  your system config that is 
causing the issue. I'll file a bug for him (pointing to this thread) so this 
doesn't get lost, but would suggest you run ^sm for now unless someone else has 
other suggestions.


On Apr 16, 2012, at 2:57 AM, Seyyed Mohtadin Hashemi wrote:

> I recompiled everything from scratch with GCC 4.4.5 and 4.7 using OMPI 1.4.5 
> tarball.
> 
> I did some tests and it does not seem that i can make it work, i tried these:
> 
> btl_sm_num_fifos 4
> btl_sm_free_list_num 1000
> btl_sm_free_list_max 1000000
> mpool_sm_min_size 1500000000
> mpool_sm_max_size 7500000000 
> 
> but nothing helped. I started out with varying one parameter at the time from 
> default to 1000000 (except fifo which i only varied till 100, and sm_min and 
> sm_max which i varied from 67mb [default was set to 67xxxxxx] to 7.5gb) to 
> see what reactions i could get. When running with 10 npp everything worked, 
> but as soon as i went to 11 npp it crashed with the same old error.
> 
> On Fri, Apr 13, 2012 at 6:41 PM, Ralph Castain <r...@open-mpi.org> wrote:
> 
> On Apr 13, 2012, at 10:36 AM, Seyyed Mohtadin Hashemi wrote:
> 
>> That fixed the issue but have brought a big question mark on why this 
>> happened.
>> 
>> I'm pretty sure it's not a system memory issue, the node with least RAM has 
>> 8gb which i would think is more than enough.
>> 
>> Do you think that adjusting the btl_sm_eager_limit, mpool_sm_min_size, and 
>> mpool_sm_max_size can help fix the problem? (Found this at 
>> http://www.open-mpi.org/faq/?category=sm )  Because compared to the -np 10 
>> the performance of -np 18 is worse when running with the cmd you suggested. 
>> I'll try playing around with the parameters and see what works. 
> 
> Yes, performance will definitely be worse - I was just trying to isolate the 
> problem. I would play a little with those sizes and see what you can do. Our 
> shared memory person is pretty much unavailable for the next two weeks, but 
> the rest of us will at least try to get you working.
> 
> We typically do run with more than 10 ppn, so I know the base sm code works 
> at that scale. However, those nodes usually have 32Gbytes of RAM, and the 
> default sm params are scaled accordingly.
> 
> 
>> 
>> On Fri, Apr 13, 2012 at 5:44 PM, Ralph Castain <r...@open-mpi.org> wrote:
>> Afraid I have no idea how those packages were built, what release they 
>> correspond to, etc. I would suggest sticking with the tarballs.
>> 
>> Your output indicates a problem with shared memory when you completely fill 
>> the machine. Could be a couple of things, like running out of memory - but 
>> for now, try adding -mca btl ^sm to your cmd line. Should work.
>> 
>> 
>> On Apr 13, 2012, at 5:09 AM, Seyyed Mohtadin Hashemi wrote:
>> 
>>> Hi,
>>> 
>>> Sorry that it took so long to answer, I didn't get any return mails and had 
>>> to check the digest for reply.
>>> 
>>> Anyway, when i compiled from scratch then i did use the tarballs from 
>>> open-mpi.org. GROMACS is not the problem (or at least i don't think so), i 
>>> just used it as a check to see if i could run parallel jobs - i am now 
>>> using OSU benchmarks because i can't be sure that the problem is not with 
>>> GROMACS.
>>> 
>>> On the new installation i have not installed (nor compiled) OMPI from the 
>>> official tarballs but rather installed the "openmpi-bin, openmpi-common, 
>>> libopenmpi1.3, openmpi-checkpoint, and libopenmpi-dev" packages using 
>>> apt-get.
>>> 
>>> As for the simple examples (i.e. ring_c, hello_c, and connectivity_c 
>>> extracted from the 1.4.2 official tarball) i get the exact same behavior as 
>>> with GROMACS/OSU bench.
>>> 
>>> I suspect you'll have to ask someone familiar with GROMACS about that 
>>> specific package. As for testing OMPI, can you run the codes in the 
>>> examples directory - e.g., "hello" and "ring"? I assume you are downloading 
>>> and installing OMPI from our tarballs?
>>> 
>>> On Apr 12, 2012, at 7:04 AM, Seyyed Mohtadin Hashemi wrote:
>>> 
>>> > Hello,
>>> >
>>> > I have a very peculiar problem: I have a micro cluster with three nodes 
>>> > (18 cores total); the nodes are clones of each other and connected to a 
>>> > frontend via Ethernet and Debian squeeze as the OS for all nodes. When I 
>>> > run parallel jobs I can used up ?-np 10? if I go further the job crashes, 
>>> > I have primarily done tests with GROMACS (because that is what I will be 
>>> > running) but have also used OSU Micro-Benchmarks 3.5.2.
>>> >
>>> > For a simple parallel job I use: ?path/mpirun ?hostfile path/hostfile ?np 
>>> > XX ?d ?display-map path/mdrun_mpi ?s path/topol.tpr ?o path/output.trr?
>>> >
>>> > (path is global) For ?np XX being smaller than or 10 it works, however as 
>>> > soon as I make use of 11 or larger the whole thing crashes. The terminal 
>>> > dump is attached to this mail: when_working.txt is for ??np 10?, 
>>> > when_crash.txt is for ??np 12?, and OpenMPI_info.txt is output from 
>>> > ?path/mpirun --bynode --hostfile path/hostfile --tag-output ompi_info -v 
>>> > ompi full ?parsable?
>>> >
>>> > I have tried OpenMPI v.1.4.2 all the way up to beta v1.5.5, and all yield 
>>> > the same result.
>>> >
>>> > The output files are from a new install I did today: I formatted all 
>>> > nodes and started from a fresh minimal install of Squeeze and used 
>>> > "apt-get install gromacs gromacs-openmpi" and installed all dependencies. 
>>> > Then I ran two jobs using the parameters described above, I also did one 
>>> > with OSU bench (data is not included) it also crashed with ?-np? larger 
>>> > than 10.
>>> >
>>> > I hope somebody can help figure out what is wrong and how I can fix it.
>>> >
>>> > Best regards,
>>> > Mohtadin
>>> >
>>> > *****************************************************************************
>>> > ** **
>>> > ** WARNING: This email contains an attachment of a very suspicious type. 
>>> > **
>>> > ** You are urged NOT to open this attachment unless you are absolutely **
>>> > ** sure it is legitimate. Opening this attachment may cause irreparable **
>>> > ** damage to your computer and your files. If you have any questions **
>>> > ** about the validity of this message, PLEASE SEEK HELP BEFORE OPENING 
>>> > IT. **
>>> > ** **
>>> > ** This warning was added by the IU Computer Science Dept. mail scanner. 
>>> > **
>>> > *****************************************************************************
>>> >
>>> > <Archive.zip>_______________________________________________
>>> > users mailing list
>>> > us...@open-mpi.org
>>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> 
>> 
>> -- 
>> De venligste hilsner/I am, yours most sincerely
>> Seyyed Mohtadin Hashemi
> 
> 
> 
> 
> -- 
> De venligste hilsner/I am, yours most sincerely
> Seyyed Mohtadin Hashemi

Reply via email to