On Aug 28, 2007, at 9:05 AM, Li-Ta Lo wrote:

On Mon, 2007-08-27 at 15:10 -0400, Rolf vandeVaart wrote:
We are running into a problem when running on one of our larger SMPs
using the latest Open MPI v1.2 branch.  We are trying to run a job
with np=128 within a single node.  We are seeing the following error:

"SM failed to send message due to shortage of shared memory."

We then increased the allowable maximum size of the shared segment to
2Gigabytes-1 which is the maximum allowed on 32-bit application.  We
used the mca parameter to increase it as shown here.

-mca mpool_sm_max_size 2147483647

This allowed the program to run to completion.  Therefore, we would
like to increase the default maximum from 512Mbytes to 2G-1 Gigabytes.
Does anyone have an objection to this change?  Soon we are going to
have larger CPU counts and would like to increase the odds that things
work "out of the box" on these large SMPs.



There is a serious problem with the 1.2 branch, it does not allocate
any SM area for each process at the beginning. SM areas are allocated
on demand and if some of the processes are more aggressive than the
others, it will cause starvation. This problem is fixed in the trunk
by assign at least one SM area for each process. I think this is what
you saw (starvation) and an increase of max size may not be necessary.

Although I'm pretty sure this is fixed in the v1.2 branch already.

I don't think we should raise that ceiling at this point. We create the file in /tmp, and if someone does -np 32 on a single, small node (not unheard of), it'll do really evil things.

Personally, I don't think we need nearly as much shared memory as we're using. It's a bad design in terms of its unbounded memory usage. We should fix that, rather than making the file bigger. But I'm not going to fix it, so take my opinion with a grain of salt.

Brian

Reply via email to