Here is a v1.6 port of what was committed to the trunk. Let me know if/how it 
works for you. The option you will want to use is:

mpirun -mca opal_set_max_sys_limits stacksize:unlimited

or whatever number you want to give (see ulimit for the units). Note that you 
won't see any impact if you run it with a non-OMPI executable like "sh ulimit" 
since it only gets called during MPI_Init.


On Apr 2, 2013, at 9:48 AM, Duke Nguyen <duke.li...@gmx.com> wrote:

> On 4/2/13 11:03 PM, Gus Correa wrote:
>> On 04/02/2013 11:40 AM, Duke Nguyen wrote:
>>> On 3/30/13 8:46 PM, Patrick Bégou wrote:
>>>> Ok, so your problem is identified as a stack size problem. I went into
>>>> these limitations using Intel fortran compilers on large data problems.
>>>> 
>>>> First, it seems you can increase your stack size as "ulimit -s
>>>> unlimited" works (you didn't enforce the system hard limit). The best
>>>> way is to set this setting in your .bashrc file so it will works on
>>>> every node.
>>>> But setting it to unlimited may not be really safe. IE, if you got in
>>>> a badly coded recursive function calling itself without a stop
>>>> condition you can request all the system memory and crash the node. So
>>>> set a large but limited value, it's safer.
>>>> 
>>> 
>>> Now I feel the pain you mentioned :). With -s unlimited now some of our
>>> nodes are easily down (completely) and needed to be hard reset!!!
>>> (whereas we never had any node down like that before even with the
>>> killed or badly coded jobs).
>>> 
>>> Looking for a safer number of ulimit -s other than "unlimited" now... :(
>>> 
>> 
>> In my opinion this is a trade off between who feels the pain.
>> It can be you (sys admin) feeling the pain of having
>> to power up offline nodes,
>> or it could be the user feeling the pain for having
>> her/his code killed by segmentation fault due to small memory
>> available for the stack.
> 
> ... in case that user is at a large institute that promises to provide best 
> service, unlimited resources/unlimited *everything* to end users. If not, 
> user should really think of how to make use the best of available resources. 
> Unfortunately many (most?) end users don't.
> 
>> There is only so much that can be done to make everybody happy.
> 
> So true... especially HPC resource is still luxurious here in Vietnam, and we 
> have a quite small (and not-so-strong) cluster.
> 
>> If you share the nodes among jobs, you could set the
>> stack size limit to
>> some part of the physical_memory divided by the number_of_cores,
>> saving some memory for the OS etc beforehand.
>> However, this can be a straitjacket for jobs that could run with
>> a bit more memory, and won't because of this limit.
>> If you do not share the nodes, then you could make stacksize
>> closer to physical memory.
> 
> Great. Thanks for this advice Gus.
> 
>> 
>> Anyway, this is less of an OpenMPI than of a
>> resource manager / queuing system conversation.
> 
> Yeah, and I have learned a lot other than just openmpi stuffs here :)
> 
>> 
>> Best,
>> Gus Correa
>> 
>>>> I'm managing a cluster and I always set a maximum value to stack size.
>>>> I also limit the memory available for each core for system stability.
>>>> If a user request only one of the 12 cores of a node he can only
>>>> access 1/12 of the node memory amount. If he needs more memory he has
>>>> to request 2 cores, even if he uses a sequential code. This avoid
>>>> crashing jobs of other users on the same node with memory
>>>> requirements. But this is not configured on your node.
>>>> 
>>>> Duke Nguyen a écrit :
>>>>> On 3/30/13 3:13 PM, Patrick Bégou wrote:
>>>>>> I do not know about your code but:
>>>>>> 
>>>>>> 1) did you check stack limitations ? Typically intel fortran codes
>>>>>> needs large amount of stack when the problem size increase.
>>>>>> Check ulimit -a
>>>>> 
>>>>> First time I heard of stack limitations. Anyway, ulimit -a gives
>>>>> 
>>>>> $ ulimit -a
>>>>> core file size (blocks, -c) 0
>>>>> data seg size (kbytes, -d) unlimited
>>>>> scheduling priority (-e) 0
>>>>> file size (blocks, -f) unlimited
>>>>> pending signals (-i) 127368
>>>>> max locked memory (kbytes, -l) unlimited
>>>>> max memory size (kbytes, -m) unlimited
>>>>> open files (-n) 1024
>>>>> pipe size (512 bytes, -p) 8
>>>>> POSIX message queues (bytes, -q) 819200
>>>>> real-time priority (-r) 0
>>>>> stack size (kbytes, -s) 10240
>>>>> cpu time (seconds, -t) unlimited
>>>>> max user processes (-u) 1024
>>>>> virtual memory (kbytes, -v) unlimited
>>>>> file locks (-x) unlimited
>>>>> 
>>>>> So stack size is 10MB??? Does this one create problem? How do I
>>>>> change this?
>>>>> 
>>>>>> 
>>>>>> 2) did your node uses cpuset and memory limitation like fake numa to
>>>>>> set the maximum amount of memory available for a job ?
>>>>> 
>>>>> Not really understand (also first time heard of fake numa), but I am
>>>>> pretty sure we do not have such things. The server I tried was a
>>>>> dedicated server with 2 x5420 and 16GB physical memory.
>>>>> 
>>>>>> 
>>>>>> Patrick
>>>>>> 
>>>>>> Duke Nguyen a écrit :
>>>>>>> Hi folks,
>>>>>>> 
>>>>>>> I am sorry if this question had been asked before, but after ten
>>>>>>> days of searching/working on the system, I surrender :(. We try to
>>>>>>> use mpirun to run abinit (abinit.org) which in turns will call an
>>>>>>> input file to run some simulation. The command to run is pretty simple
>>>>>>> 
>>>>>>> $ mpirun -np 4 /opt/apps/abinit/bin/abinit < input.files >& output.log
>>>>>>> 
>>>>>>> We ran this command on a server with two quad core x5420 and 16GB
>>>>>>> of memory. I called only 4 core, and I guess in theory each of the
>>>>>>> core should take up to 2GB each.
>>>>>>> 
>>>>>>> In the output of the log, there is something about memory:
>>>>>>> 
>>>>>>> P This job should need less than 717.175 Mbytes of memory.
>>>>>>> Rough estimation (10% accuracy) of disk space for files :
>>>>>>> WF disk file : 69.524 Mbytes ; DEN or POT disk file : 14.240 Mbytes.
>>>>>>> 
>>>>>>> So basically it reported that the above job should not take more
>>>>>>> than 718MB each core.
>>>>>>> 
>>>>>>> But I still have the Segmentation Fault error:
>>>>>>> 
>>>>>>> mpirun noticed that process rank 0 with PID 16099 on node biobos
>>>>>>> exited on signal 11 (Segmentation fault).
>>>>>>> 
>>>>>>> The system already has limits up to unlimited:
>>>>>>> 
>>>>>>> $ cat /etc/security/limits.conf | grep -v '#'
>>>>>>> * soft memlock unlimited
>>>>>>> * hard memlock unlimited
>>>>>>> 
>>>>>>> I also tried to run
>>>>>>> 
>>>>>>> $ ulimit -l unlimited
>>>>>>> 
>>>>>>> before the mpirun command above, but it did not help at all.
>>>>>>> 
>>>>>>> If we adjust the parameters of the input.files to give the reported
>>>>>>> mem per core is less than 512MB, then the job runs fine.
>>>>>>> 
>>>>>>> Please help,
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> 
>>>>>>> D.
>>>>>>> 
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> us...@open-mpi.org
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> 
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> 
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to