Here is a v1.6 port of what was committed to the trunk. Let me know if/how it works for you. The option you will want to use is:
mpirun -mca opal_set_max_sys_limits stacksize:unlimited or whatever number you want to give (see ulimit for the units). Note that you won't see any impact if you run it with a non-OMPI executable like "sh ulimit" since it only gets called during MPI_Init. On Apr 2, 2013, at 9:48 AM, Duke Nguyen <duke.li...@gmx.com> wrote: > On 4/2/13 11:03 PM, Gus Correa wrote: >> On 04/02/2013 11:40 AM, Duke Nguyen wrote: >>> On 3/30/13 8:46 PM, Patrick Bégou wrote: >>>> Ok, so your problem is identified as a stack size problem. I went into >>>> these limitations using Intel fortran compilers on large data problems. >>>> >>>> First, it seems you can increase your stack size as "ulimit -s >>>> unlimited" works (you didn't enforce the system hard limit). The best >>>> way is to set this setting in your .bashrc file so it will works on >>>> every node. >>>> But setting it to unlimited may not be really safe. IE, if you got in >>>> a badly coded recursive function calling itself without a stop >>>> condition you can request all the system memory and crash the node. So >>>> set a large but limited value, it's safer. >>>> >>> >>> Now I feel the pain you mentioned :). With -s unlimited now some of our >>> nodes are easily down (completely) and needed to be hard reset!!! >>> (whereas we never had any node down like that before even with the >>> killed or badly coded jobs). >>> >>> Looking for a safer number of ulimit -s other than "unlimited" now... :( >>> >> >> In my opinion this is a trade off between who feels the pain. >> It can be you (sys admin) feeling the pain of having >> to power up offline nodes, >> or it could be the user feeling the pain for having >> her/his code killed by segmentation fault due to small memory >> available for the stack. > > ... in case that user is at a large institute that promises to provide best > service, unlimited resources/unlimited *everything* to end users. If not, > user should really think of how to make use the best of available resources. > Unfortunately many (most?) end users don't. > >> There is only so much that can be done to make everybody happy. > > So true... especially HPC resource is still luxurious here in Vietnam, and we > have a quite small (and not-so-strong) cluster. > >> If you share the nodes among jobs, you could set the >> stack size limit to >> some part of the physical_memory divided by the number_of_cores, >> saving some memory for the OS etc beforehand. >> However, this can be a straitjacket for jobs that could run with >> a bit more memory, and won't because of this limit. >> If you do not share the nodes, then you could make stacksize >> closer to physical memory. > > Great. Thanks for this advice Gus. > >> >> Anyway, this is less of an OpenMPI than of a >> resource manager / queuing system conversation. > > Yeah, and I have learned a lot other than just openmpi stuffs here :) > >> >> Best, >> Gus Correa >> >>>> I'm managing a cluster and I always set a maximum value to stack size. >>>> I also limit the memory available for each core for system stability. >>>> If a user request only one of the 12 cores of a node he can only >>>> access 1/12 of the node memory amount. If he needs more memory he has >>>> to request 2 cores, even if he uses a sequential code. This avoid >>>> crashing jobs of other users on the same node with memory >>>> requirements. But this is not configured on your node. >>>> >>>> Duke Nguyen a écrit : >>>>> On 3/30/13 3:13 PM, Patrick Bégou wrote: >>>>>> I do not know about your code but: >>>>>> >>>>>> 1) did you check stack limitations ? Typically intel fortran codes >>>>>> needs large amount of stack when the problem size increase. >>>>>> Check ulimit -a >>>>> >>>>> First time I heard of stack limitations. Anyway, ulimit -a gives >>>>> >>>>> $ ulimit -a >>>>> core file size (blocks, -c) 0 >>>>> data seg size (kbytes, -d) unlimited >>>>> scheduling priority (-e) 0 >>>>> file size (blocks, -f) unlimited >>>>> pending signals (-i) 127368 >>>>> max locked memory (kbytes, -l) unlimited >>>>> max memory size (kbytes, -m) unlimited >>>>> open files (-n) 1024 >>>>> pipe size (512 bytes, -p) 8 >>>>> POSIX message queues (bytes, -q) 819200 >>>>> real-time priority (-r) 0 >>>>> stack size (kbytes, -s) 10240 >>>>> cpu time (seconds, -t) unlimited >>>>> max user processes (-u) 1024 >>>>> virtual memory (kbytes, -v) unlimited >>>>> file locks (-x) unlimited >>>>> >>>>> So stack size is 10MB??? Does this one create problem? How do I >>>>> change this? >>>>> >>>>>> >>>>>> 2) did your node uses cpuset and memory limitation like fake numa to >>>>>> set the maximum amount of memory available for a job ? >>>>> >>>>> Not really understand (also first time heard of fake numa), but I am >>>>> pretty sure we do not have such things. The server I tried was a >>>>> dedicated server with 2 x5420 and 16GB physical memory. >>>>> >>>>>> >>>>>> Patrick >>>>>> >>>>>> Duke Nguyen a écrit : >>>>>>> Hi folks, >>>>>>> >>>>>>> I am sorry if this question had been asked before, but after ten >>>>>>> days of searching/working on the system, I surrender :(. We try to >>>>>>> use mpirun to run abinit (abinit.org) which in turns will call an >>>>>>> input file to run some simulation. The command to run is pretty simple >>>>>>> >>>>>>> $ mpirun -np 4 /opt/apps/abinit/bin/abinit < input.files >& output.log >>>>>>> >>>>>>> We ran this command on a server with two quad core x5420 and 16GB >>>>>>> of memory. I called only 4 core, and I guess in theory each of the >>>>>>> core should take up to 2GB each. >>>>>>> >>>>>>> In the output of the log, there is something about memory: >>>>>>> >>>>>>> P This job should need less than 717.175 Mbytes of memory. >>>>>>> Rough estimation (10% accuracy) of disk space for files : >>>>>>> WF disk file : 69.524 Mbytes ; DEN or POT disk file : 14.240 Mbytes. >>>>>>> >>>>>>> So basically it reported that the above job should not take more >>>>>>> than 718MB each core. >>>>>>> >>>>>>> But I still have the Segmentation Fault error: >>>>>>> >>>>>>> mpirun noticed that process rank 0 with PID 16099 on node biobos >>>>>>> exited on signal 11 (Segmentation fault). >>>>>>> >>>>>>> The system already has limits up to unlimited: >>>>>>> >>>>>>> $ cat /etc/security/limits.conf | grep -v '#' >>>>>>> * soft memlock unlimited >>>>>>> * hard memlock unlimited >>>>>>> >>>>>>> I also tried to run >>>>>>> >>>>>>> $ ulimit -l unlimited >>>>>>> >>>>>>> before the mpirun command above, but it did not help at all. >>>>>>> >>>>>>> If we adjust the parameters of the input.files to give the reported >>>>>>> mem per core is less than 512MB, then the job runs fine. >>>>>>> >>>>>>> Please help, >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> D. >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> us...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users