Hi,

To increase the max open file, we have set execd_params in qconf –mconf and 
also on the OS level:
execd_params                 
H_DESCRIPTORS=262144,H_LOCKS=262144,H_MAXPROC=262144

On our execution nodes we can see that SGE sets a soft limit of 65535 despite 
that we told it to set it to 262144.
After qlogin:
[root@p2node01 ~]# cat /proc/104694/limits
Limit                     Soft Limit           Hard Limit           Units
Max cpu time              unlimited            unlimited            seconds
Max file size             unlimited            unlimited            bytes
Max data size             unlimited            unlimited            bytes
Max stack size            unlimited            unlimited            bytes
Max core file size        unlimited            unlimited            bytes
Max resident set          unlimited            unlimited            bytes
Max processes             262144               262144               processes
Max open files            65535                262144               files
Max locked memory         65536                65536                bytes
Max address space         unlimited            unlimited            bytes
Max file locks            262144               262144               locks
Max pending signals       15023                15023                signals
Max msgqueue size         819200               819200               bytes
Max nice priority         0                    0
Max realtime priority     0                    0
Max realtime timeout      unlimited            unlimited            us

When running PE smp job requesting for 2 slots, the soft limit is set to 
65535*2= 131070. The core number seems to be the exponent of the soft limit. If 
we request for more than 4 slots, it will exceed the hard limit and reset the 
max open files to the default of 1024. Our work around for this is to set 
H_DESCRIPTORS=9362. This is because some of our exec nodes are 28 cores. 28 x 
9362= 262144 for the limit. I was wondering if there is a better way of doing 
this?

You might think hey, why do we need to have 200k+ open file. This is due to 
someone using a software that has an open file handler leak and does not fclose 
properly. Their workaround is a dirty hack where the job ssh onto the localhost 
and bypass the ulimit set by SGE.

Many thanks,

Luis
This electronic message is intended for the use of the named recipient only, 
and may contain information that is confidential, privileged or protected from 
disclosure under applicable law. If you are not the intended recipient, or an 
employee or agent responsible for delivering this message to the intended 
recipient, you are hereby notified that any reading, disclosure, dissemination, 
distribution, copying or use of the contents of this message including any of 
its attachments is strictly prohibited. If you have received this message in 
error or are not the named recipient, please notify us immediately by 
contacting the sender at the electronic mail address noted above, and destroy 
all copies of this message. Please note, the recipient should check this email 
and any attachments for the presence of viruses. The organization accepts no 
liability for any damage caused by any virus transmitted by this email.
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to