Hi all,
Trying to get cgroups working with SoGE 8.1.8 and Centos 7. I have the
basic cgroup functionality working in the OS, cgred and cgconfig
services enabled,
Modified the "setup-cgroups-etc" script to use cgroups instead of cpuset:
#mount -t cpuset none $cpuset_mnt >/dev/null 2>&1
mount -t cgroup -ocpuset cpuset $cpuset_mnt
The "setup-cgroups-etc" is being called by the sgeexecd at startup:
$bin_dir/sge_execd
/usr/local/sge/util/resources/scripts/setup-cgroups-etc start
After rebooting the test node /proc/self/cgroup exists, and the proper
directories are being created under /dev/cpuset/sge:
-rw-r--r-- 1 sgeadmin root 0 Aug 10 16:54 cgroup.clone_children
--w--w--w- 1 sgeadmin root 0 Aug 10 16:54 cgroup.event_control
-rw-r--r-- 1 sgeadmin root 0 Aug 10 16:54 cgroup.procs
-rw-r--r-- 1 sgeadmin root 0 Aug 10 16:54 cpuset.cpu_exclusive
-rw-r--r-- 1 sgeadmin root 0 Aug 10 16:54 cpuset.cpus
-rw-r--r-- 1 sgeadmin root 0 Aug 10 16:54 cpuset.mem_exclusive
-rw-r--r-- 1 sgeadmin root 0 Aug 10 16:54 cpuset.mem_hardwall
-rw-r--r-- 1 sgeadmin root 0 Aug 10 16:54 cpuset.memory_migrate
-r--r--r-- 1 sgeadmin root 0 Aug 10 16:54 cpuset.memory_pressure
-rw-r--r-- 1 sgeadmin root 0 Aug 10 16:54 cpuset.memory_spread_page
-rw-r--r-- 1 sgeadmin root 0 Aug 10 16:54 cpuset.memory_spread_slab
-rw-r--r-- 1 sgeadmin root 0 Aug 10 16:54 cpuset.mems
-rw-r--r-- 1 sgeadmin root 0 Aug 10 16:54 cpuset.sched_load_balance
-rw-r--r-- 1 sgeadmin root 0 Aug 10 16:54 cpuset.sched_relax_domain_level
-rw-r--r-- 1 sgeadmin root 0 Aug 10 16:54 notify_on_release
-rw-r--r-- 1 sgeadmin root 0 Aug 10 16:54 tasks
qconf -sconf shows:
execd_params ENABLE_ADDGRP_KILL=TRUE ENABLE_BINDING=true \
USE_CGROUPS
The problem is that when we submit a job, the queue on that node goes
into an error state, and the sge messages for that node show:
08/15/2016 14:50:55| main|moose11|E|shepherd of job 222673.1 died
through signal = 6
08/15/2016 14:50:55| main|moose11|E|abnormal termination of shepherd
for job 222673.1: no "exit_status" file
08/15/2016 14:50:55| main|moose11|E|can't open file
active_jobs/222673.1/error: No such file or directory
08/15/2016 14:50:55| main|moose11|E|can't open pid file
"active_jobs/222673.1/pid" for job 222673.1
Thoughts?
-Dj
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users