Re: [gridengine users] Jobs on qw state and exec node on au state

2016-05-30 Thread Christopher Heiny
On Mon, 2016-05-30 at 14:14 -0400, Bill Bryce wrote: > Okay, > [snip] > > > Other things you can check is to see if all nodes can contact the > qmaster machine i.e. the networking is configured properly. You can > also make sure that the host naming is correct, either configure DNS > properly

Re: [gridengine users] Jobs on qw state and exec node on au state

2016-05-30 Thread Radhouane Aniba
Hello Bill Thank you for your reply Everything looks ok as far as I can tell ubuntu@compute010:~$ hostname compute010 ubuntu@compute010:~$ cat /etc/hosts # THIS FILE IS CONTROLLED BY ANSIBLE # any local modifications will be overwritten! # # This file is managed by Ansible. 127.0.0.1 localhost.

Re: [gridengine users] Jobs on qw state and exec node on au state

2016-05-30 Thread Bill Bryce
So typically with Grid Engine you need to select one machine as the ‘master’ machine in the cluster (you can have backups but they are running a ‘shadow_master’ so don’t worry about that for now). The qmaster needs to be on one host that all the nodes can communicate with over the network. Eac

Re: [gridengine users] Jobs on qw state and exec node on au state

2016-05-30 Thread Radhouane Aniba
Ok here is what I have connected to one node compute010 qconf -sconf gives me this #global: execd_spool_dir /var/spool/gridengine/execd mailer /usr/bin/mail xterm/usr/bin/xterm load_sensor none prolog

Re: [gridengine users] Jobs on qw state and exec node on au state

2016-05-30 Thread Bill Bryce
Okay, can you run any qconf commands such as ‘qconf -sconf’. Try having a look at the messages files for the execution daemons. They should be in $SGE_ROOT/default/spool/ and in there are directories for the master and exec hosts (if you have this installed in a shared filesystem envirionment

Re: [gridengine users] Jobs on qw state and exec node on au state

2016-05-30 Thread Radhouane Aniba
I killed all sge_* processes in exec nodes and tried to restart execd but got this message root@compute010:/home/ubuntu# /usr/lib/gridengine/sge_execd error: can't find connection error: can't get configuration from qmaster -- backgrounding On Mon, May 30, 2016 at 10:36 AM, Radhouane Aniba wrot

Re: [gridengine users] Jobs on qw state and exec node on au state

2016-05-30 Thread Radhouane Aniba
Hi Bill Yes I am sure This is what I have when I login to one of the nodes and do ubuntu@compute010:~$ ps -ef | grep sge_ sgeadmin 1254 1 0 May28 ?00:00:39 /usr/lib/gridengine/sge_qmaster sgeadmin 1446 1 0 May28 ?00:00:22 /usr/lib/gridengine/sge_execd ubuntu2552

Re: [gridengine users] Jobs on qw state and exec node on au state

2016-05-30 Thread Bill Bryce
Hi Rad, Are you sure that the execution daemons are running on your compute nodes? Can you login to one of the nodes say ‘compute001’ and do a ps looking for the execd? When an execd is functioning normally it provides the load and memory, etc… none of your nodes are showing that. Regards,

[gridengine users] Jobs on qw state and exec node on au state

2016-05-30 Thread Radhouane Aniba
Hello all, I am trying to submit a simple "hello world" to test a gridengine (I used it before with no problems) The problem is that my job is waiting in the queue forever The qhost command shows a wired state of the compute nodes HOSTNAMEARCH NCPU LOAD MEMTOT MEMUSE