On Mon, 2016-05-30 at 14:14 -0400, Bill Bryce wrote:
> Okay,
>
[snip]
>
>
> Other things you can check is to see if all nodes can contact the
> qmaster machine i.e. the networking is configured properly. You can
> also make sure that the host naming is correct, either configure DNS
> properly
Hello Bill Thank you for your reply
Everything looks ok as far as I can tell
ubuntu@compute010:~$ hostname
compute010
ubuntu@compute010:~$ cat /etc/hosts
# THIS FILE IS CONTROLLED BY ANSIBLE
# any local modifications will be overwritten!
#
# This file is managed by Ansible.
127.0.0.1 localhost.
So typically with Grid Engine you need to select one machine as the ‘master’
machine in the cluster (you can have backups but they are running a
‘shadow_master’ so don’t worry about that for now). The qmaster needs to be on
one host that all the nodes can communicate with over the network. Eac
Ok here is what I have
connected to one node compute010
qconf -sconf gives me this
#global:
execd_spool_dir /var/spool/gridengine/execd
mailer /usr/bin/mail
xterm/usr/bin/xterm
load_sensor none
prolog
Okay,
can you run any qconf commands such as ‘qconf -sconf’. Try having a look at
the messages files for the execution daemons. They should be in
$SGE_ROOT/default/spool/ and in there are directories for the master and exec
hosts (if you have this installed in a shared filesystem envirionment
I killed all sge_* processes in exec nodes and tried to restart execd but
got this message
root@compute010:/home/ubuntu# /usr/lib/gridengine/sge_execd
error: can't find connection
error: can't get configuration from qmaster -- backgrounding
On Mon, May 30, 2016 at 10:36 AM, Radhouane Aniba wrot
Hi Bill
Yes I am sure
This is what I have when I login to one of the nodes and do
ubuntu@compute010:~$ ps -ef | grep sge_
sgeadmin 1254 1 0 May28 ?00:00:39
/usr/lib/gridengine/sge_qmaster
sgeadmin 1446 1 0 May28 ?00:00:22
/usr/lib/gridengine/sge_execd
ubuntu2552
Hi Rad,
Are you sure that the execution daemons are running on your compute nodes? Can
you login to one of the nodes say ‘compute001’ and do a ps looking for the
execd? When an execd is functioning normally it provides the load and memory,
etc… none of your nodes are showing that.
Regards,
Hello all,
I am trying to submit a simple "hello world" to test a gridengine (I used
it before with no problems)
The problem is that my job is waiting in the queue forever
The qhost command shows a wired state of the compute nodes
HOSTNAMEARCH NCPU LOAD MEMTOT MEMUSE