Re: [gridengine users] Configure gridengine on CentOS 6.3

Petter Gustad Wed, 07 Nov 2012 12:01:13 -0800

From: Reuti <re...@staff.uni-marburg.de>
Subject: Re: [gridengine users] Configure gridengine on CentOS 6.3
Date: Wed, 7 Nov 2012 20:26:54 +0100


> Am 07.11.2012 um 18:49 schrieb Petter Gustad:
> 
>> From: Reuti <re...@staff.uni-marburg.de>
>> Subject: Re: [gridengine users] Configure gridengine on CentOS 6.3
>> Date: Wed, 7 Nov 2012 16:37:22 +0100
>> 
>>> Am 07.11.2012 um 15:46 schrieb Petter Gustad:
>>> 
>>>>> From: Reuti <re...@staff.uni-marburg.de>
>>>>> Subject: Re: [gridengine users] Configure gridengine on CentOS 6.3
>>>>> Date: Tue, 30 Oct 2012 11:27:49 +0100
>>>>> 
>>>>>> Just use the version you have already in the shared /usr/sge or your
>>>>>> particular mountpoint.
>>>>> 
>>>>> I should probably try this first, at least to verify that it's
>>>>> working. But later I would like to migrate to the CentOS on all my
>>>>> exechosts and leave the installation to somebody else.
>>>> 
>>>> I did this and it worked out fine on the first machine I migrated.
>>>> However, on the next set of machines I run into the problem where the
>>>> submitted job will cause the queue to go into the error state.
>>>> 
>>>> I observe that:
>>>> 
>>>> 1) It will not be submitted
>>>> 2) The queue will be marked with the 'E' state
>>>> 3) I get an e-mail saying
>>>>   Shepherd pe_hostfile:
>>>>   node 1 queue@node UNDEFINED
>>>> 4) The node will log the following in the spool/node/messages file:
>>>>   11/07/2012 15:33:07|  main|node|E|shepherd of job 48548.1 exited with 
>>>> exit status = 11
>>>> 5) qstat -j jobnumber returns
>>>> 
>>>>   error reason    1:          11/07/2012 15:33:06 [555:29681]: unable to 
>>>> find job file "/work/gridengine/spool/node/job_scr
>> 
>> Is this output always truncated,
> 
> Yes.

OK. Good.

> 
>> or could this be the source of the problem?
> 
> No.
> 
> 
>>> This looks like an unusual path for the spool directory. The name of the 
>>> node should be included.
>> 
>> I've subsituted the string "node" for the actual node name. It appears
>> to be the same for all the nodes, hence I just used "node".
> 
> Good.
> 
> 
>>> $ qconf -sconf
>>> 
>>> (at the top something like: execd_spool_dir              /var/spool/sge, 
>>> the directory for the particular node will be created automatically when 
>>> the execd starts up)
>> 
>> This will show the spool directory on the qmaster, which is different
> 
> No, it's the global setting for the execd spool directory. This can be 
> overridden, in case you have different paths on all the node.
> 
> If all nodes are the same, you can even delete all the local definitions 
> which were listed in `qconf -sconfl`.
> 
> NB: The location of the qmaster spool directory is in 
> "/usr/sge/default/common/bootstrap" (adjust the path for your installation): 
> like for me "qmaster_spool_dir       /var/spool/sge/qmaster"
> 
> 
>> from the above. But for all the nodes this is /work/gridengine/spool.
> 
> Yes, but if you check the directory /work/gridengine/spool there should be a 
> level for the node  /work/gridengine/spool/node001 or whatever. This 
> directory is readable for the sgeadmin user account?

That was the problem. Thanks! This directory was readable by the
gridengine account only. By making this world readable I managed to
submit a job. These permissions were different on the working and
non-working nodes as well.


> 
>>> $ qconf -sconfl
>>> 
>>> (get all exechost definitions [if any are present at all]), then for the 
>>> particular node:
>>> 
>>> $ qconf -sconf node42
>>> 
>>> and check the path to the execd_spool_dir.
>> 
>> They are all identical. If I do something like:
>> 
>> qconf -sconf good-node > /tmp/good-node
>> qconf -sconf bad-node > /tmp/bad-node
>> 
>> and diff the two, the only diff will be the hostname part.
>> 
>> All the nodes are using spool on a local filesystem located at
>> /work/gridengine/spool
>> 
>> 
>> The only difference I see on the bad nodes is that there is a "." at
>> the end of the permissions in the spool directory so I think this
>> might be related to SELinux. I'll have to investegate this further.
> 
> Yep. It means access limits by other facility, like it is a "+" for ACL.
> 
> I suggest to switch off SELinux.
> -- Reuti


Best regards
//Petter
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Configure gridengine on CentOS 6.3

Reply via email to