Hi all,

I just replaced my prolog script to be a fair simple one:

#!/bin/sh

exit 0


And yes, the qrsh session now can go through successfully. Now the
questions I really need to deal with is:

1) which part of my real prolog script conflict with the scheduling process
2) why this happens only on SGE 8.1.9 not SGE  2011.11p1.

Maybe it is worthwhile to mention that the new SGE cluster is CentOS7 based
and the old one is CentOS6. Not sure if this also matters

Cheers,
Derrick

On Thu, Jan 10, 2019 at 9:39 AM Derrick Lin <klin...@gmail.com> wrote:

> Hi Reuti and Iyad,
>
> Here is my prolog script, it just does one thing, setting quota on the XFS
> volume for each job:
>
> The prolog_exec_xx_xx.log file was generated, so I assumed the first exec
> command got executed.
>
> Since the generated log file is empty, I think nothing was executed after
> that.
>
> Cheers
>
> [root@zeta-4-12 common]# cat prolog_exec.sh
> #!/bin/sh
>
> exec >> /tmp/prolog_exec_"$JOB_ID"_"$SGE_TASK_ID".log
> exec 2>&1
>
> SGE_TMP_ROOT="/scratch_local"
>
> pe_num=$(cat $PE_HOSTFILE | grep $HOSTNAME | awk '{print $2}')
>
> tmp_req_var=$(echo "$tmp_requested" | grep -o -E '[0-9]+')
> tmp_req_unit=$(echo "$tmp_requested" | sed 's/[0-9]*//g')
>
> if [ -z "$pe_num" ]; then
>         quota=$tmp_requested
> else
>         quota=$(expr $tmp_req_var \* $pe_num)$tmp_req_unit
> fi
>
> echo "############################# [$HOSTNAME PROLOG] - JOB_ID:$JOB_ID
> TASK_ID:$SGE_TASK_ID #############################"
> echo "`date` [$HOSTNAME PROLOG]: xfs_quota -x -c 'project -s -p $TMP
> $JOB_ID' $SGE_TMP_ROOT"
> echo "`date` [$HOSTNAME PROLOG]: xfs_quota -x -c 'limit -p bhard=$quota
> $JOB_ID' $SGE_TMP_ROOT"
>
> xfs_quota_rc=0
>
> /usr/sbin/xfs_quota -x -c "project -s -p $TMP $JOB_ID" $SGE_TMP_ROOT
> ((xfs_quota_rc+=$?))
>
> /usr/sbin/xfs_quota -x -c "limit -p bhard=$quota $JOB_ID" $SGE_TMP_ROOT
> ((xfs_quota_rc+=$?))
>
> if [ $xfs_quota_rc -eq 0 ]; then
>         exit 0
> else
>         exit 100 # Put job in error state
> fi
>
>
> On Wed, Jan 9, 2019 at 7:36 PM Reuti <re...@staff.uni-marburg.de> wrote:
>
>> Hi,
>>
>> > Am 09.01.2019 um 01:14 schrieb Derrick Lin <klin...@gmail.com>:
>> >
>> > Hi guys,
>> >
>> > I just brought up a new SGE cluster, but somehow the qrsh session does
>> not work:
>> >
>> > tester@login-gpu:~$ qrsh
>> > ^Cerror: error while waiting for builtin IJS connection: "got select
>> timeout"
>> >
>> > after I hit entered, the session just stuck there forever instead of
>> bring me to a compute node. I have to entered Crtl+c to terminate and it
>> gave the above error.
>> >
>> > I noticed, the SGE did send my qrsh request to a compute node as I
>> could tell from qstat:
>> >
>> >
>> ---------------------------------------------------------------------------------
>> > short.q@zeta-4-15.local        BIP   0/1/80         0.01     lx-amd64
>> >      15 0.55500 QRLOGIN    tester       r    01/09/2019 10:47:13     1
>> >
>> > We have a prolog script configured globally, the script deals with
>> local disk quota and keep all output to a log file for each job. So I went
>> to that compute node, and check, found that a log file was created but it
>> was empty.
>> >
>> > So my thinking so far is, my qrsh stuck because the prolog script is
>> not fully executed.
>>
>> Is there any statement in the prolog, which could wait for stdin – and in
>> a batch job there is just no stdin, hence it continues? Could be tested
>> with "-i" to a batch job.
>>
>> -- Reuti
>>
>>
>> > qsub job are working fine.
>> >
>> > Any idea will be appreciated
>> >
>> > Cheers,
>> > Derrick
>> > _______________________________________________
>> > users mailing list
>> > users@gridengine.org
>> > https://gridengine.org/mailman/listinfo/users
>>
>>
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to