Hi all, I just replaced my prolog script to be a fair simple one:
#!/bin/sh exit 0 And yes, the qrsh session now can go through successfully. Now the questions I really need to deal with is: 1) which part of my real prolog script conflict with the scheduling process 2) why this happens only on SGE 8.1.9 not SGE 2011.11p1. Maybe it is worthwhile to mention that the new SGE cluster is CentOS7 based and the old one is CentOS6. Not sure if this also matters Cheers, Derrick On Thu, Jan 10, 2019 at 9:39 AM Derrick Lin <klin...@gmail.com> wrote: > Hi Reuti and Iyad, > > Here is my prolog script, it just does one thing, setting quota on the XFS > volume for each job: > > The prolog_exec_xx_xx.log file was generated, so I assumed the first exec > command got executed. > > Since the generated log file is empty, I think nothing was executed after > that. > > Cheers > > [root@zeta-4-12 common]# cat prolog_exec.sh > #!/bin/sh > > exec >> /tmp/prolog_exec_"$JOB_ID"_"$SGE_TASK_ID".log > exec 2>&1 > > SGE_TMP_ROOT="/scratch_local" > > pe_num=$(cat $PE_HOSTFILE | grep $HOSTNAME | awk '{print $2}') > > tmp_req_var=$(echo "$tmp_requested" | grep -o -E '[0-9]+') > tmp_req_unit=$(echo "$tmp_requested" | sed 's/[0-9]*//g') > > if [ -z "$pe_num" ]; then > quota=$tmp_requested > else > quota=$(expr $tmp_req_var \* $pe_num)$tmp_req_unit > fi > > echo "############################# [$HOSTNAME PROLOG] - JOB_ID:$JOB_ID > TASK_ID:$SGE_TASK_ID #############################" > echo "`date` [$HOSTNAME PROLOG]: xfs_quota -x -c 'project -s -p $TMP > $JOB_ID' $SGE_TMP_ROOT" > echo "`date` [$HOSTNAME PROLOG]: xfs_quota -x -c 'limit -p bhard=$quota > $JOB_ID' $SGE_TMP_ROOT" > > xfs_quota_rc=0 > > /usr/sbin/xfs_quota -x -c "project -s -p $TMP $JOB_ID" $SGE_TMP_ROOT > ((xfs_quota_rc+=$?)) > > /usr/sbin/xfs_quota -x -c "limit -p bhard=$quota $JOB_ID" $SGE_TMP_ROOT > ((xfs_quota_rc+=$?)) > > if [ $xfs_quota_rc -eq 0 ]; then > exit 0 > else > exit 100 # Put job in error state > fi > > > On Wed, Jan 9, 2019 at 7:36 PM Reuti <re...@staff.uni-marburg.de> wrote: > >> Hi, >> >> > Am 09.01.2019 um 01:14 schrieb Derrick Lin <klin...@gmail.com>: >> > >> > Hi guys, >> > >> > I just brought up a new SGE cluster, but somehow the qrsh session does >> not work: >> > >> > tester@login-gpu:~$ qrsh >> > ^Cerror: error while waiting for builtin IJS connection: "got select >> timeout" >> > >> > after I hit entered, the session just stuck there forever instead of >> bring me to a compute node. I have to entered Crtl+c to terminate and it >> gave the above error. >> > >> > I noticed, the SGE did send my qrsh request to a compute node as I >> could tell from qstat: >> > >> > >> --------------------------------------------------------------------------------- >> > short.q@zeta-4-15.local BIP 0/1/80 0.01 lx-amd64 >> > 15 0.55500 QRLOGIN tester r 01/09/2019 10:47:13 1 >> > >> > We have a prolog script configured globally, the script deals with >> local disk quota and keep all output to a log file for each job. So I went >> to that compute node, and check, found that a log file was created but it >> was empty. >> > >> > So my thinking so far is, my qrsh stuck because the prolog script is >> not fully executed. >> >> Is there any statement in the prolog, which could wait for stdin – and in >> a batch job there is just no stdin, hence it continues? Could be tested >> with "-i" to a batch job. >> >> -- Reuti >> >> >> > qsub job are working fine. >> > >> > Any idea will be appreciated >> > >> > Cheers, >> > Derrick >> > _______________________________________________ >> > users mailing list >> > users@gridengine.org >> > https://gridengine.org/mailman/listinfo/users >> >>
_______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users