> Am 11.01.2019 um 00:30 schrieb Derrick Lin <klin...@gmail.com>: > > Hi Reuti > > Thanks for the input. But how does this help on troubleshooting the prolog > script?
You asked for the meaning of the "-i" option, and I tried to outline its behavior. -- Reuti > I will also troubleshooting the prolog script line by line and see which line > is causing the problem. > > Cheers, > Derrick > > On Thu, Jan 10, 2019 at 7:42 PM Reuti <re...@staff.uni-marburg.de> wrote: > Hi, > > Am 09.01.2019 um 23:35 schrieb Derrick Lin: > > > Hi Reuti, > > > > I have to say I am still not familiar with the "-i" in qsub after reading > > the man page, what does it do? > > It will be feed as stdin to the jobscript. Hence: > > $ qsub -i myfile foo.sh > > is like: > > $ foo.sh < myfile > > but in batch. > > -- Reuti > > > > There is no useful/interesting output in qmaster message or exec node > > message log. The only information I could find is from job's trace file: > > > > [root@zeta-4-12 381.1]# ls > > config environment error exit_status pe_hostfile pid trace > > [root@zeta-4-12 381.1]# cat trace > > 01/10/2019 09:12:07 [997:307578]: shepherd called with uid = 0, euid = 997 > > 01/10/2019 09:12:07 [997:307578]: qlogin_daemon = builtin > > 01/10/2019 09:12:07 [997:307578]: starting up 8.1.9 > > 01/10/2019 09:12:07 [997:307578]: setpgid(307578, 307578) returned 0 > > 01/10/2019 09:12:07 [997:307578]: do_core_binding: "binding" parameter not > > found in config file > > 01/10/2019 09:12:07 [997:307578]: calling fork_pty() > > 01/10/2019 09:12:07 [997:307578]: parent: forked "prolog" with pid 307579 > > 01/10/2019 09:12:07 [997:307578]: using signal delivery delay of 120 seconds > > 01/10/2019 09:12:07 [997:307578]: parent: prolog-pid: 307579 > > 01/10/2019 09:12:07 [997:307579]: child: starting son(prolog, > > root@/opt/gridengine/default/common/prolog_exec.sh, 0, 10000); > > 01/10/2019 09:12:07 [997:307579]: pid=307579 pgrp=307579 sid=307579 old > > pgrp=307579 getlogin()=<no login set> > > 01/10/2019 09:12:07 [997:307579]: reading passwd information for user 'root' > > 01/10/2019 09:12:07 [997:307579]: setting limits > > 01/10/2019 09:12:07 [997:307579]: setting environment > > 01/10/2019 09:12:07 [997:307579]: Initializing error file > > 01/10/2019 09:12:07 [997:307579]: switching to intermediate/target user > > 01/10/2019 09:12:07 [997:307579]: setting additional gid=0 > > 01/10/2019 09:12:07 [6782:307579]: closing all filedescriptors > > 01/10/2019 09:12:07 [6782:307579]: further messages are in "error" and > > "trace" > > 01/10/2019 09:12:07 [997:307578]: Poll received POLLHUP (Hang up). > > Unregister the FD. > > 01/10/2019 09:12:07 [6782:307579]: using "/bin/bash" as shell of user "root" > > 01/10/2019 09:12:07 [0:307579]: now running with uid=0, euid=0 > > 01/10/2019 09:12:07 [0:307579]: > > execvlp(/opt/gridengine/default/common/prolog_exec.sh, > > "/opt/gridengine/default/common/prolog_exec.sh") > > ### The process just stuck in the line above > > > > Here is the trace file for a qsub/batch job, apparently the prolog script > > got executed and the process proceeded: > > > > [root@zeta-4-12 383.1]# ls > > addgrpid config environment error exit_status job_pid pe_hostfile > > pid trace > > [root@zeta-4-12 383.1]# cat trace > > 01/10/2019 09:20:22 [997:315329]: shepherd called with uid = 0, euid = 997 > > 01/10/2019 09:20:22 [997:315329]: starting up 8.1.9 > > 01/10/2019 09:20:22 [997:315329]: setpgid(315329, 315329) returned 0 > > 01/10/2019 09:20:22 [997:315329]: do_core_binding: "binding" parameter not > > found in config file > > 01/10/2019 09:20:22 [997:315329]: parent: forked "prolog" with pid 315330 > > 01/10/2019 09:20:22 [997:315329]: using signal delivery delay of 120 seconds > > 01/10/2019 09:20:22 [997:315329]: parent: prolog-pid: 315330 > > 01/10/2019 09:20:22 [997:315330]: child: starting son(prolog, > > root@/opt/gridengine/default/common/prolog_exec.sh, 0, 10000); > > 01/10/2019 09:20:22 [997:315330]: pid=315330 pgrp=315330 sid=315330 old > > pgrp=315329 getlogin()=<no login set> > > 01/10/2019 09:20:22 [997:315330]: reading passwd information for user 'root' > > 01/10/2019 09:20:22 [997:315330]: setting limits > > 01/10/2019 09:20:22 [997:315330]: setting environment > > 01/10/2019 09:20:22 [997:315330]: Initializing error file > > 01/10/2019 09:20:22 [997:315330]: switching to intermediate/target user > > 01/10/2019 09:20:22 [997:315330]: setting additional gid=0 > > 01/10/2019 09:20:22 [6782:315330]: closing all filedescriptors > > 01/10/2019 09:20:22 [6782:315330]: further messages are in "error" and > > "trace" > > 01/10/2019 09:20:22 [6782:315330]: using "/bin/bash" as shell of user "root" > > 01/10/2019 09:20:22 [6782:315330]: using stdout as stderr > > 01/10/2019 09:20:22 [0:315330]: now running with uid=0, euid=0 > > 01/10/2019 09:20:22 [0:315330]: > > execvlp(/opt/gridengine/default/common/prolog_exec.sh, > > "/opt/gridengine/default/common/prolog_exec.sh") > > 01/10/2019 09:20:22 [997:315329]: wait3 returned 315330 (status: 0; > > WIFSIGNALED: 0, WIFEXITED: 1, WEXITSTATUS: 0) > > 01/10/2019 09:20:22 [997:315329]: prolog exited with exit status 0 > > 01/10/2019 09:20:22 [997:315329]: reaped "prolog" with pid 315330 > > 01/10/2019 09:20:22 [997:315329]: prolog exited not due to signal > > 01/10/2019 09:20:22 [997:315329]: prolog exited with status 0 > > 01/10/2019 09:20:22 [997:315329]: parent: forked "job" with pid 315345 > > 01/10/2019 09:20:22 [997:315329]: parent: job-pid: 315345 > > 01/10/2019 09:20:22 [997:315345]: child: starting son(job, sleep, 0, 4096); > > 01/10/2019 09:20:22 [997:315345]: pid=315345 pgrp=315345 sid=315345 old > > pgrp=315329 getlogin()=<no login set> > > 01/10/2019 09:20:22 [997:315345]: reading passwd information for user > > 'derlin' > > 01/10/2019 09:20:22 [997:315345]: setosjobid: uid = 0, euid = 997 > > 01/10/2019 09:20:22 [997:315345]: setting limits > > 01/10/2019 09:20:22 [997:315345]: RLIMIT_CPU setting: (soft INFINITY hard > > INFINITY) resulting: (soft INFINITY hard INFINITY) > > 01/10/2019 09:20:22 [997:315345]: RLIMIT_FSIZE setting: (soft INFINITY hard > > INFINITY) resulting: (soft INFINITY hard INFINITY) > > 01/10/2019 09:20:22 [997:315345]: RLIMIT_DATA setting: (soft INFINITY hard > > INFINITY) resulting: (soft INFINITY hard INFINITY) > > 01/10/2019 09:20:22 [997:315345]: RLIMIT_STACK setting: (soft INFINITY hard > > INFINITY) resulting: (soft INFINITY hard INFINITY) > > 01/10/2019 09:20:22 [997:315345]: RLIMIT_CORE setting: (soft INFINITY hard > > INFINITY) resulting: (soft INFINITY hard INFINITY) > > 01/10/2019 09:20:22 [997:315345]: RLIMIT_VMEM/RLIMIT_AS setting: (soft > > 8257536000 hard 8257536000) resulting: (soft 8257536000 hard 8257536000) > > 01/10/2019 09:20:22 [997:315345]: RLIMIT_RSS setting: (soft INFINITY hard > > INFINITY) resulting: (soft INFINITY hard INFINITY) > > 01/10/2019 09:20:22 [997:315345]: setting environment > > 01/10/2019 09:20:22 [997:315345]: Initializing error file > > 01/10/2019 09:20:22 [997:315345]: switching to intermediate/target user > > 01/10/2019 09:20:22 [997:315345]: setting additional gid=20011 > > 01/10/2019 09:20:22 [6782:315345]: closing all filedescriptors > > 01/10/2019 09:20:22 [6782:315345]: further messages are in "error" and > > "trace" > > 01/10/2019 09:20:22 [6782:315345]: using stdout as stderr > > 01/10/2019 09:20:22 [6782:315345]: now running with uid=6782, euid=6782 > > 01/10/2019 09:20:22 [6782:315345]: execvlp(/bin/csh, "-csh" "-c" "sleep 10m > > ") > > > > I will attach my prolog script in the next post. > > > > Cheers > > Derrick > > > > > > On Wed, Jan 9, 2019 at 7:36 PM Reuti <re...@staff.uni-marburg.de> wrote: > > Hi, > > > > > Am 09.01.2019 um 01:14 schrieb Derrick Lin <klin...@gmail.com>: > > > > > > Hi guys, > > > > > > I just brought up a new SGE cluster, but somehow the qrsh session does > > > not work: > > > > > > tester@login-gpu:~$ qrsh > > > ^Cerror: error while waiting for builtin IJS connection: "got select > > > timeout" > > > > > > after I hit entered, the session just stuck there forever instead of > > > bring me to a compute node. I have to entered Crtl+c to terminate and it > > > gave the above error. > > > > > > I noticed, the SGE did send my qrsh request to a compute node as I could > > > tell from qstat: > > > > > > --------------------------------------------------------------------------------- > > > short.q@zeta-4-15.local BIP 0/1/80 0.01 lx-amd64 > > > 15 0.55500 QRLOGIN tester r 01/09/2019 10:47:13 1 > > > > > > We have a prolog script configured globally, the script deals with local > > > disk quota and keep all output to a log file for each job. So I went to > > > that compute node, and check, found that a log file was created but it > > > was empty. > > > > > > So my thinking so far is, my qrsh stuck because the prolog script is not > > > fully executed. > > > > Is there any statement in the prolog, which could wait for stdin – and in a > > batch job there is just no stdin, hence it continues? Could be tested with > > "-i" to a batch job. > > > > -- Reuti > > > > > > > qsub job are working fine. > > > > > > Any idea will be appreciated > > > > > > Cheers, > > > Derrick > > > _______________________________________________ > > > users mailing list > > > users@gridengine.org > > > https://gridengine.org/mailman/listinfo/users > > > _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users