I compared two trace files closely I found there is one line appears in qrsh job but not in qsub job is:
01/10/2019 09:12:07 [997:307578]: Poll received POLLHUP (Hang up). Unregister the FD. Does this mean anything important? We also have another SGE cluster running older version of SGE (and based on CentOS 6), the same prolog script runs fine for qrsh, and its trace file looks like: So in the older SGE cluster, qrsh trace file does not have that POLLHUP line either. [luffy@omega-6-16 ~]$ cat /opt/gridengine/default/spool/omega-6-16/active_jobs/1430156.1/trace 01/10/2019 09:40:09 [400:58636]: shepherd called with uid = 0, euid = 400 01/10/2019 09:40:09 [400:58636]: rlogin_daemon = builtin 01/10/2019 09:40:09 [400:58636]: starting up 2011.11p1 01/10/2019 09:40:09 [400:58636]: setpgid(58636, 58636) returned 0 01/10/2019 09:40:09 [400:58636]: do_core_binding: "binding" parameter not found in config file 01/10/2019 09:40:09 [400:58636]: parent: forked "prolog" with pid 58637 01/10/2019 09:40:09 [400:58636]: using signal delivery delay of 120 seconds 01/10/2019 09:40:09 [400:58636]: parent: prolog-pid: 58637 01/10/2019 09:40:09 [400:58637]: child: starting son(prolog, root@/opt/gridengine/default/common/prolog_exec.sh, 0); 01/10/2019 09:40:09 [400:58637]: pid=58637 pgrp=58637 sid=58637 old pgrp=58636 getlogin()=<no login set> 01/10/2019 09:40:09 [400:58637]: reading passwd information for user 'root' 01/10/2019 09:40:09 [400:58637]: setting limits 01/10/2019 09:40:09 [400:58637]: setting environment 01/10/2019 09:40:09 [400:58637]: Initializing error file 01/10/2019 09:40:09 [400:58637]: switching to intermediate/target user 01/10/2019 09:40:09 [500:58637]: closing all filedescriptors 01/10/2019 09:40:09 [500:58637]: further messages are in "error" and "trace" 01/10/2019 09:40:09 [500:58637]: using "/bin/bash" as shell of user "root" 01/10/2019 09:40:09 [0:58637]: now running with uid=0, euid=0 01/10/2019 09:40:09 [0:58637]: execvp(/opt/gridengine/default/common/prolog_exec.sh, "/opt/gridengine/default/common/prolog_exec.sh") 01/10/2019 09:40:09 [400:58636]: wait3 returned 58637 (status: 0; WIFSIGNALED: 0, WIFEXITED: 1, WEXITSTATUS: 0) 01/10/2019 09:40:09 [400:58636]: prolog exited with exit status 0 01/10/2019 09:40:09 [400:58636]: reaped "prolog" with pid 58637 01/10/2019 09:40:09 [400:58636]: prolog exited not due to signal 01/10/2019 09:40:09 [400:58636]: prolog exited with status 0 01/10/2019 09:40:09 [400:58636]: pipe to child uses fds 4 and 5 01/10/2019 09:40:09 [400:58636]: calling fork_pty() 01/10/2019 09:40:09 [400:58636]: parent: forked "job" with pid 58672 01/10/2019 09:40:09 [400:58636]: parent: job-pid: 58672 01/10/2019 09:40:09 [400:58636]: parent: closing childs end of the pipe 01/10/2019 09:40:09 [400:58636]: csp = 0 01/10/2019 09:40:09 [400:58636]: parent: starting parent loop with remote_host = dice01.local, remote_port = 44009, job_owner = luffy, fd_pty_master = 6, fd_pipe_in = -1, fd_pipe_out = -1, fd_pipe_err = -1, fd_pipe_to_child = 5 ### more lines omitted On Thu, Jan 10, 2019 at 9:35 AM Derrick Lin <klin...@gmail.com> wrote: > Hi Reuti, > > I have to say I am still not familiar with the "-i" in qsub after reading > the man page, what does it do? > > There is no useful/interesting output in qmaster message or exec node > message log. The only information I could find is from job's trace file: > > [root@zeta-4-12 381.1]# ls > config environment error exit_status pe_hostfile pid trace > [root@zeta-4-12 381.1]# cat trace > 01/10/2019 09:12:07 [997:307578]: shepherd called with uid = 0, euid = 997 > 01/10/2019 09:12:07 [997:307578]: qlogin_daemon = builtin > 01/10/2019 09:12:07 [997:307578]: starting up 8.1.9 > 01/10/2019 09:12:07 [997:307578]: setpgid(307578, 307578) returned 0 > 01/10/2019 09:12:07 [997:307578]: do_core_binding: "binding" parameter not > found in config file > 01/10/2019 09:12:07 [997:307578]: calling fork_pty() > 01/10/2019 09:12:07 [997:307578]: parent: forked "prolog" with pid 307579 > 01/10/2019 09:12:07 [997:307578]: using signal delivery delay of 120 > seconds > 01/10/2019 09:12:07 [997:307578]: parent: prolog-pid: 307579 > 01/10/2019 09:12:07 [997:307579]: child: starting son(prolog, > root@/opt/gridengine/default/common/prolog_exec.sh, > 0, 10000); > 01/10/2019 09:12:07 [997:307579]: pid=307579 pgrp=307579 sid=307579 old > pgrp=307579 getlogin()=<no login set> > 01/10/2019 09:12:07 [997:307579]: reading passwd information for user > 'root' > 01/10/2019 09:12:07 [997:307579]: setting limits > 01/10/2019 09:12:07 [997:307579]: setting environment > 01/10/2019 09:12:07 [997:307579]: Initializing error file > 01/10/2019 09:12:07 [997:307579]: switching to intermediate/target user > 01/10/2019 09:12:07 [997:307579]: setting additional gid=0 > 01/10/2019 09:12:07 [6782:307579]: closing all filedescriptors > 01/10/2019 09:12:07 [6782:307579]: further messages are in "error" and > "trace" > 01/10/2019 09:12:07 [997:307578]: Poll received POLLHUP (Hang up). > Unregister the FD. > 01/10/2019 09:12:07 [6782:307579]: using "/bin/bash" as shell of user > "root" > 01/10/2019 09:12:07 [0:307579]: now running with uid=0, euid=0 > 01/10/2019 09:12:07 [0:307579]: > execvlp(/opt/gridengine/default/common/prolog_exec.sh, > "/opt/gridengine/default/common/prolog_exec.sh") > ### The process just stuck in the line above > > Here is the trace file for a qsub/batch job, apparently the prolog script > got executed and the process proceeded: > > [root@zeta-4-12 383.1]# ls > addgrpid config environment error exit_status job_pid pe_hostfile > pid trace > [root@zeta-4-12 383.1]# cat trace > 01/10/2019 09:20:22 [997:315329]: shepherd called with uid = 0, euid = 997 > 01/10/2019 09:20:22 [997:315329]: starting up 8.1.9 > 01/10/2019 09:20:22 [997:315329]: setpgid(315329, 315329) returned 0 > 01/10/2019 09:20:22 [997:315329]: do_core_binding: "binding" parameter not > found in config file > 01/10/2019 09:20:22 [997:315329]: parent: forked "prolog" with pid 315330 > 01/10/2019 09:20:22 [997:315329]: using signal delivery delay of 120 > seconds > 01/10/2019 09:20:22 [997:315329]: parent: prolog-pid: 315330 > 01/10/2019 09:20:22 [997:315330]: child: starting son(prolog, > root@/opt/gridengine/default/common/prolog_exec.sh, > 0, 10000); > 01/10/2019 09:20:22 [997:315330]: pid=315330 pgrp=315330 sid=315330 old > pgrp=315329 getlogin()=<no login set> > 01/10/2019 09:20:22 [997:315330]: reading passwd information for user > 'root' > 01/10/2019 09:20:22 [997:315330]: setting limits > 01/10/2019 09:20:22 [997:315330]: setting environment > 01/10/2019 09:20:22 [997:315330]: Initializing error file > 01/10/2019 09:20:22 [997:315330]: switching to intermediate/target user > 01/10/2019 09:20:22 [997:315330]: setting additional gid=0 > 01/10/2019 09:20:22 [6782:315330]: closing all filedescriptors > 01/10/2019 09:20:22 [6782:315330]: further messages are in "error" and > "trace" > 01/10/2019 09:20:22 [6782:315330]: using "/bin/bash" as shell of user > "root" > 01/10/2019 09:20:22 [6782:315330]: using stdout as stderr > 01/10/2019 09:20:22 [0:315330]: now running with uid=0, euid=0 > 01/10/2019 09:20:22 [0:315330]: > execvlp(/opt/gridengine/default/common/prolog_exec.sh, > "/opt/gridengine/default/common/prolog_exec.sh") > 01/10/2019 09:20:22 [997:315329]: wait3 returned 315330 (status: 0; > WIFSIGNALED: 0, WIFEXITED: 1, WEXITSTATUS: 0) > 01/10/2019 09:20:22 [997:315329]: prolog exited with exit status 0 > 01/10/2019 09:20:22 [997:315329]: reaped "prolog" with pid 315330 > 01/10/2019 09:20:22 [997:315329]: prolog exited not due to signal > 01/10/2019 09:20:22 [997:315329]: prolog exited with status 0 > 01/10/2019 09:20:22 [997:315329]: parent: forked "job" with pid 315345 > 01/10/2019 09:20:22 [997:315329]: parent: job-pid: 315345 > 01/10/2019 09:20:22 [997:315345]: child: starting son(job, sleep, 0, 4096); > 01/10/2019 09:20:22 [997:315345]: pid=315345 pgrp=315345 sid=315345 old > pgrp=315329 getlogin()=<no login set> > 01/10/2019 09:20:22 [997:315345]: reading passwd information for user > 'derlin' > 01/10/2019 09:20:22 [997:315345]: setosjobid: uid = 0, euid = 997 > 01/10/2019 09:20:22 [997:315345]: setting limits > 01/10/2019 09:20:22 [997:315345]: RLIMIT_CPU setting: (soft INFINITY hard > INFINITY) resulting: (soft INFINITY hard INFINITY) > 01/10/2019 09:20:22 [997:315345]: RLIMIT_FSIZE setting: (soft INFINITY > hard INFINITY) resulting: (soft INFINITY hard INFINITY) > 01/10/2019 09:20:22 [997:315345]: RLIMIT_DATA setting: (soft INFINITY hard > INFINITY) resulting: (soft INFINITY hard INFINITY) > 01/10/2019 09:20:22 [997:315345]: RLIMIT_STACK setting: (soft INFINITY > hard INFINITY) resulting: (soft INFINITY hard INFINITY) > 01/10/2019 09:20:22 [997:315345]: RLIMIT_CORE setting: (soft INFINITY hard > INFINITY) resulting: (soft INFINITY hard INFINITY) > 01/10/2019 09:20:22 [997:315345]: RLIMIT_VMEM/RLIMIT_AS setting: (soft > 8257536000 hard 8257536000) resulting: (soft 8257536000 hard 8257536000) > 01/10/2019 09:20:22 [997:315345]: RLIMIT_RSS setting: (soft INFINITY hard > INFINITY) resulting: (soft INFINITY hard INFINITY) > 01/10/2019 09:20:22 [997:315345]: setting environment > 01/10/2019 09:20:22 [997:315345]: Initializing error file > 01/10/2019 09:20:22 [997:315345]: switching to intermediate/target user > 01/10/2019 09:20:22 [997:315345]: setting additional gid=20011 > 01/10/2019 09:20:22 [6782:315345]: closing all filedescriptors > 01/10/2019 09:20:22 [6782:315345]: further messages are in "error" and > "trace" > 01/10/2019 09:20:22 [6782:315345]: using stdout as stderr > 01/10/2019 09:20:22 [6782:315345]: now running with uid=6782, euid=6782 > 01/10/2019 09:20:22 [6782:315345]: execvlp(/bin/csh, "-csh" "-c" "sleep > 10m ") > > I will attach my prolog script in the next post. > > Cheers > Derrick > > > On Wed, Jan 9, 2019 at 7:36 PM Reuti <re...@staff.uni-marburg.de> wrote: > >> Hi, >> >> > Am 09.01.2019 um 01:14 schrieb Derrick Lin <klin...@gmail.com>: >> > >> > Hi guys, >> > >> > I just brought up a new SGE cluster, but somehow the qrsh session does >> not work: >> > >> > tester@login-gpu:~$ qrsh >> > ^Cerror: error while waiting for builtin IJS connection: "got select >> timeout" >> > >> > after I hit entered, the session just stuck there forever instead of >> bring me to a compute node. I have to entered Crtl+c to terminate and it >> gave the above error. >> > >> > I noticed, the SGE did send my qrsh request to a compute node as I >> could tell from qstat: >> > >> > >> --------------------------------------------------------------------------------- >> > short.q@zeta-4-15.local BIP 0/1/80 0.01 lx-amd64 >> > 15 0.55500 QRLOGIN tester r 01/09/2019 10:47:13 1 >> > >> > We have a prolog script configured globally, the script deals with >> local disk quota and keep all output to a log file for each job. So I went >> to that compute node, and check, found that a log file was created but it >> was empty. >> > >> > So my thinking so far is, my qrsh stuck because the prolog script is >> not fully executed. >> >> Is there any statement in the prolog, which could wait for stdin – and in >> a batch job there is just no stdin, hence it continues? Could be tested >> with "-i" to a batch job. >> >> -- Reuti >> >> >> > qsub job are working fine. >> > >> > Any idea will be appreciated >> > >> > Cheers, >> > Derrick >> > _______________________________________________ >> > users mailing list >> > users@gridengine.org >> > https://gridengine.org/mailman/listinfo/users >> >>
_______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users