> Am 13.05.2015 um 12:35 schrieb <[email protected]> > <[email protected]>: > > Hi Reuti, > > I did some testing again and now the process is killed after deleting the job > using qdel job_id. Please find the test results. > > After starting the job, the process started on the execution host > > qstat -j 8150628 > ================================================= > job_number: 8150628 > exec_file: job_scripts/8150628 > submission_time: Wed May 13 13:00:08 2015 > owner: spenmets > uid: 78566 > group: newgrp1 > gid: 1018 > > ================================================= > [spenmets@node2 homes/users/spenmets]$ps -au spenmets > PID TTY TIME CMD > 10837 pts/12 00:00:00 qrsh_starter > 10911 pts/12 00:00:00 xterm
As long as the process will stay attached to the `qrsh_starter`, it will be killed too as SGE will kill the complete process group. The problem arises, when a process jumps out of the process tree and must be detected by the additional group ID. Then also "execd_params ENABLE_ADDGRP_KILL=TRUE" in SGE's configuration must be set to allow this facility to jump in. -- Reuti > ================================================= > > [spenmets@node2 proc/10837]$cat status > Name: qrsh_starter > Gid: 1018 1018 1018 1018 > Utrace: 0 > FDSize: 64 > Groups: 1000 1018 1025 1030 27000 27001 27007 27010 27014 27017 27025 > ================================================ > > gridnode @ /xxxxx/xxxxx/xxxxx : qdel 8150628 > registered the job 8150628 for deletion > gridnode @ /xxxxx/xxxxx/xxxxx : qstat -j 8150628 > Following jobs do not exist: > 8150628 > > =============================================== > > [spenmets@node2 homes/users/spenmets]$ps 10837 > PID TTY STAT TIME COMMAND > [spenmets@node2 homes/users/spenmets]$cd /proc/10837 > -bash: cd: /proc/10837: No such file or directory > > Does it mean not an issue with tight integration of SSH into SGE. > > Regards, > Sudha > > -----Original Message----- > From: Sudha Padmini Penmetsa (WT01 - Global Media & Telecom) > Sent: Wednesday, May 13, 2015 1:15 PM > To: 'Reuti' > Cc: [email protected] > Subject: RE: [gridengine users] grid jobs not visible with qstat output > > Hi Reuti, > > The value in /opt/sge/default/spool/active_jobs/8143543.1/addgrpid is not > there in /proc/ > > But the the child processes of the job are available in /proc/. > > Can you please suggest a solution. > > Regards, > Sudha > > -----Original Message----- > From: Reuti [mailto:[email protected]] > Sent: Tuesday, May 12, 2015 8:53 PM > To: Sudha Padmini Penmetsa (WT01 - Global Media & Telecom) > Cc: [email protected]; [email protected] > Subject: Re: [gridengine users] grid jobs not visible with qstat output > > >> Am 12.05.2015 um 17:03 schrieb <[email protected]> >> <[email protected]>: >> >> Hi Reuti, >> >> In the link suggested by you >> (https://arc.liv.ac.uk/SGE/htmlman/htmlman5/remote_startup.html ) it >> is mentioned as below >> >> "To have a tight integration of SSH into SGE, the started sshd needs an >> additional group ID to be attached." >> >> Checked the configuration from our side and the addgrpid is generated >> >> /opt/sge/default/spool/active_jobs/8143543.1 : ls addgrpid > > Yes, but not attached to all processes. Processes running in a tight > integration needs them attached like something in /proc: > > reuti@node:/proc/24989> cat status > ... > Groups: 20082 24000 25000 > > And the 20082 is the additional one. > > -- Reuti > > >> >> Regards, >> Sudha >> >> -----Original Message----- >> From: Reuti [mailto:[email protected]] >> Sent: Monday, May 11, 2015 2:08 AM >> To: Sudha Padmini Penmetsa (WT01 - Global Media & Telecom) >> Cc: [email protected]; [email protected] >> Subject: Re: [gridengine users] grid jobs not visible with qstat >> output >> >> >> Am 10.05.2015 um 19:30 schrieb <[email protected]> >> <[email protected]>: >> >>> Hi Reuti, >>> >>> The startup mechanism is as below >>> >>> qlogin_daemon /usr/sbin/sshd -i >>> qlogin_command /gridapl1/HWEE_ge6/new/qssh >> >> Then it's most likely that the `ssh` is not tightly integrated into SGE. >> Please have a look at: >> >> https://arc.liv.ac.uk/SGE/htmlman/htmlman5/remote_startup.html >> >> section "SSH TIGHT INTEGRATION". >> >> -- Reuti >> >> >>> Regards, >>> Sudha >>> >>> -----Original Message----- >>> From: Reuti [mailto:[email protected]] >>> Sent: Friday, May 08, 2015 10:50 PM >>> To: Sudha Padmini Penmetsa (WT01 - Global Media & Telecom) >>> Cc: [email protected]; [email protected] >>> Subject: Re: [gridengine users] grid jobs not visible with qstat >>> output >>> >>> >>>> Am 08.05.2015 um 16:57 schrieb [email protected]: >>>> >>>> Hi Zhang, >>>> >>>> Please find the o/p >>>> >>>> 32682 61457200 27020 karppa 32682 >>>> /applic36/grid/HWEE_ge6/utilbin/lx24-amd64/qrsh_starter >>>> /gridapl1/HWEE_ge6/default/spo >>>> 32734 61457200 27020 karppa 32734 \_ /bin/ksh ./run_it_file.vcs >>>> 33043 61457200 27020 karppa 32734 \_ /bin/ksh ./vcs.start.dh.no_gui >>>> 33059 61457200 27020 karppa 32734 \_ >>>> ./vcs/tb_bin/hdl_top_rtldhsim/simv -licqueue -cm line+cond+fsm+branch+tgl+ >>>> 38048 61457200 27020 karppa 32734 \_ [target.bin] <defunct> >>>> 5049 61457200 27020 karppa 5049 >>>> /applic36/grid/HWEE_ge6/utilbin/lx24-amd64/qrsh_starter >>>> /gridapl1/HWEE_ge6/default/spoo >>>> 5101 61457200 27020 karppa 5101 \_ /bin/ksh ./run_it_file.vcs >>>> 5408 61457200 27020 karppa 5101 \_ /bin/ksh ./vcs.start.dh.no_gui >>>> 5424 61457200 27020 karppa 5101 \_ >>>> ./vcs/tb_bin/hdl_top_rtldhsim/simv -licqueue -cm line+cond+fsm+branch+tgl+a >>>> 9089 61457200 27020 karppa 5101 \_ [target.bin] <defunct> >>> >>> The problem seems to be, that the `qrsh`starter` is no longer bound to the >>> "sge_shephered". This was after the job? How does it look like while SGE >>> still knows about the job. What is the startup mechanism: >>> >>> $ qconf -sconf >>> ... >>> qlogin_command builtin >>> qlogin_daemon builtin >>> rlogin_command builtin >>> rlogin_daemon builtin >>> rsh_command builtin >>> rsh_daemon builtin >>> >>> -- Reuti >>> >>> >>>> Regards, >>>> Sudha >>>> >>>> -----Original Message----- >>>> From: Feng Zhang [mailto:[email protected]] >>>> Sent: Friday, May 08, 2015 7:35 PM >>>> To: Sudha Padmini Penmetsa (WT01 - Global Media & Telecom) >>>> Subject: Re: [gridengine users] grid jobs not visible with qstat >>>> output >>>> >>>> Sudha, >>>> >>>> Can you run "ps -e f -o pid,ppid,command", which can show more details? >>>> >>>> On Fri, May 8, 2015 at 4:09 AM, <[email protected]> wrote: >>>>> Hi Reuti, >>>>> >>>>> The processes are not bound to sge_shepherd anymore. >>>>> >>>>> Below are the qrsh_starter processes running still >>>>> >>>>> 5049 ? 00:00:00 qrsh_starter >>>>> 5101 ? 00:00:00 run_it_file.vcs >>>>> 5408 ? 00:00:00 vcs.start.dh.no >>>>> 5424 ? 8-20:57:02 simv >>>>> 9089 ? 00:00:00 target.bin <defunct> >>>>> 16868 ? 00:00:00 sshd >>>>> 16913 pts/9 00:00:00 bash >>>>> 17371 pts/9 00:00:00 ps >>>>> 32682 ? 00:00:00 qrsh_starter >>>>> 32734 ? 00:00:00 run_it_file.vcs >>>>> 33043 ? 00:00:00 vcs.start.dh.no >>>>> 33059 ? 8-21:19:03 simv >>>>> 38048 ? 00:00:00 target.bin <defunct> >>>>> >>>>> Regards, >>>>> Sudha >>>>> >>>>> -----Original Message----- >>>>> From: Reuti [mailto:[email protected]] >>>>> Sent: Thursday, May 07, 2015 9:52 PM >>>>> To: Sudha Padmini Penmetsa (WT01 - Global Media & Telecom) >>>>> Cc: [email protected]; [email protected] >>>>> Subject: Re: [gridengine users] grid jobs not visible with qstat >>>>> output >>>>> >>>>> Are the processes still bound to the sge_shephered or did they jump out >>>>> of the process tree? By what method were they started by qrsh_starter: >>>>> "builtin" or by defining `ssh`? >>>>> >>>>> -- Reuti >>>>> >>>>> >>>>>> Am 07.05.2015 um 18:00 schrieb <[email protected]> >>>>>> <[email protected]>: >>>>>> >>>>>> Hi, >>>>>> >>>>>> No the slots are not being used anymore >>>>>> >>>>>> That according to qstat I seem not to have any jobs at host. However, >>>>>> there are my processes running in that specific host (launched by >>>>>> qrsh_starter) that are altogether consuming 200% of CPU and licenses. >>>>>> The problem here is that the processes have been running there over a >>>>>> week and I haven't been aware of those. I've thought that the processes >>>>>> were killed when the job was killed with qdel. >>>>>> >>>>>> What could be the reason for this. >>>>>> >>>>>> Regards, >>>>>> Sudha >>>>>> >>>>>> From: Srirangam Addepalli [mailto:[email protected]] >>>>>> Sent: Wednesday, May 06, 2015 7:52 PM >>>>>> To: Sudha Padmini Penmetsa (WT01 - Global Media & Telecom) >>>>>> Subject: Re: [gridengine users] grid jobs not visible with qstat >>>>>> output >>>>>> >>>>>> That would be strange. Do the slots on the host show as being used. >>>>>> >>>>>> qhost -j -h hostname should list the jobs that Grid Engine is aware of. >>>>>> Unless qrsh some how spwanned a process that is not bound by sge_execd. >>>>>> On the client/ execution host what info do you have in active_jobs and >>>>>> jobs directories. It is more likely that the qrsh session is terminated >>>>>> but left resident processes. >>>>>> >>>>>> Rangam >>>>>> >>>>>> On Wed, May 6, 2015 at 9:05 AM, <[email protected]> wrote: >>>>>> Hi, >>>>>> >>>>>> I noticed that I've had two grid jobs running over a week on a machine >>>>>> of which I haven't been aware of. Both of the jobs have been launched >>>>>> with qrsh but they are not visible with qstat thus for a reason or >>>>>> another they are no longer included in grid book-keeping. This issue >>>>>> will cause that grid resources are wasted for ghost jobs as for example >>>>>> both of my jobs seem to consume 100% CPU on the host. >>>>>> >>>>>> Can anyone please explain on this. >>>>>> >>>>>> Regards, >>>>>> Sudha >>>>>> >>>>>> The information contained in this electronic message and any >>>>>> attachments to this message are intended for the exclusive use of >>>>>> the addressee(s) and may contain proprietary, confidential or >>>>>> privileged information. If you are not the intended recipient, you >>>>>> should not disseminate, distribute or copy this e-mail. Please >>>>>> notify the sender immediately and destroy all copies of this >>>>>> message and any attachments. WARNING: Computer viruses can be >>>>>> transmitted via email. The recipient should check this email and >>>>>> any attachments for the presence of viruses. The company accepts >>>>>> no liability for any damage caused by any virus transmitted by >>>>>> this email. www.wipro.com >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> [email protected] >>>>>> https://gridengine.org/mailman/listinfo/users >>>>>> >>>>>> >>>>>> The information contained in this electronic message and any >>>>>> attachments to this message are intended for the exclusive use of >>>>>> the addressee(s) and may contain proprietary, confidential or >>>>>> privileged information. If you are not the intended recipient, you >>>>>> should not disseminate, distribute or copy this e-mail. Please >>>>>> notify the sender immediately and destroy all copies of this >>>>>> message and any attachments. WARNING: Computer viruses can be >>>>>> transmitted via email. The recipient should check this email and >>>>>> any attachments for the presence of viruses. The company accepts >>>>>> no liability for any damage caused by any virus transmitted by >>>>>> this email. www.wipro.com >>>>> >>>>> The information contained in this electronic message and any >>>>> attachments to this message are intended for the exclusive use of >>>>> the addressee(s) and may contain proprietary, confidential or >>>>> privileged information. If you are not the intended recipient, you >>>>> should not disseminate, distribute or copy this e-mail. Please >>>>> notify the sender immediately and destroy all copies of this >>>>> message and any attachments. WARNING: Computer viruses can be >>>>> transmitted via email. The recipient should check this email and >>>>> any attachments for the presence of viruses. The company accepts no >>>>> liability for any damage caused by any virus transmitted by this >>>>> email. www.wipro.com >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> [email protected] >>>>> https://gridengine.org/mailman/listinfo/users >>>> >>>> >>>> >>>> -- >>>> Best, >>>> >>>> Feng >>>> The information contained in this electronic message and any >>>> attachments to this message are intended for the exclusive use of >>>> the addressee(s) and may contain proprietary, confidential or >>>> privileged information. If you are not the intended recipient, you >>>> should not disseminate, distribute or copy this e-mail. Please >>>> notify the sender immediately and destroy all copies of this message >>>> and any attachments. WARNING: Computer viruses can be transmitted >>>> via email. The recipient should check this email and any attachments >>>> for the presence of viruses. The company accepts no liability for >>>> any damage caused by any virus transmitted by this email. >>>> www.wipro.com >>>> >>>> _______________________________________________ >>>> users mailing list >>>> [email protected] >>>> https://gridengine.org/mailman/listinfo/users >>>> >>> >>> The information contained in this electronic message and any >>> attachments to this message are intended for the exclusive use of the >>> addressee(s) and may contain proprietary, confidential or privileged >>> information. If you are not the intended recipient, you should not >>> disseminate, distribute or copy this e-mail. Please notify the sender >>> immediately and destroy all copies of this message and any >>> attachments. WARNING: Computer viruses can be transmitted via email. >>> The recipient should check this email and any attachments for the >>> presence of viruses. The company accepts no liability for any damage >>> caused by any virus transmitted by this email. www.wipro.com >>> >> >> The information contained in this electronic message and any >> attachments to this message are intended for the exclusive use of the >> addressee(s) and may contain proprietary, confidential or privileged >> information. If you are not the intended recipient, you should not >> disseminate, distribute or copy this e-mail. Please notify the sender >> immediately and destroy all copies of this message and any >> attachments. WARNING: Computer viruses can be transmitted via email. >> The recipient should check this email and any attachments for the >> presence of viruses. The company accepts no liability for any damage >> caused by any virus transmitted by this email. www.wipro.com >> > > The information contained in this electronic message and any attachments to > this message are intended for the exclusive use of the addressee(s) and may > contain proprietary, confidential or privileged information. If you are not > the intended recipient, you should not disseminate, distribute or copy this > e-mail. Please notify the sender immediately and destroy all copies of this > message and any attachments. WARNING: Computer viruses can be transmitted via > email. The recipient should check this email and any attachments for the > presence of viruses. The company accepts no liability for any damage caused > by any virus transmitted by this email. www.wipro.com > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
