> Am 13.05.2015 um 13:36 schrieb <[email protected]> > <[email protected]>: > > Hi Reuti, > > In qconf -sconf we have the configuration as follows > execd_params enable_windomacc=true > > Can you please confirm if we can add as below or should it be defined in a > different way > > execd_params enable_windomacc=true ENABLE_ADDGRP_KILL=TRUE
It's correct. - Reuti > > Regards, > Sudha > > -----Original Message----- > From: Reuti [mailto:[email protected]] > Sent: Wednesday, May 13, 2015 4:17 PM > To: Sudha Padmini Penmetsa (WT01 - Global Media & Telecom) > Cc: [email protected] > Subject: Re: [gridengine users] grid jobs not visible with qstat output > > >> Am 13.05.2015 um 12:35 schrieb <[email protected]> >> <[email protected]>: >> >> Hi Reuti, >> >> I did some testing again and now the process is killed after deleting the >> job using qdel job_id. Please find the test results. >> >> After starting the job, the process started on the execution host >> >> qstat -j 8150628 >> ================================================= >> job_number: 8150628 >> exec_file: job_scripts/8150628 >> submission_time: Wed May 13 13:00:08 2015 >> owner: spenmets >> uid: 78566 >> group: newgrp1 >> gid: 1018 >> >> ================================================= >> [spenmets@node2 homes/users/spenmets]$ps -au spenmets >> PID TTY TIME CMD >> 10837 pts/12 00:00:00 qrsh_starter >> 10911 pts/12 00:00:00 xterm > > As long as the process will stay attached to the `qrsh_starter`, it will be > killed too as SGE will kill the complete process group. The problem arises, > when a process jumps out of the process tree and must be detected by the > additional group ID. Then also "execd_params ENABLE_ADDGRP_KILL=TRUE" in > SGE's configuration must be set to allow this facility to jump in. > > -- Reuti > > >> ================================================= >> >> [spenmets@node2 proc/10837]$cat status >> Name: qrsh_starter >> Gid: 1018 1018 1018 1018 >> Utrace: 0 >> FDSize: 64 >> Groups: 1000 1018 1025 1030 27000 27001 27007 27010 27014 27017 27025 >> ================================================ >> >> gridnode @ /xxxxx/xxxxx/xxxxx : qdel 8150628 registered the job >> 8150628 for deletion gridnode @ /xxxxx/xxxxx/xxxxx : qstat -j 8150628 >> Following jobs do not exist: >> 8150628 >> >> =============================================== >> >> [spenmets@node2 homes/users/spenmets]$ps 10837 >> PID TTY STAT TIME COMMAND >> [spenmets@node2 homes/users/spenmets]$cd /proc/10837 >> -bash: cd: /proc/10837: No such file or directory >> >> Does it mean not an issue with tight integration of SSH into SGE. >> >> Regards, >> Sudha >> >> -----Original Message----- >> From: Sudha Padmini Penmetsa (WT01 - Global Media & Telecom) >> Sent: Wednesday, May 13, 2015 1:15 PM >> To: 'Reuti' >> Cc: [email protected] >> Subject: RE: [gridengine users] grid jobs not visible with qstat >> output >> >> Hi Reuti, >> >> The value in /opt/sge/default/spool/active_jobs/8143543.1/addgrpid is >> not there in /proc/ >> >> But the the child processes of the job are available in /proc/. >> >> Can you please suggest a solution. >> >> Regards, >> Sudha >> >> -----Original Message----- >> From: Reuti [mailto:[email protected]] >> Sent: Tuesday, May 12, 2015 8:53 PM >> To: Sudha Padmini Penmetsa (WT01 - Global Media & Telecom) >> Cc: [email protected]; [email protected] >> Subject: Re: [gridengine users] grid jobs not visible with qstat >> output >> >> >>> Am 12.05.2015 um 17:03 schrieb <[email protected]> >>> <[email protected]>: >>> >>> Hi Reuti, >>> >>> In the link suggested by you >>> (https://arc.liv.ac.uk/SGE/htmlman/htmlman5/remote_startup.html ) it >>> is mentioned as below >>> >>> "To have a tight integration of SSH into SGE, the started sshd needs an >>> additional group ID to be attached." >>> >>> Checked the configuration from our side and the addgrpid is generated >>> >>> /opt/sge/default/spool/active_jobs/8143543.1 : ls addgrpid >> >> Yes, but not attached to all processes. Processes running in a tight >> integration needs them attached like something in /proc: >> >> reuti@node:/proc/24989> cat status >> ... >> Groups: 20082 24000 25000 >> >> And the 20082 is the additional one. >> >> -- Reuti >> >> >>> >>> Regards, >>> Sudha >>> >>> -----Original Message----- >>> From: Reuti [mailto:[email protected]] >>> Sent: Monday, May 11, 2015 2:08 AM >>> To: Sudha Padmini Penmetsa (WT01 - Global Media & Telecom) >>> Cc: [email protected]; [email protected] >>> Subject: Re: [gridengine users] grid jobs not visible with qstat >>> output >>> >>> >>> Am 10.05.2015 um 19:30 schrieb <[email protected]> >>> <[email protected]>: >>> >>>> Hi Reuti, >>>> >>>> The startup mechanism is as below >>>> >>>> qlogin_daemon /usr/sbin/sshd -i >>>> qlogin_command /gridapl1/HWEE_ge6/new/qssh >>> >>> Then it's most likely that the `ssh` is not tightly integrated into SGE. >>> Please have a look at: >>> >>> https://arc.liv.ac.uk/SGE/htmlman/htmlman5/remote_startup.html >>> >>> section "SSH TIGHT INTEGRATION". >>> >>> -- Reuti >>> >>> >>>> Regards, >>>> Sudha >>>> >>>> -----Original Message----- >>>> From: Reuti [mailto:[email protected]] >>>> Sent: Friday, May 08, 2015 10:50 PM >>>> To: Sudha Padmini Penmetsa (WT01 - Global Media & Telecom) >>>> Cc: [email protected]; [email protected] >>>> Subject: Re: [gridengine users] grid jobs not visible with qstat >>>> output >>>> >>>> >>>>> Am 08.05.2015 um 16:57 schrieb [email protected]: >>>>> >>>>> Hi Zhang, >>>>> >>>>> Please find the o/p >>>>> >>>>> 32682 61457200 27020 karppa 32682 >>>>> /applic36/grid/HWEE_ge6/utilbin/lx24-amd64/qrsh_starter >>>>> /gridapl1/HWEE_ge6/default/spo >>>>> 32734 61457200 27020 karppa 32734 \_ /bin/ksh ./run_it_file.vcs >>>>> 33043 61457200 27020 karppa 32734 \_ /bin/ksh ./vcs.start.dh.no_gui >>>>> 33059 61457200 27020 karppa 32734 \_ >>>>> ./vcs/tb_bin/hdl_top_rtldhsim/simv -licqueue -cm line+cond+fsm+branch+tgl+ >>>>> 38048 61457200 27020 karppa 32734 \_ [target.bin] <defunct> >>>>> 5049 61457200 27020 karppa 5049 >>>>> /applic36/grid/HWEE_ge6/utilbin/lx24-amd64/qrsh_starter >>>>> /gridapl1/HWEE_ge6/default/spoo >>>>> 5101 61457200 27020 karppa 5101 \_ /bin/ksh ./run_it_file.vcs >>>>> 5408 61457200 27020 karppa 5101 \_ /bin/ksh ./vcs.start.dh.no_gui >>>>> 5424 61457200 27020 karppa 5101 \_ >>>>> ./vcs/tb_bin/hdl_top_rtldhsim/simv -licqueue -cm >>>>> line+cond+fsm+branch+tgl+a >>>>> 9089 61457200 27020 karppa 5101 \_ [target.bin] <defunct> >>>> >>>> The problem seems to be, that the `qrsh`starter` is no longer bound to the >>>> "sge_shephered". This was after the job? How does it look like while SGE >>>> still knows about the job. What is the startup mechanism: >>>> >>>> $ qconf -sconf >>>> ... >>>> qlogin_command builtin >>>> qlogin_daemon builtin >>>> rlogin_command builtin >>>> rlogin_daemon builtin >>>> rsh_command builtin >>>> rsh_daemon builtin >>>> >>>> -- Reuti >>>> >>>> >>>>> Regards, >>>>> Sudha >>>>> >>>>> -----Original Message----- >>>>> From: Feng Zhang [mailto:[email protected]] >>>>> Sent: Friday, May 08, 2015 7:35 PM >>>>> To: Sudha Padmini Penmetsa (WT01 - Global Media & Telecom) >>>>> Subject: Re: [gridengine users] grid jobs not visible with qstat >>>>> output >>>>> >>>>> Sudha, >>>>> >>>>> Can you run "ps -e f -o pid,ppid,command", which can show more details? >>>>> >>>>> On Fri, May 8, 2015 at 4:09 AM, <[email protected]> wrote: >>>>>> Hi Reuti, >>>>>> >>>>>> The processes are not bound to sge_shepherd anymore. >>>>>> >>>>>> Below are the qrsh_starter processes running still >>>>>> >>>>>> 5049 ? 00:00:00 qrsh_starter >>>>>> 5101 ? 00:00:00 run_it_file.vcs >>>>>> 5408 ? 00:00:00 vcs.start.dh.no >>>>>> 5424 ? 8-20:57:02 simv >>>>>> 9089 ? 00:00:00 target.bin <defunct> >>>>>> 16868 ? 00:00:00 sshd >>>>>> 16913 pts/9 00:00:00 bash >>>>>> 17371 pts/9 00:00:00 ps >>>>>> 32682 ? 00:00:00 qrsh_starter >>>>>> 32734 ? 00:00:00 run_it_file.vcs >>>>>> 33043 ? 00:00:00 vcs.start.dh.no >>>>>> 33059 ? 8-21:19:03 simv >>>>>> 38048 ? 00:00:00 target.bin <defunct> >>>>>> >>>>>> Regards, >>>>>> Sudha >>>>>> >>>>>> -----Original Message----- >>>>>> From: Reuti [mailto:[email protected]] >>>>>> Sent: Thursday, May 07, 2015 9:52 PM >>>>>> To: Sudha Padmini Penmetsa (WT01 - Global Media & Telecom) >>>>>> Cc: [email protected]; [email protected] >>>>>> Subject: Re: [gridengine users] grid jobs not visible with qstat >>>>>> output >>>>>> >>>>>> Are the processes still bound to the sge_shephered or did they jump out >>>>>> of the process tree? By what method were they started by qrsh_starter: >>>>>> "builtin" or by defining `ssh`? >>>>>> >>>>>> -- Reuti >>>>>> >>>>>> >>>>>>> Am 07.05.2015 um 18:00 schrieb <[email protected]> >>>>>>> <[email protected]>: >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> No the slots are not being used anymore >>>>>>> >>>>>>> That according to qstat I seem not to have any jobs at host. However, >>>>>>> there are my processes running in that specific host (launched by >>>>>>> qrsh_starter) that are altogether consuming 200% of CPU and licenses. >>>>>>> The problem here is that the processes have been running there over a >>>>>>> week and I haven't been aware of those. I've thought that the processes >>>>>>> were killed when the job was killed with qdel. >>>>>>> >>>>>>> What could be the reason for this. >>>>>>> >>>>>>> Regards, >>>>>>> Sudha >>>>>>> >>>>>>> From: Srirangam Addepalli [mailto:[email protected]] >>>>>>> Sent: Wednesday, May 06, 2015 7:52 PM >>>>>>> To: Sudha Padmini Penmetsa (WT01 - Global Media & Telecom) >>>>>>> Subject: Re: [gridengine users] grid jobs not visible with qstat >>>>>>> output >>>>>>> >>>>>>> That would be strange. Do the slots on the host show as being used. >>>>>>> >>>>>>> qhost -j -h hostname should list the jobs that Grid Engine is aware of. >>>>>>> Unless qrsh some how spwanned a process that is not bound by sge_execd. >>>>>>> On the client/ execution host what info do you have in active_jobs and >>>>>>> jobs directories. It is more likely that the qrsh session is >>>>>>> terminated but left resident processes. >>>>>>> >>>>>>> Rangam >>>>>>> >>>>>>> On Wed, May 6, 2015 at 9:05 AM, <[email protected]> wrote: >>>>>>> Hi, >>>>>>> >>>>>>> I noticed that I've had two grid jobs running over a week on a machine >>>>>>> of which I haven't been aware of. Both of the jobs have been launched >>>>>>> with qrsh but they are not visible with qstat thus for a reason or >>>>>>> another they are no longer included in grid book-keeping. This issue >>>>>>> will cause that grid resources are wasted for ghost jobs as for example >>>>>>> both of my jobs seem to consume 100% CPU on the host. >>>>>>> >>>>>>> Can anyone please explain on this. >>>>>>> >>>>>>> Regards, >>>>>>> Sudha >>>>>>> >>>>>>> The information contained in this electronic message and any >>>>>>> attachments to this message are intended for the exclusive use of >>>>>>> the addressee(s) and may contain proprietary, confidential or >>>>>>> privileged information. If you are not the intended recipient, >>>>>>> you should not disseminate, distribute or copy this e-mail. >>>>>>> Please notify the sender immediately and destroy all copies of >>>>>>> this message and any attachments. WARNING: Computer viruses can >>>>>>> be transmitted via email. The recipient should check this email >>>>>>> and any attachments for the presence of viruses. The company >>>>>>> accepts no liability for any damage caused by any virus >>>>>>> transmitted by this email. www.wipro.com >>>>>>> >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> [email protected] >>>>>>> https://gridengine.org/mailman/listinfo/users >>>>>>> >>>>>>> >>>>>>> The information contained in this electronic message and any >>>>>>> attachments to this message are intended for the exclusive use of >>>>>>> the addressee(s) and may contain proprietary, confidential or >>>>>>> privileged information. If you are not the intended recipient, >>>>>>> you should not disseminate, distribute or copy this e-mail. >>>>>>> Please notify the sender immediately and destroy all copies of >>>>>>> this message and any attachments. WARNING: Computer viruses can >>>>>>> be transmitted via email. The recipient should check this email >>>>>>> and any attachments for the presence of viruses. The company >>>>>>> accepts no liability for any damage caused by any virus >>>>>>> transmitted by this email. www.wipro.com >>>>>> >>>>>> The information contained in this electronic message and any >>>>>> attachments to this message are intended for the exclusive use of >>>>>> the addressee(s) and may contain proprietary, confidential or >>>>>> privileged information. If you are not the intended recipient, you >>>>>> should not disseminate, distribute or copy this e-mail. Please >>>>>> notify the sender immediately and destroy all copies of this >>>>>> message and any attachments. WARNING: Computer viruses can be >>>>>> transmitted via email. The recipient should check this email and >>>>>> any attachments for the presence of viruses. The company accepts >>>>>> no liability for any damage caused by any virus transmitted by >>>>>> this email. www.wipro.com >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> [email protected] >>>>>> https://gridengine.org/mailman/listinfo/users >>>>> >>>>> >>>>> >>>>> -- >>>>> Best, >>>>> >>>>> Feng >>>>> The information contained in this electronic message and any >>>>> attachments to this message are intended for the exclusive use of >>>>> the addressee(s) and may contain proprietary, confidential or >>>>> privileged information. If you are not the intended recipient, you >>>>> should not disseminate, distribute or copy this e-mail. Please >>>>> notify the sender immediately and destroy all copies of this >>>>> message and any attachments. WARNING: Computer viruses can be >>>>> transmitted via email. The recipient should check this email and >>>>> any attachments for the presence of viruses. The company accepts no >>>>> liability for any damage caused by any virus transmitted by this email. >>>>> www.wipro.com >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> [email protected] >>>>> https://gridengine.org/mailman/listinfo/users >>>>> >>>> >>>> The information contained in this electronic message and any >>>> attachments to this message are intended for the exclusive use of >>>> the >>>> addressee(s) and may contain proprietary, confidential or privileged >>>> information. If you are not the intended recipient, you should not >>>> disseminate, distribute or copy this e-mail. Please notify the >>>> sender immediately and destroy all copies of this message and any >>>> attachments. WARNING: Computer viruses can be transmitted via email. >>>> The recipient should check this email and any attachments for the >>>> presence of viruses. The company accepts no liability for any damage >>>> caused by any virus transmitted by this email. www.wipro.com >>>> >>> >>> The information contained in this electronic message and any >>> attachments to this message are intended for the exclusive use of the >>> addressee(s) and may contain proprietary, confidential or privileged >>> information. If you are not the intended recipient, you should not >>> disseminate, distribute or copy this e-mail. Please notify the sender >>> immediately and destroy all copies of this message and any >>> attachments. WARNING: Computer viruses can be transmitted via email. >>> The recipient should check this email and any attachments for the >>> presence of viruses. The company accepts no liability for any damage >>> caused by any virus transmitted by this email. www.wipro.com >>> >> >> The information contained in this electronic message and any >> attachments to this message are intended for the exclusive use of the >> addressee(s) and may contain proprietary, confidential or privileged >> information. If you are not the intended recipient, you should not >> disseminate, distribute or copy this e-mail. Please notify the sender >> immediately and destroy all copies of this message and any >> attachments. WARNING: Computer viruses can be transmitted via email. >> The recipient should check this email and any attachments for the >> presence of viruses. The company accepts no liability for any damage >> caused by any virus transmitted by this email. www.wipro.com >> > > The information contained in this electronic message and any attachments to > this message are intended for the exclusive use of the addressee(s) and may > contain proprietary, confidential or privileged information. If you are not > the intended recipient, you should not disseminate, distribute or copy this > e-mail. Please notify the sender immediately and destroy all copies of this > message and any attachments. WARNING: Computer viruses can be transmitted via > email. The recipient should check this email and any attachments for the > presence of viruses. The company accepts no liability for any damage caused > by any virus transmitted by this email. www.wipro.com > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
