> Am 13.05.2015 um 12:35 schrieb <[email protected]> 
> <[email protected]>:
> 
> Hi Reuti,
> 
> I did some testing again and now the process is killed after deleting the job 
> using qdel job_id.  Please find the test results.
> 
> After starting the job, the process started on the execution host
> 
> qstat -j 8150628
> =================================================
> job_number:                 8150628
> exec_file:                  job_scripts/8150628
> submission_time:            Wed May 13 13:00:08 2015
> owner:                      spenmets
> uid:                        78566
> group:                      newgrp1
> gid:                        1018
> 
> =================================================
> [spenmets@node2 homes/users/spenmets]$ps -au spenmets
>  PID TTY          TIME CMD
> 10837 pts/12   00:00:00 qrsh_starter
> 10911 pts/12   00:00:00 xterm

As long as the process will stay attached to the `qrsh_starter`, it will be 
killed too as SGE will kill the complete process group. The problem arises, 
when a process jumps out of the process tree and must be detected by the 
additional group ID. Then also "execd_params ENABLE_ADDGRP_KILL=TRUE" in SGE's 
configuration must be set to allow this facility to jump in.

-- Reuti


> =================================================
> 
> [spenmets@node2 proc/10837]$cat status
> Name:   qrsh_starter
> Gid:    1018    1018    1018    1018
> Utrace: 0
> FDSize: 64
> Groups: 1000 1018 1025 1030 27000 27001 27007 27010 27014 27017 27025
> ================================================
> 
> gridnode @ /xxxxx/xxxxx/xxxxx : qdel 8150628
> registered the job 8150628 for deletion
> gridnode @ /xxxxx/xxxxx/xxxxx : qstat -j 8150628
> Following jobs do not exist:
> 8150628
> 
> ===============================================
> 
> [spenmets@node2 homes/users/spenmets]$ps 10837
>  PID TTY      STAT   TIME COMMAND
> [spenmets@node2 homes/users/spenmets]$cd /proc/10837
> -bash: cd: /proc/10837: No such file or directory
> 
> Does it mean not an issue with tight integration of SSH into SGE.
> 
> Regards,
> Sudha
> 
> -----Original Message-----
> From: Sudha Padmini Penmetsa (WT01 - Global Media & Telecom)
> Sent: Wednesday, May 13, 2015 1:15 PM
> To: 'Reuti'
> Cc: [email protected]
> Subject: RE: [gridengine users] grid jobs not visible with qstat output
> 
> Hi Reuti,
> 
> The value in /opt/sge/default/spool/active_jobs/8143543.1/addgrpid is not 
> there in /proc/
> 
> But the the child processes of the job are available in /proc/.
> 
> Can you please suggest a solution.
> 
> Regards,
> Sudha
> 
> -----Original Message-----
> From: Reuti [mailto:[email protected]]
> Sent: Tuesday, May 12, 2015 8:53 PM
> To: Sudha Padmini Penmetsa (WT01 - Global Media & Telecom)
> Cc: [email protected]; [email protected]
> Subject: Re: [gridengine users] grid jobs not visible with qstat output
> 
> 
>> Am 12.05.2015 um 17:03 schrieb <[email protected]> 
>> <[email protected]>:
>> 
>> Hi Reuti,
>> 
>> In the link suggested by you
>> (https://arc.liv.ac.uk/SGE/htmlman/htmlman5/remote_startup.html ) it
>> is mentioned as below
>> 
>> "To  have a tight integration of SSH into SGE, the started sshd needs an 
>> additional group ID to be attached."
>> 
>> Checked the configuration from our side and the addgrpid is generated
>> 
>> /opt/sge/default/spool/active_jobs/8143543.1 : ls addgrpid
> 
> Yes, but not attached to all processes. Processes running in a tight 
> integration needs them attached like something in /proc:
> 
> reuti@node:/proc/24989> cat status
> ...
> Groups: 20082 24000 25000
> 
> And the 20082 is the additional one.
> 
> -- Reuti
> 
> 
>> 
>> Regards,
>> Sudha
>> 
>> -----Original Message-----
>> From: Reuti [mailto:[email protected]]
>> Sent: Monday, May 11, 2015 2:08 AM
>> To: Sudha Padmini Penmetsa (WT01 - Global Media & Telecom)
>> Cc: [email protected]; [email protected]
>> Subject: Re: [gridengine users] grid jobs not visible with qstat
>> output
>> 
>> 
>> Am 10.05.2015 um 19:30 schrieb <[email protected]> 
>> <[email protected]>:
>> 
>>> Hi Reuti,
>>> 
>>> The startup mechanism is as below
>>> 
>>> qlogin_daemon                /usr/sbin/sshd -i
>>> qlogin_command               /gridapl1/HWEE_ge6/new/qssh
>> 
>> Then it's most likely that the `ssh` is not tightly integrated into SGE. 
>> Please have a look at:
>> 
>> https://arc.liv.ac.uk/SGE/htmlman/htmlman5/remote_startup.html
>> 
>> section "SSH TIGHT INTEGRATION".
>> 
>> -- Reuti
>> 
>> 
>>> Regards,
>>> Sudha
>>> 
>>> -----Original Message-----
>>> From: Reuti [mailto:[email protected]]
>>> Sent: Friday, May 08, 2015 10:50 PM
>>> To: Sudha Padmini Penmetsa (WT01 - Global Media & Telecom)
>>> Cc: [email protected]; [email protected]
>>> Subject: Re: [gridengine users] grid jobs not visible with qstat
>>> output
>>> 
>>> 
>>>> Am 08.05.2015 um 16:57 schrieb [email protected]:
>>>> 
>>>> Hi Zhang,
>>>> 
>>>> Please find the o/p
>>>> 
>>>> 32682 61457200 27020 karppa 32682
>>>> /applic36/grid/HWEE_ge6/utilbin/lx24-amd64/qrsh_starter
>>>> /gridapl1/HWEE_ge6/default/spo
>>>> 32734 61457200 27020 karppa 32734  \_ /bin/ksh ./run_it_file.vcs
>>>> 33043 61457200 27020 karppa 32734      \_ /bin/ksh ./vcs.start.dh.no_gui
>>>> 33059 61457200 27020 karppa 32734          \_ 
>>>> ./vcs/tb_bin/hdl_top_rtldhsim/simv -licqueue -cm line+cond+fsm+branch+tgl+
>>>> 38048 61457200 27020 karppa 32734              \_ [target.bin] <defunct>
>>>> 5049 61457200 27020 karppa 5049
>>>> /applic36/grid/HWEE_ge6/utilbin/lx24-amd64/qrsh_starter
>>>> /gridapl1/HWEE_ge6/default/spoo
>>>> 5101 61457200 27020 karppa 5101  \_ /bin/ksh ./run_it_file.vcs
>>>> 5408 61457200 27020 karppa 5101      \_ /bin/ksh ./vcs.start.dh.no_gui
>>>> 5424 61457200 27020 karppa 5101          \_ 
>>>> ./vcs/tb_bin/hdl_top_rtldhsim/simv -licqueue -cm line+cond+fsm+branch+tgl+a
>>>> 9089 61457200 27020 karppa 5101              \_ [target.bin] <defunct>
>>> 
>>> The problem seems to be, that the `qrsh`starter` is no longer bound to the 
>>> "sge_shephered". This was after the job? How does it look like while SGE 
>>> still knows about the job. What is the startup mechanism:
>>> 
>>> $ qconf -sconf
>>> ...
>>> qlogin_command               builtin
>>> qlogin_daemon                builtin
>>> rlogin_command               builtin
>>> rlogin_daemon                builtin
>>> rsh_command                  builtin
>>> rsh_daemon                   builtin
>>> 
>>> -- Reuti
>>> 
>>> 
>>>> Regards,
>>>> Sudha
>>>> 
>>>> -----Original Message-----
>>>> From: Feng Zhang [mailto:[email protected]]
>>>> Sent: Friday, May 08, 2015 7:35 PM
>>>> To: Sudha Padmini Penmetsa (WT01 - Global Media & Telecom)
>>>> Subject: Re: [gridengine users] grid jobs not visible with qstat
>>>> output
>>>> 
>>>> Sudha,
>>>> 
>>>> Can you run "ps -e f -o pid,ppid,command", which can show more details?
>>>> 
>>>> On Fri, May 8, 2015 at 4:09 AM,  <[email protected]> wrote:
>>>>> Hi Reuti,
>>>>> 
>>>>> The processes are not bound to sge_shepherd anymore.
>>>>> 
>>>>> Below are the qrsh_starter processes running still
>>>>> 
>>>>> 5049 ?        00:00:00 qrsh_starter
>>>>> 5101 ?        00:00:00 run_it_file.vcs
>>>>> 5408 ?        00:00:00 vcs.start.dh.no
>>>>> 5424 ?        8-20:57:02 simv
>>>>> 9089 ?        00:00:00 target.bin <defunct>
>>>>> 16868 ?        00:00:00 sshd
>>>>> 16913 pts/9    00:00:00 bash
>>>>> 17371 pts/9    00:00:00 ps
>>>>> 32682 ?        00:00:00 qrsh_starter
>>>>> 32734 ?        00:00:00 run_it_file.vcs
>>>>> 33043 ?        00:00:00 vcs.start.dh.no
>>>>> 33059 ?        8-21:19:03 simv
>>>>> 38048 ?        00:00:00 target.bin <defunct>
>>>>> 
>>>>> Regards,
>>>>> Sudha
>>>>> 
>>>>> -----Original Message-----
>>>>> From: Reuti [mailto:[email protected]]
>>>>> Sent: Thursday, May 07, 2015 9:52 PM
>>>>> To: Sudha Padmini Penmetsa (WT01 - Global Media & Telecom)
>>>>> Cc: [email protected]; [email protected]
>>>>> Subject: Re: [gridengine users] grid jobs not visible with qstat
>>>>> output
>>>>> 
>>>>> Are the processes still bound to the sge_shephered or did they jump out 
>>>>> of the process tree? By what method were they started by qrsh_starter: 
>>>>> "builtin" or by defining `ssh`?
>>>>> 
>>>>> -- Reuti
>>>>> 
>>>>> 
>>>>>> Am 07.05.2015 um 18:00 schrieb <[email protected]> 
>>>>>> <[email protected]>:
>>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> No the slots are not being used anymore
>>>>>> 
>>>>>> That according to qstat I seem not to have any jobs at host. However, 
>>>>>> there are my processes running in that specific host (launched by 
>>>>>> qrsh_starter) that are altogether consuming 200% of CPU and licenses. 
>>>>>> The problem here is that the processes have been running there over a 
>>>>>> week and I haven't been aware of those. I've thought that the processes 
>>>>>> were killed when the job was killed with qdel.
>>>>>> 
>>>>>> What could be the reason for this.
>>>>>> 
>>>>>> Regards,
>>>>>> Sudha
>>>>>> 
>>>>>> From: Srirangam Addepalli [mailto:[email protected]]
>>>>>> Sent: Wednesday, May 06, 2015 7:52 PM
>>>>>> To: Sudha Padmini Penmetsa (WT01 - Global Media & Telecom)
>>>>>> Subject: Re: [gridengine users] grid jobs not visible with qstat
>>>>>> output
>>>>>> 
>>>>>> That would be strange.  Do the slots on the host show as being used.
>>>>>> 
>>>>>> qhost -j -h hostname should list the jobs that Grid Engine is aware of. 
>>>>>> Unless qrsh some how spwanned a process that is not bound by sge_execd. 
>>>>>> On the client/ execution host  what info do you have in active_jobs and 
>>>>>> jobs directories.  It is more likely that the qrsh session is terminated 
>>>>>> but left resident processes.
>>>>>> 
>>>>>> Rangam
>>>>>> 
>>>>>> On Wed, May 6, 2015 at 9:05 AM, <[email protected]> wrote:
>>>>>> Hi,
>>>>>> 
>>>>>> I noticed that I've had two grid jobs running over a week on a machine 
>>>>>> of which I haven't been aware of. Both of the jobs have been launched 
>>>>>> with qrsh but they are not visible with qstat thus for a reason or 
>>>>>> another they are no longer included in grid book-keeping. This issue 
>>>>>> will cause that grid resources are wasted for ghost jobs as for example 
>>>>>> both of my jobs seem to consume 100% CPU on the host.
>>>>>> 
>>>>>> Can anyone please explain on this.
>>>>>> 
>>>>>> Regards,
>>>>>> Sudha
>>>>>> 
>>>>>> The information contained in this electronic message and any
>>>>>> attachments to this message are intended for the exclusive use of
>>>>>> the addressee(s) and may contain proprietary, confidential or
>>>>>> privileged information. If you are not the intended recipient, you
>>>>>> should not disseminate, distribute or copy this e-mail. Please
>>>>>> notify the sender immediately and destroy all copies of this
>>>>>> message and any attachments. WARNING: Computer viruses can be
>>>>>> transmitted via email. The recipient should check this email and
>>>>>> any attachments for the presence of viruses. The company accepts
>>>>>> no liability for any damage caused by any virus transmitted by
>>>>>> this email. www.wipro.com
>>>>>> 
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> [email protected]
>>>>>> https://gridengine.org/mailman/listinfo/users
>>>>>> 
>>>>>> 
>>>>>> The information contained in this electronic message and any
>>>>>> attachments to this message are intended for the exclusive use of
>>>>>> the addressee(s) and may contain proprietary, confidential or
>>>>>> privileged information. If you are not the intended recipient, you
>>>>>> should not disseminate, distribute or copy this e-mail. Please
>>>>>> notify the sender immediately and destroy all copies of this
>>>>>> message and any attachments. WARNING: Computer viruses can be
>>>>>> transmitted via email. The recipient should check this email and
>>>>>> any attachments for the presence of viruses. The company accepts
>>>>>> no liability for any damage caused by any virus transmitted by
>>>>>> this email. www.wipro.com
>>>>> 
>>>>> The information contained in this electronic message and any
>>>>> attachments to this message are intended for the exclusive use of
>>>>> the addressee(s) and may contain proprietary, confidential or
>>>>> privileged information. If you are not the intended recipient, you
>>>>> should not disseminate, distribute or copy this e-mail. Please
>>>>> notify the sender immediately and destroy all copies of this
>>>>> message and any attachments. WARNING: Computer viruses can be
>>>>> transmitted via email. The recipient should check this email and
>>>>> any attachments for the presence of viruses. The company accepts no
>>>>> liability for any damage caused by any virus transmitted by this
>>>>> email. www.wipro.com
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> [email protected]
>>>>> https://gridengine.org/mailman/listinfo/users
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Best,
>>>> 
>>>> Feng
>>>> The information contained in this electronic message and any
>>>> attachments to this message are intended for the exclusive use of
>>>> the addressee(s) and may contain proprietary, confidential or
>>>> privileged information. If you are not the intended recipient, you
>>>> should not disseminate, distribute or copy this e-mail. Please
>>>> notify the sender immediately and destroy all copies of this message
>>>> and any attachments. WARNING: Computer viruses can be transmitted
>>>> via email. The recipient should check this email and any attachments
>>>> for the presence of viruses. The company accepts no liability for
>>>> any damage caused by any virus transmitted by this email.
>>>> www.wipro.com
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> [email protected]
>>>> https://gridengine.org/mailman/listinfo/users
>>>> 
>>> 
>>> The information contained in this electronic message and any
>>> attachments to this message are intended for the exclusive use of the
>>> addressee(s) and may contain proprietary, confidential or privileged
>>> information. If you are not the intended recipient, you should not
>>> disseminate, distribute or copy this e-mail. Please notify the sender
>>> immediately and destroy all copies of this message and any
>>> attachments. WARNING: Computer viruses can be transmitted via email.
>>> The recipient should check this email and any attachments for the
>>> presence of viruses. The company accepts no liability for any damage
>>> caused by any virus transmitted by this email. www.wipro.com
>>> 
>> 
>> The information contained in this electronic message and any
>> attachments to this message are intended for the exclusive use of the
>> addressee(s) and may contain proprietary, confidential or privileged
>> information. If you are not the intended recipient, you should not
>> disseminate, distribute or copy this e-mail. Please notify the sender
>> immediately and destroy all copies of this message and any
>> attachments. WARNING: Computer viruses can be transmitted via email.
>> The recipient should check this email and any attachments for the
>> presence of viruses. The company accepts no liability for any damage
>> caused by any virus transmitted by this email. www.wipro.com
>> 
> 
> The information contained in this electronic message and any attachments to 
> this message are intended for the exclusive use of the addressee(s) and may 
> contain proprietary, confidential or privileged information. If you are not 
> the intended recipient, you should not disseminate, distribute or copy this 
> e-mail. Please notify the sender immediately and destroy all copies of this 
> message and any attachments. WARNING: Computer viruses can be transmitted via 
> email. The recipient should check this email and any attachments for the 
> presence of viruses. The company accepts no liability for any damage caused 
> by any virus transmitted by this email. www.wipro.com
> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to