[gridengine users] access definition in grid

2016-09-20 Thread sudha.penmetsa
Hi, Regarding access rules in grid, users primary UNIX group should be the one which is defined in ACL to be able to access. Would it be possible to configure it such that user just needs to belong to the defined UNIX group and gid can be whatever? Regards, Sudha The information contained in

Re: [gridengine users] Projects addition in resource quota set

2016-08-18 Thread sudha.penmetsa
Hi Reuti, Thanks for your response. I was not specifying the project name while submitting the job. Specifying -P in the job helped. Regards, Sudha -Original Message- From: Reuti [mailto:re...@staff.uni-marburg.de] Sent: Thursday, August 18, 2016 6:35 PM To: Sudha Padmini Penmetsa

[gridengine users] Projects addition in resource quota set

2016-08-18 Thread sudha.penmetsa
Hi, I am trying to limit a defined project by adding it to the resource quota set but I am not able to limit the users defined in the project to 2 slots at a time. We are able to run more than 2 jobs at a time. Here IT_test is the project name. Can anyone correct me if there is anything wrong

[gridengine users] Grid efficiency from the point of view of memory usage

2016-07-21 Thread sudha.penmetsa
Hi, We see that most of the cores in our grid queue are idle but my jobs are not getting cores because there is insufficient h_vmem in the queue as the memory is already used by other jobs Would it be possible to somehow take memory requirements into account in the scheduling? Regards,

Re: [gridengine users] Error message- failed receving gdi request when calling qsub, but job is started

2016-06-22 Thread sudha.penmetsa
Hi, We have added the below qmaster params in the SGE configuration qmaster_params gdi_timeout=240 gdi_retries=-1 cl_ping=true Could you let me know the difference between gdi_timeout and gdi_retries. Why is there gdi_retries parameter? Why can't we use gdi_timeout alone to retry

[gridengine users] Error message- failed receving gdi request when calling qsub, but job is started

2016-06-21 Thread sudha.penmetsa
Hi, Since this morning, sometimes users are facing an issue in grid while submitting qsub jobs. When submitting the job, it displays error message: "Unable to run job: failed receiving gdi request. Exiting" But the job runs successfully when it is seen later with qstat. We tried to find the

[gridengine users] Defining slots in grid queue

2016-05-06 Thread sudha.penmetsa
Hi, I have only one host defined in a queue and want to allot 2 slots per core instead of one slot per core. How do we need to define the slots to allocate more than 1 slot per core in the queue. Regards, Sudha The information contained in this electronic message and any attachments to this

[gridengine users] Specifying operating system version in qsub command

2016-04-15 Thread sudha.penmetsa
Hi, Is it possible to specify the operating system version for a grid job submitted with the qsub-command. This would help to find a way for the problem when the queue has two types of OS running hosts. Regards, Sudha The information contained in this electronic message and any attachments to

[gridengine users] alias for queue

2016-03-10 Thread sudha.penmetsa
Hi, Is it possible to keep an old queue name (test.q) we have deleted already as an alias name for the newly created queue(prod.q). This way we could continue to work with old scripts. Regards, Sudha The information contained in this electronic message and any attachments to this message are

[gridengine users] Jobs dies with signal BUS (7)

2016-02-18 Thread sudha.penmetsa
Hi, My job gets aborted after a while with exit status 135 failed 100 : assumedly after job exit_status 135 02/17/2016 11:25:10|qmaster|master1|W|job 428284.1 failed on host test1 assumedly after job because: job 428284.1 died through signal BUS (7) Job submitted with same resources

[gridengine users] Same job consumes memory in a different way

2016-01-25 Thread sudha.penmetsa
Hi, We have launched same job at different times, the h_vmem is defined as 12GB. One job consumed only 10.5G and is successful while the other consumed 18.3G and thus was killed. 01/13/2016 00:13:18|execd|test1|W|job 33452 exceeds job hard limit "h_vmem" of queue

[gridengine users] Same job consumes memory in a different way

2016-01-13 Thread sudha.penmetsa
Hi, We have launched same job at different times, the h_vmem is defined as 12GB. One job consumed only 10.5G and is successful while the other consumed 18.3G and thus was killed. 01/13/2016 00:13:18|execd|test1|W|job 33452 exceeds job hard limit "h_vmem" of queue "test.q@test1"

[gridengine users] Job getting killed in the middle

2016-01-05 Thread sudha.penmetsa
Hi, Can you please help me in understanding why does the job gets killed due to the below reason. 01/02/2016 06:18:10|execd|host1|W|job 267713 exceeds job hard limit "h_vmem" of queue "test.q@host1" (11510681600.0 > limit:8589934592.0) - sending SIGKILL Regards, Sudha The information

[gridengine users] Failure in launching jobs with qrsh

2015-11-09 Thread sudha.penmetsa
Hi, Facing the below issue while submitting jobs using qrsh to the queue. qrsh -V -cwd -q preprod.q@host1 -l h_vmem=1G -l h_stack=128M xterm ssh_exchange_identification: Connection closed by remote host can't open file /tmp/9572476.1.preprod.q/pid: No such file or directory Can you please let

[gridengine users] sharedmem parallel envrionment in grid

2015-09-21 Thread sudha.penmetsa
Hi, The parallel environment sharedmem in our grid environment is defined as follows, pe_name sharedmem slots 4 user_listsNONE xuser_lists NONE start_proc_args /bin/true stop_proc_args/bin/true allocation_rule $pe_slots control_slavesFALSE

[gridengine users] reason for the queue going into error state due to failed job

2015-06-12 Thread sudha.penmetsa
Hi, The qacct output for a job says that the job is failed with code 11 ( failed 11 : before job). seems like this error occurs mostly when the user probably was too slow and misses the timeslot for entering the password after launching the job. But when user launches the job again the job

[gridengine users] memory consumption while running the jobs in parallel environment

2015-06-04 Thread sudha.penmetsa
Hi, While running jobs in parallel environment if we want to run a job in grid using 4 cores and total memory consumption is 40G we are defining as for example qrsh -V -cwd -q test.q -l mem_free=40G,h_vmem=10G -pe sharedmem 4 sleep 40 However this assumes that each of the threads consumes max

Re: [gridengine users] frequent errors from the GRID messages

2015-06-01 Thread sudha.penmetsa
Hi Zhang, I couldn't list them as I get the error ls: cannot access /tmp/8319689.1.rhel6.q/: No such file or directory. The files are not available under /tmp Regards, Sudha -Original Message- From: Feng Zhang [mailto:prod.f...@gmail.com] Sent: Friday, May 29, 2015 11:56 PM To: Sudha

Re: [gridengine users] frequent errors from the GRID messages

2015-05-28 Thread sudha.penmetsa
Yes Hugh, Users have permissions for the directory drwxrwxrwt. 48 root root 163840 May 29 08:48 /tmp Regards, Sudha From: MacMullan, Hugh [mailto:hugh...@wharton.upenn.edu] Sent: Thursday, May 28, 2015 8:45 PM To: Sudha Padmini Penmetsa (WT01 - Global Media Telecom); users@gridengine.org

[gridengine users] memory consumption while running the jobs in parallel environment

2015-05-28 Thread sudha.penmetsa
Hi, While running jobs in parallel environment if we want to run a job in grid using 4 cores and total memory consumption is 40G we are defining as qrsh -V -cwd -q test.q -l mem_free=40G,h_vmem=10G -pe sharedmem 4 sleep 40 However this assumes that each of the threads consumes max 10G mem, the

[gridengine users] frequent errors from the GRID messages

2015-05-28 Thread sudha.penmetsa
Hi, The below errors are quiet often in the GRID messages file. Though there are enough permissions for the /tmp directory on host, the messages show the jobs failed because of this error. Can you please help in understanding the reason for these errors. 05/27/2015

Re: [gridengine users] Grid queue goes into an error state due to one job

2015-05-18 Thread sudha.penmetsa
Hi Gavin, I clear the error state using qmod -c *. Wanted to know the root cause and the solution to fix the issue permanently. Regards, Sudha -Original Message- From: Gavin W. Burris [mailto:b...@wharton.upenn.edu] Sent: Monday, May 18, 2015 6:08 PM To: Sudha Padmini Penmetsa (WT01 -

[gridengine users] Grid queue goes into an error state due to one job

2015-05-18 Thread sudha.penmetsa
Hi, We have few hosts added to a queue. Due to one single job submitted to the queue the whole queue goes into error state. As a result, no new jobs can be submitted to the queue unless we clear the error state. Can anyone please let me know what could be the reason for this and how to fix

[gridengine users] error :: execvp(/xxxx/xxxx/default/spool/node2/job_scripts/8

2015-05-14 Thread sudha.penmetsa
Hi, I have been submitting some jobs on grid. Apparently some are going through and many could not and staying in Error queue state. The error reason is as follows error :: execvp(///default/spool/node2/job_scripts/8 the failed reason seems to be failed 27 : searching

Re: [gridengine users] grid jobs not visible with qstat output

2015-05-13 Thread sudha.penmetsa
Hi Reuti, I did some testing again and now the process is killed after deleting the job using qdel job_id. Please find the test results. After starting the job, the process started on the execution host qstat -j 8150628 = job_number:

Re: [gridengine users] grid jobs not visible with qstat output

2015-05-13 Thread sudha.penmetsa
Hi Reuti, The value in /opt/sge/default/spool/active_jobs/8143543.1/addgrpid is not there in /proc/ But the the child processes of the job are available in /proc/. Can you please suggest a solution. Regards, Sudha -Original Message- From: Reuti [mailto:re...@staff.uni-marburg.de]

Re: [gridengine users] grid jobs not visible with qstat output

2015-05-12 Thread sudha.penmetsa
Hi Reuti, In the link suggested by you (https://arc.liv.ac.uk/SGE/htmlman/htmlman5/remote_startup.html ) it is mentioned as below To have a tight integration of SSH into SGE, the started sshd needs an additional group ID to be attached. Checked the configuration from our side and the

Re: [gridengine users] grid jobs not visible with qstat output

2015-05-08 Thread sudha.penmetsa
Hi Reuti, The processes are not bound to sge_shepherd anymore. Below are the qrsh_starter processes running still 5049 ?00:00:00 qrsh_starter 5101 ?00:00:00 run_it_file.vcs 5408 ?00:00:00 vcs.start.dh.no 5424 ?8-20:57:02 simv 9089 ?00:00:00 target.bin

[gridengine users] unable to contact qmaster using port 536 on host

2015-05-08 Thread sudha.penmetsa
Hi, We have an issue with our grid env today. The grid environment didn't load and while running grid commands we got the below errors error: unable to contact qmaster using port 536 on host (grid master server) error: can't unpack gdi request error: error unpacking gdi request: bad argument

Re: [gridengine users] grid jobs not visible with qstat output

2015-05-08 Thread sudha.penmetsa
Hi Zhang, Please find the o/p 32682 61457200 27020 karppa 32682 /applic36/grid/HWEE_ge6/utilbin/lx24-amd64/qrsh_starter /gridapl1/HWEE_ge6/default/spo 32734 61457200 27020 karppa 32734 \_ /bin/ksh ./run_it_file.vcs 33043 61457200 27020 karppa 32734 \_ /bin/ksh ./vcs.start.dh.no_gui 33059

Re: [gridengine users] grid jobs not visible with qstat output

2015-05-07 Thread sudha.penmetsa
Hi, No the slots are not being used anymore That according to qstat I seem not to have any jobs at host. However, there are my processes running in that specific host (launched by qrsh_starter) that are altogether consuming 200% of CPU and licenses. The problem here is that the processes have

[gridengine users] grid jobs not visible with qstat output

2015-05-06 Thread sudha.penmetsa
Hi, I noticed that I've had two grid jobs running over a week on a machine of which I haven't been aware of. Both of the jobs have been launched with qrsh but they are not visible with qstat thus for a reason or another they are no longer included in grid book-keeping. This issue will cause

[gridengine users] Qalter in grid

2015-04-29 Thread sudha.penmetsa
Hi, Can we alter the submitted job resources using qalter for interactive jobs (jobs submitted using qrsh) as well For ex : qalter -l h_vmem=5G job_id can you let me know if it is applicable only for the pending jobs or running jobs also. Regards, Sudha The information contained in this

[gridengine users] array jobs in grid

2015-04-26 Thread sudha.penmetsa
Hi, Can you please give some example to submit the array jobs in grid in real time environment. Regards, Sudha The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary,

[gridengine users] Doubts regarding the h_vmem allocation

2015-04-24 Thread sudha.penmetsa
Hi, I am submitting a job to the grid parallel environment using the command qrsh -V -cwd -q 'queuename' -l h_vmem=2G -pe 'parallel env name' 4 sleep 20 There are two servers in the queue for example host1 and host2 The total h_vmem configured on the servers is 250G. After submitting the job

Re: [gridengine users] Doubts regarding the h_vmem allocation

2015-04-24 Thread sudha.penmetsa
Hi, I am submitting a job to the grid parallel environment using the command qrsh -V -cwd -q 'queuename' -l h_vmem=2G -pe 'parallel env name' 4 sleep 20 There are two servers in the queue for example host1 and host2 The total h_vmem configured on the servers is 250G. After submitting the job