Hi,
Regarding access rules in grid, users primary UNIX group should be the one
which is defined in ACL to be able to access.
Would it be possible to configure it such that user just needs to belong to the
defined UNIX group and gid can be whatever?
Regards,
Sudha
The information contained in
Hi Reuti,
Thanks for your response.
I was not specifying the project name while submitting the job.
Specifying -P in the job helped.
Regards,
Sudha
-Original Message-
From: Reuti [mailto:re...@staff.uni-marburg.de]
Sent: Thursday, August 18, 2016 6:35 PM
To: Sudha Padmini Penmetsa
Hi,
I am trying to limit a defined project by adding it to the resource quota set
but I am not able to limit the users defined in the project to 2 slots at a
time. We are able to run more than 2 jobs at a time.
Here IT_test is the project name. Can anyone correct me if there is anything
wrong
Hi,
We see that most of the cores in our grid queue are idle but my jobs are not
getting cores because there is insufficient h_vmem in the queue as the memory
is already used by other jobs
Would it be possible to somehow take memory requirements into account in the
scheduling?
Regards,
Hi,
We have added the below qmaster params in the SGE configuration
qmaster_params gdi_timeout=240 gdi_retries=-1 cl_ping=true
Could you let me know the difference between gdi_timeout and gdi_retries. Why
is there gdi_retries parameter? Why can't we use gdi_timeout alone to retry
Hi,
Since this morning, sometimes users are facing an issue in grid while
submitting qsub jobs.
When submitting the job, it displays error message: "Unable to run job: failed
receiving gdi request. Exiting"
But the job runs successfully when it is seen later with qstat.
We tried to find the
Hi,
I have only one host defined in a queue and want to allot 2 slots per core
instead of one slot per core.
How do we need to define the slots to allocate more than 1 slot per core in the
queue.
Regards,
Sudha
The information contained in this electronic message and any attachments to
this
Hi,
Is it possible to specify the operating system version for a grid job submitted
with the qsub-command. This would help to find a way for the problem when the
queue has two types of OS running hosts.
Regards,
Sudha
The information contained in this electronic message and any attachments to
Hi,
Is it possible to keep an old queue name (test.q) we have deleted already as an
alias name for the newly created queue(prod.q).
This way we could continue to work with old scripts.
Regards,
Sudha
The information contained in this electronic message and any attachments to
this message are
Hi,
My job gets aborted after a while with exit status 135
failed 100 : assumedly after job
exit_status 135
02/17/2016 11:25:10|qmaster|master1|W|job 428284.1 failed on host test1
assumedly after job because: job 428284.1 died through signal BUS (7)
Job submitted with same resources
Hi,
We have launched same job at different times, the h_vmem is defined as 12GB.
One job consumed only 10.5G and is successful while the other consumed 18.3G
and thus was killed.
01/13/2016 00:13:18|execd|test1|W|job 33452 exceeds job hard limit "h_vmem" of
queue
Hi,
We have launched same job at different times, the h_vmem is defined as 12GB.
One job consumed only 10.5G and is successful while the other consumed 18.3G
and thus was killed.
01/13/2016 00:13:18|execd|test1|W|job 33452 exceeds job hard limit "h_vmem" of
queue "test.q@test1"
Hi,
Can you please help me in understanding why does the job gets killed due to the
below reason.
01/02/2016 06:18:10|execd|host1|W|job 267713 exceeds job hard limit "h_vmem" of
queue "test.q@host1" (11510681600.0 > limit:8589934592.0) - sending
SIGKILL
Regards,
Sudha
The information
Hi,
Facing the below issue while submitting jobs using qrsh to the queue.
qrsh -V -cwd -q preprod.q@host1 -l h_vmem=1G -l h_stack=128M xterm
ssh_exchange_identification: Connection closed by remote host
can't open file /tmp/9572476.1.preprod.q/pid: No such file or directory
Can you please let
Hi,
The parallel environment sharedmem in our grid environment is defined as
follows,
pe_name sharedmem
slots 4
user_listsNONE
xuser_lists NONE
start_proc_args /bin/true
stop_proc_args/bin/true
allocation_rule $pe_slots
control_slavesFALSE
Hi,
The qacct output for a job says that the job is failed with code 11 ( failed 11
: before job).
seems like this error occurs mostly when the user probably was too slow and
misses the timeslot for entering the password after launching the job. But when
user launches the job again the job
Hi,
While running jobs in parallel environment if we want to run a job in grid
using 4 cores and total memory consumption is 40G we are defining as for example
qrsh -V -cwd -q test.q -l mem_free=40G,h_vmem=10G -pe sharedmem 4 sleep 40
However this assumes that each of the threads consumes max
Hi Zhang,
I couldn't list them as I get the error
ls: cannot access /tmp/8319689.1.rhel6.q/: No such file or directory.
The files are not available under /tmp
Regards,
Sudha
-Original Message-
From: Feng Zhang [mailto:prod.f...@gmail.com]
Sent: Friday, May 29, 2015 11:56 PM
To: Sudha
Yes Hugh, Users have permissions for the directory
drwxrwxrwt. 48 root root 163840 May 29 08:48 /tmp
Regards,
Sudha
From: MacMullan, Hugh [mailto:hugh...@wharton.upenn.edu]
Sent: Thursday, May 28, 2015 8:45 PM
To: Sudha Padmini Penmetsa (WT01 - Global Media Telecom); users@gridengine.org
Hi,
While running jobs in parallel environment if we want to run a job in grid
using 4 cores and total memory consumption is 40G we are defining as
qrsh -V -cwd -q test.q -l mem_free=40G,h_vmem=10G -pe sharedmem 4 sleep 40
However this assumes that each of the threads consumes max 10G mem, the
Hi,
The below errors are quiet often in the GRID messages file. Though there are
enough permissions for the /tmp directory on host, the messages show the jobs
failed because of this error.
Can you please help in understanding the reason for these errors.
05/27/2015
Hi Gavin,
I clear the error state using qmod -c *.
Wanted to know the root cause and the solution to fix the issue permanently.
Regards,
Sudha
-Original Message-
From: Gavin W. Burris [mailto:b...@wharton.upenn.edu]
Sent: Monday, May 18, 2015 6:08 PM
To: Sudha Padmini Penmetsa (WT01 -
Hi,
We have few hosts added to a queue. Due to one single job submitted to the
queue the whole queue goes into error state.
As a result, no new jobs can be submitted to the queue unless we clear the
error state.
Can anyone please let me know what could be the reason for this and how to fix
Hi,
I have been submitting some jobs on grid. Apparently some are going through and
many could not and staying in Error queue state.
The error reason is as follows
error :: execvp(///default/spool/node2/job_scripts/8
the failed reason seems to be
failed 27 : searching
Hi Reuti,
I did some testing again and now the process is killed after deleting the job
using qdel job_id. Please find the test results.
After starting the job, the process started on the execution host
qstat -j 8150628
=
job_number:
Hi Reuti,
The value in /opt/sge/default/spool/active_jobs/8143543.1/addgrpid is not there
in /proc/
But the the child processes of the job are available in /proc/.
Can you please suggest a solution.
Regards,
Sudha
-Original Message-
From: Reuti [mailto:re...@staff.uni-marburg.de]
Hi Reuti,
In the link suggested by you
(https://arc.liv.ac.uk/SGE/htmlman/htmlman5/remote_startup.html ) it is
mentioned as below
To have a tight integration of SSH into SGE, the started sshd needs an
additional group ID to be attached.
Checked the configuration from our side and the
Hi Reuti,
The processes are not bound to sge_shepherd anymore.
Below are the qrsh_starter processes running still
5049 ?00:00:00 qrsh_starter
5101 ?00:00:00 run_it_file.vcs
5408 ?00:00:00 vcs.start.dh.no
5424 ?8-20:57:02 simv
9089 ?00:00:00 target.bin
Hi,
We have an issue with our grid env today.
The grid environment didn't load and while running grid commands we got the
below errors
error: unable to contact qmaster using port 536 on host (grid master server)
error: can't unpack gdi request
error: error unpacking gdi request: bad argument
Hi Zhang,
Please find the o/p
32682 61457200 27020 karppa 32682
/applic36/grid/HWEE_ge6/utilbin/lx24-amd64/qrsh_starter
/gridapl1/HWEE_ge6/default/spo
32734 61457200 27020 karppa 32734 \_ /bin/ksh ./run_it_file.vcs
33043 61457200 27020 karppa 32734 \_ /bin/ksh ./vcs.start.dh.no_gui
33059
Hi,
No the slots are not being used anymore
That according to qstat I seem not to have any jobs at host. However, there are
my processes running in that specific host (launched by qrsh_starter) that are
altogether consuming 200% of CPU and licenses. The problem here is that the
processes have
Hi,
I noticed that I've had two grid jobs running over a week on a machine of which
I haven't been aware of. Both of the jobs have been launched with qrsh but they
are not visible with qstat thus for a reason or another they are no longer
included in grid book-keeping. This issue will cause
Hi,
Can we alter the submitted job resources using qalter for interactive jobs
(jobs submitted using qrsh) as well
For ex :
qalter -l h_vmem=5G job_id
can you let me know if it is applicable only for the pending jobs or running
jobs also.
Regards,
Sudha
The information contained in this
Hi,
Can you please give some example to submit the array jobs in grid in real time
environment.
Regards,
Sudha
The information contained in this electronic message and any attachments to
this message are intended for the exclusive use of the addressee(s) and may
contain proprietary,
Hi,
I am submitting a job to the grid parallel environment using the command
qrsh -V -cwd -q 'queuename' -l h_vmem=2G -pe 'parallel env name' 4 sleep 20
There are two servers in the queue for example host1 and host2 The total
h_vmem configured on the servers is 250G.
After submitting the job
Hi,
I am submitting a job to the grid parallel environment using the command
qrsh -V -cwd -q 'queuename' -l h_vmem=2G -pe 'parallel env name' 4 sleep 20
There are two servers in the queue for example host1 and host2
The total h_vmem configured on the servers is 250G.
After submitting the job
36 matches
Mail list logo