>> Are they all failing to start on the same host? Might be worth disabling >> the queues on that host so the scheduler looks for another place to put it. >> Have a look at the host to see if something is eating virtual memory there.
No, it wasn't one particular host or queue. I even created a new queue, but I still couldn't submit a new job. From the qstat -F output all hosts had enough free virtual memory. -----Original Message----- From: William Hay [mailto:[email protected]] Sent: Wednesday, July 26, 2017 4:18 To: John_Tai Cc: [email protected] Subject: Re: [gridengine users] complex error On Tue, Jul 25, 2017 at 12:57:47AM +0000, John_Tai wrote: > I have configured virtual_free as a requestable resource: > > > > virtual_free mem MEMORY <= YES JOB > 0 0 > > > > And it's been working great for months. > > > > However today all of a sudden I got this error in messages: > > > > 07/25/2017 08:45:41|worker|ibm068|E|host load value "virtual_free" > exceeded: capacity is 95945748480.262146, job 5983416 requests additional > 268000000000.000000 > > 07/25/2017 08:45:41|worker|ibm068|E|cannot start job 5983416.1, as > resources have changed during a scheduling run > > 07/25/2017 08:45:41|worker|ibm068|W|Skipping remaining 7 orders > > > > And any job would not get scheduled at all, they'd be in waiting state > "qw", no matter how many resources it's requesting: Are they all failing to start on the same host? Might be worth disabling the queues on that host so the scheduler looks for another place to put it. Have a look at the host to see if something is eating virtual memory there. William > > > > # qstat -j 5983416 > > ============================================================== > > job_number: 5983416 > > exec_file: job_scripts/5983416 > > submission_time: Tue Jul 25 08:18:46 2017 > > owner: jumbo > > uid: 986 > > group: memory > > gid: 41 > > sge_o_home: /home/jumbo > > sge_o_log_name: jumbo > > sge_o_path: > > /home/eda/cadence/IC616.500.3_20131102/tools/bin:/home/eda/cadence/IC616.500.3_20131102/tools/dfII/bin:/ho > > me/eda/cadence/IC616.500.3_20131102/tools/plot/bin:/home/eda/cadence/Spectre161ISR2/tools/bin:/home/sge/sge6.2u6/bin/lx24-amd64:/bin:/ > > usr/bin:/usr/local/bin:.:/home/sge/bin:/home/DI/TOOLS/bin:.:/home/IPpr > oj/IOproject/quan/Flatten > > sge_o_shell: /bin/csh > > sge_o_workdir: > /home/memorytemp/jumbo/180G_RK/S018DP/design_review > > sge_o_host: ibm041 > > account: sge > > cwd: > /home/memorytemp/jumbo/180G_RK/S018DP/design_review > > merge: y > > hard resource_list: virtual_free=2000m > > mail_list: jumbo@ibm041 > > notify: FALSE > > job_name: run.pl > > jobshare: 0 > > hard_queue_list: 256g.q > > env_list: > > REMOTEHOST=dsls11,MANPATH=/home/sge/sge6.2u6/man:/opt/SUNWspro/man:/usr/man:/usr/openwin/man:/usr/dt/man:/ > > usr/local/man:/usr/local/mysql/man:/usr/local/samba/man,VNCDESKTOP=ibm041:344 > (jumbo),HOSTNAME=ibm041,HOST=ibm041,SHELL=/bin/csh,TERM= > > xterm,GROUP=memory,USER=jumbo,LD_LIBRARY_PATH=/usr/lib:/usr/openwin/lib:/usr/dt/lib:/usr/ccs/lib:/usr/local/lib:/usr/local/mysql/lib,L > > S_COLORS=no=00:fi=00:di=00;34:ln=00;36:pi=40;33:so=00;35:bd=40;33;01:cd=40;33;01:or=01;05;37;41:mi=01;05;37;41:ex=00;32:*.cmd=00;32:*. > > exe=00;32:*.com=00;32:*.btm=00;32:*.bat=00;32:*.sh=00;32:*.csh=00;32:*.tar=00;31:*.tgz=00;31:*.arj=00;31:*.taz=00;31:*.lzh=00;31:*.zip > > =00;31:*.z=00;31:*.Z=00;31:*.gz=00;31:*.bz2=00;31:*.bz=00;31:*.tz=00;31:*.rpm=00;31:*.cpio=00;31:*.jpg=00;35:*.gif=00;35:*.bmp=00;35:* > > .xbm=00;35:*.xpm=00;35:*.png=00;35:*.tif=00;35:,HOSTTYPE=x86_64-linux,MAIL=/var/spool/mail/jumbo,PATH=/home/eda/cadence/IC616.500.3_20 > > 131102/tools/bin:/home/eda/cadence/IC616.500.3_20131102/tools/dfII/bin:/home/eda/cadence/IC616.500.3_20131102/tools/plot/bin:/home/eda > > /cadence/Spectre161ISR2/tools/bin:/home/sge/sge6.2u6/bin/lx24-amd64:/bin:/usr/bin:/usr/local/bin:.:/home/sge/bin:/home/DI/TOOLS/bin:.: > > /home/IPproj/IOproject/quan/Flatten,INPUTRC=/etc/inputrc,PWD=/home/memorytemp/jumbo/180G_RK/S018DP/design_review,EDITOR=xterm > -e vi,LA > > NG=en_US.UTF-8,SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass,SHLVL=6,HOME=/home/jumbo,OSTYPE=linux,VENDOR=unknown,MACHTYPE=x86_64 > ,LOGNAME=jumbo,LESSOPEN=|/usr/bin/lesspipe.sh > > %s,DISPLAY=:344.0,G_BROKEN_FILENAMES=1,_=/usr/bin/gnome-session,GTK_RC_FILES=/etc/gtk/gt > > krc:/home/jumbo/.gtkrc-1.2-gnome2,SESSION_MANAGER=local/ibm041:/tmp/.ICE-unix/17118,GNOME_KEYRING_SOCKET=/tmp/keyring-FJMO4E/socket,GN > > OME_DESKTOP_SESSION_ID=Default,DESKTOP_STARTUP_ID=NONE,COLORTERM=gnome-terminal,WINDOWID=38263354,SGE_ROOT=/home/sge/sge6.2u6,SGE_CELL > > =cell1,SGE_CLUSTER_NAME=p5098,IC61=/home/eda/cadence/IC616.500.3_20131102,MMSIMHOME=/home/eda/cadence/Spectre161ISR2,LM_LICENSE_FILE=5 > > 280@ibm041:5280@ibm001:5280@ibm002:5280@ibm003:5260@cadlic:5280@cadlic:5280@dsw3:5280@dsw7:5280@ibm004:5280@ibm005:5280@ibm006:5280@10 > .224.172.252 > > script_file: ./run.pl > > scheduling info: queue instance "gui.q@dsbm05" dropped because > it is overloaded: mem_used=269814435839.737854 (no load adju stment) >= > 200g > > queue instance "192g.q@dsbm10" dropped because > it is temporarily not available > > queue instance "gui.q@dsbm10" dropped because > it is temporarily not available > > queue instance "gui.q@dsbm10" dropped because > it is temporarily not available > > > > > > And clearly there are available resources: > > > > > > > > # qstat -F mem > > queuename qtype resv/used/tot. load_avg arch > states > > > ---------------------------------------------------------------------- > ----------- > > gmig.q@ibm044 BIP 0/0/2 1.27 lx24-amd64 > > hc:virtual_free=24.000G > > > ---------------------------------------------------------------------- > ----------- > > gui.q@dsbm04 BIP 0/59/70 10.01 lx24-amd64 > > hc:virtual_free=256.000G > > > ---------------------------------------------------------------------- > ----------- > > gui.q@dsbm05 BIP 0/56/70 7.14 lx24-amd64 > a > > hc:virtual_free=90.705G > > > ---------------------------------------------------------------------- > ----------- > > gui.q@dsbm08 BIP 0/11/45 9.96 lx24-amd64 > > hc:virtual_free=192.000G > > > ---------------------------------------------------------------------- > ----------- > > gui.q@dsbm09 BIP 0/7/45 9.84 lx24-amd64 > > hc:virtual_free=192.000G > > > ---------------------------------------------------------------------- > ----------- > > gui.q@dsbm10 BIP 0/2/45 0.82 lx24-amd64 > o > > hc:virtual_free=192.000G > > > ---------------------------------------------------------------------- > ----------- > > gui.q@dsbm11 BIP 0/41/45 3.13 lx24-amd64 > > hc:virtual_free=192.000G > > > ---------------------------------------------------------------------- > ----------- > > lc.q@ibm071 BIP 0/0/50 0.21 lx24-amd64 > > hc:virtual_free=48.000G > > > ---------------------------------------------------------------------- > ----------- > > lc.q@ibm072 BIP 0/0/50 0.00 lx24-amd64 > > hc:virtual_free=48.000G > > > ---------------------------------------------------------------------- > ----------- > > lc.q@ibm073 BIP 0/0/50 24.09 lx24-amd64 > > hc:virtual_free=48.000G > > > ---------------------------------------------------------------------- > ----------- > > lc.q@ibm074 BIP 0/5/50 0.05 lx24-amd64 > > hc:virtual_free=48.000G > > > ---------------------------------------------------------------------- > ----------- > > lc.q@ibm075 BIP 0/0/50 24.43 lx24-amd64 > > hc:virtual_free=48.000G > > > > > > Not sure what happened there. I had to disable this complex, so now jobs > are being scheduled again. I wonder if there was one job that was > submitted improperly that caused this? > > > > > > > ---------------------------------------------------------------------- > > This email (including its attachments, if any) may be confidential and > proprietary information of SMIC, and intended only for the use of the > named recipient(s) above. Any unauthorized use or disclosure of this > email > is strictly prohibited. If you are not the intended recipient(s), please > notify the sender immediately and delete this email from your computer. > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users ________________________________ This email (including its attachments, if any) may be confidential and proprietary information of SMIC, and intended only for the use of the named recipient(s) above. Any unauthorized use or disclosure of this email is strictly prohibited. If you are not the intended recipient(s), please notify the sender immediately and delete this email from your computer. _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
