On Tue, Jul 25, 2017 at 12:57:47AM +0000, John_Tai wrote: > I have configured virtual_free as a requestable resource: > > > > virtual_free mem MEMORY <= YES JOB > 0 0 > > > > And it's been working great for months. > > > > However today all of a sudden I got this error in messages: > > > > 07/25/2017 08:45:41|worker|ibm068|E|host load value "virtual_free" > exceeded: capacity is 95945748480.262146, job 5983416 requests additional > 268000000000.000000 > > 07/25/2017 08:45:41|worker|ibm068|E|cannot start job 5983416.1, as > resources have changed during a scheduling run > > 07/25/2017 08:45:41|worker|ibm068|W|Skipping remaining 7 orders > > > > And any job would not get scheduled at all, they'd be in waiting state > "qw", no matter how many resources it's requesting: Are they all failing to start on the same host? Might be worth disabling the queues on that host so the scheduler looks for another place to put it. Have a look at the host to see if something is eating virtual memory there.
William > > > > # qstat -j 5983416 > > ============================================================== > > job_number: 5983416 > > exec_file: job_scripts/5983416 > > submission_time: Tue Jul 25 08:18:46 2017 > > owner: jumbo > > uid: 986 > > group: memory > > gid: 41 > > sge_o_home: /home/jumbo > > sge_o_log_name: jumbo > > sge_o_path: > > /home/eda/cadence/IC616.500.3_20131102/tools/bin:/home/eda/cadence/IC616.500.3_20131102/tools/dfII/bin:/ho > > me/eda/cadence/IC616.500.3_20131102/tools/plot/bin:/home/eda/cadence/Spectre161ISR2/tools/bin:/home/sge/sge6.2u6/bin/lx24-amd64:/bin:/ > > usr/bin:/usr/local/bin:.:/home/sge/bin:/home/DI/TOOLS/bin:.:/home/IPproj/IOproject/quan/Flatten > > sge_o_shell: /bin/csh > > sge_o_workdir: > /home/memorytemp/jumbo/180G_RK/S018DP/design_review > > sge_o_host: ibm041 > > account: sge > > cwd: > /home/memorytemp/jumbo/180G_RK/S018DP/design_review > > merge: y > > hard resource_list: virtual_free=2000m > > mail_list: jumbo@ibm041 > > notify: FALSE > > job_name: run.pl > > jobshare: 0 > > hard_queue_list: 256g.q > > env_list: > > REMOTEHOST=dsls11,MANPATH=/home/sge/sge6.2u6/man:/opt/SUNWspro/man:/usr/man:/usr/openwin/man:/usr/dt/man:/ > > usr/local/man:/usr/local/mysql/man:/usr/local/samba/man,VNCDESKTOP=ibm041:344 > (jumbo),HOSTNAME=ibm041,HOST=ibm041,SHELL=/bin/csh,TERM= > > xterm,GROUP=memory,USER=jumbo,LD_LIBRARY_PATH=/usr/lib:/usr/openwin/lib:/usr/dt/lib:/usr/ccs/lib:/usr/local/lib:/usr/local/mysql/lib,L > > S_COLORS=no=00:fi=00:di=00;34:ln=00;36:pi=40;33:so=00;35:bd=40;33;01:cd=40;33;01:or=01;05;37;41:mi=01;05;37;41:ex=00;32:*.cmd=00;32:*. > > exe=00;32:*.com=00;32:*.btm=00;32:*.bat=00;32:*.sh=00;32:*.csh=00;32:*.tar=00;31:*.tgz=00;31:*.arj=00;31:*.taz=00;31:*.lzh=00;31:*.zip > > =00;31:*.z=00;31:*.Z=00;31:*.gz=00;31:*.bz2=00;31:*.bz=00;31:*.tz=00;31:*.rpm=00;31:*.cpio=00;31:*.jpg=00;35:*.gif=00;35:*.bmp=00;35:* > > .xbm=00;35:*.xpm=00;35:*.png=00;35:*.tif=00;35:,HOSTTYPE=x86_64-linux,MAIL=/var/spool/mail/jumbo,PATH=/home/eda/cadence/IC616.500.3_20 > > 131102/tools/bin:/home/eda/cadence/IC616.500.3_20131102/tools/dfII/bin:/home/eda/cadence/IC616.500.3_20131102/tools/plot/bin:/home/eda > > /cadence/Spectre161ISR2/tools/bin:/home/sge/sge6.2u6/bin/lx24-amd64:/bin:/usr/bin:/usr/local/bin:.:/home/sge/bin:/home/DI/TOOLS/bin:.: > > /home/IPproj/IOproject/quan/Flatten,INPUTRC=/etc/inputrc,PWD=/home/memorytemp/jumbo/180G_RK/S018DP/design_review,EDITOR=xterm > -e vi,LA > > NG=en_US.UTF-8,SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass,SHLVL=6,HOME=/home/jumbo,OSTYPE=linux,VENDOR=unknown,MACHTYPE=x86_64 > ,LOGNAME=jumbo,LESSOPEN=|/usr/bin/lesspipe.sh > > %s,DISPLAY=:344.0,G_BROKEN_FILENAMES=1,_=/usr/bin/gnome-session,GTK_RC_FILES=/etc/gtk/gt > > krc:/home/jumbo/.gtkrc-1.2-gnome2,SESSION_MANAGER=local/ibm041:/tmp/.ICE-unix/17118,GNOME_KEYRING_SOCKET=/tmp/keyring-FJMO4E/socket,GN > > OME_DESKTOP_SESSION_ID=Default,DESKTOP_STARTUP_ID=NONE,COLORTERM=gnome-terminal,WINDOWID=38263354,SGE_ROOT=/home/sge/sge6.2u6,SGE_CELL > > =cell1,SGE_CLUSTER_NAME=p5098,IC61=/home/eda/cadence/IC616.500.3_20131102,MMSIMHOME=/home/eda/cadence/Spectre161ISR2,LM_LICENSE_FILE=5 > > 280@ibm041:5280@ibm001:5280@ibm002:5280@ibm003:5260@cadlic:5280@cadlic:5280@dsw3:5280@dsw7:5280@ibm004:5280@ibm005:5280@ibm006:5280@10 > .224.172.252 > > script_file: ./run.pl > > scheduling info: queue instance "gui.q@dsbm05" dropped because > it is overloaded: mem_used=269814435839.737854 (no load adju stment) >= > 200g > > queue instance "192g.q@dsbm10" dropped because > it is temporarily not available > > queue instance "gui.q@dsbm10" dropped because > it is temporarily not available > > queue instance "gui.q@dsbm10" dropped because > it is temporarily not available > > > > > > And clearly there are available resources: > > > > > > > > # qstat -F mem > > queuename qtype resv/used/tot. load_avg arch > states > > > --------------------------------------------------------------------------------- > > gmig.q@ibm044 BIP 0/0/2 1.27 lx24-amd64 > > hc:virtual_free=24.000G > > > --------------------------------------------------------------------------------- > > gui.q@dsbm04 BIP 0/59/70 10.01 lx24-amd64 > > hc:virtual_free=256.000G > > > --------------------------------------------------------------------------------- > > gui.q@dsbm05 BIP 0/56/70 7.14 lx24-amd64 > a > > hc:virtual_free=90.705G > > > --------------------------------------------------------------------------------- > > gui.q@dsbm08 BIP 0/11/45 9.96 lx24-amd64 > > hc:virtual_free=192.000G > > > --------------------------------------------------------------------------------- > > gui.q@dsbm09 BIP 0/7/45 9.84 lx24-amd64 > > hc:virtual_free=192.000G > > > --------------------------------------------------------------------------------- > > gui.q@dsbm10 BIP 0/2/45 0.82 lx24-amd64 > o > > hc:virtual_free=192.000G > > > --------------------------------------------------------------------------------- > > gui.q@dsbm11 BIP 0/41/45 3.13 lx24-amd64 > > hc:virtual_free=192.000G > > > --------------------------------------------------------------------------------- > > lc.q@ibm071 BIP 0/0/50 0.21 lx24-amd64 > > hc:virtual_free=48.000G > > > --------------------------------------------------------------------------------- > > lc.q@ibm072 BIP 0/0/50 0.00 lx24-amd64 > > hc:virtual_free=48.000G > > > --------------------------------------------------------------------------------- > > lc.q@ibm073 BIP 0/0/50 24.09 lx24-amd64 > > hc:virtual_free=48.000G > > > --------------------------------------------------------------------------------- > > lc.q@ibm074 BIP 0/5/50 0.05 lx24-amd64 > > hc:virtual_free=48.000G > > > --------------------------------------------------------------------------------- > > lc.q@ibm075 BIP 0/0/50 24.43 lx24-amd64 > > hc:virtual_free=48.000G > > > > > > Not sure what happened there. I had to disable this complex, so now jobs > are being scheduled again. I wonder if there was one job that was > submitted improperly that caused this? > > > > > > ---------------------------------------------------------------------- > > This email (including its attachments, if any) may be confidential and > proprietary information of SMIC, and intended only for the use of the > named recipient(s) above. Any unauthorized use or disclosure of this > email > is strictly prohibited. If you are not the intended recipient(s), please > notify the sender immediately and delete this email from your computer. > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users
signature.asc
Description: Digital signature
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
