In one of our clusters that has homogeneous compute nodes (64 GB RAM), I have 
set mem_free as a requestable and consumable resource. From the mailing list 
archives, I have done

  for x in `qconf -sel`
  do
    qconf -mattr exechost complex_values mem_free=60G $x
  done

Every job that gets submitted by every user has the following line in the 
submission script:

  #$ -hard -l mem_free=2G

for single processor jobs, and

  #$ -hard -l mem_free=(2/NPROCS)G

for a parallel job using NPROCS processors.


All single processor jobs run just fine, and so do many parallel jobs. But some 
parallel jobs, when the participating processors are spread across multiple 
compute nodes, keep on waiting.

When inspected with 'qstat -j JOB_ID', I notice that the job is looking for (2 
* NPROCS)G of RAM in each compute node. How would I go about resolving this 
issue? If additional information is necessary from my end, please let me know.

Thank you for your time and help.

Best regards,
g

--
Gowtham, PhD
Director of Research Computing, IT
Adj. Asst. Professor, Physics/ECE
Michigan Technological University

(906) 487/3593
http://it.mtu.edu
http://hpc.mtu.edu

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to