In one of our clusters that has homogeneous compute nodes (64 GB RAM), I have
set mem_free as a requestable and consumable resource. From the mailing list
archives, I have done
for x in `qconf -sel`
do
qconf -mattr exechost complex_values mem_free=60G $x
done
Every job that gets submitted by every user has the following line in the
submission script:
#$ -hard -l mem_free=2G
for single processor jobs, and
#$ -hard -l mem_free=(2/NPROCS)G
for a parallel job using NPROCS processors.
All single processor jobs run just fine, and so do many parallel jobs. But some
parallel jobs, when the participating processors are spread across multiple
compute nodes, keep on waiting.
When inspected with 'qstat -j JOB_ID', I notice that the job is looking for (2
* NPROCS)G of RAM in each compute node. How would I go about resolving this
issue? If additional information is necessary from my end, please let me know.
Thank you for your time and help.
Best regards,
g
--
Gowtham, PhD
Director of Research Computing, IT
Adj. Asst. Professor, Physics/ECE
Michigan Technological University
(906) 487/3593
http://it.mtu.edu
http://hpc.mtu.edu
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users