I think I found the mistake in my submission script. hard resource_list: mem_free=128.00G
should be hard resource_list: mem_free=2.00G so that the job with 64 processor requests 128 GB total RAM. Correct? Best regards, g -- Gowtham, PhD Director of Research Computing, IT Adj. Asst. Professor, Physics/ECE Michigan Technological University (906) 487/3593 http://it.mtu.edu http://hpc.mtu.edu On Tue, 31 Mar 2015, Gowtham wrote: | | Hi Reuti, | | It's a 64 processor job, and my hope/plan is that it requests 2 GB per processor for a total of 128 GB. But each compute node only has 64 GB total ram (60 of which is set to requestable/consumable). | | I could be mistaken, but I think the job is looking for 128 GB RAM per node? Please correct me if I am wrong. | | Best regards, | g | | -- | Gowtham, PhD | Director of Research Computing, IT | Adj. Asst. Professor, Physics/ECE | Michigan Technological University | | (906) 487/3593 | http://it.mtu.edu | http://hpc.mtu.edu | | | On Tue, 31 Mar 2015, Reuti wrote: | | | | | > Am 31.03.2015 um 14:17 schrieb Gowtham <[email protected]>: | | > | | > | | > Please find it here: | | > | | > http://sgowtham.com/downloads/qstat_j_74545.txt | | | | Ok, but where is the SGE looking for 256GB? For now each slot will get 128G as requested | | | | -- Reuti | | | | | | > | | > Best regards, | | > g | | > | | > -- | | > Gowtham, PhD | | > Director of Research Computing, IT | | > Adj. Asst. Professor, Physics/ECE | | > Michigan Technological University | | > | | > (906) 487/3593 | | > http://it.mtu.edu | | > http://hpc.mtu.edu | | > | | > | | > On Tue, 31 Mar 2015, Reuti wrote: | | > | | > | Hi, | | > | | | > | > Am 31.03.2015 um 13:13 schrieb Gowtham <[email protected]>: | | > | > | | > | > | | > | > In one of our clusters that has homogeneous compute nodes (64 GB RAM), I have set mem_free as a requestable and consumable resource. From the mailing list archives, I have done | | > | > | | > | > for x in `qconf -sel` | | > | > do | | > | > qconf -mattr exechost complex_values mem_free=60G $x | | > | > done | | > | > | | > | > Every job that gets submitted by every user has the following line in the submission script: | | > | > | | > | > #$ -hard -l mem_free=2G | | > | > | | > | > for single processor jobs, and | | > | > | | > | > #$ -hard -l mem_free=(2/NPROCS)G | | > | > | | > | > for a parallel job using NPROCS processors. | | > | > | | > | > | | > | > All single processor jobs run just fine, and so do many parallel jobs. But some parallel jobs, when the participating processors are spread across multiple compute nodes, keep on waiting. | | > | > | | > | > When inspected with 'qstat -j JOB_ID', I notice that the job is looking for (2 * NPROCS)G of RAM in each compute node. How would I go about resolving this issue? If additional information is necessary from my end, please let me know. | | > | | | > | Can you please post the output of `qstat -j JOB_ID` of such a job. | | > | | | > | -- Reuti | | > | | | > | | | > | > | | > | > Thank you for your time and help. | | > | > | | > | > Best regards, | | > | > g | | > | > | | > | > -- | | > | > Gowtham, PhD | | > | > Director of Research Computing, IT | | > | > Adj. Asst. Professor, Physics/ECE | | > | > Michigan Technological University | | > | > | | > | > (906) 487/3593 | | > | > http://it.mtu.edu | | > | > http://hpc.mtu.edu | | > | > | | > | > _______________________________________________ | | > | > users mailing list | | > | > [email protected] | | > | > https://gridengine.org/mailman/listinfo/users | | > | | | > | | | | | | _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
