Well, you definitley came up with something interesting. The NODEAVAILABILITYPOLICY looks as if it should help me to resolve this issue (but currently it didn't...yet).
I've made the following tests trying to figure what's behind the scenes of the cluster: 1. I listed all the nodes that diagnose -n says: "has more processors utilized than dedicated" 2. Then I submitted several very short jobs (2 minutes) and designated each one of them to each one of the nodes listed above. I used the -l host={nodename} -l walltime=00:00:02 (The walltime time purpose is to make sure MAUI will not activate any reservation policies on the jobs (In fact the cluster had many free CPUs at the time I made the test, so no reservations are expected)). I expected the jobs *not* to go in to R state, because each and every job was targeted to a node that "has more processors utilized than dedicated" . 3. Indeed that's what happend! None of the jobs went from Q state to R state. They have been waiting there for very long time (hours). 4. I then checked the load average on each of the nodes listed above, and I indeed found that their load average is higher than their configured resources. For example, if the 'nodes' file says 'node22 np=4' , I checked it's load average at the time it had the "has more processors utilized than dedicated" . I found that though this node runs only 2 jobs at the moment, the load average is above it (about 2.70). I expect this node to run 4 jobs at the same time. > Are these2 jobs multithreaded? Is the load ~4 while it should be ~2? I'm not sure if they are multithreaded (needs further checking with the developers) - but you're right. The load should be no more than 2 for 2 jobs, but infact its >2 . The jobs are C++ compiled with g++ compiler. Maybe a compilation switch will help with reducing the load average to 1 per job? I then moved to the next step, and set the NODEAVAILABILITYPOLICY to UTILIZED. The showconfig command now says: NODEAVAILABILITYPOLICY[0] UTILIZED:[DEFAULT] As this didn't make the jobs run, perhaps it's a matter of another tweak in the NODEAVAILABILTY policy? And yet another thing about the diagnose -j output : I'm not sure if and how should I treat the 'WARNING: job '{job_id}' utilizes more memory than dedicated (xxxx > 512) ' . A vmstat test shows that indeed jobs are heavily swapping on the node. Thanks, Itay. On Jan 30, 2008 12:26 AM, Jan Ploski <[EMAIL PROTECTED]> wrote: > > Jan Ploski >
_______________________________________________ mauiusers mailing list mauiusers@supercluster.org http://www.supercluster.org/mailman/listinfo/mauiusers