"Itay M" <[EMAIL PROTECTED]> schrieb am 02/01/2008 04:01:26 AM:
> Well, you definitley came up with something interesting. The > NODEAVAILABILITYPOLICY looks as if it should help me to resolve this > issue (but currently it didn't...yet). > > I've made the following tests trying to figure what's behind the > scenes of the cluster: > > 1. I listed all the nodes that diagnose -n says: "has more > processors utilized than dedicated" > 2. Then I submitted several very short jobs (2 minutes) and > designated each one of them to each one of the nodes listed above. I > used the -l host={nodename} -l walltime=00:00:02 It looks more like two seconds to me?... > (The walltime time > purpose is to make sure MAUI will not activate any reservation > policies on the jobs (In fact the cluster had many free CPUs at the > time I made the test, so no reservations are expected)). I expected > the jobs *not* to go in to R state, because each and every job was > targeted to a node that "has more processors utilized than dedicated" . > 3. Indeed that's what happend! None of the jobs went from Q state to > R state. They have been waiting there for very long time (hours). > 4. I then checked the load average on each of the nodes listed > above, and I indeed found that their load average is higher than > their configured resources. For example, if the 'nodes' file says > 'node22 np=4' , I checked it's load average at the time it had the > "has more processors utilized than dedicated" . I found that though > this node runs only 2 jobs at the moment, the load average is above > it (about 2.70). I expect this node to run 4 jobs at the same time. > > > Are these2 jobs multithreaded? Is the load ~4 while it should be ~2? > I'm not sure if they are multithreaded (needs further checking with > the developers) - but you're right. The load should be no more than > 2 for 2 jobs, but infact its >2 . The jobs are C++ compiled with g++ > compiler. Maybe a compilation switch will help with reducing the > load average to 1 per job? Ask the programmers. The load average could also be caused by other (non-job) activity on the node. Just run top and see how much CPU% each of the job executables consumes. For a multithreaded job (and a 2.6.x kernel) you will see >100% CPU usage in top. > I then moved to the next step, and set the NODEAVAILABILITYPOLICY to > UTILIZED. The showconfig command now says: > NODEAVAILABILITYPOLICY[0] UTILIZED:[DEFAULT] > As this didn't make the jobs run, perhaps it's a matter of another > tweak in the NODEAVAILABILTY policy? You can set it to DEDICATED for testing. Then the load average should not matter at all in scheduling decisions, just the number of jobs assigned to the node should be taken into account. However, you probably don't want to keep this setting in the long term because then you risk overcommitting the nodes and decreasing overall performance. > And yet another thing about the diagnose -j output : I'm not sure if > and how should I treat the 'WARNING: job '{job_id}' utilizes more > memory than dedicated (xxxx > 512) ' . A vmstat test shows that > indeed jobs are heavily swapping on the node. It seems that memory is the bottleneck in your setup. You should make the jobs ask for as much memory as they need on average (rather than the default 512MB). Of course, with the UTILIZED or COMBINED policy it will mean that the idle jobs won't get assigned to a node where such a memory-hog job is running. However, that's probably still better for your throughput than allowing swapping to happen (much depends on memory access patterns in the programs - if they need lots of memory, but seldom access all of it, swapping may be acceptable). You could also buy more memory or tell people to write more memory-efficient code... Regards, Jan Ploski _______________________________________________ mauiusers mailing list mauiusers@supercluster.org http://www.supercluster.org/mailman/listinfo/mauiusers