Brian Smith <[email protected]> writes: > We found h_vmem to be highly unpredictable, especially with java-based > applications. Stack settings were screwed up, certain applications > wouldn't launch (segfaults), and hard limits were hard to determine > for things like MPI applications. When your master has to launch 1024 > MPI sub-tasks (qrsh), it generally eats up more VMEM than the slave > tasks do. It was just hard to get right.
I've never really understood this business. To start with, why does the master need to launch so many qrshs? It seems it would be worth moving to an MPI which supports tree spawning for such cases. Also, I get the impression that other people see higher qrsh usage than I do. I know this is a much smaller job -- we envy you 1024 nodes, if they're reliable! -- but the overhead on this 32-node job is in the noise; you wouldn't guess which is the master. $ qhost -h `nodes-in-job 154667`|tail -15 node224 lx-amd64 4 2 4 4 4.01 7.7G 210.6M 3.7G 0.0 node228 lx-amd64 4 2 4 4 4.02 7.6G 218.1M 3.7G 288.0K node229 lx-amd64 4 2 4 4 4.04 7.7G 204.7M 3.7G 0.0 node230 lx-amd64 4 2 4 4 5.19 7.6G 251.5M 3.7G 24.0K node232 lx-amd64 4 2 4 4 4.14 7.3G 205.9M 3.7G 0.0 node233 lx-amd64 4 2 4 4 4.10 7.3G 208.8M 3.7G 0.0 node234 lx-amd64 4 2 4 4 4.13 7.3G 212.6M 3.7G 0.0 node235 lx-amd64 4 2 4 4 4.03 7.3G 246.1M 3.7G 25.9M node236 lx-amd64 4 2 4 4 4.19 7.3G 230.2M 3.7G 3.7M node237 lx-amd64 4 2 4 4 4.14 7.3G 210.8M 3.7G 0.0 node239 lx-amd64 4 2 4 4 4.03 7.3G 206.2M 3.7G 0.0 node241 lx-amd64 4 2 4 4 4.21 7.3G 282.0M 3.7G 70.3M node242 lx-amd64 4 2 4 4 4.11 7.3G 205.0M 3.7G 0.0 node243 lx-amd64 4 2 4 4 4.10 7.3G 210.7M 3.7G 0.0 node244 lx-amd64 4 2 4 4 4.20 7.3G 206.7M 3.7G 0.0 That doesn't mean we shouldn't support different limits on the master, of course, and I'll look at excluding particular processes from the cpuset/cgroup-based accounting. [It looks as if that's potentially duplicating two other lots of work, sigh, though I'm not sure anyone else is interested in the Red Hat 5-ish systems that are required for the largest scientific computing effort.] -- Community Grid Engine: http://arc.liv.ac.uk/SGE/ _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
