-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi Jeff, Ralph,
On 29/08/13 23:30, Jeff Squyres (jsquyres) wrote: > Let me try to understand this test: > > - you're simulating a 1GB memory limit via ulimit of virtual > memory ("ulimit -v $((1*1024*1024))"), or 1,048,576 bytes. Yeah, basically doing by hand what Torque/Slurm do by default for jobs (unless the user asks for more). When this happens for Dalton (compiled with the Intel compilers) it just sits there spinning its wheels at start up. > - you're trying to alloc 1070*10^6 = 1,070,000,000 bytes in an MPI > app That was the developer trying to simulate the failure in Dalton. > - OMPI is barfing in the ptmalloc allocator Sounds like it. > Meaning: you're trying to allocate 1,000x memory than you're > allowing in virtual memory -- so I guess part of this test depends > on how much physical RAM you have, because you're limiting virtual > memory, right? No, it only depends on the memory limits for the job in Slurm. The reason for the test is that he was trying to see whether or not those limits were successfully being propagated to MPI ranks or not in Slurm (and it appears not). However, in the process he found he could also replicate this livelock/deadlock in Dalton. > It's quite possible that the ptmalloc included in OMPI doesn't > guard well against a failed mmap. FWIW, I've seen all kinds of > random badness (not just with OMPI) when malloc/mmap/etc. start > failing due to lack of memory. OK, so I'll try testing again with a larger limit to see if that will ameliorate this issue. I'm also wondering where this is happening in OMPI, I've a sneaking suspicion this is at MPI_INIT(). > Do you get the same behavior if you disable ptmalloc in OMPI? > (your IB large message bandwidth will suffer a bit, though) Not tried that, but I'll take a look at it if it doesn't seem possible to fix it with a change to the default memory limits (that'll be the least intrusive). Thanks! Chris - -- Christopher Samuel Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlIf2lMACgkQO2KABBYQAh/JrACfRKATdmD3hbSX0mHWtAt2cBP6 1wYAn31EjuS37inIaD151n1DxuAH4GAM =yaYe -----END PGP SIGNATURE-----