Prentice Bisbal wrote: > I've got a new problem with my cluster. Some of this problem may be with > my queuing system (SGE), but I figured I'd post here first. > > I've been using hpl to test my new cluster. I generally run a small > problem size (Ns=60000)so the job only runs 15-20 minutes. Last night, I > upped the problem size by a factor of 10 to Ns=600000). Shortly after > submitting the job, have the nodes were shown as down in Ganglia. > > I killed the job with qdel, and the majority of the nodes came back, but > about 1/3 did not. When I came in this morning, there were kernel > panic/OOM type messages on the consoles of the systems that never came > back. > > I used to run hpl jobs much bigger than this on my cluster w/o a > problem. There's nothing I actively changes, but there might have been > some updates to the OS (kernel, libs, etc) since the last time I ran a > job this big. Any ideas where I should begin looking?
I've run into similar problems, and traced it to the way Linux overcommits RAM. What are your vm.overcommit_memory and vm.overcommit_ratio sysctls set to, and how much swap and RAM do the nodes have? -- -- Skylar Thompson ([email protected]) -- http://www.cs.earlham.edu/~skylar/
signature.asc
Description: OpenPGP digital signature
_______________________________________________ Beowulf mailing list, [email protected] To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
