Skylar Thompson wrote: > Prentice Bisbal wrote: >> I've got a new problem with my cluster. Some of this problem may be with >> my queuing system (SGE), but I figured I'd post here first. >> >> I've been using hpl to test my new cluster. I generally run a small >> problem size (Ns=60000)so the job only runs 15-20 minutes. Last night, I >> upped the problem size by a factor of 10 to Ns=600000). Shortly after >> submitting the job, have the nodes were shown as down in Ganglia. >> >> I killed the job with qdel, and the majority of the nodes came back, but >> about 1/3 did not. When I came in this morning, there were kernel >> panic/OOM type messages on the consoles of the systems that never came >> back. >> >> I used to run hpl jobs much bigger than this on my cluster w/o a >> problem. There's nothing I actively changes, but there might have been >> some updates to the OS (kernel, libs, etc) since the last time I ran a >> job this big. Any ideas where I should begin looking? > > I've run into similar problems, and traced it to the way Linux > overcommits RAM. What are your vm.overcommit_memory and > vm.overcommit_ratio sysctls set to, and how much swap and RAM do the > nodes have? >
I found the problem - it was me. I never ran HPL problems with Ns=600k. The largest job I ran was ~320k. I figured this out after checking my notes. Sorry for the trouble. However, I did want to configure my systems so that they handle requests for more memory more gracefully, so I added this to my sysctl.conf file (Thanks for the reminder, Skylar!) vm.overcommit_memory=2 vm.overcommit_ratio=100 I am actually using this on many of my other computational servers to prevent OOM crashes, but forgot to add this to my cluster nodes. Thanks to everyone for the replies. -- Prentice _______________________________________________ Beowulf mailing list, [email protected] To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
