Its a HIGH Performance Computing (HPC). To elaborate the setup, we have a thinanywhere setup for user to use and to submit jobs to those computing nodes. The management node will then push the those jobs to available computing nodes.
Cron job is not an option because we can not just kill/restart/powercycle those jobs/server on the compute node without informing the job owner. The suggestion to pull out those problematic node and test it on separate environment has been done already and we attempt to recreate the loads that cause the server hangs it is not showing like it was when the low-end server is on the cluster. What I want is an opinion if it is safe to say that upgrade is needed for those low-end computer node. This is actually a matter of how to defend my case to the boss :) On 10/14/07, Ariz Jacinto <[EMAIL PROTECTED]> wrote: > > can you be more specific to your setup? is it an HPC or HTC? > can you also elaborate on your problem? does the job stays > idle on the low-end node? the way you deal with the problem > is the typical way of responding to such but be done automatically > via the job scheduler. and since you've already identified those > problematic nodes, you might want to pull them out of the > cluster, place them in a sandbox and then troubleshoot them > further. > > > > > On 10/13/07, Michael Calizo <[EMAIL PROTECTED]> wrote: > > > > Hi Guys, > > > > A newbie here needs an expert opinion regarding Linux HPC. > > > > In my current company we have a Linux(Redhat) cluster implementation, > > say 100 nodes per cluster. > > I notice that on the problematic cluster, some nodes are low end server > > say 2GB memory while the > > other nodes have 4GB memory. This past few weeks I noticed that user > > problem keeps on growing and > > base on my investigation, the leftover jobs is always on the compute > > nodes which are "low end". > > We manage to stop/kill/restart the jobs but I know that this is only a > > temporary solution and I wanted a permanent one. > > > > 1. I am suspecting that this might be a hardware related problem but I > > am not 100% sure. I want to get opinion/suggestion first from HPC guru > > before I make my move to approach the management and raise my case that > > hardware upgrade is needed. > > > > 2. Or can this problem be attributed to the cluster missconfiguration? > > > > Thanks in advance. > > > > -- > > Mike Calizo > > Registered Linux User # 365113 > > > > _________________________________________________ > Philippine Linux Users' Group (PLUG) Mailing List > plug@lists.linux.org.ph (#PLUG @ irc.free.net.ph) > Read the Guidelines: http://linux.org.ph/lists > Searchable Archives: http://archives.free.net.ph > -- Mike Calizo Registered Linux User # 365113 _________________________________________________ Even the longest journey has to start with a small first-step
_________________________________________________ Philippine Linux Users' Group (PLUG) Mailing List plug@lists.linux.org.ph (#PLUG @ irc.free.net.ph) Read the Guidelines: http://linux.org.ph/lists Searchable Archives: http://archives.free.net.ph