Its a HIGH Performance Computing (HPC).  To elaborate the setup, we have a
thinanywhere setup for user to use and  to submit jobs to those computing
nodes. The management node  will then push the those jobs to available
computing nodes.

Cron job is not an option because we can not just kill/restart/powercycle
those jobs/server on the compute node without informing the job owner. The
suggestion to pull out those problematic node and test it on separate
environment has been done already and we attempt to recreate the loads that
cause the server hangs it is not showing like it was when the low-end server
is on the cluster.

What I want is an opinion if it is safe to say that upgrade is needed for
those low-end computer node. This is actually a matter of  how to defend  my
case to the boss :)


On 10/14/07, Ariz Jacinto <[EMAIL PROTECTED]> wrote:
>
> can you be more specific to your setup? is it an HPC or HTC?
> can you also elaborate on your problem? does the job stays
> idle on the low-end node? the way you deal with the problem
> is the typical way of responding to such but be done automatically
> via the job scheduler. and since you've already identified those
> problematic nodes, you might want to pull them out of the
> cluster, place them in a sandbox and then troubleshoot them
> further.
>
>
>
>
> On 10/13/07, Michael Calizo <[EMAIL PROTECTED]> wrote:
> >
> > Hi Guys,
> >
> > A newbie here needs an expert opinion regarding Linux HPC.
> >
> > In my current company we have a Linux(Redhat) cluster implementation,
> > say 100 nodes per cluster.
> > I notice that on the problematic cluster, some nodes are low end server
> > say 2GB memory while the
> > other nodes have 4GB memory. This past few weeks I noticed that user
> > problem keeps on growing and
> > base on my investigation, the leftover jobs is always on the compute
> > nodes which are "low end".
> > We manage to stop/kill/restart the jobs but I know that this is only a
> > temporary solution and I wanted a permanent one.
> >
> > 1. I am suspecting that this might be a hardware related problem but I
> > am not 100% sure. I want to get opinion/suggestion first from HPC guru
> > before I make my move to approach the management and raise my case that
> > hardware upgrade is needed.
> >
> > 2. Or can this problem be attributed to the cluster missconfiguration?
> >
> > Thanks in advance.
> >
> > --
> > Mike Calizo
> > Registered Linux User # 365113
> >
>
> _________________________________________________
> Philippine Linux Users' Group (PLUG) Mailing List
> plug@lists.linux.org.ph (#PLUG @ irc.free.net.ph)
> Read the Guidelines: http://linux.org.ph/lists
> Searchable Archives: http://archives.free.net.ph
>



-- 
Mike Calizo
Registered Linux User # 365113

_________________________________________________
Even the longest journey has to start with a small first-step
_________________________________________________
Philippine Linux Users' Group (PLUG) Mailing List
plug@lists.linux.org.ph (#PLUG @ irc.free.net.ph)
Read the Guidelines: http://linux.org.ph/lists
Searchable Archives: http://archives.free.net.ph

Reply via email to