"A. Podstawka" <adam.podsta...@dsmz.de> writes:

> Hello,
>
> we have an aggregated cluster (ScaleMP vSMP System) with 192 Cores and
> 2TB of RAM and have some trouble with an simple:

I'm afraid I don't know ScaleMP, other than roughly what it does.

>  for i in `seq 1 300`; do qsub simple.sh; done
>
> mostly it hangs after round about 120 submitted jobs

What exactly, hangs?  The qmaster?

> and the
> sge_shepherd's are all using 100% cpuload and the simple.sh isn't
> executed. How could i solve this?

Do you mean they only do that when you have a lot of them running?
William's advice is likely to be useful.  (You can attach strace to a
running process.)  Are there any useful messages in syslog or the SGE
messages file with the log level set to info?  Do the jobs actually
start?  If so, what's in the trace file in the job directory under
"active_jobs" in the spool area?

If you want to get really serious, it's possible to run the shepherd
under gdb by using a suitable shepherd_cmd in the configuration and
starting the execd by hand with SGE_ND=1 in the environment.

> the second problem we have, where i would need help:
> we need to use "numactl --physcpubind" for the shellscripts submitted
> to qsub, they need to run bind to a specific core (due to the hugh
> size of this aggregated machine) but i don't get it how i can push the
> numactl in front of the submitted script for qsub, so the user don't
> need to bother with it and which core is not used etc. any suggestions
> ? Since qsub mostly needs scripts which are submitted.

That's definitely not the right way to do it.  You want to get the SGE
core binding working.  The hwloc library that SGE uses is supposed to
work on ScaleMP, but if there's a problem with it (which version are you
using?) the developers will be interested and probably fix it reasonably
quickly.  If you have the hwloc utilities, do they work, e.g. can you do
something like

  hwloc-bind core:1-2 hwloc-ps

and get sensible output?

-- 
Community Grid Engine:  http://arc.liv.ac.uk/SGE/
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to