Re: [gridengine users] sge_shepherd 100% cpuload problem and running all jobs with numactl ?

William Hay Fri, 24 Aug 2012 03:39:20 -0700

On 24 August 2012 11:16, A. Podstawka <[email protected]> wrote:
> Hello,
>
> we have an aggregated cluster (ScaleMP vSMP System) with 192 Cores and
> 2TB of RAM and have some trouble with an simple:
>
>   for i in `seq 1 300`; do qsub simple.sh; done
>
> mostly it hangs after round about 120 submitted jobs and the
> sge_shepherd's are all using 100% cpuload and the simple.sh isn't
> executed. How could i solve this?
I'd start with an strace of the shepherd to see what it was up to...
>
> the second problem we have, where i would need help:
> we need to use "numactl --physcpubind" for the shellscripts submitted to
You could use a starter_method.  As a recent version of Grid Engine
though I think
SoGE has the ability to bind cores itself though so you may not need to.


> qsub, they need to run bind to a specific core (due to the hugh size of
> this aggregated machine) but i don't get it how i can push the numactl
> in front of the submitted script for qsub, so the user don't need to
> bother with it and which core is not used etc. any suggestions ? Since
> qsub mostly needs scripts which are submitted.
>
>
> the facts:
> we have SoGE 8.1.1
> we use centos 6.2 on the system
> all CPUs are Xeons with 6 Cores (HT disabled) (16 Nodes, 32 Sockets, 192
> Cores) and 128GB RAM
>
> thanks
> Adam
>
> --
> Adam Podstawka
> Leibniz-Institut DSMZ-Deutsche Sammlung von Mikro-
> organismen und Zellkulturen GmbH
> Inhoffenstraße 7 B
> 38124 Braunschweig
> Germany
> http://www.dsmz.de
>
> Director: Prof. Dr. Jörg Overmann
> Local court: Braunschweig HRB 2570
> Chairman of the supervisory board: MR Dr. Axel Kollatschny
>
> DSMZ - A member of the Leibniz Association (WGL)
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users
>
>

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] sge_shepherd 100% cpuload problem and running all jobs with numactl ?

Reply via email to