Hi William,

On 24.08.2012 12:38, William Hay wrote:
On 24 August 2012 11:16, A. Podstawka <adam.podsta...@dsmz.de> wrote:
Hello,

we have an aggregated cluster (ScaleMP vSMP System) with 192 Cores and
2TB of RAM and have some trouble with an simple:

   for i in `seq 1 300`; do qsub simple.sh; done

mostly it hangs after round about 120 submitted jobs and the
sge_shepherd's are all using 100% cpuload and the simple.sh isn't
executed. How could i solve this?
I'd start with an strace of the shepherd to see what it was up to...
ok will try - nice tip, haven't thought about strace

the second problem we have, where i would need help:
we need to use "numactl --physcpubind" for the shellscripts submitted to
You could use a starter_method.  As a recent version of Grid Engine
ok will look at it.
An other idea of mine was a "wrapper" for qsub, so the original qsub would be called afterwards from the wrapper with an extra script with numactl in it.. just as an idea, but i prefer native functions, so thanks (:

though I think
SoGE has the ability to bind cores itself though so you may not need to.
have tried it, the binding seems not working, but because of the problematic with sge_shepherd can't say this 100%


thanks
Adam
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to