Hello,
thank you for the clarification. I must have misunderstood you.
Now i did it.The master node was in the example i send now exec-node01
(it varied from attempt to attempt). The output is in the master-node
file. The qstat file is the output of
qstat -g t -u '*'
That seems to look normal.
Now i created a simple C file with an endless loop.
#include <stdio.h>
int main()
{
int x;
for(x=0;x=10;x=x+1)
{
puts("Hello");
;
}
return(0);
}
and compiled it:
mpicc mpihello.c -o mpihello
and started qsub:
qsub -pe orte 300 -j yes -cwd -S /bin/bash <<< "mpiexec -n 300 mpihello"
The outputs look the same as for the sleep command above.
But now i counted the jobs:
qstat -g t -u '*' | grep -ic slave
This results in the number '300', which i expected.
On the execute nodes i did:
ps -ef | grep mpihello | grep -v grep | grep -vc mpiexec
(i counted the 'mpihello' processes)
This is the result:
exec-node01: 43
exec-node02: 82
exec-node03: 83
exec-node04: 82
exec-node05: 82
exec-node06: 80
exec-node07: 64
exec-node08: 64
Which gives the sum of 580.
When i count the number of free solts together (from 'qhost -q') i also
get 300, which i expect.
Where do the extra processes on the nodes come from?
This difference is reproducible.
libgomp.so.1.0.0 library is installed, but aqpart from that nothing with
OpenMP.
With kind regards, ulrich
On 08/15/2016 02:30 PM, Ulrich Hiller wrote:
> Hello,
>
>> The other issue seems to be, that in fact your job is using only one
> machine, which means that it is essentially ignoring any granted slot
> allocation. While the job is running, can you please execute on the
> master node of the parallel job:
>>
>> $ ps -e f
>>
>> (f w/o -) and post the relevant lines belonging to either sge_execd or
> just running as kids of the init process, in case they jumped out of the
> process tree. Maybe a good start would be to execute something like
> `mpiexec sleep 300` in the jobscript.
>>
>
> i invoked
> qsub -pe orte 160 -j yes -cwd -S /bin/bash <<< "mpiexec -n 160 sleep 300"
>
> the only line ('ps -e f') on the master node was:
> 55722 ? Sl 3:42 /opt/sge/bin/lx-amd64/sge_qmaster
>
> No other sge lines, no child processes from it, and no other init
> processes leading to sge While at the same time the sleep processes were
> running on the nodes (Checked with ps command on the nodes).
>
> The qstat command gave :
> 264 0.60500 STDIN ulrich r 08/15/2016 11:33:02
> all.q@exec-node01 MASTER
>
> all.q@exec-node01 SLAVE
>
> all.q@exec-node01 SLAVE
>
> all.q@exec-node01 SLAVE
> [ ...]
>
> 264 0.60500 STDIN ulrich r 08/15/2016 11:33:02
> all.q@exec-node03 SLAVE
>
> all.q@exec-node03 SLAVE
>
> all.q@exec-node03 SLAVE
> [ ... ]
> 264 0.60500 STDIN ulrich r 08/15/2016 11:33:02
> all.q@exec-node05 SLAVE
>
> all.q@exec-node05 SLAVE
> [ ...]
>
>
> Because there was only the master deamon running on the master node, and
> you were tlaking about child processes: Was this now normal behaviour my
> cluster showed or is there something wrong?
>
> Kind reagrds, ulrich
>
>
>
> On 08/12/2016 07:11 PM, Reuti wrote:
>> Hi,
>>
>>> Am 12.08.2016 um 18:48 schrieb Ulrich Hiller <[email protected]>:
>>>
>>> Hello,
>>>
>>> i have a strange effect, where i am not sure whether it is "only" a
>>> misconfiguration or a bug.
>>>
>>> First: I run son of gridengine 8.1.9-1.el6.x86_64 (i installed the rhel
>>> rpm on an opensuse 13.1 machine. This should not matter in this case,
>>> and it is reported to be able to run on opensuse).
>>>
>>> mpirun and mpiexec are from openmpi-1.10.3 (no other mpi was installed,
>>> neither on master, nor on slaves). The installation was made with:
>>> ./configure --prefix=`pwd`/build --disable-dlopen --disable-mca-dso
>>> --with-orte --with-sge --with-x --enable-mpi-thread-multiple
>>> --enable-orterun-prefix-by-default --enable-mpirun-prefix-by-default
>>> --enable-orte-static-ports --enable-mpi-cxx --enable-mpi-cxx-seek
>>> --enable-oshmem --enable-java --enable-mpi-java
>>> make
>>> make install
>>>
>>> I attached the outputs of 'qconf -ap all.q' , 'qconf -sconf' and 'qconf
>>> -sp orte' as textfiles.
>>>
>>> Now my problem:
>>> I asked for 20 cores and if i run qstat -u '*' it shows that this job
>>> is being run in slave07 using 20 cores but is not true! if i run qstat
>>> -f -u '*' i see that this job is only using 3 cores in salve07 and
>>> there are 17 cores in other nodes allocated to this job which are in fact
>>> unused!
>>
>> qstat will list only the master node of the parallel job and the number of
>> overall slots. The granted allocation you can check with:
>>
>> $ qstat -g t -u '*'
>>
>> The other issue seems to be, that in fact your job is using only one
>> machine, which means that it is essentially ignoring any granted slot
>> allocation. While the job is running, can you please execute on the master
>> node of the parallel job:
>>
>> $ ps -e f
>>
>> (f w/o -) and post the relevant lines belonging to either sge_execd or just
>> running as kids of the init process, in case they jumped out of the process
>> tree. Maybe a good start would be to execute something like `mpiexec sleep
>> 300` in the jobscript.
>>
>> Next step could be a `mpihello.c` where you put an almost endless loop
>> inside and switch off all optimizations during compilations to check whether
>> these slave processes are distributed in the correct way.
>>
>> Note that some applications will check the number of cores they are running
>> on and start by OpenMP (not Open MPI) as many threads as cores are found.
>> Could this be the case for your application too?
>>
>> -- Reuti
>>
>>
>>> Or other example:
>>> My job took say 6 cpus on slave07 and 14 on slave06 but nothing was
>>> running on 06 and therefore a waste of ressource on 06 and overload on
>>> 07 becomes highly possible (the numbers are made up).
>>> If i ran 1 Cpus in many independent jobs that would not be an issue, but
>>> imagine i now request 60 cpus on slave07, that would seriously overload
>>> the node in many cases.
>>>
>>> Or other example:
>>> if i ask for say 50 CPUs, the job will start on one node, e.g,
>>> slave01, but only reserving say 15 CPUs out of 64 and reserve the rest
>>> on many other nodes (obviously wasting space doing nothing).
>>> This has the bad consequence of allocating many more CPUs than available
>>> when many jobs are running, imagine you have 10 jobs like this one...
>>> some nodes will run maybe 3 even if they only have 24 CPUs...
>>>
>>> I hope that i have made clear what the issue is.
>>>
>>> I also see that the `qstat` and `qstat -f` are in disagreement. The
>>> latter is correct, i checked the processes running on the nodes.
>>>
>>>
>>> Did somebody already encounter such a problem? Does somebody have an
>>> idea where to look into or what to test?
>>>
>>> With kind regards, ulrich
>>>
>>>
>>>
>>> <qhost.txt><qconf-sconf.txt><qconf-mp-orte.txt><qconf-all.q>_______________________________________________
>>> users mailing list
>>> [email protected]
>>> https://gridengine.org/mailman/listinfo/users
>>
job-ID prior name user state submit/start at queue
master ja-task-ID
------------------------------------------------------------------------------------------------------------------
273 0.60500 STDIN ulrich r 08/15/2016 16:01:17
all.q@exec-node01 MASTER
all.q@exec-node01 SLAVE
all.q@exec-node01 SLAVE
all.q@exec-node01 SLAVE
all.q@exec-node01 SLAVE
all.q@exec-node01 SLAVE
all.q@exec-node01 SLAVE
all.q@exec-node01 SLAVE
all.q@exec-node01 SLAVE
all.q@exec-node01 SLAVE
all.q@exec-node01 SLAVE
all.q@exec-node01 SLAVE
all.q@exec-node01 SLAVE
all.q@exec-node01 SLAVE
all.q@exec-node01 SLAVE
all.q@exec-node01 SLAVE
all.q@exec-node01 SLAVE
all.q@exec-node01 SLAVE
all.q@exec-node01 SLAVE
all.q@exec-node01 SLAVE
all.q@exec-node01 SLAVE
all.q@exec-node01 SLAVE
all.q@exec-node02 SLAVE 1442
273 0.60500 STDIN ulrich r 08/15/2016 16:01:17
all.q@exec-node02 SLAVE
all.q@exec-node02 SLAVE
all.q@exec-node02 SLAVE
all.q@exec-node02 SLAVE
all.q@exec-node02 SLAVE
all.q@exec-node02 SLAVE
all.q@exec-node02 SLAVE
all.q@exec-node02 SLAVE
all.q@exec-node02 SLAVE
all.q@exec-node02 SLAVE
all.q@exec-node02 SLAVE
all.q@exec-node02 SLAVE
all.q@exec-node02 SLAVE
all.q@exec-node02 SLAVE
all.q@exec-node02 SLAVE
all.q@exec-node02 SLAVE
all.q@exec-node02 SLAVE
all.q@exec-node02 SLAVE
all.q@exec-node02 SLAVE
all.q@exec-node02 SLAVE
all.q@exec-node02 SLAVE
273 0.60500 STDIN ulrich r 08/15/2016 16:01:17
all.q@exec-node03 SLAVE
all.q@exec-node03 SLAVE
all.q@exec-node03 SLAVE
all.q@exec-node03 SLAVE
all.q@exec-node03 SLAVE
all.q@exec-node03 SLAVE
all.q@exec-node03 SLAVE
all.q@exec-node03 SLAVE
all.q@exec-node03 SLAVE
all.q@exec-node03 SLAVE
all.q@exec-node03 SLAVE
all.q@exec-node03 SLAVE
all.q@exec-node03 SLAVE
all.q@exec-node03 SLAVE
all.q@exec-node03 SLAVE
all.q@exec-node03 SLAVE
all.q@exec-node03 SLAVE
all.q@exec-node03 SLAVE
all.q@exec-node03 SLAVE
all.q@exec-node03 SLAVE
all.q@exec-node03 SLAVE
273 0.60500 STDIN ulrich r 08/15/2016 16:01:17
all.q@exec-node04 SLAVE
all.q@exec-node04 SLAVE
all.q@exec-node04 SLAVE
all.q@exec-node04 SLAVE
all.q@exec-node04 SLAVE
all.q@exec-node04 SLAVE
all.q@exec-node04 SLAVE
all.q@exec-node04 SLAVE
all.q@exec-node04 SLAVE
all.q@exec-node04 SLAVE
all.q@exec-node04 SLAVE
all.q@exec-node04 SLAVE
all.q@exec-node04 SLAVE
all.q@exec-node04 SLAVE
all.q@exec-node04 SLAVE
all.q@exec-node04 SLAVE
all.q@exec-node04 SLAVE
all.q@exec-node04 SLAVE
all.q@exec-node04 SLAVE
all.q@exec-node04 SLAVE
273 0.60500 STDIN ulrich r 08/15/2016 16:01:17
all.q@exec-node05 SLAVE
all.q@exec-node05 SLAVE
all.q@exec-node05 SLAVE
all.q@exec-node05 SLAVE
all.q@exec-node05 SLAVE
all.q@exec-node05 SLAVE
all.q@exec-node05 SLAVE
all.q@exec-node05 SLAVE
all.q@exec-node05 SLAVE
all.q@exec-node05 SLAVE
all.q@exec-node05 SLAVE
all.q@exec-node05 SLAVE
all.q@exec-node05 SLAVE
all.q@exec-node05 SLAVE
all.q@exec-node05 SLAVE
all.q@exec-node05 SLAVE
all.q@exec-node05 SLAVE
all.q@exec-node05 SLAVE
all.q@exec-node05 SLAVE
all.q@exec-node05 SLAVE
all.q@exec-node05 SLAVE
273 0.60500 STDIN ulrich r 08/15/2016 16:01:17
all.q@exec-node06 SLAVE
all.q@exec-node06 SLAVE
all.q@exec-node06 SLAVE
all.q@exec-node06 SLAVE
all.q@exec-node06 SLAVE
all.q@exec-node06 SLAVE
all.q@exec-node06 SLAVE
all.q@exec-node06 SLAVE
all.q@exec-node06 SLAVE
all.q@exec-node06 SLAVE
all.q@exec-node06 SLAVE
all.q@exec-node06 SLAVE
all.q@exec-node06 SLAVE
all.q@exec-node06 SLAVE
all.q@exec-node06 SLAVE
all.q@exec-node06 SLAVE
all.q@exec-node06 SLAVE
all.q@exec-node06 SLAVE
all.q@exec-node06 SLAVE
all.q@exec-node06 SLAVE
273 0.60500 STDIN ulrich r 08/15/2016 16:01:17
all.q@exec-node07 SLAVE
all.q@exec-node07 SLAVE
all.q@exec-node07 SLAVE
all.q@exec-node07 SLAVE
all.q@exec-node07 SLAVE
all.q@exec-node07 SLAVE
all.q@exec-node07 SLAVE
all.q@exec-node07 SLAVE
all.q@exec-node07 SLAVE
all.q@exec-node07 SLAVE
all.q@exec-node07 SLAVE
all.q@exec-node07 SLAVE
all.q@exec-node07 SLAVE
all.q@exec-node07 SLAVE
all.q@exec-node07 SLAVE
all.q@exec-node07 SLAVE
all.q@exec-node07 SLAVE
all.q@exec-node07 SLAVE
273 0.60500 STDIN ulrich r 08/15/2016 16:01:17
all.q@exec-node08 SLAVE
all.q@exec-node08 SLAVE
all.q@exec-node08 SLAVE
all.q@exec-node08 SLAVE
all.q@exec-node08 SLAVE
all.q@exec-node08 SLAVE
all.q@exec-node08 SLAVE
all.q@exec-node08 SLAVE
all.q@exec-node08 SLAVE
all.q@exec-node08 SLAVE
all.q@exec-node08 SLAVE
all.q@exec-node08 SLAVE
all.q@exec-node08 SLAVE
all.q@exec-node08 SLAVE
all.q@exec-node08 SLAVE
all.q@exec-node08 SLAVE
all.q@exec-node08 SLAVE
all.q@exec-node08 SLAVE
sgeadmin 49137 1 0 Aug09 ? Sl 23:46
/opt/sge/bin/lx-amd64/sge_execd
sgeadmin 54455 49137 1 16:01 ? S 0:00 \_ sge_shepherd-273 -bg
ulrich 54456 54455 0 16:01 ? Ss 0:00 \_ -bash
/opt/sge/default/spool/exec-node01/job_scripts/273
ulrich 54463 54456 1 16:01 ? Sl 0:00 \_ mpiexec -n 160
sleep 300
ulrich 54465 54463 0 16:01 ? Sl 0:00 \_
/opt/sge/bin/lx-amd64/qrsh -inherit -nostdin -V exec-node02
PATH=/usr/local/misc/openmpi/openmpi-1.10.3/build/bin:$PATH ; export PATH ;
LD_LIBRARY_PATH=/usr/local/misc/openmpi/openmpi-1.10.3/build/lib64:$LD_LIBRARY_PATH
; export LD_LIBRARY_PATH ;
DYLD_LIBRARY_PATH=/usr/local/misc/openmpi/openmpi-1.10.3/build/lib64:$DYLD_LIBRARY_PATH
; export DYLD_LIBRARY_PATH ;
/usr/local/misc/openmpi/openmpi-1.10.3/build/bin/orted --hnp-topo-sig
8N:4S:8L3:32L2:64L1:64C:64H:x86_64 -mca ess "env" -mca orte_ess_jobid
"4060020736" -mca orte_ess_vpid 1 -mca orte_ess_num_procs "8" -mca orte_hnp_uri
"4060020736.0;tcp://192.168.117.1:35028" -mca plm "rsh" --tree-spawn
ulrich 54466 54463 0 16:01 ? Sl 0:00 \_
/opt/sge/bin/lx-amd64/qrsh -inherit -nostdin -V exec-node03
PATH=/usr/local/misc/openmpi/openmpi-1.10.3/build/bin:$PATH ; export PATH ;
LD_LIBRARY_PATH=/usr/local/misc/openmpi/openmpi-1.10.3/build/lib64:$LD_LIBRARY_PATH
; export LD_LIBRARY_PATH ;
DYLD_LIBRARY_PATH=/usr/local/misc/openmpi/openmpi-1.10.3/build/lib64:$DYLD_LIBRARY_PATH
; export DYLD_LIBRARY_PATH ;
/usr/local/misc/openmpi/openmpi-1.10.3/build/bin/orted --hnp-topo-sig
8N:4S:8L3:32L2:64L1:64C:64H:x86_64 -mca ess "env" -mca orte_ess_jobid
"4060020736" -mca orte_ess_vpid 2 -mca orte_ess_num_procs "8" -mca orte_hnp_uri
"4060020736.0;tcp://192.168.117.1:35028" -mca plm "rsh" --tree-spawn
ulrich 54467 54463 0 16:01 ? Sl 0:00 \_
/opt/sge/bin/lx-amd64/qrsh -inherit -nostdin -V exec-node05
PATH=/usr/local/misc/openmpi/openmpi-1.10.3/build/bin:$PATH ; export PATH ;
LD_LIBRARY_PATH=/usr/local/misc/openmpi/openmpi-1.10.3/build/lib64:$LD_LIBRARY_PATH
; export LD_LIBRARY_PATH ;
DYLD_LIBRARY_PATH=/usr/local/misc/openmpi/openmpi-1.10.3/build/lib64:$DYLD_LIBRARY_PATH
; export DYLD_LIBRARY_PATH ;
/usr/local/misc/openmpi/openmpi-1.10.3/build/bin/orted --hnp-topo-sig
8N:4S:8L3:32L2:64L1:64C:64H:x86_64 -mca ess "env" -mca orte_ess_jobid
"4060020736" -mca orte_ess_vpid 3 -mca orte_ess_num_procs "8" -mca orte_hnp_uri
"4060020736.0;tcp://192.168.117.1:35028" -mca plm "rsh" --tree-spawn
ulrich 54468 54463 0 16:01 ? Sl 0:00 \_
/opt/sge/bin/lx-amd64/qrsh -inherit -nostdin -V exec-node04
PATH=/usr/local/misc/openmpi/openmpi-1.10.3/build/bin:$PATH ; export PATH ;
LD_LIBRARY_PATH=/usr/local/misc/openmpi/openmpi-1.10.3/build/lib64:$LD_LIBRARY_PATH
; export LD_LIBRARY_PATH ;
DYLD_LIBRARY_PATH=/usr/local/misc/openmpi/openmpi-1.10.3/build/lib64:$DYLD_LIBRARY_PATH
; export DYLD_LIBRARY_PATH ;
/usr/local/misc/openmpi/openmpi-1.10.3/build/bin/orted --hnp-topo-sig
8N:4S:8L3:32L2:64L1:64C:64H:x86_64 -mca ess "env" -mca orte_ess_jobid
"4060020736" -mca orte_ess_vpid 4 -mca orte_ess_num_procs "8" -mca orte_hnp_uri
"4060020736.0;tcp://192.168.117.1:35028" -mca plm "rsh" --tree-spawn
ulrich 54469 54463 0 16:01 ? Sl 0:00 \_
/opt/sge/bin/lx-amd64/qrsh -inherit -nostdin -V exec-node06
PATH=/usr/local/misc/openmpi/openmpi-1.10.3/build/bin:$PATH ; export PATH ;
LD_LIBRARY_PATH=/usr/local/misc/openmpi/openmpi-1.10.3/build/lib64:$LD_LIBRARY_PATH
; export LD_LIBRARY_PATH ;
DYLD_LIBRARY_PATH=/usr/local/misc/openmpi/openmpi-1.10.3/build/lib64:$DYLD_LIBRARY_PATH
; export DYLD_LIBRARY_PATH ;
/usr/local/misc/openmpi/openmpi-1.10.3/build/bin/orted --hnp-topo-sig
8N:4S:8L3:32L2:64L1:64C:64H:x86_64 -mca ess "env" -mca orte_ess_jobid
"4060020736" -mca orte_ess_vpid 5 -mca orte_ess_num_procs "8" -mca orte_hnp_uri
"4060020736.0;tcp://192.168.117.1:35028" -mca plm "rsh" --tree-spawn
ulrich 54470 54463 0 16:01 ? Sl 0:00 \_
/opt/sge/bin/lx-amd64/qrsh -inherit -nostdin -V exec-node08
PATH=/usr/local/misc/openmpi/openmpi-1.10.3/build/bin:$PATH ; export PATH ;
LD_LIBRARY_PATH=/usr/local/misc/openmpi/openmpi-1.10.3/build/lib64:$LD_LIBRARY_PATH
; export LD_LIBRARY_PATH ;
DYLD_LIBRARY_PATH=/usr/local/misc/openmpi/openmpi-1.10.3/build/lib64:$DYLD_LIBRARY_PATH
; export DYLD_LIBRARY_PATH ;
/usr/local/misc/openmpi/openmpi-1.10.3/build/bin/orted --hnp-topo-sig
8N:4S:8L3:32L2:64L1:64C:64H:x86_64 -mca ess "env" -mca orte_ess_jobid
"4060020736" -mca orte_ess_vpid 6 -mca orte_ess_num_procs "8" -mca orte_hnp_uri
"4060020736.0;tcp://192.168.117.1:35028" -mca plm "rsh" --tree-spawn
ulrich 54471 54463 0 16:01 ? Sl 0:00 \_
/opt/sge/bin/lx-amd64/qrsh -inherit -nostdin -V exec-node07
PATH=/usr/local/misc/openmpi/openmpi-1.10.3/build/bin:$PATH ; export PATH ;
LD_LIBRARY_PATH=/usr/local/misc/openmpi/openmpi-1.10.3/build/lib64:$LD_LIBRARY_PATH
; export LD_LIBRARY_PATH ;
DYLD_LIBRARY_PATH=/usr/local/misc/openmpi/openmpi-1.10.3/build/lib64:$DYLD_LIBRARY_PATH
; export DYLD_LIBRARY_PATH ;
/usr/local/misc/openmpi/openmpi-1.10.3/build/bin/orted --hnp-topo-sig
8N:4S:8L3:32L2:64L1:64C:64H:x86_64 -mca ess "env" -mca orte_ess_jobid
"4060020736" -mca orte_ess_vpid 7 -mca orte_ess_num_procs "8" -mca orte_hnp_uri
"4060020736.0;tcp://192.168.117.1:35028" -mca plm "rsh" --tree-spawn
ulrich 54486 54463 0 16:01 ? S 0:00 \_ sleep 300
ulrich 54487 54463 0 16:01 ? S 0:00 \_ sleep 300
ulrich 54488 54463 0 16:01 ? S 0:00 \_ sleep 300
ulrich 54489 54463 0 16:01 ? S 0:00 \_ sleep 300
ulrich 54490 54463 0 16:01 ? S 0:00 \_ sleep 300
ulrich 54491 54463 0 16:01 ? S 0:00 \_ sleep 300
ulrich 54492 54463 0 16:01 ? S 0:00 \_ sleep 300
ulrich 54493 54463 0 16:01 ? S 0:00 \_ sleep 300
ulrich 54494 54463 0 16:01 ? S 0:00 \_ sleep 300
ulrich 54495 54463 0 16:01 ? S 0:00 \_ sleep 300
ulrich 54496 54463 0 16:01 ? S 0:00 \_ sleep 300
ulrich 54497 54463 0 16:01 ? S 0:00 \_ sleep 300
ulrich 54498 54463 0 16:01 ? S 0:00 \_ sleep 300
ulrich 54499 54463 0 16:01 ? S 0:00 \_ sleep 300
ulrich 54500 54463 0 16:01 ? S 0:00 \_ sleep 300
ulrich 54501 54463 0 16:01 ? S 0:00 \_ sleep 300
ulrich 54502 54463 0 16:01 ? S 0:00 \_ sleep 300
ulrich 54503 54463 0 16:01 ? S 0:00 \_ sleep 300
ulrich 54504 54463 0 16:01 ? S 0:00 \_ sleep 300
ulrich 54505 54463 0 16:01 ? S 0:00 \_ sleep 300
ulrich 54506 54463 0 16:01 ? S 0:00 \_ sleep 300
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users