Re: [gridengine users] Reporting SGE via QACCT output

Sangmin Park Thu, 12 Jun 2014 21:59:21 -0700

Hi,

I've checked his job when it's running.
I've checked it via 'ps -ef' command and found that his job is using
"mpiexec.hydra".
And 'qrsh' is using '-inherit' option. Here's details.


p012chm  21424 21398  0 13:20 ?        00:00:00 bash
/opt/sge/default/spool/lion07/job_scripts/46651
p012chm  21431 21424  0 13:20 ?        00:00:00 /bin/bash /opt/intel/impi/
4.0.3.008/intel64/bin/mpirun -np 12
/home/p012chm/Binary4intelMPI/vasp.5.2.12_GRAPE.O3.MPIBLOCK5000.mpi.x
p012chm  21442 21431  0 13:20 ?        00:00:00 *mpiexec.hydra* -machinefile
/tmp/sge_machinefile_21431 -np 12
/home/p012chm/Binary4intelMPI/vasp.5.2.12_GRAPE.O3.MPIBLOCK5000.mpi.x
p012chm  21443 21442  0 13:20 ?        00:00:00 /opt/sge/bin/lx24-amd64/qrsh
-inherit lion07 /home/p012pnj/intel/impi/intel64/bin/pmi_proxy
--control-port lion07:54060 --pmi-connect lazy-cache --pmi-aggregate
--bootstrap rsh --bootstrap-exec rsh --demux poll --pgid 0 --enable-stdin 1
--proxy-id 0
root     21452 21451  0 13:20 ?        00:00:00 sshd: p012chm [priv]
p012chm  21453 21443  0 13:20 ?        00:00:00 /usr/bin/ssh -p 60725
lion07 exec '/opt/sge/utilbin/lx24-amd64/qrsh_starter'
'/opt/sge/default/spool/lion07/active_jobs/46651.1/1.lion07'
p012chm  21457 21452  0 13:20 ?        00:00:00 sshd: p012chm@notty
p012chm  21458 21457  0 13:20 ?        00:00:00
/opt/sge/utilbin/lx24-amd64/qrsh_starter
/opt/sge/default/spool/lion07/active_jobs/46651.1/1.lion07
p012chm  21548 21458  0 13:20 ?        00:00:00
/home/p012pnj/intel/impi/intel64/bin/pmi_proxy --control-port lion07:54060
--pmi-connect lazy-cache --pmi-aggregate --bootstrap rsh --bootstrap-exec
rsh --demux poll --pgid 0 --enable-stdin 1 --proxy-id 0
p012chm  21549 21548 99 13:20 ?        00:22:04
/home/p012chm/Binary4intelMPI/vasp.5.2.12_GRAPE.O3.MPIBLOCK5000.mpi.x
p012chm  21550 21548 99 13:20 ?        00:22:10
/home/p012chm/Binary4intelMPI/vasp.5.2.12_GRAPE.O3.MPIBLOCK5000.mpi.x
p012chm  21551 21548 99 13:20 ?        00:22:10
/home/p012chm/Binary4intelMPI/vasp.5.2.12_GRAPE.O3.MPIBLOCK5000.mpi.x
p012chm  21552 21548 99 13:20 ?        00:22:10
/home/p012chm/Binary4intelMPI/vasp.5.2.12_GRAPE.O3.MPIBLOCK5000.mpi.x
p012chm  21553 21548 99 13:20 ?        00:22:10
/home/p012chm/Binary4intelMPI/vasp.5.2.12_GRAPE.O3.MPIBLOCK5000.mpi.x
p012chm  21554 21548 99 13:20 ?        00:22:10
/home/p012chm/Binary4intelMPI/vasp.5.2.12_GRAPE.O3.MPIBLOCK5000.mpi.x
p012chm  21555 21548 99 13:20 ?        00:22:10
/home/p012chm/Binary4intelMPI/vasp.5.2.12_GRAPE.O3.MPIBLOCK5000.mpi.x
p012chm  21556 21548 99 13:20 ?        00:22:10
/home/p012chm/Binary4intelMPI/vasp.5.2.12_GRAPE.O3.MPIBLOCK5000.mpi.x
p012chm  21557 21548 99 13:20 ?        00:22:10
/home/p012chm/Binary4intelMPI/vasp.5.2.12_GRAPE.O3.MPIBLOCK5000.mpi.x
p012chm  21558 21548 99 13:20 ?        00:22:10
/home/p012chm/Binary4intelMPI/vasp.5.2.12_GRAPE.O3.MPIBLOCK5000.mpi.x
p012chm  21559 21548 99 13:20 ?        00:22:10
/home/p012chm/Binary4intelMPI/vasp.5.2.12_GRAPE.O3.MPIBLOCK5000.mpi.x
p012chm  21560 21548 99 13:20 ?        00:22:10
/home/p012chm/Binary4intelMPI/vasp.5.2.12_GRAPE.O3.MPIBLOCK5000.mpi.x
smpark   21728 21638  0 13:43 pts/0    00:00:00 grep chm

--Sangmin


On Thu, Jun 12, 2014 at 8:04 PM, Reuti <[email protected]> wrote:

> Am 12.06.2014 um 04:23 schrieb Sangmin Park:
>
> > I've checked the version of Intel MPI. He uses Intel MPI 4.0.3.008
> version.
> > Our system uses rsh to access computing nodes. SGE doses, too.
> >
> > Please let me know how to cehck which one is used 'mpiexec.hydry' or
> 'mpiexec'.
>
> Do you have both files somewhere in a "bin" directory inside the Intel
> MPI? You could rename "mpiexec" and create a symbolic link "mpiexec"
> pointing to "mpiexec.hydra". The old startup will need some daemons running
> on the node (which are outside of SGE's control and accounting*), but
> "mpiexec.hydra" will startup the child processes on its own as kids of its
> own and should hence be under SGE's control. And as long as you are staying
> on one and the same node, this should work already without further setup
> then. To avoid a later surprise when you compute between nodes, the
> `rsh`/`ssh` should nevertheless being caught and redirected to `qrsh
> -inherit...` like outlined in "$SGE_ROOT/mpi".
>
> -- Reuti
>
> *) It's even possible to force the daemons to be started under SGE, but
> it's convoluted and not recommended.
>
>
> > Sangmin
> >
> >
> > On Wed, Jun 11, 2014 at 6:46 PM, Reuti <[email protected]>
> wrote:
> > Hi,
> >
> > Am 11.06.2014 um 02:38 schrieb Sangmin Park:
> >
> > > For the best performance, we recommend users to use 8 cores on a
> single particular node, not distributed with multi node.
> > > Before I said, he uses VASP application compiled with Intel MPI. So he
> uses Intel MPI now.
> >
> > Which version of Intel MPI? Even with the latest one it's not tightly
> integrated by default (despite the fact, that MPICH3 [on which it is based]
> is tightly integrated by default).
> >
> > Depending on the version it might be necessary to make some adjustments
> - IIRC mainly use `mpiexec.hydra` instead of `mpiexec` and supply a wrapper
> to catch the `rsh`/`ssh` call (like in the MPI demo in SGE's directory).
> >
> > -- Reuti
> >
> >
> > > --Sangmin
> > >
> > >
> > > On Tue, Jun 10, 2014 at 5:58 PM, Reuti <[email protected]>
> wrote:
> > > Hi,
> > >
> > > Am 10.06.2014 um 10:21 schrieb Sangmin Park:
> > >
> > > > This user does always parallel job using VASP application.
> > > > Usually, he uses 8 cores per a job. Lots of this kind of job have
> been submitted by the user.
> > >
> > > 8 cores on a particular node or 8 slots across the cluster? What MPI
> implementation does he use?
> > >
> > > -- Reuti
> > >
> > > NB: Please keep the list posted.
> > >
> > >
> > > > Sangmin
> > > >
> > > >
> > > > On Tue, Jun 10, 2014 at 3:42 PM, Reuti <[email protected]>
> wrote:
> > > > Am 10.06.2014 um 08:00 schrieb Sangmin Park:
> > > >
> > > > > Hello,
> > > > >
> > > > > I'm very confused about the output of qacct command.
> > > > > I thought CPU column time is the best way to measure resource
> usage by users through this web page,
> https://wiki.duke.edu/display/SCSC/Checking+SGE+Usage
> > > > >
> > > > > But, I have some situation.
> > > > > One of users in my institution, actually this user is a one of
> heavy users, uses lots of HPC resources. To get the resource usage by this
> user for requirement of the payment, I commanded qacct and the output is
> below, this is just for May.
> > > > >
> > > > > OWNER       WALLCLOCK         UTIME         STIME           CPU
>           MEMORY                 IO                IOW
> > > > >
> ========================================================================================================================
> > > > > p012chm       2980810        28.485        35.012       100.634
>            4.277              0.576              0.000
> > > > >
> > > > > CPU time is too much small. Because he is very heavy user of our
> institution, I can not accept this result. However, the WALLCLOCK time is
> very much.
> > > > >
> > > > > How do I get correct information of usage resources by users via
> qacct?
> > > >
> > > > This may happen in case you have parallel jobs which are not tightly
> integrated into SGE. What types of jobs is the user running?
> > > >
> > > > -- Reuti
> > > >
> > > >
> > > > > ===========================
> > > > > Sangmin Park
> > > > > Supercomputing Center
> > > > > Ulsan National Institute of Science and Technology(UNIST)
> > > > > Ulsan, 689-798, Korea
> > > > >
> > > > > phone : +82-52-217-4201
> > > > > mobile : +82-10-5094-0405
> > > > > fax : +82-52-217-4209
> > > > > ===========================
> > > > > _______________________________________________
> > > > > users mailing list
> > > > > [email protected]
> > > > > https://gridengine.org/mailman/listinfo/users
> > > >
> > > >
> > > >
> > > >
> > > > --
> > > > ===========================
> > > > Sangmin Park
> > > > Supercomputing Center
> > > > Ulsan National Institute of Science and Technology(UNIST)
> > > > Ulsan, 689-798, Korea
> > > >
> > > > phone : +82-52-217-4201
> > > > mobile : +82-10-5094-0405
> > > > fax : +82-52-217-4209
> > > > ===========================
> > >
> > >
> > >
> > >
> > > --
> > > ===========================
> > > Sangmin Park
> > > Supercomputing Center
> > > Ulsan National Institute of Science and Technology(UNIST)
> > > Ulsan, 689-798, Korea
> > >
> > > phone : +82-52-217-4201
> > > mobile : +82-10-5094-0405
> > > fax : +82-52-217-4209
> > > ===========================
> >
> >
> >
> >
> > --
> > ===========================
> > Sangmin Park
> > Supercomputing Center
> > Ulsan National Institute of Science and Technology(UNIST)
> > Ulsan, 689-798, Korea
> >
> > phone : +82-52-217-4201
> > mobile : +82-10-5094-0405
> > fax : +82-52-217-4209
> > ===========================
>
>


-- 
===========================
Sangmin Park
Supercomputing Center
Ulsan National Institute of Science and Technology(UNIST)
Ulsan, 689-798, Korea

phone : +82-52-217-4201
mobile : +82-10-5094-0405
fax : +82-52-217-4209
===========================

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Reporting SGE via QACCT output

Reply via email to