Re: [gridengine users] Reporting SGE via QACCT output

Reuti Thu, 19 Jun 2014 01:19:35 -0700

Am 19.06.2014 um 03:08 schrieb Sangmin Park:

> Hi,
> 
> Do you mean that I have to compile SGE? Doesn't it remove all log data that 
> was generated before?


The "aacounting" file? No. But the memory of the share-tree-usage will be gone.


> If I have to do, I would. 

Or make the changes to PAM, then there is no need to recompile SGE.

-- Reuti


> And the reason why the load is 12 even thought no slots is that we have 
> several queue.
> "all.q" does not allowed, but the other queue can. This users used another 
> queue. That's why.
> 
> --Sangmin
> 
> 
> On Tue, Jun 17, 2014 at 9:45 PM, Reuti <[email protected]> wrote:
> Hi,
> 
> Am 17.06.2014 um 03:51 schrieb Sangmin Park:
> 
> > It looks like okay. But, the usage reporting still does not work.
> > This is the 'ps -e f' result.
> >
> > 11151 ?        Sl     0:14 /opt/sge/bin/lx24-amd64/sge_execd
> > 16851 ?        S      0:00  \_ sge_shepherd-46865 -bg
> > 16877 ?        Ss     0:00  |   \_ bash 
> > /opt/sge/default/spool/lion20/job_scripts/46865
> > 16884 ?        S      0:00  |       \_ /bin/bash 
> > /opt/intel/impi/4.0.3.008/intel64/bin/mpirun -np 12 
> > /home/p012chm/Binary4intelMPI/vasp.5.2.12_GRAPE.O3.MPIB
> > 16895 ?        S      0:00  |           \_ mpiexec.hydra -machinefile 
> > /tmp/sge_machinefile_16884 -np 12 
> > /home/p012chm/Binary4intelMPI/vasp.5.2.12_GRAPE.O3.M
> > 16896 ?        S      0:00  |               \_ /opt/sge/bin/lx24-amd64/qrsh 
> > -inherit lion20 /home/p012pnj/intel/impi/intel64/bin/pmi_proxy 
> > --control-port li
> > 16906 ?        S      0:00  |                   \_ /usr/bin/ssh -p 42593 
> > lion20 exec '/opt/sge/utilbin/lx24-amd64/qrsh_starter' 
> > '/opt/sge/default/spool/lion
> > 16904 ?        S      0:00  \_ sge_shepherd-46865 -bg
> > 16905 ?        Ss     0:00      \_ sshd: p012chm [priv]
> > 16911 ?        S      0:00          \_ sshd: p012chm@notty
> 
> Aha, you are using SSH. Please have a look here to enable proper accounting:
> 
> http://arc.liv.ac.uk/SGE/htmlman/htmlman5/remote_startup.html
> 
> section "SSH TIGHT INTEGRATION". The location in OpenSSH is now:
> 
> http://gridengine.org/pipermail/users/2013-December/006974.html
> 
> 
> > 16912 ?        Ss     0:00              \_ 
> > /opt/sge/utilbin/lx24-amd64/qrsh_starter 
> > /opt/sge/default/spool/lion20/active_jobs/46865.1/1.lion20
> > 17001 ?        S      0:00                  \_ 
> > /home/p012pnj/intel/impi/intel64/bin/pmi_proxy --control-port lion20:57442 
> > --pmi-connect lazy-cache --pmi-agg
> > 17002 ?        Rl     0:11                      \_ 
> > /home/p012chm/Binary4intelMPI/vasp.5.2.12_GRAPE.O3.MPIBLOCK5000.mpi.x
> > <snip>
> > > queuename                      qtype resv/used/tot. load_avg arch         
> > >  states
> > > ---------------------------------------------------------------------------------
> > > all.q@lion01                   BIP   0/0/12         2.03     lx24-amd64
> > > ---------------------------------------------------------------------------------
> > > all.q@lion02                   BIP   0/0/12         0.00     lx24-amd64
> > > ---------------------------------------------------------------------------------
> > > all.q@lion03                   BIP   0/0/12         12.00    lx24-amd64
> 
> Why is the load 12, when there are no slots used?
> 
> -- Reuti
> 
> 
> > > ---------------------------------------------------------------------------------
> > > all.q@lion04                   BIP   0/0/12         0.03     lx24-amd64
> > >
> > >
> > > FYI,
> > > Our cluster has 37 computing nodes, lion01 ~ lion37.
> > > SGE is installed /opt directory in the master node called 'lion'.
> > > and only master node is 'submit host'
> >
> > Good, but does it now work correctly according to the tree output of the 
> > processes?
> >
> > -- Reuti
> >
> >
> > >
> > > --Sangmin
> > >
> > >
> > > On Fri, Jun 13, 2014 at 4:11 PM, Reuti <[email protected]> wrote:
> > > Am 13.06.2014 um 06:50 schrieb Sangmin Park:
> > >
> > > > Hi,
> > > >
> > > > I've checked his job when it's running.
> > > > I've checked it via 'ps -ef' command and found that his job is using 
> > > > "mpiexec.hydra".
> > >
> > > Putting a blank between "-e" and "f" will give a nice process tree.
> > >
> > >
> > > > And 'qrsh' is using '-inherit' option. Here's details.
> > > >
> > > > p012chm  21424 21398  0 13:20 ?        00:00:00 bash 
> > > > /opt/sge/default/spool/lion07/job_scripts/46651
> > > > p012chm  21431 21424  0 13:20 ?        00:00:00 /bin/bash 
> > > > /opt/intel/impi/4.0.3.008/intel64/bin/mpirun -np 12 
> > > > /home/p012chm/Binary4intelMPI/vasp.5.2.12_GRAPE.O3.MPIBLOCK5000.mpi.x
> > > > p012chm  21442 21431  0 13:20 ?        00:00:00 mpiexec.hydra 
> > > > -machinefile /tmp/sge_machinefile_21431 -np 12
> > >
> > > What creates this "sge_machinefile_21431"? Often it's put into $TMPDIR, 
> > > i.e. the temporary directory of the job as you can use always the same 
> > > name and it will be removed after the job for sure.
> > >
> > >
> > > > /home/p012chm/Binary4intelMPI/vasp.5.2.12_GRAPE.O3.MPIBLOCK5000.mpi.x
> > > > p012chm  21443 21442  0 13:20 ?        00:00:00 
> > > > /opt/sge/bin/lx24-amd64/qrsh -inherit lion07
> > >
> > > Ok, on the one hand this looks good and should give a proper accounting. 
> > > But maybe there is something about the hostname resolution, as AFAIK on 
> > > the local machine "lion07" it should just fork instead making a local 
> > > `qrsh -inherit...`.
> > >
> > > Does `qstat -f` list the short names only, or are the FQDN in the output 
> > > for the queue instances?
> > >
> > > -- Reuti
> > >
> > >
> > > > /home/p012pnj/intel/impi/intel64/bin/pmi_proxy --control-port 
> > > > lion07:54060 --pmi-connect lazy-cache --pmi-aggregate --bootstrap rsh 
> > > > --bootstrap-exec rsh --demux poll --pgid 0 --enable-stdin 1 --proxy-id 0
> > > > root     21452 21451  0 13:20 ?        00:00:00 sshd: p012chm [priv]
> > > > p012chm  21453 21443  0 13:20 ?        00:00:00 /usr/bin/ssh -p 60725 
> > > > lion07 exec '/opt/sge/utilbin/lx24-amd64/qrsh_starter' 
> > > > '/opt/sge/default/spool/lion07/active_jobs/46651.1/1.lion07'
> > > > p012chm  21457 21452  0 13:20 ?        00:00:00 sshd: p012chm@notty
> > > > p012chm  21458 21457  0 13:20 ?        00:00:00 
> > > > /opt/sge/utilbin/lx24-amd64/qrsh_starter 
> > > > /opt/sge/default/spool/lion07/active_jobs/46651.1/1.lion07
> > > > p012chm  21548 21458  0 13:20 ?        00:00:00 
> > > > /home/p012pnj/intel/impi/intel64/bin/pmi_proxy --control-port 
> > > > lion07:54060 --pmi-connect lazy-cache --pmi-aggregate --bootstrap rsh 
> > > > --bootstrap-exec rsh --demux poll --pgid 0 --enable-stdin 1 --proxy-id 0
> > > > p012chm  21549 21548 99 13:20 ?        00:22:04 
> > > > /home/p012chm/Binary4intelMPI/vasp.5.2.12_GRAPE.O3.MPIBLOCK5000.mpi.x
> > > > p012chm  21550 21548 99 13:20 ?        00:22:10 
> > > > /home/p012chm/Binary4intelMPI/vasp.5.2.12_GRAPE.O3.MPIBLOCK5000.mpi.x
> > > > p012chm  21551 21548 99 13:20 ?        00:22:10 
> > > > /home/p012chm/Binary4intelMPI/vasp.5.2.12_GRAPE.O3.MPIBLOCK5000.mpi.x
> > > > p012chm  21552 21548 99 13:20 ?        00:22:10 
> > > > /home/p012chm/Binary4intelMPI/vasp.5.2.12_GRAPE.O3.MPIBLOCK5000.mpi.x
> > > > p012chm  21553 21548 99 13:20 ?        00:22:10 
> > > > /home/p012chm/Binary4intelMPI/vasp.5.2.12_GRAPE.O3.MPIBLOCK5000.mpi.x
> > > > p012chm  21554 21548 99 13:20 ?        00:22:10 
> > > > /home/p012chm/Binary4intelMPI/vasp.5.2.12_GRAPE.O3.MPIBLOCK5000.mpi.x
> > > > p012chm  21555 21548 99 13:20 ?        00:22:10 
> > > > /home/p012chm/Binary4intelMPI/vasp.5.2.12_GRAPE.O3.MPIBLOCK5000.mpi.x
> > > > p012chm  21556 21548 99 13:20 ?        00:22:10 
> > > > /home/p012chm/Binary4intelMPI/vasp.5.2.12_GRAPE.O3.MPIBLOCK5000.mpi.x
> > > > p012chm  21557 21548 99 13:20 ?        00:22:10 
> > > > /home/p012chm/Binary4intelMPI/vasp.5.2.12_GRAPE.O3.MPIBLOCK5000.mpi.x
> > > > p012chm  21558 21548 99 13:20 ?        00:22:10 
> > > > /home/p012chm/Binary4intelMPI/vasp.5.2.12_GRAPE.O3.MPIBLOCK5000.mpi.x
> > > > p012chm  21559 21548 99 13:20 ?        00:22:10 
> > > > /home/p012chm/Binary4intelMPI/vasp.5.2.12_GRAPE.O3.MPIBLOCK5000.mpi.x
> > > > p012chm  21560 21548 99 13:20 ?        00:22:10 
> > > > /home/p012chm/Binary4intelMPI/vasp.5.2.12_GRAPE.O3.MPIBLOCK5000.mpi.x
> > > > smpark   21728 21638  0 13:43 pts/0    00:00:00 grep chm
> > > >
> > > > --Sangmin
> > > >
> > > >
> > > > On Thu, Jun 12, 2014 at 8:04 PM, Reuti <[email protected]> 
> > > > wrote:
> > > > Am 12.06.2014 um 04:23 schrieb Sangmin Park:
> > > >
> > > > > I've checked the version of Intel MPI. He uses Intel MPI 4.0.3.008 
> > > > > version.
> > > > > Our system uses rsh to access computing nodes. SGE doses, too.
> > > > >
> > > > > Please let me know how to cehck which one is used 'mpiexec.hydry' or 
> > > > > 'mpiexec'.
> > > >
> > > > Do you have both files somewhere in a "bin" directory inside the Intel 
> > > > MPI? You could rename "mpiexec" and create a symbolic link "mpiexec" 
> > > > pointing to "mpiexec.hydra". The old startup will need some daemons 
> > > > running on the node (which are outside of SGE's control and 
> > > > accounting*), but "mpiexec.hydra" will startup the child processes on 
> > > > its own as kids of its own and should hence be under SGE's control. And 
> > > > as long as you are staying on one and the same node, this should work 
> > > > already without further setup then. To avoid a later surprise when you 
> > > > compute between nodes, the `rsh`/`ssh` should nevertheless being caught 
> > > > and redirected to `qrsh -inherit...` like outlined in "$SGE_ROOT/mpi".
> > > >
> > > > -- Reuti
> > > >
> > > > *) It's even possible to force the daemons to be started under SGE, but 
> > > > it's convoluted and not recommended.
> > > >
> > > >
> > > > > Sangmin
> > > > >
> > > > >
> > > > > On Wed, Jun 11, 2014 at 6:46 PM, Reuti <[email protected]> 
> > > > > wrote:
> > > > > Hi,
> > > > >
> > > > > Am 11.06.2014 um 02:38 schrieb Sangmin Park:
> > > > >
> > > > > > For the best performance, we recommend users to use 8 cores on a 
> > > > > > single particular node, not distributed with multi node.
> > > > > > Before I said, he uses VASP application compiled with Intel MPI. So 
> > > > > > he uses Intel MPI now.
> > > > >
> > > > > Which version of Intel MPI? Even with the latest one it's not tightly 
> > > > > integrated by default (despite the fact, that MPICH3 [on which it is 
> > > > > based] is tightly integrated by default).
> > > > >
> > > > > Depending on the version it might be necessary to make some 
> > > > > adjustments - IIRC mainly use `mpiexec.hydra` instead of `mpiexec` 
> > > > > and supply a wrapper to catch the `rsh`/`ssh` call (like in the MPI 
> > > > > demo in SGE's directory).
> > > > >
> > > > > -- Reuti
> > > > >
> > > > >
> > > > > > --Sangmin
> > > > > >
> > > > > >
> > > > > > On Tue, Jun 10, 2014 at 5:58 PM, Reuti <[email protected]> 
> > > > > > wrote:
> > > > > > Hi,
> > > > > >
> > > > > > Am 10.06.2014 um 10:21 schrieb Sangmin Park:
> > > > > >
> > > > > > > This user does always parallel job using VASP application.
> > > > > > > Usually, he uses 8 cores per a job. Lots of this kind of job have 
> > > > > > > been submitted by the user.
> > > > > >
> > > > > > 8 cores on a particular node or 8 slots across the cluster? What 
> > > > > > MPI implementation does he use?
> > > > > >
> > > > > > -- Reuti
> > > > > >
> > > > > > NB: Please keep the list posted.
> > > > > >
> > > > > >
> > > > > > > Sangmin
> > > > > > >
> > > > > > >
> > > > > > > On Tue, Jun 10, 2014 at 3:42 PM, Reuti 
> > > > > > > <[email protected]> wrote:
> > > > > > > Am 10.06.2014 um 08:00 schrieb Sangmin Park:
> > > > > > >
> > > > > > > > Hello,
> > > > > > > >
> > > > > > > > I'm very confused about the output of qacct command.
> > > > > > > > I thought CPU column time is the best way to measure resource 
> > > > > > > > usage by users through this web page, 
> > > > > > > > https://wiki.duke.edu/display/SCSC/Checking+SGE+Usage
> > > > > > > >
> > > > > > > > But, I have some situation.
> > > > > > > > One of users in my institution, actually this user is a one of 
> > > > > > > > heavy users, uses lots of HPC resources. To get the resource 
> > > > > > > > usage by this user for requirement of the payment, I commanded 
> > > > > > > > qacct and the output is below, this is just for May.
> > > > > > > >
> > > > > > > > OWNER       WALLCLOCK         UTIME         STIME           CPU 
> > > > > > > >             MEMORY                 IO                IOW
> > > > > > > > ========================================================================================================================
> > > > > > > > p012chm       2980810        28.485        35.012       100.634 
> > > > > > > >              4.277              0.576              0.000
> > > > > > > >
> > > > > > > > CPU time is too much small. Because he is very heavy user of 
> > > > > > > > our institution, I can not accept this result. However, the 
> > > > > > > > WALLCLOCK time is very much.
> > > > > > > >
> > > > > > > > How do I get correct information of usage resources by users 
> > > > > > > > via qacct?
> > > > > > >
> > > > > > > This may happen in case you have parallel jobs which are not 
> > > > > > > tightly integrated into SGE. What types of jobs is the user 
> > > > > > > running?
> > > > > > >
> > > > > > > -- Reuti
> > > > > > >
> > > > > > >
> > > > > > > > ===========================
> > > > > > > > Sangmin Park
> > > > > > > > Supercomputing Center
> > > > > > > > Ulsan National Institute of Science and Technology(UNIST)
> > > > > > > > Ulsan, 689-798, Korea
> > > > > > > >
> > > > > > > > phone : +82-52-217-4201
> > > > > > > > mobile : +82-10-5094-0405
> > > > > > > > fax : +82-52-217-4209
> > > > > > > > ===========================
> > > > > > > > _______________________________________________
> > > > > > > > users mailing list
> > > > > > > > [email protected]
> > > > > > > > https://gridengine.org/mailman/listinfo/users
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > ===========================
> > > > > > > Sangmin Park
> > > > > > > Supercomputing Center
> > > > > > > Ulsan National Institute of Science and Technology(UNIST)
> > > > > > > Ulsan, 689-798, Korea
> > > > > > >
> > > > > > > phone : +82-52-217-4201
> > > > > > > mobile : +82-10-5094-0405
> > > > > > > fax : +82-52-217-4209
> > > > > > > ===========================
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > ===========================
> > > > > > Sangmin Park
> > > > > > Supercomputing Center
> > > > > > Ulsan National Institute of Science and Technology(UNIST)
> > > > > > Ulsan, 689-798, Korea
> > > > > >
> > > > > > phone : +82-52-217-4201
> > > > > > mobile : +82-10-5094-0405
> > > > > > fax : +82-52-217-4209
> > > > > > ===========================
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > ===========================
> > > > > Sangmin Park
> > > > > Supercomputing Center
> > > > > Ulsan National Institute of Science and Technology(UNIST)
> > > > > Ulsan, 689-798, Korea
> > > > >
> > > > > phone : +82-52-217-4201
> > > > > mobile : +82-10-5094-0405
> > > > > fax : +82-52-217-4209
> > > > > ===========================
> > > >
> > > >
> > > >
> > > >
> > > > --
> > > > ===========================
> > > > Sangmin Park
> > > > Supercomputing Center
> > > > Ulsan National Institute of Science and Technology(UNIST)
> > > > Ulsan, 689-798, Korea
> > > >
> > > > phone : +82-52-217-4201
> > > > mobile : +82-10-5094-0405
> > > > fax : +82-52-217-4209
> > > > ===========================
> > >
> > >
> > >
> > >
> > > --
> > > ===========================
> > > Sangmin Park
> > > Supercomputing Center
> > > Ulsan National Institute of Science and Technology(UNIST)
> > > Ulsan, 689-798, Korea
> > >
> > > phone : +82-52-217-4201
> > > mobile : +82-10-5094-0405
> > > fax : +82-52-217-4209
> > > ===========================
> >
> >
> >
> >
> > --
> > ===========================
> > Sangmin Park
> > Supercomputing Center
> > Ulsan National Institute of Science and Technology(UNIST)
> > Ulsan, 689-798, Korea
> >
> > phone : +82-52-217-4201
> > mobile : +82-10-5094-0405
> > fax : +82-52-217-4209
> > ===========================
> 
> 
> 
> 
> -- 
> ===========================
> Sangmin Park 
> Supercomputing Center
> Ulsan National Institute of Science and Technology(UNIST)
> Ulsan, 689-798, Korea 
> 
> phone : +82-52-217-4201
> mobile : +82-10-5094-0405
> fax : +82-52-217-4209
> ===========================


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Reporting SGE via QACCT output

Reply via email to