Hi, Do you mean that I have to compile SGE? Doesn't it remove all log data that was generated before? If I have to do, I would.
And the reason why the load is 12 even thought no slots is that we have several queue. "all.q" does not allowed, but the other queue can. This users used another queue. That's why. --Sangmin On Tue, Jun 17, 2014 at 9:45 PM, Reuti <[email protected]> wrote: > Hi, > > Am 17.06.2014 um 03:51 schrieb Sangmin Park: > > > It looks like okay. But, the usage reporting still does not work. > > This is the 'ps -e f' result. > > > > 11151 ? Sl 0:14 /opt/sge/bin/lx24-amd64/sge_execd > > 16851 ? S 0:00 \_ sge_shepherd-46865 -bg > > 16877 ? Ss 0:00 | \_ bash > /opt/sge/default/spool/lion20/job_scripts/46865 > > 16884 ? S 0:00 | \_ /bin/bash /opt/intel/impi/ > 4.0.3.008/intel64/bin/mpirun -np 12 > /home/p012chm/Binary4intelMPI/vasp.5.2.12_GRAPE.O3.MPIB > > 16895 ? S 0:00 | \_ mpiexec.hydra -machinefile > /tmp/sge_machinefile_16884 -np 12 > /home/p012chm/Binary4intelMPI/vasp.5.2.12_GRAPE.O3.M > > 16896 ? S 0:00 | \_ > /opt/sge/bin/lx24-amd64/qrsh -inherit lion20 > /home/p012pnj/intel/impi/intel64/bin/pmi_proxy --control-port li > > 16906 ? S 0:00 | \_ /usr/bin/ssh -p 42593 > lion20 exec '/opt/sge/utilbin/lx24-amd64/qrsh_starter' > '/opt/sge/default/spool/lion > > 16904 ? S 0:00 \_ sge_shepherd-46865 -bg > > 16905 ? Ss 0:00 \_ sshd: p012chm [priv] > > 16911 ? S 0:00 \_ sshd: p012chm@notty > > Aha, you are using SSH. Please have a look here to enable proper > accounting: > > http://arc.liv.ac.uk/SGE/htmlman/htmlman5/remote_startup.html > > section "SSH TIGHT INTEGRATION". The location in OpenSSH is now: > > http://gridengine.org/pipermail/users/2013-December/006974.html > > > > 16912 ? Ss 0:00 \_ > /opt/sge/utilbin/lx24-amd64/qrsh_starter > /opt/sge/default/spool/lion20/active_jobs/46865.1/1.lion20 > > 17001 ? S 0:00 \_ > /home/p012pnj/intel/impi/intel64/bin/pmi_proxy --control-port lion20:57442 > --pmi-connect lazy-cache --pmi-agg > > 17002 ? Rl 0:11 \_ > /home/p012chm/Binary4intelMPI/vasp.5.2.12_GRAPE.O3.MPIBLOCK5000.mpi.x > > <snip> > > > queuename qtype resv/used/tot. load_avg arch > states > > > > --------------------------------------------------------------------------------- > > > all.q@lion01 BIP 0/0/12 2.03 > lx24-amd64 > > > > --------------------------------------------------------------------------------- > > > all.q@lion02 BIP 0/0/12 0.00 > lx24-amd64 > > > > --------------------------------------------------------------------------------- > > > all.q@lion03 BIP 0/0/12 12.00 > lx24-amd64 > > Why is the load 12, when there are no slots used? > > -- Reuti > > > > > > --------------------------------------------------------------------------------- > > > all.q@lion04 BIP 0/0/12 0.03 > lx24-amd64 > > > > > > > > > FYI, > > > Our cluster has 37 computing nodes, lion01 ~ lion37. > > > SGE is installed /opt directory in the master node called 'lion'. > > > and only master node is 'submit host' > > > > Good, but does it now work correctly according to the tree output of the > processes? > > > > -- Reuti > > > > > > > > > > --Sangmin > > > > > > > > > On Fri, Jun 13, 2014 at 4:11 PM, Reuti <[email protected]> > wrote: > > > Am 13.06.2014 um 06:50 schrieb Sangmin Park: > > > > > > > Hi, > > > > > > > > I've checked his job when it's running. > > > > I've checked it via 'ps -ef' command and found that his job is using > "mpiexec.hydra". > > > > > > Putting a blank between "-e" and "f" will give a nice process tree. > > > > > > > > > > And 'qrsh' is using '-inherit' option. Here's details. > > > > > > > > p012chm 21424 21398 0 13:20 ? 00:00:00 bash > /opt/sge/default/spool/lion07/job_scripts/46651 > > > > p012chm 21431 21424 0 13:20 ? 00:00:00 /bin/bash > /opt/intel/impi/4.0.3.008/intel64/bin/mpirun -np 12 > /home/p012chm/Binary4intelMPI/vasp.5.2.12_GRAPE.O3.MPIBLOCK5000.mpi.x > > > > p012chm 21442 21431 0 13:20 ? 00:00:00 mpiexec.hydra > -machinefile /tmp/sge_machinefile_21431 -np 12 > > > > > > What creates this "sge_machinefile_21431"? Often it's put into > $TMPDIR, i.e. the temporary directory of the job as you can use always the > same name and it will be removed after the job for sure. > > > > > > > > > > /home/p012chm/Binary4intelMPI/vasp.5.2.12_GRAPE.O3.MPIBLOCK5000.mpi.x > > > > p012chm 21443 21442 0 13:20 ? 00:00:00 > /opt/sge/bin/lx24-amd64/qrsh -inherit lion07 > > > > > > Ok, on the one hand this looks good and should give a proper > accounting. But maybe there is something about the hostname resolution, as > AFAIK on the local machine "lion07" it should just fork instead making a > local `qrsh -inherit...`. > > > > > > Does `qstat -f` list the short names only, or are the FQDN in the > output for the queue instances? > > > > > > -- Reuti > > > > > > > > > > /home/p012pnj/intel/impi/intel64/bin/pmi_proxy --control-port > lion07:54060 --pmi-connect lazy-cache --pmi-aggregate --bootstrap rsh > --bootstrap-exec rsh --demux poll --pgid 0 --enable-stdin 1 --proxy-id 0 > > > > root 21452 21451 0 13:20 ? 00:00:00 sshd: p012chm [priv] > > > > p012chm 21453 21443 0 13:20 ? 00:00:00 /usr/bin/ssh -p > 60725 lion07 exec '/opt/sge/utilbin/lx24-amd64/qrsh_starter' > '/opt/sge/default/spool/lion07/active_jobs/46651.1/1.lion07' > > > > p012chm 21457 21452 0 13:20 ? 00:00:00 sshd: p012chm@notty > > > > p012chm 21458 21457 0 13:20 ? 00:00:00 > /opt/sge/utilbin/lx24-amd64/qrsh_starter > /opt/sge/default/spool/lion07/active_jobs/46651.1/1.lion07 > > > > p012chm 21548 21458 0 13:20 ? 00:00:00 > /home/p012pnj/intel/impi/intel64/bin/pmi_proxy --control-port lion07:54060 > --pmi-connect lazy-cache --pmi-aggregate --bootstrap rsh --bootstrap-exec > rsh --demux poll --pgid 0 --enable-stdin 1 --proxy-id 0 > > > > p012chm 21549 21548 99 13:20 ? 00:22:04 > /home/p012chm/Binary4intelMPI/vasp.5.2.12_GRAPE.O3.MPIBLOCK5000.mpi.x > > > > p012chm 21550 21548 99 13:20 ? 00:22:10 > /home/p012chm/Binary4intelMPI/vasp.5.2.12_GRAPE.O3.MPIBLOCK5000.mpi.x > > > > p012chm 21551 21548 99 13:20 ? 00:22:10 > /home/p012chm/Binary4intelMPI/vasp.5.2.12_GRAPE.O3.MPIBLOCK5000.mpi.x > > > > p012chm 21552 21548 99 13:20 ? 00:22:10 > /home/p012chm/Binary4intelMPI/vasp.5.2.12_GRAPE.O3.MPIBLOCK5000.mpi.x > > > > p012chm 21553 21548 99 13:20 ? 00:22:10 > /home/p012chm/Binary4intelMPI/vasp.5.2.12_GRAPE.O3.MPIBLOCK5000.mpi.x > > > > p012chm 21554 21548 99 13:20 ? 00:22:10 > /home/p012chm/Binary4intelMPI/vasp.5.2.12_GRAPE.O3.MPIBLOCK5000.mpi.x > > > > p012chm 21555 21548 99 13:20 ? 00:22:10 > /home/p012chm/Binary4intelMPI/vasp.5.2.12_GRAPE.O3.MPIBLOCK5000.mpi.x > > > > p012chm 21556 21548 99 13:20 ? 00:22:10 > /home/p012chm/Binary4intelMPI/vasp.5.2.12_GRAPE.O3.MPIBLOCK5000.mpi.x > > > > p012chm 21557 21548 99 13:20 ? 00:22:10 > /home/p012chm/Binary4intelMPI/vasp.5.2.12_GRAPE.O3.MPIBLOCK5000.mpi.x > > > > p012chm 21558 21548 99 13:20 ? 00:22:10 > /home/p012chm/Binary4intelMPI/vasp.5.2.12_GRAPE.O3.MPIBLOCK5000.mpi.x > > > > p012chm 21559 21548 99 13:20 ? 00:22:10 > /home/p012chm/Binary4intelMPI/vasp.5.2.12_GRAPE.O3.MPIBLOCK5000.mpi.x > > > > p012chm 21560 21548 99 13:20 ? 00:22:10 > /home/p012chm/Binary4intelMPI/vasp.5.2.12_GRAPE.O3.MPIBLOCK5000.mpi.x > > > > smpark 21728 21638 0 13:43 pts/0 00:00:00 grep chm > > > > > > > > --Sangmin > > > > > > > > > > > > On Thu, Jun 12, 2014 at 8:04 PM, Reuti <[email protected]> > wrote: > > > > Am 12.06.2014 um 04:23 schrieb Sangmin Park: > > > > > > > > > I've checked the version of Intel MPI. He uses Intel MPI 4.0.3.008 > version. > > > > > Our system uses rsh to access computing nodes. SGE doses, too. > > > > > > > > > > Please let me know how to cehck which one is used 'mpiexec.hydry' > or 'mpiexec'. > > > > > > > > Do you have both files somewhere in a "bin" directory inside the > Intel MPI? You could rename "mpiexec" and create a symbolic link "mpiexec" > pointing to "mpiexec.hydra". The old startup will need some daemons running > on the node (which are outside of SGE's control and accounting*), but > "mpiexec.hydra" will startup the child processes on its own as kids of its > own and should hence be under SGE's control. And as long as you are staying > on one and the same node, this should work already without further setup > then. To avoid a later surprise when you compute between nodes, the > `rsh`/`ssh` should nevertheless being caught and redirected to `qrsh > -inherit...` like outlined in "$SGE_ROOT/mpi". > > > > > > > > -- Reuti > > > > > > > > *) It's even possible to force the daemons to be started under SGE, > but it's convoluted and not recommended. > > > > > > > > > > > > > Sangmin > > > > > > > > > > > > > > > On Wed, Jun 11, 2014 at 6:46 PM, Reuti <[email protected]> > wrote: > > > > > Hi, > > > > > > > > > > Am 11.06.2014 um 02:38 schrieb Sangmin Park: > > > > > > > > > > > For the best performance, we recommend users to use 8 cores on a > single particular node, not distributed with multi node. > > > > > > Before I said, he uses VASP application compiled with Intel MPI. > So he uses Intel MPI now. > > > > > > > > > > Which version of Intel MPI? Even with the latest one it's not > tightly integrated by default (despite the fact, that MPICH3 [on which it > is based] is tightly integrated by default). > > > > > > > > > > Depending on the version it might be necessary to make some > adjustments - IIRC mainly use `mpiexec.hydra` instead of `mpiexec` and > supply a wrapper to catch the `rsh`/`ssh` call (like in the MPI demo in > SGE's directory). > > > > > > > > > > -- Reuti > > > > > > > > > > > > > > > > --Sangmin > > > > > > > > > > > > > > > > > > On Tue, Jun 10, 2014 at 5:58 PM, Reuti < > [email protected]> wrote: > > > > > > Hi, > > > > > > > > > > > > Am 10.06.2014 um 10:21 schrieb Sangmin Park: > > > > > > > > > > > > > This user does always parallel job using VASP application. > > > > > > > Usually, he uses 8 cores per a job. Lots of this kind of job > have been submitted by the user. > > > > > > > > > > > > 8 cores on a particular node or 8 slots across the cluster? What > MPI implementation does he use? > > > > > > > > > > > > -- Reuti > > > > > > > > > > > > NB: Please keep the list posted. > > > > > > > > > > > > > > > > > > > Sangmin > > > > > > > > > > > > > > > > > > > > > On Tue, Jun 10, 2014 at 3:42 PM, Reuti < > [email protected]> wrote: > > > > > > > Am 10.06.2014 um 08:00 schrieb Sangmin Park: > > > > > > > > > > > > > > > Hello, > > > > > > > > > > > > > > > > I'm very confused about the output of qacct command. > > > > > > > > I thought CPU column time is the best way to measure > resource usage by users through this web page, > https://wiki.duke.edu/display/SCSC/Checking+SGE+Usage > > > > > > > > > > > > > > > > But, I have some situation. > > > > > > > > One of users in my institution, actually this user is a one > of heavy users, uses lots of HPC resources. To get the resource usage by > this user for requirement of the payment, I commanded qacct and the output > is below, this is just for May. > > > > > > > > > > > > > > > > OWNER WALLCLOCK UTIME STIME > CPU MEMORY IO IOW > > > > > > > > > ======================================================================================================================== > > > > > > > > p012chm 2980810 28.485 35.012 > 100.634 4.277 0.576 0.000 > > > > > > > > > > > > > > > > CPU time is too much small. Because he is very heavy user of > our institution, I can not accept this result. However, the WALLCLOCK time > is very much. > > > > > > > > > > > > > > > > How do I get correct information of usage resources by users > via qacct? > > > > > > > > > > > > > > This may happen in case you have parallel jobs which are not > tightly integrated into SGE. What types of jobs is the user running? > > > > > > > > > > > > > > -- Reuti > > > > > > > > > > > > > > > > > > > > > > =========================== > > > > > > > > Sangmin Park > > > > > > > > Supercomputing Center > > > > > > > > Ulsan National Institute of Science and Technology(UNIST) > > > > > > > > Ulsan, 689-798, Korea > > > > > > > > > > > > > > > > phone : +82-52-217-4201 > > > > > > > > mobile : +82-10-5094-0405 > > > > > > > > fax : +82-52-217-4209 > > > > > > > > =========================== > > > > > > > > _______________________________________________ > > > > > > > > users mailing list > > > > > > > > [email protected] > > > > > > > > https://gridengine.org/mailman/listinfo/users > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > =========================== > > > > > > > Sangmin Park > > > > > > > Supercomputing Center > > > > > > > Ulsan National Institute of Science and Technology(UNIST) > > > > > > > Ulsan, 689-798, Korea > > > > > > > > > > > > > > phone : +82-52-217-4201 > > > > > > > mobile : +82-10-5094-0405 > > > > > > > fax : +82-52-217-4209 > > > > > > > =========================== > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > =========================== > > > > > > Sangmin Park > > > > > > Supercomputing Center > > > > > > Ulsan National Institute of Science and Technology(UNIST) > > > > > > Ulsan, 689-798, Korea > > > > > > > > > > > > phone : +82-52-217-4201 > > > > > > mobile : +82-10-5094-0405 > > > > > > fax : +82-52-217-4209 > > > > > > =========================== > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > =========================== > > > > > Sangmin Park > > > > > Supercomputing Center > > > > > Ulsan National Institute of Science and Technology(UNIST) > > > > > Ulsan, 689-798, Korea > > > > > > > > > > phone : +82-52-217-4201 > > > > > mobile : +82-10-5094-0405 > > > > > fax : +82-52-217-4209 > > > > > =========================== > > > > > > > > > > > > > > > > > > > > -- > > > > =========================== > > > > Sangmin Park > > > > Supercomputing Center > > > > Ulsan National Institute of Science and Technology(UNIST) > > > > Ulsan, 689-798, Korea > > > > > > > > phone : +82-52-217-4201 > > > > mobile : +82-10-5094-0405 > > > > fax : +82-52-217-4209 > > > > =========================== > > > > > > > > > > > > > > > -- > > > =========================== > > > Sangmin Park > > > Supercomputing Center > > > Ulsan National Institute of Science and Technology(UNIST) > > > Ulsan, 689-798, Korea > > > > > > phone : +82-52-217-4201 > > > mobile : +82-10-5094-0405 > > > fax : +82-52-217-4209 > > > =========================== > > > > > > > > > > -- > > =========================== > > Sangmin Park > > Supercomputing Center > > Ulsan National Institute of Science and Technology(UNIST) > > Ulsan, 689-798, Korea > > > > phone : +82-52-217-4201 > > mobile : +82-10-5094-0405 > > fax : +82-52-217-4209 > > =========================== > > -- =========================== Sangmin Park Supercomputing Center Ulsan National Institute of Science and Technology(UNIST) Ulsan, 689-798, Korea phone : +82-52-217-4201 mobile : +82-10-5094-0405 fax : +82-52-217-4209 ===========================
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
