crud - sorry about that! old man can't even remember his own param name....sigh
Thanks for checking it Ralph On Feb 6, 2014, at 9:47 AM, Paul Hargrove <phhargr...@lbl.gov> wrote: > Ralph, > > It worked on my second try, when I spelled it "ras_tm_smp" :-) > > Thanks, > -Paul > > > > On Wed, Feb 5, 2014 at 11:59 AM, Paul Hargrove <phhargr...@lbl.gov> wrote: > Ralph, > > I will try to build tonight's trunk tarball and then test a run tomorrow. > Please ping me if I don't post my results by Thu evening (PST). > > -Paul > > > On Wed, Feb 5, 2014 at 7:52 AM, Ralph Castain <r...@open-mpi.org> wrote: > I added this to the trunk in r30568 - a new MCA param "ras_tm_smp_mode" will > tell us to use the PBS_PPN envar to get the number of slots allocated per > node. We then just use the PBS_Nodefile to read the names of the nodes, which > I expect will be one for each partition. > > Let me know if this solves the problem - I scheduled it for 1.7.5 > > Thanks! > Ralph > > On Jan 31, 2014, at 4:33 PM, Ralph Castain <r...@open-mpi.org> wrote: > >> No worries about PBS itself - better to allow you to just run this way. Easy >> to add a switch for this purpose. >> >> For now, just add --oversubscribe to the command line >> >> On Jan 31, 2014, at 3:32 PM, Paul Hargrove <phhargr...@lbl.gov> wrote: >> >>> Ralph, >>> >>> The mods may have been done by the staff at PSC rather than by SGI. >>> Note the "_psc" suffix: >>> $ which pbsnodes >>> /usr/local/packages/torque/2.3.13_psc/bin/pbsnodes >>> >>> Their sources appear to be available in the f/s too. >>> Using "tar -d" to compare that to the pristine torque-2.3.13 tarball show >>> the following files were modified: >>> torque-2.3.13/src/resmom/job_func.c >>> torque-2.3.13/src/resmom/mom_main.c >>> torque-2.3.13/src/resmom/requests.c >>> torque-2.3.13/src/resmom/linux/mom_mach.h >>> torque-2.3.13/src/resmom/linux/mom_mach.c >>> torque-2.3.13/src/resmom/linux/cpuset.c >>> torque-2.3.13/src/resmom/start_exec.c >>> torque-2.3.13/src/scheduler.tcl/pbs_sched.c >>> torque-2.3.13/src/cmds/qalter.c >>> torque-2.3.13/src/cmds/qsub.c >>> torque-2.3.13/src/cmds/qstat.c >>> torque-2.3.13/src/server/resc_def_all.c >>> torque-2.3.13/src/server/req_quejob.c >>> torque-2.3.13/torque.spec >>> >>> I'll provide what assistance I can in testing. >>> That includes providing (off-list) the actual diffs of PSC's torque against >>> the tarball, if desired. >>> >>> In the meantime, since -npernode didn't work, what is the right way to say: >>> "I have 1 slot but I want to overcommit and run 16 mpi ranks". >>> >>> -Paul >>> >>> >>> On Fri, Jan 31, 2014 at 3:20 PM, Ralph Castain <r...@open-mpi.org> wrote: >>> >>> On Jan 31, 2014, at 3:13 PM, Paul Hargrove <phhargr...@lbl.gov> wrote: >>> >>>> Ralph, >>>> >>>> As I said this is NOT a cluster - it is a 4k-core shared memory machine. >>> >>> I understood - that wasn't the nature of my question >>> >>>> TORQUE is allocating cpus (time-shared mode, IIRC), not nodes. >>>> So, there is always exactly one line in $PBS_NODESFILE. >>> >>> Interesting - because that isn't the standard way Torque behaves. It is >>> supposed to put one line/slot in the nodefile, each line containing the >>> name of the node. Clearly, SGI has reconfigured Torque to do something >>> different. >>> >>>> >>>> The system runs as 2 partitions of 2k-cores each. >>>> So, the contents odf$PBS_NODESFILE has exactly 2 possible values, each 1 >>>> line. >>>> >>>> The values of PBS_PPN and PBS_NCPUS both reflect the size of the >>>> allocation. >>>> >>>> At a minimum, shouldn't Open MPI be multiplying the lines in >>>> $PBS_NODESFILE by the value of $PBS_PPN? >>> >>> No, as above, that isn't the way Torque generally behaves. It would appear >>> that we need a "switch" here to handle SGI's modifications. Should be >>> doable - just haven't had anyone using an SGI machine before :-) >>> >>>> >>>> Additionally, when I try "mpirun -npernode 16 ./ring_c" I am still told >>>> there are not enough slots. >>>> Shouldn't that be working with 1 line is $PBS_NODESFILE? >>>> >>>> -Paul >>>> >>>> >>>> >>>> >>>> On Fri, Jan 31, 2014 at 2:47 PM, Ralph Castain <r...@open-mpi.org> wrote: >>>> We read the nodes from the PBS_NODEFILE, Paul - can you pass that along? >>>> >>>> On Jan 31, 2014, at 2:33 PM, Paul Hargrove <phhargr...@lbl.gov> wrote: >>>> >>>>> I am trying to test the trunk on an SGI UV (to validate Nathan's port of >>>>> btl:vader to SGI's variant of xpmem). >>>>> >>>>> At configure time, PBS's TM support was correctly located. >>>>> >>>>> My PBS batch script includes >>>>> #PBS -l ncpus=16 >>>>> because that is what this installation requires (not nodes, mppnodes, or >>>>> anything like that). >>>>> One is allocating cpus on a large shared-memory machine, not a set of >>>>> nodes in a cluster. >>>>> >>>>> However, this appears to be causing mpirun to think I have just 1 slot: >>>>> >>>>> + mpirun -np 2 ./ring_c >>>>> -------------------------------------------------------------------------- >>>>> There are not enough slots available in the system to satisfy the 2 slots >>>>> that were requested by the application: >>>>> ./ring_c >>>>> >>>>> Either request fewer slots for your application, or make more slots >>>>> available >>>>> for use. >>>>> -------------------------------------------------------------------------- >>>>> >>>>> In case they contain useful info, here are the PBS env vars in the job: >>>>> >>>>> PBS_HT_NCPUS=32 >>>>> PBS_VERSION=TORQUE-2.3.13 >>>>> PBS_JOBNAME=qs >>>>> PBS_ENVIRONMENT=PBS_BATCH >>>>> PBS_HOME=/var/spool/torque >>>>> PBS_O_WORKDIR=/usr/users/6/hargrove/SCRATCH/OMPI/openmpi-trunk-linux-x86_64-uv-trunk/BLD/examples >>>>> PBS_PPN=16 >>>>> PBS_TASKNUM=1 >>>>> PBS_O_HOME=/usr/users/6/hargrove >>>>> PBS_MOMPORT=15003 >>>>> PBS_O_QUEUE=debug >>>>> PBS_O_LOGNAME=hargrove >>>>> PBS_O_LANG=en_US.UTF-8 >>>>> PBS_JOBCOOKIE=9EEF5DF75FA705A241FEF66EDFE01C5B >>>>> PBS_NODENUM=0 >>>>> PBS_O_SHELL=/usr/psc/shells/bash >>>>> PBS_SERVER=tg-login1.blacklight.psc.teragrid.org >>>>> PBS_JOBID=314827.tg-login1.blacklight.psc.teragrid.org >>>>> PBS_NCPUS=16 >>>>> PBS_O_HOST=tg-login1.blacklight.psc.teragrid.org >>>>> PBS_VNODENUM=0 >>>>> PBS_QUEUE=debug_r1 >>>>> PBS_O_MAIL=/var/mail/hargrove >>>>> PBS_NODEFILE=/var/spool/torque/aux//314827.tg-login1.blacklight.psc.teragrid.org >>>>> PBS_O_PATH=[...removed...] >>>>> >>>>> If any additional info is needed to help make mpirun "just work", please >>>>> let me know. >>>>> >>>>> However, at this point I am mostly interested in any work-arounds that >>>>> will let me run something other than a singleton on this system. >>>>> >>>>> -Paul >>>>> >>>>> -- >>>>> Paul H. Hargrove phhargr...@lbl.gov >>>>> Future Technologies Group >>>>> Computer and Data Sciences Department Tel: +1-510-495-2352 >>>>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> >>>> >>>> >>>> -- >>>> Paul H. Hargrove phhargr...@lbl.gov >>>> Future Technologies Group >>>> Computer and Data Sciences Department Tel: +1-510-495-2352 >>>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> >>> >>> -- >>> Paul H. Hargrove phhargr...@lbl.gov >>> Future Technologies Group >>> Computer and Data Sciences Department Tel: +1-510-495-2352 >>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > -- > Paul H. Hargrove phhargr...@lbl.gov > Future Technologies Group > Computer and Data Sciences Department Tel: +1-510-495-2352 > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > > > > -- > Paul H. Hargrove phhargr...@lbl.gov > Future Technologies Group > Computer and Data Sciences Department Tel: +1-510-495-2352 > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel