crud - sorry about that! old man can't even remember his own param name....sigh

Thanks for checking it
Ralph

On Feb 6, 2014, at 9:47 AM, Paul Hargrove <phhargr...@lbl.gov> wrote:

> Ralph,
> 
> It worked on my second try, when I spelled it "ras_tm_smp" :-)
> 
> Thanks,
> -Paul
> 
> 
> 
> On Wed, Feb 5, 2014 at 11:59 AM, Paul Hargrove <phhargr...@lbl.gov> wrote:
> Ralph,
> 
> I will try to build tonight's trunk tarball and then test a run tomorrow.
> Please ping me if I don't post my results by Thu evening (PST).
> 
> -Paul
> 
> 
> On Wed, Feb 5, 2014 at 7:52 AM, Ralph Castain <r...@open-mpi.org> wrote:
> I added this to the trunk in r30568 - a new MCA param "ras_tm_smp_mode" will 
> tell us to use the PBS_PPN envar to get the number of slots allocated per 
> node. We then just use the PBS_Nodefile to read the names of the nodes, which 
> I expect will be one for each partition.
> 
> Let me know if this solves the problem - I scheduled it for 1.7.5
> 
> Thanks!
> Ralph
> 
> On Jan 31, 2014, at 4:33 PM, Ralph Castain <r...@open-mpi.org> wrote:
> 
>> No worries about PBS itself - better to allow you to just run this way. Easy 
>> to add a switch for this purpose.
>> 
>> For now, just add --oversubscribe to the command line
>> 
>> On Jan 31, 2014, at 3:32 PM, Paul Hargrove <phhargr...@lbl.gov> wrote:
>> 
>>> Ralph,
>>> 
>>> The mods may have been done by the staff at PSC rather than by SGI.
>>> Note the "_psc" suffix:
>>> $ which pbsnodes
>>> /usr/local/packages/torque/2.3.13_psc/bin/pbsnodes
>>> 
>>> Their sources appear to be available in the f/s too.
>>> Using "tar -d" to compare that to the pristine torque-2.3.13 tarball show 
>>> the following files were modified:
>>> torque-2.3.13/src/resmom/job_func.c
>>> torque-2.3.13/src/resmom/mom_main.c
>>> torque-2.3.13/src/resmom/requests.c
>>> torque-2.3.13/src/resmom/linux/mom_mach.h
>>> torque-2.3.13/src/resmom/linux/mom_mach.c
>>> torque-2.3.13/src/resmom/linux/cpuset.c
>>> torque-2.3.13/src/resmom/start_exec.c
>>> torque-2.3.13/src/scheduler.tcl/pbs_sched.c
>>> torque-2.3.13/src/cmds/qalter.c
>>> torque-2.3.13/src/cmds/qsub.c
>>> torque-2.3.13/src/cmds/qstat.c
>>> torque-2.3.13/src/server/resc_def_all.c
>>> torque-2.3.13/src/server/req_quejob.c
>>> torque-2.3.13/torque.spec
>>> 
>>> I'll provide what assistance I can in testing.
>>> That includes providing (off-list) the actual diffs of PSC's torque against 
>>> the tarball, if desired.
>>> 
>>> In the meantime, since -npernode didn't work, what is the right way to say:
>>>   "I have 1 slot but I want to overcommit and run 16 mpi ranks".
>>> 
>>> -Paul
>>> 
>>> 
>>> On Fri, Jan 31, 2014 at 3:20 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>> 
>>> On Jan 31, 2014, at 3:13 PM, Paul Hargrove <phhargr...@lbl.gov> wrote:
>>> 
>>>> Ralph,
>>>> 
>>>> As I said this is NOT a cluster - it is a 4k-core shared memory machine.
>>> 
>>> I understood - that wasn't the nature of my question
>>> 
>>>> TORQUE is allocating cpus (time-shared mode, IIRC), not nodes.
>>>> So, there is always exactly one line in $PBS_NODESFILE.
>>> 
>>> Interesting - because that isn't the standard way Torque behaves. It is 
>>> supposed to put one line/slot in the nodefile, each line containing the 
>>> name of the node. Clearly, SGI has reconfigured Torque to do something 
>>> different.
>>> 
>>>> 
>>>> The system runs as 2 partitions of 2k-cores each.
>>>> So, the contents odf$PBS_NODESFILE has exactly 2 possible values, each 1 
>>>> line.
>>>> 
>>>> The values of PBS_PPN and PBS_NCPUS both reflect the size of the 
>>>> allocation.
>>>> 
>>>> At a minimum, shouldn't Open MPI be multiplying the lines in 
>>>> $PBS_NODESFILE by the value of $PBS_PPN?
>>> 
>>> No, as above, that isn't the way Torque generally behaves. It would appear 
>>> that we need a "switch" here to handle SGI's modifications. Should be 
>>> doable - just haven't had anyone using an SGI machine before :-)
>>> 
>>>> 
>>>> Additionally, when I try "mpirun -npernode 16 ./ring_c" I am still told 
>>>> there are not enough slots.
>>>> Shouldn't that be working with 1 line is $PBS_NODESFILE?
>>>> 
>>>> -Paul
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Fri, Jan 31, 2014 at 2:47 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>>> We read the nodes from the PBS_NODEFILE, Paul - can you pass that along?
>>>> 
>>>> On Jan 31, 2014, at 2:33 PM, Paul Hargrove <phhargr...@lbl.gov> wrote:
>>>> 
>>>>> I am trying to test the trunk on an SGI UV (to validate Nathan's port of 
>>>>> btl:vader to SGI's variant of xpmem).
>>>>> 
>>>>> At configure time, PBS's TM support was correctly located.
>>>>> 
>>>>> My PBS batch script includes
>>>>>   #PBS -l ncpus=16
>>>>> because that is what this installation requires (not nodes, mppnodes, or 
>>>>> anything like that).
>>>>> One is allocating cpus on a large shared-memory machine, not a set of 
>>>>> nodes in a cluster.
>>>>> 
>>>>> However, this appears to be causing mpirun to think I have just 1 slot:
>>>>> 
>>>>> + mpirun -np 2 ./ring_c
>>>>> --------------------------------------------------------------------------
>>>>> There are not enough slots available in the system to satisfy the 2 slots 
>>>>> that were requested by the application:
>>>>>   ./ring_c
>>>>> 
>>>>> Either request fewer slots for your application, or make more slots 
>>>>> available
>>>>> for use.
>>>>> --------------------------------------------------------------------------
>>>>> 
>>>>> In case they contain useful info, here are the PBS env vars in the job:
>>>>> 
>>>>> PBS_HT_NCPUS=32
>>>>> PBS_VERSION=TORQUE-2.3.13
>>>>> PBS_JOBNAME=qs
>>>>> PBS_ENVIRONMENT=PBS_BATCH
>>>>> PBS_HOME=/var/spool/torque
>>>>> PBS_O_WORKDIR=/usr/users/6/hargrove/SCRATCH/OMPI/openmpi-trunk-linux-x86_64-uv-trunk/BLD/examples
>>>>> PBS_PPN=16
>>>>> PBS_TASKNUM=1
>>>>> PBS_O_HOME=/usr/users/6/hargrove
>>>>> PBS_MOMPORT=15003
>>>>> PBS_O_QUEUE=debug
>>>>> PBS_O_LOGNAME=hargrove
>>>>> PBS_O_LANG=en_US.UTF-8
>>>>> PBS_JOBCOOKIE=9EEF5DF75FA705A241FEF66EDFE01C5B
>>>>> PBS_NODENUM=0
>>>>> PBS_O_SHELL=/usr/psc/shells/bash
>>>>> PBS_SERVER=tg-login1.blacklight.psc.teragrid.org
>>>>> PBS_JOBID=314827.tg-login1.blacklight.psc.teragrid.org
>>>>> PBS_NCPUS=16
>>>>> PBS_O_HOST=tg-login1.blacklight.psc.teragrid.org
>>>>> PBS_VNODENUM=0
>>>>> PBS_QUEUE=debug_r1
>>>>> PBS_O_MAIL=/var/mail/hargrove
>>>>> PBS_NODEFILE=/var/spool/torque/aux//314827.tg-login1.blacklight.psc.teragrid.org
>>>>> PBS_O_PATH=[...removed...]
>>>>> 
>>>>> If any additional info is needed to help make mpirun "just work", please 
>>>>> let me know.
>>>>> 
>>>>> However, at this point I am mostly interested in any work-arounds that 
>>>>> will let me run something other than a singleton on this system.
>>>>> 
>>>>> -Paul
>>>>> 
>>>>> -- 
>>>>> Paul H. Hargrove                          phhargr...@lbl.gov
>>>>> Future Technologies Group
>>>>> Computer and Data Sciences Department     Tel: +1-510-495-2352
>>>>> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> de...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> 
>>>> 
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> Paul H. Hargrove                          phhargr...@lbl.gov
>>>> Future Technologies Group
>>>> Computer and Data Sciences Department     Tel: +1-510-495-2352
>>>> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> 
>>> 
>>> -- 
>>> Paul H. Hargrove                          phhargr...@lbl.gov
>>> Future Technologies Group
>>> Computer and Data Sciences Department     Tel: +1-510-495-2352
>>> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> 
> -- 
> Paul H. Hargrove                          phhargr...@lbl.gov
> Future Technologies Group
> Computer and Data Sciences Department     Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
> 
> 
> 
> -- 
> Paul H. Hargrove                          phhargr...@lbl.gov
> Future Technologies Group
> Computer and Data Sciences Department     Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to