Can you try a newer version of OMPI, say the 3.0.0 release? Just curious to 
know if we perhaps “fixed” something relevant.


> On Oct 3, 2017, at 5:33 PM, Anthony Thyssen <a.thys...@griffith.edu.au> wrote:
> 
> FYI...
> 
> The problem is discussed further in 
> 
> Redhat Bugzilla: Bug 1321154 - numa enabled torque don't work
>    https://bugzilla.redhat.com/show_bug.cgi?id=1321154 
> <https://bugzilla.redhat.com/show_bug.cgi?id=1321154>
> 
> I'd seen this previous as it required me to add "num_node_boards=1" to each 
> node in the
> /var/lib/torque/server_priv/nodes  to get torque to at least work.  
> Specifically by munging
> the $PBS_NODES" (which comes out correcT) into a host list containing the 
> correct
> "slot=" counts.  But of course now that I have compiled OpenMPI using 
> "--with-tm" that
> should not have been needed as in fact is now ignored by OpenMPI in a 
> Torque-PBS
> environment.
> 
> However it seems ever since "NUMA" support was into the Torque RPM's, has 
> also caused
> the current problems, and is still continuing.   The last action is a new 
> EPEL "test' version
> (August 2017),  I will try shortly.
> 
> Take you for your help, though I am still open to suggestions for a 
> replacement.
> 
>   Anthony Thyssen ( System Programmer )    <a.thys...@griffith.edu.au 
> <mailto:a.thys...@griffith.edu.au>>
>  --------------------------------------------------------------------------
>    Encryption... is a powerful defensive weapon for free people.
>    It offers a technical guarantee of privacy, regardless of who is
>    running the government... It's hard to think of a more powerful,
>    less dangerous tool for liberty.            --  Esther Dyson
>  --------------------------------------------------------------------------
> 
> 
> 
> On Wed, Oct 4, 2017 at 9:02 AM, Anthony Thyssen <a.thys...@griffith.edu.au 
> <mailto:a.thys...@griffith.edu.au>> wrote:
> Thank you Gilles.  At least I now have something to follow though with.
> 
> As a FYI, the torque is the pre-built version from the Redhat Extras (EPEL) 
> archive.
> torque-4.2.10-10.el7.x86_64
> 
> Normally pre-build packages have no problems, but in this case.
> 
> 
> 
> 
> On Tue, Oct 3, 2017 at 3:39 PM, Gilles Gouaillardet <gil...@rist.or.jp 
> <mailto:gil...@rist.or.jp>> wrote:
> Anthony,
> 
> 
> we had a similar issue reported some times ago (e.g. Open MPI ignores torque 
> allocation),
> 
> and after quite some troubleshooting, we ended up with the same behavior 
> (e.g. pbsdsh is not working as expected).
> 
> see https://www.mail-archive.com/users@lists.open-mpi.org/msg29952.html 
> <https://www.mail-archive.com/users@lists.open-mpi.org/msg29952.html> for the 
> last email.
> 
> 
> from an Open MPI point of view, i would consider the root cause is with your 
> torque install.
> 
> this case was reported at 
> http://www.clusterresources.com/pipermail/torqueusers/2016-September/018858.html
>  
> <http://www.clusterresources.com/pipermail/torqueusers/2016-September/018858.html>
> 
> and no conclusion was reached.
> 
> 
> Cheers,
> 
> 
> Gilles
> 
> 
> On 10/3/2017 2:02 PM, Anthony Thyssen wrote:
> The stdin and stdout are saved to separate channels.
> 
> It is interesting that the output from pbsdsh is node21.emperor 5 times, even 
> though $PBS_NODES is the 5 individual nodes.
> 
> Attached are the two compressed files, as well as the pbs_hello batch used.
> 
> Anthony Thyssen ( System Programmer )    <a.thys...@griffith.edu.au 
> <mailto:a.thys...@griffith.edu.au> <mailto:a.thys...@griffith.edu.au 
> <mailto:a.thys...@griffith.edu.au>>>
>  --------------------------------------------------------------------------
>   There are two types of encryption:
>     One that will prevent your sister from reading your diary, and
>     One that will prevent your government.           -- Bruce Schneier
>  --------------------------------------------------------------------------
> 
> 
> 
> 
> On Tue, Oct 3, 2017 at 2:39 PM, Gilles Gouaillardet <gil...@rist.or.jp 
> <mailto:gil...@rist.or.jp> <mailto:gil...@rist.or.jp 
> <mailto:gil...@rist.or.jp>>> wrote:
> 
>     Anthony,
> 
> 
>     in your script, can you
> 
> 
>     set -x
> 
>     env
> 
>     pbsdsh hostname
> 
>     mpirun --display-map --display-allocation --mca ess_base_verbose
>     10 --mca plm_base_verbose 10 --mca ras_base_verbose 10 hostname
> 
> 
>     and then compress and send the output ?
> 
> 
>     Cheers,
> 
> 
>     Gilles
> 
> 
>     On 10/3/2017 1:19 PM, Anthony Thyssen wrote:
> 
>         I noticed that too.  Though the submitting host for torque is
>         a different host (main head node, "shrek"),  "node21" is the
>         host that torque runs the batch script (and the mpirun
>         command) it being the first node in the "dualcore" resource group.
> 
>         Adding option...
> 
>         It fixed the hostname in the allocation map, though had no
>         effect on the outcome.  The allocation is still simply ignored.
> 
>         =======8<--------CUT HERE----------
>         PBS Job Number       9000
>         PBS batch run on     node21.emperor
>         Time it was started  2017-10-03_14:11:20
>         Current Directory    /net/shrek.emperor/home/shrek/anthony
>         Submitted work dir   /home/shrek/anthony/mpi-pbs
>         Number of Nodes      5
>         Nodefile List       /var/lib/torque/aux//9000.shrek.emperor
>         node21.emperor
>         node25.emperor
>         node24.emperor
>         node23.emperor
>         node22.emperor
>         ---------------------------------------
> 
>         ======================  ALLOCATED NODES  ======================
>         node21.emperor: slots=1 max_slots=0 slots_inuse=0 state=UP
>         node25.emperor: slots=1 max_slots=0 slots_inuse=0 state=UP
>         node24.emperor: slots=1 max_slots=0 slots_inuse=0 state=UP
>         node23.emperor: slots=1 max_slots=0 slots_inuse=0 state=UP
>         node22.emperor: slots=1 max_slots=0 slots_inuse=0 state=UP
>         =================================================================
>         node21.emperor
>         node21.emperor
>         node21.emperor
>         node21.emperor
>         node21.emperor
>         =======8<--------CUT HERE----------
> 
> 
>           Anthony Thyssen ( System Programmer )   
>         <a.thys...@griffith.edu.au <mailto:a.thys...@griffith.edu.au> 
> <mailto:a.thys...@griffith.edu.au <mailto:a.thys...@griffith.edu.au>>
>         <mailto:a.thys...@griffith.edu.au <mailto:a.thys...@griffith.edu.au>
>         <mailto:a.thys...@griffith.edu.au 
> <mailto:a.thys...@griffith.edu.au>>>>
>          
> --------------------------------------------------------------------------
>            The equivalent of an armoured car should always be used to
>            protect any secret kept in a cardboard box.
>            -- Anthony Thyssen, On the use of Encryption
>          
> --------------------------------------------------------------------------
> 
> 
> 
> 
>         _______________________________________________
>         users mailing list
>         users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> 
> <mailto:users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>>
>         https://lists.open-mpi.org/mailman/listinfo/users 
> <https://lists.open-mpi.org/mailman/listinfo/users>
>         <https://lists.open-mpi.org/mailman/listinfo/users 
> <https://lists.open-mpi.org/mailman/listinfo/users>>
> 
> 
>     _______________________________________________
>     users mailing list
>     users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> 
> <mailto:users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>>
>     https://lists.open-mpi.org/mailman/listinfo/users 
> <https://lists.open-mpi.org/mailman/listinfo/users>
>     <https://lists.open-mpi.org/mailman/listinfo/users 
> <https://lists.open-mpi.org/mailman/listinfo/users>>
> 
> 
> 
> 
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
> https://lists.open-mpi.org/mailman/listinfo/users 
> <https://lists.open-mpi.org/mailman/listinfo/users>
> 
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
> https://lists.open-mpi.org/mailman/listinfo/users 
> <https://lists.open-mpi.org/mailman/listinfo/users>
> 
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to