Sorry  r...@open-mpi.org  as Gilles Gouaillardet pointed out to me the
problem wasn't OpenMPI, but with the specific EPEL implementation (see
Redhat Bugzilla 1321154)

Today, the the server was able to be taken down for maintance, and I wanted
to try a few things.

After installing EPEL Testing Repo    torque-4.2.10-11.el7

However I found that all the nodes were 'down'  even though everything
appears to be running, with no errors in the error logs.

After a lot of trials, errors and reseach, I eventually (on a whim) I
decided to remove the "num_node_boards=1" entry from the
"torque/server_priv/nodes" file and restart the server & scheduler.
 Suddenly the nodes were "free" and my initial test job ran.

Perhaps the EPEL-Test Torque 4.2.10-11  does not contain Numa?

ALL later tests (with OpenMPI - RHEL SRPM 1.10.6-2 re-compiled
"--with-tm")  is now responding to the Torque mode allocation correctly and
is no longer simply running all the jobs on the first node.

That is    $PBS_NODEFILE  ,    pbsdsh hostname  and   mpirun hostname
are all in agreement.

Thank you all for your help, and putting up with with me.

  Anthony Thyssen ( System Programmer )    <a.thys...@griffith.edu.au>
 --------------------------------------------------------------------------
  "Around here we've got a name for people what talks to dragons."
  "Traitor?"  Wiz asked apprehensively.
  "No.  Lunch."                     -- Rick Cook, "Wizadry Consulted"
 --------------------------------------------------------------------------


On Wed, Oct 4, 2017 at 11:43 AM, r...@open-mpi.org <r...@open-mpi.org> wrote:

> Can you try a newer version of OMPI, say the 3.0.0 release? Just curious
> to know if we perhaps “fixed” something relevant.
>
>
> On Oct 3, 2017, at 5:33 PM, Anthony Thyssen <a.thys...@griffith.edu.au>
> wrote:
>
> FYI...
>
> The problem is discussed further in
>
> Redhat Bugzilla: Bug 1321154 - numa enabled torque don't work
>    https://bugzilla.redhat.com/show_bug.cgi?id=1321154
>
> I'd seen this previous as it required me to add "num_node_boards=1" to
> each node in the
> /var/lib/torque/server_priv/nodes  to get torque to at least work.
> Specifically by munging
> the $PBS_NODES" (which comes out correcT) into a host list containing the
> correct
> "slot=" counts.  But of course now that I have compiled OpenMPI using
> "--with-tm" that
> should not have been needed as in fact is now ignored by OpenMPI in a
> Torque-PBS
> environment.
>
> However it seems ever since "NUMA" support was into the Torque RPM's, has
> also caused
> the current problems, and is still continuing.   The last action is a new
> EPEL "test' version
> (August 2017),  I will try shortly.
>
> Take you for your help, though I am still open to suggestions for a
> replacement.
>
>   Anthony Thyssen ( System Programmer )    <a.thys...@griffith.edu.au>
>  -----------------------------------------------------------
> ---------------
>    Encryption... is a powerful defensive weapon for free people.
>    It offers a technical guarantee of privacy, regardless of who is
>    running the government... It's hard to think of a more powerful,
>    less dangerous tool for liberty.            --  Esther Dyson
>  -----------------------------------------------------------
> ---------------
>
>
>
> On Wed, Oct 4, 2017 at 9:02 AM, Anthony Thyssen <a.thys...@griffith.edu.au
> > wrote:
>
>> Thank you Gilles.  At least I now have something to follow though with.
>>
>> As a FYI, the torque is the pre-built version from the Redhat Extras
>> (EPEL) archive.
>> torque-4.2.10-10.el7.x86_64
>>
>> Normally pre-build packages have no problems, but in this case.
>>
>>
>>
>>
>> On Tue, Oct 3, 2017 at 3:39 PM, Gilles Gouaillardet <gil...@rist.or.jp>
>> wrote:
>>
>>> Anthony,
>>>
>>>
>>> we had a similar issue reported some times ago (e.g. Open MPI ignores
>>> torque allocation),
>>>
>>> and after quite some troubleshooting, we ended up with the same behavior
>>> (e.g. pbsdsh is not working as expected).
>>>
>>> see https://www.mail-archive.com/users@lists.open-mpi.org/msg29952.html
>>> for the last email.
>>>
>>>
>>> from an Open MPI point of view, i would consider the root cause is with
>>> your torque install.
>>>
>>> this case was reported at http://www.clusterresources.co
>>> m/pipermail/torqueusers/2016-September/018858.html
>>>
>>> and no conclusion was reached.
>>>
>>>
>>> Cheers,
>>>
>>>
>>> Gilles
>>>
>>>
>>> On 10/3/2017 2:02 PM, Anthony Thyssen wrote:
>>>
>>>> The stdin and stdout are saved to separate channels.
>>>>
>>>> It is interesting that the output from pbsdsh is node21.emperor 5
>>>> times, even though $PBS_NODES is the 5 individual nodes.
>>>>
>>>> Attached are the two compressed files, as well as the pbs_hello batch
>>>> used.
>>>>
>>>> Anthony Thyssen ( System Programmer )    <a.thys...@griffith.edu.au
>>>> <mailto:a.thys...@griffith.edu.au>>
>>>>  -----------------------------------------------------------
>>>> ---------------
>>>>   There are two types of encryption:
>>>>     One that will prevent your sister from reading your diary, and
>>>>     One that will prevent your government.           -- Bruce Schneier
>>>>  -----------------------------------------------------------
>>>> ---------------
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Oct 3, 2017 at 2:39 PM, Gilles Gouaillardet <gil...@rist.or.jp
>>>> <mailto:gil...@rist.or.jp>> wrote:
>>>>
>>>>     Anthony,
>>>>
>>>>
>>>>     in your script, can you
>>>>
>>>>
>>>>     set -x
>>>>
>>>>     env
>>>>
>>>>     pbsdsh hostname
>>>>
>>>>     mpirun --display-map --display-allocation --mca ess_base_verbose
>>>>     10 --mca plm_base_verbose 10 --mca ras_base_verbose 10 hostname
>>>>
>>>>
>>>>     and then compress and send the output ?
>>>>
>>>>
>>>>     Cheers,
>>>>
>>>>
>>>>     Gilles
>>>>
>>>>
>>>>     On 10/3/2017 1:19 PM, Anthony Thyssen wrote:
>>>>
>>>>         I noticed that too.  Though the submitting host for torque is
>>>>         a different host (main head node, "shrek"),  "node21" is the
>>>>         host that torque runs the batch script (and the mpirun
>>>>         command) it being the first node in the "dualcore" resource
>>>> group.
>>>>
>>>>         Adding option...
>>>>
>>>>         It fixed the hostname in the allocation map, though had no
>>>>         effect on the outcome.  The allocation is still simply ignored.
>>>>
>>>>         =======8<--------CUT HERE----------
>>>>         PBS Job Number       9000
>>>>         PBS batch run on     node21.emperor
>>>>         Time it was started  2017-10-03_14:11:20
>>>>         Current Directory    /net/shrek.emperor/home/shrek/anthony
>>>>         Submitted work dir   /home/shrek/anthony/mpi-pbs
>>>>         Number of Nodes      5
>>>>         Nodefile List       /var/lib/torque/aux//9000.shrek.emperor
>>>>         node21.emperor
>>>>         node25.emperor
>>>>         node24.emperor
>>>>         node23.emperor
>>>>         node22.emperor
>>>>         ---------------------------------------
>>>>
>>>>         ======================  ALLOCATED NODES  ======================
>>>>         node21.emperor: slots=1 max_slots=0 slots_inuse=0 state=UP
>>>>         node25.emperor: slots=1 max_slots=0 slots_inuse=0 state=UP
>>>>         node24.emperor: slots=1 max_slots=0 slots_inuse=0 state=UP
>>>>         node23.emperor: slots=1 max_slots=0 slots_inuse=0 state=UP
>>>>         node22.emperor: slots=1 max_slots=0 slots_inuse=0 state=UP
>>>>         ============================================================
>>>> =====
>>>>         node21.emperor
>>>>         node21.emperor
>>>>         node21.emperor
>>>>         node21.emperor
>>>>         node21.emperor
>>>>         =======8<--------CUT HERE----------
>>>>
>>>>
>>>>           Anthony Thyssen ( System Programmer )
>>>>         <a.thys...@griffith.edu.au <mailto:a.thys...@griffith.edu.au>
>>>>         <mailto:a.thys...@griffith.edu.au
>>>>         <mailto:a.thys...@griffith.edu.au>>>
>>>>          -----------------------------------------------------------
>>>> ---------------
>>>>            The equivalent of an armoured car should always be used to
>>>>            protect any secret kept in a cardboard box.
>>>>            -- Anthony Thyssen, On the use of Encryption
>>>>          -----------------------------------------------------------
>>>> ---------------
>>>>
>>>>
>>>>
>>>>
>>>>         _______________________________________________
>>>>         users mailing list
>>>>         users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>>>>         https://lists.open-mpi.org/mailman/listinfo/users
>>>>         <https://lists.open-mpi.org/mailman/listinfo/users>
>>>>
>>>>
>>>>     _______________________________________________
>>>>     users mailing list
>>>>     users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>>>>     https://lists.open-mpi.org/mailman/listinfo/users
>>>>     <https://lists.open-mpi.org/mailman/listinfo/users>
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users@lists.open-mpi.org
>>>> https://lists.open-mpi.org/mailman/listinfo/users
>>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://lists.open-mpi.org/mailman/listinfo/users
>>>
>>
>>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
>
>
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to