Sorry r...@open-mpi.org as Gilles Gouaillardet pointed out to me the problem wasn't OpenMPI, but with the specific EPEL implementation (see Redhat Bugzilla 1321154)
Today, the the server was able to be taken down for maintance, and I wanted to try a few things. After installing EPEL Testing Repo torque-4.2.10-11.el7 However I found that all the nodes were 'down' even though everything appears to be running, with no errors in the error logs. After a lot of trials, errors and reseach, I eventually (on a whim) I decided to remove the "num_node_boards=1" entry from the "torque/server_priv/nodes" file and restart the server & scheduler. Suddenly the nodes were "free" and my initial test job ran. Perhaps the EPEL-Test Torque 4.2.10-11 does not contain Numa? ALL later tests (with OpenMPI - RHEL SRPM 1.10.6-2 re-compiled "--with-tm") is now responding to the Torque mode allocation correctly and is no longer simply running all the jobs on the first node. That is $PBS_NODEFILE , pbsdsh hostname and mpirun hostname are all in agreement. Thank you all for your help, and putting up with with me. Anthony Thyssen ( System Programmer ) <a.thys...@griffith.edu.au> -------------------------------------------------------------------------- "Around here we've got a name for people what talks to dragons." "Traitor?" Wiz asked apprehensively. "No. Lunch." -- Rick Cook, "Wizadry Consulted" -------------------------------------------------------------------------- On Wed, Oct 4, 2017 at 11:43 AM, r...@open-mpi.org <r...@open-mpi.org> wrote: > Can you try a newer version of OMPI, say the 3.0.0 release? Just curious > to know if we perhaps “fixed” something relevant. > > > On Oct 3, 2017, at 5:33 PM, Anthony Thyssen <a.thys...@griffith.edu.au> > wrote: > > FYI... > > The problem is discussed further in > > Redhat Bugzilla: Bug 1321154 - numa enabled torque don't work > https://bugzilla.redhat.com/show_bug.cgi?id=1321154 > > I'd seen this previous as it required me to add "num_node_boards=1" to > each node in the > /var/lib/torque/server_priv/nodes to get torque to at least work. > Specifically by munging > the $PBS_NODES" (which comes out correcT) into a host list containing the > correct > "slot=" counts. But of course now that I have compiled OpenMPI using > "--with-tm" that > should not have been needed as in fact is now ignored by OpenMPI in a > Torque-PBS > environment. > > However it seems ever since "NUMA" support was into the Torque RPM's, has > also caused > the current problems, and is still continuing. The last action is a new > EPEL "test' version > (August 2017), I will try shortly. > > Take you for your help, though I am still open to suggestions for a > replacement. > > Anthony Thyssen ( System Programmer ) <a.thys...@griffith.edu.au> > ----------------------------------------------------------- > --------------- > Encryption... is a powerful defensive weapon for free people. > It offers a technical guarantee of privacy, regardless of who is > running the government... It's hard to think of a more powerful, > less dangerous tool for liberty. -- Esther Dyson > ----------------------------------------------------------- > --------------- > > > > On Wed, Oct 4, 2017 at 9:02 AM, Anthony Thyssen <a.thys...@griffith.edu.au > > wrote: > >> Thank you Gilles. At least I now have something to follow though with. >> >> As a FYI, the torque is the pre-built version from the Redhat Extras >> (EPEL) archive. >> torque-4.2.10-10.el7.x86_64 >> >> Normally pre-build packages have no problems, but in this case. >> >> >> >> >> On Tue, Oct 3, 2017 at 3:39 PM, Gilles Gouaillardet <gil...@rist.or.jp> >> wrote: >> >>> Anthony, >>> >>> >>> we had a similar issue reported some times ago (e.g. Open MPI ignores >>> torque allocation), >>> >>> and after quite some troubleshooting, we ended up with the same behavior >>> (e.g. pbsdsh is not working as expected). >>> >>> see https://www.mail-archive.com/users@lists.open-mpi.org/msg29952.html >>> for the last email. >>> >>> >>> from an Open MPI point of view, i would consider the root cause is with >>> your torque install. >>> >>> this case was reported at http://www.clusterresources.co >>> m/pipermail/torqueusers/2016-September/018858.html >>> >>> and no conclusion was reached. >>> >>> >>> Cheers, >>> >>> >>> Gilles >>> >>> >>> On 10/3/2017 2:02 PM, Anthony Thyssen wrote: >>> >>>> The stdin and stdout are saved to separate channels. >>>> >>>> It is interesting that the output from pbsdsh is node21.emperor 5 >>>> times, even though $PBS_NODES is the 5 individual nodes. >>>> >>>> Attached are the two compressed files, as well as the pbs_hello batch >>>> used. >>>> >>>> Anthony Thyssen ( System Programmer ) <a.thys...@griffith.edu.au >>>> <mailto:a.thys...@griffith.edu.au>> >>>> ----------------------------------------------------------- >>>> --------------- >>>> There are two types of encryption: >>>> One that will prevent your sister from reading your diary, and >>>> One that will prevent your government. -- Bruce Schneier >>>> ----------------------------------------------------------- >>>> --------------- >>>> >>>> >>>> >>>> >>>> On Tue, Oct 3, 2017 at 2:39 PM, Gilles Gouaillardet <gil...@rist.or.jp >>>> <mailto:gil...@rist.or.jp>> wrote: >>>> >>>> Anthony, >>>> >>>> >>>> in your script, can you >>>> >>>> >>>> set -x >>>> >>>> env >>>> >>>> pbsdsh hostname >>>> >>>> mpirun --display-map --display-allocation --mca ess_base_verbose >>>> 10 --mca plm_base_verbose 10 --mca ras_base_verbose 10 hostname >>>> >>>> >>>> and then compress and send the output ? >>>> >>>> >>>> Cheers, >>>> >>>> >>>> Gilles >>>> >>>> >>>> On 10/3/2017 1:19 PM, Anthony Thyssen wrote: >>>> >>>> I noticed that too. Though the submitting host for torque is >>>> a different host (main head node, "shrek"), "node21" is the >>>> host that torque runs the batch script (and the mpirun >>>> command) it being the first node in the "dualcore" resource >>>> group. >>>> >>>> Adding option... >>>> >>>> It fixed the hostname in the allocation map, though had no >>>> effect on the outcome. The allocation is still simply ignored. >>>> >>>> =======8<--------CUT HERE---------- >>>> PBS Job Number 9000 >>>> PBS batch run on node21.emperor >>>> Time it was started 2017-10-03_14:11:20 >>>> Current Directory /net/shrek.emperor/home/shrek/anthony >>>> Submitted work dir /home/shrek/anthony/mpi-pbs >>>> Number of Nodes 5 >>>> Nodefile List /var/lib/torque/aux//9000.shrek.emperor >>>> node21.emperor >>>> node25.emperor >>>> node24.emperor >>>> node23.emperor >>>> node22.emperor >>>> --------------------------------------- >>>> >>>> ====================== ALLOCATED NODES ====================== >>>> node21.emperor: slots=1 max_slots=0 slots_inuse=0 state=UP >>>> node25.emperor: slots=1 max_slots=0 slots_inuse=0 state=UP >>>> node24.emperor: slots=1 max_slots=0 slots_inuse=0 state=UP >>>> node23.emperor: slots=1 max_slots=0 slots_inuse=0 state=UP >>>> node22.emperor: slots=1 max_slots=0 slots_inuse=0 state=UP >>>> ============================================================ >>>> ===== >>>> node21.emperor >>>> node21.emperor >>>> node21.emperor >>>> node21.emperor >>>> node21.emperor >>>> =======8<--------CUT HERE---------- >>>> >>>> >>>> Anthony Thyssen ( System Programmer ) >>>> <a.thys...@griffith.edu.au <mailto:a.thys...@griffith.edu.au> >>>> <mailto:a.thys...@griffith.edu.au >>>> <mailto:a.thys...@griffith.edu.au>>> >>>> ----------------------------------------------------------- >>>> --------------- >>>> The equivalent of an armoured car should always be used to >>>> protect any secret kept in a cardboard box. >>>> -- Anthony Thyssen, On the use of Encryption >>>> ----------------------------------------------------------- >>>> --------------- >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >>>> https://lists.open-mpi.org/mailman/listinfo/users >>>> <https://lists.open-mpi.org/mailman/listinfo/users> >>>> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >>>> https://lists.open-mpi.org/mailman/listinfo/users >>>> <https://lists.open-mpi.org/mailman/listinfo/users> >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> users@lists.open-mpi.org >>>> https://lists.open-mpi.org/mailman/listinfo/users >>>> >>> >>> _______________________________________________ >>> users mailing list >>> users@lists.open-mpi.org >>> https://lists.open-mpi.org/mailman/listinfo/users >>> >> >> > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users > > >
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users