Hi Ralf,

After studying prrte a little bit, I tried something new and followed
the description here using openmpi 4:
https://pmix.org/code/building-the-pmix-reference-server/

I configured openmpi 4.0.0rc3:

../configure --enable-debug --prefix [...] --with-pmix=[...] \
  --with-libevent=/usr --with-ompi-mpix-rte

(I also tried to set --with-orte=no, but it then claims not to have a
suitable rte and does not finish)

I then started my own PMIx and spawned a client compiled with mpicc of
the new openmpi installation with this environment:

PMIX_NAMESPACE=namespace_3228_0_0
PMIX_RANK=0
PMIX_SERVER_URI2=pmix-server.3234;tcp4://127.0.0.1:49637
PMIX_SERVER_URI21=pmix-server.3234;tcp4://127.0.0.1:49637
PMIX_SERVER_URI=pmix-server:3234:/tmp/pmix-3234
PMIX_SERVER_URI2USOCK=pmix-server:3234:/tmp/pmix-3234
PMIX_SECURITY_MODE=native,none
PMIX_PTL_MODULE=tcp,usock
PMIX_BFROP_BUFFER_TYPE=PMIX_BFROP_BUFFER_FULLY_DESC
PMIX_GDS_MODULE=ds12,hash
PMIX_DSTORE_ESH_BASE_PATH=/tmp/pmix_dstor_3234

The client is not connecting to my pmix server and it's environment
after MPI_Init looks like that:

PMIX_SERVER_URI2USOCK=pmix-server:3234:/tmp/pmix-3234
PMIX_RANK=0
PMIX_PTL_MODULE=tcp,usock
PMIX_SERVER_URI=pmix-server:3234:/tmp/pmix-3234
PMIX_MCA_mca_base_component_show_load_errors=1
PMIX_BFROP_BUFFER_TYPE=PMIX_BFROP_BUFFER_FULLY_DESC
PMIX_DSTORE_ESH_BASE_PATH=/tmp/ompi.landroval.1001/pid.3243/pmix_dstor_
3243
PMIX_SERVER_URI2=864157696.0;tcp4://127.0.0.1:33619
PMIX_SERVER_URI21=864157696.0;tcp4://127.0.0.1:33619
PMIX_SECURITY_MODE=native,none
PMIX_NAMESPACE=864157697
PMIX_GDS_MODULE=ds12,hash
ORTE_SCHIZO_DETECTION=ORTE
OMPI_COMMAND=./hello_env
OMPI_MCA_orte_precondition_transports=f28d6577f6b6ac08-d92c0e73869e1cfa
OMPI_MCA_orte_launch=1
OMPI_APP_CTX_NUM_PROCS=1
OMPI_MCA_pmix=^s1,s2,cray,isolated
OMPI_MCA_ess=singleton
OMPI_MCA_orte_ess_num_procs=1

So something goes wrong but I do not have an idea what I am missing. Do
you have an idea what I need to change? Do I have to set an MCA
parameter to tell OpenMPI not to start orted, or does it need another
hint in the client environment beside the stuff comming from the PMIx
server helper library?


Stephan


On Tuesday, Oct 10 2018, 08:33 -0700 Ralph H Castain wrote:
> Hi Stephan
> 
> Thanks for the clarification - that helps a great deal. You are
> correct that OMPI’s orted daemons do more than just host the PMIx
> server library. However, they are only active if you launch the OMPI
> processes using mpirun. This is probably the source of the trouble
> you are seeing.
> 
> Since you have a process launcher and have integrated the PMIx server
> support into your RM’s daemons, you really have no need for mpirun at
> all. You should just be able to launch the processes directly using
> your own launcher. The PMIx support will take care of the startup
> requirements. The application procs will not use the orted in such
> cases.
> 
> So if your system is working fine with the PMIx example programs,
> then just launch the OMPI apps the same way and it should just work.
> 
> On the Slurm side: I’m surprised that it doesn’t work without the
> —with-slurm option. An application proc doesn’t care about any of the
> Slurm-related code if PMIx is available. I might have access to a
> machine where I can check it…
> 
> Ralph
> 
> 
> > On Oct 9, 2018, at 3:26 AM, Stephan Krempel <krem...@par-tec.com>
> > wrote:
> > 
> > Ralph, Gilles,
> > 
> > thanks for your input.
> > 
> > Before I answer, let me shortly explain what my general intention
> > is.
> > We do have our own resource manager and process launcher that
> > supports
> > different MPI implementations in different ways. I want to adapt it
> > to
> > PMIx to cleanly support OpenMPI and hopefully other MPI
> > implementation
> > supporting PMIx in the future, too. 
> > 
> > > It sounds like what you really want to do is replace the orted,
> > > and
> > > have your orted open your PMIx server? In other words, you want
> > > to
> > > use the PMIx reference library to handle all the PMIx stuff, and
> > > provide your own backend functions to support the PMIx server
> > > calls? 
> > 
> > You are right, that was my original plan, and I already did it so
> > far.
> > In my environment I already can launch processes that successfully
> > call
> > PMIx client functions like put, get, fence and so on, all handled
> > by my
> > servers using the PMIx server helper library. As far as I
> > implemented
> > the server functions now, all the example programs coming with the
> > pmix
> > library are working fine.
> > 
> > Then I tried to use that with OpenMPI and stumbled.
> > My first idea was to simply replace orted but after taking a closer
> > look into OpenMPI it seems to me, that it uses/needs orted not only
> > for
> > spawning and exchange of process information, but also for its
> > general
> > communication and collectives. Am I wrong with that?
> > 
> > So replacing it completely is perhaps not what I want since I do
> > not
> > intent to replace OpenMPIs whole communication stuff. But perhaps I
> > do
> > mix up orte and orted here, not certain about that.
> > 
> > > If so, then your best bet would be to edit the PRRTE code in
> > > orte/orted/pmix and replace it with your code. You’ll have to
> > > deal
> > > with the ORTE data objects and PRRTE’s launch procedure, but that
> > > is
> > > likely easier than trying to write your own version of “orted”
> > > from
> > > scratch.
> > 
> > I think one problem here is, that I do not really understand which
> > purposes orted fulfills overall especially beside implementing the
> > PMIx
> > server side. Can you please give me a short overview?
> > 
> > > As for Slurm: it behaves the same way as PRRTE. It has a plugin
> > > that
> > > implements the server backend functions, and the Slurm daemons
> > > “host”
> > > the plugin. What you would need to do is replace that plugin with
> > > your own.
> > 
> > I understand that, but it also seems to need some special support
> > by
> > the several slurm modules on the OpenMPI side that I do not
> > understand,
> > yet. At least when I tried OpenMPI without slurm support and
> > `srun --mpi=pmix_v2` it does not work but generates a message that
> > slurm support in opemmpi is missing.
> > 
> > 
> > 
> > Stephan
> > 
> > 
> > 
> > > > On Oct 8, 2018, at 5:36 PM, Gilles Gouaillardet <gil...@rist.or
> > > > .jp>
> > > > wrote:
> > > > 
> > > > Stephan,
> > > > 
> > > > 
> > > > Have you already checked https://github.com/pmix/prrte ?
> > > > 
> > > > 
> > > > This is the PMIx Reference RunTime Environment (PPRTE), which
> > > > was
> > > > built on top of orted.
> > > > 
> > > > Long story short, it deploys the PMIx server and then you start
> > > > your MPI app with prun
> > > > An example is available at https://github.com/pmix/prrte/blob/m
> > > > aste
> > > > r/contrib/travis/test_client.sh
> > > > 
> > > > 
> > > > Cheers,
> > > > 
> > > > Gilles
> > > > 
> > > > 
> > > > On 10/9/2018 8:45 AM, Stephan Krempel wrote:
> > > > > Hallo everyone,
> > > > > 
> > > > > I am currently implementing a PMIx server and I try to use it
> > > > > with
> > > > > OpenMPI. I do have an own mpiexec which starts my PMIx server
> > > > > and
> > > > > launches the processes.
> > > > > 
> > > > > If I launch an executable linked against OpenMPI, during
> > > > > MPI_Init() the
> > > > > ORTE layer starts another PMIx server and overrides my PMIX_*
> > > > > environment so this new server is used instead of mine.
> > > > > 
> > > > > So I am looking for a method to prevent orte(d) from starting
> > > > > a
> > > > > PMIx
> > > > > server.
> > > > > 
> > > > > I already tried to understand what the slurm support is
> > > > > doing,
> > > > > since
> > > > > this is (at least in parts) what I think I need. Somehow when
> > > > > starting
> > > > > a job with srun --mpi=pmix_v2 the ess module pmi is started,
> > > > > but
> > > > > I was
> > > > > not able to enforce that manually by setting an MCA parameter
> > > > > (oss
> > > > > should be the correct one?!?)
> > > > > And I do not yet have a clue how the slurm support is
> > > > > working.
> > > > > 
> > > > > So does anyone has a hint for me where I can find
> > > > > documentation
> > > > > or
> > > > > information concerning that or is there an easy way to
> > > > > achieve
> > > > > what I
> > > > > am trying to do that I missed?
> > > > > 
> > > > > Thank you in advance.
> > > > > 
> > > > > Regards,
> > > > > 
> > > > > Stephan
> > > > > _______________________________________________
> > > > > devel mailing list
> > > > > devel@lists.open-mpi.org
> > > > > https://lists.open-mpi.org/mailman/listinfo/devel
> > > > > 
> > > > 
> > > > _______________________________________________
> > > > devel mailing list
> > > > devel@lists.open-mpi.org
> > > > https://lists.open-mpi.org/mailman/listinfo/devel
> > > 
> > > _______________________________________________
> > > devel mailing list
> > > devel@lists.open-mpi.org
> > > https://lists.open-mpi.org/mailman/listinfo/devel
> > _______________________________________________
> > devel mailing list
> > devel@lists.open-mpi.org
> > https://lists.open-mpi.org/mailman/listinfo/devel
> 
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Reply via email to