Brian,

As Ralph already stated, this is likely a hwloc API issue.
From debian9, you can
lstopo --of xml | ssh debian8 lstopo --if xml -i -

that will likely confirm the API error.

If you are willing to get a bit more details, you can add some printf
in opal_hwloc_unpack (from opal/mca/hwloc/base/hwloc_base_dt.c) to
figure out where exactly the failure occurs.

Meanwhile, you can move forward by using the embedded hwloc on both
distros (--with-hwloc=internal or no --with-hwloc option at all).


Note we strongly discourage you configure --with-FOO=/usr
(it explicitly add /usr/include and /usr/lib[64] in the search path,
and might hide some other external libraries installed in a non
standard location). In order to force the external hwloc lib installed
in the default location, --with-hwloc=external is what you need (same
thing applies to libevent and pmix)


Cheers,

Gilles
On Sun, Jul 22, 2018 at 7:52 AM r...@open-mpi.org <r...@open-mpi.org> wrote:
>
> More than likely the problem is the difference in hwloc versions - sounds 
> like the topology to/from xml is different between the two versions, and the 
> older one doesn’t understand the new one.
>
> > On Jul 21, 2018, at 12:04 PM, Brian Smith <bsm...@systemfabricworks.com> 
> > wrote:
> >
> > Greetings,
> >
> > I'm having trouble getting openmpi 2.1.2 to work when launching a
> > process from debian 8 on a remote debian 9 host. To keep things simple
> > in this example, I'm just launching date on the remote host.
> >
> > deb8host$ mpirun -H deb9host date
> > [deb8host:01552] [[32763,0],0] ORTE_ERROR_LOG: Error in file
> > base/plm_base_launch_support.c at line 954
> >
> > It works fine when executed from debian 9:
> > deb9host$ mpirun -H deb8host date
> > Sat Jul 21 13:40:43 CDT 2018
> >
> > Also works when executed from debian 8 against debian 8:
> > deb8host:~$ mpirun -H deb8host2 date
> > Sat Jul 21 13:55:57 CDT 2018
> >
> > The failure results from an error code returned by:
> > opal_dss.unpack(buffer, &topo, &idx, OPAL_HWLOC_TOPO)
> >
> > openmpi was built with the same configure flags on both hosts.
> >
> >        --prefix=$(PREFIX) \
> >        --with-verbs \
> >        --with-libfabric \
> >        --disable-silent-rules \
> >        --with-hwloc=/usr \
> >        --with-libltdl=/usr \
> >        --with-devel-headers \
> >        --with-slurm \
> >        --with-sge \
> >        --without-tm \
> >        --disable-heterogeneous \
> >        --with-contrib-vt-flags=--disable-iotrace \
> >        --sysconfdir=$(PREFIX)/etc         \
> >        --libdir=$(PREFIX)/lib    \
> >        --includedir=$(PREFIX)/include
> >
> >
> > deb9host libhwloc and libhwloc-plugins is 1.11.5-1
> > deb8host libhwloc and libhwloc-plugins is 1.10.0-3
> >
> > I've been trying to debug this for the past few days and would
> > appreciate any help on determining why this failure is occurring
> > and/or resolving the problem.
> >
> > --
> > Brian T. Smith
> > System Fabric Works
> > Senior Technical Staff
> > bsm...@systemfabricworks.com
> > GPG Key: B3C2C7B73BA3CD7F
> > _______________________________________________
> > users mailing list
> > users@lists.open-mpi.org
> > https://lists.open-mpi.org/mailman/listinfo/users
>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to