Yeah, it's failing when trying to unpack the topology obtained from hwloc. My 
guess is that one of the following calls changed in hwloc-1.4.3:

        if (0 != hwloc_topology_set_xmlbuffer(t, xmlbuffer, strlen(xmlbuffer))) 
{
            rc = OPAL_ERROR;
            free(xmlbuffer);
            hwloc_topology_destroy(t);
            goto cleanup;
        }
        /* since we are loading this from an external source, we have to
         * explicitly set a flag so hwloc sets things up correctly
         */
        if (0 != hwloc_topology_set_flags(t, 
HWLOC_TOPOLOGY_FLAG_IS_THISSYSTEM)) {
            free(xmlbuffer);
            rc = OPAL_ERROR;
            goto cleanup;
        }

Only other things in that routing are hwloc_topology_init and 
hwloc_topology_load, and those haven't changed in awhile.


On Jul 23, 2013, at 11:12 AM, Kevin H. Hobbs <hob...@ohio.edu> wrote:

> On 07/23/2013 09:54 AM, Jeff Squyres (jsquyres) wrote:
>> 
>> I don't know if Fedora RPMs include -g in their builds, or if Fedora
>> includes a debuginfo RPM that you could install such that you can attach
>> a debugger and be able to dig into OMPI's internals yourself.
>> 
> 
> There is a debuginfo package.
> 
> Since I removed all of fedora's openmpi packages and installed from
> source into /opt/openmpi-1.6.5 and /opt/openmpi-1.6.5_hwloc-1.4.3 to
> narrow down on this problem, I now have to re-install the rpms with yum.
> 
> sudo yum install openmpi openmpi-devel openmpi-debuginfo
> 
> These don't put anything into my PATH or LD_LIBRARY_PATH so I have to :
> 
> module load mpi/openmpi-x86_64
> 
> I compiled my simple program with :
> 
> mpicc -g -o mpi_simple mpi_simple.c
> 
> The program links to fedora's copies of the libraries of interest :
> 
> mpirun -n 1 ldd mpi_simple | grep hwloc
>  libhwloc.so.5 => /lib64/libhwloc.so.5 (0x0000003c57600000)
> mpirun -n 1 ldd mpi_simple | grep mpi
>  libmpi.so.1 => /usr/lib64/openmpi/lib/libmpi.so.1 (0x00007f7207e29000)
> 
> I started the debugger with :
> 
> mpirun -n 1 gdb mpi_simple
> 
> When run in the debugger I got the error I described.
> 
> I reran and in gdb did :
> 
> set breakpoint pending on
> break util/nidmap.c:146
> run
> step
> 
> took me into 'opal_dss_unpack' Then I did 'next' until I got passed
> 'opal_dss_unpack_buffer' which returned the -1 we see outside.
> 
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to