Looks like you failed to build the shared memory component. The system isn't seeing a comm path between procs on the same node.
Sent from my iPad On Apr 2, 2012, at 7:47 AM, Alex Margolin <alex.margo...@mail.huji.ac.il> wrote: > I found the problem(s) - It was more then just type redefinition, but I fixed > it too. I also added some code for btl/base to prevent/detect a similar > problem in the future. A newer version of my MOSIX patch (odls + btl + fix) > is attached. The BTL, still doesn't work, though, and when I try to use > valgrind it fails with some Open-MPI internal problems, which are most likely > unrelated to my patch. I'll keep working it, but maybe someone who knows this > part of the code should look at it... > > alex@singularity:~/huji/benchmarks/simple$ mpirun -mca btl self,mosix -n 2 > valgrind simple > ==22752== Memcheck, a memory error detector > ==22752== Copyright (C) 2002-2010, and GNU GPL'd, by Julian Seward et al. > ==22752== Using Valgrind-3.6.1-Debian and LibVEX; rerun with -h for copyright > info > ==22752== Command: simple > ==22752== > ==22753== Memcheck, a memory error detector > ==22753== Copyright (C) 2002-2010, and GNU GPL'd, by Julian Seward et al. > ==22753== Using Valgrind-3.6.1-Debian and LibVEX; rerun with -h for copyright > info > ==22753== Command: simple > ==22753== > ==22753== Invalid read of size 8 > ==22753== at 0x5ACBE0D: _wordcopy_fwd_dest_aligned (wordcopy.c:205) > ==22753== by 0x5AC5A6B: __GI_memmove (memmove.c:76) > ==22753== by 0x5ACD000: argz_insert (argz-insert.c:55) > ==22753== by 0x520A39A: lt_argz_insert (in /usr/local/lib/libmpi.so.0.0.0) > ==22753== by 0x520A537: lt_argz_insertinorder (in > /usr/local/lib/libmpi.so.0.0.0) > ==22753== by 0x520A808: lt_argz_insertdir (in > /usr/local/lib/libmpi.so.0.0.0) > ==22753== by 0x520A985: list_files_by_dir (in > /usr/local/lib/libmpi.so.0.0.0) > ==22753== by 0x520AA0A: foreachfile_callback (in > /usr/local/lib/libmpi.so.0.0.0) > ==22753== by 0x52086AA: foreach_dirinpath (in > /usr/local/lib/libmpi.so.0.0.0) > ==22753== by 0x520AADB: lt_dlforeachfile (in > /usr/local/lib/libmpi.so.0.0.0) > ==22753== by 0x52162DD: find_dyn_components (mca_base_component_find.c:319) > ==22753== by 0x5215EB6: mca_base_component_find > (mca_base_component_find.c:186) > ==22753== Address 0x68d9570 is 32 bytes inside a block of size 38 alloc'd > ==22753== at 0x4C28F9F: malloc (vg_replace_malloc.c:236) > ==22753== by 0x52071CA: lt__malloc (in /usr/local/lib/libmpi.so.0.0.0) > ==22753== by 0x520A73D: lt_argz_insertdir (in > /usr/local/lib/libmpi.so.0.0.0) > ==22753== by 0x520A985: list_files_by_dir (in > /usr/local/lib/libmpi.so.0.0.0) > ==22753== by 0x520AA0A: foreachfile_callback (in > /usr/local/lib/libmpi.so.0.0.0) > ==22753== by 0x52086AA: foreach_dirinpath (in > /usr/local/lib/libmpi.so.0.0.0) > ==22753== by 0x520AADB: lt_dlforeachfile (in > /usr/local/lib/libmpi.so.0.0.0) > ==22753== by 0x52162DD: find_dyn_components (mca_base_component_find.c:319) > ==22753== by 0x5215EB6: mca_base_component_find > (mca_base_component_find.c:186) > ==22753== by 0x5219AA3: mca_base_components_open > (mca_base_components_open.c:129) > ==22753== by 0x5246183: opal_paffinity_base_open > (paffinity_base_open.c:129) > ==22753== by 0x523C013: opal_init (opal_init.c:361) > ==22753== > ==22752== Invalid read of size 8 > ==22752== at 0x5ACBE0D: _wordcopy_fwd_dest_aligned (wordcopy.c:205) > ==22752== by 0x5AC5A6B: __GI_memmove (memmove.c:76) > ==22752== by 0x5ACD000: argz_insert (argz-insert.c:55) > ==22752== by 0x520A39A: lt_argz_insert (in /usr/local/lib/libmpi.so.0.0.0) > ==22752== by 0x520A537: lt_argz_insertinorder (in > /usr/local/lib/libmpi.so.0.0.0) > ==22752== by 0x520A808: lt_argz_insertdir (in > /usr/local/lib/libmpi.so.0.0.0) > ==22752== by 0x520A985: list_files_by_dir (in > /usr/local/lib/libmpi.so.0.0.0) > ==22752== by 0x520AA0A: foreachfile_callback (in > /usr/local/lib/libmpi.so.0.0.0) > ==22752== by 0x52086AA: foreach_dirinpath (in > /usr/local/lib/libmpi.so.0.0.0) > ==22752== by 0x520AADB: lt_dlforeachfile (in > /usr/local/lib/libmpi.so.0.0.0) > ==22752== by 0x52162DD: find_dyn_components (mca_base_component_find.c:319) > ==22752== by 0x5215EB6: mca_base_component_find > (mca_base_component_find.c:186) > ==22752== Address 0x68d9570 is 32 bytes inside a block of size 38 alloc'd > ==22752== at 0x4C28F9F: malloc (vg_replace_malloc.c:236) > ==22752== by 0x52071CA: lt__malloc (in /usr/local/lib/libmpi.so.0.0.0) > ==22752== by 0x520A73D: lt_argz_insertdir (in > /usr/local/lib/libmpi.so.0.0.0) > ==22752== by 0x520A985: list_files_by_dir (in > /usr/local/lib/libmpi.so.0.0.0) > ==22752== by 0x520AA0A: foreachfile_callback (in > /usr/local/lib/libmpi.so.0.0.0) > ==22752== by 0x52086AA: foreach_dirinpath (in > /usr/local/lib/libmpi.so.0.0.0) > ==22752== by 0x520AADB: lt_dlforeachfile (in > /usr/local/lib/libmpi.so.0.0.0) > ==22752== by 0x52162DD: find_dyn_components (mca_base_component_find.c:319) > ==22752== by 0x5215EB6: mca_base_component_find > (mca_base_component_find.c:186) > ==22752== by 0x5219AA3: mca_base_components_open > (mca_base_components_open.c:129) > ==22752== by 0x5246183: opal_paffinity_base_open > (paffinity_base_open.c:129) > ==22752== by 0x523C013: opal_init (opal_init.c:361) > ==22752== > [singularity:22753] mca: base: component_find: unable to open > /usr/local/lib/openmpi/mca_mpool_sm: libmca_common_sm.so.0: cannot open > shared object file: No such file or directory (ignored) > [singularity:22752] mca: base: component_find: unable to open > /usr/local/lib/openmpi/mca_mpool_sm: libmca_common_sm.so.0: cannot open > shared object file: No such file or directory (ignored) > [singularity:22753] mca: base: component_find: unable to open > /usr/local/lib/openmpi/mca_coll_sm: libmca_common_sm.so.0: cannot open shared > object file: No such file or directory (ignored) > [singularity:22752] mca: base: component_find: unable to open > /usr/local/lib/openmpi/mca_coll_sm: libmca_common_sm.so.0: cannot open shared > object file: No such file or directory (ignored) > ==22753== Warning: invalid file descriptor 207618048 in syscall open() > ==22752== Warning: invalid file descriptor 207618048 in syscall open() > -------------------------------------------------------------------------- > At least one pair of MPI processes are unable to reach each other for > MPI communications. This means that no Open MPI device has indicated > that it can be used to communicate between these processes. This is > an error; Open MPI requires that all MPI processes be able to reach > each other. This error can sometimes be the result of forgetting to > specify the "self" BTL. > > Process 1 ([[59806,1],0]) is on host: singularity > Process 2 ([[59806,1],1]) is on host: singularity > BTLs attempted: self > > Your MPI job is now going to abort; sorry. > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > MPI_INIT has failed because at least one MPI process is unreachable > from another. This *usually* means that an underlying communication > plugin -- such as a BTL or an MTL -- has either not loaded or not > allowed itself to be used. Your MPI job will now abort. > > You may wish to try to narrow down the problem; > > * Check the output of ompi_info to see which BTL/MTL plugins are > available. > * Run your application with MPI_THREAD_SINGLE. > * Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose, > if using MTL-based communications) to see exactly which > communication plugins were considered and/or discarded. > -------------------------------------------------------------------------- > ==22752== Use of uninitialised value of size 8 > ==22752== at 0x5A8631B: _itoa_word (_itoa.c:195) > ==22752== by 0x5A8AE43: vfprintf (vfprintf.c:1622) > ==22752== by 0x5AADB83: vasprintf (vasprintf.c:64) > ==22752== by 0x524C150: opal_show_help_vstring (show_help.c:309) > ==22752== by 0x51786F1: orte_show_help (show_help.c:648) > ==22752== by 0x50B4693: backend_fatal_aggregate > (errhandler_predefined.c:205) > ==22752== by 0x50B4A9B: backend_fatal (errhandler_predefined.c:329) > ==22752== by 0x50B3FAF: ompi_mpi_errors_are_fatal_comm_handler > (errhandler_predefined.c:68) > ==22752== by 0x50B38FA: ompi_errhandler_invoke (errhandler_invoke.c:41) > ==22752== by 0x50FD446: PMPI_Init (pinit.c:95) > ==22752== by 0x40A128: MPI::Init(int&, char**&) (in > /home/alex/huji/benchmarks/simple/simple) > ==22752== by 0x409118: main (in /home/alex/huji/benchmarks/simple/simple) > ==22752== > ==22752== Conditional jump or move depends on uninitialised value(s) > ==22752== at 0x5A86325: _itoa_word (_itoa.c:195) > ==22752== by 0x5A8AE43: vfprintf (vfprintf.c:1622) > ==22752== by 0x5AADB83: vasprintf (vasprintf.c:64) > ==22752== by 0x524C150: opal_show_help_vstring (show_help.c:309) > ==22752== by 0x51786F1: orte_show_help (show_help.c:648) > ==22752== by 0x50B4693: backend_fatal_aggregate > (errhandler_predefined.c:205) > ==22752== by 0x50B4A9B: backend_fatal (errhandler_predefined.c:329) > ==22752== by 0x50B3FAF: ompi_mpi_errors_are_fatal_comm_handler > (errhandler_predefined.c:68) > ==22752== by 0x50B38FA: ompi_errhandler_invoke (errhandler_invoke.c:41) > ==22752== by 0x50FD446: PMPI_Init (pinit.c:95) > ==22752== by 0x40A128: MPI::Init(int&, char**&) (in > /home/alex/huji/benchmarks/simple/simple) > ==22752== by 0x409118: main (in /home/alex/huji/benchmarks/simple/simple) > ==22752== > [singularity:22752] *** An error occurred in MPI_Init > [singularity:22752] *** reported by process [3919446017,0] > [singularity:22752] *** on a NULL communicator > [singularity:22752] *** Unknown error > [singularity:22752] *** MPI_ERRORS_ARE_FATAL (processes in this communicator > will now abort, > [singularity:22752] *** and potentially your MPI job) > -------------------------------------------------------------------------- > An MPI process is aborting at a time when it cannot guarantee that all > of its peer processes in the job will be killed properly. You should > double check that everything has shut down cleanly. > > Reason: Before MPI_INIT completed > Local host: singularity > PID: 22752 > -------------------------------------------------------------------------- > ==22753== Use of uninitialised value of size 8 > ==22753== at 0x5A8631B: _itoa_word (_itoa.c:195) > ==22753== by 0x5A8AE43: vfprintf (vfprintf.c:1622) > ==22753== by 0x5AADB83: vasprintf (vasprintf.c:64) > ==22753== by 0x524C150: opal_show_help_vstring (show_help.c:309) > ==22753== by 0x51786F1: orte_show_help (show_help.c:648) > ==22753== by 0x50B4693: backend_fatal_aggregate > (errhandler_predefined.c:205) > ==22753== by 0x50B4A9B: backend_fatal (errhandler_predefined.c:329) > ==22753== by 0x50B3FAF: ompi_mpi_errors_are_fatal_comm_handler > (errhandler_predefined.c:68) > ==22753== by 0x50B38FA: ompi_errhandler_invoke (errhandler_invoke.c:41) > ==22753== by 0x50FD446: PMPI_Init (pinit.c:95) > ==22753== by 0x40A128: MPI::Init(int&, char**&) (in > /home/alex/huji/benchmarks/simple/simple) > ==22753== by 0x409118: main (in /home/alex/huji/benchmarks/simple/simple) > ==22753== > ==22753== Conditional jump or move depends on uninitialised value(s) > ==22753== at 0x5A86325: _itoa_word (_itoa.c:195) > ==22753== by 0x5A8AE43: vfprintf (vfprintf.c:1622) > ==22753== by 0x5AADB83: vasprintf (vasprintf.c:64) > ==22753== by 0x524C150: opal_show_help_vstring (show_help.c:309) > ==22753== by 0x51786F1: orte_show_help (show_help.c:648) > ==22753== by 0x50B4693: backend_fatal_aggregate > (errhandler_predefined.c:205) > ==22753== by 0x50B4A9B: backend_fatal (errhandler_predefined.c:329) > ==22753== by 0x50B3FAF: ompi_mpi_errors_are_fatal_comm_handler > (errhandler_predefined.c:68) > ==22753== by 0x50B38FA: ompi_errhandler_invoke (errhandler_invoke.c:41) > ==22753== by 0x50FD446: PMPI_Init (pinit.c:95) > ==22753== by 0x40A128: MPI::Init(int&, char**&) (in > /home/alex/huji/benchmarks/simple/simple) > ==22753== by 0x409118: main (in /home/alex/huji/benchmarks/simple/simple) > ==22753== > ==22752== > ==22752== HEAP SUMMARY: > ==22752== in use at exit: 730,332 bytes in 2,844 blocks > ==22752== total heap usage: 4,959 allocs, 2,115 frees, 11,353,797 bytes > allocated > ==22752== > ==22753== > ==22753== HEAP SUMMARY: > ==22753== in use at exit: 730,332 bytes in 2,844 blocks > ==22753== total heap usage: 4,970 allocs, 2,126 frees, 11,354,058 bytes > allocated > ==22753== > ==22752== LEAK SUMMARY: > ==22752== definitely lost: 2,138 bytes in 52 blocks > ==22752== indirectly lost: 7,440 bytes in 12 blocks > ==22752== possibly lost: 0 bytes in 0 blocks > ==22752== still reachable: 720,754 bytes in 2,780 blocks > ==22752== suppressed: 0 bytes in 0 blocks > ==22752== Rerun with --leak-check=full to see details of leaked memory > ==22752== > ==22752== For counts of detected and suppressed errors, rerun with: -v > ==22752== Use --track-origins=yes to see where uninitialised values come from > ==22752== ERROR SUMMARY: 47 errors from 3 contexts (suppressed: 4 from 4) > ==22753== LEAK SUMMARY: > ==22753== definitely lost: 2,138 bytes in 52 blocks > ==22753== indirectly lost: 7,440 bytes in 12 blocks > ==22753== possibly lost: 0 bytes in 0 blocks > ==22753== still reachable: 720,754 bytes in 2,780 blocks > ==22753== suppressed: 0 bytes in 0 blocks > ==22753== Rerun with --leak-check=full to see details of leaked memory > ==22753== > ==22753== For counts of detected and suppressed errors, rerun with: -v > ==22753== Use --track-origins=yes to see where uninitialised values come from > ==22753== ERROR SUMMARY: 47 errors from 3 contexts (suppressed: 4 from 4) > ------------------------------------------------------- > While the primary job terminated normally, 2 processes returned > non-zero exit codes.. Further examination may be required. > ------------------------------------------------------- > [singularity:22751] 1 more process has sent help message help-mca-bml-r2.txt > / unreachable proc > [singularity:22751] Set MCA parameter "orte_base_help_aggregate" to 0 to see > all help / error messages > [singularity:22751] 1 more process has sent help message help-mpi-runtime / > mpi_init:startup:pml-add-procs-fail > [singularity:22751] 1 more process has sent help message help-mpi-errors.txt > / mpi_errors_are_fatal unknown handle > [singularity:22751] 1 more process has sent help message help-mpi-runtime.txt > / ompi mpi abort:cannot guarantee all killed > alex@singularity:~/huji/benchmarks/simple$ > > > On 04/01/2012 04:59 PM, Ralph Castain wrote: >> I suspect the problem is here: >> >> /** >> + * MOSIX BTL component. >> + */ >> +struct mca_btl_base_component_t { >> + mca_btl_base_component_2_0_0_t super; /**< base BTL component */ >> + mca_btl_mosix_module_t mosix_module; /**< local module */ >> +}; >> +typedef struct mca_btl_base_component_t mca_btl_mosix_component_t; >> + >> +OMPI_MODULE_DECLSPEC extern mca_btl_mosix_component_t >> mca_btl_mosix_component; >> + >> >> >> You redefined the mca_btl_base_component_t struct. What we usually do is >> define a new struct: >> >> struct mca_btl_mosix_component_t { >> mca_btl_base_component_t super; /**< base BTL component */ >> mca_btl_mosix_module_t mosix_module; /**< local module */ >> }; >> typedef struct mca_btl_mosix_component_t mca_btl_mosix_component_t; >> >> You can then overload that component with your additional info, leaving the >> base component to contain the required minimal elements. >> >> >> On Apr 1, 2012, at 1:59 AM, Alex Margolin wrote: >> >>> I traced the problem to the BML component: >>> Index: ompi/mca/bml/r2/bml_r2.c >>> =================================================================== >>> --- ompi/mca/bml/r2/bml_r2.c (revision 26191) >>> +++ ompi/mca/bml/r2/bml_r2.c (working copy) >>> @@ -105,6 +105,8 @@ >>> } >>> } >>> if (NULL == btl_names_argv || NULL == btl_names_argv[i]) { >>> + printf("\n\nR1: %p\n\n", >>> btl->btl_component->btl_version.mca_component_name); >>> + printf("\n\nR2: %s\n\n", >>> btl->btl_component->btl_version.mca_component_name); >>> opal_argv_append_nosize(&btl_names_argv, >>> >>> btl->btl_component->btl_version.mca_component_name); >>> } >>> >>> I Get (white-spaces removed) for normal run: >>> R1: 0x7f820e3c31d8 >>> R2: self >>> R1: 0x7f820e13c598 >>> R2: tcp >>> ... and for my module: >>> R1: 0x38 >>> - and then the segmentation fault. >>> I guess it has something to do with the way I initialize my component - >>> I'll resume debugging after lunch. >>> >>> Alex >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > <mosix_components.diff> > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel