Looks like you failed to build the shared memory component. The system isn't 
seeing a comm path between procs on the same node.

Sent from my iPad

On Apr 2, 2012, at 7:47 AM, Alex Margolin <alex.margo...@mail.huji.ac.il> wrote:

> I found the problem(s) - It was more then just type redefinition, but I fixed 
> it too. I also added some code for btl/base to prevent/detect a similar 
> problem in the future. A newer version of my MOSIX patch (odls + btl + fix) 
> is attached. The BTL, still doesn't work, though, and when I try to use 
> valgrind it fails with some Open-MPI internal problems, which are most likely 
> unrelated to my patch. I'll keep working it, but maybe someone who knows this 
> part of the code should look at it...
> 
> alex@singularity:~/huji/benchmarks/simple$ mpirun -mca btl self,mosix -n 2 
> valgrind simple
> ==22752== Memcheck, a memory error detector
> ==22752== Copyright (C) 2002-2010, and GNU GPL'd, by Julian Seward et al.
> ==22752== Using Valgrind-3.6.1-Debian and LibVEX; rerun with -h for copyright 
> info
> ==22752== Command: simple
> ==22752==
> ==22753== Memcheck, a memory error detector
> ==22753== Copyright (C) 2002-2010, and GNU GPL'd, by Julian Seward et al.
> ==22753== Using Valgrind-3.6.1-Debian and LibVEX; rerun with -h for copyright 
> info
> ==22753== Command: simple
> ==22753==
> ==22753== Invalid read of size 8
> ==22753==    at 0x5ACBE0D: _wordcopy_fwd_dest_aligned (wordcopy.c:205)
> ==22753==    by 0x5AC5A6B: __GI_memmove (memmove.c:76)
> ==22753==    by 0x5ACD000: argz_insert (argz-insert.c:55)
> ==22753==    by 0x520A39A: lt_argz_insert (in /usr/local/lib/libmpi.so.0.0.0)
> ==22753==    by 0x520A537: lt_argz_insertinorder (in 
> /usr/local/lib/libmpi.so.0.0.0)
> ==22753==    by 0x520A808: lt_argz_insertdir (in 
> /usr/local/lib/libmpi.so.0.0.0)
> ==22753==    by 0x520A985: list_files_by_dir (in 
> /usr/local/lib/libmpi.so.0.0.0)
> ==22753==    by 0x520AA0A: foreachfile_callback (in 
> /usr/local/lib/libmpi.so.0.0.0)
> ==22753==    by 0x52086AA: foreach_dirinpath (in 
> /usr/local/lib/libmpi.so.0.0.0)
> ==22753==    by 0x520AADB: lt_dlforeachfile (in 
> /usr/local/lib/libmpi.so.0.0.0)
> ==22753==    by 0x52162DD: find_dyn_components (mca_base_component_find.c:319)
> ==22753==    by 0x5215EB6: mca_base_component_find 
> (mca_base_component_find.c:186)
> ==22753==  Address 0x68d9570 is 32 bytes inside a block of size 38 alloc'd
> ==22753==    at 0x4C28F9F: malloc (vg_replace_malloc.c:236)
> ==22753==    by 0x52071CA: lt__malloc (in /usr/local/lib/libmpi.so.0.0.0)
> ==22753==    by 0x520A73D: lt_argz_insertdir (in 
> /usr/local/lib/libmpi.so.0.0.0)
> ==22753==    by 0x520A985: list_files_by_dir (in 
> /usr/local/lib/libmpi.so.0.0.0)
> ==22753==    by 0x520AA0A: foreachfile_callback (in 
> /usr/local/lib/libmpi.so.0.0.0)
> ==22753==    by 0x52086AA: foreach_dirinpath (in 
> /usr/local/lib/libmpi.so.0.0.0)
> ==22753==    by 0x520AADB: lt_dlforeachfile (in 
> /usr/local/lib/libmpi.so.0.0.0)
> ==22753==    by 0x52162DD: find_dyn_components (mca_base_component_find.c:319)
> ==22753==    by 0x5215EB6: mca_base_component_find 
> (mca_base_component_find.c:186)
> ==22753==    by 0x5219AA3: mca_base_components_open 
> (mca_base_components_open.c:129)
> ==22753==    by 0x5246183: opal_paffinity_base_open 
> (paffinity_base_open.c:129)
> ==22753==    by 0x523C013: opal_init (opal_init.c:361)
> ==22753==
> ==22752== Invalid read of size 8
> ==22752==    at 0x5ACBE0D: _wordcopy_fwd_dest_aligned (wordcopy.c:205)
> ==22752==    by 0x5AC5A6B: __GI_memmove (memmove.c:76)
> ==22752==    by 0x5ACD000: argz_insert (argz-insert.c:55)
> ==22752==    by 0x520A39A: lt_argz_insert (in /usr/local/lib/libmpi.so.0.0.0)
> ==22752==    by 0x520A537: lt_argz_insertinorder (in 
> /usr/local/lib/libmpi.so.0.0.0)
> ==22752==    by 0x520A808: lt_argz_insertdir (in 
> /usr/local/lib/libmpi.so.0.0.0)
> ==22752==    by 0x520A985: list_files_by_dir (in 
> /usr/local/lib/libmpi.so.0.0.0)
> ==22752==    by 0x520AA0A: foreachfile_callback (in 
> /usr/local/lib/libmpi.so.0.0.0)
> ==22752==    by 0x52086AA: foreach_dirinpath (in 
> /usr/local/lib/libmpi.so.0.0.0)
> ==22752==    by 0x520AADB: lt_dlforeachfile (in 
> /usr/local/lib/libmpi.so.0.0.0)
> ==22752==    by 0x52162DD: find_dyn_components (mca_base_component_find.c:319)
> ==22752==    by 0x5215EB6: mca_base_component_find 
> (mca_base_component_find.c:186)
> ==22752==  Address 0x68d9570 is 32 bytes inside a block of size 38 alloc'd
> ==22752==    at 0x4C28F9F: malloc (vg_replace_malloc.c:236)
> ==22752==    by 0x52071CA: lt__malloc (in /usr/local/lib/libmpi.so.0.0.0)
> ==22752==    by 0x520A73D: lt_argz_insertdir (in 
> /usr/local/lib/libmpi.so.0.0.0)
> ==22752==    by 0x520A985: list_files_by_dir (in 
> /usr/local/lib/libmpi.so.0.0.0)
> ==22752==    by 0x520AA0A: foreachfile_callback (in 
> /usr/local/lib/libmpi.so.0.0.0)
> ==22752==    by 0x52086AA: foreach_dirinpath (in 
> /usr/local/lib/libmpi.so.0.0.0)
> ==22752==    by 0x520AADB: lt_dlforeachfile (in 
> /usr/local/lib/libmpi.so.0.0.0)
> ==22752==    by 0x52162DD: find_dyn_components (mca_base_component_find.c:319)
> ==22752==    by 0x5215EB6: mca_base_component_find 
> (mca_base_component_find.c:186)
> ==22752==    by 0x5219AA3: mca_base_components_open 
> (mca_base_components_open.c:129)
> ==22752==    by 0x5246183: opal_paffinity_base_open 
> (paffinity_base_open.c:129)
> ==22752==    by 0x523C013: opal_init (opal_init.c:361)
> ==22752==
> [singularity:22753] mca: base: component_find: unable to open 
> /usr/local/lib/openmpi/mca_mpool_sm: libmca_common_sm.so.0: cannot open 
> shared object file: No such file or directory (ignored)
> [singularity:22752] mca: base: component_find: unable to open 
> /usr/local/lib/openmpi/mca_mpool_sm: libmca_common_sm.so.0: cannot open 
> shared object file: No such file or directory (ignored)
> [singularity:22753] mca: base: component_find: unable to open 
> /usr/local/lib/openmpi/mca_coll_sm: libmca_common_sm.so.0: cannot open shared 
> object file: No such file or directory (ignored)
> [singularity:22752] mca: base: component_find: unable to open 
> /usr/local/lib/openmpi/mca_coll_sm: libmca_common_sm.so.0: cannot open shared 
> object file: No such file or directory (ignored)
> ==22753== Warning: invalid file descriptor 207618048 in syscall open()
> ==22752== Warning: invalid file descriptor 207618048 in syscall open()
> --------------------------------------------------------------------------
> At least one pair of MPI processes are unable to reach each other for
> MPI communications.  This means that no Open MPI device has indicated
> that it can be used to communicate between these processes.  This is
> an error; Open MPI requires that all MPI processes be able to reach
> each other.  This error can sometimes be the result of forgetting to
> specify the "self" BTL.
> 
>  Process 1 ([[59806,1],0]) is on host: singularity
>  Process 2 ([[59806,1],1]) is on host: singularity
>  BTLs attempted: self
> 
> Your MPI job is now going to abort; sorry.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> MPI_INIT has failed because at least one MPI process is unreachable
> from another.  This *usually* means that an underlying communication
> plugin -- such as a BTL or an MTL -- has either not loaded or not
> allowed itself to be used.  Your MPI job will now abort.
> 
> You may wish to try to narrow down the problem;
> 
> * Check the output of ompi_info to see which BTL/MTL plugins are
>   available.
> * Run your application with MPI_THREAD_SINGLE.
> * Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose,
>   if using MTL-based communications) to see exactly which
>   communication plugins were considered and/or discarded.
> --------------------------------------------------------------------------
> ==22752== Use of uninitialised value of size 8
> ==22752==    at 0x5A8631B: _itoa_word (_itoa.c:195)
> ==22752==    by 0x5A8AE43: vfprintf (vfprintf.c:1622)
> ==22752==    by 0x5AADB83: vasprintf (vasprintf.c:64)
> ==22752==    by 0x524C150: opal_show_help_vstring (show_help.c:309)
> ==22752==    by 0x51786F1: orte_show_help (show_help.c:648)
> ==22752==    by 0x50B4693: backend_fatal_aggregate 
> (errhandler_predefined.c:205)
> ==22752==    by 0x50B4A9B: backend_fatal (errhandler_predefined.c:329)
> ==22752==    by 0x50B3FAF: ompi_mpi_errors_are_fatal_comm_handler 
> (errhandler_predefined.c:68)
> ==22752==    by 0x50B38FA: ompi_errhandler_invoke (errhandler_invoke.c:41)
> ==22752==    by 0x50FD446: PMPI_Init (pinit.c:95)
> ==22752==    by 0x40A128: MPI::Init(int&, char**&) (in 
> /home/alex/huji/benchmarks/simple/simple)
> ==22752==    by 0x409118: main (in /home/alex/huji/benchmarks/simple/simple)
> ==22752==
> ==22752== Conditional jump or move depends on uninitialised value(s)
> ==22752==    at 0x5A86325: _itoa_word (_itoa.c:195)
> ==22752==    by 0x5A8AE43: vfprintf (vfprintf.c:1622)
> ==22752==    by 0x5AADB83: vasprintf (vasprintf.c:64)
> ==22752==    by 0x524C150: opal_show_help_vstring (show_help.c:309)
> ==22752==    by 0x51786F1: orte_show_help (show_help.c:648)
> ==22752==    by 0x50B4693: backend_fatal_aggregate 
> (errhandler_predefined.c:205)
> ==22752==    by 0x50B4A9B: backend_fatal (errhandler_predefined.c:329)
> ==22752==    by 0x50B3FAF: ompi_mpi_errors_are_fatal_comm_handler 
> (errhandler_predefined.c:68)
> ==22752==    by 0x50B38FA: ompi_errhandler_invoke (errhandler_invoke.c:41)
> ==22752==    by 0x50FD446: PMPI_Init (pinit.c:95)
> ==22752==    by 0x40A128: MPI::Init(int&, char**&) (in 
> /home/alex/huji/benchmarks/simple/simple)
> ==22752==    by 0x409118: main (in /home/alex/huji/benchmarks/simple/simple)
> ==22752==
> [singularity:22752] *** An error occurred in MPI_Init
> [singularity:22752] *** reported by process [3919446017,0]
> [singularity:22752] *** on a NULL communicator
> [singularity:22752] *** Unknown error
> [singularity:22752] *** MPI_ERRORS_ARE_FATAL (processes in this communicator 
> will now abort,
> [singularity:22752] ***    and potentially your MPI job)
> --------------------------------------------------------------------------
> An MPI process is aborting at a time when it cannot guarantee that all
> of its peer processes in the job will be killed properly.  You should
> double check that everything has shut down cleanly.
> 
>  Reason:     Before MPI_INIT completed
>  Local host: singularity
>  PID:        22752
> --------------------------------------------------------------------------
> ==22753== Use of uninitialised value of size 8
> ==22753==    at 0x5A8631B: _itoa_word (_itoa.c:195)
> ==22753==    by 0x5A8AE43: vfprintf (vfprintf.c:1622)
> ==22753==    by 0x5AADB83: vasprintf (vasprintf.c:64)
> ==22753==    by 0x524C150: opal_show_help_vstring (show_help.c:309)
> ==22753==    by 0x51786F1: orte_show_help (show_help.c:648)
> ==22753==    by 0x50B4693: backend_fatal_aggregate 
> (errhandler_predefined.c:205)
> ==22753==    by 0x50B4A9B: backend_fatal (errhandler_predefined.c:329)
> ==22753==    by 0x50B3FAF: ompi_mpi_errors_are_fatal_comm_handler 
> (errhandler_predefined.c:68)
> ==22753==    by 0x50B38FA: ompi_errhandler_invoke (errhandler_invoke.c:41)
> ==22753==    by 0x50FD446: PMPI_Init (pinit.c:95)
> ==22753==    by 0x40A128: MPI::Init(int&, char**&) (in 
> /home/alex/huji/benchmarks/simple/simple)
> ==22753==    by 0x409118: main (in /home/alex/huji/benchmarks/simple/simple)
> ==22753==
> ==22753== Conditional jump or move depends on uninitialised value(s)
> ==22753==    at 0x5A86325: _itoa_word (_itoa.c:195)
> ==22753==    by 0x5A8AE43: vfprintf (vfprintf.c:1622)
> ==22753==    by 0x5AADB83: vasprintf (vasprintf.c:64)
> ==22753==    by 0x524C150: opal_show_help_vstring (show_help.c:309)
> ==22753==    by 0x51786F1: orte_show_help (show_help.c:648)
> ==22753==    by 0x50B4693: backend_fatal_aggregate 
> (errhandler_predefined.c:205)
> ==22753==    by 0x50B4A9B: backend_fatal (errhandler_predefined.c:329)
> ==22753==    by 0x50B3FAF: ompi_mpi_errors_are_fatal_comm_handler 
> (errhandler_predefined.c:68)
> ==22753==    by 0x50B38FA: ompi_errhandler_invoke (errhandler_invoke.c:41)
> ==22753==    by 0x50FD446: PMPI_Init (pinit.c:95)
> ==22753==    by 0x40A128: MPI::Init(int&, char**&) (in 
> /home/alex/huji/benchmarks/simple/simple)
> ==22753==    by 0x409118: main (in /home/alex/huji/benchmarks/simple/simple)
> ==22753==
> ==22752==
> ==22752== HEAP SUMMARY:
> ==22752==     in use at exit: 730,332 bytes in 2,844 blocks
> ==22752==   total heap usage: 4,959 allocs, 2,115 frees, 11,353,797 bytes 
> allocated
> ==22752==
> ==22753==
> ==22753== HEAP SUMMARY:
> ==22753==     in use at exit: 730,332 bytes in 2,844 blocks
> ==22753==   total heap usage: 4,970 allocs, 2,126 frees, 11,354,058 bytes 
> allocated
> ==22753==
> ==22752== LEAK SUMMARY:
> ==22752==    definitely lost: 2,138 bytes in 52 blocks
> ==22752==    indirectly lost: 7,440 bytes in 12 blocks
> ==22752==      possibly lost: 0 bytes in 0 blocks
> ==22752==    still reachable: 720,754 bytes in 2,780 blocks
> ==22752==         suppressed: 0 bytes in 0 blocks
> ==22752== Rerun with --leak-check=full to see details of leaked memory
> ==22752==
> ==22752== For counts of detected and suppressed errors, rerun with: -v
> ==22752== Use --track-origins=yes to see where uninitialised values come from
> ==22752== ERROR SUMMARY: 47 errors from 3 contexts (suppressed: 4 from 4)
> ==22753== LEAK SUMMARY:
> ==22753==    definitely lost: 2,138 bytes in 52 blocks
> ==22753==    indirectly lost: 7,440 bytes in 12 blocks
> ==22753==      possibly lost: 0 bytes in 0 blocks
> ==22753==    still reachable: 720,754 bytes in 2,780 blocks
> ==22753==         suppressed: 0 bytes in 0 blocks
> ==22753== Rerun with --leak-check=full to see details of leaked memory
> ==22753==
> ==22753== For counts of detected and suppressed errors, rerun with: -v
> ==22753== Use --track-origins=yes to see where uninitialised values come from
> ==22753== ERROR SUMMARY: 47 errors from 3 contexts (suppressed: 4 from 4)
> -------------------------------------------------------
> While the primary job  terminated normally, 2 processes returned
> non-zero exit codes.. Further examination may be required.
> -------------------------------------------------------
> [singularity:22751] 1 more process has sent help message help-mca-bml-r2.txt 
> / unreachable proc
> [singularity:22751] Set MCA parameter "orte_base_help_aggregate" to 0 to see 
> all help / error messages
> [singularity:22751] 1 more process has sent help message help-mpi-runtime / 
> mpi_init:startup:pml-add-procs-fail
> [singularity:22751] 1 more process has sent help message help-mpi-errors.txt 
> / mpi_errors_are_fatal unknown handle
> [singularity:22751] 1 more process has sent help message help-mpi-runtime.txt 
> / ompi mpi abort:cannot guarantee all killed
> alex@singularity:~/huji/benchmarks/simple$
> 
> 
> On 04/01/2012 04:59 PM, Ralph Castain wrote:
>> I suspect the problem is here:
>> 
>> /**
>> + * MOSIX BTL component.
>> + */
>> +struct mca_btl_base_component_t {
>> +    mca_btl_base_component_2_0_0_t super;  /**<  base BTL component */
>> +    mca_btl_mosix_module_t mosix_module;   /**<  local module */
>> +};
>> +typedef struct mca_btl_base_component_t mca_btl_mosix_component_t;
>> +
>> +OMPI_MODULE_DECLSPEC extern mca_btl_mosix_component_t 
>> mca_btl_mosix_component;
>> +
>> 
>> 
>> You redefined the mca_btl_base_component_t struct. What we usually do is 
>> define a new struct:
>> 
>> struct mca_btl_mosix_component_t {
>>    mca_btl_base_component_t super;  /**<  base BTL component */
>>    mca_btl_mosix_module_t mosix_module;   /**<  local module */
>> };
>> typedef struct mca_btl_mosix_component_t mca_btl_mosix_component_t;
>> 
>> You can then overload that component with your additional info, leaving the 
>> base component to contain the required minimal elements.
>> 
>> 
>> On Apr 1, 2012, at 1:59 AM, Alex Margolin wrote:
>> 
>>> I traced the problem to the BML component:
>>> Index: ompi/mca/bml/r2/bml_r2.c
>>> ===================================================================
>>> --- ompi/mca/bml/r2/bml_r2.c    (revision 26191)
>>> +++ ompi/mca/bml/r2/bml_r2.c    (working copy)
>>> @@ -105,6 +105,8 @@
>>>             }
>>>         }
>>>         if (NULL == btl_names_argv || NULL == btl_names_argv[i]) {
>>> +            printf("\n\nR1: %p\n\n", 
>>> btl->btl_component->btl_version.mca_component_name);
>>> +            printf("\n\nR2: %s\n\n", 
>>> btl->btl_component->btl_version.mca_component_name);
>>>             opal_argv_append_nosize(&btl_names_argv,
>>>                                     
>>> btl->btl_component->btl_version.mca_component_name);
>>>         }
>>> 
>>> I Get (white-spaces removed) for normal run:
>>> R1: 0x7f820e3c31d8
>>> R2: self
>>> R1: 0x7f820e13c598
>>> R2: tcp
>>> ... and for my module:
>>> R1: 0x38
>>> - and then the segmentation fault.
>>> I guess it has something to do with the way I initialize my component - 
>>> I'll resume debugging after lunch.
>>> 
>>> Alex
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> <mosix_components.diff>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to