Re: [OMPI devel] RFC: Rename --enable-*-threads and ENABLE*THREAD* (take 2)

2010-03-05 Thread Terry Dontje

A couple comments:
1.  I really assume the timeout is March 5th not February.
2.  As to keeping the deprecated variables I think you really need to 
ditch the --enable-mpi-threads because if you synonym it with 
--enable-mpi-thread-multiple you are not mimicing what it did before but 
redefining it IMHO.  (I am ok with the ditching personally).


--td

Jeff Squyres wrote:

WHAT: Rename the --enable-*-threads configure switches and ENABLE*THREAD* 
macros.
  (see previous RFC: 
http://www.open-mpi.org/community/lists/devel/2010/01/7366.php)

WHY: The fact that thread safety in OPAL and ORTE requires a configure switch with 
"mpi" in the name is very non-intuitive.  Additionally, MPI_THREAD_MULTIPLE 
support is not necessarily the same thing as OPAL thread support (MTM needs OPAL thread 
support, but not the other way around), and we are seeing a growing advantage/need for 
ORTE to utilize threads in mpirun and orted irrespective of the MPI layer's threading 
abilities.

WHERE: Mostly opal/config/opal_config_threads.m4, something new in 
ompi/config/*.m4, and wherever the current ENABLE*THREAD* macros are currently 
used in the current code base.

WHEN: Next Friday COB

TIMEOUT: COB, Friday, Feb 5, 2010



More details:

Cisco is starting to investigate using ORTE and OPAL in various threading scenarios.  The 
fact that you need to enable thread safety in ORTE/OPAL with a configure switch that has 
the word "mpi" in it is extremely counter-intuitive (it bit some of our 
engineers very badly, and they were mighty annoyed!!).  In addition, we ran into problems 
where it was advantageous to have threads in ORTE, but we couldn't do it without forcing 
thread support into the MPI layer because the switch is universal.

Since this functionality actually has nothing to do with MPI (it's actually the other way around -- MPI_THREAD_MULTIPLE needs this functionality), we really should divorce MPI threading functionality from whether threading machinery is enabled in OPAL or not. 


These names were proposed at the end of the previous RFC and no one objected, 
so I'm sending this around as a new RFC to ensure we're all on the same sheet 
of music:

--enable-opal-progress-threads: enables progress thread machinery in opal
 --> this is just a renaming from --enable-progress-threads
 --> the corresponding #define stays the same: OPAL_ENABLE_PROGRES_THREADS

--enable-opal-multi-threads: enables multi threaded machinery in opal
 --> this is just a renaming from --enable-mpi-threads
 --> the corresponding #define also renames; from OPAL_ENABLE_MPI_THREADS to 
OPAL_ENABLE_MULTI_THREADS

--enable-mpi-thread-multiple: enables the use of MPI_THREAD_MULTIPLE; *ONLY* 
affects the MPI layer
 --> use of this switch explicitly implies --enable-opal-multi-threads
 --> new #define: OMPI_ENABLE_THREAD_MULTIPLE

We can keep and deprecate the old configure options if desired:

--enable-mpi-threads: deprecated synonym for --enable-mpi-thread-multiple
--enable-progress-threads: deprecated synonym for --enable-opal-progress-threads

..although I'm somewhat inclined to ditch them unless someone has strong 
feelings about keeping them.

Doing the name change in OPAL and ORTE is fairly straightforward -- it's 
essentially an s/foo/bar/g kind of operation.  It'll likely take a little more 
effort in the MPI layer because the places where the current #defines are used 
may need to switch to the new name or to the new OMPI_ENABLE_THREAD_MULTIPLE 
name (and maybe some new logic?  I am not sure without looking into it closer).

  




[OMPI devel] Missing Symbol

2010-03-05 Thread Leonardo Fialho
Hi,

I know that libtool does not help us to find the source of this error, but, 
what can generate the following error?

[aoclsb-clus.uab.es:11724] mca: base: component_find: unable to open 
/home/lfialho/lib/openmpi/mca_vprotocol_receiver: perhaps a missing symbol, or 
compiled for a different version of Open MPI? (ignored)

1) yes, the file exists
2) yes, it has been compiled among all other components
3) yes, it is the same Open MPI version
4) this component is a copy of the pessimist component implemented by Aurelien
5) Aurelien's component presents the same error

The question is: what mistake should generate an error during module loading?

Thanks in advance,
Leonardo


Re: [OMPI devel] Missing Symbol

2010-03-05 Thread Terry Dontje
Possibly there is an external symbol in the .so that is being loaded 
that cannot be resolved. 


--td
Leonardo Fialho wrote:

Hi,

I know that libtool does not help us to find the source of this error, but, 
what can generate the following error?

[aoclsb-clus.uab.es:11724] mca: base: component_find: unable to open 
/home/lfialho/lib/openmpi/mca_vprotocol_receiver: perhaps a missing symbol, or 
compiled for a different version of Open MPI? (ignored)

1) yes, the file exists
2) yes, it has been compiled among all other components
3) yes, it is the same Open MPI version
4) this component is a copy of the pessimist component implemented by Aurelien
5) Aurelien's component presents the same error

The question is: what mistake should generate an error during module loading?

Thanks in advance,
Leonardo
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
  




Re: [OMPI devel] Missing Symbol

2010-03-05 Thread Terry Dontje
Sorry meant to add this, but you might be able to try and find the 
symbol causing the issue by twiddling with LD_DEBUG


--td
Terry Dontje wrote:
Possibly there is an external symbol in the .so that is being loaded 
that cannot be resolved.

--td
Leonardo Fialho wrote:

Hi,

I know that libtool does not help us to find the source of this 
error, but, what can generate the following error?


[aoclsb-clus.uab.es:11724] mca: base: component_find: unable to open 
/home/lfialho/lib/openmpi/mca_vprotocol_receiver: perhaps a missing 
symbol, or compiled for a different version of Open MPI? (ignored)


1) yes, the file exists
2) yes, it has been compiled among all other components
3) yes, it is the same Open MPI version
4) this component is a copy of the pessimist component implemented by 
Aurelien

5) Aurelien's component presents the same error

The question is: what mistake should generate an error during module 
loading?


Thanks in advance,
Leonardo
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
  


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] Missing Symbol

2010-03-05 Thread Leonardo Fialho
I see... but it is really strange because this module is clean, it does not use 
nothing. This is the output of the nm command, I can't see any symbol which is 
not available.

[lfialho@aoclsb-clus openmpi]$ nm mca_vprotocol_receiver.so 
00201208 a _DYNAMIC
00201408 a _GLOBAL_OFFSET_TABLE_
 w _Jv_RegisterClasses
002011e0 d __CTOR_END__
002011d8 d __CTOR_LIST__
002011f0 d __DTOR_END__
002011e8 d __DTOR_LIST__
11d0 r __FRAME_END__
002011f8 d __JCR_END__
002011f8 d __JCR_LIST__
00201640 A __bss_start
 w __cxa_finalize@@GLIBC_2.2.5
0d40 t __do_global_ctors_aux
07c0 t __do_global_dtors_aux
00201200 d __dso_handle
 w __gmon_start__
00201640 A _edata
00201648 A _end
0d78 T _fini
0750 T _init
07a0 t call_gmon_start
00201640 b completed.6115
0810 t frame_dummy
 U mca_pml_v
00201460 D mca_vprotocol_receiver
0c71 t mca_vprotocol_receiver_add_comm
0a5f t mca_vprotocol_receiver_add_procs
00201540 D mca_vprotocol_receiver_component
0cc3 t mca_vprotocol_receiver_component_close
0d18 t mca_vprotocol_receiver_component_finalize
0cce t mca_vprotocol_receiver_component_init
0cb8 t mca_vprotocol_receiver_component_open
0c93 t mca_vprotocol_receiver_del_comm
0a89 t mca_vprotocol_receiver_del_procs
083c t mca_vprotocol_receiver_dump
0d23 t mca_vprotocol_receiver_enable
09e7 t mca_vprotocol_receiver_iprobe
0b9a t mca_vprotocol_receiver_irecv
0ab3 t mca_vprotocol_receiver_isend
0a29 t mca_vprotocol_receiver_probe
0c00 t mca_vprotocol_receiver_recv
0b21 t mca_vprotocol_receiver_send
09bd T mca_vprotocol_receiver_start
0864 t mca_vprotocol_receiver_test
0896 t mca_vprotocol_receiver_test_all
08d0 t mca_vprotocol_receiver_test_any
0950 t mca_vprotocol_receiver_test_some
0916 t mca_vprotocol_receiver_wait_any
098a t mca_vprotocol_receiver_wait_some
 U ompi_request_null
 U opal_output
00201440 d p.6113
[lfialho@aoclsb-clus openmpi]$

On Mar 5, 2010, at 7:00 PM, Terry Dontje wrote:

> Sorry meant to add this, but you might be able to try and find the symbol 
> causing the issue by twiddling with LD_DEBUG
> 
> --td
> Terry Dontje wrote:
>> Possibly there is an external symbol in the .so that is being loaded that 
>> cannot be resolved.
>> --td
>> Leonardo Fialho wrote:
>>> Hi,
>>> 
>>> I know that libtool does not help us to find the source of this error, but, 
>>> what can generate the following error?
>>> 
>>> [aoclsb-clus.uab.es:11724] mca: base: component_find: unable to open 
>>> /home/lfialho/lib/openmpi/mca_vprotocol_receiver: perhaps a missing symbol, 
>>> or compiled for a different version of Open MPI? (ignored)
>>> 
>>> 1) yes, the file exists
>>> 2) yes, it has been compiled among all other components
>>> 3) yes, it is the same Open MPI version
>>> 4) this component is a copy of the pessimist component implemented by 
>>> Aurelien
>>> 5) Aurelien's component presents the same error
>>> 
>>> The question is: what mistake should generate an error during module 
>>> loading?
>>> 
>>> Thanks in advance,
>>> Leonardo
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>  
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] Missing Symbol

2010-03-05 Thread Ralph Castain
You said this component was a copy of Aurelien's component? Did you rename the 
critical elements (e.g., component, module) inside it to avoid name confusion?

On Mar 5, 2010, at 11:27 AM, Leonardo Fialho wrote:

> I see... but it is really strange because this module is clean, it does not 
> use nothing. This is the output of the nm command, I can't see any symbol 
> which is not available.
> 
> [lfialho@aoclsb-clus openmpi]$ nm mca_vprotocol_receiver.so 
> 00201208 a _DYNAMIC
> 00201408 a _GLOBAL_OFFSET_TABLE_
> w _Jv_RegisterClasses
> 002011e0 d __CTOR_END__
> 002011d8 d __CTOR_LIST__
> 002011f0 d __DTOR_END__
> 002011e8 d __DTOR_LIST__
> 11d0 r __FRAME_END__
> 002011f8 d __JCR_END__
> 002011f8 d __JCR_LIST__
> 00201640 A __bss_start
> w __cxa_finalize@@GLIBC_2.2.5
> 0d40 t __do_global_ctors_aux
> 07c0 t __do_global_dtors_aux
> 00201200 d __dso_handle
> w __gmon_start__
> 00201640 A _edata
> 00201648 A _end
> 0d78 T _fini
> 0750 T _init
> 07a0 t call_gmon_start
> 00201640 b completed.6115
> 0810 t frame_dummy
> U mca_pml_v
> 00201460 D mca_vprotocol_receiver
> 0c71 t mca_vprotocol_receiver_add_comm
> 0a5f t mca_vprotocol_receiver_add_procs
> 00201540 D mca_vprotocol_receiver_component
> 0cc3 t mca_vprotocol_receiver_component_close
> 0d18 t mca_vprotocol_receiver_component_finalize
> 0cce t mca_vprotocol_receiver_component_init
> 0cb8 t mca_vprotocol_receiver_component_open
> 0c93 t mca_vprotocol_receiver_del_comm
> 0a89 t mca_vprotocol_receiver_del_procs
> 083c t mca_vprotocol_receiver_dump
> 0d23 t mca_vprotocol_receiver_enable
> 09e7 t mca_vprotocol_receiver_iprobe
> 0b9a t mca_vprotocol_receiver_irecv
> 0ab3 t mca_vprotocol_receiver_isend
> 0a29 t mca_vprotocol_receiver_probe
> 0c00 t mca_vprotocol_receiver_recv
> 0b21 t mca_vprotocol_receiver_send
> 09bd T mca_vprotocol_receiver_start
> 0864 t mca_vprotocol_receiver_test
> 0896 t mca_vprotocol_receiver_test_all
> 08d0 t mca_vprotocol_receiver_test_any
> 0950 t mca_vprotocol_receiver_test_some
> 0916 t mca_vprotocol_receiver_wait_any
> 098a t mca_vprotocol_receiver_wait_some
> U ompi_request_null
> U opal_output
> 00201440 d p.6113
> [lfialho@aoclsb-clus openmpi]$
> 
> On Mar 5, 2010, at 7:00 PM, Terry Dontje wrote:
> 
>> Sorry meant to add this, but you might be able to try and find the symbol 
>> causing the issue by twiddling with LD_DEBUG
>> 
>> --td
>> Terry Dontje wrote:
>>> Possibly there is an external symbol in the .so that is being loaded that 
>>> cannot be resolved.
>>> --td
>>> Leonardo Fialho wrote:
 Hi,
 
 I know that libtool does not help us to find the source of this error, 
 but, what can generate the following error?
 
 [aoclsb-clus.uab.es:11724] mca: base: component_find: unable to open 
 /home/lfialho/lib/openmpi/mca_vprotocol_receiver: perhaps a missing 
 symbol, or compiled for a different version of Open MPI? (ignored)
 
 1) yes, the file exists
 2) yes, it has been compiled among all other components
 3) yes, it is the same Open MPI version
 4) this component is a copy of the pessimist component implemented by 
 Aurelien
 5) Aurelien's component presents the same error
 
 The question is: what mistake should generate an error during module 
 loading?
 
 Thanks in advance,
 Leonardo
 ___
 devel mailing list
 de...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/devel
 
>>> 
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] Missing Symbol

2010-03-05 Thread Leonardo Fialho
Yes,

I renamed all references to Aurelien's componant name and removed all code 
regarding to the component itself. There are only functions which returns 
OMPI_SUCCESS. No other function is called.

I'm debugging with LD_DEBUG=symbols, but the output is really huge! Probably 
the error is in the mca_pml_v symbol:

19643:  /home/lfialho/lib/openmpi/mca_vprotocol_receiver.so: error: symbol 
lookup error: undefined symbol: mca_pml_v (fatal)

Leonardo

On Mar 5, 2010, at 7:35 PM, Ralph Castain wrote:

> You said this component was a copy of Aurelien's component? Did you rename 
> the critical elements (e.g., component, module) inside it to avoid name 
> confusion?
> 
> On Mar 5, 2010, at 11:27 AM, Leonardo Fialho wrote:
> 
>> I see... but it is really strange because this module is clean, it does not 
>> use nothing. This is the output of the nm command, I can't see any symbol 
>> which is not available.
>> 
>> [lfialho@aoclsb-clus openmpi]$ nm mca_vprotocol_receiver.so 
>> 00201208 a _DYNAMIC
>> 00201408 a _GLOBAL_OFFSET_TABLE_
>>w _Jv_RegisterClasses
>> 002011e0 d __CTOR_END__
>> 002011d8 d __CTOR_LIST__
>> 002011f0 d __DTOR_END__
>> 002011e8 d __DTOR_LIST__
>> 11d0 r __FRAME_END__
>> 002011f8 d __JCR_END__
>> 002011f8 d __JCR_LIST__
>> 00201640 A __bss_start
>>w __cxa_finalize@@GLIBC_2.2.5
>> 0d40 t __do_global_ctors_aux
>> 07c0 t __do_global_dtors_aux
>> 00201200 d __dso_handle
>>w __gmon_start__
>> 00201640 A _edata
>> 00201648 A _end
>> 0d78 T _fini
>> 0750 T _init
>> 07a0 t call_gmon_start
>> 00201640 b completed.6115
>> 0810 t frame_dummy
>>U mca_pml_v
>> 00201460 D mca_vprotocol_receiver
>> 0c71 t mca_vprotocol_receiver_add_comm
>> 0a5f t mca_vprotocol_receiver_add_procs
>> 00201540 D mca_vprotocol_receiver_component
>> 0cc3 t mca_vprotocol_receiver_component_close
>> 0d18 t mca_vprotocol_receiver_component_finalize
>> 0cce t mca_vprotocol_receiver_component_init
>> 0cb8 t mca_vprotocol_receiver_component_open
>> 0c93 t mca_vprotocol_receiver_del_comm
>> 0a89 t mca_vprotocol_receiver_del_procs
>> 083c t mca_vprotocol_receiver_dump
>> 0d23 t mca_vprotocol_receiver_enable
>> 09e7 t mca_vprotocol_receiver_iprobe
>> 0b9a t mca_vprotocol_receiver_irecv
>> 0ab3 t mca_vprotocol_receiver_isend
>> 0a29 t mca_vprotocol_receiver_probe
>> 0c00 t mca_vprotocol_receiver_recv
>> 0b21 t mca_vprotocol_receiver_send
>> 09bd T mca_vprotocol_receiver_start
>> 0864 t mca_vprotocol_receiver_test
>> 0896 t mca_vprotocol_receiver_test_all
>> 08d0 t mca_vprotocol_receiver_test_any
>> 0950 t mca_vprotocol_receiver_test_some
>> 0916 t mca_vprotocol_receiver_wait_any
>> 098a t mca_vprotocol_receiver_wait_some
>>U ompi_request_null
>>U opal_output
>> 00201440 d p.6113
>> [lfialho@aoclsb-clus openmpi]$
>> 
>> On Mar 5, 2010, at 7:00 PM, Terry Dontje wrote:
>> 
>>> Sorry meant to add this, but you might be able to try and find the symbol 
>>> causing the issue by twiddling with LD_DEBUG
>>> 
>>> --td
>>> Terry Dontje wrote:
 Possibly there is an external symbol in the .so that is being loaded that 
 cannot be resolved.
 --td
 Leonardo Fialho wrote:
> Hi,
> 
> I know that libtool does not help us to find the source of this error, 
> but, what can generate the following error?
> 
> [aoclsb-clus.uab.es:11724] mca: base: component_find: unable to open 
> /home/lfialho/lib/openmpi/mca_vprotocol_receiver: perhaps a missing 
> symbol, or compiled for a different version of Open MPI? (ignored)
> 
> 1) yes, the file exists
> 2) yes, it has been compiled among all other components
> 3) yes, it is the same Open MPI version
> 4) this component is a copy of the pessimist component implemented by 
> Aurelien
> 5) Aurelien's component presents the same error
> 
> The question is: what mistake should generate an error during module 
> loading?
> 
> Thanks in advance,
> Leonardo
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
 
 ___
 devel mailing list
 de...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
>> __

Re: [OMPI devel] Missing Symbol

2010-03-05 Thread Ralph Castain
It's probably a visibility issue - check for an OMPI_DECLSPEC missing from the 
declaration of a symbol.

On Mar 5, 2010, at 11:40 AM, Leonardo Fialho wrote:

> Yes,
> 
> I renamed all references to Aurelien's componant name and removed all code 
> regarding to the component itself. There are only functions which returns 
> OMPI_SUCCESS. No other function is called.
> 
> I'm debugging with LD_DEBUG=symbols, but the output is really huge! Probably 
> the error is in the mca_pml_v symbol:
> 
> 19643:/home/lfialho/lib/openmpi/mca_vprotocol_receiver.so: error: 
> symbol lookup error: undefined symbol: mca_pml_v (fatal)
> 
> Leonardo
> 
> On Mar 5, 2010, at 7:35 PM, Ralph Castain wrote:
> 
>> You said this component was a copy of Aurelien's component? Did you rename 
>> the critical elements (e.g., component, module) inside it to avoid name 
>> confusion?
>> 
>> On Mar 5, 2010, at 11:27 AM, Leonardo Fialho wrote:
>> 
>>> I see... but it is really strange because this module is clean, it does not 
>>> use nothing. This is the output of the nm command, I can't see any symbol 
>>> which is not available.
>>> 
>>> [lfialho@aoclsb-clus openmpi]$ nm mca_vprotocol_receiver.so 
>>> 00201208 a _DYNAMIC
>>> 00201408 a _GLOBAL_OFFSET_TABLE_
>>>   w _Jv_RegisterClasses
>>> 002011e0 d __CTOR_END__
>>> 002011d8 d __CTOR_LIST__
>>> 002011f0 d __DTOR_END__
>>> 002011e8 d __DTOR_LIST__
>>> 11d0 r __FRAME_END__
>>> 002011f8 d __JCR_END__
>>> 002011f8 d __JCR_LIST__
>>> 00201640 A __bss_start
>>>   w __cxa_finalize@@GLIBC_2.2.5
>>> 0d40 t __do_global_ctors_aux
>>> 07c0 t __do_global_dtors_aux
>>> 00201200 d __dso_handle
>>>   w __gmon_start__
>>> 00201640 A _edata
>>> 00201648 A _end
>>> 0d78 T _fini
>>> 0750 T _init
>>> 07a0 t call_gmon_start
>>> 00201640 b completed.6115
>>> 0810 t frame_dummy
>>>   U mca_pml_v
>>> 00201460 D mca_vprotocol_receiver
>>> 0c71 t mca_vprotocol_receiver_add_comm
>>> 0a5f t mca_vprotocol_receiver_add_procs
>>> 00201540 D mca_vprotocol_receiver_component
>>> 0cc3 t mca_vprotocol_receiver_component_close
>>> 0d18 t mca_vprotocol_receiver_component_finalize
>>> 0cce t mca_vprotocol_receiver_component_init
>>> 0cb8 t mca_vprotocol_receiver_component_open
>>> 0c93 t mca_vprotocol_receiver_del_comm
>>> 0a89 t mca_vprotocol_receiver_del_procs
>>> 083c t mca_vprotocol_receiver_dump
>>> 0d23 t mca_vprotocol_receiver_enable
>>> 09e7 t mca_vprotocol_receiver_iprobe
>>> 0b9a t mca_vprotocol_receiver_irecv
>>> 0ab3 t mca_vprotocol_receiver_isend
>>> 0a29 t mca_vprotocol_receiver_probe
>>> 0c00 t mca_vprotocol_receiver_recv
>>> 0b21 t mca_vprotocol_receiver_send
>>> 09bd T mca_vprotocol_receiver_start
>>> 0864 t mca_vprotocol_receiver_test
>>> 0896 t mca_vprotocol_receiver_test_all
>>> 08d0 t mca_vprotocol_receiver_test_any
>>> 0950 t mca_vprotocol_receiver_test_some
>>> 0916 t mca_vprotocol_receiver_wait_any
>>> 098a t mca_vprotocol_receiver_wait_some
>>>   U ompi_request_null
>>>   U opal_output
>>> 00201440 d p.6113
>>> [lfialho@aoclsb-clus openmpi]$
>>> 
>>> On Mar 5, 2010, at 7:00 PM, Terry Dontje wrote:
>>> 
 Sorry meant to add this, but you might be able to try and find the symbol 
 causing the issue by twiddling with LD_DEBUG
 
 --td
 Terry Dontje wrote:
> Possibly there is an external symbol in the .so that is being loaded that 
> cannot be resolved.
> --td
> Leonardo Fialho wrote:
>> Hi,
>> 
>> I know that libtool does not help us to find the source of this error, 
>> but, what can generate the following error?
>> 
>> [aoclsb-clus.uab.es:11724] mca: base: component_find: unable to open 
>> /home/lfialho/lib/openmpi/mca_vprotocol_receiver: perhaps a missing 
>> symbol, or compiled for a different version of Open MPI? (ignored)
>> 
>> 1) yes, the file exists
>> 2) yes, it has been compiled among all other components
>> 3) yes, it is the same Open MPI version
>> 4) this component is a copy of the pessimist component implemented by 
>> Aurelien
>> 5) Aurelien's component presents the same error
>> 
>> The question is: what mistake should generate an error during module 
>> loading?
>> 
>> Thanks in advance,
>> Leonardo
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
> 
> 

Re: [OMPI devel] Missing Symbol

2010-03-05 Thread Terry Dontje
I would also start nm'ing the .so's you think the U symbols are resolved 
in to make sure they are exposed.  Luckily you only have 3 symbols to 
look for.


--td

Ralph Castain wrote:

It's probably a visibility issue - check for an OMPI_DECLSPEC missing from the 
declaration of a symbol.

On Mar 5, 2010, at 11:40 AM, Leonardo Fialho wrote:

  

Yes,

I renamed all references to Aurelien's componant name and removed all code 
regarding to the component itself. There are only functions which returns 
OMPI_SUCCESS. No other function is called.

I'm debugging with LD_DEBUG=symbols, but the output is really huge! Probably 
the error is in the mca_pml_v symbol:

19643:  /home/lfialho/lib/openmpi/mca_vprotocol_receiver.so: error: symbol 
lookup error: undefined symbol: mca_pml_v (fatal)

Leonardo

On Mar 5, 2010, at 7:35 PM, Ralph Castain wrote:



You said this component was a copy of Aurelien's component? Did you rename the 
critical elements (e.g., component, module) inside it to avoid name confusion?

On Mar 5, 2010, at 11:27 AM, Leonardo Fialho wrote:

  

I see... but it is really strange because this module is clean, it does not use 
nothing. This is the output of the nm command, I can't see any symbol which is 
not available.

[lfialho@aoclsb-clus openmpi]$ nm mca_vprotocol_receiver.so 
00201208 a _DYNAMIC

00201408 a _GLOBAL_OFFSET_TABLE_
  w _Jv_RegisterClasses
002011e0 d __CTOR_END__
002011d8 d __CTOR_LIST__
002011f0 d __DTOR_END__
002011e8 d __DTOR_LIST__
11d0 r __FRAME_END__
002011f8 d __JCR_END__
002011f8 d __JCR_LIST__
00201640 A __bss_start
  w __cxa_finalize@@GLIBC_2.2.5
0d40 t __do_global_ctors_aux
07c0 t __do_global_dtors_aux
00201200 d __dso_handle
  w __gmon_start__
00201640 A _edata
00201648 A _end
0d78 T _fini
0750 T _init
07a0 t call_gmon_start
00201640 b completed.6115
0810 t frame_dummy
  U mca_pml_v
00201460 D mca_vprotocol_receiver
0c71 t mca_vprotocol_receiver_add_comm
0a5f t mca_vprotocol_receiver_add_procs
00201540 D mca_vprotocol_receiver_component
0cc3 t mca_vprotocol_receiver_component_close
0d18 t mca_vprotocol_receiver_component_finalize
0cce t mca_vprotocol_receiver_component_init
0cb8 t mca_vprotocol_receiver_component_open
0c93 t mca_vprotocol_receiver_del_comm
0a89 t mca_vprotocol_receiver_del_procs
083c t mca_vprotocol_receiver_dump
0d23 t mca_vprotocol_receiver_enable
09e7 t mca_vprotocol_receiver_iprobe
0b9a t mca_vprotocol_receiver_irecv
0ab3 t mca_vprotocol_receiver_isend
0a29 t mca_vprotocol_receiver_probe
0c00 t mca_vprotocol_receiver_recv
0b21 t mca_vprotocol_receiver_send
09bd T mca_vprotocol_receiver_start
0864 t mca_vprotocol_receiver_test
0896 t mca_vprotocol_receiver_test_all
08d0 t mca_vprotocol_receiver_test_any
0950 t mca_vprotocol_receiver_test_some
0916 t mca_vprotocol_receiver_wait_any
098a t mca_vprotocol_receiver_wait_some
  U ompi_request_null
  U opal_output
00201440 d p.6113
[lfialho@aoclsb-clus openmpi]$

On Mar 5, 2010, at 7:00 PM, Terry Dontje wrote:



Sorry meant to add this, but you might be able to try and find the symbol 
causing the issue by twiddling with LD_DEBUG

--td
Terry Dontje wrote:
  

Possibly there is an external symbol in the .so that is being loaded that 
cannot be resolved.
--td
Leonardo Fialho wrote:


Hi,

I know that libtool does not help us to find the source of this error, but, 
what can generate the following error?

[aoclsb-clus.uab.es:11724] mca: base: component_find: unable to open 
/home/lfialho/lib/openmpi/mca_vprotocol_receiver: perhaps a missing symbol, or 
compiled for a different version of Open MPI? (ignored)

1) yes, the file exists
2) yes, it has been compiled among all other components
3) yes, it is the same Open MPI version
4) this component is a copy of the pessimist component implemented by Aurelien
5) Aurelien's component presents the same error

The question is: what mistake should generate an error during module loading?

Thanks in advance,
Leonardo
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

  

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
  

___

Re: [OMPI devel] Missing Symbol

2010-03-05 Thread George Bosilca
This might be an issue with the [new] way libtool load the symbols, i.e., in a 
private space and not in a global one. Try turning off the visibility feature 
and see if you get the same error.

  george.

On Mar 5, 2010, at 13:47 , Terry Dontje wrote:

> I would also start nm'ing the .so's you think the U symbols are resolved in 
> to make sure they are exposed.  Luckily you only have 3 symbols to look for.
> 
> --td
> 
> Ralph Castain wrote:
>> It's probably a visibility issue - check for an OMPI_DECLSPEC missing from 
>> the declaration of a symbol.
>> 
>> On Mar 5, 2010, at 11:40 AM, Leonardo Fialho wrote:
>> 
>>  
>>> Yes,
>>> 
>>> I renamed all references to Aurelien's componant name and removed all code 
>>> regarding to the component itself. There are only functions which returns 
>>> OMPI_SUCCESS. No other function is called.
>>> 
>>> I'm debugging with LD_DEBUG=symbols, but the output is really huge! 
>>> Probably the error is in the mca_pml_v symbol:
>>> 
>>> 19643:  /home/lfialho/lib/openmpi/mca_vprotocol_receiver.so: error: 
>>> symbol lookup error: undefined symbol: mca_pml_v (fatal)
>>> 
>>> Leonardo
>>> 
>>> On Mar 5, 2010, at 7:35 PM, Ralph Castain wrote:
>>> 
>>>
 You said this component was a copy of Aurelien's component? Did you rename 
 the critical elements (e.g., component, module) inside it to avoid name 
 confusion?
 
 On Mar 5, 2010, at 11:27 AM, Leonardo Fialho wrote:
 
  
> I see... but it is really strange because this module is clean, it does 
> not use nothing. This is the output of the nm command, I can't see any 
> symbol which is not available.
> 
> [lfialho@aoclsb-clus openmpi]$ nm mca_vprotocol_receiver.so 
> 00201208 a _DYNAMIC
> 00201408 a _GLOBAL_OFFSET_TABLE_
>  w _Jv_RegisterClasses
> 002011e0 d __CTOR_END__
> 002011d8 d __CTOR_LIST__
> 002011f0 d __DTOR_END__
> 002011e8 d __DTOR_LIST__
> 11d0 r __FRAME_END__
> 002011f8 d __JCR_END__
> 002011f8 d __JCR_LIST__
> 00201640 A __bss_start
>  w __cxa_finalize@@GLIBC_2.2.5
> 0d40 t __do_global_ctors_aux
> 07c0 t __do_global_dtors_aux
> 00201200 d __dso_handle
>  w __gmon_start__
> 00201640 A _edata
> 00201648 A _end
> 0d78 T _fini
> 0750 T _init
> 07a0 t call_gmon_start
> 00201640 b completed.6115
> 0810 t frame_dummy
>  U mca_pml_v
> 00201460 D mca_vprotocol_receiver
> 0c71 t mca_vprotocol_receiver_add_comm
> 0a5f t mca_vprotocol_receiver_add_procs
> 00201540 D mca_vprotocol_receiver_component
> 0cc3 t mca_vprotocol_receiver_component_close
> 0d18 t mca_vprotocol_receiver_component_finalize
> 0cce t mca_vprotocol_receiver_component_init
> 0cb8 t mca_vprotocol_receiver_component_open
> 0c93 t mca_vprotocol_receiver_del_comm
> 0a89 t mca_vprotocol_receiver_del_procs
> 083c t mca_vprotocol_receiver_dump
> 0d23 t mca_vprotocol_receiver_enable
> 09e7 t mca_vprotocol_receiver_iprobe
> 0b9a t mca_vprotocol_receiver_irecv
> 0ab3 t mca_vprotocol_receiver_isend
> 0a29 t mca_vprotocol_receiver_probe
> 0c00 t mca_vprotocol_receiver_recv
> 0b21 t mca_vprotocol_receiver_send
> 09bd T mca_vprotocol_receiver_start
> 0864 t mca_vprotocol_receiver_test
> 0896 t mca_vprotocol_receiver_test_all
> 08d0 t mca_vprotocol_receiver_test_any
> 0950 t mca_vprotocol_receiver_test_some
> 0916 t mca_vprotocol_receiver_wait_any
> 098a t mca_vprotocol_receiver_wait_some
>  U ompi_request_null
>  U opal_output
> 00201440 d p.6113
> [lfialho@aoclsb-clus openmpi]$
> 
> On Mar 5, 2010, at 7:00 PM, Terry Dontje wrote:
> 
>
>> Sorry meant to add this, but you might be able to try and find the 
>> symbol causing the issue by twiddling with LD_DEBUG
>> 
>> --td
>> Terry Dontje wrote:
>>  
>>> Possibly there is an external symbol in the .so that is being loaded 
>>> that cannot be resolved.
>>> --td
>>> Leonardo Fialho wrote:
>>>
 Hi,
 
 I know that libtool does not help us to find the source of this error, 
 but, what can generate the following error?
 
 [aoclsb-clus.uab.es:11724] mca: base: component_find: unable to open 
 /home/lfialho/lib/openmpi/mca_vprotocol_receiver: perhaps a missing 
 symb

Re: [OMPI devel] Missing Symbol

2010-03-05 Thread Terry Dontje

Leonardo Fialho wrote:

Yes,

I renamed all references to Aurelien's componant name and removed all code 
regarding to the component itself. There are only functions which returns 
OMPI_SUCCESS. No other function is called.

I'm debugging with LD_DEBUG=symbols, but the output is really huge! Probably 
the error is in the mca_pml_v symbol:

19643:  /home/lfialho/lib/openmpi/mca_vprotocol_receiver.so: error: symbol 
lookup error: undefined symbol: mca_pml_v (fatal)

  

That looks like the culprit to me.

--td

Leonardo

On Mar 5, 2010, at 7:35 PM, Ralph Castain wrote:

  

You said this component was a copy of Aurelien's component? Did you rename the 
critical elements (e.g., component, module) inside it to avoid name confusion?

On Mar 5, 2010, at 11:27 AM, Leonardo Fialho wrote:



I see... but it is really strange because this module is clean, it does not use 
nothing. This is the output of the nm command, I can't see any symbol which is 
not available.

[lfialho@aoclsb-clus openmpi]$ nm mca_vprotocol_receiver.so 
00201208 a _DYNAMIC

00201408 a _GLOBAL_OFFSET_TABLE_
   w _Jv_RegisterClasses
002011e0 d __CTOR_END__
002011d8 d __CTOR_LIST__
002011f0 d __DTOR_END__
002011e8 d __DTOR_LIST__
11d0 r __FRAME_END__
002011f8 d __JCR_END__
002011f8 d __JCR_LIST__
00201640 A __bss_start
   w __cxa_finalize@@GLIBC_2.2.5
0d40 t __do_global_ctors_aux
07c0 t __do_global_dtors_aux
00201200 d __dso_handle
   w __gmon_start__
00201640 A _edata
00201648 A _end
0d78 T _fini
0750 T _init
07a0 t call_gmon_start
00201640 b completed.6115
0810 t frame_dummy
   U mca_pml_v
00201460 D mca_vprotocol_receiver
0c71 t mca_vprotocol_receiver_add_comm
0a5f t mca_vprotocol_receiver_add_procs
00201540 D mca_vprotocol_receiver_component
0cc3 t mca_vprotocol_receiver_component_close
0d18 t mca_vprotocol_receiver_component_finalize
0cce t mca_vprotocol_receiver_component_init
0cb8 t mca_vprotocol_receiver_component_open
0c93 t mca_vprotocol_receiver_del_comm
0a89 t mca_vprotocol_receiver_del_procs
083c t mca_vprotocol_receiver_dump
0d23 t mca_vprotocol_receiver_enable
09e7 t mca_vprotocol_receiver_iprobe
0b9a t mca_vprotocol_receiver_irecv
0ab3 t mca_vprotocol_receiver_isend
0a29 t mca_vprotocol_receiver_probe
0c00 t mca_vprotocol_receiver_recv
0b21 t mca_vprotocol_receiver_send
09bd T mca_vprotocol_receiver_start
0864 t mca_vprotocol_receiver_test
0896 t mca_vprotocol_receiver_test_all
08d0 t mca_vprotocol_receiver_test_any
0950 t mca_vprotocol_receiver_test_some
0916 t mca_vprotocol_receiver_wait_any
098a t mca_vprotocol_receiver_wait_some
   U ompi_request_null
   U opal_output
00201440 d p.6113
[lfialho@aoclsb-clus openmpi]$

On Mar 5, 2010, at 7:00 PM, Terry Dontje wrote:

  

Sorry meant to add this, but you might be able to try and find the symbol 
causing the issue by twiddling with LD_DEBUG

--td
Terry Dontje wrote:


Possibly there is an external symbol in the .so that is being loaded that 
cannot be resolved.
--td
Leonardo Fialho wrote:
  

Hi,

I know that libtool does not help us to find the source of this error, but, 
what can generate the following error?

[aoclsb-clus.uab.es:11724] mca: base: component_find: unable to open 
/home/lfialho/lib/openmpi/mca_vprotocol_receiver: perhaps a missing symbol, or 
compiled for a different version of Open MPI? (ignored)

1) yes, the file exists
2) yes, it has been compiled among all other components
3) yes, it is the same Open MPI version
4) this component is a copy of the pessimist component implemented by Aurelien
5) Aurelien's component presents the same error

The question is: what mistake should generate an error during module loading?

Thanks in advance,
Leonardo
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
  

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
  

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel





Re: [OMPI devel] Missing Symbol

2010-03-05 Thread Leonardo Fialho
Yeah, probably ompi_request_null and opal_output are not good candidates. I'm 
trying with mca_pml_v. But I'm not familiarized with this framework although it 
is really small.

George, you said to change this (opal/mca/base/mca_base_component_find.c):

#if OPAL_HAVE_LTDL_ADVISE
  component_handle = lt_dlopenadvise(target_file->filename, opal_mca_dladvise);
#else
  component_handle = lt_dlopenext(target_file->filename);
#endif

to use lt_dladvise_global instead of lt_dladvise_local?

Leonardo

On Mar 5, 2010, at 7:47 PM, Terry Dontje wrote:

> I would also start nm'ing the .so's you think the U symbols are resolved in 
> to make sure they are exposed.  Luckily you only have 3 symbols to look for.
> 
> --td
> 
> Ralph Castain wrote:
>> It's probably a visibility issue - check for an OMPI_DECLSPEC missing from 
>> the declaration of a symbol.
>> 
>> On Mar 5, 2010, at 11:40 AM, Leonardo Fialho wrote:
>> 
>>  
>>> Yes,
>>> 
>>> I renamed all references to Aurelien's componant name and removed all code 
>>> regarding to the component itself. There are only functions which returns 
>>> OMPI_SUCCESS. No other function is called.
>>> 
>>> I'm debugging with LD_DEBUG=symbols, but the output is really huge! 
>>> Probably the error is in the mca_pml_v symbol:
>>> 
>>> 19643:  /home/lfialho/lib/openmpi/mca_vprotocol_receiver.so: error: 
>>> symbol lookup error: undefined symbol: mca_pml_v (fatal)
>>> 
>>> Leonardo
>>> 
>>> On Mar 5, 2010, at 7:35 PM, Ralph Castain wrote:
>>> 
>>>
 You said this component was a copy of Aurelien's component? Did you rename 
 the critical elements (e.g., component, module) inside it to avoid name 
 confusion?
 
 On Mar 5, 2010, at 11:27 AM, Leonardo Fialho wrote:
 
  
> I see... but it is really strange because this module is clean, it does 
> not use nothing. This is the output of the nm command, I can't see any 
> symbol which is not available.
> 
> [lfialho@aoclsb-clus openmpi]$ nm mca_vprotocol_receiver.so 
> 00201208 a _DYNAMIC
> 00201408 a _GLOBAL_OFFSET_TABLE_
>  w _Jv_RegisterClasses
> 002011e0 d __CTOR_END__
> 002011d8 d __CTOR_LIST__
> 002011f0 d __DTOR_END__
> 002011e8 d __DTOR_LIST__
> 11d0 r __FRAME_END__
> 002011f8 d __JCR_END__
> 002011f8 d __JCR_LIST__
> 00201640 A __bss_start
>  w __cxa_finalize@@GLIBC_2.2.5
> 0d40 t __do_global_ctors_aux
> 07c0 t __do_global_dtors_aux
> 00201200 d __dso_handle
>  w __gmon_start__
> 00201640 A _edata
> 00201648 A _end
> 0d78 T _fini
> 0750 T _init
> 07a0 t call_gmon_start
> 00201640 b completed.6115
> 0810 t frame_dummy
>  U mca_pml_v
> 00201460 D mca_vprotocol_receiver
> 0c71 t mca_vprotocol_receiver_add_comm
> 0a5f t mca_vprotocol_receiver_add_procs
> 00201540 D mca_vprotocol_receiver_component
> 0cc3 t mca_vprotocol_receiver_component_close
> 0d18 t mca_vprotocol_receiver_component_finalize
> 0cce t mca_vprotocol_receiver_component_init
> 0cb8 t mca_vprotocol_receiver_component_open
> 0c93 t mca_vprotocol_receiver_del_comm
> 0a89 t mca_vprotocol_receiver_del_procs
> 083c t mca_vprotocol_receiver_dump
> 0d23 t mca_vprotocol_receiver_enable
> 09e7 t mca_vprotocol_receiver_iprobe
> 0b9a t mca_vprotocol_receiver_irecv
> 0ab3 t mca_vprotocol_receiver_isend
> 0a29 t mca_vprotocol_receiver_probe
> 0c00 t mca_vprotocol_receiver_recv
> 0b21 t mca_vprotocol_receiver_send
> 09bd T mca_vprotocol_receiver_start
> 0864 t mca_vprotocol_receiver_test
> 0896 t mca_vprotocol_receiver_test_all
> 08d0 t mca_vprotocol_receiver_test_any
> 0950 t mca_vprotocol_receiver_test_some
> 0916 t mca_vprotocol_receiver_wait_any
> 098a t mca_vprotocol_receiver_wait_some
>  U ompi_request_null
>  U opal_output
> 00201440 d p.6113
> [lfialho@aoclsb-clus openmpi]$
> 
> On Mar 5, 2010, at 7:00 PM, Terry Dontje wrote:
> 
>
>> Sorry meant to add this, but you might be able to try and find the 
>> symbol causing the issue by twiddling with LD_DEBUG
>> 
>> --td
>> Terry Dontje wrote:
>>  
>>> Possibly there is an external symbol in the .so that is being loaded 
>>> that cannot be resolved.
>>> --td
>>> Leonardo Fialho wrote:
>>>
 Hi,
 
 I know that

Re: [OMPI devel] RFC: Rename --enable-*-threads and ENABLE*THREAD* (take 2)

2010-03-05 Thread Ralph Castain
I'm not going to commit this today - I think it would be a little quick :-)

However, I do have it all building on a Mercurial branch with the new options. 
It would be REALLY GOOD if people interested in thread support were to check 
this out prior to me bringing it to the trunk.

The branch can be cloned with:

hg clone https://r...@bitbucket.org/rhc/ompi-threads/
Please let me know if you test it and what you find.

Thanks
Ralph


On Mar 5, 2010, at 4:18 AM, Terry Dontje wrote:

> A couple comments:
> 1.  I really assume the timeout is March 5th not February.
> 2.  As to keeping the deprecated variables I think you really need to ditch 
> the --enable-mpi-threads because if you synonym it with 
> --enable-mpi-thread-multiple you are not mimicing what it did before but 
> redefining it IMHO.  (I am ok with the ditching personally).
> 
> --td
> 
> Jeff Squyres wrote:
>> WHAT: Rename the --enable-*-threads configure switches and ENABLE*THREAD* 
>> macros.
>>  (see previous RFC: 
>> http://www.open-mpi.org/community/lists/devel/2010/01/7366.php)
>> 
>> WHY: The fact that thread safety in OPAL and ORTE requires a configure 
>> switch with "mpi" in the name is very non-intuitive.  Additionally, 
>> MPI_THREAD_MULTIPLE support is not necessarily the same thing as OPAL thread 
>> support (MTM needs OPAL thread support, but not the other way around), and 
>> we are seeing a growing advantage/need for ORTE to utilize threads in mpirun 
>> and orted irrespective of the MPI layer's threading abilities.
>> 
>> WHERE: Mostly opal/config/opal_config_threads.m4, something new in 
>> ompi/config/*.m4, and wherever the current ENABLE*THREAD* macros are 
>> currently used in the current code base.
>> 
>> WHEN: Next Friday COB
>> 
>> TIMEOUT: COB, Friday, Feb 5, 2010
>> 
>> 
>> 
>> More details:
>> 
>> Cisco is starting to investigate using ORTE and OPAL in various threading 
>> scenarios.  The fact that you need to enable thread safety in ORTE/OPAL with 
>> a configure switch that has the word "mpi" in it is extremely 
>> counter-intuitive (it bit some of our engineers very badly, and they were 
>> mighty annoyed!!).  In addition, we ran into problems where it was 
>> advantageous to have threads in ORTE, but we couldn't do it without forcing 
>> thread support into the MPI layer because the switch is universal.
>> 
>> Since this functionality actually has nothing to do with MPI (it's actually 
>> the other way around -- MPI_THREAD_MULTIPLE needs this functionality), we 
>> really should divorce MPI threading functionality from whether threading 
>> machinery is enabled in OPAL or not. 
>> These names were proposed at the end of the previous RFC and no one 
>> objected, so I'm sending this around as a new RFC to ensure we're all on the 
>> same sheet of music:
>> 
>> --enable-opal-progress-threads: enables progress thread machinery in opal
>> --> this is just a renaming from --enable-progress-threads
>> --> the corresponding #define stays the same: OPAL_ENABLE_PROGRES_THREADS
>> 
>> --enable-opal-multi-threads: enables multi threaded machinery in opal
>> --> this is just a renaming from --enable-mpi-threads
>> --> the corresponding #define also renames; from OPAL_ENABLE_MPI_THREADS to 
>> OPAL_ENABLE_MULTI_THREADS
>> 
>> --enable-mpi-thread-multiple: enables the use of MPI_THREAD_MULTIPLE; *ONLY* 
>> affects the MPI layer
>> --> use of this switch explicitly implies --enable-opal-multi-threads
>> --> new #define: OMPI_ENABLE_THREAD_MULTIPLE
>> 
>> We can keep and deprecate the old configure options if desired:
>> 
>> --enable-mpi-threads: deprecated synonym for --enable-mpi-thread-multiple
>> --enable-progress-threads: deprecated synonym for 
>> --enable-opal-progress-threads
>> 
>> ..although I'm somewhat inclined to ditch them unless someone has strong 
>> feelings about keeping them.
>> 
>> Doing the name change in OPAL and ORTE is fairly straightforward -- it's 
>> essentially an s/foo/bar/g kind of operation.  It'll likely take a little 
>> more effort in the MPI layer because the places where the current #defines 
>> are used may need to switch to the new name or to the new 
>> OMPI_ENABLE_THREAD_MULTIPLE name (and maybe some new logic?  I am not sure 
>> without looking into it closer).
>> 
>>  
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] Missing Symbol

2010-03-05 Thread George Bosilca
I would first try the Open MPI configure option --disable-visibility. If this 
doesn't fix it, you should make sure that dlopen is called with the GLOBAL flag 
on (don't remember where exactly in the code and unfortunately I can't check 
right now). Use gdb and set a breakpoint to dlopen and you will find it.

  george.

On Mar 5, 2010, at 14:00 , Leonardo Fialho wrote:

> Yeah, probably ompi_request_null and opal_output are not good candidates. I'm 
> trying with mca_pml_v. But I'm not familiarized with this framework although 
> it is really small.
> 
> George, you said to change this (opal/mca/base/mca_base_component_find.c):
> 
> #if OPAL_HAVE_LTDL_ADVISE
>  component_handle = lt_dlopenadvise(target_file->filename, opal_mca_dladvise);
> #else
>  component_handle = lt_dlopenext(target_file->filename);
> #endif
> 
> to use lt_dladvise_global instead of lt_dladvise_local?
> 
> Leonardo
> 
> On Mar 5, 2010, at 7:47 PM, Terry Dontje wrote:
> 
>> I would also start nm'ing the .so's you think the U symbols are resolved in 
>> to make sure they are exposed.  Luckily you only have 3 symbols to look for.
>> 
>> --td
>> 
>> Ralph Castain wrote:
>>> It's probably a visibility issue - check for an OMPI_DECLSPEC missing from 
>>> the declaration of a symbol.
>>> 
>>> On Mar 5, 2010, at 11:40 AM, Leonardo Fialho wrote:
>>> 
>>> 
 Yes,
 
 I renamed all references to Aurelien's componant name and removed all code 
 regarding to the component itself. There are only functions which returns 
 OMPI_SUCCESS. No other function is called.
 
 I'm debugging with LD_DEBUG=symbols, but the output is really huge! 
 Probably the error is in the mca_pml_v symbol:
 
 19643: /home/lfialho/lib/openmpi/mca_vprotocol_receiver.so: error: 
 symbol lookup error: undefined symbol: mca_pml_v (fatal)
 
 Leonardo
 
 On Mar 5, 2010, at 7:35 PM, Ralph Castain wrote:
 
 
> You said this component was a copy of Aurelien's component? Did you 
> rename the critical elements (e.g., component, module) inside it to avoid 
> name confusion?
> 
> On Mar 5, 2010, at 11:27 AM, Leonardo Fialho wrote:
> 
> 
>> I see... but it is really strange because this module is clean, it does 
>> not use nothing. This is the output of the nm command, I can't see any 
>> symbol which is not available.
>> 
>> [lfialho@aoclsb-clus openmpi]$ nm mca_vprotocol_receiver.so 
>> 00201208 a _DYNAMIC
>> 00201408 a _GLOBAL_OFFSET_TABLE_
>> w _Jv_RegisterClasses
>> 002011e0 d __CTOR_END__
>> 002011d8 d __CTOR_LIST__
>> 002011f0 d __DTOR_END__
>> 002011e8 d __DTOR_LIST__
>> 11d0 r __FRAME_END__
>> 002011f8 d __JCR_END__
>> 002011f8 d __JCR_LIST__
>> 00201640 A __bss_start
>> w __cxa_finalize@@GLIBC_2.2.5
>> 0d40 t __do_global_ctors_aux
>> 07c0 t __do_global_dtors_aux
>> 00201200 d __dso_handle
>> w __gmon_start__
>> 00201640 A _edata
>> 00201648 A _end
>> 0d78 T _fini
>> 0750 T _init
>> 07a0 t call_gmon_start
>> 00201640 b completed.6115
>> 0810 t frame_dummy
>> U mca_pml_v
>> 00201460 D mca_vprotocol_receiver
>> 0c71 t mca_vprotocol_receiver_add_comm
>> 0a5f t mca_vprotocol_receiver_add_procs
>> 00201540 D mca_vprotocol_receiver_component
>> 0cc3 t mca_vprotocol_receiver_component_close
>> 0d18 t mca_vprotocol_receiver_component_finalize
>> 0cce t mca_vprotocol_receiver_component_init
>> 0cb8 t mca_vprotocol_receiver_component_open
>> 0c93 t mca_vprotocol_receiver_del_comm
>> 0a89 t mca_vprotocol_receiver_del_procs
>> 083c t mca_vprotocol_receiver_dump
>> 0d23 t mca_vprotocol_receiver_enable
>> 09e7 t mca_vprotocol_receiver_iprobe
>> 0b9a t mca_vprotocol_receiver_irecv
>> 0ab3 t mca_vprotocol_receiver_isend
>> 0a29 t mca_vprotocol_receiver_probe
>> 0c00 t mca_vprotocol_receiver_recv
>> 0b21 t mca_vprotocol_receiver_send
>> 09bd T mca_vprotocol_receiver_start
>> 0864 t mca_vprotocol_receiver_test
>> 0896 t mca_vprotocol_receiver_test_all
>> 08d0 t mca_vprotocol_receiver_test_any
>> 0950 t mca_vprotocol_receiver_test_some
>> 0916 t mca_vprotocol_receiver_wait_any
>> 098a t mca_vprotocol_receiver_wait_some
>> U ompi_request_null
>> U opal_output
>> 00201440 d p.6113
>> [lfialho@aoclsb-clus openmpi]$
>> 

[OMPI devel] Adding error/verbose messages to the TCP BTL

2010-03-05 Thread Jeff Squyres
>From https://svn.open-mpi.org/trac/ompi/ticket/2045, I have added a lot more 
>diagnostic error and verbose messages to the TCP BTL that detail what 
>endpoints it creates, what IP addresses and ports its trying to connect to, 
>etc.  As part of this, I also added a magic ID string into the TCP BTL socket 
>handshake so that processes can identify if the socket peer is an OMPI process.

http://bitbucket.org/jsquyres/tcp-debug-printf/

The initial commit with all the new messages and whatnot is here:

http://bitbucket.org/jsquyres/tcp-debug-printf/changeset/9efe756cda30/

There are now multiple levels of TCP BTL verbosity: 10, 20, and 30.  Each level 
gives successively more information.

This ended up in a lot more code addition than I thought it would, so I'm a 
little uncomfortable just committing it to the trunk.  Can anyone who cares 
have a look at this before I commit?  If possible, it would be good to get some 
testing on Solaris and Windows at a minimum before I commit, too -- just to 
minimize the chance of trunk breakage.

If I hear nothing back by next Friday (12 March 2010), I'll commit.

Thanks!

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] Adding error/verbose messages to the TCP BTL

2010-03-05 Thread George Bosilca
Being user friendly is good, being way too user friendly is less (but I guess 
this is the price we have to pay for a production-quality code isn't it).

I have few comments:

- In several places you replaced the BTL_ERROR (which was the way BTLs are 
supposed to complaints) by a call directly to orte_show_help. This presents 
several inconveniences: drifting away from something more or less consistent 
across all BTLs, adding more dependencies between the BTLs and ORTE.

- There are a lot of places where you just indented the code or split a 
medium-sized line into several lines. I find the code more difficult to read.

Thanks,
  george.

On Mar 5, 2010, at 14:16 , Jeff Squyres wrote:

>> From https://svn.open-mpi.org/trac/ompi/ticket/2045, I have added a lot more 
>> diagnostic error and verbose messages to the TCP BTL that detail what 
>> endpoints it creates, what IP addresses and ports its trying to connect to, 
>> etc.  As part of this, I also added a magic ID string into the TCP BTL 
>> socket handshake so that processes can identify if the socket peer is an 
>> OMPI process.
> 
>http://bitbucket.org/jsquyres/tcp-debug-printf/
> 
> The initial commit with all the new messages and whatnot is here:
> 
>http://bitbucket.org/jsquyres/tcp-debug-printf/changeset/9efe756cda30/
> 
> There are now multiple levels of TCP BTL verbosity: 10, 20, and 30.  Each 
> level gives successively more information.
> 
> This ended up in a lot more code addition than I thought it would, so I'm a 
> little uncomfortable just committing it to the trunk.  Can anyone who cares 
> have a look at this before I commit?  If possible, it would be good to get 
> some testing on Solaris and Windows at a minimum before I commit, too -- just 
> to minimize the chance of trunk breakage.
> 
> If I hear nothing back by next Friday (12 March 2010), I'll commit.
> 
> Thanks!
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] Missing Symbol

2010-03-05 Thread Leonardo Fialho
No George, this trick does not change the problem. I'm looking for the problem 
in the mca_pml_v declaration, but I still can't figure out the reason why it 
doesn't work.

Leonardo

On Mar 5, 2010, at 8:12 PM, George Bosilca wrote:

> I would first try the Open MPI configure option --disable-visibility. If this 
> doesn't fix it, you should make sure that dlopen is called with the GLOBAL 
> flag on (don't remember where exactly in the code and unfortunately I can't 
> check right now). Use gdb and set a breakpoint to dlopen and you will find it.
> 
>  george.
> 
> On Mar 5, 2010, at 14:00 , Leonardo Fialho wrote:
> 
>> Yeah, probably ompi_request_null and opal_output are not good candidates. 
>> I'm trying with mca_pml_v. But I'm not familiarized with this framework 
>> although it is really small.
>> 
>> George, you said to change this (opal/mca/base/mca_base_component_find.c):
>> 
>> #if OPAL_HAVE_LTDL_ADVISE
>> component_handle = lt_dlopenadvise(target_file->filename, opal_mca_dladvise);
>> #else
>> component_handle = lt_dlopenext(target_file->filename);
>> #endif
>> 
>> to use lt_dladvise_global instead of lt_dladvise_local?
>> 
>> Leonardo
>> 
>> On Mar 5, 2010, at 7:47 PM, Terry Dontje wrote:
>> 
>>> I would also start nm'ing the .so's you think the U symbols are resolved in 
>>> to make sure they are exposed.  Luckily you only have 3 symbols to look for.
>>> 
>>> --td
>>> 
>>> Ralph Castain wrote:
 It's probably a visibility issue - check for an OMPI_DECLSPEC missing from 
 the declaration of a symbol.
 
 On Mar 5, 2010, at 11:40 AM, Leonardo Fialho wrote:
 
 
> Yes,
> 
> I renamed all references to Aurelien's componant name and removed all 
> code regarding to the component itself. There are only functions which 
> returns OMPI_SUCCESS. No other function is called.
> 
> I'm debugging with LD_DEBUG=symbols, but the output is really huge! 
> Probably the error is in the mca_pml_v symbol:
> 
> 19643:/home/lfialho/lib/openmpi/mca_vprotocol_receiver.so: error: 
> symbol lookup error: undefined symbol: mca_pml_v (fatal)
> 
> Leonardo
> 
> On Mar 5, 2010, at 7:35 PM, Ralph Castain wrote:
> 
> 
>> You said this component was a copy of Aurelien's component? Did you 
>> rename the critical elements (e.g., component, module) inside it to 
>> avoid name confusion?
>> 
>> On Mar 5, 2010, at 11:27 AM, Leonardo Fialho wrote:
>> 
>> 
>>> I see... but it is really strange because this module is clean, it does 
>>> not use nothing. This is the output of the nm command, I can't see any 
>>> symbol which is not available.
>>> 
>>> [lfialho@aoclsb-clus openmpi]$ nm mca_vprotocol_receiver.so 
>>> 00201208 a _DYNAMIC
>>> 00201408 a _GLOBAL_OFFSET_TABLE_
>>>w _Jv_RegisterClasses
>>> 002011e0 d __CTOR_END__
>>> 002011d8 d __CTOR_LIST__
>>> 002011f0 d __DTOR_END__
>>> 002011e8 d __DTOR_LIST__
>>> 11d0 r __FRAME_END__
>>> 002011f8 d __JCR_END__
>>> 002011f8 d __JCR_LIST__
>>> 00201640 A __bss_start
>>>w __cxa_finalize@@GLIBC_2.2.5
>>> 0d40 t __do_global_ctors_aux
>>> 07c0 t __do_global_dtors_aux
>>> 00201200 d __dso_handle
>>>w __gmon_start__
>>> 00201640 A _edata
>>> 00201648 A _end
>>> 0d78 T _fini
>>> 0750 T _init
>>> 07a0 t call_gmon_start
>>> 00201640 b completed.6115
>>> 0810 t frame_dummy
>>>U mca_pml_v
>>> 00201460 D mca_vprotocol_receiver
>>> 0c71 t mca_vprotocol_receiver_add_comm
>>> 0a5f t mca_vprotocol_receiver_add_procs
>>> 00201540 D mca_vprotocol_receiver_component
>>> 0cc3 t mca_vprotocol_receiver_component_close
>>> 0d18 t mca_vprotocol_receiver_component_finalize
>>> 0cce t mca_vprotocol_receiver_component_init
>>> 0cb8 t mca_vprotocol_receiver_component_open
>>> 0c93 t mca_vprotocol_receiver_del_comm
>>> 0a89 t mca_vprotocol_receiver_del_procs
>>> 083c t mca_vprotocol_receiver_dump
>>> 0d23 t mca_vprotocol_receiver_enable
>>> 09e7 t mca_vprotocol_receiver_iprobe
>>> 0b9a t mca_vprotocol_receiver_irecv
>>> 0ab3 t mca_vprotocol_receiver_isend
>>> 0a29 t mca_vprotocol_receiver_probe
>>> 0c00 t mca_vprotocol_receiver_recv
>>> 0b21 t mca_vprotocol_receiver_send
>>> 09bd T mca_vprotocol_receiver_start
>>> 0864 t mca_vprotocol_receiver_test
>>> 0896 t mca_vprotocol_receiver_test_all
>>> 08d0 t mca_vprotocol_

Re: [OMPI devel] Missing Symbol

2010-03-05 Thread George Bosilca
Because I guess it is declared by another module loaded dynamically at runtime. 
As libtool load the symbols not in a global scope, this mca_pml_v will not be 
visible for other modules trying to use it.

  george.

On Mar 5, 2010, at 14:35 , Leonardo Fialho wrote:

> No George, this trick does not change the problem. I'm looking for the 
> problem in the mca_pml_v declaration, but I still can't figure out the reason 
> why it doesn't work.
> 
> Leonardo
> 
> On Mar 5, 2010, at 8:12 PM, George Bosilca wrote:
> 
>> I would first try the Open MPI configure option --disable-visibility. If 
>> this doesn't fix it, you should make sure that dlopen is called with the 
>> GLOBAL flag on (don't remember where exactly in the code and unfortunately I 
>> can't check right now). Use gdb and set a breakpoint to dlopen and you will 
>> find it.
>> 
>> george.
>> 
>> On Mar 5, 2010, at 14:00 , Leonardo Fialho wrote:
>> 
>>> Yeah, probably ompi_request_null and opal_output are not good candidates. 
>>> I'm trying with mca_pml_v. But I'm not familiarized with this framework 
>>> although it is really small.
>>> 
>>> George, you said to change this (opal/mca/base/mca_base_component_find.c):
>>> 
>>> #if OPAL_HAVE_LTDL_ADVISE
>>> component_handle = lt_dlopenadvise(target_file->filename, 
>>> opal_mca_dladvise);
>>> #else
>>> component_handle = lt_dlopenext(target_file->filename);
>>> #endif
>>> 
>>> to use lt_dladvise_global instead of lt_dladvise_local?
>>> 
>>> Leonardo
>>> 
>>> On Mar 5, 2010, at 7:47 PM, Terry Dontje wrote:
>>> 
 I would also start nm'ing the .so's you think the U symbols are resolved 
 in to make sure they are exposed.  Luckily you only have 3 symbols to look 
 for.
 
 --td
 
 Ralph Castain wrote:
> It's probably a visibility issue - check for an OMPI_DECLSPEC missing 
> from the declaration of a symbol.
> 
> On Mar 5, 2010, at 11:40 AM, Leonardo Fialho wrote:
> 
> 
>> Yes,
>> 
>> I renamed all references to Aurelien's componant name and removed all 
>> code regarding to the component itself. There are only functions which 
>> returns OMPI_SUCCESS. No other function is called.
>> 
>> I'm debugging with LD_DEBUG=symbols, but the output is really huge! 
>> Probably the error is in the mca_pml_v symbol:
>> 
>> 19643:   /home/lfialho/lib/openmpi/mca_vprotocol_receiver.so: error: 
>> symbol lookup error: undefined symbol: mca_pml_v (fatal)
>> 
>> Leonardo
>> 
>> On Mar 5, 2010, at 7:35 PM, Ralph Castain wrote:
>> 
>> 
>>> You said this component was a copy of Aurelien's component? Did you 
>>> rename the critical elements (e.g., component, module) inside it to 
>>> avoid name confusion?
>>> 
>>> On Mar 5, 2010, at 11:27 AM, Leonardo Fialho wrote:
>>> 
>>> 
 I see... but it is really strange because this module is clean, it 
 does not use nothing. This is the output of the nm command, I can't 
 see any symbol which is not available.
 
 [lfialho@aoclsb-clus openmpi]$ nm mca_vprotocol_receiver.so 
 00201208 a _DYNAMIC
 00201408 a _GLOBAL_OFFSET_TABLE_
   w _Jv_RegisterClasses
 002011e0 d __CTOR_END__
 002011d8 d __CTOR_LIST__
 002011f0 d __DTOR_END__
 002011e8 d __DTOR_LIST__
 11d0 r __FRAME_END__
 002011f8 d __JCR_END__
 002011f8 d __JCR_LIST__
 00201640 A __bss_start
   w __cxa_finalize@@GLIBC_2.2.5
 0d40 t __do_global_ctors_aux
 07c0 t __do_global_dtors_aux
 00201200 d __dso_handle
   w __gmon_start__
 00201640 A _edata
 00201648 A _end
 0d78 T _fini
 0750 T _init
 07a0 t call_gmon_start
 00201640 b completed.6115
 0810 t frame_dummy
   U mca_pml_v
 00201460 D mca_vprotocol_receiver
 0c71 t mca_vprotocol_receiver_add_comm
 0a5f t mca_vprotocol_receiver_add_procs
 00201540 D mca_vprotocol_receiver_component
 0cc3 t mca_vprotocol_receiver_component_close
 0d18 t mca_vprotocol_receiver_component_finalize
 0cce t mca_vprotocol_receiver_component_init
 0cb8 t mca_vprotocol_receiver_component_open
 0c93 t mca_vprotocol_receiver_del_comm
 0a89 t mca_vprotocol_receiver_del_procs
 083c t mca_vprotocol_receiver_dump
 0d23 t mca_vprotocol_receiver_enable
 09e7 t mca_vprotocol_receiver_iprobe
 0b9a t mca_vprotocol_receiver_irecv
 0ab3 t mca_vpr

Re: [OMPI devel] Missing Symbol

2010-03-05 Thread Terry Dontje
Have you found the symbol being exposed by another .so (ie have you done 
an nm on the .so that shows the symbol)?  And are you sure that .so is 
loaded by the time your .so is being loaded?


--td
Leonardo Fialho wrote:

No George, this trick does not change the problem. I'm looking for the problem 
in the mca_pml_v declaration, but I still can't figure out the reason why it 
doesn't work.

Leonardo

On Mar 5, 2010, at 8:12 PM, George Bosilca wrote:

  

I would first try the Open MPI configure option --disable-visibility. If this 
doesn't fix it, you should make sure that dlopen is called with the GLOBAL flag 
on (don't remember where exactly in the code and unfortunately I can't check 
right now). Use gdb and set a breakpoint to dlopen and you will find it.

 george.

On Mar 5, 2010, at 14:00 , Leonardo Fialho wrote:



Yeah, probably ompi_request_null and opal_output are not good candidates. I'm 
trying with mca_pml_v. But I'm not familiarized with this framework although it 
is really small.

George, you said to change this (opal/mca/base/mca_base_component_find.c):

#if OPAL_HAVE_LTDL_ADVISE
component_handle = lt_dlopenadvise(target_file->filename, opal_mca_dladvise);
#else
component_handle = lt_dlopenext(target_file->filename);
#endif

to use lt_dladvise_global instead of lt_dladvise_local?

Leonardo

On Mar 5, 2010, at 7:47 PM, Terry Dontje wrote:

  

I would also start nm'ing the .so's you think the U symbols are resolved in to 
make sure they are exposed.  Luckily you only have 3 symbols to look for.

--td

Ralph Castain wrote:


It's probably a visibility issue - check for an OMPI_DECLSPEC missing from the 
declaration of a symbol.

On Mar 5, 2010, at 11:40 AM, Leonardo Fialho wrote:


  

Yes,

I renamed all references to Aurelien's componant name and removed all code 
regarding to the component itself. There are only functions which returns 
OMPI_SUCCESS. No other function is called.

I'm debugging with LD_DEBUG=symbols, but the output is really huge! Probably 
the error is in the mca_pml_v symbol:

19643:  /home/lfialho/lib/openmpi/mca_vprotocol_receiver.so: error: symbol 
lookup error: undefined symbol: mca_pml_v (fatal)

Leonardo

On Mar 5, 2010, at 7:35 PM, Ralph Castain wrote:




You said this component was a copy of Aurelien's component? Did you rename the 
critical elements (e.g., component, module) inside it to avoid name confusion?

On Mar 5, 2010, at 11:27 AM, Leonardo Fialho wrote:


  

I see... but it is really strange because this module is clean, it does not use 
nothing. This is the output of the nm command, I can't see any symbol which is 
not available.

[lfialho@aoclsb-clus openmpi]$ nm mca_vprotocol_receiver.so 00201208 a 
_DYNAMIC
00201408 a _GLOBAL_OFFSET_TABLE_
   w _Jv_RegisterClasses
002011e0 d __CTOR_END__
002011d8 d __CTOR_LIST__
002011f0 d __DTOR_END__
002011e8 d __DTOR_LIST__
11d0 r __FRAME_END__
002011f8 d __JCR_END__
002011f8 d __JCR_LIST__
00201640 A __bss_start
   w __cxa_finalize@@GLIBC_2.2.5
0d40 t __do_global_ctors_aux
07c0 t __do_global_dtors_aux
00201200 d __dso_handle
   w __gmon_start__
00201640 A _edata
00201648 A _end
0d78 T _fini
0750 T _init
07a0 t call_gmon_start
00201640 b completed.6115
0810 t frame_dummy
   U mca_pml_v
00201460 D mca_vprotocol_receiver
0c71 t mca_vprotocol_receiver_add_comm
0a5f t mca_vprotocol_receiver_add_procs
00201540 D mca_vprotocol_receiver_component
0cc3 t mca_vprotocol_receiver_component_close
0d18 t mca_vprotocol_receiver_component_finalize
0cce t mca_vprotocol_receiver_component_init
0cb8 t mca_vprotocol_receiver_component_open
0c93 t mca_vprotocol_receiver_del_comm
0a89 t mca_vprotocol_receiver_del_procs
083c t mca_vprotocol_receiver_dump
0d23 t mca_vprotocol_receiver_enable
09e7 t mca_vprotocol_receiver_iprobe
0b9a t mca_vprotocol_receiver_irecv
0ab3 t mca_vprotocol_receiver_isend
0a29 t mca_vprotocol_receiver_probe
0c00 t mca_vprotocol_receiver_recv
0b21 t mca_vprotocol_receiver_send
09bd T mca_vprotocol_receiver_start
0864 t mca_vprotocol_receiver_test
0896 t mca_vprotocol_receiver_test_all
08d0 t mca_vprotocol_receiver_test_any
0950 t mca_vprotocol_receiver_test_some
0916 t mca_vprotocol_receiver_wait_any
098a t mca_vprotocol_receiver_wait_some
   U ompi_request_null
   U opal_output
00201440 d p.6113
[lfialho@aoclsb-clus openmpi]$

On Mar 5, 2010, at 7:00 PM, Terry Dontje wrote:




Sorry meant to add this, but you might be

Re: [OMPI devel] Adding error/verbose messages to the TCP BTL

2010-03-05 Thread Jeff Squyres
On Mar 5, 2010, at 2:34 PM, George Bosilca wrote:

> Being user friendly is good, being way too user friendly is less (but I guess 
> this is the price we have to pay for a production-quality code isn't it).

Agreed.  None of these messages appear except in error cases or if you crank up 
the verbosity.  The use case for this was a user (more than one, actually) who 
had problems with the TCP BTL deciding not to connect to peers for some reason. 
 But there was no way to know exactly what the BTL was *trying* to do -- all 
you got was (effectively), "Sorry, I couldn't connect."  So the main impetus 
for this was to give some visibility into what the TCP BTL is doing when it 
tries to connect -- you can see if it's trying to use private IP addresses by 
mistake, or somesuch.

> I have few comments:
> 
> - In several places you replaced the BTL_ERROR (which was the way BTLs are 
> supposed to complaints) by a call directly to orte_show_help. This presents 
> several inconveniences: drifting away from something more or less consistent 
> across all BTLs, adding more dependencies between the BTLs and ORTE.

I have never found BTL_ERROR to be terribly helpful.  All it is is essentially 
an fprintf -- it doesn't propagate errors upward or anything.  I tend to prefer 
show_help because then you can provide a meaningful error message that way -- 
and duplicate messages are not displayed (which many people have told me that 
they love that feature).  BTL_ERROR just guarantees that the user will have to 
email us to figure out what's going on because the messages aren't meaningful 
to anyone other than an OMPI developer.

> - There are a lot of places where you just indented the code or split a 
> medium-sized line into several lines. I find the code more difficult to read.

Ja; I did re-intent some code because I found it hard to read the super-long 
lines while trying to figure out the TCP BTL code.  Sorry about that.  

You do the same thing sometimes, too.  ;-)

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] Adding error/verbose messages to the TCP BTL

2010-03-05 Thread Ralph Castain

On Mar 5, 2010, at 12:55 PM, Jeff Squyres wrote:

> On Mar 5, 2010, at 2:34 PM, George Bosilca wrote:
> 
>> Being user friendly is good, being way too user friendly is less (but I 
>> guess this is the price we have to pay for a production-quality code isn't 
>> it).
> 
> Agreed.  None of these messages appear except in error cases or if you crank 
> up the verbosity.  The use case for this was a user (more than one, actually) 
> who had problems with the TCP BTL deciding not to connect to peers for some 
> reason.  But there was no way to know exactly what the BTL was *trying* to do 
> -- all you got was (effectively), "Sorry, I couldn't connect."  So the main 
> impetus for this was to give some visibility into what the TCP BTL is doing 
> when it tries to connect -- you can see if it's trying to use private IP 
> addresses by mistake, or somesuch.
> 
>> I have few comments:
>> 
>> - In several places you replaced the BTL_ERROR (which was the way BTLs are 
>> supposed to complaints) by a call directly to orte_show_help. This presents 
>> several inconveniences: drifting away from something more or less consistent 
>> across all BTLs, adding more dependencies between the BTLs and ORTE.
> 
> I have never found BTL_ERROR to be terribly helpful.  All it is is 
> essentially an fprintf -- it doesn't propagate errors upward or anything.  I 
> tend to prefer show_help because then you can provide a meaningful error 
> message that way -- and duplicate messages are not displayed (which many 
> people have told me that they love that feature).  BTL_ERROR just guarantees 
> that the user will have to email us to figure out what's going on because the 
> messages aren't meaningful to anyone other than an OMPI developer.

I'm not sure I understand this concern either, especially the latter one about 
orte dependency. There already are 5 calls to orte_show_help in this btl, along 
with several references to orte_process_info and other orte elements. What harm 
is done by adding more calls to orte_show_help?

I better understand the BTL_ERROR issue, but it raises the question as to 
whether BTL_ERROR should be defined as an orte_show_help call. That might help 
reduce the flood of duplicate messages when an error occurs.

> 
>> - There are a lot of places where you just indented the code or split a 
>> medium-sized line into several lines. I find the code more difficult to read.
> 
> Ja; I did re-intent some code because I found it hard to read the super-long 
> lines while trying to figure out the TCP BTL code.  Sorry about that.  
> 
> You do the same thing sometimes, too.  ;-)
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] Adding error/verbose messages to the TCP BTL

2010-03-05 Thread George Bosilca

On Mar 5, 2010, at 14:59 , Ralph Castain wrote:

>> I have never found BTL_ERROR to be terribly helpful.  All it is is 
>> essentially an fprintf -- it doesn't propagate errors upward or anything.  I 
>> tend to prefer show_help because then you can provide a meaningful error 
>> message that way -- and duplicate messages are not displayed (which many 
>> people have told me that they love that feature). BTL_ERROR just guarantees 
>> that the user will have to email us to figure out what's going on because 
>> the messages aren't meaningful to anyone other than an OMPI developer.
> 
> I'm not sure I understand this concern either, especially the latter one 
> about orte dependency. There already are 5 calls to orte_show_help in this 
> btl, along with several references to orte_process_info and other orte 
> elements. What harm is done by adding more calls to orte_show_help?
> 
> I better understand the BTL_ERROR issue, but it raises the question as to 
> whether BTL_ERROR should be defined as an orte_show_help call. That might 
> help reduce the flood of duplicate messages when an error occurs.

The project where we planned to use the BTL in another context is not dead yet. 
We didn't had much help on progressing on that front, but we didn't give up 
[yet].

I agree with Jeff's comments about the BTL_ERROR. How about a middle ground 
here? We let the BTLs use BTL_ERROR, eventually with some modifications, and we 
redirect the BTL_ERROR to a more advanced macro including support for 
orte_show_help? This will require going over all the BTLs, but on the bright 
side it will give us a 100% consistency on retorting errors.

  Thanks,
george.




Re: [OMPI devel] Missing Symbol

2010-03-05 Thread Jeff Squyres
Ick. 

I wondered aloud on IM to Terry after your earlier emails if we should just 
custom-patch ltdl in OMPI to fix this issue.  The problem is that libltdl is 
effectively reporting the "wrong" error back to OMPI, so the error string that 
we get to print out ends up not being very useful (e.g., not showing which 
symbol was missing, or what the problem was with the dlopen).  Fixing this 
properly in libltdl is actually somewhat tricky -- which is why it hasn't been 
fixed yet.  But given that OMPI's use of libltdl is pretty specific, we might 
be able to get away with a simple fix that works just for OMPI (but wouldn't 
necessarily be suitable for all other libltdl users).

Hmmm...

This looks do-able.  I'll commit in a bit.



On Mar 5, 2010, at 1:27 PM, Leonardo Fialho wrote:

> I see... but it is really strange because this module is clean, it does not 
> use nothing. This is the output of the nm command, I can't see any symbol 
> which is not available.
> 
> [lfialho@aoclsb-clus openmpi]$ nm mca_vprotocol_receiver.so
> 00201208 a _DYNAMIC
> 00201408 a _GLOBAL_OFFSET_TABLE_
>  w _Jv_RegisterClasses
> 002011e0 d __CTOR_END__
> 002011d8 d __CTOR_LIST__
> 002011f0 d __DTOR_END__
> 002011e8 d __DTOR_LIST__
> 11d0 r __FRAME_END__
> 002011f8 d __JCR_END__
> 002011f8 d __JCR_LIST__
> 00201640 A __bss_start
>  w __cxa_finalize@@GLIBC_2.2.5
> 0d40 t __do_global_ctors_aux
> 07c0 t __do_global_dtors_aux
> 00201200 d __dso_handle
>  w __gmon_start__
> 00201640 A _edata
> 00201648 A _end
> 0d78 T _fini
> 0750 T _init
> 07a0 t call_gmon_start
> 00201640 b completed.6115
> 0810 t frame_dummy
>  U mca_pml_v
> 00201460 D mca_vprotocol_receiver
> 0c71 t mca_vprotocol_receiver_add_comm
> 0a5f t mca_vprotocol_receiver_add_procs
> 00201540 D mca_vprotocol_receiver_component
> 0cc3 t mca_vprotocol_receiver_component_close
> 0d18 t mca_vprotocol_receiver_component_finalize
> 0cce t mca_vprotocol_receiver_component_init
> 0cb8 t mca_vprotocol_receiver_component_open
> 0c93 t mca_vprotocol_receiver_del_comm
> 0a89 t mca_vprotocol_receiver_del_procs
> 083c t mca_vprotocol_receiver_dump
> 0d23 t mca_vprotocol_receiver_enable
> 09e7 t mca_vprotocol_receiver_iprobe
> 0b9a t mca_vprotocol_receiver_irecv
> 0ab3 t mca_vprotocol_receiver_isend
> 0a29 t mca_vprotocol_receiver_probe
> 0c00 t mca_vprotocol_receiver_recv
> 0b21 t mca_vprotocol_receiver_send
> 09bd T mca_vprotocol_receiver_start
> 0864 t mca_vprotocol_receiver_test
> 0896 t mca_vprotocol_receiver_test_all
> 08d0 t mca_vprotocol_receiver_test_any
> 0950 t mca_vprotocol_receiver_test_some
> 0916 t mca_vprotocol_receiver_wait_any
> 098a t mca_vprotocol_receiver_wait_some
>  U ompi_request_null
>  U opal_output
> 00201440 d p.6113
> [lfialho@aoclsb-clus openmpi]$
> 
> On Mar 5, 2010, at 7:00 PM, Terry Dontje wrote:
> 
> > Sorry meant to add this, but you might be able to try and find the symbol 
> > causing the issue by twiddling with LD_DEBUG
> >
> > --td
> > Terry Dontje wrote:
> >> Possibly there is an external symbol in the .so that is being loaded that 
> >> cannot be resolved.
> >> --td
> >> Leonardo Fialho wrote:
> >>> Hi,
> >>>
> >>> I know that libtool does not help us to find the source of this error, 
> >>> but, what can generate the following error?
> >>>
> >>> [aoclsb-clus.uab.es:11724] mca: base: component_find: unable to open 
> >>> /home/lfialho/lib/openmpi/mca_vprotocol_receiver: perhaps a missing 
> >>> symbol, or compiled for a different version of Open MPI? (ignored)
> >>>
> >>> 1) yes, the file exists
> >>> 2) yes, it has been compiled among all other components
> >>> 3) yes, it is the same Open MPI version
> >>> 4) this component is a copy of the pessimist component implemented by 
> >>> Aurelien
> >>> 5) Aurelien's component presents the same error
> >>>
> >>> The question is: what mistake should generate an error during module 
> >>> loading?
> >>>
> >>> Thanks in advance,
> >>> Leonardo
> >>> ___
> >>> devel mailing list
> >>> de...@open-mpi.org
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>> 
> >>
> >> ___
> >> devel mailing list
> >> de...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> _

Re: [OMPI devel] Adding error/verbose messages to the TCP BTL

2010-03-05 Thread Ralph Castain

On Mar 5, 2010, at 3:52 PM, George Bosilca wrote:

> 
> On Mar 5, 2010, at 14:59 , Ralph Castain wrote:
> 
>>> I have never found BTL_ERROR to be terribly helpful.  All it is is 
>>> essentially an fprintf -- it doesn't propagate errors upward or anything.  
>>> I tend to prefer show_help because then you can provide a meaningful error 
>>> message that way -- and duplicate messages are not displayed (which many 
>>> people have told me that they love that feature). BTL_ERROR just guarantees 
>>> that the user will have to email us to figure out what's going on because 
>>> the messages aren't meaningful to anyone other than an OMPI developer.
>> 
>> I'm not sure I understand this concern either, especially the latter one 
>> about orte dependency. There already are 5 calls to orte_show_help in this 
>> btl, along with several references to orte_process_info and other orte 
>> elements. What harm is done by adding more calls to orte_show_help?
>> 
>> I better understand the BTL_ERROR issue, but it raises the question as to 
>> whether BTL_ERROR should be defined as an orte_show_help call. That might 
>> help reduce the flood of duplicate messages when an error occurs.
> 
> The project where we planned to use the BTL in another context is not dead 
> yet. We didn't had much help on progressing on that front, but we didn't give 
> up [yet].
> 
> I agree with Jeff's comments about the BTL_ERROR. How about a middle ground 
> here? We let the BTLs use BTL_ERROR, eventually with some modifications, and 
> we redirect the BTL_ERROR to a more advanced macro including support for 
> orte_show_help? This will require going over all the BTLs, but on the bright 
> side it will give us a 100% consistency on retorting errors.

Sounds reasonable to me - I'm happy to help do it, assuming Jeff also concurs. 
I assume we would then replace all the show_help calls as well? Otherwise, I'm 
not sure what we gain as the direct orte_show_help dependency will remain. Or 
are those calls too specialized to be replaced with BTL_ERROR?

> 
>  Thanks,
>george.
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] Missing Symbol

2010-03-05 Thread George Bosilca
Unfortunately this will not fix his issues ;( I pretty sure that his problem is 
related to the fact that mca_pml_v is exported by another dynamic module, and 
therefore not available via dlsym. I don't think there is a simple solution for 
this problem, except going back to GLOBAL symbols.

  george.

On Mar 5, 2010, at 18:02 , Jeff Squyres wrote:

> Ick. 
> 
> I wondered aloud on IM to Terry after your earlier emails if we should just 
> custom-patch ltdl in OMPI to fix this issue.  The problem is that libltdl is 
> effectively reporting the "wrong" error back to OMPI, so the error string 
> that we get to print out ends up not being very useful (e.g., not showing 
> which symbol was missing, or what the problem was with the dlopen).  Fixing 
> this properly in libltdl is actually somewhat tricky -- which is why it 
> hasn't been fixed yet.  But given that OMPI's use of libltdl is pretty 
> specific, we might be able to get away with a simple fix that works just for 
> OMPI (but wouldn't necessarily be suitable for all other libltdl users).
> 
> Hmmm...
> 
> This looks do-able.  I'll commit in a bit.
> 
> 
> 
> On Mar 5, 2010, at 1:27 PM, Leonardo Fialho wrote:
> 
>> I see... but it is really strange because this module is clean, it does not 
>> use nothing. This is the output of the nm command, I can't see any symbol 
>> which is not available.
>> 
>> [lfialho@aoclsb-clus openmpi]$ nm mca_vprotocol_receiver.so
>> 00201208 a _DYNAMIC
>> 00201408 a _GLOBAL_OFFSET_TABLE_
>> w _Jv_RegisterClasses
>> 002011e0 d __CTOR_END__
>> 002011d8 d __CTOR_LIST__
>> 002011f0 d __DTOR_END__
>> 002011e8 d __DTOR_LIST__
>> 11d0 r __FRAME_END__
>> 002011f8 d __JCR_END__
>> 002011f8 d __JCR_LIST__
>> 00201640 A __bss_start
>> w __cxa_finalize@@GLIBC_2.2.5
>> 0d40 t __do_global_ctors_aux
>> 07c0 t __do_global_dtors_aux
>> 00201200 d __dso_handle
>> w __gmon_start__
>> 00201640 A _edata
>> 00201648 A _end
>> 0d78 T _fini
>> 0750 T _init
>> 07a0 t call_gmon_start
>> 00201640 b completed.6115
>> 0810 t frame_dummy
>> U mca_pml_v
>> 00201460 D mca_vprotocol_receiver
>> 0c71 t mca_vprotocol_receiver_add_comm
>> 0a5f t mca_vprotocol_receiver_add_procs
>> 00201540 D mca_vprotocol_receiver_component
>> 0cc3 t mca_vprotocol_receiver_component_close
>> 0d18 t mca_vprotocol_receiver_component_finalize
>> 0cce t mca_vprotocol_receiver_component_init
>> 0cb8 t mca_vprotocol_receiver_component_open
>> 0c93 t mca_vprotocol_receiver_del_comm
>> 0a89 t mca_vprotocol_receiver_del_procs
>> 083c t mca_vprotocol_receiver_dump
>> 0d23 t mca_vprotocol_receiver_enable
>> 09e7 t mca_vprotocol_receiver_iprobe
>> 0b9a t mca_vprotocol_receiver_irecv
>> 0ab3 t mca_vprotocol_receiver_isend
>> 0a29 t mca_vprotocol_receiver_probe
>> 0c00 t mca_vprotocol_receiver_recv
>> 0b21 t mca_vprotocol_receiver_send
>> 09bd T mca_vprotocol_receiver_start
>> 0864 t mca_vprotocol_receiver_test
>> 0896 t mca_vprotocol_receiver_test_all
>> 08d0 t mca_vprotocol_receiver_test_any
>> 0950 t mca_vprotocol_receiver_test_some
>> 0916 t mca_vprotocol_receiver_wait_any
>> 098a t mca_vprotocol_receiver_wait_some
>> U ompi_request_null
>> U opal_output
>> 00201440 d p.6113
>> [lfialho@aoclsb-clus openmpi]$
>> 
>> On Mar 5, 2010, at 7:00 PM, Terry Dontje wrote:
>> 
>>> Sorry meant to add this, but you might be able to try and find the symbol 
>>> causing the issue by twiddling with LD_DEBUG
>>> 
>>> --td
>>> Terry Dontje wrote:
 Possibly there is an external symbol in the .so that is being loaded that 
 cannot be resolved.
 --td
 Leonardo Fialho wrote:
> Hi,
> 
> I know that libtool does not help us to find the source of this error, 
> but, what can generate the following error?
> 
> [aoclsb-clus.uab.es:11724] mca: base: component_find: unable to open 
> /home/lfialho/lib/openmpi/mca_vprotocol_receiver: perhaps a missing 
> symbol, or compiled for a different version of Open MPI? (ignored)
> 
> 1) yes, the file exists
> 2) yes, it has been compiled among all other components
> 3) yes, it is the same Open MPI version
> 4) this component is a copy of the pessimist component implemented by 
> Aurelien
> 5) Aurelien's component presents the same error
> 
> The question is: what mistake should generate an error during module 
> loading?
> 
> Thanks in advance,
> Leonardo
> _

Re: [OMPI devel] Missing Symbol

2010-03-05 Thread Jeff Squyres
On Mar 5, 2010, at 6:02 PM, Jeff Squyres (jsquyres) wrote:

> I wondered aloud on IM to Terry after your earlier emails if we should just 
> custom-patch ltdl in OMPI to fix this issue.  The problem is that libltdl is 
> effectively reporting the "wrong" error back to OMPI, so the error string 
> that we get to print out ends up not being very useful (e.g., not showing 
> which symbol was missing, or what the problem was with the dlopen).  Fixing 
> this properly in libltdl is actually somewhat tricky -- which is why it 
> hasn't been fixed yet.  But given that OMPI's use of libltdl is pretty 
> specific, we might be able to get away with a simple fix that works just for 
> OMPI (but wouldn't necessarily be suitable for all other libltdl users).

I made a patch for exactly what I described: it comments out the preopen 
module's setting of FILE_NOT_FOUND.  But  now I'm getting foiled by the use of 
RTLD_LAZY.  For example, if I add a bogus symbol that can't be resolved into 
the TCP BTL, I get this when I run ompi_info:

-
...lots of ompi_info config output...
   MPI_MAX_PORT_NAME: 1024
  MPI_MAX_DATAREP_STRING: 128
dyld: lazy symbol binding failed: Symbol not found: 
_jeffs_symbol_that_does_not_exist
  Referenced from: /Users/jsquyres/bogus/lib/openmpi/mca_btl_tcp.so
  Expected in: flat namespace
[ ompi_info aborts ]
-

This is happening because libltdl's dlopen() is being invoked with RTLD_LAZY so 
the fact that a symbol can't be resolved at dlopen() time is not a problem.  It 
becomes a fatal problem later when the component's open function is invoked and 
my unresolved symbol is exposed in all of its glory.

If I manually change the LT_LAZY_OR_NOW to RTLD_NOW in the 
libltdl/loaders/dlopen.c, then I get the behavior I was expecting:

--
...lots of ompi_info config output...
   MPI_MAX_PORT_NAME: 1024
  MPI_MAX_DATAREP_STRING: 128
[rtp-jsquyres-8717.cisco.com:89384] mca: base: component_find: unable to open 
/Users/jsquyres/bogus/lib/openmpi/mca_btl_tcp: 
dlopen(/Users/jsquyres/bogus/lib/openmpi/mca_btl_tcp.so, 10): Symbol not found: 
_jeffs_symbol_that_does_not_exist
  Referenced from: /Users/jsquyres/bogus/lib/openmpi/mca_btl_tcp.so
  Expected in: flat namespace
 in /Users/jsquyres/bogus/lib/openmpi/mca_btl_tcp.so (ignored)
   MCA backtrace: execinfo (MCA v2.0, API v2.0, Component v1.7)
   MCA paffinity: darwin (MCA v2.0, API v2.0, Component v1.7)
...lots of ompi_info config output...
-

I.e., the dlopen() fails and my patch causes us to actually get a reasonable 
error message from libltdl.

So:

1. Given that I'm seeing this on both Linux (RHEL4) and OSX, the LT_LAZY_OR_NOW 
must be resolving the RTLD_LAZY on both Linux and OSX -- so how are you getting 
the error message that you're getting?  Is your system somehow using RTLD_NOW?

2. If OSX and Linux both use RTLD_LAZY, is my patch useful?  I'm hesitant to 
add it if it's only partially (or not at all) useful...

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] Missing Symbol

2010-03-05 Thread Jeff Squyres
We already use global symbols; mca_base_component_repository.c invokes:

if (lt_dladvise_global(&opal_mca_dladvise)) {
return OPAL_ERROR;
}


On Mar 5, 2010, at 6:18 PM, George Bosilca wrote:

> Unfortunately this will not fix his issues ;( I pretty sure that his problem 
> is related to the fact that mca_pml_v is exported by another dynamic module, 
> and therefore not available via dlsym. I don't think there is a simple 
> solution for this problem, except going back to GLOBAL symbols.
> 
>   george.
> 
> On Mar 5, 2010, at 18:02 , Jeff Squyres wrote:
> 
> > Ick.
> >
> > I wondered aloud on IM to Terry after your earlier emails if we should just 
> > custom-patch ltdl in OMPI to fix this issue.  The problem is that libltdl 
> > is effectively reporting the "wrong" error back to OMPI, so the error 
> > string that we get to print out ends up not being very useful (e.g., not 
> > showing which symbol was missing, or what the problem was with the dlopen). 
> >  Fixing this properly in libltdl is actually somewhat tricky -- which is 
> > why it hasn't been fixed yet.  But given that OMPI's use of libltdl is 
> > pretty specific, we might be able to get away with a simple fix that works 
> > just for OMPI (but wouldn't necessarily be suitable for all other libltdl 
> > users).
> >
> > Hmmm...
> >
> > This looks do-able.  I'll commit in a bit.
> >
> >
> >
> > On Mar 5, 2010, at 1:27 PM, Leonardo Fialho wrote:
> >
> >> I see... but it is really strange because this module is clean, it does 
> >> not use nothing. This is the output of the nm command, I can't see any 
> >> symbol which is not available.
> >>
> >> [lfialho@aoclsb-clus openmpi]$ nm mca_vprotocol_receiver.so
> >> 00201208 a _DYNAMIC
> >> 00201408 a _GLOBAL_OFFSET_TABLE_
> >> w _Jv_RegisterClasses
> >> 002011e0 d __CTOR_END__
> >> 002011d8 d __CTOR_LIST__
> >> 002011f0 d __DTOR_END__
> >> 002011e8 d __DTOR_LIST__
> >> 11d0 r __FRAME_END__
> >> 002011f8 d __JCR_END__
> >> 002011f8 d __JCR_LIST__
> >> 00201640 A __bss_start
> >> w __cxa_finalize@@GLIBC_2.2.5
> >> 0d40 t __do_global_ctors_aux
> >> 07c0 t __do_global_dtors_aux
> >> 00201200 d __dso_handle
> >> w __gmon_start__
> >> 00201640 A _edata
> >> 00201648 A _end
> >> 0d78 T _fini
> >> 0750 T _init
> >> 07a0 t call_gmon_start
> >> 00201640 b completed.6115
> >> 0810 t frame_dummy
> >> U mca_pml_v
> >> 00201460 D mca_vprotocol_receiver
> >> 0c71 t mca_vprotocol_receiver_add_comm
> >> 0a5f t mca_vprotocol_receiver_add_procs
> >> 00201540 D mca_vprotocol_receiver_component
> >> 0cc3 t mca_vprotocol_receiver_component_close
> >> 0d18 t mca_vprotocol_receiver_component_finalize
> >> 0cce t mca_vprotocol_receiver_component_init
> >> 0cb8 t mca_vprotocol_receiver_component_open
> >> 0c93 t mca_vprotocol_receiver_del_comm
> >> 0a89 t mca_vprotocol_receiver_del_procs
> >> 083c t mca_vprotocol_receiver_dump
> >> 0d23 t mca_vprotocol_receiver_enable
> >> 09e7 t mca_vprotocol_receiver_iprobe
> >> 0b9a t mca_vprotocol_receiver_irecv
> >> 0ab3 t mca_vprotocol_receiver_isend
> >> 0a29 t mca_vprotocol_receiver_probe
> >> 0c00 t mca_vprotocol_receiver_recv
> >> 0b21 t mca_vprotocol_receiver_send
> >> 09bd T mca_vprotocol_receiver_start
> >> 0864 t mca_vprotocol_receiver_test
> >> 0896 t mca_vprotocol_receiver_test_all
> >> 08d0 t mca_vprotocol_receiver_test_any
> >> 0950 t mca_vprotocol_receiver_test_some
> >> 0916 t mca_vprotocol_receiver_wait_any
> >> 098a t mca_vprotocol_receiver_wait_some
> >> U ompi_request_null
> >> U opal_output
> >> 00201440 d p.6113
> >> [lfialho@aoclsb-clus openmpi]$
> >>
> >> On Mar 5, 2010, at 7:00 PM, Terry Dontje wrote:
> >>
> >>> Sorry meant to add this, but you might be able to try and find the symbol 
> >>> causing the issue by twiddling with LD_DEBUG
> >>>
> >>> --td
> >>> Terry Dontje wrote:
>  Possibly there is an external symbol in the .so that is being loaded 
>  that cannot be resolved.
>  --td
>  Leonardo Fialho wrote:
> > Hi,
> >
> > I know that libtool does not help us to find the source of this error, 
> > but, what can generate the following error?
> >
> > [aoclsb-clus.uab.es:11724] mca: base: component_find: unable to open 
> > /home/lfialho/lib/openmpi/mca_vprotocol_receiver: perhaps a missing 
> > symbol, or compiled for a different version of Open MPI? (ignored)
> >
> > 1) yes, the file exists
> > 2) yes, it has been c

Re: [OMPI devel] Adding error/verbose messages to the TCP BTL

2010-03-05 Thread Jeff Squyres
On Mar 5, 2010, at 6:10 PM, Ralph Castain wrote:

> > I agree with Jeff's comments about the BTL_ERROR. How about a middle ground 
> > here? We let the BTLs use BTL_ERROR, eventually with some modifications, 
> > and we redirect the BTL_ERROR to a more advanced macro including support 
> > for orte_show_help? This will require going over all the BTLs, but on the 
> > bright side it will give us a 100% consistency on retorting errors.
> 
> Sounds reasonable to me - I'm happy to help do it, assuming Jeff also 
> concurs. I assume we would then replace all the show_help calls as well? 
> Otherwise, I'm not sure what we gain as the direct orte_show_help dependency 
> will remain. Or are those calls too specialized to be replaced with BTL_ERROR?

Should this kind of thing wait for OPAL_SOS?

(I mention this because the OPAL_SOS RFC will be sent to devel Real Soon Now...)

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] Adding error/verbose messages to the TCP BTL

2010-03-05 Thread Ralph Castain

On Mar 5, 2010, at 7:22 PM, Jeff Squyres wrote:

> On Mar 5, 2010, at 6:10 PM, Ralph Castain wrote:
> 
>>> I agree with Jeff's comments about the BTL_ERROR. How about a middle ground 
>>> here? We let the BTLs use BTL_ERROR, eventually with some modifications, 
>>> and we redirect the BTL_ERROR to a more advanced macro including support 
>>> for orte_show_help? This will require going over all the BTLs, but on the 
>>> bright side it will give us a 100% consistency on retorting errors.
>> 
>> Sounds reasonable to me - I'm happy to help do it, assuming Jeff also 
>> concurs. I assume we would then replace all the show_help calls as well? 
>> Otherwise, I'm not sure what we gain as the direct orte_show_help dependency 
>> will remain. Or are those calls too specialized to be replaced with 
>> BTL_ERROR?
> 
> Should this kind of thing wait for OPAL_SOS?
> 
> (I mention this because the OPAL_SOS RFC will be sent to devel Real Soon 
> Now...)

Sure - OPAL_SOS will supersede all this anyway.

> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel