Re: [OMPI devel] OMPI devel] OMPI devel] race condition in abort can cause mpirun v1.8 hang

2014-11-26 Thread Ralph Castain
Aha - I see what happened. I have that param set to false in my default mca 
param file. If I set it to true on the cmd line, then I run without segfaulting.

Thanks!
Ralph


> On Nov 26, 2014, at 5:55 PM, Gilles Gouaillardet 
>  wrote:
> 
> Ralph,
> 
> let me correct and enhance my previous statement :
> 
> - i cannot reproduce your crash in my environment (RHEL6 like vs your RHEL7 
> like)
> (i configured with --enable-debug --enable-picky)
> 
> - i can reproduce the crash with
> mpirun --mca mpi_param_check false
> 
> - if you configured with --without-mpi-param-check, i assume you would get 
> the same crash
> (and if i understand correctly, there would be no way to --mca 
> mpi_param_check true)
> 
> here is the relevant part of my config.status :
> $ grep MPI_PARAM_CHECK config.status 
> D["MPI_PARAM_CHECK"]=" ompi_mpi_param_check"
> D["OMPI_PARAM_CHECK"]=" 1"
> 
> i will try on a centos7 box from now.
> in the mean time, can you check you config.status and try again with 
> mpirun --mca mpi_param_check true
> 
> Cheers,
> 
> Gilles
> 
> On 2014/11/27 10:06, Gilles Gouaillardet wrote:
>> I will double check this(afk right now)
>> Are you running on a rhel6 like distro with gcc ?
>> 
>> Iirc, crash vs mpi error is ruled by --with-param-check or something like 
>> this...
>> 
>> Cheers,
>> 
>> Gilles 
>> 
>> Ralph Castain  さんのメール:
>>> I tried it with both the fortran and c versions - got the same result.
>>> 
>>> 
>>> This was indeed with a debug build. I wouldn’t expect a segfault even with 
>>> an optimized build, though - I would expect an MPI error, yes?
>>> 
>>> 
>>> 
>>> 
>>> On Nov 26, 2014, at 4:26 PM, Gilles Gouaillardet 
>>>   
>>> wrote:
>>> 
>>> 
>>> I will have a look
>>> 
>>> Btw, i was running the fortran version, not the c one.
>>> Did you confgure with --enable--debug ?
>>> The program sends to a rank *not* in the communicator, so this behavior 
>>> could make some sense on an optimized build.
>>> 
>>> Cheers,
>>> 
>>> Gilles
>>> 
>>> Ralph Castain  さんのメール:
>>> Ick - I’m getting a segfault when trying to run that test:
>>> 
>>> 
>>> MPITEST info  (0): Starting MPI_Errhandler_fatal test
>>> 
>>> MPITEST info  (0): This test will abort after printing the results message
>>> 
>>> MPITEST info  (0): If it does not, then a f.a.i.l.u.r.e will be noted
>>> 
>>> [bend001:07714] *** Process received signal ***
>>> 
>>> [bend001:07714] Signal: Segmentation fault (11)
>>> 
>>> [bend001:07714] Signal code: Address not mapped (1)
>>> 
>>> [bend001:07714] Failing at address: 0x50
>>> 
>>> [bend001:07715] *** Process received signal ***
>>> 
>>> [bend001:07715] Signal: Segmentation fault (11)
>>> 
>>> [bend001:07715] Signal code: Address not mapped (1)
>>> 
>>> [bend001:07715] Failing at address: 0x50
>>> 
>>> [bend001:07714] ompi_comm_peer_lookup: invalid peer index (3)
>>> 
>>> [bend001:07713] ompi_comm_peer_lookup: invalid peer index (3)
>>> 
>>> [bend001:07715] ompi_comm_peer_lookup: invalid peer index (3)
>>> 
>>> [bend001:07713] *** Process received signal ***
>>> 
>>> [bend001:07713] Signal: Segmentation fault (11)
>>> 
>>> [bend001:07713] Signal code: Address not mapped (1)
>>> 
>>> [bend001:07713] Failing at address: 0x50
>>> 
>>> [bend001:07713] [ 0] /usr/lib64/libpthread.so.0(+0xf130)[0x7f4485ecb130]
>>> 
>>> [bend001:07713] [ 1] 
>>> /home/common/openmpi/build/ompi-release/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x5d)[0x7f4480f74ca6]
>>> 
>>> [bend001:07713] [ 2] [bend001:07714] [ 0] 
>>> /usr/lib64/libpthread.so.0(+0xf130)[0x7ff457885130]
>>> 
>>> [bend001:07714] [ 1] 
>>> /home/common/openmpi/build/ompi-release/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x5d)[0x7ff44e8dbca6]
>>> 
>>> [bend001:07714] [ 2] [bend001:07715] [ 0] 
>>> /usr/lib64/libpthread.so.0(+0xf130)[0x7ffa97ff6130]
>>> 
>>> [bend001:07715] [ 1] 
>>> /home/common/openmpi/build/ompi-release/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x5d)[0x7ffa8ca6]
>>> 
>>> [bend001:07715] [ 2] MPITEST_results: MPI_Errhandler_fatal all tests PASSED 
>>> (3)
>>> 
>>> 
>>> 
>>> This is with the head of the 1.8 branch. Any suggestions?
>>> 
>>> Ralph
>>> 
>>> 
>>> 
>>> On Nov 26, 2014, at 8:46 AM, Ralph Castain  
>>>  wrote:
>>> 
>>> 
>>> Hmmm….yeah, I know we saw this and resolved it in the trunk, but it looks 
>>> like the fix indeed failed to come over to 1.8. I’ll take a gander (pretty 
>>> sure I remember how I fixed it) - thanks!
>>> 
>>> On Nov 26, 2014, at 12:03 AM, Gilles Gouaillardet 
>>>   
>>> wrote:
>>> 
>>> Ralph,
>>> 
>>> i noted several hangs in mtt with the v1.8 branch.
>>> 
>>> a simple way to reproduce it is to use the MPI_Errhandler_fatal_f test
>> >from the intel_tests suite,
>>> invoke mpirun on one node and 

Re: [OMPI devel] OMPI devel] OMPI devel] race condition in abort can cause mpirun v1.8 hang

2014-11-26 Thread Gilles Gouaillardet
Ralph,

let me correct and enhance my previous statement :

- i cannot reproduce your crash in my environment (RHEL6 like vs your
RHEL7 like)
(i configured with --enable-debug --enable-picky)

- i can reproduce the crash with
mpirun --mca mpi_param_check false

- if you configured with --without-mpi-param-check, i assume you would
get the same crash
(and if i understand correctly, there would be no way to --mca
mpi_param_check true)

here is the relevant part of my config.status :
$ grep MPI_PARAM_CHECK config.status
D["MPI_PARAM_CHECK"]=" ompi_mpi_param_check"
D["OMPI_PARAM_CHECK"]=" 1"

i will try on a centos7 box from now.
in the mean time, can you check you config.status and try again with
mpirun --mca mpi_param_check true

Cheers,

Gilles

On 2014/11/27 10:06, Gilles Gouaillardet wrote:
> I will double check this(afk right now)
> Are you running on a rhel6 like distro with gcc ?
>
> Iirc, crash vs mpi error is ruled by --with-param-check or something like 
> this...
>
> Cheers,
>
> Gilles 
>
> Ralph Castain ??:
>> I tried it with both the fortran and c versions - got the same result.
>>
>>
>> This was indeed with a debug build. I wouldn't expect a segfault even with 
>> an optimized build, though - I would expect an MPI error, yes?
>>
>>
>>
>>
>> On Nov 26, 2014, at 4:26 PM, Gilles Gouaillardet 
>>  wrote:
>>
>>
>> I will have a look
>>
>> Btw, i was running the fortran version, not the c one.
>> Did you confgure with --enable--debug ?
>> The program sends to a rank *not* in the communicator, so this behavior 
>> could make some sense on an optimized build.
>>
>> Cheers,
>>
>> Gilles
>>
>> Ralph Castain ??:
>> Ick - I'm getting a segfault when trying to run that test:
>>
>>
>> MPITEST info  (0): Starting MPI_Errhandler_fatal test
>>
>> MPITEST info  (0): This test will abort after printing the results message
>>
>> MPITEST info  (0): If it does not, then a f.a.i.l.u.r.e will be noted
>>
>> [bend001:07714] *** Process received signal ***
>>
>> [bend001:07714] Signal: Segmentation fault (11)
>>
>> [bend001:07714] Signal code: Address not mapped (1)
>>
>> [bend001:07714] Failing at address: 0x50
>>
>> [bend001:07715] *** Process received signal ***
>>
>> [bend001:07715] Signal: Segmentation fault (11)
>>
>> [bend001:07715] Signal code: Address not mapped (1)
>>
>> [bend001:07715] Failing at address: 0x50
>>
>> [bend001:07714] ompi_comm_peer_lookup: invalid peer index (3)
>>
>> [bend001:07713] ompi_comm_peer_lookup: invalid peer index (3)
>>
>> [bend001:07715] ompi_comm_peer_lookup: invalid peer index (3)
>>
>> [bend001:07713] *** Process received signal ***
>>
>> [bend001:07713] Signal: Segmentation fault (11)
>>
>> [bend001:07713] Signal code: Address not mapped (1)
>>
>> [bend001:07713] Failing at address: 0x50
>>
>> [bend001:07713] [ 0] /usr/lib64/libpthread.so.0(+0xf130)[0x7f4485ecb130]
>>
>> [bend001:07713] [ 1] 
>> /home/common/openmpi/build/ompi-release/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x5d)[0x7f4480f74ca6]
>>
>> [bend001:07713] [ 2] [bend001:07714] [ 0] 
>> /usr/lib64/libpthread.so.0(+0xf130)[0x7ff457885130]
>>
>> [bend001:07714] [ 1] 
>> /home/common/openmpi/build/ompi-release/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x5d)[0x7ff44e8dbca6]
>>
>> [bend001:07714] [ 2] [bend001:07715] [ 0] 
>> /usr/lib64/libpthread.so.0(+0xf130)[0x7ffa97ff6130]
>>
>> [bend001:07715] [ 1] 
>> /home/common/openmpi/build/ompi-release/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x5d)[0x7ffa8ca6]
>>
>> [bend001:07715] [ 2] MPITEST_results: MPI_Errhandler_fatal all tests PASSED 
>> (3)
>>
>>
>>
>> This is with the head of the 1.8 branch. Any suggestions?
>>
>> Ralph
>>
>>
>>
>> On Nov 26, 2014, at 8:46 AM, Ralph Castain  wrote:
>>
>>
>> Hmmmyeah, I know we saw this and resolved it in the trunk, but it looks 
>> like the fix indeed failed to come over to 1.8. I'll take a gander (pretty 
>> sure I remember how I fixed it) - thanks!
>>
>> On Nov 26, 2014, at 12:03 AM, Gilles Gouaillardet 
>>  wrote:
>>
>> Ralph,
>>
>> i noted several hangs in mtt with the v1.8 branch.
>>
>> a simple way to reproduce it is to use the MPI_Errhandler_fatal_f test
> >from the intel_tests suite,
>> invoke mpirun on one node and run the taks on an other node :
>>
>> node0$ mpirun -np 3 -host node1 --mca btl tcp,self ./MPI_Errhandler_fatal_f
>>
>> /* since this is a race condition, you might need to run this in a loop
>> in order to hit the bug */
>>
>> the attached tarball contains a patch (add debug + temporary hack) and
>> some log files obtained with
>> --mca errmgr_base_verbose 100 --mca odls_base_verbose 100
>>
>> without the hack, i can reproduce the bug with -np 3 (log.ko.txt) , with
>> the hack, i can still reproduce the hang (though it might
>> be a different one) with -np 16 (log.ko.2.txt)
>>
>> i remember some similar hangs were fixed on the trunk/master a few
>> monthes ago.
>> i tried to 

Re: [OMPI devel] OMPI devel] OMPI devel] race condition in abort can cause mpirun v1.8 hang

2014-11-26 Thread Ralph Castain

> On Nov 26, 2014, at 5:06 PM, Gilles Gouaillardet 
>  wrote:
> 
> I will double check this(afk right now)
> Are you running on a rhel6 like distro with gcc ?

Yeah, I’m running CentOS7 and gcc 4.8.2

> 
> Iirc, crash vs mpi error is ruled by --with-param-check or something like 
> this…

Sounds right - just surprised that you wouldn’t get it but I would. Makes 
debugging this problem a tad difficult, so I may need a different error to 
debug the problem.

> 
> Cheers,
> 
> Gilles 
> 
> Ralph Castain さんのメール:
> I tried it with both the fortran and c versions - got the same result.
> 
> This was indeed with a debug build. I wouldn’t expect a segfault even with an 
> optimized build, though - I would expect an MPI error, yes?
> 
> 
> 
>> On Nov 26, 2014, at 4:26 PM, Gilles Gouaillardet 
>> > wrote:
>> 
>> I will have a look
>> 
>> Btw, i was running the fortran version, not the c one.
>> Did you confgure with --enable--debug ?
>> The program sends to a rank *not* in the communicator, so this behavior 
>> could make some sense on an optimized build.
>> 
>> Cheers,
>> 
>> Gilles
>> 
>> Ralph Castain >さんのメール:
>> Ick - I’m getting a segfault when trying to run that test:
>> 
>> MPITEST info  (0): Starting MPI_Errhandler_fatal test
>> MPITEST info  (0): This test will abort after printing the results message
>> MPITEST info  (0): If it does not, then a f.a.i.l.u.r.e will be noted
>> [bend001:07714] *** Process received signal ***
>> [bend001:07714] Signal: Segmentation fault (11)
>> [bend001:07714] Signal code: Address not mapped (1)
>> [bend001:07714] Failing at address: 0x50
>> [bend001:07715] *** Process received signal ***
>> [bend001:07715] Signal: Segmentation fault (11)
>> [bend001:07715] Signal code: Address not mapped (1)
>> [bend001:07715] Failing at address: 0x50
>> [bend001:07714] ompi_comm_peer_lookup: invalid peer index (3)
>> [bend001:07713] ompi_comm_peer_lookup: invalid peer index (3)
>> [bend001:07715] ompi_comm_peer_lookup: invalid peer index (3)
>> [bend001:07713] *** Process received signal ***
>> [bend001:07713] Signal: Segmentation fault (11)
>> [bend001:07713] Signal code: Address not mapped (1)
>> [bend001:07713] Failing at address: 0x50
>> [bend001:07713] [ 0] /usr/lib64/libpthread.so.0(+0xf130)[0x7f4485ecb130]
>> [bend001:07713] [ 1] 
>> /home/common/openmpi/build/ompi-release/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x5d)[0x7f4480f74ca6]
>> [bend001:07713] [ 2] [bend001:07714] [ 0] 
>> /usr/lib64/libpthread.so.0(+0xf130)[0x7ff457885130]
>> [bend001:07714] [ 1] 
>> /home/common/openmpi/build/ompi-release/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x5d)[0x7ff44e8dbca6]
>> [bend001:07714] [ 2] [bend001:07715] [ 0] 
>> /usr/lib64/libpthread.so.0(+0xf130)[0x7ffa97ff6130]
>> [bend001:07715] [ 1] 
>> /home/common/openmpi/build/ompi-release/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x5d)[0x7ffa8ca6]
>> [bend001:07715] [ 2] MPITEST_results: MPI_Errhandler_fatal all tests PASSED 
>> (3)
>> 
>> 
>> This is with the head of the 1.8 branch. Any suggestions?
>> Ralph
>> 
>> 
>>> On Nov 26, 2014, at 8:46 AM, Ralph Castain >> > wrote:
>>> 
>>> Hmmm….yeah, I know we saw this and resolved it in the trunk, but it looks 
>>> like the fix indeed failed to come over to 1.8. I’ll take a gander (pretty 
>>> sure I remember how I fixed it) - thanks!
>>> 
 On Nov 26, 2014, at 12:03 AM, Gilles Gouaillardet 
 > 
 wrote:
 
 Ralph,
 
 i noted several hangs in mtt with the v1.8 branch.
 
 a simple way to reproduce it is to use the MPI_Errhandler_fatal_f test
 from the intel_tests suite,
 invoke mpirun on one node and run the taks on an other node :
 
 node0$ mpirun -np 3 -host node1 --mca btl tcp,self ./MPI_Errhandler_fatal_f
 
 /* since this is a race condition, you might need to run this in a loop
 in order to hit the bug */
 
 the attached tarball contains a patch (add debug + temporary hack) and
 some log files obtained with
 --mca errmgr_base_verbose 100 --mca odls_base_verbose 100
 
 without the hack, i can reproduce the bug with -np 3 (log.ko.txt) , with
 the hack, i can still reproduce the hang (though it might
 be a different one) with -np 16 (log.ko.2.txt)
 
 i remember some similar hangs were fixed on the trunk/master a few
 monthes ago.
 i tried to backport some commits but it did not help :-(
 
 could you please have a look at this ?
 
 Cheers,
 
 Gilles
 ___
 devel mailing list
 de...@open-mpi.org 
 Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
 

Re: [OMPI devel] OMPI devel] OMPI devel] race condition in abort can cause mpirun v1.8 hang

2014-11-26 Thread Gilles Gouaillardet
I will double check this(afk right now)
Are you running on a rhel6 like distro with gcc ?

Iirc, crash vs mpi error is ruled by --with-param-check or something like 
this...

Cheers,

Gilles 

Ralph Castain さんのメール:
>I tried it with both the fortran and c versions - got the same result.
>
>
>This was indeed with a debug build. I wouldn’t expect a segfault even with an 
>optimized build, though - I would expect an MPI error, yes?
>
>
>
>
>On Nov 26, 2014, at 4:26 PM, Gilles Gouaillardet 
> wrote:
>
>
>I will have a look
>
>Btw, i was running the fortran version, not the c one.
>Did you confgure with --enable--debug ?
>The program sends to a rank *not* in the communicator, so this behavior could 
>make some sense on an optimized build.
>
>Cheers,
>
>Gilles
>
>Ralph Castain さんのメール:
>Ick - I’m getting a segfault when trying to run that test:
>
>
>MPITEST info  (0): Starting MPI_Errhandler_fatal test
>
>MPITEST info  (0): This test will abort after printing the results message
>
>MPITEST info  (0): If it does not, then a f.a.i.l.u.r.e will be noted
>
>[bend001:07714] *** Process received signal ***
>
>[bend001:07714] Signal: Segmentation fault (11)
>
>[bend001:07714] Signal code: Address not mapped (1)
>
>[bend001:07714] Failing at address: 0x50
>
>[bend001:07715] *** Process received signal ***
>
>[bend001:07715] Signal: Segmentation fault (11)
>
>[bend001:07715] Signal code: Address not mapped (1)
>
>[bend001:07715] Failing at address: 0x50
>
>[bend001:07714] ompi_comm_peer_lookup: invalid peer index (3)
>
>[bend001:07713] ompi_comm_peer_lookup: invalid peer index (3)
>
>[bend001:07715] ompi_comm_peer_lookup: invalid peer index (3)
>
>[bend001:07713] *** Process received signal ***
>
>[bend001:07713] Signal: Segmentation fault (11)
>
>[bend001:07713] Signal code: Address not mapped (1)
>
>[bend001:07713] Failing at address: 0x50
>
>[bend001:07713] [ 0] /usr/lib64/libpthread.so.0(+0xf130)[0x7f4485ecb130]
>
>[bend001:07713] [ 1] 
>/home/common/openmpi/build/ompi-release/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x5d)[0x7f4480f74ca6]
>
>[bend001:07713] [ 2] [bend001:07714] [ 0] 
>/usr/lib64/libpthread.so.0(+0xf130)[0x7ff457885130]
>
>[bend001:07714] [ 1] 
>/home/common/openmpi/build/ompi-release/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x5d)[0x7ff44e8dbca6]
>
>[bend001:07714] [ 2] [bend001:07715] [ 0] 
>/usr/lib64/libpthread.so.0(+0xf130)[0x7ffa97ff6130]
>
>[bend001:07715] [ 1] 
>/home/common/openmpi/build/ompi-release/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x5d)[0x7ffa8ca6]
>
>[bend001:07715] [ 2] MPITEST_results: MPI_Errhandler_fatal all tests PASSED (3)
>
>
>
>This is with the head of the 1.8 branch. Any suggestions?
>
>Ralph
>
>
>
>On Nov 26, 2014, at 8:46 AM, Ralph Castain  wrote:
>
>
>Hmmm….yeah, I know we saw this and resolved it in the trunk, but it looks like 
>the fix indeed failed to come over to 1.8. I’ll take a gander (pretty sure I 
>remember how I fixed it) - thanks!
>
>On Nov 26, 2014, at 12:03 AM, Gilles Gouaillardet 
> wrote:
>
>Ralph,
>
>i noted several hangs in mtt with the v1.8 branch.
>
>a simple way to reproduce it is to use the MPI_Errhandler_fatal_f test
>from the intel_tests suite,
>invoke mpirun on one node and run the taks on an other node :
>
>node0$ mpirun -np 3 -host node1 --mca btl tcp,self ./MPI_Errhandler_fatal_f
>
>/* since this is a race condition, you might need to run this in a loop
>in order to hit the bug */
>
>the attached tarball contains a patch (add debug + temporary hack) and
>some log files obtained with
>--mca errmgr_base_verbose 100 --mca odls_base_verbose 100
>
>without the hack, i can reproduce the bug with -np 3 (log.ko.txt) , with
>the hack, i can still reproduce the hang (though it might
>be a different one) with -np 16 (log.ko.2.txt)
>
>i remember some similar hangs were fixed on the trunk/master a few
>monthes ago.
>i tried to backport some commits but it did not help :-(
>
>could you please have a look at this ?
>
>Cheers,
>
>Gilles
>___
>devel mailing list
>de...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>Link to this post: 
>http://www.open-mpi.org/community/lists/devel/2014/11/16357.php
>
>
>___
>devel mailing list
>de...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>Link to this post: 
>http://www.open-mpi.org/community/lists/devel/2014/11/16364.php
>
>


Re: [OMPI devel] OMPI devel] race condition in abort can cause mpirun v1.8 hang

2014-11-26 Thread Ralph Castain
I tried it with both the fortran and c versions - got the same result.

This was indeed with a debug build. I wouldn’t expect a segfault even with an 
optimized build, though - I would expect an MPI error, yes?



> On Nov 26, 2014, at 4:26 PM, Gilles Gouaillardet 
>  wrote:
> 
> I will have a look
> 
> Btw, i was running the fortran version, not the c one.
> Did you confgure with --enable--debug ?
> The program sends to a rank *not* in the communicator, so this behavior could 
> make some sense on an optimized build.
> 
> Cheers,
> 
> Gilles
> 
> Ralph Castain さんのメール:
> Ick - I’m getting a segfault when trying to run that test:
> 
> MPITEST info  (0): Starting MPI_Errhandler_fatal test
> MPITEST info  (0): This test will abort after printing the results message
> MPITEST info  (0): If it does not, then a f.a.i.l.u.r.e will be noted
> [bend001:07714] *** Process received signal ***
> [bend001:07714] Signal: Segmentation fault (11)
> [bend001:07714] Signal code: Address not mapped (1)
> [bend001:07714] Failing at address: 0x50
> [bend001:07715] *** Process received signal ***
> [bend001:07715] Signal: Segmentation fault (11)
> [bend001:07715] Signal code: Address not mapped (1)
> [bend001:07715] Failing at address: 0x50
> [bend001:07714] ompi_comm_peer_lookup: invalid peer index (3)
> [bend001:07713] ompi_comm_peer_lookup: invalid peer index (3)
> [bend001:07715] ompi_comm_peer_lookup: invalid peer index (3)
> [bend001:07713] *** Process received signal ***
> [bend001:07713] Signal: Segmentation fault (11)
> [bend001:07713] Signal code: Address not mapped (1)
> [bend001:07713] Failing at address: 0x50
> [bend001:07713] [ 0] /usr/lib64/libpthread.so.0(+0xf130)[0x7f4485ecb130]
> [bend001:07713] [ 1] 
> /home/common/openmpi/build/ompi-release/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x5d)[0x7f4480f74ca6]
> [bend001:07713] [ 2] [bend001:07714] [ 0] 
> /usr/lib64/libpthread.so.0(+0xf130)[0x7ff457885130]
> [bend001:07714] [ 1] 
> /home/common/openmpi/build/ompi-release/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x5d)[0x7ff44e8dbca6]
> [bend001:07714] [ 2] [bend001:07715] [ 0] 
> /usr/lib64/libpthread.so.0(+0xf130)[0x7ffa97ff6130]
> [bend001:07715] [ 1] 
> /home/common/openmpi/build/ompi-release/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x5d)[0x7ffa8ca6]
> [bend001:07715] [ 2] MPITEST_results: MPI_Errhandler_fatal all tests PASSED 
> (3)
> 
> 
> This is with the head of the 1.8 branch. Any suggestions?
> Ralph
> 
> 
>> On Nov 26, 2014, at 8:46 AM, Ralph Castain > > wrote:
>> 
>> Hmmm….yeah, I know we saw this and resolved it in the trunk, but it looks 
>> like the fix indeed failed to come over to 1.8. I’ll take a gander (pretty 
>> sure I remember how I fixed it) - thanks!
>> 
>>> On Nov 26, 2014, at 12:03 AM, Gilles Gouaillardet 
>>> > 
>>> wrote:
>>> 
>>> Ralph,
>>> 
>>> i noted several hangs in mtt with the v1.8 branch.
>>> 
>>> a simple way to reproduce it is to use the MPI_Errhandler_fatal_f test
>>> from the intel_tests suite,
>>> invoke mpirun on one node and run the taks on an other node :
>>> 
>>> node0$ mpirun -np 3 -host node1 --mca btl tcp,self ./MPI_Errhandler_fatal_f
>>> 
>>> /* since this is a race condition, you might need to run this in a loop
>>> in order to hit the bug */
>>> 
>>> the attached tarball contains a patch (add debug + temporary hack) and
>>> some log files obtained with
>>> --mca errmgr_base_verbose 100 --mca odls_base_verbose 100
>>> 
>>> without the hack, i can reproduce the bug with -np 3 (log.ko.txt) , with
>>> the hack, i can still reproduce the hang (though it might
>>> be a different one) with -np 16 (log.ko.2.txt)
>>> 
>>> i remember some similar hangs were fixed on the trunk/master a few
>>> monthes ago.
>>> i tried to backport some commits but it did not help :-(
>>> 
>>> could you please have a look at this ?
>>> 
>>> Cheers,
>>> 
>>> Gilles
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org 
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>> 
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2014/11/16357.php 
>>> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/11/16364.php



Re: [OMPI devel] OMPI devel] race condition in abort can cause mpirun v1.8 hang

2014-11-26 Thread Gilles Gouaillardet
I will have a look

Btw, i was running the fortran version, not the c one.
Did you confgure with --enable--debug ?
The program sends to a rank *not* in the communicator, so this behavior could 
make some sense on an optimized build.

Cheers,

Gilles

Ralph Castain さんのメール:
>Ick - I’m getting a segfault when trying to run that test:
>
>
>MPITEST info  (0): Starting MPI_Errhandler_fatal test
>
>MPITEST info  (0): This test will abort after printing the results message
>
>MPITEST info  (0): If it does not, then a f.a.i.l.u.r.e will be noted
>
>[bend001:07714] *** Process received signal ***
>
>[bend001:07714] Signal: Segmentation fault (11)
>
>[bend001:07714] Signal code: Address not mapped (1)
>
>[bend001:07714] Failing at address: 0x50
>
>[bend001:07715] *** Process received signal ***
>
>[bend001:07715] Signal: Segmentation fault (11)
>
>[bend001:07715] Signal code: Address not mapped (1)
>
>[bend001:07715] Failing at address: 0x50
>
>[bend001:07714] ompi_comm_peer_lookup: invalid peer index (3)
>
>[bend001:07713] ompi_comm_peer_lookup: invalid peer index (3)
>
>[bend001:07715] ompi_comm_peer_lookup: invalid peer index (3)
>
>[bend001:07713] *** Process received signal ***
>
>[bend001:07713] Signal: Segmentation fault (11)
>
>[bend001:07713] Signal code: Address not mapped (1)
>
>[bend001:07713] Failing at address: 0x50
>
>[bend001:07713] [ 0] /usr/lib64/libpthread.so.0(+0xf130)[0x7f4485ecb130]
>
>[bend001:07713] [ 1] 
>/home/common/openmpi/build/ompi-release/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x5d)[0x7f4480f74ca6]
>
>[bend001:07713] [ 2] [bend001:07714] [ 0] 
>/usr/lib64/libpthread.so.0(+0xf130)[0x7ff457885130]
>
>[bend001:07714] [ 1] 
>/home/common/openmpi/build/ompi-release/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x5d)[0x7ff44e8dbca6]
>
>[bend001:07714] [ 2] [bend001:07715] [ 0] 
>/usr/lib64/libpthread.so.0(+0xf130)[0x7ffa97ff6130]
>
>[bend001:07715] [ 1] 
>/home/common/openmpi/build/ompi-release/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x5d)[0x7ffa8ca6]
>
>[bend001:07715] [ 2] MPITEST_results: MPI_Errhandler_fatal all tests PASSED (3)
>
>
>
>This is with the head of the 1.8 branch. Any suggestions?
>
>Ralph
>
>
>
>On Nov 26, 2014, at 8:46 AM, Ralph Castain  wrote:
>
>
>Hmmm….yeah, I know we saw this and resolved it in the trunk, but it looks like 
>the fix indeed failed to come over to 1.8. I’ll take a gander (pretty sure I 
>remember how I fixed it) - thanks!
>
>On Nov 26, 2014, at 12:03 AM, Gilles Gouaillardet 
> wrote:
>
>Ralph,
>
>i noted several hangs in mtt with the v1.8 branch.
>
>a simple way to reproduce it is to use the MPI_Errhandler_fatal_f test
>from the intel_tests suite,
>invoke mpirun on one node and run the taks on an other node :
>
>node0$ mpirun -np 3 -host node1 --mca btl tcp,self ./MPI_Errhandler_fatal_f
>
>/* since this is a race condition, you might need to run this in a loop
>in order to hit the bug */
>
>the attached tarball contains a patch (add debug + temporary hack) and
>some log files obtained with
>--mca errmgr_base_verbose 100 --mca odls_base_verbose 100
>
>without the hack, i can reproduce the bug with -np 3 (log.ko.txt) , with
>the hack, i can still reproduce the hang (though it might
>be a different one) with -np 16 (log.ko.2.txt)
>
>i remember some similar hangs were fixed on the trunk/master a few
>monthes ago.
>i tried to backport some commits but it did not help :-(
>
>could you please have a look at this ?
>
>Cheers,
>
>Gilles
>___
>devel mailing list
>de...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>Link to this post: 
>http://www.open-mpi.org/community/lists/devel/2014/11/16357.php
>
>


Re: [OMPI devel] race condition in abort can cause mpirun v1.8 hang

2014-11-26 Thread Ralph Castain
Ick - I’m getting a segfault when trying to run that test:

MPITEST info  (0): Starting MPI_Errhandler_fatal test
MPITEST info  (0): This test will abort after printing the results message
MPITEST info  (0): If it does not, then a f.a.i.l.u.r.e will be noted
[bend001:07714] *** Process received signal ***
[bend001:07714] Signal: Segmentation fault (11)
[bend001:07714] Signal code: Address not mapped (1)
[bend001:07714] Failing at address: 0x50
[bend001:07715] *** Process received signal ***
[bend001:07715] Signal: Segmentation fault (11)
[bend001:07715] Signal code: Address not mapped (1)
[bend001:07715] Failing at address: 0x50
[bend001:07714] ompi_comm_peer_lookup: invalid peer index (3)
[bend001:07713] ompi_comm_peer_lookup: invalid peer index (3)
[bend001:07715] ompi_comm_peer_lookup: invalid peer index (3)
[bend001:07713] *** Process received signal ***
[bend001:07713] Signal: Segmentation fault (11)
[bend001:07713] Signal code: Address not mapped (1)
[bend001:07713] Failing at address: 0x50
[bend001:07713] [ 0] /usr/lib64/libpthread.so.0(+0xf130)[0x7f4485ecb130]
[bend001:07713] [ 1] 
/home/common/openmpi/build/ompi-release/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x5d)[0x7f4480f74ca6]
[bend001:07713] [ 2] [bend001:07714] [ 0] 
/usr/lib64/libpthread.so.0(+0xf130)[0x7ff457885130]
[bend001:07714] [ 1] 
/home/common/openmpi/build/ompi-release/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x5d)[0x7ff44e8dbca6]
[bend001:07714] [ 2] [bend001:07715] [ 0] 
/usr/lib64/libpthread.so.0(+0xf130)[0x7ffa97ff6130]
[bend001:07715] [ 1] 
/home/common/openmpi/build/ompi-release/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x5d)[0x7ffa8ca6]
[bend001:07715] [ 2] MPITEST_results: MPI_Errhandler_fatal all tests PASSED (3)


This is with the head of the 1.8 branch. Any suggestions?
Ralph


> On Nov 26, 2014, at 8:46 AM, Ralph Castain  wrote:
> 
> Hmmm….yeah, I know we saw this and resolved it in the trunk, but it looks 
> like the fix indeed failed to come over to 1.8. I’ll take a gander (pretty 
> sure I remember how I fixed it) - thanks!
> 
>> On Nov 26, 2014, at 12:03 AM, Gilles Gouaillardet 
>> > wrote:
>> 
>> Ralph,
>> 
>> i noted several hangs in mtt with the v1.8 branch.
>> 
>> a simple way to reproduce it is to use the MPI_Errhandler_fatal_f test
>> from the intel_tests suite,
>> invoke mpirun on one node and run the taks on an other node :
>> 
>> node0$ mpirun -np 3 -host node1 --mca btl tcp,self ./MPI_Errhandler_fatal_f
>> 
>> /* since this is a race condition, you might need to run this in a loop
>> in order to hit the bug */
>> 
>> the attached tarball contains a patch (add debug + temporary hack) and
>> some log files obtained with
>> --mca errmgr_base_verbose 100 --mca odls_base_verbose 100
>> 
>> without the hack, i can reproduce the bug with -np 3 (log.ko.txt) , with
>> the hack, i can still reproduce the hang (though it might
>> be a different one) with -np 16 (log.ko.2.txt)
>> 
>> i remember some similar hangs were fixed on the trunk/master a few
>> monthes ago.
>> i tried to backport some commits but it did not help :-(
>> 
>> could you please have a look at this ?
>> 
>> Cheers,
>> 
>> Gilles
>> ___
>> devel mailing list
>> de...@open-mpi.org 
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>> 
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/11/16357.php 
>> 


Re: [OMPI devel] question to OMPI_DECLSPEC

2014-11-26 Thread Edgar Gabriel

On 11/26/2014 11:02 AM, George Bosilca wrote:


We had similar problems in the PML V, and we decided to try to minimize
the increase in size of the main library. Thus, instead of moving
everything in the base, we added a structure in the base that will
contain all the pointer to the functions we would need. This structure
is only initialized when our main module is loaded, and all sub-modules
will use this structure to get access to the pointers provided.


That is an interesting option, let me think about it. What it would give 
us is that we do not have artificially 'force' some code into the base 
of other frameworks, since in my opinion the ompio component is still 
the best place for these functions.


Thanks
Edgar




   George.

>
> 2. I will have to extend the io framework interfaces a bit ( I will try 
to minimize the number of new function as much as I can), but those function 
pointers will be NULL for ROMIO. Just want to make sure this is ok with everybody.

I’ll have to let others chime in here, but that would seem to fit
the OMPI architecture.

 >
 > Thanks
 > Edgar
 >
 > On 11/25/2014 11:43 AM, Ralph Castain wrote:
 >>
 >>> On Nov 25, 2014, at 9:36 AM, Edgar Gabriel > wrote:
 >>>
 >>> On 11/25/2014 11:31 AM, Ralph Castain wrote:
 
 > On Nov 25, 2014, at 8:24 AM, Edgar Gabriel 
 > >> wrote:
 >
 > On 11/25/2014 10:18 AM, Ralph Castain wrote:
 >> Hmmm…no, nothing has changed with regard to declspec that I know
 >> about. I’ll ask the obvious things to check:
 >>
 >> * does that component have the proper include to find this
function?
 >> Could be that it used to be found thru some chain, but the
chain is
 >> now broken and it needs to be directly included
 >
 > header is included, I double checked.
 >
 >> * is that function in the base code, or down in a component?
If the
 >> latter, then that’s a problem, but I’m assuming you didn’t
make that
 >> mistake.
 >
 >
 > I am not sure what you mean. The function is in a component,
but I am
 > not aware that it is illegal to call a function of a
component from
 > another component.
 
 
  Of course that is illegal - you can only access a function via the
  framework interface, not directly. You have no way of knowing
that the
  other component has been loaded. Doing it directly violates the
  abstraction rules.
 >>>
 >>> well, ok. I know that the other componen has been loaded
because that component triggered the initialization of these
sub-frameworks.
 >>
 >> I think we’ve seen that before, and run into problems with that
approach (i.e., components calling framework opens).
 >>
 >>>
 >>> I can move that functionality to the base, however, none of the
20+ functions are required for the other components of the io
framework (i.e. ROMIO). So I would basically add functionality
required for one component only into the base.
 >>
 >> Sounds like you’ve got an abstraction problem. If the fcoll
component requires certain functions from another framework, then
the framework should be exposing those APIs. If ROMIO doesn’t
provide them, then it needs to return an error if someone attempts
to call it.
 >>
 >> You are welcome to bring this up on next week’s call if you
like. IIRC, this has come up before when people have tried this hard
links between components. Maybe someone else will have a better
solution, but is just seems to me like you have to go thru the
framework to avoid the problem.
 >>
 >>>
 >>> Nevertheless, I think the original question is still valid. We
did not see this problem before, but it is now showing on all of our
platforms, and I am still wandering that is the case. I *know* that
the ompio component is loaded, and I still get the error message
about the missing symbol from the ompio component. I do not
understand why that happens.
 >>
 >> Probably because the fcoll component didn’t explicitly link
against the ompio component. You were likely getting away with it
out of pure luck.
 >>
 >>>
 >>>
 >>> Thanks
 >>> Edgar
 >>>
 
 
 >
 > Thanks
 > Edgar
 >
 >
 >
 >>
 >>
 >>> On Nov 25, 2014, at 8:07 AM, Edgar Gabriel

 >>> >>
 >>> wrote:
 >>>
 >>> Has something changed recently on the trunk/master regarding
 

Re: [OMPI devel] question to OMPI_DECLSPEC

2014-11-26 Thread George Bosilca
Edgar,

The restriction you are facing doesn't come from Open MPI, but instead it
comes from the default behavior of how dlopen loads the .so files. As we do
not manually force the RTLD_GLOBAL flag the scope of our modules is local,
which means that the symbols defined in this library are not made available
to resolve references in subsequently loaded libraries.

On Wed, Nov 26, 2014 at 11:27 AM, Ralph Castain  wrote:

>
> > On Nov 26, 2014, at 7:16 AM, Edgar Gabriel  wrote:
> >
> > ok, so I thought about it a bit, and while I am still baffled by the
> actual outcome and the missing symbol (for the main reason that the
> function of the fcoll component is being called from the ompio module, so
> the function of the ompio that was called from the fcoll component is
> guaranteed to be loaded, and does have the proper OMPI_DECLSPEC), I will do
> some restructuring of the code to handle that.
> >
> > As an explanation on why there are so many functions in ompio that are
> being called from the sub-frameworks directly, ompio is more or less the
> glue between all the other frameworks, and contains a lot of the code that
> is jointly used by the fbtl, fcoll and the sharedfp components (fs to a
> lesser extent as well).
> >
> > Before I start to move code around however, just want to confirm two
> things:
> >
> > 1. I can move some of functionality of ompio to the base of various
> frameworks (fcoll, fbtl and io). Just want to confirm that this will work,
> e.g. I can call without restrictions a function of the fcoll base from an
> fbtl or the io component.
>
> Yes - the base functions of any framework are contained in the core
> library and thus always available.
>

These functions will be available to any module in the application, and
will increase the size of the main Open MPI library.

We had similar problems in the PML V, and we decided to try to minimize the
increase in size of the main library. Thus, instead of moving everything in
the base, we added a structure in the base that will contain all the
pointer to the functions we would need. This structure is only initialized
when our main module is loaded, and all sub-modules will use this structure
to get access to the pointers provided.

  George.



> >
> > 2. I will have to extend the io framework interfaces a bit ( I will try
> to minimize the number of new function as much as I can), but those
> function pointers will be NULL for ROMIO. Just want to make sure this is ok
> with everybody.
>
> I’ll have to let others chime in here, but that would seem to fit the OMPI
> architecture.
>
> >
> > Thanks
> > Edgar
> >
> > On 11/25/2014 11:43 AM, Ralph Castain wrote:
> >>
> >>> On Nov 25, 2014, at 9:36 AM, Edgar Gabriel  wrote:
> >>>
> >>> On 11/25/2014 11:31 AM, Ralph Castain wrote:
> 
> > On Nov 25, 2014, at 8:24 AM, Edgar Gabriel  > > wrote:
> >
> > On 11/25/2014 10:18 AM, Ralph Castain wrote:
> >> Hmmm…no, nothing has changed with regard to declspec that I know
> >> about. I’ll ask the obvious things to check:
> >>
> >> * does that component have the proper include to find this function?
> >> Could be that it used to be found thru some chain, but the chain is
> >> now broken and it needs to be directly included
> >
> > header is included, I double checked.
> >
> >> * is that function in the base code, or down in a component? If the
> >> latter, then that’s a problem, but I’m assuming you didn’t make that
> >> mistake.
> >
> >
> > I am not sure what you mean. The function is in a component, but I am
> > not aware that it is illegal to call a function of a component from
> > another component.
> 
> 
>  Of course that is illegal - you can only access a function via the
>  framework interface, not directly. You have no way of knowing that the
>  other component has been loaded. Doing it directly violates the
>  abstraction rules.
> >>>
> >>> well, ok. I know that the other componen has been loaded because that
> component triggered the initialization of these sub-frameworks.
> >>
> >> I think we’ve seen that before, and run into problems with that
> approach (i.e., components calling framework opens).
> >>
> >>>
> >>> I can move that functionality to the base, however, none of the 20+
> functions are required for the other components of the io framework (i.e.
> ROMIO). So I would basically add functionality required for one component
> only into the base.
> >>
> >> Sounds like you’ve got an abstraction problem. If the fcoll component
> requires certain functions from another framework, then the framework
> should be exposing those APIs. If ROMIO doesn’t provide them, then it needs
> to return an error if someone attempts to call it.
> >>
> >> You are welcome to bring this up on next week’s call if you like. IIRC,
> this has come up before when 

Re: [OMPI devel] race condition in abort can cause mpirun v1.8 hang

2014-11-26 Thread Ralph Castain
Hmmm….yeah, I know we saw this and resolved it in the trunk, but it looks like 
the fix indeed failed to come over to 1.8. I’ll take a gander (pretty sure I 
remember how I fixed it) - thanks!

> On Nov 26, 2014, at 12:03 AM, Gilles Gouaillardet 
>  wrote:
> 
> Ralph,
> 
> i noted several hangs in mtt with the v1.8 branch.
> 
> a simple way to reproduce it is to use the MPI_Errhandler_fatal_f test
> from the intel_tests suite,
> invoke mpirun on one node and run the taks on an other node :
> 
> node0$ mpirun -np 3 -host node1 --mca btl tcp,self ./MPI_Errhandler_fatal_f
> 
> /* since this is a race condition, you might need to run this in a loop
> in order to hit the bug */
> 
> the attached tarball contains a patch (add debug + temporary hack) and
> some log files obtained with
> --mca errmgr_base_verbose 100 --mca odls_base_verbose 100
> 
> without the hack, i can reproduce the bug with -np 3 (log.ko.txt) , with
> the hack, i can still reproduce the hang (though it might
> be a different one) with -np 16 (log.ko.2.txt)
> 
> i remember some similar hangs were fixed on the trunk/master a few
> monthes ago.
> i tried to backport some commits but it did not help :-(
> 
> could you please have a look at this ?
> 
> Cheers,
> 
> Gilles
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/11/16357.php



Re: [OMPI devel] question to OMPI_DECLSPEC

2014-11-26 Thread Ralph Castain

> On Nov 26, 2014, at 7:16 AM, Edgar Gabriel  wrote:
> 
> ok, so I thought about it a bit, and while I am still baffled by the actual 
> outcome and the missing symbol (for the main reason that the function of the 
> fcoll component is being called from the ompio module, so the function of the 
> ompio that was called from the fcoll component is guaranteed to be loaded, 
> and does have the proper OMPI_DECLSPEC), I will do some restructuring of the 
> code to handle that.
> 
> As an explanation on why there are so many functions in ompio that are being 
> called from the sub-frameworks directly, ompio is more or less the glue 
> between all the other frameworks, and contains a lot of the code that is 
> jointly used by the fbtl, fcoll and the sharedfp components (fs to a lesser 
> extent as well).
> 
> Before I start to move code around however, just want to confirm two things:
> 
> 1. I can move some of functionality of ompio to the base of various 
> frameworks (fcoll, fbtl and io). Just want to confirm that this will work, 
> e.g. I can call without restrictions a function of the fcoll base from an 
> fbtl or the io component.

Yes - the base functions of any framework are contained in the core library and 
thus always available.

> 
> 2. I will have to extend the io framework interfaces a bit ( I will try to 
> minimize the number of new function as much as I can), but those function 
> pointers will be NULL for ROMIO. Just want to make sure this is ok with 
> everybody.

I’ll have to let others chime in here, but that would seem to fit the OMPI 
architecture.

> 
> Thanks
> Edgar
> 
> On 11/25/2014 11:43 AM, Ralph Castain wrote:
>> 
>>> On Nov 25, 2014, at 9:36 AM, Edgar Gabriel  wrote:
>>> 
>>> On 11/25/2014 11:31 AM, Ralph Castain wrote:
 
> On Nov 25, 2014, at 8:24 AM, Edgar Gabriel  > wrote:
> 
> On 11/25/2014 10:18 AM, Ralph Castain wrote:
>> Hmmm…no, nothing has changed with regard to declspec that I know
>> about. I’ll ask the obvious things to check:
>> 
>> * does that component have the proper include to find this function?
>> Could be that it used to be found thru some chain, but the chain is
>> now broken and it needs to be directly included
> 
> header is included, I double checked.
> 
>> * is that function in the base code, or down in a component? If the
>> latter, then that’s a problem, but I’m assuming you didn’t make that
>> mistake.
> 
> 
> I am not sure what you mean. The function is in a component, but I am
> not aware that it is illegal to call a function of a component from
> another component.
 
 
 Of course that is illegal - you can only access a function via the
 framework interface, not directly. You have no way of knowing that the
 other component has been loaded. Doing it directly violates the
 abstraction rules.
>>> 
>>> well, ok. I know that the other componen has been loaded because that 
>>> component triggered the initialization of these sub-frameworks.
>> 
>> I think we’ve seen that before, and run into problems with that approach 
>> (i.e., components calling framework opens).
>> 
>>> 
>>> I can move that functionality to the base, however, none of the 20+ 
>>> functions are required for the other components of the io framework (i.e. 
>>> ROMIO). So I would basically add functionality required for one component 
>>> only into the base.
>> 
>> Sounds like you’ve got an abstraction problem. If the fcoll component 
>> requires certain functions from another framework, then the framework should 
>> be exposing those APIs. If ROMIO doesn’t provide them, then it needs to 
>> return an error if someone attempts to call it.
>> 
>> You are welcome to bring this up on next week’s call if you like. IIRC, this 
>> has come up before when people have tried this hard links between 
>> components. Maybe someone else will have a better solution, but is just 
>> seems to me like you have to go thru the framework to avoid the problem.
>> 
>>> 
>>> Nevertheless, I think the original question is still valid. We did not see 
>>> this problem before, but it is now showing on all of our platforms, and I 
>>> am still wandering that is the case. I *know* that the ompio component is 
>>> loaded, and I still get the error message about the missing symbol from the 
>>> ompio component. I do not understand why that happens.
>> 
>> Probably because the fcoll component didn’t explicitly link against the 
>> ompio component. You were likely getting away with it out of pure luck.
>> 
>>> 
>>> 
>>> Thanks
>>> Edgar
>>> 
 
 
> 
> Thanks
> Edgar
> 
> 
> 
>> 
>> 
>>> On Nov 25, 2014, at 8:07 AM, Edgar Gabriel >> >
>>> wrote:
>>> 
>>> Has something changed recently on the trunk/master regarding
>>> 

Re: [OMPI devel] question to OMPI_DECLSPEC

2014-11-26 Thread Edgar Gabriel
ok, so I thought about it a bit, and while I am still baffled by the 
actual outcome and the missing symbol (for the main reason that the 
function of the fcoll component is being called from the ompio module, 
so the function of the ompio that was called from the fcoll component is 
guaranteed to be loaded, and does have the proper OMPI_DECLSPEC), I will 
do some restructuring of the code to handle that.


As an explanation on why there are so many functions in ompio that are 
being called from the sub-frameworks directly, ompio is more or less the 
glue between all the other frameworks, and contains a lot of the code 
that is jointly used by the fbtl, fcoll and the sharedfp components (fs 
to a lesser extent as well).


Before I start to move code around however, just want to confirm two things:

1. I can move some of functionality of ompio to the base of various 
frameworks (fcoll, fbtl and io). Just want to confirm that this will 
work, e.g. I can call without restrictions a function of the fcoll base 
from an fbtl or the io component.


2. I will have to extend the io framework interfaces a bit ( I will try 
to minimize the number of new function as much as I can), but those 
function pointers will be NULL for ROMIO. Just want to make sure this is 
ok with everybody.


Thanks
Edgar

On 11/25/2014 11:43 AM, Ralph Castain wrote:



On Nov 25, 2014, at 9:36 AM, Edgar Gabriel  wrote:

On 11/25/2014 11:31 AM, Ralph Castain wrote:



On Nov 25, 2014, at 8:24 AM, Edgar Gabriel > wrote:

On 11/25/2014 10:18 AM, Ralph Castain wrote:

Hmmm…no, nothing has changed with regard to declspec that I know
about. I’ll ask the obvious things to check:

* does that component have the proper include to find this function?
Could be that it used to be found thru some chain, but the chain is
now broken and it needs to be directly included


header is included, I double checked.


* is that function in the base code, or down in a component? If the
latter, then that’s a problem, but I’m assuming you didn’t make that
mistake.



I am not sure what you mean. The function is in a component, but I am
not aware that it is illegal to call a function of a component from
another component.



Of course that is illegal - you can only access a function via the
framework interface, not directly. You have no way of knowing that the
other component has been loaded. Doing it directly violates the
abstraction rules.


well, ok. I know that the other componen has been loaded because that component 
triggered the initialization of these sub-frameworks.


I think we’ve seen that before, and run into problems with that approach (i.e., 
components calling framework opens).



I can move that functionality to the base, however, none of the 20+ functions 
are required for the other components of the io framework (i.e. ROMIO). So I 
would basically add functionality required for one component only into the base.


Sounds like you’ve got an abstraction problem. If the fcoll component requires 
certain functions from another framework, then the framework should be exposing 
those APIs. If ROMIO doesn’t provide them, then it needs to return an error if 
someone attempts to call it.

You are welcome to bring this up on next week’s call if you like. IIRC, this 
has come up before when people have tried this hard links between components. 
Maybe someone else will have a better solution, but is just seems to me like 
you have to go thru the framework to avoid the problem.



Nevertheless, I think the original question is still valid. We did not see this 
problem before, but it is now showing on all of our platforms, and I am still 
wandering that is the case. I *know* that the ompio component is loaded, and I 
still get the error message about the missing symbol from the ompio component. 
I do not understand why that happens.


Probably because the fcoll component didn’t explicitly link against the ompio 
component. You were likely getting away with it out of pure luck.




Thanks
Edgar






Thanks
Edgar







On Nov 25, 2014, at 8:07 AM, Edgar Gabriel >
wrote:

Has something changed recently on the trunk/master regarding
OMPI_DECLSPEC? The reason I ask is because we get now errors about
unresolved symbols, e.g.

symbol lookup error:
/home/gabriel/OpenMPI/lib64/openmpi/mca_fcoll_dynamic.so: undefined
symbol: ompi_io_ompio_decode_datatype


and that problem was not there roughly two weeks back the last time
I tested. I did verify that the the function listed there has an
OMPI_DECLSPEC before its definition.

Thanks Edgar -- Edgar Gabriel Associate Professor Parallel Software
Technologies Lab http://pstl.cs.uh.edu Department of Computer
Science  University of Houston Philip G. Hoffman Hall, Room
524Houston, TX-77204, USA Tel: +1 (713) 743-3857
Fax: +1 (713) 743-3335
___ devel mailing list

[OMPI devel] race condition in abort can cause mpirun v1.8 hang

2014-11-26 Thread Gilles Gouaillardet
Ralph,

i noted several hangs in mtt with the v1.8 branch.

a simple way to reproduce it is to use the MPI_Errhandler_fatal_f test
from the intel_tests suite,
invoke mpirun on one node and run the taks on an other node :

node0$ mpirun -np 3 -host node1 --mca btl tcp,self ./MPI_Errhandler_fatal_f

/* since this is a race condition, you might need to run this in a loop
in order to hit the bug */

the attached tarball contains a patch (add debug + temporary hack) and
some log files obtained with
--mca errmgr_base_verbose 100 --mca odls_base_verbose 100

without the hack, i can reproduce the bug with -np 3 (log.ko.txt) , with
the hack, i can still reproduce the hang (though it might
be a different one) with -np 16 (log.ko.2.txt)

i remember some similar hangs were fixed on the trunk/master a few
monthes ago.
i tried to backport some commits but it did not help :-(

could you please have a look at this ?

Cheers,

Gilles


abort_hang.tar.gz
Description: application/gzip