Re: [OMPI devel] btl_openib.c:1200: mca_btl_openib_alloc: Assertion `qp != 255' failed

2015-02-02 Thread Adrian Reber
https://github.com/open-mpi/ompi/issues/372

On Sat, Jan 31, 2015 at 01:38:54PM +, Jeff Squyres (jsquyres) wrote:
> Adrian --
> 
> Can you file this as a Github issue?  Thanks.
> 
> 
> > On Jan 17, 2015, at 12:58 PM, Adrian Reber  wrote:
> > 
> > This time my bug report is not PSM related:
> > 
> > I was able to reproduce the MTT error from 
> > http://mtt.open-mpi.org/index.php?do_redir=2228
> > on my system with openmpi-dev-720-gf4693c9:
> > 
> > mpi_test_suite: btl_openib.c:1200: mca_btl_openib_alloc: Assertion `qp != 
> > 255' failed.
> > [n050409:06796] *** Process received signal ***
> > [n050409:06796] Signal: Aborted (6)
> > [n050409:06796] Signal code:  (-6)
> > [n050409:06796] [ 0] /lib64/libpthread.so.0(+0xf710)[0x2b036d501710]
> > [n050409:06796] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x2b036d741635]
> > [n050409:06796] [ 2] /lib64/libc.so.6(abort+0x175)[0x2b036d742e15]
> > [n050409:06796] [ 3] /lib64/libc.so.6(+0x2b75e)[0x2b036d73a75e]
> > [n050409:06796] [ 4] 
> > /lib64/libc.so.6(__assert_perror_fail+0x0)[0x2b036d73a820]
> > [n050409:06796] [ 5] 
> > /lustre/ws1/ws/adrian-mtt-0/ompi-install/lib/openmpi/mca_btl_openib.so(mca_btl_openib_alloc+0x77)[0x2b03730cf6d0]
> > [n050409:06796] [ 6] 
> > /lustre/ws1/ws/adrian-mtt-0/ompi-install/lib/openmpi/mca_btl_openib.so(mca_btl_openib_sendi+0x5e5)[0x2b03730d1ae9]
> > [n050409:06796] [ 7] 
> > /lustre/ws1/ws/adrian-mtt-0/ompi-install/lib/openmpi/mca_pml_ob1.so(+0xd407)[0x2b0373961407]
> > [n050409:06796] [ 8] 
> > /lustre/ws1/ws/adrian-mtt-0/ompi-install/lib/openmpi/mca_pml_ob1.so(+0xde45)[0x2b0373961e45]
> > [n050409:06796] [ 9] 
> > /lustre/ws1/ws/adrian-mtt-0/ompi-install/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x1ce)[0x2b0373962501]
> > [n050409:06796] [10] 
> > /lustre/ws1/ws/adrian-mtt-0/ompi-install/lib/libmpi.so.0(PMPI_Send+0x2b4)[0x2b036d20d1bb]
> > [n050409:06796] [11] mpi_test_suite[0x464424]
> > [n050409:06796] [12] mpi_test_suite[0x470304]
> > [n050409:06796] [13] mpi_test_suite[0x444a72]
> > [n050409:06796] [14] 
> > /lib64/libc.so.6(__libc_start_main+0xfd)[0x2b036d72dd5d]
> > [n050409:06796] [15] mpi_test_suite[0x4051a9]
> > [n050409:06796] *** End of error message ***
> > --
> > mpirun noticed that process rank 0 with PID 0 on node n050409 exited on 
> > signal 6 (Aborted).
> > --
> > 
> > Core was generated by `mpi_test_suite -t p2p'.
> > Program terminated with signal 6, Aborted.
> > (gdb) bt
> > #0  0x2b036d741635 in raise () from /lib64/libc.so.6
> > #1  0x2b036d742d9d in abort () from /lib64/libc.so.6
> > #2  0x2b036d73a75e in __assert_fail_base () from /lib64/libc.so.6
> > #3  0x2b036d73a820 in __assert_fail () from /lib64/libc.so.6
> > #4  0x2b03730cf6d0 in mca_btl_openib_alloc (btl=0x224e740, 
> > ep=0x22b66a0, order=255 '\377', size=73014, flags=3) at btl_openib.c:1200
> > #5  0x2b03730d1ae9 in mca_btl_openib_sendi (btl=0x224e740, 
> > ep=0x22b66a0, convertor=0x7fff2c527bb0, header=0x7fff2c527cd0, 
> > header_size=14, payload_size=73000, order=255 '\377', flags=3, 
> >tag=65 'A', descriptor=0x7fff2c527ce8) at btl_openib.c:1829
> > #6  0x2b0373961407 in mca_bml_base_sendi (bml_btl=0x2198850, 
> > convertor=0x7fff2c527bb0, header=0x7fff2c527cd0, header_size=14, 
> > payload_size=73000, order=255 '\377', flags=3, tag=65 'A', 
> >descriptor=0x7fff2c527ce8) at ../../../../ompi/mca/bml/bml.h:305
> > #7  0x2b0373961e45 in mca_pml_ob1_send_inline (buf=0x2b7b760, count=1, 
> > datatype=0x2b97440, dst=1, tag=37, seqn=3639, dst_proc=0x21c2940, 
> > endpoint=0x22dff00, comm=0x6939e0) at pml_ob1_isend.c:107
> > #8  0x2b0373962501 in mca_pml_ob1_send (buf=0x2b7b760, count=1, 
> > datatype=0x2b97440, dst=1, tag=37, sendmode=MCA_PML_BASE_SEND_STANDARD, 
> > comm=0x6939e0) at pml_ob1_isend.c:214
> > #9  0x2b036d20d1bb in PMPI_Send (buf=0x2b7b760, count=1, 
> > type=0x2b97440, dest=1, tag=37, comm=0x6939e0) at psend.c:78
> > #10 0x00464424 in tst_p2p_simple_ring_xsend_run 
> > (env=0x7fff2c528530) at p2p/tst_p2p_simple_ring_xsend.c:97
> > #11 0x00470304 in tst_test_run_func (env=0x7fff2c528530) at 
> > tst_tests.c:1463
> > #12 0x00444a72 in main (argc=3, argv=0x7fff2c5287f8) at 
> > mpi_test_suite.c:639
> > 
> > This is with --enable-debug. Without --enable-debug I get a
> > segmentation fault, but not always. Using fewer cores it works most
> > of the time. With 32 cores on 4 nodes it happens almost
> > all the time. If it does not crash using fewer cores I get messages like:
> > 
> > [n050409][[36216,1],1][btl_openib_xrc.c:58:mca_btl_openib_xrc_check_api] 
> > XRC error: bad XRC API (require XRC from OFED pre 3.12).
> > 
> > Adrian
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] btl_openib.c:1200: mca_btl_openib_alloc: Assertion `qp != 255' failed

2015-01-31 Thread Jeff Squyres (jsquyres)
Adrian --

Can you file this as a Github issue?  Thanks.


> On Jan 17, 2015, at 12:58 PM, Adrian Reber  wrote:
> 
> This time my bug report is not PSM related:
> 
> I was able to reproduce the MTT error from 
> http://mtt.open-mpi.org/index.php?do_redir=2228
> on my system with openmpi-dev-720-gf4693c9:
> 
> mpi_test_suite: btl_openib.c:1200: mca_btl_openib_alloc: Assertion `qp != 
> 255' failed.
> [n050409:06796] *** Process received signal ***
> [n050409:06796] Signal: Aborted (6)
> [n050409:06796] Signal code:  (-6)
> [n050409:06796] [ 0] /lib64/libpthread.so.0(+0xf710)[0x2b036d501710]
> [n050409:06796] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x2b036d741635]
> [n050409:06796] [ 2] /lib64/libc.so.6(abort+0x175)[0x2b036d742e15]
> [n050409:06796] [ 3] /lib64/libc.so.6(+0x2b75e)[0x2b036d73a75e]
> [n050409:06796] [ 4] 
> /lib64/libc.so.6(__assert_perror_fail+0x0)[0x2b036d73a820]
> [n050409:06796] [ 5] 
> /lustre/ws1/ws/adrian-mtt-0/ompi-install/lib/openmpi/mca_btl_openib.so(mca_btl_openib_alloc+0x77)[0x2b03730cf6d0]
> [n050409:06796] [ 6] 
> /lustre/ws1/ws/adrian-mtt-0/ompi-install/lib/openmpi/mca_btl_openib.so(mca_btl_openib_sendi+0x5e5)[0x2b03730d1ae9]
> [n050409:06796] [ 7] 
> /lustre/ws1/ws/adrian-mtt-0/ompi-install/lib/openmpi/mca_pml_ob1.so(+0xd407)[0x2b0373961407]
> [n050409:06796] [ 8] 
> /lustre/ws1/ws/adrian-mtt-0/ompi-install/lib/openmpi/mca_pml_ob1.so(+0xde45)[0x2b0373961e45]
> [n050409:06796] [ 9] 
> /lustre/ws1/ws/adrian-mtt-0/ompi-install/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x1ce)[0x2b0373962501]
> [n050409:06796] [10] 
> /lustre/ws1/ws/adrian-mtt-0/ompi-install/lib/libmpi.so.0(PMPI_Send+0x2b4)[0x2b036d20d1bb]
> [n050409:06796] [11] mpi_test_suite[0x464424]
> [n050409:06796] [12] mpi_test_suite[0x470304]
> [n050409:06796] [13] mpi_test_suite[0x444a72]
> [n050409:06796] [14] /lib64/libc.so.6(__libc_start_main+0xfd)[0x2b036d72dd5d]
> [n050409:06796] [15] mpi_test_suite[0x4051a9]
> [n050409:06796] *** End of error message ***
> --
> mpirun noticed that process rank 0 with PID 0 on node n050409 exited on 
> signal 6 (Aborted).
> --
> 
> Core was generated by `mpi_test_suite -t p2p'.
> Program terminated with signal 6, Aborted.
> (gdb) bt
> #0  0x2b036d741635 in raise () from /lib64/libc.so.6
> #1  0x2b036d742d9d in abort () from /lib64/libc.so.6
> #2  0x2b036d73a75e in __assert_fail_base () from /lib64/libc.so.6
> #3  0x2b036d73a820 in __assert_fail () from /lib64/libc.so.6
> #4  0x2b03730cf6d0 in mca_btl_openib_alloc (btl=0x224e740, ep=0x22b66a0, 
> order=255 '\377', size=73014, flags=3) at btl_openib.c:1200
> #5  0x2b03730d1ae9 in mca_btl_openib_sendi (btl=0x224e740, ep=0x22b66a0, 
> convertor=0x7fff2c527bb0, header=0x7fff2c527cd0, header_size=14, 
> payload_size=73000, order=255 '\377', flags=3, 
>tag=65 'A', descriptor=0x7fff2c527ce8) at btl_openib.c:1829
> #6  0x2b0373961407 in mca_bml_base_sendi (bml_btl=0x2198850, 
> convertor=0x7fff2c527bb0, header=0x7fff2c527cd0, header_size=14, 
> payload_size=73000, order=255 '\377', flags=3, tag=65 'A', 
>descriptor=0x7fff2c527ce8) at ../../../../ompi/mca/bml/bml.h:305
> #7  0x2b0373961e45 in mca_pml_ob1_send_inline (buf=0x2b7b760, count=1, 
> datatype=0x2b97440, dst=1, tag=37, seqn=3639, dst_proc=0x21c2940, 
> endpoint=0x22dff00, comm=0x6939e0) at pml_ob1_isend.c:107
> #8  0x2b0373962501 in mca_pml_ob1_send (buf=0x2b7b760, count=1, 
> datatype=0x2b97440, dst=1, tag=37, sendmode=MCA_PML_BASE_SEND_STANDARD, 
> comm=0x6939e0) at pml_ob1_isend.c:214
> #9  0x2b036d20d1bb in PMPI_Send (buf=0x2b7b760, count=1, type=0x2b97440, 
> dest=1, tag=37, comm=0x6939e0) at psend.c:78
> #10 0x00464424 in tst_p2p_simple_ring_xsend_run (env=0x7fff2c528530) 
> at p2p/tst_p2p_simple_ring_xsend.c:97
> #11 0x00470304 in tst_test_run_func (env=0x7fff2c528530) at 
> tst_tests.c:1463
> #12 0x00444a72 in main (argc=3, argv=0x7fff2c5287f8) at 
> mpi_test_suite.c:639
> 
> This is with --enable-debug. Without --enable-debug I get a
> segmentation fault, but not always. Using fewer cores it works most
> of the time. With 32 cores on 4 nodes it happens almost
> all the time. If it does not crash using fewer cores I get messages like:
> 
> [n050409][[36216,1],1][btl_openib_xrc.c:58:mca_btl_openib_xrc_check_api] XRC 
> error: bad XRC API (require XRC from OFED pre 3.12).
> 
>   Adrian
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/01/16797.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] btl_openib.c:1200: mca_btl_openib_alloc: Assertion `qp != 255' failed

2015-01-21 Thread Jeff Squyres (jsquyres)
On Jan 21, 2015, at 11:00 AM, George Bosilca  wrote:
> 
> As I said in my previous email it is legal to use such an overlapping 
> datatype for send operations. Thus, the datatype engine cannot prevent one 
> from creating them.

Ah, right.  Gotcha.

> We had some degree of overlap detection at some point in the past, but the 
> algorithm is quadratic in time and memory with the number of datatype 
> entries, so the cost was prohibitive.

Gotcha.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] btl_openib.c:1200: mca_btl_openib_alloc: Assertion `qp != 255' failed

2015-01-21 Thread George Bosilca
As I said in my previous email it is legal to use such an overlapping
datatype for send operations. Thus, the datatype engine cannot prevent one
from creating them.

We had some degree of overlap detection at some point in the past, but the
algorithm is quadratic in time and memory with the number of datatype
entries, so the cost was prohibitive.

  George.


On Wed, Jan 21, 2015 at 9:43 AM, Jeff Squyres (jsquyres)  wrote:

> On Jan 20, 2015, at 10:10 PM, George Bosilca  wrote:
> >
> > Receiving with such a datatype is illegal in MPI (sending is allowed as
> the buffer is supposed read only during the operation). In fact having any
> datatype that span over the same memory region twice is illegal to be used
> for any receive operations. The reason is simple, an MPI implementation can
> move the data in any order it wants, and as MPI guaranteed only the FIFO
> ordering of the matching such a datatype will break the determinism of the
> application.
>
> George: does the DDT engine check for this kind of condition?  Shouldn't
> it refuse the generate a datatype like this?
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/01/16810.php
>


Re: [OMPI devel] btl_openib.c:1200: mca_btl_openib_alloc: Assertion `qp != 255' failed

2015-01-21 Thread Jeff Squyres (jsquyres)
On Jan 20, 2015, at 10:10 PM, George Bosilca  wrote:
> 
> Receiving with such a datatype is illegal in MPI (sending is allowed as the 
> buffer is supposed read only during the operation). In fact having any 
> datatype that span over the same memory region twice is illegal to be used 
> for any receive operations. The reason is simple, an MPI implementation can 
> move the data in any order it wants, and as MPI guaranteed only the FIFO 
> ordering of the matching such a datatype will break the determinism of the 
> application.

George: does the DDT engine check for this kind of condition?  Shouldn't it 
refuse the generate a datatype like this?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] btl_openib.c:1200: mca_btl_openib_alloc: Assertion `qp != 255' failed

2015-01-20 Thread George Bosilca
On Tue, Jan 20, 2015 at 10:01 PM, Gilles Gouaillardet <
gilles.gouaillar...@iferc.org> wrote:

> 2) the mpi_test_suite uses a weird type (e.g. artificially send 20k
> integers to the wire when sending one
>  would produce the very same result)
> i briefly checked the mpi_test_suite source code, and the weird type is
> send/recv with buffers whose size
> is one element.
> i can only guess the authors wanted to send a large message to the wire
> (e.g. create traffic) without pointless
> large memory allocation.
> at this stage, i am tempted to conclude the authors did what they intended.
>

Receiving with such a datatype is illegal in MPI (sending is allowed as the
buffer is supposed read only during the operation). In fact having any
datatype that span over the same memory region twice is illegal to be used
for any receive operations. The reason is simple, an MPI implementation can
move the data in any order it wants, and as MPI guaranteed only the FIFO
ordering of the matching such a datatype will break the determinism of the
application.

We should ping the authors of the test code to address this.

  George.



>
> Cheers,
>
> Gilles
>
> On 2015/01/21 3:00, Jeff Squyres (jsquyres) wrote:
> > George is right -- Gilles: was this the correct solution?
> >
> > Put differently: the extent of the 20K vector created below is 4 (bytes).
> >
> >
> >
> >> On Jan 19, 2015, at 2:39 AM, George Bosilca 
> wrote:
> >>
> >> Btw,
> >>
> >> MPI_Type_hvector(2, 1, 0, MPI_INT, );
> >>
> >> Is just a weird datatype. Because the stride is 0, this datatype a
> memory layout that includes 2 times the same int. I'm not sure this was
> indeed intended...
> >>
> >>   George.
> >>
> >>
> >> On Mon, Jan 19, 2015 at 12:17 AM, Gilles Gouaillardet <
> gilles.gouaillar...@iferc.org> wrote:
> >> Adrian,
> >>
> >> i just fixed this in the master
> >> (
> https://github.com/open-mpi/ompi/commit/d14daf40d041f7a0a8e9d85b3bfd5eb570495fd2
> )
> >>
> >> the root cause is a corner case was not handled correctly :
> >>
> >> MPI_Type_hvector(2, 1, 0, MPI_INT, );
> >>
> >> type has extent = 4 *but* size = 8
> >> ob1 used to test only the extent to determine whether the message should
> >> be sent inlined or not
> >> extent <= 256 means try to send the message inline
> >> that meant a fragment of size 8 (which is greater than 65536 e.g.
> >> max default size for IB) was allocated,
> >> and that failed.
> >>
> >> now both extent and size are tested, so the message is not sent inline,
> >> and it just works.
> >>
> >> Cheers,
> >>
> >> Gilles
> >> ___
> >> devel mailing list
> >> de...@open-mpi.org
> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/01/16798.php
> >>
> >> ___
> >> devel mailing list
> >> de...@open-mpi.org
> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/01/16801.php
> >
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/01/16808.php
>


Re: [OMPI devel] btl_openib.c:1200: mca_btl_openib_alloc: Assertion `qp != 255' failed

2015-01-20 Thread Gilles Gouaillardet
Jeff,

There are two things here :

1) ompi was crashing, and this was fixed (now use type size instead of
type extent to figure out whether
the btl should try to send the message inline or not). and yes, George
is right (e.g. use size instead of extent
or both size and extent)

2) the mpi_test_suite uses a weird type (e.g. artificially send 20k
integers to the wire when sending one
 would produce the very same result)
i briefly checked the mpi_test_suite source code, and the weird type is
send/recv with buffers whose size
is one element.
i can only guess the authors wanted to send a large message to the wire
(e.g. create traffic) without pointless
large memory allocation.
at this stage, i am tempted to conclude the authors did what they intended.

Cheers,

Gilles

On 2015/01/21 3:00, Jeff Squyres (jsquyres) wrote:
> George is right -- Gilles: was this the correct solution?
>
> Put differently: the extent of the 20K vector created below is 4 (bytes).
>
>
>
>> On Jan 19, 2015, at 2:39 AM, George Bosilca  wrote:
>>
>> Btw,
>>
>> MPI_Type_hvector(2, 1, 0, MPI_INT, );
>>
>> Is just a weird datatype. Because the stride is 0, this datatype a memory 
>> layout that includes 2 times the same int. I'm not sure this was indeed 
>> intended...
>>
>>   George.
>>
>>
>> On Mon, Jan 19, 2015 at 12:17 AM, Gilles Gouaillardet 
>>  wrote:
>> Adrian,
>>
>> i just fixed this in the master
>> (https://github.com/open-mpi/ompi/commit/d14daf40d041f7a0a8e9d85b3bfd5eb570495fd2)
>>
>> the root cause is a corner case was not handled correctly :
>>
>> MPI_Type_hvector(2, 1, 0, MPI_INT, );
>>
>> type has extent = 4 *but* size = 8
>> ob1 used to test only the extent to determine whether the message should
>> be sent inlined or not
>> extent <= 256 means try to send the message inline
>> that meant a fragment of size 8 (which is greater than 65536 e.g.
>> max default size for IB) was allocated,
>> and that failed.
>>
>> now both extent and size are tested, so the message is not sent inline,
>> and it just works.
>>
>> Cheers,
>>
>> Gilles
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2015/01/16798.php
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2015/01/16801.php
>



Re: [OMPI devel] btl_openib.c:1200: mca_btl_openib_alloc: Assertion `qp != 255' failed

2015-01-20 Thread Adrian Reber
Using today's nightly snapshot (openmpi-dev-730-g06d3b57) both errors
are gone. Thanks!

On Mon, Jan 19, 2015 at 02:38:42PM +0900, Gilles Gouaillardet wrote:
> Adrian,
> 
> about the
> "[n050409][[36216,1],1][btl_openib_xrc.c:58:mca_btl_openib_xrc_check_api] XRC
> error: bad XRC API (require XRC from OFED pre 3.12). " message.
> 
> this means ompi was built on a system with OFED 3.12 or greater, and you
> are running on a system with an earlier OFED release.
> 
> please not Jeff recently pushed a patch related to that and this message
> might be a false positive.
> 
> Cheers,
> 
> Gilles
> 
> On 2015/01/19 14:17, Gilles Gouaillardet wrote:
> > Adrian,
> >
> > i just fixed this in the master
> > (https://github.com/open-mpi/ompi/commit/d14daf40d041f7a0a8e9d85b3bfd5eb570495fd2)
> >
> > the root cause is a corner case was not handled correctly :
> >
> > MPI_Type_hvector(2, 1, 0, MPI_INT, );
> >
> > type has extent = 4 *but* size = 8
> > ob1 used to test only the extent to determine whether the message should
> > be sent inlined or not
> > extent <= 256 means try to send the message inline
> > that meant a fragment of size 8 (which is greater than 65536 e.g.
> > max default size for IB) was allocated,
> > and that failed.
> >
> > now both extent and size are tested, so the message is not sent inline,
> > and it just works.
> >
> > Cheers,
> >
> > Gilles
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post: 
> > http://www.open-mpi.org/community/lists/devel/2015/01/16798.php
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/01/16799.php


Re: [OMPI devel] btl_openib.c:1200: mca_btl_openib_alloc: Assertion `qp != 255' failed

2015-01-19 Thread George Bosilca
Btw,

MPI_Type_hvector(2, 1, 0, MPI_INT, );

Is just a weird datatype. Because the stride is 0, this datatype a memory
layout that includes 2 times the same int. I'm not sure this was indeed
intended...

  George.


On Mon, Jan 19, 2015 at 12:17 AM, Gilles Gouaillardet <
gilles.gouaillar...@iferc.org> wrote:

> Adrian,
>
> i just fixed this in the master
> (
> https://github.com/open-mpi/ompi/commit/d14daf40d041f7a0a8e9d85b3bfd5eb570495fd2
> )
>
> the root cause is a corner case was not handled correctly :
>
> MPI_Type_hvector(2, 1, 0, MPI_INT, );
>
> type has extent = 4 *but* size = 8
> ob1 used to test only the extent to determine whether the message should
> be sent inlined or not
> extent <= 256 means try to send the message inline
> that meant a fragment of size 8 (which is greater than 65536 e.g.
> max default size for IB) was allocated,
> and that failed.
>
> now both extent and size are tested, so the message is not sent inline,
> and it just works.
>
> Cheers,
>
> Gilles
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/01/16798.php
>


Re: [OMPI devel] btl_openib.c:1200: mca_btl_openib_alloc: Assertion `qp != 255' failed

2015-01-19 Thread George Bosilca
The extent should not be part of the decision, what matters is the amount
of data to be pushed on the wire, and not it's span in memory.

  George.


On Mon, Jan 19, 2015 at 12:17 AM, Gilles Gouaillardet <
gilles.gouaillar...@iferc.org> wrote:

> Adrian,
>
> i just fixed this in the master
> (
> https://github.com/open-mpi/ompi/commit/d14daf40d041f7a0a8e9d85b3bfd5eb570495fd2
> )
>
> the root cause is a corner case was not handled correctly :
>
> MPI_Type_hvector(2, 1, 0, MPI_INT, );
>
> type has extent = 4 *but* size = 8
> ob1 used to test only the extent to determine whether the message should
> be sent inlined or not
> extent <= 256 means try to send the message inline
> that meant a fragment of size 8 (which is greater than 65536 e.g.
> max default size for IB) was allocated,
> and that failed.
>
> now both extent and size are tested, so the message is not sent inline,
> and it just works.
>
> Cheers,
>
> Gilles
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/01/16798.php
>


Re: [OMPI devel] btl_openib.c:1200: mca_btl_openib_alloc: Assertion `qp != 255' failed

2015-01-19 Thread Gilles Gouaillardet
Adrian,

about the
"[n050409][[36216,1],1][btl_openib_xrc.c:58:mca_btl_openib_xrc_check_api] XRC
error: bad XRC API (require XRC from OFED pre 3.12). " message.

this means ompi was built on a system with OFED 3.12 or greater, and you
are running on a system with an earlier OFED release.

please not Jeff recently pushed a patch related to that and this message
might be a false positive.

Cheers,

Gilles

On 2015/01/19 14:17, Gilles Gouaillardet wrote:
> Adrian,
>
> i just fixed this in the master
> (https://github.com/open-mpi/ompi/commit/d14daf40d041f7a0a8e9d85b3bfd5eb570495fd2)
>
> the root cause is a corner case was not handled correctly :
>
> MPI_Type_hvector(2, 1, 0, MPI_INT, );
>
> type has extent = 4 *but* size = 8
> ob1 used to test only the extent to determine whether the message should
> be sent inlined or not
> extent <= 256 means try to send the message inline
> that meant a fragment of size 8 (which is greater than 65536 e.g.
> max default size for IB) was allocated,
> and that failed.
>
> now both extent and size are tested, so the message is not sent inline,
> and it just works.
>
> Cheers,
>
> Gilles
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/01/16798.php



Re: [OMPI devel] btl_openib.c:1200: mca_btl_openib_alloc: Assertion `qp != 255' failed

2015-01-19 Thread Gilles Gouaillardet
Adrian,

i just fixed this in the master
(https://github.com/open-mpi/ompi/commit/d14daf40d041f7a0a8e9d85b3bfd5eb570495fd2)

the root cause is a corner case was not handled correctly :

MPI_Type_hvector(2, 1, 0, MPI_INT, );

type has extent = 4 *but* size = 8
ob1 used to test only the extent to determine whether the message should
be sent inlined or not
extent <= 256 means try to send the message inline
that meant a fragment of size 8 (which is greater than 65536 e.g.
max default size for IB) was allocated,
and that failed.

now both extent and size are tested, so the message is not sent inline,
and it just works.

Cheers,

Gilles


[OMPI devel] btl_openib.c:1200: mca_btl_openib_alloc: Assertion `qp != 255' failed

2015-01-17 Thread Adrian Reber
This time my bug report is not PSM related:

I was able to reproduce the MTT error from 
http://mtt.open-mpi.org/index.php?do_redir=2228
on my system with openmpi-dev-720-gf4693c9:

mpi_test_suite: btl_openib.c:1200: mca_btl_openib_alloc: Assertion `qp != 255' 
failed.
[n050409:06796] *** Process received signal ***
[n050409:06796] Signal: Aborted (6)
[n050409:06796] Signal code:  (-6)
[n050409:06796] [ 0] /lib64/libpthread.so.0(+0xf710)[0x2b036d501710]
[n050409:06796] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x2b036d741635]
[n050409:06796] [ 2] /lib64/libc.so.6(abort+0x175)[0x2b036d742e15]
[n050409:06796] [ 3] /lib64/libc.so.6(+0x2b75e)[0x2b036d73a75e]
[n050409:06796] [ 4] /lib64/libc.so.6(__assert_perror_fail+0x0)[0x2b036d73a820]
[n050409:06796] [ 5] 
/lustre/ws1/ws/adrian-mtt-0/ompi-install/lib/openmpi/mca_btl_openib.so(mca_btl_openib_alloc+0x77)[0x2b03730cf6d0]
[n050409:06796] [ 6] 
/lustre/ws1/ws/adrian-mtt-0/ompi-install/lib/openmpi/mca_btl_openib.so(mca_btl_openib_sendi+0x5e5)[0x2b03730d1ae9]
[n050409:06796] [ 7] 
/lustre/ws1/ws/adrian-mtt-0/ompi-install/lib/openmpi/mca_pml_ob1.so(+0xd407)[0x2b0373961407]
[n050409:06796] [ 8] 
/lustre/ws1/ws/adrian-mtt-0/ompi-install/lib/openmpi/mca_pml_ob1.so(+0xde45)[0x2b0373961e45]
[n050409:06796] [ 9] 
/lustre/ws1/ws/adrian-mtt-0/ompi-install/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x1ce)[0x2b0373962501]
[n050409:06796] [10] 
/lustre/ws1/ws/adrian-mtt-0/ompi-install/lib/libmpi.so.0(PMPI_Send+0x2b4)[0x2b036d20d1bb]
[n050409:06796] [11] mpi_test_suite[0x464424]
[n050409:06796] [12] mpi_test_suite[0x470304]
[n050409:06796] [13] mpi_test_suite[0x444a72]
[n050409:06796] [14] /lib64/libc.so.6(__libc_start_main+0xfd)[0x2b036d72dd5d]
[n050409:06796] [15] mpi_test_suite[0x4051a9]
[n050409:06796] *** End of error message ***
--
mpirun noticed that process rank 0 with PID 0 on node n050409 exited on signal 
6 (Aborted).
--

Core was generated by `mpi_test_suite -t p2p'.
Program terminated with signal 6, Aborted.
(gdb) bt
#0  0x2b036d741635 in raise () from /lib64/libc.so.6
#1  0x2b036d742d9d in abort () from /lib64/libc.so.6
#2  0x2b036d73a75e in __assert_fail_base () from /lib64/libc.so.6
#3  0x2b036d73a820 in __assert_fail () from /lib64/libc.so.6
#4  0x2b03730cf6d0 in mca_btl_openib_alloc (btl=0x224e740, ep=0x22b66a0, 
order=255 '\377', size=73014, flags=3) at btl_openib.c:1200
#5  0x2b03730d1ae9 in mca_btl_openib_sendi (btl=0x224e740, ep=0x22b66a0, 
convertor=0x7fff2c527bb0, header=0x7fff2c527cd0, header_size=14, 
payload_size=73000, order=255 '\377', flags=3, 
tag=65 'A', descriptor=0x7fff2c527ce8) at btl_openib.c:1829
#6  0x2b0373961407 in mca_bml_base_sendi (bml_btl=0x2198850, 
convertor=0x7fff2c527bb0, header=0x7fff2c527cd0, header_size=14, 
payload_size=73000, order=255 '\377', flags=3, tag=65 'A', 
descriptor=0x7fff2c527ce8) at ../../../../ompi/mca/bml/bml.h:305
#7  0x2b0373961e45 in mca_pml_ob1_send_inline (buf=0x2b7b760, count=1, 
datatype=0x2b97440, dst=1, tag=37, seqn=3639, dst_proc=0x21c2940, 
endpoint=0x22dff00, comm=0x6939e0) at pml_ob1_isend.c:107
#8  0x2b0373962501 in mca_pml_ob1_send (buf=0x2b7b760, count=1, 
datatype=0x2b97440, dst=1, tag=37, sendmode=MCA_PML_BASE_SEND_STANDARD, 
comm=0x6939e0) at pml_ob1_isend.c:214
#9  0x2b036d20d1bb in PMPI_Send (buf=0x2b7b760, count=1, type=0x2b97440, 
dest=1, tag=37, comm=0x6939e0) at psend.c:78
#10 0x00464424 in tst_p2p_simple_ring_xsend_run (env=0x7fff2c528530) at 
p2p/tst_p2p_simple_ring_xsend.c:97
#11 0x00470304 in tst_test_run_func (env=0x7fff2c528530) at 
tst_tests.c:1463
#12 0x00444a72 in main (argc=3, argv=0x7fff2c5287f8) at 
mpi_test_suite.c:639

This is with --enable-debug. Without --enable-debug I get a
segmentation fault, but not always. Using fewer cores it works most
of the time. With 32 cores on 4 nodes it happens almost
all the time. If it does not crash using fewer cores I get messages like:

[n050409][[36216,1],1][btl_openib_xrc.c:58:mca_btl_openib_xrc_check_api] XRC 
error: bad XRC API (require XRC from OFED pre 3.12).

Adrian