This happens at MPI_Init. I've attached the full error message.

The sys admin mentioned Infiniband utility tests ran OK. I'll contact him
for more details and let you know.

Thank you,
Saliya

On Sun, Dec 28, 2014 at 3:18 AM, Gilles Gouaillardet <
gilles.gouaillar...@gmail.com> wrote:

> Where does the error occurs ?
> MPI_Init ?
> MPI_Finalize ?
> In between ?
>
> In the first case, the bug is likely a mishandled error case,
> which means OpenMPI is unlikely the root cause of the crash.
>
> Did you check infniband is up and running on your cluster ?
>
> Cheers,
>
> Gilles
>
> Saliya Ekanayake <esal...@gmail.com>さんのメール:
> It's been a while on this, but we are still having trouble getting OpenMPI
> to work with Infiniband on this cluster. We tried with latest 1.8.4 as
> well, but it's still the same.
>
> To recap, we get the following error when MPI initializes (in the simple
> Hello world C example) with Infiniband. Everything works fine if we
> explicitly turn off openib with --mca btl ^openib
>
> This is the error I got after debugging with gdb as you suggested.
>
> hello_c: connect/btl_openib_connect_udcm.c:736: udcm_module_finalize:
> Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) == ((opal_object_t *)
> (&m->cm_recv_msg_queue))->obj_magic_id' failed.
>
> Thank you,
> Saliya
>
> On Mon, Nov 10, 2014 at 10:01 AM, Saliya Ekanayake <esal...@gmail.com>
> wrote:
>
>> Thank you Jeff, I'll try this and  let you know.
>>
>> Saliya
>> On Nov 10, 2014 6:42 AM, "Jeff Squyres (jsquyres)" <jsquy...@cisco.com>
>> wrote:
>>
>>> I am sorry for the delay; I've been caught up in SC deadlines.  :-(
>>>
>>> I don't see anything blatantly wrong in this output.
>>>
>>> Two things:
>>>
>>> 1. Can you try a nightly v1.8.4 snapshot tarball?  This will check to
>>> see if whatever the bug is has been fixed for the upcoming release:
>>>
>>>     http://www.open-mpi.org/nightly/v1.8/
>>>
>>> 2. Build Open MPI with the --enable-debug option (note that this adds a
>>> slight-but-noticeable performance penalty).  When you run, it should dump a
>>> core file.  Load that core file in a debugger and see where it is failing
>>> (i.e., file and line in the OMPI source).
>>>
>>> We don't usually have to resort to asking users to perform #2, but
>>> there's no additional information to give a clue as to what is happening.
>>> :-(
>>>
>>>
>>>
>>> On Nov 9, 2014, at 11:43 AM, Saliya Ekanayake <esal...@gmail.com> wrote:
>>>
>>> > Hi Jeff,
>>> >
>>> > You are probably busy, but just checking if you had a chance to look
>>> at this.
>>> >
>>> > Thanks,
>>> > Saliya
>>> >
>>> > On Thu, Nov 6, 2014 at 9:19 AM, Saliya Ekanayake <esal...@gmail.com>
>>> wrote:
>>> > Hi Jeff,
>>> >
>>> > I've attached a tar file with information.
>>> >
>>> > Thank you,
>>> > Saliya
>>> >
>>> > On Tue, Nov 4, 2014 at 4:18 PM, Jeff Squyres (jsquyres) <
>>> jsquy...@cisco.com> wrote:
>>> > Looks like it's failing in the openib BTL setup.
>>> >
>>> > Can you send the info listed here?
>>> >
>>> >     http://www.open-mpi.org/community/help/
>>> >
>>> >
>>> >
>>> > On Nov 4, 2014, at 1:10 PM, Saliya Ekanayake <esal...@gmail.com>
>>> wrote:
>>> >
>>> > > Hi,
>>> > >
>>> > > I am using OpenMPI 1.8.1 in a Linux cluster that we recently setup.
>>> It builds fine, but when I try to run even the simplest hello.c program
>>> it'll cause a segfault. Any suggestions on how to correct this?
>>> > >
>>> > > The steps I did and error message are below.
>>> > >
>>> > > 1. Built OpenMPI 1.8.1 on the cluster. The ompi_info is attached.
>>> > > 2. cd to examples directory and mpicc hello_c.c
>>> > > 3. mpirun -np 2 ./a.out
>>> > > 4. Error text is attached.
>>> > >
>>> > > Please let me know if you need more info.
>>> > >
>>> > > Thank you,
>>> > > Saliya
>>> > >
>>> > >
>>> > > --
>>> > > Saliya Ekanayake esal...@gmail.com
>>> > > Cell 812-391-4914 Home 812-961-6383
>>> > > http://saliya.org
>>> > >
>>> <ompi_info.txt><error.txt>_______________________________________________
>>> > > users mailing list
>>> > > us...@open-mpi.org
>>> > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> > > Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2014/11/25668.php
>>> >
>>> >
>>> > --
>>> > Jeff Squyres
>>> > jsquy...@cisco.com
>>> > For corporate legal information go to:
>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>> >
>>> > _______________________________________________
>>> > users mailing list
>>> > us...@open-mpi.org
>>> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> > Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2014/11/25672.php
>>> >
>>> >
>>> >
>>> > --
>>> > Saliya Ekanayake esal...@gmail.com
>>> > Cell 812-391-4914 Home 812-961-6383
>>> > http://saliya.org
>>> >
>>> >
>>> >
>>> > --
>>> > Saliya Ekanayake esal...@gmail.com
>>> > Cell 812-391-4914 Home 812-961-6383
>>> > http://saliya.org
>>> > _______________________________________________
>>> > users mailing list
>>> > us...@open-mpi.org
>>> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> > Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2014/11/25717.php
>>>
>>>
>>> --
>>> Jeff Squyres
>>> jsquy...@cisco.com
>>> For corporate legal information go to:
>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2014/11/25723.php
>>>
>>
>
>
> --
> Saliya Ekanayake
> Ph.D. Candidate | Research Assistant
> School of Informatics and Computing | Digital Science Center
> Indiana University, Bloomington
> Cell 812-391-4914
> http://saliya.org
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/12/26074.php
>



-- 
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington
Cell 812-391-4914
http://saliya.org
hello_c: connect/btl_openib_connect_udcm.c:736: udcm_module_finalize: Assertion 
`((0xdeafbeedULL << 32) + 0xdeafbeedULL) == ((opal_object_t *) 
(&m->cm_recv_msg_queue))->obj_magic_id' failed.
[tempest:22559] *** Process received signal ***
[tempest:22559] Signal: Aborted (6)
[tempest:22559] Signal code:  (-6)
[tempest:22559] [ 0] /lib64/libpthread.so.0[0x35df00eca0]
[tempest:22559] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x35de82ffc5]
[tempest:22559] [ 2] /lib64/libc.so.6(abort+0x110)[0x35de831a70]
[tempest:22559] [ 3] /lib64/libc.so.6(__assert_fail+0xf6)[0x35de829466]
[tempest:22559] [ 4] 
/N/u/sekanaya/buildompi-1.8.4/lib/openmpi/mca_btl_openib.so[0x2b269fc5860a]
[tempest:22559] [ 5] 
/N/u/sekanaya/buildompi-1.8.4/lib/openmpi/mca_btl_openib.so[0x2b269fc57234]
[tempest:22559] [ 6] 
/N/u/sekanaya/buildompi-1.8.4/lib/openmpi/mca_btl_openib.so(ompi_btl_openib_connect_base_select_for_local_port+0x182)[0x2b269fc564cb]
[tempest:22559] [ 7] 
/N/u/sekanaya/buildompi-1.8.4/lib/openmpi/mca_btl_openib.so[0x2b269fc3f47e]
[tempest:22559] [ 8] 
/N/u/sekanaya/buildompi-1.8.4/lib/libmpi.so.1(mca_btl_base_select+0x1a4)[0x2b269c10b480]
[tempest:22559] [ 9] 
/N/u/sekanaya/buildompi-1.8.4/lib/openmpi/mca_bml_r2.so(mca_bml_r2_component_init+0x20)[0x2b269fa25b06]
[tempest:22559] [10] 
/N/u/sekanaya/buildompi-1.8.4/lib/libmpi.so.1(mca_bml_base_init+0xd5)[0x2b269c10a73d]
[tempest:22559] [11] 
/N/u/sekanaya/buildompi-1.8.4/lib/openmpi/mca_pml_ob1.so[0x2b26a0d75e3a]
[tempest:22559] [12] 
/N/u/sekanaya/buildompi-1.8.4/lib/libmpi.so.1(mca_pml_base_select+0x28d)[0x2b269c13323d]
[tempest:22559] [13] 
/N/u/sekanaya/buildompi-1.8.4/lib/libmpi.so.1(ompi_mpi_init+0x616)[0x2b269c07fdd7]
[tempest:22559] [14] 
/N/u/sekanaya/buildompi-1.8.4/lib/libmpi.so.1(MPI_Init+0x181)[0x2b269c0c0ee5]
[tempest:22559] [15] ./hello_c[0x4007d3]
[tempest:22559] [16] /lib64/libc.so.6(__libc_start_main+0xf4)[0x35de81d9f4]
[tempest:22559] [17] ./hello_c[0x4006f9]
[tempest:22559] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 22559 on node tempest exited on 
signal 6 (Aborted).
--------------------------------------------------------------------------

Reply via email to