Re: [OMPI users] What could cause a segfault in OpenMPI?
It's been a while on this, but we are still having trouble getting OpenMPI to work with Infiniband on this cluster. We tried with latest 1.8.4 as well, but it's still the same. To recap, we get the following error when MPI initializes (in the simple Hello world C example) with Infiniband. Everything works fine if we explicitly turn off openib with --mca btl ^openib This is the error I got after debugging with gdb as you suggested. hello_c: connect/btl_openib_connect_udcm.c:736: udcm_module_finalize: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) == ((opal_object_t *) (&m->cm_recv_msg_queue))->obj_magic_id' failed. Thank you, Saliya On Mon, Nov 10, 2014 at 10:01 AM, Saliya Ekanayake wrote: > Thank you Jeff, I'll try this and let you know. > > Saliya > On Nov 10, 2014 6:42 AM, "Jeff Squyres (jsquyres)" > wrote: > >> I am sorry for the delay; I've been caught up in SC deadlines. :-( >> >> I don't see anything blatantly wrong in this output. >> >> Two things: >> >> 1. Can you try a nightly v1.8.4 snapshot tarball? This will check to see >> if whatever the bug is has been fixed for the upcoming release: >> >> http://www.open-mpi.org/nightly/v1.8/ >> >> 2. Build Open MPI with the --enable-debug option (note that this adds a >> slight-but-noticeable performance penalty). When you run, it should dump a >> core file. Load that core file in a debugger and see where it is failing >> (i.e., file and line in the OMPI source). >> >> We don't usually have to resort to asking users to perform #2, but >> there's no additional information to give a clue as to what is happening. >> :-( >> >> >> >> On Nov 9, 2014, at 11:43 AM, Saliya Ekanayake wrote: >> >> > Hi Jeff, >> > >> > You are probably busy, but just checking if you had a chance to look at >> this. >> > >> > Thanks, >> > Saliya >> > >> > On Thu, Nov 6, 2014 at 9:19 AM, Saliya Ekanayake >> wrote: >> > Hi Jeff, >> > >> > I've attached a tar file with information. >> > >> > Thank you, >> > Saliya >> > >> > On Tue, Nov 4, 2014 at 4:18 PM, Jeff Squyres (jsquyres) < >> jsquy...@cisco.com> wrote: >> > Looks like it's failing in the openib BTL setup. >> > >> > Can you send the info listed here? >> > >> > http://www.open-mpi.org/community/help/ >> > >> > >> > >> > On Nov 4, 2014, at 1:10 PM, Saliya Ekanayake wrote: >> > >> > > Hi, >> > > >> > > I am using OpenMPI 1.8.1 in a Linux cluster that we recently setup. >> It builds fine, but when I try to run even the simplest hello.c program >> it'll cause a segfault. Any suggestions on how to correct this? >> > > >> > > The steps I did and error message are below. >> > > >> > > 1. Built OpenMPI 1.8.1 on the cluster. The ompi_info is attached. >> > > 2. cd to examples directory and mpicc hello_c.c >> > > 3. mpirun -np 2 ./a.out >> > > 4. Error text is attached. >> > > >> > > Please let me know if you need more info. >> > > >> > > Thank you, >> > > Saliya >> > > >> > > >> > > -- >> > > Saliya Ekanayake esal...@gmail.com >> > > Cell 812-391-4914 Home 812-961-6383 >> > > http://saliya.org >> > > >> ___ >> > > users mailing list >> > > us...@open-mpi.org >> > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > Link to this post: >> http://www.open-mpi.org/community/lists/users/2014/11/25668.php >> > >> > >> > -- >> > Jeff Squyres >> > jsquy...@cisco.com >> > For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >> > >> > ___ >> > users mailing list >> > us...@open-mpi.org >> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> > Link to this post: >> http://www.open-mpi.org/community/lists/users/2014/11/25672.php >> > >> > >> > >> > -- >> > Saliya Ekanayake esal...@gmail.com >> > Cell 812-391-4914 Home 812-961-6383 >> > http://saliya.org >> > >> > >> > >> > -- >> > Saliya Ekanayake esal...@gmail.com >> > Cell 812-391-4914 Home 812-961-6383 >> > http://saliya.org >> > ___ >> > users mailing list >> > us...@open-mpi.org >> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> > Link to this post: >> http://www.open-mpi.org/community/lists/users/2014/11/25717.php >> >> >> -- >> Jeff Squyres >> jsquy...@cisco.com >> For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >> >> ___ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2014/11/25723.php >> > -- Saliya Ekanayake Ph.D. Candidate | Research Assistant School of Informatics and Computing | Digital Science Center Indiana University, Bloomington Cell 812-391-4914 http://saliya.org
Re: [OMPI users] What could cause a segfault in OpenMPI?
Thank you Jeff, I'll try this and let you know. Saliya On Nov 10, 2014 6:42 AM, "Jeff Squyres (jsquyres)" wrote: > I am sorry for the delay; I've been caught up in SC deadlines. :-( > > I don't see anything blatantly wrong in this output. > > Two things: > > 1. Can you try a nightly v1.8.4 snapshot tarball? This will check to see > if whatever the bug is has been fixed for the upcoming release: > > http://www.open-mpi.org/nightly/v1.8/ > > 2. Build Open MPI with the --enable-debug option (note that this adds a > slight-but-noticeable performance penalty). When you run, it should dump a > core file. Load that core file in a debugger and see where it is failing > (i.e., file and line in the OMPI source). > > We don't usually have to resort to asking users to perform #2, but there's > no additional information to give a clue as to what is happening. :-( > > > > On Nov 9, 2014, at 11:43 AM, Saliya Ekanayake wrote: > > > Hi Jeff, > > > > You are probably busy, but just checking if you had a chance to look at > this. > > > > Thanks, > > Saliya > > > > On Thu, Nov 6, 2014 at 9:19 AM, Saliya Ekanayake > wrote: > > Hi Jeff, > > > > I've attached a tar file with information. > > > > Thank you, > > Saliya > > > > On Tue, Nov 4, 2014 at 4:18 PM, Jeff Squyres (jsquyres) < > jsquy...@cisco.com> wrote: > > Looks like it's failing in the openib BTL setup. > > > > Can you send the info listed here? > > > > http://www.open-mpi.org/community/help/ > > > > > > > > On Nov 4, 2014, at 1:10 PM, Saliya Ekanayake wrote: > > > > > Hi, > > > > > > I am using OpenMPI 1.8.1 in a Linux cluster that we recently setup. It > builds fine, but when I try to run even the simplest hello.c program it'll > cause a segfault. Any suggestions on how to correct this? > > > > > > The steps I did and error message are below. > > > > > > 1. Built OpenMPI 1.8.1 on the cluster. The ompi_info is attached. > > > 2. cd to examples directory and mpicc hello_c.c > > > 3. mpirun -np 2 ./a.out > > > 4. Error text is attached. > > > > > > Please let me know if you need more info. > > > > > > Thank you, > > > Saliya > > > > > > > > > -- > > > Saliya Ekanayake esal...@gmail.com > > > Cell 812-391-4914 Home 812-961-6383 > > > http://saliya.org > > > > ___ > > > users mailing list > > > us...@open-mpi.org > > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > > > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/11/25668.php > > > > > > -- > > Jeff Squyres > > jsquy...@cisco.com > > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > ___ > > users mailing list > > us...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/11/25672.php > > > > > > > > -- > > Saliya Ekanayake esal...@gmail.com > > Cell 812-391-4914 Home 812-961-6383 > > http://saliya.org > > > > > > > > -- > > Saliya Ekanayake esal...@gmail.com > > Cell 812-391-4914 Home 812-961-6383 > > http://saliya.org > > ___ > > users mailing list > > us...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/11/25717.php > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > ___ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/11/25723.php >
Re: [OMPI users] What could cause a segfault in OpenMPI?
I am sorry for the delay; I've been caught up in SC deadlines. :-( I don't see anything blatantly wrong in this output. Two things: 1. Can you try a nightly v1.8.4 snapshot tarball? This will check to see if whatever the bug is has been fixed for the upcoming release: http://www.open-mpi.org/nightly/v1.8/ 2. Build Open MPI with the --enable-debug option (note that this adds a slight-but-noticeable performance penalty). When you run, it should dump a core file. Load that core file in a debugger and see where it is failing (i.e., file and line in the OMPI source). We don't usually have to resort to asking users to perform #2, but there's no additional information to give a clue as to what is happening. :-( On Nov 9, 2014, at 11:43 AM, Saliya Ekanayake wrote: > Hi Jeff, > > You are probably busy, but just checking if you had a chance to look at this. > > Thanks, > Saliya > > On Thu, Nov 6, 2014 at 9:19 AM, Saliya Ekanayake wrote: > Hi Jeff, > > I've attached a tar file with information. > > Thank you, > Saliya > > On Tue, Nov 4, 2014 at 4:18 PM, Jeff Squyres (jsquyres) > wrote: > Looks like it's failing in the openib BTL setup. > > Can you send the info listed here? > > http://www.open-mpi.org/community/help/ > > > > On Nov 4, 2014, at 1:10 PM, Saliya Ekanayake wrote: > > > Hi, > > > > I am using OpenMPI 1.8.1 in a Linux cluster that we recently setup. It > > builds fine, but when I try to run even the simplest hello.c program it'll > > cause a segfault. Any suggestions on how to correct this? > > > > The steps I did and error message are below. > > > > 1. Built OpenMPI 1.8.1 on the cluster. The ompi_info is attached. > > 2. cd to examples directory and mpicc hello_c.c > > 3. mpirun -np 2 ./a.out > > 4. Error text is attached. > > > > Please let me know if you need more info. > > > > Thank you, > > Saliya > > > > > > -- > > Saliya Ekanayake esal...@gmail.com > > Cell 812-391-4914 Home 812-961-6383 > > http://saliya.org > > ___ > > users mailing list > > us...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > > Link to this post: > > http://www.open-mpi.org/community/lists/users/2014/11/25668.php > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > ___ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/11/25672.php > > > > -- > Saliya Ekanayake esal...@gmail.com > Cell 812-391-4914 Home 812-961-6383 > http://saliya.org > > > > -- > Saliya Ekanayake esal...@gmail.com > Cell 812-391-4914 Home 812-961-6383 > http://saliya.org > ___ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/11/25717.php -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] What could cause a segfault in OpenMPI?
Hi Jeff, You are probably busy, but just checking if you had a chance to look at this. Thanks, Saliya On Thu, Nov 6, 2014 at 9:19 AM, Saliya Ekanayake wrote: > Hi Jeff, > > I've attached a tar file with information. > > Thank you, > Saliya > > On Tue, Nov 4, 2014 at 4:18 PM, Jeff Squyres (jsquyres) < > jsquy...@cisco.com> wrote: > >> Looks like it's failing in the openib BTL setup. >> >> Can you send the info listed here? >> >> http://www.open-mpi.org/community/help/ >> >> >> >> On Nov 4, 2014, at 1:10 PM, Saliya Ekanayake wrote: >> >> > Hi, >> > >> > I am using OpenMPI 1.8.1 in a Linux cluster that we recently setup. It >> builds fine, but when I try to run even the simplest hello.c program it'll >> cause a segfault. Any suggestions on how to correct this? >> > >> > The steps I did and error message are below. >> > >> > 1. Built OpenMPI 1.8.1 on the cluster. The ompi_info is attached. >> > 2. cd to examples directory and mpicc hello_c.c >> > 3. mpirun -np 2 ./a.out >> > 4. Error text is attached. >> > >> > Please let me know if you need more info. >> > >> > Thank you, >> > Saliya >> > >> > >> > -- >> > Saliya Ekanayake esal...@gmail.com >> > Cell 812-391-4914 Home 812-961-6383 >> > http://saliya.org >> > >> ___ >> > users mailing list >> > us...@open-mpi.org >> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> > Link to this post: >> http://www.open-mpi.org/community/lists/users/2014/11/25668.php >> >> >> -- >> Jeff Squyres >> jsquy...@cisco.com >> For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >> >> ___ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2014/11/25672.php >> > > > > -- > Saliya Ekanayake esal...@gmail.com > Cell 812-391-4914 Home 812-961-6383 > http://saliya.org > -- Saliya Ekanayake esal...@gmail.com Cell 812-391-4914 Home 812-961-6383 http://saliya.org
Re: [OMPI users] What could cause a segfault in OpenMPI?
Hi Jeff, I've attached a tar file with information. Thank you, Saliya On Tue, Nov 4, 2014 at 4:18 PM, Jeff Squyres (jsquyres) wrote: > Looks like it's failing in the openib BTL setup. > > Can you send the info listed here? > > http://www.open-mpi.org/community/help/ > > > > On Nov 4, 2014, at 1:10 PM, Saliya Ekanayake wrote: > > > Hi, > > > > I am using OpenMPI 1.8.1 in a Linux cluster that we recently setup. It > builds fine, but when I try to run even the simplest hello.c program it'll > cause a segfault. Any suggestions on how to correct this? > > > > The steps I did and error message are below. > > > > 1. Built OpenMPI 1.8.1 on the cluster. The ompi_info is attached. > > 2. cd to examples directory and mpicc hello_c.c > > 3. mpirun -np 2 ./a.out > > 4. Error text is attached. > > > > Please let me know if you need more info. > > > > Thank you, > > Saliya > > > > > > -- > > Saliya Ekanayake esal...@gmail.com > > Cell 812-391-4914 Home 812-961-6383 > > http://saliya.org > > ___ > > users mailing list > > us...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/11/25668.php > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > ___ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/11/25672.php > -- Saliya Ekanayake esal...@gmail.com Cell 812-391-4914 Home 812-961-6383 http://saliya.org ompi-segfault.tar.bz2 Description: BZip2 compressed data
Re: [OMPI users] What could cause a segfault in OpenMPI?
Looks like it's failing in the openib BTL setup. Can you send the info listed here? http://www.open-mpi.org/community/help/ On Nov 4, 2014, at 1:10 PM, Saliya Ekanayake wrote: > Hi, > > I am using OpenMPI 1.8.1 in a Linux cluster that we recently setup. It builds > fine, but when I try to run even the simplest hello.c program it'll cause a > segfault. Any suggestions on how to correct this? > > The steps I did and error message are below. > > 1. Built OpenMPI 1.8.1 on the cluster. The ompi_info is attached. > 2. cd to examples directory and mpicc hello_c.c > 3. mpirun -np 2 ./a.out > 4. Error text is attached. > > Please let me know if you need more info. > > Thank you, > Saliya > > > -- > Saliya Ekanayake esal...@gmail.com > Cell 812-391-4914 Home 812-961-6383 > http://saliya.org > ___ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/11/25668.php -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] What could cause a segfault in OpenMPI?
Hi Howard, I just tried with 1.8.3. as well and it produces the same error. We have another cluster where both versions work fine, which is why I was curious as what kind of things could cause this. Thank you, Saliya On Tue, Nov 4, 2014 at 1:31 PM, Howard Pritchard wrote: > Hello Saliya, > > Would you mind trying to reproduce the problem using the latest 1.8 > release - 1.8.3? > > Thanks, > > Howard > > > 2014-11-04 11:10 GMT-07:00 Saliya Ekanayake : > >> Hi, >> >> I am using OpenMPI 1.8.1 in a Linux cluster that we recently setup. It >> builds fine, but when I try to run even the simplest hello.c program it'll >> cause a segfault. Any suggestions on how to correct this? >> >> The steps I did and error message are below. >> >> 1. Built OpenMPI 1.8.1 on the cluster. The ompi_info is attached. >> 2. cd to examples directory and mpicc hello_c.c >> 3. mpirun -np 2 ./a.out >> 4. Error text is attached. >> >> Please let me know if you need more info. >> >> Thank you, >> Saliya >> >> >> -- >> Saliya Ekanayake esal...@gmail.com >> Cell 812-391-4914 Home 812-961-6383 >> http://saliya.org >> >> ___ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2014/11/25668.php >> > > > ___ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/11/25669.php > -- Saliya Ekanayake esal...@gmail.com Cell 812-391-4914 Home 812-961-6383 http://saliya.org
Re: [OMPI users] What could cause a segfault in OpenMPI?
Hello Saliya, Would you mind trying to reproduce the problem using the latest 1.8 release - 1.8.3? Thanks, Howard 2014-11-04 11:10 GMT-07:00 Saliya Ekanayake : > Hi, > > I am using OpenMPI 1.8.1 in a Linux cluster that we recently setup. It > builds fine, but when I try to run even the simplest hello.c program it'll > cause a segfault. Any suggestions on how to correct this? > > The steps I did and error message are below. > > 1. Built OpenMPI 1.8.1 on the cluster. The ompi_info is attached. > 2. cd to examples directory and mpicc hello_c.c > 3. mpirun -np 2 ./a.out > 4. Error text is attached. > > Please let me know if you need more info. > > Thank you, > Saliya > > > -- > Saliya Ekanayake esal...@gmail.com > Cell 812-391-4914 Home 812-961-6383 > http://saliya.org > > ___ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/11/25668.php >