Re: [OMPI users] OMPI users] What could cause a segfault in OpenMPI?

2014-12-28 Thread Ralph Castain
So you are saying the test worked, but you are still encountering an error when 
executing an MPI job? Or are you saying things now work?


> On Dec 28, 2014, at 5:58 PM, Saliya Ekanayake  wrote:
> 
> Thank you Ralph. This produced the warning on memory limits similar to [1] 
> and setting ulimit -l unlimited worked.
> 
> [1] http://lists.openfabrics.org/pipermail/general/2007-June/036941.html 
> 
> 
> Saliya
> 
> On Sun, Dec 28, 2014 at 5:57 PM, Ralph Castain  > wrote:
> Have the admin try running the ibv_ud_pingpong test - that will exercise the 
> portion of the system under discussion.
> 
> 
>> On Dec 28, 2014, at 2:31 PM, Saliya Ekanayake > > wrote:
>> 
>> What I heard from the administrator is that, 
>> 
>> "The tests that work are the simple utilities ib_read_lat and ib_read_bw
>> that measures latency and bandwith between two nodes. They are part of
>> the "perftest" repo package."
>> 
>> On Dec 28, 2014 10:20 AM, "Saliya Ekanayake" > > wrote:
>> This happens at MPI_Init. I've attached the full error message.
>> 
>> The sys admin mentioned Infiniband utility tests ran OK. I'll contact him 
>> for more details and let you know.
>> 
>> Thank you,
>> Saliya
>> 
>> On Sun, Dec 28, 2014 at 3:18 AM, Gilles Gouaillardet 
>> > wrote:
>> Where does the error occurs ?
>> MPI_Init ?
>> MPI_Finalize ?
>> In between ?
>> 
>> In the first case, the bug is likely a mishandled error case,
>> which means OpenMPI is unlikely the root cause of the crash.
>> 
>> Did you check infniband is up and running on your cluster ?
>> 
>> Cheers,
>> 
>> Gilles 
>> 
>> Saliya Ekanayake >さんのメール:
>> It's been a while on this, but we are still having trouble getting OpenMPI 
>> to work with Infiniband on this cluster. We tried with latest 1.8.4 as well, 
>> but it's still the same.
>> 
>> To recap, we get the following error when MPI initializes (in the simple 
>> Hello world C example) with Infiniband. Everything works fine if we 
>> explicitly turn off openib with --mca btl ^openib
>> 
>> This is the error I got after debugging with gdb as you suggested.
>> 
>> hello_c: connect/btl_openib_connect_udcm.c:736: udcm_module_finalize: 
>> Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) == ((opal_object_t *) 
>> (>cm_recv_msg_queue))->obj_magic_id' failed.
>> 
>> Thank you,
>> Saliya
>> 
>> On Mon, Nov 10, 2014 at 10:01 AM, Saliya Ekanayake > > wrote:
>> Thank you Jeff, I'll try this and  let you know. 
>> 
>> Saliya 
>> On Nov 10, 2014 6:42 AM, "Jeff Squyres (jsquyres)" > > wrote:
>> I am sorry for the delay; I've been caught up in SC deadlines.  :-(
>> 
>> I don't see anything blatantly wrong in this output.
>> 
>> Two things:
>> 
>> 1. Can you try a nightly v1.8.4 snapshot tarball?  This will check to see if 
>> whatever the bug is has been fixed for the upcoming release:
>> 
>> http://www.open-mpi.org/nightly/v1.8/ 
>> 
>> 
>> 2. Build Open MPI with the --enable-debug option (note that this adds a 
>> slight-but-noticeable performance penalty).  When you run, it should dump a 
>> core file.  Load that core file in a debugger and see where it is failing 
>> (i.e., file and line in the OMPI source).
>> 
>> We don't usually have to resort to asking users to perform #2, but there's 
>> no additional information to give a clue as to what is happening.  :-(
>> 
>> 
>> 
>> On Nov 9, 2014, at 11:43 AM, Saliya Ekanayake > > wrote:
>> 
>> > Hi Jeff,
>> >
>> > You are probably busy, but just checking if you had a chance to look at 
>> > this.
>> >
>> > Thanks,
>> > Saliya
>> >
>> > On Thu, Nov 6, 2014 at 9:19 AM, Saliya Ekanayake > > > wrote:
>> > Hi Jeff,
>> >
>> > I've attached a tar file with information.
>> >
>> > Thank you,
>> > Saliya
>> >
>> > On Tue, Nov 4, 2014 at 4:18 PM, Jeff Squyres (jsquyres) 
>> > > wrote:
>> > Looks like it's failing in the openib BTL setup.
>> >
>> > Can you send the info listed here?
>> >
>> > http://www.open-mpi.org/community/help/ 
>> > 
>> >
>> >
>> >
>> > On Nov 4, 2014, at 1:10 PM, Saliya Ekanayake > > > wrote:
>> >
>> > > Hi,
>> > >
>> > > I am using OpenMPI 1.8.1 in a Linux cluster that we recently setup. It 
>> > > builds fine, but when I try to run even the simplest hello.c program 
>> > > it'll cause a segfault. Any suggestions on how to correct this?
>> > >
>> > > The steps I did and error 

Re: [OMPI users] OMPI users] What could cause a segfault in OpenMPI?

2014-12-28 Thread Saliya Ekanayake
Thank you Ralph. This produced the warning on memory limits similar to [1]
and setting ulimit -l unlimited worked.

[1] http://lists.openfabrics.org/pipermail/general/2007-June/036941.html

Saliya

On Sun, Dec 28, 2014 at 5:57 PM, Ralph Castain  wrote:

> Have the admin try running the ibv_ud_pingpong test - that will exercise
> the portion of the system under discussion.
>
>
> On Dec 28, 2014, at 2:31 PM, Saliya Ekanayake  wrote:
>
> What I heard from the administrator is that,
>
> "The tests that work are the simple utilities ib_read_lat and ib_read_bw
> that measures latency and bandwith between two nodes. They are part of
> the "perftest" repo package."
> On Dec 28, 2014 10:20 AM, "Saliya Ekanayake"  wrote:
>
>> This happens at MPI_Init. I've attached the full error message.
>>
>> The sys admin mentioned Infiniband utility tests ran OK. I'll contact him
>> for more details and let you know.
>>
>> Thank you,
>> Saliya
>>
>> On Sun, Dec 28, 2014 at 3:18 AM, Gilles Gouaillardet <
>> gilles.gouaillar...@gmail.com> wrote:
>>
>>> Where does the error occurs ?
>>> MPI_Init ?
>>> MPI_Finalize ?
>>> In between ?
>>>
>>> In the first case, the bug is likely a mishandled error case,
>>> which means OpenMPI is unlikely the root cause of the crash.
>>>
>>> Did you check infniband is up and running on your cluster ?
>>>
>>> Cheers,
>>>
>>> Gilles
>>>
>>> Saliya Ekanayake さんのメール:
>>> It's been a while on this, but we are still having trouble getting
>>> OpenMPI to work with Infiniband on this cluster. We tried with latest 1.8.4
>>> as well, but it's still the same.
>>>
>>> To recap, we get the following error when MPI initializes (in the simple
>>> Hello world C example) with Infiniband. Everything works fine if we
>>> explicitly turn off openib with --mca btl ^openib
>>>
>>> This is the error I got after debugging with gdb as you suggested.
>>>
>>> hello_c: connect/btl_openib_connect_udcm.c:736: udcm_module_finalize:
>>> Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) == ((opal_object_t *)
>>> (>cm_recv_msg_queue))->obj_magic_id' failed.
>>>
>>> Thank you,
>>> Saliya
>>>
>>> On Mon, Nov 10, 2014 at 10:01 AM, Saliya Ekanayake 
>>> wrote:
>>>
 Thank you Jeff, I'll try this and  let you know.
 Saliya
 On Nov 10, 2014 6:42 AM, "Jeff Squyres (jsquyres)" 
 wrote:

> I am sorry for the delay; I've been caught up in SC deadlines.  :-(
>
> I don't see anything blatantly wrong in this output.
>
> Two things:
>
> 1. Can you try a nightly v1.8.4 snapshot tarball?  This will check to
> see if whatever the bug is has been fixed for the upcoming release:
>
> http://www.open-mpi.org/nightly/v1.8/
>
> 2. Build Open MPI with the --enable-debug option (note that this adds
> a slight-but-noticeable performance penalty).  When you run, it should 
> dump
> a core file.  Load that core file in a debugger and see where it is 
> failing
> (i.e., file and line in the OMPI source).
>
> We don't usually have to resort to asking users to perform #2, but
> there's no additional information to give a clue as to what is happening.
> :-(
>
>
>
> On Nov 9, 2014, at 11:43 AM, Saliya Ekanayake 
> wrote:
>
> > Hi Jeff,
> >
> > You are probably busy, but just checking if you had a chance to look
> at this.
> >
> > Thanks,
> > Saliya
> >
> > On Thu, Nov 6, 2014 at 9:19 AM, Saliya Ekanayake 
> wrote:
> > Hi Jeff,
> >
> > I've attached a tar file with information.
> >
> > Thank you,
> > Saliya
> >
> > On Tue, Nov 4, 2014 at 4:18 PM, Jeff Squyres (jsquyres) <
> jsquy...@cisco.com> wrote:
> > Looks like it's failing in the openib BTL setup.
> >
> > Can you send the info listed here?
> >
> > http://www.open-mpi.org/community/help/
> >
> >
> >
> > On Nov 4, 2014, at 1:10 PM, Saliya Ekanayake 
> wrote:
> >
> > > Hi,
> > >
> > > I am using OpenMPI 1.8.1 in a Linux cluster that we recently
> setup. It builds fine, but when I try to run even the simplest hello.c
> program it'll cause a segfault. Any suggestions on how to correct this?
> > >
> > > The steps I did and error message are below.
> > >
> > > 1. Built OpenMPI 1.8.1 on the cluster. The ompi_info is attached.
> > > 2. cd to examples directory and mpicc hello_c.c
> > > 3. mpirun -np 2 ./a.out
> > > 4. Error text is attached.
> > >
> > > Please let me know if you need more info.
> > >
> > > Thank you,
> > > Saliya
> > >
> > >
> > > --
> > > Saliya Ekanayake esal...@gmail.com
> > > Cell 812-391-4914 Home 812-961-6383
> > > http://saliya.org
> > >
> 

Re: [OMPI users] OMPI users] What could cause a segfault in OpenMPI?

2014-12-28 Thread Ralph Castain
Have the admin try running the ibv_ud_pingpong test - that will exercise the 
portion of the system under discussion.


> On Dec 28, 2014, at 2:31 PM, Saliya Ekanayake  wrote:
> 
> What I heard from the administrator is that, 
> 
> "The tests that work are the simple utilities ib_read_lat and ib_read_bw
> that measures latency and bandwith between two nodes. They are part of
> the "perftest" repo package."
> 
> On Dec 28, 2014 10:20 AM, "Saliya Ekanayake"  > wrote:
> This happens at MPI_Init. I've attached the full error message.
> 
> The sys admin mentioned Infiniband utility tests ran OK. I'll contact him for 
> more details and let you know.
> 
> Thank you,
> Saliya
> 
> On Sun, Dec 28, 2014 at 3:18 AM, Gilles Gouaillardet 
> > wrote:
> Where does the error occurs ?
> MPI_Init ?
> MPI_Finalize ?
> In between ?
> 
> In the first case, the bug is likely a mishandled error case,
> which means OpenMPI is unlikely the root cause of the crash.
> 
> Did you check infniband is up and running on your cluster ?
> 
> Cheers,
> 
> Gilles 
> 
> Saliya Ekanayake >さんのメール:
> It's been a while on this, but we are still having trouble getting OpenMPI to 
> work with Infiniband on this cluster. We tried with latest 1.8.4 as well, but 
> it's still the same.
> 
> To recap, we get the following error when MPI initializes (in the simple 
> Hello world C example) with Infiniband. Everything works fine if we 
> explicitly turn off openib with --mca btl ^openib
> 
> This is the error I got after debugging with gdb as you suggested.
> 
> hello_c: connect/btl_openib_connect_udcm.c:736: udcm_module_finalize: 
> Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) == ((opal_object_t *) 
> (>cm_recv_msg_queue))->obj_magic_id' failed.
> 
> Thank you,
> Saliya
> 
> On Mon, Nov 10, 2014 at 10:01 AM, Saliya Ekanayake  > wrote:
> Thank you Jeff, I'll try this and  let you know. 
> 
> Saliya 
> On Nov 10, 2014 6:42 AM, "Jeff Squyres (jsquyres)"  > wrote:
> I am sorry for the delay; I've been caught up in SC deadlines.  :-(
> 
> I don't see anything blatantly wrong in this output.
> 
> Two things:
> 
> 1. Can you try a nightly v1.8.4 snapshot tarball?  This will check to see if 
> whatever the bug is has been fixed for the upcoming release:
> 
> http://www.open-mpi.org/nightly/v1.8/ 
> 
> 
> 2. Build Open MPI with the --enable-debug option (note that this adds a 
> slight-but-noticeable performance penalty).  When you run, it should dump a 
> core file.  Load that core file in a debugger and see where it is failing 
> (i.e., file and line in the OMPI source).
> 
> We don't usually have to resort to asking users to perform #2, but there's no 
> additional information to give a clue as to what is happening.  :-(
> 
> 
> 
> On Nov 9, 2014, at 11:43 AM, Saliya Ekanayake  > wrote:
> 
> > Hi Jeff,
> >
> > You are probably busy, but just checking if you had a chance to look at 
> > this.
> >
> > Thanks,
> > Saliya
> >
> > On Thu, Nov 6, 2014 at 9:19 AM, Saliya Ekanayake  > > wrote:
> > Hi Jeff,
> >
> > I've attached a tar file with information.
> >
> > Thank you,
> > Saliya
> >
> > On Tue, Nov 4, 2014 at 4:18 PM, Jeff Squyres (jsquyres)  > > wrote:
> > Looks like it's failing in the openib BTL setup.
> >
> > Can you send the info listed here?
> >
> > http://www.open-mpi.org/community/help/ 
> > 
> >
> >
> >
> > On Nov 4, 2014, at 1:10 PM, Saliya Ekanayake  > > wrote:
> >
> > > Hi,
> > >
> > > I am using OpenMPI 1.8.1 in a Linux cluster that we recently setup. It 
> > > builds fine, but when I try to run even the simplest hello.c program 
> > > it'll cause a segfault. Any suggestions on how to correct this?
> > >
> > > The steps I did and error message are below.
> > >
> > > 1. Built OpenMPI 1.8.1 on the cluster. The ompi_info is attached.
> > > 2. cd to examples directory and mpicc hello_c.c
> > > 3. mpirun -np 2 ./a.out
> > > 4. Error text is attached.
> > >
> > > Please let me know if you need more info.
> > >
> > > Thank you,
> > > Saliya
> > >
> > >
> > > --
> > > Saliya Ekanayake esal...@gmail.com 
> > > Cell 812-391-4914  Home 812-961-6383 
> > > http://saliya.org 
> > > ___
> > > users mailing list
> > > us...@open-mpi.org 
> > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
> > > 
> > > Link to this post: 
> > 

Re: [OMPI users] OMPI users] What could cause a segfault in OpenMPI?

2014-12-28 Thread Saliya Ekanayake
What I heard from the administrator is that,

"The tests that work are the simple utilities ib_read_lat and ib_read_bw
that measures latency and bandwith between two nodes. They are part of
the "perftest" repo package."
On Dec 28, 2014 10:20 AM, "Saliya Ekanayake"  wrote:

> This happens at MPI_Init. I've attached the full error message.
>
> The sys admin mentioned Infiniband utility tests ran OK. I'll contact him
> for more details and let you know.
>
> Thank you,
> Saliya
>
> On Sun, Dec 28, 2014 at 3:18 AM, Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com> wrote:
>
>> Where does the error occurs ?
>> MPI_Init ?
>> MPI_Finalize ?
>> In between ?
>>
>> In the first case, the bug is likely a mishandled error case,
>> which means OpenMPI is unlikely the root cause of the crash.
>>
>> Did you check infniband is up and running on your cluster ?
>>
>> Cheers,
>>
>> Gilles
>>
>> Saliya Ekanayake さんのメール:
>> It's been a while on this, but we are still having trouble getting
>> OpenMPI to work with Infiniband on this cluster. We tried with latest 1.8.4
>> as well, but it's still the same.
>>
>> To recap, we get the following error when MPI initializes (in the simple
>> Hello world C example) with Infiniband. Everything works fine if we
>> explicitly turn off openib with --mca btl ^openib
>>
>> This is the error I got after debugging with gdb as you suggested.
>>
>> hello_c: connect/btl_openib_connect_udcm.c:736: udcm_module_finalize:
>> Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) == ((opal_object_t *)
>> (>cm_recv_msg_queue))->obj_magic_id' failed.
>>
>> Thank you,
>> Saliya
>>
>> On Mon, Nov 10, 2014 at 10:01 AM, Saliya Ekanayake 
>> wrote:
>>
>>> Thank you Jeff, I'll try this and  let you know.
>>>
>>> Saliya
>>> On Nov 10, 2014 6:42 AM, "Jeff Squyres (jsquyres)" 
>>> wrote:
>>>
 I am sorry for the delay; I've been caught up in SC deadlines.  :-(

 I don't see anything blatantly wrong in this output.

 Two things:

 1. Can you try a nightly v1.8.4 snapshot tarball?  This will check to
 see if whatever the bug is has been fixed for the upcoming release:

 http://www.open-mpi.org/nightly/v1.8/

 2. Build Open MPI with the --enable-debug option (note that this adds a
 slight-but-noticeable performance penalty).  When you run, it should dump a
 core file.  Load that core file in a debugger and see where it is failing
 (i.e., file and line in the OMPI source).

 We don't usually have to resort to asking users to perform #2, but
 there's no additional information to give a clue as to what is happening.
 :-(



 On Nov 9, 2014, at 11:43 AM, Saliya Ekanayake 
 wrote:

 > Hi Jeff,
 >
 > You are probably busy, but just checking if you had a chance to look
 at this.
 >
 > Thanks,
 > Saliya
 >
 > On Thu, Nov 6, 2014 at 9:19 AM, Saliya Ekanayake 
 wrote:
 > Hi Jeff,
 >
 > I've attached a tar file with information.
 >
 > Thank you,
 > Saliya
 >
 > On Tue, Nov 4, 2014 at 4:18 PM, Jeff Squyres (jsquyres) <
 jsquy...@cisco.com> wrote:
 > Looks like it's failing in the openib BTL setup.
 >
 > Can you send the info listed here?
 >
 > http://www.open-mpi.org/community/help/
 >
 >
 >
 > On Nov 4, 2014, at 1:10 PM, Saliya Ekanayake 
 wrote:
 >
 > > Hi,
 > >
 > > I am using OpenMPI 1.8.1 in a Linux cluster that we recently setup.
 It builds fine, but when I try to run even the simplest hello.c program
 it'll cause a segfault. Any suggestions on how to correct this?
 > >
 > > The steps I did and error message are below.
 > >
 > > 1. Built OpenMPI 1.8.1 on the cluster. The ompi_info is attached.
 > > 2. cd to examples directory and mpicc hello_c.c
 > > 3. mpirun -np 2 ./a.out
 > > 4. Error text is attached.
 > >
 > > Please let me know if you need more info.
 > >
 > > Thank you,
 > > Saliya
 > >
 > >
 > > --
 > > Saliya Ekanayake esal...@gmail.com
 > > Cell 812-391-4914 Home 812-961-6383
 > > http://saliya.org
 > >
 ___
 > > users mailing list
 > > us...@open-mpi.org
 > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
 > > Link to this post:
 http://www.open-mpi.org/community/lists/users/2014/11/25668.php
 >
 >
 > --
 > Jeff Squyres
 > jsquy...@cisco.com
 > For corporate legal information go to:
 http://www.cisco.com/web/about/doing_business/legal/cri/
 >
 > ___
 > users mailing list
 > us...@open-mpi.org
 > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
 > Link to this post:

Re: [OMPI users] OMPI users] What could cause a segfault in OpenMPI?

2014-12-28 Thread Saliya Ekanayake
This happens at MPI_Init. I've attached the full error message.

The sys admin mentioned Infiniband utility tests ran OK. I'll contact him
for more details and let you know.

Thank you,
Saliya

On Sun, Dec 28, 2014 at 3:18 AM, Gilles Gouaillardet <
gilles.gouaillar...@gmail.com> wrote:

> Where does the error occurs ?
> MPI_Init ?
> MPI_Finalize ?
> In between ?
>
> In the first case, the bug is likely a mishandled error case,
> which means OpenMPI is unlikely the root cause of the crash.
>
> Did you check infniband is up and running on your cluster ?
>
> Cheers,
>
> Gilles
>
> Saliya Ekanayake さんのメール:
> It's been a while on this, but we are still having trouble getting OpenMPI
> to work with Infiniband on this cluster. We tried with latest 1.8.4 as
> well, but it's still the same.
>
> To recap, we get the following error when MPI initializes (in the simple
> Hello world C example) with Infiniband. Everything works fine if we
> explicitly turn off openib with --mca btl ^openib
>
> This is the error I got after debugging with gdb as you suggested.
>
> hello_c: connect/btl_openib_connect_udcm.c:736: udcm_module_finalize:
> Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) == ((opal_object_t *)
> (>cm_recv_msg_queue))->obj_magic_id' failed.
>
> Thank you,
> Saliya
>
> On Mon, Nov 10, 2014 at 10:01 AM, Saliya Ekanayake 
> wrote:
>
>> Thank you Jeff, I'll try this and  let you know.
>>
>> Saliya
>> On Nov 10, 2014 6:42 AM, "Jeff Squyres (jsquyres)" 
>> wrote:
>>
>>> I am sorry for the delay; I've been caught up in SC deadlines.  :-(
>>>
>>> I don't see anything blatantly wrong in this output.
>>>
>>> Two things:
>>>
>>> 1. Can you try a nightly v1.8.4 snapshot tarball?  This will check to
>>> see if whatever the bug is has been fixed for the upcoming release:
>>>
>>> http://www.open-mpi.org/nightly/v1.8/
>>>
>>> 2. Build Open MPI with the --enable-debug option (note that this adds a
>>> slight-but-noticeable performance penalty).  When you run, it should dump a
>>> core file.  Load that core file in a debugger and see where it is failing
>>> (i.e., file and line in the OMPI source).
>>>
>>> We don't usually have to resort to asking users to perform #2, but
>>> there's no additional information to give a clue as to what is happening.
>>> :-(
>>>
>>>
>>>
>>> On Nov 9, 2014, at 11:43 AM, Saliya Ekanayake  wrote:
>>>
>>> > Hi Jeff,
>>> >
>>> > You are probably busy, but just checking if you had a chance to look
>>> at this.
>>> >
>>> > Thanks,
>>> > Saliya
>>> >
>>> > On Thu, Nov 6, 2014 at 9:19 AM, Saliya Ekanayake 
>>> wrote:
>>> > Hi Jeff,
>>> >
>>> > I've attached a tar file with information.
>>> >
>>> > Thank you,
>>> > Saliya
>>> >
>>> > On Tue, Nov 4, 2014 at 4:18 PM, Jeff Squyres (jsquyres) <
>>> jsquy...@cisco.com> wrote:
>>> > Looks like it's failing in the openib BTL setup.
>>> >
>>> > Can you send the info listed here?
>>> >
>>> > http://www.open-mpi.org/community/help/
>>> >
>>> >
>>> >
>>> > On Nov 4, 2014, at 1:10 PM, Saliya Ekanayake 
>>> wrote:
>>> >
>>> > > Hi,
>>> > >
>>> > > I am using OpenMPI 1.8.1 in a Linux cluster that we recently setup.
>>> It builds fine, but when I try to run even the simplest hello.c program
>>> it'll cause a segfault. Any suggestions on how to correct this?
>>> > >
>>> > > The steps I did and error message are below.
>>> > >
>>> > > 1. Built OpenMPI 1.8.1 on the cluster. The ompi_info is attached.
>>> > > 2. cd to examples directory and mpicc hello_c.c
>>> > > 3. mpirun -np 2 ./a.out
>>> > > 4. Error text is attached.
>>> > >
>>> > > Please let me know if you need more info.
>>> > >
>>> > > Thank you,
>>> > > Saliya
>>> > >
>>> > >
>>> > > --
>>> > > Saliya Ekanayake esal...@gmail.com
>>> > > Cell 812-391-4914 Home 812-961-6383
>>> > > http://saliya.org
>>> > >
>>> ___
>>> > > users mailing list
>>> > > us...@open-mpi.org
>>> > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> > > Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2014/11/25668.php
>>> >
>>> >
>>> > --
>>> > Jeff Squyres
>>> > jsquy...@cisco.com
>>> > For corporate legal information go to:
>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>> >
>>> > ___
>>> > users mailing list
>>> > us...@open-mpi.org
>>> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> > Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2014/11/25672.php
>>> >
>>> >
>>> >
>>> > --
>>> > Saliya Ekanayake esal...@gmail.com
>>> > Cell 812-391-4914 Home 812-961-6383
>>> > http://saliya.org
>>> >
>>> >
>>> >
>>> > --
>>> > Saliya Ekanayake esal...@gmail.com
>>> > Cell 812-391-4914 Home 812-961-6383
>>> > http://saliya.org
>>> > ___
>>> > users mailing list
>>> > us...@open-mpi.org
>>> > 

[OMPI users] Icreasing OFED registerable memory

2014-12-28 Thread Waleed Lotfy
I have a bunch of 8 GB memory nodes in a cluster who were lately
upgraded to 16 GB. When I run any jobs I get the following warning:
--
WARNING: It appears that your OpenFabrics subsystem is configured to
only
allow registering part of your physical memory.  This can cause MPI jobs
to
run with erratic performance, hang, and/or crash.

This may be caused by your OpenFabrics vendor limiting the amount of
physical memory that can be registered.  You should investigate the
relevant Linux kernel module parameters that control how much physical
memory can be registered, and increase them to allow registering all
physical memory on your machine.

See this Open MPI FAQ item for more information on these Linux kernel
module
parameters:

http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages

  Local host:  comp022.local
  Registerable memory: 8192 MiB
  Total memory:16036 MiB

Your MPI job will continue, but may be behave poorly and/or hang.
--

Searching for a fix to this issue, I found that I have to set
log_num_mtt within the kernel module, so I added this line to
modprobe.conf:

options mlx4_core log_num_mtt=21

But then ib0 interface fails to start showing this error:
ib_ipoib device ib0 does not seem to be present, delaying
initialization.

Reducing the value of log_num_mtt to 20, allows ib0 to start but shows
the registerable memory of 8 GB warning.

I am using OFED 1.3.1, I know it is pretty old and we are planning to
upgrade soon.

Output on all nodes for 'ompi_info  -v ompi full --parsable':

ompi:version:full:1.2.7
ompi:version:svn:r19401
orte:version:full:1.2.7
orte:version:svn:r19401
opal:version:full:1.2.7
opal:version:svn:r19401

Any help would be appreciated.

Waleed Lotfy
Bibliotheca Alexandrina


Re: [OMPI users] OMPI users] What could cause a segfault in OpenMPI?

2014-12-28 Thread Ralph Castain
Might also be worth checking to ensure that UD is enabled on your IB 
installation as we depend upon it for wireup of IB connections.


> On Dec 28, 2014, at 12:18 AM, Gilles Gouaillardet 
>  wrote:
> 
> Where does the error occurs ?
> MPI_Init ?
> MPI_Finalize ?
> In between ?
> 
> In the first case, the bug is likely a mishandled error case,
> which means OpenMPI is unlikely the root cause of the crash.
> 
> Did you check infniband is up and running on your cluster ?
> 
> Cheers,
> 
> Gilles 
> 
> Saliya Ekanayake さんのメール:
> It's been a while on this, but we are still having trouble getting OpenMPI to 
> work with Infiniband on this cluster. We tried with latest 1.8.4 as well, but 
> it's still the same.
> 
> To recap, we get the following error when MPI initializes (in the simple 
> Hello world C example) with Infiniband. Everything works fine if we 
> explicitly turn off openib with --mca btl ^openib
> 
> This is the error I got after debugging with gdb as you suggested.
> 
> hello_c: connect/btl_openib_connect_udcm.c:736: udcm_module_finalize: 
> Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) == ((opal_object_t *) 
> (>cm_recv_msg_queue))->obj_magic_id' failed.
> 
> Thank you,
> Saliya
> 
> On Mon, Nov 10, 2014 at 10:01 AM, Saliya Ekanayake  > wrote:
> Thank you Jeff, I'll try this and  let you know.
> 
> Saliya
> 
> On Nov 10, 2014 6:42 AM, "Jeff Squyres (jsquyres)"  > wrote:
> I am sorry for the delay; I've been caught up in SC deadlines.  :-(
> 
> I don't see anything blatantly wrong in this output.
> 
> Two things:
> 
> 1. Can you try a nightly v1.8.4 snapshot tarball?  This will check to see if 
> whatever the bug is has been fixed for the upcoming release:
> 
> http://www.open-mpi.org/nightly/v1.8/ 
> 
> 
> 2. Build Open MPI with the --enable-debug option (note that this adds a 
> slight-but-noticeable performance penalty).  When you run, it should dump a 
> core file.  Load that core file in a debugger and see where it is failing 
> (i.e., file and line in the OMPI source).
> 
> We don't usually have to resort to asking users to perform #2, but there's no 
> additional information to give a clue as to what is happening.  :-(
> 
> 
> 
> On Nov 9, 2014, at 11:43 AM, Saliya Ekanayake  > wrote:
> 
> > Hi Jeff,
> >
> > You are probably busy, but just checking if you had a chance to look at 
> > this.
> >
> > Thanks,
> > Saliya
> >
> > On Thu, Nov 6, 2014 at 9:19 AM, Saliya Ekanayake  > > wrote:
> > Hi Jeff,
> >
> > I've attached a tar file with information.
> >
> > Thank you,
> > Saliya
> >
> > On Tue, Nov 4, 2014 at 4:18 PM, Jeff Squyres (jsquyres)  > > wrote:
> > Looks like it's failing in the openib BTL setup.
> >
> > Can you send the info listed here?
> >
> > http://www.open-mpi.org/community/help/ 
> > 
> >
> >
> >
> > On Nov 4, 2014, at 1:10 PM, Saliya Ekanayake  > > wrote:
> >
> > > Hi,
> > >
> > > I am using OpenMPI 1.8.1 in a Linux cluster that we recently setup. It 
> > > builds fine, but when I try to run even the simplest hello.c program 
> > > it'll cause a segfault. Any suggestions on how to correct this?
> > >
> > > The steps I did and error message are below.
> > >
> > > 1. Built OpenMPI 1.8.1 on the cluster. The ompi_info is attached.
> > > 2. cd to examples directory and mpicc hello_c.c
> > > 3. mpirun -np 2 ./a.out
> > > 4. Error text is attached.
> > >
> > > Please let me know if you need more info.
> > >
> > > Thank you,
> > > Saliya
> > >
> > >
> > > --
> > > Saliya Ekanayake esal...@gmail.com 
> > > Cell 812-391-4914  Home 812-961-6383 
> > > http://saliya.org 
> > > ___
> > > users mailing list
> > > us...@open-mpi.org 
> > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
> > > 
> > > Link to this post: 
> > > http://www.open-mpi.org/community/lists/users/2014/11/25668.php 
> > > 
> >
> >
> > --
> > Jeff Squyres
> > jsquy...@cisco.com 
> > For corporate legal information go to: 
> > http://www.cisco.com/web/about/doing_business/legal/cri/ 
> > 
> >
> > ___
> > users mailing list
> > us...@open-mpi.org 
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
> > 
> > Link to this post: 
> > 

Re: [OMPI users] OMPI users] What could cause a segfault in OpenMPI?

2014-12-28 Thread Gilles Gouaillardet
Where does the error occurs ?
MPI_Init ?
MPI_Finalize ?
In between ?

In the first case, the bug is likely a mishandled error case,
which means OpenMPI is unlikely the root cause of the crash.

Did you check infniband is up and running on your cluster ?

Cheers,

Gilles 

Saliya Ekanayake さんのメール:
>It's been a while on this, but we are still having trouble getting OpenMPI to 
>work with Infiniband on this cluster. We tried with latest 1.8.4 as well, but 
>it's still the same.
>
>
>To recap, we get the following error when MPI initializes (in the simple Hello 
>world C example) with Infiniband. Everything works fine if we explicitly turn 
>off openib with --mca btl ^openib
>
>
>This is the error I got after debugging with gdb as you suggested.
>
>
>hello_c: connect/btl_openib_connect_udcm.c:736: udcm_module_finalize: 
>Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) == ((opal_object_t *) 
>(>cm_recv_msg_queue))->obj_magic_id' failed.
>
>
>Thank you,
>
>Saliya
>
>
>On Mon, Nov 10, 2014 at 10:01 AM, Saliya Ekanayake  wrote:
>
>Thank you Jeff, I'll try this and  let you know. 
>
>Saliya 
>
>On Nov 10, 2014 6:42 AM, "Jeff Squyres (jsquyres)"  wrote:
>
>I am sorry for the delay; I've been caught up in SC deadlines.  :-(
>
>I don't see anything blatantly wrong in this output.
>
>Two things:
>
>1. Can you try a nightly v1.8.4 snapshot tarball?  This will check to see if 
>whatever the bug is has been fixed for the upcoming release:
>
>    http://www.open-mpi.org/nightly/v1.8/
>
>2. Build Open MPI with the --enable-debug option (note that this adds a 
>slight-but-noticeable performance penalty).  When you run, it should dump a 
>core file.  Load that core file in a debugger and see where it is failing 
>(i.e., file and line in the OMPI source).
>
>We don't usually have to resort to asking users to perform #2, but there's no 
>additional information to give a clue as to what is happening.  :-(
>
>
>
>On Nov 9, 2014, at 11:43 AM, Saliya Ekanayake  wrote:
>
>> Hi Jeff,
>>
>> You are probably busy, but just checking if you had a chance to look at this.
>>
>> Thanks,
>> Saliya
>>
>> On Thu, Nov 6, 2014 at 9:19 AM, Saliya Ekanayake  wrote:
>> Hi Jeff,
>>
>> I've attached a tar file with information.
>>
>> Thank you,
>> Saliya
>>
>> On Tue, Nov 4, 2014 at 4:18 PM, Jeff Squyres (jsquyres)  
>> wrote:
>> Looks like it's failing in the openib BTL setup.
>>
>> Can you send the info listed here?
>>
>>     http://www.open-mpi.org/community/help/
>>
>>
>>
>> On Nov 4, 2014, at 1:10 PM, Saliya Ekanayake  wrote:
>>
>> > Hi,
>> >
>> > I am using OpenMPI 1.8.1 in a Linux cluster that we recently setup. It 
>> > builds fine, but when I try to run even the simplest hello.c program it'll 
>> > cause a segfault. Any suggestions on how to correct this?
>> >
>> > The steps I did and error message are below.
>> >
>> > 1. Built OpenMPI 1.8.1 on the cluster. The ompi_info is attached.
>> > 2. cd to examples directory and mpicc hello_c.c
>> > 3. mpirun -np 2 ./a.out
>> > 4. Error text is attached.
>> >
>> > Please let me know if you need more info.
>> >
>> > Thank you,
>> > Saliya
>> >
>> >
>> > --
>> > Saliya Ekanayake esal...@gmail.com
>> > Cell 812-391-4914 Home 812-961-6383
>> > http://saliya.org
>> > ___
>> > users mailing list
>> > us...@open-mpi.org
>> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> > Link to this post: 
>> > http://www.open-mpi.org/community/lists/users/2014/11/25668.php
>>
>>
>> --
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to: 
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2014/11/25672.php
>>
>>
>>
>> --
>> Saliya Ekanayake esal...@gmail.com
>> Cell 812-391-4914 Home 812-961-6383
>> http://saliya.org
>>
>>
>>
>> --
>> Saliya Ekanayake esal...@gmail.com
>> Cell 812-391-4914 Home 812-961-6383
>> http://saliya.org
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2014/11/25717.php
>
>
>--
>Jeff Squyres
>jsquy...@cisco.com
>For corporate legal information go to: 
>http://www.cisco.com/web/about/doing_business/legal/cri/
>
>___
>users mailing list
>us...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>Link to this post: 
>http://www.open-mpi.org/community/lists/users/2014/11/25723.php
>
>
>
>
>-- 
>
>Saliya Ekanayake
>
>Ph.D. Candidate | Research Assistant
>
>School of Informatics and