Re: [OMPI users] [openib] segfault when using openib btl

2010-09-27 Thread Terry Dontje
So it sounds like coalescing is not your issue and that the problem has 
something to do with the queue sizes.  It would be helpful if we could 
detect the hdr->tag == 0 issue on the sending side and get at least a 
stack trace.  There is something really odd going on here.


--td

Eloi Gaudry wrote:

Hi Terry,

I'm sorry to say that I might have missed a point here.

I've lately been relaunching all previously failing computations with 
the message coalescing feature being switched off, and I saw the same 
hdr->tag=0 error several times, always during a collective call 
(MPI_Comm_create, MPI_Allreduce and MPI_Broadcast, so far). And as 
soon as I switched to the peer queue option I was previously using 
(--mca btl_openib_receive_queues P,65536,256,192,128 instead of using 
--mca btl_openib_use_message_coalescing 0), all computations ran 
flawlessly.


As for the reproducer, I've already tried to write something but I 
haven't succeeded so far at reproducing the hdr->tag=0 issue with it.


Eloi

On 24/09/2010 18:37, Terry Dontje wrote:

Eloi Gaudry wrote:

Terry,

You were right, the error indeed seems to come from the message coalescing 
feature.
If I turn it off using the "--mca btl_openib_use_message_coalescing 0", I'm not able to 
observe the "hdr->tag=0" error.

There are some trac requests associated to very similar error (https://svn.open-mpi.org/trac/ompi/search?q=coalescing) but they are all closed (except https://svn.open-mpi.org/trac/ompi/ticket/2352 
that might be related), aren't they ? What would you suggest Terry ?


  
Interesting, though it looks to me like the segv in ticket 2352 would 
have happened on the send side instead of the receive side like you 
have.  As to what to do next it would be really nice to have some 
sort of reproducer that we can try and debug what is really going 
on.  The only other thing to do without a reproducer is to inspect 
the code on the send side to figure out what might make it generate 
at 0 hdr->tag.  Or maybe instrument the send side to stop when it is 
about ready to send a 0 hdr->tag and see if we can see how the code 
got there.


I might have some cycles to look at this Monday.

--td

Eloi


On Friday 24 September 2010 16:00:26 Terry Dontje wrote:
  

Eloi Gaudry wrote:


Terry,

No, I haven't tried any other values than P,65536,256,192,128 yet.

The reason why is quite simple. I've been reading and reading again this
thread to understand the btl_openib_receive_queues meaning and I can't
figure out why the default values seem to induce the hdr-

  

tag=0 issue
(http://www.open-mpi.org/community/lists/users/2009/01/7808.php).


Yeah, the size of the fragments and number of them really should not
cause this issue.  So I too am a little perplexed about it.



Do you think that the default shared received queue parameters are
erroneous for this specific Mellanox card ? Any help on finding the
proper parameters would actually be much appreciated.
  

I don't necessarily think it is the queue size for a specific card but
more so the handling of the queues by the BTL when using certain sizes.
At least that is one gut feel I have.

In my mind the tag being 0 is either something below OMPI is polluting
the data fragment or OMPI's internal protocol is some how getting messed
up.  I can imagine (no empirical data here) the queue sizes could change
how the OMPI protocol sets things up.  Another thing may be the
coalescing feature in the openib BTL which tries to gang multiple
messages into one packet when resources are running low.   I can see
where changing the queue sizes might affect the coalescing.  So, it
might be interesting to turn off the coalescing.  You can do that by
setting "--mca btl_openib_use_message_coalescing 0" in your mpirun line.

If that doesn't solve the issue then obviously there must be something
else going on :-).

Note, the reason I am interested in this is I am seeing a similar error
condition (hdr->tag == 0) on a development system.  Though my failing
case fails with np=8 using the connectivity test program which is mainly
point to point and there are not a significant amount of data transfers
going on either.

--td



Eloi

On Friday 24 September 2010 14:27:07 you wrote:
  

That is interesting.  So does the number of processes affect your runs
any.  The times I've seen hdr->tag be 0 usually has been due to protocol
issues.  The tag should never be 0.  Have you tried to do other
receive_queue settings other than the default and the one you mention.

I wonder if you did a combination of the two receive queues causes a
failure or not.  Something like

P,128,256,192,128:P,65536,256,192,128

I am wondering if it is the first queuing definition causing the issue
or possibly the SRQ defined in the default.

--td

Eloi Gaudry wrote:


Hi Terry,

The messages being send/received can be of any size, but the error
seems to happen more often with small messages (as an int being
broadcasted or allreduced

Re: [OMPI users] [openib] segfault when using openib btl

2010-09-27 Thread Eloi Gaudry
Hi Terry,

Do you have any patch that I could apply to be able to do so ? I'm remotely 
working on a cluster (with a terminal) and I cannot use any parallel debugger 
or sequential debugger (with a call to 
xterm...). I can track frag->hdr->tag value in 
ompi/mca/btl/openib/btl_openib_component.c::handle_wc in the SEND/RDMA_WRITE 
case, but this is all I can think of alone.

You'll find a stacktrace (receive side) in this thread (10th or 11th message) 
but it might be pointless.

Regards,
Eloi


On Monday 27 September 2010 11:43:55 Terry Dontje wrote:
> So it sounds like coalescing is not your issue and that the problem has
> something to do with the queue sizes.  It would be helpful if we could
> detect the hdr->tag == 0 issue on the sending side and get at least a
> stack trace.  There is something really odd going on here.
> 
> --td
> 
> Eloi Gaudry wrote:
> > Hi Terry,
> > 
> > I'm sorry to say that I might have missed a point here.
> > 
> > I've lately been relaunching all previously failing computations with
> > the message coalescing feature being switched off, and I saw the same
> > hdr->tag=0 error several times, always during a collective call
> > (MPI_Comm_create, MPI_Allreduce and MPI_Broadcast, so far). And as
> > soon as I switched to the peer queue option I was previously using
> > (--mca btl_openib_receive_queues P,65536,256,192,128 instead of using
> > --mca btl_openib_use_message_coalescing 0), all computations ran
> > flawlessly.
> > 
> > As for the reproducer, I've already tried to write something but I
> > haven't succeeded so far at reproducing the hdr->tag=0 issue with it.
> > 
> > Eloi
> > 
> > On 24/09/2010 18:37, Terry Dontje wrote:
> >> Eloi Gaudry wrote:
> >>> Terry,
> >>> 
> >>> You were right, the error indeed seems to come from the message
> >>> coalescing feature. If I turn it off using the "--mca
> >>> btl_openib_use_message_coalescing 0", I'm not able to observe the
> >>> "hdr->tag=0" error.
> >>> 
> >>> There are some trac requests associated to very similar error
> >>> (https://svn.open-mpi.org/trac/ompi/search?q=coalescing) but they are
> >>> all closed (except https://svn.open-mpi.org/trac/ompi/ticket/2352 that
> >>> might be related), aren't they ? What would you suggest Terry ?
> >> 
> >> Interesting, though it looks to me like the segv in ticket 2352 would
> >> have happened on the send side instead of the receive side like you
> >> have.  As to what to do next it would be really nice to have some
> >> sort of reproducer that we can try and debug what is really going
> >> on.  The only other thing to do without a reproducer is to inspect
> >> the code on the send side to figure out what might make it generate
> >> at 0 hdr->tag.  Or maybe instrument the send side to stop when it is
> >> about ready to send a 0 hdr->tag and see if we can see how the code
> >> got there.
> >> 
> >> I might have some cycles to look at this Monday.
> >> 
> >> --td
> >> 
> >>> Eloi
> >>> 
> >>> On Friday 24 September 2010 16:00:26 Terry Dontje wrote:
>  Eloi Gaudry wrote:
> > Terry,
> > 
> > No, I haven't tried any other values than P,65536,256,192,128 yet.
> > 
> > The reason why is quite simple. I've been reading and reading again
> > this thread to understand the btl_openib_receive_queues meaning and
> > I can't figure out why the default values seem to induce the hdr-
> > 
> >> tag=0 issue
> >> (http://www.open-mpi.org/community/lists/users/2009/01/7808.php).
>  
>  Yeah, the size of the fragments and number of them really should not
>  cause this issue.  So I too am a little perplexed about it.
>  
> > Do you think that the default shared received queue parameters are
> > erroneous for this specific Mellanox card ? Any help on finding the
> > proper parameters would actually be much appreciated.
>  
>  I don't necessarily think it is the queue size for a specific card but
>  more so the handling of the queues by the BTL when using certain
>  sizes. At least that is one gut feel I have.
>  
>  In my mind the tag being 0 is either something below OMPI is polluting
>  the data fragment or OMPI's internal protocol is some how getting
>  messed up.  I can imagine (no empirical data here) the queue sizes
>  could change how the OMPI protocol sets things up.  Another thing may
>  be the coalescing feature in the openib BTL which tries to gang
>  multiple messages into one packet when resources are running low.   I
>  can see where changing the queue sizes might affect the coalescing. 
>  So, it might be interesting to turn off the coalescing.  You can do
>  that by setting "--mca btl_openib_use_message_coalescing 0" in your
>  mpirun line.
>  
>  If that doesn't solve the issue then obviously there must be something
>  else going on :-).
>  
>  Note, the reason I am interested in this is I am seeing a similar
>  error condition (hdr->ta

Re: [OMPI users] [openib] segfault when using openib btl

2010-09-27 Thread Terry Dontje
I am thinking checking the value of *frag->hdr right before the return 
in the post_send function in ompi/mca/btl/openib/btl_openib_endpoint.h.  
It is line 548 in the trunk

https://svn.open-mpi.org/source/xref/ompi-trunk/ompi/mca/btl/openib/btl_openib_endpoint.h#548

--td

Eloi Gaudry wrote:

Hi Terry,

Do you have any patch that I could apply to be able to do so ? I'm remotely working on a cluster (with a terminal) and I cannot use any parallel debugger or sequential debugger (with a call to 
xterm...). I can track frag->hdr->tag value in ompi/mca/btl/openib/btl_openib_component.c::handle_wc in the SEND/RDMA_WRITE case, but this is all I can think of alone.


You'll find a stacktrace (receive side) in this thread (10th or 11th message) 
but it might be pointless.

Regards,
Eloi


On Monday 27 September 2010 11:43:55 Terry Dontje wrote:
  

So it sounds like coalescing is not your issue and that the problem has
something to do with the queue sizes.  It would be helpful if we could
detect the hdr->tag == 0 issue on the sending side and get at least a
stack trace.  There is something really odd going on here.

--td

Eloi Gaudry wrote:


Hi Terry,

I'm sorry to say that I might have missed a point here.

I've lately been relaunching all previously failing computations with
the message coalescing feature being switched off, and I saw the same
hdr->tag=0 error several times, always during a collective call
(MPI_Comm_create, MPI_Allreduce and MPI_Broadcast, so far). And as
soon as I switched to the peer queue option I was previously using
(--mca btl_openib_receive_queues P,65536,256,192,128 instead of using
--mca btl_openib_use_message_coalescing 0), all computations ran
flawlessly.

As for the reproducer, I've already tried to write something but I
haven't succeeded so far at reproducing the hdr->tag=0 issue with it.

Eloi

On 24/09/2010 18:37, Terry Dontje wrote:
  

Eloi Gaudry wrote:


Terry,

You were right, the error indeed seems to come from the message
coalescing feature. If I turn it off using the "--mca
btl_openib_use_message_coalescing 0", I'm not able to observe the
"hdr->tag=0" error.

There are some trac requests associated to very similar error
(https://svn.open-mpi.org/trac/ompi/search?q=coalescing) but they are
all closed (except https://svn.open-mpi.org/trac/ompi/ticket/2352 that
might be related), aren't they ? What would you suggest Terry ?
  

Interesting, though it looks to me like the segv in ticket 2352 would
have happened on the send side instead of the receive side like you
have.  As to what to do next it would be really nice to have some
sort of reproducer that we can try and debug what is really going
on.  The only other thing to do without a reproducer is to inspect
the code on the send side to figure out what might make it generate
at 0 hdr->tag.  Or maybe instrument the send side to stop when it is
about ready to send a 0 hdr->tag and see if we can see how the code
got there.

I might have some cycles to look at this Monday.

--td



Eloi

On Friday 24 September 2010 16:00:26 Terry Dontje wrote:
  

Eloi Gaudry wrote:


Terry,

No, I haven't tried any other values than P,65536,256,192,128 yet.

The reason why is quite simple. I've been reading and reading again
this thread to understand the btl_openib_receive_queues meaning and
I can't figure out why the default values seem to induce the hdr-

  

tag=0 issue
(http://www.open-mpi.org/community/lists/users/2009/01/7808.php).


Yeah, the size of the fragments and number of them really should not
cause this issue.  So I too am a little perplexed about it.



Do you think that the default shared received queue parameters are
erroneous for this specific Mellanox card ? Any help on finding the
proper parameters would actually be much appreciated.
  

I don't necessarily think it is the queue size for a specific card but
more so the handling of the queues by the BTL when using certain
sizes. At least that is one gut feel I have.

In my mind the tag being 0 is either something below OMPI is polluting
the data fragment or OMPI's internal protocol is some how getting
messed up.  I can imagine (no empirical data here) the queue sizes
could change how the OMPI protocol sets things up.  Another thing may
be the coalescing feature in the openib BTL which tries to gang
multiple messages into one packet when resources are running low.   I
can see where changing the queue sizes might affect the coalescing. 
So, it might be interesting to turn off the coalescing.  You can do

that by setting "--mca btl_openib_use_message_coalescing 0" in your
mpirun line.

If that doesn't solve the issue then obviously there must be something
else going on :-).

Note, the reason I am interested in this is I am seeing a similar
error condition (hdr->tag == 0) on a development system.  Though my
failing case fails with np=8 using the connectivity test program
whic

Re: [OMPI users] [openib] segfault when using openib btl

2010-09-27 Thread Eloi Gaudry
Terry,

Please find enclosed the requested check outputs (using -output-filename 
stdout.tag.null option).

For information, Nysal In his first message referred to 
ompi/mca/pml/ob1/pml_ob1_hdr.h and said that hdr->tg value was wrnong on 
receiving side.
#define MCA_PML_OB1_HDR_TYPE_MATCH (MCA_BTL_TAG_PML + 1)
#define MCA_PML_OB1_HDR_TYPE_RNDV  (MCA_BTL_TAG_PML + 2)
#define MCA_PML_OB1_HDR_TYPE_RGET  (MCA_BTL_TAG_PML + 3)
 #define MCA_PML_OB1_HDR_TYPE_ACK   (MCA_BTL_TAG_PML + 4)
#define MCA_PML_OB1_HDR_TYPE_NACK  (MCA_BTL_TAG_PML + 5)
#define MCA_PML_OB1_HDR_TYPE_FRAG  (MCA_BTL_TAG_PML + 6)
#define MCA_PML_OB1_HDR_TYPE_GET   (MCA_BTL_TAG_PML + 7)
 #define MCA_PML_OB1_HDR_TYPE_PUT   (MCA_BTL_TAG_PML + 8)
#define MCA_PML_OB1_HDR_TYPE_FIN   (MCA_BTL_TAG_PML + 9)
and in ompi/mca/btl/btl.h 
#define MCA_BTL_TAG_PML 0x40

Eloi

On Monday 27 September 2010 14:36:59 Terry Dontje wrote:
> I am thinking checking the value of *frag->hdr right before the return
> in the post_send function in ompi/mca/btl/openib/btl_openib_endpoint.h.
> It is line 548 in the trunk
> https://svn.open-mpi.org/source/xref/ompi-trunk/ompi/mca/btl/openib/btl_ope
> nib_endpoint.h#548
> 
> --td
> 
> Eloi Gaudry wrote:
> > Hi Terry,
> > 
> > Do you have any patch that I could apply to be able to do so ? I'm
> > remotely working on a cluster (with a terminal) and I cannot use any
> > parallel debugger or sequential debugger (with a call to xterm...). I
> > can track frag->hdr->tag value in
> > ompi/mca/btl/openib/btl_openib_component.c::handle_wc in the
> > SEND/RDMA_WRITE case, but this is all I can think of alone.
> > 
> > You'll find a stacktrace (receive side) in this thread (10th or 11th
> > message) but it might be pointless.
> > 
> > Regards,
> > Eloi
> > 
> > On Monday 27 September 2010 11:43:55 Terry Dontje wrote:
> >> So it sounds like coalescing is not your issue and that the problem has
> >> something to do with the queue sizes.  It would be helpful if we could
> >> detect the hdr->tag == 0 issue on the sending side and get at least a
> >> stack trace.  There is something really odd going on here.
> >> 
> >> --td
> >> 
> >> Eloi Gaudry wrote:
> >>> Hi Terry,
> >>> 
> >>> I'm sorry to say that I might have missed a point here.
> >>> 
> >>> I've lately been relaunching all previously failing computations with
> >>> the message coalescing feature being switched off, and I saw the same
> >>> hdr->tag=0 error several times, always during a collective call
> >>> (MPI_Comm_create, MPI_Allreduce and MPI_Broadcast, so far). And as
> >>> soon as I switched to the peer queue option I was previously using
> >>> (--mca btl_openib_receive_queues P,65536,256,192,128 instead of using
> >>> --mca btl_openib_use_message_coalescing 0), all computations ran
> >>> flawlessly.
> >>> 
> >>> As for the reproducer, I've already tried to write something but I
> >>> haven't succeeded so far at reproducing the hdr->tag=0 issue with it.
> >>> 
> >>> Eloi
> >>> 
> >>> On 24/09/2010 18:37, Terry Dontje wrote:
>  Eloi Gaudry wrote:
> > Terry,
> > 
> > You were right, the error indeed seems to come from the message
> > coalescing feature. If I turn it off using the "--mca
> > btl_openib_use_message_coalescing 0", I'm not able to observe the
> > "hdr->tag=0" error.
> > 
> > There are some trac requests associated to very similar error
> > (https://svn.open-mpi.org/trac/ompi/search?q=coalescing) but they are
> > all closed (except https://svn.open-mpi.org/trac/ompi/ticket/2352
> > that might be related), aren't they ? What would you suggest Terry ?
>  
>  Interesting, though it looks to me like the segv in ticket 2352 would
>  have happened on the send side instead of the receive side like you
>  have.  As to what to do next it would be really nice to have some
>  sort of reproducer that we can try and debug what is really going
>  on.  The only other thing to do without a reproducer is to inspect
>  the code on the send side to figure out what might make it generate
>  at 0 hdr->tag.  Or maybe instrument the send side to stop when it is
>  about ready to send a 0 hdr->tag and see if we can see how the code
>  got there.
>  
>  I might have some cycles to look at this Monday.
>  
>  --td
>  
> > Eloi
> > 
> > On Friday 24 September 2010 16:00:26 Terry Dontje wrote:
> >> Eloi Gaudry wrote:
> >>> Terry,
> >>> 
> >>> No, I haven't tried any other values than P,65536,256,192,128 yet.
> >>> 
> >>> The reason why is quite simple. I've been reading and reading again
> >>> this thread to understand the btl_openib_receive_queues meaning and
> >>> I can't figure out why the default values seem to induce the hdr-
> >>> 
>  tag=0 issue
>  (http://www.open-mpi.org/community/lists/users/2009/01/7808.php).
> >> 
> >> Yeah, the size of the fragments and number

Re: [OMPI users] [openib] segfault when using openib btl

2010-09-27 Thread Eloi Gaudry
Terry,

Please find enclosed the requested check outputs (using -output-filename 
stdout.tag.null option).

For information, Nysal In his first message referred to 
ompi/mca/pml/ob1/pml_ob1_hdr.h and said that hdr->tg value was wrnong on 
receiving side.
#define MCA_PML_OB1_HDR_TYPE_MATCH (MCA_BTL_TAG_PML + 1)
#define MCA_PML_OB1_HDR_TYPE_RNDV  (MCA_BTL_TAG_PML + 2)
#define MCA_PML_OB1_HDR_TYPE_RGET  (MCA_BTL_TAG_PML + 3)
 #define MCA_PML_OB1_HDR_TYPE_ACK   (MCA_BTL_TAG_PML + 4)
#define MCA_PML_OB1_HDR_TYPE_NACK  (MCA_BTL_TAG_PML + 5)
#define MCA_PML_OB1_HDR_TYPE_FRAG  (MCA_BTL_TAG_PML + 6)
#define MCA_PML_OB1_HDR_TYPE_GET   (MCA_BTL_TAG_PML + 7)
 #define MCA_PML_OB1_HDR_TYPE_PUT   (MCA_BTL_TAG_PML + 8)
#define MCA_PML_OB1_HDR_TYPE_FIN   (MCA_BTL_TAG_PML + 9)
and in ompi/mca/btl/btl.h 
#define MCA_BTL_TAG_PML 0x40

Eloi

On Monday 27 September 2010 14:36:59 Terry Dontje wrote:
> I am thinking checking the value of *frag->hdr right before the return
> in the post_send function in ompi/mca/btl/openib/btl_openib_endpoint.h.
> It is line 548 in the trunk
> https://svn.open-mpi.org/source/xref/ompi-trunk/ompi/mca/btl/openib/btl_ope
> nib_endpoint.h#548
> 
> --td
> 
> Eloi Gaudry wrote:
> > Hi Terry,
> > 
> > Do you have any patch that I could apply to be able to do so ? I'm
> > remotely working on a cluster (with a terminal) and I cannot use any
> > parallel debugger or sequential debugger (with a call to xterm...). I
> > can track frag->hdr->tag value in
> > ompi/mca/btl/openib/btl_openib_component.c::handle_wc in the
> > SEND/RDMA_WRITE case, but this is all I can think of alone.
> > 
> > You'll find a stacktrace (receive side) in this thread (10th or 11th
> > message) but it might be pointless.
> > 
> > Regards,
> > Eloi
> > 
> > On Monday 27 September 2010 11:43:55 Terry Dontje wrote:
> >> So it sounds like coalescing is not your issue and that the problem has
> >> something to do with the queue sizes.  It would be helpful if we could
> >> detect the hdr->tag == 0 issue on the sending side and get at least a
> >> stack trace.  There is something really odd going on here.
> >> 
> >> --td
> >> 
> >> Eloi Gaudry wrote:
> >>> Hi Terry,
> >>> 
> >>> I'm sorry to say that I might have missed a point here.
> >>> 
> >>> I've lately been relaunching all previously failing computations with
> >>> the message coalescing feature being switched off, and I saw the same
> >>> hdr->tag=0 error several times, always during a collective call
> >>> (MPI_Comm_create, MPI_Allreduce and MPI_Broadcast, so far). And as
> >>> soon as I switched to the peer queue option I was previously using
> >>> (--mca btl_openib_receive_queues P,65536,256,192,128 instead of using
> >>> --mca btl_openib_use_message_coalescing 0), all computations ran
> >>> flawlessly.
> >>> 
> >>> As for the reproducer, I've already tried to write something but I
> >>> haven't succeeded so far at reproducing the hdr->tag=0 issue with it.
> >>> 
> >>> Eloi
> >>> 
> >>> On 24/09/2010 18:37, Terry Dontje wrote:
>  Eloi Gaudry wrote:
> > Terry,
> > 
> > You were right, the error indeed seems to come from the message
> > coalescing feature. If I turn it off using the "--mca
> > btl_openib_use_message_coalescing 0", I'm not able to observe the
> > "hdr->tag=0" error.
> > 
> > There are some trac requests associated to very similar error
> > (https://svn.open-mpi.org/trac/ompi/search?q=coalescing) but they are
> > all closed (except https://svn.open-mpi.org/trac/ompi/ticket/2352
> > that might be related), aren't they ? What would you suggest Terry ?
>  
>  Interesting, though it looks to me like the segv in ticket 2352 would
>  have happened on the send side instead of the receive side like you
>  have.  As to what to do next it would be really nice to have some
>  sort of reproducer that we can try and debug what is really going
>  on.  The only other thing to do without a reproducer is to inspect
>  the code on the send side to figure out what might make it generate
>  at 0 hdr->tag.  Or maybe instrument the send side to stop when it is
>  about ready to send a 0 hdr->tag and see if we can see how the code
>  got there.
>  
>  I might have some cycles to look at this Monday.
>  
>  --td
>  
> > Eloi
> > 
> > On Friday 24 September 2010 16:00:26 Terry Dontje wrote:
> >> Eloi Gaudry wrote:
> >>> Terry,
> >>> 
> >>> No, I haven't tried any other values than P,65536,256,192,128 yet.
> >>> 
> >>> The reason why is quite simple. I've been reading and reading again
> >>> this thread to understand the btl_openib_receive_queues meaning and
> >>> I can't figure out why the default values seem to induce the hdr-
> >>> 
>  tag=0 issue
>  (http://www.open-mpi.org/community/lists/users/2009/01/7808.php).
> >> 
> >> Yeah, the size of the fragments and number

Re: [OMPI users] Porting Open MPI to ARM: How essential is the opal_sys_timer_get_cycles() function?

2010-09-27 Thread Jeff Squyres
On Sep 23, 2010, at 1:24 PM, Ken Mighell wrote:

> Would a hack written in C suffice?

Assembly is always better, but C should be fine.  If you really want to, could 
you write it in C and have the compiler generate optimized assembly for you.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] [openib] segfault when using openib btl

2010-09-27 Thread Terry Dontje

Eloi, sorry can you print out frag->hdr->tag?

Unfortunately from your last email I think it will still all have 
non-zero values.
If that ends up being the case then there must be something odd with the 
descriptor pointer to the fragment.


--td

Eloi Gaudry wrote:

Terry,

Please find enclosed the requested check outputs (using -output-filename 
stdout.tag.null option).

For information, Nysal In his first message referred to 
ompi/mca/pml/ob1/pml_ob1_hdr.h and said that hdr->tg value was wrnong on 
receiving side.
#define MCA_PML_OB1_HDR_TYPE_MATCH (MCA_BTL_TAG_PML + 1)
#define MCA_PML_OB1_HDR_TYPE_RNDV  (MCA_BTL_TAG_PML + 2)
#define MCA_PML_OB1_HDR_TYPE_RGET  (MCA_BTL_TAG_PML + 3)
 #define MCA_PML_OB1_HDR_TYPE_ACK   (MCA_BTL_TAG_PML + 4)
#define MCA_PML_OB1_HDR_TYPE_NACK  (MCA_BTL_TAG_PML + 5)
#define MCA_PML_OB1_HDR_TYPE_FRAG  (MCA_BTL_TAG_PML + 6)
#define MCA_PML_OB1_HDR_TYPE_GET   (MCA_BTL_TAG_PML + 7)
 #define MCA_PML_OB1_HDR_TYPE_PUT   (MCA_BTL_TAG_PML + 8)
#define MCA_PML_OB1_HDR_TYPE_FIN   (MCA_BTL_TAG_PML + 9)
and in ompi/mca/btl/btl.h 
#define MCA_BTL_TAG_PML 0x40


Eloi

On Monday 27 September 2010 14:36:59 Terry Dontje wrote:
  

I am thinking checking the value of *frag->hdr right before the return
in the post_send function in ompi/mca/btl/openib/btl_openib_endpoint.h.
It is line 548 in the trunk
https://svn.open-mpi.org/source/xref/ompi-trunk/ompi/mca/btl/openib/btl_ope
nib_endpoint.h#548

--td

Eloi Gaudry wrote:


Hi Terry,

Do you have any patch that I could apply to be able to do so ? I'm
remotely working on a cluster (with a terminal) and I cannot use any
parallel debugger or sequential debugger (with a call to xterm...). I
can track frag->hdr->tag value in
ompi/mca/btl/openib/btl_openib_component.c::handle_wc in the
SEND/RDMA_WRITE case, but this is all I can think of alone.

You'll find a stacktrace (receive side) in this thread (10th or 11th
message) but it might be pointless.

Regards,
Eloi

On Monday 27 September 2010 11:43:55 Terry Dontje wrote:
  

So it sounds like coalescing is not your issue and that the problem has
something to do with the queue sizes.  It would be helpful if we could
detect the hdr->tag == 0 issue on the sending side and get at least a
stack trace.  There is something really odd going on here.

--td

Eloi Gaudry wrote:


Hi Terry,

I'm sorry to say that I might have missed a point here.

I've lately been relaunching all previously failing computations with
the message coalescing feature being switched off, and I saw the same
hdr->tag=0 error several times, always during a collective call
(MPI_Comm_create, MPI_Allreduce and MPI_Broadcast, so far). And as
soon as I switched to the peer queue option I was previously using
(--mca btl_openib_receive_queues P,65536,256,192,128 instead of using
--mca btl_openib_use_message_coalescing 0), all computations ran
flawlessly.

As for the reproducer, I've already tried to write something but I
haven't succeeded so far at reproducing the hdr->tag=0 issue with it.

Eloi

On 24/09/2010 18:37, Terry Dontje wrote:
  

Eloi Gaudry wrote:


Terry,

You were right, the error indeed seems to come from the message
coalescing feature. If I turn it off using the "--mca
btl_openib_use_message_coalescing 0", I'm not able to observe the
"hdr->tag=0" error.

There are some trac requests associated to very similar error
(https://svn.open-mpi.org/trac/ompi/search?q=coalescing) but they are
all closed (except https://svn.open-mpi.org/trac/ompi/ticket/2352
that might be related), aren't they ? What would you suggest Terry ?
  

Interesting, though it looks to me like the segv in ticket 2352 would
have happened on the send side instead of the receive side like you
have.  As to what to do next it would be really nice to have some
sort of reproducer that we can try and debug what is really going
on.  The only other thing to do without a reproducer is to inspect
the code on the send side to figure out what might make it generate
at 0 hdr->tag.  Or maybe instrument the send side to stop when it is
about ready to send a 0 hdr->tag and see if we can see how the code
got there.

I might have some cycles to look at this Monday.

--td



Eloi

On Friday 24 September 2010 16:00:26 Terry Dontje wrote:
  

Eloi Gaudry wrote:


Terry,

No, I haven't tried any other values than P,65536,256,192,128 yet.

The reason why is quite simple. I've been reading and reading again
this thread to understand the btl_openib_receive_queues meaning and
I can't figure out why the default values seem to induce the hdr-

  

tag=0 issue
(http://www.open-mpi.org/community/lists/users/2009/01/7808.php).


Yeah, the size of the fragments and number of them really should not
cause this issue.  So I too am a little perplexed about it.



Do you think that the default shared received queue pa

Re: [OMPI users] [openib] segfault when using openib btl

2010-09-27 Thread Eloi Gaudry
Terry,

Please find enclosed the requested check outputs (using -output-filename 
stdout.tag.null option).
I'm displaying frag->hdr->tag here.

Eloi

On Monday 27 September 2010 16:29:12 Terry Dontje wrote:
> Eloi, sorry can you print out frag->hdr->tag?
> 
> Unfortunately from your last email I think it will still all have
> non-zero values.
> If that ends up being the case then there must be something odd with the
> descriptor pointer to the fragment.
> 
> --td
> 
> Eloi Gaudry wrote:
> > Terry,
> > 
> > Please find enclosed the requested check outputs (using -output-filename
> > stdout.tag.null option).
> > 
> > For information, Nysal In his first message referred to
> > ompi/mca/pml/ob1/pml_ob1_hdr.h and said that hdr->tg value was wrnong on
> > receiving side. #define MCA_PML_OB1_HDR_TYPE_MATCH (MCA_BTL_TAG_PML
> > + 1)
> > #define MCA_PML_OB1_HDR_TYPE_RNDV  (MCA_BTL_TAG_PML + 2)
> > #define MCA_PML_OB1_HDR_TYPE_RGET  (MCA_BTL_TAG_PML + 3)
> > 
> >  #define MCA_PML_OB1_HDR_TYPE_ACK   (MCA_BTL_TAG_PML + 4)
> > 
> > #define MCA_PML_OB1_HDR_TYPE_NACK  (MCA_BTL_TAG_PML + 5)
> > #define MCA_PML_OB1_HDR_TYPE_FRAG  (MCA_BTL_TAG_PML + 6)
> > #define MCA_PML_OB1_HDR_TYPE_GET   (MCA_BTL_TAG_PML + 7)
> > 
> >  #define MCA_PML_OB1_HDR_TYPE_PUT   (MCA_BTL_TAG_PML + 8)
> > 
> > #define MCA_PML_OB1_HDR_TYPE_FIN   (MCA_BTL_TAG_PML + 9)
> > and in ompi/mca/btl/btl.h
> > #define MCA_BTL_TAG_PML 0x40
> > 
> > Eloi
> > 
> > On Monday 27 September 2010 14:36:59 Terry Dontje wrote:
> >> I am thinking checking the value of *frag->hdr right before the return
> >> in the post_send function in ompi/mca/btl/openib/btl_openib_endpoint.h.
> >> It is line 548 in the trunk
> >> https://svn.open-mpi.org/source/xref/ompi-trunk/ompi/mca/btl/openib/btl_
> >> ope nib_endpoint.h#548
> >> 
> >> --td
> >> 
> >> Eloi Gaudry wrote:
> >>> Hi Terry,
> >>> 
> >>> Do you have any patch that I could apply to be able to do so ? I'm
> >>> remotely working on a cluster (with a terminal) and I cannot use any
> >>> parallel debugger or sequential debugger (with a call to xterm...). I
> >>> can track frag->hdr->tag value in
> >>> ompi/mca/btl/openib/btl_openib_component.c::handle_wc in the
> >>> SEND/RDMA_WRITE case, but this is all I can think of alone.
> >>> 
> >>> You'll find a stacktrace (receive side) in this thread (10th or 11th
> >>> message) but it might be pointless.
> >>> 
> >>> Regards,
> >>> Eloi
> >>> 
> >>> On Monday 27 September 2010 11:43:55 Terry Dontje wrote:
>  So it sounds like coalescing is not your issue and that the problem
>  has something to do with the queue sizes.  It would be helpful if we
>  could detect the hdr->tag == 0 issue on the sending side and get at
>  least a stack trace.  There is something really odd going on here.
>  
>  --td
>  
>  Eloi Gaudry wrote:
> > Hi Terry,
> > 
> > I'm sorry to say that I might have missed a point here.
> > 
> > I've lately been relaunching all previously failing computations with
> > the message coalescing feature being switched off, and I saw the same
> > hdr->tag=0 error several times, always during a collective call
> > (MPI_Comm_create, MPI_Allreduce and MPI_Broadcast, so far). And as
> > soon as I switched to the peer queue option I was previously using
> > (--mca btl_openib_receive_queues P,65536,256,192,128 instead of using
> > --mca btl_openib_use_message_coalescing 0), all computations ran
> > flawlessly.
> > 
> > As for the reproducer, I've already tried to write something but I
> > haven't succeeded so far at reproducing the hdr->tag=0 issue with it.
> > 
> > Eloi
> > 
> > On 24/09/2010 18:37, Terry Dontje wrote:
> >> Eloi Gaudry wrote:
> >>> Terry,
> >>> 
> >>> You were right, the error indeed seems to come from the message
> >>> coalescing feature. If I turn it off using the "--mca
> >>> btl_openib_use_message_coalescing 0", I'm not able to observe the
> >>> "hdr->tag=0" error.
> >>> 
> >>> There are some trac requests associated to very similar error
> >>> (https://svn.open-mpi.org/trac/ompi/search?q=coalescing) but they
> >>> are all closed (except
> >>> https://svn.open-mpi.org/trac/ompi/ticket/2352 that might be
> >>> related), aren't they ? What would you suggest Terry ?
> >> 
> >> Interesting, though it looks to me like the segv in ticket 2352
> >> would have happened on the send side instead of the receive side
> >> like you have.  As to what to do next it would be really nice to
> >> have some sort of reproducer that we can try and debug what is
> >> really going on.  The only other thing to do without a reproducer
> >> is to inspect the code on the send side to figure out what might
> >> make it generate at 0 hdr->tag.  Or maybe instrument the send side
> >> to stop when it is about ready to send a 0 hdr->tag and see if we

Re: [OMPI users] [openib] segfault when using openib btl

2010-09-27 Thread Terry Dontje
Ok there were no 0 value tags in your files.  Are you running this with 
no eager RDMA?  If not can you set the following options "-mca 
btl_openib_use_eager_rdma 0 -mca btl_openib_max_eager_rdma 0 -mca 
btl_openib_flags 1".


thanks,

--td

Eloi Gaudry wrote:

Terry,

Please find enclosed the requested check outputs (using -output-filename 
stdout.tag.null option).
I'm displaying frag->hdr->tag here.

Eloi

On Monday 27 September 2010 16:29:12 Terry Dontje wrote:
  

Eloi, sorry can you print out frag->hdr->tag?

Unfortunately from your last email I think it will still all have
non-zero values.
If that ends up being the case then there must be something odd with the
descriptor pointer to the fragment.

--td

Eloi Gaudry wrote:


Terry,

Please find enclosed the requested check outputs (using -output-filename
stdout.tag.null option).

For information, Nysal In his first message referred to
ompi/mca/pml/ob1/pml_ob1_hdr.h and said that hdr->tg value was wrnong on
receiving side. #define MCA_PML_OB1_HDR_TYPE_MATCH (MCA_BTL_TAG_PML
+ 1)
#define MCA_PML_OB1_HDR_TYPE_RNDV  (MCA_BTL_TAG_PML + 2)
#define MCA_PML_OB1_HDR_TYPE_RGET  (MCA_BTL_TAG_PML + 3)

 #define MCA_PML_OB1_HDR_TYPE_ACK   (MCA_BTL_TAG_PML + 4)

#define MCA_PML_OB1_HDR_TYPE_NACK  (MCA_BTL_TAG_PML + 5)
#define MCA_PML_OB1_HDR_TYPE_FRAG  (MCA_BTL_TAG_PML + 6)
#define MCA_PML_OB1_HDR_TYPE_GET   (MCA_BTL_TAG_PML + 7)

 #define MCA_PML_OB1_HDR_TYPE_PUT   (MCA_BTL_TAG_PML + 8)

#define MCA_PML_OB1_HDR_TYPE_FIN   (MCA_BTL_TAG_PML + 9)
and in ompi/mca/btl/btl.h
#define MCA_BTL_TAG_PML 0x40

Eloi

On Monday 27 September 2010 14:36:59 Terry Dontje wrote:
  

I am thinking checking the value of *frag->hdr right before the return
in the post_send function in ompi/mca/btl/openib/btl_openib_endpoint.h.
It is line 548 in the trunk
https://svn.open-mpi.org/source/xref/ompi-trunk/ompi/mca/btl/openib/btl_
ope nib_endpoint.h#548

--td

Eloi Gaudry wrote:


Hi Terry,

Do you have any patch that I could apply to be able to do so ? I'm
remotely working on a cluster (with a terminal) and I cannot use any
parallel debugger or sequential debugger (with a call to xterm...). I
can track frag->hdr->tag value in
ompi/mca/btl/openib/btl_openib_component.c::handle_wc in the
SEND/RDMA_WRITE case, but this is all I can think of alone.

You'll find a stacktrace (receive side) in this thread (10th or 11th
message) but it might be pointless.

Regards,
Eloi

On Monday 27 September 2010 11:43:55 Terry Dontje wrote:
  

So it sounds like coalescing is not your issue and that the problem
has something to do with the queue sizes.  It would be helpful if we
could detect the hdr->tag == 0 issue on the sending side and get at
least a stack trace.  There is something really odd going on here.

--td

Eloi Gaudry wrote:


Hi Terry,

I'm sorry to say that I might have missed a point here.

I've lately been relaunching all previously failing computations with
the message coalescing feature being switched off, and I saw the same
hdr->tag=0 error several times, always during a collective call
(MPI_Comm_create, MPI_Allreduce and MPI_Broadcast, so far). And as
soon as I switched to the peer queue option I was previously using
(--mca btl_openib_receive_queues P,65536,256,192,128 instead of using
--mca btl_openib_use_message_coalescing 0), all computations ran
flawlessly.

As for the reproducer, I've already tried to write something but I
haven't succeeded so far at reproducing the hdr->tag=0 issue with it.

Eloi

On 24/09/2010 18:37, Terry Dontje wrote:
  

Eloi Gaudry wrote:


Terry,

You were right, the error indeed seems to come from the message
coalescing feature. If I turn it off using the "--mca
btl_openib_use_message_coalescing 0", I'm not able to observe the
"hdr->tag=0" error.

There are some trac requests associated to very similar error
(https://svn.open-mpi.org/trac/ompi/search?q=coalescing) but they
are all closed (except
https://svn.open-mpi.org/trac/ompi/ticket/2352 that might be
related), aren't they ? What would you suggest Terry ?
  

Interesting, though it looks to me like the segv in ticket 2352
would have happened on the send side instead of the receive side
like you have.  As to what to do next it would be really nice to
have some sort of reproducer that we can try and debug what is
really going on.  The only other thing to do without a reproducer
is to inspect the code on the send side to figure out what might
make it generate at 0 hdr->tag.  Or maybe instrument the send side
to stop when it is about ready to send a 0 hdr->tag and see if we
can see how the code got there.

I might have some cycles to look at this Monday.

--td



Eloi

On Friday 24 September 2010 16:00:26 Terry Dontje wrote:
  

Eloi Gaudry wrote:


Terry,

No, I haven't tried any other values than P,65536,256,192,128
yet.

The reason 

[OMPI users] Memory affinity

2010-09-27 Thread Gabriele Fatigati
Dear  OpenMPI users,

if OpenMPI is numa-compiled, memory affinity is enabled by default? Because
I didn't find memory affinity alone ( similar)  parameter to set at 1.

Thanks a lot.

-- 
Ing. Gabriele Fatigati

Parallel programmer

CINECA Systems & Tecnologies Department

Supercomputing Group

Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy

www.cineca.itTel:   +39 051 6171722

g.fatigati [AT] cineca.it


Re: [OMPI users] Memory affinity

2010-09-27 Thread Gabriele Fatigati
Sorry,

memory affinity is enabled by default setting mprocessor_affinity=1 in
OpenMPI-numa?

2010/9/27 Gabriele Fatigati 

> Dear  OpenMPI users,
>
> if OpenMPI is numa-compiled, memory affinity is enabled by default? Because
> I didn't find memory affinity alone ( similar)  parameter to set at 1.
>
> Thanks a lot.
>
> --
> Ing. Gabriele Fatigati
>
> Parallel programmer
>
> CINECA Systems & Tecnologies Department
>
> Supercomputing Group
>
> Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy
>
> www.cineca.itTel:   +39 051 6171722
>
> g.fatigati [AT] cineca.it
>



-- 
Ing. Gabriele Fatigati

Parallel programmer

CINECA Systems & Tecnologies Department

Supercomputing Group

Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy

www.cineca.itTel:   +39 051 6171722

g.fatigati [AT] cineca.it


[OMPI users] error on mpiexec

2010-09-27 Thread Kraus Philipp

Hi,

I have compiled open-mpi 1.4.2 and uses them with boost-mpi. I can  
compile and run my first example. If I run it without mpiexec  
everything works fine. If I do it with mpiexec -np 1 or 2 I would get  
messages like:


[node:05126] [[582,0],0] ORTE_ERROR_LOG: Error in file  
ess_hnp_module.c at line 230

--
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ras_base_open failed
  --> Returned value Error (-1) instead of ORTE_SUCCESS


  orte_ess_set_name failed
  --> Returned value Error (-1) instead of ORTE_SUCCESS


[node:05125] [[581,0],0] ORTE_ERROR_LOG: Error in file orted/ 
orted_main.c at line 325
[node:05124] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a  
daemon on the local node in file ess_singleton_module.c at line 381
[node:05124] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a  
daemon on the local node in file ess_singleton_module.c at line 143
[node:05124] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a  
daemon on the local node in file runtime/orte_init.c at line 132



--
*** The MPI_Init() function was called before MPI_INIT was invoked.
*** This is disallowed by the MPI standard.

OpenMPI and Boost are installed in a directory under /opt and set with  
the environmental variables. I'm using the first example at http://www.boost.org/doc/libs/1_44_0/doc/html/mpi/tutorial.html


I'm not sure of the problem is the Boost call or the MPI configuration.

Does anyone have some ideas for solving my problem?

Thanks a lot

Phil




Re: [OMPI users] Memory affinity

2010-09-27 Thread Tim Prince

 On 9/27/2010 9:01 AM, Gabriele Fatigati wrote:


if OpenMPI is numa-compiled, memory affinity is enabled by default? 
Because I didn't find memory affinity alone ( similar)  parameter to 
set at 1.



 The FAQ http://www.open-mpi.org/faq/?category=tuning#using-paffinity 
has a useful introduction to affinity.  It's available in a default 
build, but not enabled by default.


If you mean something other than this, explanation is needed as part of 
your question.
taskset() or numactl() might be relevant, if you require more detailed 
control.


--
Tim Prince



Re: [OMPI users] Memory affinity

2010-09-27 Thread Gabriele Fatigati
HI Tim,

I have read that link, but I haven't understood if enabling processor
affinity are enabled also memory affinity because is written that:

"Note that memory affinity support is enabled only when processor
affinity is enabled"

Can i set processory affinity without memory affinity? This is my question..


2010/9/27 Tim Prince 
>
>  On 9/27/2010 9:01 AM, Gabriele Fatigati wrote:
>>
>> if OpenMPI is numa-compiled, memory affinity is enabled by default? Because 
>> I didn't find memory affinity alone ( similar)  parameter to set at 1.
>>
>>
>  The FAQ http://www.open-mpi.org/faq/?category=tuning#using-paffinity has a 
> useful introduction to affinity.  It's available in a default build, but not 
> enabled by default.
>
> If you mean something other than this, explanation is needed as part of your 
> question.
> taskset() or numactl() might be relevant, if you require more detailed 
> control.
>
> --
> Tim Prince
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



--
Ing. Gabriele Fatigati

Parallel programmer

CINECA Systems & Tecnologies Department

Supercomputing Group

Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy

www.cineca.it                    Tel:   +39 051 6171722

g.fatigati [AT] cineca.it



Re: [OMPI users] Memory affinity

2010-09-27 Thread Tim Prince

 On 9/27/2010 12:21 PM, Gabriele Fatigati wrote:

HI Tim,

I have read that link, but I haven't understood if enabling processor
affinity are enabled also memory affinity because is written that:

"Note that memory affinity support is enabled only when processor
affinity is enabled"

Can i set processory affinity without memory affinity? This is my question..


2010/9/27 Tim Prince

  On 9/27/2010 9:01 AM, Gabriele Fatigati wrote:

if OpenMPI is numa-compiled, memory affinity is enabled by default? Because I 
didn't find memory affinity alone ( similar)  parameter to set at 1.



  The FAQ http://www.open-mpi.org/faq/?category=tuning#using-paffinity has a 
useful introduction to affinity.  It's available in a default build, but not 
enabled by default.

Memory affinity is implied by processor affinity.  Your system libraries 
are set up so as to cause any memory allocated to be made local to the 
processor, if possible.  That's one of the primary benefits of processor 
affinity.  Not being an expert in openmpi, I assume, in the absence of 
further easily accessible documentation, there's no useful explicit way 
to disable maffinity while using paffinity on platforms other than the 
specified legacy platforms.


--
Tim Prince



Re: [OMPI users] Memory affinity

2010-09-27 Thread David Singleton

On 09/28/2010 06:52 AM, Tim Prince wrote:

On 9/27/2010 12:21 PM, Gabriele Fatigati wrote:

HI Tim,

I have read that link, but I haven't understood if enabling processor
affinity are enabled also memory affinity because is written that:

"Note that memory affinity support is enabled only when processor
affinity is enabled"

Can i set processory affinity without memory affinity? This is my
question..


2010/9/27 Tim Prince

On 9/27/2010 9:01 AM, Gabriele Fatigati wrote:

if OpenMPI is numa-compiled, memory affinity is enabled by default?
Because I didn't find memory affinity alone ( similar) parameter to
set at 1.



The FAQ http://www.open-mpi.org/faq/?category=tuning#using-paffinity
has a useful introduction to affinity. It's available in a default
build, but not enabled by default.


Memory affinity is implied by processor affinity. Your system libraries
are set up so as to cause any memory allocated to be made local to the
processor, if possible. That's one of the primary benefits of processor
affinity. Not being an expert in openmpi, I assume, in the absence of
further easily accessible documentation, there's no useful explicit way
to disable maffinity while using paffinity on platforms other than the
specified legacy platforms.



Memory allocation policy really needs to be independent of processor
binding policy.  The default memory policy (memory affinity) of "attempt
to allocate to the NUMA node of the cpu that made the allocation request
but fallback as needed" is flawed in a number of situations.  This is true
even when MPI jobs are given dedicated access to processors.  A common one is
where the local NUMA node is full of pagecache pages (from the checkpoint
of the last job to complete).  For those sites that support suspend/resume
based scheduling, NUMA nodes will generally contain pages from suspended
jobs. Ideally, the new (suspending) job should suffer a little bit of paging
overhead (pushing out the suspended job) to get ideal memory placement for
the next 6 or whatever hours of execution.

An mbind (MPOL_BIND) policy of binding to the one local NUMA node will not
work in the case of one process requiring more memory than that local NUMA
node.  One scenario is a master-slave where you might want:
  master (rank 0) bound to processor 0 but not memory bound
  slave (rank i) bound to processor i and memory bound to the local memory
of processor i.

They really are independent requirements.

Cheers,
David



[OMPI users] mpi_in_place not working in mpi_allreduce

2010-09-27 Thread David Zhang
Dear all:

I ran this simple fortran code and got unexpected result:

!
program reduce
implicit none

include 'mpif.h'

integer :: ierr, rank
real*8 :: send(5)

call mpi_init(ierr)
call mpi_comm_rank(mpi_comm_world,rank,ierr)

send = real(rank)

print *, rank,':',send
call
mpi_allreduce(MPI_IN_PLACE,send,size(send),mpi_real8,mpi_sum,mpi_comm_world,ierr)
print *, rank,'#',send

call mpi_finalize(ierr)

end program reduce

!

When running with 3 processes

mpirun -np 3 reduce

The results I'm expecting is the sum of all 3 vectors, but I got the
unexpected result:

  0 :   0.0.
0.0.0.
   2 :   2.2.
2.2.2.
   1 :   1.1.
1.1.1.
   0 #   0.0.
0.0.0.
   1 #   0.0.
0.0.0.
   2 #   0.0.
0.0.0.


During compilation and running there were no errors or warnings.  I install
openMPI via fink.  I believe somehow fink messed up during installation.
Instead of installing MPI from source (which takes hours on my machine), I
would like to know if there is a better than to find out what the problem
is, so that I could fix my current installation rather than reinstall MPI
from scratch.

-- 
David Zhang
University of California, San Diego


Re: [OMPI users] Memory affinity

2010-09-27 Thread Tim Prince

 On 9/27/2010 2:50 PM, David Singleton wrote:

On 09/28/2010 06:52 AM, Tim Prince wrote:

On 9/27/2010 12:21 PM, Gabriele Fatigati wrote:

HI Tim,

I have read that link, but I haven't understood if enabling processor
affinity are enabled also memory affinity because is written that:

"Note that memory affinity support is enabled only when processor
affinity is enabled"

Can i set processory affinity without memory affinity? This is my
question..


2010/9/27 Tim Prince

On 9/27/2010 9:01 AM, Gabriele Fatigati wrote:

if OpenMPI is numa-compiled, memory affinity is enabled by default?
Because I didn't find memory affinity alone ( similar) parameter to
set at 1.



The FAQ http://www.open-mpi.org/faq/?category=tuning#using-paffinity
has a useful introduction to affinity. It's available in a default
build, but not enabled by default.


Memory affinity is implied by processor affinity. Your system libraries
are set up so as to cause any memory allocated to be made local to the
processor, if possible. That's one of the primary benefits of processor
affinity. Not being an expert in openmpi, I assume, in the absence of
further easily accessible documentation, there's no useful explicit way
to disable maffinity while using paffinity on platforms other than the
specified legacy platforms.



Memory allocation policy really needs to be independent of processor
binding policy.  The default memory policy (memory affinity) of "attempt
to allocate to the NUMA node of the cpu that made the allocation request
but fallback as needed" is flawed in a number of situations.  This is 
true
even when MPI jobs are given dedicated access to processors.  A common 
one is

where the local NUMA node is full of pagecache pages (from the checkpoint
of the last job to complete).  For those sites that support 
suspend/resume

based scheduling, NUMA nodes will generally contain pages from suspended
jobs. Ideally, the new (suspending) job should suffer a little bit of 
paging
overhead (pushing out the suspended job) to get ideal memory placement 
for

the next 6 or whatever hours of execution.

An mbind (MPOL_BIND) policy of binding to the one local NUMA node will 
not
work in the case of one process requiring more memory than that local 
NUMA

node.  One scenario is a master-slave where you might want:
  master (rank 0) bound to processor 0 but not memory bound
  slave (rank i) bound to processor i and memory bound to the local 
memory

of processor i.

They really are independent requirements.

Cheers,
David

___
interesting; I agree with those of your points on which I have enough 
experience to have an opinion.
However, the original question was not whether it would be desirable to 
have independent memory affinity, but whether it is possible currently 
within openmpi to avoid memory placements being influenced by processor 
affinity.
I have seen the case you mention, where performance of a long job 
suffers because the state of memory from a previous job results in an 
abnormal number of allocations falling over to other NUMA nodes, but I 
don't know the practical solution.


--
Tim Prince



Re: [OMPI users] Running on crashing nodes

2010-09-27 Thread Randolph Pullen
I have have successfully used a perl program to start mpirun and record its 
PIDThe monitor can then watch the output from MPI and terminate the mpirun 
command with a series of kills or something if it is having trouble.

One method of doing this is to prefix all legal output from your MPI program 
with a known short string, if the monitor does not see this string prefixed on 
a line, it can terminate MPI, check available nodes and recast the job 
accordingly
Hope this helps,Randolph
--- On Fri, 24/9/10, Joshua Hursey  wrote:

From: Joshua Hursey 
Subject: Re: [OMPI users] Running on crashing nodes
To: "Open MPI Users" 
Received: Friday, 24 September, 2010, 10:18 PM

As one of the Open MPI developers actively working on the MPI layer 
stabilization/recover feature set, I don't think we can give you a specific 
timeframe for availability, especially availability in a stable release. Once 
the initial functionality is finished, we will open it up for user testing by 
making a public branch available. After addressing the concerns highlighted by 
public testing, we will attempt to work this feature into the mainline trunk 
and eventual release.

Unfortunately it is difficult to assess the time needed to go through these 
development stages. What I can tell you is that the work to this point on the 
MPI layer is looking promising, and that as soon as we feel that the code is 
ready we will make it available to the public for further testing.

-- Josh

On Sep 24, 2010, at 3:37 AM, Andrei Fokau wrote:

> Ralph, could you tell us when this functionality will be available in the 
> stable version? A rough estimate will be fine.
> 
> 
> On Fri, Sep 24, 2010 at 01:24, Ralph Castain  wrote:
> In a word, no. If a node crashes, OMPI will abort the currently-running job 
> if it had processes on that node. There is no current ability to "ride-thru" 
> such an event.
> 
> That said, there is work being done to support "ride-thru". Most of that is 
> in the current developer's code trunk, and more is coming, but I wouldn't 
> consider it production-quality just yet.
> 
> Specifically, the code that does what you specify below is done and works. It 
> is recovery of the MPI job itself (collectives, lost messages, etc.) that 
> remains to be completed.
> 
> 
> On Thu, Sep 23, 2010 at 7:22 AM, Andrei Fokau  
> wrote:
> Dear users,
> 
> Our cluster has a number of nodes which have high probability to crash, so it 
> happens quite often that calculations stop due to one node getting down. May 
> be you know if it is possible to block the crashed nodes during run-time when 
> running with OpenMPI? I am asking about principal possibility to program such 
> behavior. Does OpenMPI allow such dynamic checking? The scheme I am curious 
> about is the following:
> 
> 1. A code starts its tasks via mpirun on several nodes
> 2. At some moment one node gets down
> 3. The code realizes that the node is down (the results are lost) and 
> excludes it from the list of nodes to run its tasks on
> 4. At later moment the user restarts the crashed node
> 5. The code notices that the node is up again, and puts it back to the list 
> of active nodes
> 
> 
> Regards,
> Andrei
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 


Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://www.cs.indiana.edu/~jjhursey


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users