Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r26106

2012-03-09 Thread Nathan Hjelm

I tested my grdma mpool with the openib btl and IMB Alltoall/Alltoallv on a 
system that consistently hangs. If I give the connection module the ability to 
evict from the lru grdma prevents both the out of registered memory hang AND 
problems creating QPs (due to exhaustion of registered memory).

-Nathan


Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r26106

2012-03-09 Thread Nathan Hjelm



On Fri, 9 Mar 2012, George Bosilca wrote:



On Mar 9, 2012, at 14:23 , Nathan Hjelm wrote:


BTW, can anyone tell me why each mpool defines mca_mpool_base_resources_t 
instead of defining mca_mpool_blah_resources_t. The current design makes it 
impossible to support more than one mpool in a btl. I can delete a bunch of 
code if I can make a btl fall back on the rdma mpool if leave_pinned is not set.


I guess you can name them as you like as long as you do the right cast to avoid 
compiler complaints.

Why can't you support multiple mpools in the same BTL?


Because if I include mpool_rdma.h and mpool_grdma.h (or mpool_sm.h) from the 
same file we get a name collision since all mpool components define 
mca_mpool_base_resources_t.

-Nathan


Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r26106

2012-03-09 Thread George Bosilca

On Mar 9, 2012, at 14:23 , Nathan Hjelm wrote:

> BTW, can anyone tell me why each mpool defines mca_mpool_base_resources_t 
> instead of defining mca_mpool_blah_resources_t. The current design makes it 
> impossible to support more than one mpool in a btl. I can delete a bunch of 
> code if I can make a btl fall back on the rdma mpool if leave_pinned is not 
> set.

I guess you can name them as you like as long as you do the right cast to avoid 
compiler complaints.

Why can't you support multiple mpools in the same BTL?

  george.




Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r26106

2012-03-09 Thread Rolf vandeVaart
[Comment at bottom]
>-Original Message-
>From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org]
>On Behalf Of Nathan Hjelm
>Sent: Friday, March 09, 2012 2:23 PM
>To: Open MPI Developers
>Subject: Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r26106
>
>
>
>On Fri, 9 Mar 2012, Jeffrey Squyres wrote:
>
>> On Mar 9, 2012, at 1:32 PM, Nathan Hjelm wrote:
>>
>>> An mpool that is aware of local processes lru's will solve the problem in
>most cases (all that I have seen)
>>
>> I agree -- don't let words in my emails make you think otherwise.  I think 
>> this
>will fix "most" problems, but undoubtedly, some will still occur.
>>
>> What's your timeline for having this ready -- should it go to 1.5.5, or 1.6?
>>
>> More specifically: if it's immanent, and can go to v1.5, then the openib
>message is irrelevant and should not be used (and backed out of the trunk).  If
>it's going to take a little bit, I'm ok leaving the message in v1.5.5 for now.
>
>I wrote the prototype yesterday (after finding that limiting the lru doesn't
>work for uGNI-- @256 pes we could only register ~1400 item instead of the
>3600 max we saw @128). I should have a version ready for review next week
>and a final version by the end of the month.
>
>
>BTW, can anyone tell me why each mpool defines
>mca_mpool_base_resources_t instead of defining
>mca_mpool_blah_resources_t. The current design makes it impossible to
>support more than one mpool in a btl. I can delete a bunch of code if I can
>make a btl fall back on the rdma mpool if leave_pinned is not set.
>
>-Nathan

I ran into this same issue about wanting to use more than one mpool in a btl.  
I expected that there might be a base resource structure that was extended by 
each mpool.  I talked with Jeff and he told me (if I recall correctly) that the 
reason was because there was no common information in any of the 
mca_mpool_base_resources_t structures so there was no need to have a base 
structure.  I do not think there is any reason we cannot do it as you suggest.

[The one other place I have seen it done like this in the library is the 
mca_btl_base_endpoint_t which is defined differently for each BTL]


---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---



Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r26106

2012-03-09 Thread Shamis, Pavel
>> Depending on the timing, this might go to 1.6 (1.5.5 has waited for too 
>> long, and this is not a regression).  Keep in mind that the problem has been 
>> around for *a long, long time*, which is why I approved the diag message 
>> (i.e., because a real solution is still nowhere in sight).  The real issue 
>> is that we can still run out of registered memory *and there is nothing left 
>> to deregister*.  The real solution there is that the PML should fall back to 
>> a different protocol, but I'm told that doesn't happen and will require a 
>> bunch of work to make work properly.
> 
> An mpool that is aware of local processes lru's will solve the problem in 
> most cases (all that I have seen) but yes, we need to rework the pml to 
> handle the remaining cases. There are two things that need to be changed 
> (from what I can tell):
> 
>  1) allow rget to fallback to send/put depending on the failure (I have 
> fallback on put implemented in my branch-- and in my btl).
>  2) need to devise new criteria on when we should progress the rdma_pending 
> list to avoid deadlock.
> 
> #1  is fairly simple and I haven't given much though to #2.


But #1 will be good start in right direction.Agree about #2.

> 
> -Nathan
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r26106

2012-03-09 Thread Nathan Hjelm



On Fri, 9 Mar 2012, Jeffrey Squyres wrote:


On Mar 9, 2012, at 1:32 PM, Nathan Hjelm wrote:


An mpool that is aware of local processes lru's will solve the problem in most 
cases (all that I have seen)


I agree -- don't let words in my emails make you think otherwise.  I think this will fix 
"most" problems, but undoubtedly, some will still occur.

What's your timeline for having this ready -- should it go to 1.5.5, or 1.6?

More specifically: if it's immanent, and can go to v1.5, then the openib 
message is irrelevant and should not be used (and backed out of the trunk).  If 
it's going to take a little bit, I'm ok leaving the message in v1.5.5 for now.


I wrote the prototype yesterday (after finding that limiting the lru doesn't 
work for uGNI-- @256 pes we could only register ~1400 item instead of the 3600 
max we saw @128). I should have a version ready for review next week and a 
final version by the end of the month.


BTW, can anyone tell me why each mpool defines mca_mpool_base_resources_t 
instead of defining mca_mpool_blah_resources_t. The current design makes it 
impossible to support more than one mpool in a btl. I can delete a bunch of 
code if I can make a btl fall back on the rdma mpool if leave_pinned is not set.

-Nathan


Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r26106

2012-03-09 Thread Jeffrey Squyres
On Mar 9, 2012, at 1:32 PM, Nathan Hjelm wrote:

> An mpool that is aware of local processes lru's will solve the problem in 
> most cases (all that I have seen)

I agree -- don't let words in my emails make you think otherwise.  I think this 
will fix "most" problems, but undoubtedly, some will still occur.

What's your timeline for having this ready -- should it go to 1.5.5, or 1.6?

More specifically: if it's immanent, and can go to v1.5, then the openib 
message is irrelevant and should not be used (and backed out of the trunk).  If 
it's going to take a little bit, I'm ok leaving the message in v1.5.5 for now.

> but yes, we need to rework the pml to handle the remaining cases. There are 
> two things that need to be changed (from what I can tell):
> 
> 1) allow rget to fallback to send/put depending on the failure (I have 
> fallback on put implemented in my branch-- and in my btl).
> 2) need to devise new criteria on when we should progress the rdma_pending 
> list to avoid deadlock.
> 
> #1 is fairly simple and I haven't given much though to #2.
> 
> -Nathan
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r26106

2012-03-09 Thread Nathan Hjelm



On Fri, 9 Mar 2012, Jeffrey Squyres wrote:


On Mar 9, 2012, at 1:14 PM, George Bosilca wrote:


The hang occurs because there is nothing on the lru to deregister and 
ibv_reg_mr (or GNI_MemRegister in the uGNI case) fails. The PML then puts the 
request on its rdma pending list and continues. If any message comes in the 
rdma pending list is progressed. If not it hangs indefinitely!


Unlike Jeff, I'm not in favor of adding bandages. If the cause is understood, 
then there _is_ a fix, and that fix should be the target of any efforts.


The fix that Nathan proposes is not a complete fix -- we can still run out of 
memory and hang.  You should read the open tickets and prior emails we have 
sent about this -- Nathan's fix merely delays when we will run out of 
registered memory.  It does not solve the underlying problem.


Correct.


In general I have found the underlying cause of the hang is due to an imbalance 
of registrations between processes on a node. i.e the hung process has an empty 
lru but other processes could deregister. I am working on a new mpool (grdma) 
to handle the imbalance. The new mpool will allow a process to request that one 
of its peers deregisters from it lru if possible. I have a working proof of 
concept implementation that uses a posix shmem segment and a progress function 
to handle signaling and dereferencing. With it I no longer see hangs with IMB 
Alltoall/Alltoallv on uGNI (without putting an artificial limit on the number 
of registrations). I will test the mpool on infiniband later today.


If a solution already exists I don't see why we have to have the message code. 
Based on its urgency, I'm confident your patch will make its way into the 1.5 
quite easily.



Depending on the timing, this might go to 1.6 (1.5.5 has waited for too long, 
and this is not a regression).  Keep in mind that the problem has been around 
for *a long, long time*, which is why I approved the diag message (i.e., 
because a real solution is still nowhere in sight).  The real issue is that we 
can still run out of registered memory *and there is nothing left to 
deregister*.  The real solution there is that the PML should fall back to a 
different protocol, but I'm told that doesn't happen and will require a bunch 
of work to make work properly.


An mpool that is aware of local processes lru's will solve the problem in most 
cases (all that I have seen) but yes, we need to rework the pml to handle the 
remaining cases. There are two things that need to be changed (from what I can 
tell):

 1) allow rget to fallback to send/put depending on the failure (I have 
fallback on put implemented in my branch-- and in my btl).
 2) need to devise new criteria on when we should progress the rdma_pending 
list to avoid deadlock.

#1 is fairly simple and I haven't given much though to #2.

-Nathan


Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r26106

2012-03-09 Thread Jeffrey Squyres
On Mar 9, 2012, at 1:14 PM, George Bosilca wrote:

>> The hang occurs because there is nothing on the lru to deregister and 
>> ibv_reg_mr (or GNI_MemRegister in the uGNI case) fails. The PML then puts 
>> the request on its rdma pending list and continues. If any message comes in 
>> the rdma pending list is progressed. If not it hangs indefinitely!
> 
> Unlike Jeff, I'm not in favor of adding bandages. If the cause is understood, 
> then there _is_ a fix, and that fix should be the target of any efforts.

The fix that Nathan proposes is not a complete fix -- we can still run out of 
memory and hang.  You should read the open tickets and prior emails we have 
sent about this -- Nathan's fix merely delays when we will run out of 
registered memory.  It does not solve the underlying problem.

>> In general I have found the underlying cause of the hang is due to an 
>> imbalance of registrations between processes on a node. i.e the hung process 
>> has an empty lru but other processes could deregister. I am working on a new 
>> mpool (grdma) to handle the imbalance. The new mpool will allow a process to 
>> request that one of its peers deregisters from it lru if possible. I have a 
>> working proof of concept implementation that uses a posix shmem segment and 
>> a progress function to handle signaling and dereferencing. With it I no 
>> longer see hangs with IMB Alltoall/Alltoallv on uGNI (without putting an 
>> artificial limit on the number of registrations). I will test the mpool on 
>> infiniband later today.
> 
> If a solution already exists I don't see why we have to have the message 
> code. Based on its urgency, I'm confident your patch will make its way into 
> the 1.5 quite easily.


Depending on the timing, this might go to 1.6 (1.5.5 has waited for too long, 
and this is not a regression).  Keep in mind that the problem has been around 
for *a long, long time*, which is why I approved the diag message (i.e., 
because a real solution is still nowhere in sight).  The real issue is that we 
can still run out of registered memory *and there is nothing left to 
deregister*.  The real solution there is that the PML should fall back to a 
different protocol, but I'm told that doesn't happen and will require a bunch 
of work to make work properly.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r26106

2012-03-09 Thread George Bosilca

On Mar 9, 2012, at 12:59 , Nathan Hjelm wrote:

> Not exactly, the PML invokes the mpool which invokes the registration 
> function. If registration fails the mpool will deregister from its lru (if 
> possible) and try again. So, it is not an error if ibv_reg_mr fails unless it 
> fails because the process is starved of registered memory (or truely run out).
> 
> The hang occurs because there is nothing on the lru to deregister and 
> ibv_reg_mr (or GNI_MemRegister in the uGNI case) fails. The PML then puts the 
> request on its rdma pending list and continues. If any message comes in the 
> rdma pending list is progressed. If not it hangs indefinitely!

Unlike Jeff, I'm not in favor of adding bandages. If the cause is understood, 
then there _is_ a fix, and that fix should be the target of any efforts.

> In general I have found the underlying cause of the hang is due to an 
> imbalance of registrations between processes on a node. i.e the hung process 
> has an empty lru but other processes could deregister. I am working on a new 
> mpool (grdma) to handle the imbalance. The new mpool will allow a process to 
> request that one of its peers deregisters from it lru if possible. I have a 
> working proof of concept implementation that uses a posix shmem segment and a 
> progress function to handle signaling and dereferencing. With it I no longer 
> see hangs with IMB Alltoall/Alltoallv on uGNI (without putting an artificial 
> limit on the number of registrations). I will test the mpool on infiniband 
> later today.

If a solution already exists I don't see why we have to have the message code. 
Based on its urgency, I'm confident your patch will make its way into the 1.5 
quite easily.

  george.

> 
> -Nathan
> 
> On Fri, 9 Mar 2012, Jeffrey Squyres wrote:
> 
>> George --
>> 
>> I believe that this is the subject of a few long-standing tickets (i.e., 
>> what to do when running out of registered memory -- right now, we hang, for 
>> a few reasons).  I think that this is Mellanox's attempt to at least warn 
>> the user that we have run out of registered memory, and will therefore hang.
>> 
>> Once the hangs have been fixed, I'm assuming this message can be removed.
>> 
>> Note, too, that this is in the BTL registration code (openib_reg_mr), not in 
>> the directly-invoked-by-the-PML code.  So it's the mpool's fault -- not the 
>> PML's fault.
>> 
>> 
>> 
>> On Mar 6, 2012, at 10:05 AM, George Bosilca wrote:
>> 
>>> I din't check thoroughly the code, but OMPI_ERR_OUT_OF_RESOURCES is not an 
>>> error. If the registration returns out of resources, the BTL will returns 
>>> OUT_OF_RESOURCE (as an example via the mca_btl_openib_prepare_src). At the 
>>> upper level, the PML (in the mca_pml_ob1_send_request_start function) 
>>> intercept it and insert the request into a pending list. Later on this 
>>> pending list will be examined and the request for resource re-issued.
>>> 
>>> Why do we need to trigger a BTL_ERROR for OUT_OF_RESOURCES?
>>> 
>>>  george.
>>> 
>>> On Mar 6, 2012, at 09:48 , Jeffrey Squyres wrote:
>>> 
 Mike --
 
 I would make this a bit better of an error.  I.e., use orte_show_help(), 
 so you can explain the issue more, and also remove all duplicates (i.e., 
 if it fails to register multiple times).
 
 
 On Mar 6, 2012, at 8:25 AM, mi...@osl.iu.edu wrote:
 
> Author: miked
> Date: 2012-03-06 09:25:56 EST (Tue, 06 Mar 2012)
> New Revision: 26106
> URL: https://svn.open-mpi.org/trac/ompi/changeset/26106
> 
> Log:
> print error which is ignored on upper layer
> Text files modified:
> trunk/ompi/mca/btl/openib/btl_openib_component.c | 2 ++
> 1 files changed, 2 insertions(+), 0 deletions(-)
> 
> Modified: trunk/ompi/mca/btl/openib/btl_openib_component.c
> ==
> --- trunk/ompi/mca/btl/openib/btl_openib_component.c (original)
> +++ trunk/ompi/mca/btl/openib/btl_openib_component.c 2012-03-06 09:25:56 
> EST (Tue, 06 Mar 2012)
> @@ -569,6 +569,8 @@
>   openib_reg->mr = ibv_reg_mr(device->ib_pd, base, size, access_flag);
> 
>   if (NULL == openib_reg->mr) {
> +BTL_ERROR(("%s: error pinning openib memory errno says %s",
> +   __func__, strerror(errno)));
>   return OMPI_ERR_OUT_OF_RESOURCE;
>   }
> 
> ___
> svn-full mailing list
> svn-f...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/svn-full
 
 
 --
 Jeff Squyres
 jsquy...@cisco.com
 For corporate legal information go to: 
 http://www.cisco.com/web/about/doing_business/legal/cri/
 
 
 ___
 devel mailing list
 de...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> 
>>> ___

Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r26106

2012-03-09 Thread Nathan Hjelm

Not exactly, the PML invokes the mpool which invokes the registration function. 
If registration fails the mpool will deregister from its lru (if possible) and 
try again. So, it is not an error if ibv_reg_mr fails unless it fails because 
the process is starved of registered memory (or truely run out).

The hang occurs because there is nothing on the lru to deregister and 
ibv_reg_mr (or GNI_MemRegister in the uGNI case) fails. The PML then puts the 
request on its rdma pending list and continues. If any message comes in the 
rdma pending list is progressed. If not it hangs indefinitely!

In general I have found the underlying cause of the hang is due to an imbalance 
of registrations between processes on a node. i.e the hung process has an empty 
lru but other processes could deregister. I am working on a new mpool (grdma) 
to handle the imbalance. The new mpool will allow a process to request that one 
of its peers deregisters from it lru if possible. I have a working proof of 
concept implementation that uses a posix shmem segment and a progress function 
to handle signaling and dereferencing. With it I no longer see hangs with IMB 
Alltoall/Alltoallv on uGNI (without putting an artificial limit on the number 
of registrations). I will test the mpool on infiniband later today.

-Nathan

On Fri, 9 Mar 2012, Jeffrey Squyres wrote:


George --

I believe that this is the subject of a few long-standing tickets (i.e., what 
to do when running out of registered memory -- right now, we hang, for a few 
reasons).  I think that this is Mellanox's attempt to at least warn the user 
that we have run out of registered memory, and will therefore hang.

Once the hangs have been fixed, I'm assuming this message can be removed.

Note, too, that this is in the BTL registration code (openib_reg_mr), not in 
the directly-invoked-by-the-PML code.  So it's the mpool's fault -- not the 
PML's fault.



On Mar 6, 2012, at 10:05 AM, George Bosilca wrote:


I din't check thoroughly the code, but OMPI_ERR_OUT_OF_RESOURCES is not an 
error. If the registration returns out of resources, the BTL will returns 
OUT_OF_RESOURCE (as an example via the mca_btl_openib_prepare_src). At the 
upper level, the PML (in the mca_pml_ob1_send_request_start function) intercept 
it and insert the request into a pending list. Later on this pending list will 
be examined and the request for resource re-issued.

Why do we need to trigger a BTL_ERROR for OUT_OF_RESOURCES?

  george.

On Mar 6, 2012, at 09:48 , Jeffrey Squyres wrote:


Mike --

I would make this a bit better of an error.  I.e., use orte_show_help(), so you 
can explain the issue more, and also remove all duplicates (i.e., if it fails 
to register multiple times).


On Mar 6, 2012, at 8:25 AM, mi...@osl.iu.edu wrote:


Author: miked
Date: 2012-03-06 09:25:56 EST (Tue, 06 Mar 2012)
New Revision: 26106
URL: https://svn.open-mpi.org/trac/ompi/changeset/26106

Log:
print error which is ignored on upper layer
Text files modified:
 trunk/ompi/mca/btl/openib/btl_openib_component.c | 2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

Modified: trunk/ompi/mca/btl/openib/btl_openib_component.c
==
--- trunk/ompi/mca/btl/openib/btl_openib_component.c (original)
+++ trunk/ompi/mca/btl/openib/btl_openib_component.c 2012-03-06 09:25:56 EST 
(Tue, 06 Mar 2012)
@@ -569,6 +569,8 @@
   openib_reg->mr = ibv_reg_mr(device->ib_pd, base, size, access_flag);

   if (NULL == openib_reg->mr) {
+BTL_ERROR(("%s: error pinning openib memory errno says %s",
+   __func__, strerror(errno)));
   return OMPI_ERR_OUT_OF_RESOURCE;
   }

___
svn-full mailing list
svn-f...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/svn-full



--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r26106

2012-03-09 Thread Jeffrey Squyres
George --

I believe that this is the subject of a few long-standing tickets (i.e., what 
to do when running out of registered memory -- right now, we hang, for a few 
reasons).  I think that this is Mellanox's attempt to at least warn the user 
that we have run out of registered memory, and will therefore hang.

Once the hangs have been fixed, I'm assuming this message can be removed.

Note, too, that this is in the BTL registration code (openib_reg_mr), not in 
the directly-invoked-by-the-PML code.  So it's the mpool's fault -- not the 
PML's fault.



On Mar 6, 2012, at 10:05 AM, George Bosilca wrote:

> I din't check thoroughly the code, but OMPI_ERR_OUT_OF_RESOURCES is not an 
> error. If the registration returns out of resources, the BTL will returns 
> OUT_OF_RESOURCE (as an example via the mca_btl_openib_prepare_src). At the 
> upper level, the PML (in the mca_pml_ob1_send_request_start function) 
> intercept it and insert the request into a pending list. Later on this 
> pending list will be examined and the request for resource re-issued.
> 
> Why do we need to trigger a BTL_ERROR for OUT_OF_RESOURCES?
> 
>   george.
> 
> On Mar 6, 2012, at 09:48 , Jeffrey Squyres wrote:
> 
> > Mike --
> >
> > I would make this a bit better of an error.  I.e., use orte_show_help(), so 
> > you can explain the issue more, and also remove all duplicates (i.e., if it 
> > fails to register multiple times).
> >
> >
> > On Mar 6, 2012, at 8:25 AM, mi...@osl.iu.edu wrote:
> >
> >> Author: miked
> >> Date: 2012-03-06 09:25:56 EST (Tue, 06 Mar 2012)
> >> New Revision: 26106
> >> URL: https://svn.open-mpi.org/trac/ompi/changeset/26106
> >>
> >> Log:
> >> print error which is ignored on upper layer
> >> Text files modified:
> >>  trunk/ompi/mca/btl/openib/btl_openib_component.c | 2 ++   
> >>   
> >>  1 files changed, 2 insertions(+), 0 deletions(-)
> >>
> >> Modified: trunk/ompi/mca/btl/openib/btl_openib_component.c
> >> ==
> >> --- trunk/ompi/mca/btl/openib/btl_openib_component.c (original)
> >> +++ trunk/ompi/mca/btl/openib/btl_openib_component.c 2012-03-06 09:25:56 
> >> EST (Tue, 06 Mar 2012)
> >> @@ -569,6 +569,8 @@
> >>openib_reg->mr = ibv_reg_mr(device->ib_pd, base, size, access_flag);
> >>
> >>if (NULL == openib_reg->mr) {
> >> +BTL_ERROR(("%s: error pinning openib memory errno says %s",
> >> +   __func__, strerror(errno)));
> >>return OMPI_ERR_OUT_OF_RESOURCE;
> >>}
> >>
> >> ___
> >> svn-full mailing list
> >> svn-f...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/svn-full
> >
> >
> > --
> > Jeff Squyres
> > jsquy...@cisco.com
> > For corporate legal information go to: 
> > http://www.cisco.com/web/about/doing_business/legal/cri/
> >
> >
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] poor btl sm latency

2012-03-09 Thread Matthias Jurenz
I just made an interesting observation:

When binding the processes to two neighboring cores (L2 sharing) NetPIPE shows 
*sometimes* pretty good results: ~0.5us

$ mpirun -mca btl sm,self -np 1 hwloc-bind -v core:0 ./NPmpi_ompi1.5.5 -u 4 -n 
10 -p 0 : -np 1 hwloc-bind -v core:1 ./NPmpi_ompi1.5.5 -u 4 -n 10 -p 0
using object #0 depth 6 below cpuset 0x,0x
using object #1 depth 6 below cpuset 0x,0x
adding 0x0001 to 0x0
adding 0x0001 to 0x0
assuming the command starts at ./NPmpi_ompi1.5.5
binding on cpu set 0x0001
adding 0x0002 to 0x0
adding 0x0002 to 0x0
assuming the command starts at ./NPmpi_ompi1.5.5
binding on cpu set 0x0002
Using no perturbations

0: n035
Using no perturbations

1: n035
Now starting the main loop
  0:   1 bytes 10 times -->  6.01 Mbps in   1.27 usec
  1:   2 bytes 10 times --> 12.04 Mbps in   1.27 usec
  2:   3 bytes 10 times --> 18.07 Mbps in   1.27 usec
  3:   4 bytes 10 times --> 24.13 Mbps in   1.26 usec

$ mpirun -mca btl sm,self -np 1 hwloc-bind -v core:0 ./NPmpi_ompi1.5.5 -u 4 -n 
10 -p 0 : -np 1 hwloc-bind -v core:1 ./NPmpi_ompi1.5.5 -u 4 -n 10 -p 0
using object #0 depth 6 below cpuset 0x,0x
adding 0x0001 to 0x0
adding 0x0001 to 0x0
assuming the command starts at ./NPmpi_ompi1.5.5
binding on cpu set 0x0001
using object #1 depth 6 below cpuset 0x,0x
adding 0x0002 to 0x0
adding 0x0002 to 0x0
assuming the command starts at ./NPmpi_ompi1.5.5
binding on cpu set 0x0002
Using no perturbations

0: n035
Using no perturbations

1: n035
Now starting the main loop
  0:   1 bytes 10 times --> 12.96 Mbps in   0.59 usec
  1:   2 bytes 10 times --> 25.78 Mbps in   0.59 usec
  2:   3 bytes 10 times --> 38.62 Mbps in   0.59 usec
  3:   4 bytes 10 times --> 52.88 Mbps in   0.58 usec

I can reproduce that approximately every tenth run.

When binding the processes for exclusive L2 caches (e.g. core 0 and 2) I get 
constant latencies ~1.1us

Matthias

On Monday 05 March 2012 09:52:39 Matthias Jurenz wrote:
> Here the SM BTL parameters:
> 
> $ ompi_info --param btl sm
> MCA btl: parameter "btl_base_verbose" (current value: <0>, data source:
> default value) Verbosity level of the BTL framework
> MCA btl: parameter "btl" (current value: , data source:
> file
> [/sw/atlas/libraries/openmpi/1.5.5rc3/x86_64/etc/openmpi-mca-params.conf])
> Default selection set of components for the btl framework ( means
> use all components that can be found)
> MCA btl: information "btl_sm_have_knem_support" (value: <1>, data source:
> default value) Whether this component supports the knem Linux kernel module
> or not
> MCA btl: parameter "btl_sm_use_knem" (current value: <-1>, data source:
> default value) Whether knem support is desired or not (negative = try to
> enable knem support, but continue even if it is not available, 0 = do not
> enable knem support, positive = try to enable knem support and fail if it
> is not available)
> MCA btl: parameter "btl_sm_knem_dma_min" (current value: <0>, data source:
> default value) Minimum message size (in bytes) to use the knem DMA mode;
> ignored if knem does not support DMA mode (0 = do not use the knem DMA
> mode) MCA btl: parameter "btl_sm_knem_max_simultaneous" (current value:
> <0>, data source: default value) Max number of simultaneous ongoing knem
> operations to support (0 = do everything synchronously, which probably
> gives the best large message latency; >0 means to do all operations
> asynchronously, which supports better overlap for simultaneous large
> message sends)
> MCA btl: parameter "btl_sm_free_list_num" (current value: <8>, data source:
> default value)
> MCA btl: parameter "btl_sm_free_list_max" (current value: <-1>, data
> source: default value)
> MCA btl: parameter "btl_sm_free_list_inc" (current value: <64>, data
> source: default value)
> MCA btl: parameter "btl_sm_max_procs" (current value: <-1>, data source:
> default value)
> MCA btl: parameter "btl_sm_mpool" (current value: , data source:
> default value)
> MCA btl: parameter "btl_sm_fifo_size" (current value: <4096>, data source:
> default value)
> MCA btl: parameter "btl_sm_num_fifos" (current value: <1>, data source:
> default value)
> MCA btl: parameter "btl_sm_fifo_lazy_free" (current value: <120>, data
> source: default value)
> MCA btl: parameter "btl_sm_sm_extra_procs" (current value: <0>, data
> source: default value)
> MCA btl: parameter "btl_sm_exclusivity" (current value: <65535>, data
> source: default value) BTL exclusivity (must be >= 0)
> MCA btl: parameter "btl_sm_flags" (current value: <5>, data source: default
> value) BTL bit flags (general flags: SEND=1, PUT=2, GET=4, SEND_INPLACE=8,
> RDMA_MATCHED=64, HETEROGENEOUS_RDMA=256; flags only used by the "dr" PML
> (ignored by others): ACK=16, CHECKSUM=32, RDMA_COMPLETION=128; flags

Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r26118

2012-03-09 Thread Josh Hursey
Fixed in r26122. I tested locally with the ibm test suite, and it looks
good. MTT should highlight if there are any other issues - but I doubt
there will be.

-- Josh

On Thu, Mar 8, 2012 at 5:16 PM, Josh Hursey  wrote:

> Good point (I did not even look at ompi_comm_compare, I was using this for
> something else). I'll take a pass at converting the ompi_comm_compare to
> use the ompi_group_compare functionality - it is good code reuse.
>
> Thanks,
> Josh
>
>
> On Thu, Mar 8, 2012 at 4:08 PM, George Bosilca wrote:
>
>> Josh,
>>
>> Open MPI already have a similar function in the communicator part,
>> function that is not exposed to the upper layer. I think that using the
>> code in ompi_comm_compare (the second part that compare groups) is sound
>> proof. Moreover, if now we have an ompi_group_compare function you should
>> use it in the ompi_comm_compare to ease the readability of the code.
>>
>>  Regards,
>>george.
>>
>>
>>
>> On Mar 8, 2012, at 16:57 , jjhur...@osl.iu.edu wrote:
>>
>> > Author: jjhursey
>> > Date: 2012-03-08 16:57:45 EST (Thu, 08 Mar 2012)
>> > New Revision: 26118
>> > URL: https://svn.open-mpi.org/trac/ompi/changeset/26118
>> >
>> > Log:
>> > Abstract MPI_Group_compare to an OMPI function for internal use (point
>> the MPI interface to the internal function).
>>
>>
>>
>
>
> --
> Joshua Hursey
> Postdoctoral Research Associate
> Oak Ridge National Laboratory
> http://users.nccs.gov/~jjhursey
>



-- 
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey


Re: [OMPI devel] MCA BTL Fragment lists

2012-03-09 Thread George Bosilca

On Mar 9, 2012, at 08:38 , Alex Margolin wrote:

> Hi,
> 
> I'm implementing a new BTL component, and
> 
> 1. I read the TCP code and ran into the three fragment lists:
> 
>/* free list of fragment descriptors */
>ompi_free_list_t tcp_frag_eager;
>ompi_free_list_t tcp_frag_max;
>ompi_free_list_t tcp_frag_user;
> 
> I've looked it up, and found that the documentation for OpenIB refers to the 
> eager term as (in short) the first chuck of a long message, after which the 
> buffer is registered and in the meanwhile chucks from the end of the buffer 
> (beyond a limit much higher then eager-limit) are sent. I didn't find any 
> references relevant to plain TCP. I'm not sure I understand how this is 
> applicable with TCP (and I've seen it in other components as well). For a 
> long message - why would I treat chucks separately?

An eager fragment can be received by the peer eagerly (this means without the 
corresponding receive posted). This is not the case for larger fragments.

> In the TCP BTL code, when the fragment is created - shorter chucks are sent 
> to eager while the rest are sent to max. Where the two lists treated 
> differently?
> 
> Thanks,
> Alex
> 
> P.S. what does the role of mca_btl_*_component_control()?

Amazing, that's an archeological piece of Open MPI history. Fixed in r26121.

  george.

> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




[OMPI devel] MCA BTL Fragment lists

2012-03-09 Thread Alex Margolin

Hi,

I'm implementing a new BTL component, and

1. I read the TCP code and ran into the three fragment lists:

/* free list of fragment descriptors */
ompi_free_list_t tcp_frag_eager;
ompi_free_list_t tcp_frag_max;
ompi_free_list_t tcp_frag_user;

I've looked it up, and found that the documentation for OpenIB refers to 
the eager term as (in short) the first chuck of a long message, after 
which the buffer is registered and in the meanwhile chucks from the end 
of the buffer (beyond a limit much higher then eager-limit) are sent. I 
didn't find any references relevant to plain TCP. I'm not sure I 
understand how this is applicable with TCP (and I've seen it in other 
components as well). For a long message - why would I treat chucks 
separately?
In the TCP BTL code, when the fragment is created - shorter chucks are 
sent to eager while the rest are sent to max. Where the two lists 
treated differently?


Thanks,
Alex

P.S. what does the role of mca_btl_*_component_control()?