Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r26106

2012-03-13 Thread Jeffrey Squyres
I would like to understand this more.  Let's talk about it tomorrow on the 
weekly teleconf.

On Mar 9, 2012, at 5:55 PM, Nathan Hjelm wrote:

> I tested my grdma mpool with the openib btl and IMB Alltoall/Alltoallv on a 
> system that consistently hangs. If I give the connection module the ability 
> to evict from the lru grdma prevents both the out of registered memory hang 
> AND problems creating QPs (due to exhaustion of registered memory).
> 
> -Nathan
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r26106

2012-03-09 Thread Nathan Hjelm

I tested my grdma mpool with the openib btl and IMB Alltoall/Alltoallv on a 
system that consistently hangs. If I give the connection module the ability to 
evict from the lru grdma prevents both the out of registered memory hang AND 
problems creating QPs (due to exhaustion of registered memory).

-Nathan


Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r26106

2012-03-09 Thread Nathan Hjelm



On Fri, 9 Mar 2012, George Bosilca wrote:



On Mar 9, 2012, at 14:23 , Nathan Hjelm wrote:


BTW, can anyone tell me why each mpool defines mca_mpool_base_resources_t 
instead of defining mca_mpool_blah_resources_t. The current design makes it 
impossible to support more than one mpool in a btl. I can delete a bunch of 
code if I can make a btl fall back on the rdma mpool if leave_pinned is not set.


I guess you can name them as you like as long as you do the right cast to avoid 
compiler complaints.

Why can't you support multiple mpools in the same BTL?


Because if I include mpool_rdma.h and mpool_grdma.h (or mpool_sm.h) from the 
same file we get a name collision since all mpool components define 
mca_mpool_base_resources_t.

-Nathan


Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r26106

2012-03-09 Thread George Bosilca

On Mar 9, 2012, at 14:23 , Nathan Hjelm wrote:

> BTW, can anyone tell me why each mpool defines mca_mpool_base_resources_t 
> instead of defining mca_mpool_blah_resources_t. The current design makes it 
> impossible to support more than one mpool in a btl. I can delete a bunch of 
> code if I can make a btl fall back on the rdma mpool if leave_pinned is not 
> set.

I guess you can name them as you like as long as you do the right cast to avoid 
compiler complaints.

Why can't you support multiple mpools in the same BTL?

  george.




Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r26106

2012-03-09 Thread Rolf vandeVaart
[Comment at bottom]
>-Original Message-
>From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org]
>On Behalf Of Nathan Hjelm
>Sent: Friday, March 09, 2012 2:23 PM
>To: Open MPI Developers
>Subject: Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r26106
>
>
>
>On Fri, 9 Mar 2012, Jeffrey Squyres wrote:
>
>> On Mar 9, 2012, at 1:32 PM, Nathan Hjelm wrote:
>>
>>> An mpool that is aware of local processes lru's will solve the problem in
>most cases (all that I have seen)
>>
>> I agree -- don't let words in my emails make you think otherwise.  I think 
>> this
>will fix "most" problems, but undoubtedly, some will still occur.
>>
>> What's your timeline for having this ready -- should it go to 1.5.5, or 1.6?
>>
>> More specifically: if it's immanent, and can go to v1.5, then the openib
>message is irrelevant and should not be used (and backed out of the trunk).  If
>it's going to take a little bit, I'm ok leaving the message in v1.5.5 for now.
>
>I wrote the prototype yesterday (after finding that limiting the lru doesn't
>work for uGNI-- @256 pes we could only register ~1400 item instead of the
>3600 max we saw @128). I should have a version ready for review next week
>and a final version by the end of the month.
>
>
>BTW, can anyone tell me why each mpool defines
>mca_mpool_base_resources_t instead of defining
>mca_mpool_blah_resources_t. The current design makes it impossible to
>support more than one mpool in a btl. I can delete a bunch of code if I can
>make a btl fall back on the rdma mpool if leave_pinned is not set.
>
>-Nathan

I ran into this same issue about wanting to use more than one mpool in a btl.  
I expected that there might be a base resource structure that was extended by 
each mpool.  I talked with Jeff and he told me (if I recall correctly) that the 
reason was because there was no common information in any of the 
mca_mpool_base_resources_t structures so there was no need to have a base 
structure.  I do not think there is any reason we cannot do it as you suggest.

[The one other place I have seen it done like this in the library is the 
mca_btl_base_endpoint_t which is defined differently for each BTL]


---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---



Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r26106

2012-03-09 Thread Shamis, Pavel
>> Depending on the timing, this might go to 1.6 (1.5.5 has waited for too 
>> long, and this is not a regression).  Keep in mind that the problem has been 
>> around for *a long, long time*, which is why I approved the diag message 
>> (i.e., because a real solution is still nowhere in sight).  The real issue 
>> is that we can still run out of registered memory *and there is nothing left 
>> to deregister*.  The real solution there is that the PML should fall back to 
>> a different protocol, but I'm told that doesn't happen and will require a 
>> bunch of work to make work properly.
> 
> An mpool that is aware of local processes lru's will solve the problem in 
> most cases (all that I have seen) but yes, we need to rework the pml to 
> handle the remaining cases. There are two things that need to be changed 
> (from what I can tell):
> 
>  1) allow rget to fallback to send/put depending on the failure (I have 
> fallback on put implemented in my branch-- and in my btl).
>  2) need to devise new criteria on when we should progress the rdma_pending 
> list to avoid deadlock.
> 
> #1  is fairly simple and I haven't given much though to #2.


But #1 will be good start in right direction.Agree about #2.

> 
> -Nathan
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r26106

2012-03-09 Thread Nathan Hjelm



On Fri, 9 Mar 2012, Jeffrey Squyres wrote:


On Mar 9, 2012, at 1:32 PM, Nathan Hjelm wrote:


An mpool that is aware of local processes lru's will solve the problem in most 
cases (all that I have seen)


I agree -- don't let words in my emails make you think otherwise.  I think this will fix 
"most" problems, but undoubtedly, some will still occur.

What's your timeline for having this ready -- should it go to 1.5.5, or 1.6?

More specifically: if it's immanent, and can go to v1.5, then the openib 
message is irrelevant and should not be used (and backed out of the trunk).  If 
it's going to take a little bit, I'm ok leaving the message in v1.5.5 for now.


I wrote the prototype yesterday (after finding that limiting the lru doesn't 
work for uGNI-- @256 pes we could only register ~1400 item instead of the 3600 
max we saw @128). I should have a version ready for review next week and a 
final version by the end of the month.


BTW, can anyone tell me why each mpool defines mca_mpool_base_resources_t 
instead of defining mca_mpool_blah_resources_t. The current design makes it 
impossible to support more than one mpool in a btl. I can delete a bunch of 
code if I can make a btl fall back on the rdma mpool if leave_pinned is not set.

-Nathan


Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r26106

2012-03-09 Thread Jeffrey Squyres
On Mar 9, 2012, at 1:32 PM, Nathan Hjelm wrote:

> An mpool that is aware of local processes lru's will solve the problem in 
> most cases (all that I have seen)

I agree -- don't let words in my emails make you think otherwise.  I think this 
will fix "most" problems, but undoubtedly, some will still occur.

What's your timeline for having this ready -- should it go to 1.5.5, or 1.6?

More specifically: if it's immanent, and can go to v1.5, then the openib 
message is irrelevant and should not be used (and backed out of the trunk).  If 
it's going to take a little bit, I'm ok leaving the message in v1.5.5 for now.

> but yes, we need to rework the pml to handle the remaining cases. There are 
> two things that need to be changed (from what I can tell):
> 
> 1) allow rget to fallback to send/put depending on the failure (I have 
> fallback on put implemented in my branch-- and in my btl).
> 2) need to devise new criteria on when we should progress the rdma_pending 
> list to avoid deadlock.
> 
> #1 is fairly simple and I haven't given much though to #2.
> 
> -Nathan
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r26106

2012-03-09 Thread Nathan Hjelm



On Fri, 9 Mar 2012, Jeffrey Squyres wrote:


On Mar 9, 2012, at 1:14 PM, George Bosilca wrote:


The hang occurs because there is nothing on the lru to deregister and 
ibv_reg_mr (or GNI_MemRegister in the uGNI case) fails. The PML then puts the 
request on its rdma pending list and continues. If any message comes in the 
rdma pending list is progressed. If not it hangs indefinitely!


Unlike Jeff, I'm not in favor of adding bandages. If the cause is understood, 
then there _is_ a fix, and that fix should be the target of any efforts.


The fix that Nathan proposes is not a complete fix -- we can still run out of 
memory and hang.  You should read the open tickets and prior emails we have 
sent about this -- Nathan's fix merely delays when we will run out of 
registered memory.  It does not solve the underlying problem.


Correct.


In general I have found the underlying cause of the hang is due to an imbalance 
of registrations between processes on a node. i.e the hung process has an empty 
lru but other processes could deregister. I am working on a new mpool (grdma) 
to handle the imbalance. The new mpool will allow a process to request that one 
of its peers deregisters from it lru if possible. I have a working proof of 
concept implementation that uses a posix shmem segment and a progress function 
to handle signaling and dereferencing. With it I no longer see hangs with IMB 
Alltoall/Alltoallv on uGNI (without putting an artificial limit on the number 
of registrations). I will test the mpool on infiniband later today.


If a solution already exists I don't see why we have to have the message code. 
Based on its urgency, I'm confident your patch will make its way into the 1.5 
quite easily.



Depending on the timing, this might go to 1.6 (1.5.5 has waited for too long, 
and this is not a regression).  Keep in mind that the problem has been around 
for *a long, long time*, which is why I approved the diag message (i.e., 
because a real solution is still nowhere in sight).  The real issue is that we 
can still run out of registered memory *and there is nothing left to 
deregister*.  The real solution there is that the PML should fall back to a 
different protocol, but I'm told that doesn't happen and will require a bunch 
of work to make work properly.


An mpool that is aware of local processes lru's will solve the problem in most 
cases (all that I have seen) but yes, we need to rework the pml to handle the 
remaining cases. There are two things that need to be changed (from what I can 
tell):

 1) allow rget to fallback to send/put depending on the failure (I have 
fallback on put implemented in my branch-- and in my btl).
 2) need to devise new criteria on when we should progress the rdma_pending 
list to avoid deadlock.

#1 is fairly simple and I haven't given much though to #2.

-Nathan


Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r26106

2012-03-09 Thread Jeffrey Squyres
On Mar 9, 2012, at 1:14 PM, George Bosilca wrote:

>> The hang occurs because there is nothing on the lru to deregister and 
>> ibv_reg_mr (or GNI_MemRegister in the uGNI case) fails. The PML then puts 
>> the request on its rdma pending list and continues. If any message comes in 
>> the rdma pending list is progressed. If not it hangs indefinitely!
> 
> Unlike Jeff, I'm not in favor of adding bandages. If the cause is understood, 
> then there _is_ a fix, and that fix should be the target of any efforts.

The fix that Nathan proposes is not a complete fix -- we can still run out of 
memory and hang.  You should read the open tickets and prior emails we have 
sent about this -- Nathan's fix merely delays when we will run out of 
registered memory.  It does not solve the underlying problem.

>> In general I have found the underlying cause of the hang is due to an 
>> imbalance of registrations between processes on a node. i.e the hung process 
>> has an empty lru but other processes could deregister. I am working on a new 
>> mpool (grdma) to handle the imbalance. The new mpool will allow a process to 
>> request that one of its peers deregisters from it lru if possible. I have a 
>> working proof of concept implementation that uses a posix shmem segment and 
>> a progress function to handle signaling and dereferencing. With it I no 
>> longer see hangs with IMB Alltoall/Alltoallv on uGNI (without putting an 
>> artificial limit on the number of registrations). I will test the mpool on 
>> infiniband later today.
> 
> If a solution already exists I don't see why we have to have the message 
> code. Based on its urgency, I'm confident your patch will make its way into 
> the 1.5 quite easily.


Depending on the timing, this might go to 1.6 (1.5.5 has waited for too long, 
and this is not a regression).  Keep in mind that the problem has been around 
for *a long, long time*, which is why I approved the diag message (i.e., 
because a real solution is still nowhere in sight).  The real issue is that we 
can still run out of registered memory *and there is nothing left to 
deregister*.  The real solution there is that the PML should fall back to a 
different protocol, but I'm told that doesn't happen and will require a bunch 
of work to make work properly.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r26106

2012-03-09 Thread George Bosilca

On Mar 9, 2012, at 12:59 , Nathan Hjelm wrote:

> Not exactly, the PML invokes the mpool which invokes the registration 
> function. If registration fails the mpool will deregister from its lru (if 
> possible) and try again. So, it is not an error if ibv_reg_mr fails unless it 
> fails because the process is starved of registered memory (or truely run out).
> 
> The hang occurs because there is nothing on the lru to deregister and 
> ibv_reg_mr (or GNI_MemRegister in the uGNI case) fails. The PML then puts the 
> request on its rdma pending list and continues. If any message comes in the 
> rdma pending list is progressed. If not it hangs indefinitely!

Unlike Jeff, I'm not in favor of adding bandages. If the cause is understood, 
then there _is_ a fix, and that fix should be the target of any efforts.

> In general I have found the underlying cause of the hang is due to an 
> imbalance of registrations between processes on a node. i.e the hung process 
> has an empty lru but other processes could deregister. I am working on a new 
> mpool (grdma) to handle the imbalance. The new mpool will allow a process to 
> request that one of its peers deregisters from it lru if possible. I have a 
> working proof of concept implementation that uses a posix shmem segment and a 
> progress function to handle signaling and dereferencing. With it I no longer 
> see hangs with IMB Alltoall/Alltoallv on uGNI (without putting an artificial 
> limit on the number of registrations). I will test the mpool on infiniband 
> later today.

If a solution already exists I don't see why we have to have the message code. 
Based on its urgency, I'm confident your patch will make its way into the 1.5 
quite easily.

  george.

> 
> -Nathan
> 
> On Fri, 9 Mar 2012, Jeffrey Squyres wrote:
> 
>> George --
>> 
>> I believe that this is the subject of a few long-standing tickets (i.e., 
>> what to do when running out of registered memory -- right now, we hang, for 
>> a few reasons).  I think that this is Mellanox's attempt to at least warn 
>> the user that we have run out of registered memory, and will therefore hang.
>> 
>> Once the hangs have been fixed, I'm assuming this message can be removed.
>> 
>> Note, too, that this is in the BTL registration code (openib_reg_mr), not in 
>> the directly-invoked-by-the-PML code.  So it's the mpool's fault -- not the 
>> PML's fault.
>> 
>> 
>> 
>> On Mar 6, 2012, at 10:05 AM, George Bosilca wrote:
>> 
>>> I din't check thoroughly the code, but OMPI_ERR_OUT_OF_RESOURCES is not an 
>>> error. If the registration returns out of resources, the BTL will returns 
>>> OUT_OF_RESOURCE (as an example via the mca_btl_openib_prepare_src). At the 
>>> upper level, the PML (in the mca_pml_ob1_send_request_start function) 
>>> intercept it and insert the request into a pending list. Later on this 
>>> pending list will be examined and the request for resource re-issued.
>>> 
>>> Why do we need to trigger a BTL_ERROR for OUT_OF_RESOURCES?
>>> 
>>>  george.
>>> 
>>> On Mar 6, 2012, at 09:48 , Jeffrey Squyres wrote:
>>> 
 Mike --
 
 I would make this a bit better of an error.  I.e., use orte_show_help(), 
 so you can explain the issue more, and also remove all duplicates (i.e., 
 if it fails to register multiple times).
 
 
 On Mar 6, 2012, at 8:25 AM, mi...@osl.iu.edu wrote:
 
> Author: miked
> Date: 2012-03-06 09:25:56 EST (Tue, 06 Mar 2012)
> New Revision: 26106
> URL: https://svn.open-mpi.org/trac/ompi/changeset/26106
> 
> Log:
> print error which is ignored on upper layer
> Text files modified:
> trunk/ompi/mca/btl/openib/btl_openib_component.c | 2 ++
> 1 files changed, 2 insertions(+), 0 deletions(-)
> 
> Modified: trunk/ompi/mca/btl/openib/btl_openib_component.c
> ==
> --- trunk/ompi/mca/btl/openib/btl_openib_component.c (original)
> +++ trunk/ompi/mca/btl/openib/btl_openib_component.c 2012-03-06 09:25:56 
> EST (Tue, 06 Mar 2012)
> @@ -569,6 +569,8 @@
>   openib_reg->mr = ibv_reg_mr(device->ib_pd, base, size, access_flag);
> 
>   if (NULL == openib_reg->mr) {
> +BTL_ERROR(("%s: error pinning openib memory errno says %s",
> +   __func__, strerror(errno)));
>   return OMPI_ERR_OUT_OF_RESOURCE;
>   }
> 
> ___
> svn-full mailing list
> svn-f...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/svn-full
 
 
 --
 Jeff Squyres
 jsquy...@cisco.com
 For corporate legal information go to: 
 http://www.cisco.com/web/about/doing_business/legal/cri/
 
 
 ___
 devel mailing list
 de...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> 
>>> ___

Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r26106

2012-03-09 Thread Nathan Hjelm

Not exactly, the PML invokes the mpool which invokes the registration function. 
If registration fails the mpool will deregister from its lru (if possible) and 
try again. So, it is not an error if ibv_reg_mr fails unless it fails because 
the process is starved of registered memory (or truely run out).

The hang occurs because there is nothing on the lru to deregister and 
ibv_reg_mr (or GNI_MemRegister in the uGNI case) fails. The PML then puts the 
request on its rdma pending list and continues. If any message comes in the 
rdma pending list is progressed. If not it hangs indefinitely!

In general I have found the underlying cause of the hang is due to an imbalance 
of registrations between processes on a node. i.e the hung process has an empty 
lru but other processes could deregister. I am working on a new mpool (grdma) 
to handle the imbalance. The new mpool will allow a process to request that one 
of its peers deregisters from it lru if possible. I have a working proof of 
concept implementation that uses a posix shmem segment and a progress function 
to handle signaling and dereferencing. With it I no longer see hangs with IMB 
Alltoall/Alltoallv on uGNI (without putting an artificial limit on the number 
of registrations). I will test the mpool on infiniband later today.

-Nathan

On Fri, 9 Mar 2012, Jeffrey Squyres wrote:


George --

I believe that this is the subject of a few long-standing tickets (i.e., what 
to do when running out of registered memory -- right now, we hang, for a few 
reasons).  I think that this is Mellanox's attempt to at least warn the user 
that we have run out of registered memory, and will therefore hang.

Once the hangs have been fixed, I'm assuming this message can be removed.

Note, too, that this is in the BTL registration code (openib_reg_mr), not in 
the directly-invoked-by-the-PML code.  So it's the mpool's fault -- not the 
PML's fault.



On Mar 6, 2012, at 10:05 AM, George Bosilca wrote:


I din't check thoroughly the code, but OMPI_ERR_OUT_OF_RESOURCES is not an 
error. If the registration returns out of resources, the BTL will returns 
OUT_OF_RESOURCE (as an example via the mca_btl_openib_prepare_src). At the 
upper level, the PML (in the mca_pml_ob1_send_request_start function) intercept 
it and insert the request into a pending list. Later on this pending list will 
be examined and the request for resource re-issued.

Why do we need to trigger a BTL_ERROR for OUT_OF_RESOURCES?

  george.

On Mar 6, 2012, at 09:48 , Jeffrey Squyres wrote:


Mike --

I would make this a bit better of an error.  I.e., use orte_show_help(), so you 
can explain the issue more, and also remove all duplicates (i.e., if it fails 
to register multiple times).


On Mar 6, 2012, at 8:25 AM, mi...@osl.iu.edu wrote:


Author: miked
Date: 2012-03-06 09:25:56 EST (Tue, 06 Mar 2012)
New Revision: 26106
URL: https://svn.open-mpi.org/trac/ompi/changeset/26106

Log:
print error which is ignored on upper layer
Text files modified:
 trunk/ompi/mca/btl/openib/btl_openib_component.c | 2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

Modified: trunk/ompi/mca/btl/openib/btl_openib_component.c
==
--- trunk/ompi/mca/btl/openib/btl_openib_component.c (original)
+++ trunk/ompi/mca/btl/openib/btl_openib_component.c 2012-03-06 09:25:56 EST 
(Tue, 06 Mar 2012)
@@ -569,6 +569,8 @@
   openib_reg->mr = ibv_reg_mr(device->ib_pd, base, size, access_flag);

   if (NULL == openib_reg->mr) {
+BTL_ERROR(("%s: error pinning openib memory errno says %s",
+   __func__, strerror(errno)));
   return OMPI_ERR_OUT_OF_RESOURCE;
   }

___
svn-full mailing list
svn-f...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/svn-full



--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r26106

2012-03-09 Thread Jeffrey Squyres
George --

I believe that this is the subject of a few long-standing tickets (i.e., what 
to do when running out of registered memory -- right now, we hang, for a few 
reasons).  I think that this is Mellanox's attempt to at least warn the user 
that we have run out of registered memory, and will therefore hang.

Once the hangs have been fixed, I'm assuming this message can be removed.

Note, too, that this is in the BTL registration code (openib_reg_mr), not in 
the directly-invoked-by-the-PML code.  So it's the mpool's fault -- not the 
PML's fault.



On Mar 6, 2012, at 10:05 AM, George Bosilca wrote:

> I din't check thoroughly the code, but OMPI_ERR_OUT_OF_RESOURCES is not an 
> error. If the registration returns out of resources, the BTL will returns 
> OUT_OF_RESOURCE (as an example via the mca_btl_openib_prepare_src). At the 
> upper level, the PML (in the mca_pml_ob1_send_request_start function) 
> intercept it and insert the request into a pending list. Later on this 
> pending list will be examined and the request for resource re-issued.
> 
> Why do we need to trigger a BTL_ERROR for OUT_OF_RESOURCES?
> 
>   george.
> 
> On Mar 6, 2012, at 09:48 , Jeffrey Squyres wrote:
> 
> > Mike --
> >
> > I would make this a bit better of an error.  I.e., use orte_show_help(), so 
> > you can explain the issue more, and also remove all duplicates (i.e., if it 
> > fails to register multiple times).
> >
> >
> > On Mar 6, 2012, at 8:25 AM, mi...@osl.iu.edu wrote:
> >
> >> Author: miked
> >> Date: 2012-03-06 09:25:56 EST (Tue, 06 Mar 2012)
> >> New Revision: 26106
> >> URL: https://svn.open-mpi.org/trac/ompi/changeset/26106
> >>
> >> Log:
> >> print error which is ignored on upper layer
> >> Text files modified:
> >>  trunk/ompi/mca/btl/openib/btl_openib_component.c | 2 ++   
> >>   
> >>  1 files changed, 2 insertions(+), 0 deletions(-)
> >>
> >> Modified: trunk/ompi/mca/btl/openib/btl_openib_component.c
> >> ==
> >> --- trunk/ompi/mca/btl/openib/btl_openib_component.c (original)
> >> +++ trunk/ompi/mca/btl/openib/btl_openib_component.c 2012-03-06 09:25:56 
> >> EST (Tue, 06 Mar 2012)
> >> @@ -569,6 +569,8 @@
> >>openib_reg->mr = ibv_reg_mr(device->ib_pd, base, size, access_flag);
> >>
> >>if (NULL == openib_reg->mr) {
> >> +BTL_ERROR(("%s: error pinning openib memory errno says %s",
> >> +   __func__, strerror(errno)));
> >>return OMPI_ERR_OUT_OF_RESOURCE;
> >>}
> >>
> >> ___
> >> svn-full mailing list
> >> svn-f...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/svn-full
> >
> >
> > --
> > Jeff Squyres
> > jsquy...@cisco.com
> > For corporate legal information go to: 
> > http://www.cisco.com/web/about/doing_business/legal/cri/
> >
> >
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r26106

2012-03-06 Thread George Bosilca
I din't check thoroughly the code, but OMPI_ERR_OUT_OF_RESOURCES is not an 
error. If the registration returns out of resources, the BTL will returns 
OUT_OF_RESOURCE (as an example via the mca_btl_openib_prepare_src). At the 
upper level, the PML (in the mca_pml_ob1_send_request_start function) intercept 
it and insert the request into a pending list. Later on this pending list will 
be examined and the request for resource re-issued.

Why do we need to trigger a BTL_ERROR for OUT_OF_RESOURCES?

  george.

On Mar 6, 2012, at 09:48 , Jeffrey Squyres wrote:

> Mike --
> 
> I would make this a bit better of an error.  I.e., use orte_show_help(), so 
> you can explain the issue more, and also remove all duplicates (i.e., if it 
> fails to register multiple times).
> 
> 
> On Mar 6, 2012, at 8:25 AM, mi...@osl.iu.edu wrote:
> 
>> Author: miked
>> Date: 2012-03-06 09:25:56 EST (Tue, 06 Mar 2012)
>> New Revision: 26106
>> URL: https://svn.open-mpi.org/trac/ompi/changeset/26106
>> 
>> Log:
>> print error which is ignored on upper layer
>> Text files modified: 
>>  trunk/ompi/mca/btl/openib/btl_openib_component.c | 2 ++ 
>>  
>>  1 files changed, 2 insertions(+), 0 deletions(-)
>> 
>> Modified: trunk/ompi/mca/btl/openib/btl_openib_component.c
>> ==
>> --- trunk/ompi/mca/btl/openib/btl_openib_component.c (original)
>> +++ trunk/ompi/mca/btl/openib/btl_openib_component.c 2012-03-06 09:25:56 EST 
>> (Tue, 06 Mar 2012)
>> @@ -569,6 +569,8 @@
>>openib_reg->mr = ibv_reg_mr(device->ib_pd, base, size, access_flag);
>> 
>>if (NULL == openib_reg->mr) {
>> +BTL_ERROR(("%s: error pinning openib memory errno says %s",
>> +   __func__, strerror(errno)));
>>return OMPI_ERR_OUT_OF_RESOURCE;
>>}
>> 
>> ___
>> svn-full mailing list
>> svn-f...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/svn-full
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r26106

2012-03-06 Thread Jeffrey Squyres
Mike --

I would make this a bit better of an error.  I.e., use orte_show_help(), so you 
can explain the issue more, and also remove all duplicates (i.e., if it fails 
to register multiple times).


On Mar 6, 2012, at 8:25 AM, mi...@osl.iu.edu wrote:

> Author: miked
> Date: 2012-03-06 09:25:56 EST (Tue, 06 Mar 2012)
> New Revision: 26106
> URL: https://svn.open-mpi.org/trac/ompi/changeset/26106
> 
> Log:
> print error which is ignored on upper layer
> Text files modified: 
>   trunk/ompi/mca/btl/openib/btl_openib_component.c | 2 ++ 
>  
>   1 files changed, 2 insertions(+), 0 deletions(-)
> 
> Modified: trunk/ompi/mca/btl/openib/btl_openib_component.c
> ==
> --- trunk/ompi/mca/btl/openib/btl_openib_component.c  (original)
> +++ trunk/ompi/mca/btl/openib/btl_openib_component.c  2012-03-06 09:25:56 EST 
> (Tue, 06 Mar 2012)
> @@ -569,6 +569,8 @@
> openib_reg->mr = ibv_reg_mr(device->ib_pd, base, size, access_flag);
> 
> if (NULL == openib_reg->mr) {
> +BTL_ERROR(("%s: error pinning openib memory errno says %s",
> +   __func__, strerror(errno)));
> return OMPI_ERR_OUT_OF_RESOURCE;
> }
> 
> ___
> svn-full mailing list
> svn-f...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/svn-full


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/