Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r26106
I would like to understand this more. Let's talk about it tomorrow on the weekly teleconf. On Mar 9, 2012, at 5:55 PM, Nathan Hjelm wrote: > I tested my grdma mpool with the openib btl and IMB Alltoall/Alltoallv on a > system that consistently hangs. If I give the connection module the ability > to evict from the lru grdma prevents both the out of registered memory hang > AND problems creating QPs (due to exhaustion of registered memory). > > -Nathan > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r26106
I tested my grdma mpool with the openib btl and IMB Alltoall/Alltoallv on a system that consistently hangs. If I give the connection module the ability to evict from the lru grdma prevents both the out of registered memory hang AND problems creating QPs (due to exhaustion of registered memory). -Nathan
Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r26106
On Fri, 9 Mar 2012, George Bosilca wrote: On Mar 9, 2012, at 14:23 , Nathan Hjelm wrote: BTW, can anyone tell me why each mpool defines mca_mpool_base_resources_t instead of defining mca_mpool_blah_resources_t. The current design makes it impossible to support more than one mpool in a btl. I can delete a bunch of code if I can make a btl fall back on the rdma mpool if leave_pinned is not set. I guess you can name them as you like as long as you do the right cast to avoid compiler complaints. Why can't you support multiple mpools in the same BTL? Because if I include mpool_rdma.h and mpool_grdma.h (or mpool_sm.h) from the same file we get a name collision since all mpool components define mca_mpool_base_resources_t. -Nathan
Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r26106
On Mar 9, 2012, at 14:23 , Nathan Hjelm wrote: > BTW, can anyone tell me why each mpool defines mca_mpool_base_resources_t > instead of defining mca_mpool_blah_resources_t. The current design makes it > impossible to support more than one mpool in a btl. I can delete a bunch of > code if I can make a btl fall back on the rdma mpool if leave_pinned is not > set. I guess you can name them as you like as long as you do the right cast to avoid compiler complaints. Why can't you support multiple mpools in the same BTL? george.
Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r26106
[Comment at bottom] >-Original Message- >From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] >On Behalf Of Nathan Hjelm >Sent: Friday, March 09, 2012 2:23 PM >To: Open MPI Developers >Subject: Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r26106 > > > >On Fri, 9 Mar 2012, Jeffrey Squyres wrote: > >> On Mar 9, 2012, at 1:32 PM, Nathan Hjelm wrote: >> >>> An mpool that is aware of local processes lru's will solve the problem in >most cases (all that I have seen) >> >> I agree -- don't let words in my emails make you think otherwise. I think >> this >will fix "most" problems, but undoubtedly, some will still occur. >> >> What's your timeline for having this ready -- should it go to 1.5.5, or 1.6? >> >> More specifically: if it's immanent, and can go to v1.5, then the openib >message is irrelevant and should not be used (and backed out of the trunk). If >it's going to take a little bit, I'm ok leaving the message in v1.5.5 for now. > >I wrote the prototype yesterday (after finding that limiting the lru doesn't >work for uGNI-- @256 pes we could only register ~1400 item instead of the >3600 max we saw @128). I should have a version ready for review next week >and a final version by the end of the month. > > >BTW, can anyone tell me why each mpool defines >mca_mpool_base_resources_t instead of defining >mca_mpool_blah_resources_t. The current design makes it impossible to >support more than one mpool in a btl. I can delete a bunch of code if I can >make a btl fall back on the rdma mpool if leave_pinned is not set. > >-Nathan I ran into this same issue about wanting to use more than one mpool in a btl. I expected that there might be a base resource structure that was extended by each mpool. I talked with Jeff and he told me (if I recall correctly) that the reason was because there was no common information in any of the mca_mpool_base_resources_t structures so there was no need to have a base structure. I do not think there is any reason we cannot do it as you suggest. [The one other place I have seen it done like this in the library is the mca_btl_base_endpoint_t which is defined differently for each BTL] --- This email message is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message. ---
Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r26106
>> Depending on the timing, this might go to 1.6 (1.5.5 has waited for too >> long, and this is not a regression). Keep in mind that the problem has been >> around for *a long, long time*, which is why I approved the diag message >> (i.e., because a real solution is still nowhere in sight). The real issue >> is that we can still run out of registered memory *and there is nothing left >> to deregister*. The real solution there is that the PML should fall back to >> a different protocol, but I'm told that doesn't happen and will require a >> bunch of work to make work properly. > > An mpool that is aware of local processes lru's will solve the problem in > most cases (all that I have seen) but yes, we need to rework the pml to > handle the remaining cases. There are two things that need to be changed > (from what I can tell): > > 1) allow rget to fallback to send/put depending on the failure (I have > fallback on put implemented in my branch-- and in my btl). > 2) need to devise new criteria on when we should progress the rdma_pending > list to avoid deadlock. > > #1 is fairly simple and I haven't given much though to #2. But #1 will be good start in right direction.Agree about #2. > > -Nathan > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r26106
On Fri, 9 Mar 2012, Jeffrey Squyres wrote: On Mar 9, 2012, at 1:32 PM, Nathan Hjelm wrote: An mpool that is aware of local processes lru's will solve the problem in most cases (all that I have seen) I agree -- don't let words in my emails make you think otherwise. I think this will fix "most" problems, but undoubtedly, some will still occur. What's your timeline for having this ready -- should it go to 1.5.5, or 1.6? More specifically: if it's immanent, and can go to v1.5, then the openib message is irrelevant and should not be used (and backed out of the trunk). If it's going to take a little bit, I'm ok leaving the message in v1.5.5 for now. I wrote the prototype yesterday (after finding that limiting the lru doesn't work for uGNI-- @256 pes we could only register ~1400 item instead of the 3600 max we saw @128). I should have a version ready for review next week and a final version by the end of the month. BTW, can anyone tell me why each mpool defines mca_mpool_base_resources_t instead of defining mca_mpool_blah_resources_t. The current design makes it impossible to support more than one mpool in a btl. I can delete a bunch of code if I can make a btl fall back on the rdma mpool if leave_pinned is not set. -Nathan
Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r26106
On Mar 9, 2012, at 1:32 PM, Nathan Hjelm wrote: > An mpool that is aware of local processes lru's will solve the problem in > most cases (all that I have seen) I agree -- don't let words in my emails make you think otherwise. I think this will fix "most" problems, but undoubtedly, some will still occur. What's your timeline for having this ready -- should it go to 1.5.5, or 1.6? More specifically: if it's immanent, and can go to v1.5, then the openib message is irrelevant and should not be used (and backed out of the trunk). If it's going to take a little bit, I'm ok leaving the message in v1.5.5 for now. > but yes, we need to rework the pml to handle the remaining cases. There are > two things that need to be changed (from what I can tell): > > 1) allow rget to fallback to send/put depending on the failure (I have > fallback on put implemented in my branch-- and in my btl). > 2) need to devise new criteria on when we should progress the rdma_pending > list to avoid deadlock. > > #1 is fairly simple and I haven't given much though to #2. > > -Nathan > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r26106
On Fri, 9 Mar 2012, Jeffrey Squyres wrote: On Mar 9, 2012, at 1:14 PM, George Bosilca wrote: The hang occurs because there is nothing on the lru to deregister and ibv_reg_mr (or GNI_MemRegister in the uGNI case) fails. The PML then puts the request on its rdma pending list and continues. If any message comes in the rdma pending list is progressed. If not it hangs indefinitely! Unlike Jeff, I'm not in favor of adding bandages. If the cause is understood, then there _is_ a fix, and that fix should be the target of any efforts. The fix that Nathan proposes is not a complete fix -- we can still run out of memory and hang. You should read the open tickets and prior emails we have sent about this -- Nathan's fix merely delays when we will run out of registered memory. It does not solve the underlying problem. Correct. In general I have found the underlying cause of the hang is due to an imbalance of registrations between processes on a node. i.e the hung process has an empty lru but other processes could deregister. I am working on a new mpool (grdma) to handle the imbalance. The new mpool will allow a process to request that one of its peers deregisters from it lru if possible. I have a working proof of concept implementation that uses a posix shmem segment and a progress function to handle signaling and dereferencing. With it I no longer see hangs with IMB Alltoall/Alltoallv on uGNI (without putting an artificial limit on the number of registrations). I will test the mpool on infiniband later today. If a solution already exists I don't see why we have to have the message code. Based on its urgency, I'm confident your patch will make its way into the 1.5 quite easily. Depending on the timing, this might go to 1.6 (1.5.5 has waited for too long, and this is not a regression). Keep in mind that the problem has been around for *a long, long time*, which is why I approved the diag message (i.e., because a real solution is still nowhere in sight). The real issue is that we can still run out of registered memory *and there is nothing left to deregister*. The real solution there is that the PML should fall back to a different protocol, but I'm told that doesn't happen and will require a bunch of work to make work properly. An mpool that is aware of local processes lru's will solve the problem in most cases (all that I have seen) but yes, we need to rework the pml to handle the remaining cases. There are two things that need to be changed (from what I can tell): 1) allow rget to fallback to send/put depending on the failure (I have fallback on put implemented in my branch-- and in my btl). 2) need to devise new criteria on when we should progress the rdma_pending list to avoid deadlock. #1 is fairly simple and I haven't given much though to #2. -Nathan
Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r26106
On Mar 9, 2012, at 1:14 PM, George Bosilca wrote: >> The hang occurs because there is nothing on the lru to deregister and >> ibv_reg_mr (or GNI_MemRegister in the uGNI case) fails. The PML then puts >> the request on its rdma pending list and continues. If any message comes in >> the rdma pending list is progressed. If not it hangs indefinitely! > > Unlike Jeff, I'm not in favor of adding bandages. If the cause is understood, > then there _is_ a fix, and that fix should be the target of any efforts. The fix that Nathan proposes is not a complete fix -- we can still run out of memory and hang. You should read the open tickets and prior emails we have sent about this -- Nathan's fix merely delays when we will run out of registered memory. It does not solve the underlying problem. >> In general I have found the underlying cause of the hang is due to an >> imbalance of registrations between processes on a node. i.e the hung process >> has an empty lru but other processes could deregister. I am working on a new >> mpool (grdma) to handle the imbalance. The new mpool will allow a process to >> request that one of its peers deregisters from it lru if possible. I have a >> working proof of concept implementation that uses a posix shmem segment and >> a progress function to handle signaling and dereferencing. With it I no >> longer see hangs with IMB Alltoall/Alltoallv on uGNI (without putting an >> artificial limit on the number of registrations). I will test the mpool on >> infiniband later today. > > If a solution already exists I don't see why we have to have the message > code. Based on its urgency, I'm confident your patch will make its way into > the 1.5 quite easily. Depending on the timing, this might go to 1.6 (1.5.5 has waited for too long, and this is not a regression). Keep in mind that the problem has been around for *a long, long time*, which is why I approved the diag message (i.e., because a real solution is still nowhere in sight). The real issue is that we can still run out of registered memory *and there is nothing left to deregister*. The real solution there is that the PML should fall back to a different protocol, but I'm told that doesn't happen and will require a bunch of work to make work properly. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r26106
On Mar 9, 2012, at 12:59 , Nathan Hjelm wrote: > Not exactly, the PML invokes the mpool which invokes the registration > function. If registration fails the mpool will deregister from its lru (if > possible) and try again. So, it is not an error if ibv_reg_mr fails unless it > fails because the process is starved of registered memory (or truely run out). > > The hang occurs because there is nothing on the lru to deregister and > ibv_reg_mr (or GNI_MemRegister in the uGNI case) fails. The PML then puts the > request on its rdma pending list and continues. If any message comes in the > rdma pending list is progressed. If not it hangs indefinitely! Unlike Jeff, I'm not in favor of adding bandages. If the cause is understood, then there _is_ a fix, and that fix should be the target of any efforts. > In general I have found the underlying cause of the hang is due to an > imbalance of registrations between processes on a node. i.e the hung process > has an empty lru but other processes could deregister. I am working on a new > mpool (grdma) to handle the imbalance. The new mpool will allow a process to > request that one of its peers deregisters from it lru if possible. I have a > working proof of concept implementation that uses a posix shmem segment and a > progress function to handle signaling and dereferencing. With it I no longer > see hangs with IMB Alltoall/Alltoallv on uGNI (without putting an artificial > limit on the number of registrations). I will test the mpool on infiniband > later today. If a solution already exists I don't see why we have to have the message code. Based on its urgency, I'm confident your patch will make its way into the 1.5 quite easily. george. > > -Nathan > > On Fri, 9 Mar 2012, Jeffrey Squyres wrote: > >> George -- >> >> I believe that this is the subject of a few long-standing tickets (i.e., >> what to do when running out of registered memory -- right now, we hang, for >> a few reasons). I think that this is Mellanox's attempt to at least warn >> the user that we have run out of registered memory, and will therefore hang. >> >> Once the hangs have been fixed, I'm assuming this message can be removed. >> >> Note, too, that this is in the BTL registration code (openib_reg_mr), not in >> the directly-invoked-by-the-PML code. So it's the mpool's fault -- not the >> PML's fault. >> >> >> >> On Mar 6, 2012, at 10:05 AM, George Bosilca wrote: >> >>> I din't check thoroughly the code, but OMPI_ERR_OUT_OF_RESOURCES is not an >>> error. If the registration returns out of resources, the BTL will returns >>> OUT_OF_RESOURCE (as an example via the mca_btl_openib_prepare_src). At the >>> upper level, the PML (in the mca_pml_ob1_send_request_start function) >>> intercept it and insert the request into a pending list. Later on this >>> pending list will be examined and the request for resource re-issued. >>> >>> Why do we need to trigger a BTL_ERROR for OUT_OF_RESOURCES? >>> >>> george. >>> >>> On Mar 6, 2012, at 09:48 , Jeffrey Squyres wrote: >>> Mike -- I would make this a bit better of an error. I.e., use orte_show_help(), so you can explain the issue more, and also remove all duplicates (i.e., if it fails to register multiple times). On Mar 6, 2012, at 8:25 AM, mi...@osl.iu.edu wrote: > Author: miked > Date: 2012-03-06 09:25:56 EST (Tue, 06 Mar 2012) > New Revision: 26106 > URL: https://svn.open-mpi.org/trac/ompi/changeset/26106 > > Log: > print error which is ignored on upper layer > Text files modified: > trunk/ompi/mca/btl/openib/btl_openib_component.c | 2 ++ > 1 files changed, 2 insertions(+), 0 deletions(-) > > Modified: trunk/ompi/mca/btl/openib/btl_openib_component.c > == > --- trunk/ompi/mca/btl/openib/btl_openib_component.c (original) > +++ trunk/ompi/mca/btl/openib/btl_openib_component.c 2012-03-06 09:25:56 > EST (Tue, 06 Mar 2012) > @@ -569,6 +569,8 @@ > openib_reg->mr = ibv_reg_mr(device->ib_pd, base, size, access_flag); > > if (NULL == openib_reg->mr) { > +BTL_ERROR(("%s: error pinning openib memory errno says %s", > + __func__, strerror(errno))); > return OMPI_ERR_OUT_OF_RESOURCE; > } > > ___ > svn-full mailing list > svn-f...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/svn-full -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> >>> ___
Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r26106
Not exactly, the PML invokes the mpool which invokes the registration function. If registration fails the mpool will deregister from its lru (if possible) and try again. So, it is not an error if ibv_reg_mr fails unless it fails because the process is starved of registered memory (or truely run out). The hang occurs because there is nothing on the lru to deregister and ibv_reg_mr (or GNI_MemRegister in the uGNI case) fails. The PML then puts the request on its rdma pending list and continues. If any message comes in the rdma pending list is progressed. If not it hangs indefinitely! In general I have found the underlying cause of the hang is due to an imbalance of registrations between processes on a node. i.e the hung process has an empty lru but other processes could deregister. I am working on a new mpool (grdma) to handle the imbalance. The new mpool will allow a process to request that one of its peers deregisters from it lru if possible. I have a working proof of concept implementation that uses a posix shmem segment and a progress function to handle signaling and dereferencing. With it I no longer see hangs with IMB Alltoall/Alltoallv on uGNI (without putting an artificial limit on the number of registrations). I will test the mpool on infiniband later today. -Nathan On Fri, 9 Mar 2012, Jeffrey Squyres wrote: George -- I believe that this is the subject of a few long-standing tickets (i.e., what to do when running out of registered memory -- right now, we hang, for a few reasons). I think that this is Mellanox's attempt to at least warn the user that we have run out of registered memory, and will therefore hang. Once the hangs have been fixed, I'm assuming this message can be removed. Note, too, that this is in the BTL registration code (openib_reg_mr), not in the directly-invoked-by-the-PML code. So it's the mpool's fault -- not the PML's fault. On Mar 6, 2012, at 10:05 AM, George Bosilca wrote: I din't check thoroughly the code, but OMPI_ERR_OUT_OF_RESOURCES is not an error. If the registration returns out of resources, the BTL will returns OUT_OF_RESOURCE (as an example via the mca_btl_openib_prepare_src). At the upper level, the PML (in the mca_pml_ob1_send_request_start function) intercept it and insert the request into a pending list. Later on this pending list will be examined and the request for resource re-issued. Why do we need to trigger a BTL_ERROR for OUT_OF_RESOURCES? george. On Mar 6, 2012, at 09:48 , Jeffrey Squyres wrote: Mike -- I would make this a bit better of an error. I.e., use orte_show_help(), so you can explain the issue more, and also remove all duplicates (i.e., if it fails to register multiple times). On Mar 6, 2012, at 8:25 AM, mi...@osl.iu.edu wrote: Author: miked Date: 2012-03-06 09:25:56 EST (Tue, 06 Mar 2012) New Revision: 26106 URL: https://svn.open-mpi.org/trac/ompi/changeset/26106 Log: print error which is ignored on upper layer Text files modified: trunk/ompi/mca/btl/openib/btl_openib_component.c | 2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) Modified: trunk/ompi/mca/btl/openib/btl_openib_component.c == --- trunk/ompi/mca/btl/openib/btl_openib_component.c (original) +++ trunk/ompi/mca/btl/openib/btl_openib_component.c 2012-03-06 09:25:56 EST (Tue, 06 Mar 2012) @@ -569,6 +569,8 @@ openib_reg->mr = ibv_reg_mr(device->ib_pd, base, size, access_flag); if (NULL == openib_reg->mr) { +BTL_ERROR(("%s: error pinning openib memory errno says %s", + __func__, strerror(errno))); return OMPI_ERR_OUT_OF_RESOURCE; } ___ svn-full mailing list svn-f...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/svn-full -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r26106
George -- I believe that this is the subject of a few long-standing tickets (i.e., what to do when running out of registered memory -- right now, we hang, for a few reasons). I think that this is Mellanox's attempt to at least warn the user that we have run out of registered memory, and will therefore hang. Once the hangs have been fixed, I'm assuming this message can be removed. Note, too, that this is in the BTL registration code (openib_reg_mr), not in the directly-invoked-by-the-PML code. So it's the mpool's fault -- not the PML's fault. On Mar 6, 2012, at 10:05 AM, George Bosilca wrote: > I din't check thoroughly the code, but OMPI_ERR_OUT_OF_RESOURCES is not an > error. If the registration returns out of resources, the BTL will returns > OUT_OF_RESOURCE (as an example via the mca_btl_openib_prepare_src). At the > upper level, the PML (in the mca_pml_ob1_send_request_start function) > intercept it and insert the request into a pending list. Later on this > pending list will be examined and the request for resource re-issued. > > Why do we need to trigger a BTL_ERROR for OUT_OF_RESOURCES? > > george. > > On Mar 6, 2012, at 09:48 , Jeffrey Squyres wrote: > > > Mike -- > > > > I would make this a bit better of an error. I.e., use orte_show_help(), so > > you can explain the issue more, and also remove all duplicates (i.e., if it > > fails to register multiple times). > > > > > > On Mar 6, 2012, at 8:25 AM, mi...@osl.iu.edu wrote: > > > >> Author: miked > >> Date: 2012-03-06 09:25:56 EST (Tue, 06 Mar 2012) > >> New Revision: 26106 > >> URL: https://svn.open-mpi.org/trac/ompi/changeset/26106 > >> > >> Log: > >> print error which is ignored on upper layer > >> Text files modified: > >> trunk/ompi/mca/btl/openib/btl_openib_component.c | 2 ++ > >> > >> 1 files changed, 2 insertions(+), 0 deletions(-) > >> > >> Modified: trunk/ompi/mca/btl/openib/btl_openib_component.c > >> == > >> --- trunk/ompi/mca/btl/openib/btl_openib_component.c (original) > >> +++ trunk/ompi/mca/btl/openib/btl_openib_component.c 2012-03-06 09:25:56 > >> EST (Tue, 06 Mar 2012) > >> @@ -569,6 +569,8 @@ > >>openib_reg->mr = ibv_reg_mr(device->ib_pd, base, size, access_flag); > >> > >>if (NULL == openib_reg->mr) { > >> +BTL_ERROR(("%s: error pinning openib memory errno says %s", > >> + __func__, strerror(errno))); > >>return OMPI_ERR_OUT_OF_RESOURCE; > >>} > >> > >> ___ > >> svn-full mailing list > >> svn-f...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/svn-full > > > > > > -- > > Jeff Squyres > > jsquy...@cisco.com > > For corporate legal information go to: > > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > > > ___ > > devel mailing list > > de...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r26106
I din't check thoroughly the code, but OMPI_ERR_OUT_OF_RESOURCES is not an error. If the registration returns out of resources, the BTL will returns OUT_OF_RESOURCE (as an example via the mca_btl_openib_prepare_src). At the upper level, the PML (in the mca_pml_ob1_send_request_start function) intercept it and insert the request into a pending list. Later on this pending list will be examined and the request for resource re-issued. Why do we need to trigger a BTL_ERROR for OUT_OF_RESOURCES? george. On Mar 6, 2012, at 09:48 , Jeffrey Squyres wrote: > Mike -- > > I would make this a bit better of an error. I.e., use orte_show_help(), so > you can explain the issue more, and also remove all duplicates (i.e., if it > fails to register multiple times). > > > On Mar 6, 2012, at 8:25 AM, mi...@osl.iu.edu wrote: > >> Author: miked >> Date: 2012-03-06 09:25:56 EST (Tue, 06 Mar 2012) >> New Revision: 26106 >> URL: https://svn.open-mpi.org/trac/ompi/changeset/26106 >> >> Log: >> print error which is ignored on upper layer >> Text files modified: >> trunk/ompi/mca/btl/openib/btl_openib_component.c | 2 ++ >> >> 1 files changed, 2 insertions(+), 0 deletions(-) >> >> Modified: trunk/ompi/mca/btl/openib/btl_openib_component.c >> == >> --- trunk/ompi/mca/btl/openib/btl_openib_component.c (original) >> +++ trunk/ompi/mca/btl/openib/btl_openib_component.c 2012-03-06 09:25:56 EST >> (Tue, 06 Mar 2012) >> @@ -569,6 +569,8 @@ >>openib_reg->mr = ibv_reg_mr(device->ib_pd, base, size, access_flag); >> >>if (NULL == openib_reg->mr) { >> +BTL_ERROR(("%s: error pinning openib memory errno says %s", >> + __func__, strerror(errno))); >>return OMPI_ERR_OUT_OF_RESOURCE; >>} >> >> ___ >> svn-full mailing list >> svn-f...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/svn-full > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r26106
Mike -- I would make this a bit better of an error. I.e., use orte_show_help(), so you can explain the issue more, and also remove all duplicates (i.e., if it fails to register multiple times). On Mar 6, 2012, at 8:25 AM, mi...@osl.iu.edu wrote: > Author: miked > Date: 2012-03-06 09:25:56 EST (Tue, 06 Mar 2012) > New Revision: 26106 > URL: https://svn.open-mpi.org/trac/ompi/changeset/26106 > > Log: > print error which is ignored on upper layer > Text files modified: > trunk/ompi/mca/btl/openib/btl_openib_component.c | 2 ++ > > 1 files changed, 2 insertions(+), 0 deletions(-) > > Modified: trunk/ompi/mca/btl/openib/btl_openib_component.c > == > --- trunk/ompi/mca/btl/openib/btl_openib_component.c (original) > +++ trunk/ompi/mca/btl/openib/btl_openib_component.c 2012-03-06 09:25:56 EST > (Tue, 06 Mar 2012) > @@ -569,6 +569,8 @@ > openib_reg->mr = ibv_reg_mr(device->ib_pd, base, size, access_flag); > > if (NULL == openib_reg->mr) { > +BTL_ERROR(("%s: error pinning openib memory errno says %s", > + __func__, strerror(errno))); > return OMPI_ERR_OUT_OF_RESOURCE; > } > > ___ > svn-full mailing list > svn-f...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/svn-full -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/