Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r26106
I tested my grdma mpool with the openib btl and IMB Alltoall/Alltoallv on a system that consistently hangs. If I give the connection module the ability to evict from the lru grdma prevents both the out of registered memory hang AND problems creating QPs (due to exhaustion of registered memory). -Nathan
Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r26106
On Fri, 9 Mar 2012, George Bosilca wrote: On Mar 9, 2012, at 14:23 , Nathan Hjelm wrote: BTW, can anyone tell me why each mpool defines mca_mpool_base_resources_t instead of defining mca_mpool_blah_resources_t. The current design makes it impossible to support more than one mpool in a btl. I can delete a bunch of code if I can make a btl fall back on the rdma mpool if leave_pinned is not set. I guess you can name them as you like as long as you do the right cast to avoid compiler complaints. Why can't you support multiple mpools in the same BTL? Because if I include mpool_rdma.h and mpool_grdma.h (or mpool_sm.h) from the same file we get a name collision since all mpool components define mca_mpool_base_resources_t. -Nathan
Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r26106
On Mar 9, 2012, at 14:23 , Nathan Hjelm wrote: > BTW, can anyone tell me why each mpool defines mca_mpool_base_resources_t > instead of defining mca_mpool_blah_resources_t. The current design makes it > impossible to support more than one mpool in a btl. I can delete a bunch of > code if I can make a btl fall back on the rdma mpool if leave_pinned is not > set. I guess you can name them as you like as long as you do the right cast to avoid compiler complaints. Why can't you support multiple mpools in the same BTL? george.
Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r26106
[Comment at bottom] >-Original Message- >From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] >On Behalf Of Nathan Hjelm >Sent: Friday, March 09, 2012 2:23 PM >To: Open MPI Developers >Subject: Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r26106 > > > >On Fri, 9 Mar 2012, Jeffrey Squyres wrote: > >> On Mar 9, 2012, at 1:32 PM, Nathan Hjelm wrote: >> >>> An mpool that is aware of local processes lru's will solve the problem in >most cases (all that I have seen) >> >> I agree -- don't let words in my emails make you think otherwise. I think >> this >will fix "most" problems, but undoubtedly, some will still occur. >> >> What's your timeline for having this ready -- should it go to 1.5.5, or 1.6? >> >> More specifically: if it's immanent, and can go to v1.5, then the openib >message is irrelevant and should not be used (and backed out of the trunk). If >it's going to take a little bit, I'm ok leaving the message in v1.5.5 for now. > >I wrote the prototype yesterday (after finding that limiting the lru doesn't >work for uGNI-- @256 pes we could only register ~1400 item instead of the >3600 max we saw @128). I should have a version ready for review next week >and a final version by the end of the month. > > >BTW, can anyone tell me why each mpool defines >mca_mpool_base_resources_t instead of defining >mca_mpool_blah_resources_t. The current design makes it impossible to >support more than one mpool in a btl. I can delete a bunch of code if I can >make a btl fall back on the rdma mpool if leave_pinned is not set. > >-Nathan I ran into this same issue about wanting to use more than one mpool in a btl. I expected that there might be a base resource structure that was extended by each mpool. I talked with Jeff and he told me (if I recall correctly) that the reason was because there was no common information in any of the mca_mpool_base_resources_t structures so there was no need to have a base structure. I do not think there is any reason we cannot do it as you suggest. [The one other place I have seen it done like this in the library is the mca_btl_base_endpoint_t which is defined differently for each BTL] --- This email message is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message. ---
Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r26106
>> Depending on the timing, this might go to 1.6 (1.5.5 has waited for too >> long, and this is not a regression). Keep in mind that the problem has been >> around for *a long, long time*, which is why I approved the diag message >> (i.e., because a real solution is still nowhere in sight). The real issue >> is that we can still run out of registered memory *and there is nothing left >> to deregister*. The real solution there is that the PML should fall back to >> a different protocol, but I'm told that doesn't happen and will require a >> bunch of work to make work properly. > > An mpool that is aware of local processes lru's will solve the problem in > most cases (all that I have seen) but yes, we need to rework the pml to > handle the remaining cases. There are two things that need to be changed > (from what I can tell): > > 1) allow rget to fallback to send/put depending on the failure (I have > fallback on put implemented in my branch-- and in my btl). > 2) need to devise new criteria on when we should progress the rdma_pending > list to avoid deadlock. > > #1 is fairly simple and I haven't given much though to #2. But #1 will be good start in right direction.Agree about #2. > > -Nathan > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r26106
On Fri, 9 Mar 2012, Jeffrey Squyres wrote: On Mar 9, 2012, at 1:32 PM, Nathan Hjelm wrote: An mpool that is aware of local processes lru's will solve the problem in most cases (all that I have seen) I agree -- don't let words in my emails make you think otherwise. I think this will fix "most" problems, but undoubtedly, some will still occur. What's your timeline for having this ready -- should it go to 1.5.5, or 1.6? More specifically: if it's immanent, and can go to v1.5, then the openib message is irrelevant and should not be used (and backed out of the trunk). If it's going to take a little bit, I'm ok leaving the message in v1.5.5 for now. I wrote the prototype yesterday (after finding that limiting the lru doesn't work for uGNI-- @256 pes we could only register ~1400 item instead of the 3600 max we saw @128). I should have a version ready for review next week and a final version by the end of the month. BTW, can anyone tell me why each mpool defines mca_mpool_base_resources_t instead of defining mca_mpool_blah_resources_t. The current design makes it impossible to support more than one mpool in a btl. I can delete a bunch of code if I can make a btl fall back on the rdma mpool if leave_pinned is not set. -Nathan
Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r26106
On Mar 9, 2012, at 1:32 PM, Nathan Hjelm wrote: > An mpool that is aware of local processes lru's will solve the problem in > most cases (all that I have seen) I agree -- don't let words in my emails make you think otherwise. I think this will fix "most" problems, but undoubtedly, some will still occur. What's your timeline for having this ready -- should it go to 1.5.5, or 1.6? More specifically: if it's immanent, and can go to v1.5, then the openib message is irrelevant and should not be used (and backed out of the trunk). If it's going to take a little bit, I'm ok leaving the message in v1.5.5 for now. > but yes, we need to rework the pml to handle the remaining cases. There are > two things that need to be changed (from what I can tell): > > 1) allow rget to fallback to send/put depending on the failure (I have > fallback on put implemented in my branch-- and in my btl). > 2) need to devise new criteria on when we should progress the rdma_pending > list to avoid deadlock. > > #1 is fairly simple and I haven't given much though to #2. > > -Nathan > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r26106
On Fri, 9 Mar 2012, Jeffrey Squyres wrote: On Mar 9, 2012, at 1:14 PM, George Bosilca wrote: The hang occurs because there is nothing on the lru to deregister and ibv_reg_mr (or GNI_MemRegister in the uGNI case) fails. The PML then puts the request on its rdma pending list and continues. If any message comes in the rdma pending list is progressed. If not it hangs indefinitely! Unlike Jeff, I'm not in favor of adding bandages. If the cause is understood, then there _is_ a fix, and that fix should be the target of any efforts. The fix that Nathan proposes is not a complete fix -- we can still run out of memory and hang. You should read the open tickets and prior emails we have sent about this -- Nathan's fix merely delays when we will run out of registered memory. It does not solve the underlying problem. Correct. In general I have found the underlying cause of the hang is due to an imbalance of registrations between processes on a node. i.e the hung process has an empty lru but other processes could deregister. I am working on a new mpool (grdma) to handle the imbalance. The new mpool will allow a process to request that one of its peers deregisters from it lru if possible. I have a working proof of concept implementation that uses a posix shmem segment and a progress function to handle signaling and dereferencing. With it I no longer see hangs with IMB Alltoall/Alltoallv on uGNI (without putting an artificial limit on the number of registrations). I will test the mpool on infiniband later today. If a solution already exists I don't see why we have to have the message code. Based on its urgency, I'm confident your patch will make its way into the 1.5 quite easily. Depending on the timing, this might go to 1.6 (1.5.5 has waited for too long, and this is not a regression). Keep in mind that the problem has been around for *a long, long time*, which is why I approved the diag message (i.e., because a real solution is still nowhere in sight). The real issue is that we can still run out of registered memory *and there is nothing left to deregister*. The real solution there is that the PML should fall back to a different protocol, but I'm told that doesn't happen and will require a bunch of work to make work properly. An mpool that is aware of local processes lru's will solve the problem in most cases (all that I have seen) but yes, we need to rework the pml to handle the remaining cases. There are two things that need to be changed (from what I can tell): 1) allow rget to fallback to send/put depending on the failure (I have fallback on put implemented in my branch-- and in my btl). 2) need to devise new criteria on when we should progress the rdma_pending list to avoid deadlock. #1 is fairly simple and I haven't given much though to #2. -Nathan
Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r26106
On Mar 9, 2012, at 1:14 PM, George Bosilca wrote: >> The hang occurs because there is nothing on the lru to deregister and >> ibv_reg_mr (or GNI_MemRegister in the uGNI case) fails. The PML then puts >> the request on its rdma pending list and continues. If any message comes in >> the rdma pending list is progressed. If not it hangs indefinitely! > > Unlike Jeff, I'm not in favor of adding bandages. If the cause is understood, > then there _is_ a fix, and that fix should be the target of any efforts. The fix that Nathan proposes is not a complete fix -- we can still run out of memory and hang. You should read the open tickets and prior emails we have sent about this -- Nathan's fix merely delays when we will run out of registered memory. It does not solve the underlying problem. >> In general I have found the underlying cause of the hang is due to an >> imbalance of registrations between processes on a node. i.e the hung process >> has an empty lru but other processes could deregister. I am working on a new >> mpool (grdma) to handle the imbalance. The new mpool will allow a process to >> request that one of its peers deregisters from it lru if possible. I have a >> working proof of concept implementation that uses a posix shmem segment and >> a progress function to handle signaling and dereferencing. With it I no >> longer see hangs with IMB Alltoall/Alltoallv on uGNI (without putting an >> artificial limit on the number of registrations). I will test the mpool on >> infiniband later today. > > If a solution already exists I don't see why we have to have the message > code. Based on its urgency, I'm confident your patch will make its way into > the 1.5 quite easily. Depending on the timing, this might go to 1.6 (1.5.5 has waited for too long, and this is not a regression). Keep in mind that the problem has been around for *a long, long time*, which is why I approved the diag message (i.e., because a real solution is still nowhere in sight). The real issue is that we can still run out of registered memory *and there is nothing left to deregister*. The real solution there is that the PML should fall back to a different protocol, but I'm told that doesn't happen and will require a bunch of work to make work properly. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r26106
On Mar 9, 2012, at 12:59 , Nathan Hjelm wrote: > Not exactly, the PML invokes the mpool which invokes the registration > function. If registration fails the mpool will deregister from its lru (if > possible) and try again. So, it is not an error if ibv_reg_mr fails unless it > fails because the process is starved of registered memory (or truely run out). > > The hang occurs because there is nothing on the lru to deregister and > ibv_reg_mr (or GNI_MemRegister in the uGNI case) fails. The PML then puts the > request on its rdma pending list and continues. If any message comes in the > rdma pending list is progressed. If not it hangs indefinitely! Unlike Jeff, I'm not in favor of adding bandages. If the cause is understood, then there _is_ a fix, and that fix should be the target of any efforts. > In general I have found the underlying cause of the hang is due to an > imbalance of registrations between processes on a node. i.e the hung process > has an empty lru but other processes could deregister. I am working on a new > mpool (grdma) to handle the imbalance. The new mpool will allow a process to > request that one of its peers deregisters from it lru if possible. I have a > working proof of concept implementation that uses a posix shmem segment and a > progress function to handle signaling and dereferencing. With it I no longer > see hangs with IMB Alltoall/Alltoallv on uGNI (without putting an artificial > limit on the number of registrations). I will test the mpool on infiniband > later today. If a solution already exists I don't see why we have to have the message code. Based on its urgency, I'm confident your patch will make its way into the 1.5 quite easily. george. > > -Nathan > > On Fri, 9 Mar 2012, Jeffrey Squyres wrote: > >> George -- >> >> I believe that this is the subject of a few long-standing tickets (i.e., >> what to do when running out of registered memory -- right now, we hang, for >> a few reasons). I think that this is Mellanox's attempt to at least warn >> the user that we have run out of registered memory, and will therefore hang. >> >> Once the hangs have been fixed, I'm assuming this message can be removed. >> >> Note, too, that this is in the BTL registration code (openib_reg_mr), not in >> the directly-invoked-by-the-PML code. So it's the mpool's fault -- not the >> PML's fault. >> >> >> >> On Mar 6, 2012, at 10:05 AM, George Bosilca wrote: >> >>> I din't check thoroughly the code, but OMPI_ERR_OUT_OF_RESOURCES is not an >>> error. If the registration returns out of resources, the BTL will returns >>> OUT_OF_RESOURCE (as an example via the mca_btl_openib_prepare_src). At the >>> upper level, the PML (in the mca_pml_ob1_send_request_start function) >>> intercept it and insert the request into a pending list. Later on this >>> pending list will be examined and the request for resource re-issued. >>> >>> Why do we need to trigger a BTL_ERROR for OUT_OF_RESOURCES? >>> >>> george. >>> >>> On Mar 6, 2012, at 09:48 , Jeffrey Squyres wrote: >>> Mike -- I would make this a bit better of an error. I.e., use orte_show_help(), so you can explain the issue more, and also remove all duplicates (i.e., if it fails to register multiple times). On Mar 6, 2012, at 8:25 AM, mi...@osl.iu.edu wrote: > Author: miked > Date: 2012-03-06 09:25:56 EST (Tue, 06 Mar 2012) > New Revision: 26106 > URL: https://svn.open-mpi.org/trac/ompi/changeset/26106 > > Log: > print error which is ignored on upper layer > Text files modified: > trunk/ompi/mca/btl/openib/btl_openib_component.c | 2 ++ > 1 files changed, 2 insertions(+), 0 deletions(-) > > Modified: trunk/ompi/mca/btl/openib/btl_openib_component.c > == > --- trunk/ompi/mca/btl/openib/btl_openib_component.c (original) > +++ trunk/ompi/mca/btl/openib/btl_openib_component.c 2012-03-06 09:25:56 > EST (Tue, 06 Mar 2012) > @@ -569,6 +569,8 @@ > openib_reg->mr = ibv_reg_mr(device->ib_pd, base, size, access_flag); > > if (NULL == openib_reg->mr) { > +BTL_ERROR(("%s: error pinning openib memory errno says %s", > + __func__, strerror(errno))); > return OMPI_ERR_OUT_OF_RESOURCE; > } > > ___ > svn-full mailing list > svn-f...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/svn-full -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> >>> ___
Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r26106
Not exactly, the PML invokes the mpool which invokes the registration function. If registration fails the mpool will deregister from its lru (if possible) and try again. So, it is not an error if ibv_reg_mr fails unless it fails because the process is starved of registered memory (or truely run out). The hang occurs because there is nothing on the lru to deregister and ibv_reg_mr (or GNI_MemRegister in the uGNI case) fails. The PML then puts the request on its rdma pending list and continues. If any message comes in the rdma pending list is progressed. If not it hangs indefinitely! In general I have found the underlying cause of the hang is due to an imbalance of registrations between processes on a node. i.e the hung process has an empty lru but other processes could deregister. I am working on a new mpool (grdma) to handle the imbalance. The new mpool will allow a process to request that one of its peers deregisters from it lru if possible. I have a working proof of concept implementation that uses a posix shmem segment and a progress function to handle signaling and dereferencing. With it I no longer see hangs with IMB Alltoall/Alltoallv on uGNI (without putting an artificial limit on the number of registrations). I will test the mpool on infiniband later today. -Nathan On Fri, 9 Mar 2012, Jeffrey Squyres wrote: George -- I believe that this is the subject of a few long-standing tickets (i.e., what to do when running out of registered memory -- right now, we hang, for a few reasons). I think that this is Mellanox's attempt to at least warn the user that we have run out of registered memory, and will therefore hang. Once the hangs have been fixed, I'm assuming this message can be removed. Note, too, that this is in the BTL registration code (openib_reg_mr), not in the directly-invoked-by-the-PML code. So it's the mpool's fault -- not the PML's fault. On Mar 6, 2012, at 10:05 AM, George Bosilca wrote: I din't check thoroughly the code, but OMPI_ERR_OUT_OF_RESOURCES is not an error. If the registration returns out of resources, the BTL will returns OUT_OF_RESOURCE (as an example via the mca_btl_openib_prepare_src). At the upper level, the PML (in the mca_pml_ob1_send_request_start function) intercept it and insert the request into a pending list. Later on this pending list will be examined and the request for resource re-issued. Why do we need to trigger a BTL_ERROR for OUT_OF_RESOURCES? george. On Mar 6, 2012, at 09:48 , Jeffrey Squyres wrote: Mike -- I would make this a bit better of an error. I.e., use orte_show_help(), so you can explain the issue more, and also remove all duplicates (i.e., if it fails to register multiple times). On Mar 6, 2012, at 8:25 AM, mi...@osl.iu.edu wrote: Author: miked Date: 2012-03-06 09:25:56 EST (Tue, 06 Mar 2012) New Revision: 26106 URL: https://svn.open-mpi.org/trac/ompi/changeset/26106 Log: print error which is ignored on upper layer Text files modified: trunk/ompi/mca/btl/openib/btl_openib_component.c | 2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) Modified: trunk/ompi/mca/btl/openib/btl_openib_component.c == --- trunk/ompi/mca/btl/openib/btl_openib_component.c (original) +++ trunk/ompi/mca/btl/openib/btl_openib_component.c 2012-03-06 09:25:56 EST (Tue, 06 Mar 2012) @@ -569,6 +569,8 @@ openib_reg->mr = ibv_reg_mr(device->ib_pd, base, size, access_flag); if (NULL == openib_reg->mr) { +BTL_ERROR(("%s: error pinning openib memory errno says %s", + __func__, strerror(errno))); return OMPI_ERR_OUT_OF_RESOURCE; } ___ svn-full mailing list svn-f...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/svn-full -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r26106
George -- I believe that this is the subject of a few long-standing tickets (i.e., what to do when running out of registered memory -- right now, we hang, for a few reasons). I think that this is Mellanox's attempt to at least warn the user that we have run out of registered memory, and will therefore hang. Once the hangs have been fixed, I'm assuming this message can be removed. Note, too, that this is in the BTL registration code (openib_reg_mr), not in the directly-invoked-by-the-PML code. So it's the mpool's fault -- not the PML's fault. On Mar 6, 2012, at 10:05 AM, George Bosilca wrote: > I din't check thoroughly the code, but OMPI_ERR_OUT_OF_RESOURCES is not an > error. If the registration returns out of resources, the BTL will returns > OUT_OF_RESOURCE (as an example via the mca_btl_openib_prepare_src). At the > upper level, the PML (in the mca_pml_ob1_send_request_start function) > intercept it and insert the request into a pending list. Later on this > pending list will be examined and the request for resource re-issued. > > Why do we need to trigger a BTL_ERROR for OUT_OF_RESOURCES? > > george. > > On Mar 6, 2012, at 09:48 , Jeffrey Squyres wrote: > > > Mike -- > > > > I would make this a bit better of an error. I.e., use orte_show_help(), so > > you can explain the issue more, and also remove all duplicates (i.e., if it > > fails to register multiple times). > > > > > > On Mar 6, 2012, at 8:25 AM, mi...@osl.iu.edu wrote: > > > >> Author: miked > >> Date: 2012-03-06 09:25:56 EST (Tue, 06 Mar 2012) > >> New Revision: 26106 > >> URL: https://svn.open-mpi.org/trac/ompi/changeset/26106 > >> > >> Log: > >> print error which is ignored on upper layer > >> Text files modified: > >> trunk/ompi/mca/btl/openib/btl_openib_component.c | 2 ++ > >> > >> 1 files changed, 2 insertions(+), 0 deletions(-) > >> > >> Modified: trunk/ompi/mca/btl/openib/btl_openib_component.c > >> == > >> --- trunk/ompi/mca/btl/openib/btl_openib_component.c (original) > >> +++ trunk/ompi/mca/btl/openib/btl_openib_component.c 2012-03-06 09:25:56 > >> EST (Tue, 06 Mar 2012) > >> @@ -569,6 +569,8 @@ > >>openib_reg->mr = ibv_reg_mr(device->ib_pd, base, size, access_flag); > >> > >>if (NULL == openib_reg->mr) { > >> +BTL_ERROR(("%s: error pinning openib memory errno says %s", > >> + __func__, strerror(errno))); > >>return OMPI_ERR_OUT_OF_RESOURCE; > >>} > >> > >> ___ > >> svn-full mailing list > >> svn-f...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/svn-full > > > > > > -- > > Jeff Squyres > > jsquy...@cisco.com > > For corporate legal information go to: > > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > > > ___ > > devel mailing list > > de...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] poor btl sm latency
I just made an interesting observation: When binding the processes to two neighboring cores (L2 sharing) NetPIPE shows *sometimes* pretty good results: ~0.5us $ mpirun -mca btl sm,self -np 1 hwloc-bind -v core:0 ./NPmpi_ompi1.5.5 -u 4 -n 10 -p 0 : -np 1 hwloc-bind -v core:1 ./NPmpi_ompi1.5.5 -u 4 -n 10 -p 0 using object #0 depth 6 below cpuset 0x,0x using object #1 depth 6 below cpuset 0x,0x adding 0x0001 to 0x0 adding 0x0001 to 0x0 assuming the command starts at ./NPmpi_ompi1.5.5 binding on cpu set 0x0001 adding 0x0002 to 0x0 adding 0x0002 to 0x0 assuming the command starts at ./NPmpi_ompi1.5.5 binding on cpu set 0x0002 Using no perturbations 0: n035 Using no perturbations 1: n035 Now starting the main loop 0: 1 bytes 10 times --> 6.01 Mbps in 1.27 usec 1: 2 bytes 10 times --> 12.04 Mbps in 1.27 usec 2: 3 bytes 10 times --> 18.07 Mbps in 1.27 usec 3: 4 bytes 10 times --> 24.13 Mbps in 1.26 usec $ mpirun -mca btl sm,self -np 1 hwloc-bind -v core:0 ./NPmpi_ompi1.5.5 -u 4 -n 10 -p 0 : -np 1 hwloc-bind -v core:1 ./NPmpi_ompi1.5.5 -u 4 -n 10 -p 0 using object #0 depth 6 below cpuset 0x,0x adding 0x0001 to 0x0 adding 0x0001 to 0x0 assuming the command starts at ./NPmpi_ompi1.5.5 binding on cpu set 0x0001 using object #1 depth 6 below cpuset 0x,0x adding 0x0002 to 0x0 adding 0x0002 to 0x0 assuming the command starts at ./NPmpi_ompi1.5.5 binding on cpu set 0x0002 Using no perturbations 0: n035 Using no perturbations 1: n035 Now starting the main loop 0: 1 bytes 10 times --> 12.96 Mbps in 0.59 usec 1: 2 bytes 10 times --> 25.78 Mbps in 0.59 usec 2: 3 bytes 10 times --> 38.62 Mbps in 0.59 usec 3: 4 bytes 10 times --> 52.88 Mbps in 0.58 usec I can reproduce that approximately every tenth run. When binding the processes for exclusive L2 caches (e.g. core 0 and 2) I get constant latencies ~1.1us Matthias On Monday 05 March 2012 09:52:39 Matthias Jurenz wrote: > Here the SM BTL parameters: > > $ ompi_info --param btl sm > MCA btl: parameter "btl_base_verbose" (current value: <0>, data source: > default value) Verbosity level of the BTL framework > MCA btl: parameter "btl" (current value: , data source: > file > [/sw/atlas/libraries/openmpi/1.5.5rc3/x86_64/etc/openmpi-mca-params.conf]) > Default selection set of components for the btl framework ( means > use all components that can be found) > MCA btl: information "btl_sm_have_knem_support" (value: <1>, data source: > default value) Whether this component supports the knem Linux kernel module > or not > MCA btl: parameter "btl_sm_use_knem" (current value: <-1>, data source: > default value) Whether knem support is desired or not (negative = try to > enable knem support, but continue even if it is not available, 0 = do not > enable knem support, positive = try to enable knem support and fail if it > is not available) > MCA btl: parameter "btl_sm_knem_dma_min" (current value: <0>, data source: > default value) Minimum message size (in bytes) to use the knem DMA mode; > ignored if knem does not support DMA mode (0 = do not use the knem DMA > mode) MCA btl: parameter "btl_sm_knem_max_simultaneous" (current value: > <0>, data source: default value) Max number of simultaneous ongoing knem > operations to support (0 = do everything synchronously, which probably > gives the best large message latency; >0 means to do all operations > asynchronously, which supports better overlap for simultaneous large > message sends) > MCA btl: parameter "btl_sm_free_list_num" (current value: <8>, data source: > default value) > MCA btl: parameter "btl_sm_free_list_max" (current value: <-1>, data > source: default value) > MCA btl: parameter "btl_sm_free_list_inc" (current value: <64>, data > source: default value) > MCA btl: parameter "btl_sm_max_procs" (current value: <-1>, data source: > default value) > MCA btl: parameter "btl_sm_mpool" (current value: , data source: > default value) > MCA btl: parameter "btl_sm_fifo_size" (current value: <4096>, data source: > default value) > MCA btl: parameter "btl_sm_num_fifos" (current value: <1>, data source: > default value) > MCA btl: parameter "btl_sm_fifo_lazy_free" (current value: <120>, data > source: default value) > MCA btl: parameter "btl_sm_sm_extra_procs" (current value: <0>, data > source: default value) > MCA btl: parameter "btl_sm_exclusivity" (current value: <65535>, data > source: default value) BTL exclusivity (must be >= 0) > MCA btl: parameter "btl_sm_flags" (current value: <5>, data source: default > value) BTL bit flags (general flags: SEND=1, PUT=2, GET=4, SEND_INPLACE=8, > RDMA_MATCHED=64, HETEROGENEOUS_RDMA=256; flags only used by the "dr" PML > (ignored by others): ACK=16, CHECKSUM=32, RDMA_COMPLETION=128; flags
Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r26118
Fixed in r26122. I tested locally with the ibm test suite, and it looks good. MTT should highlight if there are any other issues - but I doubt there will be. -- Josh On Thu, Mar 8, 2012 at 5:16 PM, Josh Hursey wrote: > Good point (I did not even look at ompi_comm_compare, I was using this for > something else). I'll take a pass at converting the ompi_comm_compare to > use the ompi_group_compare functionality - it is good code reuse. > > Thanks, > Josh > > > On Thu, Mar 8, 2012 at 4:08 PM, George Bosilca wrote: > >> Josh, >> >> Open MPI already have a similar function in the communicator part, >> function that is not exposed to the upper layer. I think that using the >> code in ompi_comm_compare (the second part that compare groups) is sound >> proof. Moreover, if now we have an ompi_group_compare function you should >> use it in the ompi_comm_compare to ease the readability of the code. >> >> Regards, >>george. >> >> >> >> On Mar 8, 2012, at 16:57 , jjhur...@osl.iu.edu wrote: >> >> > Author: jjhursey >> > Date: 2012-03-08 16:57:45 EST (Thu, 08 Mar 2012) >> > New Revision: 26118 >> > URL: https://svn.open-mpi.org/trac/ompi/changeset/26118 >> > >> > Log: >> > Abstract MPI_Group_compare to an OMPI function for internal use (point >> the MPI interface to the internal function). >> >> >> > > > -- > Joshua Hursey > Postdoctoral Research Associate > Oak Ridge National Laboratory > http://users.nccs.gov/~jjhursey > -- Joshua Hursey Postdoctoral Research Associate Oak Ridge National Laboratory http://users.nccs.gov/~jjhursey
Re: [OMPI devel] MCA BTL Fragment lists
On Mar 9, 2012, at 08:38 , Alex Margolin wrote: > Hi, > > I'm implementing a new BTL component, and > > 1. I read the TCP code and ran into the three fragment lists: > >/* free list of fragment descriptors */ >ompi_free_list_t tcp_frag_eager; >ompi_free_list_t tcp_frag_max; >ompi_free_list_t tcp_frag_user; > > I've looked it up, and found that the documentation for OpenIB refers to the > eager term as (in short) the first chuck of a long message, after which the > buffer is registered and in the meanwhile chucks from the end of the buffer > (beyond a limit much higher then eager-limit) are sent. I didn't find any > references relevant to plain TCP. I'm not sure I understand how this is > applicable with TCP (and I've seen it in other components as well). For a > long message - why would I treat chucks separately? An eager fragment can be received by the peer eagerly (this means without the corresponding receive posted). This is not the case for larger fragments. > In the TCP BTL code, when the fragment is created - shorter chucks are sent > to eager while the rest are sent to max. Where the two lists treated > differently? > > Thanks, > Alex > > P.S. what does the role of mca_btl_*_component_control()? Amazing, that's an archeological piece of Open MPI history. Fixed in r26121. george. > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
[OMPI devel] MCA BTL Fragment lists
Hi, I'm implementing a new BTL component, and 1. I read the TCP code and ran into the three fragment lists: /* free list of fragment descriptors */ ompi_free_list_t tcp_frag_eager; ompi_free_list_t tcp_frag_max; ompi_free_list_t tcp_frag_user; I've looked it up, and found that the documentation for OpenIB refers to the eager term as (in short) the first chuck of a long message, after which the buffer is registered and in the meanwhile chucks from the end of the buffer (beyond a limit much higher then eager-limit) are sent. I didn't find any references relevant to plain TCP. I'm not sure I understand how this is applicable with TCP (and I've seen it in other components as well). For a long message - why would I treat chucks separately? In the TCP BTL code, when the fragment is created - shorter chucks are sent to eager while the rest are sent to max. Where the two lists treated differently? Thanks, Alex P.S. what does the role of mca_btl_*_component_control()?