I would like to understand this more. Let's talk about it tomorrow on the
weekly teleconf.
On Mar 9, 2012, at 5:55 PM, Nathan Hjelm wrote:
> I tested my grdma mpool with the openib btl and IMB Alltoall/Alltoallv on a
> system that consistently hangs. If I give the connection module the ability
I tested my grdma mpool with the openib btl and IMB Alltoall/Alltoallv on a
system that consistently hangs. If I give the connection module the ability to
evict from the lru grdma prevents both the out of registered memory hang AND
problems creating QPs (due to exhaustion of registered memory).
On Fri, 9 Mar 2012, George Bosilca wrote:
On Mar 9, 2012, at 14:23 , Nathan Hjelm wrote:
BTW, can anyone tell me why each mpool defines mca_mpool_base_resources_t
instead of defining mca_mpool_blah_resources_t. The current design makes it
impossible to support more than one mpool in a btl
On Mar 9, 2012, at 14:23 , Nathan Hjelm wrote:
> BTW, can anyone tell me why each mpool defines mca_mpool_base_resources_t
> instead of defining mca_mpool_blah_resources_t. The current design makes it
> impossible to support more than one mpool in a btl. I can delete a bunch of
> code if I can
[Comment at bottom]
>-Original Message-
>From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org]
>On Behalf Of Nathan Hjelm
>Sent: Friday, March 09, 2012 2:23 PM
>To: Open MPI Developers
>Subject: Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r26106
>
&g
>> Depending on the timing, this might go to 1.6 (1.5.5 has waited for too
>> long, and this is not a regression). Keep in mind that the problem has been
>> around for *a long, long time*, which is why I approved the diag message
>> (i.e., because a real solution is still nowhere in sight). Th
On Fri, 9 Mar 2012, Jeffrey Squyres wrote:
On Mar 9, 2012, at 1:32 PM, Nathan Hjelm wrote:
An mpool that is aware of local processes lru's will solve the problem in most
cases (all that I have seen)
I agree -- don't let words in my emails make you think otherwise. I think this will fix
On Mar 9, 2012, at 1:32 PM, Nathan Hjelm wrote:
> An mpool that is aware of local processes lru's will solve the problem in
> most cases (all that I have seen)
I agree -- don't let words in my emails make you think otherwise. I think this
will fix "most" problems, but undoubtedly, some will st
On Fri, 9 Mar 2012, Jeffrey Squyres wrote:
On Mar 9, 2012, at 1:14 PM, George Bosilca wrote:
The hang occurs because there is nothing on the lru to deregister and
ibv_reg_mr (or GNI_MemRegister in the uGNI case) fails. The PML then puts the
request on its rdma pending list and continues. I
On Mar 9, 2012, at 1:14 PM, George Bosilca wrote:
>> The hang occurs because there is nothing on the lru to deregister and
>> ibv_reg_mr (or GNI_MemRegister in the uGNI case) fails. The PML then puts
>> the request on its rdma pending list and continues. If any message comes in
>> the rdma pend
On Mar 9, 2012, at 12:59 , Nathan Hjelm wrote:
> Not exactly, the PML invokes the mpool which invokes the registration
> function. If registration fails the mpool will deregister from its lru (if
> possible) and try again. So, it is not an error if ibv_reg_mr fails unless it
> fails because th
Not exactly, the PML invokes the mpool which invokes the registration function.
If registration fails the mpool will deregister from its lru (if possible) and
try again. So, it is not an error if ibv_reg_mr fails unless it fails because
the process is starved of registered memory (or truely run
George --
I believe that this is the subject of a few long-standing tickets (i.e., what
to do when running out of registered memory -- right now, we hang, for a few
reasons). I think that this is Mellanox's attempt to at least warn the user
that we have run out of registered memory, and will t
I din't check thoroughly the code, but OMPI_ERR_OUT_OF_RESOURCES is not an
error. If the registration returns out of resources, the BTL will returns
OUT_OF_RESOURCE (as an example via the mca_btl_openib_prepare_src). At the
upper level, the PML (in the mca_pml_ob1_send_request_start function) in
Mike --
I would make this a bit better of an error. I.e., use orte_show_help(), so you
can explain the issue more, and also remove all duplicates (i.e., if it fails
to register multiple times).
On Mar 6, 2012, at 8:25 AM, mi...@osl.iu.edu wrote:
> Author: miked
> Date: 2012-03-06 09:25:56 ES
15 matches
Mail list logo