WHAT: New PML called "bfo" (Btl Fail Over) that supports failover between
two or more openib BTLs. New configurable code in openib BTL that works
with the bfo to do failover. Note this only works when we have two or more
openib BTLs. This does not failover to another BTL, like tcp.
TO CONFIGUR
Jeff Squyres wrote:
Yes, I think the mmap code in the sm btl actually has a sync point inside add_procs that when the
root allocs and sets up the area, it'll locally broadcast a "yes, we're good -- mmap attach
and let's continue" or "bad things happened; sm btl is broke" message.
Yes, that'
Good point - I forgot about that.
--
Samuel K. Gutierrez
Los Alamos National Laboratory
On Jun 2, 2010, at 11:40 AM, Jeff Squyres wrote:
Don't forget that the RML is also used to broadcast the success/
failure of the creation of the shared memory segment.
If the RML goes away, be sure that y
Don't forget that the RML is also used to broadcast the success/failure of the
creation of the shared memory segment.
If the RML goes away, be sure that you can still determine that without hanging.
Personally, I still don't see the problem with using the RML stuff...
On Jun 2, 2010, at 1:07 P
Yes, I think the mmap code in the sm btl actually has a sync point inside
add_procs that when the root allocs and sets up the area, it'll locally
broadcast a "yes, we're good -- mmap attach and let's continue" or "bad things
happened; sm btl is broke" message.
But I am not confident about the o
On Jun 2, 2010, at 12:42 PM, George Bosilca wrote:
> > 1. In this case, the openib BTL was not finalized, so there was a stub
> > still there listening on the RDMACM CPC. When another process tried to
> > connect to X's RDMACM CPC port, Bad Things happened (because it was only
> > half setup)
Hi George,
That may work - I'll try it.
Thanks!
--
Samuel K. Gutierrez
Los Alamos National Laboratory
On Jun 2, 2010, at 10:59 AM, George Bosilca wrote:
How about ftok ? The init function takes a file_name as argument,
and this file name is unique per instance of the shared memory
region
How about ftok ? The init function takes a file_name as argument, and this file
name is unique per instance of the shared memory region we want to create. We
can use this file name with ftok to create a unique key_t that can be used by
shmget to retrieve the shared memory identifier.
george.
George Bosilca wrote:
We did assume that at least the errors are symmetric, i.e. if A fails
to connect to B then B will fail when trying to connect to A.
I've not been following this thread closely, but thought I'd add a comment.
It used to be that the sm BTL could fail asymmetrically. A sha
On Jun 2, 2010, at 12:18 , Jeff Squyres wrote:
> On Jun 2, 2010, at 12:02 PM, Ashley Pittman wrote:
>
>>> Ah, this is the key. If I have one process (out of many) fail the
>>> create_cq() function, I get a segv during finalize. I'll dig.
>>
>> Is there an assumption that if process A claims
On Jun 2, 2010, at 12:02 PM, Ashley Pittman wrote:
> > Ah, this is the key. If I have one process (out of many) fail the
> > create_cq() function, I get a segv during finalize. I'll dig.
>
> Is there an assumption that if process A claims to be able to communicate
> with process B that proces
On 2 Jun 2010, at 16:49, Jeff Squyres wrote:
> On Jun 2, 2010, at 11:29 AM, Sylvain Jeaugey wrote:
>
>> But it made me progress on why I'm crashing : in my case, only a subset of
>> processes have their create_cq fail.
>
> Ah, this is the key. If I have one process (out of many) fail the
> cr
On Jun 2, 2010, at 8:49 AM, Jeff Squyres wrote:
On Jun 2, 2010, at 10:44 AM, George Bosilca wrote:
Not sure what you mean here. common/sm may create new shmem
segments at any time (e.g., during coll sm). The RML message
exchange is to ensure that only 1 process creates and initializes
t
On Jun 2, 2010, at 11:29 AM, Sylvain Jeaugey wrote:
> But it made me progress on why I'm crashing : in my case, only a subset of
> processes have their create_cq fail.
Ah, this is the key. If I have one process (out of many) fail the create_cq()
function, I get a segv during finalize. I'll dig
On Wed, 2 Jun 2010, Jeff Squyres wrote:
Don't you mean return NULL? This function is supposed to return a (struct
ibv_cq *).
Oops. My bad. Yes, it should return NULL. And it seems that if I make
ibv_create_cq always return NULL, the scenario described by George works
smoothly : returned OMPI
On Jun 2, 2010, at 10:44 AM, George Bosilca wrote:
> > Not sure what you mean here. common/sm may create new shmem segments at
> > any time (e.g., during coll sm). The RML message exchange is to ensure
> > that only 1 process creates and initializes the segment and then all the
> > others jus
On Jun 2, 2010, at 09:28 , Jeff Squyres wrote:
> On Jun 2, 2010, at 5:38 AM, George Bosilca wrote:
>
>> I think adding support for sysv shared memory is a good thing. However, I
>> have some strong objections over the implementation in the hg tree. Here are
>> 2 of the major ones:
>>
>> 1) th
On Jun 2, 2010, at 7:28 AM, Jeff Squyres wrote:
On Jun 2, 2010, at 5:38 AM, George Bosilca wrote:
I think adding support for sysv shared memory is a good thing.
However, I have some strong objections over the implementation in
the hg tree. Here are 2 of the major ones:
1) the sysv shared
To follow up on this RFC...
This RFC also got discussed on the weekly call (and in several other
discussions). Again, no one seemed to hate it. That being said, hwloc still
needs a bit more soak time; I just committed the 32 bit fix the other day.
So this one will happen eventually (i.e., #1,
To follow up on this RFC...
We discussed this RFC on the weekly call and no one seemed to hate it. But
there was a desire to:
a) be able to compile out hwloc for environments that don't want/need it (e.g.,
embedded environments)
b) have some degree of isolation in case hwloc ever dies
c) have
On Jun 2, 2010, at 5:38 AM, George Bosilca wrote:
> I think adding support for sysv shared memory is a good thing. However, I
> have some strong objections over the implementation in the hg tree. Here are
> 2 of the major ones:
>
> 1) the sysv shared memory creation is __atomic__ based on the f
On Jun 2, 2010, at 5:08 AM, Sylvain Jeaugey wrote:
> It must be because create_cq actually creates cqs. Try to apply this
> patch which makes create_cq_compat() *not* creates the cqs and return an
> error instead :
>
> diff -
Absolutely correct. I've fixed it on the dev trunk and filed tickets to get
the fix moved into the release branches.
Thanks!
On Jun 2, 2010, at 4:41 AM, Number Cruncher wrote:
> I'm working on some intercommunicator stuff at the moment. According to
> MPI-2.2 standard:
> "An inter-communicati
I think adding support for sysv shared memory is a good thing. However, I have
some strong objections over the implementation in the hg tree. Here are 2 of
the major ones:
1) the sysv shared memory creation is __atomic__ based on the flags used.
Therefore, all the RML messages exchange is total
I don't have any IB nodes, but I'm interested to see how this happens. What I
would like to understand here is how do we get back in the OpenIB code if the
add_procs failed for the BTL ...
george.
On Jun 2, 2010, at 05:08 , Sylvain Jeaugey wrote:
> On Tue, 1 Jun 2010, Jeff Squyres wrote:
>
On Tue, 1 Jun 2010, Jeff Squyres wrote:
On May 31, 2010, at 5:10 AM, Sylvain Jeaugey wrote:
In my case, the error happens in :
mca_btl_openib_add_procs()
mca_btl_openib_size_queues()
adjust_cq()
ibv_create_cq_compat()
ibv_create_cq()
Can you nail this down
I'm working on some intercommunicator stuff at the moment. According to
MPI-2.2 standard:
"An inter-communication is a point-to-point communication between
processes in different groups" [Section 6.6]
yet the "man" page for MPI_Comm_size reads:
"If the communicator is an **intra-communicator
Couldn't explain it better. Thanks Jeff for the summary !
On Tue, 1 Jun 2010, Jeff Squyres wrote:
On May 31, 2010, at 10:27 AM, Ralph Castain wrote:
Just curious - your proposed fix sounds exactly like what was done in
the OPAL SOS work. Are you therefore proposing to use SOS to provide a
mo
28 matches
Mail list logo