Couldn't explain it better. Thanks Jeff for the summary !
On Tue, 1 Jun 2010, Jeff Squyres wrote:
On May 31, 2010, at 10:27 AM, Ralph Castain wrote:
Just curious - your proposed fix sounds exactly like what was done in
the OPAL SOS work. Are you therefore proposing to use SOS to provide a
mo
I'm working on some intercommunicator stuff at the moment. According to
MPI-2.2 standard:
"An inter-communication is a point-to-point communication between
processes in different groups" [Section 6.6]
yet the "man" page for MPI_Comm_size reads:
"If the communicator is an **intra-communicator
On Tue, 1 Jun 2010, Jeff Squyres wrote:
On May 31, 2010, at 5:10 AM, Sylvain Jeaugey wrote:
In my case, the error happens in :
mca_btl_openib_add_procs()
mca_btl_openib_size_queues()
adjust_cq()
ibv_create_cq_compat()
ibv_create_cq()
Can you nail this down
I don't have any IB nodes, but I'm interested to see how this happens. What I
would like to understand here is how do we get back in the OpenIB code if the
add_procs failed for the BTL ...
george.
On Jun 2, 2010, at 05:08 , Sylvain Jeaugey wrote:
> On Tue, 1 Jun 2010, Jeff Squyres wrote:
>
I think adding support for sysv shared memory is a good thing. However, I have
some strong objections over the implementation in the hg tree. Here are 2 of
the major ones:
1) the sysv shared memory creation is __atomic__ based on the flags used.
Therefore, all the RML messages exchange is total
Absolutely correct. I've fixed it on the dev trunk and filed tickets to get
the fix moved into the release branches.
Thanks!
On Jun 2, 2010, at 4:41 AM, Number Cruncher wrote:
> I'm working on some intercommunicator stuff at the moment. According to
> MPI-2.2 standard:
> "An inter-communicati
On Jun 2, 2010, at 5:08 AM, Sylvain Jeaugey wrote:
> It must be because create_cq actually creates cqs. Try to apply this
> patch which makes create_cq_compat() *not* creates the cqs and return an
> error instead :
>
> diff -
On Jun 2, 2010, at 5:38 AM, George Bosilca wrote:
> I think adding support for sysv shared memory is a good thing. However, I
> have some strong objections over the implementation in the hg tree. Here are
> 2 of the major ones:
>
> 1) the sysv shared memory creation is __atomic__ based on the f
To follow up on this RFC...
We discussed this RFC on the weekly call and no one seemed to hate it. But
there was a desire to:
a) be able to compile out hwloc for environments that don't want/need it (e.g.,
embedded environments)
b) have some degree of isolation in case hwloc ever dies
c) have
To follow up on this RFC...
This RFC also got discussed on the weekly call (and in several other
discussions). Again, no one seemed to hate it. That being said, hwloc still
needs a bit more soak time; I just committed the 32 bit fix the other day.
So this one will happen eventually (i.e., #1,
On Jun 2, 2010, at 7:28 AM, Jeff Squyres wrote:
On Jun 2, 2010, at 5:38 AM, George Bosilca wrote:
I think adding support for sysv shared memory is a good thing.
However, I have some strong objections over the implementation in
the hg tree. Here are 2 of the major ones:
1) the sysv shared
On Jun 2, 2010, at 09:28 , Jeff Squyres wrote:
> On Jun 2, 2010, at 5:38 AM, George Bosilca wrote:
>
>> I think adding support for sysv shared memory is a good thing. However, I
>> have some strong objections over the implementation in the hg tree. Here are
>> 2 of the major ones:
>>
>> 1) th
On Jun 2, 2010, at 10:44 AM, George Bosilca wrote:
> > Not sure what you mean here. common/sm may create new shmem segments at
> > any time (e.g., during coll sm). The RML message exchange is to ensure
> > that only 1 process creates and initializes the segment and then all the
> > others jus
On Wed, 2 Jun 2010, Jeff Squyres wrote:
Don't you mean return NULL? This function is supposed to return a (struct
ibv_cq *).
Oops. My bad. Yes, it should return NULL. And it seems that if I make
ibv_create_cq always return NULL, the scenario described by George works
smoothly : returned OMPI
On Jun 2, 2010, at 11:29 AM, Sylvain Jeaugey wrote:
> But it made me progress on why I'm crashing : in my case, only a subset of
> processes have their create_cq fail.
Ah, this is the key. If I have one process (out of many) fail the create_cq()
function, I get a segv during finalize. I'll dig
On Jun 2, 2010, at 8:49 AM, Jeff Squyres wrote:
On Jun 2, 2010, at 10:44 AM, George Bosilca wrote:
Not sure what you mean here. common/sm may create new shmem
segments at any time (e.g., during coll sm). The RML message
exchange is to ensure that only 1 process creates and initializes
t
On 2 Jun 2010, at 16:49, Jeff Squyres wrote:
> On Jun 2, 2010, at 11:29 AM, Sylvain Jeaugey wrote:
>
>> But it made me progress on why I'm crashing : in my case, only a subset of
>> processes have their create_cq fail.
>
> Ah, this is the key. If I have one process (out of many) fail the
> cr
On Jun 2, 2010, at 12:02 PM, Ashley Pittman wrote:
> > Ah, this is the key. If I have one process (out of many) fail the
> > create_cq() function, I get a segv during finalize. I'll dig.
>
> Is there an assumption that if process A claims to be able to communicate
> with process B that proces
On Jun 2, 2010, at 12:18 , Jeff Squyres wrote:
> On Jun 2, 2010, at 12:02 PM, Ashley Pittman wrote:
>
>>> Ah, this is the key. If I have one process (out of many) fail the
>>> create_cq() function, I get a segv during finalize. I'll dig.
>>
>> Is there an assumption that if process A claims
George Bosilca wrote:
We did assume that at least the errors are symmetric, i.e. if A fails
to connect to B then B will fail when trying to connect to A.
I've not been following this thread closely, but thought I'd add a comment.
It used to be that the sm BTL could fail asymmetrically. A sha
How about ftok ? The init function takes a file_name as argument, and this file
name is unique per instance of the shared memory region we want to create. We
can use this file name with ftok to create a unique key_t that can be used by
shmget to retrieve the shared memory identifier.
george.
Hi George,
That may work - I'll try it.
Thanks!
--
Samuel K. Gutierrez
Los Alamos National Laboratory
On Jun 2, 2010, at 10:59 AM, George Bosilca wrote:
How about ftok ? The init function takes a file_name as argument,
and this file name is unique per instance of the shared memory
region
On Jun 2, 2010, at 12:42 PM, George Bosilca wrote:
> > 1. In this case, the openib BTL was not finalized, so there was a stub
> > still there listening on the RDMACM CPC. When another process tried to
> > connect to X's RDMACM CPC port, Bad Things happened (because it was only
> > half setup)
Yes, I think the mmap code in the sm btl actually has a sync point inside
add_procs that when the root allocs and sets up the area, it'll locally
broadcast a "yes, we're good -- mmap attach and let's continue" or "bad things
happened; sm btl is broke" message.
But I am not confident about the o
Don't forget that the RML is also used to broadcast the success/failure of the
creation of the shared memory segment.
If the RML goes away, be sure that you can still determine that without hanging.
Personally, I still don't see the problem with using the RML stuff...
On Jun 2, 2010, at 1:07 P
Good point - I forgot about that.
--
Samuel K. Gutierrez
Los Alamos National Laboratory
On Jun 2, 2010, at 11:40 AM, Jeff Squyres wrote:
Don't forget that the RML is also used to broadcast the success/
failure of the creation of the shared memory segment.
If the RML goes away, be sure that y
Jeff Squyres wrote:
Yes, I think the mmap code in the sm btl actually has a sync point inside add_procs that when the
root allocs and sets up the area, it'll locally broadcast a "yes, we're good -- mmap attach
and let's continue" or "bad things happened; sm btl is broke" message.
Yes, that'
WHAT: New PML called "bfo" (Btl Fail Over) that supports failover between
two or more openib BTLs. New configurable code in openib BTL that works
with the bfo to do failover. Note this only works when we have two or more
openib BTLs. This does not failover to another BTL, like tcp.
TO CONFIGUR
28 matches
Mail list logo