Re: [OMPI devel] pointer_array

2007-12-21 Thread George Bosilca
The opal_pointer_array is now committed. I try to update all BTL to  
use the opal_pointer_array instead of the orte_pointer_array. Now, the  
OMPI layer only use opal_pointer_array. Unfortunately, I cannot test  
most of the BTLs so I hope I didn't miss anything.


  Thanks,
george.

On Dec 17, 2007, at 2:22 PM, George Bosilca wrote:

Sound good. I will replace all references to ompi_pointer_array as  
well as orte_pointer_array in the ompi layer (some BTL use the  
orte_pointer_array) and replace them with the opal_pointer_array.  
I'll avoid any modification of the orte layer.


I'll commit tomorrow morning.

 Thanks,
   george.

On Dec 17, 2007, at 12:04 PM, Ralph H Castain wrote:

That would be fine with me - I can grab that out of the trunk and  
adjust

ORTE in my branch instead.

Thanks
Ralph


On 12/17/07 9:54 AM, "Tim Mattox"  wrote:


How about this as a suggested compromise.
George, could you just do half the patch... where you leave orte  
alone,

and just move the ompi pointer array implementation down into opal.
That way, any new code can make use of it from opal, and only orte
would need to be adjusted later, after Ralph is done with his  
changes.


On Dec 17, 2007 9:18 AM, Ralph H Castain  wrote:
It would require extensive modification as use of the pointer  
array has
spread over a wide range of the code base. I would really  
appreciate it if

we didn't do this right now.

The differences are historic in nature - several years ago, the  
folks
working on the OMPI layer needed to insert some Fortran-specific  
limits and
type definitions into the opal_pointer_array code. Unfortunately,  
that
caused type conflicts across a swath of the ORTE code. After a  
ton of
discussion and debate, there was no way the OMPI folks could  
guarantee that
they wouldn't need to change those definitions again at some time  
into the
future - which would again force the ORTE layer to make major  
changes to

their code.

In addition, the use of an int as the array index in the  
opal_pointer_array
raised concerns in the ORTE world as we really didn't want to  
pass generic
variable types between processes. At the time, we weren't sure if  
the index
in a pointer array was going to need to be passed somewhere in  
the future -

in fact, the code did pass it at the time in several cases.

So we agreed to simply create separate code that, even though it  
duplicated
the functionality, ensured that the two could operate semi- 
independently.


In the intervening time, the OMPI folks seem to have been able to  
leave the
opal_pointer_array definitions pretty much alone. There have been  
a few
changes along the way, but nothing overwhelming. In addition, we  
have found
that the ORTE code no longer needs to pass the array index when  
sending an
object's data to a remote process - at least, this is true at the  
moment.


So making the change might be reasonable. If we are going to do  
that,
though, we need to ensure that all the functionality is  
replicated (there
are, I believe, a couple of extensions in the orte_pointer_array  
class), and

we should similarly review the other orte/opal class overlaps.

However, doing all this right now would be a disaster on the tmp  
branch
where we are revising ORTE. It would be much better to do it  
after that
branch merges to the trunk, or just make the change in the tmp  
branch first.
That branch makes much more extensive use of the  
orte_pointer_array object
than is in the trunk, and it would be a royal pain of conflicts  
to resolve

it - all for little, if any, gain.

Thanks
Ralph




On 12/17/07 6:35 AM, "Jeff Squyres"  wrote:


Adding RHC to the thread...

I'm guessing that the patch will have to be modified for the  
ORTE tmp

branch.



On Dec 16, 2007, at 6:18 PM, George Bosilca wrote:


Right, I wonder why it didn't show in the patch file. Anyway, it
completely remove the orte_pointer_array.[ch] as well as the
ompi_pointer_array.[ch] file.

Thanks,
 george.

On Dec 16, 2007, at 12:01 AM, Tim Mattox wrote:


The patch looks good to my eyeballs, though I've not done any
testing with it.
I presume a follow on patch would remove the orte_pointer_array.
[ch] files?

On Dec 15, 2007 4:01 PM, George Bosilca   
wrote:
I have a patch that unify the pointer array implementations  
into

just
one. Right now, we have 2 pointer array implementations: one  
for

ORTE
and one for OMPI. The differences are small and mostly  
insignificant
(there is no way to add more than 2^31 elements in the  
pointer array

anyway). The following patch propose to merge these two pointer
array
into one, implemented in OPAL (and called opal_pointer_array).

If nobody has complained before Wednesday noon I'll commit the
patch.

Thanks,
 george.




___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel





--
Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/
tmat...@gmail.com || timat...@open-mpi.org

Re: [OMPI devel] openib xrc CPC minor nit

2007-12-21 Thread Gleb Natapov
On Thu, Dec 20, 2007 at 05:39:36PM -0500, Jeff Squyres wrote:
> Pasha --
> 
> I notice in the port info struct that you have a member for the lid,  
> but only #if HAVE_XRC.  Per a comment in the code, this is supposed to  
> save bytes when we're using OOB (because we don't need this value in  
> the OOB CPC).
> 
> I think we should remove this #if and always have this struct member.   
> ~4 extra bytes (because it's DSS packed) is no big deal.  It's packed  
> in with all the other modex info, so the message is already large.  4  
> more bytes per port won't make a difference (IMHO).
> 
> And keep in mind that #if HAVE_XRC is true if XRC is supported -- we  
> still send the extra bytes if XRC is supported and not used (which is  
> the default when compiling for OFED 1.3, no?).
> 
> So I think we should remove those #if's and just always have that data  
> member there.  It's up to the CPC's if they want to use that info or  
> not.
> 
> Any objections to me removing this #if on the openib-cpc branch?  (and  
> eventual merge back up to the trunk)
> 
Remove it, and add a capability mask to port info structure. Capability
will contain types of CPCs supported by a port. I may need this before
openib-cpc will be merged back to the trunk.

--
Gleb.


Re: [OMPI devel] [ofa-general] [RFC] XRC -- make receiving XRC QP independent of any one user process

2007-12-21 Thread Tang, Changqing


> -Original Message-
> From: Jack Morgenstein [mailto:ja...@dev.mellanox.co.il]
> Sent: Friday, December 21, 2007 2:32 AM
> To: Tang, Changqing
> Cc: pa...@dev.mellanox.co.il;
> mvapich-disc...@cse.ohio-state.edu;
> gene...@lists.openfabrics.org; Open MPI Developers
> Subject: Re: [ofa-general] [RFC] XRC -- make receiving XRC QP
> independent of any one user process
>
> On Thursday 20 December 2007 18:24, Tang, Changqing wrote:
> >If I have a MPI server processes on a node, many other MPI
> > client processes will dynamically connect/disconnect with
> the server. The server use same XRC domain.
> >
> > Will this cause accumulating the "kernel" QP for such
> > application ? we want the server to run 365 days a year.
>
> Yes, it will.  I have no way of knowing when a given
> receiving XRC QP is no longer needed -- except when the
> domain it belongs to is finally closed.
>
> I don't see that adding a userspace "destroy" verb for this
> QP will help:

This kernel QP is for receiving only, so when there is no activity on this QP,
can the kernel sends a heart-beat message to check if the remote sending QP
is still there (still connected) ? if not, the kernel is safe to cleanup
this qp.

So whenever the RC connection is broken, kernel can destroy this QP.


>
> The only one who actually knows that the XRC QP is no longer
> required is the userspace process which created the QP at the
> remote end of the RC connection of the receiving XRC QP.
>
> This remote process can only send a request to destroy the QP
> to some local process (via its own private protocol).
> However, you pointed out that the process which originally
> created the QP may not be around any more (this was the
> source of the problem which led to the RFC in this thread) --
> and sending the destroy request to all the remote processes
> on that node which it communicates with is REALLY ugly.
>
> I'm not familiar with MPI, so this may be a silly question:
> Can the MPI server process create a new domain for each
> client process, and destroy that domain when the client
> process is done (i.e., is this MPI server process a
> supervisor of resources for distributed computations (but is
> not a participant in these computations)?).

The server could be process group across multiple nodes, there are
parallel database searching engine, for example.


>
> (Actually, what I'm asking -- is it possible to allocate a
> new XRC domain for a distributed computation, and destroy
> that domain at the end of that computation?)

Yes, it could, but it makes MPI harder to manage the code. And also
we have a connect/accept speed concern.

We hope not to do it this way.


--CQ


>
>
> -- Jack
>



Re: [OMPI devel] pointer_array

2007-12-21 Thread Jeff Squyres
I'm unfortunately getting some test failures after r17007 when  
converting between C and C++/F77 handles, such as the following (I  
checked; trunk/r17006 works fine):


-
[9:17] svbu-mpi:~/svn/ompi-tests/cxx-test-suite % mpirun -np 4 src/ 
mpi2c++_test


Since we made it this far, we will assume that
MPI::Init() worked properly.
--
...snipped
* MPI::misc...
  - MPI::Query_thread...  PASS
  - Commname...   PASS
  - Commgethandler... PASS
  - Handle conversion...  mpi2c++_test: class/ 
opal_pointer_array.c:131: opal_pointer_array_add: Assertion `table- 
>addr[index] == ((void *)0)' failed.

-

And other failures in the simple/basic/f90/attr_f test. I'll start  
poking around shortly to see if I can figure out the problem.



On Dec 21, 2007, at 1:31 AM, George Bosilca wrote:

The opal_pointer_array is now committed. I try to update all BTL to  
use the opal_pointer_array instead of the orte_pointer_array. Now,  
the OMPI layer only use opal_pointer_array. Unfortunately, I cannot  
test most of the BTLs so I hope I didn't miss anything.


 Thanks,
   george.

On Dec 17, 2007, at 2:22 PM, George Bosilca wrote:

Sound good. I will replace all references to ompi_pointer_array as  
well as orte_pointer_array in the ompi layer (some BTL use the  
orte_pointer_array) and replace them with the opal_pointer_array.  
I'll avoid any modification of the orte layer.


I'll commit tomorrow morning.

Thanks,
  george.

On Dec 17, 2007, at 12:04 PM, Ralph H Castain wrote:

That would be fine with me - I can grab that out of the trunk and  
adjust

ORTE in my branch instead.

Thanks
Ralph


On 12/17/07 9:54 AM, "Tim Mattox"  wrote:


How about this as a suggested compromise.
George, could you just do half the patch... where you leave orte  
alone,

and just move the ompi pointer array implementation down into opal.
That way, any new code can make use of it from opal, and only orte
would need to be adjusted later, after Ralph is done with his  
changes.


On Dec 17, 2007 9:18 AM, Ralph H Castain  wrote:
It would require extensive modification as use of the pointer  
array has
spread over a wide range of the code base. I would really  
appreciate it if

we didn't do this right now.

The differences are historic in nature - several years ago, the  
folks
working on the OMPI layer needed to insert some Fortran-specific  
limits and
type definitions into the opal_pointer_array code.  
Unfortunately, that
caused type conflicts across a swath of the ORTE code. After a  
ton of
discussion and debate, there was no way the OMPI folks could  
guarantee that
they wouldn't need to change those definitions again at some  
time into the
future - which would again force the ORTE layer to make major  
changes to

their code.

In addition, the use of an int as the array index in the  
opal_pointer_array
raised concerns in the ORTE world as we really didn't want to  
pass generic
variable types between processes. At the time, we weren't sure  
if the index
in a pointer array was going to need to be passed somewhere in  
the future -

in fact, the code did pass it at the time in several cases.

So we agreed to simply create separate code that, even though it  
duplicated
the functionality, ensured that the two could operate semi- 
independently.


In the intervening time, the OMPI folks seem to have been able  
to leave the
opal_pointer_array definitions pretty much alone. There have  
been a few
changes along the way, but nothing overwhelming. In addition, we  
have found
that the ORTE code no longer needs to pass the array index when  
sending an
object's data to a remote process - at least, this is true at  
the moment.


So making the change might be reasonable. If we are going to do  
that,
though, we need to ensure that all the functionality is  
replicated (there
are, I believe, a couple of extensions in the orte_pointer_array  
class), and

we should similarly review the other orte/opal class overlaps.

However, doing all this right now would be a disaster on the tmp  
branch
where we are revising ORTE. It would be much better to do it  
after that
branch merges to the trunk, or just make the change in the tmp  
branch first.
That branch makes much more extensive use of the  
orte_pointer_array object
than is in the trunk, and it would be a royal pain of conflicts  
to resolve

it - all for little, if any, gain.

Thanks
Ralph




On 12/17/07 6:35 AM, "Jeff Squyres"  wrote:


Adding RHC to the thread...

I'm guessing that the patch will have to be modified for the  
ORTE tmp

branch.



On Dec 16, 2007, at 6:18 PM, George Bosilca wrote:


Right, I wonder why it didn't show in the patch file. Anyway, it
completely remove the orte_pointer_array.[ch] as well as the
ompi_pointer_array.[ch] file.

Thanks,
george.

On Dec 16, 2007, at 12:01 AM, Tim Mattox wrote:


The patch looks 

Re: [OMPI devel] [ofa-general] [RFC] XRC -- make receiving XRC QP independent of any one user process

2007-12-21 Thread Tang, Changqing

What we do for heart-beat is using zero-byte rdma_write, the message goes to 
the peer QP only, there is no need to post anything
on remote side, no need for pinned memory.


--CQ



> -Original Message-
> From: Jack Morgenstein [mailto:ja...@dev.mellanox.co.il]
> Sent: Friday, December 21, 2007 12:09 PM
> To: Tang, Changqing
> Cc: pa...@dev.mellanox.co.il;
> mvapich-disc...@cse.ohio-state.edu;
> gene...@lists.openfabrics.org; Open MPI Developers
> Subject: Re: [ofa-general] [RFC] XRC -- make receiving XRC QP
> independent of any one user process
>
> On Friday 21 December 2007 19:13, Tang, Changqing wrote:
> > This kernel QP is for receiving only, so when there is no
> activity on
> > this QP, can the kernel sends a heart-beat message to check if the
> > remote sending QP is still there (still connected) ? if not, the
> > kernel is safe to cleanup this qp.
> >
> > So whenever the RC connection is broken, kernel can destroy this QP.
> >
> This increases the XRC complexity considerably:
>
> 1. Need to have a separate kernel thread which will scan ALL
> xrc domains on this host for XRC receive QPs.
>This thread will need to do some form of RDMA_READ/WRITE,
> because otherwise it will interfere with
>the remote (sending side) operation.  Furthermore, the
> sending-side XRC QP may not have anyone listening
>on an associated XRC SRQ qp -- it is not meant to be set
> up to receive.  We only need an operation that
>will yield a RETRY_EXCEEDED error completion if the
> connection has broken.
>
> 2. This opens the door for all sorts of nasty race
> conditions, since we will now have a bi-directional
>protocol. For example, what if this feature is being
> combined with APM (valid for RC QPs), and we
>are simply in the middle of a migration, and maybe
> communication is temporarily interrupted.
>We will be killing off the QP without allowing any error
> recovery mechanism to work.
>
> 3. The application complexity goes up -- we now need the
> sending-side QP to declare a memory region and send
>this region's address to the receiving side so that the
> receiving side (the kernel thread mentioned above)
>can periodically try to read from this region.
>
> Still, I'll give this some thought.  For example, maybe we
> can rdma_read some random (illegal) address -- If the
> connection is alive, we'll get a "remote access error"
> completion, while if its dead, we'll get retry exceeded (need
> to check that the bad rdma read request does not cause the
> QPs to enter an error state).
>
> - Jack
>



Re: [OMPI devel] pointer_array

2007-12-21 Thread Jeff Squyres
Fixed in https://svn.open-mpi.org/trac/ompi/changeset/17021 -- a  
simple off-by-one error caused by some logic moving around.



On Dec 21, 2007, at 12:19 PM, Jeff Squyres wrote:


I'm unfortunately getting some test failures after r17007 when
converting between C and C++/F77 handles, such as the following (I
checked; trunk/r17006 works fine):

-
[9:17] svbu-mpi:~/svn/ompi-tests/cxx-test-suite % mpirun -np 4 src/
mpi2c++_test

Since we made it this far, we will assume that
MPI::Init() worked properly.
--
...snipped
* MPI::misc...
  - MPI::Query_thread...  PASS
  - Commname...   PASS
  - Commgethandler... PASS
  - Handle conversion...  mpi2c++_test: class/
opal_pointer_array.c:131: opal_pointer_array_add: Assertion `table-

addr[index] == ((void *)0)' failed.

-

And other failures in the simple/basic/f90/attr_f test. I'll start
poking around shortly to see if I can figure out the problem.


On Dec 21, 2007, at 1:31 AM, George Bosilca wrote:


The opal_pointer_array is now committed. I try to update all BTL to
use the opal_pointer_array instead of the orte_pointer_array. Now,
the OMPI layer only use opal_pointer_array. Unfortunately, I cannot
test most of the BTLs so I hope I didn't miss anything.

Thanks,
  george.

On Dec 17, 2007, at 2:22 PM, George Bosilca wrote:


Sound good. I will replace all references to ompi_pointer_array as
well as orte_pointer_array in the ompi layer (some BTL use the
orte_pointer_array) and replace them with the opal_pointer_array.
I'll avoid any modification of the orte layer.

I'll commit tomorrow morning.

Thanks,
 george.

On Dec 17, 2007, at 12:04 PM, Ralph H Castain wrote:


That would be fine with me - I can grab that out of the trunk and
adjust
ORTE in my branch instead.

Thanks
Ralph


On 12/17/07 9:54 AM, "Tim Mattox"  wrote:


How about this as a suggested compromise.
George, could you just do half the patch... where you leave orte
alone,
and just move the ompi pointer array implementation down into  
opal.

That way, any new code can make use of it from opal, and only orte
would need to be adjusted later, after Ralph is done with his
changes.

On Dec 17, 2007 9:18 AM, Ralph H Castain  wrote:

It would require extensive modification as use of the pointer
array has
spread over a wide range of the code base. I would really
appreciate it if
we didn't do this right now.

The differences are historic in nature - several years ago, the
folks
working on the OMPI layer needed to insert some Fortran-specific
limits and
type definitions into the opal_pointer_array code.
Unfortunately, that
caused type conflicts across a swath of the ORTE code. After a
ton of
discussion and debate, there was no way the OMPI folks could
guarantee that
they wouldn't need to change those definitions again at some
time into the
future - which would again force the ORTE layer to make major
changes to
their code.

In addition, the use of an int as the array index in the
opal_pointer_array
raised concerns in the ORTE world as we really didn't want to
pass generic
variable types between processes. At the time, we weren't sure
if the index
in a pointer array was going to need to be passed somewhere in
the future -
in fact, the code did pass it at the time in several cases.

So we agreed to simply create separate code that, even though it
duplicated
the functionality, ensured that the two could operate semi-
independently.

In the intervening time, the OMPI folks seem to have been able
to leave the
opal_pointer_array definitions pretty much alone. There have
been a few
changes along the way, but nothing overwhelming. In addition, we
have found
that the ORTE code no longer needs to pass the array index when
sending an
object's data to a remote process - at least, this is true at
the moment.

So making the change might be reasonable. If we are going to do
that,
though, we need to ensure that all the functionality is
replicated (there
are, I believe, a couple of extensions in the orte_pointer_array
class), and
we should similarly review the other orte/opal class overlaps.

However, doing all this right now would be a disaster on the tmp
branch
where we are revising ORTE. It would be much better to do it
after that
branch merges to the trunk, or just make the change in the tmp
branch first.
That branch makes much more extensive use of the
orte_pointer_array object
than is in the trunk, and it would be a royal pain of conflicts
to resolve
it - all for little, if any, gain.

Thanks
Ralph




On 12/17/07 6:35 AM, "Jeff Squyres"  wrote:


Adding RHC to the thread...

I'm guessing that the patch will have to be modified for the
ORTE tmp
branch.



On Dec 16, 2007, at 6:18 PM, George Bosilca wrote:

Right, I wonder why it didn't show in the patch file. Anyway,  
it

completely remove the orte_pointer_array.[ch] as well as the
ompi_pointer_array.[ch] fi