[OMPI devel] Open IB BTL and iWARP

2008-07-09 Thread Don Kerr
Last I looked the OpenIB BTL relied on the short eager rdma buffers 
being written in order?   Is this still the case?


If so, how is this handled when iWARP is underneath the User Verb API 
and not Mellonox IB HCAs?


Re: [OMPI devel] Open IB BTL and iWARP

2008-07-09 Thread Jeff Squyres

On Jul 9, 2008, at 4:08 PM, Don Kerr wrote:

Last I looked the OpenIB BTL relied on the short eager rdma buffers  
being written in order?   Is this still the case?


The eager rdma optimization relies on the last byte of the short  
message being written last.  I.e., when we see the last byte in the  
target buffer, we assume the rest of the message is there.


If so, how is this handled when iWARP is underneath the User Verb  
API and not Mellonox IB HCAs?


There's an MCA parameter that disables this optimization if the  
underlying hardware can't provide that guarantee.  We also have this  
field in the INI file so that specific adapters can disable it  
automatically if they want/need to.


Chelsio T3, NetEffect NE020, and NetXen adapters can all provide that  
guarantee (I asked those vendors).  You can see this in:


https://svn.open-mpi.org/trac/ompi/browser/trunk/ompi/mca/btl/openib/mca-btl-openib-hca-params.ini

--
Jeff Squyres
Cisco Systems



[OMPI devel] IOF repair

2008-07-09 Thread Ralph Castain
I have been investigating Ticket #1135 - stdin is read twice if rank=0
shares the node with mpirun. Repairing this problem is going to be quite
difficult due to the rather terrible spaghetti code in the IOF, and the fact
that the IOF in the HNP actually rml.sends the IO to itself multiple times
as it cycles through the spaghetti.

Unfortunately, this problem -is- a regression from 1.2. Rather than spending
weeks trying to fix it, I see two approaches we could pursue. First, I could
repair the problem by essentially returning the IOF to its 1.2 state. This
will have to be done by hand as most of the differences are in function
calls to utilities that have changed due to the removal of the old NS
framework. However, there are a few places where the logic itself has been
modified - and the problem must stem from somewhere in there.

If I make this change, then we will be no better, and no worse, than 1.2.
Note that we currently advise people to read from a file instead of from
stdin to avoid other issues that were present in 1.2.

Alternatively, we could ship 1.3 as-is, and warn users (similar to 1.2) that
they should avoiding reading from stdin if there is any chance that rank=0
could be co-located with mpirun. Note that most of our clusters do not allow
such co-location - but it is permitted by default by OMPI.

We already plan to revisit the IOF at next week's technical meeting, with a
goal of redefining the IOF's API to a more reduced set that reflects a less
ambitious requirement. I expect to implement those changes fairly soon
thereafter, but that would be targeted to 1.4 - not 1.3.

Any thoughts on which way we should go?
Ralph




Re: [OMPI devel] IOF repair

2008-07-09 Thread Jeff Squyres
I'd like to have a look at the diff between the two, but I can't do so  
until tomorrow at the earliest.


On Jul 9, 2008, at 7:26 PM, Ralph Castain wrote:


I have been investigating Ticket #1135 - stdin is read twice if rank=0
shares the node with mpirun. Repairing this problem is going to be  
quite
difficult due to the rather terrible spaghetti code in the IOF, and  
the fact
that the IOF in the HNP actually rml.sends the IO to itself multiple  
times

as it cycles through the spaghetti.

Unfortunately, this problem -is- a regression from 1.2. Rather than  
spending
weeks trying to fix it, I see two approaches we could pursue. First,  
I could
repair the problem by essentially returning the IOF to its 1.2  
state. This
will have to be done by hand as most of the differences are in  
function

calls to utilities that have changed due to the removal of the old NS
framework. However, there are a few places where the logic itself  
has been

modified - and the problem must stem from somewhere in there.

If I make this change, then we will be no better, and no worse, than  
1.2.
Note that we currently advise people to read from a file instead of  
from

stdin to avoid other issues that were present in 1.2.

Alternatively, we could ship 1.3 as-is, and warn users (similar to  
1.2) that
they should avoiding reading from stdin if there is any chance that  
rank=0
could be co-located with mpirun. Note that most of our clusters do  
not allow

such co-location - but it is permitted by default by OMPI.

We already plan to revisit the IOF at next week's technical meeting,  
with a
goal of redefining the IOF's API to a more reduced set that reflects  
a less

ambitious requirement. I expect to implement those changes fairly soon
thereafter, but that would be targeted to 1.4 - not 1.3.

Any thoughts on which way we should go?
Ralph


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems



[OMPI devel] v1.3 RM: need a ruling

2008-07-09 Thread Jeff Squyres
v1.3 RMs: Due to some recent work, the MCA parameter  
mpi_paffinity_alone disappeared -- it was moved and renamed to be  
opal_paffinity_alone.  This is Bad because we have a lot of historical  
precent based on the MCA param name "mpi_paffinity_alone" (FAQ, PPT  
presentations, e-mails on public lists, etc.).  So it needed to be  
restored for v1.3.  I just noticed that I hadn't opened a ticket on  
this -- sorry -- I opened #1383 tonight.


For a variety of reasons described in the commit message r1383, Lenny  
and I first decided that it would be best to fix this problem by the  
functionality committed in r18770 (have the ability to find out where  
an MCA parameter was set).  This would allow us to register two MCA  
params: mpi_paffinity_alone and opal_paffinity_alone, and generally do  
the Right Thing (because we could then tell if a user had set a value  
or whether it was a default MCA param value).  This functionality will  
also be useful in the openib BTL, where there is a blend of MCA  
parameters and INI file parameters.


However, after doing that, it seemed like only a few more steps to  
implement an overall better solution: implement "synonyms" for MCA  
parameters.  I.e., register the name "mpi_paffinity_alone" as a  
synonym for opal_paffinity_alone.  Along the way, it was trivial to  
add a "deprecated" flag for MCA parameters that we no longer want to  
use anymore (this deprecated flag is also useful in the OB1 PML and  
openib BTL).


So to fix a problem that needed to be fixed for v1.3 (restore the MCA  
parameter "mpi_paffinity_alone"), I ended up implementing new  
functionality.


Can this go into v1.3, or do we need to implement some kind of  
alternate fix?  (I admit to not having thought through what it would  
take to fix without the new MCA parameter functionality -- it might be  
kinda wonky)


--
Jeff Squyres
Cisco Systems