[OMPI devel] [Patch] make ompi recognize new ib (connectx/mlx4)

2007-05-10 Thread Peter Kjellstrom
I recently tried ompi on early ConnectX hardware/software.
The good news, it works =)

However, ompi needs a chunk of options set to recognize the
card so I made a small patch (setting it up like old Arbel
style hardware).

Patch against openmpi-1.3a1r14635

/Peter

---

--- ompi/mca/btl/openib/mca-btl-openib-hca-params.ini.old   2007-05-10 
13:47:58.0 +0200
+++ ompi/mca/btl/openib/mca-btl-openib-hca-params.ini   2007-05-10 
13:55:21.0 +0200
@@ -100,6 +100,14 @@

 

+[Mellanox ConnectX]
+vendor_id = 0x2c9
+vendor_part_id = 25408
+use_eager_rdma = 1
+mtu = 2048
+
+
+
 [IBM eHCA 4x and 12x ]
 vendor_id = 0x5076
 vendor_part_id = 0


Re: [OMPI devel] [Patch] make ompi recognize new ib (connectx/mlx4)

2007-05-10 Thread Jeff Squyres

On May 10, 2007, at 8:08 AM, Peter Kjellstrom wrote:


I recently tried ompi on early ConnectX hardware/software.
The good news, it works =)


We've seen some really great 1-switch latency using the early access  
ConnectX hardware.  I have a pair of ConnectX's in my MPI development  
cluster at Cisco, but am awaiting various software pieces before I  
can start playing with them.


We're also quite excited to add some of the new features of the  
ConnectX hardware (Roland Dreier is working on the verbs interface  
and Mellanox is working on the firmware).  I don't see Mellanox's  
presentation from last week's OpenFabrics Sonoma Workshop on the  
openfabrics.org web site that describes the features; I'll ping them  
and ask where it is.



However, ompi needs a chunk of options set to recognize the
card so I made a small patch (setting it up like old Arbel
style hardware).


Good point; I can't believe we forgot to commit that...  Thanks!

BTW, you copied from the MTU from Sinai, not Arbel -- is that what  
you meant?


(FWIW: the internal Mellanox code name for ConnectX is Hermon,  
another mountain in Israel, just like Sinai, Arbel, ...etc.).


--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] [ewg] Re: [ofa-general] Re: OMPI over ofed udapl - bugs opened

2007-05-10 Thread Jeff Squyres

On May 10, 2007, at 8:23 AM, Or Gerlitz wrote:

A different approach which you might want to consider is to have at  
the btl level --two-- connections per  ranks. so if A  
wants to send B it does so through the A --> B connection and if B  
wants to send A it does so through the B --> A connection. To some  
extent, this is the approach taken by IPoIB-CM (I am not enough  
into the RFC to understand the reasoning but i am quite sure this  
was the approach in the initial implementation). At first thought  
it mights seems not very elegant, but taking it into the details  
(projected on the ompi env) you might find it  even nice.


What is the advantage of this approach?

--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] [Patch] make ompi recognize new ib (connectx/mlx4)

2007-05-10 Thread Gleb Natapov
On Thu, May 10, 2007 at 08:22:41AM -0400, Jeff Squyres wrote:
> (FWIW: the internal Mellanox code name for ConnectX is Hermon,  
> another mountain in Israel, just like Sinai, Arbel, ...etc.).
> 
Yes, but Hermon is the highest one, so theoretically Mellanox can only
go downhill from there :)

--
Gleb.


Re: [OMPI devel] [ewg] Re: [ofa-general] Re: OMPI over ofed udapl - bugs opened

2007-05-10 Thread Jeff Squyres

On May 10, 2007, at 9:02 AM, Or Gerlitz wrote:

A different approach which you might want to consider is to have  
at the btl level --two-- connections per  ranks. so if A  
wants to send B it does so through the A --> B connection and if  
B wants to send A it does so through the B --> A connection. To  
some extent, this is the approach taken by IPoIB-CM (I am not  
enough into the RFC to understand the reasoning but i am quite  
sure this was the approach in the initial implementation). At  
first thought it mights seems not very elegant, but taking it  
into the details (projected on the ompi env) you might find it   
even nice.

What is the advantage of this approach?


To start with, my hope here is at least to be able play defensive  
here, that is convince you that the disadvantages are minor, where  
only if this fails, would schedule myself some reading into the  
ipoib-cm rfc to dig the advantages.


I ask about the advantages because OMPI currently treats QP's as bi- 
directional.  Having OMPI treat them at unidirectional would be a  
change.  I'm not against such a change, but I think we'd need to be  
convinced that there are good reasons to do so.  For example, on the  
surface, it seems like this scheme would simply consume more QPs and  
potentially more registered memory (and is therefore unattractive).


--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] [ewg] Re: [ofa-general] Re: OMPI over ofed udapl - bugs opened

2007-05-10 Thread Gleb Natapov
On Thu, May 10, 2007 at 04:30:27PM +0300, Or Gerlitz wrote:
> Jeff Squyres wrote:
> >On May 10, 2007, at 9:02 AM, Or Gerlitz wrote:
> 
> >>To start with, my hope here is at least to be able play defensive 
> >>here, that is convince you that the disadvantages are minor, where 
> >>only if this fails, would schedule myself some reading into the 
> >>ipoib-cm rfc to dig the advantages.
> 
> >I ask about the advantages because OMPI currently treats QP's as 
> >bi-directional.  Having OMPI treat them at unidirectional would be a 
> >change.  I'm not against such a change, but I think we'd need to be 
> >convinced that there are good reasons to do so.  For example, on the 
> >surface, it seems like this scheme would simply consume more QPs and 
> >potentially more registered memory (and is therefore unattractive).
> 
> Indeed you would need two QPs per btl connection, however, for each 
> direction you can make the relevant QP consume ~zero resources per the 
> other direction, ie on side A:
> 
> for the A --> B QP : RX WR num  = 0, RX SG size = 0
> for the B --> A QP : TX WR num  = 0, TX SG size = 0
> 
> and on side B the other way. I think that IB disallows to have zero len 
> WR num so you set it actually to 1. Note that since you use SRQ for 
> large jobs you have zero overhead for RX resources and this one TX WR 
> overhead for the "RX" connection on each side. This is the only memory 
> related overhead since you don't have to allocate any extra buffers over 
> what you do now.
> 
QP is a limited resource and we already have 2 per connection (and much
more if LMC is in used), so I don't see any reason to use this scheme only
to overcome brain damaged design of iWarp.

--
Gleb.


Re: [OMPI devel] [ewg] Re: Re: OMPI over ofed udapl - bugs opened

2007-05-10 Thread Jeff Squyres

On May 10, 2007, at 10:28 AM, Michael S. Tsirkin wrote:


What is the advantage of this approach?


Current ipoib cm uses this approach to simplify the implementation.
Overhead seems insignificant.


I think MPI's requirements are a bit different than IPoIB.  See  
Gleb's response. It is not uncommon for MPI apps to have connections  
open to many peers simultaneously.  Registered memory / internal  
buffering usage is a Big Deal in the MPI / HPC community.


--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] [ofa-general] Re: [ewg] Re: Re: OMPI over ofed udapl - bugs opened

2007-05-10 Thread Gleb Natapov
On Thu, May 10, 2007 at 05:56:13PM +0300, Michael S. Tsirkin wrote:
> > Quoting Jeff Squyres :
> > Subject: Re: [ewg] Re: [OMPI devel] Re: OMPI over ofed?udapl -?bugs?opened
> > 
> > On May 10, 2007, at 10:28 AM, Michael S. Tsirkin wrote:
> > 
> > >>What is the advantage of this approach?
> > >
> > >Current ipoib cm uses this approach to simplify the implementation.
> > >Overhead seems insignificant.
> > 
> > I think MPI's requirements are a bit different than IPoIB.  See  
> > Gleb's response. It is not uncommon for MPI apps to have connections  
> > open to many peers simultaneously.
> 
> You mean, hundreds of QPs between the same pair of hosts?
> Yes, in this case you might start running out of QPs.

Why is it matters that QPs between the same pair of hosts or not.
QPs are global resource, aren't they?

> 
> > Registered memory / internal  
> > buffering usage is a Big Deal in the MPI / HPC community.
> 
> I don't see the connection with the # of QPs.
> They are very cheap in memory.
> 
4K is cheap?

--
Gleb.


Re: [OMPI devel] OMPI over ofed udapl over iwarp

2007-05-10 Thread Steve Wise

> 
> There are two new issues so far:
> 
> 1) this has uncovered a connection migration issue in the Chelsio
> driver/firmware.  We are developing and testing a fix for this now.
> Should be ready tomorrow hopefully.
> 

I have a fix for the above issue and I can continue with OMPI testing.

To work around the client-must-send issue, I put a nice fat sleep in the
udapl btl right after it calls dat_cr_accept(), in
mca_btl_udapl_accept_connect().  This, however, exposes another issue
with the udapl btl:

Neither the client nor the server side of the udapl btl connection setup
pre-post RECV buffers before connecting.  This can allow a SEND to
arrive before a RECV buffer is available.  I _think_ IB will handle this
issue by retransmitting the SEND.  Chelsio's iWARP device, however,
TERMINATEs the connection.  My sleep() makes this condition happen every
time.  

>From what I can tell, the udapl btl exchanges memory info as a first
order of business after connection establishment
(mba_btl_udapl_sendrecv().  The RECV buffer post for this exchange,
however, should really be done _before_ the dat_ep_connect() on the
active side, and _before_ the dat_cr_accept() on the server side.
Currently its done after the ESTABLISHED event is dequeued, thus
allowing the race condition.

I believe the rules are the ULP must ensure that a RECV is posted before
the client can post a SEND for that buffer.  And further, the ULP must
enforce flow control somehow so that a SEND never arrives without a RECV
buffer being available.

Perhaps this is just a bug and I opened it up with my sleep()

Or is the uDAPL btl assuming the transport will deal with lack of RECV
buffer at the time a SEND arrives?

Also: Given there is a message exchange _always_ after connection setup,
then we can change that exchange to support the client-must-send-first
issue...


Steve.






Re: [OMPI devel] OMPI over ofed udapl over iwarp

2007-05-10 Thread Caitlin Bestler
devel-boun...@open-mpi.org wrote:
>> There are two new issues so far:
>> 
>> 1) this has uncovered a connection migration issue in the Chelsio
>> driver/firmware.  We are developing and testing a fix for this now.
>> Should be ready tomorrow hopefully.
>> 
> 
> I have a fix for the above issue and I can continue with OMPI testing.
> 
> To work around the client-must-send issue, I put a nice fat
> sleep in the udapl btl right after it calls dat_cr_accept(),
> in mca_btl_udapl_accept_connect().  This, however, exposes
> another issue with the udapl btl:
> 
> Neither the client nor the server side of the udapl btl
> connection setup pre-post RECV buffers before connecting.
> This can allow a SEND to arrive before a RECV buffer is
> available.  I _think_ IB will handle this issue by
> retransmitting the SEND.  Chelsio's iWARP device, however,
> TERMINATEs the connection.  My sleep() makes this condition
> happen every time.
> 

A compliant DAPL program also ensures that there are adequate
receive buffers in place before the remote peer Sends. It is
explicitly noted that failure to follow this real will invoke
a transport/device dependent penalty. It may be that the sendq
will be fenced, or it may be that the connection will be terminated.

So any RDMA BTL should pre-post recv buffers before initiating or
accepting a connection.



>> From what I can tell, the udapl btl exchanges memory info as a first
> order of business after connection establishment
> (mba_btl_udapl_sendrecv().  The RECV buffer post for this
> exchange, however, should really be done _before_ the
> dat_ep_connect() on the active side, and _before_ the
> dat_cr_accept() on the server side.
> Currently its done after the ESTABLISHED event is dequeued,
> thus allowing the race condition.
> 
> I believe the rules are the ULP must ensure that a RECV is
> posted before the client can post a SEND for that buffer.
> And further, the ULP must enforce flow control somehow so
> that a SEND never arrives without a RECV buffer being available.
> 
> Perhaps this is just a bug and I opened it up with my sleep()
> 
> Or is the uDAPL btl assuming the transport will deal with
> lack of RECV buffer at the time a SEND arrives?
> 

No. uDAPL *allows* a provider to compensate for this through
unspecified means, but the application MUST NOT rely on it
(on the flip side the application MUST NOT rely on any
mistake generating a fault. That's akin to relying on
a state trooper pulling you over when you exceed the
speed limit. It is always possible that your application
has too many buffers in flight but this is never detected
because the new buffers are posted before the messages
actually arrive. Your not supposed to do that, but you
have a good chance of getting away with it).

As a general rule DAPL *never* requires a provider to
check anything that the provider does not need to check
on its own (other than memory access rights). So typically
the provider will complain about too many buffers when it
actually runs out of buffers, not when the application's
end-to-end credits are theoretically negative. A "fast
path" interface becomes a lot less so if every work 
request is validated dynamically against every relevant
restriction.




Re: [OMPI devel] OMPI over ofed udapl over iwarp

2007-05-10 Thread Jeff Squyres

Steve --

Can you file a trac bug about this?


On May 10, 2007, at 6:15 PM, Steve Wise wrote:





There are two new issues so far:

1) this has uncovered a connection migration issue in the Chelsio
driver/firmware.  We are developing and testing a fix for this now.
Should be ready tomorrow hopefully.



I have a fix for the above issue and I can continue with OMPI testing.

To work around the client-must-send issue, I put a nice fat sleep  
in the

udapl btl right after it calls dat_cr_accept(), in
mca_btl_udapl_accept_connect().  This, however, exposes another issue
with the udapl btl:

Neither the client nor the server side of the udapl btl connection  
setup

pre-post RECV buffers before connecting.  This can allow a SEND to
arrive before a RECV buffer is available.  I _think_ IB will handle  
this

issue by retransmitting the SEND.  Chelsio's iWARP device, however,
TERMINATEs the connection.  My sleep() makes this condition happen  
every

time.


From what I can tell, the udapl btl exchanges memory info as a first

order of business after connection establishment
(mba_btl_udapl_sendrecv().  The RECV buffer post for this exchange,
however, should really be done _before_ the dat_ep_connect() on the
active side, and _before_ the dat_cr_accept() on the server side.
Currently its done after the ESTABLISHED event is dequeued, thus
allowing the race condition.

I believe the rules are the ULP must ensure that a RECV is posted  
before

the client can post a SEND for that buffer.  And further, the ULP must
enforce flow control somehow so that a SEND never arrives without a  
RECV

buffer being available.

Perhaps this is just a bug and I opened it up with my sleep()

Or is the uDAPL btl assuming the transport will deal with lack of RECV
buffer at the time a SEND arrives?

Also: Given there is a message exchange _always_ after connection  
setup,

then we can change that exchange to support the client-must-send-first
issue...


Steve.





--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] OMPI over ofed udapl over iwarp

2007-05-10 Thread Donald Kerr



Caitlin Bestler wrote:


devel-boun...@open-mpi.org wrote:
 


There are two new issues so far:

1) this has uncovered a connection migration issue in the Chelsio
driver/firmware.  We are developing and testing a fix for this now.
Should be ready tomorrow hopefully.

 


I have a fix for the above issue and I can continue with OMPI testing.

To work around the client-must-send issue, I put a nice fat
sleep in the udapl btl right after it calls dat_cr_accept(),
in mca_btl_udapl_accept_connect().  This, however, exposes
another issue with the udapl btl:
   

sleeping after accept? What are you trying to do here force a race 
condition?



Neither the client nor the server side of the udapl btl
connection setup pre-post RECV buffers before connecting.
This can allow a SEND to arrive before a RECV buffer is
available.  I _think_ IB will handle this issue by
retransmitting the SEND.  Chelsio's iWARP device, however,
TERMINATEs the connection.  My sleep() makes this condition
happen every time.

   



A compliant DAPL program also ensures that there are adequate
receive buffers in place before the remote peer Sends. It is
explicitly noted that failure to follow this real will invoke
a transport/device dependent penalty. It may be that the sendq
will be fenced, or it may be that the connection will be terminated.

So any RDMA BTL should pre-post recv buffers before initiating or
accepting a connection.

 


I know of no udapl restiction saying a recv must be posted before a send.
And yes we do pre post recv buffers but since the BTL creates 2 
connections per peer, one for eager size messages and one for max size 
messages the BTL needs to know which connection the current endpoint is 
to service so that it can post the proper size recv buffer.


Also, I agree in theory the btl could potentially post the recv which 
currently occurs in mca_btl_udapl_sendrecv before the connect or accept 
but I think in practise we had issue doing this and we had to wait until 
a DAT_CONNECTION_EVENT_ESTABLISHED was received.




 


From what I can tell, the udapl btl exchanges memory info as a first
 


order of business after connection establishment
(mba_btl_udapl_sendrecv().  The RECV buffer post for this
exchange, however, should really be done _before_ the
dat_ep_connect() on the active side, and _before_ the
dat_cr_accept() on the server side.
Currently its done after the ESTABLISHED event is dequeued,
thus allowing the race condition.

I believe the rules are the ULP must ensure that a RECV is
posted before the client can post a SEND for that buffer.
And further, the ULP must enforce flow control somehow so
that a SEND never arrives without a RECV buffer being available.
   


maybe this is a rule iwarp imposes on its ULPs but not uDAPL.


Perhaps this is just a bug and I opened it up with my sleep()

Or is the uDAPL btl assuming the transport will deal with
lack of RECV buffer at the time a SEND arrives?
   

There may be a race condition here but you really have to try hard to 
see it.


From Steve  previously.
"Also: Given there is a message exchange _always_ after connection 
setup, then we can change that exchange to support the 
client-must-send-first issue..."


I agree I am sure we can do something but if it includes an additional 
message we should consider a mca parameter to govern this because the 
connection wireup is already costly enough.


-DON


   



No. uDAPL *allows* a provider to compensate for this through
unspecified means, but the application MUST NOT rely on it
(on the flip side the application MUST NOT rely on any
mistake generating a fault. That's akin to relying on
a state trooper pulling you over when you exceed the
speed limit. It is always possible that your application
has too many buffers in flight but this is never detected
because the new buffers are posted before the messages
actually arrive. Your not supposed to do that, but you
have a good chance of getting away with it).

As a general rule DAPL *never* requires a provider to
check anything that the provider does not need to check
on its own (other than memory access rights). So typically
the provider will complain about too many buffers when it
actually runs out of buffers, not when the application's
end-to-end credits are theoretically negative. A "fast
path" interface becomes a lot less so if every work 
request is validated dynamically against every relevant

restriction.


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel